FN Thomson Reuters Web of Science™
VR 1.0
PT J
AU Vogel, AP
Fletcher, J
Maruff, P
AF Vogel, Adam P.
Fletcher, Janet
Maruff, Paul
TI The impact of task automaticity on speech in noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Lombard effect; Automaticity; Task selection; Speech; Cognitive load;
Timing
ID FUNDAMENTAL-FREQUENCY; SENSORIMOTOR CONTROL; ACOUSTIC ANALYSIS;
INTERFERENCE; UNCERTAINTY; VARIABILITY; RELIABILITY; IMPAIRMENT;
PRINCIPLES; DISEASE
AB In the control of skeleto-motor movement, it is well established that the less complex, or more automatic a motor task is, the less variability and uncertainty there is in its performance. It was hypothesized that a similar relationship exists for integrated cognitive-motor tasks such as speech where the uncertainty with which actions are initiated may increase when the feedback loop is interrupted or dampened. To investigate this, the Lombard effect was exploited to explore the acoustic impact of background noise on speech during tasks increasing in automaticity. Fifteen healthy adults produced five speech tasks bearing different levels of automaticity (e.g., counting, reading, unprepared monologue) during habitual and altered auditory feedback conditions (Lombard effect). Data suggest that speech tasks relatively free of meaning or phonetic complexity are influenced to a lesser degree by a compromised auditory feedback than more complex paradigms (e.g., contemporaneous speech) on measures of timing. These findings inform understanding of the relative contribution speech task selection plays in measures of speech. Data also aid in understanding the relationship between task automaticity and altered speech production in neurological conditions where dual impairments of movement and cognition are observed (e.g., Huntington's disease, progressive aphasia). (C) 2014 Elsevier B.V. All rights reserved.
C1 [Vogel, Adam P.] Univ Melbourne, Speech Neurosci Unit, Melbourne, Vic, Australia.
[Fletcher, Janet] Univ Melbourne, Sch Languages & Linguist, Melbourne, Vic, Australia.
[Maruff, Paul] Univ Melbourne, Howard Florey Inst Neurosci & Mental Hlth, Melbourne, Vic, Australia.
RP Vogel, AP (reprint author), 550 Swanston St, Melbourne, Vic 3010, Australia.
EM vogela@unimelb.edu.au
FU National Health and Medical Research Council - Australia [1012302]
FX APV was supported by a National Health and Medical Research Council -
Australia, Early Career Fellowship (#1012302).
CR Bays PM, 2007, J PHYSIOL-LONDON, V578, P387, DOI 10.1113/jphysiol.2006.120121
Bays PM, 2005, CURR BIOL, V15, P1125, DOI 10.1016/j.cub.2005.05.023
Blais C., 2010, Q J EXP PSYCHOL, V65, P268, DOI [10.1080/17470211003775234, DOI 10.1080/17470211003775234]
Boersma P., 2001, GLOT INT, V5, P341
Boril H., 2008, THESIS CZECH TECHNIC
BROWN W S JR, 1972, Journal of Auditory Research, V12, P157
Castellanos A, 1996, SPEECH COMMUN, V20, P23, DOI 10.1016/S0167-6393(96)00042-8
DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780
Dunlap WP, 1996, PSYCHOL METHODS, V1, P170, DOI 10.1037//1082-989X.1.2.170
Fredrickson A, 2008, HUM PSYCHOPHARM CLIN, V23, P425, DOI 10.1002/hup.942
Gadesmann M, 2008, INT J LANG COMM DIS, V43, P41, DOI 10.1080/13682820701234444
Gilabert R., 2006, MULTILINGUAL MATTERS, P44
Hanley TD, 1949, J SPEECH HEAR DISORD, V14, P363
Hayhoe M, 2005, TRENDS COGN SCI, V9, P188, DOI 10.1016/j.tics.2005.02.009
Houde JF, 1998, SCIENCE, V279, P1213, DOI 10.1126/science.279.5354.1213
Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6
Laures JS, 2003, J COMMUN DISORD, V36, P449, DOI 10.1016/S0021-9924(03)00032-7
Lee GS, 2007, EAR HEARING, V28, P343, DOI 10.1097/AUD.0b013e318047936f
LETOWSKI T, 1993, EAR HEARING, V14, P332
Lombard E., 1911, MALADIES OREILLE LAR, V27, P101
Lu YY, 2009, J ACOUST SOC AM, V126, P1495, DOI 10.1121/1.3179668
MACLEOD CM, 1988, J EXP PSYCHOL LEARN, V14, P126, DOI 10.1037/0278-7393.14.1.126
Mazzoni D., 2012, AUDACITY
Mendoza E, 1998, J VOICE, V12, P263, DOI 10.1016/S0892-1997(98)80017-9
Patel R, 2008, J SPEECH LANG HEAR R, V51, P209, DOI 10.1044/1092-4388(2008/016)
Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038)
RIVERS C, 1985, Journal of Auditory Research, V25, P37
Robinson P, 2001, APPL LINGUIST, V22, P27, DOI 10.1093/applin/22.1.27
Robinson P., 2005, IRAL-INT REV APPL LI, V43, P1, DOI DOI 10.1515/IRA1.2005.43.1.1
Scherer K.R., 2002, ICSLP 2002 DENV CO U, P2017
SIEGEL GM, 1992, J SPEECH HEAR RES, V35, P1358
Stout JC, 2011, NEUROPSYCHOLOGY, V25, P1, DOI 10.1037/a0020937
Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660
TARTTER VC, 1993, J ACOUST SOC AM, V94, P2437, DOI 10.1121/1.408234
Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710
Van Riper C, 1963, SPEECH CORRECTION, V4
van Beers RJ, 2002, PHILOS T R SOC B, V357, P1137, DOI 10.1098/rstb.2002.1101
Vogel AP, 2010, MOVEMENT DISORD, V25, P1753, DOI 10.1002/mds.23103
Vogel AP, 2012, NEUROPSYCHOLOGIA, V50, P3273, DOI 10.1016/j.neuropsychologia.2012.09.011
Vogel AP, 2009, BEHAV RES METHODS, V41, P318, DOI 10.3758/BRM.41.2.318
Vogel AP, 2011, J VOICE, V25, P137, DOI 10.1016/j.jvoice.2009.09.003
Wassink AB, 2007, J PHONETICS, V35, P363, DOI 10.1016/j.wocn.2006.07.002
Watson PJ, 2006, J SPEECH LANG HEAR R, V49, P636, DOI 10.1044/1092-4388(2006/046)
Wolpert DM, 2011, NAT REV NEUROSCI, V12, P739, DOI 10.1038/nrn3112
YATES AJ, 1963, PSYCHOL BULL, V60, P213, DOI 10.1037/h0044155
NR 45
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 1
EP 8
DI 10.1016/j.specom.2014.05.002
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700001
ER
PT J
AU Schwerin, B
Paliwal, K
AF Schwerin, Belinda
Paliwal, Kuldip
TI An improved speech transmission index for intelligibility prediction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech transmission index; Modulation transfer function; Speech
enhancement; Objective evaluation; Speech intelligibility; Short-time
modulation spectrum
ID MODEL
AB The speech transmission index (STI) is a well known measure of intelligibility, most suited to the evaluation of speech intelligibility in rooms, with stimuli subjected to additive noise and reverberance. However, STI and its many variations do not effectively represent the intelligibility of stimuli containing non-linear distortions such as those resulting from processing by enhancement algorithms. In this paper, we revisit the STI approach and propose a variation which processes the modulation envelope in short-time segments, requiring only an assumption of quasi-stationarity (rather than the stationarity assumption of STI) of the modulation signal. Results presented in this work show that the proposed approach improves the measures correlation to subjective intelligibility scores compared to traditional STI for a range of noise types and subjected to different enhancement approaches. The approach is also shown to have higher correlation than other coherence, correlation and distance measures tested, but is unsuited to the evaluation of stimuli heavily distorted with (for example) masking based processing, where an alternative approach such as STOI is recommended. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Griffith Sch Engn, Signol Proc Lab, Nathan, Qld 4111, Australia.
RP Schwerin, B (reprint author), Griffith Univ, Griffith Sch Engn, Signol Proc Lab, Nathan, Qld 4111, Australia.
EM belsch71@gmail.com
CR [Anonymous], 2001, ITU T REC, P862
ANSI, 1997, S351997 ANSI
Balakrishnan N., 1992, HDB LOGISTIC DISTRIB
Boldt J., 2009, P EUSIPCO, P1849
CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496
Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erkelens JS, 2007, IEEE T AUDIO SPEECH, V15, P1741, DOI 10.1109/TASL.2007.899233
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005
Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004
Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216
Pearce D., 2000, P ICSLP, V4, P29
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Taal C., 2010, P ITG FACHT SPRACHK
Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881
Tribolet J. M., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing
NR 27
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 9
EP 19
DI 10.1016/j.specom.2014.05.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700002
ER
PT J
AU Moniz, H
Batista, F
Mata, AI
Trancoso, I
AF Moniz, Helena
Batista, Fernando
Mata, Ana Isabel
Trancoso, Isabel
TI Speaking style effects in the production of disfluencies
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody; Disfluencies; Lectures; Dialogues; Speaking styles
ID SPONTANEOUS SPEECH; REPAIR; UM
AB This work explores speaking style effects in the production of disfluencies. University lectures and map-task dialogues are analyzed in order to evaluate if the prosodic strategies used when uttering disfluencies vary across speaking styles. Our results show that the distribution of disfluency types is not arbitrary across lectures and dialogues. Moreover, although there is a statistically significant cross-style strategy of prosodic contrast marking (pitch and energy increases) between the region to repair and the repair of fluency, this strategy is displayed differently depending on the specific speech task. The overall patterns observed in the lectures, with regularities ascribed for speaker and disfluency types, do not hold with the same strength for the dialogues, due to underlying specificities of the communicative purposes. The tempo patterns found for both speech tasks also confirm their distinct behaviour, evidencing the more dynamic tempo characteristics of dialogues. In university lectures, prosodic cues are given to the listener both for the units inside disfluent regions and between these and the adjacent contexts. This suggests a stronger prosodic contrast marking of disfluency fluency repair when compared to dialogues, as if teachers were monitoring the different regions the introduction to a disfluency, the disfluency itself and the beginning of the repair demarcating them in very contrastive ways. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Moniz, Helena; Batista, Fernando; Trancoso, Isabel] L2F INESCID, Lisbon, Portugal.
[Moniz, Helena; Mata, Ana Isabel] Univ Lisbon, FLUL CLUL, P-1699 Lisbon, Portugal.
[Batista, Fernando] Inst Univ Lisboa, ISCTE IUL, Lisbon, Portugal.
[Trancoso, Isabel] Univ Lisbon, Inst Super Tecn, P-1699 Lisbon, Portugal.
RP Moniz, H (reprint author), L2F INESCID, Lisbon, Portugal.
EM helenam@12f.inesc-id.pt; fmmb@12f.inesc-id.pt; aim@fl.ul.pt;
isabel.trancoso@12f.inesc-id.pt
RI Batista, Fernando/C-8355-2009
OI Batista, Fernando/0000-0002-1075-0177
FU FCT - Fundacao para a Ciencia e Tecnologia [FCT/SFRH/BD/44671/2008,
SFRH/BPD/95849/2013, PEst-OE/EEI/LA0021/2013, PTDC/CLE-LIN/120017/2010];
European Project EU-IST FP7 project SpeDial [611396]; ISCTE-IUL,
Instituto Universitario de Lisboa
FX This work was supported by national funds through FCT - Fundacao para a
Ciencia e Tecnologia, under Ph.D Grant FCT/SFRH/BD/44671/2008 and
Post-doc fellow researcher Grant SFRH/BPD/95849/2013, projects
PEst-OE/EEI/LA0021/2013 and PTDC/CLE-LIN/120017/2010, by European
Project EU-IST FP7 project SpeDial under Contract 611396, and by
ISCTE-IUL, Instituto Universitario de Lisboa.
CR Allwood J., 1990, NORD J LINGUIST, P3
Amaral R., 2008, INT 2008 BRISB AUSTR
Arnold JE, 2003, J PSYCHOLINGUIST RES, V32, P25, DOI 10.1023/A:1021980931292
Barry W, 1995, ICPHS 1995 STOCKH SW
Batista F, 2011, THESIS I SUPERIOR TE
Batista F., 2012, J SPEECH SCI, P115
Batista F, 2012, IEEE T AUDIO SPEECH, V20, P474, DOI 10.1109/TASL.2011.2159594
Benus S., 2012, 3 IEEE C COGN INF KO
Biber D., 1988, VARIATION SPEECH WRI
Blaauw E., 1995, PERCEPTUAL CLASSIFIC
Brennan SE, 2001, J MEM LANG, V44, P274, DOI 10.1006/jmla.2000.2753
Caseiro D., 2002, PMLA ISCA TUT RES WO
Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3
Cole J., 2005, DISS 2005 AIX EN PRO
Conrad S., 2009, REGISTER GENRE STYLE
Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894
Eklund R., 2004, THESIS U LINKOPINK
Erard M., 2007, SLIPS STUMBLES VERBA
Eskenazi M., 1993, EUR 1993 BERL GERM
Fox-Tree J.E., 1995, J MEM LANG, P709
Gravano A., 2011, INT 2011 FLOR IT
Grojean F., 1980, CROSS LINGUISTIC ASS, P144
Heike A., 1981, LANG SPEECH, P147
Hindle D., 1983, ACL, P123
Hirschberg J., 2000, THEORY EXPT STUDIES, P335
Koehn P., 2005, 10 MACH TRANSL SUMM
Levelt W. J. M., 1989, SPEAKING
Levelt W. J. M., 1983, J SEMANT, V2, P205
LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4
Liu Y, 2006, COMPUT SPEECH LANG, V20, P468, DOI 10.1016/j.csl.2005.06.002
Mata A.I., 2010, SPEECH PROSODY
Mata A.I., 1999, THESIS U LISBON
Moniz H, 2006, THESIS U LISBON
NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547
Neto J, 2008, INT CONF ACOUST SPEE, P1561, DOI 10.1109/ICASSP.2008.4517921
O'Connell DC, 2008, COGN LANG A SER PSYC, P3, DOI 10.1007/978-0-387-77632-3_1
Pellegrini T., 2012, GSCP 2012 BRAZ
Plauche M., 1999, ICPHS 1999 S FRANC U
Ranganath R, 2013, COMPUT SPEECH LANG, V27, P89, DOI 10.1016/j.csl.2012.01.005
Ribeiro R, 2011, J ARTIF INTELL RES, V42, P275
Rose R., 1998, THESIS U BIRMINGHAM
Savova G., 2003, DISS 2003 GOT SWED
Savova G., 2003, INT 2003 GEN SWITZ
Schuller B, 2013, COMPUT SPEECH LANG, V27, P4, DOI 10.1016/j.csl.2012.02.005
Shriberg E., 2001, J INT PHON ASSOC, V31, P153
Shriberg E, 1999, INT C PHON SCI SAN F, P612
Shriberg E. E., 1994, THESIS U CALIFORNIA
Sjolander K., 1998, ICSLP 1998 SYDN AUST, P3217
Swerts M, 1998, J PRAGMATICS, V30, P485, DOI 10.1016/S0378-2166(98)00014-9
Trancoso I., 2008, LREC 2008 LANG RES E
Trancoso I., 1998, PROPOR98 PORT AL BRA
Vaissiere J, 2005, BLACKW HBK LINGUIST, P236, DOI 10.1002/9780470757024.ch10
Viana M.C., 1998, WORKSH LING COMP LIS
NR 53
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 20
EP 35
DI 10.1016/j.specom.2014.05.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700003
ER
PT J
AU Rigoulot, S
Pell, MD
AF Rigoulot, Simon
Pell, Marc D.
TI Emotion in the voice influences the way we scan emotional faces
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech; Prosody; Face; Eye-tracking; Emotion; Cross-modal
ID SPOKEN-WORD RECOGNITION; FACIAL EXPRESSIONS; EYE-MOVEMENTS;
CULTURAL-DIFFERENCES; NEURAL EVIDENCE; SPEECH PROSODY; TIME-COURSE;
PERCEPTION; INFORMATION; ATTENTION
AB Previous eye-tracking studies have found that listening to emotionally-inflected utterances guides visual behavior towards an emotionally congruent face (e.g., Rigoulot and Pell, 2012). Here, we investigated in more detail whether emotional speech prosody influences how participants scan and fixate specific features of an emotional face that is congruent or incongruent with the prosody. Twenty-one participants viewed individual faces expressing fear, sadness, disgust, or happiness while listening to an emotionally-inflected pseudoutterance spoken in a congruent or incongruent prosody. Participants judged whether the emotional meaning of the face and voice were the same or different (match/mismatch). Results confirm that there were significant effects of prosody congruency on eye movements when participants scanned a face, although these varied by emotion type; a matching prosody promoted more frequent looks to the upper part of fear and sad facial expressions, whereas visual attention to upper and lower regions of happy (and to some extent disgust) faces was more evenly distributed. These data suggest ways that vocal emotion cues guide how humans process facial expressions in a way that could facilitate recognition of salient visual cues, to arrive at a holistic impression of intended meanings during interpersonal events. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Rigoulot, Simon; Pell, Marc D.] McGill Univ, Fac Med, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada.
McGill Ctr Res Brain Language & Mus, Montreal, PQ, Canada.
RP Rigoulot, S (reprint author), McGill Univ, Fac Med, Sch Commun Sci & Disorders, 1266 Ave Pins Ouest, Montreal, PQ H3G 1A8, Canada.
EM simon.rigoulot@mail.mcgill.ca
FU Natural Sciences and Engineering Research Council of Canada
FX We are grateful to Catherine Knowles and Hope Valeriote for running the
experiment. This research was funded by a Discovery Grant from the
Natural Sciences and Engineering Research Council of Canada (to MDP).
CR Adolphs R, 2005, NATURE, V433, P68, DOI 10.1038/nature03086
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
BASSILI JN, 1979, J PERS SOC PSYCHOL, V37, P2049, DOI 10.1037//0022-3514.37.11.2049
Bate S, 2009, NEUROPSYCHOLOGY, V23, P658, DOI 10.1037/a0014518
Bayle DJ, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008207
Beaudry O, 2014, COGNITION EMOTION, V28, P416, DOI 10.1080/02699931.2013.833500
Becker MW, 2009, Q J EXP PSYCHOL, V62, P1257, DOI 10.1080/17470210902725753
Blais C, 2012, NEUROPSYCHOLOGIA, V50, P2830, DOI 10.1016/j.neuropsychologia.2012.08.010
Brosch T, 2009, J COGNITIVE NEUROSCI, V21, P1670, DOI 10.1162/jocn.2009.21110
BRUCE V, 1992, PHILOS T ROY SOC B, V335, P121, DOI 10.1098/rstb.1992.0015
Calder AJ, 2005, NAT REV NEUROSCI, V6, P641, DOI 10.1038/nrn1724
Calder AJ, 2000, J EXP PSYCHOL HUMAN, V26, P527, DOI 10.1037/0096-1523.26.2.527
Calvo MG, 2009, COGNITION EMOTION, V23, P782, DOI 10.1080/02699930802151654
Calvo MG, 2005, J EXP PSYCHOL HUMAN, V31, P502, DOI 10.1037/0096-1523.31.3.502
Calvo MG, 2013, MOTIV EMOTION, V37, P202, DOI 10.1007/s11031-012-9298-1
Calvo MG, 2011, VISION RES, V51, P1751, DOI 10.1016/j.visres.2011.06.001
Calvo MG, 2008, J EXP PSYCHOL GEN, V137, P471, DOI 10.1037/a0012771
Campanella S, 2007, TRENDS COGN SCI, V11, P535, DOI 10.1016/j.tics.2007.10.001
Charash M, 2002, J ANXIETY DISORD, V16, P529, DOI 10.1016/S0887-6185(02)00171-8
Cisler JM, 2009, COGNITION EMOTION, V23, P675, DOI 10.1080/02699930802051599
Collignon O, 2008, BRAIN RES, V1242, P126, DOI 10.1016/j.brainres.2008.04.023
COOPER RM, 1974, COGNITIVE PSYCHOL, V6, P84, DOI 10.1016/0010-0285(74)90005-X
Cvejic E, 2010, SPEECH COMMUN, V52, P555, DOI 10.1016/j.specom.2010.02.006
Dahan D, 2001, COGNITIVE PSYCHOL, V42, P317, DOI 10.1006/cogp.2001.0750
Darwin Charles, 1998, EXPRESSION EMOTIONS, V3rd
de Gelder B, 2000, COGNITION EMOTION, V14, P289
Dolan RJ, 2001, P NATL ACAD SCI USA, V98, P10006, DOI 10.1073/pnas.171288598
Eisenbarth H, 2011, EMOTION, V11, P860, DOI 10.1037/a0022758
Ekman P., 2002, FACIAL ACTION CODING
Ekman P., 1990, J PERS SOC PSYCHOL, V58, P343, DOI [DOI 10.1037/0022-3514.58.2.342, 10.1037/0022-3514.58.2.342]
Ekman P., 1976, PICTURES FACIAL AFFE
Gordon MS, 2011, Q J EXP PSYCHOL, V64, P730, DOI 10.1080/17470218.2010.516835
Gosselin F, 2001, VISION RES, V41, P2261, DOI 10.1016/S0042-6989(01)00097-9
Green MJ, 2003, COGNITION EMOTION, V17, P779, DOI 10.1080/02699930302282
Hall JK, 2010, COGNITION EMOTION, V24, P629, DOI 10.1080/02699930902906882
Huettig F, 2005, COGNITION, V96, pB23, DOI 10.1016/j.cognition.2004.10.003
Hunnius S, 2011, COGNITION EMOTION, V25, P193, DOI 10.1080/15298861003771189
Jack RE, 2009, CURR BIOL, V19, P1543, DOI 10.1016/j.cub.2009.07.051
Jaywant A, 2012, SPEECH COMMUN, V54, P1, DOI 10.1016/j.specom.2011.05.011
Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770
Malcolm GL, 2008, J VISION, V8, DOI 10.1167/8.8.2
MATSUMOTO D, 1989, J NONVERBAL BEHAV, V13, P171, DOI 10.1007/BF00987048
Messinger DS, 2012, EMOTION, V12, P430, DOI 10.1037/a0026498
Neath KN, 2014, COGNITION EMOTION, V28, P115, DOI 10.1080/02699931.2013.812557
Niedenthal PM, 2007, SCIENCE, V316, P1002, DOI 10.1126/science.1136930
Palermo R, 2007, NEUROPSYCHOLOGIA, V45, P75, DOI 10.1016/j.neuropsychologia.2006.04.025
Paulmann S, 2011, MOTIV EMOTION, V35, P192, DOI 10.1007/s11031-011-9206-0
Paulmann S, 2012, SPEECH COMMUN, V54, P92, DOI 10.1016/j.specom.2011.07.004
Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z
Pell M.D., 2011, PLOS ONE, V6
Pell MD, 2011, COGNITION EMOTION, V25, P834, DOI 10.1080/02699931.2010.516915
Pell MD, 1999, BRAIN LANG, V69, P161, DOI 10.1006/brln.1999.2065
Pell MD, 1999, CORTEX, V35, P455, DOI 10.1016/S0010-9452(08)70813-X
Pell MD, 1997, BRAIN LANG, V57, P195, DOI 10.1006/brln.1997.1736
Pourtois G, 2005, CORTEX, V41, P49, DOI 10.1016/S0010-9452(08)70177-1
Rigoulot S., BRAIN RES IN PRESS
Rigoulot S, 2012, PLOS ONE, V7, DOI 10.1371/journal.pone.0030740
Rigoulot S, 2012, NEUROPSYCHOLOGIA, V50, P2887, DOI 10.1016/j.neuropsychologia.2012.08.015
Rigoulot S, 2011, NEUROPSYCHOLOGIA, V49, P2013, DOI 10.1016/j.neuropsychologia.2011.03.031
Rousselet GA, 2005, VIS COGN, V12, P852, DOI 10.1080/13506280444000553
SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674
Schyns PG, 2002, PSYCHOL SCI, V13, P402, DOI 10.1111/1467-9280.00472
Stins JF, 2011, EXP BRAIN RES, V212, P603, DOI 10.1007/s00221-011-2767-z
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Tanaka A, 2010, PSYCHOL SCI, V21, P1259, DOI 10.1177/0956797610380698
TANENHAUS MK, 1995, SCIENCE, V268, P1632, DOI 10.1126/science.7777863
Thompson LA, 2009, BRAIN COGNITION, V69, P108, DOI 10.1016/j.bandc.2008.06.002
Tottenham N, 2009, PSYCHIAT RES, V168, P242, DOI 10.1016/j.psychres.2008.05.006
VASEY MW, 1987, PSYCHOPHYSIOLOGY, V24, P479, DOI 10.1111/j.1469-8986.1987.tb00324.x
Vassallo S, 2009, J VISION, V9, DOI 10.1167/9.3.11
Wong B, 2005, NEUROPSYCHOLOGY, V19, P739, DOI 10.1037/0894-4105.19.6.739
Yarbus A. L., 1967, EYE MOVEMENTS VISION
Yee E, 2009, PSYCHON B REV, V16, P869, DOI 10.3758/PBR.16.5.869
Yee E, 2006, J EXP PSYCHOL LEARN, V32, P1, DOI 10.1037/0278-7393.32.1.1
Yuki M, 2007, J EXP SOC PSYCHOL, V43, P303, DOI 10.1016/j.jesp.2006.02.004
NR 77
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 36
EP 49
DI 10.1016/j.specom.2014.05.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700004
ER
PT J
AU Skantze, G
Hjalmarsson, A
Oertel, C
AF Skantze, Gabriel
Hjalmarsson, Anna
Oertel, Catharine
TI Turn-taking, feedback and joint attention in situated human-robot
interaction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Turn-taking; Feedback; Joint attention; Prosody; Gaze; Uncertainty
ID GAZE; CONVERSATIONS; BACKCHANNELS; ADDRESSEES; FEATURES; SPEAKING;
DIALOG; TASK
AB In this paper, we present a study where a robot instructs a human on how to draw a route on a map. The human and robot are seated face-to-face with the map placed on the table between them. The user's and the robot's gaze can thus serve several simultaneous functions: as cues to joint attention, turn-taking, level of understanding and task progression. We have compared this face-to-face setting with a setting where the robot employs a random gaze behaviour, as well as a voice-only setting where the robot is hidden behind a paper board. In addition to this, we have also manipulated turn-taking cues such as completeness and filled pauses in the robot's speech. By analysing the participants' subjective rating, task completion, verbal responses, gaze behaviour, and drawing activity, we show that the users indeed benefit from the robot's gaze when talking about landmarks, and that the robot's verbal and gaze behaviour has a strong effect on the users' turn-taking behaviour. We also present an analysis of the users' gaze and lexical and prosodic realisation of feedback after the robot instructions, and show that these cues reveal whether the user has yet executed the previous instruction, as well as the user's level of uncertainty. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Skantze, Gabriel; Hjalmarsson, Anna; Oertel, Catharine] KTH Royal Inst Technol, Dept Speech Mus & Hearing, Stockholm, Sweden.
RP Skantze, G (reprint author), KTH Royal Inst Technol, Dept Speech Mus & Hearing, Stockholm, Sweden.
EM gabriel@speech.kth.se
FU Swedish research council (VR) [2011-6237, 2011-6152]; GetHomeSafe (EU
7th Framework STREP) [288667]
FX Gabriel Skantze is supported by the Swedish research council (VR)
project Incremental processing in multimodal conversational systems
(2011-6237). Anna Hjalmarsson is supported by the Swedish Research
Council (VR) project Classifying and deploying pauses for flow control
in conversational systems (2011-6152). Catharine Oertel is supported by
GetHomeSafe (EU 7th Framework STREP 288667).
CR Al Moubayed S., 2013, INT J HUMANOID ROB, V10
Allen J.F., 1997, DRAFT DAMSL DI UNPUB
Allopenna PD, 1998, J MEM LANG, V38, P419, DOI 10.1006/jmla.1997.2558
Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1
Anderson A., LANG SPEECH, V34, P351
Baron-Cohen S., 1995, JOINT ATTENTION ITS, P41
Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Boersma P., 2001, GLOT INT, V5, P341
Bohus D., 2010, P ICMI10 BEIJ CHIN
Boucher J.D., 2012, FRONT NEUROROBOTICS, V6
Boye J, 2007, P 8 SIGDIAL WORKSH D
Boye J., 2012, IWSDS2012 INT WORKSH
BOYLE EA, 1994, LANG SPEECH, V37, P1
Buschmeier H, 2011, P 11 INT C INT VIRT, P169, DOI 10.1007/978-3-642-23974-8_19
Buschmeier H., 2012, P 13 ANN M SPEC INT, P295
Cathcart N., 2003, 10 C EUR CHAPT ASS C
Clark H. H., 1981, ELEMENTS DISCOURSE U, P10
Clark H. H., 1996, USING LANGUAGE
Clark HH, 2004, J MEM LANG, V50, P62, DOI 10.1016/j.jml.2003.08.004
DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031
Edlund J, 2009, LANG SPEECH, V52, P351, DOI 10.1177/0023830909103179
Forbes-Riley K, 2011, SPEECH COMMUN, V53, P1115, DOI 10.1016/j.specom.2011.02.006
Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003
Grosz B. J., 1986, Computational Linguistics, V12
Hall M., 2009, SIGKDD EXPLORATIONS, V11, P1, DOI DOI 10.1145/1656274.1656278
Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002
Hjalmarsson A, 2011, SPEECH COMMUN, V53, P23, DOI 10.1016/j.specom.2010.08.003
Hjalmarsson A., 2012, P IVA 2012 WORKSH RE
Huang L., 2011, INTELLIGENT VIRTUAL, P68
Iwase T., 1998, P ICSLP SYDN AUSTR, P1203
Johansson M., 2013, INT C SOC ROB ICSR 2
Katzenmaier M., 2004, P INT C MULT INT ICM
KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4
Kennington C., 2013, P SIGDIAL 2013 C MET, P173
Koiso H, 1998, LANG SPEECH, V41, P295
Lai C., 2010, P INT MAK JAP
LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310
Liscombe J., 2006, P INT 2006 PITTSB PA
Meena R., 2013, 14 ANN M SPEC INT GR, P375
Morency LP, 2010, AUTON AGENT MULTI-AG, V20, P70, DOI 10.1007/s10458-009-9092-y
Mutlu B, 2006, P 6 IEEE RAS INT C H, P518
Nakano Yukiko I., 2003, P 41 ANN M ASS COMP, V1, P553, DOI 10.3115/1075096.1075166
Neiberg D., 2012, INT WORKSH FEEDB BEH
Oertel C., 2012, P INT 2012 PORTL OR
Okumura Y, 2013, J EXP CHILD PSYCHOL, V116, P86, DOI 10.1016/j.jecp.2013.02.007
Pon-Barry H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P74
Randolph J. J., 2005, JOENS U LEARN INSTR
Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x
SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243
Schegloff Emanuel A., 1982, ANAL DISCOURSE TEXT, P71
Schlangen D, 2011, DIALOGUE DISCOURSE, V2, P83
SCHOBER MF, 1989, COGNITIVE PSYCHOL, V21, P211, DOI 10.1016/0010-0285(89)90008-X
Skantze G., 2013, P INT
Skantze G., 2009, P SIGDIAL 2009 LOND
Skantze G., 2012, P ICMI SANT MON CA
Skantze G., 2012, P INT WORKSH FEEDB B
Skantze G., 2013, 14 ANN M SPEC INT GR
Skantze G., 2009, P 12 C EUR CHAPT ASS
Skantze G, 2013, COMPUT SPEECH LANG, V27, P243, DOI 10.1016/j.csl.2012.05.004
Staudte M, 2011, COGNITION, V120, P268, DOI 10.1016/j.cognition.2011.05.005
Stocksmeier T., 2007, P INT 2007
Velichkovsky B. M., 1995, PRAGMAT COGN, V3, P199, DOI 10.1075/pc.3.2.02vel
Vertegaal R., 2001, P ACM C HUM FACT COM
Wallers A, 2006, LECT NOTES ARTIF INT, V4021, P183
Ward N., 2004, P INT C ISCA SPEC IN, P325
Ward N, 2003, INT J HUM-COMPUT ST, V59, P603, DOI 10.1016/S1071-5819(03)00085-5
Yngve Victor, 1970, 6 REG M CHIC LING SO, P567
NR 68
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 50
EP 66
DI 10.1016/j.specom.2014.05.005
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700005
ER
PT J
AU Yuan, JH
Liberman, M
AF Yuan, Jiahong
Liberman, Mark
TI F-0 declination in English and Mandarin Broadcast News Speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Declination; F-0; Regression; Convex-hull
ID FUNDAMENTAL-FREQUENCY; PERCEIVED PROMINENCE; SENTENCE INTONATION;
ACCENTED SYLLABLES; MAXIMUM SPEED; PITCH; PERCEPTION; DOWNTREND;
PATTERNS; CONTOURS
AB This study investigates F-0 declination in broadcast news speech in English and Mandarin Chinese. The results demonstrate a strong relationship between utterance length and declination slope. Shorter utterances have steeper declination, even after excluding the initial rising and final lowering effects. Initial F-0 tends to be higher when the utterance is longer, whereas the low bound of final F-0 is independent of the utterance length. Both top line and baseline show declination. The top line and baseline have different patterns in Mandarin Chinese, whereas in English their patterns are similar. Mandarin Chinese has more and steeper declination than English, as well as wider pitch range and more F-0 fluctuations. Our results suggest that F-0 declination is linguistically controlled, not just a by-product of the physics and physiology of talking. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Yuan, Jiahong; Liberman, Mark] Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA.
RP Yuan, JH (reprint author), Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA.
EM jiahong.yuan@gmail.com
FU NSF [0964556]
FX An earlier version of this paper was presented at Inter-speech 2010.
This work was supported in part by NSF Grant 0964556.
CR 't Hart J., 1979, IPO ANN PROGR REPORT, V14, P61
ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716
BAER T, 1979, J ACOUST SOC AM, V65, P1271, DOI 10.1121/1.382795
Breckenridge J., 1977, DECLINATION PHONOLOG
COHEN A, 1982, PHONETICA, V39, P254
COLLIER R, 1975, J ACOUST SOC AM, V58, P249, DOI 10.1121/1.380654
Cooper W. E., 1981, FUNDAMENTAL FREQUENC
EADY SJ, 1982, LANG SPEECH, V25, P29
Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5
GARDING E, 1979, PHONETICA, V36, P207
Gelfer C.E., 1983, VOCAL FOLD PHYSL BIO, P113
Han S., 2011, PLOS ONE, V6, P1
Heuven van V.J., 2004, SPEECH LANGUAGES STU, P83
Hirose H., 2010, HDB PHONETIC SCI, P130, DOI DOI 10.1002/9781444317251.CH4
Hollien H., 1983, VOCAL FOLD PHYSL, P361
HOLMES VM, 1984, LANG SPEECH, V27, P115
Honda K, 1999, LANG SPEECH, V42, P401
Keating P, 2012, J ACOUST SOC AM, V132, P1050, DOI 10.1121/1.4730893
Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5
Ladd D. R., 1984, PHONOLOGY YB, V1, P53, DOI DOI 10.1017/S0952675700000294
LADD DR, 1993, LANG SPEECH, V36, P435
Ladefoged P., 1967, 3 AREAS EXPT PHONETI
Ladefoged P., 2009, PRELIMINARY STUDIES
Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157
Lieberman P, 1966, THESIS
LIEBERMAN P, 1985, J ACOUST SOC AM, V77, P649, DOI 10.1121/1.391883
Maeda S, 1976, THESIS MIT
MERMELSTEIN P, 1975, J ACOUST SOC AM, V58, P880, DOI 10.1121/1.380738
Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002
Nooteboom S.G., 1975, STRUCTURE PROCESS SP, P124
Nooteboom S.G, 1995, PRODUCING SPEECH CON, P3
O'Shaughnessy, 1976, THESIS MIT
OHALA JJ, 1990, NATO ADV SCI I D-BEH, V55, P23
OHALA Johni, 1978, TONE LINGUISTIC SURV, P5
PIERREHUMBERT J, 1979, J ACOUST SOC AM, V66, P363, DOI 10.1121/1.383670
Pierrehumbert J, 1980, THESIS MIT
Prieto P., 2006, P SPEECH PROS 2006, V2006, P803
Prieto P, 1996, J PHONETICS, V24, P445, DOI 10.1006/jpho.1996.0024
Rialland A., 2001, P S CROSS LING STUD, P301
Shih CL, 2000, TEXT SPEECH LANG TEC, V15, P243
Sorensen J.M., 1980, PERCEPTION PRODUCTIO, P399
Sternberg S., 1980, PERCEPTION PRODUCTIO, P507
Stevens K.N., 2000, ACOUSTIC PHONETICS, P55
STRIK H, 1995, J PHONETICS, V23, P203, DOI 10.1016/S0095-4470(95)80043-3
SUNDBERG J, 1979, J PHONETICS, V7, P71
Swerts M, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1501
Talkin D., 1996, GET F0 ONLINE DOCUME
TERKEN J, 1994, J ACOUST SOC AM, V95, P3662, DOI 10.1121/1.409936
TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019
THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069
TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910
Tondering J., 2011, P ICPHS, VXVII, P2010
UMEDA N, 1982, J PHONETICS, V10, P279
Vaissiere J, 2005, BLACKW HBK LINGUIST, P236, DOI 10.1002/9780470757024.ch10
Vaissiere Jacqueline, 1983, PROSODY MODELS MEASU, P53
Whalen DH, 1997, PHONETICA, V54, P138
Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086
Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789
Yuan J., 2002, P SPEECH PROS 2002, V2002, P711
Yuan J., 2004, THESIS CORNELL U
NR 60
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 67
EP 74
DI 10.1016/j.specom.2014.06.001
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700006
ER
PT J
AU Kates, JM
Arehart, KH
AF Kates, James M.
Arehart, Kathryn H.
TI The Hearing-Aid Speech Perception Index (HASPI)
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Intelligibility index; Auditory model; Hearing
loss; Hearing aids
ID AUDITORY-NERVE RESPONSES; SPECTRAL-SHAPE-FEATURES; INTELLIGIBILITY
PREDICTION; IMPAIRED LISTENERS; FINE-STRUCTURE; SINUSOIDAL
REPRESENTATION; ARTICULATION INDEX; TRANSMISSION INDEX; VOCODED SPEECH;
WORKING-MEMORY
AB This paper presents a new index for predicting speech intelligibility for normal-hearing and hearing-impaired listeners. The Hearing-Aid Speech Perception Index (HASPI) is based on a model of the auditory periphery that incorporates changes due to hearing loss. The index compares the envelope and temporal fine structure outputs of the auditory model for a reference signal to the outputs of the model for the signal under test. The auditory model for the reference signal is set for normal hearing, while the model for the test signal incorporates the peripheral hearing loss. The new index is compared to indices based on measuring the coherence between the reference and test signals and based on measuring the envelope correlation between the two signals. HASPI is found to give accurate intelligibility predictions for a wide range of signal degradations including speech degraded by noise and nonlinear distortion, speech processed using frequency compression, noisy speech processed through a noise-suppression algorithm, and speech where the high frequencies are replaced by the output of a noise vocoder. The coherence and envelope metrics used for comparison give poor performance for at least one of these test conditions. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Kates, James M.; Arehart, Kathryn H.] Univ Colorado, Dept Speech Language & Hearing Sci, Boulder, CO 80309 USA.
RP Kates, JM (reprint author), Univ Colorado, Dept Speech Language & Hearing Sci, Boulder, CO 80309 USA.
EM James.Kates@colorado.edu; Kathryn.Arehart@colorado.edu
FU GN ReSound; NIH [R01 DC60014]
FX The authors thank Dr. Rosalinda Baca for providing the statistical
analysis used in this paper. Author JMK was supported by a grant from GN
ReSound. Author KHA was supported by a NIH Grant (R01 DC60014) and by
the grant from GN ReSound.
CR Aguilera Munoz C.M., 1999, ELECT CIRCUITS SYSTE, V2, P741
Anderson M.C., 2010, THESIS U COLORADO
ANSI, S351997 ANSI
Arehart K.H., 2013, P M AC POMA JUN 2 7, V19
Arehart KH, 2013, EAR HEARING, V34, P251, DOI 10.1097/AUD.0b013e318271aa5e
Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544
BYRNE D, 1986, EAR HEARING, V7, P257
CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496
Chen F., 2013, P 35 ANN INT C IEEE, P4199
Chen F, 2011, EAR HEARING, V32, P331, DOI 10.1097/AUD.0b013e3181ff3515
Ching TYC, 1998, J ACOUST SOC AM, V103, P1128, DOI 10.1121/1.421224
Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004
Cooke M., 1991, THESIS U SHEFFIELD
Cooper NP, 1997, J NEUROPHYSIOL, V78, P261
Cosentino S., 2012, P 11 INT C INF SCI S, P666
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020
Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6
Fogerty D, 2011, J ACOUST SOC AM, V129, P977, DOI 10.1121/1.3531954
Glista D, 2009, INT J AUDIOL, V48, P632, DOI 10.1080/14992020902971349
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Gomez AM, 2012, SPEECH COMMUN, V54, P503, DOI 10.1016/j.specom.2011.11.001
GORGA MP, 1981, J ACOUST SOC AM, V70, P1310, DOI 10.1121/1.387145
Greenberg S, 2004, IEICE T INF SYST, VE87D, P1059
HARRIS DM, 1979, J NEUROPHYSIOL, V42, P1083
Hicks ML, 1999, J ACOUST SOC AM, V105, P326, DOI 10.1121/1.424526
Hines A, 2010, SPEECH COMMUN, V52, P736, DOI 10.1016/j.specom.2010.04.006
HOHMANN V, 1995, J ACOUST SOC AM, V97, P1191, DOI 10.1121/1.413092
Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354
Hopkins K, 2011, J ACOUST SOC AM, V130, P334, DOI 10.1121/1.3585848
Hopkins K, 2008, J ACOUST SOC AM, V123, P1140, DOI 10.1121/1.2824018
HOUTGAST T, 1971, ACUSTICA, V25, P355
HUMES LE, 1986, J SPEECH HEAR RES, V29, P447
Imai S., 1983, P ICASSP, V8, P93
Immerseel LV, 2003, ACOUST RES LETT ONL, V4, P59
KATES JM, 1991, IEEE T SIGNAL PROCES, V39, P2573, DOI 10.1109/78.107409
KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657
Kates JM, 2010, J AUDIO ENG SOC, V58, P363
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Kates J.M., 2013, P M AC POMA JUN 2 7, V19
Kates J.M., 2008, DIGITAL HEARING AIDS, P1
Kiessling J., 1993, J SPEECH LANG PAT S1, V1, P39
Kjems U, 2009, J ACOUST SOC AM, V126, P1415, DOI 10.1121/1.3179673
Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617
Ludvigsen C, 1990, Acta Otolaryngol Suppl, V469, P190
MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910
McDermott HJ, 2011, PLOS ONE, V6, DOI 10.1371/journal.pone.0022358
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Moore BCJ, 2004, HEARING RES, V188, P70, DOI 10.1016/S0378-5955(03)00347-2
Moore BCJ, 1999, J ACOUST SOC AM, V106, P2761, DOI 10.1121/1.428133
Ng EHN, 2013, INT J AUDIOL, V52, P433, DOI 10.3109/14992027.2013.776181
NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469
NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735
PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456
PAVLOVIC CV, 1986, J ACOUST SOC AM, V80, P50, DOI 10.1121/1.394082
Payton K., 2008, P AC 2008 PAR, P634
PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545
Plack CJ, 2000, J ACOUST SOC AM, V107, P501, DOI 10.1121/1.428318
QUATIERI TF, 1986, IEEE T ACOUST SPEECH, V34, P1449, DOI 10.1109/TASSP.1986.1164985
Rosenthal S., 1969, IEEE T AUDIO ELECTRO, V17, P227
SACHS MB, 1974, J ACOUST SOC AM, V56, P1835, DOI 10.1121/1.1903521
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
SHAW JC, 1981, J MED ENG TECHNOL, V5, P279, DOI 10.3109/03091908109009362
Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636
Slaney M., 1993, 35 APPL COMP LIB
Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a
Souza P, 2009, J ACOUST SOC AM, V126, P792, DOI 10.1121/1.3158835
Souza PE, 2013, J SPEECH LANG HEAR R, V56, P1349, DOI 10.1044/1092-4388(2013/12-0151)
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
STEIGER JH, 1980, PSYCHOL BULL, V87, P245, DOI 10.1037//0033-2909.87.2.245
Stone MA, 2008, J ACOUST SOC AM, V124, P2272, DOI 10.1121/1.2968678
Suzuki Y, 2004, J ACOUST SOC AM, V116, P918, DOI 10.1121/1.1763601
Taal CH, 2011, J ACOUST SOC AM, V130, P3013, DOI 10.1121/1.3641373
Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881
Wang DL, 2008, J ACOUST SOC AM, V124, P2303, DOI 10.1121/1.2967865
WILLIAMS EJ, 1959, J ROY STAT SOC B, V21, P396
Wojtczak M, 2012, J ACOUST SOC AM, V131, P363, DOI 10.1121/1.3665995
YATES GK, 1990, HEARING RES, V45, P203, DOI 10.1016/0378-5955(90)90121-5
ZAHORIAN SA, 1993, J ACOUST SOC AM, V94, P1966, DOI 10.1121/1.407520
ZAHORIAN SA, 1981, J ACOUST SOC AM, V69, P832, DOI 10.1121/1.385539
Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503
Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512
NR 82
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 75
EP 93
DI 10.1016/j.specom.2014.06.002
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700007
ER
PT J
AU Salah-Eddine, C
Merouane, B
AF Salah-Eddine, Cheraitia
Merouane, Bouzid
TI Robust coding of wideband speech immittance spectral frequencies
SO SPEECH COMMUNICATION
LA English
DT Article
DE SSVQ quantization; Source-channel coding; Robust speech coding; ISF
parameters; Wideband speech coder
ID CODED VECTOR QUANTIZATION; LSF-PARAMETERS; NOISY CHANNELS; LPC
PARAMETERS; QUANTIZERS
AB In this paper, we propose a reduced complexity stochastic joint source-channel coding system developed for efficient and robust coding of wideband speech ISF (Immittance Spectral Frequency) parameters. Initially, the aim of this encoding system was to achieve a transparent quantization of ISF parameters for ideal transmissions over a noiseless channels. It was designed based on the switched split vector quantization (SSVQ) technique and called "ISF-SSVQ coder". After that, our interest was drawn to the improvement of the ISF-SSVQ robustness for transmissions over a noisy channel. To protect implicitly our ISF coders, we developed a stochastic joint source-channel coding system based on a reduced complexity version of the channel optimized SSVQ technique. Simulation results will show that our new encoding system, called ISF-SCOSSVQ coder, can provide a good implicit protection to ISF parameters. The ISF-SCOSSVQ coder was further used to encode the ISF parameters of the Adaptive Multi-rate Wideband (AMR-WB, ITU-T G.722.2) speech coder operating over a noisy channel. We will show that the proposed ISF-SCOSSVQ coder contribute significantly in the improvement of the AMR-WB performance by ensuring a good coding robustness of its ISF parameters. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Salah-Eddine, Cheraitia; Merouane, Bouzid] USTHB, Elect Fac, Speech Commun & Signal Proc Lab, Algiers 16111, Algeria.
RP Salah-Eddine, C (reprint author), USTHB, Elect Fac, Speech Commun & Signal Proc Lab, POB 32, Algiers 16111, Algeria.
EM cher.salah@yahoo.fr; mbouzid@usthb.dz
CR Azami S.B.Z., 1996, P CNES WORKSH DAT CO
Bessette B, 2002, IEEE T SPEECH AUDI P, V10, P620, DOI 10.1109/TSA.2002.804299
BISTRITZ Y, 1989, IEEE T INFORM THEORY, V35, P675, DOI 10.1109/18.30994
Bistritz Y., 1993, P IEEE INT C AC SPEE, V2, P9
Biundo G., 2002, P 3 COST 276 WORKSH, P114
Bouzid M, 2005, SIGNAL PROCESS, V85, P1675, DOI 10.1016/j.sigpro.2005.03.009
Bouzid M., 2012, P 11 ED INT C INF SC, P1045
Bouzid M, 2007, ANN TELECOMMUN, V62, P426
Chen JH, 1996, INT CONF ACOUST SPEE, P275
Chiang D.M., 1997, IEEE T CIRCUITS SYST, V7, P604
Cordoba J.L.P., 2005, P INTERSPEECH 2005 L, P2745
CUPERMAN V, 1985, IEEE T COMMUN, V33, P685, DOI 10.1109/TCOM.1985.1096372
DARPA TIMIT, 1993, AC PHON CONT SPEECH
Duhamel P., 1997, P C GRETSI GREN FRAN, P699
FARVARDIN N, 1990, IEEE T INFORM THEORY, V36, P799, DOI 10.1109/18.53739
FARVARDIN N, 1991, IEEE T INFORM THEORY, V37, P155, DOI 10.1109/18.61130
Gersho A., 1992, VECTOR QUANTIZATION
Guibe G, 2001, EUR T TELECOMMUN, V12, P535, DOI 10.1002/ett.4460120609
HAMMING RW, 1950, AT&T TECH J, V29, P147
Hussain Y., 1992, P IEEE INT C AC SPEE, V2, P133
Itakura F., 1975, J ACOUST SOC AM, V57, P535
Katsavounidis I, 1994, IEEE SIGNAL PROC LET, V1, P144, DOI 10.1109/97.329844
Kleijn W. B., 1995, SPEECH CODING SYNTHE
Knagenhjelm P., 1993, THESIS CHALMERS U TE
Kovesi B., 1997, P GRETSI 97 GREN FRA, P1065
Krishnan V, 2004, IEEE T SPEECH AUDI P, V12, P1, DOI 10.1109/TSA.2003.819945
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
McLoughlin IV, 2008, SIGNAL PROCESS, V88, P448, DOI 10.1016/j.sigpro.2007.09.003
MILLER D, 1994, IEEE T COMMUN, V42, P347, DOI 10.1109/TCOMM.1994.577056
Moreira Jorge C., 2006, ESSENTIALS ERROR CON
Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363
POTTER LC, 1995, IEEE T COMMUN, V43, P804, DOI 10.1109/26.380112
Rabiner L.R., 1978, DIGITAL PROCESSING S
So S., 2004, P INT C SPOK LANG PR
So S, 2007, DIGIT SIGNAL PROCESS, V17, P138, DOI 10.1016/j.dsp.2005.08.005
Xiang Y., 2010, IEEE T INFORM THEORY, V56, P5769
ZEGER K, 1990, IEEE T COMMUN, V38, P2147, DOI 10.1109/26.64657
NR 37
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 94
EP 108
DI 10.1016/j.specom.2014.07.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700008
ER
PT J
AU Patel, R
Kember, H
Natale, S
AF Patel, Rupal
Kember, Heather
Natale, Sara
TI Feasibility of augmenting text with visual prosodic cues to enhance oral
reading
SO SPEECH COMMUNICATION
LA English
DT Article
DE Oral reading; Prosody; Reading software; Fluency; Children
ID VOCAL FUNDAMENTAL-FREQUENCY; COMPREHENSION; CHILDREN; FLUENCY;
INTONATION; SPEECH; READERS; CRIES; LANGUAGE; STRESS
AB Reading fluency has traditionally focused on speed and accuracy yet recent reports suggest that expressive oral reading is an important component that has been largely overlooked. The current study assessed the impact of augmenting text with visual prosodic cues to improve expressive reading in beginning readers. Customized reading software was developed to present text augmented with prosodic cues to convey changes in pitch, duration and/or intensity. Prosodic modulation was derived from the recordings of a fluent adult model and rendered as a set of visual cues that could be presented in isolation or in combination. To establish baseline measures, eight children aged 7-8 first read a five-chapter story in standard text format. In the subsequent three sessions, participants were trained to use each augmented text cue with the guidance of an auditory model. They also had the opportunity to practice reading aloud in each cue condition. At the post-training session, participants re-recorded the baseline story with each chapter read in one of the different cue conditions (standard, pitch, duration, intensity and combination). Post-training and baseline recordings were acoustically analyzed to assess changes in reading expressivity. Despite large individual differences in how each participant implemented the prosodic cues, as a group, there were notable improvements in marking pitch accents and elongating word duration to convey linguistic contrasts. In fact, even after only three training sessions, participants appeared to have generalized implementation of pitch and word duration cues when reading standard text at post-training. In contrast, while participants manipulated pause duration when provided with explicit visual cues, they did not transfer these cues to standard text at post-training. These findings suggest that beginning readers could benefit from explicit visual prosodic cues and that even limited exposure may be sufficient to learn and generalize skills. Further discussion focuses on the implications of this work on struggling readers and second language learners. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Patel, Rupal; Kember, Heather; Natale, Sara] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA.
[Patel, Rupal] Northeastern Univ, Coll Comp & Informat Sci, Boston, MA 02115 USA.
RP Patel, R (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 360 Huntington Ave,Room 204 FR, Boston, MA 02115 USA.
EM r.patel@neu.edu
FU National Science Foundation [HCC-0915527]
FX There are a number of individuals who have made significant
contributions this work. We are indebted to Isabel Meirelles for her
collaboration on designing the visual renderings used here, to Sheelah
Sweeny for her guidance and work on developing the stories and
comprehension questions, and to William Furr for his dedication to
implementing a robust and user-friendly software system. We also thank
our participants and their families for their time and commitment to
this multi-week study. Last but not least, this material is based upon
work supported by the National Science Foundation under Grant No.
HCC-0915527.
CR Adams M. J., 1990, BEGINNING READ THINK
ALLINGTON RL, 1983, READ TEACH, V36, P556
ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x
Beaver J. M., 2006, DEV READING ASSESSME
Blevins W., 2001, BUILDING FLUENCY LES
Boersma P., 2014, PRAAT DOING PHONETIC
Bolinger D., 1989, INTONATION ITS USES
Carlson K, 2009, LANGUAGE LINGUISTICS, V3, P1188, DOI DOI 10.1111/J.1749-818X.2009.00150.X)
Christensen R., 2002, PLANE ANSWERS COMPLE
Cooper W. E., 1981, FUNDAMENTAL FREQUENC
COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372
Cromer W, 1970, J Educ Psychol, V61, P471, DOI 10.1037/h0030288
CRUTTENDEN A, 1985, J CHILD LANG, V12, P643
Crystal D, 1978, COMMUN COGNITION, P257
Crystal D, 1986, LANGUAGE ACQUISITION
Cutler A, 1997, LANG SPEECH, V40, P141
Daane M. C., 2005, 2006469 NCES US DEP
Dowhower S. L., 1991, THEOR PRACT, V30, P165, DOI DOI 10.1080/00405849109543497
EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091
Eason SH, 2013, SCI STUD READ, V17, P199, DOI 10.1080/10888438.2011.652722
FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022
Gilbert HR, 1996, INT J PEDIATR OTORHI, V34, P237, DOI 10.1016/0165-5876(95)01273-7
Grigos MI, 2007, J SPEECH LANG HEAR R, V50, P119, DOI 10.1044/1092-4388(2007/010)
Jenkins JR, 2003, J EDUC PSYCHOL, V95, P719, DOI 10.1037/0022-0663.95.4.719
Lehiste I., 1970, SUPRASEGMENTALS
LeVasseur VM, 2006, APPL PSYCHOLINGUIST, V27, P423, DOI 10.1017/S0142716406060346
Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1
Local J., 1980, SOCIOLINGUISTIC VARI
Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839
Morgan J.L., 1995, SIGNAL SYNTAX BOOTST
Neddenriep CE, 2011, PSYCHOL SCHOOLS, V48, P14, DOI 10.1002/pits.20542
OSHEA LJ, 1983, READ RES QUART, V18, P458, DOI 10.2307/747380
Patel R., 2011, ACM CHI C HUM FACT C, P3203
Patel R, 2006, SPEECH COMMUN, V48, P1308, DOI 10.1016/j.specom.2006.06.007
Patel R, 2011, SPEECH COMMUN, V53, P431, DOI 10.1016/j.specom.2010.11.007
Protopapas A, 1997, J ACOUST SOC AM, V102, P3723, DOI 10.1121/1.420403
Schreiber P. A., 1987, COMPREHENDING ORAL W, P243
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
SNOW D, 1994, J SPEECH HEAR RES, V37, P831
Snow D, 1998, J SPEECH LANG HEAR R, V41, P1158
Stathopoulos ET, 1997, J SPEECH LANG HEAR R, V40, P595
Therrien WJ, 2004, REM SPEC EDUC, V25, P252, DOI 10.1177/07419325040250040801
TINGLEY BM, 1975, CHILD DEV, V46, P186
Walker L.L., 2005, BEHAV, V14, P21
Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X
Wermke K, 2002, MED ENG PHYS, V24, P501, DOI 10.1016/S1350-4533(02)00061-9
Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086
Xu Y., 2011, J SPEECH SCI, V1, P85
NR 48
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2014
VL 65
BP 109
EP 118
DI 10.1016/j.specom.2014.07.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AP2KS
UT WOS:000341901700009
ER
PT J
AU Schoenenberg, K
Raake, A
Egger, S
Schatz, R
AF Schoenenberg, Katrin
Raake, Alexander
Egger, Sebastian
Schatz, Raimund
TI On interaction behaviour in telephone conversations under transmission
delay
SO SPEECH COMMUNICATION
LA English
DT Article
DE VoIP; Delay; Interactivity; Conversation analysis; Interaction rhythm;
Conversational quality
ID ON-OFF PATTERNS; SPEECH QUALITY
AB This work analyses the interaction behaviour of two interlocutors communicating over telephone connections affected by echo-free delay, for conversation tasks yielding different speed and structure. Based on a series of conversation tests, it is shown that transmission delay in a telephone circuit does not only result in a longer time until information is exchanged between the interlocutors, but also alters various characteristics of the conversational course. It was observed that with increasing transmission delay, the realities perceived by the interlocutors increasingly diverge. As a measure of utterance pace, a new conversation surface structure metric, the so-called utterance rhythm (URY), is introduced. Using surface-structure analysis of conversations from different conversation tests, it is shown that peoples' utterance rhythm stays rather constant in close-to-natural conversations, but is considerably affected for scenarios requiring fast interaction and a clear answering structure. At the same time, the quality of the connection is perceived less critically in close-to-natural than in tasks requiring fast interaction, that is, interactive tasks leading to a delay-dependant utterance rhythm. Hence, the conclusion can be drawn that the degree of necessary adaption of the utterance rhythm to a certain delay condition co-determines the extent to which transmission delay impacts the perceived integral quality of a call. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Schoenenberg, Katrin; Raake, Alexander] Tech Univ Berlin, T Labs, D-10587 Berlin, Germany.
[Egger, Sebastian; Schatz, Raimund] Telecommun Res Ctr Vienna FTW, A-1220 Vienna, Austria.
RP Schoenenberg, K (reprint author), Tech Univ Berlin, T Labs, Ernst Reuter Pl 7, D-10587 Berlin, Germany.
EM katrin.schoenenberg@telekom.de; alexander.raake@telekom.de;
egger@ftw.at; raimund.schatz@ftw.at
CR BRADY PT, 1971, AT&T TECH J, V50, P115
BRADY PT, 1968, AT&T TECH J, V47, P73
BRADY PT, 1965, AT&T TECH J, V44, P1
Egger S., 2012, P INT C COMM ICC, P1320
Egger S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1321
Glass L, 2001, NATURE, V410, P277, DOI 10.1038/35065745
Gueguin M, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/185248
Hammer F., 2004, P INT C SPOK LANG PR, P1741
Hoeldtke K., 2011, P INT C COMM ICC, P1
KITAWAKI N, 1991, IEEE J SEL AREA COMM, V9, P586, DOI 10.1109/49.81952
KRAUSS RM, 1967, J ACOUST SOC AM, V41, P286, DOI 10.1121/1.1910338
Lakaniemi A., 2001, INT C COMM ICC, V3, P748, DOI 10.1109/ICC.2001.937339
Luengo I., 2010, P LREC C VALL MALT M, P1539
Moller S, 2011, IEEE SIGNAL PROC MAG, V28, P18, DOI 10.1109/MSP.2011.942469
Moller S., 2000, ASSESSMENT PREDICTIO
Raake A., 2006, SPEECH QUALITY VOIP
Richards D.L., 1962, P INT C SAT COMM, P955
RIESZ RR, 1963, AT&T TECH J, V42, P2919
Sat B., 2007, P IEEE INT S MULT TA, P3
Schoenenberg K, 2014, INT J HUM-COMPUT ST, V72, P477, DOI 10.1016/j.ijhcs.2014.02.004
Thomsen G, 2000, IEEE SPECTRUM, V37, P52, DOI 10.1109/6.842135
Trevarthen C., 1993, PERCEIVED SELF ECOLO, P121
Wah Benjamin W, 2009, Journal of Multimedia, V4
NR 23
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 1
EP 14
DI 10.1016/j.specom.2014.04.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600001
ER
PT J
AU De Armas, W
Mamun, KA
Chau, T
AF De Armas, Winston
Mamun, Khondaker A.
Chau, Tom
TI Vocal frequency estimation and voicing state prediction with surface EMG
pattern recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Fundamental frequency; EMG; Electrolarynx; Pitch modulation; Hands free;
Voicing state
ID ELECTROMYOGRAPHIC ACTIVITY; SPEECH; REHABILITATION; CLASSIFICATION;
LARYNGECTOMY; MACHINE; SIGNAL; STRAP; HAND
AB The majority of laryngectomees use the electrolarynx as their primary mode of verbal communication after total laryngectomy surgery. However, the archetypal electrolarynx suffers from a monotonous tone and the inconvenience of requiring manual control. This paper presents the potential of pattern recognition to support electrolarynx use by predicting fundamental frequency (F0) and voicing state (VS) from surface EMG of the infrahyoid and suprahyoid muscles, as well as from a respiratory trace. In this study, surface EMG signals from the infrahyoid and suprahyoid muscle groups and respiratory trace were collected from 10 able-bodied, adult males (18- 60 years old). Participants performed three kinds of vocal tasks tones, legatos and phrases. Signal features were extracted from the EMG and respiratory trace, and a Support Vector Machine (SVM) classifier with radial basis function kernels was employed to predict F0 and voicing state. An average root mean squared error of 2.81 +/- 0.6 semitones was achieved for the estimation of vocal frequency in the range of 90-360 Hz. An average cross-validation (CV) accuracy of 78.05 +/- 6.3% was achieved for the prediction of voicing state from EMG and 65.24 +/- 7.8% from the respiratory trace. The proposed method has the advantage of being non-invasive compared with studies that relied on intramuscular electrodes (invasive), while still maintaining an accuracy above chance. Pattern classification of neck-muscle surface EMG has merit in the prediction of fundamental frequency and voicing state during vocalization, encouraging further study of automatic pitch modulation for electrolarynges and silent speech interfaces. (C) 2014 Elsevier B.V. All rights reserved.
C1 [De Armas, Winston; Mamun, Khondaker A.; Chau, Tom] Univ Toronto, Inst Biomat & Biomed Engn, Toronto, ON M5S 3G9, Canada.
[Mamun, Khondaker A.; Chau, Tom] Holland Bloorview Kids Rehabil Hosp, Bloorview Res Inst, Toronto, ON M4G 1R8, Canada.
RP Chau, T (reprint author), Univ Toronto, Inst Biomat & Biomed Engn, Rosebrugh Bldg,164 Coll St,Room 407, Toronto, ON M5S 3G9, Canada.
EM winston.dearmas@mail.utoronto.ca; k.mamun@utoronto.ca;
tom.chau@utoronto.ca
CR ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716
Bishop C.M., 2006, PATTERN RECOGN, P325
Castellini C, 2009, BIOL CYBERN, V100, P35, DOI 10.1007/s00422-008-0278-1
Chen JJ, 2005, SAR QSAR ENVIRON RES, V16, P517, DOI 10.1080/10659360500468468
Daubechies I., 1992, 10 LECT WAVELETS
Goldstein EA, 2004, IEEE T BIO-MED ENG, V51, P325, DOI 10.1109/TBME.2003.820373
GRAY S, 1976, ARCH PHYS MED REHAB, V57, P140
HART J, 1981, Journal of the Acoustical Society of America, V69, P811
Hastie T, 2004, J MACH LEARN RES, V5, P1391
Hillman R E, 1998, Ann Otol Rhinol Laryngol Suppl, V172, P1
Honda K, 1999, LANG SPEECH, V42, P401
Hsu C.W., 2010, PRACTICAL GUIDE SUPP
Huang H.-P., 1999, 1999 IEEE INT C ROB, V3, P2392, DOI [10.1109/ROBOT.1999.770463, DOI 10.1109/ROBOT.1999.770463]
HUDGINS B, 1993, IEEE T BIO-MED ENG, V40, P82, DOI 10.1109/10.204774
Johner C., 2012, ADV AFFECTIVE PLEASU
Khokhar ZO, 2010, BIOMED ENG ONLINE, V9, DOI 10.1186/1475-925X-9-41
Kubert HL, 2009, J COMMUN DISORD, V42, P211, DOI 10.1016/j.jcomdis.2008.12.002
Lee KS, 2008, IEEE T BIO-MED ENG, V55, P930, DOI 10.1109/TBME.2008.915658
Ma K., 1999, IMPROVEMENT ELECTROL
Meltzner G., 2005, CONT CONSIDERATIONS
Meltzner GS, 2005, J SPEECH LANG HEAR R, V48, P766, DOI 10.1044/1092-4388(2005/053)
Mendenhall WM, 2002, J CLIN ONCOL, V20, P2500, DOI 10.1200/JCO.2002.07.047
Merletti R, 2001, P ANN INT IEEE EMBS, V23, P1119
Muller-putz G.R., 2008, INT J BIOELECTROMAGN
Nakamura K, 2011, INT CONF ACOUST SPEE, P573
Nolan F., 2003, P P 15 INT C PHON SC
Ohala J., 1969, AUT 1969 M AC SOC JA, P359
Park W, 2010, IEEE INT CONF ROBOT, P205
Reaz MBI, 2006, BIOL PROCED ONLINE, V8, P11, DOI 10.1251/bpo115
Roubeau B, 1997, ACTA OTO-LARYNGOL, V117, P459, DOI 10.3109/00016489709113421
Saikachi Y, 2009, J SPEECH LANG HEAR R, V52, P1360, DOI 10.1044/1092-4388(2009/08-0167)
Scott R.N, 1984, INTRO MYOELECTRIC PR
SHIPP T, 1979, J ACOUST SOC AM, V66, P678, DOI 10.1121/1.383694
Slim Y, 2010, IRBM, V31, P209, DOI 10.1016/j.irbm.2010.05.002
St-Amant Y., 1996, BIOENG C 1996 P 1996, P93
Stepp CE, 2009, IEEE T NEUR SYS REH, V17, P146, DOI 10.1109/TNSRE.2009.2017805
Sundberg J., 1973, STL QPSR, V14, P39
Talkin D., 1995, SPEECH CODING SYNTHE
Uemi N., 1994, 3 IEEE INT WORKSH RO, P198
Vapnik V, 1998, NONLINEAR MODELING, P55
von Tscharner V, 2011, J ELECTROMYOGR KINES, V21, P683, DOI 10.1016/j.jelekin.2011.03.004
Watson PJ, 2009, AM J SPEECH-LANG PAT, V18, P162, DOI 10.1044/1058-0360(2008/08-0025)
Yang DP, 2009, 2009 IEEE-RSJ INTERNATIONAL CONFERENCE ON INTELLIGENT ROBOTS AND SYSTEMS, P516, DOI 10.1109/IROS.2009.5354544
Zhao JD, 2005, IEEE INT CONF ROBOT, P4482
NR 44
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 15
EP 26
DI 10.1016/j.specom.2014.04.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600002
ER
PT J
AU Xia, XJ
Ling, ZH
Jiang, Y
Dai, LR
AF Xia, Xian-Jun
Ling, Zhen-Hua
Jiang, Yuan
Dai, Li-Rong
TI HMM-based unit selection speech synthesis using log likelihood ratios
derived from perceptual data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Unit selection; Hidden Markov model; Log likelihood
ratio; Perceptual data
ID HIDDEN MARKOV-MODELS; SYNTHESIS SYSTEM
AB This paper presents a hidden Markov model (HMM) based unit selection speech synthesis method using log likelihood ratios (LLR) derived from perceptual data. The perceptual data is collected by judging the naturalness of each synthetic prosodic word manually. Two acoustic models which represent the natural speech and the unnatural synthetic speech are trained respectively. At synthesis time, the LLRs are derived from the estimated acoustic models and integrated into the unit selection criterion as target cost functions. The experimental results show that our proposed method can synthesize more natural speech than the conventional method using likelihood functions. Due to the inadequacy of the acoustic model estimated for the unnatural synthetic speech, utilizing the LLR-based target cost functions to rescore the pre-selection results or the N-best sequences can achieve better performance than substituting them for the original target cost functions directly. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Xia, Xian-Jun; Ling, Zhen-Hua; Dai, Li-Rong] Univ Sci & Technol China, Natl Engn Lab Speech & Language Informat Proc, Hefei 230027, Anhui, Peoples R China.
[Jiang, Yuan] iFLYTEK Res, Hefei 230088, Anhui, Peoples R China.
RP Ling, ZH (reprint author), Univ Sci & Technol China, 96 JinZhai Rd, Hefei, Anhui, Peoples R China.
EM xxjpjj@mail.ustc.edu.cn; zhling@ustc.edu.cn; yuanjiang@iflytek.com;
lrdai@ustc.edu.cn
FU Fundamental Research Funds for the Central Universities [WK2100060005];
National Nature Science Foundation of China [61273032]
FX This work is partially funded by the Fundamental Research Funds for the
Central Universities (Grant No. WK2100060005) and the National Nature
Science Foundation of China (Grant No. 61273032). We also gratefully
acknowledge the research division of iFLYTEK Co. Ltd., Hefei, China for
providing the speech database and collecting the perceptual data.
CR Hirai T., 2007, 6 ISCA SPEECH SYNTH, P81
Hirai T., 2004, 5 ISCA SPEECH SYNTH, P37
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Ling Z. H., 2007, BLIZZ CHALL WORKSH
Ling Z.-H., 2008, BLIZZ CHALL WORKSH
Ling Z.-H., 2010, ISCSLP, P144
Ling ZH, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2034
Ling ZH, 2007, INT CONF ACOUST SPEE, P1245
Ling Z.-H., 2008, J PR NI, V21, P280
Ling ZH, 2008, INT CONF ACOUST SPEE, P3949
Lu H, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P162
Lu H, 2011, INT CONF ACOUST SPEE, P5352
Oura K., 2009, INTERSPEECH, P1759
Qian Y, 2013, IEEE T AUDIO SPEECH, V21, P280, DOI 10.1109/TASL.2012.2221460
Qian Y., 2002, SPEECH PROSODY, P591
Sagisaka Y., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196677
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Song Yang, 2013, Journal of Tsinghua University (Science and Technology), V53
Strom V, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P150
Syrdal A., 2005, INTERSPEECH 2005 LIS, P2813
Toda T., 2004, ICASSP, P657
Tokuda K, 2013, P IEEE, V101, P1234, DOI 10.1109/JPROC.2013.2251852
TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229
Wang RH, 2009, CHINESE SCI BULL, V54, P1963, DOI 10.1007/s11434-009-0267-3
Wei S, 2009, SPEECH COMMUN, V51, P896, DOI 10.1016/j.specom.2009.03.004
Xia XJ, 2012, 2012 8TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING, P160
Yoshida A, 2008, INT CONF ACOUST SPEE, P4617
Yoshimura T., 1999, EUROSPEECH, P2347
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
NR 30
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 27
EP 37
DI 10.1016/j.specom.2014.04.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600003
ER
PT J
AU White, L
AF White, Laurence
TI Communicative function and prosodic form in speech timing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech timing; Prosody; Rhythm; Prosodic structure; Speech perception
ID WORD SEGMENTATION; AMERICAN ENGLISH; FUNDAMENTAL-FREQUENCY; VOWEL
DURATION; TURN-TAKING; PERCEPTION; STRESS; PHRASE; BOUNDARY; DOMAIN
AB Listeners can use variation in speech segment duration to interpret the structure of spoken utterances, but there is no systematic description of how speakers manipulate timing for communicative ends. Here I propose a functional approach to prosodic speech timing, with particular reference to English. The disparate findings regarding the production of timing effects are evaluated against the functional requirement that communicative durational variation should be perceivable and interpretable by the listener. In the resulting framework, prosodic structure is held to influence speech timing directly only at the heads and edges of prosodic domains, through large, consistent lengthening effects. As each such effect has a characteristic locus within its domain, speech timing cues are potentially disambiguated for the listener, even in the absence of other information. Diffuse timing effects in particular, quasi-rhythmical compensatory processes implying a relationship between structure and timing throughout the utterance are found to be weak and inconsistently observed. Furthermore, it is argued that articulatory and perceptual constraints make shortening processes less useful as structural cues, and they must be regarded as peripheral, at best, in a parsimonious and functionally-informed account. (C) 2014 Elsevier B.V. All rights reserved.
C1 Univ Plymouth, Sch Psychol, Plymouth PL4 8AA, Devon, England.
RP White, L (reprint author), Univ Plymouth, Sch Psychol, Plymouth PL4 8AA, Devon, England.
EM laurence.white@plymouth.ac.uk
CR Abercrombie D, 1967, ELEMENTS GEN PHONETI
Albin DD, 1996, INFANT BEHAV DEV, V19, P401, DOI 10.1016/S0163-6383(96)90002-8
Arvaniti A., 2013, LAB PHONOLOGY, V4, P7, DOI 10.1515/lp-2013-0002
Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930
BEACH CM, 1991, J MEM LANG, V30, P644, DOI 10.1016/0749-596X(91)90030-N
Beckman M. E., 1990, PAPERS LABORATORY PH, P152
Beckman M. E., 1992, SPEECH PERCEPTION PR, P457
Beckman M.E., 1992, SPEECH PERCEPTION PR, P356
BERKOVITS R, 1994, LANG SPEECH, V37, P237
Bolinger Dwight L., 1965, FORMS ENGLISH ACCENT
Brown W, 1911, PSYCHOL REV, V18, P336, DOI 10.1037/h0074259
Bye Patrik, 1997, ESTONIAN PROSODY PAP, P36
Byrd D, 2003, J PHONETICS, V31, P149, DOI 10.1016/S0095-4470(02)00085-2
Byrd D, 2005, J ACOUST SOC AM, V118, P3860, DOI 10.1121/1.2130950
Byrd D, 2008, J INT PHON ASSOC, V38, P187, DOI 10.1017/S0025100308003460
Cambier-Langeveld T., 2000, THESIS U AMSTERDAM
CAMPBELL WN, 1991, J PHONETICS, V19, P37
Cho TH, 2001, J PHONETICS, V29, P155, DOI 10.1006/jpho.2001.0131
Cho TH, 2007, J PHONETICS, V35, P210, DOI 10.1016/j.wocn.2006.03.003
Chomsky N., 1968, SOUND PATTERN ENGLIS
Christophe A, 1996, LINGUIST REV, V13, P383, DOI 10.1515/tlir.1996.13.3-4.383
Classe A, 1939, RHYTHM ENGLISH PROSE
Couper-Kuhlen E., 1993, ENGLISH SPEECH RHYTH
Couper-Kuhlen E., 1986, INTRO ENGLISH PROSOD
Cumming R, 2011, J PHONETICS, V39, P375, DOI 10.1016/j.wocn.2011.01.004
Cummins F, 1999, J ACOUST SOC AM, V105, P476, DOI 10.1121/1.424576
Cummins F, 2009, J PHONETICS, V37, P16, DOI 10.1016/j.wocn.2008.08.003
Cummins F, 2012, FRONT PSYCHOL, V3, DOI 10.3389/fpsyg.2012.00364
Cummins F, 2011, FRONT HUM NEUROSCI, V5, DOI 10.3389/fnhum.2011.00170
Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070
Cutler A., 1990, PAPERS LAB PHONOLOGY, P208
DAUER RM, 1983, J PHONETICS, V11, P51
Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218
Dilley LC, 2010, PSYCHOL SCI, V21, P1664, DOI 10.1177/0956797610384743
Dimitrova S, 2012, J PHONETICS, V40, P403, DOI 10.1016/j.wocn.2012.02.008
EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674
Fletcher J., 2010, HDB PHONETIC SCI, P523
Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332
FOURAKIS M, 1988, LANG SPEECH, V31, P283
FOWLER CA, 1981, PHONETICA, V38, P35
Fowler C.A., 1990, PAPERS LAB PHONOLOGY, P201
Frota S., 2007, SEGMENTAL PROSODIC I, P131
FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022
Gaitenby J.H., 1965, SR2 HASK LAB
GOW DW, 1995, J EXP PSYCHOL HUMAN, V21, P344, DOI 10.1037//0096-1523.21.2.344
Gussenhoven C., 2002, P SPEECH PROS AIX EN
JONES MR, 1989, PSYCHOL REV, V96, P459, DOI 10.1037//0033-295X.96.3.459
Keating P., 2003, PAPERS LAB PHONOLOGY, VVI, P145
Kim E, 2013, ATTEN PERCEPT PSYCHO, V75, P1547, DOI 10.3758/s13414-013-0490-5
Kim H., 2005, INT 2005 P 9 EUR C S, P2365
KLATT D, 1974, J SPEECH HEAR RES, V17, P51
KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986
Klatt D.H., 1975, J PHONETICS, V3, P129
Klatt D.H., 1975, STRUCTURE PROCESS SP, P69
Knight S., 2013, THESIS U CAMBRIDGE
Kochanski G, 2005, J ACOUST SOC AM, V118, P1038, DOI 10.1121/1.1923349
Kohler K.J., 2003, P 15 INT C PHON SCI, P7
Ladd D. R., 1996, INTONATIONAL PHONOLO
Lee H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1193
LEHISTE I, 1973, J ACOUST SOC AM, V54, P1228, DOI 10.1121/1.1914379
Lehiste I., 1975, PHONOLOGICA 1972, P115
Lehiste I., 1977, J PHONETICS, V5, P253
LEHISTE I, 1972, J ACOUST SOC AM, V51, P2018, DOI 10.1121/1.1913062
Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691
LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403
MARSLENWILSON WD, 1992, Q J EXP PSYCHOL-A, V45, P73
Mattys SL, 2000, PERCEPT PSYCHOPHYS, V62, P253, DOI 10.3758/BF03205547
Mattys SL, 2005, J EXP PSYCHOL GEN, V134, P477, DOI 10.1037/0096-3445.134.4.477
MORTON J, 1976, PSYCHOL REV, V83, P405, DOI 10.1037//0033-295X.83.5.405
NAKATANI LH, 1981, PHONETICA, V38, P84
Nespor M., 1986, PROSODIC PHONOLOGY
Nolan F., P ROY SOC B
Nolan F, 2009, PHONETICA, V66, P64, DOI 10.1159/000208931
O'Dell M.L., 1999, P 14 ICPHS SAN FRANC, P1075
O'Dell M.L., 2009, NORD PROS P 10 C HEL, P179
Oller D. K., 1973, J ACOUST SOC AM, V54, P1235
Ortega-Llebaria M., 2007, SEGMENTAL PROSODIC I, P155
FANT G, 1991, J PHONETICS, V19, P351
PISONI DB, 1976, J ACOUST SOC AM, V59, pS39, DOI 10.1121/1.2002669
POINTON GE, 1980, J PHONETICS, V8, P293
Port RF, 2003, J PHONETICS, V31, P599, DOI 10.1016/j.wocn.2003.08.001
PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347
PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770
Prieto P, 2012, SPEECH COMMUN, V54, P681, DOI 10.1016/j.specom.2011.12.001
QUENE H, 1992, J PHONETICS, V20, P331
RAKERD B, 1987, PHONETICA, V44, P147
RAPHAEL LJ, 1972, J ACOUST SOC AM, V51, P1296, DOI 10.1121/1.1912974
Reinisch E, 2011, J EXP PSYCHOL HUMAN, V37, P978, DOI 10.1037/a0021923
Remijsen B, 2008, J PHONETICS, V36, P318, DOI 10.1016/j.wocn.2007.09.002
Roach P., 1982, LINGUISTIC CONTROVER, P73
Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032
Salverda AP, 2003, COGNITION, V90, P51, DOI 10.1016/S0010-0277(03)00139-2
SCOTT DR, 1982, J ACOUST SOC AM, V71, P996, DOI 10.1121/1.387581
SCOTT DR, 1985, J PHONETICS, V13, P155
Scott SK, 2009, NAT REV NEUROSCI, V10, P295, DOI 10.1038/nrn2603
Selkirk E, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P187
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
Shen Y., 1962, STUDIES LINGUISTICS
Sluijter A.M.C, 1995, THESIS U LEIDEN
SNOW D, 1994, J SPEECH HEAR RES, V37, P831
Stivers T, 2009, P NATL ACAD SCI USA, V106, P10587, DOI 10.1073/pnas.0903616106
Suomi K, 2007, J PHONETICS, V35, P40, DOI 10.1016/j.wocn.2005.12.001
Suomi K, 2009, J PHONETICS, V37, P397, DOI 10.1016/j.wocn.2009.07.003
Suomi K, 2013, J PHONETICS, V41, P1, DOI 10.1016/j.wocn.2012.09.001
Tabain M, 2003, J ACOUST SOC AM, V113, P2834, DOI 10.1121/1.1564013
Tagliapietra L, 2010, J MEM LANG, V63, P306, DOI 10.1016/j.jml.2010.05.001
Turk AE, 1997, J PHONETICS, V25, P25, DOI 10.1006/jpho.1996.0032
Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001
Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123
Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093
van Santen J.P.H, 1997, COMPUTING PROSODY CO, P225
VANLANCKER D, 1988, J PHONETICS, V16, P339
VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5
White L., 2014, P SPEECH PROS DUBL
White L., 2007, CURRENT ISSUES LINGU, P237
White L, 2012, J MEM LANG, V66, P665, DOI 10.1016/j.jml.2011.12.010
White L, 2010, J PHONETICS, V38, P459, DOI 10.1016/j.wocn.2010.05.002
White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003
White L., LANG LEARN IN PRESS
White L., 2009, PHONETICS PHONOLOGY, P137
White Laurence, 2002, THESIS U EDINBURGH
WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450
Wilson M, 2005, PSYCHON B REV, V12, P957, DOI 10.3758/BF03206432
Xu Y, 2010, J PHONETICS, V38, P329, DOI 10.1016/j.wocn.2010.04.003
Xu Y, 2005, J PHONETICS, V33, P159, DOI 10.1016/j.wocn.2004.11.001
Xu Y., 2006, P SPEECH PROS DRESD
NR 126
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 38
EP 54
DI 10.1016/j.specom.2014.04.003
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600004
ER
PT J
AU Ning, LH
Shih, C
Loucks, TM
AF Ning, Li-Hsin
Shih, Chilin
Loucks, Torrey M.
TI Mandarin tone learning in L2 adults: A test of perceptual and
sensorimotor contributions
SO SPEECH COMMUNICATION
LA English
DT Article
DE Pitch shift; Internal model; Tone discrimination; L2 learning aptitude;
Language experience classification
ID PITCH FEEDBACK PERTURBATIONS; SHIFTED AUDITORY-FEEDBACK; VOICE F-0
RESPONSES; SPEECH PRODUCTION; LANGUAGE EXPERIENCE; VOCAL RESPONSES;
SUSTAINED VOCALIZATION; INTERNAL-MODELS; NATIVE-LANGUAGE; LEXICAL TONE
AB Adult second language learners (L2) of Mandarin have to acquire both new perceptual categories for discriminating and identifying lexical pitch variation and new sensorimotor skills to produce rapid tone changes. Perceptual learning was investigated using two perceptual tasks, musical tone discrimination and linguistic tone discrimination, which were administered to 10 naive adults (native speakers of English with no tonal language exposure), 10 L2 adults, and 9 Mandarin-speaking adults. Changes in sensorimotor skills were examined with a pitch-shift paradigm that examines rapid responses to unexpected pitch perturbations in auditory feedback. Discrimination of musical tones was correlated significantly with discrimination of Mandarin tones, with the clearest advantage (better performance) among Mandarin speakers and some advantage among L2 learners. Group differences were found in the fundamental frequency (F0) contours of responses to pitch-shift stimuli. The F0 contours of Mandarin speakers were least affected quantitatively by the amplitude and direction of pitch perturbations, suggesting more stable internal tone models, while the F0 contours of naive speakers and L2 learners were significantly altered by the perturbations. Discriminant analysis suggests that pitch-shift responses and tone discrimination predict class membership for the three groups. Discrimination of variations in tone appears to change early in L2 learning, possibly reflecting a process whereby new pitch representations are internalized. These findings indicate that tone discrimination and internal models for audio vocal control are sensitive to language experience. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Ning, Li-Hsin; Shih, Chilin] Univ Illinois, Dept Linguist, Urbana, IL 61801 USA.
[Shih, Chilin] Univ Illinois, Dept East Asian Languages & Cultures, Urbana, IL 61801 USA.
[Loucks, Torrey M.] Univ Illinois, Dept Speech & Hearing Sci, Champaign, IL 61820 USA.
RP Ning, LH (reprint author), Univ Illinois, Dept Linguist, 4080 Foreign Language Bldg,707 S Mathews Ave, Urbana, IL 61801 USA.
EM uiucning@gmail.com; cls@illinois.edu; tloucks@illinois.edu
CR Bauer JJ, 2003, J ACOUST SOC AM, V114, P1048, DOI 10.1121/1.1592161
Behroozmand R., 2011, BMC NEUROSCI, V12
Burnett TA, 2002, J ACOUST SOC AM, V112, P1058, DOI 10.1121/1.1487844
Burnett TA, 1998, J ACOUST SOC AM, V103, P3153, DOI 10.1121/1.423073
Callan DE, 2004, NEUROIMAGE, V22, P1182, DOI 10.1016/j.neuroimage.2004.03.006
Chandrasekaran B, 2007, BRAIN RES, V1128, P148, DOI 10.1016/j.brainres.2006.10.064
Chang EF, 2013, P NATL ACAD SCI USA, V110, P2653, DOI [10.1073/pnas.1216827110, 10.1073/pnas.1216827110/-/DCSupplemental]
Chen SH, 2007, J ACOUST SOC AM, V121, P1157, DOI 10.1121/1.2404624
Chen ZC, 2010, J ACOUST SOC AM, V128, pEL355, DOI 10.1121/1.3509124
Chen ZC, 2012, BRAIN LANG, V121, P25, DOI 10.1016/j.bandl.2012.02.004
Cooper A, 2012, J ACOUST SOC AM, V131, P4756, DOI 10.1121/1.4714355
Deutsch D., 2000, J ACOUST SOC AM, V108, P2591
Donath TM, 2002, J ACOUST SOC AM, V111, P357, DOI 10.1121/1.1424870
Eliades SJ, 2008, NATURE, V453, P1102, DOI 10.1038/nature06910
Francis AL, 2008, J PHONETICS, V36, P268, DOI 10.1016/j.wocn.2007.06.005
GANDOUR JT, 1978, LANG SPEECH, V21, P1
GANDOUR J, 1983, J PHONETICS, V11, P149
Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013
Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001
Guenther FH, 1998, PSYCHOL REV, V105, P611
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
Hain TC, 2000, EXP BRAIN RES, V130, P133, DOI 10.1007/s002219900237
Hain TC, 2001, J ACOUST SOC AM, V109, P2146, DOI 10.1121/1.1366319
Halle PA, 2004, J PHONETICS, V32, P395, DOI 10.1016/S0095-4470(03)00016-0
Heinks-Maldonado TH, 2005, PSYCHOPHYSIOLOGY, V42, P180, DOI 10.1111/j.1469-8986.2005.00272.x
Henthorn T, 2007, AM J MED GENET A, V143A, P102, DOI 10.1002/ajmg.a.31596
Hickok G, 2011, NEURON, V69, P407, DOI 10.1016/j.neuron.2011.01.019
Houde JF, 1998, SCIENCE, V279, P1213, DOI 10.1126/science.279.5354.1213
Houde JF, 2002, J SPEECH LANG HEAR R, V45, P295, DOI 10.1044/1092-4388(2002/023)
Jones JA, 2002, J PHONETICS, V30, P303, DOI 10.1006/jpho.2001.0160
Jones JA, 2000, J ACOUST SOC AM, V108, P1246, DOI 10.1121/1.1288414
JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1
Kawato M, 1999, CURR OPIN NEUROBIOL, V9, P718, DOI 10.1016/S0959-4388(99)00028-8
Kosling K, 2013, LANG SPEECH, V56, P529, DOI 10.1177/0023830913478914
Krishnan A, 2005, COGNITIVE BRAIN RES, V25, P161, DOI 10.1016/j.cogbrainres.2005.05.004
Lalazar H, 2008, CURR OPIN NEUROBIOL, V18, P573, DOI 10.1016/j.conb.2008.11.003
LANE H, 1971, J SPEECH HEAR RES, V14, P677
Larson CR, 2001, J ACOUST SOC AM, V110, P2845, DOI 10.1121/1.1417527
Larson CR, 2000, J ACOUST SOC AM, V107, P559, DOI 10.1121/1.428323
Liu H., 2009, J ACOUST SOC AM, V127
Liu HJ, 2007, J ACOUST SOC AM, V122, P3671, DOI 10.1121/1.2800254
Liu HJ, 2010, J ACOUST SOC AM, V128, P3739, DOI 10.1121/1.3500675
Liu HJ, 2009, J ACOUST SOC AM, V125, P2299, DOI 10.1121/1.3081523
Mandell J., 2009, ADAPTIVE PITCH TEST
Mattock K, 2006, INFANCY, V10, P241, DOI 10.1207/s15327078in1003_3
Mattock K, 2008, COGNITION, V106, P1367, DOI 10.1016/j.cognition.2007.07.002
Mitsuya T, 2013, J ACOUST SOC AM, V133, P2993, DOI 10.1121/1.4795786
Sakai S., 2004, P INT C AC SPEECH SI, V1, P277
Van Lancker D, 1973, J PHONETICS, V6, P19
Wang Y, 2001, BRAIN LANG, V78, P332, DOI 10.1006/brln.2001.2474
Wang Y, 2003, J COGNITIVE NEUROSCI, V15, P1019, DOI 10.1162/089892903770007407
Wong PCM, 2007, APPL PSYCHOLINGUIST, V28, P565, DOI 10.1017/S0142716407070312
Wood SN, 2011, J R STAT SOC B, V73, P3, DOI 10.1111/j.1467-9868.2010.00749.x
Wood SN, 2006, GEN ADDITIVE MODELS
Xu Y, 2004, J ACOUST SOC AM, V116, P1168, DOI 10.1121/1.1763952
Xu YS, 2006, J ACOUST SOC AM, V120, P1063, DOI 10.1121/1.2213572
NR 56
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 55
EP 69
DI 10.1016/j.specom.2014.05.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600005
ER
PT J
AU Medabalimi, AJX
Seshadri, G
Bayya, Y
AF Medabalimi, Anand Joseph Xavier
Seshadri, Guruprasad
Bayya, Yegnanarayana
TI Extraction of formant bandwidths using properties of group delay
functions
SO SPEECH COMMUNICATION
LA English
DT Article
DE Formant frequency; Bandwidth; Group delay function; Short segments;
Closed phase; Open phase
ID PREDICTION PHASE SPECTRA; LINEAR-PREDICTION
AB Formant frequencies represent resonances of vocal tract system during the production of speech signals. Bandwidths associated with the formant frequencies are important parameters in analysis and synthesis of speech signals. In this paper, a method is proposed to extract the bandwidths associated with formant frequencies, by analysing short segments (2-3 ms) of speech signal. The method is based on two important properties of group delay function (GDF): (a) The GDF exhibits prominent peaks at resonant frequencies and (b) the influence of one resonant frequency on other resonances is negligible in GDF. The accuracy of the method is demonstrated for synthetic signals generated using all-pole filters. The method is evaluated by extracting bandwidths of synthetic signals in closed phase and open phase regions within a pitch period. The accuracy of the proposed method is also compared with that of two other methods, one based on linear prediction analysis of speech signals, and another based on filterbank arrays for obtaining amplitude envelopes and instantaneous frequency signals. Results indicate that the method based on the properties of GDF is suitable for accurate extraction of formant bandwidths, even from short segments of speech signal within a pitch period. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Medabalimi, Anand Joseph Xavier; Bayya, Yegnanarayana] Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India.
[Seshadri, Guruprasad] TATA Consultancy Serv, Innovat Labs, Bangalore 560066, Karnataka, India.
RP Medabalimi, AJX (reprint author), Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India.
EM anandjm@research.iiit.ac.in; guruprasad.seshadri@gmail.com;
yegna@iiit.ac.in
CR Cohen L., 1992, P IEEE 6 SP WORKSH S, P13
Deng L, 2006, INT CONF ACOUST SPEE, P369
Fant G., 1960, ACOUSTIC THEORY SPEE
GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
Oppenheim A.V., 1999, DISCRETE TIME SIGNAL
Oppenheim A.V., 1975, DIGIT SIGNAL PROCESS
POTAMIANOS A, 1995, INT CONF ACOUST SPEE, P784, DOI 10.1109/ICASSP.1995.479811
REDDY NS, 1984, IEEE T ACOUST SPEECH, V32, P1136, DOI 10.1109/TASSP.1984.1164456
Tsiakoulis P, 2013, INT CONF ACOUST SPEE, P8032, DOI 10.1109/ICASSP.2013.6639229
Xavier M.A.J., 2006, P INTERSPEECH PITTSB, P1009
Yasojima O., 2006, P IEEE INT S SIGN PR, P589
YEGNANARAYANA B, 1978, J ACOUST SOC AM, V63, P1638, DOI 10.1121/1.381864
Zheng YL, 2003, PROCEEDINGS OF THE 2003 IEEE WORKSHOP ON STATISTICAL SIGNAL PROCESSING, P601
NR 16
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP-OCT
PY 2014
VL 63-64
BP 70
EP 83
DI 10.1016/j.specom.2014.04.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ7NY
UT WOS:000337884600006
ER
PT J
AU Uzun, E
Sencar, HT
AF Uzun, Erkam
Sencar, Husrev T.
TI A preliminary examination technique for audio evidence to distinguish
speech from non-speech using objective speech quality measures
SO SPEECH COMMUNICATION
LA English
DT Article
DE Preliminary analysis of audio evidence; Speech and non-speech
discrimination; Objective speech quality assessment; Audio encoding;
Audio effects; Surveillance
ID CLASSIFICATION; DISCRIMINATION; NETWORKS; LIBRARY; CODECS
AB Forensic practitioners are faced more and more with large volumes of data. Therefore, there is a growing need for computational techniques to aid in evidence collection and analysis. With this study, we introduce a technique for preliminary analysis of audio evidence to discriminate between speech and non-speech. The novelty of our approach lies in the use of well-established speech quality measures for characterizing speech signals. These measures rely on models of human perception of speech to provide objective and reliable measurements of changes in characteristics that influence speech quality. We utilize this capability to compute quality scores between an audio and its noise-suppressed version and to model variations of these scores in speech as compared to those in non-speech audio. Tests performed on 11 datasets with widely varying characteristics show that the technique has a high discrimination capability, achieving an identification accuracy of 96 to 99% in most test cases, and offers good generalization properties across different datasets. Results also reveal that the technique is robust against encoding at low bit-rates, application of audio effects and degradations due to varying degrees of background noise. Performance comparisons made with existing studies show that the proposed method improves the state-of-the-art in audio content identification. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Sencar, Husrev T.] TOBB Univ Econ & Technol, Ankara, Turkey.
New York Univ Abu Dhabi, Abu Dhabi, U Arab Emirates.
RP Sencar, HT (reprint author), TOBB Univ Econ & Technol, Ankara, Turkey.
EM euzun@etu.edu.tr; htsencar@etu.edu.tr
CR Alexandre-Cortizo E., 2005, P EUROCON, V2, P1666
Alvarez L., 2011, P GAVTASC, P97
Barbedo J. G. A., 2006, Journal of the Audio Engineering Society, V54
BARNWELL TP, 1979, J ACOUST SOC AM, V66, P1658, DOI 10.1121/1.383664
Barthet M., 2011, P EXPL MUS CONT
Beigi H., 2011, AUDIO SOURCE CLASSIF
Campbell D, 2009, SIGNAL PROCESS, V89, P1489, DOI 10.1016/j.sigpro.2009.02.015
CAREY MJ, 1999, ACOUST SPEECH SIG PR, P149
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Chow D, 2004, LECT NOTES ARTIF INT, V3157, P901
DIMOLITSAS S, 1989, P IEE P I, V136, P317
Dong Y., 2005, P 14 INT C ACM WORLD, P1072
DONOHO DL, 1994, CR ACAD SCI I-MATH, V319, P1317
Dubey Rajesh Kumar, 2013, International Journal of Speech Technology, V16, DOI 10.1007/s10772-012-9162-4
Emiya V, 2011, IEEE T AUDIO SPEECH, V19, P2046, DOI 10.1109/TASL.2011.2109381
Falk T.H., 2008, P IWAENC
Fenton S., 2011, P AES CONV
Fry Dennis B., 1979, PHYS SPEECH
Ghosal A, 2011, Proceedings of the Second International Conference on Emerging Applications of Information Technology (EAIT 2011), DOI 10.1109/EAIT.2011.19
Gonzalez R., 2012, P ICME, P556
Grancharov V, 2006, IEEE T AUDIO SPEECH, V14, P1948, DOI 10.1109/TASL.2006.883250
GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P380, DOI 10.1109/TASSP.1976.1162849
Haque MA, 2013, MULTIMED TOOLS APPL, V63, P63, DOI 10.1007/s11042-012-1023-2
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
Hu Y., 2006, P INTERSPEECH
Huijbiegts M, 2011, SPEECH COMMUN, V53, P143, DOI 10.1016/j.specom.2010.08.008
Itakura F., 1968, P 6 INT C AC, V17, pC17
Jain A, 1997, IEEE T PATTERN ANAL, V19, P153, DOI 10.1109/34.574797
Kim HC, 2002, INT C PATT RECOG, P160
Kitawaki N., 1992, ADV SPEECH SIGNAL PR, P357
Klatt D., 1982, P IEEE INT C AC SPEE, V7, P1278
Kondo K., 2012, SUBJECTIVE QUALITY M
Kubichek R., 1991, 3 GLOBECOM, P1765
Lavner Y., 2009, EURASIP J AUDIO SPEE, V2
Lim C., 2011, ETRI J, V33
Loizou PC, 2011, STUD COMPUT INTELL, V346, P623
MacLean Ken, 2014, VOXFORGE OPEN SOURCE
Manning CD, 2008, INTRO INFORM RETRIEV, V1
McKinney M., 2003, P ISMIR, V3, P151
Meier Paul, 2014, IDEA INT DIALECTS EN
Munoz-Exposito JE, 2007, ENG APPL ARTIF INTEL, V20, P783, DOI 10.1016/j.engappai.2006.10.007
NHK Technology, 2009, OBJ PERC AUD QUAL ME
NOCERINO N, 1985, SPEECH COMMUN, V4, P317, DOI 10.1016/0167-6393(85)90057-3
Ozer H, 2003, P SOC PHOTO-OPT INS, V5020, P55, DOI 10.1117/12.477313
Pikrakis A, 2008, IEEE T MULTIMEDIA, V10, P846, DOI 10.1109/TMM.2008.922870
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
RIX AW, 2001, ACOUST SPEECH SIG PR, P749
Rohdenburg T., 2005, P 9 INT WORKSH AC EC, P169
Sadjadi S. O., 2007, P 6 INT C INF COMM S, P1
Scheirer E, 1997, INT CONF ACOUST SPEE, P1331, DOI 10.1109/ICASSP.1997.596192
Song JH, 2008, IEEE SIGNAL PROC LET, V15, P103, DOI 10.1109/LSP.2007.911184
Tzanetakis G., 2001, P AMTA
Verfaille V, 2006, IEEE T AUDIO SPEECH, V14, P1817, DOI 10.1109/TSA.2005.858531
Voran S, 1999, IEEE T SPEECH AUDI P, V7, P371, DOI 10.1109/89.771259
Wang J, 2008, INT CONF ACOUST SPEE, P2033
WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987
Xie L, 2011, MULTIMEDIA SYST, V17, P101, DOI 10.1007/s00530-010-0205-x
Yang W., 1997, P SPEECH COD WORKSH, V10
Yang W., 1999, THESIS TEMPLE U
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7
NR 61
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN-JUL
PY 2014
VL 61-62
BP 1
EP 16
DI 10.1016/j.specom.2014.03.003
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ6EO
UT WOS:000337782700001
ER
PT J
AU Vieira, MN
Sansao, JPH
Yehia, HC
AF Vieira, Maurilio N.
Sansao, Joao Pedro H.
Yehia, Hani C.
TI Measurement of signal-to-noise ratio in dysphonic voices by image
processing of spectrograms
SO SPEECH COMMUNICATION
LA English
DT Article
DE Signal-to-noise ratio; Breathiness; Dysphonic voice; 2D speech
processing
ID BREATHY VOCAL QUALITY; PATHOLOGICAL VOICE; ADDITIVE NOISE; PERTURBATION;
SPEECH; HOARSENESS; FILTERS; ENHANCEMENT; COMPUTATION; PERCEPTION
AB The measurement of glottal noise was investigated in human and synthesized dysphonic voices by means of two-dimensional (2D) speech processing. A prime objective was the reduction of measurement sensitivities to fundamental frequency (f(o)) tracking errors and phonatory aperiodicities. An available fingerprint image enhancement algorithm was used for signal-to-noise measurement in narrow band spectrographic images. This spectrographic signal-to-noise ratio estimation method ((SNR)-N-2) creates binary masks, mainly based on the orientation field of the partials, to separate energy in regions with strong harmonics from energy in noisy areas. Synthesized vowels with additive noise were used to calibrate the algorithm, validate the calibration, and systematically evaluate its dependence on f(o), shimmer (cycle-to-cycle amplitude perturbation), and jitter (cycle-to-cycle fo perturbation). In synthesized voices with known signal-to-noise ratios in the 5-40 dB range, (SNR)-N-2 estimates were, on average, accurate within +/- 3.2 dB and robust to variations in f(o) (120 Hz or 220 Hz), jitter (0-3%), and shimmer (0-30%). In human /a/ produced by dysphonic speakers, (SNR)-N-2 values and perceptual ratings of breathiness revealed a non-linear but monotonic decay of (SNR)-N-2 with increased breathiness. Comparison between (SNR)-N-2 and related acoustic measurements indicated similar behaviors regarding the relationship with breathiness and immunity to shimmer, but the other methods had marked influence of jitter. Overall, the (SNR)-N-2 method did not rely on accurate fo estimation, was robust to vocal perturbations and largely independent of vowel type, having also potential application in running speech. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Vieira, Maurilio N.; Yehia, Hani C.] Univ Fed Minas Gerais, Dept Elect Engn, BR-31270010 Belo Horizonte, MG, Brazil.
[Sansao, Joao Pedro H.] Univ Fed Minas Gerais, Programa Posgrad Engn Eletr, BR-31270901 Belo Horizonte, MG, Brazil.
[Sansao, Joao Pedro H.] Univ Fed Sao Joao del Rei, Dept Engn Telecomunicacoes & Mecatron, BR-36420000 Ouro Branco, MG, Brazil.
RP Vieira, MN (reprint author), Univ Fed Minas Gerais, Dept Elect Engn, Ave Antonio Carlos 6627, BR-31270010 Belo Horizonte, MG, Brazil.
EM maurilionunesv@cpdee.ufmg.br; jsansao@gmail.com; hani@cpdee.ufmg.br
FU Conselho Nacional de Desenvolvimento Cientifico e TecnolOgico (CNPq);
Coordenagao de Aperfeigoamento de Pessoal de Nivel Superior (Capes);
Fundagdo de Amparo a Pesquisa do Estado de Minas Gerais (Fapemig)
FX This research was supported by Conselho Nacional de Desenvolvimento
Cientifico e TecnolOgico (CNPq), Coordenagao de Aperfeigoamento de
Pessoal de Nivel Superior (Capes), and Fundagdo de Amparo a Pesquisa do
Estado de Minas Gerais (Fapemig). The human dysphonic voices used in the
study were perceptually rated by meticulous work of Ana Paula da Penha
and Mariana de Sousa Dutra Borges.
CR Bazen AM, 2002, IEEE T PATTERN ANAL, V24, P905, DOI 10.1109/TPAMI.2002.1017618
Bielamowicz S, 1996, J SPEECH HEAR RES, V39, P126
Chen G, 2013, J ACOUST SOC AM, V133, P1656, DOI 10.1121/1.4789931
COX NB, 1989, J ACOUST SOC AM, V85, P2165, DOI 10.1121/1.397865
DAUGMAN JG, 1985, J OPT SOC AM A, V2, P1160, DOI 10.1364/JOSAA.2.001160
DEJONCKERE PH, 1994, CLIN LINGUIST PHONET, V8, P161, DOI 10.3109/02699209408985304
DEKROM G, 1993, J SPEECH HEAR RES, V36, P254
Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003
ESKENAZI L, 1990, J SPEECH HEAR RES, V33, P298
Ezzat T., 2007, 8 ANN C INIT SPEECH, P506
FANT G, 1979, STL QPSR, V1, P85
FEIJOO S, 1990, J SPEECH HEAR RES, V33, P324
Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X
HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769
HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448
Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311
HIRANO M, 1988, ACTA OTO-LARYNGOL, V105, P432, DOI 10.3109/00016488809119497
HIRAOKA N, 1984, J ACOUST SOC AM, V76, P1648, DOI 10.1121/1.391611
Hong L, 1998, IEEE T PATTERN ANAL, V20, P777
HORII Y, 1979, J SPEECH HEAR RES, V22, P5
Horn T, 1998, ACUSTICA, V84, P175
JAIN AK, 1991, PATTERN RECOGN, V24, P1167, DOI 10.1016/0031-3203(91)90143-S
Jellyman KA, 2009, LECT NOTES COMPUT SC, V5371, P63
KANE M, 1985, FOLIA PHONIATR, V37, P53
Kasuya H., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4)
KASUYA H, 1986, J ACOUST SOC AM, V80, P1329, DOI 10.1121/1.394384
KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
Kovesi P.D., 2005, MATLAB OCTAVE FUNCTI
LAVER J, 1992, J VOICE, V6, P115, DOI 10.1016/S0892-1997(05)80125-0
LIEBERMAN P, 1963, J ACOUST SOC AM, V35, P344, DOI 10.1121/1.1918465
Markel JD, 1976, LINEAR PREDICTION SP
MARTIN D, 1995, J SPEECH HEAR RES, V38, P765
Maryn Y, 2009, J ACOUST SOC AM, V126, P2619, DOI 10.1121/1.3224706
Murphy PJ, 2000, J ACOUST SOC AM, V107, P978, DOI 10.1121/1.428272
Murphy PJ, 1999, J ACOUST SOC AM, V105, P2866, DOI 10.1121/1.426901
Murphy PJ, 2007, J ACOUST SOC AM, V121, P1679, DOI 10.1121/1.2427123
MUTA H, 1988, J ACOUST SOC AM, V84, P1292, DOI 10.1121/1.396628
Nixon M. S., 2002, FEATURE EXTRACTION I
Patel S, 2012, J SPEECH LANG HEAR R, V55, P639, DOI 10.1044/1092-4388(2011/10-0337)
PROSEK RA, 1987, J COMMUN DISORD, V20, P105, DOI 10.1016/0021-9924(87)90002-5
QI YG, 1992, J ACOUST SOC AM, V92, P2569, DOI 10.1121/1.404429
QI YY, 1995, J ACOUST SOC AM, V97, P2525, DOI 10.1121/1.411972
Schoentgen J., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90020-6
Shue Y.-L., 2010, VOICESAUCE PROGRAM V
Soon Y.I., 2003, IEEE T SPEECH AUDIO, V111, P717
TITZE IR, 1993, J SPEECH HEAR RES, V36, P1177
Vieira MN, 2002, J ACOUST SOC AM, V111, P1045, DOI 10.1121/1.1430686
Vieira M.N., 1997, THESIS U EDINBURGH U
Wolfe VI, 2000, J SPEECH LANG HEAR R, V43, P697
WOLFE VI, 1987, J SPEECH HEAR RES, V30, P230
YANAGIHA.N, 1967, J SPEECH HEAR RES, V10, P531
YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808
Zhang Y, 2005, J ACOUST SOC AM, V118, P2551, DOI 10.1121/1.2005907
NR 54
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN-JUL
PY 2014
VL 61-62
BP 17
EP 32
DI 10.1016/j.specom.2014.04.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AJ6EO
UT WOS:000337782700002
ER
PT J
AU Zhang, WQ
Liu, WW
Li, ZY
Shi, YZ
Liu, J
AF Zhang, Wei-Qiang
Liu, Wei-Wei
Li, Zhi-Yi
Shi, Yong-Zhe
Liu, Jia
TI Spoken language recognition based on gap-weighted subsequence kernels
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken language recognition; Gap-weighted subsequence kernel (GWSK);
n-Gram; Phone recognizer (PR); Vector space model (VSM)
ID STRING KERNELS; FRONT-END; IDENTIFICATION; CLASSIFICATION
AB Phone recognizers followed by vector space models (PR-VSM) is a state-of-the-art phonotactic method for spoken language recognition. This method resorts to a bag-of-n-grams, with each dimension of the super vector based on the counts of n-gram tokens. The n-gram cannot capture the long-context co-occurrence relations due to the restriction of gram order. Moreover, it is vulnerable to the errors induced by the frontend phone recognizer. In this paper, we introduce a gap-weighted subsequence kernel (GWSK) method to overcome the drawbacks of n-gram. GWSK counts the co-occurrence of the tokens in a non-contiguous way and thus is not only error-tolerant but also capable of revealing the long-context relations. Beyond this, we further propose a truncated GWSK with constraints on context length in order to remove the interference from remote tokens and lower the computational cost, and extend the idea to lattices to take the advantage of multiple hypotheses from the phone recognizer. In addition, we investigate the optimal parameter setting and computational complexity of the proposed methods. Experiments on NIST 2009 LRE evaluation corpus with several configurations show that the proposed GWSK is consistently more effective than the PR-VSM approach. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Zhang, Wei-Qiang; Liu, Wei-Wei; Li, Zhi-Yi; Shi, Yong-Zhe; Liu, Jia] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China.
RP Zhang, WQ (reprint author), Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China.
EM wqzhang@tsinghua.edu.cn
RI Zhang, Wei-Qiang/A-7088-2008
OI Zhang, Wei-Qiang/0000-0003-3841-1959
FU National Natural Science Foundation of China [61370034, 61273268,
61005019]
FX This work was supported by the National Natural Science Foundation of
China under Grant Nos. 61370034, 61273268 and 61005019.
CR Campbell W., 2006, P OD SAN JUAN
Campbell W. M., 2004, P ICASSP, P1
Campbell WM, 2007, INT CONF ACOUST SPEE, P989
Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307
Fan RE, 2008, J MACH LEARN RES, V9, P1871
Gauvain J. -L., 2004, P ICSLP, P25
Hazen T. J., 1993, P EUR 93 SEPT, V2, P1303
Hofmann T, 2008, ANN STAT, V36, P1171, DOI 10.1214/009053607000000677
Kim S., 2010, BMC BIOINFORMATICS, V11
Kruengkrai C., 2005, P 5 INT S COMM INF T, P896
Lerma M. A., 2008, SEQUENCES STRINGS
Li HZ, 2007, IEEE T AUDIO SPEECH, V15, P271, DOI 10.1109/TASL.2006.876860
Lodhi H, 2002, J MACH LEARN RES, V2, P419, DOI 10.1162/153244302760200687
Ma B, 2007, IEEE T AUDIO SPEECH, V15, P2053, DOI 10.1109/TASL.2007.902861
Matejka P, 2005, P INT 2005 LISB PORT, P2237
Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925
Navratil J, 1997, INT CONF ACOUST SPEE, P1115, DOI 10.1109/ICASSP.1997.596137
Navratil J, 2001, IEEE T SPEECH AUDI P, V9, P678, DOI 10.1109/89.943345
NIST, 2009, 2009 NIST LANG REC E
Penagarikano M, 2011, IEEE T AUDIO SPEECH, V19, P2348, DOI 10.1109/TASL.2011.2134088
Rousu J, 2005, J MACH LEARN RES, V6, P1323
Shawe-Taylor J., 2004, KERNEL METHODS PATTE
Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901
Tong R, 2009, IEEE T AUDIO SPEECH, V17, P1335, DOI 10.1109/TASL.2009.2016731
Torres-Carrasquillo P., 2008, P INT 08, P719
Torres-Carrasquillo PA, 2002, THESIS MICHIGAN STAT
Vapnik V., 1995, NATURE STAT LEARNING
Yin CH, 2008, NEUROCOMPUTING, V71, P944, DOI 10.1016/j.neucom.2007.02.005
Zhang W., 2006, P ICSP GUIL, V1
Zhang WQ, 2010, CHINESE J ELECTRON, V19, P124
ZISSMAN MA, 1994, INT CONF ACOUST SPEE, P305
Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6
NR 32
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 1
EP 12
DI 10.1016/j.specom.2014.01.005
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800001
ER
PT J
AU Xia, BY
Bao, CC
AF Xia, Bingyin
Bao, Changchun
TI Wiener filtering based speech enhancement with Weighted Denoising
Auto-encoder and noise classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Weighted Denoising Auto-encoder; SNR estimation;
Wiener filter; Noise classification; Gaussian mixture model
ID RECOGNITION
AB A novel speech enhancement method based on Weighted Denoising Auto-encoder (WDA) and noise classification is proposed in this paper. A weighted reconstruction loss function is introduced into the conventional Denoising Auto-encoder (DA), and the relationship between the power spectra of clean speech and noisy observation is described by WDA model. First, the sub-band power spectrum of clean speech is estimated by WDA model from the noisy observation. Then, the a priori SNR is estimated by the a Posteriori SNR Controlled Recursive Averaging (PCRA) approach. Finally, the clean speech is obtained by Wiener filter in frequency domain. In addition, in order to make the proposed method suitable for various kinds of noise conditions, a Gaussian Mixture Model (GMM) based noise classification method is employed. And the corresponding WDA model is used in the enhancement process. From the test results under ITU-T G.160, it is shown that, in comparison with the reference method which is the Wiener filtering method with decision-directed approach for SNR estimation, the WDA-based speech enhancement methods could achieve better objective speech quality, no matter whether the noise conditions are included in the training set or not. And the similar amount of noise reduction and SNR improvement can be obtained with smaller distortion on speech level. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Xia, Bingyin; Bao, Changchun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
EM baochch@bjut.edu.cn
FU Beijing Natural Science Foundation Program; Beijing Municipal Commission
of Education [KZ201110005005]
FX This work was supported by the Beijing Natural Science Foundation
Program and Scientific Research Key Program of Beijing Municipal
Commission of Education (No. KZ201110005005).
CR 3GPP2, 2010, ENH VAR RAT COD SPEE
Bengio Yoshua, 2009, Foundations and Trends in Machine Learning, V2, DOI 10.1561/2200000006
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Brakel P., 2013, ANN C INT SPEECH COM, P2973
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
Dahl GE, 2012, IEEE T AUDIO SPEECH, V20, P30, DOI 10.1109/TASL.2011.2134090
DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Hinton GE, 2006, SCIENCE, V313, P504, DOI 10.1126/science.1127647
ITU-T, 2008, ITU SER G
ITU-T, 2001, ITU SER P
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Lecun Y, 1998, P IEEE, V86, P2278, DOI 10.1109/5.726791
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Lu X., 2013, ANN C INT SPEECH COM, P436
Maas AL, 2012, 13TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2012 (INTERSPEECH 2012), VOLS 1-3, P22
NTT, 1994, MULT SPEECH DAT TEL
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Vincent P, 2010, J MACH LEARN RES, V11, P3371
Vincent P., 2008, ICML, P1096
Xie J., 2012, ADV NEURAL INF PROCE, P341
Xu H. -T., 2005, EUR C SPEECH COMM TE, P977
NR 25
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 13
EP 29
DI 10.1016/j.specom.2014.02.001
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800002
ER
PT J
AU Lee, KS
AF Lee, Ki-Seung
TI A unit selection approach for voice transformation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice conversion; Unit selection; Hidden Markov model
ID TO-SPEECH SYNTHESIS; CONVERSION; RECOGNITION; ALGORITHM; FREQUENCY;
NETWORKS; DESIGN; MODELS
AB A voice transformation (VT) method that can make the utterance of a source speaker mimic that of a target speaker is described. Speaker individuality transformation is achieved by altering four feature parameters, which include the linear prediction coefficients cepstrum (LPCC), ALPCC, LP-residual and pitch period. The main objective of this study involves construction of an optimal sequence of features selected from a target speaker's database, to maximize both the correlation probabilities between the transformed and the source features and the likelihood of the transformed features with respect to the target model. A set of two-pass conversion rules is proposed, where the feature parameters are first selected from a database then the optimal sequence of the feature parameters is then constructed in the second pass. The conversion rules were developed using a statistical approach that employed a maximum likelihood criterion. In constructing an optimal sequence of the features, a hidden Markov model (HMM) with global control variables (GCV) was employed to find the most likely combination of the features with respect to the target speaker's model.
The effectiveness of the proposed transformation method was evaluated using objective tests and formal listening tests. We confirmed that the proposed method leads to perceptually more preferred results, compared with the conventional methods. (C) 2014 Elsevier B.V. All rights reserved.
C1 Konkuk Univ, Dept Elect Engn, Seoul 143701, South Korea.
RP Lee, KS (reprint author), Konkuk Univ, Dept Elect Engn, 1 Hwayang Dong, Seoul 143701, South Korea.
EM kseung@konkuk.ac.kr
CR Abe M., 1988, P IEEE ICASSP, P565
Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1
Beutnagel M., 1999, P JOINT M ASA EAA DA
Bi N, 1997, IEEE T SPEECH AUDI P, V5, P97
Cheng YM, 1994, IEEE T SPEECH AUDI P, V2, P544
Childers D. G., 1985, P ICASSP 85, P748
CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733
COX SJ, 1989, P IEEE INT C AC SPEE, P294
Dutoit T., 2007, P ICASSP, P15
Erickson M. L., 2003, 31 ANN S CAR PROF VO, P24
Erro D, 2013, IEEE T AUDIO SPEECH, V21, P556, DOI 10.1109/TASL.2012.2227735
Helander E, 2012, IEEE T AUDIO SPEECH, V20, P806, DOI 10.1109/TASL.2011.2165944
Huang Y. C., 2013, P IEEE T AUDIO SPEEC, V21, P51
IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B
Jian Z. H., 2007, P INT SIGN PROC COMM, P32
KAIN A, 1998, ACOUST SPEECH SIG PR, P285
Kain A, 2001, INT CONF ACOUST SPEE, P813
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Kominek J., 2004, P 5 ISCA SPEECH SYNT, P223
Lee KS, 2002, IEICE T INF SYST, VE85D, P1297
Lee KS, 2007, IEEE T AUDIO SPEECH, V15, P641, DOI 10.1109/TASL.2006.876760
Lee KS., 1996, P ICSLP, P1401
Lee KS, 2008, IEEE T BIO-MED ENG, V55, P930, DOI 10.1109/TBME.2008.915658
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
Ma JC, 2005, PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING (IEEE NLP-KE'05), P199
Manabe H, 2004, P ANN INT IEEE EMBS, V26, P4389
MIZUNO H, 1995, SPEECH COMMUN, V16, P153, DOI 10.1016/0167-6393(94)00052-C
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I
Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150
Rabiner L, 1993, FUNDAMENTALS SPEECH
Rabiner L.R., 1978, DIGITAL PROCESSING S
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
Saito D, 2012, IEEE T AUDIO SPEECH, V20, P1784, DOI 10.1109/TASL.2012.2188628
Savic M., 1991, DIGIT SIGNAL PROCESS, V4, P107
Shuang ZW, 2008, INT CONF ACOUST SPEE, P4661
Rao KS, 2010, COMPUT SPEECH LANG, V24, P474, DOI 10.1016/j.csl.2009.03.003
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
Summerfield A. Q., 1992, PHILOS T R SOC LON B, V335, P71
Sundermann D., 2006, P ICASSP, P14
Sundermann D., 2005, P IEEE WORKSH AUT SP, P369
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
VALBRET H, 1992, SPEECH COMMUN, V11, P175, DOI 10.1016/0167-6393(92)90012-V
WHITE GM, 1976, IEEE T ACOUST SPEECH, V24, P183, DOI 10.1109/TASSP.1976.1162779
Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839
NR 45
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 30
EP 43
DI 10.1016/j.specom.2014.02.002
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800003
ER
PT J
AU Sun, CL
Zhu, Q
Wan, MH
AF Sun, Chengli
Zhu, Qi
Wan, Minghua
TI A novel speech enhancement method based on constrained low-rank and
sparse matrix decomposition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Matrix decomposition; Low-rank matrix approximation;
Robust principal component analysis
ID SPECTRAL AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; NOISE; SUBTRACTION;
COMPLETION; ALGORITHM
AB In this paper, we present a novel speech enhancement method based on the principle of constrained low-rank and sparse matrix decomposition (CLSMD). According to the proposed method, noise signal can be assumed as a low-rank component because noise spectra within different time frames are usually highly correlated with each other; while the speech signal is regarded as a sparse component since it is relatively sparse in time frequency domain. Based on these assumptions, we develop an alternative projection algorithm to separate the speech and noise magnitude spectra by imposing rank and sparsity constraints, with which the enhanced time-domain speech can be constructed from sparse matrix by inverse discrete Fourier transform and overlap-add-synthesis. The proposed method is significantly different from existing speech enhancement methods. It can estimate enhanced speech in a straightforward manner, and does not need a voice activity detector to find noise-only excerpts for noise estimation. Moreover, it can obtain better performance in low SNR conditions, and does not need to know the exact distribution of noise signal. Experimental results show the new method can perform better than conventional methods in many types of strong noise conditions, in terms of yielding less residual noise and lower speech distortion. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Sun, Chengli] Sci & Technol Avion Integrat Lab, Shanghai 200233, Peoples R China.
[Sun, Chengli; Wan, Minghua] Nanchang Hang kong Univ, Sch Informat, Nanchang 330063, Peoples R China.
[Zhu, Qi] Nanjing Univ Aeronaut & Astronaut, Dept Comp Sci & Engn, Nanjing 210016, Jiangsu, Peoples R China.
RP Sun, CL (reprint author), Nanchang Hang kong Univ, Sch Informat, Nanchang 330063, Peoples R China.
EM sun_chengli@163.com
FU Science and Technology on Avionics Integration Laboratory; China's
Aviation Science Fund [20115556007]; National Nature Science Committee
of China [61362031, 61203243, 61263040, 61263032]
FX We thank the anonymous reviews for their constructive comments and
suggestions. This article is partially supported by the Science and
Technology on Avionics Integration Laboratory and China's Aviation
Science Fund (No. 20115556007), and the funds of National Nature Science
Committee of China (Nos. 61362031, 61203243, 61263040, 61263032).
CR [Anonymous], 2001, PERC EV SPEECH QUAL
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cai JF, 2010, SIAM J OPTIMIZ, V20, P1956, DOI 10.1137/080738970
Candes EJ, 2011, J ACM, V58, DOI 10.1145/1970392.1970395
Candes EJ, 2010, IEEE T INFORM THEORY, V56, P2053, DOI 10.1109/TIT.2010.2044061
Chang SG, 2000, IEEE T IMAGE PROCESS, V9, P1532, DOI 10.1109/83.862633
Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278]
Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Hai-yan W., 2011, J CHIN U POSTS TELEC, V1, P13
Hermus K, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/45821
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
Huang P.-S., 2012, ICASSP
Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd
Larsen R. M., 1998, LANCZOS BIDIAGONALIZ
Lin Z, 2009, UILUENG092215 UIUC
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003
Manohar K, 2006, SPEECH COMMUN, V48, P96, DOI 10.1016/j.specom.2005.08.002
Mardani M, 2013, IEEE T INFORM THEORY, V59, P5186, DOI 10.1109/TIT.2013.2257913
Moor B. D., 1993, IEEE T SIGNAL PROCES, V41, P2826
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003
Peng Y., 2012, IEEE T PATTERN ANAL
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Scalart P., 1996, P 21 IEEE INT C AC S
Shannon B., 2006, P INT C SPOK LANG PR
Soon IY, 2000, IEE P-VIS IMAGE SIGN, V147, P247, DOI 10.1049/ip-vis:20000323
Stark A, 2011, SPEECH COMMUN, V53, P51, DOI 10.1016/j.specom.2010.08.001
Toh KC, 2010, PAC J OPTIM, V6, P615
Vaseghi S. V., 2006, ADV DIGITAL SIGNAL P
Wiener N., 1949, EXTRAPOLATION INTERP
Wright J., 2009, NIPS
Xu H, 2012, IEEE T INFORM THEORY, V58, P3047, DOI 10.1109/TIT.2011.2173156
Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005
Zhou X., 2013, IEEE T PATTERN ANAL, V35
NR 40
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 44
EP 55
DI 10.1016/j.specom.2014.03.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800004
ER
PT J
AU Larcher, A
Lee, KA
Ma, B
Li, HZ
AF Larcher, Anthony
Lee, Kong Aik
Ma, Bin
Li, Haizhou
TI Text-dependent speaker verification: Classifiers, databases and RSR2015
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; Text-dependent; Database
ID GAUSSIAN MIXTURE-MODELS; DATA FUSION; RECOGNITION; IDENTIFICATION;
SPEECH; CORPUS; HMM; NORMALIZATION; FEATURES
AB The RSR2015 database, designed to evaluate text-dependent speaker verification systems under different durations and lexical constraints has been collected and released by the Human Language Technology (HLT) department at Institute for Infocomm Research ((IR)-R-2) in Singapore. English speakers were recorded with a balanced diversity of accents commonly found in Singapore. More than 151 h of speech data were recorded using mobile devices. The pool of speakers consists of 300 participants (143 female and 157 male speakers) between 17 and 42 years old making the RSR2015 database one of the largest publicly available database targeted for text-dependent speaker verification. We provide evaluation protocol for each of the three parts of the database, together with the results of two speaker verification system: the HiLAM system, based on a three layer acoustic architecture, and an i-vector/PLDA system. We thus provide a reference evaluation scheme and a reference performance on RSR2015 database to the research community. The HiLAM outperforms the state-of-the-art i-vector system in most of the scenarios. (C) 2014 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/3.0/).
C1 [Larcher, Anthony; Lee, Kong Aik; Ma, Bin; Li, Haizhou] Human Language Technol Dept 1, Inst Infocomm Res I2R, Singapore 138632, Singapore.
RP Larcher, A (reprint author), Human Language Technol Dept 1, Inst Infocomm Res I2R, Fusionopolis Way 21-01,Connexis South Tower, Singapore 138632, Singapore.
EM alarcher@i2r.a-star.edu.sg; kalee@i2r.a-star.edu.sg;
mabin@i2r.a-star.edu.sg; hli@i2r.a-star.edu.sg
CR Amino K, 2009, FORENSIC SCI INT, V185, P21, DOI 10.1016/j.forsciint.2008.11.018
Aronowitz Hagai, 2012, OD SPEAK LANG REC WO
Avinash B, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1073
Bailly-Bailliere E, 2003, LECT NOTES COMPUT SC, V2688, P625
BenZeghiba MF, 2006, SPEECH COMMUN, V48, P1200, DOI 10.1016/j.specom.2005.08.008
Boakye K., 2004, OD SPEAK LANG REC WO, P1
Boies D., 2004, OD SPEAK LANG REC WO, P1
Bonastre J.F., 2003, EUR C SPEECH COMM TE, P2013
Bousquet P.M., 2011, ANN C INT SPEECH COM, P485
Bousquet P.M., 2012, OD SPEAK LANG REC WO, P1
Brummer N., 2010, OD SPEAK LANG REC WO, P1
Campbell J., 1994, YOHO SPEAKER VERIFIC
Campbell JP, 2009, IEEE SIGNAL PROC MAG, V26, P95, DOI 10.1109/MSP.2008.931100
CAMPBELL JP, 1995, INT CONF ACOUST SPEE, P341, DOI 10.1109/ICASSP.1995.479543
CAMPBELL JP, 1999, ACOUST SPEECH SIG PR, P829
Charlet D, 2000, SPEECH COMMUN, V31, P113, DOI 10.1016/S0167-6393(99)00072-2
Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0
Chatzis S, 2007, ICSPC: 2007 IEEE INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING AND COMMUNICATIONS, VOLS 1-3, PROCEEDINGS, P804
Che CW, 1996, INT CONF ACOUST SPEE, P673
Chen K, 1996, IEEE T NEURAL NETWOR, V7, P1309
Chen W., 2012, INT C AUD LANG IM PR, P432
Chollet G., 1996, TECHNICAL REPORT
Cole R., 1998, P INT C SPOK LANG PR, P3167
Cumani S, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4361
Cumani S, 2013, INT CONF ACOUST SPEE, P7644, DOI 10.1109/ICASSP.2013.6639150
Das A., 2010, IEEE INT C AC SPEECH, P4510
Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307
Dehak N., 2011, ANN C INT SPEECH COM, P857
Dehak N., 2010, OD SPEAK LANG REC WO, P1
Dessimoz D., 2008, FORENSIC SCI INT, V167, P154
Dialogues Spotlight Technology, 2000, TECHNICAL REPORT
Doddington G., 2012, OD SPEAK LANG REC WO, P1
Doddington G.R., 1998, WORKSH SPEAK REC ITS, P20
Dong C., 2008, OD SPEAK LANG REC WO, P1
Dumas B., 2005, BIOMETRICS INTERNET, V275, P59
Dutta T, 2008, CISP 2008: FIRST INTERNATIONAL CONGRESS ON IMAGE AND SIGNAL PROCESSING, VOL 2, PROCEEDINGS, P354, DOI 10.1109/CISP.2008.560
Dutta T., 2007, IMAGE VISION COMPUT, P238
ELDA - Evaluations and Language resources Distribution Agency, 2003, S0050 RUSTEN RUSS SW
FARRELL KR, 1995, INT CONF ACOUST SPEE, P349, DOI 10.1109/ICASSP.1995.479545
FARRELL KR, 1998, ACOUST SPEECH SIG PR, P1129
Faundez-Zanuy M, 2006, IEEE AERO EL SYS MAG, V21, P29, DOI 10.1109/MAES.2006.1703234
Fauve B., 2009, THESIS SWANSEA U
Fierrez J, 2010, PATTERN ANAL APPL, V13, P235, DOI 10.1007/s10044-009-0151-4
Fierrez J, 2007, PATTERN RECOGN, V40, P1389, DOI 10.1016/j.patcog.2006.10.014
Finan RA, 1996, IEEE IJCNN, P1992, DOI 10.1109/ICNN.1996.549207
FORSYTH M, 1995, SPEECH COMMUN, V17, P117, DOI 10.1016/0167-6393(95)00020-O
Fox N.A., 2005, INT C AUD VID BAS PE, P777
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P342, DOI 10.1109/TASSP.1981.1163605
Garcia-Romero D., 2011, ANN C INT SPEECH COM, P249
Garcia-Salicetti S., 2003, AUDIO VIDEO BASED BI
Garofolo J.S., 1993, TIMIT ACOUSTIC PHONE, P1
Gu Y., 1998, ANN C INT SPEECH COM, P125
Hasan T, 2013, INT CONF ACOUST SPEE, P7663, DOI 10.1109/ICASSP.2013.6639154
Hebert M., 2008, HDB SPEECH PROCESSIN, P743
Hebert M., 2003, EUR C SPEECH COMM TE, P1665
Hebert M, 2005, INT CONF ACOUST SPEE, P729
Heck Larry, 2001, OD SPEAK LANG REC WO, P249
Hennebert J, 2000, SPEECH COMMUN, V31, P265, DOI 10.1016/S0167-6393(99)00082-5
Jiang Y., 2012, ANN C INT SPEECH COM, P1680
Kahn J., 2011, INT C PHON SCI ICPHS, P1002
Kahn J., 2010, OD SPEAK LANG REC WO, P109
Kanagasundaram A., 2011, ANN C INT SPEECH COM, P2341
Karam ZN, 2011, 2011 IEEE STATISTICAL SIGNAL PROCESSING WORKSHOP (SSP), P525, DOI 10.1109/SSP.2011.5967749
Karlsson I, 2000, SPEECH COMMUN, V31, P121, DOI 10.1016/S0167-6393(99)00073-4
Karlsson I., 1999, GOTHENBURG PAPERS TH, P93
KATO T, 2003, ACOUST SPEECH SIG PR, P57
Kekre H., 2010, INT J BIOMETRICS BIO, V4, P100
Kelly F, 2011, LECT NOTES COMPUT SC, V6583, P113, DOI 10.1007/978-3-642-19530-3_11
Kelly F., 2012, INT C BIOM ICB, P478
Kenny P, 2013, INT CONF ACOUST SPEE, P7649, DOI 10.1109/ICASSP.2013.6639151
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693
Kenny P., 2004, P IEEE INT C AC SPEE, V1, P37
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kong Aik Lee, 2011, ANN C INT SPEECH COM, P3317
Larcher A, 2013, DIGIT SIGNAL PROCESS, V23, P1910, DOI 10.1016/j.dsp.2013.07.007
Larcher A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4773
Larcher A., 2013, ANN C INT SPEECH COM, P2768
Larcher A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P371
Larcher A, 2013, INT CONF ACOUST SPEE, P7673, DOI 10.1109/ICASSP.2013.6639156
Larcher Anthony, 2012, ANN C INT SPEECH COM, P1580
Lawson A.D., 2009, ANN C INT SPEECH COM, P2899
Lee K.A., 2013, ANN C INT SPEECH COM, P3651
Lee Kong Aik, 2013, SLTC NEWSLETTER
Lei Y., 2009, ANN C INT SPEECH COM, P2371
Li HZ, 2013, P IEEE, V101, P1136, DOI 10.1109/JPROC.2012.2237151
Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146
Luan J., 2006, OD SPEAK LANG REC WO, P1
Mandasari M.I., 2011, ANN C INT SPEECH COM, P21
Marcel S, 2010, LECT NOTES COMPUT SC, V6388, P210, DOI 10.1007/978-3-642-17711-8_22
Martin A.F., 2010, ANN C INT SPEECH COM, P2726
Martin A.F., 2009, ANN C INT SPEECH COM, P2579
Martinez D., 2011, ANN C INT SPEECH COM, P861
Mason J.S., 1996, TECHNICAL REPORT
Matsui T., 1993, IEEE INT C AC SPEECH, V2, P391, DOI 10.1109/ICASSP.1993.319321
Meng H., 2006, INT WORKSH MULT US A, P1
Messer K., 1999, 2 INT C AUD VID BAS, V964, P965
MISTRETTA W, 1998, ACOUST SPEECH SIG PR, P113
Nakano S., 2004, International Astronomical Union Circular
Nosratighods M, 2010, SPEECH COMMUN, V52, P753, DOI 10.1016/j.specom.2010.04.007
Ortega-Garcia J, 2010, IEEE T PATTERN ANAL, V32, P1097, DOI 10.1109/TPAMI.2009.76
Ortega-Garcia J, 2000, SPEECH COMMUN, V31, P255, DOI 10.1016/S0167-6393(99)00081-3
Pigeon Stephane, 1997, AUDIO VIDEO BASED BI
Prazak J., 2011, INT C INT DAT ACQ AD, P347
Prince S., 2007, IEEE 11 INT C COMP V, P1
Przybocki M.A., 2006, OD SPEAK LANG REC 2, P1
Ramasubramanian V., 2006, IEEE VLSI DESIGN, P1
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
ROSENBERG AE, 1991, INT CONF ACOUST SPEE, P381, DOI 10.1109/ICASSP.1991.150356
Rosenberg AE, 2000, SPEECH COMMUN, V31, P131, DOI 10.1016/S0167-6393(99)00074-6
Schmidt M, 1996, INT CONF ACOUST SPEE, P105, DOI 10.1109/ICASSP.1996.540301
Senoussaoui M., 2011, ANN C INT SPEECH COM, P25
Silovsky J, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P2920
Stafylakis T., 2013, ANN C INT SPEECH COM, P3684
Steininger Silke, 2002, LREC WORKSH MULT RES
Stolcke A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4397
Sturim DE, 2002, INT CONF ACOUST SPEE, P677
Subramanya A., 2007, IEEE INT C AC SPEECH, P4
Toledano D.T., 2008, LREC
Toledo-Ronen O., 2011, ANN C INT SPEECH COM, P9
van Leeuwen D.A., 2013, ANN C INT SPEECH COM, P1619
Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003
Vogt R.J., 2009, ANN C INT SPEECH COM, P1563
Vogt R.J., 2008, OD SPEAK LANG REC WO, P1
Wagner M., 2006, OD SPEAK LANG REC WO, P1
Wong YW, 2011, PATTERN RECOGN LETT, V32, P1503, DOI 10.1016/j.patrec.2011.06.011
Woo R.H., 2006, OD SPEAK LANG REC WO
Woo S.C., 2000, P IEEE REG 10 INT C
Wu D., 2008, SPEECH RECOGNITION T
Xu J., 2011, ANN C INT SPEECH COM
Yegnanarayana B, 2005, IEEE T SPEECH AUDI P, V13, P575, DOI 10.1109/TSA.2005.848892
Yoma NB, 2002, SPEECH COMMUN, V38, P77, DOI 10.1016/S0167-6393(01)00044-9
You CH, 2010, IEEE T AUDIO SPEECH, V18, P1300, DOI 10.1109/TASL.2009.2032950
Young S.J., 2008, SPRINGER HDB SPEECH
Young S. J., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225844
Zheng T.F., 2005, ORIENTAL COCOSDA
NR 136
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 56
EP 77
DI 10.1016/j.specom.2014.03.001
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800005
ER
PT J
AU Ablimit, M
Kawahara, T
Hamdulla, A
AF Ablimit, Mijit
Kawahara, Tatsuya
Hamdulla, Askar
TI Lexicon optimization based on discriminative learning for automatic
speech recognition of agglutinative language
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Language model; Lexicon; Morpheme; Discriminative
learning; Uyghur
ID UNITS
AB For automatic speech recognition (ASR) of agglutinative languages, selection of a lexical unit is not obvious. The morpheme unit is usually adopted to ensure sufficient coverage, but many morphemes are short, resulting in weak constraints and possible confusion. We propose a discriminative approach for lexicon optimization that directly contributes to ASR error reduction by taking into account not only linguistic constraints but also acoustic phonetic confusability. It is based on an evaluation function for each word defined by a set of features and their weights, which are optimized by the difference in word error rates (WERs) between ASR hypotheses obtained by the morpheme-based model and those by the word-based model. Then, word or sub-word entries with higher evaluation scores are selected to be added to the lexicon. We investigate several discriminative models to realize this approach. Specifically, we implement it with support vector machines (SVM), logistic regression (LR) model as well as the simple perceptron algorithm. This approach was successfully applied to an Uyghur large-vocabulary continuous speech recognition system, resulting in a significant reduction of WER with a modest lexicon size and a small out-of-vocabulary rate. The use of SVM for a sub-word lexicon results in the best performance, outperforming the word-based model as well as conventional statistical concatenation approaches. The proposed learning approach is realized in an unsupervised manner because it does not require correct transcription for training data. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Ablimit, Mijit; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Kyoto, Japan.
[Hamdulla, Askar] Xinjiang Univ, Inst Informat Engn, Urumqi, Peoples R China.
RP Kawahara, T (reprint author), Sakyo Ku, Kyoto 6068501, Japan.
EM kawahara@i.kyoto-u.ac.jp
FU JSPS; National Natural Science Foundation of China (NSFC) [61163032]
FX This work is supported by JSPS Grant-in-Aid-for Scientific Research
(KAKENHI) and National Natural Science Foundation of China (NSFC; grant
61163032).
CR Ablimit M., 2010, P ICSP BEIJ
Ablimit M., 2012, P IEEE ICASSP
Afify M., 2006, P INTERSPEECH
Arisoy E, 2009, IEEE T AUDIO SPEECH, V17, P874, DOI 10.1109/TASL.2008.2012313
Arisoy E, 2012, IEEE T AUDIO SPEECH, V20, P540, DOI 10.1109/TASL.2011.2162323
Berton A., 1996, P ICSLP
Carki K., 2000, P IEEE ICASSP
Collins M., 2002, P EMNLP
Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903
Creutz M., 2006, THESIS HELSINKI U TE
Creutz M., 2007, ACM T SPEECH LANG PR, V5, P1, DOI 10.1145/1322391.1322394
DELIGNE S, 1995, INT CONF ACOUST SPEE, P169, DOI 10.1109/ICASSP.1995.479391
El-Desoky A., 2009, P INTERSPEECH
Fan RE, 2008, J MACH LEARN RES, V9, P1871
Geutner P., 1995, P IEEE ICASSP
Goldsmith J., 2001, COMPUT LINGUIST, V2, P78
Hacioglu K., 2003, P EUR
Ircing P., 2001, P EUR
Jeff Kuo H.-K., 1999, P EUR
Jongtaveesataporn M., 2009, SPEECH COMMUN, V2009, P379
Kawahara T., 2000, P INT C SPOK LANG PR, V4, P476
Kiecza D., 1999, P ICSP SEOUL
KWON OW, 2000, ACOUST SPEECH SIG PR, P1567
Kwon OW, 2003, SPEECH COMMUN, V39, P287, DOI 10.1016/S0167-6393(02)00031-6
Larson M., 2000, P INTERSPEECH
Lee A., 2001, EUROSPEECH, P1691
Masataki H, 1996, INT CONF ACOUST SPEE, P188, DOI 10.1109/ICASSP.1996.540322
Mihajlik P., 2007, P INT 2007, P1497
Nussbaum-Thom M., 2011, P INT
Pellegrini T., 2007, P INTERSPEECH
Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295
Puurula A., 2007, P ACL
Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006
Sak H, 2012, IEEE T AUDIO SPEECH, V20, P2341, DOI 10.1109/TASL.2012.2201477
Saon G, 2001, IEEE T SPEECH AUDI P, V9, P327, DOI 10.1109/89.917678
Sarikaya R, 2008, IEEE T AUDIO SPEECH, V16, P1330, DOI 10.1109/TASL.2008.924591
Shinozaki T., 2002, P ICSLP, P717
Whittaker E., 2003, P ISCSLP BEIJ
Xiang B., 2006, IEEE ICASSP
NR 39
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2014
VL 60
BP 78
EP 87
DI 10.1016/j.specom.2013.09.011
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AH0JI
UT WOS:000335804800006
ER
PT J
AU Amano-Kusumoto, A
Hosom, JP
Kain, A
Aronoff, JM
AF Amano-Kusumoto, Akiko
Hosom, John-Paul
Kain, Alexander
Aronoff, Justin M.
TI Determining the relevance of different aspects of formant contours to
intelligibility
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Vowel perception; Speech synthesis
ID CONVERSATIONAL SPEECH; CLEAR SPEECH; VOWEL INTELLIGIBILITY;
NORMAL-HEARING; PERCEPTION; TRANSITION; LISTENERS; CHILDREN; HARD
AB Previous studies have shown that "clear" speech, where the speaker intentionally tries to enunciate, has better intelligibility than "conversational" speech, which is produced in regular conversation. However, conversational and clear speech vary along a number of acoustic dimensions and it is unclear what aspects of clear speech lead to better intelligibility. Previously, Kain et al. (2008) showed that a combination of short-term spectra and duration was responsible for the improved intelligibility of one speaker. This study investigates subsets of specific features of short-term spectra including temporal aspects. Similar to Kain's study, hybrid stimuli were synthesized with a combination of features from clear speech and complementary features from conversational speech to determine which acoustic features cause the improved intelligibility of clear speech. Our results indicate that, although steady-state formant values of tense vowels contributed to the intelligibility of clear speech, neither the steady-state portion nor the formant transition was sufficient to yield comparable intelligibility to that of clear speech. In contrast, when the entire formant contour of conversational speech including the phoneme duration was replaced by that of clear speech, intelligibility was comparable to that of clear speech. It indicated that the combination of formant contour and duration information was relevant to the improved intelligibility of clear speech. The study provides a better understanding of the relevance of different aspects of formant contours to the improved intelligibility of clear speech. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Amano-Kusumoto, Akiko; Aronoff, Justin M.] House Res Inst, Dept Human Commun Sci Devices, Los Angeles, CA 90057 USA.
[Amano-Kusumoto, Akiko; Hosom, John-Paul; Kain, Alexander] Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding CSLU, Beaverton, OR 97006 USA.
RP Amano-Kusumoto, A (reprint author), House Res Inst, Dept Human Commun Sci Devices, 2100 West Third St, Los Angeles, CA 90057 USA.
EM akiko.amano@gmail.com
FU NSF [0826654]; NIH [T32DC009975]
FX This work was supported in part by NSF grant 0826654 and NIH grant
T32DC009975.
CR Amano-Kusumoto A., 2011, 11002 CSLU, P1
Amano-Kusumoto A, 2009, INT CONF ACOUST SPEE, P4677, DOI 10.1109/ICASSP.2009.4960674
[Anonymous], 1996, EL SOUND LEV MET
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Bradlow AR, 2003, J SPEECH LANG HEAR R, V46, P80, DOI 10.1044/1092-4388(2003/007)
Ferguson SH, 2004, J ACOUST SOC AM, V116, P2365, DOI 10.1121/1.1788730
Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078
FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842
Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826
Helfer K S, 1998, J Am Acad Audiol, V9, P234
Hillenbrand JM, 1999, J ACOUST SOC AM, V105, P3509, DOI 10.1121/1.424676
Hosom JP, 2009, SPEECH COMMUN, V51, P352, DOI 10.1016/j.specom.2008.11.003
Kain A, 2008, J ACOUST SOC AM, V124, P2308, DOI 10.1121/1.2967844
Kain AB, 2007, SPEECH COMMUN, V49, P743, DOI 10.1016/j.specom.2007.05.001
Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842
Krause JC, 2002, J ACOUST SOC AM, V112, P2165, DOI 10.1121/1.1509432
Kusumoto A., 2007, P INTERSPEECH, P370
Liu S, 2004, J ACOUST SOC AM, V116, P2374, DOI 10.1121/1.1787528
MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492
Perkell JS, 2002, J ACOUST SOC AM, V112, P1627, DOI 10.1121/1.1506369
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434
WINGFIELD A, 1985, J GERONTOL, V40, P579
NR 23
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 1
EP 9
DI 10.1016/j.specom.2013.12.001
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300001
ER
PT J
AU Kane, J
Aylett, M
Yanushevskaya, I
Gobl, C
AF Kane, John
Aylett, Matthew
Yanushevskaya, Irena
Gobl, Christer
TI Phonetic feature extraction for context-sensitive glottal source
processing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice quality; Phonation type; Glottal source; Expressive speech; Speech
synthesis
ID SPEAKER RECOGNITION; SPEECH RECOGNITION; NEURAL-NETWORKS; FLOW
AB The effectiveness of glottal source analysis is known to be dependent on the phonetic properties of its concomitant supraglottal features. Phonetic classes like nasals and fricatives are particularly problematic. Their acoustic characteristics, including zeros in the vocal tract spectrum and aperiodic noise, can have a negative effect on glottal inverse filtering, a necessary pre-requisite to glottal source analysis. In this paper, we first describe and evaluate a set of binary feature extractors, for phonetic classes with relevance for glottal source analysis. As voice quality classification is typically achieved using feature data derived by glottal source analysis, we then investigate the effect of removing data from certain detected phonetic regions on the classification accuracy. For the phonetic feature extraction, classification algorithms based on Artificial Neural Networks (ANNs), Gaussian Mixture Models (GMMs) and Support Vector Machines (SVMs) are compared. Experiments demonstrate that the discriminative classifiers (i.e. ANNs and SVMs) in general give better results compared with the generative learning algorithm (i.e. GMMs). This accuracy generally decreases according to the sparseness of the feature (e.g., accuracy is lower for nasals compared to syllabic regions). We find best classification of voice quality when just using glottal source parameter data derived within detected syllabic regions. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Kane, John; Yanushevskaya, Irena; Gobl, Christer] Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland.
[Aylett, Matthew] Univ Edinburgh, Sch Informat, Edinburgh EH8 9YL, Midlothian, Scotland.
[Aylett, Matthew] CereProc Ltd, Edinburgh, Midlothian, Scotland.
RP Kane, J (reprint author), Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland.
EM kanejo@tcd.ie; matthewa@cereproc.com; yanushei@tcd.ie; cegobl@tcd.ie
FU Science Foundation Ireland [09/IN.1/I2631]
FX The first, third and fourth authors are supported by the Science
Foundation Ireland Grant 09/IN.1/I2631 (FASTNET).
CR Airas M., 2007, P INT 2007, P1410
Ali AMA, 1999, ISCAS '99: PROCEEDINGS OF THE 1999 IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOL 3, P118
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365
Alku P, 1997, SPEECH COMMUN, V22, P67, DOI 10.1016/S0167-6393(97)00020-4
Alku P, 2011, SADHANA-ACAD P ENG S, V36, P623, DOI 10.1007/s12046-011-0041-5
Alku P, 2013, J ACOUST SOC AM, V134, P1295, DOI 10.1121/1.4812756
Alku P., 1994, P INT C SPOK LANG PR, P1619
Aylett M.P., 2007, ARTIFICIAL INTELLIGE
Bishop C. M., 2006, PATTERN RECOGNITION
Campbell N., 2003, P 15 INT C PHON SCI, P2417
Chan WN, 2007, IEEE T AUDIO SPEECH, V15, P1884, DOI 10.1109/TASL.2007.900103
Chomsky N., 1968, SOUND PATTERN ENGLIS
Cullen A., 2013, P WASSS GREN FRANC
Drugman T., 2011, COMPUTER SPEECH LANG, V26, P20
Fant G, 1987, SPEECH TRANSMISSION, V1, P13
Fant G., 1970, ACOUSTIC THEORY SPEE
Fant G., 1985, KTH SPEECH TRANSMISS, P21
Fant Gunnar, 1985, STL QPSR, V4, P1
Gobl C, 2013, J VOICE, V27, P155, DOI 10.1016/j.jvoice.2012.09.004
Hacki T., 1989, FOLIA PHONIATR, P43
Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991
HORNIK K, 1991, NEURAL NETWORKS, V4, P251, DOI 10.1016/0893-6080(91)90009-T
Iliev I., 2010, COMPUT SPEECH LANG, V24, P445
Kane J, 2013, SPEECH COMMUN, V55, P295, DOI [10.1016/j.specom.2012.08.011, 10.1016/j.specom.2012.08.01]
Kane J., 2013, P NOLISP MONS BELG, P1
Kane J, 2013, IEEE T AUDIO SPEECH, V21, P1170, DOI 10.1109/TASL.2013.2245653
Kane J, 2013, SPEECH COMMUN, V55, P397, DOI 10.1016/j.specom.2012.12.004
Kane J., 2013, P INT LYON FRANC
Kane J., 2013, P ICASSP VANC CAN
Kanokphara S., 2006, P 19 INT C IND ENG O
King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148
Kominek J., 2004, ISCA SPEECH SYNTH WO, P223
Launay B, 2002, INT CONF ACOUST SPEE, P817
Lin Q., 1987, KTH SPEECH TRANSMISS, V28, P1
Lugger M, 2008, INT CONF ACOUST SPEE, P4945, DOI 10.1109/ICASSP.2008.4518767
Mokhtari P, 2003, IEICE T INF SYST, VE86D, P574
Mokhtari P., 2002, P LANG RES EV LREC
Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538
Raitio T, 2014, COMPUT SPEECH LANG, V28, P648, DOI 10.1016/j.csl.2013.03.003
Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239
Richmond K., 2007, P BLIZZ CHALL WORKSH
Siniscalchi SM, 2009, SPEECH COMMUN, V51, P1139, DOI 10.1016/j.specom.2009.05.004
Siniscalchi SM, 2013, NEUROCOMPUTING, V106, P148, DOI 10.1016/j.neucom.2012.11.008
Sturmel N., 2006, 7 C ADV QUANT LAR GR
Szekely E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4593
Tarek A., 2003, P NONL SPEECH PROC W
Teager H. M, 1990, INT C AC SPEECH SIGN, V4, P241
Walker J, 2007, LECT NOTES COMPUT SC, V4391, P1
Young Steve J., 2007, HTK BOOK VERSION 3 4
Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4169
Zheng NH, 2007, IEEE SIGNAL PROC LET, V14, P181, DOI 10.1109/LSP.2006.884031
NR 52
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 10
EP 21
DI 10.1016/j.specom.2013.12.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300002
ER
PT J
AU Liang, S
Liu, WJ
Jiang, W
Xue, W
AF Liang, Shan
Liu, WenJu
Jiang, Wei
Xue, Wei
TI The analysis of the simplification from the ideal ratio to binary mask
in signal-to-noise ratio sense
SO SPEECH COMMUNICATION
LA English
DT Article
DE Ideal binary mask; Ideal ratio mask; W-Disjoint Orthogonality
ID AUTOMATIC SPEECH RECOGNITION; SEGREGATION; SEPARATION; INTELLIGIBILITY;
ALGORITHM
AB For speech separation systems, the ideal binary mask (IBM) can be viewed as a simplified goal of the ideal ratio mask (IRM) which is derived from Wiener filter. The available research usually verify the rationality of this simplification from the aspect of speech intelligibility. However, the difference between the two masks has not been addressed rigorously in the signal-to-noise ratio (SNR) sense. In this paper, we analytically investigate the difference between the two ideal masks under the assumption of the approximate W-Disjoint Orthogonality (AWDO) which almost holds under many kinds of interference due to the sparse nature of speech. From the analysis, one theoretical upper bound of the difference is obtained under the AWDO assumption. Some other interesting discoveries include a new ratio mask which achieves higher SNR gains than the IRM and the essential relation between the AWDO degree and the SNR gain of the IRM. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Liang, Shan; Liu, WenJu; Jiang, Wei; Xue, Wei] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing, Peoples R China.
RP Liu, WJ (reprint author), Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit NLPR, Beijing, Peoples R China.
EM sliang@nlpr.ia.ac.cn; lwj@nlpr.ia.ac.cn; wjiang@nlpr.ia.ac.cn;
wxue@nlpr.ia.ac.cn
FU China National Nature Science Foundation [91120303, 61273267, 90820011]
FX This research was supported in part by the China National Nature Science
Foundation (No. 91120303, No. 61273267 and No. 90820011).
CR Barker J., 2000, P ICSLP BEIJ CHIN, P373
Bregman S., 1990, AUDITORY SCENE ANAL
BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016
Brown G.J., 1993, J ACOUST SOC AM, V94, P2454, DOI 10.1121/1.407441
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Cooke M.P., 1993, MODELING AUDITORY PR
Ellis D.P.W., 1996, THESIS
Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P15
Ellis D.P.W., 1995, WORKSH COMP AUD SCEN, P111
Han K, 2012, J ACOUST SOC AM, V132, P3475, DOI 10.1121/1.4754541
Hu GN, 2010, IEEE T AUDIO SPEECH, V18, P2067, DOI 10.1109/TASL.2010.2041110
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Hu G.N., 2001, IEEE WORKSH APPL SIG, P79
Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949
Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094
Lehmann EA, 2008, J ACOUST SOC AM, V124, P269, DOI 10.1121/1.2936367
Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617
Li Y.P., 2009, SPEECH COMMUN, V51, P1486
Liang S, 2012, IEEE SIGNAL PROC LET, V19, P627, DOI 10.1109/LSP.2012.2209643
Liang S, 2013, IEEE T AUDIO SPEECH, V21, P476, DOI 10.1109/TASL.2012.2226156
Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180
Ma WY, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P402
Mallat S., 1998, WAVELET TOUR SIGNAL
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Melia T., 2007, THESIS U COLL DUBLIN
Patterson R. D., 1988, 2341 MRC APPL PSYCH
Peharz R, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P249
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
Wang D. L., 2006, COMPUTATIONAL AUDITO, P1
Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727
Weintraub M., 1985, THESIS STANFORD U
Wiener N., 1949, EXTRAPOLATION INTERP
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
NR 41
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 22
EP 30
DI 10.1016/j.specom.2013.12.002
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300003
ER
PT J
AU van der Zande, P
Jesse, A
Cutler, A
AF van der Zande, Patrick
Jesse, Alexandra
Cutler, Anne
TI Hearing words helps seeing words: A cross-modal word repetition effect
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech perception; Audiovisual speech; Word repetition priming;
Cross-modal priming
ID SPEECH-PERCEPTION; SPOKEN WORDS; RECOGNITION MEMORY; CONSONANT
RECOGNITION; TALKER VARIABILITY; VISUAL PROSODY; VOICE; INTELLIGIBILITY;
IDENTIFICATION; REPRESENTATION
AB Watching a speaker say words benefits subsequent auditory recognition of the same words. In this study, we tested whether hearing words also facilitates subsequent phonological processing from visual speech, and if so, whether speaker repetition influences the magnitude of this word repetition priming. We used long-term cross-modal repetition priming as a means to investigate the underlying lexical representations involved in listening to and seeing speech. In Experiment 1, listeners identified auditory-only words during exposure and visual-only words at test. Words at test were repeated or new and produced by the exposure speaker or a novel speaker. Results showed a significant effect of cross-modal word repetition priming but this was unaffected by speaker changes. Experiment 2 added an explicit recognition task at test. Listeners' lipreading performance was again improved by prior exposure to auditory words. Explicit recognition memory was poor, and neither word repetition nor speaker repetition improved it. This suggests that cross-modal repetition priming is neither mediated by explicit memory nor improved by speaker information. Our results suggest that phonological representations in the lexicon are shared across auditory and visual processing, and that speaker information is not transferred across modalities at the lexical level. (C) 2014 Elsevier B.V. All rights reserved.
C1 [van der Zande, Patrick; Cutler, Anne] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands.
[Jesse, Alexandra] Univ Massachusetts, Dept Psychol, Amherst, MA 01003 USA.
[Cutler, Anne] Univ Western Sydney, MARCS Inst, Penrith, NSW 2751, Australia.
RP van der Zande, P (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands.
EM P.Zande@gmail.com; AJesse@psych.umass.edu; A.Cutler@uws.edu.au
RI Cutler, Anne/C-9467-2012
CR Arnold P, 2001, BRIT J PSYCHOL, V92, P339, DOI 10.1348/000712601162220
Baayen R. H., 1993, CELEX LEXICAL DATABA
Bates D., 2007, LME4 LINEAR MIXED EF
BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4
Bradlow AR, 1999, PERCEPT PSYCHOPHYS, V61, P206, DOI 10.3758/BF03206883
Buchwald AB, 2009, LANG COGNITIVE PROC, V24, P580, DOI 10.1080/01690960802536357
CRAIK FIM, 1974, Q J EXP PSYCHOL, V26, P274, DOI 10.1080/14640747408400413
CREELMAN CD, 1957, J ACOUST SOC AM, V29, P655, DOI 10.1121/1.1909003
Cvejic E, 2012, COGNITION, V122, P442, DOI 10.1016/j.cognition.2011.11.013
Dodd B., 1989, VISIBLE LANG, V22, P58
Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009
Ellis A., 1982, CURR PSYCHOL RES REV, V2, P123
Ferguson SH, 2004, J ACOUST SOC AM, V116, P2365, DOI 10.1121/1.1788730
Foulkes P, 2006, J PHONETICS, V34, P409, DOI 10.1016/j.wocn.2005.08.002
Fowler CA, 2003, J MEM LANG, V49, P396, DOI 10.1016/S0749-596X(03)00072-X
Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135
Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166
Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788
Irwin JR, 2006, PERCEPT PSYCHOPHYS, V68, P582, DOI 10.3758/BF03208760
JACKSON A, 1984, MEM COGNITION, V12, P568, DOI 10.3758/BF03213345
Jesse A, 2014, Q J EXP PSYCHOL, V67, P793, DOI 10.1080/17470218.2013.834371
Jesse A, 2011, PSYCHON B REV, V18, P943, DOI 10.3758/s13423-011-0129-2
Jesse A, 2010, ATTEN PERCEPT PSYCHO, V72, P209, DOI 10.3758/APP.72.1.209
Kamachi M, 2003, CURR BIOL, V13, P1709, DOI 10.1016/j.cub.2003.09.005
Kim J, 2004, COGNITION, V93, pB39, DOI 10.1016/j.cognition.2003.11.003
Krahmer E, 2004, HUM COM INT, V7, P191
KRICOS PB, 1982, VOLTA REV, V84, P219
Lachs L, 2004, J EXP PSYCHOL HUMAN, V30, P378, DOI 10.1037/0096-1523.30.2.378
LADEFOGED P, 1980, LANGUAGE, V56, P485, DOI 10.2307/414446
Laver J, 1979, SOCIAL MARKERS SPEEC, P1
LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6
Luce PA, 1998, MEM COGNITION, V26, P708, DOI 10.3758/BF03211391
MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
McLennan CT, 2003, J EXP PSYCHOL LEARN, V29, P539, DOI 10.1037/0278-7393.29.4.539
McQueen JM, 2006, COGNITIVE SCI, V30, P1113, DOI 10.1207/s15516709cog0000_79
MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688
Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x
Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001
NYGAARD LC, 1994, PSYCHOL SCI, V5, P42, DOI 10.1111/j.1467-9280.1994.tb00612.x
Nygaard LC, 1998, PERCEPT PSYCHOPHYS, V60, P355, DOI 10.3758/BF03206860
PALMERI TJ, 1993, J EXP PSYCHOL LEARN, V19, P309, DOI 10.1037/0278-7393.19.2.309
R Development Core Team, 2007, R LANG ENV STAT COMP
Reisberg D., 1987, HEARING EYE PSYCHOL, P97
Rosenblum LD, 2007, PSYCHOL SCI, V18, P392, DOI 10.1111/j.1467-9280.2007.01911.x
Rosenblum LD, 2008, CURR DIR PSYCHOL SCI, V17, P405, DOI 10.1111/j.1467-8721.2008.00615.x
SCHACTER DL, 1992, J EXP PSYCHOL LEARN, V18, P915, DOI 10.1037/0278-7393.18.5.915
Sheffert SM, 1998, MEM COGNITION, V26, P591, DOI 10.3758/BF03201165
SHEFFERT SM, 1995, J MEM LANG, V34, P665, DOI 10.1006/jmla.1995.1030
Strelnikov K, 2009, NEUROPSYCHOLOGIA, V47, P972, DOI 10.1016/j.neuropsychologia.2008.10.017
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004
van der Zande P, 2013, J ACOUST SOC AM, V134, P562, DOI 10.1121/1.4807814
VANSON N, 1994, J ACOUST SOC AM, V96, P1341, DOI 10.1121/1.411324
WALDEN BE, 1974, J SPEECH HEAR RES, V17, P270
Yakel DA, 2000, PERCEPT PSYCHOPHYS, V62, P1405, DOI 10.3758/BF03212142
Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X
NR 58
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 31
EP 43
DI 10.1016/j.specom.2014.01.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300004
ER
PT J
AU Clapham, R
Middag, C
Hilgers, F
Martens, JP
van den Brekel, M
van Son, R
AF Clapham, Renee
Middag, Catherine
Hilgers, Frans
Martens, Jean-Pierre
van den Brekel, Michiel
van Son, Rob
TI Developing automatic articulation, phonation and accent assessment
techniques for speakers treated for advanced head and neck cancer
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic evaluation; Head and neck cancer; Perceptual evaluation;
Phonemic features; Phonological features; AMPEX
ID OROPHARYNGEAL CANCER; SUBSTITUTION VOICES; SPEECH; INTELLIGIBILITY;
QUALITY; DISORDERS; FEATURES; OUTCOMES
AB Purpose: To develop automatic assessment models for assessing the articulation, phonation and accent of speakers with head and neck cancer (Experiment 1) and to investigate whether the models can track changes over time (Experiment 2).
Method: Several speech analysis methods for extracting a compact acoustic feature set that characterizes a speaker's speech are investigated. The effectiveness of a feature set for assessing a variable is assessed by feeding it to a linear regression model and by measuring the mean difference between the outputs of that model for a set of recordings and the corresponding perceptual scores for the assessed variable (Experiment 1). The models are trained and tested on recordings of 55 speakers treated non-surgically for advanced oral cavity, pharynx and larynx cancer. The perceptual scores are average unsealed ratings of a group of 13 raters. The ability of the models to track changes in perceptual scores over time is also investigated (Experiment 2).
Results: Experiment 1 has demonstrated that combinations of feature sets generally result in better models, that the best articulation model outperforms the average human rater's performance and that the best accent and phonation models are deemed competitive. Scatter plots of computed and observed scores show, however, that especially low perceptual scores are difficult to assess automatically. Experiment 2 has shown that the articulation and phonation models show only variable success in tracking trends over time and for only one of the time pairs are they deemed compete with the average human rater (Experiment 2). Nevertheless, there is a significant level of agreement between computed and observed trends when considering only a coarse classification of the trend into three classes: clearly positive, clearly negative and minor differences.
Conclusions: A baseline tool to support the multi-dimensional evaluation of speakers treated non-surgically for advanced head and neck cancer now exists. More work is required to further improve the models, particularly with respect to their ability to assess low-quality speech. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Clapham, Renee; Hilgers, Frans; van den Brekel, Michiel; van Son, Rob] Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 VT Amsterdam, Netherlands.
[Clapham, Renee; Hilgers, Frans; van den Brekel, Michiel; van Son, Rob] Netherlands Canc Inst, NL-1066 CX Amsterdam, Netherlands.
[Middag, Catherine; Martens, Jean-Pierre] Univ Ghent, Multimedia Lab ELIS, B-9000 Ghent, Belgium.
RP Clapham, R (reprint author), Univ Amsterdam, Amsterdam Ctr Language & Commun, Spuistra 210, NL-1012 VT Amsterdam, Netherlands.
EM r.p.clapham@uva.nl; Catherine.Middag@UGent.be; f.hilgers@nki.nl;
martens@elis.ugent.be; M.W.M.vandenBrekel@uva.nl; r.v.son@nki.nl
FU Atos Medical (Horby, Sweden); Verwelius Foundation (Naarden, the
Netherlands)
FX Part of this research was funded by unrestricted research grants from
Atos Medical (Horby, Sweden) and the Verwelius Foundation (Naarden, the
Netherlands). All speech recordings were collected by Lisette van der
Molen, SLP PhD, and Irene Jacobi, PhD. The authors wish to thank Maya
van Rossum, SLP PhD for her input on collecting the perceptual scores.
Prof. Louis Pols is greatly acknowledged for his critical and
constructive review of the manuscript.
CR Breiman L., 1996, MACH LEARN, V24, P23
C Middag, 2011, P INT C SPOK LANG PR, P3005
Clapham RP, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3350
De Bruijn M. J., 2011, SPEECH COMMUN, V54, P632
De Bodt MS, 2002, J COMMUN DISORD, V35, P283, DOI 10.1016/S0021-9924(02)00065-5
de Bruijn M, 2011, LOGOP PHONIATR VOCO, V36, P168, DOI 10.3109/14015439.2011.606227
de Bruijn MJ, 2009, FOLIA PHONIATR LOGO, V61, P180, DOI 10.1159/000219953
Haderlein T, 2007, EUR ARCH OTO-RHINO-L, V264, P1315, DOI 10.1007/s00405-007-0363-4
Jacobi I, 2010, EUR ARCH OTO-RHINO-L, V267, P1495, DOI 10.1007/s00405-010-1316-x
Jacobi I, 2013, ANN OTO RHINOL LARYN, V122, P754
Jacobi Irene, 2009, THESIS U AMSTERDAM
Kissine M., 2003, LINGUISTICS NETHERLA, P93
Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004
Manfredi C, 2011, LOGOP PHONIATR VOCO, V36, P78, DOI 10.3109/14015439.2011.578077
Maryn Y, 2010, J COMMUN DISORD, V43, P161, DOI 10.1016/j.jcomdis.2009.12.004
Middag C, 2014, COMPUT SPEECH LANG, V28, P467, DOI 10.1016/j.csl.2012.10.007
Middag C., 2009, EURASIP J ADV SIGNAL
Middag C., 2010, P INT 2010
Moerman M, 2004, EUR ARCH OTO-RHINO-L, V261, P541, DOI 10.1007/s00405-003-0681-0
Moerman MBJ, 2006, EUR ARCH OTO-RHINO-L, V263, P183, DOI 10.1007/s00405-005-0960-z
Newman L., 2001, HEAD NECK-J SCI SPEC, V24, P68
Rietveld A. C. M., 1997, ALGEMENE FONETIEK
Schuurman I., 2003, P 4 INT WORKSH LANG, P340
Shrivastav R, 2005, J SPEECH LANG HEAR R, V48, P323, DOI 10.1044/1092-4388(2005/022)
Stouten F, 2006, INT CONF ACOUST SPEE, P329
van der Molen L, 2012, J VOICE, V26, pe25, DOI DOI 10.1016/J.JV0ICE.2011.08.016
VANIMMERSEEL LM, 1992, J ACOUST SOC AM, V91, P3511, DOI 10.1121/1.402840
Van Nuffelen G, 2009, INT J LANG COMM DIS, V44, P716, DOI 10.1080/13682820802342062
Verdonck-de Leeuw I.M., 2007, HEAD NECK CANC TREAT, P27
NR 29
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 44
EP 54
DI 10.1016/j.specom.2014.01.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300005
ER
PT J
AU Deng, F
Bao, F
Bao, CC
AF Deng, Feng
Bao, Feng
Bao, Chang-chun
TI Speech enhancement using generalized weighted beta-order spectral
amplitude estimator
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Auditory masking properties; Generalized weighed
spectral amplitude estimator; A priori SNR estimation
ID NOISE; MODELS
AB In this paper, a single-channel speech enhancement method based on generalized weighted beta-order spectral amplitude estimator is proposed. First, we derive a new kind of generalized weighted beta-order Bayesian spectral amplitude estimator, which takes full advantage of both the traditional perceptually weighted estimators and beta-order spectral amplitude estimators and can obtain flexible and effective gain function. Second, according to the masking properties of human auditory system, the adaptive estimation methods for the perceptually weighted order p is proposed, which is based on a criterion that inaudible noise may be masked rather than removed. Thereby, the distortion of enhanced speech is reduced. Third, based on the compressive nonlinearity of the cochlea, the spectral amplitude order beta can be interpreted as the compression rate of the spectral amplitude, and then the adaptive calculation method of parameter beta is proposed. In addition, due to one frame delay, the a priori SNR estimation of decision-directed method in speech activity periods is inaccurate. In order to overcome the drawback, we present a new a priori SNR estimation method by combining predicted estimation with decision-directed rule. The subjective and objective test results indicate that the proposed Bayesian spectral amplitude estimator combined with the proposed a priori SNR estimation method can achieve a more significant segmental SNR improvement, a lower log-spectral distortion and a better speech quality over the reference methods. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Deng, Feng; Bao, Feng; Bao, Chang-chun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
EM baochch@bjut.edu.cn
FU Beijing Natural Science Foundation program; Scientific Research Key
Program of Beijing Municipal Commission of Education [KZ201110005005];
National Natural Science Foundation of China [61072089]
FX This work was supported by the Beijing Natural Science Foundation
program and Scientific Research Key Program of Beijing Municipal
Commission of Education (Grant No. KZ201110005005), the National Natural
Science Foundation of China (Grant No. 61072089).
CR Abramson A, 2007, IEEE T AUDIO SPEECH, V15, P2348, DOI 10.1109/TASL.2007.904231
[Anonymous], 2001, REC P 862 PERC EV SP
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
Deng F., 2011, 2011 INT C WIR COMM, P1
DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Gradshteyn I. S., 2000, TABLE INTEGRALS SERI
GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052
ITU-T, 1993, REC P 56 OBJ MEAS AC
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Loizou P.C., 2007, SPEECH ENHANCEMENT T, P213
Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929
MALAH D, 1999, ACOUST SPEECH SIG PR, P789
Moore BC., 2003, INTRO PSYCHOL HEARIN
Plourde E, 2008, IEEE T AUDIO SPEECH, V16, P1614, DOI 10.1109/TASL.2008.2004304
Poblete V, 2014, SPEECH COMMUN, V56, P19, DOI 10.1016/j.specom.2013.07.006
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Robles L, 2001, PHYSIOL REV, V81, P1305
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
You CH, 2005, IEEE T SPEECH AUDI P, V13, P475, DOI 10.1109/TSA.2005.848883
You C.H., 2004, IEEE INT C AC SPEECH, V1, P725
You CH, 2006, SPEECH COMMUN, V48, P57, DOI 10.1016/j.specom.2005.05.012
Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256
NR 29
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 55
EP 68
DI 10.1016/j.specom.2014.01.002
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300006
ER
PT J
AU Kanagasundaram, A
Dean, D
Sridharan, S
Gonzalez-Dominguez, J
Gonzalez-Rodriguez, J
Ramos, D
AF Kanagasundaram, A.
Dean, D.
Sridharan, S.
Gonzalez-Dominguez, J.
Gonzalez-Rodriguez, J.
Ramos, D.
TI Improving short utterance i-vector speaker verification using utterance
variance modelling and compensation techniques
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; I-vector; PLDA; SN-LDA; SUVN; SUV
ID VARIABILITY
AB This paper proposes techniques to improve the performance of i-vector based speaker verification systems when only short utterances are available. Short-length utterance i-vectors vary with speaker, session variations, and the phonetic content of the utterance. Well established methods such as linear discriminant analysis (LDA), source-normalized LDA (SN-LDA) and within-class covariance normalization (WCCN) exist for compensating the session variation but we have identified the variability introduced by phonetic content due to utterance variation as an additional source of degradation when short-duration utterances are used. To compensate for utterance variations in short i-vector speaker verification systems using cosine similarity scoring (CSS), we have introduced a short utterance variance normalization (SUVN) technique and a short utterance variance (SUV) modelling approach at the i-vector feature level. A combination of SUVN with LDA and SN-LDA is proposed to compensate the session and utterance variations and is shown to provide improvement in performance over the traditional approach of using LDA and/or SN-LDA followed by WCCN. An alternative approach is also introduced using probabilistic linear discriminant analysis (PLDA) approach to directly model the SUV. The combination of SUVN, LDA and SN-LDA followed by SUV PLDA modelling provides an improvement over the baseline PLDA approach. We also show that for this combination of techniques, the utterance variation information needs to be artificially added to full-length i-vectors for PLDA modelling. (C) 2014 Elsevier B.V. All rights reserved.
C1 [Kanagasundaram, A.; Dean, D.; Sridharan, S.] Queensland Univ Technol, SAIVT, Speech Res Lab, Brisbane, Qld 4001, Australia.
[Gonzalez-Dominguez, J.; Gonzalez-Rodriguez, J.; Ramos, D.] Univ Autonoma Madrid, ATVS Biometr Recognit Grp, E-28049 Madrid, Spain.
RP Kanagasundaram, A (reprint author), Queensland Univ Technol, SAIVT, Speech Res Lab, Brisbane, Qld 4001, Australia.
EM a.kanagasundaram@qut.edu.au; d.dean@qut.edu.au; s.sridharan@qut.edu.au;
javier.gonzalez@uam.es; joaquin.gonzalez@uam.es; daniel.ramos@uam.es
FU Australian Research Council (ARC) [LP130100110]; European Commission
Marie Curie ITN Bayesian Biometrics for Forensics (BBfor2) network;
Spanish Ministerio de Economia y Competitividad [TEC2012-37585-C02-01]
FX This project was supported by an Australian Research Council (ARC)
Linkage Grant LP130100110 and by the European Commission Marie Curie ITN
Bayesian Biometrics for Forensics (BBfor2) network and the Spanish
Ministerio de Economia y Competitividad under the project
TEC2012-37585-C02-01.
CR Dehak N., 2010, OD SPEAK LANG REC WO
Dehak N., 2010, IEEE T AUDIO SPEECH, P1
Dehak N., 2009, P INT C SPOK LANG PR, P1559
Garcia-Romero D, 2011, INTERSPEECH, P249
Hasan T., 2013, IEEE INT C AC SPEECH
Kanagasundaram A., 2011, P INTERSPEECH, P2341
Kanagasundaram A., 2012, P OD WORKSH
Kanagasundaram Ahilan, 2012, SPEAK LANG REC WORKS
KENNY P, 2006, IEEE OD 2006 SPEAK L, P1
Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147
Kenny P, 2005, JOINT FACTOR ANAL SP
Kenny P, 2010, P OD SPEAK LANG REC
Kenny P., 2013, IEEE INT C AC SPEECH
McLaren M, 2011, INT CONF ACOUST SPEE, P5456
McLaren M, 2011, INT CONF ACOUST SPEE, P5460
McLaren M., 2010, P OD WORKSH
McLaren M, 2012, IEEE T AUDIO SPEECH, V20, P755, DOI 10.1109/TASL.2011.2164533
NIST, 2008, NIST YEAR 2008 SPEAK
NIST, 2010, NIST YEAR 2010 SPEAK
Shum S., 2010, P OD
Vogt R., 2008, INT 2008 BRISB AUSTR
VOGT R, 2008, OD SPEAK LANG REC WO
Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003
Zhao XY, 2009, INT CONF ACOUST SPEE, P4049
NR 24
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2014
VL 59
BP 69
EP 82
DI 10.1016/j.specom.2014.01.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA AD8DI
UT WOS:000333496300007
ER
PT J
AU Jeong, Y
AF Jeong, Yongwon
TI Joint speaker and environment adaptation using Tensor Voice for robust
speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Acoustic model adaptation; Environment adaptation; Speaker adaptation;
Speech recognition; Tensor analysis
ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; NOISE
AB We present an adaptation of a hidden Markov model (HMM)-based automatic speech recognition system to the target speaker and noise environment. Given HMMs built from various speakers and noise conditions, we build tensorvoices that capture the interaction between the speaker and noise by using a tensor decomposition. We express the updated model for the target speaker and noise environment as a product of the tensorvoices and two weight vectors, one each for the speaker and noise. An iterative algorithm is presented to determine the weight vectors in the maximum likelihood (ML) framework. With the use of separate weight vectors, the tensorvoice approach can adapt to the target speaker and noise environment differentially, whereas the eigenvoice approach, which is based on a matrix decomposition technique, cannot differentially adapt to those two factors. In supervised adaptation tests using the AURORA4 corpus, the relative improvement of performance obtained by the tensorvoice method over the eigenvoice method is approximately 10% on average for adaptation data of 6-24 s in length, and the relative improvement of performance obtained by the tensorvoice method over the maximum likelihood linear regression (MLLR) method is approximately 5.4% on average for adaptation data of 6-18 s in length. Therefore, the tensorvoice approach. (C) 2013 Elsevier B.V. All rights reserved.
C1 Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea.
RP Jeong, Y (reprint author), Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea.
EM jeongy@pusan.ac.kr
CR Acero A., 2000, P ICSLP, P869
Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137
CARROLL JD, 1970, PSYCHOMETRIKA, V35, P283, DOI 10.1007/BF02310791
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6
Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Jeong Y, 2010, INT CONF ACOUST SPEE, P4870, DOI 10.1109/ICASSP.2010.5495117
Jeong Y, 2011, IEEE SIGNAL PROC LET, V18, P347, DOI 10.1109/LSP.2011.2136335
Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd
Kolda TG, 2009, SIAM REV, V51, P455, DOI 10.1137/07070111X
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Lathauwer LD, 2000, SIAM J MATRIX ANAL A, V21, P1253
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Li J, 2007, PROCEEDINGS OF THE 3RD INTERNATIONAL YELLOW RIVER FORUM ON SUSTAINABLE WATER RESOURCES MANAGEMENT AND DELTA ECOSYSTEM MAINTENANCE, VOL VI, P65
Lu HP, 2008, IEEE T NEURAL NETWOR, V19, P18, DOI 10.1109/TNN.2007.901277
NGUYEN P, 1999, P EUROSPEECH, P2519
O'Shaughnessy D, 2008, PATTERN RECOGN, V41, P2965, DOI 10.1016/j.patcog.2008.05.008
Pallett D.S., 1994, P HUM LANG TECHN WOR, P49, DOI 10.3115/1075812.1075824
Parihar N., 2002, AURORA WORKING GROUP
PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Rigazio L., 2001, P INTERSPEECH, P2347
Sagayama S., 1997, P IEEE WORKSH AUT SP, P396
Seltzer M. L., 2011, P INTERSPEECH, P1097
Seltzer M. L., 2011, P IEEE WORKSH AUT SP, P146
SIROVICH L, 1987, J OPT SOC AM A, V4, P519, DOI 10.1364/JOSAA.4.000519
Tsao Y, 2009, IEEE T AUDIO SPEECH, V17, P1025, DOI 10.1109/TASL.2009.2016231
TURK M, 1991, J COGNITIVE NEUROSCI, V3, P71, DOI 10.1162/jocn.1991.3.1.71
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Vasilescu M. A. O., 2007, P ICCV, P1
Vasilescu M.A.O., 2002, LNCS, V2350, P447
Vasilescu MAO, 2002, INT C PATT RECOG, P511
Vasilescu MAO, 2005, PROC CVPR IEEE, P547
Vlasic D, 2005, ACM T GRAPHIC, V24, P426, DOI 10.1145/1073204.1073209
Wang YQ, 2012, IEEE T AUDIO SPEECH, V20, P2149, DOI 10.1109/TASL.2012.2198059
NR 37
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 1
EP 10
DI 10.1016/j.specom.2013.10.001
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000001
ER
PT J
AU De Looze, C
Scherer, S
Vaughan, B
Campbell, N
AF De Looze, Celine
Scherer, Stefan
Vaughan, Brian
Campbell, Nick
TI Investigating automatic measurements of prosodic accommodation and its
dynamics in social interaction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosodic accommodation; Dynamics; Interactional conversation;
Information exchange; Speakers' involvement and affinity
ID CROSS-LANGUAGE; ADULT SPEECH; MIMICRY; COMMUNICATION; CONVERSATION;
COORDINATION; CONVERGENCE; PERCEPTION; SYNCHRONY; DESIRABILITY
AB Spoken dialogue systems are increasingly being used to facilitate and enhance human communication. While these interactive systems can process the linguistic aspects of human communication, they are not yet capable of processing the complex dynamics involved in social interaction, such as the adaptation on the part of interlocutors. Providing interactive systems with the capacity to process and exhibit this accommodation could however improve their efficiency and make machines more socially-competent interactants.
At present, no automatic system is available to process prosodic accommodation, nor do any clear measures exist that quantify its dynamic manifestation. While it can be observed to be a monotonically manifest property, it is our hypotheses that it evolves dynamically with functional social aspects.
In this paper, we propose an automatic system for its measurement and the capture of its dynamic manifestation. We investigate the evolution of prosodic accommodation in 41 Japanese dyadic telephone conversations and discuss its manifestation in relation to its functions in social interaction. Overall, our study shows that prosodic accommodation changes dynamically over the course of a conversation and across conversations, and that these dynamics inform about the naturalness of the conversation flow, the speakers' degree of involvement and their affinity in the conversation. (C) 2013 Elsevier B.V. All rights reserved.
C1 [De Looze, Celine; Vaughan, Brian; Campbell, Nick] Trinity Coll Dublin, Speech Commun Lab, Dublin 2, Ireland.
[Scherer, Stefan] Univ So Calif, Inst Creat Technol, Los Angeles, CA 90089 USA.
RP De Looze, C (reprint author), Trinity Coll Dublin, Speech Commun Lab, 7-9 South Leinster St, Dublin 2, Ireland.
EM deloozec@tcd.ie
FU Science Foundation Ireland (SFI) [09/IN.I/12631]; FASTNET project -
Focus on Action in Social Talk: Network Enabling Technology
FX This work was undertaken as part of the FASTNET project - Focus on
Action in Social Talk: Network Enabling Technology funded by Science
Foundation Ireland (SFI) 09/IN.1/I2631.
CR Agarwal S.K., 2011, 2 INT WORKSH INT US
Apple Inc, 2011, APPL SIRI HOM
Aubanel V, 2010, SPEECH COMMUN, V52, P577, DOI 10.1016/j.specom.2010.02.008
Babel M., 2011, LANG SPEECH, V55, P231
BAILLY G, 2010, INT 2010, P1153
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
BAVELAS JB, 1986, J PERS SOC PSYCHOL, V50, P322, DOI 10.1037/0022-3514.50.2.322
Bell L., 2003, P ICPHS CIT, V3, P833
Bernieri F.J., 1991, FUNDAMENTALS NONVERB, P401
Black JW, 1949, J SPEECH HEAR DISORD, V14, P16
Boersma P., 2006, PRAAT DOING PHONETIC
Boylan P., 2004, TECHNICAL REPORT
Branigan HP, 2010, J PRAGMATICS, V42, P2355, DOI 10.1016/j.pragma.2009.12.012
Breazeal C, 2002, INT J ROBOT RES, V21, P883, DOI 10.1177/0278364902021010096
Brennan S, 1996, P INT S SPOK DIAL, P41
Burgoon J.K., 1995, NUMBER CAMBRIDGE
Campbell N., 2004, P LANG RES EV C LREC, P183
Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893
Cleland AA, 2003, J MEM LANG, V49, P214, DOI 10.1016/S0749-596X(03)00060-3
Collins B., 1998, 5 INT C SPOK LANG PR
CONDON WS, 1974, CHILD DEV, V45, P456, DOI 10.2307/1127968
Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689
CROWNE DP, 1960, J CONSULT PSYCHOL, V24, P349, DOI 10.1037/h0047358
De Looze C., 2010, THESIS U PROVENCE
de Looze C., 2011, P 17 INT C PHON SCI, P1294
De Looze C., 2011, P INT, P1393
de Jong NH, 2009, BEHAV RES METHODS, V41, P385, DOI 10.3758/BRM.41.2.385
Delvaux V, 2007, PHONETICA, V64, P145, DOI 10.1159/000107914
Edlund J., 2009, 10 ANN C INT SPEECH, P2779
FERGUSON CA, 1975, ANTHROPOL LINGUIST, V17, P1
FERNALD A, 1989, J CHILD LANG, V16, P477
Gallois C., 1991, CONTEXTS ACCOMMODATI, P245, DOI 10.1017/CBO9780511663673.008
GALLOIS C, 1988, LANG COMMUN, V8, P271, DOI 10.1016/0271-5309(88)90022-5
Garrod Simon C., 2006, RES LANGUAGE COMPUTA, V4, P203, DOI DOI 10.1007/S11168-006-9004-0
Giles H., 1991, CONTEXTS ACCOMMODATI, P1, DOI DOI 10.1017/CBO9780511663673
GOLDMANEISLER F, 1961, LANG SPEECH, V4, P171
Goldman-Eisler F., 1968, PSYCHOLINGUISTICS EX
Google, 2011, GOOGL VOIC SEARCH
Gregory SW, 1997, J NONVERBAL BEHAV, V21, P23
GREGORY S, 1993, LANG COMMUN, V13, P195, DOI 10.1016/0271-5309(93)90026-J
GREGORY SW, 1982, J PSYCHOLINGUIST RES, V11, P35, DOI 10.1007/BF01067500
GROSJEAN F, 1975, PHONETICA, V31, P144
Haywood SL, 2005, PSYCHOL SCI, V16, P362, DOI 10.1111/j.0956-7976.2005.01541.x
Heldner J., 2010, P INT 2010, P1
Hess U, 2001, INT J PSYCHOPHYSIOL, V40, P129, DOI 10.1016/S0167-8760(00)00161-6
Jaffe J., 2001, RHYTHMS DIALOGUE INF
Jaffe J., 1970, RHYTHMS DIALOGUE
Juslin P. N., 2005, NEW HDB METHODS NONV
Kleinberger T, 2007, LECT NOTES COMPUT SC, V4555, P103
Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007
Kousidis S., 2009, P SPECOM 2009 ST PET, P2
Kousidis S., 2008, P INT
Lakin JL, 2003, PSYCHOL SCI, V14, P334, DOI 10.1111/1467-9280.14481
Lee CC, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P793
Levelt W. J. M., 1982, PROCESSES BELIEFS QU, P199
Levitan R., P ACL 2011, P113
Levitan R., 2011, INTERSPEECH 2011, P3081
Levitan R., 2011, 12 ANN C INT SPEECH
Lu H, 2011, LECT NOTES COMPUT SC, V6696, P188
Maganti H.K., 2007, P IEEE INT C AC SPEE, V4
MATARAZZO JOSEPH D., 1967, J EXP RES PERSONALITY, V2, P56
MAURER RE, 1983, J COUNS PSYCHOL, V30, P158, DOI 10.1037/0022-0167.30.2.158
McGarva AR, 2003, J PSYCHOLINGUIST RES, V32, P335, DOI 10.1023/A:1023547703110
MELTZER L, 1971, J PERS SOC PSYCHOL, V18, P392, DOI 10.1037/h0030993
MELTZOFF AN, 1977, SCIENCE, V198, P75, DOI 10.1126/science.198.4312.75
Miles LK, 2009, J EXP SOC PSYCHOL, V45, P585, DOI 10.1016/j.jesp.2009.02.002
Mondada L., 2001, MARGES LINGUISTIQUES, V1, P1
NATALE M, 1975, J PERS SOC PSYCHOL, V32, P790, DOI 10.1037/0022-3514.32.5.790
Nenkova A., 2008, P 46 ANN M ASS COMP, P169, DOI 10.3115/1557690.1557737
Nishimura R, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P534
OHALA JJ, 1983, PHONETICA, V40, P1
Oviatt S, 1996, IEEE MULTIMEDIA, V3, P26, DOI 10.1109/93.556458
Pardo JS, 2006, J ACOUST SOC AM, V119, P2382, DOI 10.1121/1.2178720
Parrill F, 2006, J NONVERBAL BEHAV, V30, P157, DOI 10.1007/s10919-006-0014-2
Pentland A.S., 2008, HONEST SIGNALS THEY
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
Pickering MJ, 2008, PSYCHOL BULL, V134, P427, DOI 10.1037/0033-2909.134.3.427
Putman W.B., 1984, INT J SOCIOL LANG, V1984, P97, DOI 10.1515/ijsl.1984.46.97
Ramseyer F, 2010, LECT NOTES COMPUT SC, V5967, P182, DOI 10.1007/978-3-642-12397-9_15
Richardson MJ, 2007, HUM MOVEMENT SCI, V26, P867, DOI 10.1016/j.humov.2007.07.002
Rumsey F., 2002, SOUND RECORDING INTR
SCHERER KR, 1994, J PERS SOC PSYCHOL, V66, P310, DOI 10.1037/0022-3514.66.2.310
Shepard C. A., 2001, NEW HDB LANGUAGE SOC, P33
Shockley K, 2009, TOP COGN SCI, V1, P305, DOI 10.1111/j.1756-8765.2009.01021.x
Shockley K, 2007, J EXP PSYCHOL HUMAN, V33, P201, DOI 10.1037/0096-1523.33.1.201
Smith C. L., 2007, P 16 INT C PHON SCI, P313
Stanford G.W., 1996, J PERS SOC PSYCHOL, V70, P1231
Street Jr Richard L., 1983, LANG SCI, V5, P79
Suzuki N, 2007, CONNECT SCI, V19, P131, DOI 10.1080/09540090701369125
Tickle-Degnen L, 1990, PSYCHOL INQ, V1, P285, DOI DOI 10.1207/S15327965PLI0104_1
VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917
Vaughan B., 2011, P INT 2011, P1865
Vinciarelli A, 2009, IEEE SIGNAL PROC MAG, V26, P133, DOI 10.1109/MSP.2009.933382
Ward Diane, 2007, ISCA TUT RES WORKSH, P4
Ward N., 2002, 7 INT C SPOK LANG PR
Webb J. T., 1972, STUDIES DYADIC COMMU, P115
WELKOWIT.J, 1973, J CONSULT CLIN PSYCH, V41, P472, DOI 10.1037/h0035328
Woodall G.W., 1983, J NONVERBAL BEHAV, V8, P126
ZEBROWITZ LA, 1992, J NONVERBAL BEHAV, V16, P143, DOI 10.1007/BF00988031
ZEINE L, 1988, J COMMUN DISORD, V21, P373, DOI 10.1016/0021-9924(88)90022-6
Zhou J., 2012, GERONTOLOGIST, P73
Zuengler J., 1991, CONTEXTS ACCOMMODATI, P223, DOI 10.1017/CBO9780511663673.007
NR 102
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 11
EP 34
DI 10.1016/j.specom.2013.10.002
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000002
ER
PT J
AU Lu, CT
AF Lu, Ching-Ta
TI Reduction of musical residual noise using block-and-directional-median
filter adapted by harmonic properties
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Spectral subtraction; Musical residual noise;
Block-and-directional median filter; Post-processing; Median filter
ID SPEECH ENHANCEMENT; MASKING PROPERTIES; TRANSFORM
AB Many speech enhancement systems can efficiently remove background noise. However, most of them suffer from musical residual noise which is very annoying to the human ear. This study proposes a post-processing system to efficiently reduce the effect of musical residual noise, enabling the enhanced speech to be improved. Noisy speech is firstly enhanced by a speech enhancement algorithm to reduce background noise. The enhanced speech is then post-processed by a block-and-directional-median (BDM) filter adapted by harmonic properties, causing the musical effect of residual noise being efficiently reduced. In the case of a speech-like spectrum, directional-median filtering is performed to slightly reduce the musical effect of residual noise, where a strong harmonic spectrum of a vowel is well maintained. The quality of post-processed speech is then ensured. On the contrary, block-median filtering is performed to greatly reduce the spectral variation in noise-dominant regions, enabling the spectral peaks of musical tones to be significantly smoothed. The musical effect of residual noise is therefore reduced. Finally, the preprocessed and post-processed spectra are integrated according to the speech-presence probability. Experimental results show that the proposed post processor can efficiently improve the performance of a speech enhancement system by reducing the musical effect of residual noise. (C) 2013 Elsevier B.V. All rights reserved.
C1 Asia Univ, Dept Informat Commun, Taichung 41354, Taiwan.
RP Lu, CT (reprint author), Asia Univ, Dept Informat Commun, 500 Lioufeng Rd, Taichung 41354, Taiwan.
EM lucas1@ms26.hinet.net
FU National Science Council, Taiwan [NSC 101-2221-E-468-010]
FX This research was supported by the National Science Council, Taiwan,
under contract number NSC 101-2221-E-468-010. We thank the reviewers for
their valuable comments which much improve the quality of this paper.
Our gratitude also goes to Dr. Timothy Williams (Asia University) for
his help in English proofreading.
CR Cho E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4569
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Esch T, 2009, INT CONF ACOUST SPEE, P4409, DOI 10.1109/ICASSP.2009.4960607
Ghanbari Y, 2006, SPEECH COMMUN, V48, P927, DOI 10.1016/j.specom.2005.12.002
Goh Z, 1998, IEEE T SPEECH AUDI P, V6, P287
Hsu CC, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4001
Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714
Ibne M.N., 2003, P IEEE INT S SIGN PR, P749
ITU-T, 2001, INT TEL UN P
ITU-T, 2003, INT TEL UN P
Jensen J, 2012, IEEE T AUDIO SPEECH, V20, P92, DOI 10.1109/TASL.2011.2157685
Jensen J.R., 2012, IEEE T AUDIO SPEECH, V20, P948
Jin W, 2010, IEEE T AUDIO SPEECH, V18, P356, DOI 10.1109/TASL.2009.2028916
Jo S, 2010, IEEE T AUDIO SPEECH, V18, P2099, DOI 10.1109/TASL.2010.2041119
KLEIN M, 2002, ACOUST SPEECH SIG PR, P537
Leitner C., 2012, P INT C SYST SIGN IM, P464
Lu C-T, 2007, DIGIT SIGNAL PROCESS, V17, P171
Lu CT, 2003, SPEECH COMMUN, V41, P409, DOI 10.1016/S0167-6393(03)00011-6
Lu CT, 2010, COMPUT SPEECH LANG, V24, P632, DOI 10.1016/j.csl.2009.09.001
Lu C.-T., 2012, P IEEE SPRING WORLD, V2, P458
Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001
Lu CT, 2004, ELECTRON LETT, V40, P394, DOI 10.1049/el:20040266
Lu C.-T., 2011, P IEEE INT C COMP SC, V3, P475
Lu CT, 2011, SPEECH COMMUN, V53, P495, DOI 10.1016/j.specom.2010.11.008
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Miyazaki R, 2012, IEEE T AUDIO SPEECH, V20, P2080, DOI 10.1109/TASL.2012.2196513
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Rix A., 2001, P IEEE INT C AC SPEE, P749
Shimamura T, 2001, IEEE T SPEECH AUDI P, V9, P727, DOI 10.1109/89.952490
Udrea RM, 2008, SIGNAL PROCESS, V88, P1293
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Yu HJ, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4573
NR 34
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 35
EP 48
DI 10.1016/j.specom.2013.11.002
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000003
ER
PT J
AU Schwerin, B
Paliwal, K
AF Schwerin, Belinda
Paliwal, Kuldip
TI Using STFT real and imaginary parts of modulation signals for MMSE-based
speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Modulation domain; Analysis-modification-synthesis (AMS); Speech
enhancement; MMSE short-time spectral magnitude estimator; Modulation
spectrum; Modulation magnitude spectrum; Modulation signal; Modulation
spectral subtraction
ID SPECTRAL AMPLITUDE ESTIMATOR; SUBTRACTION; NOISE
AB In this paper we investigate an alternate, RI-modulation (R = real, I = imaginary) AMS framework for speech enhancement, in which the real and imaginary parts of the modulation signal are processed in secondary AMS procedures. This framework offers theoretical advantages over the previously proposed modulation AMS frameworks in that noise is additive in the modulation signal and noisy acoustic phase is not used to reconstruct speech. Using the MMSE magnitude estimation to modify modulation magnitude spectra, initial experiments presented in this work evaluate if these advantages translate into improvements in processed speech quality. The effect of speech presence uncertainty and log-domain processing on MMSE magnitude estimation in the RI-modulation framework is also investigated. Finally, a comparison of different enhancement approaches applied in the RI-modulation framework is presented. Using subjective and objective experiments as well as spectrogram analysis, we show that RI-modulation MMSE magnitude estimation with speech presence uncertainty produces stimuli which has a higher preference by listeners than the other RI-modulation types. In comparisons to similar approaches in the modulation AMS framework, results showed that the theoretical advantages of the RI-modulation framework did not translate to an improvement in overall quality, with both frameworks yielding very similar sounding stimuli, but a clear improvement (compared to the corresponding modulation AMS based approach) in speech intelligibility was found. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
RP Schwerin, B (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
EM b.schwerin@griffith.edu.au; k.paliwal@griffith.edu.au
CR Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004
Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P297
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Rix A., 2001, PERCEPTUAL EVALUATIO
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Vary P, 2006, DIGITAL SPEECH TRANS
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Wiener N., 1949, EXTRAPOLATION INTERP
Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005
Zhang Y, 2011, INT CONF ACOUST SPEE, P4744
NR 23
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 49
EP 68
DI 10.1016/j.specom.2013.11.001
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000004
ER
PT J
AU Lee, K
AF Lee, Kyogu
TI Application of non-negative spectrogram decomposition with sparsity
constraints to single-channel speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Single-channel speech enhancement; Non-negative spectrogram
decomposition; Sparsity constraint; Unsupervised source separation
ID NOISE
AB We propose an algorithm for single-channel speech enhancement that requires no pre-trained models - neither speech nor noise models - using non-negative spectrogram decomposition with sparsity constraints. To this end, before staring the EM algorithm for spectrogram decomposition, we divide the spectral basis vectors into two disjoint groups - speech and noise groups - and impose sparsity constraints only on those in the speech group as we update the parameters. After the EM algorithm converges, the proposed algorithm successfully separates speech from noise, and no post-processing is required for speech reconstruction. Experiments with various types of real-world noises show that the proposed algorithm achieves performance significantly better than other classical algorithms or comparable to the spectrogram decomposition method using pre-trained noise models. (C) 2013 Elsevier B.V. All rights reserved.
C1 Seoul Natl Univ, Grad Sch Convergence Sci & Technol, Mus & Audio Res Grp, Seoul, South Korea.
RP Lee, K (reprint author), Seoul Natl Univ, Grad Sch Convergence Sci & Technol, Mus & Audio Res Grp, 1 Gwanak Ro, Seoul, South Korea.
EM kglee@snu.ac.kr
FU MSIP (Ministry of Science, ICT and Future Planning), Korea, under the
ITRC (Information Technology Research Center) [NIPA-2013-H0301-13-4005];
NIPA (National IT Industry Promotion Agency)
FX This research was supported by the MSIP (Ministry of Science, ICT and
Future Planning), Korea, under the ITRC (Information Technology Research
Center) support program (NIPA-2013-H0301-13-4005) supervised by the NIPA
(National IT Industry Promotion Agency).
CR Brand M, 1999, NEURAL COMPUT, V11, P1155, DOI 10.1162/089976699300016395
Brand M.E., 1999, UNCERTAINTY 99
Duan Z., 2012, P LVA ICA
Duan Z., 2012, P INTERSPEECH
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
Fevotte C, 2009, NEURAL COMPUT, V21, P793, DOI 10.1162/neco.2008.04-08-771
FIELD DJ, 1994, NEURAL COMPUT, V6, P559, DOI 10.1162/neco.1994.6.4.559
Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148
Hofmann Thomas, 1999, P 15 C UNC ART INT U
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
Joder C., 2012, P LVA ICA
Kamath S., 2002, P ICASSP
Lauberg H., 2008, P IEEE AS C SIGN SYS
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Raj B., 2005, P IEEE WORKSH APPL S
Rix A., 2012, P ICASSP, P749
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Shashanka M., 2007, P IEEE C AC SPEECH S
Shashanka M., 2007, P NIPS
Smaragdis P., 2006, P NIPS
Smaragdis P., 2008, P IEEE C AC SPEECH S
Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005
NR 22
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 69
EP 80
DI 10.1016/j.specom.2013.11.008
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000005
ER
PT J
AU Morrison, GS
Lindh, J
Curran, JM
AF Morrison, Geoffrey Stewart
Lindh, Jonas
Curran, James M.
TI Likelihood ratio calculation for a disputed-utterance analysis with
limited available data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Forensic; Likelihood ratio; Disputed utterance; Reliability; Hotelling's
7(2); Posterior predictive density
ID ELEMENTAL COMPOSITION MEASUREMENTS; FORENSIC GLASS EVIDENCE; RELIABILITY
AB We present a disputed-utterance analysis using relevant data, quantitative measurements and statistical models to calculate likelihood ratios. The acoustic data were taken from an actual forensic case in which the amount of data available to train the statistical models was small and the data point from the disputed word was far out on the tail of one of the modelled distributions. A procedure based on single multivariate Gaussian models for each hypothesis led to an unrealistically high likelihood ratio value with extremely poor reliability, but a procedure based on Hotelling's T-2 statistic and a procedure based on calculating a posterior predictive density produced more acceptable results. The Hotelling's T-2 procedure attempts to take account of the sampling uncertainty of the mean vectors and covariance matrices due to the small number of tokens used to train the models, and the posterior-predictive-density analysis integrates out the values of the mean vectors and covariance matrices as nuisance parameters. Data scarcity is common in forensic speech science and we argue that it is important not to accept extremely large calculated likelihood ratios at face value, but to consider whether such values can be supported given the size of the available data and modelling constraints. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Morrison, Geoffrey Stewart] Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
[Lindh, Jonas] Univ Gothenburg, Sahlgrenska Acad, Inst Neurosc & Physiol, Dept Clin Neurosci & Rehabil,Div Speech & Languag, SE-40530 Gothenburg, Sweden.
[Curran, James M.] Univ Auckland, Dept Stat, Auckland 1142, New Zealand.
RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
EM geoff-morrison@forensic-evaluation.net
FU Australian Research Council; Australian Federal Police; New South Wales
Police; Queensland Police; National Institute of Forensic Science;
Australasian Speech Science and Technology Association; Guardia Civil
through Linkage Project [LP100200142]; US National Institute of Justice
[2011-DN-BX-K541]
FX Morrison's contribution to this research was supported by the Australian
Research Council, Australian Federal Police, New South Wales Police,
Queensland Police, National Institute of Forensic Science, Australasian
Speech Science and Technology Association, and the Guardia Civil through
Linkage Project LP100200142. Unless otherwise explicitly attributed, the
opinions expressed in this paper are those of the authors and do not
necessarily represent the policies or opinions of any of the above
mentioned organizations. Curran's contribution to this research was
supported by grant 2011-DN-BX-K541 from the US National Institute of
Justice. Points of view in this document are those of the authors and do
not necessarily represent the official position or policies of the US
Department of Justice.
CR Anderson N, 1978, MODERN SPECTRUM ANAL, P252
Bernardo JM, 1994, BAYESIAN THEORY
Boersma P., 2008, PRAAT DOING PHONETIC
Champod C., 2000, FORENSIC LINGUIST, V7, P238, DOI 10.1558/sll.2000.7.2.238
Curran J. M., 2005, LAW PROBABILITY RISK, V4, P115, DOI [10.1093/1pr/mgi009, DOI 10.1093/LPR/MGI009, 10 1093/Ipr/mgi009]
Curran JM, 1997, SCI JUSTICE, V37, P241, DOI 10.1016/S1355-0306(97)72197-X
Curran JM, 1997, SCI JUSTICE, V37, P245, DOI 10.1016/S1355-0306(97)72198-1
Curran JM, 2002, SCI JUSTICE, V42, P29, DOI 10.1016/S1355-0306(02)71794-2
Genz A, 2012, MVTNORM MULTIVARIATE
Genz A., 2009, LECT NOTES STAT, V195, DOI [DOI 10.1007/978-3-642-01689-9, 10.1007/978-3-642-01689-9]
Gosset WS., 1908, BIOMETRIKA, V6, P1, DOI DOI 10.1093/BIOMET/6.1.1
Hotelling H, 1931, ANN MATH STAT, V2, P360, DOI 10.1214/aoms/1177732979
Kaye David H., 2009, CORNELL JL PUB POLY, V19, P145
Lagarias JC, 1998, SIAM J OPTIMIZ, V9, P112, DOI 10.1137/S1052623496303470
Lindh Jonas, 2009, Proceedings of the 2009 5th IEEE International Conference on e-Science (e-Science 2009), DOI 10.1109/e-Science.2009.15
Lundeborg I, 2012, LOGOP PHONIATR VOCO, V37, P117, DOI 10.3109/14015439.2012.664654
MathWorks Inc, 2010, MATL SOFTW REL 2010A
Morrison G.S., 2012, P 46 AUD ENG SOC AES, P203
Morrison GS, 2011, SCI JUSTICE, V51, P91, DOI 10.1016/j.scijus.2011.03.002
Nordgaard A, 2012, LAW PROBAB RISK, V11, P1, DOI DOI 10.1093/1PR/MGR020
R Development Core Team, 2013, R LANG ENV STAT COMP
Stoel R.D., 2012, HDB RISK THEORY EPIS, P135, DOI DOI 10.1007/978-94-007-1433-5
Villalba J., 2011, P INTERSPEECH, P505
Zhang CL, 2013, J ACOUST SOC AM, V133, pEL54, DOI 10.1121/1.4773223
NR 24
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 81
EP 90
DI 10.1016/j.specom.2013.11.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000006
ER
PT J
AU Etz, T
Reetz, H
Wegener, C
Bahlmann, F
AF Etz, Tanja
Reetz, Henning
Wegener, Carla
Bahlmann, Franz
TI Infant cry reliability: Acoustic homogeneity of spontaneous cries and
pain-induced cries
SO SPEECH COMMUNICATION
LA English
DT Article
DE Infant cry; Reliability; Acoustic analysis
ID AUTISM SPECTRUM DISORDER; NEWBORN-INFANTS; HEARING IMPAIRMENT; SOUND
SPECTROGRAM; AGREEMENT; DIAGNOSIS; EXPOSURE; STIMULI; MELODY; RISK
AB Infant cries can indicate certain developmental disorders and therefore may be suited for early diagnosis. An open research question is which type of crying (spontaneous, pain-induced) is best suited for infant cry analysis. For estimating the degree of consistency among single cries in an episode of crying, healthy infants were recorded and allocated to the four groups spontaneous cries, spontaneous non-distressed cries, pain-induced cries and pain-induced cries without the first cry after pain stimulus. 19 acoustic parameters were computed and statistically analyzed on their reliability with Krippendorff's Alpha. Krippendorff's Alpha values between 0.184 and 0.779 were reached over all groups. No significant differences between the cry groups were found. However, the non-distressed cries reached the highest alpha values in 16 out of 19 acoustic parameters by trend. The results show that the single cries within an infant's episode of crying are not very reliable in general. For the cry types, the non-distressed cry is the one with the best reliability making it the favorite for infant cry analysis. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Etz, Tanja; Wegener, Carla] Fresenius Univ Appl Sci Idstein, D-65510 Idstein, Germany.
[Etz, Tanja; Reetz, Henning] Goethe Univ Frankfurt, Dept Phonet, D-60325 Frankfurt, Germany.
[Bahlmann, Franz] Burgerhosp Frankfurt Main, D-60318 Frankfurt, Germany.
RP Etz, T (reprint author), Mullerwies 14, D-65232 Taunusstein, Germany.
EM tanja.etz@hs-fresenius.de; reetz@em.uni-frankfurt.de;
wegener@hs-fresenius.de; f.bahlmann@buergerhospital-ffm.de
CR ANDERSEN N, 1974, GEOPHYSICS, V39, P69, DOI 10.1190/1.1440413
APGAR V, 1953, Curr Res Anesth Analg, V32, P260
Arch-Tirado Emilio, 2004, Cir Cir, V72, P271
Artstein R, 2008, COMPUT LINGUIST, V34, P555, DOI 10.1162/coli.07-034-R2
Barr R.G., 2000, CLIN DEV MED, V152
BLINICK G, 1971, AM J OBSTET GYNECOL, V110, P948
Boersma P, 2009, FOLIA PHONIATR LOGO, V61, P305, DOI 10.1159/000245159
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Boersma Paul, 2013, PRAAT DOING PHONETIC
Branco A, 2007, INT J PEDIATR OTORHI, V71, P539, DOI 10.1016/j.ijporl.2006.11.009
Childers D.G., 1978, MODERN SPECTRUM ANAL
CORWIN MJ, 1992, PEDIATRICS, V89, P1199
CROWE HP, 1992, CHILD ABUSE NEGLECT, V16, P19, DOI 10.1016/0145-2134(92)90005-C
Esposito G, 2013, RES DEV DISABIL, V34, P2717, DOI 10.1016/j.ridd.2013.05.036
Etz T, 2012, FOLIA PHONIATR LOGO, V64, P254, DOI 10.1159/000343994
Fisichelli V.R., 1966, PSYCHON SCI, V6, P195
Fort A, 1998, MED ENG PHYS, V20, P432, DOI 10.1016/S1350-4533(98)00045-9
Furlow FB, 1997, EVOL HUM BEHAV, V18, P175, DOI 10.1016/S1090-5138(97)00006-8
GOLUB HL, 1982, PEDIATRICS, V69, P197
GRAU SM, 1995, J SPEECH HEAR RES, V38, P373
Green JA, 1998, CHILD DEV, V69, P271, DOI 10.1111/j.1467-8624.1998.tb06187.x
Hayes Andrew F., 2007, COMMUNICATION METHOD, V1, P77, DOI DOI 10.1080/19312450709336664
IBM, 2011, SPSS STAT SOFTW VERS
KARELITZ S, 1962, J PEDIATR-US, V61, P679, DOI 10.1016/S0022-3476(62)80338-2
Krippendorff K., 2003, CONTENT ANAL INTRO I
LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310
LESTER BM, 1976, CHILD DEV, V47, P237
Lester BM, 2002, PEDIATRICS, V110, P1182, DOI 10.1542/peds.110.6.1182
Lind J., 1967, ACTA PAEDIATR SCAND, V177, P113
LIND J, 1970, DEV MED CHILD NEUROL, V12, P478
Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1
Manfredi C, 2009, MED ENG PHYS, V31, P528, DOI 10.1016/j.medengphy.2008.10.003
Michelsson K, 2002, FOLIA PHONIATR LOGO, V54, P190, DOI 10.1159/000063190
MICHELSSON K, 1977, DEV MED CHILD NEUROL, V19, P309
Moller S, 1999, SPEECH COMMUN, V28, P175, DOI 10.1016/S0167-6393(99)00016-3
Nugent JK, 1996, CHILD DEV, V67, P1806, DOI 10.1111/j.1467-8624.1996.tb01829.x
PORTER FL, 1986, CHILD DEV, V57, P790, DOI 10.2307/1130355
Press W. H., 2002, NUMERICAL RECIPES C, V2nd
Rietveld T., 1993, STAT TECHNIQUES STUD
Robb MP, 1997, FOLIA PHONIATR LOGO, V49, P35
Runefors P, 2005, FOLIA PHONIATR LOGO, V57, P90, DOI 10.1159/000083570
Runefors P, 2000, ACTA PAEDIATR, V89, P68, DOI 10.1080/080352500750029095
Sheinkopf SJ, 2012, AUTISM RES, V5, P331, DOI 10.1002/aur.1244
SHROUT PE, 1979, PSYCHOL BULL, V86, P420, DOI 10.1037//0033-2909.86.2.420
SIRVIO P, 1976, FOLIA PHONIATR, V28, P161
Thoden C.J., 1980, INFANT COMMUNICATION, P124
THODEN CJ, 1979, DEV MED CHILD NEUROL, V21, P400
Truby H.M., 1965, ACTA PAEDIATR, V54, P8, DOI 10.1111/j.1651-2227.1965.tb09308.x
Varallyay G, 2007, INT J PEDIATR OTORHI, V71, P1699, DOI 10.1016/j.ijport.2007.07.005
Verduzco-Mendoza A, 2012, CIR CIR, V80, P3
Vuorenkoski V, 1966, Ann Paediatr Fenn, V12, P174
Wasz-Hockert O., 1968, CLIN DEV MED, V29
Wermke K, 2002, MED ENG PHYS, V24, P501, DOI 10.1016/S1350-4533(02)00061-9
Wermke K, 2011, CLEFT PALATE-CRAN J, V48, P321, DOI 10.1597/09-055
Wolff PH, 1969, DETERMINATS INFANT B, P81
NR 55
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 91
EP 100
DI 10.1016/j.specom.2013.11.006
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000007
ER
PT J
AU Yousefian, N
Loizou, PC
Hansen, JHL
AF Yousefian, Nima
Loizou, Philipos C.
Hansen, John H. L.
TI A coherence-based noise reduction algorithm for binaural hearing aids
SO SPEECH COMMUNICATION
LA English
DT Article
DE Binaural signal processing; Coherence function; Noise reduction; SNR
estimation; Speech intelligibility
ID DATA SPEECH RECOGNITION; ROOM REVERBERATION; ENHANCEMENT; SYSTEMS; MODEL
AB In this study, we present a novel coherence-based noise reduction technique and show how it can be employed in binaural hearing aid instruments in order to suppress any potential noise present inside a realistic low reverberant environment. The technique is based on particular assumptions on the spatial properties of the target and undesired interfering signals and suppresses (coherent) interferences without prior statistical knowledge of the noise environment. The proposed algorithm is simple, easy to implement and has the advantage of high performance in coping with adverse signal conditions such as scenarios in which competing talkers are present. The technique was assessed by measurements with normal-hearing subjects and the processed outputs in each ear showed significant improvements in terms of speech intelligibility (measured by an adaptive speech reception threshold (SRT) sentence test) over the unprocessed signals (baseline). In a mildly reverberant room with T-60 = 200, the average improvement in SRT obtained relative to the baseline was approximately 6.5 dB. In addition, the proposed algorithm was found to yield higher intelligibility and quality than those obtained by a well-established interaural time difference (ITD)-based speech enhancement algorithm. These attractive features make the proposed method a potential candidate for future use in commercial hearing aid and cochlear implant devices. (C) 2013 Published by Elsevier B.V.
C1 [Yousefian, Nima; Loizou, Philipos C.; Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, 800 West Campbell Rd EC33, Richardson, TX 75080 USA.
EM nimayou@utdallas.edu; john.hanse-n@utdallas.edu
CR Aarabi P, 2004, IEEE T SYST MAN CY B, V34, P1763, DOI 10.1109/TSMCB.2004.830345
ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621
Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st
Bronkhorst AW, 2000, ACUSTICA, V86, P117
Chen F., 2011, SPEECH COMMUN, V54, P272
Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298
Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937
Doclo S., 2006, P INT WORKSH AC ECH
ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481
Griffiths L., 1982, IEEE T ANTENN PROPAG, V130, P27
HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901
Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354
Hawley M.L., 2004, J ACOUST SOC AM, V115, P843
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058
ITU, 2000, PERC EV SPEECH QUAL
Kim C, 2011, INT CONF ACOUST SPEE, P5072
Krishnamurthy N, 2009, IEEE T AUDIO SPEECH, V17, P1394, DOI 10.1109/TASL.2009.2015084
LeBouquinJeannes R, 1997, IEEE T SPEECH AUDI P, V5, P484
LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Loizou PC, 2009, J ACOUST SOC AM, V125, P372, DOI 10.1121/1.3036175
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Maj JB, 2006, SPEECH COMMUN, V48, P957, DOI 10.1016/j.specom.2005.12.005
Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005
PLOMP R, 1986, J SPEECH HEAR RES, V29, P146
RIX AW, 2001, ACOUST SPEECH SIG PR, P749
Spahr AJ, 2012, EAR HEARING, V33, P112, DOI 10.1097/AUD.0b013e31822c2549
Spriet A, 2007, EAR HEARING, V28, P62, DOI 10.1097/01.aud.0000252470.54246.54
Spriet A, 2004, SIGNAL PROCESS, V84, P2367, DOI 10.1016/j.sigpro.2004.07.028
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
Van den Bogaert T, 2007, INT CONF ACOUST SPEE, P565
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Yousefian N, 2012, IEEE T AUDIO SPEECH, V20, P599, DOI 10.1109/TASL.2011.2162406
Yousefian N, 2009, INT CONF ACOUST SPEE, P4653, DOI 10.1109/ICASSP.2009.4960668
Yu T, 2009, INT CONF ACOUST SPEE, P213
NR 36
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 101
EP 110
DI 10.1016/j.specom.2013.11.003
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000008
ER
PT J
AU Garcia, JE
Ortega, A
Miguel, A
Lleida, E
AF Enrique Garcia, Jose
Ortega, Alfonso
Miguel, Antonio
Lleida, Eduardo
TI Low bit rate compression methods of feature vectors for distributed
speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Distributed speech recognition; Neural Networks; Multi-layer perceptron;
Predictive vector quantizer optimization; IP networks
ID PACKET LOSS CONCEALMENT; WORLD-WIDE-WEB; CEPSTRAL PARAMETERS;
QUANTIZATION; ROBUST; NETWORKS; CHANNELS
AB In this paper, we present a family of compression methods based on differential vector quantization (DVQ) for encoding Mel frequency cepstral coefficients (MFCC) in distributed speech recognition (DSR) applications. The proposed techniques benefit from the existence of temporal correlation across consecutive MFCC frames as well as the presence of intra-frame redundancy. We present DVQ schemes based on linear prediction and non-linear methods with multi-layer perceptrons (MLP). In addition to this, we propose the use of a multipath search coding strategy based on the M-algorithm that obtains the sequence of centroids that minimize the quantization error globally instead of selecting the centroids that minimize the quantization error locally in a frame by frame basis. We have evaluated the performance of the proposed methods for two different tasks. On the one hand, two small-size vocabulary databases, Spechdat-Car and Aurora 2, have been considered obtaining negligible degradation in terms of Word Accuracy (around 1%) compared to the unquantized scheme for bit-rates as low as 0.5 kbps. On the other hand, for a large vocabulary task (Aurora 4), the proposed method achieves a WER comparable to the unquantized scheme only with 1.6 kbps. Moreover, we propose a combined scheme (differential/non-differential) that allows the system to present the same sensitivity to transmission errors than previous multi-frame coding proposals for DSR. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Enrique Garcia, Jose; Ortega, Alfonso; Miguel, Antonio; Lleida, Eduardo] Univ Zaragoza, Aragon Inst Engn Res I3A, Commun Technol Grp GTC, Zaragoza 50018, Spain.
RP Garcia, JE (reprint author), Univ Zaragoza, Elect Engn & Commun Dept, C Maria de Luna 1, Zaragoza 50018, Spain.
EM jegarlai@unizar.es; ortega@unizar.es; amiguel@unizar.es;
lleida@unizar.es
RI Ortega, Alfonso/J-6280-2014; Lleida, Eduardo/K-8974-2014
OI Ortega, Alfonso/0000-0002-3886-7748; Lleida, Eduardo/0000-0001-9137-4013
FU Spanish Government; European Union (FEDER) [TIN2011-28169-C05-02]
FX This work has been partially funded by the Spanish Government and the
European Union (FEDER) under Project TIN2011-28169-C05-02.
CR [Anonymous], 2003, 202212 ETSI ES
[Anonymous], 2002, 202050 ETSI ES
[Anonymous], 2000, 201108 ETSI ES
Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141
Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532
DIGALAKIS V, 1998, ACOUST SPEECH SIG PR, P989
Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698
Flynn R, 2010, DIGIT SIGNAL PROCESS, V20, P1559, DOI 10.1016/j.dsp.2010.03.009
Flynn R, 2012, SPEECH COMMUN, V54, P881, DOI 10.1016/j.specom.2012.03.001
Garcia J.E., 2009, INTERSPEECH, P2587
Garcia JE, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2378
Gomez A., 2004, COST278 ISCA TUT RES
Gomez AM, 2009, SPEECH COMMUN, V51, P390, DOI 10.1016/j.specom.2008.12.002
Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611
Hirsch G., 2000, ISCA ITRW ASR 2000
Hirsch G., 2001, ETSI STQ AURORA DSR
Hsu W.H., 2004, P IEEE INT C AC SPEE, V1, P69
Huang X., 2001, SPOKEN LANGUAGE PROC
James A.B., 2004, P ICSLP 04
James A.B., 2004, P IEEE INT C AC SPEE, V1, P853
Jayant N.S., 1984, PRINCIPLES APPL SPEE
JELINEK F, 1971, IEEE T INFORM THEORY, V17, P118, DOI 10.1109/TIT.1971.1054572
Kelleher H., 2002, P ICSLP 02 DENV US
Kiss I., 2000, P INT C SPOK LANG PR
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489
MACKAY, 1992, NEURAL COMPUT, V4, P415
Milner B, 2006, IEEE T AUDIO SPEECH, V14, P223, DOI 10.1109/TSA.2005.852997
Milner B, 2000, PIMRC 2000: 11TH IEEE INTERNATIONAL SYMPOSIUM ON PERSONAL, INDOOR AND MOBILE RADIO COMMUNICATIONS, VOLS 1 AND 2, PROCEEDINGS, P1197
Moreno A., 2000, LREC
So S, 2006, SPEECH COMMUN, V48, P746, DOI 10.1016/j.specom.2005.10.002
Pearce D., 2000, P AVIOS 00 SPEECH AP
Pearce D., 2004, P COST278 ISCA TUT R
Peinado AM, 2005, INT CONF ACOUST SPEE, P329
POTAMIANOS A, 2001, ACOUST SPEECH SIG PR, P269
RAMASWAMY GN, 1998, ACOUST SPEECH SIG PR, P977
Schulzrinne H., 2003, 3550 RFC
Segura J., 2007, HIWIRE DATABASE NOIS
So S, 2008, ADV PATTERN RECOGNIT, P131, DOI 10.1007/978-1-84800-143-5_7
Srinivasamurthy N, 2006, SPEECH COMMUN, V48, P888, DOI 10.1016/j.specom.2005.11.003
Subramaniam AD, 2003, IEEE T SPEECH AUDI P, V11, P130, DOI 10.1109/TSA.2003.809192
Tan ZH, 2003, ELECTRON LETT, V39, P1619, DOI 10.1049/el:20031026
Tan Z.H., 2004, P IEEE INT C AC SPEE, P57
Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007
Tan ZH, 2007, IEEE T AUDIO SPEECH, V15, P1391, DOI 10.1109/TASL.2006.889799
Vetterli M., 1995, WAVELETS SUBBAND COD
Weerackody V, 2002, IEEE T WIREL COMMUN, V1, P282, DOI 10.1109/7693.994822
Young S., 2002, HTK BOOK, P3
NR 48
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 111
EP 123
DI 10.1016/j.specom.2013.11.007
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000009
ER
PT J
AU Xu, N
Tang, YB
Bao, JY
Jiang, AM
Liu, XF
Yang, Z
AF Xu, Ning
Tang, Yibing
Bao, Jingyi
Jiang, Aiming
Liu, Xiaofeng
Yang, Zhen
TI Voice conversion based on Gaussian processes by coherent and asymmetric
training with limited training data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Asymmetric training; Coherent training; Gaussian processes; Gaussian
mixture model; Voice conversion
ID ARTIFICIAL NEURAL-NETWORKS; TRANSFORMATION
AB Voice conversion (VC) is a technique aiming to mapping the individuality of a source speaker to that of a target speaker, wherein Gaussian mixture model (GMM) based methods are evidently prevalent. Despite their wide use, two major problems remains to be resolved, i.e., over-smoothing and over-fitting. The latter one arises naturally when the structure of model is too complicated given limited amount of training data.
Recently, a new voice conversion method based on Gaussian processes (GPs) was proposed, whose nonparametric nature ensures that the over-fitting problem can be alleviated significantly. Meanwhile, it is flexible to perform non-linear mapping under the framework of GPs by introducing sophisticated kernel functions. Thus this kind of method deserves to be explored thoroughly in this paper. To further improve the performance of the GP-based method, a strategy for mapping prosodic and spectral features coherently is adopted, making the best use of the intercorrelations embedded among both excitation and vocal tract features. Moreover, the accuracy in computing the kernel functions of GP can be improved by resorting to an asymmetric training strategy that allows the dimensionality of input vectors being reasonably higher than that of the output vectors without additional computational costs. Experiments have been conducted to confirm the effectiveness of the proposed method both objectively and subjectively, which have demonstrated that improvements can be obtained by GP-based method compared to the traditional GMM-based approach. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Xu, Ning; Tang, Yibing; Jiang, Aiming; Liu, Xiaofeng] Hohai Univ, Coll IoT Engn, Changzhou, Peoples R China.
[Xu, Ning; Jiang, Aiming; Liu, Xiaofeng] Hohai Univ, Changzhou Key Lab Robot & Intelligent Technol, Changzhou, Peoples R China.
[Xu, Ning] Nanjing Univ Posts & Telecommun, Minist Educ, Key Lab Broadband Wireless Commun & Sensor Networ, Nanjing, Jiangsu, Peoples R China.
[Bao, Jingyi] Changzhou Inst Technol, Sch Elect Informat & Elect Engn, Changzhou, Peoples R China.
[Yang, Zhen] Nanjing Univ Posts & Telecommun, Coll Telecommun & Informat Engn, Nanjing, Jiangsu, Peoples R China.
RP Xu, N (reprint author), Hohai Univ, Coll IoT Engn, Changzhou, Peoples R China.
EM xuningdlts@gmail.com; tangyb@hhuc.e-du.cn; baojy@czu.cn;
jiangam@hhuc.edu.cn; liuxf@hhuc.edu.cn; yangz@njupt.edu.cn
FU National Natural Science Foundation of China [11274092, 61271335];
Fundamental Research Funds for the Central Universities [2011B11114,
2011B11314, 2012B07314, 2012B04014]; National Natural Science Foundation
for Young Scholars of China [61101158, 61201301, 31101643]; Jiangsu
Province Natural Science Foundation for Young Scholars of China
[BK20130238]; Key Lab of Broadband Wireless Communication and Sensor
Network Technology (Nanjing University of Posts and Telecommunications),
Ministry of Education [NYKL201305]
FX The work is supported in part by the Grant from the National Natural
Science Foundation of China (11274092, 61271335), the Grant from the
Fundamental Research Funds for the Central Universities (2011B11114,
2011B11314, 2012B07314, 2012B04014), the Grant from the National Natural
Science Foundation for Young Scholars of China (61101158, 61201301,
31101643), the Grant from the Jiangsu Province Natural Science
Foundation for Young Scholars of China (BK20130238), and the open
research fund of Key Lab of Broadband Wireless Communication and Sensor
Network Technology (Nanjing University of Posts and Telecommunications),
Ministry of Education (NYKL201305).
CR Abe M., 1998, P IEEE INT C AC SPEE, P655
Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1
Bishop C. M., 2006, PATTERN RECOGNITION
Chatzis SP, 2012, IEEE T NEUR NET LEAR, V23, P1862, DOI 10.1109/TNNLS.2012.2217986
Chen Y, 2003, P EUR, P2413
Desai S, 2010, IEEE T AUDIO SPEECH, V18, P954, DOI 10.1109/TASL.2010.2047683
Erro D., 2007, P 6 ISCA WORKSH SPEE, P194
Erro D, 2010, IEEE T AUDIO SPEECH, V18, P922, DOI 10.1109/TASL.2009.2038663
Erro D., 2008, THESIS U POLITECNICA
Helander E, 2008, INT CONF ACOUST SPEE, P4669, DOI 10.1109/ICASSP.2008.4518698
Helander E, 2012, IEEE T AUDIO SPEECH, V20, P806, DOI 10.1109/TASL.2011.2165944
Kain A., 2001, THESIS OREGON HLTH S
Lee KS, 2007, IEEE T AUDIO SPEECH, V15, P641, DOI 10.1109/TASL.2006.876760
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Marume M., 2007, TECHNICAL REPORT IEI, V107, P103
NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I
Pilkington N., 2011, P INTERSPEECH, P2761
Rabiner L.R., 2009, THEORY APPL DIGITAL
Rasmussen CE, 2005, ADAPT COMPUT MACH LE, P1
Snelson E.L., 2007, THESIS U LONDON
Stylianou Y, 2009, INT CONF ACOUST SPEE, P3585, DOI 10.1109/ICASSP.2009.4960401
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
Stylianou Y., 1996, THESIS ECOLE NATL SU
Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
Xu Ning, 2010, Journal of Nanjing University of Posts and Telecommunications, V30
Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839
NR 27
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2014
VL 58
BP 124
EP 138
DI 10.1016/j.specom.2013.11.005
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 293FE
UT WOS:000329956000010
ER
PT J
AU Mariooryad, S
Busso, C
AF Mariooryad, Soroosh
Busso, Carlos
TI Compensating for speaker or lexical variabilities in speech for emotion
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Factor analysis; Feature normalization; Speaker
variability
ID EXPRESSION; LEVEL
AB Affect recognition is a crucial requirement for future human machine interfaces to effectively respond to nonverbal behaviors of the user. Speech emotion recognition systems analyze acoustic features to deduce the speaker's emotional state. However, human voice conveys a mixture of information including speaker, lexical, cultural, physiological and emotional traits. The presence of these communication aspects introduces variabilities that affect the performance of an emotion recognition system. Therefore, building robust emotional models requires careful considerations to compensate for the effect of these variabilities. This study aims to factorize speaker characteristics, verbal content and expressive behaviors in various acoustic features. The factorization technique consists in building phoneme level trajectory models for the features. We propose a metric to quantify the dependency between acoustic features and communication traits (i.e., speaker, lexical and emotional factors). This metric, which is motivated by the mutual information framework, estimates the uncertainty reduction in the trajectory models when a given trait is considered. The analysis provides important insights on the dependency between the features and the aforementioned factors. Motivated by these results, we propose a feature normalization technique based on the whitening transformation that aims to compensate for speaker and lexical variabilities. The benefit of employing this normalization scheme is validated with the presented factor analysis method. The emotion recognition experiments show that the normalization approach can attenuate the variability imposed by the verbal content and speaker identity, yielding 4.1% and 2.4% relative performance improvements on a selected set of features, respectively. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Mariooryad, Soroosh; Busso, Carlos] Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, Richardson, TX 75083 USA.
RP Busso, C (reprint author), Univ Texas Dallas, Dept Elect Engn, Multimodal Signal Proc MSP Lab, 800 West Campbell Rd, Richardson, TX 75083 USA.
EM busso@utdallas.edu
FU Samsung Telecommunications America; US National Science Foundation [IIS
1217104, IIS: 1329659]
FX This study was funded by Samsung Telecommunications America and the US
National Science Foundation under Grants IIS 1217104 and IIS: 1329659.
CR Batliner Anton, 2010, Advances in Human-Computing Interaction, DOI 10.1155/2010/782802
Busso Carlos, 2007, IEEE 9th Workshop on Multimedia Signal Processing, 2007. MMSP 2007
Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578
Busso C, 2011, INT CONF ACOUST SPEE, P5692
Busso C., 2006, 7 INT SEM SPEECH PRO, P549
BUSSO C, 2008, J LANGUAGE RESOURCES, V42, P335
Busso C., 2007, INTERSPEECH EUROSPEE, P2225
CHAPPELL DT, 1998, ACOUST SPEECH SIG PR, P885
Chauhan R, 2011, COMM COM INF SC, V168, P359
Cover T. M., 2006, ELEMENTS INFORM THEO, V2nd
Dawes R. M., 2001, RATIONAL CHOICE UNCE
Dehak N., 2011, INTERSPEECH, P857
Dehak N., 2009, INTERSPEECH, P1559
Eyben Florian, 2010, ACM INT C MULT MM 20, V25, P1459
Fu L, 2008, PAC AS WORKSH COMP I, P140
Gish H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319337
Gish H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P466
Gong YF, 1997, IEEE T SPEECH AUDI P, V5, P33
Hall M, 2009, ACM SIGKDD EXPLORATI, V11, P10, DOI DOI 10.1145/1656274.1656278
Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527
Lee C. M., 2004, 8 INT C SPOK LANG PR, V1, P889
Lee S., 2005, 9 EUR C SPEECH COMM, P497
Lei X, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1237
Li M, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P1937
Mariooryad S, 2012, IEEE IMAGE PROC, P2605, DOI 10.1109/ICIP.2012.6467432
Mariooryad S, 2013, IEEE T AFFECT COMPUT, V4, P183, DOI 10.1109/T-AFFC.2013.11
Mariooryad S., 2013, IEEE INT C AUT FAC G
Metallinou A, 2010, INT CONF ACOUST SPEE, P2462, DOI 10.1109/ICASSP.2010.5494890
Metallinou A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P2401
Metallinou A, 2010, INT CONF ACOUST SPEE, P2474, DOI 10.1109/ICASSP.2010.5494893
Mower E., 2009, INT C AFF COMP INT I
Park S., 2010, INT J AERONAUT SPACE, V11, P327
Prince SJD, 2008, IEEE T PATTERN ANAL, V30, P970, DOI 10.1109/TPAMI.2008.48
RABINER LR, 1975, AT&T TECH J, V54, P297
Rahman T, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5117
Schuller B., 2006, 3 INT C SPEECH PROS
Schuller B., 2011, 12 ANN C INT SPEECH, P3201
Schuller B, 2011, SPEECH COMMUN, V53, P1062, DOI 10.1016/j.specom.2011.01.011
Schuller B, 2010, IEEE T AFFECT COMPUT, V1, P119, DOI 10.1109/T-AFFC.2010.8
Sethu V., 2007, 15 INT C DIG SIGN PR, P611
Shriberg E, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P609
Tenenbaum JB, 2000, NEURAL COMPUT, V12, P1247, DOI 10.1162/089976600300015349
Vlasenko B., 2011, 12 ANN C INT SPEECH, P1577
Vlasenko B., 2011, IEEE INT C MULT EXP
Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139
Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597
Wu W, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2102
Xia R., 2012, INT 2012 PORTL OR US, P2230
Yildirim S., 2004, 8 INT C SPOK LANG PR, P2193
Zeng ZH, 2008, IEEE T MULTIMEDIA, V10, P570, DOI 10.1109/TMM.2008.921737
NR 52
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 1
EP 12
DI 10.1016/j.specom.2013.07.011
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100001
ER
PT J
AU Chappel, R
Paliwal, K
AF Chappel, Roger
Paliwal, Kuldip
TI An educational platform to demonstrate speech processing techniques on
Android based smart phones and tablets
SO SPEECH COMMUNICATION
LA English
DT Article
DE Digital signal processing (DSP); Electronic engineering education;
Speech enhancement; Android; Smart phone
ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; PHASE;
RECOGNITION; TECHNOLOGY; COURSES
AB This work highlights the need to adapt teaching methods in digital signal processing (DSP) on speech to suit shifts in generational learning behavior, furthermore it suggests the use of integrating theory into a practical smart phone or tablet application as a means to bridge the gap between traditional teaching styles and current learning styles. The application presented here is called "Speech Enhancement for Android (SEA)" and aims at assisting in the development of an intuitive understanding of course content by allowing students to interact with theoretical concepts through their personal device. SEA not only allows the student to interact with speech processing methods, but also enables the student to interact with their surrounding environment by recording and processing their own voice. A case study on students studying DSP for speech processing found that by using SEA as an additional learning tool enhanced their understanding and helped to motivate students to engage in course work by way of having ready access to interactive content on a hand held device. This paper describes the platform in detail acting as a road-map for education institutions, and how it can be integrated into a DSP based speech processing education framework. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Chappel, Roger; Paliwal, Kuldip] Griffith Univ, Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia.
RP Chappel, R (reprint author), Griffith Univ, Sch Engn, Signal Proc Lab, Nathan Campus, Brisbane, Qld 4111, Australia.
EM roger.chappel@giriffithuni.edu.au
CR Acero A., 2001, SPOKEN LANGUAGE PROC
Allen I.E., 2007, STAT ROUND TABLE QUA, P64
Alsteris L., 2004, P IEEE INT C AC SPEE, V1, P573
ARM, 2009, WHIT PAP ARM CORT A9
ARMSTRONG RL, 1987, PERCEPT MOTOR SKILL, V64, P359
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
Billings DM, 2005, J PROF NURS, V21, P126, DOI 10.1016/j.profnurs.2005.01.002
Caillaud B, 2000, EUR ECON REV, V44, P1091, DOI 10.1016/S0014-2921(00)00047-7
Chassaing R., 2002, DSP APPL USING C TMS
CROCHIERE RE, 1981, AT&T TECH J, V60, P1633
Deci E. L., 1985, INTRINSIC MOTIVATION
Deci EL, 2001, REV EDUC RES, V71, P1, DOI 10.3102/00346543071001001
Donath J, 2004, BT TECHNOL J, V22, P71, DOI 10.1023/B:BTTJ.0000047585.06264.cc
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Finkelstein J, 2010, SECOND INTERNATIONAL CONFERENCE ON MOBILE, HYBRID, AND ON-LINE LEARNING (ELML 2010), P77, DOI 10.1109/eLmL.2010.36
Freeman D.K., 1989, IEEE INT C AC SPEECH, V1, P369
Gartner, 2011, FOR MED TABL OP SYST
Gartner, 2011, MARK SHAR MOB COMM D
Gonvalves F., 2001, IEEE POW EL SPEC C, V1, P85
Hassanlou K., 2009, TECHNOL SOC, V31, P125
Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st
Haykin S., 1991, INFORM SYSTEM SCI SE, V2
Helmholtz H., 1954, SENSATIONS TONE PHYS
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2006, INT CONF ACOUST SPEE, P153
Huang X., 2001, SPOKEN LANGUAGE PROC
Humphreys L, 2010, NEW MEDIA SOC, V12, P763, DOI 10.1177/1461444809349578
JACKSON J, 2001, ACOUST SPEECH SIG PR, P2721
Joiner R, 2006, COMPUT HUM BEHAV, V22, P67, DOI 10.1016/j.chb.2005.01.001
Kay S. M., 1993, SIGNAL PROCESSING SE
Ko Y., 2003, P 33 ANN FRONT ED FI, V1, pT3E
Kong JSL, 2012, INFORM MANAGE-AMSTER, V49, P1, DOI 10.1016/j.im.2011.10.004
Liu J., 2011, FRONT ED C FIE OCT, pF2G
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lubis MA, 2010, PROCD SOC BEHV, V9, DOI 10.1016/j.sbspro.2010.12.285
Maiti A., 2011, 2011 IEEE EUROCON IN, P1
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
MARTIN R, 1994, P EUSPICO, V2, P1182
Martin S, 2011, COMPUT EDUC, V57, P1893, DOI 10.1016/j.compedu.2011.04.003
Milanesi C., 2011, IPAD FUTURE TABLET M
MILNER B, 2002, ACOUST SPEECH SIG PR, P797
Mitra S., 2006, DIGITAL SIGNAL PROCE, V3
Nussbaumer H., 1981, FAST FOURIER TRANSFO, V2
[NVIDIA NVIDIA Corporation], 2010, BEN MULT CPU COR MOB
O'Neil H. F., 2003, TECHNOLOGY APPL ED L
Ohm G. S., 1843, ANN PHYS CHEM, V59, P513
Oppenheim A. V., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing
OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022
Paliwal K. K., 2003, P EUR 2003, P2117
Alsteris LD, 2006, SPEECH COMMUN, V48, P727, DOI 10.1016/j.specom.2005.10.005
PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532
Potts J, 2011, IEEE SOUTHEASTCON, P293, DOI 10.1109/SECON.2011.5752952
Price S, 2011, INTERACT COMPUT, V23, P499, DOI 10.1016/j.intcom.2011.06.003
Price S, 2003, INTERACT COMPUT, V15, P169, DOI 10.1016/S0953-5438(03)00006-7
Proakis J., 2007, DIGITAL SIGNAL PROCE
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Ranganath S., 2012, ASEE ANN C JUN
Rienties B, 2009, COMPUT HUM BEHAV, V25, P1195, DOI 10.1016/j.chb.2009.05.012
Rosenbaum E, 2008, SOC SCI RES, V37, P350, DOI 10.1016/j.ssresearch.2007.03.003
Rosenberg M., 1976, AUTOMATIC SPEAKER VE, V64, P475
Salajan FD, 2010, COMPUT EDUC, V55, P1393, DOI 10.1016/j.compedu.2010.06.017
Shi GJ, 2006, IEEE T AUDIO SPEECH, V14, P1867, DOI 10.1109/TSA.2005.858512
Sims R, 1997, COMPUT HUM BEHAV, V13, P157, DOI 10.1016/S0747-5632(97)00004-6
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Soliman S., 1990, INFORM SYSTEM SCI SE, V2
Soloway E, 2001, COMMUN ACM, V44, P15, DOI 10.1145/376134.376140
Song J., 2012, AUSTRALASIAN MARKETI, V20, P80
Srinivasan K., 1993, IEEE WORKSH SPEECH C, P85
Stark A, 2011, SPEECH COMMUN, V53, P51, DOI 10.1016/j.specom.2010.08.001
Steinfield Charles, 2007, J COMPUT-MEDIAT COMM, V12, P1143, DOI DOI 10.1111/J.1083-6101.2007.00367.X
Talor F., 1983, DIGITAL FILTER DESIG
Teng C.-C., 2010, 7 INT C INF TECHN NE, P471
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
Wellman B, 2001, INT J URBAN REGIONAL, V25, P227, DOI 10.1111/1468-2427.00309
White M.-A., 1989, Education & Computing, V5, DOI 10.1016/S0167-9287(89)80003-2
Wojcicki KK, 2007, INT CONF ACOUST SPEE, P729
Yerushalmy M., 2004, MOBILE PHONES ED CAS
NR 79
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 13
EP 38
DI 10.1016/j.specom.2013.08.002
PG 26
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100002
ER
PT J
AU Wu, L
Wan, CY
Xiao, K
Wang, SP
Wan, MX
AF Wu, Liang
Wan, Congying
Xiao, Ke
Wang, Supin
Wan, Mingxi
TI Evaluation of a method for vowel-specific voice source control of an
electrolarynx using visual information
SO SPEECH COMMUNICATION
LA English
DT Article
DE Electrolarynx; Intelligibility evaluation; Visual information;
Vowel-specific voice source
ID SPEECH-PERCEPTION; LARYNGECTOMY; COMMUNICATION
AB The electrolarynx (EL) is a widely used device for alaryngeal communication, but the low quality seriously reduces the intelligibility of EL speech. To improve EL speech quality, a vowel-specific voice source based on visual information of lip shape and movements and artificial neural network (ANN) is implemented into an experimental EL (SGVS-EL) system in real time. Five volunteers (one laryngectomee and four normal speakers) participated in the experimental evaluation of the method and SGVS-EL system. Using ANN participants were able to perform high vowel precision with identification rates of >90% after the training. The results of voicing control indicated that all subjects using SGVS-EL could achieve good vowel control performance in real time, but still control errors frequently occurred at the voice initiation period. However, the control errors had no significantly impact on the perception of SGVS-EL speech. Intelligibility evaluation demonstrated that both the vowels and words produced using the SGVS-EL were more intelligible than vowels spoken with a commercial EL (by 30%) or words (by 18%), respectively. Using a controlled vowel-specific voice source was a feasible and effective way to improve EL speech quality with more intelligible words. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Wu, Liang; Wan, Congying; Xiao, Ke; Wang, Supin; Wan, Mingxi] Xi An Jiao Tong Univ, Key Lab Biomed Informat Engn, Minist Educ, Dept Biomed Engn,Sch Life Sci & Technol, Xian 710049, Peoples R China.
RP Wang, SP (reprint author), 28 Xianning West Rd, Xian 710049, Shaanxi, Peoples R China.
EM spwang@mail.xjtu.edu.cn; mxwan@mail.xjtu.edu.cn
FU National Natural Science Foundation of China [11274250, 61271087];
Research Fund for the Doctoral Program of Higher Education of China
[20120201110049]
FX This work was supported by the National Natural Science Foundation of
China under Grant 11274250 and 61271087, and Research Fund for the
Doctoral Program of Higher Education of China under Grand
20120201110049. The authors would like to express special appreciation
to Ye Wenyuan and Rui Baozhen for their works in the experiments.
CR Carlson R., 1975, AUDITORY ANAL PERCEP, P55
Carr MM, 2000, OTOLARYNG HEAD NECK, V122, P39, DOI 10.1016/S0194-5998(00)70141-0
Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094
Clements KS, 1997, ARCH OTOLARYNGOL, V123, P493
Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479
Evitts PM, 2010, J COMMUN DISORD, V43, P92, DOI 10.1016/j.jcomdis.2009.10.002
Goldstein EA, 2007, J SPEECH LANG HEAR R, V50, P335, DOI 10.1044/1092-4388(2007/024)
Goldstein EA, 2004, IEEE T BIO-MED ENG, V51, P325, DOI 10.1109/TBME.2003.820373
Hillman R E, 1998, Ann Otol Rhinol Laryngol Suppl, V172, P1
MASSARO DW, 1983, J EXP PSYCHOL HUMAN, V9, P753, DOI 10.1037/0096-1523.9.5.753
Meltzner G.S., 2003, THESIS MIT
Meltzner G.S., 2005, ELECTROLARYNGEAL SPE, P571
MORRIS HL, 1992, ANN OTO RHINOL LARYN, V101, P503
Neti C., 2000, AUD VIS SPEECH REC F, P764
Pawar PV, 2008, J CANC RES THER, V4, P186
QI YY, 1991, J SPEECH HEAR RES, V34, P1250
Stepp CE, 2009, IEEE T NEUR SYS REH, V17, P146, DOI 10.1109/TNSRE.2009.2017805
SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009
Takahashi H, 2008, J VOICE, V22, P420, DOI 10.1016/j.jvoice.2006.10.004
TRAUNMULLER H, 1987, SPEECH COMMUN, V6, P143, DOI 10.1016/0167-6393(87)90037-9
Uemi N., 1995, Japanese Journal of Medical Electronics and Biological Engineering, V33
UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374
Wan C.Y., 2012, J VOICE, V26
WEISS MS, 1979, J ACOUST SOC AM, V65, P1298, DOI 10.1121/1.382697
Wu L, 2013, IEEE T BIO-MED ENG, V60, P1965, DOI 10.1109/TBME.2013.2246789
Wu L., 2013, J VOICE, V27, p259e7
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 39
EP 49
DI 10.1016/j.specom.2013.09.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100003
ER
PT J
AU Norouzian, A
Rose, R
AF Norouzian, Atta
Rose, Richard
TI An approach for efficient open vocabulary spoken term detection
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken term detection; Automatic speech recognition; Index
ID RECOGNITION; SEARCH; SPEECH
AB A hybrid two-pass approach for facilitating fast and efficient open vocabulary spoken term detection (STD) is presented in this paper. A large vocabulary continuous speech recognition (LVCSR) system is deployed for producing word lattices from audio recordings. An index construction technique is used for facilitating very fast search of lattices for finding occurrences of both in vocabulary (IV) and out of vocabulary (OOV) query terms. Efficient search for query terms is performed in two passes. In the first pass, a subword approach is used for identifying audio segments that are likely to contain occurrences of the IV and OOV query terms from the index. A more detailed subword based search is performed in the second pass for verifying the occurrence of the query terms in the candidate segments.
The performance of this STD system is evaluated in an open vocabulary STD task defined on a lecture domain corpus. It is shown that the indexing method presented here results in an index that is nearly two orders of magnitude smaller than the LVCSR lattices while preserving most of the information relevant for STD. Furthermore, despite using word lattices for constructing the index, 67% of the segments containing occurrences of the OOV query terms are identified from the index in the first pass. Finally, it is shown that the detection performance of the subword based term detection performed in the second pass has the effect of reducing the performance gap between OOV and IV query terms. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Norouzian, Atta; Rose, Richard] McGill Univ, Montreal, PQ, Canada.
RP Norouzian, A (reprint author), McGill Univ, Montreal, PQ, Canada.
EM atta.norouzian@mail.mcgill.ca
CR Allauzen C., 2004, WORKSH INT APPR SPEE, P33
Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002
Brno University Super Lectures, 2012, SUP LECT
Can D, 2011, IEEE T AUDIO SPEECH, V19, P2338, DOI 10.1109/TASL.2011.2134087
Chaudhari UV, 2012, IEEE T AUDIO SPEECH, V20, P1633, DOI 10.1109/TASL.2012.2186805
Chelba C., 2005, P 43 ANN M ASS COMP, P443, DOI 10.3115/1219840.1219895
Chen YN, 2011, INT CONF ACOUST SPEE, P5644
COOL, 2012, COURS ONL
Hain T., 2008, P NIST RT07 WORKSH
Hori T., 2007, INT C AC SPEECH SIGN, V4, pIV
Iwata K, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2195
Jansen A, 2011, INT CONF ACOUST SPEE, P5180
Jansen A., 2009, 10 ANN C INT SPEECH
Koumpis K, 2005, IEEE SIGNAL PROC MAG, V22, P61, DOI 10.1109/MSP.2005.1511824
Mamou J., 2006, 29 ANN INT C RES DEV, P51
Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152
Manning C. D., 1999, FDN STAT NATURAL LAN
Microsoft MAVIS, 2012, MAV
Miller D., 2007, 8 ANN C INT SPEECH C, P314
Norouzian A., 2013, INT C AC SPEECH SIGN
Norouzian A., 2013, 14 ANN C INT SPEECH
Norouzian A, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P5169
Norouzian A., 2010, WORKSH SPOK LANG TEC, P194
Norouzian A., 2012, 13 ANN C INT SPEECH
Rose R, 2010, INT CONF ACOUST SPEE, P5282, DOI 10.1109/ICASSP.2010.5494982
Saraclar M., 2004, HUM LANG TECHN C N A
Schwarz P, 2004, LECT NOTES COMPUT SC, V3206, P465
Siohan O., 2005, 9 EUR C SPEECH COMM
Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901
Szoke I, 2008, LECT NOTES COMPUT SC, V4892, P237
Szoke I., 2007, WORKSH SIGN PROC APP, P1
Tu T.-W., 2011, AUT SPEECH REC UND W, P383
Wang D, 2010, INT CONF ACOUST SPEE, P5294, DOI 10.1109/ICASSP.2010.5494968
Yu P, 2005, INT CONF ACOUST SPEE, P481
NR 34
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 50
EP 62
DI 10.1016/j.specom.2013.09.002
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100004
ER
PT J
AU Szekely, E
Ahmed, Z
Hennig, S
Cabral, JP
Carson-Berndsen, J
AF Szekely, Eva
Ahmed, Zeeshan
Hennig, Shannon
Cabral, Joao P.
Carson-Berndsen, Julie
TI Predicting synthetic voice style from facial expressions. An application
for augmented conversations
SO SPEECH COMMUNICATION
LA English
DT Article
DE Expressive speech synthesis; Facial expressions; Multimodal application;
Augmentative and alternative communication
ID SPEECH SYNTHESIS; RESPONSES; EMOTIONS
AB The ability to efficiently facilitate social interaction and emotional expression is an important, yet unmet requirement for speech generating devices aimed at individuals with speech impairment. Using gestures such as facial expressions to control aspects of expressive synthetic speech could contribute to an improved communication experience for both the user of the device and the conversation partner. For this purpose, a mapping model between facial expressions and speech is needed, that is high level (utterance-based), versatile and personalisable. In the mapping developed in this work, visual and auditory modalities are connected based on the intended emotional salience of a message: the intensity of facial expressions of the user to the emotional intensity of the synthetic speech. The mapping model has been implemented in a system called WinkTalk that uses estimated facial expression categories and their intensity values to automatically select between three expressive synthetic voices reflecting three degrees of emotional intensity. An evaluation is conducted through an interactive experiment using simulated augmented conversations. The results have shown that automatic control of synthetic speech through facial expressions is fast, non-intrusive, sufficiently accurate and supports the user to feel more involved in the conversation. It can be concluded that the system has the potential to facilitate a more efficient communication process between user and listener. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Szekely, Eva; Ahmed, Zeeshan; Cabral, Joao P.; Carson-Berndsen, Julie] Univ Coll Dublin, Sch Comp Sci & Informat, CNGL, Dublin 2, Ireland.
[Hennig, Shannon] Ist Italian Tecnol, Genoa, Italy.
RP Szekely, E (reprint author), Univ Coll Dublin, Sch Comp Sci & Informat, CNGL, Dublin 2, Ireland.
EM eva.szekely@ucdconnect.ie; zeeshan.ahmed@ucdconnect.ie;
shannon.hennig@iit.it; joao.cabral@ucd.ie; julie.berndsen@ucd.ie
FU Science Foundation Ireland of the Centre for Next Generation
Localisation at University College Dublin (UCD) [07/CE/I1142]; Istituto
Italian di Tecnologia; Universita degli Studi di Genova
FX This research is supported by the Science Foundation Ireland (Grant
07/CE/I1142) as part of the Centre for Next Generation Localisation
(www.cngl.ie) at University College Dublin (UCD). The opinions,
findings, and conclusions or recommendations expressed in this material
are those of the authors and do not necessarily reflect the views of
Science Foundation Ireland. This is research is further supported by the
Istituto Italian di Tecnologia and the Universita degli Studi di Genova.
The authors would also like to thank Nick Campbell and the Speech
Communication Lab (TCD) for their invaluable help with the interactive
evaluation.
CR Asha, 2005, ROL RESP SPEECH LANG
Bedrosian J., 1995, AUGMENTATIVE ALTERNA, V11, P6, DOI 10.1080/07434619512331277089
Beukelman D., 2012, COMMUNICATION ENHANC
Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222
Breuer S., 2006, P LREC GEN
Bulut M., 2002, P INT C SPOK LANG PR, P1265
Cabral J., 2007, P 6 ISCA WORKSH SPEE
Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1
Cave C., 2002, P ICSLP DENV
CHILDERS DG, 1995, J ACOUST SOC AM, V97, P505, DOI 10.1121/1.412276
Cowie R., 2000, P ISCA WORKSH SPEECH, P14
Creer S., 2010, COMPUTER SYNTHESISED, P92
Critchley HD, 2000, J NEUROSCI, V20, P3033
Cvejic I., 2011, P INT FLOR
Dawson M.E., 2000, ELECTRODERMAL SYSTEM, V2, P200
EKMAN P, 1976, ENVIRON PSYCH NONVER, V1, P56, DOI 10.1007/BF01115465
Fant G., 1988, SPEECH TRANSMISSION, V29, P1
Fontaine JRJ, 2007, PSYCHOL SCI, V18, P1050, DOI 10.1111/j.1467-9280.2007.02024.x
Gallo LC, 2000, PSYCHOPHYSIOLOGY, V37, P289, DOI 10.1017/S0048577200982222
Gobl C., 1989, STL QPSR, P9
Guenther FH, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008218
Hennig S., 2012, P ISAAC PITTSB
Higginbotham D. J., 1995, AUGMENTATIVE ALTERNA, V11, P2, DOI 10.1080/07434619512331277079
Higginbotham J, 2010, COMPUTER SYNTHESIZED, P50
Higginbotham J., 2002, ASSIST TECHNOL, V24, P14
HTS, 2008, HTS 2 1 TOOLK HMM BA
Kawanami H., 2003, P EUR GEN
Kueblbeck C., 2006, J IMAGE VISION COMPU, V24, P564
Liao C., 2012, P ICASSP KYOT
Light J, 2007, AUGMENT ALTERN COMM, V23, P204, DOI 10.1080/07434610701553635
Light J., 2003, COMMUNICATIVE COMPET, P3
MACWHINNEY B, 1982, MEM COGNITION, V10, P308, DOI 10.3758/BF03202422
Massaro D.W., 2004, MULTISENSORY INTEGRA
Moubayed S.A., 2011, LECT NOTES COMPUTER, V6456, P55
Mullennix J.W., 2010, COMPUTER SYNTHESIZED, P1
MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558
Schroder M, 2004, THESIS SAARLAND U
SHORE, 2012, SHOR FAC DET ENG
Sprengelmeyer R, 2006, NEUROPSYCHOLOGIA, V44, P2899, DOI 10.1016/j.neuropsychologia.2006.06.020
Stern SE, 2002, J APPL PSYCHOL, V87, P411, DOI 10.1037//0021-9010.87.2.411
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Szekely E., 2011, P INT FLOR ISCA, P2409
Szekely E., 2012, P SLPAT MONTR
Szekely E., 2012, P LREC IST
Szekely E., 2013, J MULTIMODA IN PRESS
Taylor P, 2009, TEXT TO SPEECH SYNTH
Wilkinson KM, 2011, AM J SPEECH-LANG PAT, V20, P288, DOI 10.1044/1058-0360(2011/10-0065)
Wisenburn B, 2008, AUGMENT ALTERN COMM, V24, P100, DOI 10.1080/07434610701740448
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
Zhao Y., 2006, P INT PITTSB
NR 50
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 63
EP 75
DI 10.1016/j.specom.2013.09.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100005
ER
PT J
AU Chao, YH
AF Chao, Yi-Hsiang
TI Using LR-based discriminant kernel methods with applications to speaker
verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Likelihood Ratio; Speaker verification; Support Vector Machine; Kernel
Fisher Discriminant; Multiple Kernel Learning
ID SUPPORT VECTOR MACHINES; RECOGNITION; MODELS; SCORE
AB Kernel methods are powerful techniques that have been widely discussed and successfully applied to pattern recognition problems. Kernel-based speaker verification has also been developed to use the concept of sequence kernel that is able to deal with variable-length patterns such as speech. However, constructing a proper kernel cleverly tied in with speaker verification is still an issue. In this paper, we propose the new defined kernels derived by the Likelihood Ratio (LR) test, named the LR-based kernels, in attempts to integrate kernel methods with the LR-based speaker verification framework tightly and intuitively while an LR is embedded in the kernel function. The proposed kernels have two advantages over existing methods. The first is that they can compute the kernel function without needing to represent the variable-length speech as a fixed-dimension vector in advance. The second is that they have a trainable mechanism in the kernel computation using the Multiple Kernel Learning (MKL) algorithm. Our experimental results show that the proposed methods outperform conventional speaker verification approaches. (C) 2013 Elsevier B.V. All rights reserved.
C1 Chien Hsin Univ Sci & Technol, Dept Appl Geomat, Tao Yuan, Taiwan.
RP Chao, YH (reprint author), Chien Hsin Univ Sci & Technol, Dept Appl Geomat, Tao Yuan, Taiwan.
EM yschao@uch.edu.tw
FU National Science Council, Taiwan [NSC101-2221-E-231-026]
FX This work was funded by the National Science Council, Taiwan, under
Grant: NSC101-2221-E-231-026.
CR Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
Bengio S., 2004, P OD SPEAK LANG REC
Bengio S, 2001, INT CONF ACOUST SPEE, P425, DOI 10.1109/ICASSP.2001.940858
Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555
Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
Chao Y.H., 2006, P ICPR2006
Chao Y.H., 2007, INT J COMPUTATIONAL, V12, P255
Chao YH, 2008, IEEE T AUDIO SPEECH, V16, P1675, DOI 10.1109/TASL.2008.2004297
Chao Y.H., 2006, P INT ICSLP
Dehak N., 2010, P OD SPEAK LANG REC
Dehak N., 2009, P INT C AC SPEECH SI, P4237
Gonen M, 2011, J MACH LEARN RES, V12, P2211
Herbrich R., 2002, LEARNING KERNEL CLAS
Huang X., 2001, SPOKEN LANGUAGE PROC
Karam Z.N., 2008, P ICASSP
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Liu CS, 1996, IEEE T SPEECH AUDI P, V4, P56
Luettin J., 1998, 9805 IDIAPCOM
Martin A.F., 1997, P EUR
Messer K., 1999, P AVBPA
Mika S, 2002, THESIS U TECHNOLOGY
Rakotomamonjy A, 2008, J MACH LEARN RES, V9, P2491
Ratsch G., 1999, P IEEE INT WORKSH NE, V9, P41
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599
Smith N, 2002, ADV NEUR IN, V14, P1197
Vapnik V, 1998, STAT LEARNING THEORY
Wan V, 2005, IEEE T SPEECH AUDI P, V13, P203, DOI 10.1109/TSA.2004.841042
NR 30
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 76
EP 86
DI 10.1016/j.specom.2013.09.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100006
ER
PT J
AU Hyon, S
Dang, JW
Feng, H
Wang, HC
Honda, K
AF Hyon, Songgun
Dang, Jianwu
Feng, Hui
Wang, Hongcui
Honda, Kiyoshi
TI Detection of speaker individual information using a phoneme effect
suppression method
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker identification; Frequency warping; MFCC; Speech production;
Phoneme-related effects
ID ACOUSTIC CHARACTERISTICS; RECOGNITION; IDENTIFICATION; FREQUENCY;
MODELS; NASAL; BAND
AB Feature extraction of speaker information from speech signals is a key procedure for exploring individual speaker characteristics and also the most critical part in a speaker recognition system, which needs to preserve individual information while attenuating linguistic information. However, it is difficult to separate individual from linguistic information in a given utterance. For this reason, we investigated a number of potential effects on speaker individual information that arise from differences in articulation due to speaker-specific morphology of the speech organs, comparing English, Chinese and Korean. We found that voiced and unvoiced phonemes have different frequency distributions in speaker information and these effects are consistent across the three languages, while the effect of nasal sounds on speaker individuality is language dependent. Because these differences are confounded with speaker individual information, feature extraction is negatively affected. Accordingly, a new feature extraction method is proposed to more accurately detect speaker individual information by suppressing phoneme-related effects, where the phoneme alignment is required once in constructing a filter bank for phoneme effect suppression, but is not necessary in processing feature extraction. The proposed method was evaluated by implementing it in GMM speaker models for speaker identification experiments. It is shown that the proposed approach outperformed both Mel Frequency Cepstrum Coefficient (MFCC) and the traditional F-ratio (FFCC). The use of the proposed feature has reduced recognition errors by 32.1-67.3% for the three languages compared with MFCC, and by 6.6-31% compared with FFCC. When combining an automatic phoneme aligner with the proposed method, the result demonstrated that the proposed method can detect speaker individuality with about the same accuracy as that based on manual phoneme alignment. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Hyon, Songgun; Dang, Jianwu; Wang, Hongcui; Honda, Kiyoshi] Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China.
[Hyon, Songgun] KimllSung Univ, Sch Comp Sci, Ryongnam, South Korea.
[Dang, Jianwu] Japan Adv Inst Sci & Technol, Sch Informat Sci, Kanazawa, Ishikawa, Japan.
[Dang, Jianwu; Feng, Hui; Wang, Hongcui; Honda, Kiyoshi] Tianjin Key Lab Cognit Comp & Applicat, Tianjin, Peoples R China.
[Feng, Hui] Tianjin Univ, Sch Liberal Arts & Law, Tianjin, Peoples R China.
RP Dang, JW (reprint author), Tianjin Univ, Sch Comp Sci & Technol, Tianjin, Peoples R China.
EM h_star1020@yahoo.com; jdang@jaist.ac.jp; fenghui@tju.edu.cn;
hcwang@tju.edu.cn; khonda@sannet.ne.jp
FU National Basic Research Program of China [2013CB329301]; National
Natural Science Foundation of China [61233009, 6117501]; JSPS KAKENHI
[25330190]
FX The authors would like to thank Dr. Jiahong Yuan and M.S. Yuan Ma for
conducting the automatic phoneme alignment and for running part of the
experiments. The authors would also like to thank Dr. Mark Tiede for his
helpful comments. This work is supported in part by the National Basic
Research Program of China (No. 2013CB329301), and in part by the
National Natural Science Foundation of China under contract Nos.
61233009 and 6117501. This study was also supported in part by JSPS
KAKENHI Grant (NO. 25330190).
CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155
BLOMBERG M, 1991, SPEECH COMMUN, V10, P453, DOI 10.1016/0167-6393(91)90048-X
Bozkurt B., 2005, P EUSIPCO AN TURK
Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714
Dang J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P965
DANG JW, 1994, J ACOUST SOC AM, V96, P2088, DOI 10.1121/1.410150
Dang JW, 1996, J ACOUST SOC AM, V100, P3374, DOI 10.1121/1.416978
Dang JW, 1997, J ACOUST SOC AM, V101, P456, DOI 10.1121/1.417990
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Fant G., 1960, ACOUSTIC THEORY SPEE
Feng G, 1996, J ACOUST SOC AM, V99, P3694, DOI 10.1121/1.414967
Garofolo J.S., 1990, DARPA TIMIT ACOUSTIC
Gutman D., 2002, P EUSIPCO 2002 TOUL
Hansen E.G., 2004, P OD SPEAK REC WORKS
HAYAKAWA S, 1994, INT CONF ACOUST SPEE, P137
Kajarekar S. S., 2001, P SPEAK OD 2001 CRET
Kitamura Tatsuya, 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.434
Kitamura T., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.16
Lu XG, 2008, SPEECH COMMUN, V50, P312, DOI 10.1016/j.specom.2007.10.005
Matsui T., 1993, P ICASSP1993 MINN MN
Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0
Miyajima C., 1999, P EUR 1999, P779
Moore BCJ, 1996, ACUSTICA, V82, P335
O'Shaughnessy D., 1987, SPEECH COMMUN, P150
Orman O., 2001, P SPEAK OD SPEAK REC, P219
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
Rabiner L, 1993, FUNDAMENTALS SPEECH
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
Stevens K.N., 1998, ACOUSTIC PHONETICS
SUNDBERG J, 1974, J ACOUST SOC AM, V55, P838, DOI 10.1121/1.1914609
Suzuki H., 1990, P ICSLP90, P437
Takemoto H, 2006, J ACOUST SOC AM, V120, P2228, DOI 10.1121/1.2261270
Weber F., 2002, P ICASSP 2002 ORL FL
WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065
Yu Yibiao, 2008, Acta Acustica, V33
Yu Yibiao, 2005, Acta Acustica, V30
Yuan J., 2008, P AC 08
ZWICKER E, 1980, J ACOUST SOC AM, V68, P1523, DOI 10.1121/1.385079
NR 38
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 87
EP 100
DI 10.1016/j.specom.2013.09.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100007
ER
PT J
AU Trawicki, MB
Johnson, MT
AF Trawicki, Marek B.
Johnson, Michael T.
TI Speech enhancement using Bayesian estimators of the
perceptually-motivated short-time spectral amplitude (STSA) with Chi
speech priors
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Probability; Amplitude estimation; Phase estimation;
Parameter estimation
AB In this paper, the authors propose new perceptually-motivated Weighted Euclidean (WE) and Weighted Cosh (WCOSH) estimators that utilize more appropriate Chi statistical models for the speech prior with Gaussian statistical models for the noise likelihood. Whereas the perceptually-motivated WE and WCOSH cost functions emphasized spectral valleys rather than spectral peaks (formants) and indirectly accounted for auditory masking effects, the incorporation of the Chi distribution statistical models demonstrated distinct improvement over the Rayleigh statistical models for the speech prior. The estimators incorporate both weighting law and shape parameters on the cost functions and distributions. Performance is evaluated in terms of the Segmental Signal-to-Noise Ratio (SSNR), Perceptual Evaluation of Speech Quality (PESQ), and Signal-to-Noise Ratio (SNR) Loss objective quality measures to determine the amount of noise reduction along with overall speech quality and speech intelligibility improvement. Based on experimental results across three different input SNRs and eight unique noises along with various weighting law and shape parameters, the two general, less-complicated, closed-form derived solution estimators of WE and WCOSH with Chi speech priors provide significant gains in noise reduction and noticeable gains in overall speech quality and speech intelligibility improvements over the baseline WE and WCOSH with the standard Rayleigh speech priors. Overall, the goal of the work is to capitalize on the mutual benefits of the WE and WCOSH cost functions and Chi distributions for the speech prior to improvement enhancement. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Trawicki, Marek B.; Johnson, Michael T.] Marquette Univ, Dept Elect & Comp Engn, Speech & Signal Proc Lab, Milwaukee, WI 53201 USA.
RP Trawicki, MB (reprint author), Marquette Univ, Dept Elect & Comp Engn, Speech & Signal Proc Lab, POB 1881, Milwaukee, WI 53201 USA.
EM marek.trawicki@marquette.edu; mike.johnson@marquette.edu
CR Andrianakis I, 2009, SPEECH COMMUN, V51, P1, DOI 10.1016/j.specom.2008.05.018
Breithaupt C., 2008, INT C AC SPEECH SIGN
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Gradshteyn IS, 2007, TABLES INTEGRALS SER, V7th
GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
ITU, 2003, SUBJ TEST METH EV SP
Johnson N.L., 1994, CONTINUOUS UNIVARIAT, VI
Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005
Papamichalis P.E., 1987, PRACTICAL APPROACHES
Pearce D., 2000, 6 INT C SPOK LANG PR
Rix A., 2001, IEEE INT C AC SPEECH
Subcommittee I., 1969, IEEE T AUDIO ELECTRO, P225
NR 17
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 101
EP 113
DI 10.1016/j.specom.2013.09.009
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100008
ER
PT J
AU McLachlan, NM
Grayden, DB
AF McLachlan, Neil M.
Grayden, David B.
TI Enhancement of speech perception in noise by periodicity processing: A
neurobiological model and signal processing algorithm
SO SPEECH COMMUNICATION
LA English
DT Article
DE Neurocognitive; Model; Periodicity; Segregation; Algorithm; Speech
ID VENTRAL COCHLEAR NUCLEUS; DIFFERENT FUNDAMENTAL FREQUENCIES; ITERATED
RIPPLED NOISE; AUDITORY-NERVE FIBERS; INFERIOR COLLICULUS; PITCH
STRENGTH; TEMPORAL INTEGRATION; AMPLITUDE-MODULATION; LATERAL LEMNISCUS;
CONCURRENT VOWELS
AB The perceived loudness of sound increases with its tonality or periodicity, and the pitch strength of tones are linearly proportional to their sound pressure level. These observations suggest a fundamental relationship between pitch strength and loudness. This relationship may be explained by the superimposition of inputs to inferior colliculus neurons from cochlear nucleus chopper cells and phase locked spike trains from the lateral lemniscus. The regularity of chopper cell outputs increases for stimuli with periodicity at the same frequency as their intrinsic chopping rate. So inputs to inferior colliculus cells become synchronized for periodic stimuli, leading to increased likelihood that they will fire and increased salience of periodic signal components at the characteristic frequency of the inferior colliculus cell. A computer algorithm to enhance speech in noise was based on this model. The periodicity of the outputs of a Gammatone filter bank after each sound onset was determined by first sampling each filter channel at a range of typical chopper cell frequencies and then passing these amplitudes through a step function to simulate the firing of coincidence detecting neurons in the inferior colliculus. Filter channel amplification was based on the maximum accumulated spike count after each onset, resulting in increased amplitudes for filter channels with greater periodicity. The speech intelligibility of stimuli in noise was not changed when the algorithm was used to remove around 14 dB of noise from stimuli with signal noise ratios of around 0 dB. This mechanism is a likely candidate for enhancing speech recognition in noise, and raises the proposition that pitch itself is an epiphenomenon that evolved from neural mechanisms that boost the hearing sensitivity of animals to vocalizations. (C) 2013 Elsevier B.V. All rights reserved.
C1 [McLachlan, Neil M.] Univ Melbourne, Melbourne Sch Psychol Sci, Melbourne, Vic 3010, Australia.
[Grayden, David B.] Univ Melbourne, Dept Elect & Elect Engn, NeuroEngn Lab, Melbourne, Vic 3010, Australia.
[Grayden, David B.] Univ Melbourne, Ctr Neural Engn, Melbourne, Vic 3010, Australia.
RP McLachlan, NM (reprint author), Univ Melbourne, Melbourne Sch Psychol Sci, Melbourne, Vic 3010, Australia.
EM mcln@unimelb.edu.au; grayden@unimelb.edu.au
FU Australian Research Council [DP1094830, DP120103039]
FX This work was supported by Australian Research Council Discovery Project
Grants DP1094830 and DP120103039.
CR Alain C, 2007, HEARING RES, V229, P225, DOI 10.1016/j.heares.2007.01.011
ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772
BEERENDS JG, 1989, J ACOUST SOC AM, V85, P813, DOI 10.1121/1.397974
Bench J., 1979, SPEECH HEARING TESTS, P481
BLACKBURN CC, 1992, J NEUROPHYSIOL, V68, P124
Blackburn C.C., 1991, J NEUROPHYSIOL, V65, P606
Brons I, 2013, EAR HEARING, V34, P29, DOI 10.1097/AUD.0b013e31825f299f
Cant NB, 2003, BRAIN RES BULL, V60, P457, DOI 10.1016/S0361-9230(03)00050-9
Cariani PA, 2001, NEURAL NETWORKS, V14, P737, DOI 10.1016/S0893-6080(01)00056-9
CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427
COVEY E, 1991, J NEUROSCI, V11, P3456
Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344
de Cheveigne A, 1999, SPEECH COMMUN, V27, P175, DOI 10.1016/S0167-6393(98)00074-0
Dicke U, 2007, J ACOUST SOC AM, V121, P310, DOI 10.1121/1.2400670
Ehret G., 2005, INFERIOR COLLICULUS, P319
Fastl H., 1989, P 13 INT C AC BELGR, P11
FASTL H, 1979, HEARING RES, V1, P293, DOI 10.1016/0378-5955(79)90002-9
Ferragamo MJ, 2002, J NEUROPHYSIOL, V87, P2262, DOI 10.1152/jn.00587.2001
Frisina R.D., 1990, HEARING RES, V44, P90
Guerin A, 2006, HEARING RES, V211, P54, DOI 10.1016/j.heares.2005.10.001
Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9
Herzke T, 2007, ACTA ACUST UNITED AC, V93, P498
HOWELL P, 1983, SPEECH COMMUN, V2, P164, DOI 10.1016/0167-6393(83)90018-3
Hsieh IH, 2007, HEARING RES, V233, P108, DOI 10.1016/j.heares.2007.08.005
Hu GN, 2008, J ACOUST SOC AM, V124, P1306, DOI 10.1121/1.2939132
Hutchins S., 2011, J EXP PSYCHOL GEN, V141, P76, DOI DOI 10.1037/A0025064
Ishizuka K, 2006, SPEECH COMMUN, V48, P1447, DOI 10.1016/j.specom.2006.06.008
Kidd Jr G., 1989, J ACOUST SOC AM, V38, P106
Krumbholz K, 2003, CEREB CORTEX, V13, P765, DOI 10.1093/cercor/13.7.765
KRYTER KD, 1965, J ACOUST SOC AM, V38, P106, DOI 10.1121/1.1909578
Langner G, 2002, HEARING RES, V168, P110, DOI 10.1016/S0378-5955(02)00367-2
McLachlan N, 2010, PSYCHOL REV, V117, P175, DOI 10.1037/a0018063
McLachlan N, 2011, J ACOUST SOC AM, V130, P2845, DOI 10.1121/1.3643082
McLachlan N, 2009, HEARING RES, V249, P23, DOI 10.1016/j.heares.2009.01.003
McLachlan N, 2013, J EXP PSYCHOL GEN, V142, P1142, DOI 10.1037/a0030830
MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767
Meddis R, 2006, J ACOUST SOC AM, V120, P3861, DOI 10.1121/1.2372595
Milczynski M, 2012, HEARING RES, V285, P1, DOI 10.1016/j.heares.2012.02.006
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Nelson PC, 2004, J ACOUST SOC AM, V116, P2173, DOI 10.1121/1.1784442
Oertel D, 2000, P NATL ACAD SCI USA, V97, P11773, DOI 10.1073/pnas.97.22.11773
Oertel D, 1997, NEURON, V19, P959, DOI 10.1016/S0896-6273(00)80388-8
Oertel D., 1985, J ACOUST SOC AM, V78, P329
Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657
PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456
Pickles J.O., 2008, INTRO PHYSL HEARING, P155
PLOMP R, 1994, EAR HEARING, V15, P2
Rakowski A, 1996, ACUSTICA, V82, pS80
Richardson U, 2004, DYSLEXIA, V10, P215, DOI 10.1002/dys.276
Riquelme R, 2001, J COMP NEUROL, V432, P409, DOI 10.1002/cne.1111
ROBINSON K, 1995, J ACOUST SOC AM, V98, P1858, DOI 10.1121/1.414405
Schofield B.R., 2005, INFERIOR COLLICULUS, P140
Seither-Preisler A, 2006, HEARING RES, V218, P50, DOI 10.1016/j.heares.2006.04.005
Slaney M., 1993, 35 APPL COMP
SMITH RL, 1975, BIOL CYBERN, V17, P169, DOI 10.1007/BF00364166
Soeta Y, 2004, J ACOUST SOC AM, V116, P3275, DOI 10.1121/1.1782931
Soeta Y, 2007, J SOUND VIB, V304, P415, DOI 10.1016/j.jsv.2007.03.007
STIEBLER I, 1986, NEUROSCI LETT, V65, P336, DOI 10.1016/0304-3940(86)90285-5
Strait DL, 2012, CORTEX, V48, P360, DOI 10.1016/j.cortex.2011.03.015
STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855
Turicchia L, 2005, IEEE T SPEECH AUDI P, V13, P243, DOI 10.1109/TSA.2004.841044
Vandali AE, 2011, J ACOUST SOC AM, V129, P4023, DOI 10.1121/1.3573988
VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953
Wiegrebe L, 2001, J NEUROPHYSIOL, V85, P1206
Wiegrebe L, 2004, J ACOUST SOC AM, V115, P1207, DOI 10.1121/1.1643359
Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257
Zwicker E., 1999, PSYCHOACOUSTICS FACT, P111
NR 67
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 114
EP 125
DI 10.1016/j.specom.2013.09.007
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100009
ER
PT J
AU Dileep, AD
Sekhar, CC
AF Dileep, A. D.
Sekhar, C. Chandra
TI Class-specific GMM based intermediate matching kernel for classification
of varying length patterns of long duration speech using support vector
machines
SO SPEECH COMMUNICATION
LA English
DT Article
DE Varying length pattern; Long duration speech; Set of local feature
vectors; Dynamic kernels; Intermediate matching kernel; Support vector
machine; Speech emotion recognition; Speaker identification
ID SPEAKER VERIFICATION; RECOGNITION; IDENTIFICATION; MODELS
AB Dynamic kernel based support vector machines are used for classification of varying length patterns. This paper explores the use of intermediate matching kernel (IMK) as a dynamic kernel for classification of varying length patterns of long duration speech represented as sets of feature vectors. The main issue in construction of IMK is the choice for the set of virtual feature vectors used to select the local feature vectors for matching. The components of class-independent GMM (CIGMM) have been used earlier as a representation for the set of virtual feature vectors. For every component of CIGMM, a local feature vector each from the two sets of local feature vectors that has the highest probability of belonging to that component is selected and a base kernel is computed between the selected local feature vectors. The IMK is computed as the sum of all the base kernels corresponding to different components of CIGMM. The construction of CIGMM-based IMK does not use the class-specific information, as the local feature vectors are selected using the components of CIGMM that is common for all the classes. We propose two novel methods to build a better discriminatory IMK-based SVM classifier by considering a set of virtual feature vectors specific to each class depending on the approaches to multiclass classification using SVMs. In the first method, we propose a class-wise IMK based SVM for every class by using components of GMM built for a class as the set of virtual feature vectors for that class in the one-against-the-rest approach to multiclass pattern classification. In the second method, we propose a pairwise IMK based SVM for every pair of classes by using components of GM M built for a pair of classes as the set of virtual feature vectors for that pair of classes in the one-against-one approach to multiclass classification. We also proposed to use the mixture coefficient weighted and responsibility term weighted base kernels in computation of class-specific IMKs to improve their discrimination ability. This paper also proposes the posterior probability weighted dynamic kernels to improve their classification performance and reduce the number of support vectors. The performance of the SVM-based classifiers using the proposed class-specific IMKs is studied for speech emotion recognition and speaker identification tasks and compared with that of the SVM-based classifiers using the state-of-the-art dynamic kernels. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Dileep, A. D.; Sekhar, C. Chandra] Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India.
RP Dileep, AD (reprint author), IIT Madras, Speech & Vis Lab, Dept CSE, Madras 600036, Tamil Nadu, India.
EM addileep@gmail.com; chandra@c-se.iitm.ac.in
CR Boughorbel S, 2005, IEEE IJCNN, P889
Boughorbel S., 2004, P BRIT MACH VIS C BM, P137
Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555
Burkhardt F., 2005, P INT, P1517
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
Chandrakala S, 2009, Proceedings 2009 International Joint Conference on Neural Networks (IJCNN 2009 - Atlanta), DOI 10.1109/IJCNN.2009.5178777
Chang C. C., 2011, ACM T INTELLIGENT SY, V2
Dileep A.D., 2011, SPEAKER FORENSICS NE, P389
Gonen M, 2008, IEEE T NEURAL NETWOR, V19, P130, DOI 10.1109/TNN.2007.903157
Jaakkola T, 2000, J COMPUT BIOL, V7, P95, DOI 10.1089/10665270050081405
Lee K. A., 2007, P INT, P294
Neiberg D., 2006, P INT 2006 PITTSB US
NIST, 2002, NIST YEAR 2002 SPEAK
NIST, 2003, NIST YEAR 2003 SPEAK
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
Rabiner L.R., 2003, FUNDAMENTALS SPEECH
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Sato N., 2007, J NATURAL LANGUAGE P, V14, P83
Sha F, 2006, INT CONF ACOUST SPEE, P265
Shawe-Taylor J., 2004, KERNEL METHODS PATTE
Smith N., 2001, DATA DEPENDENT KERNE
Steidl S., 2009, THESIS U ERLANGEN NU
Tao Q, 2005, IEEE T NEURAL NETWOR, V16, P1561, DOI 10.1109/tnn.2005.857955
Wallraven C., 2003, Proceedings Ninth IEEE International Conference on Computer Vision
WAN V, 2002, ACOUST SPEECH SIG PR, P669
You CH, 2010, IEEE T AUDIO SPEECH, V18, P1300, DOI 10.1109/TASL.2009.2032950
NR 27
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 126
EP 143
DI 10.1016/j.specom.2013.09.010
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100010
ER
PT J
AU Maeno, Y
Nose, T
Kobayashi, T
Koriyama, T
Ijima, Y
Nakajima, H
Mizuno, H
Yoshioka, O
AF Maeno, Yu
Nose, Takashi
Kobayashi, Takao
Koriyama, Tomoki
Ijima, Yusuke
Nakajima, Hideharu
Mizuno, Hideyuki
Yoshioka, Osamu
TI Prosodic variation enhancement using unsupervised context labeling for
HMM-based expressive speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based expressive speech synthesis; Prosodic context; Unsupervised
labeling; Audiobook; Prosody control
ID CORPUS-BASED SPEECH; SYNTHESIS SYSTEM; MODEL
AB This paper proposes an unsupervised labeling technique using phrase-level prosodic contexts for HMM-based expressive speech synthesis, which enables users to manually enhance prosodic variations of synthetic speech without degrading the naturalness. In the proposed technique, HMMs are first trained using the conventional labels including only linguistic information, and prosodic features are generated from the HMMs. The average difference of original and generated prosodic features for each accent phrase is then calculated and classified into three classes, e.g., low, neutral, and high in the case of fundamental frequency. The created prosodic context label has a practical meaning such as high/low of relative pitch at the phrase level, and hence it is expected that users can modify the prosodic characteristic of synthetic speech in an intuitive way by manually changing the proposed labels. In the experiments, we evaluate the proposed technique in both ideal and practical conditions using speech of sales talk and fairy tale recorded under a realistic domain. In the evaluation under the practical condition, we evaluate whether the users achieve their intended prosodic modification by changing the proposed context label of a certain accent phrase for a given sentence. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Maeno, Yu; Nose, Takashi; Kobayashi, Takao; Koriyama, Tomoki] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
[Ijima, Yusuke; Nakajima, Hideharu; Mizuno, Hideyuki; Yoshioka, Osamu] NTT Corp, NTT Media Intelligence Labs, Yokosuka, Kanagawa 2390847, Japan.
RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp;
koriyama.t.aa@m.titech.ac.jp
RI Koriyama, Tomoki/B-9321-2015
OI Koriyama, Tomoki/0000-0002-8347-5604
CR Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123
Braunschweiler N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2222
Campbell N, 2005, IEICE T INF SYST, VE88D, P376, DOI 10.1093/ietisy/e88-d.3.376
Chen L., 2013, P ICASSP 2013, P7977
Doukhan D., 2011, P INTERSPEECH 2011, P3129
Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317
Eyben F, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4009
Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Maeno Y., 2011, P INTERSPEECH 2011, P1849
Morizane K, 2009, ORIENTAL COCOSDA 2009 - INTERNATIONAL CONFERENCE ON SPEECH DATABASE AND ASSESSMENTS, P76
Nakajima H, 2009, 2009 EIGHTH INTERNATIONAL SYMPOSIUM ON NATURAL LANGUAGE PROCESSING, PROCEEDINGS, P137, DOI 10.1109/SNLP.2009.5340932
Nakajima H., 2010, CREATION ANAL JAPANE
Nose T, 2013, SPEECH COMMUN, V55, P347, DOI 10.1016/j.specom.2012.09.003
Prahallad K., 2007, P INTERSPEECH, P2901
Schroder M, 2009, AFFECTIVE INFORMATION PROCESSING, P111, DOI 10.1007/978-1-84800-306-4_7
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Strom V., 2007, P INT 2007 ANTW BELG, P1282
Suni A., 2012, P BLIZZ CHALL 2012 W
Szekely E., 2011, P INT FLOR ISCA, P2409
Tsuzuki R, 2004, P INTERSPEECH 2004 I, P1185
Vainio M., 2005, P 10 INT C SPEECH CO, P309
Yamagishi J., 2003, P INTERSPEECH 2003 E, P2461
Yu K, 2010, INT CONF ACOUST SPEE, P4238, DOI 10.1109/ICASSP.2010.5495690
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zhao Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1750
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 144
EP 154
DI 10.1016/j.specom.2013.09.014
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100011
ER
PT J
AU Origlia, A
Cutugno, F
Galata, V
AF Origlia, A.
Cutugno, F.
Galata, V.
TI Continuous emotion recognition with phonetic syllables
SO SPEECH COMMUNICATION
LA English
DT Article
DE Affective computing; Feature extraction; Phonetic syllables;
Valence-Activation-Dominance space
ID FUNDAMENTAL-FREQUENCY; TONAL PERCEPTION; SPEECH; MODEL; FEATURES;
DISCRIMINATION; STYLIZATION; MODULATION; AMPLITUDE; RHYTHM
AB As research on the extraction of acoustic properties of speech for emotion recognition progresses, the need of investigating methods of feature extraction taking into account the necessities of real time processing systems becomes more important. Past works have shown the importance of syllables for the transmission of emotions, while classical research methods adopted in prosody show that it is important to concentrate on specific areas of the speech signal to study intonation phenomena. Technological approaches, however, are often designed to use the whole speech signal without taking into account the qualitative variability of the spectral content. Given this contrast with the theoretical basis around which prosodic research is pursued, we present here a feature extraction method built on the basis of a phonetic interpretation of the concept of syllable. In particular, we concentrate on the spectral content of syllabic nuclei, thus reducing the amount of information to be processed. Moreover, we introduce feature weighting based on syllabic prominence, thus not considering all the units of analysis as being equally important. The method is evaluated on a continuous, three-dimensional model of emotions built on the classical axes of Valence, Activation and Dominance and is shown to be competitive with state-of-the-art performance. The potential impact of this approach on the design of affective computing systems is also analysed. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Origlia, A.; Cutugno, F.] Univ Naples Federico II, Dept Elect Engn & Informat Technol DIETI, Language Understanding & Speech Interfaces LUSI L, I-80125 Naples, Italy.
[Galata, V.] CNR, Inst Cognit Sci & Technol ISTC, Padua, Italy.
RP Origlia, A (reprint author), Univ Naples Federico II, Dept Elect Engn & Informat Technol DIETI, Language Understanding & Speech Interfaces LUSI L, Via Claudio 21, I-80125 Naples, Italy.
EM antonio.origlia@unina.it; cutugno@unina.it;
vincenzo.galata@pd.istc.cnr.it
FU European Community [600958]
FX Antonio Origlia and Francesco Cutugno's work was supported by the
European Community, within the FP7 SHERPA IP #600958 project. The
authors would like to thank the two anonymous reviewers and the editor
for the helpful comments and the constructive suggestions provided.
CR Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930
Avanzi M., 2010, P SPEECH PROS
Baltruktitis T., 2013, P IEEE INT C AUT FAC
Barry WJ, 2003, P 15 INT C PHON SCI, P2693
Batliner Anton, 2010, Advances in Human-Computing Interaction, DOI 10.1155/2010/782802
Boersma P., 2011, PRAAT DOING PHONETIC
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114
Burkhardt F., 2005, P INT, P1517
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Collier R., 1990, PERCEPTUAL STUDY INT
Cowie R., 2001, IEEE SIGNAL PROCESSI, V18, P33
Cowie R., 2000, P ISCA WORKSH SPEECH, P19
Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070
DALESSANDRO C, 1995, COMPUT SPEECH LANG, V9, P257, DOI 10.1006/csla.1995.0013
Dellwo V, 2003, P 15 INT C PHON SCI, P471
Dellwo V., 2006, LANGUAGE LANGUAGE PR, P231
Drioli C., 2003, P VOIC QUAL GEN, P127
EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068
Espinosa HP, 2010, INT CONF ACOUST SPEE, P5138
Fernandez R, 2011, SPEECH COMMUN, V53, P1088, DOI 10.1016/j.specom.2011.05.003
FETH LL, 1972, ACUSTICA, V26, P67
Fragopanagos N, 2005, NEURAL NETWORKS, V18, P389, DOI 10.1016/j.neunet.2005.03.006
Galata V., 2010, THESIS U CALABRIA IT
Gharavian D., 2012, NEURAL COMPUT APPL, V22, P1
Goudbeek M., 2009, P INT, P1575
Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572
Grimm M., 2005, P IEEE AUT SPEECH RE, P381
Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010
Gunes Hatice, 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), DOI 10.1109/FG.2011.5771357
Hall M. A., 1998, THESIS HAMILTON
House D, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2048
House D., 1995, P EUR, P949
House David, 1990, TONAL PERCEPTION SPE
Jespersen O., 1920, LEHRBUCH PHONETIC
Jia J, 2011, IEEE T AUDIO SPEECH, V19, P570, DOI 10.1109/TASL.2010.2052246
Jittiwarangkul N., 1998, P IEEE AS PAC C CIRC, P169
Kaiser J., 1990, P IEEE INT C AC SPEE, V1, P381
Kao YH, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1814
KLATT DH, 1973, J ACOUST SOC AM, V53, P8, DOI 10.1121/1.1913333
Ludusan B., 2011, P INT, P2413
MAIWALD D, 1967, ACUSTICA, V18, P81
Martin P., 2010, P SPEECH PROS
Mary L, 2008, SPEECH COMMUN, V50, P782, DOI 10.1016/j.specom.2008.04.010
McKeown G, 2010, IEEE INT CON MULTI, P1079, DOI 10.1109/ICME.2010.5583006
Mehrabian A, 1996, CURR PSYCHOL, V14, P261, DOI 10.1007/BF02686918
Mermelstein D., 1975, J ACOUST SOC AM, V54, P880
Mertens P., 2004, P SPEECH PROS
Moller E., 1974, P FACTS MOD HEAR, P227
Nicolaou MA, 2011, IEEE T AFFECT COMPUT, V2, P92, DOI 10.1109/T-AFFC.2011.9
Origlia A, 2013, COMPUT SPEECH LANG, V27, P190, DOI 10.1016/j.csl.2012.04.003
Patel S, 2011, BIOL PSYCHOL, V87, P93, DOI 10.1016/j.biopsycho.2011.02.010
Petrillo M., 2003, P EUR, P2913
Pfitzinger HR, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1261
POLLACK I, 1968, J EXP PSYCHOL, V77, P535, DOI 10.1037/h0026051
Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X
Roach P., 2000, PRACTICAL COURSE
ROSSI M, 1978, LANG SPEECH, V21, P384
ROSSI M, 1971, PHONETICA, V23, P1
RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714
Scherer KR, 2003, SER AFFECTIVE SCI, P433
Schouten H.E.M., 1985, PERCEPT PSYCHOPHYS, V37, P369
Schuller B, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P1333, DOI 10.1109/ICME.2008.4607689
Schuller B, 2011, SPEECH COMMUN, V53, P1062, DOI 10.1016/j.specom.2011.01.011
Schuller Bjorn, 2009, P INTERSPEECH, P312
Seppi D., 2010, P SPEECH PROS
SERGEANT RL, 1962, J ACOUST SOC AM, V34, P1625, DOI 10.1121/1.1909065
Silipo R., 1999, P 14 INT C PHON SCI, P2351
Tamburini Fabio, 2007, P INT 2007 ANTW BELG, P1809
TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019
Vlasenko B., 2011, P ICME, P4230
Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139
Wu SQ, 2011, SPEECH COMMUN, V53, P768, DOI 10.1016/j.specom.2010.08.013
ZWICKER EBERHARD, 1962, JOUR ACOUSTICAL SOC AMER, V34, P1425, DOI 10.1121/1.1918362
NR 74
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 155
EP 169
DI 10.1016/j.specom.2013.09.012
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100012
ER
PT J
AU Wolf, M
Nadeu, C
AF Wolf, Martin
Nadeu, Climent
TI Channel selection measures for multi-microphone speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Channel (microphone) selection; Signal
quality; Multi-microphone; Reverberation
ID NOISE
AB Automatic speech recognition in a room with distant microphones is strongly affected by noise and reverberation. In scenarios where the speech signal is captured by several arbitrarily located microphones the degree of distortion differs from one channel to another. In this work we deal with measures extracted from a given distorted signal that either estimate its quality or measure how well it fits the acoustic models of the recognition system. We then apply them to solve the problem of selecting the signal (i.e. the channel) that presumably leads to the lowest recognition error rate. New channel. selection techniques are presented, and compared experimentally in reverberant environments with other approaches reported in the literature. Significant improvements in recognition rate are observed for most of the measures. A new measure based on the variance of the speech intensity envelope shows a good trade-off between recognition accuracy, latency and computational cost. Also, the combination of measures allows a further improvement in recognition rate. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Wolf, Martin; Nadeu, Climent] Univ Politecn Cataluna, TALP Res Ctr, Dept Signal Theory & Commun, ES-08034 Barcelona, Spain.
RP Wolf, M (reprint author), Univ Politecn Cataluna, TALP Res Ctr, Dept Signal Theory & Commun, Jordi Girona 1-3, ES-08034 Barcelona, Spain.
EM martin.wolf@upc.edu; climent.nadeu@upc.edu
RI Nadeu, Climent/B-9638-2014
OI Nadeu, Climent/0000-0002-5863-0983
FU Spanish project SARAI [TEC2010-21040-C02-01]
FX This work was supported by the Spanish project SARAI (reference number
TEC2010-21040-C02-01).
CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702
Brandstein M., 2001, MICROPHONE ARRAYS
de la Torre A., 2002, P ICASSP
Fisher RA, 1936, ANN EUGENIC, V7, P179
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hui Jiang, 2005, Speech Communication, V45, DOI 10.1016/j.specom.2004.12.004
ICSI, 2003, ICSI M REC DIG CORP
Janin A., 2004, P ICASSP 2004 M REC
Jeub M., 2011, P EUR SIGN PROC C EU
Kumar K, 2011, P HSCMA ED UK, P1
Leonard R. G., 1984, P ICASSP 84, P111
Molau S, 2001, ASRU 2001: IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, CONFERENCE PROCEEDINGS, P21
NIST, 2000, SPEECH QUAL ASS SPQA
Obuchi Y, 2006, ELECTRON COMM JPN 2, V89, P9, DOI 10.1002/ecjb.20281
Obuchi Y., 2004, WORKSH STAT PERC AUD
OPENSHAW JP, 1994, INT CONF ACOUST SPEE, P49
Petrick R., 2007, P INT, P1094
Shimizu Y, 2000, INT CONF ACOUST SPEE, P1747, DOI 10.1109/ICASSP.2000.862090
Wolf M., 2009, P 1 JOINT SIG IL MIC, P61
Wolf M., 2012, P IBERSPEECH MADR SP, P513
Wolf M., 2010, P INTERSPEECH TOK JA, P80
Wolfel M., 2007, P INT ANTW AUG, P582
Wolfel M., 2009, DISTANT SPEECH RECOG
Wolfel M., 2006, P INTERSPEECH
Young S., 2006, HTK BOOK HTK VERSION
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 170
EP 180
DI 10.1016/j.specom.2013.09.015
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100013
ER
PT J
AU Xu, Y
Prom-on, S
AF Xu, Yi
Prom-on, Santitham
TI Toward invariant functional representations of variable surface
fundamental frequency contours: Synthesizing speech melody via
model-based stochastic learning
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody modeling; Target approximation; Parallel encoding;
Analysis-by-synthesis; Simulated annealing
ID COMMAND-RESPONSE MODEL; STANDARD CHINESE; MATCHED STATEMENTS; MANDARIN
CHINESE; FOCUS LOCATION; INTONATION; TONE; PROSODY; ENGLISH; PITCH
AB Variability has been one of the major challenges for both theoretical understanding and computer synthesis of speech prosody. In this paper we show that economical representation of variability is the key to effective modeling of prosody. Specifically, we report the development of PENTAtrainer-A trainable yet deterministic prosody synthesizer based on an articulatory functional view of speech. We show with testing results on Thai, Mandarin and English that it is possible to achieve high-accuracy predictive synthesis of fundamental frequency contours with very small sets of parameters obtained through stochastic learning from real speech data. The first key component of this system is syllable-synchronized sequential target approximation implemented as the qTA model, which is designed to simulate, for each tonal unit, a wide range of contextual variability with a single invariant target. The second key component is the automatic learning of function-specific targets through stochastic global optimization, guided by a layered pseudo-hierarchical functional annotation scheme, which requires the manual labeling of only the temporal domains of the functional units. The results in terms of synthesis accuracy demonstrate that effective modeling of the contextual variability is the key also to effective modeling of function-related variability. Additionally, we show that, being both theory-based and trainable (hence data-driven), computational systems like PENTAtrainer can serve as an effective modeling tool in basic research, with which the level of falsifiability in theory testing can be raised, and also a closer link between basic and applied research in speech science can be developed. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Xu, Yi; Prom-on, Santitham] UCL, Dept Speech Hearing & Phonet Sci, London WC1N 1PF, England.
[Prom-on, Santitham] King Mongkuts Univ Technol Thonburi, Dept Comp Engn, Fac Engn, Bangkok 10140, Thailand.
RP Prom-on, S (reprint author), King Mongkuts Univ Technol Thonburi, Dept Comp Engn, Fac Engn, 126 Prachauthit Rd, Bangkok 10140, Thailand.
EM yi.xu@ucl.ac.uk; santitham@cpe.kmutt.ac.th
FU Royal Society; Royal Academy of Engineering through the Newton
International Fellowship Scheme; Thai Research Fund through the Research
Grant for New Researcher [TRG5680096]; National Science Foundation
FX We would like to thank for the financial supports the Royal Society and
the Royal Academy of Engineering through the Newton International
Fellowship Scheme (to SP), the Thai Research Fund through the Research
Grant for New Researcher (Grant Number TRG5680096 to SP), and the
National Science Foundation (to YX). We thank Fang Liu for providing the
English and Mandarin Chinese corpora used in this work. We would further
like to thank the Organizers of Speech Prosody 2012 for inviting us to
give the tutorial about this work at the conference.
CR Abramson A.S., 1962, INT J AM LINGUIST, V28, P3
Anderson M.D., 1984, P ICASSP 1984 SAN DI, P77
Arvaniti A, 2009, PHONOLOGY, V26, P43, DOI 10.1017/S0952675709001717
Bailly G, 2005, SPEECH COMMUN, V46, P348, DOI 10.1016/j.specom.2005.04.008
Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X
Black AW, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1385
Boersma Paul, 2012, PRAAT DOING PHONETIC
Breen M, 2012, CORPUS LINGUIST LING, V8, P277, DOI 10.1515/cllt-2012-0011
Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE
Chen G.-P., 2004, P ISCSLP, P177
Chen Mathew Y., 2000, TONE SANDHI PATTERNS
Chen YY, 2006, PHONETICA, V63, P47, DOI 10.1159/000091406
Collier R., 1990, PERCEPTUAL STUDY INT
COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372
Duanmu S., 2000, PHONOLOGY STANDARD C
EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091
EFRON B, 1979, ANN STAT, V7, P1, DOI 10.1214/aos/1176344552
Fujisaki H, 2005, SPEECH COMMUN, V47, P59, DOI 10.1016/j.specom.2005.06.009
Fujisaki H., 1990, P ICSLP 90, P841
GANDOUR J, 1994, J PHONETICS, V22, P477
Grabe E, 2007, LANG SPEECH, V50, P281
Gu WT, 2006, IEEE T AUDIO SPEECH, V14, P1155, DOI 10.1109/TASL.2006.876132
Gu WT, 2007, PHONETICA, V64, P29, DOI 10.1159/0000100060
HADDINGKOCH K, 1964, PHONETICA, V11, P175
Hermes DJ, 1998, J SPEECH LANG HEAR R, V41, P73
Hirst D. J., 2011, J SPEECH SCI, V1, P55
Hirst DJ, 2005, SPEECH COMMUN, V46, P334, DOI 10.1016/j.specom.2005.02.020
HO AT, 1977, PHONETICA, V34, P446
HOWIE JM, 1974, PHONETICA, V30, P129
Jilka M, 1999, SPEECH COMMUN, V28, P83, DOI 10.1016/S0167-6393(99)00008-4
Jokisch O., 2000, P ICSLP 2000 BEIJ, P645
KIRKPATRICK S, 1983, SCIENCE, V220, P671, DOI 10.1126/science.220.4598.671
KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275
Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X
Konishi S, 1996, BIOMETRIKA, V83, P875, DOI 10.1093/biomet/83.4.875
Kuo YC, 2007, PHONOL PHONET, V12-2, P211
Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1
Ladefoged P., 1967, 3 AREAS EXPT PHONETI
Lee Y.-C., 2010, P SPEECH PROS 2010 C
Liu F., 2013, J SPEECH SCI, V3, P85
Liu F, 2005, PHONETICA, V62, P70, DOI 10.1159/000090090
Mixdorff H., 2003, P EUR 2003, P873
Myers Scott, 1998, PHONOLOGY, V15, P367, DOI 10.1017/S0952675799003620
Ni J., 2004, P INT C SPEECH PROS, P95
Ni JF, 2006, J ACOUST SOC AM, V119, P1764, DOI 10.1121/1.2165071
Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002
OSHAUGHNESSY D, 1983, J ACOUST SOC AM, V74, P1155, DOI 10.1121/1.390039
Pell MD, 2001, J ACOUST SOC AM, V109, P1668, DOI 10.1121/1.1352088
Peng S.-H., 2000, PAPERS LAB PHONOLOGY, VV, P152
Perkell J. S., 1986, INVARIANCE VARIABILI
Perrier P, 1996, J SPEECH HEAR RES, V39, P365
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033
Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE
Potisuk S., 1997, PHONETICA, V42, P22
Prom-on S., 2011, P 17 INT C PHON SCI, P1638
Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222
Prom-on S, 2012, J ACOUST SOC AM, V132, P421, DOI 10.1121/1.4725762
Raidt S., 2004, P SPEECH PROS 2004 N, P417
Rose P.J., 1988, PROSODIC ANAL ASIAN, P55
Ross KN, 1999, IEEE T SPEECH AUDI P, V7, P295, DOI 10.1109/89.759037
Sakurai A, 2003, SPEECH COMMUN, V40, P535, DOI 10.1016/S0167-6393(02)00177-2
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
Shih C., 1987, PHONETICS CHINESE TO
Silverman K., 1992, P INT C SPOK LANG PR, V2, P867
SILVERMAN K, 1986, PHONETICA, V43, P76
Sun X., 2002, THESIS NW U
Sun XJ, 2002, J VOICE, V16, P443, DOI 10.1016/S0892-1997(02)00119-4
Syrdal A. K., 2000, P INT C SPOK LANG PR, V3, P235
Taylor P, 2009, TEXT TO SPEECH SYNTH
Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453
Vainio M., 2009, P SPECOM 2009 ST PET, P164
Vainio M, 2010, J ACOUST SOC AM, V128, P1313, DOI 10.1121/1.3467767
van Santen JPH, 2000, TEXT SPEECH LANG TEC, V15, P269
Wagner M, 2010, LANG COGNITIVE PROC, V25, P905, DOI 10.1080/01690961003589492
WANG WSY, 1967, J SPEECH HEAR RES, V10, P629
WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0
Wightman C., 1999, P IEEE ASRU 1999 KEY, P333
Wightman C., 2002, P SPEECH PROS 2002 A, P25
Wong Y.W., 2007, P 16 INT C PHON SCI, P1293
Wu W.L., 2010, P SPEECH PROS 2010 C
Wu Zong Ji, 1984, ZHONGGUO YUYAN XUEBA, V2, P70
Xiaonan Shen, 1990, PROSODY MANDARIN CHI
Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086
Xu Y, 2012, LINGUIST REV, V29, P131, DOI 10.1515/tlr-2012-0006
Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034
Xu Y, 2013, PLOS ONE, V8, DOI 10.1371/journal.pone.0062397
Xu Y., 2011, J SPEECH SCI, V1, P85
Xu Y, 2005, SPEECH COMMUN, V46, P220, DOI 10.1016/j.specom.2005.02.014
Xu Y., 2009, J PHONETICS, V37, P507
Xu Y., 2013, PROSODY ICONICITY, P33
Xu Y., 2010, PENTATRAINER1
Xu Y, 2005, J PHONETICS, V33, P159, DOI 10.1016/j.wocn.2004.11.001
Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432
Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7
Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789
Yang XH, 2012, PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON SPEECH PROSODY, VOLS I AND II, P543
Yip M., 2002, TONE
Yuan J., 2002, P 1 INT C SPEECH PRO, P711
Zhang J., 2004, PHONETICALLY BASED P
NR 101
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 181
EP 208
DI 10.1016/j.specom.2013.09.013
PG 28
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100014
ER
PT J
AU Wagner, P
Malisz, Z
Kopp, S
AF Wagner, Petra
Malisz, Zofia
Kopp, Stefah
TI Gesture and speech in interaction: An overview
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
ID EMBODIED CONVERSATIONAL AGENTS; BEHAVIOR MARKUP LANGUAGE; HEAD
MOVEMENTS; COLLABORATIVE PROCESS; PROSODIC PROMINENCE; LISTENER
RESPONSES; ANNOTATION-SCHEME; VIRTUAL HUMANS; HAND; COMMUNICATION
AB Gestures and speech interact. They are linked in language production and perception, with their interaction contributing to felicitous communication. The multifaceted nature of these interactions has attracted considerable attention from the speech and gesture community. This article provides an overview of our current understanding of manual and head gesture form and function, of the principle functional interactions between gesture and speech aiding communication, transporting meaning and producing speech. Furthermore, we present an overview of research on temporal speech-gesture synchrony, including the special role of prosody in speech-gesture alignment. In addition, we provide a summary of tools and data available for gesture analysis, and describe speech-gesture interaction models and simulations in technical systems. This overview also serves as an introduction to a Special Issue covering a wide range of articles on these topics. We provide links to the Special Issue throughout this paper. (C) 2013 Elsevier B.V. All rights reserved.
EM zofia.malisz@uni-bielefeld.de
CR Abercrombie David, 1954, ELT J, V9, P3
Akakin HC, 2011, IMAGE VISION COMPUT, V29, P470, DOI 10.1016/j.imavis.2011.03.001
Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752
Alibali MW, 2000, LANG COGNITIVE PROC, V15, P593
Alibali MW, 1999, COGNITIVE DEV, V14, P37, DOI 10.1016/S0885-2014(99)80017-3
Allwood J, 2007, LANG RESOUR EVAL, V41, P273, DOI 10.1007/s10579-007-9061-5
Allwood J., 2003, 1 NORD S MULT COMM C, P7
Al Moubayed S, 2010, J MULTIMODAL USER IN, V3, P299, DOI 10.1007/s12193-010-0054-0
Altorfer A, 2000, BEHAV RES METH INS C, V32, P17, DOI 10.3758/BF03200785
Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004
Bavelas JB, 1994, RES LANG SOC INTERAC, V27, P201, DOI DOI 10.1207/S15327973RLSI2703_3
Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566
Becker Raymond, 2011, P GESPIN2011 GEST SP
Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X
Bergmann K, 2006, P 10 WORKSH SEM PRAG, P90
Bergmann K, 2010, LECT NOTES ARTIF INT, V6356, P104, DOI 10.1007/978-3-642-15892-6_11
Bergmann K., 2013, LECT NOTES ARTIF INT, P139
Bergmann K, 2010, LECT NOTES ARTIF INT, V5934, P182, DOI 10.1007/978-3-642-12553-9_16
Bergmann Kirsten, 2011, P GESPIN2011 GEST SP
Bergmann Kirsten, 2013, P INT C INT VIRT AG
Bergmann Kirsten, 2009, P 8 INT C AUT AG MUL, P361
Beskow J., 2006, FOCAL ACCENT FACIAL, P52
Beskow J, 2007, LECT NOTES COMPUT SC, V4775, P250
Bevacqua Elisabetta, 2009, THESIS U PARIS 8 PAR
Birdwhistell Ray L., 1970, ESSAYS BODY MOTION C
Birdwhistell Ray L., 1952, INTRO KINESICS ANNOT
Boersma P., 2008, PRAAT DOING PHONETIC
BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326
Bolinger D., 1986, INTONATION ITS PARTS
Bolinger Dwight, 1961, LANGUAGE, V37, P87
Bolinger Dwight, 1982, CHICAGO LINGUISTICS, P1
Bousmalis Konstantinos, 2012, IMAGE VISION COMPUT, V31, P203
Bressem J, 2011, SEMIOTICA, V184, P53, DOI 10.1515/semi.2011.022
Browman Catherine, 1986, PHONOLOGY YB, V3, P219
Brugman Hennie, 2004, P 4 INT C LANG RES E, P2065
Brugman Hennie, 2002, 3 INT C LANG RES EV, P176
BULL P, 1985, J NONVERBAL BEHAV, V9, P169, DOI 10.1007/BF01000738
Buss S. R., 2003, 3D COMPUTER GRAPHICS
Butterworth B., 1978, RECENT ADV PSYCHOL L
Cafaro Angelo, 2012, IVA, P67
Caldognetto Emanuela M., 2004, P LREC WORKSH MULT C, P29
Caldognetto Magno, 2001, MULTIMODALITA MULTIM
Cassell J., 2000, EMBODIED CONVERSATIO
Cassell Justine, 1996, COMPUTER VISION HUMA
Cerrato Loredana, 2007, THESIS KTH COMPUTER
Cerrato Loredana, 2005, GOTHENBURG PAPERS TH, P153
Chiu Chung-Cheng, 2011, 11 INT C INT VIRT AG
CHRISTENFELD N, 1991, J PSYCHOLINGUIST RES, V20, P1, DOI 10.1007/BF01076916
Chui K, 2005, J PRAGMATICS, V37, P871, DOI 10.1016/j.pragma.2004.10.016
Cienki A, 2008, CAMB HANDB PSYCHOL, P483
Clark H. H., 1991, PERSPECTIVES SOCIALL, V13, P127, DOI DOI 10.1037/10096-006
CLARK HH, 1986, COGNITION, V22, P1, DOI 10.1016/0010-0277(86)90010-7
Condon W. S., 1971, PERCEPTION LANGUAGE
Corradini A., 2002, Gesture and Sign Language in Human-Computer Interaction. International Gesture Workshop, GW 2001. Revised Papers (Lecture Notes in Artificial Intelligence Vol.2298)
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
De Ruiter JP, 2000, LANGUAGE GESTURE, P248
de Ruiter JP, 2010, INTERACT STUD, V11, P51, DOI 10.1075/is.11.1.05rui
de Ruiter JP, 2012, TOP COGN SCI, V4, P232, DOI 10.1111/j.1756-8765.2012.01183.x
DeSteno D, 2012, PSYCHOL SCI, V23, P1549, DOI 10.1177/0956797612448793
DITTMANN A. T., 1969, J PERS SOC PSYCHOL, V23, P283
DITTMANN AT, 1968, J PERS SOC PSYCHOL, V9, P79, DOI 10.1037/h0025722
Dobrogaev S.M., 1929, IAZYKOVEDENIE MAT, P105
DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031
Eickeler S, 1998, INT C PATT RECOG, P1206
Ekman P., 1979, HUMAN ETHOLOGY, P169
EKMAN P, 1972, J COMMUN, V22, P353, DOI 10.1111/j.1460-2466.1972.tb00163.x
Erdem U. M., 2002, Proceedings 16th International Conference on Pattern Recognition, DOI 10.1109/ICPR.2002.1044759
Eriksson A., 2001, PROCEEDINGS OF EUROS, P399
Ferre G., 2010, LANGUAGE RESOURCES E
Feyereisen P., 1987, PSYCHOL REV, V94, P168
Gentilucci Maurizio, 2007, GESTURE, V7, P159, DOI 10.1075/gest.7.2.03gen
GOLDINMEADOW S, 1993, PSYCHOL REV, V100, P279, DOI 10.1037/0033-295X.100.2.279
Goldin-Meadow S, 2000, CHILD DEV, V71, P231, DOI 10.1111/1467-8624.00138
Goldin-Meadow S, 2001, PSYCHOL SCI, V12, P516, DOI 10.1111/1467-9280.00395
Goldin-Meadow S, 1999, TRENDS COGN SCI, V3, P419, DOI 10.1016/S1364-6613(99)01397-2
Goldsmith J., 1990, AUTOSEGMENTAL METRIC
Goodwin C., 1981, CONVERSATIONAL ORG I
Granstrom Bjorn, 2007, P 16 INT C PHON SCI, P11
Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003
Gussenhoven Carlos, 1999, LANG SPEECH, V42, P283
Gut Ulrike, 2003, KI, V17, P34
Habets B, 2011, J COGNITIVE NEUROSCI, V23, P1845, DOI 10.1162/jocn.2010.21462
HADAR U, 1984, HUM MOVEMENT SCI, V3, P237, DOI 10.1016/0167-9457(84)90018-6
HADAR U, 1983, HUM MOVEMENT SCI, V2, P35, DOI 10.1016/0167-9457(83)90004-0
BUTTERWORTH B, 1989, PSYCHOL REV, V96, P168, DOI 10.1037//0033-295X.96.1.168
Hadar Uri, 1985, J NONVERBAL BEHAV, V9, P214
Harling P. A., 1997, Progress in Gestural Interaction. Proceedings of Gesture Workshop '96
Harrison S., 2013, P TIGER 2013 TILB NL
Hartmann B, 2006, LECT NOTES ARTIF INT, V3881, P188
Heldner Mattias, 2012, NORD PROS P 11 C TAR, P137
Heylen D, 2011, COGN TECHNOL, P321, DOI 10.1007/978-3-642-15184-2_17
Heylen D, 2008, LECT NOTES ARTIF INT, V4930, P241
Heylen D, 2006, INT J HUM ROBOT, V3, P241, DOI 10.1142/S0219843606000746
Heylen Dirk, 2005, P JOINT S VIRT SOC A, P45
Holler J, 2007, J LANG SOC PSYCHOL, V26, P4, DOI 10.1177/0261927X06296428
Holler J., 2009, J LANG SOC PSYCHOL, V26, P4
Holler J, 2003, GESTURE, V3, P127, DOI DOI 10.1075/GEST.3.2.02HOL
Hostetter A., 2007, GESTURE, V7, P73, DOI DOI 10.1075/GEST.7.1.05HOS
Hostetter AB, 2007, LANG COGNITIVE PROC, V22, P313, DOI 10.1080/01690960600632812
Hostetter AB, 2012, GESTURE, V12, P62, DOI 10.1075/gest.12.1.04hos
Hostetter AB, 2008, PSYCHON B REV, V15, P495, DOI 10.3758/PBR.15.3.495
Ishii R, 2008, LECT NOTES COMPUT SC, V5208, P200
Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19
Iverson JM, 1998, NATURE, V396, P228, DOI 10.1038/24300
Jakobson Roman, 1972, LANGUAGE SOC, V1, P91
Jannedy S., 2005, INTERDISCIPLINARY ST, V03, P199
Jokinen K, 2010, P WORKSH EYE GAZ INT, P118
Jun Sun-Ah, 2007, PROSODIC TYPOLOGY PH, P430
Kapoor A, 2001, P 2001 WORKSH PERC U, P1, DOI 10.1145/971478.971509
Karpitiski Maciej, 2009, SPEECH LANGUAGE TECH, V11, P113
Kelso J. A. S., 1983, PRODUCTION SPEECH, P138
Kendon A., 2004, GESTURE VISIBLE ACTI
Kendon Adam, 1972, STUDIES DYADIC COMMU, P177
Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207
Kendon Adam, 2003, GESTURE, V2, P147
Kipp M, 2007, LANG RESOUR EVAL, V41, P325, DOI 10.1007/s10579-007-9053-5
Kipp M., 2004, THESIS SAARLAND U
Kipp M., 2009, P INT C AFF COMP INT
Kipp Michael, 2001, P WORKSH MULT COMM C
Kipp Michael, 2009, LECT NOTES COMPUTER, V5509
Kipp Michael, 2012, MULTIMEDIA INFORM EX, P531
Kirchhof Carolin, 2012, 5 C INT SOC GEST STU
Kirchhof Carolin, 2011, P GESPIN2011 GEST SP
Kita S., 2000, LANGUAGE CULTURE COG, P162
Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3
Kita S., 2009, LANG COGNITIVE PROC, V24, P795
Knight Dawn, 2011, RBLA, V2, P391
Kopp S., 2004, P INT C MULT INT ICM, P97, DOI 10.1145/1027933.1027952
Kopp S, 2006, LECT NOTES ARTIF INT, V4133, P205
Kopp S, 2008, LECT NOTES ARTIF INT, V4930, P18
Kopp Stefan, 2013, P 35 ANN M COGN SCI
Kousidis Spyros, 2013, TIGER 2013 TILB GEST
Kousidis Spyros, 2012, P INT WORKSH FEEDB B
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krauss R., 1999, GESTURE SPEECH SIGN, P93
Krenn B., 2004, P AISB 2004 S LANG S, P107
Kunin M, 2007, J NEUROPHYSIOL, V98, P3095, DOI 10.1152/jn.00764.2007
Lakoff John, 1980, METHAPHORS WE LIVE
Lausberg H, 2009, BEHAV RES METHODS, V41, P841, DOI 10.3758/BRM.41.3.841
Lee J, 2006, LECT NOTES ARTIF INT, V4133, P243
Leonard Thomas, 2010, LANG COGNITIVE PROC, V26, P1457
Leonard Thomas, 2009, P GESPIN2009 GEST SP, P1
LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X
Li Renxiang, 2007, MOBILITY 07, P572
Loehr D. P., 2012, LAB PHONOL, V3, P71
Loehr DP, 2004, THESIS GEORGETOWN U
LOEvenbruck H., 2009, SOME ASPECTS SPEECH, P211
Louwerse MM, 2012, COGNITIVE SCI, V36, P1404, DOI 10.1111/j.1551-6709.2012.01269.x
Lu Peng, 2005, LECT NOTES COMPUTER, P495
Lucking A, 2013, J MULTIMODAL USER IN, V7, P5, DOI 10.1007/s12193-012-0106-8
Martell C., 2002, P ICSLP 02, P353
Martell CH, 2005, TEXT SPEECH LANG TEC, V30, P79
Mayer RE, 2012, J EXP PSYCHOL-APPL, V18, P239, DOI 10.1037/a0028616
McClave E., 1991, THESIS GEORGETOWN U
MCCLAVE E, 1994, J PSYCHOLINGUIST RES, V23, P45, DOI 10.1007/BF02143175
McClave EZ, 2000, J PRAGMATICS, V32, P855, DOI 10.1016/S0378-2166(99)00079-X
McNeill D., 1992, HAND MIND WHAT GESTU
McNeill D., 2005, GESTURE AND THOUGHT
MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350
McNeill David, 1989, PSYCHOL REV, V94, P499
Mertins Inge, 2000, HDB MULTIMODAL SPOKE
Morency L-P, 2005, P 7 INT C MULT INT, P18, DOI 10.1145/1088463.1088470
MORRELSAMUELS P, 1992, J EXP PSYCHOL LEARN, V18, P615, DOI 10.1037/0278-7393.18.3.615
Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x
Neff M, 2008, ACM T GRAPHIC, V27, DOI 10.1145/1330511.1330516
NGUYEN L, 2012, P INT C MULT INT ACM, P289
Nobe S, 2000, LANGUAGE GESTURE, P186, DOI 10.1017/CBO9780511620850.012
Oertel C, 2013, J MULTIMODAL USER IN, V7, P19, DOI 10.1007/s12193-012-0108-6
OHALA JJ, 1984, PHONETICA, V41, P1
Ozyarek A., 2007, J COGNITIVE NEUROSCI, V19, P605
Parrell B, 2011, P 9 INT SEM SPEECH P
Poggi I., 2010, P INT C LANG RES EV, P17
Poggi Isabella, 2001, P INT VIRT AG 3 INT, P235
Poggi I, 2013, J MULTIMODAL USER IN, V7, P67, DOI 10.1007/s12193-012-0102-z
Priesters M.A., 2013, P TILB GEST RES M TI
Rietveld Toni, 1985, J PHONETICS, V13, P299
Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173)
Rosenfeld H.M., 1980, RELATIONSHIP VERBAL, P193
Roth WM, 2001, REV EDUC RES, V71, P365, DOI 10.3102/00346543071003365
Roustan Benjamin, 2010, SPEECH PROSODY 2010
Ruttkay Z, 2007, LECT NOTES COMPUT SC, V4775, P23
Salem Maha, 2012, INT J SOC ROBOT, V1875-4805, P201
Sargin Mehmet, 2006, P IEEE INT C MULT, P893
Sargin ME, 2008, IEEE T PATTERN ANAL, V30, P1330, DOI 10.1109/TPAMI.2007.70797
Schegloff E.A., 1984, STRUCTURES SOCIAL AC, P266
Schmidt T., 2004, P LREC WORKSH XML BA
SCHRODER M, 2001, P EUR 2001 AALB, V1, P87
Selting M., 1998, LINGUISTISCHE BERICH, V173, P91
Selting Margret, 2009, GESPRACHSFORSCHUNG O, V10, P353
Shattuck-Hufnagel Stefanie, 2007, NATO SECURITY SCI E, V18
Slobin Dan Isaac, 1996, RETHINKING LINGUISTI, P70
So WC, 2009, COGNITIVE SCI, V33, P115, DOI 10.1111/j.1551-6709.2008.01006.x
Spoons D., 1993, INTELLIGENT MULTIMED, P257
Stetson Raymond Herbert, 1951, MOTOR PHONETICS
Stone M, 2004, ACM T GRAPHIC, V23, P506, DOI 10.1145/1015706.1015753
Streeck J, 2008, GESTURE, V8, P285, DOI 10.1075/gest.8.3.02str
SWERTS M, 1994, LANG SPEECH, V37, P21
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Tamburini Fabio, 2007, P INT 2007 ANTW BELG, P1809
TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019
Theune M, 2010, LECT NOTES ARTIF INT, V5934, P195, DOI 10.1007/978-3-642-12553-9_17
Tomlinson RD, 2000, IEEE ENG MED BIOL, V19, P43, DOI 10.1109/51.827404
Treffner P, 2008, ECOL PSYCHOL, V20, P32, DOI 10.1080/10407410701766643
Trippel Thorsten, 2004, P LREC 2004 LISB POR
Truong KP, 2011, P INT FLOR IT, P2973
TUITE K, 1993, SEMIOTICA, V93, P83, DOI 10.1515/semi.1993.93.1-2.83
Urban Christian, 2011, THESIS BIELEFELD U
Vendler Z, 1967, LINGUISTICS PHILOS
Vilhjalmsson H, 2007, LECT NOTES ARTIF INT, V4722, P99
Wachsmuth I., 1998, LECT NOTES ARTIF INT, V1317, P23
Wagner Petra, NEW THEORY COMMUNICA
Wexelblat A, 1995, ACM T COMPUT-HUM INT, V2, P179, DOI 10.1145/210079.210080
Wilson Andrew D., 1996, INT C AUT FAC GEST R
Wlodarczak Marcin, 2012, P INT WORKSH FEEDB B
Wu Y, 1999, LECT NOTES ARTIF INT, V1739, P103
Yassinik Y., 2004, P INT C SOUND SENS M, pC97
Yehia C., 2002, J PHONETICS, V30, P555
Yoganandan N, 2009, J BIOMECH, V42, P1177, DOI 10.1016/j.jbiomech.2009.03.029
NR 218
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 209
EP 232
DI 10.1016/j.specom.2013.09.008
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100015
ER
PT J
AU Ishi, CT
Ishiguro, H
Hagita, N
AF Ishi, Carlos Toshinori
Ishiguro, Hiroshi
Hagita, Norihiro
TI Analysis of relationship between head motion events and speech in
dialogue conversations
SO SPEECH COMMUNICATION
LA English
DT Article
DE Head motion; Paralinguistic information; Dialogue act; Inter-personal
relationship; Spontaneous speech
ID ANIMATION
AB Head motion naturally occurs in synchrony with speech and may convey paralinguistic information (such as intentions, attitudes and emotions) in dialogue communication. With the aim of verifying the relationship between head motion events and speech utterances, analyses were conducted on motion-captured data of multiple speakers during spontaneous dialogue conversations. The relationship between head motion events and dialogue acts was firstly analyzed. Among the head motion types, nods occurred with most frequency during speech utterances, not only for expressing dialogue acts of agreement or affirmation, but also appearing at the end of phrases with strong boundaries (including both turn-keeping and giving dialogue act functions). Head shakes usually appeared for expressing negation, while head tilts appeared mostly in interjections expressing denial, and in phrases with weak boundaries, where the speaker is thinking or did not finish uttering. The synchronization of head motion events and speech was also analyzed with focus on the timing of nods relative to the last syllable of a phrase. Results showed that nods were highly synchronized with the center portion of backchannels, while it was more synchronized with the end portion of the last syllable in phrases with strong boundaries. Speaker variability analyses indicated that the inter-personal relationship with the interlocutor is one factor influencing the frequency of head motion events, It was found that the frequency of nods was lower for dialogue partners with close relationship (such as family members), where speakers do not have to express careful attitudes. On the other hand, the frequency of nods (especially of multiple nods) clearly increased when the inter-personal relationship between the dialogue partners was distant. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Ishi, Carlos Toshinori; Hagita, Norihiro] ATR Intelligent Robot & Commun Lab, Keihanna Sci City, Kyoto 6190288, Japan.
[Ishiguro, Hiroshi] ATR Hiroshi Ishiguro Special Lab, Kyoto, Japan.
RP Ishi, CT (reprint author), ATR Intelligent Robot & Commun Lab, 2-2-2 Hikaridai, Keihanna Sci City, Kyoto 6190288, Japan.
EM carlos@atr.jp
FU Ministry of Internal Affairs and Communication
FX This work is partly supported by the Ministry of Internal Affairs and
Communication. We thank Kyoko Nakanishi, Maiko Hirano, Chaoran Liu,
Hiroaki Hatano and Mika Morita for their contributions in data
annotation and analysis. We also thank Freerk Wilbers and Judith Haas
for their contributions in the collection and processing of motion data.
CR Beskow J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1272
Burnham D., 2007, P 8 ANN C INT SPEECH, P698
Busso C, 2007, IEEE T AUDIO SPEECH, V15, P1075, DOI 10.1109/TASL.2006.885910
Dohen M., 2006, P SPEECH PROSODY, P221
Foster ME, 2007, LANG RESOUR EVAL, V41, P305, DOI 10.1007/s10579-007-9055-3
Graf H.P., 2002, P IEEE INT C AUT FAC
Hofer G., 2007, P INT 2007, P722
Ishi Carlos T, 2010, Proceedings of the 5th ACM/IEEE International Conference on Human-Robot Interaction (HRI 2010), DOI 10.1109/HRI.2010.5453183
Ishi C.T., 2006, J PHONETIC SOC JPN, V10, P18
Ishi C.T., 2007, P INT 2007, P670
Ishi C.T., 2008, P INT C AUD VIS SPEE, P37
Iwano Y., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607233
Morency LP, 2007, ARTIF INTELL, V171, P568, DOI 10.1016/j.artint.2007.04.003
Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x
Sargin M.E., 2006, P IEEE INT C MULT
Sidner C.L., 2006, P HRI 2006, P290, DOI DOI 10.1145/1121241.1121291
Stegmann M. B., 2002, BRIEF INTRO STAT SHA
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Venditti J., 1997, OSU WORKING PAPERS L, V50, P127
Watanabe T, 2004, INT J HUM-COMPUT INT, V17, P43, DOI 10.1207/s15327590ijhc1701_4
Watanuki K., 2000, P ICSLP
Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165
NR 22
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 233
EP 243
DI 10.1016/j.specom.2013.06.008
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100016
ER
PT J
AU Rowbotham, S
Holler, J
Lloyd, D
Wearden, A
AF Rowbotham, Samantha
Holler, Judith
Lloyd, Donna
Wearden, Alison
TI Handling pain: The semantic interplay of speech and co-speech hand
gestures in the description of pain sensations
SO SPEECH COMMUNICATION
LA English
DT Article
DE Co-speech gesture; Gesture-speech redundancy; Pain communication; Pain
sensation; Semantic interplay of gesture and speech
ID FACE-TO-FACE; ICONIC GESTURES; COMMUNICATION; VISIBILITY; TELEPHONE;
DIALOGUE; BEHAVIOR; RECORD; WORDS; MODEL
AB Pain is a private and subjective experience about which effective communication is vital, particularly in medical settings. Speakers often represent information about pain sensation in both speech and co-speech hand gestures simultaneously, but it is not known whether gestures merely replicate spoken information or complement it in some way. We examined the representational contribution of gestures in a range of consecutive analyses. Firstly, we found that 78% of speech units containing pain sensation were accompanied by gestures, with 53% of these gestures representing pain sensation. Secondly, in 43% of these instances, gestures represented pain sensation information that was not contained in speech, contributing additional, complementary information to the pain sensation message. Finally, when applying a specificity analysis, we found that in contrast with research in different domains of talk, gestures did not make the pain sensation information in speech more specific. Rather, they complemented the verbal pain message by representing different aspects of pain sensation, contributing to a fuller representation of pain sensation than speech alone. These findings highlight the importance of gestures in communicating about pain sensation and suggest that this modality provides additional information to supplement and clarify the often ambiguous verbal pain message. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Rowbotham, Samantha; Holler, Judith; Lloyd, Donna; Wearden, Alison] Univ Manchester, Sch Psychol Sci, Manchester M13 9PL, Lancs, England.
[Holler, Judith] Max Planck Inst Psycholinguist, NL-6525 XD Nijmegen, Netherlands.
[Lloyd, Donna] Univ Leeds, Inst Psychol Sci, Leeds LS2 9JT, W Yorkshire, England.
RP Rowbotham, S (reprint author), Univ Manchester, Sch Psychol Sci, Coupland Bldg,Oxford Rd, Manchester M13 9PL, Lancs, England.
EM samantha.rowbotham@manchester.ac.uk
CR Alibali MW, 1997, J EDUC PSYCHOL, V89, P183, DOI 10.1037/0022-0663.89.1.183
Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752
Bavelas J., 2002, GESTURE, V2, P1, DOI 10.1075/gest.2.1.02bav
Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004
Bavelas JB, 1994, RES LANG SOC INTERAC, V27, P201, DOI DOI 10.1207/S15327973RLSI2703_3
BAVELAS JB, 1992, DISCOURSE PROCESS, V15, P469
Bavelas JB, 2000, J LANG SOC PSYCHOL, V19, P163, DOI 10.1177/0261927X00019002001
Beattie G, 1999, SEMIOTICA, V123, P1, DOI 10.1515/semi.1999.123.1-2.1
Beattie G, 1999, J LANG SOC PSYCHOL, V18, P438, DOI 10.1177/0261927X99018004005
Bergmann K., 2006, BRANDIAL 06, P90
Briggs Emma, 2010, Nurs Stand, V25, P35
Bruckner CT, 2006, AM J MENT RETARD, V111, P433, DOI 10.1352/0895-8017(2006)111[433:IKIORB]2.0.CO;2
BUTTERWO.B, 1975, J PSYCHOLINGUIST RES, V4, P75, DOI 10.1007/BF01066991
Cook SW, 2009, COGNITION, V113, P98, DOI 10.1016/j.cognition.2009.06.006
Craig KD, 2009, CAN PSYCHOL, V50, P22, DOI 10.1037/a0014772
CRAIG KD, 1992, APS J, V1, P153, DOI 10.1016/1058-9139(92)90001-S
Ehlich K, 1985, Theor Med, V6, P177, DOI 10.1007/BF00489662
Ekman P, 1968, RES PSYCHOTHER, P179, DOI 10.1037/10546-011
EMMORREY K., 2001, GESTURE, V1, P35, DOI DOI 10.1075/GEST.1.1.04EMM
Frank A. W., 1991, WILL BODY REFLECTION
Gerwing J, 2009, GESTURE, V9, P312, DOI 10.1075/gest.9.3.03ger
Gerwing J, 2011, GESTURE, V11, P308, DOI 10.1075/gest.11.3.03ger
GRAHAM JA, 1975, INT J PSYCHOL, V10, P57, DOI 10.1080/00207597508247319
Hartzband P, 2008, NEW ENGL J MED, V358, P1656, DOI 10.1056/NEJMp0802221
Heath C, 2002, J COMMUN, V52, P597, DOI 10.1093/joc/52.3.597
Holler J, 2003, SEMIOTICA, V146, P81
Holler J, 2002, SEMIOTICA, V142, P31
Holler J, 2009, J NONVERBAL BEHAV, V33, P73, DOI 10.1007/s10919-008-0063-9
Holler J, 2003, GESTURE, V3, P127, DOI DOI 10.1075/GEST.3.2.02HOL
Hostetter AB, 2011, PSYCHOL BULL, V137, P297, DOI 10.1037/a0022128
Hyden LC, 2002, HEALTH, V6, P325, DOI 10.1177/136345930200600305
Jacobs N, 2007, J MEM LANG, V56, P291, DOI 10.1016/j.jml.2006.07.011
Kallai I, 2004, PAIN, V112, P142, DOI 10.1016/j.pain.2004.08.008
Kendon A, 1997, ANNU REV ANTHROPOL, V26, P109, DOI 10.1146/annurev.anthro.26.1.109
Kendon A., 1980, RELATIONSHIP VERBAL, P207
Kendon A., 2004, GESTURE VISIBLE ACTI
Kendon Adam, 1985, PERSPECTIVES SILENCE, P215
Labus JS, 2003, PAIN, V102, P109, DOI 10.1016/S0304-3959(02)00354-8
LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310
LEVINE FM, 1991, PAIN, V44, P69, DOI 10.1016/0304-3959(91)90149-R
Loeser JD, 1999, LANCET, V353, P1607, DOI 10.1016/S0140-6736(99)01311-2
Makoul G, 2001, J AM MED INFORM ASSN, V8, P610
Margalit RS, 2006, PATIENT EDUC COUNS, V61, P134, DOI 10.1016/j.pec.2005.03.004
McCahon S, 2005, CLIN J PAIN, V21, P223, DOI 10.1097/00002508-200505000-00005
McGrath John M, 2007, Health Informatics J, V13, P105, DOI 10.1177/1460458207076466
McNeill D., 1992, HAND MIND WHAT GESTU
McNeill D., 2005, GESTURE AND THOUGHT
MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350
National Cancer Institute, 2011, PAIN ASS
OLDFIELD RC, 1971, NEUROPSYCHOLOGIA, V9, P97, DOI 10.1016/0028-3932(71)90067-4
Padfield D., 2003, PERCEPTIONS PAIN
PRKACHIN KM, 1995, J NONVERBAL BEHAV, V19, P191, DOI 10.1007/BF02173080
Rowbotham S, 2012, J NONVERBAL BEHAV, V36, P1, DOI 10.1007/s10919-011-0122-5
Rowbotham S., HLTH PSYCHO IN PRESS, V22, P19
Ruusuvuori J, 2001, SOC SCI MED, V52, P1093, DOI 10.1016/S0277-9536(00)00227-6
Ryle G., 1949, CONCEPT MIND
Salovey P., 1992, VITAL HLTH STAT, P6
Scarry E., 1985, BODY PAIN MAKING UNM
Schott GD, 2004, PAIN, V108, P209, DOI 10.1016/j.pain.2004.01.037
Swann J, 2010, NURS RESIDENTIAL CAR, V12, P212
Tian TN, 2011, PAIN, V152, P1210, DOI 10.1016/j.pain.2011.02.022
Treasure T, 1998, BRIT MED J, V317, P602
Turk D.C., 2001, HDB PAIN
WAGSTAFF S, 1985, ANN RHEUM DIS, V44, P262, DOI 10.1136/ard.44.4.262
NR 64
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 244
EP 256
DI 10.1016/j.specom.2013.04.002
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100017
ER
PT J
AU Hoetjes, M
Krahmer, E
Swerts, M
AF Hoetjes, Marieke
Krahmer, Emiel
Swerts, Marc
TI Does our speech change when we cannot gesture?
SO SPEECH COMMUNICATION
LA English
DT Article
DE Gesture; Speech; Gesture prevention; Director-matcher task
ID NONVERBAL BEHAVIOR; LEXICAL RETRIEVAL; HAND GESTURES; COMMUNICATION;
ADDRESSEES; VISIBILITY; DIALOGUE; SPEAKING; ACCESS
AB Do people speak differently when they cannot use their hands? Previous studies have suggested that speech becomes less fluent and more monotonous when speakers cannot gesture, but the evidence for this claim remains inconclusive. The present study attempts to find support for this claim in a production experiment in which speakers had to give addressees instructions on how to tie a tie; half of the participants had to perform this task while sitting on their hands. Other factors that influence the ease of communication, such as mutual visibility and previous experience, were also taken into account. No evidence was found for the claim that the inability to gesture affects speech fluency or monotony. An additional perception task showed that people were also not able to hear whether someone gestures or not. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Hoetjes, Marieke; Krahmer, Emiel; Swerts, Marc] Tilburg Univ, Sch Humanities, Tilburg Ctr Cognit & Commun TiCC, NL-5000 LE Tilburg, Netherlands.
RP Hoetjes, M (reprint author), Tilburg Univ, Room D404,POB 90153, NL-5000 LE Tilburg, Netherlands.
EM m.w.hoetjes@tilburguniversity.edu; e.j.krahmer@tilburguniversity.edu;
m.g.j.swerts@tilburguniversity.edu
RI Swerts, Marc/C-8855-2013
FU Netherlands Organization for Scientific Research [27770007]
FX We would like to thank Bastiaan Roset and Nick Wood for statistical and
technical support and help in creating the stimuli, Joost Driessen for
help in transcribing the data, Martijn Goudbeek for statistical support
and Katya Chown for providing background information on Dobrogaev. We
received financial support from The Netherlands Organization for
Scientific Research, via a Vici grant (NWO grant 27770007), which is
gratefully acknowledged. Parts of this paper were presented at the Tabu
dag 2009 in Groningen, at the Gesture Centre at the Max Planck Institute
for Psycholinguistics, at the 2009 AVSP conference, at LabPhon 2010 and
at ISGS 2010. We would like to thank the audiences for their suggestions
and comments. Finally, thanks to the anonymous reviewers for their
useful and constructive comments.
CR Alibali MW, 2001, J MEM LANG, V44, P169, DOI 10.1006/jmla.2000.2752
Alibali MW, 2000, LANG COGNITIVE PROC, V15, P593
Bavelas J, 2008, J MEM LANG, V58, P495, DOI 10.1016/j.jml.2007.02.004
Beattie G, 1999, BRIT J PSYCHOL, V90, P35, DOI 10.1348/000712699161251
Bernardis P, 2006, NEUROPSYCHOLOGIA, V44, P178, DOI 10.1016/j.neuropsychologia.2005.05.007
Boersma P., 2010, PRAAT DOING PHONETIC
BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326
Cave C., 1996, 4 INT C SPOK LANG PR
Chown K, 2008, STUD E EUR THOUGHT, V60, P307, DOI 10.1007/s11212-008-9063-x
Chu M., 2007, INTEGRATING GESTURES
CLARK HH, 1986, COGNITION, V22, P1, DOI 10.1016/0010-0277(86)90010-7
Clark HH, 2004, J MEM LANG, V50, P62, DOI 10.1016/j.jml.2003.08.004
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
De Ruiter J.-P., 2006, ADV SPEECH LANGUAGE, V8, P124, DOI DOI 10.1080/14417040600667285
Dobrogaev S.M., 1929, IAZYKOVEDENIE MAT, P105
Duncan S., 2000, LANGUAGE GESTURE, P141, DOI 10.1017/CBO9780511620850.010
EMMORREY K., 2001, GESTURE, V1, P35, DOI DOI 10.1075/GEST.1.1.04EMM
Finlayson S., 2003, DISFLUENCY SPONTANEO
Flecha-Garcia ML, 2010, SPEECH COMMUN, V52, P542, DOI 10.1016/j.specom.2009.12.003
GRAHAM JA, 1975, EUR J SOC PSYCHOL, V5, P189, DOI 10.1002/ejsp.2420050204
Gullberg M, 2006, LANG LEARN, V56, P155, DOI 10.1111/j.0023-8333.2006.00344.x
Hostetter AB, 2010, J MEM LANG, V63, P245, DOI 10.1016/j.jml.2010.04.003
Hostetter A.B., 2007, 29 ANN M COGN SCI SO, P1097
Kendon A., 2004, GESTURE VISIBLE ACTI
Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207
Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krahmer E, 2004, HUM COM INT, V7, P191
Krauss RM, 1998, CURR DIR PSYCHOL SCI, V7, P54, DOI 10.1111/1467-8721.ep13175642
Krauss R.M., 2001, GESTURE SPEECH SIGN, P93
Krauss RM, 1996, ADV EXP SOC PSYCHOL, V28, P389, DOI 10.1016/S0065-2601(08)60241-5
Kuhlen AK, 2013, PSYCHON B REV, V20, P54, DOI 10.3758/s13423-012-0341-8
Levelt W. J., 1989, SPEAKING INTENTION A
McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974
McNeill D., 1992, HAND MIND WHAT GESTU
Mol L, 2009, GESTURE, V9, P97, DOI 10.1075/gest.9.1.04mol
Morsella E, 2005, J PSYCHOLINGUIST RES, V34, P415, DOI 10.1007/s10936-005-6141-9
Ozyurek A, 2002, J MEM LANG, V46, P688, DOI 10.1006/jmla.2001.2826
Pine KJ, 2007, DEVELOPMENTAL SCI, V10, P747, DOI 10.1111/j.1467-7687.2007.00610.x
Rauscher FH, 1996, PSYCHOL SCI, V7, P226, DOI 10.1111/j.1467-9280.1996.tb00364.x
RIME B, 1984, MOTIV EMOTION, V8, P311, DOI 10.1007/BF00991870
Wittenburg P., 2006, LREC 2006
NR 42
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 257
EP 267
DI 10.1016/j.specom.2013.06.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100018
ER
PT J
AU Ferre, G
AF Ferre, Gaelle
TI A multimodal approach to markedness in spoken French
SO SPEECH COMMUNICATION
LA English
DT Article
DE Contrast; Fronted syntactic constructions; Prosodic emphasis; Gesture
reinforcement; Gesture-speech production models
ID PROSODIC PROMINENCE; CONTRASTIVE FOCUS; VISUAL-PERCEPTION; SPEECH;
GESTURE
AB This study aims at examining the links between marked structures in the syntactic and prosodic domains (fronting and focal accent), and the way the two types of contrast can be reinforced by gestures. It was conducted on a corpus of 1h30 of spoken French, involving three pairs of speakers in dialogues. Results show that although the tendency is for marked constructions both in syntax and prosody not to be reinforced by gestures, there is still a higher proportion of gesture reinforcing with prosodic marking than with syntactic fronting. The paper describes which eyebrow and head movements as well as hand gestures are more liable to accompany the two operations. Beyond these findings, the study gives an insight into the current models proposed in the literature for gesture speech production. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Ferre, Gaelle] Univ Nantes, Sch Languages & Linguist Lab LLING, F-44312 Nantes 3, France.
RP Ferre, G (reprint author), Univ Nantes, Fac Langues & Cultures Etrangeres, BP 81227, F-44312 Nantes 3, France.
EM Gaelle.Ferre@univ-nantes.fr
FU French National Research Agency [ANR BLAN08-2_349062]
FX This research is supported by the French National Research Agency
(Project number: ANR BLAN08-2_349062) and is based on a corpus and
transcriptions made by various team members beside the author of the
current paper, whom we would like to thank here. The OTIM project is
referenced on the following webpage: http://aune.lpl.univ-aix.fr/similar
to otim/. Many thanks as well to two anonymous reviewers for their
useful comments on previous versions of the paper.
CR Al Moubayed S., 2010, AUTONOMOUS ADAPTIVE, P55
Al Moubayed S, 2010, J MULTIMODAL USER IN, V3, P299, DOI 10.1007/s12193-010-0054-0
Astesano C., 2004, P JOURN ET PAR JEP F, P1
Beavin Bavelas J., 2000, PERSPECTIVES FLUENCY, P91
Bertrand R., 2008, TRAITEMENT AUTOMATIQ, V49, P105
Blache P, 2009, LECT NOTES ARTIF INT, V5509, P38
Boersma P., 2005, PRAAT DOING PHONETIC
BOLINGER D, 1983, AM SPEECH, V58, P156, DOI 10.2307/455326
Buring D., 2007, OXFORD HDB LINGUISTI, P445
Calhoun S, 2009, STUD PRAGMAT, V8, P53
Combettes Bernard, 1999, THEMATISATION LANGUE, P231
Creissels D., 2004, COURS SYNTAXE GEN, P1
DAHAN D, 1994, J PHYS IV, V4, P501, DOI 10.1051/jp4:19945106
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
Dohen M., 2004, P 8 ICSLP, P1313
Dohen M., 2005, P INT 2005, P2413
Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009
Dooley R.A., 2000, ANAL DISCOURSE MANUA
Ferre G., 2007, P INT 2007 ANTW BELG
Ferre G., 2011, MULTIMODAL COMMUNICA, V1, P5
Ferre G., 2010, P LREC WORKSH MULT C, P86
Fery C., 2003, NOUVEAUX DEPARTS PHO, P161
Fery Caroline, 2001, AUDIATUR VOX SAPIENT, P153
Gregory ML, 2001, J PRAGMATICS, V33, P1665, DOI 10.1016/S0378-2166(00)00063-1
Halliday M. A. K., 1967, J LINGUIST, V3, P199, DOI DOI 10.1017/S0022226700016613
Herment-Dujardin S., 2002, SPEECH PROSODY 2002, P379
Katz J, 2011, LANGUAGE, V87, P771
Kipp M., 2001, P 7 EUR C SPEECH COM, P1367
Kita S, 2007, LANG COGNITIVE PROC, V22, P1212, DOI 10.1080/01690960701461426
Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3
Kohler K.J., 2006, P SPEECH PROS DRESD, P748
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krahmer E., 2002, P SPEECH PROS 2002 A, P443
Krahmer E, 2001, SPEECH COMMUN, V34, P391, DOI 10.1016/S0167-6393(00)00058-3
Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017
Lacheret-Dujour A., 2003, B SOC LINGUISTIQUE P, VXIII, P137
Lambrecht K., 1994, INFORM STRUCTURE SEN
McNeill D., 1992, HAND MIND WHAT GESTU
McNeill D., 2005, GESTURE AND THOUGHT
Prevost S., 2003, CAHIERS PRAXEMATIQUE, V40, P97
Rialland A., 2002, SPEECH PROSODY 2002, P595
SELKIRK E., 1978, NORDIC PROSODY, P111
Stark E., 1999, THEMATISATION LANGUE, P337
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Wilmes K.A., 2009, THESIS U OSNABRUCK O
NR 45
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 268
EP 282
DI 10.1016/j.specom.2013.06.002
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100019
ER
PT J
AU Rusiewicz, HL
Shaiman, S
Iverson, JM
Szuminsky, N
AF Rusiewicz, Heather Leavy
Shaiman, Susan
Iverson, Jana M.
Szuminsky, Neil
TI Effects of perturbation and prosody on the coordination of speech and
gesture
SO SPEECH COMMUNICATION
LA English
DT Article
DE Gesture; Perturbation; Prosody; Entrainment; Speech-hand coordination;
Capacitance
ID DELAYED AUDITORY-FEEDBACK; FINGER MOVEMENTS; LEXICAL STRESS; BODY
MOVEMENT; RESPONSES; HAND; JAW; TASK; TIME; SUSCEPTIBILITY
AB The temporal alignment of speech and gesture is widely acknowledged as primary evidence of the integration of spoken language and gesture systems. Yet there is a disconnect between the lack of experimental research on the variables that affect the temporal relationship of speech and gesture and the overwhelming acceptance that speech and gesture are temporally coordinated. Furthermore, the mechanism of the temporal coordination of speech and gesture is poorly represented. Recent experimental research suggests that gestures overlap prosodically prominent points in the speech stream, though the effects of other variables such as perturbation of speech are not yet studied in a controlled paradigm. The purpose of the present investigation was to further investigate the mechanism of this interaction according to a dynamic systems framework. Fifteen typical young adults completed a task that elicited the production of contrastive prosodic stress on different syllable positions with and without delayed auditory feedback while pointing to corresponding pictures. The coordination of deictic gestures and spoken language was examined as a function of perturbation, prosody, and position of the target syllable. Results indicated that the temporal parameters of gesture were affected by all three variables. The findings suggest that speech and gesture may be coordinated due to internal pulse-based temporal entrainment of the two motor systems. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Rusiewicz, Heather Leavy] Duquesne Univ, Dept Speech Language Pathol, Pittsburgh, PA 15282 USA.
[Shaiman, Susan; Szuminsky, Neil] Univ Pittsburgh, Dept Commun Sci & Disorders, Pittsburgh, PA 15260 USA.
[Iverson, Jana M.] Univ Pittsburgh, Dept Psychol, Pittsburgh, PA 15260 USA.
RP Rusiewicz, HL (reprint author), Duquesne Univ, Dept Speech Language Pathol, 412 Fisher Hall,600 Forbes Ave, Pittsburgh, PA 15282 USA.
EM rusiewih@duq.edu
FU University of Pittsburgh School of Rehabilitation and Health Sciences
Development Fund
FX This project was supported by the University of Pittsburgh School of
Rehabilitation and Health Sciences Development Fund. We are grateful for
the time of the participants and notable contributions to all stages of
this work by Christine Dollaghan, Thomas Campbell, J. Scott Yaruss, and
Diane Williams. We also thank Jordana Birnbaum, Megan Pelletierre, and
Alyssa Milloy for assistance in data analysis and entry and Megan Murray
for assistance in the construction of this manuscript. Portions of this
work were presented at the 2009 and 2010 Annual Convention of the
American-Speech-Language-Hearing Association and the 2011 conference on
Gesture and Speech in Interaction (GESPIN).
CR ABBS JH, 1984, J NEUROPHYSIOL, V51, P705
ABBS JH, 1976, J SPEECH HEAR RES, V19, P19
ABBS J H, 1982, Journal of the Acoustical Society of America, V71, pS33, DOI 10.1121/1.2019343
Barbosa P.A., 2002, SPEECH PROS AIX EN P
Bard C, 1999, EXP BRAIN RES, V125, P410, DOI 10.1007/s002210050697
Birdwhistell R. L., 1970, KINESICS CONTEXT ESS
Birdwhistell Ray L., 1952, INTRO KINESICS ANNOT
Bluedorn A.C., 2002, HUMAN ORG TIME TEMPO
BULL P, 1985, J NONVERBAL BEHAV, V9, P169, DOI 10.1007/BF01000738
BURKE BD, 1975, J COMMUN DISORD, V8, P75, DOI 10.1016/0021-9924(75)90028-3
Cutler A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90004-0
CHANG P, 1987, J MOTOR BEHAV, V19, P265
Corriveau KH, 2009, CORTEX, V45, P119, DOI 10.1016/j.cortex.2007.09.008
Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
De Rutter J. P., 1998, THESIS KATHOLIEKE U
DITTMANN AT, 1969, J PERS SOC PSYCHOL, V11, P98, DOI 10.1037/h0027035
ELMAN JL, 1983, J SPEECH HEAR RES, V26, P106
Esteve-Gibert N., 2011, P 2 GEST SPEECH INT
FOLKINS JW, 1975, J SPEECH HEAR RES, V18, P207
FOLKINS JW, 1981, J ACOUST SOC AM, V69, P1441, DOI 10.1121/1.385828
Franks IM, 1998, CAN J EXP PSYCHOL, V52, P93, DOI 10.1037/h0087284
FRANZ EA, 1992, J MOTOR BEHAV, V24, P281
Gentilucci M, 2004, EUR J NEUROSCI, V19, P190, DOI 10.1111/j.1460-9568.2004.03104.x
Gentilucci M, 2001, J NEUROPHYSIOL, V86, P1685
Goffman L, 1999, J SPEECH LANG HEAR R, V42, P1003
Goldin-Meadow S, 1999, TRENDS COGN SCI, V3, P419, DOI 10.1016/S1364-6613(99)01397-2
Gracco V.L., 1986, EXP BRAIN RES, V65, P155
GRACCO VL, 1985, J NEUROPHYSIOL, V54, P418
Guenther FH, 1998, PSYCHOL REV, V105, P611
HAKEN H, 1985, BIOL CYBERN, V51, P347, DOI 10.1007/BF00336922
HISCOCK M, 1986, NEUROPSYCHOLOGIA, V24, P691, DOI 10.1016/0028-3932(86)90008-4
Howell P, 2002, J ACOUST SOC AM, V111, P2842, DOI 10.1121/1.1474444
HOWELL P, 1984, PERCEPT PSYCHOPHYS, V36, P296, DOI 10.3758/BF03206371
Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19
JONES MR, 1989, PSYCHOL REV, V96, P459, DOI 10.1037//0033-295X.96.3.459
Kelso J.A., 1984, J EXPER PSYCH HUM PE, V6, P812
Kelso J.A., 1984, AM J PHYSIO REGULAT, V246, pR935
KELSO JAS, 1983, J SPEECH HEAR RES, V26, P217
Kelso J.A.S., 1981, PRODUCTION SPEECH, P137
Kendon A., 1980, RELATIONSHIP VERBAL, P207
Kendon Adam, 1972, STUDIES DYADIC COMMU, P177
KOMILIS E, 1993, J MOTOR BEHAV, V25, P299
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017
Large EW, 1999, PSYCHOL REV, V106, P119, DOI 10.1037/0033-295X.106.1.119
Leonard T, 2011, LANG COGNITIVE PROC, V26, P1457, DOI 10.1080/01690965.2010.500218
LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X
Liberman AM, 2000, TRENDS COGN SCI, V4, P187, DOI 10.1016/S1364-6613(00)01471-6
Loehr D., 2007, GESTURE, V7, P179, DOI [10.1075/gest.7.2.04loe, DOI 10.1075/GEST.7.2.04LOE]
Loehr D.P., 2004, DISS ABSTR INT A, V65, P2180
Lorenz S, 2007, J COMPUT CHEM, V28, P1384, DOI 10.1002/jcc.20674
MARSLENWILSON WD, 1981, PHILOS T ROY SOC B, V295, P317, DOI 10.1098/rstb.1981.0143
Mayberry R., 2000, LANGUAGE GESTURE, P199, DOI 10.1017/CBO9780511620850.013
Mayberry R. I., 1998, NEW DIR CHILD ADOLES, V79, P77
McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974
MCCLAVE E, 1994, J PSYCHOLINGUIST RES, V23, P45, DOI 10.1007/BF02143175
McNeill D., 1992, HAND AND MIND
Merker B., 2009, CORTEX, V45, P1
Nobe S., 1996, DISS ABSTR INT A, V57, P4736
O'Dell M., 1999, P 14 INT C PHON SCI, V2, P1075
CONDON WS, 1966, J NERV MENT DIS, V143, P338, DOI 10.1097/00005053-196610000-00005
PAULIGNAN Y, 1991, EXP BRAIN RES, V87, P407
Peter B, 2011, TOP LANG DISORD, V31, P145, DOI 10.1097/TLD.0b013e318217b855
Port R., 1998, NONLINEAR ANAL DEV P, P5
Port RF, 2003, J PHONETICS, V31, P599, DOI 10.1016/j.wocn.2003.08.001
PRABLANC C, 1992, J NEUROPHYSIOL, V67, P455
Ringel R.L., 1963, J SPEECH HEAR RES, V13, P369
Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173)
Rusiewicz HL, 2013, J SPEECH LANG HEAR R, V56, P458, DOI 10.1044/1092-4388(2012/11-0283)
Rusiewicz H.L., 2011, P 2 GEST SPEECH INT
Sager Rebecca, 2004, ESEM COUNTERPOINT, V1, P1
Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0
Saltzman E.L., 1989, ECOLOG PSYCH, V1, P330
Schmidt RA, 1999, MOTOR CONTROL LEARNI
Shaiman S, 2002, EXP BRAIN RES, V146, P411, DOI 10.1007/s00221-002-1195-5
SHAIMAN S, 1989, J ACOUST SOC AM, V86, P78, DOI 10.1121/1.398223
Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123
SMITH A, 1992, CRIT REV ORAL BIOL M, V3, P233
SMITH A, 1986, J SPEECH HEAR RES, V29, P471
Stuart A, 2002, J ACOUST SOC AM, V111, P2237, DOI 10.1121/1.1466868
Thelen E., 2002, DYNAMIC SYSTEM APPRO
Tuite K., 1993, SEMIOTICA, V93, P8
Tuller B., 1982, J ACOUST SOCI AM, V72, pS103, DOI 10.1121/1.2019693
TULLER B, 1982, J ACOUST SOC AM, V71, P1534, DOI 10.1121/1.387807
TULLER B, 1983, J EXP PSYCHOL HUMAN, V9, P829, DOI 10.1037/0096-1523.9.5.829
TURVEY MT, 1990, AM PSYCHOL, V45, P938, DOI 10.1037/0003-066X.45.8.938
van Lieshout P., 2004, SPEECH MOTOR CONTROL, P51
van Kuijk D, 1999, SPEECH COMMUN, V27, P95, DOI 10.1016/S0167-6393(98)00069-7
von Hoist E., 1973, COLLECTED PAPERS E V
Wlodarczak M., 2012, 6 INT C SHANGH CHIN
Yasinnik Y., 2004, M SOUND SENS 50 YEAR
NR 92
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 283
EP 300
DI 10.1016/j.specom.2013.06.004
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100020
ER
PT J
AU Esteve-Gibert, N
Prieto, P
AF Esteve-Gibert, Nuria
Prieto, Pilar
TI Infants temporally coordinate gesture-speech combinations before they
produce their first words
SO SPEECH COMMUNICATION
LA English
DT Article
DE Early gestures; Early acquisition of multimodality; Early gesture-speech
temporal coordination
ID LANGUAGE-DEVELOPMENT; YOUNG-CHILDREN; BEHAVIOR; INTENTIONALITY;
PERCEPTION; SYSTEM; FOCUS
AB This study explores the patterns of gesture and speech combinations from the babbling period to the one-word stage and the temporal alignment between the two modalities. The communicative acts of four Catalan children at 0;11, 1;1, 1;3, 1;5, and 1;7 were gesturally and acoustically analyzed. Results from the analysis of a total of 4,507 communicative acts extracted from approximately 24 h of at-home recordings showed that (1) from the early single-word period onwards gesture starts being produced mainly in combination with speech rather than as a gesture-only act; (2) in these early gesture-speech combinations most of the gestures are deictic gestures (pointing and reaching gestures) with a declarative communicative purpose; and (3) there is evidence of temporal coordination between gesture and speech already at the babbling stage because gestures start before the vocalizations associated with them, the stroke onset coincides with the onset of the prominent syllable in speech, and the gesture apex is produced before the end of the accented syllable. These results suggest that during the transition between the babbling stage and single-word period infants start combining deictic gestures and speech and, when combined, the two modalities are temporally coordinated. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Esteve-Gibert, Nuria; Prieto, Pilar] Univ Pompeu Fabra, Dpt Translat & Language Sci, Barcelona, Spain.
RP Esteve-Gibert, N (reprint author), Roc Boronat 138, Barcelona 08018, Spain.
EM nuria.esteve@upf.edu
RI Esteve-Gibert, Nuria/D-5342-2014
OI Esteve-Gibert, Nuria/0000-0003-2408-5849
FU Spanish Ministry of Science and Innovation [FFI2009-07648/FILO,
FI2012-31995]; Generalitat de Catalunya [2009SGR-701]; Obra Social 'La
Caixa'; Consolider-Ingenio grant [CSD2007-00012]
FX This research has been funded by two research grants awarded by the
Spanish Ministry of Science and Innovation (FFI2009-07648/FILO "The role
of tonal scaling and tonal alignment in distinguishing intonational
categories in Catalan and Spanish", and FFI2012-31995 "Gestures, prosody
and linguistic structure"), by a grant awarded by the Generalitat de
Catalunya (2009SGR-701) to the Grup d'Estudis de Prosodia, by the
Generalitat de Catalunya (2009SGR-701) to the Grup d'Estudis de
Prosodia, by the grant RECERCAIXA 2012 for the project "Els precursors
del llenguatge. Una guia TIC per a pares i educadors" awarded by Obra
Social 'La Caixa', and by the Consolider-Ingenio 2010 (CSD2007-00012)
grant.
CR Astruc L., 2013, LANG SPEECH, V56, P78
Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005
Bangerter A, 2004, PSYCHOL SCI, V15, P415, DOI 10.1111/j.0956-7976.2004.00694.x
BATES E, 1975, MERRILL PALMER QUART, V21, P205
Bergmann K., 2011, P 2 WORKSH GEST SPEE
BLAKE J, 1994, INFANT BEHAV DEV, V17, P195, DOI 10.1016/0163-6383(94)90055-8
Blake J., 2005, GESTURE, V5, P201, DOI 10.1075/gest.5.1.14bla
Boersma Paul, 2012, PRAAT DOING PHONETIC
Bonsdroff L., 2005, 18 SWED PHON C DEP L, P59
Butcher C., 2000, LANGUAGE GESTURE, P235, DOI DOI 10.1017/CBO9780511620850.015
Butterworth B., 1978, RECENT ADV PSYCHOL L, P347
Camaioni L, 2004, INFANCY, V5, P291, DOI 10.1207/s15327078in0503_3
Capone NC, 2004, J SPEECH LANG HEAR R, V47, P173, DOI [10.1044/1092-4388(2004/015), 10.1044/1092-4388(2004/15)]
Cochet H, 2010, GESTURE, V10, P129, DOI 10.1075/gest.10.2-3.02coc
Colonnesi C, 2010, DEV REV, V30, P352, DOI 10.1016/j.dr.2010.10.001
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
De Ruiter J.-P., 2006, ADV SPEECH LANGUAGE, V8, P124, DOI DOI 10.1080/14417040600667285
De Rutter J. P., 1998, THESIS KATHOLIEKE U
Ejiri K, 2001, DEVELOPMENTAL SCI, V4, P40, DOI 10.1111/1467-7687.00147
Ekman P., 1969, SEMIOTICA, V1, P49
Engstrand O., 2004, 17 SWED PHON C STOCK, P64
Esteve-Gibert N., 2013, J SPEECH LANG HEAR R
Esteve-Gibert N., 2012, ESTEVE PRIETO CATALA
Feldman R, 1996, INFANT BEHAV DEV, V19, P483, DOI 10.1016/S0163-6383(96)90008-9
Ferre G., 2010, P LREC WORKSH MULT C, P86
Frota S., 2008, P 11 INT C STUD CHIL
Iverson J. M., 1999, J CONSCIOUSNESS STUD, V6, P19
Iverson JM, 2000, J NONVERBAL BEHAV, V24, P105, DOI 10.1023/A:1006605912965
Iverson JM, 2005, PSYCHOL SCI, V16, P367, DOI 10.1111/j.0956-7976.2005.01542.x
Iverson JM, 2004, CHILD DEV, V75, P1053, DOI 10.1111/j.1467-8624.2004.00725.x
Kendon A., 1980, RELATIONSHIP VERBAL, P207
Kendon A., 2004, GESTURE VISIBLE ACTI
Kita S., 2000, LANGUAGE CULTURE COG, P162
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017
Lausberg H, 2009, BEHAV RES METHODS, V41, P841, DOI 10.3758/BRM.41.3.841
Leonard T., 2010, LANG COGNITIVE PROC, V26, P1295
LEUNG EHL, 1981, DEV PSYCHOL, V17, P215, DOI 10.1037//0012-1649.17.2.215
LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X
Liszkowski U., 2007, GESTURAL COMMUNICATI, P124
Loehr D., 2007, GESTURE, V7, P179, DOI [10.1075/gest.7.2.04loe, DOI 10.1075/GEST.7.2.04LOE]
Masataka N, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P69
McNeill D., 1992, HAND MIND WHAT GESTU
Melinger A., 2004, GESTURE, V4, P119, DOI DOI 10.1075/GEST.4.2.02MEL
Nobe S., 1996, THESIS U CHICAGO CHI
Oller D.K., 1976, J CHILD LANG, V3, P1
Olson S.L., 1982, INFANT BEHAV DEV, V12, P77
Ozcaliskan S., 2005, COGNITION, V96, P101, DOI [10.1016/j.cognition.2005.01.001, DOI 10.1016/J.C0GNITI0N.2005.01.001]
Papaeliou CF, 2006, J CHILD LANG, V33, P163, DOI 10.1017/S0305000905007300
Prieto P., 2013, INTONATIONAL VARATIO
Prieto P., 2006, LANG SPEECH, V49, P233
Rochat P, 2007, ACTA PSYCHOL, V124, P8, DOI 10.1016/j.actpsy.2006.09.004
Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173)
Roustan B., 2010, P SPEECH PROS CHIC
Rusiewicz H.L., 2010, THESIS U PITTSBURGH
So WC, 2010, APPL PSYCHOLINGUIST, V31, P209, DOI 10.1017/S0142716409990221
Sperry LA, 2003, J AUTISM DEV DISORD, V33, P281, DOI 10.1023/A:1024454517263
Tomasello M, 2007, CHILD DEV, V78, P705, DOI 10.1111/j.1467-8624.2007.01025.x
Vanrell M.M., 2011, ANEJO QUADERNS FILOL, P71
DEBOYSSONBARDIES B, 1991, LANGUAGE, V67, P297, DOI 10.2307/415108
VIHMAN MM, 1985, LANGUAGE, V61, P397, DOI 10.2307/414151
Vihman MM, 2009, CAMB HB LANG LINGUIS, P163
West B. T., 2007, LINEAR MIXED MODELS
NR 63
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 301
EP 316
DI 10.1016/j.specom.2013.06.006
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100021
ER
PT J
AU Kim, J
Cvejic, E
Davis, C
AF Kim, Jeesun
Cvejic, Erin
Davis, Chris
TI Tracking eyebrows and head gestures associated with spoken prosody
SO SPEECH COMMUNICATION
LA English
DT Article
DE Visual prosody; Eyebrow movements; Focus; Sentence modality; Guided
principal component analysis
ID SPEECH ACOUSTICS; FACE; STRESS; PROMINENCE; KINEMATICS; ENGLISH; MOTION;
VOICE
AB Although it is clear that eyebrow and head movements are in some way associated with spoken prosody, the precise form of this association is unclear. To examine this, eyebrow and head movements were recorded from six talkers producing 30 sentences (with two repetitions) in three prosodic conditions (Broad focus, Narrow focus and Echoic question) in a face to face dialogue exchange task. Movement displacement and peak velocity were measured for the prosodically marked constituents (critical region) as well as for the preceding and following regions. The amount of eyebrow movement in the Narrow focus and Echoic question conditions tended to be larger at the beginning of an utterance (in the pre-critical and critical regions) than at the end (in the post-critical region). Head rotation (nodding) tended to occur later, being maximal in the critical region and still occurring often in the post-critical one. For eyebrow movements, peak velocity tended to distinguish the regions better than the displacement measure. The extent to which eyebrow and head movements co-occurred was also examined. Compared to broad focussed condition, both movement types occurred more often in the narrow focussed and echoic question ones. When these double movements occurred in narrow focused utterances, brow raises tended to begin before the onset of the critical constituent and reach a peak displacement at the time of the critical constituent, whereas rigid pitch movements tended to begin at the time of critical constituent and reach peak displacement after this region. The pattern for echoic questions was similar for eyebrow motion however head rotations tended to begin earlier compared to the narrow focus condition. These results are discussed in terms of the differences these types of visual cues may have in production and perception. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Kim, Jeesun; Cvejic, Erin; Davis, Chris] Univ Western Sydney, MARCS Inst, Penrith, NSW 2751, Australia.
[Cvejic, Erin] Univ New S Wales, Sch Psychiat, Sydney, NSW 2052, Australia.
RP Davis, C (reprint author), Univ Western Sydney, MARCS Inst, Locked Bag 1797, Penrith, NSW 2751, Australia.
EM chris.davis@uws.edu.au
FU Australian Research Council [DP120104298]
FX We thank Virginie Attina and Guillame Gibert for their contribution of
Mat lab scripts used for the gPCA analysis and all of the speakers. We
also thank two anonymous reviewers for helpful comments. Some of the
results reported here were from the second author's PhD thesis. We
acknowledge support from Australian Research Council (DP120104298).
CR Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166
Bailly G, 2009, EURASIP J AUDIO SPEE, DOI 10.1155/2009/769494
Beautemps D, 2001, J ACOUST SOC AM, V109, P2165, DOI 10.1121/1.1361090
Boersma P., 2001, GLOT INT, V5, P341
Brooks V.B., 1998, J MOTOR BEHAV, V2, P117
Cave C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2175
Cvejic E, 2012, COGNITION, V122, P442, DOI 10.1016/j.cognition.2011.11.013
Cvejic E, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1433
Cvejic E, 2012, J ACOUST SOC AM, V131, P1011, DOI 10.1121/1.3676605
DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275
Dohen M., 2009, VISUAL SPEECH RECOGN, P416
Dohen M, 2009, LANG SPEECH, V52, P177, DOI 10.1177/0023830909103166
EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674
Fagel S., 2008, P INT C AUD VIS SPEE, P59
Fitzpatrick M., 2011, P INT 2011, P2829
Flecha-Garcia ML, 2010, SPEECH COMMUN, V52, P542, DOI 10.1016/j.specom.2009.12.003
Guaitella I, 2009, LANG SPEECH, V52, P207, DOI 10.1177/0023830909103167
HADAR U, 1983, LANG SPEECH, V26, P117
HORN BKP, 1987, J OPT SOC AM A, V4, P629, DOI 10.1364/JOSAA.4.000629
House D., 2001, P EUR 2001, P387
IEEE, 1969, IEEE T AUDIO ELECTRO, VAE-17, P227
Kendon A., 2004, GESTURE VISIBLE ACTI
Kim J, 2011, PERCEPTION, V40, P853, DOI 10.1068/p6941
Lucero JC, 2005, J ACOUST SOC AM, V118, P405, DOI 10.1121/1.1928807
MAEDA S, 2005, ZAS PAPERS LINGUISTI, V40, P95
Nam H, 2012, J ACOUST SOC AM, V132, P3980, DOI 10.1121/1.4763545
Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924
Schwartz JL, 2004, COGNITION, V93, pB69, DOI 10.1016/j.cognition.2004.01.006
Srinivasan RJ, 2003, LANG SPEECH, V46, P1
Swerts M, 2010, J PHONETICS, V38, P197, DOI 10.1016/j.wocn.2009.10.002
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
VANSUMMERS W, 1987, J ACOUST SOC AM, V82, P847, DOI 10.1121/1.395284
WING AM, 1984, PSYCHOL RES-PSYCH FO, V46, P121, DOI 10.1007/BF00308597
Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X
Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165
NR 35
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 317
EP 330
DI 10.1016/j.specom.2013.06.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100022
ER
PT J
AU Fernandez-Baena, A
Montano, R
Antonijoan, M
Roversi, A
Miralles, D
Alias, F
AF Fernandez-Baena, Adso
Montano, Raul
Antonijoan, Marc
Roversi, Arturo
Miralles, David
Alias, Francesc
TI Gesture synthesis adapted to speech emphasis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Human computer interaction; Body language; Speech analysis; Speech
emphasis; Character animation; Motion capture; Motion graphs
ID ANIMATION; PROMINENCE; CHARACTERS; EXPRESSION; PROSODY; BEAT
AB Avatars communicate through speech and gestures to appear realistic and to enhance interaction with humans. In this context, several works have analyzed the relationship between speech and gestures, while others have been focused on their synthesis, following different approaches. In this work, we address both goals by linking speech to gestures in terms of time and intensity, to then use this knowledge to drive a gesture synthesizer from a manually annotated speech signal. To that effect, we define strength indicators for speech and motion. After validating them through perceptual tests, we obtain an intensity rule from their correlation. Moreover, we derive a synchrony rule to determine temporal correspondences between speech and gestures. These analyses have been conducted on aggressive and neutral performances to cover a broad range of emphatic levels, whose speech signal and motion have been manually annotated. Next, intensity and synchrony rules are used to drive a gesture synthesizer called gesture motion graph (GMG). These rules are validated by users from GMG output animations through perceptual tests. Results show that animations using intensity and synchrony rules perform better than those only using the synchrony rule (which in turn enhance realism with respect to random animation). Finally, we conclude that the extracted rules allow GMG to properly synthesize gestures adapted to speech emphasis from annotated speech. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Fernandez-Baena, Adso; Montano, Raul; Antonijoan, Marc; Roversi, Arturo; Miralles, David; Alias, Francesc] La Salle Univ Ramon Llull, GTM, Barcelona 08022, Spain.
RP Fernandez-Baena, A (reprint author), La Salle Univ Ramon Llull, GTM, Barcelona 08022, Spain.
EM adso@salle.url.edu
RI Alias, Francesc/L-1088-2014; Antonijoan, Marc/L-5249-2014
OI Alias, Francesc/0000-0002-1921-2375;
FU CENIT program [CEN-20101019]; Ministry of Science and Innovation of
Spain
FX We thank Eduard Ruesga and Meritxell Aragones for their work on the
acquisition and processing of motion capture data. We also want to thank
Dani Arguedas for his labor as actor. We acknowledge Anna Fuste for
assisting in programming. This work was supported by the CENIT program
number CEN-20101019, Granted by the Ministry of Science and Innovation
of Spain.
CR Abete G., 2010, P SPEECH PROS 2010 S
Antonijoan M., 2012, THESIS R LLULL U LA
Arikan O, 2002, ACM T GRAPHIC, V21, P483
Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008
Boersma P., 2001, GLOT INT, V5, P341
Bolinger D., 1986, INTONATION ITS PARTS
Bulut M, 2007, INT CONF ACOUST SPEE, P1237
Cassell J., 1994, P SIGGRAPH 94, P413, DOI 10.1145/192161.192272
Cassell J, 2001, COMP GRAPH, P477
Chiu C.C., 2011, P 10 INT C INT VIRT, P127
Condon W.S., 1976, ANAL BEHAV ORG, P285
Escudero-Mancebo D., 2010, P INT C SPEECH PROS
Fernandez-Baena A., 2012, FAST RESPONSE QUICK, P77
Goldman J.-P., 2011, INTERSPEECH 2011, P3233
Hall M., 2009, SIGKDD EXPLORATIONS, V11, DOI DOI 10.1145/1656274.1656278
ITU-T, 1996, P 800 METH SUBJ DET
Kalinli O, 2009, IEEE T AUDIO SPEECH, V17, P1009, DOI 10.1109/TASL.2009.2014795
Kendon Adam, 1980, NONVERBAL COMMUNICAT, P207
Kipp M, 2007, LECT NOTES ARTIF INT, V4722, P15
Kipp M, 2004, THESIS BOCA RATON
Kipp M., 2001, P 7 EUR C SPEECH COM, P1367
Kipp M., 2009, P INT C AFF COMP INT
KOPP S., 2008, INT J SEMANTIC COMPU, V2, P115, DOI 10.1142/S1793351X08000361
Kovar L, 2002, ACM T GRAPHIC, V21, P473
Lee JH, 2002, ACM T GRAPHIC, V21, P491
Leonard T, 2011, LANG COGNITIVE PROC, V26, P1457, DOI 10.1080/01690965.2010.500218
Levine S., 2010, ACM T GRAPH, V29
Levine S., 2009, ACM T GRAPH, V28
Llull L.S.U.R., 2012, MEDIALAB MOTION CAPT
Loehr DP, 2004, THESIS GEORGETOWN U
Luo PC, 2009, LECT NOTES ARTIF INT, V5773, P405
MATTHEWS BW, 1975, BIOCHIM BIOPHYS ACTA, V405, P442, DOI 10.1016/0005-2795(75)90109-9
McNeill D., 1992, HAND MIND WHAT GESTU
McNeill D., 2008, PHOENIX POETS SERIES
MCNEILL D, 1985, PSYCHOL REV, V92, P350, DOI 10.1037//0033-295X.92.3.350
Neff M., 2008, ACM T GRAPH, V27, P5
Nobe Shuichi, 1996, REPRESENTATIONAL GES
Onuma K., 2008, P EUROGRAPHICS
Ortega-Llebaria M, 2011, LANG SPEECH, V54, P73, DOI 10.1177/0023830910388014
Ortiz-Lira H., 1999, ONOMAZEIN, V4, P429
Oshita M., 2011, P 4 INT C MOT GAM, P120
Pejsa T, 2010, COMPUT GRAPH FORUM, V29, P202, DOI 10.1111/j.1467-8659.2009.01591.x
Pelachaud C., 2005, P 13 ANN ACM INT C M, P683, DOI 10.1145/1101149.1101301
Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE
Planet S., 2008, P 2 INT WORKSH EMOTI
Platt J. C., 1999, FAST TRAINING SUPPOR, P185
Prieto Pilar, 2009, ESTUDIOS FONETICA EX, V18, P263
Qin Y., 2010, INT C SIGN PROC SYST
Quintilian, 1920, I ORATORIA QUINTILIA
Ren Z., 2011, 2011 8 INT C INF COM, P1
Renwick M., 2004, SOUND SENSE 50 YEARS, P97
Roekhaut S., 2010, 5 INT C SPEECH PROS
RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714
Shalev-Shwartz S., 2007, 24 INT C MACH LEARN, P807
Shattuck-Hufnagel S., 2007, NATO PUBLISHING SU E, V18
Silipo R., 1999, P 14 INT C PHON SCI, P2351
Silverman K., 1992, ICSLP ISCA
Stone M, 2004, ACM T GRAPHIC, V23, P506, DOI 10.1145/1015706.1015753
Streefkerk B., 1999, EUROSPEECH
Syrdal AK, 2001, SPEECH COMMUN, V33, P135, DOI 10.1016/S0167-6393(00)00073-X
Tamburini F., 2007, INT 2007, P1809
TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019
Valbonesi L., 2002, MULTIMODAL SIGNAL AN
Van Basten B.J.H., 2009, P 4 INT C FDN DIG GA, P199, DOI 10.1145/1536513.1536551
van der Sluis I, 2007, DISCOURSE PROCESS, V44, P145
VANRIJSBERGEN CJ, 1974, J DOC, V30, P365
van Welbergen H, 2010, COMPUT GRAPH FORUM, V29, P2530, DOI 10.1111/j.1467-8659.2010.01822.x
Vilhjalmsson H., 2003, THESIS MIT
Wachsmuth I., 1998, LECT NOTES ARTIF INT, V1317, P23
Wallbott HG, 1998, EUR J SOC PSYCHOL, V28, P879, DOI 10.1002/(SICI)1099-0992(1998110)28:6<879::AID-EJSP901>3.0.CO;2-W
Wang J., 2008, ACM T GRAPH, V27, P1
Xu Jing, 2011, Instrument Techniques and Sensor
NR 72
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2014
VL 57
BP 331
EP 350
DI 10.1016/j.specom.2013.06.005
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 268OJ
UT WOS:000328180100023
ER
PT J
AU Cucu, H
Buzo, A
Besacier, L
Burileanu, C
AF Cucu, Horia
Buzo, Andi
Besacier, Laurent
Burileanu, Corneliu
TI SMT-based ASR domain adaptation methods for under-resourced languages:
Application to Romanian
SO SPEECH COMMUNICATION
LA English
DT Article
DE Under-resourced languages; Domain adaptation; Automatic speech
recognition; Statistical machine translation; Language modeling
AB This study investigates the possibility of using statistical machine translation to create domain-specific language resources. We propose a methodology that aims to create a domain-specific automatic speech recognition (ASR) system for a low-resourced language when in-domain text corpora are available only in a high-resourced language. Several translation scenarios (both unsupervised and semi-supervised) are used to obtain domain-specific textual data. Moreover this paper shows that a small amount of manually post-edited text is enough to develop other natural language processing systems that, in turn, can be used to automatically improve the machine translated text, leading to a significant boost in ASR performance. An in-depth analysis, to explain why and how the machine translated text improves the performance of the domain-specific ASR, is also made at the end of this paper. As bi-products of this core domain-adaptation methodology, this paper also presents the first large vocabulary continuous speech recognition system for Romanian, and introduces a diacritics restoration module to process the Romanian text corpora, as well as an automatic phonetization module needed to extend the Romanian pronunciation dictionary. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Cucu, Horia; Buzo, Andi; Burileanu, Corneliu] Univ Politehn Bucuresti, Bucharest, Romania.
[Cucu, Horia; Besacier, Laurent] Univ Grenoble 1, LIG, Grenoble, France.
RP Cucu, H (reprint author), Univ Politehn Bucuresti, Bucharest, Romania.
EM horia.cucu@upb.ro; andi.buzo@upb.ro; laurent.besacier@imag.fr;
cburileanu@messnet.pub.ro
RI Buzo, Andi/B-1834-2013
OI Buzo, Andi/0000-0001-6545-5338
CR Abdillahi N, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P289
Berment V, 2004, THESIS U J FOURIER G
Bertoldi N., 2009, PRAGUE B MATH LINGUI, P1
Bick E., MULTILINGUALITY INTE, P169
BILLA J, 2002, ACOUST SPEECH SIG PR, P5
Bisani M., 2003, EUR GEN SWITZ, P933
Bisani M., SPEECH COMMUNICATION, V50, P434
Bonaventura P., 1998, ACL
Bonneau-Maynard H., 2005, INTERSPEECH, P3457
Burileanu D., 1999, ICPHS SAN FRANC US, V1, P503
Cai J., 2008, SLTU HAN VIETN
Cucu H., 2011, U POLITEHNICA BUCH C, P179
Cucu H, 2012, EUR SIGNAL PR CONF, P1648
Cucu H., 2011, SPECOM KAZ RUSS, P81
Cucu H., 2011, THESIS POLITEHNICA U
Cucu H., 2011, ASRU HAW US, P260
Domokos J., 2009, THESIS TU CLUJ NAPOC
Domokos J., 2011, SPED BRAS ROM, P1
Draxler C., 2007, INTERSPEECH, P1509
Dumitru C.-O., 2008, ADV ROBOTICS AUTOMAT, P472
Gizaw S., 2008, SLTU HAN VIETN
Jabaian B, 2011, INT CONF ACOUST SPEE, P5612
Jensson A., 2008, SLTU HAN VIETN
Jitca D., 2003, SPED BUCH ROM, P43
Kabir A., 2011, 10 INT C SIGN PROC R, P323
Karanasou P, 2010, LECT NOTES ARTIF INT, V6233, P167, DOI 10.1007/978-3-642-14770-8_20
Koehn P., 2007, ACL PRAG CZECH REP
Koehn P., 2005, MACHINE TRANSLATION, P79
Laurent A., 2009, INT 2009 BRIGHT UK, P708
Le V.B., 2003, EUROSPEECH 2003, P3117
Le V.B., IEEE T AUDIO SPEECH, V17, P1471
Lita L., 2003, ACL 2003 SAPP JAP, P152
Macoveiciuc M., MULTILINGUALITY INTE, P151
Mihajlik P., 2007, INTERSPEECH, P1497
Militaru D, 2009, FROM SPEECH PROCESSING TO SPOKEN LANGUAGE TECHNOLOGY, P21
Munteanu D.-P., 2006, THESIS TECHNICAL MIL
Nakajima H., 2002, COLING 2002, V2, P716
Oancea E., 2004, COMMUNICATIONS, P221
Ordean M.A., 2009, SYNASC TIM ROM, P401
Papineni K., 2002, ACL, P311
Pellegrini T, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P285
Petrea C.S., 2010, ECIT IAS ROM, P13
Sam S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P254
Schultz T., MULTILINGUAL SPEECH
Simard M., 2007, NAACL HLT ROCH US, P508
Stuker S., 2008, SLTU HAN VIETN
Suenderman K., 2009, INT 2009 BRIGHT UK, P1475
Toma S.-A., 2009, COMPUTATION WORLD AT, P682
Tufi D., 2008, LREC MARR MOR
Tufis D., 1999, INT WORKSH COMP LEX, P185
Ungurean Catalin, 2008, University "Politehnica" of Bucharest, Scientific Bulletin Series C: Electrical Engineering, V70
Ungurean C., 2011, SPED BRAS ROM, P1
Vlad A, 2007, LECT NOTES COMPUT SC, V4705, P409
NR 53
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 195
EP 212
DI 10.1016/j.specom.2013.05.003
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800016
ER
PT J
AU Dufour, R
Esteve, Y
Deleglise, P
AF Dufour, Richard
Esteve, Yannick
Deleglise, Paul
TI Characterizing and detecting spontaneous speech: Application to speaker
role recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spontaneous speech; Speaker role; Feature extraction; Speech
classification; Automatic speech recognition; Role recognition
ID IDENTIFICATION; DISFLUENCIES
AB Processing spontaneous speech is one of the many challenges that automatic speech recognition systems have to deal with. The main characteristics of this kind of speech are disfluencies (filled pause, repetition, false start, etc.) and many studies have focused on their detection and correction. Spontaneous speech is defined in opposition to prepared speech, where utterances contain well-formed sentences close to those found in written documents.
Acoustic and linguistic features made available by the use of an automatic speech recognition system are proposed to characterize and detect spontaneous speech segments from large audio databases. To better define this notion of spontaneous speech, segments of an 11-hour corpus (French Broadcast News) had been manually labeled according to three classes of spontaneity.
Firstly, we present a study of these features. We then propose a two-level strategy to automatically assign a class of spontaneity to each speech segment. The proposed system reaches a 73.0% precision and a 73.5% recall on high spontaneous speech segments, and a 66.8% precision and a 69.6% recall on prepared speech segments.
A quantitative study shows that the classes of spontaneity are useful information to characterize the speaker roles. This is confirmed by extending the speech spontaneity characterization approach to build an efficient automatic speaker role recognition system. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Dufour, Richard; Esteve, Yannick; Deleglise, Paul] Univ Le Mans, LIUM, Le Mans, France.
RP Dufour, R (reprint author), Univ Avignon, LIA, Avignon, France.
EM richard.dufour@univ-avignon.fr; yan-nick.esteve@lium.univ-lemans.fr;
paul.deleglise@lium.univ-lemans.fr
CR Amaral R., 2003, ISCA WORKSH MULT SPO, P31
Barzilay R., 2000, SEVENTEENTH NATIONAL, P679
Bazillon T., 2008, THE SIXTH INTERNATIO, P1067
Bigot B., 2010, INTERNATIONAL WORKSH, P5
Boula de Mareuil P., 2005, PROCEEDING OF THE WO
Caelen-Haumont G., 2002, P 1 INT C SPEECH PRO, P195
COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104
Damnati G., 2011, INTERNATIONAL CONFER
Deleglise P., 2009, CONFERENCE OF THE IN, P2123
DUEZ D, 1982, LANG SPEECH, V25, P11
Dufour R., 2010, ACM WORKSHOP ON SEAR
Dufour R., 2011, CONFERENCE OF THE IN
Dufour R., 2009, 13TH INTERNATIONAL C
Dufour R., 2010, CONFERENCE OF THE IN
Dufour R., 2009, AUTOMATIC SPEECH REC
Esteve Y., 2010, LREC VALL MALT, P1686
Eugenio B.D., 2004, COMPUTATIONAL LINGUI, V30, P95, DOI 10.1162/089120104773633402
Galliano S., 2005, CONFERENCE OF THE IN
Garg P.N., 2008, ACM MULTIMEDIA CONFE, P693
Goto M., 1999, SIXTH EUROPEAN CONFE, P227
Gravier G., 2012, INTERNATIONAL CONFER
Hakkani-Tur D., 2007, INTERNATIONAL CONFER, V4, P1
Heeman PA, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P362
Jousse V, 2009, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2009.4960644
Lease M, 2006, IEEE T AUDIO SPEECH, V14, P1566, DOI 10.1109/TASL.2006.878269
Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255
Liu Y., 2006, HUM LANG TECHN C NAA, P81
Liu Y., 2005, CONFERENCE OF THE IN, P3313
Luzzati D., 2004, WORKSHOP MODELISATIO, P13
MCDONOUGH J, 1994, INT CONF ACOUST SPEE, P385
Meignier S., 2010, CMU SPHINX USERS AND
Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184
O'Shaughnessy D., 1993, INTERNATIONAL CONFER, V2, P724, DOI 10.1109/ICASSP.1993.319414
Peskin B., 1993, HLT 93 P WORKSH HUM, P119
Rousseau A., 2011, PROCEEDINGS OF IWSLT
Salamin H, 2009, IEEE T MULTIMEDIA, V11, P1373, DOI 10.1109/TMM.2009.2030740
Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923
Schuller B., 2012, CONFERENCE OF THE IN
Shriberg E., 1999, P INT C PHON SCI SAN, P619
Siu M, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P386
Vinciarelli A, 2007, IEEE T MULTIMEDIA, V9, P1215, DOI 10.1109/TMM.2007.902882
Yeh JF, 2006, IEEE T AUDIO SPEECH, V14, P1574, DOI 10.1109/TASL.2006.878267
NR 42
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 1
EP 18
DI 10.1016/j.specom.2013.07.007
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800001
ER
PT J
AU Poblete, V
Yoma, NB
Stern, RM
AF Poblete, Victor
Yoma, Nestor Becerra
Stern, Richard M.
TI Optimization of the parameters characterizing sigmoidal rate-level
functions based on acoustic features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Sigmoidal function; Auditory systems; Optimization; Acoustic features;
Speech enhancement
ID ROBUST SPEECH RECOGNITION; AUDITORY-NERVE FIBERS; DYNAMIC-RANGE
ADAPTATION; CONTRAST GAIN-CONTROL; SPEAKER VERIFICATION; NOISY
ENVIRONMENTS; NEURAL ADAPTATION; BASILAR-MEMBRANE; SCENE ANALYSIS; MODEL
AB This paper describes the development of an optimal sigmoidal rate-level function that is a component of many models of the peripheral auditory system. The optimization makes use of a set of criteria defined exclusively on the basis of physical attributes of the input sound that are inspired by physiological evidence. The criteria developed attempt to discriminate between a degraded speech signal and noise to preserve the maximum amount of information in the linear region of the sigmoidal curve, and to minimize the effects of distortion in the saturating regions. The performance of the proposed optimal sigmoidal function is validated by text-independent speaker-verification experiments with signals corrupted by additive noise at different SNRs. The experimental results suggest that the approach presented in combination with cepstral variance normalization can lead to relative reductions in equal error rate as great as 40% when compared with the use of baseline MFCC coefficients for some SNRs. (C) 2013 Elsevier By. All rights reserved.
C1 [Poblete, Victor; Yoma, Nestor Becerra] Univ Chile, Speech Proc & Transmiss Lab, Santiago, Chile.
[Stern, Richard M.] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA.
[Stern, Richard M.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
[Poblete, Victor] Univ Austral Chile, Inst Acoust, Valdivia, Chile.
RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Av Tupper 2007,POB 412-3, Santiago, Chile.
EM nbecerra@ing.uchile.cl
FU Conicyt-Chile [Fondecyt 1100195, ACT 1120]
FX This research was funded by Conicyt-Chile under grants Fondecyt 1100195
and Team Research in Science and Technology ACT 1120.
CR Ajmera PK, 2011, PATTERN RECOGN, V44, P2749, DOI 10.1016/j.patcog.2011.04.009
Allen J. B., 1985, IEEE ASSP Magazine, V2, DOI 10.1109/MASSP.1985.1163723
Barbour DL, 2011, NEUROSCI BIOBEHAV R, V35, P2064, DOI 10.1016/j.neubiorev.2011.04.009
Bures Z, 2010, EUR J NEUROSCI, V32, P155, DOI 10.1111/j.1460-9568.2010.07280.x
Campbell J., 1994, YOHO SPEAKER VERIFIC
Chiu YHB, 2012, IEEE T AUDIO SPEECH, V20, P900, DOI 10.1109/TASL.2011.2168209
Chiu YHB, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1000
COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756
COSTALUPES JA, 1984, J NEUROPHYSIOL, V51, P1326
Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156
Dean I, 2008, J NEUROSCI, V28, P6430, DOI 10.1523/JNEUROSCI.0470-08.2008
Dean I, 2005, NAT NEUROSCI, V8, P1684, DOI 10.1038/nn1541
Dimitriadis D, 2011, IEEE T AUDIO SPEECH, V19, P1504, DOI 10.1109/TASL.2010.2092766
Gao F, 2009, PHYSIOL BEHAV, V97, P369, DOI 10.1016/j.physbeh.2009.03.004
Garcia-Lazaro JA, 2007, EUR J NEUROSCI, V26, P2359, DOI 10.1111/j.1460-9568.2007.05847.x
Ghitza O., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80018-3
Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357
Hanilci C, 2012, IEEE SIGNAL PROC LET, V19, P163, DOI 10.1109/LSP.2012.2184284
Hasan T, 2013, IEEE T AUDIO SPEECH, V21, P842, DOI 10.1109/TASL.2012.2226161
Hirsch H.G., 2000, INT WORKSH AUT SPEEC, P181
Jankowski C.R., 1992, P WORKSH SPEECH NAT, P453, DOI 10.3115/1075527.1075637
Kang SY, 2010, JARO-J ASSOC RES OTO, V11, P245, DOI 10.1007/s10162-009-0194-7
Kim C., 2006, P INT PITTSB PENNS, P1975
Kim C, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4101
Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kinnunen T, 2012, IEEE T AUDIO SPEECH, V20, P1990, DOI 10.1109/TASL.2012.2191960
Li Q, 2011, IEEE T AUDIO SPEECH, V19, P1791, DOI 10.1109/TASL.2010.2101594
Li Q, 2010, INT CONF ACOUST SPEE, P4514, DOI 10.1109/ICASSP.2010.5495589
Lyon R., 1982, IEEE INT C AC SPEECH, V7, P1282, DOI 10.1109/ICASSP.1982.1171644
MAY BJ, 1992, J NEUROPHYSIOL, V68, P1589
Middlebrooks JC, 2004, J ACOUST SOC AM, V116, P452, DOI 10.1121/1.1760795
Miller CA, 2011, JARO-J ASSOC RES OTO, V12, P219, DOI 10.1007/s10162-010-0249-9
Ming J, 2007, IEEE T AUDIO SPEECH, V15, P1711, DOI 10.1109/TASL.2007.899278
Moore B.C.J., 2003, INTRO PSYCHOL HEARIN, P39
Nizami L, 2005, HEARING RES, V208, P26, DOI 10.1016/j.heares.2005.05.002
OHZAWA I, 1985, J NEUROPHYSIOL, V54, P651
Patterson R.D., 1992, AUDITORY PROCESSING, P67
Pfingst BE, 2011, HEARING RES, V281, P65, DOI 10.1016/j.heares.2011.05.002
Pickles J.O., 2008, INTRO PHYSL HEARING
Rabinowitz NC, 2011, NEURON, V70, P1178, DOI 10.1016/j.neuron.2011.04.030
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
RHODE WS, 1993, HEARING RES, V66, P31, DOI 10.1016/0378-5955(93)90257-2
Robles L, 2001, PHYSIOL REV, V81, P1305
SACHS MB, 1974, J ACOUST SOC AM, V56, P1835, DOI 10.1121/1.1903521
Saeidi R, 2010, IEEE SIGNAL PROC LET, V17, P599, DOI 10.1109/LSP.2010.2048649
Schneider BA, 2011, ATTEN PERCEPT PSYCHO, V73, P1562, DOI 10.3758/s13414-011-0097-7
SENEFF S, 1988, J PHONETICS, V16, P55
SHAMMA S, 1988, J PHONETICS, V16, P77
SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1612, DOI 10.1121/1.392799
Shao Y, 2008, INT CONF ACOUST SPEE, P1589
Shao Y, 2010, COMPUT SPEECH LANG, V24, P77, DOI 10.1016/j.csl.2008.03.004
Shao Y, 2007, INT CONF ACOUST SPEE, P277
Shin JW, 2008, IEEE SIGNAL PROC LET, V15, P257, DOI 10.1109/LSP.2008.917027
Slaney M., 1998, 1998010 INT RES CORP
Stern R., 2012, TECHNIQUES NOISE ROB
Stern RM, 2012, IEEE SIGNAL PROC MAG, V29, P34, DOI 10.1109/MSP.2012.2207989
Taberner AM, 2005, J NEUROPHYSIOL, V93, P557, DOI 10.1152/jn.00574.2004
Wang KS, 1994, IEEE T SPEECH AUDI P, V2, P421
Wang N, 2011, IEEE T AUDIO SPEECH, V19, P196, DOI 10.1109/TASL.2010.2045800
Watkins PV, 2011, CEREB CORTEX, V21, P178, DOI 10.1093/cercor/bhq079
Wen B, 2012, J NEUROPHYSIOL, V108, P69, DOI 10.1152/jn.00055.2012
Wen B, 2009, J NEUROSCI, V29, P13797, DOI 10.1523/JNEUROSCI.5610-08.2009
Werblin F, 1996, IEEE SPECTRUM, V33, P30, DOI 10.1109/6.490054
WINSLOW RL, 1987, J NEUROPHYSIOL, V57, P1002
Wu W, 2007, IEEE T AUDIO SPEECH, V15, P1893, DOI 10.1109/TASL.2007.899297
YATES GK, 1990, HEARING RES, V45, P203, DOI 10.1016/0378-5955(90)90121-5
Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151
Zilany MSA, 2010, J NEUROSCI, V30, P10380, DOI 10.1523/JNEUROSCI.0647-10.2010
NR 70
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 19
EP 34
DI 10.1016/j.specom.2013.07.006
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800002
ER
PT J
AU Tisljar-Szabo, E
Pleh, C
AF Tisljar-Szabo, Eszter
Pleh, Csaba
TI Ascribing emotions depending on pause length in native and foreign
language speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion ascribing; Cross-linguistic; Speech pauses; Silent pause
duration; Foreign language
ID CROSS-CULTURAL DIFFERENCES; FACIAL EXPRESSIONS; VOCAL COMMUNICATION;
MEXICAN CHILDREN; AMERICAN SAMPLE; SPEAKER AFFECT; RECOGNITION;
PERCEPTION; DISTURBANCES; ANXIETY
AB Although the relationship between emotions and speech is well documented, little is known about the role of speech pauses in emotion expression and emotion recognition. The present study investigated how speech pause length influences how listeners ascribe emotional states to the speaker. Emotionally neutral Hungarian speech samples were taken, and speech pauses were systematically manipulated to create five variants of all passages. Hungarian and Austrian participants rated the emotionality of these passages by indicating on a 1-6 point scale how angry, sad, disgusted, happy, surprised, scared, positive, and heated the speaker could have been. The data reveal that the length of silent pauses influences listeners in attributing emotional states to the speaker. Our findings argue that pauses play a relevant role in ascribing emotions and that this phenomenon might be partly independent of language. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Tisljar-Szabo, Eszter; Pleh, Csaba] Budapest Univ Technol & Econ, Dept Cognit Sci, H-1111 Budapest, Hungary.
RP Tisljar-Szabo, E (reprint author), Univ Debrecen, Med & Hlth Sci Ctr, Dept Behav Sci, Nagyerdei krt 98,POB 45, H-4032 Debrecen, Hungary.
EM jszaboeszter@gmail.com; pleh.csaba@ektf.hu
FU Austria-Hungary Action Foundation
FX We thank Florian Menz, professor at the Linguistic Institute, the
University of Vienna for helping in providing conditions for the
experiment. This work was supported by a scholarship from the
Austria-Hungary Action Foundation to the first author. We thank Csaba
Szabo, Agnes Lukacs, Dezso Nemeth, Karolina Janacsek, and Istvan Winkler
for valuable feedback and suggestions.
CR ALBAS DC, 1976, J CROSS CULT PSYCHOL, V7, P481, DOI 10.1177/002202217674009
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016
Beaupre MG, 2005, J CROSS CULT PSYCHOL, V36, P355, DOI 10.1177/0022022104273656
BEIER EG, 1972, J CONSULT CLIN PSYCH, V39, P166, DOI 10.1037/h0033170
BERGMANN G, 1988, Z EXP ANGEW PSYCHOL, V35, P167
Biehl M, 1997, J NONVERBAL BEHAV, V21, P3, DOI 10.1023/A:1024902500935
Breitenstein C., 1996, NEUROLOGIE REHABILIT, V2
Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114
Burkhardt F., 2006, P SPEECH PROS 2006 D
Burkhardt F., 2000, ISCA WORKSH ITRW SPE
Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1
CARLSON R, 1992, SPEECH COMMUN, V11, P159, DOI 10.1016/0167-6393(92)90010-5
Deppermann A., 2005, Z QUALITATIVE FORSCH, V1, P35
Ekman A, 1969, Lakartidningen, V66, P5021
EKMAN P, 1987, J PERS SOC PSYCHOL, V53, P712, DOI 10.1037/0022-3514.53.4.712
EKMAN P, 1971, J PERS SOC PSYCHOL, V17, P124, DOI 10.1037/h0030377
EKMAN P, 1969, SCIENCE, V164, P86, DOI 10.1126/science.164.3875.86
ELDRED SH, 1958, PSYCHIATR, V21, P115
Elfenbein HA, 2003, J CROSS CULT PSYCHOL, V34, P92, DOI 10.1177/0022022102239157
Elfenbein HA, 2007, EMOTION, V7, P131, DOI 10.1037/1528-3542.7.1.131
Fairbanks G, 1941, SPEECH MONOGR, V8, P85
Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293
Fontaine JRJ, 2007, PSYCHOL SCI, V18, P1050, DOI 10.1111/j.1467-9280.2007.02024.x
GOLDMANEISLER F, 1958, Q J EXP PSYCHOL, V10, P96, DOI 10.1080/17470215808416261
Goldman-Eisler F., 1968, PSYCHOLINGUISTICS EX
Gosy M., 2008, BESZEDKUTATAS, V2008, P116
Gosy M., 2008, HUMAN FACTORS VOICE, P193
Gosy M., 2003, MAGYAR NYELVOR, V127, P257
HENDERSO.A, 1966, LANG SPEECH, V9, P207
Hofmann SG, 1997, J ANXIETY DISORD, V11, P573, DOI 10.1016/S0887-6185(97)00040-6
IZARD CE, 1994, PSYCHOL BULL, V115, P288, DOI 10.1037/0033-2909.115.2.288
Izard C.E., 1971, FACE EMOTION
JOHNSON WF, 1986, ARCH GEN PSYCHIAT, V43, P280
Jovicic S.T., SPECOM 2004, P77
Juslin P. N., 2005, NEW HDB METHODS NONV, P65
Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381
KASL SV, 1965, J PERS SOC PSYCHOL, V1, P425, DOI 10.1037/h0021918
LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466
Laukka P, 2008, J NONVERBAL BEHAV, V32, P195, DOI 10.1007/s10919-008-0055-9
MAHL GF, 1956, J ABNORM SOC PSYCH, V53, P1, DOI 10.1037/h0047552
MATSUMOTO D, 1993, MOTIV EMOTION, V17, P107, DOI 10.1007/BF00995188
MCCLUSKEY KW, 1981, INT J PSYCHOL, V16, P119, DOI 10.1080/00207598108247409
MCCLUSKEY KW, 1975, DEV PSYCHOL, V11, P551, DOI 10.1037/0012-1649.11.5.551
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006
POPE B, 1970, J CONSULT CLIN PSYCH, V35, P128, DOI 10.1037/h0029659
ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111
SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450
Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Schonpflug U, 2008, COGNITIVE DEV, V23, P385, DOI 10.1016/j.cogdev.2008.05.002
Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X
Szabo E., 2008, MAGYAR PSZICHOLOGIAI, V63, P651
Thompson WF, 2006, SEMIOTICA, V158, P407, DOI 10.1515/SEM.2006.017
VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001
WILLIAMS EJ, 1949, AUST J SCI RES SER A, V2, P149
NR 57
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 35
EP 48
DI 10.1016/j.specom.2013.07.009
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800003
ER
PT J
AU Sun, Y
Gemmeke, JF
Cranen, B
ten Bosch, L
Boves, L
AF Sun, Yang
Gemmeke, Jort F.
Cranen, Bert
ten Bosch, Louis
Boves, Lou
TI Fusion of parametric and non-parametric approaches to noise-robust ASR
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dynamic Bayesian Network; Virtual Evidence; Sparse Classification; Early
fusion; Robust speech recognition
ID AUTOMATIC SPEECH RECOGNITION
AB In this paper we present a principled method for the fusion of independent estimates of the state likelihood in a Dynamic Bayesian Network (DBN) by means of the Virtual Evidence option for improving speech recognition in the AURORA-2 task. A first estimate is derived from a conventional parametric Gaussian Mixture Model; a second estimate is obtained from a non-parametric Sparse Classification (SC) system. During training the parameters pertaining to the input streams can be optimized independently, but also jointly, provided that all streams represent true probability functions. During decoding the weights of the streams can be varied much more freely. It appeared that the state likelihoods in the GMM and SC streams are very different, and that this makes it necessary to apply different weights to the streams in decoding. When using optimal weights, the dual-input system can outperform the individual GMM or the SC systems for all SNR levels in test sets A and B in the AURORA-2 task. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Sun, Yang; Cranen, Bert; ten Bosch, Louis; Boves, Lou] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6525 HT Nijmegen, Netherlands.
[Gemmeke, Jort F.] Katholieke Univ Leuven, Dept Elect Engn ESAT, B-3001 Heverlee, Belgium.
RP Sun, Y (reprint author), Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6525 HT Nijmegen, Netherlands.
EM jonathan.eric.sun@gmail.com; jgemme-ke@amadana.nl; B.Cranen@let.ru.nl;
L.tenBosch@let.ru.nl; L.Boves@let.ru.nl
FU European Community [213850 SCALE]; IWT-SBO project ALADIN [100049];
FP7-SME project OPTI-FOX [262266]
FX The research leading to these results has received funding from the
European Community's Seventh Framework Programme FP7/2007-2013 under
Grant agreement no 213850 SCALE. The research of Jort F. Gemmeke was
funded by IWT-SBO project ALADIN contract 100049. Louis ten Bosch
received funding from the FP7-SME project OPTI-FOX, project reference
262266.
CR Aradilla G., 2008, THESIS ECOLE POLYTEC
Bilmes J., 2004, UWEETR20040016 U WAS
Bilmes J., 2001, DISCR STRUCT GRAPH M
Bilmes J., 2002, GMTK DOCUMENTATION
Bilmes J., 2001, UWEETR20010005 U WAS
Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9
Cetin O., 2005, THESIS U WASHINGTON
Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Ellis D.P.W., 2000, STREAM COMBINATION A, P1635
ETSI, 2007, 202050 ETSI ES
Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347
Gemmeke J. F., 2011, P EUSIPCO, P1490
Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350
Gemmeke J.F., 2011, P IEEE WORKSH AUT SP
Hirsch H., 2000, P ICSLP, V4, P29
Hurmalainen A., 2011, P INT WORKSH MACH LI
Kirchhoff K., 2000, P ISCA ITRW WORKSH A, P17
Kirchhoff K, 2000, INT CONF ACOUST SPEE, P1435, DOI 10.1109/ICASSP.2000.861883
MISRA H, 2003, ACOUST SPEECH SIG PR, P741
Misra H., 2005, THESIS ECOLE POLYTEC
Morris J, 2008, IEEE T AUDIO SPEECH, V16, P617, DOI 10.1109/TASL.2008.916057
Pearl J., 1988, PROBABILISTIC REASON
Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828
Rasipuram R, 2011, INT CONF ACOUST SPEE, P5192
Saenko K, 2009, IEEE T PATTERN ANAL, V31, P1700, DOI 10.1109/TPAMI.2008.303
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
Subramanya A., 2007, P IEEE WORKSH AUT SP
Sun Y., 2011, P EUSIPCO BARC SPAIN, P1495
Sun Y., 2010, P INT MAK JAP
Sun Y., 2011, P INTERSPEECH, P1669
Valente F, 2010, SPEECH COMMUN, V52, P213, DOI 10.1016/j.specom.2009.10.002
van Dalen RC, 2011, IEEE T AUDIO SPEECH, V19, P733, DOI 10.1109/TASL.2010.2061226
Wolmer M., 2012, COMPUTER SPEECH LANG, V27, P780
Wu S., 1998, P ICASSP 98, P459
WU SL, 1998, ACOUST SPEECH SIG PR, P721
NR 36
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 49
EP 62
DI 10.1016/j.specom.2013.07.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800004
ER
PT J
AU Kim, JW
Nam, CM
Kim, YW
Kim, HH
AF Kim, Jung Wan
Nam, Chung Mo
Kim, Yong Wook
Kim, Hyang Hee
TI The development of the Geriatric Index of Communicative Ability (GICA)
for measuring communicative competence of elderly: A pilot study
SO SPEECH COMMUNICATION
LA English
DT Article
DE Communicative ability; Elderly; Language; Cognition; Index
ID SPEECH RECOGNITION; YOUNG; LISTENERS; NOISE; LANGUAGE; LIFE
AB A change in communicative ability, among various changes arising during the aging process, may cause various difficulties for the elderly. This study aims to develop a Geriatric Index of Communicative Ability (GICA) and verify its reliability and validity. After organizing the areas required for GICA and defining the categories for the sub-domains, relevant questions were arranged. The final version of GICA was completed through the stages of content and face validity, expert review, and pilot study. The overall reliability of GICA was good and the internal consistency (Cronbach's alpha = .786) and test-retest reliability (range of Pearson's correlation coefficients: .58-.98) were high. Based on this verification of the instrument's reliability and validity, the completed GICA was organized with three questions in each of six sub-domains: hearing, language comprehension & production, attention & memory, communication efficiency, voice and reading/writing/calculation. As a tool to measure the communicative ability of elderly people reliably and appropriately, GICA is very useful in the early identification of those with communication difficulties among the elderly. (C) 2013 Published by Elsevier B.V.
C1 [Kim, Jung Wan] Daegu Univ, Dept Speech & Language Pathol, Gyongsan 712714, South Korea.
[Nam, Chung Mo] Yonsei Univ, Coll Med, Dept Prevent Med, Seoul 120752, South Korea.
[Kim, Yong Wook; Kim, Hyang Hee] Yonsei Univ, Coll Med, Dept & Res Inst Rehabil Med, Seoul 120752, South Korea.
[Kim, Hyang Hee] Yonsei Univ, Grad Program Speech & Language Pathol, Seoul 120752, South Korea.
RP Kim, HH (reprint author), Yonsei Univ, Coll Med, Dept & Res Inst Rehabil Med, Seoul 120752, South Korea.
EM h.kim@yonsei.ac.kr
CR Brod M, 1999, GERONTOLOGIST, V39, P25
Burzynski C.M., 1987, COMMUNICATION DISORD, P214
Christensen K. J., 1991, PSYCHOL ASSESSMENT J, V3, P168, DOI 10.1037//1040-3590.3.2.168
FEHRING RJ, 1987, HEART LUNG, V16, P625
Frattali C, 1995, AM SPEECH LANGUAGE H
Friedenberg L, 1995, PSYCHOL TESTING DESI
Frisina DR, 1997, HEARING RES, V106, P95, DOI 10.1016/S0378-5955(97)00006-3
Glorig A., 1965, HEARING LEVELS AD 11, V11
GORDONSALANT S, 1993, J SPEECH HEAR RES, V36, P1276
Gronlund N., 1988, CONSTRUCT ACHIEVEMEN
Hegde M.N., 2001, INTRO COMMUNICATIVE
Holland A. L., 1999, COMMUNICATION ACTIVI
Kang Yeonwook, 2006, [Korean Journal of Psychology: General, 한국심리학회지:일반], V25, P1
Ki B, 1996, J KOREAN NEUROPSYCHI, V35, P298
Kim J.W., 2009, KOREAN J COMMUNICATI, V14, P495
Kim JY, COMMUNICATION
Kim Y.S., 2001, J KOREAN ACAD FAMILY, V22, P878
Likert R., 1932, ARCH PSYCHOL, V40, P1
LOMAS J, 1989, J SPEECH HEAR DISORD, V54, P113
Mayerson M.D., 1976, J GERONTOL, V31, P29
MUELLER PB, 1984, EAR NOSE THROAT J, V63, P292
Park J., 2001, GRADE RESPONSE MODEL
PICHORAFULLER MK, 1995, J ACOUST SOC AM, V97, P593, DOI 10.1121/1.412282
RIEGEL KF, 1973, GERONTOLOGIST, V13, P478
Statistics Korea, 2010, ELD STAT 2010
The Korean Gerontological Society, 2002, REV GER STUD
Thorndike R. M., 1991, MEASUREMENT EVALUATI
Tun P. A., 1999, J GERONTOL B-PSYCHOL, V54B, P317
Versfeld NJ, 2002, J ACOUST SOC AM, V111, P401, DOI 10.1121/1.1426376
Watson BC, 1998, J ACOUST SOC AM, V103, P3642, DOI 10.1121/1.423068
Wingfield A, 2006, J AM ACAD AUDIOL, V17, P487, DOI 10.3766/jaaa.17.7.4
Wingfield A, 2006, J NEUROPHYSIOL, V96, P2830, DOI 10.1152/jn.00628.2006
Yonan CA, 2000, PSYCHOL AGING, V15, P88, DOI 10.1037//0882-7974.15.1.88
NR 33
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 63
EP 69
DI 10.1016/j.specom.2013.08.001
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800005
ER
PT J
AU Amino, K
Osanai, T
AF Amino, Kanae
Osanai, Takashi
TI Native vs. non-native accent identification using Japanese spoken
telephone numbers
SO SPEECH COMMUNICATION
LA English
DT Article
DE Foreign accent identification; Non-native speech; Spoken telephone
numbers; Prosody; Forensic speech science
ID AUTOMATIC LANGUAGE IDENTIFICATION; PERCEIVED FOREIGN ACCENT; EMOTION
RECOGNITION; SPEECH; CLASSIFICATION; ADVANTAGE; ENGLISH
AB In forensic investigations, it would be helpful to be able to identify a speaker's native language based on the sound of their speech. Previous research on foreign accent identification suggested that the identification accuracy can be improved by using linguistic forms in which non-native characteristics are reflected. This study investigates how native and non-native speakers of Japanese differ in reading Japanese telephone numbers, which have a specific prosodic structure called a bipodic template. Spoken Japanese telephone numbers were recorded from native speakers, and Chinese and Korean learners of Japanese. Twelve utterances were obtained from each speaker, and their FO contours were compared between native and non-native speakers. All native speakers realised the prosodic pattern of the bipodic template while reading the telephone numbers, whereas non-native speakers did not. The metric rhythm and segmental properties of the speech samples were also analysed, and a foreign accent identification experiment was carried out using six acoustic features. By applying a logistic regression analysis, this method yielded an 81.8% correct identification rate, which is slightly better than that achieved in other studies. Discrimination accuracy between native and non-native accents was better than 90%, although discrimination between the two non-native accents was not that successful. A perceptual accent identification experiment was also conducted in order to compare automatic and human identifications. The results revealed that human listeners could discriminate between native and non-native speakers better, while they were inferior at identifying foreign accents. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Amino, Kanae; Osanai, Takashi] Natl Res Inst Police Sci, Kashiwa, Chiba 2770882, Japan.
RP Amino, K (reprint author), Natl Res Inst Police Sci, 6-3-1 Kashiwanoha, Kashiwa, Chiba 2770882, Japan.
EM arnino@nrips.go.jp; osanai@nrips.go.jp
FU MEXT [25350488, 24810034, 21300060]
FX Portions of this work were presented at ICPhS 2011 (K. Amino and T.
Osanai, Realisation of the prosodic structure of spoken telephone
numbers by native and non-native speakers of Japanese, in: Proc.
International Congress of Phonetic Sciences, pp. 236-239, Hong Kong,
August 2011) and the ASJ meeting in 2011 (K. Amino and T. Osanai,
Identification of native and nonnative speech by using Japanese spoken
telephone numbers, in: Proc. Autumn Meeting Acoust. Soc. Jpn., pp.
407-410, Matsue, September 2011). This work was supported by
Grants-in-Aid for Scientific Research from MEXT (25350488, 24810034,
21300060).
CR Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6
Arslan LM, 1997, INT CONF ACOUST SPEE, P1123, DOI 10.1109/ICASSP.1997.596139
Baumann S., 2001, P EUR 2001, P557
Beaupre MG, 2006, PERS SOC PSYCHOL B, V32, P16, DOI 10.1177/0146167205277097
Berkling K., 1998, P INT C SPOK LANG PR
Blackburn C.S., 1993, P EUROSPEECH, P1241
Bloch B, 1950, LANGUAGE, V26, P86, DOI 10.2307/410409
Boersma P., 2001, GLOT INT, V5, P341
BREWER MB, 1979, PSYCHOL BULL, V86, P307, DOI 10.1037/0033-2909.86.2.307
Brousseau J., 1992, P INT C SPOK LANG PR, P1003
Cleirigh C., 1994, P INT C SPOK LANG PR, P375
Elfenbein HA, 2002, PSYCHOL BULL, V128, P243, DOI 10.1037//0033-2909.128.2.243
FLEGE JE, 1992, J ACOUST SOC AM, V91, P370, DOI 10.1121/1.402780
FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876
Fullana N., 2009, RECENT RES 2 LANGUAG, P97
Fung P., 1999, P ICASSP, P221
Hall M, 2009, ACM SIGKDD EXPLORATI, V11, P10, DOI DOI 10.1145/1656274.1656278
HANSEN JHL, 1995, INT CONF ACOUST SPEE, P836, DOI 10.1109/ICASSP.1995.479824
Hirano H., 2006, IEICE TECHNICAL REPO, V105, P23
Hirano H., 2006, IEICE TECHNICAL REPO, V106, P19
Hollien H., 2002, FORENSIC VOICE IDENT
Itahashi S., 1992, P INT C SPOK LANG PR, P1015
Itahashi S., 1993, P EUR, P639
Ito C., 2006, MIT WORKING PAPERS L, V52, P65
Joh H., 2011, DICT BASIC PHONETIC
Katagiri K.L., 2008, P S ED JAP AS CHUL U, P103
KIM CW, 1968, LANGUAGE, V44, P516, DOI 10.2307/411719
Kindaichi H., 2001, DICT JAPANESE ACCENT
Koshimizu M., 1998, GUIDE WORLDS LANGU 2
Kulshreshtha M, 2012, FORENSIC SPEAKER RECOGNITION: LAW ENFORCEMENT AND COUNTER-TERRORISM, P71, DOI 10.1007/978-1-4614-0263-3_4
Kumpf K, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1740
Lee J.P., 2004, P INTERSPEECH, P1245
Lee O.H., 2005, SPEECH SCI, V12, P95
Lin H, 2007, J CHINESE LINGUISTIC, V17, P127
Mackay I.R.A., 2009, RECENT RES 2 LANGUAG, P43
Manning C. D., 1999, FDN STAT NATURAL LAN
Markham D., 1999, P INT C SPOK LANG PR, P1187
Matsuzaki H., 1999, J PHONETIC SOC JAPAN, V3, P26
Mehlhorn G., 2007, P 16 INT C PHON SCI, P1745
Min K.J., 1996, THESIS TOHOKU U SEND
Mixdorff H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1469
Mok P., 2008, P SPEECH PROS 2008 C, P423
Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925
Nasu A., 2001, J OSAKA U FOREIGN ST, V25, P115
Noma H., 1998, GUIDE WORLDS LANGU 2
Piat M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P759
POSER WJ, 1990, LANGUAGE, V66, P78, DOI 10.2307/415280
Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X
Rogers H., 1998, INT J SPEECH LANG LA, V5, P203
Saito Y., 2009, INTRO JAPANESE PHONE
Tate D.A., 1979, CURRENT ISSUES PHONE, P847
Teixeira C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1784
Toki S., 2010, PHONETICS RES BASED
Tsujimura N., 1996, INTRO JAPANESE LINGU
Utsugi A., 2004, J PHONET SOC JPN, V8, P96
Vance T., 2008, SOUNDS JAPANESE
Vieru-Dimulescu B., 2007, P INT WORKSH PAR SPE, P47
Wrembel M., 2009, RECENT RES 2 LANGUAG, P291
Yanguas L.R., 1998, P INT C SPOK LANG PR
Zissman MA, 1996, INT CONF ACOUST SPEE, P777, DOI 10.1109/ICASSP.1996.543236
Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450
NR 61
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 70
EP 81
DI 10.1016/j.specom.2013.07.010
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800006
ER
PT J
AU Besacier, L
Barnard, E
Karpov, A
Schultz, T
AF Besacier, Laurent
Barnard, Etienne
Karpov, Alexey
Schultz, Tanja
TI Introduction to the special issue on processing under-resourced
languages
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
C1 [Besacier, Laurent] Lab Informat Grenoble, Grenoble, France.
[Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa.
[Karpov, Alexey] Russian Acad Sci, St Petersburg Inst Informat & Automat, St Petersburg 196140, Russia.
[Schultz, Tanja] Karlsruhe Inst Technol, D-76021 Karlsruhe, Germany.
RP Besacier, L (reprint author), Lab Informat Grenoble, Grenoble, France.
RI Karpov, Alexey/A-8905-2012
OI Karpov, Alexey/0000-0003-3424-652X
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 83
EP 84
DI 10.1016/j.specom.2013.09.001
PG 2
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800007
ER
PT J
AU Besacier, L
Barnard, E
Karpov, A
Schultz, T
AF Besacier, Laurent
Barnard, Etienne
Karpov, Alexey
Schultz, Tanja
TI Automatic speech recognition for under-resourced languages: A survey
SO SPEECH COMMUNICATION
LA English
DT Article
DE Under-resourced languages; Automatic speech recognition (ASR); Language
portability; Speech and language resources acquisition; Statistical
language modeling; Crosslingual acoustic modeling and adaptation;
Automatic pronunciation generation; Lexical modeling
ID ASR
AB Speech processing for under-resourced languages is an active field of research, which has experienced significant progress during the past decade. We propose, in this paper, a survey that focuses on automatic speech recognition (ASR) for these languages. The definition of under-resourced languages and the challenges associated to them are first defined. The main part of the paper is a literature review of the recent (last 8 years) contributions made in ASR for under-resourced languages. Examples of past projects and future trends when dealing with under-resourced languages are also presented. We believe that this paper will be a good starting point for anyone interested to initiate research in (or operational development of) ASR for one or several under-resourced languages. It should be clear, however, that many of the issues and approaches presented here, apply to speech technology in general (text-to-speech synthesis for instance). (C) 2013 Published by Elsevier B.V.
C1 [Besacier, Laurent] Lab Informat Grenoble, Grenoble, France.
[Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa.
[Karpov, Alexey] Russian Acad Sci, St Petersburg Inst Informat & Automat, St Petersburg 196140, Russia.
[Schultz, Tanja] Karlsruhe Inst Technol, D-76021 Karlsruhe, Germany.
RP Besacier, L (reprint author), Lab Informat Grenoble, Grenoble, France.
RI Karpov, Alexey/A-8905-2012
OI Karpov, Alexey/0000-0003-3424-652X
CR Abdillahi N, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P289
Ablimit M, 2010, 2010 IEEE 10TH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING PROCEEDINGS (ICSP2010), VOLS I-III, P581, DOI 10.1109/ICOSP.2010.5656065
Adda-Decker M., 2003, P EUR C SPEECH COMM, P257
[Anonymous], 2009, US NIST 2009 RT 09 R
Arisoy E, 2006, SIGNAL PROCESS, V86, P2844, DOI 10.1016/j.sigpro.2005.12.002
Arisoy E., 2012, P NAACL HLT 2012 WOR, P20
Barnard E., 2009, P INTERSPEECH, P2847
Barnard E., 2010, P AAAI SPRING S ART, P8
Barnett J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2191
Berment V., 2004, THESIS J FOURIER U G
Besacier L., 2006, IEEE ACL SLT 2006 AR
Bhanuprasad K., 2008, P 3 INT JOINT C NAT, P805
Billa J., 1997, P EUR, P363
Cai J., 2008, SLTU 08 HAN VIETN
Carki K., 2000, IEEE ICASSP
Cetin O., 2008, SLTU 08 HAN VIETN
Chan H.Y., 2012, P 2 ACM S COMP DEV
Charniak E., 2003, P MT SUMM 9 NEW ORL, P40
Charoenpornsawat P., 2006, HUM LANG TECHN C HLT
Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147
Cohen P., 1997, P AUT SPEECH REC UND, P591
Constantinescu A., 1997, P ASRU, P606
Creutz M., 2005, A81 HELS U TECHN
Creutz M., 2007, ACM T SPEECH LANGUAG, V5
Crystal D., 2000, LANGUAGE DEATH
Cucu H., 2011, P ASRU 2011 HAW US
Cucu H., 2012, EUSIPCO 2012 BUC ROM
Cucu H., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.05.003
Davel M.H., 2011, P INTERSPEECH, P3153
De Vries N.J., 2011, P INTERSPEECH, P3177
De Vries N.J., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.07.001
Denoual E., 2006, NLP ANN M TOK JAP, P731
Do T, 2010, WORKSH SPOK LANG TEC
Dugast C., 1995, P EUROSPEECH, P197
Ekpenyong M., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.02.003
Ganapathiraju A., 2000, P INT C SPOK LANG PR, V4, P504
Gebreegziabher M., 2012, SLTU WORKSH SPOK LAN
Gelas H., 2011, INT 2011 FLOR IT 28
Gelas H., 2010, LAPH 12 NEW MEX US J
Gemmeke J.F., 2011, IEEE ASRU 2011 HI US
Ghoshal A., 2009, IEEE ICASSP
Gizaw S., 2008, SLTU 2008 HAN VIETN
GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C
Godfrey J. J., 1992, P ICASSP, V1, P517
Gokcen S., 1997, P AUT SPEECH REC UND, P599
Grezl F., 2007, P ICASSP
Hermansky H., 2000, P ICASSP
Huang C., 2000, P ICSLP, P818
Huet S, 2010, COMPUT SPEECH LANG, V24, P663, DOI 10.1016/j.csl.2009.10.001
Hughes T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1914
IPA, 1999, HDB INT PHON ASS GUI
Jensson A., 2008, SLTU 08 HAN VIETN
Jing Z., 2010, P INT C COMP MECH CO, V5, P320
Kanejiya D.P., 2003, P TIFR WORKSH SPOK L, P93
Kanthak S., 2003, EUR 2003 GEN SWITZ, P1145
Karanasou P, 2010, LECT NOTES ARTIF INT, V6233, P167, DOI 10.1007/978-3-642-14770-8_20
Karpov A., 2011, P INT 2011 FLOR IT, P3161
Karpov A., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.07.004
Kiecza D., 1999, P INT C SPEECH PROC, P323
Killer M., 2003, INTERSPEECH
Kipyatkova I., 2012, Proceedings of the 2012 Federated Conference on Computer Science and Information Systems (FedCSIS)
KOHLER J, 1998, ACOUST SPEECH SIG PR, P417
Krauwer S., 2003, P 2003 INT WORKSH SP, P8
Kuo HKJ, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P327, DOI 10.1109/ASRU.2009.5373470
Kurimo M., 2006, P HLT NAACL NY US
Kurimo M., 2006, P INT 06 PITTSB PA U, P1021
Lamel L., 1995, P EUR, P185
Laurent A., 2009, INT 2009 BRIGHT UK, P708
Le V.B., 2003, EUROSPEECH 2003, P3117
Le VB, 2009, IEEE T AUDIO SPEECH, V17, P1471, DOI 10.1109/TASL.2009.2021723
Lee DG, 2009, IEEE T AUDIO SPEECH, V17, P945, DOI 10.1109/TASL.2009.2019922
Loof J., 2009, INT 2009 BRIGHT UK
Lopatkova M, 2005, LECT NOTES ARTIF INT, V3658, P140
Mihajlik P., 2007, INT 07 ANTW BELG
MIKOLOV T, 2010, P INT, P1045
Mohamed AR, 2012, IEEE T AUDIO SPEECH, V20, P14, DOI 10.1109/TASL.2011.2109382
Muthusamy Y.K., 1992, 2 INT C SPOK LANG PR
Nakajima H., 2002, COLING 2002, V2, P716
Nanjo H, 2005, INT CONF ACOUST SPEE, P1053
Oparin I., 2008, P IEEE WORKSH SPOK L
Parent G, 2010, Proceedings 2010 IEEE Spoken Language Technology Workshop (SLT 2010), DOI 10.1109/SLT.2010.5700870
Patel N, 2009, CHI2009: PROCEEDINGS OF THE 27TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P51
Patel N, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P733
Pellegrini T., 2006, ICSLP 06 PITTSB
Pellegrini T., 2008, SLTU 08 HAN VIETN
Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295
Plahl C., 2011, P ASRU US
Rastrow A., 2012, P 50 ANN M ASS COMP, P175
Ronzhin A L, 2007, Pattern Recognition and Image Analysis, V17, DOI 10.1134/S1054661807020216
Rotovnik T, 2007, SPEECH COMMUN, V49, P437, DOI 10.1016/j.specom.2007.02.010
Roux J.C., 2000, P 2 INT C LANG RES E, P975
Sak H, 2010, INT CONF ACOUST SPEE, P5402, DOI 10.1109/ICASSP.2010.5494927
Sarikaya R, 2007, INT CONF ACOUST SPEE, P181
Schlippe T., 2010, INT 2010 MAK JAP 26
Schlippe T., 2012, ICASSP 2012 KYOT JAP
Schlippe T., 2012, INT 2012 PORTL OR 9
Schlippe T., 2013, COMMUNICATION, DOI DOI 10.1016/J.SPEC0M.2013.06.015
Schultz T., 2002, P ICSLP, V1, P345
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Schultz T., 2007, INT 2007 ANTW BELG
Schultz T., 2013, ICASSP 2013 VANC CAN
Schultz T., 2006, MULTILINGUAL SPEECH
SCHULTZ T, 1998, P ICSLP SYDN, P1819
Seide F., 2011, P ASRU, P24
Siniscalchi SM, 2013, COMPUT SPEECH LANG, V27, P209, DOI 10.1016/j.csl.2012.05.001
Solera-Urena R, 2007, SPEECH COMMUN, V49, P253, DOI 10.1016/j.specom.2007.01.013
Stahlberg F., 2012, P 4 IEEE WORKSH SPOK
Stahlberg F., 2013, P 1 INT C STAT LANG
Stephenson T.A., 2002, IDIAPRR242002, P10
Stolcke A., 2006, P ICASSP 2006
Stuker S., 2009, INT 2009 BRIGHT UK
Stuker S., 2008, SLTU 08 HAN VIETN
Stuker S., 2003, ICASSP 2003
Stuker S., 2003, P ICASSP 03 IEEE INT
Suenderman K., 2009, INT 2009 BRIGHT UK, P1475
Szarvas M., 2003, P ICASSP HONG KONG C, P368
Tachbelie M., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.01.008
Tachbelie M., 2012, SLTU WORKSH SPOK LAN
Tarjan B., 2010, P 2 INT WORKSH SPOK, P10
Thomas S., 2012, P INTERSPEECH
Thomas Samuel, 2012, P ICASSP
Toth L., 2008, P INTERSPEECH
Trentin E, 2001, NEUROCOMPUTING, V37, P91, DOI 10.1016/S0925-2312(00)00308-8
van Heerden C., 2010, P WORKSH SPOK LANG T, P17
van Niekerk D.R., 2013, SPEECH COMMUN, DOI DOI 10.1016/J.SPEC0M.2013.01.009
Vergyri D., 2004, P ICSLP, P2245
Vesely K., 2012, P SLT US
Vu Ngoc Thang, 2012, P INTERSPEECH
Vu N.T., 2011, P INTERSPEECH
Vu N.T., 2012, P SLTU S AFR
Vu N.T., 2010, P SLT US
WHEATLEY B, 1994, INT CONF ACOUST SPEE, P237
Whittaker EWD, 2001, INT CONF ACOUST SPEE, P545, DOI 10.1109/ICASSP.2001.940889
Whittaker E.W.D., 2000, THESIS CAMBRIDGE U, P140
Wissing D., 2008, SO AFRICAN LINGUISTI, V26, P255
Young S., 2008, SPRINGER HDB SPEECH, P539, DOI 10.1007/978-3-540-49127-9_27
Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023
Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4169
NR 138
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 85
EP 100
DI 10.1016/j.specom.2013.07.008
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800008
ER
PT J
AU Schlippe, T
Ochs, S
Schultz, T
AF Schlippe, Tim
Ochs, Sebastian
Schultz, Tanja
TI Web-based tools and methods for rapid pronunciation dictionary creation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Web-derived pronunciations; Pronunciation modeling; Rapid bootstrapping;
Multilingual speech recognition
AB In this paper we study the potential as well as the challenges of using the World Wide Web as a seed for the rapid generation of pronunciation dictionaries in new languages. In particular, we describe Wiktionary, a community-driven resource of pronunciations in IPA notation, which is available in many different languages. First, we analyze Wiktionary in terms of language and vocabulary coverage and compare it in terms of quality and coverage with another source of pronunciation dictionaries in multiple languages (GlobalPhone). Second, we investigate the performance of statistical grapheme-to-phoneme models in ten different languages and measure the model performance for these languages over the amount of training data. The results show that for the studied languages about 15k phone tokens are sufficient to train stable grapheme-to-phoneme models. Third, we create grapheme-to-phoneme models for ten languages using both the GlobalPhone and the Wiktionary resources. The resulting pronunciation dictionaries are carefully evaluated along several quality checks, i.e. in terms of consistency, complexity, model confidence, grapheme n-gram coverage, and phoneme perplexity. Fourth, as a crucial prerequisite for a fully automated process of dictionary generation, we implement and evaluate methods to automatically remove flawed and inconsistent pronunciations from dictionaries. Last but not least, speech recognition experiments in six languages evaluate the usefulness of the dictionaries in terms of word error rates. Our results indicate that the web resources of Wiktionary can be successfully leveraged to fully automatically create pronunciation dictionaries in new languages. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Schlippe, Tim; Ochs, Sebastian; Schultz, Tanja] Karlsruhe Inst Technol, Inst Anthropomat, Cognit Syst Lab, D-76131 Karlsruhe, Germany.
RP Schlippe, T (reprint author), Karlsruhe Inst Technol, Inst Anthropomat, Cognit Syst Lab, Adenauerring 4, D-76131 Karlsruhe, Germany.
EM tim.schlippe@kit.edu
FU OSEO, French State agency for innovation
FX This work was partly realized as part of the Quaero Programme, funded by
OSEO, French State agency for innovation.
CR Besling S., 1994, KONVENS
Bisani M., 2008, SPEECH COMMUNICATION
Black A. W., 1998, ESCA WORKSH SPEECH S
Can D., 2009, 32 ANN INT ACM SIGIR
Chen S.F., 2003, EUROSPEECH
Davel M., 2006, INTERSPEECH
Davel M., 2009, INTERSPEECH
Davel M., 2010, INTERSPEECH
Davel M., 2004, ICSLP
Gerosa M., 2009, P 2009 IEEE INT C AC, DOI DOI 10.1109/ICASSP.2009.4960583
Ghoshal A., 2009, ICASSP
Hahn S., 2012, INT 2012
IPA I. P. A., 1999, HDB INT PHON ASS GUI
Jiampojamarn S., 2007, HLT
Kanthak S., 2002, ICASSP
Kaplan R.M., 1994, COMPUTATIONAL LINGUI
Karanasou P., 2010, P 7 INT C ADV NAT LA
Killer M., 2003, EUROSPEECH
Kneser R., 2000, WYTP409100002 PHIL S
Kominek J., 2009, THESIS
Kominek J., 2006, HLT C NAACL
Laurent A., 2009, INTERSPEECH
Llitjos A.F., 2002, LREC
Martirosian O., 2007, PRASA
Novak J., 2011, PHONETISAURUS WFST D
Novak J., 2012, INT WORKSH FIN STAT
Schlippe T., 2010, INTERSPEECH
Schlippe T., 2012, SLT U
Schlippe T., 2012, INTERSPEECH
Schlippe T., 2012, ICASSP
Schultz T., 2002, ICSLP
Schultz T., 2007, INTERSPEECH
Stueker S., 2004, SPECOM
Vozila P., 2003, EUROSPEECH
Vu N. T., 2010, INTERSPEECH
Wells JC, 1997, HDB STANDARDS RESOUR
Wikimedia, 2012, LIST WIK ED RANK ART
Wolff M., 2002, PMLA
Zhu X., 2001, ICASSP
NR 39
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 101
EP 118
DI 10.1016/j.specom.2013.06.015
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800009
ER
PT J
AU de Vries, NJ
Davel, MH
Badenhorst, J
Basson, WD
de Wet, F
Barnard, E
de Waal, A
AF de Vries, Nic J.
Davel, Marelie H.
Badenhorst, Jaco
Basson, Willem D.
de Wet, Febe
Barnard, Etienne
de Waal, Alta
TI A smartphone-based ASR data collection tool for under-resourced
languages
SO SPEECH COMMUNICATION
LA English
DT Article
DE Smartphone-based; ASR data collection; Under-resourced languages;
Automatic speech recognition; ASR corpora; Speech resources; Speech data
collection; Broadband speech corpora; Woefzela; On-device quality
control; QC-on-the-go; Android
ID DEVELOPING REGIONS; TECHNOLOGY
AB Acoustic data collection for automatic speech recognition (ASR) purposes is a particularly challenging task when working with under-resourced languages, many of which are found in the developing world. We provide a brief overview of related data collection strategies, highlighting some of the salient issues pertaining to collecting ASR data for under-resourced languages. We then describe the development of a smartphone-based data collection tool, Woefzela, which is designed to function in a developing world context. Specifically, this tool is designed to function without any Internet connectivity, while remaining portable and allowing for the collection of multiple sessions in parallel; it also simplifies the data collection process by providing process support to various role players during the data collection process, and performs on-device quality control in order to maximise the use of recording opportunities.
The use of the tool is demonstrated as part of a South African data collection project, during which almost 800 hours of ASR data was collected, often in remote, rural areas, and subsequently used to successfully build acoustic models for eleven languages. The on-device quality control mechanism (referred to as QC-on-the-go) is an interesting aspect of the Woefzela tool and we discuss this functionality in more detail. We experiment with different uses of quality control information, and evaluate the impact of these on ASR accuracy. Woefzela was developed for the Android Operating System and is freely available for use on Android smartphones. (C) 2013 Elsevier B.V. All rights reserved.
C1 [de Vries, Nic J.; Badenhorst, Jaco; Basson, Willem D.; de Wet, Febe; de Waal, Alta] CSIR, Meraka Inst, Human Language Technol Res Grp, Pretoria, South Africa.
[Davel, Marelie H.; Badenhorst, Jaco; Basson, Willem D.; Barnard, Etienne] North West Univ, Multilingual Speech Technol, ZA-1900 Vanderbijlpark, South Africa.
[de Wet, Febe] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7602 Matieland, South Africa.
RP de Vries, NJ (reprint author), CSIR, Meraka Inst, Human Language Technol Res Grp, Pretoria, South Africa.
EM ndevries@csir.co.za; marelie.davel@nwu.ac.za
CR Abney S., 2010, P 48 ANN M ASS COMP, P88
Ackermann U., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607244
Badenhorst J, 2011, LANG RESOUR EVAL, V45, P289, DOI 10.1007/s10579-011-9152-1
Badenhorst J., 2012, P WORKSH SPOK LANG T, P139
Badenhorst J., 2011, P PATT REC ASS S AFR, P1
Barnard E, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P13, DOI 10.1109/SLT.2008.4777828
Barnard E, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P282
Barnard E., 2003, 9 INT WIN C STELL S, P1
Barnard E., 2009, P INTERSPEECH, P2847
Barnard E., 2010, AAAI S ART INT DEV A, P8
Basson W.D., 2012, P PATT REC ASS S AFR, P144
Botha G., 2005, P 16 ANN S PATT REC, P194
Brewer E, 2006, IEEE PERVAS COMPUT, V5, P15, DOI 10.1109/MPRV.2006.40
Brewer E, 2005, COMPUTER, V38, P25, DOI 10.1109/MC.2005.204
Constantinescu A., 1997, AUT SPEECH REC UND A, P606
Davel M., 2009, INTERSPEECH, P2851
Davel M, 2008, COMPUT SPEECH LANG, V22, P374, DOI 10.1016/j.csl.2008.01.001
Davel M.H., 2012, P WORKSH SPOK LANG T, P68
Davel M.H., 2011, P INTERSPEECH, P3153
De Vries N.J., 2011, P INTERSPEECH, P3177
De Vries N.J., 2012, THESIS N W U S AFRIC
De Wet F., 2011, NATL CTR HUMAN LANGU
De Wet F., 2006, P PATT REC ASS S AFR, P1
De Wet F., 2011, P INTERSPEECH, P3185
Draxler C., 2007, P INT 07 ANTW BELG, P1509
Gauvain J.-1., 1988, P ICSLP 88 SYDN AUST, P1335
Giwa O., 2011, P PATT REC ASS S AFR, P49
Grover A.S., 2011, P 20 INT C WORLD WID, P433
Grover A.S., 2009, IEEE INT C ICTD ICTD, P95
Huang X., 2001, SPOKEN LANGUAGE PROC
Hughes T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1914
Kamper H., 2012, P WORKSH SPOK LANG T, P102
Kamvar M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1966
Kim DY, 2003, ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, P105
Kleynhans N., 2012, P PATT REC ASS S AFR, P165
Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186
Lane Ian, 2010, P NAACL HLT, P184
Lata S., 2010, P 7 C INT LANG RES E, P2851
Lee C., 2011, P INT, P3041
Maxwell M., 2006, P WORKSH FRONT LING, P29, DOI 10.3115/1641991.1641996
McGraw I., 2010, P LREC, P19
Modipa T.I., 2012, P PATT REC ASS S AFR, P173
Parent G., 2011, P INTERSPEECH, P3037
Pentland AS, 2004, COMPUTER, V37, P78, DOI 10.1109/MC.2004.1260729
Roux JC, 2004, P LREC LISB PORT, P93
Schiel F., 2003, PRODUCTION SPEECH CO
Schultz T., 2002, P ICSLP, V1, P345
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Schultz T, 2006, MULTILINGUAL SPEECH
Schuster M, 2010, LECT NOTES ARTIF INT, V6230, P8, DOI 10.1007/978-3-642-15246-7_3
Shan JL, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P354
Grover AS, 2011, LANG RESOUR EVAL, V45, P271, DOI 10.1007/s10579-011-9151-2
Sherwani J., 2007, INT C INF COMM TECHN, P1
Sung Y., 2011, P INTERSPEECH, P2865
van Heerden C., 2010, P WORKSH SPOK LANG T, P17
Van Heerden C., 2011, P PATT REC ASS S AFR, P138
Van Heerden C.J., 2012, P WORKSH SPOK LANG T, P146
van den Heuvel H, 2008, LANG RESOUR EVAL, V42, P41, DOI 10.1007/s10579-007-9049-1
WHEATLEY B, 1994, INT CONF ACOUST SPEE, P237
Young S., 2009, HTK BOOK VERSION 3 4
NR 60
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 119
EP 131
DI 10.1016/j.specom.2013.07.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800010
ER
PT J
AU Ko, T
Mak, B
AF Ko, Tom
Mak, Brian
TI Eigentrigraphemes for under-resourced languages
SO SPEECH COMMUNICATION
LA English
DT Article
DE Eigentriphone; Eigentrigrapheme; Under-resourced language; Grapheme;
Regularization; Weighted PCA
AB Grapheme-based modeling has an advantage over phone-based modeling in automatic speech recognition for under-resourced languages when a good dictionary is not available. Recently we proposed a new method for parameter estimation of context-dependent hidden Markov model (HMM) called eigentriphone modeling. Eigentriphone modeling outperforms conventional tied-state HMM by eliminating the quantization errors among the tied states. The eigentriphone modeling framework is very flexible and can be applied to any group of modeling unit provided that they may be represented by vectors of the same dimension. In this paper, we would like to port the eigentriphone modeling method from a phone-based system to a grapheme-based system; the new method will be called eigentrigrapheme modeling. Experiments on four official South African under-resourced languages (Afrikaans, South African English, Sesotho, siSwati) show that the new eigentrigrapheme modeling method reduces the word error rates of conventional tied-state trigrapheme modeling by an average of 4.08% relative. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Ko, Tom; Mak, Brian] Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China.
RP Mak, B (reprint author), Hong Kong Univ Sci & Technol, Dept Comp Sci & Engn, Hong Kong, Hong Kong, Peoples R China.
EM tomko@cse.ust.hk; mak@cse.ust.hk
FU Research Grants Council of the Hong Kong SAR [FSGRF12EG31, FSGRF13EG20,
SRFI11EG15]
FX This research is partially supported by the Research Grants Council of
the Hong Kong SAR under the grant numbers FSGRF12EG31, FSGRF13EG20, and
SRFI11EG15.
CR Andersen O., 1996, P INT C SPOK LANG PR
Bellegarda J.R., 2003, P IEEE INT C AC SPEE
Beyerlein P., 1999, P IEEE AUT SPEECH RE
Burget L., 2010, P IEEE INT C AC SPEE
Charoenpornsawat S.H.P., 2006, P HUM LANG TECHN C N
Daniels Peter T., 1996, WORLDS WRITING SYSTE
Davel M., 2009, P INTERSPEECH
Davel M., 2004, P INTERSPEECH
de Wet F., 2011, P INTERSPEECH
Kamper H., 2011, P INTERSPEECH
Kanthak S., 2003, P EUR C SPEECH COMM
Kanthak S., 2002, P IEEE INT C AC SPEE
Ko T., 2011, P INTERSPEECH
Ko T., 2011, P IEEE INT C AC SPEE
Ko T., 2012, P IEEE INT C AC SPEE
Ko T, 2013, IEEE T AUDIO SPEECH, V21, P1285, DOI 10.1109/TASL.2013.2248722
Kohler J., 1996, P INT C SPOK LANG PR
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Le VB, 2009, IEEE T AUDIO SPEECH, V17, P1471, DOI 10.1109/TASL.2009.2021723
Lu L., 2011, 2011 IEEE WORKSH AUT, P365
Meng H, 1996, SPEECH COMMUN, V18, P47, DOI 10.1016/0167-6393(95)00032-1
Meraka-Institute, LWAZ ASR CORP
Ogbureke K.U., 2010, P IEEE INT C AC SPEE
Povey D., 2010, P IEEE INT C AC SPEE
Roux J.C., 2004, P LREC
Schukat-Talamazzini E.G., 1993, P EUR C SPEECH COMM
Sharma-Grover A., 2010, P 2 WORKSH AFR LANG
Sooful J.J., 2001, P PATT REC ASS S AFR
Stuker S., 2009, THESIS U FRIDERICIAN
Stuker S., 2008, P IEEE INT C AC SPEE
Tempest M., 2009, DICT MAKER 2 16 USER
van Heerden C., 2009, P INTERSPEECH
van Huyssteen G.B., 2009, P 20 ANN S PATT REC
Young S. J., 1994, P WORKSH HUM LANG TE
Young S. J., 2006, HTK BOOK VERSION 3 4
NR 35
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 132
EP 141
DI 10.1016/j.specom.2013.01.010
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800011
ER
PT J
AU Imseng, D
Motlicek, P
Bourlard, H
Garner, PN
AF Imseng, David
Motlicek, Petr
Bourlard, Herve
Garner, Philip N.
TI Using out-of-language data to improve an under-resourced speech
recognizer
SO SPEECH COMMUNICATION
LA English
DT Article
DE Multilingual speech recognition; Posterior features; Subspace Gaussian
mixture models; Under-resourced languages; Afrikaans
AB Under-resourced speech recognizers may benefit from data in languages other than the target language. In this paper, we report how to boost the performance of an Afrikaans automatic speech recognition system by using already available Dutch data. We successfully exploit available multilingual resources through (1) posterior features, estimated by multilayer perceptrons (MLP) and (2) subspace Gaussian mixture models (SGMMs). Both the MLPs and the SGMMs can be trained on out-of-language data. We use three different acoustic modeling techniques, namely Tandem, Kullback-Leibler divergence based HMMs (KL-HMM) as well as SGMMs and show that the proposed multilingual systems yield 12% relative improvement compared to a conventional monolingual HMM/GMM system only trained on Afrikaans. We also show that KL-HMMs are extremely powerful for under-resourced languages: using only six minutes of Afrikaans data (in combination with out-of-language data), KL-HMM yields about 30% relative improvement compared to conventional maximum likelihood linear regression and maximum a posteriori based acoustic model adaptation. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Imseng, David; Motlicek, Petr; Bourlard, Herve; Garner, Philip N.] Idiap Res Inst, Martigny, Switzerland.
[Imseng, David; Bourlard, Herve] Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland.
RP Imseng, D (reprint author), Idiap Res Inst, Martigny, Switzerland.
EM dimseng@idiap.ch
FU Swiss NSF through the project Interactive Cognitive Systems (ICS)
[200021_132619/1]; National Centre of Competence in Research (NCCR) in
Interactive Multimodal Information Management (IM2)
FX This research was supported by the Swiss NSF through the project
Interactive Cognitive Systems (ICS) under contract number
200021_132619/1 and the National Centre of Competence in Research (NCCR)
in Interactive Multimodal Information Management (IM2)
http://wwww.im2.ch.
CR Aradilla G, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P928
Barnard E., 2009, P INTERSPEECH, P2847
Bisani M., 2004, P IEEE INT C AC SPEE, V1, P409
Bloomfield Leonard, 1933, LANGUAGE
Burget L, 2010, INT CONF ACOUST SPEE, P4334, DOI 10.1109/ICASSP.2010.5495646
Cover T M, 1991, ELEMENTS INFORM THEO
Davel M., 2009, P INTERSPEECH, P2851
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gauvain J.-L., 1993, P ICASSP, V2, P558
Grezl F., 2011, P ASRU, P359
Heeringa W., 2008, P C PATT REC ASS S A, P159
Hermansky H, 2000, P ICASSP, V3, P1635
Imseng D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4869
Imseng D., 2011, P INT, P537
Imseng D., 2012, P 3 INT WORKSH SPOK, P60
Imseng D., 2012, P INT
Johnson D., 2005, QUICKNET
KOHLER J, 1998, ACOUST SPEECH SIG PR, P417
KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694
KULLBACK S, 1987, AM STAT, V41, P340
Niesler T, 2007, SPEECH COMMUN, V49, P453, DOI 10.1016/j.specom.2007.04.001
Oostdijk NHJ, 2000, P LREC 2000 ATH, V2, P887
Povey D, 2011, INT CONF ACOUST SPEE, P4504
Povey D, 2010, INT CONF ACOUST SPEE, P4330, DOI 10.1109/ICASSP.2010.5495662
Qian Y., 2011, P ASRU HAW IEEE, P354
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Shinoda K., 1997, P EUROSPEECH, V1, P99
Toth L, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2695
van Heerden C., 2009, P INT BRIGHT UK, P3003
Zen H., 2007, HMM BASED SPEECH SYN
NR 30
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 142
EP 151
DI 10.1016/j.specom.2013.01.007
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800012
ER
PT J
AU Kempton, T
Moore, RK
AF Kempton, Timothy
Moore, Roger K.
TI Discovering the phoneme inventory of an unwritten language: A
machine-assisted approach
SO SPEECH COMMUNICATION
LA English
DT Article
DE Phonemic analysis; Endangered languages; Field linguistics
AB There is a consensus between many linguists that half of all languages risk disappearing by the end of the century. Documentation is agreed to be a priority. This includes the process of phonemic analysis to discover the contrastive sounds of a language with the resulting benefits of further linguistic analysis, literacy, and access to speech technology. A machine-assisted approach to phonemic analysis has the potential to greatly speed up the process and make the analysis more objective.
It is demonstrated that a machine-assisted approach can make a measurable contribution to a phonemic analysis for all the procedures investigated; phonetic similarity, complementary distribution, and minimal pairs. The evaluation measures introduced in this paper allows a comprehensive quantitative comparison between these phonemic analysis procedures. Given the best available data and the machine-assisted procedures described, there is a strong indication that phonetic similarity is the most important piece of evidence in a phonemic analysis. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Kempton, Timothy; Moore, Roger K.] Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England.
RP Kempton, T (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England.
EM drtimothykempton@gmail.com
FU UK Engineering and Physical Sciences Research Council (EPSRC)
[EP/P502748/1]
FX This work is funded by the UK Engineering and Physical Sciences Research
Council (EPSRC grant number EP/P502748/1). We are grateful to Andy
Castro for providing the Kua-nsi recordings, Mary Pearce and Cathy
Bartram for helpful discussions about phonemic analysis heuristics, and
Sharon Peperkamp for answering our questions regarding previous
experiments. We would also like to thank Nic de Vries for comments on
the application of the tools. Image component credits: Zscout370
(Lesotho flag).
CR Aslam J. A., 2005, P 14 ACM INT C INF K, P664, DOI 10.1145/1099554.1099721
BAMBER D, 1975, J MATH PSYCHOL, V12, P387, DOI 10.1016/0022-2496(75)90001-2
Burquest D., 2006, PHONOLOGICAL ANAL FU
Castro A., 2010, SIL ELECT SURVEY REP, V1, P96
Clark J., 2007, INTRO PHONETICS PHON
Crystal D., 2000, LANGUAGE DEATH
Davis J., 2006, P 23 INT C MACH LEAR, P233, DOI DOI 10.1145/1143844.1143874
Demuth K., 2007, SESOTHO SPEECH ACQUI, P528
Dingemanse M., 2008, LANG DOC CONSERV, V2, P325
Fitt S., 1999, 6 EUR C SPEECH COMM
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
Gildea D, 1996, COMPUT LINGUIST, V22, P497
Gleason H., 1961, INTRO DESCRIPTIVE LI
Grenoble L. A., 2006, SAVING LANGUAGES INT
HAYES BRUCE, 2009, INTRO PHONOLOGY
Himmelmann N., 2002, LECT ENDANGERED LANG, V5, P37
Hockett C, 1955, ASTOUNDING SCI FICTI, P97
Huckvale M., 2004, P ICSLP
Jeffreys H., 1948, THEORY PROBABILITY, V2nd
Kempton T., 2012, THESIS U SHEFFIELD
Kondrak G, 2003, COMPUT HUMANITIES, V37, P273, DOI 10.1023/A:1025071200644
KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694
Kurtic E., 2012, C LANG RES EV INST T
Ladefoged Peter, 2003, PHONETIC DATA ANAL I
Le Calvez R., 2007, P 2 EUR COGN SCI C, P167
Moseley C, 2009, ATLAS WORLDS LANGUAG, P2009
Peperkamp S, 2006, COGNITION, V101, pB31, DOI 10.1016/j.cognition.2005.10.006
Pike Kenneth, 1947, PHONEMICS TECHNIQUE
Poser B., 2008, MINPAIR VERSION 5 1
Postal P., 1968, ASPECTS PHONOLOGICAL
SIL, 2008, PHON ASS V3 0 1
Siniscalchi SM, 2008, INT CONF ACOUST SPEE, P4261, DOI 10.1109/ICASSP.2008.4518596
Sproat R., 1993, J PHON, V21
Wells John C., 1982, ACCENTS ENGLISH, V3
Williams L, 1977, J PHONETICS, V5, P169
NR 35
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 152
EP 166
DI 10.1016/j.specom.2013.02.006
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800013
ER
PT J
AU Mohan, A
Rose, R
Ghalehjegh, SH
Umesh, S
AF Mohan, Aanchan
Rose, Richard
Ghalehjegh, Sina Hamidi
Umesh, S.
TI Acoustic modelling for speech recognition in Indian languages in an
agricultural commodities task domain
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Multi-lingual speech recognition; Subspace
modelling; Under-resourced languages; Acoustic-normalization
AB In developing speech recognition based services for any task domain, it is necessary to account for the support of an increasing number of languages over the life of the service. This paper considers a small vocabulary speech recognition task in multiple Indian languages. To configure a multi-lingual system in this task domain, an experimental study is presented using data from two linguistically similar languages Hindi and Marathi. We do so by training a subspace Gaussian mixture model (SGMM) (Povey et al., 2011; Rose et al., 2011) under a multi-lingual scenario (Burget et al., 2010; Mohan et al., 2012a). Speech data was collected from the targeted user population to develop spoken dialogue systems in an agricultural commodities task domain for this experimental study. It is well known that acoustic, channel and environmental mismatch between data sets from multiple languages is an issue while building multi-lingual systems of this nature. As a result, we use a cross-corpus acoustic normalization procedure which is a variant of speaker adaptive training (SAT) (Mohan et al., 2012a). The resulting multi-lingual system provides the best speech recognition performance for both languages. Further, the effect of sharing "similar" context-dependent states from the Marathi language on the Hindi speech recognition performance is presented. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Mohan, Aanchan; Rose, Richard; Ghalehjegh, Sina Hamidi] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ, Canada.
[Umesh, S.] Indian Inst Technol, Madras 600036, Tamil Nadu, India.
RP Mohan, A (reprint author), McGill Univ, Dept Elect & Comp Engn, Montreal, PQ, Canada.
EM aanchan.mohan@mail.mcgill.ca; rose@ece.mcgill.ca;
sina.hamidighalehjegh@mail.mcgill.ca; umeshs@ee.iitm.ac.in
FU Government of India
FX The authors would like to thank all of the members involved in the data
collection effort and the development of the dialogue system for the
project "Speech-based Access 1 for Agricultural Commodity Prices in Six
Indian Languages" sponsored by the Government of India. We would also
like to thank M.S. Research Scholar Raghavengra Bilgi at IIT Madras for
his timely help with providing resources for this experimental study.
CR Bowonder B., 2003, DEV RURAL MAZRKET EH
Burget L, 2010, INT CONF ACOUST SPEE, P4334, DOI 10.1109/ICASSP.2010.5495646
Cardona G., 2003, INDOARYAN LANGUAGES, V2
Central Hindi Directorate I., 1977, DEV DEV AMPL STAND
Chopde A., 2006, ITRANS INDIAN LANGUA
Gales M., 2001, P IEEE INT C AC SPEE
Gales M. J. F., 1998, COMPUTER SPEECH LANG, V12
Gales M. J. F., 2001, P ASRU 2001 TRENT IT, P77
Gillick L., 1989, P ICASSP, V1, P532
Hakkani-Tur D., 2006, P IEEE INT C AC SPEE, V1, pI
Killer M., 2003, 8 EUR C SPEECH COMM
Lee K.-F., 1989, AUTOMATIC SPEECH REC
Lu L, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4877
Lu L, 2011, P IEEE ASRU, P365
Mantena G., 2011, JOINT WORKSH HANDS F, P153
Mohan A., 2012, IEEE C INF SCI SIGN
Mohan A, 2012, P IEEE INT C AC SPEE
Patel N, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P733
Plauche M., 2007, INF TECHNOL INT DEV, V4, P69, DOI 10.1162/itid.2007.4.1.69
Povey D., 2009, TUTORIAL STYLE INTRO
Povey D, 2011, COMPUT SPEECH LANG, V25, P404, DOI 10.1016/j.csl.2010.06.003
Qian Y., 2011, 12 ANN C INT SPEECH
Rose RC, 2011, P IEEE INT C AC SPEE
SARACLAR M, 2000, ACOUST SPEECH SIG PR, P1679
Scharf P., 2009, LINGUISTIC ISSUES EN
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Schultz T, 2006, MULTILINGUAL SPEECH
Seltzer M. L., 2011, P INTERSPEECH, P1097
Shrishrimal Pukhraj P., 2012, INT J COMPUTER APPL, V47, P17
SINGH R, 1999, P INT C SPOK LANG PR, V1, P117
Sulaiman R., 2003, INNOVATIONS AGR EXTE
Vu NT, 2011, INT CONF ACOUST SPEE, P5000
Weilhammer K., 2006, 9 INT C SPOK LANG PR
Young S., 2006, HTK BOOK HTK VERSION
NR 34
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 167
EP 180
DI 10.1016/j.specom.2013.07.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800014
ER
PT J
AU Tachbelie, MY
Abate, ST
Besacier, L
AF Tachbelie, Martha Yifiru
Abate, Solomon Teferra
Besacier, Laurent
TI Using different acoustic, lexical and language modeling units for ASR of
an under-resourced language - Amharic
SO SPEECH COMMUNICATION
LA English
DT Article
DE Syllable-based acoustic modeling; Hybrid (phone syllable) acoustic
modeling; Morphemebased; Speech recognition; Under-resourced languages;
Amharic
ID SPEECH RECOGNITION
AB State-of-the-art large vocabulary continuous speech recognition systems use mostly phone based acoustic models (AMs) and word based lexical and language models. However, phone based AMs are not efficient in modeling long-term temporal dependencies and the use of words in lexical and language models leads to out-of-vocabulary (00V) problem, which is a serious issue for morphologically rich languages. This paper presents the results of our contributions on the use of different units for acoustic, lexical and language modeling for an under-resourced language (Amharic spoken in Ethiopia). Triphone, Syllable and hybrid (syllable-phone) units have been investigated for acoustic modeling. Word and morphemes have been investigated for lexical and language modeling. We have also investigated the use of longer (syllable) acoustic units and shorter (morpheme) lexical as well as language modeling units in a speech recognition system.
Although hybrid AMs did not bring much improvement over context dependent syllable based recognizers in speech recognition performance with word based lexical and language model (i.e. word based speech recognition), we observed a significant word error rate (WER) reduction compared to triphone-based systems in morpheme-based speech recognition. Syllable AMs also led to a WER reduction over the triphone-based systems both in word based and morpheme based speech recognition. It was possible to obtain a 3% absolute WER reduction as a result of using syllable acoustic units in morpheme-based speech recognition. Overall, our result shows that syllable and hybrid AMs are best fitted in morpheme-based speech recognition. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Tachbelie, Martha Yifiru; Abate, Solomon Teferra] Univ Addis Ababa, Sch Informat Sci, Addis Ababa, Ethiopia.
[Besacier, Laurent] Univ Grenoble 1, LIG, Grenoble 1, France.
RP Tachbelie, MY (reprint author), Univ Addis Ababa, Sch Informat Sci, Addis Ababa, Ethiopia.
EM marthayifiru@gmail.com; solomon_teferra_7@yahoo.com;
laurent.besacier@imag.fr
CR Abate S. T., 2005, P INT LISB PORT, P1601
Abate Solomon Teferra, 2007, P INT ANTW BELG, P1541
Abate Solomon Teferra, 2006, THESIS U HAMBURG GER
Abate Solomon Teferra, 2007, P 2007 WORKSH COMP A, P33, DOI 10.3115/1654576.1654583
Abhinav Sethy, 2002, P ISCA PRON MOD WORK, P30
Appleyard David, 1995, C AMHARIC COMPLETE C
Azim Mohamed Mostafa, 2008, WSEAS T SIGNAL PROCE, V4, P211
Bazzi I., 2002, THESIS MIT
Bender M., 1976, LANGUAGES ETHIOPIA
Berhanu Solomon, 2001, THESIS ADDIS ABABA U
Berment V, 2004, THESIS U J FOURIER G
Besacier L, 2006, INT CONF ACOUST SPEE, P1221
Carki, 2000, ICASSP 2000 IST TURK, V3, P1563
Creutz M., 2005, A81 HELS U TECHN NEU
El-Desoky A., 2009, P INT 2009, P2679
Gales Mark, 2006, RECENT PROGR LARGE V
Ganapathiraju A, 2001, IEEE T SPEECH AUDI P, V9, P358, DOI 10.1109/89.917681
Gelas Hadrien, 2011, P INTERSPEECH FLOR I
Geutner Petra, 1995, P ICASSP, VI, P445
Girmaw Molalgne, 2004, THESIS ROYAL I TECHN
Gruenstein Alexander, 2009, P SLATE BRIGHT UK
Haile Alemayehu, 1995, J ETHOPIAN STUD, V28, P15
Hamalainen Annika, 2005, P SPECOM 2005, P499
Hirsimaki T., 2005, P INT INT C AD KNOWL, P121
Ircing P., 2001, P 7 EUR C SPEECH COM, P487
Kirchhoff Katrin, 2002, NOVEL SPEECH RECOGNI
Leslau Wolf, 2000, INTRO GRAMMAR AMHARI
Liu X, 2011, INT CONF ACOUST SPEE, P4872
Marge Matthew, 2010, P NAACL HLT
Mariam Sebsibe H., 2004, P 5 ISCA SPEECH SYNT, P103
McGraw Ian, 2009, P INTERSPEECH
Mohri M, 1998, LECT NOTES COMPUT SC, V1436, P144
Mulugeta Seyoum, 2001, THESIS DEP LINGUISTI
Pellegrini T., 2007, P INTERSPEECH 2007, P1797
Pellegrini T, 2009, IEEE T AUDIO SPEECH, V17, P863, DOI 10.1109/TASL.2009.2022295
Pellegrini Thomas, 2006, P LREC
Pellegrini Thomas, 2006, P INTERSPEECH 2006
Scott Novotney, 2010, P NAACL HLT, P207
Seid Hussien, 2005, P INT LISB PORT, P3349
Seifu Zegaye, 2003, THESIS ADDIS ABABA U
Snow R., 2008, P C EMP METH NAT LAN, P254, DOI 10.3115/1613715.1613751
Stolcke Andreas, 2002, P ICSLP 2002 DENB CO, P901
Tachbelie Martha Yifiru, P SLTU 10 PEN MAL, P68
Tachbelie Martha Yifiru, 2010, THESIS U HAMBURG GER
Tachbelie Martha Yifiru, 2003, THESIS ADDIS ABABA U
Tachbelie M.Y., 2009, P 4 LANG TECHN C LT, P114
Tadesse Kinfe, 2002, THESIS ADDIS ABABA U
Thangarajan R., 2008, S ASIAN LANGUAGE REV, V17, P71
Vesa Siivola, P EUR, P2293
Voigt Rainer M, 1987, JSS, V32, P1
Whittaker E.W.D., 2000, P 6 INT C SPOK LANG, P170
Whittaker E.W.D., 2001, IEEE WORKSH AUT SPEE, P315, DOI 10.1109/ASRU.2001.1034650
WOODLAND PC, 1995, INT CONF ACOUST SPEE, P73, DOI 10.1109/ICASSP.1995.479276
Yifiru Tachbelie Martha, 2011, P HLTD 2011, P50
Yifiru Tachbelie Martha, 2011, LECT NOTES COMPUTER, V6562, P82
Yimam Baye, 2007, YEAMARINA SEWASEW
NR 56
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 181
EP 194
DI 10.1016/j.specom.2013.01.008
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800015
ER
PT J
AU Karpov, A
Markov, K
Kipyatkova, I
Vazhenina, D
Ronzhin, A
AF Karpov, Alexey
Markov, Konstantin
Kipyatkova, Irina
Vazhenina, Dania
Ronzhin, Andrey
TI Large vocabulary Russian speech recognition using syntactico-statistical
language modeling
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Slavic languages; Russian speech; Language
modeling; Syntactical analysis
ID SYSTEMS
AB Speech is the most natural way of human communication and in order to achieve convenient and efficient human computer interaction implementation of state-of-the-art spoken language technology is necessary. Research in this area has been traditionally focused on several main languages, such as English, French, Spanish, Chinese or Japanese, but some other languages, particularly Eastern European languages, have received much less attention. However, recently, research activities on speech technologies for Czech, Polish, Serbo-Croatian, Russian languages have been steadily increasing.
In this paper, we describe our efforts to build an automatic speech recognition (ASR) system for the Russian language with a large vocabulary. Russian is a synthetic and highly inflected language with lots of roots and affixes. This greatly reduces the performance of the ASR systems designed using traditional approaches. In our work, we have taken special attention to the specifics of the Russian language when developing the acoustic, lexical and language models. A special software tool for pronunciation lexicon creation was developed. For the acoustic model, we investigated a combination of knowledge-based and statistical approaches to create several different phoneme sets, the best of which was determined experimentally. For the language model (LM), we introduced a new method that combines syntactical and statistical analysis of the training text data in order to build better n-gram models.
Evaluation experiments were performed using two different Russian speech databases and an internally collected text corpus. Among the several phoneme sets we created, the one which achieved the fewest word level recognition errors was the set with 47 phonemes and thus we used it in the following language modeling evaluations. Experiments with 204 thousand words vocabulary ASR were performed to compare the standard statistical n-gram LMs and the language models created using our syntactico-statistical method. The results demonstrated that the proposed language modeling approach is capable of reducing the word recognition errors. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Karpov, Alexey; Kipyatkova, Irina; Ronzhin, Andrey] Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg, Russia.
[Markov, Konstantin; Vazhenina, Dania] Univ Aizu, Human Interface Lab, Fukushima, Japan.
RP Kipyatkova, I (reprint author), Russian Acad Sci SPIIRAS, St Petersburg Inst Informat & Automat, St Petersburg, Russia.
EM kipyatkova@iias.spb.su
RI Karpov, Alexey/A-8905-2012
OI Karpov, Alexey/0000-0003-3424-652X
FU Ministry of Education and Science of Russia [07.514.11.4139]; grant of
the President of Russia [MK-1880.2012.8]; Russian Foundation for Basic
Research [12-08-01265]; Russian Humanitarian Scientific Foundation
[12-04-12062]
FX This research is supported by the Ministry of Education and Science of
Russia (contract No. 07.514.11.4139), by the grant of the President of
Russia (project No. MK-1880.2012.8), by the Russian Foundation for Basic
Research (project No. 12-08-01265) and by the Russian Humanitarian
Scientific Foundation (project No. 12-04-12062).
CR Anisimovich K., 2012, P DIAL 2012 MOSC RUS, V2, P91
Antonova A., 2012, P INT C DIAL 2012 MO, V2, P104
Arisoy E, 2010, INT CONF ACOUST SPEE, P5538, DOI 10.1109/ICASSP.2010.5495226
Arlazarov V., 2004, P INT C SPECOM 2004, P650
Bechet F., 2009, P INT 2009 BRIGHT UK, P1039
Bellegarda JR, 2004, SPEECH COMMUN, V42, P93, DOI 10.1016/j.specom.2003.08.002
Bhanuprasad K., 2008, P 3 INT JOINT C NAT, P805
Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147
Cubberley P., 2002, RUSSIAN LINGUISTIC I
Deoras A., 2012, P INT 2012 PORTL OR
Huet S, 2010, COMPUT SPEECH LANG, V24, P663, DOI 10.1016/j.csl.2009.10.001
Iomdin L., 2012, P DIAL 2012 MOSC RUS, V2, P119
Ircing P., 2006, P INT C LANG RES EV, P2600
Jokisch O., 2009, P SPECOM 2009 ST PET, P515
Kanejiya D.P., 2003, P TIFR WORKSH SPOK L, P93
Kanevsky D., 1996, P 1 INT C SPEECH COM, P117
Karpov A., 2011, P INT 2011 FLOR IT, P3161
Karpov A., 2012, P 3 INT WORKSH SPOK, P84
Kipyatkova I, 2012, 2012 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS (FEDCSIS), P719
Kouznetsov V., 1999, P SPECOM 1999 MOSC R, P179
Kuo HKJ, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P327, DOI 10.1109/ASRU.2009.5373470
Kurimo M., 2006, P HLT NAACL NEW YORK, P487, DOI 10.3115/1220835.1220897
Lamel L., 2012, P SLTU 2012 CAP TOWN, P156
Lamel L., 2011, P INT WORKSH SPOK LA, P121
Lee A., 2009, P APSIPA ASC, P131
Leontyeva A, 2008, LECT NOTES ARTIF INT, V5246, P373, DOI 10.1007/978-3-540-87391-4_48
Moore G.L., 2001, THESIS CAMBRIDGE U
Nozhov I., 2003, THESIS, P140
Odell JJ, 1995, THESIS CAMBRIDGE U
Oparin I, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P189, DOI 10.1109/SLT.2008.4777872
Oparin I., 2005, P SPECOM PATR GREEC, P575
Padgett J, 2005, PHONETICA, V62, P14, DOI 10.1159/000087223
Potapova R., 2011, P INT C SPEECH COMP, P13
Psutka J., 2005, P EUR 2005 LISB PORT, P1349
Pylypenko V, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1809
Rastrow A., 2012, P 50 ANN M ASS COMP, P175
Roark B., 2002, P 40 ANN M ASS COMP, P287
Ronzhin A., 2004, P INT C SPECOM 2004, P291
Schalkwyk J., 2010, ADV SPEECH RECOGNITI, P61, DOI 10.1007/978-1-4419-5951-5_4
Schultz T., 1998, P TSD 1998 BRN CZECH, P311
Shirokova A., 2007, P SPECOM 2007 MOSC R, P877
Shvedova N., 1980, RUSSIAN GRAMMAR, V1, P783
Sidorov G., 2012, LECT NOTES ARTIF INT, V7630, P1
Singh R, 2002, IEEE T SPEECH AUDI P, V10, P89, DOI 10.1109/89.985546
Skatov D., 2012, DICTASCOPE SYNTAX NA
Skrelin P, 2010, LECT NOTES ARTIF INT, V6231, P392, DOI 10.1007/978-3-642-15760-8_50
Smirnova J., 2011, P 17 INT C PHON SCI, P1870
Sokirko A., 2004, P 10 INT C DIAL 2004, P559
Starostin A., 2007, P INT C DIAL 2007 MO
Stolcke A., 2011, P IEEE AUT SPEECH RE
Stuker S, 2008, INT CONF ACOUST SPEE, P4249, DOI 10.1109/ICASSP.2008.4518593
Stuker S., 2004, P SPECOM 2004 SAINT, P297
Szarvas M., 2003, P ICASSP HONG KONG C, P368
Tatarnikova M., 2006, P SPECOM 2006 ST PET, P83
Vaiciunas A., 2006, THESIS VYTAUTAS MAGN
Vazhenina D., 2011, P 7 INT C NLP KNOWL, P475
Vazhenina D., 2012, P JOINT INT C HUM CT, P59
Viktorov A., 2009, SPEECH TECHNOL, V2, P39
Vintsyuk T., 1968, KIBERNETICA, V1, P15
Whittaker EWD, 2001, INT CONF ACOUST SPEE, P545, DOI 10.1109/ICASSP.2001.940889
Whittaker E.W.D., 2000, THESIS CAMBRIDGE U, P140
Young S., 2009, HTK BOOK, P384
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
Zablotskiy S, 2012, LREC 2012 - EIGHTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, P3374
Zaliznjak A.A., 2003, GRAMMATICAL DICT RUS, P800
Zhang JS, 2008, IEICE T INF SYST, VE91D, P508, DOI 10.1093/ictisy/e9l-d.3.508
NR 66
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 213
EP 228
DI 10.1016/j.specom.2013.07.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800017
ER
PT J
AU Van Niekerk, DR
Barnard, E
AF Van Niekerk, Daniel R.
Barnard, Etienne
TI Predicting utterance pitch targets in Yoruba for tone realisation in
speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Yoruba; Tone language; Speech synthesis; Fundamental frequency
ID UNIVERSALITY; INTONATION
AB Pitch is a fundamental acoustic feature of speech and as such needs to be determined during the process of speech synthesis. While a range of communicative functions are attributed to pitch variation in speech of all languages, it plays a vital role in distinguishing meaning of lexical items in tone languages. As a number of factors are assumed to affect the realisation of pitch, it is important to know which mechanisms are systematically responsible for pitch realisation in order to be able to model these effectively and thus develop robust speech synthesis systems in under-resourced environments. To this end, features influencing syllable pitch targets in continuous utterances in Yoruba are investigated in a small speech corpus of 4 speakers. It is found that the previous syllable pitch level is strongly correlated with pitch changes between syllables and a number of approaches and features are evaluated in this context. The resulting models can be used to predict utterance pitch targets for speech synthesisers (whether it be concatenative or statistical parametric systems), and may also prove useful in speech-recognition systems. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Van Niekerk, Daniel R.; Barnard, Etienne] North West Univ, Vanderbijlpark, South Africa.
[Van Niekerk, Daniel R.] CSIR, Meraka Inst, Human Language Technol Res Grp, ZA-0001 Pretoria, South Africa.
RP Van Niekerk, DR (reprint author), North West Univ, Ctr Text Technol, Potchefstroom, South Africa.
EM daniel.vanniekerk@nwu.ac.za
CR Adegbola T., 2009, P EACL 2009 WORKSH L, P53, DOI 10.3115/1564508.1564519
Adegbola T., 2012, 3 INT WORKSH SPOK LA, P48
Boersma P., 2001, PRAAT SYSTEM DOING P
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Connell B, 2002, J PHONETICS, V30, P101, DOI 10.1006/jpho.2001.0156
Connell B., 1990, PHONOLOGY, V7, P1, DOI 10.1017/S095267570000110X
Courtenay K., 1971, STUDIES AFRICAN LING, V2, P239
Davel M, 2008, COMPUT SPEECH LANG, V22, P374, DOI 10.1016/j.csl.2008.01.001
Ekpenyong Moses E., 2008, International Journal of Speech Technology, V11, DOI 10.1007/s10772-009-9037-5
Fujisaki H., 1998, 3 ESCA COCOSDA WORKS, P26
Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X
Laniran YO, 2003, J PHONETICS, V31, P203, DOI 10.1016/S0095-4470(02)00098-0
Louw J.A., 2006, S AFRICAN J AFRICAN, V2, P1
Odejobi Odetunji A., 2008, Computer Speech & Language, V22, DOI 10.1016/j.csl.2007.05.002
Odejobi O.A., 2007, INFOCOMP J COMPUT SC, V6, P47
Odejobi OA, 2006, COMPUT SPEECH LANG, V20, P563, DOI 10.1016/j.csl.2005.08.006
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Pedregosa F, 2011, J MACH LEARN RES, V12, P2825
Peng G, 2005, SPEECH COMMUN, V45, P49, DOI 10.1016/j.specom.2004.09.004
Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222
Tao JH, 2006, IEEE T AUDIO SPEECH, V14, P1145, DOI 10.1109/TASL.2006.876113
Van Niekerk D.R., 2012, 3 INT WORKSH SPOK LA, P54
van Niekerk D.R., 2009, P INT 2009 BRIGHT UK, P880
WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0
Xu Y., 2000, 6 INT C SPOK LANG PR, P666
Xu Y, 2005, SPEECH COMMUN, V46, P220, DOI 10.1016/j.specom.2005.02.014
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
Zen H., 2006, 6 INT WORKSH SPEECH, P294
NR 28
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 229
EP 242
DI 10.1016/j.specom.2013.01.009
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800018
ER
PT J
AU Ekpenyong, M
Urua, EA
Watts, O
King, S
Yamagishi, J
AF Ekpenyong, Moses
Urua, Eno-Abasi
Watts, Oliver
King, Simon
Yamagishi, Junichi
TI Statistical parametric speech synthesis for Ibibio
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Ibibio; Low-resource languages; HTS
ID TONE CORRECTNESS
AB Ibibio is a Nigerian tone language, spoken in the south-east coastal region of Nigeria. Like most African languages, it is resource-limited. This presents a major challenge to conventional approaches to speech synthesis, which typically require the training of numerous predictive models of linguistic features such as the phoneme sequence (i.e., a pronunciation dictionary plus a letter-to-sound model) and prosodic structure (e.g., a phrase break predictor). This training is invariably supervised, requiring a corpus of training data labelled with the linguistic feature to be predicted. In this paper, we investigate what can be achieved in the absence of many of these expensive resources, and also with a limited amount of speech recordings. We employ a statistical parametric method, because this has been found to offer good performance even on small corpora, and because it is able to directly learn the relationship between acoustics and whatever linguistic features are available, potentially mitigating the absence of explicit representations of intermediate linguistic layers such as prosody.
We present an evaluation that compares systems that have access to varying degrees of linguistic structure. The simplest system only uses phonetic context (quinphones), and this is compared to systems with access to a richer set of context features, with or without tone marking. It is found that the use of tone marking contributes significantly to the quality of synthetic speech. Future work should therefore address the problem of tone assignment using a dictionary and the building of a prediction module for out-of-vocabulary words. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Ekpenyong, Moses] Univ Uyo, Dept Comp Sci, Uyo 520003, Akwa Ibom State, Nigeria.
[Urua, Eno-Abasi] Univ Uyo, Dept Linguist & Nigerian Languages, Uyo 520003, Akwa Ibom State, Nigeria.
[Watts, Oliver; King, Simon; Yamagishi, Junichi] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
RP Ekpenyong, M (reprint author), Univ Uyo, Dept Comp Sci, PMB 1017, Uyo 520003, Akwa Ibom State, Nigeria.
EM mosesekpenyong@gmail.com
CR Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002
Chunwijitra V, 2012, SPEECH COMMUN, V54, P245, DOI 10.1016/j.specom.2011.08.006
Clark R., 2007, P BLZ3 2007 P SSW6
Ekpenyong Moses E., 2008, International Journal of Speech Technology, V11, DOI 10.1007/s10772-009-9037-5
Ekpenyong M., 2009, USEM J LANG LINGUIST, V2, P71
Essien O.E., 1990, GRAMMAR IBIBIO LANGU
Gibbon D., 2004, DATA CREATION IBIBIO
Gibbon D., 2001, P EUR, P83
Gibbon D., 2006, INT TUT RES WORKSH M, P1
King S., 2009, P BLIZZ CHALL WORKSH
Louw J.A., 2008, P 19 ANN S PATT REC, P165
Podsiadlo M., 2007, THESIS U EDINBURGH E
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Taylor P, 2009, TEXT TO SPEECH SYNTH
Tucker R., 2005, P INT 2005 EUR LISB, P453
Urua E.-A., 2001, UYO IBIBIO IN PRESS
Watts O., 2012, THESIS U EDINBURGH E
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394
Zen H., 2009, P APSIPA ASC OCT, P121
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 22
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2014
VL 56
BP 243
EP 251
DI 10.1016/j.specom.2013.02.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 244AZ
UT WOS:000326360800019
ER
PT J
AU Yokoyama, R
Nasu, Y
Iwano, K
Shinoda, K
AF Yokoyama, Ryo
Nasu, Yu
Iwano, Koji
Shinoda, Koichi
TI Detection of overlapped speech using lapel microphones in meeting
SO SPEECH COMMUNICATION
LA English
DT Article
DE Overlap speech detection; Spectral subtraction; Cosine distance
ID BLIND SEPARATION
AB We propose an overlapped speech detection method for speech recognition and speaker diarization of meetings, where each speaker wears a lapel microphone. Two novel features are utilized as inputs for a GMM-based detector. One is speech power after cross-channel spectral subtraction which reduces the power from the other speakers. The other is an amplitude spectral cosine correlation coefficient which effectively extracts the correlation of spectral components in a rather quiet condition. We evaluated our method using a meeting speech corpus of four speakers. The accuracy of our proposed method, 75.7%, was significantly better than that of the conventional method, 66.8%, which uses raw speech power and power spectral Pearson's correlation coefficient. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Yokoyama, Ryo; Nasu, Yu; Shinoda, Koichi] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan.
[Iwano, Koji] Tokyo City Univ, Fac Environm & Informat Studies, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan.
RP Yokoyama, R (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan.
EM yokoyama@ks.cs.titech.ac.jp
RI Shinoda, Koichi/D-3198-2014
OI Shinoda, Koichi/0000-0003-1095-3203
CR Aoki M., 2001, Acoustical Science and Technology, V22, DOI 10.1250/ast.22.149
BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129
Ben-Harush O., 2009, IEEE INT WORKSH MACH, P1
Boakye K, 2008, INT CONF ACOUST SPEE, P4353, DOI 10.1109/ICASSP.2008.4518619
Boakye K., 2011, P INTERSPEECH, P941
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Ghosh PK, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3098
Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364
Maganti HK, 2007, INT CONF ACOUST SPEE, P1037
Moore D. C., 2003, P ICASSP, P497
Nasu Y, 2011, INT CONF ACOUST SPEE, P4812
Pfau T, 2001, IEEE WORKSH AUT SPEE, V1, P107, DOI 10.1109/ASRU.2001.1034599
Rickard S., 2001, P ICA2001 DEC, P651
Rozgic Viktor, 2010, Journal of Multimedia, V5, DOI 10.4304/jmm.5.4.322-331
Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2
Stolcke A, 2011, INT CONF ACOUST SPEE, P4992
Stolcke A, 2010, INT CONF ACOUST SPEE, P4390, DOI 10.1109/ICASSP.2010.5495626
Sun H, 2011, P INTERSPEECH, P2345
Sun HW, 2010, INT CONF ACOUST SPEE, P4982, DOI 10.1109/ICASSP.2010.5495077
Valente F, 2011, INT CONF ACOUST SPEE, P4416
Valente F, 2010, INT CONF ACOUST SPEE, P4954, DOI 10.1109/ICASSP.2010.5495087
Vijayasenan D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4173
Vijayasenan D, 2010, INT CONF ACOUST SPEE, P4950, DOI 10.1109/ICASSP.2010.5495086
Wrigley SN, 2005, IEEE T SPEECH AUDI P, V13, P84, DOI 10.1109/TSA.2004.838531
Xiao B, 2011, INT CONF ACOUST SPEE, P5216
Yamamoto K, 2006, IEICE T FUND ELECTR, VE89A, P2158, DOI 10.1093/ietfec/e89-a.8.2158
Yella S. H., 2011, P INTERSPEECH, P953
Zhu M, 2004, 09 U WAT DEP STAT AC
Zwyssig E, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4177
NR 29
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 941
EP 949
DI 10.1016/j.specom.2013.06.013
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500001
ER
PT J
AU Zhang, WL
Qu, D
Zhang, WQ
Li, BC
AF Zhang, Wen-Lin
Qu, Dan
Zhang, Wei-Qiang
Li, Bi-Cheng
TI Rapid speaker adaptation using compressive sensing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker adaptation; Speaker subspace; Compressive sensing; Matching
pursuit; l(1) regularization
ID SPEECH RECOGNITION; INVERSE PROBLEMS; RECONSTRUCTION; REPRESENTATION;
SUBSPACE
AB Speaker-space-based speaker adaptation methods can obtain good performance even if the amount of adaptation data is limited. However, it is difficult to determine the optimal dimension and basis vectors of the subspace for a particular unknown speaker. Conventional methods, such as eigenvoice (EV) and reference speaker weighting (RSW), can only obtain a sub-optimal speaker subspace. In this paper, we present a new speaker-space-based speaker adaptation framework using compressive sensing. The mean vectors of all mixture components of a conventional Gaussian-Mixture-Model-Hidden-Markov-Model (GMM-HMM)-based speech recognition system are concatenated to form a supervector. The speaker adaptation problem is viewed as recovering the speaker-dependent supervector from limited speech signal observations. A redundant speaker dictionary is constructed by a combination of all the training speaker supervectors and the supervectors derived from the EV method. Given the adaptation data, the best subspace for a particular speaker is constructed in a maximum a posterior manner by selecting a proper set of items from this dictionary. Two algorithms, i.e. matching pursuit and l(1) regularized optimization, are adapted to solve this problem. With an efficient redundant basis vector removal mechanism and an iterative updating of the speaker coordinate, the matching pursuit based speaker adaptation method is fast and efficient. The matching pursuit algorithm is greedy and sub-optimal, while direct optimization of the likelihood of the adaptation data with an explicit l(1) regularization term can obtain better approximation of the unknown speaker model. The projected gradient optimization algorithm is adopted and a few iterations of the matching pursuit algorithm can provide a good initial value. Experimental results show that matching pursuit algorithm outperforms the conventional testing methods under all testing conditions. Better performance is obtained when direct l(1) regularized optimization is applied. Both methods can select a proper mixed set of the eigenvoice and reference speaker supervectors automatically for estimation of the unknown speaker models. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Zhang, Wen-Lin; Qu, Dan; Li, Bi-Cheng] Zhengzhou Informat Sci & Technol Inst, Zhengzhou 450002, Peoples R China.
[Zhang, Wei-Qiang] Tsinghua Univ, Dept Elect Engn, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China.
RP Zhang, WL (reprint author), Zhengzhou Informat Sci & Technol Inst, Zhengzhou 450002, Peoples R China.
EM zwlin_2004@163.com; qudanqu-dan@sina.com; wqzhang@tsinghua.edu.cn;
lbclm@163.com
RI Zhang, Wei-Qiang/A-7088-2008
OI Zhang, Wei-Qiang/0000-0003-3841-1959
FU National Natural Science Foundation of China [61175017, 61005019];
National High-Tech Research and Development Plan of China [2012AA011603]
FX This work was supported in part by the National Natural Science
Foundation of China (No. 61175017 and No. 61005019) and the National
High-Tech Research and Development Plan of China (No. 2012AA011603).
CR Boominathan V, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4381
Boyd S., 2004, CONVEX OPTIMIZATION
Bruckstein AM, 2009, SIAM REV, V51, P34, DOI 10.1137/060657704
Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083
Chang E, 2001, P EUR, P2799
Chen S., 1999, P EUR C SPEECH COMM, V3, P1087
Cho HY, 2010, ETRI J, V32, P795, DOI 10.4218/etrij.10.1510.0062
Daubechies I, 2004, COMMUN PUR APPL MATH, V57, P1413, DOI 10.1002/cpa.20042
Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933
Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582
Figueiredo MAT, 2007, IEEE J-STSP, V1, P586, DOI 10.1109/JSTSP.2007.910281
Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350
Hahm S, 2010, INT CONF ACOUST SPEE, P4302, DOI 10.1109/ICASSP.2010.5495672
Hazen T. J., 1997, P EUR, P2047
Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161
Kenny P, 2004, IEEE T SPEECH AUDI P, V12, P579, DOI 10.1109/TSA.2004.825668
Kua JMK, 2011, INT CONF ACOUST SPEE, P4548
Kua JMK, 2013, SPEECH COMMUN, V55, P707, DOI 10.1016/j.specom.2013.01.005
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Li JY, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P1656
Lu L, 2011, IEEE SIGNAL PROC LET, V18, P419, DOI 10.1109/LSP.2011.2157820
Mak B., 2006, P ICASSP, V1
MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082
Naseem Imran, 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), DOI 10.1109/ICPR.2010.1083
Olsen P. A, 2011, P ASRU, P53
Olsen PA, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4317
Petersen KB, 2008, MATRIX COOKBOOK
Povey D, 2012, COMPUT SPEECH LANG, V26, P35, DOI 10.1016/j.csl.2011.04.002
Shinoda K, 2010, IEICE T INF SYST, VE93D, P2348, DOI 10.1587/transinf.E93.D.2348
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Teng W. X., 2007, P INTERSPEECH, P258
Teng WX, 2009, INT CONF ACOUST SPEE, P4381, DOI 10.1109/ICASSP.2009.4960600
Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267
Tropp JA, 2007, IEEE T INFORM THEORY, V53, P4655, DOI 10.1109/TIT.2007.909108
Wiesler S, 2011, INT CONF ACOUST SPEE, P5324
WOODLAND PC, 1994, INT CONF ACOUST SPEE, P125
Young S., 2009, HTK BOOK HTK VERSION
Yu D, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4409
NR 38
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 950
EP 963
DI 10.1016/j.specom.2013.06.012
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500002
ER
PT J
AU Yu, RS
AF Yu, Rongshan
TI Speech enhancement based on soft audible noise masking and noise power
estimation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Speech processing; Auditory model; Perceptual model;
Noise estimation; Noise suppression; Noise tracking
ID SPECTRAL AMPLITUDE ESTIMATOR; AUDITORY-SYSTEM; SUPPRESSION; EPHRAIM
AB This paper presents a perceptual model based speech enhancement algorithm. The proposed algorithm measures the amount of the audible noise in the input noisy speech based on estimation of short-time spectral power of noise signal, and masking threshold calculated from the estimated spectrum of clean speech. An appropriate amount of noise reduction is chosen based on the result to achieve good noise suppression without introducing significant distortion to the clean speech. To mitigate the problem of "musical noise", the amount of noise reduction is linked directly to the estimation of short-term noise spectral amplitude instead of noise variance so that the spectral peaks of noise can be better suppressed. Good performance of the proposed speech enhancement system is confirmed through objective and subjective tests. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Yu, Rongshan] Inst Infocomm Res I2R, Dept Signal Proc, Singapore, Singapore.
[Yu, Rongshan] Dolby Labs Inc, San Francisco, CA 94103 USA.
RP Yu, RS (reprint author), Inst Infocomm Res I2R, Dept Signal Proc, 1 Fusionopolis Way,21-01 Connexis South Tower, Singapore, Singapore.
EM ryu@i2r.a-star.edu.sg
CR 3GPP, 2001, 26978 3GPP TR
3GPP, 2001, 26090 3GPP TS
3GPP2, 2004, CS00140 3GPP2
[Anonymous], 2003, P835 ITUT
[Anonymous], 2001, PERC EV SPEECH QUAL, P862
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erkelens J. S, 2008, IEEE INT C AC SPEECH
Fisher W. M., 1986, P DARPA WORKSH SPEEC, P93
Gustafsson S, 1998, IEEE INT C AC SPEECH
Hansen JHL, 2006, IEEE T AUDIO SPEECH, V14, P2049, DOI 10.1109/TASL.2006.876883
Hu Y, 2006, 2006 IEEE INT C AC S
Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714
ISO/IEC, 1992, JTC1SC29WG11IS111723
ITU, 1993, P56 ITUT, P56
ITU-T, 1996, P800 ITUT, P800
Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Lin L, 2003, IEEE INT C AC SPEECH
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Princen J. P., 1987, P ICASSP 87 DALL TX, V4, P2161
RICE SO, 1948, AT&T TECH J, V27, P109
Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Widrow B, 1985, ADAPTIVE SIGNAL PROC
Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111
Yu RS, 2009, INT CONF ACOUST SPEE, P4421
NR 29
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 964
EP 974
DI 10.1016/j.specom.2013.05.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500003
ER
PT J
AU Djendi, M
Scalart, P
Gilloire, A
AF Djendi, Mohamed
Scalart, Pascal
Gilloire, Andre
TI Analysis of two-sensors forward BSS structure with post-filters in the
presence of coherent and incoherent noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Cepstral distance; Spectral distortion; Signal to noise ratio (SNR);
Speech to coherent noise (SCR) ratio; Coherent noise to diffuse noise
(CDR) ratio; BSS; Speech enhancement; Forward and Backward BSS structure
ID LOW SIGNAL-DISTORTION; SPECTRAL AMPLITUDE ESTIMATOR; SPEECH ENHANCEMENT;
SEPARATION; CANCELER; INTELLIGIBILITY; CANCELLATION; ALGORITHMS;
REDUCTION; FIELD
AB We consider the speech enhancement problem in a moving car through a blind source separation BSS scheme involving two sensors. To correct the distortion brought by this structure we have proposed in previous work (Djendi et al., 2007) two frequency-domain methods to compute the post-filters placed at the output of the forward BSS structure (FBSS). In this work, we consider the case where the noises at the sensor inputs contain coherent and non-coherent components. We provide an analysis of the performance (output SNR and the distortion criterion) of the FBSS structure with post-filters as a function of two new parameters: the coherent to diffuse ratio (CDR) and the speech to coherent ratio (SCR). Simulation results show perfect agreement between theoretical and experimental results. Crown Copyright (c) 2013 Published by Elsevier B.V. All rights reserved.
C1 [Djendi, Mohamed] Blida Univ, LATSI Lab, Blida 09000, Algeria.
[Djendi, Mohamed; Scalart, Pascal] Univ Rennes, IRISA ENSSAT, F-22305 Lannion, France.
[Gilloire, Andre] France Telecom R&D, TECHISSTP, F-22307 Lannion, France.
RP Djendi, M (reprint author), Univ Rennes, IRISA ENSSAT, F-22305 Lannion, France.
EM m_djendi@yahoo.fr; pascal.scalart@enssat.fr; andre.gilloire@wanadoo.fr
CR ALKINDI MJ, 1989, SIGNAL PROCESS, V17, P241, DOI 10.1016/0165-1684(89)90005-4
Araki S, 2005, INT CONF ACOUST SPEE, P81
Bentler R, 2008, INT J AUDIOL, V47, P447, DOI 10.1080/14992020802033091
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Charkani N. H., 1996, THESIS U RENNES 1 FR
DJENDI M, 2006, INT CONF ACOUST SPEE, P744
Djendi M, 2007, P EUSIPCO POZN, V1, P218
Djendi M, 2009, P EUSICPO GLASG UK, V1, P165
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Even J, 2009, 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, P513, DOI 10.1109/SSP.2009.5278525
FERRARA ER, 1980, IEEE T ACOUST SPEECH, V28, P474, DOI 10.1109/TASSP.1980.1163432
Gabrea M, 1996, P EUISIPCO 1996 TRIE, V2, P983
Gabrea M, 2003, P ICASSP, V2, P904
Goodwin G. C, 1985, INFO SYST SCI
Guerin A, 2002, THESIS NATL POLLYTEC
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
Ikeda S, 1999, IEICE T FUND ELECTR, VE82A, P1517
Ikeda S, 1999, IEEE T SIGNAL PROCES, V47, P665, DOI 10.1109/78.747774
Jan T, 2009, INT CONF ACOUST SPEE, P1713, DOI 10.1109/ICASSP.2009.4959933
Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603
Lim J, 1978, IEEE T ACOUST SPEECH, VASSP-37, P471
Marro C, 1998, IEEE T SPEECH AUDI P, V6, P240, DOI 10.1109/89.668818
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Matsuoka K., 2002, P 41 SICE ANN C AUG, V4, P2138, DOI 10.1109/SICE.2002.1195729
McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212
Mourjopoulos J. N, 1982, P ICASSP, V1, P1858
MOURJOPOULOS JN, 1994, J AUDIO ENG SOC, V42, P884
Plapous C, 2005, INT CONF ACOUST SPEE, P157
Plapous C., 2004, P IEEE INT C AC SPEE, V1, P289
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005
Sato M, 2005, IEICE T FUND ELECTR, VE88A, P2055, DOI 10.1093/ietfec/e88-a.8.2055
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Simmer K.U., 2001, MICROPHONE ARRAYS, P36
Sugiyama A, 1989, P IEEE ICASSP, V2, P892
VANGERVEN S, 1995, IEEE T SIGNAL PROCES, V43, P1602, DOI 10.1109/78.398721
Wang DL, 2009, J ACOUST SOC AM, V125, P2336, DOI 10.1121/1.3083233
Weiss RJ, 2010, COMPUT SPEECH LANG, V24, P16, DOI 10.1016/j.csl.2008.03.003
Zelinski R, 1988, P ICASSP, V5
NR 41
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 975
EP 987
DI 10.1016/j.specom.2013.06.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500004
ER
PT J
AU Shi, YY
Wiggers, P
Jonker, CM
AF Shi, Yangyang
Wiggers, Pascal
Jonker, Catholijn M.
TI Classifying the socio-situational settings of transcripts of spoken
discourses
SO SPEECH COMMUNICATION
LA English
DT Article
DE Socio-situational setting; Support vector machine; Dynamic Bayesian
networks; Genre classification; Part of speech
ID CLASSIFICATION; LANGUAGE; TEXT
AB In this paper, we investigate automatic classification of the socio-situational settings of transcripts of a spoken discourse. Knowledge of the socio-situational setting can be used to search for content recorded in a particular setting or to select context-dependent models for example in speech recognition. The subjective experiment we report on in this paper shows that people correctly classify 68% the socio-situational settings. Based on the cues that participants mentioned in the experiment, we developed two types of automatic socio-situational setting classification methods; a static socio-situational setting classification method using support vector machines (S3C-SVM), and a dynamic socio-situational classification method applying dynamic Bayesian networks (S3C-DBN). Using these two methods, we developed classifiers applying various features and combinations of features. The S3C-SVM method with sentence length, function word ratio, single occurrence word ratio, part of speech (POS) and words as features results in a classification accuracy of almost 90%. Using a bigram S3C-DBN with Pos tag and word features results in a dynamic classifier which can obtain nearly 89% classification accuracy. The dynamic classifiers not only can achieve similar results as the static classifiers, but also can track the socio-situational setting while processing a transcript or conversation. On discourses with a static social situational setting, the dynamic classifiers only need the initial 25% of data to achieve a classification accuracy close to the accuracy achieved when all data of a transcript is used. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Shi, Yangyang; Jonker, Catholijn M.] Delft Univ Technol, Intelligent Syst Dept, NL-2600 AA Delft, Netherlands.
[Wiggers, Pascal] Amsterdam Univ Appl Sci HvA, CREATE IT Appl Res, Amsterdam, Netherlands.
RP Shi, YY (reprint author), HB12-290,Mekelweg 4, NL-2628 CD Delft, Netherlands.
EM shiyang1983@gmail.com
CR Argamon S., 1998, P 1 INT WORKSH INN I
Baeza-Yates R., 1999, MODERN INFORM RETRIE
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Chen GY, 2008, APPLIED COMPUTING 2008, VOLS 1-3, P2353
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
Dean T., 1989, Computational Intelligence, V5, DOI 10.1111/j.1467-8640.1989.tb00324.x
Fan RE, 2008, J MACH LEARN RES, V9, P1871
Feldman S, 2009, INT CONF ACOUST SPEE, P4781, DOI 10.1109/ICASSP.2009.4960700
Firth J. R., 1957, STUDIES LINGUISTIC A, P1930
Iyer R., 1994, P ARPA WORKSH HUM LA, P82, DOI 10.3115/1075812.1075828
Joachims T., 1998, MACH LEARN ECML 98, V1398, P137, DOI DOI 10.1007/BFB0026683
KARLGREN J, 1994, P 15 INT C COMP LING, V2, P1071, DOI 10.3115/991250.991324
Kessler B., 1997, P 35 ANN M ASS COMP, P32
Labov William, 1972, SOCIOLINGUISTIC PATT
Langley P., 1992, P 10 NAT C ART INT, P223
Lee Y., 2002, P ACL SIGIR C RES DE, P145
LEVINSON SC, 1979, LINGUISTICS, V17, P365, DOI 10.1515/ling.1979.17.5-6.365
Murphy K.P., 2002, THESIS U CALIFORNIA
Obin N, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P3070
Oostdijk N., 1999, BUILDING CORPUS SPOK
Oostdijk N., 2002, P 3 INT C LANG RES E, P340
Pearl J., 1988, PROBABILISTIC REASON
Peng FC, 2003, LECT NOTES COMPUT SC, V2633, P335
Ries K, 2000, P INT C LANG RESS EV
Rosenfeld R, 2000, P IEEE, V88, P1270, DOI 10.1109/5.880083
Santini M., 2004, 7 ANN CLUK RES C
Santini M., 2006, JADT 2006 8 JOURN
Shi Y, 2010, 22 BEN C ART INT, P154
Stamatatos E., 2000, P 18 INT C COMP LING, V2, P808, DOI 10.3115/992730.992763
Theodoridis S, 2009, PATTERN RECOGNITION, 4RTH EDITION, P1
Tong S, 2000, J MACHINE LEARNING R, V2, P45, DOI DOI 10.1162/153244302760185243
Van Gijsel S, 2006, P 8 INT C STAT AN TE, V2, P961
Wiggers P., P INT C TEXT SPEECH, P366
NR 33
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 988
EP 1002
DI 10.1016/j.specom.2013.06.011
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500005
ER
PT J
AU Yook, S
Nam, KW
Kim, H
Kwon, SY
Kim, D
Lee, S
Hong, SH
Jang, DP
Kim, IY
AF Yook, Sunhyun
Nam, Kyoung Won
Kim, Heepyung
Kwon, See Youn
Kim, Dongwook
Lee, Sangmin
Hong, Sung Hwa
Jang, Dong Pyo
Kim, In Young
TI Modified segmental signal-to-noise ratio reflecting spectral masking
effect for evaluating the performance of hearing aid algorithms
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech quality test; Hearing aid; Masking threshold; Segmental signal to
noise ratio
ID SPEECH ENHANCEMENT; QUALITY; RECOGNITION
AB Most traditional objective indices don't distinguish between a real sound and a perceived sound, and therefore, these indices have limitations in regard to the evaluation of the real effect of an algorithm under investigation on the auditory perception of a hearing-impaired person. Though several objective indices, such as perceptual evaluation of speech quality (PESQ) and composite measurements, that reflect the psychoacoustic factors were already in use, it is helpful to develop more objective indices that take into account human psychoacoustic factors in order to accurately evaluate the performance of hearing aid algorithms. In this study, a new objective index that reflects the spectral masking effect into the calculation of the conventional segmental signal-to-noise ratio (segSNR) was proposed. The performance of this index was evaluated by analyzing the correlation of the result and (1) the mean opinion score and (2) the speech recognition threshold tests of 15 normal-hearing volunteers and 15 hearing-impaired patients. The correlation values of the proposed index were relatively high (0.83-0.97) across various ambient noise situations. Based on these experimental results, the proposed index has the potential to be widely used as a measuring index for the performance evaluation of various hearing aid algorithms prior to conducting clinical experiments. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Yook, Sunhyun; Nam, Kyoung Won; Kim, Heepyung; Jang, Dong Pyo; Kim, In Young] Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea.
[Kwon, See Youn; Hong, Sung Hwa] Samsung Med Ctr, Dept Otolaryngol Head & Neck Surg, Seoul 135710, South Korea.
[Kim, Dongwook] Samsung Adv Inst Technol, Bio & Hlth Lab, Yongin 446712, South Korea.
[Lee, Sangmin] Inha Univ, Dept Elect Engn, Inchon 402751, South Korea.
RP Kim, IY (reprint author), Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea.
EM soonhyun@bme.hanyang.ac.kr; kwnam@bme.hanyang.ac.kr;
heepyung@bme.hanyang.ac.kr; sykwon@bme.hanyang.ac.kr;
steve7.kim@sam-sung.com; sanglee@inha.ac.kr; hongsh@skku.edu;
dongpjang@gmail.com; iykim@hanyang.ac.kr
FU Strategic Technology Development Program of Ministry of Knowledge
Economy, KOREA [10031764]; Seoul R&BD Program, KOREA [SS100022]
FX This work was supported by grants from the Strategic Technology
Development Program of Ministry of Knowledge Economy, KOREA (No.
10031764) and from Seoul R&BD Program, KOREA (No. SS100022). The authors
are grateful to Silvia Allegro and Phonak for allowing access to their
database.
CR Alam M. J, 2009, P IEEE INT C ICCIT, V12, P483
Amehraye A, 2009, INT J INF COMMUN ENG, V5, P2
Arehart KH, 2010, EAR HEARING, V31, P420, DOI 10.1097/AUD.0b013e3181d3d4f3
Beerends JG, 2002, J AUDIO ENG SOC, V50, P765
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Brandenburg K, 1992, INT C TEST MEAS, P11
Buchler M, 2005, EURASIP J APPL SIG P, V18, P2991
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
Garcia R. A, 1999, AUDIO ENG SOC, V107, P5073
Han H, 2011, INT J AUDIOL, V50, P59, DOI 10.3109/14992027.2010.526637
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058
ISO, 2003, 2262003 ISO
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Kim J. S, 2008, KOREAN ACAD AUDIOL, V4, P126
Li JF, 2011, J ACOUST SOC AM, V129, P3291, DOI 10.1121/1.3571422
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Mourjopoulos J, 1991, AUDIO ENG SOC, V90, P3064
Nam KW, 2013, SPEECH COMMUN, V55, P544, DOI 10.1016/j.specom.2012.11.002
Quackenbush S, 1988, OBJECTIVE MEASURES S, P47
Richards D.L., 1965, Electronics Letters, V1, DOI 10.1049/el:19650037
Scalart P, 1969, P IEEE INT C AC SPEE, V2, P629
Tribolet J., 1988, P IEEE INT C AC SPEE, V3, P586
Turner CW, 2004, J ACOUST SOC AM, V115, P1729, DOI 10.1121/1.1687425
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
Yost W. A., 1994, FUNDAMENTALS HEARING
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1003
EP 1010
DI 10.1016/j.specom.2013.05.005
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500006
ER
PT J
AU Chen, F
Wong, LLN
Hu, Y
AF Chen, Fei
Wong, Lena L. N.
Hu, Yi
TI A Hilbert-fine-structure-derived physical metric for predicting the
intelligibility of noise-distorted and noise-suppressed speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Hilbert fine-structure signal; Speech
transmission index
ID HEARING-IMPAIRED LISTENERS; TEMPORAL ENVELOPE; REDUCTION ALGORITHMS;
AUDITORY-PERCEPTION; ARTICULATION INDEX; CUES; COHERENCE; QUALITY;
ABILITY; RECOGNITION
AB Despite the established importance of temporal fine-structure (TFS) on speech perception in noise, existing speech transmission metrics use primarily envelope information to model speech intelligibility variance. This study proposes a new physical metric for predicting speech intelligibility using information obtained from the Hilbert-derived TFS waveform. It is found that by making explicit use of coherence information contained in the complex spectra of the Hilbert-derived TFS waveforms of the clean and corrupted speech signals, and assessing the extent to which the coherence in the Hilbert fine structure is affected following the linear or non-linear processing (e.g., noise distortion, speech enhancement, etc.) of the stimulus, the predictive power of the intelligibility measure can be significantly improved for noise-distorted and noise-suppressed speech signals. When evaluated with speech recognition scores obtained with normal-hearing listeners, including a total of sixty-four noise-suppressed conditions with nonlinear distortions and eight noisy conditions without subsequent noise reduction, the proposed TFS-based measure was found to predict speech intelligibility better than most envelope- and coherence-based measures. High correlation was maintained for all types of maskers tested, with a maximum correlation of r = 0.95 achieved in car and street noise conditions. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Chen, Fei; Wong, Lena L. N.] Univ Hong Kong, Prince Philip Dent Hosp, Div Speech & Hearing Sci, Hong Kong, Hong Kong, Peoples R China.
[Hu, Yi] Univ Wisconsin, Dept Elect Engn & Comp Sci, Milwaukee, WI 53211 USA.
RP Chen, F (reprint author), Univ Hong Kong, Prince Philip Dent Hosp, Div Speech & Hearing Sci, 34 Hosp Rd, Hong Kong, Hong Kong, Peoples R China.
EM feichen1@hku.hk; huy@uwm.edu
FU Faculty Research Fund, Faculty of Education, The University of Hong
Kong; Seed Funding for Basic Research, The University of Hong Kong;
General Research Fund (GRF)
FX This research was supported by Faculty Research Fund, Faculty of
Education, The University of Hong Kong, by Seed Funding for Basic
Research, The University of Hong Kong, and by General Research Fund
(GRF), administered by the Hong Kong Research Grants council. The
authors thank the Associate Editor, Dr. Karen Livescu, and two anonymous
reviewers for their constructive and helpful comments.
CR ANSI, 1997, S351997 ANSI
Arehart KH, 2007, J ACOUST SOC AM, V122, P1150, DOI 10.1121/1.2754061
Baher H, 2001, ANALOG DIGITAL SIGNA
Carter C, 1973, IEEE T AUDIO ELECTRO, VAU-21, P337
Chen F, 2013, J ACOUST SOC AM, V133, pEL405, DOI 10.1121/1.4800189
Chen F, 2010, J ACOUST SOC AM, V128, P3715, DOI 10.1121/1.3502473
Chen F, 2012, J ACOUST SOC AM, V131, P4104, DOI 10.1121/1.3695401
Drennan WR, 2005, EAR HEARING, V26, P461, DOI 10.1097/01.aud.0000179690.30137.21
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
Gilbert G, 2006, J ACOUST SOC AM, V119, P2438, DOI 10.1121/1.2173522
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Gomez AM, 2012, SPEECH COMMUN, V54, P503, DOI 10.1016/j.specom.2011.11.001
GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052
Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083
Hirsch H, 2000, ISCA TUT RES WORKSH
Hollube I., 1996, J ACOUST SOC AM, V100, P1703
Hopkins K, 2008, J ACOUST SOC AM, V123, P1140, DOI 10.1121/1.2824018
HOUTGAST T, 1971, ACUSTICA, V25, P355
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778
Jorgensen S, 2011, J ACOUST SOC AM, V130, P1475, DOI 10.1121/1.3621502
KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Kong YY, 2006, J ACOUST SOC AM, V120, P2830, DOI 10.1121/1.2346009
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1698, DOI 10.1121/1.1909096
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lorenzi C, 2006, P NATL ACAD SCI USA, V103, P18866, DOI 10.1073/pnas.0607364103
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
Moore BCJ, 2008, JARO-J ASSOC RES OTO, V9, P399, DOI 10.1007/s10162-008-0143-x
Myers R, 1990, CLASSICAL MODERN REG, V2nd
PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
STEENEKEN HJM, 1982, ACUSTICA, V51, P229
STEIGER JH, 1980, PSYCHOL BULL, V87, P245, DOI 10.1037//0033-2909.87.2.245
Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881
van Buuren RA, 1999, J ACOUST SOC AM, V105, P2903, DOI 10.1121/1.426943
Zeng FG, 2004, J ACOUST SOC AM, V116, P1351, DOI 10.1121/1.1777938
NR 43
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1011
EP 1020
DI 10.1016/j.specom.2013.06.016
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500007
ER
PT J
AU Ozimek, E
Kocinski, J
Kutzner, D
Sek, A
Wicher, A
AF Ozimek, Edward
Kocinski, Jedrzej
Kutzner, Dariusz
Sek, Aleksander
Wicher, Andrzej
TI Speech intelligibility for different spatial configurations of target
speech and competing noise source in a horizontal and median plane
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Speech-in-noise test; Spatial perception;
Monaural and binaural perception
ID IMPAIRED LISTENERS; UNMASKING; HEARING; MASKING; RECOGNITION;
SEPARATION; THRESHOLD; LOCATION; RELEASE
AB The speech intelligibility for different configurations of a target signal (speech) and masker (babble noise) in a horizontal and a median plane was investigated. The sources were placed at the front, in the back or in the right hand side (at different angular configurations) of a dummy head. The speech signals were presented to listeners via headphones at different signal-to-noise ratios (SNR). Three different types of listening mode (binaural and monaural for the right or left ear) were tested. It was found that the binaural mode gave the lowest, i.e. 'the best', speech reception threshold (SRT) values compared to the other modes, except for the cases when both the target and masker were at the same position. With regard to the monaural modes, SRTs were generally worse than those for the binaural mode. The new data gathered for the median plane revealed that a change in elevation of the speech source had a small, but statistically significant, influence on speech intelligibility. It was found that when speech elevation was increased, speech intelligibility decreased. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Ozimek, Edward; Kocinski, Jedrzej; Kutzner, Dariusz; Sek, Aleksander; Wicher, Andrzej] Adam Mickiewicz Univ, Fac Phys, Inst Acoust, PL-61614 Poznan, Poland.
RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, Ul Umultowska 85, PL-61614 Poznan, Poland.
EM ozimaku@amu.edu.pl
FU European Union [004171]; Polish-Norwegian Research Fund, State Ministry
of Science and Higher Education [N N518 502139]; National Science Centre
[UMO-2011/03/B/HS6/03709]
FX This work was supported by Grants from: the European Union FP6: Project
004171 HEARCOM, the Polish-Norwegian Research Fund, State Ministry of
Science and Higher Education: Project Number N N518 502139 and National
Science Centre: Project Number UMO-2011/03/B/HS6/03709.
CR Allen K, 2008, J ACOUST SOC AM, V123, P1562, DOI 10.1121/1.2831774
Bosman AJ, 1995, AUDIOLOGY, V34, P260
Brand T, 2002, J ACOUST SOC AM, V111, P2801, DOI 10.1121/1.1479152
BRONKHORST AW, 1988, J ACOUST SOC AM, V83, P1508, DOI 10.1121/1.395906
Brungart DS, 2012, J ACOUST SOC AM, V132, P2545, DOI 10.1121/1.4747005
Brungart DS, 2002, J ACOUST SOC AM, V112, P664, DOI 10.1121/1.1490592
COX RM, 1991, J SPEECH HEAR RES, V34, P904
Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503
Edmonds BA, 2006, J ACOUST SOC AM, V120, P1539, DOI 10.1121/1.2228573
Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211
Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984
Garadat S. N., 2006, J ACOUST SOC AM, V121, P1047
Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908
Kocinski J., 2005, Archives of Acoustics, V30
Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085
Lin WY, 2003, J NEUROSCI, V23, P8143
Litovsky RY, 2005, J ACOUST SOC AM, V117, P3091, DOI 10.1121/1.1873913
Muller C, 1992, PERZEPTIVE ANAL WEIT
Ozimek E., 2009, INT J AUDIOL, V48, P440
Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150
Shinn-Cunningham BG, 2001, J ACOUST SOC AM, V110, P1118, DOI 10.1121/1.1386633
Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451
Yost W. A., 1993, HUMAN PSYCHOPHYSICS
NR 23
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1021
EP 1032
DI 10.1016/j.specom.2013.06.009
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500008
ER
PT J
AU Cerva, P
Silovsky, J
Zdansky, J
Nouza, J
Seps, L
AF Cerva, Petr
Silovsky, Jan
Zdansky, Jindrich
Nouza, Jan
Seps, Ladislav
TI Speaker-adaptive speech recognition using speaker diarization for
improved transcription of large spoken archives
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker adaptive; Automatic speech recognition; Speaker adaptation;
Speaker diarization; Automatic transcription; Large spoken archives
ID SYSTEM; NORMALIZATION; ADAPTATION; ACCESS; WORD
AB This paper deals with speaker-adaptive speech recognition for large spoken archives. The goal is to improve the recognition accuracy of an automatic speech recognition (ASR) system that is being deployed for transcription of a large archive of Czech radio. This archive represents a significant part of Czech cultural heritage, as it contains recordings covering 90 years of broadcasting. A large portion of these documents (100,000 h) is to be transcribed and made public for browsing. To improve the transcription results, an efficient speaker-adaptive scheme is proposed. The scheme is based on integration of speaker diarization and adaptation methods and is designed to achieve a low Real-Time Factor (RTF) of the entire adaptation process, because the archive's size is enormous. It thus employs just two decoding passes, where the first one is carried out using the lexicon with a reduced number of items. Moreover, the transcripts from the first pass serve not only for adaptation, but also as the input to the speaker diarization module, which employs two-stage clustering. The output of diarization is then utilized for a cluster-based unsupervised Speaker Adaptation (SA) approach that also utilizes information based on the gender of each individual speaker. Presented experimental results on various types of programs show that our adaptation scheme yields a significant Word Error Rate (WER) reduction from 22.24% to 18.85% over the Speaker Independent (SI) system while operating at a reasonable RTF. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Cerva, Petr; Silovsky, Jan; Zdansky, Jindrich; Nouza, Jan; Seps, Ladislav] Tech Univ Liberec, Inst Informat Technol & Elect, Liberec 46117, Czech Republic.
RP Cerva, P (reprint author), Tech Univ Liberec, Inst Informat Technol & Elect, Studentska 2, Liberec 46117, Czech Republic.
EM petr.cerva@tul.cz; jan.silovsky@tul.cz; jindrich.zdansky@tul.cz;
jan.nouza@tul.cz; ladislav.seps@tul.cz
RI Nouza, Jan/E-9914-2011
FU Czech Science Foundation [P103/11/P499]; Student Grant Scheme (SGS) at
the Technical University of Liberec
FX This work was supported by the Czech Science Foundation (project no.
P103/11/P499) and by the Student Grant Scheme (SGS) at the Technical
University of Liberec.
CR Acero A, 1991, ACOUSTICS SPEECH SIG, V2, P893
Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137
Anguera X, 2006, INT 06 PITTSB PA US
Breslin C, 2011, INTERSPEECH, P1085
Byrne W, 2004, IEEE T SPEECH AUDI P, V12, P420, DOI 10.1109/TSA.2004.828702
Cerva P, 2011, 12TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2011 (INTERSPEECH 2011), VOLS 1-5, P2576
Chen S. S., 1998, P DARPA BROADC NEWS, P127
Chu SM, 2008, INT CONF ACOUST SPEE, P4329, DOI 10.1109/ICASSP.2008.4518613
Chu SM, 2010, INT CONF ACOUST SPEE, P4374, DOI 10.1109/ICASSP.2010.5495639
Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307
Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347
Franco-Pedroso J, 2010, FALA 2010, V2010, P415
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Glass J., 2007, P INTERSPEECH, P2553
Hain T, 2012, IEEE T AUDIO SPEECH, V20, P486, DOI 10.1109/TASL.2011.2163395
Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088
Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940
Kim D. Y., 2004, P 2004 INT C SPOK LA
KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394
Lee L, 1996, INT CONF ACOUST SPEE, P353
Liu D, 2005, INTERPEECH 2005 LISB, P281
luc Gauvain J., 1998, ICSLP 98, P1335
Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152
Matsoukas S, 2006, IEEE T AUDIO SPEECH, V14, P1541, DOI 10.1109/TASL.2006.878257
Moattar MH, 2012, SPEECH COMMUN, V54, P1065, DOI 10.1016/j.specom.2012.05.002
Ng T, 2012, P INT 2012 INT SPEEC, P1967
NIST, 2009, 2009 RT 09 RICH TRAN
Nouza J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1650
Nouza J, 2010, IEEE MEDITERR ELECT, P202, DOI 10.1109/MELCON.2010.5476306
Nouza J, 2012, COMM COM INF SC, V247, P27
Ordelman R, 2006, P 17 EUR C ART INT E
Povey D, 2011, ASRU, P158
Shinoda K, 2005, ELECTRON COMM JPN 3, V88, P25, DOI 10.1002/ecjc.20207
Silovsky J, 2012, P INT 2012 INT SPEEC, P478
Silovsky J, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4193
Stolcke A, 2008, LECT NOTES COMPUT SC, V4625, P450
Uebel L. F, 1999, 6 EUR C SPEECH COMM
Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435
Zdansky J, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2182
NR 41
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1033
EP 1046
DI 10.1016/j.specom.2013.06.017
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500009
ER
PT J
AU Tiemounou, S
Jeannes, RL
Barriac, V
AF Tiemounou, Sibiri
Jeannes, Regine Le Bouquin
Barriac, Vincent
TI On the identification of relevant degradation indicators in super
wideband listening quality assessment models
SO SPEECH COMMUNICATION
LA English
DT Article
DE Perceptual dimensions; Super wideband; Voice quality assessment;
Objective models; Diagnostic
ID PERCEPTUAL EVALUATION; ITU STANDARD; SPEECH; PESQ
AB Recently, new objective speech quality evaluation methods, designed and adapted to new high voice quality contexts, have been developed. One interest of these methods is that they integrate voice quality perceptual dimensions reflecting the effects of frequency response distortions, discontinuities, noise and/or speech level deviations respectively. This makes it possible to use these methods also to provide diagnostic information about specific aspects of the transmission systems' quality, as perceived by end-users. In this paper, we present and analyze in depth two of these approaches namely POLQA (Perceived Objective Listening Quality Assessment) and DIAL (Diagnostic Instrumental Assessment of Listening quality), in terms of quality degradation indicators related to the perceptual dimensions these models could embed. The main goal of our work is to find and propose the most robust quality degradation indicators to reliably characterize the impact of degradations relative to the perceptual dimensions described above and to identify the underlying technical causes in super wideband telephone communications [50, 14000] Hz. To do so, the first step of our study was to identify in both models the correspondence between perceptual dimensions and quality degradation indicators. Such indicators could be either present in the model itself or derived from our own investigation of the model. In a second step, we analyzed the performance and robustness of the identified quality degradation indicators on speech samples only impaired by one degradation (representative of one perceptual dimension) at a time. This study highlighted the reliability of some of the quality degradation indicators embedded in the two models under study and stood for a first step in the evaluation of performance of these indicators to quantify the degradation for which they were designed. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Tiemounou, Sibiri; Barriac, Vincent] Orange Labs Lannion, F-22307 Lannion, France.
[Tiemounou, Sibiri; Jeannes, Regine Le Bouquin] INSERM, U1099, F-35000 Rennes, France.
[Tiemounou, Sibiri; Jeannes, Regine Le Bouquin] Univ Rennes 1, LTSI, F-35000 Rennes, France.
RP Tiemounou, S (reprint author), Orange Labs Lannion, 2 Av Pierre Marzin, F-22307 Lannion, France.
EM sibiri.tiemounou@orange.com; regine.le-bouquin-jeannes@univ-rennes1.fr;
vincent.barriac@orange.com
FU Rohde & Schwarz -SwissQual; Opticom
FX The authors wish to thank the following companies for granting us
permission to use their SWB subjectively scored databases and publish
results based on them: Rohde & Schwarz - SwissQual, TNO, Deutsche
Telekom, Net-scout-Psytechnics. The use and analysis of the new
POL-QA/P.863 standard model have been made possible thanks to the
support of Rohde & Schwarz -SwissQual and Opticom.
CR [Anonymous], 2011, P 863 PERC OBJ LIST, P863
[Anonymous], 1996, METH SUBJ DET TRANSM, P800
[Anonymous], 2001, PERC EV SPEECH QUAL, P862
Beerends J. G, 2004, JOINT C GERM FRENCH
Beerends JG, 2007, J AUDIO ENG SOC, V55, P1059
Beerends JG, 2002, J AUDIO ENG SOC, V50, P765
BEERENDS JG, 1994, J AUDIO ENG SOC, V42, P115
Cote N., 2011, INTEGRAL DIAGNOSTIC, P133
Cote N, 2006, P 2 ISCA DEGA TUT RE, P115
Danno V, 2013, ADAPTIVE MULTIRATE W
FU Q, 2000, ACOUST SPEECH SIG PR, P1511
Huo L., 2008, P 155 M AC SOC AM 5, P1
Huo L, 2008, P 155 M AC SOC AM 5
Huo L., 2007, IEEE WORKSH APPL SIG
ITU-T Rec, 2005, WID EXT REC P 862 AS
IUT-T Rec, 1988, SPEC INT REF SYST, P48
Leman A., 2010, P 38 INT C AUD ENG S
MCDERMOT.BJ, 1969, J ACOUST SOC AM, V45, P774, DOI 10.1121/1.1911465
Raake A, 2006, SPEECH QUALITY VOIP, P74
Rix AW, 2002, J AUDIO ENG SOC, V50, P755
Scholz K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1523
Sen D, 2008, J AUDIO ENG SOC, V131, P4087
Sen D, 2012, INT TELECOMMUNICATIO
Waltermann M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2170
Waltermann M, 2008, VOIC COMM ITG C DE A, P1
Zielinski S, 2008, J AUDIO ENG SOC, V56, P427
NR 26
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1047
EP 1063
DI 10.1016/j.specom.2013.06.010
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500010
ER
PT J
AU Short, G
Hirose, K
Minematsu, N
AF Short, Greg
Hirose, Keikichi
Minematsu, Nobuaki
TI Japanese lexical accent recognition for a CALL system by deriving
classification equations with perceptual experiments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Japanese pitch accent; Automatic recognition; Perception; Fundamental
frequency; Resynthesis; Stimulus continua
AB For non-native learners of Japanese, the pitch accent can be cumbersome to acquire without proper instruction. A Computer Assisted Language Learning (CALL) system could aid these learners in this acquisition provided that it can generate helpful feedback based on automatic analysis of the learner's utterance. For this, it is necessary to consider that the characteristics of a given learner's Japanese production will be largely influenced by his or her native tongue. For example, non-natives may produce pitch contours that natives do not produce. A standard approach to carry out recognition for error detection is to use a machine learning algorithm making use of an array composed of a variety of features. However, a method motivated by perceptual analysis may be better for a CALL system. With such a method, it should be possible to better understand the human recognition process and the causal relationships between contour and perception, which could be useful for feedback. Also, since accent recognition is a perceptual process, it may be possible to improve automatic recognition for non-native speech with such a method. Thus, we carry out listening tests making use of experiments using resynthesized speech to construct a method. First, we inspect which variables the probability of a pitch level transition is dependent on, and from this inspection, derive equations to calculate the probability at the disyllable level. Then, to recognize the word-level pattern, the location of each transition was determined from the probabilities for each two syllable pair. This method makes it possible to recognize all pitch patterns and to give more in-depth feedback. We conduct recognition experiments using these functions and achieve results that performed comparably to the inter-labeler agreement rate and outperformed SVM-based methods for non-native speech. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Short, Greg; Hirose, Keikichi] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan.
[Minematsu, Nobuaki] Univ Tokyo, Grad Sch Engn, Tokyo, Japan.
RP Short, G (reprint author), Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo, Japan.
EM shortg29@gmail.com
CR Boersma Paul, 2012, PRAAT DOING PHONETIC
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Fujisaki H., 1968, 6TH INT C AC, P95
Fujisaki Hiroya, 1977, NIHONGO 5 ONIN, P63
Hirano-Cook E., 2011, THESIS
Ishi C, 2001, I ELECT INFORM COMMU, V101, P65
Ishi C. T, 2003, SPEECH COMMUN, V101, P23
Kato S, 2012, P SPEECH PROS, V2, P198
Kawai G, 1999, P EUR 99, V1, P177
Kumagai Y, 1999, IPSJ SIG NOTES, V99, P22
Lee A., 2001, P EUR C SPEECH COMM, P1691
Liberman A, 1957, J EXP PSYCHOL, P358
Masuda-Katsuse I., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.97
Mattingly I, 1971, COGNITIVE PSYCHOL, P131
Minematsu N, 1996, TECHNICAL REPORT IEI, P69
Nakagawa C., 2002, TOKYOGO AKUSENTO SHU, P73
Neri A, 2002, P CALL C, P179
NHK Broadcasting Culture Research Institute, 2005, JAP ACC DICT
Oyakawa T, 1971, 19712 U CAL BERK LIN, V1971
Short G, 2011, SP2010128 IEICE, P79
Short G, 2012, NONNATIVE CORPUS
Shport I., 2008, ACQUISITION JAPANESE, P165
Sinha A., 2009, INT J PSYCHOL COUNSE, V1, P117
Takahashi S, 1999, INT WORKSH EMBR IMPL, P23
Valbret H., 1992, SPEECH COMMUN, V1, P145
Yoshimura T, 1994, P AUT M AC SOC JAP, P173
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1064
EP 1080
DI 10.1016/j.specom.2013.07.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500011
ER
PT J
AU Zhang, Y
Zhao, YX
AF Zhang, Yi
Zhao, Yunxin
TI Modulation domain blind speech separation in noisy environments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Time-frequency masking; Direction of arrival; Modulation frequency;
Blind speech separation
ID INDEPENDENT COMPONENT ANALYSIS; PERMUTATION ALIGNMENT; NONSTATIONARY
SOURCES; SPECTRAL SUBTRACTION; MIXTURE-MODELS; FREQUENCY; LOCALIZATION;
SIGNALS; REAL
AB We propose a noise robust blind speech separation (BSS) method by using two microphones. We perform BSS in the modulation domain to take advantage of the improved signal sparsity and reduced musical tone noise in this domain over the conventional acoustic frequency domain processing. We first use modulation domain real and imaginary spectral subtraction (MRISS) to enhance both magnitude and phase spectra of the noisy speech mixture inputs. We then estimate the direction of arrivals (DOAs) of the speech sources from subband inter-sensor phase differences (IPDs) by using an asymmetric Laplacian mixture model (ALMM), cluster the full-band IPDs via the estimated DOAs, and perform time frequency masking to separate the source signals, all in the modulation domain. Experimental evaluations in five types of noises have shown that the performance of the proposed method is robust in 0-10 dB SNRs and it is superior to acoustic domain separation without MRISS. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Zhang, Yi; Zhao, Yunxin] Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA.
RP Zhao, YX (reprint author), Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA.
EM yzcb3@mail.missouri.edu; ZhaoY@missouri.edu
CR Aichner R, 2006, SIGNAL PROCESS, V86, P1260, DOI 10.1016/j.sigpro.2005.06.022
Amari S, 1997, FIRST IEEE SIGNAL PROCESSING WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS, P101, DOI 10.1109/SPAWC.1997.630083
Araki S, 2003, EURASIP J APPL SIG P, V2003, P1157, DOI 10.1155/S1110865703305074
Araki S., 2003, IEEE INT C AC SPEECH, P509
Araki S, 2006, INT CONF ACOUST SPEE, P33
ASANO F, 2001, ACOUST SPEECH SIG PR, P2729
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Choi S, 2000, ELECTRON LETT, V36, P848, DOI 10.1049/el:20000623
COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9
Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6
Ellis DPW, 2006, INT CONF ACOUST SPEE, P957
Falk T. H., 2007, P ISCA C INT SPEECH, P970
Hazewinkel M., 2001, ENCY MATH
Hu R, 2008, EURASIP, V4, DOI 101155/2008/349214
HUANG J, 1995, IEEE T INSTRUM MEAS, V44, P733
Hurley N, 2009, IEEE T INFORM THEORY, V55, P4723, DOI 10.1109/TIT.2009.2027527
Ichir MM, 2006, IEEE T IMAGE PROCESS, V15, P1887, DOI 10.1109/TIP.2006.877068
IKRAM MZ, 2002, ACOUST SPEECH SIG PR, P881
Jian M., 1998, P IEEE INT S CIRC SY, V5, P293
Joho M., 2000, INDEPENDENT COMPONEN, P81
JOURJINE A, 2000, ACOUST SPEECH SIG PR, P2985
Kawamoto M, 1998, NEUROCOMPUTING, V22, P157, DOI 10.1016/S0925-2312(98)00055-1
Khademul M, 2007, IEEE T AUDIO SPEECH, V15, P893
Kinnunen T, 2008, P ISCA SPEAK LANG RE
KURITA S, 2000, ACOUST SPEECH SIG PR, P3140
Lee DD, 1999, NATURE, V401, P788
Lu X, 2010, SPEECH COMMUN, V52, P1, DOI 10.1016/j.specom.2009.08.006
Winter S, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/24717
Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Matsuoka K., 2001, P INT C IND COMP AN, P722
Mitchell T.M., 1997, MACHINE LEARNING
Mitianoudis N, 2007, IEEE T AUDIO SPEECH, V15, P1818, DOI 10.1109/TASL.2007.899281
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal K. K., 2011, P INTERSPEECH FLOR I, P1209
Papoulis A, 1991, PROBABILITY RANDOM V, V3rd
Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214
Pearlmutter BA, 2004, LECT NOTES COMPUT SC, V3195, P478
Peterson J. M., 2003, P ICASSP 2003, VVI, P581
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
Roweis ST, 2001, ADV NEUR IN, V13, P793
RWCP, 2001, RWCP SOUND SCEN DAT
Sawada H., 2007, P IEEE WORKSH APPL S, P139
Sawada H, 2004, IEEE T SPEECH AUDI P, V12, P530, DOI 10.1109/TSA.2004.832994
Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355
Schimidt M. N, 2006, P ICSLP, V2, P2
Schobben L, 2002, IEEE T SIGNAL PROCES, V50, P1855
Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397
VIELVA L, 2002, ACOUST SPEECH SIG PR, P3049
Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005
Vu D. H. T, 2008, ITG C VOIC COMM, P1
Yamanouchi K, 2007, PROC WRLD ACAD SCI E, V19, P113
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
Yu KM, 2005, COMMUN STAT-THEOR M, V34, P1867, DOI 10.1080/03610920500199018
ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083
Zhang Y, 2013, SPEECH COMMUN, V55, P509, DOI 10.1016/j.specom.2012.09.005
ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7
NR 59
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2013
VL 55
IS 10
BP 1081
EP 1099
DI 10.1016/j.specom.2013.06.014
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 215RH
UT WOS:000324227500012
ER
PT J
AU Milner, B
AF Milner, Ben
TI Enhancing speech at very low signal-to-noise ratios using non-acoustic
reference signals
SO SPEECH COMMUNICATION
LA English
DT Article
DE Noise estimation; Speech enhancement; Speech quality; Speech
intelligibility
ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; RECOGNITION
AB An investigation is made into whether non-acoustic noise reference signals can be used for noise estimation, and subsequently speech enhancement, in very low signal-to-noise ratio (SNR) environments where conventional noise estimation methods may be less effective. The environment selected is Formula 1 motor racing where SNRs fall frequently to 15 dB. Analysis reveals three primary noise sources (engine, airflow and tyre) which are found to relate to data parameters measured by the car's onboard computer, namely engine speed, road speed and throttle opening. This leads to the proposal of a two stage noise reduction system that uses first engine speed to cancel engine noise within an adaptive filtering framework. Secondly, a maximum a posteriori (MAP) framework is developed to estimate airflow and tyre noise from data parameters which is subsequently removed. Objective measurements comparing noise estimation with conventional methods show the proposed method to be substantially more accurate. Subjective quality tests using comparative mean opinion score listening tests found that the proposed method achieves +1.43 compared to +0.66 for a conventional method. In subjective intelligibility tests, 81.8% of words were recognised correctly using the proposed method in comparison to 76.7% with no noise compensation and 66.0% for the conventional method. (C) 2013 Published by Elsevier B.V.
C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England.
RP Milner, B (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England.
EM b.milner@uea.ac.uk
CR Aboulnasr T, 1997, IEEE T SIGNAL PROCES, V45, P631, DOI 10.1109/78.558478
Afify M, 2009, IEEE T AUDIO SPEECH, V17, P1325, DOI 10.1109/TASL.2009.2018017
Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Fransen J., 1994, CUEDFINFENGTR192
Gillick L., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266481
Hillier V, 2004, FUNDAMENTALS MOTOR V
HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387
Hu Y, 2006, INT CONF ACOUST SPEE, P153
ITU-T, 1996, P 800 METH SUB DET T
ITU-T, 2003, P 835 SUB TEST METH
KWONG RH, 1992, IEEE T SIGNAL PROCES, V40, P1633, DOI 10.1109/78.143435
Linde Y., 1980, IEEE T COMMUN, V28, P94
Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Milner B., 2011, INTERSPEECH
Pearce D., 2000, ICSLP, V4, P29
Puder H., 2003, EUROSPEECH, P1397
Puder H., 2000, EUSIPCO, P1851
Rangachari S., 2006, SPEECH COMMUN, V48, P22
Taghia J., 2011, ICASSP
Therrien C., 1992, DISCRETE RANDOM SIGN
TUCKER R, 1992, IEE PROC-I, V139, P377
Vaseghi S., 2000, ICASSP, P213
Vaseghi S. V., 2006, ADV DIGITAL SIGNAL P
Widrow B, 1985, ADAPTIVE SIGNAL PROC
NR 28
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2013
VL 55
IS 9
BP 879
EP 892
DI 10.1016/j.specom.2013.04.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 189US
UT WOS:000322294400001
ER
PT J
AU Zhang, XR
Demuynck, K
Van Hamme, H
AF Zhang, Xueru
Demuynck, Kris
Van Hamme, Hugo
TI Rapid speaker adaptation in latent speaker space with non-negative
matrix factorization
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker adaptation; NMF; SAT; fMLLR; Eigenvoice
ID MAXIMUM-LIKELIHOOD
AB A novel speaker adaptation algorithm based on Gaussian mixture weight adaptation is described. A small number of latent speaker vectors are estimated with non-negative matrix factorization (NMF). These latent vectors encode the distinctive systematic patterns of Gaussian usage observed when modeling the individual speakers that make up the training data. Expressing the speaker dependent Gaussian mixture weights as a linear combination of a small number of latent vectors reduces the number of parameters that must be estimated from the enrollment data. The resulting fast adaptation algorithm, using 3 s of enrollment data only, achieves similar performance as fMLLR adapting on 100+ s of data. In order to learn richer Gaussian usage patterns from the training data, the NMF-based weight adaptation is combined with vocal tract length normalization (VTLN) and speaker adaptive training (SAT), or with a simple Gaussian exponentiation scheme that lowers the dynamic range of the Gaussian likelihoods. Evaluation on the Wall Street Journal tasks shows a 5% relative word error rate (WER) reduction over the speaker independent recognition system which already incorporates VTLN. The WER can be lowered further by combining weight adaptation with Gaussian mean adaptation by means of eigenvoice speaker adaptation. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Zhang, Xueru; Demuynck, Kris; Van Hamme, Hugo] Katholieke Univ Leuven, Dept Elect Engn ESAT, B-3001 Louvain, Belgium.
RP Zhang, XR (reprint author), Katholieke Univ Leuven, Dept Elect Engn ESAT, Kasteelpk Arenberg 10,Bus 2441, B-3001 Louvain, Belgium.
EM Xueru.Zhang@esat.kuleuven.be; Kris.Demuynck@esat.kuleuven.be;
Hugo.Vanhamme@esat.kuleuven.be
RI Van hamme, Hugo/D-6581-2012
FU Dutch-Flemish IMPact program (NWO-iMinds)
FX This work is funded by the Dutch-Flemish IMPact program (NWO-iMinds).
CR Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137
Anastasakos T, 1997, INT CONF ACOUST SPEE, P1043, DOI 10.1109/ICASSP.1997.596119
Baum L. E., 1972, INEQUALITIES, V3, P1
CHEN KT, 2000, P ICSLP, V3, P742
Chen S. F., 1998, TECHNICAL REPORT
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Demuynck K., 2001, THESIS KATHOLIEKE U
Duchateau J., 2006, P ITRW SPEECH REC IN, P59
Duchateau J, 2008, INT CONF ACOUST SPEE, P4269, DOI 10.1109/ICASSP.2008.4518598
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Hofmann T, 1999, SIGIR'99: PROCEEDINGS OF 22ND INTERNATIONAL CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P50
Hofmann T., 1999, P 15 C UNC AI
Huang X., 2001, SPOKEN LANGUAGE PROC
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Lee DD, 2001, ADV NEUR IN, V13, P556
Lee DD, 1999, NATURE, V401, P788
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Leggetter C.J., 1994, CUEDFINFENGTR181
Nguyens P., 1998, THESIS I EURECOM
Smaragdis P., 2003, IEEE WORKSH APPL SIG, P177
Smaragdis P., 2006, P ADV MOD AC PROC WO
Tuerk C., 1993, P EUR, P351
Van hamme H., 2008, P ISCA TUT RES WORKS
Virtanen T., 2007, IEEE T AUDIO SPEECH, V15, P291
Woodland P.C., 2001, ITRW ADAPTATION METH, P11
Xu W, 2003, P 26 ANN INT ACM SIG, P267, DOI DOI 10.1145/860435.860485
Zavaliagkos G, 1996, INT CONF ACOUST SPEE, P725, DOI 10.1109/ICASSP.1996.543223
Zhang XR, 2011, INT CONF ACOUST SPEE, P4456
Zhang XR, 2012, 2012 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), P4349
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2013
VL 55
IS 9
BP 893
EP 908
DI 10.1016/j.specom.2013.05.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 189US
UT WOS:000322294400002
ER
PT J
AU Rasilo, H
Rasanen, O
Laine, UK
AF Rasilo, Heikki
Rasanen, Okko
Laine, Unto K.
TI Feedback and imitation by a caregiver guides a virtual infant to learn
native phonemes and the skill of speech inversion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Language acquisition; Speech inversion; Articulatory modeling;
Imitation; Phonetic learning; Caregiver feedback
ID VOCAL-TRACT; MODEL; ACQUISITION; PERCEPTION; VOWEL; SHAPES;
DISCRIMINATION; CATEGORIES; MOVEMENTS; SOUNDS
AB Despite large-scale research, development of robust machines for imitation and inversion of human speech into articulatory movements has remained an unsolved problem. We propose a set of principles that can partially explain real infants' speech acquisition processes and the emergence of imitation skills and demonstrate a simulation where a learning virtual infant (LeVI) learns to invert and imitate a virtual caregiver's speech. Based on recent findings in infants' language acquisition, LeVI learns the phonemes of his native language in a babbling phase using only caregiver's feedback as guidance and to map acoustically differing caregiver's speech into its own articulation in a phase where LeVI is imitated by the caregiver with similar, but not exact, utterances. After the learning stage, LeVI is able to recognize vowels from the virtual caregiver's VCVC utterances perfectly and all 25 Finnish phonemes with an average accuracy of 88.42%. The place of articulation of consonants is recognized with an accuracy of 96.81%. LeVI is also able to imitate the caregiver's speech since the recognition occurs directly in the domain of articulatory programs for phonemes. The learned imitation ability (speech inversion) is strongly language dependent since it is based on the phonemic programs learned from the caregiver. The findings suggest that caregivers' feedback can act as an important signal in guiding infants' articulatory learning, and that the speech inversion problem can be effectively approached from the perspective of early speech acquisition. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Rasilo, Heikki; Rasanen, Okko; Laine, Unto K.] Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, FI-00076 Aalto, Finland.
[Rasilo, Heikki] Vrije Univ Brussel, Artificial Intelligence Lab, B-1050 Brussels, Belgium.
RP Rasilo, H (reprint author), Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, POB 13000, FI-00076 Aalto, Finland.
EM heikki.rasilo@aalto.fi; okko.rasane-n@aalto.fi; unto.laine@aalto.fi
FU graduate school of Electronics, Telecommunications and Automation (ETA);
Finnish Foundation for Technology Promotion (TES); KAUTE foundation;
Nokia foundation
FX This study was supported by the graduate school of Electronics,
Telecommunications and Automation (ETA), Finnish Foundation for
Technology Promotion (TES), KAUTE foundation and the Nokia foundation.
The authors would like to thank Bart de Boer for his valuable comments
on the manuscript.
CR Ananthakrishnan G., 2011, P INTERSPEECH 2011, P765
ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848
Beaumont S. L., 1993, 1 LANGUAGE, V13, P235, DOI 10.1177/014272379301303805
Bickley C.A., 1989, THESIS MIT
Bresch E, 2006, J ACOUST SOC AM, V120, P1791, DOI 10.1121/1.2335423
D'Ausillo A, 2009, CURR BIOL, V19, P381, DOI 10.1016/j.cub.2009.01.017
DAVIS BL, 1995, J SPEECH HEAR RES, V38, P1199
EIMAS PD, 1971, SCIENCE, V171, P303, DOI 10.1126/science.171.3968.303
ELBERS L, 1982, COGNITION, V12, P45, DOI 10.1016/0010-0277(82)90029-4
FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817
FLASH T, 1985, J NEUROSCI, V5, P1688
Goldstein MH, 2008, PSYCHOL SCI, V19, P515, DOI 10.1111/j.1467-9280.2008.02117.x
Goldstein MH, 2003, P NATL ACAD SCI USA, V100, P8030, DOI 10.1073/pnas.1332441100
Goodluck H, 1991, LANGUAGE ACQUISITION
Gros-Louis J, 2006, INT J BEHAV DEV, V30, P509, DOI 10.1177/0165025406071914
Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013
Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636
Hornstein J., 2007, IEEE RSJ INT C INT R, P3442
Hornstein J., 2007, S LANG ROB AV PORT
Hornstein J., 2008, IROS 2008 WORKSH FRO
Houston DM, 2000, J EXP PSYCHOL HUMAN, V26, P1570, DOI 10.1037//0096-1523.26.5.1570
Howard IS, 2011, MOTOR CONTROL, V15, P85
HUANG XD, 1992, IEEE T SIGNAL PROCES, V40, P1062, DOI 10.1109/78.134469
Ishihara H, 2009, IEEE T AUTON MENT DE, V1, P217, DOI 10.1109/TAMD.2009.2038988
Jones SS, 2007, PSYCHOL SCI, V18, P593, DOI 10.1111/j.1467-9280.2007.01945.x
KENT RD, 1982, J ACOUST SOC AM, V72, P353, DOI 10.1121/1.388089
Kokkinaki T, 2000, J REPROD INFANT PSYC, V18, P173
KUHL PK, 1991, PERCEPT PSYCHOPHYS, V50, P93, DOI 10.3758/BF03212211
Kuhl PK, 1996, J ACOUST SOC AM, V100, P2425, DOI 10.1121/1.417951
LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Markey K. L., 1994, THESIS U COLORADO BO
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
Meltzoff AN, 1999, J COMMUN DISORD, V32, P251, DOI 10.1016/S0021-9924(99)00009-X
Meltzoff AN, 1990, SELF TRANSITION INFA, P139
MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427
Miura K, 2007, ADV ROBOTICS, V21, P1583
Miura K, 2008, INT C DEVEL LEARN, P262, DOI 10.1109/DEVLRN.2008.4640840
MIYAWAKI K, 1975, PERCEPT PSYCHOPHYS, V18, P331, DOI 10.3758/BF03211209
Narayanan S., 2011, P INT, P837
Oller D. K., 2000, EMERGENCE SPEECH CAP
Ouni S., 2005, J ACOUST SOC AM, V118, P411
Plummer A.R., 2012, P INT PORTL OR US
Rasanen O, 2012, PATTERN RECOGN, V45, P606, DOI 10.1016/j.patcog.2011.05.005
Rasanen O., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6289052
Rasanen O., 2009, P INT 09 BRIGHT ENGL, P852
Rasanen O., 2012, P INT 2012 PORTL OR
Rasanen O, 2012, SPEECH COMMUN, V54, P975, DOI 10.1016/j.specom.2012.05.001
Rasilo H., 2013, WORKSH SPEECH UNPUB
Rasilo H, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 3 AND 4, P2414
Rasilo H., 2011, P INT 11 FLOR IT, P2693
Rasilo H., 2013, ARTICULATORY MODEL S, DOI DOI 10.1016/J.SPEC0M.2013.05.002
Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356
Sorokin VN, 2000, SPEECH COMMUN, V30, P55, DOI 10.1016/S0167-6393(99)00031-X
Stark RE, 1980, CHILD PHONOLOGY, P73
Tikhonov A. N., 1977, SOLUTION ILL POSED P
Toda T., 2004, P ICSLP JEJ ISL KOR, P1129
TREHUB SE, 1976, CHILD DEV, V47, P466, DOI 10.2307/1128803
Vaz M.J.L.R.M., 2009, THESIS U MINHO ESCOL
WERKER JF, 1981, CHILD DEV, V52, P349, DOI 10.2307/1129249
WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3
Westermann G, 2004, BRAIN LANG, V89, P393, DOI 10.1016/S0093-934X(03)00345-6
Wiik K., 1965, THESIS U TURKU
Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263
Yoshikawa Y, 2003, CONNECT SCI, V15, P245, DOI 10.1080/09540090310001655075
NR 67
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2013
VL 55
IS 9
BP 909
EP 931
DI 10.1016/j.specom.2013.05.002
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 189US
UT WOS:000322294400003
ER
PT J
AU Bao, XL
Zhu, J
AF Bao, Xulei
Zhu, Jie
TI An Improved Method for Late-Reverberant Suppression Based on Statistical
Model
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dereverberation; Late reverberant spectral variance estimator; Shape
parameter; Maximum likelihood; Probabilistic speech model
ID SPEECH DEREVERBERATION; LINEAR PREDICTION; TIME; RECOGNITION; SIGNALS
AB Model-based late reverberant spectral variance (LRSV) estimator is considered as an effective approach for speech dereverberation, which can construct a simple expression for the LRSV according to the past spectral variance of the reverberant signal. In this paper, we develop a new LRSV estimator based on the time-varying room impulse responses (RIRs) with the assumption that the background noise is comprised of reverberant noise and direct-path noise in a noisy and reverberant environment. In the LRSV estimator, more than one item of past spectral variance of the reverberant signals are used to obtain a smoother shape parameter, which can lead to a better performance for dereverberation compared to the classic methods. Since this shape parameter affected by the estimation error of LRSV may in turn affect the subsequent LRSV estimation, we combine this smoother shape parameter based LRSV estimator with maximum likelihood (ML) algorithm in spectral domain in order to get a more reliable estimation of LRSV. Furthermore, we use the proposed LRSV estimator prior rather than posterior to speech enhancement in noisy and reverberant environment. Experimental results demonstrate our new LRSV estimator is more effective for both noise-free and noisy reverberant speech. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Bao, Xulei; Zhu, Jie] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China.
RP Bao, XL (reprint author), Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China.
EM qunzhong@sjtu.edu.cn
CR Bao X., 2012, P IEEE INT C MULT EX, P467
Delcroix M, 2009, IEEE T AUDIO SPEECH, V17, P324, DOI 10.1109/TASL.2008.2010214
Erkelens JS, 2010, IEEE T AUDIO SPEECH, V18, P1746, DOI 10.1109/TASL.2010.2051271
Erkelens JS, 2010, INT CONF ACOUST SPEE, P4706, DOI 10.1109/ICASSP.2010.5495178
Erkelens JS, 2008, IEEE T AUDIO SPEECH, V16, P1112, DOI 10.1109/TASL.2008.2001108
Gillespie B., 2003, P 2003 IEEE INT C AC, VI, P676
Gomez R., 2009, P INTERSPEECH, P1223
Habets EAP, 2009, IEEE SIGNAL PROC LET, V16, P770, DOI 10.1109/LSP.2009.2024791
Hikichi T, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/31013
Huang YTA, 2005, IEEE T SPEECH AUDI P, V13, P882, DOI 10.1109/TSA.2005.851941
Jeub M., 2010, P INT C DIGT SIGN PR, P1
Kingsbury BED, 1997, INT CONF ACOUST SPEE, P1259, DOI 10.1109/ICASSP.1997.596174
Kinoshita K, 2006, INT CONF ACOUST SPEE, P817
Kumar K, 2010, INT CONF ACOUST SPEE, P4282, DOI 10.1109/ICASSP.2010.5495667
Lebert K., 2001, ACTA ACOUST, V87, P359
Lu XG, 2011, COMPUT SPEECH LANG, V25, P571, DOI 10.1016/j.csl.2010.10.002
Nakatani T, 2007, IEEE T AUDIO SPEECH, V15, P80, DOI 10.1109/TASL.2006.872620
Nakatani T, 2008, INT CONF ACOUST SPEE, P85, DOI 10.1109/ICASSP.2008.4517552
Nakatani T, 2010, IEEE T AUDIO SPEECH, V18, P1717, DOI 10.1109/TASL.2010.2052251
SCHROEDE.MR, 1965, J ACOUST SOC AM, V37, P409, DOI 10.1121/1.1909343
Sehr A, 2010, IEEE T AUDIO SPEECH, V18, P1676, DOI 10.1109/TASL.2010.2050511
Yoshioka T, 2009, IEEE T AUDIO SPEECH, V17, P231, DOI 10.1109/TASL.2008.2008042
NR 22
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2013
VL 55
IS 9
BP 932
EP 940
DI 10.1016/j.specom.2013.04.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 189US
UT WOS:000322294400004
ER
PT J
AU Santos, JF
Cosentino, S
Hazrati, O
Loizou, PC
Falk, TH
AF Santos, Joao F.
Cosentino, Stefano
Hazrati, Oldooz
Loizou, Philipos C.
Falk, Tiago H.
TI Objective speech intelligibility measurement for cochlear implant users
in complex listening environments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Intelligibility; Cochlear implants; Reverberation; Noise; Objective
metrics
ID NORMAL-HEARING; NOISE; REVERBERATION; PERCEPTION; QUALITY; MODEL;
MASKING; IDENTIFICATION; STIMULATION; ENHANCEMENT
AB Objective intelligibility measurement allows for reliable, low-cost, and repeatable assessment of innovative speech processing technologies, thus dispensing costly and time-consuming subjective tests. To date, existing objective measures have focused on normal hearing model, and limited use has been found for restorative hearing instruments such as cochlear implants (CIs). In this paper, we have evaluated the performance of five existing objective measures, as well as proposed two refinements to one particular measure to better emulate CI hearing, under complex listening conditions involving noise-only, reverberation-only, and noise-plus-reverberation. Performance is assessed against subjectively rated data. Experimental results show that the proposed CI-inspired objective measures outperformed all existing measures; gains by as much as 22% could be achieved in rank correlation. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Santos, Joao F.; Falk, Tiago H.] Inst Natl Rech Sci INRS EMT, Montreal, PQ, Canada.
[Cosentino, Stefano] UCL, Ear Inst, London, England.
[Hazrati, Oldooz; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA.
RP Falk, TH (reprint author), Inst Natl Rech Sci INRS EMT, Montreal, PQ, Canada.
EM jfsantos@emt.inrs.ca; stefano.cosenti-no.10@ucl.ac.uk;
hazrati@utdallas.edu; falk@emt.inrs.ca
FU Natural Sciences and Engineering Research Council of Canada; UCL;
Neurelec; National Institute of Deafness and Other Communication
Disorders Grant [R01 DC 010494]
FX THF and JFS thank the Natural Sciences and Engineering Research Council
of Canada for their financial support. SC acknowledges funding from UCL
and Neurelec. PCL and OH were supported by a National Institute of
Deafness and Other Communication Disorders Grant (R01 DC 010494).
CR [Anonymous], 2004, P563 ITUT
ANSI, 1997, S351997 ANSI
Arai T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2490
Chen F, 2013, BIOMED SIGNAL PROCES, V8, P311, DOI 10.1016/j.bspc.2012.11.007
Chen F, 2012, J MED BIOL ENG, V32, P189, DOI 10.5405/jmbe.885
Chen F, 2011, EAR HEARING, V32, P331, DOI 10.1097/AUD.0b013e3181ff3515
Cosentino S., 2012, P INT C INF SCI SIGN, P4710
Dorman MF, 1997, J ACOUST SOC AM, V102, P2403, DOI 10.1121/1.419603
Drgas S, 2010, HEARING RES, V269, P162, DOI 10.1016/j.heares.2010.06.016
Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020
Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Hazrati O., 2012, INT J AUDIOLOGY
Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354
ITU, 2001, P862 ITUT
Kates JM, 2005, INT INTEG REL WRKSP, P53, DOI 10.1109/ASPAA.2005.1540166
Kokkinakis K, 2011, J ACOUST SOC AM, V129, P3221, DOI 10.1121/1.3559683
Kokkinakis K, 2011, J ACOUST SOC AM, V130, P1099, DOI 10.1121/1.3614539
Kokkinakis K, 2011, INT CONF ACOUST SPEE, P2420
Loizou PC, 2005, J ACOUST SOC AM, V118, P2791, DOI 10.1121/1.2065847
Lorenzi C., 2008, 1 INT S AUD AUD RES, P263
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
Malfait L, 2006, IEEE T AUDIO SPEECH, V14, P1924, DOI 10.1109/TASL.2006.883177
Moller S, 2011, IEEE SIGNAL PROC MAG, V28, P18, DOI 10.1109/MSP.2011.942469
Moore BCJ, 2008, JARO-J ASSOC RES OTO, V9, P399, DOI 10.1007/s10162-008-0143-x
Moore BCJ, 1996, ACUSTICA, V82, P335
Nabelek A., 1989, J ACOUST SOC AM, V86, P318
Nabelek A. K., 1993, ACOUSTICAL FACTORS A, P15
Neuman AC, 2010, EAR HEARING, V31, P336, DOI 10.1097/AUD.0b013e3181d3d514
Pearson K., 1894, Philosophical Transactions, V185a, P71, DOI 10.1098/rsta.1894.0003
PLOMP R, 1986, J SPEECH HEAR RES, V29, P146
Poissant SF, 2006, J ACOUST SOC AM, V119, P1606, DOI 10.1121/1.2168428
Qin MK, 2003, J ACOUST SOC AM, V114, P446, DOI 10.1121/1.1579009
Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058
Santos J., 2012, INTERSPEECH
Schroder J., 2009, P INT C AC NAG DAGA, P606
Vandali AE, 2000, EAR HEARING, V21, P608, DOI 10.1097/00003446-200012000-00008
Van den Bogaert T, 2009, J ACOUST SOC AM, V125, P360, DOI 10.1121/1.3023069
Watkins AJ, 2000, ACUSTICA, V86, P532
Wilson BS, 2008, HEARING RES, V242, P3, DOI 10.1016/j.heares.2008.06.005
Xu L, 2008, HEARING RES, V242, P132, DOI 10.1016/j.heares.2007.12.010
Yang LP, 2005, J ACOUST SOC AM, V117, P1001, DOI 10.1121/1.1852873
Zheng YF, 2011, EAR HEARING, V32, P569, DOI 10.1097/AUD.0b013e318216eba6
NR 46
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2013
VL 55
IS 7-8
BP 815
EP 824
DI 10.1016/j.specom.2013.04.001
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 178YU
UT WOS:000321484100001
ER
PT J
AU Lutfi, SL
Fernandez-Martinez, F
Lucas-Cuesta, JM
Lopez-Lebon, L
Montero, JM
AF Lebai Lutfi, Syaheerah
Fernandez-Martinez, Fernando
Manuel Lucas-Cuesta, Juan
Lopez-Lebon, Lorena
Manuel Montero, Juan
TI A satisfaction-based model for affect recognition from conversational
features in spoken dialog systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic affect detection; Affective spoken dialog system; Domestic
environment; HiFi agent; Social intelligence; Dialog features;
Conversational cues; User bias; Predicting user satisfaction
ID AFFECTIVE SPEECH; EMOTION; ANNOTATION; LANGUAGE; TUTORS; CUES
AB Detecting user affect automatically during real-time conversation is the main challenge towards our greater aim of infusing social intelligence into a natural-language mixed-initiative High-Fidelity (Hi-Fi) audio control spoken dialog agent. In recent years, studies on affect detection from voice have moved on to using realistic, non-acted data, which is subtler. However, it is more challenging to perceive subtler emotions and this is demonstrated in tasks such as labeling and machine prediction. This paper attempts to address part of this challenge by considering the role of user satisfaction ratings and also conversational/dialog features in discriminating contentment and frustration, two types of emotions that are known to be prevalent within spoken human-computer interaction. However, given the laboratory constraints, users might be positively biased when rating the system, indirectly making the reliability of the satisfaction data questionable. Machine learning experiments were conducted on two datasets, users and annotators, which were then compared in order to assess the reliability of these datasets. Our results indicated that standard classifiers were significantly more successful in discriminating the abovementioned emotions and their intensities (reflected by user satisfaction ratings) from annotator data than from user data. These results corroborated that: first, satisfaction data could be used directly as an alternative target variable to model affect, and that they could be predicted exclusively by dialog features. Second, these were only true when trying to predict the abovementioned emotions using annotator's data, suggesting that user bias does exist in a laboratory-led evaluation. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Lebai Lutfi, Syaheerah; Manuel Lucas-Cuesta, Juan; Lopez-Lebon, Lorena; Manuel Montero, Juan] Univ Politecn Madrid, Speech Technol Grp, E-28040 Madrid, Spain.
[Lebai Lutfi, Syaheerah] Univ Sci Malaysia, Sch Comp Sci, George Town, Malaysia.
[Fernandez-Martinez, Fernando] Univ Carlos III Madrid, Dept Signal Theory & Commun, Multimedia Proc Grp GPM, E-28903 Getafe, Spain.
RP Montero, JM (reprint author), Univ Politecn Madrid, Speech Technol Grp, E-28040 Madrid, Spain.
EM syaheerah@die.upm.es; ffm@tsc.uc3m.es; juanmak@die.upm.es;
lorena.llebon@alumnos.upm.es; juancho@die.upm.es
RI Montero, Juan M/K-2381-2014; Fernandez-Martinez, Fernando/M-2935-2014
OI Montero, Juan M/0000-0002-7908-5400;
FU European Union [287678]; TIMPANO [TIN2011-28169-C05-03];
ITALIHA(CAM-UPM); INAPRA [DPI2010-21247-C02-02]; SD-TEAM
[TIN2008-06856-C05-03]; MA2VICMR (Comunidad Autonoma de Madrid)
[S2009/TIC-1542]; University Science of Malaysia; Malaysian Ministry of
Higher Education
FX The work leading to these results has received funding from the European
Union under Grant agreement No. 287678. It has also been supported by
TIMPANO(TIN2011-28169-C05-03), ITALIHA(CAM-UPM), INAPRA
(DPI2010-21247-C02-02), SD-TEAM (TIN2008-06856-C05-03) and MA2VICMR
(Comunidad Autonoma de Madrid, S2009/TIC-1542) projects. The
corresponding author thanks University Science of Malaysia and the
Malaysian Ministry of Higher Education for the PhD funding. Authors also
thank all the other members of the Speech Technology Group for the
continuous and fruitful discussion on these topics.
CR Ai H, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P797
Ai H., 2007, 8 SIGDIAL WORKSH DIS
Ang J., 2002, P INT C SPOK LANG PR
[Anonymous], 2001, PERC EV SPEECH QUAL
BAILEY JE, 1983, MANAGE SCI, V29, P530, DOI 10.1287/mnsc.29.5.530
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Barra-Chicote R., 2007, P WISP
Barra-Chicote R, 2010, SPEECH COMMUN, V52, P394, DOI 10.1016/j.specom.2009.12.007
Barra-Chicote R., 2009, P INT, P336
Barra R, 2006, INT CONF ACOUST SPEE, P1085
Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003
Burkhardt F., 2009, P IEEE
Callejas Z, 2008, LECT NOTES ARTIF INT, V5078, P221, DOI 10.1007/978-3-540-69369-7_25
Callejas Z, 2008, SPEECH COMMUN, V50, P416, DOI 10.1016/j.specom.2008.01.001
Callejas Z, 2008, SPEECH COMMUN, V50, P646, DOI 10.1016/j.specom.2008.04.004
Charfuelan M., 2000, TALN
Cowie R., 2010, ESSENTIAL ROLE HUMAN, P151
DANIELI M, 1995, AAAI SPRING S EMP ME, P34
Devillers L., 2002, LREC
Devillers L, 2011, COMPUT SPEECH LANG, V25, P1, DOI 10.1016/j.csl.2010.07.002
D'Mello SK, 2008, USER MODEL USER-ADAP, V18, P45, DOI 10.1007/s11257-007-9037-6
DOLL WJ, 1988, MIS QUART, V12, P259, DOI 10.2307/248851
Dybkjaer L, 2004, SPEECH COMMUN, V43, P33, DOI 10.1016/j.specom.2004.02.001
Ekman P., 1978, FACIAL ACTION CODING
Engelbrecht K.-P., 2009, P SIGDIAL WORKSH DIS, P170, DOI 10.3115/1708376.1708402
Fernandez-Martinez F., 2010, P 7 C INT LANG RES E
Fernandez-Martinez F., 2010, P IEEE WORKSH DAT EX
Fernandez-Martinez F., 2008, P IEEE WORKSH SPOK L
Field A., 2005, DISCOVERING STAT USI, V2nd
Forbes-Riley K, 2011, SPEECH COMMUN, V53, P1115, DOI 10.1016/j.specom.2011.02.006
Forbes-Riley K, 2011, COMPUT SPEECH LANG, V25, P105, DOI 10.1016/j.csl.2009.12.002
Gelbrich K., 2009, SCHMALENBACH BUSINES, V61, P40
Grichkovtsova I, 2012, SPEECH COMMUN, V54, P414, DOI 10.1016/j.specom.2011.10.005
Grothendieck J, 2009, INT CONF ACOUST SPEE, P4745, DOI 10.1109/ICASSP.2009.4960691
Hone K.S., 2000, NAT LANG ENG, V6, P287, DOI 10.1017/S1351324900002497
Kernbach S., 2005, J SERV MARK, V19, P438, DOI 10.1108/08876040510625945
Laukka P, 2011, COMPUT SPEECH LANG, V25, P84, DOI 10.1016/j.csl.2010.03.004
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Liscombe J., 2005, P INT LISB PORT, P1845
Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008
Locke E. A., 1976, NATURE CAUSES JOB SA
Lutfi S., 2010, P BRAIN INSP COGN SY
Lutfi S. L., 2009, P 9 INT C EP ROB EPI, P221
Lutfi SL, 2009, HEALTHINF 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON HEALTH INFORMATICS, P488
Mairesse F, 2007, J ARTIF INTELL RES, V30, P457
Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003
Moller S., 2005, QUALITY TELEPHONE BA
Nicholson J, 2000, NEURAL COMPUT APPL, V9, P290, DOI 10.1007/s005210070006
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Picard R. W., 1999, P HCI INT 8 INT C HU, V1, P829
Podsakoff PM, 2003, J APPL PSYCHOL, V88, P879, DOI 10.1037/0021-9101.88.5.879
Porayska-Pomsta K, 2008, USER MODEL USER-ADAP, V18, P125, DOI 10.1007/s11257-007-9041-x
Reeves B., 1996, MEDIA EQUATION PEOPL
Riccardi G, 2005, LECT NOTES COMPUT SC, V3814, P144
Saris W. E., 2010, SURVEY RES METHODS, V4, P61
Schuller B., 2012, COMPUTER SPEECH LANG
Shami M., 2007, LECT NOTES COMPUTER, V4441, P43, DOI DOI 10.1007/978-3-540-74122-05
Tcherkassof A, 2007, EUR J SOC PSYCHOL, V37, P1325, DOI 10.1002/ejsp.427
Toivanen J, 2004, LANG SPEECH, V47, P383
Truong KP, 2012, SPEECH COMMUN, V54, P1049, DOI 10.1016/j.specom.2012.04.006
Vidrascu L., 2005, INTERSPEECH 05, P1841
Vogt T, 2005, 2005 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), VOLS 1 AND 2, P474, DOI 10.1109/ICME.2005.1521463
Walker M., 2000, P LANG RES EV C LREC
Witten I.H., 2005, DATA MINING PRACTICA
NR 65
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2013
VL 55
IS 7-8
BP 825
EP 840
DI 10.1016/j.specom.2013.04.005
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 178YU
UT WOS:000321484100002
ER
PT J
AU Tan, LN
Alwan, A
AF Tan, Lee Ngee
Alwan, Abeer
TI Multi-band summary correlogram-based pitch detection for noisy speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Pitch detection; Multi-band; Correlogram; Comb-filter; Noise-robust
ID DETECTION ALGORITHMS; CLASSIFICATION; RECOGNITION; SIGNALS; ROBUST
AB A multi-band summary correlogram (MBSC)-based pitch detection algorithm (PDA) is proposed. The PDA performs pitch estimation and voiced/unvoiced (V/UV) detection via novel signal processing schemes that are designed to enhance the MBSC's peaks at the most likely pitch period. These peak-enhancement schemes include comb-filter channel-weighting to yield each individual subband's summary correlogram (SC) stream, and stream-reliability-weighting to combine these SCs into a single MBSC. V/UV detection is performed by applying a constant threshold on the maximum peak of the enhanced MBSC. Narrowband noisy speech sampled at 8 kHz are generated from Keele (development set) and CSTR - Centre for Speech Technology Research-(evaluation set) corpora. Both 4-kHz full-band speech, and G.712-filtered telephone speech are simulated. When evaluated solely on pitch estimation accuracy, assuming voicing detection is perfect, the proposed algorithm has the lowest gross pitch error for noisy speech in the evaluation set among the algorithms evaluated (RAPT, YIN, etc.). The proposed PDA also achieves the lowest average pitch detection error, when both pitch estimation and voicing detection errors are taken into account. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Tan, Lee Ngee; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA.
RP Tan, LN (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA.
EM ngee@seas.ucla.edu; alwan@ee.ucla.edu
FU Defense Advanced Research Projects Agency (DARPA) [D10PC20024]
FX This material is based on work supported in part by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. D10PC20024. Any
opinions, findings, and conclusions or recommendations expressed in this
material are those of the authors and do not necessarily reflect the
view of the DARPA or its Contracting Agent, the US Department of the
Interior, National Business Center, Acquisition and Property Management
Division, Southwest Branch. The views expressed are those of the author
and do not reflect the official policy or position of the Department of
Defense or the US Government. Approved for Public Release, Distribution
Unlimited. The authors thank the editor and the reviewers for their
comments.
CR Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042
ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800
Bagshaw P., 1993, P EUR C SPEECH COMM, P1003
Beritelli F, 2007, ELECTRON LETT, V43, P249, DOI [10.1049/el:20073800, 10.1049/e1:20073800]
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Camacho A, 2008, J ACOUST SOC AM, V124, P1638, DOI 10.1121/1.2951592
Cariani PA, 1996, J NEUROPHYSIOL, V76, P1698
Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917
Chu W, 2009, INT CONF ACOUST SPEE, P3969
DELGUTTE B, 1980, J ACOUST SOC AM, V68, P843, DOI 10.1121/1.384824
DRULLMAN R, 1995, J ACOUST SOC AM, V97, P585, DOI 10.1121/1.413112
Frerking M. E., 1994, DIGITAL SIGNAL PROCE
HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427
Hirsch H-Guenter, 2005, FANT FILTERING NOISE
ITU, 1996, REC G 712 TRANSM PER
LICKLIDER JCR, 1951, EXPERIENTIA, V7, P128, DOI 10.1007/BF02156143
Loughlin PJ, 1996, J ACOUST SOC AM, V100, P1594, DOI 10.1121/1.416061
Luengo I, 2007, INT CONF ACOUST SPEE, P1057
MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763
MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725
Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522
Oh K., 1984, P IEEE ICASSP, P85
PATTERSON RD, 1992, ADV BIOSCI, V83, P429
Plante F., 1995, P EUR, P837
RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846
ROSS MJ, 1974, IEEE T ACOUST SPEECH, VAS22, P353, DOI 10.1109/TASSP.1974.1162598
Rouat J, 1997, SPEECH COMMUN, V21, P191, DOI 10.1016/S0167-6393(97)00002-2
Secrest B. G., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing
Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing
Shah J., 2004, P IEEE INT C AC SPEE, P17
Slaney M., 1990, P IEEE INT C AC SPEE, P357
SUN XJ, 2002, ACOUST SPEECH SIG PR, P333
Talkin D., 1995, SPEECH CODING SYNTHE, P497
Tan LN, 2011, INT CONF ACOUST SPEE, P4464
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Walker K., 2012, ISCA ODYSSEY
Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539
NR 37
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2013
VL 55
IS 7-8
BP 841
EP 856
DI 10.1016/j.specom.2013.03.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 178YU
UT WOS:000321484100003
ER
PT J
AU Mattheyses, W
Latacz, L
Verhelst, W
AF Mattheyses, Wesley
Latacz, Lukas
Verhelst, Werner
TI Comprehensive many-to-many phoneme-to-viseme mapping and its application
for concatenative visual speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Visual speech synthesis; Viseme classification; Phoneme-to-viseme
mapping; Context-dependent visemes; Visemes
ID WORD-RECOGNITION; ANIMATION; PERFORMANCE; CONSONANTS; COARTICULATION;
IMPLEMENTATION; PERCEPTION; SELECTION; MODEL; FACE
AB The use of visemes as atomic speech units in visual speech analysis and synthesis systems is well-established. Viseme labels are determined using a many-to-one phoneme-to-viseme mapping. However, due to visual coarticulation effects, an accurate mapping from phonemes to visemes should define a many-to-many mapping scheme instead. In this research it was found that neither the use of standardized nor speaker-dependent many-to-one viseme labels could satisfy the quality requirements of concatenative visual speech synthesis. Therefore, a novel technique to define a many-to-many phoneme-to-viseme mapping scheme is introduced, which makes use of both tree-based and k-means clustering approaches. We show that these many-to-many viseme labels more accurately describe the visual speech information as compared to both phoneme-based and many-to-one viseme-based speech labels. In addition, we found that the use of these many-to-many visemes improves the precision of the segment selection phase in concatenative visual speech synthesis using limited speech databases. Furthermore, the resulting synthetic visual speech was both objectively and subjectively found to be of higher quality when the many-to-many visemes are used to describe the speech database and the synthesis targets. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Mattheyses, Wesley; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, Dept ETRO DSSP, B-1050 Brussels, Belgium.
[Verhelst, Werner] IMinds, B-9050 Ghent, Belgium.
RP Mattheyses, W (reprint author), Vrije Univ Brussel, Dept ETRO DSSP, Pl Laan 2, B-1050 Brussels, Belgium.
EM wmatthey@etro.vub.ac.be; llatac-z@etro.vub.ac.be;
wverhels@etro.vub.ac.be
CR Arslan LM, 1999, SPEECH COMMUN, V27, P81, DOI 10.1016/S0167-6393(98)00068-5
Aschenberner B., 2005, PHONEME VISEME MAPPI
Auer J., 1997, J ACOUST SOC AM, V102, P3704
Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107
BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196
Beskow J., 2005, P INT 2005 LISB PORT, P793
BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619
Bozkurt E., 2007, P SIGN PROC COMM APP, P1
Breen AP, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2159
Bregler C., 1997, P ACM SIGGRAPH, P353, DOI 10.1145/258734.258880
Cappelletta Luca, 2012, Proceedings of the 1st International Conference on Pattern Recognition Applications and Methods. ICPRAM 2012
Cohen M. M., 1993, Models and Techniques in Computer Animation
COHEN MM, 1990, BEHAV RES METH INSTR, V22, P260
Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467
Corthals P., 1984, TIJDSCHR LOG AUDIOLJ, V14, P126
Costa P., 2010, P ACM SSPNET INT S F, P20, DOI 10.1145/1924035.1924047
Dahai Yu, 2010, IPSJ T COMPUTER VISI, V2, P25
De Martino JM, 2006, COMPUT GRAPH-UK, V30, P971, DOI 10.1016/j.cag.2006.08.017
DEMUYNCK K, 2008, P ICSLP, P495
Deng Z, 2008, COMPUT GRAPH FORUM, V27, P2096, DOI 10.1111/j.1467-8659.2008.01192.x
EBERHARDT SP, 1990, J ACOUST SOC AM, V88, P1274, DOI 10.1121/1.399704
Eggermont J. P. M., 1964, TAALVERWERVING BIJ G
Elisei F., 2001, P AUD VIS SPEECH PRO, P90
Ezzat T, 2002, ACM T GRAPHIC, V21, P388
Ezzat T, 2000, INT J COMPUT VISION, V38, P45, DOI 10.1023/A:1008166717597
Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006
FISHER CG, 1968, J SPEECH HEAR RES, V11, P796
Galanes F., 1998, P INT C AUD VIS SPEE, P191
Govokhina O., 2007, P 6 ISCA WORKSH SPEE, P1
Govokhina O., 2006, P JOURN ET PAR, P305
Hazen T.J., 2004, P INT C MULT INT, P235, DOI 10.1145/1027933.1027972
Hilder S., 2010, P INT C AUD VIS SPEE, P154
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110
JACKSON PL, 1988, VOLTA REV, V90, P99
Jeffers J., 1971, SPEECHREADING LIPREA
Keating Patricia A., 1988, PHONOLOGY, V5, P275, DOI 10.1017/S095267570000230X
Kent R. D., 1977, J PHONETICS, V15, P115
LeGoff B, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2163
Lesner S. A., 1981, J ACADEMY REHABILITA, V14, P252
Liu K., 2011, P IEEE GLOB TEL C GL, P1
Lloyd S., 1982, IEEE T INFORMATION T, V28, P129, DOI DOI 10.1109/TIT.1982.1056489
Mattheyses W., 2010, P INT C AUD VIS SPEE, P148
Mattheyses W., 2011, P INT C AUD VIS SPEE, P1113
Mattheyses W, 2008, LECT NOTES COMPUT SC, V5237, P125, DOI 10.1007/978-3-540-85853-9_12
Mattheyses W, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1113
Mattheyses W., 2009, EURASIP J
Mattys SL, 2002, PERCEPT PSYCHOPHYS, V64, P667, DOI 10.3758/BF03194734
Melenchon J., 2007, P INT C AUD VIS SPEE, P191
Melenchon J, 2009, IEEE T AUDIO SPEECH, V17, P459, DOI 10.1109/TASL.2008.2010213
MONTGOMERY AA, 1983, J ACOUST SOC AM, V73, P2134, DOI 10.1121/1.389537
MYERS C, 1980, IEEE T ACOUST SPEECH, V28, P623, DOI 10.1109/TASSP.1980.1163491
OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310
Olshen R., 1984, CLASSIFICATION REGRE, V1st
OWENS E, 1985, J SPEECH HEAR RES, V28, P381
Pandzic I., 2003, MPEG 4 FACIAL ANIMAT
Potamianos G., 2004, ISSUES VISUAL AUDIO
Rogozan A., 1999, International Journal on Artificial Intelligence Tools (Architectures, Languages, Algorithms), V8, DOI 10.1142/S021821309900004X
Saenko E., 2004, ARTICULARY FEATURES
Senin P., 2008, DYNAMIC TIME WARPING
Tamura M, 1998, P INT C AUD VIS SPEE
Taylor S.L., 2012, P 11 ACM SIGGRAPH EU, P275
Tekalp AM, 2000, SIGNAL PROCESS-IMAGE, V15, P387, DOI 10.1016/S0923-5965(99)00055-7
Theobald B., 2008, P INTERSPEECH, P1875
Theobald BJ, 2004, SPEECH COMMUN, V44, P127, DOI 10.1016/j.specom.2004.07.002
Theobald BJ, 2012, IEEE T AUDIO SPEECH, V20, P2378, DOI 10.1109/TASL.2012.2202651
VANSON N, 1994, J ACOUST SOC AM, V96, P1341, DOI 10.1121/1.411324
Verma A., 2003, P IEEE INT C AC SPEE, P720
Visser M, 1999, LECT NOTES ARTIF INT, V1692, P349
Ypsilos I. A., 2004, Proceedings. 2nd International Symposium on 3D Data Processing, Visualization, and Transmission, DOI 10.1109/TDPVT.2004.1335143
Zelezny M, 2006, SIGNAL PROCESS, V86, P3657, DOI 10.1016/j.sigpro.2006.02.039
NR 70
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2013
VL 55
IS 7-8
BP 857
EP 876
DI 10.1016/j.specom.2013.02.005
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 178YU
UT WOS:000321484100004
ER
PT J
AU Mattheyses, W
Latacz, L
Verhelst, W
AF Mattheyses, Wesley
Latacz, Lukas
Verhelst, Werner
TI Comprehensive many-to-many phoneme-to-viseme mapping and its application
for concatenative visual speech synthesis (vol 55, pg 857, 2013)
SO SPEECH COMMUNICATION
LA English
DT Correction
C1 [Mattheyses, Wesley; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, Dept ETRO DSSP, B-1050 Brussels, Belgium.
[Verhelst, Werner] IMinds, B-9050 Ghent, Belgium.
RP Mattheyses, W (reprint author), Vrije Univ Brussel, Dept ETRO DSSP, Pl Laan 2, B-1050 Brussels, Belgium.
EM wmatthey@etro.vub.ac.be
CR Mattheyses W, 2013, SPEECH COMMUN, V55, P857, DOI 10.1016/j.specom.2013.02.005
NR 1
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2013
VL 55
IS 7-8
BP 877
EP 877
DI 10.1016/j.specom.2013.05.004
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 178YU
UT WOS:000321484100005
ER
PT J
AU Rao, KS
Vuppala, AK
AF Rao, K. Sreenivasa
Vuppala, Anil Kumar
TI Non-uniform time scale modification using instants of significant
excitation and vowel onset points
SO SPEECH COMMUNICATION
LA English
DT Article
DE Instants of significant excitation; Epochs; Vowel onset point; Time
scale modification; Non-uniform time scale modification; Uniform time
scale modification
ID SPEECH
AB In this paper, a non-uniform time scale modification (TSM) method is proposed for increasing or decreasing speech rate. The proposed method modifies the durations of vowel and pause segments by different modification factors. Vowel segments are modified by factors based on their identities, and pause segments by uniform factors based on the desired speaking rate. Consonant and transition (consonant-to-vowel) segments are not modified in the proposed TSM. These modification factors are derived from the analysis of slow and fast speech collected from professional radio artists. In the proposed TSM method, vowel onset points (VOPs) are used to mark the consonant, transition and vowel regions, and instants of significant excitation (ISE) are used to perform TSM as required. The VOPs indicate the instants at which the onsets of vowels take place. The ISE, also known as epochs, indicate the instants of glottal closure during voiced speech, and some random excitations such as burst onset during non-voiced speech. In this work, VOPs are determined using multiple sources of evidence from excitation source, spectral peaks, modulation spectrum and uniformity in epoch intervals. The ISEs are determined using a zero-frequency filter method. The performance of the proposed non-uniform TSM scheme is compared with uniform and existing non-uniform TSM schemes using epoch and time domain pitch synchronous overlap and add (TD-PSOLA) methods. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Rao, K. Sreenivasa] Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India.
[Vuppala, Anil Kumar] IIIT Hyderabad, LTRC, Hyderabad, Andhra Pradesh, India.
RP Rao, KS (reprint author), Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India.
EM ksrao@iitkgp.ac.in; anil.vuppala@gmail.com
CR Bonada J, 2000, P 2000 INT COMP MUS, P396
Deller J. R., 1993, DISCRETE TIME PROCES
di Marino J., 2001, P IEEE INT C AC SPEE, V2, P853
Donnellan Olivia, 2003, P 3 IEEE INT C ADV L
Duxbury C, 2001, P DIG AUD EFF C DAFX, P1
Duxbury C, 2002, P AES 112 CONV MUN G, P5530
Gangashetty S. V., 2004, THESIS IIT MADRAS
Grofit S, 2008, IEEE T AUDIO SPEECH, V16, P106, DOI 10.1109/TASL.2007.909444
Hainsworth SW, 2001, PROCEEDINGS OF THE 2001 IEEE WORKSHOP ON THE APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, P23, DOI 10.1109/ASPAA.2001.969533
Hogg RV, 1987, ENG STAT
Ilk HG, 2006, SIGNAL PROCESS, V86, P127, DOI 10.1016/j.sigpro.2005.05.006
Klapuri A., 1999, ACOUST SPEECH SIG PR, P3089
Kumar Anil Vuppala, 2012, INT J ELECT COMMUNIC, V66, P697
Mahadeva Prasanna S R, 2009, IEEE Transactions on Audio, Speech and Language Processing, V17, DOI 10.1109/TASL.2008.2010884
Moulines, 1995, SPEECH COMMUN, V16, P175
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Pickett J. M., 1999, ACOUSTICS SPEECH COM
PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581
Prakash Dixit R, 1991, J PHONETICS, V19, P213
QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793
Rodet X, 2001, P INT COMP MUS C ICM, P30
Roebel A, 2003, P INT C DIG AUD EFF, P344
Slaney Malcolm, P IEEE INT C AC SPEE
Sreenivasa Rao K., 2009, SPEECH COMMUN, V51, P1263
Sreenivasa-Rao K, 2006, IEEE T SPEECH AUDIO, V14, P972
Sri Rama Murty K, 2008, IEEE T SPEECH AUDIO, V16, P1602
Stevens K.N., 1999, ACOUSTIC PHONETICS
NR 27
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2013
VL 55
IS 6
BP 745
EP 756
DI 10.1016/j.specom.2013.03.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 165IQ
UT WOS:000320477500001
ER
PT J
AU Low, SY
Pham, DS
Venkatesh, S
AF Low, Siow Yong
Duc Son Pham
Venkatesh, Svetha
TI Compressive speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Compressed sensing; Speech enhancement; Sparsity
ID VOICE ACTIVITY DETECTION; SIGNAL RECOVERY; BEAMFORMER; NOISE; PURSUIT
AB This paper presents an alternative approach to speech enhancement by using compressed sensing (CS). CS is a new sampling theory, which states that sparse signals can be reconstructed from far fewer measurements than the Nyquist sampling. As such, CS can be exploited to reconstruct only the sparse components (e.g., speech) from the mixture of sparse and non-sparse components (e.g., noise). This is possible because in a time-frequency representation, speech signal is sparse whilst most noise is non-sparse. Derivation shows that on average the signal to noise ratio (SNR) in the compressed domain is greater or equal than the uncompressed domain. Experimental results concur with the derivation and the proposed CS scheme achieves better or similar perceptual evaluation of speech quality (PESQ) scores and segmental SNR compared to other conventional methods in a wide range of input SNR. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Low, Siow Yong] Curtin Univ, Miri, Malaysia.
[Duc Son Pham] Curtin Univ, Dept Comp, Bentley, WA, Australia.
[Venkatesh, Svetha] Deakin Univ, Ctr Pattern Recognit & Data Analyt, Geelong, Vic 3217, Australia.
RP Low, SY (reprint author), Curtin Univ, Sarawak Campus, Miri, Malaysia.
EM siowyong@curtin.edu.my; DucSon.Pham@curtin.edu.au;
svetha.venkatesh@deakin.edu.au
CR Benesty J., 2005, SPEECH ENHANCEMENT S
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Boufounos Petros, 2007, IEEE WORKSH STAT SIG, P299
Brandstein M., 2001, DIGITAL SIGNAL PROCE
Candes E. J., 2006, P INT C MATH MADR SP
Candes EJ, 2008, IEEE SIGNAL PROC MAG, V25, P21, DOI 10.1109/MSP.2007.914731
Candes EJ, 2006, IEEE T INFORM THEORY, V52, P5406, DOI 10.1109/TIT.2006.885507
Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083
Chen SSB, 1998, SIAM J SCI COMPUT, V20, P33, DOI 10.1137/S1064827596304010
Christensen M. G., 2009, P AS C SIGN SYST COM, P356
Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544
Dam H.Q, 2004, IEEE INT S CIRC SYST, V3, P433
Davis A, 2005, INT CONF ACOUST SPEE, P65
Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
Gardner TJ, 2006, P NATL ACAD SCI USA, V103, P6094, DOI 10.1073/pnas.0601707103
Garofolo J., 1988, GETTING STARTED DARP
Ghosh PK, 2011, IEEE T AUDIO SPEECH, V19, P600, DOI 10.1109/TASL.2010.2052803
Giacobello D, 2012, IEEE T AUDIO SPEECH, V20, P1644, DOI 10.1109/TASL.2012.2186807
Golub G.H., 1996, MATRIX COMPUTATIONS
Griffin A, 2011, IEEE T AUDIO SPEECH, V19, P1382, DOI 10.1109/TASL.2010.2090656
Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
ITU, 2001, 862 ITU, P862
Jancovic P, 2012, SPEECH COMMUN, V54, P108, DOI 10.1016/j.specom.2011.07.005
Karvanen J, 2003, P 4 INT S IND COMP A, P125
Kim SJ, 2007, IEEE J-STSP, V1, P606, DOI 10.1109/JSTSP.2007.910971
Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lotter T, 2003, EURASIP J APPL SIG P, V2003, P1147, DOI 10.1155/S1110865703305025
Low S.Y, 2002, ICCS 2002 8 INT C CO, V2, P1020
Low S.Y, 2005, IEEE INT C AC SPEECH, V3, P69
Lu CT, 2011, SPEECH COMMUN, V53, P495, DOI 10.1016/j.specom.2010.11.008
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Miyazaki R, 2012, IEEE T AUDIO SPEECH, V20, P2080, DOI 10.1109/TASL.2012.2196513
O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Pham D.T, 2009, LECT NOTES COMPUTER, V5441
Pinter I, 1996, COMPUT SPEECH LANG, V10, P1, DOI 10.1006/csla.1996.0001
Principi E, 2010, J ELECT COMPUTER ENG, P1
Rachlin Y, 2008, 46 ALL C COMM CONTR
Shawe-Taylor J, 2002, ADV NEUR IN, V14, P511
Sreenivas TV, 2009, INT CONF ACOUST SPEE, P4125, DOI 10.1109/ICASSP.2009.4960536
Tropp JA, 2007, IEEE T INFORM THEORY, V53, P4655, DOI 10.1109/TIT.2007.909108
Uemura Y, 2008, INT WORKSH AC ECH NO
Veen B. V., 1988, IEEE ASSP MAG APR, V5, P4
Wahlberg B, 2012, IFAC S SYST ID, V1, P16
Wu DS, 2011, ANN OPER RES, V185, P1, DOI 10.1007/s10479-010-0822-y
Yang J., 1993, INT C AC SPEECH SIGN, V2, P363, DOI 10.1109/ICASSP.1993.319313
Yu T, 2009, INT CONF ACOUST SPEE, P213
NR 50
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2013
VL 55
IS 6
BP 757
EP 768
DI 10.1016/j.specom.2013.03.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 165IQ
UT WOS:000320477500002
ER
PT J
AU Hansen, JHL
Suh, JW
Leonard, MR
AF Hansen, John H. L.
Suh, Jun-Won
Leonard, Matthew R.
TI In-set/out-of-set speaker recognition in sustained acoustic scenarios
using sparse data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; In-set/out-of-set; Sparse data; Environmental noise
ID GAUSSIAN MIXTURE-MODELS; VERIFICATION; IDENTIFICATION; NOISE;
ENROLLMENT; DISTANCE; SYSTEMS; CORPUS
AB This study addresses the problem of identifying in-set versus out-of-set speakers in noise for limited train/test durations in situations where rapid detection and tracking is required. The objective is to form a decision as to whether the current input speaker is accepted as a member of an enrolled in-set group or rejected as an outside speaker. A new scoring algorithm that combines log likelihood scores across an energy-frequency grid is developed where high-energy speaker dependent frames are fused with weighted scores from low-energy noise dependent frames. By leveraging the balance between the speaker versus background noise environment, it is possible to realize an improvement in overall equal error rate performance. Using speakers from the TIMIT database with 5 s of train and 2 s of test, the average optimum relative EER performance improvement for the proposed full selective leveraging approach is +31.6%. The optimum relative EER performance improvement using 10 s of NIST SRE-2008 is +10.8% using the proposed approach. The results confirm that for situations in which the background environment type remains constant between train and test, an in-set/out-of-set speaker recognition system that takes advantage of information gathered from the environmental noise can be formulated which realizes significant improvement when only extremely limited amounts of train/test data is available. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Hansen, John H. L.; Suh, Jun-Won; Leonard, Matthew R.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Dept Elect Engn, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM John.Hansen@utdallas.edu
FU AFRL [FA8750-09-C-0067]; University of Texas at Dallas from the
Distinguished University Chair in Telecommunications Engineering
FX This project was funded by AFRL through a subcontract to RADC Inc. under
FA8750-09-C-0067, and partially by the University of Texas at Dallas
from the Distinguished University Chair in Telecommunications
Engineering held by J.H.L. Hansen.
CR Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694
Angkititrakul P, 2004, IEEE ICASSP, V1, P169
Angkititrakul P, 2007, IEEE T AUDIO SPEECH, V15, P498, DOI 10.1109/TASL.2006.881689
Ariyaeeinia AM, 2006, IEE P-VIS IMAGE SIGN, V153, P618, DOI 10.1049/ip-vis:20050273
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
BEN M, 2003, ACOUST SPEECH SIG PR, P69
Do MN, 2003, IEEE SIGNAL PROC LET, V10, P115, DOI 10.1109/LSP.2003.809034
DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088
HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143
Logan B, 2001, IEEE INT C MULTIMEDI, V0, P190
Muller C, 2005, ESTIMATING ACOUSTIC
Muller C, 2007, SPEAKER CLASSIFICATI, V1
NIST SRE: U.S. National Institute of Standards and Technology, 2011, NIST YEAR 2008 SPEAK
Prakash V, 2007, IEEE T AUDIO SPEECH, V15, P2044, DOI 10.1109/TASL.2007.902058
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273
Shao Y, 2007, INT CONF ACOUST SPEE, P277
Stahl V, 2000, INT CONF ACOUST SPEE, P1875, DOI 10.1109/ICASSP.2000.862122
Suh JW, 2012, J ACOUST SOC AM, V131, P1515, DOI 10.1121/1.3672707
Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822
ZHansen J.H.L, 2004, GETTING STARTED CU M
NR 23
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2013
VL 55
IS 6
BP 769
EP 781
DI 10.1016/j.specom.2013.01.006
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 165IQ
UT WOS:000320477500003
ER
PT J
AU Bayya, Y
Gowda, DN
AF Bayya, Yegnanarayana
Gowda, Dhananjaya N.
TI Spectro-temporal analysis of speech signals using zero-time windowing
and group delay function
SO SPEECH COMMUNICATION
LA English
DT Article
DE Zero-time windowing; Zero-frequency filtering; Group delay function; NGD
spectrum; HNGD spectrum
ID VOCAL-TRACT; LINEAR-PREDICTION; EXTRACTION; REPRESENTATIONS;
RECOGNITION; RESONANCES; SPECTRUM; F0
AB Traditional methods for estimating the vocal tract system characteristics typically compute the spectrum using a window size of 20-30 ms. The resulting spectrum is the average characteristics of the vocal tract system within the window segment. Also, the effect of pitch harmonics need to be countered in the process of spectrum estimation. In this paper, we propose a new approach for estimating the spectrum using a highly decaying window function. The impulse-like window function used is an approximation to integration operation in the frequency domain, and the operation is referred to as zero-time windowing analogous to the zero-frequency filtering operation in frequency domain. The apparent loss in spectral resolution due to the use of a highly decaying window function is restored by successive differencing in the frequency domain. The spectral resolution is further improved by the use of group delay function which has an additive property on the individual resonances as against the multiplicative nature of the magnitude spectrum. The effectiveness of the proposed approach in estimating the spectrum is evaluated in terms of its robustness to additive noise, and in formant estimation. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Bayya, Yegnanarayana] Int Inst Informat Technol, Hyderabad, Andhra Pradesh, India.
[Gowda, Dhananjaya N.] Aalto Univ, Dept Informat & Comp Sci, Espoo, Finland.
RP Gowda, DN (reprint author), Aalto Univ, Dept Informat & Comp Sci, Espoo, Finland.
EM dhananjaya.gowda@aalto.fi
FU Department of Information Technology, Government of India; Academy of
Finland (Finnish Centre of Excellence in Computational Inference
Research COIN) [251170]
FX The authors would like to thank the Department of Information
Technology, Government of India for supporting this activity through
sponsored research projects. The second author would also like to thank
The Academy of Finland (Finnish Centre of Excellence in Computational
Inference Research COIN, 251170) for supporting his stay in Finland as a
Postdoctoral Researcher.
CR Abe T, 2006, IEEE T AUDIO SPEECH, V14, P1292, DOI 10.1109/TSA.2005.858545
Anand M. Joseph, 2006, P INT C SPOK LANG PR, P1009
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Deller J., 2000, DISCRETE TIME PROCES
Deng L, 2006, IEEE T AUDIO SPEECH, V14, P425, DOI 10.1109/TSA.2005.855841
Deng L, 2006, P INT C AC SPEECH SI, P1
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
Gianfelici F, 2007, IEEE T AUDIO SPEECH, V15, P823, DOI 10.1109/TASL.2006.889744
Kawahara H, 2008, INT CONF ACOUST SPEE, P3933, DOI 10.1109/ICASSP.2008.4518514
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526
Oppenheim A.V, 1975, DIGIT SIGNAL PROCESS, P1
Rabiner L. R., 2010, THEORY APPL DIGITAL
Santhanam B, 2000, IEEE T COMMUN, V48, P473, DOI 10.1109/26.837050
Yegnanarayana B, 2009, IEEE T AUDIO SPEECH, V17, P614, DOI 10.1109/TASL.2008.2012194
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Vargas J, 2008, IEEE T AUDIO SPEECH, V16, P1, DOI 10.1109/TASL.2007.907573
Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308
YEGNANARAYANA B, 1992, IEEE T SIGNAL PROCES, V40, P2281, DOI 10.1109/78.157227
YEGNANARAYANA B, 1978, J ACOUST SOC AM, V63, P1638, DOI 10.1121/1.381864
Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359
NR 22
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2013
VL 55
IS 6
BP 782
EP 795
DI 10.1016/j.specom.2013.02.007
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 165IQ
UT WOS:000320477500004
ER
PT J
AU Zhang, CL
Morrison, GS
Enzinger, E
Ochoa, F
AF Zhang, Cuiling
Morrison, Geoffrey Stewart
Enzinger, Ewald
Ochoa, Felipe
TI Effects of telephone transmission on the performance of
formant-trajectory-based forensic voice comparison - Female voices
SO SPEECH COMMUNICATION
LA English
DT Article
DE Formant; Telephone; Landline; Mobile; Forensic; Validity
ID ACOUSTIC CHARACTERISTICS; ENGLISH VOWELS; SPEECH; RECOGNITION; TRACKING;
RELIABILITY
AB In forensic-voice-comparison casework a common scenario is that the suspect's voice is recorded directly using a microphone in an interview room but the offender's voice is recorded via a telephone system. Acoustic-phonetic approaches to forensic voice comparison often include analysis of vowel formants, and the second formant is often assumed to be relatively robust to telephone-transmission effects. This study assesses the effects of telephone transmission on the performance of formant-trajectory-based forensic-voice-comparison systems. The effectiveness of both human-supervised and fully-automatic formant tracking is investigated. Human-supervised formant tracking is generally considered to be more accurate and reliable but requires a substantial investment of human labor. Measurements were made of the formant trajectories of haul tokens in a database of recordings of 60 female speakers of Chinese using one human-supervised and five fully-automatic formant trackers. Measurements were made under high-quality, landline-to-landline, mobile-to-mobile, and mobile-to-landline conditions. High-quality recordings were treated as suspect samples and telephone-transmitted recordings as offender samples. Discrete cosine transforms (DCT) were fitted to the formant trajectories and likelihood ratios were calculated on the basis of the DCT coefficients. For each telephone-transmission condition the formant-trajectory system was fused with a baseline mel-frequency cepstral-coefficient (MFCC) system, and performance was assessed relative to the baseline system. The systems based on human-supervised formant measurement always outperformed the systems based on fully-automatic formant measurement; however, in conditions involving mobile telephones neither the former nor the latter type of system provided meaningful improvement over the baseline system, and even in the other conditions the high cost in skilled labor for human-supervised formant-trajectory measurement is probably not warranted given the relatively good performance that can be obtained using other less-costly procedures. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Zhang, Cuiling] China Criminal Police Univ, Dept Forens Sci & Technol, Shenyang 110854, Liaoning, Peoples R China.
[Zhang, Cuiling; Morrison, Geoffrey Stewart; Enzinger, Ewald; Ochoa, Felipe] Univ New S Wales, Sch Elect Engn & Telecommun, UNSW Sydney, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, UNSW Sydney, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
EM geoff-morrison@forensic-voice-comparison.net
FU Australian Research Council; Australian Federal Police; New South Wales
Police; Queensland Police; National Institute of Forensic Science;
Australasian Speech Science and Technology Association; Guardia Civil
through Linkage Project [LP100200142]; China Scholarship Council;
Ministry of Education of the People's Republic of China [NCET-11-0836];
International Association of Forensic Phonetics and Acoustics Research
Grant
FX This research received support from multiple sources, including the
following: Australian Research Council, Australian Federal Police, New
South Wales Police, Queensland Police, National Institute of Forensic
Science, Australasian Speech Science and Technology Association, and the
Guardia Civil through Linkage Project LP100200142; China Scholarship
Council State-Sponsored Scholarship Program for Visiting Scholars;
Ministry of Education of the People's Republic of China "Program for New
Century Excellent Talents in University" (NCET-11-0836); International
Association of Forensic Phonetics and Acoustics Research Grant. Thanks
to Terrance M. Nearey for providing clarifications on some of the
algorithmic details of NAH2002 and FORMANTMEASURER. Unless otherwise
explicitly attributed, the opinions expressed are those of the authors
and do not necessarily represent the policies or opinions of any of the
above mentioned organizations or individuals. Earlier versions of this
paper were presented at the Special Session on Forensic Acoustics at the
162nd Meeting of the Acoustical Society of America, San Diego, November
2011 [J. Acoust. Soc. Amer. 130, 2519, doi: 10.1121/1.3655044]; at the
21st Annual Conference of the International Association for Forensic
Phonetics and Acoustics, Santander, August 2012; and at the UNSW
Forensic Speech Science Conference, Sydney, December 2012.
CR Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P109, DOI 10.1046/j.0035-9254.2003.05271.x
Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P665, DOI 10.1111/j.1467-9876.2004.02031.x
Anderson N, 1978, MODERN SPECTRUM ANAL, P252
ASSMANN PF, 1987, J ACOUST SOC AM, V81, P520, DOI 10.1121/1.394918
Boersma P., 2011, PRAAT DOING PHONETIC
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001
BRUMMER N, 2005, FOCAL TOOLBOX TOOLS
Byrne C, 2004, INT J SPEECH LANG LA, V11, P83, DOI 10.1558/sll.2004.11.1.83
CHEN NF, 2009, P INT 2009 INT SPEEC, P2203
de Castro A, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P2343
Deng L, 2007, IEEE T AUDIO SPEECH, V15, P13, DOI 10.1109/TASL.2006.876724
Duckworth M, 2011, INT J SPEECH LANG LA, V18, P35, DOI 10.1558/ijsll.v18i1.35
FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788
Gold E., 2011, INT J SPEECH LANG LA, V18, P143, DOI DOI 10.1558/IJS11.V18I2.293
Gonzalez-Rodriguez J, 2007, IEEE T AUDIO SPEECH, V15, P2104, DOI 10.1109/TASL.2007.902747
Gonzalez-Rodriguez J., 2011, P INT 2011 INT SPEEC, P133
Guillemin BJ, 2008, INT J SPEECH LANG LA, V15, P193, DOI 10.1558/ijsll.v15i2.193
HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872
Kondaurova MV, 2012, J ACOUST SOC AM, V132, P1039, DOI 10.1121/1.4728169
Kunzel HJ, 2002, FORENSIC LINGUIST, V9, P83
Kunzel H.J., 2001, FORENSIC LINGUIST, V8, P80
Lawrence S, 2008, INT J SPEECH LANG LA, V15, P161, DOI 10.1558/ijsll.v15i2.161
Markel JD, 1976, LINEAR PREDICTION SP
Morrison G, 2010, SOUNDLABELLER ERGONO
Morrison G., 2013, VOWEL INHERENT SPECT, P263, DOI DOI 10.1007/978-3-642-14209-3_11
Morrison G, 2009, ROBUST VERSION TRAIN
Morrison GS, 2011, SCI JUSTICE, V51, P91, DOI 10.1016/j.scijus.2011.03.002
Morrison GS, 2012, AUST J FORENSIC SCI, V44, P155, DOI 10.1080/00450618.2011.630412
Morrison GS, 2009, J ACOUST SOC AM, V125, P2387, DOI 10.1121/1.3081384
Morrison GS, 2011, SPEECH COMMUN, V53, P242, DOI 10.1016/j.specom.2010.09.005
MORRISON GS, 2007, FORENSIC LIKELIHOOD
Morrison GS, 2013, AUST J FORENSIC SCI, V45, P173, DOI 10.1080/00450618.2012.733025
Mustafa K, 2006, IEEE T AUDIO SPEECH, V14, P435, DOI 10.1109/TSA.2005.855840
Nearey T. M., 2002, J ACOUST SOC AM, V112, P2323
Nolan F, 2002, FORENSIC LINGUIST, V9, P74, DOI 10.1558/sll.2002.9.1.74
OLIVE JP, 1971, J ACOUST SOC AM, V50, P661, DOI 10.1121/1.1912681
Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213
Pigeon S, 2000, DIGIT SIGNAL PROCESS, V10, P237, DOI 10.1006/dspr.1999.0358
Remez RE, 2011, J ACOUST SOC AM, V130, P2173, DOI 10.1121/1.3631667
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
ROSE P, 2003, EXPERT EVIDENCE
Rudoy D., 2010, THESIS HARVARD U CAM
Rudoy D., 2007, P INT, P526
SCHAFER RW, 1970, J ACOUST SOC AM, V47, P634, DOI 10.1121/1.1911939
SJOLANDER K, 2011, WAVESURFER VERSION 1
Sjolander K., 2000, P ICSLP, P464
Talkin D., 1987, J ACOUST SOC AM S, V82, pS55
Thomson RI, 2009, J ACOUST SOC AM, V126, P1447, DOI 10.1121/1.3177260
Vallabha GK, 2002, SPEECH COMMUN, V38, P141, DOI 10.1016/S0167-6393(01)00049-8
van Leeuwen David A, 2007, Speaker Classification I. Fundamentals, Features, and Methods. (Lecture Notes in Artificial Intelligence vol. 4343), DOI 10.1007/978-3-540-74200-5_19
Xue SA, 2006, J VOICE, V20, P391, DOI 10.1016/j.jvoice.2005.05.001
Zhang C., 2011, FORENSIC DATABASE AU
Zhang C, 2011, P 17 INT C PHON SCI, P2280
ZHANG C, 2012, HUMAN SUPERVISED FUL
Zhang CL, 2013, J ACOUST SOC AM, V133, pEL54, DOI 10.1121/1.4773223
NR 56
TC 4
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2013
VL 55
IS 6
BP 796
EP 813
DI 10.1016/j.specom.2013.01.011
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 165IQ
UT WOS:000320477500005
ER
PT J
AU Pardede, HF
Iwano, K
Shinoda, K
AF Pardede, Hilman F.
Iwano, Koji
Shinoda, Koichi
TI Feature normalization based on non-extensive statistics for speech
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Normalization; q-Logarithm; Non-extensive
statistics
ID VECTOR TAYLOR-SERIES; NONEXTENSIVE STATISTICS; CROSS-TERMS; NOISE;
MODEL; ENTROPY; ENHANCEMENT; ENVIRONMENT; CALCULUS; SPECTRA
AB Most compensation methods to improve the robustness of speech recognition systems in noisy environments such as spectral subtraction, CMN, and MVN, rely on the fact that noise and speech spectra are independent. However, the use of limited window in signal processing may introduce a cross-term between them, which deteriorates the speech recognition accuracy. To tackle this problem, we introduce the q-logarithmic (q-log) spectral domain of non-extensive statistics and propose q-log spectral mean normalization (q-LSMN) which is an extension of log spectral mean normalization (LSMN) to this domain. The recognition experiments on a synthesized noisy speech database, the Aurora-2 database, showed that q-LSMN was consistently better than the conventional normalization methods, CMN, LSMN, and MVN. Furthermore, q-LSMN was even more effective when applied to a real noisy environment in the CEN-SREC-2 database. It significantly outperformed ETSI AFE front-end. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Pardede, Hilman F.; Shinoda, Koichi] Tokyo Inst Technol, Dept Comp Sci, Grad Sch Informat Sci & Engn, Meguro Ku, Tokyo 1528552, Japan.
[Iwano, Koji] Tokyo City Univ, Fac Environm & Informat Studies, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan.
RP Pardede, HF (reprint author), Tokyo Inst Technol, Dept Comp Sci, Grad Sch Informat Sci & Engn, Meguro Ku, Ookayama 2-12-1, Tokyo 1528552, Japan.
EM hilman@ks.cs.titech.ac.jp
RI Shinoda, Koichi/D-3198-2014
OI Shinoda, Koichi/0000-0003-1095-3203
FU [24650079]
FX This work is supported by Grant in Aid for Challenging Exploratory
Research No. 24650079.
CR Acero A., 2000, P ICSLP, P869
Agarwal A., 1999, P IEEE WORKSH AUT SP, P12
Avendano C, 1997, IEEE T SPEECH AUDI P, V5, P372, DOI 10.1109/89.593318
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
Bezerianos A, 2003, ANN BIOMED ENG, V31, P221, DOI 10.1114/1.1541013
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Borges EP, 2004, PHYSICA A, V340, P95, DOI 10.1016/j.physa.2004.03.082
Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940
Deng L, 2004, IEEE T SPEECH AUDI P, V12, P133, DOI 10.1109/TSA.2003.820201
Doblinger G., 1995, P 4 EUR C SPEECH COM, P1513
ETSI standard doc, 2002, 2002050 ETSI ES
Evans N., 2006, P IEEE INT C AC SPEE, V1, P1520
Faubel F, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P553
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Gradojevic N, 2011, IEEE SIGNAL PROC MAG, V28, P116, DOI 10.1109/MSP.2011.941843
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hirsch H.-G, 2000, P ISCA ITRW ASR2000, P181
HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387
Itahashi S., 1990, P INT C SPOKEN LANGU, P1081
Ito Y., 2000, P INTERSPEECH, P530
JEONG JC, 1992, IEEE T SIGNAL PROCES, V40, P2608, DOI 10.1109/78.157305
Jiulin D., 2007, ASTROPHYS SPACE SCI, V312, P47, DOI 10.1007/s10509-007-9611-8
KADAMBE S, 1992, IEEE T SIGNAL PROCES, V40, P2498, DOI 10.1109/78.157292
Kim C, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P28
Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7
KOBAYASHI T, 1984, IEEE T ACOUST SPEECH, V32, P1087, DOI 10.1109/TASSP.1984.1164416
Li JY, 2009, COMPUT SPEECH LANG, V23, P389, DOI 10.1016/j.csl.2009.02.001
LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z
Mauuary L., 1996, P EUSIPCO
McAuley J, 2005, IEEE T SPEECH AUDI P, V13, P956, DOI 10.1109/TSA.2005.851952
Ming J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1061
Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225
Moret MA, 2011, PHYSICA A, V390, P3055, DOI 10.1016/j.physa.2011.04.008
Nakamura S, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2330
Nivanen L, 2003, REP MATH PHYS, V52, P437, DOI 10.1016/S0034-4877(03)80040-X
Olemskoi A, 2010, EPL-EUROPHYS LETT, V89, DOI 10.1209/0295-5075/89/50007
Pardede H.F., 2011, P INTERSPEECH, P1645
Plastino AR, 2004, ASTROPHYS SPACE SCI, V290, P275, DOI 10.1023/B:ASTR.0000032529.67037.21
Rufiner HL, 2004, PHYSICA A, V332, P496, DOI [10.1016/j.physa.2003.09.050, 10.1016/i.physa.2003.09.050]
SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662
TSALLIS C, 1988, J STAT PHYS, V52, P479, DOI 10.1007/BF01016429
Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8
Weili S., 2009, P INT C MECH AUT, P1004
Wilk G, 2002, PHYSICA A, V305, P227, DOI 10.1016/S0378-4371(01)00666-5
Zhang YD, 2008, SENSORS-BASEL, V8, P7518, DOI 10.3390/s8117518
Zhu QF, 2002, IEEE SIGNAL PROC LET, V9, P275, DOI 10.1109/LSP.2002.801722
NR 47
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 587
EP 599
DI 10.1016/j.specom.2013.02.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800001
ER
PT J
AU Lander, K
Capek, C
AF Lander, Karen
Capek, Cheryl
TI Investigating the impact of lip visibility and talking style on
speechreading performance
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speechreading; Speechreadability; Lip visibility; Speaking style
ID AUDIOVISUAL SPEECH-PERCEPTION; CONVERSATIONAL SPEECH; CLEAR SPEECH;
HEARING; FACE; INTELLIGIBILITY; LISTENERS
AB It has long been known that visual information from a talker's mouth and face plays an important role in the perception and understanding of spoken language. The reported experiments explore the impact of lip visibility (Experiments 1 & 2) and speaking style (Experiment 2) on talker speechreadability. Specifically we compare speechreading performance (words in Experiment 1; sentences in Experiment 2 with low level auditory input) from talkers with natural lips, with brightly coloured lips and with concealed lips. Results reveal that highlighting the lip area by the application of lipstick or concealer improves speechreading, relative to natural lips. Furthermore, speaking in a clear (rather than conversational) manner improves speechreading performance, with no interaction between lip visibility and speaking style. Results are discussed in relation to practical methods of improving speechreading and in relation to attention and movement parameters. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Lander, Karen; Capek, Cheryl] Univ Manchester, Sch Psychol Sci, Manchester M13 9PL, Lancs, England.
RP Lander, K (reprint author), Univ Manchester, Sch Psychol Sci, Oxford Rd, Manchester M13 9PL, Lancs, England.
EM karen.lander@manchester.ac.uk
CR Auer ET, 2010, J AM ACAD AUDIOL, V21, P163, DOI 10.3766/jaaa.21.3.4
Bench J., 1979, SPEECH HEARING TESTS
Bernstein LE, 2001, J SPEECH LANG HEAR R, V44, P5, DOI 10.1044/1092-4388(2001/001)
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
DeCarlo D, 2000, INT J COMPUT VISION, V38, P99, DOI 10.1023/A:1008122917811
DEMOREST ME, 1992, J SPEECH HEAR RES, V35, P876
Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078
Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7
Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432
IJSSELDIJK FJ, 1992, J SPEECH HEAR RES, V35, P466
Irwin A, 2011, SPEECH COMMUN, V53, P807, DOI 10.1016/j.specom.2011.01.010
Krause J.C., 2004, J ACOUST SOC AM, V115, P363
KRICOS PB, 1985, VOLTA REV, V87, P5
Lander K, 2008, Q J EXP PSYCHOL, V61, P961, DOI 10.1080/17470210801908476
Lansing L.R., 2003, PERCEPT PSYCHOPHYS, V65, P536
MARASSA LK, 1995, J SPEECH HEAR RES, V38, P1387
Massaro D. W., 1998, PERCEIVING TALKING F
MASSARO DW, 1993, PERCEPT PSYCHOPHYS, V53, P549, DOI 10.3758/BF03205203
McGrath M., 1985, THESIS U NOTTINGHAM
Mills A. E., 1987, HEARING EYE PSYCHOL, P145
Munhall KG, 1998, HEARING EYE, P123
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
Preminger JE, 1998, J SPEECH LANG HEAR R, V41, P564
Reisberg D., 1987, HEARING EYE PSYCHOL
Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159
Smiljanic R., 2009, LANGUAGE LINGUISTICS, V3, P236, DOI DOI 10.1111/J.1749-818X.2008.00112.X
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009
Valentine G, 2008, SOC CULT GEOGR, V9, P469, DOI 10.1080/14649360802175691
Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929
Vogt M., 1997, P ESCA WORKSH AUD VI
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 600
EP 605
DI 10.1016/j.specom.2013.01.003
PG 6
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800002
ER
PT J
AU Maia, R
Akamine, M
Gales, MJF
AF Maia, Ranniery
Akamine, Masami
Gales, Mark J. F.
TI Complex cepstrum for statistical parametric speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Statistical parametric speech synthesis; Spectral
analysis; Cepstral analysis; Complex cepstrum; Glottal source models
ID ALGORITHM
AB Statistical parametric synthesizers have typically relied on a simplified model of speech production. In this model, speech is generated using a minimum-phase filter, implemented from coefficients derived from spectral parameters, driven by a zero or random phase excitation signal. This excitation signal is usually constructed from fundamental frequencies and parameters used to control the balance between the periodicity and aperiodicity of the signal. The application of this approach to statistical parametric synthesis has partly been motivated by speech coding theory. However, in contrast to most real-time speech coders, parametric speech synthesizers do not require causality. This allows the standard simplified model to be extended to represent the natural mixed-phase characteristics of speech signals. This paper proposes the use of the complex cepstrum to model the mixed phase characteristics of speech through the incorporation of phase information in statistical parametric synthesis. The phase information is contained in the anti-causal portion of the complex cepstrum. These parameters have a direct connection with the shape of the glottal pulse of the excitation signal. Phase parameters are extracted on a frame-basis and are modeled in the same fashion as the minimum-phase synthesis filter parameters. At synthesis time, phase parameter trajectories are generated and used to modify the excitation signal. Experimental results show that the use of such complex cepstrum-based phase features results in better synthesized speech quality. Listening test results yield an average preference of 60% for the system with the proposed phase feature on both female and male voices. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Maia, Ranniery; Gales, Mark J. F.] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England.
[Akamine, Masami] Toshiba Co Ltd, Ctr Corp Res & Dev, Saiwai Ku, Kawasaki, Kanagawa 2128582, Japan.
RP Maia, R (reprint author), Toshiba Res Europe Ltd, Cambridge Res Lab, 208 Cambridge Sci Pk,Milton Rd, Cambridge CB4 0GZ, England.
EM ranniery.maia@crl.toshiba.co.uk; masa.akamine@toshiba.co.jp;
mjfg@crl.toshiba.co.uk
CR BEDNAR JB, 1985, IEEE T ACOUST SPEECH, V33, P1014, DOI 10.1109/TASSP.1985.1164655
Bhanu B., 1980, IEEE T ACOUSTICS SPE, P583
Buchholz S., 2011, P INT, P3053
Buchholz S., 2007, TOSHIBA ENTRY 2007 B
Cabral J., 2007, P 6 ISCA SPEECH SYNT, P113
Chu WC, 2003, SPEECH CODING ALGORI
Deller J., 2000, DISCRETE TIME PROCES
Drugman T., 2009, P INTERSPEECH, P1779
Drugman T, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P657
Drugman T, 2011, SPEECH COMMUN, V53, P855, DOI 10.1016/j.specom.2011.02.004
Jackson PJB, 2001, IEEE T SPEECH AUDI P, V9, P713, DOI 10.1109/89.952489
Kawahara H., 2001, P MAVEBA, P13
Maia R., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288938
Maia R., 2007, P ISCA SSW6, P131
Oppenheim Alan V., 2010, DISCRETE TIME SIGNAL, V3rd
QUATIERI TF, 1979, IEEE T ACOUST SPEECH, V27, P328, DOI 10.1109/TASSP.1979.1163252
Raitio T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1881
SAMPA, COMPUTER READABLE PH
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K., 1994, P INT C SPOK LANG PR, P1043
TRIBOLET JM, 1977, IEEE T ACOUST SPEECH, V25, P170, DOI 10.1109/TASSP.1977.1162923
VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787
Vondra M, 2011, LECT NOTES COMPUT SC, V6456, P324, DOI 10.1007/978-3-642-18184-9_27
Yamagishi J., 2010, THE CSTR EMIME HTS S
Yoshimura T., 2001, P EUROSPEECH, P2263
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 27
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 606
EP 618
DI 10.1016/j.specom.2012.12.008
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800003
ER
PT J
AU Xia, BY
Bao, CC
AF Xia, Bingyin
Bao, Changchun
TI Compressed domain speech enhancement method based on ITU-T G.722.2
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Compressed domain; CELP; G.722.2; Parameter
modification
AB Based on the bit-stream of ITU-T G.722.2 speech coding standard, through the modification of codebook gains in the codec, a compressed domain speech enhancement method that is compatible with the discontinuous transmission (DTX) mode and frame erasure condition is proposed in this paper. In non-DTX mode, the Voice Activity Detection (VAD) is carried out in the compressed domain, and the background noise is classified into full-band distributed noise and low-frequency distributed noise. Then, the noise intensity is estimated based on the algebraic codebook power, and the a priori SNR is estimated according to the noise type. Next, the codebook gains are jointly modified under the rule of energy compensation. Especially, the adaptive comb filter is adopted to remove the residual noise in the excitation signal in low-frequency distributed noise. Finally, the modified codebook gains are re-quantized in speech or excitation domain. For non-speech frames in DTX mode, the logarithmic frame energy is attenuated to remove the noise, while the spectral envelope is kept unchanged. When frame erasure occurs, the recovered algebraic codebook gain is exponentially attenuated, and based on the reconstructed algebraic codebook vector, all the codec parameters are re-quantized to form the error concealed bit-stream. The result of performance evaluation under ITU-T G.160 shows that, with much lower computational complexity, better noise reduction, SNR improvement, and objective speech quality performances are achieved by the proposed method comparing with the state-of-art compressed domain methods. The subjective speech quality test shows that, the speech quality of the proposed method is better than the method that only modifies the algebraic codebook gain, and similar to the one with the assistance of linear domain speech enhancement method. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Xia, Bingyin; Bao, Changchun] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
EM baochch@bjut.edu.cn
FU Beijing Natural Science Foundation Program; Scientific Research Key
Program of Beijing Municipal Commission of Education [KZ201110005005];
Funding Project for Academic Human Resources Development in Institutions
of Higher Learning under the Jurisdiction of Beijing Municipality;
Postgraduate Science Foundation of Beijing University of Technology
[ykj-2012-7284]; Huawei Technologies Co., Ltd.
FX This work was supported by the Beijing Natural Science Foundation
Program and Scientific Research Key Program of Beijing Municipal
Commission of Education (No. KZ201110005005), the Funding Project for
Academic Human Resources Development in Institutions of Higher Learning
under the Jurisdiction of Beijing Municipality, the 10th Postgraduate
Science Foundation of Beijing University of Technology (ykj-2012-7284),
and Huawei Technologies Co., Ltd.
CR [Anonymous], 2001, P862 ITUT
Chandran R, 2000, PROCEEDINGS OF THE 43RD IEEE MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS I-III, P10, DOI 10.1109/MWSCAS.2000.951575
Duetsch N., 2004, P 5 ITG FACHB, P357
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Fapi E.T., 2008, P 7 INT C SOURC CHAN
ITU-T, 2002, G7222 ITUT
ITU-T, 2002, G7222 ITUT G
ITU-T, 2003, G7222 ITUT G
ITU-T, 2008, G160 ITUT
ITU-T (Telecommunication Standardization Sector International Telecommunication Union), 2005, G191 ITUT
Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929
MARTIN R, 1994, P EUSPICO, V2, P1182
Schroeder M.R., 1985, P IEEE INT C AC SPEE, V3, P937
Sukkar R.A., 2006, United States Patent Application, Patent No. [US 2006/0217970 Al, 20060217970]
Taddei H., 2004, P IEEE INT C AC SPEE, V1, P1497
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
NR 16
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 619
EP 640
DI 10.1016/j.specom.2013.02.001
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800004
ER
PT J
AU Daqrouq, K
Al Azzawi, KY
AF Daqrouq, K.
Al Azzawi, K. Y.
TI Arabic vowels recognition based on wavelet average framing linear
prediction coding and neural network
SO SPEECH COMMUNICATION
LA English
DT Article
DE Arabic vowel; LPC; Average framing; Wavelet; Probabilistic neural
network
ID SPEECH RECOGNITION; ROBUSTNESS; ALGORITHM; ENTROPY
AB In this work, an average framing linear prediction coding (AFLPC) technique for speaker-independent Arabic vowels recognition system was proposed. Usually, linear prediction coding (LPC) has been applied in many speech recognition applications, however, the combination of modified LPC termed AFLPC with wavelet transform (WT) is proposed in this study for vowel recognition. The investigation procedure was based on feature extraction and classification. In the stage of feature extraction, the distinguished resonance of vocal tract of Arabic vowel characteristics was extracted using the AFLPC technique. LPC order of 30 was found to be the best according to the system performance. In the phase of classification, probabilistic neural network (PNN) was applied because of its rapid response and ease in implementation. In practical investigation, performances of different wavelet transforms in conjunction with AFLPC were compared with one another. In addition, the capability analysis on the proposed system was examined by comparing with other systems proposed in latest literature. Referring to our experimental results, the PNN classifier could achieve a better recognition rate with discrete wavelet transform and AFLPC as a feature extraction method termed (LPCDWTF). (C) 2013 Elsevier B.V. All rights reserved.
C1 [Daqrouq, K.] King Abdulaziz Univ, Elect & Comp Eng Dept, Jeddah 21413, Saudi Arabia.
[Al Azzawi, K. Y.] Univ Technol Baghdad, Electromech Engn Dept, Baghdad, Iraq.
RP Daqrouq, K (reprint author), King Abdulaziz Univ, Elect & Comp Eng Dept, Jeddah 21413, Saudi Arabia.
EM haleddaq@yahoo.com
RI Daqrouq, Khaled/K-1293-2012
CR Abu-Rabia A., 1999, J PSYCHOLINGUIST RES, V28, P93
Alghamdi M., 1998, J KING SAUD U, V10, P3
Alotaibi Y., 2009, P 1 INT C DEC FGIT J, P10
Alotaibi Y., 2009, P BIOID MULTICOMM MA
Alotaibi YA, 2005, INFORM SCIENCES, V173, P115, DOI 10.1016/j.ins.2004.07.008
Amrouche A., 2009, ENG APPL ARTIFICIAL
Amrouche A, 2003, Proceedings of the 46th IEEE International Midwest Symposium on Circuits & Systems, Vols 1-3, P689
Anani M., 1999, P 14 INT C PHON SCI, V9, P2117
Andrianopoulos MV, 2001, J VOICE, V15, P194, DOI 10.1016/S0892-1997(01)00021-2
Atal BS, 2006, IEEE SIGNAL PROC MAG, V23, P154, DOI 10.1109/MSP.2006.1598091
Avci D, 2009, EXPERT SYST APPL, V36, P6295, DOI 10.1016/j.eswa.2008.07.012
Avci E., 2006, EXPERT SYST APPL, V33, P582
Avci E, 2007, EXPERT SYST APPL, V32, P485, DOI 10.1016/j.eswa.2005.12.004
Cherif A, 2001, APPL ACOUST, V62, P1129, DOI 10.1016/S0003-682X(01)00007-X
Daqrouq K, 2011, ENG APPL ARTIF INTEL, V24, P796, DOI 10.1016/j.engappai.2011.01.001
Daqrouq K., 2009, INT J INFORM SCI COM, V1
Daqrouq Khaled, 2010, International Journal of Speech Technology, V13, DOI 10.1007/s10772-010-9073-1
DAUBECHIES I, 1988, COMMUN PUR APPL MATH, V41, P909, DOI 10.1002/cpa.3160410705
Delac K, 2009, IMAGE VISION COMPUT, V27, P1108, DOI 10.1016/j.imavis.2008.10.007
Engin A., 2007, EXPERT SYSTEMS APPL, V32, P485
GOWDY JN, 2000, ACOUST SPEECH SIG PR, P1351
Hachkar Z., 2011, MACHINE LEARNING PAT, V2
Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Jongman A, 2011, J PHONETICS, V39, P85, DOI 10.1016/j.wocn.2010.11.007
Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354
Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3
Kirchoff K., 2002, TECHNICAL REPORT
Kirschhoff K., 2003, P INT C ASSP ICASSP, P344
Kotnik B., 2003, P IEEE EUROCON 2003, P131
Lazli L, 2003, LECT NOTES ARTIF INT, V2734, P379
Lee S, 2008, CLIN LINGUIST PHONET, V22, P523, DOI 10.1080/02699200801945120
Lei Z., 2005, CIRC SYST SIGNAL PRO, V24, P287, DOI 10.1007/s00034-004-0529-x
MACKOWIAK PA, 1992, JAMA-J AM MED ASSOC, V268, P1578, DOI 10.1001/jama.268.12.1578
Mallat S., 1998, WAVELET TOUR SIGNAL
MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463
Mayo R., 1995, J NATL BLACK ASS SPE, V17, P32
Mokbel C., 1995, EUR C SPEECH COMM TE, P141
Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3
Natour Y.S., 2010, J VOICE, V25, pe75
Saeed K., 2005, P IEEE 7 INT C DSPA, P528
Saeed K, 2005, INFORMATION PROCESSING AND SECURITY SYSTEMS, P55, DOI 10.1007/0-387-26325-X_6
Selouani SA, 2001, ISSPA 2001: SIXTH INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND ITS APPLICATIONS, VOLS 1 AND 2, PROCEEDINGS, P719
Titze IR, 1995, WORKSH AC VOIC AN SU
Tryon WW, 2001, PSYCHOL METHODS, V6, P371, DOI 10.1037//1082-989X.6.4.371
Tufekci Z., 2000, Proceedings of the IEEE SoutheastCon 2000. `Preparing for The New Millennium' (Cat. No.00CH37105), DOI 10.1109/SECON.2000.845444
Uchida S, 2002, INT C PATT RECOG, P572
VISHWANATH M, 1994, IEEE T SIGNAL PROCES, V42, P673, DOI 10.1109/78.277863
Wu J.-D., 2009, SPEAKER IDENTIFICATI
Wu J.-D., 2009, EXPERT SYSTEMS APPL
Xue SA, 2006, CLIN LINGUIST PHONET, V20, P691, DOI 10.1080/02699200500297716
Zitouni I, 2009, COMPUT SPEECH LANG, V23, P257, DOI 10.1016/j.csl.2008.06.001
NR 52
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 641
EP 652
DI 10.1016/j.specom.2013.01.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800005
ER
PT J
AU Mehrabani, M
Hansen, JHL
AF Mehrabani, Mahnoosh
Hansen, John H. L.
TI Singing speaker clustering based on subspace learning in the GMM mean
supervector space
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker clustering; Singing; Speaking styles; Subspace learning
ID SPEECH; RECOGNITION; MODELS; VERIFICATION; MIXTURE; COMPENSATION;
ADAPTATION; RECORDINGS; SEPARATION; STRESS
AB In this study, we propose algorithms based on subspace learning in the GMM mean supervector space to improve performance of speaker clustering with speech from both reading and singing. As a speaking style, singing introduces changes in the time-frequency structure of a speaker's voice. The purpose of this study is to introduce advancements for speech systems such as speech indexing and retrieval which improve robustness to intrinsic variations in speech production. Speaker clustering techniques such as k-means and hierarchical are explored for analysis of acoustic space differences of a corpus consisting of reading and singing of lyrics for each speaker. Furthermore, a distance based on fuzzy c-means membership degrees is proposed to more accurately measure clustering difficulty or speaker confusability. Two categories of subspace learning methods are studied: unsupervised based on LPP, and supervised based on PLDA. Our proposed clustering method based on PLDA is a two stage algorithm: where first, initial clusters are obtained using full dimension supervectors, and next, each cluster is refined in a PLDA subspace resulting in a more speaker dependent representation that is less sensitive to speaking style. It is shown that LPP improves average clustering accuracy by 5.1% absolute versus a hierarchical baseline for a mixture of reading and singing, and PLDA based clustering increases accuracy by 9.6% absolute versus a k-means baseline. The advancements offer novel techniques to improve model formulation for speech applications including speaker ID, audio search, and audio content analysis. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Mehrabani, Mahnoosh; Hansen, John H. L.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dallas, TX 75230 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dallas, TX 75230 USA.
EM mahmehrabani@utdallas.edu; john.hansen@utdallas.edu
CR Belkin M, 2002, ADV NEUR IN, V14, P585
Ben M., 2004, P ICSLP
Bezdek J.C., 1981, PATTERN RECOGNITION
Bishop C. M., 1995, NEURAL NETWORKS PATT
Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
Chu SM, 2009, IEEE INT CON MULTI, P494
Chu SM, 2009, INT CONF ACOUST SPEE, P4089, DOI 10.1109/ICASSP.2009.4960527
Dunn J.C., 1973, FUZZY RELATIVE ISODA
Faltlhauser R., 2001, IEEE WORKSH AUT SPEE, P57, DOI 10.1109/ASRU.2001.1034588
Fan X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1313
Hansen J. H. L., 2000, IMPACT SPEECH STRESS
Hansen JHL, 2009, IEEE T AUDIO SPEECH, V17, P366, DOI 10.1109/TASL.2008.2009019
Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7
He XF, 2005, IEEE T PATTERN ANAL, V27, P328
He Xiaofei, 2003, P C ADV NEUR INF PRO, V16, P153
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Ioffe S, 2006, LECT NOTES COMPUT SC, V3954, P531
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Li YP, 2007, IEEE T AUDIO SPEECH, V15, P1475, DOI 10.1109/TASL.2006.889789
Lippmann R., 1987, P INT C AC SPEECH SI, V12, P705
Makhoul J, 2000, P IEEE, V88, P1338, DOI 10.1109/5.880087
Mehrabani M., 2012, P INTERSPEECH
Ozerov A, 2007, IEEE T AUDIO SPEECH, V15, P1564, DOI 10.1109/TASL.2007.899291
Prince S., 2007, IEEE 11 INT C COMP V, P1
Rabiner L., 1993, FUNDAM SPEECH RECOGN, V103
Reynolds D., 2009, P INTERSPEECH, P6
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Shriberg E., 2008, P INTERSPEECH
SOLOMONOFF A, 1998, ACOUST SPEECH SIG PR, P757
Tang H, 2012, IEEE T PATTERN ANAL, V34, P959, DOI 10.1109/TPAMI.2011.174
Tang H, 2009, INT CONF ACOUST SPEE, P4101
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
Tsai WH, 2004, COMPUT MUSIC J, V28, P68, DOI 10.1162/0148926041790630
Tsai WH, 2005, INT CONF ACOUST SPEE, P725
WARD JH, 1963, J AM STAT ASSOC, V58, P236, DOI 10.2307/2282967
Wooters C, 2008, LECT NOTES COMPUT SC, V4625, P509
Wu W, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2102
Zhang C, 2011, IEEE T AUDIO SPEECH, V19, P883, DOI 10.1109/TASL.2010.2066967
Zhang C., 2007, P INTERSPEECH, V2007, P2289
NR 41
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 653
EP 666
DI 10.1016/j.specom.2012.11.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800006
ER
PT J
AU Erath, BD
Zanartu, M
Stewart, KC
Plesniak, MW
Sommer, DE
Peterson, SD
AF Erath, Byron D.
Zanartu, Matias
Stewart, Kelley C.
Plesniak, Michael W.
Sommer, David E.
Peterson, Sean D.
TI A review of lumped-element models of voiced speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Lumped-mass models; Glottal aerodynamics; Acoustics; Vocal fold models;
Vocal tract; Subglottal system; Acoustic interaction
ID VOCAL-FOLD MODEL; TRACT AREA FUNCTION; GLOTTAL AIR-FLOW; IN-VITRO
MODELS; VISCOELASTIC SHEAR PROPERTIES; INTRINSIC LARYNGEAL MUSCLES;
INVERSE FILTERING TECHNIQUE; VERTICAL-BAR UTTERANCES; DOMAIN ACOUSTIC
MODEL; 2-MASS MODEL
AB Voiced speech is a highly complex process involving coupled interactions between the vocal fold structure, aerodynamics, and acoustic field. Reduced-order lumped-element models of the vocal fold structure, coupled with various aerodynamic and acoustic models, have proven useful in a wide array of speech investigations. These simplified models of speech, in which the vocal folds are approximated as arrays of lumped masses connected to one another via springs and dampers to simulate the viscoelastic tissue properties, have been used to study phenomena ranging from sustained vowels and pitch glides to polyps and vocal fold paralysis. Over the past several decades a variety of structural, aerodynamic, and acoustic models have been developed and deployed into the lumped-element modeling framework. This paper aims to provide an overview of advances in lumped-element models and their constituents, with particular emphasis on their physical foundations and limitations. Examples of the application of lumped-element models to speech studies will also be addressed, as well as an outlook on the direction and future of these models. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Erath, Byron D.] Clarkson Univ, Dept Mech & Aeronaut Engn, Potsdam, NY 13699 USA.
[Zanartu, Matias] Univ Tecn Federico Santa Maria, Dept Elect Engn, Valparaiso, Chile.
[Stewart, Kelley C.; Plesniak, Michael W.] George Washington Univ, Dept Mech & Aerosp Engn, Washington, DC 20052 USA.
[Sommer, David E.; Peterson, Sean D.] Univ Waterloo, Dept Mech & Mechatron Engn, Waterloo, ON N2L 3G1, Canada.
RP Erath, BD (reprint author), Clarkson Univ, Dept Mech & Aeronaut Engn, Potsdam, NY 13699 USA.
EM berath@clarkson.edu; matias.zanartu@usm.cl; kstewart@gwu.edu;
plesniak@gwu.edu; peterson@mme.uwaterloo.ca
RI Zanartu, Matias/I-3133-2012
OI Zanartu, Matias/0000-0001-5581-4392
FU National Science Foundation [CBET 1036280]; UTFSM; CONICYT; [FONDECYT
11110147]
FX This material is based upon work supported by the National Science
Foundation under Grant No. CBET 1036280. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of the National
Science Foundation. The work of Maths Zanartu was supported by UTFSM and
CONICYT, Grant FONDECYT 11110147
CR Agarwal M, 2003, J VOICE, V17, P97, DOI 10.1016/S0892-1997(03)00012-2
Agarwal M, 2004, THESIS BOWLING GREEN
Alipour F, 2001, ANN OTO RHINOL LARYN, V110, P550
ALIPOURHAGHIGHI F, 1991, J ACOUST SOC AM, V90, P1326, DOI 10.1121/1.401924
Alipour F., 2013, J ACOUST SOC AM, V132, P1017
Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801
ARNOLD GODFREY E., 1961, LARYNGOSCOPE, V71, P687
Avanzini F, 2008, SPEECH COMMUN, V50, P95, DOI 10.1016/j.specom.2007.07.002
Avanzini F., 2001, 7 EUR C SPEECH COMM, P51
Avanzini F, 2006, ACTA ACUST UNITED AC, V92, P731
Baer T., 1981, 11 ASHA, V11, P38
Bailly L, 2008, J ACOUST SOC AM, V124, P3296, DOI 10.1121/1.2977740
Bailly L, 2010, J ACOUST SOC AM, V127, P3212, DOI 10.1121/1.3365220
Baken R.J., 1997, PROFESSIONAL VOICE S
Benjamin B., 1987, ANN OTO RHINOL LARYN, V99, P530
Birkholz P., 2011, P INT 2011 FLOR IT, P2681
Birkholz P., 2011, STUDIENTEXTE SPRACHK, P184
Birkholz P, 2011, IEEE T AUDIO SPEECH, V19, P1422, DOI 10.1109/TASL.2010.2091632
Birkholz P., 2011, 1 INT WORKSH PERF SP
Birkhoz P, 2007, IEEE T AUDIO SPEECH, V15, P1218, DOI 10.1109/TASL.2006.889731
Bocklet T., 2011, P IEEE WORKSH AUT SP, P478
BROWN BL, 1974, J ACOUST SOC AM, V55, P313, DOI 10.1121/1.1914504
Bunton K, 2010, J ACOUST SOC AM, V127, pEL146, DOI 10.1121/1.3313921
Bunton K., 2011, J ACOUST SOC AM, V129, P2626
Chan RW, 2000, J ACOUST SOC AM, V107, P565, DOI 10.1121/1.428354
Chan RW, 1999, J ACOUST SOC AM, V106, P2008, DOI 10.1121/1.427947
Chen LQ, 2008, LECT NOTES COMPUT SC, V5209, P1
CHILDERS DG, 1986, J ACOUST SOC AM, V80, P1309, DOI 10.1121/1.394382
Cisonni J, 2011, ACTA ACUST UNITED AC, V97, P291, DOI 10.3813/AAA.918409
Cook D.D., 2010, 7 INT C VOIC PHYS BI
CRANEN B, 1995, J PHONETICS, V23, P165, DOI 10.1016/S0095-4470(95)80040-9
CRANEN B, 1987, J ACOUST SOC AM, V81, P734, DOI 10.1121/1.394842
Dejonckere PH, 2009, FOLIA PHONIATR LOGO, V61, P171, DOI 10.1159/000219952
de Vries MP, 1999, J ACOUST SOC AM, V106, P3620, DOI 10.1121/1.428214
de Vries MP, 2002, J ACOUST SOC AM, V111, P1847, DOI 10.1121/1.1323716
Dollinger M, 2002, IEEE T BIO-MED ENG, V49, P773, DOI 10.1109/TBME.2002.800755
Drechsel JS, 2008, J ACOUST SOC AM, V123, P4434, DOI 10.1121/1.2897040
Dresel Christian, 2006, Logoped Phoniatr Vocol, V31, P61, DOI 10.1080/14015430500363232
Drioli C, 2002, MED ENG PHYS, V24, P453, DOI 10.1016/S1350-4533(02)00057-7
Dursun G, 1996, J VOICE, V10, P206, DOI 10.1016/S0892-1997(96)80048-8
Erath BD, 2011, CHAOS, V21, DOI 10.1063/1.3615726
Erath BD, 2012, INT J HEAT FLUID FL, V35, P93, DOI 10.1016/j.ijheatfluidflow.2012.03.006
Erath BD, 2006, J ACOUST SOC AM, V120, P1000, DOI 10.1121/1.2213522
Erath BD, 2010, INT J HEAT FLUID FL, V31, P468, DOI 10.1016/j.ijheatfluidflow.2010.02.014
Erath BD, 2006, EXP FLUIDS, V40, P683, DOI 10.1007/s00348-006-0106-0
Erath BD, 2011, J ACOUST SOC AM, V130, P389, DOI 10.1121/1.3586785
Erath BD, 2006, EXP FLUIDS, V41, P735, DOI 10.1007/s00348-006-0196-8
Erath B.D., 2010, J ACOUST SOC AM, V129, pEL64
Erath BD, 2010, EXP FLUIDS, V49, P131, DOI 10.1007/s00348-009-0809-0
ERIKSSON LJ, 1980, J ACOUST SOC AM, V68, P545, DOI 10.1121/1.384768
Fant G., 1960, ACOUSTIC THEORY SPEE
Fant G., 1987, STL QPSR, V28, P13
Flanagan J., 1972, SPEECH ANAL SYNTHESI
FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949
Fraile R, 2012, BIOMED SIGNAL PROCES, V7, P65, DOI 10.1016/j.bspc.2011.04.002
Fulcher LP, 2006, AM J PHYS, V74, P386, DOI 10.1119/1.2173272
GAY T, 1972, ANN OTO RHINOL LARYN, V81, P401
Goldberg D. E, 1989, GENETIC ALGORITHMS S
Gunter HE, 2003, J ACOUST SOC AM, V113, P994, DOI 10.1121/1.1534100
GUPTA V, 1973, J ACOUST SOC AM, V54, P1607, DOI 10.1121/1.1914457
HANSON DG, 1988, LARYNGOSCOPE, V98, P541
Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991
HARTMAN DE, 1984, ARCH OTOLARYNGOL, V110, P394
HERZEL H, 1995, CHAOS, V5, P30, DOI 10.1063/1.166078
HERZEL H, 1995, NONLINEAR DYNAM, V7, P53
Hess MM, 1998, J VOICE, V12, P50, DOI 10.1016/S0892-1997(98)80075-1
HILLMAN RE, 1989, J SPEECH HEAR RES, V32, P373
Hirano M., 1975, OTOLOGIA FUKUOKA S1, V21, P239
Hirano M., 1977, DYNAMIC ASPECTS SPEE, P13
Hirano M., 1981, VOCAL FOLD PHYSL, P33
HIRANO M, 1974, FOLIA PHONIATR, V26, P89
Hirano M., 1983, VOCAL FOLD PHYSL CON, P22
HIRANO M, 1970, FOLIA PHONIATR, V22, P1
Hirschberg A, 1996, VOCAL FOLD, P31
Ho JC, 2011, J ACOUST SOC AM, V129, P1531, DOI 10.1121/1.3543971
Hofmans GCJ, 2003, J ACOUST SOC AM, V113, P1658, DOI 10.1121/1.1547459
HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829
Honda K, 2004, IEICE T INF SYST, VE87D, P1050
Horacek J, 2005, J FLUID STRUCT, V20, P853, DOI 10.1016/j.jfluidstructs.2005.05.003
ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P1193, DOI 10.1121/1.381221
ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P190, DOI 10.1121/1.381064
Ishizaka K, 1968, J ACOUST SOC JPN, V24, P312
ISHIZAKA K, 1972, AT&T TECH J, V51, P1233
Isshiki N., 1977, FUNCTIONAL SURG LARY
Jiang J, 2000, OTOLARYNG CLIN N AM, V33, P699, DOI 10.1016/S0030-6665(05)70238-3
Jiang JJ, 2001, J ACOUST SOC AM, V110, P2120, DOI 10.1121/1.1395596
JIANG JJQ, 1994, J VOICE, V8, P132, DOI 10.1016/S0892-1997(05)80305-4
Johns M.M., 2003, HEAD NECK SURG, V11, P456
Kaneko T., 1972, J JPN SOC BRONCHOESO, V25, P133
Kelly J.L., 1973, STANFORD REV SPR, P1
Khosla S, 2007, ANN OTO RHINOL LARYN, V116, P217
Khosla S, 2008, CURR OPIN OTOLARYNGO, V16, P183, DOI 10.1097/MOO.0b013e3282ff5fc5
Khosla S, 2008, ANN OTO RHINOL LARYN, V117, P134
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
Kob M., 2002, THESIS U TECHNOLOGY
KOIZUMI T, 1993, LARYNGOSCOPE, V103, P1035
KOIZUMI T, 1987, J ACOUST SOC AM, V82, P1179, DOI 10.1121/1.395254
Krane M, 2007, J ACOUST SOC AM, V122, P3659, DOI 10.1121/1.2409485
Kroger BJ, 2011, LECT NOTES COMPUT SC, V6456, P354, DOI 10.1007/978-3-642-18184-9_31
Kroger BJ, 2007, LARYNGO RHINO OTOL, V86, P365, DOI 10.1055/s-2006-944981
Kroger BJ, 2011, COGN COMPUT, V3, P449, DOI 10.1007/s12559-010-9071-2
Kroger B.J., 2011, PALADYN J BEHAV ROBO, V2, P82
Kuo J., 1998, THESIS HARVARD MIT D
Li S, 2007, LECT NOTES COMPUT SC, V4561, P147
Liljencrants J., 1985, THESIS ROYAL I TECHN
Liljencrants J., 1991, STL QPSR, V32, P1
Lo CY, 2000, ARCH SURG-CHICAGO, V135, P204, DOI 10.1001/archsurg.135.2.204
LOFQVIST A, 1995, SPEECH COMMUN, V16, P49, DOI 10.1016/0167-6393(94)00049-G
LOGEMANN JA, 1978, J SPEECH HEAR DISORD, V43, P47
Lohscheller J, 2008, IEEE T MED IMAGING, V27, P300, DOI 10.1109/TMI.2007.903690
Lohscheller J, 2007, MED IMAGE ANAL, V11, P400, DOI 10.1016/j.media.2007.04.005
Lous NJC, 1998, ACUSTICA, V84, P1135
Lowell SY, 2006, J ACOUST SOC AM, V120, P386, DOI 10.1121/1.2204442
Lucero JC, 2005, J SOUND VIB, V282, P1247, DOI 10.1016/j.jsv.2004.05.008
Lucero JC, 2005, J ACOUST SOC AM, V117, P1362, DOI 10.1121/1.1853235
Luo HX, 2009, J ACOUST SOC AM, V126, P816, DOI 10.1121/1.3158942
Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6
MASSEY EW, 1985, SOUTHERN MED J, V78, P316
MCGOWAN RS, 1995, SPEECH COMMUN, V16, P67, DOI 10.1016/0167-6393(94)00048-F
MCGOWAN RS, 1988, J ACOUST SOC AM, V83, P696, DOI 10.1121/1.396165
McGowan RS, 2010, J ACOUST SOC AM, V127, pEL215, DOI 10.1121/1.3397283
Mehta DD, 2011, J ACOUST SOC AM, V130, P3999, DOI 10.1121/1.3658441
Mergell P, 1997, SPEECH COMMUN, V22, P141, DOI 10.1016/S0167-6393(97)00016-2
Miller DG, 2005, FOLIA PHONIATR LOGO, V57, P278, DOI 10.1159/000087081
Mittal R, 2013, ANNU REV FLUID MECH, V45, P437, DOI 10.1146/annurev-fluid-011212-140636
Mokhtari P, 2008, SPEECH COMMUN, V50, P179, DOI 10.1016/j.specom.2007.08.001
Mongeau L, 1997, J ACOUST SOC AM, V102, P1121, DOI 10.1121/1.419864
Neubauer J, 2007, J ACOUST SOC AM, V121, P1102, DOI 10.1121/1.2409488
Park JB, 2008, J ACOUST SOC AM, V124, P1171, DOI 10.1121/1.2945116
Park JB, 2007, J ACOUST SOC AM, V121, P442, DOI 10.1121/1.2401652
PELORSON X, 1995, ACTA ACUST, V3, P191
PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449
Pelorson X, 1996, ACUSTICA, V82, P358
Perlman A. L., 1985, THESIS U IOWA IOWA C
Qin XL, 2009, IEEE T BIO-MED ENG, V56, P1744, DOI 10.1109/TBME.2009.2015772
Qiu Q.J., 2002, P 2 INT S INSTR SCI, V3, P541
ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513
ROTHENBERG M, 1977, J ACOUST SOC AM, V61, P1063, DOI 10.1121/1.381392
Rothenberg M., 1984, VOCAL FOLD PHYSL BIO, P465
Rothenberg M., 1981, STL QPSR, V4, P1
Rupitsch SJ, 2011, J SOUND VIB, V330, P4447, DOI 10.1016/j.jsv.2011.05.008
Ruty N, 2007, J ACOUST SOC AM, V121, P479, DOI 10.1121/1.2384846
Scherer RC, 2010, J ACOUST SOC AM, V128, P828, DOI 10.1121/1.3455838
Schlichting H., 1968, BOUNDARY LAYER THEOR
Schroeter J., 2008, HDB SPEECH PROCESSIN, P413
Schwarz R, 2008, J ACOUST SOC AM, V123, P2717, DOI [10.1121/1.2902167, 10.1121/1.29021671]
Schwarz R, 2006, IEEE T BIO-MED ENG, V53, P1099, DOI 10.1109/TBME.2006.873396
Sciamarella D., 2003, EUROSPEECH
Sciamarella D, 2009, SPEECH COMMUN, V51, P344, DOI 10.1016/j.specom.2008.10.004
Sciamarella D, 2004, ACTA ACUST UNITED AC, V90, P746
SERCARZ JA, 1992, ANN OTO RHINOL LARYN, V101, P567
SMITH ME, 1992, J SPEECH HEAR RES, V35, P545
SOBEY IJ, 1983, J FLUID MECH, V134, P247, DOI 10.1017/S0022112083003341
Sommer DE, 2013, J ACOUST SOC AM, V133, pEL214, DOI 10.1121/1.4790662
Sommer DE, 2012, J ACOUST SOC AM, V132, pEL271, DOI 10.1121/1.4734013
STEINECKE I, 1995, J ACOUST SOC AM, V97, P1874, DOI 10.1121/1.412061
Stevens K.N., 1998, ACOUSTIC PHONETICS
STEVENS KN, 1955, J ACOUST SOC AM, V27, P484, DOI 10.1121/1.1907943
Story B. H., 1995, THESIS U IOWA IOWA C
Story B. H., 2002, Acoustical Science and Technology, V23, DOI 10.1250/ast.23.195
Story BH, 2010, J SPEECH LANG HEAR R, V53, P1514, DOI 10.1044/1092-4388(2010/09-0127)
STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234
Story B.H., 2007, EMOTIONS HUMAN VOICE, V1, P123
Story BH, 2008, J ACOUST SOC AM, V123, P327, DOI 10.1121/1.2805683
Story BH, 1996, J ACOUST SOC AM, V100, P537, DOI 10.1121/1.415960
Story BH, 2007, J ACOUST SOC AM, V121, P3770, DOI 10.1121/1.2730621
Story BH, 2005, J ACOUST SOC AM, V117, P3231, DOI 10.1121/1.1869752
Story B.H., 2009, J ACOUST SOC AM, V125, P2637
Takemoto H, 2006, J ACOUST SOC AM, V119, P1037, DOI 10.1121/1.2151823
Tao C, 2008, PHYS REV E, V77, DOI 10.1103/PhysRevE.77.061922
Tao C, 2007, IEEE T BIO-MED ENG, V54, P794, DOI 10.1109/TBME.2006.889182
Tao C, 2007, J BIOMECH, V40, P2191, DOI 10.1016/j.jbiomech.2006.10.030
Tao C, 2007, J ACOUST SOC AM, V122, P2270, DOI 10.1121/1.2773960
Titze I, 2008, J ACOUST SOC AM, V123, P1902, DOI 10.1121/1.2832339
Titze I. R., 2006, MYOELASTIC AERODYNAM
TITZE IR, 1994, J VOICE, V8, P99, DOI 10.1016/S0892-1997(05)80302-9
Titze IR, 2002, J ACOUST SOC AM, V112, P1064, DOI 10.1121/1.1496080
TITZE IR, 1973, PHONETICA, V28, P129
Titze IR, 2009, J ACOUST SOC AM, V126, P1530, DOI 10.1121/1.3160296
TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910
Titze IR, 2002, J ACOUST SOC AM, V111, P367, DOI 10.1121/1.1417526
Titze IR, 2004, J VOICE, V18, P292, DOI 10.1016/j.jvoice.2003.12.010
Titze I.R., 1979, TRANSCR 9 S CAR PROF, P23
Titze IR, 1994, PRINCIPLES VOICE PRO
Titze IR, 2008, J ACOUST SOC AM, V123, P2733, DOI 10.1121/1.2832337
Titze IR, 1997, J ACOUST SOC AM, V101, P2234, DOI 10.1121/1.418246
TITZE IR, 1974, PHONETICA, V29, P1
Tokuda IT, 2007, J ACOUST SOC AM, V122, P519, DOI 10.1121/1.2741210
Tokuda IT, 2008, CHAOS, V18, DOI 10.1063/1.2825295
Tokuda IT, 2010, J ACOUST SOC AM, V127, P1528, DOI 10.1121/1.3299201
Triep M, 2005, EXP FLUIDS, V39, P232, DOI 10.1007/s00348-005-1015-3
Triep M, 2010, J ACOUST SOC AM, V127, P1537, DOI 10.1121/1.3299202
VANDENBERG J, 1958, J SPEECH HEAR RES, V1, P227
VANDENBE.JW, 1968, ANN NY ACAD SCI, V155, P129
van den BERG J., 1959, PRACTICA OTO RHINO LARYNGOL, V21, P425
Vilain CE, 2004, J SOUND VIB, V276, P475, DOI 10.1016/j.jsv.2003.07.035
Voigt D, 2010, J ACOUST SOC AM, V128, pEL347, DOI 10.1121/1.3493637
Wegel RL, 1930, J ACOUST SOC AM, V1, P1, DOI 10.1121/1.1915199
WODICKA GR, 1989, IEEE T BIO-MED ENG, V36, P925, DOI 10.1109/10.35301
WONG D, 1991, J ACOUST SOC AM, V89, P383, DOI 10.1121/1.400472
Wurzbacher T., 2004, INT C VOIC PHYS BIOM
Wurzbacher T, 2008, J ACOUST SOC AM, V123, P2324, DOI 10.1121/1.2835435
Wurzbacher T, 2006, J ACOUST SOC AM, V120, P1012, DOI 10.1121/1.2211550
Xue Q, 2010, J ACOUST SOC AM, V128, P818, DOI 10.1121/1.3458839
Yamana T, 2000, J VOICE, V14, P1, DOI 10.1016/S0892-1997(00)80089-2
Yang AX, 2012, J ACOUST SOC AM, V131, P1378, DOI 10.1121/1.3676622
Yang AX, 2011, J ACOUST SOC AM, V130, P948, DOI 10.1121/1.3605551
Yang AX, 2010, J ACOUST SOC AM, V127, P1014, DOI 10.1121/1.3277165
Yumoto E, 2002, AURIS NASUS LARYNX, V29, P41, DOI 10.1016/S0385-8146(01)00122-5
Zanartu M, 2007, J ACOUST SOC AM, V121, P1119, DOI 10.1121/1.2409491
Zanartu M, 2011, J ACOUST SOC AM, V129, P326, DOI 10.1121/1.3514536
Zanartu M., 2006, THESIS PURDUE U
Zanartu M., 2010, THESIS PURDUE U
Zhang C, 2002, J ACOUST SOC AM, V112, P2147, DOI 10.1121/1.1506694
Zhang Y, 2008, J SOUND VIB, V316, P248, DOI 10.1016/j.jsv.2008.02.026
Zhang Y, 2004, J ACOUST SOC AM, V115, P2270, DOI 10.1121/1.699392
Zhang Y, 2008, CHAOS, V18, DOI 10.1063/1.2988251
Zhang Y, 2005, CHAOS, V15, DOI 10.1063/1.1916186
Zhang Y, 2004, J ACOUST SOC AM, V115, P1266, DOI 10.1121/1.1648974
Zhang ZY, 2006, J ACOUST SOC AM, V119, P3995, DOI 10.1121/1.2195268
Zhang ZY, 2006, J ACOUST SOC AM, V120, P1558, DOI 10.1121/1.2225682
Zhao W, 2002, J ACOUST SOC AM, V112, P2134, DOI 10.1121/1.1506693
Zheng X, 2011, J ACOUST SOC AM, V130, P404, DOI 10.1121/1.3592216
Zheng XD, 2009, ANN BIOMED ENG, V37, P625, DOI 10.1007/s10439-008-9630-9
Zhuang P, 2009, LARYNGOSCOPE, V119, P811, DOI 10.1002/lary.20165
NR 225
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 667
EP 690
DI 10.1016/j.specom.2013.02.002
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800007
ER
PT J
AU Koniaris, C
Salvi, G
Engwall, O
AF Koniaris, Christos
Salvi, Giampiero
Engwall, Olov
TI On mispronunciation analysis of individual foreign speakers using
auditory periphery models
SO SPEECH COMMUNICATION
LA English
DT Article
DE Second language learning; Auditory model; Distortion measure; Perceptual
assessment; Pronunciation error detection; Phoneme
ID SPEECH RECOGNITION; QUANTITATIVE MODEL; PRONUNCIATION; PERCEPTION;
SYSTEM; ACCENT; REPRESENTATIONS
AB In second language (L2) learning, a major difficulty is to discriminate between the acoustic diversity within an L2 phoneme category and that between different categories. We propose a general method for automatic diagnostic assessment of the pronunciation of nonnative speakers based on models of the human auditory periphery. Considering each phoneme class separately, the geometric shape similarity between the native auditory domain and the non-native speech domain is measured. The phonemes that deviate the most from the native pronunciation for a set of L2 speakers are detected by comparing the geometric shape similarity measure with that calculated for native speakers on the same phonemes. To evaluate the system, we have tested it with different non-native speaker groups from various language backgrounds. The experimental results are in accordance with linguistic findings and human listeners' ratings, particularly when both the spectral and temporal cues of the speech signal are utilized in the pronunciation analysis. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Koniaris, Christos; Salvi, Giampiero; Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol, SE-10044 Stockholm, Sweden.
RP Koniaris, C (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol, Lindstedtsvagen 24, SE-10044 Stockholm, Sweden.
EM koniaris@kth.se; giampi@kth.se; engwall@kth.se
FU Swedish Research Council [80449001]
FX This work is supported by the Swedish Research Council project 80449001
Computer-Animated LAnguage TEAchers (CALATEA). The authors wish to thank
our colleague Dr. Mats Blomberg, Dr. Saikat Chatterjee from the
Communication Theory Laboratory, KTH - Royal Institute of Technology,
and the anonymous reviewers for helpful suggestions.
CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x
Andrianopoulos MV, 2001, J VOICE, V15, P61, DOI 10.1016/S0892-1997(01)00007-8
Bannert R., 1984, FOLIA LINGUIST, V18, P193, DOI 10.1515/flin.1984.18.1-2.193
Bregman AS., 1990, AUDITORY SCENE ANAL
Chatterjee S, 2011, IEEE T AUDIO SPEECH, V19, P1813, DOI 10.1109/TASL.2010.2101597
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Digalakis V., 1992, THESIS BOSTON U BOST
Digalakis V, 1993, IEEE T SPEECH AUDI P, V1, P431, DOI 10.1109/89.242489
Eddins D.A., 1995, TEMPORAL INTEGRATION
Eskenazi M., 1998, SPEECH TECH LANG LEA, P77
Flege J. E., 1995, SPEECH PERCEPTION LI, P233
Franco H, 1997, INT CONF ACOUST SPEE, P1471, DOI 10.1109/ICASSP.1997.596227
Franco H, 2010, LANG TEST, V27, P401, DOI 10.1177/0265532210364408
GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658
Guion SG, 2000, J ACOUST SOC AM, V107, P2711, DOI 10.1121/1.428657
Kawai G., 1998, INT C SPOK LANG PROC, P1823
Kluender K. R., 1989, ECOL PSYCHOL, V1, P121, DOI 10.1207/s15326969eco0102_2
Koniaris C, 2011, INT CONF ACOUST SPEE, P5704
Koniaris C, 2010, INT CONF ACOUST SPEE, P4342, DOI 10.1109/ICASSP.2010.5495648
Koniaris C., 2012, INT S AUT DET ERR PR, P59
Koniaris C., 2011, INTERSPEECH, P1157
Koniaris C, 2010, J ACOUST SOC AM, V127, pEL73, DOI 10.1121/1.3284545
Koniaris C., 2012, INTERSPEECH
KUHL PK, 1993, J PHONETICS, V21, P125
Menzel W., 2000, WORK INT SPEECH TECH, P49
Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001
MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x
Neumeyer L, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1457
Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134
Plasberg JH, 2007, IEEE T AUDIO SPEECH, V15, P310, DOI 10.1109/TASL.2006.876722
Pressnitzer D, 2008, CURR BIOL, V18, P1124, DOI 10.1016/j.cub.2008.06.053
Richardson M, 2003, SPEECH COMMUN, V41, P511, DOI 10.1016/S0167-6393(03)00031-1
Schmid PM, 1999, J SPEECH LANG HEAR R, V42, P56
Sjolander K., 2003, FONETIK, P93
Stevens K.N., 1998, ACOUSTIC PHONETICS
Strik H, 2009, SPEECH COMMUN, V51, P845, DOI 10.1016/j.specom.2009.05.007
Tepperman J, 2008, IEEE T AUDIO SPEECH, V16, P8, DOI 10.1109/TASL.2007.909330
Thoren B., 2008, THESIS STOCKHOLM U S
van de Par S., 2002, ACOUST SPEECH SIG PR, P1805
Wei S, 2009, SPEECH COMMUN, V51, P896, DOI 10.1016/j.specom.2009.03.004
WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3
Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
NR 45
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 691
EP 706
DI 10.1016/j.specom.2013.01.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800008
ER
PT J
AU Kua, JMK
Epps, J
Ambikairajah, E
AF Kua, Jia Min Karen
Epps, Julien
Ambikairajah, Eliathamby
TI i-Vector with sparse representation classification for speaker
verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; Sparse representation classification;
l(1)-Minimization; i-Vectors; Support vector machine; Cosine Distance
Scoring
ID MACHINES; RECOGNITION; SELECTION; MODELS; SYSTEMS; KERNEL;
RECONSTRUCTION; REGULARIZATION; IDENTIFICATION; NORMALIZATION
AB Sparse representation-based methods have very lately shown promise for speaker recognition systems. This paper investigates and develops an i-vector based sparse representation classification (SRC) as an alternative classifier to support vector machine (SVM) and Cosine Distance Scoring (CDS) classifier, producing an approach we term i-vector-sparse representation classification (i-SRC). Unlike SVM which fixes the support vector for each target example, SRC allows the supports, which we term sparse coefficient vectors, to be adapted to the test signal being characterized. Furthermore, similarly to CDS, SRC does not require a training phase. We also analyze different types of sparseness methods and dictionary composition to determine the best configuration for speaker recognition. We observe that including an identity matrix in the dictionary helps to remove sensitivity to outliers and that sparseness methods based on l(1) and l(2) norm offer the best performance. A combination of both techniques achieves a 18% relative reduction in EER over a SRC system based on l(1) norm and without identity matrix. Experimental results on NIST 2010 SRE show that the i-SRC consistently outperforms i-SVM and i-CDS in EER by 0.14-0.81%, and the fusion of i-CDS and i-SRC achieves a relative EER reduction of 8-19% over i-SRC alone. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Kua, Jia Min Karen; Epps, Julien; Ambikairajah, Eliathamby] Univ New S Wales, Sch Elect Engn & Telecommun, Unsw Sydney, NSW 2052, Australia.
[Epps, Julien; Ambikairajah, Eliathamby] NICTA, ATP Res Lab, Eveleigh 2015, Australia.
RP Kua, JMK (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Unsw Sydney, NSW 2052, Australia.
EM j.kua@unswalumni.com; j.epps@unsw.edu.au; ambi@ee.unsw.edu.au
CR Aharon M, 2006, IEEE T SIGNAL PROCES, V54, P4311, DOI 10.1109/TSP.2006.881199
Alex Solomonoff C.Q., 2004, P OD SPEAK LANG REC, P57
Amaldi E, 1998, THEOR COMPUT SCI, V209, P237, DOI 10.1016/S0304-3975(97)00115-1
Ariki Y, 1996, INT CONF ACOUST SPEE, P319, DOI 10.1109/ICASSP.1996.541096
Ariki Y., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing
Baraniuk RG, 2007, IEEE SIGNAL PROC MAG, V24, P118, DOI 10.1109/MSP.2007.4286571
Blake C. L., 1998, UCI REPOSITORY MACHI, V460
Bruckstein AM, 2009, SIAM REV, V51, P34, DOI 10.1137/060657704
Brummer N., 2010, P NIST 2010 SPEAK RE
Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555
Campadelli P, 2005, NEUROCOMPUTING, V68, P281, DOI 10.1016/j.neucom.2005.03.005
Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003
Campbell WM, 2006, INT CONF ACOUST SPEE, P97
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
Candes E., 2005, 1 MAGIC RECOVERY SPA
Candes E. J., 2006, P INT C MATH
Candes EJ, 2006, IEEE T INFORM THEORY, V52, P5406, DOI 10.1109/TIT.2006.885507
Candes EJ, 2006, IEEE T INFORM THEORY, V52, P489, DOI 10.1109/TIT.2005.862083
Chapelle O, 2002, MACH LEARN, V46, P131, DOI 10.1023/A:1012450327387
Dehak N., 2009, P INTERSPEECH
Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307
Dehak N, 2009, INT CONF ACOUST SPEE, P4237, DOI 10.1109/ICASSP.2009.4960564
Donoho D., 2005, SPARSELAB, P25
Donoho DL, 2006, COMMUN PUR APPL MATH, V59, P797, DOI 10.1002/cpa.20132
Fauve BGB, 2007, IEEE T AUDIO SPEECH, V15, P1960, DOI 10.1109/TASL.2007.902877
Figueiredo MAT, 2007, IEEE J-STSP, V1, P586, DOI 10.1109/JSTSP.2007.910281
Friedman J, 2010, J STAT SOFTW, V33, P1
Frohlich H, 2005, IEEE IJCNN, P1431
Georghiades AS, 2001, IEEE T PATTERN ANAL, V23, P643, DOI 10.1109/34.927464
Gunasekara N.A., 2010, METALEARNING STRING
Hatch AO, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1471
Huang K., 2007, ADV NEURAL INFORM PR, V19, P609
Ji SH, 2008, IEEE T SIGNAL PROCES, V56, P2346, DOI 10.1109/TSP.2007.914345
Kanevsky D., 2010, P INTERSPEECH
Karam ZN, 2008, INT CONF ACOUST SPEE, P4117, DOI 10.1109/ICASSP.2008.4518560
Kenny P, 2005, JOINT FACTOR ANAL SP
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Koh K., 2007, 11 IS MATLAB SOLVER
Kreutz-Delgado K, 2003, NEURAL COMPUT, V15, P349, DOI 10.1162/089976603762552951
Kua JMK, 2011, INT CONF ACOUST SPEE, P4548
Lei Y., 2010, CRSS SYSTEMS 2010 NI
Li M., 2011, P INTERSPEECH
Li Ming, 2011, P ICASSP
Mairal J., 2009, ADV NEURAL INFORM PR, V21, P1033
Martin A.F., 2010, NIST 2010 SPEAKER RE
McLaren M., 2009, P ICASSP, P4041
McLaren M, 2009, LECT NOTES COMPUT SC, V5558, P474, DOI 10.1007/978-3-642-01793-3_49
Moreno P.J., 2003, P 8 EUR C SPEECH COM, P2965
Naseem Imran, 2010, Proceedings of the 2010 20th International Conference on Pattern Recognition (ICPR 2010), DOI 10.1109/ICPR.2010.1083
Plumbley MD, 2007, LECT NOTES COMPUT SC, V4666, P406
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Sainath T.N., 2010, SPARSE REPRESENTATIO
Sainath TN, 2010, INT CONF ACOUST SPEE, P4370, DOI 10.1109/ICASSP.2010.5495638
Sedlak F, 2011, INT CONF ACOUST SPEE, P4544
Solomonoff A, 2005, INT CONF ACOUST SPEE, P629
Suh J.W., 2011, P ICASSP
Tao DC, 2006, IEEE T PATTERN ANAL, V28, P1088
Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267
Tikhonov A. N., 1977, SOLUTION ILL POSED P
Vainsencher D., 2010, PREPRINT
Villalba J., 2010, I3A NIST SRE2010 SYS
Wan V., 2000, IEEE INT WORKSH NEUR, P775
WAN V, 2002, ACOUST SPEECH SIG PR, P669
Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd
Wright J., 2008, IEEE T PATTERN ANAL, V31, P210
Yang A., 2007, FEATURE SELECTION FA
Yang AY, 2010, P IEEE, V98, P1077, DOI 10.1109/JPROC.2010.2040797
Zou H, 2005, J ROY STAT SOC B, V67, P301, DOI 10.1111/j.1467-9868.2005.00503.x
NR 69
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 707
EP 720
DI 10.1016/j.specom.2013.01.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800009
ER
PT J
AU Kurtic, E
Brown, GJ
Wells, B
AF Kurtic, Emina
Brown, Guy J.
Wells, Bill
TI Resources for turn competition in overlapping talk
SO SPEECH COMMUNICATION
LA English
DT Article
DE Overlapping talk; Turn competition; Prosody; Turn-taking; Turn end
projection
ID ORGANIZATION; GAZE; CONVERSATION; ALIGNMENT; PAUSES; GAPS
AB Overlapping talk occurs frequently in multi-party conversations, and is a domain in which speakers may pursue various communicative goals. The current study focuses on turn competition. Specifically, we seek to identify the phonetic differences that discriminate turn-competitive from non-competitive overlaps. Conversation analysis techniques were used to identify competitive and non-competitive overlaps in a corpus of multi-party recordings. We then generated a set of potentially predictive features relating to prosody (F0, intensity, speech rate, pausing) and overlap placement (overlap duration, point of overlap onset, recycling etc.). Decision tree classifiers were trained on the features and tested on a classification task, in order to determine which features and feature combinations best differentiate competitive overlaps from non-competitive overlaps. It was found that overlap placement features played a greater role than prosodic features in indicating turn competition. Among the prosodic features tested, F0 and intensity were the most effective predictors of turn competition. Also, our decision tree models suggest that turn competitive and non-competitive overlaps can be initiated by a new speaker at many different points in the current speaker's turn. These findings have implications for the design of dialogue systems, and suggest novel hypotheses about how speakers deploy phonetic resources in everyday talk. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Kurtic, Emina; Wells, Bill] Univ Sheffield, Dept Human Commun Sci, Sheffield S10 2TA, S Yorkshire, England.
[Kurtic, Emina; Brown, Guy J.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
RP Wells, B (reprint author), Univ Sheffield, Dept Human Commun Sci, 31 Claremont Crescent, Sheffield S10 2TA, S Yorkshire, England.
EM bill.wells@sheffield.ac.uk
FU University of Sheffield; UK Arts and Humanities Research Council
[1-62874195]
FX The research reported here was supported by a University of Sheffield
Project Studentship. Preparation of the article was facilitated by UK
Arts and Humanities Research Council Grant 1-62874195. We are grateful
to our annotators for their time and effort; to Ahmed Aker for
invaluable assistance at various stages of the research; to Gareth
Walker and John Local for their sustained interest and encouragement;
and to Jens Edlund and an anonymous reviewer for their constructive
comments on an earlier draft.
CR Adda-Decker M., 2008, P 6 INT LANG RES EV
Barkhuysen P, 2008, J ACOUST SOC AM, V123, P354, DOI 10.1121/1.2816561
Bavelas JB, 2002, J COMMUN, V52, P566, DOI 10.1093/joc/52.3.566
Boersma P., 2001, GLOT INT, V5, P341
Carletta J, 2007, LANG RESOUR EVAL, V41, P181, DOI 10.1007/s10579-007-9040-x
Cetin O., 2006, P 3 JOINT WORKSH MUL
Couper-Kuhlen E., 1993, ENGLISH SPEECH RHYTH
Dellwo V., 2006, P SPEECH PROS 2006 D
Dhillon R, 2004, TR04002 ICSI
French P., 1983, J PRAGMATICS, V7, P701
Gardner R., 2001, PRAGMATICS NEW SERIE, V92
GOODWIN C, 1980, SOCIOL INQ, V50, P272, DOI 10.1111/j.1475-682X.1980.tb00023.x
GOODWIN MH, 1986, SEMIOTICA, V62, P51
Gorisch J, 2012, LANG SPEECH, V55, P57, DOI 10.1177/0023830911428874
Gravano A, 2011, COMPUT SPEECH LANG, V25, P601, DOI 10.1016/j.csl.2010.10.003
Hain T, 2012, IEEE T AUDIO SPEECH, V20, P486, DOI 10.1109/TASL.2011.2163395
Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002
Heldner M, 2011, J ACOUST SOC AM, V130, P508, DOI 10.1121/1.3598457
Hjalmarsson A, 2011, SPEECH COMMUN, V53, P23, DOI 10.1016/j.specom.2010.08.003
Janin A, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P364
Jefferson G., 1987, INTERACTION LANGUAGE, V9, P153
Jefferson G., 1983, 2 EXPLORATIONS ORG O
Jefferson G., 2003, CONVERSATION ANAL ST
Jefferson Gail, 2004, CONVERSATION ANAL ST, P13
KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4
Kurtic E, 2009, STUD PRAGMAT, V8, P183
Kurtic E., 2012, P INT C LANG RES EV
Kurtic E., 2010, P INT 2010 MAK JAP
Lee C., 2008, P INT C SPOK LANG PR
Lerner G, 1999, LANGUAGE TURN SEQUEN, P225
Lerner G., 1999, CONVERSATION ANAL ST
Levinson Stephen C, 2006, ROOTS HUMAN SOCIALIT, P39
Local J, 2005, PHONETICA, V62, P120, DOI 10.1159/000090093
Local J, 2005, FIGURE OF SPEECH: A FESTSCHRIFT FOR JOHN LAVER, P263
Mondada L., 2011, INTEGRATING GESTURES
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Quinlan R., 1994, C4 5 PROGRAMS MACHIN
Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x
SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243
Schegloff Emanuel A., 1982, ANAL DISCOURSE TEXT, P71
Schegloff Emanuel A., 2001, HDB SOCIOLOGICAL THE, P287, DOI 10.1007/0-387-36274-6_15
Schegloff EA, 2000, LANG SOC, V29, P1
Schegloff Emanuel Abraham, 1987, TALK SOCIAL ORG, P70
Selting M., 1998, INTERACTION LINGUIST, V4, P1
Shriberg E., 2001, P 7 EUR C SPEECH COM
Shriberg E., 2001, ISCA TUT RES WORKSH
Sidnell J, 2001, J PRAGMATICS, V33, P1263, DOI 10.1016/S0378-2166(00)00062-X
Stivers T, 2008, RES LANG SOC INTERAC, V41, P31, DOI 10.1080/08351810701691123
Szczepek-Reed Beatrice, 2006, PROSODIC ORIENTATION
WALKER MB, 1982, J SOC PSYCHOL, V117, P305
Wells B, 1998, LANG SPEECH, V41, P265
Wells B., 2004, SOUND PATTERNS INTER, P119
Yngve Victor, 1970, 6 REG M CHIC LING SO, P567
NR 53
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2013
VL 55
IS 5
BP 721
EP 743
DI 10.1016/j.specom.2012.10.002
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 141RS
UT WOS:000318744800010
ER
PT J
AU Zhang, Y
Zhao, YX
AF Zhang, Yi
Zhao, Yunxin
TI Real and imaginary modulation spectral subtraction for speech
enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spectral subtraction; Noise reduction; Speech phase; Modulation
frequency
ID ADDITIVE NOISE; RECOGNITION; PHASE; PERFORMANCE; SEPARATION
AB In this paper, we propose a novel spectral subtraction method for noisy speech enhancement. Instead of taking the conventional approach of carrying out subtraction on the magnitude spectrum in the acoustic frequency domain, we propose to perform subtraction on the real and imaginary spectra separately in the modulation frequency domain, where the method is referred to as MRISS. By doing so, we are able to enhance magnitude as well as phase through spectral subtraction. We conducted objective and subjective evaluation experiments to compare the performance of the proposed MRISS method with three existing methods, including modulation frequency domain magnitude spectral subtraction (MSS), nonlinear spectral subtraction (NSS), and minimum mean square error estimation (MMSE). The objective evaluation used the criteria of segmental signal-to-noise ratio (Segmental SNR), PESQ, and average Itakura-Saito spectral distance (ISD). The subjective evaluation used a mean preference score with 14 participants. Both objective and subjective evaluation results have demonstrated that the proposed method outperformed the three existing speech enhancement methods. A further analysis has shown that the winning performance of the proposed MRISS method comes from improvements in the recovery of both acoustic magnitude and phase spectrum. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Zhang, Yi; Zhao, Yunxin] Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA.
RP Zhao, YX (reprint author), Univ Missouri, Dept Comp Sci, Columbia, MO 65211 USA.
EM yzcb3@mail.missouri.edu; Zhaoy@missouri.edu
CR Aarabi P, 2004, IEEE T SYST MAN CY B, V34, P1763, DOI 10.1109/TSMCB.2004.830345
[Anonymous], 2001, RWCP SOUND SC DAT RE
Araki S, 2006, INT CONF ACOUST SPEE, P33
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Evans NWD, 2006, INT CONF ACOUST SPEE, P145
Flanagan J. L., 1992, P IEEE INT C AC SPEE, V1, P285
Hansen J. H. L., 1998, P INT C SPOK LANG PR, V7, P2819
Hegde RM, 2007, IEEE T AUDIO SPEECH, V15, P190, DOI 10.1109/TASL.2006.876858
HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387
KAMATH S, 2002, ACOUST SPEECH SIG PR, P4164
Kleinschmidt T, 2011, COMPUT SPEECH LANG, V25, P585, DOI 10.1016/j.csl.2010.09.001
Lin L, 2003, ELECTRON LETT, V39, P754, DOI 10.1049/el:20030480
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003
Makhoul J., 1979, P IEEE INT C AC SPEE, V23, P208
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Martin R., 1994, P 7 EUR SIGN PROC C, P1182
Nakagawa S., 2004, P INT C SPOK LANG PR, V23, P477
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Papoulis A, 1991, PROBABILITY RANDOM V, V3rd
Savoji M. H., 2010, TELECOMMUN IST, P895
Schluter R, 2001, INT CONF ACOUST SPEE, P133
Shannon BJ, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1423
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
Wojcicki K, 2008, IEEE SIGNAL PROC LET, V15, P461, DOI 10.1109/LSP.2008.923579
Yellin D, 1996, IEEE T SIGNAL PROCES, V44, P106, DOI 10.1109/78.482016
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
Yiteng Huang, 2004, AUDIO SIGNAL PROCESS
Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325
Zhu D, 2004, P IEEE INT C AC SPEE, V1, P125
Zhu QF, 2002, IEEE SIGNAL PROC LET, V9, P275, DOI 10.1109/LSP.2002.801722
NR 33
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 509
EP 522
DI 10.1016/j.specom.2012.09.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800001
ER
PT J
AU Mirzahasanloo, TS
Kehtarnavaz, N
Gopalakrishna, V
Loizou, PC
AF Mirzahasanloo, Taher S.
Kehtarnavaz, Nasser
Gopalakrishna, Vanishree
Loizou, Philipos C.
TI Environment-adaptive speech enhancement for bilateral cochlear implants
using a single processor
SO SPEECH COMMUNICATION
LA English
DT Article
DE Bilateral cochlear implants; Single-processor speech enhancement for
bilateral cochlear implants; Environment-adaptive speech enhancement
ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE-REDUCTION; ALGORITHMS; CHILDREN;
RECOGNITION; PERCEPTION; BENEFITS
AB A computationally efficient speech enhancement pipeline in noisy environments based on a single-processor implementation is developed for utilization in bilateral cochlear implant systems. A two-channel joint objective function is defined and a closed form solution is obtained based on the weighted-Euclidean distortion measure. The computational efficiency and no need for synchronization aspects of this pipeline make it a suitable solution for real-time deployment. A speech quality measure is used to show its effectiveness in six different noisy environments as compared to a similar one-channel enhancement pipeline when using two separate processors or when using independent sequential processing. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Mirzahasanloo, Taher S.; Kehtarnavaz, Nasser; Gopalakrishna, Vanishree; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA.
RP Kehtarnavaz, N (reprint author), Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75080 USA.
EM mirzahasanloo@utdallas.edu; kehtar@utdallas.edu
FU NIH/NIDCD [DC010494]
FX This work was supported by Grant No. DC010494 from NIH/NIDCD.
CR Abramowitz M., 1965, HDB MATH FUNCTIONS
Algazi V. R., 2001, IEEE WORKSH APPL SIG, P99
[Anonymous], 2000, P862 ITUT
Chen J., 2006, EURASIP J APPL SIG P, V26, P19
Ching T Y C, 2007, Trends Amplif, V11, P161, DOI 10.1177/1084713807304357
Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erkelens J, 2007, SPEECH COMMUN, V49, P530, DOI 10.1016/j.specom.2006.06.012
Erkelens JS, 2008, IEEE T AUDIO SPEECH, V16, P1112, DOI 10.1109/TASL.2008.2001108
Fetterman BL, 2002, OTOLARYNG HEAD NECK, V126, P257, DOI 10.1067/mhn.2002.123044
Gopalakrishna V, 2012, IEEE T BIO-MED ENG, V59, P1691, DOI 10.1109/TBME.2012.2191968
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y., 2007, J ACOUST SOC AM, V128, P128
IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058
Kokkinakis K, 2010, J ACOUST SOC AM, V127, P3136, DOI 10.1121/1.3372727
Kuhn-Inacker H, 2004, INT J PEDIATR OTORHI, V68, P1257, DOI 10.1016/j.ijporl.2004.04.029
Litovsky RY, 2006, INT J AUDIOL, V45, pS78, DOI 10.1080/14992020600782956
Litovsky RY, 2004, ARCH OTOLARYNGOL, V130, P648, DOI 10.1001/archotol.130.5.648
Loizou Philipos C, 2006, Adv Otorhinolaryngol, V64, P109
Loizou PC, 2011, STUD COMPUT INTELL, V346, P623
Loizou PC, 2005, J ACOUST SOC AM, V118, P2791, DOI 10.1121/1.2065847
Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lotter T, 2005, EURASIP J APPL SIG P, V2005, P1110, DOI 10.1155/ASP.2005.1110
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Mirzahasanloo T. S., 2012, IEEE INT C ENG MED B, V2012, P2271
Muller J, 2002, EAR HEARING, V23, P198
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Remus JJ, 2005, EURASIP J APPL SIG P, V2005, P2979, DOI 10.1155/ASP.2005.2979
van Hoesel RJM, 2003, J ACOUST SOC AM, V113, P1617, DOI 10.1121/1.1539520
van Hoesel RJM, 2004, AUDIOL NEURO-OTOL, V9, P234, DOI 10.1159/000078393
NR 32
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 523
EP 534
DI 10.1016/j.specom.2012.10.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800002
ER
PT J
AU Dam, HH
Rimantho, D
Nordholm, S
AF Dam, Hai Huyen
Rimantho, Dedi
Nordholm, Sven
TI Second-order blind signal separation with optimal step size
SO SPEECH COMMUNICATION
LA English
DT Article
DE Blind signal separation; Convolutive mixture; Conjugate gradient; Fast
convergent; Optimal step size
ID NONSTATIONARY SOURCES; CONJUGATE-GRADIENT
AB This paper proposes a new computational procedure for solving the second-order gradient-based blind signal separation (BSS) problem with convolutive mixtures. The problem is formulated as a constrained optimization problem where the time domain constraints on the unmixing matrices are added to ease the permutation effects associated with convolutive mixtures. A linear transformation using QR factorization is developed to transform the constrained optimization problem into an unconstrained problem. A conjugate gradient procedure with the step size derived optimally at each iteration is then proposed to solve the optimization problem. The advantage of the procedure is that it has low computational complexity, as it does not require multiple evaluations of the objective function. In addition, fast convergence of the conjugate gradient algorithm makes it suitable for online implementation. The convergence of the conjugate gradient algorithm with optimal step size is compared to the fixed step size case and the optimal step size steepest descent algorithm. Evaluations are performed in real and simulated environments. Crown Copyright (C) 2012 Published by Elsevier B.V. All rights reserved.
C1 [Dam, Hai Huyen; Rimantho, Dedi] Curtin Univ Technol, Dept Math & Stat, Perth, WA, Australia.
[Nordholm, Sven] Curtin Univ Technol, Dept Elect & Comp Engn, Perth, WA, Australia.
RP Dam, HH (reprint author), Curtin Univ Technol, Dept Math & Stat, Perth, WA, Australia.
EM H.dam@curtin.edu.au; dedi.rimantho@student.curtin.edu.au;
S.Nordholm@curtin.edu.au
RI Nordholm, Sven/J-5247-2014
FU ARC [DP120103859]
FX This research was supported by ARC Discovery Project DP120103859.
CR Benesty J., 2005, SPEECH ENHANCEMENT
BORAY GK, 1992, IEEE T CIRCUITS-I, V39, P1, DOI 10.1109/81.109237
Buchner H, 2005, IEEE T SPEECH AUDI P, V13, P120, DOI 10.1109/TSA.2004.838775
Dam HH, 2008, IEEE SIGNAL PROC LET, V15, P79, DOI 10.1109/LSP.2007.910234
Dam HH, 2007, IEEE T SIGNAL PROCES, V55, P4198, DOI 10.1109/TSP.2007.894406
FLETCHER R, 1964, COMPUT J, V7, P149, DOI 10.1093/comjnl/7.2.149
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
Hyvarinen A, 2001, INDEPENDENT COMPONEN
McCormick G., 1983, NONLINEAR PROGRAMMIN
Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214
Schobben DWE, 2002, IEEE T SIGNAL PROCES, V50, P1855, DOI 10.1109/TSP.2002.800417
Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433
Yong P.C., 2011, P 19 EUR SIGN PROC C, P211
NR 13
TC 0
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 535
EP 543
DI 10.1016/j.specom.2012.10.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800003
ER
PT J
AU Nam, KW
Ji, YS
Han, J
Lee, S
Kim, D
Hong, SH
Jang, DP
Kim, IY
AF Nam, Kyoung Won
Ji, Yoon Sang
Han, Jonghee
Lee, Sangmin
Kim, Dongwook
Hong, Sung Hwa
Jang, Dong Pyo
Kim, In Young
TI Clinical evaluation of the performance of a blind source separation
algorithm combining beamforming and independent component analysis in
hearing aid use
SO SPEECH COMMUNICATION
LA English
DT Article
DE Hearing aid; Independent component analysis; Beamforming; Noise
reduction
ID FREQUENCY-DOMAIN ICA; SPEECH RECOGNITION; NOISE; QUALITY; SYSTEMS
AB There have been several reports on improved blind source separation algorithms that combine beamforming and independent component analysis. However, none of the prior reports verified the clinical efficacy of such combinational algorithms in real hearing aid situations. In the current study, we evaluated the clinical efficacy of such a combinational algorithm using the mean opinion score and speech recognition threshold tests in various types of real-world hearing aid situations involving environmental noise. Parameters of the testing algorithm were adjusted to match the geometric specifications of the real behind-the-ear type hearing aid housing. The study included 15 normal-hearing volunteers and 15 hearing-impaired patients. Experimental results demonstrated that the testing algorithm improved the speech intelligibility of all of the participants in noisy environments, and the clinical efficacy of the combinational algorithm was superior to either the beamforming or independent component analysis algorithms alone. Despite the computational complexity of the testing algorithm, our experimental results and the rapid enhancement of hardware technology indicate that the testing algorithm has the potential to be applied to real hearing aids in the near future, thereby improving the speech intelligibility of hearing-impaired patients in noisy environments. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Nam, Kyoung Won; Ji, Yoon Sang; Jang, Dong Pyo; Kim, In Young] Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea.
[Han, Jonghee; Kim, Dongwook] Samsung Adv Inst Technol, Bio & Hlth Lab, Yongin 446712, South Korea.
[Lee, Sangmin] Inha Univ, Dept Elect Engn, Inchon 402751, South Korea.
[Hong, Sung Hwa] Samsung Med Ctr, Dept Otolaryngol Head & Neck Surg, Seoul 135710, South Korea.
RP Kim, IY (reprint author), Hanyang Univ, Dept Biomed Engn, Seoul 133791, South Korea.
EM kwnam@bme.hanyang.ac.kr; ysji@bme.hanyang.ac.kr;
apaper@bme.hanyang.ac.kr; sanglee@inha.ac.kr; steve7.kim@samsung.com;
hongsh@skku.edu; dongpjang@gmail.com; iykim@hanyang.ac.kr
FU Strategic Technology Development Program of the Ministry of Knowledge
Economy [10031764]; Seoul University Industry Collaboration Foundation
[SS100022]
FX This work was supported by grants No. 10031764, from the Strategic
Technology Development Program of the Ministry of Knowledge Economy, and
No. SS100022, from theSeoul University Industry Collaboration
Foundation.
CR Duran-Diaz I, 2012, DIGIT SIGNAL PROCESS, V22, P1126, DOI 10.1016/j.dsp.2012.05.014
[Anonymous], 2012, P800 ITUT
Arehart KH, 2010, EAR HEARING, V31, P420, DOI 10.1097/AUD.0b013e3181d3d4f3
Bentler Ruth A, 2005, J Am Acad Audiol, V16, P473, DOI 10.3766/jaaa.16.7.7
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P1245, DOI 10.1109/TSA.2005.858061
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
Guo W., 2012, CECNET 12, P3001
Han H, 2011, INT J AUDIOL, V50, P59, DOI 10.3109/14992027.2010.526637
KEATING PA, 1994, SPEECH COMMUN, V14, P131, DOI 10.1016/0167-6393(94)90004-3
Kocinski J, 2011, SPEECH COMMUN, V53, P390, DOI 10.1016/j.specom.2010.11.002
Kocinski J, 2008, SPEECH COMMUN, V50, P29, DOI 10.1016/j.specom.2007.06.003
Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887
Lee TW, 1999, NEURAL COMPUT, V11, P417, DOI 10.1162/089976699300016719
Liquan Z., 2008, WICOM 08, P1
Luo FL, 2002, IEEE T SIGNAL PROCES, V50, P1583
Lv Q, 2005, LECT NOTES COMPUT SC, V3497, P538
Madhu N., 2006, IWAENC 06, P1
Mangalathu-Arumana J, 2012, NEUROIMAGE, V60, P2247, DOI 10.1016/j.neuroimage.2012.02.030
Marques I, 2012, SOFT COMPUT, V16, P1525, DOI 10.1007/s00500-012-0826-4
Mitianoudis N, 2004, LECT NOTES COMPUT SC, V3195, P669
Park HM, 1999, ELECTRON LETT, V35, P2011, DOI 10.1049/el:19991358
Parra LC, 2002, IEEE T SPEECH AUDI P, V10, P352, DOI 10.1109/TSA.2002.803443
Saruwatari H, 2003, EURASIP J APPL SIG P, V2003, P1135, DOI 10.1155/S1110865703305104
SARUWATARI H, 2001, ACOUST SPEECH SIG PR, P2733
Saruwatari H, 2006, IEEE T AUDIO SPEECH, V14, P666, DOI 10.1109/TSA.2005.855832
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2
Thiede T, 2000, J AUDIO ENG SOC, V48, P3
Tunner C.W., 2004, J ACOUST SOC AM, V115, P1729
Ukai S, 2005, INT CONF ACOUST SPEE, P85
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
NR 33
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 544
EP 552
DI 10.1016/j.specom.2012.11.002
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800004
ER
PT J
AU Gonseth, C
Vilain, A
Vilain, C
AF Gonseth, Chloe
Vilain, Anne
Vilain, Coriandre
TI An experimental study of speech/gesture interactions and distance
encoding
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech/gesture interaction; Pointing; Distance encoding; Sound symbolism
ID SPEECH PRODUCTION; GESTURE; LANGUAGE
AB This paper explores the possible encoding of distance information in vocal and manual pointing and its relationship with the linguistic structure of deictic words, as well as speech/gesture cooperation within the process of deixis. Two experiments required participants to point at and/or name a close or distant target, with speech only, with gesture only, or with speech gesture. Acoustic, articulatory, and manual data were recorded. We investigated the interaction between vocal and manual pointing, with respect to the distance to the target. There are two major findings. First, distance significantly affects both articulatory and manual pointing, since participants perform larger vocal and manual gestures to designate a more distant target. Second, modality influences both deictic speech and gesture, since pointing is more emphatic in unimodal use of either over bimodal use of both, to compensate for the loss of the other mode. These findings suggest that distance is encoded in both vocal and manual pointing. We also demonstrate that the correlates of distance encoding in the vocal modality can be related to the typology of deictic words. Finally, our data suggest a two-way interaction between speech and gesture, and support the hypothesis that these two modalities are cooperating within a single communication system. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Gonseth, Chloe; Vilain, Anne; Vilain, Coriandre] Grenoble Univ, CNRS UMR 5216, Gipsa Lab, Speech & Cognit Dept, F-38402 St Martin Dheres, France.
RP Gonseth, C (reprint author), Grenoble Univ, CNRS UMR 5216, Gipsa Lab, Speech & Cognit Dept, 11 Rue Math,Grenoble Campus,BP 46, F-38402 St Martin Dheres, France.
EM chloe.gonseth@gipsa-lab.grenoble-inp.fr;
anne.vilain@gipsa-lab.grenoble-inp.fr;
coriandre.vilain@gipsa-lab.grenoble-inp.fr
CR Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038
Astafiev SV, 2003, J NEUROSCI, V23, P4689
Bernardis P, 2006, NEUROPSYCHOLOGIA, V44, P178, DOI 10.1016/j.neuropsychologia.2005.05.007
Boersma P., 2001, GLOT INT, V5, P341
Bonfiglioli C, 2009, COGNITION, V111, P270, DOI 10.1016/j.cognition.2009.01.006
Bruner J.S., 1983, CHILD TALK
Butterworth G., 1998, DEV SENSORY MOTOR CO, P171
Chieffi S, 2009, BEHAV BRAIN RES, V203, P200, DOI 10.1016/j.bbr.2009.05.003
de Ruiter JP, 1998, THESIS CATHOLIC U NI
Diessel H., 1999, DEMONSTRATIVES FORM, P42
DIESSEL HOLGER, 2011, WORLD ATLAS LANGUAGE
Enfield NJ, 2003, LANGUAGE, V79, P82, DOI 10.1353/lan.2003.0075
Feyereisen P, 1997, J MEM LANG, V36, P13, DOI 10.1006/jmla.1995.2458
GENTILUCCI M, 1991, NEUROPSYCHOLOGIA, V29, P361, DOI 10.1016/0028-3932(91)90025-4
BUTTERWORTH B, 1989, PSYCHOL REV, V96, P168, DOI 10.1037//0033-295X.96.1.168
Hostetter AB, 2008, PSYCHON B REV, V15, P495, DOI 10.3758/PBR.15.3.495
Iverson JM, 2005, PSYCHOL SCI, V16, P367, DOI 10.1111/j.0956-7976.2005.01542.x
Johansson N., 2011, THESIS LUNDS U
Kendon A., 2004, GESTURE VISIBLE ACTI
Kita S., 2003, POINTING LANGUAGE CU
Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3
Engberg-Pedersen E, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P269
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
Krause MA, 1997, J COMP PSYCHOL, V111, P330, DOI 10.1037/0735-7036.111.4.330
Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017
Leavens DA, 2005, CURR DIR PSYCHOL SCI, V14, P185, DOI 10.1111/j.0963-7214.2005.00361.x
Levelt W. J., 1989, SPEAKING INTENTION A
LEVELT WJM, 1985, J MEM LANG, V24, P133, DOI 10.1016/0749-596X(85)90021-X
Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1
LINDBLOM B, 1979, J PHONETICS, V7, P147
Loevenbruck H, 2005, J NEUROLINGUIST, V18, P237, DOI 10.1016/j.jneuroling.2004.12.002
MAEDA S, 1991, J PHONETICS, V19, P321
McNeill D., 2005, GESTURE AND THOUGHT
McNeill D., 2000, LANGUAGE AND GESTURE
Rochet-Capellan A., 2008, P INT SEM SPEECH PRO
Sapir E., 1949, SELECTED WRITINGS E, P61
Tomasello M, 2005, BEHAV BRAIN SCI, V28, P675, DOI 10.1017/S0140525X05000129
Traunmuller H., 1987, PSYCHOPHYSICS SPEECH, P293
Traunmuller H., 1996, TMH QPSR, V2, P147
Ultan R., 1978, UNIVERSALS HUMAN LAN, V2
Volterra Virginia, 2005, NATURE NURTURE ESSAY, P3
WOODWORTH NL, 1991, LINGUISTICS, V29, P273, DOI 10.1515/ling.1991.29.2.273
NR 42
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 553
EP 571
DI 10.1016/j.specom.2012.11.003
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800005
ER
PT J
AU Cooke, M
Mayo, C
Valentini-Botinhao, C
Stylianou, Y
Sauert, B
Tang, Y
AF Cooke, Martin
Mayo, Catherine
Valentini-Botinhao, Cassia
Stylianou, Yannis
Sauert, Bastian
Tang, Yan
TI Evaluating the intelligibility benefit of speech modifications in known
noise conditions
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Speech modification; Synthetic speech
ID LISTENING CONDITIONS; HEARING; ENVIRONMENTS; ENHANCEMENT; RECOGNITION;
ALGORITHM; MODEL; CLEAR
AB The use of live and recorded speech is widespread in applications where correct message reception is important. Furthermore, the deployment of synthetic speech in such applications is growing. Modifications to natural and synthetic speech have therefore been proposed which aim at improving intelligibility in noise. The current study compares the benefits of speech modification algorithms in a large-scale speech intelligibility evaluation and quantifies the equivalent intensity change, defined as the amount in decibels that unmodified speech would need to be adjusted by in order to achieve the same intelligibility as modified speech. Listeners identified keywords in phonetically-balanced sentences representing ten different types of speech: plain and Lombard speech, five types of modified speech, and three forms of synthetic speech. Sentences were masked by either a stationary or a competing speech masker. Modification methods varied in the manner and degree to which they exploited estimates of the masking noise. The best-performing modifications led to equivalent intensity changes of around 5 dB in moderate and high noise levels for the stationary masker, and 3-4 dB in the presence of competing speech. These gains exceed those produced by Lombard speech. Synthetic speech in noise was always less intelligible than plain natural speech, but modified synthetic speech reduced this deficit by a significant amount. (C) 2013 Elsevier B.V. All rights reserved.
C1 [Cooke, Martin] Ikerbasque Basque Sci Fdn, Bilbao, Spain.
[Cooke, Martin; Tang, Yan] Univ Basque Country, Language & Speech Lab, Vitoria, Spain.
[Mayo, Catherine; Valentini-Botinhao, Cassia] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland.
[Stylianou, Yannis] ICS FORTH, Inst Comp Sci, Iraklion, Greece.
[Sauert, Bastian] Rhein Westfal TH Aachen, Inst Commun Syst & Data Proc, Aachen, Germany.
RP Cooke, M (reprint author), Univ Basque Country, Language & Speech Lab, Vitoria, Spain.
EM m.cooke@ikerbasque.org
FU European Community [213850]; Future and Emerging Technologies (FET)
programme under FET-Open grant [256230]
FX We thank Vasilis Karaiskos for help in running the listening tests,
Julian Villegas for contributions to the recording of speech material,
and T-C. Zorila, V. Kandia and D. Erro for useful discussions on
developing SSDRC and TMDRC. The research leading to these results was
partly funded from the European Community's Seventh Framework Programme
(FP7/2007-2013) under grant agreement 213850 (SCALE) and by the Future
and Emerging Technologies (FET) programme under FET-Open grant number
256230 (LISTA).
CR ANSI, 1997, S351997 ANSI
Bell S.T., 1992, J SPEECH HEAR RES, V35, P950
Blesser B.A., 1969, IEEE T AUDIO ELECTRO, V17
Boersma P., 2001, GLOT INT, V5, P341
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004
Cooke M, 2010, J ACOUST SOC AM, V128, P2059, DOI 10.1121/1.3478775
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780
Dreschler WA, 2001, AUDIOLOGY, V40, P148
Erro D., 2012, P INTERSPEECH
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
Hazan V., 1996, SPEECH HEARING LANGU, V9, P43
Hazan V, 2011, J ACOUST SOC AM, V130, P2139, DOI 10.1121/1.3623753
Taal C. H., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288810
Holland J. H., 1975, ADAPTATION NATURAL A, V2nd
Howell P, 2006, PERCEPT PSYCHOPHYS, V68, P139, DOI 10.3758/BF03193664
Huang D., 2010, P SSW7 KYOT JAP, P258
Kates JM, 1998, SPRING INT SER ENG C, P235
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Langner B, 2005, INT CONF ACOUST SPEE, P265
Lindblom B., 1990, SPEECH PRODUCTION SP, V55, P403
Lombard E., 1911, ANN MALADIES OREILLE, V37, P101
Lu YY, 2008, J ACOUST SOC AM, V124, P3261, DOI 10.1121/1.2990705
McLoughlin IV, 1997, DSP 97: 1997 13TH INTERNATIONAL CONFERENCE ON DIGITAL SIGNAL PROCESSING PROCEEDINGS, VOLS 1 AND 2, P591
Moore R.K., 2011, 17 INT C PHON SCI, P1422
NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824
Patel R, 2008, J SPEECH LANG HEAR R, V51, P209, DOI 10.1044/1092-4388(2008/016)
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038)
Raitio T., 2012, P ICASSP, P4015
Raitio T., 2011, P INTERSPEECH, P2781
RIX AW, 2001, ACOUST SPEECH SIG PR, P749
Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058
Sauert B., 2010, P ITG FACHT SPRACHK
Sauert B, 2006, INT CONF ACOUST SPEE, P493
Sauert B., 2011, P C EL SPRACHS ESSV, P333
Skowronski MD, 2006, SPEECH COMMUN, V48, P549, DOI 10.1016/j.specom.2005.09.003
SoX, 2012, SOX SOUND EXCHANGE S
STUDEBAKER GA, 1987, J ACOUST SOC AM, V81, P1130, DOI 10.1121/1.394633
Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660
Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881
Tang Y., 2010, P INTERSPEECH, P1636
Tang Y., 2011, P INTERSPEECH, P345
Tang Y., 2012, P INTERSPEECH
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003
Valentini-Botinhao C., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288794
Valentini-Botinhao C., 2012, P INTERSPEECH
Yamagishi J., 2008, P BLIZZ CHALL WORKSH
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394
Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zorila T. C., 2012, P INTERSPEECH
ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630
NR 56
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2013
VL 55
IS 4
BP 572
EP 585
DI 10.1016/j.specom.2013.01.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 122NW
UT WOS:000317325800006
ER
PT J
AU Dai, P
Soon, IY
AF Dai, Peng
Soon, Ing Yann
TI An improved model of masking effects for robust speech recognition
system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Auditory modeling; Simultaneous masking;
Temporal masking; AURORA2
ID AUDITORY-SYSTEM; LATERAL INHIBITION; WORD RECOGNITION; FRONT-END;
FREQUENCY; NOISE; ADAPTATION; PERCEPTION; FEATURES; NERVE
AB Performance of an automatic speech recognition system drops dramatically in the presence of background noise unlike the human auditory system which is more adept at noisy speech recognition. This paper proposes a novel auditory modeling algorithm which is integrated into the feature extraction front-end for Hidden Markov Model (HMM). The proposed algorithm is named LTFC which simulates properties of the human auditory system and applies it to the speech recognition system to enhance its robustness. It integrates simultaneous masking, temporal masking and cepstral mean and variance normalization into ordinary mel-frequency cepstral coefficients (MFCC) feature extraction algorithm for robust speech recognition. The proposed method sharpens the power spectrum of the signal in both the frequency domain and the time domain. Evaluation tests are carried out on the AURORA2 database. Experimental results show that the word recognition rate using our proposed feature extraction method has been effectively increased. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
EM daip0001@e.ntu.edu.sg; eiysoon@ntu.edu.sg
CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
Bouquin L., 1995, P IEEE INT C AC SPEE, V1, P800
Brookes M., VOICEBOX
Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717
Chen JD, 2007, SPEECH COMMUN, V49, P305, DOI 10.1016/j.specom.2007.02.002
CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427
Dai P., 2010, SPEECH COMMUN, V53, P229
Dai P., 2009, P ICICS MAC, P1
Dai P, 2012, SPEECH COMMUN, V54, P402, DOI 10.1016/j.specom.2011.10.004
FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788
Gold B., 2000, SPEECH AUDIO SIGNAL
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
Holmberg M, 2006, IEEE T AUDIO SPEECH, V14, P43, DOI 10.1109/TSA.2005.860349
HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048
JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576
Lu X., 2000, P IEEE SIGN PROC SOC, V2, P785
McGovern S. G., MODEL ROOM ACOUSTICS
MILNER B, 2002, ACOUST SPEECH SIG PR, P797
Mokbel C, 1996, SPEECH COMMUN, V19, P185, DOI 10.1016/0167-6393(96)00032-5
Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501
Palomaki KJ, 2011, SPEECH COMMUN, V53, P924, DOI 10.1016/j.specom.2011.03.005
Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9
Pearce D., 2000, P ISCA ITRW ASR, P17
Roeser R.J., 2000, AUDIOLOGY DIAGNOSIS
SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1612, DOI 10.1121/1.392799
SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800
Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569
Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950
Togneri R., 2010, P IEEE INT C AC SPEE, P1618
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Zhang B, 2012, ACTA ACUST UNITED AC, V98, P328, DOI 10.3813/AAA.918516
Zhu WZ, 2005, INT CONF ACOUST SPEE, P245
NR 33
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 387
EP 396
DI 10.1016/j.specom.2012.12.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000001
ER
PT J
AU Kane, J
Gobl, C
AF Kane, John
Gobl, Christer
TI Automating manual user strategies for precise voice source analysis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice source; Glottal source; LF model; Inverse filtering; Voice quality
ID PLUS NOISE MODEL; GLOTTAL FLOW; SPEECH SYNTHESIS; SIGNALS; WAVE;
PARAMETRIZATION; MINIMIZATION; PREDICTION; ALGORITHM
AB A large part of the research carried out at the Phonetics and Speech Laboratory is concerned with the role of the voice source in the prosody of spoken language, including its linguistic and expressive dimensions. Due to the lack of robustness of automatic voice source analysis methods we have tended to use labour intensive methods which require pulse-by-pulse manual optimisation. This has affected the feasibility of conducting analysis on large volumes of data. To address this, a new method is proposed for automatic parameterisation of the deterministic component of the voice source by simulating the strategies used in the manual optimisation approach. The method involves a combination of exhaustive search, dynamic programming and optimisation methods, with settings derived from analysis of previous manual voice source analysis. A quantitative evaluation demonstrated clearly closer model parameter values to our reference values, compared with a standard time domain-based approach and a phase minimisation method. A complementary qualitative analysis illustrated broadly similar findings, in terms of voice source dynamics in various placements of focus, when using the proposed algorithm compared with a previous study which employed the manual optimisation approach. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Kane, John; Gobl, Christer] Trinity Coll Dublin, Phonet & Speech Lab, Ctr Language & Commun Studies, Sch Linguist Speech & Commun Sci, Dublin, Ireland.
RP Kane, J (reprint author), Trinity Coll Dublin, Phonet & Speech Lab, Ctr Language & Commun Studies, Sch Linguist Speech & Commun Sci, Dublin, Ireland.
EM kanejo@tcd.ie; cegobl@tcd.ie
FU Science Foundation Ireland [07/CE/I1142, 09/IN.1/I2631]; Irish
Department of Arts, Heritage and the Gaeltacht (ABAIR project)
FX This work is supported by the Science Foundation Ireland, Grant
07/CE/I1142 (Centre for Next Generation Localisation, www.cngl.ie) and
Grant 09/IN.1/I2631 (FASTNET) as well as by the Irish Department of
Arts, Heritage and the Gaeltacht (ABAIR project). We would like to thank
Dr. Irena Yanushevskaya for carrying out the manually optimised voice
source analysis used in this study. The authors would like to thank the
anonymous reviewers whose comments and suggestions have helped us to
significantly improve this paper.
CR Airas M., 2007, P INT 2007, P1410
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365
Alku P, 2011, SADHANA-ACAD P ENG S, V36, P623, DOI 10.1007/s12046-011-0041-5
Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801
Arroabarren I., 2003, P EUROSPEECH 0901, P57
Baayen R. Harald, 2008, ANAL LINGUISTIC DATA
Bozkurt B, 2005, IEEE SIGNAL PROC LET, V12, P344, DOI 10.1109/LSP.2005.843770
Brent RP, 1973, ALGORITHMS MINIMIZAT
Cabral JP, 2011, INT CONF ACOUST SPEE, P4704
Degottex G, 2011, IEEE T AUDIO SPEECH, V19, P1080, DOI 10.1109/TASL.2010.2076806
Degottex G, 2011, INT CONF ACOUST SPEE, P5128
Degottex G., 2009, P SPECOM ST PET, P226
Doval B., 2001, P EUROSPEECH SCAND
Drugman T., 2009, P INTERSPEECH, P1779
Drugman T., 2009, P INTERSPEECH, P116
Drugman T., 2009, P INTERSPEECH, P2891
Drugman T, 2012, IEEE T AUDIO SPEECH, V20, P994, DOI 10.1109/TASL.2011.2170835
Drugman T, 2012, COMPUT SPEECH LANG, V26, P20, DOI 10.1016/j.csl.2011.03.003
Fant G., 1985, Q PROGR STATUS REPOR, V4, P1
Fant G., 1995, STL QPSR, V36, P119
Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076
Gobl C., 2003, THESIS KTH SPEECH MU
Gobl C., 2003, AMPLITUDE BASED SOUR, P151
Gobl C., 2010, P INT 2010, P2606
HACKI T, 1989, FOLIA PHONIATR, V41, P43
Hanson HM, 2001, J PHONETICS, V29, P451, DOI 10.1006/jpho.2001.0146
Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991
ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641
Kominek J., 2004, ISCA SPEECH SYNTH WO, P223
Kreiman J, 2006, ANAL SYNTHESIS PATHO
NELDER JA, 1965, COMPUT J, V7, P308
NEY H, 1983, IEEE T SYST MAN CYB, V13, P208
Ni Chasaide A., 2011, P ICPHS HONG KONG, P1470
Ni Chasaide A., 1999, COARTICULATION THEOR, P300
O' Brien D., 2011, P IR SIGN SYST C ISS
O' Cinneide A., 2011, P INTERSPEECH, P57
Pantazis Y, 2008, INT CONF ACOUST SPEE, P4609, DOI 10.1109/ICASSP.2008.4518683
Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239
Rodet X., 2007, P C DIG AUD EFF DAFX, P1
Strik H., 1993, P 3 EUR C SPEECH TEC, V1, P103
Strik H, 1998, J ACOUST SOC AM, V103, P2659, DOI 10.1121/1.422786
Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068
Talkin D., 1995, SPEECH CODING SYNTHE, P495
Thomas MRP, 2009, IEEE T AUDIO SPEECH, V17, P1557, DOI 10.1109/TASL.2009.2022430
TIMCKE R, 1958, ARCHIV OTOLARYNGOL, V68, P1
Vainio M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P921
Veeneman D., 1985, ACOUSTICS SPEECH SIG, V33, P369
Walker J, 2007, LECT NOTES COMPUT SC, V4391, P1
WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260
Yanushevskaya I, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P462
Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359
NR 52
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 397
EP 414
DI 10.1016/j.specom.2012.12.004
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000002
ER
PT J
AU Hahm, SJ
Watanabe, S
Ogawa, A
Fujimoto, M
Hori, T
Nakamura, A
AF Hahm, Seong-Jun
Watanabe, Shinji
Ogawa, Atsunori
Fujimoto, Masakiyo
Hori, Takaaki
Nakamura, Atsushi
TI Prior-shared feature and model space speaker adaptation by consistently
employing map estimation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Speaker adaptation; Feature space normalization;
Model space adaptation; Prior distribution sharing
ID CONTINUOUS SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD APPROACH; HIDDEN
MARKOV-MODELS; LINEAR-REGRESSION; NORMALIZATION; PARAMETERS; INFERENCE
AB The purpose of this paper is to describe the development of a speaker adaptation method that improves speech recognition performance regardless of the amount of adaptation data. For that purpose, we propose the consistent employment of a maximum a posteriori (MAP)-based Bayesian estimation for both feature space normalization and model space adaptation. Namely, constrained structural maximum a posteriori linear regression (CSMAPLR) is first performed in a feature space to compensate for the speaker characteristics, and then, SMAPLR is performed in a model space to capture the remaining speaker characteristics. A prior distribution stabilizes the parameter estimation especially when the amount of adaptation data is small. In the proposed method, CSMAPLR and SMAPLR are performed based on the same acoustic model. Therefore, the dimension-dependent variations of feature and model spaces can be similar. Dimension-dependent variations of the transformation matrix are explained well by the prior distribution. Therefore, by sharing the same prior distribution between CSMAPLR and SMAPLR, their parameter estimations can be appropriately regularized in both spaces. Experiments on large vocabulary continuous speech recognition using the Corpus of Spontaneous Japanese (CSJ) and the MIT Open-CourseWare corpus (MIT-OCW) confirm the effectiveness of the proposed method compared with other conventional adaptation methods with and without using speaker adaptive training. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Hahm, Seong-Jun; Watanabe, Shinji; Ogawa, Atsunori; Fujimoto, Masakiyo; Hori, Takaaki; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan.
RP Hahm, SJ (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto 6190237, Japan.
EM seongjun.hahm@lab.ntt.co.jp
RI Hahm, Seong-Jun/I-6719-2013
CR Anastasakos T, 1997, INT CONF ACOUST SPEE, P1043, DOI 10.1109/ICASSP.1997.596119
BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179
Breslin C., 2010, P INT JAP, P1644
CHEN KT, 2000, P ICSLP, V3, P742
Chou W., 1999, P ICASSP, P1
DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659
Eide E, 1996, INT CONF ACOUST SPEE, P346, DOI 10.1109/ICASSP.1996.541103
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Glass J., 2007, P INTERSPEECH, P2553
Hahm S, 2010, INT CONF ACOUST SPEE, P4302, DOI 10.1109/ICASSP.2010.5495672
Hahm SJ, 2010, IEICE T INF SYST, VE93D, P1927, DOI 10.1587/transinf.E93.D.1927
Hazen TJ, 2000, SPEECH COMMUN, V31, P15, DOI 10.1016/S0167-6393(99)00059-X
Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790
Huang J., 2005, P INT C MULT EXP, P338
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Lei X, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P773
Maekawa K, 2000, P LREC2000, V2, P947
MENG XL, 1993, BIOMETRIKA, V80, P267, DOI 10.2307/2337198
Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225
Nakano Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2286
Povey D, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1145
Pye D, 1997, INT CONF ACOUST SPEE, P1047, DOI 10.1109/ICASSP.1997.596120
Rabiner L, 1993, FUNDAMENTALS SPEECH
Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001
Siohan O, 2001, IEEE T SPEECH AUDI P, V9, P417, DOI 10.1109/89.917687
Siohan O, 2002, COMPUT SPEECH LANG, V16, P5, DOI 10.1006/csla.2001.0181
Strang G., 2003, INTRO LINEAR ALGEBRA
Watanabe S, 2004, IEEE T SPEECH AUDI P, V12, P365, DOI 10.1109/TSA.2004.828640
Watanabe S, 2010, IEEE T AUDIO SPEECH, V18, P395, DOI 10.1109/TASL.2009.2029717
Watanabe S., 2011, IEEE INT WORKSH MACH, P1
Woodland P. C., 2001, ISCA TUT RES WORKSH
Yu K, 2007, IEEE T AUDIO SPEECH, V15, P1932, DOI 10.1109/TASL.2007.901300
Yu K, 2006, INT CONF ACOUST SPEE, P217
Zajic Z, 2009, LECT NOTES ARTIF INT, V5729, P274, DOI 10.1007/978-3-642-04208-9_39
NR 37
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 415
EP 431
DI 10.1016/j.specom.2012.12.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000003
ER
PT J
AU Xu, T
Wang, WW
Dai, W
AF Xu, Tao
Wang, Wenwu
Dai, Wei
TI Sparse coding with adaptive dictionary learning for underdetermined
blind speech separation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Underdetermined blind speech separation (BSS); Sparse representation;
Signal recovery; Adaptive dictionary learning
ID AUDIO SOURCE SEPARATION; OVERCOMPLETE DICTIONARIES; NONSTATIONARY
SOURCES; MATCHING PURSUITS; LEAST-SQUARES; MIXTURES; REPRESENTATIONS;
IDENTIFICATION; APPROXIMATION; DECOMPOSITION
AB A block-based approach coupled with adaptive dictionary learning is presented for underdetermined blind speech separation. The proposed algorithm, derived as a multi-stage method, is established by reformulating the underdetermined blind source separation problem as a sparse coding problem. First, the mixing matrix is estimated in the transform domain by a clustering algorithm. Then a dictionary is learned by an adaptive learning algorithm for which three algorithms have been tested, including the simultaneous codeword optimization (SimCO) technique that we have proposed recently. Using the estimated mixing matrix and the learned dictionary, the sources are recovered from the blocked mixtures by a signal recovery approach. The separated source components from all the blocks are concatenated to reconstruct the whole signal. The block-based operation has the advantage of improving considerably the computational efficiency of the source recovery process without degrading its separation performance. Numerical experiments are provided to show the competitive separation performance of the proposed algorithm, as compared with the state-of-the-art approaches. Using mutual coherence and sparsity index, the performance of a variety of dictionaries that are applied in underdetermined speech separation is compared and analyzed, such as the dictionaries learned from speech mixtures and ground truth speech sources, as well as those predefined by mathematical transforms such as discrete cosine transform (DCT) and short time Fourier transform (STFT). (c) 2013 Elsevier B.V. All rights reserved.
C1 [Xu, Tao; Wang, Wenwu] Univ Surrey, Dept Elect Engn, Guildford GU2 7XH, Surrey, England.
[Dai, Wei] Univ London Imperial Coll Sci Technol & Med, Dept Elect & Elect Engn, London SW7 2AZ, England.
RP Xu, T (reprint author), Univ Surrey, Dept Elect Engn, Guildford GU2 7XH, Surrey, England.
EM t.xu@surrey.ac.uk; w.wang@surrey.ac.uk; wei.dai1@imperial.ac.uk
FU Engineering and Physical Sciences Research Council (EPSRC) of the UK
[EP/H050000/1, EP/H012842/1]; Centre for Vision Speech and Signal
Processing (CVSSP); China Scholarship Council (CSC); MOD University
Defence Research Centre (UDRC) in Signal Processing
FX We thank the Associate Editor Dr. Bin Ma and the anonymous reviewers for
their helpful comments for improving our paper, and Dr. Mark Barnard for
proofreading the manuscript. This work was supported in part by the
Engineering and Physical Sciences Research Council (EPSRC) of the UK
(Grant Nos. EP/H050000/1 and EP/H012842/1), the Centre for Vision Speech
and Signal Processing (CVSSP), and the China Scholarship Council (CSC),
and in part by the MOD University Defence Research Centre (UDRC) in
Signal Processing.
CR Aharon M, 2006, IEEE T SIGNAL PROCES, V54, P4311, DOI 10.1109/TSP.2006.881199
Alinaghi A, 2011, INT CONF ACOUST SPEE, P209
Araki S, 2007, SIGNALS COMMUN TECHN, P243, DOI 10.1007/978-1-4020-6479-1_9
Arberet S, 2010, IEEE T SIGNAL PROCES, V58, P121, DOI 10.1109/TSP.2009.2030854
Beck A, 2009, SIAM J IMAGING SCI, V2, P183, DOI 10.1137/080716542
Berg EVD, 2008, SIAM J SCI COMPUT, V31, P890, DOI DOI 10.1137/080714488
Blumensath T, 2008, IEEE T SIGNAL PROCES, V56, P2370, DOI 10.1109/TSP.2007.916124
Bofill P, 2001, SIGNAL PROCESS, V81, P2353, DOI 10.1016/S0165-1684(01)00120-7
Chen S. S., 1999, SIAM J SCI COMPUT, V20, P33
Cichocki A., 2009, NONNEGATIVE MATRIX T
Cichocki A., 2003, ADAPTIVE BLIND SIGNA
Comon P, 1998, P SOC PHOTO-OPT INS, V3461, P2, DOI 10.1117/12.325670
Comon P, 2004, IEEE T SIGNAL PROCES, V52, P11, DOI 10.1109/TSP.2003.820073
Dai W, 2009, IEEE T INFORM THEORY, V55, P2230, DOI 10.1109/TIT.2009.2016006
Dai W, 2012, IEEE T SIGNAL PROCES, V60, P6340, DOI 10.1109/TSP.2012.2215026
Daubechies I, 2010, COMMUN PUR APPL MATH, V63, P1
Demmel J.W., 1997, APPL NUMERICAL LINEA
Donoho D., 2005, SPARSELAB, P25
Donoho DL, 2006, IEEE T INFORM THEORY, V52, P1289, DOI 10.1109/TIT.2006.871582
Donoho DL, 2006, COMMUN PUR APPL MATH, V59, P797, DOI 10.1002/cpa.20132
Edelman A, 1998, SIAM J MATRIX ANAL A, V20, P303, DOI 10.1137/S0895479895290954
Elad M., 2010, IEEE T SIGNAL PROCES, V58, P1558
Elad M, 2006, IEEE T IMAGE PROCESS, V15, P3736, DOI 10.1109/TIP.2006.881969
Friedlander M.P., 2008, SPG11 SPECTRAL PROJE
Gowreesunker BV, 2008, INT CONF ACOUST SPEE, P33
Gowreesunker BV, 2009, LECT NOTES COMPUT SC, V5441, P34, DOI 10.1007/978-3-642-00599-2_5
Gribonval R, 2006, IEEE T INFORM THEORY, V52, P255, DOI 10.1109/TIT.2005.860474
Gribonval R., 2006, ESANN 06, P323
Hulle M.V., 1999, IEEE WORKSH NEUR NET, P315
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Jafari MG, 2011, IEEE J-STSP, V5, P1025, DOI 10.1109/JSTSP.2011.2157892
Jan T, 2011, SPEECH COMMUN, V53, P524, DOI 10.1016/j.specom.2011.01.002
JOURJINE A, 2000, ACOUST SPEECH SIG PR, P2985
Kim SJ, 2007, IEEE J-STSP, V1, P606, DOI 10.1109/JSTSP.2007.910971
Kim S.-J., 2007, L1LS L1 REGULARISED
Kowalski M, 2010, IEEE T AUDIO SPEECH, V18, P1818, DOI 10.1109/TASL.2010.2050089
Luo YH, 2006, IEEE T SIGNAL PROCES, V54, P2198, DOI 10.1109/TSP.2006.873367
Mailhe B., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288688
Makino S, 2007, SIGNALS COMMUN TECHN, P1, DOI 10.1007/978-1-4020-6479-1
MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082
Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711
Mohimani H, 2009, IEEE T SIGNAL PROCES, V57, P289, DOI 10.1109/TSP.2008.2007606
Needell D., 2008, APPL COMPUT HARMON A, V26, P301, DOI DOI 10.1016/J.ACHA.2008.07.002
Nion D, 2008, SIGNAL PROCESS, V88, P749, DOI 10.1016/j.sigpro.2007.07.024
Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214
Pati Y. C., 1993, P 27 AS C SIGN SYST, V1, P40, DOI DOI 10.1109/ACSSC.1993.342465
Pedersen MS, 2008, IEEE T NEURAL NETWOR, V19, P475, DOI 10.1109/TNN.2007.911740
Peleg T, 2012, IEEE T SIGNAL PROCES, V60, P2286, DOI 10.1109/TSP.2012.2188520
Plumbley M.D., 2012, P ICML WORKSH SPARS
Plumbley MD, 2010, P IEEE, V98, P995, DOI 10.1109/JPROC.2009.2030345
Sawada H, 2011, IEEE T AUDIO SPEECH, V19, P516, DOI 10.1109/TASL.2010.2051355
Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2
Sudhakar P., 2011, THESIS U RENNES 1 FR
Tibshirani R, 1996, J ROY STAT SOC B MET, V58, P267
Tichavsky P, 2011, IEEE T SIGNAL PROCES, V59, P1037, DOI 10.1109/TSP.2010.2096221
Tropp JA, 2004, IEEE T INFORM THEORY, V50, P2231, DOI 10.1109/TIT.2004.834793
Vincent E, 2009, LECT NOTES COMPUT SC, V5441, P734, DOI 10.1007/978-3-642-00599-2_92
Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005
Wang W., 2008, P ICARN LIV UK SEP 2, P5
Wang W, 2007, PROC MONOGR ENG WATE, P347
Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433
Wang WW, 2009, IEEE T SIGNAL PROCES, V57, P2858, DOI 10.1109/TSP.2009.2016881
Wang WW, 2008, IEEE IJCNN, P3681
Wang WW, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/231367
Xu T, 2010, INT CONF ACOUST SPEE, P2022, DOI 10.1109/ICASSP.2010.5494935
Xu T, 2009, 2009 IEEE/SP 15TH WORKSHOP ON STATISTICAL SIGNAL PROCESSING, VOLS 1 AND 2, P493
Xu T.C., 2011, P IEEE INT C MACH LE, P1
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
Zibulevsky M, 2001, NEURAL COMPUT, V13, P863, DOI 10.1162/089976601300014385
NR 69
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 432
EP 450
DI 10.1016/j.specom.2012.12.003
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000004
ER
PT J
AU Neiberg, D
Salvi, G
Gustafson, J
AF Neiberg, Daniel
Salvi, Giampiero
Gustafson, Joakim
TI Semi-supervised methods for exploring the acoustics of simple productive
feedback
SO SPEECH COMMUNICATION
LA English
DT Article
DE Social signal processing; Affective annotation; Feedback modelling;
Grounding
ID TURN-TAKING; EMOTION; DIALOGUE; EXPRESSION; CUES; CONVERSATION;
ORGANIZATION; RESPONSES; AGENTS; MODEL
AB This paper proposes methods for exploring acoustic correlates to feedback functions. A sub-language of Swedish, simple productive feedback, is introduced to facilitate investigations of the functional contributions of base tokens, phonological operations and prosody. The function of feedback is to convey the listeners' attention, understanding and affective states. In order to handle the large number of possible affective states, the current study starts by performing a listening experiment where humans annotated the functional similarity of feedback tokens with different prosodic realizations. By selecting a set of stimuli that had different prosodic distances from a reference token, it was possible to compute a generalised functional distance measure. The resulting generalised functional distance measure showed to be correlated to prosodic distance but the correlations varied as a function of base tokens and phonological operations. In a subsequent listening test, a small representative sample of feedback tokens were rated for understanding, agreement, interest, surprise and certainty. These ratings were found to explain a significant proportion of the generalised functional distance. By combining the acoustic analysis with an explorative visualisation of the prosody, we have established a map between human perception of similarity between feedback tokens, their measured distance in acoustic space, and the link to the perception of the function of feedback tokens with varying realisations. (c) 2013 Elsevier B.V. All rights reserved.
C1 [Neiberg, Daniel; Salvi, Giampiero; Gustafson, Joakim] KTH Royal Inst Technol, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden.
RP Neiberg, D (reprint author), KTH Royal Inst Technol, Dept Speech Mus & Hearing, Lindstedtsv 24, S-10044 Stockholm, Sweden.
EM neiberg@speech.kth.se
FU Swedish Research Council (VR) project "Introducing interactional
phenomena in speech synthesis" [2009-4291]; Swedish Research Council
(VR) project "Biologically inspired statistical methods for flexible
automatic speech understanding" [2009-4599]
FX Funding was provided by the Swedish Research Council (VR) projects
"Introducing interactional phenomena in speech synthesis" (2009-4291)
and "Biologically inspired statistical methods for flexible automatic
speech understanding" (2009-4599).
CR Al Moubayed S., 2010, P FONETIK, P11
Allwood J, 2007, LANG RESOUR EVAL, V41, P273, DOI 10.1007/s10579-007-9061-5
Allwood J., 1987, TEMA KOMMUNIKATION, V1, P89
Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1
Audibert N., 2011, COGNITION EMOTION, P37
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Baron-Cohen S., 2004, MIND READING INTERAC
BARSALOU LW, 1985, J EXP PSYCHOL LEARN, V11, P629, DOI 10.1037/0278-7393.11.1-4.629
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Bell L., 2000, 6 INT C SPOK LANG PR
Benus S., 2007, P 16 INT C PHON SCI, P1065
Bunt H., 2007, P 8 SIGDIAL WORKSH D, P283
Buschmeier H, 2011, P 11 INT C INT VIRT, P169, DOI 10.1007/978-3-642-23974-8_19
Cassell J., 2007, P WORKSH EMB LANG PR, P41, DOI 10.3115/1610065.1610071
Cerrato L., 2006, THESIS KTH ROYAL I T
CETIN O, 2006, P ICSLP PITTSB, P293
Chen A, 2004, LANG SPEECH, V47, P311
CLARK HH, 1994, SPEECH COMMUN, V15, P243, DOI 10.1016/0167-6393(94)90075-2
CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7
Dietrich S, 2006, PROG BRAIN RES, V156, P295, DOI 10.1016/S0079-6123(06)56016-9
DITTMANN AT, 1968, J PERS SOC PSYCHOL, V9, P79, DOI 10.1037/h0025722
Duncan Jr S., 1972, J PERSONALITY SOCIAL, P23
Duncan S., 1974, LANG SOC, V3, P161, DOI DOI 10.1017/S0047404500004322
Duncan Jr S, 1977, FACE TO FACE INTERAC
Edlund J., 2009, NORD PROS P 10 C OCT, P57
Edlund J, 2008, SPEECH COMMUN, V50, P630, DOI 10.1016/j.specom.2008.04.002
Edlund J., 2010, P 7 C INT LANG RES E, P2992
Edlund J., 2005, P INT 2005 LISB PORT, P2389
Ekman P, 1972, NEBRASKA S MOTIVATIO, P207
EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068
Fries Charles C., 1952, STRUCTURE ENGLISH IN
Fujimoto D. T., 2007, J OSAKA JOGAKUIN 2 Y, V37, P35
Gardner R, 2001, LISTENERS TALK RESPO
Goodwin C., 1981, CONVERSATIONAL ORG I
Goudbeek M, 2010, J ACOUST SOC AM, V128, P1322, DOI 10.1121/1.3466853
Gratch J, 2007, LECT NOTES ARTIF INT, V4722, P125
Gravano A, 2012, COMPUT LINGUIST, V38, P1, DOI 10.1162/COLI_a_00083
Gravano A., 2008, P 4 SPEECH PROS C CA
Greenberg Joseph H., 1978, WORD STRUCTURE, V3, P297
Gustafson J., 2002, P ISCA WORKSH MULT D
Gustafson J., 2010, 5 WORKSH DISFL SPONT
Gustafson J, 2008, LECT NOTES ARTIF INT, V5078, P240, DOI 10.1007/978-3-540-69369-7_27
Heldner M., 2011, 12 ANN C INT SPEECH
Hirschberg J., 1999, P AUT SPEECH REC UND, P349
Hjalmarsson A., 2008, P SIGDIAL 2008 COL O
Hjalmarsson A., 2010, THESIS ROYAL I TECHN
House David, 1990, TONAL PERCEPTION SPE
KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4
Kopp S., 2006, ZIF WORKSH, P18
Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007
Krahmer E, 2002, SPEECH COMMUN, V36, P133, DOI 10.1016/S0167-6393(01)00030-9
Lai C., 2009, P INT 09 BRIGHT UK
Lai C., 2010, P INT 2010 MAK JAP
Larsson S., 2002, THESIS GOTEBORG U
Laskowski K., 2004, P ISLP 2004 JEJ ISL, P973
Levitt E. A., 1964, COMMUNICATION EMOTIO, P87
Liscombe J., 2003, P EUR 2003
LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489
McGraw KO, 1996, PSYCHOL METHODS, V1, P390, DOI 10.1037//1082-989X.1.4.390
Neiberg D., 2012, INT WORKSH FEEDB BEH
Neiberg D., 2010, INTERSPEECH 2010, P2562
Neiberg D., 2011, INTERSPEECH 2011
Neiberg D., 2011, INTERSPEECH 2011, P1581
Neiberg D, 2011, INT CONF ACOUST SPEE, P5836
NILSENOVA MARIE, 2006, THESIS U AMSTERDAM
Payr S, 2011, APPL ARTIF INTELL, V25, P441, DOI 10.1080/08839514.2011.586616
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Reese Brian, 2007, THESIS U TEXAS AUSTI
Reidsma D, 2011, J MULTIMODAL USER IN, V4, P97, DOI 10.1007/s12193-011-0060-x
Russell JA, 2003, PSYCHOL REV, V110, P145, DOI 10.1037/0033-295X.110.1.145
SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243
Sauter D., 2010, P NATL ACAD SCI USA, P107
Sauter DA, 2010, Q J EXP PSYCHOL, V63, P2251, DOI 10.1080/17470211003721642
Schegloff EA, 2000, LANG SOC, V29, P1
SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143
Scherer KR, 2009, COGNITION EMOTION, V23, P1307, DOI 10.1080/02699930902928969
SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674
Schroder M., 2009, 3 INT C AFF COMP INT, P1
Sigurd B., 1984, SPRAKVARD, P3
Skantze G., 2007, THESIS ROYAL I TECHN
Sloman A., 2010, CLOSE ENGAGEMENTS AR
SOKAL ROBERT R., 1962, TAXON, V11, P33, DOI 10.2307/1217208
Stocksmeier T., 2007, P INT, P1290
Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737
Stromqvist S, 1999, J PRAGMATICS, V31, P1245, DOI 10.1016/S0378-2166(98)00104-0
Traum D., 1994, THESIS U ROCHESTER
Wallers A., 2006, THESIS KTH STOCKHOLM
Ward N., 2006, PRAGMAT COGN, V14, P129, DOI 10.1075/pc.14.1.08war
Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5
Ward Nigel G, 2007, Computer Assisted Language Learning, V20, DOI 10.1080/09588220701745825
Yngve Victor, 1970, 6 REG M CHIC LING SO, P567
NR 91
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 451
EP 469
DI 10.1016/j.specom.2012.12.007
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000005
ER
PT J
AU Nakagawa, S
Iwami, K
Fujii, Y
Yamamoto, K
AF Nakagawa, Seiichi
Iwami, Keisuke
Fujii, Yasuhisa
Yamamoto, Kazumasa
TI A robust/fast spoken term detection method based on a syllable n-gram
index with a distance metric
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken term detection; Syllable recognition; N-gram; Distant n-gram;
Out-of-Vocabulary; Mis-recognition
ID RETRIEVAL; SPEECH
AB For spoken document retrieval, it is crucial to consider Out-of-vocabulary (OOV) and the mis-recognition of spoken words. Consequently, sub-word unit based recognition and retrieval methods have been proposed. This paper describes a Japanese spoken term detection method for spoken documents that robustly considers OOV words and mis-recognition. To solve the problem of OOV keywords, we use individual syllables as the sub-word unit in continuous speech recognition. To address OOV words, recognition errors, and high-speed retrieval, we propose a distant n-gram indexing/retrieval method that incorporates a distance metric in a syllable lattice. When applied to syllable sequences, our proposed method outperformed a conventional DTW method between syllable sequences and was about 100 times faster. The retrieval results show that we can detect OOV words in a database containing 44 h of audio in less than 10 m sec per query with an F-measure of 0.54. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Nakagawa, Seiichi; Iwami, Keisuke; Fujii, Yasuhisa; Yamamoto, Kazumasa] Toyohashi Univ Technol, Dept Comp Sci & Engn, Toyohashi, Aichi 4418580, Japan.
RP Iwami, K (reprint author), Toyohashi Univ Technol, Dept Comp Sci & Engn, 1-1 Hibarigaoka,Tempaku Cho, Toyohashi, Aichi 4418580, Japan.
EM nakagawa@slp.cs.tut.ac.jp; iwami@slp.cs.tut.ac.jp;
fujii@slp.cs.tut.ac.jp; kyama@slp.cs.tut.ac.jp
CR Akbacak M, 2008, INT CONF ACOUST SPEE, P5240, DOI 10.1109/ICASSP.2008.4518841
Akiba T., 2011, 9 NTCIR WORKSH SPOK, P1
Allauzen C., 2004, WORKSH INT APPR SPEE, P33
Can D, 2009, INT CONF ACOUST SPEE, P3957, DOI 10.1109/ICASSP.2009.4960494
Chaudhari UV, 2012, IEEE T AUDIO SPEECH, V20, P1633, DOI 10.1109/TASL.2012.2186805
Chen B., 2000, ICASSP, P2985
Dharanipragada S, 2002, IEEE T SPEECH AUDI P, V10, P542, DOI 10.1109/TSA.2002.804543
Fujii Y., 2011, MUSP, P110
Itoh Y, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P677
Iwami K., 2010, SLT, P200
Iwami K, 2011, INT CONF ACOUST SPEE, P5664
Iwami K., 2013, IPSJ, V54
K Ng, 1998, ICSLP, P1088
Kanda N., 2008, MMSP, P939
Katsurada K., 2009, INTERSPEECH, P2147
Larson M., 2003, EUROSPEECH, P1217
Mamou J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2106
Mamou Jonathan, 2007, SIGIR 07, P615
Meng H.M., 2000, ICSLP, P101
Natori S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P681
Ng C, 2000, SPEECH COMMUN, V32, P61, DOI 10.1016/S0167-6393(00)00024-8
Nishizaki H., 2002, HLT, P144
Parada C., 2009, ASRU, P404
Parada C, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P1269
Saito H., 2012, 6 WORKSH SPOK DOC PR
Sakamoto N., 2013, SPRING M ASJ
Saraclar M, 2004, HLT-NAACL 2004: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P129
Wang HM, 2000, SPEECH COMMUN, V32, P49, DOI 10.1016/S0167-6393(00)00023-6
Wechsler M., 1998, SIGIR 98, P20
NR 29
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 470
EP 485
DI 10.1016/j.specom.2012.12.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000006
ER
PT J
AU Winters, S
O'Brien, MG
AF Winters, Stephen
O'Brien, Mary Grantham
TI Perceived accentedness and intelligibility: The relative contributions
of F0 and duration
SO SPEECH COMMUNICATION
LA English
DT Article
DE Foreign accent; Intelligibility; Second language; Perception; Prosody
ID GLOBAL FOREIGN ACCENT; IN-NOISE RECOGNITION; VOICE-ONSET-TIME; L2
SPEECH; ENGLISH; PERCEPTION; LISTENERS; PROSODY; EXPERIENCE; SPEAKERS
AB The current study sought to determine the relative contributions of suprasegmental and segmental features to the perception of foreign accent and intelligibility in both first language (L1) and second language (L2) German and English speech. Suprasegmental and segmental features were manipulated independently by transferring (1) native intonation contours and/or syllable durations onto non-native segments and (2) non-native intonation contours and/or syllable durations onto native segments in both English and German. These resynthesized stimuli were then presented, in an intelligibility task, to native speakers of German and English who were proficient in both languages. Both of these groups of speakers and monolingual native speakers of English also rated the foreign accentedness of the manipulated stimuli. In general, tokens became more accented and less intelligible, the more they were manipulated. Tokens were also less accented and more intelligible when produced by speakers of (and in) the listeners' L1. Nonetheless, in certain L2 productions, there was both a reduction in perceived accentedness and decreased intelligibility for tokens in which native prosody was applied to non-native segments, indicating a disconnect between the perceptual processing of intelligibility and accent. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Winters, Stephen] Univ Calgary, Dept Linguist, Calgary, AB T2N 1N4, Canada.
[O'Brien, Mary Grantham] Univ Calgary, Dept German Slav & East Asian Studies, Calgary, AB T2N 1N4, Canada.
RP O'Brien, MG (reprint author), Univ Calgary, Dept German Slav & East Asian Studies, Craigie Hall C208,2500 Univ Dr NW, Calgary, AB T2N 1N4, Canada.
EM swinters@ucalgary.ca; mgobrien@ucal-gary.ca
FU German Academic Exchange Service (DAAD)
FX Both authors contributed equally to this project. We would like to thank
Kelly-Ann Casey, Roswita Dressler and Tara Dainton for providing paid
research assistance and the members of the audience of the Germanic
Linguistics Annual Conference 2009 for their feedback on the pilot study
data. Any errors that remain are our own. This work was supported by a
grant from the German Academic Exchange Service (DAAD).
CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x
Atterer M, 2004, J PHONETICS, V32, P177, DOI 10.1016/S0095-4470(03)00039-1
Backman N., 1979, INTERLANGUAGE STUDIE, V4, P239
Baker RE, 2011, J PHONETICS, V39, P1, DOI 10.1016/j.wocn.2010.10.006
Baker W, 2008, LANG SPEECH, V51, P317, DOI 10.1177/0023830908099068
Bannert R., 1995, PHONUM, V3, P7
Baumann S., 2000, LINGUISTISCHE BERICH, V181, P1
Baumann Stefan, 2006, METHODS EMPIRICAL PR, P153
Beckman M. E., 1994, TOBI ANNOTATION CONV
Bel B., 2004, P SPEECH PROS 2004 N, P721
Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234
Boersma P., 2007, PRAAT DOING PHONETIC
Bolinger D., 1998, INTONATION SYSTEMS S, P45
Boula de Mareuil P., 2004, P SPEECH PROS 2004 N, P681
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024
de Mareuil PB, 2006, PHONETICA, V63, P247, DOI 10.1159/000097308
Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010]
Derwing T.M., 2005, TESOL Q, V39, P379
Eckert H, 1994, MENSCHEN IHRE STIMME
Escudero P, 2009, J PHONETICS, V37, P452, DOI 10.1016/j.wocn.2009.07.006
ESSER J, 1978, PHONETICA, V35, P41
Fery C, 1993, GERMAN INTONATIONAL
FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256
Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052
Fournier R, 2006, J PHONETICS, V34, P29, DOI 10.1016/j.wocn.2005.03.002
Fox A., 1984, GERMAN INTONATION
Fries Charles C., 1964, HONOUR D JONES, P242
Gibbon D, 1998, INTONATION SYSTEMS S, P78
Goethe-Institut, 2004, EINST
Gonzalez-Bueno M, 1997, IRAL-INT REV APPL LI, V35, P251, DOI 10.1515/iral.1997.35.4.251
GRABE E, 1998, COMP INTONATIONAL PH
Grosser W., 1997, 2 LANGUAGE SPEECH ST, P211
Gulikers L., 1995, LEON CELEX LEXICAL D
Gut U., 2009, NONNATIVE SPEECH COR
Gut U., 2003, FREMDSPRACHEN LEHREN, V32, P133
Hahn LD, 2004, TESOL QUART, V38, P201
Heinrich A, 2010, SPEECH COMMUN, V52, P1038, DOI 10.1016/j.specom.2010.09.009
Hoehle B., 2009, LINGUISTICS, V47, P359
Holm S., 2008, THESIS NORWEGIAN U S
James A., 2000, NEW SOUNDS 2000, P1
Jilka M., 2007, NONNATIVE PROSODY PH, P77
Jilka M., 2000, THESIS U STUTTGARD S
Kennedy S, 2008, CAN MOD LANG REV, V64, P459, DOI 10.3138/cmlr.64.3.459
Kohler K. J., 2004, TRADITIONAL PHONOLOG, P205
Ladefoged Peter, 2011, COURSE PHONETICS, V6th
Lai YH, 2009, LANG COGNITIVE PROC, V24, P1265, DOI 10.1080/01690960802113850
Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081
Major R. C., 1987, STUDIES 2ND LANGUAGE, V9, P63, DOI 10.1017/S0272263100006513
MAJOR RC, 1986, SECOND LANG RES, V2, P53, DOI 10.1177/026765838600200104
Major RC, 2007, STUD SECOND LANG ACQ, V29, P539, DOI 10.1017/S0272263107070428
Mennen I, 2004, J PHONETICS, V32, P543, DOI 10.1016/j.wocn.2004.02.002
Mennen I., 2007, NONNATIVE PROSODY PH, P53
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Moyer A, 1999, STUDIES 2 LANGUAGE A, V21, P81, DOI DOI 10.1017/50272263199001035
Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735
Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451
Munro MJ, 1995, LANG SPEECH, V38, P289
Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049
Munro MJ, 2010, SPEECH COMMUN, V52, P626, DOI 10.1016/j.specom.2010.02.013
Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193
Ng ML, 2008, INT J SPEECH-LANG PA, V10, P404, DOI 10.1080/17549500802399007
O'Brien M.G., 2011, ACHIEVEMENTS PERSPEC, P205
Oxford University Press, 2009, OXFORD ONLINE PLACEM
Pennington MC, 2000, MOD LANG J, V84, P372, DOI 10.1111/0026-7902.00075
Pfitzinger H.P., 2010, P 5 INT C SPEECH PRO, P1
Pinet M, 2010, J ACOUST SOC AM, V128, P1357, DOI 10.1121/1.3466857
Ramirez Verdugo D., 2002, ICAME J, V26, P115
Riney TJ, 1999, LANG LEARN, V49, P275, DOI 10.1111/0023-8333.00089
Riney TJ, 2005, TESOL QUART, V39, P441
Riney TJ, 2000, TESOL QUART, V34, P711, DOI 10.2307/3587782
Shah A.P., 2003, THESIS CITY U NEW YO
Sidaras SK, 2009, J ACOUST SOC AM, V125, P3306, DOI 10.1121/1.3101452
Spitzer SM, 2007, J ACOUST SOC AM, V122, P3678, DOI 10.1121/1.2801545
Stibbard RM, 2006, J ACOUST SOC AM, V120, P433, DOI 10.1121/1.2203595
SYRDAL A, 1998, ACOUST SPEECH SIG PR, P273
Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031
Trofimovich P, 2006, STUD SECOND LANG ACQ, V28, P1, DOI 10.1017/S0272263106060013
van Leyden K, 2006, PHONETICA, V63, P149, DOI 10.1159/000095306
Wegener H., 1998, 2 SPRACHE LERNEN EMP, P21
Winters S., 2011, 162 ANN M AC SOC AM
NR 81
TC 1
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2013
VL 55
IS 3
BP 486
EP 507
DI 10.1016/j.specom.2012.12.006
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 115UI
UT WOS:000316837000007
ER
PT J
AU Veisi, H
Sameti, H
AF Veisi, Hadi
Sameti, Hossein
TI Speech enhancement using hidden Markov models in Mel-frequency domain
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based speech enhancement; Mel-frequency; Parallel cepstral and
spectral (PCS)
ID PARAMETER GENERATION; NOISY SPEECH; RECOGNITION
AB Hidden Markov model (HMM)-based minimum mean square error speech enhancement method in Mel-frequency domain is focused on and a parallel cepstral and spectral (PCS) modeling is proposed. Both Mel-frequency spectral (MFS) and Mel-frequency cepstral (MFC) features are studied and experimented for speech enhancement. To estimate clean speech waveform from a noisy signal, an inversion from the Mel-frequency domain to the spectral domain is required which introduces distortion artifacts in the spectrum estimation and the filtering. To reduce the corrupting effects of the inversion, the PCS modeling is proposed. This method performs concurrent modeling in both cepstral and magnitude spectral domains. In addition to the spectrum estimator, magnitude spectrum, log-magnitude spectrum and power spectrum estimators are also studied and evaluated in the HMM-based speech enhancement framework.
The performances of the proposed methods are evaluated in the presence of five noise types with different SNR levels and the results are compared with several established speech enhancement methods especially auto-regressive HMM-based speech enhancement. The experimental results for both subjective and objective tests confirm the superiority of the proposed methods in the Mel-frequency domain over the reference methods, particularly for non-stationary noises. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Veisi, Hadi; Sameti, Hossein] Sharif Univ Technol, Dept Comp Engn, Tehran, Iran.
RP Veisi, H (reprint author), Sharif Univ Technol, Dept Comp Engn, Tehran, Iran.
EM veisi@ce.sharif.edu; sameti@sharif.edu
FU Iranian Telecommunication Research Center (ITRC)
FX This research was partially supported by the Iranian Telecommunication
Research Center (ITRC).
CR Arakawa T., 2006, P IEEE INT C AC SPEE, P1
Berouti M., 1979, P IEEE INT C AC SPEE, P208
Chen B, 2007, SPEECH COMMUN, V49, P134, DOI 10.1016/j.specom.2006.12.005
EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237
EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532
EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947
EPHRAIM Y, 1984, P IEEE T SPEECH AUD, V32, P1109
EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664
EPHRAIM Y, 1985, P IEEE INT C AC SPEE, V33, P443
Gales M.J.F., 1995, THESIS U CAMBRIDGE S
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949
Imai S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing
LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197
Logan B.T., 1998, THESIS CAMBRIDGE U
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927
Perceptual Evaluation of Speech Quality (PESQ), 2001, OBJECTIVE METHOD END, P862
Porter J., 1984, P IEEE INT C AC SPEE, P53
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Sameti H., 1994, THESIS U WATERLOO WA
Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670
Sasou A, 2006, SPEECH COMMUN, V48, P1100, DOI 10.1016/j.specom.2006.03.002
Segura J. C., 2001, P EUROSPEECH2001, P221
Srinivasan S, 2007, IEEE T AUDIO SPEECH, V15, P441, DOI 10.1109/TASL.2006.881696
Stouten V, 2006, SPEECH COMMUN, V48, P1502, DOI 10.1016/j.specom.2005.12.006
TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684
Tokuda K, 2000, INT CONF ACOUST SPEE, P1315, DOI 10.1109/ICASSP.2000.861820
Veisi H, 2011, DIGIT SIGNAL PROCESS, V21, P36, DOI 10.1016/j.dsp.2010.07.004
Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111
You C.H., 2003, P IEEE INT C AC SPEE, P852
Yu D, 2008, IEEE T AUDIO SPEECH, V16, P1061, DOI 10.1109/TASL.2008.921761
Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256
NR 33
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 205
EP 220
DI 10.1016/j.specom.2012.08.005
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900001
ER
PT J
AU Bolanos, D
Cole, RA
Ward, WH
Tindal, GA
Schwanenflugel, PJ
Kuhn, MR
AF Bolanos, Daniel
Cole, Ronald A.
Ward, Wayne H.
Tindal, Gerald A.
Schwanenflugel, Paula J.
Kuhn, Melanie R.
TI Automatic assessment of expressive oral reading
SO SPEECH COMMUNICATION
LA English
DT Article
DE Oral reading fluency; Expressive reading; Prosody; Children's read
speech; Education
ID YOUNG READERS; FLUENCY; PROSODY; CLASSIFICATION; AGREEMENT; CHILDREN;
THINGS
AB We investigated the automatic assessment of expressive children's oral reading of grade level text passages using a standardized rubric. After a careful review of the reading literature and a close examination of the rubric, we designed a novel set of prosodic and lexical features to characterize fluent expressive reading.
A number of complementary sources of information were used to design the features, each of them motivated by research on different components of reading fluency. Features are connected to the child's reading rate, to the presence and number of pauses, filled-pauses and word-repetitions, the correlation between punctuation marks and pauses, the length of word groupings, syllable stress and duration and the location of pitch peaks and contours.
The proposed features were evaluated on a corpus of 783 one-minute reading sessions from 313 students reading grade-leveled passages without assistance (cold unassisted reading). Experimental results show that the proposed lexical and prosodic features provide complementary information and are able to capture the characteristics of expressive reading. The results showed that on both the 2-point and the 4-point expressiveness scales, computer-generated ratings of expressiveness agreed with human raters better than the human raters agreed with each other. The results of the study suggest that automatic assessment of expressive oral reading can be combined with automatic measures of word accuracy and reading rate to produce an accurate multidimensional estimate of children's oral reading ability. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Bolanos, Daniel; Cole, Ronald A.; Ward, Wayne H.] Boulder Language Technol, Boulder, CO 80301 USA.
[Ward, Wayne H.] Univ Colorado, Boulder, CO 80309 USA.
[Tindal, Gerald A.] Univ Oregon, Eugene, OR 97403 USA.
[Schwanenflugel, Paula J.] Univ Georgia, Athens, GA 30602 USA.
[Kuhn, Melanie R.] Boston Univ, Boston, MA 02215 USA.
RP Bolanos, D (reprint author), Boulder Language Technol, 2960 Ctr Green Court,Suite 200, Boulder, CO 80301 USA.
EM dani@bltek.com; rcole@bltek.com; wward@bltek.com; geraldt@uoregon.edu;
pschwan@uga.edu; melaniek@bu.edu
FU U.S. Department of Education [R305B070434]; National Science Foundation
[0733323]; NIH [R43 DC009926-01]
FX We gratefully acknowledge the help of Angel Stobaugh, Director of
Literacy Education at Boulder Valley School District, and the principals
and teachers who allowed us to visit their schools and classrooms. We
appreciate the amazing efforts of Jennifer Borum, who organized the
FLORA data collection effort, and the efforts of Linda Hill, Suzan
Heglin and the rest of the human experts who scored the text passages.
This work was supported by U.S. Department of Education award number
R305B070434, National Science Foundation award number 0733323 and NIH
award number R43 DC009926-01.
CR Alonzo J., 2006, EASYCBM ONLINE PROGR
Altman D, 1991, PRACTICAL STAT MED R
Benjamin RG, 2010, READ RES QUART, V45, P388, DOI 10.1598/RRQ.45.4.2
Bolanos D., 2011, ACM T SPEECH LANGUAG, V7
Bolanos D., 2012, IEEE SPOK LANG TECHN
Brenier J.M., 2005, P EUR 9 EUR C SPEECH, P3297
Chafe W., 1988, COMMUNICATION, P396
Chang C.-C., 2001, LIBSVM LIB SUPPORT V
Chang Y., 2008, JMLR WORKSH C P WCCI, V3
Chard DJ, 2002, J LEARN DISABIL-US, V35, P386, DOI 10.1177/00222194020350050101
Clay M., 1971, J VERB LEARN VERB BE, P133
COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256
COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104
Cole R., 2006, TRCSLR200602 U COL
Cole R., 2006, TRCSLR200603 U COL
Cowie R, 2002, LANG SPEECH, V45, P47
Daane M. C., 2005, 2006469 NCES US DEP
Duong M, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P769
Duong M., 2011, ACM T SPEECH LANGUAG, V7
Fisher B., 1996, TSYLB2 1 1 SYLLABIFI
Fuchs L. S., 2001, SCI STUD READ, V5, P239, DOI DOI 10.1207/S1532799XSSR0503_3
Good R. H., 2002, DYNAMIC INDICATORS B, V6th
Good R. H., 2001, SCI STUD READ, V5, P257, DOI DOI 10.1207/S1532799XSSR0503_
Good R. H., 2007, DYNAMIC INDICATORS B
Guyon I, 2002, MACH LEARN, V46, P389, DOI 10.1023/A:1012487302797
Hasbrouck J, 2006, READ TEACH, V59, P636, DOI 10.1598/RT.59.7.3
Kane M., 2006, ED MEASUREMENT, V4th, P17
Kuhn M., 2005, READING PSYCHOL, V26, P127, DOI [10.1080/02702710590930492, DOI 10.1080/02702710590930492]
Kuhn M., 2000, TECHNICAL REPORT
Kuhn M. R., 2010, READING RES Q, V45, P230, DOI DOI 10.1598/RRQ.45.2.4
Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255
LOGAN GD, 1988, PSYCHOL REV, V95, P492, DOI 10.1037//0033-295X.95.4.492
Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839
Miller J, 2008, READ RES QUART, V43, P336, DOI 10.1598/RRQ.43.4.2
Mostow J, 2009, FR ART INT, V200, P189, DOI 10.3233/978-1-60750-028-5-189
National Reading Panel N., 2000, TECHNICAL REPORT
Patel R, 2011, SPEECH COMMUN, V53, P431, DOI 10.1016/j.specom.2010.11.007
Pinnell G., 1995, 1995726 NCES US DEP
Platt J., 1999, ADV LARGE MARGIN CLA, V10, P61
Platt JC, 2000, ADV NEUR IN, V12, P547
Rasinski T., 2004, ASSESSING READING FL
Rasinski T. V., 2009, LITERACY RES INSTRUC, V48, P350, DOI DOI 10.1080/19388070802468715
Rasinski T. V., 1991, THEOR PRACT, V30, P211, DOI DOI 10.1080/00405849109543502
Rosenfeld R., 1994, CMU STAT LANGUAGE MO
Schwanenflugel P., 2011, COMMUNICATION
Schwanenflugel P., 2012, FLUENCY INS IN PRESS
Schwanenflugel PJ, 2004, J EDUC PSYCHOL, V96, P119, DOI 10.1037/0022-0663.96.1.119
Shinn, 1998, ADV APPL CURRICULUM
Shinn M. R., 2002, AIMSWEB TRAINING WOR
Shobaki K., 2000, P ICSLP 2000 BEIJ CH
Sjolander K., 1997, TECHNICAL REPORT
Snow C. E., 1998, PREVENTING READING D
Spearman C, 1904, AM J PSYCHOL, V15, P72, DOI 10.2307/1412159
Steidl S, 2005, INT CONF ACOUST SPEE, P317
Valencia SW, 2010, READ RES QUART, V45, P270, DOI 10.1598/RRQ.45.3.1
Vapnik V., 1995, NATURE STAT LEARNING
Vicsi K, 2010, SPEECH COMMUN, V52, P413, DOI 10.1016/j.specom.2010.01.003
Wayman MM, 2007, J SPEC EDUC, V41, P85
NR 58
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 221
EP 236
DI 10.1016/j.specom.2012.08.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900002
ER
PT J
AU Alam, MJ
Kinnunen, T
Kenny, P
Ouellet, P
O'Shaughnessy, D
AF Alam, Md Jahangir
Kinnunen, Tomi
Kenny, Patrick
Ouellet, Pierre
O'Shaughnessy, Douglas
TI Multitaper MFCC and PLP features for speaker verification using
i-vectors
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; Multi-taper spectrum; Feature extraction;
i-Vectors; MFCC; PLP
ID SPECTRAL ESTIMATION; HARMONIC-ANALYSIS; VARIANCE; RECOGNITION; WINDOWS;
SPEECH; SPHERE
AB In this paper we study the performance of the low-variance multi-taper Mel-frequency cepstral coefficient (MFCC) and perceptual linear prediction (PLP) features in a state-of-the-art i-vector speaker verification system. The MFCC and PLP features are usually computed from a Hamming-windowed periodogram spectrum estimate. Such a single-tapered spectrum estimate has large variance, which can be reduced by averaging spectral estimates obtained using a set of different tapers, leading to a so-called multi-taper spectral estimate. The multi-taper spectrum estimation method has proven to be powerful especially when the spectrum of interest has a large dynamic range or varies rapidly. Multi-taper MFCC features were also recently studied in speaker verification with promising preliminary results. In this study our primary goal is to validate those findings using an up-to-date i-vector classifier on the latest NIST 2010 SRE data. In addition, we also propose to compute robust perceptual linear prediction (PLP) features using multitapers. Furthermore, we provide a detailed comparison between different taper weight selections in the Thomson multi-taper method in the context of speaker verification. Speaker verification results on the telephone (det5) and microphone speech (det1, det2, det3 and det4) of the latest NIST 2010 SRE corpus indicate that the multi-taper methods outperform the conventional periodogram technique. Instead of simply averaging (using uniform weights) the individual spectral estimates in forming the multi-taper estimate, weighted averaging (using non-uniform weights) improves performance. Compared to the MFCC and PLP baseline systems, the sine-weighted cepstrum estimator (SWCE) based multi-taper method provides average relative reductions of 12.3% and 7.5% in equal error rate, respectively. For the multi-peak multi-taper method, the corresponding reductions are 12.6% and 11.6%, respectively. Finally, the Thomson multi-taper method provides error reductions of 9.5% and 5.0% in EER for MFCC and PLP features, respectively. We conclude that both the MFCC and PLP features computed via multitapers provide systematic improvements in recognition accuracy. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Alam, Md Jahangir] Univ Quebec, INRS EMT, Montreal, PQ H5A 1K6, Canada.
[Alam, Md Jahangir; Kenny, Patrick; Ouellet, Pierre] CRIM, Montreal, PQ, Canada.
[Kinnunen, Tomi] Univ Eastern Finland, Sch Comp, Joensuu, Finland.
RP Alam, MJ (reprint author), Univ Quebec, INRS EMT, 800 La Gauchetiere W,Suite 6900, Montreal, PQ H5A 1K6, Canada.
EM alam@emt.inrs.ca; tkinnu@cs.joensuu.fi; Patrick.Kenny@crim.ca;
Pierre.Ouellet@crim.ca; dougo@emt.inrs.ca
FU Academy of Finland [132129]
FX The work of T. Kinnunen was supported by the Academy of Finland (Project
No. 132129). We would like to thank the anonymous reviewers for their
comments that helped to improve the content of this paper.
CR Alam J., 2011, LECT NOTES ARTIF INT, V7015, P239
Alam M.J., 2011, P IEEE AUT SPEECH RE, P547
Brummer N, 2010, P OD SPEAK LANG REC
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Dehak N, 2011, IEEE T AUDIO SPEECH, V19, P788, DOI 10.1109/TASL.2010.2064307
Djuric P., 1999, DIGITAL SIGNAL PROCE
Garcia-Romero D., 2011, P INTERSPEECH, P249
Gold B., 2000, SPEECH AUDIO SIGNAL
Hansson M, 1997, IEEE T SIGNAL PROCES, V45, P778, DOI 10.1109/78.558503
Hansson-Sandsten M, 2009, INT CONF ACOUST SPEE, P3077, DOI 10.1109/ICASSP.2009.4960274
HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Honig Florian, 2005, P INTERSPEECH, P2997
Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949
Kay S. M., 1988, MODERN SPECTRAL ESTI
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527
Kenny P, 2010, P OD SPEAK LANG REC
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693
Kinnunen T, 2012, IEEE T AUDIO SPEECH, V20, P1990, DOI 10.1109/TASL.2012.2191960
Kinnunen T., 2010, P INTERSPEECH, P2734
Matejka Pavel, 2006, P IEEE OD 2006 SPEAK, P57
McCoy EJ, 1998, IEEE T SIGNAL PROCES, V46, P655, DOI 10.1109/78.661333
National Institute of Standards and Technology, NIST SPEAK REC EV
Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213
Percival D.B., 1993, SPECTRAL ANAL PHYS A
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Reynolds D.A., 2008, UNIVERSAL BACKGROUND
RIEDEL KS, 1995, IEEE T SIGNAL PROCES, V43, P188, DOI 10.1109/78.365298
Sandberg J, 2010, IEEE SIGNAL PROC LET, V17, P343, DOI 10.1109/LSP.2010.2040228
Senoussaoui M., 2011, P INTERSPEECH, P25
Senoussaoui M., 2010, P OD SPEAK LANG REC
SLEPIAN D, 1961, AT&T TECH J, V40, P43
THOMSON DJ, 1990, PHILOS T ROY SOC A, V332, P539, DOI 10.1098/rsta.1990.0130
THOMSON DJ, 1982, P IEEE, V70, P1055, DOI 10.1109/PROC.1982.12433
WALDEN AT, 1994, IEEE T SIGNAL PROCES, V42, P479, DOI 10.1109/78.275635
Wieczorek MA, 2007, J FOURIER ANAL APPL, V13, P665, DOI 10.1007/s00041-006-6904-1
Wieczorek MA, 2005, GEOPHYS J INT, V162, P655, DOI 10.1111/j-1365-246X.2005.02687.x
XIANG B, 2002, ACOUST SPEECH SIG PR, P681
Young S., 2006, HTK BOOK
NR 39
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 237
EP 251
DI 10.1016/j.specom.2012.08.007
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900003
ER
PT J
AU Wollmer, M
Schuller, B
Rigoll, G
AF Woellmer, Martin
Schuller, Bjoern
Rigoll, Gerhard
TI Keyword spotting exploiting Long Short-Term Memory
SO SPEECH COMMUNICATION
LA English
DT Article
DE Keyword spotting; Long Short-Term Memory; Recurrent neural networks;
Dynamic Bayesian Networks
ID BIDIRECTIONAL LSTM NETWORKS; RECURRENT NEURAL-NETWORKS; SPEECH
RECOGNITION; ARCHITECTURES; MODELS
AB We investigate various techniques for keyword spotting which are exclusively based on acoustic modeling and do not presume the existence of an in-domain language model. Since adequate context modeling is nevertheless necessary for word spotting, we show how the principle of Long Short-Term Memory (LSTM) can be incorporated into the decoding process. We propose a novel technique that exploits LSTM in combination with Connectionist Temporal Classification in order to improve performance by using a self-learned amount of contextual information. All considered approaches are evaluated on read speech as contained in the TIMIT corpus as well as on the SEMAINE database which consists of spontaneous and emotionally colored speech. As further evidence for the effectiveness of LSTM modeling for keyword spotting, results on the CHiME task are shown. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Woellmer, Martin; Schuller, Bjoern; Rigoll, Gerhard] Tech Univ Munich, Inst Human Machine Commun, D-80333 Munich, Germany.
RP Wollmer, M (reprint author), Tech Univ Munich, Inst Human Machine Commun, Theresienstr 90, D-80333 Munich, Germany.
EM woellmer@tum.de
FU Federal Republic of Germany through the German Research Foundation (DFG)
[SCHU2508/4-1]
FX The research leading to these results has received funding from the
Federal Republic of Germany through the German Research Foundation (DFG)
under Grant No. SCHU2508/4-1.
CR Benayed Y., 2003, P ICASSP, P588
BILMES J, 2002, ACOUST SPEECH SIG PR, P3916
Bilmes J. A., 2003, MATH FDN SPEECH LANG, P191
Bilmes JA, 2005, IEEE SIGNAL PROC MAG, V22, P89
Charles F, 2007, LECT NOTES COMPUT SC, V4871, P210
Chen B., 2004, P INTERSPEECH
Christensen H., 2010, P INTERSPEECH, P1918
Dekel O, 2004, WORKSH MULT INT REL, P146
Fernandez S, 2007, LECT NOTES COMPUT SC, V4669, P220
Gemmeke JF, 2011, IEEE T AUDIO SPEECH, V19, P2067, DOI 10.1109/TASL.2011.2112350
Gers FA, 2000, NEURAL COMPUT, V12, P2451, DOI 10.1162/089976600300015015
Graves A, 2005, NEURAL NETWORKS, V18, P602, DOI 10.1016/j.neunet.2005.06.042
Graves A., 2006, P 23 INT C MACH LEAR, P369, DOI DOI 10.1145/1143844.1143891
Graves A., 2008, THESIS TU MUNCHEN
Graves A., 2008, ADV NEURAL INFORM PR, V20, P1
Grezl F, 2008, INT CONF ACOUST SPEE, P4729, DOI 10.1109/ICASSP.2008.4518713
Hermansky H., 2008, P EUR C SPEECH COMM, P361
Heylen D., 2008, P 4 INT WORKSH HUM C, P1
Hochreiter S, 1997, NEURAL COMPUT, V9, P1735, DOI 10.1162/neco.1997.9.8.1735
Hochreiter S., 2001, FIELD GUIDE DYNAMICA, P1
Jensen F. V., 1996, INTRO BAYESIAN NETWO
Keshet J., 2007, THESIS HEBREW U
Keshet J, 2009, SPEECH COMMUN, V51, P317, DOI 10.1016/j.specom.2008.10.002
Ketabdar H., 2006, IDAIP RR, P1
McKee GJ, 2010, AGR ISSUES POLICIES, P1
Nijholt A, 2000, STUD FUZZ SOFT COMP, V45, P148
Parveen S., 2004, P ICASSP
Principi E., 2009, P HSI CAT IT, P216
Rigoll G., 1994, IEEE T AUDIO SPEECH, V2
ROSE RC, 1995, COMPUT SPEECH LANG, V9, P309, DOI 10.1006/csla.1995.0015
Rosevear R. D., 1990, Power Technology International
Schroder M, 2012, IEEE T AFFECT COMPUT, V3, P165, DOI 10.1109/T-AFFC.2011.34
Schuller B., 2011, P 1 INT AUD VIS EM C, P415
Schuster M, 1997, IEEE T SIGNAL PROCES, V45, P2673, DOI 10.1109/78.650093
Szoke Igor, 2010, Proceedings 2010 IEEE Spoken Language Technology Workshop (SLT 2010), DOI 10.1109/SLT.2010.5700849
Trentin E, 2001, NEUROCOMPUTING, V37, P91, DOI 10.1016/S0925-2312(00)00308-8
Vergyri D., 2007, P INTERSPEECH
Wang H.C., 1997, P ROCLING 10 INT C, P325
Weninger F., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288963
Weninger F., 2011, P CHIME WORKSH FLOR, P24
Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006
Wollmer M, 2009, INT CONF ACOUST SPEE, P3949, DOI 10.1109/ICASSP.2009.4960492
Wollmer M, 2009, NEUROCOMPUTING, V73, P366, DOI 10.1016/j.neucom.2009.08.005
Wollmer M, 2010, COGN COMPUT, V2, P180, DOI 10.1007/s12559-010-9041-8
Wollmer M, 2011, IEEE T INTELL TRANSP, V12, P574, DOI 10.1109/TITS.2011.2119483
Wollmer Martin, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373544
Wollmer M, 2011, INT CONF ACOUST SPEE, P4860
Wollmer M, 2010, IEEE J-STSP, V4, P867, DOI 10.1109/JSTSP.2010.2057200
Wollmer M, 2010, INT CONF ACOUST SPEE, P5274, DOI 10.1109/ICASSP.2010.5494980
Young S., 2006, HTK BOOK V3 4
Zhu QF, 2005, LECT NOTES COMPUT SC, V3361, P223
NR 51
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 252
EP 265
DI 10.1016/j.specom.2012.08.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900004
ER
PT J
AU Yeh, CY
Chang, SC
Hwang, SH
AF Yeh, Cheng-Yu
Chang, Shun-Chieh
Hwang, Shaw-Hwa
TI A consistency analysis on an acoustic module for Mandarin text-to-speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Consistency analysis; Hidden Markov model (HMM); Vector quantization
(VQ); Acoustic module; Text-to-speech (TTS); Speech synthesis
ID PROSODIC INFORMATION; CHINESE; SYSTEM; UNITS; CONVERSION; ALGORITHM
AB In this work, a consistency analysis on an acoustic module for a Mandarin text-to-speech (TTS) is presented as a way to improve the speech quality. Found by an inspection on the pronunciation process of human beings, the consistency can be interpreted as a high correlation of a warping curve between the spectrum and the prosody intra a syllable. Through three steps in the procedure of the consistency analysis, the HMM algorithm is used firstly to decode HMM-state sequences within a syllable at the same time as to divide them into three segments. Secondly, based on a designated syllable, the vector quantization (VQ) with the Linde-Buzo-Gray (LBG) algorithm is used to train the VQ codebooks of each segment. Thirdly, the prosodic vector of each segment is encoded as an index by VQ codebooks, and then the probability of each possible path is evaluated as a prerequisite to analyze the consistency. It is demonstrated experimentally that a consistency is definitely acquired in case the syllable is located exactly in the same word. These results offer a research direction that the warping process between the spectrum and the prosody intra a syllable must be considered in a TTS system to improve the speech quality. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Yeh, Cheng-Yu] Natl Chin Yi Univ Technol, Dept Elect Engn, Taichung 41170, Taiwan.
[Chang, Shun-Chieh; Hwang, Shaw-Hwa] Natl Taipei Univ Technol, Dept Elect Engn, Taipei 10608, Taiwan.
RP Yeh, CY (reprint author), Natl Chin Yi Univ Technol, Dept Elect Engn, 57,Sec 2,Zhongshan Rd, Taichung 41170, Taiwan.
EM cy.yeh@ncut.edu.tw; t6319011@ntut.edu.tw; hsf@ntut.edu.tw
FU Ministry of Economic Affairs [100-EC-17-A-03-S1-123]; National Science
Council, Taiwan, Republic of China [NSC 95-2221-E-027-090]
FX This research was financially supported by the Ministry of Economic
Affairs under Grant No. 100-EC-17-A-03-S1-123 and the National Science
Council under Grant No. NSC 95-2221-E-027-090, Taiwan, Republic of
China.
CR Bellegarda JR, 2010, IEEE T AUDIO SPEECH, V18, P1455, DOI 10.1109/TASL.2009.2035209
Chalamandaris A, 2010, IEEE T CONSUM ELECTR, V56, P1890, DOI 10.1109/TCE.2010.5606343
Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226
Chou FC, 2002, IEEE T SPEECH AUDI P, V10, P481, DOI 10.1109/TSA.2002.803437
CHOU FC, 1998, ACOUST SPEECH SIG PR, P893
Chou FC, 1997, INT CONF ACOUST SPEE, P923
Chou FC, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1624
Dey S, 2007, ASIA S PACIF DES AUT, P298
Guo Q., 2008, P TRIANGL S ADV ICT, P1
Huang X.D., 2001, HIDDEN MARKOV MODELS, P377
Hwang SH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1421
Hwang SH, 2005, GESTS INT T SPEECH S, V2, P91
Karabetsos S, 2009, IEEE T CONSUM ELECTR, V55, P613, DOI 10.1109/TCE.2009.5174430
KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275
LEE LS, 1989, IEEE T ACOUST SPEECH, V37, P1309
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
OMALLEY MH, 1990, COMPUTER, V23, P17, DOI 10.1109/2.56867
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Spelta Cristiano, 2010, IEEE Embedded Systems Letters, V2, DOI 10.1109/LES.2010.2052019
Wu CH, 2001, SPEECH COMMUN, V35, P219, DOI 10.1016/S0167-6393(00)00075-3
Yeh CY, 2005, IEE P-VIS IMAGE SIGN, V152, P793, DOI 10.1049/ip-vis:20045095
Yeh C.Y., 2010, P ISCCSP, P1
YING ZW, 2001, ACOUST SPEECH SIG PR, P809
Yoshimura T., 2000, P ICASSP, P1315
Yue D. J., 2010, P ICALIP, P1652
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zhu Y., 2002, P TENCON, P204
NR 28
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 266
EP 277
DI 10.1016/j.specom.2012.08.009
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900005
ER
PT J
AU Degottex, G
Lanchantin, P
Roebel, A
Rodet, X
AF Degottex, Gilles
Lanchantin, Pierre
Roebel, Axel
Rodet, Xavier
TI Mixed source model and its adapted vocal tract filter estimate for voice
transformation and synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mixed source; Glottal model; Vocal tract filter; Voice quality; Voice
transformation; Speech synthesis
ID SPEECH SYNTHESIS; HMM; REPRESENTATION; EXCITATION; NOISE
AB In current methods for voice transformation and speech synthesis, the vocal tract filter is usually assumed to be excited by a flat amplitude spectrum. In this article, we present a method using a mixed source model defined as a mixture of the Lihencrants-Fant (LF) model and Gaussian noise. Using the LF model, the base approach used in this presented work is therefore close to a vocoder using exogenous input like ARX-based methods or the Glottal Spectral Separation (GSS) method. Such approaches are therefore dedicated to voice processing promising an improved naturalness compared to generic signal models. To estimate the Vocal Tract Filter (VTF), using spectral division like in GSS, we show that a glottal source model can be used with any envelope estimation method conversely to ARX approach where a least square AR solution is used. We therefore derive a VTF estimate which takes into account the amplitude spectra of both deterministic and random components of the glottal source. The proposed mixed source model is controlled by a small set of intuitive and independent parameters. The relevance of this voice production model is evaluated, through listening tests, in the context of resynthesis, HMM-based speech synthesis, breathiness modification and pitch transposition. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Degottex, Gilles; Lanchantin, Pierre; Roebel, Axel; Rodet, Xavier] Ircam, CNRS, Anal Synth Team, STMS,UMR9912, F-75004 Paris, France.
RP Degottex, G (reprint author), Univ Crete, Dept Comp Sci, Iraklion 71409, Crete, Greece.
EM gilles.degottex@ircam.fr
FU Affective Avatar ANR project; Respoken and AngelStudio FEDER projects;
Centre National de la Recherche Scientifique (CNRS)
FX This research was partly supported by the Affective Avatar ANR project,
by the Respoken and AngelStudio FEDER projects and the Centre National
de la Recherche Scientifique (CNRS) for the PhD grant. Authors also
would like to thank Chunghsin Yeh and Joao Cabral for the discussions,
their time and their code, the reviewers for their numerous and precise
remarks and especially the listeners for their precious ears.
CR Agiomyrgiannakis Y., 2009, P IEEE INT C AC SPEE, P3589
Agiomyrgiannakis Y., 2008, P INTERSPEECH, P1849
Alku P, 1999, CLIN NEUROPHYSIOL, V110, P1329, DOI 10.1016/S1388-2457(99)00088-7
Assembly T.I.R., 2003, TECHNICAL REPORT
BANNO H, 1998, ACOUST SPEECH SIG PR, P861
Bechet F., 2001, TRAITEMENT AUTOMATIQ, V42, P47
Bonada J., 2008, THESIS U POMPEU FABR
Cabral J. P., 2010, THESIS U EDINBURGH U
Cabral JP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1829
Cabral JP, 2011, INT CONF ACOUST SPEE, P4704
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Degottex G, 2011, IEEE T AUDIO SPEECH, V19, P1080, DOI 10.1109/TASL.2010.2076806
Degottex G, 2011, INT CONF ACOUST SPEE, P5128
Degottex G., 2010, THESIS UPMC FRANCE
del Pozo A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1457
Drugman T., 2009, P INTERSPEECH, P116
Drugman T., 2009, INTERSPEECH
Fant G., 1995, STL QPSR, V36, P119
Fant Gunnar, 1985, STL QPSR, V4, P1
Flanagan J.L., 1966, BELL SYSTEM TECHNICA
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651
Hamon C., 1989, P INT C AC SPEECH SI, V89, P238
Hedelin P., 1984, P IEEE INT C AC SPEE, P21
Henrich N., 2001, THESIS UPMC FRANCE
HERMES DJ, 1991, SPEECH COMMUN, V10, P497, DOI 10.1016/0167-6393(91)90053-V
Imai S., 1979, ELECT COMM, V62-A, P10
Kawahara H., 2001, MAVEBA
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Kim SJ, 2007, IEICE T INF SYST, VE90D, P378, DOI 10.1093/ietisy/e90-d.1.378
Lanchantin P., 2008, P INT C LANG RES EV, P2403
Lanchantin P, 2010, INT CONF ACOUST SPEE, P4630, DOI 10.1109/ICASSP.2010.5495550
Laroche J., 1993, P IEEE INT C AC SPEE, P550
Markel JD, 1976, LINEAR PREDICTION SP
MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910
Mehta D, 2005, INT INTEG REL WRKSP, P199, DOI 10.1109/ASPAA.2005.1540204
MILLER RL, 1959, J ACOUST SOC AM, V31, P667, DOI 10.1121/1.1907771
OPPENHEI.AV, 1968, PR INST ELECTR ELECT, V56, P1264, DOI 10.1109/PROC.1968.6570
Pantazis Y., 2010, IEEE T AUDIO SPEECH, V19, P290
Peelers G., 2001, THESIS UPMC FRANCE
Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239
RODET X, 1984, COMPUT MUSIC J, V8, P15, DOI 10.2307/3679810
Roebel A., 2007, PATTERN RECOGN, V28, P1343
STEVENS KN, 1971, J ACOUST SOC AM, V50, P1180, DOI 10.1121/1.1912751
Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068
Stylianou Y., 1996, THESIS TELECOMPARIS
Tokuda K., 1995, P EUROSPEECH, P757
Tokuda K, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P227
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
Tooher M., 2003, P ISCA VOIC QUAL FUN, P41
Valbret H., 1992, P ICASSP SAN FRANC U, V1, P145, DOI 10.1109/ICASSP.1992.225951
Vincent D, 2007, INT CONF ACOUST SPEE, P525
Young S., 1994, TECHNICAL REPORT
Zen H, 2007, P ISCA WORKSH SPEECH
Zen H., 2004, P ICSLP
Zivanovic M, 2008, COMPUT MUSIC J, V32, P57, DOI 10.1162/comj.2008.32.2.57
NR 56
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 278
EP 294
DI 10.1016/j.specom.2012.08.010
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900006
ER
PT J
AU Kane, J
Gobl, C
AF Kane, John
Gobl, Christer
TI Evaluation of glottal closure instant detection in a range of voice
qualities
SO SPEECH COMMUNICATION
LA English
DT Article
DE Glottal closure instant; Phonation type; Voice quality; Glottal epochs
ID GROUP DELAY FUNCTION; SIGNIFICANT EXCITATION; SPEECH SIGNALS;
ELECTROGLOTTOGRAPHIC SIGNALS; WAVELET TRANSFORM; EPOCH EXTRACTION; DYPSA
ALGORITHM; PREDICTION; PHONATION; ENVELOPE
AB Recently developed speech technology platforms, such as statistical speech synthesis and voice transformation systems, facilitate the modification of voice characteristics. To fully exploit the potential of such platforms, speech analysis algorithms need to be able to handle the different acoustic characteristics of a variety of voice qualities. Glottal closure instant (GCI) detection is typically required in the analysis stages, and thus the importance of robust GCI algorithms is evident. The current study examines some important analysis signals relevant to GCI detection, for a range of phonation types. Furthermore, a new algorithm is proposed which builds on an existing GCI algorithm to optimise the performance when analysing speech involving different phonation types. Results suggest improvements in the GCI detection rate for creaky voice due to a reduction in false positives. When there is a lack of prominent peaks in the Linear Prediction residual, as found for breathy and harsh voice, the results further indicate some enhancement of GCI identification accuracy for the proposed method. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Kane, John; Gobl, Christer] Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland.
RP Kane, J (reprint author), Trinity Coll Dublin, Sch Linguist Speech & Commun Sci, Phonet & Speech Lab, Dublin, Ireland.
EM kanejo@tcd.ie; cegobl@tcd.ie
FU Science Foundation Ireland [07/CE/I1142, 09/IN.1/I 2631]
FX This research was supported by the Science Foundation Ireland, Grant
07/CE/I1142 (Centre for Next Generation Localisation,
http://www.cngl.ie) and Grant 09/IN.1/I 2631 (FASTNET). The authors
would like to thank the anonymous reviewers for their insightful
comments and suggestions.
CR Agiomyrgiannakis Y., 2009, P IEEE INT C AC SPEE, P3589
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267
Cabral J., 2011, P INTERSPEECH, P1989
Cabral JP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1829
Carlson R, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1300
CHENG YM, 1989, IEEE T ACOUST SPEECH, V37, P1805, DOI 10.1109/29.45529
d'Alessandro C., 2011, SADHANA 5, V36
Degottex G., 2009, P SPECOM ST PET, P226
Drugman T., 2009, P INTERSPEECH, P116
Drugman T., 2009, P INTERSPEECH, P2891
Drugman T., 2011, THESIS U MONS
Drugman T, 2012, IEEE T AUDIO SPEECH, V20, P994, DOI 10.1109/TASL.2011.2170835
Drugman T., 2012, P INTERSPEECH, P130
Drugman T., 2011, P INTERSPEECH, P1973
Esling J. H., 2003, P 15 ICPHS BARC, P1049
Fant G., 1985, 4 KTH SPEECH TRANSM, V4, P1
GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C
Guruprasad S., 2007, P INT ANTW BELG, P554
Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401
Ishi CT, 2008, IEEE T AUDIO SPEECH, V16, P47, DOI 10.1109/TASL.2007.910791
Ishi C.T., 2008, SPEECH COMMUN, V50, P531
KADAMBE S, 1992, IEEE T INFORM THEORY, V38, P917, DOI 10.1109/18.119752
Kay S. M., 1988, MODERN SPECTRAL ESTI
Kominek J., 2004, ISCA SPEECH SYNTH WO, P223
KOUNOUDES A, 2002, ACOUST SPEECH SIG PR, P349
Laver J, 1980, PHONETIC DESCRIPTION
MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593
MOULINES E, 1990, SPEECH COMMUN, V9, P401, DOI 10.1016/0167-6393(90)90017-4
Murty KSR, 2008, IEEE T AUDIO SPEECH, V16, P1602, DOI 10.1109/TASL.2008.2004526
Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878
Ni Chasaide A., 1993, LANG SPEECH, V2, P303
O' Cinneide A., 2011, P INTERSPEECH, P57
OGDEN RICHARD, 2001, J INT PHON ASSOC, V31, P139
Podesva RJ, 2007, J SOCIOLING, V11, P478, DOI 10.1111/j.1467-9841.2007.00334.x
Raitio T, 2011, IEEE T AUDIO SPEECH, V19, P153, DOI 10.1109/TASL.2010.2045239
Rao KS, 2006, IEEE T AUDIO SPEECH, V14, P972, DOI 10.1109/TSA.2005.858051
Rao KS, 2007, IEEE SIGNAL PROC LET, V14, P762, DOI 10.1109/LSP.2007.896454
Schnell K, 2007, LECT NOTES ARTIF INT, V4885, P221
Schroder M., 2001, P EUROSPEECH 2001 SE, P561
Schwartz R., 1990, P IEEE INT C AC SPEE, P81
SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662
Sturmel N, 2009, INT CONF ACOUST SPEE, P4517, DOI 10.1109/ICASSP.2009.4960634
Stylianou Y., 1999, 6 EUR C SPEECH COMM
Talkin D., 1989, J ACOUST SOC AM, V85, P149
Talkin D., 1995, SPEECH CODING SYNTHE
Thomas MRP, 2009, IEEE T AUDIO SPEECH, V17, P1557, DOI 10.1109/TASL.2009.2022430
Thomas MRP, 2012, IEEE T AUDIO SPEECH, V20, P82, DOI 10.1109/TASL.2011.2157684
Tuan V., 1999, 6 EUR C SPEECH COMM
Van den Berg J., 1968, MANUAL PHONETICS, P278
Villavicencio F, 2006, INT CONF ACOUST SPEE, P869
Vincent D, 2005, P INT LISB, P333
NR 52
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 295
EP 314
DI 10.1016/j.specom.2012.08.01
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900007
ER
PT J
AU Dromey, C
Jang, GO
Hollis, K
AF Dromey, Christopher
Jang, Gwi-Ok
Hollis, Kristi
TI Assessing correlations between lingual movements and formants
SO SPEECH COMMUNICATION
LA English
DT Article
DE Formant; Kinematic; Diphthong; Articulation
ID MOTOR SPEECH DISORDERS; VOCAL-TRACT STEADINESS; LOUDNESS MANIPULATIONS;
SPASMODIC DYSPHONIA; TONGUE MOVEMENTS; SPEAKING RATE; DYSARTHRIA;
SPEAKERS; SYSTEM
AB First and second formant histories have been used in studies of both normal and disordered speech to indirectly measure the activity of the vocal tract. The purpose of the present study was to determine the extent to which formant measures are reflective of lingual movements during diphthong production. Twenty native speakers of American English from the western United States produced four diphthongs in a sentence context while tongue movement was measured with a magnetic tracking system. Correlations were computed between the vertical tongue movements and the first formant, as well as between the anteroposterior movements and the second formant during the transition phase of the diphthong. In many instances the acoustic measures were clearly reflective of the kinematic data. However, there were also exceptions, where the acoustic and kinematic records were not congruent. These instances were evaluated quantitatively and qualitatively in an effort to understand the cause of the discrepancy. Factors such as coarticulation, motor equivalence (including the influence of structures other than the tongue), and nonlinearities in the linkage between movement and acoustics could account for these findings. Recognizing potential influences on the acoustic kinematic relationship may be valuable in the interpretation of articulatory acoustic data on the individual speaker level. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Dromey, Christopher; Jang, Gwi-Ok; Hollis, Kristi] Brigham Young Univ, Dept Commun Disorders, Provo, UT 84602 USA.
RP Dromey, C (reprint author), Brigham Young Univ, Dept Commun Disorders, 133 John Taylor Bldg, Provo, UT 84602 USA.
EM dromey@byu.edu
FU David O. McKay School of Education at Brigham Young University
FX We express appreciation to the individuals who volunteered their time to
participate in this study. Funding was provided by the David O. McKay
School of Education at Brigham Young University. This paper was based on
the master's theses of the second and third authors.
CR BARLOW SM, 1983, J SPEECH HEAR RES, V26, P283
Beddor PS, 2009, LANGUAGE, V85, P785
Boersma P., 2007, PRAAT
CANNITO MP, 1989, RECENT ADVANCES IN CLINICAL DYSARTHRIA, P243
Dagli AS, 1997, EUR ARCH OTO-RHINO-L, V254, P78, DOI 10.1007/BF01526184
Dromey C, 2006, SPEECH COMMUN, V48, P463, DOI 10.1016/j.specom.2005.05.003
Dromey C, 2008, J SPEECH LANG HEAR R, V51, P196, DOI 10.1044/1092-4388(2008/015)
Ferrand C. T., 2007, SPEECH SCI INTEGRATE
GERRATT BR, 1983, J SPEECH HEAR RES, V26, P297
Glass G V., 1984, STAT METHODS ED PSYC
Hollis K.L., 2009, THESIS B YOUNG U
HUGHES OM, 1976, PHONETICA, V33, P199
Jong G.O., 2010, THESIS B YOUNG U
Kent RD, 1999, J COMMUN DISORD, V32, P141, DOI 10.1016/S0021-9924(99)00004-0
LISS JM, 1992, J ACOUST SOC AM, V92, P2984, DOI 10.1121/1.404364
Mathworks, 2009, MATLAB
Mefferd AS, 2010, J SPEECH LANG HEAR R, V53, P1206, DOI 10.1044/1092-4388(2010/09-0083)
Nissen SL, 2007, PHONETICA, V64, P201, DOI [10.1159/000121373, 10.1159/006121373]
OHALA JJ, 1993, LANG SPEECH, V36, P155
PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204
SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7
SMITH A, 1995, EXP BRAIN RES, V104, P493
SPSS, 2012, SPSS
STEVENS KN, 1989, J PHONETICS, V17, P3
Stevens KN, 2010, J PHONETICS, V38, P10, DOI 10.1016/j.wocn.2008.10.004
Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969
Story BH, 2009, J ACOUST SOC AM, V126, P825, DOI 10.1121/1.3158816
Tasko SM, 2002, J SPEECH LANG HEAR R, V45, P127, DOI 10.1044/1092-4388(2002/010)
Tingley S, 2000, J MED SPEECH-LANG PA, V8, P249
Tjaden K, 2004, J SPEECH LANG HEAR R, V47, P766, DOI 10.1044/1092-4388(2004/058)
WEISMER G, 1992, J ACOUST SOC AM, V91, P1085, DOI 10.1121/1.402635
Weismer G, 2003, J ACOUST SOC AM, V113, P3362, DOI 10.1121/1.1572142
Zwirner P, 1997, EUR ARCH OTO-RHINO-L, V254, P391, DOI 10.1007/BF01642557
ZWIRNER P, 1992, J SPEECH HEAR RES, V35, P761
NR 34
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 315
EP 328
DI 10.1016/j.specom.2012.09.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900008
ER
PT J
AU Iwata, T
Watanabe, S
AF Iwata, Tomoharu
Watanabe, Shinji
TI Influence relation estimation based on lexical entrainment in
conversation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Conversation analysis; Influence; Latent variable model; Entrainment
ID EM ALGORITHM; SPEECH RECOGNITION; LANGUAGE MODEL; NETWORK
AB In conversations, people tend to mimic their companions' behavior depending on their level of trust. This phenomenon is known as entrainment. We propose a probabilistic model for estimating influences among speakers from conversation data involving multiple people by modeling lexical entrainment. The proposed model estimates word use as a function of the weighted sum of the earlier word use of other speakers. The weights represent influences between speakers. The influences can be efficiently estimated by using the expectation maximization (EM) algorithm. We also develop its online inference procedures for sequentially modeling the dynamics of influence relations. Experiments performed on two meeting data sets one in Japanese and one in English demonstrate the effectiveness of the proposed method. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Iwata, Tomoharu; Watanabe, Shinji] NTT Commun Sci Labs, Kyoto 6190237, Japan.
RP Iwata, T (reprint author), NTT Commun Sci Labs, Kyoto 6190237, Japan.
EM iwata.tomoharu@lab.ntt.co.jp
CR Auyeung A., 2010, P ACM C HYP HYP HT 1, P245
Blei D, 2006, P INT C MACH LEARN, V1, P113
Brennan S. E., 1996, INT S SPOK DIAL, P41
Chang F, 2006, PSYCHOL REV, V113, P234, DOI 10.1037/0033-295X.113.2.234
Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893
Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689
De Mori R., 2011, P 12 ANN C INT SPEEC, P3081
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
DIMBERG U, 1982, PSYCHOPHYSIOLOGY, V19, P643, DOI 10.1111/j.1469-8986.1982.tb02516.x
Fiscus J.G., 2007, MULTIMODAL TECHNOLOG, P373
Gildea D., 1999, P EUROSPEECH, P2167
Giles H., 1991, ACCOMMODATION THEORY
Hori T, 2010, IEEE WORKSH SPOK LAN, P412
Hori T., 2011, IEEE T AUDIO SPEECH
Iwata Tomoharu, 2011, P 12 ANN C INT SPEEC, P3089
Ji Gang, 2004, P HLT NACAC 2004 MAS, P133, DOI 10.3115/1613984.1614018
KUHN R, 1990, IEEE T PATTERN ANAL, V12, P570, DOI 10.1109/34.56193
Mehler A, 2010, ENTROPY-SWITZ, V12, P1440, DOI 10.3390/e12061440
Neal RM, 1998, NATO ADV SCI I D-BEH, V89, P355
Nenkova A., 2008, P 46 ANN M ASS COMP, P169, DOI 10.3115/1557690.1557737
Otsuka K., 2008, P ACM ICMI, P257, DOI 10.1145/1452392.1452446
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
Purver M, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P17
Reitter D, 2011, COGNITIVE SCI, V35, P587, DOI 10.1111/j.1551-6709.2010.01165.x
Reitter D., 2006, P HUM LANG TECHN C N, P121, DOI 10.3115/1614049.1614080
Renals S., 2007, P IEEE WORKSH AUT SP, P238
Sato M, 2000, NEURAL COMPUT, V12, P407, DOI 10.1162/089976600300015853
Scissors LE, 2008, CSCW: 2008 ACM CONFERENCE ON COMPUTER SUPPORTED COOPERATIVE WORK, CONFERENCE PROCEEDINGS, P277
Valente F., 2011, P 12 ANN C INT SPEEC, P3077
Vinciarelli A, 2007, IEEE T MULTIMEDIA, V9, P1215, DOI 10.1109/TMM.2007.902882
Watanabe S, 2011, COMPUT SPEECH LANG, V25, P440, DOI 10.1016/j.csl.2010.07.006
NR 31
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 329
EP 339
DI 10.1016/j.specom.2012.08.012
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900009
ER
PT J
AU Jeong, Y
AF Jeong, Yongwon
TI Unified framework for basis-based speaker adaptation based on sample
covariance matrix of variable dimension
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Speaker adaptation; Two-dimensional principal
component analysis; Eigenvoice speaker adaptation
ID HIDDEN MARKOV-MODELS; 2-DIMENSIONAL PCA; RECOGNITION
AB We present a unified framework for basis-based speaker adaptation techniques, which subsumes eigenvoice speaker adaptation using principal component analysis (PCA) and speaker adaptation using two-dimensional PCA (2DPCA). The basic idea is to partition a Gaussian mean vector of a hidden Markov model (HMM) for each state and mixture component into a group of subvectors and stack all the subvectors of a training speaker model into a matrix. The dimension of the matrix varies according to the dimension of the subvector. As a result, the basis vectors derived from the PCA of training model matrices have variable dimension and so does the speaker weight in the adaptation equation. When the amount of adaptation data is small, adaptation using the speaker weight of small dimension with the basis vectors of large dimension can give good performance, whereas when the amount of adaptation data is large, adaptation using the speaker weight of large dimension with the basis vectors of small dimension can give good performance. In the experimental results, when the dimension of basis vectors was chosen between those of the eigenvoice method and the 2DPCA-based method, the model showed the balanced performance between the eigenvoice method and the 2DPCA-based method. (C) 2012 Elsevier B.V. All rights reserved.
C1 Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea.
RP Jeong, Y (reprint author), Pusan Natl Univ, Sch Elect Engn, Pusan 609735, South Korea.
EM jeongy@pusan.ac.kr
CR Chen SC, 2004, PATTERN RECOGN, V37, P1081, DOI 10.1016/j.patcog.2003.09.004
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Gottumukkal R, 2004, PATTERN RECOGN LETT, V25, P429, DOI 10.1016/j.patrec.2003.11.005
Jeong Y, 2010, IEEE SIGNAL PROC LET, V17, P193, DOI 10.1109/LSP.2009.2036696
Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd
Jung HY, 2002, ETRI J, V24, P469, DOI 10.4218/etrij.02.0202.0003
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
LIM YJ, 1995, INT CONF ACOUST SPEE, P89
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Shankar S., 2008, P IEEE C COMP VIS PA, P1, DOI 10.1145/1544012.1544028
Wang LW, 2005, PATTERN RECOGN LETT, V26, P57, DOI 10.1016/j.patrec.2004.08.016
Yang J, 2004, IEEE T PATTERN ANAL, V26, P131
NR 13
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 340
EP 346
DI 10.1016/j.specom.2012.09.002
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900010
ER
PT J
AU Nose, T
Kobayashi, T
AF Nose, Takashi
Kobayashi, Takao
TI An intuitive style control technique in HMM-based expressive speech
synthesis using subjective style intensity and multiple-regression
global variance model
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based expressive speech synthesis; Multiple-regression HSMM; Style
control; Style intensity; Multiple-regression global variance model
ID HIDDEN MARKOV MODEL; SYNTHESIS SYSTEM; RECOGNITION; EMOTION
AB To control intuitively the intensities of emotional expressions and speaking styles for synthetic speech, we introduce subjective style intensities and multiple-regression global variance (MRGV) models into hidden Markov model (HMM)-based expressive speech synthesis. A problem in the conventional parametric style modeling and style control techniques is that the intensities of styles appearing in synthetic speech strongly depend on the training data. To alleviate this problem, the proposed technique explicitly takes into account subjective style intensities perceived for respective training utterances using multiple-regression hidden semi-Markov models (MRHSMMs). As a result, synthetic speech becomes less sensitive to the variation of style expressivity existing in the training data. Another problem is that the synthetic speech generally suffers from the over-smoothing effect of model parameters in the model training, so the variance of the generated speech parameter trajectory becomes smaller than that of the natural speech. To alleviate this problem for the case of style control, we extend the conventional variance compensation method based on a GV model for a single-style speech to the case of multiple styles with variable style intensities by deriving the MRGV modeling. The objective and subjective experimental results show that these two techniques significantly enhance the intuitive style control of synthetic speech, which is essential for the speech synthesis system to communicate para-linguistic information correctly to the listeners. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp
FU JSPS [23700195, 24300071]
FX Part of this work was supported by JSPS Grant-in-Aid for Scientific
Research 23700195 and 24300071.
CR Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137
Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123
Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7
Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317
FUJINAGA K, 2001, ACOUST SPEECH SIG PR, P513
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223
Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Koriyama T, 2011, P INTERSPEECH, P2657
KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W
Miyanaga K., 2004, P INTERSPEECH 2004 I, P1437
Niwase N, 2005, IEICE T INF SYST, VE88D, P2492, DOI 10.1093/ietisy/e88-d.11.2492
Nose T, 2009, IEICE T INF SYST, VE92D, P489, DOI 10.1587/transinf.E92.D.489
Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406
Scherer K. R., 1977, MOTIV EMOTION, V1, P331, DOI 10.1007/BF00992539
Schroder M., 2001, P EUROSPEECH 2001 SE, P561
SCHULLER B, 2003, ACOUST SPEECH SIG PR, P1
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tsuzuki R, 2004, P INTERSPEECH 2004 I, P1185
Yamagishi J., 2008, HTS 2008 SYSTEM YET
Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956
Yamagishi J, 2003, IEICE T INF SYST, VE86D, P534
Yamagishi J., 2003, P INTERSPEECH 2003 E, P2461
Yoshimura T, 1999, P EUR, P2347
Yu K, 2011, SPEECH COMMUN, V53, P914, DOI 10.1016/j.specom.2011.03.003
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
NR 29
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 347
EP 357
DI 10.1016/j.specom.2012.09.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900011
ER
PT J
AU Yong, PC
Nordholm, S
Dam, HH
AF Yong, Pei Chee
Nordholm, Sven
Dam, Hai Huyen
TI Optimization and evaluation of sigmoid function with a priori SNR
estimate for real-time speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; SNR estimation; Decision-directed approach; Sigmoid
function; Objective evaluation
ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE
AB In this paper, an a priori signal-to-noise ratio (SNR) estimator with a modified sigmoid gain function is proposed for real-time speech enhancement. The proposed sigmoid gain function has three parameters, which can be optimized such that they match conventional gain functions. In addition, the joint temporal dynamics between the SNR estimate and the spectral gain function is investigated to improve the performance of the speech enhancement scheme. As the widely-used decision-directed (DD) a priori SNR estimate has a well-known one-frame delay that leads to the degradation of speech quality, a modified a priori SNR estimator is proposed for the DD approach to overcome this delay. Evaluations are performed by utilizing the objective evaluation metric that measures the trade-off between the noise reduction, the speech distortion and the musical noise in the enhanced signal. The results are compared using the PESQ and the SNRseg measures as well as subjective listening tests. Simulation results show that the proposed gain function, which can flexibly model exponential distributions, is a potential alternative speech enhancement gain function. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Yong, Pei Chee; Nordholm, Sven; Dam, Hai Huyen] Curtin Univ Technol, Bentley, WA 6102, Australia.
RP Yong, PC (reprint author), 1-40 Marquis St, Bentley, WA 6102, Australia.
EM peichee.yong@postgrad.curtin.edu.au
CR Alam M. J., 2009, J ELECT ELECT ENG, V9, P809
Andrianakis I, 2009, SPEECH COMMUN, V51, P1, DOI 10.1016/j.specom.2008.05.018
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681
Breithaupt C, 2008, INT CONF ACOUST SPEE, P4037, DOI 10.1109/ICASSP.2008.4518540
Breithaupt C, 2008, INT CONF ACOUST SPEE, P4897, DOI 10.1109/ICASSP.2008.4518755
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403
Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278]
Davis A., 2006, P 14 EUR SIGN PROC C
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Gustafsson S, 2002, IEEE T SPEECH AUDI P, V10, P245, DOI 10.1109/TSA.2002.800553
Hansen J. H. L., 1998, P INT C SPOK LANG PR, V7, P2819
Hendriks RC, 2010, INT CONF ACOUST SPEE, P4266, DOI 10.1109/ICASSP.2010.5495680
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal K, 2012, SPEECH COMMUN, V54, P282, DOI 10.1016/j.specom.2011.09.003
Park YS, 2007, IEICE T COMMUN, VE90B, P2182, DOI 10.1093/ietcom/e90-b.8.2182
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Plourde E, 2009, IEEE SIGNAL PROC LET, V16, P485, DOI 10.1109/LSP.2009.2018225
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
RIX AW, 2001, ACOUST SPEECH SIG PR, P749
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Suhadi S, 2011, IEEE T AUDIO SPEECH, V19, P186, DOI 10.1109/TASL.2010.2045799
Uemura Y., 2008, P INT WORKSH AC ECH
Yong P.C., 2011, P 19 EUR SIGN PROC C, P211
NR 29
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 358
EP 376
DI 10.1016/j.specom.2012.09.004
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900012
ER
PT J
AU Oonishi, T
Iwano, K
Furui, S
AF Oonishi, Tasuku
Iwano, Koji
Furui, Sadaoki
TI A noise-robust speech recognition approach incorporating normalized
speech/non-speech likelihood into hypothesis scores
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Noise robustness; Gaussian mixture model adaptation
ID GAUSSIAN MIXTURE-MODELS
AB ]In noisy environments, speech recognition decoders often incorrectly produce speech hypotheses for non-speech periods, and non-speech hypotheses, such as silence or a short pause, for speech periods. It is crucial to reduce such errors to improve the performance of speech recognition systems. This paper proposes an approach using normalized speech/non-speech likelihoods calculated using adaptive speech and non-speech GMMs to weight the scores of recognition hypotheses produced by the decoder. To achieve good decoding performance, the GMMs are adapted to the variations of acoustic characteristics of input utterances and environmental noise, using either of the two modern on-line unsupervised adaptation methods, switching Kalman filter (SKF) or maximum a posteriori (MAP) estimation. Experimental results on real-world in-car speech, the Drivers' Japanese Speech Corpus in a Car Environment (DJSC), and the AURORA-2 database show that the proposed method significantly improves recognition accuracy compared to a conventional approach using front-end voice activity detection (VAD). Results also confirm that our method significantly improves recognition accuracy under various noise and task conditions. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Oonishi, Tasuku; Furui, Sadaoki] Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan.
[Iwano, Koji] Tokyo City Univ, Tsuzuki Ku, Yokohama, Kanagawa 2248551, Japan.
RP Oonishi, T (reprint author), Tokyo Inst Technol, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan.
EM oonishi@furui.cs.titech.ac.jp; iwano@tcu.ac.jp; furui@cs.titech.ac.jp
CR Dixon P. R., 2007, P IEEE ASRU, P443
ETSI, 2002, 202050 ETSI ES
Fujimoto M, 2007, INT CONF ACOUST SPEE, P797
Fujimoto M., 2007, P INT 07 AUG, P2933
Gales M.J.F, 1995, THESIS CAMBRIDGE U
Hiraki K., 2008, IEICE TECHNICAL REPO, P93
Itou K., 1999, Journal of the Acoustical Society of Japan (E), V20
Iwano Koji, 2002, P ICSLP, P941
Metze F., 2002, P INT C SPOK LANG PR, P2133
Oonishi Tasuku, 2010, P INTERSPEECH, P3122
Pearce D., 2000, P ICSLP, V4, P29
Ramirez J., 2007, ROBUST SPEECH RECOGN, P1
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Satoshi Tamura, 2004, P INT C AC SPEECH SI, V1, P857
Shiell D., 2009, VISUAL SPEECH RECOGN, P1
SINGH R, 2001, ACOUST SPEECH SIG PR, P273
Zhang YX, 2008, PATTERN RECOGN LETT, V29, P735, DOI 10.1016/j.patrec.2007.12.006
NR 17
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2013
VL 55
IS 2
BP 377
EP 386
DI 10.1016/j.specom.2012.10.001
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 056SO
UT WOS:000312511900013
ER
PT J
AU Black, MP
Katsamanis, A
Baucom, BR
Lee, CC
Lammert, AC
Christensen, A
Georgiou, PG
Narayanan, SS
AF Black, Matthew P.
Katsamanis, Athanasios
Baucom, Brian R.
Lee, Chi-Chun
Lammert, Adam C.
Christensen, Andrew
Georgiou, Panayiotis G.
Narayanan, Shrikanth S.
TI Toward automating a human behavioral coding system for married couples'
interactions using speech acoustic features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Behavioral signal processing (BSP); Couple therapy; Dyadic interaction;
Human behavior analysis; Prosody; Emotion recognition
ID RANDOMIZED CLINICAL-TRIAL; FUNDAMENTAL-FREQUENCY; SPEAKER DIARIZATION;
NONVERBAL BEHAVIOR; THERAPY; SATISFACTION; EMOTIONS; DISTRESS;
COMMUNICATION; RECOGNITION
AB Observational methods are fundamental to the study of human behavior in the behavioral sciences. For example, in the context of research on intimate relationships, psychologists' hypotheses are often empirically tested by video recording interactions of couples and manually coding relevant behaviors using standardized coding systems. This coding process can be time-consuming, and the resulting coded data may have a high degree of variability because of a number of factors (e.g., inter-evaluator differences). These challenges provide an opportunity to employ engineering methods to aid in automatically coding human behavioral data. In this work, we analyzed a large corpus of married couples' problem-solving interactions. Each spouse was manually coded with multiple session-level behavioral observations (e.g., level of blame toward other spouse), and we used acoustic speech features to automatically classify extreme instances for six selected codes (e.g., "low" vs. "high" blame). Specifically, we extracted prosodic, spectral, and voice quality features to capture global acoustic properties for each spouse and trained gender-specific and gender-independent classifiers. The best overall automatic system correctly classified 74.1% of the instances, an improvement of 3.95% absolute (5.63% relative) over our previously reported best results. We compare performance for the various factors: across codes, gender, classifier type, and feature type. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Black, Matthew P.; Katsamanis, Athanasios; Lee, Chi-Chun; Lammert, Adam C.; Georgiou, Panayiotis G.; Narayanan, Shrikanth S.] Univ So Calif, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA.
[Baucom, Brian R.; Narayanan, Shrikanth S.] Univ So Calif, Dept Psychol, Los Angeles, CA 90089 USA.
[Christensen, Andrew] Univ Calif Los Angeles, Dept Psychol, Los Angeles, CA 90095 USA.
RP Black, MP (reprint author), Univ So Calif, Signal Anal & Interpretat Lab, 3710 McClintock Ave, Los Angeles, CA 90089 USA.
EM matthepb@usc.edu
FU National Science Foundation; Viterbi Research Innovation Fund
FX This research was supported in part by the National Science Foundation
and the Viterbi Research Innovation Fund. Special thanks to the Couple
Therapy research staff for collecting, transcribing, and coding the
data.
CR Atkins D., 2005, ANN M ASS BEH COGN T
Batliner A, 2011, COGN TECHNOL, P71, DOI 10.1007/978-3-642-15184-2_6
Baucom B, 2007, J SOC CLIN PSYCHOL, V26, P689, DOI 10.1521/jscp.2007.26.6.689
Baucom BR, 2009, J CONSULT CLIN PSYCH, V77, P160, DOI 10.1037/a0014405
Baucom D., 2004, COUPLE OBSERVATIONAL, P159
Baucom DH, 1998, J CONSULT CLIN PSYCH, V66, P53, DOI 10.1037//0022-006X.66.1.53
Beck JG, 2006, BEHAV RES THER, V44, P737, DOI 10.1016/j.brat.2005.05.004
Black M., 2011, P INTERSPEECH
BLACK M, 2010, P INT CHIB JAP, P2030
Boersma P., 2001, GLOT INT, V5, P341
Brune M, 2008, J NERV MENT DIS, V196, P282, DOI 10.1097/NMD.0b013e31816a4922
Bulut M, 2008, J ACOUST SOC AM, V123, P4547, DOI 10.1121/1.2909562
Burkhardt F, 2009, INT CONF ACOUST SPEE, P4761, DOI 10.1109/ICASSP.2009.4960695
Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578
Busso C., 2009, ROLE PROSODY AFFECTI, P309
Campbell N., 2000, ISCA TUT RES WORKSH
Chen K, 2004, P ICASSP, V1, P509
Christensen A, 2006, J CONSULT CLIN PSYCH, V74, P1180, DOI 10.1037/0022-006X.74.6.1180
Christensen A, 2004, J CONSULT CLIN PSYCH, V72, P176, DOI 10.1037/0022-006X.72.2.176
Christensen A., 1995, CLIN HDB COUPLE THER, P31
Christensen A., 1990, GENDER ISSUES CONT S, V6, P113
Christensen A, 2010, J CONSULT CLIN PSYCH, V78, P225, DOI 10.1037/a0018132
Cowie R, 2009, PHILOS T R SOC B, V364, P3515, DOI 10.1098/rstb.2009.0139
Coy A, 2007, SPEECH COMMUN, V49, P384, DOI 10.1016/j.specom.2006.11.002
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
Devillers L, 2011, COMPUT SPEECH LANG, V25, P1, DOI 10.1016/j.csl.2010.07.002
Douglas-Cowie E, 2007, LECT NOTES COMPUT SC, V4738, P488
Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5
Eyben F., 2010, ACM MULTIMEDIA, P1459
Fan RE, 2008, J MACH LEARN RES, V9, P1871
Fredman SJ, 2008, J FAM PSYCHOL, V22, P71, DOI 10.1037/0893-3200.22.1.71
Georgiou P.G., 2011, AFFECTIVE COMPUTING
Ghosh P.K., 2010, IEEE T AUDIO SPEECH, V19, P600
Gibson J., 2011, P INTERSPEECH
Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1
Gonzaga GC, 2007, J PERS SOC PSYCHOL, V93, P34, DOI 10.1037/0022-3514.93.1.34
GOTTMAN JM, 1989, J CONSULT CLIN PSYCH, V57, P47, DOI 10.1037/0022-006X.57.1.47
GOTTMAN J, 1977, J MARRIAGE FAM, V39, P461, DOI 10.2307/350902
Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010
Han KJ, 2008, IEEE T AUDIO SPEECH, V16, P1590, DOI 10.1109/TASL.2008.2002085
Heavey C., 2002, COUPLES INTERACTION
Heyman RE, 2001, PSYCHOL ASSESSMENT, V13, P5, DOI 10.1037//1040-3590.13.1.5
Hops H., 1971, TECHNICAL REPORT
Joachims T., 1998, EUR C MACH LEARN CHE, V1398, P137
Jones J., 1998, COUPLES INTERACTION
Jurafsky D., 2009, HUMAN LANGUAGE TECHN, P638
Juslin P. N., 2005, NEW HDB METHODS NONV, P65
KARNEY BR, 1995, PSYCHOL BULL, V118, P3, DOI 10.1037/0033-2909.118.1.3
Katsamanis A., 2011, AFFECTIVE COMPUTING
Katsamanis A., 2011, VER LARG SCAL PHON W
Keen D, 2005, RES DEV DISABIL, V26, P243, DOI 10.1016/j.ridd.2004.07.002
Kerig P. K., 2004, COUPLE OBSERVATIONAL
Lee C., 2011, P INTERSPEECH
Lee C, 2009, P INT, P320
Lee CC, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P793
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Margolin G, 1998, Clin Child Fam Psychol Rev, V1, P195, DOI 10.1023/A:1022608117322
Margolin G, 2004, DEV PSYCHOPATHOL, V16, P753, DOI 10.1017/S0954579404004766
McNemar Q, 1947, PSYCHOMETRIKA, V12, P153, DOI 10.1007/BF02295996
Mendenhall W., 2007, STAT ENG SCI, P302
Moreno P.J., 1998, P ICSLP SYDN AUSTR
Murray K., 2001, SIGDIAL WORKSH DISC
O'Brien M, 1994, Violence Vict, V9, P45
Ranganath R., 2009, C EMP METH NAT LANG, P334
Rozgic V., 2010, P INTERSPEECH
Rozgic V, 2011, INT CONF ACOUST SPEE, P2368
Schuller B., 2009, P INT BRIGHT UK, V2009, P312
Schuller B., 2007, P INT, P2253
SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794
Schuller B., 2009, ROLE PROSODY AFFECTI, P285
Schuller B, 2008, INT CONF ACOUST SPEE, P4501, DOI 10.1109/ICASSP.2008.4518656
Sevier M, 2008, BEHAV THER, V39, P137, DOI 10.1016/j.beth.2007.06.001
Shoham V, 1998, J FAM PSYCHOL, V12, P557
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
Traunmuller H., 1994, TECHNICAL REPORT
Vinciarelli A, 2009, IMAGE VISION COMPUT, V27, P1743, DOI 10.1016/j.imavis.2008.11.007
Williams-Baucom KJ, 2010, PERS RELATIONSHIP, V17, P41
Yildirim S., 2010, COMPUT SPEECH LANG, V25, P29
NR 79
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 1
EP 21
DI 10.1016/j.specom.2011.12.003
PG 21
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900001
ER
PT J
AU Hofe, R
Ell, SR
Fagan, MJ
Gilbert, JM
Green, PD
Moore, RK
Rybchenko, SI
AF Hofe, Robin
Ell, Stephen R.
Fagan, Michael J.
Gilbert, James M.
Green, Phil D.
Moore, Roger K.
Rybchenko, Sergey I.
TI Small-vocabulary speech recognition using a silent speech interface
based on magnetic sensing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Silent speech interfaces; Clinical speech technology; Articulography;
Multi-modal speech recognition; Speech articulation
AB This paper reports on word recognition experiments using a silent speech interface based on magnetic sensing of articulator movements. A magnetic field was generated by permanent magnet pellets fixed to relevant speech articulators. Magnetic field sensors mounted on a wearable frame measured the fluctuations of the magnetic field during speech articulation. These sensor data were used in place of conventional acoustic features for the training of hidden Markov models. Both small vocabulary isolated word recognition and connected digit recognition experiments are presented. Their results demonstrate the ability of the system to capture phonetic detail at a level that is surprising for a device without any direct access to voicing information. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Hofe, Robin; Green, Phil D.; Moore, Roger K.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
[Ell, Stephen R.] Hull & E Yorkshire Hosp Trust, Castle Hill Hosp, Cottingham HU16 5JQ, England.
[Fagan, Michael J.; Gilbert, James M.; Rybchenko, Sergey I.] Univ Hull, Dept Engn, Kingston Upon Hull HU6 7RX, Yorks, England.
RP Hofe, R (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
EM r.hofe@sheffield.ac.uk
FU Henry Smith Charity; Action Medical Research [AP 1164]
FX The MVOCA was developed under the REdRESS project, which has been funded
by a generous grant from The Henry Smith Charity and Action Medical
Research (Grant Number AP 1164).
CR BORDEN GJ, 1979, BRAIN LANG, V7, P307, DOI 10.1016/0093-934X(79)90025-7
Brumberg J.S., 2010, SPEECH COMMUNICATION, V52
Denby B., 2006, IEEE INT C AC SPEECH
Denby B, 2010, SPEECH COMMUN, V52, P270, DOI 10.1016/j.specom.2009.08.002
ETSI, 2000, 201108V111 ETSI
Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003
Gilbert JM, 2010, MED ENG PHYS, V32, P1189, DOI 10.1016/j.medengphy.2010.08.011
Gillick L., 1989, P IEEE C AC SPEECH S
Hofe R., 2011, P INT 2011 FLOR IT
Hofe R., 2010, P INT 2010 MAK JAP
Kroos C., 2008, 8 INT SEM SPEECH PRO
LEONARD R., 1984, IEEE INT C AC SPEECH
Levinson SE, 2005, MATHEMATICAL MODELS FOR SPEECH TECHNOLOGY, P1, DOI 10.1002/0470020911
Maier-Hein L., 2005, P AUT SPEECH REC UND
Petajan E., 1988, CHI 88 P SIGCHI C HU
Qin C., 2008, P INT 2008 BRISB AUS
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
SCHONLE PW, 1983, BIOMED TECH, V28, P263, DOI 10.1515/bmte.1983.28.11.263
Schultz T, 2010, SPEECH COMMUN, V52, P341, DOI 10.1016/j.specom.2009.12.002
Wand M., 2011, INT C BIOINSP SYST S
Young S., 2009, HTK BOOK HTK VERSION
NR 21
TC 1
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 22
EP 32
DI 10.1016/j.specom.2012.02.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900002
ER
PT J
AU Boersma, P
Chladkova, K
AF Boersma, Paul
Chladkova, Katerina
TI Detecting categorical perception in continuous discrimination data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Categorical perception; Dense sampling; Discrimination; Maximum
likelihood
ID VOWEL PERCEPTION; SPEECH; IDENTIFICATION; MEMORY
AB We present a method for assessing categorical perception from continuous discrimination data. Until recently, categorical perception of speech has exclusively been measured by discrimination and identification experiments with a small number of different stimuli, each of which is presented multiple times. Experiments by Rogers and Davis (2009), however, suggest that using non-repeating stimuli yields a more reliable measure of categorization. If this idea is applied to a single phonetic continuum, the continuum has to be densely sampled and the obtained discrimination data is nearly continuous. In the present study, we describe a maximum-likelihood method that is appropriate for analysing such continuous discrimination data. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Boersma, Paul; Chladkova, Katerina] Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 WX Amsterdam, Netherlands.
RP Boersma, P (reprint author), Univ Amsterdam, Amsterdam Ctr Language & Commun, NL-1012 WX Amsterdam, Netherlands.
EM paul.boersma@uva.nl; k.chladkova@uva.nl
CR AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705
BABAUD J, 1986, IEEE T PATTERN ANAL, V8, P26
Best C.T., 1992, SR109110 HASK LAB, P89
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Boersma Paul, 1997, P I PHONETIC SCI U A, V21, P43
EIMAS PD, 1963, LANG SPEECH, V6, P206
Fisher R.A., 1922, PHILOS T R SOC A, V222, P309, DOI DOI 10.1098/RSTA.1922.0009
Flannery B. P., 1992, NUMERICAL RECIPES C
Gerrits E, 2004, PERCEPT PSYCHOPHYS, V66, P363, DOI 10.3758/BF03194885
KEWLEYPORT D, 1995, J ACOUST SOC AM, V97, P3139, DOI 10.1121/1.413106
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417
MERMELSTEIN P, 1978, J ACOUST SOC AM, V63, P572, DOI 10.1121/1.381756
NEAREY TM, 1990, J PHONETICS, V18, P347
PISONI DB, 1975, MEM COGNITION, V3, P7, DOI 10.3758/BF03198202
PISONI DB, 1973, PERCEPT PSYCHOPHYS, V13, P253, DOI 10.3758/BF03214136
Pitt MA, 2002, PSYCHOL REV, V109, P472, DOI 10.1037//0033-295X.109.3.472
Polka L, 2003, SPEECH COMMUN, V41, P221, DOI 10.1016/S0167-6393(02)00105-X
Rogers J. C., 2009, P INT, P376
Weenink D., 1992, PRAAT DOING PHONETIC
Wilks SS, 1938, ANN MATH STAT, V9, P60, DOI 10.1214/aoms/1177732360
NR 21
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 33
EP 39
DI 10.1016/j.specom.2012.05.003
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900003
ER
PT J
AU Qiu, W
Li, BZ
Li, XW
AF Qiu, Wei
Li, Bing-Zhao
Li, Xue-Wen
TI Speech recovery based on the linear canonical transform
SO SPEECH COMMUNICATION
LA English
DT Article
DE Linear canonical transform; Chirp signal; The AM-FM model; Speech;
Signal reconstruction
ID FRACTIONAL FOURIER-TRANSFORM; UNCERTAINTY PRINCIPLES; ENERGY OPERATORS;
OPTICS; SEPARATION; FREQUENCY; SIGNALS; REPRESENTATIONS; DEMODULATION;
AMPLITUDE
AB As is well known, speech signal processing is one of the hottest signal processing directions. There are exist lots of speech signal models, such as speech sinusoidal model, straight speech model, AM FM model, gaussian mixture model and so on. This paper investigates AM FM speech model by the linear canonical transform (LCT). The LCT can be considered as a generalization of traditional Fourier transform and fractional Fourier transform, and proved to be one of the powerful tools for non-stationary signal processing. This has opened up the possibility of a new range of potentially promising and useful applications based on the LCT. Firstly, two novel recovery methods of speech based on the AM FM model are presented in this paper: one depends on the LCT domain filtering; the other one is based on the chirp signal parameter estimation to restore the speech signal in LCT domain. Then, experiments results are presented to verify the performance of the proposed methods. Finally, the summarization and the conclusion of the paper is given. (C) 2012 Published by Elsevier B.V.
C1 [Qiu, Wei; Li, Bing-Zhao; Li, Xue-Wen] Beijing Inst Technol, Dept Math, Beijing 100081, Peoples R China.
RP Li, BZ (reprint author), Beijing Inst Technol, Dept Math, Beijing 100081, Peoples R China.
EM li_bingzhaobit@bit.edu.cn
RI Li, Bing-Zhao/B-5165-2009
OI Li, Bing-Zhao/0000-0002-3850-4656
FU National Natural Science Foundation of China [60901058, 61171195];
Beijing Natural Science Foundation [1102029]
FX This work was supported by the National Natural Science Foundation of
China (Nos. 60901058 and 61171195), and also supported partially by
Beijing Natural Science Foundation (No. 1102029).
CR ABATZOGLOU TJ, 1986, IEEE T AERO ELEC SYS, V22, P708, DOI 10.1109/TAES.1986.310805
Aizenberg I, 2006, IEEE T SIGNAL PROCES, V54, P4261, DOI 10.1109/TSP.2006.881189
ALMEIDA LB, 1994, IEEE T SIGNAL PROCES, V42, P3084, DOI 10.1109/78.330368
BARGMANN V, 1961, COMMUN PUR APPL MATH, V14, P187, DOI 10.1002/cpa.3160140303
BOVIK AC, 1993, IEEE T SIGNAL PROCES, V41, P3245, DOI 10.1109/78.258071
CHEW KC, 1994, IEEE T SIGNAL PROCES, V42, P1939
COLLINS SA, 1970, J OPT SOC AM, V60, P1168
Dimitriadis D., 2005, IEEE SIGNAL PROCESSI, V12, P425
Fan HY, 2006, OPT LETT, V31, P2622, DOI 10.1364/OL.31.002622
FRIEDLANDER B, 1995, IEEE T SIGNAL PROCES, V43, P917, DOI 10.1109/78.376844
Gianfelici F, 2007, IEEE T AUDIO SPEECH, V15, P823, DOI 10.1109/TASL.2006.889744
Huang NE, 1998, P ROY SOC A-MATH PHY, V454, P903
James DFV, 1996, OPT COMMUN, V126, P207, DOI 10.1016/0030-4018(95)00708-3
Koc A, 2008, IEEE T SIGNAL PROCES, V56, P2383, DOI 10.1109/TSP.2007.912890
KOSTENBAUDER AG, 1990, IEEE J QUANTUM ELECT, V26, P1148, DOI 10.1109/3.108113
Li BZ, 2009, SIGNAL PROCESS, V89, P851, DOI 10.1016/j.sigpro.2008.10.030
Li BZ, 2007, SIGNAL PROCESS, V87, P983, DOI 10.1016/j.sigpro.2006.09.008
Li CP, 2012, SIGNAL PROCESS, V92, P1658, DOI 10.1016/j.sigpro.2011.12.024
MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P1532, DOI 10.1109/78.212729
Moshinsky M., 2006, J MATH PHYS, V27, P665
Pei SC, 2002, IEEE T SIGNAL PROCES, V50, P11
齐林, 2003, [中国科学. E辑, 技术科学, Science in China], V33, P749
Santhanam B, 2000, IEEE T COMMUN, V48, P473, DOI 10.1109/26.837050
Sharma KK, 2008, IEEE T SIGNAL PROCES, V56, P2677, DOI 10.1109/TSP.2008.917384
Sharma KK, 2006, OPT COMMUN, V265, P454, DOI 10.1016/j.optcom.2006.03.062
Smith III J.O., 1992, P ICMC 87
Stern A, 2008, J OPT SOC AM A, V25, P647, DOI 10.1364/JOSAA.25.000647
Tao R., 2009, FRACTIONAL FOURIER T
Tao R, 2008, IEEE T SIGNAL PROCES, V56, P4199, DOI 10.1109/TSP.2008.925579
Teager H. M., 1989, NATO ADV STUDY I SPE
Xu Tian-Zhou, 2012, MATH PROBL ENG, P19
Yao J, 2002, IEEE T BIO-MED ENG, V49, P1299, DOI 10.1109/TMBE.2002.804590
Yin H., 2008, 6 INT S CHIN SPOK LA
Zhao J, 2009, IEEE T SIGNAL PROCES, V57, P2856, DOI 10.1109/TSP.2009.2020039
NR 34
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 40
EP 50
DI 10.1016/j.specom.2012.06.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900004
ER
PT J
AU Arsikere, H
Leung, GKF
Lulich, SM
Alwan, A
AF Arsikere, Harish
Leung, Gary K. F.
Lulich, Steven M.
Alwan, Abeer
TI Automatic estimation of the first three subglottal resonances from
adults' speech signals with application to speaker height estimation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Subglottal resonances; Automatic estimation; Bilingual speakers; Speaker
height
ID AMERICAN ENGLISH VOWELS; BODY-SIZE; NORMALIZATION; RECOGNITION; WEIGHT;
SCALE
AB Recent research has demonstrated the usefulness of subglottal resonances (SGRs) in speaker normalization. However, existing algorithms for estimating SGRs from speech signals have limited applicability they are effective with isolated vowels only. This paper proposes a novel algorithm for estimating the first three SGRs (Sg1,Sg2 and Sg3) from continuous adults' speech. While Sg1 and Sg2 are estimated based on the phonological distinction they provide between vowel categories, Sg3 is estimated based on its correlation with Sg2. The RMS estimation errors (approximately 30, 60 and 100 Hz for Sg1,Sg2 and Sg3, respectively) are not only comparable to the standard deviations in the measurements, but also are independent of vowel content and language (English and Spanish). Since SGRs correlate with speaker height while remaining roughly constant for a given speaker (unlike vocal tract parameters), the proposed algorithm is applied to the task of height estimation using speech signals. The proposed height estimation method matches state-of-the-art algorithms in performance (mean absolute error = 5.3 cm), but uses much less training data and a much smaller feature set. Our results, with additional analysis of physiological data, suggest the existence of a limit to the accuracy of speech-based height estimation. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Arsikere, Harish; Leung, Gary K. F.; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA.
[Lulich, Steven M.] Washington Univ, Dept Psychol, St Louis, MO 63130 USA.
RP Arsikere, H (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA.
EM hari.arsikere@gmail.com; garyleung@ucla.edu; slulich@wustl.edu;
alwan@ee.ucla.edu
FU NSF [0905381]
FX We would like to thank John R. Morton for his role in recording and
labeling the WashU-UCLA corpora. We are also thankful to Dr. Mitchell S.
Sommers for his valuable suggestions, and to Melissa Erickson for help
with manual measurements. The work was supported in part by NSF Grant
no. 0905381.
CR [Anonymous], 2001, TRANSMISSION PERFORM
Arsikere H, 2011, INT CONF ACOUST SPEE, P4616
Arsikere H., 2010, J ACOUST SOC AM, V128, P2288, DOI [10.1121/1.3508029, DOI 10.1121/1.3508029]
Arsikere H., 2011, J ACOUST SOC AM EXPR, V129, P197
Arsikere H., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6288792
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Boersma P., PRAAT SPEECH PROCESS
Cheyne H.A., 2002, THESIS MIT
Chi X., 2004, J ACOUST SOC AM, V115, P2540
Chi XM, 2007, J ACOUST SOC AM, V122, P1735, DOI 10.1121/1.2756793
CHISTOVICH LA, 1985, J ACOUST SOC AM, V77, P789, DOI 10.1121/1.392049
Csapo T. G., 2009, P INT, P484
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Dogil G., 2011, TONES FEATURES PHONE, P137
Dusan S., 2005, P EUROSPEECH 2005 LI, P1989
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
Ganchev T, 2010, LECT NOTES ARTIF INT, V6040, P81
Ganchev T., 2010, P 2010 EUR SIGN PROC, P800
Garofolo J., 1988, GETTING STARTED DARP
Gonzalez J, 2004, J PHONETICS, V32, P277, DOI 10.1016/S0095-4470(03)00049-4
Graczi T.E., 2011, P INT
HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872
Honda K, 2010, J PHONETICS, V38, P33, DOI 10.1016/j.wocn.2008.11.002
Hrdlicka A., 1925, OLD AM
Jung Y., 2009, THESIS MIT HARVARD
KUNZEL HJ, 1989, PHONETICA, V46, P117
Lulich S. M., 2010, J ACOUST SOC AM, V128
Lulich SM, 2011, J ACOUST SOC AM, V130, P2108, DOI 10.1121/1.3632091
Lulich SM, 2010, J PHONETICS, V38, P20, DOI 10.1016/j.wocn.2008.10.006
Lulich Steven M., 2006, THESIS MIT
Madsack A, 2008, P LABPHON, V11, P91
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Nelson D, 1997, INT CONF ACOUST SPEE, P1643, DOI 10.1109/ICASSP.1997.598822
Pellom B. L., 1997, 40 MIDW S CIRC SYST, P873
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Rendall D, 2005, J ACOUST SOC AM, V117, P944, DOI 10.1121/1.1848011
Sjolander K., 1997, SNACK SOUND TOOLKIT
Sonderegger M., 2004, THESIS MIT
Stevens K.N., 1998, ACOUSTIC PHONETICS
SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381
TRAUNMULLER H, 1990, J ACOUST SOC AM, V88, P97
Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329
vanDommelen WA, 1995, LANG SPEECH, V38, P267
Wang S., 2009, P INT, P1619
Wang SZ, 2008, INT CONF ACOUST SPEE, P4277
Wang SZ, 2009, J ACOUST SOC AM, V126, P3268, DOI 10.1121/1.3257185
Wang SZ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1717
NR 47
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 51
EP 70
DI 10.1016/j.specom.2012.06.004
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900005
ER
PT J
AU Ming, Y
Ruan, QQ
Gao, GD
AF Ming, Yue
Ruan, Qiuqi
Gao, Guodong
TI A Mandarin edutainment system integrated virtual learning environments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mandarin learning; Pronunciation evaluation; Virtual Reality (VR);
Edutainment; Virtual learning environment (VLE); 3D face recognition
ID REALITY; GAMES
AB In this paper, a novel Mandarin edutainment system is developed for learning Mandarin in immersing, interactive Virtual Learning Environments (VLE). Our system is mainly comprised of two parts: speech technology support and virtual 3D game design. First, 3D face recognition technology is introduced to discriminate the different learners and provide the personalized learning services based on the characteristics of the individuals. Then, a Mandarin pronunciation recognition and assessment scheme is constructed by state-of-the-art speech processing technology. According to the distinctive differences of Mandarin rhythm from the Western languages, we integrate the prosodic parameters into the recognition and evaluation model to highlight Mandarin characteristics and improve the evaluation performance. In order to promote the engagement of foreign learners, we embed our technology framework into a Virtual Reality (VR) game environment. The character design reflects the Chinese traditional culture, and the plots effectively give consider to learning pronunciation and learners' interest, providing the scoring feedback simultaneously. In the experimental design, first, we test the correlation of recognition results and machine scores with the different errors and human scores. Then, we evaluate the usability, likeability, and knowledgeability of the whole VLE system. We divide the learners into three categories in terms of their Mandarin levels, and they provide feedback via a questionnaire. The results show that our system can effectively promote the foreign learners' engagement and improve their Mandarin level. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Ming, Yue; Ruan, Qiuqi] Beijing JiaoTong Univ, Inst Informat Sci, Beijing 100044, Peoples R China.
[Gao, Guodong] Beijing Traff Control Technol CO Ltd, Beijing 100044, Peoples R China.
RP Ming, Y (reprint author), Beijing JiaoTong Univ, Inst Informat Sci, Beijing 100044, Peoples R China.
EM myname35875235@126.com
FU National Natural Science Foundation [60973060]; Research Fund for the
Doctoral Program [20080004001]; Beijing Program [YB20081000401];
Fundamental Research Funds for the Central Universities [2009YJS025]
FX This work is supported by National Natural Science Foundation
(60973060), the Research Fund for the Doctoral Program (20080004001) and
Beijing Program (YB20081000401) and the Fundamental Research Funds for
the Central Universities (2009YJS025). Informedia digital video
understanding lab in Carneige Mellon University provides the portions of
experimental materials and environments. The authors would like also
thank the Associate Editor and the anonymous reviewers for their helpful
comments.
CR Amory A., 1998, P ED MEDIA ED TELECO, P50
Asakawa S., 2005, P EUROSPEECH, P165
Conati C, 2002, LECT NOTES COMPUT SC, V2363, P944
Cucchiarini C., 1998, P ICSLP, P1738
Cucchiarini C., 1997, P IEEE WORKSH ASRU S, P622
de Wet F, 2009, SPEECH COMMUN, V51, P864, DOI 10.1016/j.specom.2009.03.002
Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X
Franco H.L., 1997, P ICASSP, P1465
Hauptmann Alex, 2012, IEEE INT C MULT EXP
Hodges B, 2009, PRAGMAT COGN, V17, P628, DOI 10.1075/p&c.17.3.08hod
Huang Jui-Ting, 2006, SPEECH PROSODY
Kearney P., 2004, P ED MEDIA WORLD C E, P3915
Li XX, 2007, LECT NOTES COMPUT SC, V4402, P188
Minematsu N., 2004, P ICSLP, P1317
Neueyer L., 2000, SPEECH COMMUN, V30, P83
Neumeyer L., 1996, P ICSLP, P217
Pan ZG, 2006, COMPUT GRAPH-UK, V30, P20, DOI 10.1016/j.cag.2005.10.004
Rabiner L, 1993, FUNDAMENTALS SPEECH
Raux A., 2002, P ICSLP, P737
Tsubota Y., 2004, P ICSLP, P849
Van Lier V., 2000, SOCIOCULTURAL THEORY, P245
Virvou M, 2008, COMPUT EDUC, V50, P154, DOI 10.1016/j.compedu.2006.04.004
Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
Witt S.M., 1999, THESIS
Young MF, 2000, THEORETICAL FOUNDATIONS OF LEARNING ENVIRONMENTS, P147
Young M.F., 2004, HDB RES ED COMMUNICA
Yue Ming, 2012, IMAGE VISIO IN PRESS
Zhang YB, 2008, INT CONF ACOUST SPEE, P5065
NR 29
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 71
EP 83
DI 10.1016/j.specom.2012.06.007
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900006
ER
PT J
AU Wang, FJ
Swegles, K
AF Wang, Fangju
Swegles, Kyle
TI Modeling user behavior online for disambiguating user input in a spoken
dialogue system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken dialogue system; Automatic speech recognition; Natural language
processing; Disambiguation; Reinforcement learning; User behavior
modeling
ID SIMULATION; STATE
AB A spoken dialogue system (SDS) interacts with its user in a spoken natural language. It interprets user speech input and responds to the user. User speech in a spoken natural language may be ambiguous. A challenge in building an SDS is dealing with ambiguity. Without good abilities for disambiguation, an SDS can hardly have meaningful and smooth dialogues with its user in practical applications. The existing techniques for disambiguation are mainly based on statistical knowledge about language use. In practical situations, such knowledge alone is inadequate. In our research, we develop a new disambiguation technique, which is based on application of knowledge about user activity behavior, in addition to knowledge about language use. The technique is named MUBOD, standing for modeling user behavior online for disambiguation. The core component of MUBOD is an online reinforcement learning algorithm that is used to learn the knowledge and apply the knowledge for disambiguation. In this paper, we describe the technique and its implementation, and present and analyze some initial experimental results. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Wang, Fangju; Swegles, Kyle] Univ Guelph, Sch Comp Sci, Guelph, ON N1G 2W1, Canada.
RP Wang, FJ (reprint author), Univ Guelph, Sch Comp Sci, Guelph, ON N1G 2W1, Canada.
EM fjwang@uoguelph.ca
FU Natural Sciences and Engineering Research Council of Canada (NSERC)
FX This research is supported by the Natural Sciences and Engineering
Research Council of Canada (NSERC). This paper is based on Kyle
Swegles's MSc thesis work.
CR Chickering DM, 2007, USER MODEL USER-ADAP, V17, P71, DOI 10.1007/s11257-006-9020-7
Chotimongkol A., 2001, P 7 EUR C SPEECH COM, P1829
Cifarelli C, 2007, J CLASSIF, V24, P205, DOI 10.1007/s00357-007-0012-z
Frampton M, 2009, KNOWL ENG REV, V24, P375, DOI 10.1017/S0269888909990166
Frampton M, 2006, COLING/ACL 2006, VOLS 1 AND 2, PROCEEDINGS OF THE CONFERENCE, P185
Gabsdil M., 2004, P ANN M ASS COMP LIN, P343, DOI 10.3115/1218955.1218999
Gasic M., 2011, P IEEE WORKSH AUT SP, P312
Griol D, 2008, SPEECH COMMUN, V50, P666, DOI 10.1016/j.specom.2008.04.001
Gruenstein A., 2008, P SIGDIAL, P11, DOI 10.3115/1622064.1622067
Higashinaka R, 2006, SPEECH COMMUN, V48, P417, DOI 10.1016/j.specom.2005.06.011
JOKINEN K., 2010, SPOKEN DIALOGUE SYST
Jonson R., 2006, P SPOK LANG TECHN WO, P174
Jung S, 2009, COMPUT SPEECH LANG, V23, P479, DOI 10.1016/j.csl.2009.03.002
Jurafsky D., 2007, SPEECH LANGUAGE PROC
Jurcicek F, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P90
Keizer S, 2008, 2008 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY: SLT 2008, PROCEEDINGS, P121, DOI 10.1109/SLT.2008.4777855
Lemon O., 2009, P EACL, P505, DOI 10.3115/1609067.1609123
Levin E., 2000, IEEE T SPEECH AUDIO, V8, P1
Lidstone George James, 1920, T FACULTY ACTUARIES, V8, P182
Litman D.J., 2000, P 18 C COMP LING ASS, V1, P502, DOI 10.3115/990820.990893
Manning C. D., 1999, FDN STAT NATURAL LAN
Png SW, 2011, INT CONF ACOUST SPEE, P2156
Schatzmann J., 2006, KNOWL ENG REV, V21, P1
Scheffler Konrad, 2002, P 2 INT C HUM LANG T, P12, DOI 10.3115/1289189.1289246
Sutton R. S., 2005, REINFORCEMENT LEARNI
Tetreault JR, 2008, SPEECH COMMUN, V50, P683, DOI 10.1016/j.specom.2008.05.002
Thomson B, 2010, COMPUT SPEECH LANG, V24, P562, DOI 10.1016/j.csl.2009.07.003
Thomson B., 2010, SPOK LANG TECHN WORK, P271
Torres F, 2008, COMPUT SPEECH LANG, V22, P230, DOI 10.1016/j.csl.2007.09.002
Walker W., 2004, SPHINX 4 FLEXIBLE OP
Williams J., 2005, P KNOWL REAS PRACT D
Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008
NR 32
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 84
EP 98
DI 10.1016/j.specom.2012.06.006
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900007
ER
PT J
AU Sepulveda, A
Guido, RC
Castellanos-Dominguez, G
AF Sepulveda, Alexander
Guido, Rodrigo Capobianco
Castellanos-Dominguez, G.
TI Estimation of relevant time-frequency features using Kendall coefficient
for articulator position inference
SO SPEECH COMMUNICATION
LA English
DT Article
DE Articulatory information; Wavelet-packet transform; Kendall coefficient;
GMM; Neural networks
ID VOCAL-TRACT
AB The determination of relevant acoustic information for the inference of articulators position is an open issue. This paper presents a method to estimate those acoustic features better related to articulators movement. The input feature set is based on time-frequency representation calculated from the speech signal, whose parametrization is achieved using the wavelet-packet transform. The main focus is on measuring the relevant acoustic information, in terms of statistical association, for the inference of articulator positions. The rank correlation Kendall coefficient is used as the relevance measure. Attained statistical association is validated using the chi(2) information measure. The maps of relevant time frequency features are calculated for the MOCHA-TIMIT database, where the articulatory information is represented by trajectories of specific positions in the vocal tract. Relevant maps are estimated over the whole speech signal as well as on specific phones, for which a given articulator is known to be critical. The usefulness of the relevant maps is tested in an acoustic-to-articulatory mapping system based on gaussian mixture models. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Guido, Rodrigo Capobianco] Univ Sao Paulo, Inst Phys Sao Carlos IFSC, Dept Phys & Informat, BR-13566590 Sao Carlos, SP, Brazil.
[Sepulveda, Alexander; Castellanos-Dominguez, G.] Univ Nacl Colombia, Signal Proc & Recognit Grp, Manizales, Colombia.
RP Guido, RC (reprint author), Univ Sao Paulo, Inst Phys Sao Carlos IFSC, Dept Phys & Informat, Ave Trabalhador Saocarlense 400, BR-13566590 Sao Carlos, SP, Brazil.
EM guido@ifsc.usp.br
FU Administrative Department of Science, Technology and Innovation of
Colombia (COLCIENCIAS); Red de Macrouniversidades de America
Latina-Grupo Santander
FX We thank the reviewers for valuable suggestions and constructive
criticisms of an earlier version of this paper. This work was supported
mainly by Administrative Department of Science, Technology and
Innovation of Colombia (COLCIENCIAS); and, it was also financed by the
Red de Macrouniversidades de America Latina-Grupo Santander, who
provided the facilities to carry out a six month internship at
SpeechLab, IFSC, University of Sao Paulo (Sao Carlos, SP, Brazil).
CR Addison P.S., 2002, ILLUSTRATED WAVELET
Akansu AN, 2001, MULTIRESOLUTION SIGN
Al-Moubayed S., 2010, INTERSPEECH 2010
Bishop C. M., 2006, PATTERN RECOGNITION
Choueiter G., 2007, IEEE T AUDIO SPEECH
Dickinson J., 2003, NONPARAMETRIC STAT I
Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676
Hasegawa-Johnson M., 2000, INT C SPOK LANG PROC, P133
Hogden J., 1996, J ACOUSTICAL SOC AM, V100
Jackson P., 2009, SPEECH COMMUNICATION
Kent R. D., 2002, ACOUSTIC ANAL SPEECH
Maji P, 2009, IEEE T BIO-MED ENG, V56, P1063, DOI 10.1109/TBME.2008.2004502
Mallat S., 1998, WAVELET TOUR SIGNAL
Ozbek Y., 2011, IEEE T AUDIO SPEECH
Papcun G., 1992, J ACOUSTICAL SOC AM
Richmond K, 2003, COMPUT SPEECH LANG, V17, P153, DOI 10.1016/S0885-2308(03)00005-6
Richmond Korin, 2001, THESIS U EDINBURGH E
Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356
Silva J., 2009, IEEE T SIGNAL PROCES, V57
Sorokin VN, 2000, SPEECH COMMUN, V30, P55, DOI 10.1016/S0167-6393(99)00031-X
Suzuki T., 2009, BMC BIOINFORMATICS, V10
Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001
Toutios A., 2008, 16 EUR SIGN PROC C E
Wrench A., 1999, MOCHA TIMIT ARTICULA
Yang HH, 2000, SPEECH COMMUN, V31, P35, DOI 10.1016/S0167-6393(00)00007-8
Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 99
EP 110
DI 10.1016/j.specom.2012.06.005
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900008
ER
PT J
AU Yagli, C
Turan, MAT
Erzin, E
AF Yagli, Can
Turan, M. A. Tugtekin
Erzin, Engin
TI Artificial bandwidth extension of spectral envelope along a Viterbi path
SO SPEECH COMMUNICATION
LA English
DT Article
DE Artificial bandwidth extension; Source-filter separation; Line spectral
frequency; Joint temporal analysis
ID NARROW-BAND; SPEECH
AB In this paper, we propose a hidden Markov model (HMM)-based wideband spectral envelope estimation method for the artificial bandwidth extension problem. The proposed HMM-based estimator decodes an optimal Viterbi path based on the temporal contour of the narrowband spectral envelope and then performs the minimum mean square error (MMSE) estimation of the wideband spectral envelope on this path. Experimental evaluations are performed to compare the proposed estimator to the state-of-the-art HMM and Gaussian mixture model based estimators using both objective and subjective evaluations. Objective evaluations are performed with the log-spectral distortion (LSD) and the wideband perceptual evaluation of speech quality (PESQ) metrics. Subjective evaluations are performed with the A/B pair comparison listening test. Both objective and subjective evaluations yield that the proposed wideband spectral envelope estimator consistently improves performances over the state-of-the-art estimators. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Yagli, Can; Turan, M. A. Tugtekin; Erzin, Engin] Koc Univ, Coll Engn, Multimedia Vis & Graph Lab, TR-34450 Istanbul, Turkey.
RP Erzin, E (reprint author), Koc Univ, Coll Engn, Multimedia Vis & Graph Lab, TR-34450 Istanbul, Turkey.
EM canyagli@ku.edu.tr; mturan@ku.edu.tr; eerzin@ku.edu.tr
CR Agiomyrgiannakis Y, 2007, IEEE T AUDIO SPEECH, V15, P377, DOI 10.1109/TASL.2006.881702
[Anonymous], 1993, SPEC INT REF SYST
Cheng YM, 1994, IEEE T SPEECH AUDI P, V2, P544
Enbom N., 1999, IEEE WORKSH SPEECH C, P171
Erzin E, 2009, IEEE T AUDIO SPEECH, V17, P1316, DOI 10.1109/TASL.2009.2016733
ITU- T, 2005, WID EXT REC P 862 AS
Jax P., 2004, AUDIO BANDWIDTH EXTE, P171
Jax P, 2003, SIGNAL PROCESS, V83, P1707, DOI 10.1016/S0165-1684(03)00082-3
Park KY, 2000, INT CONF ACOUST SPEE, P1843
Voran S., 1997, IEEE WORKSH SPEECH C, P81
Yagli C., 2012, ABE SPEECH SYNTHESIS
Yagli C, 2011, INT CONF ACOUST SPEE, P5096
NR 12
TC 2
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 111
EP 118
DI 10.1016/j.specom.2012.07.003
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900009
ER
PT J
AU Fan, X
Hansen, JHL
AF Fan, Xing
Hansen, John H. L.
TI Acoustic analysis and feature transformation from neutral to whisper for
speaker identification within whispered speech audio streams
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker identification; Whispered speech; Vocal effort; Robust speaker
verification
ID MAXIMUM-LIKELIHOOD; EM ALGORITHM; RECOGNITION; VOWELS
AB Whispered speech is an alternative speech production mode from neutral speech, which is used by talkers intentionally in natural conversational scenarios to protect privacy and to avoid certain content from being overheard or made public. Due to the profound differences between whispered and neutral speech in vocal excitation and vocal tract function, the performance of automatic speaker identification systems trained with neutral speech degrades significantly. In order to better understand these differences and to further develop efficient model adaptation and feature compensation methods, this study first analyzes the speaker and phoneme dependency of these differences by a maximum likelihood transformation estimation from neutral speech towards whispered speech. Based on analysis results, this study then considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID on whispered speech without using whispered adaptation data from test speakers. Three estimation methods that model the transformation from neutral to whispered speech are applied, including convolutional transformation (ConvTran), constrained maximum likelihood linear regression (CM LLR), and factor analysis (FA). a speech mode independent (SMI) universal background model (UBM) is trained using collected real neutral features and transformed pseudo-whisper features generated with the estimated transformation. Text-independent closed set speaker ID results using the UT-VocalEffort II corpus show performance improvement by using the proposed training framework. The best performance of 88.87% is achieved by using the ConvTran model, which represents a relative improvement of 46.26% compared to the 79.29% accuracy of the GMM-UBM baseline system. This result suggests that synthesizing pseudo-whispered speaker and background training data with the ConvTran model results in improved speaker ID robustness to whispered speech. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Ctr Robust Speech Syst, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM John.Hansen@utdallas.edu
FU AFRL [FA8750-09-C-0067]; University of Texas at Dallas from the
Distinguished University Chair in Telecommunications Engineering
FX This project was funded by AFRL through a subcontract to RADC Inc. under
FA8750-09-C-0067 (Approved for public release, distribution unlimited),
and partially by the University of Texas at Dallas from the
Distinguished University Chair in Telecommunications Engineering held by
J. Hansen.
CR Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Deguchi D., 2010, APSIPA ASC BIOP SING, P502
Dehak N., 2009, INTERSPEECH, P1559
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Deng L, 2004, IEEE T SPEECH AUDI P, V12, P218, DOI 10.1109/TSA.2003.822627
Eklund I, 1996, PHONETICA, V54, P1
Fan X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1313
Fan X., 2009, ISCA INTERSPEECH 09, P896
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Garofolo JS, 1993, LINGUISTIC DATA CONS
GavidiaCeballos L, 1996, IEEE T BIO-MED ENG, V43, P373, DOI 10.1109/10.486257
Ito T, 2005, SPEECH COMMUN, V45, P139, DOI 10.1016/j.specom.2003.10.005
Jin Q., 2007, IEEE INT C MULT EXP, P1027
Jovicic ST, 1998, ACUSTICA, V84, P739
KALLAIL KJ, 1984, J SPEECH HEAR RES, V27, P245
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693
Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940
Kullback S., 1968, INFORM THEORY STAT
Lei Y, 2009, INT CONF ACOUST SPEE, P4337, DOI 10.1109/ICASSP.2009.4960589
Li JY, 2009, COMPUT SPEECH LANG, V23, P389, DOI 10.1016/j.csl.2009.02.001
Matsuura M, 1999, ADV EARTHQ ENGN, P133
MEYEREPPLER W, 1957, J ACOUST SOC AM, V29, P104, DOI 10.1121/1.1908631
Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225
Morris RW, 2002, MED ENG PHYS, V24, P515, DOI 10.1016/S1350-4533(02)00060-7
THOMAS IB, 1969, J ACOUST SOC AM, V46, P468, DOI 10.1121/1.1911712
Tipping ME, 1999, J ROY STAT SOC B, V61, P611, DOI 10.1111/1467-9868.00196
Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839
Zhang C., 2007, INTERSPEECH 07, P2289
Zhang C., 2009, INTERSPEECH, P860
NR 30
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 119
EP 134
DI 10.1016/j.specom.2012.07.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900010
ER
PT J
AU Gibert, G
Leung, Y
Stevens, CJ
AF Gibert, Guillaume
Leung, Yvonne
Stevens, Catherine J.
TI Control of speech-related facial movements of an avatar from video
SO SPEECH COMMUNICATION
LA English
DT Article
DE Talking head; Auditory-visual speech; Puppetry; Facial animation; Face
tracking
ID ACTIVE SHAPE MODEL; EXPRESSION; FACE; TRACKING; HEAD
AB Several puppetry techniques have been recently proposed to transfer emotional facial expressions to an avatar from a user's video. Whereas generation of facial expressions may not be sensitive to small tracking errors, generation of speech-related facial movements would be severely impaired. Since incongruent facial movements can drastically influence speech perception, we proposed a more effective method to transfer speech-related facial movements from a user to an avatar. After a facial tracking phase, speech articulatory parameters (controlling the jaw and the lips) were determined from the set of landmark positions. Two additional processes calculated the articulatory parameters which controlled the eyelids and the tongue from the 2D Discrete Cosine Transform coefficients of the eyes and inner mouth images.
A speech in noise perception experiment was conducted on 25 participants to evaluate the system. Increase in intelligibility was shown for the avatar and human auditory visual conditions compared to the avatar and human auditory-only conditions, respectively. Depending on the vocalic context, the results of the avatar auditory visual presentation were different: all the consonants were better perceived in /a/ vocalic context compared to /i/ and /u/ because of the lack of depth information retrieved from video. This method could be used to accurately animate avatars for hearing impaired people using information technologies and telecommunication. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Gibert, Guillaume] INSERM, U846, F-69500 Bron, France.
[Gibert, Guillaume] Stem Cell & Brain Res Inst, F-69500 Bron, France.
[Gibert, Guillaume] Univ Lyon 1, F-69003 Lyon, France.
[Gibert, Guillaume; Leung, Yvonne; Stevens, Catherine J.] Univ Western Sydney, Marcs Inst, Penrith, NSW 2751, Australia.
[Stevens, Catherine J.] Univ Western Sydney, Sch Social Sci & Psychol, Penrith, NSW 2751, Australia.
RP Gibert, G (reprint author), Stem Cell & Brain Res Inst, INSERM, U846, 18 Ave Doyen Lepine, F-69675 Bron, France.
EM guillaume.gibert@inserm.fr; y.leung@uws.edu.au; kj.stevens@uws.edu.au
RI Gibert, Guillaume/M-5816-2014
FU Australian Research Council; National Health and Medical Research
Council [TS0669874]; SWoOZ project [ANR 11 PDOC 019 01]
FX We thank James Heathers for manually segmenting the images. This work
was supported by the Thinking Head project, a Special Initiative scheme
of the Australian Research Council and the National Health and Medical
Research Council (TS0669874) (Burnham et al., 2006) and by the SWoOZ
project (ANR 11 PDOC 019 01).
CR Allbeck J., 2010, INTELLIGENT VIRTUAL, V6356, P420
Badin P., 2006, 7 INT SEM SPEECH PRO
Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107
Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4
Besle J, 2004, EUR J NEUROSCI, V20, P2225, DOI 10.1111/j.1460-9568.2004.03670.x
Boker SM, 2009, PHILOS T R SOC B, V364, P3485, DOI 10.1098/rstb.2009.0152
Brand M, 1999, P 26 ANN C COMP GRAP
Burnham D., 2006, TS0669874 ARCNH MRC
Caridakis G., 2007, LANGUAGE RESOURCES E, P41
Chibelushi C., 2003, CVONLINE LINE COMPEN
COOTES TF, 1995, COMPUT VIS IMAGE UND, V61, P38, DOI 10.1006/cviu.1995.1004
Fanelli G., 2007, 3DTV C, P1
Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503
Gallou S., 2007, P 7 INT C INT VIRT A
Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587
Gross R, 2010, IMAGE VISION COMPUT, V28, P807, DOI 10.1016/j.imavis.2009.08.002
Jordan T. R., 1998, HEARING EYE, P155
Kuratate T., 1998, INT C AUD VIS SPEECH, P185
Lee SW, 2007, IET COMPUT VIS, V1, P17, DOI 10.1049/iet-cvi:20045243
Massaro D, 2011, AM J PSYCHOL, V124, P341, DOI 10.5406/amerjpsyc.124.3.0341
Massaro D. W., 1998, PERCEIVING TALKING F
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
Milborrow S, 2008, LECT NOTES COMPUT SC, V5305, P504
Morishima S, 2001, IEEE SIGNAL PROC MAG, V18, P26, DOI 10.1109/79.924886
Ouni S., 2007, EURASIP J AUDIO SPEE
Ouni S., 2003, INT C PHON SCI ICPHS
Reveret L., 2000, 6 INT C SPOK LANG PR
Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159
Saragih JM, 2011, AUTOMATIC FACE GESTU
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Theobald BJ, 2009, LANG SPEECH, V52, P369, DOI 10.1177/0023830909103181
Theobald B.-J., 2007, P 9 861 INT C MULT I
Viola P., 2001, P IEEE COMP SOC C CO, V1, P1
Vlasic D, 2005, ACM T GRAPHIC, V24, P426, DOI 10.1145/1073204.1073209
Weise T., 2009, 8 ACM SIGGRAPH EUR S
NR 35
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 135
EP 146
DI 10.1016/j.specom.2012.07.001
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900011
ER
PT J
AU Lammert, A
Goldstein, L
Narayanan, S
Iskarous, K
AF Lammert, Adam
Goldstein, Louis
Narayanan, Shrikanth
Iskarous, Khalil
TI Statistical methods for estimation of direct and differential kinematics
of the vocal tract
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech production; Direct kinematics; Differential kinematics; Task
dynamics; Articulatory synthesis; Kinematic estimation; Statistical
machine learning; Locally-weighted regression; Artificial neural
networks
ID TO-ARTICULATORY INVERSION; LOCALLY WEIGHTED REGRESSION; NEURAL-NETWORK
MODEL; SPEECH PRODUCTION-MODEL; MARQUARDT ALGORITHM; MOTOR CONTROL;
MOVEMENTS; ACOUSTICS; MANIPULATORS; GESTURES
AB We present and evaluate two statistical methods for estimating kinematic relationships of the speech production system: artificial neural networks and locally-weighted regression. The work is motivated by the need to characterize this motor system, with particular focus on estimating differential aspects of kinematics. Kinematic analysis will facilitate progress in a variety of areas, including the nature of speech production goals, articulatory redundancy and, relatedly, acoustic-to-articulatory inversion. Statistical methods must be used to estimate these relationships from data since they are infeasible to express in closed form. Statistical models are optimized and evaluated - using a heldout data validation procedure on two sets of synthetic speech data. The theoretical and practical advantages of both methods are also discussed. It is shown that both direct and differential kinematics can be estimated with high accuracy, even for complex, nonlinear relationships. Locally-weighted regression displays the best overall performance, which may be due to practical advantages in its training procedure. Moreover, accurate estimation can be achieved using only a modest amount of training data, as judged by convergence of performance. The algorithms are also applied to real-time MRI data, and the results are generally consistent with those obtained from synthetic data. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Lammert, Adam; Narayanan, Shrikanth] Univ So Calif, SAIL, Los Angeles, CA 90089 USA.
[Goldstein, Louis; Narayanan, Shrikanth; Iskarous, Khalil] Univ So Calif, Dept Linguist, Los Angeles, CA 90089 USA.
[Goldstein, Louis; Iskarous, Khalil] Haskins Labs Inc, New Haven, CT 06511 USA.
RP Lammert, A (reprint author), Univ So Calif, SAIL, 3710 McClintock Ave, Los Angeles, CA 90089 USA.
EM lammert@usc.edu
FU NIH NIDCD [02717]; NIH [DC008780, DC007124]; Annenberg Foundation
FX This work was supported by NIH NIDCD Grant 02717, NIH R01 Grant
DC008780, NIH Grant DC007124, as well as a graduate fellowship from the
Annenberg Foundation. We would also like to acknowledge Elliot Saltzman
for his technical insights, and Hosung Nam for his help with
understanding TADA.
CR ABBS JH, 1984, J NEUROPHYSIOL, V51, P705
Al Moubayed S, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P937
Ananthakrishnan G., 2009, P INT BRIGHT UK, P2799
ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848
Atal B.S., 1989, J ACOUSTICAL SOC AM, P86
Atkeson CG, 1997, ARTIF INTELL REV, V11, P11, DOI 10.1023/A:1006559212014
BAILLY G, 1991, J PHONETICS, V19, P9
Balestrino A., 1984, P 9 IFAC WORLD C, V5, P2435
BENNETT DJ, 1991, IEEE T ROBOTIC AUTOM, V7, P597, DOI 10.1109/70.97871
Bernstein N., 1967, COORDINATION REGULAT
Bishop C. M., 2006, PATTERN RECOGNITION
BOE LJ, 1992, J PHONETICS, V20, P27
Bresch E, 2009, IEEE T MED IMAGING, V28, P323, DOI 10.1109/TMI.2008.928920
BULLOCK D, 1993, J COGNITIVE NEUROSCI, V5, P408, DOI 10.1162/jocn.1993.5.4.408
CLEVELAND WS, 1979, J AM STAT ASSOC, V74, P829, DOI 10.2307/2286407
CLEVELAND WS, 1988, J ECONOMETRICS, V37, P87, DOI 10.1016/0304-4076(88)90077-2
CLEVELAND WS, 1988, J AM STAT ASSOC, V83, P596, DOI 10.2307/2289282
Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467
Cybenko G., 1989, Mathematics of Control, Signals, and Systems, V2, DOI 10.1007/BF02551274
D'Souza A., 2001, P CIRAS
Duch W., 1999, NEURAL COMPUTING SUR, V2, P163
Fels S., 2005, P AUD VIS SPEECH PRO, P119
Gerard J.-M., 2006, SPEECH PRODUCTION MO, P85
Gerard JM, 2003, REC RES DEV BIOMECH, V1, P49
Ghosh PK, 2011, J ACOUST SOC AM, V130, pEL251, DOI 10.1121/1.3634122
Ghosh PK, 2010, J ACOUST SOC AM, V128, P2162, DOI 10.1121/1.3455847
GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237
Guenther FH, 1998, PSYCHOL REV, V105, P611
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
Guigon E, 2007, J NEUROPHYSIOL, V97, P331, DOI 10.1152/jn.00290.2006
HAGAN MT, 1994, IEEE T NEURAL NETWOR, V5, P989, DOI 10.1109/72.329697
Hinton GE, 2006, NEURAL COMPUT, V18, P1527, DOI 10.1162/neco.2006.18.7.1527
HIROYA S, 2002, ACOUST SPEECH SIG PR, P437
Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636
Hiroya S., 2003, P INT WORKSH SPEECH, P9
Hiroya S., 2002, P ICSLP, P2305
Hogden J, 1996, J ACOUST SOC AM, V100, P1819, DOI 10.1121/1.416001
HOLLERBACH JM, 1982, TRENDS NEUROSCI, V5, P189, DOI 10.1016/0166-2236(82)90111-4
Hollerbach JM, 1996, INT J ROBOT RES, V15, P573, DOI 10.1177/027836499601500604
Homilc K., 1989, NEURAL NETWORKS, V2
Iskarous K., 2003, P ICPHS
Jacobs R. A., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.1.79
JORDAN MI, 1992, J MATH PSYCHOL, V36, P396, DOI 10.1016/0022-2496(92)90029-7
JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1
Jordan M.I., 1995, HDB BRAIN THEORY NEU
Kaburagi T., 1998, P ICSLP
Kello CT, 2004, J ACOUST SOC AM, V116, P2354, DOI 10.1121/1.1715112
Kelso S., 1984, J EXP PSYCHOL, V10, P812
KHATIB O, 1987, IEEE T ROBOTIC AUTOM, V3, P43
Lammert A., 2011, J ACOUST SOC AM, V130, P2549
Lammert A., 2010, P INTERSPEECH
Lammert A.C., 2008, P SAPA 08, P29
Lawrence S., 1996, Proceedings of the Seventh Australian Conference on Neural Networks (ACNN'96)
McGowan RS, 2009, J ACOUST SOC AM, V126, P2011, DOI 10.1121/1.3184581
MERMELST.P, 1965, J ACOUST SOC AM, V37, P1186, DOI 10.1121/1.1939448
MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427
Mitra V, 2010, IEEE J-STSP, V4, P1027, DOI 10.1109/JSTSP.2010.2076013
Mitra V., 2009, P ICASSP
Mitra V., 2011, IEEE T AUDIO SPEECH
Mooring B.W., 1991, FUNDAMENTALS MANIPUL
Mottet D, 2001, J EXP PSYCHOL HUMAN, V27, P1275, DOI 10.1037/0096-1523.27.6.1275
Nakamura K., 2006, P ICASSP
NAKAMURA Y, 1986, J DYN SYST-T ASME, V108, P163
Nakanishi J, 2008, INT J ROBOT RES, V27, P737, DOI 10.1177/0278364908091463
Nam H., 2004, J ACOUST SOC AM, V115, P2430, DOI DOI 10.1016/J.SPECOM.2005.07.003
Nam H., 2006, TADA TASK DYNAMICS A
Nam H., 2010, P ICASSP
Narayanan S., 2004, JASA, V109, P2446
Narayanan S., 2011, P INTERSPEECH
Panchapagesan S, 2011, J ACOUST SOC AM, V129, P2144, DOI 10.1121/1.3514544
PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994
Payan Y., 1997, SPEECH COMMUN, V22, P187
Perrier P, 1996, J PHONETICS, V24, P53, DOI 10.1006/jpho.1996.0005
Perrier P, 2003, J ACOUST SOC AM, V114, P1582, DOI 10.1121/1.1587737
Qin C., 2007, P INTERSPEECH
Qin C., 2010, P INTERSPEECH
Rahim M. G., 1991, P IEEE INT C AC SPEE, P485, DOI 10.1109/ICASSP.1991.150382
Richmond K., 2010, P INTERSPEECH, P577
Rubin P., 1996, P 1 ETRW SPEECH PROD
Rumelhart D., 1986, PARALLEL DISTRIBUTED, V1
Rumelhart D., 1986, NATURE, V323
Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0
Saltzman E., 2006, DYNAMICS SPEECH PROD
SALTZMAN E, 1987, PSYCHOL REV, V94, P84, DOI 10.1037//0033-295X.94.1.84
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
Scholz J. P., 1999, EXP BRAIN RES, V126, P189
Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356
Sciavicco L., 2005, MODELLING CONTROL RO
Shiga Y., 2004, P INTERSPEECH
Sklar M.E., 1989, P 2 C REC ADV ROB, P178
SOECHTING JF, 1982, BRAIN RES, V248, P392, DOI 10.1016/0006-8993(82)90601-1
Ting J.A., 2008, P ICRA PAS CA
Toda T., 2004, P 5 ISCA SPEECH SYNT, P31
Toda T., 2008, SPEECH COMMUNICATION, V50
Toledo A, 2005, IEEE T NEURAL NETWOR, V16, P988, DOI 10.1109/TNN.2005.849849
Vogt F., 2005, J ACOUST SOC AM, V117, P2542
Vogt F., 2006, P ISSP, P51
WAKITA H, 1973, IEEE T ACOUST SPEECH, VAU21, P417, DOI 10.1109/TAU.1973.1162506
WAMPLER CW, 1986, IEEE T SYST MAN CYB, V16, P93, DOI 10.1109/TSMC.1986.289285
WHITNEY DE, 1969, IEEE T MAN MACHINE, VMM10, P47, DOI 10.1109/TMMS.1969.299896
Wilamowski BM, 2008, IEEE T IND ELECTRON, V55, P3784, DOI 10.1109/TIE.2008.2003319
Winkler R., 2011, P INTERSPEECH
Winkler R., 2011, P ISSP
Wolovich W. A., 1984, Proceedings of the 23rd IEEE Conference on Decision and Control (Cat. No. 84CH2093-3)
Wrench A. A., 2000, P 5 SEM SPEECH PROD, P305
Wu JM, 2008, IEEE T NEURAL NETWOR, V19, P2032, DOI 10.1109/TNN.2008.2003271
NR 106
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 147
EP 161
DI 10.1016/j.specom.2012.08.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900012
ER
PT J
AU Deoras, A
Mikolov, T
Kombrink, S
Church, K
AF Deoras, Anoop
Mikolov, Tomas
Kombrink, Stefan
Church, Kenneth
TI Approximate inference: A sampling based modeling technique to capture
complex dependencies in a language model
SO SPEECH COMMUNICATION
LA English
DT Article
DE Long-span language models; Recurrent neural networks; Speech
recognition; Decoding
AB In this paper, we present strategies to incorporate long context information directly during the first pass decoding and also for the second pass lattice re-scoring in speech recognition systems. Long-span language models that capture complex syntactic and/or semantic information are seldom used in the first pass of large vocabulary continuous speech recognition systems due to the prohibitive increase in the size of the sentence-hypotheses search space. Typically, n-gram language models are used in the first pass to produce N-best lists, which are then re-scored using long-span models. Such a pipeline produces biased first pass output, resulting in sub-optimal performance during re-scoring. In this paper we show that computationally tractable variational approximations of the long-span and complex language models are a better choice than the standard n-gram model for the first pass decoding and also for lattice re-scoring. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Deoras, Anoop] Microsoft Corp, Mountain View, CA 94043 USA.
[Mikolov, Tomas; Kombrink, Stefan] Brno Univ Technol, Speech FIT, Brno, Czech Republic.
[Church, Kenneth] IBM TJ Watson Res Ctr, Yorktown Hts, NY USA.
RP Deoras, A (reprint author), Microsoft Corp, 1065 La Ave, Mountain View, CA 94043 USA.
EM Anoop.Deoras@microsoft.com; imikolov@fit.vutbr.cz;
kombrink@fit.vutbr.cz; kwchurch@us.ibm.com
FU HLT-COE Johns Hopkins University; Technology Agency of the Czech
Republic [TA01011328]; Grant Agency of Czech Republic [102/08/0707]
FX HLT-COE Johns Hopkins University partially funded this research. BUT
researchers were partly funded by the Technology Agency of the Czech
Republic Grant No. TA01011328, and Grant Agency of Czech Republic
Project No. 102/08/0707. Frederick Jelinek's contribution is
acknowledged towards this work. He would be a co-author if he were
available and willing to give his consent. We thank anonymous reviewers
for their many helpful comments and suggestions.
CR Allauzen C., 2003, P ASS COMP LING ACL
Bengio Y., 2007, SCALING LEARNING ALG
Bickel P. J., 1977, MATH STAT BASIC IDEA
Bishop C. M., 2006, PATTERN RECOGNITION
Boden M., 2002, GUIDE RECURRENT NEUR
Calzolari N., 2010, P 7 C INT LANG RES E
Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147
Chen SF, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P299, DOI 10.1109/ASRU.2009.5373380
Cover T M, 1991, ELEMENTS INFORM THEO
Deng L., 2011, P AS PAC SIGN INF PR
Deoras A., 2010, P IEEE SPOK LANG TEC
Deoras A., 2011, P IEEE INT C AC SPEE
Deoras A., 2011, THESIS J HOPKINS U
Deoras A., 2011, P 2011 C EMP METH NA
Deoras A, 2009, 2009 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION & UNDERSTANDING (ASRU 2009), P282, DOI 10.1109/ASRU.2009.5373438
ELMAN JL, 1990, COGNITIVE SCI, V14, P179, DOI 10.1207/s15516709cog1402_1
Filimonov D., 2009, P 2009 C EMP METH NA
GEMAN S, 1984, IEEE T PATTERN ANAL, V6, P721
Glass J., 2007, P ICSLP INT
Hain T, 2005, P RICH TRANSCR 2005
Jordon M.I., 1998, LEARNING GRAPHICAL M
Mikolov T., 2011, P ICSLP INT
Mikolov T., 2010, P ICSLP INT
Mikolov T., 2011, P IEEE WORKSH AUT SP
Mikolov T., 2011, P IEEE INT C AC SPEE
Momtazi S., 2010, P ICSLP INT
Nederhof M.-J., 2004, P ASS COMP LING ACL
Nederhof M.-J., 2005, P ASS COMP LING ACL
Roark B, 2001, COMPUT LINGUIST, V27, P249, DOI 10.1162/089120101750300526
Rumelhart D.E., 1986, PARALLEL DISTRIBUTED, V1, P318
SHANNON CE, 1951, AT&T TECH J, V30, P50
Soltau H., 2010, P IEEE WORKSH SPOK L
Stolcke A., 1998, P DARPA BROADC NEWS, P8
Stolcke A., 1994, P ASS COMP LING ACL
Xu P., 2005, THESIS J HOPKINS U
NR 35
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 162
EP 177
DI 10.1016/j.specom.2012.08.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900013
ER
PT J
AU Cmejla, R
Rusz, J
Bergl, P
Vokral, J
AF Cmejla, Roman
Rusz, Jan
Bergl, Petr
Vokral, Jan
TI Bayesian changepoint detection for the automatic assessment of fluency
and articulatory disorders
SO SPEECH COMMUNICATION
LA English
DT Article
DE Changepoint detection; Speech pathology; Speech signal processing;
Disfluency; Articulation disorder
ID CHANGE-POINT ANALYSIS; PARKINSONS-DISEASE; TIME-SERIES; SPEECH; VOICE;
DYSARTHRIA; SEVERITY; SPEAKERS; HEALTHY; MODEL
AB The accurate changepoint detection of different signal segments is a frequent challenge in a wide range of applications. With regard to speech utterances, the changepoints are related to significant spectral changes, mostly represented by the borders between two phonemes. The main aim of this study is to design a novel Bayesian autoregressive changepoint detector (BACD) and test its feasibility in the evaluation of fluency and articulatory disorders. The originality of the proposed method consists in its normalizing of a posteriori probability using Bayesian evidence and designing a recursive algorithm for reliable practice. For further evaluation of the BACD, we used data from (a) 118 people with various severity of stuttering to assess the extent of speech disfluency using a short reading passage, and (b) 24 patients with early Parkinson's disease and 22 healthy speakers for evaluation of articulation accuracy using fast syllable repetition. Subsequently, we designed two measures for each type of disorder. While speech disfluency has been related to greater distances between spectral changes, inaccurate dysarthric articulation has instead been associated with lower spectral changes. These findings have been confirmed by statistically significant differences, which were achieved in separating several degrees of disfluency and distinguishing healthy from parkinsonian speakers. In addition, a significant correlation was found between the automatic assessment of speech fluency and the judgment of human experts. In conclusion, the method proposed provides a cost-effective, easily applicable and freely available evaluation of speech disorders, as well as other areas requiring reliable techniques for changepoint detection. In a more modest scope, BACD may be used in diagnosis of disease severity, monitoring treatment, and support for therapist evaluation. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Cmejla, Roman; Rusz, Jan; Bergl, Petr] Czech Tech Univ, Dept Circuit Theory, Fac Elect Engn, Prague 16627 6, Czech Republic.
[Rusz, Jan] Charles Univ Prague, Fac Med 1, Dept Neurol, Prague, Czech Republic.
[Vokral, Jan] Gen Univ Hosp Prague, Fac Med 1, Dept Phoniatr, Prague, Czech Republic.
RP Cmejla, R (reprint author), Czech Tech Univ, Dept Circuit Theory, Fac Elect Engn, Prague 16627 6, Czech Republic.
EM cmejla@fel.cvut.cz
FU Czech Science Foundation [GACR P102/12/2230]; Czech Ministry of Health
[NT 12288-5/2011, NT11460-4/2010]; Czech Ministry of Education [MSM
0021620849]
FX The authors are obliged to Miroslava Hrbkova, Libor Cerny, Hana
Ruzickova, Jiri Klempir, Veronika Majerova, Jana Picmausova, Jan Roth,
and Evzen Ruzicka for provision of clinical data. The study was partly
supported by the Czech Science Foundation, project GACR P102/12/2230,
Czech Ministry of Health, projects NT 12288-5/2011 and NT11460-4/2010,
and Czech Ministry of Education, project MSM 0021620849.
CR Ackermann H, 1997, BRAIN LANG, V56, P312, DOI 10.1006/brln.1997.1851
Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666
APPEL U, 1983, INFORM SCIENCES, V29, P27, DOI 10.1016/0020-0255(83)90008-7
Asgari M, 2010, IEEE ENG MED BIO, P5201, DOI 10.1109/IEMBS.2010.5626104
Asgari M., 2010, IEEE INT WORKSH MACH, P462
Basseville M., 1993, INFORM SYSTEM SCI SE
Bergl P., 2010, THESIS CZECH TU PRAG
Bergl P, 2007, Proceedings of the Fifth IASTED International Conference on Biomedical Engineering, P171
Bergl P, 2006, PROC WRLD ACAD SCI E, V18, P33
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Bocklet T, 2012, J VOICE, V26, P390, DOI 10.1016/j.jvoice.2011.04.010
Boersma P., 2001, GLOT INT, V5, P341
Brown J. R., 1969, J SPEECH HEAR RES, V12, P249
Chu PS, 2004, J CLIMATE, V17, P4893, DOI 10.1175/JCLI-3248.1
Cmejla R., 2004, EUROSIPCO P WIEN, P245
Cmejla R., 2001, 4 INT C TEXT SPEECH, P291
Conture E., 2001, STUTTERING ITS NATUR
Couvreur L., 1999, P ESCA ETRW WORKSH A, P84
DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462
Dejonckere PH, 2001, EUR ARCH OTO-RHINO-L, V258, P77, DOI 10.1007/s004050000299
deRijk MC, 1997, J NEUROL NEUROSUR PS, V62, P10, DOI 10.1136/jnnp.62.1.10
Dobigeon N., 2005, 13 IEEE WORKSH STAT, P335
Duffy JR, 2005, MOTOR SPEECH DISORDE, P592
Eyben F., 2010, P ACM MULT MM FLOR I, P1459, DOI 10.1145/1873951.1874246
Goberman AM, 2010, J NEUROLINGUIST, V23, P470, DOI 10.1016/j.jneuroling.2008.11.001
Godino-Llorente JI, 2004, IEEE T BIO-MED ENG, V51, P380, DOI 10.1109/TBME.2003.820386
Harel BT, 2004, J NEUROLINGUIST, V17, P439, DOI 10.1016/j.jneuroling.2004.06.001
Hariharan M, 2012, J MED SYST, V36, P1821, DOI 10.1007/s10916-010-9641-6
HARTELIUS L, 1994, FOLIA PHONIATR LOGO, V46, P9
Hawkins DM, 2005, TECHNOMETRICS, V47, P164, DOI 10.1198/004017004000000644
Henriquez P, 2009, IEEE T AUDIO SPEECH, V17, P1186, DOI 10.1109/TASL.2009.2016734
Hirano M, 1981, CLIN EXAMINATION VOI
Hornykiewicz O, 1998, NEUROLOGY, V51, pS2
Jacobson BH, 1997, AM J SPEECH-LANG PAT, V6, P66
Kay Elemetrics Corp, 2003, MULTIDIMENSIONAL VOI
Kent RD, 1999, J COMMUN DISORD, V32, P141, DOI 10.1016/S0021-9924(99)00004-0
KENT RD, 1989, J SPEECH HEAR DISORD, V54, P482
Lastovka M., 1998, 2371998C1LF
Lechta V, 2004, STUTTERING
Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004
Mak B., 1996, 4 INT C SPOK LANG PR, V4
Middag C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1745
Noth E., 2000, P 4 INT C SPOK LANG, P65
Prochazka A, 2008, 2008 3RD INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS, CONTROL AND SIGNAL PROCESSING, VOLS 1-3, P719, DOI 10.1109/ISCCSP.2008.4537317
Ravikumar K.M., 2009, ICGST INT J DIGITAL, V9, P19
Reeves J, 2007, J APPL METEOROL CLIM, V46, P900, DOI 10.1175/JAM2493.1
RILEY GD, 1972, J SPEECH HEAR DISORD, V37, P314
Robertson S., 1986, WORKING DYSARTHRICS
Rosen KM, 2011, INT J SPEECH-LANG PA, V13, P165, DOI 10.3109/17549507.2011.529939
Rosen KM, 2006, J SPEECH LANG HEAR R, V49, P395, DOI 10.1044/1092-4388(2006/031)
Ruanaidh J. J., 1996, SERIES STAT COMPUTIN
Ruben RJ, 2000, LARYNGOSCOPE, V110, P241, DOI 10.1097/00005537-200002010-00010
Ruggiero C, 1999, J TELEMED TELECARE, V5, P11, DOI 10.1258/1357633991932333
Rusz J, 2011, J ACOUST SOC AM, V129, P350, DOI 10.1121/1.3514381
Rusz J, 2011, MOVEMENT DISORD, V26, P1951, DOI 10.1002/mds.23680
Sapir S, 2010, J SPEECH LANG HEAR R, V53, P114, DOI 10.1044/1092-4388(2009/08-0184)
Singh N, 2007, PROG NEUROBIOL, V81, P29, DOI 10.1016/j.pneurobio.2006.11.009
Sooful J. J., 2001, P 12 S PATT REC ASS, P99
Su HY, 2008, INT CONF ACOUST SPEE, P4513
Teesson K, 2003, J SPEECH LANG HEAR R, V46, P1009, DOI 10.1044/1092-4388(2003/078)
Tsanas A, 2011, J R SOC INTERFACE, V8, P842, DOI 10.1098/rsif.2010.0456
Ureten O., 1999, IEEE EURASIP WORKSH, P830
Van Borsel J, 2003, INT J LANG COMM DIS, V38, P119, DOI 10.1080/1368282021000042902
Western B, 2004, POLIT ANAL, V12, P354, DOI 10.1093/pan/mph023
Wisniewski M., 2007, COMPUTER RECOGNITION, V2, P445
Wong H, 2006, J HYDROL, V324, P323, DOI 10.1016/j.jhydrol.2005.10.007
Yairi E., 1999, J SPEECH LANG HEAR R, V42, P1098
NR 67
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 178
EP 189
DI 10.1016/j.specom.2012.08.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900014
ER
PT J
AU Zuo, X
Sumii, T
Iwahashi, N
Nakano, M
Funakoshi, K
Oka, N
AF Zuo, Xiang
Sumii, Taisuke
Iwahashi, Naoto
Nakano, Mikio
Funakoshi, Kotaro
Oka, Natsuki
TI Correcting phoneme recognition errors in learning word pronunciation
through speech interaction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Pronunciation learning; Interactive Phoneme Update; Phoneme recognition
ID AUTOMATIC TRANSCRIPTION; UNKNOWN WORDS; SYSTEM; ALGORITHM
AB This paper presents a method called Interactive Phoneme Update (IPU) that enables users to teach systems the pronunciation (phoneme sequences) of words in the course of speech interaction. Using the method, users can correct mis-recognized phoneme sequences by repeatedly making correction utterances according to the system responses. The originalities of this method are: (1) word-segment-based correction that allows users to use word segments for locating mis-recognized phonemes based on open-begin-end dynamic programming matching and generalized posterior probability, (2) history-based correction that utilizes the information of phoneme sequences that were recognized and corrected previously in the course of interactive learning of each word. Experimental results show that the proposed IPU method reduces the error rate by a factor of three over a previously proposed maximum-likelihood-based method. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Zuo, Xiang; Sumii, Taisuke; Oka, Natsuki] Kyoto Inst Technol, Sakyo Ku, Kyoto 6068585, Japan.
[Iwahashi, Naoto] Natl Inst Informat & Commun Technol, Sora Ku, Kyoto 6190289, Japan.
[Nakano, Mikio; Funakoshi, Kotaro] Honda Res Inst Japan Co Ltd, Wako, Saitama 3510188, Japan.
RP Zuo, X (reprint author), Kyoto Inst Technol, Sakyo Ku, Hashigami Cho, Kyoto 6068585, Japan.
EM edgarzx@gmail.com
CR Bael C., 2007, COMPUT SPEECH LANG, V21, P652
Bansal D, 2009, INT CONF ACOUST SPEE, P4293, DOI 10.1109/ICASSP.2009.4960578
Bazzi I., 2002, P ICSLP, P1613
Chang S., 2000, P 6 INT C SPOK LANG, P330
Chung G., 2003, P HLT NAACL EDM CAN, P32
ELVIRA JM, 1998, ACOUST SPEECH SIG PR, P849
HAEBUMBACH R, 1995, INT CONF ACOUST SPEE, P840, DOI 10.1109/ICASSP.1995.479825
Holzapfel H, 2008, ROBOT AUTON SYST, V56, P1004, DOI 10.1016/j.robot.2008.08.012
KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Leitner C., 2010, P 7 C INT LANG RES E, P3278
Mohamed A.-R., 2009, P INT C COMP GRAPH I, P1
Nakagawa S., 2006, P IEICE GEN C, P13
Nakamura S, 2006, IEEE T AUDIO SPEECH, V14, P365, DOI 10.1109/TSA.2005.860774
Nakano M., 2010, P NAT C ART INT AAAI, P74
Parada C., 2010, P N AM CHAPT ASS COM, P216
Rastrow A, 2009, INT CONF ACOUST SPEE, P3953, DOI 10.1109/ICASSP.2009.4960493
SAKOE H, 1979, IEEE T ACOUST SPEECH, V27, P588, DOI 10.1109/TASSP.1979.1163310
Soong F.K., 2004, P SPEC WORKSH MAUI S
Sun H., 2003, P EUR, P2713
Svendsen T., 1995, P EUROSPEECH, P783
Waibel A., 2000, VERBMOBIL FDN SPEECH, P33
Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002
WU JX, 1999, ACOUST SPEECH SIG PR, P589
Yazgan A., 2004, IEEE ICASP PROCESS, V1, P745
Zuo X., 2010, P 3 IEEE WORKSH SPOK, P348
NR 26
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2013
VL 55
IS 1
BP 190
EP 203
DI 10.1016/j.specom.2012.08.008
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 055NF
UT WOS:000312422900015
ER
PT J
AU Moattar, MH
Homayounpour, MM
AF Moattar, M. H.
Homayounpour, M. M.
TI A review on speaker diarization systems and approaches
SO SPEECH COMMUNICATION
LA English
DT Review
DE Speaker indexing; Speaker diarization; Speaker segmentation; Speaker
clustering; Speaker tracking
ID BAYESIAN INFORMATION CRITERION; LONG-TERM FEATURES; BROADCAST NEWS;
SPEECH RECOGNITION; AUTOMATIC SEGMENTATION; TRANSCRIPTION SYSTEM;
MICROPHONE MEETINGS; AUDIO SEGMENTATION; MODEL; CLASSIFICATION
AB Speaker indexing or diarization is an important task in audio processing and retrieval. Speaker diarization is the process of labeling a speech signal with labels corresponding to the identity of speakers. This paper includes a comprehensive review on the evolution of the technology and different approaches in speaker indexing and tries to offer a fully detailed discussion on these approaches and their contributions. This paper reviews the most common features for speaker diarization in addition to the most important approaches for speech activity detection (SAD) in diarization frameworks. Two main tasks of speaker indexing are speaker segmentation and speaker clustering. This paper includes a separate review on the approaches proposed for these subtasks. However, speaker diarization systems which combine the two tasks in a unified framework are also introduced in this paper. Another discussion concerns the approaches for online speaker indexing which has fundamental differences with traditional offline approaches. Other parts of this paper include an introduction on the most common performance measures and evaluation datasets. To conclude this paper, a complete framework for speaker indexing is proposed, which is aimed to be domain independent and parameter free and applicable for both online and offline applications. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Moattar, M. H.; Homayounpour, M. M.] Amirkahir Univ Technol, Lab Intelligent Multimedia Proc IMP, Comp Engn & Informat Technol Dept, Tehran, Iran.
RP Moattar, MH (reprint author), Amirkahir Univ Technol, Lab Intelligent Multimedia Proc IMP, Comp Engn & Informat Technol Dept, Tehran, Iran.
EM moattar@aut.ac.ir; homayoun@aut.ac.ir
FU Telecommunication Research Center (ITRC) [T/500/14939]
FX The authors would like to thank Iran Telecommunication Research Center
(ITRC) for supporting this work under contract No. T/500/14939.
CR Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666
Ajmera J., 2002, P INT C SPOK LANG PR, P573
Ajmera J., 2004, P ICASSP, V1, P605
Ajmera J, 2003, P IEEE WORKSH AUT SP, P411
Ajmera J., 2002, IMPROVED UNKNOWN MUL
Akita Y., 2003, P EUROSPEECH, P2985
Alabiso J., 1998, ENGLISH BROADCAST NE
Anguera X., 2006, P OD, P1
Anguera X., 2006, THESIS U POLITECNICA
Anguera X, BEAMFORMIT FAST ROBU
Anguera X., 2004, 3 JORN TECN HABL VAL
Anguera X., 2011, P ICASSP
Anguera X., 2007, LECT NOTES COMPUTER, V4299
Anguera X., 2006, P 2 INT WORKSH MULT
Anguera X., 2006, P 9 INT C SPOK LANG
Anguera X, 2005, LECT NOTES COMPUT SC, V3869, P402
[Anonymous], OP SOURC PLATF BIOM
[Anonymous], 2006, NIST FALL RICH TRANS
[Anonymous], 2006, P 2 INT C IM VID RET, P488
[Anonymous], CHAIR COMP SCI 6
Antolin A.G., 2007, IEEE T COMPUT, V56, P1212
Arias J.A., 2005, P 13 EUR SIGN PROC C
Bakis R., 1997, P SPEECH REC WORKSH, P67
Barras C., 2004, P FALL RICH TRANSCR
Barras C, 2006, IEEE T AUDIO SPEECH, V14, P1505, DOI 10.1109/TASL.2006.878261
BARRAS C, 2003, ACOUST SPEECH SIG PR, P49
Basseville M., 1993, DETECTION ABRUPT CHA
BENHARUSH O, 2008, P INTERSPEECH, P24
Biatov K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2082
Bijankhan M, 2002, D91 SECURESCM
Bimbot F., 1993, P EUROSPEECH, P169
Boakye K, 2008, INT CONF ACOUST SPEE, P4353, DOI 10.1109/ICASSP.2008.4518619
Boakye K., 2008, THESIS U CALIFORNIA
Boehm C, 2009, INT CONF ACOUST SPEE, P4081, DOI 10.1109/ICASSP.2009.4960525
Bozonnet S, 2010, INT CONF ACOUST SPEE, P4958, DOI 10.1109/ICASSP.2010.5495088
Brandstein M., 2001, EXPLICIT SPEECH MODE
Burger S, 2002, P INT C SPOK LANG PR, P301
Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714
Campbell N., 2006, P WORKSH PROGR
Castaldo F, 2008, INT CONF ACOUST SPEE, P4133, DOI 10.1109/ICASSP.2008.4518564
Cetin O, 2006, INT CONF ACOUST SPEE, P357
Cettolo M., 2005, COMPUT SPEECH LANG, V19, P1004
Cettolo M., 2003, P ICASSP HONG KONG C, P537
Chen SS, 2002, SPEECH COMMUN, V37, P69, DOI 10.1016/S0167-6393(01)00060-7
CHEN SS, 1998, ACOUST SPEECH SIG PR, P645
Chen T., 1996, P ICASSP, V4, P2056
Cheng S., 2004, P 8 INT C SPOK LANG, P1617
Chu SM, 2008, INT CONF ACOUST SPEE, P4329, DOI 10.1109/ICASSP.2008.4518613
Chu SM, 2009, INT CONF ACOUST SPEE, P4089, DOI 10.1109/ICASSP.2009.4960527
Cohen I., 2002, P 22 CONV EL EL ENG
Collet M., 2003, P 2003 IEEE INT C AC, P713
COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054
COX H, 1986, IEEE T ACOUST SPEECH, V34, P393, DOI 10.1109/TASSP.1986.1164847
Delacourt P., 1999, P EUR C SPEECH COMM, V3, P1195
Delacourt P, 2000, SPEECH COMMUN, V32, P111, DOI 10.1016/S0167-6393(00)00027-3
Delphine C, 2010, INT CONF ACOUST SPEE, P4966, DOI 10.1109/ICASSP.2010.5495090
Deshayes J., 1986, ONLINE STAT ANAL CHA
DESOBRY F, 2003, ACOUST SPEECH SIG PR, P872
El-Khoury E, 2009, INT CONF ACOUST SPEE, P4097, DOI 10.1109/ICASSP.2009.4960529
Ellis D.P.W., 2004, P NIST M REC WORKSH
Evans NWD, 2009, INT CONF ACOUST SPEE, P4061, DOI 10.1109/ICASSP.2009.4960520
Fernandez D., 2009, P INT BRIGHT UK, P849
FIoshuyama O., 1999, IEEE T SIGNAL PROCES
Fischer S., 1997, P ICASSP
Fiscus J. G., 2004, P FALL 2004 RICH TRA
Fiscus JG, 2005, LECT NOTES COMPUT SC, V3869, P369
Fisher JW, 2004, IEEE T MULTIMEDIA, V6, P406, DOI 10.1109/TMM.2004.827503
Fisher J.W., 2000, P NEUR INF PROC SYST, P772
Flanagan J, 1994, J ACOUST SOC AM, V78, P1508
Fredouille C., 2004, P NIST 2004 SPRING R
Fredouille C., 2006, P MLMI 06 WASH US
Friedland AG, 2009, INT CONF ACOUST SPEE, P4077, DOI 10.1109/ICASSP.2009.4960524
Friedland G, 2009, IEEE T AUDIO SPEECH, V17, P985, DOI 10.1109/TASL.2009.2015089
Gales MJF, 2006, IEEE T AUDIO SPEECH, V14, P1513, DOI 10.1109/TASL.2006.878264
Galliano S, 2006, P LANG EV RES C
Garofolo J, 2002, NIST RICH TRANSCRIPT
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Gauvain J.L., 1998, P INT C SPOK LANG PR, P1335
Ghahramani Z, 1997, MACH LEARN, V29, P245, DOI 10.1023/A:1007425814087
Graff D, 2001, TDT3 MANDARIN AUDIO
Gravier G., 2010, AUDIOSEG AUDIO SEGME
Griffiths L., 1982, IEEE T ANTENNAS PROP
Hain T., 1998, P DARPA BROADC NEWS, P133
Han K. J., 2008, P ICASSP 2008 MAR, P4373
Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088
Harb H., 2006, INT J DIGITAL LIB, V6, P70, DOI 10.1007/s00799-005-0120-5
Heck L., 1997, P EUR RHOD GREEC
Huang C.H., 2004, P INT S CHIN SPOK LA, P109
Huang Y., 2007, P IEEE AUT SPEECH RE, P693
Huijbregts M., 2007, LECT NOTES COMPUTER, V4816
Hung H., 2008, P CVPR WORKSH HUM CO
Hung J., 2000, P ICASLP BEIJ CHIN
Imseng D, 2010, IEEE T AUDIO SPEECH, V18, P2028, DOI 10.1109/TASL.2010.2040796
Istrate D, 2005, LECT NOTES COMPUT SC, V3869, P428
Izmirli O, 2000, P INT S MUS INF RETR
Jain A. K., 1988, ALGORITHMS CLUSTERIN
Janin A., 2003, P ICCASP HONG KONG
Jin H., 1997, P DARPA SPEECH REC W, P108
Jin Q., 2004, P NIST M REC WORKSH, P112
Johnson D, 1993, ARRAY SIGNAL PROCESS
Johnson S., 1998, P 5 INT C SPOK LANG, P1775
Johnson S, 1999, P EUR BUD HUNG
JOHNSON SE, 2000, ACOUST SPEECH SIG PR, P1427
Jothilakshmi S., 2009, ENG APPL ARTIFICIAL, V22
Juang B., 1985, AT T TECHNICAL J
Kaneda Y., 1991, Journal of the Acoustical Society of Japan (E), V12
Kataoka A., 1990, Journal of the Acoustical Society of Japan (E), V11
Kemp T, 2000, INT CONF ACOUST SPEE, P1423, DOI 10.1109/ICASSP.2000.861862
Kim HG, 2005, INT CONF ACOUST SPEE, P745
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kinnunen T., 2008, P SPEAK LANG REC WOR
Koshinaka T, 2009, INT CONF ACOUST SPEE, P4093, DOI 10.1109/ICASSP.2009.4960528
Kotti M., 2006, P 2006 IEEE INT C MU, P1101
Kotti M, 2008, SIGNAL PROCESS, V88, P1091, DOI 10.1016/j.sigpro.2007.11.017
Kotti M, 2008, IEEE T AUDIO SPEECH, V16, P920, DOI 10.1109/TASL.2008.925152
Kotti M., 2006, P 2006 IEEE INT S CI
Kristjansson T., 2005, P ICSLP LISB PORT
Kubala F, 1997, P SPEECH REC WORKSH, P90
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Kuhn R., 1998, P ICSLP, P1771
Kwon S., 2002, P INT C SPOK LANG PR, P2537
Kwon S., 2004, P ICSLP, P1517
Kwon S., 2004, IEEE T SPEECH AUDIO, V13, P1004
Kwon S., 2003, P IEEE AUT SPEECH RE, P423
Kwon S., 2003, P EUR
Lapidot I., 2002, 0260 IDIAP
Lapidot I., 2001, P 2001 SPEAK OD SPEA, P169
Lathoud G., 2004, P ICASSP NIST M REC
Leeuwen D., 2008, LNCS, V4625, P475
Liu D., 2004, P ICASSP MAY, P333
LIU D, 1999, P ESCA EUR 99 BUD HU, V3, P1031
Liu D., 2003, P IEEE INT C AC SPEE, P572
Lopez J.F., 2000, P ICSLP BEIJ CHIN
Lu L., 2002, P ICPR QUEB CIT CAN, V2
Lu L, 2005, MULTIMEDIA SYST, V10, P332, DOI 10.1007/s00530-004-0160-5
Lu L, 2002, IEEE T SPEECH AUDI P, V10, P504, DOI 10.1109/TSA.2002.804546
Lu L., 2002, P 10 ACM INT C MULT, P602
Malegaonkar A, 2006, IEEE SIGNAL PROC LET, V13, P509, DOI 10.1109/LSP.2006.873656
Markov K, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P363
Markov K., 2007, P ASRU, P699
Markov K., 2007, P INTERSPEECH
Marro C., 1998, IEEE T SPEECH AUDIO
Martin A., 2001, P EUR C SPEECH COMM, V2, P787
McCowan IA, 2000, INT CONF ACOUST SPEE, P1723, DOI 10.1109/ICASSP.2000.862084
McNeill D., 2000, LANGUAGE AND GESTURE
Meignier S., 2001, P OD SPEAK LANG REC, P175
Meignier S., 2005, COMPUT SPEECH LA SEP, P303
Meignier S, 2006, COMPUT SPEECH LANG, V20, P303, DOI 10.1016/j.csl.2005.08.002
Meignier Sylvain, 2010, CMU SPUD WORKSH
Meinedo H., 2003, P ICASSP HONG KONG C
Mesgarani N., 2004, P ICASSP, V1, P601
Mirghafori N., 2006, P ICASSP
Moattar M. H., 2009, P EUSIPCO, P2549
Moattar M.H., 2009, P 14 INT COMP SOC IR, P501
Moh Y., 2003, P IEEE INT C AC SPEE, P85
MORARU D, 2003, ACOUST SPEECH SIG PR, P89
Moraru D., 2004, P OD 2004 SPEAK LANG
Moraru D., 2003, P HUM COMP DIAL BUCH
Moraru D., 2004, P ICASSP MONTR CAN
Mori K, 2001, INT CONF ACOUST SPEE, P413, DOI 10.1109/ICASSP.2001.940855
Muthusamy Y. K., 1992, P INT C SPOK LANG PR, P895
Neal RM, 1998, NATO ADV SCI I D-BEH, V89, P355
Nguyen P., 2002, P RICH TRANSCR WORKS
Nguyen TH, 2009, INT CONF ACOUST SPEE, P4085, DOI 10.1109/ICASSP.2009.4960526
Nishida M, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P172
NIST, 2002, RICH TRANSCR EV PROJ
Nock H.J., 2003, LECT NOTES COMPUTER, V2728, P565
Noulas A.K., 2009, P COMP VIS IM UND
Noulas AK, 2007, ICMI'07: PROCEEDINGS OF THE NINTH INTERNATIONAL CONFERENCE ON MULTIMODAL INTERFACES, P350
Nwe T.L., 2010, P ICASSP, P4073
Omar M., 2005, P ICASSP
Otero P.L., 2010, P ICASSP, P4970
Ouellet P., 2005, P ICASSP LISB PORT
Pardo JM, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2194
Pardo JM, 2006, LECT NOTES COMPUT SC, V4299, P257
Pelecanos J., 2001, P ISCA SPEAK REC WOR
Pellom BL, 1998, SPEECH COMMUN, V25, P97, DOI 10.1016/S0167-6393(98)00031-4
Pfau T., 2001, P EUR
Rao R., 1996, P INT PICT COD S
Reynolds D. A., 2004, P FALL 2004 RICH TRA
Roch M., 2004, P SPEAK OD, P349
Rosca J., 2003, P ICASSP
Rougui J., 2006, P ICASSP TOUL FRANC
Sanchez-Bote J., 2003, P ICASSP
Sankar A., 1998, P DARPA BROADC NEWS
Sankar A., 1995, P EUR MADR SPAIN
SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136
Shriberg E., 2007, SERIES LECT NOTES CO, V4343, P241, DOI 10.1007/978-3-540-74200-5_14
Shriberg Elizabeth, 2001, P EUROSPEECH, P1359
Sian Cheng S., 2003, P EUR GEN SWITZ
Siegler M., 1997, P DARPA SPEECH REC W, P97
Sinha R., 2005, P EUR C SPEECH COMM, P2437
Siracusa M., 2007, P ICASSP
Sivakumaran P., 2001, P EUR SCAND
SOLOMONOFF A, 1998, ACOUST SPEECH SIG PR, P757
Stafylakis T., 2009, P INTERSPEECH
Stafylakis T, 2010, INT CONF ACOUST SPEE, P4978, DOI 10.1109/ICASSP.2010.5495076
Stern R., 1997, P DARPA SPEECH REC W
Stolcke A, 2010, INT CONF ACOUST SPEE, P4390, DOI 10.1109/ICASSP.2010.5495626
Sturim D., 2001, P ICASSP SALT LAK CI
Sun H., 2009, P INTERSPEECH
Sun HW, 2010, INT CONF ACOUST SPEE, P4982, DOI 10.1109/ICASSP.2010.5495077
Thyes O., 2000, P INT C SPOK LANG PR, V2, P242
Tranter S, 2005, P ICASSP MONTR CAN
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
Tranter S.E., 2004, P 2004 IEEE INT C AC, P433
Tritschler A, 1999, P EUROSPEECH, P679
Trueba-Hornero B., 2008, THESIS U POLITECNICA
Tsai W.H., 2004, P ICASLP JEJ ISL KOR
Vajaria H, 2006, INT C PATT RECOG, P1150
Valin J., 2004, P ICASSP
van Leeuwan D.A, 2005, P MACH LEARN MULT IN, P440
Vandecatseye A., 2004, P LREC LISB PORT
Vescovi M., 2003, P 8 EUR C SPEECH COM, P2997
Vijayasenan D., 2007, P IEEE AUT SPEECH RE, P250
Voitovetsky I., 1998, P 1 WORKSH TEXT SPEE, P321
Voitovetsky I, 1997, P IEEE WORKSH NEUR N, P578
Wactlar H., 1996, P ARPA STL WORKSH
Wang D, 2003, 2003 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P468
Wang W, 2007, LECT NOTES COMPUT SC, V4477, P555
WARD JH, 1963, J AM STAT ASSOC, V58, P236, DOI 10.2307/2282967
Wegman S, 1997, P DARPA BROADC NEWS
Wolfel M., 2009, P INTERSPEECH
Woodland P., 1997, P SPEECH REC WORKSH, P73
Wooters C., 2004, P FALL 2004 RICH TRA
Wooters C., 2008, LNCS, V4625
Wu CH, 2006, IEEE T AUDIO SPEECH, V14, P647, DOI 10.1109/TSA.2005.852988
Wu T., 2003, P ICME 03, V2, P721
Wu T., 2003, P INT C MULT MOD
WU TY, 2003, ACOUST SPEECH SIG PR, P193
Yamaguchi M., 2005, P ICASLP
Yoo IC, 2009, ETRI J, V31, P451, DOI 10.4218/etrij.09.0209.0104
Zamalloa, 2010, P ICASSP, P4962
Zdansky J., 2006, P INTERSPEECH, P2186
Zelinski R., 1988, P IEEE INT C AC SPEE, V5, P2578
Zhang C., 2006, P IEEE INT WORKSH MU
Zhang X., 2004, P ICASSP
Zhou B., 2000, P INT C SPOK LANG PR, P714
Zhou BW, 2005, IEEE T SPEECH AUDI P, V13, P467, DOI 10.1109/TSA.2005.845790
Zhu X, 2006, LECT NOTES COMPUT SC, V4299, P396
Zhu X, 2008, LECT NOTES COMPUT SC, V4625, P533
Zhu X., 2005, P EUR C SPEECH COMM
Zhu YM, 2003, IEEE T SYST MAN CY A, V33, P502, DOI 10.1109/TSMCA.2003.809211
Zochova P., 2005, P ICSLP LISB PORT
NR 245
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2012
VL 54
IS 10
BP 1065
EP 1103
DI 10.1016/j.specom.2012.05.002
PG 39
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 988LK
UT WOS:000307492300001
ER
PT J
AU Karjigi, V
Rao, P
AF Karjigi, V.
Rao, P.
TI Classification of place of articulation in unvoiced stops with
spectro-temporal surface modeling
SO SPEECH COMMUNICATION
LA English
DT Article
DE Place of articulation; Unvoiced stops; Spectro-temporal features
ID SPEECH RECOGNITION; AUDITORY-CORTEX; CONSONANTS; PERCEPTION; FEATURES;
UTTERANCES
AB Unvoiced stops are rapidly varying sounds with acoustic cues to place identity linked to the temporal dynamics. Neurophysiological studies have indicated the importance of joint spectro-temporal processing in the human perception of stops. In this study, two distinct approaches to modeling the spectra-temporal envelope of unvoiced stop phone segments are investigated with a view to obtaining a low-dimensional feature vector for automatic place classification. Classification accuracies on the TIMIT database and a Marathi words dataset show the overall superiority of classifier combination of polynomial surface coefficients and 2D-DCT. A comparison of performance with published results on the place classification of stops revealed that the proposed spectro-temporal feature systems improve upon the best previous systems' performances. The results indicate that joint spectro-temporal features may be usefully incorporated in hierarchical phone classifiers based on diverse class-specific features. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Karjigi, V.; Rao, P.] Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India.
RP Karjigi, V (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India.
EM veena@ee.iitb.ac.in; prao@ee.iitb.ac.in
FU IIT Bombay under the National Programme on Perception Engineering;
Department of Information Technology, MCIT, Government of India
FX The authors are grateful to the anonymous reviewers for their valuable
comments on a previous version of the submission. The research is partly
supported by a project grant to IIT Bombay under the National Programme
on Perception Engineering, sponsored by the Department of Information
Technology, MCIT, Government of India.
CR AHMED R, 1969, J ACOUST SOC AM, V45, P758, DOI 10.1121/1.1911459
Bonneau A, 1996, J ACOUST SOC AM, V100, P555, DOI 10.1121/1.415866
Bouvrie J, 2008, INT CONF ACOUST SPEE, P4733, DOI 10.1109/ICASSP.2008.4518714
Bunnell H.T., 2004, P INT 04 JEJ ISL KOR, P1313
Chen B., 2004, P ICSLP, P612
Chi T, 2005, J ACOUST SOC AM, V118, P887, DOI 10.1121/1.1945807
DATTA AK, 1980, IEEE T ACOUST SPEECH, V28, P85, DOI 10.1109/TASSP.1980.1163354
Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220
Dyer SA, 2001, IEEE INSTRU MEAS MAG, V4, P4
Ellis D.P.W., 1997, CORRELATION FEATURE
Enzinger E., 2010, P AUD ENG SOC INT C, P47
Ezzat T., 2007, P INT C SPOK LANG PR, P506
Fant G., 1973, STOPS CV SYLLABLES S
Feijoo S, 1999, SPEECH COMMUN, V27, P1, DOI 10.1016/S0167-6393(98)00064-8
FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977
Gish H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P466
Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8
Goldenthal W.D., 1994, THESIS MIT CAMBRIDGE
Halberstadt A. K., 1998, THESIS MIT CAMBRIDGE
Harder D.W., 2010, NUMERICAL ANAL ENG
Hazen T., 1998, ACOUST SPEECH SIG PR, P653
Hermansky H., 1999, P AUT SPEECH REC UND, P63
Hou J., 2007, P INT 07 ANTW BELG, P1929
Kajarekar SS, 2001, INT CONF ACOUST SPEE, P137, DOI 10.1109/ICASSP.2001.940786
KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
Lamel L.F., 1986, DARPA SPEECH REC WOR, P61
Lee DD, 1999, NATURE, V401, P788
Mesgarani N., 2009, P INT 09, P2983
Meyer BT, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P906
Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826
Neagu A., 1998, P ICSLP, P2127
Niyogi P, 2003, SPEECH COMMUN, V41, P349, DOI 10.1016/S0167-6393(02)00151-6
NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735
Obleser J, 2010, FRONT PSYCHOL, V1, DOI 10.3389/fpsyg.2010.00232
Ohala M., 1995, INT C PHON SCI STOCK, P22
Ohala M., 1998, P INT C SPOK LANG PR, P2795
Pandey PC, 2009, IEEE T AUDIO SPEECH, V17, P277, DOI 10.1109/TASL.2008.2010285
Patil V., 2009, P INT 09 BRIGHT UK, P2543
Prasanna SRM, 2001, P SIGN P COMM BANG I, P81
Rifkin R., 2007, MITCSAILTR2007007
Sekhar CC, 2002, IEEE T SPEECH AUDI P, V10, P472, DOI 10.1109/TSA.2002.804298
Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241
Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009
Wang XH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1221
NR 45
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2012
VL 54
IS 10
BP 1104
EP 1120
DI 10.1016/j.specom.2012.04.007
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 988LK
UT WOS:000307492300002
ER
PT J
AU Ozimek, E
Kutzner, D
Libiszewski, P
AF Ozimek, Edward
Kutzner, Dariusz
Libiszewski, Pawel
TI Speech intelligibility tested by the Pediatric Matrix Sentence test in
3-6 year old children
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility in children; Sentence matrix test; Speech
reception threshold
ID RECEPTION THRESHOLD; NOISE
AB Objective: The present study was aimed at the development and application of the Polish Pediatric Matrix Sentence Test (PPMST) for testing speech intelligibility in normal-hearing (NH) and hearing-impaired (HI) children aged 3-6.
Methods & Procedures: The test was based on sentences of the subject-verb-object pattern. Pictures illustrating PPMST utterances were prepared and the picture-point (PP) method was used for administering the 1-up/1-down adaptive procedure converging the signal to noise ratio (SNR) to the speech reception threshold (SRT). The correctness of verbal responses (VR), preceding PP responses, was also judged.
Outcomes & Results: The normative SRT for the PP method was shown to decrease with age. The guessing rate (gamma) turned out to be close to the theoretical value for forced-choice procedures, gamma = 1/n, where n = 6 for the six-alternative PP method (gamma approximate to 0.166) and n = 4 for the four-alternative PP method (gamma approximate to 0.25). Test optimization resulted in minimizing the lapse rate (lambda) (ratio gamma/lambda approximate to 8.0 for n = 4 and gamma/lambda approximate to 5.6 for n = 6, both for NH and HI children). Significantly higher SRTs were observed for HI children than for the NH group.
Conclusions & Implications: For children aged 3-6, tested by the developed PPMST, speech intelligibility performance, for both the VR and PP method, increases with age. For each age group, significantly worse intelligibility was observed for HI children than for NH children. The PPMST combined with the PP method is a reliable tool for pediatric speech intelligibility measurements. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Ozimek, Edward; Kutzner, Dariusz; Libiszewski, Pawel] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland.
RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, 85 Umultowska St, PL-61614 Poznan, Poland.
EM ozimaku@amu.edu.pl
FU Norway through the Norwegian Financial Mechanism [PNRF-167-AI-1/07]
FX Supported by a grant from Norway through the Norwegian Financial
Mechanism (project no. PNRF-167-AI-1/07).
CR Bell T S, 2001, J Am Acad Audiol, V12, P514
Bench J, 1979, Br J Audiol, V13, P108, DOI 10.3109/03005367909078884
Brand T, 2002, J ACOUST SOC AM, V111, P2801, DOI 10.1121/1.1479152
Bulczynska K., 1987, SLOWNICTWO DZIECI WI
Cameron S, 2006, INT J AUDIOL, V45, P99, DOI 10.1080/14992020500377931
Elliott L. L., 1980, NW U CHILDRENS PERCE
ELLIOTT LL, 1979, J ACOUST SOC AM, V66, P651, DOI 10.1121/1.383691
ELLIOTT LL, 1979, J ACOUST SOC AM, V66, P12, DOI 10.1121/1.383065
Fallon M, 2000, J ACOUST SOC AM, V108, P3023, DOI 10.1121/1.1323233
Gescheider George A., 1997, PSYCHOPHYSICS FUNDAM
HAGERMAN B, 1982, SCAND AUDIOL, V11, P79, DOI 10.3109/01050398209076203
Jerger S., 1984, PEDIAT SPEECH INTELL
Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085
Kollmeier B., 1990, THESIS U GOTTINGEN G
LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375
MACKIE K, 1986, J SPEECH HEAR RES, V29, P275
Mendel LL, 2008, INT J AUDIOL, V47, P546, DOI 10.1080/14992020802252261
Nilsson M. J., 1996, DEV HEARING NOISE TE
Ozimek E., 2011, SPEECH INTELLI UNPUB
Ozimek E, 2010, INT J AUDIOL, V49, P444, DOI 10.3109/14992021003681030
Ozimek E, 2009, INT J AUDIOL, V48, P433, DOI 10.1080/14992020902725521
PLOMP R, 1979, AUDIOLOGY, V18, P43
ROSS M, 1970, J SPEECH HEAR RES, V13, P44
Smits C, 2006, J ACOUST SOC AM, V120, P1608, DOI 10.1121/1.2221405
Steffens T, 2003, HNO, V51, P1012, DOI 10.1007/s00106-003-0848-4
SZMEJA Z, 1963, Otolaryngol Pol, V17, P367
Szuchnik J., 2002, PROGRAM ROZWIJANIA O
Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451
Wagener K, 2005, Z AUDIOL, V44, P134
Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080
Wagener K, 1999, Z AUDIOL, V38, P4
ZAKRZEWSKI A, 1971, Otolaryngologia Polska, V25, P297
Zheng Y, 2009, INT J AUDIOL, V48, P718, DOI 10.1080/14992020902902658
NR 33
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2012
VL 54
IS 10
BP 1121
EP 1131
DI 10.1016/j.specom.2012.06.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 988LK
UT WOS:000307492300003
ER
PT J
AU Baum, D
AF Baum, Doris
TI Recognising speakers from the topics they talk about
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; Topic classification; High-level features
ID MODELS; RECOGNITION
AB We investigate how a speaker's preference for specific topics can be used for speaker identification. In domains like broadcast news or parliamentary speeches, speakers have a field of expertise they are associated with. We explore how topic information for a segment of speech, extracted from an automatic speech recognition transcript, can be employed to identify the speaker. Two methods for modelling topic preferences are compared: implicitly, based on speaker-characteristic keywords, and explicitly, by using automatically derived topic models to assign topics to the speech segments. In the keyword-based approach, the segments' tf-idf vectors are classified with Support Vector Machine speaker models. For the topic-model-based approach, a domain-specific topic model is used to represent each segment as a mixture of topics; the speakers' score is derived from the Kullback-Leibler divergence between the topic mixtures of their training data and of the segment.
The methods were tested on political speeches given in German parliament by 235 politicians. We found that topic cues do carry speaker information, as the topic-model-based system yielded an equal error rate (EER) of 16.3%. The topic-based approach combined well with a spectral baseline system, improving the EER from 8.6% for the spectral to 6.2% for the fused system. (c) 2012 Elsevier B.V. All rights reserved.
C1 Fraunhofer IAIS, St Augustin, Germany.
RP Baum, D (reprint author), Fraunhofer IAIS, St Augustin, Germany.
EM dorisbaum@gmx.net
FU German Federal Ministry of Economics and Technology through the
CONTENTUS scenario of the THESEUS project
FX Part of this study was funded by the German Federal Ministry of
Economics and Technology through the CONTENTUS scenario of the THESEUS
project.7
CR Baum D, 2009, IEEE AUT SPEECH REC
Biatov K, 2002, P 3 INT C LANG RES E
Blei D, 2003, P 26 ANN INT ACM SIG, P127, DOI DOI 10.1145/860435.860460
Blei DM, 2003, J MACH LEARN RES, V3, P993, DOI 10.1162/jmlr.2003.3.4-5.993
Branavan S, 2008, P ACL, P263
Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001
Canseco L, 2005, P IEEE WORKSH AUT SP
Canseco-Rodriguez L., 2004, ICSLP 2004, P1272
Chang J., 2009, NEURAL INFORM PROCES, V31
Doddington G.R., 2001, 7 EUR C SPEECH COMM, P2521
Feng Y., 2010, HUM LANG TECHN 2010, P831
Ferrer L, 2010, INT CONF ACOUST SPEE, P4414, DOI 10.1109/ICASSP.2010.5495632
Gillick L., 1989, ICASSP 1989 GLASG UK, V1, P532
Griffiths T., 2007, LATENT SEMANTIC ANAL
Hatch AO, 2005, INT CONF ACOUST SPEE, P169
Joachims T., 1999, ADV KERNEL METHODS S
Joachims T., 1998, P EUR C MACH LEARN
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kockmann M, 2010, INT CONF ACOUST SPEE, P4418, DOI 10.1109/ICASSP.2010.5495616
Lee A., 2001, 7 EUR C SPEECH COMM
Martin A., 1997, EUROSPEECH, P1895
Mauclair J., 2006, IEEE SPEAK LANG REC, P1
McCallum A, 2002, MALLET MACHINE LEARN
McNemar Q, 1947, PSYCHOMETRIKA, V12, P153, DOI 10.1007/BF02295996
Mertens T, 2009, INT CONF ACOUST SPEE, P4885, DOI 10.1109/ICASSP.2009.4960726
Morik K, 1999, P 16 INT C MACH LEAR, P268
Porter M, 2001, SNOWBALL LANGUAGE ST
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Rosen-Zvi M, 2010, ACM T INFORM SYST, V28, DOI 10.1145/1658377.1658381
Shriberg E, 2005, SPEECH COMMUN, V46, P455, DOI 10.1016/j.specom.2005.02.018
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2012
VL 54
IS 10
BP 1132
EP 1142
DI 10.1016/j.specom.2012.06.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 988LK
UT WOS:000307492300004
ER
PT J
AU Rasanen, O
AF Rasanen, Okko
TI Computational modeling of phonetic and lexical learning in early
language acquisition: Existing models and future directions
SO SPEECH COMMUNICATION
LA English
DT Article
DE Language acquisition; Distributional learning; Computer simulation;
Phonetic learning; Lexical learning
ID SPEECH-PERCEPTION; WORD SEGMENTATION; VOCABULARY DEVELOPMENT;
CONNECTIONIST MODEL; PATTERN DISCOVERY; INFANT VOCABULARY; RECOGNITION;
CUES; CATEGORIES; FEEDBACK
AB This work reviews a number of existing computational studies concentrated on the question of how spoken language can be learned from continuous speech in the absence of linguistically or phonetically motivated background knowledge, a situation faced by human infants when they first attempt to learn their native language. Specifically, the focus is on how phonetic categories and word-like units can be acquired purely on the basis of the statistical structure of speech signals, possibly aided by some articulatory or visual constraints. The outcomes and shortcomings of the existing work are reflected onto findings from experimental and theoretical studies. Finally, some of the open questions and possible future research directions related to the computational models of language acquisition are discussed. (c) 2012 Elsevier B.V. All rights reserved.
C1 Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, FI-00076 Aalto, Finland.
RP Rasanen, O (reprint author), Aalto Univ, Sch Elect Engn, Dept Signal Proc & Acoust, POB 13000, FI-00076 Aalto, Finland.
EM okko.rasanen@aalto.fi
FU Finnish Graduate School of Language Studies (Langnet); Nokia Research
Center Tampere
FX This research was funded by the Finnish Graduate School of Language
Studies (Langnet) and Nokia Research Center Tampere. The author would
also like to thank Heikki Rasilo, Roger K. Moore, and the two anonymous
reviewers for their invaluable comments on the manuscript.
CR Ahissar E, 2005, AUDITORY CORTEX: SYNTHESIS OF HUMAN AND ANIMAL RESEARCH, P295
Aimetti G., 2009, P STUD RES WORKSH EA, P1, DOI 10.3115/1609179.1609180
Almpanidis G, 2008, SPEECH COMMUN, V50, P38, DOI 10.1016/j.specom.2007.06.005
Altosaar T., 2010, P INT C LANG RES EV, P1062
Aversano G, 2001, PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, P516, DOI 10.1109/MWSCAS.2001.986241
Beal J., 2009, P 31 ANN C COGN SCI, P99
Best CC, 2003, LANG SPEECH, V46, P183
Blanchard D, 2010, J CHILD LANG, V37, P487, DOI 10.1017/S030500090999050X
Brent MR, 1996, COGNITION, V61, P93, DOI 10.1016/S0010-0277(96)00719-6
Brent MR, 1999, MACH LEARN, V34, P71, DOI 10.1023/A:1007541817488
Brent MR, 2001, COGNITION, V81, pB33, DOI 10.1016/S0010-0277(01)00122-6
Brosch M, 2005, AUDITORY CORTEX: SYNTHESIS OF HUMAN AND ANIMAL RESEARCH, P127
Buttery P., 2006, 675 U CAMBR COMP LAB
CASELLI MC, 1995, COGNITIVE DEV, V10, P159, DOI 10.1016/0885-2014(95)90008-X
Christiansen MH, 2009, DEVELOPMENTAL SCI, V12, P388, DOI 10.1111/j.1467-7687.2009.00824.x
Christiansen MH, 1998, LANG COGNITIVE PROC, V13, P221
Coen MH, 2005, P 20 NAT C ART INT A, P932
Coen Michael H., 2006, P 21 NAT C ART INT A, V2, P1451
Curtin S, 2005, COGNITION, V96, P233, DOI 10.1016/j.cognition.2004.08.005
Curtin S, 2001, PROC ANN BUCLD, P190
CUTLER A, 1994, LINGUA, V92, P81, DOI 10.1016/0024-3841(94)90338-7
Daland R, 2011, COGNITIVE SCI, V35, P119, DOI 10.1111/j.1551-6709.2010.01160.x
de Marcken C., 1995, 1558 AI MIT
de Boer B, 2003, ACOUST RES LETT ONL, V4, P129, DOI 10.1121/1.1613311
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Demuynck K., 2002, P 5 INT C TEXT SPEEC, P277
Driesen J., 2009, P INT 2009, p1731
Duran D., 2010, RES LANGUAGE COMPUTA, V8, P133, DOI 10.1007/s11168-011-9075-4
EIMAS PD, 1971, SCIENCE, V171, P303, DOI 10.1126/science.171.3968.303
ELMAN JL, 1990, COGNITIVE SCI, V14, P179, DOI 10.1207/s15516709cog1402_1
Emmorey K, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P110, DOI 10.1017/CBO9780511541599.005
Esposito A, 2005, LECT NOTES ARTIF INT, V3445, P261
Estevan YP, 2007, INT CONF ACOUST SPEE, P937
Fant G., 1985, Q PROGR STATUS REPOR, V4, P1
Feldman N. H., 2009, P 31 ANN C COGN SCI, P2208
Feldman NH, 2009, PSYCHOL REV, V116, P752, DOI 10.1037/a0017196
Fenson L., 2003, MACARTHUS BATES COMM
Gentner D., 1983, LANGUAGE COGNITION C, V2
Gleitman Lila, 1990, LANG ACQUIS, V1, P3, DOI DOI 10.1207/S153278171A0101_2
Goldstein MH, 2008, PSYCHOL SCI, V19, P515, DOI 10.1111/j.1467-9280.2008.02117.x
GOLINKOFF RM, 1994, J CHILD LANG, V21, P125
Gros-Louis J, 2006, INT J BEHAV DEV, V30, P509, DOI 10.1177/0165025406071914
Guenther FH, 1996, J ACOUST SOC AM, V100, P1111, DOI 10.1121/1.416296
Hamilton A, 2000, J CHILD LANG, V27, P689, DOI 10.1017/S0305000900004414
HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872
Hirsch H.-G., 2000, P ISCA ITRW ASR2000, P29
Houston DM, 2003, J EXP PSYCHOL HUMAN, V29, P1143, DOI 10.1037/0096-1523.29.6.1143
Howard IS, 2011, MOTOR CONTROL, V15, P85
Iverson P., 1994, J ACOUST SOC AM, V95, P2976, DOI 10.1121/1.408983
Jones SS, 2007, PSYCHOL SCI, V18, P593, DOI 10.1111/j.1467-9280.2007.01945.x
JUSCZYK PW, 1993, PROCEEDINGS OF THE FIFTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P49
JUSCZYK PW, 1993, J PHONETICS, V21, P3
Kanerva P, 2009, COGN COMPUT, V1, P139, DOI 10.1007/s12559-009-9009-8
Kanerva P., 2000, P 22 ANN C COGN SCI, P103
Kaplan F, 2006, INTERACT STUD, V7, P135, DOI 10.1075/is.7.2.04kap
Keshet J., 2005, P INT 05, P2961
Kirchhoff K, 2005, J ACOUST SOC AM, V117, P2238, DOI 10.1121/1.1869172
KOHONEN T, 1990, P IEEE, V78, P1464, DOI 10.1109/5.58325
Kokkinaki T, 2000, J REPROD INFANT PSYC, V18, P173
Kouki M., 2010, P INT 2010, P2914
Kuhl P., 2005, LANGUAGE LEARNING DE, V1, P237, DOI DOI 10.1080/15475441.2005.9671948
Kuhl PK, 2006, DEVELOPMENTAL SCI, V9, pF13, DOI 10.1111/j.1467-7687.2006.00468.x
Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533
Kuhl PK, 2008, PHILOS T R SOC B, V363, P979, DOI 10.1098/rstb.2007.2154
KUHL PK, 1986, EXP BIOL, V45, P233
Kuwahara H., 1972, ACOUSTICAL SOC JAPAN, V28, P225
Lake BM, 2009, IEEE T AUTON MENT DE, V1, P35, DOI 10.1109/TAMD.2009.2021703
Lee DD, 1999, NATURE, V401, P788
LEVITT AG, 1992, J CHILD LANG, V19, P19
LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6
MacQueen J., 1967, P 5 BERK S MATH STAT, P281
MACWHINNEY B, 1985, J CHILD LANG, V12, P271
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Markey K.L., 1994, THESIS U COLORADO CO
Marr D., 1982, VISION COMPUTATIONAL
Maye J, 2002, COGNITION, V82, pB101, DOI 10.1016/S0010-0277(01)00157-3
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
McInnes F., 2011, P 33 ANN M COGN SCI, P2006
McMurray B, 2009, DEVELOPMENTAL SCI, V12, P365, DOI 10.1111/j.1467-7687.2009.00821.x
McMurray B, 2009, DEVELOPMENTAL SCI, V12, P369, DOI 10.1111/j.1467-7687.2009.00822.x
Mehler J., 1990, COGNITIVE MODELS SPE
Meltzoff AN, 2009, SCIENCE, V325, P284, DOI 10.1126/science.1175626
MILLER KD, 1992, NEUROREPORT, V3, P73, DOI 10.1097/00001756-199201000-00019
Miller M., 2009, PERSONALMAGAZIN, P12
Newman M.E.J., 2004, PHYS REV E, V69
Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241
NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4
Nowlan S, 1990, ADV NEURAL INFORMATI, V2, P574
Oates T., 2002, Proceedings 2002 IEEE International Conference on Data Mining. ICDM 2002, DOI 10.1109/ICDM.2002.1183920
Oates T., 2001, THESIS U MASSACHUSET
Park A, 2006, INT CONF ACOUST SPEE, P409
Park A., 2005, P ASRU SAN JUAN PUER, P53
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Pinker Steven, 1989, LEARNABILITY COGNITI
Pisoni D. B., 1997, TALKER VARIABILITY S, P9
Port R, 2007, NEW IDEAS PSYCHOL, V25, P143, DOI 10.1016/j.newideapsych.2007.02.001
Quine W.V.O., 1960, WORD OBJECT
Rasanen O., P 34 ANN C COGN SCI
Rasanen O, 2011, COGNITION, V120, P149, DOI 10.1016/j.cognition.2011.04.001
Rasanen O., STRUCTURE CONT UNPUB
RASANEN O, 2008, P INT 08, P1980
Rasanen O, 2012, PATTERN RECOGN, V45, P606, DOI 10.1016/j.patcog.2011.05.005
Rasanen O., 2012, Proceedings of the 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2012), DOI 10.1109/ICASSP.2012.6289052
Rasanen Okko, 2011, Speech Technologies
Rasanen O., 2009, P 17 NORD C COMP LIN, P255
Roy D, 2003, IEEE T MULTIMEDIA, V5, P197, DOI 10.1109/TMM.2003.811618
Saffran JR, 1996, SCIENCE, V274, P1926, DOI 10.1126/science.274.5294.1926
Saffran JR, 2001, COGNITION, V81, P149, DOI 10.1016/S0010-0277(01)00132-9
Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032
Scharenborg O., 2007, P INT C SPOK LANG PR, P1953
Scharenborg O, 2010, PRAGMAT COGN, V18, P136, DOI 10.1075/pc.18.1.06sch
Smith K, 2006, LECT NOTES ARTIF INT, V4211, P31
Smith K, 2011, COGNITIVE SCI, V35, P480, DOI 10.1111/j.1551-6709.2010.01158.x
Smith L, 2008, COGNITION, V106, P1558, DOI 10.1016/j.cognition.2007.06.010
Stager CL, 1997, NATURE, V388, P381, DOI 10.1038/41102
Steels L., 2000, EVOLUTION COMMUNICAT, V4, P3
Steels L, 2003, TRENDS COGN SCI, V7, P308, DOI 10.1016/S1364-6613(03)00129-3
Stouten V., 2007, IEEE SIGNAL PROCESSI, V15, P131
Swingley D, 2005, COGNITIVE PSYCHOL, V50, P86, DOI 10.1016/j.cogpsych.2004.06.001
Ten Bosch L, 2007, P INTERSPEECH2007, P1481
ten Bosch L., 2009, P WORKSH CHILD COMP
ten Bosch Louis, 2008, Speech Recognition - Technologies and Applications
ten Bosch L, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P704
ten Bosch L, 2007, SPEECH COMMUN, V49, P331, DOI 10.1016/j.specom.2007.03.001
ten Bosch L, 2009, FUND INFORM, V90, P229, DOI 10.3233/FI-2009-0016
Thiessen ED, 2004, PERCEPT PSYCHOPHYS, V66, P779, DOI 10.3758/BF03194972
Thiessen ED, 2003, DEV PSYCHOL, V39, P706, DOI 10.1037/0012-1649.39.4.706
Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579
Tomasello M, 1995, JOINT ATTENTION ITS, V16, P103
Tomasello M., 2000, PRAGMATICS, V10, P401
Toscano JC, 2010, COGNITIVE SCI, V34, P434, DOI 10.1111/j.1551-6709.2009.01077.x
TREHUB SE, 1976, CHILD DEV, V47, P466, DOI 10.2307/1128803
Tsao FM, 2004, CHILD DEV, V75, P1067, DOI 10.1111/j.1467-8624.2004.00726.x
Unal F.A., 1992, P INT JOINT C NEUR N, P715
Vallabha GK, 2007, P NATL ACAD SCI USA, V104, P13273, DOI 10.1073/pnas.0705369104
Van hamme H., 2008, P INT C SPOK LANG PR, P2554
Venkataraman A, 2001, COMPUT LINGUIST, V27, P351, DOI 10.1162/089120101317066113
Versteegh M., 2010, P INT 10 CHIB JAP, P2930
Villing R., 2006, P ISSC, P521
Warren RM, 2000, BEHAV BRAIN SCI, V23, P350, DOI 10.1017/S0140525X00503240
Waterson N., 1971, J LINGUIST, V7, P179, DOI [10.1017/S0022226700002917S0022226700002917, DOI 10.1017/S0022226700002917]
Werker J., 2005, LANGUAGE LEARNING DE, V1, P197, DOI DOI 10.1080/15475441.2005.9684216
WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3
Witner S., 2010, P 11 INT C COMP LING, P86
NR 144
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2012
VL 54
IS 9
BP 975
EP 997
DI 10.1016/j.specom.2012.05.001
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 987LV
UT WOS:000307419400001
ER
PT J
AU Irino, T
Aoki, Y
Kawahara, H
Patterson, RD
AF Irino, Toshio
Aoki, Yoshie
Kawahara, Hideki
Patterson, Roy D.
TI Comparison of performance with voiced and whispered speech in word
recognition and mean-formant-frequency discrimination
SO SPEECH COMMUNICATION
LA English
DT Article
DE Whispered word recognition; Mean formant frequency discrimination; Modes
of vocal excitation
ID SPEAKER SIZE; VOCAL-TRACT; BODY-SIZE; VOWELS; IDENTIFICATION; SEX;
PITCH; INFORMATION; PERCEPTION; PARAMETERS
AB There has recently been a series of studies concerning the interaction of glottal pulse rate (GPR) and mean-formant-frequency (MFF) in the perception of speaker characteristics and speech recognition. This paper extends the research by comparing the recognition and discrimination performance achieved with voiced words to that achieved with whispered words. The recognition experiment shows that performance with whispered words is slightly worse than with voiced words at all MFFs when the GPR of the voiced words is in the middle of the normal range. But, as GPR decreases below this range, voiced-word performance decreases and eventually becomes worse than whispered-word performance. The discrimination experiment shows that the just noticeable difference (JND) for MFF is essentially independent of the mode of vocal excitation; the JND is close to 5% for both voiced and voiceless words for all speaker types. The interaction between GPR and VTL is interpreted in terms of the stability of the internal representation of speech which improves with GPR across the range of values used in these experiments. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Irino, Toshio; Aoki, Yoshie; Kawahara, Hideki] Wakayama Univ, Fac Syst Engn, Wakayama 6408510, Japan.
[Patterson, Roy D.] Univ Cambridge, Dept Physiol Dev & Neurosci, Ctr Neural Basis Hearing, Cambridge CB2 3EG, England.
RP Irino, T (reprint author), Wakayama Univ, Fac Syst Engn, 930 Sakaedani, Wakayama 6408510, Japan.
EM irino@sys.wakayama-u.ac.jp; kawahara@sys.wakayama-u.ac.jp;
rdp1@cam.ac.uk
CR Aoki Y., 2008, ARO 31 MIDW M PHOEN
Aoki Y., 2008, J ACOUST SOC AM 2, V123, P3718, DOI 10.1121/1.2935170
Assmann PF, 2005, J ACOUST SOC AM, V117, P886, DOI 10.1121/1.1852549
Assmann PF, 2008, J ACOUST SOC AM, V124, P3203, DOI 10.1121/1.2980456
Boersma P., 2001, GLOT INT, V5, P341
Chiba T., 1942, VOWEL ITS NATURE STR
CORNSWEE.TN, 1965, J PHYSIOL-LONDON, V176, P294
Fant G., 1970, ACOUSTIC THEORY SPEE
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
FUJISAKI H, 1968, IEEE T ACOUST SPEECH, VAU16, P73, DOI 10.1109/TAU.1968.1161952
Ghazanfar A. A., 2008, CURR BIOL, V18, P457
Gonzalez J, 2004, J PHONETICS, V32, P277, DOI 10.1016/S0095-4470(03)00049-4
HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872
Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150
Irino T, 2002, SPEECH COMMUN, V36, P181, DOI 10.1016/S0167-6393(00)00085-6
Irino T, 2006, IEEE T AUDIO SPEECH, V14, P2222, DOI 10.1109/TASL.2006.874669
Ives DT, 2005, J ACOUST SOC AM, V118, P3816, DOI 10.1121/1.2118427
Kawahara H., 2004, SPEECH SEPARATION HU, P167
Kawahara H., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.349
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
LASS NJ, 1976, J ACOUST SOC AM, V59, P675, DOI 10.1121/1.380917
Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686
Liu C, 2004, ACOUST RES LETT ONL, V5, P31, DOI 10.1121/1.1635431
MARCUS SM, 1981, PERCEPT PSYCHOPHYS, V30, P247, DOI 10.3758/BF03214280
Markel J., 1975, LINEAR PREDICTION SP
MILLER GA, 1947, J ACOUST SOC AM, V19, P609, DOI 10.1121/1.1916528
Nearey T. M., 2002, J ACOUST SOC AM, V112, P2323
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Pisanski K, 2011, J ACOUST SOC AM, V129, P2201, DOI 10.1121/1.3552866
Rendall D, 2005, J ACOUST SOC AM, V117, P944, DOI 10.1121/1.1848011
Rendall D, 2007, J EXP PSYCHOL HUMAN, V33, P1208, DOI 10.1037/0096-1523.33.5.1208
Sakamoto S., 2004, ACOUST SCI TECHNOL, V106, P1511
SCHWARTZ MF, 1968, J ACOUST SOC AM, V43, P1178, DOI 10.1121/1.1910954
SCHWARTZ MF, 1970, J SPEECH HEAR RES, V13, P445
SCHWARTZ MF, 1968, J ACOUST SOC AM, V43, P1448, DOI 10.1121/1.1911007
SCHWARTZ MF, 1968, J ACOUST SOC AM, V44, P1736, DOI 10.1121/1.1911324
SINNOTT JM, 1987, J COMP PSYCHOL, V101, P126, DOI 10.1037/0735-7036.101.2.126
Smith DRR, 2005, J ACOUST SOC AM, V117, P305, DOI 10.1121/1.1828637
Smith D.R.R., 2005, M BRIT SOC AUD CARD
TARTTER VC, 1989, J ACOUST SOC AM, V86, P1678, DOI 10.1121/1.398598
TARTTER VC, 1991, PERCEPT PSYCHOPHYS, V49, P365, DOI 10.3758/BF03205994
TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151
Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414
Tsujimura N, 2007, INTRO JAPANESE LINGU
Turner RE, 2009, J ACOUST SOC AM, V125, P2374, DOI 10.1121/1.3079772
vanDommelen WA, 1995, LANG SPEECH, V38, P267
Vestergaard MD, 2009, J ACOUST SOC AM, V126, P2860, DOI 10.1121/1.3257582
Wichmann FA, 2001, PERCEPT PSYCHOPHYS, V63, P1293, DOI 10.3758/BF03194544
NR 48
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2012
VL 54
IS 9
BP 998
EP 1013
DI 10.1016/j.specom.2012.04.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 987LV
UT WOS:000307419400002
ER
PT J
AU Ogawa, A
Nakamura, A
AF Ogawa, Atsunori
Nakamura, Atsushi
TI Joint estimation of confidence and error causes in speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Confidence estimation; Error cause detection; Joint
estimation; Discriminative model
ID COMBINATION
AB Speech recognition errors are essentially unavoidable under the severe conditions of real fields, and so confidence estimation, which scores the reliability of a recognition result, plays a critical role in the development of speech recognition based real-field application systems. However, if we are to develop an application system that provides a high-quality service, in addition to achieving accurate confidence estimation, we also need to extract and exploit further supplementary information from a speech recognition engine. As a first step in this direction, in this paper, we propose a method for estimating the confidence of a recognition result while jointly detecting the causes of recognition errors based on a discriminative model. The confidence of a recognition result and the nonexistence/existence of error causes are naturally correlated. By directly capturing these correlations between the confidence and error causes, the proposed method enhances its estimation performance for the confidence and each error cause complementarily. In the initial speech recognition experiments, the proposed method provided higher confidence estimation accuracy than a discriminative model based state-of-the-art confidence estimation method. Moreover, the effective estimation mechanism of the proposed method was confirmed by the detailed analyses. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Ogawa, Atsunori; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto, Japan.
RP Ogawa, A (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan.
EM ogawa.atsunori@lab.ntt.co.jp; naka-mura.atsushi@lab.ntt.co.jp
CR Berger AL, 1996, COMPUT LINGUIST, V22, P39
Burget L, 2008, INT CONF ACOUST SPEE, P4081, DOI 10.1109/ICASSP.2008.4518551
Chase L., 1997, P EUR C SPEECH COMM, P815
Fayolle J., 2010, P INTERSPEECH, P1492
Google Inc, 2011, ANDR DEV
HAZEN TJ, 2001, ACOUST SPEECH SIG PR, P397
Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790
ICASSP Special Session (SS-6), 2008, P ICASSP PRAG CZECH, P5240
Interspeech Special Highlight Session, 2010, P INTERSPEECH
Interspeech Special Session (Wed-Ses-S1), 2009, P INTERSPEECH
Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
Kombrink S., 2009, P INTERSPEECH, P80
Lafferty John D., 2001, ICML, P282
Lee C.-H., 2001, P INT WORKSH HANDS F, P27
Nakagawa S., 1994, J ACOUSTICAL SOC JAP, V50, P849
Nakano T., 2007, P IEEE ASRU, P601
Ogawa A, 2010, 11TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2010 (INTERSPEECH 2010), VOLS 1-2, P242
Ogawa A., 2009, P INT, P1199
Ogawa A, 2010, INT CONF ACOUST SPEE, P4454, DOI 10.1109/ICASSP.2010.5495608
Pearl J., 1988, PROBABILISTIC REASON
Schalkwyk J., 2010, ADV SPEECH RECOGNITI, P61, DOI 10.1007/978-1-4419-5951-5_4
Sukkar RA, 1997, SPEECH COMMUN, V22, P333, DOI 10.1016/S0167-6393(97)00031-9
Wang YY, 2008, IEEE SIGNAL PROC MAG, V25, P29, DOI 10.1109/MSP.2008.918411
Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002
White C, 2007, INT CONF ACOUST SPEE, P809
White C, 2008, INT CONF ACOUST SPEE, P4085, DOI 10.1109/ICASSP.2008.4518552
Yoma NB, 2005, IEEE SIGNAL PROC LET, V12, P745, DOI [10.1109/LSP.2005.856888, 10.1109/LSP.2005.856988]
YOUNG SR, 1994, INT CONF ACOUST SPEE, P21
Yu D, 2009, PATTERN RECOGN LETT, V30, P1295, DOI 10.1016/j.patrec.2009.06.005
Zhou B., 2011, COMPUTER SPEECH LANG
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2012
VL 54
IS 9
BP 1014
EP 1028
DI 10.1016/j.specom.2012.04.004
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 987LV
UT WOS:000307419400003
ER
PT J
AU Clemente, IA
Heckmann, M
Wrede, B
AF Clemente, Irene Ayllon
Heckmann, Martin
Wrede, Britta
TI Incremental word learning: Efficient HMM initialization and large margin
discriminative adaptation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Incremental word learning; Learning from few examples; Speech
recognition; Multiple sequence alignment; Discriminative training;
Bootstrapping
ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; LANGUAGE; MATRICES
AB In this paper we present an incremental word learning system that is able to cope with few training data samples to enable speech acquisition in on-line human robot interaction. As with most automatic speech recognition systems (ASR), our architecture relies on a Hidden Markov Model (HMM) framework where the different word models are sequentially trained and the system has little prior knowledge. To achieve good performance, HMMs depends on the amount of training data, the initialization procedure and the efficiency of the discriminative training algorithms. Thus, we propose different approaches to improve the system. One major problem of using a small amount of training data is over-fitting. Hence we present a novel estimation of the variance floor dependent on the number of available training samples. Next, we propose a bootstrapping approach in order to get a good initialization of the HMM parameters. This method is based on unsupervised training of the parameters and subsequent construction of a new HMM by aligning and merging Viterbi decoded sequences. Finally, we investigate large margin discriminative training techniques to enlarge the generalization performance of the models using several strategies suitable for limited training data. In the evaluation of the results, we examine the contribution of the different stages proposed to the overall system performance. This includes the comparison of different state-of-the-art methods with our presented techniques and the investigation of the possible reduction of the number of training data samples. We compare our algorithms on isolated and continuous digit recognition tasks. To sum up, we show that the proposed algorithms yield significant improvements and are a step towards efficient learning with few examples. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Clemente, Irene Ayllon; Wrede, Britta] Univ Bielefeld, Res Inst Cognit & Robot, CoR Lab, D-33615 Bielefeld, Germany.
[Clemente, Irene Ayllon; Heckmann, Martin] Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany.
RP Clemente, IA (reprint author), Univ Bielefeld, Res Inst Cognit & Robot, CoR Lab, D-33615 Bielefeld, Germany.
EM iayllon@cor-lab.uni-bielefeld.de; martin.heckmann@honda-ri.de;
bwrede@cor-lab.uni-bielefeld.de
CR Ayllon Clemente I., 2010, P INTERSPEECH
Ayllon Clemente I., 2010, P IEEE INT C AC SPEE
BAKIS R, 1976, P ASA M WASH DC
Bertsekas DP, 1999, NONLINEAR PROGRAMMIN
Bilmes J., 2002, WHAT HMMS CAN DO
Bimbot F., 1998, OVERVIEW CAVE PROJEC
Bishop C. M., 2006, PATTERN RECOGNITION
Bosch L., 2009, FUNDAMENTA INFORM, V90, P229
Bosch L., 2008, P INT C TEXT SPEECH, P261
Boves L, 2007, PROCEEDINGS OF THE SIXTH IEEE INTERNATIONAL CONFERENCE ON COGNITIVE INFORMATICS, P349
Brandl H., 2010, THESIS BIELEFELD U
Brandl H, 2008, INT C DEVEL LEARN, P31, DOI 10.1109/DEVLRN.2008.4640801
Chang T.-H., 2008, P IEEE INT C AC SPEE
Charles W., 1992, TRULY INTELLIGENT CO
Devijver P. A., 1982, PATTERN RECOGNITION
Dong Yu, 2008, Computer Speech & Language, V22, DOI 10.1016/j.csl.2008.03.002
Fink G. A., 2008, MARKOV MODELS PATTER
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
Ganapathiraju A., 2000, P SPEECH TRANSCR WOR
Garofolo J.S., 1993, 20041068 USGS
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Geiger J.T., 2011, P INTERSPEECH
Ghahramani Z, 2001, INT J PATTERN RECOGN, V15, P9, DOI 10.1142/S0218001401000836
Ghoshal A., 2005, P 28 ANN INT ACM SIG, P544, DOI 10.1145/1076034.1076127
He XD, 2008, IEEE SIGNAL PROC MAG, V25, P14, DOI 10.1109/MSP.2008.926652
Hermansky H., 1994, IEEE T SPEECH AUDIO, V2, P587
Huang X., 2001, SPOKEN LANGUAGE PROC
Itaya Y, 2005, IEICE T INF SYST, VE88D, P425, DOI 10.1093/ietisy/e88-d.3.425
Iwahashi N, 2006, LECT NOTES ARTIF INT, V4211, P143
JONES DT, 1992, COMPUT APPL BIOSCI, V8, P275
Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257
Kadous MW, 2002, THESIS U NEW S WALES
Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V
Leonard R., 1984, P IEEE INT C AC SPEE
Li X., 2006, P INTERSPEECH
Li X., 2005, P IEEE INT C AC SPEE
Lin H., 2011, P INTERSPEECH
Liu F.-H., 1994, THESIS CARNEGIE MELL
Markov K., 2007, P INT
McDermott E., 1997, THESIS WASEDA U
Melin H., 1998, P INT C SPOK LANG PR
Melin H., 1998, OPTIMIZING VARIANCE
Melin H., 1999, P EUR C SPEECH COMM, P5
Minematsu N., 2010, P INT C SPEECH PROS
Mokbel C., 1999, P IEEE INT C AC SPEE
Moore RK, 2007, SPEECH COMMUN, V49, P418, DOI 10.1016/j.specom.2007.01.011
Morgan N., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, DOI 10.1142/S0218001493000455
Nabney I. T., 2002, ADV PATTERN RECOGNIT
Nathan K, 1996, INT CONF ACOUST SPEE, P3502, DOI 10.1109/ICASSP.1996.550783
NEEDLEMA.SB, 1970, J MOL BIOL, V48, P443, DOI 10.1016/0022-2836(70)90057-4
Neukirchen C., 1998, P ICSLP, P2999
Nocedal J., 1999, NUMERICAL OPTIMIZATI
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
ROSENBERG A., 1998, P IEEE INT C AC SPEE
Roy D, 2003, IEEE T MULTIMEDIA, V5, P197, DOI 10.1109/TMM.2003.811618
Scholkopf B., 2001, LEARNING KERNELS SUP
Sha F., 2007, ADV NEURAL INFORM PR
Sim KC, 2006, IEEE T AUDIO SPEECH, V14, P882, DOI 10.1109/TSA.2005.858062
SIU MH, 1999, ACOUST SPEECH SIG PR, P105
Smith K., 2002, HIDDEN MARKOV MODELS
SMITH TF, 1981, J MOL BIOL, V147, P195, DOI 10.1016/0022-2836(81)90087-5
Stuttle M.N., 2004, THESIS HUGHES HALL C
Theodoridis S, 2009, PATTERN RECOGNITION, 4RTH EDITION, P1
Van hamme H., 2008, P INT
Van Segbroeck M, 2009, SPEECH COMMUN, V51, P1124, DOI 10.1016/j.specom.2009.05.003
Vapnik V, 1998, STAT LEARNING THEORY
Wu Y., 2007, DATA SELECTION SPEEC
Young S. J., 2006, HTK BOOK VERSION 3 4
NR 68
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2012
VL 54
IS 9
BP 1029
EP 1048
DI 10.1016/j.specom.2012.04.005
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 987LV
UT WOS:000307419400004
ER
PT J
AU Truong, KP
van Leeuwen, DA
de Jong, FMG
AF Truong, Khiet P.
van Leeuwen, David A.
de Jong, Franciska M. G.
TI Speech-based recognition of self-reported and observed emotion in a
dimensional space
SO SPEECH COMMUNICATION
LA English
DT Article
DE Affective computing; Automatic emotion recognition; Emotional speech;
Emotion database; Audiovisual database; Emotion perception; Emotion
annotation; Emotion elicitation; Videogames; Support Vector Regression
ID SUPPORT VECTOR REGRESSION; AUTOMATIC RECOGNITION; GAME; FEATURES; MODEL
AB The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance. (c) 2012 Elsevier B.V. All rights reserved.
C1 [Truong, Khiet P.; de Jong, Franciska M. G.] Univ Twente, NL-7500 AE Enschede, Netherlands.
[van Leeuwen, David A.] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands.
RP Truong, KP (reprint author), Univ Twente, POB 217, NL-7500 AE Enschede, Netherlands.
EM k.p.truong@utwente.nl; d.vanleeuwen@let.ru.nl; f.m.g.dejong@utwente.nl
FU MultimediaN; European Community's Seventh Framework Programme (FP7)
[231287]
FX We would like to thank the anonymous reviewers for their helpful
comments. This work was supported by MultimediaN and the European
Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreement no. 231287 (SSPNet).
CR Ang J, 2002, P INT C SPOK LANG PR, P2037
Auberge V., 2006, P 5 INT C LANG RES E
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Batliner A., 2006, LANGUAGE TECHNOLOGIE, P240
Batliner A, 2000, P ISCA WORKSH SPEECH, P195
Biersack S., 2005, P ISCA WORKSH PLAST, P211
Boersma P., 2009, PRAAT DOING PHONETIC
Busso C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P257
Chang C.-C., 2001, LIBSVM LIB SUPPORT V
Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197
Cowie R., 2000, P ISCA WORKSH SPEECH, P19
Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7
Dellaert F, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1970
den Uyl M.J., 2005, P MEAS BEH, P589
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
Devillers L., 2003, P ICME 2003, P549
Douglas-Cowie E, 2005, P INT 2005 LISB PORT, P813
Ekman P, 1972, NEBRASKA S MOTIVATIO, P207
Ekman P, 1975, UNMASKING FACE GUIDE
Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6
Giannakopoulos T, 2009, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.2009.4959521
Grimm M, 2007, INT CONF ACOUST SPEE, P1085
Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010
Gunes Hatice, 2011, Proceedings 2011 IEEE International Conference on Automatic Face & Gesture Recognition (FG 2011), DOI 10.1109/FG.2011.5771357
Hanjalic A, 2005, IEEE T MULTIMEDIA, V7, P143, DOI 10.1109/TMM.2004.840618
Joachims Thorsten, 1998, P 10 EUR C MACH LEAR, P137
Johnstone T, 2005, EMOTION, V5, P513, DOI 10.1037/1528-3542.5.4.513
Kim J., 2005, P 9 EUR C SPEECH COM, P809
KWON O.W., 2003, P 8 EUR C SPEECH COM, P125
Lang P. J., 1995, AM PSYCHOL, V50, P371, DOI [10.1037/0003-066X.50.5.372, DOI 10.1037/0003-066X.50.5.372]
Lazarro N., 2004, WHY WE PLAY GAMES 4
Lee C. M., 2002, P INT C MULT EXP SWI, P737
Liscombe J., 2003, P EUR C SPEECH COMM, P725
Mower E., 2009, P INTERSPEECH, P1583
Nicolaou M., 2010, P LREC INT WORKSH MU, P43
Nicolaou MA, 2011, IEEE T AFFECT COMPUT, V2, P92, DOI 10.1109/T-AFFC.2011.9
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
Petrushin V.A., 1999, P 1999 C ART NEUR NE
POLZIN TS, 1998, P COOP MULT COMM CMC
Ravaja N, 2006, PRESENCE-TELEOP VIRT, V15, P381, DOI 10.1162/pres.15.4.381
RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714
SALTON G, 1988, INFORM PROCESS MANAG, V24, P513, DOI 10.1016/0306-4573(88)90021-0
Scherer K. R., 2010, BLUEPRINT AFFECTIVE, P47
SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570
SCHULLER B, 2003, ACOUST SPEECH SIG PR, P1
Smola AJ, 2004, STAT COMPUT, V14, P199, DOI 10.1023/B:STCO.0000035301.49549.88
Tato R.S., 2002, P INT C SPOK LANG PR, P2029
Truong K. P., 2009, P INTERSPEECH, P2027
Truong K.P., 2008, P INTERSPEECH, P381
Truong KP, 2008, LECT NOTES COMPUT SC, V5237, P161, DOI 10.1007/978-3-540-85853-9_15
Vapnik V., 2002, NATURE STAT LEARNING
Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003
Ververidis D., 2005, P IEEE INT C MULT EX, P1500
Wang N, 2006, LECT NOTES ARTIF INT, V4133, P282
WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238
Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597
Wollmer M., 2009, P INT BRIGHT UK, P1595
Wundt W., 1874, GRUNDZUGE PHYSL PSYC
Yildirim S., 2005, P 9 EUR C SPEECH COM, P2209
Yu C., 2004, P 8 INT C SPOK LANG, P1329
Zeng Z., 2005, P IEEE INT C MULT EX, P828
NR 61
TC 7
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2012
VL 54
IS 9
BP 1049
EP 1063
DI 10.1016/j.specom.2012.04.006
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 987LV
UT WOS:000307419400005
ER
PT J
AU Yunusova, Y
Baljko, M
Pintilie, G
Rudy, K
Faloutsos, P
Daskalogiannakis, J
AF Yunusova, Yana
Baljko, Melanie
Pintilie, Grigore
Rudy, Krista
Faloutsos, Petros
Daskalogiannakis, John
TI Acquisition of the 3D surface of the palate by in-vivo digitization with
Wave
SO SPEECH COMMUNICATION
LA English
DT Article
DE Hard palate modeling; Thin plate spline (TPS) technique; Palate casts;
The Wave system
ID ACCURACY; SHAPE
AB An accurate characterization of the morphology of the hard palate is essential for understanding its role in human speech. The position of the tongue is adjusted in the oral cavity, of which the hard palate is a key anatomical structure. Methods for modeling the palate are limited at present. This paper evaluated the use of a thin plate spline (TPS) technique for reconstructing the palate surface from a series of in-vivo tracings obtained with electromagnetic articulography using Wave (NDI). Twenty-four individuals (13 females and 11 males) provided upper dental casts and in-vivo tracings. Models of the palate surfaces were derived from data acquired in-vivo and compared to the scanned casts. The optimal value for the smoothness parameter for the TPS technique, which provided the smallest error of fit between the modeled and scanned surfaces, was determined empirically (the value of 0.05). Significant predictors of the quality of the fit were determined and included the individuals' palate characteristics such as palate slope and curvature. The tracing protocol composed of four different traces produced the best palate models for the in-vivo procedure. Evidence demonstrated that the TPS procedure as a whole is suitable for modeling the palate surface using a small number of in-vivo tracings. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Yunusova, Yana; Pintilie, Grigore; Rudy, Krista] Univ Toronto, Dept Speech Language Pathol, Toronto, ON M5G 1V7, Canada.
[Baljko, Melanie; Faloutsos, Petros] York Univ, Dept Comp Sci & Engn, Toronto, ON M3J 1P3, Canada.
[Daskalogiannakis, John] Univ Toronto, Dept Orthodont, Toronto, ON M5G 1G6, Canada.
RP Yunusova, Y (reprint author), Univ Toronto, Dept Speech Language Pathol, Rehabil Sci Bldg,160-500 Univ Ave, Toronto, ON M5G 1V7, Canada.
EM yana.yunusova@utoronto.ca
RI Yunusova, Yana/E-3428-2010
OI Yunusova, Yana/0000-0002-2353-2275
FU Natural Sciences and Engineering Research Council of Canada (NSERC);
ASHA Foundation
FX This research was supported by the Natural Sciences and Engineering
Research Council of Canada (NSERC) Discovery Grant and ASHA Foundation
New Investigator Award.
CR Berry JJ, 2011, J SPEECH LANG HEAR R, V54, P1295, DOI 10.1044/1092-4388(2011/10-0226)
BESL PJ, 1992, IEEE T PATTERN ANAL, V14, P239, DOI 10.1109/34.121791
Bhagyalakshmi G., 2007, SYNDR RES PRACT, V12, P55
Brunner J, 2009, J ACOUST SOC AM, V125, P3936, DOI 10.1121/1.3125313
Brunner J., 2005, ZAS PAPERS LINGUISTI, V42, P43
Ferrario VF, 1998, CLEFT PALATE-CRAN J, V35, P396, DOI 10.1597/1545-1569(1998)035<0396:QDOTMO>2.3.CO;2
Ferrario VF, 2001, CLIN ORTHOD RES, V4, P141, DOI 10.1034/j.1600-0544.2001.040304.x
Fuchs S., 2010, TURBULENT SOUNDS INT, P281
Hamilton C., 1993, SYND RES PRACT, V1, P15
HIKI S, 1986, SPEECH COMMUN, V5, P141, DOI 10.1016/0167-6393(86)90004-X
Kumar V, 2008, ANGLE ORTHOD, V78, P873, DOI 10.2319/082907-399.1
Mooshammer C., 2004, AIPUK, V36, P47
Perkell J.S., 1998, HDB PHONETIC SCI, P333
PERRIER P, 1992, J SPEECH HEAR RES, V35, P53
Schneider PJ, 2003, GEOMETRIC TOOLS COMP
Thompson G.W., 1977, J DENT RES, V56
Vorperian H.K., 2005, J ACOUST SOC AM, V117, P338
Wahba G., 1990, CBMS NSF REG C SER A, V59
Weirich M., 2011, P ISSP, P251
Westbury J.R., 1994, XRAY MICROBEAM SPEEC
Yunusova Y, 2009, J SPEECH LANG HEAR R, V52, P547, DOI 10.1044/1092-4388(2008/07-0218)
NR 21
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2012
VL 54
IS 8
BP 923
EP 931
DI 10.1016/j.specom.2012.03.006
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961UC
UT WOS:000305494000001
ER
PT J
AU Sun, QH
Hirose, K
Minematsu, N
AF Sun, Qinghua
Hirose, Keikichi
Minematsu, Nobuaki
TI A method for generation of Mandarin F-0 contours based on tone nucleus
model and superpositional model
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mandarin; Tone; Fundamental frequency contour; Tone nucleus; Phrase
component; Tone component
ID SPEECH SYSTEM; CHINESE; RULES
AB A new method was proposed for synthesizing sentence fundamental frequency (F-0) contours of Mandarin speech. The method is based on representing a sentence logarithmic F-0 contour as a superposition of tone components on phrase components, as in the case of the generation process model (F-0 model). However, the method is not fully depending on the model in that tone components are generated in a corpus-based way by concatenating F-0 patterns predicted for constituting syllables. Furthermore, the prediction is done only for the stable part of syllable tone component, known as tone nucleus. The entire tone components were obtained by concatenating the predicted patterns. Since effect of tone coarticulation is minor for tone nuclei, as compared to conventional methods of handling full syllable F-0 contours, a better prediction is possible especially when the size of training corpus is limited. While tone components are highly language specific, phrase components are assumed to be more language universal: analogy from a control scheme of phrase components developed for a language may applicable for other languages. Also, phrase components covers a wider range (phrase, clause, etc.) of speech and is tightly related to higher linguistic information (syntax), and, therefore, concatenation of short F-0 contour fragments predicted in a corpus-based method will not be appropriate. Taking these into consideration, rules similar to Japanese were constructed to control phrase commands, from which phrase components were generated with simple mathematical calculations in the framework of the generation process model. There is a tight relation between phrase and tone components, and, therefore, both components cannot be generated independently. To ensure the correct relation be held in the synthesized F-0 contour, a two-step scheme was developed, where information of generated phrase components was utilized for the prediction of tone components. A listening test was conducted for speech synthesized using F-0 contours generated by the developed method. Synthetic speech sounded highly natural, showing the validity of the method. Furthermore, it was shown through an experiment of word emphasis that flexible F-0 control was possible by the proposed method. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Sun, Qinghua] Univ Tokyo, Grad Sch Engn, Tokyo 1138654, Japan.
[Hirose, Keikichi; Minematsu, Nobuaki] Univ Tokyo, Grad Sch Informat Sci & Technol, Tokyo 1138654, Japan.
RP Sun, QH (reprint author), Univ Tokyo, Grad Sch Engn, Tokyo 1138654, Japan.
EM qinghua@gavo.t.u-tokyo.ac.jp; hirose@gavo.t.u-tokyo.ac.jp;
mine@gavo.t.u-tokyo.ac.jp
CR Chao Y. R., 1968, GRAMMAR APOKEN CHINE, P1
Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226
Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5
Gu W., 2006, P SPEECH PROSODY, P561
Gu W., 2005, P EUR LISB PORT, P1825
Gu WT, 2004, IEICE T INF SYST, VE87D, P1079
Hirose K., 1986, P IEEE IECEJ ASJ ICA, P2415
HIROSE K, 1993, IEICE T FUND ELECTR, VE76A, P1971
LEE LS, 1989, IEEE T ACOUST SPEECH, V37, P1309
Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P287, DOI 10.1109/89.232612
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Sun Q., 2008, P INT C SPE IN PRESS
Sun Q., 2006, P INT C SPEECH PROS, P561
Sun Q., 2008, P INT WORKSH NONL CI, P112
Sun Q., 2007, P ISCA WORKSH SPEECH, P154
Sun Q., 2005, P INT, P3265
Tao J., 2002, P INT C SPEECH LANG, P2097
Tokuda K., 1997, P IEEE ICASSP, P229
Wang R., 2006, ADV CHINESE SPOKEN L
Zhang JS, 2004, SPEECH COMMUN, V42, P447, DOI 10.1016/j.specom.2004.01.001
NR 20
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2012
VL 54
IS 8
BP 932
EP 945
DI 10.1016/j.specom.2012.03.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961UC
UT WOS:000305494000002
ER
PT J
AU Mok, PPK
AF Mok, Peggy P. K.
TI Effects of consonant cluster syllabification on vowel-to-vowel
coarticulation in English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Vowel-to-vowel coarticulation; Syllable structure; English; Consonant
clusters
ID SYLLABLE STRUCTURE; ACOUSTIC ANALYSIS; FINAL CONSONANTS; VCV UTTERANCES;
STOP PLACE; SEQUENCES; PERCEPTION; PATTERNS; CATALAN; ARTICULATION
AB This paper investigates how different syllable affiliations of intervocalic /st/ cluster affect vowel-to-vowel coarticulation in English. Very few studies have examined the effect of syllable structure on vowel-to-vowel coarticulation. Previous studies show that onset and coda consonants differ acoustically, articulatorily, perceptually and typologically. Onsets are stronger, more stable, more common and more distinguishable than codas. Since codas are less constrained, it was hypothesized that coda /st./ would allow more vowel-to-vowel coarticulation than onset /st/. Three vowels (/i a u/) were used to form the target sequences with the /st/ cluster in English: onset /CV.stVC/, heterosyllabic /CVs.tVC/, coda /CVst.VC/. F1 and F2 frequencies at vowel edges and the durations of the first vowel and the intervocalic consonants were measured from six speakers of Standard Southern British English. Factors included in the experiment are: Direction, Syllable Form, Target, Context. Results show that coda /st./ allows more vowel-to-vowel coarticulation than onset /.st/, and heterosyllabic /s.t/ is the most resistant among the Syllable Forms. Vowels in heterosyllabic /s.t/ are more extreme than in the other two Syllable Forms in the carryover direction. These findings suggest that vowel-to-vowel coarticulation is sensitive to different syllable structure with the same segmental composition. Possible factors contributing to the observed patterns are discussed. (C) 2012 Elsevier B.V. All rights reserved.
C1 Chinese Univ Hong Kong, Dept Linguist & Modern Languages, Shatin, Hong Kong, Peoples R China.
RP Mok, PPK (reprint author), Chinese Univ Hong Kong, Dept Linguist & Modern Languages, Leung Kau Kui Bldg, Shatin, Hong Kong, Peoples R China.
EM peggymok@cuhk.edu.hk
FU Sir Edward Youde Memorial Fellowship for Overseas Studies from Hong
Kong; Overseas Research Studentship from the United Kingdom
FX The author would like to thank Sarah Hawkins for help and guidance
throughout the project. Thanks also go to Rachel Smith and Francis Nolan
for helpful discussion. She thanks the Editor and two anonymous
reviewers for their constructive comments. This research was supported
by the Sir Edward Youde Memorial Fellowship for Overseas Studies from
Hong Kong and an Overseas Research Studentship from the United Kingdom.
CR Adank P, 2004, J ACOUST SOC AM, V116, P3099, DOI 10.1121/1.1795335
ANDERSON S, 1994, J PHONETICS, V22, P283
Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X
Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177
Beddor P.S., 1995, 13 INT C PHON SCI IC, P44
Bell A., 1978, SYLLABLES SEGMENTS, P3
BOUCHER VJ, 1988, J PHONETICS, V16, P299
Browman C. P., 1995, PRODUCING SPEECH CON, P19
BROWMAN CP, 1988, PHONETICA, V45, P140
Byrd D, 1996, J PHONETICS, V24, P209, DOI 10.1006/jpho.1996.0012
BYRD D, 1995, PHONETICA, V52, P285
Byrd D, 1996, J PHONETICS, V24, P263, DOI 10.1006/jpho.1996.0014
Cho T, 2001, PHONETICA, V58, P129, DOI 10.1159/000056196
Cho TH, 2004, J PHONETICS, V32, P141, DOI 10.1016/S0095-4470(03)00043-3
CHRISTIE WM, 1974, J ACOUST SOC AM, V55, P819, DOI 10.1121/1.1914606
Culter A, 1987, COMPUT SPEECH LANG, V2, P133
De Jong KJ, 2001, LANG SPEECH, V44, P197
de Jong KJ, 2004, LANG SPEECH, V47, P241
Fabricius AH, 2009, LANG VAR CHANGE, V21, P413, DOI 10.1017/S0954394509990160
Ferragne E, 2010, J INT PHON ASSOC, V40, P1, DOI 10.1017/S0025100309990247
Fowler C.A., 1981, J SPEECH HEAR RES, V46, P127
Gick B, 2006, J PHONETICS, V34, P49, DOI 10.1016/j.wocn.2005.03.005
Greenberg J., 1978, PHONOLOGY, V2, P243
Haggard M., 1973, J PHONETICS, V1, P9
Haggard M., 1973, J PHONETICS, V1, P111
Hawkins S., 2005, J INT PHON ASSOC, V35, P183, DOI 10.1017/S0025100305002124
Hertrich I, 1995, LANG SPEECH, V38, P159
Honorof D.N., 1995, 13 INT C PHON SCI IC, P552
Hosung Nam, 2003, 15 INT C PHON SCI IC, P2253
Keating P., 2003, PAPERS LAB PHONOLOGY, P143
Kochetov A., 2006, PAPERS LAB PHONOLOGY, V8, P565
Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089
Lehiste I., 1970, SUPRASEGMENTALS
Low E., 2000, LANG SPEECH, V43, P377
Low E.L., 1999, LANG SPEECH, V42, P39
MACCHI M, 1988, PHONETICA, V45, P109
MacNeilage PF, 2000, SCIENCE, V288, P527, DOI 10.1126/science.288.5465.527
Maddieson I., 1984, PATTERNS SOUNDS
Magen HS, 1997, J PHONETICS, V25, P187, DOI 10.1006/jpho.1996.0041
MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705
Modarresi G, 2004, J PHONETICS, V32, P291, DOI 10.1016/j.wocn.2003.11.002
Modarresi G, 2004, PHONETICA, V61, P2, DOI 10.1159/000078660
Mok P., LANGUAGE SP IN PRESS
Mok PKP, 2010, J ACOUST SOC AM, V128, P1346, DOI 10.1121/1.3466859
Mok PPK, 2011, LANG SPEECH, V54, P527, DOI 10.1177/0023830911404961
OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151
PICKETT JM, 1995, PHONETICA, V52, P1
QUENE H, 1992, J PHONETICS, V20, P331
Recasens D, 2009, J ACOUST SOC AM, V125, P2288, DOI 10.1121/1.3089222
RECASENS D, 1989, SPEECH COMMUN, V8, P293, DOI 10.1016/0167-6393(89)90012-5
Recasens D, 2001, J PHONETICS, V29, P273, DOI 10.1006/jpho.2001.0139
Recasens D, 2002, J ACOUST SOC AM, V111, P2828, DOI 10.1121/1.1479146
RECASENS D, 1987, J PHONETICS, V15, P299
Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727
Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152
SAMUEL AG, 1989, PERCEPT PSYCHOPHYS, V45, P485, DOI 10.3758/BF03208055
Schwab S, 2008, PHONETICA, V65, P173, DOI 10.1159/000144078
Smith Caroline L., 1995, PAPERS LAB PHONOLOGY, P205
SPROAT R, 1993, J PHONETICS, V21, P291
Stetson R.H., 1988, RETROSPECTIVE EDITIO, P235
Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567
Tuller B., 1990, ATTENTION PERFORM, P429
TULLER B, 1991, J SPEECH HEAR RES, V34, P501
Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123
Vihman M. M., 1996, PHONOLOGICAL DEV
Watt D., 2010, SOCIOPHONETICS STUDE, P107
Watt D., 2002, LEEDS WORKING PAPERS, V9, P159
Wells J., 1990, STUDIES PRONUNCIATIO, P76
ZSIGA EC, 1994, J PHONETICS, V22, P121
NR 69
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2012
VL 54
IS 8
BP 946
EP 956
DI 10.1016/j.specom.2012.04.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961UC
UT WOS:000305494000003
ER
PT J
AU Li, ZB
Zhao, SH
Bruhn, S
Wang, J
Kuang, JM
AF Li, Zhongbo
Zhao, Shenghui
Bruhn, Stefan
Wang, Jing
Kuang, Jingming
TI Comparison and optimization of packet loss recovery methods based on
AMR-WB for VoIP
SO SPEECH COMMUNICATION
LA English
DT Article
DE VoIP; AMR-WB; Packet loss; FEC; MDC
ID IP; NETWORKS; SPEECH; AUDIO
AB AMR-WB coclec, which has been standardized for wideband speech conversational applications, has a broad range of potential applications in the migration of wireless and wireline networks towards a single converged IP network. Forward error control (FEC) and multiple description coding (MDC) are two promising techniques to make the transmission robust against packet loss in Voice over IP (VoIP). However, how to achieve the optimal reconstructed speech quality with these methods for AMR-WB under different packet loss rate conditions is still an open problem. In this paper, we compare the performance of various FEC and MDC schemes for the AMR-WB codec both analytically and experimentally. Based on the comparison results, some advantageous configurations of FEC and MDC for the AMR-WB codec are obtained, and hence an optimization system is proposed by selecting the optimal packet loss recovery scheme in accordance with the variable network conditions. Subjective AB test results show that the optimization can lead to obvious improvements of the perceived speech quality in the IP environment. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Li, Zhongbo; Zhao, Shenghui; Wang, Jing; Kuang, Jingming] Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China.
[Li, Zhongbo] Inst Chinese Elect Equipment Syst Engn Co, Beijing 100141, Peoples R China.
[Bruhn, Stefan] Multimedia Technol, Ericsson Res, Stockholm, Sweden.
RP Zhao, SH (reprint author), Beijing Inst Technol, Sch Informat & Elect, Beijing 100081, Peoples R China.
EM shzhao@bit.edu.cn
FU cooperation project between Ericsson and BIT
FX The work in this paper is supported by the cooperation project between
Ericsson and BIT.
CR Altman E, 2002, COMPUT NETW, V39, P185, DOI 10.1016/S1389-1286(01)00309-7
[Anonymous], 2006, DUAL RAT SPEECH COD
[Anonymous], 2001, 26191 3GPP TS
[Anonymous], 2009, 181005 ETSI TS
[Anonymous], 2001, 2690 3GPP TS
[Anonymous], 1988, PULS COD MOD VOIC FR
[Anonymous], 2001, 26190 3GPP TS
[Anonymous], 2003, WID COD SPEECH 16 KB
[Anonymous], 1992, COD SPEECH 16 KBITS
[Anonymous], 1990, 40 32 24 16 KBITS AD
Apostolopoulos J, 2002, IEEE INFOCOM SER, P1736
Bastiaan Kleijn W, 2006, IEEE T COMMUN, V54
De Martin J.C, 2001, P IEEE INT C AC SPEE, P753
Degermark M., 1999, 2507 RFC
DONG H, 2004, IEEE ICASSP 04 QUEB, P277
ELGAMAL AA, 1982, IEEE T INFORM THEORY, V28, P851
Gunnar Karlsson, IEEE C ICC2006 IST T, P1002
International Telecommunication Union, 2007, COD SPEECH 8 KBITS U
ITU- T, 2005, WID EXT REC P 862 AS
ITU-T Rec, 1988, 7 KHZ AUD COD 64 KBI
ITU-T Rec, 2005, LOW COMPL COD 24 32
Jingming Kuang, 2007, IEEE C ICITA2007 HAR
Johansson I., 2002, IEEE SPEECH COD WORK
Morinaga T., 2002, ACOUST SPEECH SIG PR
OZAROW L, 1980, AT&T TECH J, V59, P1909
Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750
PETR DW, 1989, IEEE J SEL AREA COMM, V7, P644, DOI 10.1109/49.32328
Petracca M., 2004, 1 INT S CONTR COMM S, P587, DOI 10.1109/ISCCSP.2004.1296457
Podolsky M, 1998, IEEE INFOCOM SER, P505
Schulzrinneh, 1996, RFC1889
Wah BW, 2005, IEEE T MULTIMEDIA, V7, P167, DOI 10.1109/TMM.2004.840593
NR 31
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2012
VL 54
IS 8
BP 957
EP 974
DI 10.1016/j.specom.2012.04.003
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961UC
UT WOS:000305494000004
ER
PT J
AU Wang, L
Chen, H
Li, S
Meng, HM
AF Wang, Lan
Chen, Hui
Li, Sheng
Meng, Helen M.
TI Phoneme-level articulatory animation in pronunciation training
SO SPEECH COMMUNICATION
LA English
DT Article
DE Phoneme-based articulatory models; HMM-based visual synthesis; 3D
articulatory animation
ID SPEECH SYNTHESIS; VISIBLE SPEECH; MODEL; HMM; MRI
AB Speech visualization is extended to use animated talking heads for computer assisted pronunciation training. In this paper, we design a data-driven 3D talking head system for articulatory animations with synthesized articulator dynamics at the phoneme level. A database of AG500 EMA-recordings of three-dimensional articulatory movements is proposed to explore the distinctions of producing the sounds. Visual synthesis methods are then investigated, including a phoneme-based articulatory model with a modified blending method. A commonly used HMM-based synthesis is also performed with a Maximum Likelihood Parameter Generation algorithm for smoothing. The 3D articulators are then controlled by synthesized articulatory movements, to illustrate both internal and external motions. Experimental results have shown the performances of visual synthesis methods by root mean square errors. A perception test is then presented to evaluate the 3D animations, where a word identification accuracy is 91.6% among 286 tests, and an average realism score is 3.5 (1 = bad to 5 = excellent). (C) 2012 Elsevier B.V. All rights reserved.
C1 [Wang, Lan; Li, Sheng; Meng, Helen M.] Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing 100864, Peoples R China.
[Meng, Helen M.] Chinese Univ Hong Kong, Hong Kong, Hong Kong, Peoples R China.
[Chen, Hui] Chinese Acad Sci, Inst Software, Beijing 100864, Peoples R China.
RP Wang, L (reprint author), Chinese Acad Sci, Shenzhen Inst Adv Technol, Beijing 100864, Peoples R China.
EM lan.wang@siat.ac.cn; chenhui@iscas.ac.cn; sheng.li@siat.ac.cn;
hmmeng@se.cuhk.edu.hk
FU National Nature Science Foundation of China [NSFC 61135003, NSFC
90920002]; National Fundamental Research Grant of Science and Technology
(973 Project) [2009CB320804]; Knowledge Innovation Program of the
Chinese Academy of Sciences [KJCXZ-YW-617]
FX Our work is supported by National Nature Science Foundation of China
(NSFC 61135003, NSFC 90920002), National Fundamental Research Grant of
Science and Technology (973 Project: 2009CB320804), and The Knowledge
Innovation Program of the Chinese Academy of Sciences (KJCXZ-YW-617).
CR Badin P., 2008, P 5 C ART MOT DEF OB, P132
Badin P, 2010, SPEECH COMMUN, V52, P493, DOI 10.1016/j.specom.2010.03.002
Blackburn S.C., 2000, J ACOUST SOC AM, V107, P659
Chen H, 2010, VISUAL COMPUT, V26, P477, DOI 10.1007/s00371-010-0434-1
Cohen M. M., 1993, Models and Techniques in Computer Animation
Dang J.W., 2005, P INT LISB, P1025
Deng A., 2008, COMPUTER GRAPHICS, V27, P2096
Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2]
Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006
Grauwinkel K., 2007, P INT C PHON SCI 200, P2173
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
Hoole P., 2003, P INT C PHONETIC SCI, P265
King S., 2001, FACIAL MODEL ANIMATI
Lado R., 1957, LINGUISTICS CULTURES
Ling Z.H., 2008, P INT, P573
Ma JY, 2004, COMPUT ANIMAT VIRT W, V15, P485, DOI 10.1002/cav.11
Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025)
Murray N., 1993, 4 NCVS, P41
Rathinavelu A., 2007, LECT NOTES COMPUT SC, P786
Serrurier A, 2008, J ACOUST SOC AM, V123, P2335, DOI 10.1121/1.2875111
Tamura M., 1999, P EUR C SPEECH COMM, P959
Tarabalka Y., 2007, P ASSISTH 2007 FRANC, P187
TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684
Wang L., 2009, P INTERSPEECH 2009, P2247
Wik P., 2008, P FONETIK 2008, P57
Wik P, 2009, SPEECH COMMUN, V51, P1024, DOI 10.1016/j.specom.2009.05.006
Youssef B., 2009, P INT, P2255
Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004
Zierdt A., 2010, SPEECH MOTOR CONTROL, P331
NR 29
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 845
EP 856
DI 10.1016/j.specom.2012.02.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400001
ER
PT J
AU Hashimoto, K
Yamagishi, J
Byrne, W
King, S
Tokuda, K
AF Hashimoto, Kei
Yamagishi, Junichi
Byrne, William
King, Simon
Tokuda, Keiichi
TI Impacts of machine translation and speech synthesis on speech-to-speech
translation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech-to-speech translation; Machine translation; Speech synthesis;
Subjective evaluation
AB This paper analyzes the impacts of machine translation and speech synthesis on speech-to-speech translation systems. A typical speech-to-speech translation system consists of three components: speech recognition, machine translation and speech synthesis. Many techniques have been proposed for integration of speech recognition and machine translation. However, corresponding techniques have not yet been considered for speech synthesis. The focus of the current work is machine translation and speech synthesis, and we present a subjective evaluation designed to analyze their impact on speech-to-speech translation. The results of these analyses show that the naturalness and intelligibility of the synthesized speech are strongly affected by the fluency of the translated sentences. In addition, several features were found to correlate well with the average fluency of the translated sentences and the average naturalness of the synthesized speech. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Hashimoto, Kei; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Nagoya, Aichi, Japan.
[Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh, Midlothian, Scotland.
[Byrne, William] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England.
RP Hashimoto, K (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Nagoya, Aichi, Japan.
EM bonanza@sp.nitech.ac.jp; jyamagis@inf.ed.ac.uk;
bill.byrne@eng.cam.ac.uk; Simon.King@ed.ac.uk; tokuda@nitech.ac.jp
FU European Community's Seventh Framework Programme [213845]; Strategic
Information and Communications R&D Promotion Programme (SCOPE) of the
Ministry of Internal Affairs and Communication, Japan; JSPS (Japan
Society for the Promotion of Science)
FX The research leading to these results was partly funded from the
European Community's Seventh Framework Programme (FP7/2007-2013) under
grant agreement 213845 (the EMIME project http://www.emime.org) and the
Strategic Information and Communications R&D Promotion Programme (SCOPE)
of the Ministry of Internal Affairs and Communication, Japan. A part of
this research was supported by JSPS (Japan Society for the Promotion of
Science) Research Fellowships for Young Scientists.
CR Boidin C., 2009, P INT SPEC SESS MACH, P2487
Bulyko I, 2002, COMPUT SPEECH LANG, V16, P533, DOI 10.1016/S0885-2308(02)00023-2
Byrne William, 2009, P HLT NAACL BOULD CO, P433, DOI 10.3115/1620754.1620817
Callison-Burch C., 2010, P NAACL HLT 2010 WOR, P1
Callison-Burch C., 2009, P 2009 C EMP METH NA, P286
Casacuberta F, 2008, IEEE SIGNAL PROC MAG, V25, P80, DOI 10.1109/MSP.2008.917989
Chae J., 2009, P 12 C EUR CHAPT ASS, P139, DOI 10.3115/1609067.1609082
Fort K, 2011, COMPUT LINGUIST, V37, P413, DOI 10.1162/COLI_a_00057
Gispert A., 2009, P NAACL HLT 2009, P73
Heilman M., 2010, P NAACL HLT 2010 WOR, P35
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394
Koehn P, 2005, P 10 MACH TRANSL SUM, P79
Kunath S.A., 2010, P NAACL HLT 2010 WOR, P168
Malfait L, 2006, IEEE T AUDIO SPEECH, V14, P1924, DOI 10.1109/TASL.2006.883177
Mutton A., 2007, P 45 ANN M ASS COMP, P344
NAKATSU C, 2006, P 44 ANN M ASS COMP, P1113, DOI 10.3115/1220175.1220315
Ney H., 1999, P IEEE INT C AC SPEE, P1149
Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370
Parlikar A., 2010, P INT 2010, P194
Snow R., 2008, P C EMP METH NAT LAN, P254, DOI 10.3115/1613715.1613751
Stolcke A., 2002, P INT C SPOK LANG PR, P901
Tokuda K., 2000, P ICASSP, P936
TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229
Vidal E., 1997, P INT C AC SPEECH SI, P111
Wan S., 2005, P 10 EUR WORKSH NAT
White J. S., 1994, P 1 C ASS MACH TRANS, P193
Wolters M., 2010, P SSW7, P136
Wu Y.J., 2009, P INTERSPEECH, P528
Yamada S., 2005, P MT SUMM, P55
Yoshimura T, 1999, P EUR, P2347
Zen H., 2004, P ICSLP, P1185
NR 32
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 857
EP 866
DI 10.1016/j.specom.2012.02.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400002
ER
PT J
AU Ikbal, S
Misra, H
Hermansky, H
Magimai-Doss, M
AF Ikbal, Shajith
Misra, Hemant
Hermansky, Hynek
Magimai-Doss, Mathew
TI Phase Auto Correlation (PAC) features for noise robust speech
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Noise robust speech recognition; Phase AutoCorrelation (PAC); Energy
normalization; Inverse cosine; Inverse entropy; MLP feature combination
ID WORD RECOGNITION
AB In this paper, we introduce a new class of noise robust features derived from an alternative measure of autocorrelation representing the phase variation of speech signal frame over time. These features, referred to as Phase AutoCorrelation (PAC) features include PAC-spectrum and PAC-MFCC, among others. In traditional autocorrelation, correlation between two time delayed signal vectors is computed as their dot product. Whereas in PAC, angle between the vectors in the signal vector space is used to compute the correlation. PAC features are more noise robust because the angle is typically less affected by noise than the dot product. However, the use of angle as correlation estimate makes the PAC features inferior in clean speech. In this paper, we circumvent this problem by introducing another set of features where complementary information among the PAC features and the traditional features are combined adaptively to retain the best of both. An entropy based feature combination method in a multi-layer perceptron (MLP) based multi-stream framework is used to derive an adaptively combined representation of the component feature streams. An evaluation of the combined features using OGI Numbers95 database and Aurora-2 database under various noise conditions and noise levels show significant improvements in recognition accuracies in clean as well as noisy conditions. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Ikbal, Shajith] IBM Res Corp, Bangalore, Karnataka, India.
[Misra, Hemant] Philips Res, Bangalore, Karnataka, India.
[Hermansky, Hynek] Johns Hopkins Univ, Baltimore, MD USA.
[Magimai-Doss, Mathew] Idiap Res Inst, Martigny, Switzerland.
RP Ikbal, S (reprint author), IBM Res Corp, Bangalore, Karnataka, India.
EM shajmoha@in.ibm.com; hemant.misra@philips.com; hynek@jhu.edu;
mathew@idiap.ch
FU Swiss National Science Foundation [MULTI: FN 2000-068231.02/1]; National
Centre of Competence in Research (NCCR); DARPA
FX The authors thank the Swiss National Science Foundation for the support
of their work during their stay at Idiap Research Institute, through
grant MULTI: FN 2000-068231.02/1 and through National Centre of
Competence in Research (NCCR) on "Interactive Multimodal Information
Management (IM2)". The authors also thank DARPA for supporting through
the EARS (Effective, Affordable, Reusable Speech-to-Text) project.
CR Alexandre P., 1993, P IEEE ICASSP 93, P99
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Bourlard H., 1996, CC AI J SPECIAL ISSU
Bourlard H., 1996, P EUR SIGN PROC C TR, P1579
Bourlard H.A., 1993, CONNECTIONIST SPEECH
Cole R., 1995, P EUR C SPEECH COMM, P821
Cooke M, 1997, INT CONF ACOUST SPEE, P863, DOI 10.1109/ICASSP.1997.596072
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Ellis D., 2001, P IEEE ICASSP 01 SAL
FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788
Furui S., 1992, P ESCA WORKSH SPEECH, P31
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Hagen A., 2001, THESIS EPFL LAUSANNE
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H., 2000, P IEEE ICASSP 00 IST
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hernando J., 1997, IEEE T SPEECH AUDIO, V5
Hirsch H. G., 2000, ISCA ITRW ASR2000 AU
Ikbal S., 2003, P IEEE ASRU 03 ST TH
IKBAL S, 2003, ACOUST SPEECH SIG PR, P133
Ikbal S., 2004, P INT 04 JEJ ISL KOR
Ikbal S., 2008, P INT 08 BRISB AUSTR
Ikbal S., 2004, P IEEE ICASSP 04 MON
Ikbal S., 2004, THESIS EPFL LAUSANNE
Kalgoankar K., 2009, P ASRU 09 TRENT IT
Kim W, 2011, SPEECH COMMUN, V53, P1, DOI 10.1016/j.specom.2010.08.005
Klatt D., 1986, P IEEE ICASSP 86, P741
Legetter C.J., 1995, P ARPA WORKSH SPOK L, P110
Li J., 2007, P ASRU 07 KYOT JAP
LIM JS, 1979, IEEE T ACOUST SPEECH, V27, P223
Lockwood P., 1992, P ICASSP SAN FRANC C, P265, DOI 10.1109/ICASSP.1992.225921
Mansour D., 1988, P ICASSP 88, P36
MISRA H, 2003, P ICASSP HONG KONG, V2, P741
Nolazco-Flores J.A., 1994, P ICASSP AD AUSTR, P409
Okawa S., 1998, P IEEE ICASSP 98 SEA
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
Rabiner L, 1993, FUNDAMENTALS SPEECH
Raj B., 1998, P ICSLP 98 SYDN AUST
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
Shannon BJ, 2006, SPEECH COMMUN, V48, P1458, DOI 10.1016/j.specom.2006.08.003
Sharma S., 1999, THESIS OGI PORTLAND
Stephenson TA, 2004, IEEE T SPEECH AUDI P, V12, P189, DOI 10.1109/TSA.2003.822631
Varga A., 1992, NOISEX 92 STUDY AFFE
Varga A., 1989, P EUR 89, P167
Varga A.P., 1990, P ICASSP, P845
Young S., 1992, HTK BOOK VERSION 3 2
Zhu Q., 2004, P INT 04 JEJ ISL KOR
NR 47
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 867
EP 880
DI 10.1016/j.specom.2012.02.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400003
ER
PT J
AU Flynn, R
Jones, E
AF Flynn, Ronan
Jones, Edward
TI Reducing bandwidth for robust distributed speech recognition in
conditions of packet loss
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust distributed speech recognition; Auditory front-end; Wavelet;
Bandwidth reduction; Packet loss
ID NETWORKS; CHANNELS
AB This paper proposes a method to reduce the bandwidth requirements for a distributed speech recognition (DSR) system, with minimal impact on recognition performance. Bandwidth reduction is achieved by applying a wavelet decomposition to feature vectors extracted from speech using an auditory-based front-end. The resulting vectors undergo vector quantisation and are then combined in pairs for transmission over a statistically modeled channel that is subject to packet burst loss. Recognition performance is evaluated in the presence of both background noise and packet loss. When there is no packet loss, results show that the proposed method can reduce the bandwidth required to 50% of the bandwidth required for the system in which the proposed method is not used, without compromising recognition performance. The bandwidth can be further reduced to 25% of the baseline for a slight decrease in recognition performance. Furthermore, in the presence of packet loss, the proposed method for bandwidth reduction, when combined with a suitable redundancy scheme, gives a 29% reduction in bandwidth, when compared to the recognition performance of an established packet loss mitigation technique. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Flynn, Ronan] Athlone Inst Technol, Sch Engn, Athlone, Ireland.
[Jones, Edward] Natl Univ Ireland, Coll Engn & Informat, Galway, Ireland.
RP Flynn, R (reprint author), Athlone Inst Technol, Sch Engn, Athlone, Ireland.
EM rflynn@ait.ie; edward.jones@nuigalway.ie
CR Agarwal A., 1999, P ASRU, P67
[Anonymous], HTK SPEECH REC TOOLK
Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141
Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532
El Safty S, 2009, INT J ELEC POWER, V31, P604, DOI 10.1016/j.ijepes.2009.06.003
Ephraim Y., 2006, ELECT ENG HDB, P15
ETSI, 2003, 201108 ETSI ES
ETSI, 2007, 202050 ETSI ES
Flynn R., 2006, P IET IR SIGN SYST C, P111
Flynn R, 2008, SPEECH COMMUN, V50, P797, DOI 10.1016/j.specom.2008.05.004
Gallardo-Antolin G., 2005, IEEE T SPEECH AUDIO, V13, P1186
Gomez AM, 2009, SPEECH COMMUN, V51, P390, DOI 10.1016/j.specom.2008.12.002
Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181
Hsu W.H., 2004, P IEEE INT C AC SPEE, V1, P69
James A, 2006, SPEECH COMMUN, V48, P1402, DOI 10.1016/j.specom.2006.07.005
James A.B., 2004, P 2 COST278 ISCA TUT
James A.B., 2004, P IEEE INT C AC SPEE, V1, P853
James A.B., 2005, P INT 2005 LISB PORT, P2857
Jayasree T, 2009, INT J COMP ELECT ENG, V1, P590
Karapantazis S, 2009, COMPUT NETW, V53, P2050, DOI 10.1016/j.comnet.2009.03.010
Li Q., 2000, P 6 INT C SPOK LANG, V3, P51
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
Macho D., 2001, P IEEE INT C AC SPEE, V1, P305
Macho D., 2002, P ICSLP, P17
MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463
Milner B.P., 2004, P 8 INT C SPOK LANG, P1549
Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750
Quercia D., 2002, P IEEE INT C AC SPEE, V4, P3820
Russo M, 2005, CONSUM COMM NETWORK, P493
Tan ZH, 2010, IEEE J-STSP, V4, P798, DOI 10.1109/JSTSP.2010.2057192
Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007
Tan Z.-H., 2005, P IEEE MMSP NOV, P1
Tan ZH, 2007, IEEE T AUDIO SPEECH, V15, P1391, DOI 10.1109/TASL.2006.889799
Xie Q., 2005, 202050 ETSI ES
Xie Q., 2003, 210108 ETSI ES
NR 36
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 881
EP 892
DI 10.1016/j.specom.2012.03.001
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400004
ER
PT J
AU Smit, T
Turckheim, F
Mores, R
AF Smit, Thorsten
Tuerckheim, Friedrich
Mores, Robert
TI Fast and robust formant detection from LP data
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Formant tracking; Speech analysis; Male and female
voices
ID LINEAR-PREDICTION; SPEECH; EXTRACTION; SPECTRA
AB This paper introduces a method for real-time selective root finding from linear prediction (LP) coefficients using a combination of spectral peak picking and complex contour integration (Cl). The proposed method locates roots within predefined areas of the complex z-plane, for instance roots which correspond to formants while other roots are ignored. It includes an approach to limit the search area (SEA) as much as possible. For this purpose, peaks of the group delay function (GDF) serve as pointers. A frequency weighted wGDF will be introduced in which a simple modification enables a parametric emphasis of the GDF spikes to separate merged formants. Thus, a nearly zero defected separation of peaks is possible even when these are very closely spaced. The performance and efficiency of the proposed wGDF-CI method is demonstrated by comparative error-analysis evaluated on a subset of the DARPA TIMIT corpus. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Smit, Thorsten; Tuerckheim, Friedrich; Mores, Robert] Univ Appl Sci Hamburg, D-20081 Hamburg, Germany.
RP Smit, T (reprint author), Univ Appl Sci Hamburg, Finkenau 35, D-20081 Hamburg, Germany.
EM thorsten.smit@mt.haw-hamburg.de; friedrich.tuerckheim@haw-hamburg.de;
mores@mt.haw-hamburg.de
FU German Federal Ministry of Education and Research (AiF) [1767X07]
FX The authors wish to thank the German Federal Ministry of Education and
Research (AiF Project No. 1767X07).
CR Alciatore D.G., 1995, WINDING NUMBER POINT
ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679
Boersma P., 2005, PRAAT DOING PHONETIC
Dellar J.R., 1999, DISCRETE TIME PROCES
Deng L., 2006, P IEEE INT C AUD SPE, P60
DUNN HK, 1961, J ACOUST SOC AM, V33, P1737, DOI 10.1121/1.1908558
Fant G., 1960, ACOUSTIC THEORY SPEE
FLANAGAN JL, 1964, J ACOUST SOC AM, V36, P1030, DOI 10.1121/1.2143268
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Hanson HM, 1994, IEEE T SPEECH AUDI P, V2, P436, DOI 10.1109/89.294358
IEEE, P INT C AC SPEECH SI, V3, P1381
International Speech Communication Association (ISCA), EUR 95, V1
Itakura F., 1975, J ACOUST SOC AM, V57, P35
Kim C., 2006, EURASIP J APPL SIG P, V2006, P1
Kim HK, 1999, IEEE T SPEECH AUDI P, V7, P87
Knuth D. E., 1998, ART COMPUTER PROGRAM, V2
Kuwarabara H., 1995, SPEECH COMMUN, V16, P365
Markel JD, 1976, LINEAR PREDICTION SP
MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559
MURTHY HA, 1989, ELECTRON LETT, V25, P1609, DOI 10.1049/el:19891080
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
Peterson G., 1951, J ACOUST SOC AM, V24, P1441
Pfitzinger H., 2005, ZAS PAPERS LINGUISTI, V40, P133
REDDY NS, 1984, IEEE T ACOUST SPEECH, V32, P1136, DOI 10.1109/TASSP.1984.1164456
Rudin Walter, 1974, REAL COMPLEX ANAL
Sandler M, 1991, IEE P, V9
Schafer R.W., 1970, J ACOUST SOC AM, V47, P637
Schleicher D, 2002, ERGOD THEOR DYN SYST, V22, P935, DOI 10.1017/S0143385702000482
Sjolander K., 2000, P INT C SPOK LANG PR
Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882
Snell R.C., 1983, INVESTIGATION SPEAKE
Stevens K.N., 2000, ACOUSTIC PHONETICS
Talkin D., 1987, J ACOUST SOC AM S1, VS1, P55
Ueda Yuichi, 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.271
Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308
Williams C. S., 1986, DESIGNING DIGITAL FI
Wong D., 1980, IEEE T ACOUST SPEECH, V80, P263
NR 37
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 893
EP 902
DI 10.1016/j.specom.2012.03.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400005
ER
PT J
AU Hassan, A
Damper, RI
AF Hassan, A.
Damper, R. I.
TI Classification of emotional speech using 3DEC hierarchical classifier
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech processing; Emotion recognition; Valence-arousal model;
Multiclass support vector machines
ID SYNTHETIC SPEECH; RECOGNITION; FEATURES; SIMULATION; DATABASES
AB The recognition of emotion from speech acoustics is an important problem in human machine interaction, with many potential applications. In this paper, we first compare four ways to extend binary support vector machines (SVMs) to multiclass classification for recognising emotions from speech-namely two standard SVM schemes (one-versus-one and one-versus-rest) and two other methods (DAG and UDT) that form a hierarchy of classifiers, each making a distinct binary decision about class membership. These are trained and tested using 6552 features per speech sample extracted from three databases of acted emotional speech (DES, Berlin and Serbian) and a database of spontaneous speech (FAU Aibo Emotion Corpus) using the OpenEAR toolkit. Analysis of the errors made by these classifiers leads us to apply non-metric multi-dimensional scaling (NMDS) to produce a compact (two-dimensional) representation of the data suitable for guiding the choice of decision hierarchy. This representation can be interpreted in terms of the well-known valence-arousal model of emotion. We find that this model does not give a particularly good fit to the data: although the arousal dimension can be identified easily, valence is not well represented in the transformed data. We describe a new hierarchical classification technique whose structure is based on NMDS, which we call Data-Driven Dimensional Emotion Classification (3DEC). This new method is compared with the best of the four classifiers studied earlier and a state-of-the-art classification method on all four databases. We find no significant difference between these three approaches with respect to speaker-dependent performance. However, for the much more interesting and important case of speaker-independent emotion classification, 3DEC significantly outperforms the competitors. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Hassan, A.; Damper, R. I.] Univ Southampton, Syst Res Grp, Sch Elect & Comp Sci, Southampton SO17 1BJ, Hants, England.
RP Damper, RI (reprint author), Univ Southampton, Syst Res Grp, Sch Elect & Comp Sci, Southampton SO17 1BJ, Hants, England.
EM ah07r@ecs.soton.ac.uk; rid@ecs.soton.ac.uk
CR Balentine B, 2002, BUILD SPEECH RECOGNI
Batliner A., 2004, P 4 INT C LANG RES E, P171
Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4
Batliner A, 2006, P IS LTC 2006 LJUBL, P240
Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749
Burkhardt F., 2005, INTERSPEECH, P1517
Casale S., 2008, IEEE INT C SEM COMP, P158
Chawla NV, 2002, J ARTIF INTELL RES, V16, P321
Cristianini N., 2000, INTRO SUPPORT VECTOR
DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022
Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5
El Ayadi M, 2011, PATTERN RECOGN, V44, P572, DOI 10.1016/j.patcog.2010.09.020
Engberg I. S., 1996, DOCUMENTATION DANISH
Eyben F., 2010, 7 INT C LANG RES EV, P77
Eyben F., 2009, P 4 INT HUMAINE ASS, P1
Hansen J.H.L., 1997, P EUR C SPEECH COMM, P1743
Hassan A., 2009, INTERSPEECH 09, P2403
Hassan A., 2010, INTERSPEECH 10, P2354
Hsu C., 2001, IEEE T NEURAL NETWOR, V13, P415
Hsu C. W., 2003, PRACTICAL GUIDE SUPP
Jovicic S. T, 2004, P 9 C SPEECH COMP SP, P77
KRUSKAL JB, 1964, PSYCHOMETRIKA, V29, P115, DOI 10.1007/BF02289694
Lee CC, 2011, SPEECH COMMUN, V53, P1162, DOI 10.1016/j.specom.2011.06.004
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Li Y., 1998, P INT C SPOK LANG PR, P2255
Luengo I., 2009, 1NTERSPEECH 09, P332
Martin A., 1997, P EUR RHOD GREEC, V97, P1895
McTear MF, 2002, ACM COMPUT SURV, V34, P90, DOI 10.1145/505282.505285
Murray IR, 2008, COMPUT SPEECH LANG, V22, P107, DOI 10.1016/j.csl.2007.06.001
MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558
ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315
Paeschke A., 2004, P SPEECH PROS NAR JA, P671
Picard R. W., 1997, AFFECTIVE COMPUTING
Planet S., 2009, INTERSPEECH 09, P316
PLATT J. C., 2000, P NEUR INF PROC SYST, P547
Ramanan A., 2007, INT C IND INF SYST I, P291
RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714
Schiel F., 2002, P 3 LANG RES EV C LR, P200
SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570
Schuller B., 2011, P INTERSPEECH, V12, P3201
Schuller B., 2009, INTERSPEECH, P312
Schuller B., 2011, P 1 INT AUD VIS EM C, P415
SCHULLER B, 2009, IEEE WORKSH AUT SPEE, P552
Schuller B, 2010, IEEE T AFFECT COMPUT, V1, P119, DOI 10.1109/T-AFFC.2010.8
Shahid S., 2008, P 4 INT C SPEECH PRO, P669
Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006
Shaukat A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2771
Siegel S., 1988, NONPARAMETRIC STAT B
Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3
Steidl S., 2009, THESIS U ERLANGEN NU
Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003
Ververidis D., 2004, P IEEE INT C AC SPEE, V1, P593
Vidrascu L., 2005, INTERSPEECH 05, P1841
Vogt T., 2005, IEEE INT C MULT EXP, P474
Wilting J., 2006, INTERSPEECH 2006, P1093
Witten I.H., 2005, DATA MINING PRACTICA
Yang B, 2010, SIGNAL PROCESS, V90, P1415, DOI 10.1016/j.sigpro.2009.09.009
NR 57
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 903
EP 916
DI 10.1016/j.specom.2012.03.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400006
ER
PT J
AU Quene, H
Semin, GR
Foroni, F
AF Quene, Hugo
Semin, Gun R.
Foroni, Francesco
TI Audible smiles and frowns affect speech comprehension
SO SPEECH COMMUNICATION
LA English
DT Article
DE Smiles; Speech comprehension; Emotion; Affect perception; Motor
resonance
ID LANGUAGE COMPREHENSION; COMMUNICATING EMOTION; VOCAL COMMUNICATION;
EXPRESSIONS; PROSODY; WORD; INTERFERENCE; RECOGNITION; RESPONSES
AB Motor resonance processes are involved both in language comprehension and in affect perception. Therefore we predict that listeners understand spoken affective words slower, if the phonetic form of a word is incongruent with its affective meaning. A language comprehension study involving an interference paradigm confirmed this prediction. This interference suggests that affective phonetic cues contribute to language comprehension. A perceived smile or frown affects the listener, and hearing an incongruent smile or frown impedes our comprehension of spoken words. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Quene, Hugo] Univ Utrecht, Utrecht Inst Linguist, OTS, NL-3512 JK Utrecht, Netherlands.
[Semin, Gun R.; Foroni, Francesco] Univ Utrecht, Fac Social & Behav Sci, NL-3584 CS Utrecht, Netherlands.
RP Quene, H (reprint author), Univ Utrecht, Utrecht Inst Linguist, OTS, Trans 10, NL-3512 JK Utrecht, Netherlands.
EM h.quene@uu.nl; g.r.semin@uu.nl; f.foroni@uu.nl
RI Foroni, Francesco/G-5469-2012
CR Adank P, 2010, PSYCHOL SCI, V21, P1903, DOI 10.1177/0956797610389192
Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Boersma P., 2011, PRAAT DOING PHONETIC
Chuenwattanapranithi S, 2008, PHONETICA, V65, P210, DOI 10.1159/000192793
Dimberg U, 2000, PSYCHOL SCI, V11, P86, DOI 10.1111/1467-9280.00221
Drahota A, 2008, SPEECH COMMUN, V50, P278, DOI 10.1016/j.specom.2007.10.001
esink CMJY, 2008, J COGNITIVE NEUROSCI, V21, P2085
Foroni F, 2009, PSYCHOL SCI, V20, P974, DOI 10.1111/j.1467-9280.2009.02400.x
FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412
Galantucci B, 2006, PSYCHON B REV, V13, P361, DOI 10.3758/BF03193857
Gallese V, 2003, PSYCHOPATHOLOGY, V36, P171, DOI 10.1159/000072786
Gallese V, 2009, PSYCHOL RES-PSYCH FO, V73, P486, DOI 10.1007/s00426-009-0232-4
Gallese V, 1996, BRAIN, V119, P593, DOI 10.1093/brain/119.2.593
Grimshaw GM, 1998, BRAIN COGNITION, V36, P108, DOI 10.1006/brcg.1997.0949
Hawk ST, 2012, J PERS SOC PSYCHOL, V102, P796, DOI 10.1037/a0026234
Hietanen JK, 1998, PSYCHOPHYSIOLOGY, V35, P530, DOI 10.1017/S0048577298970445
Kitayama S, 2002, COGNITION EMOTION, V16, P29, DOI 10.1080/0269993943000121
KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940
Klauer KC, 2003, PSYCHOLOGY OF EVALUATION, P7
Kohler E, 2002, SCIENCE, V297, P846, DOI 10.1126/science.1070311
Lasarcyk E., 2008, 8 INT SPEECH PROD SE
MEHRABIA.A, 1967, J PERS SOC PSYCHOL, V6, P109, DOI 10.1037/h0024532
Neumann R, 2000, J PERS SOC PSYCHOL, V79, P211, DOI 10.1037//0022-3514.79.2.211
Niedenthal PM, 2010, BEHAV BRAIN SCI, V33, P417, DOI 10.1017/S0140525X10000865
Niedenthal PM, 2007, SCIENCE, V316, P1002, DOI 10.1126/science.1136930
Nygaard LC, 2008, J EXP PSYCHOL HUMAN, V34, P1017, DOI 10.1037/0096-1523.34.4.1017
Ohala J.J., 1980, J ACOUST SOC AM S, pS33
OHALA JJ, 1983, PHONETICA, V40, P1
Paulmann S, 2012, SPEECH COMMUN, V54, P92, DOI 10.1016/j.specom.2011.07.004
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Quene H, 2008, J MEM LANG, V59, P413, DOI 10.1016/j.jml.2008.02.002
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8
Schirmer A, 2003, J COGNITIVE NEUROSCI, V15, P1135, DOI 10.1162/089892903322598102
Schroder M, 2006, IEEE T AUDIO SPEECH, V14, P1128, DOI 10.1109/TASL.2006.876118
Scott SK, 1997, NATURE, V385, P254, DOI 10.1038/385254a0
Stroop JR, 1935, J EXP PSYCHOL, V18, P643, DOI 10.1037/0096-3445.121.1.15
TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151
Van Berkum J.J.A., 2007, J COGNITIVE NEUROSCI, V20, P580
WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238
Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263
Xu Y., 2010, PERCEPTION
Zwaan RA, 2006, J EXP PSYCHOL GEN, V135, P1, DOI 10.1037/0096-3445.135.1.1
NR 44
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2012
VL 54
IS 7
BP 917
EP 922
DI 10.1016/j.specom.2012.03.004
PG 6
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 961VA
UT WOS:000305496400007
ER
PT J
AU Prieto, P
Vanrell, MD
Astruc, L
Payne, E
Post, B
AF Prieto, Pilar
Vanrell, Maria del Mar
Astruc, Lluisa
Payne, Elinor
Post, Brechtje
TI Phonotactic and phrasal properties of speech rhythm. Evidence from
Catalan, English, and Spanish
SO SPEECH COMMUNICATION
LA English
DT Article
DE Rhythm; Rhythm index measures; Accentual lengthening; Final lengthening;
Spanish language; Catalan language; English language
ID DURATION; LANGUAGE; STRESS; FRENCH; MUSIC
AB The goal of this study is twofold: first, to examine in greater depth the claimed contribution of differences in syllable structure to measures of speech rhythm for three languages that are reported to belong to different rhythmic classes, namely, English, Spanish, and Catalan; and second, to investigate differences in the durational marking of prosodic heads and final edges of prosodic constituents between the three languages and test whether this distinction correlates in any way with the rhythmic distinctions. Data from a total of 24 speakers reading 720 utterances from these three languages show that differences in the rhythm metrics emerge even when syllable structure is controlled for in the experimental materials, at least between English on the one hand and Spanish/Catalan on the other, suggesting that important differences in durational patterns exist between these languages that cannot simply be attributed to differences in phonotactic properties. In particular, the vocalic variability measures nPVI-V, Delta V, and VarcoV are shown to be robust tools for discrimination above and beyond such phonotactic properties. Further analyses of the data indicate that the rhythmic class distinctions under consideration finely correlate with differences in the way these languages instantiate two prosodic timing processes, namely, the durational marking of prosodic heads, and pre-final lengthening at prosodic boundaries. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Prieto, Pilar] ICREA Univ Pompeu Fabra, Barcelona 08018, Spain.
[Vanrell, Maria del Mar] Univ Pompeu Fabra, Barcelona 08018, Spain.
[Astruc, Lluisa] Open Univ, Dept Languages, Fac Languages & Language Studies, Milton Keynes MK7 6AA, Bucks, England.
[Payne, Elinor] Univ Oxford, Phonet Lab, Oxford OX1 2JF, England.
[Post, Brechtje] Dept Theoret & Appl Linguist, Cambridge CB3 4DA, England.
RP Prieto, P (reprint author), ICREA Univ Pompeu Fabra, Carrer Roc Boronat 138,Despatx 51-600, Barcelona 08018, Spain.
EM pilar.prieto@upf.edu
RI Consolider Ingenio 2010, BRAINGLOT/D-1235-2009; Prieto,
Pilar/E-7390-2013
OI Prieto, Pilar/0000-0001-8175-1081
FU The acquisition of rhythm in Catalan, Spanish and English [2007 PBR 29];
The acquisition of intonation in Catalan, Spanish and English [2009 PBR
00018]; Generalitat de Catalunya; British Academy [SG-51777]; Spanish
Ministerio de Ciencia e Innovacion; SGR 701; [FFI2009-07648/FILO]
FX We are grateful to the audience at these three conferences, and
especially to M. D'Imperio, S. Frota, J.I. Hualde, F. Nolan, and L.
White for fruitful discussions on some of the issues raised in this
article. We also thank the action editor Jan van Santen and three
anonymous reviewers for their comments on an earlier version of this
paper. The idea for conducting the first experiment stems from an
informal conversation with J.I. Hualde. We would like to thank N.
Argemi, A. Barbera, M. Bell, A. Estrella, and F. Torres-Tamarit for
recording the data in the three languages, and to N. Hilton and P.
Roseano for carrying out the segmentation and coding of the data.
Special thanks are due to F. Ramus et al. for making available to us the
sentences used in their 1999 paper. This research has been funded by two
Batista i Roca research projects entitled "The acquisition of rhythm in
Catalan, Spanish and English" and "The acquisition of intonation in
Catalan, Spanish and English" (Refs. 2007 PBR 29 and 2009 PBR 00018,
respectively) awarded by the Generalitat de Catalunya, by the project "A
cross-linguistic study of intonational development in young infants and
children" awarded by the British Academy (Ref. SG-51777), and by the
projects FFI2009-07648/FILO and CONSOLIDER-INGENIO 2010 "Bilinguismo y
Neurociencia Cognitiva CSD2007-00012" awarded by the Spanish Ministerio
de Ciencia e Innovacion, and 2009 SGR 701, awarded by the Generalitat de
Catalunya.
CR Abercrombie D, 1967, ELEMENTS GEN PHONETI
Arvaniti A, 2009, PHONETICA, V66, P46, DOI 10.1159/000208930
Astruc L., 2009, CAMBRIDGE OCCASIONAL, V5, P1
Astruc L., 2006, P SPEECH PROS 2006, P337
Asu E., 2006, P SPEECH PROS 2006, P49
Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005
Barnes J, 2001, NE LING SOC NELS 32
Barry W, 2009, PHONETICA, V66, P78, DOI 10.1159/000208932
Beckman M., 1994, PAPERS LAB PHONOLOGY, P7
Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X
Beckman M. E., 1992, SPEECH PERCEPTION PR, P457
Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008
Boersma P., 2007, PRAAT DOING PHONETIC
Bolinger D., 1965, PITCH ACCENT SENTENC
Byrd D, 2000, PHONETICA, V57, P3, DOI 10.1159/000028456
Carter PM, 2005, AMST STUD THEORY HIS, V272, P63
Cummins F, 2002, P SPEECH PROS AIR EN, P121
Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070
Dauer R., 1987, P 11 INT C PHON SCI, P447
DAUER RM, 1983, J PHONETICS, V11, P51
DELATTRE P, 1966, IRAL-INT REV APPL LI, V4, P183, DOI 10.1515/iral.1966.4.1-4.183
Dellwo V., 2004, P 8 ICSLP JEJ ISL KO
Dellwo V, 2003, P 15 INT C PHON SCI, P471
Dellwo V., 2006, P 38 LING C LANG FRA, P231
DEMANRIQUE AMB, 1983, J PHONETICS, V11, P117
den Os E., 1988, RHYTHM TEMPO DUTCH I
Dilley L., 1996, J PHONETICS, V26, P45
Estebas-Vilaplana Eva, 2010, TRANSCRIPTION INTONA, P17
Fant G., 1991, 12 INT C PHON SCI AI, P118
Ferragne E., 2004, P MOD ID LANG PAR
Fougeron C., 1996, UCLA WORKING PAPERS, V92, P61
Frota S., 2007, SEGMENTAL PROSODIC I, P131
Frota S., 2001, PROBUS, V13, P247, DOI 10.1515/prbs.2001.005
Gavalda-Ferre N, 2007, THESIS U COLL LONDON
Grabe Esther, 2002, LAB PHONOLOGY, V7, P515
Hasegawa-Johnson M., 2007, P 16 INT C PHON SCI, P1264
Hasegawa-Johnson M, 2004, P ICSA INT C SPOK LA, P2729
Hualde J.I., 2005, SOUNDS SPANISH
IBM Corporation, 2010, IBM SPSS STAT VERS 1
LLOYD J., 1940, SPEECH SIGNALS TELEP
Low E., 2000, LANG SPEECH, V43, P377
MASCARO IGNASI, 2010, P 5 INT C SPEECH PRO, P1
Mascaro J., 2002, GRAMATICA CATALA CON, P89
Nespor M., 1990, LOGICAL ISSUES LANGU, P157
Nolan F, 2009, PHONETICA, V66, P64, DOI 10.1159/000208931
O'Rourke E., 2008, P 4 C SPEECH PROS MA
OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393
Ortega-Llebaria M., 2010, LANG SPEECH, V54, P1
Ortega-Llebaria M., 2007, CURRENT ISSUES LINGU, P155
FANT G, 1991, J PHONETICS, V19, P351
Patel A., 2008, MUSIC LANGUAGE BRAIN
Patel AD, 2003, COGNITION, V87, pB35, DOI 10.1016/S0010-0277(02)00187-7
Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657
Payne E., LANGUAGE SPEECH, V55
Payne E., 2009, OXFORD U WORKING PAP, V12, P123
Payne E., 2010, ATT CONV PROS GLI UN, P147
PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183
Pike K. L., 1945, INTONATION AM ENGLIS
Prieto P., 2010, J PHONETICS, V38, P688
Prieto Pilar, PROSODIC TY IN PRESS
Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X
Ramus F., 2003, P 15 INT C PHON SCI, P337
Randolph J. J., 2005, JOENS U LEARN INSTR
Randolph JJ, 2008, ONLINE KAPPA CALCULA
Roach P., 1982, LINGUISTIC CONTROVER
Russo M., 2008, P 4 C SPEECH PROS MA
Shen Y., 1962, U BUFFALO STUDIES LI, V9, P1
SYRDAL ANN, 2000, P INT C SPOK LANG PR, P235
Turk AE, 1997, J PHONETICS, V25, P25, DOI 10.1006/jpho.1996.0032
Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093
Warrens MJ, 2010, ADV DATA ANAL CLASSI, V4, P271, DOI 10.1007/s11634-010-0073-4
WENK BJ, 1982, J PHONETICS, V10, P193
White L., 2007, CURRENT ISSUES LINGU, P237
White L., 2009, PHONETICS PHONOLOGY
White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003
WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450
NR 76
TC 15
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 681
EP 702
DI 10.1016/j.specom.2011.12.001
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400001
ER
PT J
AU Oura, K
Yamagishi, J
Wester, M
King, S
Tokuda, K
AF Oura, Keiichiro
Yamagishi, Junichi
Wester, Mirjam
King, Simon
Tokuda, Keiichi
TI Analysis of unsupervised cross-lingual speaker adaptation for HMM-based
speech synthesis using KLD-based transform mapping
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based speech synthesis; Unsupervised speaker adaptation;
Cross-lingual speaker adaptation; Speech-to-speech translation
ID SYNTHESIS SYSTEM; RECOGNITION; ALGORITHM; MODEL; TTS
AB In the EMIME project, we developed a mobile device that performs personalized speech-to-speech translation such that a user's spoken input in one language is used to produce spoken output in another language, while continuing to sound like the user's voice. We integrated two techniques into a single architecture: unsupervised adaptation for HMM-based TTS using word-based large-vocabulary continuous speech recognition, and cross-lingual speaker adaptation (CLSA) for HMM-based TTS. The CLSA is based on a state-level transform mapping learned using minimum Kullback-Leibler divergence between pairs of HMM states in the input and output languages. Thus, an unsupervised cross-lingual speaker adaptation system was developed. End-to-end speech-to-speech translation systems for four languages (English, Finnish, Mandarin, and Japanese) were constructed within this framework. In this paper, the English-to-Japanese adaptation is evaluated. Listening tests demonstrate that adapted voices sound more similar to a target speaker than average voices and that differences between supervised and unsupervised cross-lingual speaker adaptation are small. Calculating the KLD state-mapping on only the first 10 mel-cepstral coefficients leads to huge savings in computational costs, without any detrimental effect on the quality of the synthetic speech. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Oura, Keiichiro; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Nagoya, Aichi 4668555, Japan.
[Yamagishi, Junichi; Wester, Mirjam; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
RP Oura, K (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Gokiso Cho, Nagoya, Aichi 4668555, Japan.
EM uratec@nitech.ac.jp
FU European Community [213845]; Ministry of Internal Affairs and
Communication, Japan
FX The research leading to these results was partly funded by the European
Community's Seventh Framework Programme (FP7/2007-2013) under grant
agreement 213845 (the EMIME project), and the Strategic Information and
Communications R&D Promotion Programme (SCOPE), Ministry of Internal
Affairs and Communication, Japan.
CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
Baumann O, 2010, PSYCHOL RES-PSYCH FO, V74, P110, DOI 10.1007/s00426-008-0185-z
Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464
Dines J., 2009, P INT 2009 BRIGHT UK
Dines J, 2010, IEEE J-STSP, V4, P1046, DOI 10.1109/JSTSP.2010.2079315
Fitt S., 2000, DOCUMENTATION USER G
Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gao Y., 2003, P EUR 2003 GEN, P365
Gibson M., 2009, P INT BRIGHT UK SEP, P1791
Goldberger J, 2003, ICCV 03, P487
Hashimoto K., SPEECH COMM
Hashimoto K, 2011, INT CONF ACOUST SPEE, P5108
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hirsimaki T, 2009, IEEE T AUDIO SPEECH, V17, P724, DOI 10.1109/TASL.2008.2012323
Imai S, 1983, ICASSP 83, P93
Iskra D., 2002, P LREC, P329
Itou K., 1998, P ICSLP, P3261
Kawahara H., 2001, P 2 MAVEBA FIR IT
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
King S., 2008, P INT, P1869
Kneser R, 1995, P IEEE INT C AC SPEE, V1, P181
Ladefoged P., 1996, SOUNDS WORLDS LANGUA
Liang H, 2010, INT CONF ACOUST SPEE, P4598, DOI 10.1109/ICASSP.2010.5495559
Liu F., 2003, P ICASSP, P636
MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089
Moore D., 2006, 3 JOINT WORKSH MULT
NEY H, 1999, ACOUST SPEECH SIG PR, P517
Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370
PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614
Qian Y, 2009, IEEE T AUDIO SPEECH, V17, P1231, DOI 10.1109/TASL.2009.2015708
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Sjolander K., 1998, P ICSLP 1998
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
Tokuda K., 1994, P INT C SPOK LANG PR, P1043
Tsujimura N, 2006, INTRO JAPANESE LINGU
Tsuzaki M., 2011, INTERSPEECH 2011 FLO, P157
Wester M, 2011, INT CONF ACOUST SPEE, P5372
Wester M., 2010, P 7 ISCA SPEECH SYNT
Wester M, 2010, P INT 2010 TOK JAP
Woodland P. C., 2001, P ISCA WORKSH AD MET, P11
Wu Y.J., 2009, P INTERSPEECH, P528
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647
Yamagishi J., 2009, P BLIZZ CHALL WORKSH
Yamagishi J, 2010, IEEE T AUDIO SPEECH, V18, P984, DOI 10.1109/TASL.2010.2045237
Yoneyama K, 2004, P INT 2004 JEJ ISL K
Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
Yu Z., 2008, P ICSP 2008, P655
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 53
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 703
EP 714
DI 10.1016/j.specom.2011.12.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400002
ER
PT J
AU Kaufmann, T
Pfister, B
AF Kaufmann, Tobias
Pfister, Beat
TI Syntactic language modeling with formal grammars
SO SPEECH COMMUNICATION
LA English
DT Article
DE Large-vocabulary continuous speech recognition; Language modeling;
Formal grammar; Discriminative reranking
ID SYSTEM
AB It has repeatedly been demonstrated that automatic speech recognition can benefit from syntactic information. However, virtually all syntactic language models for large-vocabulary continuous speech recognition are based on statistical parsers. In this paper, we investigate the use of a formal grammar as a source of syntactic information. We describe a novel approach to integrating formal grammars into speech recognition and evaluate it in a series of experiments. For a German broadcast news transcription task, the approach was found to reduce the word error rate by 9.7% (relative) compared to a competitive baseline speech recognizer. We provide an extensive discussion on various aspects of the approach, including the contribution of different kinds of information, the development of a precise formal grammar and the acquisition of lexical information. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Kaufmann, Tobias; Pfister, Beat] ETH, Speech Proc Grp, Comp Engn & Networks Lab, CH-8092 Zurich, Switzerland.
RP Kaufmann, T (reprint author), ETH, Speech Proc Grp, Comp Engn & Networks Lab, Gloriastr 35, CH-8092 Zurich, Switzerland.
EM tobias.kaufmann@stairwell.ch
FU Swiss National Science Foundation
FX This work was supported by the Swiss National Science Foundation. We
cordially thank Jean-Luc Gauvain for providing us with word lattices
from the LIMSI German broadcast news transcription system. We further
thank Canoo Engineering AG for granting us access to their morphological
database.
CR Baldwin T., 2005, P 6 M PAC ASS COMP L, P23
Baldwin Timothy, 2005, P ACL SIGLEX WORKSH, P67, DOI 10.3115/1631850.1631858
Beutler R, 2007, THESIS ETH ZURICH
Beutler R., 2005, P IEEE ASRU WORKSH S, P104
Brants S., 2002, P WORKSH TREEB LING
Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178
Carrol J., 2005, P 2 INT JOINT C NAT, P165
Charnak E., 2001, P 39 ANN M ASS COMP, P124, DOI 10.3115/1073012.1073029
Charniak E, 2005, P 43 ANN M ASS COMP, P173, DOI 10.3115/1219840.1219862
Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147
Chen S. F., 1999, GAUSSIAN PRIOR SMOOT
Collins C., 2004, P ACL
Collins M, 2005, COMPUT LINGUIST, V31, P25, DOI 10.1162/0891201053630273
Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903
Crysmann Berhold, 2003, P RANLP
Crysmann Berthold, 2005, RES LANGUAGE COMPUTA, V3, P61, DOI 10.1007/s11168-005-1287-z
Duden, 1999, GROSSE WORTERBUCH DT
ERMAN LD, 1980, COMPUT SURV, V12, P213
Fano R. M., 1961, TRANSMISSION INFORM
Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9
Gillick L., 1989, P ICASSP, P532
Goddeau D, 1992, P ICSLP
Goodine D., 1991, P 2 EUR C SPEECH COM
Hall K., 2003, P IEEE AUT SPEECH RE, P507
Church K. W., 1990, Computational Linguistics, V16
Johnson Mark, 1999, P 37 ANN M ASS COMP, P535, DOI 10.3115/1034678.1034758
Jurafsky D., 1995, P 1995 INT C AC SPEE, VI, P189
Kaplan R. M., 2004, P HUM LANG TECHN C N, P97
Kaufmann T., 2009, P INT 09 BRIGHT ENGL
Kaufmann T, 2009, THESIS ETH ZURICH
Kiefer B, 2000, ART INTEL, P280
Kita K., 1991, P IEEE INT C AC SPEE, P269, DOI 10.1109/ICASSP.1991.150329
Malouf R., 2004, P IJCNLP 04 WORKSH S
Malouf R., 2002, P 6 C NAT LANG LEARN, P1
MANBER U, 1990, PROCEEDINGS OF THE FIRST ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, P319
McNeill W. P., 2006, P IEEE SPOK LANG TEC, P90
McTait K., 2003, P EUR 03 GEN SWITZ
Moore R., 1995, P ARPA SPOK LANG SYS
Muller S, 2000, ART INTEL, P238
Muller S, 2007, STAUFFENBURG EINFUHR
Muller Stefan, 1999, LINGUISTISCHE ARBEIT, V394
Nederhof M. J., 1997, P ACL EACL WORKSH SP, P66
Nicholson J., 2008, P 6 INT C LANG RES E
Osborne M., 2000, P 18 INT C COMP LING, P586
Pollard Carl, 1994, HEAD DRIVEN PHRASE S
Price P., 1988, P IEEE INT C AC SPEE, V1, P651
Price P., 1990, P DARPA SPEECH NAT L, P91, DOI 10.3115/116580.116612
Prins R., 2003, TRAITEMENT AUTOMATIQ, V44, P121
Rayner M., 2006, PUTTING LINGUISTICS
Riezler S., 2002, P 40 ANN M ASS COMP, P271
Roark B, 2001, COMPUT LINGUIST, V27, P249, DOI 10.1162/089120101750300526
Roark B, 2001, P ACL 01 PHIL US, P287, DOI 10.3115/1073083.1073131
Sag I., 2002, LECT NOTES COMPUT SC, V2276, P1
Sag I. A., 1987, CSLI LECT NOTES, VI
Sag I. A., 2003, SYNTACTIC THEORY FOR
Schone P, 2001, PROCEEDINGS OF THE 2001 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P100
Tomita M., 1991, GEN LR PARSING
Toutanova K., 2005, J RES LANGUAGE COMPU, V3, P83
van Noord G., 2004, P ACL
van Noord G, 2001, ROBUSTNESS LANGUAGE
van Noord Gertjan, 2006, ACT 13 C TRAIT AUT L, P20
Wahlster W., 2000, VERBMOBIL FDN SPEECH
Wang W., 2003, P IEEE AUT SPEECH RE, P519
Wang W., 2002, P C EMP METH NAT LAN, V10, P238, DOI 10.3115/1118693.1118724
Wang W., 2004, P IEEE ICASSP 04 MON, VI, P261
Xu P., 2002, P 40 ANN M ACL PHIL, P191
Yamamoto M, 2001, COMPUT LINGUIST, V27, P1, DOI 10.1162/089120101300346787
Zhang Yi, 2007, P WORKSH DEEP LING P, P128, DOI 10.3115/1608912.1608932
Zue V., 1990, P ICASSP 90, P73
NR 69
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 715
EP 731
DI 10.1016/j.specom.2012.01.001
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400003
ER
PT J
AU Zelinka, P
Sigmund, M
Schimmel, J
AF Zelinka, Petr
Sigmund, Milan
Schimmel, Jiri
TI Impact of vocal effort variability on automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Vocal effort level; Robust speech recognition; Machine learning
ID STRESS; NOISE
AB The impact of changes in a speaker's vocal effort on the performance of automatic speech recognition has largely been overlooked by researchers and virtually no speech resources exist for the development and testing of speech recognizers at all vocal effort levels. This study deals with speech properties in the whole range of vocal modes - whispering, soft speech, normal speech, loud speech, and shouting. Fundamental acoustic and phonetic changes are documented. The impact of vocal effort variability on the performance of an isolated-word recognizer is shown and effective means of improving the system's robustness are tested. The proposed multiple model framework approach reaches a 50% relative reduction of word error rate compared to the baseline system. A new specialized speech database, BUT-VE1, is presented, which contains speech recordings of 13 speakers at 5 vocal effort levels with manual phonetic segmentation and sound pressure level calibration. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Zelinka, Petr; Sigmund, Milan] Brno Univ Technol, Dept Radio Elect, Brno 61200, Czech Republic.
[Schimmel, Jiri] Brno Univ Technol, Dept Telecommun, Brno 61200, Czech Republic.
RP Zelinka, P (reprint author), Brno Univ Technol, Dept Radio Elect, Purkynova 118, Brno 61200, Czech Republic.
EM xzelin06@stud.feec.vutbr.cz; sigmund@feec.vutbr.cz;
schimmel@feec.vutbr.cz
FU Czech Grant Agency [102/08/H027]; European Social Fund
[CZ.1.07/2.3.00/20.0007, MSM 0021630513]
FX The described research was supported by the Czech Grant Agency under
Grant No. 102/08/H027. The research leading to the results has also
received funding from the European Social Fund under Grant agreement
CZ.1.07/2.3.00/20.0007 (the WICOMT project) and under the research
program MSM 0021630513 (ELCOM).
CR Acero A., 2000, P ICSLP, P869
Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815
Brungart D.S., 2001, P EUR AALB DENM, P747
Chang C.-C., 2011, ACM T INTEL SYST TEC, V2, P1, DOI DOI 10.1145/1961189.1961199
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Greenberg J. H., 2002, INDO EUROPEAN ITS CL
Ikeno A., 2007, P 2007 IEEE AER C
Ito T, 2005, SPEECH COMMUN, V45, P139, DOI 10.1016/j.specom.2003.10.005
Itoh T., 2001, P ASRU 01, P429
Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6
Kuttruff H., 2000, ROOM ACOUSTICS
Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
Lu YY, 2009, SPEECH COMMUN, V51, P1253, DOI 10.1016/j.specom.2009.07.002
Meyer BT, 2011, SPEECH COMMUN, V53, P753, DOI 10.1016/j.specom.2010.07.002
Murray-Smith Roderick, 1997, MULTIPLE MODEL APPRO
Nouza J., 1997, Radioengineering, V6
Oppenheim A. V, 2002, DISCRETE TIME SPEECH
Paliwal K.K, 1996, AUTOMATIC SPEECH SPE
Ternstrom S, 2006, J ACOUST SOC AM, V119, P1648, DOI 10.1121/1.2161435
Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414
Wu TF, 2004, J MACH LEARN RES, V5, P975
Xu H., 2005, P INT 2005 LISB PORT, P977
Zelinka Petr, 2010, Proceedings of the 2010 IEEE International Workshop on Machine Learning for Signal Processing (MLSP), DOI 10.1109/MLSP.2010.5589174
Zhang BB, 2004, PATTERN RECOGN, V37, P131, DOI 10.1016/S0031-3203(03)00140-7
Zhang C., 2008, P INT 2008
Zhang C, 2007, P INT 2007, P2289
Zhang C., 2009, P INT 2009
Zhang C., 2008, P ITRW 2008
NR 28
TC 12
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 732
EP 742
DI 10.1016/j.specom.2012.01.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400004
ER
PT J
AU Kotsakis, R
Kalliris, G
Dimoulas, C
AF Kotsakis, R.
Kalliris, G.
Dimoulas, C.
TI Investigation of broadcast-audio semantic analysis scenarios employing
radio-programme-adaptive pattern classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Audio-semantics; Radio-programmes; Content-management; Speech/non-speech
segmentation; Pattern classification; Neural networks
ID NEWS TRANSCRIPTION SYSTEM; SPEECH RECOGNITION; BIOACOUSTICS APPLICATION;
LONG-TERM; SEGMENTATION; FEATURES; WAVELETS
AB The present paper focuses on the investigation of various audio pattern classifiers in broadcast-audio semantic analysis, using radio-programme-adaptive classification strategies with supervised training. Multiple neural network topologies and training configurations are evaluated and compared in combination with feature-extraction, ranking and feature-selection procedures. Different pattern classification taxonomies are implemented, using programme-adapted multi-class definitions and hierarchical schemes. Hierarchical and hybrid classification taxonomies are deployed in speech analysis tasks, facilitating efficient speaker recognition/identification, speech/music discrimination, and generally speech/non-speech detection-segmentation. Exhaustive qualitative and quantitative evaluation is conducted, including indicative comparison with non-neural approaches. Hierarchical approaches offer classification-similarities for easy adaptation to generic radio-broadcast semantic analysis tasks. The proposed strategy exhibits increased efficiency in radio-programme content segmentation and classification, which is one of the most demanding audio semantics tasks. This strategy can be easily adapted in broader audio detection and classification problems, including additional real-world speech-communication demanding scenarios. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Kotsakis, R.; Kalliris, G.; Dimoulas, C.] Aristotle Univ Thessaloniki, Dept Journalism & Mass Commun, Lab Elect Media, Thessaloniki, Greece.
RP Dimoulas, C (reprint author), Aristotle Univ Thessaloniki, Dept Journalism & Mass Commun, Lab Elect Media, Thessaloniki, Greece.
EM babis@eng.auth.gr
CR Ajmera J, 2003, SPEECH COMMUN, V40, P351, DOI 10.1016/S0167-6393(02)00087-0
Avdelidis K, 2010, LECT NOTES ARTIF INT, V6086, P100, DOI 10.1007/978-3-642-13529-3_12
Avdelidis K., 2010, P 128 AES CONV
Bach JH, 2011, SPEECH COMMUN, V53, P690, DOI 10.1016/j.specom.2010.07.003
Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006
Beyerlein P, 2002, SPEECH COMMUN, V37, P109, DOI 10.1016/S0167-6393(01)00062-0
Bishop M, 1995, NEURAL NETWORKS PATT
Burnett I.S., 2006, MPEG 21 BOOK
Burred JJ, 2004, J AUDIO ENG SOC, V52, P724
Celma O, 2008, J WEB SEMANT, V6, P162, DOI 10.1016/j.websem.2008.01.003
Dhanalakshmi P, 2009, EXPERT SYST APPL, V36, P6069, DOI 10.1016/j.eswa.2008.06.126
Dhanalakshmi P, 2011, ENG APPL ARTIF INTEL, V24, P350, DOI 10.1016/j.engappai.2010.10.011
Dhanalakshmi P., 2010, ASIAN J INFORM TECHN, V9, P323
Dimoulas C, 2008, EXPERT SYST APPL, V34, P26, DOI 10.1016/j.eswa.2006.08.014
Dimoulas C., 2010, 4 PAN HELL C AC ELIN
Dimoulas C., 2007, P 122 AES CONV
Dimoulas C, 2007, COMPUT BIOL MED, V37, P438, DOI 10.1016/j.compbiomed.2006.08.013
Dimoulas CA, 2011, EXPERT SYST APPL, V38, P13082, DOI 10.1016/j.eswa.2011.04.115
Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9
Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI DOI 10.1145/1656274.1656278
Hosmer DW, 2000, APPL LOGISTIC REGRES, V2nd
Huijbiegts M, 2011, SPEECH COMMUN, V53, P143, DOI 10.1016/j.specom.2010.08.008
Jain AK, 2000, IEEE T PATTERN ANAL, V22, P4, DOI 10.1109/34.824819
Jothilakshmi S, 2009, EXPERT SYST APPL, V36, P9799, DOI 10.1016/j.eswa.2009.02.040
Kakumanu P, 2006, SPEECH COMMUN, V48, P598, DOI 10.1016/j.specom.2005.09.005
Kalliris G., 2009, P INT C NEW MED INF
Kalliris G.M., 2002, P 12 AES CONV
Kim H.-G., 2006, MPEG 7 AUDIO AUDIO C
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Koenen R, 2000, SIGNAL PROCESS-IMAGE, V15, P463, DOI 10.1016/S0923-5965(99)00058-2
Kotsakis R., 2011, THESIS ARISTOTLE U T
Lartillot O., 2007, P INT C MUS INF RETR
Lee CC, 2011, SPEECH COMMUN, V53, P1162, DOI 10.1016/j.specom.2011.06.004
Loviscach J., 2010, P 128 AES CONV LOND
Markaki M, 2011, SPEECH COMMUN, V53, P726, DOI 10.1016/j.specom.2010.08.007
Matsiola M., 2009, THESIS ARISTOTLE U T
MOODY JE, 1992, ADV NEUR IN, V4, P847
Nguyen MN, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/572571
Pater N., 2005, MACH LEARN C PAP ECE
Platt JC, 1999, ADVANCES IN KERNEL METHODS, P185
QUINLAN JR, 1986, MACH LEARN, V1, P1, DOI DOI 10.1023/A:1022643204877
Rongqing Huang, 2006, IEEE Transactions on Audio, Speech and Language Processing, V14, DOI 10.1109/TSA.2005.858057
Sankar A, 2002, SPEECH COMMUN, V37, P133, DOI 10.1016/S0167-6393(01)00063-2
Seidl T., 1998, P ACM SIGMOD INT C M, P154, DOI DOI 10.1145/276304.276319
Sjolander K., 2000, P INT C SPOK LANG PR
Spyridou P.L., INT COMMUNI IN PRESS
Spyridou P.L., 2009, THESIS ARISTOTLE U T
Stouten F, 2006, SPEECH COMMUN, V48, P1590, DOI 10.1016/j.specom.2006.04.004
Taniguchi T, 2008, SPEECH COMMUN, V50, P547, DOI 10.1016/j.specom.2008.03.007
van Rossum G, 1991, PYTHON LANGUAGE
Vegiris C., 2009, P 126 AES CONV AUD E
Veglis A., 2008, 1 MONDAY, V13
Veglis A., 2008, JOURNALISM, V9, P52, DOI 10.1177/1464884907084340
Woodland PC, 2002, SPEECH COMMUN, V37, P47, DOI 10.1016/S0167-6393(01)00059-0
Wu CH, 2009, IEEE T AUDIO SPEECH, V17, P1612, DOI 10.1109/TASL.2009.2021304
Wu JD, 2009, EXPERT SYST APPL, V36, P8056, DOI 10.1016/j.eswa.2008.10.051
NR 56
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 743
EP 762
DI 10.1016/j.specom.2012.01.004
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400005
ER
PT J
AU Moattar, MH
Homayounpour, MM
AF Moattar, M. H.
Homayounpour, M. M.
TI Variational conditional random fields for online speaker detection and
tracking
SO SPEECH COMMUNICATION
LA English
DT Article
DE Conditional random fields; Gaussian mixture model; Variational
approximation; Speaker verification; Speaker diarization; Speaker
tracking
ID SUPPORT VECTOR MACHINES; VERIFICATION; RECOGNITION; MODELS; SYSTEMS;
TEXT
AB There are many references that concern a specific aspect of speaker tracking. This paper focuses on the speaker modeling issue and proposes conditional random fields (CRF) for this purpose. CRF is a class of undirected graphical models for classifying sequential data. CRF has some interesting characteristics which have encouraged us to use this model in a speaker modeling and tracking task. The main concern of CRF model is its training. Known approaches for CRF training are prone to overfitting and unreliable convergence. To solve this problem, variational approaches are proposed in this paper. The main novelty of this paper is to adapt variational framework for CRF training. The resulted approach is evaluated on three different areas. First, the best CRF model configuration for speaker modeling is evaluated on text independent speaker verification. Next, the selected model is used in a speaker detection task, in which the models of the existing speakers in the conversation are known a priori. Then, the proposed CRF approach is compared with GMM in an online speaker tracking framework. The results show that the proposed CRF model is superior to GMM in speaker detection and tracking, due to its capability for sequence modeling and segmentation. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Moattar, M. H.; Homayounpour, M. M.] Amirkabir Univ Technol, Comp Engn & Informat Technol Dept, Lab Intelligent Sound & Speech Proc, Tehran, Iran.
RP Homayounpour, MM (reprint author), Amirkabir Univ Technol, Comp Engn & Informat Technol Dept, Lab Intelligent Sound & Speech Proc, Tehran, Iran.
EM moattar@aut.ac.ir; homayoun@aut.ac.ir
FU Iran Telecommunication Research Center (ITRC) [T/500/14939]
FX The authors would like to thank Iran Telecommunication Research Center
(ITRC) for supporting this work under contract No. T/500/14939.
CR Anguera X., 2006, P 2 INT WORKSH MULT
Anguera X, 2011, IEEE TASLP IN PRESS
[Anonymous], 2009, 2009 RT 09 RICH TRAN
[Anonymous], 2009, NIST YEAR 2010 SPEAK
Attias H., 1999, P 15 C UNC ART INT, P21
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
Beal M.J., 2003, THESIS U CAMBRIDGE U
Bijankhan M, 2003, P EUROSPEECH 2003, P1525
Bijankhan M, 2002, GREAT FARSDAT DATABA
Bishop C. M., 2006, PATTERN RECOGNITION
Blei DM, 2006, BAYESIAN ANAL, V1, P121
Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
CASELLA G, 1992, AM STAT, V46, P167, DOI 10.2307/2685208
Cournapeau D, 2010, INT CONF ACOUST SPEE, P4462, DOI 10.1109/ICASSP.2010.5495610
DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379
Davy M., 2000, ACOUST SPEECH SIG PR, P33
Ding N, 2010, INT CONF ACOUST SPEE, P2098, DOI 10.1109/ICASSP.2010.5495125
Garofolo J., 2002, P LREC MAY 29 31
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Gonina E., 2011, P AUT SPEECH REC UND
Gunawardana A., 2005, P INTERSPEECH, P1117
Izmirli O, 2000, P INT S MUS INF RETR, P284
Jordan M. I., 1998, LEARNING GRAPHICAL M
Jordan M. I., 1999, LEARNING GRAPHICAL M, P105
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kotti M, 2008, SIGNAL PROCESS, V88, P1091, DOI 10.1016/j.sigpro.2007.11.017
Kumar S., 2004, ADV NEURAL INFOR PRO, V16
Kwon S., 2004, P ICSLP, P1517
Kwon S., 2004, IEEE T SPEECH AUDIO, V13, P1004
Lafferty John D., 2001, ICML, P282
Li H., 2009, P INT C AC SPEECH SI, P4201
Liao CP, 2010, INT CONF ACOUST SPEE, P2002, DOI 10.1109/ICASSP.2010.5495215
Liu Y., 2005, P 9 ANN INT C COMP B, P408
Markov K., 2007, P INT, P1437
MARKOV K, 2008, P INT, P363
Markov K., 2007, P ASRU, P699
Martin A., 2001, P EUR C SPEECH COMM, V2, P787
McCallum A., 2003, P 19 C UNC ART INT U, P403
Mirghafori N., 2006, P ICASSP
Mishra H.K., 2009, P INT C ADV PATT REC, P183
Moattar M. H., 2009, P EUSIPCO, P2549
Morency L. P., 2007, MITCSAILTR2007002
Muthusamy Y.K., 1992, P INT C SPOK LANG PR, V2, P895
Nasios N, 2006, IEEE T SYST MAN CY B, V36, P849, DOI 10.1109/TSMCB.2006.872273
NIST, 2002, RICH TRANSCR EV PROJ
Parisi G., 1988, STAT FIELD THEORY
Prabhavalkar R, 2010, INT CONF ACOUST SPEE, P5534, DOI 10.1109/ICASSP.2010.5495222
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds D.A., 2000, DIGIT SIGNAL PROCESS, V10, P1
Sahu V.P., 2009, LECT NOTES COMPUT SC, P513
Sato K, 2005, BIOINFORMATICS, V21, P237, DOI 10.1093/bioinformatics/bti1139
Schmidt M., 2008, P IEEE INT S PAR DIS, P1
Schnitzspan P, 2009, PROC CVPR IEEE, P2238
Settles B, 2005, BIOINFORMATICS, V21, P3191, DOI 10.1093/bioinformatics/bti475
Sha F., 2003, P HLT NAACL, P213
Shen YA, 2010, J SIGNAL PROCESS SYS, V61, P51, DOI 10.1007/s11265-008-0299-y
Somervuo P., 2002, P ICSLP, P1245
Su D, 2010, INT CONF ACOUST SPEE, P4890, DOI 10.1109/ICASSP.2010.5495122
Sung Y.H., 2007, P ASRU, P347
Sutton C., 2006, INTRO STAT RELATIONA
Teh Y.W., 2008, ADV NEURAL INFOR PRO, V20
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
Valente F, 2005, THESIS EURECOM
You CH, 2009, IEEE SIGNAL PROC LET, V16, P49, DOI 10.1109/LSP.2008.2006711
Yu D, 2010, INT CONF ACOUST SPEE, P5030, DOI 10.1109/ICASSP.2010.5495072
Zamalloa, 2010, P ICASSP, P4962
Zhao XY, 2009, INT CONF ACOUST SPEE, P4049
Zhu DL, 2009, INT CONF ACOUST SPEE, P4045, DOI 10.1109/ICASSP.2009.4960516
NR 69
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 763
EP 780
DI 10.1016/j.specom.2012.01.005
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400006
ER
PT J
AU Wester, M
AF Wester, Mirjam
TI Talker discrimination across languages
SO SPEECH COMMUNICATION
LA English
DT Article
DE Human speech perception; Talker discrimination; Cross-language
ID VOICE IDENTIFICATION; SPEECH; RECOGNITION; SPEAKERS
AB This study investigated the extent to which listeners are able to discriminate between bilingual talkers in three language pairs - English-German, English-Finnish and English-Mandarin. Native English listeners were presented with two sentences spoken by bilingual talkers and were asked to judge whether they thought the sentences were spoken by the same person. Equal amounts of cross-language and matched-language trials were presented. The results show that native English listeners are able to carry out this task well; achieving percent correct levels at well above chance for all three language pairs. Previous research has shown this for English German, this research shows listeners also extend this to Finnish and Mandarin, languages that are quite distinct from English from a genetic and phonetic similarity perspective. However, listeners are significantly less accurate on cross-language talker trials (English foreign) than on matched-language trials (English English and foreign foreign). Understanding listeners' behaviour in cross-language talker discrimination using natural speech is the first step in developing principled evaluation techniques for synthesis systems in which the goal is for the synthesised voice to sound like the original speaker, for instance, in speech-to-speech translation systems, voice conversion and reconstruction. (C) 2012 Elsevier B.V. All rights reserved.
C1 Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
RP Wester, M (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland.
EM mwester@inf.ed.ac.uk
FU European Community [213845]
FX The research leading to these results was partly funded from the
European Community's Seventh Framework Programme (FP7/2007-2013) under
the grant agreement 213845 (the EMIME project). The author thanks
Vasilis Karaiskos for running the perception experiments.
CR ABE M, 1991, J ACOUST SOC AM, V90, P76, DOI 10.1121/1.402284
Bradlow A, 2010, SPEECH COMMUN, V52, P930, DOI 10.1016/j.specom.2010.06.003
GOGGIN JP, 1991, MEM COGNITION, V19, P448, DOI 10.3758/BF03199567
Karhila R., 2011, P INT 11 FLOR IT
KREIMAN J, 1991, SPEECH COMMUN, V10, P265, DOI 10.1016/0167-6393(91)90016-M
Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003
Lewis M. P., 2009, ETHNOLOGUE LANGUAGES
Liang H., 2010, P ICASSP 10
Mashimo M., 2001, P EUR 01
Nygaard LC, 1998, PERCEPT PSYCHOPHYS, V60, P355, DOI 10.3758/BF03206860
Nygaard LC, 2005, BLACKW HBK LINGUIST, P390, DOI 10.1002/9780470757024.ch16
Perrachione TK, 2007, NEUROPSYCHOLOGIA, V45, P1899, DOI 10.1016/j.neuropsychologia.2006.11.015
Perrachione TK, 2009, J EXP PSYCHOL HUMAN, V35, P1950, DOI 10.1037/a0015869
Philippon AC, 2007, APPL COGNITIVE PSYCH, V21, P539, DOI 10.1002/acp.1296
R Development Core Team, 2010, R LANG ENV STAT COMP
SAMMON JW, 1969, IEEE T COMPUT, VC 18, P401, DOI 10.1109/T-C.1969.222678
Stanislaw H, 1999, BEHAV RES METH INS C, V31, P137, DOI 10.3758/BF03207704
Stockmal V., 2004, P 17 C LING PRAG
Stockmal V, 2000, APPL PSYCHOLINGUIST, V21, P383, DOI 10.1017/S0142716400003052
Sundermann D., 2006, P INT 06
THOMPSON CP, 1987, APPL COGNITIVE PSYCH, V1, P121, DOI 10.1002/acp.2350010205
Tsuzaki M., 2011, P INT 11 FLOR IT
VANLANCKER D, 1987, NEUROPSYCHOLOGIA, V25, P829, DOI 10.1016/0028-3932(87)90120-5
Wester M., 2010, P SSW7
Wester M., 2010, EDIINFRR1388 U ED
Wester M., 2011, P INT 11 FLOR IT
Wester M., 2011, P ICASSP
Wester M., 2011, EDIINFRR1396 U ED
Wester M., 2010, P INT 10
Winters S., 2005, ENCY LANGUAGE LINGUI, V12, P31
Winters SJ, 2008, J ACOUST SOC AM, V123, P4524, DOI 10.1121/1.2913046
NR 31
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 781
EP 790
DI 10.1016/j.specom.2012.01.006
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400007
ER
PT J
AU Oba, T
Hori, T
Nakamura, A
AF Oba, Takanobu
Hori, Takaaki
Nakamura, Atsushi
TI Efficient training of discriminative language models by sample selection
SO SPEECH COMMUNICATION
LA English
DT Article
DE Discriminative language model; Error correction; Discriminative
training; Sample selection
AB This paper focuses on discriminative language models (DLMs) for large vocabulary speech recognition tasks. To train such models, we usually use a large number of hypotheses generated for each utterance by a speech recognizer, namely an n-best list or a lattice. Since the data size is large, we usually need a high-end machine or a large-scale distributed computation system consisting of many computers for model training. However, it is still unclear whether or not such a large number of sentence hypotheses are necessary. Furthermore, we do not know which kinds of sentences are necessary. In this paper, we show that we can generate a high performance model using small subsets of the n-best lists by choosing samples properly, i.e., we describe a sample selection method for DLMs. Sample selection reduces the memory footprint needed for holding training samples and allows us to train models in a standard machine. Furthermore, it enables us to generate a highly accurate model using various types of features. Specifically, experimental results show that even training using two samples in each list can provide an accurate model with a small memory footprint. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Oba, Takanobu; Hori, Takaaki; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto, Japan.
RP Oba, T (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan.
EM oba.takanobu@lab.ntt.co.jp; hori.t@lab.ntt.co.jp;
nakamura.atsushi@lab.ntt.co.jp
CR Arisoy E, 2010, INT CONF ACOUST SPEE, P5538, DOI 10.1109/ICASSP.2010.5495226
Collins M, 2005, COMPUT LINGUIST, V31, P25, DOI 10.1162/0891201053630273
Collins M., 2005, P ACL, P507, DOI 10.3115/1219840.1219903
Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1
Hori T., 2005, P INT, P284
KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394
LIU DC, 1989, MATH PROGRAM, V45, P503, DOI 10.1007/BF01589116
Maekawa K., 2000, P 2 INT C LANG RES E, V2, P947
McDermott E, 2007, IEEE T AUDIO SPEECH, V15, P203, DOI 10.1109/TASL.2006.876778
Oba T, 2010, INT CONF ACOUST SPEE, P5126, DOI 10.1109/ICASSP.2010.5495028
Oba Takanobu, 2007, P INT, P1753
Och F.J, 2003, P 41 ANN M ASS COMP, P160
Roark B., 2004, P ICASSP, V1, P749
Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006
Roark B., 2004, P ACL, P47, DOI 10.3115/1218955.1218962
Shafran I., 2006, P EMNLP SYDN AUSTR, P390, DOI 10.3115/1610075.1610130
Zhou Z., 2006, P ICASSP, V1, P141
NR 17
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 791
EP 800
DI 10.1016/j.specom.2012.01.007
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400008
ER
PT J
AU Kamper, H
Mukanya, FJM
Niesler, T
AF Kamper, Herman
Mukanya, Felicien Jeje Muamba
Niesler, Thomas
TI Multi-accent acoustic modelling of South African English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Multi-accent acoustic modelling; Multi-accent speech recognition;
Under-resourced languages; South African English accents
ID SPEECH RECOGNITION
AB Although English is spoken throughout South Africa it is most often used as a second or third language, resulting in several prevalent accents within the same population. When dealing with multiple accents in this under-resourced environment, automatic speech recognition (ASR) is complicated by the need to compile multiple, accent-specific speech corpora. We investigate how best to combine speech data from five South African accents of English in order to improve overall speech recognition performance. Three acoustic modelling approaches are considered: separate accent-specific models, accent-independent models obtained by pooling training data across accents, and multi-accent models. The latter approach extends the decision-tree clustering process normally used to construct tied-state hidden Markov models (HMMs) by allowing questions relating to accent. We find that multi-accent modelling outperforms accent-specific and accent-independent modelling in both phone and word recognition experiments, and that these improvements are statistically significant. Furthermore, we find that the relative merits of the accent-independent and accent-specific approaches depend on the particular accents involved. Multi-accent modelling therefore offers a mechanism by which speech recognition performance can be optimised automatically, and for hard decisions regarding which data to pool and which to separate to be avoided. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Kamper, Herman; Mukanya, Felicien Jeje Muamba; Niesler, Thomas] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa.
RP Niesler, T (reprint author), Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa.
EM kamperh@sun.ac.za; trn@sun.ac.za
FU National Research Foundation (NRF)
FX The authors would like to thank Febe de Wet for her helpful comments and
suggestions. The financial assistance of the National Research
Foundation (NRF) towards this research is hereby acknowledged. Opinions
expressed and conclusions arrived at are those of the authors and are
not necessarily to be attributed to the NRF. Parts of this work were
executed using the High Performance Computer (HPC) facility at
Stellenbosch University.
CR Badenhorst J.A.C., 2008, P PRASA CAP TOWN S A
Beattie V, 1995, P EUR 95, P1123
Bisani M., 2004, P IEEE INT C AC SPEE, P409
Bowerman S., 2004, HDB VARIETIES ENGLIS, P949
Bowerman S, 2004, HDB VARIETIES ENGLIS, P931
Caballero M, 2009, SPEECH COMMUN, V51, P217, DOI 10.1016/j.specom.2008.08.003
Chengalvarayan R, 2001, P EUR AALB DENM, P2733
Crystal D., 1991, DICT LINGUISTICS PHO
Despres J., 2009, P INT BRIGHT, P96
Diakoloukas V., 1997, P IEEE INT C AC SPEE, P1455
Finn P., 2004, HDB VARIETIES ENGLIS, P964
Fischer V., 1998, P ICSLP SYDN AUSTR, P787
Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd
Hershey JR, 2008, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2008.4518670
Humphries J. J., 1997, P EUR, P2367
KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125
Kirchhoff K, 2005, SPEECH COMMUN, V46, P37, DOI 10.1016/j.specom.2005.01.004
Mak B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607191
McCormick K., 2004, HDB VARIETIES ENGLIS, P993
Mesthrie R., 2004, HDB VARIETIES ENGLIS, P953
Mesthrie R., 2004, HDB VARIETIES ENGLIS, P974
NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001
Niesler T, 2007, SPEECH COMMUN, V49, P453, DOI 10.1016/j.specom.2007.04.001
Olsen P.A., 2007, P INT C, P46
Roux JC, 2004, P LREC LISB PORT, P93
Schneider EW, 2004, HDB VARIETIES ENGLIS
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Statistics South Africa, 2004, CENS 2001 PRIM TABL
Stolcke A., 2002, P INT C SPOK LANG PR, P901
ten Bosch L., 2000, P ICSLP BEIJ CHIN, P1009
Van Compernolle D., 1991, P EUR 91, P723
Van Rooy B., 2004, HDB VARIETIES ENGLIS, V1, P943
Vihola M., 2002, ACOUST SPEECH SIG PR, P933
Wang Z., 2003, P IEEE INT C AC SPEE, P540
Young S., 2009, HTK BOOK VERSION 3 4
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
NR 36
TC 2
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 801
EP 813
DI 10.1016/j.specom.2012.01.008
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400009
ER
PT J
AU Pavez, E
Silva, JF
AF Pavez, Eduardo
Silva, Jorge F.
TI Analysis and design of Wavelet-Packet Cepstral coefficients for
automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Wavelet Packets; Filter-bank analysis; Automatic speech recognition;
Filter-bank selection; Cepstral coefficients; The Gray code
ID HIDDEN MARKOV-MODELS; FEATURE-EXTRACTION; SAMPLING THEOREM; BASIS
SELECTION; SIGNAL; CLASSIFICATION; REPRESENTATIONS; SUBSPACES;
FRAMEWORK; FILTERS
AB This work proposes using Wavelet-Packet Cepstral coefficients (WPPCs) as an alternative way to do filter-bank energy-based feature extraction (FE) for automatic speech recognition (ASR). The rich coverage of time-frequency properties of Wavelet Packets (WPs) is used to obtain new sets of acoustic features, in which competitive and better performances are obtained with respect to the widely adopted Mel-Frequency Cepstral coefficients (MFCCs) in the TIMIT corpus. In the analysis, concrete filter-bank design considerations are stipulated to obtain most of the phone-discriminating information embedded in the speech signal, where the filter-bank frequency selectivity, and better discrimination in the lower frequency range [200 Hz-1 kHz] of the acoustic spectrum are important aspects to consider. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Pavez, Eduardo; Silva, Jorge F.] Univ Chile, Dept Elect Engn, Santiago 4123, Chile.
RP Silva, JF (reprint author), Univ Chile, Dept Elect Engn, Av Tupper 2007, Santiago 4123, Chile.
EM epavez@ing.uchile.cl; josilva@ing.uchile.cl
RI Silva, Jorge/M-8750-2013
FU FONDECYT, CONICYT-Chile [1110145]
FX The work was supported by funding from FONDECYT Grant 1110145,
CONICYT-Chile. We are grateful to the anonymous reviewers for their
suggestions and comments that contribute to improve the quality and
organization of the work. We thank S. Beckman for proofreading this
material.
CR Atto A.M., 2010, IEEE T INFORM THEORY, V56, P429
Atto AM, 2007, SIGNAL PROCESS, V87, P2320, DOI 10.1016/j.sigpro.2007.03.014
BOHANEC M, 1994, MACH LEARN, V15, P223, DOI 10.1007/BF00993345
Chang T, 1993, IEEE T IMAGE PROCESS, V2, P429, DOI 10.1109/83.242353
CHOU PA, 1989, IEEE T INFORM THEORY, V35, P299, DOI 10.1109/18.32124
Choueiter GF, 2007, IEEE T AUDIO SPEECH, V15, P939, DOI 10.1109/TASL.2006.889793
Coifman R., 1990, SIGNAL PROCESSING CO
Coifman R., 1992, WAVELETS THEIR APPL, P153
COIFMAN RR, 1992, IEEE T INFORM THEORY, V38, P713, DOI 10.1109/18.119732
Cormen T. H., 1990, INTRO ALGORITHMS
Cover T M, 1991, ELEMENTS INFORM THEO
Crouse MS, 1998, IEEE T SIGNAL PROCES, V46, P886, DOI 10.1109/78.668544
Daubechies I., 1992, 10 LECT WAVELETS
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Duda R.O., 1983, PATTERN CLASSIFICATI
Etemad K, 1998, IEEE T IMAGE PROCESS, V7, P1453, DOI 10.1109/83.718485
Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676
GRAY R. M., 1990, ENTROPY INFORM THEOR
Kim K, 2000, IEEE SYS MAN CYBERN, P2891
Kullback S., 1958, INFORM THEORY STAT
Learned R.E., 1992, WAVELET PACKET BASED, P109
LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546
Mallat S, 2009, WAVELET TOUR OF SIGNAL PROCESSING: THE SPARSE WAY, P1
MALLAT SG, 1989, IEEE T PATTERN ANAL, V11, P674, DOI 10.1109/34.192463
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Padmanabhan M, 2005, IEEE T SPEECH AUDI P, V13, P512, DOI 10.1109/TSA.2005.848876
Quatieri T. F., 2002, DISCRETE TIME SPEECH
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Ramchandran K, 1996, P IEEE, V84, P541, DOI 10.1109/5.488699
SAITO N, 1994, P SOC PHOTO-OPT INS, V2303, P2, DOI 10.1117/12.188763
Scott C, 2005, IEEE T SIGNAL PROCES, V53, P4518, DOI 10.1109/TSP.2005.859220
Scott C, 2004, IEEE T SIGNAL PROCES, V52, P2264, DOI 10.1109/TSP.2004.831121
Shen JH, 1998, APPL COMPUT HARMON A, V5, P312, DOI 10.1006/acha.1997.0234
Shen JH, 1996, P AM MATH SOC, V124, P3819, DOI 10.1090/S0002-9939-96-03557-5
Silva J, 2009, IEEE T SIGNAL PROCES, V57, P1796, DOI 10.1109/TSP.2009.2013898
Silva J., 2007, IEEE WORKSH MACH LEA
Silva JF, 2012, PATTERN RECOGN, V45, P1853, DOI 10.1016/j.patcog.2011.11.015
Tan BT, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2431
Vaidyanathan P. P., 1993, MULTIRATE SYSTEMS FI
Vasconcelos N, 2004, IEEE T SIGNAL PROCES, V52, P2322, DOI 10.1109/TSP.2004.831125
Vetterli M., 1995, WAVELET SUBBAND CODI
WALTER GG, 1992, IEEE T INFORM THEORY, V38, P881, DOI 10.1109/18.119745
Willsky AS, 2002, P IEEE, V90, P1396, DOI 10.1109/JPROC.2002.800717
Young S., 2009, HTK BOOK HTK VERSION
Zhou XW, 1999, J FOURIER ANAL APPL, V5, P347, DOI 10.1007/BF01259375
NR 45
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 814
EP 835
DI 10.1016/j.specom.2012.02.002
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400010
ER
PT J
AU Flynn, R
Jones, E
AF Flynn, Ronan
Jones, Edward
TI Feature selection for reduced-bandwidth distributed speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Distributed speech recognition; Feature selection; Bandwidth reduction;
HLDA
ID MODELS
AB The impact on speech recognition performance in a distributed speech recognition (DSR) environment of two methods used to reduce the dimension of the feature vectors is examined in this paper. The motivation behind reducing the dimension of the feature set is to reduce the bandwidth required to send the feature vectors over a channel from the client front-end to the server back-end in a DSR system. In the first approach, the features are empirically chosen to maximise recognition performance. A data-centric transform-based dimensionality-reduction technique is applied in the second case. Test results for the empirical approach show that individual coefficients have different impacts on the speech recognition performance, and that certain coefficients should always be present in an empirically selected reduced feature set for given training and test conditions. Initial results show that for the empirical method, the number of elements in a feature vector produced by an established DSR front-end can be reduced by 23% with low impact on the recognition performance (less than 8% relative performance drop compared to the full bandwidth case). Using the transform-based approach, for a similar impact on recognition performance, the number of feature vector elements can be reduced by 30%. Furthermore, for best recognition performance, the results indicate that the SNR of the speech signal should be considered using either approach when selecting the feature vector elements that are to be included in a reduced feature set. (C) 2012 Elsevier B.V. All rights reserved.
C1 [Flynn, Ronan] Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland.
[Jones, Edward] Natl Univ Ireland, Coll Engn & Informat, Galway, Ireland.
RP Flynn, R (reprint author), Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland.
EM rflynn@ait.ie; edward.jones@nuigalway.ie
CR [Anonymous], HTK SPEECH RECOGNITI
[Anonymous], 2007, 202050 ETSI ES
[Anonymous], 2003, 201108 ETSI ES
Bocchieri E. L., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1012
Cernak M., 2007, P 12 INT C SPEECH CO, VI, P188
Chakroborty S, 2010, SPEECH COMMUN, V52, P693, DOI 10.1016/j.specom.2010.04.002
Choi E., 2002, P 9 AUSTR INT C SPEE, P166
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181
Jafari A, 2010, SPEECH COMMUN, V52, P725, DOI 10.1016/j.specom.2010.04.005
Koniaris C, 2010, INT CONF ACOUST SPEE, P4342, DOI 10.1109/ICASSP.2010.5495648
Koniaris C, 2010, J ACOUST SOC AM, V127, pEL73, DOI 10.1121/1.3284545
Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2
Nicholson S., 1997, P EUR SPEECH C SPEEC, P413
Paliwal K. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90005-J
Peinado A. M., 2006, SPEECH RECOGNITION D
Plasberg JH, 2007, IEEE T AUDIO SPEECH, V15, P310, DOI 10.1109/TASL.2006.876722
Tan ZH, 2008, ADV PATTERN RECOGNIT, P1, DOI 10.1007/978-1-84800-143-5
NR 19
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 836
EP 843
DI 10.1016/j.specom.2012.01.003
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400011
ER
PT J
AU Howard, DM
Abberton, E
Fourcin, A
AF Howard, David M.
Abberton, Evelyn
Fourcin, Adrian
TI Disordered voice measurement and auditory analysis (vol 54, pg 611,
2012)
SO SPEECH COMMUNICATION
LA English
DT Correction
C1 [Howard, David M.] Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England.
[Abberton, Evelyn; Fourcin, Adrian] UCL, Dept Phonet & Linguist, London WC1E 6BT, England.
RP Howard, DM (reprint author), Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England.
EM dh@ohm.york.ac.uk
CR Howard DM, 2012, SPEECH COMMUN, V54, P611, DOI 10.1016/j.specom.2011.03.008
NR 1
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2012
VL 54
IS 6
BP 844
EP 844
DI 10.1016/j.specom.2012.03.007
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 940QP
UT WOS:000303908400012
ER
PT J
AU Rodriguez, WR
Saz, O
Lleida, E
AF Rodriguez, William R.
Saz, Oscar
Lleida, Eduardo
TI A prelingual tool for the education of altered voices
SO SPEECH COMMUNICATION
LA English
DT Article
DE Altered voice; Voice therapy; Speech processing; Formant normalization;
Vocal tract length estimation
AB This paper addresses the problem of Computer-Aided Voice Therapy for altered voices. The proposal of the work is to develop a set of free activities called PreLingua for providing interactive voice therapy to a population of individuals with voice disorders. The interactive tools are designed to train voice skills like: voice production, intensity, blow, vocal onset, phonation time, tone, and vocalic articulation for Spanish language. The development of these interactive tools along with the underlying speech technologies that support them requires the existence of speech processing, whose algorithms must be robust with respect to the sources of speech variability that are characteristic of this population of speakers. One of the main problem addressed is how to estimate reliably formant frequencies in high-pitched speech (typical in children and women) and how to normalize these estimations independently of the characteristics of the speakers. Linear prediction coding, homomorphic analysis and modeling of the vocal tract are the core of the speech processing techniques used to allow such normalization through vocal tract length. This paper also presents the result of an experimental study where PreLingua was applied in a population with voice disorders and pathologies in special education centers in Spain and Colombia. Promising results were obtained in this preliminary study after 12 weeks of therapy, as it showed improvements in the voice capabilities of a remarkable number of users and the ability of the tool to educate impaired users with voice alterations. This improvement was assessed by the evaluation of the educators before and after the study and also by the performance of the subjects in the activities of PreLingua. The results were very encouraging to keep working in this direction, with the overall aim of providing further functionalities and robustness to the system. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Rodriguez, William R.; Saz, Oscar; Lleida, Eduardo] Univ Zaragoza, Commun Technol Grp GTC, Aragon Inst Engn Res 13A, Zaragoza 50018, Spain.
RP Rodriguez, WR (reprint author), Univ Zaragoza, Commun Technol Grp GTC, Aragon Inst Engn Res 13A, Maria de Luna 1, Zaragoza 50018, Spain.
EM wricardo@unizar.es; oskarsaz@unizar.es; lleida@unizar.es
RI Lleida, Eduardo/K-8974-2014; Saz Torralba, Oscar/L-7329-2014
OI Lleida, Eduardo/0000-0001-9137-4013;
FU MEC of the Spanish government [TIN-2008-06856-C05-04]; Santander Bank
FX This work was supported under TIN-2008-06856-C05-04 from MEC of the
Spanish government and Santander Bank scholarships. The authors want to
acknowledge Center of Special Education "CEDESNID" in Bogota (Colombia),
and the Public School for Especial Education "Alborada" in Zaragoza
(Spain), for their collaboration applying and testing PreLingua.
CR Arias C., 2005, DISFONIA INFANTIL
Aronso A., 1993, CLIN VOICE DISORDERS
Fant G., 1960, ACOUSTIC THEORY SPEE
Gurlekian J., 2000, CARACTERIZACION ARTI
Kenneth D., 1966, P ANN CONV AM SPEECH
Kirschning I., 2007, IDEA GROUP
Kornilov A.-U., 2004, P 9 INT C SPEECH COM
Martinez-Celdran E., 1989, FONOLOGIA GEN ESPANO
Necioglu B., 2000, ACOUST SPEECH SIGNAL, V3, P1319
Rabiner L., 1978, DIGITAL PROCESSING S
Rabiner L. R., 2007, INTRO DIGITAL SPEECH
Rodriguez W.-R., 2009, P 2009 WORKSH SPEECH
Sakhnov K., 2009, IAENG INT J COMPUT S, V36
Saz O, 2009, SPEECH COMMUN, V51, P948, DOI 10.1016/j.specom.2009.04.006
Shahidur Rahman M., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.502
Traunmuller H., 1997, P EUROSPEECH 1997, V1, P477
VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787
WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929
Watt D., 2002, LEEDS WORKING PAPERS, V9, P159
NR 19
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 583
EP 600
DI 10.1016/j.specom.2011.05.006
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600001
ER
PT J
AU Vaiciukynas, E
Verikas, A
Gelzinis, A
Bacauskiene, M
Uloza, V
AF Vaiciukynas, Evaldas
Verikas, Antanas
Gelzinis, Adas
Bacauskiene, Marija
Uloza, Virgilijus
TI Exploring similarity-based classification of larynx disorders from human
voice
SO SPEECH COMMUNICATION
LA English
DT Article
DE Laryngeal disorder; Pathological voice; Mel-frequency cepstral
coefficients; Sequence kernel; Kullback-Leibler divergence; Earth
mover's distance; GMM; SVM
AB In this paper identification of laryngeal disorders using cepstral parameters of human voice is researched. Mel-frequency cepstral coefficients (MFCCs), extracted from audio recordings of patient's voice, are further approximated, using various strategies (sampling, averaging, and clustering by Gaussian mixture model). The effectiveness of similarity-based classification techniques in categorizing such pre-processed data into normal voice, nodular, and diffuse vocal fold lesion classes is explored and schemes to combine binary decisions of support vector machines (SVMs) are evaluated. Most practiced RBF kernel was compared to several constructed custom kernels: (i) a sequence kernel, defined over a pair of matrices, rather than over a pair of vectors and calculating the kernelized principal angle (KPA) between subspaces; (ii) a simple supervector kernel using only means of patient's GMM; (iii) two distance kernels, specifically tailored to exploit covariance matrices of GMM and using the approximation of the Kullback-Leibler divergence from the Monte-Carlo sampling (KL-MCS), and the Kullback-Leibler divergence combined with the Earth mover's distance (KL-EMD) as similarity metrics.
The sequence kernel and the distance kernels both outperformed the popular RBF kernel, but the difference is statistically significant only in the distance kernels case. When tested on voice recordings, collected from 410 subjects (130 normal voice, 140 diffuse, and 140 nodular vocal fold lesions), the KL-MCS kernel, using GMM with full covariance matrices, and the KL-EMD kernel, using GMM with diagonal covariance matrices, provided the best overall performance. In most cases, SVM reached higher accuracy than least squares SVM, except for common binary classification using distance kernels. The results indicate that features, modeled with GMM, and kernel methods, exploiting this information, is an interesting fusion of generative (probabilistic) and discriminative (hyperplane) models for similarity-based classification. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Vaiciukynas, Evaldas; Verikas, Antanas; Gelzinis, Adas; Bacauskiene, Marija] Kaunas Univ Technol, Dept Elect & Control Equipment, LT-51368 Kaunas, Lithuania.
[Verikas, Antanas] Halmstad Univ, Intelligent Syst Lab, S-30118 Halmstad, Sweden.
[Uloza, Virgilijus] Lithuanian Univ Hlth Sci, Dept Otolaryngol, LT-50009 Kaunas, Lithuania.
RP Vaiciukynas, E (reprint author), Kaunas Univ Technol, Dept Elect & Control Equipment, Studentu 50, LT-51368 Kaunas, Lithuania.
EM evaldas.vaiciukynas@stud.ktu.lt; antanas.verikas@hh.se;
adas.gelzinis@ktu.lt; marija.bacauskiene@ktu.lt;
virgilijus.ulozas@kmuk.lt
CR Benesty J., 2007, SPRINGER HDB SPEECH
Chen YH, 2009, J MACH LEARN RES, V10, P747
Doremalen J., 2007, THESIS RADBOUD U NIJ
Dubnov S., 2008, COMPUTER AUDITION TO
Gelzinis A, 2008, COMPUT METH PROG BIO, V91, P36, DOI 10.1016/j.cmpb.2008.01.008
Godino-Llorente JI, 2005, LECT NOTES ARTIF INT, V3817, P219
Kuroiwa S, 2006, IEICE T INF SYST, VE89D, P1074, DOI 10.1093/ietisy/e89-d.3.1074
Levina Elizaveta, 2001, P IEEE 8 INT C COMP, P251, DOI 10.1109/ICCV.2001.937632
Markaki M, 2010, INT CONF ACOUST SPEE, P5162, DOI 10.1109/ICASSP.2010.5495020
McLaren M, 2011, COMPUT SPEECH LANG, V25, P327, DOI 10.1016/j.csl.2010.02.004
Pampalk E., 2004, P 5 INT C MUS INF RE
Pouchoulin G., 2007, P 8 ANN C INT SPEECH
Suykens JAK, 1999, NEURAL PROCESS LETT, V9, P293, DOI 10.1023/A:1018628609742
TVERSKY A, 1982, PSYCHOL REV, V89, P123, DOI 10.1037/0033-295X.89.2.123
Wang X., 2011, J VOICE
Weston J., 2006, MATLAB TOOLBOX KERNE
Weston J., 1999, P 7 EUR S ART NEUR N
Wolf L., 2003, J MACHINE LEARNING R, V4, P913, DOI 10.1162/jmlr.2003.4.6.913
Zheng WM, 2006, NEURAL COMPUT, V18, P979
NR 19
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 601
EP 610
DI 10.1016/j.specom.2011.04.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600002
ER
PT J
AU Howard, DM
Abberton, E
Fourcin, A
AF Howard, David M.
Abberton, Evelyn
Fourcin, Adrian
TI Disordered voice measurement and auditory analysis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Hearing modelling; Pathological voice; Dysphonia; Temporal analysis;
Spectral analysis; Laryngograph; Clinical voice analysis
ID FILTER SHAPES; SPEECH; NOISE; IDENTIFICATION; RECOGNITION; FEATURES
AB Although voice disorder is ordinarily first detected by listening, hearing is little used in voice measurement. Auditory critical band approaches to the quantitative analysis of dysphonia are compared with the results of applying cycle-by-cycle time based methods and the results from a listening test. The comparisons show that quite large rough/smooth differences, that are readily perceptible, are not as robustly measurable using either peripheral human hearing based GammaTone spectrograms, or a cepstral prominence algorithm, as they may be when using cycle-by-cycle based computations that are linked to temporal criteria. The implications of these tentative observations are discussed for the development of clinically relevant analyses of pathological voice signals with special reference to the analytic advantages of employing appropriate auditory criteria. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Howard, David M.] Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England.
[Abberton, Evelyn; Fourcin, Adrian] UCL, Dept Phonet & Linguist, London WC1E 6BT, England.
RP Howard, DM (reprint author), Univ York, Dept Elect, Audio Lab, York YO10 5DD, N Yorkshire, England.
EM dh@ohm.york.ac.uk
CR Abberton Evelyn, 2005, Logoped Phoniatr Vocol, V30, P175
ABBERTON ERM, 1989, CLIN LINGUIST PHONET, V3, P281, DOI 10.3109/02699208908985291
Abdulla Waleed H, 2010, International Journal of Biometrics, V2, DOI 10.1504/IJBM.2010.035448
Baken RJ, 2000, CLIN MEASUREMENT SPE
Baken R.J., 1991, READINGS CLIN SPECTR
Bridle J., 1974, 1003 JSRU
Brookes T, 2000, Logoped Phoniatr Vocol, V25, P72
BROWN JC, 1991, J ACOUST SOC AM, V89, P425, DOI 10.1121/1.400476
Buder EH., 2000, VOICE QUALITY MEASUR, P119
Caeiros A.M., 2010, J APPL RES TECHNOL, V8, P56
Cavalli L, 2010, LOGOP PHONIATR VOCO, V35, P60, DOI 10.3109/14015439.2010.482860
Cooke M, 2010, COMPUT SPEECH LANG, V24, P1, DOI 10.1016/j.csl.2009.02.006
DEBOER E, 1978, J ACOUST SOC AM, V63, P115, DOI 10.1121/1.381704
de Cheveigne A., 2010, OXFORD HDB AUDITORY, P71
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679
Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47
Fourcin A, 2009, FOLIA PHONIATR LOGO, V61, P126, DOI 10.1159/000219948
Fourcin A., 2000, VOICE QUALITY MEASUR
Fourcin A, 2008, LOGOP PHONIATR VOCO, V33, P35, DOI 10.1080/14015430701251574
FOURCIN AJ, 1971, MED BIOL ILLUS, V21, P172
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311
HOWARD D., 1997, ORG SOUND, V2, P65, DOI 10.1017/S1355771897009011
Howard D., 2009, ACOUSTICS PSYCHOACOU
Howard D.M., 1995, FORENSIC LINGUIST, V2, P28
Howard D.M., 2009, LARYNX
Koike Y., 1973, STUDIA PHONOLOGICA, V7, P17
Li Q, 2010, INT CONF ACOUST SPEE, P4514, DOI 10.1109/ICASSP.2010.5495589
LISKER L, 1964, WORD, V20, P384
Malyska N, 2005, INT CONF ACOUST SPEE, P873
Mermelstein P., 1976, SR47 HASK LAB
Moore B. C., 2004, INTRO PSYCHOL HEARIN
PATTERSON RD, 1976, J ACOUST SOC AM, V59, P640, DOI 10.1121/1.380914
Ptok M., 2006, Z HNO, V54, P1326
Ramig LA, 1987, J VOICE, V1, P162, DOI 10.1016/S0892-1997(87)80040-1
Sayles M, 2008, J NEUROSCI, V28, P11925, DOI 10.1523/JNEUROSCI.3137-08.2008
Scharf B., 1970, F MODERN AUDITORY TH, V1, P159
Shao Y, 2010, COMPUT SPEECH LANG, V24, P77, DOI 10.1016/j.csl.2008.03.004
Slaney M., 1993, 35 APPL COMP
Slaney M., 2003, 45 APPL COMP
Sumner CJ, 2002, J ACOUST SOC AM, V111, P2178, DOI 10.1121/1.1453451
Wang D., 2008, P INT C AUD LANG IM, P1340
Watkins AJ, 2007, J ACOUST SOC AM, V121, P257, DOI 10.1121/1.2387134
Weyer E.G., 1936, J PSYCHOL, V3, P101
ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630
NR 46
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 611
EP 621
DI 10.1016/j.specom.2011.03.008
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600003
ER
PT J
AU Falk, TH
Chan, WY
Shein, F
AF Falk, Tiago H.
Chan, Wai-Yip
Shein, Fraser
TI Characterization of atypical vocal source excitation, temporal dynamics
and prosody for objective measurement of dysarthric word intelligibility
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dysarthria; Vocal source excitation; Temporal dynamics; Intelligibility;
Linear prediction
ID SPEECH RECOGNITION; DISORDERED SPEECH; QUALITY; SYSTEM; RECEPTION; VOICE
AB Objective measurement of dysarthric speech intelligibility can assist clinicians in the diagnosis of speech disorder severity as well as in the evaluation of dysarthria treatments. In this paper, several objective measures are proposed and tested as correlates of subjective intelligibility. More specifically, the kurtosis of the linear prediction residual is proposed as a measure of vocal source excitation oddity. Additionally, temporal perturbations resultant from imprecise articulation and atypical speech rates are characterized by short- and long-term temporal dynamics measures, which in turn, are based on log-energy dynamics and on an auditory-inspired modulation spectral signal representation, respectively. Motivated by recent insights in the communication disorders literature, a composite measure is developed based on linearly combining a salient subset of the proposed measures with conventional prosodic parameters. Experiments with the publicly-available 'Universal Access' database of spastic dysarthric speech (10 patient speakers; 300 words spoken in isolation, per speaker) show that the proposed composite measure can achieve correlation with subjective intelligibility ratings as high as 0.97; thus the measure can serve as an accurate indicator of dysarthric speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Falk, Tiago H.] INRS EMT, Inst Natl Rech Sci, Montreal, PQ, Canada.
[Chan, Wai-Yip] Queens Univ, Dept Elect & Comp Engn, Kingston, ON, Canada.
[Shein, Fraser] Holland Bloorview Kids Rehabil Hosp, Bloorview Res Inst, Toronto, ON, Canada.
[Shein, Fraser] Univ Toronto, Dept Comp Sci, Toronto, ON, Canada.
RP Falk, TH (reprint author), INRS EMT, Inst Natl Rech Sci, Montreal, PQ, Canada.
EM tiago.falk@ieee.org
FU Natural Sciences and Engineering Research Council of Canada
FX The authors wish to acknowledge Dr. Mark Hasegawa-Johnson for making the
UA-Speech database available, the Natural Sciences and Engineering
Research Council of Canada for their financial support, and the
anonymous reviewers for their insightful comments.
CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267
[Anonymous], 2004, P563 ITUT, P563
Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318
Baken RJ, 2000, CLIN MEASUREMENT SPE
Benesty J., 2008, SPRINGER HDB SPEECH
Bunton K, 2000, CLIN LINGUIST PHONET, V14, P13, DOI 10.1080/026992000298922
COLCORD RD, 1979, J SPEECH HEAR RES, V22, P468
Constantinescu G, 2010, INT J LANG COMM DIS, V45, P630, DOI 10.3109/13682820903470569
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344
De Bodt MS, 2002, J COMMUN DISORD, V35, P283, DOI 10.1016/S0021-9924(02)00065-5
Doyle PC, 1997, J REHABIL RES DEV, V34, P309
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Duffy J., 2005, DIFFERENTIAL DIAGNOS
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247
Falk TH, 2010, IEEE T INSTRUM MEAS, V59, P978, DOI 10.1109/TIM.2009.2024697
Falk TH, 2006, IEEE T AUDIO SPEECH, V14, P1935, DOI 10.1109/TASL.2006.883253
Fant G., 1960, NASAL SOUNDS NASALIZ
Ferrier L., 1995, AUGMENTATIVE ALTERNA, V11, P165, DOI 10.1080/07434619512331277289
Gillespie B., 2001, P IEEE INT C AC SPEE, V6
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Green P., 2003, P 8 EUR C SPEECH COM, P4
Gu LY, 2005, EURASIP J APPL SIG P, V2005, P1400, DOI 10.1155/ASP.2005.1400
HASEGAWAJOHNSON M, 2006, INT CONF ACOUST SPEE, P1060
Hill AJ, 2006, AM J SPEECH-LANG PAT, V15, P45, DOI 10.1044/1058-0360(2006/006)
HOUSE AS, 1956, J SPEECH HEAR DISORD, V21, P218
Huang X., 2001, ALGORITHM SYSTEM DEV
KENT RD, 1989, CLIN LINGUIST PHONET, V3, P347, DOI 10.3109/02699208908985295
Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466
KIM H, 2008, P INT C SPOK LANG PR, P1741
Klopfenstein M., 2009, INT J SPEECH LANGUAG, V11, P326
LeGendre S. J., 2009, J ACOUST SOC AM, V125, P2530, DOI [10.1121/1.4783544, DOI 10.1121/1.4783544]
Maier A, 2009, SPEECH COMMUN, V51, P425, DOI 10.1016/j.specom.2009.01.004
Middag C, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/629030
O'Shaughnessy D., 2008, HDB SPEECH PROCESSIN, P213
PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532
Raghavendra P., 2001, AUGMENTATIVE ALTERNA, V17, P265, DOI 10.1080/714043390
Rudzicz F, 2007, P INT ACM SIGACCESS, P256
Saz O., 2008, P 1 WORKSH CHILD COM, P6
SCHLENCK KJ, 1993, CLIN LINGUIST PHONET, V7, P119, DOI 10.3109/02699209308985549
Sharma H., 2009, P 10 ANN C INT SPEEC, P4
Sjolander K., 2000, P INT C SPOK LANG PR, P4
Slaney M, 1993, EFFICIENT IMPLEMENTA
Talkin D., 1995, ROBUST ALGORITHM PIT, P495
Talkin D., 1987, J ACOUST SOC AM S, V82, pS55
Van Nuffelen G, 2009, INT J LANG COMM DIS, V44, P716, DOI 10.1080/13682820802342062
Zecevic A., 2002, THESIS U MANNHEIM
NR 48
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 622
EP 631
DI 10.1016/j.specom.2011.03.007
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600004
ER
PT J
AU de Bruijn, MJ
ten Bosch, L
Kuik, DJ
Witte, BI
Langendijk, JA
Leemans, CR
Verdonck-de Leeuw, IM
AF de Bruijn, Marieke J.
ten Bosch, Louis
Kuik, Dirk J.
Witte, Birgit I.
Langendijk, Johannes A.
Leemans, C. Rene
Verdonck-de Leeuw, Irma M.
TI Acoustic-phonetic and artificial neural network feature analysis to
assess speech quality of stop consonants produced by patients treated
for oral or oropharyngeal cancer
SO SPEECH COMMUNICATION
LA English
DT Article
DE Head and neck cancer; Oral cancer; Reconstructive surgery; Speech
quality; Artificial neural network; Voice-onset-time (VOT)
ID VOICE-ONSET-TIME; OVERLAPPING ARTICULATORY FEATURES; FOREARM FREE-FLAP;
OF-LIFE; LONGITUDINAL ASSESSMENT; EUROPEAN-ORGANIZATION; MAXILLECTOMY
PATIENTS; PARTIAL GLOSSECTOMY; MICROVASCULAR FLAP; SURGICAL-TREATMENT
AB Speech impairment often occurs in patients after treatment for head and neck cancer. A specific speech characteristic that influences intelligibility and speech quality is voice-onset-time (VOT) in stop consonants. VOT is one of the functionally most relevant parameters that distinguishes voiced and voiceless stops. The goal of the present study is to investigate the role and validity of acoustic-phonetic and artificial neural network analysis (ANN) of stop consonants in a multidimensional speech assessment protocol. Speech recordings of 51 patients 6 months after treatment for oral or oropharyngeal cancer and of 18 control speakers were evaluated by trained speech pathologists regarding intelligibility and articulation. Acoustic-phonetic analyses and artificial neural network analysis of the phonological feature voicing were performed in voiced /b/, /d/ and voiceless /p/ and /t/. Results revealed that objective acoustic-phonetic analysis and feature analysis for /b, d, p/ distinguish between patients and controls. Within patients, /t, d/ distinguish for tumour location and tumour stage. Measurements of the phonological feature voicing in almost all consonants were significantly correlated with articulation and intelligibility, but not with self-evaluations. Overall, objective acoustic-phonetic and feature analyses of stop consonants are feasible and contribute to further development of a multidimensional speech quality assessment protocol. (C) 2011 Elsevier B.V. All rights reserved.
C1 [de Bruijn, Marieke J.; Leemans, C. Rene; Verdonck-de Leeuw, Irma M.] Vrije Univ Amsterdam, Dept Otolaryngol Head & Neck Surg, Med Ctr, NL-1007 MB Amsterdam, Netherlands.
[ten Bosch, Louis] Univ Nijmegen, Dept Language & Speech, Nijmegen, Netherlands.
[Kuik, Dirk J.; Witte, Birgit I.] Vrije Univ Amsterdam, Dept Epidemiol & Biostat, Med Ctr, NL-1007 MB Amsterdam, Netherlands.
[Langendijk, Johannes A.] Univ Groningen, Dept Radiat Oncol, Univ Med Ctr Groningen, Groningen, Netherlands.
RP Verdonck-de Leeuw, IM (reprint author), Vrije Univ Amsterdam, Dept Otolaryngol Head & Neck Surg, Med Ctr, POB 7057, NL-1007 MB Amsterdam, Netherlands.
EM im.verdonck@vumc.nl
CR AARONSON NK, 1993, J NATL CANCER I, V85, P365, DOI 10.1093/jnci/85.5.365
Allen JS, 2003, J ACOUST SOC AM, V113, P544, DOI 10.1121/1.1528172
[Anonymous], 2007, PRAAT DOING PHON COM
Bjordal K, 1999, J CLIN ONCOL, V17, P1008
Borggreven PA, 2005, HEAD NECK-J SCI SPEC, V27, P785, DOI 10.1002/hed.20236
Borggreven PA, 2007, ORAL ONCOL, V43, P1034, DOI 10.1016/j.oraloncology.2006.11.017
Bressmann T, 2004, J ORAL MAXIL SURG, V62, P298, DOI 10.1016/j.joms.2003.04.017
Bridle J., 1998, CLSP JHU SUMM WORKSH
CHRISTENSEN JM, 1978, J SPEECH HEAR RES, V21, P56
de Bruijn MJ, 2009, FOLIA PHONIATR LOGO, V61, P180, DOI 10.1159/000219953
DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839
Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358
Furia CLB, 2001, ARCH OTOLARYNGOL, V127, P877
Graupe D., 2007, PRINCIPLES ARTIFICIA
Haderlein T, 2009, FOLIA PHONIATR LOGO, V61, P12, DOI 10.1159/000187620
Hara I, 2003, BRIT J ORAL MAX SURG, V41, P161, DOI 10.1016/S0266-4356(03)00068-8
Houde J., 1998, SCIENCE, P1213
Karnell LH, 2000, HEAD NECK-J SCI SPEC, V22, P6, DOI 10.1002/(SICI)1097-0347(200001)22:1<6::AID-HED2>3.0.CO;2-P
Kazi R, 2007, INT J LANG COMM DIS, V42, P521, DOI 10.1080/13682820601056566
Kent RA., 1992, INTELLIGIBILITY SPEE
King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148
KLATT DH, 1975, J SPEECH HEAR RES, V18, P686
Ladefoged P., 1996, SOUNDS WORLDS LANGUA
Markkanen-Leppanen M, 2005, J CRANIOFAC SURG, V16, P990, DOI 10.1097/01.scs.0000179753.14037.7a
McConnel FMS, 1998, ARCH OTOLARYNGOL, V124, P625
MICHI K, 1989, J CRANIO MAXILL SURG, V17, P162, DOI 10.1016/S1010-5182(89)80015-0
Nabil N., 1995, SIGNAL REPRESENTATIO, P310
Nabil N., 1996, KNOWLEDGE BASED SIGN, P29
Ng ML, 2009, J SPEECH LANG HEAR R, V52, P780, DOI 10.1044/1092-4388(2008/07-0182)
Pauloski BR, 1998, OTOLARYNG HEAD NECK, V118, P616, DOI 10.1177/019459989811800509
Rinkel RN, 2008, HEAD NECK-J SCI SPEC, V30, P868, DOI 10.1002/hed.20795
ROBBINS J, 1986, J SPEECH HEAR RES, V29, P499
Robinson T., 1996, AUTOMATIC SPEECH SPE, P233
Savariaux C., 2001, SPEECH PRODUCTION GL
Schuster M, 2006, EUR ARCH OTO-RHINO-L, V263, P188, DOI 10.1007/s00405-005-0974-6
Seikaly H, 2003, LARYNGOSCOPE, V113, P897, DOI 10.1097/00005537-200305000-00023
Su W.F., 2003, ARCH OTOLARYNGOL, P412
Sumita YI, 2002, J ORAL REHABIL, V29, P649, DOI 10.1046/j.1365-2842.2002.00911.x
Terai H, 2004, BRIT J ORAL MAX SURG, V42, P190, DOI 10.1016/j.bjoms.2004.02.007
van der Molen L, 2009, EUR ARCH OTO-RHINO-L, V266, P901
Whitehill TL, 2006, CLIN LINGUIST PHONET, V20, P135, DOI 10.1080/02699200400026694
Windrich M, 2008, FOLIA PHONIATR LOGO, V60, P151, DOI 10.1159/000121004
Yoshida H, 2000, J ORAL REHABIL, V27, P723, DOI 10.1046/j.1365-2842.2000.00537.x
NR 43
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 632
EP 640
DI 10.1016/j.specom.2011.06.005
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600005
ER
PT J
AU Karakozoglou, SZ
Henrich, N
d'Alessandro, C
Stylianou, Y
AF Karakozoglou, Sevasti-Zoi
Henrich, Nathalie
d'Alessandro, Christophe
Stylianou, Yannis
TI Automatic glottal segmentation using local-based active contours and
application to glottovibrography
SO SPEECH COMMUNICATION
LA English
DT Article
DE High-speed videoendoscopy; Vocal-fold vibration; Active contours;
Representation; Glottovibrogram; Electroglottography
ID HIGH-SPEED VIDEOENDOSCOPY; VOCAL FOLD VIBRATIONS; ELECTROGLOTTOGRAPHY;
MECHANISMS; ALGORITHM
AB The use of high-speed videoendoscopy (HSV) for the assessment of vocal-fold vibrations dictates the development of efficient techniques for glottal image segmentation. We present a new glottal segmentation method using a local-based active contour framework. The use of local-based features and the exploitation of the vibratory pattern allows for dealing effectively with image noise and cases where the glottal area consists of multiple regions. A scheme for precise glottis localization is introduced, which facilitates the segmentation procedure. The method has been tested on a database of 60 HSV recordings. Comparisons with manual verification resulted in less than 1% difference on the average glottal area. These errors mainly come from detection failure in the posterior or anterior parts of the glottal area. Comparisons with automatic threshold-based glottal detection point out the necessity of complete frameworks for automatic detection. The glottovibrogram (GVG), a representation of glottal vibration is also presented. This easily readable representation depicts the time-varying distance of the vocal-fold edges. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Karakozoglou, Sevasti-Zoi; d'Alessandro, Christophe] LIMSI CNRS, Orsay, France.
[Henrich, Nathalie] Univ Grenoble 3, Dept Speech & Cognit, GIPSA Lab, UMR 5216,CNRS,INPG, Grenoble, France.
[Karakozoglou, Sevasti-Zoi; Stylianou, Yannis] Univ Crete, Dept Comp Sci, Iraklion, Greece.
[Karakozoglou, Sevasti-Zoi] Univ Paris 11, Dept Comp Sci, Orsay, France.
RP Karakozoglou, SZ (reprint author), LIMSI CNRS, Orsay, France.
EM skarako@csd.uoc.gr; Nathalie.Henrich@gipsa-lab.grenoble-inp.fr;
Christophe.D'Alessandro@limsi.fr; yannis@csd.uoc.gr
RI d'Alessandro, Christophe/I-6991-2013
CR ADAMS R, 1994, IEEE T PATTERN ANAL, V16, P641, DOI 10.1109/34.295913
Allin S., 2004, P IEEE INT S BIOM IM, P812
Bailly L., 2009, THESIS U MAINE
BEZIER P., 1972, NUMERICAL CONTROL MA
Chan TF, 2001, IEEE T IMAGE PROCESS, V10, P266, DOI 10.1109/83.902291
CHILDERS DG, 1985, CRIT REV BIOMED ENG, V12, P131
CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K
Deliyski D, 2003, P 6 INT C ADV QUANT, P1
Deliyski DD, 2008, FOLIA PHONIATR LOGO, V60, P33, DOI 10.1159/000111802
Demeyer J., 2009, 3 ADV VOIC FUNCT ASS
Dollinger Michael, 2011, Advances in Vibration Analysis Research
Einig D., 2010, THESIS TRIER U APPL
GILBERT HR, 1984, J SPEECH HEAR RES, V27, P178
Glasbey C. A., 1993, GRAPH MODEL IM PROC, V55, P532, DOI 10.1006/gmip.1993.1040
HARALICK RM, 1985, COMPUT VISION GRAPH, V29, P100, DOI 10.1016/S0734-189X(85)90153-7
Henrich N., 2001, THESIS U P M CURIE P
Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401
CHILDERS DG, 1990, J SPEECH HEAR RES, V33, P245
KARAKOZOGLOU SZ, 2010, THESIS U PARIS SUD 1
Kass M., 1988, INT J COMPUT VISION, V1, P321, DOI DOI 10.1007/BF00133570
KOHLER R, 1981, COMPUT VISION GRAPH, V15, P319, DOI 10.1016/S0146-664X(81)80015-9
Lankton S, 2008, IEEE T IMAGE PROCESS, V17, P2029, DOI 10.1109/TIP.2008.2004611
Lohscheller J, 2004, IEEE T BIO-MED ENG, V51, P1394, DOI [10.1109/TBME.2004.827938, 10.1109/TMBE.2004.827938]
Lohscheller J, 2008, IEEE T MED IMAGING, V27, P300, DOI 10.1109/TMI.2007.903690
Lohscheller J, 2007, MED IMAGE ANAL, V11, P400, DOI 10.1016/j.media.2007.04.005
Marendic B, 2001, IEEE IMAGE PROC, P397
Mehnert A, 1997, PATTERN RECOGN LETT, V18, P1065, DOI 10.1016/S0167-8655(97)00131-1
Mehta DD, 2010, ANN OTO RHINOL LARYN, V119, P1
Mehta DD, 2011, J SPEECH LANG HEAR R, V54, P47, DOI 10.1044/1092-4388(2010/10-0026)
Moukalled H., 2009, MAVEBA, V1, P137
Neubauer J, 2001, J ACOUST SOC AM, V110, P3179, DOI 10.1121/1.1406498
ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4
Roubeau B, 2009, J VOICE, V23, P425, DOI 10.1016/j.jvoice.2007.10.014
Samet H., 1988, IEEE T PATTERN ANAL, V10, P586
Scherer R. C., 1988, VOCAL PHYSL VOICE PR, P279
Sethian J. A., 1999, LEVEL SET METHODS FA
WESTPHAL LC, 1983, IEEE T ACOUST SPEECH, V31, P766, DOI 10.1109/TASSP.1983.1164104
Yan Y., 2006, IEEE T BIOMED ENG, V53
Zuiderveld K., 1994, GRAPHICS GEMS, VIV, P474
NR 39
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 641
EP 654
DI 10.1016/j.specom.2011.07.010
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600006
ER
PT J
AU Alpan, A
Schoentgen, J
Maryn, Y
Grenez, F
Murphy, P
AF Alpan, A.
Schoentgen, J.
Maryn, Y.
Grenez, F.
Murphy, P.
TI Assessment of disordered voice via the first rahmonic
SO SPEECH COMMUNICATION
LA English
DT Article
DE Disordered voice analysis; Cepstrum; First rahmonic; Correlation
analysis; Sustained vowel; Connected speech
ID CEPSTRAL PEAK PROMINENCE; BREATHY VOCAL QUALITY; DYSPHONIA SEVERITY;
SIGNALS; SPEECH; PREDICTION; PARAMETERS; INDEX; NOISE
AB A number of studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum can be usefully employed to indicate hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier transform of the log-magnitude spectrum. In the present study, a number of spectral pre-processing steps are investigated prior to computing the cepstrum; the pre-processing steps include period-synchronous, period-asynchronous, harmonic-synchronous and harmonic-asynchronous spectral band-limitation analysis. The analysis is applied on both sustained vowels [a] and connected speech signals. The correlation between R1 (the amplitude of the first rahmonic) and perceptual ratings is examined for a corpus comprising 251 speakers. It is observed that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a previously reported cepstral cue, cepstral peak prominence (CPP). (C) 2011 Elsevier B.V. All rights reserved.
C1 [Alpan, A.; Schoentgen, J.; Grenez, F.] Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, Brussels, Belgium.
[Maryn, Y.] Sint Jan Gen Hosp, Dept Otorhinolaryngol & Head & Neck Surg, Dept Speech Language Pathol & Audiol, Brugge, Belgium.
[Murphy, P.] Univ Limerick, Dept Phys, Limerick, Ireland.
RP Alpan, A (reprint author), ULB LIST CP 165-51Av,F Roosevelt 50, B-1050 Brussels, Belgium.
EM aalpan@ulb.ac.be; jschoent@ulb.ac.be; Youri.Maryn@azbrugge.be;
fgrenez@ulb.ac.be; peter.murphy@ul.ie
FU COST ACTION at the University of Limerick [2103]; "Region Wallonne",
Belgium
FX This research has been supported by COST ACTION 2103 "Advanced Voice
Function Assessment" in the framework of a short-term scientific mission
at the University of Limerick, and by the "Region Wallonne", Belgium, in
the framework of the "WALEO II" programme.
CR Alpan A, 2011, SPEECH COMMUN, V53, P131, DOI 10.1016/j.specom.2010.06.010
Awan SN, 2009, J SPEECH LANG HEAR R, V52, P482, DOI 10.1044/1092-4388(2009/08-0034)
Awan SN, 2006, CLIN LINGUIST PHONET, V20, P35, DOI 10.1080/02699200400008353
Awan SN, 2005, J VOICE, V19, P268, DOI 10.1016/j.jvoice.2004.03.005
Balasubramanium R.K., 2010, J VOICE, V24, P651
Balasubramanium R.K., J VOICE IN PRESS
Boersma P., 1993, IFA P, V17, P97
Boersma P., 2007, PRAAT DOING PHONETIC
DEJONCKERE PH, 1994, CLIN LINGUIST PHONET, V8, P161, DOI 10.3109/02699209408985304
DEKROM G, 1993, J SPEECH HEAR RES, V36, P254
DUNN OJ, 1969, J AM STAT ASSOC, V64, P366, DOI 10.2307/2283746
Eadie TL, 2006, J VOICE, V20, P527, DOI 10.1016/j.jvoice.2005.08.007
Heman-Ackah YD, 2003, ANN OTO RHINOL LARYN, V112, P324
Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X
HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769
Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311
Hotelling H, 1940, ANN MATH STAT, V11, P271, DOI 10.1214/aoms/1177731867
Maryn Y, 2010, J VOICE, V24, P540, DOI 10.1016/j.jvoice.2008.12.014
Murphy PJ, 2006, J ACOUST SOC AM, V120, P2896, DOI 10.1121/1.2355483
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
Rabiner L.R., 1978, DIGITAL PROCESSING S
Schoentgen J, 2003, J ACOUST SOC AM, V113, P553, DOI 10.1121/1.1523384
Wolfe V, 1997, J COMMUN DISORD, V30, P403, DOI 10.1016/S0021-9924(96)00112-8
NR 23
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 655
EP 663
DI 10.1016/j.specom.2011.04.001
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600007
ER
PT J
AU Ghio, A
Pouchoulin, G
Teston, B
Pinto, S
Fredouille, C
De Looze, C
Robert, D
Viallet, F
Giovanni, A
AF Ghio, A.
Pouchoulin, G.
Teston, B.
Pinto, S.
Fredouille, C.
De Looze, C.
Robert, D.
Viallet, F.
Giovanni, A.
TI How to manage sound, physiological and clinical data of 2500 dysphonic
and dysarthric speakers?
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice/speech disorders; Dysphonia; Dysarthria; Database; Clinical
phonetics
ID OBJECTIVE VOICE ANALYSIS; AERODYNAMIC MEASUREMENTS; SPEECH; DISEASE
AB The aim of this contribution is to propose a database model designed for the storage and accessibility of various speech disorder data including signals, clinical evaluations and patients' information. This model is the result of 15 years of experience in the management and the analysis of this type of data. We present two important French corpora of voice and speech disorders that we have been recording in hospitals in Marseilles (MTO corpus) and Aix-en-Provence (AHN corpus). The population consists of 2500 dysphonic, dysarthric and control subjects, a number of speakers which, as far as we know, constitutes currently one of the largest corpora of "pathological" speech. The originality of this data lies in the presence of physiological data (such as oral airflow or estimated sub-glottal pressure) associated with acoustic recordings. This activity led us to raise the question of how we can manage the sound, physiological and clinical data of such a large quantity of data. Consequently, we developed a database model that we present here. Recommendations and technical solutions based on MySQL, a relational database management system, are discussed. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Ghio, A.; Pouchoulin, G.; Teston, B.; Pinto, S.; De Looze, C.; Robert, D.; Viallet, F.; Giovanni, A.] Aix Marseille Univ, LPL, CNRS, UMR 6057, Marseille, France.
[Pouchoulin, G.; Fredouille, C.] Avignon Univ, LIA, Avignon, France.
[Viallet, F.] Ctr Hosp Pays Aix, Serv Neurol, Aix En Provence, France.
[Robert, D.; Giovanni, A.] Ctr Hosp Univ Timone, Serv ORL, Marseille, France.
RP Ghio, A (reprint author), Univ Aix Marseille 1, CNRS, Lab Parole & Langage, 5 Ave Pasteur,BP 80975, F-13604 Aix En Provence 1, France.
EM alain.ghio@lpl-aix.fr
FU PHRC (projet hospitalier de recherche clinique); French National
Research Agency [ANR BLAN08-0125]; COST Action [2103]; France Parkinson
Association
FX The authors would like to thank the financial supports: PHRC (projet
hospitalier de recherche clinique), ANR BLAN08-0125 of the French
National Research Agency, COST Action 2103 "Advanced Voice Function
Assessment" and "France Parkinson Association".
CR Auzou P, 2006, BATTERIE EVALUATION
Baken RJ, 2000, CLIN MEASUREMENT SPE
Bombien L., 2006, P 11 AUSTR INT C SPE, P313
Bonastre J., 2007, P INT ICSLP ANTW BEL, P1194
CARDEBAT D, 1990, ACTA NEUROL BELG, V90, P207
Carre R., 1984, P INT C AC SPEECH SI, P324
CNIL, 2006, METH REF TRAIT DONN
Darley F.L, 1975, MOTOR SPEECH DISORDE
DELLER JR, 1993, J ACOUST SOC AM, V93, P3516, DOI 10.1121/1.405684
DENT H, 1995, EUR J DISORDER COMM, V30, P264
Descout R., 1986, P 12 INT C AC TOR CA, VA, P4
Duez D., 2006, J MULTILINGUAL COMMU, V4, P45, DOI 10.1080/14769670500485513
Duez D, 2009, CLIN LINGUIST PHONET, V23, P781, DOI 10.3109/02699200903144788
Durand J., 2002, B PFC, V1, P1
Enderby P. M., 1983, FRENCHAY DYSARTHRIA
FABRE P, 1957, Bull Acad Natl Med, V141, P66
Fahn S, 1987, RECENT DEV PARKINSON, P153
FOLSTEIN MF, 1975, J PSYCHIAT RES, V12, P189, DOI 10.1016/0022-3956(75)90026-6
Fougeron C., 2010, P LANG RES EV LRC VA, P2831
Fourcin A., 1989, SPEECH INPUT OUTPUT
Fredouille C., 2005, P 9 EUR C SPEECH COM, P149
Fredouille C., 2009, EUR J ADV SIGNAL PRO, P1
Ghio A, 2004, P INT C VOIC PHYSL B, P55
Gibbon D, 1997, HDB STANDARDS RESOUR
Gibbon F, 1998, Int J Lang Commun Disord, V33 Suppl, P44
Giovanni A, 1999, LARYNGOSCOPE, V109, P656, DOI 10.1097/00005537-199904000-00026
Giovanni A, 2002, FOLIA PHONIATR LOGO, V54, P304, DOI 10.1159/000066152
Giovanni A, 1999, J VOICE, V13, P341, DOI 10.1016/S0892-1997(99)80040-X
HAMMARBERG B, 1980, ACTA OTO-LARYNGOL, V90, P441, DOI 10.3109/00016488009131746
Heaton RK, 1993, WISCONSIN CARD SORTI
Helsinki, 2004, DECLARATION HELSINKI
Hirano M, 1981, CLIN EXAMINATION VOI
Ketelslagers K., 2006, EUR ARCH OTO-RHINO-L, V264, P519
KIM H, 2008, P INT C SPOK LANG PR, P1741
Kim Y, 2011, J SPEECH LANG HEAR R, V54, P417, DOI 10.1044/1092-4388(2010/10-0020)
Klatt D., 1980, TRENDS SPEECH RECOGN, P49
Laver J, 1980, PHONETIC DESCRIPTION
MARCHAL A, 1993, LANG SPEECH, V36, P137
MARCHAL A, 1993, J ACOUST SOC AM, V93, P2990, DOI 10.1121/1.405820
Mattis S., 1988, DEMENTIA RATING SCAL
McVeigh A., 1992, Proceedings of the Fourth Australian International Conference on Speech Science and Technology
Menendez-Pidal X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608020
Nixon T., 2008, PRECLINICAL SPEECH S
Parsa V, 2001, J SPEECH LANG HEAR R, V44, P327, DOI 10.1044/1092-4388(2001/027)
Pinto S, 2010, REV NEUROL-FRANCE, V166, P800, DOI 10.1016/j.neurol.2010.07.005
Pouchoulin G., 2007, P 8 INTERSPEECH C IN, P1198
Revis J, 2002, FOLIA PHONIATR LOGO, V54, P19, DOI 10.1159/000048593
Robert D, 1999, ACTA OTO-LARYNGOL, V119, P724
Saenz-Lechon N, 2006, BIOMED SIGNAL PROCES, V1, P120, DOI 10.1016/j.bspc.2006.06.003
Sarr MM, 2009, REV NEUROL-FRANCE, V165, P1055, DOI 10.1016/j.neurol.2009.03.012
SMITHERAN JR, 1981, J SPEECH HEAR DISORD, V46, P138
Teston B, 1995, P EUR C SPEECH COMM, P1883
Viallet F., 2002, P SPEECH PROS, P679
VIALLET F., 2003, P DYSPH DYS DYSPH, P53
Viallet F, 2004, MOVEMENT DISORD, V19, pS237
Wester M., 1998, P S DAT VOIC QUAL RE, P92
Wilson D. K., 1987, VOICE PROBLEMS CHILD
Yu P, 2007, FOLIA PHONIATR LOGO, V59, P20, DOI 10.1159/000096547
Yu P, 2001, J VOICE, V15, P529, DOI 10.1016/S0892-1997(01)00053-4
YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808
Zeiliger J., 1992, P JOURN ET PAR JEP B, P213
Zeiliger J., 1994, P 10 JOURN ET PAR, P287
NR 62
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2012
VL 54
IS 5
SI SI
BP 664
EP 679
DI 10.1016/j.specom.2011.04.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 925JD
UT WOS:000302756600008
ER
PT J
AU Ben Aicha, A
Ben Jebara, S
AF Ben Aicha, Anis
Ben Jebara, Sofia
TI Perceptual speech quality measures separating speech distortion and
additive noise degradations
SO SPEECH COMMUNICATION
LA English
DT Article
DE Upper bound of perceptual equivalence; Lower bound of perceptual
equivalence; Class of perceptual equivalence; Objective criteria
ID ENHANCEMENT; AUDIO
AB In this paper, novel perceptual criteria measuring speech distortion, additive noise and the overall quality are presented. Based on the masking concept, they are built to measure only the audible degradations perceived by the human ear. The class of perceptual equivalence (CPE) is introduced which leads to specify the nature of degradations affecting denoised speech. The CPE is defined in the frequency domain using perceptual tools and limited by two curves : upper bound of perceptual equivalence (UBPE) and lower bound of perceptual equivalence (LBPE). Denoised speech components belonging to this class are perceptually equivalent to the clean speech components, otherwise audible degradations are noticed. Based on this concept, new perceptual criteria are developed to assess denoised speech signals. After criteria introduction and explanation, they are validated by comparing their relationship, in terms of scatter plots and Pearson correlation with ITU-T recommendation P.835 which specifies three subjective tests to evaluate independently the speech distortion (SIG), the residual background noise (BAK) and the overall quality (MOS). Moreover, proposed criteria are compared conventional criteria, indicating an improved ability for predicting subjective tests. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Ben Aicha, Anis; Ben Jebara, Sofia] Univ Carthage, Ecole Super Commun Tunis, Res Unit TECHTRA, Ariana 2083, Tunisia.
RP Ben Aicha, A (reprint author), Univ Carthage, Ecole Super Commun Tunis, Res Unit TECHTRA, Route Raoued 3-5 Km, Ariana 2083, Tunisia.
EM anis_ben_aicha@yahoo.fr; sofia.benjebara@supcom.rnu.tn
CR [Anonymous], 2000, PERC EV SPEECH QUAL
[Anonymous], 1996, METH SUBJ DET TRANSM, P800
[Anonymous], 2003, SUBJ TEST METH EV SP, P835
Benesty J., 2005, SPEECH ENHANCEMENT
Benesty J., 2008, HDB SPEECH PROCESSIN, P843
Beruoti M., 1979, P IEEE INT C AC SPEE, P208
Chetouani M., 2007, ADV NONLINEAR SPEECH, P230
Dimolitsas S, 1984, P IEEE, V136
Dreiseitel P, 2001, P INT WORKSH AC ECH
Garofolo J., 1988, GETTING STARTED DARP
Gustafsson S, 2002, IEEE T SPEECH AUDI P, V10, P245, DOI 10.1109/TSA.2002.800553
Hansen J.H.L., 1998, P INT C SPOK LANG PR
Hensen J.H.L., 1998, P INT C SPOK LANG PR, V7, P2819
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2006, P IEEE INT C AC SPEE, V1, P153
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
HU Y, 2006, P INT, P1447
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Klatt D., 1982, P IEEE INT C AC SPEE, V7, P1278
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Painter T, 2000, P IEEE, V88, P451, DOI 10.1109/5.842996
Quanckenbush S., 1988, OBJECTIVE MEASURES S
Rix A., 2001, P IEEE INT C AC SPEE, P749
Rix AW, 2006, IEEE T AUDIO SPEECH, V14, P1890, DOI 10.1109/TASL.2006.883260
Scalart P., 1996, P IEEE INT C AC SPEE
Tribolet J. M., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing
Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Yang W., 1998, P IEEE INT C AC SPEE, V1, P541
Zwicker E., 1990, PSYCHOACOUSTICS FACT
NR 30
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2012
VL 54
IS 4
BP 517
EP 528
DI 10.1016/j.specom.2011.11.005
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 902DA
UT WOS:000301017600001
ER
PT J
AU Wu, MH
Li, HH
Hong, ZL
Xian, XC
Li, JY
Wu, XH
Li, L
AF Wu, Meihong
Li, Huahui
Hong, Zhiling
Xian, Xinchi
Li, Jingyu
Wu, Xihong
Li, Liang
TI Effects of aging on the ability to benefit from prior knowledge of
message content in masked speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Auditory aging; "Cocktail-party" problem; Content priming; Energetic
masking; Informational masking; Speech recognition; Working memory
ID PERCEIVED SPATIAL SEPARATION; INFORMATIONAL MASKING; OLDER-ADULTS;
COMPETING SPEECH; HEARING-LOSS; AUDITORY ATTENTION; ENERGETIC MASKING;
CHINESE SPEECH; NOISE; RELEASE
AB Under conditions in the presence of competing talkers, presenting the early part of a target sentence in quiet improves recognition of the last keyword of the sentence. This content-priming effect depends on a working-memory resource holding the information of the early presented part of the target speech (the content prime). Older adults usually exhibit declined working memory and experience more difficulties in speech recognition under "cocktail-party" conditions. This study investigated whether speech masking also affects recall of the content prime and whether the content-priming effect declines in older adults. The results show that in both younger adults and older adults, although the content prime was heard in quiet, recall of keywords in the prime was significantly affected by the signal-to-masker ratio of the target/masker presentation. The vulnerability of prime recall to speech masking was larger in older adults than that in younger adults. Also, the content-priming effect disappeared in older adults, even though older adults are able to use the content prime to determine the target speech in the presence of competing talkers. Thus, a speech masker affects not only recognition but also recall of speech, and there is an age-related decline in both content-priming-based unmasking of the target speech and recall of the prime. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Li, Liang] Peking Univ, Dept Psychol, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China.
Peking Univ, Dept Machine Intelligence, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China.
RP Li, L (reprint author), Peking Univ, Dept Psychol, Speech & Hearing Res Ctr, Key Lab Machine Percept,Minist Educ, Beijing 100871, Peoples R China.
EM liangli@pku.edu.cn
FU "973" National Basic Research Program of China [2009CB320901,
2010DFA31520, 2011CB707805]; National Natural Science Foundation of
China [31170985, 30711120563, 90920302, 60811140086]; Chinese Ministry
of Education [20090001110050]; Peking University
FX This work was supported by the "973" National Basic Research Program of
China (2009CB320901; 2010DFA31520; 2011CB707805), the National Natural
Science Foundation of China (31170985; 30711120563, 90920302,
60811140086), the Chinese Ministry of Education (20090001110050), and
"985" grants from Peking University.
CR Agus TR, 2009, J ACOUST SOC AM, V126, P1926, DOI 10.1121/1.3205403
Arbogast TL, 2002, J ACOUST SOC AM, V112, P2086, DOI 10.1121/1.1510141
Baddeley A. D., 1986, WORKING MEMORY
Bell R, 2008, PSYCHOL AGING, V23, P377, DOI 10.1037/0882-7974.23.2.377
Best V, 2008, P NATL ACAD SCI USA, V105, P13174, DOI 10.1073/pnas.0803718105
Best V, 2007, JARO-J ASSOC RES OTO, V8, P294, DOI 10.1007/s10162-007-0073-z
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
Cao SY, 2011, J ACOUST SOC AM, V129, P2227, DOI 10.1121/1.3559707
Cheesman MF, 1995, AUDIOLOGY, V34, P321
Cherry CE, 1953, J ACOUST SOC AM, V25, P975, DOI DOI 10.1121/1.1907229
Ezzatian P, 2011, EAR HEARING, V32, P84, DOI 10.1097/AUD.0b013e3181ee6b8a
DUQUESNOY AJ, 1983, J ACOUST SOC AM, V74, P739, DOI 10.1121/1.389859
Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211
Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984
Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343
Frisina DR, 1997, HEARING RES, V106, P95, DOI 10.1016/S0378-5955(97)00006-3
Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953
GELFAND SA, 1988, J ACOUST SOC AM, V83, P248, DOI 10.1121/1.396426
Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668
Hasher L., 1988, PSYCHOL LEARN MOTIV, V22, P193, DOI DOI 10.1016/S0079-7421(08)60041-9
HELFER KS, 1990, J SPEECH HEAR RES, V33, P149
Helfer KS, 2009, J ACOUST SOC AM, V125, P447, DOI 10.1121/1.3035837
Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432
Helfer KS, 2005, J ACOUST SOC AM, V117, P842, DOI [10.1121/1.1836832, 10.1121/1.183682]
Helfer KS, 2008, EAR HEARING, V29, P87
Helfer KS, 2010, J ACOUST SOC AM, V128, P3625, DOI 10.1121/1.3502462
Huang Y, 2010, EAR HEARING, V31, P579, DOI 10.1097/AUD.0b013e3181db6dc2
Huang Y, 2009, J EXP PSYCHOL HUMAN, V35, P1618, DOI 10.1037/a0015791
Huang Y, 2008, HEARING RES, V244, P51, DOI 10.1016/j.heares.2008.07.006
HUMES LE, 1990, J SPEECH HEAR RES, V33, P726
Humes LE, 2007, J AM ACAD AUDIOL, V18, P590, DOI 10.3766/jaaa.18.7.6
JERGER J, 1991, EAR HEARING, V12, P103
KIDD G, 1994, J ACOUST SOC AM, V95, P3475, DOI 10.1121/1.410023
Kidd G, 1998, J ACOUST SOC AM, V104, P422, DOI 10.1121/1.423246
Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187
King S., 2009, P BLIZZ CHALL WORKSH
LEEK MR, 1991, PERCEPT PSYCHOPHYS, V50, P205, DOI 10.3758/BF03206743
Li L, 2004, J EXP PSYCHOL HUMAN, V30, P1077, DOI 10.1037/0096-1523.30.6.1077
Newman RS, 2007, J PHONETICS, V35, P85, DOI 10.1016/j.wocn.2005.10.004
Rakerd B, 2006, J ACOUST SOC AM, V119, P1597, DOI 10.1121/1.2161438
Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159
Rossi-Katz J, 2009, J SPEECH LANG HEAR R, V52, P435, DOI 10.1044/1092-4388(2008/07-0243)
Rudmann DS, 2003, HUM FACTORS, V45, P329, DOI 10.1518/hfes.45.2.329.27237
SALTHOUSE TA, 1991, PSYCHOL SCI, V2, P179, DOI 10.1111/j.1467-9280.1991.tb00127.x
Schneider B. A., 1997, J SPEECH LANGUAGE PA, V21, P111
Schneider B. A., 2007, J AM ACAD AUDIOL, V18, P578
Schneider BA, 2000, PSYCHOL AGING, V15, P110, DOI 10.1037//0882-7974.15.1.110
Shinoda K., 1997, P EUR
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Summerfleld A. Q., 1979, PHONETICA, V36, P314
Tun PA, 2002, PSYCHOL AGING, V17, P453, DOI 10.1037//0882-7974.17.3.453
Verhaeghen P., 1993, J GERONTOL, V48, P157
Wolfram S., 1991, MATH SYSTEM DOING MA
Wu X.-H., 2007, EFFECT NUMBER MASKIN, P390
Wu XH, 2005, HEARING RES, V199, P1, DOI 10.1016/j.heares.2004.03.010
Yang ZG, 2007, SPEECH COMMUN, V49, P892, DOI 10.1016/j.specom.2007.05.005
Yoshimura T, 1999, P EUR, P2347
Zen H., 2007, P 6 ISCA WORKSH SPEE
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 59
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2012
VL 54
IS 4
BP 529
EP 542
DI 10.1016/j.specom.2011.11.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 902DA
UT WOS:000301017600002
ER
PT J
AU Sahidullah, M
Saha, G
AF Sahidullah, Md.
Saha, Goutam
TI Design, analysis and experimental evaluation of block based
transformation in MFCC computation for speaker recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; MFCC; DCT; Correlation matrix; Decorrelation
technique; Linear transformation; Block transform; Narrow-band noise;
Missing feature theory
ID SUBBAND DCT; NOISE; IDENTIFICATION; VERIFICATION; ALGORITHM
AB Standard Mel frequency cepstrum coefficient (MFCC) computation technique utilizes discrete cosine transform (DCT) for decorrelating log energies of filter bank output. The use of DCT is reasonable here as the covariance matrix of Mel filter bank log energy (MFLE) can be compared with that of highly correlated Markov-I process. This full-band based MFCC computation technique where each of the filter bank output has contribution to all coefficients, has two main disadvantages. First, the covariance matrix of the log energies does not exactly follow Markov-I property. Second, full-band based MFCC feature gets severely degraded when speech signal is corrupted with narrow-band channel noise, though few filter bank outputs may remain unaffected. In this work, we have studied a class of linear transformation techniques based on block wise transformation of MFLE which effectively decorrelate the filter bank log energies and also capture speech information in an efficient manner. A thorough study has been carried out on the block based transformation approach by investigating a new partitioning technique that highlights associated advantages. This article also reports a novel feature extraction scheme which captures complementary information to wide band information; that otherwise remains undetected by standard MFCC and proposed block transform (BT) techniques. The proposed features are evaluated on NIST SRE databases using Gaussian mixture model-universal background model (GMM-UBM) based speaker recognition system. We have obtained significant performance improvement over baseline features for both matched and mismatched condition, also for standard and narrow-band noises. The proposed method achieves significant performance improvement in presence of narrow-band noise when clubbed with missing feature theory based score computation scheme. Crown Copyright (C) 2011 Published by Elsevier B.V. All rights reserved.
C1 [Sahidullah, Md.; Saha, Goutam] Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India.
RP Sahidullah, M (reprint author), Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India.
EM sahidullahmd@gmail.com; gsaha@ece.iitkgp.ernet.in
RI Sahidullah, Md/E-2953-2013
CR AHMED N, 1974, IEEE T COMPUT, VC 23, P90, DOI 10.1109/T-C.1974.223784
Akansu A. N., 1992, MULTIRESOLUTION SIGN
Benesty J., 2007, SPRINGER HDB SPEECH
Besacier L., 1997, LECT NOTES COMPUT SC, V1206, P193
Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5
Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714
Campbell JP, 2009, IEEE SIGNAL PROC MAG, V26, P95, DOI 10.1109/MSP.2008.931100
Chakroborty S., 2008, THESIS INDIAN I TECH
Chakroborty S, 2010, SPEECH COMMUN, V52, P693, DOI 10.1016/j.specom.2010.04.002
Chetouani M, 2009, PATTERN RECOGN, V42, P487, DOI 10.1016/j.patcog.2008.08.008
Damper RI, 2003, PATTERN RECOGN LETT, V24, P2167, DOI 10.1016/S0167-8655(03)00082-5
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Douglas O, 2009, SPEECH COMMUNICATION
Finan R. A., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009652732313
Garreton C, 2010, IEEE T AUDIO SPEECH, V18, P1082, DOI 10.1109/TASL.2010.2049671
Hung WW, 2001, IEEE SIGNAL PROC LET, V8, P70
Jain A.K., 2010, FUNDAMENTALS DIGITAL
Jingdong C., 2000, P INT C SPOK LANG PR, VIV, P117
Jingdong Chen, 2004, IEEE Signal Processing Letters, V11, DOI 10.1109/LSP.2003.821689
Jung SH, 1996, IEEE T CIRC SYST VID, V6, P273
Kajarekar S., 2001, P ICASSP, V1, P137
Kim S, 2008, ETRI J, V30, P89, DOI 10.4218/etrij.08.0107.0108
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
Kinnunen T., 2004, THESIS U JOENSUU
Kwon OW, 2004, SIGNAL PROCESS, V84, P1005, DOI 10.1016/j.sigpro.2004.03.004
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
Lippmann R., 1997, EUROSPEECH, pKN37
Mak B, 2002, IEEE SIGNAL PROC LET, V9, P241, DOI 10.1109/LSP.2002.803007
Martin A., 2006, 2004 NIST SPEAKER RE
Ming J, 2007, IEEE T AUDIO SPEECH, V15, P1711, DOI 10.1109/TASL.2007.899278
Mukherjee J, 2002, IEEE T CIRC SYST VID, V12, P620, DOI 10.1109/TCSVT.2002.800509
Nasersharif B, 2007, PATTERN RECOGN LETT, V28, P1320, DOI 10.1016/j.patrec.2006.11.019
Nitta T., 2000, ICSLP, V1, P385
Oppenheim A. V., 1979, DIGITAL SIGNAL PROCE
Przybocki M., 2002, 2001 NIST SPEAKER RE
Quatieri T, 2006, DISCRETE TIME SPEECH
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Sahidullah Md, 2010, International Journal of Biometrics, V2, DOI 10.1504/IJBM.2010.035450
Sahidullah M., 2009, IEEE VTC OCT, P1
Sivakumaran P, 2003, SPEECH COMMUN, V41, P485, DOI 10.1016/S0167-6393(03)00017-7
MALVAR HS, 1989, IEEE T ACOUST SPEECH, V37, P553, DOI 10.1109/29.17536
Takiguchi Tetsuya, 2007, Journal of Multimedia, V2, DOI 10.4304/jmm.2.5.13-18
Vale EE, 2008, ELECTRON LETT, V44, P1280, DOI 10.1049/el:20082455
NR 43
TC 13
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2012
VL 54
IS 4
BP 543
EP 565
DI 10.1016/j.specom.2011.11.004
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 902DA
UT WOS:000301017600003
ER
PT J
AU Escudero, D
Aguilar, L
Vanrell, MD
Prieto, P
AF Escudero, David
Aguilar, Lourdes
del Mar Vanrell, Maria
Prieto, Pilar
TI Analysis of inter-transcriber consistency in the Cat_ToBI prosodic
labeling system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody; Prosodic labeling; Inter-transcriber consistency; ToBI
ID RELIABILITY; AGREEMENT; SPEECH; CORPUS
AB A set of tools to analyze inconsistencies observed in a Cat_ToBI labeling experiment are presented. We formalize and use the metrics that are commonly used in inconsistency tests. The metrics are systematically applied to analyze the robustness of every symbol and every pair of transcribers. The results reveal agreement rates for this study that are comparable to previous ToBI inter-reliability tests. The inter-transcriber confusion rates are transformed into distance matrices to use multidimensional scaling for visualizing the confusion between the different ToBI symbols and the disagreement between the raters. Potential different labeling criteria are identified and subsets of symbols that are candidates to be fused are proposed. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Escudero, David] Univ Valladolid, Dpt Comp Sci, E-47002 Valladolid, Spain.
[Aguilar, Lourdes] Univ Autonoma Barcelona, Dpt Spanish Philol, Barcelona, Spain.
[del Mar Vanrell, Maria] Univ Autonoma Barcelona, Dpt Catalan Philol, Barcelona, Spain.
[Prieto, Pilar] Univ Pompeu Fabra, Dpt Translat & Language Sci, ICREA, Barcelona, Spain.
RP Escudero, D (reprint author), Univ Valladolid, Dpt Comp Sci, E-47002 Valladolid, Spain.
EM descuder@infor.uva.es
RI Consolider Ingenio 2010, BRAINGLOT/D-1235-2009; Prieto,
Pilar/E-7390-2013; Escudero, David/K-7905-2014
OI Prieto, Pilar/0000-0001-8175-1081; Escudero, David/0000-0003-0849-8803
FU Spanish Ministerio de Ciencia e Innovacion [FFI2008-04982-C003-02,
FFI2008-04982-C003-03, FFI2011-29559-C02-01, FFI2011-29559-C02-02,
FFI2009-07648/FILO, CSD2007-00012]; Generalitat de Catalunya
[2009SGR-701]
FX This research has been funded by six research grants awarded by the
Spanish Ministerio de Ciencia e Innovacion, namely the Glissando project
FFI2008-04982-C003-02, FFI2008-04982-C003-03, FFI2011-29559-C02-01,
FFI2011-29559-C02-02, FFI2009-07648/FILO and CONSOLIDER-INGENIO 2010
Programme CSD2007-00012, and by a grant awarded by the Generalitat de
Catalunya to the Grup d'Estudis de Prosodia (2009SGR-701)
CR Aguilar L., 2009, CAT TOBI TRAINING MA
Ananthakrishnan S, 2008, IEEE T AUDIO SPEECH, V16, P216, DOI 10.1109/TASL.2007.907570
Arvaniti A., 2005, PROSODIC TYPOLOGY PH, P84
Beckman M., 2000, KOREAN J SPEECH SCI, V7, P143
Beckman M., 2005, PROSODIC TYPOLOGY PH, P9
Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008
Beckman M.E., 2000, INTONATION SPANISH T
Boersma P., 2011, PRAAT DOING PHONETIC
Bonafonte A., 2008, P LREC MARR, P3325
Borg I., 2005, MODERN MULTIMENSIONA
BRUGOS ALEJNA, 2008, P SPEECH PROS 2008, P273
Buhmann J., 2002, P 3 INT C LANG RES E, P779
Cabre T., 2007, INTERACTIVE ATLAS CA
COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104
Estebas E., 2009, ESTUDIOS FONETICA EX, VXVIII, P263
Estebas Vilaplana E., 2010, TRANSCRIPTION INTONA, P17
FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619
Godfrey J. J., 1992, P ICASSP, V1, P517
Gonzalez C., 2010, P INT 2010, P142
Grice M., 1995, PHONUS, V1, P33
Grice M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607958
Gwet K. L., 2010, HDB INTERRATER RELIA
Gwet K.L, 2008, BRIT J MATH STAT PSY, V61, P26
Hasegawa-Johnson M, 2005, SPEECH COMMUN, V46, P418, DOI 10.1016/j.specom.2005.01.009
Hasegawa-Johnson M, 2004, P ICSA INT C SPOK LA, P2729
Herman R, 2002, LANG SPEECH, V45, P1
Hirst DJ, 2005, SPEECH COMMUN, V46, P334, DOI 10.1016/j.specom.2005.02.020
Ihaka R., 1996, J COMPUTATIONAL GRAP, V5, P299, DOI DOI 10.2307/1390807
Jun S., 2000, P ICSLP, V3, P211
Kruskal J. B., 1978, SAGE U PAPER SERIES
LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310
Mayo C., 1997, P ESCA WORKSH INT TH, P231
Ohio-State-University, 2006, WHAT IS TOBI
Pena D, 1999, ESTADISTICA MODELOS
Pierrehumbert J, 1980, THESIS MIT
Pitrelli J., 1994, P 3 INT C SPOK LANG, P123
Pitt MA, 2005, SPEECH COMMUN, V45, P89, DOI 10.1016/j.specom.2004.09.001
Prieto P., 2009, ESTUDIOS FONETICA EX, VXVIII, P287
Prieto P, 2012, PROSODIC TYPOLOGY
Prom-on S, 2009, J ACOUST SOC AM, V125, P405, DOI 10.1121/1.3037222
Rosenberg A., 2010, HLT NAACL, P721
Rosenberg A, 2009, THESIS COLUMBIA U US
Scott William, 1955, PUBLIC OPIN QUART, P321, DOI DOI 10.1086/266577
Silverman K., 1992, P INT C SPOK LANG PR, P867
Sim J, 2005, PHYS THER, V85, P257
Sridhar VKR, 2008, IEEE T AUDIO SPEECH, V16, P797, DOI 10.1109/TASL.2008.917071
Syrdal A. K., 2000, P INT C SPOK LANG PR, V3, P235
Syrdal AK, 2001, SPEECH COMMUN, V33, P135, DOI 10.1016/S0167-6393(00)00073-X
UEBERSAX JS, 1987, PSYCHOL BULL, V101, P140, DOI 10.1037/0033-2909.101.1.140
Venditti Jennifer J., 2005, PROSODIC TYPOLOGY PH, P172
Wightman Colin W, 2002, P SPEECH PROS
NR 51
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2012
VL 54
IS 4
BP 566
EP 582
DI 10.1016/j.specom.2011.12.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 902DA
UT WOS:000301017600004
ER
PT J
AU Reveil, B
Martens, JP
van den Heuvel, H
AF Reveil, Bert
Martens, Jean-Pierre
van den Heuvel, Henk
TI Improving proper name recognition by means of automatically learned
pronunciation variants
SO SPEECH COMMUNICATION
LA English
DT Article
DE Proper name recognition; Pronunciation variation modeling;
Cross-linguality
ID SPEECH RECOGNITION; CORPORA; MODELS; UNITS
AB This paper introduces a novel lexical modeling approach that aims to improve large vocabulary proper name recognition for native and non-native speakers. The method uses one or more so-called phoneme-to-phoneme (P2P) converters to add useful pronunciation variants to a baseline lexicon. Each P2P converter is a stochastic automaton that applies context-dependent transformation rules to a baseline transcription that is generated by a standard grapheme-to-phoneme (G2P) converter. The paper focuses on the inclusion of different types of features to describe the rule context ranging from the identities of neighboring phonemes to morphological and even semantic features such as the language of origin of the name and on the development and assessment of methods that can cope with cross-lingual issues. Another aim is to ensure that the proposed solutions are applicable to new names (not seen during system development) and useful in the hands of product developers with good knowledge of their application domain but little expertise in automatic speech recognition (ASR) and speech corpus acquisition. The proposed method was evaluated on person name and geographical name recognition, two economically interesting domains in which non-native speakers as well as non-native names occur very frequently. For the recognition experiments a state-of-the-art commercial ASR engine was employed. The experimental results demonstrate that significant improvements of the recognition accuracy can be achieved: large gains (up to 40% relative) in case prior knowledge of the speaker tongue and the name origin is available, and still significant gains in case no such prior information is available. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Reveil, Bert; Martens, Jean-Pierre] UGent, ELIS, DSSP Grp, B-9000 Ghent, Belgium.
[van den Heuvel, Henk] Radboud Univ Nijmegen, Fac Arts, CLST, Nijmegen, Netherlands.
RP Reveil, B (reprint author), UGent, ELIS, DSSP Grp, Sint Pietersnieuwstr 41, B-9000 Ghent, Belgium.
EM breveil@elis.ugent.be; martens@elis.ugent.be; h.vandenheuvel@let.ru.nl
FU Flanders FWO
FX The presented work was carried out in the context of two research
projects: the Autonomata Too project, granted under the Dutch-Flemish
STEVIN program, and the TELEX project, granted by Flanders FWO.
CR Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1
Amdal I., 2000, P ISCA ITRW ASR2000, P85
Amdal I., 2000, P ICSLP, P622
Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5
Bartkova K, 2006, INT CONF ACOUST SPEE, P1037
Bartkova K, 2007, SPEECH COMMUN, V49, P836, DOI 10.1016/j.specom.2006.12.009
Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002
Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17
Bouselmi G, 2006, INT CONF ACOUST SPEE, P345
CMU, 2010, CARN MELL U PRON DIC
Conover W., 1999, PRACTICAL NONPARAMET, V3
Cremelie N., 2001, P ISCA ITRW AD METH, P151
Daelemans W., 2005, MEMORY BASED LANGUAG, VI
Fosler-Lussier E, 2005, SPEECH COMMUN, V46, P153, DOI 10.1016/j.specom.2005.03.003
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Goronzy S., 2004, SPEECH COMM, P42
Humphries J., 1997, P EUR, P317
Jurafsky D, 2001, INT CONF ACOUST SPEE, P577, DOI 10.1109/ICASSP.2001.940897
Lawson A., 2003, P EUR GEN SWITZ, P1505
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Li X, 2007, P ASRU, P130
Loots L, 2011, SPEECH COMMUN, V53, P75, DOI 10.1016/j.specom.2010.07.006
Maison B., 2003, P ASRU VIRG ISL US, P429
Mayfield-Tomokiyo L., 2001, P WORKSH MULT SPOK L
PMLA, 2002, P ITRW PRON MOD LEX
Raux A., 2004, P ICSLP 04 INT C SPO, P613
Reveil B., 2010, P LREC, P2149
Revell B., 2009, P INT, P2995
Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0
Schaden S., 2003, P 15 INT C PHON SCI, P2545
Schaden S., 2003, P 10 EACL C, P159
Schraagen M., 2010, P LREC, P612
Stemmer G., 2001, P EUR AALB DENM, P2745
Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2
Van den Heuvel H., 2009, P INT BRIGHT UK, P2991
van den Heuvel H., 2008, P LREC, P140
Van Bael C, 2007, COMPUT SPEECH LANG, V21, P652, DOI 10.1016/j.csl.2007.03.003
Van Compernolle D, 2001, SPEECH COMMUN, V35, P71, DOI 10.1016/S0167-6393(00)00096-0
Wester M., 2000, P INT C SPOK LANG PR, P488
Yang Q., 2002, P PMLA, P123
You H., 2005, P EUR LISB PORT, P749
NR 41
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 321
EP 340
DI 10.1016/j.specom.2011.10.007
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100001
ER
PT J
AU Kulkarni, PN
Pandey, PC
Jangamashetti, DS
AF Kulkarni, Pandurangarao N.
Pandey, Prem C.
Jangamashetti, Dakshayani S.
TI Multi-band frequency compression for improving speech perception by
listeners with moderate sensorineural hearing loss
SO SPEECH COMMUNICATION
LA English
DT Article
DE Frequency compression; Sensorineural hearing loss; Spectral masking
ID SPECTRAL CONTRAST ENHANCEMENT; IMPAIRED LISTENERS; LOUDNESS RECRUITMENT;
THRESHOLD ELEVATION; RESPONSE-TIMES; INTELLIGIBILITY; TRANSPOSITION;
SIMULATION; NOISE; DISCRIMINATION
AB In multi-band frequency compression, the speech spectrum is divided into a number of analysis bands, and the spectral samples in each band are compressed towards the band center by a constant compression factor, resulting in presentation of the speech energy in relatively narrow bands, for reducing the effect of increased intraspeech spectral masking associated with sensorineural hearing loss. Earlier investigation assessing the quality of the processed speech showed best results for auditory critical bandwidth based compression using spectral segment mapping and pitch-synchronous analysis-synthesis. The objective of the present investigation is to evaluate the effectiveness of the technique in improving speech perception by listeners with moderate to severe sensorineural loss and to optimize the technique with respect to the compression factor. The listening tests showed maximum improvement in speech perception for a compression factor of 0.6, with an improvement of 9%-21% in the recognition scores for consonants and a significant reduction in response times. (V) 2011 Elsevier B.V. All rights reserved.
C1 [Kulkarni, Pandurangarao N.; Pandey, Prem C.] Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India.
[Jangamashetti, Dakshayani S.] Basaveshwar Engn Coll, Dept Elect & Elect Engn, Bagalkot 587102, Karnataka, India.
RP Pandey, PC (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India.
EM pnkulkarni@ee.iitb.ac.in; pcpandey@ee.iitb.ac.in; dsj1869@rediffmail.com
FU Department of Information Technology, MCIT, Government of India
FX The authors are grateful to Dr. Kiran Kalburgi and Dr. S. S. Doddamani
for providing support in conducting listening tests on hearing-impaired
listeners. The research is partly supported by a project grant to IIT
Bombay under the National Programme on Perception Engineering, sponsored
by the Department of Information Technology, MCIT, Government of India.
CR ANSI, 1989, S321989 ANSI AM STAN
Apoux F, 2001, HEARING RES, V153, P123, DOI 10.1016/S0378-5955(00)00265-3
Arai T., 2004, P 18 INT C AC ICA, P1389
BAER T, 1993, J REHABIL RES DEV, V30, P49
Baskent D, 2006, J ACOUST SOC AM, V119, P1156, DOI 10.1121/1.2151825
BUNNELL HT, 1990, J ACOUST SOC AM, V88, P2546, DOI 10.1121/1.399976
CARNEY AE, 1983, J ACOUST SOC AM, V73, P268, DOI 10.1121/1.388860
Chaudhari D.S., 1998, ACOUST SPEECH SIG PR, P3601
Cheeran A. N., 2004, P ICASSP MONTR QUEB, VIV, P17
CHILDERS DG, 1994, J ACOUST SOC AM, V96, P2026, DOI 10.1121/1.411319
Cohen I, 2006, SIGNAL PROCESS, V86, P698, DOI 10.1016/j.sigpro.2005.06.005
Delogu C., 1991, P EUROSPEECH 91 GENO, P353
DUBNO JR, 1989, J ACOUST SOC AM, V85, P1666, DOI 10.1121/1.397955
Fraga F. J., 2008, P INT BRISB AUSTR, P2238
GATEHOUSE S J G, 1990, British Journal of Audiology, V24, P63, DOI 10.3109/03005369009077843
GLASBERG BR, 1986, J ACOUST SOC AM, V79, P1020, DOI 10.1121/1.393374
HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295
Jangamashetti D. S., 2010, P 20 INT C AC ICA 20
Kreul K.J., 1968, J SPEECH HEAR RES, V11, P536
Kulkarni P. N., 2009, INT J SPEECH TECH, V10, P219
Kulkarni P.N., 2009, P 16 INT C DIG SIGN
Kulkarni P.N, 2010, THESIS INDIAN I TECH
Kulkarni P.N., 2006, J ACOUST SOC AM, V120, P3253
Lunner T, 1993, Scand Audiol Suppl, V38, P75
Lyregaard P E, 1982, Scand Audiol Suppl, V15, P113
Mackersie C, 1999, EAR HEARING, V20, P140, DOI 10.1097/00003446-199904000-00005
McDermott HJ, 2000, BRIT J AUDIOL, V34, P353
Meftah M., 1996, P ICSLP, V1, P74, DOI 10.1109/ICSLP.1996.607033
Miller RL, 1999, J ACOUST SOC AM, V106, P2693, DOI 10.1121/1.428135
Mitra S.K., 1998, COMPUTER BASED APPRO
MOORE BCJ, 1993, J ACOUST SOC AM, V94, P2050, DOI 10.1121/1.407478
Moore BCJ, 1997, INTRO PSYCHOL HEARIN
Murase A., 2004, P 18 INT C AC KYOT J, VII, P1519
Nejime Y, 1997, J ACOUST SOC AM, V102, P603, DOI 10.1121/1.419733
Proakis J. G., 1992, DIGITAL SIGNAL PROCE
Rabiner L.R., 1978, DIGITAL PROCESSING S
REED CM, 1983, J ACOUST SOC AM, V74, P409, DOI 10.1121/1.389834
Robinson JD, 2007, INT J AUDIOL, V46, P293, DOI 10.1080/14992020601188591
Sakamoto S, 2000, Auris Nasus Larynx, V27, P327, DOI 10.1016/S0385-8146(00)00066-3
Simpson A, 2006, INT J AUDIOL, V45, P619, DOI 10.1080/14992020600825508
Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636
STONE MA, 1992, J REHABIL RES DEV, V29, P39, DOI 10.1682/JRRD.1992.04.0039
TERKEURS M, 1992, J ACOUST SOC AM, V91, P2872, DOI 10.1121/1.402950
Turner Christopher W., 1999, Journal of the Acoustical Society of America, V106, P877, DOI 10.1121/1.427103
VILLCHUR E, 1974, J ACOUST SOC AM, V56, P1601, DOI 10.1121/1.1903484
VILLCHUR E, 1977, J ACOUST SOC AM, V62, P665, DOI 10.1121/1.381579
Yang J, 2003, SPEECH COMMUN, V39, P33, DOI 10.1016/S0167-6393(02)00057-2
Yang WY, 2006, J ACOUST SOC AM, V120, P801, DOI 10.1121/1.2216768
Yasu K., 2002, P CHIN JAP JOINT C A, P159
Yoo SD, 2007, J ACOUST SOC AM, V122, P1138, DOI 10.1121/1.2751257
ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630
NR 51
TC 4
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 341
EP 350
DI 10.1016/j.specom.2011.09.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100002
ER
PT J
AU Moreno-Daniel, A
Wilpon, J
Juang, BH
AF Moreno-Daniel, Antonio
Wilpon, Jay
Juang, B. H.
TI Index-based incremental language model for scalable directory assistance
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice search; Directory assistance; Listings search; Spoken query
processing; Incremental language model
ID CONTINUOUS SPEECH RECOGNITION; FINITE-STATE TRANSDUCERS
AB As the ubiquitous access to vast and remote information sources from portable devices becomes commonplace, the need from users to perform searches in keyboard-unfriendly situations grows substantially, thus triggering the increased demand of voice search sessions. This paper proposes a methodology that addresses different dimensions of scalability of mixed-initiative voice search in automatic spoken dialog systems.
The strategy is based on splitting the complexity of the fully-constrained grammar (one that tightly covers the entire hypothesis space) into a fixed/low complexity phonotactic grammar followed by an index mechanism that dynamically assembles a second-pass grammar that consists of only a handful of hypotheses. The experimental analysis demonstrates different dimensions of scalability achieved by the proposed method using actual WHITEPAGEs-residential data. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Moreno-Daniel, Antonio; Juang, B. H.] Georgia Inst Technol, Atlanta, GA 30332 USA.
[Wilpon, Jay] AT&T Labs Res, Florham Pk, NJ USA.
RP Moreno-Daniel, A (reprint author), Georgia Inst Technol, Atlanta, GA 30332 USA.
EM amoreno@gmail.com
CR Allauzen C., 2005, LECT NOTES COMPUT SC, P23
Allauzen C., 2004, P ICASSP2003, V1, P352
Allen J. E., 1999, IEEE Intelligent Systems, V14, DOI 10.1109/5254.796083
Aubert XL, 2002, COMPUT SPEECH LANG, V16, P89, DOI 10.1006/csla.2001.0185
Bangalore S., 2006, P C EUR CHAPT ASS CO, P361
Bangalore S., 2003, P IEEE WORKSH AUT SP, P221
Bayer R., 1972, Acta Informatica, V1, DOI 10.1007/BF00289509
Brin S, 1998, COMPUT NETWORKS ISDN, V30, P107, DOI 10.1016/S0169-7552(98)00110-X
Chang W., 2002, IEEE T SPEECH AUDIO, V10, P531
Crestani F, 2002, DATA KNOWL ENG, V41, P105, DOI 10.1016/S0169-023X(02)00024-1
David R., 1995, STL TUTORIAL REFEREN
Dolfing H.J.G.A., 2001, P IEEE AUT SPEECH RE, P194
FREDKIN E, 1960, COMMUN ACM, V3, P490, DOI 10.1145/367390.367400
Goffin V., 2005, P IEEE INT C AC SPEE, VI, P1033, DOI 10.1109/ICASSP.2005.1415293
Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X
HARRIS R. A., 2004, VOICE INTERACTION DE
Hori T., 2004, P ICSLP, V1, P289
Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790
Johnston M., 2006, P IEEE INT C AC SPEE, VI, P617
Kanthak S., 2002, P INT C SPOK LANG PR, P1309
KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125
Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184
Mohri M, 1997, COMPUT LINGUIST, V23, P269
Mohri M., 2008, SPEECH RECOGNITION W, P559, DOI 10.1007/978-3-540-49127-9_28
Mohri M, 1998, LECT NOTES COMPUT SC, V1436, P144
Moreno-Daniel A., 2009, P IEEE INT C AC SPEE, VI, P3945
Moreno-Daniel A., 2007, P IEEE INT C AC SPEE, VIV, P121
Natarajan P., 2002, P ICASSP 2002 MAY, VI, P21
Pack T., 2008, P ISCA INT SEPT, P53
Pack T., 2008, P 21 ANN S US INT SO, P141, DOI 10.1145/1449715.1449738
Parthasarathy S., 2005, P INTERSPEECH, P2493
Parthasarathy S., 2007, P IEEE INT C AC SPEE, VIV, P161
Pereira F.C., 1996, SPEECH RECOGNITION C, P431
Rabiner L. R., 1986, IEEE ASSP Magazine, V3, DOI 10.1109/MASSP.1986.1165342
Rose R.C., 2001, P ICASSP, VI, P17
Salomaa A., 1978, AUTOMATA THEORETIC A
Sproat R., 1999, TEXT TO SPEECH SYNTH
Wang Y.-Y., 2008, SIGNAL PROCESSING MA, V25, P28
Willet D., 2002, P ICASSP, VI, P713
Wilpon J.G., 1994, P ISCA INT C SPOK LA, P667
Yu D., 2007, P INT, P2709
NR 41
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 351
EP 367
DI 10.1016/j.specom.2011.09.006
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100003
ER
PT J
AU Recasens, D
AF Recasens, Daniel
TI A cross-language acoustic study of initial and final allophones of /1/
SO SPEECH COMMUNICATION
LA English
DT Article
DE Clear and dark /1/; Intrinsic and extrinsic allophones; Darkness degree;
Vowel coarticulation; Spectral analysis
AB Formant frequency data for /1/ in 23 languages/dialects where the consonant may be typically clear or dark show that the two varieties of /1/ are set in contrast mostly in the context of /i/ but also next to /a/, and that a few languages/dialects may exhibit intermediate degrees of darkness in the consonant. F2 for /1/ is higher utterance initially than utterance finally, more so if the lateral is clear than if it is dark; moreover, the initial and final allophones may be characterized as intrinsic (in most languages/dialects) or extrinsic (in several English dialects, Czech and Dutch) depending on whether the position-dependent frequency difference in question is below or above 200/300 Hz. The paper also reports a larger degree of vowel coarticulation for clear /1/ than for dark /1/ and in initial than in final position. These results are interpreted in terms of the production mechanisms involved in the realization of the two /1/ varieties in the different positional and vowel context conditions subjected to investigation. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Recasens, Daniel] Univ Autonoma Barcelona, Dept Filologia Catalana, E-08193 Barcelona, Spain.
[Recasens, Daniel] Inst Estudis Catalans, Lab Fonet, Barcelona 08001, Spain.
RP Recasens, D (reprint author), Univ Autonoma Barcelona, Dept Filologia Catalana, E-08193 Barcelona, Spain.
EM daniel.recasens@uab.es
FU Ministry of Innovation and Science of Spain [FFI2009-09339]; Catalan
Government [2009SGR003]
FX This research was funded by the Project FFI2009-09339 of the Ministry of
Innovation and Science of Spain and by the research group 2009SGR003 of
the Catalan Government. We would like to acknowledge three reviewers for
comments on a previous version of the manuscript, and several scholars
for providing acoustic recordings or formant frequency data: (Alguercse)
Francesco Ballone; (Czech) Jan Volin; (Danish) John Tondering; (Dutch)
Louis Pols; (Finnish) olli Aaltonen; (German) Marzena Zygis and Micaela
Mertins; (Hungarian) Maria Gosy; (Italian) Silvia Calamai; (Norwegian)
Hanne Gram Simonsen and Inger Moen; (Occitan) Daniela Muller;
(Portuguese) Antonio Texeira; (Romanian) Ioana Chitoran; (Russian)
Alexei Kochetov; (Swedish) Francisco Lacerda. Most of these scholars
read a preliminary paper version and made remarks for improvement. We
are also grateful to the PETRA lab (Plateau d'Etudes Techniques et de
Recherche en Audition; http://petra.univ-tlse2fr/) where the Occitan
recordings were carried out.
CR Bladon R. A. W., 1979, CURRENT TRENDS PHONE, P501
Bladon R. A. W., 1976, J PHONETICS, V3, P137
Bladon R.A.W., 1978, J ITALIAN LINGUISTIC, V3, P43
Browman C.P., 1995, FESTSCHRIFT KS HARRI, P19
Charcouloff M., 1985, TRAVAUX I PHONETIQUE, V10, P63
Cruz-Ferreira M., 1995, J INT PHON ASSOC, V25, P90
Dankovicova J., 1999, HDB INT PHONETIC ASS, P70
Delattre P, 1965, COMP PHONETIC FEATUR
Espinosa A., 2005, J INT PHON ASSOC, V35, P1, DOI DOI 10.1017/S0025100305001878
Fant G., 1960, ACOUSTIC THEORY SPEE
Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332
Gosy M., 2004, FONETIKA BESZED TUDO
Gronnum Nina, 2005, FONETIK FONOLOGI ALM
Iivonen A., 2000, SUOMEN FONETIIKKAA P
Jones Daniel, 1969, PHONETICS RUSSIAN
Kohler KJ, 1977, EINFUHRUNG PHONETIK
Kristoffersen G., 2007, PHONOLOGY NORWEGIAN
Lacerda F., 2006, ACOUSTIC ANAL CENTRA
Ladefoged P., 1968, NATURE GEN PHONETIC, P283
LADEFOGED P, 1965, LANGUAGE, V41, P332, DOI 10.2307/411884
Lavoie L, 2001, CONSONANT STRENGTH P
Lehiste I., 1964, ACOUSTICAL CHARACTER
Lindblad P., 2003, P 15 ICPHS, P1899
Lindblom B., 2004, SOUND SENSE, P86
Local J. K., 2002, STRUCTURAL VARIATION
Marques I., 2010, THESIS U AVEIRO PORT
Martinez Celdran E., 2007, MANUAL FONETICA ESPA
Narayanan SS, 1997, J ACOUST SOC AM, V101, P1064, DOI 10.1121/1.418030
Newton D.E., 1996, YORK PAPERS LINGUIST, V17, P167
Quilis A., 1979, LINGUISTICA ESPANOLA, V1, P233
Recasens D, 2004, CLIN LINGUIST PHONET, V18, P593, DOI 10.1080/02699200410001703556
Recasens D., 1996, FONETICA DESCRIPTIVA
Recasens D., 1986, ESTUDIS FONETICA EXP
Recasens D., 1994, PHONOLOGICA 1992, P195
Recasens D, 2004, J PHONETICS, V32, P435, DOI 10.1016/j.wocn.2004.02.001
SPROAT R, 1993, J PHONETICS, V21, P291
Stevens K.N., 1998, ACOUSTIC PHONETICS
Tomas Tomas Navarro, 1972, MANUAL PRONUNCIACION
Tranel B., 1987, SOUND FRENCH INTRO
Warner Natasha, 2001, PHONOLOGY, V18, P387
Wells John, 1982, ACCENTS ENGLISH
Wheeler M., 1988, ROMANCE LANGUAGES
Wiik K., 1966, PUBLICATIONS PHONETI
Zhou X., 2009, THESIS U MARYLAND
NR 44
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 368
EP 383
DI 10.1016/j.specom.2011.10.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100004
ER
PT J
AU Nose, T
Kobayashi, T
AF Nose, Takashi
Kobayashi, Takao
TI Very low bit-rate F0 coding for phonetic vocoders using MSD-HMM with
quantized F0 symbols
SO SPEECH COMMUNICATION
LA English
DT Article
DE Phonetic vocoder; HMM-based speech synthesis; Very low bit-rate speech
coding; MSD-HMM; MSD-VQ
ID SPEECH PARAMETER GENERATION
AB This paper presents a technique of very low bit-rate F0 coding for phonetic vocoders based on a hidden Markov model (HMM) using phone-level quantized F0 symbols. In the proposed technique, an input F0 sequence is converted into an F0 symbol sequence at the phone level using scalar quantization. The quantized F0 symbols represent the rough shape of the original F0 contour and are used as the prosodic context for the HMM in the decoding process. To model the F0 that has voiced and unvoiced regions, we use multi-space probability distribution HMM (MSD-HMM). Synthetic speech is generated from the context-dependent labels and pre-trained MSD-HMMs by using the HMM-based parameter generation algorithm. By taking into account the preceding and succeeding contexts as well as the current one in the modeling and synthesis, we can generate a smooth F0 trajectory similar to that of the original with only a small number of quantization bits. The experimental results reveal that the proposed F0 coding outperforms the conventional segment-based F0 coding technique using MSD-VQ. We also demonstrate that the decoded speech of the proposed vocoder has acceptable quality even when the F0 bit-rate is less than 50 bps. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
EM takashi.nose@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp
FU JSPS [21300063, 21800020]
FX Part of this work was supported by JSPS Grant-in-Aid for Scientific
Research 21300063 and 21800020.
CR DUDLEY H, 1958, J ACOUST SOC AM, V30, P733, DOI 10.1121/1.1909744
Hoshiya T., 2003, P ICASSP 2003, P800
Katsaggelos A., 2002, P IEEE 2, V86, P1126
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Lee KS, 2001, IEEE T SPEECH AUDI P, V9, P482
Nose T., 2010, IEICE T INF SYSTEMS, P2483
Nose T, 2010, INT CONF ACOUST SPEE, P4622, DOI 10.1109/ICASSP.2010.5495548
Picone J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
Picone J., 1989, P ICASSP 89, P580
Scheffers M., 1988, P 7 FASE S, P981
Schwartz R., 1980, P ICASSP 80 IEEE, P32
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Soong F., 1989, P ICASSP, P584
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K., 1995, P EUROSPEECH, P757
TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684
TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229
TOKUDA K, 1998, ACOUST SPEECH SIG PR, P609
NR 18
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 384
EP 392
DI 10.1016/j.specom.2011.10.002
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100005
ER
PT J
AU de Lima, AA
Prego, TD
Netto, SL
Lee, B
Said, A
Schafer, RW
Kalker, T
Fozunbal, M
AF de Lima, Amaro A.
Prego, Thiago de M.
Netto, Sergio L.
Lee, Bowon
Said, Amir
Schafer, Ronald W.
Kalker, Ton
Fozunbal, Majid
TI On the quality-assessment of reverberated speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech quality evaluation; Reverberation assessment; Intrusive approach
ID ENERGY RATIO; ACOUSTICS
AB This paper addresses the problem of quantifying the reverberation effect in speech signals. The perception of reverberation is assessed based on a new measure combining the characteristics of reverberation time, room spectral variance, and direct-to-reverberant energy ratio, which are estimated from the associated room impulse response (RIR). The practical aspects behind a robust RIR estimation are underlined, allowing an effective feature extraction for reverberation evaluation. The resulting objective metric achieves a correlation factor of about 90% with the subjective scores of two distinct speech databases, illustrating the system's ability to assess the reverberation effect in a reliable manner. (C) 2011 Elsevier B.V. All rights reserved.
C1 [de Lima, Amaro A.] Fed Ctr Technol Educ Celso Suckow da Fonseca CEFE, Nova Iguacu, RJ, Brazil.
[de Lima, Amaro A.; Prego, Thiago de M.; Netto, Sergio L.] Univ Fed Rio de Janeiro, COPPE, Program Elect Engn, BR-21945 Rio De Janeiro, RJ, Brazil.
[Lee, Bowon; Said, Amir; Schafer, Ronald W.; Kalker, Ton; Fozunbal, Majid] Hewlett Packard Labs, Palo Alto, CA 94304 USA.
RP de Lima, AA (reprint author), Fed Ctr Technol Educ Celso Suckow da Fonseca CEFE, Nova Iguacu, RJ, Brazil.
EM amaro@lps.ufrj.br; thprego@lps.ufrj.br; sergioln@lps.ufrj.br;
bowon.lee@hp.com; amir_said@hp.com; ron.schafer@hp.com
CR Allen J. B., 1982, J ACOUSTIC SOC AM, V71
ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599
[Anonymous], 2004, P563 ITUT, P563
[Anonymous], 2001, P862 ITUT
Berkley D. A., 1993, ACOUSTICAL FACTORS A
Cole D., 1994, P INT S SPEECH IM PR, P241, DOI 10.1109/SIPNN.1994.344922
de Lima A. A., 2009, P INT WORKSH MULT SI
de Lima A. A., 2008, P INT C SIGN PROC MU, P257
Falk T., 2010, IEEE T AUDIO SPEECH, V18
Figueiredo F.L., 2005, P INT C EXP NOIS CON
Gardner WG, 1998, SPRING INT SER ENG C, P85
Goetze S., 2010, P IEEE INT C AC SPEE
Griesinger D., 2009, 157 M AC SOC AM PORT
ITU- T Rec, 1996, P800 ITUT
ITU-T, 2005, P8622 ITUT
JETZT JJ, 1979, J ACOUST SOC AM, V65, P1204, DOI 10.1121/1.382786
Jeub M., 2009, P 16 INT C DIG SIGN
Jot J.-M., 1991, P 90 CONV AM ENG SOC
Karjalainen M., 2001, P CONV AUD ENG SOC A, P867
Kay S. M., 1993, FUNDAMENTALS STAT SI
Kuster M, 2008, J ACOUST SOC AM, V124, P982, DOI 10.1121/1.2940585
Kuttruff H., 2000, ROOM ACOUSTICS
Kuttruff H., 2007, ACOUSTICS INTRO
Larsen E, 2008, J ACOUST SOC AM, V124, P450, DOI 10.1121/1.2936368
SCHROEDE.MR, 1965, J ACOUST SOC AM, V37, P409, DOI 10.1121/1.1909343
Wen J. Y. C., 2006, P IEEE INT WORKSH AC
Wen J.Y.C., 2006, P EUR SIGN PROC C FL
Zahorik P, 2002, J ACOUST SOC AM, V112, P2110, DOI 10.1121/1.1506692
Zahorik P, 2002, J ACOUST SOC AM, V111, P1832, DOI 10.1121/1.1458027
Zielinski S, 2008, J AUDIO ENG SOC, V56, P427
NR 30
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 393
EP 401
DI 10.1016/j.specom.2011.10.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100006
ER
PT J
AU Dai, P
Soon, IY
AF Dai, Peng
Soon, Ing Yann
TI A temporal frequency warped (TFW) 2D psychoacoustic filter for robust
speech recognition system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; 2D mask; Simultaneous masking; Temporal
masking; Temporal frequency warping; Temporal integration
ID LATERAL INHIBITION; MASKING
AB In this paper, a novel hybrid feature extraction algorithm is proposed, which implements forward masking, lateral inhibition, and temporal integration with a simple 2D psychoacoustic filter. The proposed algorithm consists of two key parts, the 2D psychoacoustic filter and cepstral mean variance normalization (CMVN). Mathematical derivation is provided to show the correctness of the 2D psychoacoustic filter based on the characteristic functions of masking effects. The effectiveness of the proposed algorithm is tested on the AURORA2 database. Extensive comparison is made against lateral inhibition (LI), forward masking (FM), CMVN, RASTA filter, the ETSI standard advanced front-end feature extraction algorithm (AFE), and the temporal warped 2D psychoacoustic filter. Experimental results show significant improvements from the proposed algorithm, a relative improvement of nearly 46.78% over the baseline mel-frequency cepstral coefficients (MFCC) system in noisy conditions. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
EM daip0001@e.ntu.edu.sg; eiysoon@ntu.edu.sg
CR Brookes M., 1997, VOICEBOX
CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427
Dai P., 2010, SPEECH COMMUN, V53, P229
Dai P., 2009, P ICICS DEC
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
ETSI, 2007, 202050 ETSI ES
Gold B., 2000, SPEECH AUDIO SIGNAL
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
Hirsch H., 2000, P ISCA ITRW ASR, V181, P188
HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048
Ishizuka K, 2010, SPEECH COMMUN, V52, P41, DOI 10.1016/j.specom.2009.08.003
JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576
Luo X., 2008, P ICALIP, V1105, P1109
Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501
Oxenham AJ, 2000, HEARING RES, V150, P258, DOI 10.1016/S0378-5955(00)00206-9
Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9
SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800
NR 18
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 402
EP 413
DI 10.1016/j.specom.2011.10.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100007
ER
PT J
AU Grichkovtsova, I
Morel, M
Lacheret, A
AF Grichkovtsova, Ioulia
Morel, Michel
Lacheret, Anne
TI The role of voice quality and prosodic contour in affective speech
perception
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech perception; Affective prosody; Voice quality; Prosodic contour;
Speech synthesis; Prosody transplantation paradigm; Attitudes; Emotions;
French
ID RECOGNIZING EMOTIONS; FOREIGN-LANGUAGE; VOCAL EXPRESSION; SYNTHETIC
SPEECH; SIMULATION
AB We explore the usage of voice quality and prosodic contour in the identification of emotions and attitudes in French. For this purpose, we develop a corpus of affective speech based on one lexically neutral utterance and apply prosody transplantation method in our perception experiment. We apply logistic regression to analyze our categorical data and we observe differences in the identification of these two affective categories. Listeners primarily use prosodic contour in the identification of studied attitudes. Emotions are identified on the basis of voice quality and prosodic contour. However, their usage is not homogeneous within individual emotions. Depending on the stimuli, listeners may use both voice quality and prosodic contour, or privilege just one of them for the successful identification of emotions. The results of our study are discussed in view of their importance for speech synthesis. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Grichkovtsova, Ioulia; Morel, Michel] Univ Caen Basse Normandie, Lab CRISCO, EA 4255, F-14032 Caen, France.
[Lacheret, Anne] Univ Paris Ouest Nanterre Def, CNRS, Lab MODYCO, UFR LLPHI,Dept Sci Langage,UMR 7114, F-92001 Nanterre, France.
[Lacheret, Anne] Inst Univ France, F-75005 Paris, France.
RP Grichkovtsova, I (reprint author), Univ Caen Basse Normandie, Lab CRISCO, EA 4255, F-14032 Caen, France.
EM grichkovtsova@hotmail.com
FU Conseil Regional de Basse-Normandie
FX We acknowledge the financial support from the Conseil Regional de
Basse-Normandie, under the research funding program "Transfer de
Technologic". I.G. is pleased to thank Dr. Henni Ouerdane for careful
reading of the manuscript, useful discussions and assistance with
statistics.
CR Alessandro C., 2006, METHOD EMPIRICAL PRO, P63
Auberge V, 2003, SPEECH COMMUN, V40, P87, DOI 10.1016/S0167-6393(02)00077-8
Baayen R. H., 2004, MENTAL LEXICON WORKI, V1, P1
Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016
Barkhuysen P, 2010, LANG SPEECH, V53, P3, DOI 10.1177/0023830909348993
Beck S, 2009, J SEMANT, V26, P159, DOI 10.1093/jos/ffp001
Campbell N., 2006, P 15 INT C PHON SCI, P2417
Chen A.J., 2005, THESIS
Dromey C, 2005, SPEECH COMMUN, V47, P351, DOI 10.1016/j.specom.2004.09.010
Dutoit T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1393
Ekman P., 1999, HDB COGNITION EMOTIO
Elfenbein HA, 2003, CURR DIR PSYCHOL SCI, V12, P159, DOI 10.1111/1467-8721.01252
Erickson D., 2010, SPEECH PROSODY 2010
Garcia M.N., 2006, P INT C LANG RES EV, P307
Ghio A., 2007, TIPA, V22, P115
Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1
Grichkovtsova I., 2007, P INT
Grichkovtsova I., 2009, ROLE PROSODY AFFECTI, P371
Hancil Sylvie, 2009, ROLE PROSODY AFFECTI
Hox J., 2002, MULTILEVEL ANAL TECH
Izard C.E., 1971, FACE EMOTION
Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007
Johnson-Laird P. N., 1987, COGNITION EMOTION, V1, P29, DOI [10.1080/02699938708408362, DOI 10.1080/02699938708408362]
Johnstone T., 2000, HDB EMOTIONS, V2nd, P220
Hammerschmidt Kurt, 2007, J Voice, V21, P531, DOI 10.1016/j.jvoice.2006.03.002
Lakshminarayanan K, 2003, BRAIN LANG, V84, P250, DOI 10.1016/S0093-934X(02)00516-3
Laver J., 1994, P INT C SPEECH PROS
Laver J, 1980, PHONETIC DESCRIPTION
Laver John, 1994, PRINCIPLES PHONETICS
Moineddin R., 2007, CAHIERS I LINGUISTIQ, V7, P1, DOI 10.2143/CILL.30.1.519219
Moineddin R., 2007, BMC MED RES METHODOL, V7, P1, DOI DOI 10.1186/1471-2288-7-34
Morel M., 2004, TRAITEMENT AUTOMATIQ, V30, P207
Murray IR, 2008, COMPUT SPEECH LANG, V22, P107, DOI 10.1016/j.csl.2007.06.001
MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558
ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7
PLUTCHIK R, 1993, HDB EMOTIONS, DOI DOI 10.1016/J.WOCN.2009.07.005
Power M., 2008, COGNITION EMOTION OR
Prudon R., 2004, TEXT TO SPEECH SYNTH, P203
R Development Core Team, 2008, R LANG ENV STAT COMP
Rilliard A, 2009, LANG SPEECH, V52, P223, DOI 10.1177/0023830909103171
RODERO E, 2011, J VOICE, V25, P25, DOI DOI 10.1177/0023830909103171
Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
SCHRODER M, 2008, EMOTIONS HUMAN VOICE, P307
Schroder M., 2008, P INT C SPEECH PROS, P307
Shochi T., 2009, ROLE PROSODY AFFECTI, P31
Thompson WF, 2004, EMOTION, V4, P46, DOI 10.1037/1528-3542.4.1.46
WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238
YANUSHEVSKAYA I, 2006, P 3 INT C SPEECH PRO
NR 53
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 414
EP 429
DI 10.1016/j.specom.2011.10.005
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100008
ER
PT J
AU Rudzicz, F
AF Rudzicz, Frank
TI Using articulatory likelihoods in the recognition of dysarthric speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dysarthria; Speech recognition; Acoustic-articulatory inversion;
Task-dynamics
ID GAUSSIAN MIXTURE MODEL; RECOVERING ARTICULATION; ACOUSTICS
AB Millions of individuals have congenital or acquired neuro-motor conditions that limit control of their muscles, including those that manipulate the vocal tract. These conditions, collectively called dysarthria, result in speech that is very difficult to understand both by human listeners and by traditional automatic speech recognition (ASR), which in some cases can be rendered completely unusable.
In this work we first introduce a new method for acoustic-to-articulatory inversion which estimates positions of the vocal tract given acoustics using a nonlinear Hammerstein system. This is accomplished based on the theory of task-dynamics using the TORGO database of dysarthric articulation. Our approach uses adaptive kernel canonical correlation analysis and is found to be significantly more accurate than mixture density networks, at or above the 95% level of confidence for most vocal tract variables.
Next, we introduce a new method for ASR in which acoustic-based hypotheses are re-evaluated according to the likelihoods of their articulatory realizations in task-dynamics. This approach incorporates high-level, long-term aspects of speech production and is found to be significantly more accurate than hidden Markov models, dynamic Bayesian networks, and switching Kalman filters. (C) 2011 Elsevier B.V. All rights reserved.
C1 Univ Toronto, Dept Comp Sci, Toronto, ON, Canada.
RP Rudzicz, F (reprint author), Univ Toronto, Dept Comp Sci, Toronto, ON, Canada.
EM frank@cs.toronto.edu
CR Ananthakrishnan G., 2009, P INT 2009 BRIGHT UK
Aschbacher E., 2005, P 13 STAT SIGN PROC
Bahr RH, 2005, TOP LANG DISORD, V25, P254
Browman Catherine, 1986, PHONOLOGY YB, V3, P219
Deng J., 2005, AC SPEECH SIGN PROC, V1, P1121
Dogil G., 1998, PHONOLOGY, V15
Enderby P. M., 1983, FRENCHAY DYSARTHRIA
Friedland B., 2005, CONTROL SYSTEM DESIG
Fukuda T., 2003, P ICASSP 03, V2, P25
Goldstein L, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P215, DOI 10.1017/CBO9780511541599.008
Goldstein L.M., 2003, ARTICULATORY PHONOLO
Hasegawa-Johnson M., 2007, INT SPEECH LEXICON P
Hawley MS, 2007, MED ENG PHYS, V29, P586, DOI 10.1016/j.medengphy.2006.06.009
Hogden J, 2007, SPEECH COMMUN, V49, P361, DOI 10.1016/j.specom.2007.02.008
Havstam Christina, 2003, Logoped Phoniatr Vocol, V28, P81, DOI 10.1080/14015430310015372
King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622
Kirchhoff K., 1999, THESIS U BIELEFELD G
Lai P L, 2000, Int J Neural Syst, V10, P365, DOI 10.1016/S0129-0657(00)00034-X
LEE LJ, 2001, ACOUST SPEECH SIG PR, P797
Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1
Livescu Karen, 2007, P INT C AC SPEECH SI
Matsumasa Hironori, 2009, Journal of Multimedia, V4, DOI 10.4304/jmm.4.4.254-261
Menendez-Pidal X., 1996, P 4 INT C SPOK LANG
Metze F, 2007, SPEECH COMMUN, V49, P348, DOI 10.1016/j.specom.2007.02.009
Morales S. O. C., 2009, EURASIP J ADV SIGNAL
Murphy Kevin P., 1998, SWITCHING KALMAN FIL
Murphy K.P., 2002, THESIS U CALIFORNIA
Nam H., 2006, TADA TASK DYNAMICS A
Nam H, 2003, P 15 INT C PHON SCI, P2253
Ozbek IY, 2011, IEEE T AUDIO SPEECH, V19, P1180, DOI 10.1109/TASL.2010.2087751
Polur PD, 2006, MED ENG PHYS, V28, P741, DOI 10.1016/j.medengphy.2005.11.002
Raghavendra P., 2001, AUGMENTATIVE ALTERNA, V17, P265, DOI 10.1080/714043390
Ramsay J. O., 2005, FITTING DIFFERENTIAL, P327
Reimer M., 2010, P INT 2010 MAK JAP, P1608
Richardson M., 2000, P ICSLP BEIJ CHIN, P131
Richmond K, 2003, COMPUT SPEECH LANG, V17, P153, DOI 10.1016/S0885-2308(03)00005-6
Rosen K., 2000, AUGMENTATIVE ALTERNA, V16, P48, DOI DOI 10.1080/07434610012331278904
Rudzicz F., 2010, P 48 ANN M ASS COMP
Rudzicz F., 2008, P 8 INT SEM SPEECH P
RUDZICZ F, 2007, P 9 INT ACM SIGACCES, P255, DOI DOI 10.1145/1296843.1296899
Rudzicz F., 2011, THESIS U TORONTO TOR
Rudzicz Frank, 2010, P 2010 IEEE INT C AC
SAKOE H, 1978, IEEE T ACOUST SPEECH, V26, P43, DOI 10.1109/TASSP.1978.1163055
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
Saltzman E.M., 1986, TASK DYNAMIC COORDIN, P129
Smith A., 2004, SPEECH MOTOR CONTROL, P227
Stevens KN, 2010, J PHONETICS, V38, P10, DOI 10.1016/j.wocn.2008.10.004
Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380
Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001
Vaerenbergh S.V., 2008, EURASIP J ADV SIG PR, V8, P1
Vaerenbergh S.V., 2006, P 2006 IEEE INT C AC
Vaerenbergh S.V., 2006, P 2006 INT JOINT C N, P1198
Van Lieshout P., 2008, ASIA PACIFIC J SPEEC, V11, P283
Wrench A., 1999, MOCHA TIMIT ARTICULA
Wrench Alan, 2000, P INT C SPOK LANG PR
Yorkston K. M., 1981, ASSESSMENT INTELLIGI
Yunusova Y, 2009, J SPEECH LANG HEAR R, V52, P547, DOI 10.1044/1092-4388(2008/07-0218)
Zheng WM, 2006, IEEE T NEURAL NETWOR, V17, P233, DOI 10.1109/TNN.2005.860849
NR 58
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 430
EP 444
DI 10.1016/j.specom.2011.10.006
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100009
ER
PT J
AU Jeon, JH
Liu, Y
AF Jeon, Je Hun
Liu, Yang
TI Automatic prosodic event detection using a novel labeling and selection
method in co-training
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosodic event; ToBI; Pitch accent; Break index; Intonational phrase
boundary; Co-training
ID FEATURES
AB Most previous approaches to automatic prosodic event detection are based on supervised learning, relying on the availability of a corpus that is annotated with the prosodic labels of interest in order to train the classification models. However, creating such resources is an expensive and time-consuming task. In this paper, we exploit semi-supervised learning with the co-training algorithm for automatic detection of coarse-level representation of prosodic events such as pitch accent, intonational phrase boundaries, and break indices. Since co-training works on the condition that the views are compatible and uncorrelated, and real data often do not satisfy these conditions, we propose a method to label and select examples in co-training. In our experiments on the Boston University radio news corpus, when using only a small amount of the labeled data as the initial training set, our proposed labeling method can effectively use unlabeled data to improve performance and finally reach performance close to the results of the supervised method using more labeled data. We perform a thorough analysis of various factors impacting the learning curves, including labeling error rate and informativeness of added examples, performance of the individual classifiers and their difference, and the initial and added data size. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Jeon, Je Hun; Liu, Yang] Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA.
RP Jeon, JH (reprint author), Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA.
EM jhjeon@hlt.utdallas.edu; yangl@hlt.utdallas.edu
FU US Air Force Office of Scientific Research [FA9550-10-1-0388]
FX This work is partly supported by an award from the US Air Force Office
of Scientific Research, FA9550-10-1-0388.
CR Ananthakrishnan S., 2007, P ICASSP, P65
Ananthakrishnan S, 2008, IEEE T AUDIO SPEECH, V16, P216, DOI 10.1109/TASL.2007.907570
Ananthakrishnan S., 2006, P INT C SPOK LANG PR, P297
Balcan MF, 2005, ADV NEURAL INFORM PR, V17, P89
Bartlett S., 2009, P NAACL HLT, P308, DOI 10.3115/1620754.1620799
Blum A., 1998, Proceedings of the Eleventh Annual Conference on Computational Learning Theory, DOI 10.1145/279943.279962
Boersma P., 2001, GLOT INT, V5, P341
Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178
Chen K., 2004, P ICASSP, P1509
Chen K, 2006, IEEE T AUDIO SPEECH, V14, P232, DOI 10.1109/TSA.2005.853208
Clark S., 2003, P CONLL EDM CAN, P49
Dasgupta S, 2002, ADV NEUR IN, V14, P375
Dehak N, 2007, IEEE T AUDIO SPEECH, V15, P2095, DOI 10.1109/TASL.2007.902758
Goldman S.A., 2000, P 17 INT C MACH LEAR, P327
Grabe E., 2003, P SASRTLM, P45
Gregory Michelle L., 2004, P 42 M ASS COMP LING, P677, DOI 10.3115/1218955.1219041
Gus U., 2007, P INT, P2597
HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C
Jeon J. H., 2009, P ACL IJCNLP, P540, DOI DOI 10.3115/1690219.1690222]
Jeon J. H., 2010, P INT, P1772
Jeon J.H., 2011, ACL HLT, P732
Jeon J.-H., 2009, P ICASSP, P4565
Kudoh T., 2000, P CONLL 2000 LLL 200, P142
Levow G. A., 2008, P IJCNLP, P217
Levow G.-A., 2006, P HLT NAACL, P224, DOI 10.3115/1220835.1220864
Lin HT, 2007, MACH LEARN, V68, P267, DOI 10.1007/s10994-007-5018-6
Muslea I, 2002, P 19 INT C MACH LEAR, P435
Muslea I., 2000, Proceedings Seventeenth National Conference on Artificial Intelligence (AAAI-2000). Twelfth Innovative Applications of Artificial Intelligence Conference (IAAI-2000)
Nakatani C.H., 1995, WORK NOT AAAI 95 SPR, P106
Nenkova A, 2007, P HLT ACL ROCH NY AP, P9
Nigam K., 2000, Proceedings of the Ninth International Conference on Information and Knowledge Management. CIKM 2000, DOI 10.1145/354756.354805
Price P., 1991, P HLT, P372, DOI 10.3115/112405.112738
Rosenberg A, 2007, P INT, P2777
Shriberg E., 1998, LANG SPEECH, V41, P439
Silverman K., 1992, P INT C SPOK LANG PR, P867
Sridhar VKR, 2008, IEEE T AUDIO SPEECH, V16, P797, DOI 10.1109/TASL.2008.917071
Steedman M., 2003, CLSP WS 02 FINAL REP
Tur G., 2003, P EUR, P2793
Wang W., 2007, P ICASSP, V4, pIV137
Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607
Xie S., 2011, P INT, P2522
NR 41
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 445
EP 458
DI 10.1016/j.specom.2011.10.008
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100010
ER
PT J
AU Adell, J
Escudero, D
Bonafonte, A
AF Adell, Jordi
Escudero, David
Bonafonte, Antonio
TI Production of filled pauses in concatenative speech synthesis based on
the underlying fluent sentence
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Conversational speech; Talking speech synthesiser;
Filled pause; Disfluency; Underlying fluent sentence; Prosody; Ogmios;
Perceptual evaluation
ID UM; UH; DISFLUENCIES; COMPUTERS; SPEAKING; REPAIR; CORPUS
AB Until now, speech synthesis has mainly involved reading-style speech. Today, however, text-to-speech systems must provide a variety of styles because users expect these interfaces to do more than just read information. If synthetic voices must be integrated into future technology, they must simulate the way people talk instead of the way people read. Existing knowledge about how disfluencies occur has made it possible to propose a general framework for synthesising disfluencies. We propose a model based on the definition of disfluency and the concept of underlying fluent sentences. The model incorporates the parameters of standard prosodic models for fluent speech with local modifications of prosodic parameters near the interruption point. The constituents of the local models for filled pauses are derived from the analysis corpus, and constituent's prosodic parameters are predicted via linear regression analysis. We also discuss the implementation details of the model when used in a real speech synthesis system. Objective and perceptual evaluations showed that the proposed models outperformed the baseline model. Perceptual evaluations of the system showed that it is possible to synthesise filled pauses without decreasing the overall naturalness of the system, and users stated that the speech produced is even more natural than the one produced without filled pauses. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Adell, Jordi; Bonafonte, Antonio] Univ Politecn Cataluna, Barcelona, Spain.
[Escudero, David] Univ Valladolid, Valladolid, Spain.
RP Adell, J (reprint author), Univ Politecn Cataluna, Barcelona, Spain.
EM jordi.adell@upc.edu
RI Escudero, David/K-7905-2014
OI Escudero, David/0000-0003-0849-8803
CR Aaron A, 2005, SCI AM, V292, P64
Adell J., 2006, P 3 INT C SPEECH PRO
Adell J., 2010, P ISCA SPEECH PROS C
Adell J., 2007, LECT NOTES ARTIF INT, V1, P358
Adell J., 2010, P ICASSP DALL US
Adell J., 2008, P INT BRISB AUSTR, P2278
Adell J., 2005, P INT C AC SPEECH SI
Agiiero P.D., 2004, P 5 ISCA SPEECH SYNT
Andersson S., 2010, P INT C SPEECH PROS
Campbell N, 2007, TEXT SPEECH LANG TEC, V37, P29
Arslan L., 1998, P ICASSP SEATL US
Batliner A., 1995, P 13 INT C PHON SCI, V3, P472
Bennett C. L., 2005, P INT EUR, P105
Bernstein J., 1991, HLT 91, P423
Boersma P., 2005, PRAAT DOING PHONETIC
Bonafonte A., 2006, P TC STAR WORKSH SPE
Bonafonte A., 2005, TTS PROGR REPORT DEL
Bonafonte A., 2008, P 6 LANG RES EV C LR
Bonafonte A., 1998, P 8 JORN TEL I D TEL
Campbell N., 1998, P 3 ESCA COCOSDA WOR, P17
Carlson R., 2006, P INT, P1300
Chung H., 2002, P INT C SPEECH PROS
Clark H. H., 1996, USING LANGUAGE
Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3
Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X
Clark R.A.J., 2007, P 3 BLIZZ CHALL BONN
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
Clinton Hillary Rodham, 2003, LIVING HIST
Conejero D., 2003, P EUR GEN SWITZ
Corder G. W., 2009, NONPARAMETRIC STAT N
Dahl D.A., 1994, P ARPA WORKSH HUM LA, P43, DOI 10.3115/1075812.1075823
Dusterhoff K.E., 1999, P EUR
Eide E, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P127, DOI 10.1109/WSS.2002.1224388
Erro D, 2010, IEEE T AUDIO SPEECH, V18, P974, DOI 10.1109/TASL.2009.2038658
Escudero Mancebo D., 2007, SPEECH COMMUN, V49, P213
Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032
Fraser M., 2007, P 3 BLIZZ CHALL BONN
Gabrea M., 2000, P ICSLP BEIJ, V3, P678
GARNHAM A, 1981, LINGUISTICS, V19, P805, DOI 10.1515/ling.1981.19.7-8.805
Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858
Goedertier W., 2000, P INT C LANG RES EV, P909
Hamza W., 2004, P INT C SPOK LANG PR
Hirschberg J, 2002, SPEECH COMMUN, V36, P31, DOI 10.1016/S0167-6393(01)00024-3
Iriondo I., 2007, INT C AC SPEECH SIGN, V4, P821
ITU-T.P85, 1994, ITUTP85
Kowtko J.C., 1989, P DARPA SPEECH NAT L
Krishna N. S., 2004, P ISCA WORKSH SPEECH
Lee EJ, 2010, COMPUT HUM BEHAV, V26, P665, DOI 10.1016/j.chb.2010.01.003
Levelt W. J., 1993, SPEAKING INTENTION A
Levelt W.J., 1983, J SEMANT, P205
LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4
Likert R., 1932, ARCH PSYCHOL, V140, P1, DOI DOI 10.1111/J.1540-5834.2010.00585
MARCUSROBERTS HM, 1987, J EDUC STAT, V12, P383, DOI 10.3102/10769986012004383
NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547
O'Shaughnessy D., 1992, P ICASSP, P521, DOI 10.1109/ICASSP.1992.225857
O'Connell DC, 2004, J PSYCHOLINGUIST RES, V33, P459, DOI 10.1007/s10936-004-2666-6
O'Connell DC, 2005, J PSYCHOLINGUIST RES, V34, P555, DOI 10.1007/s10936-005-9164-3
Pakhomov S.V., 1999, P 37 ANN M ASS COMP
Perez J., 2006, P INT C LANG RES EV
Quimbo F.C.M., 1998, P INT C SPEECH LANG
ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111
Rose R., 1998, THESIS U BIRMINGHAM
Savino M., 2000, LECT NOTES ARTIF INT, P421
Shriberg E., 1997, P EUR RHOD GREEC
Shriberg E. E., 1994, THESIS U CALIFORNIA
SHRIBERG EE, 1993, PHONETICA, V50, P172
Shriberg Elizabeth E., 1999, P INT C PHON SCI ICP, V1, P619
Stolcke A, 1996, INT CONF ACOUST SPEE, P405, DOI 10.1109/ICASSP.1996.541118
Stouten F, 2006, SPEECH COMMUN, V48, P1590, DOI 10.1016/j.specom.2006.04.004
Sundaram S, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P203, DOI 10.1109/WSS.2002.1224409
Sundaram S., 2003, P EUROSPEECH GEN SWI, P1221
SVARTVIK J, 1980, LUND STUDIES ENGLISH, V56
Svartvik Jan, 1990, LONDON LUND CORPUS S
Taylor P, 2009, TEXT TO SPEECH SYNTH
Taylor P., 1998, P INT C SPEECH LANG
Tree JEF, 2001, MEM COGNITION, V29, P320
Tseng S.-C., 1999, THESIS U BIELEFELD
Umbert M., 2006, P INT C LANG RES EV
van Santen J., 1997, PROGR SPEECH SYNTHES
Vazquez Y., 2002, P INT C SPEECH LANG, P329
Watanabe M., 2005, P EUR SEPT LISB PORT, P37
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zhao Y., 2005, P 13 IEEE INT C NETW, P179
NR 83
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 459
EP 476
DI 10.1016/j.specom.2011.10.010
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100011
ER
PT J
AU Choi, JH
Chang, JH
AF Choi, Jae-Hun
Chang, Joon-Hyuk
TI On using acoustic environment classification for statistical model-based
speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Noise classification; Gaussian mixture model; DFT
ID SPECTRAL AMPLITUDE ESTIMATOR; SOFT-DECISION; NOISE; SUPPRESSION;
SUBTRACTION
AB In this paper, we present a statistical model-based speech enhancement technique using acoustic environment classification supported by a Gaussian mixture model (GMM). In the data training stage, the principal parameters of the statistical model-based speech enhancement algorithm such as the weighting parameter in the decision-directed (DD) method, the long-term smoothing parameter of the noise estimation, and the control parameter of the minimum gain value are uniquely set as optimal operating points according to the given noise information to ensure the best performance for each noise. These optimal operating points, which are specific to the different background noises, are estimated based on the composite measures, which are the objective quality measures representing the highest correlation with the actual speech quality processed by noise suppression algorithms.
In the on-line environment-aware speech enhancement step, the noise classification is performed on a frame-by-frame basis using the maximum likelihood (ML)-based Gaussian mixture model (GMM). The speech absence probability (SAP) is used to detect the speech absence periods and to update the likelihood of the GMM. According to the classified noise information for each frame, we assign the optimal values to the aforementioned three parameters for speech enhancement. We evaluated the performances of the proposed methods using objective speech quality measures and subjective listening tests under various noise environments. Our experimental results showed that the proposed method yields better performances than does a conventional algorithm with fixed parameters. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Choi, Jae-Hun; Chang, Joon-Hyuk] Hanyang Univ, Sch Elect Engn, Seoul 133791, South Korea.
RP Chang, JH (reprint author), Hanyang Univ, Sch Elect Engn, Seoul 133791, South Korea.
EM jchang@hanyang.ac.kr
FU MKE/KEIT [2009-S-036-01]; National Research Foundation of Korea (NRF);
Korean Government (MEST) [NRF-2011-0009182]; Hanyang University
[HY-2011-201100000000]
FX This work was supported by the IT R&D program of MKE/KEIT
[2009-S-036-01, Development of New Virtual Machine Specification and
Technology]. And, this work was supported by National Research
Foundation of Korea (NRF) grant funded by the Korean Government (MEST)
(NRF-2011-0009182). This work was supported by the research fund of
Hanyang University (HY-2011-201100000000)
CR Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694
[Anonymous], 2002, TMS320C55X DSP LIB P
[Anonymous], 1996, TIAEIAIS127
[Anonymous], 2005, 3GPP2CR00300
[Anonymous], 2000, P862 ITUT
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
Chang JH, 2005, IEEE T CIRCUITS-II, V52, P535, DOI 10.1109/TCSII.2005.850448
Chang JH, 2001, IEICE T INF SYST, VE84D, P1231
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Hu Y., 2008, IEEE T AUDIO SPEECH
HU Y, 2006, P INT, P1447
Kim NS, 2000, IEEE SIGNAL PROC LET, V7, P108
Kraft F., 2005, P IEEE INT 2005 LISB, P2689
Krishnamurthy N., 2006, P INT, P1431
Ma L., 2006, ACM T SPEECH LANGUAG, V3, P1, DOI 10.1145/1149290.1149292
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
Park YS, 2007, IEICE T COMMUN, VE90B, P2182, DOI 10.1093/ietcom/e90-b.8.2182
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Sangwan A., 2007, P INT 2007 ANTW BELG, P2929
Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328
Sohn J., 1998, P INT C AC SPEECH SI, V1, P365
Song JH, 2008, IEEE SIGNAL PROC LET, V15, P103, DOI 10.1109/LSP.2007.911184
NR 26
TC 4
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 477
EP 490
DI 10.1016/j.specom.2011.10.009
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100012
ER
PT J
AU Kurata, G
Itoh, N
Nishimura, M
Sethy, A
Ramabhadran, B
AF Kurata, Gakuto
Itoh, Nobuyasu
Nishimura, Masafumi
Sethy, Abhinav
Ramabhadran, Bhuvana
TI Leveraging word confusion networks for named entity modeling and
detection from conversational telephone speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Named entity detection; Conversational telephone speech; Word confusion
networks; Maximum entropy model
AB Named Entity (NE) detection from Conversational Telephone Speech (CTS) is important from business aspects. However, results of Automatic Speech Recognition (ASR) inevitably contain errors and this makes NE detection from CTS more difficult than from written text. One of the options to detect NEs is to use a statistical NE model. In order to capture the nature of ASR errors, the NE model is usually trained with the ASR one-best results instead of manually transcribed text and then is applied to the ASR one-best results of speech that contain NEs. To make NE detection more robust to ASR errors, we propose using Word Confusion Networks (WCNs), sequences of bundled words, for both NE modeling and detection by regarding the word bundles as units instead of the independent words. We realize this by clustering similar word bundles that may originate from the same word. We trained the NE models that predict the NE tag sequences from the sequence of the word bundles with the maximum entropy principle. Note that clustering of word bundles is conducted in advance of NE modeling and thus our proposed method can combine with any NE modeling method. We conducted experiments using real-life call-center data. The experimental results showed that by using the WCNs, the accuracy of NE detection improved regardless of the NE modeling method. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Kurata, Gakuto; Itoh, Nobuyasu; Nishimura, Masafumi] IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan.
[Sethy, Abhinav; Ramabhadran, Bhuvana] IBM Corp, IBM Res TJ Watson Res Ctr, Yorktown Hts, NY USA.
RP Kurata, G (reprint author), IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan.
EM gakuto@jp.ibm.com
FU IBM T.J. Watson Research Center
FX We thank Stanley Chen of IBM T.J. Watson Research Center for his
support.
CR Bechet F., 2002, P INT C SPOK LANG PR, P597
Bender O., 2003, P CONLL 2003 EDM CAN, P148
Bunescu R., 2006, P EACL, V6, P9
Chen S. F., 2009, P HLT NAACL, P450, DOI 10.3115/1620754.1620820
Chen S. F., 2010, P INT, P1037
Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128
Chen SF, 2006, IEEE T AUDIO SPEECH, V14, P1596, DOI 10.1109/TASL.2006.879814
Chen S.F, 2009, P HLT NAACL, P468, DOI 10.3115/1620754.1620822
CHIEU H., 2002, P 19 INT C COMP LING, P190
Chieu H.L., 2003, P CONLL 2003, P160
Chiticariu L., 2010, P 2010 C EMP METH NA, P1002
Doddington George, 2004, P 4 INT C LANG RES E, P837
Favre B., 2005, P HLT EMNLP, P491, DOI 10.3115/1220575.1220637
Feng J., 2009, P EACL, P238, DOI 10.3115/1609067.1609093
Finkel J.R., 2005, P 43 ANN M ASS COMP, P363, DOI 10.3115/1219840.1219885
Grishman Ralph, 1996, P 16 INT C COMP LING, P466
Hakkani-Tfir D., 2003, P ICASSP, V1, P596
Hakkani-Tur D, 2006, COMPUT SPEECH LANG, V20, P495, DOI 10.1016/j.csl.2005.07.005
Hori T., 2007, P ICASSP, V4, P73
Huang J., 2001, P ACL, P298, DOI 10.3115/1073012.1073051
Huang Liang, 2009, P ACL, P522
KAZAWA H., 2002, P 19 INT C COMP LING, P1
Kubala F., 1998, P DARPA BROADC NEWS, P287
Kudo T., 2001, P 2 M N AM CHAPT ASS, P1
Kudo T., 2004, P 2004 C EMP METH NA, P230
Kurata G., 2011, P ICASSP, P5576
Mamou J., 2006, Proceedings of the Twenty-Ninth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/1148170.1148183
Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152
Minkov E., 2005, P C HUM LANG TECHN E, P443, DOI 10.3115/1220575.1220631
Nagata M., 1994, P 15 INT C COMP LING, P201
NIST, 2008, ACE08 EV PLAN
Palmer D.D., 2001, P HUM LANG TECHN WOR, P1, DOI 10.3115/1072133.1072186
Papageorgiou C.P., 1994, P HLT WORKSH, P283, DOI 10.3115/1075812.1075875
Povey D, 2008, INT CONF ACOUST SPEE, P4057, DOI 10.1109/ICASSP.2008.4518545
Ramshaw L. A., 1995, P 3 WORKSH VER LARG, P88
Sang E.F.T.K., 1999, P 9 C EUR CHAPT ASS, P173, DOI 10.3115/977035.977059
Saraclar M., 2004, P HLT NAACL, P129
Sarikaya R., 2010, P INT, P1804
Shao J., 2007, P INT, P2405
Shen D, 2004, P 42 ANN M ASS COMP, P589, DOI 10.3115/1218955.1219030
SUDOH K, 2006, P 21 INT C COMP LING, P617, DOI 10.3115/1220175.1220253
Tjong Kim Sang E.F, 2003, P 7 C NAT LANG LEARN, V4, P142
Tjong Kim Sang E.F., 2002, P CONLL, V20, P1
Tsujii J., 2000, P 18 INT C COMP LING, P201
Xue N, 2005, NAT LANG ENG, V11, P207, DOI 10.1017/S135132490400364X
Yu S., 2001, PROCESSING NORMS MOD
Zhai L., 2004, P HLT NAACL, P37, DOI 10.3115/1613984.1613994
Zhao Y, 2005, DATA MIN KNOWL DISC, V10, P141, DOI 10.1007/s10618-005-0361-3
Zhao Y., 2002, 02014 U MINN
Zhou G., 2002, P 40 ANN M ASS COMP, P473
NR 50
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 491
EP 502
DI 10.1016/j.specom.2011.11.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100013
ER
PT J
AU Gomez, AM
Schwerin, B
Paliwal, K
AF Gomez, Angel M.
Schwerin, Belinda
Paliwal, Kuldip
TI Improving objective intelligibility prediction by combining correlation
and coherence based methods with a measure based on the negative
distortion ratio
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Objective measures; Speech enhancement;
Distance-based methods; Correlation-based methods; Coherence-based
methods; Negative spectral distortion
ID SPEECH-INTELLIGIBILITY; ENHANCEMENT; ALGORITHMS; QUALITY; INDEX
AB In this paper we propose a novel objective method for intelligibility prediction of enhanced speech which is based on the negative distortion ratio (NDR) - that is, the amount of power spectra that has been removed in comparison to the original clean speech signal, likely due to a bad noise estimate during the speech enhancement procedure. While negative spectral distortions can have a significant importance in subjective intelligibility assessment of processed speech, most of the objective measures in the literature do not well account for this type of distortion. The proposed method focuses on a very specific type of noise, so it is not intended to be used alone but in combination with other techniques, to jointly achieve a better intelligibility prediction. In order to find an appropriate technique to be combined with, in this paper we also review a number of recently proposed methods based on correlation and coherence measures. These methods have already shown a high correlation with human recognition scores, as they effectively detect the presence of nonlinearities, frequently found in noise-suppressed speech. However, when these techniques are jointly applied with the proposed method, significantly higher correlations (above r = 0.9) are shown to be achieved. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Gomez, Angel M.] Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, E-18071 Granada, Spain.
[Gomez, Angel M.; Schwerin, Belinda; Paliwal, Kuldip] Griffith Univ, Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
RP Gomez, AM (reprint author), Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, Campus Fuentenueva S-N, E-18071 Granada, Spain.
EM amgg@ugr.es; b.schwerin@griffith.edu.au; k.paliwal@griffith.edu.au
RI Gomez Garcia, Angel Manuel/C-6856-2012
OI Gomez Garcia, Angel Manuel/0000-0002-9995-3068
FU Spanish Government [JC2010-0194, CEB09-0010]
FX This work has been supported by the Spanish Government Grant JC2010-0194
and project CEI BioTIC GENIL (CEB09-0010).
CR [Anonymous], 2001, P862 ITUT
ANSI, 1997, S351997 ANSI
Balakrishnan N., 1992, HDB LOGISTIC DISTRIB
Boldt J., 2009, P EUSIPCO, P1849
CARTER GC, 1983, IEEE T AUDIO ELECTRO, V21, P337, DOI DOI 10.1109/TAU.1973.1162496
Christiansen C, 2010, SPEECH COMMUN, V52, P678, DOI 10.1016/j.specom.2010.03.004
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
Falk T., 2008, P INT WORKSH AC ECH
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
Gibbons J. D., 1985, NONPARAMETRIC STAT I
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Gomez A., 2011, P ISCA EUR C SPEECH
Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Kim G., 2010, P IEEE INT C AC SPEE, V1, P4738
Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Ludvigsen C, 1993, Scand Audiol Suppl, V38, P50
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
Ma JF, 2011, SPEECH COMMUN, V53, P340, DOI 10.1016/j.specom.2010.10.005
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Pearce D., 2000, P ICSLP, V4, P29
Rothauser E. H., 1969, IEEE T AUDIO ELECTRO, V17, P225, DOI DOI 10.1109/TAU.1969.1162058
Scalart P., 1996, P ICASSP, V2, P629
SILVERMAN HF, 1976, IEEE T ACOUST SPEECH, V24, P289, DOI 10.1109/TASSP.1976.1162814
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Taal CH, 2011, IEEE T AUDIO SPEECH, V19, P2125, DOI 10.1109/TASL.2011.2114881
NR 29
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2012
VL 54
IS 3
BP 503
EP 515
DI 10.1016/j.specom.2011.11.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 899IH
UT WOS:000300809100014
ER
PT J
AU Ward, NG
Vega, A
Baumann, T
AF Ward, Nigel G.
Vega, Alejandro
Baumann, Timo
TI Prosodic and temporal features for language modeling for dialog
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dialog dynamics; Dialog state; Prosody; Interlocutor behavior; Word
probabilities; Prediction; Perplexity; Speech recognition; Switchboard
corpus; Verbmobil corpus
ID SPEECH RECOGNITION; COMMUNICATION; CONVERSATION; FRAMEWORK; SPEAKING;
ENGLISH
AB If we can model the cognitive and communicative processes underlying speech, we should be able to better predict what a speaker will do. With this idea as inspiration, we examine a number of prosodic and timing features as potential sources of information on what words the speaker is likely to say next. In spontaneous dialog we find that word probabilities do vary with such features. Using perplexity as the metric, the most informative of these included recent speaking rate, volume, and pitch, and time until end of utterance. Using simple combinations of such features to augment trigram language models gave up to a 8.4% perplexity benefit on the Switchboard corpus, and up to a 1.0% relative reduction in word error rate (0.3% absolute) on the Verbmobil II corpus. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Ward, Nigel G.; Vega, Alejandro] Univ Texas El Paso, El Paso, TX 79968 USA.
[Baumann, Timo] Univ Potsdam, Dept Linguist, D-14476 Potsdam, Germany.
RP Ward, NG (reprint author), Univ Texas El Paso, 500 W Univ Ave, El Paso, TX 79968 USA.
EM nigelward@acm.org; avega5@miners.utep.edu; mail@timobaumann.de
FU NSF [IIS-0415150, IIS-0914868]; US Army Research, Development and
Engineering Command; USC Institute for Creative Technologies; DFG
FX We thank Shreyas Karkhedkar, Nisha Kiran, Shubhra Datta, Gary
Beverungen, and Justin McManus for contributing to the code and
analysis; David Novick, Olac Fuentes and many thoughtful reviewers for
Speech Communication, Interspeech, ASRU, and the National Science
Foundation for comments; and Joe Picone and Andreas Stolcke for making
available the Switchboard labeling and the SRILM toolkit, respectively.
This work was supported in part by NSF Grants IIS-0415150 and
IIS-0914868 and REU supplements thereto, by the US Army Research,
Development and Engineering Command via a subcontract to the USC
Institute for Creative Technologies, and by a DFG grant in the Emmy
Noether program.
CR Ananthakrishnan S, 2007, INT CONF ACOUST SPEE, P873
Bard E. G., 2002, EDILOG 2002, P29
Barsalou Lawrence W, 2007, Cogn Process, V8, P79, DOI 10.1007/s10339-007-0163-1
Batliner A., 1995, 88 VERBM PROJ
Batliner A., 2001, ISCA WORKSH PROS SPE
Beebe B, 2008, J PSYCHOLINGUIST RES, V37, P293, DOI 10.1007/s10936-008-9078-y
Bell A, 2009, J MEM LANG, V60, P92, DOI 10.1016/j.jml.2008.06.003
Bellegarda JR, 2004, SPEECH COMMUN, V42, P93, DOI 10.1016/j.specom.2003.08.002
Bengio Y, 2003, J MACH LEARN RES, V3, P1137, DOI 10.1162/153244303322533223
Bradlow A. R., 2009, LANG SPEECH, V52, P391
Braun B, 2011, LANG COGNITIVE PROC, V26, P350, DOI 10.1080/01690965.2010.492641
BRENNAN SE, 1995, KNOWL-BASED SYST, V8, P143, DOI 10.1016/0950-7051(95)98376-H
Campbell N, 2007, LECT NOTES COMPUT SC, V4775, P117
Chen K., 2007, ROBUST SPEECH RECOGN, P319
Chen S., 1998, DARPA BROADC NEWS TR
Clark H. H., 1996, USING LANGUAGE
Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3
Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X
Ferrer L., 2003, ICASSP
Fujisaki H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1
Garrod S, 2004, TRENDS COGN SCI, V8, P8, DOI 10.1016/j.tics.2003.10.016
Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858
GOFFMAN E, 1978, LANGUAGE, V54, P787, DOI 10.2307/413235
Gratch J., 2006, 6 INT C INT VIRT AG, P14
Hamaker J., 1998, RULES GUIDELINES TRA
Huang S., 2007, LNCS, V4892, P191
ISIP, 2003, MAN CORR SWITCHB WOR
Jaffe J., 1978, NONVERBAL BEHAV COMM, P55
Jahr E., 2007, J SPEECH LANG PATHOL, V2, P190
Jekat S., 1997, 62 VM LMU MUNCH U HA
Jelinek F., 1997, STAT METHODS SPEECH
Ji G., 2010, ICASSP 2010
Ji G., 2004, C HUM LANG TECHN
Levinson Stephen C, 2006, ROOTS HUMAN SOCIALIT, P39
Ma KW, 2000, SPEECH COMMUN, V31, P51, DOI 10.1016/S0167-6393(99)00060-6
Macrae CN, 2008, COGNITION, V109, P152, DOI 10.1016/j.cognition.2008.07.007
Morgan N., 1998, ICASSP, P721
O'Connell DC, 2008, COGN LANG A SER PSYC, P3, DOI 10.1007/978-0-387-77632-3_1
Petukhova V., 2009, NAACL HLT
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
Raux A., 2009, HUMAN LANGUAGE TECHN
SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243
Sebanz N, 2006, TRENDS COGN SCI, V10, P70, DOI 10.1016/j.tics.2005.12.009
Shriberg E, 2004, IMA V MATH, V138, P105
Shriberg E., 2004, P INT C SPEECH PROS, P575
Shriberg E., 2001, J INT PHON ASSOC, V31, P153
Stolcke A., 1999, P 6 EUR C SPEECH COM
Stolcke A., 2002, P INT C SPOK LANG PR
Streeck J, 2009, DISCOURSE PROCESS, V46, P93, DOI 10.1080/01638530902728777
Truong KP, 2007, SPEECH COMMUN, V49, P144, DOI 10.1016/j.specom.2007.01.001
Vicsi K, 2010, SPEECH COMMUN, V52, P413, DOI 10.1016/j.specom.2010.01.003
Ward NG, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P160
Walker W., 2004, TR20040811 SMLI SUN
Ward N., 2006, PRAGMAT COGN, V14, P113
Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5
Ward N. G., 2010, INTERSPEECH
Ward N. G., 1999, ESCA WORKSH DIAL PRO, P83
Ward NG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1606
Ward N. G., 2010, SPEECH PROS
Ward N. G., 2010, UTEPCS22 DEP COMP SC
Ward N. G., 2009, 11 IEEE WORKSH AUT S, P323
Xu P, 2007, COMPUT SPEECH LANG, V21, P105, DOI 10.1016/j.csl.2006.01.003
Yngve Victor, 1970, 6 REG M CHIC LING SO, P567
NR 63
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 161
EP 174
DI 10.1016/j.specom.2011.07.009
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600001
ER
PT J
AU Andersson, S
Yamagishi, J
Clark, RAJ
AF Andersson, Sebastian
Yamagishi, Junichi
Clark, Robert A. J.
TI Synthesis and evaluation of conversational characteristics in HMM-based
speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; HMM; Conversation; Spontaneous speech; Filled pauses;
Discourse marker
ID SPEAKING; READ
AB Spontaneous conversational speech has many characteristics that are currently not modelled well by HMM-based speech synthesis and in order to build synthetic voices that can give an impression of someone partaking in a conversation, we need to utilise data that exhibits more of the speech phenomena associated with conversations than the more generally used carefully read aloud sentences. In this paper we show that synthetic voices built with HMM-based speech synthesis techniques from conversational speech data, preserved segmental and prosodic characteristics of frequent conversational speech phenomena. An analysis of an evaluation investigating the perception of quality and speaking style of HMM-based voices confirms that speech with conversational characteristics are instrumental for listeners to perceive successful integration of conversational speech phenomena in synthetic speech. The achieved synthetic speech quality provides an encouraging start for the continued use of conversational speech in HMM-based speech synthesis. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Andersson, Sebastian; Yamagishi, Junichi; Clark, Robert A. J.] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
RP Andersson, S (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland.
EM J.S.Andersson@sms.ed.ac.uk; jyamagis@inf.ed.ac.uk; robert@cstr.ed.ac.uk
FU Marie Curie Early Stage Training Site EdSST [MEST-CT-2005-020568]
FX The authors are grateful to David Traum and Kallirroi Georgila at the
USC Institute for Creative Technologies (http://www.ict.usc.edu) for
making the speech data available to us. The first author is supported by
Marie Curie Early Stage Training Site EdSST (MEST-CT-2005-020568).
CR Adell J., 2008, P INT BRISB AUSTR, P2278
Adell J., 2010, P SPEECH PROS CHIC U, V100624, P1
Anderson S. H., 2010, Proceedings of the 19th World Congress of Soil Science: Soil solutions for a changing world, Brisbane, Australia, 1-6 August 2010. Symposium 4.4.2 Attracting (young) people to a soils career, P1, DOI 10.1109/NEBC.2010.5458172
Andersson S., 2010, P SSW7 KYOT JAP, P173
Aylett M, 2006, J ACOUST SOC AM, V119, P3048, DOI 10.1121/1.2188331
Aylett M. P., 2007, P AISB 2007 NEWC UK, P174
Badino L., 2009, P INT BRIGHT UK, P520
Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836
BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0
CADIC D, 2008, P INT BRISB AUSTR, P1861
Campbell N, 2007, P SSW6 BONN GERM, P22
Campbell N, 2006, P SPEECH PROS DRESD
Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3
Gravano A., 2007, P 45 ANN M ASS COMP, P800
Gustafsson K., 2004, P SSW5 PITTSB US, P145
Jurafsky D., 1998, COLING ACL 98 WORKSH
Kawahara H., 2001, P 2 MAVEBA FIR IT
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
King S., 2009, P BLIZZ CHALL WORKSH
Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5
Lasarcyk E., 2010, P SSW7 KYOT JAP, P230
Lee CH, 2010, INT CONF ACOUST SPEE, P4826, DOI 10.1109/ICASSP.2010.5495140
Nakamura M, 2008, COMPUT SPEECH LANG, V22, P171, DOI 10.1016/j.csl.2007.07.003
O'Shaughnessy D., 1992, P ICASSP, P521, DOI 10.1109/ICASSP.1992.225857
Odell JJ, 1995, THESIS CAMBRIDGE U
Romportl J., 2010, P SSW7 KYOT JAP, P120
Schiffrin Deborah, 1987, DISCOURSE MARKERS
Shriberg E., 1999, P INT C PHON SCI SAN, P619
Shriberg E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607996
Sundaram S, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P203, DOI 10.1109/WSS.2002.1224409
Tokuda K, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P227
Traum D. R., 2008, RECENT TRENDS DISCOU, P45, DOI [10.1007/978-1-4020-6821-8_3, DOI 10.1007/978-1-4020-6821-8_3]
Wahlster W., 2000, VERBMOBIL FDN SPEECH
Yamagishi J, 2005, IEICE T INF SYST, VE88D, P502, DOI 10.1093/ietisy/e88-d.3.502
Young S., 2006, HTK BOOK
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 37
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 175
EP 188
DI 10.1016/j.specom.2011.08.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600002
ER
PT J
AU Bouton, S
Cole, P
Serniclaes, W
AF Bouton, Sophie
Cole, Pascale
Serniclaes, Willy
TI The influence of lexical knowledge on phoneme discrimination in deaf
children with cochlear implants
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech perception; Phonological features; Cochlear implant; Speech
development; Lexicality effect
ID AUDIOVISUAL SPEECH-PERCEPTION; WORD RECOGNITION; FINE-STRUCTURE;
TEMPORAL CUES; USERS; FRENCH; CATEGORIZATION; INFORMATION; ACCESS;
SKILLS
AB This paper addresses the questions of whether lexical information influences phoneme discrimination in children with cochlear implants (CI) and whether this influence is similar to what occurs in normal-hearing (NH) children. Previous research with CI children evidenced poor accuracy in phonemic perception, which might have an incidence on the use of lexical information in phoneme discrimination. A discrimination task with French vowels and consonants in minimal pairs of words (e.g., mouche/bouche) or pseudowords (e.g., moute/boute) was used to search for possible differences in the use of lexical knowledge between CI children and NH children matched for listening age. Minimal pairs differed in a single consonant or vowel feature (e.g., nasality, vocalic aperture, voicing) to unveil possible interactions between phonological/acoustic and lexical processing. The results showed that both the word and pseudoword discrimination of CI children are inferior to those of NH children, with the magnitude of the deficit depending on the feature. However, word discrimination was better than pseudoword discrimination, and this lexicality effect was equivalent for both CI and NH children. Further, this lexicality effect did not depend on the feature in either group. Our results support the idea that hearing deprivation period may not have consequence on lexical processes implied on speech perception. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Bouton, Sophie; Cole, Pascale] Aix Marseille Univ, Lab Cognit Psychol, CNRS, UMR 6146, Aix En Provence, France.
[Bouton, Sophie; Serniclaes, Willy] Paris Descartes Univ, Lab Psychol Percept, CNRS, UMR 8158, Paris, France.
RP Bouton, S (reprint author), Univ Aix Marseille 1, Lab Psychol Cognit, Batiment 9,Case D,3 Pl Victor Hugo, F-13331 Marseille 3, France.
EM Sophie.Bouton@univ-provence.fr
RI Imhof, Margarete/F-8471-2011
CR Abramson A. S., 1985, PHONETIC LINGUISTICS, P25
ANDRUSKI JE, 1994, COGNITION, V52, P163, DOI 10.1016/0010-0277(94)90042-6
Bergeson TR, 2005, EAR HEARING, V26, P149, DOI 10.1097/00003446-200504000-00004
Bergeson TR, 2003, VOLTA REV, V103, P347
Bertoncini J, 2009, J SPEECH LANG HEAR R, V52, P682, DOI 10.1044/1092-4388(2008/07-0273)
Bouton S., J SPEECH LA IN PRESS
Brancazio L, 2004, J EXP PSYCHOL HUMAN, V30, P445, DOI 10.1037/0096-1523.30.3.445
Burns TC, 2007, APPL PSYCHOLINGUIST, V28, P455, DOI 10.1017/S0142716407070257
Chiappe P, 2001, J EXP CHILD PSYCHOL, V80, P58, DOI 10.1006/jecp.2000.2624
CONNINE CM, 1987, J EXP PSYCHOL HUMAN, V13, P291, DOI 10.1037/0096-1523.13.2.291
CONTENT A, 1988, CAH PSYCHOL COGN, V8, P399
Dorman Michael F, 2002, Am J Audiol, V11, P119, DOI 10.1044/1059-0889(2002/014)
EILERS RE, 1979, CHILD DEV, V50, P14
Fort M, 2010, SPEECH COMMUN, V52, P525, DOI 10.1016/j.specom.2010.02.005
FOX RA, 1984, J EXP PSYCHOL HUMAN, V10, P526, DOI 10.1037//0096-1523.10.4.526
GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110
Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613
Geers A., 2003, EAR HEARING, V24, P24
Harnsberger JD, 2001, J ACOUST SOC AM, V109, P2135, DOI 10.1121/1.1350403
Hoonhorst I, 2009, J EXP CHILD PSYCHOL, V104, P353, DOI 10.1016/j.jecp.2009.07.005
Hoonhorst I, 2009, CLIN NEUROPHYSIOL, V120, P897, DOI 10.1016/j.clinph.2009.02.174
Kaiser AR, 2003, J SPEECH LANG HEAR R, V46, P390, DOI 10.1044/1092-4388(2003/032)
KIRK KI, 1995, EAR HEARING, V16, P470, DOI 10.1097/00003446-199510000-00004
Knudsen EI, 2004, J COGNITIVE NEUROSCI, V16, P1412, DOI 10.1162/0898929042304796
Lachs L, 2001, EAR HEARING, V22, P236, DOI 10.1097/00003446-200106000-00007
Lete B, 2004, BEHAV RES METH INS C, V36, P156, DOI 10.3758/BF03195560
Leybaert J., 2007, ENFANCE, V59, P245, DOI [10.3917/enf.593.0245, DOI 10.3917/ENF.593.0245]
MacMillan N. A., 2005, DETECTION THEORY USE
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
McMurray B, 2003, J PSYCHOLINGUIST RES, V32, P77, DOI 10.1023/A:1021937116271
MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433
Medina V., 2009, REV LOGOPEDIA FONIAT, V29, P186, DOI DOI 10.1016/S0214-4603(09)70027-0
Misiurski C, 2005, BRAIN LANG, V93, P64, DOI 10.1016/j.bandl.2004.08.001
Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9
O'Donoghue GM, 2000, LANCET, V356, P466, DOI 10.1016/S0140-6736(00)02555-1
Raven JC, 1947, COLOURED PROGR MATRI
REPP BH, 1982, PSYCHOL BULL, V92, P81, DOI 10.1037//0033-2909.92.1.81
Repp B.H, 1984, SPEECH LANGUAGE ADV, V10, P234
Rivera-Gaxiola M, 2005, DEVELOPMENTAL SCI, V8, P162, DOI 10.1111/j.1467-7687.2005.00403.x
ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070
Rouger J, 2008, BRAIN RES, V1188, P87, DOI 10.1016/j.brainres.2007.10.049
SAERENS M, 1989, LANG SPEECH, V32, P291
Schatzer R, 2010, ACTA OTO-LARYNGOL, V130, P1031, DOI 10.3109/00016481003591731
Serniclaes W., 2010, PORTUGUESE J LINGUIS, V9
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a
Spencer LJ, 2009, J DEAF STUD DEAF EDU, V14, P1, DOI 10.1093/deafed/enn013
Stevens KN, 2000, PHONETICA, V57, P139, DOI 10.1159/000028468
TYEMURRAY N, 1995, J ACOUST SOC AM, V98, P2454, DOI 10.1121/1.413278
Tyler RS, 1997, OTOLARYNG HEAD NECK, V117, P180, DOI 10.1016/S0194-5998(97)70172-4
van Linden S, 2007, NEUROSCI LETT, V420, P49, DOI 10.1016/j.neulet.2007.04.006
Verschuur C, 2001, BRIT J AUDIOL, V35, P209
WHALEN DH, 1991, PERCEPT PSYCHOPHYS, V50, P351, DOI 10.3758/BF03212227
Won JH, 2010, EAR HEARING, V31, P796, DOI 10.1097/AUD.0b013e3181e8b7bd
Xu L, 2007, J ACOUST SOC AM, V122, P1758, DOI 10.1121/1.2767000
NR 56
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 189
EP 198
DI 10.1016/j.specom.2011.08.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600003
ER
PT J
AU Gudnason, J
Thomas, MRP
Ellis, DPW
Naylor, PA
AF Gudnason, Jon
Thomas, Mark R. P.
Ellis, Daniel P. W.
Naylor, Patrick A.
TI Data-driven voice source waveform analysis and synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice source signal; Inverse filtering; Vocal tract modeling; Principal
component analysis; Gaussian mixture model; Segmental signal to
reconstruction ratio
ID VOCAL-FOLD VIBRATION; LINEAR PREDICTION; GLOTTAL CLOSURE; SPEAKER
IDENTIFICATION; SPEECH ANALYSIS; TUTORIAL; MODEL; ALGORITHM; QUALITY;
PHASE
AB A data-driven approach is introduced for studying, analyzing and processing the voice source signal. Existing approaches parameterize the voice source signal by using models that are motivated, for example, by a physical model or function-fitting. Such parameterization is often difficult to achieve and it produces a poor approximation to a large variety of real voice source waveforms of the human voice. This paper presents a novel data-driven approach to analyze different types of voice source waveforms using principal component analysis and Gaussian mixture modeling. This approach models certain voice source features that many other approaches fail to model. Prototype voice source waveforms are obtained from each mixture component and analyzed with respect to speaker, phone and pitch. An analysis/synthesis scheme was set up to demonstrate the effectiveness of the method. Compression of the proposed voice source by discarding 75% of the features yields a segmental signal-to-reconstruction error ratio of 13 dB and a Bark spectral distortion of 0.14. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Gudnason, Jon] Reykjavik Univ, Sch Sci & Engn, Reykjavik, Iceland.
[Thomas, Mark R. P.; Naylor, Patrick A.] Univ London Imperial Coll Sci Technol & Med, Dept Elect & Elect Engn, London SW7 2AZ, England.
[Naylor, Patrick A.] Columbia Univ, LabROSA, New York, NY 10027 USA.
RP Gudnason, J (reprint author), Reykjavik Univ, Sch Sci & Engn, Reykjavik, Iceland.
EM jg@ru.is; mark.r.thomas02@imperial.ac.uk; dpwe@ee.columbia.edu;
p.naylor@imperial.ac.uk
CR ABBERTON ERM, 1989, CLIN LINGUIST PHONET, V3, P281, DOI 10.3109/02699208908985291
Alipour F, 2000, J ACOUST SOC AM, V108, P3003, DOI 10.1121/1.1324678
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801
Ananthapadmanabha T., 1984, 2 ROYAL I TECHN SPEE, P1
ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679
Backstrom T, 2002, IEEE T SPEECH AUDI P, V10, P186, DOI 10.1109/TSA.2002.1001983
BLACK AW, 2007, STAT PARAMETRIC SPEE, V4, P1229
Brookes D.M., 1994, P I ACOUSTICS, V15, P501
Carlson R., 1989, P INT C AC SPEECH SI, P223
Cataldo E., 2006, J BRAZILIAN SOC MECH, P28
CHAN DSF, 1989, P EUR C SPEECH COMM, V33
CHILDERS DG, 1995, IEEE T SPEECH AUDI P, V3, P209, DOI 10.1109/89.388148
CUMMINGS KE, 1995, DIGIT SIGNAL PROCESS, V5, P21, DOI 10.1006/dspr.1995.1003
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Ding C, 2004, P INT C MACH LEARN I, P225, DOI 10.1145/1015330.1015408
Drugman T., 2009, P IEEE INT C AC SPEE
Duda R. O., 2001, PATTERN CLASSIFICATI
Eysholdt U, 1996, FOLIA PHONIATR LOGO, V48, P163
Fant G., 1960, ACOUSTIC THEORY SPEE
Fant G., 1985, STL QPSR, V26, P1
Flanagan J., 1972, SPEECH ANAL SYNTHESI
FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949
FUJISAKI H, 1987, P IEEE INT C AC SPEE, V12, P637
Gudnason J., 2009, P INT C BRIGHT UK
Gudnason J, 2008, INT CONF ACOUST SPEE, P4821, DOI 10.1109/ICASSP.2008.4518736
Hartigan J. A., 1979, Applied Statistics, V28, DOI 10.2307/2346830
Hirano M, 1981, CLIN EXAMINATION VOI
IEC, 2003, 616722003 IEC
ISHIZAKA K, 1972, AT&T TECH J, V51, P1233
Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
Kumar A., 1997, P IEEE WORKSH SPEECH, P3
LECLUSE F, 1975, FOLIA PHONIATR, V17, P215
Lindsey G., 1987, SPARS ARCH ACTUAL WO
Ma C, 1994, IEEE T SPEECH AUDI P, V2, P258
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Markel JD, 1976, LINEAR PREDICTION SP
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389
SPANIAS AS, 1994, P IEEE, V82, P1541, DOI 10.1109/5.326413
STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234
STRUBE HW, 1974, J ACOUST SOC AM, V56, P1625, DOI 10.1121/1.1903487
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
Svec JG, 1996, J VOICE, V10, P201, DOI 10.1016/S0892-1997(96)80047-6
Thomas M. R. P., 2008, P EUR SIGN PROC C EU
Thomas M.R.P., 2010, P IEEE INT C AC SPEE
Thomas M.R.P., 2010, DETECTION GLOT UNPUB
Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324
WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260
Zhu W., 1996, P INT C SPOK LANG PR, P1413
NR 53
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 199
EP 211
DI 10.1016/j.specom.2011.08.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600004
ER
PT J
AU Saon, G
Soltau, H
AF Saon, George
Soltau, Hagen
TI Boosting systems for large vocabulary continuous speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Boosting; Acoustic modeling
AB We employ a variant of the popular Adaboost algorithm to train multiple acoustic models such that the aggregate system exhibits improved performance over the individual recognizers. Each model is trained sequentially on re-weighted versions of the training data. At each iteration, the weights are decreased for the frames that are correctly decoded by the current system. These weights are then multiplied with the frame-level statistics for the decision trees and Gaussian mixture components of the next iteration system. The composite system uses a log-linear combination of HMM state observation likelihoods. We report experimental results on several broadcast news transcription setups which differ in the language being spoken (English and Arabic) and amounts of training data. Additionally, we study the impact of boosting on maximum likelihood (ML) and discriminatively trained acoustic models. Our findings suggest that significant gains can be obtained for small amounts of training data even after feature and model-space discriminative training. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Saon, George; Soltau, Hagen] IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA.
RP Saon, G (reprint author), IBM TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA.
EM gsaon@us.ibm.com
FU DARPA [HR0011-06-2-0001]
FX The authors acknowledge the support of DARPA under Grant
HR0011-06-2-0001 for funding part of this work. The views, opinions,
and/or findings contained in this article/presentation are those of the
author/presenter and should not be interpreted as representing the
official views or policies, either expressed or implied, of the Defense
Advanced Research Projects Agency or the Department of Defense.
CR Breslin C., 2007, INT 07, P1441
Dimitrakakis C., 2004, ICASSP 04, P621
Du J., 2010, INT 10, P2942
Eibl G., 2002, ECML 02, P72
Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504
Povey D, 2008, INT CONF ACOUST SPEE, P4057, DOI 10.1109/ICASSP.2008.4518545
Saon G, 2010, INT CONF ACOUST SPEE, P4378, DOI 10.1109/ICASSP.2010.5495640
Saon G, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P920
Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901
Siohan O, 2005, INT CONF ACOUST SPEE, P197
Soltau H., 2010, WORKSH SPEECH LANG T, P97
Tang H., 2010, ICASSP 10, P2274
Zhang R., 2004, ICSLP 04
Zhu J, 2009, STAT INTERFACE, V2, P349
Zweig G., 2000, ACOUST SPEECH SIG PR, P1527
NR 15
TC 9
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 212
EP 218
DI 10.1016/j.specom.2011.07.011
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600005
ER
PT J
AU Kurata, G
Sethy, A
Ramabhadran, B
Rastrow, A
Itoh, N
Nishimura, M
AF Kurata, Gakuto
Sethy, Abhinav
Ramabhadran, Bhuvana
Rastrow, Ariya
Itoh, Nobuyasu
Nishimura, Masafumi
TI Acoustically discriminative language model training with
pseudo-hypothesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Discriminative training; Language model; Phonetic confusability; Finite
state transducer
ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; DIVERGENCE
AB Recently proposed methods for discriminative language modeling require alternate hypotheses in the form of lattices or N-best lists. These are usually generated by an Automatic Speech Recognition (ASR) system on the same speech data used to train the system. This requirement restricts the scope of these methods to corpora where both the acoustic material and the corresponding true transcripts are available. Typically, the text data available for language model (LM) training is an order of magnitude larger than manually transcribed speech. This paper provides a general framework to take advantage of this volume of textual data in the discriminative training of language models. We propose to generate probable N-best lists directly from the text material, which resemble the N-best lists produced by an ASR system by incorporating phonetic confusability estimated from the acoustic model of the ASR system. We present experiments with Japanese spontaneous lecture speech data, which demonstrate that discriminative LM training with the proposed framework is effective and provides modest gains in ASR accuracy. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Kurata, Gakuto; Itoh, Nobuyasu; Nishimura, Masafumi] IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan.
[Sethy, Abhinav; Ramabhadran, Bhuvana] IBM Corp, IBM Res TJ Watson Res Ctr, Yorktown Hts, NY USA.
[Rastrow, Ariya] Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD USA.
RP Kurata, G (reprint author), IBM Japan Ltd, IBM Res Tokyo, Yamato, Kanagawa, Japan.
EM gakuto@jp.ibm.com
CR Bahl L.R., 1977, J ACOUST SOC AM 1, V62, pS63
Berger A., 1998, P ICASSP, VII, P705, DOI 10.1109/ICASSP.1998.675362
Bhattacharyya A., 1943, Bulletin of the Calcutta Mathematical Society, V35
Chen J.-Y., 2007, P INTERSPEECH 2007 A, P2089
Chen S. F., 2009, P HLT NAACL, P450, DOI 10.3115/1620754.1620820
Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128
Chen Z., 2000, P ICSLP, V1, P493
Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1
Gauvain J., 2003, P RICH TRANSCR WORKS
Hershey J. R., 2007, P ICASSP, V4, P317
Hershey John R., 2007, P ASRU, P323
Hershey JR, 2008, INT CONF ACOUST SPEE, P4557, DOI 10.1109/ICASSP.2008.4518670
Kuo H., 2007, P ICASSP, V4, P45
Kuo H. K. J., 2002, P ICASSP, V1, P325
Kuo J.-W., 2005, P INTERSPEECH, P1277
Kurata G, 2009, INT CONF ACOUST SPEE, P4717, DOI 10.1109/ICASSP.2009.4960684
Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186
Lin S., 2005, P INTERSPEECH, P733
Minematsu N., 2002, P ICSLP, P529
Mohanty B, 2008, INT CONF ACOUST SPEE, P4953, DOI 10.1109/ICASSP.2008.4518769
Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184
Nguyen L., 2003, P RICH TRANSCR WORKS
Nishimura R., 2001, P EUROSPEECH, P2127
Oba Takanobu, 2007, P INT, P1753
Okanohara D., 2007, P ACL, P73
Pallet David S, 1990, P ICASSP, P97
Povey D., 2007, P ICASSP, V4, P321
Printz H, 2002, COMPUT SPEECH LANG, V16, P131, DOI 10.1006/csla.2001.0188
Rastrow Ariya, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373338
Roark B, 2007, COMPUT SPEECH LANG, V21, P373, DOI 10.1016/j.csl.2006.06.006
Roark B., 2004, P ACL, P47, DOI 10.3115/1218955.1218962
Sandbank B, 2008, P EMNLP, P51, DOI 10.3115/1613715.1613723
Schwenk H., 2005, P C HUM LANG TECHN E, P201, DOI 10.3115/1220575.1220601
Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059
Silva J., 2006, P IEEE INT S INF THE, P2299
Smith N A, 2005, P 43 ANN M ASS COMP, P354, DOI 10.3115/1219840.1219884
Woodland PC, 2002, COMPUT SPEECH LANG, V16, P25, DOI 10.1006/csla.2001.0182
Xu P., 2009, P ASRU, P317
NR 38
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 219
EP 228
DI 10.1016/j.specom.2011.08.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600006
ER
PT J
AU Fujimoto, M
Watanabe, S
Nakatani, T
AF Fujimoto, Masakiyo
Watanabe, Shinji
Nakatani, Tomohiro
TI Frame-wise model re-estimation method based on Gaussian pruning with
weight normalization for noise robust voice activity detection
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice activity detection; Switching Kalman filter; Gaussian pruning;
Posterior probability; Gaussian weight normalization
ID HIGHER-ORDER STATISTICS; SPEECH RECOGNITION
AB This paper proposes a robust voice activity detection (VAD) method that operates in the presence of noise. For noise robust VAD, we have already proposed statistical models and a switching Kalman filter (SKF)-based technique. In this paper, we focus on a model re-estimation method using Gaussian pruning with weight normalization. The statistical model for SKF-based VAD is constructed using Gaussian mixture models (GMMs), and consists of pre-trained silence and clean speech GMMs and a sequentially estimated noise GMM. However, the composed model is not optimal in that it does not fully reflect the characteristics of the observed signal. Thus, to ensure the optimality of the composed model, we investigate a method for its re-estimation that reflects the characteristics of the observed signal sequence. Since our VAD method works through the use of frame-wise sequential processing, processing with the smallest latency is very important. In this case, there are insufficient re-training data for a re-estimation of all the Gaussian parameters. To solve this problem, we propose a model re-estimation method that involves the extraction of reliable characteristics using Gaussian pruning with weight normalization. Namely, the proposed method re-estimates the model by pruning non-dominant Gaussian distributions that express the local characteristics of each frame and by normalizing the Gaussian weights of the remaining distributions. In an experiment using a speech corpus for VAD evaluation, CENSREC-1-C, the proposed method significantly improved the VAD performance with compared that of the original SKF-based VAD. This result confirmed that the proposed Gaussian pruning contributes to an improvement in VAD accuracy. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Fujimoto, Masakiyo; Watanabe, Shinji; Nakatani, Tomohiro] NTT Corp 2 4, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan.
RP Fujimoto, M (reprint author), NTT Corp 2 4, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan.
EM fujimoto.masakiyo@lab.ntt.co.jp
CR [Anonymous], 1999, 301708 ETSI EN
[Anonymous], 2006, 202050 ETSI ES
Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403
Cournapeau D., 2007, P INT, P2945
Fischer V., 1999, P EUROSPEECH99 SEPT, V3, P1099
Fujimoto M., 2009, P INT 09 SEPT, P1235
Fujimoto M., 2007, P INT 07 AUG, P2933
Fujimoto M, 2008, INT CONF ACOUST SPEE, P4441
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P18
Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790
Ishizuka K, 2010, SPEECH COMMUN, V52, P41, DOI 10.1016/j.specom.2009.08.003
ITU-T, 1996, G729 ITUT
Kato H., 2008, SPEECH COMMUN, V50, P476
Kitaoka N., 2007, P IEEE WORKSH AUT SP, P607
KRISTIANSSON T., 2005, P INTERSPEECH, P369
Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955
Mohri M., 2000, P ASR2000, P97
Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535
Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996
Ogawa A, 2008, INT CONF ACOUST SPEE, P4173
RABINER LR, 1975, AT&T TECH J, V54, P297
Ramirez J, 2007, INT CONF ACOUST SPEE, P801
Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002
Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551
Shinoda K., 2002, P ICASSP2002, VI, P869
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521
Weiss R.J., 2008, P INT 08, P127
NR 28
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 229
EP 244
DI 10.1016/j.specom.2011.08.005
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600007
ER
PT J
AU Chunwijitra, V
Nose, T
Kobayashi, T
AF Chunwijitra, Vataya
Nose, Takashi
Kobayashi, Takao
TI A tone-modeling technique using a quantized F0 context to improve tone
correctness in average-voice-based speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based speech synthesis; Average voice model; F0 modeling; F0
quantization
ID SYNTHESIS SYSTEM; PITCH; ADAPTATION
AB This paper proposes a technique of improving tone correctness in speech synthesis of a tonal language based on an average-voice model trained with a corpus from nonprofessional speakers' speech. We focused on reducing tone disagreements in speech data acquired from nonprofessional speakers without manually modifying the labels. To reduce the distortion in tone caused by inconsistent tonal labeling, quantized F0 symbols were utilized as the context for F0 to obtain an appropriate F0 model. With this technique, the tonal context could be directly extracted from the original speech and this prevented inconsistency between speech data and F0 labels generated from transcriptions, which affect naturalness and the tone correctness in synthetic speech. We examined two types of labeling for the tonal context using phone-based and sub-phone-based quantized F0 symbols. Subjective and objective evaluations of the synthetic voice were carried out in terms of the intelligibility of tone and its naturalness. The experimental results from both the objective and subjective tests revealed that the proposed technique could improve not only naturalness but also the tone correctness of synthetic speech under conditions where a small amount of speech data from nonprofessional target speakers was used. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Chunwijitra, Vataya; Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
RP Chunwijitra, V (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
EM chunwijitra.v.aa@m.titech.ac.jp; takashi.nose@ip.titech.ac.jp;
takao.kobayashi@ip.titech.ac.jp
FU JSPS [21300063]; Thai government
FX This work was supported in part by the JSPS Grant-in-Aid for Scientific
Research 21300063. The first author was supported by a Science and
Technology Scholarship from the Thai government. We would like to thank
NECTEC, Thailand, for providing us with the LOTUS and the TSynC-1 speech
corpora.
CR Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Chomphan S, 2009, SPEECH COMMUN, V51, P330, DOI 10.1016/j.specom.2008.10.003
Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002
CHOU PA, 1991, IEEE T PATTERN ANAL, V13, P340, DOI 10.1109/34.88569
Fujisaki H., 1984, J ACOUST SOC JPN ASJ, V5, P133
Hansakunbuntheung C., 2005, P SNLP 2005, P127
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110
Kasuriya S., 2003, Proceedings of the Oriental COCOSDA 2003. International Coordinating Committee on Speech Databases and Speech I/O System Assessment
Kawahara H., 1997, P IEEE INT C AC SPEE, V1, P1303
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Li Y., 2004, P SPEECH PROS MARCH, P169
Mittrapiyanuruk P., 2000, NECTEC ANN C BANGK, P483
Nose T, 2010, INT CONF ACOUST SPEE, P4622, DOI 10.1109/ICASSP.2010.5495548
Ogata K., 2006, P INTERSPEECH 2006 I, P1328
Raux A., 2003, P WORKSH AUT SPEECH, P700
Sornlertlamvanich V., 1998, P OR COCOSDA WORKSH
TAMURA M, 2001, ACOUST SPEECH SIG PR, P805
Tamura M., 2001, P EUROSPEECH 2001 SE, P345
Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453
Thangthai A., 2008, P INT C SPEECH SCI T, P2270
TOKUDA K, 1995, P ICASSP, P660
TOKUDA K, 1999, ACOUST SPEECH SIG PR, P229
Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956
Yamagishi J, 2003, IEICE T INF SYST, VE86D, P534
Yamagishi J, 2010, IEEE T AUDIO SPEECH, V18, P984, DOI 10.1109/TASL.2010.2045237
Yamagishi J, 2007, IEICE T INF SYST, VE90D, P533, DOI 10.1093/ietisy/e90-d.2.533
Yoshimura T, 1999, P EUR, P2347
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
NR 28
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 245
EP 255
DI 10.1016/j.specom.2011.08.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600008
ER
PT J
AU Tohidypour, HR
Seyyedsalehi, SA
Behbood, H
Roshandel, H
AF Tohidypour, Hamid Reza
Seyyedsalehi, Seyyed Ali
Behbood, Hossein
Roshandel, Hossein
TI A new representation for speech frame recognition based on redundant
wavelet filter banks
SO SPEECH COMMUNICATION
LA English
DT Article
DE Redundant wavelet filter-bank (RWFB); Wavelet transform (WT); Speech
frame recognition; Representation; Frame wavelet; Zero moments;
Four-channel higher density discrete wavelet; Time delay neural network
(TDNN)
ID TRANSFORM; MODEL
AB Although the conventional wavelet transform possesses multi-resolution properties, it is not optimized for speech recognition systems. It suffers from lower performance compared with Mel Frequency Cepstral Coefficients (MFCCs) in which Mel scale is based on human auditory perception. In this paper, some new speech representations based on redundant wavelet filter-banks (RWFB) are proposed. RWFB parameters are much less shift-sensitive than those of critically sampled discrete wavelet transform (DWT), so they seem to feature better performance in speech recognition tasks because of having better time-frequency localization ability. However, the improvement is at the expense of higher redundancy. In this paper, some types of wavelet representations are introduced, including a combination of critically sampled DWT and some different multi-channel redundant filter-banks down-sampled by 2. In order to find appropriate filter values for multi-channel filter-banks, effects of changing the zero moments of proposed wavelet are discussed. The corresponding method performances are compared in a phoneme recognition task using time delay neural networks. It is revealed that redundant multi-channel wavelet filter-banks work better than conventional DWT in speech recognition systems. The proposed four-channel higher density discrete wavelet filter-bank results in up to approximately 8.95% recognition rate increase, compared with critically sampled two-channel wavelet filter-bank. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Tohidypour, Hamid Reza; Seyyedsalehi, Seyyed Ali; Behbood, Hossein] Amirkabir Univ Technol, Dept Biomed Engn, Tehran 158754413, Iran.
[Roshandel, Hossein] Amirkabir Univ Technol, Dept Elect Engn, Tehran Polytech, Tehran 158754413, Iran.
RP Tohidypour, HR (reprint author), Amirkabir Univ Technol, Dept Biomed Engn, 424 Hafez Ave, Tehran 158754413, Iran.
EM hamidto86@aut.ac.ir
CR Abdelnour A.F, 2005, P SOC PHOTO-OPT INS, V5914, P133
Abdelnour AF, 2005, IEEE T SIGNAL PROCES, V53, P231, DOI 10.1109/TSP.2004.838959
Bijankhan M., 1994, P SPEECH SCI TECHN C, P826
Bresolin AD, 2008, INT CONF ACOUST SPEE, P1545
Farooq O, 2001, IEEE SIGNAL PROC LET, V8, P196, DOI 10.1109/97.928676
Favero R. F., 1994, Proceedings of the IEEE-SP International Symposium on Time-Frequency and Time-Scale Analysis (Cat. No.94TH8007), DOI 10.1109/TFSA.1994.467280
Gillick L., 1989, P ICASSP, V1, P532
GOWDY JN, 2000, P IEEE INT C AC SPEE, V3, P1351
Karam J.R., 2000, P CAN C EL COMP ENG, V1, P331
Karami Sh., 2001, THESIS AMIRKABIR U T
Lebrun J, 2004, J SYMB COMPUT, V37, P227, DOI 10.1016/j.jsc.2002.06.002
Nejadgholi I, 2009, NEURAL COMPUT APPL, V18, P45, DOI 10.1007/s00521-007-0151-5
Rahiminejad M., 2002, THESIS AMIRKABIR U T
Selesnick I. W., 2001, WAVELETS SIGNAL IMAG
Selesnick IW, 2004, APPL COMPUT HARMON A, V17, P211, DOI 10.1016/j.acha.2004.05.003
Selesnick IW, 2005, IEEE SIGNAL PROC MAG, V22, P123, DOI 10.1109/MSP.2005.1550194
Selesnick IW, 2006, IEEE T SIGNAL PROCES, V54, P3039, DOI 10.1109/TSP.2006.875388
Shao Y., 2010, IEEE T SYST MAN CY A, V41, P284
Tufekei Z, 2006, SPEECH COMMUN, V48, P1294, DOI 10.1016/j.specom.2006.06.006
Wu JD, 2009, EXPERT SYST APPL, V36, P3136, DOI 10.1016/j.eswa.2008.01.038
Xueying Z., 2004, P 7 INT C SIGN PROC, VI, P695, DOI 10.1109/ICOSP.2004.1452758
Yao J, 2001, IEEE T BIO-MED ENG, V48, P856
NR 22
TC 2
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 256
EP 271
DI 10.1016/j.specom.2011.09.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600009
ER
PT J
AU Chen, F
Loizou, PC
AF Chen, Fei
Loizou, Philipos C.
TI Impact of SNR and gain-function over- and under-estimation on speech
intelligibility
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Speech intelligibility; SNR estimation
ID SPECTRAL AMPLITUDE ESTIMATOR; ENHANCEMENT ALGORITHMS; MINIMUM
STATISTICS; DENSITY-ESTIMATION; NOISE-ESTIMATION
AB Most noise reduction algorithms rely on obtaining reliable estimates of the SNR of each frequency bin. For that reason, much work has been done in analyzing the behavior and performance of SNR estimation algorithms in the context of improving speech quality and reducing speech distortions (e.g., musical noise). Comparatively little work has been reported, however, regarding the analysis and investigation of the effect of errors in SNR estimation on speech intelligibility. It is not known, for instance, whether it is the errors in SNR overestimation, errors in SNR underestimation, or both that are harmful to speech intelligibility. Errors in SNR estimation produce concomitant errors in the computation of the gain (suppression) function, and the impact of gain estimation errors on speech intelligibility is unclear. The present study assesses the effect of SNR estimation errors on gain function estimation via sensitivity analysis. Intelligibility listening studies were conducted to validate the sensitivity analysis. Results indicated that speech intelligibility is severely compromised when SNR and gain over-estimation errors are introduced in spectral components with negative SNR. A theoretical upper bound on the gain function is derived that can be used to constrain the values of the gain function so as to ensure that SNR overestimation errors are minimized. Speech enhancement algorithms that can limit the values of the gain function to fall within this upper bound can improve speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Chen, Fei; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA.
RP Loizou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, 800 W Campbell Rd,EC33, Richardson, TX 75080 USA.
EM loizou@utdallas.edu
FU National Institute of Deafness and other Communication Disorders, NIH
[R01 DC010494]
FX This research was supported by Grant No. R01 DC010494 from the National
Institute of Deafness and other Communication Disorders, NIH.
CR Berouti M., 1979, P IEEE INT C AC SPEE, P208
Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
Cappe O., 1994, IEEE T SPEECH AUDIO, V2, P346
Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erkelens J, 2007, SPEECH COMMUN, V49, P530, DOI 10.1016/j.specom.2006.06.012
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058
Kim G, 2009, J ACOUST SOC AM, V126, P1486, DOI 10.1121/1.3184603
Kim G, 2011, J ACOUST SOC AM, V130, P1581, DOI 10.1121/1.3619790
Kim G., 2010, P INT, P1632
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1698, DOI 10.1121/1.1909096
Li N., 2009, J ACOUST SOC AM, V123, P1673
Li YP, 2009, SPEECH COMMUN, V51, P230, DOI 10.1016/j.specom.2008.09.001
Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu Y, 2010, INT CONF ACOUST SPEE, P4754, DOI 10.1109/ICASSP.2010.5495156
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Martin R, 2006, SIGNAL PROCESS, V86, P1215, DOI 10.1016/j.sigpro.2005.07.037
Martin R, 2005, SIG COM TEC, P43, DOI 10.1007/3-540-27489-8_3
Papoulis A., 2002, PROBABILITY RANDOM V
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Wang D., 2006, COMPUTATIONAL AUDITO
Whitehead PS, 2011, INT CONF ACOUST SPEE, P5080
NR 31
TC 4
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 272
EP 281
DI 10.1016/j.specom.2011.09.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600010
ER
PT J
AU Paliwal, K
Schwerin, B
Wojcicki, K
AF Paliwal, Kuldip
Schwerin, Belinda
Wojcicki, Kamil
TI Speech enhancement using a minimum mean-square error short-time spectral
modulation magnitude estimator
SO SPEECH COMMUNICATION
LA English
DT Article
DE Modulation domain; Analysis-modification-synthesis (AMS); Speech
enhancement; MMSE short-time spectral magnitude estimator (AME);
Modulation spectrum; Modulation magnitude spectrum; MMSE short-time
modulation magnitude estimator (MME)
ID AMPLITUDE ESTIMATOR; QUALITY ESTIMATION; STATISTICAL-MODEL; NOISE;
INTELLIGIBILITY; RECOGNITION; SUPPRESSION; SUBTRACTION; SNR
AB In this paper we investigate the enhancement of speech by applying MMSE short-time spectral magnitude estimation in the modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to include modulation domain processing. We compensate the noisy modulation spectrum for additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. A number of subjective experiments were conducted. Initially, we determine the parameter values that maximise the subjective quality of stimuli enhanced using the MMSE modulation magnitude estimator. Next, we compare the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the MMSE acoustic magnitude estimator and the modulation spectral subtraction method, and show that good improvement in speech quality is achieved through use of the proposed approach. Then we evaluate the effect of including speech presence uncertainty and log-domain processing on the quality of enhanced speech, and find that this method works better with speech uncertainty. Finally we compare the quality of speech enhanced using the MMSE modulation magnitude estimator (when used with speech presence uncertainty) with that enhanced using different acoustic domain MMSE magnitude estimator formulations, and those enhanced using different modulation domain based enhancement algorithms. Results of these tests show that the MMSE modulation magnitude estimator improves the quality of processed stimuli, without introducing musical noise or spectral smearing distortion. The proposed method is shown to have better noise suppression than MMSE acoustic magnitude estimation, and improved speech quality compared to other modulation domain based enhancement methods considered. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Paliwal, Kuldip; Schwerin, Belinda; Wojcicki, Kamil] Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Nathan, Qld 4111, Australia.
RP Schwerin, B (reprint author), Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Nathan, Qld 4111, Australia.
EM belinda.schwerin@griffithuni.edu.au
CR Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Breithaupt C, 2011, IEEE T AUDIO SPEECH, V19, P277, DOI 10.1109/TASL.2010.2047681
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Falk T., 2008, P INT WORKSH AC ECH
Falk T. H., 2007, P ISCA C INT SPEECH, P970
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247
GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421
GREENBERG S, 1997, P ICASSP, V3, P1647
Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Huang X., 2001, SPOKEN LANGUAGE PROC
ITU-T P. 835, 2007, P835 ITUT
Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466
Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lyons J., 2008, P ISCA C INT SPEECH, P387
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Martin R., 1994, P 7 EUR SIGN PROC C, P1182
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004
Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Rabiner L. R., 2010, THEORY APPL DIGITAL
Rix A., 2001, P862 ITUT
Scalart P., 1996, P ICASSP, V2, P629
Shannon B., 2006, P INT C SPOK LANG PR, P1423
Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328
So S, 2011, SPEECH COMMUN, V53, P818, DOI 10.1016/j.specom.2011.02.001
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397
Tyagi V., 2003, P ISCA EUR C SPEECH, P981
Vary P, 2006, DIGITAL SPEECH TRANS
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
Wiener N., 1949, EXTRAPOLATION INTERP
Wu S., 2009, INT C DIG SIGN PROC
ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083
NR 52
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 282
EP 305
DI 10.1016/j.specom.2011.09.003
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600011
ER
PT J
AU Hines, A
Harte, N
AF Hines, Andrew
Harte, Naomi
TI Speech intelligibility prediction using a Neurogram Similarity Index
Measure
SO SPEECH COMMUNICATION
LA English
DT Article
DE Auditory periphery model; Simulated performance intensity function;
NSIM; SSIM; Speech Intelligibility
ID AUDITORY-NERVE RESPONSES; QUALITY ASSESSMENT; PHENOMENOLOGICAL MODEL;
STRUCTURAL SIMILARITY; TEMPORAL INFORMATION; NORMAL-HEARING;
RECOGNITION; PERCEPTION; PERIPHERY; LOUDNESS
AB Discharge patterns produced by fibres from normal and impaired auditory nerves in response to speech and other complex sounds can be discriminated subjectively through visual inspection. Similarly, responses from auditory nerves where speech is presented at diminishing sound levels progressively deteriorate from those at normal listening levels. This paper presents a Neurogram Similarity Index Measure (NSIM) that automates this inspection process, and translates the response pattern differences into a bounded discrimination metric.
Performance intensity functions can be used to provide additional information over measurement of speech reception threshold and maximum phoneme recognition by plotting a test subject's recognition probability over a range of sound intensities. A computational model of the auditory periphery was used to replace the human subject and develop a methodology that simulates a real listener test. The newly developed NSIM is used to evaluate the model outputs in response to Consonant-Vowel-Consonant (CVC) word lists and produce phoneme discrimination scores. The simulated results are rigorously compared to those from normal hearing subjects in both quiet and noise conditions. The accuracy of the tests and the minimum number of word lists necessary for repeatable results is established and the results are compared to predictions using the speech intelligibility index (SII). The experiments demonstrate that the proposed simulated performance intensity function (SPIF) produces results with confidence intervals within the human error bounds expected with real listener tests. This work represents an important step in validating the use of auditory nerve models to predict speech intelligibility. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Hines, Andrew; Harte, Naomi] Trinity Coll Dublin, Sigmedia Grp, Dept Elect & Elect Engn, Dublin, Ireland.
RP Hines, A (reprint author), Trinity Coll Dublin, Sigmedia Grp, Dept Elect & Elect Engn, Dublin, Ireland.
EM hinesa@tcd.ie
CR American National Standards Institute (ANSI), 1997, S351997R2007 ANSI
Bondy J, 2004, ADV NEUR IN, V16, P1409
Boothroyd A, 1968, SOUND, V2, P3
Boothroyd A., 2006, COMPUTER AIDED SPEEC
Boothroyd A, 2008, EAR HEARING, V29, P479, DOI 10.1097/AUD.0b013e318174f067
BOOTHROY.A, 1968, J ACOUST SOC AM, V43, P362, DOI 10.1121/1.1910787
Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dillon H., 2001, HEARING AIDS
Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6
Gallun F, 2008, EAR HEARING, V29, P800, DOI 10.1097/AUD.0b013e31817e73ef
Gelfand SA, 1998, J SPEECH LANG HEAR R, V41, P1088
Hines A, 2010, SPEECH COMMUN, V52, P736, DOI 10.1016/j.specom.2010.04.006
HOCHBERG I, 1975, AUDIOLOGY, V14, P27
Huber R, 2006, IEEE T AUDIO SPEECH, V14, P1902, DOI 10.1109/TASL.2006.883259
Ibrahim RA, 2010, NEUROPHYSIOLOGICAL BASES OF AUDITORY PERCEPTION, P429, DOI 10.1007/978-1-4419-5686-6_40
JERGER J, 1971, ARCHIV OTOLARYNGOL, V93, P573
Jurgens T, 2009, J ACOUST SOC AM, V126, P2635, DOI 10.1121/1.3224721
Jurgens T., 2010, INTERSPEECH 2010 MAK, P2478
Kandadai S, 2008, INT CONF ACOUST SPEE, P221, DOI 10.1109/ICASSP.2008.4517586
Mackersie C L, 2001, J Am Acad Audiol, V12, P390
Markides A, 1978, Br J Audiol, V12, P40, DOI 10.3109/03005367809078852
McCreery R, 2010, EAR HEARING, V31, P95, DOI 10.1097/AUD.0b013e3181bc7702
Moore B.C.J, 2007, PSYCHOL TECHNICAL IS
Olsen WO, 1997, EAR HEARING, V18, P175, DOI 10.1097/00003446-199706000-00001
PREMINGER JE, 1995, J SPEECH HEAR RES, V38, P714
ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070
Sachs MB, 2002, ANN BIOMED ENG, V30, P157, DOI 10.1114/1.1458592
SAMMETH CA, 1989, EAR HEARING, V10, P94
Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a
STUDEBAKER GA, 1993, J SPEECH HEAR RES, V36, P799
Wang Z, 2004, IEEE T IMAGE PROCESS, V13, P600, DOI 10.1109/TIP.2003.819861
Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151
Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503
Zilany MSA, 2009, J ACOUST SOC AM, V126, P2390, DOI 10.1121/1.3238250
Zilany MSA, 2007, 3 INT IEEE EMBS C NE, P481
Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512
NR 37
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2012
VL 54
IS 2
BP 306
EP 320
DI 10.1016/j.specom.2011.09.004
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 853RY
UT WOS:000297444600012
ER
PT J
AU Jaywant, A
Pell, MD
AF Jaywant, Abhishek
Pell, Marc D.
TI Categorical processing of negative emotions from speech prosody
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody; Facial expressions; Emotion; Nonverbal cues; Priming;
Category-specific processing
ID AFFECT DECISION TASK; FACIAL EXPRESSION; VOCAL EXPRESSION; AUTOMATIC
ACTIVATION; NONVERBAL EMOTION; CIRCUMPLEX MODEL; PERCEPTION; FACE;
VOICE; RECOGNITION
AB Everyday communication involves processing nonverbal emotional cues from auditory and visual stimuli. To characterize whether emotional meanings are processed with category-specificity from speech prosody and facial expressions, we employed a cross-modal priming task (the Facial Affect Decision Task; Pell, 2005a) using emotional stimuli with the same valence but that differed by emotion category. After listening to angry, sad, disgusted, or neutral vocal primes, subjects rendered a facial affect decision about an emotionally congruent or incongruent face target. Our results revealed that participants made fewer errors when judging face targets that conveyed the same emotion as the vocal prime, and responded significantly faster for most emotions (anger and sadness). Surprisingly, participants responded slower when the prime and target both conveyed disgust, perhaps due to attention biases for disgust-related stimuli. Our findings suggest that vocal emotional expressions with similar valence are processed with category specificity, and that discrete emotion knowledge implicitly affects the processing of emotional faces between sensory modalities. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Jaywant, Abhishek; Pell, Marc D.] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada.
RP Pell, MD (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1266 Ave Pins Ouest, Montreal, PQ H3G 1A8, Canada.
EM marc.pell@mcgill.ca
FU Natural Sciences and Engineering Research Council of Canada; McGill
University
FX This research was financially supported by the Natural Sciences and
Engineering Research Council of Canada (Discovery grants competition)
and by McGill University (William Dawson Scholar Award to MDP). We thank
Catherine Knowles, Shoshana Gal, Laura Monetta, Pan Liu, and Hope
Valeriote for their input and help with data collection and manuscript
preparation.
CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Batty M, 2003, COGNITIVE BRAIN RES, V17, P613, DOI 10.1016/S0926-6410(03)00174-5
Bimler D, 2001, COGNITION EMOTION, V15, P633, DOI 10.1080/02699930143000077
Borod JC, 2000, COGNITION EMOTION, V14, P193
Bowers D., 1993, NEUROPSYCHOLOGY, V7, P433, DOI 10.1037//0894-4105.7.4.433
Carroll NC, 2005, Q J EXP PSYCHOL-A, V58, P1173, DOI 10.1080/02724980443000539
Charash M, 2002, J ANXIETY DISORD, V16, P529, DOI 10.1016/S0887-6185(02)00171-8
Cisler JM, 2009, COGNITION EMOTION, V23, P675, DOI 10.1080/02699930802051599
de Gelder B, 2000, COGNITION EMOTION, V14, P289
de Gelder B, 1999, NEUROSCI LETT, V260, P133, DOI 10.1016/S0304-3940(98)00963-X
EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068
ETCOFF NL, 1992, COGNITION, V44, P227, DOI 10.1016/0010-0277(92)90002-Y
Fazio RH, 2001, COGNITION EMOTION, V15, P115, DOI 10.1080/0269993004200024
Gerber AJ, 2008, NEUROPSYCHOLOGIA, V46, P2129, DOI 10.1016/j.neuropsychologia.2008.02.032
Goldstone RL, 2010, WIRES COGN SCI, V1, P69, DOI 10.1002/wcs.26
HERMANS D, 1994, COGNITION EMOTION, V8, P515, DOI 10.1080/02699939408408957
Hietanen J, 2004, EUR J COGN PSYCHOL, V16, P769, DOI 10.1080/09541440340000330
Hinojosa JA, 2009, EMOTION, V9, P164, DOI 10.1037/a0014680
Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770
Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381
Kreifelts B, 2007, NEUROIMAGE, V37, P1445, DOI 10.1016/j.neuroimage.2007.06.020
Kreifelts B, 2009, NEUROPSYCHOLOGIA, V47, P3059, DOI 10.1016/j.neuropsychologia.2009.07.001
Krolak-Salmon P, 2001, EUR J NEUROSCI, V13, P987, DOI 10.1046/j.0953-816x.2001.01454.x
Laukka P, 2005, COGNITION EMOTION, V19, P633, DOI 10.1080/02699930441000445
Laukka P, 2005, EMOTION, V5, P277, DOI 10.1037/1528-3542.5.3.277
LEVENSON RW, 1990, PSYCHOPHYSIOLOGY, V27, P363, DOI 10.1111/j.1469-8986.1990.tb02330.x
Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421
NIEDENTHAL PM, 1994, PERS SOC PSYCHOL B, V20, P401, DOI 10.1177/0146167294204007
Palermo R, 2004, BEHAV RES METH INS C, V36, P634, DOI 10.3758/BF03206544
Paulmann S, 2009, J PSYCHOPHYSIOL, V23, P63, DOI 10.1027/0269-8803.23.2.63
PAULMANN S, SPEECH COMM IN PRESS
Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230
Paulmann S, 2009, NEUROREPORT, V20, P1603, DOI 10.1097/WNR.0b013e3283320e3f
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
PELL MD, TIMECOURSE IN PRESS
Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7
Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z
Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006
Pell MD, 2011, COGNITION EMOTION, V25, P834, DOI 10.1080/02699931.2010.516915
Pell MD, 2002, BRAIN COGNITION, V48, P499, DOI 10.1006/brxg.2001.1406
Pell MD, 2005, J NONVERBAL BEHAV, V29, P45, DOI 10.1007/s10919-004-0889-8
Posner J, 2005, DEV PSYCHOPATHOL, V17, P715, DOI 10.1017/S0954579405050340
Pourtois G, 2000, NEUROREPORT, V11, P1329, DOI 10.1097/00001756-200004270-00036
SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674
Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8
Schirmer A, 2005, COGNITIVE BRAIN RES, V24, P442, DOI 10.1016/j.cogbrainres.2005.02.022
Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X
Scott S. K., 2009, HDB MAMMALIAN VOCALI, P187
Simon-Thomas ER, 2009, EMOTION, V9, P838, DOI 10.1037/a0017810
Spruyt A, 2007, EXP PSYCHOL, V54, P44, DOI 10.1027/1618-3169.54.1.44
Tracy JL, 2008, EMOTION, V8, P81, DOI 10.1037/1528-3542.8.1.81
Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382
Williams LM, 2009, J CLIN EXP NEUROPSYC, V31, P257, DOI 10.1080/13803390802255635
Young AW, 1997, COGNITION, V63, P271, DOI 10.1016/S0010-0277(97)00003-6
Zhang Q, 2006, BRAIN RES BULL, V71, P316, DOI 10.1016/j.brainresbull.2006.09.023
NR 55
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 1
EP 10
DI 10.1016/j.specom.2011.05.011
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800001
ER
PT J
AU Fersini, E
Messina, E
Archetti, F
AF Fersini, E.
Messina, E.
Archetti, F.
TI Emotional states in judicial courtrooms: An experimental investigation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Pattern recognition; Application
ID SPEECH; CLASSIFICATION; RECOGNITION
AB Thanks to the recent progress in the judicial proceedings management, especially related to the introduction of audio/video recording facilities, the challenge of identification of emotional states can be tackled. Discovering affective states embedded into speech signals could help in semantic retrieval of multimedia clips, and therefore in a deep understanding of mechanisms behind courtroom debates and judges/jurors decision making processes. In this paper two main contributions are given: (1) the collection of real-world human emotions coming from courtroom audio recordings; (2) the investigation of a hierarchical classification system, based on a risk minimization method, able to recognize emotional states from speech signatures. The accuracy of the proposed classification approach - named Multilayer Support Vector Machines - has been evaluated by comparing its performance with traditional machine learning approaches, by using both benchmark datasets and real courtroom recordings. Results in recognition obtained by the proposed technique outperform the prediction power achieved by traditional approaches like SVM, k-Nearest Neighbors, Naive Bayes, Decision Trees and Bayesian Networks. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Fersini, E.; Messina, E.; Archetti, F.] Univ Milano Bicocca, I-20126 Milan, Italy.
RP Fersini, E (reprint author), Univ Milano Bicocca, Viale Sarca 336, I-20126 Milan, Italy.
EM fersini@disco.unimib.it
FU European Community [214306]
FX This work has been supported by the European Community FP-7 under the
JUMAS Project (ref.: 214306). The authors would like to thank Gaia
Arosio for the development of the Multi-layer SVM software.
CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
Albornoz EM, 2010, LECT NOTES COMPUT SC, V5967, P242
ARCHETTI F, 2008, 1 INT C ICT SOL JUST
BARRACHICOTE R, 2009, 10 ANN C INT SPEECH, P336
BATLINER A, 2004, 4 INT C LANG RES EV, P171
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Bouckaert RR, 2010, J MACH LEARN RES, V11, P2533
Burkhardt F., 2005, INTERSPEECH, P1517
CAMPBELL N, 2002, 3 INT C LANG RES EV
Chavhan Y., 2010, INT J COMPUTER APPL, V1, P6
Cichosz J., 2004, Proceedings of International Conference on Signals and Electronic Systems ICSES'04
COOPER GF, 1992, MACH LEARN, V9, P309, DOI 10.1007/BF00994110
Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022
DEVILLERS L, 2007, REAL LIFE EMOTION RE, P34
Engbert I. S., 2007, DOCUMENTATION DANISH
Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8
Fersini E, 2009, LECT NOTES ARTIF INT, V5632, P594, DOI 10.1007/978-3-642-03070-3_45
France DJ, 2000, IEEE T BIO-MED ENG, V47, P829, DOI 10.1109/10.846676
GRIMM M, 2008, IEEE INT C MULT EXP, P865
Guyon I., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753616
Hansen J.H.L., 1997, P EUR C SPEECH COMM, P1743
Hastie T, 1998, ADV NEUR IN, V10, P507
JOVICIC S, 2004, P 9 C SPEECH COMP
Karstedt S, 2002, THEOR CRIMINOL, V6, P299
KIM D, 2009, P 16 INT C NEUR INF, P649
Kohavi R., 1996, THESIS STANFORD
LAZARUS R, 2001, RELATIONAL MEANING D, P37
Lee C., 2009, P INTERSPEECH
MAO X, 2009, CSIE 09 P 2009 WRI W, P225
Martin O., 2006, P 22 INT C DAT ENG W, P8, DOI 10.1109/ICDEW.2006.145
Mozziconacci S., 1999, P 14 INT C PHON SCI, P2001
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
PAO TL, 2007, ICIC 07 P INT COMP 3, P997
Pereira C., 2000, ITRW SPEECH EMOTION, P25
Petrushin V.A., 2000, P 6 INT C SPOK LANG, P222
REDDY P, 2010, ABS10064548 CORR
Roach P., 1998, J INT PHON ASSOC, V28, P83
Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886
Schuller B, 2006, SPEECH PROSODY
Schuller B, 2006, P INT C MULT EXP ICM, P5
Schuller B, 2009, IMAGE VISION COMPUT, V27, P1760, DOI 10.1016/j.imavis.2009.02.013
SCHULLER B, 2007, ICASSP, V2, P733
SEDAAGHI MH, 2007, P 15 EUR SIGN PROC C, P2209
Sethu V, 2008, INT CONF ACOUST SPEE, P5017, DOI 10.1109/ICASSP.2008.4518785
Steininger S., 2002, P WORKSH MULT RES MU, P33
Tato R.S., 2002, P INT C SPOK LANG PR, P2029
VERVERIDIS D, 2004, SIGNAL PROCESS, V1, P593
Vogt Thurid, 2006, P LANG RES EV C LREC
WALKER MA, 2001, EUR C SPEECH LANG PR, P1371
Wollmer M., 2008, P INT BRISB AUSTR, P597
Xiao ZZ, 2010, MULTIMED TOOLS APPL, V46, P119, DOI 10.1007/s11042-009-0319-3
Yacoub S, 2003, P EUR GEN, p[1, 729]
Yang B, 2010, SIGNAL PROCESS, V90, P1415, DOI 10.1016/j.sigpro.2009.09.009
ZHOU Y, 2009, INT C RES CHALL COMP, P73
NR 55
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 11
EP 22
DI 10.1016/j.specom.2011.06.001
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800002
ER
PT J
AU Djamah, M
O'Shaughnessy, D
AF Djamah, Mouloud
O'Shaughnessy, Douglas
TI Fine granularity scalable speech coding using embedded tree-structured
vector quantization
SO SPEECH COMMUNICATION
LA English
DT Article
DE Embedded quantization; Fast search; Scalable speech coding;
Tree-structured vector quantization
ID LPC PARAMETERS; ALGORITHM; SEARCH; DESIGN
AB This paper proposes an efficient codebook design for tree-structured vector quantization (TSVQ) that is embedded in nature. We modify two speech coding standards by replacing their original quantizers for line spectral frequencies (LSF's) and/or Fourier magnitudes quantization with TSVQ-based quantizers. The modified coders are fine-granular bit-rate scalable with gradual change in quality for the synthetic speech. A fast search encoding algorithm using multistage tree-structured vector quantization (MTVQ) is proposed for quantization of LSF's. The proposed method is compared to the multipath sequential tree-assisted search (MSTS) and to the well known multipath sequential search (MSS) or M-L search algorithms. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Djamah, Mouloud] Univ Quebec, INRS EMT, Bur 6900, Montreal, PQ H5A 1K6, Canada.
RP Djamah, M (reprint author), Univ Quebec, INRS EMT, Bur 6900, 800 Gauchetiere Ouest, Montreal, PQ H5A 1K6, Canada.
EM djamah@emt.inrs.ca; dougo@emt.inrs.ca
CR [Anonymous], 2006, G7291 ITUT
[Anonymous], 1993, P56 ITUT
[Anonymous], 2001, P862 ITUT
Bao K, 2000, IEEE T CIRC SYST VID, V10, P833
BHATTACHARYA B, 1992, IEEE INT C AC SPEECH
CHAN W, 1991, IEEE T COMMUN JAN, P11
CHAN W, 1994, IEEE ICASSP, P521
Chan W.-Y., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.226194
CHAN WY, 1990, INT CONF ACOUST SPEE, P1109, DOI 10.1109/ICASSP.1990.116130
CHAN WY, 1991, INT CONF ACOUST SPEE, P3597, DOI 10.1109/ICASSP.1991.151052
CHANG RF, 1991, INT CONF ACOUST SPEE, P2281
BEI CD, 1985, IEEE T COMMUN, V33, P1132
CHEMLA D, 1993, P IEEE SPEECH COD WO, P71, DOI 10.1109/SCFT.1993.762344
Chen FC, 2003, INT CONF ACOUST SPEE, P145
Chu WC, 2003, SPEECH CODING ALGORI
CHU WC, 2004, SIGNALS SYSTEMS COMP, V1, P425
Chu WC, 2006, IEEE T AUDIO SPEECH, V14, P1205, DOI 10.1109/TSA.2005.860831
DARPA TIMIT, 1993, AC PHON CONT SPEECH
Djamah M, 2010, INT CONF ACOUST SPEE, P4686, DOI 10.1109/ICASSP.2010.5495190
DJAMAH M, 2009, INT C SIGN IM PROC H, P42
DJAMAH M, 2009, 10 ANN C INT SPEECH, P2603
DONG H, 2002, P IEEE ISCAS, P859
Gersho A., 1992, VECTOR QUANTIZATION
HIWASAKI Y, 2004, NTT TECH REV, V2
*ITU, 1990, G727 ITU
*ITU T, 2007, G729 ITUT CSACELP
*ITU T, 2005, G191 ITUT SOFTW TOOL
*ITU T STUD GROUP, 1995, SQ4695R3 ITUT STUD G
JAFARKHANI H, 1995, P IEEE INT C IM PROC, V2, P81
JUNG S, 2004, P IEEE ICASSP, P285
KABAL P, 1986, IEEE T ACOUST SPEECH, V34, P1419, DOI 10.1109/TASSP.1986.1164983
Katsavounidis I, 1996, IEEE T IMAGE PROCESS, V5, P398, DOI 10.1109/83.480778
LeBlanc WP, 1993, IEEE T SPEECH AUDI P, V1, P373, DOI 10.1109/89.242483
Li WP, 2001, IEEE T CIRC SYST VID, V11, P301
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
LYONS DF, 1993, IEEE ICASSP, V5, P602, DOI 10.1109/ICASSP.1993.319883
McCree AV, 1997, IEEE ICASSP, P1591
MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089
NISHIGUCHI M, 1999, IEEE SPEECH COD WORK, P84
Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363
RISKIN EA, 1994, IEEE T IMAGE PROCESS, V3, P307, DOI 10.1109/83.287025
SUGAMURA N, 1986, ELSEVIER SPEECH COMM, P199
TAI HM, 1999, IEEE IND ELECT SOC, P762
Tsou SL, 2003, ICICS-PCM 2003, VOLS 1-3, PROCEEDINGS, P1389
NR 44
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 23
EP 39
DI 10.1016/j.specom.2011.06.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800003
ER
PT J
AU Sangwan, A
Hansen, JHL
AF Sangwan, Abhijeet
Hansen, John H. L.
TI Automatic analysis of Mandarin accented English using phonological
features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Phonological features; Accent analysis; Non-native speaker traits
ID ARTICULATORY FEATURES; SPEECH RECOGNITION; AMERICAN ENGLISH; FOREIGN
ACCENT; CLASSIFICATION; PRONUNCIATION; PERCEPTION; NETWORKS; SPEAKERS;
VOWELS
AB The problem of accent analysis and modeling has been considered from a variety of domains, including linguistic structure, statistical analysis of speech production features, and HMM/GMM (hidden Markov model/Gaussian mixture model) model classification. These studies however fail to connect speech production from a temporal perspective through a final classification strategy. Here, a novel accent analysis system and methodology which exploits the power of phonological features (PFs) is presented. The proposed system exploits the knowledge of articulation embedded in phonology by building Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents, a new statistical measure of "accentedness" is developed which rates the articulation of a word by a speaker on a scale of native-like (+1) to non-native like (-1). The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The experimental results demonstrate the capability of the proposed system to perform quantitative as well as qualitative analysis of foreign accents. The work developed in this study can be easily expanded into language learning systems, and has potential impact in the areas of speaker recognition and ASR (automatic speech recognition). (C) 2011 Elsevier B.V. All rights reserved.
C1 [Sangwan, Abhijeet; Hansen, John H. L.] Univ Texas Dallas, CRSS, Richardson, TX 75083 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Richardson, TX 75083 USA.
EM john.hansen@utdallas.edu
FU USAF [FA8750-09-C-0067]
FX This work was supported by the USAF under a subcontract to RADC, Inc.,
Contract FA8750-09-C-0067. (Approved for public release. Distribution
unlimited.)
CR Angkititrakul P, 2006, IEEE T AUDIO SPEECH, V14, P634, DOI 10.1109/TSA.2005.851980
Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6
Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608
Choueiter G, 2008, INT CONF ACOUST SPEE, P4265, DOI 10.1109/ICASSP.2008.4518597
Chreist F., 1964, FOREIGN ACCENT
DAS S, 2004, IEEE NORSIG, P344
FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876
Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052
Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002
FRANKEL J, 2007, INTERSPEECH
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
Hansen JHL, 2010, SPEECH COMMUN, V52, P777, DOI 10.1016/j.specom.2010.05.004
Jia G, 2006, J ACOUST SOC AM, V119, P1118, DOI 10.1121/1.2151806
Jou SC, 2005, INT CONF ACOUST SPEE, P1009
King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622
King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148
Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013
MAK B, 2003, HUMAN LANGUAGE TECHN, V2, P217
MANGAYYAGARI S, 2008, INT C PATT REC
Markov K, 2006, SPEECH COMMUN, V48, P161, DOI 10.1016/j.specom.2005.07.003
METZE F, 2002, ICSLP
Metze F, 2007, SPEECH COMMUN, V49, P348, DOI 10.1016/j.specom.2007.02.009
Morris J, 2008, IEEE T AUDIO SPEECH, V16, P617, DOI 10.1109/TASL.2008.916057
NERI A, 2006, INTERSPEECH
PEDERSEN C, 2007, 6 INT C COMP INF SCI
SALVI G, 2003, EUROSPEECH, P2677
SANGWAN A, 2007, IEEE AUT SPEECH REC, P582
Sangwan A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1525
Scharenborg O, 2007, SPEECH COMMUN, V49, P811, DOI 10.1016/j.specom.2007.01.005
Tepperman J, 2008, IEEE T AUDIO SPEECH, V16, P8, DOI 10.1109/TASL.2007.909330
WEI S, 2006, INTERSPEECH 06
Zheng Y., 2005, INTERSPEECH, P217
NR 32
TC 3
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 40
EP 54
DI 10.1016/j.specom.2011.06.003
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800004
ER
PT J
AU Vijayasenan, D
Valente, F
Bourlard, H
AF Vijayasenan, Deepu
Valente, Fabio
Bourlard, Herve
TI Multistream speaker diarization of meetings recordings beyond MFCC and
TDOA features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker diarization; Meeting recordings; Multi-stream modeling; NIST
rich transcription; Information bottleneck diarization
ID INFORMATION; SYSTEM
AB Many state-of-the-art diarization systems for meeting recordings are based on the HMM/GMM framework and the combination of spectral (MFCC) and time delay of arrivals (TDOA) features. This paper presents an extensive study on how multistream diarization can be improved beyond these two sets of features. While several other features have been proven effective for speaker diarization, little efforts have been devoted to integrate them into the MFCC + TDOA state-of-the-art baseline and to the authors' best knowledge, no positive results have been reported so far. The first contribution of this paper consists in analyzing the reasons of this, investigating through a set of oracle experiments the robustness of the HMM/GMM diarization when also other features (the modulation spectrum features and the frequency domain linear prediction features) are integrated. The second contribution of the paper consists in introducing a non-parametric multistream diarization method based on the information bottleneck (IB) approach. In contrary to the HMM/GMM which makes use of log-likelihood combination, it combines the feature streams in a normalized space of relevance variables. The previous analysis is repeated revealing that the proposed approach is more robust and can actually benefit from other sources of information beyond the conventional MFCC and TDOA features. Experiments based on the rich transcription data (heterogeneous meetings data recorded in several different rooms) show that it achieves a very competitive error of only 6.3% when four feature streams are used, compared to the 14.9% of the HMM/GMM system. Those results are analyzed in terms of error sensitivity to the stream weightings. To the authors' best knowledge this is the first successful attempt to reduce the speaker error combining other features with the MFCC and the TDOA and the first study to show the shortcomings of the HMM/GMM in going beyond this baseline. As last contribution, the paper also addresses issues related to the computational complexity of multistream approaches. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Vijayasenan, Deepu; Valente, Fabio; Bourlard, Herve] Idiap Res Inst, CH-1920 Martigny, Switzerland.
RP Valente, F (reprint author), Idiap Res Inst, CH-1920 Martigny, Switzerland.
EM deepu.vijayasenan@idiap.ch; fabio.valente@idiap.ch;
herve.bourlard@idiap.ch
FU Swiss Science Foundation; EU; Hasler Foundation
FX The authors would like to thank colleagues involved in the AMI and IM2
projects, Dr. John Dines (IDIAP) and Samuel Thomas (Johns Hopkins
University) for their help with this work as well as the anonymous
reviewers for their comments. This work was funded by the Swiss Science
Foundation through IM2 grant, by the EU through SSPnet grant and by the
Hasler Foundation through the SESAME grant.
CR Ajmera J, 2004, IEEE SIGNAL PROC LET, V11, P649, DOI 10.1109/LSP.2004.831666
Ajmera J., 2004, THESIS ECOLE POLYTEC
Anguera X., 2006, THESIS U POLITECNICA
Anguera X., 2006, BEAMFORMIT FAST ROBU
Anguera X, 2005, LECT NOTES COMPUT SC, V3869, P402
ANGUERA X, 2006, P AUT SPEECH REC UND, P426
Athineos M., 2003, P IEEE WORKSH AUT SP
Chen S. S., 1998, P DARPA BROADC NEWS, P127
Friedland G, 2009, IEEE T AUDIO SPEECH, V17, P985, DOI 10.1109/TASL.2009.2015089
GANAPATHY S, 2008, P INTERSPEECH BRISB
GUILLERMO A, 2008, THESIS ECOLE POLYTEC
HARREMOES P, 2007, IEEE INT S INF THEOR, P566
KINGSBURY B, 1998, SPEECH COMMUN, V25, P17132
Kinnunen T., 2008, P OD SPEAK LANG REC
NOULAS A, 2007, P INT C MULT INT ICM, P350, DOI 10.1145/1322192.1322254
Pardo JM, 2007, IEEE T COMPUT, V56, P1212, DOI 10.1109/TC.2007.1077
PARDO JM, 2006, INT C SPEECH LANG PR
Slonim N, 2002, THESIS HEBREW U JERU
Slonim N., 1999, P ADV NEUR INF PROC, P617
Thomas S, 2008, IEEE SIGNAL PROC LET, V15, P681, DOI 10.1109/LSP.2008.2002708
TISHBY N, 1998, NEC RES I TR
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
van Leeuwen DA, 2008, LECT NOTES COMPUT SC, V4625, P475
Vijayasenan D., 2009, 10 ANN C INT SPEECH
VIJAYASENAN D, 2008, INTERSPEECH 2008
Vijayasenan D, 2009, IEEE T AUDIO SPEECH, V17, P1382, DOI 10.1109/TASL.2009.2015698
VINYALS O, 2008, P INT 2008
Wooters C, 2008, LECT NOTES COMPUT SC, V4625, P509
NR 28
TC 0
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 55
EP 67
DI 10.1016/j.specom.2011.07.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800005
ER
PT J
AU Chiosain, MN
Welby, P
Espesser, R
AF Chiosain, Maire Ni
Welby, Pauline
Espesser, Robert
TI Is the syllabification of Irish a typological exception? An experimental
study
SO SPEECH COMMUNICATION
LA English
DT Article
DE Syllabification; Ambisyllabicity; Irish; Speech perception; Pattern
frequency
ID INTERVOCALIC CONSONANTS; ENGLISH SYLLABICATION; SYLLABLE STRUCTURE;
SEGMENTATION; LANGUAGE; FRENCH; DURATION; VOWEL; SYLLABIFICATION;
PERCEPTION
AB We examined whether Irish speakers syllabify intervocalic consonants as codas (e.g., poca 'pocket' 'po:k.(sic)/ CVC.V), as claimed by many authors, but contrary to claims in phonological theory of a universal preference for syllables with onsets. We conducted a perception experiment using a part-repetition task and presented auditory stimuli consisting of VCV items with a single medial consonant (Cm), varying in the length of VI and the manner of articulation of Cm (e.g., poca /po:k(sic)/ 'pocket', lofa /lof(sic)/ 'rotten'), as well as VCCV items varying in the length of VI and consonant sequence type (e.g., masla /masl(sic) 'insult', canta /ka:nt(sic)/ 'hunk'). Response patterns were in line with many, though not all, of the findings in the literature for other languages: Listeners preferred syllables with onsets, often treated Cm as ambisyllabic, syllabified Cm as a coda more often when VI was short, and dispreferred stops as codas.
The results, however, did not completely support the Syllable Onset Segmentation Hypothesis (SOSH), which proposes differing roles for syllable onsets and offsets in word segmentation. For VCCV items, listeners show a great deal of variability in decisions not only about where the first syllable ends, but also about where the second syllable begins, a variability that could not be explained by the number of legal onsets possible for a given consonant sequence.
We examined the hypotheses that variability in perception can be accounted for by (I) variability in production (in the signal) and (2) phoneme pattern frequency. We searched pattern frequencies in an electronic dictionary of Irish, but found no support for an account in which language-specific syllabification patterns reflect patterns of word-initial phoneme sequences. Our investigation of potential acoustic cues to syllable boundaries showed a gradient effect of vowel length on syllabification judgments: the longer the VI duration, the less likely a closed syllable, in line with results for other languages. For Irish, though, this pattern interestingly holds for all consonant manners except stops. The phonetic analyses point to other language-specific differences in phonetic patterns that cue syllable boundaries. For Irish, unlike English, consonant duration was not a more important cue to syllable boundaries than vowel duration, and there was no evidence that relative duration between the two consonants of a medial sequence signals syllable boundaries. The findings have implications not only for the syllable structure of Irish and theories of syllabification more generally. They are relevant to all theoretical and applied work on Irish that makes reference to the syllable. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Chiosain, Maire Ni; Welby, Pauline] Univ Coll Dublin, Sch Irish Celt Studies Irish Folklore & Linguist, Dublin 4, Ireland.
[Welby, Pauline; Espesser, Robert] Univ Aix Marseille 1, CNRS, Lab Parole & Langage, F-13100 Aix En Provence, France.
[Welby, Pauline; Espesser, Robert] Univ Aix Marseille 2, F-13100 Aix En Provence, France.
RP Chiosain, MN (reprint author), Univ Coll Dublin, Sch Irish Celt Studies Irish Folklore & Linguist, John Henry Newman Bldg, Dublin 4, Ireland.
EM maire.nichiosain@ucd.ie; pauline.welby@lpl-aix.fr;
robert.espesser@lpl-aix.fr
FU Foras na Gaeilge
FX We thank Cliona Ni Chiosain, Niall O Ciosain, Anna Ni Ghallachair, Kayla
Reed, and John Walsh for their help in recruiting participants and
providing space to run the experiments, Brian O Raghallaigh and Michelle
Tooher for technical assistance, Michal Boleslav Mechura and Kevin
Scannell for their help with the frequency analyses, Christine "Ni
Mhuilleoir" Meunier for her help with the acoustic analyses, the
audiences at our presentations at the Formal Approaches to Celtic
Linguistics Conference and the Laboratoire Parole et Langage (LPL) for
their helpful feedback, and Foras na Gaeilge for financial support. We
thank two anonymous reviewers for their valuable comments on an earlier
version of the manuscript. We also thank our participants.
CR Arnason Kristjan, 1980, QUANTITY HIST PHONOL
Baayen R. H., 1995, CELEX LEXICAL DATABA
Baayen R. Harald, 2008, ANAL LINGUISTIC DATA
Barry William, 1999, PHONUS I PHONETICS U, V4, P87
Bates D., 2005, R NEWS, V5, P27, DOI DOI 10.1111/J.1523-1739.2005.00280.X
Bell Alan, 1978, SYLLABLES SEGMENTS
Berg T, 2000, J PHONETICS, V28, P187, DOI 10.1006/jpho.2000.0112
Berg Thomas, 2001, NORD J LINGUIST, V24, P71, DOI 10.1080/033258601750266196
Bertinetto Pier Marco, 2004, ITALIAN J LINGUISTIC, V16, P349
BERTINETTO PM, 1994, QUADERNI LAB LINGUIS, V8, P1
Blevins J., 1995, HDB PHONOLOGICAL THE, P206
Boersma P., 2008, PRAAT DOING PHONETIC
BORGSTROM C, 1937, NORSK TIDSSKRIFT SPO, V7, P71
BOSCH A, 1998, TEXAS LINGUISTIC FOR, V41, P1
Bosch Anna, 1998, SCOTTISH GAELIC STUD, V18, P1
BOUCHER VJ, 1988, J PHONETICS, V16, P299
Breatnach Risteard B., 1947, IRISH RING CO WATERF
Breen G, 1999, LINGUIST INQ, V30, P1, DOI 10.1162/002438999553940
BREEN G, 1990, SYLLABLE ARRER UNPUB
CAIRNS CE, 2011, BRILLS HDB LINGUISTI
CHRISTIE WM, 1974, J ACOUST SOC AM, V55, P819, DOI 10.1121/1.1914606
Clements G. N., 1990, PAPERS LAB PHONOLOGY, P283
Clements George N., 1983, LINGUISTIC INQUIRY M, V9
Clements GN, 2009, CURR STUD LINGUIST, P165
Content A, 2001, J MEM LANG, V45, P177, DOI 10.1006/jmla.2000.2775
Content A, 2001, LANG COGNITIVE PROC, V16, P609
COTE MH, OXFORD HDB IN PRESS
Cote MH, 2011, BRILL HANDB LINGUIST, V1, P273
CRYSTAL TH, 1988, J PHONETICS, V16, P285
Cutler A, 2001, LANG SPEECH, V44, P171
Dalton M., 2008, THESIS TRINITY COLL
Dalton M, 2005, LANG SPEECH, V48, P441
DAVIDSEN-NIELSEN N., 1974, J PHONETICS, V2, P15
De Bhaldraithe Tomas, 1945, IRISH COIS FHAIRRGE
de Burca Sean, 1958, IRISH TOURMAKEADY CO
DERWING BL, 1992, LANG SPEECH, V35, P219
DILWORTH A, 1972, MAINLAND DIALECTS SC
Dixon R., 2002, AUSTR LANGUAGES THEI
Dubach Green Antony, 1997, THESIS CORNELL U
Dumay N, 2002, BRAIN LANG, V81, P144, DOI 10.1006/brln.2001.2513
ELLISON TM, 1998, INSTRUMENTAL S UNPUB
Evans N, 2009, BEHAV BRAIN SCI, V32, P429, DOI 10.1017/S0140525X0999094X
FALLOWS D, 1981, J LINGUIST, V17, P309, DOI 10.1017/S0022226700007027
Fery Caroline, 2003, SYLLABLE OPTIMALITY
FHAILLIGH EMA, 1968, IRISH ERRIS
Forster K. I., 2008, J MEMORY LANGUAGE, V59
Gillies William, 1993, CELTIC LANGUAGES, P145
Gillis S, 1996, J CHILD LANG, V23, P487
Giollagain C, 2007, STAIDEAR CUIMSITHEAC
GORDEEVA OB, 2007, WP12 QUEEN MARGARET
Goslin J, 2008, LANG SPEECH, V51, P199, DOI 10.1177/0023830908098540
Goslin J, 2001, LANG SPEECH, V44, P409
GUOMUNDSSON V, 1922, ISLANDSK GRAMMATIK
Harrington J, 2010, PHONETIC ANAL SPEECH
Hay J., 2004, PAPERS LAB PHONOLOGY, VVI, P58
Holmer Nils, 1962, GAELIC KINTYRE
HOOPER JB, 1972, LANGUAGE, V48, P525, DOI 10.2307/412031
Hothorn T, 2008, BIOMETRICAL J, V50, P346, DOI 10.1002/bimj.200810425
Ishikawa K, 2002, LANG SPEECH, V45, P355
ITO J, 1988, THESIS U MASSACHUSET
Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007
Jakobson Roman, 1956, FUNDAMENTALS LANGUAG
JOANISSE M, 1999, INT C PHON SCI SAN F, P731
Jones D, 1936, OUTLINE ENGLISH PHON
KAHN D, 1980, THESIS MASSACHUSETTS
KEATING PA, 1984, LANGUAGE, V60, P286, DOI 10.2307/413642
KHARLAMOV V, 2009, ANN C CAN LING ASS, P12
KNOTT E, 1974, INTRO IRISH SYLLABIC
Kochetav A, 2004, LANG SPEECH, V47, P351
Ladefoged P., 1996, SOUNDS WORLDS LANGUA
Levin J., 1985, THESIS MIT
LUCE PA, 1985, J ACOUST SOC AM, V78, P1949, DOI 10.1121/1.392651
MACKEN MA, 1990, PAR SYLL PHON PHON, P273
MADDIESON IAN, 1985, PHONETIC LINGUISTICS, P203
McQueen JM, 1998, J MEM LANG, V39, P21, DOI 10.1006/jmla.1998.2568
MEYNADIER Y, 2001, TRAVAUX INTERDISCIPL, V20, P91
NEW B, 2006, TRAITEMENT AUTOMATIQ
Ni Chasaide A., 1999, HDB INT PHONETIC ASS, P111
Ni Chiosain Maire, 1991, THESIS U MASSACHUSET
NICHASAIDE A, 1987, INT C PHON SCI TALL, P28
NICHIOSAIN M, 2007, LANG VAR CHANGE, V19, P51
NICHIOSAIN M, 1994, PHONOLOGICA 1992, P157
O Baoill Donall, 1986, LARCHANUINT DON GHAE
O Cuiv Brian, 1944, IRISH W MUSKERRY CO
O Murchu Mairtin, 1989, E PERTHSHIRE GAELIC
O Searcaigh Seamus, 1925, FOGHRAIDHEACHT GHAED
O Siadhail Micheal, 1975, CORAS FUAIMEANNA GAE
OBAOILL D, 1986, FOCLOIR POCA
O'Connor JD, 1953, WORD, V9, P103
OFTEDAL M, 1956, NORSK TIDSSKRIFT S S, V4
Ohala M, 1999, STUD GENERA GRAMMAR, V45, P93
ORAGHALLAIGH B, 2010, THESIS TRINITY COLL
PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183
PRINCE A, 2004, 2 RUTG U CTR COGN SC
PULGRAM E, 1970, JANUA LINGUARUM SERI, V81
Quene H, 2004, SPEECH COMMUN, V43, P103, DOI 10.1016/j.specom 2004.02.004
Quiggin Edmund Crosby, 1906, DIALECT DONEGAL BEIN
Redford MA, 2005, J PHONETICS, V33, P27, DOI 10.1016/j.wocn.2004.05.003
Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152
Richtsmeier PT, 2011, LAB PHONOLOGY, V2, P157
RUBACH J, 1999, RIV LINGUISTICA, V11, P273
Schiller NO, 1997, LANG SPEECH, V40, P103
SELKIRK L, 1982, STRUCTURE PHONOLOGIC, V2, P337
Shatzman KB, 2006, PERCEPT PSYCHOPHYS, V68, P1, DOI 10.3758/BF03193651
Siadhail Micheal, 1989, MODERN IRISH GRAMMAT
Sjoestedt Marie-Louise, 1931, PHONETIQUE PARLER IR
SOMMER BA, 1970, INT J AM LINGUIST, V36, P57, DOI 10.1086/465090
SOMMER B, 1981, PHONOLOGY 1980S, P231
Spinelli E, 2003, J MEM LANG, V48, P233, DOI 10.1016/S0749-596X(02)00513-2
Spinelli E, 2010, ATTEN PERCEPT PSYCHO, V72, P775, DOI 10.3758/APP.72.3.775
Spinelli E, 2007, LANG COGNITIVE PROC, V22, P828, DOI 10.1080/01690960601076472
Steriade Donca, 1982, THESIS MIT
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Suomi K, 1997, J MEM LANG, V36, P422, DOI 10.1006/jmla.1996.2495
Tabain M., 2004, J INT PHON ASSOC, V34, P175, DOI 10.1017/S0025100304001719
Ternes Elmar, 1973, PHONEMIC ANAL SCOTTI
TREIMAN R, 1990, J MEM LANG, V29, P66, DOI 10.1016/0749-596X(90)90010-W
TREIMAN R, 1992, J PHONETICS, V20, P383
TREIMAN R, 1988, J MEM LANG, V27, P87, DOI 10.1016/0749-596X(88)90050-2
Tuller B., 1990, ATTENTION PERFORM, P429
TULLER B, 1991, J SPEECH HEAR RES, V34, P501
van der Hulst Harry, 1999, SYLLABLE VIEWS FACTS
VANDERLUGT A, 1999, THESIS KATHOLIEKE U
Vennemann Th, 1988, PREFERENCE LAWS SYLL
Zec D, 2007, CAMBRIDGE HANDBOOK OF PHONOLOGY, P161
ZIOLKOWSKI MS, 1990, PAR SYLL PHON PHON
ZUE VW, 1976, THESIS MIT
NR 127
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 68
EP 91
DI 10.1016/j.specom.2011.07.002
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800006
ER
PT J
AU Paulmann, S
Titone, D
Pell, MD
AF Paulmann, Silke
Titone, Debra
Pell, Marc D.
TI How emotional prosody guides your way: Evidence from eye movements
SO SPEECH COMMUNICATION
LA English
DT Article
DE Eye-tracking; Gaze; Speech processing; Affective prosody; Semantics
ID EVENT-RELATED POTENTIALS; SPOKEN-WORD RECOGNITION; AFFECT DECISION TASK;
VISUAL-SEARCH; FACIAL EXPRESSION; SEX-DIFFERENCES; ERP EVIDENCE;
TIME-COURSE; BASIC EMOTIONS; FACE
AB This study investigated cross-modal effects of emotional voice tone (prosody) on face processing during instructed visual search. Specifically, we evaluated whether emotional prosodic cues in speech have a rapid, mandatory influence on eye movements to an emotionally-related face, and whether these effects persist as semantic information unfolds. Participants viewed an array of six emotional faces while listening to instructions spoken in an emotionally congruent or incongruent prosody (e.g., "Click on the happy face" spoken in a happy or angry voice). The duration and frequency of eye fixations were analyzed when only prosodic cues were emotionally meaningful (pre-emotional label window: "Click on the/ ... "), and after emotional semantic information was available (post-emotional label window: " ... /happy face"). In the pre-emotional label window, results showed that participants made immediate use of emotional prosody, as reflected in significantly longer frequent fixations to emotionally congruent versus incongruent faces. However, when explicit semantic information in the instructions became available (post-emotional label window), the influence of prosody on measures of eye gaze was relatively minimal. Our data show that emotional prosody has a rapid impact on gaze behavior during social information processing, but that prosodic meanings can be overridden by semantic cues when linguistic information is task relevant. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Paulmann, Silke] Univ Essex, Dept Psychol, Colchester C04 3SQ, Essex, England.
[Titone, Debra] McGill Univ, Dept Psychol, Montreal, PQ, Canada.
[Pell, Marc D.] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ, Canada.
[Titone, Debra; Pell, Marc D.] McGill Ctr Res Language Mind & Brain, Montreal, PQ, Canada.
RP Paulmann, S (reprint author), Univ Essex, Dept Psychol, Wivenhoe Pk, Colchester C04 3SQ, Essex, England.
EM paulmann@essex.ac.uk
FU Center for Research on Language, Mind and Brain (CRLMB); German Academic
Exchange Service (DAAD); McGill University
FX The authors wish to thank Matthieu Couturier for help with programming
the experiment, Abhishek Jaywant for help with data acquisition, Stephen
Hopkins, Moritz Dannhauer, and Cord Plasse for help with data analysis,
and Catherine Knowles for help with tables and figures. This work was
supported by a new initiative fund awarded to the authors by the Center
for Research on Language, Mind and Brain (CRLMB). Support received from
the German Academic Exchange Service (DAAD) to the first author and
McGill University (William Dawson Scholar award) to the third author is
gratefully acknowledged.
CR Allopenna PD, 1998, J MEM LANG, V38, P419, DOI 10.1006/jmla.1997.2558
Ashley V, 2004, NEUROREPORT, V15, P211, DOI 10.1097/01.wnr.0000091411.19795.f5
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Batty M, 2003, COGNITIVE BRAIN RES, V17, P613, DOI 10.1016/S0926-6410(03)00174-5
Berman JMJ, 2010, J EXP CHILD PSYCHOL, V107, P87, DOI 10.1016/j.jecp.2010.04.012
Besson M, 2002, TRENDS COGN SCI, V6, P405, DOI 10.1016/S1364-6613(02)01975-7
Boersma P., 2009, PRAAT DOING PHONETIC
Borod JC, 2000, COGNITION EMOTION, V14, P193
Bostanov V, 2004, PSYCHOPHYSIOLOGY, V41, P259, DOI 10.1111/j.1469-8986.2003.00142.x
BOWER GH, 1981, AM PSYCHOL, V36, P129, DOI 10.1037//0003-066X.36.2.129
Bowers D., 1993, NEUROPSYCHOLOGY, V7, P433, DOI 10.1037//0894-4105.7.4.433
BOWERS D, 1987, NEUROPSYCHOLOGIA, V25, P317, DOI 10.1016/0028-3932(87)90021-2
Brosch T., 2008, J COGNITIVE NEUROSCI, V21, P1670
Calvo MG, 2008, EXP PSYCHOL, V55, P359, DOI 10.1027/1618-3169.55.6.359
Calvo MG, 2008, J EXP PSYCHOL GEN, V137, P471, DOI 10.1037/a0012771
Carroll JM, 1996, J PERS SOC PSYCHOL, V70, P205, DOI 10.1037/0022-3514.70.2.205
Carroll NC, 2005, Q J EXP PSYCHOL-A, V58, P1173, DOI 10.1080/02724980443000539
Dahan D, 2005, PSYCHON B REV, V12, P453, DOI 10.3758/BF03193787
de Gelder B, 2000, COGNITION EMOTION, V14, P289
Eastwood JD, 2001, PERCEPT PSYCHOPHYS, V63, P1004, DOI 10.3758/BF03194519
Eimer M, 2002, NEUROREPORT, V13, P427, DOI 10.1097/00001756-200203250-00013
Eimer M, 2003, COGN AFFECT BEHAV NE, V3, P97, DOI 10.3758/CABN.3.2.97
EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068
EKMAN P, 1969, SCIENCE, V164, P86, DOI 10.1126/science.164.3875.86
Fazio RH, 2001, COGNITION EMOTION, V15, P115, DOI 10.1080/0269993004200024
Frischen A, 2008, PSYCHOL BULL, V134, P662, DOI 10.1037/0033-2909.134.5.662
Grimshaw GM, 1998, BRAIN COGNITION, V36, P108, DOI 10.1006/brcg.1997.0949
HANSEN CH, 1988, J PERS SOC PSYCHOL, V54, P917, DOI 10.1037/0022-3514.54.6.917
HANSEN CH, 1995, PERS SOC PSYCHOL B, V21, P548, DOI 10.1177/0146167295216001
Henderson J., 2004, INTERFACE LANGUAGE V
HESS U, 1988, MULTICHANNEL COMMUNI
Hietanen J, 2004, EUR J COGN PSYCHOL, V16, P769, DOI 10.1080/09541440340000330
Horstmann G, 2007, VIS COGN, V15, P799, DOI 10.1080/13506280600892798
Innes-Ker A, 2002, J PERS SOC PSYCHOL, V83, P804, DOI 10.1037//0022-3514.83.4.804
Isaacowitz D. M., 2008, PSYCHOL SCI, V19, P843
ITO K, 2008, J MEM LANG, P541
Johnstone T., 2000, HDB EMOTIONS, V2nd, P220
Kissler J, 2008, EXP BRAIN RES, V188, P215, DOI 10.1007/s00221-008-1358-0
Kitayama S, 2002, COGNITION EMOTION, V16, P29, DOI 10.1080/0269993943000121
Koelsch S, 2005, TRENDS COGN SCI, V9, P578, DOI 10.1016/j.tics.2005.10.001
Kotz SA, 2007, BRAIN RES, V1151, P107, DOI 10.1016/j.brainres.2007.03.015
Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421
MATIN E, 1993, PERCEPT PSYCHOPHYS, V53, P372, DOI 10.3758/BF03206780
MUMANUS MS, 2009, GAZE FIXATION PERCEP
NIEDENTHAL PM, 1995, PSYCHOL LEARN MOTIV, V33, P23, DOI 10.1016/S0079-7421(08)60371-0
Nummenmaa L, 2009, J EXP PSYCHOL HUMAN, V35, P305, DOI 10.1037/a0013626
Nygaard LC, 2002, MEM COGNITION, V30, P583, DOI 10.3758/BF03194959
Ohman A, 2001, J EXP PSYCHOL GEN, V130, P466, DOI 10.1037/0096-3445.130.3.466
Paulmann S, 2011, MOTIV EMOTION, V35, P192, DOI 10.1007/s11031-011-9206-0
PAULMANN S, 2006, P ARCH MECH LANG PRO, P37
Paulmann S, 2008, BRAIN LANG, V105, P59, DOI 10.1016/j.bandl.2007.11.005
Paulmann S, 2009, J PSYCHOPHYSIOL, V23, P63, DOI 10.1027/0269-8803.23.2.63
Paulmann S, 2008, NEUROREPORT, V19, P209, DOI 10.1097/WNR.0b013e3282f454db
Paulmann S, 2010, COGN AFFECT BEHAV NE, V10, P230, DOI 10.3758/CABN.10.2.230
Paulmann S, 2008, BRAIN LANG, V104, P262, DOI 10.1016/j.bandl.2007.03.002
Paulmann S, 2009, NEUROREPORT, V20, P1603, DOI 10.1097/WNR.0b013e3283320e3f
Pell MD, 2009, J PHONETICS, V37, P417, DOI 10.1016/j.wocn.2009.07.005
Pell MD, 2006, BRAIN LANG, V96, P221, DOI 10.1016/j.bandl.2005.04.007
Pell MD, 2009, J NONVERBAL BEHAV, V33, P107, DOI 10.1007/s10919-008-0065-7
Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z
Pell MD, 2008, SPEECH COMMUN, V50, P519, DOI 10.1016/j.specom.2008.03.006
PELL MD, COGNITION E IN PRESS
Pell MD, 2005, J NONVERBAL BEHAV, V29, P45, DOI 10.1007/s10919-004-0889-8
Pittman J., 1993, HDB EMOTIONS, P185
Russell JA, 2000, HDB EMOTION
Sauter DA, 2010, J COGNITIVE NEUROSCI, V22, P474, DOI 10.1162/jocn.2009.21215
Sauter DA, 2010, Q J EXP PSYCHOL, V63, P2251, DOI 10.1080/17470211003721642
Scherer K. R., 1989, VOCAL MEASUREMENT EM
Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009
Schirmer A, 2005, NEUROREPORT, V16, P635, DOI 10.1097/00001756-200504250-00024
Schirmer A, 2006, TRENDS COGN SCI, V10, P24, DOI 10.1016/j.tics.2005.11.009
Schirmer A, 2002, COGNITIVE BRAIN RES, V14, P228, DOI 10.1016/S0926-6410(02)00108-8
Schirmer A, 2005, COGNITIVE BRAIN RES, V24, P442, DOI 10.1016/j.cogbrainres.2005.02.022
Schirmer A, 2003, J COGNITIVE NEUROSCI, V15, P1135, DOI 10.1162/089892903322598102
Spivey MJ, 2001, PSYCHOL SCI, V12, P282, DOI 10.1111/1467-9280.00352
TANENHAUS MK, 1995, SCIENCE, V268, P1632, DOI 10.1126/science.7777863
Thompson WF, 2008, COGNITION EMOTION, V22, P1457, DOI 10.1080/02699930701813974
Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382
Vroomen J, 2000, J EXP PSYCHOL HUMAN, V26, P1583, DOI 10.1037/0096-1523.26.5.1583
Wambacq IJA, 2004, NEUROREPORT, V15, P555, DOI 10.1097/01.wnr.0000109989.85243.8f
Weber A, 2006, COGNITION, V99, pB63, DOI 10.1016/j.cognition.2005.07.001
Wilson D, 2006, J PRAGMATICS, V38, P1559, DOI 10.1016/j.pragma.2005.04.012
Young AW, 1997, COGNITION, V63, P271, DOI 10.1016/S0010-0277(97)00003-6
Zeelenberg R, 2010, COGNITION, V115, P202, DOI 10.1016/j.cognition.2009.12.004
NR 84
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 92
EP 107
DI 10.1016/j.specom.2011.07.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800007
ER
PT J
AU Jancovic, P
Zou, X
Kokuer, M
AF Jancovic, Peter
Zou, Xin
Koekueer, Muenevver
TI Speech enhancement based on Sparse Code Shrinkage employing multiple
speech models
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Sparse Code Shrinkage; Independent Component
Analysis; Multiple models; Clustering; Super-Gaussian distribution;
Gaussian mixture model (GMM)
ID NONSTATIONARY NOISE; SIGNAL ENHANCEMENT
AB This paper presents a single-channel speech enhancement system based on the Sparse Code Shrinkage (SCS) algorithm and employment of multiple speech models. The enhancement system consists of two stages: training and enhancement. In the training stage, the Gaussian mixture modelling (GMM) is employed to cluster speech signals in ICA-based transform domain into several categories, and for each category a super-Gaussian model is estimated that is used during the enhancement stage. In the enhancement stage, the estimate of each signal frame is obtained as a weighted average of estimates obtained by using each speech category model. The weights are calculated according to the probability of each category, given the signal enhanced using the conventional SCS algorithm. During the enhancement, the individual speech category models are further adapted at each signal frame. Experimental evaluations are performed on speech signals from the TIMIT database, corrupted by Gaussian noise and three real-world noises, Subway, Street, and Railway noise, from the NOISEX-92 database. Evaluations are performed in terms of segmental SNR, spectral distortion and PESQ measure. Experimental results show that the proposed multi-model SCS enhancement algorithm significantly outperforms the conventional WF, SCS and multi-model WF algorithms. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Jancovic, Peter; Zou, Xin; Koekueer, Muenevver] Univ Birmingham, Sch Elect Elect & Comp Engn, Birmingham B15 2TT, W Midlands, England.
RP Jancovic, P (reprint author), Univ Birmingham, Sch Elect Elect & Comp Engn, Pritchatts Rd, Birmingham B15 2TT, W Midlands, England.
EM p.jancovic@bham.ac.uk; x.zou@ul-ster.ac.uk; m.kokuer@bham.ac.uk
FU UK EPSRC [EP/F036132/1]
FX This work was supported by UK EPSRC Grant EP/F036132/1.
CR BOLL S, 1979, SIGNAL PROCESS, V27, P120
CHOI C, 2001, P INT C IND COMP AN
EPHRAIM Y, 1992, P IEEE, V80, P1524
EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947
Ephraim Y., 1985, IEEE T ACOUST SPEECH, V33, P251
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Gazor S, 2005, IEEE T SPEECH AUDI P, V13, P896, DOI 10.1109/TSA.2005.851943
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P457, DOI 10.1109/TSA.2003.815936
Hyvarinen A, 1999, NEURAL COMPUT, V11, P1739, DOI 10.1162/089976699300016214
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Jancovic P, 2007, IEEE SIGNAL PROC LET, V14, P66, DOI 10.1109/LSP.2006.881517
JANCOVIC P, 2011, RECENT ADV ROBUST SP, P103
Kundu A, 2008, INT CONF ACOUST SPEE, P4893, DOI 10.1109/ICASSP.2008.4518754
Lee JH, 2000, ELECTRON LETT, V36, P1506, DOI 10.1049/el:20001028
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Lotter T, 2005, EURASIP J APPL SIG P, V2005, P1110, DOI 10.1155/ASP.2005.1110
Martin R, 2002, INT CONF ACOUST SPEE, P253
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
Ming J, 2011, IEEE T AUDIO SPEECH, V19, P822, DOI 10.1109/TASL.2010.2064312
Potamitis I, 2001, INT CONF ACOUST SPEE, P621, DOI 10.1109/ICASSP.2001.940908
Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670
Srinivasan S, 2006, IEEE T AUDIO SPEECH, V14, P163, DOI 10.1109/TSA.2005.854113
Vaseghi Saeed V., 2005, ADV DIGITAL SIGNAL P
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Wolfe PJ, 2003, EURASIP J APPL SIG P, V2003, P1043, DOI 10.1155/S1110865703304111
YOU C, 2006, SPEECH COMMUN, V48, P50
YOUNG S, 2000, HTK BOOK V3 1
Zhao DY, 2007, IEEE T AUDIO SPEECH, V15, P882, DOI 10.1109/TASL.2006.885256
Zou X, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P415
Zou X, 2008, IEEE T SIGNAL PROCES, V56, P1812, DOI 10.1109/TSP.2007.910555
NR 30
TC 3
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 108
EP 118
DI 10.1016/j.specom.2011.07.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800008
ER
PT J
AU Do, CT
Pastor, D
Goalic, A
AF Cong-Thanh Do
Pastor, Dominique
Goalic, Andre
TI A novel framework for noise robust ASR using cochlear implant-like
spectrally reduced speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Aurora 2; Cochlear implant; Kullback-Leibler divergence; HMM-based ASR;
Noise robust ASR; Spectrally reduced speech
ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; AMPLITUDE ESTIMATOR; SUBSPACE
APPROACH; RECOGNITION; ENHANCEMENT; ENVIRONMENTS; SUBTRACTION;
DIVERGENCE; ALGORITHM
AB We propose a novel framework for noise robust automatic speech recognition (ASR) based on cochlear implant-like spectrally reduced speech (SRS). Two experimental protocols (EPs) are proposed in order to clarify the advantage of using SRS for noise robust ASR. These two EPs assess the SRS in both the training and testing environments. Speech enhancement was used in one of two EPs to improve the quality of testing speech. In training, SRS is synthesized from original clean speech whereas in testing, SRS is synthesized directly from noisy speech or from enhanced speech signals. The synthesized SRS is recognized with the ASR systems trained on SRS signals, with the same synthesis parameters. Experiments show that the ASR results, in terms of word accuracy, calculated with ASR systems using SRS, are significantly improved compared to the baseline non-SRS ASR systems. We propose also a measure of the training and testing mismatch based on the Kullback-Leibler divergence. The numerical results show that using the SRS in ASR systems helps in reducing significantly the training and testing mismatch due to environmental noise. The training of the HMM-based ASR systems and the recognition tests were performed by using the HTK toolkit and the Aurora 2 speech database. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Cong-Thanh Do] Idiap Res Inst, Ctr Parc, CH-1920 Martigny, Switzerland.
[Pastor, Dominique; Goalic, Andre] Telecom Bretagne, UMR CNRS Lab STICC 3192, F-29238 Brest 3, France.
RP Do, CT (reprint author), Idiap Res Inst, Ctr Parc, Rue Marconi 19,POB 592, CH-1920 Martigny, Switzerland.
EM cong-thanh.do@idiap.ch
FU Bretagne Regional Council, Bretagne, France
FX This work was initially performed when Cong-Thanh Do was with Telecom
Bretagne, UMR CNRS 3192 Lab-STICC. It was supported by the Bretagne
Regional Council, Bretagne, France.
CR Bhattacharyya A., 1943, Bulletin of the Calcutta Mathematical Society, V35
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chen J., 2008, SPRINGER HDB SPEECH, P843, DOI 10.1007/978-3-540-49127-9_43
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Do CT, 2010, IEEE T AUDIO SPEECH, V18, P1065, DOI 10.1109/TASL.2009.2032945
DO CT, 2010, P JEP 2010 JOURN ET, P49
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
FURUI S, 1980, IEEE T ACOUST SPEECH, V32, P357
Gales M., 1996, THESIS CAMBRIDGE U
Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6
Gauvain JL, 2000, P IEEE, V88, P1181, DOI 10.1109/5.880079
GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J
Gunawan T. S., 2004, P 10 INT C SPEECH SC, P420
Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083
Hansen JHL, 2006, IEEE T AUDIO SPEECH, V14, P2049, DOI 10.1109/TASL.2006.876883
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hershey J. R., 2007, P ICASSP, V4, P317
HIRSCH HG, 2000, P ISCA ASR2000 AUT S
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031
KUBIN G, 1999, P IEEE INT C AC SPEE, V1, P205
KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Leonard R., 1984, P ICASSP, V9, P328
Loizou PC, 1999, IEEE ENG MED BIOL, V18, P32, DOI 10.1109/51.740962
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548
NADAS A, 1983, IEEE T ACOUST SPEECH, V31, P814, DOI 10.1109/TASSP.1983.1164173
Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005
Shannon BJ, 2006, SPEECH COMMUN, V48, P1458, DOI 10.1016/j.specom.2006.08.003
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Silva J, 2008, IEEE T SIGNAL PROCES, V56, P4176, DOI 10.1109/TSP.2008.924137
Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059
Young S., 2008, SPRINGER HDB SPEECH, P539, DOI 10.1007/978-3-540-49127-9_27
Young S., 2006, HTK BOOK HTK VERSION
NR 38
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 119
EP 133
DI 10.1016/j.specom.2011.07.006
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800009
ER
PT J
AU Nakamura, K
Toda, T
Saruwatari, H
Shikano, K
AF Nakamura, Keigo
Toda, Tomoki
Saruwatari, Hiroshi
Shikano, Kiyohiro
TI Speaking-aid systems using GMM-based voice conversion for
electrolaryngeal speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Electrolaryngeal speech; Voice conversion; Speaking-aid system; Speech
enhancement; Airpressure sensor; Silence excitation; Non-audible murmur;
Laryngectomee
ID MAXIMUM-LIKELIHOOD; LARYNGECTOMY; CANCER
AB An electrolarynx (EL) is a medical device that generates sound source signals to provide laryngectomees with a voice. In this article we focus on two problems of speech produced with an EL (EL speech). One problem is that EL speech is extremely unnatural and the other is that sound source signals with high energy are generated by an EL, and therefore, the signals often annoy surrounding people. To address these two problems, in this article we propose three speaking-aid systems that enhance three different types of EL speech signals: EL speech, EL speech using an air-pressure sensor (EL-air speech), and silent EL speech. The air-pressure sensor enables a laryngectomee to manipulate the F-0 contours of EL speech using exhaled air that flows from the tracheostoma. Silent EL speech is produced with a new sound source unit that generates signals with extremely low energy. Our speaking-aid systems address the poor quality of EL speech using voice conversion (VC), which transforms acoustic features so that it appears as if the speech is uttered by another person. Our systems estimate spectral parameters, F-0 and aperiodic components independently. The result of experimental evaluations demonstrates that the use of an air-pressure sensor dramatically improves F-0 estimation accuracy. Moreover, it is revealed that the converted speech signals are preferred to source EL speech. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Nakamura, Keigo; Toda, Tomoki; Saruwatari, Hiroshi; Shikano, Kiyohiro] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma, Nara 6300192, Japan.
RP Nakamura, K (reprint author), Nara Inst Sci & Technol, Grad Sch Informat Sci, 8916-5 Takayama Cho, Ikoma, Nara 6300192, Japan.
EM kei_go@nifty.com
FU MIC SCOPE; JSPS
FX The authors are grateful to Professor Hideki Kawahara of Wakayama
University, Japan, for permission to use the STRAIGHT analysis-synthesis
method. This research was also supported in part by MIC SCOPE and
Grant-in-Aid for JSPS Fellows.
CR Carr MM, 2000, OTOLARYNG HEAD NECK, V122, P39, DOI 10.1016/S0194-5998(00)70141-0
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Forastiere AA, 2003, NEW ENGL J MED, V349, P2091, DOI 10.1056/NEJMoa031317
Fukada T., 1992, P ICASSP 92, V1, P137
Goldstein E.A, 2004, IEEE T BIOMED ENG, V51
HASHIBA M, 2001, IEICE T D 2, V94, P1240
Hatamura Y, 2001, J ARTIF ORGANS, V4, P288, DOI 10.1007/BF02480019
HAYES M, 1951, CANC J CLIN, V1, P147
Hocevar-Boltezar I, 2001, RADIOL ONCOL, V35, P249
HOSOI Y, 2003, SP2003105 IEICE, P13
IFUKUBE T, 2003, SOUND BASED ASSISTIV
Imai S., 1983, ELECTR COMMUN JPN, V66, P10, DOI 10.1002/ecja.4400660203
Jemal A, 2008, CA-CANCER J CLIN, V58, P71, DOI 10.3322/CA.2007.0010
KAIN A, 1998, ACOUST SPEECH SIG PR, P285
KAWAHARA H, 2001, 2 MODELS ANAL VOCAL
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Laccourreye O, 1996, LARYNGOSCOPE, V106, P495, DOI 10.1097/00005537-199604000-00019
Liu HJ, 2006, IEEE T BIO-MED ENG, V53, P865, DOI 10.1109/TBME.2006.872821
Miyamoto D, 2009, INT CONF ACOUST SPEE, P3901, DOI 10.1109/ICASSP.2009.4960480
Murakami K., 2004, IEICE T D 1, VJ87-D-I, P1030
NAKAGIRI M, 2006, P INTERSPEECH PITTSB, P2270
Nakajima Y., 2005, P INT 2005, P293
Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1
Nakamura K., 2007, IEICE T INF SYST, VJ90-D, P780
OHTANI Y, 2006, P ICSLP SEPT, P2266
Saikachi Y, 2009, J SPEECH LANG HEAR R, V52, P1360, DOI 10.1044/1092-4388(2009/08-0167)
SINGER MI, 1980, ANN OTO RHINOL LARYN, V89, P529
Talkin D., 1995, SPEECH CODING SYNTHE, P495
Toda T., 2009, P INTERSPEECH, P632
Toda T., 2005, P INTERSPEECH LISB P, P1957
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
Uemi N., 1994, Proceedings. 3rd IEEE International Workshop on Robot and Human Communication. RO-MAN '94 Nagoya (Cat. No.94TH0679-1), DOI 10.1109/ROMAN.1994.365931
WILLIAMS SE, 1985, ARCH OTOLARYNGOL, V111, P216
NR 33
TC 15
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 134
EP 146
DI 10.1016/j.specom.2011.07.007
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800010
ER
PT J
AU Kong, YY
Mullangi, A
AF Kong, Ying-Yee
Mullangi, Ala
TI On the development of a frequency-lowering system that enhances
place-of-articulation perception
SO SPEECH COMMUNICATION
LA English
DT Article
DE Frequency lowering; Speech perception; Place of articulation; Fricatives
ID HEARING-LOSS; ENGLISH FRICATIVES; ACOUSTIC CHARACTERISTICS; CONSONANT
IDENTIFICATION; SPEECH-PERCEPTION; TRANSPOSITION; COMPRESSION;
LISTENERS; CHILDREN; DISCRIMINATION
AB Frequency lowering is a form of signal processing designed to deliver high-frequency speech cues to the residual hearing region of a listener with a high-frequency hearing loss. While this processing technique has been shown to improve the intelligibility of fricative and affricate consonants, perception of place of articulation has remained a challenge for hearing-impaired listeners, especially when the bandwidth of the speech signal is reduced during the frequency-lowering processing. This paper describes a modified vocoder-based frequency-lowering system similar to one reported by Posen et al. (1993), with the goal of improving place-of-articulation perception by enhancing the spectral differences of fricative consonants. In this system, frequency lowering is conditional; it suppresses the processing whenever the high-frequency portion (>400 Hz) of the speech signal is a periodic signal. In addition, the system separates non-sonorant consonants into three classes based on the spectral information (slope and peak location) of fricative consonants. Results from a group of normal-hearing listeners with our modified system show improved perception of frication and affrication features, as well as place-of-articulation distinction, without degrading the perception of nasals and semivowels compared to low-pass filtering and Posen et al's system. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Kong, Ying-Yee] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA.
[Kong, Ying-Yee; Mullangi, Ala] Northeastern Univ, Bioengn Program, Boston, MA 02115 USA.
RP Kong, YY (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 106A Forsyth Bldg, Boston, MA 02115 USA.
EM yykong@neu.edu
FU NIH/NIDCD [R03 DC009684-03]
FX We would like to thank Professor Louis Braida for helpful suggestions
and Ikaro Silva for technical support. We also thank Dr. Qian-Jie Fu for
allowing us to use his Matlab programs for performing information
transmission analysis. This work was supported by NIH/NIDCD (R03
DC009684-03 and ARRA supplement, PI: YYK).
CR Ali AMA, 2001, J ACOUST SOC AM, V109, P2217, DOI 10.1121/1.1357814
Baer T, 2002, J ACOUST SOC AM, V112, P1133, DOI 10.1121/1.1498853
BEASLEY DS, 1976, AUDIOLOGY, V15, P395
BEHRENS S, 1988, J ACOUST SOC AM, V84, P861, DOI 10.1121/1.396655
BEHRENS SJ, 1988, J PHONETICS, V16, P295
Boersma P., 2009, PRAAT DOING PHONETIC
BRAIDA LD, 1979, ASHA MONOGRAPH, V19
Dudley H., 1939, J ACOUST SOC AM, V11, P165, DOI 10.1121/1.1902137
Fox RA, 2005, J SPEECH LANG HEAR R, V48, P753, DOI 10.1044/1092-4388(2005/052)
Fullgrabe C, 2010, INT J AUDIOL, V49, P420, DOI 10.3109/14992020903505521
Glista D, 2009, INT J AUDIOL, V48, P632, DOI 10.1080/14992020902971349
HARRIS KS, 1958, LANG SPEECH, V1, P1
Henry Belinda A., 1998, Australian Journal of Audiology, V20, P79
Hogan CA, 1998, J ACOUST SOC AM, V104, P432, DOI 10.1121/1.423247
HUGHES GW, 1956, J ACOUST SOC AM, V28, P303, DOI 10.1121/1.1908271
Jongman A, 2000, J ACOUST SOC AM, V108, P1252, DOI 10.1121/1.1288413
Kong YY, 2011, J SPEECH LANG HEAR R, V54, P959, DOI 10.1044/1092-4388(2010/10-0197)
Korhonen P, 2008, J AM ACAD AUDIOL, V19, P639, DOI 10.3766/jaaa.19.8.7
Kuk F, 2009, J AM ACAD AUDIOL, V20, P465, DOI 10.3766/jaaa.20.8.2
Lippmann R. P., 1980, J ACOUST SOC AM, V67, pS78, DOI 10.1121/1.2018401
Maniwa K, 2009, J ACOUST SOC AM, V125, P3962, DOI 10.1121/1.2990715
McDermotet H, 2010, J AM ACAD AUDIOL, V21, P380, DOI 10.3766/jaaa.21.6.3
McDermott HJ, 2000, BRIT J AUDIOL, V34, P353
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Moore F. R., 1990, ELEMENTS COMPUTER MU
Nissen SL, 2005, J ACOUST SOC AM, V118, P2570, DOI 10.1121/1.2010407
NITTROUER S, 1989, J SPEECH HEAR RES, V32, P120
Onaka A., 2000, 8 AUST INT C SPEECH, P134
POSEN MP, 1993, J REHABIL RES DEV, V30, P26
REED CM, 1991, J REHABIL RES DEV, V28, P6782
REED CM, 1983, J ACOUST SOC AM, V74, P409, DOI 10.1121/1.389834
Robinson JD, 2007, INT J AUDIOL, V46, P293, DOI 10.1080/14992020601188591
Shannon RV, 1999, J ACOUST SOC AM, V106, pL71, DOI 10.1121/1.428150
Simpson A, 2006, INT J AUDIOL, V45, P619, DOI 10.1080/14992020600825508
Simpson A, 2005, INT J AUDIOL, V44, P281, DOI 10.1080/14992020500060636
Simpson Andrea, 2009, Trends Amplif, V13, P87, DOI 10.1177/1084713809336421
Stelmachowicz PG, 2004, ARCH OTOLARYNGOL, V130, P556, DOI 10.1001/archotol.130.5.556
Turner Christopher W., 1999, Journal of the Acoustical Society of America, V106, P877, DOI 10.1121/1.427103
VELMANS M, 1973, LANG SPEECH, V16, P224
Velmans M., 1974, BRIT J AUDIOL, V8, P1, DOI 10.3109/03005367409086943
Wolfe J, 2010, J AM ACAD AUDIOL, V21, P618, DOI 10.3766/jaaa.21.10.2
Wolfe J, 2011, INT J AUDIOL, V50, P396, DOI 10.3109/14992027.2010.551788
NR 42
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2012
VL 54
IS 1
BP 147
EP 160
DI 10.1016/j.specom.2011.07.008
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 831QI
UT WOS:000295745800011
ER
PT J
AU Schuller, B
Batliner, A
Steidl, S
AF Schuller, Bjorn
Batliner, Anton
Steidl, Stefan
TI Introduction to the special issue on sensing emotion and affect - Facing
realism in speech processing
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
C1 [Schuller, Bjorn] Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany.
[Batliner, Anton; Steidl, Stefan] Univ Erlangen Nurnberg, Pattern Recognit Lab, Nurnberg, Germany.
RP Schuller, B (reprint author), Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany.
EM schuller@tum.de
NR 0
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1059
EP 1061
DI 10.1016/j.specom.2011.07.003
PG 3
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000001
ER
PT J
AU Schuller, B
Batliner, A
Steidl, S
Seppi, D
AF Schuller, Bjorn
Batliner, Anton
Steidl, Stefan
Seppi, Dino
TI Recognising realistic emotions and affect in speech: State of the art
and lessons learnt from the first challenge
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion; Affect; Automatic classification; Feature types; Feature
selection; Noise robustness; Adaptation; Standardisation; Usability;
Evaluation
ID HIDDEN MARKOV-MODELS; AFFECT RECOGNITION; LINEAR PREDICTION; ALGORITHM;
MEMORY; PERFORMANCE; CONTROVERSY; PERCEPTION; FRAMEWORK; SELECTION
AB More than a decade has passed since research on automatic recognition of emotion from speech has become a new field of research in line with its 'big brothers' speech and speaker recognition. This article attempts to provide a short overview on where we are today, how we got there and what this can reveal us on where to go next and how we could arrive there. In a first part, we address the basic phenomenon reflecting the last fifteen years, commenting on databases, modelling and annotation, the unit of analysis and prototypicality. We then shift to automatic processing including discussions on features, classification, robustness, evaluation, and implementation and system integration. From there we go to the first comparative challenge on emotion recognition from speech-the INTERSPEECH 2009 Emotion Challenge, organised by (part of) the authors, including the description of the Challenge's database, Sub-Challenges, participants and their approaches, the winners, and the fusion of results to the actual learnt lessons before we finally address the ever-lasting problems and future promising attempts. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Schuller, Bjorn] Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany.
[Batliner, Anton; Steidl, Stefan] Univ Erlangen Nurnberg, Pattern Recognit Lab, Nurnberg, Germany.
[Seppi, Dino] Katholieke Univ Leuven, ESAT, Louvain, Belgium.
RP Schuller, B (reprint author), Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany.
EM schuller@tum.de
FU European Union [211486, IST-2001-37599, IST-2002-50742,
RTN-CT-2006-035561]; HUMAINE Association; Deutsche Telekom Laboratories
FX This work was partly funded by the European Union under Grant Agreement
No. 211486 (FP7/2007-2013, SEMAINE), IST-2001-37599 (PF-STAR),
IST-2002-50742 (HUMAINE), and RTN-CT-2006-035561 (S2S). The authors
would further like to thank the sponsors of the challenge, the HUMAINE
Association and Deutsche Telekom Laboratories. The responsibility lies
with the authors.
CR AI H, 2006, P INT PITTSB PA, P797
ALHAMES M, 2006, P ICASSP TOUL FRANC, P757
Altun H, 2009, EXPERT SYST APPL, V36, P8197, DOI 10.1016/j.eswa.2008.10.005
Ang J, 2002, P INT C SPOK LANG PR, P2037
Armstrong JS, 2007, INT J FORECASTING, V23, P321, DOI 10.1016/j.ijforecast.2007.03.004
Arunachalam S., 2001, P EUR AALB DENM, P2675
ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679
Athanaselis T, 2005, NEURAL NETWORKS, V18, P437, DOI 10.1016/j.neunet.2005.03.008
Ayadi M. M. H. E., 2007, P ICASSP HON HY, P957
Baggia P., 2007, EMMA EXTENSIBLE MULT
BARRACHICOTE R, 2009, ACOUSTIC EMOTION REC, P336
Batliner A, 2001, P EUR 2001 AALB DENM, P2781
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Batliner A., 2010, ADV HUMAN COMPUTER I, P15, DOI 10.1155/2010/782802
BATLINER A, 2004, P TUT RES WORKSH AFF, P1
Batliner A., 2005, P INT LISB, P489
Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4
Batliner A., 2003, P EUROSPEECH, P733
Batliner A, 2006, P IS LTC 2006 LJUBL, P240
Batliner A., 2008, P ICASSP 2008 LAS VE, P4497
Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003
Batliner A, 2000, P ISCA WORKSH SPEECH, P195
Batliner A., 2007, P INT WORKSH PAR SPE, P17
Batliner A., 2007, P 16 INT C PHONETICS, P2201
Batliner Anton, 2006, P IS LTC 2006 LJ SLO, P246
Bellman RE, 1961, ADAPTIVE CONTROL PRO
Bengio S., 2003, ADV NIPS
Bengio Y, 1995, ADV NEURAL INFORMATI, V7, P427
BODA PP, 2004, P COLING 2004 SAT WO, P22
Boersma P., 2005, PRAAT DOING PHONETIC
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Bogert B, 1963, S TIM SER AN, P209
Bozkurt E., 2009, P INT BRIGHT, P324
BREESE J, 1998, MSTR9841
Brendel M., 2010, P 3 INT WORKSH EMOTI, P58
Burkhardt F., 2005, P INT, P1517
Burkhardt F, 2009, P ACII AMST NETH, P1
Busso C., 2004, P 6 INT C MULT INT, P205, DOI 10.1145/1027933.1027968
Campbell N., 2005, P INT LISB PORT, P465
Chen L. S., 1998, P IEEE WORKSH MULT S, P83
CHENG YM, 2006, P INT WORKSH MULT SI, P238
Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917
Chuang ZJ, 2004, P IEEE INT C MULT EX, V1, P53
Cohen J., 1988, STAT POWER ANAL BEHA, V2nd
COWIE R, 2005, NEURAL NETWORKS, V18, P3388
Cowie R., 2000, P ISCA WORKSH SPEECH, P19
Cowie R., 1999, COMPUT INTELL, P109
COWIE R, 2010, HUMAINE HDB
DAUBECHIES I, 1990, IEEE T INFORM THEORY, V36, P961, DOI 10.1109/18.57199
DAVIS S, 1980, IEEE T ACOUST SPEECH, V29, P917
de Gelder B, 2000, COGNITION EMOTION, V14, P289
de Gelder B, 1999, NEUROSCI LETT, V260, P133, DOI 10.1016/S0304-3940(98)00963-X
Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
DEVILLERS L, 2005, P 1 INT C AFF COMP I, P519
Devillers L., 2003, P ICME 2003, P549
Devillers L., 2007, LECT NOTES COMPUTER, V4441, P34
Ding Hui, 2006, P 2006 INT C INT INF, P537
Dumouche P, 2009, P INT C INT 2009 BRI, P344
Elliott C., 1992, THESIS NW U
Engberg I., 1997, P EUR RHOD GREEC, P1695
ERICKSON D, 2004, PHONETICA, V63, P1
Eyben F, 2010, P 3 INT WORKSH EM SA, P77
Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6
Eyben F., 2010, P ACM MULT MM FLOR I, P1459, DOI 10.1145/1873951.1874246
Eyben F., 2009, P ACII AMST NETH, P576
EYSENCK HJ, 1960, PSYCHOL REV, V67, P269, DOI 10.1037/h0048412
FATTAH SA, 2008, INT C NEUR NETW SIGN, P114
FEHR B, 1984, J EXP PSYCHOL GEN, V113, P464, DOI 10.1037/0096-3445.113.3.464
Fei ZC, 2006, PACLIC 20: Proceedings of the 20th Pacific Asia Conference on Language, Information and Computation, P257
Ferguson CJ, 2009, PROF PSYCHOL-RES PR, V40, P532, DOI 10.1037/a0015808
Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8
FILLENBA.S, 1966, LANG SPEECH, V9, P217
Nasoz F., 2004, Cognition, Technology & Work, V6, DOI 10.1007/s10111-003-0143-x
Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347
FLEISS JL, 1969, PSYCHOL BULL, V72, P323, DOI 10.1037/h0028106
Forbes-Riley K., 2004, P HUM LANG TECHN C N
FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412
Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd
Gaussier E., 2005, P 28 ANN INT ACM SIG, P601, DOI DOI 10.1145/1076034.1076148
Gigerenzer G., 2004, J SOCIO-ECON, V33, P587, DOI [DOI 10.1016/J.SOCEC.2004.09.033, DOI 10.1016/J.S0CEC.2004.09.033]
Godbole N., 2007, P INT C WEBL SOC MED
Goertzel B., 2000, P ANN C SOC STUD ART
Grimm M, 2008, P IEEE INT C MULT EX, P865
Grimm M, 2007, LECT NOTES COMPUT SC, V4738, P126
GUNES H, 2005, IEEE INT C SYST MAN, V4, P3437
Hall M., 1998, THESIS WAIKATO U HAM
Hansen J., 1997, P EUROSPEECH 97 RHOD, V4, P1743
HARNAD S, 1987, GROUNDWORK COGNITION
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
HESS W, 1996, COMPUTING PROSODY, P363
HIRSCHBERG J, 2003, P ISCA IEEE WORKSH S, P1
HUI L, 2006, P ICASSP TOUL FRANC, P1
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Inanoglu Z., 2005, P 10 INT C INT US IN, P251, DOI 10.1145/1040830.1040885
Joachims Thorsten, 1998, P 10 EUR C MACH LEAR, P137
Johnstone T., 2000, HDB EMOTIONS, V2nd, P220
Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd
KHARAT GU, 2008, WSEAS T COMPUT, V7
KIESSLING A, 1997, BERICHTE INFORM
Kim E., 2007, P INT C ADV INT MECH, P1
Kim J., 2005, P 9 EUR C SPEECH COM, P809
Kim KH, 2004, MED BIOL ENG COMPUT, V42, P419, DOI 10.1007/BF02344719
Kim S.-M., 2005, P INT JOINT C NAT LA, P61
Kockmann M, 2009, P INT BRIGHT, P348
KWON O.W., 2003, P 8 EUR C SPEECH COM, P125
LASKOWSKI K, 2009, P ICASSP IEEE TAIP T, P4765
Lee C, 2009, P INT, P320
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Lee CM, 2002, P INT C SPOK LANG PR, P873
Lee DD, 1999, NATURE, V401, P788
LEFTER I, 2010, P 11 INT C COMP SYST, P287, DOI 10.1145/1839379.1839430
Liscombe J., 2003, P EUR C SPEECH COMM, P725
Liscombe J, 2005, P INT, P1837
Liscombe J., 2005, P INT LISB PORT, P1845
Litman D., 2003, P ASRU VIRG ISL, P25
Liu H., 2003, P 8 INT C INT US INT, P125
Lizhong Wu, 1999, IEEE Transactions on Multimedia, V1, DOI 10.1109/6046.807953
LOVINS JB, 1968, MECH TRANSL, V11, P22
LUENGO I, 2009, P INT BRIGHT UK SEP, P332
Lugger M., 2008, SPEECH RECOGNITION I, P1
LUGGER M, 2006, P ICASSP TOUL FRANC, P1097
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
MARTIN JC, 2006, INT J HUM ROBOT, V3, P1
Martinez C.A., 2005, IEEE INT WORKSH ROB, P19
Matos S, 2006, IEEE T BIO-MED ENG, V53, P1078, DOI 10.1109/TBME.2006.873548
McGilloway S., 2000, P ISCA WORKSH SPEECH, P207
MEYER D, 2002, REPORT SERIES ADAPTI, V78
MISSEN M, 2009, SCIENCE, V2009, P729
Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004
Morrison D, 2007, J NETW COMPUT APPL, V30, P1356, DOI 10.1016/j.jnca.2006.09.005
Morrison D., 2007, International Journal of Intelligent Systems Technologies and Applications, V2, DOI 10.1504/IJISTA.2007.012486
Mower E., 2009, P ACII AMST NETH, P662
Nefian A. V., 2002, P IEEE ICASSP 2002, P2013
Neiberg D, 2006, P INT C SPOK LANG PR, P809
Nickerson RS, 2000, PSYCHOL METHODS, V5, P241, DOI 10.1037//1082-989X.5.2.241
NOGUEIRAS A, 2001, P EUROSPEECH 2001, P2267
NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339
Nose T., 2007, P INTERSPEECH 2007 A, P2285
Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
PACHET F, 2009, EURASIP J AUDIO SPEE
Pal P., 2006, P ICASSP TOUL FRANC, P809
Pang B, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P79
Pantic M, 2003, P IEEE, V91, P1370, DOI 10.1109/JPROC.2003.817122
Perneger TV, 1998, BRIT MED J, V316, P1236
Petrushin V., 1999, P ART NEUR NETW ENG, P7
Picard RW, 2001, IEEE T PATTERN ANAL, V23, P1175, DOI 10.1109/34.954607
Planet S, 2009, P 10 ANN C INT SPEEC, P316
Polzehl T., 2009, P INT BRIGHT, P340
Polzin T., 2000, P ISCA WORKSH SPEECH, P201
Popescu A.-M., 2005, P C HUM LANG TECHN E, P339, DOI 10.3115/1220575.1220618
PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814
Potthast M., 2012, HLT NAACL 2004 ASS C, V3, P68
PUDIL P, 1994, PATTERN RECOGN LETT, V15, P1119, DOI 10.1016/0167-8655(94)90127-9
RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905
RAHURKAR MA, 2003, P 4 INT S IND COMP A, P1017
Rong J, 2007, P 6 INT C COMP INF S, P419
ROSCH E, 1975, J EXP PSYCHOL GEN, V104, P192, DOI 10.1037//0096-3445.104.3.192
ROZEBOOM WW, 1960, PSYCHOL BULL, V57, P416, DOI 10.1037/h0042040
Russell James A., 2003, VVolume 54, P329
SACHS JS, 1967, PERCEPT PSYCHOPHYS, V2, P437, DOI 10.3758/BF03208784
Said Christopher P, 2010, Front Syst Neurosci, V4, P6, DOI 10.3389/fnsys.2010.00006
Salzberg SL, 1997, DATA MIN KNOWL DISC, V1, P317, DOI 10.1023/A:1009752403260
Sato N., 2007, INFORM MEDIA TECHNOL, V2, P835
Scherer K. R., 2003, HDB AFFECTIVE SCI, P433
Schiel F, 1999, P 14 INT C PHON SCI, P607
Schroder M, 2007, LECT NOTES COMPUT SC, V4738, P440
SchrOder M., 2008, P 4 INT WORKSH HUM C
Schroder M, 2006, P LREC 06 WORKSH COR, P88
SCHULLER B., 2005, P INT LISB PORT, P805
SCHULLER B, 2006, P INTERSPEECH 2006 I, P1818
Schuller B, 2009, P INT BRIGHT UK, P1999
Schuller B., 2004, P IEEE INT C AC SPEE, V1
Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886
SCHULLER B, 2007, P INT C MULT INT ACM, P30, DOI 10.1145/1322192.1322201
Schuller B., 2007, P INT, P2253
SCHULLER B, 2008, P 1 WORKSH CHILD COM
Schuller B., 2009, P ICASSP TAIP TAIW, P4585
Schuller B., 2005, P IEEE INT C AC SPEE, VI, P325, DOI 10.1109/ICASSP.2005.1415116
SCHULLER B, 2006, P SPEECH PROS 2006 D
SCHULLER B, 2008, P ICASSP LAS VEG NV, P4501
Schuller B, 2009, EURASIP J AUDIO SPEE, DOI 10.1155/2009/942617
Schuller B., 2009, STUDIES LANGUAGE COM, V97, P285
SCHULLER B, 2006, P INTERSPEECH 2006 I, P793
Schuller B., 2007, P ICASSP 2007 HON, P941
Schuller B, 2003, P ICASSP HONG KONG, P1
SCHULLER B, 2008, P 9 INT 2008 INC 12, P265
Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495017
Schuller B, 2006, P INT C MULT EXP ICM, P5
Schuller B, 2009, IMAGE VISION COMPUT, V27, P1760, DOI 10.1016/j.imavis.2009.02.013
Schuller B., 2005, P IEEE INT C MULT EX, P864
SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794
Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495061
Schuller B, 2009, P INT C DOC AN REC B, P858
Schuller B, 2008, LECT NOTES ARTIF INT, V5078, P99, DOI 10.1007/978-3-540-69369-7_12
Schuller Bjorn, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5494986
SCHULLER B, 2008, P ICME HANN GERM, P1333
Schuller Bjorn, 2009, P INTERSPEECH, P312
Schulte B, 2010, Proceedings of the 2010 40th European Microwave Conference (EuMC)
Seppi D., 2010, P SPEECH PROS 2010 C
SEPPI D, 2008, P 1 WORKSH CHILD COM
Seppi D., 2008, P INT BRISB AUSTR, P601
Sethu V, 2007, Proceedings of the 2007 15th International Conference on Digital Signal Processing, P611
Shami M., 2007, LECT NOTES COMPUTER, V4441, P43, DOI DOI 10.1007/978-3-540-74122-05
Shaver P. R., 1992, EMOTION, P175
SOOD S, 2004, P 12 ANN ACM INT C M, P280, DOI 10.1145/1027527.1027591
Steidl S., 2009, P INT C AFF COMP INT, P690
STEIDL S, 2008, 11 INT C TEXT SPEECH, P525
Steidl S, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/783954
Steidl S., 2009, THESIS FAU ERLANGEN
STEIDL S, 2004, 7 INT C TEXT SPEECH, P629
Takahashi K., 2004, P 2 INT C AUT ROB AG, P186
ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3
TOMLINSON MJ, 1996, P ICASSP ATL GA US, P812
Truong K.P., 2005, P EUR C SPEECH COMM, P485
Ververidis D., 2003, PCI 2003 9 PANH C IN, P560
Ververidis D., 2006, P EUR SIGN PROC C EU
VIDRASCU L, 2007, P INT WORKSH PAR SPE, P11
Vinciarelli A, 2008, P 10 INT C MULT INT, P61, DOI DOI 10.1145/1452392.1452405
Vlasenko B., 2009, P INT 2009, P2039
Vlasenko B, 2007, P INT, P2249
VLASENKO B, 2008, P INT BRISB AUSTR, P805
Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139
Vogt T, 2008, LECT NOTES ARTIF INT, V5078, P188, DOI 10.1007/978-3-540-69369-7_21
VOGT T, 2009, P ACII AMST NETH, P670
Vogt T, 2009, P INT SPEECH COMM AS, P328
Vogt T., 2005, P MULT EXP AMST, P474
Wagner J., 2005, P IEEE INT C MULT EX, P940
Wagner J, 2007, LECT NOTES COMPUT SC, V4738, P114
WANG Y, 2005, P IEEE C AC SPEECH S, V2, P1125, DOI DOI 10.1109/ICASSP20051415607
WILSON T, 2004, P C AM ASS ART INT A
WIMMER M, 2008, P 3 INT C COMP VIS T, P145
Witten I.H., 2005, DATA MINING PRACTICA
Wollmer M, 2009, NEUROCOMPUTING, V73, P366, DOI 10.1016/j.neucom.2009.08.005
Wollmer M, 2010, IEEE J-STSP, V4, P867, DOI 10.1109/JSTSP.2010.2057200
WOLLMER M, 2009, P ICASSP TAIP TAIW, P3949
Wollmer M., 2008, P INT BRISB AUSTR, P597
WOLPERT DH, 1992, NEURAL NETWORKS, V5, P241, DOI 10.1016/S0893-6080(05)80023-1
WU CH, 2008, AFFECTIVE INFORM PRO, V2, P93
WU S, 2008, P INTERSPEECH BRISB, P638
Wu T., 2005, FDN DATA MINING KNOW, P319
Yi J, 2003, P 3 IEEE INT C DAT M, P427
You M, 2006, PROC IEEE INTL CONF, P1653
Young S., 2006, HTK BOOK HTK VERSION
YU C, 2004, P ICSLP, V1, P1329
Zeng ZH, 2007, LECT NOTES COMPUT SC, V4451, P72
Zeng ZH, 2009, IEEE T PATTERN ANAL, V31, P39, DOI 10.1109/TPAMI.2008.52
Zeng ZH, 2007, IEEE T MULTIMEDIA, V9, P424, DOI 10.1109/TMM.2006.886310
Zhe X., 2002, P INT S COMM SYST NE, P164
Zwicker E, 1999, PSYCHOACOUSTICS FACT
NR 251
TC 75
Z9 76
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1062
EP 1087
DI 10.1016/j.specom.2011.01.011
PG 26
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000002
ER
PT J
AU Fernandez, R
Picard, R
AF Fernandez, Raul
Picard, Rosalind
TI Recognizing affect from speech prosody using hierarchical graphical
models
SO SPEECH COMMUNICATION
LA English
DT Article
DE Affective speech; Prosodic modeling; Graphical models; Paralinguistics
ID EMOTION
AB In this work we develop and apply a class of hierarchical directed graphical models on the task of recognizing affective categories from prosody in both acted and natural speech. A strength of this new approach is the integration and summarization of information using both local (e.g., syllable level) and global prosodic phenomena (e.g., utterance level). In this framework speech is structurally modeled as a dynamically evolving hierarchical model in which levels of the hierarchy are determined by prosodic constituency and contain parameters that evolve according to dynamical systems. The acoustic parameters have been chosen to reflect four main components of speech thought to reflect paralinguistic and affect-specific information: intonation, loudness, rhythm and voice quality. The work is first evaluated on a database of acted emotions and compared to human perceptual recognition of five affective categories where it achieves rates within nearly 10% of human recognition accuracy despite only focusing on prosody. The model is then evaluated on two different corpora of fully spontaneous, affectively-colored, naturally occurring speech between people: Call Home English and BT Call Center. Here the ground truth labels are obtained from examining the agreement of 29 human coders labeling arousal and valence. The best discrimination performance on the natural spontaneous speech, using only the prosody features, obtains a 70% detection rate with 30% false alarms when detecting high arousal negative valence speech in call centers. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Fernandez, Raul] IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA.
[Picard, Rosalind] MIT, Media Lab, Cambridge, MA 02139 USA.
RP Fernandez, R (reprint author), IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA.
EM fernanra@us.ibm.com; picard@media.mit.edu
CR Barra-Chicote R., 2009, P INT, P336
BATLINER A, 2010, ADV HUMAN COMPUT INT, V15
BECKMAN ME, 1996, PROSODY PARSING SPEC, P17
BLACK M, 2010, P INT CHIB JAP, P2030
Burkhardt F., 2005, P INT, P1517
Campbell N., 2003, P 15 INT C PHON SCI, P2417
Chen DQ, 2009, PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON IMAGE AND GRAPHICS (ICIG 2009), P912, DOI 10.1109/ICIG.2009.120
Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7
Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070
DOUGLASCOWIE E, 2005, P INT LISB PORT, P88
DURSTON PJ, 2001, P EUR AALB DENM, P1323
Eyben F, 2010, J MULTIMODAL USER IN, V3, P7, DOI 10.1007/s12193-009-0032-6
Fernandez R., 2004, THESIS MIT
Fernandez R., 2005, P INT 2005 LISB PORT, P473
Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1
Hayes Bruce, 1989, PHONETICS PHONOLOGY, V1, P201
HIRSCHBERG J, 1998, P AAAI SPRING S APPL, P52
KOLLER D, 2009, PRINCIPLES TECHNIQUE
KOMPE R, 1994, P INT C AC SPEECH SI, V2, P173
LADD DR, 1996, CAMBRIDGE STUDIES LI, V79
Laver John, 1994, PRINCIPLES PHONETICS
Lehiste I., 1970, SUPRASEGMENTALS
Mao X, 2009, P WRI WORLD C COMP S, P225
Murphy K, 2007, BAYES NET TOOLBOX MA
Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370
Pereira C., 2000, P ISCA WORKSH SPEECH, P25
POLZIN T, 2000, THESIS CARNEGIE MELL
SCHERER K, 2005, PROPOSAL EXEMPLARS W
Schroder M, 2006, P LREC 06 WORKSH COR, P88
Schuller B., 2003, P IEEE INT C AC SPEE, V2, P1
SCHULLER B., 2010, P INTERSPEECH 2010 M, P2794
Schuller Bjorn, 2009, P INTERSPEECH, P312
SEBE N, 2006, P 18 INT C PATT REC, P1136
Selkirk E. O., 1984, PHONOLOGY SYNTAX REL
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
VLASENKO B, 2008, P INT BRISB AUSTR, P805
Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y
WIGHTMAN CW, 1992, P ICASSP 92, V1, P221
Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607
WIGHTMAN CW, 1991, P INT C ACOUST SPEEC, V1, P321
WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450
Yanushevskaya I., 2005, P INT LISB PORT, P1849
YANUSHEVSKAYA I, 2008, P SPEECH PROS 2008 I, P709
Zwicker E., 1999, SPRINGER SERIES INFO, V22
NR 44
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1088
EP 1103
DI 10.1016/j.specom.2011.05.003
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000003
ER
PT J
AU Worgan, SF
Moore, RK
AF Worgan, Simon F.
Moore, Roger K.
TI Towards the detection of social dominance in dialogue
SO SPEECH COMMUNICATION
LA English
DT Article
DE Social dominance; Rapport; Affordances; Social kinesthesis
ID SPEECH-PERCEPTION; VOCAL COMMUNICATION; ACCOMMODATION; AFFORDANCES;
RECOGNITION; PSYCHOLOGY
AB When developing human computer conversational systems the complex co-acting processes of human human dialogue present a significant challenge. Para-linguistic features establish rapport between individuals and direct the conversation in directions that cannot be captured by semantic analysis alone. This paper attempts to address part of this challenge by considering the role of para-linguistic features in establishing and manipulating social dominance. We propose that social dominance can be understood as an interaction affordance, revealing action potentials for each signalling participant, and can be detected as a feature of rapport not of the individual. An analysis of F0 and long-term averaged spectra (LTAS) correlation values for conversational pairs reveals a high degree of accommodation. The nature of this accommodation demonstrates that others will adjust their speech to match the current dominant individual. We conclude by exploring the implications of these results on the role of rapport and outline potential advances for the detection of emotion in speech by encompassing the entirety of pleasure-arousal-dominance emotional space. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Worgan, Simon F.; Moore, Roger K.] Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
RP Worgan, SF (reprint author), Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
EM S.Worgan@dcs.shef.ac.uk
CR ANDERSON AH, 1991, LANG SPEECH, V34, P351
[Anonymous], SNACK SOUND TOOLKIT
Bock J. K., 1990, COGNITION, V35, P1
Burkhardt F., 2005, P INT, P1517
CATIZONE R, 2010, ARTIFICIAL COMPANION, P157
Crystal D, 1980, 1 DICT LINGUISTICS P
DAWKINS MS, 1991, ANIM BEHAV, V41, P865, DOI 10.1016/S0003-3472(05)80353-7
FOWLER CA, 1986, J PHONETICS, V14, P3
Gibson J. J., 1986, ECOLOGICAL APPROACH
Giles Howard, 1991, LANGUAGE CONTEXTS CO
Good JMM, 2007, THEOR PSYCHOL, V17, P265, DOI 10.1177/0959354307075046
Gratch J., 2006, 6 INT C INT VIRT AG
Gregory SW, 1997, J NONVERBAL BEHAV, V21, P23
Gregory SW, 2002, SOC PSYCHOL QUART, V65, P298
Gregory SW, 2001, LANG COMMUN, V21, P37, DOI 10.1016/S0271-5309(00)00011-2
Hockett C. F., 1955, MANUAL PHONOLOGY
Hodges BH, 2006, PERS SOC PSYCHOL REV, V10, P2, DOI 10.1207/s15327957pspr1001_1
JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631
KANG S, 2009, 8 INT C IND COMP AN
Lauria S, 2007, CIRC SYST SIGNAL PR, V26, P513, DOI 10.1007/s00034-007-4005-9
Levelt W. J. M., 1983, J SEMANT, V2, P205
LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403
Lu YY, 2009, SPEECH COMMUN, V51, P1253, DOI 10.1016/j.specom.2009.07.002
MEHRABIAN A, 1996, BEHAV SCI, V14, P261
*MIT, 2010, MIT AM ENGL MAP TASK
Moore RK, 2007, IEEE T COMPUT, V56, P1176, DOI 10.1109/TC.2007.1080
Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290
Nittrouer S, 2006, J ACOUST SOC AM, V120, P1799, DOI 10.1121/1.2335273
OHALA JJ, 1986, J PHONETICS, V14, P75
OHALA JJ, 1982, J ACOUST SOC AM, V72, P66
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
Parrill F, 2006, J NONVERBAL BEHAV, V30, P157, DOI 10.1007/s10919-006-0014-2
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
Port RF, 2005, LANGUAGE, V81, P927, DOI 10.1353/lan.2005.0195
PORTER RJ, 1986, J PHONETICS, V14, P83
REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191
Schafer AJ, 2000, J PSYCHOLINGUIST RES, V29, P169, DOI 10.1023/A:1005192911512
Scherer K. R., 2003, HDB AFFECTIVE SCI, P433
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Thorisson KR, 1999, APPL ARTIF INTELL, V13, P449, DOI 10.1080/088395199117342
Vogt T, 2008, LECT NOTES ARTIF INT, V5078, P188, DOI 10.1007/978-3-540-69369-7_21
Ward N., 2004, International Journal of Speech Technology, V7, DOI 10.1023/B:IJST.0000037070.31146.f9
WARD N, 2009, INT 2009 BRIGHT UK, P2431
Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5
WARD NG, 2005, UTEPCS0523
WORGAN SF, 2010, THESIS U SOUTHAMPTON
Worgan SF, 2010, ECOL PSYCHOL, V22, P327, DOI 10.1080/10407413.2010.517125
NR 47
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1104
EP 1114
DI 10.1016/j.specom.2010.12.004
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000004
ER
PT J
AU Forbes-Riley, K
Litman, D
AF Forbes-Riley, Kate
Litman, Diane
TI Benefits and challenges of real-time uncertainty detection and
adaptation in a spoken dialogue computer tutor
SO SPEECH COMMUNICATION
LA English
DT Article
DE Fully automated spoken dialogue tutoring system; Automatic affect
detection and adaptation; Accompanying manual annotation; Controlled
experimental evaluation; Error analysis
ID EMOTIONS
AB We evaluate the performance of a spoken dialogue system that provides substantive dynamic responses to automatically detected user affective states. We then present a detailed system error analysis that reveals challenges for real-time affect detection and adaptation. This research is situated in the tutoring domain, where the user is a student and the spoken dialogue system is a tutor. Our adaptive system detects uncertainty in each student turn via a model that combines a machine learning approach with hedging phrase heuristics; the learned model uses acoustic-prosodic and lexical features extracted from the speech signal, as well as dialogue features. The adaptive system varies its content based on the automatic uncertainty and correctness labels for each turn. Our controlled experimental evaluation shows that the adaptive system yields higher global performance than two non-adaptive control systems, but the difference is only significant for a subset of students. Our system error analysis indicates that noisy affect labeling is a major performance bottleneck, yielding fewer than expected adaptations thus lower than expected performance. However, the percentage of received adaptation correlates with higher performance over all students. Moreover, when uncertainty is accurately recognized and adapted to, local performance is significantly improved. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Forbes-Riley, Kate; Litman, Diane] Univ Pittsburgh, Ctr Learning Res & Dev, Dept Comp Sci, Pittsburgh, PA 15260 USA.
RP Forbes-Riley, K (reprint author), Univ Pittsburgh, Ctr Learning Res & Dev, Dept Comp Sci, Pittsburgh, PA 15260 USA.
EM forbesk@pitt.edu
FU National Science Foundation (NSF) [0914615, 0631930]
FX This work is funded by National Science Foundation (NSF) awards #0914615
and #0631930. We thank Pam Jordan and the ITSPOKE Group for their help
with this research.
CR AI H, 2006, P INT PITTSB PA, P797
AIST G, 2002, P INT TUT SYST WORKS
Ang J, 2002, P INT C SPOK LANG PR, P2037
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Batliner A, 2008, USER MODEL USER-ADAP, V18, P175, DOI 10.1007/s11257-007-9039-4
Bhatt K, 2004, PROCEEDINGS OF THE TWENTY-SIXTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P114
BLACK A, 1997, 83 U ED HUM COMM RES
Burleson W, 2007, IEEE INTELL SYST, V22, P62, DOI 10.1109/MIS.2007.69
Chawla NV, 2002, J ARTIF INTELL RES, V16, P321
Conati C, 2009, USER MODEL USER-ADAP, V19, P267, DOI 10.1007/s11257-009-9062-8
CONATI C, 2004, P INT TUT SYST C ITS, P55
D'Mello S., 2007, P 29 ANN M COGN SCI, P203
De Vicente A., 2002, P 6 INT C INT TUT SY, P933
Devillers L., 2005, P INTERSPEECH LISB P
DEVILLERS L, 2003, P IEEE INT C MULT EX
D'Mello SK, 2008, USER MODEL USER-ADAP, V18, P45, DOI 10.1007/s11257-007-9037-6
Ekman P., 1978, FACIAL ACTION CODING
FORBESRILEY K, 2006, P FLOR ART INT RES S, P509
Forbes-Riley K., 2009, P 14 INT C ART INT E
FORBESRILEY K, 2005, P 9 EUR C SPEECH COM
FORBESRILEY K, 2010, P INT INT TUT SYST C
Forbes-Riley K., 2004, P HUM LANG TECHN C N, P201
Forbes-Riley K., 2010, COMPUTER SPEECH LANG, V25, P105
FORBESRILEY K, 2007, P AFF COMP INT INT A, P678
Forbes-Riley K., 2008, P 9 INT C INT TUT SY
Gholson J., 2004, J ED MEDIA, V29, P241, DOI DOI 10.1080/1358165042000283101
Gratch J., 2003, BROWN J WORLD AFFAIR, VX, P63
Hake R.R., 2002, P UNESCO ASPEN WORKS
HALL L, 2004, P INT TUT SYST C ITS, P604
Hall M., 2009, SIGKDD EXPLORATIONS, V11, P10, DOI DOI 10.1145/1656274.1656278
HUANG X, 1993, COMPUT SPEECH LA FEB, P137
JORDAN PW, 2007, P AIED 2007, P43
Klein J, 2002, INTERACT COMPUT, V14, P119
Kort B., 2001, Proceedings IEEE International Conference on Advanced Learning Technologies, DOI 10.1109/ICALT.2001.943850
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Lee CM, 2002, P INT C SPOK LANG PR, P873
LITMAN D, 2009, COGN MET ED SYST AAA
Litman D., 2005, INT J ARTIFICIAL INT, V16, P145
LITMAN D, 2004, P SIGDIAL WORKSH DIS, P144
LITMAN D, 2009, P INT BRIGHT UK
Litman D. J., 2004, P 42 ANN M ASS COMP, P352
Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008
LIU K, 2005, CHI WORKSH HCI CHALL
Mairesse F., 2008, P 46 ANN M ASS COMP
McQuiggan S. W., 2008, P 9 INT INT TUT SYST
McQuiggan SW, 2008, USER MODEL USER-ADAP, V18, P81, DOI 10.1007/s11257-007-9040-y
OUDEYER P, 2002, P I INT C PROS AIX E, P551
Paiva A., 2007, AFFECTIVE COMPUTING
Pon-Barry Heather, 2006, INT J ARTIFICIAL INT, V16, P171
Porayska-Pomsta K, 2008, USER MODEL USER-ADAP, V18, P125, DOI 10.1007/s11257-007-9041-x
Prendinger H, 2005, APPL ARTIF INTELL, V19, P267, DOI 10.1080/08839510590910174
Schuller B., 2010, P 11 ANN C INT SPEEC
Schuller B., 2009, P 10 ANN C INT SPEEC
Shafran I., 2003, P IEEE AUT SPEECH RE, P31
Talkin D., 1996, GET F0 ONLINE DOCUME
Talkin D., 1995, SPEECH CODING SYNTHE
TSUKAHARA W, 2001, P SIG CHI HUM FACT C
VanLehn K, 2003, COGNITION INSTRUCT, V21, P209, DOI 10.1207/S1532690XCI2103_01
VANLEHN K, 2002, P INT TUT SYST
Wang N, 2005, P 2005 INT C INT US, P12, DOI 10.1145/1040830.1040845
Wang N, 2008, INT J HUM-COMPUT ST, V66, P98, DOI 10.1016/j.ijhcs.2007.09.003
NR 61
TC 16
Z9 16
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1115
EP 1136
DI 10.1016/j.specom.2011.02.006
PG 22
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000005
ER
PT J
AU Acosta, JC
Ward, NG
AF Acosta, Jaime C.
Ward, Nigel G.
TI Achieving rapport with turn-by-turn, user-responsive emotional coloring
SO SPEECH COMMUNICATION
LA English
DT Article
DE Affective computing; Dimensional emotions; Prosody; Persuasion;
Prediction; Immediate response patterns; Responsiveness; User modeling;
Social interaction in dialog; Interpersonal adaptation
ID HUMAN-COMPUTER INTERACTION; AGENTS; ALIGNMENT
AB People in dialog use a rich set of nonverbal behaviors, including variations in the prosody of their utterances. Such behaviors, often emotion-related, call for appropriate responses, but today's spoken dialog systems lack the ability to do this. Recent work has shown how to recognize user emotions from prosody and how to express system-side emotions with prosody, but demonstrations of how to combine these functions to improve the user experience have been lacking. Working with a corpus of conversations with students about graduate school, we analyzed the emotional states of the interlocutors, utterance-by-utterance, using three dimensions: activation, evaluation, and power. We found that the emotional coloring of the speaker's utterance could be largely predicted from the emotion shown by her interlocutor in the immediately previous utterance. This finding enabled us to build Gracie, the first spoken dialog system that recognizes a user's emotional state from his or her speech and gives a response with appropriate emotional coloring. Evaluation with 36 subjects showed that they felt significantly more rapport with Gracie than with either of two controls. This shows that dialog systems can tap into this important level of interpersonal interaction using today's technology. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Acosta, Jaime C.; Ward, Nigel G.] Univ Texas El Paso, El Paso, TX 79968 USA.
RP Ward, NG (reprint author), Univ Texas El Paso, 500 W Univ Ave, El Paso, TX 79968 USA.
EM jcacosta@miners.utep.edu; nigelward@acm.org
FU ACTEDS Scholarship; NSF [IIS-0415150, IIS-0914868]; US Army
FX We thank Anais Rivera and Sue Walker for the persuasive dialog corpus,
Rafael Escalante-Ruiz for the persuasive letter generator, Alejandro
Vega, Shreyas Karkhedkar, and Tam Acosta for their help during testing,
Ben Walker and Josh McCartney for running subjects, and David Novick,
Steven Crites and Anton Bat liner for discussion and comments. This work
was supported in part by an ACTEDS Scholarship, by NSF Awards
IIS-0415150 and IIS-0914868, and by the US Army Research, Development
and Engineering Command via a contract to the USC Institute for Creative
Technologies.
CR Acosta J. C., 2009, THESIS U TEXAS EL PA
Batliner A, 2011, COMPUT SPEECH LANG, V25, P4, DOI 10.1016/j.csl.2009.12.003
BATLINER A, 2009, ADV HUMAN COMPUTER I
Beale R, 2009, INT J HUM-COMPUT ST, V67, P755, DOI 10.1016/j.ijhcs.2009.05.001
Becker C, 2004, LECT NOTES COMPUT SC, V3068, P154
Berry DC, 2005, INT J HUM-COMPUT ST, V63, P304, DOI 10.1016/j.ijhcs.2005.03.006
BEVACQUA E, 2008, P 8 INT C INT VIRT A
Branigan HP, 2010, J PRAGMATICS, V42, P2355, DOI 10.1016/j.pragma.2009.12.012
BRAVE S, 2007, HUMAN COMPUTER INTER, P77
Burgoon J., 1995, INTERPERSONAL ADAPTA
CACIOPPO JT, 1982, ATTITUDES LANGUAGE V, P189
Calvo RA, 2010, IEEE T AFFECT COMPUT, V1, P18, DOI 10.1109/T-AFFC.2010.1
Cappella J. N., 1991, COMMUN THEORY, V1, P4, DOI 10.1111/j.1468-2885.1991.tb00002.x
Cassell J, 2003, USER MODEL USER-ADAP, V13, P89, DOI 10.1023/A:1024026532471
Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893
Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
D'Mello S, 2009, LECT NOTES COMPUT SC, V5612, P595, DOI 10.1007/978-3-642-02580-8_65
Fogg B. J., 2003, PERSUASIVE TECHNOLOG
Forbes-Riley K, 2011, COMPUT SPEECH LANG, V25, P105, DOI 10.1016/j.csl.2009.12.002
Frank E, 1998, MACH LEARN, V32, P63, DOI 10.1023/A:1007421302149
Grahe JE, 1999, J NONVERBAL BEHAV, V23, P253, DOI 10.1023/A:1021698725361
Gratch J., 2006, 6 INT C INT VIRT AG, P14
Gratch J, 2007, LECT NOTES ARTIF INT, V4722, P125
HOLTGRAVES TM, 2008, LANGUAGE MEANING SOC
Huggins-Daines D., 2006, IEEE INT C AC SPEECH
Klein J, 2002, INTERACT COMPUT, V14, P119
Komatani K, 2005, USER MODEL USER-ADAP, V15, P169, DOI 10.1007/s11257-004-5659-0
Kopp S, 2010, SPEECH COMMUN, V52, P587, DOI 10.1016/j.specom.2010.02.007
Lee C.-C., 2009, INTERSPEECH, P320
Mazzotta I, 2009, LECT NOTES ARTIF INT, V5773, P527
Pittermann Johannes, 2010, International Journal of Speech Technology, V13, DOI 10.1007/s10772-010-9068-y
PONBARRY H, 2008, INTERSPEECH, V12, P74
Saerbeck M, 2010, CHI2010: PROCEEDINGS OF THE 28TH ANNUAL CHI CONFERENCE ON HUMAN FACTORS IN COMPUTING SYSTEMS, VOLS 1-4, P1613
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
SCHRODER M, 2004, SPEECH EMOTION RES A
Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924
SCHROEDER M, 2009, EMOTION MARKUP LANGU
Schuller Bjorn, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5372886
Shepard C. A., 2001, NEW HDB LANGUAGE SOC, P33
Suzuki N, 2007, CONNECT SCI, V19, P131, DOI 10.1080/09540090701369125
WARD N, 2009, INT 2009 BRIGHT UK, P2431
Ward N, 2003, INT J HUM-COMPUT ST, V59, P603, DOI 10.1016/S1071-5819(03)00085-5
Ward NG, 2010, J CROSS CULT PSYCHOL, V41, P270, DOI 10.1177/0022022109354644
Winton Ward M., 1990, HDB LANGUAGE SOCIAL, P33
Witten I. H., 2002, ACM SIGMOD RECORD, V31, P76, DOI 10.1145/507338.507355
Witten I.H., 2005, DATA MINING PRACTICA
NR 47
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1137
EP 1148
DI 10.1016/j.specom.2010.11.006
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000006
ER
PT J
AU Mahdhaoui, A
Chetouani, M
AF Mahdhaoui, Ammar
Chetouani, Mohamed
TI Supervised and semi-supervised infant-directed speech classification for
parent-infant interaction analysis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Infant-directed speech; Emotion recognition; Face-to-face interaction;
Data fusion; Semi-supervised learning
ID MOTHERESE; PREFERENCE; LANGUAGE
AB This paper describes the development of an infant-directed speech discrimination system for parent infant interaction analysis. Different feature sets for emotion recognition were investigated using two classification techniques: supervised and semi-supervised. The classification experiments were carried out with short pre-segmented adult-directed speech and infant-directed speech segments extracted from real-life family home movies (with durations typically between 0.5 s and 4 s). The experimental results show that in the case of supervised learning, spectral features play a major role in the infant-directed speech discrimination. However, a major difficulty of using natural corpora is that the annotation process is time-consuming, and the expression of emotion is much more complex than in acted speech. Furthermore, interlabeler agreement and annotation label confidences are important issues to address. To overcome these problems, we propose a new semi-supervised approach based on the standard co-training algorithm exploiting labelled and unlabelled data. It offers a framework to take advantage of supervised classifiers trained by different features. The proposed dynamic weighted co-training approach combines various features and classifiers usually used in emotion recognition in order to learn from different views. Our experiments demonstrate the validity and effectiveness of this method for a real-life corpus such as home movies. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Mahdhaoui, Ammar] Univ Paris 06, F-75005 Paris, France.
CNRS, ISIR, Inst Syst Intelligents & Robot, UMR 7222, F-75005 Paris, France.
RP Mahdhaoui, A (reprint author), Univ Paris 06, F-75005 Paris, France.
EM Ammar.Mahdhaoui@isir.upmc.fr; Mohamed.Chetouani@upmc.fr
RI CHETOUANI, Mohamed/F-5854-2010
FU La Fondation de France
FX The authors would like to thank Filippo Muratori and Fabio Apicella from
Scientific Institute Stella Mans of University of Pisa, Italy, who have
provided data; family home movies. We would also like to extend our
thanks to David Cohen and his staff, Raquel Sofia Cassel and Catherine
Saint-Georges, from the Department of Child and Adolescent Psychiatry,
AP-HP, Groupe Hospitalier Pitie-Salpetriere, Universite Pierre et Marie
Curie, Paris France, for their collaboration and the manual database
annotation and data analysis. Finally, this work has been partially
funded by La Fondation de France.
CR Association A. P., 1994, DIAGNOSTIC STAT MANU, VIV
Bishop C. M., 1995, NEURAL NETWORKS PATT
Blum A., 1998, C COMP LEARN THEOR
Boersma P., 2005, PRAAT DOING PHONETIC
BREFELD U, 2006, EFFICIENT COREGULARI
BURNHAM C, 2002, WHATS NEW PUSSYCAT T, V296, P1435
CARLSON A, 2009, THESIS CARNEGIE MELL
Chang C.-C., 2001, LIBSVM LIB SUPPORT V
Chetouani M, 2009, COGN COMPUT, V1, P194, DOI 10.1007/s12559-009-9016-9
Cohen J., 1960, EDUC PSYCHOL MEAS, V20, P3746
COOPER RP, 1990, CHILD DEV, V61, P1584, DOI 10.1111/j.1467-8624.1990.tb02885.x
Cooper RP, 1997, INFANT BEHAV DEV, V20, P477, DOI 10.1016/S0163-6383(97)90037-0
Duda R., 2000, PATTERN CLASSIFICATI
EIBE F, 1999, MORGAN KAUFMANN SERI
Esposito A, 2005, LECT NOTES ARTIF INT, V3445, P1
FERNALD A, 1985, INFANT BEHAV DEV, V8, P181, DOI 10.1016/S0163-6383(85)80005-9
FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8
Goldman S., 2000, INT C MACH LEARN, P327
GRIESER DL, 1988, DEV PSYCHOL, V24, P14, DOI 10.1037/0012-1649.24.1.14
INOUEA T, 2011, NEUROSCI RES, P1
KESSOUS L, 2007, PARALING
Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533
LAZNIK M, 2005, COMMENCEMENT TAIT VO, P81
MAESTRO S, 2005, CHILD PSYCHIAT HUMAN, V35, P83
MANDHAOUI A, 2008, INT C PATT REC ICPR, P8
MANDHAOUI A, 2011, INT J METH PSYCH RES, V20, pE6
MANDHAOUI A, 2009, MULTIMODAL SIGNALS C, P248
Muratori F., 2007, INT J DIALOGICAL SCI, V2, P93
Muslea I., 2000, Proceedings Seventeenth National Conference on Artificial Intelligence (AAAI-2000). Twelfth Innovative Applications of Artificial Intelligence Conference (IAAI-2000)
NIGAM K, 2000, INT C MACH LEARN
Nigam K., 2000, 9 INT C INF KNOWL MA, P86
Platt J. C., 1999, ADV LARGE MARGIN CLA
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Saint-Georges C, 2010, RES AUTISM SPECT DIS, V4, P355, DOI 10.1016/j.rasd.2009.10.017
Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901
Schuller B., 2007, INTERSPEECH, P2253
Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006
SHAMI M, 2005, IEEE MULTIMEDIA EXPO
Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3
Truong KP, 2007, SPEECH COMMUN, V49, P144, DOI 10.1016/j.specom.2007.01.001
Vapnik V., 1995, NATURE STAT LEARNING
Vapnik V, 1998, STAT LEARNING THEORY
Zhang QJ, 2010, PATTERN RECOGN, V43, P3113, DOI 10.1016/j.patcog.2010.04.004
Zhou D., 2005, ADV NEURAL INFORM PR, V17, P1633
Zhu X., 2003, ICML, P912
Zwicker E, 1999, PSYCHOACOUSTICS FACT
Zwicker E., 1961, ACOUSTICAL SOC AM, V33, P248
NR 47
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1149
EP 1161
DI 10.1016/j.specom.2011.05.005
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000007
ER
PT J
AU Lee, CC
Mower, E
Busso, C
Lee, S
Narayanan, S
AF Lee, Chi-Chun
Mower, Emily
Busso, Carlos
Lee, Sungbok
Narayanan, Shrikanth
TI Emotion recognition using a hierarchical binary decision tree approach
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Hierarchical structure; Support Vector Machine;
Bayesian Logistic Regression
AB Automated emotion state tracking is a crucial element in the computational study of human communication behaviors. It is important to design robust and reliable emotion recognition systems that are suitable for real-world applications both to enhance analytical abilities to support human decision making and to design human machine interfaces that facilitate efficient communication. We introduce a hierarchical computational structure to recognize emotions. The proposed structure maps an input speech utterance into one of the multiple emotion classes through subsequent layers of binary classifications. The key idea is that the levels in the tree are designed to solve the easiest classification tasks first, allowing us to mitigate error propagation. We evaluated the classification framework on two different emotional databases using acoustic features, the AIBO database and the USC IEMOCAP database. In the case of the AIBO database, we obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall on the evaluation data set improves by 3.37% absolute (8.82% relative) over a Support Vector Machine baseline model. In the USC IEMOCAP database, we obtain an absolute improvement of 7.44% (14.58%) over a baseline Support Vector Machine modeling. The results demonstrate that the presented hierarchical approach is effective for classifying emotional utterances in multiple database contexts. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Lee, Chi-Chun; Mower, Emily; Lee, Sungbok; Narayanan, Shrikanth] Univ So Calif, Signal Anal & Interpretat Lab SAIL, Dept Elect Engn, Los Angeles, CA 90089 USA.
[Busso, Carlos] Univ Texas Dallas, Dept Elect Engn, Dallas, TX 75080 USA.
RP Lee, CC (reprint author), Univ So Calif, Signal Anal & Interpretat Lab SAIL, Dept Elect Engn, Los Angeles, CA 90089 USA.
EM chiclee@usc.edu
RI Narayanan, Shrikanth/D-5676-2012
CR Agresti A., 1990, WILEY SERIES PROBABI
ALBORNOZ EM, 2011, COMPUT SPEE IN PRESS, V25
AMIR N, 2010, SPEECH PROSODY
BLACK M, 2010, P INT MAK JAP 2010
Black M., 2008, P WORKSH CHILD COMP
Brave S, 2005, INT J HUM-COMPUT ST, V62, P161, DOI 10.1016/j.ijhcs.2004.11.002
BRODY L, 2008, GENDER EMOTION CONTE
Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578
Busso C, 2008, LANG RESOUR EVAL, V42, P335, DOI 10.1007/s10579-008-9076-6
Chawla NV, 2002, J ARTIF INTELL RES, V16, P321
Eyben F, 2009, SPEECH MUSIC INTERPR
Genkin A, 2007, TECHNOMETRICS, V49, P291, DOI 10.1198/004017007000000245
HASSAN A, 2010, P INT, P2354
HERM O, 2008, P INT
Hosmer DW, 2000, WILEY SERIES PROBABI, V2nd
Kanda T, 2004, HUM-COMPUT INTERACT, V19, P61, DOI 10.1207/s15327051hci1901&2_4
Kapoor A., 2005, P 13 ANN ACM INT C M, P677, DOI DOI 10.1145/1101149.1101300
LAZARUS R, 2001, EMOTION THEOR METHOD, P37
LEE CC, 2010, P INT MAK JAP
Lee C.-C., 2009, P INT
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Mao QR, 2010, COMPUT SCI INF SYST, V7, P211, DOI 10.2298/CSIS1001211Q
Metallinou A., 2010, P INT C AC SPEECH SI
Mower E, 2011, IEEE T AUDIO SPEECH, V19, P1057, DOI 10.1109/TASL.2010.2076804
Pantic M, 2005, P 13 ANN ACM INT C M, P669, DOI 10.1145/1101149.1101299
Prendinger H, 2005, INT J HUM-COMPUT ST, V62, P231, DOI 10.1016/j.ijhcs.2004.11.009
Schuller B., 2007, P INT
SCHULLER B, 2009, P INT BRIGHT UK
Steidl S., 2009, AUTOMATIC CLASSIFICA
Vapnik V., 1995, NATURE STAT LEARNING
WANGER HL, 1993, J NONVERBAL BEHAV, V17, P3
XIAO Z, 2007, P INT S MULT WORKSH, P291
YILDIRIM S, 2005, P EUR LISB PORT
Yildirim S, 2011, COMPUT SPEECH LANG, V25, P29, DOI 10.1016/j.csl.2009.12.004
NR 34
TC 26
Z9 29
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1162
EP 1171
DI 10.1016/j.specom.2011.06.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000008
ER
PT J
AU Kockmann, M
Burget, L
Cernocky, J
AF Kockmann, Marcel
Burget, Lukas
Cernocky, Jan Honza
TI Application of speaker- and language identification state-of-the-art
techniques for emotion recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Gaussian mixture models;
Maximum-mutual-information; Intersession variability compensation;
Score-level fusion
ID VERIFICATION
AB This paper describes our efforts of transferring feature extraction and statistical modeling techniques from the fields of speaker and language identification to the related field of emotion recognition. We give detailed insight to our acoustic and prosodic feature extraction and show how to apply Gaussian Mixture Modeling techniques on top of it. We focus on different flavors of Gaussian Mixture Models (GMMs), including more sophisticated approaches like discriminative training using Maximum-Mutual-Information (MMI) criterion and InterSession Variability (ISV) compensation. Both techniques show superior performance in language and speaker identification. Furthermore, we combine multiple system outputs by score-level fusion to exploit the complementary information in diverse systems. Our proposal is evaluated with several experiments on the FAU Aibo Emotion Corpus containing non-acted spontaneous emotional speech. Within the Interspeech 2009 Emotion Challenge we could achieve the best results for the 5-class task of the Open Performance Sub-Challenge with an unweighted average recall of 41.7%. Further additional experiments on the acted Berlin Database of Emotional Speech show the capability of intersession variability compensation for emotion recognition. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Kockmann, Marcel; Burget, Lukas; Cernocky, Jan Honza] Brno Univ Technol, Speech FIT, Brno, Czech Republic.
RP Kockmann, M (reprint author), Brno Univ Technol, Speech FIT, Brno, Czech Republic.
EM kockmann@fit.vutbr.cz
FU European project MOBIO [FP7-214324]; Grant Agency of Czech Republic
[102/08/0707]; Czech Ministry of Education [MSM0021630528]; SVOX
Deutschland GmbH, Munich, Germany
FX This work was partly supported by European project MOBIO (FP7-214324),
by Grant Agency of Czech Republic project No. 102/08/0707, and by the
Czech Ministry of Education project No. MSM0021630528. Marcel Kockmann
is supported by SVOX Deutschland GmbH, Munich, Germany.
CR Batliner A, 2006, P IS LTC 2006 LJUBL, P240
Bishop C. M., 2006, PATTERN RECOGNITION
Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870
Brummer N., 2004, P NIST SPEAK REC EV
Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001
Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499
Burkhardt F., 2005, 9 EUR C SPEECH COMM
Cohen J, 1995, J ACOUST SOC AM, V97, P3246, DOI 10.1121/1.411700
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Dehak N., 2009, IEEE T AUDIO SPEECH, P1
GAURAV M, 2008, SPOK LANG TECHN WORK, P313
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HUBEIKA V, 2008, P INT, P1990
Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147
Kinnunen T, 2010, SPEECH COMMUN, V52, P12, DOI 10.1016/j.specom.2009.08.009
KOCKMANN M, 2008, SPOK LANG TECHN WORK, P45
Kockmann M, 2009, P INT BRIGHT, P348
Matejka P., 2008, P INT
Matejka P., 2006, P OD
*NIST, 2005, 2005 NIST LANG REC E, P1
POVEY D, 2003, THESIS CAMBRIDGE U E, P1
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
SCHULLER B, 2006, SPEECH PROSODY DRESD
SCHULLER B, 2007, INTERSPEECH 2007 1 4
Schuller B, 2009, P INT BRIGHT, P1, DOI 10.1007/978-3-0346-0198-6_1
Schwarz P., 2006, P ICASSP, P325
SEPPI D, 2008, P INT
SHRIBERG E, 2009, INTERSPEECH BRIGHTON
STEIDL S, 2009, STUDIEN MUSTERERKENN, V28, P1
Torres-Carrasquillo P.A., 2002, 7 INT C SPOK LANG PR
Ververidis D., 2003, P 1 RICHM C, P109
Vlasenko B, 2007, P INT, P2249
Young S. J., 2006, HTK BOOK VERSION 3 4
NR 33
TC 9
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1172
EP 1185
DI 10.1016/j.specom.2011.01.007
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000009
ER
PT J
AU Bozkurt, E
Erzin, E
Erdem, CE
Erdem, AT
AF Bozkurt, Elif
Erzin, Engin
Erdem, Cigdem Eroglu
Erdem, A. Tanju
TI Formant position based weighted spectral features for emotion
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Emotional speech classification; Spectral features;
Formant frequency; Line spectral frequency; Decision fusion
AB In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features. The spectral weighting is derived from the normalized inverse harmonic mean function of the line spectral frequency (LSF) features, which are known to be localized around formant frequencies. The above approach can be considered as an early data fusion of spectral content and formant location information. We also investigate methods for late decision fusion of unimodal classifiers. We evaluate the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus. The results show that unimodal classifiers with the WMFCC features perform significantly better than the classifiers with standard spectral features. Late decision fusion of classifiers provide further significant performance improvements. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Bozkurt, Elif; Erzin, Engin] Koc Univ, Multimedia Vis & Graph Lab, Coll Engn, TR-34450 Istanbul, Turkey.
[Erdem, Cigdem Eroglu] Bahcesehir Univ, Dept Elect & Elect Engn, TR-34353 Istanbul, Turkey.
[Erdem, A. Tanju] Ozyegin Univ, Dept Elect & Elect Engn, TR-34662 Istanbul, Turkey.
RP Erzin, E (reprint author), Koc Univ, Multimedia Vis & Graph Lab, Coll Engn, TR-34450 Istanbul, Turkey.
EM ebozkurt@ku.edu.tr; eerzin@ku.edu.tr; cigdem.eroglu@bahcesehir.edu.tr;
tanju.erdem@ozyegin.edu.tr
RI Erzin, Engin/H-1716-2011; Eroglu Erdem, Cigdem/J-4216-2012
OI Eroglu Erdem, Cigdem/0000-0002-9264-5652
FU Turkish Scientific and Technical Research Council (TUBITAK) [106E201,
COST2102, 110E056]
FX This work was supported in part by the Turkish Scientific and Technical
Research Council (TUBITAK) under projects 106E201 (COST2102 action) and
110E056. The authors would like to acknowledge and thank the anonymous
referees for their valuable comments that have significantly improved
the quality of the paper.
CR Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Boersma P., 2010, PRAAT DOING PHONETIC
Deller J. R., 1993, DISCRETE TIME PROCES
Dietterich TG, 1998, NEURAL COMPUT, V10, P1895, DOI 10.1162/089976698300017197
DUMOUCHEL P, 2009, INT 2009 10 ANN C IN, P1
Erzin E, 2005, IEEE T MULTIMEDIA, V7, P840, DOI 10.1109/TMM.2005.854464
GOUDBEEK MB, 2009, P INT BRIGHT UK
GRIMM M, 2006, P 14 EUR SIGN PROC C
ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
KOCKMANN M, 2009, INT 2009 10 ANN C IN, V1, P316
Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421
LEE C, 2004, INT C SPOK LANG PROC
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Morris RW, 2002, IEEE SIGNAL PROC LET, V9, P19, DOI 10.1109/97.988719
Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004
Nakatsu R, 2000, KNOWL-BASED SYST, V13, P497, DOI 10.1016/S0950-7051(00)00070-8
Neiberg D, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2755
POLZIN T, 2000, INTERSPEECH 2008
Sargin ME, 2007, IEEE T MULTIMEDIA, V9, P1396, DOI 10.1109/TMM.2007.906583
SCHERER KR, 1995, P 13 INT C PHON SCI, P90
SCHULLER B, 2006, DAGA, P57
SCHULLER B, 2009, INTERSPEECH 2009
Schuller B., 2003, P INT C AC SPEECH SI
Steidl S., 2009, AUTOMATIC CLASSIFICA
Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003
VLASENKO B, 2007, P 2 INT C AFF COMP I, P139
Zeng ZH, 2009, IEEE T PATTERN ANAL, V31, P39, DOI 10.1109/TPAMI.2008.52
NR 28
TC 8
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1186
EP 1197
DI 10.1016/j.specom.2011.04.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000010
ER
PT J
AU Polzehl, T
Schmitt, A
Metze, F
Wagner, M
AF Polzehl, Tim
Schmitt, Alexander
Metze, Florian
Wagner, Michael
TI Anger recognition in speech using acoustic and linguistic cues
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion detection; Anger classification; Linguistic and prosodic
acoustic modeling; IGR ranking; Decision fusion; IVR speech
ID EMOTIONS
AB The present study elaborates on the exploitation of both linguistic and acoustic feature modeling for anger classification. In terms of acoustic modeling we generate statistics from acoustic audio descriptors, e.g. pitch, loudness, spectral characteristics. Ranking our features we see that loudness and MFCC seem most promising for all databases. For the English database also pitch features are important. In terms of linguistic modeling we apply probabilistic and entropy-based models of words and phrases, e.g. Bag-of-Words (BOW), Term Frequency (TF), Term Frequency - Inverse Document Frequency (TF.IDF) and the Self-Referential Information (SRI). SRI clearly outperforms vector space models. Modeling phrases slightly improves the scores. After classification of both acoustic and linguistic information on separated levels we fuse information on decision level adding confidences. We compare the obtained scores on three different databases. Two databases are taken from the IVR customer care domain, another database accounts for a WoZ data collection. All corpora are of realistic speech condition. We observe promising results for the IVR databases while the WoZ database shows lower scores overall. In order to provide comparability between the results we evaluate classification success using the fl measurement in addition to overall accuracy figures. As a result, acoustic modeling clearly outperforms linguistic modeling. Fusion slightly improves overall scores. With a baseline of approximately 60% accuracy and.40 fl-measurement by constant majority class voting we obtain an accuracy of 75% with respective.70 fl for the WoZ database. For the IVR databases we obtain approximately 79% accuracy with respective.78 fl over a baseline of 60% accuracy with respective.38 fl. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Polzehl, Tim] Tech Univ Berlin, Qual & Usabil Lab, D-10587 Berlin, Germany.
[Polzehl, Tim] Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany.
[Schmitt, Alexander] Univ Ulm, Dialogue Syst Grp, D-89081 Ulm, Germany.
[Schmitt, Alexander] Univ Ulm, Inst Informat Technol, D-89081 Ulm, Germany.
[Metze, Florian] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
[Wagner, Michael] Univ Canberra, Natl Ctr Biometr Studies, Canberra, ACT 2601, Australia.
RP Polzehl, T (reprint author), Tech Univ Berlin, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany.
EM tim.polzehl@gmail.com; alexander.schmitt@uni-ulm.de; fmetze@cs.cmu.edu;
michael.wagner@canberra.edu.au
RI Metze, Florian/N-4661-2014
OI Metze, Florian/0000-0002-6663-8600
CR Batliner A, 2000, ISCA WORKSH SPEECH E
Bitouk D, 2010, SPEECH COMMUN, V52, P613, DOI 10.1016/j.specom.2010.02.010
Boersma P., 2009, PRAAT DOING PHONETIC
BURKHARDT F, 2005, P ANN C INT SPEECH C
Burkhardt F., 2005, P EL SPEECH SIGN PRO
BURKHARDT F, 2009, P IEEE ICASSP, P4761
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
DAVIES M, 1982, MEASURING AGREEMENT
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
Duda R., 2000, PATTERN CLASSIFICATI
DUMOUCHEL P, 2009, P ANN C INT SPEECH C
ENBERG IS, 1996, DOCUMENTATION DANISH
Fastl H, 2005, PSYCHOACOUSTICS FACT
HOZJAN V, 2003, INT J SPEECH TECHNOL, V6, P11
Huang X., 2001, SPOKEN LANGUAGE PROC
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
LEE FM, 2008, INT C SIGN PROC ROB, P171
METZE F, 2008, GETTING CLOSER TAILO
Metze F., 2009, P INT C SEM COMP ICS
METZE F, 2009, P ANN C INT SPEECH C, P1
Polzehl T, 2010, SPOKEN DIALOGUE SYST, P81
Polzehl T., 2009, P INT BRIGHT, P340
POLZEHL T, 2009, P INT WORKSH SPOK DI
Schmitt A., 2010, 6 INT C INT ENV IE 1
Schuller B, 2009, INT C AC SPEECH SIGN
SCHULLER B, 2006, THESIS TU MUNCHEN MU
SCHULLER B, 2009, P ANN C INT SPEECH C
Schuller B., 2004, IEEE INT C AC SPEECH
SHAFRAN I, 2003, IEEE WORKSH AUT SPEE, P31
SHAFRAN I, 2005, IEEE INT C AC SPEECH
Steidl S, 2005, INT CONF ACOUST SPEE, P317
Steidl S., 2009, THESIS
VIDRASCU L, 2007, PARALING
VLASENKO B, 2008, P INT BRISB AUSTR, P805
Vlasenko B., 2007, P INTERSPEECH 2007, P2225
VLASENKO B, 2009, P ANN C INT SPEECH C
WANG Y, 2009, EL MEAS INSTR ICEMI
Yacoub S., 2003, EUROSPEECH, P1
NR 38
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1198
EP 1209
DI 10.1016/j.specom.2011.05.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000011
ER
PT J
AU Lopez-Cozar, R
Silovsky, J
Kroul, M
AF Lopez-Cozar, Ramon
Silovsky, Jan
Kroul, Martin
TI Enhancement of emotion detection in spoken dialogue systems by combining
several information sources
SO SPEECH COMMUNICATION
LA English
DT Article
DE Adaptive spoken dialogue systems; Combination of classifiers;
Information fusion; Emotion detection; Human computer interaction
ID RECOGNITION; AGREEMENT; COMPUTER; CLASSIFIERS; SPEECH; TUTORS; USER
AB This paper proposes a technique to enhance emotion detection in spoken dialogue systems by means of two modules that combine different information sources. The first one, called Fusion-0, combines emotion predictions generated by a set of classifiers that deal with different kinds of information about each sentence uttered by the user. To do this, the module employs several methods for information fusion that produce other predictions about the emotional state of the user. The predictions are the input to the second information fusion module, called Fusion-1, where they are combined to deduce the emotional state of the user. Fusion-0 represents a method employed in previous studies to enhance classification rates, whereas Fusion-1 represents the novelty of the technique, which is the combination of emotion predictions generated by Fusion-0. One advantage of the technique is that it can be applied as a posterior processing stage to any other methods that combine information from different information sources at the decision level. This is so because the technique works on the predictions (outputs) of the methods, without interfering in the procedure used to obtain these predictions. Another advantage is that the technique can be implemented as a modular architecture, which facilitates the setting up within a spoken dialogue system as well as the deduction of the emotional state of the user in real time. Experiments have been carried out considering classifiers to deal with prosodic, acoustic, lexical, and dialogue acts information, and three methods to combine information: multiplication of probabilities, average of probabilities, and unweighted vote. The results show that the technique enhances the classification rates of the standard fusion by 2.27% and 3.38% absolute in experiments carried out considering two and three emotion categories, respectively. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Lopez-Cozar, Ramon] Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci, E-18071 Granada, Spain.
[Silovsky, Jan; Kroul, Martin] Tech Univ Liberec, Inst Informat Technol & Elect, Fac Mechatron, Liberec, Czech Republic.
RP Lopez-Cozar, R (reprint author), Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci, E-18071 Granada, Spain.
EM rlopezc@ugr.es; jan.silovsky@tul.cz; martin.kroul@tul.cz
RI Prieto, Ignacio/B-5361-2013; Lopez-Cozar, Ramon/A-7686-2012
OI Lopez-Cozar, Ramon/0000-0003-2078-495X
FU Spanish project HADA [TIN2007-64718]; Czech Grant Agency [102/08/0707];
Technical University of Liberec
FX This research has been funded by the Spanish project HADA TIN2007-64718,
the Czech Grant Agency Project No. 102/08/0707 and the Student Grant
Scheme (SGS) at the Technical University of Liberec. The authors would
like to thank the reviewers and the guest editors for their comments,
suggestions and corrections that significantly improved the quality of
this paper.
CR AI H, 2006, P INT PITTSB PA, P797
Ang J, 2002, P INT C SPOK LANG PR, P2037
Barra-Chicote R., 2009, P INT, P336
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
Bozkurt E., 2009, P INT BRIGHT, P324
Carletta J, 1996, COMPUT LINGUIST, V22, P249
COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104
Cover T M, 1991, ELEMENTS INFORM THEO
Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022
Devillers L., 2006, P INT C SPOK LANG PR, P801
HASTIE HW, 2002, P M ASS COMP LING, P384
HUBER R, 2000, P INT C SPOK LANG PR, V1, P665
Kanda T, 2004, HUM-COMPUT INTERACT, V19, P61, DOI 10.1207/s15327051hci1901&2_4
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
Klein J, 2002, INTERACT COMPUT, V14, P119
Kuncheva LI, 2001, PATTERN RECOGN, V34, P299, DOI 10.1016/S0031-3203(99)00223-X
LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310
Le CA, 2005, LECT NOTES ARTIF INT, V3518, P262
Lee C, 2003, P EUR, P157
Lee C, 2009, P INT, P320
Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534
Lee C-M, 2001, P IEEE WORKSH AUT SP, P240
Lee CM, 2002, P INT C SPOK LANG PR, P873
Liscombe J., 2005, P INT LISB PORT, P1845
Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008
Lopez-Cozar R., 2005, SPOKEN MULTILINGUAL
LOPEZCOZAR R, 2005, COMPUTER SPEECH LANG, V20, P420, DOI DOI 10.1016/J.CSL.2005.05.003
Luengo I, 2005, P INTERSPEECH, P493
LUENGO I, 2009, P INT BRIGHT UK SEP, P332
LUGGER M, 2009, P INT, P1995
MOLLER S, 2004, QUALITY TELEPHONE BA
Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004
Nakatsu R., 1999, P INT C MULT COMP SY
Neiberg D, 2006, P INT C SPOK LANG PR, P809
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
Ortony A., 1990, COGNITIVE STRUCTURE
PETRUSHIN V, 2000, P ICSLP BEIJ CHIN
Plutchik R, 1994, PSYCHOL BIOL EMOTION, V1st
Polzehl T., 2009, P INT BRIGHT, P340
ROLI F, 2004, P 5 INT WORKSH MSC 2, V3077
Scheirer J, 2002, INTERACT COMPUT, V14, P93, DOI 10.1016/S0953-5438(01)00059-5
Schuller Bjorn, 2009, P INTERSPEECH, P312
Steidl S., 2009, AUTOMATIC CLASSIFICA
Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737
TAYLOR P, 1998, LANG SPEECH, V41, P489
XU L, 2009, P INT, P2035
Yacoub S., 2003, P EUROSPEECH, P729
NR 47
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2011
VL 53
IS 9-10
SI SI
BP 1210
EP 1228
DI 10.1016/j.specom.2011.01.006
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 810CW
UT WOS:000294104000012
ER
PT J
AU Garner, PN
AF Garner, Philip N.
TI Cepstral normalisation and the signal to noise ratio spectrum in
automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Cepstral normalisation; Noise robustness;
Aurora
ID ENHANCEMENT; SUPPRESSION
AB Cepstral normalisation in automatic speech recognition is investigated in the context of robustness to additive noise. In this paper, it is argued that such normalisation leads naturally to a speech feature based on signal to noise ratio rather than absolute energy (or power). Explicit calculation of this SNR-cepstrum by means of a noise estimate is shown to have theoretical and practical advantages over the usual (energy based) cepstrum. The relationship between the SNR-cepstrum and the articulation index, known in psycho-acoustics, is discussed. Experiments are presented suggesting that the combination of the SNR-cepstrum with the well known perceptual linear prediction method can be beneficial in noisy environments. (C) 2011 Elsevier B.V. All rights reserved.
C1 Idiap Res Inst, Ctr Parc, CH-1920 Martigny, Switzerland.
RP Garner, PN (reprint author), Idiap Res Inst, Ctr Parc, Rue Marconi 19,POB 592, CH-1920 Martigny, Switzerland.
EM pgarner@idiap.ch
FU Swiss National Science Foundation under the National Center of
Competence in Research (NCCR) on Interactive Multi-modal Information
Management (IM2)
FX This work was supported by the Swiss National Science Foundation under
the National Center of Competence in Research (NCCR) on Interactive
Multi-modal Information Management (IM2). This paper only reflects the
authors' views and funding agencies are not liable for any use that may
be made of the information contained herein.
CR ACERO A, 1990, THESIS CARNEGIE MELL, P15213
Acero A., 1990, P IEEE INT C AC SPEE, V2, P849
Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
Allen JB, 2005, J ACOUST SOC AM, V117, P2212, DOI 10.1121/1.1856231
Au Yeung S.-K., 2004, P INT C SPOK LANG PR
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940
de la Torre A, 2005, IEEE T SPEECH AUDI P, V13, P355, DOI 10.1109/TSA.2005.845805
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
*ETSI, 2002, 202050 ETSI
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
GARNER PN, 2009, P IEEE WORKSH AUT SP
Garner P.N., 2010, P INT MAK JAP
HAIN T, 2006, P NIST RT06 SPRING W
Hain T., 2010, P INT MAK JAP
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hirsch H. G., 2000, ISCA ITRW ASR2000 AU
LATHOUD G, 2006, 0609 IDIAPRR
LATHOUD G, 2005, P IEEE WORKSH AUT SP
LI J, 2007, P IEEE WORKSH AUT SP
LINDBERG B, 2001, DANISH SPEECHDAT CAR
LOBDELL BE, 2008, P INT BRISB AUSTR
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
MORENO PJ, 1996, THESIS CARNEGIE MELL, P15213
Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733
NETSCH L, 2001, AU27300 STQ AUR DSR
PARIHAR N, 2004, P 12 EUR SIGN PROC C
Plapous C., 2004, P IEEE INT C AC SPEE, V1, P289
Ris C, 2001, SPEECH COMMUN, V34, P141, DOI 10.1016/S0167-6393(00)00051-0
Segura J.C., 2002, P ICSLP 02, P225
STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162
Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2
Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8
VIKKI O, 1997, ROBUST SPEECH RECOGN, P107
NR 37
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2011
VL 53
IS 8
BP 991
EP 1001
DI 10.1016/j.specom.2011.05.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 804RJ
UT WOS:000293671400001
ER
PT J
AU D'Haro, LF
de Cordoba, R
San-Segundo, R
Ferreiros, J
Pardo, JM
AF Fernando D'Haro, Luis
de Cordoba, Ricardo
San-Segundo, Ruben
Ferreiros, Javier
Manuel Pardo, Jose
TI Design and evaluation of acceleration strategies for speeding up the
development of dialog applications
SO SPEECH COMMUNICATION
LA English
DT Article
DE Development tools; Automatic design; VoiceXML; Data mining; Speech-based
dialogs
ID PLATFORM; SYSTEMS
AB In this paper, we describe a complete development platform that features different innovative acceleration strategies, not included in any other current platform, that simplify and speed up the definition of the different elements required to design a spoken dialog service. The proposed accelerations are mainly based on using the information from the backend database schema and contents, as well as cumulative information produced throughout the different steps in the design. Thanks to these accelerations, the interaction between the designer and the platform is improved, and in most cases the design is reduced to simple confirmations of the "proposals" that the platform dynamically provides at each step.
In addition, the platform provides several other accelerations such as configurable templates that can be used to define the different tasks in the service or the dialogs to obtain or show information to the user, automatic proposals for the best way to request slot contents from the user (i.e. using mixed-initiative forms or directed forms), an assistant that offers the set of more probable actions required to complete the definition of the different tasks in the application, or another assistant for solving specific modality details such as confirmations of user answers or how to present them the lists of retrieved results after querying the backend database. Additionally, the platform also allows the creation of speech grammars and prompts, database access functions, and the possibility of using mixed initiative and over-answering dialogs. In the paper we also describe in detail each assistant in the platform, emphasizing the different kind of methodologies followed to facilitate the design process at each one.
Finally, we describe the results obtained in both a subjective and an objective evaluation with different designers that confirm the viability, usefulness, and functionality of the proposed accelerations. Thanks to the accelerations, the design time is reduced in more than 56% and the number of keystrokes by 84%. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Fernando D'Haro, Luis; de Cordoba, Ricardo; San-Segundo, Ruben; Ferreiros, Javier; Manuel Pardo, Jose] Univ Politecn Madrid, Grp Tecnol Habla, Madrid, Spain.
RP D'Haro, LF (reprint author), ETSI Telecomunicac, Ciudad Univ S-N, Madrid 28040, Spain.
EM lfdharo@die.upm.es; cordoba@die.upm.es; lapiz@die.upm.es;
jfl@die.upm.es; pardo@die.upm.es
RI Pardo, Jose/H-3745-2013; Cordoba, Ricardo/B-5861-2008
OI Cordoba, Ricardo/0000-0002-7136-9636
FU ROBONAUTA [DPI2007-66846-c02-02]; SD-TEAM [TIN2008-06856-C05-03]
FX This work has been supported by ROBONAUTA (DPI2007-66846-c02-02) and
SD-TEAM (TIN2008-06856-C05-03). We want to thank the following people
for their contribution in the coding of the platform and runtime system:
to Rosalia Ramos, Jose Ramon Jimenez, Javier Morante, Ignacio Ibarz, and
Ruben Martin from the Universidad Politecnica de Madrid, and to all the
members of the GEMINI project for making possible the creation of the
platform described in this paper.
CR Agah A, 2000, INTERACT COMPUT, V12, P529, DOI 10.1016/S0953-5438(99)00022-3
BALENTINE B, 2001, BUILD SPEECH RECOGNI, P414
Bohus D, 2009, COMPUT SPEECH LANG, V23, P332, DOI 10.1016/j.csl.2008.10.001
CHUNG G, 2004, ACL, P63
CORDOBA R, 2004, INT C SPOK LANG PROC, P257
D'Haro L. F., 2009, THESIS U POLITECNICA
DHARO LF, 2004, INT C SPOK LANG PROC, P3057
D'Haro LF, 2006, SPEECH COMMUN, V48, P863, DOI 10.1016/j.specom.2005.11.001
EBERMAN B, 2002, 11 INT C WWW, P713
Feng J., 2003, WORKSH AUT SPEECH RE, P168
GEORGILA K, 2004, 4 INT C LANG RES EV
Hamerich S. W., 2008, 9 SIGDIAL WORKSH DIS, P92
HAMERICH SW, 2003, XML BASED DIALOG DES, P404
Jung S., 2008, SPEECH COMMUN, V50, P683
Lopez-Cozar R., 2005, SPOKEN MULTILINGUAL
McGlashan S., 2004, VOICE EXTENSIBLE MAR
McTear M, 2005, SPEECH COMMUN, V45, P249, DOI 10.1016/j.specom.2004.11.006
McTear M., 1998, INT C SPOK LANG PROC, P1223
Pargellis AN, 2004, SPEECH COMMUN, V42, P329, DOI 10.1016/j.specom.2003.10.003
Polifroni J., 2006, INT C LANG RES EV LR, P143
Schubert V., 2005, EUR C SPEECH COMM TE, P789
Tsai MJ, 2006, EXPERT SYST APPL, V31, P684, DOI 10.1016/j.eswa.2006.01.010
Wang YY, 2006, SPEECH COMMUN, V48, P390, DOI 10.1016/j.specom.2005.07.001
Wolters M, 2009, INTERACT COMPUT, V21, P276, DOI 10.1016/j.intcom.2009.05.009
NR 24
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2011
VL 53
IS 8
BP 1002
EP 1025
DI 10.1016/j.specom.2011.05.008
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 804RJ
UT WOS:000293671400002
ER
PT J
AU Polyakova, T
Bonafonte, A
AF Polyakova, Tatyana
Bonafonte, Antonio
TI Introducing nativization to Spanish TTS systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Nativization; Pronunciation by analogy; Multilingual TTS;
Grapheme-to-phoneme conversion; Phoneme-to-phoneme conversion
ID TO-PHONEME CONVERSION; ANALOGY
AB In the modern world, speech technologies must be flexible and adaptable to any framework. Mass media globalization introduces multilingualism as a challenge for the most popular speech applications such as text-to-speech synthesis and automatic speech recognition. Mixed-language texts vary in their nature and when processed, some essential characteristics must be considered. In Spain and other Spanish-speaking countries, the use of Anglicisms and other words of foreign origin is constantly growing. A particularity of peninsular Spanish is that there is a tendency to nativize the pronunciation of non-Spanish words so that they fit properly into Spanish phonetic patterns. In our previous work, we proposed to use hand-crafted nativization tables that were capable of nativizing correctly 24% of words from the test data. In this work, our goal was to approach the nativization challenge by data-driven methods, because they are transferable to other languages and do not drop in performance in comparison with explicit rules manually written by experts. Training and test corpora for nativization consisted of 1000 and 100 words respectively and were crafted manually. Different specifications of nativization by analogy and learning from errors focused on finding the best nativized pronunciation of foreign words. The best obtained objective nativization results showed an improvement from 24% to 64% in word accuracy in comparison to our previous work. Furthermore, a subjective evaluation of the synthesized speech allowed for the conclusion that nativization by analogy is clearly the preferred method among listeners of different backgrounds when comparing to previously proposed methods. These results were quite encouraging and proved that even a small training corpus is sufficient for achieving significant improvements in naturalness for English inclusions of variable length in Spanish utterances. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Polyakova, Tatyana; Bonafonte, Antonio] Univ Politecn Cataluna, Signal Theory Dept, ES-08034 Barcelona, Spain.
RP Polyakova, T (reprint author), Univ Politecn Cataluna, Signal Theory Dept, Jordi Girona 1-3, ES-08034 Barcelona, Spain.
EM tatyana.polyakova@upc.edu; antonio.bonafonte@upc.edu
CR [Anonymous], 1999, HDB INT PHONETIC ASS
Bear D., 2003, ENGLISH LEARNERS REA, P71
Bellegarda JR, 2005, SPEECH COMMUN, V46, P140, DOI 10.1016/j.specom.2005.03.002
Bisani M, 2008, SPEECH COMMUN, V50, P434, DOI 10.1016/j.specom.2008.01.002
BLACK A, 1998, SSW3, P77
BLACK AW, 2004, P ICASSP MAY, V3, P761
BONAFONTE A, 2008, UPC TTS SYSTEM DESCR
Bonafonte Antonio, 2006, P INT C LANG RES EV, P311
Brill E, 1995, COMPUT LINGUIST, V21, P543
Canellada Maria Josefa, 1987, PRONUNCIACION ESPANO
CONDE X, 2001, IANUA ROMANCE PHI S4
DAMPER RI, 2004, P 5 INT SPEECH COMM, P209
Dedina M. J., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90017-K
*ED 1, 2011, ENGL PROF IND
FOX RA, 1995, J ACOUST SOC AM, V97, P2540, DOI 10.1121/1.411974
GLUSHKO R, 1981, INTERACTIVE PROCESSE, P1
Hammond Robert M., 2001, SOUNDS SPANISH ANAL
Hartikainen E., 2003, P EUR C SPEECH COMM, P1529
LADEFOGED P, 2003, VOWELS CONSONANTS
LINDSTROM A, 2004, THESIS U LINKOPING L
Llitjos Ariadna Font, 2001, P EUROSPEECH, P1919
LLORENTE J, 2004, LIBRO ESTILO CANAL T
Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674
Pfister Beat, 2003, P EUR, P2037
POLYAKOVA T, 2006, P INT 2006 PITTSB US, P2442
POLYAKOVA T, 2009, P IEEE INT C AC SPEE, P4261
POLYAKOVA T, 2008, ACT 5 JORN TECN HABL, P207
RAYNOLDS L, 2009, READ WRIT, P1
Real Academia Espanola, 1992, DICC LENG ESP
SEJNOWSKI T, 1993, NETTALK CORPUS
SOONKLANG T, 2008, NAT LANG ENG, V14, P527
Swan M., 2001, LEARNER ENGLISH TEAC, V2nd
Taylor P., 2005, P INT, P1973
TRANCOSO I, 1999, P EUR 1999 5 9 SEPT, P195
TRANCOSO I, 1995, P WORKSH INT LANG SP, P193
Van den Heuvel H., 2009, P INT BRIGHT UK, P2991
VANDENBOSCH A, 1993, P 6 C EUR CHAPT ASS, P45, DOI 10.3115/976744.976751
Wells J. C., 1982, ACCENTS ENGLISH INTR
Yavas M., 2006, APPL ENGLISH PHONOLO
Zheng M, 2005, LECT NOTES ARTIF INT, V3614, P600
NR 40
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2011
VL 53
IS 8
BP 1026
EP 1041
DI 10.1016/j.specom.2011.05.009
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 804RJ
UT WOS:000293671400003
ER
PT J
AU Davidson, L
AF Davidson, Lisa
TI Characteristics of stop releases in American English spontaneous speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Stop releases; Stop-consonant sequences; Spontaneous speech;
Articulatory coordination
ID CONSONANT SEQUENCES; GLOTTALIZATION; DURATION; OVERLAP
AB This study examines the factors affecting the production of stop releases in American English spontaneous speech. Previous research has shown that releases are conditioned by phonetic and social factors. However, previous studies either rely exclusively on read speech, or for sociolinguistic studies, focus on phrase-final stops. In this study, spontaneous speech is collected from two sources: interviews from the non-profit StoryCorps project and from sentences spontaneously generated in a picture description task. Stop releases were examined before obstruents and nasals in word-medial position (e.g. rugby), word-final, phrase-medial position (e.g. They crack nuts), and prepausally (e.g. I look up). Phonetic factors taken into account include identity of the stop, directionality of place of articulation in the consonant cluster (front-to-back vs. back-to-front) and manner of C2. For the StoryCorps data, race of the speaker was also found to be an important predictor. Results showed that approximately a quarter of the stops followed by a consonant were released, but release was strongly affected by the place of the stop and the manner of the following consonant. Release of pre-pausal stops differed between black and white speakers; the latter had double the amount of final release. Other realizations of the stops, such as deletion, lenition, and glottalization are also analyzed. (C) 2011 Elsevier B.V. All rights reserved.
C1 NYU, Dept Linguist, New York, NY 10003 USA.
RP Davidson, L (reprint author), NYU, Dept Linguist, 10 Washington Pl, New York, NY 10003 USA.
EM lisa.davidson@nyu.edu
FU NSF [BCS-0449560]
FX The author would like to thank Marcos Rohena-Madrazo and Vincent
Chanethom for their assistance in data collection and analysis, and Jon
Brennan for statistical consulting. Thanks also to the members of the
NYU Phonetics and Experimental Phonology Lab, Cohn Wilson, and to the
audience at the Acoustical Society of America in Cancun, Mexico for
valuable feedback. This research was supported by NSF CAREER Grant
BCS-0449560.
CR Balota DA, 2007, BEHAV RES METHODS, V39, P445, DOI 10.3758/BF03193014
Batliner A., 1993, P ESCA WORKSH PROS L, P176
BENOR S, 2001, SELECTED PAPERS NWAV, V29, P1
BOERSMA P, 2011, PRAAT 5 2 DOING PHON
Browman C. P., 1990, PAPERS LABORATORY PH, P341
Bucholtz M., 1998, P 4 BERK WOM LANG C, P119
Byrd D, 1996, J PHONETICS, V24, P209, DOI 10.1006/jpho.1996.0012
Byrd D., 1993, UCLA WORKING PAPERS, V83, P97
Catford John C., 1977, FUNDAMENTAL PROBLEMS
Chitoran I., 2002, PAPERS LAB PHONOLOGY, V7
Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094
CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911
CRYSTAL TH, 1988, J PHONETICS, V16, P285
Davidson L, 2008, J INT PHON ASSOC, V38, P137, DOI 10.1017/S0025100308003447
Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023
Eckert P, 2008, J SOCIOLING, V12, P453, DOI 10.1111/j.1467-9841.2008.00374.x
Eddington D, 2010, AM SPEECH, V85, P338, DOI 10.1215/00031283-2010-019
Francis W.N., 1958, STRUCTURE AM ENGLISH
Gafos A. I., 2010, LAB PHONOLOGY, V10, P657
Ghosh PK, 2009, J ACOUST SOC AM, V126, pEL1, DOI 10.1121/1.3141876
Guy Gregory, 1997, LANG VAR CHANGE, V9, P149
Guy Gregory R, 1980, LOCATING LANGUAGE TI, P1
Hardcastle W. J., 1979, CURRENT ISSUES PHONE, P531
Harris J., 1994, ENGLISH SOUND STRUCT
HENDERSON JB, 1982, PHONETICA, V39, P71
HENTON CG, 1987, LANGUAGE SPEECH MIND, P3
Kahn D., 1980, SYLLABLE BASED GEN E
KIPARSKY P, 1979, LINGUIST INQ, V10, P421
Kochetov A., 2002, PRODUCTION PERCEPTIO
Labov William, 1972, SOCIOLINGUISTIC PATT
Ladefoged Peter, 2005, COURSE PHONETICS
Lamothe Peter, 2006, J AM HIST, V93, P171
Nespor M., 1986, PROSODIC PHONOLOGY
Neu H., 1980, LOCATING LANGUAGE TI, P37
Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666
Ortega-Llebaria M, 2004, LAB APPROACHES SPANI, P237
R Development Core Team, 2011, R LANG ENV STAT COMP
Raymond William D., 2006, LANG VAR CHANGE, V18, P55
Redi L, 2001, J PHONETICS, V29, P407, DOI 10.1006/jpho.2001.0145
Roberts J, 2006, AM SPEECH, V81, P227, DOI 10.1215/00031283-2006-016
Selkirk E., 1982, STRUCTURE PHONOLOG 2, P337
SMITH DB, 2009, REF REV, V23, P57
Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004
SURPRENANT A, 1988, J ACOUST SOC AM, V104, P518
Tiede M. K., 2001, J ACOUST SOC AM, V110, P2657
Trager George L., 1951, OUTLINE ENGLISH STRU
Tsukada K, 2004, PHONETICA, V61, P67, DOI 10.1159/000082557
Wolfram Walt, 1969, SOCIOLINGUISTIC DESC
Podesva RJ, 2002, LANGUAGE AND SEXUALITY: CONTESTING MEANING IN THEORY AND PRACTICE, P175
Yuan J., 2008, P AC 08, P5687
Zsiga E., 2003, STUDIES 2 LANGUAGE A, V25, P399
Zsiga EC, 2000, J PHONETICS, V28, P69, DOI 10.1006/jpho.2000.0109
ZSIGA EC, 1994, J PHONETICS, V22, P121
NR 53
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2011
VL 53
IS 8
BP 1042
EP 1058
DI 10.1016/j.specom.2011.05.010
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 804RJ
UT WOS:000293671400004
ER
PT J
AU Chen, TH
Massaro, DW
AF Chen, Trevor H.
Massaro, Dominic W.
TI Evaluation of synthetic and natural Mandarin visual speech: Initial
consonants, single vowels, and syllables
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mandarin; Visual speech; Visemes; Speechreading; Synthetic; Natural;
Talking head
ID VISIBLE SPEECH; HEARING-LOSS; PERCEPTION; INDIVIDUALS; VOCABULARY;
CHILDREN
AB Although the auditory aspects of Mandarin speech are relatively more heavily-researched and well-known in the field, this study addresses its visual aspects by examining the perception of both Mandarin natural and synthetic visual speech. In perceptual experiments, the synthetic visual speech of a computer-animated Mandarin talking head was evaluated and subsequently improved. Also, the basic (or "minimum") units of Mandarin visual speech were determined for initial consonants and final single-vowels. Overall, the current study achieved solid improvements of synthetic visual speech, and this was one step towards a Mandarin synthetic talking head with realistic speech. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Chen, Trevor H.; Massaro, Dominic W.] Univ Calif Santa Cruz, Dept Psychol, Santa Cruz, CA 95064 USA.
RP Chen, TH (reprint author), Univ Calif Santa Cruz, Dept Psychol, 1156 High St, Santa Cruz, CA 95064 USA.
EM river_rover_t@yahoo.com
FU Federico and Rena Per lino Scholarship Award; Eugene Cota-Robles
Fellowship; Psychology Department at the University of California, Santa
Cruz
FX The research and writing of this article were supported by the Federico
and Rena Per lino Scholarship Award, the Eugene Cota-Robles Fellowship,
and the Psychology Department at the University of California, Santa
Cruz (the Doctoral Student Sabbatical Fellowship and the Mini-Grant
Research Fellowship). The authors thank Michael M. Cohen for offering
expert technical assistance.
CR Andersson U., 2001, J DEAF STUD DEAF EDU, V6, P103, DOI DOI 10.1093/DEAFED/6.2.103
BAILLY G, 2000, COST254 WORKSH FRIEN
Bosseler A, 2003, J AUTISM DEV DISORD, V33, P653, DOI 10.1023/B:JADD.0000006002.82367.4f
Campbell CS, 1997, PERCEPTION, V26, P627, DOI 10.1068/p260627
Caplier A, 2007, EURASIP J IMAGE VIDE, DOI 10.1155/2007/45641
CHEN F, 2005, P 2005 INT C CYB
CHEN HC, 1991, MANDARIN CONSONANT V, P149
CHEN HC, 1992, MANDARIN VOWEL DIPHT, P179
Chen TH, 2008, J ACOUST SOC AM, V123, P2356, DOI 10.1121/1.2839004
COHEN MM, 1990, BEHAV RES METH INSTR, V22, P260
COLE R, 1998, STILL ESCA WORKSH SP, P163
COSI P, 2002, ICSLP 2002 7 INT C S
FISHER CG, 1968, J SPEECH HEAR RES, V11, P796
GAILEY L, 1987, HEARING EYE PSYCHOL, P115
JACKSON PL, 1988, VOLTA REV, V90, P99
LADEFOGED PETER, 2001, COURSE PHONETICS
Lee W-S., 2003, J INT PHON ASSOC, V33, P109, DOI 10.1017/S0025100303001208
MacMillan N. A., 2005, DETECTION THEORY USE
Massaro D. W., 1998, PERCEIVING TALKING F
Massaro D. W., 1989, EXPT PSYCHOL INFORM
Massaro D. W., 1987, SPEECH PERCEPTION EA
Massaro D. W., 2003, P EUR INT 8 EUR C SP
Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025)
MASSARO DW, 2005, P 38 ANN HAW INT C S
MASSARO DW, 2006, P 9 INT C SPOK LANG, P825
MASSARO DW, 2008, P INT 2008 BRISB QUE, P2623
Massaro DW, 2004, VOLTA REV, V104, P141
MING O, 1999, ICAT 99
Mohammed T, 2006, CLIN LINGUIST PHONET, V20, P621, DOI 10.1080/02699200500266745
OUNI S, 2003, P 15 INT C PHON SCI
Ouni S, 2005, SPEECH COMMUN, V45, P115, DOI 10.1016/j.specom.2004.11.008
Pei Y, 2007, IEEE T VIS COMPUT GR, V13, P58, DOI 10.1109/TVCG.2007.22
Pei YR, 2006, LECT NOTES COMPUT SC, V3851, P591
PULLUM Geoffrey, 1996, PHONETIC SYMBOL GUID
WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130
WANG AH, 2000, P INT S CHIN SPOK LA, P215
WANG ZM, 2003, J SOFTWARE, V16, P1054
WU Z, 2006, P 9 INT C SPOK LANG
ZHOU W, 2007, 6 IEEE ACIS INT C CO, P924
NR 39
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2011
VL 53
IS 7
BP 955
EP 972
DI 10.1016/j.specom.2011.03.009
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 791OZ
UT WOS:000292675800001
ER
PT J
AU Nose, T
Kobayashi, T
AF Nose, Takashi
Kobayashi, Takao
TI Speaker-independent HMM-based voice conversion using adaptive
quantization of the fundamental frequency
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice conversion; Hidden Markov model (HMM); HMM-based speech synthesis;
Speaker-independent model; Fundamental frequency quantization; Prosody
conversion
ID SPEECH SYNTHESIS
AB This paper describes a speaker-independent HMM-based voice conversion technique that incorporates context-dependent prosodic symbols obtained using adaptive quantization of the fundamental frequency (F0). In the HMM-based conversion of our previous study, the input utterance of a source speaker is decoded into phonetic and prosodic symbol sequences, and the converted speech is generated using the decoded information from the pre-trained target speaker's phonetically and prosodically context-dependent H M M. In our previous work, we generated the F0 symbol by quantizing the average log F0 value of each phone using the global mean and variance calculated from the training data. In the current study, these statistical parameters are obtained from each utterance itself, and this adaptive method improves the F0 conversion performance of the conventional one. We also introduce a speaker-independent model for decoding the input speech and model adaptation for training the target speaker's model in order to reduce the required amount of training data under a condition where the phonetic transcription is available for the input speech. Objective and subjective experimental results for Japanese speech demonstrate that the adaptive quantization method gives better F0 conversion performance than the conventional one. Moreover, our technique with only ten sentences of the target speaker's adaptation data outperforms the conventional GMM-based one using parallel data of 200 sentences. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Nose, Takashi; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
RP Nose, T (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Yokohama, Kanagawa 2268502, Japan.
EM takashi.nose@ip.titech.ac.jp; takao.ko-bayashi@ip.titech.ac.jp
CR ABE M, 1991, INT CONF ACOUST SPEE, P765, DOI 10.1109/ICASSP.1991.150451
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Imai S., 1983, IECE T A, P122
Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
MASHIMO M, 2002, P ICSLP, P293
NAKANO Y, 2006, P INTERSPEECH 2006 I, P2286
Nose T., 2010, IEICE T INF SYSTEMS, P2483
NOSE T, 2010, P 7 ISCA SPEECH SYNT, P80
NOSE T, 2010, P INT 2010, P1724
Nose Takashi, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495548
Rabiner L, 1993, FUNDAMENTALS SPEECH
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
STYLIANOU Y, 2009, P ICASSP 2009, P3585
Tamura M., 2001, P EUROSPEECH 2001 SE, P345
TANAKA T, 2002, P ICASSP 2002, P329
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K, 1998, INT CONF ACOUST SPEE, P609, DOI 10.1109/ICASSP.1998.675338
TOKUDA K, 1995, P ICASSP, P660
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
TURK O, 2009, P ICASSP 2009, P3597
TURK O, 2002, P INT C SPOK LANG PR, P289
VERMA A, 2005, ACM T SPEECH LANGUAG, V2, P1, DOI 10.1145/1075389.1075393
Ye H, 2006, IEEE T AUDIO SPEECH, V14, P1301, DOI 10.1109/TSA.2005.860839
YOKOMIZO S, 2010, P INT 2010, P430
Yoshimura T, 1999, P EUR, P2347
YUTANI K, 2009, P ICASSP 2009, P3897
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
NR 30
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2011
VL 53
IS 7
BP 973
EP 985
DI 10.1016/j.specom.2011.05.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 791OZ
UT WOS:000292675800002
ER
PT J
AU Peng, JX
Bei, CX
Sun, HT
AF Peng, Jianxin
Bei, Chengxun
Sun, Haitao
TI Relationship between Chinese speech intelligibility and speech
transmission index in rooms based on auralization
SO SPEECH COMMUNICATION
LA English
DT Article
DE Chinese speech intelligibility; Speech transmission index; Room impulse
response; Phonetically balanced test; Diotic listening; Dichotic
listening
ID OCTAVE-BAND WEIGHTS; RASTI-METHOD; VALIDATION; ENGLISH
AB Based on simulated monaural and binaural room impulse responses, the relationship between Chinese speech intelligibility scores and speech transmission index (STI) including the effect of noise is investigated using a phonetically balanced test in virtual rooms. The results show that Chinese speech intelligibility scores increase monotonically with STI values. The correlation coefficients are 0.95, 0.90 and the standard deviation is 5.6%, 6.7% under diotic and dichotic listening conditions, respectively. Compared with diotic listening based on monaural room impulse responses, dichotic listening based on binaural room impulse responses can improve by 2.7 dB signal-to-noise ratio for Chinese speech intelligibility. The STI method can better predict and evaluate Chinese speech intelligibility in rooms. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Peng, Jianxin; Bei, Chengxun] S China Univ Technol, Dept Phys, Sch Sci, Guangzhou 510640, Peoples R China.
[Peng, Jianxin; Sun, Haitao] S China Univ Technol, State Key Lab Subtrop Bldg Sci, Guangzhou 510640, Peoples R China.
RP Peng, JX (reprint author), S China Univ Technol, Dept Phys, Sch Sci, Guangzhou 510640, Peoples R China.
EM phjxpeng@163.com
FU National Natural Science Foundation of China [10774048]; Science and
Technology Planning Project of Guangdong Province, China
[2008B080701020]; State Key Laboratory of Subtropical Building Science,
South China University of Technology, China [2008KB32]
FX The authors thank the students who participated in subjective evaluation
of Chinese speech intelligibility. This work was support by National
Natural Science Foundation of China (Grant No. 10774048), Science and
Technology Planning Project of Guangdong Province, China (Grant No.
2008B080701020) and Opening Project of the State Key Laboratory of
Subtropical Building Science, South China University of Technology,
China (Grant No. 2008KB32).
CR ANDERSON BW, 1987, J ACOUST SOC AM, V81, P1982, DOI 10.1121/1.394764
[Anonymous], 2003, 6026816 IEC
[Anonymous], 1995, 15508 GBT
BRADLEY JS, 1986, J ACOUST SOC AM, V80, P837, DOI 10.1121/1.393907
Christensen C.L., 2009, ODEON ROOM ACOUSTICS
Diaz C, 1995, APPL ACOUST, V46, P363, DOI 10.1016/0003-682X(95)00016-3
HOUTGAST T, 1984, ACUSTICA, V54, P185
HOUTGAST T, 1973, ACUSTICA, V28, P66
JACOB KD, 1991, J AUDIO ENG SOC, V39, P232
Kang J, 1998, J ACOUST SOC AM, V103, P1213, DOI 10.1121/1.421253
Kruger K., 1991, Canadian Acoustics, V19
MIJIC M, 1991, ACUSTICA, V74, P143
Peng JX, 2007, SPEECH COMMUN, V49, P933, DOI 10.1016/j.specom.2007.06.001
Peng JX, 2005, APPL ACOUST, V66, P591, DOI 10.1016/j.apacoust.2004.08.006
Peng JX, 2006, ACTA ACUST UNITED AC, V92, P79
Peng JX, 2008, CHINESE SCI BULL, V53, P2748, DOI 10.1007/s11434-008-0383-5
SHAO L, 1989, P 5 ARCH PHYS BEIJ
SHEN H, 1993, AUDIO ENG, P2
Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Steeneken HJM, 2002, SPEECH COMMUN, V38, P413, DOI 10.1016/S0167-6393(02)00010-9
Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0
Wijngaarden S., 2008, J ACOUST SOC AM, V123, P4514
Yang W, 2007, ACTA ACUST UNITED AC, V93, P991
ZHANG JL, 1981, ACTA ACUST, V7, P237
NR 25
TC 5
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2011
VL 53
IS 7
BP 986
EP 990
DI 10.1016/j.specom.2011.05.004
PG 5
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 791OZ
UT WOS:000292675800003
ER
PT J
AU Beukelman, DR
Childes, J
Carrell, T
Funk, T
Ball, LJ
Pattee, GL
AF Beukelman, David R.
Childes, Jana
Carrell, Tom
Funk, Trisha
Ball, Laura J.
Pattee, Gary L.
TI Perceived attention allocation of listeners who transcribe the speech of
speakers with amyotrophic lateral sclerosis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Amyotrophic lateral sclerosis; Dysarthria; Perception
AB The purpose of this study was to investigate the self-perceived attention allocation of listeners as they transcribed the speech samples of speakers with mild to severe dysarthria as a result of amyotrophic lateral sclerosis. Listeners reported that their perceived attention allocation increased consistently as speech intelligibility for sentences decreased from 100% to 75%. In this study, self-perceptions of attention allocation peaked between 75% and 80% intelligibility. These results support the conclusion that listeners experience a considerable perceptual load as they attempt to comprehend the messages of persons whose speech has relatively high intelligibility but distorted due to dysarthria. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Beukelman, David R.; Childes, Jana; Carrell, Tom; Funk, Trisha] Univ Nebraska, Barkley Mem Ctr 301, Lincoln, NE 68583 USA.
[Beukelman, David R.; Pattee, Gary L.] Univ Nebraska, Med Ctr, Omaha, NE 68198 USA.
[Beukelman, David R.] Madonna Rehabil Hosp, Inst Rehabil Sci & Engn, Lincoln, NE 68506 USA.
[Ball, Laura J.] E Carolina Univ, Greenville, NC 27858 USA.
RP Beukelman, DR (reprint author), Univ Nebraska, Barkley Mem Ctr 202, POB 830732, Lincoln, NE 68583 USA.
EM dbeukelman1@unl.edu
FU Barkley Trust
FX This research was partially funded by the Barkley Trust. The authors
wish to thank the staff and patients of the Muscular Dystrophy Clinic
and ALS Clinic for their support of the Nebraska ALS Database Project.
The authors report no conflicts of interest. The authors alone are
responsible for the content and writing of the article.
CR Ball L., 2004, AUGMENTATIVE ALTERNA, V20, P113, DOI 10.1080/0743461042000216596
Ball L., 2007, AUGMENTATIVE COMMUNI, P287
Ball LJ, 2002, J MED SPEECH-LANG PA, V10, P231
BECKER CA, 1976, J EXP PSYCHOL HUMAN, V2, P556, DOI 10.1037//0096-1523.2.4.556
BEUKELMAN D, 1979, J COMMUN DISORD, V12, P89
Broadbent D.E., 1958, PERCEPTION COMMUNICA
CRANDALL J, 2007, ATTENTION ALLOCATION, P1
Duffy J.R, 2005, MOTOR SPEECH DISORDE
Green M., 2000, TRANSPORTATION HUMAN, V2, P195, DOI DOI 10.1207/STHF0203_1
HENDY K, 1993, HUM FACTORS, V23, P579
Kahneman D., 1973, ATTENTION EFFORT
Pisoni D. B., 1982, Speech Technology, V1
SITVER M, 1982, ASHA, V24, P783
Yorkston K., 2007, SENTENCE INTELLIGIBI
Yorkston K., 2010, CLIN MANAGEMENT SPEA
YORKSTON K, 1991, J MED SPEECH-LANG PA, V1, P35
Yorkston K. M., 2004, MANAGEMENT SPEECH SW
NR 17
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 801
EP 806
DI 10.1016/j.specom.2010.12.005
PG 6
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500001
ER
PT J
AU Irwin, A
Pilling, M
Thomas, SM
AF Irwin, Amy
Pilling, Michael
Thomas, Sharon M.
TI An analysis of British regional accent and contextual cue effects on
speechreading performance
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speechreading; Accent; Speech perception; Talker speechreadability
ID AUDIOVISUAL SPEECH-PERCEPTION; NORMAL-HEARING; SPOKEN WORDS;
VARIABILITY; INTELLIGIBILITY; TALKER; RECOGNITION; SENTENCES; ADULTS
AB The aim of this paper was to examine the effect of regional accent on speechreading accuracy and the utility of contextual. cues in reducing accent effects.
Study 1: Participants were recruited from Nottingham (n = 24) and Glasgow (n = 17). Their task was to speechread 240 visually presented sentences spoken by 12 talkers, half with a Glaswegian accent, half a Nottingham accent. Both participant groups found the Glaswegian talkers less intelligible (p < 0.05). A significant interaction between participant location and accent type (p < 0.05) indicated that both participant groups showed an advantage for speechreading talkers with their own accent over the opposite group.
Study 2: Participants were recruited from Nottingham (n = 15). The same visual sentences were used, but each one was presented with a contextual cue. The results showed that speechreading performance was significantly improved when a contextual cue was used (p < 0.05). However the Nottingham observers still found the Glaswegian talkers less intelligible than the Nottingham talkers (p < 0.05).
The findings of this paper suggest that accent type may have an influence upon visual speech intelligibility and as such may impact upon the design, and results, of tests of speechreading ability. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Irwin, Amy; Pilling, Michael; Thomas, Sharon M.] Nottingham Univ Sect, MRC Inst Hearing Res, Nottingham NG7 2RD, England.
RP Irwin, A (reprint author), Univ Aberdeen, Sch Psychol, Aberdeen AB24 2UB, Scotland.
EM a.irwin@abdn.ac.uk
CR Abercrombie D, 1967, ELEMENTS GEN PHONETI
Arnold P, 2001, BRIT J PSYCHOL, V92, P339, DOI 10.1348/000712601162220
Arnold P., 1997, J DEAF STUD DEAF EDU, V2, P199
Auer ET, 1997, J ACOUST SOC AM, V102, P3704, DOI 10.1121/1.420402
Bench J, 1979, Br J Audiol, V13, P108, DOI 10.3109/03005367909078884
Bernstein LE, 2001, J SPEECH LANG HEAR R, V44, P5, DOI 10.1044/1092-4388(2001/001)
Beskow J, 2004, LECT NOTES COMPUT SC, V3118, P1178
BOOTHROYD A, 1988, VOLTA REV, V90, P77
Clopper C.G., 2006, J ACOUST SOC AM, V119, P3424
Conrey B, 2006, VISION RES, V46, P3243, DOI 10.1016/j.visres.2006.03.020
COX RM, 1987, J ACOUST SOC AM, V81, P1598, DOI 10.1121/1.394512
DEMOREST ME, 1992, J SPEECH HEAR RES, V35, P876
ELLIS T, 2001, P INT C AUD VIS SPEE, P13
ERBER NP, 1992, J ACAD R, V25, P113
Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078
Floccia C, 2006, J EXP PSYCHOL HUMAN, V32, P1276, DOI 10.1037/0096-1523.32.5.1276
Flynn MC, 1999, J SPEECH LANG HEAR R, V42, P540
Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135
Grant KW, 2000, J ACOUST SOC AM, V107, P1000, DOI 10.1121/1.428280
Jordan TR, 2000, PERCEPT PSYCHOPHYS, V62, P1394, DOI 10.3758/BF03212141
KRICOS PB, 1982, VOLTA REV, V84, P219
KRICOS PB, 1985, VOLTA REV, V87, P5
Labov W., 1997, LANGUAGE VARIETY S R, P508
LABOV W, 1989, CHICAGO LINGUISTIC S, V25, P171
Lander K, 2008, Q J EXP PSYCHOL, V61, P961, DOI 10.1080/17470210801908476
LANSING CR, 1995, J SPEECH HEAR RES, V38, P1377
LESNER SA, 1988, VOLTA REV, V90, P89
Lidestam B, 2001, SCAND AUDIOL, V30, P89, DOI 10.1080/010503901300112194
MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786
Massaro D. W., 1998, PERCEIVING TALKING F
MASSARO DW, 1993, PERCEPT PSYCHOPHYS, V53, P549, DOI 10.3758/BF03205203
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
Most T, 2001, EAR HEARING, V22, P252, DOI 10.1097/00003446-200106000-00008
Munro MJ, 1995, LANG SPEECH, V38, P289
Nathan L, 1998, J CHILD LANG, V25, P343, DOI 10.1017/S0305000998003444
Nathan L, 2001, APPL PSYCHOLINGUIST, V22, P343, DOI 10.1017/S0142716401003046
OWENS E, 1985, J SPEECH HEAR RES, V28, P381
Reisberg D., 1987, HEARING EYE PSYCHOL, P97
Ronnberg J, 1999, J SPEECH LANG HEAR R, V42, P5
SHEFFERT SM, 1995, J MEM LANG, V34, P665, DOI 10.1006/jmla.1995.1030
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009
SUMMERFIELD Q, 1987, HEARING EYE PSYCHOL
SUMMERFIELD Q, 1984, Q J EXP PSYCHOL-A, V36, P51
Valentine G, 2008, SOC CULT GEOGR, V9, P469, DOI 10.1080/14649360802175691
Wells J. C., 1982, ACCENTS ENGLISH
Wells J. C, 1982, ACCENTS ENGLISH, V1
Yakel DA, 2000, PERCEPT PSYCHOPHYS, V62, P1405, DOI 10.3758/BF03212142
NR 48
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 807
EP 817
DI 10.1016/j.specom.2011.01.010
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500002
ER
PT J
AU So, S
Paliwal, KK
AF So, Stephen
Paliwal, Kuldip K.
TI Modulation-domain Kalman filtering for single-channel speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Modulation domain; Kalman filtering; Speech enhancement
ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE; RECEPTION
AB In this paper, we investigate the modulation-domain Kalman filter (MDKF) and compare its performance with other time-domain and acoustic-domain speech enhancement methods. In contrast to previously reported modulation domain-enhancement methods based on fixed bandpass filtering, the MDKF is an adaptive and linear MMSE estimator that uses models of the temporal changes of the magnitude spectrum for both speech and noise. Also, because the Kalman filter is a joint magnitude and phase spectrum estimator, under non-stationarity assumptions, it is highly suited for modulation-domain processing, as phase information has been shown to play an important role in the modulation domain. We have found that the Kalman filter is better suited for processing in the modulation-domain, rather than in the time-domain, since the low order linear predictor is sufficient at modelling the dynamics of slow changes in the modulation domain, while being insufficient at modelling the long-term correlation speech information in the time domain. As a result, the MDKF method produces enhanced speech that has very minimal distortion and residual noise, in the ideal case. The results from objective experiments and blind subjective listening tests using the NOIZEUS corpus show that the MDKF (with clean speech parameters) outperforms all the acoustic and time-domain enhancement methods that were evaluated, including the time-domain Kalman filter with clean speech parameters. A practical MDKF that uses the MMSE-STSA method to enhance noisy speech in the acoustic domain prior to LPC analysis was also evaluated and showed promising results. (C) 2011 Elsevier B.V. All rights reserved.
C1 [So, Stephen; Paliwal, Kuldip K.] Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Brisbane, Qld 4111, Australia.
RP So, S (reprint author), Griffith Univ, Signal Proc Lab, Griffith Sch Engn, Brisbane, Qld 4111, Australia.
EM s.so@griffith.edu.au; k.paliwal@griffith.edu.au
RI So, Stephen/D-6649-2011
CR Arai T, 1999, J ACOUST SOC AM, V105, P2783, DOI 10.1121/1.426895
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Falk T. H., 2007, P ISCA C INT SPEECH, P970
Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367
GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144
Greenberg S., 2001, P 7 EUR C SPEECH COM, P473
Greenberg S., 1998, P INT C SPOK LANG PR, P2803
Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405
Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552
KANEDERA N, 1998, P IEEE INT C AC SPEE, V2, P613, DOI 10.1109/ICASSP.1998.675339
Li C. J., 2006, THESIS AARLBORG U DE
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lyons J., 2008, P ISCA C INT SPEECH, P387
Mesgarani N., 2005, P IEEE INT C AC SPEE, P1105
Paliwal K, 2011, SPEECH COMMUN, V53, P327, DOI 10.1016/j.specom.2010.10.004
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Rix A., 2001, P862 ITUT
SAMBUR MR, 1976, IEEE T ACOUST SPEECH, V24, P488, DOI 10.1109/TASSP.1976.1162870
SO S, 2009, P IEEE INT C AC SPEE, P4405
So S, 2011, SPEECH COMMUN, V53, P355, DOI 10.1016/j.specom.2010.10.006
SORQVIST P, 1997, P IEEE INT C AC SPEE, V2, P1219
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Wiener N., 1949, EXTRAPOLATION INTERP
Wu WR, 1998, IEEE T CIRCUITS-II, V45, P1072
NR 32
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 818
EP 829
DI 10.1016/j.specom.2011.02.001
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500003
ER
PT J
AU Muller, F
Mertins, A
AF Mueller, Florian
Mertins, Alfred
TI Contextual invariant-integration features for improved
speaker-independent speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Speaker-independency; Invariant-integration
ID VOCAL-TRACT NORMALIZATION; HIDDEN MARKOV-MODELS; PATTERN-RECOGNITION;
TRANSFORM
AB This work presents a feature-extraction method that is based on the theory of invariant integration. The invariant-integration features are derived from an extended time period, and their computation has a very low complexity. Recognition experiments show a superior performance of the presented feature type compared to cepstral coefficients using a mel filterbank (MFCCs) or a gammatone filterbank (GTCCs) in matching as well as in mismatching training-testing conditions. Even without any speaker adaptation, the presented features yield accuracies that are larger than for MFCCs combined with vocal tract length normalization (VTLN) in matching training-test conditions. Also, it is shown that the invariant-integration features (IIFs) can be successfully combined with additional speaker-adaptation methods to further increase the accuracy. In addition to standard MFCCs also contextual MFCCs are introduced. Their performance lies between the one of MFCCs and IIFs. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Mueller, Florian; Mertins, Alfred] Med Univ Lubeck, Inst Signal Proc, D-23538 Lubeck, Germany.
RP Muller, F (reprint author), Med Univ Lubeck, Inst Signal Proc, Ratzeburger Allee 160, D-23538 Lubeck, Germany.
EM mueller@isip.uni-luebeck.de; mertins@isip.uni-luebeck.de
FU German Research Foundation [ME1170/2-1]
FX This work has been supported by the German Research Foundation under
Grant No. ME1170/2-1.
CR Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006
Boe L.-J., 2006, P 7 INT SEM SPEECH P, P75
BURKHARDT H, 1980, IEEE T ACOUST SPEECH, V28, P517, DOI 10.1109/TASSP.1980.1163439
Burkhardt H., 2001, NONLINEAR MODEL BASE, P269
COHEN L, 1993, IEEE T SIGNAL PROCES, V41, P3275, DOI 10.1109/78.258073
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Deller J. R., 1993, DISCRETE TIME PROCES
Ellis DPW, 2009, GAMMATONE LIKE SPECT
FANG M, 1989, APPL OPTICS, V28, P1257, DOI 10.1364/AO.28.001257
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Gramss Tino, 1991, P IEEE WORKSH NEUR N, P289
Haeb-Umbach R., 1992, P IEEE INT C AC SPEE, V1, P13
Halberstadt A., 1998, THESIS MIT
Huang X., 2001, SPOKEN LANGUAGE PROC
HURWITZ A, 1897, UEBER ERZEUGUNG INVA, P71
Irino T, 2002, SPEECH COMMUN, V36, P181, DOI 10.1016/S0167-6393(00)00085-6
Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131
Kleinschmidt M., 2002, P INT C SPOK LANG PR, P25
Kleinschmidt Michael, 2002, THESIS U OLDENBURG
LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546
Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
LEONARD RG, 1993, TIDIGITS LINGUISTIC
Lohweg V, 2004, EURASIP J APPL SIG P, V2004, P1912, DOI 10.1155/S1110865704404247
MERTINS A, 2006, P IEEE INT C AC SPEE, V5, P1025
Mertins A., 2005, P 2005 IEEE AUT SPEE, P308
Monaghan J. J., 2008, J ACOUST SOC AM, V123, P3066, DOI 10.1121/1.2932824
MOORE BCJ, 1996, ACTA ACUST UNITED AC, V82, P245
Muller Florian, 2009, Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU 2009), DOI 10.1109/ASRU.2009.5373284
Muller F., 2009, P INT C SPOK LANG PR, P2975
Muller F, 2010, LECT NOTES ARTIF INT, V5933, P111
NOETHER E, 1916, MATH ANN, V77, P89
Patterson R. D., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.183
PATTERSON RD, 1992, ADV BIOSCI, V83, P429
Pitz M, 2005, IEEE T SPEECH AUDI P, V13, P930, DOI 10.1109/TSA.2005.848881
RADEMACHER J, 2006, P INT C SPOK LANG PR, P1499
REITBOEC.H, 1969, INFORM CONTROL, V15, P130, DOI 10.1016/S0019-9958(69)90387-8
Saon G., 2000, P INT C AC SPEECH SI, V2, P1129
SCHLUTER R, 2006, P INT C SPOK LANG PR, P345
SCHULZMIRBACH H, 1995, MUSTERERKENNUNG 1995, V17, P1
SCHULZMIRBACH H, 1992, P 11 INT C PATT REC, V2, P178
SCHULZMIRBACH H, 1995, TR40295018 U HAMB
SENA AD, 2005, P SOUND MUS C SAL JU
SIGGELKOW S, 2002, THESIS ALBERTLUDWIGS
SINHA R, 2002, P IEEE INT C AC SPEE, V1
Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329
UMESH S, 2002, P INT C AC SPEECH SI, V1
Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435
Young S., 2009, HTK BOOK HTK VERSION
NR 50
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 830
EP 841
DI 10.1016/j.specom.2011.02.002
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500004
ER
PT J
AU Feng, YQ
Hao, GJ
Xue, SA
Max, L
AF Feng, Yongqiang
Hao, Grace J.
Xue, Steve A.
Max, Ludo
TI Detecting anticipatory effects in speech articulation by means of
spectral coefficient analyses
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech production; Articulation; Anticipatory coarticulation; Acoustic
analysis; Spectral coefficients
ID TO-VOWEL COARTICULATION; FRICATIVE-STOP COARTICULATION; CATALAN VCV
SEQUENCES; LOCUS EQUATIONS; STATISTICAL-ANALYSIS; PARKINSONS-DISEASE;
MULTIPLE-SCLEROSIS; ACOUSTIC ANALYSIS; CO-ARTICULATION; CONSONANTS
AB Few acoustic studies have attempted to examine anticipatory effects in the earliest part of the release of stop consonants. We investigated the ability of spectral coefficients to reveal anticipatory coarticulation in the burst and early aspiration of stops in monosyllables. Twenty American English speakers produced stop (/k,t,p/) - vowel (/ae,i,o/) stop (/k,t,p/) sequences in two phrase positions. The first four spectral coefficients (mean, standard deviation, skewness, kurtosis) were calculated for one window centered on the burst of the onset consonant and two subsequent, non-overlapping windows. All coefficients showed an influence of vowel-to-consonant anticipatory coarticulation. Which onset consonant showed the strongest vowel effect depended on the specific coefficient under consideration. A context-dependent consonant-to-consonant anticipatory effect was observed for onset /p/. Findings demonstrate that spectral coefficients can reveal subtle anticipatory adjustments as early as the burst of stop consonants. Different results for the four coefficients suggest that comprehensive spectral analyses offer advantages over other approaches. Studies using these techniques may expose previously unobserved articulatory adjustments among phonetic contexts or speaker populations. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Max, Ludo] Univ Washington, Dept Speech & Hearing Sci, Seattle, WA 98105 USA.
[Feng, Yongqiang] Chinese Acad Sci, Inst Acoust, Beijing 100190, Peoples R China.
[Hao, Grace J.] N Carolina Cent Univ, Dept Commun Disorders, Durham, NC 27707 USA.
[Xue, Steve A.] Univ Hong Kong, Div Speech & Hearing Sci, Hong Kong, Hong Kong, Peoples R China.
[Max, Ludo] Haskins Labs Inc, New Haven, CT 06511 USA.
RP Max, L (reprint author), Univ Washington, Dept Speech & Hearing Sci, 1417 NE 42nd St, Seattle, WA 98105 USA.
EM LudoMax@uw.edu
FU National Institute on Deafness and Other Communication Disorders
[R01DC007603]
FX This work was supported by Grant R01DC007603 from the National Institute
on Deafness and Other Communication Disorders (L. Max PI). The content
is solely the responsibility of the authors and does not necessarily
represent the official views of the National Institute on Deafness and
Other Communication Disorders or the National Institutes of Health.
CR BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319
BROAD DJ, 1970, J ACOUST SOC AM, V47, P1572, DOI 10.1121/1.1912090
Cho T, 1999, J PHONETICS, V27, P207, DOI 10.1006/jpho.1999.0094
DANILOFF R, 1968, J SPEECH HEAR RES, V11, P707
FARNETANI E, 1993, LANG SPEECH, V36, P279
FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977
Fowler CA, 2000, LANG SPEECH, V43, P1
GOLDMANEISLER F, 1961, LANG SPEECH, V4, P220
Gordon M., 2002, J INT PHON ASSOC, V32, P141, DOI 10.1017/S0025100302001020
Hawkins S, 2004, J PHONETICS, V32, P199, DOI 10.1016/S0095-4470(03)00031-7
HAWKINS S, 2000, SWAP, P167
Hertrich I, 1999, J SPEECH LANG HEAR R, V42, P367
HOUSE AS, 1953, J ACOUST SOC AM, V25, P105, DOI 10.1121/1.1906982
Joos M., 1948, LANGUAGE SUPPL, V24, P1, DOI DOI 10.2307/522229
Kent R. D., 1977, J PHONETICS, V15, P115
KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813
Kirk RR, 1995, EXPT DESIGN PROCEDUR
LISKER L, 1964, WORD, V20, P384
LOFQVIST A, 1994, PHONETICA, V51, P52
MACNEILA.PF, 1969, J ACOUST SOC AM, V45, P1217, DOI 10.1121/1.1911593
Max L, 1999, J SPEECH LANG HEAR R, V42, P261
Max L, 1998, J SPEECH LANG HEAR R, V41, P1265
MENZERATH P, 1933, KOARTIKULATION STEUC
MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683
NEWELL KM, 1984, J MOTOR BEHAV, V16, P320
NITTROUER S, 1995, J ACOUST SOC AM, V97, P520, DOI 10.1121/1.412278
OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310
OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151
Ostry DJ, 1996, J NEUROSCI, V16, P1570
PARUSH A, 1983, J ACOUST SOC AM, V74, P1115, DOI 10.1121/1.390035
PLOMP R, 1967, J ACOUST SOC AM, V41, P707, DOI 10.1121/1.1910398
PURCELL ET, 1979, J ACOUST SOC AM, V66, P1691, DOI 10.1121/1.383641
RABINER LR, 1975, AT&T TECH J, V54, P297
RECASENS D, 1984, J PHONETICS, V12, P61
RECASENS D, 1984, J ACOUST SOC AM, V76, P1624, DOI 10.1121/1.391609
RECASENS D, 1987, J PHONETICS, V15, P299
RECASENS D, 1985, LANG SPEECH, V28, P97
Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727
REPP BH, 1983, J ACOUST SOC AM, V74, P420, DOI 10.1121/1.389835
REPP BH, 1986, J ACOUST SOC AM, V79, P1616, DOI 10.1121/1.393298
REPP BH, 1982, J ACOUST SOC AM, V71, P1562, DOI 10.1121/1.387810
REPP BH, 1981, J ACOUST SOC AM, V69, P1154, DOI 10.1121/1.385695
SERENO JA, 1987, J ACOUST SOC AM, V81, P512, DOI 10.1121/1.394917
SIREN KA, 1995, J SPEECH HEAR RES, V38, P351
SKIPPER JK, 1972, STAT ISSUES READER B, P141
Stevens K.N., 1998, ACOUSTIC PHONETICS
STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102
STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111
Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567
SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923
SUSSMAN HM, 1992, J SPEECH HEAR RES, V35, P769
SUSSMAN HM, 1994, PHONETICA, V51, P119
Tjaden K, 2000, J SPEECH LANG HEAR R, V43, P1466
Tjaden K, 2005, J SPEECH LANG HEAR R, V48, P261, DOI 10.1044/1092-4388(2005/018)
Tjaden K, 2003, J SPEECH LANG HEAR R, V46, P990, DOI 10.1044/1092-4388(2003/077)
WHALEN DH, 1990, J PHONETICS, V18, P3
WINITZ H, 1972, J ACOUST SOC AM, V51, P1309, DOI 10.1121/1.1912976
NR 57
TC 1
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 842
EP 854
DI 10.1016/j.specom.2011.02.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500005
ER
PT J
AU Drugman, T
Bozkurt, B
Dutoit, T
AF Drugman, Thomas
Bozkurt, Bans
Dutoit, Thierry
TI Causal-anticausal decomposition of speech using complex cepstrum for
glottal source estimation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Complex cepstrum; Homomorphic analysis; Glottal source estimation;
Source-tract separation
ID VOICED SPEECH; FLOW; PERCEPTION; SIGNALS
AB Complex cepstrum is known in the literature for linearly separating causal and anticausal components. Relying on advances achieved by the Zeros of the Z-Transform (ZZT) technique, we here investigate the possibility of using complex cepstrum for glottal flow estimation on a large-scale database. Via a systematic study of the windowing effects on the deconvolution quality, we show that the complex cepstrum causal-anticausal decomposition can be effectively used for glottal flow estimation when specific windowing criteria are met. It is also shown that this complex cepstral decomposition gives similar glottal estimates as obtained with the ZZT method. However, as complex cepstrum uses FFT operations instead of requiring the factoring of high-degree polynomials, the method benefits from a much higher speed. Finally in our tests on a large corpus of real expressive speech, we show that the proposed method has the potential to be used for voice quality analysis. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Drugman, Thomas; Dutoit, Thierry] Univ Mons, TCTS Lab, B-7000 Mons, Belgium.
[Bozkurt, Bans] Izmir Inst Technol, Dept Elect & Elect Engn, Izmir, Turkey.
RP Drugman, T (reprint author), Univ Mons, TCTS Lab, B-7000 Mons, Belgium.
EM thomas.drugman@umons.ac.be
FU Fonds National de la Recherche Scientifique (FNRS); Scientific and
Technological Research Council of Turkey (TUBITAK)
FX Thomas Drugman is supported by the Fonds National de la Recherche
Scientifique (FNRS). Bans Bozkurt is supported by the Scientific and
Technological Research Council of Turkey (TUBITAK). The authors also
would like to thank N. Henrich and B. Doval for providing us the speech
recording used to create Fig. 10 and M. Schroeder for the De7 database
(Schroeder and Grice, 2003) used in the second experiment on real
speech. Authors also would like to thank reviewers for their fruitful
feedback.
CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365
ALKU P, 1994, 3 INT C SPOK LANG PR, P1619
Alku P, 2009, J ACOUST SOC AM, V125, P3289, DOI 10.1121/1.3095801
Ananthapadmanabha T. V., 1982, SPEECH COMMUN, V1, P167, DOI 10.1016/0167-6393(82)90015-2
[Anonymous], SNACK SOUND TOOLKIT
Bozkurt B, 2007, SPEECH COMMUN, V49, P159, DOI 10.1016/j.specom.2006.12.004
BOZKURT B, 2005, IEEE SIGNAL PROCESS, V12
Bozkurt B., 2003, VOQUAL 03, P21
BOZKURT B, 2004, P INT
Childers D. G., 1999, SPEECH PROCESSING SY
CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044
DALESSANDRO C, 2008, LNCS, V4885, P1
Deng HQ, 2006, IEEE T AUDIO SPEECH, V14, P445, DOI 10.1109/TSA.2005.857811
Doval B., 2003, P ISCA ITRW VOQUAL G, P15
Doval B, 2006, ACTA ACUST UNITED AC, V92, P1026
Drugman T., 2009, P INT
Drugman T., 2009, ISCA WORKSH NONL SPE
FANT G, 1997, 4 PARAMETER MODEL GL, P1
Fant G., 1995, STL QPSR, V36, P119
Fant Gunnar, 1985, STL QPSR, V4, P1
HANSON HM, 1995, P IEEE ICASSP, P772
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878
NORDEN F, 2001, ICASSP, V2, P717, DOI DOI 10.1109/TASL.2006.876878
Oppenheim A., 1989, DISCRETE TIME SIGNAL
Oppenheim A.V., 1983, SIGNALS SYSTEMS
Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363
PEDERSEN C, 2009, P INT
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
QUATIERI TF, 1979, IEEE T ACOUST SPEECH, V27, P328, DOI 10.1109/TASSP.1979.1163252
QUATIERI T, 2002, DISCRETE TIME SIGNAL, pCH6
Schroder M., 2003, P 15 INT C PHON SCI, P2589
SITTON G, 2003, IEEE SIGNAL PROCESS, P27
STEIGLITZ K, 1982, IEEE T ACOUST SPEECH, V30, P984, DOI 10.1109/TASSP.1982.1163975
STEIGLITZ K, 1977, P ICASSP77, V2, P723
STURMEL N, 2007, P INT
TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929
TRIBOLET J, 1977, P ICASSP77, V2, P716
VEENEMAN DE, 1985, IEEE T ACOUST SPEECH, V33, P369, DOI 10.1109/TASSP.1985.1164544
VERHELST W, 1986, IEEE T ACOUST SPEECH, V34, P43, DOI 10.1109/TASSP.1986.1164787
WALKER J, 2007, IEEE T ACOUST SPEECH, P1, DOI 10.1049/ic:20070799
WANGRAE J, 2005, SCIENCE
WANGRAE J, 2005, LECT NOTES COMPUTER
NR 44
TC 13
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 855
EP 866
DI 10.1016/j.specom.2011.02.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500006
ER
PT J
AU Goberman, AM
Hughes, S
Haydock, T
AF Goberman, Alexander M.
Hughes, Stephanie
Haydock, Todd
TI Acoustic characteristics of public speaking: Anxiety and practice
effects
SO SPEECH COMMUNICATION
LA English
DT Article
DE Acoustic analysis; Public speaking; Anxiety; Practice; Illusion of
transparency
ID COMMUNICATION APPREHENSION; SPEECH ANXIETY; TRANSPARENCY; PERFORMANCE;
ILLUSION; OTHERS; FEAR; DISTURBANCES; ASSESSMENTS; SELF
AB This study describes the relationship between acoustic characteristics, self-ratings, and listener-ratings of public speaking. The specific purpose of this study was to examine the effects of anxiety and practice on speech and voice during public speaking. Further examination of the data was completed to examine the illusion of transparency, which hypothesizes that public speakers think their anxiety is more noticeable to listeners than it really is. Self-rating and acoustic speech data were reported on two separate speeches produced by 16 college-aged individuals completing coursework in interpersonal communication. Results indicated that there were significant relationships between acoustic characteristics of speech and both self- and listener-ratings of anxiety in public speaking. However, self-ratings of anxiety were higher than listener ratings, indicating possible confirmation of the illusion of transparency. Finally, data indicate that practice patterns have a significant effect on the fluency characteristics of public speaking performance, as speakers who started practicing earlier were less disfluent than those who started later. Data are also discussed relative to rehabilitation for individuals with communication disorders that can be associated with public speaking anxiety. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Goberman, Alexander M.] Bowling Green State Univ, Dept Commun Sci & Disorders, Bowling Green, OH 43403 USA.
[Hughes, Stephanie] Governors State Univ, Conununicat Disorders Dept, University Pk, IL 60484 USA.
[Haydock, Todd] Encore Rehabil Serv, Whitehouse, OH 43571 USA.
RP Goberman, AM (reprint author), Bowling Green State Univ, Dept Commun Sci & Disorders, 200 Hlth Ctr Bldg, Bowling Green, OH 43403 USA.
EM goberma@bgsu.edu; s-hughes@-govst.edu; THaydock@encorerehabilitation.com
CR BALDWIN HJ, 1983, PATIENT COUNS COMMUN, V2, P8
BLOTE AW, 2009, J ANXIETY DISORD, V23, P5
Bodie G. D., 2010, COMMUN EDUC, V59, P70, DOI DOI 10.1080/03634520903443849
Boersma P., 2007, PRAAT DOING PHONETIC
Bortfeld H, 2001, LANG SPEECH, V44, P123
BRANIGAN HP, 1999, P 14 C PHON SCI SAN
Breakey Lisa K., 2005, Seminars in Speech and Language, V26, P107, DOI 10.1055/s-2005-871206
Cho YR, 2004, BEHAV RES THER, V42, P13, DOI 10.1016/S0005-7967(03)00067-6
Christenfeld N, 1996, J PERS SOC PSYCHOL, V70, P451
DALY JA, 1975, J COUNS PSYCHOL, V22, P309, DOI 10.1037/h0076748
DRISKELL JE, 1994, J APPL PSYCHOL, V79, P481, DOI 10.1037/0021-9010.79.4.481
Ezrati-Vinacour R, 2004, J FLUENCY DISORD, V29, P135, DOI 10.1016/j.jfludis.2004.02.003
Gilovich T, 1999, CURR DIR PSYCHOL SCI, V8, P165, DOI 10.1111/1467-8721.00039
Gilovich T, 1998, J PERS SOC PSYCHOL, V75, P332, DOI 10.1037/0022-3514.75.2.332
Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1
GOLDMANEISLER F, 1968, PSYCHOLINGUSITCS EXP
Guitar B, 2006, STUTTERING INTEGRATE
GUNEY F, 2010, J CLIN NEUROSCI, V16, P1311
Hagenaars MA, 2005, J ANXIETY DISORD, V19, P521, DOI 10.1016/j.janxdix.2004.04.008
Hancock AB, 2010, J VOICE, V24, P302, DOI 10.1016/j.jvoice.2008.09.007
Harris SR, 2002, CYBERPSYCHOL BEHAV, V5, P543, DOI 10.1089/109493102321018187
HEISHMAN SJ, 2010, PSYCHOPHARMACOLOGY, V2, P453
Hofmann SG, 1997, J ANXIETY DISORD, V11, P573, DOI 10.1016/S0887-6185(97)00040-6
Lee JM, 2002, CYBERPSYCHOL BEHAV, V5, P191, DOI 10.1089/109493102760147169
MAHL GF, 1956, J ABNORM SOC PSYCH, V53, P1, DOI 10.1037/h0047552
Mansell W, 1999, BEHAV RES THER, V37, P419, DOI 10.1016/S0005-7967(98)00148-X
McCroskey J. C., 1976, W SPEECH COMMUNICATI, V40, P14, DOI 10.1080/10570317609373881
McCroskey J. C., 1989, COMMUNICATION Q, V37, P100
McCroskey J. C., 1977, HUMAN COMMUNICATION, V4, P78, DOI DOI 10.1111/J.1468-2958.1977.TB00599.X
MCCROSKEY JC, 1970, SPEECH MONOGR, V37, P269
McCroskey J.C., 1975, HUMAN COMMUNICATION, V2, P51, DOI 10.1111/j.1468-2958.1975.tb00468.x
McCroskey J.C., 1976, HUMAN COMMUNICATION, V2, P376, DOI 10.1111/j.1468-2958.1976.tb00498.x
MCCROSKEY JC, 1976, FLA SPEECH COMMUN J, V4, P1
McCroskey JC, 1976, HUMAN COMMUNICATION, V3, P73, DOI 10.1111/j.1468-2958.1976.tb00506.x
Merritt L, 2001, J VOICE, V15, P257, DOI 10.1016/S0892-1997(01)00026-1
Protopapas A, 1997, J ACOUST SOC AM, V101, P2267, DOI 10.1121/1.418247
RAPEE RM, 1992, J ABNORM PSYCHOL, V101, P728, DOI 10.1037/0021-843X.101.4.728
ROCHESTE.SR, 1973, J PSYCHOLINGUIST RES, V2, P51, DOI 10.1007/BF01067111
Ruiz R, 1996, SPEECH COMMUN, V20, P111, DOI 10.1016/S0167-6393(96)00048-9
Savitsky K, 2003, J EXP SOC PSYCHOL, V39, P618, DOI 10.1016/S0022-1031(03)00056-8
SCOTT MD, 1978, J COMMUN, V28, P104, DOI 10.1111/j.1460-2466.1978.tb01571.x
Slater M, 2006, CYBERPSYCHOL BEHAV, V9, P627, DOI 10.1089/cpb.2006.9.627
NR 42
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 867
EP 876
DI 10.1016/j.specom.2011.02.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500007
ER
PT J
AU Stephens, JDW
Holt, LL
AF Stephens, Joseph D. W.
Holt, Lori L.
TI A standard set of American-English voiced stop-consonant stimuli from
morphed natural speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech stimuli; Consonants; Linear predictive coding
ID PHONETIC CATEGORIZATION; PRECEDING LIQUID; PERCEPTION; COARTICULATION;
INVARIANCE; PLACE; IDENTIFICATION; COMPENSATION; ARTICULATION; LANGUAGE
AB Linear predictive coding (LPC) analysis was used to create morphed natural tokens of English voiced stop consonants ranging from /b/ to /d/ and /d/ to /g/ in four vowel contexts (/i/, /ae/, /a/, /u/). Both vowel consonant vowel (VCV) and consonant vowel (CV) stimuli were created. A total of 320 natural-sounding acoustic speech stimuli were created, comprising 16 stimulus series. A behavioral experiment demonstrated that the stimuli varied perceptually from /b/ to /d/ to /g/, and provided useful reference data for the ambiguity of each token. Acoustic analyses indicated that the stimuli compared favorably to standard characteristics of naturally-produced consonants, and that the LPC morphing procedure successfully modulated multiple acoustic parameters associated with place of articulation. The entire set of stimuli is freely available on the Internet (http://www.psy.cmu.edu/similar to lholt/php/StephensHoltStimuli.php) for use in research applications. (C) 2011 Elsevier B.V. All rights reserved.
C1 Carnegie Mellon Univ, Dept Psychol, Pittsburgh, PA 15213 USA.
Carnegie Mellon Univ, Ctr Neural Basis Cognit, Pittsburgh, PA 15213 USA.
RP Stephens, JDW (reprint author), N Carolina Agr & Tech State Univ, Dept Psychol, 1601 E Market St, Greensboro, NC 27411 USA.
EM jdstephe@ncat.edu; lholt@andrew.cmu.edu
FU National Institute on Deafness and Other Communication Disorders [1 F31
DC007284-01]; National Science Foundation [BCS-0345773]; Center for the
Neural Basis of Cognition
FX The authors thank Christi Adams Gomez and Tony Kelly for help with data
collection and preparation of online materials, respectively. This work
was supported by a National Research Service Award (1 F31 DC007284-01)
from the National Institute on Deafness and Other Communication
Disorders to J.D.W.S., by a grant from the National Science Foundation
(BCS-0345773) to L.L.H., and by the Center for the Neural Basis of
Cognition.
CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679
BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319
BLUMSTEIN SE, 1980, J ACOUST SOC AM, V67, P648, DOI 10.1121/1.383890
Boersma P., 2001, GLOT INT, V5, P341
ELMAN JL, 1988, J MEM LANG, V27, P143, DOI 10.1016/0749-596X(88)90071-X
Engstrand O., 2000, P 13 SWED PHON C FON, P53
Fant G., 1960, ACOUSTIC THEORY SPEE
GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110
HOLT LL, 1999, THESIS
Holt LL, 2005, PSYCHOL SCI, V16, P305, DOI 10.1111/j.0956-7976.2005.01532.x
Holt LL, 2002, HEARING RES, V167, P156, DOI 10.1016/S0378-5955(02)00383-0
KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081
KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
Krull D., 1988, PHONETIC EXPT RES I, VVII, P66
LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417
Lindblom B., 1963, 29 ROYAL I TECHN SPE
LISKER L, 1964, WORD, V20, P384
Lotto AJ, 2006, PERCEPT PSYCHOPHYS, V68, P178, DOI 10.3758/BF03193667
Lotto AJ, 1998, PERCEPT PSYCHOPHYS, V60, P602, DOI 10.3758/BF03206049
MANN VA, 1980, PERCEPT PSYCHOPHYS, V28, P407, DOI 10.3758/BF03204884
Markel JD, 1976, LINEAR PREDICTION SP
Massaro D. W., 1998, PERCEIVING TALKING F
Massaro D. W., 1987, SPEECH PERCEPTION EA
McCandliss BD, 2002, COGN AFFECT BEHAV NE, V2, P89, DOI 10.3758/CABN.2.2.89
MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433
Newman RS, 1997, J EXP PSYCHOL HUMAN, V23, P873, DOI 10.1037/0096-1523.23.3.873
OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151
Pfitzinger H. R., 2004, P 10 AUSTR INT C SPE, P545
Pitt MA, 1998, J MEM LANG, V39, P347, DOI 10.1006/jmla.1998.2571
SLANEY M, 1996, P 1996 IEEE ICASSP A, P1001
Stephens JDW, 2010, J ACOUST SOC AM, V128, P2138, DOI 10.1121/1.3479537
Stevens K.N., 1998, ACOUSTIC PHONETICS
STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102
Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567
SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923
NR 36
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 877
EP 888
DI 10.1016/j.specom.2011.02.007
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500008
ER
PT J
AU Akdemir, E
Ciloglu, T
AF Akdemir, Eren
Ciloglu, Tolga
TI Bimodal automatic speech segmentation based on audio and visual
information fusion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech segmentation; Audiovisual; Lip motion; Text-to-speech
ID SENSORY INTEGRATION; AUDIOVISUAL SPEECH; RECOGNITION; MODELS; PERCEPTION
AB Bimodal automatic speech segmentation using visual information together with audio data is introduced. The accuracy of automatic segmentation directly affects the quality of speech processing systems using the segmented database. The collaboration of audio and visual data results in lower average absolute boundary error between the manual segmentation and automatic segmentation results. The information from two modalities are fused at the feature level and used in a HMM based speech segmentation system. A Turkish audiovisual speech database has been prepared and used in the experiments. The average absolute boundary error decreases up to 18% by using different audiovisual feature vectors. The benefits of incorporating visual information are discussed for different phoneme boundary types. Each audiovisual feature vector results in a different performance at different types of phoneme boundaries. The average absolute boundary error decreases by approximately 25% by using audiovisual feature vectors selectively for different boundary classes. Visual data is collected using an ordinary webcam. The proposed method is very convenient to be used in practice. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Akdemir, Eren; Ciloglu, Tolga] Middle E Tech Univ, Elect & Elect Engn Dept, TR-06531 Ankara, Turkey.
RP Akdemir, E (reprint author), Middle E Tech Univ, Elect & Elect Engn Dept, TR-06531 Ankara, Turkey.
EM erenakdemir@gmail.com
FU Scientific and Technological Research Council of Turkey (TUBITAK)
[107e101]
FX This work is supported by Scientific and Technological Research Council
of Turkey (TUBITAK) Project no: 107e101.
CR Akdemir E, 2008, SPEECH COMMUN, V50, P594, DOI 10.1016/j.specom.2008.04.005
Bayati M., 2006, INF THEOR 2006 IEEE, V2006, P557
Bonafonte A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607841
BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W
Chen T, 1998, P IEEE, V86, P837
Cosi P., 1991, P EUROSPEECH 91, P693
Dodd B., 1998, HEARING EYE
ENGWALL A, 2003, 6 INT SEM SPEECH PRO, P43
Hall DL, 1997, P IEEE, V85, P6, DOI 10.1109/5.554205
ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189
Jarifi S., 2008, SPEECH COMMUN, V50, P67, DOI 10.1016/j.specom.2007.07.001
Kawai H., 2004, P IEEE INT C AC SPEE
Kaynak MN, 2004, IEEE T SYST MAN CY A, V34, P564, DOI 10.1109/TSMCA.2004.826274
MAK MW, 1994, SPEECH COMMUN, V14, P279, DOI 10.1016/0167-6393(94)90067-1
MAKASHAY MJ, 2000, P INT C SPOK LANG PR, P431
Malfrere F, 2003, SPEECH COMMUN, V40, P503, DOI 10.1016/S0167-6393(02)00131-0
Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861
Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x
MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621
NETI C, 2000, WORKSH 2000 FIN REP
Park SS, 2007, IEEE T AUDIO SPEECH, V15, P2202, DOI 10.1109/TASL.2007.903933
Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150
Prasad VK, 2004, SPEECH COMMUN, V42, P429, DOI 10.1016/j.specom.2003.12.002
Smeele PMT, 1998, J EXP PSYCHOL HUMAN, V24, P1232, DOI 10.1037//0096-1523.24.4.1232
STORK DG, 1996, P 2 INT C AUT FAC GE, V16, P14
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
SUMMERFIELD AQ, 1989, HDB RES FACE PROCESS
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579
VEPA J, 2003, P 8 EUR C SPEECH COM, P293
Wells J., 1997, HDB STANDARDS RE 4 B
WOUTERS J, 1998, P ICSLP, V6, P2747, DOI DOI 10.1109/ICASSP.2001.941045
Young S., 2002, HTK BOOK HTK VERSION
YUHAS BP, 1990, P IEEE, V78, P1658, DOI 10.1109/5.58349
Zhang J, 1997, P IEEE, V85, P1423
NR 35
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 889
EP 902
DI 10.1016/j.specom.2011.03.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500009
ER
PT J
AU Prendergast, G
Johnson, SR
Green, GGR
AF Prendergast, Garreth
Johnson, Sam R.
Green, Gary G. R.
TI Extracting amplitude modulations from speech in the time domain
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech; Amplitude modulation; Vocoder; Intelligibility
ID AUDITORY-SYSTEM; QUANTITATIVE MODEL; NATURAL SOUNDS; FREQUENCY;
INTELLIGIBILITY; RECOGNITION; RECEPTION; CUES
AB Natural sounds can be characterised by patterns of changes in loudness (amplitude modulations), and human speech perception studies have focused on the low frequencies contained in the gross temporal structure of speech. Low-pass filtering the temporal envelopes of sub-band filtered speech maintains intelligibility, but it remains unclear how the human auditory system could perform such a modulation domain analysis or even if it does so at all. It is difficult to further manipulate amplitude modulations through frequency-domain filtering to investigate cues the system may use. The current work focuses on a time-domain decomposition of filter output envelopes into pulses of amplitude modulation. The technique demonstrates that signals low-pass filtered in the modulation domain maintain bursts of energy which are comparable to those that can be extracted entirely within the time-domain. This paper presents preliminary work that suggests a time-domain approach, which focuses on the instantaneous features of transient changes in loudness, can be used to study the content of human speech. This approach should be pursued as it allows human speech intelligibility mechanisms to be investigated from a new perspective. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Prendergast, Garreth; Johnson, Sam R.; Green, Gary G. R.] Univ York, York Neuroimaging Ctr, York YO10 5DG, N Yorkshire, England.
[Prendergast, Garreth; Green, Gary G. R.] Univ York, Hull York Med Sch, York YO10 5DG, N Yorkshire, England.
RP Prendergast, G (reprint author), Univ York, York Neuroimaging Ctr, York YO10 5DG, N Yorkshire, England.
EM garreth.prendergast@ynic.york.ac.uk
RI Green, Gary/D-3543-2009
CR BACON SP, 1989, J ACOUST SOC AM, V85, P2575, DOI 10.1121/1.397751
CAPRANICA R R, 1972, Physiologist, V15, P55
Chi TS, 1999, J ACOUST SOC AM, V106, P2719, DOI 10.1121/1.428100
Clark P, 2009, IEEE T SIGNAL PROCES, V57, P4323, DOI 10.1109/TSP.2009.2025107
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020
Elliott TM, 2009, PLOS COMPUT BIOL, V5, DOI 10.1371/journal.pcbi.1000302
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
GREEN GGR, 1974, J PHYSIOL-LONDON, V241, pP29
Greenberg S, 2004, IEICE T INF SYST, VE87D, P1059
Greenberg S., 1998, INT C SPOK LANG PROC
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Joris PX, 2004, PHYSIOL REV, V84, P541, DOI 10.1152/physrev.00029.2003
KAY RH, 1982, PHYSIOL REV, V62, P894
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
Krebs B, 2008, J NEUROPHYSIOL, V100, P1602, DOI 10.1152/jn.90374.2008
Lewicki MS, 2002, NAT NEUROSCI, V5, P356, DOI 10.1038/nn831
Patterson R. D., 1988, 2341 MRC APPL PSYCH
PRENDERGAST G, 2010, EUR J NEUROSCI OCT
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Singh NC, 2003, J ACOUST SOC AM, V114, P3394, DOI 10.1121/1.1624067
Slaney M., 1993, 35 APPL COMP INC PER
Slaney M., 1994, 45 APPL COMP INC
VIEMEISTER NF, 2002, GENETICS FUNCTION AU, P273
NR 27
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 903
EP 913
DI 10.1016/j.specom.2011.03.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500010
ER
PT J
AU Yu, K
Zen, H
Mairesse, F
Young, S
AF Yu, Kai
Zen, Heiga
Mairesse, Francois
Young, Steve
TI Context adaptive training with factorized decision trees for HMM-based
statistical parametric speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE HMM-based speech synthesis; Context adaptive training; Factorized
decision tree; State clustering
ID HIDDEN MARKOV-MODELS; REGRESSION
AB To achieve natural high quality synthesized speech in HMM-based speech synthesis, the effective modelling of complex acoustic and linguistic contexts is critical. Traditional approaches use context-dependent HMMs with decision tree based parameter clustering to model the full combinatorial of contexts. However, weak contexts, such as word-level emphasis in natural speech, are difficult to capture using this approach. Also, due to combinatorial explosion, incorporating new contexts within the traditional framework may easily lead to the problem of insufficient data coverage. To effectively model weak contexts and reduce the data sparsity problem, different types of contexts should be treated independently. Context adaptive training provides a structured framework for this whereby standard HMMs represent normal contexts and transforms represent the additional effects of weak contexts. In contrast to speaker adaptive training in speech recognition, separate decision trees have to be built for different types of context factors. This paper describes the general framework of context adaptive training and investigates three concrete forms: MLLR, CMLLR and CAT based systems. Experiments on a word-level emphasis synthesis task show that all context adaptive training approaches can outperform the standard full-context-dependent HMM approach. However, the MLLR based system achieved the best performance. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Yu, Kai; Mairesse, Francois; Young, Steve] Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England.
[Zen, Heiga] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England.
RP Yu, K (reprint author), Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England.
EM ky219@cam.ac.uk
RI Yu, Kai/B-1772-2012
OI Yu, Kai/0000-0002-7102-9826
FU UK EPSRC [EP/F013930/1]; EU [216594]
FX This research was partly funded by the UK EPSRC under grant agreement
EP/F013930/1 and by the EU FP7 Programme under grant agreement 216594
(CLASSIC project: www.classic-project.org). The original version of this
paper was selected as one of the best papers from Interspeech 2010. It
is presented here in revised form following additional peer review.
CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
Chou W., 1999, P ICASSP 99, V1, P345
Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953
GALES M, 2010, P INT, P58
Gales M. J. F., 1996, CUEDFINFENGTR263
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223
Imai S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing
Iwahashi N, 2000, IEICE T INF SYST, VE83D, P1550
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Kominek J., 2003, CMULTI03177
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Nankaku Y., 2008, P ICASSP, P4469
Povey Daniel, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495662
Saino K., 2008, THESIS NAGOYA I TECH
Shinoda K., 1997, P EUR
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K., HMM BASED SPEECH SYN
TOKUDA K, 2000, P ICASSP, V3, P1315
Yoshimura T, 1999, P EUR, P2347
Young S., 2009, HTK BOOK HTK VERSION
Young S.J., 1994, ARPA WORKSH HUM LANG, P307
Yu K, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495690
YU K, 2009, P ICASSP, P3773
Zen H., 2009, P INT, P2091
Zen H., 2010, P INT, P410
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 28
TC 13
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 914
EP 923
DI 10.1016/j.specom.2011.03.003
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500011
ER
PT J
AU Palomaki, KJ
Brown, GJ
AF Palomaki, Kalle J.
Brown, Guy J.
TI A computational model of binaural speech recognition: Role of
across-frequency vs. within-frequency processing and internal noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Binaural model; Speech recognition; Equalization-cancellation model;
Missing data
ID INTERAURAL TIME-DELAY; AUDITORY-NERVE DATA; PERCEPTUAL SEGREGATION;
SOUND LOCALIZATION; SPATIAL UNMASKING; LEVEL DIFFERENCES;
INTELLIGIBILITY; REVERBERATION; EQUALIZATION; RESTORATION
AB This study describes a model of binaural speech recognition that is tested against psychoacoustic findings on binaural speech intelligibility in noise. It consists of models of the auditory periphery, binaural pathway and recognition of speech from glimpses based on the missing data approach, which allows the speech reception threshold (SRT) of the model and listeners to be compared. The binaural advantage based on differences between the interaural time differences (ITD) of the target and masker is modelled using the equalization cancellation (EC) mechanism, either independently within each frequency channel or across all channels. The model is tested using a stimulus paradigm in which the target speech and noise interference are split into low- and high-frequency bands, so that the ITD in each band can be varied independently. The match between the model and listener data is quantified by a normalized SRT distance and a correlation metric, which demonstrate a slightly better match for the within-channel model (SRT: 0.5 dB, correlation: 0.94), than for the across-channel model (SRT: 0.7 dB, correlation: 0.90). However, as the differences between the approaches are small and non-significant, our results suggest that listeners exploit ITD via a mechanism that is neither fully frequency-dependent nor fully frequency-independent. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Palomaki, Kalle J.] Aalto Univ, Sch Sci & Technol, Dept Comp & Informat Sci, Adapt Informat Res Ctr, FI-00076 Aalto, Finland.
[Brown, Guy J.] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
RP Palomaki, KJ (reprint author), Aalto Univ, Sch Sci & Technol, Dept Comp & Informat Sci, Adapt Informat Res Ctr, POB 15400, FI-00076 Aalto, Finland.
EM kalle.palomaki@tkk.fi; g.brown@dcs.shef.ac.uk
CR Akeroyd MA, 2004, J ACOUST SOC AM, V116, P1135, DOI 10.1121/1.1768959
Akeroyd MA, 2001, J ACOUST SOC AM, V110, P1498, DOI 10.1121/1.1390336
Barker J., 2006, COMPUTATIONAL AUDITO, P297
Beutelmann R, 2006, J ACOUST SOC AM, V120, P331, DOI 10.1121/1.2202888
BREEBAART DJ, 2001, THESIS TU EINDHOVEN
BRONKHORST AW, 1988, J ACOUST SOC AM, V83, P1508, DOI 10.1121/1.395906
BROWN GJ, 2005, P INT LISB SEPT 4 8, P1753
COLBURN HS, 1973, J ACOUST SOC AM, V54, P1458, DOI 10.1121/1.1914445
COLBURN HS, 1977, J ACOUST SOC AM, V61, P525, DOI 10.1121/1.381294
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Cooke M., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Culling JF, 1998, J ACOUST SOC AM, V103, P3509, DOI 10.1121/1.423059
CULLING JF, 1995, J ACOUST SOC AM, V98, P785, DOI 10.1121/1.413571
Darwin CJ, 1997, J ACOUST SOC AM, V102, P2316, DOI 10.1121/1.419641
Drennan WR, 2003, J ACOUST SOC AM, V114, P2178, DOI 10.1121/1.1609994
DURLACH NI, 1963, J ACOUST SOC AM, V35, P1206, DOI 10.1121/1.1918675
Durlach N.I., 1972, F MODERN AUDITORY TH, P371
EDMONDS B, 2004, THESIS CARDIFF U
Edmonds BA, 2006, J ACOUST SOC AM, V120, P1539, DOI 10.1121/1.2228573
Edmonds BA, 2005, J ACOUST SOC AM, V117, P3069, DOI 10.1121/1.1880752
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Hawley ML, 1999, J ACOUST SOC AM, V105, P3436, DOI 10.1121/1.424670
HIRSH IJ, 1950, J ACOUST SOC AM, V22, P196, DOI 10.1121/1.1906588
JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495
KOCK WE, 1950, J ACOUST SOC AM, V22, P801, DOI 10.1121/1.1906692
Leonard R. G., 1984, P ICASSP 84, P111
Liu C, 2001, J ACOUST SOC AM, V110, P3218, DOI 10.1121/1.1419090
Lyon R. F., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing
Mathworks, 2008, MATLAB
McArdle Rachel A, 2005, J Am Acad Audiol, V16, P726, DOI 10.3766/jaaa.16.9.9
Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005
Palomaki KJ, 2004, SPEECH COMMUN, V43, P123, DOI 10.1016/j.specom.2004.02.005
PALOMAKI KJ, 2008, J ACOUST SOC AM, V123, P3715
PLOMP R, 1979, AUDIOLOGY, V18, P43
Ramkissoon Ishara, 2002, Am J Audiol, V11, P23, DOI 10.1044/1059-0889(2002/005)
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
SCHUBERT ED, 1956, J ACOUST SOC AM, V28, P895, DOI 10.1121/1.1908508
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
SPIETH W, 1954, J ACOUST SOC AM, V26, P391, DOI 10.1121/1.1907347
STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162
van der Heijden M, 1999, J ACOUST SOC AM, V105, P388, DOI 10.1121/1.424628
VOMHOVEL H, 1984, THESIS RHEINISCH WES
WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392
Warren RM, 1997, PERCEPT PSYCHOPHYS, V59, P275, DOI 10.3758/BF03211895
Wilson Richard H., 2004, Seminars in Hearing, V25, P93
Wilson RH, 2005, J REHABIL RES DEV, V42, P499, DOI 10.1682/JRRD.2004.10.0134
NR 47
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 924
EP 940
DI 10.1016/j.specom.2011.03.005
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500012
ER
PT J
AU Zimmerer, F
Scharinger, M
Reetz, H
AF Zimmerer, Frank
Scharinger, Mathias
Reetz, Henning
TI When BEAT becomes HOUSE: Factors of word final /t/-deletion in German
SO SPEECH COMMUNICATION
LA English
DT Article
DE Segment deletion; Segment reduction; Natural speech; Production;
Phonology
ID AMERICAN ENGLISH; SPEECH; PERCEPTION; SPEAKING; INTELLIGIBILITY;
FREQUENCY; REDUCTION; EXCERPTS
AB The deletion and reduction of alveolar /t/ is a phenomenon that has been given considerable attention in the research on speech production and perception. Data have mainly be drawn from spoken language corpora, where a tight control over contributing factors of /t/-deletion is hardly possible. Here, we present a new way of creating a spoken language corpus adhering to some crucial factors we wanted to hold constant for the investigation of word-final /t/-deletion in German. German is especially interesting with regard to /t/-deletion due to its rich suffixal morphology, attributing morphological status to word-final /t/ in many paradigms. We focused on verb inflection and employed a verb form production task for creating a concise corpus of naturally spoken language in which we could control for factors previously established to affect it/-deletion. We then determined the best estimators for /t/-productions (i.e. canonical, deleted, or reduced) in our corpus. The influence of extra-linguistic factors was comparable to previous studies. We suggest that our method of constructing a natural language corpus with carefully selected characteristics is a viable way for the examination of deletions and reductions during speech production. Furthermore, we found that the best predictor for non-canonical productions and deletions was the following phonological context. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Zimmerer, Frank; Reetz, Henning] Goethe Univ Frankfurt, Inst Phonet, D-60054 Frankfurt, Germany.
[Scharinger, Mathias] Univ Maryland, Dept Linguist, College Pk, MD 20742 USA.
[Zimmerer, Frank; Scharinger, Mathias] Univ Konstanz, Dept Linguist, D-78457 Constance, Germany.
RP Zimmerer, F (reprint author), Goethe Univ Frankfurt, Inst Phonet, Box 170, D-60054 Frankfurt, Germany.
EM zimmerer@em.uni-frankfurt.de
FU Deutsche Forschungs Gesellschaft, DFG [SPP 1234, SFB 471]
FX This work was supported by the Deutsche Forschungs Gesellschaft, DFG
(SPP 1234 and SFB 471). We also wish to thank our reviewers for their
very useful comments and suggestions.
CR Agresti A., 2002, CATEGORICAL DATA ANA
ARVANITI A, 2007, P 16 INT C PHON SCI, P19
Baayen H., 2008, ANAL LINGUISTIC DATA
Baayen H. R., 1995, CELEX LEXICAL DATABA
Bates D., 2010, IME4 LINEAR MIXED EF
Boersma P., 2007, PRAAT DOING PHONETIC
BRESLOW NE, 1993, J AM STAT ASSOC, V88, P9, DOI 10.2307/2290687
Byrd D, 1996, J PHONETICS, V24, P263, DOI 10.1006/jpho.1996.0014
BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6
Fasold Ralph W., 1972, TENSE MARKING BLACK
Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7
Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332
GREENBERG S, 2002, P 2 INT C HUM LANG T, P36, DOI 10.3115/1289189.1289251
Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3
GUY GR, 1992, CHANGE, V3, P223
Guy Gregory R, 1980, LOCATING LANGUAGE TI, P1
Hay J, 2001, LINGUISTICS, V39, P1041, DOI 10.1515/ling.2001.041
Hume E., 2007, BUCKEYE CORPUS CONVE
IPDS, 1994, KIEL CORP SPONT SPEE
Johnson Keith, 2004, P 1 SESS 10 INT S
Jurafsky Daniel, 2001, FREQUENCY EMERGENCE, P229, DOI 10.1075/tsl.45.13jur
KINGSTON J, 2006, P 3 C LAB APPR SPAN
Kohler Klaus, 1995, EINFUHRUNG PHONETIK, VSecond
Koreman J, 2006, J ACOUST SOC AM, V119, P582, DOI 10.1121/1.2133436
LABOV W, 1967, NEW DIRECTIONS ELEME, P1
LAHIRI A, 2007, P 16 INT C PHON SCI, P19
LIEBERMAN P, 1963, LANG SPEECH, V6, P172
Mitterer H, 2008, J MEM LANG, V59, P133, DOI 10.1016/j.jml.2008.02.004
Mitterer H, 2006, J PHONETICS, V34, P73, DOI 10.1016/j.wocn.2005.03.003
Neu H., 1980, LOCATING LANGUAGE TI, P37
Nolan F., 1992, PAPERS LABORATORY PH, P261
PICKETT JM, 1963, LANG SPEECH, V6, P151
Pinheiro J. C., 2000, MIXED EFFECTS MODELS
POLLACK I, 1963, LANG SPEECH, V6, P165
PYCHA A, 2010, J INT PHON ASSOC, V39, P1
R Development Core Team, 2010, R LANG ENV STAT COMP
Raymond William D., 2006, LANG VAR CHANGE, V18, P55
Shriberg E., 1999, P INT C PHON SCI SAN, P619
Sumner M, 2005, J MEM LANG, V52, P322, DOI 10.1016/j.jml.2004.11.004
Tree Jean E. Fox, 1997, Cognition, V62, P151
Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001
Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093
Wolfram Walt, 1969, SOCIOLINGUISTIC DESC
ZIMMERER F, 2009, THESIS U FRANKFURT
Zimmerer F, 2009, J ACOUST SOC AM, V125, P2307, DOI 10.1121/1.3021438
ZUE VW, 1979, J ACOUST SOC AM, V66, P1039, DOI 10.1121/1.383323
NR 46
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2011
VL 53
IS 6
BP 941
EP 954
DI 10.1016/j.specom.2011.03.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 767BE
UT WOS:000290829500013
ER
PT J
AU Heckmann, M
Raj, B
Smaragdis, P
AF Heckmann, Martin
Raj, Bhiksha
Smaragdis, Paris
TI Special Issue: Perceptual and Statistical Audition Preface
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 591
EP 591
DI 10.1016/j.specom.2011.03.004
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900001
ER
PT J
AU Dietz, M
Ewert, SD
Hohmann, V
AF Dietz, Mathias
Ewert, Stephan D.
Hohmann, Volker
TI Auditory model based direction estimation of concurrent speakers from
binaural signals
SO SPEECH COMMUNICATION
LA English
DT Article
DE Binaural processing; Auditory modeling; Direction estimation
ID TIME-FREQUENCY MASKS; SOURCE LOCALIZATION; NOISE-REDUCTION;
CONTRALATERAL INHIBITION; INTERAURAL PARAMETERS; ROOM REVERBERATION;
SPEECH RECOGNITION; SOUND LOCALIZATION; CROSS-CORRELATION; HEARING-AIDS
AB Humans show a very robust ability to localize sounds in adverse conditions. Computational models of binaural sound localization and technical approaches of direction-of-arrival (DOA) estimation also show good performance, however, both their binaural feature extraction and the strategies for further analysis partly differ from what is currently known about the human auditory system. This study investigates auditory model based DOA estimation emphasizing known features and limitations of the auditory binaural processing such as (i) high temporal resolution, (ii) restricted frequency range to exploit temporal fine-structure, (iii) use of temporal envelope disparities, and (iv) a limited range to compensate for interaural time delay. DOA estimation performance was investigated for up to five concurrent speakers in free field and for up to three speakers in the presence of noise. The DOA errors in these conditions were always smaller than 5 degrees. A condition with moving speakers was also tested and up to three moving speakers could be tracked simultaneously. Analysis of DOA performance as a function of the binaural temporal resolution showed that short time constants of about 5 ms employed by the auditory model were crucial for robustness against concurrent sources. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Dietz, Mathias; Ewert, Stephan D.; Hohmann, Volker] Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany.
RP Dietz, M (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany.
EM mathias.dietz@uni-oldenburg.de
FU DFG [SFB/TRR31]; International Graduate School
FX This study was supported by the DFG (SFB/TRR31 'The Active Auditory
System') and the International Graduate School "Neurosensory Science,
Systems and Applications". We would like to thank the members of the
Medical Physics group and Birger Kollmeier for continuous support and
fruitful discussions. We are grateful to Hendrik Kayser for his valuable
help with the head-related impulse responses.
CR Ajmera J., 2004, P ICASSP, V1, P605
ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621
BERNSTEIN LR, 1994, J ACOUST SOC AM, V95, P3561, DOI 10.1121/1.409973
Bernstein LR, 2002, J ACOUST SOC AM, V112, P1026, DOI 10.1121/1.1497620
BLAUERT J, 1986, J ACOUST SOC AM, V80, P533, DOI 10.1121/1.394048
Braasch J, 2002, ACTA ACUST UNITED AC, V88, P956
Brand A, 2002, NATURE, V417, P543, DOI 10.1038/417543a
Breebaart J, 2001, J ACOUST SOC AM, V110, P1105, DOI 10.1121/1.1383299
Cherry CE, 1953, J ACOUST SOC AM, V25, P975, DOI DOI 10.1121/1.1907229
Cooke M, 2003, J PHONETICS, V31, P579, DOI 10.1016/S0095-4470(03)00013-5
COOKE MP, 1993, SPEECH COMMUN, V13, P391, DOI 10.1016/0167-6393(93)90037-L
Culling JF, 2000, J ACOUST SOC AM, V107, P517, DOI 10.1121/1.428320
de Cheveigne A, 1999, SPEECH COMMUN, V27, P175, DOI 10.1016/S0167-6393(98)00074-0
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Dietz M, 2009, J ACOUST SOC AM, V125, P1622, DOI 10.1121/1.3076045
Dietz M, 2008, BRAIN RES, V1220, P234, DOI 10.1016/j.brainres.2007.09.026
Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503
Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665
Faller C, 2004, J ACOUST SOC AM, V116, P3075, DOI 10.1121/1.1791872
Garofolo J., 1990, DARPA TIMIT ACOUSTIC
Goupell MJ, 2006, J ACOUST SOC AM, V119, P3971, DOI 10.1121/1.2200147
Grimm G, 2009, IEEE T AUDIO SPEECH, V17, P1408, DOI 10.1109/TASL.2009.2020531
Hartikainen J., 2008, OPTIMAL FILTERING KA
Haykin S, 2005, NEURAL COMPUT, V17, P1875, DOI 10.1162/0899766054322964
Heil P, 2003, SPEECH COMMUN, V41, P123, DOI 10.1016/S0167-6393(02)00099-7
Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433
Joris PX, 2006, J NEUROSCI, V26, P279, DOI 10.1523/JNEUROSCI.2285-05.2006
Kayser H, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/298605
KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830
KOEHNKE J, 1986, J ACOUST SOC AM, V79, P1558, DOI 10.1121/1.393682
KOLLMEIER B, 1990, J ACOUST SOC AM, V87, P1709, DOI 10.1121/1.399419
KUHN GF, 1977, J ACOUST SOC AM, V62, P157, DOI 10.1121/1.381498
Li YP, 2009, SPEECH COMMUN, V51, P230, DOI 10.1016/j.specom.2008.09.001
LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325
Liu C, 2000, J ACOUST SOC AM, V108, P1888, DOI 10.1121/1.1290516
Louage DHG, 2006, J NEUROSCI, V26, P96, DOI 10.1523/JNEUROSCI.2339-05.2006
MARQUARDT T, 2007, HEARING SENSORY PROC, P312
May T, 2011, IEEE T AUDIO SPEECH, V19, P1, DOI 10.1109/TASL.2010.2042128
McAlpine D, 2001, NAT NEUROSCI, V4, P396, DOI 10.1038/86049
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Nix J, 2007, IEEE T AUDIO SPEECH, V15, P995, DOI 10.1109/TASL.2006.889788
Nix J, 2006, J ACOUST SOC AM, V119, P463, DOI 10.1121/1.2139619
PALMER AR, 1986, HEARING RES, V24, P1, DOI 10.1016/0378-5955(86)90002-X
Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005
Park HM, 2009, SPEECH COMMUN, V51, P15, DOI 10.1016/j.specom.2008.05.012
Patterson RD, 1987, M IOC SPEECH GROUP A
Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150
POLLACK I, 1959, J ACOUST SOC AM, V31, P1250, DOI 10.1121/1.1907852
Puria S, 1997, J ACOUST SOC AM, V101, P2754, DOI 10.1121/1.418563
Rohdenburg T, 2008, INT CONF ACOUST SPEE, P2449, DOI 10.1109/ICASSP.2008.4518143
Roman N, 2008, IEEE T AUDIO SPEECH, V16, P728, DOI 10.1109/TASL.2008.918978
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
RUGGERO MA, 1991, J NEUROSCI, V11, P1057
Sarkka S, 2007, INFORM FUSION, V8, P2, DOI 10.1016/j.inffus.2005.09.009
SAYERS BM, 1964, J ACOUST SOC AM, V36, P923, DOI 10.1121/1.1919121
Siveke I, 2008, J NEUROSCI, V28, P2043, DOI 10.1523/JNEUROSCI.4488-07.2008
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
Supper B, 2006, IEEE T AUDIO SPEECH, V14, P1008, DOI 10.1109/TSA.2005.857787
Thompson S. P., 1882, PHILOS MAG, V13, P406
van de Par S, 1999, J ACOUST SOC AM, V106, P1940, DOI 10.1121/1.427942
Wang D., 2006, COMPUTATIONAL AUDITO
Wittkop T, 2003, SPEECH COMMUN, V39, P111, DOI 10.1016/S0167-6393(02)00062-6
Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539
NR 63
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 592
EP 605
DI 10.1016/j.specom.2010.05.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900002
ER
PT J
AU Weiss, RJ
Mandel, MI
Ellis, DPW
AF Weiss, Ron J.
Mandel, Michael I.
Ellis, Daniel P. W.
TI Combining localization cues and source model constraints for binaural
source separation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Source separation; Binaural; Source models; Eigenvoices; EM
ID DATA SPEECH RECOGNITION; STATISTICS
AB We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7 dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7 dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Weiss, Ron J.; Mandel, Michael I.; Ellis, Daniel P. W.] Columbia Univ, Dept Elect Engn, LabROSA, New York, NY 10027 USA.
RP Weiss, RJ (reprint author), Columbia Univ, Dept Elect Engn, LabROSA, New York, NY 10027 USA.
EM ronw@ee.columbia.edu; mim@ee.columbia.edu; dpwe@ee.columbia.edu
FU NSF [IIS-0238301, IIS-0535168]; EU
FX This work was supported by the NSF under Grants No. IIS-0238301 and
IIS-0535168, and by EU project AMIDA. Any opinions, findings, and
conclusions or recommendations expressed in this material are those of
the authors and do not necessarily reflect the views of the Sponsors.
CR AARABI P, 2002, IEEE T SYSTEMS MAN C, V32
Algazi V. R., 2001, IEEE WORKSH APPL SIG, P99
BLAUERT J, 1997, SPATIAL HEARNING PSY
CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229
Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005
Cooke M, 2010, COMPUT SPEECH LANG, V24, P1, DOI 10.1016/j.csl.2009.02.006
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354
JOURJINE A, 2000, P IEEE INT C AC SPEE, V5, P2985
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Mandel M. I., 2007, P IEEE WORKSH APPL S, P275
Mandel MI, 2010, IEEE T AUDIO SPEECH, V18, P382, DOI 10.1109/TASL.2009.2029711
Nix J, 2006, J ACOUST SOC AM, V119, P463, DOI 10.1121/1.2139619
Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005
RENNIE S, 2003, P IEEE INT C AC SPEE, V1, P88
RENNIE SI, 2005, P 10 INT WORKSH ART, P293
ROMAN N, 2004, ADV NEURAL INFORM PR
Roman N, 2006, J ACOUST SOC AM, V120, P458, DOI 10.1121/1.2204590
Roweis S. T., 2003, P EUR, P1009
Sawada H., 2007, P IEEE WORKSH APPL S, P139
Shinn-Cunningham BG, 2005, J ACOUST SOC AM, V117, P3100, DOI 10.1121/1.1872572
WANG D, 2005, IDEAL BINARY MASK CO, P181
Weiss RJ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P419
Weiss RJ, 2010, COMPUT SPEECH LANG, V24, P16, DOI 10.1016/j.csl.2008.03.003
Weiss Ron J, 2009, THESIS COLUMBIA U
WIGHTMAN FL, 1992, J ACOUST SOC AM, V91, P1648, DOI 10.1121/1.402445
Wilson K, 2007, INT CONF ACOUST SPEE, P33
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
NR 29
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 606
EP 621
DI 10.1016/j.specom.2011.01.003
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900003
ER
PT J
AU Lu, YC
Cooke, M
AF Lu, Yan-Chen
Cooke, Martin
TI Motion strategies for binaural localisation of speech sources in azimuth
and distance by artificial listeners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Active hearing; Sound source localisation; Interaural time difference;
Motion parallax; Particle filtering
ID TIME-DELAY ESTIMATION; SOUND LOCALIZATION; PARTICLE FILTERS; HEAD
MOVEMENTS; REVERBERATION; PERCEPTION; SIMULATION; TRACKING; CUES
AB Localisation in azimuth and distance of sound sources such as speech is an important ability for both human and artificial listeners. While progress has been made, particularly for azimuth estimation, most work has been directed at the special case of static listeners and static sound sources. Although dynamic sound sources create their own localisation challenges such as motion blur, moving listeners have the potential to exploit additional cues not available in the static situation. An example is motion parallax, based on a sequence of azimuth estimates, which can be used to triangulate sound source location. The current study examines what types of listener (or sensor) motion are beneficial for localisation. Is any kind of motion useful, or do certain motion trajectories deliver robust estimates rapidly? Eight listener motion strategies and a no motion baseline were tested, including simple approaches such as random walks and motion limited to head rotations only, as well as more sophisticated strategies designed to maximise the amount of new information available at each time step or to minimise the overall estimate uncertainty. Sequential integration of estimates was achieved using a particle filtering framework. Evaluations, performed in a simulated acoustic environment with single sources under both anechoic and reverberant conditions, demonstrated that two strategies were particularly effective for localisation. The first was simply to move towards the most likely source location, which is beneficial in increasing signal-to-noise ratio, particularly in reverberant conditions. The other high performing approach was based on moving in the direction which led to the largest reduction in the uncertainty of the location estimate. Both strategies achieved estimation errors nearly an order of magnitude less than those obtainable with a static approach, demonstrating the power of motion-based cues to sound source localisation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Lu, Yan-Chen] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
[Cooke, Martin] Ikerbasque Basque Fdn Sci, Bilbao 48011, Spain.
[Cooke, Martin] Univ Basque Country, Language & Speech Lab, Fac Letters, Bilbao, Spain.
RP Lu, YC (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England.
EM y.c.lu@dcs.shef.ac.uk
CR Aarabi P, 2002, IEEE T SYST MAN CY C, V32, P474, DOI 10.1109/TSMCB.2002.804369
ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599
Arulampalam MS, 2002, IEEE T SIGNAL PROCES, V50, P174, DOI 10.1109/78.978374
ASHMEAD DH, 1995, J EXP PSYCHOL HUMAN, V21, P239, DOI 10.1037/0096-1523.21.2.239
Asoh H., 2004, P FUS, P805
Blauert J., 1997, SPATIAL HEARING PSYC
Bodden M., 1993, Acta Acustica, V1
Brandstein MS, 1997, INT CONF ACOUST SPEE, P375, DOI 10.1109/ICASSP.1997.599651
BRANDSTEIN MS, 1997, P WASPAA
Campbell D. R., 2005, Computing and Information Systems, V9
Champagne B, 1996, IEEE T SPEECH AUDI P, V4, P148, DOI 10.1109/89.486067
Chen JD, 2006, EURASIP J APPL SIG P, DOI 10.1155/ASP/2006/26503
COOKE M, 2008, AUDITORY SIGNAL PROC
DEFREITAS N, 1998, 328 CUEDFINFENGTR
Del Moral P., 1996, MARKOV PROCESS RELAT, V2, P555
Douc R., 2005, Proceedings of the 4th International Symposium on Image and Signal Processing and Analysis (IEEE Cat. No. 05EX1094)
Doucet A, 2000, STAT COMPUT, V10, P197, DOI 10.1023/A:1008935410038
Eyring C. F., 1930, J ACOUST SOC AM, V1, P168, DOI 10.1121/1.1901884
Faller C, 2004, J ACOUST SOC AM, V116, P3075, DOI 10.1121/1.1791872
GAIK W, 1993, J ACOUST SOC AM, V94, P98, DOI 10.1121/1.406947
GARDNER WG, 1996, STANDARDS COMPUTER G
GORDON NJ, 1993, IEE PROC-F, V140, P107
Jan EE, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1321
JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495
Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187
Kitagawa G., 1996, J COMPUTATIONAL GRAP, V5, P1, DOI DOI 10.2307/1390750
KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830
LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325
Liu JS, 1998, J AM STAT ASSOC, V93, P1032, DOI 10.2307/2669847
LOOMIS JM, 1990, J ACOUST SOC AM, V88, P1757, DOI 10.1121/1.400250
Lu Y.-C., 2007, P INT ANTW BELG AUG, P574
Lukowicz P, 2004, LECT NOTES COMPUT SC, V3001, P18
MACKENSEN P, 2004, THESIS TU BERLIN BER
MARTINSON E, 2006, P IEEE RSJ INT C INT, P1139
OTANI M, 2007, P JAP CHIN JOINT C A
Patterson R.D., 1988, 2341 APU
Pitt MK, 1999, J AM STAT ASSOC, V94, P590, DOI 10.2307/2670179
REKLEITIS IM, 2003, THESIS MCGILL U MONT
RUI Y, 2004, ACOUST SPEECH SIG PR, P133
Sasaki Y., 2006, P IEEE RSJ INT C INT, P380
Sawhney N., 2000, ACM Transactions on Computer-Human Interaction, V7, DOI 10.1145/355324.355327
Speigle J. M., 1993, Proceedings IEEE 1993 Symposium on Research Frontiers in Virtual Reality (Cat. No.93TH0585-0), DOI 10.1109/VRAIS.1993.378257
THURLOW WR, 1967, J ACOUST SOC AM, V42, P489, DOI 10.1121/1.1910605
Viste H, 2004, P 7 INT C DIG AUD EF, P145
Wallach H, 1940, J EXP PSYCHOL, V27, P339, DOI 10.1037/h0054629
Wang H, 1997, INT CONF ACOUST SPEE, P187, DOI 10.1109/ICASSP.1997.599595
Ward DB, 2003, IEEE T SPEECH AUDI P, V11, P826, DOI 10.1109/TSA.2003.818112
West M., 1997, BAYESIAN FORECASTING, V2nd
Zahorik P, 2005, ACTA ACUST UNITED AC, V91, P409
NR 49
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 622
EP 642
DI 10.1016/j.specom.2010.06.001
PG 21
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900004
ER
PT J
AU Pichevar, R
Najaf-Zadeh, H
Thibault, L
Landili, H
AF Pichevar, Ramin
Najaf-Zadeh, Hossein
Thibault, Louis
Landili, Hassan
TI Auditory-inspired sparse representation of audio signals
SO SPEECH COMMUNICATION
LA English
DT Article
DE Sparse representations; Masking; Quantization; Temporal data mining;
Episode discovery; Audio coding; Matching pursuit; Auditory pattern
recognition
ID MATCHING PURSUITS; EPISODES; SPIKES; FILTER
AB This article deals with the generation of auditory-inspired spectro-temporal features aimed at audio coding. To do so, we first generate sparse audio representations we call spikegrams, using projections on gammatone/gammachirp kernels that generate neural spikes. Unlike Fourier-based representations, these representations are powerful at identifying auditory events, such as onsets, offsets, transients, and harmonic structures. We show that the introduction of adaptiveness in the selection of gammachirp kernels enhances the compression rate compared to the case where the kernels are non-adaptive. We also integrate a masking model that helps reduce bitrate without loss of perceptible audio quality. We finally propose a method to extract frequent audio objects (patterns) in the aforementioned sparse representations. The extracted frequency-domain patterns (audio objects) help us address spikes (audio events) collectively rather than individually. When audio compression is needed, the different patterns are stored in a small codebook that can be used to efficiently encode audio materials in a lossless way. The approach is applied to different audio signals and results are discussed and compared. This work is a first step towards the design of a high-quality auditory-inspired "object-based" audio coder. Crown Copyright (C) 2010 Published by Elsevier B.V. All rights reserved.
C1 [Pichevar, Ramin; Najaf-Zadeh, Hossein; Thibault, Louis; Landili, Hassan] Commun Res Ctr, Ottawa, ON K2H 8S2, Canada.
RP Pichevar, R (reprint author), Commun Res Ctr, 3701 Carling Ave, Ottawa, ON K2H 8S2, Canada.
EM Ramin.Pichevar@usherbrooke.ca
FU University of Sherbrooke
FX The authors would like to thank Richard Boudreau, Hunter Hong, and
Frederic Mustiere for proofreading the paper. They also express their
gratitude to Debprakash Patnaik and Koniparambil Unnikrishnan for
providing them with the GMiner toolbox and for fruitful discussions on
frequent episode discovery. The first author would also like to thank
Jean Rouat for fruitful discussions on machine learning, as well as the
University of Sherbrooke for a travel grant that made the discussions on
machine learning and frequent episode discovery possible. Many thanks
also to the three anonymous reviewers for their constructive comments.
CR Abdallah SA, 2006, IEEE T NEURAL NETWOR, V17, P179, DOI 10.1109/TNN.2005.861031
Abeles M., 1991, CORTICONICS NEURAL C
Bech S., 2006, PERCEPTUAL AUDIO EVA
Christensen MG, 2006, IEEE T AUDIO SPEECH, V14, P1340, DOI 10.1109/TSA.2005.858038
Feldbauer C, 2005, EURASIP J APPL SIG P, V2005, P1334, DOI 10.1155/ASP.2005.1334
Goodwin MM, 1999, IEEE T SIGNAL PROCES, V47, P1890, DOI 10.1109/78.771038
Graham D., 2006, EVOLUTION NERVOUS SY
Gribonval R, 2001, IEEE T SIGNAL PROCES, V49, P994, DOI 10.1109/78.917803
HEUSDENS R, 2001, IEEE INT C AUD SPEEC
IRINO T, 2006, IEEE T AUDIO SPEECH, V14, P2008
Irino T, 2001, J ACOUST SOC AM, V109, P2008, DOI 10.1121/1.1367253
Izhikevich EM, 2006, NEURAL COMPUT, V18, P245, DOI 10.1162/089976606775093882
Laxman S, 2007, IEEE T KNOWL DATA EN, V19, P1188, DOI [10.1109/TKDE.2007.1055, 10.1109/TKDE.2007.1055.]
MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082
Mannila H, 1997, DATA MIN KNOWL DISC, V1, P259, DOI 10.1023/A:1009748302351
NAJAFZADEH H, 2008, AUD ENG SOC CONV NET
Patnaik D, 2008, SCI PROGRAMMING-NETH, V16, P49, DOI 10.3233/SPR-2008-0242
Patterson RD, 1986, FREQUENCY SELECTIVIT, P123
PICHEVAR R, 2007, AUD ENG SOC CONV AUS
PICHEVAR R, 2008, EUR SIGN PROC C LAUS
PICHEVAR R, 2010, INT JOINT C NEUR NET
PICHEVAR R, 2008, AUD ENG SOC CONV NET
Ravelli E, 2008, IEEE T AUDIO SPEECH, V16, P1361, DOI 10.1109/TASL.2008.2004290
Rozell CJ, 2008, NEURAL COMPUT, V20, P2526, DOI 10.1162/neco.2008.03-07-486
SMITH E, 2006, NATURE, V7079, P978
Smith E, 2005, NEURAL COMPUT, V17, P19, DOI 10.1162/0899766052530839
STRAHL S, 2008, BRAIN RES, V1220, P3
THIEDE T, 2000, AUD ENG C, P3
VERMA TS, 1999, ACOUST SPEECH SIG PR, P981
Zwicker E., 1990, PSYCHOACOUSTICS FACT
ZWICKER E, 1984, J ACOUST SOC AM, V75, P219, DOI 10.1121/1.390398
NR 31
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 643
EP 657
DI 10.1016/j.specom.2010.09.008
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900005
ER
PT J
AU Le Roux, J
Kameoka, H
Ono, N
de Cheveigne, A
Sagayama, S
AF Le Roux, Jonathan
Kameoka, Hirokazu
Ono, Nobutaka
de Cheveigne, Alain
Sagayama, Shigeki
TI Computational auditory induction as a missing-data model-fitting problem
with Bregman divergence
SO SPEECH COMMUNICATION
LA English
DT Article
DE Auditory induction; Acoustical scene analysis; Missing data; Auxiliary
function; Bregman divergence; EM algorithm; Non-negative matrix
factorization; Harmonic-temporal clustering
ID NONNEGATIVE MATRIX FACTORIZATION; SPEECH RECOGNITION; AUTOREGRESSIVE
PROCESSES; ACOUSTIC-SIGNALS; LONG GAPS; INTERPOLATION; RECONSTRUCTION;
REPRESENTATION; MUSIC
AB The human auditory system has the ability, known as auditory induction, to estimate the missing parts of a continuous auditory stream briefly covered by noise and perceptually resynthesize them. In this article, we formulate this ability as a model-based spectrogram analysis and clustering problem with missing data, show how to solve it using an auxiliary function method, and explain how this method is generally related to the expectation-maximization (EM) algorithm for a certain type of divergence measures called Bregman divergences, thus enabling the use of prior distributions on the parameters. We illustrate how our method can be used to simultaneously analyze a scene and estimate missing information with two algorithms: the first, based on non-negative matrix factorization (NMF), performs analysis of polyphonic multi-instrumental musical pieces. Our method allows this algorithm to cope with gaps within the audio data, estimating the timbre of the instruments and their pitch, and reconstructing the missing parts. The second, based on a recently introduced technique for the analysis of complex acoustical scenes called harmonic-temporal clustering (FITC), enables us to perform robust fundamental frequency estimation from incomplete speech data. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Le Roux, Jonathan; Kameoka, Hirokazu] NTT Corp, NTT Commun Sci Labs, Kanagawa 2430198, Japan.
[Le Roux, Jonathan; Ono, Nobutaka; Sagayama, Shigeki] Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan.
[de Cheveigne, Alain] Univ Paris 05, Ctr Natl Rech Sci, F-75230 Paris 05, France.
[de Cheveigne, Alain] Ecole Normale Super, F-75230 Paris 05, France.
RP Le Roux, J (reprint author), NTT Corp, NTT Commun Sci Labs, 3-1 Morinosato Wakamiya, Kanagawa 2430198, Japan.
EM leroux@cs.brl.ntt.co.jp; kameoka@cs.brl.ntt.co.jp;
onono@hil.t.u-tokyo.ac.jp; Alain.de.Cheveigne@ens.fr;
sagayama@hil.t.u-tokyo.ac.jp
RI de Cheveigne, Alain/F-4947-2012
CR ACHAN K, 2005, P 2005 IEEE INT C AC, V5, P221, DOI 10.1109/ICASSP.2005.1416280
Bagshaw P., 1993, P EUR C SPEECH COMM, P1003
Banerjee A, 2005, J MACH LEARN RES, V6, P1705
Barker J., 2006, COMPUTATIONAL AUDITO, P297
Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002
Barlow H, 2001, BEHAV BRAIN SCI, V24, P602
Bertalmio M, 2000, COMP GRAPH, P417
Bregman AS., 1990, AUDITORY SCENE ANAL
BREGMAN LM, 1967, COMP MATH MATH PHYS, V7, P620
Cemgil A. T., 2008, CUEDFINFENGTR609
CEMGIL AT, 2005, P EUR SIGN PROC C EU
CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229
Clark P, 2008, INT CONF ACOUST SPEE, P3741, DOI 10.1109/ICASSP.2008.4518466
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Criminisi A, 2004, IEEE T IMAGE PROCESS, V13, P1200, DOI 10.1109/TIP.2004.833105
CSISZAR I, 1975, ANN PROBAB, V3, P146, DOI 10.1214/aop/1176996454
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Eggert J., 2004, P IEEE INT JOINT C N, V4, P2529
Ellis D. P. W., 1996, THESIS MIT
ELLIS DPW, 1993, P IEEE WORKSH APPL S
Esquef PAA, 2006, IEEE T AUDIO SPEECH, V14, P1391, DOI 10.1109/TSA.2005.858018
Fevotte C, 2009, NEURAL COMPUT, V21, P793, DOI 10.1162/neco.2008.04-08-771
Fujisaki H., 1969, Annual Report of the Engineering Research Institute, Faculty of Engineering, University of Tokyo, V28
Godsill S.J., 1998, DIGITAL AUDIO RESTOR
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Grunwald P. D., 2007, MINIMUM DESCRIPTION
Helmholtz H., 1954, SENSATIONS TONE PHYS
Helmholtz Hermann von, 1885, SENSATIONS TONE PHYS
JANSSEN AJEM, 1986, IEEE T ACOUST SPEECH, V34, P317, DOI 10.1109/TASSP.1986.1164824
Kameoka H, 2007, IEEE T AUDIO SPEECH, V15, P982, DOI 10.1109/TASL.2006.885248
Kameoka H., 2007, THESIS U TOKYO
Kameoka H, 2009, INT CONF ACOUST SPEE, P3437, DOI 10.1109/ICASSP.2009.4960364
Kashino M., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.318
Lee DD, 2001, ADV NEUR IN, V13, P556
Lee DD, 1999, NATURE, V401, P788
LEROUX J, 2007, P AC SOC JPN AUT M, P351
Le Roux J, 2007, IEEE T AUDIO SPEECH, V15, P1135, DOI 10.1109/TASL.2007.894510
LEROUX J, 2008, 200811 METR U TOK
LU L, 2003, P ICASSP, V5, P636
MAHER RC, 1994, J AUDIO ENG SOC, V42, P350
MAHER RC, 1993, 95 AES CONV NEW YORK, P1
MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910
MENG XL, 1993, BIOMETRIKA, V80, P267, DOI 10.2307/2337198
Morup M., 2006, SPARSE NONNEGATIVE M
Ono N., 2008, P SAPA SEPT, P23
Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
Rajan JJ, 1997, IEE P-VIS IMAGE SIGN, V144, P249, DOI 10.1049/ip-vis:19971305
RAYNER PJW, 1991, P IEEE WORKSH APPL S
REYESGOMEZ M, 2004, P ISCA WORKSH STAT P, P25
Sajda P, 2003, P SOC PHOTO-OPT INS, V5207, P321, DOI 10.1117/12.504676
Schmidt M. N., 2006, P 6 INT C IND COMP A, P700
SMARAGDIS P, 2009, P IEEE WORKSH MACH L
Smaragdis P., 2004, P INT C IND COMP AN, P494
VASEGHI SV, 1990, IEE PROC-I, V137, P38
Veldhuis R., 1990, RESTORATION LOST SAM
Virtanenm T., 2008, P ISCA TUT RES WORKS, P17
von Helmholtz H.L.F., 1863, SENSATIONS TONE PHYS
Wang D., 2006, COMPUTATIONAL AUDITO
Warren R. M., 1982, AUDITORY PERCEPTION
WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392
WOLFE PJ, 2005, P INT C AC SPEECH SI, V5, P517, DOI 10.1109/ICASSP.2005.1416354
NR 62
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 658
EP 676
DI 10.1016/j.specom.2010.08.009
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900006
ER
PT J
AU Li, JF
Sakamoto, S
Hongo, S
Akagi, M
Suzuki, Y
AF Li, Junfeng
Sakamoto, Shuichi
Hongo, Satoshi
Akagi, Masato
Suzuki, Yoiti
TI Two-stage binaural speech enhancement with Wiener filter for
high-quality speech communication
SO SPEECH COMMUNICATION
LA English
DT Article
DE Binaural masking level difference; Equalization-cancellation model;
Two-stage binaural speech enhancement (TS-BASE); Binaural cue
preservation; Sound localization
ID ARRAY HEARING-AIDS; NOISE-REDUCTION; ENVIRONMENTS; OUTPUT
AB Speech enhancement has been researched extensively for many years to provide high-quality speech communication in the presence of background noise and concurrent interference signals. Human listening is robust against these acoustic interferences using only two ears, but state-of-the-art two-channel algorithms function poorly. Motivated by psychoacoustic studies of binaural hearing (equalization-cancellation (EC) theory), in this paper, we propose a two-stage binaural speech enhancement with Wiener filter (TS-BASE/WF) approach that is a two-input two-output system. In this proposed TS-BASE/WF, interference signals are first estimated by equalizing and cancelling the target signal in a way inspired by the EC theory, a time-variant Wiener filter is then applied to enhance the target signal given the noisy mixture signals. The main advantages of the proposed TS-BASE/WF are (1) effectiveness in dealing with non-stationary multiple-source interference signals, and (2) success in preserving binaural cues after processing. These advantages were confirmed according to the comprehensive objective and subjective evaluations in different acoustical spatial configurations in terms of speech enhancement and binaural cue preservation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Li, Junfeng; Akagi, Masato] Japan Adv Inst Sci & Technol, Sch Informat Sci, Tokyo, Japan.
[Sakamoto, Shuichi; Suzuki, Yoiti] Tohoku Univ, Elect Commun Res Inst, Sendai, Miyagi 980, Japan.
[Hongo, Satoshi] Miyagi Natl Coll Technol, Dept Design & Comp Applicat, Sendai, Miyagi, Japan.
RP Li, JF (reprint author), Japan Adv Inst Sci & Technol, Sch Informat Sci, Tokyo, Japan.
EM junfeng@jaist.ac.jp
CR AICHNER R, 2007, P ICASSP2007
Blauert J., 1997, SPATIAL HEARING PSYC
BOGAERT TV, 2007, ICASSP2007, P565
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st
Campbell DR, 2003, SPEECH COMMUN, V39, P97, DOI 10.1016/S0167-6393(02)00061-4
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298
Doclo S, 2007, SPEECH COMMUN, V49, P636, DOI 10.1016/j.specom.2007.02.001
DORBECKER M, 1996, EUSIPCO1996, P995
Durlach N. I., 1972, F MODERN AUDITORY TH, V2, P369
DURLACH NI, 1963, J ACOUST SOC AM, V35, P1206, DOI 10.1121/1.1918675
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132
Griffiths J., 1982, IEEE T ANTENN PROPAG, V30, P27
Klasen TJ, 2007, IEEE T SIGNAL PROCES, V55, P1579, DOI 10.1109/TSP.2006.888897
KOCK WE, 1950, J ACOUST SOC AM, V22, P801, DOI 10.1121/1.1906692
Kollmeier B, 1993, Scand Audiol Suppl, V38, P28
Li JF, 2008, 2008 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING, VOLS 1 AND 2, PROCEEDINGS, P97
Li JF, 2008, IEICE T FUND ELECTR, VE91A, P1337, DOI 10.1093/ietfec/e91-a.6.1337
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lotter T., 2005, P EUROSPEECH2005, P2285
Nakashima H., 2003, Acoustical Science and Technology, V24, DOI 10.1250/ast.24.172
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Roman N, 2006, J ACOUST SOC AM, V120, P4040, DOI 10.1121/1.2355480
Scalart P., 1996, IEEE INT C AC SPEECH, V2, P629
Shields PW, 2001, J ACOUST SOC AM, V110, P3232, DOI 10.1121/1.1413750
Suzuki Y, 1999, IEICE T FUND ELECTR, VE82A, P588
Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005
Waibel Alex, 2008, 2008 Second International Symposium on Universal Communication, DOI 10.1109/ISUC.2008.78
Wang D., 2006, COMPUTATIONAL AUDITO
Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299
Wiener N., 1949, EXTRAPOLATION INTERP
NR 33
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 677
EP 689
DI 10.1016/j.specom.2010.04.009
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900007
ER
PT J
AU Bach, JH
Anemuller, J
Kollmeier, B
AF Bach, Joerg-Hendrik
Anemueller, Joern
Kollmeier, Birger
TI Robust speech detection in real acoustic backgrounds with perceptually
motivated features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech detection; Pattern classification; Amplitude modulations;
Fluctuating noise; Real-world scenario
ID AUDITORY-CORTEX; RECOGNITION; CLASSIFICATION; PERIODICITY; FREQUENCY;
PLP
AB The current study presents an analysis of the robustness of a speech detector in real background sounds. One of the most important aspects of automatic speech/nonspeech classification is robustness in the presence of strongly varying external conditions. These include variations of the signal-to-noise ratio as well as fluctuations of the background noise. These variations are systematically evaluated by choosing different mismatched conditions between training and testing of the speech/nonspeech classifiers. The detection performance of the classifier with respect to these mismatched conditions is used as a measure of robustness and generalisation. The generalisation towards un-trained SNR conditions and unknown background noises is evaluated and compared to a matched baseline condition.
The classifier consists of a feature front-end, which computes amplitude modulation spectral features (AMS), and a support vector machine (SVM) back-end. The AMS features are based on Fourier decomposition over time of short-term spectrograms. Mel-frequency cepstral coefficients (MFCC) as well as relative spectral features (RASTA) based on perceptual linear prediction (PLP) serve as baseline.
The results show that RASTA-filtered PLP features perform best in the matched task. In the generalisation tasks however, the AMS features emerge as more robust in most cases, while MFCC features are outperformed by both other feature types.
In a second set of experiments, a hierarchical approach is analysed which employs a background classification step prior to the speech/nonspeech classifier in order to improve the robustness of the detection scores in novel backgrounds. The background sounds used are recorded in typical everyday scenarios. The hierarchy provides a benefit in overall performance if the robust AMS features are employed.
The generalisation capabilities of the hierarchy towards novel backgrounds and SNRs is found to be optimal when a limited number of training backgrounds is used (compared to the inclusion of all available background data). The best backgrounds in terms of generalisation capabilities are found to be backgrounds in which some component of speech (such as unintelligible background babble) is present, which corroborates the hypothesis that the AMS features provide a decomposition of signals which is by itself very suitable for training very general speech/nonspeech detectors. This is also supported by the finding that the SVMs combined with RASTA-PLPs require nonlinear kernels to reach a similar performance as the AMS patterns with linear kernels. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Bach, Joerg-Hendrik; Anemueller, Joern; Kollmeier, Birger] Carl von Ossietzky Univ Oldenburg, Dept Med Phys, D-26111 Oldenburg, Germany.
RP Bach, JH (reprint author), Carl von Ossietzky Str 8-11, D-26111 Oldenburg, Germany.
EM j.bach@uni-oldenburg.de
FU European Union; European Graduate School for Neurosensory Science and
Systems
FX This work was supported by the European Union within the 6th Framework
Programme through the IP IST DIRAC, and by the European Graduate School
for Neurosensory Science and Systems. We are indebted to Nina Pohl and
Georg Klump of the Group for Animal Physiology and Behaviour at the
University of Oldenburg for providing high quality, undisturbed
recordings of natural scenes, and to Hendrik Kayser, who contributed
significantly to this work, most notably the office and city recordings.
We thank Bernd T. Meyer for giving helpful advice in numerous
discussion, and for proofreading an earlier version of this manuscript.
CR ANEMULLER J, 2008, P INT BRISB AUSTR
ANEMULLER J, 2008, 1 INT C COGN SYST CO
Bee MA, 2004, J NEUROPHYSIOL, V92, P1088, DOI 10.1152/jn.00884.2003
Bregman AS., 1990, AUDITORY SCENE ANAL
Buchler M, 2005, EURASIP J APPL SIG P, V2005, P2991, DOI 10.1155/ASP.2005.2991
Chang C.-C., 2001, LIBSVM LIB SUPPORT V
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
DRESCHLER WA, 1999, J ACOUST SOC AM, V105, P1296, DOI 10.1121/1.426174
Garofolo JS, 1993, TIMIT ACOUSTIC PHONE
GRAMSS T, 1990, SPEECH COMMUN, V9, P35, DOI 10.1016/0167-6393(90)90043-9
GREENBERG S, 1997, P ICASSP MUN GERM
Happel MFK, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P670
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
*ITU, 1996, REC G X729 ANN B
Kingsbury BED, 1997, INT CONF ACOUST SPEE, P1259, DOI 10.1109/ICASSP.1997.596174
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
KLEINSCHMIDT M, 2002, ROBUST SPEECH RECOGN
KOLLMEIER B, 1994, J ACOUST SOC AM, V95, P1593, DOI 10.1121/1.408546
Langner G, 1997, J COMP PHYSIOL A, V181, P665, DOI 10.1007/s003590050148
Lin HT, 2007, MACH LEARN, V68, P267, DOI 10.1007/s10994-007-5018-6
LUO J, 2008, P ICVS SANT GREEC
MAGANTI HK, 2007, P ICASSP HON
MARKAKI M, 2008, WORKSH STAT PERC AUD, P7
Martin A. F., 1997, P EUROSPEECH, P1895
Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548
Mesgarani N, 2008, J ACOUST SOC AM, V123, P899, DOI 10.1121/1.2816572
Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055
Meyer BT, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P906
OSTENDORF M, 1998, P FORTSCHR AK DAGA 9, P402
Platt J., 2000, PROBABILISTIC OUTPUT
RABINER LR, 1975, AT&T TECH J, V54, P297
SCHREINER CE, 1988, J NEUROPHYSIOL, V60, P1823
SHIRE ML, 2000, ICASSP IST
Tchorz J, 2003, IEEE T SPEECH AUDI P, V11, P184, DOI 10.1109/TSA.2003.811542
Vapnik V., 1995, NATURE STAT LEARNING
Young S., 2002, HTK BOOK HTK VERSION
ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630
NR 38
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 690
EP 706
DI 10.1016/j.specom.2010.07.003
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900008
ER
PT J
AU Yin, H
Hohmann, V
Nadeu, C
AF Yin, Hui
Hohmann, Volker
Nadeu, Climent
TI Acoustic features for speech recognition based on Gammatone filterbank
and instantaneous frequency
SO SPEECH COMMUNICATION
LA English
DT Article
DE Gammatone filterbank; Instantaneous frequency; Speech recognition
AB Most of the features used by modern automatic speech recognition systems, such as mel-frequency cepstral coefficients (MFCC) and perceptual linear predictive (PLP) coefficients, represent spectral envelope of the speech signal only. Nevertheless, phase or frequency modulation as represented in recent perceptual models of the peripheral auditory system might also contribute to speech decoding. Furthermore, such features can be complementary to the envelope features. This paper proposes a variety of features based on a linear auditory filterbank, the Gammatone filterbank. Envelope features are derived from the envelope of the subband filter outputs. Phase/frequency modulation is represented by the subband instantaneous frequency (IF) and is used explicitly by concatenating envelope-based and IF-based features or is used implicitly by IF-based frequency reassignment. Speech recognition experiments using a standard HMM-based recognizer under both clean training and multi-condition training are conducted on a Chinese mandarin digits corpus. The experimental results show that the proposed envelope and phase based features can improve recognition rates in clean and noisy conditions compared to the reference MFCC-based recognizer. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Yin, Hui] Beijing Inst Technol, Dept Elect Engn, Beijing 100081, Peoples R China.
[Yin, Hui; Hohmann, Volker; Nadeu, Climent] Univ Politecn Cataluna, TALP Res Ctr, Barcelona, Spain.
[Hohmann, Volker] Carl von Ossietzky Univ Oldenburg, D-2900 Oldenburg, Germany.
RP Yin, H (reprint author), Beijing Inst Technol, Dept Elect Engn, Beijing 100081, Peoples R China.
EM hchhuihui@gmail.com; volker.hohmann@uni-oldenburg.de;
climent.nadeu@upc.edu
RI Nadeu, Climent/B-9638-2014
OI Nadeu, Climent/0000-0002-5863-0983
FU Spanish Ministry of Education and Science [TEC2007-65470]; National
Nature Science Foundation of China [NSFC 60605015]
FX This research was partially supported by the Spanish project SAPIRE
(TEC2007-65470) as well as a research grant to Volker Hohmann from the
Spanish Ministry of Education and Science, and partially supported by
the National Nature Science Foundation of China under Grant NSFC
60605015.
CR Alsteris LD, 2007, DIGIT SIGNAL PROCESS, V17, P578, DOI 10.1016/j.dsp.2006.06.007
BOER E, 1978, J ACOUST SOC AM, V63, P115
Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050
*ETSI, 2003, 2002212 ETSI ES
Gardner TJ, 2005, J ACOUST SOC AM, V117, P2896, DOI 10.1121/1.1863072
GU L, 2001, IEEE ICASSP 2001
Haque S, 2009, SPEECH COMMUN, V51, P58, DOI 10.1016/j.specom.2008.06.002
Herzke T, 2005, EURASIP J APPL SIG P, V2005, P3034, DOI 10.1155/ASP.2005.3034
HOHMANN V, 2006, INT S HEAR ISH 2006, P11
Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433
HOLMBERG M, 2007, SPEECH COMMUN, P917
IKBAL S, 2003, ACOUST SPEECH SIG PR, P133
Johannesma PIM, 1972, P IPO S HEARING THEO, P58
Kleinschmidt M, 2001, SPEECH COMMUN, V34, P75, DOI 10.1016/S0167-6393(00)00047-9
KUBO Y, 2008, IEICE T INF SYST, V91, P8
KUMARESAN R, 2003, AS C SIGN SYST COMP, V2, P2078
Potamianos A, 1996, J ACOUST SOC AM, V99, P3795, DOI 10.1121/1.414997
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Munkong R., 2008, IEEE SIGNAL PROCESSI, P98
Pagano M., 2000, PRINCIPLES BIOSTATIS
Patterson RD, 1987, M IOC SPEECH GROUP A
Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821
Potamianos A, 2001, IEEE T SPEECH AUDI P, V9, P196, DOI 10.1109/89.905994
Schluter R, 2007, INT CONF ACOUST SPEE, P649
STARK AP, 2008, INTERSPEECH
WANG Y, 2003, EPIDEMIC SPREADING R, P25
Wang YR, 2006, LECT NOTES COMPUT SC, V4274, P370
Young S., 1995, HIDDEN MARKOV MODEL
NR 28
TC 2
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 707
EP 715
DI 10.1016/j.specom.2010.04.008
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900009
ER
PT J
AU Kubo, Y
Okawa, S
Kurematsu, A
Shirai, K
AF Kubo, Yotaro
Okawa, Shigeki
Kurematsu, Akira
Shirai, Katsuhiko
TI Temporal AM-FM combination for robust speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Frequency modulation; Multistream speech
recognition; HMM/MLP-tandem approach
ID NOISY; FREQUENCY; HUMANS
AB A novel method for feature extraction from the frequency modulation (FM) in speech signals is proposed for robust speech recognition. To exploit of the multistream speech recognizers, each stream should compensate for the shortcomings of the other streams. In this light, FM features are promising as complemental features of amplitude modulation (AM). In order to extract effective features from FM patterns, we applied the proposed feature extraction method by the data-driven modulation analysis of instantaneous frequency. By evaluating the frequency responses of the temporal filters obtained by the proposed method, we confirmed that the modulation observed around 4 Hz is important for the discrimination of FM patterns, as in the case of AM features. We evaluated the robustness of our method by performing noisy speech recognition experiments. We confirmed that our FM features can improve the noise robustness of speech recognizers even when the FM features are not combined with conventional AM and/or spectral envelope features. We also performed multistream speech recognition experiments. The experimental results show that combination of the conventional AM system and proposed FM system reduced word error by 43.6% at 10 dB SNR as compared to the baseline MFCC system and by 20.2% as compared to the conventional AM system. We investigated the complementarity of the AM and FM features by performing speech recognition experiments in artificial noisy environments. We found the FM features to be robust to wide-band noise, which certainly degrades the performance of AM features. Further, we evaluated the efficiency of multiconditional training. Although the performance of the proposed combination method was degraded by multiconditional training, we confirmed that the performance of the proposed FM method improved. Through a series of experiments, we confirmed that our FM features can be used as independent features as well as complemental features. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Kubo, Yotaro; Kurematsu, Akira; Shirai, Katsuhiko] Waseda Univ, Dept Comp Sci, Shinjuku Ku, Tokyo 1698555, Japan.
[Okawa, Shigeki] Chiba Inst Technol, Chiba 2750016, Japan.
RP Kubo, Y (reprint author), Waseda Univ, Dept Comp Sci, Shinjuku Ku, 3-4-1 Ohkubo, Tokyo 1698555, Japan.
EM yotaro@ieee.org
FU Ministry of Education, Culture, Sports, Science and Technology, Japan
[21.04190]
FX The authors would like to thank the anonymous reviewers and the editor
for their valuable comments and suggestions for improving the quality of
this paper. The authors also would like to thank Prof. Mikio Tohyama for
introducing them to perceptual studies on zero-crossing points of
signals. This study was partly supported by a Grant-in-Aid for JSPS
Fellows (21.04190) from the Ministry of Education, Culture, Sports,
Science and Technology, Japan.
CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
BOASHASH B, 1992, P IEEE, V80, P520, DOI 10.1109/5.135376
CHEN B, 2005, P ICASSP 05, V1, P945, DOI 10.1109/ICASSP.2005.1415271
CHEN B, 2003, P EUROSPEECH, P853
Chen JD, 2004, IEEE SIGNAL PROC LET, V11, P258, DOI 10.1109/LSP.2003.821689
Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050
GAJIC B, 2003, P ICASSP 2006 HONG K, P62
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HERMANSKY H, 1998, P ICSLP 98 SYDN AUST
Hermansky H., 2000, ACOUST SPEECH SIG PR, P1635
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
IKBAL S, 2004, P INTERSPPECH ICSLP, P2553
Janin A., 1999, P 6 EUR C SPEECH COM, P591
Kaiser J., 1993, P IEEE ICASSP 93, P149
Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55
Kubo Y, 2008, INT CONF ACOUST SPEE, P4709
Kubo Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P642
Kubo Y, 2008, IEICE T INF SYST, VE91D, P448, DOI 10.1093/ietisy/e91-d.3.448
Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6
Morgan N., 1995, IEEE SIGNAL PROC MAY, P25
Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826
Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535
Rumelhart DH, 1986, PARALLEL DISTRIBUTED, V1
Sharma S., 1999, THESIS OREGON GRADUA
Suzuki H., 2006, Acoustical Science and Technology, V27, DOI 10.1250/ast.27.163
VUUREN S, 1997, P EUROSPEECH 1997, P409
Wang Y., 2003, P 22 INT S REL DISTR, P25
YOSHIDA K, 2002, P ICASSP 2002
NR 30
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 716
EP 725
DI 10.1016/j.specom.2010.08.012
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900010
ER
PT J
AU Markaki, M
Stylianou, Y
AF Markaki, Maria
Stylianou, Yannis
TI Discrimination of speech from nonspeeech in broadcast news based on
modulation frequency features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech discrimination; Modulation spectrum; Mutual information; Higher
order singular value decomposition
ID SPEAKER DIARIZATION; SEGMENTATION; SYSTEMS
AB In audio content analysis, the discrimination of speech and non-speech is the first processing step before speaker segmentation and recognition, or speech transcription. Speech/non-speech segmentation algorithms usually consist of a frame-based scoring phase using MFCC features, combined with a smoothing phase. In this paper, a content based speech discrimination algorithm is designed to exploit long-term information inherent in modulation spectrum. In order to address the varying degrees of redundancy and discriminative power of the acoustic and modulation frequency subspaces, we first employ a generalization of SVD to tensors (Higher Order SVD) to reduce dimensions. Projection of modulation spectral features on the principal axes with the higher energy in each subspace results in a compact set of features with minimum redundancy. We further estimate the relevance of these projections to speech discrimination based on mutual information to the target class. This system is built upon a segment-based SVM classifier in order to recognize the presence of voice activity in audio signal. Detection experiments using Greek and US English broadcast news data composed of many speakers in various acoustic conditions suggest that the system provides complementary information to state-of-the-art mel-cepstral features. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Markaki, Maria; Stylianou, Yannis] Univ Crete, Dept Comp Sci, Iraklion, Greece.
RP Markaki, M (reprint author), Univ Crete, Dept Comp Sci, Iraklion, Greece.
EM mmarkaki@csd.uoe.gr
CR Aronowitz H, 2007, INT CONF ACOUST SPEE, P393
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
ATLAS L, 2005, MODULATION TOOLBOX M
Barras C, 2006, IEEE T AUDIO SPEECH, V14, P1505, DOI 10.1109/TASL.2006.878261
Cover T M, 1991, ELEMENTS INFORM THEO
De Lathauwer L, 2000, SIAM J MATRIX ANAL A, V21, P1253, DOI 10.1137/S0895479896305696
GREENBERG S, 1997, P ICASSP, V3, P1647
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Jain A, 2005, PATTERN RECOGN, V38, P2270, DOI 10.1016/j.patcog.2005.01.012
Joachims T., 1999, ADV KERNEL METHODS S, P41
Kinnunen T., 2008, P OD SPEAK LANG REC
KINNUNEN T, 2007, P SPECOM 2007
Lu L, 2003, MULTIMEDIA SYST, V8, P482, DOI 10.1007/s00530-002-0065-0
Malyska N, 2005, INT CONF ACOUST SPEE, P873
MARKAKI M, 2009, P IEEE EMBC 09
Martin A., 1997, DET CURVE ASSESSMENT, P1895
Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055
Peng HC, 2005, IEEE T PATTERN ANAL, V27, P1226
QUATIERI TF, 2003, IEEE WORKSH APPL SIG
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Sanderson C., 2002, 0233 IDIAPRR
Saunders J, 1996, INT CONF ACOUST SPEE, P993, DOI 10.1109/ICASSP.1996.543290
SCHEIRER E, 1997, P IEEE INT C AC SPEE, P1331
Schimmel SM, 2007, INT CONF ACOUST SPEE, P605
Slonim N, 2005, ESTIMATING MUTUAL IN
Spina MS, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P594
Sukittanon S, 2004, IEEE T SIGNAL PROCES, V52, P3023, DOI 10.1109/TSP.2004.833861
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
Wooters C., 2004, P FALL 2004 RICH TRA
NR 29
TC 7
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 726
EP 735
DI 10.1016/j.specom.2010.08.007
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900011
ER
PT J
AU Heckmann, M
Domont, X
Joublin, F
Goerick, C
AF Heckmann, Martin
Domont, Xavier
Joublin, Frank
Goerick, Christian
TI A hierarchical framework for spectro-temporal feature extraction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spectro-temporal; Auditory; Robust speech recognition; Image processing;
Learning; Competition; Hierarchical
ID INDEPENDENT COMPONENT ANALYSIS; AUTOMATIC SPEECH RECOGNITION; OBJECT
RECOGNITION; RECEPTIVE-FIELDS; AUDITORY-CORTEX; SOUNDS; ENHANCEMENT;
MODULATIONS; PERCEPTION; COMPLEX
AB In this paper we present a hierarchical framework for the extraction of spectro-temporal acoustic features. The design of the features targets higher robustness in dynamic environments. Motivated by the large gap between human and machine performance in such conditions we take inspirations from the organization of the mammalian auditory cortex in the design of our features. This includes the joint processing of spectral and temporal information, the organization in hierarchical layers, competition between coequal features, the use of high-dimensional sparse feature spaces, and the learning of the underlying receptive fields in a data-driven manner. Due to these properties we termed the features as hierarchical spectro-temporal (HIST) features. For the learning of the features at the first layer we use Independent Component Analysis (ICA). At the second layer of our feature hierarchy we apply Non-Negative Sparse Coding (NNSC) to obtain features spanning a larger frequency and time region. We investigate the contribution of the different subparts of this feature extraction process to the overall performance. This includes an analysis of the benefits of the hierarchical processing, the comparison of different feature extraction methods on the first layer, the evaluation of the feature competition, and the investigation of the influence of different receptive field sizes on the second layer. Additionally, we compare our features to MFCC and RASTA-PLP features in a continuous digit recognition task in noise. On a wideband dataset we constructed ourselves based on the Aurora-2 task, as well as on the actual Aurora-2 database. We show that a combination of the proposed HIST features and RASTA-PLP features yields significant improvements and that the proposed features carry complementary information to RASTA-PLP and MFCC features. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Heckmann, Martin; Domont, Xavier; Joublin, Frank; Goerick, Christian] Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany.
[Domont, Xavier] Tech Univ Darmstadt, Control Theory & Robot Lab, D-64283 Darmstadt, Germany.
RP Heckmann, M (reprint author), Honda Res Inst Europe GmbH, D-63073 Offenbach, Germany.
EM martin.heckmann@honda-ri.de; xavier.domont@rtr.tu-darmstadt.de;
frank.joublin@honda-ri.de; christian.goerick@honda-ri.de
CR BAER T, 1993, J REHABIL RES DEV, V30, P49
Behnke S., 2003, P INT JOINT C NEUR N, V4, P2758, DOI 10.1109/IJCNN.2003.1224004
CHEN B, 2004, P 8 INT C SPOK LANG
CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044
Cho YC, 2005, PATTERN RECOGN LETT, V26, P1327, DOI 10.1016/j.patrec.2004.11.026
COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9
CRICK F, 1984, P NATL ACAD SCI-BIOL, V81, P4586, DOI 10.1073/pnas.81.14.4586
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
deCharms RC, 1998, SCIENCE, V280, P1439, DOI 10.1126/science.280.5368.1439
Domont X, 2008, INT CONF ACOUST SPEE, P4417, DOI 10.1109/ICASSP.2008.4518635
DOMONT X, 2007, LECT NOTES COMPUTER, P142
DUSAN S, 2005, 9 EUR C SPEEC COMM T
ELHILALI M, 2006, P INT C AC SPEECH SI
EZZAT T, 2007, P INTERSPEECH ISCA A
Fant G., 1970, ACOUSTIC THEORY SPEE
FANT G, 1979, SPEECH TRANSMISS LAB, V1, P70
Felleman DJ, 1991, CEREB CORTEX, V1, P1, DOI 10.1093/cercor/1.1.1
FERGUS R, 2003, P IEEE COMP SOC C CO, V2
Flynn R, 2008, SPEECH COMMUN, V50, P797, DOI 10.1016/j.specom.2008.05.004
Fritz J, 2003, NAT NEUROSCI, V6, P1216, DOI 10.1038/nn1141
FUKUSHIMA K, 1980, BIOL CYBERN, V36, P193, DOI 10.1007/BF00344251
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Glaser C, 2010, IEEE T AUDIO SPEECH, V18, P224, DOI 10.1109/TASL.2009.2025536
Hague S., 2009, SPEECH COMMUN, V51, P58
HECKMANN M, 2010, ISCA TUT RE IN PRESS
HECKMANN M, 2009, P INTERSPEECH ISCA B
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HERMANSKY H, 2000, P INT C AC SPEECH SI, V3
HERMANSKY H, 1998, 5 INT C SPOK LANG PR
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hirsch G., 2005, FANT FILTERING NOISE
Hoyer PO, 2004, J MACH LEARN RES, V5, P1457
HUBEL DH, 1965, J NEUROPHYSIOL, V28, P229
Hyvarinen A, 1999, IEEE T NEURAL NETWOR, V10, P626, DOI 10.1109/72.761722
KIM C, 2010, P INT C AC SPEECH SI, P4574
King AJ, 2009, NAT NEUROSCI, V12, P698, DOI 10.1038/nn.2308
Klein DJ, 2003, EURASIP J APPL SIG P, V2003, P659, DOI 10.1155/S1110865703303051
Kleinschmidt M., 2002, P INT C SPOK LANG PR
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
Leonard R., 1984, INT C AC SPEECH SIGN, V9
Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6
Mesgarani N, 2006, IEEE T AUDIO SPEECH, V14, P920, DOI 10.1109/TSA.2005.858055
MEYER B, 2008, P INTERSPEECH ISCA B
Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826
Morgan N., 1995, IEEE Signal Processing Magazine, V12, DOI 10.1109/79.382443
Olshausen BA, 1996, NATURE, V381, P607, DOI 10.1038/381607a0
PATTERSON RD, 1992, ADV BIOSCI, V83, P429
Pearce D., 2000, P INT C SPOK LANG PR
Rauschecker JP, 1998, CURR OPIN NEUROBIOL, V8, P516, DOI 10.1016/S0959-4388(98)80040-8
Read HL, 2002, CURR OPIN NEUROBIOL, V12, P433, DOI 10.1016/S0959-4388(02)00342-2
Riesenhuber M, 1999, NAT NEUROSCI, V2, P1019
Schreiner CE, 1994, AUDIT NEUROSCI, V1, P39
Scott SK, 2003, TRENDS NEUROSCI, V26, P100, DOI 10.1016/S0166-2236(02)00037-1
Shamma S, 2001, TRENDS COGN SCI, V5, P340, DOI 10.1016/S1364-6613(00)01704-6
SHERRY Y, 2008, P INTERSPEECH ISCA B
Slaney M., 1993, 35 APPL COMP CO
Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009
Stevens K.N., 2000, ACOUSTIC PHONETICS
SUR M, 1988, SCIENCE, V242, P1437, DOI 10.1126/science.2462279
van Hateren JH, 1998, P ROY SOC B-BIOL SCI, V265, P2315
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Vilar JM, 2008, INT CONF ACOUST SPEE, P5101, DOI 10.1109/ICASSP.2008.4518806
WANG H, 2008, P INTERSPEECH ISCA B
Wersing H, 2003, NEURAL COMPUT, V15, P1559, DOI 10.1162/089976603321891800
Young ED, 2008, PHILOS T R SOC B, V363, P923, DOI 10.1098/rstb.2007.2151
NR 66
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 736
EP 752
DI 10.1016/j.specom.2010.08.006
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900012
ER
PT J
AU Meyer, BT
Kollmeier, B
AF Meyer, Bernd T.
Kollmeier, Birger
TI Robustness of spectro-temporal features against intrinsic and extrinsic
variations in automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spectro-temporal feature extraction; Automatic speech recognition;
Robustness; Intrinsic variability
ID RECOGNIZERS
AB The effect of bio-inspired spectro-temporal processing for automatic speech recognition (ASR) is analyzed for two different tasks with focus on the robustness of spectro-temporal Gabor features in comparison to mel-frequency cepstral coefficients (MFCCs). Experiments aiming at extrinsic factors such as additive noise and changes of the transmission channel were carried out on a digit classification task (AURORA 2) for which spectro-temporal features were found to be more robust than the MFCC baseline against a wide range of noise sources. Intrinsic variations, i.e., changes in speaking rate, speaking effort and pitch, were analyzed on a phoneme recognition task with matched training and test conditions. The sensitivity of Gabor and MFCC features against various speaking styles was found to be different in a systematic way. An analysis based on phoneme confusions for both feature types suggests that spectro-temporal and purely spectral features carry complementary information. The usefulness of the combined information was demonstrated in a system using a combination of both types of features which yields a decrease in word-error rate of 16% compared to the best single-stream recognizer and 47% compared to an MFCC baseline. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Meyer, Bernd T.; Kollmeier, Birger] Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany.
RP Meyer, BT (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany.
EM bernd.meyer@uni-oldenburg.de
FU DFG [SFB/TRR 31]
FX Supported by the DFG (SFB/TRR 31 'The active auditory system'; URL:
http://www.uni-oldenburg.de/sfbtr31). The OLLO speech database OLLO has
been developed as part of the EU DIVINES Project IST-2002-002034.
CR BARKER J, 2007, SPEECH COMM
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220
DEVALOIS RL, 1980, ANNU REV PSYCHOL, V31, P309, DOI 10.1146/annurev.ps.31.020180.001521
Dimitriadis D, 2005, IEEE SIGNAL PROC LET, V12, P621, DOI 10.1109/LSP.2005.853050
DRESCHLER WA, 1999, J ACOUST SOC AM, V105, P1296, DOI 10.1121/1.426174
Dreschler WA, 2001, AUDIOLOGY, V40, P148
Ellis D., 2003, RASTA PLP MATLAB
EZZAT T, 2007, P INT
GRAMSS T, 1991, P IEEE 2 INT C ART N, P180
GRAMSS T, 1990, SPEECH COMMUN, V9, P35, DOI 10.1016/0167-6393(90)90043-9
Happel MFK, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P670
HECKMANN M, 2008, P INT, P4417
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H., 2000, ACOUST SPEECH SIG PR, P1635
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1999, ACOUST SPEECH SIG PR, P289
HIRSCH H, 2000, P ISCA ITRW ASR, P2697
JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631
KAERNBACH C, 2000, P CONTR PSYCH AC RES, P295
KLEINSCHMIDT M, 2003, THESIS
Kleinschmidt M., 2002, P ICSLP
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
Leonard R., 1984, P ICASSP, V9, P328
LIEB M, 2002, P ICSLP, P449
Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6
MESGARANI N, 2007, P INT
MEYER B, 2006, WORKSH SPEECH INTR V, P95
MEYER B, 2008, P INT
Meyer B., 2007, P INT, P1485
Qiu AQ, 2003, J NEUROPHYSIOL, V90, P456, DOI 10.1152/jn.00851.2002
Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009
Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950
Wesker T., 2005, P INT, P1273
Young S., 1995, HTK BOOK
Zhao SY, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P898
NR 36
TC 13
Z9 14
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 753
EP 767
DI 10.1016/j.specom.2010.07.002
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900013
ER
PT J
AU Wu, SQ
Falk, TH
Chan, WY
AF Wu, Siqing
Falk, Tiago H.
Chan, Wai-Yip
TI Automatic speech emotion recognition using modulation spectral features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion recognition; Speech modulation; Spectro-temporal representation;
Affective computing; Speech analysis
ID FREQUENCY; CLASSIFICATION; ENVELOPE
AB In this study, modulation spectral features (MSFs) are proposed for the automatic recognition of human affective information from speech. The features are extracted from an auditory-inspired long-term spectro-temporal representation. Obtained using an auditory filterbank and a modulation filterbank for speech analysis, the representation captures both acoustic frequency and temporal modulation frequency components, thereby conveying information that is important for human speech perception but missing from conventional short-term spectral features. On an experiment assessing classification of discrete emotion categories, the MSFs show promising performance in comparison with features that are based on mel-frequency cepstral coefficients and perceptual linear prediction coefficients, two commonly used short-term spectral representations. The MSFs further render a substantial improvement in recognition performance when used to augment prosodic features, which have been extensively used for emotion recognition. Using both types of features, an overall recognition rate of 91.6% is obtained for classifying seven emotion categories. Moreover, in an experiment assessing recognition of continuous emotions, the proposed features in combination with prosodic features attain estimation performance comparable to human evaluation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Wu, Siqing; Chan, Wai-Yip] Queens Univ, Dept Elect & Comp Engn, Kingston, ON K7L 3N6, Canada.
[Falk, Tiago H.] Univ Toronto, Inst Biomat & Biomed Engn, Toronto, ON M5S 3G9, Canada.
RP Wu, SQ (reprint author), Queens Univ, Dept Elect & Comp Engn, Kingston, ON K7L 3N6, Canada.
EM siqing.wu@queensu.ca; tiago.falk@ieee.org; chan@queensu.ca
CR Abelin A., 2000, P ISCA WORKSH SPEECH, P110
AERTSEN AMHJ, 1980, BIOL CYBERN, V38, P223, DOI 10.1007/BF00337015
[Anonymous], 1996, G729 ITUT
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
BARRA R, 2006, P ICASSP 06, V1, P1085
Batliner A, 2006, P IS LTC 2006 LJUBL, P240
Bishop C. M., 2006, PATTERN RECOGNITION
Burkhardt F., 2005, P INT, P1517
Busso C, 2009, IEEE T AUDIO SPEECH, V17, P582, DOI 10.1109/TASL.2008.2009578
Chang CC, 2009, LIBSVM LIB SUPPORT V
Chih T., 2005, J ACOUST SOC AM, V118, P887
Clavel C, 2008, SPEECH COMMUN, V50, P487, DOI 10.1016/j.specom.2008.03.012
COWIE R, 1996, P 4 INT C SPOK LANG, V3, P1989, DOI 10.1109/ICSLP.1996.608027
Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220
Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5
Ekman P., 1999, BASIC EMOTIONS HDB C, P45
Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665
Falk T., 2008, P INT WORKSH AC ECH
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679
Falk TH, 2010, IEEE T INSTRUM MEAS, V59, P978, DOI 10.1109/TIM.2009.2024697
Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47
Giannakopoulos T, 2009, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.2009.4959521
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572
Grimm M., 2007, P IEEE INT C AC SPEE, VIV, P1085
Grimm M, 2007, SPEECH COMMUN, V49, P787, DOI 10.1016/j.specom.2007.01.010
Gunes H, 2007, J NETW COMPUT APPL, V30, P1334, DOI 10.1016/j.jnca.2006.09.007
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hsu CW, 2007, PRACTICAL GUIDE SUPP
International Telecommunications Union, 1993, P56 ITUT
Ishi CT, 2010, EURASIP J AUDIO SPEE, DOI 10.1155/2010/528193
Kaiser J., 1990, P IEEE INT C AC SPEE, V1, P381
Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3
Kittler J., 1978, Pattern Recognition and Signal Processing
Lee S., 2005, P EUR LISB PORT, P497
LUGGER M, 2008, P INT C AC SPEECH SI, V4, P4945
Morgan N, 2005, IEEE SIGNAL PROC MAG, V22, P81, DOI 10.1109/MSP.2005.1511826
Mozziconacci S, 2002, P 1 INT C SPEECH PRO, P1
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
Picard R. W., 1997, AFFECTIVE COMPUTING
Rabiner L, 1993, FUNDAMENTALS SPEECH
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Scherer S., 2007, P INT ENV 2007, P152
Schuller B., 2007, P INT, P2253
Schuller B, 2007, P ICASSP, V4, P941
Schuller B., 2006, P SPEECH PROS
Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006
Shamma S, 2001, TRENDS COGN SCI, V5, P340, DOI 10.1016/S1364-6613(00)01704-6
Shamma S., 2003, IETE J RES, V49, P193
Slaney M, 1993, EFFICIENT IMPLEMENTA
Sun R, 2009, INT CONF ACOUST SPEE, P4509, DOI 10.1109/ICASSP.2009.4960632
TALKIN D, 1995, ROBUST ALGORITHM PIT, pCH14
Vapnik V., 1995, NATURE STAT LEARNING
Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003
Vlasenko B., 2007, P INTERSPEECH 2007, P2225
Wollmer M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P597
Wu S., 2009, P INT C DIG SIGN PRO, P1
Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995
NR 60
TC 32
Z9 35
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 768
EP 785
DI 10.1016/j.specom.2010.08.013
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900014
ER
PT J
AU Alias, F
Formiga, L
Llora, X
AF Alias, Francesc
Formiga, Lluis
Llora, Xavier
TI Efficient and reliable perceptual weight tuning for unit-selection
text-to-speech synthesis based on active interactive genetic algorithms:
A proof-of-concept
SO SPEECH COMMUNICATION
LA English
DT Article
DE Perceptual weight tuning; Unit selection text-to-speech synthesis;
Active interactive genetic algorithms
AB Unit-selection speech synthesis is one of the current corpus-based text-to-speech synthesis techniques. The quality of the generated speech depends on the accuracy of the unit selection process, which in turn relies on the cost function definition. This function should map the user perceptual preferences when selecting synthesis units, which is still an open research issue. This paper proposes a complete methodology for the tuning of the cost function weights by fusing the human judgments with the cost function, through efficient and reliable interactive weight tuning. To that effect, active interactive genetic algorithms (aiGAs) are used to guide the subjective weight adjustments. The application of aiGAs to this process allows mitigating user fatigue and frustration by improving user consistency. However, it is still unfeasible to subjectively adjust the weights of the whole corpus units (diphones and triphones in this work). This makes it mandatory to perform unit clustering before conducting the tuning process. The aiGA-based weight tuning proposal is evaluated in a small speech corpus as a proof-of-concept and results in more natural synthetic speech when compared to previous objective and subjective-based approaches. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Alias, Francesc; Formiga, Lluis] La Salle Univ Ramon Llull, GTM Grp Rec Tecnol Media, Barcelona 08022, Spain.
[Llora, Xavier] Univ Illinois, Natl Ctr Supercomp Applicat, Urbana, IL 61801 USA.
RP Alias, F (reprint author), La Salle Univ Ramon Llull, GTM Grp Rec Tecnol Media, C Quatre Camins 2, Barcelona 08022, Spain.
EM falias@salle.url.edu
RI Alias, Francesc/L-1088-2014
OI Alias, Francesc/0000-0002-1921-2375
FU European Commission [FP6 IST-4-027122-IP]
FX This work has been partially supported by the European Commission,
Project SALERO (FP6 IST-4-027122-IP). We would like to thank The Andrew
W. Mellon Foundation and the National Center for Supercomputing
Applications for their support during the preparation of this
manuscript.
CR Alias F., 2004, P 8 INT C SPOK LANG, P1221
Alias F., 2003, P 8 EUR C SPEECH COM, P1333
ALIAS F, 2006, P ICASSP TOUL FRANC, V1, P865
BLACK A, 2002, IEEE WORKSH SPEECH S
BLACK A, 2007, P ICASSP HON HI, V4, P1229
Black A. B., 2005, P INT 2005 LISB PORT, P77
Black A. W., 1997, P EUR, P601
Black A.W., 1997, HCRCTR83 U ED
BREUER R, 2004, P 8 INT C SPOK LANG, P1217
Campillo F. D., 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004
CHU M, 2002, ACOUST SPEECH SIG PR, P453
Chu M, 2001, P 7 EUR C SPEECH COM, P2087
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
COELLOCOELLO CA, 1998, LANIARD0908
Cristianini N., 2000, INTRO SUPPORT VECTOR
Deb K., 2000, 200001 KANGAL IND I
Durant EA, 2004, IEEE T SPEECH AUDI P, V12, P144, DOI 10.1109/TSA.2003.822640
FERNANDEZ R, 2006, TC STAR WORKSH SPEEC, P175
FORMIGA L, 2010, P 2010 IEEE C EV COM, P2322
Goldberg D., 1989, COMPLEX SYSTEMS, V3, P493
Goldberg D. E., 2002, DESIGN INNOVATION LE
Goldberg D. E, 1989, GENETIC ALGORITHMS S
GUNTER S, 2001, 3 IAPR TC15 WORKSH G, P229
Hunt A., 1996, P INT C AC SPEECH SI, V1, P373
Kim NS, 2004, IEEE SIGNAL PROC LET, V11, P40, DOI 10.1109/LSP.2003.819345
LEE M, 2001, P 4 ISCA WORKSH SPEE, P75
Llora X., 2005, P GEN EV COMP C, P1363, DOI 10.1145/1068009.1068228
MENG H, 2002, P 7 INT C SPOK LANG, P2373
Meron Y., 1999, P EUROSPEECH BUD HUN, V5, P2319
Pareto f, 1896, COURS EC POLITIQUE, VII
Pareto V., 1896, COURS EC POLITIQUE, P1
PARK SS, 2003, P EUR GEN SWITZ, V1, P281
Peng H., 2002, P 7 INT C SPOK LANG, P1341
Sebag M, 1998, LECT NOTES COMPUT SC, V1498, P418
Strom V, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1873
Takagi H, 2001, P IEEE, V89, P1275, DOI 10.1109/5.949485
Toda T., 2004, P ICASSP 2004 MONTR, P657
Toda T, 2006, SPEECH COMMUN, V48, P45, DOI 10.1016/j.specom.2005.05.011
TODA T, 2003, THESIS NARA I SCI TE
Wu CH, 2001, SPEECH COMMUN, V35, P219, DOI 10.1016/S0167-6393(00)00075-3
YI JRW, 2003, THESIS MIT CAMBRIDGE
NR 41
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY-JUN
PY 2011
VL 53
IS 5
SI SI
BP 786
EP 800
DI 10.1016/j.specom.2011.01.004
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 757DV
UT WOS:000290065900015
ER
PT J
AU Kim, W
Hansen, JHL
AF Kim, Wooil
Hansen, John H. L.
TI Variational noise model composition through model perturbation for
robust speech recognition with time-varying background noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Variational model composition (VMC); Time-varying noise; Feature
compensation; Multiple environmental models; Robust speech recognition
ID MISSING-FEATURE RECONSTRUCTION; COMPENSATION; ENHANCEMENT; COMBINATION
AB This study proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. It is suggested that each element of the cepstral coefficients represents the frequency degree of the changing components in the envelope of the log-spectrum. With this motivation, in the proposed method, variational noise models are formulated by selectively applying perturbation factors to the mean parameters of a basis model, resulting in a collection of noise models that more accurately reflect the natural range of spectral patterns Fen in the log-spectral domain. The basis noise model is obtained from the silence segments of the input speech. The perturbation factors are designed separately for changes in the energy level and spectral envelope. The proposed variational model composition (VMC) method is employed to generate multiple environmental models for our previously proposed parallel combined gaussian mixture model (PCGMM) based feature compensation algorithm. The mixture sharing technique is integrated to reduce computational expenses, caused by employing the variational models. Experimental results prove that the proposed method is considerably more effective at increasing speech recognition performance in time-varying background noise conditions, with +31.31%, +10.65%, and +20.54% average relative improvements in word error rate for speech babble, background music, and real-life in-vehicle noise conditions respectively, compared to the original basic PCGMM method. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Kim, Wooil; Hansen, John H. L.] Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM john.hansen@utdallas.edu
FU USAF [FA8750-09-C-0067]
FX This work was supported by the USAF under a subcontract to RADC, Inc.,
Contract FA8750-09-C-0067 (Approved for public release. Distribution
unlimited). A preliminary study of this work was presented at the
Inter-speech-2009, Brighton, UK, September 2009 (Kim and Hansen, 2009c).
CR Angkititrakul P., 2007, IEEE INT VEH S, P566
ANGKITITRAKUL P, 2009, INVEHICLE CORPUS SIG, pCH5
[Anonymous], 2000, 201108 ETSI ES
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Deller J., 2000, DISCRETE TIME PROCES
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
ETSI, 2002, 202050 ETSI ES
FREY J, 2001, EUR 2001
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088
Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618
HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901
Hansen JHL, 2004, DSP IN VEHICLE MOBIL
HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387
Hirsch H.G., 2000, ISCA ITRW ASR2000
KIM NS, 1997, IEEE WORKSH SPEECH R, P389
Kim NS, 2002, SPEECH COMMUN, V37, P231, DOI 10.1016/S0167-6393(01)00013-9
Kim W, 2009, SPEECH COMMUN, V51, P83, DOI 10.1016/j.specom.2008.06.004
Kim W, 2006, INT CONF ACOUST SPEE, P305
KIM W, 2009, INT 2009, P2399
Kim W, 2007, 2007 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, VOLS 1 AND 2, P687
Kim W, 2009, IEEE T AUDIO SPEECH, V17, P1292, DOI 10.1109/TASL.2009.2015080
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Martin R., 1994, EUSIPCO 94, P1182
Moreno P.J., 1996, THESIS CARNEGIE MELL
Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
Sasou A., 2004, ICSLP2004, P121
Stouten V., 2004, ICASSP2004, P949
VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970
YAO K, 2001, ADV NEURAL INFO PROC, V14
NR 34
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 451
EP 464
DI 10.1016/j.specom.2010.12.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000001
ER
PT J
AU Paliwal, K
Wojcicki, K
Shannon, B
AF Paliwal, Kuldip
Wojcicki, Kamil
Shannon, Benjamin
TI The importance of phase in speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Analysis window; Short-time Fourier analysis;
Analysis-modification-synthesis (AMS); Magnitude spectrum; Minimum
mean-square error (MMSE) short-time spectral amplitude (STSA) estimator;
Phase spectrum; Phase spectrum compensation (PSC); MMSE PSC
ID TIME FOURIER-TRANSFORM; SPECTRAL AMPLITUDE ESTIMATOR; HUMAN LISTENING
TESTS; SIGNAL RECONSTRUCTION; MAGNITUDE; RECOGNITION; SUBTRACTION;
PERCEPTION; SUPPRESSION; NOISE
AB Typical speech enhancement methods, based on the short-time Fourier analysis-modification-synthesis (AMS) framework, modify only the magnitude spectrum and keep the phase spectrum unchanged. In this paper our aim is to show that by modifying the phase spectrum in the enhancement process the quality of the resulting speech can be improved. For this we use analysis windows of 32 ms duration and investigate a number of approaches to phase spectrum computation. These include the use of matched or mismatched analysis windows for magnitude and phase spectra estimation during AMS processing, as well as the phase spectrum compensation (PSC) method. We consider four cases and conduct a series of objective and subjective experiments that examine the importance of the phase spectrum for speech quality in a systematic manner. In the first (oracle) case, our goal is to determine maximum speech quality improvements achievable when accurate phase spectrum estimates are available, but when no enhancement is performed on the magnitude spectrum. For this purpose speech stimuli are constructed, where (during AMS processing) the phase spectrum is computed from clean speech, while the magnitude spectrum is computed from noisy speech. While such a situation does not arise in practice, it does provide us with a useful insight into how much a precise knowledge of the phase spectrum can contribute towards speech quality. In this first case, matched and mismatched analysis window approaches are investigated. Particular attention is given to the choice of analysis window type used during phase spectrum computation, where the effect of spectral dynamic range on speech quality is examined. In the second (non-oracle) case, we consider a more realistic scenario where only the noisy spectra (observable in practice) is available. We study the potential of the mismatched window approach for speech quality improvements in this non-oracle case. We would also like to determine how much room for improvement exists between this case and the best (oracle) case. In the third case, we use the PSC algorithm to enhance the phase spectrum. We compare this approach with the oracle and non-oracle matched and mismatched window techniques investigated in the preceding cases. While in the first three cases we consider the usefulness of various approaches to phase spectrum computation within the AMS framework when noisy magnitude spectrum is used, in the fourth case we examine the usefulness of these techniques when enhanced magnitude spectrum is employed. Our aim (in the context of traditional magnitude spectrum-based enhancement methods) is to determine how much benefit in terms of speech quality can be attained by also processing the phase spectrum. For this purpose, the minimum mean-square error (M M SE) short-time spectral amplitude (STSA) estimates are employed instead of noisy magnitude spectra. The results of the oracle experiments show that accurate phase spectrum estimates can considerably contribute towards speech quality, as well as that the use of mismatched analysis windows (in the computation of the magnitude and phase spectra) provides significant improvements in both objective and subjective speech quality - especially, when the choice of analysis window used for phase spectrum computation is carefully considered. The mismatched window approach was also found to improve speech quality in the non-oracle case. While the improvements were found to be statistically significant, they were only modest compared to those observed in the oracle case.
This suggests that researc into better phase spectrum estimation algorithms, while a challenging task, could be worthwhile. The results of the PSC experiments indicate that the PSC method achieves better speech quality improvements than the other non-oracle methods considered. The results of the MMSE experiments suggest that accurate phase spectrum estimates have a potential to significantly improve performance of existing magnitude spectrum-based methods. Out of the non-oracle approaches considered, the combination of the MMSE STSA method with the PSC algorithm produced significantly better speech quality improvements than those achieved by these methods individually. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Paliwal, Kuldip; Wojcicki, Kamil; Shannon, Benjamin] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
RP Wojcicki, K (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
EM kamil.wojcicki@ieee.org
CR Alsteris L., 2004, P IEEE INT C AC SPEE, V1, P573
ALSTERIS L, 2005, P INT S SIGN PROC AP
Alsteris LD, 2007, COMPUT SPEECH LANG, V21, P174, DOI 10.1016/j.csl.2006.03.001
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Ghitza O, 2001, J ACOUST SOC AM, V110, P1628, DOI 10.1121/1.1396325
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837
Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st
HAYES MH, 1980, IEEE T ACOUST SPEECH, V28, P672, DOI 10.1109/TASSP.1980.1163463
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
*ITU T, 2001, 0862 ITUT
Kim DS, 2003, IEEE T SPEECH AUDI P, V11, P355, DOI 10.1109/TSA.2003.814409
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Loveimi E., 2010, P INT S COMM CONTR S, P1
Lu Y, 2008, SPEECH COMMUN, V50, P453, DOI 10.1016/j.specom.2008.01.003
Martin R., 1994, P 7 EUR SIGN PROC C, P1182
MCAULAY RJ, 1995, PSYCHOACOUSTICS FACT, P121
Nakagawa S, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1065
NAWAB SH, 1983, IEEE T ACOUST SPEECH, V31, P986, DOI 10.1109/TASSP.1983.1164162
OPPENHEIM AV, 1979, P IEEE INT C AC SPEE, V4, P632
OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022
Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177
Paliwal K. K., 2003, P EUR 2003, P2117
Alsteris LD, 2006, SPEECH COMMUN, V48, P727, DOI 10.1016/j.specom.2005.10.005
Paliwal K.K., 2003, P IPSJ SPOK LANG PRO, P1
Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001
PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532
POBLOTH H, 1999, ACOUST SPEECH SIG PR, P29
Quatieri T. F., 2002, DISCRETE TIME SPEECH
REDDY NS, 1985, IEEE T CIRCUITS SYST, V32, P616, DOI 10.1109/TCS.1985.1085749
Schluter R., 2001, P IEEE INT C AC SPEE, P133
Shannon BJ, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1423
Shi GJ, 2006, IEEE T AUDIO SPEECH, V14, P1867, DOI 10.1109/TSA.2005.858512
Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328
SKOGLUND J, 1997, P IEEE SPEECH COD WO, P51
Stark AP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P549
VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Wackerly D., 2007, MATH STAT APPL
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
Wang LB, 2009, INT CONF ACOUST SPEE, P4529, DOI 10.1109/ICASSP.2009.4960637
Wiener N., 1949, EXTRAPOLATION INTERP
Wojcicki K, 2008, IEEE SIGNAL PROC LET, V15, P461, DOI 10.1109/LSP.2008.923579
WOJCICKI K, 2008, P ISCA TUT RES WORKS
WOJCICKI K, 2007, P IEEE INT C AC SPEE, V4, P729
YEGNANARAYANA B, 1987, P IEEE INT C AC SPEE, V12, P301
NR 54
TC 17
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 465
EP 494
DI 10.1016/j.specom.2010.12.003
PG 30
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000002
ER
PT J
AU Lu, CT
AF Lu, Ching-Ta
TI Enhancement of single channel speech using perceptual-decision-directed
approach
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Power spectral subtraction; Masking property;
Decision directed; Perceptual gain factor
ID MASKING PROPERTIES; NOISE; TRANSFORM; QUALITY; CODERS
AB The masking properties of the human ear have been successfully applied to adapt a speech enhancement system, yielding an improvement in speech quality. The accuracy of estimated speech spectra plays a major role in computing the noise masking threshold. Although traditional methods using the power-spectral-subtraction method to roughly estimate the speech spectra can provide an acceptable performance, the estimated speech spectra can be further improved for computing the noise masking threshold. In this article, we aim at finding a better spectral estimate of speech by the two-step-decision-directed method. In turn, this estimate is employed to compute the noise masking threshold of a perceptual gain factor. Experimental results show that the amounts of residual noise can be efficiently suppressed by embedding the two-step-decision-directed algorithm in the perceptual gain factor. (C) 2010 Elsevier B.V. All rights reserved.
C1 Asia Univ, Dept Informat Commun, Taichung 41354, Taiwan.
RP Lu, CT (reprint author), Asia Univ, Dept Informat Commun, 500 Lioufeng Rd, Taichung 41354, Taiwan.
EM lucas1@ms26.hinet.net
FU National Science Council, Taiwan [NSC 98-2221-E-468-006]
FX This research was supported by the National Science Council, Taiwan,
under Contract Number NSC 98-2221-E-468-006. The author would like to
thank the anonymous reviewers for their valuable comments which improve
the quality of this paper.
CR Amehraye A, 2008, INT CONF ACOUST SPEE, P2081, DOI 10.1109/ICASSP.2008.4518051
[Anonymous], 2003, SUBJ TEST METH EV SP, P835
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
Ding HJ, 2009, SPEECH COMMUN, V51, P259, DOI 10.1016/j.specom.2008.09.003
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Ghanbari Y, 2006, SPEECH COMMUN, V48, P927, DOI 10.1016/j.specom.2005.12.002
Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714
*ITU T, 2001, 0862 ITUT
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu C-T, 2007, DIGIT SIGNAL PROCESS, V17, P171
Lu CT, 2003, SPEECH COMMUN, V41, P409, DOI 10.1016/S0167-6393(03)00011-6
Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001
Lu CT, 2004, ELECTRON LETT, V40, P394, DOI 10.1049/el:20040266
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Rix A., 2001, P IEEE INT C AC SPEE, P749
SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662
Udrea RM, 2008, SIGNAL PROCESS, V88, P1293
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987
YANG WH, 1998, ACOUST SPEECH SIG PR, P541
Yu G, 2008, IEEE T SIGNAL PROCES, V56, P1830, DOI 10.1109/TSP.2007.912893
NR 22
TC 7
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 495
EP 507
DI 10.1016/j.specom.2010.11.008
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000003
ER
PT J
AU Gao, J
Zhao, QW
Yan, YH
AF Gao, Jie
Zhao, Qingwei
Yan, Yonghong
TI Towards precise and robust automatic synchronization of live speech and
its transcripts
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatically synchronizing spoken utterances with their transcripts;
Frame-synchronous likelihood ratio test; Hidden markov models;
Simultaneous broadcast news subtitling
ID SUBTITLING SYSTEM; RECOGNITION
AB This paper presents our efforts in automatically synchronizing spoken utterances with their transcripts (textual contents) (ASUT), where the speech is a live stream and its corresponding transcripts are known. This task is first simplified to the problem of online detecting the end times of spoken utterances and then a solution based on a novel frame-synchronous likelihood ratio test (FSLRT) procedure is proposed. We detail the formulation and implementation of the proposed FSLRT procedure under the Hidden Markov Models (HMMs) framework, and we study its property and parameter settings empirically.
Because synchronization failures may occur in the FSLRT-based AUST systems, this paper also extends the FSLRT procedure to its multiple-instance version to increase the robustness of the system. The proposed multiple-instance FSLRT can detect the synchronization failures and restart the system from an appropriate point. Therefore a fully automatic FSLRT-based ASUT system could be constructed.
The FSLRT-based ASUT system is evaluated in a simultaneous broadcasting news subtitling task. Experimental results show that the proposed method achieves satisfying performance and it outperforms an automatic speech recognition-based method both in terms of robustness and precision. Finally, the FSLRT-based news subtitling system can correctly subtitle about 90% of the sentences with an average time deviation of about 100 ms, running at the speed of 0.37 real time (RT). (C) 2011 Elsevier B.V. All rights reserved.
C1 [Gao, Jie; Zhao, Qingwei; Yan, Yonghong] Chinese Acad Sci, ThinkIT Speech Lab, Inst Acoust, Beijing 100190, Peoples R China.
RP Gao, J (reprint author), Chinese Acad Sci, ThinkIT Speech Lab, Inst Acoust, Beijing 100190, Peoples R China.
EM jgao@hccl.ioa.ac.cn
FU National Science & Technology Pillar Program [2008BAI50B03]; National
Natural Science Foundation of China [10925419, 90920302, 10874203,
60875014]
FX This work is partially supported by The National Science & Technology
Pillar Program (2008BAI50B03), National Natural Science Foundation of
China (No. 10925419, 90920302, 10874203, 60875014). We thank the members
of ThinkIT Speech Lab for the fruitful discussions on the mathematical
formulation of FSLRT. We also thank the reviewers for insightful their
comments and suggestions, which are very helpful for improving the
quality of our manuscript.
CR Ando A, 2003, IEICE T INF SYST, VE86D, P15
Boulianne G, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P273
Duda R., 2000, PATTERN CLASSIFICATI
Gao J, 2009, LECT NOTES COMPUT SC, V5553, P576
GAO J, 2009, P INT 2009 ISCA, P2115
GUO Y, 2007, P INT, P2949
Hosom J.-P., 2000, THESIS OREGON GRADUA
HUANG C, 2003, AUTOMATIC CLOSED CAP
Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC
Kan MY, 2008, IEEE T AUDIO SPEECH, V16, P338, DOI 10.1109/TASL.2007.911559
MANUEL J, 2002, IEEE T MUTIMEDIAS, V3, P88
MORENO P, 1998, P ICSLP 1998 ISCA
Neto J, 2008, INT CONF ACOUST SPEE, P1561, DOI 10.1109/ICASSP.2008.4517921
Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733
ROBERTRIBES J, 1997, P EUR ISCA, P903
Shao J, 2008, IEICE T INF SYST, VE91D, P529, DOI 10.1093/ietisy/e9l-d.3.529
WEINTRAUB M, 1995, INT CONF ACOUST SPEE, P297, DOI 10.1109/ICASSP.1995.479532
WHEATLEY B, 1992, P IEEE INT C AC SPEE, P533, DOI 10.1109/ICASSP.1992.225853
NR 20
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 508
EP 523
DI 10.1016/j.specom.2011.01.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000004
ER
PT J
AU Jan, T
Wang, WW
Wang, DL
AF Jan, Tariqullah
Wang, Wenwu
Wang, DeLiang
TI A multistage approach to blind separation of convolutive speech mixtures
SO SPEECH COMMUNICATION
LA English
DT Article
DE Independent component analysis (ICA); Convolutive mixtures; Ideal binary
mask (IBM); Estimated binary mask; Cepstral smoothing; Musical noise
ID TIME-FREQUENCY MASKING; NONSTATIONARY SOURCES; PERMUTATION PROBLEM;
NATURAL GRADIENT; DOMAIN; REPRESENTATION; SENSORS
AB We propose a novel algorithm for the separation of convolutive speech mixtures using two-microphone recordings, based on the combination of independent component analysis (ICA) and ideal binary mask (IBM), together with a post-filtering process in the cepstral domain. The proposed algorithm consists of three steps. First, a constrained convolutive ICA algorithm is applied to separate the source signals from two-microphone recordings. In the second step, we estimate the IBM by comparing the energy of corresponding time frequency (T-F) units from the separated sources obtained with the convolutive ICA algorithm. The last step is to reduce musical noise caused by T-F masking using cepstral smoothing. The performance of the proposed approach is evaluated using both reverberant mixtures generated using a simulated room model and real recordings in terms of signal to noise ratio measurement. The proposed algorithm offers considerably higher efficiency and improved speech quality while producing similar separation performance compared with a recent approach. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Jan, Tariqullah; Wang, Wenwu] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 5XH, Surrey, England.
[Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA.
[Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA.
RP Jan, T (reprint author), Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 5XH, Surrey, England.
EM t.jan@surrey.ac.uk; w.wang@surrey.ac.uk; dwang@cse.ohio-state.edu
FU Royal Academy of Engineering [IJB/AM/08-587]; EPSRC [EP/H012842/1];
AFOSR [FA9550-08-1-0155]; NSF [IIS-0534707]; NWFP UET Peshawar, Pakistan
FX We are grateful to M. S. Pedersen for providing the mat-lab code of
Pedersen et al. (2008) and the assistance in the preparation of this
work. Part of the work was conducted while W. Wang was visiting OSU. T.
U. Jan was supported by the NWFP UET Peshawar, Pakistan. W. Wang was
supported in part by a Royal Academy of Engineering travel Grant
(IJB/AM/08-587) and an EPSRC Grant (EP/H012842/1). D. L. Wang was
supported in part by an AFOSR Grant (FA9550-08-1-0155) and an NSF Grant
(IIS-0534707).
CR Aissa-El-Bey A, 2007, IEEE T AUDIO SPEECH, V15, P1540, DOI 10.1109/TASL.2007.898455
Amari S, 1997, FIRST IEEE SIGNAL PROCESSING WORKSHOP ON SIGNAL PROCESSING ADVANCES IN WIRELESS COMMUNICATIONS, P101, DOI 10.1109/SPAWC.1997.630083
Araki S., 2005, P IEEE INT C AC SPEE, V3, P81, DOI 10.1109/ICASSP.2005.1415651
Araki S, 2003, IEEE T SPEECH AUDI P, V11, P109, DOI 10.1109/TSA.2003.809193
ARAKI S, 2004, P INT C IND COMP AN, P898
Araki S, 2007, SIGNAL PROCESS, V87, P1833, DOI 10.1016/j.sigpro.2007.02.003
BACK AD, 1994, IEEE WORKSH NEUR NET, P565
Buchner H, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P255, DOI 10.1007/1-4020-7769-6_10
CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229
Cichocki A., 2002, ADAPTIVE BLIND SIGNA
Dillon H., 2001, HEARING AIDS
Douglas SC, 2007, IEEE T AUDIO SPEECH, V15, P1511, DOI 10.1109/TASL.2007.899176
Douglas SC, 2005, IEEE T SPEECH AUDI P, V13, P92, DOI 10.1109/TSA.2004.838538
Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6
Gaubitch N. D., 1979, ALLEN BERKLEY IMAGE
Han S, 2009, P 7 INT C INF COMM S, P356
He ZS, 2007, IEEE T AUDIO SPEECH, V15, P1551, DOI 10.1109/TASL.2007.898457
Hoel P.G., 1976, ELEMENTARY STAT
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Jan T, 2009, INT CONF ACOUST SPEE, P1713, DOI 10.1109/ICASSP.2009.4959933
Lambert RH, 1997, INT CONF ACOUST SPEE, P423, DOI 10.1109/ICASSP.1997.599665
Lee T. W., 1998, INDEPENDENT COMPONEN
Lee TW, 1997, 1997 IEEE INTERNATIONAL CONFERENCE ON NEURAL NETWORKS, VOLS 1-4, P2129, DOI 10.1109/ICNN.1997.614235
Madhu N, 2008, INT CONF ACOUST SPEE, P45, DOI 10.1109/ICASSP.2008.4517542
Makino S, 2005, IEICE T FUND ELECTR, VE88A, P1640, DOI 10.1093/ietfec/e88-a.7.1640
Matsuoka K., 2001, INT WORKSH ICA BSS, P722
Mazur R, 2009, IEEE T AUDIO SPEECH, V17, P117, DOI 10.1109/TASL.2008.2005349
MITIANONDIS N, 2002, INT J ADAPT CONTROL, P1
Mukai R, 2004, LECT NOTES COMPUT SC, V3195, P461
Murata N, 2001, NEUROCOMPUTING, V41, P1, DOI 10.1016/S0925-2312(00)00345-3
Nesta F., 2009, IEEE WORKSH APPL SIG, P105
NESTA F, 2008, P HANDS FREE SPEECH, P232
Nickel RM, 2006, INT CONF ACOUST SPEE, P629
Olsson RK, 2006, INT CONF ACOUST SPEE, P657
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214
Pedersen MS, 2008, IEEE T NEURAL NETWOR, V19, P475, DOI 10.1109/TNN.2007.911740
Rahbar K, 2005, IEEE T SPEECH AUDI P, V13, P832, DOI 10.1109/TSA.2005.851925
Reju VG, 2010, IEEE T AUDIO SPEECH, V18, P101, DOI 10.1109/TASL.2009.2024380
RODRIGUES GF, 2009, P 8 IND C AN SIGN SE, P621
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
Sawada H, 2004, IEEE T SPEECH AUDI P, V12, P530, DOI 10.1109/TSA.2004.832994
Sawada H, 2006, IEEE T AUDIO SPEECH, V14, P2165, DOI 10.1109/TASL.2006.872599
Sawada H, 2003, IEICE T FUND ELECTR, VE86A, P590
Sawada H, 2007, IEEE T AUDIO SPEECH, V15, P1592, DOI 10.1109/TASL.2007.899218
Schobben L, 2002, IEEE T SIGNAL PROCES, V50, P1855
Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2
SOON VC, 1993, P IEEE INT S CIRC SY, V1, P703
Wang D., 2008, TRENDS AMPLIF, V12, P332
Wang D., 2006, COMPUTATIONAL AUDITO
Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12
Wang DL, 2009, J ACOUST SOC AM, V125, P2336, DOI 10.1121/1.3083233
WANG W, 2004, OSA
Wang WW, 2005, IEEE T SIGNAL PROCES, V53, P1654, DOI 10.1109/TSP.2005.845433
Yoshioka T., 2009, P 17 EUR SIGN PROC C, P1432
NR 56
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 524
EP 539
DI 10.1016/j.specom.2011.01.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000005
ER
PT J
AU Le, N
Ambikairajah, E
Epps, J
Sethu, V
Choi, EHC
AF Phu Ngoc Le
Ambikairajah, Eliathamby
Epps, Julien
Sethu, Vidhyasaharan
Choi, Eric H. C.
TI Investigation of spectral centroid features for cognitive load
classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Cognitive load; Gaussian mixture model; Spectral centroid feature;
Frequency scale; Kullback-Leibler distance
ID SPEECH RECOGNITION; STRESS; SYSTEM; WORKLOAD
AB Speech is a promising modality for the convenient measurement of cognitive load, and recent years have seen the development of several cognitive load classification systems. Many of these systems have utilised mel frequency cepstral coefficients (MFCC) and prosodic features like pitch and intensity to discriminate between different cognitive load levels. However, the accuracies obtained by these systems are still not high enough to allow for their use outside of laboratory environments. One reason for this might be the imperfect acoustic description of speech provided by MFCCs. Since these features do not characterise the distribution of the spectral energy within subbands, in this paper, we investigate the use of spectral centroid frequency (SCF) and spectral centroid amplitude (SCA) features, applying them to the problem of automatic cognitive load classification. The effect of varying the number of filters and the frequency scale used is also evaluated, in terms of the effectiveness of the resultant spectral centroid features in discriminating between cognitive loads. The results of classification experiments show that the spectral centroid features consistently and significantly outperform a baseline system employing MFCC, pitch, and intensity features. Experimental results reported in this paper indicate that the fusion of an SCF based system with an SCA based system results in a relative reduction in error rate of 39% and 29% for two different cognitive load databases. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Phu Ngoc Le; Ambikairajah, Eliathamby; Epps, Julien; Sethu, Vidhyasaharan] Univ New S Wales, Sch Elect Engn & Telecommun, UNSW, Sydney, NSW 2052, Australia.
[Phu Ngoc Le; Ambikairajah, Eliathamby; Epps, Julien; Choi, Eric H. C.] Natl ICT Australia NICTA, ATP Res Lab, Eveleigh 2015, Australia.
RP Le, N (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, UNSW, Sydney, NSW 2052, Australia.
EM phule@unsw.edu.au; ambi@ee.unsw.edu.au; j.epps@unsw.edu.au;
vidhyasaharan@gmail.com; Eric.Choi@nicta.com.au
CR Berthold A, 1999, CISM COUR L, P235
BORIL H, 2010, P INT MAKH CHIB JAP, P502
Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8
Gajic B, 2006, IEEE T AUDIO SPEECH, V14, P600, DOI 10.1109/TSA.2005.855834
GERVEN PWM, 2004, PSYCHOPHYSIOLOGY, V41, P167
Goldberger J., 2005, P INT, P1985
GRIFFIN GR, 1987, AVIAT SPACE ENVIR MD, V58, P1165
HE L, 2009, P 5 INT C NAT COMP T, P260
Hosseinzadeh D, 2008, EURASIP J ADV SIG PR, DOI 10.1155/2008/258184
Khawaja M. A., 2007, P 19 AUSTR C COMP HU, P57, DOI 10.1145/1324892.1324902
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
Kua J. M. K., 2010, P OD SPEAK LANG REC, P34
Le P. N., 2009, P 7 INT C INF COMM S, P1
LE PN, 2010, P 20 INT C PATT REC, P4516
LIVELY SE, 1993, J ACOUST SOC AM, V93, P2962, DOI 10.1121/1.405815
LU X, 2007, SPEECH COMMUN, P312
Mendoza E, 1998, J VOICE, V12, P263, DOI 10.1016/S0892-1997(98)80017-9
*MET, 2007, LEX FRAM READ
MULLER C, 2001, LECT NOTES COMPUTER, P24
Paas F, 2003, EDUC PSYCHOL, V38, P1, DOI 10.1207/S15326985EP3801_1
PALIWAL KK, 1998, ACOUST SPEECH SIG PR, P617
Pass F.G.W.C., 1994, J EDUC PSYCHOL, V86, P122
Pelecanos J., 2001, ODYSSEY 2001 CRET GR, P213
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Scherer K. R., 2002, P ICSLP, P2017
SHANNON BJ, 2003, P MICR ENG RES C BRI, P1
SHRIBERG E, 1992, P DARPA SPEECH NAT L, P419, DOI 10.3115/1075527.1075628
STEENEKEN HJM, 1999, ACOUST SPEECH SIG PR, P2079
Talkin D., 1995, SPEECH CODING SYNTHE, P495
Thiruvaran T., 2006, P 11 AUSTR INT C SPE, P148
Yap T. F., 2010, P ISSPA KUAL LUMP MA, P221
YAP TF, 2010, P INTERSPEECH MAK CH, P2022
Yap TF, 2009, INT CONF ACOUST SPEE, P4825
YAP TF, 2010, P ICASSP DALL TEX US, P5234
Yin B, 2008, INT CONF ACOUST SPEE, P2041, DOI 10.1109/ICASSP.2008.4518041
Yin B., 2007, P CHISIG, P249, DOI 10.1145/1324892.1324946
NR 37
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 540
EP 551
DI 10.1016/j.specom.2011.01.005
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000006
ER
PT J
AU Legat, M
Matousek, J
Tihelka, D
AF Legat, M.
Matousek, J.
Tihelka, D.
TI On the detection of pitch marks using a robust multi-phase algorithm
SO SPEECH COMMUNICATION
LA English
DT Article
DE Glottal closure instant; Pitch mark; Speech signal polarity; Fundamental
frequency
ID SPEECH SYNTHESIS; GLOTTAL CLOSURE; VOICED SPEECH; INSTANTS; EXCITATION;
MODEL; WAVE
AB A large number of methods for identifying glottal closure instants (GCIs) in voiced speech have been proposed in recent years. In this paper, we propose to take advantage of both glottal and speech signals in order to increase the accuracy of detection of GCIs. All aspects of this particular issue, from determining speech polarity to handling a delay between glottal and corresponding speech signal, are addressed. A robust multi-phase algorithm (MPA), which combines different methods applied on both signals in a unique way, is presented. Within the process, a special attention is paid to determination of speech waveform polarity, as it was found to be considerably influencing the performance of the detection algorithms. Another feature of the proposed method is that every detected GCI is given a confidence score, which allows to locate potentially inaccurate GCI subsequences. The performance of the proposed algorithm was tested and compared with other freely available GCI detection algorithms. The MPA algorithm was found to be more robust in terms of detection accuracy over various sets of sentences, languages and phone classes. Finally, some pitfalls of the GCI detection are discussed. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Legat, M.; Matousek, J.; Tihelka, D.] Univ W Bohemia, Fac Sci Appl, Dept Cybernet, Plzen 30614, Czech Republic.
RP Matousek, J (reprint author), Univ W Bohemia, Fac Sci Appl, Dept Cybernet, Univ 8, Plzen 30614, Czech Republic.
EM legatm@kky.zcu.cz; jmatouse@kky.zeu.cz
RI Matousek, Jindrich/C-2146-2011; Tihelka, Daniel/A-4318-2012
OI Matousek, Jindrich/0000-0002-7408-7730; Tihelka,
Daniel/0000-0002-3149-2330
FU Ministry of Education of the Czech Republic [2C06020]; Grant Agency of
the Czech Republic [GACR 102/09/0989]
FX This research was supported by the Ministry of Education of the Czech
Republic, Project No. 2C06020, and by the Grant Agency of the Czech
Republic, Project No. GACR 102/09/0989. The access to the METACentrum
clusters provided under the research intent MSM6383917201 is highly
appreciated.
CR BANGA ER, 2002, IMPROVEMENTS SPEECH, P52
BELLEGARDA JR, 2004, P 5 ISCA SPEECH SYNT, P133
Boersma P., 2005, PRAAT SOFTWARE SPEEC
Brookes M, 2006, IEEE T AUDIO SPEECH, V14, P456, DOI 10.1109/TSA.2005.857810
Chambers JM, 1983, GRAPHICAL METHODS DA
Chen J., 2001, COMPUTATIONAL LINGUI, V6, P1
CHENGYUAN L, 2004, P INTERSPEECH JEJ KO, P1189
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Dutoit T., 2008, SPRINGER HDB SPEECH, P437, DOI 10.1007/978-3-540-49127-9_21
Dutoit T, 1996, SPEECH COMMUN, V19, P119, DOI 10.1016/0167-6393(96)00029-5
Hagmuller M, 2006, SPEECH COMMUN, V48, P1650, DOI 10.1016/j.specom.2006.07.008
Hamon C., 1989, P INT C AC SPEECH SI, P238
Hanzlicek Z, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P681
Huang X., 2001, SPOKEN LANGUAGE PROC
HUSSEIN H, 2007, P INTERSPEECH ANTW B, P1653
Hussein H, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P135
KLEIJN WB, 2002, P IEEE WORKSH SPEECH, P163
Legat M, 2007, LECT NOTES ARTIF INT, V4629, P502
Legat M, 2007, INTERSPEECH 2007: 8TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION, VOLS 1-4, P1641
Ma C, 1994, IEEE T SPEECH AUDI P, V2, P258
MATOUSEK J, 2008, P 6 INT C LANG RES E
MATOUSEK J, 2001, P EUR 2001 ALB, V3, P2047
MATOUSEK J, 2004, P ICSLP JEJ KOR, V3, P1933
Matousek J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1626
MCKENNA JG, 2001, P 4 ISCA TUT RES WOR
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Murthy PS, 1999, IEEE T SPEECH AUDI P, V7, P609
Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5
ROTHENBERG M, 1988, J SPEECH HEAR RES, V31, P338
ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4
SAKAMOTO M, 2000, P INT C SPOK LANG PR, V3, P650
Schoentgen J, 2003, J VOICE, V17, P114, DOI 10.1016/S0892-1997(03)0014-6
SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662
STRUBE HW, 1974, J ACOUST SOC AM, V56, P1625, DOI 10.1121/1.1903487
Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068
Tuan V. N., 1999, P EUR C SPEECH TECHN, P2805
Valbret H., 1992, P ICASSP92 MARCH, V1, P145
Verhelst W., 1993, P IEEE INT C AC SPEE, V2, P554
WEN D, 1998, P ICASSP SEATTL WA, V2, P857, DOI 10.1109/ICASSP.1998.675400
YEGNANARAYANA B, 1995, INT CONF ACOUST SPEE, P776, DOI 10.1109/ICASSP.1995.479809
NR 40
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 552
EP 566
DI 10.1016/j.specom.2011.01.008
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000007
ER
PT J
AU Ananthakrishnan, G
Engwall, O
AF Ananthakrishnan, G.
Engwall, Olov
TI Mapping between acoustic and articulatory gestures
SO SPEECH COMMUNICATION
LA English
DT Article
DE Acoustic gestures; Articulatory gestures; Acoustic-to-articulatory
inversion; Critical trajectory error
ID SPEECH; SEGMENTATION; MOVEMENTS; FEATURES; MODEL; RECOGNITION;
PERCEPTION; HMM
AB This paper proposes a definition for articulatory as well as acoustic gestures along with a method to segment the measured articulatory trajectories and acoustic waveforms into gestures. Using a simultaneously recorded acoustic-articulatory database, the gestures are detected based on finding critical points in the utterance, both in the acoustic and articulatory representations. The acoustic gestures are parameterized using 2-D cepstral coefficients. The articulatory trajectories arc essentially the horizontal and vertical movements of Electromagnetic Articulography (EMA) coils placed on the tongue, jaw and lips along the midsagittal plane. The articulatory movements are parameterized using 2D-DCT using the same transformation that is applied on the acoustics. The relationship between the detected acoustic and articulatory gestures in terms of the timing as well as the shape is studied. In order to study this relationship further, acoustic-to-articulatory inversion is performed using GMM-based regression. The accuracy of predicting the articulatory trajectories from the acoustic waveforms are at par with state-of-the-art frame-based methods with dynamical constraints (with an average error of 1.45-1.55 mm for the two speakers in the database). In order to evaluate the acoustic-to-articulatory inversion in a more intuitive manner, a method based on the error in estimated critical points is suggested. Using this method, it was noted that the estimated articulatory trajectories using the acoustic-to-articulatory inversion methods were still not accurate enough to be within the perceptual tolerance of audio-visual asynchrony. (C) 2011 Elsevier B.V. All rights reserved.
C1 [Ananthakrishnan, G.; Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden.
RP Ananthakrishnan, G (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden.
EM agopal@kth.se; engwall@kth.se
FU Swedish Research Council [621-2008-4490]
FX This work is supported by the Grant 621-2008-4490 from the Swedish
Research Council.
CR ANANTHAKRISHNAN G, 2006, P INT C INT SENS INF, P115
Ananthakrishnan G., 2009, P INT BRIGHT UK, P2799
ARIKI Y, 1989, IEE PROC-I, V136, P133
Atal S., 1978, J ACOUST SOC AM, V63, P1535
Bilmes J., 1998, INT COMPUT SCI I, V4, P1
Browman Catherine, 1986, PHONOLOGY YB, V3, P219
CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K
Diehl RL, 2004, ANNU REV PSYCHOL, V55, P149, DOI 10.1146/annurev.psych.55.090902.142028
DUSAN S, 2000, P 5 SEM SPEECH PROD, P237
ENGWALL O, 2006, P 7 INT SEM SPEECH P, P469
FARHAT A, 1993, P EUROSPEECH, P657
Fowler CA, 1996, J ACOUST SOC AM, V99, P1730, DOI 10.1121/1.415237
GHOLAMPOUR I, 1998, P INT C SPOK LANG PR, P1555
Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636
Hoole P., 1996, FORSCHUNGSBERICHTE I, V34, P158
KATSAMANIS A, 2008, P EUR SIGN PROC C LA
KEATING PA, 1984, LANGUAGE, V60, P286, DOI 10.2307/413642
Kjellstrom H, 2009, SPEECH COMMUN, V51, P195, DOI 10.1016/j.specom.2008.07.005
LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279
Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983
MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070
MAEDA S, 1988, J ACOUST SOC AM, V84, pS146, DOI 10.1121/1.2025845
Markov K, 2006, SPEECH COMMUN, V48, P161, DOI 10.1016/j.specom.2005.07.003
McGowan RS, 2009, J ACOUST SOC AM, V126, P2011, DOI 10.1121/1.3184581
MILLER JD, 1989, J ACOUST SOC AM, V85, P2114, DOI 10.1121/1.397862
MILNER B, 1995, P EUROSPEECH, P519
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
Neiberg D, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1485
NEIBERG D, 2009, P INTERSPEECH BRIGHT, P1387
OUNI S, 2002, P INT C SPOK LANG PR, P2301
OZBEK IY, 2009, P INT BRIGHT UK, P2807
PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204
Perrier P, 2008, J NEUROPHYSIOL, V100, P1171, DOI 10.1152/jn.01116.2007
Qin C, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2306
Reeves B., 1993, EFFECTS AUDIO VIDEO
Richmond K., 2002, THESIS CTR SPEECH TE
Richmond K, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P577
SAITO T, 1998, P ICSLP SYDN AUSTR, V7, P2839
Sarkar A, 2005, INT CONF ACOUST SPEE, P397
SCHMIDT RA, 1979, PSYCHOL REV, V86, P415, DOI 10.1037//0033-295X.86.5.415
SENEFF S, 1988, TRANSCRIPTION ALIGNM
Stephenson T., 2000, P INT C SPOK LANG PR, P951
Stevens KN, 2002, J ACOUST SOC AM, V111, P1872, DOI 10.1121/1.1458026
Sung H. G., 2004, THESIS RICE U HOUSTO
Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
Toda T., 2004, 5 ISCA SPEECH SYNTH, P31
Toda T., 2004, P ICSLP JEJ ISL KOR, P1129
Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001
Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579
Toutios A., 2003, 6 HELL EUR C COMP MA, P1
VANHEMERT JP, 1991, IEEE T SIGNAL PROCES, V39, P1008, DOI 10.1109/78.80941
VIVIANI P, 1982, NEUROSCIENCE, V7, P431, DOI 10.1016/0306-4522(82)90277-9
Wrench A., 1999, MOCHA TIMIT ARTICULA
Wrench A. A., 2000, P ICSLP BEIJ CHIN, P145
Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X
Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004
ZLOKARNIK I, 1993, P 3 EUR C SPEECH COM, P2215
ZUE V, 1989, P IEEE INT C ACOUSTI, P389
NR 58
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2011
VL 53
IS 4
BP 567
EP 589
DI 10.1016/j.specom.2011.01.009
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 742GJ
UT WOS:000288929000008
ER
PT J
AU Vayrynen, E
Toivanen, J
Seppanen, T
AF Vayrynen, Eero
Toivanen, Juhani
Seppanen, Tapio
TI Classification of emotion in spoken Finnish using vowel-length segments:
Increasing reliability with a fusion technique
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic classification of emotion; Prosodic features; Vocal source
features; Classifier fusion; Vowel segments; Spoken Finnish
ID FLOATING SEARCH METHODS; FEATURE-SELECTION; RECOGNITION; SPEECH;
QUOTIENT; STRESS
AB Classification of emotional content of short Finnish emotional [a:] vowel speech samples is performed using vocal source parameter and traditional intonation contour parameter derived prosodic features. A multiple kNN classifier based decision level fusion classification architecture is proposed for multimodal speech prosody and vocal source expert fusion. The sum fusion rule and the sequential forward floating search (SFFS) algorithm are used to produce leveraged expert classifiers. Automatic classification tests in five emotional classes demonstrate that significantly higher than random level emotional content classification performance is achievable using both prosodic and vocal source features. The fusion classification approach is further shown to be capable of emotional content classification in the vowel domain approaching the performance level of the human reference. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Vayrynen, Eero; Seppanen, Tapio] Univ Oulu, Elect & Informat Engn Dept, Comp Engn Lab, FI-90014 Oulu, Finland.
[Toivanen, Juhani] Acad Finland, FI-90014 Oulu, Finland.
[Toivanen, Juhani] Univ Oulu, Elect & Informat Engn Dept, Informat Proc Lab, FI-90014 Oulu, Finland.
RP Vayrynen, E (reprint author), Univ Oulu, Elect & Informat Engn Dept, Comp Engn Lab, POB 4500, FI-90014 Oulu, Finland.
EM eero.vayrynen@ee.oulu.fi; juhani.toi-vanen@ee.oulu.fi;
tapio.seppanen@ee.oulu.fi
FU Academy of Finland [1114920]; Infotech Oulu Graduate School of the
University of Oulu
FX Academy of Finland (project number 1114920) and Infotech Oulu Graduate
School of the University of Oulu are gratefully acknowledged for
financial support.
CR Airas M., 2005, P INT 2005 LISB PORT, P2145
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Alku P, 2002, J ACOUST SOC AM, V112, P701, DOI 10.1121/1.1490365
Alku P, 1996, SPEECH COMMUN, V18, P131, DOI 10.1016/0167-6393(95)00040-2
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Barra R, 2006, INT CONF ACOUST SPEE, P1085
Batliner A, 2000, P ISCA WORKSH SPEECH, P195
Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749
CAMPBELL JP, 2003, 9 8 EUR C SPEECH COM, P2665
Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197
Cruttenden Alan, 1997, INTONATION, V2nd
DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022
Drugman T., 2008, P IEEE INT C SIGN PR
Duda, 2001, PATTERN CLASSIFICATI
Engberg I. S., 1996, DOCUMENTATION DANISH
Gobl C, 2003, P ISCA TUT RES WORKS, P151
Kim S, 2007, 2007 IEEE NINTH WORKSHOP ON MULTIMEDIA SIGNAL PROCESSING, P48, DOI 10.1109/MMSP.2007.4412815
Kittler J, 2000, LECT NOTES COMPUT SC, V1876, P45
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
Laukkanen AM, 1996, J PHONETICS, V24, P313, DOI 10.1006/jpho.1996.0017
Lee C.M., 2001, P IEEE WORKSH AUT SP
McGilloway S., 2000, P ISCA WORKSH SPEECH, P207
Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004
Mozziconacci S., 1999, P 14 INT C PHON SCI, P2001
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
Paeschke A., 2000, P ISCA WORKSH SPEECH, P75
Polzin T., 2000, P ISCA WORKSH SPEECH, P201
PUDIL P, 1994, PATTERN RECOGN LETT, V15, P1119, DOI 10.1016/0167-8655(94)90127-9
Pulakka H., 2005, THESIS HELSINKI U TE
Schaefer A, 2003, NEUROIMAGE, V18, P938, DOI 10.1016/S1053-8119(03)00009-0
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
SCHULLER B, 2006, P 32 DTSCH JAHR AK D
Seppanen Tapio, 2003, P EUR GEN SWITZ, P717
Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3
Somol P, 1999, PATTERN RECOGN LETT, V20, P1157, DOI 10.1016/S0167-8655(99)00083-5
Suomi K., 2008, STUDIA HUMANIORA OUL, V9
Suomi K, 2003, J PHONETICS, V31, P113, DOI 10.1016/S0095-4470(02)00074-8
ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3
Toivanen J., 2003, P 15 INT C PHON SCI, V3, P2469
Toivanen J, 2004, LANG SPEECH, V47, P383
Ververidis D., 2004, P ICASSP2004
Vroomen J, 1998, J MEM LANG, V38, P133, DOI 10.1006/jmla.1997.2548
NR 43
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 269
EP 282
DI 10.1016/j.specom.2010.09.007
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300001
ER
PT J
AU Shinoda, K
Watanabe, Y
Iwata, K
Liang, YA
Nakagawa, R
Furui, S
AF Shinoda, Koichi
Watanabe, Yasushi
Iwata, Kenji
Liang, Yuan
Nakagawa, Ryuta
Furui, Sadaoki
TI Semi-synchronous speech and pen input for mobile user interfaces
SO SPEECH COMMUNICATION
LA English
DT Article
DE User interfaces; Speech recognition; Handwritten character recognition;
Multi-modal recognition; Adaptation
AB This paper proposes new interfaces using semi-synchronous speech and pen input for mobile environments. A user speaks while writing, and the pen input complements the speech so that recognition performance will be higher than with speech alone. Since the input speed and input information are different between the two modes, speaking and writing, a time lag always exists between them. Therefore, conventional multi-modal recognition algorithms cannot be directly applied to this interface. To tackle this problem, we developed a multi-modal recognition algorithm that can handle this asynchronicity (time-lag) by using a segment-based unification scheme and a method of adapting to the time-lag characteristics of individual users. Five different pen-input interfaces, each of which is assumed to be given for a phrase unit in speech, were evaluated in speech recognition experiments using noisy speech data. The recognition accuracy of the proposed method was higher than that of speech alone in all five interfaces. We also carried out a subjective test to examine the usability of each interface. We found a trade-off between usability and improvement in recognition performance. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Shinoda, Koichi; Watanabe, Yasushi; Iwata, Kenji; Liang, Yuan; Nakagawa, Ryuta; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan.
RP Shinoda, K (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan.
EM shinoda@cs.titech.ac.jp
RI Shinoda, Koichi/D-3198-2014
OI Shinoda, Koichi/0000-0003-1095-3203
FU JSPS [15300054, 20300063]
FX This research was partially supported by JSPS Grants-in-Aid for
Scientific Research (B) 15300054 and 20300063.
CR BAN H, 2004, P INTERSPEECH 2004 I
HAYAMIZU S, 1993, IEICE T INF SYST, VE76D, P17
Hui PY, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1197
Itahashi S., 1991, Journal of the Acoustical Society of Japan, V47
Itou K., 1998, P ICSLP, P3261
Lee A., 2001, P EUR C SPEECH COMM, P1691
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
NAKAGAWA M, 1995, TECHNICAL REPORT IEI, V95, P43
NAKAI N, 2001, P ICDAR 2001, P491
Tamura S, 2004, J VLSI SIG PROC SYST, V36, P117, DOI 10.1023/B:VLSI.0000015091.47302.07
Watanabe Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2675
WATANABE Y, 2007, P ICASSP 2007 HON HA, V4, P409
Wu LZ, 1999, IEEE T MULTIMEDIA, V1, P334
ZHOU X, 2006, P ICASSP 2006, V1, P609
JAPANESE DICTATION T
NR 15
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 283
EP 291
DI 10.1016/j.specom.2010.10.001
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300002
ER
PT J
AU Vieru, B
de Mareueil, PB
Adda-Decker, M
AF Vieru, Bianca
de Mareueil, Philippe Boula
Adda-Decker, Martine
TI Characterisation and identification of non-native French accents
SO SPEECH COMMUNICATION
LA English
DT Article
DE Foreign accents; Non-native French; Perceptual experiments; Automatic
speech alignment; Pronunciation variants; Data mining techniques;
Automatic classification
ID FOREIGN ACCENT; PRONUNCIATION VARIANTS; ENGLISH; PERCEPTION; LANGUAGE;
SPEECH
AB This paper focuses on foreign accent characterisation and identification in French. How many accents may a native French speaker recognise and which cues does (s)he use? Our interest concentrates on French productions stemming from speakers of six different mother tongues: Arabic, English, German, Italian, Portuguese and Spanish, also compared with native French speakers (from the Ile-de-France region). Using automatic speech processing, our objective is to identify the most reliable acoustic cues distinguishing these accents, and to link these cues with human perception. We measured acoustic parameters such as duration and voicing for consonants, the first two formant values for vowels, word-final schwa-related prosodic features and the percentages of confusions obtained using automatic alignment including non-standard pronunciation variants. Machine learning techniques were used to select the most discriminant cues distinguishing different accents and to classify speakers according to their accents. The results obtained in automatic identification of the different linguistic origins under investigation compare favourably to perceptual data. Major identified accent-specific cues include the devoicing of voiced stop consonants, /b/similar to/v/ and /s/similar to/z/ confusions, the "rolled r" and schwa fronting or raising. These cues can contribute to improve pronunciation modeling in automatic speech recognition of accented speech. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Vieru, Bianca; de Mareueil, Philippe Boula; Adda-Decker, Martine] LIMSI CNRS, F-91403 Orsay, France.
RP Adda-Decker, M (reprint author), LIMSI CNRS, BP 133, F-91403 Orsay, France.
EM madda@limsi.fr
CR Abdelli-Beruh NB, 2004, PHONETICA, V61, P201, DOI 10.1159/000084158
Adank Patti, 2003, THESIS RADBOUD U NIJ
Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1
Adda-Decker M., 2007, P INT C PHON SCI ICP, P613
ALBA O, 2001, MANUAL FONETICA HISP
ANGKITITRAKUL P, 2003, P INTERSPEECH 2003 E, P1353
Arai T., 1997, P EUR RHOD GREEC, P1011
Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608
BARTKOVA K, 2004, P SPECOM 04 INT C SP, P22
Berkling K, 2001, SPEECH COMMUN, V35, P125, DOI 10.1016/S0167-6393(00)00100-X
Boersma P., 2001, GLOT INT, V5, P341
BOSCH L, 2002, P ISCA WORKSH PRON M, P111
Boula de Marettil P., 2004, P INT JEJ ISL KOR, P341
Bouselmi G, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P109
Calliope, 1989, PAROLE SON TRAITEMEN
CINCAREK T, 2004, P ICSLP JEJ ISL KOR, P1509
Clopper CG, 2004, J PHONETICS, V32, P111, DOI 10.1016/S0095-4470(03)00009-3
Delattre P, 1965, COMP PHONETIC FEATUR
DELLWO V, 2010, THESIS U BONN GERMAN
de Mareuil PB, 2006, PHONETICA, V63, P247, DOI 10.1159/000097308
DISNER SF, 1980, J ACOUST SOC AM, V67, P253, DOI 10.1121/1.383734
Durand J., 2003, TRIBUNE INT LANGUES, V33, P3
FERRAGNE E, 2007, SPEAKER CLASSIFICATI, V2, P243
Flege J., 1982, STUDIES 2 LANGUAGE A, V5, P1, DOI 10.1017/S0272263100004563
FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256
FLEGE JE, 1981, LANG SPEECH, V24, P125
Flege JE, 2003, SPEECH COMMUN, V40, P467, DOI 10.1016/S0167-6393(02)00128-0
FRELANDRICARD M, 1996, REV PHONET APPL, P61
Frota S., 2007, SEGMENTAL PROSODIC I, P131
Gauvain J.L., 2005, P INT LISB PORT, P1665
Gendrot C., 2005, P INT LISB, P2453
Ghazali S., 2002, P 1 INT C SPEECH PRO, P331
Goronzy S, 2004, SPEECH COMMUN, V42, P109, DOI 10.1016/j.specom.2003.09.003
Grabe Esther, 2002, LAB PHONOLOGY, V7, P515
Gutierrez Diez F., 2008, J ACOUST SOC AM, V123, P3886
GUYON I, 2003, J MACHINE LEARN RES, V3, P1265
Harrington J, 2000, NATURE, V408, P927, DOI 10.1038/35050160
HUCKVALE M, 2007, P INT C PHON SCI SAA, P1821
HUCKVALE M, 2004, P ICSLP, P29
Ihaka R., 1996, J COMPUTATIONAL GRAP, V5, P299, DOI DOI 10.2307/1390807
Jilka M., 2000, THESIS U STUTTGART G
King R. W., 1997, P EUROSPEECH 97, P2323
Lamel L, 2007, INT CONF ACOUST SPEE, P997
LIVESCU K, 2000, ACOUST SPEECH SIG PR, P1683
Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081
Martin A. F., 2008, P OD SPEAK LANG REC
NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861
Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134
Quilis Antonio, 1993, TRATADO FONOLOGIA FO
Ramus F., 1999, THESIS EHESS PARIS
Raux A., 2004, P ICSLP 04 INT C SPO, P613
ROMANO A, 2010, DIMENSIONE TEMPORALE, P45
Rouas JL, 2008, SPEECH COMMUN, V50, P965, DOI 10.1016/j.specom.2008.05.006
Sangwan A, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P172
SCHADEN S, 2003, P INT C LANG RES EV, P1395
SILKE G, 2004, SPEECH COMMUN, V42, P109
VELOSO J, 2007, ACT JOURN ET LING NA, P55
VIERUDIMULESCU B, 2008, THESIS U PARIS SUD O
Witten I.H., 2005, DATA MINING PRACTICA
WOEHRLING C, 2009, P INT BRIGHT UK, P2183
Woehrling Cecile, 2006, REV PAROLE, V37, P25
Yamada R. A., 1994, P INT C SPOK LANG PR, P2023
NR 62
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 292
EP 310
DI 10.1016/j.specom.2010.10.002
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300003
ER
PT J
AU Mayo, C
Clark, RAJ
King, S
AF Mayo, Catherine
Clark, Robert A. J.
King, Simon
TI Listeners' weighting of acoustic cues to synthetic speech naturalness: A
multidimensional scaling analysis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Evaluation; Speech perception; Acoustic cue weighting;
Multidimensional scaling
ID VOICE QUALITY ASSESSMENT; FINAL STOP CONSONANTS; COMPLEX SOUNDS;
CHILDREN; ADULTS; PERCEPTION; CATEGORIZATION; ENGLISH; RATINGS
AB The quality of current commercial speech synthesis systems is now so high that system improvements are being made at subtle sub- and supra-segmental levels. Human perceptual evaluation of such subtle improvements requires a highly sophisticated level of perceptual attention to specific acoustic characteristics or cues. However, it is not well understood what acoustic cues listeners attend to by default when asked to evaluate synthetic speech. It may, therefore, be potentially quite difficult to design an evaluation method that allows listeners to concentrate on only one dimension of the signal, while ignoring others that are perceptually more important to them.
The aim of the current study was to determine which acoustic characteristics of unit-selection synthetic speech are most salient to listeners when evaluating the naturalness of such speech. This study made use of multidimensional scaling techniques to analyse listeners' pairwise comparisons of synthetic speech sentences. Results indicate that listeners place a great deal of perceptual importance on the presence of artifacts and discontinuities in the speech, somewhat less importance on aspects of segmental quality, and very little importance on stress/intonation appropriateness. These relative differences in importance will impact on listeners' ability to attend to these different acoustic characteristics of synthetic speech, and should therefore be taken into account when designing appropriate methods of synthetic speech evaluation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Mayo, Catherine; Clark, Robert A. J.; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
RP Mayo, C (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland.
EM catherin@ling.ed.ac.uk; robert@cstr.ed.ac.uk; simon.king@ed.ac.uk
CR Allen P, 1997, J ACOUST SOC AM, V102, P2255, DOI 10.1121/1.419637
Allen P, 2002, J ACOUST SOC AM, V112, P211, DOI 10.1121/1.1482075
BAILLY G, 2003, ISCA SPEC SESS HOT T
BEST CT, 1981, PERCEPT PSYCHOPHYS, V29, P191, DOI 10.3758/BF03207286
Black A., 1997, FESTIVAL SPEECH SYNT
Bradlow AR, 1999, PERCEPT PSYCHOPHYS, V61, P206, DOI 10.3758/BF03206883
Cernak M, 2005, P EUR C AC, P2725
CERNAK M, 2009, P ICSVI INT C SOUND
CHEN JD, 1999, P EUROSPEECH, P611
Christensen LA, 1997, J ACOUST SOC AM, V102, P2297, DOI 10.1121/1.419639
Clark R. A. J., 1999, P EUR 99 6 EUR C SPE, P1623
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
CLARK RAJ, 2003, INT C PHON SCI BARCE, P1141
*EXP ADV GROUP LAN, 1996, EV NAT LANG PROC SYS
FALK TH, 2008, P BLIZZ WORKSH BRISB
Fisher C, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P343
Francis AL, 2008, J ACOUST SOC AM, V124, P1234, DOI 10.1121/1.2945161
Garofolo J., 1988, GETTING STARTED DARP
GORDON PC, 1993, COGNITIVE PSYCHOL, V25, P1, DOI 10.1006/cogp.1993.1001
Hall JL, 2001, J ACOUST SOC AM, V110, P2167, DOI 10.1121/1.1397322
Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121
Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9
HAZAN V, 1998, ICSLP SYD AUSTR, P2163
HIRST D, 1998, P ESCA COCOSDA WORKS
*ITU T, 1994, P85 ITUT
Iverson P, 2005, J ACOUST SOC AM, V118, P3267, DOI 10.1121/1.2062307
Iverson P, 2003, COGNITION, V87, pB47, DOI 10.1016/S0010-0277(02)00198-1
JILKA M, 2003, P 15 ICPHS BARC, P2549
JILKA M, 2005, P INT 2005 LISB PORT, P2393
Jusczyk P. W., 1997, DISCOVERY SPOKEN LAN
KLABBERS E, 2001, IEEE T SPEECH AUDIO, V9
Klabbers E., 1998, P ICSLP, P1983
Kreiman J, 1998, J ACOUST SOC AM, V104, P1598, DOI 10.1121/1.424372
Kreiman J, 2007, J ACOUST SOC AM, V122, P2354, DOI 10.1121/1.2770547
Kreiman J, 2000, J ACOUST SOC AM, V108, P1867, DOI 10.1121/1.1289362
KREIMAN J, 2004, J ACOUST SOC AM, V115, P2609
Kruskal J. B., 1978, SAGE U PAPER SERIES
LAMEL LF, 1989, P SPEECH I O ASS SPE, P2161
Marozeau J, 2003, J ACOUST SOC AM, V114, P2946, DOI 10.1121/1.1618239
Mayo C, 2004, J ACOUST SOC AM, V115, P3184, DOI 10.1121/1.1738838
Mayo C, 2005, J ACOUST SOC AM, V118, P1730, DOI 10.1121/1.1979451
MAYO C, 2005, P INT 2005 LISB PORT
MOLLER S, 2009, P NAG DAGA 2009 ROTT, P1168
Nittrouer S, 2004, J ACOUST SOC AM, V115, P1777, DOI 10.1121/1.1651192
CUTLER A, 1994, J MEM LANG, V33, P824, DOI 10.1006/jmla.1994.1039
PLUMPE M, 1998, P ESCA COCOSDA WORKS
RABINOV CR, 1995, J SPEECH HEAR RES, V38, P26
Schnieder W., 2002, E PRIME USERS GUIDE
STYLIANOU Y, 2001, P ICASSP INT C AC SP
SYRDAL A, 2004, J ACOUST SOC AM, V115, P2543
SYRDAL AK, 2001, P EUROSPEECH AALB DE, P979
Turk A., 2006, METHODS EMPIRICAL PR, P1
VAINIO M, 2002, P IEEE 2002 WORKSH S
Vepa J., 2004, TEXT SPEECH SYNTHESI
WARDRIPFRUIN C, 1985, J ACOUST SOC AM, V77, P1907, DOI 10.1121/1.391833
WARDRIPFRUIN C, 1982, J ACOUST SOC AM, V71, P187, DOI 10.1121/1.387346
Watson J. M. M., 1997, THESIS QUEEN MARGARE
WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450
WOUTERS J, 1998, P ICSLP, V6, P2747, DOI DOI 10.1109/ICASSP.2001.941045
Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002
NR 60
TC 3
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 311
EP 326
DI 10.1016/j.specom.2010.10.003
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300004
ER
PT J
AU Paliwal, K
Schwerin, B
Wojcicki, K
AF Paliwal, Kuldip
Schwerin, Belinda
Wojcicki, Kamil
TI Role of modulation magnitude and phase spectrum towards speech
intelligibility
SO SPEECH COMMUNICATION
LA English
DT Article
DE Analysis frame duration; Modulation frame duration; Modulation domain;
Modulation magnitude spectrum; Modulation phase spectrum; Speech
intelligibility; Speech transmission index (STI);
Analysis-modification-synthesis (AMS)
ID TRANSMISSION INDEX; QUALITY ESTIMATION; RECOGNITION; ENHANCEMENT
AB In this paper our aim is to investigate the properties of the modulation domain and more specifically, to evaluate the relative contributions of the modulation magnitude and phase spectra towards speech intelligibility. For this purpose, we extend the traditional (acoustic domain) analysis modification synthesis framework to include modulation domain processing. We use this framework to construct stimuli that retain only selected spectral components, for the purpose of objective and subjective intelligibility tests. We conduct three experiments. In the first, we investigate the relative contributions to intelligibility of the modulation magnitude, modulation phase, and acoustic phase spectra. In the second experiment, the effect of modulation frame duration on intelligibility for processing of the modulation magnitude spectrum is investigated. In the third experiment, the effect of modulation frame duration on intelligibility for processing of the modulation phase spectrum is investigated. Results of these experiments show that both the modulation magnitude and phase spectra are important for speech intelligibility, and that significant improvement is gained by the inclusion of acoustic phase information. They also show that smaller modulation frame durations improve intelligibility when processing the modulation magnitude spectrum, while longer frame durations improve intelligibility when processing the modulation phase spectrum. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Paliwal, Kuldip; Schwerin, Belinda; Wojcicki, Kamil] Griffith Univ, Signal Proc Lab, Sch Engn, Brisbane, Qld 4111, Australia.
RP Schwerin, B (reprint author), Griffith Univ, Signal Proc Lab, Sch Engn, Nathan Campus, Brisbane, Qld 4111, Australia.
EM belsch71@gmail.com
CR Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761
Atlas LE, 2001, P SOC PHOTO-OPT INS, V4474, P1, DOI 10.1117/12.448636
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Falk T., 2008, P INT WORKSH AC ECH
Falk T. H., 2007, P ISCA C INT SPEECH, P970
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P90, DOI 10.1109/TASL.2009.2023679
Falk TH, 2010, IEEE T AUDIO SPEECH, V18, P1766, DOI 10.1109/TASL.2010.2052247
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
GREENBERG S, 1998, P INT C SPOK LANG PR, V6, P2803
GREENBERG S, 1997, P ICASSP, V3, P1647
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Hanson B. A., 1993, P IEEE INT C AC SPEE, VII, P79
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Huang X., 2001, SPOKEN LANGUAGE PROC
KANEDERA N, 1998, ACOUST SPEECH SIG PR, P613
Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466
Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lyons JG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P387
OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022
PALIWAL K, 2010, SPL101 GRIFF U
Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755
Paliwal K, 2010, SPEECH COMMUN, V52, P450, DOI 10.1016/j.specom.2010.02.004
Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001
Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216
PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Rix A., 2001, P862 ITUT
SCHROEDER MR, 1975, P IEEE, V63, P1332, DOI 10.1109/PROC.1975.9941
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397
Tyagi V., 2003, P ISCA EUR C SPEECH, P981
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
WOJCICKI K, 2007, P IEEE INT C AC SPEE, V4, P729
Wu S., 2009, INT C DIG SIGN PROC
NR 39
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 327
EP 339
DI 10.1016/j.specom.2010.10.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300005
ER
PT J
AU Ma, JF
Loizou, PC
AF Ma, Jianfen
Loizou, Philipos C.
TI SNR loss: A new objective measure for predicting the intelligibility of
noise-suppressed speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Speech enhancement; Speech intelligibility
indices
ID SPECTRAL AMPLITUDE ESTIMATOR; RECEPTION THRESHOLD; VECTOR QUANTIZATION;
FLUCTUATING NOISE; SUBSPACE APPROACH; ENHANCEMENT; REDUCTION; INDEX;
ALGORITHMS; PARAMETERS
AB Most of the existing intelligibility measures do not account for the distortions present in processed speech, such as those introduced by speech-enhancement algorithms. In the present study, we propose three new objective measures that can be used for prediction of intelligibility of processed (e.g., via an enhancement algorithm) speech in noisy conditions. All three measures use a critical-band spectral representation of the clean and noise-suppressed signals and are based on the measurement of the SNR loss incurred in each critical band after the corrupted signal goes through a speech enhancement algorithm. The proposed measures are flexible in that they can provide different weights to the two types of spectral distortions introduced by enhancement algorithms, namely spectral attenuation and spectral amplification distortions. The proposed measures were evaluated with intelligibility scores obtained by normal-hearing listeners in 72 noisy conditions involving noise-suppressed speech (consonants and sentences) corrupted by four different maskers (car, babble, train and street interferences). Highest correlation (r = -0.85) with sentence recognition scores was obtained using a variant of the SNR loss measure that only included vowel/consonant transitions and weak consonant information. High correlation was maintained for all noise types, with a maximum correlation (r = -0.88) achieved in street noise conditions. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Ma, Jianfen; Loizou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA.
[Ma, Jianfen] Taiyuan Univ Technol, Taiyuan 030024, Shanxi, Peoples R China.
RP Loizou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, POB 830688,EC 33, Richardson, TX 75083 USA.
EM loizou@utdallas.edu
CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
[Anonymous], 2000, P862 ITUT
ANSI, 1997, S351997 ANSI
Beerends J., 2004, P WORKSH MEAS SPEECH
Benesty J, 2009, SPRINGER TOP SIGN PR, V2, P1, DOI 10.1007/978-3-642-00296-0_1
Benesty J, 2008, IEEE T AUDIO SPEECH, V16, P757, DOI 10.1109/TASL.2008.919072
Berouti M., 1979, P IEEE INT C AC SPEE, P208
Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851
COHEN I, 2008, HDB SPEECH PROCESSIN, P873
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334
Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083
HIRSCH H, 2000, P ISCA ITRW ASR200
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949
Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058
Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031
Kamath S.D., 2002, IEEE INT C AC SPEECH
Kates J M, 1987, J Rehabil Res Dev, V24, P271
KATES JM, 1992, J ACOUST SOC AM, V91, P2236, DOI 10.1121/1.403657
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
KRYTER K, 1926, J ACOUST SOC AM, V34, P1698
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094
Loizou PC, 2011, IEEE T AUDIO SPEECH, V19, P47, DOI 10.1109/TASL.2010.2045180
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Ma JF, 2009, J ACOUST SOC AM, V125, P3387, DOI 10.1121/1.3097493
MATTILA V, 2003, P ONL WORKSH MEAS SP
Nein HW, 2001, IEEE T SPEECH AUDI P, V9, P73
PAAJANEN E, 2000, IEEE SPEECH COD WORK, P23
Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755
Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363
PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442
Quackenbush S. R., 1988, OBJECTIVE MEASURES S
Rhebergen KS, 2006, J ACOUST SOC AM, V120, P3988, DOI 10.1121/1.2358008
Rhebergen KS, 2005, J ACOUST SOC AM, V117, P2181, DOI 10.1121/1.1861713
Scalart P, 1996, INT CONF ACOUST SPEE, P629, DOI 10.1109/ICASSP.1996.543199
Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
YOON YS, 2006, SNR LOSS HEARING IMP
NR 44
TC 19
Z9 20
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 340
EP 354
DI 10.1016/j.specom.2010.10.005
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300006
ER
PT J
AU So, S
Paliwal, KK
AF So, Stephen
Paliwal, Kuldip K.
TI Suppressing the influence of additive noise on the Kalman gain for low
residual noise speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Kalman filtering; Speech enhancement; Linear prediction; Dolph-Chebycher
windows
ID LINEAR PREDICTION; FILTER
AB In this paper, we present a detailed analysis of the Kalman filter for the application of speech enhancement and identify its shortcomings when the linear predictor model parameters are estimated from speech that has been corrupted with additive noise. We show that when only noise-corrupted speech is available, the poor performance of the Kalman filter may be attributed to the presence of large values in the Kalman gain during low speech energy regions, which cause a large degree of residual noise to be present in the output. These large Kalman gain values result from poor estimates of the LPCs due to the presence of additive noise. This paper presents the analysis and application of the Kalman gain trajectory as a useful indicator of Kalman filter performance, which can be used to motivate further methods of improvement. As an example, we analyse the previously-reported application of long and overlapped tapered windows using Kalman gain trajectories to explain the reduction and smoothing of residual noise in the enhanced output. In addition, we investigate further extensions, such as Dolph-Chebychev windowing and iterative LPC estimation. This modified Kalman filter was found to have improved on the conventional and iterative versions of the Kalman filter in both objective and subjective testing. (C) 2011 Elsevier B.V. All rights reserved.
C1 [So, Stephen; Paliwal, Kuldip K.] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia.
RP So, S (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Brisbane, Qld 4111, Australia.
EM s.so@griffith.edu.au; k.paliwal@griffith.edu.au
RI So, Stephen/D-6649-2011
CR Astrom K. J., 1997, PRENTICE HALL INFORM
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erkelens JS, 1997, IEEE T SPEECH AUDI P, V5, P116, DOI 10.1109/89.554773
Gabrea M, 1999, IEEE SIGNAL PROC LET, V6, P55, DOI 10.1109/97.744623
Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367
GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144
Hayes M. H., 1996, STAT DIGITAL SIGNAL, V1st
Haykin S., 2002, PRENTICE HALL INFORM
Holmes J., 2001, SPEECH SYNTHESIS REC
Hu Y, 2006, P IEEE INT C AC SPEE, V1, P153
Hu Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1447
Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552
Kay S. M., 1993, PRENTICE HALL SIGNAL, V1
Li C. J., 2006, THESIS AARLBORG U DE
LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197
Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
MEHRA MK, 1970, IEEE T AUTOMAT CONTR, V15, P175
OHYA T, 1994, IEEE 44 VEH TECHN C, P1680
Paliwal K., 1987, P IEEE INT C AC SPEE, V12, P177
Rix A., 2001, P862 ITUT
So S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P391
SORQVIST P, 1997, P IEEE INT C AC SPEE, V2, P1219
WANG T, 2002, IEEE WORKSH SPEECH C
Wiener N., 1949, EXTRAPOLATION INTERP
NR 28
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 355
EP 378
DI 10.1016/j.specom.2010.10.006
PG 24
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300007
ER
PT J
AU Krause, JC
Pelley-Lopez, KA
Tessler, MP
AF Krause, Jean C.
Pelley-Lopez, Katherine A.
Tessler, Morgan P.
TI A method for transcribing the manual components of Cued Speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Transcription; Coarticulation; Cue accuracy; Handshape identification;
Placement identification
ID RECOGNITION; PERCEPTION; FRENCH; HAND; AID
AB Designed to allow visual communication of speech signals, Cued Speech consists of discrete hand signals that are produced in synchrony with the visual mouth movements of speech. The purpose of this paper is to describe a method for transcribing these hand signals. Procedures are presented for identifying (1) the steady-state portion of the cue to be analyzed, (2) the cue's handshape, and (3) the cue's placement. Reliability is evaluated, using materials from 12 cuers that were transcribed on two separate occasions (either by the original rater or a second rater). Results show very good intra-rater and inter-rater reliability on average, which remained good across a variety of individual cuers, even when the cuer's hand gestures were heavily coarticulated. Given its high reliability, this transcription method may be of benefit to applications that require systematic and quantitative analysis of Cued Speech production in various populations. In addition, some of the transcription principles from this method may be helpful in improving accuracy of automatic Cued Speech recognition systems. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Krause, Jean C.; Pelley-Lopez, Katherine A.; Tessler, Morgan P.] Univ S Florida, Dept Commun Sci & Disorders, Tampa, FL 33620 USA.
RP Krause, JC (reprint author), Univ S Florida, Dept Commun Sci & Disorders, Tampa, FL 33620 USA.
EM jeankrause@usf.edu
FU National Institute on Deafness and Other Communication Disorders (NIH)
[5 R03 DC 007355]
FX The authors wish to thank Dana Herrington and Jessica Vick for many
helpful technical discussions, and Joe Frisbie for the cue chart used in
Fig. 1. Financial support for this work was provided in part by a grant
from the National Institute on Deafness and Other Communication
Disorders (NIH Grant No. 5 R03 DC 007355).
CR Alegria J, 2005, J DEAF STUD DEAF EDU, V10, P122, DOI 10.1093/deafed/eni013
Attina V, 2004, SPEECH COMMUN, V44, P197, DOI 10.1016/j.specom.2004.10.013
Auer ET, 2007, J SPEECH LANG HEAR R, V50, P1157, DOI 10.1044/1092-4388(2007/080)
CORNETT RO, 1967, AM ANN DEAF, V112, P3
CORNETT RO, 2001, CUED SPEECH RESOURCE
CORNETT RO, 1977, PROCESS AIDS DEAF, P224
*CUED SPEECH ASS U, 2009, WRIT CUES CUE SCRIPT
Duchnowski P, 2000, IEEE T BIO-MED ENG, V47, P487, DOI 10.1109/10.828148
EBRAHIMI D, 1991, IEEE T BIOMED ENG, V38, P44
ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481
*FILMS HUM, 1989, LIF CYCL PLANTS
Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587
Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788
Heracleous P, 2009, IEEE SIGNAL PROC LET, V16, P339, DOI 10.1109/LSP.2009.2016011
Krause JC, 2008, J DEAF STUD DEAF EDU, V13, P432, DOI 10.1093/deafed/enm059
Leybaert J., 2003, OXFORD HDB DEAF STUD, P261
Leybaert J, 2001, J SPEECH LANG HEAR R, V44, P949, DOI 10.1044/1092-4388(2001/074)
MASSARO DW, 2009, CUED SPEECH CUED LAN, pCH20
*NAT CUED SPEECH A, 1994, CUED SPEECH J, V5, P73
NICHOLLS GH, 1982, J SPEECH HEAR RES, V25, P262
UCHANSKI RM, 1994, J REHABIL RES DEV, V31, P20
UPTON HW, 1968, AM ANN DEAF, V113, P222
NR 22
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 379
EP 389
DI 10.1016/j.specom.2010.11.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300008
ER
PT J
AU Kocinski, J
Libiszewski, P
Sek, A
AF Kocinski, Jedrzej
Libiszewski, Pawel
Sek, Aleksander
TI Spatial efficiency of blind source separation based on decorrelation -
subjective and objective assessment
SO SPEECH COMMUNICATION
LA English
DT Article
DE Blind source separation; Beamforming; Speech intelligibility; Speech
enhancement
ID FREQUENCY-DOMAIN; CONVOLUTIVE MIXTURES; SPEECH MIXTURES;
NOISE-REDUCTION; INTELLIGIBILITY
AB Blind source separation (BSS) method is one of the newest multisensorial methods that exploits statistical properties of simultaneously recorded independent signals to separate them out. The objective of this method is similar to that of beamforming, namely a set of spatial filters that separate source signals are calculated. Thus, it seems to be reasonable to investigate the spatial efficiency of BSS that is reported in this study. A dummy head with two microphones was used to record two signals in an anechoic chamber: target speech and babble noise in different spatial configurations. Then the speech reception thresholds (SRTs, i.e. signal-to-noise ratio, SNR yielding 50% speech intelligibility) before and after BSS algorithm (Parra and Spence, 2000) were determined for audiologically normal subjects. A significant speech intelligibility improvement was noticed after the BSS was applied. This happened in most cases when the target and masker sources were spatially separated. Moreover, the comparison of objective (SNR enhancement) and subjective (intelligibility improvement) assessment methods is reported here. It must be emphasized that these measures give different results. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Kocinski, Jedrzej; Libiszewski, Pawel; Sek, Aleksander] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland.
RP Kocinski, J (reprint author), Adam Mickiewicz Univ, Inst Acoust, 85 Umultowska Str, PL-61614 Poznan, Poland.
EM jedrzej.kocinski@amu.edu.pl
FU Polish-Norwegian Research Fund; European Union
FX This work was supported by Polish-Norwegian Research Fund and FP6 of the
European Union 'HearCom'. The authors would like to thank two anonymous
reviewers for useful comments and remarks on the earlier version of this
manuscript.
CR ANEMULLER J, 2000, ICA 2000, P215
Araki S, 2003, EURASIP J APPL SIG P, V2003, P1157, DOI 10.1155/S1110865703305074
Berouti M, 1979, IEEE INT C AC SPEECH, V4, P208, DOI 10.1109/ICASSP.1979.1170788
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Brandstein M., 2001, MICROPHONE ARRAYS SI, V1st
CARDOSO JF, 1989, P IC ASSP, V89, P2109
Deller J., 2000, DISCRETE TIME PROCES
Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6
Drgas S, 2008, ARCH ACOUST, V33, P455
Ephraim E., 1984, IEEE T ACOUST SPEECH, VASSP-32, P1109
GAUTHAM JM, 2010, 9 INT C LAT VAR AN S
Hao JC, 2009, IEEE T AUDIO SPEECH, V17, P24, DOI 10.1109/TASL.2008.2005342
HARMELING S, 2001, CONVBSS
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Johnson D, 1993, ARRAY SIGNAL PROCESS
KAJALA M, 2001, IEEE INT C AC SPEECH, V5, P2917
KITAWAKI N, 2007, ETSI WORKSH SPEECH
Kocinski J., 2005, Archives of Acoustics, V30
Kocinski J, 2008, SPEECH COMMUN, V50, P29, DOI 10.1016/j.specom.2007.06.003
Kokkinakis K, 2008, J ACOUST SOC AM, V123, P2379, DOI 10.1121/1.2839887
Lee J. K., 2003, 32 INT C EXP NOIS CO
LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375
Libiszewski P, 2007, ARCH ACOUST, V32, P337
Makino S, 2005, IEICE T FUND ELECTR, VE88A, P1640, DOI 10.1093/ietfec/e88-a.7.1640
MATSUOKA K, 1995, NEURAL NETWORKS, V8, P411, DOI 10.1016/0893-6080(94)00083-X
Moore BC., 2003, INTRO PSYCHOL HEARIN
MUKAI R, 2004, ISCAS 2004
Ozimek E., 2006, ARCH ACOUST, V31, P431
Ozimek E, 2009, INT J AUDIOL, V48, P433, DOI 10.1080/14992020902725521
Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214
Parra LC, 2006, J ACOUST SOC AM, V119, P3839, DOI 10.1121/1.2197606
Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150
Pham D.-T., 2003, ICA 2003 NAR JAP
SARUWATARI H, 2003, 4 INT S IND COMP AN
Sawada H., 2005, SPEECH ENHANCEMENT
Scalart P., 1996, IEEE INT C AC SPEECH, V1, P629
Shinn-Cunningham BG, 2001, J ACOUST SOC AM, V110, P1118, DOI 10.1121/1.1386633
Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2
SMARAGDIS P, 1997, INFORM THEORETIC APP
WILSON KW, 2008, ICASSP LAS VEG NEV U
Yilmaz O, 2004, IEEE T SIGNAL PROCES, V52, P1830, DOI [10.1109/TSP.2004.828896, 10.1109/TSP.2004.V8896]
Yu T, 2009, INT CONF ACOUST SPEE, P213
Zhou Y, 2003, SIGNAL PROCESS, V83, P2037, DOI 10.1016/S0165-1684(03)00134-8
NR 43
TC 2
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 390
EP 402
DI 10.1016/j.specom.2010.11.002
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300009
ER
PT J
AU Stark, A
Paliwal, K
AF Stark, Anthony
Paliwal, Kuldip
TI MMSE estimation of log-filterbank energies for robust speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; MMSE estimation; Speech enhancement methods
ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE SUPPRESSION FILTER; ENHANCEMENT
AB In this paper, we derive a minimum mean square error log-filterbank energy estimator for environment-robust automatic speech recognition. While several such estimators exist within the literature, most involve trade-offs between simplifications of the log-filterbank noise distortion model and analytical tractability. To avoid this limitation, we extend a well known spectral domain noise distortion model for use in the log-filterbank energy domain. To do this, several mathematical transformations are developed to transform spectral domain models into filterbank and log-filterbank energy models. As a result, a new estimator is developed that allows for robust estimation of both log-filterbank energies and subsequent Mel-frequency cepstral coefficients. The proposed estimator is evaluated over the Aurora2, and RM speech recognition tasks, with results showing a significant reduction in word recognition error over both baseline results and several competing estimators. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Stark, Anthony; Paliwal, Kuldip] Griffith Univ, Signal Proc Lab, Brisbane, Qld 4111, Australia.
RP Paliwal, K (reprint author), Griffith Univ, Signal Proc Lab, Nathan Campus, Brisbane, Qld 4111, Australia.
EM a.stark@griffith.edu.au; k.paliwal@griffith.edu.au
CR ACERO A, 2000, P INT
Barker J., 2000, P ICSLP BEIJ CHIN, P373
Cohen I., 2002, SIGNAL PROCESSING LE, V9, P113
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
DAVIS S, 1990, READINGS SPEECH RECO
Deng L., 2000, P ICSLP
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1991, IEEE T SIGNAL PROCES, V39, P795
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Erell A, 1993, IEEE T SPEECH AUDI P, V1, P84, DOI 10.1109/89.221370
FUJIMOTO M, 2000, IEEE INT C AC SPEECH, V3, P1727
GALES L, 1995, THESIS U CAMBRIDGE U
Gemello R, 2006, IEEE SIGNAL PROC LET, V13, P56, DOI 10.1109/LSP.2005.860535
Gillick L., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266481
Gradshteyn I. S., 2007, TABLE INTEGRALS SERI, V7th
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hermus K, 2007, EURASIP J ADV SIG PR, DOI 10.1155/2007/45821
Indrebo KM, 2008, IEEE T AUDIO SPEECH, V16, P1654, DOI 10.1109/TASL.2008.2002083
Lathoud G., 2005, P 2005 IEEE ASRU WOR, P343
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
MALAH D, 1999, ACOUST SPEECH SIG PR, P789
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
Moreno P.J., 1996, THESIS CARNEGIE MELL
Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225
Pearce D., 2000, ISCA ITRW ASR2000, P29
PRICE P, 1988, IEEE ICASSP 88 NEW Y, V1, P651
Rabiner L.R., 1978, DIGITAL PROCESSING S
Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828
Soon IY, 1999, SIGNAL PROCESS, V75, P151, DOI 10.1016/S0165-1684(98)00230-8
SPOUGE JL, 1994, SIAM J NUMER ANAL, V31, P931, DOI 10.1137/0731050
Stouten V., 2006, THESIS KATHOLIEKE U
Young S., 2000, HTK BOOK VERSION 3 0
Yu D, 2008, IEEE T AUDIO SPEECH, V16, P1061, DOI 10.1109/TASL.2008.921761
NR 33
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 403
EP 416
DI 10.1016/j.specom.2010.11.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300010
ER
PT J
AU Hoonhorst, I
Medina, V
Colin, C
Markessis, E
Radeau, M
Deltenre, P
Serniclaes, W
AF Hoonhorst, I.
Medina, V.
Colin, C.
Markessis, E.
Radeau, M.
Deltenre, P.
Serniclaes, W.
TI Categorical perception of voicing, colors and facial expressions: A
developmental study
SO SPEECH COMMUNICATION
LA English
DT Article
DE Categorical perception; Boundary precision; Development; Voice onset
time; Colors; Facial expressions
ID SPEECH-PERCEPTION; CROSS-LANGUAGE; ONSET TIME; CV SYLLABLES;
DISCRIMINATION; CHILDREN; INFANTS; SOUNDS; ADULTS; COARTICULATION
AB The aim of the present paper was to compare the development of perceptual categorization of voicing, colors and facial expressions in French-speaking children (from 6 to 8 years) and adults. Differences in both categorical perception, i.e. the correspondence between identification and discrimination performances, and in boundary precision, indexed by the steepness of the identification slope, were investigated. Whereas there was no significant effect of age on categorical perception, boundary precision increased with age, both for voicing and facial expressions though not for colors. These results suggest that the development of boundary precision arises from a general cognitive maturation across different perceptual domains. However, this is not without domain specific effects since we found (1) a correlation between the development of voicing perception and some reading performances and (2) an earlier maturation of boundary precision for colors compared to voicing and facial expressions. These comparative data indicate that whereas general cognitive maturation has some influence on the development of perceptual categorization, this is not without domain-specific effects, the structural complexity of the categories being one of them. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Hoonhorst, I.; Colin, C.; Radeau, M.; Deltenre, P.] ULB, UNESCOG, B-1050 Brussels, Belgium.
[Medina, V.] Univ Paris 07, UFR, F-75013 Paris, France.
[Medina, V.; Serniclaes, W.] CNRS, LPP, F-75006 Paris, France.
[Medina, V.; Serniclaes, W.] Univ Paris 05, F-75006 Paris, France.
[Markessis, E.] ULB, Fac Med, B-1070 Brussels, Belgium.
[Deltenre, P.] CHU Brugmann, Clin Neurophysiol, B-1020 Brussels, Belgium.
RP Hoonhorst, I (reprint author), ULB, UNESCOG, 50 Ave Franklin Roosevelt CP 191, B-1050 Brussels, Belgium.
EM ihoonhor@ulb.ac.be; medina_vicky@yahoo.fr; ccolin@ulb.ac.be;
emarkess@ulb.ac.be; moradeau@ulb.ac.be; paul.deltenre@chu-brugmann.be;
willy.serniclaes@parisdescartes.fr
FU Belgian National Fund for Scientific Research (FNRS); U.L.B; Loicq
Foundation; Van Goethem-Brichant Foundation; Brugmann Foundation;
Belgian Kids Foundation; ANR (France) [ANR-07-BLAN-0014-01]
FX This work was supported financially from funds given to I. Hoonhorst by
the Belgian National Fund for Scientific Research (FNRS); to M. Radeau
by an FER grant from U.L.B.; to P. Deltenre by the Loicq Foundation, the
Van Goethem-Brichant Foundation and the Brugmann Foundation; to E.
Markessis by the Belgian Kids Foundation; and to W. Serniclaes by the
ANR Program PBELA ANR-07-BLAN-0014-01 (France). The authors are grateful
to R. Carre (Dynamique du Langage Lab., CNRS- Lyon 2 University) for
providing the speech synthesis software; to C. Van Nechel (Brugmann
Hospital, Brussels), M. Vanhaelen (U.C.L., Louvain-la-Neuve) and R.
Bruyer (U.C.L., Louvain-la-Neuve) for their help in the creation of
color and facial expression stimuli; to Mr. Chaussard, schools inspector
and to S. Lacourthiade and C. Moreul, headmistresses of the primary
school of Capens and Marquefave, where the main part of the study took
place.
CR ASLIN RN, 1981, CHILD DEV, V52, P1135, DOI 10.1111/j.1467-8624.1981.tb03159.x
BEALE JM, 1995, COGNITION, V57, P217, DOI 10.1016/0010-0277(95)00669-X
BEAUPRE M, 2005, MONTREAL SET FACIAL
Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177
Bogliotti C, 2008, J EXP CHILD PSYCHOL, V101, P137, DOI 10.1016/j.jecp.2008.03.006
BORNSTEIN MH, 1976, SCIENCE, V191, P201, DOI 10.1126/science.1246610
Bruyer R, 2007, EUR REV APPL PSYCHOL, V57, P37, DOI 10.1016/j.erap.2006.02.001
Burnham D., 2003, READ WRIT, V16, P573, DOI DOI 10.1023/A:1025593911070
BURNHAM DK, 1991, J CHILD LANG, V18, P231
Campanella S, 2001, VIS COGN, V8, P237
Damper RI, 2000, PERCEPT PSYCHOPHYS, V62, P843, DOI 10.3758/BF03206927
DARCY I, 2007, PAPERS LAB PHONOLOGY, V9
Dunn L. M., 1981, PEABODY PICTURE VOCA
Dunn L. M., 1993, ECHELLE VOCABULAIRE
EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068
ELLIOTT LL, 1981, J ACOUST SOC AM, V70, P669, DOI 10.1121/1.386929
ELLIOTT LL, 1986, CHILD DEV, V57, P628
Finney D. J., 1971, PROBIT ANAL, V3rd
Franklin A, 2004, BRIT J DEV PSYCHOL, V22, P349, DOI 10.1348/0261510041552738
Franklin A, 2005, J EXP CHILD PSYCHOL, V90, P114, DOI 10.1016/j.jecp.2004.10.001
Gao XQ, 2009, J EXP CHILD PSYCHOL, V102, P503, DOI 10.1016/j.jecp.2008.11.002
Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121
Hoonhorst I, 2009, J EXP CHILD PSYCHOL, V104, P353, DOI 10.1016/j.jecp.2009.07.005
Hoonhorst I, 2009, CLIN NEUROPHYSIOL, V120, P897, DOI 10.1016/j.clinph.2009.02.174
KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940
Kotsoni E, 2001, PERCEPTION, V30, P1115, DOI 10.1068/p3155
Kraljic T, 2006, PSYCHON B REV, V13, P262, DOI 10.3758/BF03193841
KRAUSE SE, 1982, J ACOUST SOC AM, V71, P990, DOI 10.1121/1.387580
Lalonde CE, 1995, INFANT BEHAV DEV, V18, P459, DOI 10.1016/0163-6383(95)90035-7
LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417
MacMillan N. A., 2005, DETECTION THEORY USE
Martin-Malivel J, 2007, BEHAV NEUROSCI, V121, P1145, DOI 10.1037/0735-7044.121.6.1145
McCullagh P, 1983, GEN LINEAR MODELS
Medina V, 2010, J PHONETICS, V38, P493, DOI 10.1016/j.wocn.2010.06.002
MEDINA V, 2009, 10 ISCA C INT BRIGTH
Mitterer H, 2008, PSYCHOL SCI, V19, P629, DOI 10.1111/j.1467-9280.2008.02133.x
Mitterer H, 2006, PERCEPT PSYCHOPHYS, V68, P1227, DOI 10.3758/BF03193723
Mondloch CJ, 2002, PERCEPTION, V31, P553, DOI 10.1068/p3339
NEAREY TM, 1990, J PHONETICS, V18, P347
POLLACK I, 1971, PSYCHON SCI, V24, P299
RASKIN LA, 1983, PSYCHOL RES-PSYCH FO, V45, P135, DOI 10.1007/BF00308665
Roberson D, 2005, BEHAV BRAIN SCI, V28, P505
SANDELL JH, 1979, J COMP PHYSIOL PSYCH, V93, P626, DOI 10.1037/h0077594
Schouten B, 2003, SPEECH COMMUN, V41, P71, DOI 10.1016/S0167-6393(02)00094-8
Schwarzer G, 2000, CHILD DEV, V71, P391, DOI 10.1111/1467-8624.00152
Serniclaes W, 2005, COGNITION, V98, pB35, DOI 10.1016/j.cognition.2005.03.002
Serniclaes W., 1987, THESIS U LIBRE BRUXE
SIMON C, 1978, J ACOUST SOC AM, V63, P925, DOI 10.1121/1.381772
SNOWDON CT, 1987, CATEGORICAL PERCEPTI, P332
STEVENS KN, 1974, J ACOUST SOC AM, V55, P653, DOI 10.1121/1.1914578
STREETER LA, 1976, J ACOUST SOC AM, V59, P448, DOI 10.1121/1.380864
SUMMERFIELD Q, 1975, SPEECH PERCEPTION SE, V2
von Frisch K, 1964, BEES THEIR VISION CH
WERKER JF, 1985, PERCEPT PSYCHOPHYS, V37, P35, DOI 10.3758/BF03207136
WOOD CC, 1976, J ACOUST SOC AM, V60, P1381, DOI 10.1121/1.381231
WRIGHT AA, 1972, VISION RES, V12, P1447, DOI 10.1016/0042-6989(72)90171-X
ZLATIN MA, 1975, J SPEECH HEAR RES, V18, P541
NR 57
TC 8
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 417
EP 430
DI 10.1016/j.specom.2010.11.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300011
ER
PT J
AU Patel, R
McNab, C
AF Patel, Rupal
McNab, Catherine
TI Displaying prosodic text to enhance expressive oral reading
SO SPEECH COMMUNICATION
LA English
DT Article
DE Children; Oral reading; Prosody; Reading software; Expressive reading
ID VOCAL FUNDAMENTAL-FREQUENCY; 11-YEAR-OLD CHILDREN; CONTRASTIVE STRESS;
POOR READERS; SIMPLE VIEW; FLUENCY; COMPREHENSION; INTONATION; LANGUAGE;
CRIES
AB This study assessed the effectiveness of software designed to facilitate expressive oral reading through text manipulations that convey prosody. The software presented stories in standard (S) and manipulated formats corresponding to variations in fundamental frequency (F), intensity (I), duration (D), and combined cues (C) indicating modulation of pitch, loudness and length, respectively. Ten early readers (mean age = 7.6 years) attended three sessions. During the first session, children read two stories in standard format to establish a baseline. The second session provided training and practice in the manipulated formats. In the third, post-training session, sections of each story were read in each condition (S, F, I, D, C in random order). Recordings were acoustically examined for changes in word duration, peak intensity and peak F0 from baseline to post-training. When provided with pitch cues (F), children increased utterance-wide peak F0 range (mean = 34.5 Hz) and absolute peak F0 for accented words. Pitch cues were more effective in isolation (F) than in combination (C). Although Condition I elicited increased intensity of salient words, Conditions S and D had minimal impact on prosodic variation. Findings suggest that textual manipulations conveying prosody can be readily learned by children to improve reading expressivity. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Patel, Rupal; McNab, Catherine] Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA.
RP Patel, R (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 360 Huntington Ave,102 Forsyth Bldg, Boston, MA 02115 USA.
EM r.patel@neu.edu
FU American Speech and Hearing Association; Northeastern University;
National Science Foundation [IIS-0915527]
FX This study was conducted in the Communication Analysis and Design
Laboratory in the Department of Speech Language Pathology and Audiology
at Northeastern University. This work was supported in part by funding
from the American Speech and Hearing Association SPARC award (Students
Preparing for Academic and Research Careers), the Northeastern
University Provost Award, and the National Science Foundation (Grant
IIS-0915527). The authors are grateful to Kevin Reilly, Michael Epstein
and Timothy Mills for their time, effort, and suggestions and to Ghadeer
Rahhal, who was instrumental in implementing the ReadN'Karaoke program
and supplemental acoustic analysis software. Last, the authors thank the
children and families who participated for their time, effort, and
enthusiasm.
CR Aylett M, 2006, J ACOUST SOC AM, V119, P3048, DOI 10.1121/1.2188331
BATES E, 1976, J CHILD LANG, V1, P227
Blevins W., 2001, BUILDING FLUENCY LES
BOERSMA P, 2007, SYSTEM DOING PHONETI
BOLINGER DL, 1961, LANGUAGE, V37, P83, DOI 10.2307/411252
Bolinger D., 1989, INTONATION ITS USES
BREWSTER K, 1989, LINGUISTICS CLIN PRA, P186
CARVER RP, 1993, J READING BEHAV, V25, P439
Chafe W., 1988, WRIT COMMUN, V5, P396, DOI DOI 10.1177/0741088388005004001
Cooper W. E., 1980, SYNTAX SPEECH
COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372
Cowie R, 2002, LANG SPEECH, V45, P47
Cromer W, 1970, J Educ Psychol, V61, P471, DOI 10.1037/h0030288
CRUTTENDEN A, 1985, J CHILD LANG, V12, P643
Crystal D., 1979, LANG ACQUIS, p[33, 174]
Cutler A, 1997, LANG SPEECH, V40, P141
CUTLER A, 1987, J CHILD LANG, V14, P145
DOWHOWER SL, 1987, READING RES Q, P390
FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022
FRY DB, 1958, LANG SPEECH, V1, P126
Fuchs L. S., 2001, SCI STUD READ, V5, P239, DOI DOI 10.1207/S1532799XSSR0503_3
FURROW D, 1984, J CHILD LANG, V11, P203
Gibson E. J., 1975, PSYCHOL READING
Gilbert HR, 1996, INT J PEDIATR OTORHI, V34, P237, DOI 10.1016/0165-5876(95)01273-7
Grigos MI, 2007, J SPEECH LANG HEAR R, V50, P119, DOI 10.1044/1092-4388(2007/010)
HERMAN PA, 1985, READING RES Q, V20, P535
HOOVER WA, 1990, READ WRIT, V2, P127, DOI 10.1007/BF00401799
Hudson RF, 2005, READ TEACH, V58, P702, DOI 10.1598/RT.58.8.1
Kent R., 1997, SPEECH SCI
Kuhn MR, 2003, J EDUC PSYCHOL, V95, P3, DOI 10.1037/0022-0663.95.1.3
LABERGE D, 1974, COGNITIVE PSYCHOL, V6, P293, DOI 10.1016/0010-0285(74)90015-2
Ladd DR, 2008, CAMB STUD LINGUIST, V79, P1
Lehiste I., 1970, SUPRASEGMENTALS
LeVasseur VM, 2006, APPL PSYCHOLINGUIST, V27, P423, DOI 10.1017/S0142716406060346
Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1
Locke J. L., 1993, CHILDS PATH SPOKEN L
Miller J, 2006, J EDUC PSYCHOL, V98, P839, DOI 10.1037/0022-0663.98.4.839
Morgan JL, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P1
MORRIS D, 2002, EVERY CHILD READING
*NAEP, 1995, LIST CHILD READ AL O
National Institute of Child Health and Human Development, 2000, NIH PUBL, V00-4769
OSHEA LJ, 1983, READ RES QUART, V18, P458, DOI 10.2307/747380
Patel R, 2006, SPEECH COMMUN, V48, P1308, DOI 10.1016/j.specom.2006.06.007
Patel R, 2009, J SPEECH LANG HEAR R, V52, P790, DOI 10.1044/1092-4388(2008/07-0137)
Pinnell G., 1995, LISTENING CHILDREN R
Protopapas A, 1997, J ACOUST SOC AM, V102, P3723, DOI 10.1121/1.420403
Rasinski T. V., 2003, FLUENT READER ORAL R
RASINSKI TV, 1990, J EDUC RES, V83, P147
*READ NAT, READ NAT STRAT
*READ THEAT, READ THEAT SCRIPTS P
SAMUELS SJ, 1988, READ TEACH, V41, P756
Schreiber P., 1982, LANG ACQUIS, P78
Schreiber P. A., 1987, COMPREHENDING ORAL W, P243
Schreiber P. A., 1991, THEOR PRACT, V30, P158, DOI [10.2307/1476877, DOI 10.1080/00405849109543496]
SCHREIBER PA, 1980, J READING BEHAV, V12, P177
Schwanenflugel PJ, 2004, J EDUC PSYCHOL, V96, P119, DOI 10.1037/0022-0663.96.1.119
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
Snow D, 1998, J SPEECH LANG HEAR R, V41, P576
SNOW D, 1994, J SPEECH HEAR RES, V37, P831
Stanovich K., 1996, HDB READING RES, V2, P418
Stathopoulos ET, 1997, J SPEECH LANG HEAR R, V40, P595
Titze IR, 1994, PRINCIPLES VOICE PRO
Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414
Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X
WERKER JF, 1994, INFANT BEHAV DEV, V17, P323, DOI 10.1016/0163-6383(94)90012-4
Whalley K, 2006, J RES READ, V29, P288, DOI 10.1111/j.1467-9817.2006.00309.x
Wiig E. H., 2004, CLIN EVALUATION LANG
WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238
YOUNG A, 1995, J EXP CHILD PSYCHOL, V60, P428, DOI 10.1006/jecp.1995.1048
Young AR, 1996, APPL PSYCHOLINGUIST, V17, P59, DOI 10.1017/S0142716400009462
NR 70
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 431
EP 441
DI 10.1016/j.specom.2010.11.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300012
ER
PT J
AU Stan, A
Yamagishi, J
King, S
Aylett, M
AF Stan, Adriana
Yamagishi, Junichi
King, Simon
Aylett, Matthew
TI The Romanian speech synthesis (RSS) corpus: Building a high quality
HMM-based speech synthesis system using a high sampling rate
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; HTS; Romanian; HMMs; Sampling frequency; Auditory
scale
AB This paper first introduces a newly-recorded high quality Romanian speech corpus designed for speech synthesis, called "RSS", along with Romanian front-end text processing modules and HMM-based synthetic voices built from the corpus. All of these are now freely available for academic use in order to promote Romanian speech technology research. The RSS corpus comprises 3500 training sentences and 500 test sentences uttered by a female speaker and was recorded using multiple microphones at 96 kHz sampling frequency in a hemianechoic chamber. The details of the new Romanian text processor we have developed are also given.
Using the database, we then revisit some basic configuration choices of speech synthesis, such as waveform sampling frequency and auditory frequency warping scale, with the aim of improving speaker similarity, which is an acknowledged weakness of current HMM-based speech synthesisers. As we demonstrate using perceptual tests, these configuration choices can make substantial differences to the quality of the synthetic speech. Contrary to common practice in automatic speech recognition, higher waveform sampling frequencies can offer enhanced feature extraction and improved speaker similarity for HMM-based speech synthesis. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Stan, Adriana] Tech Univ Cluj Napoca, Dept Commun, Cluj Napoca 400027, Romania.
[Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
[Aylett, Matthew] CereProc Ltd, Edinburgh EH8 9LE, Midlothian, Scotland.
RP Stan, A (reprint author), Tech Univ Cluj Napoca, Dept Commun, 26-28 George Baritiu St, Cluj Napoca 400027, Romania.
EM adriana.stan@com.utcluj.ro; jyamagis@staffmail.ed.ac.uk;
simon.king@ed.ac.uk; matthew@cereproc.com
RI Stan, Adriana /G-1257-2014
OI Stan, Adriana /0000-0003-2894-5770
FU European Social Fund [POSDRU/6/1.5/S/5]; European Community [213845];
eDIKT initiative
FX Adriana Stan is funded by the European Social Fund, project
POSDRU/6/1.5/S/5 and was visiting CSTR at the time of this work. Junichi
Yamagishi and Simon King are partially funded by the European
Community's Seventh Framework Programme (FP7/2007-2013) under Grant
agreement 213845 (the EMIME project).This work has made use of the
resources provided by the Edinburgh Compute and Data Facility (ECDF -
http://www.ecdf.ed.ac.uk). The ECDF is partially supported by the eDIKT
initiative (http://www.edikt.org.uk).
CR Aylett M. P., 2007, P AISB 2007 NEWC UK, P174
Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Black A. W., 1995, P EUROSPEECH MADR SP, P581
BURILEANU D, 1999, P EUROSPEECH 99 BUD, P2063
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Fant G, 2005, TEXT SPEECH LANG TEC, V24, P199
FERENCZ A, 1997, THESIS U CLUJ NAPOCA
FRUNZA O, 2005, P EUROLAN 2005 WORKS
Hunt AJ, 1996, INT CONF ACOUST SPEE, P373, DOI 10.1109/ICASSP.1996.541110
Karaiskos V., 2008, P BLIZZ CHALL WORKSH
KAWAHARA H, 2001, 2 MAVEBAW
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
MURAOKA T, 1978, J AUDIO ENG SOC, V26, P252
Ohtani Y, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2266
Olshen R., 1984, CLASSIFICATION REGRE, V1st
PATTERSON RD, 1982, J ACOUST SOC AM, V76, P640
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695
TOKUDA K, 1991, IEICE T FUND ELECTR, V74, P1240
TOKUDA K, 1994, TECHNICAL REPORT NAG
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
Tokuda K., 1994, P INT C SPOK LANG PR, P1043
Yamagishi J, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P581
Yamagishi Junichi, 2010, Proceedings 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2010, DOI 10.1109/ICASSP.2010.5495562
Yamagishi Junichi, 2008, P BLIZZ CHALL 2008 B
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
Zen H, 2007, P 6 ISCA WORKSH SPEE, P294
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
ZWICKER E, 1965, PSYCH REV, V72, P2
NR 31
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2011
VL 53
IS 3
BP 442
EP 450
DI 10.1016/j.specom.2010.12.002
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 724FK
UT WOS:000287561300013
ER
PT J
AU Huijbiegts, M
de Jong, F
AF Huijbiegts, Marijn
de Jong, Fianciska
TI Robust speech/non-speech classification in heterogeneous multimedia
content
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech/non speech classification; Rich transcription; SHoUT toolkit
ID RECOGNITION
AB In this paper we present a speech/non speech classification method that allows high quality classification without the need to know in advance what kinds of audible non-speech events are present in an audio recording and that does not require a single parameter to be tuned on in-domain data Because no parameter tuning is needed and no training data is required to train models for specific sounds the classifier is able to process a wide range of audio types with varying conditions and thereby contributes to the development of a more robust automatic speech recognition framework
Our speech/non-speech classification system does not attempt to classify all audible non-speech in a single run Instead, first a bootstrap speech/silence classification is obtained using a standard speech/non-speech classifier Next, models for speech, silence and audible non-speech are trained on the target audio using the bootstrap classification The experiments show that the performance of the proposed system is 83% and 44% (relative) better than that of a common broadcast news speech/non-speech classifier when applied to a collection of meetings recorded with table-top microphones and a collection of Dutch television broadcasts used for TRECVID 2007 (C) 2010 Elsevier B V All rights reserved
C1 [Huijbiegts, Marijn; de Jong, Fianciska] Univ Twente, Dept Comp Sci, NL-7500 AE Enschede, Netherlands.
RP Huijbiegts, M (reprint author), Univ Twente, Dept Comp Sci, POB 217, NL-7500 AE Enschede, Netherlands.
FU Dutch government; EU [IST-FP6-506811, IST-FP6-027685, IST-PF6-027413]
FX The work reported here was partly supported by the bsik-program
MultimediaN which is funded by the Dutch government(http //www
multimedian nl) and the EU projects AMI (IST-FP6-506811), MESH
(IST-FP6-027685), and Media Campaign (IST-PF6-027413) We would like to
thank IDIAP for providing us with the speech/music benchmark files
CR Ajmera J, 2003, SPEECH COMMUN, V40, P351, DOI 10.1016/S0167-6393(02)00087-0
Anguera X., 2006, THESIS U POLITECNICA
Anguera X., 2007, LECT NOTES COMPUTER, V4299
Byrne W, 2004, IEEE T SPEECH AUDI P, V12, P420, DOI 10.1109/TSA.2004.828702
Cassidy S, 2004, P NIST RT04S EV WORK
Chen S., 1998, P DARPA BROADC NEWS
Fiscus JG, 2006, LECT NOTES COMPUT SC, V4299, P309
Garofolo J., 2000, P RECH INF ASS ORD C
Gauvain JL, 1999, P DARPA BROADC NEWS, P99
Goldman J., 2005, INT J DIGITAL LIB, V5, P287, DOI 10.1007/s00799-004-0101-0
Hain T., 1998, P DARPA BROADC NEWS, P133
Huang J, 2007, P NIST RICH TRANSCR
Huijbregts M, 2001, PROSODY BASED BOUNDA
HUIJBREGTS M., 2007, LECT NOTES COMPUTER
Istrate D, 2006, LECT NOTES COMPUTER
ITO MR, 1971, IEEE T ACOUST SPEECH, VAU19, P235, DOI 10.1109/TAU.1971.1162189
OOSTDIJK N, 2000, 2 INT C LANG RES EV, V2, P887
Pellom B., 2003, P ICASSP
Rentzeperis E, 2007, LECT NOTES COMPUTER
SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136
Stolcke A, 2007, P NIST RICH TRANSCR
van Leeuwen D, 2007, LECT NOTES COMPUTER
Wolfel M, 2007, P NIST RICH TRANSCR
NR 23
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 143
EP 153
DI 10.1016/j.specom.2010.08.008
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300001
ER
PT J
AU Krishnamoorthy, P
Prasanna, SRM
AF Krishnamoorthy, P.
Prasanna, S. R. M.
TI Enhancement of noisy speech by temporal and spectral processing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Temporal processing; Spectral processing; Temporal
and spectral processing
ID LINEAR PREDICTION; AMPLITUDE ESTIMATOR; SUBTRACTION METHOD;
REPRESENTATION; REDUCTION; DATABASE; DOMAIN
AB This paper presents a noisy speech enhancement method by combining linear prediction (LP) residual weighting in the time domain and spectral processing in the frequency domain to provide better noise suppression as well as better enhancement in the speech regions The noisy speech is initially processed by the excitation source (LP residual) based temporal processing that Involves identifying and enhancing the excitation source based speech-specific features present at the gross and fine temporal levels The gross level features are identified by estimating the following speech parameters sum of the peaks in the discrete Fourier transform (DFT) spectrum, smoothed Hilbert envelope of the LP residual and modulation spectrum values, all from the noisy speech signal The fine level features are identified using the knowledge of the instants of significant excitation A weight function is derived from the gross and fine weight functions to obtain the temporally processed speech signal The temporally processed speech is further subjected to spectral domain processing Spectral processing involves estimation and removal of degrading components, and also identification and enhancement of speech-specific spectral components The proposed method is evaluated using different objective and subjective quality measures The quality measures show that the proposed combined temporal and spectral processing method provides better enhancement compared to either temporal or spectral processing alone (C) 2010 Elsevier B V All rights reserved
C1 [Prasanna, S. R. M.] Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India.
[Krishnamoorthy, P.] Samsung India Software Ctr, Noida 201301, India.
RP Prasanna, SRM (reprint author), Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India.
CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267
Berouti M., 1979, P IEEE INT C AC SPEE, P208
Munkong R, 2008, IEEE SIGNAL PROC MAG, V25, P98, DOI 10.1109/MSP.2008.918418
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Chang JH, 2007, PATTERN RECOGN, V40, P1123, DOI 10.1016/j.patcog.2006.07.006
Chen B, 2007, SPEECH COMMUN, V49, P134, DOI 10.1016/j.specom.2006.12.005
CHEN B, 2005, P IEEE ICASSP, V1, P1097
Deller J. R., 1993, DISCRETE TIME PROCES
DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Greenberg S, 1997, INT CONF ACOUST SPEE, P1647, DOI 10.1109/ICASSP.1997.598826
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2006, P INT PHIL PA US
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
Jin W, 2006, SPEECH COMMUN, V48, P1349, DOI 10.1016/j.specom.2006.07.001
Kamath S., 2002, P IEEE INT C AC SPEE
Kim W, 2000, IEE P-VIS IMAGE SIGN, V147, P423, DOI 10.1049/ip-vis:20000408
Krishnamoorthy P, 2009, IEEE T AUDIO SPEECH, V17, P253, DOI 10.1109/TASL.2008.2008039
Krishnamoorthy P, 2008, ADCOM: 2008 16TH INTERNATIONAL CONFERENCE ON ADVANCED COMPUTING AND COMMUNICATIONS, P112
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Marple J., 1999, IEEE T SIGNAL PROCES, V47, P2600
Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927
Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548
MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910
PRASANNA SRM, 2004, P IEEE INT C AC SPEE, V1, pI109
Prasanna SRM, 2005, IEEE P 3 INT C INT S, P140
Press W.H., 1992, NUMERICAL RECIPES C
Proakis J. G., 1996, DIGITAL SIGNAL PROCE
Rix AW, 2002, J AUDIO ENG SOC, V50, P755
SCHROEDE.MR, 1970, PR INST ELECTR ELECT, V58, P707, DOI 10.1109/PROC.1970.7725
Senapati S, 2008, SPEECH COMMUN, V50, P504, DOI 10.1016/j.specom.2008.03.004
Seok JW, 1999, ELECTRON LETT, V35, P123, DOI 10.1049/el:19990122
Shao Y, 2007, IEEE T SYST MAN CY B, V37, P877, DOI 10.1109/TSMCB.2007.895365
SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662
Sri Rama Murty K, 2007, P INT ANTW BELG, P2941
Yegnanarayana B, 2009, IEEE T AUDIO SPEECH, V17, P614, DOI 10.1109/TASL.2008.2012194
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Yamashita K, 2005, IEEE SIGNAL PROC LET, V12, P465, DOI 10.1109/LSP.2005.847864
Yang LP, 2005, J ACOUST SOC AM, V117, P1001, DOI 10.1121/1.1852873
Yegnanarayana B., 2002, P IEEE INT C AC SPEE, V1, P541
Yegnanarayana B, 1999, SPEECH COMMUN, V28, P25, DOI 10.1016/S0167-6393(98)00070-3
ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7
NR 46
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 154
EP 174
DI 10.1016/j.specom.2010.08.011
PG 21
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300002
ER
PT J
AU Wang, RL
Lu, JL
AF Wang, Ruili
Lu, Jingli
TI Investigation of golden speakers for second language learners from
imitation preference perspective by voice modification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Computer assisted language learning (CALL); Computer Assisted
Pronunciation Training (CAPT); Voice modification; Pitch; Speech rate
AB This paper investigates what voice features (e g, speech rate and pitch-formants) make a teacher's voice preferable for second language learners to imitate, when they practice sentence pronunciation using Computer-Assisted Pronunciation Training (CAPT) systems The CAPT system employed in our investigation uses a single teacher's voice as the source to automatically resynthesize several sample voices with different voice features based on the features of a learner's voice Our approach is different from that in the study conducted by Probst et al which uses multiple native speakers' voices as sample voices [Probst, K, Ke, Y, Eskenazi M, 2002 Enhancing foreign language tutors-in search of the golden speaker Speech Communication 37 (3-4) 161-173] Our approach can reduce the influence of characteristics of teachers' voices (e g voice quality and clarity) on the investigation Our experimental results show that a teacher s voice, which has similar speech rate and pitch-formants to a learner's voice, is not always the learner's first imitation preference Many factors can influence learners' imitation preferences, e g, background and proficiency of the language that they are learning Also a learner's preferences may change at different learning stages We thus advocate an automatic voice modification function in CAPT systems provide speech learning material with a wide variety of voice features, e g, different speech rates or different pitch-formants Learners then can control the voice modifications according to their preferences (C) 2010 Elsevier B V All rights reserved
C1 [Wang, Ruili] Massey Univ, Sch Engn & Adv Technol, Palmerston North, New Zealand.
Nanjing Univ, State Key Lab Novel Software Technol, Nanjing, Peoples R China.
RP Wang, RL (reprint author), Massey Univ, Sch Engn & Adv Technol, Palmerston North, New Zealand.
CR Arnett M.K., 1952, J SO STATES COMM ASS, V17, P203
Bissiri MP, 2009, SPEECH COMMUN, V51, P933, DOI 10.1016/j.specom.2009.03.001
Black A, 2007, P ISCA ITRW SLATE WO
Boersma P., 2009, PRAAT DOING PHONETIC
Clark J., 2007, INTRO PHONETICS PHON
Derwing TM, 2003, CAN MOD LANG REV, V59, P546
Dyck C, 2002, LANG LEARN TECHNOL, V6, P27
Erro D, 2007, INTERSPEECH 2007 EUR
Eskenazi M., 2000, P INSTIL 2000 INT SP, P73
Eskenazi M, 1998, P SPEECH TECHN LANG, P77
Eskenazi M, 2009, SPEECH COMMUN, V51, P832, DOI 10.1016/j.specom.2009.04.005
Fant G., 1960, ACOUSTIC THEORY SPEE
Felps D, 2009, SPEECH COMMUN, V51, P920, DOI 10.1016/j.specom.2008.11.004
Hirose K, 2004, P INT S TON ASP LANG, P77
Hismanoglu M., 2006, J LANGUAGE LINGUISTI, V2, P101
Jacob A, 2008, ADV HUMAN COMPUTER I
Lee S. T., 2008, THESIS AUSTR CATHOLI
Lu J, 2010, INTERSPEECH2010 MAK, P606
Meszaros K, 2005, FOLIA PHONIATR LOGO, V57, P111, DOI 10.1159/000083572
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Nagano K, 1990, 1 INT C SPOK LANG PR, P1169
Nolan Francis, 2003, P 15 INT C PHON SCI, P771
Ostendorf M, 1995, 95001 ECS BOST U
Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7
Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49
NR 25
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 175
EP 184
DI 10.1016/j.specom.2010.08.015
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300003
ER
PT J
AU Lobdell, BE
Allen, JB
Hasegawa-Johnson, MA
AF Lobdell, B. E.
Allen, J. B.
Hasegawa-Johnson, M. A.
TI Intelligibility predictors and neural representation of speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech perception; Articulation; Index; Speech recognition; Speech
representation
ID ARTICULATION INDEX; - DISTINCTION; WORD RECOGNITION; PERCEPTION;
NOISE; MODEL; IDENTIFICATION; PLACE; CUES; INTEGRATION
AB Intelligibility predictors tell us a great deal about human speech perception, in particular which acoustic factors strongly effect human behavior and which do not A particular intelligibility predictor, the Articulation Index (AI), is interesting because it models human behavior in noise, and its form has implications about representation of speech in the brain Specifically, the Articulation Index implies that a listener pre-consciously estimates the making noise distribution and uses it to classify time/frequency samples as speech or non-speech We classify consonants using representations of speech and noise which are consistent with this hypothesis and determine whether their error rate and error patterns are more or less consistent with human behavior than representations typical of automatic speech recognition systems The new representations resulted in error patterns more similar to humans in cases where the testing and training data sets do not have the same masking noise spectrum (C) 2010 Elsevier B V All rights reserved
C1 [Lobdell, B. E.; Allen, J. B.; Hasegawa-Johnson, M. A.] Univ Illinois, Beckman Inst, Urbana, IL 61820 USA.
RP Lobdell, BE (reprint author), Univ Illinois, Beckman Inst, 405 N Mathews Ave, Urbana, IL 61820 USA.
CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
Allen JB, 2005, J ACOUST SOC AM, V117, P2212, DOI 10.1121/1.1856231
ANSI, 1969, S35 ANSI
ANSI, 1997, S35 ANSI
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
COOPER FS, 1952, J ACOUST SOC AM, V24, P597, DOI 10.1121/1.1906940
Cover T. M., 2006, ELEMENTS INFORM THEO, V2nd
Darwin C. J., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90006-1
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Drullman R, 1996, J ACOUST SOC AM, V99, P2358, DOI 10.1121/1.415423
DURLACH NI, 1986, J ACOUST SOC AM, V80, P63, DOI 10.1121/1.394084
FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605
Fletcher H, 1938, J ACOUST SOC AM, V9, P275, DOI 10.1121/1.1915935
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842
Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7
HEDRICK MS, 1993, J ACOUST SOC AM, V94, P2005, DOI 10.1121/1.407503
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
JONGMAN A, 1989, J ACOUST SOC AM, V85, P1718, DOI 10.1121/1.397961
Kewley Port D, 1983, J ACOUST SOC AM, V73, P1779
KRYTER K, 1926, J ACOUST SOC AM, V34, P1698
KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094
LEE B, 2007, BIENNIAL DSP VEHICLE
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Musch H, 2000, ACOUST RES LET ONLIN, V2, P25
Omar MK, 2004, IEEE T SIGNAL PROCES, V52, P2701, DOI 10.1109/TSP.2004.834344
Padmanabhan M, 2005, IEEE T SPEECH AUDI P, V13, P512, DOI 10.1109/TSA.2005.848876
PAVLOVIC CV, 1984, J ACOUST SOC AM, V75, P1606, DOI 10.1121/1.390870
Phatak SA, 2007, J ACOUST SOC AM, V121, P2312, DOI 10.1121/1.2642397
Phatak SA, 2008, J ACOUST SOC AM, V124, P1220, DOI 10.1121/1.2913251
REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191
REPP BH, 1988, J ACOUST SOC AM, V83, P237, DOI 10.1121/1.396529
REPP BH, 1986, J ACOUST SOC AM, V79, P1987, DOI 10.1121/1.393207
Ronan D, 2004, J ACOUST SOC AM, V116, P1749, DOI 10.1121/1.1777858
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
SHARF DJ, 1972, J ACOUST SOC AM, V51, P652, DOI 10.1121/1.1912890
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102
Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569
VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953
NR 44
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 185
EP 194
DI 10.1016/j.specom.2010.08.016
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300004
ER
PT J
AU Alwan, A
Jiang, JT
Chen, W
AF Alwan, Abeer
Jiang, Jintao
Chen, Willa
TI Perception of place of articulation for plosives and fricatives in noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech perception; Place of articulation; Plosives; Fricatives; Noise;
Psychoacoustics
ID VOICELESS STOP CONSONANTS; LOCUS EQUATIONS; NONNATIVE LISTENERS;
IMPAIRED LISTENERS; ENGLISH FRICATIVES; RELATIVE AMPLITUDE;
SPEECH-PERCEPTION; WORD RECOGNITION; CUES; CONFUSIONS
AB This study aims at uncovering perceptually-relevant acoustic cues for the labial versus alveolar place of articulation distinction in syllable-initial plosives {/b/,/d/ /p/,/t/} and fricatives {/f/,/s/,/v/,/z/} in noise Speech materials consisted of naturally-spoken consonant-vowel (CV) syllables from four talkers where the vowel was one of {/a/ /i/,/u/} Acoustic analyses using logistic regression show that formant frequency measurements, relative spectral amplitude measurements, and burst/noise durations are generally reliable cues for labial/alveolar classification In a subsequent perceptual experiment, each pair of syllables with the labial/alveolar distinction (e g, /ba,da/) was presented to listeners in various levels of signal-to noise-ratio (SNR) in a 2-AFC task A threshold SNR was obtained for each syllable pair using sigmoid fitting of the percent correct scores Results show that the perception of the labial/alveolar distinction in noise depends on the manner of articulation the vowel context, and the interaction between voicing and manner of articulation Correlation analyses of the acoustic measurements and the threshold SNRs show that formant frequency measurements (such as F1 and F2 onset frequencies and F2 and F3 frequency changes) become increasingly important for the perception of labial/alveolar distinctions as the SNR degrades (C) 2010 Elsevier B V All rights reserved
C1 [Alwan, Abeer; Jiang, Jintao; Chen, Willa] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA.
RP Alwan, A (reprint author), Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA.
FU NIH-NIDCD [R29-DC02033]; NSF; Radcliffe Institute
FX This work was supported in part by the NIH-NIDCD Grant R29-DC02033, the
NSF, and a Fellowship from the Radcliffe Institute to Abeer Alwan We
thank Marcia Chen for her help in data analysis and Wendy Espeland,
Marwa Elshakry, and Christine Stansell for commenting on an earlier
version of this manuscript Thanks also to Steven Lulich for constructive
comments The views expressed here are those of the authors and do not
necessarily represent those of the NSF
CR Alwan A, 1992, P INT C SPOK LANG PR, P1063
BEHRENS S, 1988, J ACOUST SOC AM, V84, P861, DOI 10.1121/1.396655
Benki JR, 2003, J ACOUST SOC AM, V113, P1689, DOI 10.1121/1.1534102
BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024
ENGEN KJV, 2007, J ACOUST SOC AM, V121, P519
Fant G., 1973, SPEECH SOUNDS FEATUR, P110
Farar CL, 1987, J ACOUST SOC AM, V81, P1085
Fruchter D, 1997, J ACOUST SOC AM, V102, P2997, DOI 10.1121/1.421012
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
Guerlekian JA, 1981, J ACOUST SOC AM, V70, P1624
Hant JJ, 2000, THESIS U CALIFORNIA
Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7
Hant JJ, 2000, P 6 INT C SPOK LANG, P941
HARRIS KS, 1958, LANG SPEECH, V1, P1
HEDRICK MS, 1993, J ACOUST SOC AM, V94, P2005, DOI 10.1121/1.407503
Hedrick MS, 1996, J ACOUST SOC AM, V100, P3398, DOI 10.1121/1.416981
Hedrick MS, 2007, J SPEECH LANG HEAR R, V50, P254, DOI 10.1044/1092-4388(2007/019)
HEDRICK MS, 1995, J ACOUST SOC AM, V98, P1292, DOI 10.1121/1.413466
HEINZ JM, 1961, J ACOUST SOC AM, V33, P589, DOI 10.1121/1.1908734
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841
JONGMAN A, 1989, J ACOUST SOC AM, V85, P1718, DOI 10.1121/1.397961
Kewley Port D, 1982, J ACOUST SOC AM, V72, P379
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375
Li N, 2007, J ACOUST SOC AM, V122, P1165, DOI 10.1121/1.2749454
Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Nittrouer S, 2003, J ACOUST SOC AM, V113, P2254
OHDE RN, 1983, J ACOUST SOC AM, V74, P706, DOI 10.1121/1.389856
Parikh G, 2005, J ACOUST SOC AM, V118, P3874, DOI 10.1121/1.2118407
Potter R. K., 1947, VISIBLE SPEECH
Redford MA, 1999, J ACOUST SOC AM, V106, P1555, DOI 10.1121/1.427152
Shadle CH, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1521
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650
SOLI SD, 1979, J ACOUST SOC AM, V66, P46, DOI 10.1121/1.382972
Stevens K. N., 1985, PHONETIC LINGUISTICS, P243
Stevens KN, 1999, P INT C PHON SCI SAN, P1117
Stevens K.N., 1998, ACOUSTIC PHONETICS
STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102
Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569
Suchato A., 2004, THESIS MIT CAMBRIDGE
SUSSMAN HM, 1995, J ACOUST SOC AM, V97, P3112, DOI 10.1121/1.411873
SUSSMAN HM, 1993, J ACOUST SOC AM, V94, P1256, DOI 10.1121/1.408178
SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923
WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417
You H. Y., 1979, THESIS U EDMONTON ED
Zue V. W., 1976, THESIS MIT CAMBRIDGE
NR 55
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 195
EP 209
DI 10.1016/j.specom.2010.09.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300005
ER
PT J
AU Borowicz, A
Petrovsky, A
AF Borowicz, Adam
Petrovsky, Alexandr
TI Signal subspace approach for psychoacoustically motivated speech
enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; KLT; Psychoacoustics
ID NOISE
AB In this paper we deal with the perceptually motivated signal subspace methods for speech enhancement We focus on extended spectral-domain-constrained (SDC) estimator It is obtained using Lagrange multipliers method We present an algorithm for a precise computation of the Lagrange multipliers allowing for a direct shaping the residual noise power spectrum In addition the SDC estimator is presented in a new possibly more effective form As a practical implementation of the estimator we propose perceptually constrained signal subspace (PCSS) method for speech enhancement The approach utilizes masking phenomena for residual noise shaping and is optimal for the case of coloured noise Also, less demanding approximate version of this method is derived Finally comparative evaluation of the most known subspace based methods is performed using objective speech quality measures and listening tests Results show that the PCSS method outperforms other methods providing high noise attenuation and better speech quality (C) 2010 Elsevier B V All rights reserved
C1 [Borowicz, Adam; Petrovsky, Alexandr] Bialystok Tech Univ, Dept Real Time Syst, PL-15351 Bialystok, Poland.
RP Borowicz, A (reprint author), Bialystok Tech Univ, Dept Real Time Syst, Wiejska Str 45A, PL-15351 Bialystok, Poland.
CR EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
GUSTAFSSON S, 1998, P IEEE INT C AC SPEE, V1, P397, DOI 10.1109/ICASSP.1998.674451
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
JABLOUN F, 2002, P IEEE INT C AC SPEE, V1, P569
Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Lev An H, 2003, IEEE SIGN PROCESS LE, V10, P104
Mittal P, 2000, IEEE T SPEECH AUDIO, V8, P159
Petrovsky A., 2004, AES CONV 116 BERL GE
Rezayee A, 2001, IEEE T SPEECH AUDI P, V9, P87, DOI 10.1109/89.902276
Vetter R, 1999, P EUR C SPEECH COMM, P2411
YANG B, 1995, IEEE T SIGNAL PROCES, V43, P95
YANG WH, 1998, ACOUST SPEECH SIG PR, P541
NR 14
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 210
EP 219
DI 10.1016/j.specom.2010.09.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300006
ER
PT J
AU Feld, J
Sommers, M
AF Feld, Julia
Sommers, Mitchell
TI There goes the neighborhood Lipreading and the structure of the mental
lexicon
SO SPEECH COMMUNICATION
LA English
DT Article
DE Lipreading; Spoken word recognition; Lexical competition; Lexical
neighborhoods
ID AUDIOVISUAL WORD RECOGNITION; SPEECH-PERCEPTION; SPOKEN WORDS;
DISTINCTIVENESS; MODEL; COMPETITION; ADULTS
AB A central question in spoken word recognition research is whether words are recognized relationally, in the context of other words in the mental lexicon (McClelland and Elman, 1986, Luce and Pisoni 1998) The current research evaluated metrics for measuring the influence of the mental lexicon on visually perceived (lipread) spoken word recognition Lexical competition (the extent to which perceptually similar words influence recognition of a stimulus word) was quantified using metrics that are well-established in the literature, as well as a novel statistical method for calculating perceptual confusability, based on the Phi square statistic The Phi square statistic proved an effective measure for assessing lexical competition and explained significant variance in visual spoken word recognition beyond that accounted for by traditional metrics Because these values include the influence of a large subset of the lexicon (rather than only perceptually similar words) it suggests that even perceptually distant words may receive some activation, and therefore provide competition, during spoken word recognition This work supports and extends earlier research (Auer, 2002, Mattys et al, 2002) that proposed a common recognition system underlying auditory and visual spoken word recognition and provides support for the use of the Phi-square statistic for quantifying lexical competition (C) 2010 Elsevier B V All rights reserved
C1 [Feld, Julia; Sommers, Mitchell] Washington Univ, Dept Psychol, St Louis, MO 63103 USA.
RP Feld, J (reprint author), Washington Univ, Dept Psychol, Campus Box 1125, St Louis, MO 63103 USA.
RI Strand, Julia/J-5432-2014
OI Strand, Julia/0000-0001-5950-0139
FU National Institute on Aging [RO1 AG 18029-4]
FX We thank Lorin Lachs, Luis Hernandez, and David Pisoni for generously
sharing with us their identification data from the Hoosier Audiovisual
Multitalker Database This work was supported in part by grant award
number RO1 AG 18029-4 from the National Institute on Aging
CR AUER E, 2008, J ACOUST SOC AM, V124, P2459
Auer ET, 2002, PSYCHON B REV, V9, P341, DOI 10.3758/BF03196291
Auer ET, 1997, J ACOUST SOC AM, V102, P3704, DOI 10.1121/1.420402
Balota DA, 2007, BEHAV RES METHODS, V39, P445, DOI 10.3758/BF03193014
BERNSTEIN LE, 1997, P ESCA ESCOP WORKSH, P21
GOLDINGER SD, 1989, J MEM LANG, V28, P501, DOI 10.1016/0749-596X(89)90009-0
Iverson P, 1998, SPEECH COMMUN, V26, P45, DOI 10.1016/S0167-6393(98)00049-1
JACKSON PL, 1988, NEW REFLECTIONS SPEE, P99
Jusczyk PW, 2002, EAR HEARING, V23, P2, DOI 10.1097/00003446-200202000-00002
Kaiser AR, 2003, J SPEECH LANG HEAR R, V46, P390, DOI 10.1044/1092-4388(2003/032)
LACHS L, 1998, RES SPOKEN LANGUAGE, V22, P377
Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001
Lund K, 1996, BEHAV RES METH INSTR, V28, P203, DOI 10.3758/BF03204766
Mattys SL, 2002, PERCEPT PSYCHOPHYS, V64, P667, DOI 10.3758/BF03194734
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
MORTON J, 1979, PSYCHOLINGUISTICS ST, V2
MURRAY NT, 2008, FDN AURAL REHABILITA
MURRAY NT, 2007, TRENDS AMPLIF, V11, P233
NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4
OWENS E, 1985, J SPEECH HEAR RES, V28, P381
Seitz P. F., 1998, PHLEX PHONOLOGICALLY
Sommers MS, 1999, PSYCHOL AGING, V14, P458, DOI 10.1037/0882-7974.14.3.458
SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009
Vitevitch MS, 1998, PSYCHOL SCI, V9, P325, DOI 10.1111/1467-9280.00064
WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130
NR 25
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 220
EP 228
DI 10.1016/j.specom.2010.09.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300007
ER
PT J
AU Morrison, GS
AF Morrison, Geoffrey Stewart
TI A comparison of procedures for the calculation of forensic likelihood
ratios from acoustic-phonetic data Multivariate kernel density (MVKD)
versus Gaussian mixture model-universal background model (GMM-UBM)
SO SPEECH COMMUNICATION
LA English
DT Article
DE Forensic voice comparison; Likelihood ratio; Acoustic-phonetic;
Multivariate kernel density; GMM-UBM
ID SPEAKER RECOGNITION; CASEWORK; SYSTEMS; SCIENCE; FUSION; IMPACT
AB Two procedures for the calculation of forensic likelihood ratios were tested on the same set of acoustic phonetic data One procedure was a multivariate kernel density procedure (MVKD) which is common in acoustic phonetic forensic voice comparison, and the other was a Gaussian mixture model universal background model (GMM-UBM) which is common in automatic forensic voice comparison The data were coefficient values from discrete cosine transforms fitted to second-formant trajectories of /ar /er/, /ou/, /au/, and /oi/ tokens produced by 27 male speakers of Australian English Scores were calculated separately for each phoneme and then fused using logistic regression The performance of the fused GMM-UBM system was much better than that of the fused MVKD system, both in terms of accuracy (as measured using the log-likelihood ratio cost, C-llr) and precision (as measured using an empirical estimate of the 95% credible interval for the likelihood ratios from the different-speaker comparisons) (C) 2010 Elsevier B V All rights reserved
C1 [Morrison, Geoffrey Stewart] Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
[Morrison, Geoffrey Stewart] Australian Natl Univ, Sch Language Studies, Canberra, ACT 0200, Australia.
RP Morrison, GS (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Forens Voice Comparison Lab, Sydney, NSW 2052, Australia.
FU Australian Research Council [DP0774115]
FX This research was funded by Australian Research Council Discovery Grant
No DP0774115 Thanks to Yuko Kinoshita for supplying the original audio
data Thanks to Julien Epps for comments on an earlier version of this
paper
CR Agresti A, 2007, INTRO CATEGORICAL DA, V2nd
Aitken CGG, 2004, STAT EVALUATION FORE
Aitken CGG, 2004, J ROY STAT SOC C-APP, V53, P109, DOI 10.1046/j.0035-9254.2003.05271.x
Aitken C.G.G., 1991, USE STAT FORENSIC SC
Aitken CGG, 2004, APPL STAT, V53, P665, DOI [10 1111/j 1467 9876 2004 02031 x, DOI 10.1111/J.1467.9876.2004.02031.X]
Alderman T., 2004, P 10 AUSTR INT C SPE, P177
Alexander A, 2005, INT J SPEECH LANG LA, V12, P214, DOI 10.1558/sll.2005.12.2.214
Alexander A., 2004, P 8 INT C SPOK LANG
ALEXANDER A, 2005, THESIS ECOLE POLYTEC
[Anonymous], 2002, FORENSIC SPEAKER IDE
Association of Forensic Science Providers, 2009, SCI JUSTICE, V49, P161, DOI [DOI 10.1016/J.SCIJUS.2009.07.004, 10.1016/j.scijus.2009.07.004]
Balding DJ, 2005, STAT PRACT, P1, DOI 10.1002/9780470867693
Becker T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1505
Becker T, 2010, P OD 2010 LANG SPEAK, P58
Becker T, 2009, P NAG DAGA INT C AC
Broeders APA, 1995, P INT C PHON SCI STO, V3, P154
Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870
Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001
Buckleton J, 2005, FORENSIC DNA EVIDENCE INTERPRETATION, P27
Castro D.R., 2007, THESIS U AUTONOMA MA
Champod C, 2000, SPEECH COMMUN, V31, P193, DOI 10.1016/S0167-6393(99)00078-3
Cook R, 1998, SCI JUSTICE, V38, P231, DOI 10.1016/S1355-0306(98)72117-3
Curran J. M., 2005, LAW PROBABILITY RISK, V4, P115, DOI [10.1093/1pr/mgi009, DOI 10.1093/LPR/MGI009, 10 1093/Ipr/mgi009]
Curran JM, 2002, SCI JUSTICE, V42, P29, DOI 10.1016/S1355-0306(02)71794-2
Drygajlo A, 2007, IEEE SIGNAL PROC MAR, P132, DOI [10 1109/MSP 2007 323278, DOI 10.1109/MSP.2007.323278]
Duda R., 2000, PATTERN CLASSIFICATI
Enzinger E, 2010, P 39 AUD ENG SOC C A
Evett I, 2000, SCI JUSTICE, V40, P233, DOI 10.1016/S1355-0306(00)71993-9
Evett IW, 1996, ADV FOREN H, V6, P79
Evett I.W., 1990, FORENSIC SCI PROGR, V4, P141
Evett IW, 1998, SCI JUSTICE, V38, P198, DOI 10.1016/S1355-0306(98)72105-7
Ferrer L, 2009, J MACH LEARN RES, V10, P2079
Friedman J., 2009, ELEMENTS STAT LEARNI, V2nd
Gonzalez Rodriguez J, 2007, IEEE T AUDIO SPEECH, V15, P2104, DOI [10 1109/ TASL 2007 902747, DOI 10.1109/TASL.2007.902747]
Gonzalez Rodriguez J, 2005, FORENSIC SCI INT, V155, P126, DOI [10 1016/j forsciint 2004 11 007, DOI 10.1016/J.FORSCIINT.2004.11.007]
Guillemin BJ, 2008, INT J SPEECH LANG LA, V15, P193, DOI 10.1558/ijsll.v15i2.193
Hosmer DW, 2000, APPL LOGISTIC REGRES, V2nd
Ishihara S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1941
Jessen M., 2008, LANG LING COMPASS, V2, P671, DOI [DOI 10.1111/J.1749-818X.2008.00066.X, 10.1111/j.1749-818x.2008.00066.x]
Kahn J, 2010, P OD 2010 LANG SPEAK, P109
Kinoshita Y, 2008, P OD 2008 SPEAK LANG
Kinoshita Y, 2009, INT J SPEECH LANG LA, V16, P91, DOI 10.1558/ijsll.v16i1.91
Kinoshita Y., 2006, P 11 AUSTR INT C SPE, P112
Lewis S., 1984, P I ACOUSTICS, V6, P69
LINDLEY DV, 1977, BIOMETRIKA, V64, P207, DOI 10.1093/biomet/64.2.207
Lucy D., 2005, INTRO STAT FORENSIC
Menard S., 2002, APPL LOGISTIC REGRES
Meuwly D, 2001, THESIS U LAUSANNE LA
Meuwly D, 2001, P 2001 SPEAK OD SPEA
Morrison G. S., 2010, P OD 2010 LANG SPEAK, P63
MORRISON GS, 2008, INT J SPEECH LANG LA, V15, P247, DOI DOI 10.1558/IJSLL.V15I2.249
Morrison GS, 2009, SCI JUSTICE, V49, P298, DOI 10.1016/j.scijus.2009.09.002
Morrison GS, 2009, J ACOUST SOC AM, V125, P2387, DOI 10.1121/1.3081384
Morrison GS, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1501
Morrison GS, 2010, EXPERT EVIDENCE
National Research Council, 2009, STRENGTH FOR SCI US
Pigeon S, 2000, DIGIT SIGNAL PROCESS, V10, P237, DOI 10.1006/dspr.1999.0358
Ramos Castro D, 2006, P OD 2006 SPEAK LANG, DOI [10 1109/ODYSSEY 2006 248088, DOI 10.1109/ODYSSEY.2006.248088]
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Robertson B., 1995, INTERPRETING EVIDENC
Rodriguez J., 2006, COMPUT SPEECH LANG, V20, P331, DOI DOI 10.1016/J.CSL.2005.08.005
Rose P., 2004, P 10 AUSTR INT C SPE, P492
Rose P, 2006, COMPUT SPEECH LANG, V20, P159, DOI 10.1016/j.csl.2005.07.003
Rose P, 2009, INT J SPEECH LANG LA, V16, P139, DOI 10.1558/ijsll.v16i1.139
Rose P., 2006, P 11 AUSTR INT C SPE, P329
Rose P, 2006, P OD 2006 SPEAK LANG, DOI [10 1109/ODYSSEY 2006 248095, DOI 10.1109/ODYSSEY.2006.248095]
Rose P., 2007, P 16 INT C PHON SCI, P1817
Rose Phil, 2006, P 11 AUSTR INT C SPE, P64
Thiruvaran T, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1497
van Leeuwen D., 2007, SPEAKER CLASSIFICATI, VI, P330, DOI [DOI 10.1007/978-3-540-74200-5_19, 10.1007/978- 3-540-74200-5_19]
NR 70
TC 11
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 242
EP 256
DI 10.1016/j.specom.2010.09.005
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300009
ER
PT J
AU Lei, Y
Hansen, JHL
AF Lei, Yun
Hansen, John H. L.
TI Mismatch modeling and compensation for robust speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; Factor analysis; Eigenchannel; Mismatch
compensation
ID SPEECH RECOGNITION; CHANNEL COMPENSATION; VARIABILITY; NOISE
AB In this study primary channel mismatch scenario between enrollment and test conditions in a speaker verification task are analyzed and modeled A novel Gaussian mixture modeling with a universal background model (GMM-UBM) frame based compensation model related to the mismatch is formulated and evaluated using National Institute of Standards and Technology (NIST) speaker recognition evaluation (SRE) 2008 data, along with a comparison to the well-known eigenchannel model Proposed compensation method show significant improvement versus an eigenchannel model when only the supervector of the UBM is employed Here the supervector of the enrollment speaker model is not included for estimation of the mismatch since it is difficult to obtain the real supervector of the speaker based on the limited 5 mm channel dependent speech data only The proposed mismatch compensation model, therefore show that construction of the supervector obtained from a UBM model can more accurately describe the mismatch between enrollment and test data, resulting in effective classification performance improvement for speaker/speech applications (C) 2010 Elsevier B V All rights reserved
C1 [Hansen, John H. L.] Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dept Elect Engn, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Erik Jonsson Sch Engn & Comp Sci, CRSS, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
FU AFRL [FA8750-09-C-0067]
FX This project was supported by AFRL under a subcontract to RADC Inc under
FA8750-09-C-0067 Approved for public release, distribution unlimited
CR Acero A., 1990, THESIS CARNEGIE MELL
Acero Alex, 2000, INTERSPEECH, P869
Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499
Campbell WM, 2006, INT CONF ACOUST SPEE, P97
Castaldo F, 2007, IEEE T AUDIO SPEECH, V15, P1969, DOI 10.1109/TASL.2007.901823
GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014
Gales M.J.F, 1995, THESIS CAMBRIDGE U
Hank Liao, 2007, THESIS U CAMBRIDGE
Hansen J. H. L., 1988, THESIS GEORGIA I TEC
Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7
Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1435, DOI 10.1109/TASL.2006.881693
Kenny P., 2004, IEEE ICASSP 2004 MON, P37
Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940
Kenny P, 2005, INT CONF ACOUST SPEE, P637
Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7
Matrouf D, 2007, INTERSPEECH, P1242
Moreno P.J., 1996, THESIS CARNEGIE MELL
Pelecanos J., 2001, IEEE OD 2001 SPEAK L, P18
REYNOLDS DA, 1995, INT CONF ACOUST SPEE, P329, DOI 10.1109/ICASSP.1995.479540
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Solomonoff A, 2005, INT CONF ACOUST SPEE, P629
Stolcke A., 2005, INTERSPEECH, P2425
Vair C, 2006, IEEE OD SPEAK LANG R, P1
VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970
Vogt R, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P853
Vogt R., 2005, INTERSPEECH, P3117
NR 27
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 257
EP 268
DI 10.1016/j.specom.2010.09.006
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300010
ER
PT J
AU Dai, P
Soon, IY
AF Dai, Peng
Soon, Ing Yann
TI A temporal warped 2D psychoacoustic modeling for robust speech
recognition system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; 2D mask; Simultaneous masking; Temporal
masking; Temporal warping
ID WORD RECOGNITION; AUDITORY-SYSTEM; MASKING; ADAPTATION; FREQUENCY
AB Human auditory system performs better than speech recognition system under noisy condition, which leads us to the idea of Incorporating the human auditory system into automatic speech recognition engines In this paper, a hybrid feature extraction method, which utilizes forward masking, backward masking, and lateral inhibition, is incorporated into mel-frequency cepstral coefficients (MFCC) The integration is Implemented using a warped 2D psychoacoustic filter The AURORA2 database is utilized for testing, and the Hidden Markov Model (HMM) is used for recognition Comparison is made against lateral inhibition (LI), forward masking (FM) cepstral mean and variance normalization (CMVN), the original 2D psychoacoustic filter and the RASTA filter Experimental results show that the word recognition rate is significantly improved, especially under noisy conditions (C) 2010 Elsevier B V All rights reserved
C1 [Dai, Peng; Soon, Ing Yann] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
RP Dai, P (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
RI Soon, Ing Yann/A-5173-2011
CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615
Ambikairajah E, 1997, ELECTRON COMMUN ENG, V9, P165, DOI 10.1049/ecej:19970403
Brookes M., 1997, VOICEBOX
CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427
Dai P., 2009, P ICICS DEC
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Golden B, 2000, SPEECH AUDIO SIGNAL
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181
JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576
Kvale MN, 2004, J NEUROPHYSIOL, V91, P604, DOI 10.1152/jn.00484.2003
LEONARD RG, 1984, P ICASSP, V3, P42
Lois LE, 1962, J ACOUST SOC AM, V34, P1116
Luo XW, 2008, 2008 INTERNATIONAL CONFERENCE ON AUDIO, LANGUAGE AND IMAGE PROCESSING, VOLS 1 AND 2, PROCEEDINGS, P1105
Nghia PT, 2008, P ICATC, P349
Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501
Oxenham AJ, 2000, HEARING RES, V150, P258, DOI 10.1016/S0378-5955(00)00206-9
Park KY, 2003, NEUROCOMPUTING, V52-4, P615, DOI 10.1016/S0925-2312(02)00791-9
SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800
Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
NR 22
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2011
VL 53
IS 2
BP 229
EP 241
DI 10.1016/j.specom.2010.09.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LQ
UT WOS:000285665300008
ER
PT J
AU Kim, W
Stern, RM
AF Kim, Wooil
Stern, Richard M.
TI Mask classification for missing-feature reconstruction for robust speech
recognition in unknown background noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Missing-feature reconstruction;
Frequency-dependent mask classification; Colored-noise masker
generation; Multi-band partition method
AB "Missing-feature" techniques to improve speech recognition accuracy are based on the blind determination of which cells in a spectrogram-like display of speech are corrupted by the effects of noise or other types of disturbance (and hence are "missing"). In this paper we present three new approaches that improve the speech recognition accuracy obtained using missing-feature techniques. It had been found in previous studies (e.g. Seltzer et al., 2004) that Bayesian approaches to missing-feature classification are effective in ameliorating the effects of various types of additive noise. While Seltzer et al. primarily used white noise for training their Bayesian classifier, we have found that this is not the best type of training signal when noise with greater spectral and/or temporal variation is encountered in the testing environment. The first innovation introduced in this paper, referred to as frequency-dependent classification, involves independent classification in each of the various frequency bands in which the incoming speech is analyzed based on parallel sets of frequency-dependent features. The second innovation, referred to as colored-noise generation using multi-band partitioning, involves the use of masking noises with artificially-introduced spectral and temporal variation in training the Bayesian classifier used to determine which spectro-temporal components of incoming speech are corrupted by noise in unknown testing environments. The third innovation consists of an adaptive method to estimate the a priori values of the mask classifier that determines whether a particular time-frequency segment of the test data should be considered to be reliable or not. It is shown that these innovations provide improved speech recognition accuracy on a small vocabulary test when missing-feature restoration is applied to incoming speech that is corrupted by additive noise of an unknown nature, especially at lower signal-to-noise ratios. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Kim, Wooil] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, Richardson, TX 75080 USA.
[Stern, Richard M.] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA.
[Stern, Richard M.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
RP Kim, W (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM wikim@utdallas.edu
FU Korea Science and Engineering Foundation (KOSEF); National Science
Foundation [IIS-0420866]
FX This work was supported by the Postdoctoral Fellow Program of the Korea
Science and Engineering Foundation (KOSEF) and by the National Science
Foundation (Grant IIS-0420866). Much of the work described here was
previously presented at the Interspeech 2005 (Kim et al., 2005) and IEEE
ICASSP 2006 (Kim and Stern, 2006) conferences, although with somewhat
different notation.
CR Barker J., 2000, ICSLP 2000, P373
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cooke M, 1997, INT CONF ACOUST SPEE, P863, DOI 10.1109/ICASSP.1997.596072
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
*ETSI, 2000, 201108V112 ETSI ES
Harding S, 2005, INT CONF ACOUST SPEE, P537
Hirsch H.G., 2000, ISCA ITRW ASR 2000
Jancovic P., 2003, EUROSP GEN SWITZ, P2161
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
Josifovski L., 1999, EUROSPEECH 99, P2837
Kim W, 2006, INT CONF ACOUST SPEE, P305
Kim W., 2005, INTERSPEECH 2005, P2637
Lippmann R.P., 1997, EUROSPEECH, P37
Martin R., 1994, EUSIPCO 94, P1182
Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225
Park HM, 2009, SPEECH COMMUN, V51, P15, DOI 10.1016/j.specom.2008.05.012
Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828
Raj B., 2005, ASRU 2005, P65
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
Renevey P., 2001, CRAC2001
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
Seltzer M.L., 2000, THESIS CARNEGIE MELL
Singh R., 2002, CRC HDB NOISE REDUCT
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
NR 25
TC 7
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 1
EP 11
DI 10.1016/j.specom.2010.08.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700001
ER
PT J
AU Knoll, M
Scharrer, L
Costall, A
AF Knoll, Monja
Scharrer, Lisa
Costall, Alan
TI "Look at the shark": Evaluation of student- and actress-produced
standardised sentences of infant- and foreigner-directed speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE IDS; Standardised sentences; Simulated speech; Actresses;
Hyperarticulation
ID VOCAL EMOTION EXPRESSION; CONVERSATIONAL SPEECH; IRONIC COMMUNICATION;
ACOUSTIC PROFILES; CUES; INTELLIGIBILITY; PERCEPTION; INTONATION;
LANGUAGE; VOICE
AB Standardised sentence production is routinely used in speech research to avoid content variability typical of natural speech production. However, the validity of such standardised material is not well understood. Here, we evaluated the use of standardised sentences by comparing them to two existing, non-standardised datasets of simulated free and natural speech (the latter produced by mothers in real interactions). Standardised sentences and simulated free speech were produced by students and actresses without an interaction partner. Each dataset comprised recordings of infant- (IDS), foreigner- (FDS) and adult-directed (ADS) speech, which were analysed for mean F-0, vowel duration and hyperarticulation. Whilst students' mean F-0 pattern in standardised speech was closer to the natural speech than their previous 'simulated free speech', no difference in vowel hyperarticulation and duration patterns was found for students' standardised sentences between the three speech styles. Actresses' F-0, vowel duration and hyperarticulation patterns in standardised speech were similar to the natural speech, and a part improvement on their 'simulated free speech'. These results suggest that successful reproduction of some acoustic measures (e.g., F-0) can be achieved with standardised content regardless of the type of speaker, whereas the production of other acoustic measures (e.g., hyperarticulation) are context- and speaker-dependent. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Knoll, Monja] Univ W Scotland, Sch Social Sci, Paisley PA1 2BE, Renfrew, Scotland.
[Scharrer, Lisa] Univ Munster, Dept Psychol, D-48149 Munster, Germany.
[Knoll, Monja; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England.
RP Knoll, M (reprint author), Univ W Scotland, Sch Social Sci, Paisley PA1 2BE, Renfrew, Scotland.
EM monja.knoll@uws.ac.uk; Lisa.Scharrer@uni-muenster.de;
alan.costall@port.ac.uk
FU Economic and Social Research Council (ESRC)
FX We are grateful to two anonymous reviewers for helpful comments that
greatly improved this manuscript. Our particular thanks go to David
Bauckham and the 'The Bridge Theatre Training Company' for their
invaluable help in recruiting the actresses used in this study. This
research was supported by a research bursary from the Economic and
Social Research Council (ESRC) to Monja Knoll.
CR Anolli L, 2002, INT J PSYCHOL, V37, P266, DOI 10.1080/00207590244000106
Anolli L, 2000, J PSYCHOLINGUIST RES, V29, P275, DOI 10.1023/A:1005100221723
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016
Biersack S, 2005, PHONETICA, V62, P106, DOI 10.1159/000090092
Biersack S., 2005, P 9 EUR C SPEECH COM, P2401
Boersma P., 2006, PRAAT DOING PHONETIC
Bradlow A.R., 2007, J ACOUSTICAL SOC A 2, V121, P3072
Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5
Bradlow AR, 2003, J SPEECH LANG HEAR R, V46, P80, DOI 10.1044/1092-4388(2003/007)
Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114
Burnham D., 2002, SCIENCE, V296, P1095
Cheang HS, 2008, SPEECH COMMUN, V50, P366, DOI 10.1016/j.specom.2007.11.003
COHEN A, 1961, AM J PSYCHOL, V74, P90, DOI 10.2307/1419829
Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078
FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276
JAKOBOVITS LA, 1965, PSYCHOL REP, V17, P785
Johnson N. L., 1970, CONTINUOUS UNIVARIAT, V1
Johnson N. L., 1970, CONTINUOUS UNIVARIAT, V2
KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436
Knoll M, 2009, SPEECH COMMUN, V51, P296, DOI 10.1016/j.specom.2008.10.001
Knower FH, 1941, J SOC PSYCHOL, V14, P369
KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473
Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684
LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466
Laukka P, 2005, COGNITION EMOTION, V19, P633, DOI 10.1080/02699930441000445
LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403
MILMOE S, 1967, J ABNORM PSYCHOL, V72, P78, DOI 10.1037/h0024219
Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3
PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889
Parkinson B, 1996, BRIT J PSYCHOL, V87, P663
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
Rockwell P, 2000, J PSYCHOLINGUIST RES, V29, P483, DOI 10.1023/A:1005120109296
ROGERS PL, 1971, BEHAV RES METH INSTR, V3, P16, DOI 10.3758/BF03208115
ROSS M, 1973, AM ANN DEAF, V118, P37
Schaeffler F., 2006, 3 SPEECH PROS C DRES
SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143
SCHERER KR, 1971, J EXP RES PERS, V5, P155
Scherer KR, 2007, EMOTION, V7, P158, DOI 10.1037/1528-3542.7.1.158
Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009
SCHERER KR, 1972, J PSYCHOLINGUIST RES, V1, P269, DOI 10.1007/BF01074443
SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788
Soskin W.F., 1961, J COMMUN, V11, P73, DOI 10.1111/j.1460-2466.1961.tb00331.x
STARKWEATHER JA, 1956, AM J PSYCHOL, V69, P121, DOI 10.2307/1418129
Starkweather J.A., 1967, RES VERBAL BEHAV SOM, P253
Tischer B., 1988, SPRACHE KOGNIT, V7, P205
Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240
Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003
Viscovich N, 2003, PERCEPT MOTOR SKILL, V96, P759, DOI 10.2466/PMS.96.3.759-771
WALLBOTT HG, 1986, J PERS SOC PSYCHOL, V51, P690, DOI 10.1037//0022-3514.51.4.690
WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238
NR 53
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 12
EP 22
DI 10.1016/j.specom.2010.08.004
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700002
ER
PT J
AU Hjalmarsson, A
AF Hjalmarsson, Anna
TI The additive effect of turn-taking cues in human and synthetic voice
SO SPEECH COMMUNICATION
LA English
DT Article
DE Turn-taking; Speech synthesis; Human-like interaction; Conversational
interfaces
ID CONVERSATION
AB A previous line of research suggests that interlocutors identify appropriate places to speak by cues in the behaviour of the preceding speaker. If used in combination, these cues have an additive effect on listeners' turn-taking attempts. The present study further explores these findings by examining the effect of such turn-taking cues experimentally. The objective is to investigate the possibilities of generating turn-taking cues with a synthetic voice. Thus, in addition to stimuli realized with a human voice, the experiment included dialogues where one of the speakers is replaced with a synthesis. The turn-taking cues investigated include intonation, phrase-final lengthening, semantic completeness, stereotyped lexical expressions and non-lexical speech production phenomena such as lexical repetitions, breathing and lip-smacks. The results show that the turn-taking cues realized with a synthetic voice affect the judgements similar to the corresponding human version and there is no difference in reaction times between these two conditions. Furthermore, the results support Duncan's findings: the more turn-taking cues with the same pragmatic function, turn-yielding or turn-holding, the higher the agreement among subjects on the expected outcome. In addition, the number of turn-taking cues affects the reaction times for these decisions. Thus, the more cues, the faster the reaction time. (C) 2010 Elsevier B.V. All rights reserved.
C1 KTH, SE-10044 Stockholm, Sweden.
RP Hjalmarsson, A (reprint author), KTH, Lindstedsvagen 24, SE-10044 Stockholm, Sweden.
EM annah@speech.kth.se
FU Swedish Research Council [2007-6431, GENDIAL]
FX This research was carried out at Centre for Speech Technology, KTH. The
research is also supported by the Swedish Research Council Project
#2007-6431, GENDIAL. Many thanks to Rolf Carlson, Jens Edlund, Joakim
Gustafson, Mattias Heldner, Julia Hirschberg and Gabriel Skantze for
help with valuable comments and annotation of data. Many thanks also to
the reviewers for valuable comments that helped to improve the paper.
CR BEATTIE GW, 1982, NATURE, V300, P744, DOI 10.1038/300744a0
Bruce Gosta, 1977, SWEDISH WORD ACCENTS
Butterworth B., 1975, J PSYCHOLINGUIST RES, V4
Campione E., 2002, ESCA WORKSH SPEECH P, P199
Clark H., 2002, SPEECH COMMUN, V36
Cutler Anne, 1986, INTONATION DISCOURSE, P139
De Ruiter JP, 2006, LANGUAGE, V82, P515
DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031
Duncan Jr S, 1977, FACE TO FACE INTERAC
Edlund J, 2005, PHONETICA, V62, P215, DOI 10.1159/000090099
Edlund J, 2008, SPEECH COMMUN, V50, P630, DOI 10.1016/j.specom.2008.04.002
Fernandez R., 2007, P 11 WORKSH SEM PRAG, P25
Ferrer L., 2003, IEEE INT C AC SPEECH
Ferrer L., 2002, P INT C SPOK LANG PR, P2061
Ford C. E., 1996, INTERACTION GRAMMAR, P134, DOI 10.1017/CBO9780511620874.003
Gravano A., 2009, THESIS COLUMBIA U
Gustafson J, 2008, LECT NOTES ARTIF INT, V5078, P293, DOI 10.1007/978-3-540-69369-7_35
Heldner M, 2010, J PHONETICS, V38, P555, DOI 10.1016/j.wocn.2010.08.002
Hjalmarsson A., 2007, P SIGDIAL ANTW BELG, P132
Hjalmarsson A., 2008, P SIGDIAL 2008 COL O
IZDEBSKI K, 1978, J SPEECH HEAR RES, V21, P638
KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4
Kilger A., 1995, RR9511 GERM RES CTR
Koiso H, 1998, LANG SPEECH, V41, P295
Levelt W. J. M., 1989, SPEAKING FROM INTENT
LOCAL J, 1986, HUM STUD, V9, P185, DOI 10.1007/BF00148126
LOCAL JK, 1986, J LINGUIST, V22, P411, DOI 10.1017/S0022226700010859
METZ CE, 1978, SEMIN NUCL MED, V8, P283, DOI 10.1016/S0001-2998(78)80014-2
Oliveira Jr M., 2008, SPEECH PROSODY 2008, P485
Reiter E., 1997, NAT LANG ENG, V3, P57, DOI DOI 10.1017/S1351324997001502
SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243
SCHAFFER D, 1983, J PHONETICS, V11, P243
Schourup L, 1999, LINGUA, V107, P227, DOI 10.1016/S0024-3841(96)90026-1
Selting M., 1996, PRAGMATICS, V6, P357
Shriberg E. E., 1994, THESIS U CALIFORNIA
Watanabe M, 2008, SPEECH COMMUN, V50, P81, DOI 10.1016/j.specom.2007.06.002
Weilhammer K., 2003, ICPHS 2003 BARC SPAI
WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450
NR 38
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 23
EP 35
DI 10.1016/j.specom.2010.08.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700003
ER
PT J
AU Stark, A
Paliwal, K
AF Stark, Anthony
Paliwal, Kuldip
TI Use of speech presence uncertainty with MMSE spectral energy estimation
for robust automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; MMSE estimation; Speech enhancement methods
ID AMPLITUDE ESTIMATOR; ENHANCEMENT; NOISE; SUPPRESSION; EPHRAIM
AB In this paper, we investigate the use of the minimum mean square error (MMSE) spectral energy estimator for use in environment-robust automatic speech recognition (ASR). In the past, it has been common to use the MMSE log-spectral amplitude estimator for this task. However, this estimator was originally derived under subjective human listening criteria. Therefore its complex suppression rule may not be optimal for use in ASR. On the other hand, it can be shown that the MMSE spectral energy estimator is closely related to the MMSE Mel-frequency cepstral coefficient (MFCC) estimator. Despite this, the spectral energy estimator has tended to suffer from the problem of excessive residual noise. We examine the cause of this residual noise and show that the introduction of a heuristic based speech presence uncertainty (SPU) can significantly improve its performance as a front-end ASR enhancement regime. The proposed spectral energy SPU estimator is evaluated on the Aurora2, RM and OLLO2 speech recognition tasks and can be shown to significantly improve additive noise robustness over the more common spectral amplitude and log-spectral amplitude estimators. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Stark, Anthony; Paliwal, Kuldip] Griffith Univ, Signal Proc Lab, Nathan, Qld 4111, Australia.
RP Paliwal, K (reprint author), Griffith Univ, Signal Proc Lab, Nathan, Qld 4111, Australia.
EM a.stark@griffith.edu.au; k.paliwal@griffith.edu.au
CR Berouti M, 1979, IEEE INT C AC SPEECH, V4, P208, DOI 10.1109/ICASSP.1979.1170788
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1991, IEEE T SIGNAL PROCES, V39, P795
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
FUJIMOTO M, 2000, IEEE INT C AC SPEECH, V3, P1727
GALES L, 1995, THESIS U CAMBRIDGE U
Gemello R, 2006, IEEE SIGNAL PROC LET, V13, P56, DOI 10.1109/LSP.2005.860535
HERMUS K, 2007, EURASIP J APPL SIG P, P195
Huang X., 2001, SPOKEN LANGUAGE PROC
Lathoud G., 2005, P 2005 IEEE ASRU WOR, P343
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Malah D, 1999, INT CONF ACOUST SPEE, P789
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
PALIWAL K, 1987, IEEE INT C AC SPEECH, P297
Pearce D., 2000, ISCA ITRW ASR2000, P29
Price P., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196669
Wesker T., 2005, P INT, P1273
Wiener N., 1949, EXTRAPOLATION INTERP
Young S., 2000, HTK BOOK VERSION 3 0
NR 23
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 51
EP 61
DI 10.1016/j.specom.2010.08.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700005
ER
PT J
AU Raab, M
Gruhn, R
Noth, E
AF Raab, Martin
Gruhn, Rainer
Noeth, Elmar
TI A scalable architecture for multilingual speech recognition on embedded
devices
SO SPEECH COMMUNICATION
LA English
DT Article
DE Multilingual speech recognition; Non-native speech; Projections between
Gaussian spaces; Gaussian Mixture Model distances
ID MIXTURE; MODELS
AB In-car infotainment and navigation devices are typical examples where speech based interfaces are successfully applied. While classical applications are monolingual, such as voice commands or monolingual destination input, the trend goes towards multilingual applications. Examples are music player control or multilingual destination input. As soon as more languages are considered the training and decoding complexity of the speech recognizer increases. For large multilingual systems, some kind of parameter tying is needed to keep the decoding task feasible on embedded systems with limited resources. A traditional technique for this is to use a semi-continuous Hidden Markov Model as the acoustic model. The monolingual codebook on which such a system relies is not appropriate for multilingual recognition. We introduce Multilingual Weighted Codebooks that give good results with low decoding complexity. These codebooks depend on the actual language combination and increase the training complexity. Therefore an algorithm is needed that can reduce the training complexity. Our first proposal are mathematically motivated projections between Hidden Markov Models defined in Gaussian spaces. Although theoretically optimal, these projections were difficult to employ directly in speech decoders. We found approximated projections to be most effective for practical application, giving good performance without requiring major modifications to the common speech recognizer architecture. With a combination of the Multilingual Weighted Codebooks and Gaussian Mixture Model projections we create an efficient and scalable architecture for non-native speech recognition. Our new architecture offers a solution to the combinatoric problems of training and decoding for multiple languages. It builds new multilingual systems in only 0.002% of the time of a traditional HMM training, and achieves comparable performance on foreign languages. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Raab, Martin; Gruhn, Rainer] Harman Becker Automot Syst, Speech Dialog Syst, Ulm, Germany.
[Raab, Martin; Noeth, Elmar] Univ Erlangen Nurnberg, Chair Pattern Recognit, Erlangen, Germany.
RP Raab, M (reprint author), Harman Becker Automot Syst, Speech Dialog Syst, Ulm, Germany.
EM martin.raab@informatik.uni-erlangen.de; rainer.gruhn@alumni.uni-ulm.de
CR Bartkova K, 2006, INT CONF ACOUST SPEE, P1037
Biehl M., 1990, P ASI SUMM WORKSH NE
BOUSELMI G, 2007, P INT ANTW BELG AUG, P1449
Boyd S., 2004, CONVEX OPTIMIZATION
Dalsgaard P., 1998, P ICSLP SYDN AUSTR, P2623
Fuegen C., 2003, P ASRU, P441
Goronzy S., 2001, P EUROSPEECH 2001, P309
Harbeck S., 1998, P WORKSH TEXT SPEECH, P375
Hershey JR, 2007, INT CONF ACOUST SPEE, P317
Huang X.D., 1990, P ICASSP, P689
Iskra D., 2002, P LREC, P329
Jensen J.H., 2007, P INT C MUS INF RETR, P107
Jian B, 2005, IEEE I CONF COMP VIS, P1246
JUANG BH, 1985, AT&T TECH J, V64, P391
Koch W., 2004, THESIS U KIEL KIEL
Koehler J., 2001, SPEECH COMMUN, V35, P21
Kuhn H.W., 1951, P 2 BERK S MATH STAT, P481
LADEFOGED P, 1990, LANGUAGE, V66, P550, DOI 10.2307/414611
Lang H., 2009, THESIS U ULM ULM
Lieb E.H., 2001, ANALYSIS
Menzel W., 2000, P LREC ATH GREEC, P957
Niesler T., 2006, P ITRW STELL S AFR
Noord G., 2009, TEXTCAT
Noth E., 1999, NATO ASI SERIES F, P363
Park J., 2004, P INT C SPOK LANG PR, P693
Petersen KB, 2008, MATRIX COOKBOOK
Platt J.C., 1988, CONSTRAINED DIFFEREN
Raab M., 2009, P DAGA ROTT NETH, P411
Raab M, 2008, INT CONF ACOUST SPEE, P4257, DOI 10.1109/ICASSP.2008.4518595
RAAB M, 2007, P ASRU KYOT JAP, P413
RAAB M, 2008, P TSD BRNO CZECH REP, P485
SCHADEN S, 2006, THESIS U DUISBURG ES
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
Segura J., 2007, HIWIRE DATABASE NOIS
Tan TP, 2007, INT CONF ACOUST SPEE, P1009
Tomokiyo L.M., 2001, P MSLP AALB DENM, P39
Uebler U, 2001, SPEECH COMMUN, V35, P53, DOI 10.1016/S0167-6393(00)00095-9
UEDA Y, 1990, P INT C SPOKEN LANGU, P1209
Wang Z., 2002, P ICMI OCT, P247
Weng F., 1997, P EUR RHOD, P359
Witt S., 1999, THESIS CAMBRIDGE U C
Zhang F., 2005, SCHUR COMPLEMENT ITS
NR 42
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 62
EP 74
DI 10.1016/j.specom.2010.07.007
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700006
ER
PT J
AU Loots, L
Niesler, T
AF Loots, Linsen
Niesler, Thomas
TI Automatic conversion between pronunciations of different English accents
SO SPEECH COMMUNICATION
LA English
DT Article
DE English accents; Pronunciation modelling; G2P; P2P; GP2P; Decision
trees; South African English
AB We describe the application of decision trees to the automatic conversion of pronunciations between American, British and South African English accents. The resulting phoneme-to-phoneme (P2P) conversion technique derives the pronunciation of a word in a new target accent by taking advantage of its existing available pronunciation in a different source accent. We find that it is substantially more accurate to derive pronunciations in this way than directly from the orthography and available target accent pronunciations using more conventional grapheme-to-phoneme (G2P) conversion. Furthermore, by including both the graphemes and the phonemes of the source accent, grapheme-and-phoneme-to-phoneme (GP2P) conversion delivers additional increases in accuracy in relation to P2P. These findings are particularly important for less-resourced varieties of English, for which extensive manually-prepared pronunciation dictionaries are not available. By means of the P2P and GP2P approaches, the pronunciations of new words can be obtained with better accuracy than is possible using G2P methods. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Loots, Linsen; Niesler, Thomas] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa.
RP Niesler, T (reprint author), Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa.
EM linsen@sun.ac.za; trn@sun.ac.za
FU South African National Research Foundation (NRF) [FA2007022300015,
TTK2007041000010]
FX This material is based on work supported financially by the South
African National Research Foundation (NRF) under Grants FA2007022300015
and TTK2007041000010, and was executed using the High-Performance
Computing (HPC) facility at Stellenbosch University.
CR Andersen O., 1996, P ICSLP PHIL US
BEEP, 2009, BRIT EXAMPLE PRONUNC
Bekker I., 2009, THESIS NW U POTCHEFS
Bisani M., 2004, P ICASSP MONTR CAN
Black A., 1998, P ESCA SPEECH SYNTH
Bowerman Sean, 2004, HDB VARIETIES ENGLIS, VI, P931
CMU, 2009, CARN MELL U PRON DIC
Daelemans W., 1994, PROGR SPEECH SYNTHES, P77
Daelemans W, 1999, MACH LEARN, V34, P11, DOI 10.1023/A:1007585615670
Damper R., 2004, P 5 ISCA SPEECH SYNT
Damper RI, 1999, COMPUT SPEECH LANG, V13, P155, DOI 10.1006/csla.1998.0117
Demberg V., 2007, P ACL PRAG CZECH REP
Dines J., 2008, P SPOK LANG TECHN WO
Han K.-S., 2004, P ICSLP JEJ KOR
Humphries J., 2001, P ICASSP SALT LAK CI
Kienappel A.K., 2001, P EUR AALB DENM
LDC, 2009, COMLEX ENGL PRON LEX
Loots L., 2009, P INT BRIGHT UK
Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674
Niesler T., 2005, SO AFRICAN LINGUISTI, V23, P459, DOI 10.2989/16073610509486401
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Pagel V., 1998, P ICSLP SYDN AUSTR
Rabiner L, 1993, FUNDAMENTALS SPEECH
Suontausta J., 2000, P ICSLP BEIJ CHIN
Taylor P., 2005, P INT LISB PORT
Torkkola K., 1993, P ICASSP MINN US
van den Heuvel H., 2007, P INT ANTW BELG
Webster G., 2008, P INT BRISB AUSTR
Wells John, 1982, ACCENTS ENGLISH
NR 29
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 75
EP 84
DI 10.1016/j.specom.2010.07.006
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700007
ER
PT J
AU Lazaridis, A
Mporas, I
Ganchev, T
Kokkinakis, G
Fakotakis, N
AF Lazaridis, Alexandros
Mporas, Iosif
Ganchev, Todor
Kokkinakis, George
Fakotakis, Nikos
TI Improving phone duration modelling using support vector regression
fusion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Duration modelling; Parallel fusion scheme; Phone duration prediction;
Support vector regression; Text-to-speech synthesis
ID TEXT-TO-SPEECH; SEGMENTAL DURATION; NETWORKS; DATABASE; APPROXIMATION;
RECOGNITION
AB In the present work, we propose a scheme for the fusion of different phone duration models, operating in parallel. Specifically, the predictions from a group of dissimilar and independent to each other individual duration models are fed to a machine learning algorithm, which reconciles and fuses the outputs of the individual models, yielding more precise phone duration predictions. The performance of the individual duration models and of the proposed fusion scheme is evaluated on the American-English KED TIMIT and on the Greek WCL-1 databases. On both databases, the SVR-based individual model demonstrates the lowest error rate. When compared to the second-best individual algorithm, a relative reduction of the mean absolute error (MAE) and the root mean square error (RMSE) by 5.5% and 3.7% on KED TIMIT, and 6.8% and 3.7% on WCL-1 is achieved. At the fusion stage, we evaluate the performance of 12 fusion techniques. The proposed fusion scheme, when implemented with SVR-based fusion, contributes to the improvement of the phone duration prediction accuracy over the one of the best individual model, by 1.9% and 2.0% in terms of relative reduction of the MAE and RMSE on KED TIMIT, and by 2.6% and 1.8% on the WCL-1 database. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Lazaridis, Alexandros; Mporas, Iosif; Ganchev, Todor; Kokkinakis, George; Fakotakis, Nikos] Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, Rion 26500, Greece.
RP Lazaridis, A (reprint author), Univ Patras, Dept Elect & Comp Engn, Wire Commun Lab, Artificial Intelligence Grp, Rion 26500, Greece.
EM alaza@upatras.gr; imporas@upatras.gr; tganchev@ieee.org;
gkokkin@wel.ee.upatras.gr; fakotaki@upatras.gr
CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759
AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705
Allen J., 1987, TEXT SPEECH MITALK S
BARTKOVA K, 1987, SPEECH COMMUN, V6, P245, DOI 10.1016/0167-6393(87)90029-X
Beckman M. E., 1994, GUIDELINES TOBI LABE
Bellegarda JR, 2001, IEEE T SPEECH AUDI P, V9, P52, DOI 10.1109/89.890071
Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350
Campbell W. N., 1992, TALKING MACHINES THE, P211
CARLSON R, 1986, PHONETICA, V43, P140
Chen SH, 2003, IEEE T SPEECH AUDI P, V11, P308, DOI 10.1109/TSA.2003.814377
Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226
CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911
*CSTR, 2001, CSTR US KED TIMIT
Dutoit T., 1997, INTRO TEXT TO SPEECH
EDWARDS J, 1988, PHONETICA, V45, P156
Epitropakis G., 1993, P EUROSPEECH 93 BERL, P1995
Ferrer L., 2003, P EUR GEN, P2017
Freedman D. A., 2007, STATISTICS
Friedman JH, 2002, COMPUT STAT DATA AN, V38, P367, DOI 10.1016/S0167-9473(01)00065-2
Friedman JH, 2001, ANN STAT, V29, P1189, DOI 10.1214/aos/1013203451
Furui S., 2000, DIGITAL SPEECH PROCE
Goubanova O, 2008, SPEECH COMMUN, V50, P301, DOI 10.1016/j.specom.2007.10.002
Goubanova O., 2000, P ICSLP 2000 BEIJ CH, P427
Huang X., 2001, SPOKEN LANGUAGE PROC
Iwahashi N, 2000, IEICE T INF SYST, VE83D, P1550
Jennequin N., 2007, P ICASSP 2007 HON HA, P641
Kaariainen M, 2004, J MACH LEARN RES, V5, P1107
Klatt D. H., 1976, J ACOUST SOC AM, V59, P1209
KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275
Kohavi R, 1997, ARTIF INTELL, V97, P273, DOI 10.1016/S0004-3702(97)00043-X
Kohler K.J., 1988, DIGITALE SPRACHVERAR, P165
KOMINEK J, 2004, P 8 INT C SPOK LANG, P1385
KOMINEK J, 2003, CMULTI03177 SCH COMP
Laver J, 1980, PHONETIC DESCRIPTION
Laver John, 1994, PRINCIPLES PHONETICS
Lazaridis A, 2007, PROC INT C TOOLS ART, P518, DOI 10.1109/ICTAI.2007.33
Lee S., 1999, P OR COCOSDA 99, P109
Levinson S, 1986, P IEEE ICASSP, P1241
MITCHELL CD, 1995, DIGIT SIGNAL PROCESS, V5, P43, DOI 10.1006/dspr.1995.1004
MONKOWSKI MD, 1995, INT CONF ACOUST SPEE, P528, DOI 10.1109/ICASSP.1995.479645
Olive J.P., 1985, J ACOUST SOC AM S1, V78, pS6, DOI 10.1121/1.2022951
PARK J, 1993, NEURAL COMPUT, V5, P305, DOI 10.1162/neco.1993.5.2.305
Platt JC, 1999, ADVANCES IN KERNEL METHODS, P185
Pols LCW, 1996, SPEECH COMMUN, V19, P161, DOI 10.1016/0167-6393(96)00033-7
Quinlan R. J., 1992, P 5 AUSTR JOINT C AR, P343
Rao KS, 2007, COMPUT SPEECH LANG, V21, P282, DOI 10.1016/j.csl.2006.06.003
Rao KS, 2005, 2005 INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSING, PROCEEDINGS, P258
Riley M., 1992, TALKING MACHINES THE, P265
Scholkopf B., 2002, LEARNING KERNELS
Shih C., 1997, PROGR SPEECH SYNTHES, P383
Silverman K., 1992, P INT C SPOK LANG PR, P867
Simoes A.R.M., 1990, P WORKSH SPEECH SYNT, P173
Smola A.J., 1998, 1998030 NEUROCOLT TR
TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467
Van Santen J. P. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90016-Y
VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005
VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5
Vapnik V., 1995, NATURE STAT LEARNING
Vapnik V, 1998, STAT LEARNING THEORY
Vilalta R., 2002, ARTIF INTELL REV, V18, p[2, 77]
Wang L., 2004, P ICASSP 2004, P641
Wang Y, 1997, 9 EUR C MACH LEARN P, P128
WILCOXON F, 1945, BIOMETRICS BULL, V1, P80, DOI 10.2307/3001968
Witten H., 1999, DATA MINING PRACTICA
Yallop C., 1995, INTRO PHONETICS PHON
Yamagishi J, 2008, SPEECH COMMUN, V50, P405, DOI 10.1016/j.specom.2007.12.003
Yiourgalis N, 1996, SPEECH COMMUN, V19, P21, DOI 10.1016/0167-6393(96)00012-X
Zervas P, 2008, J QUANT LINGUIST, V15, P154, DOI 10.1080/09296170801961827
NR 69
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 85
EP 97
DI 10.1016/j.specom.2010.07.005
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700008
ER
PT J
AU Ghosh, PK
Narayanan, SS
AF Ghosh, Prasanta Kumar
Narayanan, Shrikanth S.
TI Joint source-filter optimization for robust glottal source estimation in
the presence of shimmer and jitter
SO SPEECH COMMUNICATION
LA English
DT Article
DE Glottal flow derivative; Shimmer; Jitter; Glottal source estimation
ID SPEECH; MODEL; FLOW; WAVE; QUALITY
AB We propose a glottal source estimation method robust to shimmer and jitter in the glottal flow. The proposed estimation method is based on a joint source-filter optimization technique. The glottal source is modeled by the Liljencrants-Fant (LF) model and the vocal-tract filter is modeled by an auto-regressive filter, which is common in the source-filter approach to speech production. The optimization estimates the parameters of the LF model, the amplitudes of the glottal flow in each pitch period, and the vocal-tract filter coefficients so that the speech production model best describes the observed speech samples. Experiments with synthetic and real speech data show that the proposed estimation method is robust to different phonation types with varying shimmer and jitter characteristics. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Ghosh, Prasanta Kumar; Narayanan, Shrikanth S.] Univ So Calif, Dept Elect Engn, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA.
RP Ghosh, PK (reprint author), Univ So Calif, Dept Elect Engn, Signal Anal & Interpretat Lab, Los Angeles, CA 90089 USA.
EM prasantg@usc.edu
RI Narayanan, Shrikanth/D-5676-2012
FU National Science Foundation (NSF); US Army
FX This research work was supported by the National Science Foundation
(NSF) and the US Army.
CR Airas M, 2006, PHONETICA, V63, P26, DOI 10.1159/000091405
Airas M, 2008, LOGOP PHONIATR VOCO, V33, P49, DOI 10.1080/14015430701855333
ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
Baken R. J., 1999, CLIN MEASUREMENT SPE
Carre R., 1981, 4 FASE S APR 1981 VE
Childers D.G., 2000, SPEECH PROCESSING SY
DING W, 1995, IEICE T INF SYST, VE78D, P738
Drugman T., 2009, P INTERSPEECH, P2891
Fant G., 1985, STL QPSR 4 85 R I TE
Flanagan J. L., 1965, SPEECH ANAL SYNTHESI
Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076
Fu Q, 2006, IEEE T AUDIO SPEECH, V14, P492, DOI 10.1109/TSA.2005.857807
HALL MG, 1983, SIGNAL PROCESS, V5, P267, DOI 10.1016/0165-1684(83)90074-9
Hess W., 1983, PITCH DETERMINATION
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
KRISHNAMURTHY AK, 1992, IEEE T SIGNAL PROCES, V40, P682, DOI 10.1109/78.120812
KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909
Markel JD, 1976, LINEAR PREDICTION SP
MILLER RL, 1959, J ACOUST SOC AM, V31, P667, DOI 10.1121/1.1907771
Moore E., 2003, P 25 ANN C ENG MED B, V3, P2849
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
Quatieri T. F., 2001, DISCRETE TIME SPEECH
Rabiner L. R., 2010, THEORY APPL DIGITAL
ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389
Saratxaga I., 2009, P INT BRIGHT UK, P1075
Shimamura T, 2001, IEEE T SPEECH AUDI P, V9, P727, DOI 10.1109/89.952490
Strik H, 1998, J ACOUST SOC AM, V103, P2659, DOI 10.1121/1.422786
Veldhuis R, 1998, J ACOUST SOC AM, V103, P566, DOI 10.1121/1.421103
WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260
Yoshiyuki H., 1982, J SPEECH HEAR RES, V25, P12
NR 30
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 98
EP 109
DI 10.1016/j.specom.2010.07.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700009
ER
PT J
AU Apsingekar, VR
De Leon, PL
AF Apsingekar, Vijendra Raj
De Leon, Phillip L.
TI Speaker verification score normalization using speaker model clusters
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; Score normalization
ID IDENTIFICATION
AB Among the various proposed score normalizations, T- and Z-norm are most widely used in speaker verification systems. The main idea in these normalizations is to reduce the variations in impostor scores in order to improve accuracy. These normalizations require selection of a set of cohort models or utterances in order to estimate the impostor score distribution. In this paper we investigate basing this selection on recently-proposed speaker model clusters (SMCs). We evaluate this approach using the NTIMIT and NIST-2002 corpora and compare against T- and Z-norm which use other cohort selection methods. We also propose three new normalization techniques, Delta-, Delta T- and TC-norm, which also use SMCs to estimate the normalization parameters. Our results show that we can lower the equal error rate and minimum decision cost function with fewer cohort models using SMC-based score normalization approaches. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Apsingekar, Vijendra Raj; De Leon, Phillip L.] New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA.
RP De Leon, PL (reprint author), New Mexico State Univ, Klipsch Sch Elect & Comp Engn, Las Cruces, NM 88003 USA.
EM vijendra@nmsu.edu; pde-leon@nmsu.edu
RI De Leon, Phillip/N-8884-2014
OI De Leon, Phillip/0000-0002-7665-9632
CR Apsingekar VR, 2009, IEEE T AUDIO SPEECH, V17, P848, DOI 10.1109/TASL.2008.2010882
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
Bimbot F, 2004, EURASIP J APPL SIG P, V2004, P430, DOI 10.1155/S1110865704310024
Judge GG, 1988, THEORY PRACTICE ECON
Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595
Longworth C, 2009, IEEE T AUDIO SPEECH, V17, P748, DOI 10.1109/TASL.2008.2012193
RAMASWAMY GN, 2003, P ICASSP, V2, P61
Ramos-Castro D, 2007, PATTERN RECOGN LETT, V28, P90, DOI 10.1016/j.patrec.2006.06.008
Ravulakollu K, 2008, 42ND ANNUAL 2008 IEEE INTERNATIONAL CARNAHAN CONFERENCE ON SECURITY TECHNOLOGY, PROCEEDINGS, P56, DOI 10.1109/CCST.2008.4751277
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds D. A., 1995, Lincoln Laboratory Journal, V8
REYNOLDS DA, 2003, ACOUST SPEECH SIG PR, P53
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Sturim D. E., 2005, P IEEE INT C AC SPEE, P741, DOI 10.1109/ICASSP.2005.1415220
van Leeuwen David A., 2005, INTERSPEECH 1981 198, P1981
Zhang SX, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1275
NR 16
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 110
EP 118
DI 10.1016/j.specom.2010.07.001
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700010
ER
PT J
AU Mak, MW
Rao, W
AF Mak, Man-Wai
Rao, Wei
TI Utterance partitioning with acoustic vector resampling for GMM-SVM
speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; GMM-supervectors (GSV); Utterance partitioning;
GMM-SVM; Support vector machine; Random resampling; Data imbalance
ID RECOGNITION; MACHINES; CLASSIFICATION; ENSEMBLE
AB Recent research has demonstrated the merit of combining Gaussian mixture models and support vector machine (SVM) for text-independent speaker verification. However, one unaddressed issue in this GMM-SVM approach is the imbalance between the numbers of speaker-class utterances and impostor-class utterances available for training a speaker-dependent SVM. This paper proposes a resampling technique - namely utterance partitioning with acoustic vector resampling (UP-AVR) - to mitigate the data imbalance problem. Briefly, the sequence order of acoustic vectors in an enrollment utterance is first randomized, which is followed by partitioning the randomized sequence into a number of segments. Each of these segments is then used to produce a GM M supervector via MAP adaptation and mean vector concatenation. The randomization and partitioning processes are repeated several times to produce a sufficient number of speaker-class supervectors for training an SVM. Experimental evaluations based on the NIST 2002 and 2004 SRE suggest that UP-AVR can reduce the error rate of GMM-SVM systems. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Mak, Man-Wai; Rao, Wei] Hong Kong Polytech Univ, Elect & Informat Engn Dept, Ctr Signal Proc, Hong Kong, Hong Kong, Peoples R China.
RP Mak, MW (reprint author), Hong Kong Polytech Univ, Elect & Informat Engn Dept, Ctr Signal Proc, Hong Kong, Hong Kong, Peoples R China.
EM enmwmak@polyu.edu.hk
FU Center for Signal Processing, The Hong Polytechnic University [1-BB9W];
Research Grant Council of The Hong Kong SAR [PolyU 5264/09E]
FX This work was in part supported by Center for Signal Processing, The
Hong Polytechnic University (1-BB9W) and Research Grant Council of The
Hong Kong SAR (PolyU 5264/09E).
CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
Bar-Yosef Y., 2009, INTERSPEECH, P1271
BoIle R., 2004, GUIDE BIOMETRICS
Bolle R. M., 1999, P AUTOID 99, P9
Bonastre J. F., 2005, ICASSP, V1, P737
Campbell W. M., 2006, P IEEE INT C AC SPEE, V1, P97
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
Chawla N. V., 2003, 7 EUR C PRINC PRACT, P107
Chawla NV, 2002, J ARTIF INTELL RES, V16, P321
Cieri C., 2004, P 4 INT C LANG RES E, P69
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Dehak N, 2009, INT CONF ACOUST SPEE, P4237, DOI 10.1109/ICASSP.2009.4960564
EFRON B, 1983, AM STAT, V37, P36, DOI 10.2307/2685844
Fauve B., 2008, ODYSSEY 2008
Ferrer L., 2004, ICASSP, P173
Gillick L., 1989, P ICASSP, P532
Gonzalez-Rodriguez J, 2006, COMPUT SPEECH LANG, V20, P331, DOI 10.1016/j.csl.2005.08.005
Kang PS, 2006, LECT NOTES COMPUT SC, V4232, P837
Kenny P., 2003, EUROSPEECH, P2961
LeCun Y., 2005, P 10 INT WORKSH ART
Lin Y, 2002, MACH LEARN, V46, P191, DOI 10.1023/A:1012406528296
Lin ZY, 2009, LECT NOTES COMPUT SC, V5678, P536
LONGWORTH C, 2009, IEEE T AUDIO SPEECH, V17
Martin A. F., 1997, P EUROSPEECH, P1895
Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213
Ramaswamy G. N., 2003, ICASSP, V2, P61
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Solomonoff A., 2004, P OD SPEAK LANG REC, P41
Solomonoff A, 2005, INT CONF ACOUST SPEE, P629
Sun AX, 2009, DECIS SUPPORT SYST, V48, P191, DOI 10.1016/j.dss.2009.07.011
Tang YC, 2009, IEEE T SYST MAN CY B, V39, P281, DOI 10.1109/TSMCB.2008.2002909
van Leeuwen D. A., 2005, INTERSPEECH, P1981
Veropoulos K., 1999, P INT JOINT C ART IN, P55
Wu G, 2005, IEEE T KNOWL DATA EN, V17, P786, DOI 10.1109/TKDE.2005.95
Zhang SX, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P1275
NR 36
TC 8
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 119
EP 130
DI 10.1016/j.specom.2010.06.011
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700011
ER
PT J
AU Alpan, A
Maryn, Y
Kacha, A
Grenez, F
Schoentgen, J
AF Alpan, A.
Maryn, Y.
Kacha, A.
Grenez, F.
Schoentgen, J.
TI Multi-band dysperiodicity analyses of disordered connected speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Connected disordered speech; Variogram; Signal-to-dysperiodicity ratio;
Cepstral peak prominence; Multi-band analysis; Multi-variable analysis
ID VOCAL DYSPERIODICITIES; SPASMODIC DYSPHONIA; ACOUSTIC MEASURES;
SUSTAINED VOWELS; RUNNING SPEECH; CEPSTRAL PEAK; VOICES; DISCRIMINATION;
LARYNGOGRAPH; PREDICTION
AB The objective is to analyse vocal dysperiodicities in connected speech produced by dysphonic speakers. The analysis involves a variogram-based method that enables tracking instantaneous vocal dysperiodicities. The dysperiodicity trace is summarized by means of the signal-to-dysperiodicity ratio, which has been shown to correlate strongly with the perceived degree of hoarseness of the speaker. Previously, this method has been evaluated on small corpora only. In this article, analyses have been carried out on two corpora comprising over 250 and 700 speakers. This has enabled carrying out multi-frequency band and multi-cue analyses without risking overfitting. The analysis results are compared to the cepstral peak prominence, which is a popular cue that indirectly summarizes vocal dysperiodicities frame-wise. A perceptual rating has been available for the first corpus whereas speakers in the second corpus have been categorized as normal or pathological only. For the first corpus, results show that the correlation with perceptual scores increases statistically significantly for multi-band analysis compared to conventional full-band analysis. Also, combining the cepstral peak prominence with the low-frequency band signal-to-dysperiodicity ratio statistically significantly increases their combined correlation with perceptual scores. The signal-to-dysperiodicity ratios of the two corpora have been separately submitted to principal component analysis. The results show that the first two principal components are interpretable in terms of the degree of dysphonia and the spectral slope, respectively. The clinical relevance of the principal components has been confirmed by linear discriminant analysis. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Alpan, A.] Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, LIST, B-1050 Brussels, Belgium.
[Maryn, Y.] Sint Jan Gen Hosp, Dept Speech Language Pathol & Audiol, Dept Otorhinolaryngol & Head & Neck Surgery, Brugge, Belgium.
RP Alpan, A (reprint author), Univ Libre Bruxelles, Lab Images Signals & Telecommun Devices, LIST, CP165-51,Av F Roosevelt 50, B-1050 Brussels, Belgium.
EM aalpan@ulb.ac.be; Youri.Maryn@azbrugge.be; akacha@ulb.ac.be;
fgrenez@ulb.ac.be; jschoent@ulb.ac.be
FU "Region Wallonne", Belgium
FX This research was supported by the "Region Wallonne", Belgium, in the
framework of the "WALEO II" programme. We are grateful to Dr. Marc
Remade, Department of Otorhinolaryngology and Head and Neck Surgery,
University Hospital of Louvain at Mont-Godinne, Belgium, for providing
the Kay Elemetrics Database.
CR Alpan A., 2009, P INT BRIGHT UK, P959
Alpan A., 2007, P INT ANTW BELG AUG, P1178
Awan SN, 2005, J VOICE, V19, P268, DOI 10.1016/j.jvoice.2004.03.005
Bettens F, 2005, J ACOUST SOC AM, V117, P328, DOI 10.1121/1.1835511
DOLANSKY L, 1968, IEEE T ACOUST SPEECH, VAU16, P51, DOI 10.1109/TAU.1968.1161962
DUNN OJ, 1969, J AM STAT ASSOC, V64, P366, DOI 10.2307/2283746
Edgar JD, 2001, J VOICE, V15, P362, DOI 10.1016/S0892-1997(01)00038-8
Fairbanks G, 1960, VOICE ARTICULATION D
Flalberstam B., 2004, ORL, V66, P70
FOURCIN AJ, 1971, MED BIOL ILLUS, V21, P172
FOURCIN AJ, 1977, PHONETICA, V34, P313
Fredouille C, 2009, EURASIP J ADV SIG PR, DOI 10.1155/2009/982102
Haslett J, 1997, STATISTICIAN, V46, P475
HECKER MHL, 1971, J ACOUST SOC AM, V49, P1275, DOI 10.1121/1.1912490
Heman-Ackah YD, 2002, J VOICE, V16, P20, DOI 10.1016/S0892-1997(02)00067-X
Heman-Ackah YD, 2004, J VOICE, V18, P203, DOI 10.1016/j.jvoice.2004.01.005
Hernan-Ackah Y.D., 2003, ANN OTOLRHINOLLARYNG, V112, P324
Hernandez-Espinosa C., 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, DOI 10.1109/IJCNN.2000.860781
Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311
Hinkle DE, 1998, APPL STAT BEHAV SCI
Hirano M, 1981, CLIN EXAMINATION VOI
Hotelling H, 1940, ANN MATH STAT, V11, P271, DOI 10.1214/aoms/1177731867
Jayant N. S., 1984, CODING WAVEFORMS PRI
Jolliffe I. T., 2002, PRINCIPAL COMPONENT, V2nd
Kacha A, 2006, SPEECH COMMUN, V48, P1365, DOI 10.1016/j.specom.2006.07.003
Kacha A, 2006, BIOMED SIGNAL PROCES, V1, P137, DOI 10.1016/j.bspc.2006.07.002
Kay Elemetrics Corp, 1994, DIS VOIC DAT VER 1 0
Klingholz F., 1987, SPEECH COMMUN, V6, P1
KLINGHOLTZ F, 1990, J ACOUST SOC AM, V87, P2218, DOI 10.1121/1.399189
LAVER J, 1986, J PHONETICS, V14, P517
LIEBERMAN P, 1963, J ACOUST SOC AM, V35, P344, DOI 10.1121/1.1918465
Maryn Y., J VOICE IN PRESS
MURRY T, 1980, J SPEECH HEAR RES, V23, P361
MUTA H, 1988, J ACOUST SOC AM, V84, P1292, DOI 10.1121/1.396628
Oppenheimer A., 1975, DIGITAL SIGNAL PROCE
Parsa V, 2001, J SPEECH LANG HEAR R, V44, P327, DOI 10.1044/1092-4388(2001/027)
Parsa V., 2002, P INT C SPOK LANG PR, P2505
Qi YY, 1999, J ACOUST SOC AM, V105, P2532, DOI 10.1121/1.426860
Sapienza CM, 2002, J SPEECH LANG HEAR R, V45, P830, DOI 10.1044/1092-4388(2002/067)
Schoentgen J, 2003, J ACOUST SOC AM, V113, P553, DOI 10.1121/1.1523384
Stevens SS, 1937, J ACOUST SOC AM, V8, P185, DOI 10.1121/1.1915893
Umapathy K, 2005, IEEE T BIO-MED ENG, V52, P421, DOI 10.1109/TBME.2004.842962
Yiu E, 2000, CLIN LINGUIST PHONET, V14, P295
NR 43
TC 11
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 131
EP 141
DI 10.1016/j.specom.2010.06.010
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700012
ER
PT J
AU Mori, H
Satake, T
Nakamura, M
Kasuya, H
AF Mori, Hiroki
Satake, Tomoyuki
Nakamura, Makoto
Kasuya, Hideki
TI Constructing a spoken dialogue corpus for studying paralinguistic
information in expressive conversation and analyzing its
statistical/acoustic characteristics
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotional state; Expressive speech; Annotation; Abstract dimensions;
Spontaneous speech; Spoken dialogue
ID FACIAL EXPRESSIONS; EMOTIONAL STATES; SPEECH; GENERATION; RATINGS
AB The Utsunomiya University (UU) Spoken Dialogue Database for Paralinguistic Information Studies is introduced. The UU Database is especially intended for use in understanding the usage, structure and effect of paralinguistic information in expressive Japanese conversational speech. Paralinguistic information refers to meaningful information, such as emotion or attitude, delivered along with linguistic messages. The UU Database comes with labels of perceived emotional states for all utterances. The emotional states were annotated with six abstract dimensions: pleasant-unpleasant, aroused-sleepy, dominant-submissive, credible-doubtful, interested-indifferent, and positive-negative. To stimulate expressively-rich and vivid conversation, the "4-frame cartoon sorting task" was devised. In this task, four cards each containing one frame extracted from a cartoon are shuffled, and each participant with two cards out of the four then has to estimate the original order. The effectiveness of the method was supported by a broad distribution of subjective emotional state ratings. Preliminary annotation experiments by a large number of annotators confirmed that most annotators could provide fairly consistent ratings for a repeated identical stimulus, and the inter-rater agreement was good (W similar or equal to 0.5) for three of the six dimensions. Based on the results, three annotators were selected for labeling all 4840 utterances. The high degree of agreement was verified using such measures as Kendall's W. The results of correlation analyses showed that not only prosodic parameters such as intensity and f(0) but also a voice quality parameter were related to the dimensions. Multiple correlation of above 0.7 and RMS error of about 0.6 were obtained for the recognition of some dimensions using linear combinations of the speech parameters. Overall, the perceived emotional states of speakers can be accurately estimated from the speech parameters in most cases. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Mori, Hiroki; Satake, Tomoyuki; Kasuya, Hideki] Utsunomiya Univ, Grad Sch Engn, Utsunomiya, Tochigi 3218585, Japan.
[Nakamura, Makoto] Utsunomiya Univ, Fac Int Studies, Utsunomiya, Tochigi 3218505, Japan.
RP Mori, H (reprint author), Utsunomiya Univ, Grad Sch Engn, 7-1-2 Yoto, Utsunomiya, Tochigi 3218585, Japan.
EM hiroki@klab.ee.utsunomiya-u.ac.jp
CR ANDERSON AH, 1991, LANG SPEECH, V34, P351
ARGYLE M, 1965, SOCIOMETRY, V28, P289, DOI 10.2307/2786027
ARGYLE M, 1971, EUR J SOC PSYCHOL, V1, P385, DOI 10.1002/ejsp.2420010307
Arimoto Y, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P322
Auberge V., 2003, P EUR 2003, P185
Burgoon J. K., 1989, NONVERBAL COMMUNICAT
Campbell N., 2003, 1 JST CREST INT WORK, P61
Campbell N., 2004, J PHONET SOC JPN, V8, P9
Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197
Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7
Devillers L, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P801
Douglas-Cowie E., 2000, P ISCA WORKSH SPEECH, P39
Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5
Duck S.W., 1993, SOCIAL CONTEXT RELAT, V3
Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317
Horiuchi Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14
Greasley P., 1995, P 13 INT C PHON SCI, P242
Grimm M, 2008, 2008 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO, VOLS 1-4, P865, DOI 10.1109/ICME.2008.4607572
HUTTAR GL, 1968, J SPEECH HEAR RES, V11, P481
Kasuya H., 1999, P INT C PHON SCI SAN, P2505
Kasuya H., 2000, P INT C SPOK LANG PR, P345
Keeley M., 1994, UNDERSTANDING RELATI, V4, P35
MacIntyre R, 1995, DYSFLUENCY ANNOTATIO
Mehrabian A., 1974, THEORY AFFILIATION
Miller G.R., 1993, INTERPERSONAL COMMUN, V14
Mori Hiroki, 2009, Acoustical Science and Technology, V30, DOI 10.1250/ast.30.376
Mori H., 2005, J ACOUST SOC JPN, V61, P690
Mori H., 2007, SIGSLUDA60303 JSAI
Mori H, 2008, IEICE T INF SYST, VE91D, P1628, DOI 10.1093/ietisy/e91-d.6.1628
Mori H., 2007, P INT 2007, P102
Morimoto T., 1994, P ICSLP, P1791
MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558
Ohtsuka T., 2001, P EUR 2001, V3, P2267
PALMER MT, 1989, COMMUN MONOGR, V56, P1
Patterson M.L., 1991, FUNDAMENTALS NONVERB, P58
Rosenthal R., 1979, SENSITIVITY NONVERBA
RUSSELL JA, 1985, J PERS SOC PSYCHOL, V48, P1290, DOI 10.1037/0022-3514.48.5.1290
RUSSELL JA, 1989, J PERS SOC PSYCHOL, V57, P493, DOI 10.1037/0022-3514.57.3.493
Satake T., 2009, SIGSLUDA80313 JSAI
Scherer K. R., 1977, MOTIV EMOTION, V1, P331, DOI 10.1007/BF00992539
Scherer K. R., 1989, HDB SOCIAL PSYCHOPHY, P165
SCHLOSBERG H, 1952, J EXP PSYCHOL, V44, P229, DOI 10.1037/h0055778
Schroder M, 2004, THESIS SAARLAND U
STANG DJ, 1973, J PERS SOC PSYCHOL, V27, P405, DOI 10.1037/h0034940
The Japan Ministry of Education Culture Sports Science and Technology, 2000, PHYS FITN TEST, P1
Truong KP, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P318
WARNER RM, 1987, J NONVERBAL BEHAV, V11, P57, DOI 10.1007/BF00990958
Witten I, 2005, DATA MINING
NR 48
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2011
VL 53
IS 1
BP 36
EP 50
DI 10.1016/j.specom.2010.08.002
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 699LA
UT WOS:000285663700004
ER
PT J
AU Cutler, A
Cooke, M
Lecumberri, MLG
AF Cutler, Anne
Cooke, Martin
Lecumberri, M. Luisa Garcia
TI Special Issue: Non-native Speech Perception in Adverse Conditions
Preface
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
RI Cutler, Anne/C-9467-2012
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 863
EP 863
DI 10.1016/j.specom.2010.11.003
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700001
ER
PT J
AU Lecumberri, MLG
Cooke, M
Cutler, A
AF Garcia Lecumberri, Maria Luisa
Cooke, Martin
Cutler, Anne
TI Non-native speech perception in adverse conditions: A review
SO SPEECH COMMUNICATION
LA English
DT Review
DE Non-native; Speech perception; Noise; Review
ID TRAINING JAPANESE LISTENERS; SPOKEN-WORD RECOGNITION; INFORMATIONAL
MASKING; LANGUAGE-ACQUISITION; ENGLISH CONSONANTS; NATIVE-LANGUAGE;
LEXICAL ACCESS; 2ND LANGUAGE; FOREIGN ACCENT; 2ND-LANGUAGE ACQUISITION
AB If listening in adverse conditions is hard, then listening in a foreign language is doubly so: non-native listeners have to cope with both imperfect signals and imperfect knowledge. Comparison of native and non-native listener performance in speech-in-noise tasks helps to clarify the role of prior linguistic experience in speech perception, and, more directly, contributes to an understanding of the problems faced by language learners in everyday listening situations. This article reviews experimental studies on non-native listening in adverse conditions, organised around three principal contributory factors: the task facing listeners, the effect of adverse conditions on speech, and the differences among listener populations. Based on a comprehensive tabulation of key studies, we identify robust findings, research trends and gaps in current knowledge. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Garcia Lecumberri, Maria Luisa; Cooke, Martin] Univ Basque Country, Fac Letras, Language & Speech Lab, Vitoria 01006, Spain.
[Cooke, Martin] Basque Fdn Sci, IKERBASQUE, Bilbao 48011, Spain.
[Cutler, Anne] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands.
[Cutler, Anne] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW 1797, Australia.
[Cutler, Anne] Radboud Univ Nijmegen, Donders Inst Brain Cognit & Behav, Nijmegen, Netherlands.
RP Lecumberri, MLG (reprint author), Univ Basque Country, Fac Letras, Language & Speech Lab, Paseo Univ 5, Vitoria 01006, Spain.
EM garcia.lecumberri@ehu.es
RI Cutler, Anne/C-9467-2012
FU EU, Basque Government [IT311-10]; Spanish Government [FFI2009-10264]
FX M.L. Garcia Lecumberri and M. Cooke were supported by the EU Marie Curie
RTN "Sound to Sense", Basque Government Grant IT311-10 and Spanish
Government grant FFI2009-10264. We thank Madhu Shashanka for recording
the reverberant sound example.
CR Adank P, 2009, J EXP PSYCHOL HUMAN, V35, P520, DOI 10.1037/a0013552
Akker Evelien, 2003, BILING-LANG COGN, V6, P81, DOI 10.1017/S1366728903001056
Assmann P. F., 2004, SPRINGER HDB AUDITOR, V18
BASHFORD JA, 1988, J ACOUST SOC AM, V84, P1635, DOI 10.1121/1.397178
BASHFORD JA, 1987, PERCEPT PSYCHOPHYS, V42, P114, DOI 10.3758/BF03210499
Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234
BERGMAN, 1980, AGING PERCEPTION SPE, P123
Best C. T., 1995, SPEECH PERCEPTION LI, P171
Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378
BLACK JW, 1962, J SPEECH HEAR RES, V5, P70
Bloomfield Leonard, 1933, LANGUAGE
Bohn O. S., 1995, SPEECH PERCEPTION LI, P279
BOHN OS, 1990, APPL PSYCHOLINGUIST, V11, P303, DOI 10.1017/S0142716400008912
Bohn OS, 2000, AMST STUD THEORY HIS, V198, P1
BOHN OS, 2007, HONOR JE FLEGE
BOSCH L, 1997, EUROSPEECH 97, V1, P231
Bradlow A, 2010, SPEECH COMMUN, V52, P930, DOI 10.1016/j.specom.2010.06.003
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Broersma M, 2008, SYSTEM, V36, P22, DOI 10.1016/j.system.2007.11.003
BROERSMA M, 2010, Q J EXP PSYCHOL
Broersma M, 2010, SPEECH COMMUN, V52, P980, DOI 10.1016/j.specom.2010.08.010
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
Burki-Cohen J, 2001, LANG SPEECH, V44, P149
BUUS S, 1986, P INT 86, P895
CARHART R, 1969, J ACOUST SOC AM, V45, P694, DOI 10.1121/1.1911445
Cebrian J, 2006, J PHONETICS, V34, P372, DOI 10.1016/j.wocn.2005.08.003
Cieslicka A, 2006, SECOND LANG RES, V22, P115, DOI 10.1191/0267658306sr263oa
Clahsen H, 2006, APPL PSYCHOLINGUIST, V27, P3, DOI 10.1017/S0142716406060024
Clopper C.G., 2006, J ACOUST SOC AM, V119, P3424
Cooke M, 2010, SPEECH COMMUN, V52, P954, DOI 10.1016/j.specom.2010.04.004
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cooper N, 2002, LANG SPEECH, V45, P207
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
Cutler A, 2006, J PHONETICS, V34, P269, DOI 10.1016/j.wocn.2005.06.002
Cutler A, 2007, P INTERSPEECH 2007 A, P1585
Darwin CJ, 2003, J ACOUST SOC AM, V114, P2913, DOI 10.1121/1.1616924
Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156
ECKMAN FR, 1977, LANG LEARN, V27, P315, DOI 10.1111/j.1467-1770.1977.tb00124.x
Eisner F, 2005, PERCEPT PSYCHOPHYS, V67, P224, DOI 10.3758/BF03206487
Ellis Rod, 1994, STUDY 2 LANGUAGE ACQ
Ezzatian P, 2010, SPEECH COMMUN, V52, P919, DOI 10.1016/j.specom.2010.04.001
FAIRBANKS G, 1958, J ACOUST SOC AM, V30, P596, DOI 10.1121/1.1909702
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
Flege J. E., 1988, HUMAN COMMUNICATION, V1, P224
Flege J. E., 1995, SPEECH PERCEPTION LI, P233
Flege J. E., 1987, APPLIED LINGUISTICS, V8, P162, DOI [10.1093/applin/8.2.162, DOI 10.1093/APPLIN/8.2.162]
Flege James Emil, 2001, STUDIES 2 LANGUAGE A, V23, P527
Flege JE, 1999, SEC LANG ACQ RES, P101
Flege JE, 1999, J MEM LANG, V41, P78, DOI 10.1006/jmla.1999.2638
FLEGE JE, 1995, J ACOUST SOC AM, V97, P3125, DOI 10.1121/1.413041
Flege JE, 1997, J PHONETICS, V25, P169, DOI 10.1006/jpho.1996.0040
Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052
FOX RA, 1995, J ACOUST SOC AM, V97, P2540, DOI 10.1121/1.411974
Florentine M., 1985, P INT 85, P1021
Florentine M, 1984, J ACOUST SOC AM, V75, pS84, DOI 10.1121/ 1.2021645
Frauenfelder U. H., 1998, LANGUAGE COMPREHENSI, P1
Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343
GARLAND S, 2007, BILINGUAL SPECTRUM
Garnier M., 2007, THESIS U PARIS 6
Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613
GAT IB, 1978, AUDIOLOGY, V17, P339
Golestani N, 2009, BILING-LANG COGN, V12, P385, DOI 10.1017/S1366728909990150
Gooskens C, 2010, SPEECH COMMUN, V52, P1022, DOI 10.1016/j.specom.2010.06.005
Grosjean F, 1998, LANGUAGE COGNITION, V1, P131
Grosjean F., 2010, BILINGUAL LIFE REALI
Grosjean F., 2001, ONE MIND 2 LANGUAGES, P1
Guion SG, 2000, J PHONETICS, V28, P27, DOI 10.1006/jpho.2000.0104
Hardison DM, 1996, LANG LEARN, V46, P3, DOI 10.1111/j.1467-1770.1996.tb00640.x
HARLEY B, 1995, LANG LEARN, V45, P43, DOI 10.1111/j.1467-1770.1995.tb00962.x
Hazan V, 2000, LANG SPEECH, V43, P273
Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9
Hazan V, 2010, SPEECH COMMUN, V52, P996, DOI 10.1016/j.specom.2010.05.003
Heinrich A, 2010, SPEECH COMMUN, V52, P1038, DOI 10.1016/j.specom.2010.09.009
Hoen M., 2007, SPEECH COMMUN, V12, P905, DOI 10.1016/j.specom.2007.05.008
HOWES D, 1957, J ACOUST SOC AM, V29, P296, DOI 10.1121/1.1908862
Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291
IOUP G, 1984, LANG LEARN, V34, P1, DOI 10.1111/j.1467-1770.1984.tb01001.x
IVERSON P, 1995, J ACOUST SOC AM, V97, P553, DOI 10.1121/1.412280
Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841
Jones C, 2007, COMPUT SPEECH LANG, V21, P641, DOI 10.1016/j.csl.2007.03.001
KREUL EJ, 1968, J SPEECH HEAR RES, V11, P536
Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533
KUHL PK, 1993, J ACOUST SOC AM, V93, P2423
Kuhl PK, 2000, P NATL ACAD SCI USA, V97, P11850, DOI 10.1073/pnas.97.22.11850
KUHL PK, 1993, J PHONETICS, V21, P125
Lado R., 1957, LINGUISTICS CULTURES
LANE H, 1971, J SPEECH HEAR RES, V14, P677
Leather J., 1991, STUDIES 2ND LANGUAGE, V13, P305, DOI 10.1017/S0272263100010019
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Lee CY, 2010, SPEECH COMMUN, V52, P900, DOI 10.1016/j.specom.2010.01.004
Lenneberg E., 1967, BIOL FDN LANGUAGE
LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177
LIVELY SE, 1994, J ACOUST SOC AM, V96, P2076, DOI 10.1121/1.410149
LOGAN JS, 1991, J ACOUST SOC AM, V89, P874, DOI 10.1121/1.1894649
Lombard E., 1911, ANN MALADIES OREILLE, V37, P101
Long M., 1990, STUDIES 2ND LANGUAGE, V12, P251, DOI DOI 10.1017/S0272263100009165
Lovitt A, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2154
LU Y, 2010, THESIS U SHEFFIELD
Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001
MacKay IRA, 2001, J ACOUST SOC AM, V110, P516, DOI 10.1121/1.1377287
MacKay IRA, 2001, PHONETICA, V58, P103, DOI 10.1159/000028490
Macnamara J. T., 1969, DESCRIPTION MEASUREM, P80
Maddieson I., 1984, PATTERNS SOUNDS
Major R., 1998, STUDIES 2 LANGUAGE A, V20, P131
Major R. C, 2001, FOREIGN ACCENT ONTOG
MAJOR RC, 1999, PHONOLOGICAL ISSUES, P151
Markham D., 1997, PHONETIC IMITATION A
MARSLENWILSON W, 1994, PSYCHOL REV, V101, P653, DOI 10.1037//0033-295X.101.4.653
Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001
Mattys SL, 2010, SPEECH COMMUN, V52, P887, DOI 10.1016/j.specom.2010.01.005
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
McQueen J. M., 2007, OXFORD HDB PSYCHOLIN, P37
MCQUEEN JM, SPEECH COMMUNI UNPUB
McQueen JM, 2006, COGNITIVE SCI, V30, P1113, DOI 10.1207/s15516709cog0000_79
McQueen JM, 1999, J EXP PSYCHOL HUMAN, V25, P1363, DOI 10.1037//0096-1523.25.5.1363
Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134
MILLER GA, 1947, PSYCHOL BULL, V44, P105, DOI 10.1037/h0055960
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688
Munro M. J., 1998, STUDIES 2 LANGUAGE A, V20, P139
NABELEK AK, 1982, J ACOUST SOC AM, V71, P1242
NABELEK AK, 1974, J SPEECH HEAR RES, V17, P724
NABELEK AK, 1988, J ACOUST SOC AM, V84, P476
NABELEK AK, 1984, J ACOUST SOC AM, V75, P632
Nelson P, 2005, LANG SPEECH HEAR SER, V36, P219, DOI 10.1044/0161-1461(2005/022)
NEUMAN AC, 1983, J ACOUST SOC AM, V73, P2145, DOI 10.1121/1.389538
Norris D, 2008, PSYCHOL REV, V115, P357, DOI 10.1037/0033-295X.115.2.357
Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001
Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9
NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4
NYGAARD LC, 1994, PSYCHOL SCI, V5, P42, DOI 10.1111/j.1467-9280.1994.tb00612.x
Penfield W, 1959, SPEECH BRAIN MECH
Picard M, 2001, AUDIOLOGY, V40, P221
PICHCNY MA, 1981, THESIS MIT
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
PISKE T, 1999, P 14 INT C PHON SCI, P1433
Polianov E., 1931, TRAVAUX CERCLE LINGU, P79
POLLACK I, 1959, J ACOUST SOC AM, V31, P273, DOI 10.1121/1.1907712
Quene H, 2010, SPEECH COMMUN, V52, P911, DOI 10.1016/j.specom.2010.03.005
REPP BH, 1988, J ACOUST SOC AM, V84, P1929, DOI 10.1121/1.397159
Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751
Rogers CL, 2004, LANG SPEECH, V47, P139
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474
SAVIN HB, 1963, J ACOUST SOC AM, V35, P200, DOI 10.1121/1.1918432
Schegloff EA, 2000, LANG SOC, V29, P1
SCOVEL T, 1969, LANG LEARN, V19, P245, DOI 10.1111/j.1467-1770.1969.tb00466.x
Scovel T., 1988, TIME SPEAK PSYCHOLIN
Seliger H. W., 1978, 2 LANGUAGE ACQUISITI, P11
Shimizu T, 2002, AURIS NASUS LARYNX, V29, P121, DOI 10.1016/S0385-8146(01)00133-X
Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650
SINGH S, 1966, J ACOUST SOC AM, V40, P635, DOI 10.1121/1.1910130
Singleton D., 1989, LANGUAGE ACQUISITION
Sjerps MJ, 2010, J EXP PSYCHOL HUMAN, V36, P195, DOI 10.1037/a0016803
SLOWIACZEK LM, 1987, J EXP PSYCHOL LEARN, V13, P64, DOI 10.1037//0278-7393.13.1.64
Sorace Antonella, 1993, SECOND LANG RES, V9, P22
Soto-Faraco S, 2001, J MEM LANG, V45, P412, DOI 10.1006/jmla.2000.2783
Spivey MJ, 1999, PSYCHOL SCI, V10, P281, DOI 10.1111/1467-9280.00151
SPOLSKY B, 1968, LANG LEARN, V18, P79, DOI 10.1111/j.1467-1770.1968.tb00224.x
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Stockwell Robert P., 1965, SOUNDS ENGLISH SPANI
STRANGE W., 1995, SPEECH PERCEPTION LI, P3
Taft M, 1986, LANG COGNITIVE PROC, V1, P297, DOI 10.1080/01690968608404679
TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769
Tremblay A, 2008, APPL PSYCHOLINGUIST, V29, P553, DOI 10.1017/S0142716408080247
Trubetzkoy N. S., 1939, PRINCIPLES PHONOLOGY
Vanlancker-Sidtis D, 2003, APPL PSYCHOLINGUIST, V24, P45, DOI 10.1017/S0142716403000031
VANDERVLUGT M, 1986, 21 IPO, P41
van Dommelen WA, 2010, SPEECH COMMUN, V52, P968, DOI 10.1016/j.specom.2010.05.001
Van Engen KJ, 2010, SPEECH COMMUN, V52, P943, DOI 10.1016/j.specom.2010.05.002
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917
van Wijngaarden SJ, 2004, J ACOUST SOC AM, V115, P1281, DOI 10.1121/1.1647145
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
Volin J, 2010, SPEECH COMMUN, V52, P1010, DOI 10.1016/j.specom.2010.06.009
von Hapsburg Deborah, 2004, J Am Acad Audiol, V15, P88, DOI 10.3766/jaaa.15.1.9
von Hapsburg D, 2002, J SPEECH LANG HEAR R, V45, P202
Walsh T., 1981, INDIVIDUAL DIFFERENC, P3
WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417
WANG Y, 2008, J PHONETICS, V37, P344
WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392
Watkins AJ, 2005, ACTA ACUST UNITED AC, V91, P892
Weber A, 2004, J MEM LANG, V50, P1, DOI 10.1016/S0749-596X(03)00105-0
WEISS W, 2008, J AM ACAD AUDIOL, V19, P5
WODE H, 1980, 2 LANGUAGE DEV TREND
Wright R., 2004, PHONETICALLY BASED P
ZWITSERLOOD P, 1989, COGNITION, V32, P25, DOI 10.1016/0010-0277(89)90013-9
NR 192
TC 21
Z9 21
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 864
EP 886
DI 10.1016/j.specom.2010.08.014
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700002
ER
PT J
AU Mattys, SL
Carroll, LM
Li, CKW
Chan, SLY
AF Mattys, Sven L.
Carroll, Lucy M.
Li, Carrie K. W.
Chan, Sonia L. Y.
TI Effects of energetic and informational masking on speech segmentation by
native and non-native speakers
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken-word recognition; Speech segmentation; Bilingualism; Processing
load; Cognitive load; Energetic masking; Informational masking
ID SPOKEN-WORD RECOGNITION; NOISE; PERCEPTION; LISTENERS; LANGUAGE;
INTELLIGIBILITY; 2ND-LANGUAGE; INTEGRATION; ENGLISH; CONTEXT
AB In this study, we asked whether native and non-native speakers of English use a similar balance of lexical knowledge and acoustic cues, e.g., juncture-specific allophones, to segment spoken English, and whether the two groups are equally affected by energetic masking (a competing talker) and by cognitive load (a simultaneous visual search task). In intact speech, as well as in both adverse conditions, non-native speakers gave relatively less weight to lexical plausibility than to acoustic cues. Under energetic masking, overall segmentation accuracy decreased, but this decrease was of comparable magnitude in native and non-natives speakers. Under cognitive load, native speakers relied relatively more on lexical plausibility than on acoustic cues. This lexical drift was not observed in the non-native group. These results indicate that non-native speakers pay less attention to lexical information-and relatively more attention to acoustic detail-than previously thought. They also suggest that the penetrability of the speech system by cognitive factors depends on listener's proficiency with the language, and especially their level of lexical-semantic knowledge. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Mattys, Sven L.; Carroll, Lucy M.; Li, Carrie K. W.; Chan, Sonia L. Y.] Univ Bristol, Dept Expt Psychol, Bristol BS8 1TU, Avon, England.
RP Mattys, SL (reprint author), Univ Bristol, Dept Expt Psychol, 12A Priory Rd, Bristol BS8 1TU, Avon, England.
EM Sven.Mattys@bris.ac.uk
FU Leverhulme Trust [F/00 182/BG]; Marie Curie foundation
[MRTN-CT-2006-035561]
FX This study was made possible thanks to a grant from the Leverhulme Trust
(F/00 182/BG) to S.L. Mattys, and a Research Training Network grant from
the Marie Curie foundation (MRTN-CT-2006-035561). We thank Martin Cooke
for calibrating the babble noise and calculating the glimpsing
percentages. We also thank Lukas Wiget for contributing to data
collection and Jeff Bowers for comments on an earlier draft.
CR Altenberg EP, 2005, SECOND LANG RES, V21, P325, DOI 10.1191/0267658305sr250oa
Baayen R. H., 1995, CELEX LEXICAL DATABA
Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005
Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696
Casini L, 2009, COGNITION, V112, P318, DOI 10.1016/j.cognition.2009.04.005
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218
Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023
Elston-Guttler KE, 2005, J MEM LANG, V52, P256, DOI 10.1016/j.jml.2004.11.002
Farris C, 2008, TESOL QUART, V42, P397
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
GOW DW, 1995, J EXP PSYCHOL HUMAN, V21, P344, DOI 10.1037//0096-1523.21.2.344
Hazan V, 2000, LANG SPEECH, V43, P273
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Love T, 2003, EXP PSYCHOL, V50, P204, DOI 10.1027//1618-3169.50.3.204
Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001
Mattys SL, 2005, J EXP PSYCHOL GEN, V134, P477, DOI 10.1037/0096-3445.134.4.477
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
NABELEK AK, 1984, J ACOUST SOC AM, V75, P632
NORRIS D, 1995, J EXP PSYCHOL LEARN, V21, P1209
OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393
Rosenhouse J., 2006, INT J BILINGUAL, V10, P119
Sanders LD, 2002, J SPEECH LANG HEAR R, V45, P519, DOI 10.1044/1092-4388(2002/041)
Sanders LD, 2003, COGNITIVE BRAIN RES, V15, P214, DOI 10.1016/S0926-6410(02)00194-5
Ito Kikuyo, 2009, J Acoust Soc Am, V125, P2348, DOI 10.1121/1.3082103
Styles E., 1997, PSYCHOL ATTENTION
TAKANO Y, 1993, J CROSS CULT PSYCHOL, V24, P445, DOI 10.1177/0022022193244005
Thorn ASC, 1999, Q J EXP PSYCHOL-A, V52, P303, DOI 10.1080/027249899391089
TREISMAN AM, 1980, COGNITIVE PSYCHOL, V12, P97, DOI 10.1016/0010-0285(80)90005-5
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
WHITE L, 2010, Q J EXP PSYCHOL
NR 38
TC 9
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 887
EP 899
DI 10.1016/j.specom.2010.01.005
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700003
ER
PT J
AU Lee, CY
Tao, LA
Bond, ZS
AF Lee, Chao-Yang
Tao, Liang
Bond, Z. S.
TI Identification of multi-speaker Mandarin tones in noise by native and
non-native listeners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mandarin tones; Speech perception; Speaker variability; Noise
ID SPEAKER NORMALIZATION; TALKER VARIABILITY; SPEECH-PERCEPTION; LEXICAL
TONE; RECOGNITION MEMORY; WORD RECOGNITION; CHINESE TONES; SPOKEN WORDS;
ACQUISITION; INFORMATION
AB The similarities and contrasts between native and non-native identification of multi-speaker Mandarin tones in quiet and in noise were explored in a perception experiment. Mandarin tone materials produced by three male and three female speakers were presented with five levels of signal-to-noise ratios (quiet, 0, -5, -10, and -15 dB) in two presentation formats (blocked by speaker and mixed across speakers) to listeners with various Mandarin experience (native, first-year, second-year, third-year, and fourth-year students). Stimuli blocked by speaker yielded higher accuracy and shorter reaction time. The additional demand of processing mixed-speaker stimuli, however, did not compromise non-native performance more than native performance. Noise expectedly compromised identification performance, although it did not compromise non-native identification more than native identification. Native listeners expectedly outperformed non-native listeners, although identification performance did not vary systematically as a function of duration of Mandarin experience. It is speculated that sources of variability in speech would affect non-native more than native tone identification only if syllable-internal, canonical F0 information is removed or altered. Published by Elsevier B.V.
C1 [Lee, Chao-Yang] Ohio Univ, Sch Hearing Speech & Language Sci, Athens, OH 45701 USA.
[Tao, Liang; Bond, Z. S.] Ohio Univ, Dept Linguist, Athens, OH 45701 USA.
RP Lee, CY (reprint author), Ohio Univ, Sch Hearing Speech & Language Sci, Athens, OH 45701 USA.
EM leec1@ohio.edu
FU Ohio University [RC-09-088]
FX We would like to thank Ning Zhou for assistance in speech processing, Na
Wang and Lauren Dutton for assistance in administering the experiment,
and Jessica Stillwell and Jana Van Hooser for assistance in stimulus
preparation. We are also grateful to three anonymous reviewers for their
helpful comments. This research was partially supported by Research
Challenge Award RC-09-088 from Ohio University.
CR Bluhme H., 1971, STUD LINGUIST, V22, P51
Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Chin T., 1987, J CHINESE LANGUAGE T, V22, P87
CHURCH BA, 1994, J EXP PSYCHOL LEARN, V20, P521, DOI 10.1037//0278-7393.20.3.521
CREELMAN CD, 1957, J ACOUST SOC AM, V29, P655, DOI 10.1121/1.1909003
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
FOX RA, 1985, J CHINESE LINGUIST, V13, P69
FOX RA, 1990, J CHINESE LINGUIST, V18, P261
GANDOUR J, 1983, J PHONETICS, V11, P149
GANDOUR JT, 1978, LANG SPEECH, V22, P1
Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166
Goldinger SD, 1998, PSYCHOL REV, V105, P251, DOI 10.1037/0033-295X.105.2.251
Gottfried TL, 1997, J PHONETICS, V25, P207, DOI 10.1006/jpho.1997.0042
Halle PA, 2004, J PHONETICS, V32, P395, DOI 10.1016/S0095-4470(03)00016-0
Hardison DM, 2003, APPL PSYCHOLINGUIST, V24, P495, DOI 10.1017/S0142716403000250
Johnson K, 2005, BLACKW HBK LINGUIST, P363, DOI 10.1002/9780470757024.ch15
KIRILOFF C, 1969, PHONETICA, V20, P63
Kong YY, 2006, J ACOUST SOC AM, V120, P2830, DOI 10.1121/1.2346009
LEATHER J, 1983, J PHONETICS, V11, P373
LEE CY, 2010, LANG SPEECH, P53
Lee CY, 2008, J PHONETICS, V36, P537, DOI 10.1016/j.wocn.2008.01.002
Lee CY, 2009, J ACOUST SOC AM, V125, P1125, DOI 10.1121/1.3050322
Lee CY, 2009, J PHONETICS, V37, P1, DOI 10.1016/j.wocn.2008.08.001
Lin T., 1984, ZHONGGUO YUYAN XUEBA, V2, P59
Lin W. C. J., 1985, RELC J, V16, P31, DOI 10.1177/003368828501600207
LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134
Mertus J. A., 2000, BROWN LAB INTERACTIV
Miracle W. C., 1989, J CHINESE LANGUAGE T, V24, P49
Moore CB, 1997, J ACOUST SOC AM, V102, P1864, DOI 10.1121/1.420092
MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688
NABELEK AK, 1984, J ACOUST SOC AM, V75, P632
NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469
Nusbaum H. C., 1992, SPEECH PERCEPTION PR, P113
PALMERI TJ, 1993, J EXP PSYCHOL LEARN, V19, P309, DOI 10.1037/0278-7393.19.2.309
REPP BH, 1990, J PHONETICS, V18, P481
Sebastian-Galles N, 2005, BLACKW HBK LINGUIST, P546, DOI 10.1002/9780470757024.ch22
Sereno J. A., 2007, LANGUAGE EXPERIENCE, P239
Shen X. S., 1989, J CHINESE LANGUAGE T, V24, P27
Summerfield Q., 1973, REPORT SPEECH RES PR, V2, P12
Takayanagi S, 2002, J SPEECH LANG HEAR R, V45, P585, DOI 10.1044/1092-4388(2002/047)
TSAI CH, 2000, CH TSAIS TECHNOLOGY
VERBRUGGE RR, 1976, J ACOUST SOC AM, V60, P198, DOI 10.1121/1.381065
Wang Y, 1999, J ACOUST SOC AM, V106, P3649, DOI 10.1121/1.428217
Wei CG, 2007, EAR HEARING, V28, p62S, DOI 10.1097/AUD.0b013e318031512c
Wong PCM, 2003, J SPEECH LANG HEAR R, V46, P413, DOI 10.1044/1092-4388(2003/034)
Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034
XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684
Zhou N, 2008, EAR HEARING, V29, P326, DOI 10.1097/AUD.0b013e3181662c42
NR 51
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 900
EP 910
DI 10.1016/j.specom.2010.01.004
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700004
ER
PT J
AU Quene, H
van Delft, LE
AF Quene, Hugo
van Delft, L. E.
TI Non-native durational patterns decrease speech intelligibility
SO SPEECH COMMUNICATION
LA English
DT Article
DE Non-native speech; Duration patterns; Segmental durations; Speech
Reception Threshold; Intelligibility
ID RECEPTION THRESHOLD; SEGMENTAL DURATION; ENGLISH; 2ND-LANGUAGE;
PERCEPTION; LANGUAGE; DUTCH; PROSODY; ACCENT; NOISE
AB In native speech, durational patterns convey linguistically relevant phenomena such as phrase structure, lexical stress, rhythm, and word boundaries. The lower intelligibility of non-native speech may be partly due to its deviant durational patterns. The present study aims to quantify the relative contributions of non-native durational patterns and of non-native speech sounds to intelligibility. In a Speech Reception Threshold study, duration patterns were transplanted between native and non-native versions of Dutch sentences. Intelligibility thresholds (critical speech-to-noise ratios) differed by about 4 dB between the matching versions with unchanged durational patterns. Results for manipulated versions suggest that about 0.4-1.1 dB of this difference was due to the durational patterns, and that this contribution was larger if the native and non-native patterns were more deviant. The remainder of the difference must have been due to non-native speech sounds in these materials. This finding supports recommendations to attend to durational patterns as well as native-like speech sounds, when learning to speak a foreign language. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Quene, Hugo; van Delft, L. E.] Univ Utrecht, Utrecht Inst Linguist OTS, NL-3512 JK Utrecht, Netherlands.
RP Quene, H (reprint author), Univ Utrecht, Utrecht Inst Linguist OTS, Trans 10, NL-3512 JK Utrecht, Netherlands.
EM h.quene@uu.nl
CR Adams C., 1979, ENGLISH SPEECH RHYTH
ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x
Baayen RH, 2008, J MEM LANG, V59, P390, DOI 10.1016/j.jml.2007.12.005
Bates D., 2005, R NEWS, V5, P27, DOI DOI 10.1111/J.1523-1739.2005.00280.X
Bent T, 2008, PHONETICA, V65, P131, DOI 10.1159/000144077
Boersma P., 2008, PRAAT DOING PHONETIC
CAMBIERLANGEVEL.G, 2000, THESIS U AMSTERDAM
Chun Dorothy, 2002, DISCOURSE INTONATION
Cutler A, 1997, LANG SPEECH, V40, P141
Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V20, P1
EEFTING W, 1993, ANAL SYNTHESIS SPEEC, P225
Faraway J. J., 2006, EXTENDING LINEAR MOD
FLEGE JE, 1993, J ACOUST SOC AM, V93, P1589, DOI 10.1121/1.406818
FLEGE JE, 1981, LANG SPEECH, V24, P125
Goetry V, 2000, PSYCHOL BELG, V40, P115
Holm S., 2008, THESIS NORWEGIAN U S
Hox J., 2002, MULTILEVEL ANAL TECH
KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986
LAEUFER C, 1992, J PHONETICS, V20, P411
MAASSEN B, 1984, SPEECH COMMUN, V3, P123, DOI 10.1016/0167-6393(84)90034-7
Mareuil P., 2006, PHONETICA, V63.4, P247, DOI 10.1159/000097308
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Munro MJ, 1999, LANG LEARN, V49, P285, DOI 10.1111/0023-8333.49.s1.8
Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193
Nooteboom S., 1997, HDB PHONETIC SCI, P640
Patel A., 2008, MUSIC LANGUAGE BRAIN
PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183
Pinheiro J. C., 2000, STAT COMPUTING
PLOMP R, 1979, AUDIOLOGY, V18, P43
PLOMP R, 1986, J SPEECH HEAR RES, V29, P146
Quene H, 2004, SPEECH COMMUN, V43, P103, DOI 10.1016/j.specom 2004.02.004
Quene H, 2008, J MEM LANG, V59, P413, DOI 10.1016/j.jml.2008.02.002
QUENE H, 1992, J PHONETICS, V20, P331
Quene H, 2005, PHONETICA, V62, P1, DOI 10.1159/000087222
R Development Core Team, 2008, R LANG ENV STAT COMP
Rajadurai J., 2007, WORLD ENGLISH, V26, P87, DOI 10.1111/j.1467-971X.2007.00490.x
Shatzman KB, 2006, PERCEPT PSYCHOPHYS, V68, P1, DOI 10.3758/BF03193651
SLIS IH, 1969, LANG SPEECH, V12, P80
SLUIJTER AMC, 1995, PHONETICA, V52, P71
Smith R., 2004, THESIS U CAMBRIDGE
STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455
Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031
VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V112, P3004, DOI 10.1121/1.1512289
van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4
White L., 2007, CURRENT ISSUES LINGU, P237
White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003
NR 47
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 911
EP 918
DI 10.1016/j.specom.2010.03.005
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700005
ER
PT J
AU Ezzatian, P
Avivi, M
Schneider, BA
AF Ezzatian, Payam
Avivi, Meital
Schneider, Bruce A.
TI Do nonnative listeners benefit as much as native listeners from spatial
cues that release speech from masking?
SO SPEECH COMMUNICATION
LA English
DT Article
DE Second language; Bilingualism; Speech comprehension; Speech perception;
Informational masking; Energetic masking; Stream segregation
ID INFORMATIONAL MASKING; COMPETING SPEECH; SIMULTANEOUS TALKERS; ENERGETIC
MASKING; COCKTAIL PARTY; PERCEPTION; NOISE; RECOGNITION; SEPARATION;
HEARING
AB Since most everyday communication takes place in less than optimal acoustic settings, it is important to understand how such environments affect nonnative listeners. In this study we compare the speech reception abilities of native and nonnative English speakers when they are asked to repeat semantically anomalous sentences masked by steady-state noise or two other talkers in two conditions: when the target and masker appear to be colocated; and when the target and masker appear to emanate from different loci. We found that the later the age of language acquisition, the higher the threshold for speech reception under all conditions, suggesting that the ability to extract speech information from masking sounds in complex acoustic situations depends on language competency. Interestingly, however, native and nonnative listeners benefited equally from perceived spatial separation (an acoustic cue that releases speech from masking) independent of whether the speech target was masked by speech or noise, suggesting that the acoustic factors that release speech from masking are not affected by linguistic competence. In addition speech reception thresholds were correlated with vocabulary scores in all individuals, both native and nonnative. The implications of these findings for nonnative listeners in acoustically complex environments are discussed. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Ezzatian, Payam; Avivi, Meital; Schneider, Bruce A.] Univ Toronto Mississauga, Ctr Res Biol Commun Syst, Dept Psychol, Mississauga, ON L5L 1C6, Canada.
RP Schneider, BA (reprint author), Univ Toronto Mississauga, Ctr Res Biol Commun Syst, Dept Psychol, Mississauga, ON L5L 1C6, Canada.
EM bruce.schneider@utoronto.ca
FU Canadian Institute of Health Research [MT15359]; Natural Sciences and
Engineering Research Council of Canada [RGPIN 9952]
FX This work was supported by the Canadian Institute of Health Research
(MT15359) and Natural Sciences and Engineering Research Council of
Canada (RGPIN 9952). We would like to thank James Qi for creating the
program used to run our experiments and Lulu Li for help in recruiting
participants.
CR Abrahamsson N, 2009, LANG LEARN, V59, P249, DOI 10.1111/j.1467-9922.2009.00507.x
Akeroyd MA, 2000, J ACOUST SOC AM, V107, P3394, DOI 10.1121/1.429410
BILGER RC, 1984, J SPEECH HEAR RES, V27, P32
Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Bregman AS., 1990, AUDITORY SCENE ANAL
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
Brungart DS, 2002, J ACOUST SOC AM, V112, P664, DOI 10.1121/1.1490592
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Darwin CJ, 2003, J ACOUST SOC AM, V114, P2913, DOI 10.1121/1.1616924
Flege J. E., 1995, SPEECH PERCEPTION LI, P233
FLORENTINE M, 1985, P ACOUST SOC JAPAN
Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211
Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984
Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343
Freyman RL, 2007, J ACOUST SOC AM, V121, P1040, DOI 10.1121/1.2427117
Hasher L., 1988, PSYCHOL LEARN MOTIV, V22, P193, DOI DOI 10.1016/S0079-7421(08)60041-9
Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908
Heinrich A, 2008, Q J EXP PSYCHOL, V61, P735, DOI 10.1080/17470210701402372
Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432
Helfer KS, 2005, J ACOUST SOC AM, V117, P842, DOI [10.1121/1.1836832, 10.1121/1.183682]
Helfer KS, 2008, EAR HEARING, V29, P87
KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436
Kidd G, 1998, J ACOUST SOC AM, V104, P422, DOI 10.1121/1.423246
Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187
Li Ang, 2004, Journal of Experimental Psychology Human Perception and Performance, V30, P1077
Litovsky RY, 2005, J ACOUST SOC AM, V117, P3091, DOI 10.1121/1.1873913
Marrone N, 2008, J ACOUST SOC AM, V124, P3064, DOI 10.1121/1.2980441
Marslen-Wilson W. D., 1989, LEXICAL REPRESENTATI, P3
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134
Noble W, 2002, PERCEPT PSYCHOPHYS, V64, P1325, DOI 10.3758/BF03194775
Rogers CL, 2008, J ACOUST SOC AM, V124, P1278, DOI 10.1121/1.2939127
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
Saffran JR, 1996, J MEM LANG, V35, P606, DOI 10.1006/jmla.1996.0032
Schneider B. A., 2010, SPRINGER HDB AUDITOR, P167
Schneider BA, 2007, J AM ACAD AUDIOL, V18, P559, DOI 10.3766/jaaa.18.7.4
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
Verhaeghen P, 1997, PSYCHOL BULL, V122, P231, DOI 10.1037/0033-2909.122.3.231
Yang ZG, 2007, SPEECH COMMUN, V49, P892, DOI 10.1016/j.specom.2007.05.005
NR 42
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 919
EP 929
DI 10.1016/j.specom.2010.04.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700006
ER
PT J
AU Bradlow, A
Clopper, C
Smiljanic, R
Walter, MA
AF Bradlow, Ann
Clopper, Cynthia
Smiljanic, Rajka
Walter, Mary Ann
TI A perceptual phonetic similarity space for languages: Evidence from five
native language listener groups
SO SPEECH COMMUNICATION
LA English
DT Article
DE Phonetic similarity; Cross-language speech intelligibility; Language
classification
ID SPEECH-INTELLIGIBILITY BENEFIT; FREE CLASSIFICATION; ENGLISH; NOISE;
RECOGNITION; CONTRASTS; FEATURES; TALKER
AB The goal of the present study was to devise a means of representing languages in a perceptual similarity space based on their overall phonetic similarity. In Experiment 1, native English listeners performed a free classification task in which they grouped 17 diverse languages based on their perceived phonetic similarity. A similarity matrix of the grouping patterns was then submitted to clustering and multidimensional scaling analyses. In Experiment 2, an independent group of native English listeners sorted the group of 17 languages in terms of their distance from English. Experiment 3 repeated Experiment 2 with four groups of non-native English listeners: Dutch, Mandarin, Turkish and Korean listeners. Taken together, the results of these three experiments represent a step towards establishing an approach to assess the overall phonetic similarity of languages. This approach could potentially provide the basis for developing predictions regarding foreign-accented speech intelligibility for various listener groups, and regarding speech perception accuracy in the context of background noise in various languages. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Bradlow, Ann] Northwestern Univ, Dept Linguist, Evanston, IL 60208 USA.
[Clopper, Cynthia] Ohio State Univ, Dept Linguist, Columbus, OH 43210 USA.
[Smiljanic, Rajka] Univ Texas Austin, Dept Linguist, Austin, TX 78712 USA.
[Walter, Mary Ann] Middle E Tech Univ, Ankara, Turkey.
RP Bradlow, A (reprint author), Northwestern Univ, Dept Linguist, 2016 Sheridan Rd, Evanston, IL 60208 USA.
EM abradlow@northwestern.edu
FU NIH [F32 DC007237, R01 DC005794]
FX We are grateful to Rachel Baker, Arim Choi and Susanne Brouwer for
research assistance. This work was supported by NIH Grants F32 DC007237
and R01 DC005794.
CR [Anonymous], 1999, HDB INT PHONETIC ASS
BARKAT M, 2001, P EUR 2001, P1065
Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234
Bent T, 2008, PHONETICA, V65, P131, DOI 10.1159/000144077
Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378
Boersma P., 2009, PRAAT DOING PHONETIC
CALANDRUCCIO L, J SPEECH LA IN PRESS
Clopper CG, 2007, J PHONETICS, V35, P421, DOI 10.1016/j.wocn.2006.06.001
Clopper CG, 2008, BEHAV RES METHODS, V40, P575, DOI 10.3758/BRM.40.2.575
CORTER JE, 1982, BEHAV RES METH INSTR, V14, P353
Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V20, P1
Dunn M, 2005, SCIENCE, V309, P2072, DOI 10.1126/science.1114615
Flege J. E., 1995, SPEECH PERCEPTION LI, P233
Hayes-Harb R, 2008, J PHONETICS, V36, P664, DOI 10.1016/j.wocn.2008.04.002
Heeringa W, 2009, SPEECH COMMUN, V51, P167, DOI 10.1016/j.specom.2008.07.006
Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291
Kruskal JB, 1978, MULTIDIMENSIONAL SCA
Kuhl PK, 2008, PHILOS T R SOC B, V363, P979, DOI 10.1098/rstb.2007.2154
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
MEYER J, 2003, P INT C PHON SCI
MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x
Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049
Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X
RHEBERGEN KS, 2005, J ACOUST SOC AM, V118, P1
SMILJANIC R, 2007, P 26 INT C PHON SCI
Stibbard RM, 2006, J ACOUST SOC AM, V120, P433, DOI 10.1121/1.2203595
Stockmal V, 2000, APPL PSYCHOLINGUIST, V21, P383, DOI 10.1017/S0142716400003052
Strange W., 2007, LANGUAGE EXPERIENCE, P35, DOI 10.1075/lllt.17.08str
TAKANE Y, 1977, PSYCHOMETRIKA, V42, P7, DOI 10.1007/BF02293745
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
VASILESCU I., 2005, P INT, P1773
VASILESCU I, 2000, P INT C SPOK LANG PR, V2, P543
NR 34
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 930
EP 942
DI 10.1016/j.specom.2010.06.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700007
ER
PT J
AU Van Engen, KJ
AF Van Engen, Kristin J.
TI Similarity and familiarity: Second language sentence recognition in
first- and second-language multi-talker babble
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech-in-noise perception; Informational masking; Multi-talker babble;
Bilingual speech perception
ID NONNATIVE LISTENERS; SPEECH-PERCEPTION; INFORMATIONAL MASKING;
NATIVE-LANGUAGE; CLEAR SPEECH; NOISE; REVERBERATION; ENGLISH;
INTELLIGIBILITY; IDENTIFICATION
AB The intelligibility of speech in noisy environments depends not only on the functionality of listeners' peripheral auditory systems, but also on cognitive factors such as their language learning experience. Previous studies have shown, for example, that normal-hearing listeners attending to a non-native language have more difficulty in identifying speech targets in noisy conditions than do native listeners. Furthermore, native listeners have more difficulty in understanding speech targets in the presence of speech noise in their native language versus a foreign language. The present study addresses the role of listeners' experience with both the target and noise languages by examining second-language sentence recognition in first- and second-language noise. Native English speakers and non-native English speakers whose native language is Mandarin were tested on English sentence recognition in English and Mandarin 2-talker babble. Results show that both listener groups experienced greater difficulty in English versus Mandarin babble, but that native Mandarin listeners experienced a smaller release from masking in Mandarin babble relative to English babble. These results indicate that both the similarity between the target and noise and the language experience of the listeners contribute to the amount of interference listeners experience when listening to speech in the presence of speech noise. (C) 2010 Elsevier B.V. All rights reserved.
C1 Northwestern Univ, Dept Linguist, Evanston, IL 60208 USA.
RP Van Engen, KJ (reprint author), Northwestern Univ, Dept Linguist, 2016 Sheridan Rd, Evanston, IL 60208 USA.
EM k-van@northwestern.edu
FU NIH-NIDCD [F31DC009516, R01-DC005794]
FX The author thanks Ann Bradlow for helpful discussions at various stages
of this project. Special thanks also to Chun Chan for software
development and technical support, to Page Piccinini for assistance in
data collection, and to Matt Goldrick for assistance with data analysis.
This research was supported by Award No. F31DC009516 (Kristin Van Engen,
PI) and Grant No. R01-DC005794 from NIH-NIDCD (Ann Bradlow, PI). The
content is solely the responsibility of the author and does not
necessarily represent the official views of the NIDCD or the NIH.
CR Bamford J., 1979, SPEECH HEARING TESTS, P148
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
CALANDRUCCIO L, J SPEECH LA IN PRESS
Callan DE, 2004, NEUROIMAGE, V22, P1182, DOI 10.1016/j.neuroimage.2004.03.006
Clahsen H, 2006, TRENDS COGN SCI, V10, P564, DOI 10.1016/j.tics.2006.10.002
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
Durlach N, 2006, J ACOUST SOC AM, V120, P1787, DOI 10.1121/1.2335426
*ETS, 2005, TOEFL INT BAS TEST S
Felty RA, 2009, J ACOUST SOC AM, V125, pEL93, DOI 10.1121/1.3073733
Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343
Hazan V, 2000, LANG SPEECH, V43, P273
Jaeger TF, 2008, J MEM LANG, V59, P434, DOI 10.1016/j.jml.2007.11.007
Kidd Jr G., 2007, AUDITORY PERCEPTION, P143
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Marion V, 2007, J SPEECH LANG HEAR R, V50, P940, DOI 10.1044/1092-4388(2007/067)
Mattys SL, 2009, COGNITIVE PSYCHOL, V59, P203, DOI 10.1016/j.cogpsych.2009.04.001
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
Mueller JL, 2005, SECOND LANG RES, V21, P152, DOI 10.1191/0267658305sr256oa
NABELEK AK, 1984, J ACOUST SOC AM, V75, P632
NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469
R Development Core Team, 2005, R LANG ENV STAT COMP
RHEBERGEN KS, 2005, J ACOUST SOC AM, V118, P1
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788
Sperry J L, 1997, J Am Acad Audiol, V8, P71
STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455
TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769
TICE R, 1998, LEVEL16
VANENGEN KJ, 2007, J ACOUST SOC AM, V122, P2994, DOI 10.1121/1.2942684
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
VONHAPSBURG D, 2004, J AM ACAD AUDIOL, V14, P559
von Hapsburg D, 2002, J SPEECH LANG HEAR R, V45, P202
NR 38
TC 17
Z9 17
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 943
EP 953
DI 10.1016/j.specom.2010.05.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700008
ER
PT J
AU Cooke, M
Lecumberri, MLG
Scharenborg, O
van Dommelen, WA
AF Cooke, Martin
Garcia Lecumberri, Maria Luisa
Scharenborg, Odette
van Dommelen, Wim A.
TI Language-independent processing in speech perception: Identification of
English intervocalic consonants by speakers of eight European languages
SO SPEECH COMMUNICATION
LA English
DT Article
DE Consonant identification; Non-native; Cross-language; Noise
ID NONNATIVE LISTENERS; BACKGROUND-NOISE; CUE-ENHANCEMENT; NATIVE SPEAKERS;
NORMAL-HEARING; INTELLIGIBILITY; RECOGNITION; CONFUSIONS; TALKER;
MASKING
AB Processing speech in a non-native language requires listeners to cope with influences from their first language and to overcome the effects of limited exposure and experience. These factors may be particularly important when listening in adverse conditions. However, native listeners also suffer in noise, and the intelligibility of speech in noise clearly depends on factors which are independent of a listener's first language. The current study explored the issue of language-independence by comparing the responses of eight listener groups differing in native language when confronted with the task of identifying English intervocalic consonants in three masker backgrounds, viz. stationary speech-shaped noise, temporally-modulated speech-shaped noise and competing English speech. The study analysed the effects of (i) noise type, (ii) speaker, (iii) vowel context, (iv) consonant, (v) phonetic feature classes, (vi) stress position, (vii) gender and (viii) stimulus onset relative to noise onset. A significant degree of similarity in the response to many of these factors was evident across all eight language groups, suggesting that acoustic and auditory considerations play a large role in determining intelligibility. Language-specific influences were observed in the rankings of individual consonants and in the masking effect of competing speech relative to speech-modulated noise. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Cooke, Martin] Basque Fdn Sci, Bilbao 48011, Spain.
[Cooke, Martin; Garcia Lecumberri, Maria Luisa] Univ Basque Country, Fac Letters, Language & Speech Lab, Vitoria 01006, Spain.
[Scharenborg, Odette] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands.
[van Dommelen, Wim A.] NTNU, Dept Language & Commun Studies, NO-7491 Trondheim, Norway.
RP Cooke, M (reprint author), Basque Fdn Sci, Bilbao 48011, Spain.
EM m.cooke@ikerbasque.org
RI Scharenborg, Odette/E-2056-2012
FU EU; Netherlands Organisation for Scientific Research (NWO)
FX Corpus recording, annotation and native English listening tests took
place while Martin Cooke was at the University of Sheffield, UK. We
extend our thanks to Francesco Cutugno, Mircea Giurgiu, Bernd Meyer and
Jan Volin for coordinating listener groups in Naples, Cluj-Napoca,
Oldenburg and Prague; Youyi Lu (University of Sheffield) for speech
material; Stuart Rosen (UCL) for making available the FIX software
package; and the developers of the R statistical language R Development
Core Team (2008). All authors were supported by the EU Marie Curie
Research Training Network "Sound to Sense". Odette Scharenborg was
supported by a Veni-grant from the Netherlands Organisation for
Scientific Research (NWO). We also thank Marc Swerts and the reviewers
for their insightful comments on an earlier version of the paper.
CR AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306
Alamsaputra DM, 2006, AUGMENT ALTERN COMM, V22, P258, DOI 10.1080/00498250600718555
Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003
Benki JR, 2003, PHONETICA, V60, P129, DOI 10.1159/000071450
Best C. T., 1995, SPEECH PERCEPTION LI, P171
Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952
Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
BRUNGART DS, 2001, J ACOUST SOC AM, V100, P2527
CARHART R, 1969, J ACOUST SOC AM, V45, P694, DOI 10.1121/1.1911445
Cervera T, 2005, ACTA ACUST UNITED AC, V91, P132
Cooke M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1765
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
CUTLER A, 1979, SENTENCE PROCESSING, P171
CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7
Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156
Detey S, 2008, LINGUA, V118, P66, DOI 10.1016/j.lingua.2007.04.003
DUBNO JR, 1981, J ACOUST SOC AM, V69, P249, DOI 10.1121/1.385345
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
Flege J. E., 1995, SPEECH PERCEPTION LI, P233
Florentine M, 1984, J ACOUST SOC AM, V75, pS84, DOI 10.1121/ 1.2021645
FOSS DJ, 1980, COGNITIVE PSYCHOL, V12, P1, DOI 10.1016/0010-0285(80)90002-X
Fullgrabe C, 2006, HEARING RES, V211, P74, DOI 10.1016/j.heares.2005.09.001
Gamer M, 2007, IRR VARIOUS COEFFICI
Hazan V, 2000, LANG SPEECH, V43, P273
Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9
Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826
Imai S, 2005, J ACOUST SOC AM, V117, P896, DOI 10.1121/1.1823291
Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841
Kendall MG, 1948, RANK CORRELATION MET
KUHL PK, 1993, J ACOUST SOC AM, V93, P2423
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lovitt A, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2154
LU Y, 2010, THESIS U SHEFFIELD
MacKay IRA, 2001, PHONETICA, V58, P103, DOI 10.1159/000028490
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Moore B. C., 2004, INTRO PSYCHOL HEARIN
Parikh G, 2005, J ACOUST SOC AM, V118, P3874, DOI 10.1121/1.2118407
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
Pinheiro J, 2008, NLME LINEAR NONLINEA
Pinheiro J. C., 2000, MIXED EFFECTS MODELS
R Development Core Team, 2008, R LANG ENV STAT COMP
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455
TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417
Wright R., 2004, PHONETICALLY BASED P, P34, DOI 10.1017/CBO9780511486401.002
NR 55
TC 9
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 954
EP 967
DI 10.1016/j.specom.2010.04.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700009
ER
PT J
AU van Dommelen, WA
Hazan, V
AF van Dommelen, Wim A.
Hazan, Valerie
TI Perception of English consonants in noise by native and Norwegian
listeners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Consonant identification; Non-native; English; Norwegian; Noise
ID NONNATIVE LISTENERS; SPEECH-PERCEPTION; INFORMATIONAL MASKING;
RECOGNITION; LANGUAGE; JAPANESE; VOWELS; L2; IDENTIFICATION;
ASSIMILATION
AB The aim of this study was to investigate factors that affect second language speech perception. Listening tests were run in which native and non-native (Norwegian) participants identified English consonants in VCV syllables in quiet and in different noise conditions. An assimilation test investigated the mapping of English consonants onto Norwegian counterparts. Results of the identification test showed a lower non-native performance but there was no evidence that the non-native disadvantage was greater in noise than in quiet. Poorer identification was found for sounds that occur only in English ('novel category' consonants) but this was the case for both English and Norwegian listeners, and thus likely to be related to the acoustic-phonetic properties of consonants in that category. Information transfer analyses revealed a certain impact of phonological factors on L2 perception, as the transmission of the voicing feature was more affected for Norwegian listeners than the transmission of place or manner information. The relation between the results of the identification in noise and assimilation tasks suggests that, at least in higher proficiency L2 learners, assimilation patterns may not be predictive of listeners' ability to hear non-native speech sounds. (C) 2010 Elsevier B.V. All rights reserved.
C1 [van Dommelen, Wim A.] Norwegian Univ Sci & Technol, Dept Language & Commun Studies, N-7491 Trondheim, Norway.
[Hazan, Valerie] UCL, Dept Speech Hearing & Phonet Sci, London WC1N 1PF, England.
RP van Dommelen, WA (reprint author), Norwegian Univ Sci & Technol, Dept Language & Commun Studies, N-7491 Trondheim, Norway.
EM wim.van.dommelen@ntnu.no; v.hazan@ucl.ac.uk
RI Hazan, Valerie/C-9722-2009
OI Hazan, Valerie/0000-0001-6572-6679
FU EU
FX The speech material used in this study is part of the corpus from the
Consonant Challenge project organized by Martin Cooke, Maria Luisa
Garcia Lecumberri and Odette preparation of the listening test
materials, and for making the data for native speakers available to us.
Part of the study was financially supported by the EU Marie Curie
Research Training Network "Sound to Sense". We acknowledge the useful
comments on an earlier version of this paper given by two anonymous
reviewers.
CR Aoyama K, 2004, J PHONETICS, V32, P233, DOI 10.1016/S0095-4470(03)00036-6
Best C. T., 1995, SPEECH PERCEPTION LI, P171
BEST CT, 1988, J EXP PSYCHOL HUMAN, V14, P345, DOI 10.1037/0096-1523.14.3.345
Best CT, 2007, LANGUAGE EXPERIENCE, P13
Boersma P., 2008, PRAAT DOING PHONETIC
Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
Cebrian J, 2006, J PHONETICS, V34, P372, DOI 10.1016/j.wocn.2005.08.003
Cooke M, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1765
Cooke M, 2010, SPEECH COMMUN, V52, P954, DOI 10.1016/j.specom.2010.04.004
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
DAVIDSENNIELSEN N, 1975, ENGLISH PHONETICS
Docherty Gerard J., 1992, TIMING VOICING BRIT
EDWARDS TJ, 1981, J ACOUST SOC AM, V69, P535, DOI 10.1121/1.385482
Flege J. E., 1995, SPEECH PERCEPTION LI, P229
Goedegebure A, 2002, INT J AUDIOL, V41, P414, DOI 10.3109/14992020209090419
Guion SG, 2000, J ACOUST SOC AM, V107, P2711, DOI 10.1121/1.428657
Halle PA, 2007, J ACOUST SOC AM, V121, P2899, DOI 10.1121/1.2534656
HAUGEN E, 1995, NORWEGIAN ENGLISH DI
Hazan V, 2000, LANG SPEECH, V43, P273
Iverson P, 2007, J ACOUST SOC AM, V122, P2842, DOI 10.1121/1.2783198
Iverson P, 2003, COGNITION, V87, pB47, DOI 10.1016/S0010-0277(02)00198-1
Kang KH, 2006, J ACOUST SOC AM, V119, P1672, DOI 10.1121/1.2166607
Kingston J, 2003, LANG SPEECH, V46, P295
Kristoffersen G., 2000, PHONOLOGY NORWEGIAN
Lecumberri MLG, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1781
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Lengeris A, 2009, PHONETICA, V66, P169, DOI 10.1159/000235659
Levy ES, 2009, J ACOUST SOC AM, V125, P1138, DOI 10.1121/1.3050256
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
MCALLISTER R, 2007, LANGUAGE EXPERIENCE, P153
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134
Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
Sagi E, 2008, J ACOUST SOC AM, V123, P2848, DOI 10.1121/1.2897914
Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650
Strange W., 2007, LANGUAGE EXPERIENCE, P35, DOI 10.1075/lllt.17.08str
Strange W., 2004, J ACOUST SOC AM, V115, P2606
TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769
van der Horst R, 1999, J ACOUST SOC AM, V105, P1801, DOI 10.1121/1.426718
VANDOMMELEN WA, 2007, P FON 2007 STOCKH MA, V50, P5
VANENGEN KJ, LANG SPEECH
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
NR 49
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 968
EP 979
DI 10.1016/j.specom.2010.05.001
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700010
ER
PT J
AU Broersma, M
Scharenborg, O
AF Broersma, Mirjam
Scharenborg, Odette
TI Native and non-native listeners' perception of English consonants in
different types of noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech Perception; Consonants; Identification; Noise; Non-native
Language
ID SPEECH-IN-NOISE; NORMAL-HEARING; LANGUAGE; INTELLIGIBILITY;
IDENTIFICATION; RECOGNITION; CONFUSIONS; BABBLE
AB This paper shows that the effect of different types of noise on recognition of different phonemes by native versus non-native listeners is highly variable, even within classes of phonemes with the same manner or place of articulation. In a phoneme identification experiment, English and Dutch listeners heard all 24 English consonants in VCV stimuli in quiet and in three types of noise: competing talker, speech-shaped noise, and modulated speech-shaped noise (all with SNRs of -6 dB). Differential effects of noise type for English and Dutch listeners were found for eight consonants (/p t k g m n eta r/) but not for the other 16 consonants. For those eight consonants, effects were again highly variable: each noise type hindered non-native listeners more than native listeners for some of the target sounds, but none of the noise types did so for all of the target sounds, not even for phonemes with the same manner or place of articulation. The results imply that the noise types employed will strongly affect the outcomes of any study of native and non-native speech perception in noise. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Broersma, Mirjam] Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands.
[Broersma, Mirjam] Radboud Univ Nijmegen, Donders Inst Brain Cognit & Behav, NL-6500 HE Nijmegen, Netherlands.
Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands.
RP Broersma, M (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands.
EM mirjam@mirjambroersma.nl; o.scharenborg@let.ru.nl
RI Scharenborg, Odette/E-2056-2012; Broersma, Mirjam/B-2032-2015
OI Broersma, Mirjam/0000-0001-8511-2877
FU Netherlands Organisation for Scientific Research (NWO)
FX Each of the authors was supported by an individual (separate) Veni grant
from the Netherlands Organisation for Scientific Research (NWO). We
would like to thank Martin Cooke (University of the Basque Country) for
kindly providing the English VCV data and two anonymous reviewers for
helpful comments.
CR BOHN OS, 2007, HONOR JE FLEGE
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
Broersma M, 2005, J ACOUST SOC AM, V117, P3890, DOI 10.1121/1.1906060
Broersma M, 2010, J ACOUST SOC AM, V127, P1636, DOI 10.1121/1.3292996
Broersma M, 2008, J ACOUST SOC AM, V124, P712, DOI 10.1121/1.2940578
COOKE M, 2008, P INT 2008 BRISB AUS
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
Felty RA, 2009, J ACOUST SOC AM, V125, pEL93, DOI 10.1121/1.3073733
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
Golestani N, 2009, BILING-LANG COGN, V12, P385, DOI 10.1017/S1366728909990150
Gussenhoven C., 1999, HDB INT PHONETIC ASS, P74
Hazan V, 2000, LANG SPEECH, V43, P273
LECUMBERRI MLG, 2008, P INT 2008 BRISB AUS
LU Y, 2010, THESIS U SHEFFIELD U
Maniwa K, 2008, J ACOUST SOC AM, V123, P1114, DOI 10.1121/1.2821966
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
NABELEK AK, 1984, J ACOUST SOC AM, V75, P632
Phatak SA, 2008, J ACOUST SOC AM, V124, P1220, DOI 10.1121/1.2913251
Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650
Strange W., 1995, SPEECH PERCEPTION LI
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4
van Wijngaarden SJ, 2002, J ACOUST SOC AM, V111, P1906, DOI 10.1121/1.1456928
Wagner A, 2006, J ACOUST SOC AM, V120, P2267, DOI 10.1121/1.2335422
WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417
NR 30
TC 10
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 980
EP 995
DI 10.1016/j.specom.2010.08.010
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700011
ER
PT J
AU Hazan, V
Kim, J
Chen, YC
AF Hazan, Valerie
Kim, Jeesun
Chen, Yuchun
TI Audiovisual perception in adverse conditions: Language, speaker and
listener effects
SO SPEECH COMMUNICATION
LA English
DT Article
DE Audiovisual; L2 perception; Speaker/listener variability
ID SPEECH-PERCEPTION; CONSONANT RECOGNITION; NONNATIVE LISTENERS; HEARING
LIPS; VISUAL CUES; NOISE; ENGLISH; REVERBERATION; 2ND-LANGUAGE;
INTEGRATION
AB This study investigated the relative contribution of auditory and visual information to speech perception by looking at the effect of visual and auditory degradation on the weighting given to visual cues for native and non-native speakers. Multiple iterations of /ba/, /da/ and /ga/ by five Australian English and five Mandarin Chinese speakers were presented to Australian English, British English and Mandarin Chinese participants. Tokens were presented in auditory, visual and congruent/incongruent audiovisual (AV) modes, either in clear or with visual degradation (blurring), auditory degradation (noise) or combined degradations. In the AV clear condition, English-speaking participants showed greater visual weighting for non-native speakers, but this was not found for Chinese participants. In 'single-channel degradation' conditions, the weighting of the intact channel increased significantly, with little influence of speaker language. There was no strong evidence of native-language effects on the weighting of visual cues. The degree of visual weighting varied widely across individual participants, and was also affected by individual speaker characteristics. The weighting of auditory and visual cues is therefore highly flexible and dependent on the information load of each channel; non-native speaker and language-background effects may influence visual weighting but individual perceiver and speaker strategies also have a strong impact. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Hazan, Valerie] UCL, London WC1N 1PF, England.
[Kim, Jeesun] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW 1797, Australia.
[Chen, Yuchun] Natl Taiwan Normal Univ, Dept Special Educ, Taipei 10644, Taiwan.
RP Hazan, V (reprint author), UCL, Chandler House,2 Wakefield St, London WC1N 1PF, England.
EM v.hazan@ucl.ac.uk; j.kim@uws.edu.au
RI Hazan, Valerie/C-9722-2009
OI Hazan, Valerie/0000-0001-6572-6679
FU Australian Research Council [DP0666857, TS0669874]
FX We are extremely grateful to Chris Davis for his advice at all stages of
this study. We also thank the following for their valuable contribution
to the study: Steve Nevard and Andrew Faulkner for help with the
audiovisual recordings, Erin Cvejic and Michael Fitzpatrick for help in
the processing of the speech materials, Jennifer Le in running the data
collection in Australia. The second author acknowledges the support of
Australian Research Council, Grant Nos. DP0666857 and TS0669874.
CR Best C. T., 1995, SPEECH PERCEPTION LI, P171
Best CT, 2007, LANGUAGE EXPERIENCE, P13
BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619
BRANDY WT, 1966, J SPEECH HEAR RES, V9, P461
Chen TH, 2004, PERCEPT PSYCHOPHYS, V66, P820, DOI 10.3758/BF03194976
Chen Y., 2007, P 16 INT C PHON SCI, P2177
Chen YC, 2009, J ACOUST SOC AM, V126, P858, DOI 10.1121/1.3158823
Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
Davis C, 2001, ARTIF INTELL REV, V16, P37, DOI 10.1023/A:1011086120667
Davis C, 2004, Q J EXP PSYCHOL-A, V57, P1103, DOI 10.1080/02724980343000701
DEGELDER B, 1992, COGNITIVE PROCESSING, P413
DEGELDER B, 1995, P 4 EUR C SPEECH COM, P1699
DODD B, 1977, PERCEPTION, V6, P31, DOI 10.1068/p060031
ERBER NP, 1969, J SPEECH HEAR RES, V12, P423
Fixmer E., 1998, P AUD VIS SPEECH PRO, P27
Flege J. E., 1995, SPEECH PERCEPTION LI, P229
Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503
Fuster-Duran A., 1996, SPEECHREADING HUMANS, P135
Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135
Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7
Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788
Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668
Hardison DM, 1999, LANG LEARN, V49, P213, DOI 10.1111/0023-8333.49.s1.7
HAYASHI Y, 1998, P AUD VIS SPEECH PRO, P61
Hazan V, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1191
Hazan V, 2006, J ACOUST SOC AM, V119, P1740, DOI 10.1121/1.2166611
Kuhl P. K., 1994, P INT C SPOK LANG PR, P539
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
MacDonald J, 2000, PERCEPTION, V29, P1155, DOI 10.1068/p3020
Massaro D. W., 1998, PERCEIVING TALKING F
MASSARO DW, 1995, MEM COGNITION, V23, P113, DOI 10.3758/BF03210561
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
Navarra J, 2007, PSYCHOL RES-PSYCH FO, V71, P4, DOI 10.1007/s00426-005-0031-5
Nielsen K., 2004, P INTERSPEECH 2004, P2533
Ortega-Llebaria M., 2001, P INT C AUD VIS SPEE, P149
Rogers CL, 2006, APPL PSYCHOLINGUIST, V27, P465, DOI 10.1017/S014271640606036X
Ross Lars A, 2007, Cereb Cortex, V17, P1147, DOI 10.1093/cercor/bhl024
SEKIYAMA K, 2003, P INT C AUD VIS SPEE, P61
SEKIYAMA K, 1993, J PHONETICS, V21, P427
Sekiyama K, 1997, PERCEPT PSYCHOPHYS, V59, P73, DOI 10.3758/BF03206849
Sekiyama K, 2008, DEVELOPMENTAL SCI, V11, P303
SEKIYAMA K, 1995, P 13 INT C PHON SCI, V3, P214
SEKIYAMA K, 1991, J ACOUST SOC AM, V90, P1805
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769
Thomas SM, 2002, PERCEPT PSYCHOPHYS, V64, P932, DOI 10.3758/BF03196797
Wang Y, 2008, J ACOUST SOC AM, V124, P1716, DOI 10.1121/1.2956483
Wang Y, 2009, J PHONETICS, V37, P344, DOI 10.1016/j.wocn.2009.04.002
NR 50
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 996
EP 1009
DI 10.1016/j.specom.2010.05.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700012
ER
PT J
AU Volin, J
Skarnitzl, R
AF Volin, Jan
Skarnitzl, Radek
TI The strength of foreign accent in Czech English under adverse listening
conditions
SO SPEECH COMMUNICATION
LA English
DT Article
DE Foreign accent; Czech English; Low-pass filtering; Perception; Rhythm
metrics; Signal-to-noise ratio
ID SPEAKER NORMALIZATION; SPONTANEOUS SPEECH; PERCEIVED ACCENT;
INTELLIGIBILITY; PERCEPTION; COMMUNICATION; COMPREHENSION; LANGUAGE;
NOISE; AGE
AB The study connects two major topics in current speech research: foreign accentedness and speech in adverse conditions. We parallel the research in intelligibility of non-native speech, but instead of linguistic unit recognition we focus on the perception of the foreign accent strength. First, the question of type and degree of perceptual deficiencies occurring along with certain types of signal degradation is tackled. Second, we measure correlations between the accent ratings and certain candidate phenomena that may influence them, e.g., articulation rate, temporal patterning, contrasts in sound pressure levels on selected syllables and F0 variation. The impacts of different types of signal degradation help to estimate the role of segmental/suprasegmental information in assessments of foreignness in Czech English. The full appreciation of the strength of foreign accent is apparently not possible without fine phonetic detail on the segmental level. However, certain suprasegmental features of foreignness are robust enough to manifest at severe levels of signal degradation. Pair-wise variability indices of vowel durations and variation in F0 tracks seem to guide the listener even better in the degraded than in the 'clean' speech signal. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Volin, Jan; Skarnitzl, Radek] Charles Univ Prague, Fac Arts, Inst Phonet, Prague 11638 1, Czech Republic.
RP Volin, J (reprint author), Charles Univ Prague, Fac Arts, Inst Phonet, Nam Jana Palacha 2, Prague 11638 1, Czech Republic.
EM jan.volin@ff.cuni.cz; radek.skarnitzl@ff.cuni.cz
FU European Union [MRTN-CT-2006-035561]; Czech Ministry of Education [VZ
MSM0021620825]
FX We thank our anonymous reviewers for many helpful comments on earlier
versions of this paper. This work was supported by the European Union
Grant MRTN-CT-2006-035561 - Sound to Sense, and by the Czech Ministry of
Education Grant VZ MSM0021620825.
CR ANDERSONHSIEH J, 1988, LANG LEARN, V38, P561, DOI 10.1111/j.1467-1770.1988.tb00167.x
Asu E. L., 2006, P SPEECH PROS 2006 T, P249
Barry WJ, 2003, P 15 INT C PHON SCI, P2693
Bent T, 2003, J ACOUST SOC AM, V114, P1600, DOI 10.1121/1.1603234
BLADON RAW, 1984, LANG COMMUN, V4, P59, DOI 10.1016/0271-5309(84)90019-3
Boersma P., 2009, PRAAT DOING PHONETIC
Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5
Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837
BRENNAN EM, 1981, J PSYCHOLINGUIST RES, V10, P487, DOI 10.1007/BF01076735
Cooke M, 2008, J ACOUST SOC AM, V123, P414, DOI 10.1121/1.2804952
Derwing TM, 2009, LANG TEACH, V42, P476, DOI 10.1017/S026144480800551X
Eskenazi M, 2009, SPEECH COMMUN, V51, P832, DOI 10.1016/j.specom.2009.04.005
FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256
Flege J. E., 2007, LAB PHONOLOGY, P353
FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876
FLEGE JE, 1995, J ACOUST SOC AM, V97, P3125, DOI 10.1121/1.413041
GHESQUIERE P, 2002, P INT C AC SPEECH SI, V1, P749
Gibbon D., 2001, P EUROSPEECH, P91
Grabe Esther, 2002, LAB PHONOLOGY, V7, P515
Hahn LD, 2004, TESOL QUART, V38, P201
Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826
HENTON C, 1990, J PHONETICS, V18, P203
Ikeno A, 2006, INT CONF ACOUST SPEE, P401
INGRAM J, 1987, J PHONETICS, V15, P127
Johnson K, 2005, BLACKW HBK LINGUIST, P363, DOI 10.1002/9780470757024.ch15
Koreman J, 2006, J ACOUST SOC AM, V119, P582, DOI 10.1121/1.2133436
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
LIEBERMAN P, 1985, J ACOUST SOC AM, V77, P649, DOI 10.1121/1.391883
Mackay IRA, 2006, APPL PSYCHOLINGUIST, V27, P157, DOI 10.1017/S0142716406060231
Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081
MARKOVA P, 2009, P 19 CZECH GERM WORK, P56
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134
Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451
MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x
Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049
Munro MJ, 1998, LANG LEARN, V48, P159, DOI 10.1111/1467-9922.00038
Paeschke A., 2000, P ISCA WORKSH SPEECH, P75
PFITZINGER HR, 2006, P 3 INT C SPEECH PRO, V1, P105
PFITZINGER HR, 1998, P ICSLP 98 SYDN AUST, P1087
Pickering L, 2001, TESOL QUART, V35, P233, DOI 10.2307/3587647
Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X
RUBIN DL, 1992, RES HIGH EDUC, V33, P511, DOI 10.1007/BF00973770
Scott SK, 2009, J ACOUST SOC AM, V125, P1737, DOI 10.1121/1.3050255
SKARNITZL R, 2005, P 2 PRAG C LING LIT, P11
Southwood MH, 1999, CLIN LINGUIST PHONET, V13, P335
STRIK H, 2003, P 15 ICPHS, P227
Volin J, 2009, INTERSPEECH 2009: 10TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2009, VOLS 1-5, P3051
Wagner P. S., 2004, P SPEECH PROS, P227
White L, 2007, P 16 INT C PHON SCI, P1009
Wu TY, 2010, SPEECH COMMUN, V52, P83, DOI 10.1016/j.specom.2009.08.010
NR 51
TC 4
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 1010
EP 1021
DI 10.1016/j.specom.2010.06.009
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700013
ER
PT J
AU Gooskens, C
van Heuven, VJ
van Bezooijen, R
Pacilly, JJA
AF Gooskens, Charlotte
van Heuven, Vincent J.
van Bezooijen, Renee
Pacilly, Jos J. A.
TI Is spoken Danish less intelligible than Swedish?
SO SPEECH COMMUNICATION
LA English
DT Article
DE Mutual intelligibility; Danish; Swedish; Babble noise; Semantically
unpredictable sentences; Map tasks; Cognates
ID WORD RECOGNITION; SPEECH; CORPUS; NOISE
AB The most straightforward way to explain why Danes understand spoken Swedish relatively better than Swedes understand spoken Danish would be that spoken Danish is intrinsically a more difficult language to understand than spoken Swedish. We discuss circumstantial evidence suggesting that Danish is intrinsically poorly intelligible. We then report on a formal experiment in which we tested the intelligibility of Danish and Swedish materials spoken by three representative male speakers per language (isolated cognate and non-cognate words, words in semantically unpredictable sentences, words in spontaneous interaction in map tasks) presented in descending levels of noise to native listeners of Danish (N = 18) and Swedish (N = 24), respectively. The results show that Danish is as intelligible to Danish listeners as Swedish is to Swedish listeners. In a separate task, the same listeners recognized the same materials (presented without noise) in the neighboring language. The asymmetry that has traditionally been claimed was indeed found, even when differences in familiarity with the non-native language were controlled for. Possible reasons for the asymmetry are discussed. (C) 2010 Elsevier B.V. All rights reserved.
C1 [van Heuven, Vincent J.] Leiden Univ, Ctr Linguist, Phonet Lab, NL-2300 RA Leiden, Netherlands.
[Gooskens, Charlotte; van Bezooijen, Renee] Univ Groningen, NL-9700 AB Groningen, Netherlands.
RP van Heuven, VJ (reprint author), Leiden Univ, Ctr Linguist, Phonet Lab, POB 9515, NL-2300 RA Leiden, Netherlands.
EM v.j.j.p.van.heuven@hum.leidenuniv.nl
CR Allen Sture, 1970, NUSVENSK FREKVENSORD
ANDERSON AH, 1991, LANG SPEECH, V34, P351
BASBOL H, 2005, PHONOLOGY DANISH
Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Bergenholtz H., 1992, DANSK FREKVENSORDBOG
Bezooijen Renee van, 1999, LINGUISTICS NETHERLA, P1
Bleses D, 2008, J CHILD LANG, V35, P619, DOI 10.1017/S0305000908008714
Bleses Dorthe, 2004, 20 DAN 2003 S BRAIN, P165
Bo I., 1978, 4 ROG
Borestam Ulla, 1987, DANSK SVENSK SPRAKGE
BRAUNMULLER K, 2002, APPL LINGUIST, V12, P1
Brink Lars, 1975, DANSK RIGSMAL LYDUDV
Delsing Lars-Olof, 2005, HALLER SPRAKET IHOP
Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010]
ELBERLING C, 1989, SCAND AUDIOL, V18, P169, DOI 10.3109/01050398909070742
Elert C-C., 1970, LJUD ORD SVENSKAN
ENGSTRAND O, 1990, SWEDISH J INT PHONET, V20, P42
Garlen C., 1984, SVENSKANS FONOLOGI K
GOOSKENS C, 2007, NEAR LANGUAGES COLLA, P99
GRONNUM N, 1998, DANISH J INT PHONETI, V28, P99
GRONNUM N, 2003, TAKE DANISH FOR INST
Gronnum Nina, 1998, FONETIK FONOLOGI ALM
Gronnum N, 2009, SPEECH COMMUN, V51, P594, DOI 10.1016/j.specom.2008.11.002
HAGERMAN B, 1984, SCAND AUDIOL, V13, P57, DOI 10.3109/01050398409076258
HANSEN PM, 1990, UDTALEORDBOG
Hedelin Per, 1997, NORSTEDTS SVENSKA UT
JENSEN JB, 1989, HISPANIA-J DEV INTER, V72, P848, DOI 10.2307/343562
Kennedy S, 2008, CAN MOD LANG REV, V64, P459, DOI 10.3138/cmlr.64.3.459
LIDEN G, 1954, Acta Otolaryngol Suppl, V116, P189
MALMBERG B, 1968, SVENSK FONETIK
Maurud Oivind, 1976, NABOSPRAKSFORSTAELSE
Perre L, 2008, BRAIN RES, V1188, P132, DOI 10.1016/j.brainres.2007.10.084
Perre L, 2009, PSYCHOPHYSIOLOGY, V46, P739, DOI 10.1111/j.1469-8986.2009.00813.x
Tang Chaoju, 2009, LINGUA, V119, P709, DOI DOI 10.1016/J.LINGUA.2008.10.001
TELEMAN U, 1987, NORDISK SPRAKSEKRETA, V8, P70
van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481
Van Heuven Vincent J, 2008, International Journal of Humanities and Arts Computing, V2, DOI 10.3366/E1753854809000305
van Wijngaarden SJ, 2001, SPEECH COMMUN, V35, P103, DOI 10.1016/S0167-6393(00)00098-4
Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080
Yule G, 1984, TEACHING TALK STRATE
NR 40
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 1022
EP 1037
DI 10.1016/j.specom.2010.06.005
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700014
ER
PT J
AU Heinrich, A
Flory, Y
Hawkins, S
AF Heinrich, Antje
Flory, Yvonne
Hawkins, Sarah
TI Influence of English r-resonances on intelligibility of speech in noise
for native English and German listeners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; L2 acquisition; r-resonances
ID SHORT-TERM-MEMORY; VERTICAL-BAR; NONNATIVE LISTENERS; NATURAL SPEECH;
PERCEPTION; COARTICULATION; ORGANIZATION; RECOGNITION; CONTRASTS;
PATTERNS
AB Non-rhotic British English speakers and Germans living in England were compared in their use of short- and long-domain r-resonances (cues to an upcoming [1]) in read English sentences heard in noise. The sentences comprised 52 pairs differing only in /r/ or /I/ in a minimal-pair target word (mirror, miller). Target words were cross-spliced into a different utterance of the same sentence-base (match) and into a base originally containing the other target word (mismatch), making a four-stimulus set for each sentence-pair. Intelligibility of target and some preceding unspliced words was measured. English listeners were strongly influenced by r-resonances in the sonorant immediately preceding the critical /r/. A median split of the German group showed that those who had lived in southeast England for 3-20 months used the weaker long-domain r-resonances, whereas Germans who had lived in England for 21-105 months ignored all r-resonances, possibly in favour of word frequency. A preliminary study of German speech showed differences in temporal extent and spectral balance (frequency of F3 and higher forniants) between English and German r-resonances. The perception and production studies together suggest sophisticated application of exposure-induced changes in acoustic phonetic and phonological knowledge of L1 to a partially similar sound in L2. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Heinrich, Antje; Hawkins, Sarah] Univ Cambridge, Dept Linguist, Cambridge CB3 9DA, England.
[Flory, Yvonne] Univ Saarland, Saarbrucken, Germany.
RP Heinrich, A (reprint author), Univ Cambridge, Dept Linguist, Sidgwick Ave, Cambridge CB3 9DA, England.
EM ah540@cam.ac.uk
FU ERA-AGE (FLARE) [RG49525]; Marie Curie Research Training Network
[MRTN-CT-2006-035561-S2S]
FX Funded by an ERA-AGE (FLARE) Grant (RG49525) to the first author, and a
Marie Curie Research Training Network Grant, Sound to Sense (S2S:
MRTN-CT-2006-035561-S2S) to the third author. Experiment 1 was conducted
by the second author as part of a Master's thesis at the Universitat des
Saarlandes. We thank Pia Rubig for helping with stimulus preparation and
data collection, and Bill Barry for discussion. Data for the English
listeners comprised part of a paper presented at the 2009
CR American National Standards Institute (ANSI), 1996, S361996 ANSI
BARRY WJ, 1995, PHONETICA, V52, P228
BELLBERTI F, 1982, J ACOUST SOC AM, V71, P449, DOI 10.1121/1.387466
BENGUERE.AP, 1974, PHONETICA, V30, P41
Best C. T., 1994, DEV SPEECH PERCEPTIO, P167
Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378
Bradlow AR, 2007, J ACOUST SOC AM, V121, P2339, DOI 10.1121/1.2642103
Bronkhorst AW, 2000, ACUSTICA, V86, P117
Carter P., 2003, PAPERS LAB PHONOLOGY, P237
CARTER P, 1999, P 14 INT C PHON SCI, V1, P105
CARTER P, 2002, THESIS U YORK
Cohen J., 1988, STAT POWER ANAL BEHA, V2nd
Coleman J, 2003, J PHONETICS, V31, P351, DOI 10.1016/j.wocn.2003.10.001
Craik FIM, 1996, J EXP PSYCHOL GEN, V125, P159, DOI 10.1037/0096-3445.125.2.159
Cruttenden Alan, 2001, GIMSONS PRONUNCIATIO
Cutler A, 2008, J ACOUST SOC AM, V124, P1264, DOI 10.1121/1.2946707
DARWIN CJ, 1985, SPEECH COMMUN, V4, P231, DOI 10.1016/0167-6393(85)90049-4
Faul F, 2007, BEHAV RES METHODS, V39, P175, DOI 10.3758/BRM.41.4.1149
GRUNKE ME, 1982, PERCEPT PSYCHOPHYS, V31, P210, DOI 10.3758/BF03202525
Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006
Hawkins S, 2004, J PHONETICS, V32, P199, DOI 10.1016/S0095-4470(03)00031-7
HAWKINS S, 1996, SOUND PATTERNS CONNE, P173
Hawkins S, 2010, J PHONETICS, V38, P60, DOI 10.1016/j.wocn.2009.02.001
HAWKINS S, 1995, P 13 INT C PHON SCI
Hawkins S., 1994, P INT C SPOK LANG PR, P57
Heid S., 2000, P 5 SEM SPEECH PROD, P77
Heinrich A, 2008, Q J EXP PSYCHOL, V61, P735, DOI 10.1080/17470210701402372
HUGGINS AWF, 1972, J ACOUST SOC AM, V51, P1279, DOI 10.1121/1.1912972
HUGGINS AWF, 1972, J ACOUST SOC AM, V51, P1270, DOI 10.1121/1.1912971
Kahneman D., 1973, ATTENTION EFFORT
KELLY J, 1986, C PUBLICATION, V258, P304
Kelly J., 1989, DOING PHONOLOGY
LADEFOGED PETER, 2001, COURSE PHONETICS
Laver John, 1994, PRINCIPLES PHONETICS
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
Leech G., 2001, WORD FREQUENCIES WRI
Lindau M., 1985, PHONETIC LINGUISTICS, P157
Local J, 2003, J PHONETICS, V31, P321, DOI 10.1016/S0095-4470(03)00045-7
Lodge K, 2003, LINGUA, V113, P931, DOI 10.1016/S0024-3841(02)00142-0
LOFTUS GR, 1994, PSYCHON B REV, V1, P476, DOI 10.3758/BF03210951
LUCE PA, 1983, HUM FACTORS, V25, P17
Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001
MACHELETT K, 1996, LESEN SONAGRAMMEN V1
Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686
Moore RK, 2007, SPEECH COMMUN, V49, P418, DOI 10.1016/j.specom.2007.01.011
Nieto-Castanon A, 2005, J ACOUST SOC AM, V117, P3196, DOI 10.1121/1.1893271
Ogden R., 2009, INTRO ENGLISH PHONET
Raven J., 1998, RAVEN MANUAL SECTION
Raven J. C., 1982, MILL HILL VOCABULARY
REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129
Remez RE, 2003, J PHONETICS, V31, P293, DOI 10.1016/S0095-4470(03)00042-1
REMEZ RE, 1994, HDB PSYCHOLINGUISTIC, P145
Secord W., 2007, ELICITING SOUNDS TEC
Simpson Adrian P., 1998, ZAS WORKING PAPERS L, V11, P91
Sommers MS, 1996, PSYCHOL AGING, V11, P333, DOI 10.1037/0882-7974.11.2.333
Stevens K.N., 1998, ACOUSTIC PHONETICS
Ito Kikuyo, 2009, J Acoust Soc Am, V125, P2348, DOI 10.1121/1.3082103
Tunley A., 1999, THESIS U CAMBRIDGE
ULBRICHT H, 1972, INSTRUMENTALPHONETIS
Wagener KC, 2005, INT J AUDIOL, V44, P144, DOI 10.1080/14992020500057517
WEST P, 1999, P 14 INT C PHON SCI, V3, P1901
WEST P, 2001, J PHONETICS, V27, P405
Zhou XH, 2008, J ACOUST SOC AM, V123, P4466, DOI 10.1121/1.2902168
NR 63
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV-DEC
PY 2010
VL 52
IS 11-12
SI SI
BP 1038
EP 1055
DI 10.1016/j.specom.2010.09.009
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 692BO
UT WOS:000285126700015
ER
PT J
AU Hansen, JHL
Gray, SS
Kim, W
AF Hansen, John H. L.
Gray, Sharmistha S.
Kim, Wooil
TI Automatic voice onset time detection for unvoiced stops (/p/,/t/,/k/)
with application to accent classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice Onset Time (VOT); Voice Onset Region (VOR); Teager Energy Operator
(TEO); Accent classification
ID AMERICAN ENGLISH; FOREIGN ACCENT; SPEECH; RECOGNITION; TRANSFORMS;
PERCEPTION; INVARIANT; ROTATION; OPERATOR; FRENCH
AB Articulation characteristics of particular phonemes can provide cues to distinguish accents in spoken English. For example, as shown in Arslan and Hansen (1996, 1997), Voice Onset Time (VOT) can be used to classify mandarin, Turkish, German and American accented English. Our goal in this study is to develop an automatic system that classifies accents using VOT in unvoiced stops(1). VOT is an important temporal feature which is often overlooked in speech perception, speech recognition, as well as accent detection. Fixed length frame-based speech processing inherently ignores VOT. In this paper, a more effective VOT detection scheme using the non-linear energy tracking algorithm Teager Energy Operator (TEO), across a sub-frequency band partition for unvoiced stops (/p/, /t/ and /k/), is introduced. The proposed VOT detection algorithm also incorporates spectral differences in the Voice Onset Region (VOR) and the succeeding vowel of a given stop-vowel sequence to classify speakers having accents due to different ethnic origin. The spectral cues are enhanced using one of the four types of feature parameter extractions - Discrete Mellin Transform (DMT), Discrete Mellin Fourier Transform (DMFT) and Discrete Wavelet Transform using the lowest and the highest frequency resolutions (DWTlfr and DWThfr). A Hidden Markov Model (HMM) classifier is employed with these extracted parameters and applied to the problem of accent classification. Three different language groups (American English, Chinese, and Indian) are used from the CU-Accent database. The VOT is detected with less than 10% error when compared to the manual detected VOT with a success rate of 79.90%, 87.32% and 47.73% for English, Chinese and Indian speakers (includes atypical cases for Indian case), respectively. It is noted that the DMT and DWTlfr features are good for parameterizing speech samples which exhibit substitution of succeeding vowel after the stop in accented speech. The successful accent classification rates of DMT and DWTlfr features are 66.13% and 71.67%, for /p/ and /t/ respectively, for pairwise accent detection. Alternatively, the DMFT feature works on all accent sensitive words considered, with a success rate of 70.63%. This study shows that effective VOT detection can be achieved using an integrated TEO processing with spectral difference analysis in the VOR that can be employed for accent classification. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Hansen, John H. L.; Gray, Sharmistha S.; Kim, Wooil] Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Erik Jonsson Sch Engn & Comp Sci, Dept Elect Engn, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM john.hansen@utdallas.edu
FU US Air Force Research Laboratory, Rome NY [FA8750-04-1-0058]
FX This work was supported by the US Air Force Research Laboratory, Rome
NY, under contract number FA8750-04-1-0058.
CR Allen JS, 2003, J ACOUST SOC AM, V113, P544, DOI 10.1121/1.1528172
Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6
Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608
Bahoura M, 2001, IEEE SIGNAL PROC LET, V8, P10, DOI 10.1109/97.889636
BERKLING K, 2002, SPEECH COMMUN, V35, P125
CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601
CHAN AYW, 2000, CULTURE CURRICULUM, V13, P67
CHEN QS, 1994, IEEE T PATTERN ANAL, V16, P1156
Comrie B., 1990, WORLDS MAJOR LANGUAG
DAS S, 2004, IEEE NORSIG, P344
Deller J., 1999, DISCRETE TIME PROCES
Esposito A, 2002, PHONETICA, V59, P197, DOI 10.1159/000068347
FANG M, 1990, APPL OPTICS, V29, P704, DOI 10.1364/AO.29.000704
FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256
FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876
Francis AL, 2003, J ACOUST SOC AM, V113, P1025, DOI 10.1121/1.1536169
Ghesquiere PJ, 2002, INT CONF ACOUST SPEE, P749
GRAY SS, 2005, IEEE AUT SPEECH REC
GRAY SS, 2005, THESIS U COLORADO BO
GROVER C, 1987, LANG SPEECH, V30, P277
Hansen JHL, 1998, IEEE T BIO-MED ENG, V45, P300, DOI 10.1109/10.661155
HOSHINO A, 2003, IEEE ICASSP 03, P472
Johnson C., 2002, INT J BILINGUAL, V6, P271, DOI 10.1177/13670069020060030401
KAISER JF, 1990, INT CONF ACOUST SPEE, P381, DOI 10.1109/ICASSP.1990.115702
KAZEMZADEH A, 2006, INT 2006
Kumpf K, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1740
KUMPF K, 1997, EUROSPEECH 97
Ladefoged Peter, 1993, COURSE PHONETICS
Levkovitz J, 1997, APPL OPTICS, V36, P3035, DOI 10.1364/AO.36.003035
LOPEZBASCUAS LE, 2004, J ACOUST SOC AM, V115, P2465
MAHADEVA PSR, 2009, IEEE T AUDIO SIGNAL, V17, P556
Major R. C, 2001, FOREIGN ACCENT ONTOG
MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799
MCGORY J, 2001, J ACOUST SOC AM, V109, P2474
NIST/SEMATECH, 2005, E HDB STAT METH
Peng Long, 2004, WORLD ENGLISH, V23, P535, DOI 10.1111/j.0083-2919.2004.00376.x
POSER WJ, 2004, LANGUAGE LOG POST 20
ROSNER BS, 1984, J ACOUST SOC AM, V75, P1231, DOI 10.1121/1.390775
SHENG Y, 1988, OPT ENG
SHENG Y, 1986, J OPT SOC AM
STEINSCHNEIDER M, 1999, AM PHYSL SOC, P2346
Stouten V, 2009, SPEECH COMMUN, V51, P1194, DOI 10.1016/j.specom.2009.06.003
Sundaram N., 2003, INSTANTANEOUS NONLIN
Teager H. M., 1983, SPEECH SCI RECENT AD, P73
TEAGER HM, 1980, IEEE T ACOUST SPEECH, V28, P599, DOI 10.1109/TASSP.1980.1163453
TORRENCE C, 1998, PRACTICAL GUIDE WAVE
Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995
ZWICKE PE, 1983, IEEE T PATTERN ANAL, V5, P191
2010, CU ACCENT
NR 49
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 777
EP 789
DI 10.1016/j.specom.2010.05.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000001
ER
PT J
AU Valente, F
AF Valente, Fabio
TI Hierarchical and parallel processing of auditory and modulation
frequencies for automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition (ASR); TANDEM features; Multi Layer
Perceptron (MLP); Auditory and modulation frequencies
AB This paper investigates from an automatic speech recognition perspective, the most effective way of combining Multi Layer Perceptron (M LP) classifiers trained on different ranges of auditory and modulation frequencies. Two different combination schemes based on M LP are considered. The first one operates in parallel fashion and is invariant to the order in which feature streams are introduced. The second one operates in hierarchical fashion and is sensitive to the order in which feature streams are introduced. The study is carried on a Large Vocabulary Continuous Speech Recognition system for transcription of meetings data using the TANDEM approach. Results reveal that (1) the combination of MLPs trained on different ranges of auditory frequencies is more effective if performed in parallel fashion; (2) the combination of MLPs trained on different ranges of modulation frequencies is more effective if performed in hierarchical fashion moving from high to low modulations; (3) the improvement obtained from separate processing of two modulation frequency ranges (12% relative WER reduction w.r.t. the single classifier approach) is considerably larger than the improvement obtained from separate processing of two auditory frequency ranges (4% relative WER reduction w.r.t. the single classifier approach). Similar results are also verified on other LVCSR systems and on other languages. Furthermore, the paper extends the discussion to the combination of classifiers trained on separate auditory-modulation frequency channels showing that previous conclusions hold also in this scenario. (C) 2010 Elsevier B.V. All rights reserved.
C1 IDIAP Res Inst, CH-1920 Martigny, Switzerland.
RP Valente, F (reprint author), IDIAP Res Inst, CH-1920 Martigny, Switzerland.
EM fabio.valente@idiap.ch
FU Defense Advanced Research Projects Agency (DARPA) [HR0011-06-C-0023];
European Union
FX This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023 and
by the European Union under the integrated project AMIDA. Any opinions,
findings and conclusions or recommendations expressed in this material
are those of the author(s) and do not necessarily reflect the views of
the Defense Advanced Research Projects Agency (DARPA). The author thanks
colleagues from the AMIDA and GALE projects for their help with the
different LVCSR systems and the reviewers for their comments.
CR Allen J., 2005, ARTICULATION INTELLI
ALLEN JB, 1994, IEEE T SPEECH AUDIO, V2
BOURLARD H, 2004, P DARPA EARS EFF AFF
BOURLARD H, 1996, P ICSLP 96
Bourlard Ha, 1994, CONNECTIONIST SPEECH
CHEN B, 2003, P EUR 2003
Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344
Fletcher H., 1953, SPEECH HEARING COMMU
HAIN T, 2005, NIST RT05 WORKSH ED
HERMANSKY H, 1996, P ICSLP 96
HERMANSKY H, 1999, P ICASSP 99
HERMANSKY H, 2000, P ICASSP 2000
Hermansky H., 2005, P INT 2005
HERMANSKY H, 2003, P ASRU 2003
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1997, P EUR 97
Hermansky Hynek, 1994, IEEE T SPEECH AUDIO, V2
HOUTGAST T, 1989, J ACOUST SOC AM, V88
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
MILLER, 2002, J NEUROPHYSIOL, V87
MISRA H, 2003, P ICASSP 2003
MOORE D, 2006, P MLMI 2006
MORGAN N, 2004, P ICASSP 2004
PLAHL C, 2009, P 10 ANN C INT SPEEC
RUMELHART DE, 1986, NATURE, V323, P533, DOI 10.1038/323533a0
SIVADAS S, 2002, P ICASSP 2002
VALENTE F, 2008, P ICASSP 2008
VALENTE F, 2009, P 10 ANN C INT SPEEC
VALENTE F, 2007, INT 2007
VALENTE F, 2007, P ICASSP 2007
ZHAO S, 2009, P INT 2009
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 790
EP 800
DI 10.1016/j.specom.2010.05.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000002
ER
PT J
AU Riedhammer, K
Favre, B
Hakkani-Tur, D
AF Riedhammer, Korbinian
Favre, Benoit
Hakkani-Tuer, Dilek
TI Long story short - Global unsupervised models for keyphrase based
meeting summarization
SO SPEECH COMMUNICATION
LA English
DT Article
DE Multi-party meetings speech; Summarization; Keyphrases; Global
optimization
ID TEXT
AB We analyze and compare two different methods for unsupervised extractive spontaneous speech summarization in the meeting domain. Based on utterance comparison, we introduce an optimal formulation for the widely used greedy maximum marginal relevance (MMR) algorithm. Following the idea that information is spread over the utterances in form of concepts, we describe a system which finds an optimal selection of utterances covering as many unique important concepts as possible. Both optimization problems are formulated as an integer linear program (ILP) and solved using public domain software. We analyze and discuss the performance of both approaches using various evaluation setups on two well studied meeting corpora. We conclude on the benefits and drawbacks of the presented models and give an outlook on future aspects to improve extractive meeting summarization. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Riedhammer, Korbinian] Univ Erlangen Nurnberg, Lehrstuhl Informat 5, D-91058 Erlangen, Germany.
[Riedhammer, Korbinian; Favre, Benoit; Hakkani-Tuer, Dilek] Int Comp Sci Inst, Berkeley, CA 94704 USA.
[Favre, Benoit] Univ Maine, Lab Informat, F-72085 Le Mans 9, France.
RP Riedhammer, K (reprint author), Univ Erlangen Nurnberg, Lehrstuhl Informat 5, Martensstr 3, D-91058 Erlangen, Germany.
EM korbinian.riedhammer@informatik.uni-erlangen.de;
benoit.favre@lium.univ-lemans.fr; dilek@icsi.berkeley.edu
RI Riedhammer, Korbinian/A-2293-2012
OI Riedhammer, Korbinian/0000-0003-3582-2154
CR Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P2, DOI DOI 10.1023/A:1009715923555
Carbonell J., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.291025
Christensen H, 2004, LECT NOTES COMPUT SC, V2997, P223
Filatova E., 2004, P ACL WORKSH SUMM
Furui S, 2004, IEEE T SPEECH AUDI P, V12, P401, DOI 10.1109/TSA.2004.828699
GARG N, 2009, P ANN C INT SPEECH C, P1499
GILLICK D, 2009, P IEEE INT C AC SPEE, P4769
Gillick D., 2009, P ACL HLT WORKSH INT, P10, DOI 10.3115/1611638.1611640
GILLICK D, 2008, P TEXT AN C WORKSH, P227
Ha L.Q., 2002, P 19 INT C COMP LING, P1
HORI C, 2000, P INT C SPOK LANG PR, P326
HORI C, 2002, P INT C AC SPEECH SI, P9
HOVY E, 2006, P INT C LANG RES EV
Huang Z., 2007, P EMNLP CONLL, P1093
INOUE A, 2004, P ICASSP, P599
JANIN A, 2003, P ICASSP, P364
Lin C., 2004, P WORKSH TEXT SUMM B, P25
LIN H, 2009, P IEEE WORKSH SPEECH, P381
Liu F., 2009, P HLT NAACL, P620, DOI 10.3115/1620754.1620845
Liu F., 2009, P ACL IJCNLP, P261, DOI 10.3115/1667583.1667664
Liu F., 2008, P ACL, P201, DOI 10.3115/1557690.1557747
Liu FH, 2008, LECT NOTES COMPUT SC, V5299, P181
LIU Y, 2008, P ICASSP, P5009
Maskey S., 2005, P EUR C SPEECH COMM, P621
McCowan I., 2005, P MEAS BEH
McDonald R, 2007, LECT NOTES COMPUT SC, V4425, P557
Mieskes M, 2007, Proceedings of the IASTED International Conference on Artificial Intelligence and Applications, P627
MROZINSKI J, 2005, P IEEE INT C AC SPEE, P981
MURRAY G, 2007, P ACM WORKSH MACH LE, P156
Murray G., 2005, P ACL 2005 WORKSH IN, P33
Murray G., 2005, P INT 2005 LISB PORT, P593
Murray G, 2008, P INT WORKSH MACH LE, P349
Murray G., 2006, P HUM LANG TECHN C N, P367, DOI 10.3115/1220835.1220882
NENKOVA A, 2004, P JOINT ANN M HLT NA
Penn G., 2008, P ACL, P470
RENALS S, 2007, P IEEE WORKSH SPEECH
RIEDHAMMER K, 2008, P INT, P2434
RIEDHAMMER K, 2008, P IEEE WORKSH SPOK L, P153
Santorini Beatrice, 1990, MSCIS9047 U PENNS DE
Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923
Takamura H., 2009, P 12 C EUR CHAPT ACL, P781, DOI 10.3115/1609067.1609154
Thede S. M., 1999, P 37 ANN M ACL, P175, DOI 10.3115/1034678.1034712
Xie S., 2009, P ANN C INT SPEECH C, P1503
XIE S, 2009, P IEEE WORKSH SPEECH
Zechner K, 2002, COMPUT LINGUIST, V28, P447, DOI 10.1162/089120102762671945
Zhang J., 2007, P NAACL HLT COMP VOL, P213
Zhu Q., 2005, P INT, P2141
ZHU X, 2006, IEEE INT C MULT EXP, P793
Zhu XR, 2009, PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, P549, DOI 10.1109/GCIS.2009.201
NR 49
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 801
EP 815
DI 10.1016/j.specom.2010.06.002
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000003
ER
PT J
AU Engelbrecht, KP
Moller, S
AF Engelbrecht, Klaus-Peter
Moeller, Sebastian
TI Sequential classifiers for the prediction of user judgments about spoken
dialog systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prediction model; Evaluation; PARADISE; Spoken dialog system; Usability
ID QUALITY
AB So far, predictions of user quality judgments in response to spoken dialog systems have been achieved on the basis of interaction parameters describing the dialog, e.g. in the PARADISE framework. These parameters do not take into account the temporal position of events happening in the dialog. It seems promising to apply sequence classification algorithms to the raw annotations of the data, instead of interaction parameters describing the overall dialog.
As dialogs can be of very different length, Hidden Markov Models (HMM) and Markov Chains (MC) are handy, because they describe the likelihood of traversing to a state given only the previous state and the transition probability, thus they can be trained and applied to sequences of different lengths.
This paper analyzes the feasibility of predicting user judgments with HMMs and MCs. In order to test the models, we acquire data with different types of users, forcing users to do as similar interactions as possible, and asking for user judgments after each turn. This allows comparing predicted distributions of judgments to the distributions measured empirically. We also apply the models to less rich corpora and compare them with results from Linear Regression models as used in the PARADISE framework. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Engelbrecht, Klaus-Peter; Moeller, Sebastian] TU Berlin, Qual & Usabil Lab, Deutsch Telekom Labs, D-10587 Berlin, Germany.
RP Engelbrecht, KP (reprint author), TU Berlin, Qual & Usabil Lab, Deutsch Telekom Labs, Ernst Reuter Pl 7, D-10587 Berlin, Germany.
EM klaus-peter.engelbrecht@telekom.de; sebastian.moeller@telekom.de
FU Deutsche Telekom Laboratories, TU Berlin
FX The work presented in this paper was supported by a number of students
and colleagues from Deutsche Telekom Laboratories, TU Berlin. Felix
Hartard and Florian Godde helped collecting the data described in the
first part of the paper. Babette Wiezorek made a large part of the
annotations of features. flamed Ketabdar gave advice for the
implementation of the HMM-related algorithms, and Benjamin Weiss helped
with comments to a first draft of the paper. We are very grateful for
this support. In addition, we would like to thank the anonymous
reviewers for their kind reviews and valuable comments, which helped to
improve this paper.
CR Ai H., 2008, P 9 SIGDIAL WORKSH D, P164, DOI 10.3115/1622064.1622097
Bortz J., 2005, STAT HUMAN SOZIALWIS, V6
Cuayahuitl H., 2005, P IEEE WORKSH AUT SP, P290
DORNER D, 2002, NEURONALE THEORIE HA
Eckert W., 1997, P IEEE WORKSH AUT SP
Engelbrecht KP, 2009, SPEECH COMMUN, V51, P1234, DOI 10.1016/j.specom.2009.06.007
ENGELBRECHT KP, 2009, P 1 INT WORKSH SPOK
ENGELBRECHT KP, 2007, P 8 SIGDIAL WORKSH D, P291
Engelbrecht K.-P., 2008, P ESSV, P86
Engelbrecht K.-P., 2009, P SIGDIAL WORKSH DIS, P170, DOI 10.3115/1708376.1708402
EVANINI K, 2008, P SPOK LANG TECHN WO, P129
Fraser N., 1997, HDB STANDARDS RESOUR, P564
Guski R., 1999, NOISE HEALTH, V1, P45
HASTIE HW, 2002, P 3 INT C LANG RES E, V2, P641
HONE KS, 2001, P EUROSPEECH AALB DE, P2083
*ITU T, 2003, 851 ITU T
Jekosch U., 2005, ASSESSMENT EVALUATIO
Levenshtein V., 1966, SOV PHYS DOKL, V10, P707
Litman D., 2007, P 8 SIGDIAL WORKSH D, P124
McTear M., 2004, SPOKEN DIALOGUE TECH
Moller S, 2008, SPEECH COMMUN, V50, P730, DOI 10.1016/j.specom.2008.03.001
MOLLER S, 2005, P 4 EUR C AC FOR AC, P2681
Moller S., 2005, QUALITY TELEPHONE BA
MOLLER S, 2006, P 9 INT C SPOK LANG, P1786
Nielsen J., 1993, USABILITY ENG
Okun MA, 1990, EDUC PSYCHOL REV, V2, P59, DOI 10.1007/BF01323529
Raake A., 2006, SPEECH QUALITY VOIP
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Rieser V., 2008, P 6 INT C LANG RES E, P2356
RUSSEL S, 2004, MODERNER ANSATZ
SCHLEICHER R, 2008, THESIS U COLOGNE
Schmitt A, 2008, LECT NOTES ARTIF INT, V5078, P72, DOI 10.1007/978-3-540-69369-7_9
SKOWRONEK J, 2002, THESIS RUHR U BOCHUM
WALKER M, 2000, P 2 INT C LANG RES E, V1, P189
Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503
Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271
Weiss B, 2009, ACTA ACUST UNITED AC, V95, P1140, DOI 10.3813/AAA.918245
Witten I.H., 2005, DATA MINING PRACTICA
NR 38
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 816
EP 833
DI 10.1016/j.specom.2010.06.004
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000004
ER
PT J
AU Ling, ZH
Richmond, K
Yamagishi, J
AF Ling, Zhen-Hua
Richmond, Korin
Yamagishi, Junichi
TI An Analysis of HMM-based prediction of articulatory movements
SO SPEECH COMMUNICATION
LA English
DT Article
DE Hidden Markov model; Articulatory features; Parameter generation
ID SPEECH PRODUCTION; MODEL; ACOUSTICS
AB This paper presents an investigation into predicting the movement of a speaker's mouth from text input using hidden Markov models (HMM). A corpus of human articulatory movements, recorded by electromagnetic articulography (EMA), is used to train HMMs. To predict articulatory movements for input text, a suitable model sequence is selected and a maximum-likelihood parameter generation (MLPG) algorithm is used to generate output articulatory trajectories. Unified acoustic-articulatory HMMs are introduced to integrate acoustic features when an acoustic signal is also provided with the input text. Several aspects of this method are analyzed in this paper, including the effectiveness of context-dependent modeling, the role of supplementary acoustic input, and the appropriateness of certain model structures for the unified acoustic-articulatory models. When text is the sole input, we find that fully context-dependent models significantly outperform monophone and quinphone models, achieving an average root mean square (RMS) error of 1.945 mm and an average correlation coefficient of 0.600. When both text and acoustic features are given as input to the system, the difference between the performance of quinphone models and fully context-dependent models is no longer significant. The best performance overall is achieved using unified acoustic-articulatory quinphone HMMs with separate clustering of acoustic and articulatory model parameters, a synchronous-state sequence, and a dependent-feature model structure, with an RMS error of 0.900 mm and a correlation coefficient of 0.855 on average. Finally, we also apply the same quinphone HMMs to the acoustic-articulatory, or inversion, mapping problem, where only acoustic input is available. An average root mean square (RMS) error of 1.076 mm and an average correlation coefficient of 0.812 are achieved. Taken together, our results demonstrate how text and acoustic inputs both contribute to the prediction of articulatory movements in the method used. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Ling, Zhen-Hua] Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Anhui, Peoples R China.
[Richmond, Korin; Yamagishi, Junichi] Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland.
RP Ling, ZH (reprint author), Univ Sci & Technol China, iFLYTEK Speech Lab, Hefei 230027, Anhui, Peoples R China.
EM zhling@ustc.edu; korin@cstr.ed.ac.uk; jyamagis@inf.ed.ac.uk
FU Marie Curie Early Stage Training (EST) Network; National Nature Science
Foundation of China [60905010]; Engineering and Physical Sciences
Research Council (EPSRC); EC
FX The authors thank Phil Hoole of Ludwig-Maximilian University, Munich for
his great effort in helping record the EMA database. This work was
supported by the Marie Curie Early Stage Training (EST) Network,
"Edinburgh Speech Science and Technology (EdSST)". Zhen-Hua Ling is
funded by the National Nature Science Foundation of China (Grant No.
60905010). Korin Richmond is funded by the Engineering and Physical
Sciences Research Council (EPSRC). Junichi Yamagishi is funded by EPSRC
and an EC FP7 collaborative project called the EMIME project.
CR BAER T, 1987, MAGNETIC RESONANCE I, V5, P7
Fitt S., 1999, EUROSPEECH, V2, P823
Hiroya S, 2006, SPEECH COMMUN, V48, P1677, DOI 10.1016/j.specom.2006.08.002
Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
KIRITANI S, 1986, SPEECH COMMUN, V5, P119, DOI 10.1016/0167-6393(86)90003-8
Ling ZH, 2009, IEEE T AUDIO SPEECH, V17, P1171, DOI 10.1109/TASL.2009.2014796
PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994
RICHMOND K, 2007, NOLISP, P263
RICHMOND K, 2009, P INT BRIGHT UK
SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Blackburn CS, 2000, J ACOUST SOC AM, V107, P1659, DOI 10.1121/1.428450
Tamura M, 1999, EUROSPEECH, P959
Taylor P. A., 1998, 3 ESCA WORKSH SPEECH, P147
Toda T, 2008, SPEECH COMMUN, V50, P215, DOI 10.1016/j.specom.2007.09.001
Tokuda K, 1999, INT CONF ACOUST SPEE, P229
Tokuda K, 2000, ICASSP, V3, P1315
Tokuda K., 2004, TEXT SPEECH SYNTHESI
Yoshimura T, 1998, P ICSLP, P29
Young S., 2002, HTK BOOK HTK VERSION
Zen H, 2007, 6 ISCA WORKSH SPEECH, P294
Zhang L, 2008, IEEE SIGNAL PROC LET, V15, P245, DOI 10.1109/LSP.2008.917004
NR 23
TC 8
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 834
EP 846
DI 10.1016/j.specom.2010.06.006
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000005
ER
PT J
AU Pontes, JDA
Furui, S
AF Pontes, Josafa de Jesus Aguiar
Furui, Sadaoki
TI Predicting the phonetic realizations of word-final consonants in context
- A challenge for French grapheme-to-phoneme converters
SO SPEECH COMMUNICATION
LA English
DT Article
DE Decision trees; Liaison in French; Post-lexical rules; Speech synthesis;
Grapheme-to-phoneme conversions
AB One of the main problems in developing a text-to-speech (TTS) synthesizer for French lies in grapheme-to-phoneme conversion. Automatic converters produce still too many errors in their phoneme sequences, to be helpful for people learning French as a foreign language. The prediction of the phonetic realizations of word-final consonants (WFCs) in general, and liaison in particular (les haricots vs. les escargots), are some of the main causes of such conversion errors. Rule-based methods have been used to solve these issues. Yet, the number of rules and their complex interaction make maintenance a problem. In order to alleviate such problems, we propose here an approach that, starting from a database (compiled from cases documented in the literature), allows to build C4.5 decision trees and subsequently, automate the generation of the required phonetic rules. We investigated the relative efficiency of this method both for classification of contexts and word-final consonant phoneme prediction. A prototype based on this approach reduced Obligatory context classification errors by 52%. Our method has the advantage to spare us the trouble to code rules manually, since they are contained already in the training database. Our results suggest that predicting the realization of WFCs as well as context classification is still a challenge for the development of a TTS application for teaching French pronunciation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Pontes, Josafa de Jesus Aguiar; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan.
RP Pontes, JDA (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1-W8-77 Ookayama, Tokyo 1528552, Japan.
EM josafa@furui.cs.titech.ac.jp; furui@furui.cs.titech.ac.jp
CR BATTYE A, 2000, FRENCH LANGUAGE TODA, P109
Black A., 1998, P 3 ESCA WORKSH SPEE, P77
Black A. W., 1999, FESTIVAL SPEECH SYNT
Boe L.-J., 1992, ZUT DICT PHONETIQUE
CORREARD MH, 2003, OXFORD HACHETTE FREN
COTE MH, 2005, PHONOLOGIC FRANCAISE, V58
DELATTRE PIERRE, 1951, PRINCIPES PHONETIQUE
Durand J, 2008, J FR LANG STUD, V18, P33, DOI 10.1017/S0959269507003158
Encreve P., 1988, LIAISON AVEC SANS EN
Fouche P., 1959, TRAITE PRONONCIATION
GOLDMAN JP, 1999, ACT TALN99 CARG CORS, P165
GREVISSE M, 1997, USAGE GRAMMAIRE FRAN, P45
MAREUIL PN, 2003, LIAISONS FRENCH CORP
NEW B, 2003, LEXIQUE, V2
*OFF QUEB LANG FRA, 2002, BDL BANQ DEP LING
Pierret J-M., 1994, PHONETIQUE HIST FRAN, P98
Quinlan J. R., 1993, C4 5 PROGRAMS MACHIN
ROBERT P, 2005, GRAND ROBERT LANGUE
RONKOHAVI R, 1998, LEARNING KNOWLEDGE D, V30, P1998
SANDERS C, 1993, FRENCH TODAY LANGUAG, P263
SPROAT R, 1998, MULTILINGUAL TEXT TO, P56
STAMMERJOHANN H, 1976, NEUEREN SPRACHEN, V75, P489
*SYN, 2006, BIBL CORD ET LEMM, V11
Tzoukermann E., 1998, P ICSLP98 SYDN AUSTR, V5, P2039
*U SO DENM, 2008, CORP
VERLUYTEN SP, 1987, LING ANTVERP NEW SER, V21, P175
Yvon F, 1998, COMPUT SPEECH LANG, V12, P393, DOI 10.1006/csla.1998.0104
NR 27
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2010
VL 52
IS 10
BP 847
EP 862
DI 10.1016/j.specom.2010.06.007
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 638ST
UT WOS:000280917000006
ER
PT J
AU Chakroborty, S
Saha, G
AF Chakroborty, Sandipan
Saha, Goutam
TI Feature selection using singular value decomposition and QR
factorization with column pivoting for text-independent speaker
identification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker identification; MFCC; LFCC; GMFCC; GMM; Divergence; Subband;
Correlation; SVD; QRcp; F-Ratio
ID RECOGNITION SYSTEM; SPEECH RECOGNITION; MUTUAL INFORMATION;
VERIFICATION; MODELS
AB Selection of features is one of the important tasks in the application like Speaker Identification (SI) and other pattern recognition problems. When multiple features are extracted from the same frame of speech, it is expected that a feature vector would contain redundant features. Redundant features confuse the speaker model in multidimensional space resulting in degraded performance by the system. Careful selection of potential features can remove this redundancy while helping to achieve the higher rate of accuracy at lower computational cost. Although the selection of features is difficult without having exhaustive search, this paper proposes an alternative and straight forward technique for feature selection using Singular Value Decomposition (SVD) followed by QR Decomposition with Column Pivoting (QRcp). The idea is to capture the most salient part of the information from the speakers' data by choosing those features that can explain different dimensions showing minimal similarities (or maximum acoustic variability) among them in orthogonal sense. The performances after selection of features using proposed criterion have been compared with using Mel-frequency Cepstral Coefficients (MFCC), Linear Frequency (LF) Cepstral Coefficients (LFCC) and a new feature proposed in this paper that is based on Gaussian shaped filters on mel-scale. It is shown that proposed SVD-QRcp based feature selection outperforms F-Ratio based method and the proposed feature extraction tool is superior to baseline MFCC & LFCC. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Chakroborty, Sandipan; Saha, Goutam] Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India.
RP Chakroborty, S (reprint author), Indian Inst Technol Kharagpur, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India.
EM mail2sandi@gmail.com; gsaha@ece.iitkgp.ernet.in
CR ARI S, 2007, INT J BIOMED SCI, V2
ARI S, 2008, J APPL SOFT COMPUT, DOI DOI 10.1016/J.ASOC.2008.04.010
ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155
BATTITI R, 1994, IEEE T NEURAL NETWOR, V5, P537, DOI 10.1109/72.298224
Bellman R. E., 1957, DYNAMIC PROGRAMMING
Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5
Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499
Campbell J., 1995, P INT C AC SPEECH SI, P341
CHAKROBORTY S, 2007, INT J SIGNAL PROCESS, V5, P11
CHANG CY, 1973, IEEE T SYST MAN CYB, VSMC3, P166
Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0
CHEUNG RS, 1978, IEEE T ACOUST SPEECH, V26, P397, DOI 10.1109/TASSP.1978.1163142
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
DHARANIPRAGDA S, 2007, IEEE T AUDIO SPEECH, V15
Ellis D. P. W., 2000, P INT C SPOK LANG PR, P79
Eriksson T, 2005, IEEE SIGNAL PROC LET, V12, P500, DOI 10.1109/LSP.2005.849495
Errity A, 2007, Proceedings of the 2007 15th International Conference on Digital Signal Processing, P587
Gajic B, 2006, IEEE T AUDIO SPEECH, V14, P600, DOI 10.1109/TSA.2005.855834
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
Ganchev T, 2006, P 5 INT S COMM SYST, P314
GJELSVIK E, 1999, P 5 INT S SIGN PROC, P637
Golub G. H., 1996, MATRIX COMPUTATIONS, P48
Haydar A, 1998, ELECTRON LETT, V34, P1457, DOI 10.1049/el:19981069
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hua YB, 1998, IEEE SIGNAL PROC LET, V5, P141
Kanjilal PP, 1999, IEEE T SYST MAN CY B, V29, P1, DOI 10.1109/3477.740161
KANJILAL PP, 1995, ADAPTIVE PREDICTION, P56
Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2
Kumar N, 1997, THESIS J HOPKINS U B
Kwak N, 2002, IEEE T PATTERN ANAL, V24, P1667, DOI 10.1109/TPAMI.2002.1114861
MELIN H, 1996, P COST 250 WORKSH AP, P59
Milner B, 2007, IEEE T AUDIO SPEECH, V15, P24, DOI 10.1109/TASL.2006.876880
Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538
NELSON GD, 1968, IEEE T SYST SCI CYB, VSSC4, P145, DOI 10.1109/TSSC.1968.300141
Nicholson S., 1997, P EUR SPEECH C SPEEC, P413
Paliwal K. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90005-J
PANDIT M, 1998, P ICASSP, V2, P769, DOI 10.1109/ICASSP.1998.675378
Papoulis A., 2002, PROBABILITY RANDOM V, P72
PEACOCKE RD, 1990, COMPUTER, V23, P26, DOI 10.1109/2.56868
Petrovska D., 1998, POLYCOST TELEPHONE S, P211
PRASAD KS, 2007, P INT C SIGN PROC CO, P20
PRUZANSKY S, 1964, J ACOUST SOC AM, V36, P2041, DOI 10.1121/1.1919320
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
SAHA G, 2005, P IEEE ANN C IND 200, P70
SAMBUR MR, 1975, IEEE T ACOUST SPEECH, VAS23, P176, DOI 10.1109/TASSP.1975.1162664
SILVERMAN BW, 1986, MONOGRAPHS STAT APPL, P75
Wolf JJ, 1971, J ACOUST SOC AM, V51, P2044
ZHENG F, 2000, P INT C SPOK LANG PR, P389
Zilca RD, 2006, IEEE T AUDIO SPEECH, V14, P467, DOI [10.1109/TSA.2005.857809, 10.1109/FSA.2005.857809]
NR 49
TC 7
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 693
EP 709
DI 10.1016/j.specom.2010.04.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600001
ER
PT J
AU Bao, CC
Xu, H
Xia, BY
Liu, ZY
Qiu, JW
AF Bao, Changchun
Xu, Hao
Xia, Bingyin
Liu, Zhangyu
Qiu, Jianwei
TI An efficient transcoding algorithm between AMR-NB and G.729ab
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech coding; Speech transcoding; CELP; G.729ab; AMR-NB
AB In this paper, an efficient transcoding algorithm between AMR-NB and G.729ab is proposed. The proposed algorithm further elaborates on solutions when a DTX function is adopted between the source and destination coding systems. When neither, either or both of the source and destination coding systems adopt the DTX function, the proposed algorithm can carry out the transcoding operation between the two coding systems efficiently. When neither of the two coding systems adopts the DTX function, transcoding methods in different domains are proposed. A scalable distortion measure method based on parameter domain, specifically related to codebook gain conversion, is proposed to keep the amplitude of synthesized speech. The effect on subjective speech quality due to the amplitude of synthesized speech is cancelled out by using the proposed method and the computational complexity is reduced as well. When either or both of the two coding systems adopt the DTX function, depending on the type of the destination frame, transcoding methods between speech frames and non-speech frames are proposed. When the frame is declared as an erased frame, a linear prediction-based pitch recovery and transcoding method is used in this paper. By employing the proposed algorithm in transcoders, complexity is reduced by about 26-82% and quality is also improved compared to the conventional DTE method. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Bao, Changchun; Xu, Hao; Xia, Bingyin; Liu, Zhangyu; Qiu, Jianwei] Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
RP Bao, CC (reprint author), Beijing Univ Technol, Sch Elect Informat & Control Engn, Speech & Audio Signal Proc Lab, Beijing 100124, Peoples R China.
EM baochch@bjut.edu.cn
FU Huawei Technologies Co., Ltd.; Funding Project for Academic Human
Resources Development in Institutions of Higher Learning Under the
Jurisdiction of Beijing Municipality
FX This work was supported by Huawei Technologies Co., Ltd. and Funding
Project for Academic Human Resources Development in Institutions of
Higher Learning Under the Jurisdiction of Beijing Municipality.
CR *3GPP, 2007, 26094 3GPP REC TS
*3GPP, 2007, 26092 3GPP TS
*3GPP, 1999, 26090 3GPP REC TS
[Anonymous], 2003, P8621 ITUT
Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527
CHOI JK, 2004, ICASSP 04, P269
CHRISTOPHE B, 2007, IEEE 9 WORKSH MULT S, P155
GHENANIA M, 2004, P EUR, P1681
*ITU T, 2005, G7291 ITUT
ITU-T, 1996, G729 ITUT
ITU-T (Telecommunication Standardization Sector International Telecommunication Union), 2005, G191 ITUT
Kang HG, 2003, IEEE T MULTIMEDIA, V5, P24, DOI 10.1109/TMM.2003.808823
KANG HG, 2000, IEEE WORKSH SPEECH C, P78
LEE ED, 2007, ICHIT 06, P178
NETO AFC, 1999, IEEE P INT C AC SPEE, P177
OTA Y, 2002, IEEE INT C COMM, P114
TSUCHINAGA Y, 2002, Patent No. 020072104
NR 17
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 710
EP 724
DI 10.1016/j.specom.2010.04.003
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600002
ER
PT J
AU Jafari, A
Almasganj, F
AF Jafari, Ayyoob
Almasganj, Farshad
TI Using Laplacian eigenmaps latent variable model and manifold learning to
improve speech recognition accuracy
SO SPEECH COMMUNICATION
LA English
DT Article
DE Laplacian eigenmaps; Latent variable model; Speech recognition
ID NONLINEAR DIMENSIONALITY REDUCTION
AB This paper demonstrates the application of the Laplacian eigenmaps latent variable model (LELVM) to the task of speech recognition. LELVM is a new dimension reduction method that combines the benefits of latent variable models a multimodal probability density for latent and observed variables, and globally differentiable nonlinear mappings for the tasks of reconstruction and dimensionality reduction with spectral manifold learning methods no local optimum, ability to unfold nonlinear manifolds, and excellent practical scaling to latent spaces of high dimensions. LELVM is achieved by defining an out-of-sample mapping for Laplacian eigenmaps using a semi-supervised learning procedure. LELVM is simple, non-parametric and computationally inexpensive. In this research, LELVM is used to project MFCC features to a new subspace which leads to more discrimination among different phonetic categories. To evaluate the performance of the proposed feature modification system, a HMM-based speech recognition system and TIMIT speech database are employed. The experiments represent about 5% of the accuracy improvement in an isolated phoneme recognition task. The experiments imply the superiority of the proposed method to the usual PCA methods. Moreover, the proposed method keeps its benefits in noisy environments and does not degrade in such conditions. Crown Copyright (C) 2010 Published by Elsevier B.V. All rights reserved.
C1 [Jafari, Ayyoob; Almasganj, Farshad] Amirkabir Univ Technol, Dept Biomed Engn, Tehran, Iran.
RP Jafari, A (reprint author), Amirkabir Univ Technol, Dept Biomed Engn, Tehran, Iran.
EM ajafari20@aut.ac.ir; almas@aut.ac.ir
CR Belkin M, 2003, NEURAL COMPUT, V15, P1373, DOI 10.1162/089976603321780317
Belkin M, 2002, ADV NEUR IN, V14, P585
CARREIRAPERPINA.MA, 2001, THESIS U SHEELD UK
CARREIRAPERPINA.MA, 2007, PEOPLE TRACKING USIN
Fant G., 1970, ACOUSTIC THEORY SPEE
Ham J., 2005, AISTATS, P120
Jansen A., 2005, GEOMETRIC PERSPECTIV
JIAYAN J, 2006, INT JOINT C NEUR NET
Jolliffe I, 1986, SPRINGER SERIES STAT
LANG C, P ICSP 04
Lawrence N, 2005, J MACH LEARN RES, V6, P1783
MORSE PM, 1953, METHODS THEORETICA 1, P43
ROSENBERG S, 1997, LAPLACIAN REIMMANNIA
Roweis ST, 2000, SCIENCE, V290, P2323, DOI 10.1126/science.290.5500.2323
Tenenbaum JB, 2000, SCIENCE, V290, P2319, DOI 10.1126/science.290.5500.2319
TOGNERI R, 1992, IEE PROC-I, V139, P123
Wand M. P., 1995, KERNEL SMOOTHING
Yang X., 2006, ICML 06, P1065
Zhu X., 2003, ICML, P912
NR 19
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 725
EP 735
DI 10.1016/j.specom.2010.04.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600003
ER
PT J
AU Hines, A
Harte, N
AF Hines, Andrew
Harte, Naomi
TI Speech intelligibility from image processing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Auditory periphery model; Hearing aids; Sensorineural hearing loss;
Structural similarity; MSSIM; Speech intelligibility
ID AUDITORY-NERVE RESPONSES; FINE-STRUCTURE; VOWEL EPSILON; MODEL;
PERCEPTION; PERIPHERY; ENVELOPE; QUALITY; SOUNDS; FIBERS
AB Hearing loss research has traditionally been based on perceptual criteria, speech intelligibility and threshold levels. The development of computational models of the auditory periphery has allowed experimentation via simulation to provide quantitative, repeatable results at a more granular level than would be practical with clinical research on human subjects. The responses of the model used in this study have been previously shown to be consistent with a wide range of physiological data from both normal and impaired ears for stimuli presentation levels spanning the dynamic range of hearing.
The model output can be assessed by examination of the spectro-temporal output visualised as neurograms. The effect of sensorineural hearing loss (SNHL) on phonemic structure was evaluated in this study using two types of neurograms: temporal fine structure (TFS) and average discharge rate or temporal envelope. A new systematic way of assessing phonemic degradation is proposed using the outputs of an auditory nerve model for a range of SNHLs. The mean structured similarity index (MSSIM) is an objective measure originally developed to assess perceptual image quality. The measure is adapted here for use in measuring the phonemic degradation in neurograms derived from impaired auditory nerve outputs. A full evaluation of the choice of parameters for the metric is presented using a large amount of natural human speech.
The metric's boundedness and the results for TFS neurograms indicate it is a superior metric to standard point to point metrics of relative mean absolute error and relative mean squared error. MSSIM as an indicative score of intelligibility is also promising, with results similar to those of the standard speech intelligibility index metric. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Hines, Andrew; Harte, Naomi] Trinity Coll Dublin, Dept Elect & Elect Engn, Sigmedia Grp, Dublin, Ireland.
RP Hines, A (reprint author), Trinity Coll Dublin, Dept Elect & Elect Engn, Sigmedia Grp, Dublin, Ireland.
EM hinesa@tcd.ie
CR American National Standards Institute (ANSI), 1997, S351997R2007 ANSI
BONDY J, 2004, NIPS 2003 ADV NEURAL, V16, P1409
Bruce IC, 2003, J ACOUST SOC AM, V113, P369, DOI 10.1121/1.1519544
Bruce I.C., 2007, AUD SIGN PROC HEAR I, P73
DARPA UDC, 1990, 111 NIST
DENG L, 1987, J ACOUST SOC AM, V82, P2001, DOI 10.1121/1.395644
Dillon H., 2001, HEARING AIDS
DINATH F, 2008, P 30 INT IEEE ENG ME, P1793
Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
Heinz MG, 2009, JARO-J ASSOC RES OTO, V10, P407, DOI 10.1007/s10162-009-0169-8
Hines A., 2009, INT 09 BRIGHT ENGL, P1119
Houtgast T., 2001, J ACOUST SOC AM, V110, P529
JERGER J, 1971, ARCHIV OTOLARYNGOL, V93, P573
Kandadai S., 2008, P IEEE INT C AC SPEE, P221
LIBERMAN MC, 1978, J ACOUST SOC AM, V63, P442, DOI 10.1121/1.381736
Lopez-Poveda EA, 2005, INT REV NEUROBIOL, V70, P7, DOI 10.1016/S0074-7742(05)70001-5
Lorenzi C, 2006, P NATL ACAD SCI USA, V103, P18866, DOI 10.1073/pnas.0607364103
ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070
Sachs MB, 2002, ANN BIOMED ENG, V30, P157, DOI 10.1114/1.1458592
Smith ZM, 2002, NATURE, V416, P87, DOI 10.1038/416087a
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0
Studebaker GA, 1999, J ACOUST SOC AM, V105, P2431, DOI 10.1121/1.426848
Wang Z, 2004, IEEE T IMAGE PROCESS, V13, P600, DOI 10.1109/TIP.2003.819861
Wang Z., 2005, P IEEE INT C AC SPEE, V2, P573
WIENER FM, 1946, J ACOUST SOC AM, V18, P401, DOI 10.1121/1.1916378
Wong JC, 1998, HEARING RES, V123, P61, DOI 10.1016/S0378-5955(98)00098-7
Xu L, 2003, J ACOUST SOC AM, V114, P3024, DOI 10.1121/1.1623786
Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503
Zilany M. S. A., 2007, THESIS MCMASTER U HA
Zilany MSA, 2007, J ACOUST SOC AM, V122, P402, DOI 10.1121/1.2735117
Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512
NR 33
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 736
EP 752
DI 10.1016/j.specom.2010.04.006
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600004
ER
PT J
AU Nosratighods, M
Ambikairajah, E
Epps, J
Carey, MJ
AF Nosratighods, Mohaddeseh
Ambikairajah, Eliathamby
Epps, Julien
Carey, Michael John
TI A segment selection technique for speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; Segment selection; Null hypothesis
ID RECOGNITION; NORMALIZATION; GMM
AB The performance of speaker verification systems degrades considerably when the test segments are utterances of very short duration. This might be either due to variations in score-matching arising from the unobserved speech sounds of short speech utterances or the fact that the shorter the utterance, the greater the effect of individual speech sounds on the average likelihood score. In other words, the effects of individual speech sounds have not been cancelled out by a large number of speech sounds in very short utterances. This paper presents a score-based segment selection technique for discarding portions of speech that result in poor discrimination ability in a speaker verification task. Theory is developed to detect the most significant and reliable speech segments based on the probability that the test segment conies from a fixed set of cohort models. This approach, suitable for any duration of test utterance, reduces the effect of acoustic regions of the speech that are not accurately modelled due to sparse training data, and makes a decision based only on the segments that provide the best-matched scores from the segment selection algorithm. The proposed segment selection technique provides reductions in relative error rate of 22% and 7% in terms of minimum Detection Cost Function (DCF) and Equal Error Rate (EER) compared with a baseline used the segment-based normalization, when evaluated on the short utterances of NIST 2002 Speaker Recognition Evaluation dataset. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Nosratighods, Mohaddeseh; Ambikairajah, Eliathamby; Epps, Julien] Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia.
[Carey, Michael John] Univ Birmingham, Dept Elect Elect & Comp Engn, Birmingham B15 2TT, W Midlands, England.
[Ambikairajah, Eliathamby] NICTA, Eveleigh 1430, Australia.
RP Nosratighods, M (reprint author), Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia.
EM hadis@unsw.edu.au; am-bi@ee.unsw.edu; j.epps@unsw.edu.au;
m.carey@bham.ac.uk
CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
AUCKENTHALER R, 1999, IEEE INT C ACOUSTICS, V1, P313
BARRAS C, 2003, IEEE INT C AC SPEECH, V2, P49
CAREY MJ, 1997, IEEE INT C AC SPEECH, V2, P1083
Devore J.L., 1995, PROBABILITY STAT ENG
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9
Heck LP, 1997, INT CONF ACOUST SPEE, P1071, DOI 10.1109/ICASSP.1997.596126
Koolwaaij J, 2000, DIGIT SIGNAL PROCESS, V10, P113, DOI 10.1006/dspr.1999.0357
Li K.-P., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196655
Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363
MARTIN AF, 2002, NIST SPEAKER EVALUAT
Nosratighods M, 2007, INT CONF ACOUST SPEE, P269
NOSRATIGHODS M, 2006, INT C SPEECH SCI TEC, P136
Pelecanos J, 2006, INT CONF ACOUST SPEE, P109
PELECANOS J, 2001, SPEAKER ODYSSEY SPEA, P175
PRZYBOCKI M, 2004, SPEAKER ODYSSEY SPEA, P15
REYNOLDS D, 1994, EUR C SPEECH COMM TE
Reynolds DA, 1992, GAUSSIAN MIXTURE MOD
ROSE RC, 1991, INT CONF ACOUST SPEE, P401, DOI 10.1109/ICASSP.1991.150361
Wackerly D. D., 1996, MATH STAT APPL
NR 23
TC 3
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 753
EP 761
DI 10.1016/j.specom.2010.04.007
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600005
ER
PT J
AU Laska, B
Bolic, M
Goubran, R
AF Laska, Brady
Bolic, Miodrag
Goubran, Rafik
TI Discrete cosine transform particle filter speech enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Noise reduction; Particle filtering; Discrete cosine
transform (DCT)
ID SPECTRAL AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; KALMAN FILTER; NOISY
SPEECH; ALGORITHMS
AB A discrete cosine transform (DCT) domain speech enhancement algorithm is proposed that models the evolution of speech DCT coefficients as a time-varying autoregressive process. Rao-Blackwellized particle filter (RBPF) techniques are used to estimate the model parameters and recover the clean signal coefficients. Using very low-order models for each coefficient and operating at a decimated frame rate, the proposed approach provides a significant complexity reduction compared to the standard full-band RBPF speech enhancement algorithm. In addition to the complexity gains, performance is also improved. Modeling the speech signal in the DCT-domain is shown to provide a better fit in spectral troughs, leading to more noise reduction and less speech distortion. To illustrate possible frequency-dependent processing strategies, a hybrid structure is proposed that offers a complexity/performance trade-off by substituting a simple DCT Wiener filter for the DCT-RBPF in some bands. In comparisons with high performing speech enhancement algorithms using wide-band speech and noise, the proposed DCT-RBPF algorithm achieves higher scores on objective quality and intelligibility measures. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Laska, Brady] Res Mot Ltd, Ottawa, ON K2K 3K1, Canada.
[Goubran, Rafik] Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada.
[Bolic, Miodrag] Univ Ottawa, Sch Informat Technol & Engn, Ottawa, ON K1N 6N5, Canada.
RP Laska, B (reprint author), Res Mot Ltd, 4000 Innovat Dr, Ottawa, ON K2K 3K1, Canada.
EM blaska@rim.com; mbolic@site.uottawa.ca; goubran@sce.carleton.ca
FU Siemens AG; NSERC
FX This work was funded in part by Siemens AG, and NSERC. Thanks to Rajbabu
Velmurugan and the anonymous reviewers for their valuable comments and
to Frederic Mustiere for sharing his RBPF expertise.
CR Arulampalam MS, 2002, IEEE T SIGNAL PROCES, V50, P174, DOI 10.1109/78.978374
BOLIC M, 2003, P IEEE INT C AC SPEE, V2, P589
CHEVALIER M, 1985, P IEEE ICASSP, V10, P501
Daum F., 2003, P IEEE C AER, V4, P1979
DENDRINOS M, 1991, SPEECH COMMUN, V10, P45, DOI 10.1016/0167-6393(91)90027-Q
DENG Y, 2006, P EUR EUSIPCO FLOR I
Doucet A., 2000, P 16 C UNC ART INT, P176
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Fong W, 2002, IEEE T SIGNAL PROCES, V50, P438, DOI 10.1109/78.978397
Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367
GORDON NJ, 1993, IEE PROC-F, V140, P107
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Labarre D, 2007, IEEE T SIGNAL PROCES, V55, P5195, DOI 10.1109/TSP.2007.899587
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lotter T., 2005, EURASIP J APPL SIG P, V7, P1110
Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515
Martin R, 2005, SIG COM TEC, P43, DOI 10.1007/3-540-27489-8_3
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
MUSTIERE F, 2006, P IEEE ICASSP TOUL F, V3, P21
MUSTIERE F, 2007, P IEEE ICASSP HON HI, P1197
Paliwal K. K., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
PUDER H, 2006, TOPICS ACOUSTIC ECHO
Rao S, 1996, IEEE T INFORM THEORY, V42, P1160, DOI 10.1109/18.508839
Soon IY, 1998, SPEECH COMMUN, V24, P249, DOI 10.1016/S0167-6393(98)00019-3
TRIKI M, 2009, P IEEE ICASSP, P29
Vermaak J, 2002, IEEE T SPEECH AUDI P, V10, P173, DOI 10.1109/TSA.2002.1001982
Wu WR, 1998, IEEE T CIRCUITS-II, V45, P1072
NR 32
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2010
VL 52
IS 9
BP 762
EP 775
DI 10.1016/j.specom.2010.05.005
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 631BD
UT WOS:000280320600006
ER
PT J
AU Bitouk, D
Verma, R
Nenkova, A
AF Bitouk, Dmitri
Verma, Ragini
Nenkova, Ani
TI Class-level spectral features for emotion recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotions; Emotional speech classification; Spectral features
ID SPEECH
AB The most common approaches to automatic emotion recognition rely on utterance-level prosodic features. Recent studies have shown that utterance-level statistics of segmental spectral features also contain rich information about expressivity and emotion. In our work we introduce a more fine-grained yet robust set of spectral features: statistics of Mel-Frequency Cepstral Coefficients computed over three phoneme type classes of interest stressed vowels, unstressed vowels and consonants in the utterance. We investigate performance of our features in the task of speaker-independent emotion recognition using two publicly available datasets. Our experimental results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement. Given the large number of class-level spectral features, we expected feature selection will improve results even further, but none of several selection methods led to clear gains. Further analyses reveal that spectral features computed from consonant regions of the utterance contain more information about emotion than either stressed or unstressed vowel features. We also explore how emotion recognition accuracy depends on utterance length. We show that, while there is no significant dependence for utterance-level prosodic features, accuracy of emotion recognition using class-level spectral features increases with the utterance length. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Bitouk, Dmitri; Verma, Ragini] Univ Penn, Dept Radiol, Sect Biomed Image Anal, Philadelphia, PA 19104 USA.
[Nenkova, Ani] Univ Penn, Dept Comp & Informat Sci, Philadelphia, PA 19104 USA.
RP Bitouk, D (reprint author), Univ Penn, Dept Radiol, Sect Biomed Image Anal, 3600 Market St,Suite 380, Philadelphia, PA 19104 USA.
EM Dmitri.Bitouk@uphs.upenn.edu
FU NIH [R01 MHO73174]
FX This work is supported by NIH Grant R01 MHO73174. The authors would like
to thank Dr. Jiahong Yuan for providing us with the code and English
acoustic models for forced alignment.
CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
BITOUK D, 2009, P INT 2009
Boersma P., 2001, GLOT INT, V5, P341
Burkhardt F., 2005, P INT 2005, P1
Chang C.-C., 2001, LIBSVM LIB SUPPORT V
Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022
Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8
GRIMM M, 2006, P 14 EUR SIGN PROC C
Hall M. A., 1997, P 4 INT C NEUR INF P, P855
Hall MA, 1998, AUST COMP S, V20, P181
HASEGAWAJOHNSON M, 2004, P INT 2004
HU H, 2007, P IEEE INT C AC SPEE, V4, P413
HUANG R, 2006, P INT C PATT REC ICP, P1204
KIM S, 2007, P IEEE 9 WORKSH MULT
KWON O.W., 2003, P 8 EUR C SPEECH COM, P125
Lee CM, 2004, P INT 2004, P205
LEGETTER C, 1996, COMPUT SPEECH LANG, V10, P249
*LING DAT CONS, 2002, LDC2002528 U PENNS
Luengo I, 2005, P INTERSPEECH, P493
McGilloway S., 2000, P ISCA WORKSH SPEECH, P200
MENG H, 2007, P IEEE INT C SIGN PR, P1179
Neiberg D, 2006, P INT C SPOK LANG PR, P809
Nicholson J, 2000, NEURAL COMPUT APPL, V9, P290, DOI 10.1007/s005210070006
Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2
ODELL J, 2002, HTK BOOK
PAO T, 2005, COMPUT LINGUISTICS C, V10
Sato N., 2007, INFORM MEDIA TECHNOL, V2, P835
Scherer S., 2007, P INT ENV 2007, P152
SCHULLER B., 2005, P INT LISB PORT, P805
SCHULLER B, 2006, P INTERSPEECH 2006 I, P1818
SETHU V, 2008, P INT 2008, P617
Shafran I., 2003, P IEEE AUT SPEECH RE, P31
Shamia M., 2005, P IEEE INT C MULT EX
Song ML, 2004, LECT NOTES COMPUT SC, V3046, P406
TABATABAEI T, 2007, P IEEE INT S CIRC SY, P345
Vlasenko B, 2008, LECT NOTES ARTIF INT, V5078, P217, DOI 10.1007/978-3-540-69369-7_24
Vlasenko B, 2007, LECT NOTES COMPUT SC, V4738, P139
Vondra M, 2009, LECT NOTES ARTIF INT, V5398, P256
WANG Y, 2005, P IEEE INT C AC SPEE, P1125
Yacoub S., 2003, P EUROSPEECH, P729
Ye CX, 2008, LECT NOTES COMPUT SC, V5353, P61
NR 41
TC 28
Z9 30
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 613
EP 625
DI 10.1016/j.specom.2010.02.010
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700001
ER
PT J
AU Munro, MJ
Derwing, TM
Burgess, CS
AF Munro, Murray J.
Derwing, Tracey M.
Burgess, Clifford S.
TI Detection of nonnative speaker status from content-masked speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Foreign accent; Voice quality; Backwards speech
ID PERCEIVED FOREIGN ACCENT; LANGUAGE DISCRIMINATION; MAGNITUDE ESTIMATION;
ESL LEARNERS; L2 SPEECH; ENGLISH; VOICE; 2ND-LANGUAGE; INTELLIGIBILITY;
PRONUNCIATION
AB Listeners are highly sensitive to divergences from native-speaker patterns of speech production, such that they can recognize second-language speakers even from very short stretches of speech. However, the processes by which nonnative speaker detection is accomplished are not fully understood. In this investigation, we used content-masked (backwards) speech to assess listeners' sensitivity to nonnative speaker status when potential segmental, grammatical, and lexical cues were removed. The listeners performed at above-chance levels across utterances of varying lengths and across three accents (Mandarin, Cantonese, and Czech). Reduced sensitivity was observed when variability in speaking rates and F0 was removed, and when temporal integrity was severely disrupted. The results indicate that temporal properties, pitch, and voice quality probably played a role in the listeners' judgments. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Munro, Murray J.; Burgess, Clifford S.] Simon Fraser Univ, Dept Linguist, Burnaby, BC V5A 1S6, Canada.
[Derwing, Tracey M.] Univ Alberta, Dept Educ Psychol, Edmonton, AB T6G 2G5, Canada.
RP Munro, MJ (reprint author), Simon Fraser Univ, Dept Linguist, 8888 Univ Dr, Burnaby, BC V5A 1S6, Canada.
EM mjmunro@sfu.ca
FU Social Sciences and Humanities Research Council of Canada
FX The authors thank H. Li, N. Penner, R. Thomson, and K. Jamieson for
their assistance with stimulus preparation and data collection. We also
thank Bruce Derwing for his comments. This research was supported by the
Social Sciences and Humanities Research Council of Canada.
CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x
Andrianopoulos MV, 2001, J VOICE, V15, P61, DOI 10.1016/S0892-1997(01)00007-8
BLACK JW, 1973, J SPEECH HEAR RES, V16, P165
Boersma P., 2008, PRAAT DOING PHONETIC
BRENNAN EM, 1981, LANG SPEECH, V24, P207
BRENNAN EM, 1975, J PSYCHOLINGUIST RES, V4, P27, DOI 10.1007/BF01066988
Chan AY, 2007, CAN J LING/REV CAN L, V52, P231
Collins SA, 2000, ANIM BEHAV, V60, P773, DOI 10.1006/anbe.2000.1523
DAVILA A, 1993, SOC SCI QUART, V74, P902
Derwing T. M., 1997, STUDIES 2 LANGUAGE A, V19, P1, DOI [DOI 10.1017/S0272263197001010, 10.1017/S0272263197001010]
Derwing T. M., 2002, J MULTILING MULTICUL, V23, P245, DOI DOI 10.1080/01434630208666468
Derwing TM, 2008, APPL LINGUIST, V29, P359, DOI 10.1093/applin/amm041
DONALDSON W, 1993, B PSYCHONOMIC SOC, V31, P271
Esling J. H., 2000, VOICE QUALITY MEASUR, P25
ESLING JH, 1983, TESOL QUART, V17, P89, DOI 10.2307/3586426
ESLING JH, 1994, PRONUNCIATION PEDAGO, P49
Fitch WT, 2000, TRENDS COGN SCI, V4, P258, DOI 10.1016/S1364-6613(00)01494-7
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
Flege J. E., 1988, HUMAN COMMUNICATION, P224
FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256
FLEGE JE, 1991, J ACOUST SOC AM, V89, P395, DOI 10.1121/1.400473
FLEGE JE, 1992, J ACOUST SOC AM, V91, P370, DOI 10.1121/1.402780
FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276
Gick B., 2008, PHONOLOGY 2 LANGUAGE, P309
HANLEY TD, 1966, PHONETICA, V14, P97
Hartmann W. M., 1998, SIGNALS SOUND SENSAT
HONIKMAN B, 1964, PAPERS CONTRIBUTED O, P73
Jones R. H., 1995, ELT J, V49, P244, DOI 10.1093/elt/49.3.244
KIMURA D, 1968, SCIENCE, V161, P395, DOI 10.1126/science.161.3839.395
Ladefoged Peter, 2005, VOWELS CONSONANTS IN
Laver J, 1980, PHONETIC DESCRIPTION
Levi SV, 2007, J ACOUST SOC AM, V121, P2327, DOI 10.1121/1.2537345
Lippi-Green Rosina, 1997, ENGLISH ACCENT LANGU
Mackay IRA, 2006, APPL PSYCHOLINGUIST, V27, P157, DOI 10.1017/S0142716406060231
Magen HS, 1998, J PHONETICS, V26, P381, DOI 10.1006/jpho.1998.0081
Major RC, 2007, STUD SECOND LANG ACQ, V29, P539, DOI 10.1017/S0272263107070428
Mennen I, 2004, J PHONETICS, V32, P543, DOI 10.1016/j.wocn.2004.02.002
Munro M., 2003, P 15 INT C PHON SCI, P535
Munro M., 2003, TESL CANADA J, V20, P38
Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735
Munro M. J., 2001, STUDIES 2 LANGUAGE A, V23, P451
MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x
Munro MJ, 2006, STUD SECOND LANG ACQ, V28, P111, DOI 10.1017/S0272263106060049
Munro Murray J., 2008, PHONOLOGY 2 LANGUAGE, P193
Nolan F., 1998, FORENSIC LINGUIST, V3, P39
Ramus F, 2000, SCIENCE, V288, P349, DOI 10.1126/science.288.5464.349
RAUPACH M, 1980, STUDIES HONOUR F GOL, P263
RIGGENBACH H, 1991, DISCOURSE PROCESS, V14, P423
Scovel T., 1988, TIME SPEAK PSYCHOLIN
Sherman D, 1954, J SPEECH HEAR DISORD, V19, P312
Southwood MH, 1999, CLIN LINGUIST PHONET, V13, P335
Tajima K, 1997, J PHONETICS, V25, P1, DOI 10.1006/jpho.1996.0031
THOMPSON I, 1991, LANG LEARN, V41, P177, DOI 10.1111/j.1467-1770.1991.tb00683.x
Toro JM, 2003, ANIM COGN, V6, P131, DOI 10.1007/s10071-003-0172-0
Trofimovich P, 2006, STUD SECOND LANG ACQ, V28, P1, DOI 10.1017/S0272263106060013
Tsukada K, 2004, PHONETICA, V61, P67, DOI 10.1159/000082557
vanDommelen WA, 1995, LANG SPEECH, V38, P267
VANELS T, 1987, MOD LANG J, V71, P147, DOI 10.2307/327199
VANLANCKER D, 1985, J PHONETICS, V13, P19
Varonis E. M., 1982, STUDIES 2ND LANGUAGE, V4, P114, DOI 10.1017/S027226310000437X
White L, 2007, J PHONETICS, V35, P501, DOI 10.1016/j.wocn.2007.02.003
Wilson I. L., 2006, THESIS U BRIT COLUMB
NR 62
TC 12
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 626
EP 637
DI 10.1016/j.specom.2010.02.013
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700002
ER
PT J
AU Reubold, U
Harrington, J
Kleber, F
AF Reubold, Ulrich
Harrington, Jonathan
Kleber, Felicitas
TI Vocal aging effects on F-0 and the first formant: A longitudinal
analysis in adult speakers
SO SPEECH COMMUNICATION
LA English
DT Article
DE Vocal aging; F-0; Formants; Longitudinal analysis; Perception;
Source-tract-interaction
ID SPEAKING FUNDAMENTAL-FREQUENCY; LARYNGEAL AIRWAY-RESISTANCE;
AGE-RELATED-CHANGES; ACOUSTIC CHARACTERISTICS; VOWEL PRODUCTION;
VERTICAL-BAR; SPEECH; VOICE; TRACT; ENGLISH
AB This paper presents a longitudinal analysis of the extent to which age affects F-0 and formant frequencies. Five speakers at two time intervals showed a clear effect for F-0 and F-1 but no systematic effects for F-2 or F-3. In two speakers for which recordings were available in successive years over a 50 year period, results showed with increasing age a decrease in both F-0 and F-1 for a female speaker and a V-shaped pattern, i.e. a decrease followed by an increase in both F-0 and F-1 for a male speaker. This analysis also provided strong evidence that F-1 approximately tracked F-0 across the years: i.e., the rate of change of (the logarithm of) F-0 and F-1 were generally the same. We then also tested that the changes in F-1 were not an acoustic artifact of changing F-0. Perception experiments with the main aim of assessing whether changes in F-1 contributed to age judgments beyond those from F-0 showed that the contribution of F-1 was inconsistent and negligible. The general conclusion is that age-related changes in F-1 may be compensatory to offset a physiologically induced decline in F-0 and thereby maintain a relatively constant auditory distance between F-0 and F-1. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Reubold, Ulrich; Harrington, Jonathan; Kleber, Felicitas] Univ Munich, Inst Phonet & Speech Proc IPS, D-80799 Munich, Germany.
Ocean Univ Qingdao, Inst Comp Sci & Engn, Shandong, Peoples R China.
RP Reubold, U (reprint author), Univ Munich, Inst Phonet & Speech Proc IPS, Schellingstr 3-2, D-80799 Munich, Germany.
EM reubold@phonetik.uni-muenchen.de
CR Abitbol J, 1999, J VOICE, V13, P424, DOI 10.1016/S0892-1997(99)80048-4
Baayen R. Harald, 2008, ANAL LINGUISTIC DATA
BADIN P, 1984, 23 STL QPSR, P53
Bailey Guy, 1991, LANG VAR CHANGE, V3.3, P241, DOI [10.1017/S0954394500000569, DOI 10.1017/S0954394500000569]
Baken RJ, 2005, J VOICE, V19, P317, DOI 10.1016/j.jvoice.2004.07.005
Barney A, 2007, ACTA ACUST UNITED AC, V93, P1046
Beckman Mary E., 2010, HDB PHONETIC SCI, P603, DOI 10.1002/9781444317251.ch16
BENJAMIN BJ, 1982, J PSYCHOLINGUIST RES, V11, P159
Beyerlein P., 2008, VOCAL AGING EXPLAINE
BROWN WS, 1991, J VOICE, V5, P310, DOI 10.1016/S0892-1997(05)80061-X
Campbell J, 2004, DIGIT SIGNAL PROCESS, V14, P295, DOI 10.1016/j.dsp.2004.06.001
Chambers J. K., 2002, HDB LANGUAGE VARIATI
CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733
Cox F., 2006, AUSTR J LINGUISTICS, V26, P147, DOI 10.1080/07268600600885494
DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275
Decoster W, 2000, J VOICE, V14, P184, DOI 10.1016/S0892-1997(00)80026-0
DEPINTO O, 1982, J PHONETICS, V10, P367
DOCHERTY G, OXFORD HDB IN PRESS
ENDRES W, 1971, J ACOUST SOC AM, V49, P1842, DOI 10.1121/1.1912589
Flanagan J. L., 1965, SPEECH ANAL SYNTHESI
FLUGEL C, 1991, MECH AGEING DEV, V61, P65, DOI 10.1016/0047-6374(91)90007-M
Foulkes P, 2006, J PHONETICS, V34, P409, DOI 10.1016/j.wocn.2005.08.002
GUGATSCHKA M, J VOICE IN PRESS
Harnsberger James D, 2008, J Voice, V22, P58, DOI 10.1016/j.jvoice.2006.07.004
Harrington J., 2000, PAPERS LAB PHONOLOGY, VV, P40
Harrington J, 2008, J ACOUST SOC AM, V123, P2825, DOI 10.1121/1.2897042
Harrington J, 2000, NATURE, V408, P927, DOI 10.1038/35050160
HARRINGTON J, 2005, GIFT SPEECH, P227
Harrington J, 2006, J PHONETICS, V34, P439, DOI 10.1016/j.wocn.2005.08.001
HARRINGTON J, 2007, LAB PHONOLOGY, V9
Harrington Jonathan, 2000, J INT PHON ASSOC, V30, P63, DOI [10.1017/S0025100300006666, DOI 10.1017/S0025100300006666]
HOIT JD, 1987, J SPEECH HEAR RES, V30, P351
HOIT JD, 1992, J SPEECH HEAR RES, V35, P309
HOLLIEN H, 1972, J SPEECH HEAR RES, V15, P155
Holmes J., 2001, SPEECH SYNTHESIS REC
Honda K, 1999, LANG SPEECH, V42, P401
Huber JE, 2008, J SPEECH LANG HEAR R, V51, P651, DOI 10.1044/1092-4388(2008/047)
KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894
LABOV W, 1994, SOCIOLINGUISTIC PATT, V1
Labov William, 1972, SOCIOLINGUISTIC PATT
Labov William, 2001, PRINCIPLES LINGUISTI, V2
LINDBLOM BE, 1971, J ACOUST SOC AM, V50, P1166, DOI 10.1121/1.1912750
Linville SE, 2001, J VOICE, V15, P323, DOI 10.1016/S0892-1997(01)00034-0
Linville SE, 2001, VOCAL AGING
Linville SE, 1996, J VOICE, V10, P190, DOI 10.1016/S0892-1997(96)80046-4
Linville SE, 1987, J VOICE, V1, P44, DOI 10.1016/S0892-1997(87)80023-1
LINVILLE SE, 1985, J GERONTOL, V40, P324
LINVILLE SE, 1985, J ACOUST SOC AM, V78, P40, DOI 10.1121/1.392452
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Maeda S., 1979, ACT 10 JOURN ET PAR, P152
MELCON MC, 1989, J SPEECH HEAR DISORD, V54, P282
MENDOZADENTON N, INTER INTRA IN PRESS
Morris RJ, 1987, J VOICE, V1, P38, DOI 10.1016/S0892-1997(87)80022-X
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
MWANGI S, 2009, P INT C AC NAG DAGA, P1761
Nishio M, 2008, FOLIA PHONIATR LOGO, V60, P120, DOI 10.1159/000118510
Palethorpe Sallyanne, 2007, INTERSPEECH 2007, P2753
Pedhazur Elazar J., 1997, MULTIPLE REGRESSION
Rastatter MP, 1997, FOLIA PHONIATR LOGO, V49, P1
RUSSELL A, 1995, J SPEECH HEAR RES, V38, P101
Sankoff G, 2007, LANGUAGE, V83, P560, DOI 10.1353/lan.2007.0106
Sato K, 1997, ANN OTO RHINOL LARYN, V106, P44
SCUKANEC GP, 1991, PERCEPT MOTOR SKILL, V73, P203, DOI 10.2466/PMS.73.4.203-208
SEGRE R, 1971, EYE EAR NOSE THROAT, V50, P62
SHIPP T, 1975, J SPEECH HEAR RES, V18, P707
SLAWSON AW, 1968, J ACOUST SOC AM, V43, P87, DOI 10.1121/1.1910769
SMITH BL, 1987, J SPEECH HEAR RES, V30, P522
SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381
TRAUNMULLER H, 1981, J ACOUST SOC AM, V69, P1465
TRAUNMULLER H, 1984, SPEECH COMMUN, V3, P49, DOI 10.1016/0167-6393(84)90008-6
TRAUNMULLER H, 1991, P 12 INT C PHON SCI, V5, P62
Trudgill Peter, 1979, SOCIAL MARKERS SPEEC
Verdonck-de Leeuw Irma M, 2004, J Voice, V18, P193, DOI 10.1016/j.jvoice.2003.10.002
Weinreich U., 1968, DIRECTIONS HIST LING
Wind J., 1970, PHYLOGENY ONTOGENY H
WINKLER R, 2007, P 16 INT C PHON SCI
WINKLER R, 2007, P INT 2007 ANTW
Xue SA, 2003, J SPEECH LANG HEAR R, V46, P689, DOI 10.1044/1092-4388(2003/054)
Zemlin WR, 1998, SPEECH HEARING SCI A
NR 79
TC 18
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 638
EP 651
DI 10.1016/j.specom.2010.02.012
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700003
ER
PT J
AU Yu, K
Gales, M
Wang, L
Woodland, PC
AF Yu, Kai
Gales, Mark
Wang, Lan
Woodland, Philip C.
TI Unsupervised training and directed manual transcription for LVCSR
SO SPEECH COMMUNICATION
LA English
DT Article
DE Unsupervised training; Discriminative training; Automatic transcription;
Data selection
AB A significant cost in obtaining acoustic training data is the generation of accurate transcriptions. When no transcription is available, unsupervised training techniques must be used. Furthermore, the use of discriminative training has become a standard feature of state-of-the-art large vocabulary continuous speech recognition (LVCSR) system. In unsupervised training, unlabelled data are recognised using a seed model and the hypotheses from the recognition system are used as transcriptions for training. In contrast to maximum likelihood training, the performance of discriminative training is more sensitive to the quality of the transcriptions. One approach to deal with this issue is data selection, where only well recognised data are selected for training. More effectively, as the key contribution of this work, an active learning technique, directed manual transcription, can be used. Here a relatively small amount of poorly recognised data is manually transcribed to supplement the automatic transcriptions. Experiments show that using the data selection approach for discriminative training yields disappointing performance improvement on the data which is mismatched to the training data type of the seed model. However, using the directed manual transcription approach can yield significant improvements in recognition accuracy on all types of data. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Yu, Kai; Gales, Mark; Wang, Lan; Woodland, Philip C.] Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England.
RP Yu, K (reprint author), Univ Cambridge, Dept Engn, Machine Intelligence Lab, Cambridge CB2 1PZ, England.
EM ky219@cam.ac.uk
RI Yu, Kai/B-1772-2012
OI Yu, Kai/0000-0002-7102-9826
FU Defense Advanced Research Projects Agency [HR0011-06-C-0022]
FX This work was supported in part under the GALE program of the Defense
Advanced Research Projects Agency, Contract No. HR0011-06-C-0022. Many
thanks go to X.A. Liu for training some of the language models used in
the experiments.
CR CHAN H, 2004, ICASSP MONTR
COHN D, 1994, MACH LEARN, V15, P201, DOI 10.1023/A:1022673506211
Cox S. J., 1989, P ICASSP GLASG
DOUMPIOTIS V, 2003, P EUROSPEECH
EVERMANN G, 2003, P ASRU ST THOM
EVERMANN G, 2005, P ICASSP PHIL
GALES MJF, 2005, P ICASSP PHIL
KAMM TM, 2002, P HUM LANG TECHN SAN
KEMP T, 1999, P EUROSPEECH BUD
Kumar N, 1997, THESIS J HOPKINS U
LAMEL L, 2001, P ICASSP SALT LAK CI
Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186
MA J, 2006, P ICASSP TOUL
MANGU L, 2000, THESIS J HOPKINS U
NAKAMURA M, 2007, COMPUTER SPEECH LANG, V22, P171
PALLETT DS, 1990, P ICASSP
POVEY D, 2002, P ICASSP ORL
Povey D., 2003, THESIS CAMBRIDGE U
RICCARDI G, 2003, P EUROSPEECH GEN
SINHA R, 2006, P ICASSP TOUL
WANG L, 2007, P ICASSP HON
Wessel F, 2005, IEEE T SPEECH AUDI P, V13, P23, DOI 10.1109/TSA.2004.838537
WOODLAND PC, 1995, ARPA WORKSH SPOK LAN, P104
WOODLAND PC, 1997, DARPA SPEECH REC WOR, P73
YU K, 2007, P INTERSPEECH ANTW
NR 25
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 652
EP 663
DI 10.1016/j.specom.2010.02.014
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700004
ER
PT J
AU Gorriz, JM
Ramirez, J
Lang, EW
Puntonet, CG
Turias, I
AF Gorriz, J. M.
Ramirez, J.
Lang, E. W.
Puntonet, C. G.
Turias, I.
TI Improved likelihood ratio test based voice activity detector applied to
speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice activity detection; Generalized complex Gaussian probability
distribution function; Robust speech recognition
ID NOISE; LRT; INFORMATION; MODEL; VAD
AB Nowadays, the accuracy of speech processing systems is strongly affected by acoustic noise. This is a serious obstacle regarding the demands of modern applications. Therefore, these systems often need a noise reduction algorithm working in combination with a precise voice activity detector (VAD). The computation needed to achieve denoising and speech detection must not exceed the limitations imposed by real time speech processing systems. This paper presents a novel VAD for improving speech detection robustness in noisy environments and the performance of speech recognition systems in real time applications. The algorithm is based on a Multivariate Complex Gaussian (MCG) observation model and defines an optimal likelihood ratio test (LRT) involving multiple and correlated observations (MCO) based on a jointly Gaussian probability distribution (jGpdf) and a symmetric covariance matrix. The complete derivation of the jGpdf-LRT for the general case of a symmetric covariance matrix is shown in terms of the Cholesky decomposition which allows to efficiently compute the VAD decision rule. An extensive analysis of the proposed methodology for a low dimensional observation model demonstrates: (i) the improved robustness of the proposed approach by means of a clear reduction of the classification error as the number of observations is increased, and (ii) the trade-off between the number of observations and the detection performance. The proposed strategy is also compared to different VAD methods including the G.729, AMR and AFE standards, as well as other recently reported algorithms showing a sustained advantage in speech/non-speech detection accuracy and speech recognition performance using the AURORA databases. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Gorriz, J. M.; Ramirez, J.] Univ Granada, Dpt Signal Theory Networking & Commun, E-18071 Granada, Spain.
[Lang, E. W.] Univ Regensburg, Inst Biophys, D-93040 Regensburg, Germany.
[Puntonet, C. G.] Univ Granada, Dpt Comp Architecture & Technol, E-18071 Granada, Spain.
[Turias, I.] Univ Cadiz, Dpt Lenguajes & Sistemas Informat, Algeciras 11202, Spain.
RP Gorriz, JM (reprint author), Univ Granada, Dpt Signal Theory Networking & Commun, E-18071 Granada, Spain.
EM gorriz@ugr.es
RI Puntonet, Carlos/B-1837-2012; Prieto, Ignacio/B-5361-2013; Gorriz,
Juan/C-2385-2012; Turias, Ignacio/L-7211-2014; Ramirez,
Javier/B-1836-2012
OI Turias, Ignacio/0000-0003-4627-0252; Ramirez, Javier/0000-0002-6229-2921
CR [Anonymous], 1999, 301708 ETSI EN
[Anonymous], 2000, 201108 ETSI ES
Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527
Berouti M., 1979, P IEEE INT C AC SPEE, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Bouquin-Jeannes R. L., 1995, SPEECH COMMUN, V16, P245
BOUQUINJEANNES RL, 1994, ELECTRON LETT, V30, P930
Chang JH, 2004, ELECTRON LETT, V40, P1561, DOI 10.1049/el:20047090
Chengalvarayan R, 1999, P EUROSPEECH 1999 BU, P61
Cho D, 2005, SIGNAL PROCESS-IMAGE, V20, P77, DOI 10.1016/j.image.2004.10.003
Cho YD, 2001, P INT C AC SPEECH SI, V2, P737
ETSI, 2002, 202050 ETSI ES
Golub G.H., 1996, MATRIX COMPUTATIONS
Gorriz J. M., 2006, Speech Communication, V48, DOI 10.1016/j.specom.2006.07.006
Gorriz J, 2009, IEEE T AUDIO SPEECH, V16, P1565
Gorriz JM, 2006, J ACOUST SOC AM, V120, P470, DOI 10.1121/1.2208450
Gorriz JM, 2005, ELECTRON LETT, V41, P877, DOI 10.1049/el:20051761
Gorriz JM, 2006, IEEE SIGNAL PROC LET, V13, P636, DOI 10.1109/LSP.2006.876340
Hirsch H. G., 2000, ISCA ITRW ASR2000 AU
International Telecommunication Union (ITU), 1996, G729 ITUT
KARRAY L, 2003, SPEECH COMMUN, V3, P261
Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146
Manly B.F.J., 1986, MULTIVARIATE STAT ME
Marzinzik M., 2002, IEEE T SPEECH AUDIO, V10, P341
MORENO A, 2000, P 2 LREC C
Niehsen W, 1999, IEEE T SIGNAL PROCES, V47, P217, DOI 10.1109/78.738256
Ramirez J, 2005, IEEE T SPEECH AUDI P, V13, P1119, DOI 10.1109/TSA.2005.853212
Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002
Ramirez J, 2006, IEEE SIGNAL PROC LET, V13, P497, DOI 10.1109/LSP.2006.873147
RAMIREZ J, 2001, IEEE SIGNAL PROCESS, V12, P837
SOHN J, 1999, IEEE SIGNAL PROCESSI, V16, P1
Tanyer SG, 2000, IEEE T SPEECH AUDI P, V8, P478, DOI 10.1109/89.848229
TUCKER R, 1992, IEE PROC-I, V139, P377
Woo KH, 2000, ELECTRON LETT, V36, P180, DOI 10.1049/el:20000192
Yamani HA, 1997, J PHYS A-MATH GEN, V30, P2889, DOI 10.1088/0305-4470/30/8/029
YOUNG S, 1997, HTK BOOK
NR 36
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 664
EP 677
DI 10.1016/j.specom.2010.03.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700005
ER
PT J
AU Christiansen, C
Pedersen, MS
Dau, T
AF Christiansen, Claus
Pedersen, Michael Syskind
Dau, Torsten
TI Prediction of speech intelligibility based on an auditory preprocessing
model
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Auditory processing model; Ideal binary mask;
Speech intelligibility index; Speech transmission index
ID SHORT-TERM ADAPTATION; RECEPTION THRESHOLD; AMPLITUDE-MODULATION; AUDIO
QUALITY; QUANTITATIVE MODEL; TRANSMISSION INDEX; FLUCTUATING NOISE;
NERVE RESPONSES; NORMAL-HEARING; ITU STANDARD
AB Classical speech intelligibility models, such as the speech transmission index (STI) and the speech intelligibility index (SII) are based on calculations on the physical acoustic signals. The present study predicts speech intelligibility by combining a psychoacoustically validated model of auditory preprocessing [Dau et al., 1997. J. Acoust. Soc. Am. 102,2892-2905] with a simple central stage that describes the similarity of the test signal with the corresponding reference signal at a level of the internal representation of the signals. The model was compared with previous approaches, whereby a speech in noise experiment was used for training and an ideal binary mask experiment was used for evaluation. All three models were able to capture the trends in the speech in noise training data well, but the proposed model provides a better prediction of the binary mask test data, particularly when the binary masks degenerate to a noise vocoder. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Christiansen, Claus; Dau, Torsten] Tech Univ Denmark, Dept Elect Engn, Ctr Appl Hearing Res, DK-2800 Lyngby, Denmark.
[Pedersen, Michael Syskind] Oticon AS, DK-2765 Smorum, Denmark.
RP Christiansen, C (reprint author), Tech Univ Denmark, Dept Elect Engn, Ctr Appl Hearing Res, DK-2800 Lyngby, Denmark.
EM cfc@elektro.dtu.dk; msp@oticon.dk; tda@elektro.dtu.dk
FU Danish research council; Oticon; Widex; GN Resound
FX We are grateful to Ulrik Kjems for providing all the speech material as
well as all the measured data from the psychoacoustic measurements. We
also thank two anonymous reviewers for their very helpful comments on an
earlier version of this paper. This work has been partly supported by
the Danish research council and partly by Oticon, Widex and GN Resound
through a research consortium.
CR ANSI, 1997, S351997 ANSI
BEERENDS JG, 1992, J AUDIO ENG SOC, V40, P963
Beerends JG, 2002, J AUDIO ENG SOC, V50, P765
Bekesy G., 1960, EXPT HEARING
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
CARTER GC, 1973, IEEE T ACOUST SPEECH, VAU21, P337, DOI 10.1109/TAU.1973.1162496
Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959
Dau T, 1997, J ACOUST SOC AM, V102, P2906, DOI 10.1121/1.420345
Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960
Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344
Derleth RP, 2000, J ACOUST SOC AM, V108, P285, DOI 10.1121/1.429464
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Elhilali M, 2003, SPEECH COMMUN, V41, P331, DOI 10.1016/S0167-6393(02)00134-6
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
HAGERMAN B, 1984, SCAND AUDIOL, V13, P57, DOI 10.3109/01050398409076258
HAGERMAN B, 1982, SCAND AUDIOL, V11, P79, DOI 10.3109/01050398209076203
HAGERMAN B, 1984, Scandinavian Audiology Supplementum, P1
HAGERMAN B, 1982, SCAND AUDIOL, V11, P191, DOI 10.3109/01050398209076217
HAGERMAN B, 1995, SCAND AUDIOL, V24, P71, DOI 10.3109/01050399509042213
Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Huber R, 2006, IEEE T AUDIO SPEECH, V14, P1902, DOI 10.1109/TASL.2006.883259
Jepsen ML, 2008, J ACOUST SOC AM, V124, P422, DOI 10.1121/1.2924135
Karjalainen M., 1985, P IEEE INT C AC SPEE, V10, P608
Kates JM, 2005, J ACOUST SOC AM, V117, P2224, DOI 10.1121/1.1862575
Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924
Kjems U, 2009, J ACOUST SOC AM, V126, P1415, DOI 10.1121/1.3179673
KOCH R, 1992, THESIS G AUGUST U GO
LUDVIGSEN C, 1990, ACTA OTO-LARYNGOL, P190
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
MILLER GA, 1950, J ACOUST SOC AM, V22, P167, DOI 10.1121/1.1906584
MOORE BCJ, 1977, INTRO PSYCHOL HEARIN
NIELSEN LB, 1993, THESIS TU DENMARK OT
PALMER AR, 1986, HEARING RES, V24, P1, DOI 10.1016/0378-5955(86)90002-X
Patterson RD, 1987, M IOC SPEECH GROUP A
Payton K. L., 2002, PAST PRESENT FUTURE, P125
Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216
Peters RW, 1998, J ACOUST SOC AM, V103, P577, DOI 10.1121/1.421128
Pickles JO, 1988, INTRO PHYSL HEARING
Plack C. J., 2005, THE SENSE OF HEARING
Rhebergen KS, 2005, J ACOUST SOC AM, V117, P2181, DOI 10.1121/1.1861713
Ruggero MA, 1997, J ACOUST SOC AM, V101, P2151, DOI 10.1121/1.418265
SMITH RL, 1977, J NEUROPHYSIOL, V40, P1098
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Thiede T, 2000, J AUDIO ENG SOC, V48, P3
Verhey JL, 1999, J ACOUST SOC AM, V106, P2733, DOI 10.1121/1.428101
Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080
Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12
WESTERMAN LA, 1984, HEARING RES, V15, P249, DOI 10.1016/0378-5955(84)90032-7
Zilany MSA, 2007, 3 INT IEEE EMBS C NE, P481
Zilany MSA, 2006, J ACOUST SOC AM, V120, P1446, DOI 10.1121/1.2225512
NR 53
TC 21
Z9 21
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL-AUG
PY 2010
VL 52
IS 7-8
BP 678
EP 692
DI 10.1016/j.specom.2010.03.004
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 608HH
UT WOS:000278573700006
ER
PT J
AU Dohen, M
Schwartz, JL
Bailly, G
AF Dohen, Marion
Schwartz, Jean-Luc
Bailly, Gerard
TI Speech and face-to-face communication - An introduction
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
DE Multimodality; Interaction; Nonverbal communication
AB This issue focuses on face-to-face speech communication. Research works have demonstrated that this communicative situation is essential to language acquisition and development (e.g. naming). Face-to-face communication is in fact much more than speaking and speech is greatly influenced both in substance and content by this essential form of communication.
Face-to-face communication is multimodal: interacting involves multimodality and nonverbal communication to a large extent. Speakers not only hear but also see each other producing sounds as well as facial and more generally body gestures. Gaze together with speech contribute to maintain mutual attention and to regulate turn-taking for example. Moreover, speech communication involves not only linguistic but also psychological, affective and social aspects of interaction.
Face-to-face communication is situated: the true challenge of spoken communication is to take into account and integrate information not only from the speakers but also from the entire physical environment in which the interaction takes place. The communicative setting, the "task" in which the interlocutors are involved, their respective roles and the environmental conditions of the conversation indeed greatly influence how the spoken interaction unfolds.
The present issue aims at synthesizing the most recent developments in this topic considering its various aspects from complementary perspectives: cognitive and neurocognitive (multisensory and perceptuo-motor interactions), linguistic (dialogic face to face interactions), paralinguistic (emotions and affects, turn-taking, mutual attention), computational (animated conversational agents, multimodal interacting communication systems). (C) 2010 Elsevier B.V. All rights reserved.
C1 [Dohen, Marion; Schwartz, Jean-Luc; Bailly, Gerard] Grenoble Univ, GIPSA Lab, Speech & Cognit Dept, CNRS,UMR 5216, Grenoble, France.
RP Dohen, M (reprint author), INPG, 961 Rue Houille Blanche,Domaine Univ,BP 46, F-38402 St Martin Dheres, France.
EM Marion.Dohen@gipsa-lab.grenoble-inp.fr
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 477
EP 480
DI 10.1016/j.specom.2010.02.016
PG 4
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000001
ER
PT J
AU Weiss, B
Kuehnel, C
Wechsung, I
Fagel, S
Moller, S
AF Weiss, Benjamin
Kuehnel, Christine
Wechsung, Ina
Fagel, Sascha
Moeller, Sebastian
TI Quality of talking heads in different interaction and media contexts
SO SPEECH COMMUNICATION
LA English
DT Article
DE Embodied conversational agent; Smart home; Talking head; Usability; WOZ
ID AGENT; INTERFACE; FACE
AB We investigate the impact of three different factors on the quality of talking heads as metaphors of a spoken dialogue system in the smart home domain. The main focus lies on the effect of voice and head characteristics on audio and video quality, as well as overall quality. Furthermore, the influence of interactivity and of media context on user perception is analysed. For this purpose two subsequent experiments were conducted: the first was designed as a non-interactive rating test of videos of talking heads, while the second experiment was interactive. Here, the participants had to solve a number of tasks in dialogue with a talking head. To assess the impact of the media context, redundant information was provided via an additional visual output channel to half of the participants. As a secondary effect, the importance of participants' gender is examined. It is shown that perceived quality differences observed in the non-interactive setting are blurred when the interactivity and media contexts provide distraction from the talking head. Furthermore, a simple additional feedback screen improves the perceived quality of the talking heads. Gender effects are negligible concerning the ratings in interaction, but female and male participants exhibit different behaviour in the experiment. This advocates for more realistic evaluation settings in order to increase the external validity of the obtained quality judgements. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Weiss, Benjamin; Kuehnel, Christine; Wechsung, Ina; Moeller, Sebastian] TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany.
[Fagel, Sascha] TU Berlin, Inst Sprache & Kommunikat, D-10587 Berlin, Germany.
RP Weiss, B (reprint author), TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany.
EM BWeiss@telekom.de; Christine.Kuehnel@telekom.de;
Ina.Wechsung@telekom.de; Sascha.Fagel@tu-berlin.de;
Sebastian.Moeller@telekom.de
FU Deutsche Forschungsgemeinschaft DFG (German Research Community) [MO
1038/6-1]
FX The project was financially supported by the Deutsche
Forschungsgemeinschaft DFG (German Research Community), Grant MO
1038/6-1.
CR Adcock A. B., 2005, J INTERACTIVE LEARNI, V16, P195
Andre E, 1998, KNOWL-BASED SYST, V11, P25, DOI 10.1016/S0950-7051(98)00057-4
Berry DC, 2005, INT J HUM-COMPUT ST, V63, P304, DOI 10.1016/j.ijhcs.2005.03.006
BREITFUSS W, 2008, P AISB 2008 S MULT O, P18
Buisine S, 2004, HUM COM INT, V7, P217
Burnham D., 2008, P INT C AUD VIS SPEE
CANADA K, 1991, ETR&D-EDUC TECH RES, V39, P43, DOI 10.1007/BF02298153
Cassell J., 2000, EMBODIED CONVERSATIO
Costello A.B., 2005, PRACTICAL ASSESSMENT, V10
Cowell AJ, 2005, INT J HUM-COMPUT ST, V62, P281, DOI 10.1016/j.ijhcs.2004.11.008
Dehn DM, 2000, INT J HUM-COMPUT ST, V52, P1, DOI 10.1006/ijhc.1999.0325
Dutoit T., 1996, P ICSLP 96 PHIL, V3, P1393, DOI 10.1109/ICSLP.1996.607874
ERICKSON T, 1997, DESIGNING AGENTS PEO
Fagel S., 2007, P INT C AUD VIS SPEE
FAGEL S, 2008, P INT C AUD VIS SPEE
Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006
Frokjaer E, 2000, P ACM C HUM FACT COM, P345
Gong L, 2008, COMPUT HUM BEHAV, V24, P2074, DOI 10.1016/j.chb.2007.09.008
Gong L, 2007, HUM COMMUN RES, V33, P163, DOI 10.1111/j.1468-2958.2007.00295.x
HORN JL, 1965, PSYCHOMETRIKA, V30, P179, DOI 10.1007/BF02289447
Hutcheson G. D., 1999, MULTIVARIATE SOCIAL
KING WJ, 1996, P C HUM FACT COMP SY
Kipp M, 2004, THESIS BOCA RATON
Koda T., 1996, Proceedings. 5th IEEE International Workshop on Robot and Human Communication RO-MAN'96 Tsukuba (Cat. No.96TH8179), DOI 10.1109/ROMAN.1996.568812
Kramer N.C., 2002, VIRTUELLE REALITATEN, P203
Kramer N.C., 2008, SOZIALE WIRKUNGEN VI
KUHNEL C, 2008, P INT C MULT INT ICM
Lester J. C., 1997, P 8 WORLD C ART INT, P23
MASSARO DW, 2000, EMBODIED CONVERSATIO, P286
MCBREEN HM, 2000, P AAAI FALL S SOC IN, P122
MOLLER S, 2003, P 8 EUR C SPEECH COM, V3, P1953
NASS C, 1999, P INT C AUD VIS SPEE
Nass C, 1997, J APPL SOC PSYCHOL, V27, P864, DOI 10.1111/j.1559-1816.1997.tb00275.x
Nowak K. L., 2004, J COMPUTER MEDIATED, V9
Nowak K. L., 2005, J COMPUTER MEDIATED, V11
Pandzic IS, 1999, VISUAL COMPUT, V15, P330, DOI 10.1007/s003710050182
Pelachaud C., 2004, BROWS TRUST EVALUATI
Pollick F. E., 2010, P SIGCHI C HUM FACT, V40, P69, DOI 10.1145/1240624.1240626
PRENDINGER H, 2004, KI ZEITSCHRIFT GERMA, V1, P4
Schroder M., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025708916924
Sproull L, 1996, HUM-COMPUT INTERACT, V11, P97, DOI 10.1207/s15327051hci1102_1
Takeuchi A., 1995, Human Factors in Computing Systems. CHI'95 Conference Proceedings
THEOBALD B, 2008, P INT 2008 BRISB AUS, P2310
VANMULKEN S, 1998, P HCI PEOPL COMP
Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110
WEISS B, 2009, P HUM COMP INT INT H, P349
XIAO J, 2002, EMBODIED CONVERSATIO
Xiao J., 2006, THESIS GEORGIA I TEC
Zimmerman J, 2005, P C DES PLEAS PROD I, P233
NR 49
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 481
EP 492
DI 10.1016/j.specom.2010.02.011
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000002
ER
PT J
AU Badin, P
Tarabalka, Y
Elisei, F
Bailly, G
AF Badin, Pierre
Tarabalka, Yuliya
Elisei, Frederic
Bailly, Gerard
TI Can you 'read' tongue movements? Evaluation of the contribution of
tongue display to speech understanding
SO SPEECH COMMUNICATION
LA English
DT Article
DE Lip reading; Tongue reading; Audiovisual speech perception; Virtual
audiovisual talking head; Augmented speech; ElectroMagnetic
Articulography (EMA)
ID PERCEPTION; FRENCH; MODELS
AB Lip reading relies on visible articulators to ease speech understanding. However, lips and face alone provide very incomplete phonetic information: the tongue, that is generally not entirely seen, carries an important part of the articulatory information not accessible through lip reading. The question is thus whether the direct and full vision of the tongue allows tongue reading. We have therefore generated a set of audiovisual VCV stimuli with an audiovisual talking head that can display all speech articulators, including tongue, in an augmented speech mode. The talking head is a virtual clone of a human speaker and the articulatory movements have also been captured on this speaker using ElectroMagnetic Articulography (EMA). These stimuli have been played to subjects in audiovisual perception tests in various presentation conditions (audio signal alone, audiovisual signal with profile cutaway display with or without tongue, complete face), at various Signal-to-Noise Ratios. The results indicate: (1) the possibility of implicit learning of tongue reading, (2) better consonant identification with the cutaway presentation with the tongue than without the tongue, (3) no significant difference between the cutaway presentation with the tongue and the more ecological rendering of the complete face, (4) a predominance of lip reading over tongue reading, but (5) a certain natural human capability for tongue reading when the audio signal is strongly degraded or absent. We conclude that these tongue reading capabilities could be used for applications in the domains of speech therapy for speech retarded children, of perception and production rehabilitation of hearing impaired children, and of pronunciation training for second language learners. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Badin, Pierre] Grenoble Univ, GIPSA Lab, DPC, ICP,CNRS,ENSE3,UMR 5216, F-38402 St Martin Dheres, France.
RP Badin, P (reprint author), Grenoble Univ, GIPSA Lab, DPC, ICP,CNRS,ENSE3,UMR 5216, 961 Rue Houille Blanche,BP 46, F-38402 St Martin Dheres, France.
EM Pierre.Badin@gipsa-lab.grenoble-inp.fr;
Yuliya.Tarabalka@gipsa-lab.grenoble-inp.fr;
Frederic.E-lisei@gipsa-lab.grenoble.inp.fr;
Gerard.Bailly@gipsa-lab.gre-noble-inp.fr
CR Badin P., 2006, 7 INT SEM SPEECH PRO, P395
Badin P, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2635
BALTER O, 2005, 7 INT ACM SIGACCESS, P36
Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4
Benoit C., 1996, SPEECHREADING HUMANS, P315
CATHIARD MA, 1996, SPEECHREADING HUMANS, P211
CORNETT RO, 1967, AM ANN DEAF, V112, P3
Engwall O, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2631
ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481
Fagel S, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2643
GRAUWINKEL K, 2007, INTERSPEECH, P706
Hoole P., 1997, FORSCHUNGSBERICHTE I, V35, P177
IJSSELDIJK FJ, 1992, J SPEECH HEAR RES, V35, P466
BENOI C, 1994, J SPEECH HEAR RES, V37, P1195
Kroger BJ, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2639
Massaro D., 2003, EUR 2003 GEN SWITZ, P2249
Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025)
Massaro DW, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2623
Mills A. E., 1987, HEARING EYE PSYCHOL, P145
MONTGOMERY D, 1981, PSYCHOL RES, V43
Mulford R., 1988, EMERGENT LEXICON CHI, P293
Narayanan S, 2004, J ACOUST SOC AM, V115, P1771, DOI 10.1121/1.1652588
Odisio M, 2004, SPEECH COMMUN, V44, P63, DOI 10.1016/j.specom.2004.10.008
PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204
Serrurier A, 2008, J ACOUST SOC AM, V123, P2335, DOI 10.1121/1.2875111
STOELGAMMON C, 1988, J SPEECH HEAR DISORD, V53, P302
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
TYEMURRAY N, 1993, NCVS STATUS PROG REP, V4, P41
VIHMAN MM, 1985, LANGUAGE, V61, P397, DOI 10.2307/414151
Wik P, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P2627
NR 30
TC 8
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 493
EP 503
DI 10.1016/j.specom.2010.03.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000003
ER
PT J
AU Heracleous, P
Beautemps, D
Aboutabit, N
AF Heracleous, Panikos
Beautemps, Denis
Aboutabit, Noureddine
TI Cued Speech automatic recognition in normal-hearing and deaf subjects
SO SPEECH COMMUNICATION
LA English
DT Article
DE French Cued Speech; Hidden Markov models; Automatic recognition; Feature
fusion; Multi-stream HMM decision fusion
ID LANGUAGE
AB This article discusses the automatic recognition of Cued Speech in French based on hidden Markov models (HMMs). Cued Speech is a visual mode which, by using hand shapes in different positions and in combination with lip patterns of speech, makes all the sounds of a spoken language clearly understandable to deaf people. The aim of Cued Speech is to overcome the problems of lipreading and thus enable deaf children and adults to understand spoken language completely. In the current study, the authors demonstrate that visible gestures are as discriminant as audible orofacial gestures. Phoneme recognition and isolated word recognition experiments have been conducted using data from a normal-hearing cuer. The results obtained were very promising, and the study has been extended by applying the proposed methods to a deaf cuer. The achieved results have not shown any significant differences compared to automatic Cued Speech recognition in a normal-hearing subject. In automatic recognition of Cued Speech, lip shape and gesture recognition are required. Moreover, the integration of the two modalities is of great importance. In this study, lip shape component is fused with hand component to realize Cued Speech recognition. Using concatenative feature fusion and multi-stream HMM decision fusion, vowel recognition, consonant recognition, and isolated word recognition experiments have been conducted. For vowel recognition, an 87.6% vowel accuracy was obtained showing a 61.3% relative improvement compared to the sole use of lip shape parameters. In the case of consonant recognition, a 78.9% accuracy was obtained showing a 56% relative improvement compared to the use of lip shape only. In addition to vowel and consonant recognition, a complete phoneme recognition experiment using concatenated feature vectors and Gaussian mixture model (GMM) discrimination was conducted, obtaining a 74.4% phoneme accuracy. Isolated word recognition experiments in both normal-hearing and deaf subjects were also conducted providing a word accuracy of 94.9% and 89%, respectively. The obtained results were compared with those obtained using audio signal, and comparable accuracies were observed. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Heracleous, Panikos] ATR, Intelligent Robot & Commun Labs, Kyoto 6190288, Japan.
[Heracleous, Panikos; Beautemps, Denis; Aboutabit, Noureddine] Univ Grenoble 3, Speech & Cognit Dept, GIPSA Lab, CNRS,UJF,INPG,UMR 5216, F-38402 St Martin Dheres, France.
RP Heracleous, P (reprint author), ATR, Intelligent Robot & Commun Labs, 2-2-2 Hikaridai Seika Cho, Kyoto 6190288, Japan.
EM panikos@atr.jp
FU ANR
FX The authors would like to thank the volunteer cuers Sabine Chevalier,
Myriam Diboui, and Clementine Huriez for their time spending on Cued
Speech data recording, and also for accepting the recording constraints.
Also the authors would like to thank Christophe Savariaux and Coriandre
Vilain for their help in the Cued Speech material recording. This work
was mainly performed at GIPSA-lab, Speech and Cognition Department and
was supported by the TELMA project (ANR, 2005 edition).
CR ABOUTABIT N, 2006, P ICASSP2006, P633
ABOUTABIT N, 2007, P INT C AUD VIS SPEE
Aboutabit N., 2007, THESIS I NATL POLYTE
Adjoudani A., 1996, SPEECHREADING HUMANS, P461
Auer ET, 2007, J SPEECH LANG HEAR R, V50, P1157, DOI 10.1044/1092-4388(2007/080)
BERNSTEIN L, 2007, CUED SPEECH CUED LAN
Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145
CORNETT RO, 1967, AM ANN DEAF, V112, P3
Dreuw P, 2007, P INT, P2513
FLEETWOOD E, 1999, CUED LANGUAGE STRUCT
Gibert G, 2005, J ACOUST SOC AM, V118, P1144, DOI 10.1121/1.1944587
Gillick L., 1989, P ICASSP, P532
Hennecke M. E., 1996, SPEECHREADING HUMANS, P331
Leybaert J, 2000, J EXP CHILD PSYCHOL, V75, P291, DOI 10.1006/jecp.1999.2539
MERKX P, 2005, FINAL PROJECT MATH S, V196, P1
MONTGOMERY AA, 1983, J ACOUST SOC AM, V73, P2134, DOI 10.1121/1.389537
Nakamura S., 2002, Proceedings Fourth IEEE International Conference on Multimodal Interfaces, DOI 10.1109/ICMI.2002.1167011
Nefian A. V., 2002, P ICASSP
NICHOLLS GH, 1982, J SPEECH HEAR RES, V25, P262
Ong SCW, 2005, IEEE T PATTERN ANAL, V27, P873, DOI 10.1109/TPAMI.2005.112
Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150
UCHANSKI RM, 1994, J REHABIL RES DEV, V31, P20
Young S., 2001, HTK BOOK
NR 23
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 504
EP 512
DI 10.1016/j.specom.2010.03.001
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000004
ER
PT J
AU Troille, E
Cathiard, MA
Abry, C
AF Troille, Emilie
Cathiard, Marie-Agnes
Abry, Christian
TI Speech face perception is locked to anticipation in speech production
SO SPEECH COMMUNICATION
LA English
DT Article
DE Auditory-visual speech perception; Speech production; Anticipation
AB At the beginning of the 90's, it was definitively demonstrated that as early as the visual speech information is perceivable, speech identification can be processed. Cathiard et al. [Cathiard, M.-A., Tiberghien, G., Tseva, A., Lallouache, M.-T., Escudier, P., 1991. Visual perception of anticipatory rounding during acoustic pauses: a cross-language study. In: Proceedings of the XIIth International Congress of Phonetic Sciences, Aix-en-Provence, France, 4, pp. 50-53] used different V-to-V anticipatory spans, with articulatory measurements, along silent pauses, in a perceptual gating paradigm, and established that up to 200 ms "speech can be seen before it is heard". These results were later framed into the framework of a general anticipatory control model, the Movement Expansion Model [Abry, C., Lallouache, M.-T., Cathiard, M.-A., 1996]. How can coarticulation models account for speech sensitivity to audio visual desynchronization? In: Stork, D., Hennecke, M. (Eds.), Speechreading by Humans and Machines, NATO ASI Series F: Computer, Vol. 150. Springer-Verlag, Berlin, Tokyo, pp. 247-255]. Surprisingly the timing of the vowel and consonant auditory and visual streams remained until now poorly understood within the typical CVCV span. A first preliminary test was published by Escudier, Benoit and Lallouache [Escudier, P., Benoit, C., Lallouache, M.-T., 1990. Identification visuelle de stimuli associes a l'opposition /i/-/y/: etude statique. Colloque de physique, supplement au n(o) 2, tome 51, ler Congres Francais d'Acoustique, C2-541-544]: this is the issue we took up again more than 10 years later. And for the first time we found that "speech can be heard before it is seen". The main purpose of the present contribution will be to bring new data in order to clear up apparent contradictions, essentially due to misconceptions of variability and lawfulness in speakers' behavior. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Troille, Emilie; Cathiard, Marie-Agnes; Abry, Christian] Univ Grenoble 3, CRI, EA 610, F-38040 Grenoble 9, France.
[Troille, Emilie] Univ Grenoble 3, GIPSA Lab, CNRS, ICP,INPG,UMR 5216, F-38040 Grenoble 9, France.
RP Cathiard, MA (reprint author), 10 Chemin Ruy, F-38690 Chabons, France.
EM emilie.troille@gmail.com; marieagnes.cathiard@u-grenoble3.fr;
chris.abry@orange.fr
FU Region Rhone-Alpes, France
FX We thank: Deborah Kowalski and Jean-Luc Schwartz, our speakers; Alain
Arnal, Helene Loevenbruck, Solange Rossato and Christophe Savariaux for
their technical assistance at the Institut de la Communication Parlee,
Grenoble, France. This work was supported by a grant from Region
Rhone-Alpes, France.
CR ABRY C, 1995, P ICPHS STOCKH SUED, V4, P152
Abry C., 1995, B COMM PARLEE, V3, P85
Abry C, 1996, NATO ASI SERIES F, V150, P247
ABRY C, 1989, J PHONETICS, V17, P47
BENOIT C, 1986, J ACOUST SOC AM, V80, P1846, DOI 10.1121/1.394302
BENOIT C, 1986, P 12 INT C AC TOR
Byrd D, 2003, J PHONETICS, V31, P149, DOI 10.1016/S0095-4470(02)00085-2
CATHIARD MA, 1996, NATO ASI SERIES F, V150, P211
CATHIARD MA, 2007, SPEC SESS AUD SPEECH, P291
CATHIARD MA, 1991, P 12 INT C PHON SCI, V4, P50
CATHIARD MA, 1994, THESIS GRENOBLE
ESCUDIER P, 1990, C PHYS C FRANC AC S2, V51
Evans N, 2009, BEHAV BRAIN SCI, V32, P429, DOI 10.1017/S0140525X0999094X
Farnetani E., 1999, COARTICULATION THEOR, P144
Finney D. J., 1971, PROBIT ANAL, V3rd
GAITENBY J, 1965, SR2 HASK LAB, P1
Ghazanfar AA, 2005, J NEUROSCI, V25, P5004, DOI 10.1523/JNEUROSCI.0799-05.2005
GROSJEAN F, 1980, PERCEPT PSYCHOPHYS, V28, P267, DOI 10.3758/BF03204386
Grosjean Francois, 1997, GUIDE SPOKEN WORD RE, P597
Lallouache M. T., 1991, THESIS ENSERG GRENOB
Munson B, 2004, J SPEECH LANG HEAR R, V47, P58, DOI [10.1044/1092-4388(2004/006), 10.1044/1092-4388(2204/006)]
NITTROUER S, 1988, J ACOUST SOC AM, V84, P1653, DOI 10.1121/1.397180
Noiray A, 2008, EMERGENCE LANGUAGE A, P100
NOIRAY A, 2006, P 7 INT SEM SPEECH P, P319
SCHWARTZ JL, 1993, J PHONETICS, V21, P411
SMEELE PMT, 1994, P INT C SPOK LANG PR, V3, P1431
SMEELE PMT, 1994, THESIS TU DELFT DELF
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Troille E., 2007, P INT C AUD VIS SPEE, P281
TULLER B, 1984, J ACOUST SOC AM, V76, P1030, DOI 10.1121/1.391421
WHALEN DH, 1984, PERCEPT PSYCHOPHYS, V35, P49, DOI 10.3758/BF03205924
NR 31
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 513
EP 524
DI 10.1016/j.specom.2009.12.005
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000005
ER
PT J
AU Fort, M
Spinelli, E
Savariaux, C
Kandel, S
AF Fort, Mathilde
Spinelli, Elsa
Savariaux, Christophe
Kandel, Sonia
TI The word superiority effect in audiovisual speech perception
SO SPEECH COMMUNICATION
LA English
DT Article
DE Audiovisual speech; Lexical access; Speech perception in noise; Word
recognition
ID RECOGNITION; FRENCH; IDENTIFICATION; RESTORATION; CONTEXT
AB Seeing the facial gestures of a speaker enhances phonemic identification in noise. The goal of this study was to assess whether the visual information regarding consonant articulation activates lexical representations. We conducted a phoneme monitoring task with word and pseudo-words in audio only (A) and audiovisual (AV) contexts with two levels of white noise masking the acoustic signal. The results confirmed that visual information enhances consonant detection in noisy conditions and also revealed that it accelerates the phoneme detection process. The consonants were detected faster in AV than in A only condition. Furthermore, when the acoustic signal was deteriorated, the consonant phonemes were better recognized when they were embedded in words rather than in pseudo-words in the AV condition. This provides evidence indicating that visual information on phoneme identity can contribute to lexical activation processes during word recognition. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Fort, Mathilde; Spinelli, Elsa; Kandel, Sonia] Univ Pierre Mendes France, Lab Psychol & NeuroCognit, CNRS, UMR 5105, F-38040 Grenoble 9, France.
[Spinelli, Elsa; Kandel, Sonia] Inst Univ France, F-75005 Paris, France.
[Savariaux, Christophe] Univ Grenoble 3, GIPSA Lab, Dpt Parole & Cognit, CNRS,UMR 5216, F-38040 Grenoble 9, France.
RP Fort, M (reprint author), Univ Pierre Mendes France, Lab Psychol & NeuroCognit, CNRS, UMR 5105, BP 47, F-38040 Grenoble 9, France.
EM mathilde.fort@upmf-grenoble.fr; elsa.spinelli@upmf-grenoble.fr;
christophe.savariaux@gipsa-lab.inpg.fr; sonia.kandel@upmf-grenoble.fr
CR AMANO J, 1998, P AUD VIS SPEECH PRO, P43
Barutchu A, 2008, EUR J COGN PSYCHOL, V20, P1, DOI 10.1080/09541440601125623
Brancazio L, 2004, J EXP PSYCHOL HUMAN, V30, P445, DOI 10.1037/0096-1523.30.3.445
Buchwald AB, 2009, LANG COGNITIVE PROC, V24, P580, DOI 10.1080/01690960802536357
COLIN C, 2003, ANN PSYCHOL, V104, P497
Connine CM, 1996, LANG COGNITIVE PROC, V11, P635
CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7
ERBER NP, 1969, J SPEECH HEAR RES, V12, P423
FRAUENFELDER UH, 1990, J EXP PSYCHOL HUMAN, V16, P77, DOI 10.1037/0096-1523.16.1.77
GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110
Gow DW, 2003, PERCEPT PSYCHOPHYS, V65, P575, DOI 10.3758/BF03194584
Green K. P., 1998, ADV PSYCHOL SPEECHRE, P3
BENOI C, 1994, J SPEECH HEAR RES, V37, P1195
Kim J, 2004, COGNITION, V93, pB39, DOI 10.1016/j.cognition.2003.11.003
LOCASTO PC, 2007, LANG SPEECH, V50, P54
MARSLENWILSON W, 1990, ACL MIT NAT, P148
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
New B, 2001, ANN PSYCHOL, V101, P447
Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241
Robert-Ribes J, 1998, J ACOUST SOC AM, V103, P3677, DOI 10.1121/1.423069
Sams M, 1998, SPEECH COMMUN, V26, P75, DOI 10.1016/S0167-6393(98)00051-X
SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474
Spinelli E., 2005, PSYCHOL LANGAGE ECRI
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Tiippana K, 2004, EUR J COGN PSYCHOL, V16, P457, DOI 10.1080/09541440340000268
WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392
NR 29
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 525
EP 532
DI 10.1016/j.specom.2010.02.005
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000006
ER
PT J
AU Sato, M
Buccino, G
Gentilucci, M
Cattaneo, L
AF Sato, Marc
Buccino, Giovanni
Gentilucci, Maurizio
Cattaneo, Luigi
TI On the tip of the tongue: Modulation of the primary motor cortex during
audiovisual speech perception
SO SPEECH COMMUNICATION
LA English
DT Article
DE Audiovisual speech perception; Transcranial magnetic stimulation; Motor
system; Mirror-neuron system; Motor theory of speech perception; McGurk
effect
ID TRANSCRANIAL MAGNETIC STIMULATION; CORTICO-HYPOGLOSSAL PROJECTIONS;
HUMAN AUDITORY-CORTEX; HUMAN BRAIN-STEM; VISUAL SPEECH; LINGUAL MUSCLES;
SEEING VOICES; HEARING LIPS; BROCAS AREA; EXCITABILITY
AB Recent neurophysiological studies show that cortical brain regions involved in the planning and execution of speech gestures are also activated in processing speech sounds. These findings suggest that speech perception is in part mediated by reference to the motor actions afforded in the speech signal. Since interactions between auditory and visual modalities are beneficial in speech perception and face-to-face communication, we used single-pulse transcranial magnetic stimulation (TMS) to investigate whether audiovisual speech perception might induce excitability changes in the left tongue-related primary motor cortex and whether acoustic and visual speech inputs might differentially modulate motor excitability. To this aim, motor-evoked potentials obtained with focal TMS applied over the left tongue primary motor cortex were recorded from participants' tongue muscles during the perception of matching and conflicting audiovisual syllables incorporating tongue- and/or lip-related phonemes (i.e. visual and acoustic /ba/, /ga/ and /da/, visual /ba/ and acoustic /ga/, visual /ga/ and acoustic /ba/). Compared to the presentation of congruent /ba/ syllable, which primarily involves lip movements when pronounced, exposure to syllables incorporating visual and/or acoustic tongue-related phonemes induced a greater excitability of the left tongue primary motor cortex as early as 100-200 ms after the consonantal onset of the acoustically presented syllable. These results provide evidence that both visual and auditory modalities specifically modulate activity in the tongue primary motor cortex at an early stage during audiovisual speech perception. Because no interaction between the two modalities was observed, these results suggest that information from each sensory channel is recoded separately in the primary motor cortex at that point of time. These findings are discussed in relation to theories assuming a link between perception and action in the human speech processing system and theoretical models of audiovisual interaction. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Sato, Marc] CNRS, GIPSA Lab, UMR 5216, Dept Parole & Cognit, F-38040 Grenoble 9, France.
[Sato, Marc] Grenoble Univ, F-38040 Grenoble 9, France.
[Buccino, Giovanni] Magna Graecia Univ Catanzaro, Dept Med Sci, Catanzaro, Italy.
[Gentilucci, Maurizio] Univ Parma, Dept Neurosci, Sez Fisiol, I-43100 Parma, Italy.
[Cattaneo, Luigi] Univ Trent, Ctr Mind Brain Sci, CIMeC, I-38100 Trento, Italy.
RP Sato, M (reprint author), CNRS, GIPSA Lab, UMR 5216, Dept Parole & Cognit, 1180 Ave Cent,BP 25, F-38040 Grenoble 9, France.
EM marc.sato@gipsa-lab.inpg.fr
FU MIUR (Ministero Italiano dell'Istruzione, dell'Universita e della
Ricerca); CNRS (Centre National de la Recherche Scientifique)
FX We wish to thank Elena Borra and Helene Loevenbruck for their help with
this study. This research was supported by MIUR (Ministero Italiano
dell'Istruzione, dell'Universita e della Ricerca) and CNRS (Centre
National de la Recherche Scientifique).
CR Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038
Besle J, 2004, EUR J NEUROSCI, V20, P2225, DOI 10.1111/j.1460-9568.2004.03670.x
Brancazio L, 2005, PERCEPT PSYCHOPHYS, V67, P759, DOI 10.3758/BF03193531
Callan DE, 2004, J COGNITIVE NEUROSCI, V16, P805, DOI 10.1162/089892904970771
Callan DE, 2003, NEUROREPORT, V14, P2213, DOI 10.1097/00001756-200312020-00016
Calvert GA, 2000, CURR BIOL, V10, P649, DOI 10.1016/S0960-9822(00)00513-3
Calvert GA, 2003, J COGNITIVE NEUROSCI, V15, P57, DOI 10.1162/089892903321107828
Chen CH, 1999, NEUROLOGY, V52, P411
D'Ausillo A, 2009, CURR BIOL, V19, P381, DOI 10.1016/j.cub.2009.01.017
Fadiga L, 2005, CURR OPIN NEUROBIOL, V15, P213, DOI 10.1016/j.conb.2005.03.013
Fadiga L, 2002, EUR J NEUROSCI, V15, P399, DOI 10.1046/j.0953-816x.2001.01874.x
Galantucci B, 2006, PSYCHON B REV, V13, P361, DOI 10.3758/BF03193857
Gentilucci M, 2006, NEUROSCI BIOBEHAV R, V30, P949, DOI 10.1016/j.neubiorev.2006.02.004
Gentilucci M, 2005, EXP BRAIN RES, V167, P66, DOI 10.1007/s00221-005-0008-z
Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788
Green K P, 1998, HEARING EYE, P3
Hertrich I, 2007, NEUROPSYCHOLOGIA, V45, P1342, DOI 10.1016/j.neuropsychologia.2006.09.019
Jones JA, 2003, NEUROREPORT, V14, P1129, DOI 10.1097/01.wnr.0000074343.81633.2a
BENOI C, 1994, J SPEECH HEAR RES, V37, P1195
Klucharev Vasily, 2003, Brain Res Cogn Brain Res, V18, P65
Liberman A. M., 2000, TRENDS COGN SCI, V3, P254
LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6
LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279
MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786
Massaro D. W., 1998, PERCEIVING TALKING F
MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0
Meister IG, 2007, CURR BIOL, V17, P1692, DOI 10.1016/j.cub.2007.08.064
Mottonen R, 2004, NEUROSCI LETT, V363, P112, DOI 10.1016/j.neulet.2004.03.076
Möttönen Riikka, 2002, Brain Res Cogn Brain Res, V13, P417
MOTTONEN R, 2004, THESIS HELSINKI U TE
Muellbacher W, 1997, BRAIN, V120, P1909, DOI 10.1093/brain/120.10.1909
MUELLBACHER W, 1994, J NEUROL NEUROSUR PS, V57, P309, DOI 10.1136/jnnp.57.3.309
Nishitani N, 2002, NEURON, V36, P1211, DOI 10.1016/S0896-6273(02)01089-9
Ojanen V, 2005, NEUROIMAGE, V25, P333, DOI 10.1016/j.neuroimage.2004.12.001
OJANEN V, 2005, THESIS HELSINKI U TE
OLDFIELD RC, 1971, NEUROPSYCHOLOGIA, V9, P97, DOI 10.1016/0028-3932(71)90067-4
Paulesu E, 2003, J NEUROPHYSIOL, V90, P2005, DOI 10.1152/jn.00926.2002
Pekkola J, 2006, NEUROIMAGE, V29, P797, DOI 10.1016/j.neuroimage.2005.09.069
Pulvermuller F, 2006, P NATL ACAD SCI USA, V103, P7865, DOI 10.1073/pnas.0509989103
Reisberg D., 1987, HEARING EYE PSYCHOL, P97
Rizzolatti G, 2004, ANNU REV NEUROSCI, V27, P169, DOI 10.1146/annurev.neuro.27.070203.144230
Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0
Rodel RMW, 2003, ANN OTO RHINOL LARYN, V112, P71
ROSSINI PM, 1994, ELECTROEN CLIN NEURO, V91, P79, DOI 10.1016/0013-4694(94)90029-9
Roy AC, 2008, J PHYSIOLOGY-PARIS, V102, P101, DOI 10.1016/j.jphysparis.2008.03.006
SAMS M, 1991, NEUROSCI LETT, V127, P141, DOI 10.1016/0304-3940(91)90914-F
Sato M, 2009, BRAIN LANG, V111, P1, DOI 10.1016/j.bandl.2009.03.002
SCHWARTZ JL, 1998, ADV PSYCHOL SPEECHRE, P85
Schwartz JL, 2008, REV FR LING APPL, V13, P9
Sekiyama K, 2003, NEUROSCI RES, V47, P277, DOI 10.1016/S0168-0102(03)00214-1
Skipper JI, 2007, CEREB CORTEX, V17, P2387, DOI 10.1093/cercor/bhl147
Skipper JI, 2005, NEUROIMAGE, V25, P76, DOI 10.1016/j.neuroimage.2004.11.006
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Sundara M, 2001, NEUROREPORT, V12, P1341, DOI 10.1097/00001756-200105250-00010
Urban PP, 1996, BRAIN, V119, P1031, DOI 10.1093/brain/119.3.1031
van Wassenhove V, 2005, P NATL ACAD SCI USA, V102, P1181, DOI 10.1073/pnas.0408949102
Wassermann EM, 1998, EVOKED POTENTIAL, V108, P1, DOI 10.1016/S0168-5597(97)00096-8
Watkins K, 2004, J COGNITIVE NEUROSCI, V16, P978, DOI 10.1162/0898929041502616
Watkins KE, 2003, NEUROPSYCHOLOGIA, V41, P989, DOI 10.1016/S0028-3932(02)00316-0
Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263
Wilson SM, 2006, NEUROIMAGE, V33, P316, DOI 10.1016/j.neuroimage.2006.05.032
NR 61
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 533
EP 541
DI 10.1016/j.specom.2009.12.004
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000007
ER
PT J
AU Flecha-Garcia, ML
AF Flecha-Garcia, Maria L.
TI Eyebrow raises in dialogue and their relation to discourse structure,
utterance function and pitch accents in English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Dialogue structure; Pitch accent; Facial movement; Non-verbal
communication; Multimodality
ID PROSODIC PROMINENCE; SPEECH
AB Face-to-face interaction involves both verbal and non-verbal communication. Studies have suggested a relationship between eyebrow raises and the verbal message, but our knowledge is still limited. If we could characterise a relation between eyebrow raises and the linguistic signal we could better understand and reproduce human multimodal communication behaviour. Based on previous observations on body movement, this research investigated eyebrow raising in face-to-face dialogue in English in connection with (1) discourse structure and utterance function and (2) pitch accents. Small but significant results partially supported the predictions, suggesting a link between eyebrow raising and spoken language. Eyebrow raises occurred more frequently at the start of high-level discourse segments than anywhere else in the dialogue, and more frequently in instructions than in requests for or acknowledgements of information. Interestingly, contrary to the hypothesis queries did not have more raises than any other type of utterance. Additionally, as predicted, eyebrow raises seemed to be aligned with pitch accents, preceding them by an average of 0.06 s. Possible linguistic functions are proposed, namely the structuring and emphasising of information in the verbal message. Finally, methodological issues and practical applications are briefly discussed. (C) 2009 Elsevier B.V. All rights reserved.
C1 Univ Edinburgh, Sch Philosophy Psychol & Language Sci, Edinburgh EH8 9AD, Midlothian, Scotland.
RP Flecha-Garcia, ML (reprint author), Univ Edinburgh, Sch Philosophy Psychol & Language Sci, Dugald Stewart Bldg,3 Charles St, Edinburgh EH8 9AD, Midlothian, Scotland.
EM marisaflecha@gmail.com
FU EPSRC
FX I am very grateful to Dr. Ellen G. Bard and Prof. D. Robert Ladd for
their valuable help in the supervision of this project. I also thank Dr.
Holly Brannigan and Dr. Marc Swerts for their constructive comments.
This project was partially funded by EPSRC.
CR ANDERSON AH, 1991, LANG SPEECH, V34, P351
Bolinger D., 1986, INTONATION ITS PARTS
Carletta J, 1997, COMPUT LINGUIST, V23, P13
Cassell J., 2001, P 41 ANN M ASS COMP, P106
Cave C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607235
Cave C, 2002, P 7 INT C SPOK LANG, P2353
Chovil N., 1991, RES LANG SOC INTERAC, V25, P163
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
Ekman P., 1979, HUMAN ETHOLOGY, P169
FLECHAGARCIA ML, 2006, THESIS U EDINBURGH E
Granstrom B, 2005, SPEECH COMMUN, V46, P473, DOI 10.1016/j.specom.2005.02.017
Keating P., 2003, P 16 INT C PHON SCI, P2071
Kendon A., 1980, RELATIONSHIP VERBAL, P207
Kendon Adam, 1972, STUDIES DYADIC COMMU, P177
Krahmer E, 2007, J MEM LANG, V57, P396, DOI 10.1016/j.jml.2007.06.005
KRAHMER E, 2004, BROWS TRUST EVALUATI
Ladd D. R., 1996, INTONATIONAL PHONOLO
Ladd DR, 2003, J PHONETICS, V31, P81, DOI 10.1016/S0095-4470(02)00073-6
McClave E, 1998, J PSYCHOLINGUIST RES, V27, P69, DOI 10.1023/A:1023274823974
McClave EZ, 2000, J PRAGMATICS, V32, P855, DOI 10.1016/S0378-2166(99)00079-X
McNeill D., 1992, HAND MIND WHAT GESTU
McNeill D., 2001, GESTURE, V1, P9, DOI DOI 10.1075/GEST.1.1.03MCN)
Neter J, 1996, APPL LINEAR STAT MOD
Pitrelli J., 1994, P 3 INT C SPOK LANG, P123
Silverman K., 1992, P INT C SPOK LANG PR, P867
Srinivasan RJ, 2003, LANG SPEECH, V46, P1
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
NR 27
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 542
EP 554
DI 10.1016/j.specom.2009.12.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000008
ER
PT J
AU Cvejic, E
Kim, J
Davis, C
AF Cvejic, Erin
Kim, Jeesun
Davis, Chris
TI Prosody off the top of the head: Prosodic contrasts can be discriminated
by head motion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody; Suprasegmental; Visual speech; Rigid head motion; Upper head
ID AUDIOVISUAL SPEECH-PERCEPTION; VISUAL-PERCEPTION; INTONATION;
STATEMENTS; QUESTIONS; ENGLISH; STRESS; FOCUS
AB The current study investigated people's ability to discriminate prosody related head and face motion from videos showing only the upper face of the speaker saying the same sentence with different prosody. The first two experiments used a visual-visual matching task. These videos were either fully textured (Experiment 1) or showed only the outline of the speaker's head (Experiment 2). Participants were presented with two stimulus pairs of silent videos, with their task to select the pair that had the same prosody. The overall results of the visual-visual matching experiments showed that people could discriminate same- from different-prosody sentences with a high degree of accuracy. Similar levels of discrimination performance were obtained for the fully textured (containing rigid and non-rigid motions) and the outline only (rigid motion only) videos. Good visual-visual matching performance shows that people are sensitive to the underlying factor that determined whether the movements were the same or not, i.e., the production of prosody. However, testing auditory-visual matching provides a more direct test concerning people's sensitivity to how head motion/face motion relates to spoken prosody. Experiments 3 (with fully textured videos) and 4 (with outline only videos) employed a cross-modal matching task that required participants to match auditory with visual tokens that had the same prosody. As with the previous experiments, participants performed this discrimination very well. Similarly, no decline in performance was observed for the outline only videos. This result supports the proposal that rigid head motion provides an important visual cue to prosody. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Cvejic, Erin; Kim, Jeesun; Davis, Chris] Univ Western Sydney, MARCS Auditory Labs, Penrith, NSW, Australia.
RP Cvejic, E (reprint author), Univ Western Sydney, MARCS Auditory Labs, Bldg 5,Bankstown Campus,Locked Bag 1797, Penrith, NSW, Australia.
EM e.cvejic@uws.edu.au; j.kim@uws.edu.au; chris.davis@uws.edu.au
FU School of Psychology, University of Western Sydney and MARCS Auditory
Laboratories; Australian Research Council [DP0666857, TS0669874]
FX The authors wish to thank Bronson Harry for his patient assistance with
the recording of audio-visual stimuli, and two anonymous reviewers and
the guest editors for their helpful suggestions to improve the
manuscript. The first author also wishes to acknowledge the support of
the School of Psychology, University of Western Sydney and MARCS
Auditory Laboratories for providing generous financial support. The
second and third authors acknowledge support from Australian Research
Council (DP0666857 & TS0669874).
CR Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4
BERNSTEIN LE, 1989, J ACOUST SOC AM, V85, P397, DOI 10.1121/1.397690
Boersma P., 2008, PRAAT DOING PHONETIC
BOLINGER D, 1972, LANGUAGE, V48, P633, DOI 10.2307/412039
Bolinger D., 1989, INTONATION ITS USES
BURNHAM D, 2007, INTERSPEECH, P698
BURNHAM D, 2007, INTERSPEECH, P701
Cave C, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2175
Cutler A, 1997, LANG SPEECH, V40, P141
Davis C, 2006, COGNITION, V100, pB21, DOI 10.1016/j.cognition.2005.09.002
Davis C, 2004, Q J EXP PSYCHOL-A, V57, P1103, DOI 10.1080/02724980343000701
DOHEN M, 2005, INTERSPEECH 2005, P2413
DOHEN M, 2005, INTERSPEECH, P2416
DOHEN M, 2006, SPEECH PROSODY, P224
DOHEN M, 2006, SPEECH PROSODY 2006, P221
Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009
EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091
Erickson D, 1998, LANG SPEECH, V41, P399
FLECHAGARCIA ML, 2006, THESIS U EDINBURGH
Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503
GUA I, 2009, LANG SPEECH, V52, P207
IEEE, 1969, IEEE T AUDIO ELECTRO, VAE-17, P227
Ishi C. T., 2007, IEEE RSJ INT C INT R, P548
Krahmer E, 2001, SPEECH COMMUN, V34, P391, DOI 10.1016/S0167-6393(00)00058-3
Krahmer E, 2004, HUM COM INT, V7, P191
Lansing CR, 1999, J SPEECH LANG HEAR R, V42, P526
Lappin JS, 2009, J VISION, V9, DOI 10.1167/9.1.30
Lee HM, 2008, J HIGH ENERGY PHYS
MCKEE SP, 1984, VISION RES, V24, P25, DOI 10.1016/0042-6989(84)90140-8
Nooteboom S., 1997, HDB PHONETIC SCI, P640
Pare M, 2003, PERCEPT PSYCHOPHYS, V65, P553, DOI 10.3758/BF03194582
Scarborough R, 2009, LANG SPEECH, V52, P135, DOI 10.1177/0023830909103165
Srinivasan RJ, 2003, LANG SPEECH, V46, P1
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
SUMMERFIELD Q, 1992, PHILOS T ROY SOC B, V335, P71, DOI 10.1098/rstb.1992.0009
Swerts M, 2008, J PHONETICS, V36, P219, DOI 10.1016/j.wocn.2007.05.001
Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929
Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165
NR 38
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 555
EP 564
DI 10.1016/j.specom.2010.02.006
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000009
ER
PT J
AU Colletta, JM
Pellenq, C
Guidetti, M
AF Colletta, Jean-Marc
Pellenq, Catherine
Guidetti, Michele
TI Age-related changes in co-speech gesture and narrative: Evidence from
French children and adults
SO SPEECH COMMUNICATION
LA English
DT Article
DE Co-speech gestures; Narratives; Multimodality; French children and
adults
ID ICONIC GESTURES
AB As children's language abilities develop, so may their use of co-speech gesture. We tested this hypothesis by studying oral narratives produced by French children and adults. One hundred and twenty-two participants, divided into three age groups (6 years old, 10 years old and adults), were asked to watch a Toni and Jerry cartoon and then tell the story to the experimenter. All narratives were videotaped, and subsequently transcribed and annotated for language and gesture using the ELAN software. The results showed a strong effect of age on language complexity, discourse construction and gesture. The age effect was only partly related to the length of the narratives, as adults produced shorter narratives than 10-year-olds. The study thus confirms that co-speech gestures develop with age in the context of narrative activity and plays a crucial role in discourse cohesion and the framing of verbal utterances. This developmental shift towards more complex narratives through both words and gestures is discussed in terms of its theoretical implications in the study of gesture and discourse development. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Colletta, Jean-Marc] Univ Grenoble 3, Lab Lidilem, EA 609, F-38040 Grenoble 9, France.
[Pellenq, Catherine] IUFM, F-38100 Grenoble, France.
[Pellenq, Catherine] Univ Grenoble 1, Lab Sci Educ, EA 602, F-38100 Grenoble, France.
[Guidetti, Michele] Univ Toulouse, UTM, Unite Rech Interdisciplinare Octogone, Lab Cognit Commun & Dev ECCD,EA 4156, F-31058 Toulouse 9, France.
RP Colletta, JM (reprint author), Univ Grenoble 3, Lab Lidilem, EA 609, 1180 Ave Cent,BP 25, F-38040 Grenoble 9, France.
EM jean-marc.colletta@u-grenoble3.fr; catherine.pellenq@ujf-grenoble.fr;
guidetti@univ-tlse2.fr
FU ANR (French National Research Agency) [0178-01]
FX This research was supported by grant no. 0178-01 from the ANR (French
National Research Agency) project entitled "L'acquisition et les
troubles du langage au regard de la multimodalite de la communication
parlee". We are grateful to Isabelle Rousset from Lidilem, and to all
the children and adult students who took part in this study.
CR Bamberg M., 1987, ACQUISITION NARRATIV
Beattie G, 1999, SEMIOTICA, V123, P1, DOI 10.1515/semi.1999.123.1-2.1
Berman R. A., 1994, RELATING EVENTS NARR
Bouvet Danielle, 2001, DIMENSION CORPORELLE
Butcher C., 2000, LANGUAGE GESTURE, P235, DOI DOI 10.1017/CBO9780511620850.015
Calbris G., 2003, EXPRESSION GESTUELLE
CAPIRCI O, 2002, ESSAYS HONOR WC STOK, P213
Capirci O, 2008, GESTURE, V8, P22, DOI 10.1075/gest.8.1.04cap
Capirci O, 1996, J CHILD LANG, V23, P645
CAPIRCI O, 2007, P 3 INT SOC GEST STU
Coirier P., 1996, PSYCHOLINGUISTIQUE T
COLLETTA JM, 2009, MULTIMODAL CORPORA
Colletta J.-M., 2004, DEV PAROLE CHEZ ENFA
Colletta J.-M., 2009, EXPOSITORY DISCOURSE, P63
Colletta JM, 2009, GESTURE, V9, P61, DOI 10.1075/gest.9.1.03col
De Ruiter J. P., 2000, LANGUAGE GESTURE, P284, DOI DOI 10.1017/CBO9780511620850.018
Diessel H., 2004, CAMBRIDGE STUDIES LI, V105
DUCEYKAUFMANN V, 2007, THESIS U STENDHAL GR
Duncan S., 2000, LANGUAGE GESTURE, P141, DOI 10.1017/CBO9780511620850.010
Fayol M., 1997, IDEES TEXTE PSYCHOL
FAYOL M, 1985, RECIT CONSTRUCTION A
FEYEREISEN P, 1994, CERVEAU COMMUNICATIO
Goldin-Meadow S, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P85
Gombert JE, 1990, DEV METALINGUISTIQUE
GRAZIANO M, 2009, THESIS U STUDI SUOR
Guidetti M., 2003, PRAGMATIQUE PSYCHOL
Guidetti M., 2002, 1 LANGUAGE, V22, P265
Gullberg M, 2008, GESTURE, V8, P149, DOI 10.1075/gest.8.2.03gul
Gullberg M., 2008, FIRST LANG, V28, P200, DOI DOI 10.1177/0142723707088074
Hadar U, 1997, SEMIOTICA, V115, P147, DOI 10.1515/semi.1997.115.1-2.147
Halliday Michael, 1976, COHESION ENGLISH
Hickmann Maya, 2003, CHILDRENS DISCOURSE
Iverson JM, 1998, NEW DIRECTIONS CHILD
Jisa H, 2000, LINGUISTICS, V38, P591, DOI 10.1515/ling.38.3.591
Jisa H., 2004, LANGUAGE DEV CHILDHO, P135
Kendon A., 1980, RELATIONSHIP VERBAL, P207
Kendon A., 2004, GESTURE VISIBLE ACTI
Kita S, 2007, LANG COGNITIVE PROC, V22, P1212, DOI 10.1080/01690960701461426
Kita S., 2000, LANGUAGE GESTURE, P162, DOI DOI 10.1017/CBO9780511620850.011
Kita S, 2003, J MEM LANG, V48, P16, DOI 10.1016/S0749-596X(02)00505-3
Kita S, 2007, GESTURE STUD, V1, P67
Krauss R. M., 2000, LANGUAGE GESTURE, P261, DOI DOI 10.1017/CBO9780511620850.017
KUNENE R, 2007, P 10 INT PRAGM C GOT
Labov W., 1978, PARLER ORDINAIRE
LAFOREST M, 1996, AUTOUR NARRATION
LEONARD JL, 1993, LANGAGE SOC, V65, P39
LUNDQUIST L., 1980, COHERENCE TEXTUELLE
Marcos H, 1998, COMMUNICATION PRELIN
Mayberry R., 2000, LANGUAGE GESTURE, P199, DOI 10.1017/CBO9780511620850.013
McNeill D., 1992, HAND MIND WHAT GESTU
NICOLADIS E, 2008, P 11 INT C STUD CHIL
OZCALISKAN S, 2006, CONSTRUCTIONS ACQUIS, P31
Ozyürek Asli, 2008, Dev Psychol, V44, P1040, DOI 10.1037/0012-1649.44.4.1040
PIZZUTO E, 2007, GESTURAL COMMUNICATI, P164
Rime B., 1991, FUNDAMENTALS NONVERB, P239
SPERBER D, 1989, COMMUNICATION COGNIT
Streeck Jurgen, 1992, ADV NONVERBAL COMMUN, P3
THOMPSON LA, 1994, J EXP CHILD PSYCHOL, V57, P327, DOI 10.1006/jecp.1994.1016
Tolchinsky L, 2004, LANGUAGE DEV CHILDHO, P233, DOI 10.1075/tilar.3.15tol
VANDERSTRATTEN A, 1991, PREMIERS GESTES PREM
NR 60
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 565
EP 576
DI 10.1016/j.specom.2010.02.009
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000010
ER
PT J
AU Aubanel, V
Nguyen, N
AF Aubanel, Vincent
Nguyen, Noel
TI Automatic recognition of regional phonological variation in
conversational interaction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Conversational interaction; Regional phonological and phonetic
variation; Automatic speech processing; French; Sociophonetics
ID SOCIAL DESIRABILITY; SPEECH RECOGNITION; PERCEPTION; LANGUAGE; DIALECT;
FRENCH; CONVERGENCE; ACCENT; CORPUS; SCALE
AB One key aspect of face-to-face communication concerns the differences that may exist between speakers' native regional accents. This paper focuses on the characterization of regional phonological variation in a conversational setting. A new, interactive task was designed in which 12 pairs of participants engaged in a collaborative game leading them to produce a number of purpose-built names. In each game, the participants were native speakers of Southern French and Northern French, respectively. How the names were produced by each of the two participants was automatically determined from the recordings using ASR techniques and a pre-established set of possible regional variants along five phonological dimensions. A naive Bayes classifier was then applied to these phonetic forms, with a view to differentiating the speakers' native regional accents. The results showed that native regional accent was correctly recognized for 79% of the speakers. These results also revealed or confirmed the existence of accent-dependent differences in how segments are phonetically realized, such as the affrication of /d/ in /di/ sequences. Our data allow us to better characterize the phonological and phonetic patterns associated with regional varieties of French on a large scale and in a natural, interactional situation. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Aubanel, Vincent] CNRS, Lab Parole & Langage, F-13100 Aix En Provence, France.
Aix Marseille Univ, F-13100 Aix En Provence, France.
RP Aubanel, V (reprint author), CNRS, Lab Parole & Langage, 5 Ave Pasteur, F-13100 Aix En Provence, France.
EM vincent.aubanel@lpl-aix.fr; noel.nguyen@lpl-aix.fr
FU Region Provence-Alpes-Cote d'Azur; [ANR-08-BLAN-0276-01]
FX This work was supported by a Ph.D. Scholarship awarded to the first
author by the Region Provence-Alpes-Cote d'Azur, and by the project
ANR-08-BLAN-0276-01. We are grateful to Stephane Rauzy for statistical
advice and to the staff and students of the lycee Thiers, Marseille, for
their kind participation. We also thank two anonymous reviewers for
helpful comments.
CR Adda-Decker M., 2008, TRAITEMENT AUTOMATIQ, V49-3, P13
Adda-Decker M, 2007, REV FR LING APPL, V12, P71
Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1
ANDERSON AH, 1991, LANG SPEECH, V34, P351
Bertrand R., 2008, TRAITEMENT AUTOMATIQ, V49, P105
Binisti N., 2003, CAHIERS FRANCAIS CON, V8, P107
Bradlow A.R., 2007, J ACOUSTICAL SOC A 2, V121, P3072
Brunelliere A, 2009, COGNITION, V111, P390, DOI 10.1016/j.cognition.2009.02.013
Carton F., 1983, ACCENTS FRANCAIS
Clopper CG, 2008, LANG SPEECH, V51, P175, DOI 10.1177/0023830908098539
Conrey B, 2005, BRAIN LANG, V95, P435, DOI 10.1016/j.bandl.2005.06.008
Coveney A., 2001, SOUNDS CONT FRENCH A
CROWNE DP, 1960, J CONSULT PSYCHOL, V24, P349, DOI 10.1037/h0047358
Cutler A, 2005, SPEECH COMMUN, V47, P32, DOI 10.1016/j.specom.2005.02.001
Delvaux V, 2007, PHONETICA, V64, P145, DOI 10.1159/000107914
Dufour S., 2007, J ACOUST SOC AM, V121, P131
Durand J., 2003, CORPUS VARIATION PHO, P11
Durand J., 1988, RECHERCHES LINGUISTI, V17, P29
Durand J., 2004, VARIATION FRANCOPHON, P217
Durand J., 1990, GENERATIVE NONLINEAR
Durand J., 2003, TRIBUNE INT LANGUES, V33, P3
Evans BG, 2004, J ACOUST SOC AM, V115, P352, DOI 10.1121/1.1635413
Eychenne J., 2006, THESIS U TOULOUSE LE
FAGYAL Z, 2002, P 24 JOURN ET PAR NA, P165
Fagyal-Le Mentec Z, 2006, FRENCH: A LINGUISTIC INTRODUCTION, P17, DOI 10.1017/CBO9780511791185.003
Floccia C, 2006, J EXP PSYCHOL HUMAN, V32, P1276, DOI 10.1037/0096-1523.32.5.1276
FONAGY I, 1989, REV ROMANE, V24, P225
Hansen Anita Berit, 2001, LINGUISTIQUE, V37, P33
Hay J, 2006, LINGUIST REV, V23, P351, DOI 10.1515/TLR.2006.014
Kraljic T, 2008, COGNITION, V107, P54, DOI 10.1016/j.cognition.2007.07.013
LENNOX RD, 1984, J PERS SOC PSYCHOL, V46, P1349, DOI 10.1037/0022-3514.46.6.1349
MALECOT A, 1976, PHONETICA, V33, P45
MARLOWE D, 1961, J CONSULT PSYCHOL, V25, P100
MARTINET A, 1958, ROMANCE PHILOL, V11, P345
Martinet A., 1945, PRONONCIATION FRANCA
NATALE M, 1975, J PERS SOC PSYCHOL, V32, P790, DOI 10.1037/0022-3514.32.5.790
New B, 2004, BEHAV RES METH INS C, V36, P516, DOI 10.3758/BF03195598
Pardo JS, 2006, J ACOUST SOC AM, V119, P2382, DOI 10.1121/1.2178720
Racine I., 2008, THESIS U GENEVE
Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009
SNYDER M, 1974, J PERS SOC PSYCHOL, V30, P526, DOI 10.1037/h0037039
Sumner M, 2009, J MEM LANG, V60, P487, DOI 10.1016/j.jml.2009.01.001
TRIMAILLE C, 2008, AFLS C OXF 3 5 SEPT
van Rijsbergen CJ., 1979, INFORM RETRIEVAL
VANRULLEN T, 2005, P TRAIT AUT LANG NAT
Vieru-Dimulescu B., 2008, TRAITEMENT AUTOMATIQ, V49, P135
WOEHRLING C, 2009, THESIS U PARIS SUD
NR 47
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 577
EP 586
DI 10.1016/j.specom.2010.02.008
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000011
ER
PT J
AU Kopp, S
AF Kopp, Stefan
TI Social resonance and embodied coordination in face-to-face conversation
with artificial interlocutors
SO SPEECH COMMUNICATION
LA English
DT Article
DE Social Resonance; Coordination; Embodied conversational agents; Gesture
ID NONCONSCIOUS MIMICRY; IMITATION; RAPPORT; SOUND
AB Human natural face-to-face communication is characterized by inter-personal coordination. In this paper, phenomena are analyzed that yield coordination of behaviors, beliefs, and attitudes between interaction partners, which can be tied to a concept of establishing social resonance. It is discussed whether these mechanisms can and should be transferred to conversation with artificial interlocutors like ECAs or humanoid robots. It is argued that one major step in this direction is embodied coordination, mutual adaptations that are mediated by flexible modules for the top-down production and bottom-up perception of expressive conversational behavior that ground in and, crucially, coalesce in the same sensorimotor structures. Work on modeling this for ECAs with a focus on coverbal gestures is presented. (C) 2010 Elsevier B.V. All rights reserved.
C1 Univ Bielefeld, Sociable Agents Grp, D-33501 Bielefeld, Germany.
RP Kopp, S (reprint author), Univ Bielefeld, Sociable Agents Grp, POB 100131, D-33501 Bielefeld, Germany.
EM skopp@techfak.uni-bielefeld.de
RI Kopp, Stefan/K-3456-2013
OI Kopp, Stefan/0000-0002-4047-9277
FU Deutsche Forschungsgemeinschaft (DFG) [SFB 673]; Center of Excellence
"Cognitive Interaction Technology" (CITEC)
FX This research is supported by the Deutsche Forschungsgemeinschaft (DFG)
in SFB 673 "Alignment in Communication" and the Center of Excellence
"Cognitive Interaction Technology" (CITEC).
CR Allwood J., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.1
Amit R., 2002, Proceedings 2nd International Conference on Development and Learning. ICDL 2002, DOI 10.1109/DEVLRN.2002.1011867
Bailenson JN, 2008, COMPUT HUM BEHAV, V24, P66, DOI 10.1016/j.chb.2007.01.015
Bergmann K, 2009, LECT NOTES ARTIF INT, V5773, P76
Bergmann Kirsten, 2009, P 8 INT C AUT AG MUL, P361
Bernieri F. J., 1991, FUNDAMENTALS NONVERB, P401
BERNIERI FJ, 1994, PERS SOC PSYCHOL B, V20, P303, DOI 10.1177/0146167294203008
Bertenthal BI, 2006, J EXP PSYCHOL HUMAN, V32, P210, DOI 10.1037/0096-1523.32.2.210
BICKMORE T, 2006, P CHI, P550
Bickmore T., 2003, THESIS MIT
Billard A, 2002, FROM ANIM ANIMAT, P281
BRANIGAN H, 2010, J PRAGMATIC IN PRESS
Breazeal C, 2005, ARTIF LIFE, V11, P31, DOI 10.1162/1064546053278955
Brennan SE, 1996, J EXP PSYCHOL LEARN, V22, P1482, DOI 10.1037/0278-7393.22.6.1482
BUSCHMEIER H, 2009, 12 EUR WORKSH NAT LA, P82
Cassell J., 2000, P INT NAT LANG GEN C, P171
Cassell J., 2000, EMBODIED CONVERSATIO
CASSELL J, 2007, ACL WORKSH EMB LANG, P40
CASSELL J, 2001, P ACM CHI 2001 C SEA, P396
Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893
Clark H. H., 1996, USING LANGUAGE
DUNCAN S, 2007, P GEST 2007 C INT SO
Finkel EJ, 2006, J PERS SOC PSYCHOL, V91, P456, DOI 10.1037/0022-3514.91.3.456
Gallese V, 2004, TRENDS COGN SCI, V8, P396, DOI 10.1016/j.tics.2004.07.002
GARDENFORS P, 1996, CONTRIBUTION SCI TEC
Giles Howard, 1991, LANGUAGE CONTEXTS CO
Gratch J, 2006, LECT NOTES ARTIF INT, V4133, P14
Hall J. A., 2001, INTERPERSONAL SENSIT
HAMILTON A, 2008, ATTENTION PERFORMANC, V22
HSEE CK, 1990, COGNITION EMOTION, V4, P327, DOI 10.1080/02699939008408081
Kang S.-H, 2008, P INT C AUT AG MULT, P120
Kendon Adam, 1973, SOCIAL COMMUNICATION, P29
KIMBARA I, 2005, GESTURE, V6, P39
Kopp S., 2004, P INT C MULT INT ICM, P97, DOI 10.1145/1027933.1027952
Kopp S, 2005, LECT NOTES ARTIF INT, V3661, P329
Kopp S, 2004, COMPUT ANIMAT VIRT W, V15, P39, DOI 10.1002/cav.6
Kramer NC, 2008, LECT NOTES COMPUT SC, V5208, P507
Lakin JL, 2008, PSYCHOL SCI, V19, P816, DOI 10.1111/j.1467-9280.2008.02162.x
Lakin JL, 2003, J NONVERBAL BEHAV, V27, P145, DOI 10.1023/A:1025389814290
McNeill D., 1992, HAND MIND WHAT GESTU
MILES L, 2009, J EXPT SOCIAL PSYCHO, P585
Montgomery KJ, 2007, SOC COGN AFFECT NEUR, V2, P114, DOI 10.1093/scan/nsm004
CONDON WS, 1966, J NERV MENT DIS, V143, P338, DOI 10.1097/00005053-196610000-00005
Pickering MJ, 2004, BEHAV BRAIN SCI, V27, P169
Press C, 2006, EUR J NEUROSCI, V24, P2415, DOI 10.1111/j.1460-9568.2006.05115.x
Reddy M., 1979, METAPHOR THOUGHT, P284
Reeves B., 1996, MEDIA EQUATION PEOPL
Rizzolatti G, 2001, NAT REV NEUROSCI, V2, P661, DOI 10.1038/35090060
SADEGHIPOUR A, 2009, LECT NOTES ARTIF INT, V5773, P80
SCHEFLEN AE, 1964, PSYCHIATR, V27, P316
Scheflen A. E., 1982, INTERACTION RHYTHMS, P13
Shockley K, 2003, J EXP PSYCHOL HUMAN, V29, P326, DOI 10.1037/0096-1523.29.2.326
SHON A, 2007, 2007 IEEE INT C ROB, P2847
Sowa Timo, 2005, P KOGWIS 2005, P183
STRONKS B, 2002, P CHI 02 WORKSH PHIL, P25
Sugathapala De Silva M. W., 1980, ASPECTS LINGUISTIC B, P105
Tickle-Degnen L, 1990, PSYCHOL INQ, V1, P285, DOI DOI 10.1207/S15327965PLI0104_1
TRAUM D, 1992, 2 INT C SPOK LANG PR, P137
ULDALL B, OPTIMAL DISTIN UNPUB
Wallbott Harald G, 1995, MUTUALITIES DIALOGUE, P82
Wilson M, 2005, PSYCHOL BULL, V131, P460, DOI 10.1037/0033-2909.131.3.460
Yngve Victor, 1970, CHICAGO LINGUISTIC S, V6, P567
NR 62
TC 26
Z9 26
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 587
EP 597
DI 10.1016/j.specom.2010.02.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000012
ER
PT J
AU Bailly, G
Raidt, S
Elisei, F
AF Bailly, Gerard
Raidt, Stephan
Elisei, Frederic
TI Gaze, conversational agents and face-to-face communication
SO SPEECH COMMUNICATION
LA English
DT Article
DE Conversational agents; Face-to-face communication; Gaze
ID AUDIOVISUAL SPEECH-PERCEPTION; SOCIAL ATTENTION; DIRECTION; SIGNALS;
EYES; MIND
AB In this paper, we describe two series of experiments that examine audiovisual face-to-face interaction between naive human viewers and either a human interlocutor or a virtual conversational agent. The main objective is to analyze the interplay between speech activity and mutual gaze patterns during mediated face-to-face interactions. We first quantify the impact of deictic gaze patterns of our agent. We further aim at refining our experimental knowledge on mutual gaze patterns during human face-to-face interaction by using new technological devices such as non-invasive eye trackers and pinhole cameras, and at quantifying the impact of a selection of cognitive states and communicative functions on recorded gaze patterns. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Bailly, Gerard; Raidt, Stephan; Elisei, Frederic] Univ Grenoble, Speech & Cognit Dept, GIPSA Lab, CNRS,UMR 5216, Grenoble, France.
RP Bailly, G (reprint author), Univ Grenoble, Speech & Cognit Dept, GIPSA Lab, CNRS,UMR 5216, Grenoble, France.
EM gerard.bailly@gipsa-lab.grenoble-inp.fr
FU Rhone-Alpes region
FX We thank our colleague and target speaker Helene Loevenbruck for her
incredible patience and complicity. We also thank all of our subjects -
the ones whose data have been used here and the others whose data have
been corrupted by deficiencies of recording devices. Edouard Gentaz has
helped us in statistical processing. This paper benefited from the
pertinent suggestions of the two anonymous reviewers. We thank Peter F.
Dominey and Marion Dohen for the proofreading. This project has been
financed by the project Presence of the cluster ISLE of the Rhone-Alpes
region.
CR Argyle Michael, 1976, GAZE AND MUTUAL GAZE
BAILLY G, 2005, HUMAN COMPUTER INTER
Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107
BARONCOHEN S, 1985, COGNITION, V21, P37, DOI 10.1016/0010-0277(85)90022-8
Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X
Blais C, 2008, PLOS ONE, V3, DOI 10.1371/journal.pone.0003022
BREAZEAL C, 2000, THESIS MIT BOSTON
Buchan JN, 2008, BRAIN RES, V1242, P162, DOI 10.1016/j.brainres.2008.06.083
Buchan JN, 2007, SOC NEUROSCI, V2, P1, DOI 10.1080/17470910601043644
Carpenter M., 2000, COMMUNICATIVE LANGUA, V9, P30
Cassell J., 2000, EMBODIED CONVERSATIO
CASTIELLO U, 1991, BRAIN, V114, P2639, DOI 10.1093/brain/114.6.2639
CHEN M, 2002, SIGCHI, P49
Clair R. N. S., 1979, LANGUAGE SOCIAL PSYC
Driver J, 1999, VIS COGN, V6, P509
DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031
ELISEI F, 2007, EYEGAZE AWARE ANAL S, P120
EVINGER C, 1994, EXP BRAIN RES, V100, P337
FUJIE S, 2005, BACK CHANNEL FEEDBAC, P889
GEIGER G, 2003, PERCEPTUAL EVALUATIO, P224
GOODWIN C, 1980, SOCIOL INQ, V50, P272, DOI 10.1111/j.1475-682X.1980.tb00023.x
HADDINGTON P, 2002, STUDIA LINGUSITICA L, P107
Itti L, 2003, SPIE 48 ANN INT S OP, P64
Kaur M., 2003, INT C MULT INT VANC, P151
KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4
Langton SRH, 2000, TRENDS COGN SCI, V4, P50, DOI 10.1016/S1364-6613(99)01436-9
Langton SRH, 1999, VIS COGN, V6, P541
Lee SP, 2002, ACM T GRAPHIC, V21, P637
Leslie A.M., 1994, MAPPING MIND DOMAIN, P119, DOI DOI 10.1017/CBO9780511752902.006
Lewkowicz DJ, 1996, J EXP PSYCHOL HUMAN, V22, P1094
Matsusaka Y, 2003, IEICE T INF SYST, VE86D, P26
Miller LM, 2005, J NEUROSCI, V25, P5884, DOI 10.1523/JNEUROSCI.0896-05.2005
Morgan J. L., 1996, SIGNAL SYNTAX OVERVI
NOVICK DG, 1996, COORDINATING TURN TA
OS ED, 2005, SPEECH COMMUN, V47, P194
Peters C, 2003, ATTENTION DRIVEN EYE
Peters C, 2005, LECT NOTES ARTIF INT, V3661, P229
Picot A, 2007, LECT NOTES ARTIF INT, V4722, P272
POSNER MI, 1990, ANNU REV NEUROSCI, V13, P25, DOI 10.1146/annurev.neuro.13.1.25
POSNER MI, 1980, Q J EXP PSYCHOL, V32, P3, DOI 10.1080/00335558008248231
Pourtois G, 2004, EUR J NEUROSCI, V20, P3507, DOI 10.1111/j.1460-9568.2004.03794.x
Povinelli RJ, 2003, IEEE T KNOWL DATA EN, V15, P339, DOI 10.1109/TKDE.2003.1185838
PREMACK D, 1978, BEHAV BRAIN SCI, V1, P515
RAIDT S, 2006, LANG RESS EV C LREC, P2544
RAIDT S, 2008, THESIS I NATL POLYTE, P175
Reveret L., 2000, INT C SPEECH LANG PR, P755
Riva G, 2003, BEING THERE CONCEPTS
Rochet-Capellan A, 2008, J SPEECH LANG HEAR R, V51, P1507, DOI 10.1044/1092-4388(2008/07-0173)
RUTTER DR, 1987, DEV PSYCHOL, V23, P54, DOI 10.1037//0012-1649.23.1.54
Salvucci D.D., 2000, ETRA 00, P71
Scassellati B. M., 2001, FDN THEORY MIND HUMA
Thorisson KR, 2002, TEXT SPEECH LANG TEC, V19, P173
Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929
WALLBOTT HG, 1991, J PERS SOC PSYCHOL, V61, P147, DOI 10.1037/0022-3514.61.1.147
Yarbus A. L, 1967, EYE MOVEMENTS VISION, DOI [10.1007/978-1-4899-5379-7, DOI 10.1007/978-1-4899-5379-7]
NR 55
TC 18
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2010
VL 52
IS 6
SI SI
BP 598
EP 612
DI 10.1016/j.specom.2010.02.015
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 604MB
UT WOS:000278282000013
ER
PT J
AU Gunawan, TS
Ambikairajah, E
Epps, J
AF Gunawan, Teddy Surya
Ambikairajah, Eliathamby
Epps, Julien
TI Perceptual speech enhancement exploiting temporal masking properties of
human auditory system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Human auditory system; Speech enhancement; Temporal masking; Subjective
test; Objective test
ID NOISE SUPPRESSION; MODEL; RECOGNITION; ESTIMATOR; ALGORITHM; FREQUENCY;
PHASE
AB The use of simultaneous masking in speech enhancement has shown promise for a range of noise types. In this paper, a new speech enhancement algorithm based on a short-term temporal masking threshold to noise ratio (MNR) is presented. A novel functional model for forward masking based on three parameters is incorporated into a speech enhancement framework based on speech boosting. The performance of the speech enhancement algorithm using the proposed forward masking model was compared with seven other speech enhancement methods over 12 different noise types and four SNRs. Objective evaluation using PESQ revealed that using the proposed forward masking model, the speech enhancement algorithm outperforms the other algorithms by 6-20% depending on the SNR. Moreover, subjective evaluation using 16 listeners confirmed the objective test results. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Gunawan, Teddy Surya] Int Islamic Univ Malaysia, Dept Elect & Comp Engn, Kuala Lumpur 53100, Malaysia.
[Ambikairajah, Eliathamby; Epps, Julien] Univ New S Wales, Sch Elect Engn & Telecommun, Sydney, NSW 2052, Australia.
RP Gunawan, TS (reprint author), Int Islamic Univ Malaysia, Dept Elect & Comp Engn, Kuala Lumpur 53100, Malaysia.
EM tsgunawan@gmail.com
CR AMBIKAIRAJAH E, 1998, INT C SPOK LANG PROC
[Anonymous], 2003, P835 ITUT
BEROUTI M, 1979, INT C AC SPEECH SIGN
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Bouquin RL., 1996, SPEECH COMMUN, V18, P3
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
CARNERO B, 1999, IEEE T SIGNAL PROCES, V47
Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278]
EBU, 1988, SOUND QUAL ASS MAT R
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
FEDER M, 1989, IEEE T ACOUST SPEECH, V37, P204, DOI 10.1109/29.21683
FLORENTINE M, 1988, J ACOUST SOC AM, V84, P195, DOI 10.1121/1.396964
GAGNON L, 1991, P INT C AC SPEECH SI
GUNAWAN TS, 2006, 10 INT C COMM SYST S
GUNAWAN TS, 2004, 10 INT C SPEECH SCI
GUNAWAN TS, 2006, IEEE INT C AC SPEECH
GUSTAFSSON S, 1998, INT C AC SPEECH SIGN
Hansen J. H. L., 1999, ENCY ELECT ELECT ENG
HIRSCH HG, 2000, AURORA EXPT FRAMEWOR
Hu Y, 2004, IEEE SIGNAL PROC LET, V11, P270, DOI 10.1109/LSP.2003.821714
Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054
Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P457, DOI 10.1109/TSA.2003.815936
IRINO T, 1999, P INT C AC SPEECH SI
*ITU, 1996, P830 ITUT
ITU, 1998, BS1387 ITUR
ITU, 2001, P862 ITUT
JESTEADT W, 1982, J ACOUST SOC AM, V71, P950, DOI 10.1121/1.387576
JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608
KUBIN G, 1999, INT C AC SPEECH SIGN
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197
LIM JS, 1983, SPEECH ENHANCEMENT
Lin L, 2003, ELECTRON LETT, V39, P754, DOI 10.1049/el:20030480
Lin L, 2002, ELECTRON LETT, V38, P1486, DOI 10.1049/el:20020965
LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Moore B. C. J. E., 1995, HEARING
Moore BC., 2003, INTRO PSYCHOL HEARIN
MOORER JA, 1986, J AUDIO ENG SOC, V34, P143
OSHAUGHNESSY D, 1989, IEEE COMMUN MAG, V27, P46, DOI 10.1109/35.17653
SCALART P, 1996, INT C AC SPEECH SIGN
SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662
Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569
Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296
VARGA AP, 1992, NOISEX 92 STUDY EFFE
VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7
Vaseghi S. V., 2000, ADV DIGITAL SIGNAL P
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
WESTERLUND N, 2003, APPL SPEECH ENHANCEM
ZWICKER E, 1999, PSYCHOACOUTICS FACTS
NR 55
TC 9
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 381
EP 393
DI 10.1016/j.specom.2009.12.006
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200001
ER
PT J
AU Barra-Chicote, R
Yamagishi, J
King, S
Montero, JM
Macias-Guarasa, J
AF Barra-Chicote, Roberto
Yamagishi, Junichi
King, Simon
Manuel Montero, Juan
Macias-Guarasa, Javier
TI Analysis of statistical parametric and unit selection speech synthesis
systems applied to emotional speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotional speech synthesis; HMM-based synthesis; Unit selection
AB We have applied two state-of-the-art speech synthesis techniques (unit selection and HMM-based synthesis) to the synthesis of emotional speech. A series of carefully designed perceptual tests to evaluate speech quality, emotion identification rates and emotional strength were used for the six emotions which we recorded - happiness, sadness, anger, surprise, fear, disgust. For the HMM-based method, we evaluated spectral and source components separately and identified which components contribute to which emotion.
Our analysis shows that, although the HMM method produces significantly better neutral speech, the two methods produce emotional speech of similar quality, except for emotions having context-dependent prosodic patterns. Whilst synthetic speech produced using the unit selection method has better emotional strength scores than the HMM-based method, the HMM-based method has the ability to manipulate the emotional strength. For emotions that are characterized by both spectral and prosodic components, synthetic speech using unit selection methods was more accurately identified by listeners. For emotions mainly characterized by prosodic components, HMM-based synthetic speech was more accurately identified. This finding differs from previous results regarding listener judgements of speaker similarity for neutral speech. We conclude that unit selection methods require improvements to prosodic modeling and that HMM-based methods require improvements to spectral modeling for emotional speech. Certain emotions cannot be reproduced well by either method. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Barra-Chicote, Roberto; Manuel Montero, Juan] Univ Politecn Madrid, Grp Tecnol Habla, ETSI Telecomunicac, E-28040 Madrid, Spain.
[Yamagishi, Junichi; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Informat Forum, Edinburgh EH8 9AB, Midlothian, Scotland.
[Macias-Guarasa, Javier] Univ Alcala De Henares, Dept Elect, Madrid 28805, Spain.
RP Barra-Chicote, R (reprint author), Univ Politecn Madrid, Grp Tecnol Habla, ETSI Telecomunicac, Ciudad Univ S-N, E-28040 Madrid, Spain.
EM barra@die.upm.es
RI Macias-Guarasa, Javier/J-4625-2012; Montero, Juan M/K-2381-2014;
Barra-Chicote, Roberto/L-4963-2014
OI Montero, Juan M/0000-0002-7908-5400; Barra-Chicote,
Roberto/0000-0003-0844-7037
FU Spanish Ministry of Education; ROBONAUTA [DPI2007-66846-c02-02]; EPSRC;
EC; SD-TEAM-UPM [TIN2008-06856-C05-03]; SD-TEAM-UAH
[TIN2008-06856-C05-05]; eDIKT
FX RB was visiting CSTR at the time of this work. RB was supported by the
Spanish Ministry of Education and by project ROBONAUTA
(DPI2007-66846-c02-02). JY is supported by EPSRC and the EC FP7 EMIME
project. SK holds an EPSRC Advanced Research Fellowship. JMM and JMG are
supported by projects SD-TEAM-UPM (TIN2008-06856-C05-03) and SD-TEAM-UAH
(TIN2008-06856-C05-05), respectively. This work has made use of the
resources provided by the Edinburgh Compute and Data Facility which is
partially supported by the eDIKT initiative (http://www.edikt.org.uk).
We also thank the two anonymous reviewers for their constructive
feedback and helpful suggestions. The associate editor coordinating the
review of this manuscript for publication was Dr. Marc Swerts.
CR Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123
BARRA R, 2007, P INT 2007, P2233
BARRACHICOTE R, 2008, P 6 INT C LANG RES E
BARRACHICOTE R, 2008, V J TECNOL, P115
Barra R, 2006, INT CONF ACOUST SPEE, P1085
BENNETT C, 2006, P BLIZZ CHALL 2006
Black A., 2003, P EUROSPEECH GEN SWI, P1649
Black A. B., 2005, P INT 2005 LISB PORT, P77
Black A. W., 1995, P EUROSPEECH MADR SP, P581
Bulut M., 2002, P INT C SPOK LANG PR, P1265
Burkhardt F., 2005, P INT, P1517
CHARONNAT L, 2008, P LANG RES EV C, P2376
CLARK R, 2006, P BLIZZ CHALL WORKSH
Clark R., 2007, P BLZ3 2007 P SSW6
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
Donovan RE, 1999, COMPUT SPEECH LANG, V13, P223, DOI 10.1006/csla.1999.0123
Eide E, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P127, DOI 10.1109/WSS.2002.1224388
FRASER M, 2007, P BLZ3 2007 P SSW6 A
GALLARDOANTOLIN A, 2007, P INTERSPEECH 2007
GOMES C, 2004, CPAIOR04, P387
HAMZA W, 2004, P ICSLP 2004
Hofer G., 2005, P INTERSPEECH LISB P, P501
Hunt A. J., 1996, P ICASSP 96, P373
Karaiskos V., 2008, P BLIZZ CHALL WORKSH
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Montero J. M., 1998, P 5 INT C SPOK LANG, P923
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
SCHRODER M, 2004, THESIS SAARLAND U SA
Schroder M., 2001, P EUROSPEECH 2001 SE, P561
STROM V, 2008, P INT 2008, P1873
SYRDAL AK, 2000, P ICSLP 2000 OCT, P411
Tachibana M, 2006, IEICE T INF SYST, VE89D, P1092, DOI 10.1093/ietisy/e89-d.3.1092
Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K., 2008, HMM BASED SPEECH SYN
Yamagishi J., 2008, P BLIZZ CHALL 2008
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394
Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374
YOSHIMURA T, 2000, IEICE T D 2, V83, P2099
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 44
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 394
EP 404
DI 10.1016/j.specom.2009.12.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200002
ER
PT J
AU Lee, YH
Kim, HK
AF Lee, Young Han
Kim, Hong Kook
TI Entropy coding of compressed feature parameters for distributed speech
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Entropy coding; Distributed speech recognition; Huffman coding;
Mel-frequency cepstral coefficient; Voicing class
ID QUANTIZATION
AB In this paper, we propose several entropy coding methods to further compress quantized mel-frequency cepstral coefficients (MFCCs) used for distributed speech recognition (DSR). As a standard DSR front-end, the European Telecommunications Standards Institute (ETSI) published an extended front-end that includes the split-vector quantization of MFCCs and voicing class information. By exploring entropy variances of compressed MFCCs according to the voicing class of the analysis frame and the amount of the entropy due to MFCC subvector indices, voicing class-dependent and subvector-wise Huffman coding methods are proposed. In addition, differential Huffman coding is then applied to further enhance the coding gain against class-dependent and subvector-wise Huffman codings. Subsequent experiments show that the average bit-rate of the subvector-wise differential Huffman coding is measured at 33.93 bits/frame, which is the smallest among the proposed Huffman coding methods, whereas that of a traditional Huffman coding that does not consider voicing class and encodes with a single Huffman coding tree for all the subvectors is measured at 42.22 bits/frame for the TIMIT database. In addition, we evaluate the performance of the proposed Huffman coding methods applied to speech in noise by using the Aurora 4 database, a standard speech database for DSR. As a result, it is shown that the subvector-wise differential Huffman coding method provides the smallest average bit-rate. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Lee, Young Han; Kim, Hong Kook] Gwangju Inst Sci & Technol, Dept Informat & Commun, Kwangju 500712, South Korea.
RP Kim, HK (reprint author), Gwangju Inst Sci & Technol, Dept Informat & Commun, 1 Oryong Dong, Kwangju 500712, South Korea.
EM cpumaker@gist.ac.kr; hongkook@gist.ac.kr
FU Korea government (MEST) [2009-0057194]; Ministry of Knowledge and
Economy, Korea [NIPA-2009-C1090-0902-0010]; Gwangiu Institute of Science
and Technology
FX This work was supported in part by the Korea Science and Engineering
Foundation (KOSEF) grant funded by the Korea government (MEST) (No.
2009-0057194), by the Ministry of Knowledge and Economy, Korea, under
the ITRC support program supervised by the National IT Industry
Promotion Agency (NIPA)(NIPA-2009-C1090-0902-0010), and by the basic
research project grant provided by the Gwangiu Institute of Science and
Technology in 2009.
CR [Anonymous], 2003, 202211 ETSI ES
Borgstrom B. J., 2007, P INT AUG, P578
Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698
Gallardo-Antolin A, 2005, IEEE T SPEECH AUDI P, V13, P1186, DOI 10.1109/TSA.2005.853210
Garofolo J., 1988, GETTING STARTED DARP
Hirsch Guenter, 2002, EXPT FRAMEWORK PERFO
HIRSCH HG, 1998, P ICSLP DENV CO, P1877
Huerta JM, 1998, P 5 INT C SPOK LANG, V4, P1463
HUFFMAN DA, 1952, P IRE, V40, P1098, DOI 10.1109/JRPROC.1952.273898
Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558
KISS I, 1999, P EUROSPEECH, P2183
KISS I, 2000, P ICSLP BEIJ CHIN, V4, P250
So S, 2006, SPEECH COMMUN, V48, P746, DOI 10.1016/j.specom.2005.10.002
RAJ B, 2001, P IEEE WORKSH ASRU T, P127
RAMASWAMY G, 1998, P ICASSP, V2, P977, DOI 10.1109/ICASSP.1998.675430
SORIN A, 2004, P ICASSP MAY, P129
Srinivasamurthy N, 2006, SPEECH COMMUN, V48, P888, DOI 10.1016/j.specom.2005.11.003
Tan ZH, 2005, SPEECH COMMUN, V47, P220, DOI 10.1016/j.specom.2005.05.007
ZHU Q, 2001, P IEEE INT C AC SPEE, V1, P113
NR 19
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 405
EP 412
DI 10.1016/j.specom.2010.01.002
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200003
ER
PT J
AU Vicsi, K
Szaszak, G
AF Vicsi, Klara
Szaszak, Gyoergy
TI Using prosody to improve automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Prosody; Syntactic unit; Sentence modality; Hidden
Markov models
AB In this paper acoustic processing and modelling of the supra-segmental characteristics of speech is addressed, with the aim of incorporating advanced syntactic and semantic level processing of spoken language for speech recognition/understanding tasks. The proposed modelling approach is very similar to the one used in standard speech recognition, where basic HMM units (the most often acoustic phoneme models) are trained and are then connected according to the dictionary and some grammar (language model) to obtain a recognition network, along which recognition can be interpreted also as an alignment process. In this paper the HMM framework is used to model speech prosody, and to perform initial syntactic and/or semantic level processing of the input speech in parallel to standard speech recognition. As acoustic prosodic features, fundamental frequency and energy are used. A method was implemented for syntactic level information extraction from the speech. The method was designed to work for fixed-stress languages, and it yields a segmentation of the input speech for syntactically linked word groups, or even single words corresponding to a syntactic unit (these word groups are sometimes referred to as phonological phrases in psycholinguistics, which can consist of one or more words). These so-called word-stress units are marked by prosody, and have an associated fundamental frequency and/or energy contour which allows their discovery. For this, HMMs for the different types of word-stress unit contours were trained and then used for recognition and alignment of such units from the input speech. This prosodic segmentation of the input speech also allows word-boundary recovery and can be used for N-best lattice rescoring based on prosodic information. The syntactic level input speech segmentation algorithm was evaluated for the Hungarian and for the Finnish languages that have fixed stress on the first syllable. (This means if a word is stressed, stress is realized on the first syllable of the word.) The N-best rescoring based on syntactic level word-stress unit alignment was shown to augment the number of correctly recognized words. For further syntactic and semantic level processing of the input speech in ASR, clause and sentence boundary detection and modality (sentence type) recognition was implemented. Again, the classification was carried out by HMMs, which model the prosodic contour for each clause and/or sentence modality type. Clause (and hence also sentence) boundary detection was based on H M M's excellent capacity in aligning dynamically the reference prosodic structure to the utterance coming from the ASR input. This method also allows punctuation to be automatically marked. This semantic level processing of speech was investigated for the Hungarian and for the German languages. The correctness of recognized types of modalities was 69% for Hungarian, and 78% for German. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Vicsi, Klara; Szaszak, Gyoergy] Budapest Univ Technol & Econ, Lab Speech Acoust, TMIT, H-1111 Budapest, Hungary.
RP Vicsi, K (reprint author), Budapest Univ Technol & Econ, Lab Speech Acoust, TMIT, Stoczek U 2, H-1111 Budapest, Hungary.
EM vicsi@tmit.bme.hu; szaszak@tmit.bme.hu
FU Hungarian Research Foundations OTKA T [046487]; IKTA [00056]
FX The authors would like to thank Toomas Altosaar (Helsinki University of
Technology) for his kind help and his contribution to the use of the
Finnish Speech Database.The work has been supported by the Hungarian
Research Foundations OTKA T 046487 ELE and IKTA 00056.
CR AINSWORTH WA, 1976, MECH SPEECH RECOGNIT
BATLINER A, 1994, NATO ASI SERIES F
Becchetti C., 1999, SPEECH RECOGNITION T
*C ALBR U KIEL I P, 1994, KIEL CORP READ SPEEC, V1
Gallwitz F, 2002, SPEECH COMMUN, V36, P81, DOI 10.1016/S0167-6393(01)00027-9
Hirota K., 2001, Physica C
KOMPE R, 1995, P 4 EUR C SPEECH COM, P1333
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Silverman K., 1992, P INT C SPOK LANG PR, P867
SZASZAK G, 2007, COST ACT 2102 INT WO, P138
VAINIO M, 1999, P ICPHS 1999 SAN FRA, P2347
Veilleux N. M., 1993, P ARPA WORKSH HUM LA, P335, DOI 10.3115/1075671.1075749
VICSI K, 2008, USING PROSODY IMPROV
Vicsi K., 2004, 2 MAG SZAM NYELV K S, P315
VICSI K, 2005, SPEECH TECHNOL, P363
VICSI K, 1998, 1 HUNGARIAN SPEECH D, P163
Wahlster W., 2000, VERBMOBIL FDN SPEECH
Young S., 2005, HTK BOOK HTK VERSION
NR 18
TC 15
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 413
EP 426
DI 10.1016/j.specom.2010.01.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200004
ER
PT J
AU Benesty, J
Chen, JD
Huang, YY
AF Benesty, Jacob
Chen, Jingdong
Huang, Yiteng (Arden)
TI On widely linear Wiener and tradeoff filters for noise reduction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Noise reduction; Wiener filter; Widely linear Wiener filter;
Circularity; Noncircularity
ID COMPLEX-VARIABLES; STATISTICS; CIRCULARITY; SIGNALS
AB Noise reduction is often formulated as a linear filtering problem in the frequency domain. With this formulation, the core issue of noise reduction becomes how to design an optimal frequency-domain filter that can significantly suppress noise without introducing perceptually noticeable speech distortion. While higher-order information can be used, most existing approaches use only second-order statistics to design the noise-reduction filter because they are relatively easier to estimate and are more reliable. When we transform non-stationary speech signals into the frequency domain and work with the short-time discrete Fourier transform coefficients, there are two types of second-order statistics, i.e., the variance and the so-called pseudo-variance due to the noncircularity of the signal. So far, only the variance information has been exploited in designing different noise-reduction filters while the pseudo-variance has been neglected. In this paper, we attempt to shed some light on how to use noncircularity in the context of noise reduction. We will discuss the design of optimal and suboptimal noise reduction filters using both the variance and pseudo-variance and answer the basic question whether noncircularity can be used to improve the noise-reduction performance. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Benesty, Jacob] Univ Quebec, INRS EMT, Montreal, PQ H5A 1K6, Canada.
[Chen, Jingdong; Huang, Yiteng (Arden)] WeVoice Inc, Bridgewater, NJ 08807 USA.
RP Chen, JD (reprint author), 9 Iroquois Trail, Branchburg, NJ 08876 USA.
EM benesty@emt.inrs.ca; jingdongchen@ieee.org; ardenhuang@gmail.com
CR Amblard PO, 1996, SIGNAL PROCESS, V53, P15, DOI 10.1016/0165-1684(96)00072-2
Amblard PO, 1996, SIGNAL PROCESS, V53, P1, DOI 10.1016/0165-1684(96)00071-0
[Anonymous], 1990, DARPA TIMIT ACOUSTIC
Benesty J, 2009, SPRINGER TOP SIGN PR, V2, P1, DOI 10.1007/978-3-642-00296-0_1
Benesty J., 2005, SPEECH ENHANCEMENT
CHEN J, 2003, ADAPTIVE SIGNAL PROC, P129
Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851
CHEVALIER P, 2009, P IEEE ICASSP, P3573
Diethorn EJ, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P91, DOI 10.1007/1-4020-7769-6_4
ERIKSSON J, 2009, P IEEE ICASSP, P3565
Hirsch H. G., 1995, P IEEE INT C AC SPEE, V1, P153
Huang Y., 2006, ACOUSTIC MIMO SIGNAL
LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Mandic D., 2009, COMPLEX VALUED NONLI
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
NEESER FD, 1993, IEEE T INFORM THEORY, V39, P1293, DOI 10.1109/18.243446
Ollila E, 2008, IEEE SIGNAL PROC LET, V15, P841, DOI 10.1109/LSP.2008.2005050
PICINBONO B, 1994, IEEE T SIGNAL PROCES, V42, P3473, DOI 10.1109/78.340781
PICINBONO B, 1995, IEEE T SIGNAL PROCES, V43, P2030, DOI 10.1109/78.403373
Schreier PJ, 2003, IEEE T SIGNAL PROCES, V51, P714, DOI [10.1109/TSP.2002.808085, 10.1109/TCP.2002.808085]
STAHL V, 2000, P ICASSP, V3, P1875
Vary P, 2006, DIGITAL SPEECH TRANS
NR 23
TC 3
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 427
EP 439
DI 10.1016/j.specom.2010.02.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200005
ER
PT J
AU Vicente-Pena, J
Diaz-de-Maria, F
AF Vicente-Pena, Jesus
Diaz-de-Maria, Fernando
TI Uncertainty decoding on Frequency Filtered parameters for robust ASR
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Spectral subtraction; Uncertainty decoding;
Frequency Filtered; Bounded distance HMM; SSBD-HMM
ID SPEECH RECOGNITION; SPECTRAL SUBTRACTION; NOISE; MODEL; TIME
AB The use of feature enhancement techniques to obtain estimates of the clean parameters is a common approach for robust automatic speech recognition (ASR). However, the decoding algorithm typically ignores how accurate these estimates are. Uncertainty decoding methods incorporate this type of information. In this paper, we develop a formulation of the uncertainty decoding paradigm for Frequency Filtered (FF) parameters using spectral subtraction as a feature enhancement method. Additionally, we show that the uncertainty decoding method for FF parameters admits a simple interpretation as a spectral weighting method that assigns more importance to the most reliable spectral components.
Furthermore, we suggest combining this method with SSBD-HMM (Spectral Subtraction and Bounded Distance HMM), one recently proposed technique that is able to compensate for the effects of features that are highly contaminated (outliers). This combination pursues two objectives: to improve the results achieved by uncertainty decoding methods and to determine which part of the improvements is due to compensating for the effects of outliers and which part is due to compensating for other less deteriorated features. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Vicente-Pena, Jesus; Diaz-de-Maria, Fernando] Univ Carlos III Madrid, Dept Signal Proc & Commun, EPS, Madrid 28911, Spain.
RP Vicente-Pena, J (reprint author), Univ Carlos III Madrid, Dept Signal Proc & Commun, EPS, Avda Univ 30, Madrid 28911, Spain.
EM jvicente@tsc.uc3m.es; fdiaz@tsc.uc3m.es
RI Diaz de Maria, Fernando/E-8048-2011
CR Arrowood J. A., 2002, P ICSLP, P1561
BENITEZ C, 2004, P ICSLP JEJ ISL KOR, P137
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
*CMU, 1998, CMU V 0 6 PRON DICT
Deng L., 2002, P ICSLP, P2449
de Veth J, 2001, SPEECH COMMUN, V34, P247, DOI 10.1016/S0167-6393(00)00037-6
de Veth J, 2001, SPEECH COMMUN, V34, P57, DOI 10.1016/S0167-6393(00)00046-7
Droppo J., 2002, P ICASSP 2002, V1, P57
GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J
HIRSCH G, 2002, AU41702 ETSI STQ AUR
KRISTJANSSON T, 2002, P ICASSP, V1, P61
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733
MORRIS A, 2001, WISP WORKSH INN METH
NADEU C, 1995, P EUR 95, P1381
Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0
*NIST, 1992, NIST RES MAN CORP RM
Paliwal KK, 1999, P EUR C SPEECH COMM, P85
Papoulis A., 2002, PROBABILITY RANDOM V
PARIHAR N, 2001, AURORA WORKING GROUP
PAUL DB, 1992, HLT 91, P357
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Stouten V, 2006, SPEECH COMMUN, V48, P1502, DOI 10.1016/j.specom.2005.12.006
Varga AP, 1992, TECH REP DRA SPEECH
VICENTEPENA J, 2006, P INT C SPOK LANG PR, P1491
Vicente-Pena J, 2010, SPEECH COMMUN, V52, P123, DOI 10.1016/j.specom.2009.09.002
Vicente-Pena J, 2006, SPEECH COMMUN, V48, P1379, DOI 10.1016/j.specom.2006.07.007
Weiss NA, 1993, INTRO STAT, P407
WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125
Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325
Yoma NB, 2002, IEEE T SPEECH AUDI P, V10, P158, DOI 10.1109/TSA.2002.1001980
Young S., 2002, HTK BOOK HTK VERSION
NR 32
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 440
EP 449
DI 10.1016/j.specom.2010.02.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200006
ER
PT J
AU Paliwal, K
Wojcicki, K
Schwerin, B
AF Paliwal, Kuldip
Wojcicki, Kamil
Schwerin, Belinda
TI Single-channel speech enhancement using spectral subtraction in the
short-time modulation domain
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Modulation spectral subtraction; Speech enhancement
fusion; Analysis-modification-synthesis (AMS); Musical noise
ID PRIMARY AUDITORY-CORTEX; FOURIER-ANALYSIS; AMPLITUDE-MODULATION;
TRANSMISSION INDEX; QUALITY ESTIMATION; MASKING PROPERTIES; RECOGNITION;
FREQUENCY; NOISE; INTELLIGIBILITY
AB In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using an objective speech quality measure as well as formal subjective listening tests, we show that the proposed method results in improved speech quality. Furthermore, the proposed method achieves better noise suppression than the MMSE method. In this study, the effect of modulation frame duration on speech quality of the proposed enhancement method is also investigated. The results indicate that modulation frame durations of 180-280 ms, provide a good compromise between different types of spectral distortions, namely musical noise and temporal slurring. Thus given a proper selection of modulation frame duration, the proposed modulation spectral subtraction does not suffer from musical noise artifacts typically associated with acoustic spectral subtraction. In order to achieve further improvements in speech quality, we also propose and investigate fusion of modulation spectral subtraction with the MMSE method. The fusion is performed in the short-time spectral domain by combining the magnitude spectra of the above speech enhancement algorithms. Subjective and objective evaluation of the speech enhancement fusion shows consistent speech quality improvements across input SNRs. (C) 2010 Elsevier B.V. All rights reserved.
C1 [Paliwal, Kuldip; Wojcicki, Kamil; Schwerin, Belinda] Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
RP Wojcicki, K (reprint author), Griffith Univ, Griffith Sch Engn, Signal Proc Lab, Nathan, Qld 4111, Australia.
EM kamil.wojcicki@ieee.org
CR ALLEN JB, 1977, IEEE T ACOUST SPEECH, V25, P235, DOI 10.1109/TASSP.1977.1162950
ALLEN JB, 1977, P IEEE, V65, P1558, DOI 10.1109/PROC.1977.10770
Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
Atlas L., 2004, P IEEE INT C AC SPEE, V2, P761
ATLAS L, 2003, 155 IEICE U
Atlas LE, 2001, P SOC PHOTO-OPT INS, V4474, P1, DOI 10.1117/12.448636
BACON SP, 1989, J ACOUST SOC AM, V85, P2575, DOI 10.1121/1.397751
Berouti M., 1979, P IEEE INT C AC SPEE, V4, P208
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353
Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Falk T. H., 2007, P ISCA C INT SPEECH, P970
Goldsworthy RL, 2004, J ACOUST SOC AM, V116, P3679, DOI 10.1121/1.1804628
Greenberg S., 2001, P 7 EUR C SPEECH COM, P473
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Hasan MK, 2004, IEEE SIGNAL PROC LET, V11, P450, DOI 10.1109/LSP.2004.824017
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H., 1995, P IEEE INT C AC SPEE, V1, P405
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006
Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949
Kamath S., 2002, P IEEE INT C AC SPEE
Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3
Kim DS, 2004, IEEE SIGNAL PROC LET, V11, P849, DOI 10.1109/LSP.2004.835466
Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
Kinnunen T, 2008, P ISCA SPEAK LANG RE
KINNUNEN T, 2006, P IEEE INT C AC SPEE, V1, P665
Kowalski N, 1996, J NEUROPHYSIOL, V76, P3503
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Lu CT, 2007, PATTERN RECOGN LETT, V28, P1300, DOI 10.1016/j.patrec.2007.03.001
Lu X, 2010, SPEECH COMMUN, V52, P1, DOI 10.1016/j.specom.2009.08.006
Lyons J., 2008, P ISCA C INT SPEECH, P387
Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363
Mesgarani N., 2005, P ICASSP, V1, P1105, DOI 10.1109/ICASSP.2005.1415311
Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7
Paliwal K, 2008, IEEE SIGNAL PROC LET, V15, P785, DOI 10.1109/LSP.2008.2005755
Payton K. L., 2002, PAST PRESENT FUTURE, P125
Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216
PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P364, DOI 10.1109/TASSP.1981.1163580
Quatieri T. F., 2002, DISCRETE TIME SPEECH
Rix AW, 2001, PERCEPTUAL EVALUATIO, P862
SCHREINER CE, 1986, HEARING RES, V21, P227, DOI 10.1016/0378-5955(86)90221-2
Shamma SA, 1996, NETWORK-COMP NEURAL, V7, P439, DOI 10.1088/0954-898X/7/3/001
Shannon B., 2006, P INT C SPOK LANG PR, P1423
SHEFT S, 1990, J ACOUST SOC AM, V88, P796, DOI 10.1121/1.399729
STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464
Thompson JK, 2003, P IEEE INT C AC SPEE, V5, P397
Tyagi V., 2003, P ISCA EUR C SPEECH, P981
VASEGHI SV, 1992, J AUDIO ENG SOC, V40, P791
Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118
Vuuren S. V., 1998, P INT C SPOK LANG PR, P3205
WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920
XIAO X, 2007, P ICASSP 2007, V4, P1021
ZADEH LA, 1950, P IRE, V38, P291, DOI 10.1109/JRPROC.1950.231083
NR 62
TC 25
Z9 28
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2010
VL 52
IS 5
BP 450
EP 475
DI 10.1016/j.specom.2010.02.004
PG 26
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 594NK
UT WOS:000277544200007
ER
PT J
AU Denby, B
Schultz, T
Honda, K
AF Denby, Bruce
Schultz, Tanja
Honda, Kiyoshi
TI Special Issue Silent Speech Interfaces
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
EM denby@ieee.org
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 269
EP 269
DI 10.1016/j.specom.2010.02.001
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400001
ER
PT J
AU Denby, B
Schultz, T
Honda, K
Hueber, T
Gilbert, JM
Brumberg, JS
AF Denby, B.
Schultz, T.
Honda, K.
Hueber, T.
Gilbert, J. M.
Brumberg, J. S.
TI Silent speech interfaces
SO SPEECH COMMUNICATION
LA English
DT Article
DE Silent speech; Speech pathologies; Cellular telephones; Speech
recognition; Speech synthesis
ID BRAIN-COMPUTER INTERFACES; ULTRASOUND IMAGES; MOTOR CORTEX; RECOGNITION;
SYSTEM; MOVEMENTS; ELECTRODE; SURFACE; COMMUNICATION; TETRAPLEGIA
AB The possibility of speech processing in the absence of an intelligible acoustic signal has given rise to the idea of a 'silent speech' interface, to be used as an aid for the speech-handicapped, or as part of a communications system operating in silence-required or high-background-noise environments. The article first outlines the emergence of the silent speech interface from the fields of speech production, automatic speech processing, speech pathology research, and telecommunications privacy issues, and then follows with a presentation of demonstrator systems based on seven different types of technologies. A concluding section underlining some of the common challenges faced by silent speech interface researchers, and ideas for possible future directions, is also provided. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Denby, B.] Univ Paris 06, F-75005 Paris, France.
[Schultz, T.] Univ Karlsruhe, Cognit Syst Lab, D-76131 Karlsruhe, Germany.
[Honda, K.] ATR Cognit Informat Sci Labs, Seika, Kyoto 6190288, Japan.
[Denby, B.; Hueber, T.] ESPCI ParisTech, Elect Lab, F-75005 Paris, France.
[Gilbert, J. M.] Univ Hull, Dept Engn, Kingston Upon Hull HU6 7RX, N Humberside, England.
[Brumberg, J. S.] Boston Univ, Dept Cognit & Neural Syst, Boston, MA 02215 USA.
RP Denby, B (reprint author), Univ Paris 06, 4 Pl Jussieu, F-75005 Paris, France.
EM denby@ieee.org; tanja@ira.uka.de; honda@atr.jp; hueber@ieee.org;
J.M.Gilbert@hull.ac.uk; brum-berg@cns.bu.edu
FU French Department of Defense (DGA); Centre de Microelectronique de Paris
Ile-de-France (CEMIP); French National Research Agency (ANR)
[ANR-06-BLAN-0166]; ENT Consultants' Fund; Hull and East Yorkshire
Hospitals NHS Trust; National Institute on Deafness and other
Communication Disorders [R01 DC07683, R44 DC007050-02]; National Science
Foundation [SBE-0354378]
FX The authors acknowledge support from the French Department of Defense
(DGA); the "Centre de Microelectronique de Paris Ile-de-France" (CEMIP);
the French National Research Agency (ANR) under the contract number
ANR-06-BLAN-0166; the ENT Consultants' Fund, Hull and East Yorkshire
Hospitals NHS Trust; the National Institute on Deafness and other
Communication Disorders (R01 DC07683; R44 DC007050-02); and the National
Science Foundation (SBE-0354378). They also wish to thank Gerard
Chollet; Maureen Stone; Laurent Benaroya; Gerard Dreyfus; Pierre
Roussel; and Szu-Chen (Stan) Jou for their help in preparing this
article.
CR ARNAL A, 2000, 23 JOURN ET PAR, P425
BAKEN RJ, 1984, J SPEECH HEAR DISORD, V49, P202
Bartels J, 2008, J NEUROSCI METH, V174, P168, DOI 10.1016/j.jneumeth.2008.06.030
Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012
Birbaumer N, 2000, IEEE T REHABIL ENG, V8, P190, DOI 10.1109/86.847812
Blankertz B, 2006, IEEE T NEUR SYS REH, V14, P147, DOI 10.1109/TNSRE.2006.875557
BLOM ED, 1979, LARYNGECTOMEE REHABI, P251
BOS JC, 2005, 2005064 DRCD TOR CR
Brown DR, 2005, MEAS SCI TECHNOL, V16, P2381, DOI 10.1088/0957-0233/16/11/033
Brown DR, 2004, MEAS SCI TECHNOL, V15, P1291, DOI 10.1088/0957-0233/15/7/010
BRUMBERG JS, 2007, NEUR M PLANN 2007 SA
Brumberg JS, 2010, SPEECH COMMUN, V52, P367, DOI 10.1016/j.specom.2010.01.001
BRUMBERG JS, 2008, NEUR M PLANN 2007 WA
BURNETT GC, 1997, J ACOUST SOC AM, V102, pA3168, DOI 10.1121/1.420785
Chan A. D. C., 2003, THESIS U NEW BRUNSWI
Chan ADC, 2001, MED BIOL ENG COMPUT, V39, P500, DOI 10.1007/BF02345373
Crevier-Buchman Lise, 2002, Rev Laryngol Otol Rhinol (Bord), V123, P137
DASALLA CS, 2009, P 3 INT CONV REH ENG
DAVIDSON L, 2005, J ACOUST SOC AM, V120, P407
DEKENS T, 2008, P 6 INT LANG RES EV
DENBY B, 2004, P IEEE INT C AC SPEE, V1, P1685
DENBY B, 2006, PROSPECTS SILENT SPE, P1365
Dornhege G., 2007, BRAIN COMPUTER INTER
DRUMMOND S, 1996, MOTOR SKILLS, V83, P801
Dupont S., 2004, P ROB 2004 WORKSH IT
EPSTEIN CM, 1983, INTRO EEG EVOKED POT
EPSTEIN MA, 2005, CLIN LINGUIST PHONET, V16, P567
Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003
FITZPATRICK M, 2002, NEW SCI 0403
Furui S., 2001, DIGITAL SPEECH PROCE
GEORGOPOULOS AP, 1982, J NEUROSCI, V2, P1527
GIBBON F, 2005, BIBLIO ELECTROPALATO
Gracco VL, 2005, NEUROIMAGE, V26, P294, DOI 10.1016/j.neuroimage.2005.01.033
GUENTHER FH, 2008, NEUR M PLANN 2008 WA
HASEGAWA T, 1992, P IEEE ICCS ISITA 19, V20, P617
HASEGAWAJOHNSON M, 2008, COMMUNICATION
HERACLEOUS P, 2007, EURASIP J ADV SIG PR, P1
Hirahara T, 2010, SPEECH COMMUN, V52, P301, DOI 10.1016/j.specom.2009.12.001
HOCHBERG LR, 2008, NEUR M PLANN 2008 WA
Hochberg LR, 2006, NATURE, V442, P164, DOI 10.1038/nature04970
Holmes J., 2001, SPEECH SYNTHESIS REC
Hoole P., 1999, COARTICULATION THEOR, P260
HOUSE D, 2002, LECT NOTES COMPUTER, V2443, P65
Hueber T., 2008, INT SEM SPEECH PROD, P365
HUEBER T, 2008, PHONE RECOGNITION UL, P2032
Hueber T, 2010, SPEECH COMMUN, V52, P288, DOI 10.1016/j.specom.2009.11.004
HUEBER T, 2007, INT C PHON SCI SAARB, P2193
HUEBER T, 2007, IEEE INT C AC SPEECH, V7, P1245
HUEBER T, 2007, CONTINUOUS SPEECH PH, P658
Hummel J, 2006, PHYS MED BIOL, V51, pN205, DOI 10.1088/0031-9155/51/10/N01
*IEEE, 2008, IEEE COMPUTER, V41
JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072
Jorgensen C, 2010, SPEECH COMMUN, V52, P354, DOI 10.1016/j.specom.2009.11.003
JORGENSEN C, 2005, P 38 ANN HAW INT C S
JOU S, 2007, INT C AC SPEECH SIGN
JOU S, 2006, INTERSPEECH 2006 9 I, V2, P573
KENNEDY PR, 1989, J NEUROSCI METH, V29, P181, DOI 10.1016/0165-0270(89)90142-8
KENNEDY PR, 2006, ELECT ENG HDB SERIES, V1
Kennedy PR, 2000, IEEE T REHABIL ENG, V8, P198, DOI 10.1109/86.847815
Kennedy PR, 1998, NEUROREPORT, V9, P1707, DOI 10.1097/00001756-199806010-00007
Kim S., 2007, NEURAL ENG, P486
Levinson SE, 2005, MATHEMATICAL MODELS FOR SPEECH TECHNOLOGY, P1, DOI 10.1002/0470020911
Lotte F, 2007, J NEURAL ENG, V4, pR1, DOI 10.1088/1741-2560/4/R01
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331
MANABE H, 2004, P 26 ANN INT ENG MED, V2, P4389
Manabe H., 2003, P EXT ABSTR CHI2003, P794
MARCHAL A, 1993, LANG SPEECH, V36, P3
Maynard EM, 1997, ELECTROEN CLIN NEURO, V102, P228, DOI 10.1016/S0013-4694(96)95176-0
Millet L, 2007, NY TIMES BK REV, P20
Morse M. S., 1991, P 13 ANN INT C IEEE, V13, P1877
MORSE MS, 1989, IMAGES 21 CENTURY 2, V11, P1793
MORSE MS, 1986, COMPUT BIOL MED, V16, P399, DOI 10.1016/0010-4825(86)90064-8
MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621
Nakajima Y., 2003, P EUROSPEECH, P2601
Nakajima Y., 2003, P ICASSP, P708
Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1
NAKAMURA H, 1988, Patent No. 4769845
Nakada Y., 2005, Scientific Reports of the Kyoto Prefectural University, Human Environment and Agriculture, P7
NessAiver MS, 2006, J MAGN RESON IMAGING, V23, P92, DOI 10.1002/jmri.20463
Neuper C, 2003, CLIN NEUROPHYSIOL, V114, P399, DOI 10.1016/S1388-2457(02)00387-5
Ng L., 2000, P INT C AC SPEECH SI, V1, P229
NGUYEN N, 1996, J PHONETICS, P77
Nijholt A, 2008, IEEE INTELL SYST, V23, P72, DOI 10.1109/MIS.2008.41
Otani Makoto, 2008, Acoustical Science and Technology, V29, DOI 10.1250/ast.29.195
*OUISP, 2006, OR ULTR SYNTH SPEECH
Patil SA, 2010, SPEECH COMMUN, V52, P327, DOI 10.1016/j.specom.2009.11.006
PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204
PETAJAN ED, 1984, IEEE COMM SOC GLOB T
Porbadnigk A, 2009, BIOSIGNALS 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON BIO-INSPIRED SYSTEMS AND SIGNAL PROCESSING, P376
PREUSS RD, 2006, 8 INT C SIGN PROC IC, V1, P16
Quatieri TF, 2006, IEEE T AUDIO SPEECH, V14, P533, DOI 10.1109/TSA.2005.855838
ROTHENBERG M, 1992, J VOICE, V6, P36, DOI 10.1016/S0892-1997(05)80007-4
RUBIN P, 1998, P AUDIO VISUAL SPEEC, P233
SAJDA P, 2008, IEEE SIGNAL PROCESS
SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7
SCHROETER J, 2000, IEEE INT C MULT EXP, P571
Schultz T, 2010, SPEECH COMMUN, V52, P341, DOI 10.1016/j.specom.2009.12.002
Stone M, 2005, CLIN LINGUIST PHONET, V19, P455, DOI 10.1080/02699200500113558
Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799
Stone M, 1986, Dysphagia, V1, P78, DOI 10.1007/BF02407118
STONE M, 1983, J PHONETICS, V11, P207
SUGIE N, 1985, IEEE T BIO-MED ENG, V32, P485, DOI 10.1109/TBME.1985.325564
Suppes P, 1997, P NATL ACAD SCI USA, V94, P14965, DOI 10.1073/pnas.94.26.14965
TARDELLI JD, 2003, ESCTR2004084 MIT LIN
TATHAM, 1971, BEHAV TECHNOL, V6
TERC, 2009, TERC
Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324
TRAN VA, 2008, P SPEECH PROS CAMP B
Tran VA, 2010, SPEECH COMMUN, V52, P314, DOI 10.1016/j.specom.2009.11.005
Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465
Truccolo W, 2008, J NEUROSCI, V28, P1163, DOI 10.1523/JNEUROSCI.4415-07.2008
WALLICZEK M, 2006, P INT PITTSB US, P1487
WAND M, 2009, COMMUNICATI IN PRESS
WESTER M, 2006, THESIS U KARLSRUHE
Wolpaw JR, 2002, CLIN NEUROPHYSIOL, V113, P767, DOI 10.1016/S1388-2457(02)00057-3
WRENCH A, 2007, ULTRAFEST, V4
WRENCH AA, 2003, 6 INT SEM SPEECH PRO, P314
WRIGHT EJ, 2007, NEUR M PLANN 2007 SA
2008, CARSTENS MEDIZINELEK
NR 120
TC 39
Z9 42
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 270
EP 287
DI 10.1016/j.specom.2009.08.002
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400002
ER
PT J
AU Hueber, T
Benaroya, EL
Chollet, G
Denby, B
Dreyfus, G
Stone, M
AF Hueber, Thomas
Benaroya, Elie-Laurent
Chollet, Gerard
Denby, Bruce
Dreyfus, Gerard
Stone, Maureen
TI Development of a silent speech interface driven by ultrasound and
optical images of the tongue and lips
SO SPEECH COMMUNICATION
LA English
DT Article
DE Silent speech; Ultrasound; Corpus-based speech synthesis; Visual phone
recognition
ID SYSTEM
AB This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a "silent speech interface" application, usable either by a laryngectomized patient or for silent communication. The system is built around an audio visual dictionary which associates visual to acoustic observations for each phonetic class. Visual features are extracted from ultrasound images of the tongue and from video images of the lips using a PCA-based image coding technique. Visual observations of each phonetic class are modeled by continuous HMMs. The system then combines a phone recognition stage with corpus-based synthesis. In the recognition stage, the visual HMMs are used to identify phonetic targets in a sequence of visual features. In the synthesis stage, these phonetic targets constrain the dictionary search for the sequence of diphones that maximizes similarity to the input test data in the visual space, subject to a concatenation cost in the acoustic domain. A prosody-template is extracted from the training corpus, and the final speech waveform is generated using "Harmonic plus Noise Model" concatenative synthesis techniques. Experimental results are based on an audiovisual database containing I h of continuous speech from each of two speakers. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Hueber, Thomas; Benaroya, Elie-Laurent; Denby, Bruce; Dreyfus, Gerard] Ecole Super Phys & Chim Ind Ville Paris ESPCI Par, Elect Lab, F-75231 Paris 05, France.
[Hueber, Thomas; Chollet, Gerard] Telecom ParisTech, CNRS, LTCI, F-75634 Paris, France.
[Denby, Bruce] Univ Paris 06, F-75252 Paris, France.
[Stone, Maureen] Univ Maryland, Sch Dent, Vocal Tract Visualizat Lab, Baltimore, MD 21201 USA.
RP Hueber, T (reprint author), ESPCI ParisTech, Elect Lab, 10 Rue Vauquelin, F-75005 Paris, France.
EM hueber@ieee.org
FU French Department of Defense (DGA); Centre de Microelectronique de Paris
Ile-de-France (CEMIP); French National Research Agency (ANR)
[ANR-06-BLAN-0166]
FX This work was supported by the French Department of Defense (DGA), the
"Centre de Microelectronique de Paris Ile-de-France" (CEMIP) and the
French National Research Agency (ANR), under the contract number
ANR-06-BLAN-0166. The authors would like to thank the anonymous
reviewers for numerous valuable suggestions and corrections. They also
acknowledge the seven synthesis transcribers for their excellent work,
as well as the contributions of the collaboration members and numerous
visitors who have attended Ouisper Brainstormings over the past 3 years.
CR Akgul YS, 2000, IEEE WORKSHOP ON MATHEMATICAL METHODS IN BIOMEDICAL IMAGE ANALYSIS, PROCEEDINGS, P135
BIRKHOLZ P, 2003, P 15 INT C PHON SCI, P2597
EFRON B, 1981, BIOMETRIKA, V68, P589, DOI 10.1093/biomet/68.3.589
EPSTEIN M, 2001, J ACOUST SOC AM, V115, P2631
Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003
FORNEY GD, 1973, P IEEE, V61, P268, DOI 10.1109/PROC.1973.9030
GRAVIER G, 2002, P 2 INT C HUM LANG T
HERACLEOUS P, 2005, TISSUE CONDUCTIVE AC, P93
Hogg R., 1996, PROBABILITY STAT INF
Hueber T., 2008, INT SEM SPEECH PROD, P365
HUEBER T, 2007, EIGENTONGUE FEATURE, V1, P1245
HUEBER T, 2008, PHONE RECOGNITION UL, P2032
HUEBER T, 2009, VISUOPHONETIC DECODI, P640
HUEBER T, 2007, CONTINUOUS SPEECH PH, P658
HUNT A, 1996, UNIT SELECTION CONCA, P373
JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072
Kominek J., 2004, P 5 ISCA SPEECH SYNT, P223
Li M, 2005, CLIN LINGUIST PHONET, V19, P545, DOI 10.1080/02699200500113616
Lucey P., 2006, P 8 IEEE WORKSH MULT, P24
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331
PERONA P, 1990, IEEE T PATTERN ANAL, V12, P629, DOI 10.1109/34.56205
SINDER D, 1997, P ASVA97 TOK, P439
Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799
STYLIANOU Y, 1997, DIPHONE CONCATENATIO, P613
TOKUDA K, 2000, SPEECH PARAMETER GEN, P1315
Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465
TURK M, 1991, FACE RECOGNITION USI, P586
Young S., 2005, HTK BOOK
Young S., 1989, FINFENGTR38 CUED CAM
Yu YJ, 2002, IEEE T IMAGE PROCESS, V11, P1260, DOI [10.1109/TIP.2002.804276, 10.1109/TIP.2002.804279]
NR 31
TC 11
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 288
EP 300
DI 10.1016/j.specom.2009.11.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400003
ER
PT J
AU Hirahara, T
Otani, M
Shimizu, S
Toda, T
Nakamura, K
Nakajima, Y
Shikano, K
AF Hirahara, Tatsuya
Otani, Makoto
Shimizu, Shota
Toda, Tomoki
Nakamura, Keigo
Nakajima, Yoshitaka
Shikano, Kiyohiro
TI Silent-speech enhancement using body-conducted vocal-tract resonance
signals
SO SPEECH COMMUNICATION
LA English
DT Article
DE Non-audible murmur; Body-conducted sound; Voice conversion; Talking aids
ID VOICE CONVERSION
AB The physical characteristics of weak body-conducted vocal-tract resonance signals called non-audible murmur (NAM) and the acoustic characteristics of three sensors developed for detecting these signals have been investigated. NAM signals attenuate 50 dB at 1 kHz; this attenuation consists of 30-dB full-range attenuation due to air-to-body transmission loss and 10 dB/octave spectral decay due to a sound propagation loss within the body. These characteristics agree with the spectral characteristics of measured NAM signals. The sensors have a sensitivity of between 41 and 58 dB [V/Pa] at I kHz, and the mean signal-to-noise ratio of the detected signals was 15 dB. On the basis of these investigations, three types of silent-speech enhancement systems were developed: (1) simple, direct amplification of weak vocal-tract resonance signals using a wired urethane-elastomer NAM microphone, (2) simple, direct amplification using a wireless urethane-elastomer-duplex NAM microphone, and (3) transformation of the weak vocal-tract resonance signals sensed by a soft-silicone NAM microphone into whispered speech using statistical conversion. Field testing of the systems showed that they enable voice impaired people to communicate verbally using body-conducted vocal-tract resonance signals. Listening tests demonstrated that weak body-conducted vocal-tract resonance sounds can be transformed into intelligible whispered speech sounds. Using these systems, people with voice impairments can re-acquire speech communication with less effort. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Hirahara, Tatsuya; Otani, Makoto; Shimizu, Shota] Toyama Prefectural Univ, Dept Intelligent Syst Design Engn, Toyama 9390398, Japan.
[Toda, Tomoki; Nakamura, Keigo; Nakajima, Yoshitaka; Shikano, Kiyohiro] Nara Inst Sci & Technol, Grad Sch Informat Sci, Nara 6300192, Japan.
RP Hirahara, T (reprint author), Toyama Prefectural Univ, Dept Intelligent Syst Design Engn, 5180 Kurokawa, Toyama 9390398, Japan.
EM hirahara@pu-toyama.ac.jp
FU SCOPE of the Ministry of Internal Affairs and Communications of Japan
FX This work was supported by SCOPE of the Ministry of Internal Affairs and
Communications of Japan.
CR Abe M., 1990, TRI0166 ATR INT TEL
Espy-Wilson CY, 1998, J SPEECH LANG HEAR R, V41, P1253
Fant G., 1970, ACOUSTIC THEORY SPEE
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
FUJISAKA Y, 2004, TECHNICAL REPORT I E, V103, P13
Heracleous P., 2003, P ASRU, P73
Higashikawa M, 1999, J SPEECH LANG HEAR R, V42, P583
KIKUCHI Y, 2004, P INT C SPEECH PROS, P761
NAKAGIRI M, 2006, P INTERSPEECH PITTSB, P2270
Nakajima Y., 2005, P INTERSPEECH LISB P, P389
Nakajima Y., 2003, P EUROSPEECH, P2601
NAKAJIMA Y, 2005, TECHNICAL REPORT I E, V105, P7
Nakajima Y., 2003, P ICASSP, P708
Nakajima Y, 2006, IEICE T INF SYST, VE89D, P1, DOI 10.1093/ietisy/e89-d.1.1
NAKAMURA K, 2007, P INTERSPEECH ANTW B, P2517
Nakamura K., 2006, P INTERSPEECH PITTSB, P1395
Nota Y., 2007, Acoustical Science and Technology, V28, DOI 10.1250/ast.28.33
OESTREICHER HL, 1951, J ACOUST SOC AM, V23, P707, DOI 10.1121/1.1906828
Otani M, 2009, APPL ACOUST, V70, P469, DOI 10.1016/j.apacoust.2008.05.003
SAGISAKA Y, 1992, J ACOUST SOC JPN, V48, P878
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959
TODA T, 2005, P ICASSP PHIL US MAR, V1, P9, DOI 10.1109/ICASSP.2005.1415037
Toda T., 2005, P INTERSPEECH LISB P, P1957
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
Uemi N., 1994, Proceedings. 3rd IEEE International Workshop on Robot and Human Communication. RO-MAN '94 Nagoya (Cat. No.94TH0679-1), DOI 10.1109/ROMAN.1994.365931
NR 26
TC 8
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 301
EP 313
DI 10.1016/j.specom.2009.12.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400004
ER
PT J
AU Tran, VA
Bailly, G
Loevenbruck, H
Toda, T
AF Tran, Viet-Anh
Bailly, Gerard
Loevenbruck, Helene
Toda, Tomoki
TI Improvement to a NAM-captured whisper-to-speech system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Non-audible murmur; Whispered speech; Audiovisual voice conversion;
Silent speech interface
ID EXTRACTION; PITCH
AB Exploiting a tissue-conductive sensor a stethoscopic microphone the system developed at NAIST which converts non-audible murmur (NAM) to audible speech by GM M-based statistical mapping is a very promising technique. The quality of the converted speech is however still insufficient for computer-mediated communication, notably because of the poor estimation of F-0 from unvoiced speech and because of impoverished phonetic contrasts. This paper presents our investigations to improve the intelligibility and naturalness of the synthesized speech and first objective and subjective evaluations of the resulting system. The first improvement concerns voicing and F-0 estimation. Instead of using a single GMM for both, we estimate a continuous F-0 using a GMM, trained on target voiced segments only. The continuous F-0 estimation is filtered by a voicing decision computed by a neural network. The objective and subjective improvement is significant. The second improvement concerns the input time window and its dimensionality reduction: we show that the precision of F-0 estimation is also significantly improved by extending the input time window from 90 to 450 ms and by using a Linear Discriminant Analysis (LDA) instead of the original Principal Component Analysis (PCA). Estimation of spectral envelope is also slightly improved with LDA but is degraded with larger time windows. A third improvement consists in adding visual parameters both as input and output parameters. The positive contribution of this information is confirmed by a subjective test. Finally, H M M-based conversion is compared with GMM-based conversion. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Tran, Viet-Anh; Bailly, Gerard; Loevenbruck, Helene] Grenoble Univ, CNRS, UMR 5216, GIPSA Lab, Grenoble, France.
[Toda, Tomoki] Nara Inst Sci & Technol, Grad Sch Informat Sci, Nara, Japan.
RP Tran, VA (reprint author), Grenoble Univ, CNRS, UMR 5216, GIPSA Lab, Grenoble, France.
EM viet-anh.tran@gipsa-lab.inpg.fr; gerard.bailly@gipsa-lab.inpg.fr;
helene.loevenbruck@gipsa-lab.inpg.fr; tomoki@is.naist.jp
CR Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166
BAILLY G, 2008, SPEAKING SMILE DISGU, P111
Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107
BETT BJ, 2005, SMALL VOCABULARY REC, P16
COLEMAN J, 2002, LARYNX MOVEMENTS INT
Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467
CREVIERBUCHMAN L, 2008, ACT 64 C SOC FRAC PH
HERACLEOUS P, 2005, INT C SMART OBJ AMB, P93
Higashikawa M, 1996, J VOICE, V10, P155, DOI 10.1016/S0892-1997(96)80042-7
Hueber T, 2007, INT CONF ACOUST SPEE, P1245
HUEBER T, 2008, PHONE RECOGNITION UL, P2032
HUEBER T, 2008, SEGMENTAL VOCODER DR, P2028
HUEBER T, 2007, INT C PHON SCI SAARB, P2193
HUEBER T, 2007, CONTINUOUS SPEECH PH, P658
INOUYE T, 1970, J NERV MENT DIS, V151, P415, DOI 10.1097/00005053-197012000-00007
Jorgensen C., 2005, P 38 ANN HAW INT C S, p294c, DOI 10.1109/HICSS.2005.683
JOU SC, 2006, CONTINUOUS SPEECH RE, P573
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Nakagiri M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2270
Nakajima Y., 2003, INT C AC SPEECH SIGN, P708
POTAMIANOS G, 2003, JOINT AUDIOVISUAL SP, P95
Reveret L., 2000, INT C SPEECH LANG PR, P755
Sodoyer D, 2004, SPEECH COMMUN, V44, P113, DOI 10.1016/j.specom.2004.10.002
SUMMERFIELD AQ, 1989, HDB RES FACE PROCESS, P223
SUMMERFIELD Q, 1979, PHONETICA, V36, P314
TODA T, 2005, NAM TO SPEECH CONVER, P1957
Toda T., 2009, P ICASSP TAIP TAIW, P3601
TODA T, 2005, SPEECH PARAMETER GEN, P2801
Tokuda K, 2000, INT CONF ACOUST SPEE, P1315, DOI 10.1109/ICASSP.2000.861820
TRAN VA, 2008, P SPEECH PROS CAMP B
WALLICZEK M, 2006, SUB WORD UNIT BASED, P1487
Young S., 1999, HTK BOOK
Zen H., 2007, SPEECH SYNTH WORKSH, P294
Zeroual C., 2005, P INTERSPEECH LISB, P1069
NR 34
TC 7
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 314
EP 326
DI 10.1016/j.specom.2009.11.005
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400005
ER
PT J
AU Patil, SA
Hansen, JHL
AF Patil, Sanjay A.
Hansen, John H. L.
TI The physiological microphone (PMIC): A competitive alternative for
speaker assessment in stress detection and speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Physiological sensor; Stress detection; Speaker verification;
Non-acoustic sensor; PMIC
ID AUTOMATIC SPEECH RECOGNITION; HEART-RATE; CLASSIFICATION; NOISE;
COMPENSATION; FEATURES; SENSOR
AB Interactive speech system scenarios exist which require the user to perform tasks which exert limitations on speech production, thereby causing speaker variability and reduced speech performance. In noisy stressful scenarios, even if noise could be completely eliminated, the production variability brought on by stress, including Lombard effect, has a more pronounced impact on speech system performance. Thus, in this study we focus on the use of a silent speech interface (PMIC), with a corresponding experimental assessment to illustrate its utility in the tasks of stress detection and speaker verification. This study focuses on the suitability of PMIC versus close-talk microphone (CTM), and reports that the PMIC achieves as good performance as CTM or better for a number of test conditions. PMIC reflects both stress-related information and speaker-dependent information to a far greater extent than the CTM. For stress detection performance (which is reported in % accuracy), PMIC performs at least on par or about 2% better than the CTM-based system. For a speaker verification application, the PMIC outperforms CTM for all matched stress conditions. The performance reported in terms of %EER is 0.91% (as compared to 1.69%), 0.45% (as compared to 1.49%), and 1.42% (as compared to 1.80%) for PMIC. This indicates that PMIC reflects speaker-dependent information. Also, another advantage of the PMIC is its ability to record the user physiology traits/state. Our experiments illustrate that PMIC can be an attractive alternative for stress detection as well as speaker verification tasks along with an advantage of its ability to record physiological information, in situations where the use of CTM may hinder operations (deep sea divers, fire-fighters in rescue operations, etc.). (C) 2009 Elsevier B.V. All rights reserved.
C1 [Patil, Sanjay A.; Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, CRSS, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, CRSS, Erik Jonsson Sch Engn & Comp Sci, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM john.hansen@utdallas.edu
CR AKARGUN UC, 2007, IEEE SIGNAL PROCESS, P1
Baber C, 1996, SPEECH COMMUN, V20, P37, DOI 10.1016/S0167-6393(96)00043-X
Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006
Bimbot F, 2004, EURASIP J APPL SIG P, V2004, P430, DOI 10.1155/S1110865704310024
BOUGHAZALE S, 1996, THESIS DUKE U
BOUGHAZALE SE, 1995, INT CONF ACOUST SPEE, P664, DOI 10.1109/ICASSP.1995.479685
Bou-Ghazale SE, 2000, IEEE T SPEECH AUDI P, V8, P429, DOI 10.1109/89.848224
BRADY K, 2004, IEEE INT C AC SPEECH
BROUHA L, 1961, J APPL PHYSIOL, V16, P133
BROUHA L, 1963, J APPL PHYSIOL, V18, P1095
Brown DR, 2005, MEAS SCI TECHNOL, V16, P2381, DOI 10.1088/0957-0233/16/11/033
BURNETT G, 1999, THESIS U CALIFORNIA
CHAN C, 2003, THESIS U NEW BRUNSWI
CORRIGAN G, 1996, THESIS NW U WA
COSMIDES L, 1983, J EXP PSYCHOL HUMAN, V9, P864, DOI 10.1037/0096-1523.9.6.864
Courteville A, 1998, IEEE T BIO-MED ENG, V45, P145, DOI 10.1109/10.661262
Denes PB, 1993, SPEECH CHAIN PHYS BI
DEPAULA MH, 1992, REV SCI INSTRUM, V63, P3487, DOI 10.1063/1.1143753
Freund Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14
GABLE TJ, 2000, THESIS U CALIFORNIA
Goodie JL, 2000, J PSYCHOPHYSIOL, V14, P159, DOI 10.1027//0269-8803.14.3.159
Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549]
Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618
Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7
Hansen J. H. L., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319239
HEUBER T, 2007, CONTINUOUS SPEECH PH, P658
HOLZRICHTER JF, 2002, IEEE 10 DIG SIGN PRO, P35
Huang RQ, 2007, IEEE T AUDIO SPEECH, V15, P453, DOI 10.1109/TASL.2006.881695
Ikeno A., 2007, IEEE AER C 2007 BIG, P1
INGALLS R, 1987, J ACOUST SOC AM, V81, P809, DOI 10.1121/1.394659
JOU S, 2007, IEEE INT C AC SPEECH
Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6
Klabunde R. E., 2005, CARDIOVASCULAR PHYSL
KNUDSEN S, 1994, 10 SPIE INT C OPT FI, V2360, P396
MAINARDI E, 2007, ANN INT C IEEE ENG M, P3035
MAXFIELD ME, 1963, J APPL PHYSIOL, V18, P1099
MOHAMAD N, 2007, P SOC PHOTO-OPT INS, V6800, P40
MULDER G, 1981, PSYCHOPHYSIOLOGY, V18, P392, DOI 10.1111/j.1469-8986.1981.tb02470.x
Murray IR, 1996, SPEECH COMMUN, V20, P3, DOI 10.1016/S0167-6393(96)00040-4
Noma H, 2005, Ninth IEEE International Symposium on Wearable Computers, Proceedings, P210, DOI 10.1109/ISWC.2005.56
OTANI K, 1995, IEEE J SEL AREA COMM, V13, P42, DOI 10.1109/49.363147
PETERS RD, 1995, COMP MED SY, P204
Quatieri TF, 2006, IEEE T AUDIO SPEECH, V14, P533, DOI 10.1109/TSA.2005.855838
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Roucos S., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4)
SCANLON M, 2002, MULT SPEECH REC WORK, P1
SHAHINA A, 2005, INT C INT SENS INF P, P400
Ten Bosch Louis, 2003, SPEECH COMMUN, V40.1, P213
Titze IR, 2000, J ACOUST SOC AM, V107, P581, DOI 10.1121/1.428324
Tran VA, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1465
VISWANATHAN V, 1984, IEEE INT C AC SPEECH, P57
WAND M, 2007, INTERSPEECH 2007
WANG H, 2003, P IEEE SENSORS, V2, P1096
Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0
WOMACK BD, 1966, IEEE INT C AC SPEECH, V1, P53
Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995
NR 56
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 327
EP 340
DI 10.1016/j.specom.2009.11.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400006
ER
PT J
AU Schultz, T
Wand, M
AF Schultz, Tanja
Wand, Michael
TI Modeling coarticulation in EMG-based continuous speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE EMG-based speech recognition; Silent Speech Interfaces; Phonetic
features
ID SIGNALS
AB This paper discusses the use of surface electromyography for automatic speech recognition. Electromyographic signals captured at the facial muscles record the activity of the human articulatory apparatus and thus allow to trace back a speech signal even if it is spoken silently. Since speech is captured before it gets airborne, the resulting signal is not masked by ambient noise. The resulting Silent Speech Interface has the potential to overcome major limitations of conventional speech-driven interfaces: it is not prone to any environmental noise, allows to silently transmit confidential information, and does not disturb bystanders.
We describe our new approach of phonetic feature bundling for modeling coarticulation in EMG-based speech recognition and report results on the EMG-PIT corpus, a multiple speaker large vocabulary database of silent and audible EMG speech recordings, which we recently collected. Our results on speaker-dependent and speaker-independent setups show that modeling the interdependence of phonetic features reduces the word error rate of the baseline system by over 33% relative. Our final system achieves 10% word error rate for the best-recognized speaker on a 101-word vocabulary task, bringing EMG-based speech recognition within a useful range for the application of Silent Speech Interfaces. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Schultz, Tanja; Wand, Michael] Karlsruhe Inst Technol, Cognit Syst Lab, D-76131 Karlsruhe, Germany.
RP Schultz, T (reprint author), Karlsruhe Inst Technol, Cognit Syst Lab, Adenauerring 4, D-76131 Karlsruhe, Germany.
EM tanja.schultz@kit.edu
FU School of Health and Rehabilitation Sciences, University of Pittsburgh.
FX The authors would like to thank Szu-Chen (Stan) Jou for his in-depth
support with the initial recognition system, his help with the EMG-PIT
collection and the data scripts. We also thank Maria Dietrich for
recruiting all subjects and carrying out major parts of the database
collection. Her study was supported in part through funding received
from the SHRS Research Development Fund, School of Health and
Rehabilitation Sciences, University of Pittsburgh.
CR Bahl L, 1991, P INT C AC SPEECH SI, P185, DOI 10.1109/ICASSP.1991.150308
BEYERLEIN P, 2000, THESIS RWTH AACHEN
Chan ADC, 2001, MED BIOL ENG COMPUT, V39, P500, DOI 10.1007/BF02345373
Denby B, 2010, SPEECH COMMUN, V52, P270, DOI 10.1016/j.specom.2009.08.002
Dietrich M., 2008, THESIS U PITTSBURGH
FRANKEL J, 2004, P INT C SPOK LANG PR, P1202, DOI DOI 10.1016/J.SPECOM.2008.05.004
International Phonetic Association, 1999, HDB INT PHON ASS
JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072
JOU SC, 2006, P INT PITTSB PA, P573
JOU SCS, 2007, P IEEE INT C AC SPEE, P401
Kirchhoff K., 1999, THESIS U BIELEFELD
Leveau B, 1992, SELECTED TOPICS SURF
Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331
METZE F, 2005, THESIS U KARLSRUHE
Metze F., 2002, P INT C SPOK LANG PR, P2133
MORSE MS, 1989, P ANN INT C IEEE ENG, P1793
MORSE MS, 1991, PROCEEDINGS OF THE ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOL 13, PTS 1-5, P1877, DOI 10.1109/IEMBS.1991.684800
MORSE MS, 1986, COMPUT BIOL MED, V16, P399, DOI 10.1016/0010-4825(86)90064-8
SCHUNKE M, 2006, KOPF NEUROANATOMIE, V3
SUGIE N, 1985, IEEE T BIO-MED ENG, V32, P485, DOI 10.1109/TBME.1985.325564
Ueda N, 2000, J VLSI SIG PROC SYST, V26, P133, DOI 10.1023/A:1008155703044
WALLICZEK M, 2006, P INT PITTSB US, P1487
WAND M, 2009, P BIOS PROT PROT, P155
WAND M, COMMUNICATI IN PRESS
WAND M, 2009, P INT BRIGHT UK
YU H, 2000, P INT C SPOK LANG PR, P353
YU H, 2003, P EUR GEN SWITZ, P1869
NR 27
TC 30
Z9 30
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 341
EP 353
DI 10.1016/j.specom.2009.12.002
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400007
ER
PT J
AU Jorgensen, C
Dusan, S
AF Jorgensen, Charles
Dusan, Sorin
TI Speech interfaces based upon surface electromyography
SO SPEECH COMMUNICATION
LA English
DT Article
DE Electromyography; Speech recognition; Speech synthesis; Bioelectric
control; Articulatory synthesis; EMG
ID TRACT AREA FUNCTION; ARTICULATORY MODEL; RECOGNITION
AB This paper discusses the use of surface electromyography (EMG) to recognize and synthesize speech. The acoustic speech signal can be significantly corrupted by high noise in the environment or impeded by garments or masks. Such situations occur, for example, when firefighters wear pressurized suits with self-contained breathing apparatus (SCBA) or when astronauts perform operations in pressurized gear. In these conditions it is important to capture and transmit clear speech commands in spite of a corrupted or distorted acoustic speech signal. One way to mitigate this problem is to use surface electromyography to capture activity of speech articulators and then, either recognize spoken commands from EMG signals or use these signals to synthesize acoustic speech commands. We describe a set of experiments for both speech recognition and speech synthesis based on surface electromyography and discuss the lessons learned about the characteristics of the EMG signal for these domains. The experiments include speech recognition in high noise based on 15 commands for firefighters wearing self-contained breathing apparatus, a sub-vocal speech robotic platform control experiment based on five words, a speech recognition experiment testing recognition of vowels and consonants, and a speech synthesis experiment based on an articulatory speech synthesizer. Published by Elsevier B.V.
C1 [Jorgensen, Charles] NASA, Ames Res Ctr, Computat Sci Div, Moffett Field, CA 94035 USA.
[Dusan, Sorin] NASA, Ames Res Ctr, MCT Inc, Moffett Field, CA 94035 USA.
RP Jorgensen, C (reprint author), NASA, Ames Res Ctr, Computat Sci Div, M-S 269-1, Moffett Field, CA 94035 USA.
EM Charles.Jorgensen@nasa.gov
FU NASA
FX The authors would like to particularly thank the NASA Aeronautics Basic
Research program's Extension of the Human Senses Initiative for support
over the years during the development of these technologies and the
collegial contributions of Drs. Bradley Betts, Kevin Wheeler, Kim
Binstead, Shinji Maeda, Arturo Galvan, Jianwu Dang, and Mrs. Rebekah
Kochavi and Mrs. Diana Lee.
CR Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012
BIRKHOLZ P, 2007, 6 ISCA WORKSH SPEECH
BRADY K, 2004, P 2004 IEEE INT C AC, V1, P477
*CARN MELL U ROB, PERS EXPL ROV
Chan ADC, 2005, IEEE T BIO-MED ENG, V52, P121, DOI 10.1109/TBME.2004.836492
COKER CH, 1966, J ACOUST SOC AM, V40, P1271, DOI 10.1121/1.2143456
DELUCA CJ, 2005, IMAGING BEHAV MOTOR
DUSAN S, 2000, P 5 SEM SPECH PROD M
Faaborg-Anderson K., 1957, ACTA PHYSL SCAN S140, V41, P1
Gerdle B., 1999, MODERN TECHNIQUES NE, P705
Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549]
JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072
JORGENSEN C, 2005, P 38 ANN HAW INT C S
JOU SC, 2005, P IEEE INT C AC SPEE, P1009
JUNQUA JC, 1993, J AM SOC AM, V93, P512
Junqua J-C, 1999, P INT C AC SPEECH SI, P2083
Kingsbury N, 2001, APPL COMPUT HARMON A, V10, P234, DOI 10.1006/acha.2000.0343
LABOISSIERE R, 1995, P 13 INT C PHON SCI, V1, P358
MAEDA S, 1979, J ACOUST SOC AM, V65, pS22, DOI 10.1121/1.2017158
MAEDA S, 1982, SPEECH COMM
MAEDA S, 1988, J ACOUST SOC AM, V84, pS146, DOI 10.1121/1.2025845
MCCOWAN I, 2005, 04 IDIAP RES, P73
McGowan RS, 1996, J ACOUST SOC AM, V99, P595, DOI 10.1121/1.415220
MENDES JAG, 2008, IEEE COMP SOC C IM S
MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427
MEYER P, 1986, SIGNAL PROCESS, V3, P377
Ng L., 2000, P INT C AC SPEECH SI, V1, P229
PERRIER P, 1992, J SPEECH HEAR RES, V35, P53
RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
Shah SA, 2005, AM J TRANSPLANT, V5, P400
STEINER I, 2008, 8 INT SEM SPEECH PRO
Trejo LJ, 2003, IEEE T NEUR SYS REH, V11, P199, DOI 10.1109/TNSRE.2003.814426
Wheeler KR, 2003, IEEE PERVAS COMPUT, V2, P56, DOI 10.1109/MPRV.2003.1203754
NR 34
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 354
EP 366
DI 10.1016/j.specom.2009.11.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400008
ER
PT J
AU Brumberg, JS
Nieto-Castanon, A
Kennedy, PR
Guenther, FH
AF Brumberg, Jonathan S.
Nieto-Castanon, Alfonso
Kennedy, Philip R.
Guenther, Frank H.
TI Brain-computer interfaces for speech communication
SO SPEECH COMMUNICATION
LA English
DT Article
DE Brain-computer interface; Neural prosthesis; Speech restoration
ID THOUGHT-TRANSLATION DEVICE; INTRACORTICAL ELECTRODE ARRAY; VISUAL-EVOKED
POTENTIALS; NEURAL-NETWORK MODEL; CEREBRAL-CORTEX; CONE ELECTRODE;
ELECTROCORTICOGRAPHIC SIGNALS; MICROELECTRODE ARRAYS; MACHINE INTERFACE;
CORTICAL-NEURONS
AB This paper briefly reviews current silent speech methodologies for normal and disabled individuals. Current techniques utilizing electromyographic (EMG) recordings of vocal tract movements are useful for physically healthy individuals but fail for tetraplegic individuals who do not have accurate voluntary control over the speech articulators. Alternative methods utilizing EMG from other body parts (e.g., hand, arm, or facial muscles) or electroencephalography (EEG) can provide capable silent communication to severely paralyzed users, though current interfaces are extremely slow relative to normal conversation rates and require constant attention to a computer screen that provides visual feedback and/or cueing. We present a novel approach to the problem of silent speech via an intracortical microelectrode brain-computer interface (BCI) to predict intended speech information directly from the activity of neurons involved in speech production. The predicted speech is synthesized and acoustically fed back to the user with a delay under 50 ms. We demonstrate that the Neurotrophic Electrode used in the BCI is capable of providing useful neural recordings for over 4 years, a necessary property for BCIs that need to remain viable over the lifespan of the user. Other design considerations include neural decoding techniques based on previous research involving BCIs for computer cursor or robotic arm control via prediction of intended movement kinematics from motor cortical signals in monkeys and humans. Initial results from a study of continuous speech production with instantaneous acoustic feedback show the BCI user was able to improve his control over an artificial speech synthesizer both within and across recording sessions. The success of this initial trial validates the potential of the intracortical microelectrode-based approach for providing a speech prosthesis that can allow much more rapid communication rates. (C) 2010 Elsevier By. All rights reserved.
C1 [Brumberg, Jonathan S.; Guenther, Frank H.] Boston Univ, Dept Cognit & Neural Syst, Boston, MA 02215 USA.
[Guenther, Frank H.] Boston Univ, Sargent Coll Hlth & Rehabil Sci, Dept Speech Language & Hearing Sci, Boston, MA 02215 USA.
[Guenther, Frank H.] Harvard Univ, MIT, Div Hlth Sci & Technol, Cambridge, MA 02139 USA.
[Guenther, Frank H.] Massachusetts Gen Hosp, Athinoula A Martinos Ctr Biomed Imaging, Charlestown, MA 02129 USA.
[Brumberg, Jonathan S.; Kennedy, Philip R.] Neural Signals Inc, Duluth, GA 30096 USA.
[Nieto-Castanon, Alfonso] StatsANC LLC, Buenos Aires, DF, Argentina.
RP Brumberg, JS (reprint author), Boston Univ, Dept Cognit & Neural Syst, 677 Beacon St, Boston, MA 02215 USA.
EM brumberg@cns.bu.edu
FU National Institute on Deafness and other Communication Disorders [R01
DC007683, R01 DC002852, R44 DC007050-02]; CELEST, an NSF Science of
Learning Center [NSF SBE-0354378]
FX This research was supported by the National Institute on Deafness and
other Communication Disorders (R01 DC007683; R01 DC002852; R44
DC007050-02) and by CELEST, an NSF Science of Learning Center (NSF
SBE-0354378). The authors thank the participant and his family for their
dedication to this research project, Rob Law and Misha Panko for their
assistance with the preparation of this manuscript and Tanja Schultz for
her helpful comments.
CR Allison BZ, 2008, CLIN NEUROPHYSIOL, V119, P399, DOI 10.1016/j.clinph.2007.09.121
Bartels J, 2008, J NEUROSCI METH, V174, P168, DOI 10.1016/j.jneumeth.2008.06.030
Betts BJ, 2006, INTERACT COMPUT, V18, P1242, DOI 10.1016/j.intcom.2006.08.012
Birbaumer N, 2000, IEEE T REHABIL ENG, V8, P190, DOI 10.1109/86.847812
Birbaumer N, 2003, IEEE T NEUR SYS REH, V11, P120, DOI 10.1109/TNSRE.2003.814439
Birbaumer N, 1999, NATURE, V398, P297, DOI 10.1038/18581
BROWN EN, 2004, COMPUTATIONAL NEUROS, V7, P253
Brumberg J. S., 2009, P 10 ANN C INT SPEEC
Carmena JM, 2003, PLOS BIOL, V1, P193, DOI 10.1371/journal.pbio.0000042
Cheng M, 2002, IEEE T BIO-MED ENG, V49, P1181, DOI 10.1109/TBME.2002.803536
DaSalla CS, 2009, NEURAL NETWORKS, V22, P1334, DOI 10.1016/j.neunet.2009.05.008
Donchin E, 2000, IEEE T REHABIL ENG, V8, P174, DOI 10.1109/86.847808
Fagan MJ, 2008, MED ENG PHYS, V30, P419, DOI 10.1016/j.medengphy.2007.05.003
Gelb A., 1974, APPL OPTIMAL ESTIMAT
GEORGOPOULOS AP, 1982, J NEUROSCI, V2, P1527
GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237
Guenther FH, 2009, PLOS ONE, V4, DOI 10.1371/journal.pone.0008218
Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
HAMALAINEN M, 1993, REV MOD PHYS, V65, P413, DOI 10.1103/RevModPhys.65.413
Hinterberger T, 2003, CLIN NEUROPHYSIOL, V114, P416, DOI 10.1016/S1388-2457(02)00411-X
HOCHBERG LR, 2008, NEUR M PLANN 2008 SO
Hochberg LR, 2006, NATURE, V442, P164, DOI 10.1038/nature04970
HOOGERWERF AC, 1994, IEEE T BIO-MED ENG, V41, P1136, DOI 10.1109/10.335862
JONES KE, 1992, ANN BIOMED ENG, V20, P423, DOI 10.1007/BF02368134
JORGENSEN C, 2003, P INT JOINT C NEUR N, V4, P3128, DOI 10.1109/IJCNN.2003.1224072
JOU SC, 2006, INTERSPEECH 2006
JOU SS, 2009, COMMUN COMPUT PHYS, V25, P305
Kalman R.E, 1960, J BASIC ENG, V82, P35, DOI DOI 10.1115/1.3662552
KENNEDY PR, 1989, J NEUROSCI METH, V29, P181, DOI 10.1016/0165-0270(89)90142-8
KENNEDY PR, 1992, NEUROREPORT, V3, P605, DOI 10.1097/00001756-199207000-00015
KENNEDY PR, 2006, ELECT ENG HDB SERIES, V1
KENNEDY PR, 1992, NEUROSCI LETT, V142, P89, DOI 10.1016/0304-3940(92)90627-J
Kennedy PR, 2004, IEEE T NEUR SYS REH, V12, P339, DOI 10.1109/TNSRE.2004.834629
Kennedy PR, 2000, IEEE T REHABIL ENG, V8, P198, DOI 10.1109/86.847815
Kennedy PR, 1998, NEUROREPORT, V9, P1707, DOI 10.1097/00001756-199806010-00007
Kim S, 2007, 3 IEEE EMBS C NEUR E, P486
Kipke DR, 2003, IEEE T NEUR SYS REH, V11, P151, DOI 10.1109/TNSRE.2003.814443
Krusienski DJ, 2008, J NEUROSCI METH, V167, P15, DOI 10.1016/j.jneumeth.2007.07.017
Krusienski DJ, 2006, J NEURAL ENG, V3, P299, DOI 10.1088/1741-2560/3/4/007
Kubler A, 1999, EXP BRAIN RES, V124, P223, DOI 10.1007/s002210050617
Leuthardt Eric C, 2004, J Neural Eng, V1, P63, DOI 10.1088/1741-2560/1/2/001
MACKAY DG, 1968, J ACOUST SOC AM, V43, P811, DOI 10.1121/1.1910900
Maier-Hein L., 2005, IEEE WORKSH AUT SPEE, P331
MATTHEWS BA, 2008, NEUR M PLANN 2008 SO
Maynard EM, 1997, ELECTROEN CLIN NEURO, V102, P228, DOI 10.1016/S0013-4694(96)95176-0
MENDES J, 2008, C IM SIGN PROC CISP, V1, P221
MILLER LE, 2007, NEUR M PLANN 2007 SA
MITZDORF U, 1985, PHYSIOL REV, V65, P37
MOUNTCASTLE VB, 1975, J NEUROPHYSIOL, V38, P871
Nicolelis MAL, 2003, P NATL ACAD SCI USA, V100, P11041, DOI 10.1073/pnas.1934665100
Penfield W, 1959, SPEECH BRAIN MECH
PORBADNIGK A, 2009, INT C BIOINSP SYST S, P376
Rousche PJ, 1998, J NEUROSCI METH, V82, P1, DOI 10.1016/S0165-0270(98)00031-4
Schalk G, 2007, J NEURAL ENG, V4, P264, DOI 10.1088/1741-2560/4/3/012
Schalk G, 2008, J NEURAL ENG, V5, P75, DOI 10.1088/1741-2560/5/1/008
SCHMIDT EM, 1976, EXP NEUROL, V52, P496, DOI 10.1016/0014-4886(76)90220-X
Sellers EW, 2006, BIOL PSYCHOL, V73, P242, DOI 10.1016/j.biopsycho.2006.04.007
SIEBERT SA, 2008, NEUR M PLANN 2008 SO
Suppes P, 1997, P NATL ACAD SCI USA, V94, P14965, DOI 10.1073/pnas.94.26.14965
Taylor DM, 2002, SCIENCE, V296, P1829, DOI 10.1126/science.1070291
Trejo LJ, 2006, IEEE T NEUR SYS REH, V14, P225, DOI 10.1109/TNSRE.2006.875578
Truccolo W, 2008, J NEUROSCI, V28, P1163, DOI 10.1523/JNEUROSCI.4415-07.2008
Truccolo W, 2005, J NEUROPHYSIOL, V93, P1074, DOI 10.1152/jn.00697.2004
Vaughan TM, 2006, IEEE T NEUR SYS REH, V14, P229, DOI 10.1109/TNSRE.2006.875577
Velliste M, 2008, NATURE, V453, P1098, DOI 10.1038/nature06996
WALLICZEK M, 2006, INTERSPEECH 2006, P1596
Wand M., 2009, INT C BIOINSP SYST S
Wessberg J, 2000, NATURE, V408, P361
Williams JC, 1999, BRAIN RES PROTOC, V4, P303, DOI 10.1016/S1385-299X(99)00034-3
Wise KD, 2004, P IEEE, V92, P76, DOI 10.1109/JPROC.2003.820544
WISE KD, 1970, IEEE T BIO-MED ENG, VBM17, P238, DOI 10.1109/TBME.1970.4502738
Wolpaw JR, 2000, IEEE T REHABIL ENG, V8, P222, DOI 10.1109/86.847823
Wolpaw JR, 2004, P NATL ACAD SCI USA, V101, P17849, DOI 10.1073/pnas.0403504101
WRIGHT EJ, 2007, NEUR M PLANN 2007 SA
WRIGHT EJ, 2008, NEUR M PLANN 2008 SO
NR 76
TC 27
Z9 28
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2010
VL 52
IS 4
SI SI
BP 367
EP 379
DI 10.1016/j.specom.2010.01.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 574XB
UT WOS:000276026400009
ER
PT J
AU Goldwater, S
Jurafsky, D
Manning, CD
AF Goldwater, Sharon
Jurafsky, Dan
Manning, Christopher D.
TI Which words are hard to recognize? Prosodic, lexical, and disfluency
factors that increase speech recognition error rates
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Conversational; Error analysis; Individual
differences; Mixed-effects model
ID NEIGHBORHOOD ACTIVATION; FREQUENCY; TRANSCRIPTION
AB Despite years of speech recognition research, little is known about which words tend to be misrecognized and why. Previous work has shown that errors increase for infrequent words, short words, and very loud or fast speech, but many other presumed causes of error (e.g., nearby disfluencies, turn-initial words, phonetic neighborhood density) have never been carefully tested. The reasons for the huge differences found in error rates between speakers also remain largely mysterious.
Using a mixed-effects regression model, we investigate these and other factors by analyzing the errors of two state-of-the-art recognizers on conversational speech. Words with higher error rates include those with extreme prosodic characteristics, those occurring turn-initially or as discourse markers, and doubly confusable pairs: acoustically similar words that also have similar language model probabilities. Words preceding disfluent interruption points (first repetition tokens and words before fragments) also have higher error rates. Finally, even after accounting for other factors, speaker differences cause enormous variance in error rates, suggesting that speaker error rate variance is not fully explained by differences in word choice, fluency, or prosodic characteristics. We also propose that doubly confusable pairs, rather than high neighborhood density, may better explain phonetic neighborhood errors in human speech processing. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Goldwater, Sharon] Univ Edinburgh, Sch Informat, Edinburgh EH8 9AB, Midlothian, Scotland.
[Jurafsky, Dan] Stanford Univ, Dept Linguist, Stanford, CA 94305 USA.
[Manning, Christopher D.] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA.
RP Goldwater, S (reprint author), Univ Edinburgh, Sch Informat, 10 Crichton St, Edinburgh EH8 9AB, Midlothian, Scotland.
EM sgwater@inf.ed.ac.uk; jurafsky@stanford.edu; manning@stanford.edu
FU Edinburgh-Stanford LINK a; ONR MURI [N000140510388]
FX This work was supported by the Edinburgh-Stanford LINK and ONR MURI
award N000140510388. We thank Andreas Stolcke for providing the SRI
recognizer output, language model, and forced alignments; Phil Woodland
for providing the Cambridge recognizer output and other evaluation data;
and Katrin Kirchhoff and Raghunandan Kumaran for datasets used in
preliminary work, useful scripts, and additional help.
CR Adda-Decker M., 2005, P INTERSPEECH C INTE, P2205
Baayen RH, 2008, PRACTICAL INTRO STAT
Bates D, 2007, IME4 LINEAR MIXED EF
Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836
Boersma P., 2007, PRAAT DOING PHONETIC
Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5
BULYKO I, 2003, P C HUM LANG TECHN
Dahan D, 2001, COGNITIVE PSYCHOL, V42, P317, DOI 10.1006/cogp.2001.0750
Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011
DODDINGTON GR, 1981, IEEE SPECTRUM, V18, P26
EVERMANN G, 2004, P FALL 2004 RICH TRA
EVERMANN G, 2005, P ICASSP, P209
Everrnann G., 2004, P ICASSP
FISCUS J, 2004, RT 04F WORKSH
Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7
Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332
GOLDWATER S, 2008, P ACL
Good PI, 2004, PERMUTATION PARAMETR
Hain T, 2005, IEEE T SPEECH AUDI P, V13, P1173, DOI 10.1109/TSA.2005.852999
Harrell FE, 2007, DESIGN PACKAGE R PAC
HEIKE A, 1981, LANG SPEECH, V24, P147
Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006
HOWES D, 1954, J EXP PSYCHOL, V48, P106, DOI 10.1037/h0059478
Ingle J., 2005, J ACOUST SOC AM, V117, P2459
Keating P., 2003, PAPERS LAB PHONOLOGY, P143
Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113
Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001
Marcus M.P., 1999, LDC99T42
MARSLENWILSON WD, 1987, COGNITION, V25, P71, DOI 10.1016/0010-0277(87)90005-9
Nakamura M, 2008, COMPUT SPEECH LANG, V22, P171, DOI 10.1016/j.csl.2007.07.003
NUSBAUM H, 1995, 2002 APPL SPEECH TEC, pCH4
Nusbaum H. C., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90002-7
PENNOCKSPECK B, 2005, ACT 28 C INT AEDEAN, P407
POVEY D, 2002, P IEEE ICASSP
R Development Core Team, 2007, R LANG ENV STAT COMP
Ratnaparkhi A, 1996, P C EMP METH NAT LAN, P133
SHINOZAKI T, 2001, P ASRU 2001
SHRIBERG E, 1995, P INT C PHON SCI ICP, V4, P384
SIEGLER M, 1995, P ICASSP
Stolcke A, 2006, IEEE T AUDIO SPEECH, V14, P1729, DOI 10.1109/TASL.2006.879807
Vergyri D., 2003, P ICASSP HONG KONG A, pI
Vitevitch MS, 1999, J MEM LANG, V40, P374, DOI 10.1006/jmla.1998.2618
Wang W, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P238
NR 43
TC 18
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 181
EP 200
DI 10.1016/j.specom.2009.10.001
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900001
ER
PT J
AU Torreira, F
Adda-Decker, M
Ernestus, M
AF Torreira, Francisco
Adda-Decker, Martine
Ernestus, Mirjam
TI The Nijmegen Corpus of Casual French
SO SPEECH COMMUNICATION
LA English
DT Article
DE Corpus; Casual speech; French
ID SPONTANEOUS SPEECH; WORDS
AB This article describes the preparation, recording and orthographic transcription of a new speech corpus, the Nijmegen Corpus of Casual French (NCCFr). The corpus contains a total of over 36 h of recordings of 46 French speakers engaged in conversations with friends. Casual speech was elicited during three different parts, which together provided around 90 min of speech from every pair of speakers. While Parts 1 and 2 did not require participants to perform any specific task, in Part 3 participants negotiated a common answer to general questions about society. Comparisons with the ESTER corpus of journalistic speech show that the two corpora contain speech of considerably different registers. A number of indicators of casualness, including swear words, casual words, verlan, disfluencies and word repetitions, are more frequent in the NCCFr than in the ESTER corpus, while the use of double negation, an indicator of formal speech, is less frequent. In general, these estimates of casualness are constant through the three parts of the recording sessions and across speakers. Based on these facts, we conclude that our corpus is a rich resource of highly casual speech, and that it call be effectively exploited by researchers in language science and technology. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Torreira, Francisco; Ernestus, Mirjam] Radboud Univ Nijmegen, CLS, NL-6525 XD Nijmegen, Netherlands.
[Torreira, Francisco; Ernestus, Mirjam] Max Planck Inst Psycholinguist, NL-6525 XD Nijmegen, Netherlands.
[Adda-Decker, Martine] LIMSI CNRS, Spoken Language Proc Grp, F-91403 Orsay, France.
[Adda-Decker, Martine] LIMSI CNRS, Situated Percept Grp, F-91403 Orsay, France.
RP Torreira, F (reprint author), Radboud Univ Nijmegen, CLS, Wundtlaan 1, NL-6525 XD Nijmegen, Netherlands.
EM Francisco.Torreira@mpi.nl
RI Ernestus, Mirjam /E-4344-2010
FU European Young Investigator Award
FX Our thanks to Cecile Fougeron, Coralie Vincent, Christine Meunier, Ton
Wempe, the staff at ILPGA and the participants for their help during the
recording of the corpus in France. We also want to thank Lou Boves and
Christopher Stewart for helpful comments and discussion. This work was
funded by a European Young Investigator Award to the third author. It
was presented at the 6th Journees d'Etudes Linguistiques of Nantes
University in June 2009.
CR Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4
Blanche-Benveniste Claire, 1990, FRANCAIS PARLE ETUDE
Boersma P., 2009, PRAAT DOING PHONETIC
Clark H. H., 1996, USING LANGUAGE
Clark HH, 1998, COGNITIVE PSYCHOL, V37, P201, DOI 10.1006/cogp.1998.0693
Coveney Aidan, 1996, VARIABILITY SPOKEN F
Development Core Team, 2008, R LANG ENV STAT COMP
Eggins Suzanne, 1997, ANAL CASUAL CONVERSA
Ernestus M., 2000, VOICE ASSIMILATION S
FAGYAL Z, 1998, VIVACITE DIVERSITE V, V3, P151
Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032
GALLIANO S, 2005, P INT 2005, P2453
GAUVAIN JL, 2005, P INT 2005
JOHNSON K., 2004, SPONTANEOUS SPEECH D, P29
JOUSSE V, 2008, CARACTERISATION DETE
Laks B., 2005, LINGUISTIQUE CORPUS, P205
Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9
Local J., 2007, P 16 INT C PHON SCI, P6
Local J, 2003, J PHONETICS, V31, P321, DOI 10.1016/S0095-4470(03)00045-7
Moore R. K., 2003, P EUROSPEECH, P2581
MOORE RK, 2005, P 10 INT C SPEECH CO
Plug L, 2005, PHONETICA, V62, P131, DOI 10.1159/000090094
Sarkar D, 2008, USE R, P1
Schegloff EA, 2000, LANG SOC, V29, P1
SERPOLLET N, 2007, P CORPUS LINGUISTICS
Shriberg E., 2001, J INT PHON ASSOC, V31, P153
Smith Alan, 2002, J FRENCH LANGUAGE ST, V12, P23
Valdman A, 2000, FR REV, V73, P1179
2007, NP ROBERT 2008 GRAND
NR 29
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 201
EP 212
DI 10.1016/j.specom.2009.10.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900002
ER
PT J
AU Valente, F
AF Valente, Fabio
TI Multi-stream speech recognition based on Dempster-Shafer combination
rule
SO SPEECH COMMUNICATION
LA English
DT Article
DE TANDEM features; Multi Layer Perceptron; Multi-stream speech
recognition; Inverse-entropy combination
AB This paper aims at investigating the use of Dempster-Shafer (DS) combination rule for multi-stream automatic speech recognition. The DS combination is based on a generalization of the conventional Bayesian framework. The main motivation for this work is the similarity between the DS combination and findings of Fletcher on human speech recognition. Experiments are based on the combination of several Multi Layer Perceptron (MLP) classifiers trained oil different representations of the speech signal. The TANDEM framework is adopted in order to use the MLP outputs into conventional speech recognition systems. We exhaustively investigate several methods for applying the DS combination into multi-stream ASR. Experiments are run oil small and large vocabulary speech recognition tasks and aim at comparing the proposed technique with other frame-based combination rules (e.g. inverse entropy). Results reveal that the proposed method outperforms conventional combination rules in both tasks. Furthermore we verify that the performance of the combined feature stream is never inferior to the performance of the best individual feature stream. We conclude the paper discussing other applications of the DS combination and possible extensions. (C) 2009 Elsevier B.V. All rights reserved.
C1 IDIAP Res Inst, CH-1920 Martigny, Switzerland.
RP Valente, F (reprint author), IDIAP Res Inst, CH-1920 Martigny, Switzerland.
EM fabio.valente@idiap.ch
FU Defense Advanced Research Projects Agency (DARPA) [HR0011-06-C-0023]
FX This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. HR0011-06-C-0023.
Any opinions, findings and conclusions or recommendations expressed in
this material are those of the author(s) and do not necessarily reflect
the views of the Defense Advanced Research Projects Agency (DARPA). The
author thank Jithendra Vepa, Thomas Hain and AMI ASR team for their help
with the meeting system. The author also thanks anonymous reviewers for
their comments.
CR Allen J., 2005, ARTICULATION INTELLI
BOURLARD H, 1996, P ICSLP 96
Bourlard Ha, 1994, CONNECTIONIST SPEECH
Fletcher H., 1953, SPEECH HEARING COMMU
Galina L.R., 1994, NEURAL NETWORKS, V7, P777
HAIN T, 2005, NIST RT05 WORKSH ED
Hermansky H., 1996, P ICSLP
Hermansky H., 2005, P INT 2005
HERMANSKY H, 1998, P ICSLP 98 SYDN AUST
Hermansky H., 2000, P ICASSP
HWANG MY, 2007, P IEEE WORKSH AUT SP
Kittler J., 1998, IEEE T PAMI, V20
Kleinschmidt M, 2002, ACTA ACUST UNITED AC, V88, P416
MANDLER EJ, 1988, PATTERN RECOGN, V10, P381
Misra H., 2003, P ICASSP
MORGAN N, 2004, P ICASSP
PLAHL C, 2009, P INT BRISB AUSTR
Shafer G., 1976, MATH THEORY EVIDENCE
Stolcke Andreas, 2006, P ICASSP
THOMAS S, 2008, P INT
Valente F., 2007, P ICASSP
VALENTE F, 2007, P INT
XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943
Zhu Q., 2004, P ICSLP
RICH TRANSCRIPTION E
NR 25
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 213
EP 222
DI 10.1016/j.specom.2009.10.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900003
ER
PT J
AU Chien, JT
Chueh, CH
AF Chien, Jen-Tzung
Chueh, Chuang-Hua
TI Joint acoustic and language modeling for speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Hidden Markov model; n-Gram; Conditional random field; Maximum entropy;
Discriminative training; Speech recognition
ID MAXIMUM-ENTROPY APPROACH; RANDOM-FIELDS
AB In a traditional model of speech recognition, acoustic and linguistic information sources are assumed independent of each other. Parameters of hidden Markov model (HMM) and n-gram are separately estimated for maximum a posteriori classification. However, the speech features and lexical words are inherently correlated in natural language. Lacking combination of these models leads to some inefficiencies. This paper reports on the joint acoustic and linguistic modeling for speech recognition by using the acoustic evidence in estimation of the linguistic model parameters, and vice versa, according to the maximum entropy (ME) principle. The discriminative ME (DME) models are exploited by using features from competing sentences. Moreover, a mutual ME (MME) model is built for sentence posterior probability, which is maximized to estimate the model parameters by characterizing the dependence between acoustic and linguistic features. The N-best Viterbi approximation is presented in implementing DME and MME models. Additionally, the new models are incorporated with the high-order feature statistics and word regularities. In the experiments, the proposed methods increase the sentence posterior probability or model separation. Recognition errors are significantly reduced in comparison with separate HMM and n-gram model estimations from 32.2% to 27.4% using the MATBN corpus and from 5.4% to 4.8% using the WSJ corpus (5K condition). (C) 2009 Elsevier B.V. All rights reserved.
C1 [Chien, Jen-Tzung; Chueh, Chuang-Hua] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan.
RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan.
EM jtchien@mail.ncku.edu.tw
FU National Science Council, Taiwan, ROC [NSC97-2221-E-006-230-MY3]
FX The authors acknowledge Dr. Jean-Luc Gauvain and the anonymous reviewers
for their valuable comments which improved the presentation of this
paper. This word has been partially supported by the National Science
Council, Taiwan, ROC, under contract NSC97-2221-E-006-230-MY3.
CR Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179>
Berger AL, 1996, COMPUT LINGUIST, V22, P39
BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472
Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P1719, DOI 10.1109/TSA.2005.858551
Chien JT, 2006, IEEE T AUDIO SPEECH, V14, P797, DOI 10.1109/TSA.2005.860847
CHIEN JT, 2008, P IEEE WORKSH SPOK L, P201
Chueh C. H., 2005, P INTERSPEECH, P721
CHUEH CH, 2006, P IEEE INT C AC SPEE, V1, P1061
DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379
DellaPietra S, 1997, IEEE T PATTERN ANAL, V19, P380, DOI 10.1109/34.588021
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
FOSLERLUSSIER E, 2008, P IEEE INT C AC SPEE, P4049
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Gillick L., 1989, P ICASSP, P532
Gunawardana A., 2005, P INTERSPEECH, P1117
HEIGOLD G, 2007, P EUR C SPEECH COMM, P1721
JAYNES ET, 1957, PHYS REV, V106, P620, DOI 10.1103/PhysRev.106.620
JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747
Khudanpur S, 2000, COMPUT SPEECH LANG, V14, P355, DOI 10.1006/csla.2000.0149
Kuo H. K. J., 2002, P ICASSP, V1, P325
KUO HKJ, 2004, P ICSLP JEJ ISL S KO, P681
Lafferty John D., 2001, ICML, P282
Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63
Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255
MACHEREY W, 2003, P EUR C SPEECH COMM, V1, P493
MAHAJAN M, 2006, P ICASSP 2006, V1, P273
Malouf R., 2002, P 6 C NAT LANG LEARN, P49
McCallum A., 2000, P 17 INT C MACH LEAR, P591
MORRIS J, 2006, P INT C SPOK LANG PR, P579
Normandin Y, 1994, IEEE T SPEECH AUDI P, V2, P299, DOI 10.1109/89.279279
Quattoni A, 2007, IEEE T PATTERN ANAL, V29, P1848, DOI 10.1109/TPAMI.2007.1124
Riedmiller M., 1993, P IEEE INT C NEUR NE, V1, P586, DOI DOI 10.1109/ICNN.1993.298623
Rosenfeld R, 1996, COMPUT SPEECH LANG, V10, P187, DOI 10.1006/csla.1996.0011
Sha F., 2003, P C N AM CHAPT ASS C, P134
Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901
Wang SJ, 2004, IEEE T NEURAL NETWOR, V15, P903, DOI 10.1109/TNN.2004.828755
NR 36
TC 15
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 223
EP 235
DI 10.1016/j.specom.2009.10.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900004
ER
PT J
AU Kolar, J
Liu, Y
Shriberg, E
AF Kolar, Jachym
Liu, Yang
Shriberg, Elizabeth
TI Speaker adaptation of language and prosodic models for automatic dialog
act segmentation of speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken language understanding; Dialog act segmentation; Speaker
adaptation; Prosody modeling; Language modeling
ID TO-TEXT; RECOGNITION
AB Speaker-dependent modeling has a long history in speech recognition, but has received less attention in speech understanding. This study explores speaker-specific modeling for the task of automatic segmentation of speech into dialog acts (DAs), using a linear combination of speaker-dependent and speaker-independent language and prosodic models. Data come from 20 frequent speakers in the ICSI meeting corpus; adaptation data per speaker ranges from 5 k to 115 k words. We compare performance for both reference transcripts and automatic speech recognition output. We find that: (1) speaker adaptation in this domain results both in a significant overall improvement and in improvements for many individual speakers, (2) the magnitude of improvement for individual speakers does not depend on the amount of adaptation data, and (3) language and prosodic models differ both in degree of improvement, and in relative benefit for specific DA classes. These results suggest important future directions for speaker-specific modeling in spoken language understanding tasks. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Kolar, Jachym] Univ W Bohemia, Dept Cybernet, Fac Sci Appl, Plzen 30614, Czech Republic.
[Liu, Yang] Univ Texas Dallas, Dept Comp Sci, Richardson, TX 75083 USA.
[Shriberg, Elizabeth] SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA.
[Shriberg, Elizabeth] Int Comp Sci Inst, Berkeley, CA 94704 USA.
RP Kolar, J (reprint author), Univ W Bohemia, Dept Cybernet, Fac Sci Appl, Univerzitni 8, Plzen 30614, Czech Republic.
EM jachym@kky.zcu.cz; yangl@hlt.utdallas.edu; ees@speech.sri.com
FU Ministry of Education of the Czech Republic [1M0567, 2C06020]; NSF
[IIS-0544682, IIS-0845484]
FX This work was supported by the Ministry of Education of the Czech
Republic under projects 1M0567 and 2C06020 at UWB Pilsen, and the NSF
grants IIS-0544682 at SRI International and IIS-0845484 at UT Dallas.
The views are those of the authors and do not reflect the views of the
funding agencies.
CR AKITA Y, 2004, P INTERSPEECH 2004 I
AKITA Y, 2006, P INTERSPEECH 2006 I
BESLING S, 1995, P EUROSPEECH MADR SP
CUENDET S, 2006, P IEEE WORKSH SPOK L
Dhillon R, 2004, TR04002 ICSI
FAVRE B, 2008, P IEEE WORKSH SPOK L
Furui S, 2004, IEEE T SPEECH AUDI P, V12, P401, DOI 10.1109/TSA.2004.828699
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
HIRST A, 1998, INTONATION SYST
Huang J., 2002, P ICSLP 2002 DENV CO
JANIN A, 2003, P ICASSP HONG KONG
Jones D., 2003, P EUROSPEECH GEN SWI
KAHN JG, 2004, P HLT NAACL BOST MA
Kim JH, 2003, SPEECH COMMUN, V41, P563, DOI 10.1016/S00167-6393(03)00049-9
KNESER R, 1995, P ICASSP DETR MI US
KOLAR J, 2007, P INTERSPEECH 2007 A
Kolar J, 2006, LECT NOTES ARTIF INT, V4188, P629
KOLAR J, 2006, P INTERSPEECH 2006 I
Liu Y, 2006, COMPUT SPEECH LANG, V20, P468, DOI 10.1016/j.csl.2005.06.002
LIU Y, 2005, P ACL ANN ARB MI US
LIU Y, 2004, P EMNLP BARC SPAIN
MAGIMAIDOSS M, 2007, P ICASSP HON HI
MATUSOV E, 2007, P INTERSPEECH 2007 A
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Ostendorf M., 1994, Computational Linguistics, V20
ROARK B, 2006, P ICASSP TOUL FRANC
Shriberg E, 2000, SPEECH COMMUN, V32, P127, DOI 10.1016/S0167-6393(00)00028-5
SONMEZ K, 1998, P ICSLP SYDN AUSTR
SRIVASTAVA A, 2003, P EUROSPEECH GEN SWI
STOLCKE A, 1998, P ICSLP SYDN AUSTR
Stolcke A, 2006, IEEE T AUDIO SPEECH, V14, P1729, DOI 10.1109/TASL.2006.879807
STOLCKE A, 2002, P ICSLP DENV CO US
TUR G, 2007, P ICASSP HON HI US
WARNKE V, 1997, P EUROSPEECH RHOD GR
ZIMMERMANN M, 2006, P INTERSPEECH 2006 I
ZIMMERMANN M, 2009, P INTERSPEECH 2009 B
NR 37
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 236
EP 245
DI 10.1016/j.specom.2009.10.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900005
ER
PT J
AU Boulenger, V
Hoen, M
Ferragne, E
Pellegrino, F
Meunier, F
AF Boulenger, Veronique
Hoen, Michel
Ferragne, Emmanuel
Pellegrino, Francois
Meunier, Fanny
TI Real-time lexical competitions during speech-in-speech comprehension
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech-in-noise; Informational masking; Lexical competition
ID AUDITORY SCENE ANALYSIS; TOP-DOWN INFLUENCES; INFORMATIONAL MASKING;
SIMULTANEOUS TALKERS; INTERFERING-SPEECH; WORD RECOGNITION; COCKTAIL
PARTY; PERCEPTION; HEARING; IDENTIFICATION
AB This study aimed at characterizing the cognitive processes that come into play during speech-in-speech comprehension by examining lexical competitions between target speech and concurrent multi-talker babble. We investigated the effects of number of simultaneous talkers (2, 4, 6 or 8) and of the token frequency of the words that compose the babble (high or low) on lexical decision to target words. Results revealed a decrease in performance as measured by reaction times to targets with increasing number of concurrent talkers. Crucially, the frequency of words in the babble significantly affected performance: high-frequency babble interfered more strongly (by lengthening reaction times) with word recognition than low-frequency babble. This informational masking was particularly salient when only two talkers were present in the babble due to the availability of identifiable lexical items from the background. Our findings suggest that speech comprehension in multi-talker babble can trigger competitions at the lexical level between target and background. They further highlight the importance of investigating speech-in-speech comprehension situations as they may provide crucial information on interactive and competitive mechanisms that occur in real-time during word recognition. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Boulenger, Veronique; Ferragne, Emmanuel; Pellegrino, Francois; Meunier, Fanny] Univ Lyon, Lab Dynam Langage, UMR CNRS 5596, Inst Sci Homme, F-69363 Lyon 07, France.
[Hoen, Michel] Univ Lyon 1, Stem Cell & Brain Res Inst, INSERM U846, F-69675 Bron, France.
RP Boulenger, V (reprint author), Univ Lyon, Lab Dynam Langage, UMR CNRS 5596, Inst Sci Homme, 14 Ave Berthelot, F-69363 Lyon 07, France.
EM Veronique.Boulenger@ish-lyon.cnrs.fr
RI Hoen, Michel/C-7721-2012
OI Hoen, Michel/0000-0003-2099-8130
FU European Research Council
FX This project was carried out with financial support from the European
Research Council (SpiN project to Fanny Meunier). We would like to thank
Claire Grataloup for allowing us to use the materials from her PhD. We
would also like to thank the anonymous Reviewer and the Editor for their
very helpful comments.
CR Alain C, 2005, J COGNITIVE NEUROSCI, V17, P811, DOI 10.1162/0898929053747621
Alain C, 2001, J EXP PSYCHOL HUMAN, V27, P1072, DOI 10.1037//0096-1523.27.5.1072
Alain C, 2003, J COGNITIVE NEUROSCI, V15, P1063, DOI 10.1162/089892903770007443
ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486
Boersma P., 2009, PRAAT DOING PHONETIC
Bregman AS., 1990, AUDITORY SCENE ANAL
BRONKHORST AW, 1992, J ACOUST SOC AM, V92, P3132, DOI 10.1121/1.404209
Bronkhorst AW, 2000, ACUSTICA, V86, P117
Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696
Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229
CONNINE CM, 1990, J EXP PSYCHOL LEARN, V16, P1084, DOI 10.1037/0278-7393.16.6.1084
Davis MH, 2007, HEARING RES, V229, P132, DOI 10.1016/j.heares.2007.01.014
Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503
Dupoux E, 2003, J EXP PSYCHOL HUMAN, V29, P172, DOI 10.1037/0096-1523.29.1.172
FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247
FORSTER KI, 1984, J EXP PSYCHOL LEARN, V10, P680, DOI 10.1037/0278-7393.10.4.680
Gaskell MG, 1997, PROCEEDINGS OF THE NINETEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P247
Hawley ML, 2004, J ACOUST SOC AM, V115, P833, DOI 10.1121/1.1639908
Hoen M, 2007, SPEECH COMMUN, V49, P905, DOI 10.1016/j.specom.2007.05.008
Kouider S, 2005, PSYCHOL SCI, V16, P617, DOI 10.1111/j.1467-9280.2005.01584.x
Lecumberri MLG, 2006, J ACOUST SOC AM, V119, P2445, DOI 10.1021/1.2180210
MacKay D. G., 1987, ORG PERCEPTION ACTIO
MACKAY DG, 1982, PSYCHOL REV, V89, P483, DOI 10.1037/0033-295X.89.5.483
MarslenWilson W, 1996, J EXP PSYCHOL HUMAN, V22, P1376
MARSLENWILSON W, 1990, ACL MIT NAT, P148
McClelland J., 1986, COGNITIVE PSYCHOL, V8, P1, DOI [10.1016/0010-0285(86)90015-0, DOI 10.1016/0010-0285(86)90015-0]
MCCLELLAND JL, 1981, PSYCHOL REV, V88, P375, DOI 10.1037/0033-295X.88.5.375
Mirman D., 2005, J MEM LANG, V52, P424
Monsell S., 1991, BASIC PROCESSES READ, P148
MOORE TE, 1995, CAN J BEHAV SCI, V27, P9, DOI 10.1037/008-400X .27.1.9
New B, 2004, BEHAV RES METH INS C, V36, P516, DOI 10.3758/BF03195598
PELLEGRINO F, 2004, INT C SPEECH PROS 20, P517
Pellegrino F, 2000, SIGNAL PROCESS, V80, P1231, DOI 10.1016/S0165-1684(00)00032-3
Plant D. C., 1996, PSYCHOL REV, V103, P56, DOI DOI 10.1037/0033-295
Rhebergen KS, 2005, J ACOUST SOC AM, V118, P1274, DOI 10.1121/1.2000751
RUBIN P, 1976, PERCEPT PSYCHOPHYS, V19, P394, DOI 10.3758/BF03199398
Samuel AG, 1997, COGNITIVE PSYCHOL, V32, P97, DOI 10.1006/cogp.1997.0646
Samuel AG, 2001, PSYCHOL SCI, V12, P348, DOI 10.1111/1467-9280.00364
Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650
TAFT M, 1986, COGNITION, V22, P259, DOI 10.1016/0010-0277(86)90017-X
Van Engen KJ, 2007, J ACOUST SOC AM, V121, P519, DOI 10.1121/1.2400666
NR 43
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 246
EP 253
DI 10.1016/j.specom.2009.11.002
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900006
ER
PT J
AU Arias, JP
Yoma, NB
Vivanco, H
AF Pablo Arias, Juan
Becerra Yoma, Nestor
Vivanco, Hiram
TI Automatic intonation assessment for computer aided language learning
SO SPEECH COMMUNICATION
LA English
DT Article
DE Intonation assessment; Computer aided language learning; Word stress
assessment
ID RECOGNITION
AB In this paper the nature and relevance of the information provided by intonation is discussed in the framework of second language learning. As a consequence, an automatic intonation assessment system for second language learning is proposed based on a top-down scheme. A stress assessment system is also presented by combining intonation and energy contour estimation. The utterance pronounced by the student is directly compared with a reference one. The trend similarity of intonation and energy contours are compared frame-by-frame by using DTW alignment. Moreover the robustness of the alignment provided by the DTW algorithm to microphone, speaker and quality pronunciation mismatch is addressed. The intonation assessment system gives an averaged subjective-objective score correlation as high as 0.88. The stress assessment evaluation system gives an EER equal to 21.5%, which in turn is similar to the error observed in phonetic quality evaluation schemes. These results suggest that the proposed systems could be employed in real applications. Finally, the schemes presented here are text- and language-independent due to the fact that the reference utterance text-transcription and language are not required. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Pablo Arias, Juan; Becerra Yoma, Nestor] Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile.
[Vivanco, Hiram] Univ Chile, Dept Linguist, Santiago, Chile.
RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Av Tupper 2007,POB 412-3, Santiago, Chile.
EM nbecerra@ing.uchile.cl
FU Conicyt-Chile [D051-10243, 1070382]
FX This work was funded by Conicyt-Chile under Grants Fondef No. D051-10243
and Fondecyt No. 1070382.
CR BAETENS H, 1982, BILINGUALISM BASIC P
Bell ND, 2009, J PRAGMATICS, V41, P1825, DOI 10.1016/j.pragma.2008.10.010
BERNAT E, 2006, ASIAN EFL J, V8
Bernstein J., 1990, P INT C SPOK LANG PR, P1185
Boersma P., 2008, PRAAT DOING PHONETIC
Bolinger D., 1986, INTONATION ITS PARTS
Bolinger D., 1989, INTONATION ITS USES
Botinis A, 2001, SPEECH COMMUN, V33, P263, DOI 10.1016/S0167-6393(00)00060-1
Carter R., 2001, CAMBRIDGE GUIDE TEAC
Celce-Murcia M., 2000, DISCOURSE CONTEXT LA
Chun Dorothy, 2002, DISCOURSE INTONATION
Cruttenden A., 2008, GIMSONS PRONUNCIATIO
Dalton C., 1994, PRONUNCIATION
DELMONTE R, 1997, P ESCA EUR 97 RHOD, V2, P669
DONG B, 2008, 6 INT S CHIN SPOK LA
DONG B, 2004, INT S CHIN SPOK LANG, P137
ELIMAM YA, 2005, IEICE T INFORM SYSTE
ESKENAZI M, 1998, P STILL WORKSH SPEEC
Eskenazi M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607892
FACE T, 2006, J LANG LINGUIST, V15, P295
FLETCHER J, 2005, INTONATIONAL VARIATI
Fonagy Ivan, 2001, LANGUAGES LANGUAGE E
FRANCO H, 1997, ICASSP 97, V2, P1471
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
GARNNUNN P, 1992, CALVERTS DESCRIPTIVE
Gillick L., 1989, P ICASSP, P532
GRABE E, 2002, INTONATIONAL VARIATI
GU L, 2003, INT S CIRC SYST ISCA, V2, P580
Guy G. R., 1984, AUSTR J LINGUISTICS, V4, P1, DOI 10.1080/07268608408599317
Hiller S., 1994, Computer Assisted Language Learning, V7, DOI 10.1080/0958822940070105
Holmes J., 2001, SPEECH SYNTHESIS REC
Jenkins J., 2000, PHONOLOGY ENGLISH IN
JIA H, 2008, P SPEECH PROS, P547
Jones R. H., 1997, SYSTEM, V25, P103, DOI 10.1016/S0346-251X(96)00064-4
Jurafsky Daniel, 2009, SPEECH LANGUAGE PROC, V2nd
Kachru Yamuna, 1985, RELC J, V16, P1, DOI 10.1177/003368828501600201
KIM H, 2002, ICSLP 2002, P1225
LIANG W, 2005, P INT C COMM CIRC SY, V2, P857
Molina C, 2009, SPEECH COMMUN, V51, P485, DOI 10.1016/j.specom.2009.01.002
MORLEY J, 1991, TESOL QUART, V25, P481, DOI 10.2307/3586981
Moyer Alene, 2004, AGE ACCENT EXPERIENC
NEUMEYER L, 1996, P ICSLP 96
OPPELSTRUP L, 2005, P FONETIK GOT
PEABODY M, 2006, P 5 INT S CHIN SPOK
Pennington M., 1989, RELC J, V20, P20, DOI 10.1177/003368828902000103
PETERS AM, 1977, LANGUAGE, V53, P560, DOI 10.2307/413177
PIERREHUMBERT J, 1990, INTENTIONS COMMUNICA
RABINER LR, 1979, IEEE T ACOUST SPEECH, V27, P583, DOI 10.1109/TASSP.1979.1163323
RABINER LR, 1980, IEEE T ACOUST SPEECH, V28, P377, DOI 10.1109/TASSP.1980.1163422
RABINER LR, 1978, IEEE T ACOUST SPEECH, V26, P34, DOI 10.1109/TASSP.1978.1163037
RAMAN M, 2004, ENGLISH LANGUAGE TEA
Ramirez Verdugo D., 2005, INTERCULT PRAGMAT, V2, P151
Roach P., 2008, ENGLISH PHONETICS PH
Rypa M. E., 1999, CALICO Journal, V16
SAKOE H, 1978, IEEE T ACOUST SPEECH, V26
Saussure F. D., 2006, WRITINGS GEN LINGUIS
SHIMIZU M, 2005, PHON TEACH LEARN C 2
STOUTEN F, 2006, P ICASSP
SU P, 2006, P 5 INT C MACH LEARN
Tao Hongyin, 1996, UNITS MANDARIN CONVE
Teixeira C., 2000, P ICSLP
TEPPERMAN J, 2007, IEEE T AUDIO SPEECH, V16, P8
Tepperman J., 2008, P INTERSPEECH ICSLP
TRAYNOR PL, 2003, INSTRUCT PSYCHOL J, P137
van Santen JPH, 2009, SPEECH COMMUN, V51, P1082, DOI 10.1016/j.specom.2009.04.007
Wells John, 2006, ENGLISH INTONATION
YOU K, 2004, INTERSPEECH 2004, P1857
ZHAO X, 2007, ISSSE 07, P59
NR 68
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2010
VL 52
IS 3
BP 254
EP 267
DI 10.1016/j.specom.2009.11.001
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560GA
UT WOS:000274888900007
ER
PT J
AU Wu, TY
Duchateau, J
Martens, JP
Van Compernolle, D
AF Wu, Tingyao
Duchateau, Jacques
Martens, Jean-Pierre
Van Compernolle, Dirk
TI Feature subset selection for improved native accent identification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Accent identification; Language identification; Feature selection;
Gaussian mixture model; Linear discriminant analysis; Support vector
machine
ID AUTOMATIC LANGUAGE IDENTIFICATION; SUPPORT VECTOR MACHINES; CANCER
CLASSIFICATION; GENE SELECTION; SPEECH; RECOGNITION
AB In this paper, we develop methods to identify accents of native speakers. Accent identification differs from other speaker classification tasks because accents may differ in a limited number of phonemes only and moreover the differences can be quite subtle. In this paper, it is shown that in such cases it is essential to select a small subset of discriminative features that can be reliably estimated and at the same time discard non-discriminative and noisy features. For identification purposes a speaker is modeled by a supervector containing the mean values for the features for all phonemes. Initial accent models are obtained as class means from the speaker supervectors. Then feature subset selection is performed by applying either ANOVA (analysis of variance), LDA (linear discriminant analysis), SVM-RFE (support vector machine-recursive feature elimination), or their hybrids, resulting in a reduced dimensionality of the speaker vector and more importantly a significantly enhanced recognition performance. We also compare the performance of GMM, LDA and SVM as classifiers on a full or a reduced feature subset. The methods are tested on a Flemish read speech database with speakers classified in five regions. The difficulty of the task is confirmed by a human listening experiment. We show that a relative improvement of more than 20% in accent recognition rate can be achieved with feature subset selection irrespective of the choice of classifier. We finally show that the construction of speaker-based supervectors significantly enhances results over a reference GMM system that uses the raw feature vectors directly as input, both in text dependent and independent conditions. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Wu, Tingyao; Duchateau, Jacques; Van Compernolle, Dirk] Katholieke Univ Leuven, PSI, ESAT, B-3001 Heverlee, Belgium.
[Martens, Jean-Pierre] Univ Ghent, ELIS, Ghent, Belgium.
RP Van Compernolle, D (reprint author), Katholieke Univ Leuven, PSI, ESAT, Kasteelpk Arenberg 10,B2441, B-3001 Heverlee, Belgium.
EM Tingyao.Wu@esat.kuleuven.be; Jacques.Duchateau@esat.kuleuven.be;
Jean-Pierre.Martens@elis.ugent.be; Dirk.VanCompernolle@esat.kuleuven.be
FU K.U. Leuven; Scientific Research Flanders [G.008.01, G.0260.07]
FX This research was supported by the Research Fund of the K.U. Leuven, the
Fund for Scientific Research Flanders (Projects G.008.01 and G.0260.07).
CR Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1
Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5
BERKLING K, 1998, P ICSLP 98, V2, P89
Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555
Burget L, 2006, P ICASSP 2006, V1, P209
CAMPBELL WM, 2008, P INT C AC SPEECH SI, P4141
Castaldo F., 2007, P INT, P346
CHEN T, 2001, P ASRU 2001 TREN IT, P343
DEMUYNCK K, 2008, P ICSLP, P495
Duan KB, 2005, IEEE T NANOBIOSCI, V4, P228, DOI 10.1109/TNB.2005.853657
GHESQUIERE P, 2002, P INT C AC SPEECH SI, V1, P749
Guyon I, 2002, MACH LEARN, V46, P389, DOI 10.1023/A:1012487302797
Guyon I., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753616
HANSEN JHL, 2004, P INT C SPOK LANG PR, P1569
HUANG RQ, 2006, P INT, P445
KNOPS U, 1984, TAAL TONGVAL, V25, P117
Kohavi R, 1997, ARTIF INTELL, V1, P273
LABOV W, 1996, P INT C SPOK LANG PR
LAMEL LF, 1995, COMPUT SPEECH LANG, V9, P87, DOI 10.1006/csla.1995.0005
LINCOLN M, 1998, P INT C SPOK LANG PR, V2, P109
LIU MK, 2000, P IEEE INT C AC SPEE, V2, P1025
Matejka P., 2006, P OD 2006 SPEAK LANG, P57
Mertens P., 1998, FONILEX MANUAL
Purnell T, 1999, J LANG SOC PSYCHOL, V18, P10, DOI 10.1177/0261927X99018001002
Rakotomamonjy A., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753706
SHEN W, 2008, P INT, P763
SHRIBERG E, 2008, P OD SPEAK LANG REC
TENBOSCH L, 2000, P INT C SPOK LANG PR, V3, P1009
Thomas ER, 2000, J PHONETICS, V28, P1, DOI 10.1006/jpho.2000.0103
TORRESCARRASQUI.PA, 2002, P ICASSP, V1, P757
TORRESCARRASQUI.PA, 2004, P OD SPEAK LANG REC, P297
TORRESCARRASQUI.PA, 2008, P INT, P79
Tukey J.W., 1977, EXPLORATORY DATA ANA
van Hout Roeland, 1999, ARTIKELEN DERDE SOCI, P183
WU T, 2009, THESIS KATHOLIEKE U
Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450
Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6
NR 37
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 83
EP 98
DI 10.1016/j.specom.2009.08.010
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100001
ER
PT J
AU Hanson, EK
Beukelman, DR
Heidemann, JK
Shutts-Johnson, E
AF Hanson, Elizabeth K.
Beukelman, David R.
Heidemann, Jana Kahl
Shutts-Johnson, Erin
TI The impact of alphabet supplementation and word prediction on sentence
intelligiblity of electronically distorted speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Alphabet supplementation; Word prediction; Language modeling; Speech
intelligibility; Speech-generating device; SGD; Prototype; Dysarthria;
AAC; Augmentative and alternative communication
ID SEVERELY DYSARTHRIC SPEECH; STIMULUS COHESION; COMPRESSED SPEECH;
LINGUISTIC CUES; ADAPTATION
AB Alphabet supplementation is a low-tech augmentative and alternative communication (AAC) strategy that involves pointing to the first letter of each word spoken. Sentence intelligibility scores increased an average of 25% (Hanson et al., 2004) when speakers with moderate and severe dysarthria (a neurologic speech impairment) used alphabet supplementation strategies. This project investigated the impact of both alphabet supplementation and ail electronic word prediction strategy, commonly used in augmentative and alternative communication technology, oil the sentence intelligibility of normal natural speech that was electronically distorted to reduce intelligibility to the profound range of <30%. Results demonstrated large sentence intelligibility increases (average 80% increase) when distorted speech was supplemented with alphabet supplementation and word prediction. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Hanson, Elizabeth K.] Univ S Dakota, Dept Commun Disorders, Vermillion, SD 57069 USA.
[Beukelman, David R.] Univ Nebraska, Barkley Mem Ctr 202, Lincoln, NE 68583 USA.
RP Hanson, EK (reprint author), Univ S Dakota, Dept Commun Disorders, 414 E Clark St, Vermillion, SD 57069 USA.
EM ekhanson@usd.edu; dbeukelma-n1@unl.edu
FU National Institute of Disability and Rehabilitation Research [H113
980026]; US Department of Education
FX This publication was produced in part under Grant #H113#980026 from the
National Institute of Disability and Rehabilitation Research, US
Department of Education. The opinions expressed in this publication are
those of the grantee and do not necessarily reflect those of NID-RR or
the Department of Education.
CR Beliveau C., 1995, AUGMENTATIVE ALTERNA, V11, P176, DOI 10.1080/07434619512331277299
Berger K., 1967, J COMMUN DISORD, V1, P201, DOI 10.1016/0021-9924(68)90032-4
BEUKELMAN DR, 1977, J SPEECH HEAR DISORD, V42, P265
Beukelman DR, 2002, J MED SPEECH-LANG PA, V10, P237
Clarke CM, 2004, J ACOUST SOC AM, V116, P3647, DOI 10.1121/1.1815131
CROW E, 1989, RECENT ADV CLIN DYSA
DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462
DePaul R, 2000, AM J SPEECH-LANG PAT, V9, P230
Dowden P. A, 1997, AUGMENTATIVE ALTERNA, V13, P48, DOI DOI 10.1080/07434619712331277838
Dupoux E, 1997, J EXP PSYCHOL HUMAN, V23, P914, DOI 10.1037/0096-1523.23.3.914
Hanson EK, 2004, J MED SPEECH-LANG PA, V12, pIX
HANSON EK, 2008, C MOT SPEECH MONT CA
Hustad K. C, 2001, AUGMENTATIVE ALTERNA, V17, P213, DOI 10.1080/714043385
Hustad KC, 2001, J SPEECH LANG HEAR R, V44, P497, DOI 10.1044/1092-4388(2001/039)
Hustad KC, 2008, J SPEECH LANG HEAR R, V51, P1438, DOI 10.1044/1092-4388(2008/07-0185)
Hustad KC, 2003, J SPEECH LANG HEAR R, V46, P462, DOI 10.1044/1092-4388(2003/038)
Hustad KC, 2002, J SPEECH LANG HEAR R, V45, P545, DOI 10.1044/1092-4388(2002/043)
KING D, 2004, SPEAKING DYNAMICALLY
LINDBLOM B, 1990, AAC (Augmentative and Alternative Communication), V6, P220, DOI 10.1080/07434619012331275504
NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469
Peelle JE, 2005, J EXP PSYCHOL HUMAN, V31, P1315, DOI 10.1037/0096-1523.31.6.1315
PISONI DB, 1985, SPEECH COMMUN, V4, P75, DOI 10.1016/0167-6393(85)90037-8
STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455
NR 23
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 99
EP 105
DI 10.1016/j.specom.2009.08.004
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100002
ER
PT J
AU Shue, YL
Shattuck-Hufnagel, S
Iseli, M
Jun, SA
Veilleux, N
Alwan, A
AF Shue, Yen-Liang
Shattuck-Hufnagel, Stefanie
Iseli, Markus
Jun, Sun-Ah
Veilleux, Nanette
Alwan, Abeer
TI On the acoustic correlates of high and low nuclear pitch accents in
American English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Pitch-accent correlates; Prosody; Voice quality; Tonal crowding
ID MAXIMUM SPEED; REALIZATION; INTONATION
AB Earlier findings in Shue et al. (2007, 2008) raised questions about the alignment of nuclear pitch accents in American English, which are addressed here by eliciting both high and low pitch accents in two different target words in several different positions in a single-phrase utterance (early, late but not final, and filial) from 20 speakers (10 male, 10 female). Results show that the F-0 peak associated with a high nuclear pitch accent is systematically displaced to an earlier point in the target word if that word is filial ill the phrase and thus bears the boundary-related tones as well. This effect of tonal crowding holds across speakers, genders and target words, but was not observed for low accents, adding to the growing evidence that low targets behave differently from highs. Analysis of energy shows that, across target words and genders, the average energy level of a target word is greatest at the start of an utterance and decreases with increasing proximity to the utterance boundary. Duration measures confirms the findings of existing literature on main-stress-syllable lengthening, final syllable lengthening, and lengthening associated with pitch accents, and reveals that final syllable lengthening is further enhanced if the final word also carries a pitch accent. Individual speaker analyses found that while most speakers conformed to the general trends for pitch movements there were 2/10 male and 1/10 female speakers who did not. These results show the importance of taking into account prosodic contexts and speaker variability when interpreting correlates to prosodic events such as pitch accents. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Shue, Yen-Liang; Iseli, Markus; Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA.
[Shattuck-Hufnagel, Stefanie] MIT, Elect Res Lab, Cambridge, MA 02139 USA.
[Jun, Sun-Ah] Univ Calif Los Angeles, Dept Linguist, Los Angeles, CA 90095 USA.
[Veilleux, Nanette] Simmons Coll, Dept Comp Sci, Boston, MA 02115 USA.
RP Shue, YL (reprint author), Univ Calif Los Angeles, Dept Elect Engn, 56-125B Engn 4 Bldg,Box 951594, Los Angeles, CA 90095 USA.
EM yshue@ee.ucla.edu; stef@speech.mit.edu; iseli@ee.ucla.edu;
jun@humnet.ucla.edu; nanette.veilleux@simmons.edu; alwan@ee.ucla.edu
FU NSF
FX We thank Dr. Patricia Keating for her helpful suggestions and advice
during the preparation of this study. We also thank the speakers who
participated in this experiment. This work was supported in part by the
NSF.
CR Arvaniti A, 2006, SPEECH COMMUN, V48, P667, DOI 10.1016/j.specom.2005.09.012
Arvaniti A., 2007, PAPERS LAB PHONOLOGY, V9, P547
Beckman M. E., 1994, PHONOLOGICAL STRUCTU, V3, P7
Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X
CHAFE WL, 1993, TALKING DATA TRANSCR, P3
Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023
Fougeron C, 1998, J PHONETICS, V26, P45, DOI 10.1006/jpho.1997.0062
Grabe E, 2000, J PHONETICS, V28, P161, DOI 10.1006/jpho.2000.0111
Hirschberg J., 1986, P 24 ANN M ASS COMP, P136, DOI 10.3115/981131.981152
Iseli M, 2007, J ACOUST SOC AM, V121, P2283, DOI 10.1121/1.2697522
Jilka M., 2007, P INT 2007, P2621
KAWAHARA H, 1998, P ICSLP SYD AUSTR
KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986
KLATT DH, 1976, IEEE T ACOUST SPEECH, V35, P445
Kochanski G, 2005, J ACOUST SOC AM, V118, P1038, DOI 10.1121/1.1923349
Ladd D. R., 1996, INTONATIONAL PHONOLO
LEVI SV, INTONATION TUR UNPUB
Mucke D, 2009, J PHONETICS, V37, P321, DOI 10.1016/j.wocn.2009.03.005
Ode C, 2005, SPEECH COMMUN, V47, P71, DOI 10.1016/j.specom.2005.06.004
OHALA JJ, 1973, J ACOUST SOC AM, V53, P345, DOI 10.1121/1.1982441
Pierrehumbert J, 1980, THESIS MIT
PIERREHUMBERT JB, 1991, PAPERS LAB PHONOLOGY, V2, P90
ROSENBERG A, 2006, P INT PITTSB, P301
ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572
SHUE YL, 2008, P INT BRISB AUSTR, P873
SHUE YL, 2007, P INT ANTW BELG, P2625
Silverman Kim E. A., 1990, PAPERS LABORATORY PH, P72
SLIFKA J, 2007, J VOICE, V20, P171
Sluijter A. M. C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607440
Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955
SUNDBERG J, 1979, J PHONETICS, V7, P71
TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959
Turk AE, 2007, J PHONETICS, V35, P445, DOI 10.1016/j.wocn.2006.12.001
Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093
Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789
NR 35
TC 0
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 106
EP 122
DI 10.1016/j.specom.2009.08.005
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100003
ER
PT J
AU Vicente-Pena, J
Diaz-de-Maria, F
Kleijn, WB
AF Vicente-Pena, Jesus
Diaz-de-Maria, Fernando
Kleijn, W. Bastiaan
TI The synergy between bounded-distance HMM and spectral subtraction for
robust speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Spectral subtraction; Acoustic backing-off;
Bounded-distance HMM; Missing features; Outliers
ID NOISE; FEATURES
AB Additive noise generates important losses in automatic speech recognition systems. In this paper, we show that one of the causes contributing to these losses is the fact that conventional recognisers take into consideration feature values that are outliers. The method that we call bounded-distance HMM is a suitable method to avoid that outliers contribute to the recogniser decision. However, this method just deals with outliers, leaving the remaining features unaltered. In contrast, spectral subtraction is able to correct all the features at the expense of introducing some artifacts that, as shown in the paper, cause a larger number of outliers. As a result, we find that bounded-distance HMM and spectral subtraction complement each other well. A comprehensive experimental evaluation was conducted, considering several well-known ASR tasks (of different complexities) and numerous noise types and SNRs. The achieved results show that the suggested combination generally outperforms both the bounded-distance HMM and spectral subtraction individually. Furthermore, the obtained improvements, especially for low and medium SNRs, are larger than the sum of the improvements individually obtained by bounded-distance HMM and spectral subtraction. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Vicente-Pena, Jesus; Diaz-de-Maria, Fernando] Univ Carlos III Madrid, EPS, Dept Signal Proc & Commun, Madrid 28911, Spain.
[Kleijn, W. Bastiaan] KTH Royal Inst Technol, Sound & Image Proc Lab, Stockholm, Sweden.
RP Vicente-Pena, J (reprint author), Univ Carlos III Madrid, EPS, Dept Signal Proc & Commun, Avda Univ 30, Madrid 28911, Spain.
EM jvicente@tsc.uc3m.es; fdiaz@tsc.uc3m.es; bastiaan.kleijn@ee.kth.se
RI Diaz de Maria, Fernando/E-8048-2011
FU Spanish Regional [CCG06-UC3M/TIC-0812]
FX This work has been partially supported by Spanish Regional Grant No.
CCG06-UC3M/TIC-0812.
CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
*CMU, 1998, CMU V 0 6 PRON DICT
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
de la Torre A, 2005, IEEE T SPEECH AUDI P, V13, P355, DOI 10.1109/TSA.2005.845805
DENG L, 2000, P ICSLP, P806
de Veth J, 2001, SPEECH COMMUN, V34, P247, DOI 10.1016/S0167-6393(00)00037-6
de Veth J, 2001, SPEECH COMMUN, V34, P57, DOI 10.1016/S0167-6393(00)00046-7
DEVETH J, 1998, P INT C SPOK LANG PR, P1427
FLORES JAN, 1994, P ICASSP, V1, P409
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J
HAIN T, 1999, P ICASSP 99, V1, P57
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
HIRSCH G, 2002, AU41702 ETSI STQ DSR
MACHO D, 2000, SPANISH SDC AURORA D
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157
Matsui T, 1991, P IEEE INT C AC SPEE, V1, P377
Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733
NADEU C, 1995, P EUR 95, P1381
Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0
*NIST, 1992, NIST RES MAN CORP RM
Paliwal KK, 1999, P EUR C SPEECH COMM, P85
PAUL DB, 1992, HLT 91, P357
PUJOL P, 2004, P INT C SPOK LANG PR
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Raj B., 2000, THESIS CARNEGIE MELL
Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006
SHOZAKAI M, 1997, IEEE WORKSH AUT SPEE, P450
VARGA AP, 1992, NOISEX 92 STUDY EFFE
Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8
Young S., 2002, HTK BOOK HTK VERSION
NR 36
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 123
EP 133
DI 10.1016/j.specom.2009.09.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100004
ER
PT J
AU Hansen, JHL
Zhang, XX
AF Hansen, John H. L.
Zhang, Xianxian
TI Analysis of CFA-BF: Novel combined fixed/adaptive beamforming for robust
speech recognition in real car environments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Array processing; Robust speech recognition; In-vehicle speech systems;
Beamforming
ID ADAPTIVE BEAMFORMER; MICROPHONE ARRAY; TIME-DELAY; ENHANCEMENT; NOISE;
ALGORITHM; SPECTRUM; FILTERS
AB Among a number of studies which have investigated various speech enhancement and processing schemes for in-vehicle speech systems, the delay-and-sum beamforming (DASB) and adaptive beamforming are two typical methods that both have their advantages and disadvantages. In this paper, we propose a novel combined fixed/adaptive beamforming solution (CFA-BF) based on previous work for speech enhancement and recognition in real moving car environments, which seeks to take advantage of both methods. The working scheme of CFA-BF consists of two steps: source location calibration and target signal enhancement. The first step is to pre-record the transfer functions between the speaker and microphone array from different potential source positions using adaptive beamforming under quiet environments; and the second step is to use this pre-recorded information to enhance the desired speech when the car is running on the road. An evaluation using extensive actual car speech data from the CU-Move Corpus shows that the method can decrease WER for speech recognition by up to 30% over a single channel scenario and improve speech quality via the SEGSNR measure by up to 1 dB on the average. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Hansen, John H. L.; Zhang, Xianxian] Univ Texas Dallas, CRSS, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, Richardson, TX 75083 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, CRSS, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, EC33,POB 830688, Richardson, TX 75083 USA.
EM John.Hansen@utdallas.edu
FU DARPA through SPAWAR [N66001-8906]; University of Texas at Dallas
[EM-MITT]
FX This project was supported by Grants from DARPA through SPAWAR under
Grant No. N66001-8906, and by the University of Texas at Dallas under
Project EM-MITT. Any opinions, findings and conclusions expressed in
this material are those of the authors and do not necessarily reflect
the views of DARPA.
CR ABUT H, 2002, IEEE DL LECT JAP HON
Brandstein M., 2001, MICROPHONE ARRAYS
CAPON J, 1969, P IEEE, V57, P1408, DOI 10.1109/PROC.1969.7278
Compernolle D. V., 1990, SPEECH COMMUN, V9, P433
COMPERNOLLE DV, 1990, P IEEE ICASSP 90, V2, P833
DELLER JR, 2000, DISCRETE TIME PROCES, pCH8
FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817
GALANENKO V, 2001, P IEEE ICASSP, V5, P3017
GAZOR S, 1995, IEEE T SPEECH AUDIO, V3, P94
GAZOR S, 1994, P IEEE ICASSP, V4, P557
GIULIANI D, 1996, P IEEE ICSLP 96, V3, P1329, DOI 10.1109/ICSLP.1996.607858
GOULDING MM, 1990, IEEE T VEH TECHNOL, V39, P316, DOI 10.1109/25.61353
Grenier Y., 1992, P IEEE ICASSP, V1, P305
GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739
HAAN JMD, 2003, IEEE T SPEECH AUDIO, V11, P14
Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618
HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901
HANSEN JHL, 1995, J ACOUST SOC AM, V97, P609, DOI 10.1121/1.412283
Hansen JHL, 2000, P IEEE ICSLP 2000, V1, P524
HANSEN JHL, 2001, INTERSPEECH 01 EUROS, V3, P2023
Haykin S., 1985, ARRAY SIGNAL PROCESS
Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650
Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop
Jensen J, 2001, IEEE T SPEECH AUDI P, V9, P731, DOI 10.1109/89.952491
Johnson D, 1993, ARRAY SIGNAL PROCESS
Kaiser J.F., 1993, P INT C AC SPEECH SI, V3, P149
KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830
KOMAROW S, 2000, USA TODAY 0912
KOROMPIS D, 1995, P IEEE ICASSP 95, V4, P2739
LANG SW, 1980, IEEE T ACOUST SPEECH, V28, P716, DOI 10.1109/TASSP.1980.1163467
Li WF, 2005, IEEE SIGNAL PROC LET, V12, P340, DOI 10.1109/LSP.2005.843761
MEYER J, 1997, P ICASSP 97, V2, P1167
NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384
Nordholm S, 1999, IEEE T SPEECH AUDI P, V7, P241, DOI 10.1109/89.759030
Oh S., 1992, P IEEE ICASSP 92, V1, P281
OMOLOGO M, 1996, P IEEE INT C AC SPEE, V2, P921
OMOLOGO M, 1994, P IEEE ICASSP 94, V2, P860
Pellom B, 2001, TRCSLR200101 U COL
Pellom BL, 1998, IEEE T SPEECH AUDI P, V6, P573, DOI 10.1109/89.725324
Pillai S., 1989, ARRAY SIGNAL PROCESS
PLUCIENKOWSKI J, 2001, P EUROSPEECH 01, V3, P1573
RABINER L, 1993, FUNDAMENTALS SPEECH, P447
REED FA, 1981, IEEE T ACOUST SPEECH, V29, P561, DOI 10.1109/TASSP.1981.1163614
SENADJI B, 1993, P IEEE ICASSP 93, V1, P321
SHINDE T, 2002, P INT 02 ICSLP 02 DE
SVAIZER P, 1997, P IEEE ICASSP, V1, P231
VISSER E, 2002, P INT 02 ICSLP 02 DE
WAHAB A, 1998, P 5 INT C CONTR AUT
WAHAB A, 1997, P 1 INT C INF COMM S
WALLACE RB, 1992, IEEE T CIRCUITS-II, V39, P239, DOI 10.1109/82.136574
Widrow B, 1985, ADAPTIVE SIGNAL PROC
Yamada T, 2002, IEEE T SPEECH AUDI P, V10, P48, DOI 10.1109/89.985542
YAPANEL U, 2002, P ICSLP 2002, V2, P793
ZHANG XX, 2000, INT C SIGN PROC TECH
ZHANG XX, 2003, P INT 03 EUR 03 GEN
Zhang XX, 2003, IEEE T SPEECH AUDI P, V11, P733, DOI 10.1109/TSA.2003.818034
Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995
NR 57
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 134
EP 149
DI 10.1016/j.specom.2009.09.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100005
ER
PT J
AU Teoh, ABJ
Chong, LY
AF Teoh, Andrew Beng Jin
Chong, Lee-Ying
TI Secure speech template protection in speaker verification system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Cancellable biometrics; Random projection; Speaker verification; 2D
subspace projection methods; Gaussian mixture model
ID DATA PERTURBATION; PRIVACY; MODELS; IDENTIFICATION
AB Due to biometric template characteristics that are susceptible to non-revocable and privacy invasion, cancellable biometrics has been introduced to tackle these issues. In this paper, we present a two-factor cancellable formulation for speech biometrics, which we refer as probabilistic random projection (PRP). PRP offers strong protection on speech template by hiding the actual speech feature through the random subspace projection process. Besides, the speech template is replaceable and can be reissued when it is compromised. Our proposed method enables the generation of different speech templates from the same speech feature, which means linkability is not exited between the speech templates. The formulation of the cancellable biometrics retains its performance as for the conventional biometric. Besides that, we also propose 2D subspace projection techniques for speech feature extraction, namely 2D Principle Component Analysis (2DPCA) and 2D CLAss-Featuring Information Compression (2DCLAFIC) to accommodate the requirements of PRP formulation. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Teoh, Andrew Beng Jin] Yonsei Univ, Coll Engn, Seoul 120749, South Korea.
[Chong, Lee-Ying] Multimedia Univ, Fac Informat Sci & Technol, Jalan Ayer Keroh Lama 75450, Melaka, Malaysia.
RP Teoh, ABJ (reprint author), Yonsei Univ, Coll Engn, Seoul 120749, South Korea.
EM bjteoh@yonsei.ac.kr; lychong@mmu.edu.my
RI Chong, Lee-Ying/B-3506-2010; Teoh, Andrew Beng Jin/F-4422-2010
FU Korea Science and Engineering Foundation (KOSEF) through the Biometrics
Engineering Research Center (BERC) at Yonsei University
[112002105080020]
FX This work was supported by the Korea Science and Engineering Foundation
(KOSEF) through the Biometrics Engineering Research Center (BERC) at
Yonsei University (Grant No. R 112002105080020 (2009)).
CR Ang R., 2005, P 10 AUSTR C INF SEC, P242
Ariki Y., 1996, Proceedings of the 13th International Conference on Pattern Recognition, DOI 10.1109/ICPR.1996.546989
BOULT T, 2006, 7 INT C AUT FAC GEST, P560, DOI 10.1109/FGR.2006.94
CHONG LY, 2007, 5 IEEE WORKSH AUT ID, P445
CHONG LY, 2007, P INT C ROB VIS INF, P525
Dasgupta S., 2000, P 16 C UNC ART INT, P143
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Gersho A., 1992, VECTOR QUANTIZATION
HARRAG A, 2005, 2005 ANN IEEE INDICO, P237, DOI 10.1109/INDCON.2005.1590163
Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6
Kargupta H, 2005, KNOWL INF SYST, V7, P387, DOI 10.1007/s10115-004-0173-6
Laaksonen J., 1996, P INT C ART NEUR NET, P227
Liu K, 2006, IEEE T KNOWL DATA EN, V18, P92
MATSUMOTO T, 2002, OPT SECUR COUNTERFEI, V4, P4677
Ratha NK, 2007, IEEE T PATTERN ANAL, V29, P561, DOI 10.1109/TPAMI.2007.1004
Ratha NK, 2001, IBM SYST J, V40, P614
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
ROSCA J, 2003, P ICA 2003 NAR JAP A, P999
SAVVIDES M, 2004, INT C PATTERN RECOGN, V3, P922
Teoh ABJ, 2007, IEEE T SYST MAN CY B, V37, P1096, DOI 10.1109/TSMCB.2007.903538
Teoh B. J .A., 2004, PATTERN RECOGN, V37, P2245
TEOH BJA, 2006, IEEE T PATTERN ANAL, V28, P1892
TULYAKOV S, 2005, ICAPR, P30
Yang J, 2004, IEEE T PATTERN ANAL, V26, P131
ZHANG W, 2003, IEEE INT C SYST MAN, V5, P4147
NR 27
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 150
EP 163
DI 10.1016/j.specom.2009.09.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100006
ER
PT J
AU Pucher, M
Schabus, D
Yamagishi, J
Neubarth, F
Strom, V
AF Pucher, Michael
Schabus, Dietmar
Yamagishi, Junichi
Neubarth, Friedrich
Strom, Volker
TI Modeling and interpolation of Austrian German and Viennese dialect in
HMM-based speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech synthesis; Hidden Markov model; Dialect; Sociolect; Austrian
German
ID SYNTHESIS SYSTEM; ALGORITHM
AB An HMM-based speech synthesis framework is applied to both standard Austrian German and a Viennese dialectal variety and several training strategies for multi-dialect modeling such as dialect clustering and dialect-adaptive training are investigated. For bridging the gap between processing on the level of HMMs and on the linguistic level, we add phonological transformations to the HMM interpolation and apply them to dialect interpolation. The crucial steps are to employ several formalized phonological rules between Austrian German and Viennese dialect as constraints for the HMM interpolation. We verify the effectiveness of this strategy in a number of perceptual evaluations. Since the HMM space used is not articulatory but acoustic space, there are some variations in evaluation results between the phonological rules. However, in general we obtained good evaluation results which show that listeners call perceive both continuous and categorical changes of dialect varieties by using phonological transformations employed as switching rules in the HMM interpolation. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Pucher, Michael; Schabus, Dietmar] Telecommun Res Ctr Vienna Ftw, A-1220 Vienna, Austria.
[Yamagishi, Junichi; Strom, Volker] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9AB, Midlothian, Scotland.
[Neubarth, Friedrich] Austrian Res Inst Artificial Intelligence OFAI, A-1010 Vienna, Austria.
RP Pucher, M (reprint author), Telecommun Res Ctr Vienna Ftw, Donau City Str 1,3rd Floor, A-1220 Vienna, Austria.
EM pucher@ftw.at
FU Austrian Government; City of Vienna within the competence center;
Austrian Federal Ministry for Transport, Innovation, and Technology;
European Community's Seventh Framework [FP7/2007-2013, 213845]
FX The project "Viennese Sociolect and Dialect Synthesis" is funded by the
Vienna Science and Technology Fund (WWTF). The Telecommunications
Research Center Vienna (FTW) is supported by the Austrian Government and
the City of Vienna within the competence center program COMET. OFAI is
supported by the Austrian Federal Ministry for Transport, Innovation,
and Technology and by the Austrian Federal Ministry for Science and
Research. Junichi Yamagishi is funded by the European Community's
Seventh Framework Programme (FP7/2007-2013) under Grant Agreement No.
213845 (the EMIME project). We thank Dr. Simon King and Mr. Oliver Watts
of the University of Edinburgh for their valuable comments and
proofreading. We also thank the reviewers for their valuable
suggestions.
CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
Black A., 2007, P ICASSP 2007, P1229
Cox T., 2001, MULTIDIMENSIONAL SCA
CREER S, 2009, ADV CLIN NEUROSCI RE, V9, P16
FITT S, 1999, P EUR, V2, P823
FRASER M, 2007, P BLIZZ 2007
Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Garman M., 1990, PSYCHOLINGUISTICS
Karaiskos V., 2008, P BLIZZ CHALL WORKSH
Kawahara H., 2001, 2 MAVEBA
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
LIBERMAN AM, 1970, PERCEPT DISORD, V48
Ling Z.H., 2008, P INT, P573
Ling ZH, 2009, IEEE T AUDIO SPEECH, V17, P1171, DOI 10.1109/TASL.2009.2014796
Moosmuller Sylvia, 1987, SOZIOPHONOLOGISCHE V
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Muhr Rudolf, 2007, OSTERREICHISCHES AUS
NEUBARTH F, 2008, P 9 ANN C INT SPEECH, P1877
Saussure F, 1916, COURSE GEN LINGUISTI
SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
Stevens K. N., 1997, HDB PHONETIC SCI, P462
Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
TOKUDA K, 1991, IEICE T FUND ELECTR, V74, P1240
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647
Yamagishi J., 2008, P BLIZZ CHALL 2008
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P1208, DOI 10.1109/TASL.2009.2016394
Yoshimura T., 1999, P EUROSPEECH 99 SEPT, P2374
Yoshimura T., 2001, P EUROSPEECH, P2263
Yoshimura T., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.199
Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zen H, 2009, SPEECH COMMUN, V51, P1039, DOI 10.1016/j.specom.2009.04.004
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
NR 36
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2010
VL 52
IS 2
BP 164
EP 179
DI 10.1016/j.specom.2009.09.004
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 532QI
UT WOS:000272764100007
ER
PT J
AU Lu, X
Matsuda, S
Unoki, M
Nakamura, S
AF Lu, X.
Matsuda, S.
Unoki, M.
Nakamura, S.
TI Temporal contrast normalization and edge-preserved smoothing of temporal
modulation structures of speech for robust speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Temporal modulation; Mean and variance
normalization; Edge-preserved smoothing; Modulation object
ID BILATERAL FILTER; SPECTRUM; FEATURES
AB Traditionally, noise reduction methods for additive noise have been quite different from those for reverberation. In this study, we investigated the effect of additive noise and reverberation on speech on the basis of the concept of temporal modulation transfer. We first analyzed the noise effect on the temporal modulation of speech. Then on the basis of this analysis, we proposed a two-stage processing algorithm that adaptively normalizes the temporal modulation of speech to extract robust speech features for automatic speech recognition. In the first stage of the proposed algorithm, the temporal modulation contrast of the cepstral time series for both clean and noisy speech is normalized. In the second stage, the contrast normalized temporal modulation spectrum is smoothed in order to reduce the artifacts due to noise while preserving the information in the speech modulation events (edges). We tested our algorithm in speech recognition experiments for additive noise condition, reverberant condition, and noisy condition (both additive noise and reverberation) using the AURORA-2J data corpus. Our results showed that as part of a uniform processing framework, the algorithm helped achieve the following: (1) for the additive noise condition, a 55.85% relative word error reduction (RWER) rate when clean conditional training was performed, and a 41.64% RWER rate when multi-conditional training was performed, (2) for the reverberant condition, a 51.28% RWER rate, and (3) for the noisy condition (both additive noise and reverberation), a 95.03% RWER rate. In addition, we evaluated the performance of each stage of the proposed algorithm in AURORA-2J and AURORA4 experiments, and compared the performance of our algorithm with the performances of two similar processing algorithms in the second stage. The evaluation results further confirmed the effectiveness of our proposed algorithm. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Lu, X.; Matsuda, S.; Nakamura, S.] Natl Inst Informat & Commun Technol, Tokyo, Japan.
[Unoki, M.] Japan Adv Inst Sci & Technol, Kanazawa, Ishikawa, Japan.
RP Lu, X (reprint author), Natl Inst Informat & Commun Technol, Tokyo, Japan.
EM xuganglu@gmail.com
FU Knowledge Creating Communication Research Center of NICT
FX This study is supported by the MASTAR Project of the Knowledge Creating
Communication Research Center of NICT. The authors would like to thank
Dr. Yeung of HKUST for sharing the HTK based evaluation system on
AURORA4.
CR [Anonymous], 2002, HTK BOOK VERS 3 2
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
Chen CP, 2007, IEEE T AUDIO SPEECH, V15, P257, DOI 10.1109/TASL.2006.876717
Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851
Chen JD, 2003, SPEECH COMMUN, V41, P469, DOI 10.1016/S0167-6393(03)00016-5
DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836
Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020
Elad M, 2002, IEEE T IMAGE PROCESS, V11, P1141, DOI 10.1109/TIP.2002.801126
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
ETSI, 2007, 202050V115 ETSI ES
Greenberg S, 1997, INT CONF ACOUST SPEE, P1647, DOI 10.1109/ICASSP.1997.598826
Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236
HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224
HUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947
Hung JW, 2006, IEEE T AUDIO SPEECH, V14, P808, DOI 10.1109/TSA.2005.857801
Joris PX, 2004, PHYSIOL REV, V84, P541, DOI 10.1152/physrev.00029.2003
Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
NEUMANN J, 2007, PERSONAL UBIQUITOUS
RABINER L, 1993, FUNFAMENTALS SPEECH
SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303
Shen JL, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P881
Togneri R., 2006, P SST 2006, P94
Tomasi C., 1998, ICCV, P839, DOI DOI 10.1109/ICCV.1998.710815
Torre A., 2005, IEEE T SPEECH AUDIO, V13, P355
Turner RE, 2007, LECT NOTES COMPUT SC, V4666, P544
Vaseghi S. V., 2000, ADV DIGITAL SIGNAL P
Xiao X, 2007, IEEE SIGNAL PROC LET, V14, P500, DOI 10.1109/LSP.2006.891341
Xiao X, 2008, IEEE T AUDIO SPEECH, V16, P1662, DOI 10.1109/TASL.2008.2002082
Zhang M, 2008, INT CONF ACOUST SPEE, P929
Zhu WZ, 2005, INT CONF ACOUST SPEE, P245
NR 31
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2010
VL 52
IS 1
BP 1
EP 11
DI 10.1016/j.specom.2009.08.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 522QX
UT WOS:000272014500001
ER
PT J
AU Kinnunen, T
Li, HZ
AF Kinnunen, Tomi
Li, Haizhou
TI An overview of text-independent speaker recognition: From features to
supervectors
SO SPEECH COMMUNICATION
LA English
DT Review
DE Speaker recognition; Text-independence; Feature extraction; Statistical
models; Discriminative models; Supervectors; Intersession variability
compensation
ID GAUSSIAN MIXTURE-MODELS; COMBINING MULTIPLE CLASSIFIERS; SUPPORT VECTOR
MACHINES; LANGUAGE RECOGNITION; SESSION VARIABILITY; LINEAR PREDICTION;
SCORE NORMALIZATION; PATTERN-RECOGNITION; FEATURE-EXTRACTION; PROSODIC
FEATURES
AB This paper gives an overview of automatic speaker recognition technology, with an emphasis on text-independent recognition. Speaker recognition has been studied actively for several decades. We give an overview of both the classical and the state-of-the-art methods. We start with the fundamentals of automatic speaker recognition, concerning feature extraction and speaker modeling. We elaborate advanced computational techniques to address robustness and session variability. The recent progress from vectors towards supervectors opens up a new area of exploration and represents a technology trend. We also provide an overview of this recent development and discuss the evaluation methodology of speaker recognition systems. We conclude the paper with discussion on future directions. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Kinnunen, Tomi] Univ Joensuu, Dept Comp Sci & Stat, Speech & Image Proc Unit, FIN-80101 Joensuu, Finland.
[Li, Haizhou] Inst Infocomm Res, Dept Human Language Technol, Singapore 138632, Singapore.
RP Kinnunen, T (reprint author), Univ Joensuu, Dept Comp Sci & Stat, Speech & Image Proc Unit, POB 111, FIN-80101 Joensuu, Finland.
EM tkinnu@cs.joensuu.fi; hli@i2r.a-star.edu.sg
CR ADAMI A, 2003, P ICASSP, V4, P788
Adami AG, 2007, SPEECH COMMUN, V49, P277, DOI 10.1016/j.specom.2007.02.005
ALEXANDER A, 2004, FORENSIC SCI INT, V146, P95
Alku P, 1999, CLIN NEUROPHYSIOL, V110, P1329, DOI 10.1016/S1388-2457(99)00088-7
Altincay H, 2003, SPEECH COMMUN, V41, P531, DOI 10.1016/S0167-6393(03)00032-3
Ambikairajah E., 2007, P 6 INT IEEE C INF C, P1
ANDREWS W, 2002, P ICASSP, V1, P149
ANDREWS W, 2001, P EUROSPEECH, P2517
[Anonymous], 2002, FORENSIC SPEAKER IDE
ARCIENEGA M, 2001, P 7 EUR C SPEECH COM, P2821
ASHOUR G, 1999, P 6 EUR C SPEECH COM, P1187
ATAL BS, 1972, J ACOUST SOC AM, V52, P1687, DOI 10.1121/1.1913303
ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702
Atlas L, 2003, EURASIP J APPL SIG P, V2003, P668, DOI 10.1155/S1110865703305013
Auckenthaler R., 2001, P SPEAK OD SPEAK REC, P83
Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360
Bartkova K, 2002, P INT C SPOK LANG PR, P1197
Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527
BENZEGHIBA M, 2003, P 8 EUR C SPEECH COM, P1361
BenZeghiba MF, 2006, SPEECH COMMUN, V48, P1200, DOI 10.1016/j.specom.2005.08.008
Besacier L, 2000, SIGNAL PROCESS, V80, P1245, DOI 10.1016/S0165-1684(00)00033-5
Besacier L, 2000, SPEECH COMMUN, V31, P89, DOI 10.1016/S0167-6393(99)00070-9
BIMBOT F, 1995, SPEECH COMMUN, V17, P177, DOI 10.1016/0167-6393(95)00013-E
Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024]
Bishop C. M., 2006, PATTERN RECOGNITION
BOCKLET T, 2009, P INT C AC SPEECH SI, P4525
Boersma P., 2009, PRAAT DOING PHONETIC
Bonastre J.-F., 2007, P INT 2007 ICSLP ANT, P2053
Brummer N, 2007, IEEE T AUDIO SPEECH, V15, P2072, DOI 10.1109/TASL.2007.902870
Brummer N, 2006, COMPUT SPEECH LANG, V20, P230, DOI 10.1016/j.csl.2005.08.001
BURGET L, 2009, ROBUST SPEAKER RECOG
Burget L, 2007, IEEE T AUDIO SPEECH, V15, P1979, DOI 10.1109/TASL.2007.902499
BURTON DK, 1987, IEEE T ACOUST SPEECH, V35, P133, DOI 10.1109/TASSP.1987.1165110
CAMPBELL J, 2004, ADV NEURAL INFORM PR, V16
CAMPBELL J, 2005, P INT C AC SPEECH SI, P637
CAMPBELL J, 2006, IEEE SIGNAL PROCESS, V13, P308
Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714
Campbell WM, 2002, IEEE T SPEECH AUDI P, V10, P205, DOI 10.1109/TSA.2002.1011533
Campbell WM, 2006, COMPUT SPEECH LANG, V20, P210, DOI 10.1016/j.csl.2005.06.003
Carey M. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607979
Castaldo F, 2007, IEEE T AUDIO SPEECH, V15, P1969, DOI 10.1109/TASL.2007.901823
Chan WN, 2007, IEEE T AUDIO SPEECH, V15, P1884, DOI 10.1109/TASL.2007.900103
CHARBUILLET C, 2006, P C INT IEEE C AC SP, V1, P673
Chaudhari UV, 2003, IEEE T SPEECH AUDI P, V11, P61, DOI 10.1109/TSA.2003.809121
Chen K, 1997, INT J PATTERN RECOGN, V11, P417, DOI 10.1142/S0218001497000196
CHEN ZH, 2004, P INT C SPOK LANG PR, P1421
Chetouani M, 2009, PATTERN RECOGN, V42, P487, DOI 10.1016/j.patcog.2008.08.008
CHEVEIGNE A, 2001, P EUROSP, P2451
Damper RI, 2003, PATTERN RECOGN LETT, V24, P2167, DOI 10.1016/S0167-8655(03)00082-5
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024
Dehak N, 2006, P IEEE OD SPEAK LANG
Dehak N, 2007, IEEE T AUDIO SPEECH, V15, P2095, DOI 10.1109/TASL.2007.902758
Dehak N., 2009, P INT C AC SPEECH SI, P4237
DEHAK N, 2008, SPEAK LANG REC WORKS
Deller J., 2000, DISCRETE TIME PROCES
Doddington G., 2001, P EUR, P2521
DUNN RB, 2001, 35 AS C SIGN SYST CO, V2, P1562
ESPYWILSON CY, 2006, P ICSLP 2006, P1475
Ezzaidi H., 2001, P EUROSPEECH, P2825
Faltlhauser R., 2001, P 7 EUR C SPEECH COM, P751
Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362
FARRELL K, 1998, P 1998 IEEE INT C AC, V2, P1129, DOI 10.1109/ICASSP.1998.675468
FAUVE B, 2008, SPEAK LANG REC WORKS
Fauve BGB, 2007, IEEE T AUDIO SPEECH, V15, P1960, DOI 10.1109/TASL.2007.902877
Ferrer L., 2007, P ICASSP HON APR, V4, P233
Ferrer L., 2008, P INT C AC SPEECH SI, P4853
FERRER L, 2008, SPEAK LANG REC WORKS
Fredouille C, 2000, DIGIT SIGNAL PROCESS, V10, P172, DOI 10.1006/dspr.1999.0367
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1
GARCIAROMERO D, 2004, P SPEAK OD SPEAK REC, V4, P105
Gersho A., 1991, VECTOR QUANTIZATION
GLEMBEK O, 2009, P INT C AC SPEECH SI, P4057
GONG WG, 2008, P IM SIGN PROC CISP, V5, P295
GONZALEZRODRIGU.J, 2003, P EUROSPEECH, P693
Gopalan K, 1999, IEEE T SPEECH AUDI P, V7, P289, DOI 10.1109/89.759036
GUDNASON J, 2008, P IEEE INT C AC SPEE, P4821
Gupta S. K., 1992, Digital Signal Processing, V2, DOI 10.1016/1051-2004(92)90027-V
HANNANI A, 2004, P SPEAK OD SPEAK REC, P111
HANSEN EG, 2004, P OD 04 SPEAK LANG R, P179
Harrington J., 1999, TECHNIQUES SPEECH AC
HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837
Hatch AO, 2005, 2005 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), P75
HATCH AO, 2006, P ICASSP, P585
HATCH AO, 2006, P INT, P1471
Hautamaki V, 2008, PATTERN RECOGN LETT, V29, P1427, DOI 10.1016/j.patrec.2008.02.021
Hautamaki V., 2007, P 12 INT C SPEECH CO, P645
Hautamaki V, 2008, IEEE SIGNAL PROC LET, V15, P162, DOI 10.1109/LSP.2007.914792
He JL, 1999, IEEE T SPEECH AUDI P, V7, P353
Hebert M., 2003, P 8 EUR C SPEECH COM, P1665
Hebert M., 2008, SPRINGER HDB SPEECH, P743, DOI 10.1007/978-3-540-49127-9_37
HECK L, 2002, P INT C SPOK LANG PR, P1369
HECK LP, 1997, P ICASSP, P1071
Heck LP, 2000, SPEECH COMMUN, V31, P181, DOI 10.1016/S0167-6393(99)00077-1
HEDGE RM, 2004, P IEEE INT C AC SPEE, V1, P517
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
Hess W., 1983, PITCH DETERMINATION
Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6
Huang X., 2001, SPOKEN LANGUAGE PROC
Imperl B, 1997, SPEECH COMMUN, V22, P385, DOI 10.1016/S0167-6393(97)00053-8
Jain AK, 2000, IEEE T PATTERN ANAL, V22, P4, DOI 10.1109/34.824819
Jang GJ, 2002, NEUROCOMPUTING, V49, P329, DOI 10.1016/S0925-2312(02)00527-1
JIN Q, 2002, P ICASSP ORL MAY, V1, P145
KAJAREKAR S, 2001, P SPEAK OD SPEAK REC, P201
KARAM ZN, 2007, P INT ANTW AUG, P290
KARPOV E, 2004, P 9 INT C SPEECH COM, P366
Kenny P, 2007, IEEE T AUDIO SPEECH, V15, P1448, DOI 10.1109/TASL.2007.894527
Kenny P, 2008, IEEE T AUDIO SPEECH, V16, P980, DOI 10.1109/TASL.2008.925147
Kenny P, 2006, CRIM060814
KINNUNEN T, 2009, P INT C AC SPEECH SI, P4545
Kinnunen T., 2002, P ICSLP 02, P2325
KINNUNEN T, 2007, P INT C BIOM ICB 200, P58
KINNUNEN T, 2006, P IEEE INT C AC SPEE, V1, P665
Kinnunen T, 2009, PATTERN RECOGN LETT, V30, P341, DOI 10.1016/j.patrec.2008.11.007
Kinnunen T., 2000, Proceedings of the IASTED International Conference. Signal Processing and Communications
KINNUNEN T, 2006, P 5 INT S CHIN SPOK, P547
Kinnunen T., 2005, P 10 INT C SPEECH CO, P567
Kinnunen T., 2004, P 9 INT C SPEECH COM, P361
KINNUNEN T, 2004, THESIS U JOENSUU JOE
KINNUNEN T, 2006, 5 INT S CHIN SPOK LA, P559
Kinnunen T., 2008, SPEAK LANG REC WORKS
Kinnunen T, 2006, IEEE T AUDIO SPEECH, V14, P277, DOI 10.1109/TSA.2005.853206
KITAMURA T, 2008, P INT, P813
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
Kolano G., 1999, P EUR C SPEECH COMM, P1203
KRYSZCZUK K, 2007, EURASIP J ADV SIG PR, V1, P86572
Lapidot I, 2002, IEEE T NEURAL NETWOR, V13, P877, DOI 10.1109/TNN.2002.1021888
LASKOWSKI K, 2009, P INT C AC SPEECH SI, P4541
Lee Hae-Lim, 2007, Plant Biology (Rockville), V2007, P294
LEE K, 2008, P 9 INT INT 2008 BRI, P1397
LEEUWEN D, 2006, COMPUT SPEECH LANG, V20, P128
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Lei H., 2007, P INT 2007 ICSLP ANT, P746
Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013
Li H., 2009, P INT C AC SPEECH SI, P4201
Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595
LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577
LONGWORTH C, 2007, IEEE T AUDIO SPEECH, V6, P1
Louradour J., 2005, P 13 EUR C SIGN PROC
LOURADOUR J, 2005, P ICASSP PHIL US MAR, P613
LU X, 2007, SPEECH COMMUN, V50, P312
MA B, 2006, P INT C AC SPEECH SI, V1, P1029
Ma B., 2006, P INT 2006 ICSLP PIT, P505
Ma B, 2007, IEEE T AUDIO SPEECH, V15, P2053, DOI 10.1109/TASL.2007.902861
Magrin-Chagnolleau I, 2002, IEEE T SPEECH AUDI P, V10, P371, DOI 10.1109/TSA.2002.800557
Mak M.W., 2006, P INT C AC SPEECH SI, V1, P929
Mak M.W., 2004, EURASIP J APPL SIG P, V4, P452
MAK MW, 2003, P INT C AC SPEECH SI, V2, P745
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Malayath N, 2000, DIGIT SIGNAL PROCESS, V10, P55, DOI 10.1006/dspr.1999.0363
Mami Y, 2006, SPEECH COMMUN, V48, P127, DOI 10.1016/j.specom.2005.06.014
Mammone RJ, 1996, IEEE SIGNAL PROC MAG, V13, P58, DOI 10.1109/79.536825
Mariethoz J., 2002, P ICSLP, P581
MARKEL JD, 1977, IEEE T ACOUST SPEECH, V25, P330, DOI 10.1109/TASSP.1977.1162961
Martin A. F., 1997, P EUROSPEECH, P1895
MARY L, 2006, P INT 2006 ICSLP PIT, P917
Mary L, 2008, SPEECH COMMUN, V50, P782, DOI 10.1016/j.specom.2008.04.010
MASON M, 2005, P EUR LISB PORT SEP, P3109
McLaughlin J., 1999, P EUR, P1215
Misra H, 2003, SPEECH COMMUN, V39, P301, DOI 10.1016/S0167-6393(02)00046-8
Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0
MOONASAR V, 2001, P INT JOINT C NEUR N, P2936
MULLER C, 2007, LECT NOTES COMPUTER, V4441
Muller Christian, 2007, LECT NOTES COMPUTER, V4343
Muller KR, 2001, IEEE T NEURAL NETWOR, V12, P181, DOI 10.1109/72.914517
Murty KR, 2006, IEEE SIGNAL PROC LET, V13, P52, DOI 10.1109/LSP.2005.860538
Naik J. M., 1989, P IEEE INT C AC SPEE, P524
NAKASONE H, 2004, P SPEAK OD SPEAK REC, P251
Ney H, 1997, TEXT SPEECH LANG TEC, V2, P174
NIEMILAITINEN T, 2005, P 2 BALT C HUM LANG, P317
*NIST, 2008, SRE RES PAG
Nolan F, 1983, PHONETIC BASES SPEAK
Oppenheim A. V., 1999, DISCRETE TIME SIGNAL
ORMAN D, 2001, P SPEAK OD SPEAK REC, P219
Paliwal K. K., 2003, P EUR 2003, P2117
Park A., 2002, P INT C SPOK LANG PR, P1337
Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213
PELECANOS J, 2000, P INT C PATT REC ICP, P3298
PELLOM BL, 1999, P ICASSP 99, P837
Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467
PFISTER B, 2003, P EUROSPEECH 2003, P701
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
POH N, 2004, P IEEE INT C AC SPEE, V5, P893
Prasanna SRM, 2006, SPEECH COMMUN, V48, P1243, DOI 10.1016/j.specom.2006.06.002
Przybocki MA, 2007, IEEE T AUDIO SPEECH, V15, P1951, DOI 10.1109/TASL.2007.902489
Rabiner L, 1993, FUNDAMENTALS SPEECH
Ramachandran RP, 2002, PATTERN RECOGN, V35, P2801, DOI 10.1016/S0031-3203(01)00235-7
Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002
Ramos-Castro D, 2007, PATTERN RECOGN LETT, V28, P90, DOI 10.1016/j.patrec.2006.06.008
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Reynolds D.A., 2005, P ICASSP, V1, P177, DOI 10.1109/ICASSP.2005.1415079
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
REYNOLDS DA, 2003, P ICASSP, P784
Reynolds D.A., 2003, P IEEE ICASSP, P53
Roch M, 2006, SPEECH COMMUN, V48, P85, DOI 10.1016/j.specom.2005.06.003
Rodriguez-Linares Leandro, 2003, Pattern Recognition, V36, P347
SAASTAMOINEN J, 2005, EURASIP J APPL SIG P, V17, P2816
Saeidi R, 2009, IEEE T AUDIO SPEECH, V17, P344, DOI 10.1109/TASL.2008.2010278
Shriberg E, 2005, SPEECH COMMUN, V46, P455, DOI 10.1016/j.specom.2005.02.018
Sivakumaran P., 2003, P EUROSPEECH INTERSP, P2669
Sivakumaran P, 2003, SPEECH COMMUN, V41, P485, DOI 10.1016/S0167-6393(03)00017-7
SLOMKA S, 1998, P INT C SPOK LANG PR, P225
SLYH RE, 2004, P ISCA TUT RES WORKS, P315
Solewicz YA, 2007, IEEE T AUDIO SPEECH, V15, P2063, DOI 10.1109/TASL.2007.903054
SOLOMONOFF A, 2005, P ICASSP, P629
SONMEZ M, 1998, P INT C SPOK LANG PR, P3189
SONMEZ MK, 1997, P EUR, P1391
SOONG FK, 1987, AT&T TECH J, V66, P14
SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598
Stolcke A, 2007, IEEE T AUDIO SPEECH, V15, P1987, DOI 10.1109/TASL.2007.902859
STOLCKE A, 2008, P ICASSP, P1577
Sturim D., 2001, P ICASSP, V1, P429
STURIM D, 2005, P ICASSP, P741
Teunen R., 2000, P INT C SPOK LANG PR, V2, P495
THEVENAZ P, 1995, SPEECH COMMUN, V17, P145, DOI 10.1016/0167-6393(95)00010-L
THIAN N, 2004, P 1 INT C BIOM AUTH, P631
THIRUVARAN T, 2008, P INT 2008 INC SST 2, P1497
Thiruvaran T., 2008, ELECT LETT, V44
TONG R, 2006, 5 INT S CHIN SPOK LA, P494
TORRESCARRASQUI.PA, 2002, P ICASSP, V1, P757
Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256
TYDLITAT B, 2007, P INT C AC SPEECH SI, V4, P293
Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8
Vogt R., 2008, SPEAK LANG REC WORKS
Vogt R., 2005, P INT, P3117
Vogt R, 2008, COMPUT SPEECH LANG, V22, P17, DOI 10.1016/j.csl.2007.05.003
Wan V, 2005, IEEE T SPEECH AUDI P, V13, P203, DOI 10.1109/TSA.2004.841042
WILDERMOTH B, 2008, P 8 AUSTR INT C SPEE, P324
WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065
Xiang B, 2003, IEEE SIGNAL PROC LET, V10, P141, DOI 10.1109/LSP.2003.810913
Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822
XIANG B, 2002, P ICASSP, V1, P681
Xiong ZY, 2006, SPEECH COMMUN, V48, P1273, DOI 10.1016/j.specom.2606.06.011
Yegnanarayana B, 2002, NEURAL NETWORKS, V15, P459, DOI 10.1016/S0893-6080(02)00019-9
You CH, 2009, IEEE SIGNAL PROC LET, V16, P49, DOI 10.1109/LSP.2008.2006711
Yuo KH, 1999, SPEECH COMMUN, V28, P227, DOI 10.1016/S0167-6393(99)00017-5
Zheng NH, 2007, IEEE SIGNAL PROC LET, V14, P181, DOI 10.1109/LSP.2006.884031
ZHU D, 2008, P INT 2008 BRISB AUS
ZHU D, 2007, P INT C AC SPEECH SI, V4, P61
Zhu D., 2009, P ICASSP, P4045
Zilca RD, 2002, IEEE T SPEECH AUDI P, V10, P363, DOI 10.1109/TSA.2002.803419
Zilca RD, 2006, IEEE T AUDIO SPEECH, V14, P467, DOI [10.1109/TSA.2005.857809, 10.1109/FSA.2005.857809]
Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450
NR 247
TC 185
Z9 198
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2010
VL 52
IS 1
BP 12
EP 40
DI 10.1016/j.specom.2009.08.009
PG 29
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 522QX
UT WOS:000272014500002
ER
PT J
AU Ishizuka, K
Nakatani, T
Fujimoto, M
Miyazaki, N
AF Ishizuka, Kentaro
Nakatani, Tomohiro
Fujimoto, Masakiyo
Miyazaki, Noboru
TI Noise robust voice activity detection based on periodic to aperiodic
component ratio
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice activity detection; Robustness; Periodicity; Aperiodicity; Noise
robust front-end processing for automatic speech recognition
ID FUNDAMENTAL-FREQUENCY ESTIMATION; SUBBAND-BASED PERIODICITY;
HIGHER-ORDER STATISTICS; SPEECH RECOGNITION; SPECTRUM ESTIMATION; PITCH
DETECTION; ALGORITHM; DECOMPOSITION; SIGNALS; MODEL
AB This paper proposes a noise robust voice activity detection (VAD) technique called PARADE (PAR based Activity DEtection) that employs the periodic component to aperiodic component ratio (PAR). Conventional noise robust features for VAD are still sensitive to non-stationary noise, which yields variations in the signal-to-noise ratio, and sometimes requires a priori noise power estimations, although the characteristics of environmental noise change dynamically in the real world. To overcome this problem, we adopt the PAR, which is insensitive to both stationary and non-stationary noise, as an acoustic feature for VAD. By considering both periodic and aperiodic components simultaneously in the PAR, we can mitigate the effect of the non-stationarity of noise. PARADE first estimates the fundamental frequencies of the dominant periodic components of the observed signals, decomposes the power of the observed signals into the powers of its periodic and aperiodic components by taking account of the power of the aperiodic components at the frequencies where the periodic components exist, and calculates the PAR based on the decomposed powers. Then it detects the presence of target speech signals by estimating the voice activity likelihood defined in relation to the PAR. Comparisons of the VAD performance for noisy speech data confirmed that PARADE outperforms the conventional VAD algorithms even in the presence of non-stationary noise. In addition, PARADE is applied to a front-end processing technique for automatic speech recognition (ASR) that employs a robust feature extraction method called SPADE (Subband based Periodicity and Aperiodicity DEcomposition) as an application of PARADE. Comparisons of the ASR performance for noisy speech show that the SPADE front-end combined with PARADE achieves significantly higher word accuracies than those achieved by MFCC (Mel-frequency Cepstral Coefficient) based feature extraction, which is widely used for conventional ASR systems, the SPADE front-end without PARADE, and other standard noise robust front-end processing techniques (ETSI ES 202 050 and ETSI ES 202 212). This result confirmed that PARADE can improve the performance of front-end processing for ASR. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Ishizuka, Kentaro; Nakatani, Tomohiro; Fujimoto, Masakiyo] NTT Corp, NTT Commun Sci Labs, Kyoto 6190237, Japan.
[Miyazaki, Noboru] NTT Corp, NTT Cyber Space Labs, Yokosuka, Kanagawa 2390847, Japan.
RP Ishizuka, K (reprint author), NTT Corp, NTT Commun Sci Labs, Hikaridai 2-4, Kyoto 6190237, Japan.
EM ishizuka@cslab.kecl.ntt.co.jp; nak@cslab.kecl.ntt.co.jp;
masakiyo@cslab.kecl.ntt.co.jp; miyazaki.noboru@lab.ntt.co.jp
CR Adami A, 2002, P ICSLP, P21
Agarwal A., 1999, P ASRU, P67
Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042
[Anonymous], 2002, 202050 ETSI ES
[Anonymous], 2003, 202212 ETSI ES
[Anonymous], 1999, 301708 ETSI EN
ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800
Basu S., 2003, P ICASSP, V1, pI
Benitez C, 2001, P EUR 2001, P429
Boersma P., 1993, P I PHONETIC SCI, V17, P97
Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403
Chen C.-P., 2002, P ICSLP, P241
Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276
Cournapeau D., 2007, P INT, P2945
Davis A, 2006, IEEE T AUDIO SPEECH, V14, P412, DOI 10.1109/TSA.2005.855842
DELATORRE A, 2006, P INT, P1954
Deshmukh O, 2005, IEEE T SPEECH AUDI P, V13, P776, DOI 10.1109/TSA.2005.851910
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
*ETSI, 2000, 101707 ETSI TS
Evangelopoulos G, 2006, IEEE T AUDIO SPEECH, V14, P2024, DOI 10.1109/TASL.2006.872625
Fisher E, 2006, IEEE T AUDIO SPEECH, V14, P502, DOI 10.1109/TSA.2005.857806
Fujimoto M., 2008, P ICASSP 08 APR, P4441
Fujimoto M, 2008, IEICE T INF SYST, VE91D, P467, DOI [10.1093/ietisy/e91-d.3.467, 10.1093/ietisy/e9l-d.3.467]
Gorriz JM, 2006, IEEE SIGNAL PROC LET, V13, P636, DOI 10.1109/LSP.2006.876340
GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651
HAMADA M, 1990, P INT C SPEECH LANG, P893
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hess W., 1983, PITCH DETERMINATION
HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181
Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131
Ishizuka K, 2006, SPEECH COMMUN, V48, P1447, DOI 10.1016/j.specom.2006.06.008
ITAKURA F, 1968, REP INT C AC
JACKSON PJB, 2003, P EUROSPEECH, P2321
Jackson PJB, 2001, IEEE T SPEECH AUDI P, V9, P713, DOI 10.1109/89.952489
Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354
JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631
Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3
KINGSBURY B, 2002, P ICASSP, V1, P53
Kitaoka N., 2007, P IEEE WORKSH AUT SP, P607
KRISTIANSSON T., 2005, P INTERSPEECH, P369
Krom G., 1993, J SPEECH HEAR RES, V36, P254
LAMEL LF, 1981, IEEE T ACOUST SPEECH, V29, P777, DOI 10.1109/TASSP.1981.1163642
LAROCHE J, 1993, P ICASSP, V1, P550
LEBOUQUINJEANNES R, 1995, SPEECH COMMUN, V16, P245, DOI 10.1016/0167-6393(94)00056-G
Lee A., 2004, P ICSLP, V1, P173
Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955
Li Q., 2001, P 7 EUR C SPEECH COM, P619
Machiraju VR, 2002, J CARDIAC SURG, V17, P20
MAK B, 1992, P IEEE INT C AC SPEE, V1, P269
Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548
Mauuary L., 1998, P EUSPICO 98, V1, P359
MOUSSET E, 1996, P INT C SPOK LANG PR, V2, P1273, DOI 10.1109/ICSLP.1996.607842
NAKAMURA A, 1996, P ICSLP, V4, P2199, DOI 10.1109/ICSLP.1996.607241
NAKAMURA S, 2005, TEICE T INF SYST D, V88, P535
NAKAMURA S, 2003, P 8 IEEE WORKSH AUT, P619
Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522
Nakatani T, 2008, SPEECH COMMUN, V50, P203, DOI 10.1016/j.specom.2007.09.003
Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996
NOE B, 2001, P EUR 2001 AALB DENM, P433
Pearce D., 2000, P ICSLP, V4, P29
RABINER LR, 1975, AT&T TECH J, V54, P297
RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905
Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002
Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551
RAMIREZ J, 2007, P ICASSP, V4, P801
RICHARD G, 1996, PROGR TEXT TO SPEECH, P41
SAVOJI MH, 1989, SPEECH COMMUN, V8, P45, DOI 10.1016/0167-6393(89)90067-8
SERRA X, 1990, COMPUT MUSIC J, V14, P12, DOI 10.2307/3680788
SHEN JL, 1998, P ICSLP
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Solvang HK, 2008, SPEECH COMMUN, V50, P476, DOI 10.1016/j.specom.2008.02.003
Srinivasant K., 1993, P IEEE SPEECH COD WO, P85, DOI 10.1109/SCFT.1993.762351
Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660
Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521
TUCKER R, 1992, IEE PROC-I, V139, P377
Wilpon J. G., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90015-5
Wu BF, 2005, IEEE T SPEECH AUDI P, V13, P762, DOI 10.1109/TSA.2005.851909
YANTORNO RE, 2001, P IEEE INT WORKSH IN
Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P1, DOI 10.1109/89.650304
NR 80
TC 11
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2010
VL 52
IS 1
BP 41
EP 60
DI 10.1016/j.specom.2009.08.003
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 522QX
UT WOS:000272014500003
ER
PT J
AU Misu, T
Kawahara, T
AF Misu, Teruhisa
Kawahara, Tatsuya
TI Bayes risk-based dialogue management for document retrieval system with
speech interface
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken dialogue system; Dialogue management; Document retrieval; Bayes
risk
AB We propose an efficient technique of dialogue management for an information navigation system based oil a document knowledge base. The system can use ASR N-best hypotheses and contextual information to perform robustly for fragmental speech input and erroneous output of automatic speech recognition (ASR). It also has several choices in generating responses or confirmations. We formulate the optimization of these choices based oil a Bayes risk criterion, which is defined based on a reward for correct information presentation and a penalty for redundant turns. The parameters for the dialogue management we propose can be adaptively tuned by online learning. We evaluated this strategy with our spoken dialogue system called "Dialogue Navigator for Kyoto City", which generates responses based oil the document retrieval and also has question-answering capability. The effectiveness of the proposed framework was demonstrated by the increased success rate of dialogue and the reduced number of turns for information access through an experiment with a large number of utterances by real users. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Misu, Teruhisa; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan.
RP Misu, T (reprint author), Natl Inst Informat & Commun Technol, Kyoto, Japan.
EM misu@ar.media.kyoto-u.ac.jp
CR AKIBA T, 2005, P INT
Boni M. D., 2005, NAT LANG ENG, V11, P343
BRONDSTED T, 2006, P WORKSH SPEECH MOB
Chen B., 2005, P EUR C SPEECH COMM, P109
DOHSAKA K, 2003, P EUR
HORVITZ E, 2006, USER MODEL USER-ADAP, V17, P159
Komatani K, 2005, USER MODEL USER-ADAP, V15, P169, DOI 10.1007/s11257-004-5659-0
KUDO T, 2003, P 42 ANN M ACL
LAMEL L, 2002, SPEECH COMM, V38
LAMEL L, 1999, P ICASSP
Lee A., 2004, P ICASSP QUEB, P793
Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450
Levin E., 2006, P SPOK LANG TECHN WO, P198
LITMAN DJ, 2000, P 17 C COMP LING, P502
MATSUDA M, 2006, P 5 NTCIR WORKSH M E, P414
MISU T, 2007, P ICASSP
MISU T, 2006, P INTERSPEECH, P9
Misu T, 2006, SPEECH COMMUN, V48, P1137, DOI 10.1016/j.specom.2006.04.001
MURATA M, 2006, P AS INF RETR S, P601
NIIMI Y, 1996, P ICSLP
NISHIMURA R, 2005, P INT
*NIST, 2003, NIST SPEC PUBL, P500
Pan Y. C., 2007, P AUT SPEECH REC UND, P544
POTAMIANOS A, 2000, P ICSLP
Raux A., 2005, P INT
RAVICHANDRAN D, 2002, P 40 ANN M ACL
RAYMOND C, 2003, P AUT SPEECH REC UND
Reithinger N., 2005, P INT
Roy N., 2000, P 38 ANN M ASS COMP, P93, DOI DOI 10.3115/1075218.1075231
RUDNICKY A, 2000, P ICSLP, V2
SENEFF S, 2000, P ANLP NAACL 2000 SA
Singh S, 2002, J ARTIF INTELL RES, V16, P105
Sturm J., 1999, P ESCA WORKSH INT DI
YOUNG S, 2007, P ICASSP
Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460
NR 35
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2010
VL 52
IS 1
BP 61
EP 71
DI 10.1016/j.specom.2009.08.007
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 522QX
UT WOS:000272014500004
ER
PT J
AU Srinivasan, S
Wang, DL
AF Srinivasan, Soundararajan
Wang, DeLiang
TI Robust speech recognition by integrating speech separation and
hypothesis testing
SO SPEECH COMMUNICATION
LA English
DT Article
DE Robust speech recognition; Missing-data recognizer; Ideal binary mask;
Speech segregation; Top-down processing
ID NOVELTY DETECTION; SOUNDS; NOISE
AB Missing-data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time-frequency (T-F) domain. Such methods require a binary mask to label speech-dominant T-F regions of a noisy speech signal as reliable and the rest as unreliable. Current methods for computing the mask are based mainly on bottom-up cues such as harmonicity and produce labeling errors that degrade recognition performance. In this paper, we propose a two-stage recognition system that combines bottom-up and top-down cues in order to simultaneously improve both mask estimation and recognition accuracy. First, an n-best lattice consistent with a speech separation mask is generated. The lattice is then re-scored by expanding the mask using a model-based hypothesis test to determine the reliability of individual T-F units. Systematic evaluations of the proposed system show significant improvement in recognition performance compared to that using speech separation alone. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Srinivasan, Soundararajan] Ohio State Univ, Dept Biomed Engn, Columbus, OH 43210 USA.
[Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA.
[Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA.
RP Srinivasan, S (reprint author), Robert Bosch LLC, Res & Technol Ctr N Amer, Pittsburgh, PA 15212 USA.
EM srinivasan.36@osu.edu; dwang@cse.ohio-state.edu
FU AFOSR [FA9550-08-1-0155]; NSF [IIS-0534707]
FX This research was supported in part by an AFOSR grant
(FA9550-08-1-0155), an NSF grant (IIS-0534707). We thank J. Barker for
help with the speech fragment decoder and E. Fosler-Lussier for helpful
discussions. A preliminary version of this work was presented in 2005
ICASSP (Srinivasan and Wang, 2005a).
CR Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002
BISHOP CM, 1994, IEE P-VIS IMAGE SIGN, V141, P217, DOI 10.1049/ip-vis:19941330
Boersma P, 2002, PRAAT DOING PHONETIC
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Bregman AS., 1990, AUDITORY SCENE ANAL
BROWN GJ, 2001, P IJCNN 01, P2907
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
Darwin CJ, 2008, PHILOS T R SOC B, V363, P1011, DOI 10.1098/rstb.2007.2156
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
DROPPO J, 2002, P INT C SPOK LANG PR, P1569
Drygajlo A., 1998, P ICASSP 98, V1, P121, DOI 10.1109/ICASSP.1998.674382
EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947
Gales Mark, 2007, Foundations and Trends in Signal Processing, V1, DOI 10.1561/2000000004
GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Huang X., 2001, SPOKEN LANGUAGE PROC
Leonard R. G., 1984, P ICASSP 84, P111
Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst
Markou M, 2003, SIGNAL PROCESS, V83, P2481, DOI [10.1016/j.sigpro.2003.07.018, 10.1016/j.sigpro.2003.018]
McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212
McLachlan G.J., 1988, MIXTURE MODELS INFER
Patterson R.D., 1988, 2341 APU
Pearce D., 2000, P ICSLP, V4, P29
Renevey P., 2001, P CONS REL AC CUES S, P71
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
SELTZER ML, 2000, P INT C SPOK LANG PR, P538
Srinivasan S., 2005, P IEEE INT C AC SPEE, V1, P89, DOI 10.1109/ICASSP.2005.1415057
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
Srinivasan S, 2005, SPEECH COMMUN, V45, P63, DOI 10.1016/j.specom.2004.09.002
Srinivasan S., 2006, THESIS OHIO STATE U
Stark H., 2002, PROBABILITY RANDOM P, V3rd
Tax D, 1998, LECT NOTES COMPUTER, V1451, P593
Van hamme H, 2004, P IEEE ICASSP, V1, P213
Varga A.P., 1990, P ICASSP, P845
VARGA AP, 1992, NOISEX 92 STUDY EFFE
Wang D., 2006, COMPUTATIONAL AUDITO
Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727
Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129
Young S., 2000, HTK BOOK HTK VERSION
NR 39
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JAN
PY 2010
VL 52
IS 1
BP 72
EP 81
DI 10.1016/j.specom.2009.08.008
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 522QX
UT WOS:000272014500005
ER
PT J
AU Jansen, A
Niyogi, P
AF Jansen, Aren
Niyogi, Partha
TI Point process models for event-based speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Event-based speech recognition; Speech processing; Point process models
ID SPIKE; FRAMEWORK; FEATURES; NEURONS
AB Several strands of research in the fields of linguistics, speech perception, and neuroethology suggest that modelling the temporal dynamics of an acoustic event landmark-based representation is a scientifically plausible approach to the automatic speech recognition (ASR) problem. Adopting a point process representation of the speech signal opens up ASR to a large class of statistical models that have seen wide application in the neuroscience community. In this paper, we formulate several point process models for application to speech recognition, designed to operate on sparse detector-based representations of the speech signal. We find that even with a noisy and extremely sparse phone-based point process representation, obstruent phones can be decoded at accuracy levels comparable to a basic hidden Markov model baseline and with improved robustness. We conclude by outlining various avenues for future development of our methodology. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Jansen, Aren; Niyogi, Partha] Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA.
RP Jansen, A (reprint author), Univ Chicago, Dept Comp Sci, 1100 E 58th St, Chicago, IL 60637 USA.
EM aren@cs.uchicago.cdu; niyogi@cs.uchicago.edu
CR Amarasingham A, 2006, J NEUROSCI, V26, P801, DOI 10.1523/JNEUROSCI.2948-05.2006
AMIT Y, 2005, J ACOUST SOC AM, V118
BOURLARD H, 1996, IDIAPRR9607
Brown EN, 2005, METHODS MODELS NEURO, P691
Chi ZY, 2007, J NEUROPHYSIOL, V97, P1221, DOI 10.1152/jn.00448.2006
DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839
Ellis D.P.W., 2005, PLP RASTA MFCC INVER
Esser KH, 1997, P NATL ACAD SCI USA, V94, P14019, DOI 10.1073/pnas.94.25.14019
FRANGOULIS E, 1989, P ICASSP, P9
FUZESSERY ZM, 1983, J COMP PHYSIOL, V150, P333
Geiger D, 1999, INT J COMPUT VISION, V33, P139, DOI 10.1023/A:1008146126392
Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8
Greenberg S, 2003, J PHONETICS, V31, P465, DOI 10.1016/j.wocn.2003.09.005
Gutig R, 2006, NAT NEUROSCI, V9, P420, DOI 10.1038/nn1643
HASEGAWAJOHNSON M, 2002, ACCUMU J ARTS TECHNO
Jansen A, 2008, J ACOUST SOC AM, V124, P1739, DOI 10.1121/1.2956472
JUANG BH, 1985, P ICASSP
Lee Chin-Hui, 2007, P INT, P1825
LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546
Legenstein R, 2005, NEURAL COMPUT, V17, P2337, DOI 10.1162/0899766054796888
Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2
Li J., 2005, P INT LISB PORT SEP, P3365
Livescu K., 2004, P HLT NAACL
MA C, 2006, P INT
MAK B, 2000, P ICSLP, P149
MARGOLIASH D, 1992, J NEUROSCI, V12, P4309
Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666
NIYOGI P, 1998, P ICSLP
Nock HJ, 2003, COMPUT SPEECH LANG, V17, P233, DOI 10.1016/S0885-2308(03)00009-3
OLHAUSEN BA, 2003, P ICIP, P41
OSTENDORF M, 1992, P DARPA WORKSH CONT
Ostendorf M., 1996, AUTOMATIC SPEECH SPE, P185
Parker Steve, 2002, THESIS U MASSACHUSET
POEPPEL D, 2007, PHILOS T ROYAL SOC B
Pruthi T, 2004, SPEECH COMMUN, V43, P225, DOI 10.1016/j.specom.2004.06.001
RAMESH P, 1992, P ICASSP
RUSSELL MJ, 1987, P ICASSP
Serre T, 2007, IEEE T PATTERN ANAL, V29, P411, DOI 10.1109/TPAMI.2007.56
Sha F., 2007, P ICASSP, P313
STEVENS K, 1992, P ICSLP
Stevens K. N., 1981, PERSPECTIVES STUDY S, P1
Stevens KN, 2002, J ACOUST SOC AM, V111, P1872, DOI 10.1121/1.1458026
Suga N, 2006, LISTENING SPEECH AUD, P159
Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380
Truccolo W, 2005, J NEUROPHYSIOL, V93, P1074, DOI 10.1152/jn.00697.2004
WILLETT R, 2007, P ICASSP, P1249
XIE Z, 2006, P ICSLP
Zhang Yongshun, 2003, Proceedings. 2003 IEEE International Conference on Robotics, Intelligent Systems and Signal Processing (IEEE Cat. No.03EX707)
NR 48
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1155
EP 1168
DI 10.1016/j.specom.2009.05.008
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800001
ER
PT J
AU Qian, Y
Soong, FK
AF Qian, Yao
Soong, Frank K.
TI A Multi-Space Distribution (MSD) and two-stream tone modeling approach
to Mandarin speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Tone model; Mandarin speech recognition; Multi-Space Distribution (MSD);
Noisy digit recognition; LVCSR
AB Tone plays an important role in recognizing spoken tonal languages like Chinese. However, the discontinuity of F0 between voiced and unvoiced transition has traditionally been a hurdle in creating a succinct statistical tone model for automatic speech recognition and synthesis. Various heuristic approaches have been proposed before to get around the problem but with limited success. The Multi-Space Distribution (MSD) proposed by Tokuda et al. which models the two probability spaces, discrete for unvoiced region and continuous for voiced F0 contour, in a linearly weighted mixture, has been successfully applied to Hidden Markov Model (HMM)-based text-to-speech synthesis. We extend MSD to Chinese Mandarin tone modeling for speech recognition. The tone features and spectral features are further separated into two streams and corresponding stream-dependent models are trained. Finally two separated decision trees are constructed by clustering corresponding stream-dependent HMMs. The MSD and two-stream modeling approach is evaluated on large vocabulary, continuously read and spontaneous speech Mandarin databases and its robustness is further investigated in a noisy, continuous Mandarin digit database with eight types of noises at five different SNRs. Experimental results show that our MSD and two-stream based tone modeling approach can significantly improve the recognition performance over a toneless baseline system. The relative tonal syllable error rate (TSER) reductions are 21.0%, 8.4% and 17.4% for large vocabulary read and spontaneous and noisy digit speech recognition tasks, respectively. Comparing with the conventional system where F0 contours are interpolated in unvoiced segments, our approach improves the recognition performance by 9.8%, 7.4% and 13.3% in relative TSER reductions in the corresponding speech recognition tasks, respectively. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Qian, Yao; Soong, Frank K.] Microsoft Res Asia, Beijing 100190, Peoples R China.
RP Qian, Y (reprint author), Microsoft Res Asia, Beijing 100190, Peoples R China.
EM yaoqian@microsoft.com; frankkps@microsoft.com
CR Chang E., 2000, P ICSLP 2000, P983
Chen C.J., 1997, P EUROSPEECH, P1543
Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7
Freij G., 1988, P ICASSP 1988, P135
Hirsch H. G., 2000, ISCA ITRW ASR
Hirst D. J., 1993, TRAVAUX I PHONETIQUE, V15, P71
Ho T.-H., 1999, P EUROSPEECH 1999, P883
LEI X, 2006, P INT C SPOK LANG PR, P1237
Lin CH, 1996, SPEECH COMMUN, V18, P175, DOI 10.1016/0167-6393(95)00043-7
Peng G, 2005, SPEECH COMMUN, V45, P49, DOI 10.1016/j.specom.2004.09.004
Qian Y., 2006, P ICASSP, V1, P133
Qian Y, 2007, J ACOUST SOC AM, V121, P2936, DOI 10.1121/1.2717413
QIANG S, 2007, P INTERSPEECH 2007, P1801
Seide F., 2000, P ICSLP 2000, P495
SHINOZAKI T, 2001, P EUROSPEECH AALB, V1, P491
Talkin A. D., 1995, SPEECH CODING SYNTHE, P495
TIAN Y, 2004, P ICASSP, V1, P105
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
Wang HL, 2006, LECT NOTES COMPUT SC, V4274, P445
WANG HL, 2006, P ICSLP 2006, P1047
Wang HM, 1997, IEEE T SPEECH AUDI P, V5, P195
Xu Y, 2006, ITALIAN J LINGUISTIC, V18, P125
Zhang JS, 2005, SPEECH COMMUN, V46, P440, DOI 10.1016/j.specom.2005.03.010
Zhang L, 2006, LECT NOTES COMPUT SC, V4274, P590
Zhou J., 2004, P ICASSP 2004, P997
NR 25
TC 3
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1169
EP 1179
DI 10.1016/j.specom.2009.08.001
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800002
ER
PT J
AU Chen, JF
Phua, K
Shue, L
Sun, HW
AF Chen, Jianfeng
Phua, Koksoon
Shue, Louis
Sun, Hanwu
TI Performance evaluation of adaptive dual microphone systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Microphone array; Adaptive beamforming; Noise reduction
ID ARRAY HEARING-AIDS; BINAURAL OUTPUT
AB In this paper, the performance of the adaptive noise cancellation method is evaluated on several possible dual microphone system (DMS) configurations. Two groups of DMS are taken into consideration with one consisting of two omnidirectional microphones and another involving directional microphones. The properties of these methods are theoretically analyzed under incoherent, coherent and diffuse noise respectively. To further investigate their achievable noise reduction performance in real situations, a series of experiments in simulated and real office environments are carried out. Some recommendations are given at the end for designing and choosing the suitable methods in real applications. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Chen, Jianfeng; Phua, Koksoon; Shue, Louis; Sun, Hanwu] Inst Infocomm Res, Singapore 138632, Singapore.
RP Phua, K (reprint author), Inst Infocomm Res, 1 Fusionopolis Way,21-01 Connexis S Tower, Singapore 138632, Singapore.
EM jfchen@i2r.a-star.edu.sg; ksphua@i2r.a-star.edu.sg;
lshue@i2r.a-star.edu.sg; hwsun@i2r.a-star.edu.sg
CR ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599
*AM NAT STAND, 2004, ANSIS3
American National Standard, 1997, METH CALC SPEECH INT
Berghe J. V., 1998, Journal of the Acoustical Society of America, V103, DOI 10.1121/1.423066
Bitzer J., 1998, P EUR SIGN PROC C RH, P105
BITZER J, 1999, P IEEE ASSP WORKSH A, V1, P7
BRANDSTEIN M, 2001, MICROPHONE ARRAYS, pCH2
COMPERNOLLE DV, 1990, IEEE T ACOUST SPEECH, V1, P833
COX H, 1986, IEEE T ACOUST SPEECH, V34, P393, DOI 10.1109/TASSP.1986.1164847
Csermak B., 2000, HEARING REV, V7, P56
Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298
Elko G. W., 1995, P IEEE WORKSH APPL S, P169, DOI 10.1109/ASPAA.1995.482983
Elko G. W., 2000, ACOUSTIC SIGNAL PROC, P181
GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334
GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739
Haykin S., 1996, ADAPTIVE FILTER THEO
Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650
Luo FL, 2002, IEEE T SIGNAL PROCES, V50, P1583
MAJ JB, 2003, P INT WORKSH ACOUST, V1, P171
Maj J.B, 2004, THESIS KATHOLIEKE U
Phua KS, 2005, SIGNAL PROCESS, V85, P809, DOI 10.1016/j.sigpro.2004.12.004
Ricketts T, 2002, INT J AUDIOL, V41, P100, DOI 10.3109/14992020209090400
SASAKI T, 1995, MICROPHONE APPARATUS
TOMPSON SC, 1999, HEARING REV, V3, P31
Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299
WIDROW B, 1975, P IEEE, V63, P1692, DOI 10.1109/PROC.1975.10036
NR 26
TC 6
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1180
EP 1193
DI 10.1016/j.specom.2009.06.002
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800003
ER
PT J
AU Stouten, V
Van Hamme, H
AF Stouten, Veronique
Van Hamme, Hugo
TI Automatic voice onset time estimation from reassignment spectra
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice Onset Time; Speech attributes; Estimation; Reassignment spectrum;
Lattice rescoring
ID FREQUENCY; PLOSIVES; FEATURES
AB We describe an algorithm to automatically estimate the voice onset time (VOT) of plosives. The VOT is the time delay between the burst onset and the start of periodicity when it is followed by a voiced sound. Since the VOT is affected by factors like place of articulation and voicing it can be used for inference of these factors. The algorithm uses the reassignment spectrum of the speech signal, a high resolution time-frequency representation which simplifies the detection of the acoustic events in a plosive. The performance of our algorithm is evaluated on a subset of the TIMIT database by comparison with manual VOT measurements. On average, the difference is smaller than 10 ms for 76.1% and smaller than 20 ms for 91.4% of the plosive segments. We also provide analysis statistics of the VOT of /b/, /d/, /g/, /p/, /t/ and /k/ and experimentally verify some sources of variability. Finally, to illustrate possible applications, we integrate the automatic VOT estimates as an additional feature in an HMM-based speech recognition system and show a small but statistically significant improvement in phone recognition rate. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Stouten, Veronique; Van Hamme, Hugo] Katholieke Univ Leuven, ESAT Dept, B-3001 Louvain, Belgium.
RP Van Hamme, H (reprint author), Katholieke Univ Leuven, ESAT Dept, Kasteelpk Arenberg 10,PO 2441, B-3001 Louvain, Belgium.
EM Hugo.Vanhamme@esat.kuleuven.be
RI Van hamme, Hugo/D-6581-2012
FU IWT - SBO [040102]; European Commission [FP6-034362]
FX This research was funded by the IWT - SBO project 'SPACE' (Project no.
040102) and by the European Commission under Contract FP6-034362
(ACORNS).
CR AUGER F, 1995, IEEE T SIGNAL PROCES, V43, P1068, DOI 10.1109/78.382394
BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472
Bilmes JA, 2005, IEEE SIGNAL PROC MAG, V22, P89
Borden G. J., 1984, SPEECH SCI PRIMER PH
DEMUYNCK K, 2006, P INT 2006 ICSLP 9 I, P1622
Demuynck K., 2001, THESIS K U LEUVEN
GAROFOLO J, 1990, SPEECH DISC 1 1 1
Hainsworth S. W., 2003, CUEDFINFENGTR459
Jiang JT, 2006, J ACOUST SOC AM, V119, P1092, DOI 10.1121/1.2149841
KAZEMZADEH A, 2006, P ICSLP PITTSB PA US
King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148
Lee Chin-Hui, 2007, P INT, P1825
LEFEBVRE C, 1990, P ICSLP KOB JAP, P1073
McCrea CR, 2005, J SPEECH LANG HEAR R, V48, P1013, DOI 10.1044/1092-4388(2005/069)
Niyogi P, 1998, INT CONF ACOUST SPEE, P13, DOI 10.1109/ICASSP.1998.674355
OBRIEN SM, 1993, INT J MAN MACH STUD, V38, P97, DOI 10.1006/imms.1993.1006
Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821
Ramesh P., 1998, P ICSLP SYDN AUSTR
Seppi D., 2007, P INTERSPEECH ANTW B, P1805
SONMEZ K, 2000, P ICSLP BEIJ CHIN
STOUTEN F, 2006, P INT 2006, P357
Whiteside SP, 2004, J ACOUST SOC AM, V116, P1179, DOI 10.1121/1.1768256
WITTEN IH, 1991, IEEE T INFORM THEORY, V37, P1085, DOI 10.1109/18.87000
Xiao J, 2007, IEEE T SIGNAL PROCES, V55, P2851, DOI 10.1109/TSP.2007.893961
NR 24
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1194
EP 1205
DI 10.1016/j.specom.2009.06.003
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800004
ER
PT J
AU Pitsikalis, V
Maragos, P
AF Pitsikalis, Vassilis
Maragos, Petros
TI Analysis and classification of speech signals by generalized fractal
dimension features
SO SPEECH COMMUNICATION
LA English
DT Article
DE Feature extraction; Generalized fractal dimensions; Broad class phoneme
classification
ID MULTIFRACTAL NATURE; STRANGE ATTRACTORS; CHAOTIC SYSTEMS; TIME-SERIES;
RECOGNITION; TURBULENCE; DYNAMICS; MODELS
AB We explore nonlinear signal processing methods inspired by dynamical systems and fractal theory in order to analyze and characterize speech sounds. A speech signal is at first embedded in a multidimensional phase-space and further employed for the estimation of measurements related to the fractal dimensions. Our goals are to compute these raw measurements in the practical cases of speech signals, to further utilize them for the extraction of simple descriptive features and to address issues on the efficacy of the proposed features to characterize speech sounds. We observe that distinct feature vector elements obtain values or show statistical trends that on average depend on general characteristics such as the voicing, the manner and the place of articulation of broad phoneme classes. Moreover the way that the statistical parameters of the features are altered as an effect of the variation of phonetic characteristics seem to follow some roughly formed patterns. We also discuss some qualitative aspects concerning the linear phoneme-wise correlation between the fractal features and the commonly employed mel-frequency cepstral coefficients (MFCCs) demonstrating phonetic cases of maximal and minimal correlation. In the same context we also investigate the fractal features' spectral content, in terms of the most and least correlated components with the MFCC. Further the proposed methods are examined under the light of indicative phoneme classification experiments. These quantify the efficacy of the features to characterize broad classes of speech sounds. The results are shown to be comparable for some classification scenarios with the corresponding ones of the MFCC features. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Pitsikalis, Vassilis; Maragos, Petros] Natl Tech Univ Athens, Sch Elect & Comp Engn, GR-15773 Athens, Greece.
RP Pitsikalis, V (reprint author), Natl Tech Univ Athens, Sch Elect & Comp Engn, Iroon Polytexneiou Str, GR-15773 Athens, Greece.
EM vpitsik@cs.ntua.gr; maragos@cs.ntua.gr
FU FP6 European research programs HIWIRE; Network of Excellence MUSCLE;
'Protagoras' NTUA research program
FX This work was supported in part by the FP6 European research programs
HIWIRE and the Network of Excellence MUSCLE and by the 'Protagoras' NTUA
research program.
CR Abarbanel H. D. I., 1996, ANAL OBSERVED CHAOTI
ADEYEMI O, 1997, IEEE T ACOUST SPEECH, P2377
Anderson TW, 2003, INTRO MULTIVARIATE S
Kumar A, 1996, J ACOUST SOC AM, V100, P615, DOI 10.1121/1.415886
BADII R, 1985, J STAT PHYS, V40, P725
Banbrook M, 1999, IEEE T SPEECH AUDI P, V7, P1, DOI 10.1109/89.736326
BENZI R, 1984, J PHYS A-MATH GEN, V17, P3521, DOI 10.1088/0305-4470/17/18/021
BERNHARD HP, 1991, 12 INT C PHON SCI AI, P19
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6
Evertsz C., 1992, CHAOS FRACTALS NEW F
GAROFOLO J, 1993, TIMIT ACOUST PHONETI
GRASSBERGER P, 1983, PHYSICA D, V9, P189, DOI 10.1016/0167-2789(83)90298-1
Greenwood GW, 1997, BIOSYSTEMS, V44, P161, DOI 10.1016/S0303-2647(97)00056-7
HENTSCHEL HGE, 1983, PHYSICA D, V8, P435, DOI 10.1016/0167-2789(83)90235-X
HENTSCHEL HGE, 1983, PHYS REV A, V27, P1266, DOI 10.1103/PhysRevA.27.1266
HERZEL H, 1993, NCVS STATUS PROGR RE, V4, P177
Hirschberg A., 1992, B COMMUNICATION PARL, V2, P7
Howe MS, 2005, P ROY SOC A-MATH PHY, V461, P1005, DOI 10.1098/rspa.2004.1405
HUNT F, 1986, DIMENSIONS ENTROPIES
Johnson MT, 2005, IEEE T SPEECH AUDI P, V13, P458, DOI 10.1109/TSA.2005.848885
Kaiser J. F., 1983, VOCAL FOLD PHYSL BIO, P358
Kantz H., 1997, NONLINEAR TIME SERIE
Kokkinos I, 2005, IEEE T SPEECH AUDI P, V13, P1098, DOI 10.1109/TSA.2005.852982
Kubin G., 1996, P INT C AC SPEECH SI, V1, P267
LIVESCU K, 2003, 8 EUR C SPEECH COMM, P2529
Mandelbrot B. B., 1982, FRACTAL GEOMETRY NAT
MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799
Maragos P, 1999, J ACOUST SOC AM, V105, P1925, DOI 10.1121/1.426738
Maragos P., 1991, P IEEE ICASSP, P417, DOI 10.1109/ICASSP.1991.150365
MENEVEAU C, 1991, J FLUID MECH, V224, P429, DOI 10.1017/S0022112091001830
NARAYANAN SS, 1995, J ACOUST SOC AM, V97, P2511, DOI 10.1121/1.411971
PACKARD NH, 1980, PHYS REV LETT, V45, P712, DOI 10.1103/PhysRevLett.45.712
Papandreou G, 2009, IEEE T AUDIO SPEECH, V17, P423, DOI 10.1109/TASL.2008.2011515
Pitsikalis V., 2003, P EUROSPEECH 2003 GE, P817
Pitsikalis V, 2006, IEEE SIGNAL PROC LET, V13, P711, DOI 10.1109/LSP.2006.879424
PITSIKALIS V, 2002, IEEE T ACOUST SPEECH, P533
Potamianos G, 2004, ISSUES VISUAL AUDIO
QUATIERI TF, 1990, P ICASSP 1990 ALB NM, V3, P1551
SAUER T, 1991, J STAT PHYS, V65, P579, DOI 10.1007/BF01053745
Takens F., 1981, DYNAMICAL SYSTEMS TU, V898, P366, DOI DOI 10.1007/BFB0091924
TEAGER HM, 1989, SPEECH PRODUCTION D, V55
TEMAM R, 1993, APPL MATH SCI, V68
Thomas T. J., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80019-5
Tokuda I, 2001, J ACOUST SOC AM, V110, P3207, DOI 10.1121/1.1413749
TOWNSHEND B, 1991, P INT C AC SPEECH SI, P425, DOI 10.1109/ICASSP.1991.150367
Tritton D. J., 1988, PHYS FLUID DYNAMICS, V1st
Young S., 2002, HTK BOOK, P3
NR 48
TC 9
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1206
EP 1223
DI 10.1016/j.specom.2009.06.005
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800005
ER
PT J
AU Deshmukh, OD
Verma, A
AF Deshmukh, Om D.
Verma, Ashish
TI Nucleus-level clustering for word-independent syllable stress
classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Syllable stress; Speech analysis; Language learning; Acoustic-phonetics
AB This paper presents a word-independent technique for classifying the syllable stress of spoken English words. The proposed technique improves upon the existing word-independent techniques by utilizing the acoustic differences of various syllabic nuclei. Syllables with acoustically similar nuclei are grouped together and a separate stress classifier is trained for each such group. The performance of the proposed group-specific classifiers is analyzed as the number of groups is increased and is also compared with an alternative data-driven clustering based approach. The proposed technique improves the syllable-level accuracy by 5.2% and the word-level accuracy by 1.1%. The corresponding improvements using the data-driven clustering based approach are 0.12% and 0.02%, respectively. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Deshmukh, Om D.; Verma, Ashish] IBM India Res Lab, Vasant Kunj Inst Area, New Delhi 110070, India.
RP Deshmukh, OD (reprint author), IBM India Res Lab, Vasant Kunj Inst Area, Block C,Plot 4, New Delhi 110070, India.
EM odeshmuk@in.ibm.com; va-shish@in.ibm.com
CR CHANDEL A, 2007, IEEE AUT SPEECH REC, P711
Duda R. O., 2001, PATTERN CLASSIFICATI
FOSLERLUSSIER E, 1999, IEEE AUT SPEECH REC, P16
Garner S. R., 1995, P NZ COMP SCI RES ST, P57
GREENBERG S, 2001, P ISCA WORKSH PROS S, P51
HARRIS KS, 1988, ANN B RES I LOGOPEDI, V22, P53
Imoto K., 2002, P ICSLP, P749
Jenkin K. L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607466
Kahn D., 1976, THESIS MIT CAMBRIDGE
KOTHARI R, 2003, PATTERN RECOGN, V24, P1215
OPPELSTRUP L, 2005, P FONETIK GOT, P51
Ramabhadran B., 2007, IEEE AUT SPEECH REC, P472
SILIPO R, 2000, AUTOMATIC DETECTION
Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994
Sluijter AMC, 1996, INT C SPOK LANG PROC, V2, P630
Stevens K.N., 1999, ACOUSTIC PHONETICS
TEPPERMAN J, 2005, P ICASSP PHIL MAR, P733
VERMA A, 2006, P INT C ACOUST SPEEC, pI1237
Ying G. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607932
You K, 2005, IEEE ICCE, P267
NR 20
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1224
EP 1233
DI 10.1016/j.specom.2009.06.006
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800006
ER
PT J
AU Engelbrecht, KP
Quade, M
Moller, S
AF Engelbrecht, Klaus-Peter
Quade, Michael
Moeller, Sebastian
TI Analysis of a new simulation approach to dialog system evaluation
SO SPEECH COMMUNICATION
LA English
DT Article
DE Evaluation; User simulation; Spoken dialog system; Usability; Prediction
model; Optimization
AB The evaluation of spoken dialog systems still relies on subjective interaction experiments for quantifying interaction behavior and user-perceived quality. In this paper, we present a simulation approach replacing subjective tests in early system design and evaluation phases. The simulation is based on a model of the system, and a probabilistic model of user behavior. Probabilities for the next user action vary in dependence of system features and user characteristics, as defined by rules. This way, simulations can be conducted before data have been acquired. In order to evaluate the simulation approach, characteristics of simulated interactions are compared to interaction corpora obtained in subjective experiments. As was previously proposed in the literature, we compare interaction parameters for both corpora and calculate recall and precision of user utterances. The results are compared to those from a comparison of real user corpora. While the real corpora are not equal, they are more similar than the simulation is to the real data. However, the simulations can predict differences between system versions and user groups quite well on a relative level. In order to derive further requirements for the model, we conclude with a detailed analysis of utterances missing in the simulated corpus and consider the believability of entire dialogs. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Engelbrecht, Klaus-Peter; Moeller, Sebastian] TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany.
[Quade, Michael] TU Berlin, DAI Labor, D-10587 Berlin, Germany.
RP Engelbrecht, KP (reprint author), TU Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, Ernst Reuter Pl 7, D-10587 Berlin, Germany.
EM klaus-peter.engelbrecht@telekom.de; michael.quade@dai-labor.de;
sebastian.moeller@telekom.de
CR AI H, 2008, P 9 SIGDIAL WORKSH D
AI H, 2008, P 46 ANN M ASS COMP
Ai H., 2006, P AAAI WORKSH STAT E
Anderson JR, 2004, PSYCHOL REV, V111, P1036, DOI 10.1037/0033-295x.111.4.1036
[Anonymous], 2003, P851 ITUT
ARAKI M, 1997, ECAI 96, P183
BOHUS D, 2005, P 6 SIGDIAL WORKSH D
Card S. K., 1983, PSYCHOL HUMAN COMPUT
CHUNG G, 2004, P 42 ANN M ASS COMP
Eckert W., 1997, P IEEE WORKSH AUT SP
ENGELBRECHT KP, 2008, P INT 2008 BRISB AUS
Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M
HERMANN F, 2007, P HCI INT 2007
ITO A, 2006, P INT 2006 PITTSB PA
*ITU T, 2005, PAR DESCR INT SP P S, V24
IVORY MY, 2000, UCBCSD001105 EECS DE
JANARTHANAM S, 2008, P SEMDIAL 2008 LONDI
John BE, 2005, IEEE PERVAS COMPUT, V4, P27, DOI 10.1109/MPRV.2005.80
Kieras D.E., 2003, HUMAN COMPUTER INTER, P1191
Lopez-Cozar R, 2003, SPEECH COMMUN, V40, P387, DOI [10.1016/S0167-6393(02)00126-7, 10.1016/S0167-6393902)00126-7]
MOLLER S, 2007, P 12 INT C SPEECH CO
MOLLER S, 2006, P INT 2006 PITTSB PA
Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003
MOLLER S, 2007, P INT 2007 ANTW BELG
Newell A., 1990, UNIFIED THEORIES COG
Nielsen J., 1993, USABILITY ENG
NORMAN DA, 1981, PSYCHOL REV, V88, P1, DOI 10.1037/0033-295X.88.1.1
Norman D. A., 1983, MENTAL MODELS, P7
PIETQUIN O, 2006, P IEEE INT C MULT EX
PIETQUIN O, 2002, P IEEE INT C ACOUST
PIETQUIN O, 2009, MACH LEARN, P167
RIESER V, 2006, P INT 2006 PITTSB PA
SCHATZMANN J, 2007, P HLT NAACL ROCH NY
SCHATZMANN J, 2007, P ASRU KYOT JAP
SCHATZMANN J, 2007, P 8 SIGDIAL WORKSH D
SCHATZMANN J, 2005, P 6 SIGDIAL WORKSH D
SCHEFFLER K, 2001, P NAACL 2001 WORKSH
Seneff S, 2002, COMPUT SPEECH LANG, V16, P283, DOI 10.1016/SO885-2308(02)00011-6
Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503
WALKER MA, 1997, P ACL EACL 35 ANN M
NR 40
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1234
EP 1252
DI 10.1016/j.specom.2009.06.007
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800007
ER
PT J
AU Lu, YY
Cooke, M
AF Lu, Youyi
Cooke, Martin
TI The contribution of changes in F0 and spectral tilt to increased
intelligibility of speech produced in noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Intelligibility; Noise; Speech production; Spectral tilt
ID FUNDAMENTAL-FREQUENCY; SPEAKER INTELLIGIBILITY; NORMAL-HEARING; CLEAR
SPEECH; RECOGNITION; LISTENERS; ENHANCEMENT; PERCEPTION; ENVIRONMENTS;
CHILDREN
AB Talkers modify the way they speak in the presence of noise. As well as increases in voice level and fundamental frequency (170), a flattening of spectral tilt is observed. The resulting "Lombard speech" is typically more intelligible than speech produced in quiet, even when level differences are removed. What is the cause of the enhanced intelligibility of Lombard speech? The current study explored the relative contributions to intelligibility of changes in mean F0 and spectral tilt. The roles of F0 and spectral tilt were assessed by measuring the intelligibility gain of non-Lombard speech whose mean F0 and spectrum were manipulated, both independently and in concert, to simulate those of natural Lombard speech. In the presence of speech-shaped noise, flattening of spectral tilt contributed greatly to the intelligibility gain of noise-induced speech over speech produced in quiet while an increase in F0 did not have a significant influence. The perceptual effects of spectrum flattening was attributed to its ability of increasing the amount of speech time-frequency plane "glimpsed" in the presence of noise. However, spectral tilt changes alone could not fully account for the intelligibility of Lombard speech. Other changes observed in Lombard speech such as durational modifications may well contribute to intelligibility. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Lu, Youyi] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England.
[Cooke, Martin] Univ Basque Country, Fac Letras, Language & Speech Lab, Vitoria, Spain.
RP Lu, YY (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England.
EM y.lu@dcs.shef.ac.uk; acq05yl@shef.ac.uk
FU EU
FX The second author acknowledges support from the EU Marie Curie Network
"Sound to Sense". The authors would like to thank Hideki Kawahara for
providing the Matlab implementation of STRAIGHT v40.
CR Assmann PF, 2005, J ACOUST SOC AM, V117, P886, DOI 10.1121/1.1852549
Assmann P.F., 2002, INT C SPOK LANG PROC, P425
Assmann PF, 2008, J ACOUST SOC AM, V124, P3203, DOI 10.1121/1.2980456
Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003
Boersma P., 1993, P I PHONETIC SCI, V17, P97
BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4
Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005
Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600
COX RM, 1987, J ACOUST SOC AM, V81, P1598, DOI 10.1121/1.394512
DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780
Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078
Garnier M, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P2246
GORDONSALANT S, 1986, J ACOUST SOC AM, V80, P1599, DOI 10.1121/1.394324
Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7
Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9
Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826
Jones C, 2007, COMPUT SPEECH LANG, V21, P641, DOI 10.1016/j.csl.2007.03.001
JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631
Kawahara H, 1997, INT CONF ACOUST SPEE, P1303, DOI 10.1109/ICASSP.1997.596185
KAWAHARA H, 1998, P135 M ACOUST SOC AM, V103, P2776
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842
Laures JS, 2003, J COMMUN DISORD, V36, P449, DOI 10.1016/S0021-9924(03)00032-7
Lee SH, 2007, Proceedings of the Fourth IASTED International Conference on Signal Processing, Pattern Recognition, and Applications, P287
LU Y, 2009, J ACOUST SOC AM, V126
Lu YY, 2008, J ACOUST SOC AM, V124, P3261, DOI 10.1121/1.2990705
MCLOUGHLIN IV, 1997, P 13 INT C DSP, V2, P591, DOI 10.1109/ICDSP.1997.628419
NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824
Pisoni D. B., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8)
Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038)
RYALLS JH, 1982, J ACOUST SOC AM, V72, P1631, DOI 10.1121/1.388499
Skowronski MD, 2006, SPEECH COMMUN, V48, P549, DOI 10.1016/j.specom.2005.09.003
Sommers MS, 1997, J ACOUST SOC AM, V101, P2278, DOI 10.1121/1.418208
Steeneken HJM, 1999, INT CONF ACOUST SPEE, P2079
Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660
Tallal P, 1996, SCIENCE, V271, P81, DOI 10.1126/science.271.5245.81
TARTTER VC, 1993, J ACOUST SOC AM, V94, P2437, DOI 10.1121/1.408234
Thomas I.B., 1967, P NUT EL C, V23, P544
THOMAS IB, 1968, J AUDIO ENG SOC, V16, P182
Uchanski RM, 2002, J SPEECH LANG HEAR R, V45, P1027, DOI 10.1044/1092-4388(2002/083)
Watson PJ, 2008, AM J SPEECH-LANG PAT, V17, P348, DOI 10.1044/1058-0360(2008/07-0048)
Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0
NR 42
TC 25
Z9 27
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1253
EP 1262
DI 10.1016/j.specom.2009.07.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800008
ER
PT J
AU Rao, KS
Yegnanarayana, B
AF Rao, K. Sreenivasa
Yegnanarayana, B.
TI Duration modification using glottal closure instants and vowel onset
points
SO SPEECH COMMUNICATION
LA English
DT Article
DE Instants of significant excitation; Group delay function; Hilbert
envelope; Linear prediction residual; Vowel onset point; Time scale
modification; Duration modification
ID TIME-SCALE MODIFICATION; SIGNIFICANT EXCITATION; SPEECH
AB This paper proposes a method for duration (time scale) modification using glottal closure instants (GCI, also known as instants of significant excitation) and vowel onset points (VOP). In general, most of the time scale modification methods attempt to vary the duration of speech segments uniformly over all regions. But it is observed that consonant regions and transition regions between a consonant and the following vowel, and between two consonant regions do not vary appreciably with speaking rate. The proposed method implements the duration modification without changing the durations of the transition and consonant regions. Vowel onset points are used to identify the transition and consonant regions. A VOP is the instant at which the onset of the vowel takes place, which corresponds to the transition from a consonant to the following vowel in most cases. The VOPs are computed using the Hilbert envelope of linear prediction (LP) residual. The instants of significant excitation correspond to the instants of glottal closure (epochs) in the case of voiced speech, and to some random excitations, like the onset of burst, in the case of nonvoiced speech. Manipulation of duration is achieved by modifying the duration of the LP residual with the help of instants of significant excitation as pitch markers. The modified residual is used to excite the time-varying filter whose parameters are derived from the original speech signal. Perceptual quality of the synthesized speech is found to be natural. Performance of the proposed method is compared with the method, where the duration of speech is modified uniformly over all regions. Samples of speech signals for different modification factors is available for listening at http://sit.iitkgp.ernet.in/similar to ksrao/result.html. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Rao, K. Sreenivasa] Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India.
[Yegnanarayana, B.] Int Inst Informat Technol, Hyderabad 500032, Andhra Pradesh, India.
RP Rao, KS (reprint author), Indian Inst Technol, Sch Informat Technol, Kharagpur 721302, W Bengal, India.
EM ksrao@iitkgp.ac.in; yegna@iiit.ac.in
CR Deller J. R., 1993, DISCRETE TIME PROCES
DIMARINO J, 2001, P IEEE INT C ACOUST
Donnellan Olivia, 2003, P 3 IEEE INT C ADV L
Flanagan J., 1972, SPEECH ANAL SYNTHESI
Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93
Gangashetty SV, 2004, PROCEEDINGS OF INTERNATIONAL CONFERENCE ON INTELLIGENT SENSING AND INFORMATION PROCESSING, P159
Hogg RV, 1987, ENG STAT
Ilk HG, 2006, SIGNAL PROCESS, V86, P127, DOI 10.1016/j.sigpro.2005.05.006
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE
PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581
Prasanna S.R.M., 2004, THESIS INDIAN I TECH
PRASANNA SRM, 2002, P IEEE INT C ACOUST
QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793
Rao KS, 2006, IEEE T AUDIO SPEECH, V14, P972, DOI 10.1109/TSA.2005.858051
SLANEY M, 1996, P IEEE INT C ACOUST
SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662
Stevens K.N., 1999, ACOUSTIC PHONETICS
NR 20
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD DEC
PY 2009
VL 51
IS 12
BP 1263
EP 1269
DI 10.1016/j.specom.2009.06.004
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 560FZ
UT WOS:000274888800009
ER
PT J
AU Zen, H
Tokuda, K
Black, AW
AF Zen, Heiga
Tokuda, Keiichi
Black, Alan W.
TI Statistical parametric speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Review
DE Speech synthesis; Unit selection; Hidden Markov models
ID HIDDEN MARKOV-MODELS; SPEAKER ADAPTATION; MAXIMUM-LIKELIHOOD; SYNTHESIS
SYSTEM; VOICE CONVERSION; COVARIANCE MATRICES; HMM; RECOGNITION;
GENERATION; ALGORITHM
AB This review gives a general overview of techniques used in statistical parametric speech synthesis. One instance of these techniques, called hidden Markov model (HMM)-based speech synthesis, has recently been demonstrated to be very effective in synthesizing acceptable speech. This review also contrasts these techniques with the more conventional technique of unit-selection synthesis that has dominated speech synthesis over the last decade. The advantages and drawbacks of statistical parametric synthesis are highlighted and we identify where we expect key developments to appear in the immediate future. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Zen, Heiga; Tokuda, Keiichi] Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Nagoya, Aichi 4668555, Japan.
[Zen, Heiga] Toshiba Res Europe Ltd, Cambridge Res Lab, Cambridge CB4 0GZ, England.
[Black, Alan W.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
RP Zen, H (reprint author), Nagoya Inst Technol, Dept Comp Sci & Engn, Showa Ku, Gokiso Cho, Nagoya, Aichi 4668555, Japan.
EM heiga.zen@crl.toshiba.co.uk; tokuda@nitech.ac.jp; awb@cs.cmu.edu
FU Ministry of Education, Culture, Sports, Science and Technology (MEXT);
Hori information science promotion foundation; JSPS [1880009]; European
Community's Seventh Framework Programme [FP7/2007-2013]; US National
Science Foundation [0415021]
FX The authors would like to thank Drs. Tomoki Toda of the Nara Institute
of Science and Technology, Junichi Yamagishi of the University of
Edinburgh, and Ranniery Maia of the ATR Spoken Language Communication
Research Laboratories for their helpful comments and discussions. We are
also grateful to many researchers who provided us with useful
information that enabled us to write this review. This work was partly
supported by the Ministry of Education, Culture, Sports, Science and
Technology (MEXT) e-Society project, the Hori information science
promotion foundation, a Grant-in-Aid for Scientific Research (No.
1880009) by JSPS, and the European Community's Seventh Framework
Programme (FP7/2007-2013) under grant agreement 213845 (the EMIME
Project). This work was also partly supported by the US National Science
Foundation under Grant No. 0415021 "SPICE: Speech Processing Interactive
Creation and Evaluation Toolkit for new Languages." Any opinions,
findings, conclusions, or recommendations expressed in this material are
those of the authors and do not necessarily reflect the views of the
National Science Foundation.
CR ABDELHAMID O, 2006, P INT, P1332
Acero A., 1999, P EUROSPEECH, P1047
AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705
AKAMINE M, 1998, P ICSLP, P139
ALLAUZEN C, 2004, P 42 M ACL
Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
Aylett M., 2008, P LANGTECH
BAI Q, 2007, COMMUNICATION
Banos E., 2008, 5 JORN TECN HABL, P145
BARROS MJ, 2005, P INTERSPEECH 05 EUR, P2581
Beal M. J., 2003, THESIS U LONDON
Bennett C. L., 2005, P INT EUR, P105
Bennett C. L., 2006, P BLIZZ CHALL WORKSH
BERRY J, 2008, P AR LINGUISTICSCIRC
Beutnagel B, 1999, P JOINT ASA EAA DAEA, P15
Bilmes JA, 2003, COMPUT SPEECH LANG, V17, P213, DOI 10.1016/S0885-2308(03)00010-X
Black A., 2003, P EUROSPEECH GEN SWI, P1649
Black A., 2006, P ISCA ITRW MULTILIN
BLACK A, 2006, P INT, P1762
Black A., 2000, P ICSLP BEIJ CHIN, P411
BLACK A, 2002, P IEEE SPEECH SYNTH
Black A. W., 1997, P EUR, P601
BONAFONTE A, 2008, P LREC
Breen A, 1998, P ICSLP, P2735
BULYKO I, 2002, P ICASSP, P461
CABRAL J, 2008, P INT, P1829
Cabral J., 2007, P 6 ISCA SPEECH SYNT, P113
CHOMPHAN S, 2007, P INTERSPEECH 2007, P2849
Clark R. A. J., 2007, P BLIZZ CHALL WORKSH
Coorman G, 2000, P INT C SPOK LANG BE, P395
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A
Deng L, 2006, IEEE T AUDIO SPEECH, V14, P1492, DOI 10.1109/TASL.2006.878265
DINES J, 2001, P ICASSP, P833
Donovan R., 1995, P EUR, P573
Donovan RE, 1998, P ICSLP, P1703
DRUGMAN T, 2009, P ICASSP, P3793
DRUGMAN T, 2008, P BEN
EICHNER M, 2001, P IEEE INT C AC SPEE, P829
EICHNER M, 2000, P INT C SPOK LANG PR, P701
EIDE E, 2004, P ISCA SSW5
FARES T, 2008, P CATA, P93
Ferguson J. D., 1980, P S APPL HIDD MARK M, P143
Frankel J, 2007, IEEE T AUDIO SPEECH, V15, P246, DOI 10.1109/TASL.2006.876766
Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002
Freij G., 1988, P ICASSP 1988, P135
FUJINAGA K, 2001, P ICASSP, P513
Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953
Gales M. J. F., 1996, CUEDFINFENGTR263
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223
GAO BH, 2008, P INT, P2266
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
GISH H, 1993, P INT C AC SPEECH SI, P447
GONZALVO X, 2007, P NOLISP, P7
Gonzalvo X., 2007, P 6 ISCA SPEECH SYNT, P362
HASHIMOTO K, 2008, P AUT M ASJ, P251
HEMPTINNE C, 2006, THESIS IDIAP RES I
HILL DR, 1995, P AVIOS95 S SAN JOS, P27
HIRAI T, 2004, P ISCA SSW5
Hirose K, 2005, SPEECH COMMUN, V46, P385, DOI 10.1016/j.specom.2005.03.014
Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636
HOMAYOUNPOUR M, 2004, CSI J COMPUT SCI ENG, V2
Hon H, 1998, INT CONF ACOUST SPEE, P293, DOI 10.1109/ICASSP.1998.674425
HUANG X, 1996, P ICSLP 96 PHIL, P2387
Hunt A. J., 1996, P ICASSP 96, P373
Imai S., 1983, ELECTR COMMUN JPN, V66, P10, DOI 10.1002/ecja.4400660203
IRINO T, 2002, P ICSLP, P2545
ISHIMATSU Y, 2001, SP200181 IEICE, V101, P57
ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641
IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B
IWANO K, 2002, SP200273 IEICE, P11
JENSEN U, 1994, COMPUT SPEECH LANG, V8, P247, DOI 10.1006/csla.1994.1013
Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257
KARABETSOS S, 2008, P TSD, P349
Karaiskos V., 2008, P BLIZZ CHALL WORKSH
Katagiri S., 1991, P IEEE WORKSH NEUR N, P299
KATAOKA S, 2004, P INT, P1205
Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5
KAWAI H, 2002, P IEEE SPEECH SYNTH
KAWAI H, 2004, P ISCA SSW5
*KDDI R D LAB, 2008, DEV DOWNL SPEECH SYN
Kim SJ, 2006, IEICE T INF SYST, VE89D, P1116, DOI 10.1093/ietisy/e89-d.3.1116
Kim SJ, 2006, IEEE T CONSUM ELECTR, V52, P1384, DOI 10.1109/TCE.2006.273160
Kim SJ, 2007, IEICE T INF SYST, VE90D, P378, DOI 10.1093/ietisy/e90-d.1.378
King S., 2008, P INT, P1869
KISHIMOTO Y, 2003, P SPRING M ASJ, P243
Kishore S. P., 2003, P EUROSPEECH 2003 GE, P1317
Koishida K, 2001, IEICE T INF SYST, VE84D, P1427
KOMINEK J, 2006, P BLIZZ CHALL WORKSH
Kominek J., 2003, CMULTI03177
KRSTULOVIC S, 2008, P SPEECH PROS, P67
Krstulovic S., 2007, P INT 2007 ANTW BELG, P1897
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003
LATORRE J, 2007, P ICASSP, P1241
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2
LIANG H, 2008, P ICASSP, P4641
Ling Z.H., 2008, P INT, P573
Ling Z.-H, 2007, P BLIZZ CHALL WORKSH
LING ZH, 2006, P INT 2006 SEP, P2034
LING ZH, 2008, P ISCSLP, P5
LING ZH, 2008, P IEEE INT C AC SPEE, P3949
LING ZH, 2007, P ICASSP, P1245
Ling Z.-H., 2006, P BLIZZ CHALL WORKSH
LU H, 2009, P ICASSP, P4033
Lundgren A., 2005, THESIS ROYAL I TECHN
MAIA R, 2008, P ICASSP, P3965
Maia R., 2007, P ISCA SSW6, P131
MAIA R, 2009, P SPRING M ASJ, P311
Maia R., 2003, P EUROSPEECH, P2465
Martincic-Ipsic S., 2006, Journal of Computing and Information Technology - CIT, V14, DOI 10.2498/cit.2006.04.06
MARUME M, 2006, P AUT M ASJ, P185
MASUKO T, 2003, P AUT M ASJ, P209
Masuko T., 1997, P ICASSP, P1611
MATSUDA S, 2003, IEICE T INFORM SY D2, V86, P741
Miyanaga K., 2004, P INTERSPEECH 2004 I, P1437
Mizutani N., 2002, P AUT M ASJ, P241
Morioka Y., 2004, P AUT M ASJ, P325
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Nakamura K., 2006, P ICASSP, P93
NAKAMURA K, 2007, THESIS NAGOYA I TECH
Nakatani N., 2006, P SPECOM, P261
NANKAKU Y, 2003, TECH REP IEICE, V103, P19
Nose T., 2007, P INTERSPEECH 2007 A, P2285
Nose T, 2007, IEICE T INF SYST, VE90D, P1406, DOI 10.1093/ietisy/e90-d.9.1406
NOSE T, 2009, TEICE T INFORM SYS D, V92, P489
Odell J.J., 1995, THESIS U CAMBRIDGE
Ogata K., 2006, P INTERSPEECH 2006 I, P1328
OJALA T, 2006, THESIS HELSINKI U TE
Okubo T, 2006, IEICE T INF SYST, VE89D, P2775, DOI 10.1093/ietisy/e89-d.11.2775
Olsen PA, 2004, IEEE T SPEECH AUDI P, V12, P37, DOI 10.1109/TSA.2003.819943
OURA K, 2008, P AUT M ASJ, P421
OURA K, 2008, P ISCSLP, P1
OURA K, 2007, P AUT M ASJ, P367
PENNY W, 1998, H MARKOV MODELS EXTE
PLUMPE M, 1998, P ICSLP, P2751
POLLET V, 2008, P INT, P1825
Povey D., 2003, THESIS U CAMBRIDGE
QIAN Y, 2006, P ISCSLP, P223
QIAN Y, 2008, P ISCSLP, P13
Qian Y., 2008, P INT, P2126
QIAN Y, 2009, P ICASSP, P3781
QIN L, 2006, P INT, P2250
QIN L, 2008, P ICASSP MAR, P3953
RAITIO T, 2008, P INT, P1881
RICHARDS H, 1999, P ICASSP, V1, P357
Rissanen J., 1980, STOCHASTIC COMPLEXIT
Ross K., 1994, P ESCA IEEE WORKSH S, P131
ROSTI A, 2003, CUEDFTNFENGTR461 U C
Rosti AVI, 2004, COMPUT SPEECH LANG, V18, P181, DOI 10.1016/j.csl.2003.09.004
Rouibia S., 2005, P INT, P2565
Russell M. J., 1985, P IEEE INT C AC SPEE, P5
Sagisaka Y., 1992, P ICSLP, P483
Sakai S., 2005, P INT 2005 LISB PORT, P81
Sakti S., 2008, P OR COCOSDA, P215
SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136
Segi H., 2004, P 5 ISCA SPEECH SYNT, P115
SHERPA U, 2008, P OR COCOSDA, P150
SHI Y, 2002, P ICSLP, P2369
Shichiri K., 2002, P ICSLP, P1269
Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
SHINTA Y, 2005, P RES WORKSH HOK AR
SILEN H, 2008, P INT, P1853
SONDHI M, 2002, P IEEE SPEECH SYNTH
STYLIANOU Y, 1999, P ICASSP, P377
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
SUN JW, 2009, P ICASSP, P4021
*SVOX AG, 2007, SVOX ANN SVOX PIC RE
*SVOX AG, 2008, SVOX REL PIC HIGH QU
SYKORA T, 2006, THESIS SLOVAK U TECH
Tachibana M, 2006, IEICE T INF SYST, VE89D, P1092, DOI 10.1093/ietisy/e89-d.3.1092
TACHIBANA M, 2008, P ICASSP 2008 APR, P4633
Tachibana M, 2005, IEICE T INF SYST, VE88D, P2484, DOI 10.1093/ietisy/e88-d.11.2484
TACHIWA W, 1999, P SPRING M AC SOC JA, P239
TAKAHASHI JI, 1995, P ICASSP 95, P696
TAKAHASHI T, 2001, P AUT M ASJ OCT, V1, P5
Tamura M., 2001, P ICASSP 2001, P805
TAMURA M, 2005, P ICASSP, P351
TAYLOR P, 2006, P INT, P1758
TAYLOR P, 1999, P EUR, P1531
TIOMKIN S, 2008, P INT, P1841
Toda T., 2004, P 5 ISCA SPEECH SYNT, P31
TODA T, 2009, P ICASSP, P4025
Toda T, 2007, IEEE T AUDIO SPEECH, V15, P2222, DOI 10.1109/TASL.2007.907344
TODA T, 2008, P ICASSP, P3925
Toda T, 2007, IEICE T INF SYST, VE90D, P816, DOI 10.1093/ietisy/e90-d.5.816
Tokuda K., 2008, HMM BASED SPEECH SYN
TOKUDA K, 2005, P INT EUR, P77
TOKUDA K, 1995, P ICASSP, P660
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
Tokuda K., 2002, P IEEE SPEECH SYNTH
TOKUDA K, 2000, P ICASSP, V3, P1315
Toth B., 2008, INFOCOMMUNICATIONS, VLXIII, P30
VAINIO M, 2005, P 2 BALT C HLT, P201
VESNICER B, 2004, P TSD2004, P513
Wang C., 2008, P ISCSLP, P129
Watanabe S, 2004, IEEE T SPEECH AUDI P, V12, P365, DOI 10.1109/TSA.2004.828640
Watanabe S., 2007, P IEEE S FDN COMP IN, P383
WATANABE T, 2007, P AUT M ASJ, P209
WEISS C, 2005, P ESSP
WOUTERS J, 2000, P ICSLP, P302
Wu Y., 2008, P ICASSP LAS VEG US, P4621
Wu Y. J., 2008, P ISCSLP, P9
WU YJ, 2006, P INT, P2046
Wu Y.-J., 2008, P INT, P577
WU YJ, 2006, P ICASSP, P89
WU YJ, 2009, P ICASSP, P4013
[吴义坚 WU Yijian], 2006, [中文信息学报, Journal of Chinese Information Processing], V20, P75
YAMAGISHI J, 2008, P ICASSP, P3957
Yamagishi J., 2008, P BLIZZ CHALL WORKSH
Yamagishi J., 2008, P INT 08 BRISB AUSTR, P581
Yamagishi J, 2006, THESIS TOKYO I TECHN
Yamagishi J, 2009, IEEE T AUDIO SPEECH, V17, P66, DOI 10.1109/TASL.2008.2006647
Yamagishi J, 2007, IEICE T INF SYST, VE90D, P533, DOI 10.1093/ietisy/e90-d.2.533
Yang J.-H., 2006, P BLIZZ CHALL WORKSH
Yoshimura T., 1997, P EUR, P2523
Yoshimura T., 2001, P EUROSPEECH, P2263
Yoshimura T, 1998, P ICSLP, P29
Yoshimura T, 1999, P EUR, P2347
YOUNG S, 2006, H MARKOV MODEL TOOLK
YU J, 2007, P ICASSP, P709
YU K, 2009, P ICASSP, P3773
YU ZP, 2008, P ICSP
Zen H., 2007, P INT, P2065
ZEN H, 2003, TRSLT0032 ATRSLT
ZEN H, 2008, P INT, P1068
Zen H, 2007, P 6 ISCA WORKSH SPEE, P294
Zen H., 2006, P BLIZZ CHALL WORKSH
Zen H, 2007, IEICE T INF SYST, VE90D, P825, DOI 10.1093/ietisy/e90-d.5.825
Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002
Zen H., 2003, P EUR, P3189
Zen H., 2006, P INT, P2274
Zen HG, 2007, IEICE T INF SYST, VE90D, P325, DOI 10.1093/ietisy/e90-d.1.325
ZHANG L, 2009, THESIS U EDINBURGH
NR 238
TC 184
Z9 190
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1039
EP 1064
DI 10.1016/j.specom.2009.04.004
PG 26
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000001
ER
PT J
AU Gonon, G
Bimbot, F
Gribonval, R
AF Gonon, Gilles
Bimbot, Frederic
Gribonval, Remi
TI Probabilistic scoring using decision trees for fast and scalable speaker
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; Decision trees; Embedded systems; Resource
constraint; Biometric authentication
ID VERIFICATION; ALGORITHM
AB In the context of fast and low cost speaker recognition, this article investigates several techniques based on decision trees. A new approach is introduced where the trees are used to estimate a score function rather than returning a decision among classes. This technique is developed to approximate the GMM log-likelihood ratio (LLR) score function. On top of this approach, different solutions are derived to improve the accuracy of the proposed trees. The first one studies the quantization of the LLR function to create classification trees on the LLR values. The second one makes use of knowledge on the GMM distribution of the acoustic features in order to build oblique trees. A third extension consists in using a low-complexity score function in each of the tree leaves. Series of comparative experiments are performed on the NIST 2005 speaker recognition evaluation data in order to evaluate the impact of the proposed improvements in terms of efficiency, execution time and algorithmic complexity. Considering a baseline system with an Equal Error Rate (EER) of 9.6% on the NIST 2005 evaluation, the best tree-based configuration achieves an EER of 12.9%, with a computational cost adapted to embedded devices and an execution time suitable for real-time speaker identification. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Bimbot, Frederic] IRISA METISS, CNRS, F-35042 Rennes, France.
IRISA METISS, INRIA, F-35042 Rennes, France.
RP Bimbot, F (reprint author), IRISA METISS, CNRS, Campus Univ Beaulieu, F-35042 Rennes, France.
EM Gilles.Gonon@irisa.fr; Frederic.Bimbot@irisa.fr; Remi.Gribonval@irisa.fr
CR AUCKENTHALER R, 2001, ODYSSEY, P83
BARRAS C, 2003, IEEE INT C AC SPEECH, V2, P49
Bengio S., 2004, ODYSSEY 2004 SPEAK L, P237
Bengio S., 2001, P IEEE INT C AC SPEE, V1, P425
BLOUET R, 2002, THESIS U RENNES 1
BLOUET R, 2001, WORKSH SPEAK OD, P223
Bocchieri E., 1993, ICASSP, V2, P692, DOI 10.1109/ICASSP.1993.319405
Campbell WM, 2002, INT CONF ACOUST SPEE, P161
CAMPBELL WM, 2004, ODYSS SPEAK LANG REC, P41
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362
FRITSCH J, 1996, IEEE INT C AC SPEECH, V2, P837
FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P152, DOI 10.1109/89.748120
Gauvain J., 1994, IEEE T SPEECH AUDIO, V2
GONON G, 2005, 9 EUR C SPEECH COMM, V4, P2661
Lim TS, 2000, MACH LEARN, V40, P203, DOI 10.1023/A:1007608224229
MCLAUGHLIN J, 1999, EUROSPEECH 99, V3, P1215
MOON YS, 2003, WBMA 03, P53
MURTHY SK, 1981, J ARTIF INTELL RES, V2, P1
MURVEIT H, 1994, HLT 94, P393
*NIST, 2005, NIST YEAR 2005 SPEAK
Olshen R., 1984, CLASSIFICATION REGRE, V1st
Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467
Reynolds D. A., 2000, DIGITAL SIGNAL PROCE, V10
Schapire RE, 2003, LECT NOTES STAT, V171, P149
WAN V, 2003, INT C AC SPEECH SIGN, V2, P221
Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822
XIANG B, 2002, P ICASSP, V1, P681
XIE Y, 2006, WORLD C INT CONTR AU, V2, P9463
XIONG Z, 2005, IEEE ICASSP 05, V1, P625
NR 31
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1065
EP 1081
DI 10.1016/j.specom.2009.02.007
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000002
ER
PT J
AU van Santen, JPH
Prud'hommeaux, ET
Black, LM
AF van Santen, Jan P. H.
Prud'hommeaux, Emily Tucker
Black, Lois M.
TI Automated assessment of prosody production
SO SPEECH COMMUNICATION
LA English
DT Article
DE Prosody; Automated assessment; Acoustic analysis; Speech pathology;
Language pathology
ID DIAGNOSTIC-OBSERVATION-SCHEDULE; HIGH-FUNCTIONING AUTISM; REVISED
ALGORITHMS; NORMAL SPEAKERS; SPEECH; CHILDREN; ENGLISH; COMMUNICATION;
REPLICATION; DURATION
AB Assessment of prosody is important for diagnosis and remediation of speech and language disorders, for diagnosis of neurological conditions, and for foreign language instruction. Current assessment is largely auditory-perceptual, which has obvious drawbacks; however, automation of assessment faces numerous obstacles. We propose methods for automatically assessing production of lexical stress, focus, phrasing, pragmatic style, and vocal affect. Speech was analyzed from children in six tasks designed to elicit specific prosodic contrasts. The methods involve dynamic and global features, using spectral, fundamental frequency, and temporal information. The automatically computed scores were validated against mean scores from judges who, in all but one task, listened to "prosodic minimal pairs" of recordings, each pair containing two utterances from the same child with approximately the same phonemic material but differing on a specific prosodic dimension, such as stress. The judges identified the prosodic categories of the two utterances and rated the strength of their contrast. For almost all tasks, we found that the automated scores correlated with the mean scores approximately as well as the judges' individual scores. Real-time scores assigned during examination - as is fairly typical in speech assessment - correlated substantially less than the automated scores with the mean scores. (C) 2009 Elsevier B.V. All rights reserved.
C1 [van Santen, Jan P. H.; Prud'hommeaux, Emily Tucker; Black, Lois M.] Oregon Hlth & Sci Univ, CSLU, Div Biomed Comp Sci BMCS, Sch Med, Beaverton, OR 97006 USA.
RP van Santen, JPH (reprint author), Oregon Hlth & Sci Univ, CSLU, Div Biomed Comp Sci BMCS, Sch Med, 20000 NW Walker Rd, Beaverton, OR 97006 USA.
EM vansanten@cslu.ogi.edu
FU National Institute on Deafness and Other Communicative Disorders [NIDCD
1R01DC007129-01]; National Science Foundation [IIS-0205731];
AutismSpeaks
FX We thank Sue Peppe for making available the pictorial stimuli and for
granting permission to us for creating new versions of several PEPS-C
tasks; Lawrence Shriberg for helpful comments on an earlier draft of the
paper, in particular on the LSR section; Rhea Paul for helpful comments
on an earlier draft of the paper and for suggesting the Lexical Stress
and Pragmatic Style Tasks; the clinical staff at CSLU (Beth Langhorst,
Rachel Coulston, and Robbyn Sanger Hahn) and at Yale University (Nancy
Fredine, Moira Lewis, Allyson Lee) for data collection; senior
programmer Jacques de Villiers for the data collection software and data
management architecture; Meg Mitchell and Justin Holguin for speech
transcription; and the parents and children for participating in the
study. This research was supported by grants from the National Institute
on Deafness and Other Communicative Disorders, NIDCD 1R01DC007129-01 and
from the National Science Foundation, IIS-0205731, both to van Santen;
by a Student Fellowship from AutismSpeaks to Emily Tucker Prud'hommeaux;
and by an Innovative Technology for Autism grant from AutismSpeaks to
Brian Roark. The views herein are those of the authors and reflect the
views neither of the funding agencies nor of any of the individuals
acknowledged.
CR ADAMI AG, 2003, P ICASSP 03
Barlow R., 1972, STAT INFERENCE ORDER
BERK S, 1983, J COMMUN DISORD, V16, P49, DOI 10.1016/0021-9924(83)90026-6
Berument SK, 1999, BRIT J PSYCHIAT, V175, P444, DOI 10.1192/bjp.175.5.444
CARDOZO BL, 1968, IEEE T ACOUST SPEECH, VAU16, P159, DOI 10.1109/TAU.1968.1161978
Cohen J, 1960, EDUC PSYCHOL MEAS, V20, P46
DARLEY FL, 1969, J SPEECH HEAR RES, V12, P462
DARLEY FL, 1969, J SPEECH HEAR RES, V12, P246
Dollaghan C, 1998, J SPEECH LANG HEAR R, V41, P1136
Ekman P., 1976, PICTURES FACIAL AFFE
*EL, 1993, MULT VOIC PROGR MDVP
*ENTR RES LAB INC, 1996, ESPS PROGR A L
GILBERG C, 1998, ASPERGER SYNDROME HI, P79
Gotham K, 2007, J AUTISM DEV DISORD, V37, P613, DOI 10.1007/s10803-006-0280-1
Gotham K, 2008, J AM ACAD CHILD PSY, V47, P642, DOI 10.1097/CHI.0b013e31816bffb7
Grabe E., 2002, P SPEECH PROS 2002 C, P343
Grabe E, 2000, J PHONETICS, V28, P161, DOI 10.1006/jpho.2000.0111
HIRSCHBERG J, 2000, FESTSCHRIFT HONOR G
Hirschberg J., 1995, P 13 INT C PHON SCI, V2, P36
HOUSE A, 1987, J NEUROL NEUROSUR PS, V50, P910, DOI 10.1136/jnnp.50.7.910
Kent RD, 1996, AM J SPEECH-LANG PAT, V5, P7, DOI DOI 10.1044/1058-0360.0503.07
Klabbers E., 2007, P 6 ISCA WORKSH SPEE, P339
KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986
Le Dorze G, 1998, FOLIA PHONIATR LOGO, V50, P1, DOI 10.1159/000021444
Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686
Lord C, 2000, J AUTISM DEV DISORD, V30, P205, DOI 10.1023/A:1005592401947
Mackey LS, 1997, J SPEECH LANG HEAR R, V40, P349
Marasek K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607920
McCann J, 2003, INT J LANG COMM DIS, V38, P325, DOI 10.1080/1368282031000154204
MIAO Q, 2006, EFFECTS PROSODIC FAC
MILROY L, 1978, SOCIOLINGUISTIC PATT, P19
MURRY T, 1980, J SPEECH HEAR RES, V23, P361
Ofuka E, 2000, SPEECH COMMUN, V32, P199, DOI 10.1016/S0167-6393(00)00009-1
OFUKA E, 1994, ICSLP 1994, P1447
Paul R., 2005, J AUTISM DEV DISORD, V35, P201
Peppe S, 2003, CLIN LINGUIST PHONET, V17, P345, DOI 10.1080/0269920031000079994
Peppe S, 2006, J PRAGMATICS, V38, P1776, DOI 10.1016/j.pragma.2005.07.004
Peppe S, 2007, J SPEECH LANG HEAR R, V50, P1015, DOI 10.1044/1092-4388(2007/071)
PLANT G, 1986, STL QPSR, V27, P65
PRUDHOMMEAUX ET, 2008, INT M AUT RES 2008 L
ROARK R, 2007, P 2 INT C TECHN AG I
*ROYAL I TECHN, 2006, SNACK SOUND TOOLK
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Shriberg LD, 2001, PEPPER PROGRAMS EXAM
Shriberg LD, 2006, J SPEECH LANG HEAR R, V49, P500, DOI 10.1044/1092-4388(2006/038)
Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123
Siegel DJ, 1996, J AUTISM DEV DISORD, V26, P389, DOI 10.1007/BF02172825
SILVERMAN K, 1993, P 1993 EUR, V3, P2169
Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955
SONMEZ K, 1998, P ICSLP 1998
Sutton S, 1998, P INT C SPOK LANG PR, P3221
TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/0033-295X.84.4.327
van Santen J., 2006, ITALIAN J LINGUISTIC, V18, P161
Van Santen J.P.H., 1994, P INT C SPOK LANG PR, P719
VANSANTEN J, 1999, P EUR 99 BUD HUNG
VANSANTEN J, 2002, 4 IEEE WORKSH SPEECH
VANSANTEN J, 2008, INT M AUT RES 2008 L
VANSANTEN J, 2007, INT M AUT RES 2007 S
VANSANTEN J, 2000, INTONATION ANAL MODE, P69
VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5
van Santen JPH, 2000, J ACOUST SOC AM, V107, P1012, DOI 10.1121/1.428281
van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004
WINGFIELD A, 1984, J SPEECH HEAR RES, V7, P128
Zhang Y, 2008, J VOICE, V22, P1, DOI 10.1016/j.jvoice.2006.08.003
2002, DSM 4 TR DIAGNOSTIC
NR 65
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1082
EP 1097
DI 10.1016/j.specom.2009.04.007
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000003
ER
PT J
AU Derakhshan, N
Akbari, A
Ayatollahi, A
AF Derakhshan, Nima
Akbari, Ahmad
Ayatollahi, Ahmad
TI Noise power spectrum estimation using constrained variance spectral
smoothing and minima tracking
SO SPEECH COMMUNICATION
LA English
DT Article
DE Adaptive smoothing; Noise power spectrum estimation; Speech enhancement
ID SPEECH ENHANCEMENT; AMPLITUDE ESTIMATOR; DENSITY-ESTIMATION;
ENVIRONMENTS; STATISTICS
AB In this paper, we propose a new noise estimation algorithm based on tracking the minima of an adaptively smoothed noisy short-time power spectrum (STPS). The heart of the proposed algorithm is a constrained variance smoothing (CVS) filter, which smoothes; the noisy STPS independently of the noise level. The proposed smoothing procedure is capable of tracking the non-stationary behavior of the noisy STPS while reducing its variance. The minima of the smoothed STPS are tracked with a low delay and are used to construct voice activity detectors (VAD) in frequency bins. Finally, the noise power spectrum is estimated by averaging the noisy STPS on the noise-only regions. Experiments show that the proposed noise estimation algorithm possesses a very short delay in tracking the non-stationary behavior of the noise. When the proposed algorithm is utilized in a noise reduction system, it exhibits superior performance over the other recently proposed noise estimation algorithms. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Derakhshan, Nima; Akbari, Ahmad] Iran Univ Sci & Technol, Res Ctr Informat Technol, Dept Comp Engn, Tehran 16844, Iran.
[Derakhshan, Nima; Ayatollahi, Ahmad] Iran Univ Sci & Technol, Dept Elect Engn, Tehran 16844, Iran.
RP Derakhshan, N (reprint author), Iran Univ Sci & Technol, Res Ctr Informat Technol, Dept Comp Engn, Tehran 16844, Iran.
EM nima_derakhshan@ee.iust.ac.ir
CR [Anonymous], 1993, P56 ITUT
[Anonymous], 2001, P862 ITUT
Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544
Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717
DERAKHSHAN N, 2007, P 12 INT C SPEECH CO, P542
Doblinger G., 1995, P 4 EUR C SPEECH COM, P1513
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Lynch J. F. Jr., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915
Martin R., 1994, P 7 EUR SIGN PROC C, P1182
Martin R, 2006, SIGNAL PROCESS, V86, P1215, DOI 10.1016/j.sigpro.2005.07.037
Paajanen E., 2002, P IEEE WORKSH SPEECH, P77
Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005
Rohdenburg T., 2005, P 9 INT WORKSH AC EC, P169
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
NR 17
TC 3
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1098
EP 1113
DI 10.1016/j.specom.2009.04.008
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000004
ER
PT J
AU Taft, DA
Grayden, DB
Burkitt, AN
AF Taft, D. A.
Grayden, D. B.
Burkitt, A. N.
TI Speech coding with traveling wave delays: Desynchronizing cochlear
implant frequency bands with cochlea-like group delays
SO SPEECH COMMUNICATION
LA English
DT Article
DE Cochlear implant; Phase; Envelope; Traveling wave delay
ID NORMAL-HEARING; RECOGNITION; LISTENERS; STRATEGIES; HUMANS; NOISE; EAR;
MAP
AB Traveling wave delays are the frequency-dependent delays for sounds along the cochlear partition. In this study, a set of suitable delays was calibrated to cochlear implant users' pitch perception along the implanted electrode array. These delays were then explicitly coded in a cochlear implant speech processing strategy as frequency specific group delays. The envelopes of low frequency filter bands were delayed relative to high frequencies, with amplitude and fine structure unmodified. Incorporating such delays into subjects' own processing strategies in this way produced a significant improvement in speech perception scores in noise. A subsequent investigation indicated that perceptual sensitivity to changes in delay size was low and so accurate delay calibration may not be necessary. It is proposed that contention between broadband envelope cues is reduced when frequency bands are de-synchronized. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Taft, D. A.; Grayden, D. B.; Burkitt, A. N.] Bion Ear Inst, Melbourne, Vic 3002, Australia.
[Taft, D. A.; Grayden, D. B.; Burkitt, A. N.] Univ Melbourne, Dept Elect & Elect Engn, Melbourne, Vic 3010, Australia.
[Taft, D. A.] Univ Melbourne, Dept Otolaryngol, Melbourne, Vic 3010, Australia.
RP Taft, DA (reprint author), Bion Ear Inst, 384-388 Albert St, Melbourne, Vic 3002, Australia.
EM dtaft@bionicear.org; grayden@unimelb.edu.au; aburkitt@unimelb.edu.au
RI Burkitt, Anthony/N-9077-2013
OI Burkitt, Anthony/0000-0001-5672-2772
FU Harold Mitchell Foundation
FX This research was supported by The Harold Mitchell Foundation.
CR ARAI T, 1998, P 1998 IEEE INT C AC, V2, P933, DOI 10.1109/ICASSP.1998.675419
Baumann U, 2006, HEARING RES, V213, P34, DOI 10.1016/j.heares.2005.12.010
BEKESY GV, 1960, WAVE MOTION COCHLEA
Blarney P.J., 1996, HEARING RES, V99, P139
Boex C, 2006, JARO-J ASSOC RES OTO, V7, P110, DOI 10.1007/s10162-005-0027-2
Cainer KE, 2008, HEARING RES, V238, P155, DOI 10.1016/j.heares.2007.10.001
DALLOS P, 1992, J NEUROSCI, V12, P4575
Dawson PW, 2007, EAR HEARING, V28, P163, DOI 10.1097/AUD.0b013e3180312651
DONALDSON GS, 1993, J ACOUST SOC AM, V93, P940, DOI 10.1121/1.405454
Dorman MF, 2007, JARO-J ASSOC RES OTO, V8, P234, DOI 10.1007/s10162-007-0071-1
Elberling C, 2007, J ACOUST SOC AM, V122, P2772, DOI 10.1121/1.2783985
Friesen LM, 2001, J ACOUST SOC AM, V110, P1150, DOI 10.1121/1.1381538
Fu QJ, 2001, J ACOUST SOC AM, V109, P1166, DOI 10.1121/1.1344158
GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052
Healy EW, 2002, J SPEECH LANG HEAR R, V45, P1262, DOI 10.1044/1092-4388(2002/101)
Javel E, 2000, HEARING RES, V140, P45, DOI 10.1016/S0378-5955(99)00186-0
Loizou PC, 1998, IEEE SIGNAL PROC MAG, V15, P101, DOI 10.1109/79.708543
MCDERMOTT H, 2009, AUDIOL NEUROTOL S1, V14
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Plant Kerrie L, 2002, Cochlear Implants Int, V3, P104, DOI 10.1002/cii.56
Rubinstein Jay T, 2004, Curr Opin Otolaryngol Head Neck Surg, V12, P444, DOI 10.1097/01.moo.0000134452.24819.c0
Skinner MW, 2002, EAR HEARING, V23, P207, DOI 10.1097/00003446-200206000-00005
Sridhar Divya, 2006, Audiol Neurootol, V11 Suppl 1, P16, DOI 10.1159/000095609
Stakhovskaya O, 2007, JARO-J ASSOC RES OTO, V8, P220, DOI 10.1007/s10162-007-0076-9
Vandali AE, 2005, J ACOUST SOC AM, V117, P3126, DOI 10.1121/1.1874632
Wilson RH, 2007, J SPEECH LANG HEAR R, V50, P844, DOI 10.1044/1092-4388(2007/059)
NR 26
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1114
EP 1123
DI 10.1016/j.specom.2009.05.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000005
ER
PT J
AU Van Segbroeck, M
Van Hamme, H
AF Van Segbroeck, Maarten
Van Hamme, Hugo
TI Unsupervised learning of time-frequency patches as a noise-robust
representation of speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Acoustic signal analysis; Language acquisition; Matrix factorization;
Automatic speech recognition; Noise robustness
ID NONNEGATIVE MATRIX FACTORIZATION; REASSIGNMENT; SPECTROGRAM; SEPARATION
AB We present a self-learning algorithm using a bottom-up based approach to automatically discover, acquire and recognize the words of a language. First, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed for static and dynamic speech features using a spectral representation of both short and long acoustic events. By describing speech in terms of the discovered time-frequency patches, patch activations are obtained which express to what extent each patch is present across time. We then show that speaker-independent patterns appear to recur in these patch activations and how they can be discovered by applying a second NMF-based algorithm on the co-occurrence counts of activation events. By providing information about the word identity to the learning algorithm, the retrieved patterns can be associated with meaningful objects of the language. In case of a small vocabulary task, the system is able to learn patterns corresponding to words and subsequently detects the presence of these words in speech utterances. Without the prior requirement of expert knowledge about the speech as is the case in conventional automatic speech recognition, we illustrate that the learning algorithm achieves a promising accuracy and noise robustness. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Van Segbroeck, Maarten; Van Hamme, Hugo] Katholieke Univ Leuven, Dept ESAT, B-3001 Louvain, Belgium.
RP Van Segbroeck, M (reprint author), Katholieke Univ Leuven, Dept ESAT, Kasteelpk Arenberg 10, B-3001 Louvain, Belgium.
EM maarten.vansegbroeck@esat.kuleuven.be; hugo.vanhamme@esat.kuleuven.be
RI Van hamme, Hugo/D-6581-2012
FU Institute for the Promotion of Innovation through Science and Technology
in Flanders, Belgium; European Commission [FP6-034362]
FX This research was funded by the Institute for the Promotion of
Innovation through Science and Technology in Flanders, Belgium
(I.W.T.-Vlaanderen) and by the European Commission under contract
FP6-034362 (ACORNS).
CR [Anonymous], 2000, 202050 ETSI ES
[Anonymous], 2000, 201108 ETSI ES
AUGER F, 1995, IEEE T SIGNAL PROCES, V43, P1068, DOI 10.1109/78.382394
Aversano G., 2001, P 44 IEEE MIDW S CIR, V2, P516, DOI 10.1109/MWSCAS.2001.986241
BAKER JM, 2006, MINDS HIST DEV FUTUR
BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W
Chomsky N, 2000, NEW HORIZONS STUDY L
Ding C., 2006, LBNL60428
Ezzat T., 2007, P INT C SPOK LANG PR, P506
Hainsworth S. W., 2003, CUEDFINFENGTR459
Hermansky H., 2005, P INT, P361
Hermansky H., 1998, P INT C SPOK LANG PR, P1003
HERMANSKY H, 1997, P INT C AC SPEECH SI, V1, P289
Hirsch H. G., 2000, P ISCA ITRW ASR2000, P18
Hoyer PO, 2004, J MACH LEARN RES, V5, P1457
Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6
Kleinschmidt M., 2003, P EUR, P2573
KODERA K, 1978, IEEE T ACOUST SPEECH, V26, P64, DOI 10.1109/TASSP.1978.1163047
Lee DD, 2001, ADV NEUR IN, V13, P556
Machiraju VR, 2002, J CARDIAC SURG, V17, P20
Martin A. F., 1997, P EUROSPEECH, P1895
MEYER BT, 2008, P INT, P906
O'Grady PD, 2008, NEUROCOMPUTING, V72, P88, DOI 10.1016/j.neucom.2008.01.033
Park A., 2005, P ASRU SAN JUAN PUER, P53
Plante F, 1998, IEEE T SPEECH AUDI P, V6, P282, DOI 10.1109/89.668821
QIAO Y, 2008, P SPRING M AC SOC JA
Scharenborg O., 2007, P INT C SPOK LANG PR, P1953
Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37
SIIVOLA V, 2003, P 8 EUR C SPEECH COM, P2293
Smaragdis P, 2007, IEEE T AUDIO SPEECH, V15, P1, DOI 10.1109/TASL.2006.876726
Stouten V, 2008, IEEE SIGNAL PROC LET, V15, P131, DOI 10.1109/LSP.2007.911723
Tyagi V., 2003, P ASRU 03, P399
Van hamme H., 2008, P INT C SPOK LANG PR, P2554
VANHAMME H, 2008, ISCA ITRW WORKSH SPE
Virtanen T, 2007, IEEE T AUDIO SPEECH, V15, P1066, DOI 10.1109/TASL.2006.885253
YOUNG S, 1999, HTK BOOK VERSION2 2
NR 36
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1124
EP 1138
DI 10.1016/j.specom.2009.05.003
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000006
ER
PT J
AU Siniscalchi, SM
Lee, CH
AF Siniscalchi, Sabato Marco
Lee, Chin-Hui
TI A study on integrating acoustic-phonetic information into lattice
rescoring for automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Knowledge based system; Lattice rescoring; Continuous phone recognition;
Large vocabulary continuous speech recognition
ID HIDDEN MARKOV-MODELS; VERIFICATION
AB In this paper, a lattice rescoring approach to integrating acoustic-phonetic information into automatic speech recognition (ASR) is described. Additional information over what is used in conventional log-likelihood based decoding is provided by a bank of speech event detectors that score manner and place of articulation events with log-likelihood ratios that are treated as confidence levels. An artificial neural network (ANN) is then used to transform raw log-likelihood ratio scores into manageable terms for easy incorporation. We refer to the union of the event detectors and the ANN as knowledge module. A goal of this study is to design a generic framework which makes it easier to incorporate other sources of information into an existing ASR system. Another aim is to start investigating the possibility of building a generic knowledge module that can be plugged into an ASR system without being trained on specific data for the given task. To this end, the proposed approach is evaluated on three diverse ASR tasks: continuous phone recognition, connected digit recognition, and large vocabulary continuous speech recognition, but the data-driven knowledge module is trained with a single corpus and used in all three evaluation tasks without further training. Experimental results indicate that in all three cases the proposed rescoring framework achieves better results than those obtained without incorporating the confidence scores provided by the knowledge module. It is interesting to note that the rescoring process is especially effective in correcting utterances with errors in large vocabulary continuous speech recognition, where constraints imposed by the lexical and language models sometimes produce recognition results not strictly observing the underlying acoustic-phonetic properties. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Siniscalchi, Sabato Marco] Norwegian Univ Sci & Technol, Dept Elect & Telecommun, NO-7491 Trondheim, Norway.
[Lee, Chin-Hui] Georgia Inst Technol, Sch Elect & Comp Engn, Atlanta, GA 30332 USA.
RP Siniscalchi, SM (reprint author), Norwegian Univ Sci & Technol, Dept Elect & Telecommun, NO-7491 Trondheim, Norway.
EM marco77@iet.ntnu.no; chl@ece.gatech.edu
RI Siniscalchi, Sabato/I-3423-2012
FU NSF [IIS-04-27113]; IBM Faculty Award
FX This study was partially supported by the NSF Grant IIS-04-27113 and an
IBM Faculty Award. The first author would also like to thank Jinyu Li of
Georgia Institute of Technology for stimulating discussions.
CR BAHL LR, 1986, P ICASSP TOK JAP
Bazzi I., 2000, P ICSLP BEIJ CHIN, P401
Bitar N., 1996, P INT C AC SPEECH SI, P29
Bourlard Ha, 1994, CONNECTIONIST SPEECH
CHEN B, 2004, P INT S SPOK LANG PR, P925
CHENG I, 2006, P 8 IEEE INT S MULT, P533
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
EIDE E, 2001, P EUR AALB DENM, P1613
EVERMANN G, 2000, P ICASSP, P1655
Fant G., 1960, ACOUSTIC THEORY SPEE
Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347
Frankel J, 2007, COMPUT SPEECH LANG, V21, P620, DOI 10.1016/j.csl.2007.03.002
FU Q, 2006, P INT, P681
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
Goel V, 2000, COMPUT SPEECH LANG, V14, P115, DOI 10.1006/csla.2000.0138
Hacioglu K., 2004, P ICASSP04 MONTR CAN, P925
HASEGAWA M, 2005, P ICASSP PHIL US
Haykin S., 1998, NEURAL NETWORKS COMP, V2nd
Hermansky H., 2005, P INT, P361
International Phonetic Association, 1999, HDB INT PHON ASS GUI
Jaakkola TS, 1999, ADV NEUR IN, V11, P487
Jiang H, 2006, IEEE T AUDIO SPEECH, V14, P1584, DOI 10.1109/TASL.2006.879805
JUANG BH, 1986, IEEE T INFORM THEORY, V32, P307
Kawahara T, 1998, IEEE T SPEECH AUDI P, V6, P558, DOI 10.1109/89.725322
Kay S. M., 1998, FUNDAMENTALS STAT SI, V2
Kirchhoff K., 1999, THESIS U BIELEFELD G
KIRCHHOFF K., 1998, P ICSLP, P891
Koo MW, 2001, IEEE T SPEECH AUDI P, V9, P821
LAUNARY B, 2002, P ICASSP02 ORL, P817
Lee C.-H., 1997, P COST WORKSH SPEECH, P62
Lee C.-H., 2004, P ICSLP, P109
LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546
LEONARD RG, 1984, P ICSLP SAN DIEG US
LEVINSON SE, 1985, P IEEE, V73, P1625, DOI 10.1109/PROC.1985.13344
LI J, 2006, P INT, P2422
Li J, 2005, P ICASSP PHIL US, P837
Li J., 2005, P INT LISB PORT SEP, P3365
Li JY, 2007, IEEE T AUDIO SPEECH, V15, P2393, DOI 10.1109/TASL.2007.906178
Macherey W., 2005, P INT LISB PORT, P2133
MAK B, 2006, P ICASSP, P222
Mangu L., 1999, P EUR C SPEECH COMM, P495
Metze F., 2002, P ICSLP DENV US SEPT, P16
Metze F., 2005, THESIS U KARLSRUHE G
Morris J., 2006, P INT, P597
Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666
RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626
Richardson M, 2003, SPEECH COMMUN, V41, P511, DOI 10.1016/S0167-6393(03)00031-1
SAKTI S, 2007, P INTERSPEECH, P2117
SCHWARTZ R, 1996, AUTOMATIC SPEECH SPE, P29
Schwarz P., 2006, P ICASSP, P325
Siniscalchi SM, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P517
Stolcke A., 1997, P EUROSPEECH, P163
Stuker S., 2003, P ICASSP HONG KONG C, P144
Sukkar R.A., 1993, P IEEE ICASSP 93, P451
Tang M., 2003, P EUR GEN SWITZ, P2585
Tsao Y., 2005, P INT 05, P1109
Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002
WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125
YOUNG S, 2007, HTK BOOK HTK VERSION
NR 59
TC 21
Z9 23
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD NOV
PY 2009
VL 51
IS 11
BP 1139
EP 1153
DI 10.1016/j.specom.2009.05.004
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 498SR
UT WOS:000270164000007
ER
PT J
AU Eskenazi, M
AF Eskenazi, Maxine
TI Special Issue: Spoken Language Technology for Education
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
C1 Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
RP Eskenazi, M (reprint author), Carnegie Mellon Univ, Language Technol Inst, 4619 Newell Simon Hall,5000 Forbes Ave, Pittsburgh, PA 15213 USA.
EM max@cmu.edu
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 831
EP 831
DI 10.1016/j.specom.2009.07.001
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500001
ER
PT J
AU Eskenazi, M
AF Eskenazi, Maxine
TI An overview of spoken language technology for education
SO SPEECH COMMUNICATION
LA English
DT Article
DE Language learning; Automatic speech processing; Speech recognition;
Speech synthesis; Spoken dialogue systems; Speech technology for
education
ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; SPEECH
RECOGNITION; JAPANESE; SYSTEM; TUTORS
AB This paper reviews research in spoken language technology for education and more specifically for language learning. It traces the history of the domain and then groups main issues in the interaction with the student. It addresses the modalities of interaction and their implementation issues and algorithms. Then it discusses one user population - children - and an application for them. Finally it has a discussion of overall systems. It can be used as an introduction to the field and a source of reference materials. (C) 2009 Elsevier B.V. All rights reserved.
C1 Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA.
RP Eskenazi, M (reprint author), Carnegie Mellon Univ, Language Technol Inst, 4619 Newell Simon Hall,5000 Forbes Ave, Pittsburgh, PA 15213 USA.
EM max@cmu.edu
CR Akahane-Yamada R., 1997, Journal of the Acoustical Society of Japan (E), V18
AKAHANEYAMADA R, 1998, P STILL 98 ESCA MARH, P111
AKAHANEYAMADA R, 2004, 18 INT C AC P ICA 20, V3, P2319
ALWAN A, 2007, P IEEE MULT SIGN PRO
BACHMAN L, 1990, OXFORD APPL LINGUIST, P216
BADIN P, 1998, P ESCA TUT RES WORKS, P167
Beck J. E., 2004, TECHNOLOGY INSTRUCTI, V2, P61
BERNSTEIN J, 1977, P IEEE ICASSP 77, P244
BERNSTEIN J, 1989, J ACOUST SOC AM S1, V86, pS77, DOI 10.1121/1.2027647
BERNSTEIN J, 1984, P IEEE ICASSP 84
BERNSTEIN J, 1986, EIF SPEC SESS SPEECH
BERNSTEIN J, 1996, SPEECH RECOGNITION C
Bernstein J., 1990, P INT C SPOK LANG PR, P1185
Bernstein J., 1999, CALICO Journal, V16
BESKOW J, 2000, P ESCA ETRW INSTIL D, P138
BIANCHI D, 2004, P ISCA ITRW INSTIL04, P203
Black A, 2007, P ISCA ITRW SLATE WO
BONAVENTURA P, 2000, KONVENS 2000, P225
Bradlow AR, 1997, J ACOUST SOC AM, V101, P2299, DOI 10.1121/1.418276
CHAO C, 2007, P ISCA ITRW SLATE FA
COLE R, 1999, P ESCA SOCRATES ETRW
COLE RA, 1998, P ESCA ETRW STILL MA, P163
COSI P, 2004, P INSTD ICALL S VER, P207
Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894
CUCCHIARINI C, 1998, P 5 INT C SPOK LANG, P2619
Cucchiarini C., 2007, P INT 2007 ANTW BELG, P2181
Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279
DARCY SM, 2005, P ISCA INT 2005 LISB
Delcloque P., 2000, HIST CALL
DELMONTE R, 1998, P ESCA ETRW STILL MA, P57
Delmonte R, 2000, SPEECH COMMUN, V30, P145, DOI 10.1016/S0167-6393(99)00043-6
DESTOMBES F, 1993, NATO ASI SERIES COMP, P187
Ehsani F, 2000, SPEECH COMMUN, V30, P167, DOI 10.1016/S0167-6393(99)00042-4
EHSANI F, 1997, P EUR 1997 RHOD GREE, P681
ELLIS NC, 2007, P ISCA ITRW SLATE FA
Ellis R., 1997, 2 LANGUAGE ACQUISITI
ESKENAZI M, 1996, P INT C SPOK LANG PR
FLEGE JE, 1988, LANG LEARN, V38, P365, DOI 10.1111/j.1467-1770.1988.tb00417.x
FORBESRILEY K, 2007, P HUM LANG TECHN ANN
FORBESRILEY K, 2007, P AFF COMP INT INT A
Forbes-Riley K., 2008, P 9 INT C INT TUT SY
Franco H., 2000, P INSTILL 2000 DUND, P123
GEROSA M, 2006, P INT C AC SPEECH SI, V1, P393
GEROSA M, 2004, P NLP SPEECH TECHN A, P9
GRANSTROM B, 2004, P ISCA ITRW INSTIL04
HACKER C, 2007, P IEEE ICASSP HAW
Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004
Handley Z, 2005, LANG LEARN TECHNOL, V9, P99
HAZAN V, 2005, SPEECH COMM
Hazan Valerie L., 1998, P ESCA ETRW STILL98, P119
HILLER S, 1993, SPEECH COMMUN, V13, P463, DOI 10.1016/0167-6393(93)90045-M
Hincks R., 2002, TMH QPSR, V44, P153
HINCKS R, 2004, P ISCA ITRW INSTIL V, P63
HOLLINGSED T, 2007, P ISCA ITRW SPEECH L
HOUSE D, 1999, P ESCA EUR, V99, P1843
HUANG X, 2004, SPOKEN LANGUAGE PROC
ISHI CT, 2000, P ESCA ETRW INSTIL20, P106
Ito A., 2005, P EUROSPEECH, P173
Johnson W. L., 2008, P IAAI 2008
JOHNSON WL, 2004, P I ITSEC 2004
KAWAHARA H, 2006, 4 JOINT M ASA ASJ DE
KAWAI G, 1998, P INT C SPOK LANG PR
Kim Y., 1997, P EUR, P645
LANGLAIS P, 1998, P ESCA WORKSH SPEECH, P41
LaRocca S., 2000, P ISCA ITRW INSTIL20, P26
LISCOMBE J, 2006, P ISCA INT 2006 PITT
LIU Y, 2007, P ISCA ITRW SLATE FA
Mak B., 2003, P HLT NAACL, P23
MARTONY J, 1968, AM ANN DEAF, V113, P195
MASSARO, 1998, P SPEECH TECHN LANG, P169
Massaro D. W., 2006, NATURAL INTELLIGENT, P183
Massaro D. W., 2003, P 5 INT C MULT INT V, P172, DOI 10.1145/958432.958466
MASSARO DW, 2006, P 9 INT C SPOK LANG, P825
MCGRAW I, 2007, P ISCA ITRW SLATE FA
MERCIER G, 2000, P INSTIL, P145
MICH O, 2004, P ISCA ITRW NLP SPEE, P169
MINEMATSU N, 2007, P ISCA ITRW SPEECH L
MOSTOW J, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P785
MOTE N, 2004, P ISCA ITRW INSTIL04, P47
Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001
NERI A, 2006, P ISCA INT PITTSB PA, P1982
NEUMEYER L, 1996, P INT C SPOK LANG PR
NICKERSON RS, 1972, P C SPEECH COMM PROC
PEABODY M, 2004, P ISCA ITRW INSTIL04, P173
Precoda K., 2000, P ISCA ITRW INSTIL20, P102
Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7
PRUITT J, 1998, P ESCA ETRW STILL98, P107
QUN L, 2001, P ESCA EUROSPEECH 20, P2671
Raux A., 2002, P ICSLP, P737
Raux Antoine, 2004, P INSTIL ICALL 2004, P147
RHEE SC, 2004, P ISCA INTERSPEECH 2, P1681
Russell M, 1996, P INT C SPOK LANG PR
Russell M, 2000, COMPUT SPEECH LANG, V14, P161, DOI 10.1006/csla.2000.0139
Rypa M. E., 1999, CALICO Journal, V16
SENEFF S, 2007, P ISCA ITRW SLATE07, P9
Seneff S., 2007, P NAACL HLT07 ROCH N
SENEFF S, 2004, P INSTIL ICALL 2004, P151
STRIK H, 2007, P INT 2007 ANTW NETH
Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49
SVENSTER B, 1998, P ESCA ETRW STILL MA, P91
TOWNSHEND B, 1998, P ESCA WORKSH SPEECH, P179
TRONG K, 2005, P ISCA INT LISB, P1345
Tsubota Y., 2002, P ICSLP, P1205
TSUBOTA Y, 2004, P ESCA ETRW INSTIL04, P139
VANLEHN K, 2007, SLATE WORKSH SPEECH
WEIGELT LF, 1990, J ACOUST SOC AM, V87, P2729, DOI 10.1121/1.399063
WIK P, 2007, P ISCA ITRW SLATE FA
WISE B, INTERACTIVE LITERACY
WITT S, 1997, 5 EUR C SPEECH COMM, P22
WITT SM, 1999, THESIS U CAMBRIDGE, P445
Yamashita Y, 2005, IEICE T INF SYST, VE88D, P496, DOI 10.1093/ietisy/e88-d.3.496
YI J, 1998, P ICSLP SYDN AUSTR N
YOSHIMURA Y, 2007, P ISCA ITRW SLATE FA
Zechner K., 2007, P ISCA ITRW SPEECH L
NR 114
TC 26
Z9 26
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 832
EP 844
DI 10.1016/j.specom.2009.04.005
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500002
ER
PT J
AU Strik, H
Truong, K
de Wet, F
Cucchiarini, C
AF Strik, Helmer
Truong, Khiet
de Wet, Febe
Cucchiarini, Catia
TI Comparing different approaches for automatic pronunciation error
detection
SO SPEECH COMMUNICATION
LA English
DT Article
DE Pronunciation error detection; Acoustic-phonetic classification;
Computer assisted pronunciation training; Computer assisted language
learning
AB One of the biggest challenges in designing computer assisted language learning (CALL) applications that provide automatic feedback on pronunciation errors consists in reliably detecting the pronunciation errors at such a detailed level that the information provided can be useful to learners. In our research we investigate pronunciation errors frequently made by foreigners learning Dutch as a second language. In the present paper we focus on the velar fricative /x/ and the velar plosive /k/. We compare four types of classifiers that can be used to detect erroneous pronunciations of these phones: two acoustic-phonetic classifiers (one of which employs Linear Discriminant Analysis (LDA)), a classifier based on cepstral coefficients in combination with LDA, and one based on confidence measures (the so-called Goodness Of Pronunciation score). The best results were obtained for the two LDA classifiers which produced accuracy levels of about 85-93%. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Strik, Helmer; Cucchiarini, Catia] Radboud Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands.
[Truong, Khiet] Univ Twente, Dept Comp Sci, Human Media Interact Grp, NL-7500 AE Enschede, Netherlands.
[de Wet, Febe] Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7602 Matieland, South Africa.
RP Strik, H (reprint author), Radboud Univ Nijmegen, Ctr Language & Speech Technol, POB 9103, NL-6500 HD Nijmegen, Netherlands.
EM h.strik@let.ru.nl; truongkp@ewi.utwen-te.nl; fdw@sun.ac.za;
c.cucchiarini@let.ru.nl
CR Boersma P., 2001, GLOT INT, V5, P341
Cucchiarini C, 1996, CLIN LINGUIST PHONET, V10, P131, DOI 10.3109/02699209608985167
Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279
DEN OS, 1995, P EUR, P825
Mak B., 2003, P HLT NAACL, P23
NERI A, 2006, APPL LINGUIST LANG T, V44, P357
NERI A, 2004, P INSTIL ICALL S, P13
NERI A, 2006, P ISCA INT PITTSB PA, P1982
Neumeyer L, 2000, SPEECH COMMUN, V30, P83, DOI 10.1016/S0167-6393(99)00046-1
TRUONG KP, 2004, THESIS UTRECHT U
WEIGELT LF, 1990, J ACOUST SOC AM, V87, P2729, DOI 10.1121/1.399063
Witt S. M., 1999, THESIS U CAMBRIDGE
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
Young S., 2000, HTK BOOK VERSION 3 0
NR 14
TC 18
Z9 22
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 845
EP 852
DI 10.1016/j.specom.2009.05.007
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500003
ER
PT J
AU Cucchiarini, C
Neri, A
Strik, H
AF Cucchiarini, Catia
Neri, Ambra
Strik, Helmer
TI Oral proficiency training in Dutch L2: The contribution of ASR-based
corrective feedback
SO SPEECH COMMUNICATION
LA English
DT Article
DE Computer assisted pronunciation training (CAPT); Corrective feedback;
Pronunciation error detection; Goodness of pronunciation; Accent
reduction
ID FLUENCY
AB In this paper, we introduce a system for providing automatically generated corrective feedback on pronunciation errors in Dutch, Dutch-CAPT We describe the architecture of the system paying particular attention to the rationale behind it, to the performance of the error detection algorithm and its relationship to the effectiveness of the corrective feedback provided. It appears that although the system does not achieve 100% accuracy in error detection, learners enjoy using it and the feedback provided is still effective in improving pronunciation errors after only a few hours of use over a period of one month. We discuss which factors may have led to these positive results and argue that it is worthwhile studying how ASR technology could be applied to the training of other speaking skills. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Cucchiarini, Catia; Neri, Ambra; Strik, Helmer] Radboud Univ Nijmegen, Dept Linguist, CLST, NL-6525 HT Nijmegen, Netherlands.
RP Cucchiarini, C (reprint author), Radboud Univ Nijmegen, Dept Linguist, CLST, Erasmuspl 1, NL-6525 HT Nijmegen, Netherlands.
EM C.Cucchiarini@let.ru.nl; ambra.neri@gmail.com; h.strik@let.ru.nl
CR CAREY M, 2002, THESIS MACQUARIE U A
Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279
DeKeyser RM, 2005, LANG LEARN, V55, P1, DOI 10.1111/j.0023-8333.2005.00294.x
Egan K. B., 1999, CALICO Journal, V16
Ehsani F., 1998, LANGUAGE LEARNING TE, V2, P45
ELTATAWY M, 2002, WORKING PAPERS TESOL
Giuliani D., 2003, Proceedings 3rd IEEE International Conference on Advanced Technologies, DOI 10.1109/ICALT.2003.1215131
HERRON D, 1999, P EUR 99 SEPT, P855
LENNON P, 1990, LANG LEARN, V40, P387, DOI 10.1111/j.1467-1770.1990.tb00669.x
Mak B., 2003, P HLT NAACL, P23
Neri A., 2006, IRAL-INT REV APPL LI, V44, P357, DOI 10.1515/IRAL.2006.016
Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473
Oostdijk N, 2002, LANG COMPUT, P105
REESER TW, 2001, CALICO REV
SCHMIDT RW, 1990, APPL LINGUIST, V11, P129, DOI 10.1093/applin/11.2.129
Van Bael C., 2003, P EUR GEN SWITZ, P1545
Witt S., 1999, THESIS U CAMBRIDGE C
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
Young S., 2000, HTK BOOK VERSION 3 0
ZHENG T, 2002, CALICO REV
NR 20
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 853
EP 863
DI 10.1016/j.specom.2009.03.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500004
ER
PT J
AU de Wet, F
Van der Walt, C
Niesler, TR
AF de Wet, F.
Van der Walt, C.
Niesler, T. R.
TI Automatic assessment of oral language proficiency and listening
comprehension
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic language proficiency assessment; Large-scale assessment; Rate
of speech; Goodness of pronunciation; CALL
ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; PRONUNCIATION
QUALITY; WORKING-MEMORY; SPEECH
AB This paper describes an attempt to automate the large-scale assessment of oral language proficiency and listening comprehension for fairly advanced students of English as a second language. The automatic test is implemented as a spoken dialogue system and consists of a reading as well as a repeating task. Two experiments are described in which different rating criteria were used by human judges. In the first experiment, proficiency was scored globally for each of the two test components. In the second experiment, various aspects of proficiency were evaluated for each section of the test. In both experiments, rate of speech (ROS), goodness of pronunciation (GOP) and repeat accuracy were calculated for the spoken utterances. The correlation between scores assigned by human raters and these three automatically derived measures was determined to assess their suitability as proficiency indicators. Results show that the more specific rating instructions used in the second experiment improved intra-rater agreement, but made little difference to inter-rater agreement. In addition, the more specific rating criteria resulted in a better correlation between the human and the automatic scores for the repeating task, but had almost no impact in the reading task. Overall, the results indicate that, even for the narrow range of proficiency levels observed in the test population, the automatically derived ROS and accuracy scores give a fair indication of oral proficiency. (C) 2009 Elsevier B.V. All rights reserved.
C1 [de Wet, F.] Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7600 Stellenbosch, South Africa.
[Van der Walt, C.] Univ Stellenbosch, Dept Curriculum Studies, ZA-7600 Stellenbosch, South Africa.
[Niesler, T. R.] Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa.
RP de Wet, F (reprint author), Univ Stellenbosch, Ctr Language & Speech Technol, ZA-7600 Stellenbosch, South Africa.
EM fdw@sun.ac.za; cvdwalt@sun.ac.za; trn@dsp.sun.ac.za
CR Bejar I. I., 2006, P HUM LANG TECHN C N, P216, DOI 10.3115/1220835.1220863
Bernstein J., 2000, P INSTIL2000 INT SPE, P57
Chalhoub-Deville M., 2001, LANGUAGE LEARNING TE, V5, P95
CINCAREK T, 2004, THESIS FRIEDRICHALEX
Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894
Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0
Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279
DANEMAN M, 1991, J PSYCHOLINGUIST RES, V20, P445, DOI 10.1007/BF01067637
DEWET F, 2007, P INT ANTW BELG, P218
ELLIS NC, 1996, Q J EXP PSYCHOL, V49, P34250
Franco H., 2000, P INSTILL 2000 DUND, P123
Fulcher G., 1996, LANG TEST, V13, P208, DOI DOI 10.1177/026553229601300205
FULCHER G, 1997, ENCY LANGUAGE ED LAN, V17, P75
KAWAI G, 1998, P ICSLP SYDN AUSTR
Kenyon DM, 2000, MOD LANG J, V84, P85, DOI 10.1111/0026-7902.00054
LISCOMBE JJ, 2007, THESIS COLUMBIA U CO
Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001
NERI A, 2006, P ISCA INT PITTSB PA, P1982
Neumeyer L, 2000, SPEECH COMMUN, V30, P83, DOI 10.1016/S0167-6393(99)00046-1
Payne JS, 2005, LANG LEARN TECHNOL, V9, P35
ROUX JC, 2004, P 4 INT C LANG RES E, V1, P93
Spolsky B., 1995, MEASURED WORDS
*STATSOFT, 2008, STAT 8 0
SUNDH S, 2003, THESIS UPPSALA U UPP
Upshur J., 1995, ELT J, V49, P3, DOI 10.1093/elt/49.1.3
van der Walt C, 2008, SO AFR LINGUIST APPL, V26, P135, DOI 10.2989/SALALS.2008.26.1.11.426
Wigglesworth G., 1997, LANG TEST, V14, P85, DOI DOI 10.1177/026553229701400105
Witt S., 1999, THESIS U CAMBRIDGE C
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
XI X, 2006, ANN M NAT COUNC MEAS
Young Steve, 2002, HTK BOOK VERSION 3 2
NR 31
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 864
EP 874
DI 10.1016/j.specom.2009.03.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500005
ER
PT J
AU Ohkawa, Y
Suzuki, M
Ogasawara, H
Ito, A
Makino, S
AF Ohkawa, Yuichi
Suzuki, Motoyuki
Ogasawara, Hirokazu
Ito, Akinori
Makino, Shozo
TI A speaker adaptation method for non-native speech using learners' native
utterances for computer-assisted language learning systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker adaptation; Non-native speech; Computer-assisted language
learning; Bilingual speaker
AB In recent years, various CALL systems which can evaluate a learner's pronunciation using speech recognition technology have been proposed. In order to evaluate a learner's utterances and point out problems with higher accuracy, speaker adaptation is a promising technology. However, many learners who use the CALL system often have very poor speaking ability in the target language (L2), so conventional speaker adaptation methods have problems because they require the learners' correctly-pronounced L2 utterances for adaptation. In this paper, we propose two new types of speaker adaptation methods for the CALL system. The new methods only require the learners' utterances in their native language (L1) for adapting the acoustic model for L2.
The first method is an algorithm to adapt acoustic models using a bilingual speaker's utterances. The speaker-independent acoustic models of L1 and L2 are adapted to the bilingual speaker once, then they are adapted to the learner again using the learner's L1 utterances. Using this method, we obtained about 5-point higher phoneme recognition accuracy than the baseline method.
The second method is a training algorithm of a set of acoustic models based on speaker adaptive training. It can robustly train bilinguals' models using a few utterances in L1 and L2 uttered by bilingual speakers. Using this method, we obtained about 10-point higher phoneme recognition accuracy than the baseline method. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Ohkawa, Yuichi] Tohoku Univ, Grad Sch Educ Informat, Aoba Ku, Sendai, Miyagi 9808576, Japan.
[Suzuki, Motoyuki] Univ Tokushima, Fac Engn, Tokushima 7708506, Japan.
[Ogasawara, Hirokazu; Ito, Akinori; Makino, Shozo] Tohoku Univ, Grad Sch Engn, Aoba Ku, Sendai, Miyagi 9808579, Japan.
RP Ohkawa, Y (reprint author), Tohoku Univ, Grad Sch Educ Informat, Aoba Ku, 27-1 Kawauchi, Sendai, Miyagi 9808576, Japan.
EM kuri@ei.tohoku.ac.jp; moto@m.ieice.org;
ogasawara@makino.ecei.tohoku.ac.jp; aito@fw.ipsj.or.jp;
makino@makino.ecei.tohoku.ac.jp
CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
Chun D. M., 1995, Journal of Educational Multimedia and Hypermedia, V4
DUNKEL P, 1991, MOD LANG J, V75, P64, DOI 10.2307/329835
HILLER S, 1993, P EUR 199O BERL GERM, P1343
ITO A, 2006, ED TECHNOLOGY RES, V29, P13
Kawai G, 2000, SPEECH COMMUN, V30, P131, DOI 10.1016/S0167-6393(99)00041-2
KWEON O, 2004, ED TECHNOL RES, V27, P9
Lee C. H., 1993, P ICASSP, VII-558
LEE KK, 1994, T INFORMATION PROCES, V35, P1223
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
MASHIMO M, 2002, P ICSLP, P293
NAKAMURA N, 2002, SP200220 IEICE
OGASAWARA H, 2003, SP2003127 IEICE
OHKURA K, 1992, P ICSLP 92, P369
STEIDL S, 2004, P ICSLP, P314
WITT S, 1998, P ICSLP, P1010
WITT S, 1997, LANGUAGE TEACHING LA, P25
Young Steve, 2002, HTK BOOK VERSION 3 2
NR 18
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 875
EP 882
DI 10.1016/j.specom.2009.05.005
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500006
ER
PT J
AU Zechner, K
Higgins, D
Xi, XM
Williamson, DM
AF Zechner, Klaus
Higgins, Derrick
Xi, Xiaoming
Williamson, David M.
TI Automatic scoring of non-native spontaneous speech in tests of spoken
English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech scoring; Automatic scoring; Spoken language scoring; Scoring of
spontaneous speech; Speaking assessment
ID 2ND-LANGUAGE LEARNERS FLUENCY; QUANTITATIVE ASSESSMENT; PRONUNCIATION
QUALITY; ALGORITHMS; SCORES
AB This paper presents the first version of the SpeechRater(SM) system for automatically scoring non-native spontaneous high-entropy speech in the context of an online practice test for prospective takers of the Test of English as a Foreign Language (R) internet-based test (TOEFL (R) iBT).
The system consists of a speech recognizer trained on non-native English speech data, a feature computation module, using speech recognizer output to compute a set of mostly fluency based features, and a multiple regression scoring model which predicts a speaking proficiency score for every test item response, using a subset of the features generated by the previous component. Experiments with classification and regression trees (CART) complement those performed with multiple regression. We evaluate the system both on TOEFL Practice data [TOEFL Practice Online (TPO)] as well as on Field Study data collected before the introduction of the TOEFL iBT.
Features are selected by test development experts based on both their empirical correlations with human scores as well as on their coverage of the concept of communicative competence.
We conclude that while the correlation between machine scores and human scores on TPO (of 0.57) still differs by 0.17 from the inter-human correlation (of 0.74) on complete sets of six items (Pearson r correlation coefficients), the correlation of 0.57 is still high enough to warrant the deployment of the system in a low-stakes practice environment, given its coverage of several important aspects of communicative competence such as fluency, vocabulary diversity, grammar, and pronunciation. Another reason why the deployment of the system in a low-stakes practice environment is warranted is that this system is an initial version of a long-term research and development program where features related to vocabulary, grammar, and content will be added in a later stage when automatic speech recognition performance improves, which can then be easily achieved without a re-design of the system.
Exact agreement on single TPO items between our system and human scores was 57.8%, essentially at par with inter-human agreement of 57.2%.
Our system has been in operational use to score TOEFL Practice Online Speaking tests since the Fall of 2006 and has since scored tens of thousands of tests. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Zechner, Klaus; Higgins, Derrick; Xi, Xiaoming; Williamson, David M.] Educ Testing Serv, Automated Scoring & NLP, Princeton, NJ 08541 USA.
RP Zechner, K (reprint author), Educ Testing Serv, Automated Scoring & NLP, Rosedale Rd,MS 11-R, Princeton, NJ 08541 USA.
EM kzechner@ets.org; dhiggins@ets.org; xxi@ets.org; dmwilliamson@ets.org
CR ATTALI Y, 2005, RR0445 ETS
ATTALI Y, 2004, ANN M INT ASS ED ASS
Bachman L., 1996, LANGUAGE TESTING PRA
Bachman L. F., 1990, FUNDAMENTAL CONSIDER
BERNSTEIN J, 2000, P INSTILL2000 DUND S
Bernstein J., 1999, PHONEPASS TESTING ST
Brieman L, 1984, CLASSIFICATION REGRE
Burstein J., 1998, RR9815 ETS
CHODOROW M, 2004, RR73 TOEFL ED TEST S
Clauser BE, 1997, J EDUC MEAS, V34, P141, DOI 10.1111/j.1745-3984.1997.tb00511.x
COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256
Cucchiarini C, 2002, J ACOUST SOC AM, V111, P2862, DOI 10.1121/1.1471894
Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0
CUCCHIARINI C, 1997, IEEE AUT SPEECH REC
CUCCHIARINI C, 1997, 3 INT S ACQ 2 LANG S
Cucchiarini C, 2000, J ACOUST SOC AM, V107, P989, DOI 10.1121/1.428279
FRANCO H, 2000, P INSTILL 2000 INT S
Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X
Landauer TK, 1997, PSYCHOL REV, V104, P211, DOI 10.1037/0033-295X.104.2.211
Leacock C, 2003, COMPUT HUMANITIES, V37, P389, DOI 10.1023/A:1025779619903
Leacock C., 2004, EXAMENS, V1
North B., 2000, DEV COMMON FRAMEWORK
PAGE EB, 1966, PHI DELTA KAPPAN, V47, P238
Rudner L., 2006, J TECHNOLOGY LEARNIN, V4
Steinberg D., 1995, CART TREE STRUCTURED
Williamson DM, 1999, J EDUC MEAS, V36, P158, DOI 10.1111/j.1745-3984.1999.tb00552.x
Xi X., 2006, TOEFLIBT01
ZECHNER K, 2006, P 2006 C HUM LANG TE
1997, HUB 4 BROADCAST NEWS
NR 29
TC 21
Z9 21
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 883
EP 895
DI 10.1016/j.specom.2009.04.009
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500007
ER
PT J
AU Wei, S
Hu, GP
Hu, Y
Wang, RH
AF Wei, Si
Hu, Guoping
Hu, Yu
Wang, Ren-Hua
TI A new method for mispronunciation detection using Support Vector Machine
based on Pronunciation Space Models
SO SPEECH COMMUNICATION
LA English
DT Article
DE Automatic speech recognition; Mispronunciation detection; Support Vector
Machine; Pronunciation Space Models
ID SPEECH RECOGNITION; CONFIDENCE MEASURES
AB This paper presents two new ideas for text dependent mispronunciation detection. Firstly, mispronunciation detection is formulated as a classification problem to integrate various predictive features. A Support Vector Machine (SVM) is used as the classifier and the log-likelihood ratios between all the acoustic models and the model corresponding to the given text are employed as features for the classifier. Secondly, Pronunciation Space Models (PSMs) are proposed to enhance the discriminative capability of the acoustic models for pronunciation variations. In PSMs, each phone is modeled with several parallel acoustic models to represent pronunciation variations of that phone at different proficiency levels, and an unsupervised method is proposed for the construction of the PSMs. Experiments on a database consisting of more than 500,000 Mandarin syllables collected from 1335 Chinese speakers show that the proposed methods can significantly outperform the traditional posterior probability based method. The overall recall rates for the 13 most frequently mispronounced phones increase from 17.2%, 7.6% and 0% to 58.3%, 44.3% and 29.5% at three precision levels of 60%, 70%, and 80%, respectively. The improvement is also demonstrated by a subjective experiment with 30 subjects, in which 53.3% of the subjects think the proposed method is better than the traditional one and 23.3% of them think that the two methods are comparable. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Wei, Si; Wang, Ren-Hua] Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China.
[Wei, Si; Hu, Guoping; Hu, Yu] iFLYTEK Res, Hefei, Anhui, Peoples R China.
RP Wei, S (reprint author), Univ Sci & Technol China, 96 Jinzhai Rd, Hefei 230026, Anhui, Peoples R China.
EM siwei@iflytek.com; gphu@iflytek.com; yuhu@iflytek.com; rhw@ustc.edu.cn
CR ATAL BS, 1994, J ACOUST SOC AM, P1304
Cucchiarini C., 1998, P 1998 INT C SPOK LA, P1739
DONG B, 2006, P INT S CHIN SPOK LA, P580
Franco H., 1999, P EUROSPEECH, P851
Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X
Ito A., 2005, P EUROSPEECH, P173
Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004
Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310
Lee L., 1996, P ICASSP
LIU Y, 2002, THESIS HONG KONG U S
Liu Y, 2003, COMPUT SPEECH LANG, V17, P357, DOI 10.1016/S0885-2308(03)00008-1
Minematsu N., 2004, P ICSLP, P1317
Neumeyer L., 1999, SPEECH COMMUN, V30, P83
ROSE RC, 1995, P INT C AC SPEECH SI, P281
Saraclar M., 2000, THESIS J HOPKINS U
Strik H., 2007, P INT 07, P1837
TRUONG K, 2004, THESIS ULTRECHT U NE
Vapnik V., 1995, NATURE STAT LEARNING
WEGMANN S, 1996, P ICASSP, P339
Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002
Witt S., 1999, THESIS CAMBRIDGE U E
Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8
YOUNG K, 2000, HTK BOOK HTK VERSION
Zhang F., 2008, P INT C AC SPEECH SI, P2077
Zhang R., 2001, P 7 EUR C SPEECH COM, P2105
NR 25
TC 20
Z9 25
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 896
EP 905
DI 10.1016/j.specom.2009.03.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500008
ER
PT J
AU Handley, Z
AF Handley, Zoee
TI Is text-to-speech synthesis ready for use in computer-assisted language
learning?
SO SPEECH COMMUNICATION
LA English
DT Article
DE CALL; Speech synthesis; TTS synthesis; Evaluation
AB Text-to-speech (TTS) synthesis, the generation of speech from text input, offers another means of providing spoken language input to learners in Computer-Assisted Language Learning (CALL) environments. Indeed, many potential benefits (ease of creation and editing of speech models, generation of speech models and feedback on demand, etc.) and uses (talking dictionaries, talking texts, dictation, pronunciation training, dialogue partner, etc.) of TTS synthesis in CALL have been put forward. Yet, the use of TTS synthesis in CALL is not widely accepted and only a few applications have found their way onto the market. One potential reason for this is that TTS synthesis has not been adequately evaluated for this purpose. Previous evaluations of TTS synthesis for use in CALL, have only addressed the comprehensibility of TTS synthesis. Yet, CALL places demands on the comprehensibility, naturalness, accuracy, register and expressiveness of the output of TTS synthesis. In this paper, the aforementioned aspects of the quality of the output of four state-of-the-art French TTS synthesis systems are evaluated with respect to their use in the three different roles that TTS synthesis systems may assume within CALL applications, namely: (1) reading machine, (2) pronunciation model and (3) conversational partner [Handley, Z., Hamel, M.-J., 2005. Establishing a methodology for benchmarking speech synthesis for computer-assisted language learning (CALL). Language Learning and Technology Journal 9(3), 99-119. Retrieved from: http://llt.msu.edu/vol9num3/handley/default.html.]. The results of this evaluation suggest that the best TTS synthesis systems are ready for use in applications in which they 'add value' to CALL, i.e. exploit the unique capacity of TTS synthesis to generate speech models on demand. An example of such an application is a dialogue partner. In order to fully meet the requirements of CALL, further attention needs to be paid to accuracy and naturalness, in particular at the prosodic level, and expressiveness. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Handley, Zoee] Univ Manchester, Sch Comp Sci, Manchester M13 9EP, Lancs, England.
RP Handley, Z (reprint author), Univ Nottingham, Learning Sci Res Inst, Exchange Bldg,Jubilee Campus,Wollaton Rd, Nottingham NG7 1BB, England.
EM zoe.handley@nottingham.ac.uk
CR *AUR, 2002, TALK ME CONV METH VE
*BAB TECHN, 2003, BRIGHTSPEECH
Bailly G., 2003, P EUR 2003 GEN, P37
Bennett C. L., 2005, P INT EUR, P105
Beutnagel M., 1999, P JOINT M ASA EAA DA
Black A. B., 2005, P INT 2005 LISB PORT, P77
BLACK AW, 2000, P ICSLP BEIJ CHIN
BLACK AW, 1994, P C COMP LING KYOT J, P983
Campbell N., 1997, PROGR SPEECH SYNTHES, P279
CAMPBELL N, 2006, IEEE T AUDIO SPEECH, V14
CHAPELLE C, 2001, RECALL, V23, P3
Chapelle C., 1998, LANGUAGE LEARNING TE, V2, P22
Chapelle C.A., 2001, COMPUTER APPL 2 LANG
COHEN R, 1993, COMPUT EDUC, V21, P25, DOI 10.1016/0360-1315(93)90044-J
CONKIE A, 1999, P JOINT M ASA EAA DA
DEPIJPER JR, 1997, PROGR SPEECH SYNTHES, P575
DESAINTEXUPERY A, 1999, PETIT PRINCE
Dutoit T., 1997, INTRO TEXT TO SPEECH
EDGINGTON M, 1997, P EUR 97 RHOD GREEC, P1
EGAN BK, 2000, P INSTIL 2000, P4
Ehsani B.K., 1998, LANGUAGE LEARNING TE, V2, P45
*ELSE, 1999, D11 ELSE
FRANCIS AL, 1999, HUMAN FACTORS VOICE, P63
Galliers J. R., 1996, EVALUATING NATURAL L
HAMEL MJ, 2003, THESIS UMIST MANCHES
HAMEL MJ, 2003, P MICTE2003, V3, P1661
Hamel M.-J., 1998, RECALL, V10, P79
Handley Z, 2005, LANG LEARN TECHNOL, V9, P99
HANDLEY Z, 2006, THESIS U MANCHESTER
Henton C., 2002, International Journal of Speech Technology, V5, DOI 10.1023/A:1015416013198
Hincks R., 2002, TMH QPSR, V44, P153
Huang X., 2001, SPOKEN LANGUAGE PROC
JOHNSON WL, 2002, P 7 INT C SPOK LANG
KELLER E, 2000, P INSTIL U AB DUND D, P109
MERCIER G, 2000, P INSTIL, P145
*MULT, 2005, ELITE DOC
PISONI DB, 1987, TEXT SPEECH MITALK S, P151
Polkosky M. D., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1022390615396
Raux Antoine, 2004, P INSTIL ICALL 2004, P147
RODMAN RD, 1999, COMPUTER SPEECH TECH
Santagiustina M, 1999, J OPT B-QUANTUM S O, V1, P191, DOI 10.1088/1464-4266/1/1/033
SCHMIDTNIELSEN A, 1995, APPL SPEECH TECHNOLO, P195
Schroeter J, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P211, DOI 10.1109/WSS.2002.1224411
SENEFF S, 2004, P INSTIL ICALL 2004, P151
SHERWOOD B, 1981, STUDIES LANGUAGE LER, V3, P175
SOBKOWIAK W, 1998, MULTIMEDIA CALL THEO
STEVENS V, 1989, TEACHING LANGUAGES C, P31
STRATIL M, 1987, PROGRAM LEARN EDUC T, V24, P309
Stratil M., 1987, Literary & Linguistic Computing, V2, DOI 10.1093/llc/2.2.116
*TMA ASS, 2003, NUANC US ENGL
van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481
NR 51
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 906
EP 919
DI 10.1016/j.specom.2008.12.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500009
ER
PT J
AU Felps, D
Bortfeld, H
Gutierrez-Osuna, R
AF Felps, Daniel
Bortfeld, Heather
Gutierrez-Osuna, Ricardo
TI Foreign accent conversion in computer assisted pronunciation training
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice conversion; Foreign accent; Speaker identity; Computer assisted
pronunciation training; Implicit feedback
ID PROCESSING TECHNIQUES; VOICE CONVERSION; SPEECH; ENGLISH; SPEAKER;
IDENTIFICATION; RECOGNITION; FRAMEWORK; FEATURES; VOWELS
AB Learners of a second language practice their pronunciation by listening to and imitating utterances from native speakers. Recent research has shown that choosing a well-matched native speaker to imitate can have a positive impact on pronunciation training. Here we propose a voice-transformation technique that can be used to generate the (arguably) ideal voice to imitate: the own voice of the learner with a native accent. Our work extends previous research, which suggests that providing learners with prosodically corrected versions of their utterances can be a suitable form of feedback in computer assisted pronunciation training. Our technique provides a conversion of both prosodic and segmental characteristics by means of a pitch-synchronous decomposition of speech into glottal excitation and spectral envelope. We apply the technique to a corpus containing parallel recordings of foreign-accented and native-accented utterances, and validate the resulting accent conversions through a series of perceptual experiments. Our results indicate that the technique can reduce foreign accentedness without significantly altering the voice quality properties of the foreign speaker. Finally, we propose a pedagogical strategy for integrating accent conversion as a form of behavioral shaping in computer assisted pronunciation training. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Felps, Daniel; Gutierrez-Osuna, Ricardo] Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA.
[Bortfeld, Heather] Texas A&M Univ, Dept Psychol, College Stn, TX 77843 USA.
RP Gutierrez-Osuna, R (reprint author), Texas A&M Univ, Dept Comp Sci, 3112 TAMU, College Stn, TX 77843 USA.
EM dlfelps@cs.tamu.edu; bortfeld@psyc.ta-mu.edu; rgutier@cs.tamu.edu
CR Abe M., 1988, P ICASSP, P655
ANISFELD M, 1962, J ABNORM PSYCHOL, V65, P223, DOI 10.1037/h0045060
Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608
ARSLAN LM, 1997, VOICE CONVERSION COD, P1347
ARTHUR B, 1974, LANG SPEECH, V17, P255
*AUR, 2002, TALK ME
Bissiri M. P., 2006, P 11 AUSTR INT C SPE, P24
Boersma P., 2007, PRAAT DOING PHONETIC
Bongaerts T, 1999, SEC LANG ACQ RES, P133
RYAN EB, 1975, J PERS SOC PSYCHOL, V31, P855, DOI 10.1037/h0076704
CELCEMURCIA M, 1996, TEACHING PRONUNCIATI, V12
CHILDERS DG, 1989, SPEECH COMMUN, V8, P147, DOI 10.1016/0167-6393(89)90041-1
Chun D., 1998, LANGUAGE LEARNING TE, V2, P61
COMPTON AJ, 1963, J ACOUST SOC AM, V35, P1748, DOI 10.1121/1.1918810
Derwing TM, 1998, LANG LEARN, V48, P393, DOI 10.1111/0023-8333.00047
Dijkstra E. W., 1959, NUMER MATH, V1, P269, DOI DOI 10.1007/BF01386390
ESKENAZI M, 1998, P STILL WORKSH SPEEC
Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62
Fant G., 1960, ACOUSTIC THEORY SPEE
GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317
Hansen T. K., 2006, 4 INT C MULT INF COM, P342
Hincks R., 2003, ReCALL, V15, DOI 10.1017/S0958344003000211
Huckvale M., 2007, P ISCA SPEECH SYNTH, P64
Jilka M., 1998, P ESCA WORKSH SPEECH, P115
Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423
KENNY OP, 1998, P 1998 IEEE INT C AC, V1, P573, DOI 10.1109/ICASSP.1998.674495
KEWLEYPORT D, 1994, APPL SPEECH TECHNOLO, P565
Kominek J., 2003, CMU ARCTIC DATABASES
KOUNOUDES A, 2002, P INT C AC SPEECH SI, P349
KREIMAN J, 1991, SPEECH COMMUN, V10, P265, DOI 10.1016/0167-6393(91)90016-M
LENNEBERG EH, 1967, BIOL FDN LANGUAGE, V16
LEVY M, 1997, COMPUTER ASSISTED LA, V15
Lippi-Green Rosina, 1997, ENGLISH ACCENT LANGU
Lyster R, 2001, LANG LEARN, V51, P265, DOI 10.1111/j.1467-1770.2001.tb00019.x
MAJOR RC, 2001, FOREIGN ONTOGENY PHY, V9
Makhoul J., 1979, P IEEE INT C AC SPEE, P428
MARKHAM D, 1997, TRAVAUX I LINGUISTIQ
MARTIN P, 2004, WINPITCH LTL 2 MULTI
MATSUMOT.H, 1973, IEEE T ACOUST SPEECH, VAU21, P428, DOI 10.1109/TAU.1973.1162507
MCALLISTER R, 1998, P SPEECH TECHN LANG, P155
MENZEL W, 2000, AUTOMATIC DETECTION, P49
MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E
MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z
Munro M. J., 1995, STUDIES 2 LANGUAGE A, V17, P17, DOI 10.1017/S0272263100013735
Munro M. J., 1994, LANG TEST, V11, P253, DOI 10.1177/026553229401100302
MUNRO MJ, 1995, LANG LEARN, V45, P73, DOI 10.1111/j.1467-1770.1995.tb00963.x
Murray G., 1999, SYSTEM, V27, P295, DOI 10.1016/S0346-251X(99)00026-3
Nagano K, 1990, 1 INT C SPOK LANG PR, P1169
Neri A., 2003, P 15 INT C PHON SCI, P1157
Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473
PAUL DB, 1981, IEEE T ACOUST SPEECH, V29, P786, DOI 10.1109/TASSP.1981.1163643
Peabody M, 2006, LECT NOTES COMPUT SC, V4274, P602
Pelham B. W., 2007, CONDUCTING RES PSYCH
Penfield W, 1959, SPEECH BRAIN MECH
Pennington M. C., 1999, Computer Assisted Language Learning, V12, DOI 10.1076/call.12.5.427.5693
Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7
REPP BH, 1987, SPEECH COMMUN, V6, P1, DOI 10.1016/0167-6393(87)90065-3
Rogers C.L., 1996, J ACOUST SOC AM, V100, P2725, DOI 10.1121/1.416179
SAMBUR MR, 1975, IEEE T ACOUST SPEECH, VAS23, P176, DOI 10.1109/TASSP.1975.1162664
SCHAIRER KE, 1992, MOD LANG J, V76, P309, DOI 10.2307/330161
SCOVEL T, 1988, ISSUES 2 LANGUAGE RE, P206
Sheffert SM, 2002, J EXP PSYCHOL HUMAN, V28, P1447, DOI 10.1037//0096-1523.28.6.1447
*SPEEDLINGU, 2007, GENEVALOGIC
*SPHINX, 2001, SPHINXTRAIN BUILD AC
Sundermann D., 2003, P IEEE WORKSH AUT SP, P676
Sundstrom A., 1998, P ISCA WORKSH SPEECH, P49
TANG M, 2001, VOICE TRANSFORMATION
Tenenbaum JB, 2000, SCIENCE, V290, P2319, DOI 10.1126/science.290.5500.2319
TRAUNMULLER H, 1994, PHONETICA, V51, P170
TURK O, 2005, DONOR SELECTION VOIC
Turk O, 2006, COMPUT SPEECH LANG, V20, P441, DOI 10.1016/j.csl.2005.06.001
VANLANCKER D, 1985, J PHONETICS, V13, P19
VIERUDIMULESCU B, 2005, P ISCA WORKSH PLAST, P66
Wachowicz K. A., 1999, CALICO Journal, V16
WATSON CS, 1989, VOLTA REV, V91, P29
YAN Q, 2004, P IEEE INT C AC SPEE, P637
YOUNG SJ, 1993, HTK HIDDEN MARKOV MO, P153
NR 77
TC 14
Z9 14
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 920
EP 932
DI 10.1016/j.specom.2008.11.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500010
ER
PT J
AU Bissiri, MP
Pfitzinger, HR
AF Bissiri, Maria Paola
Pfitzinger, Hartmut R.
TI Italian speakers learn lexical stress of German morphologically complex
words
SO SPEECH COMMUNICATION
LA English
DT Article
DE CALL; Computer Assisted Language Learning; Prosody modification;
Intonation; Local speech rate; Intensity; Performance assessment;
Lexical stress
ID FEEDBACK; RECASTS
AB Italian speakers tend to stress the second component of German morphologically complex words such as compounds and prefix verbs even if the first component is lexically stressed. To improve their prosodic phrasing an automatic pronunciation teaching method was developed based on auditory feedback of prosodically corrected utterances in the learners' own voices. Basically, the method copies contours of F0, local speech rate, and intensity from reference utterances of a German native speaker to the learners' speech signals. It also adds emphasis to the stress position in order to help the learners better recognise the correct pronunciation and identify their errors. A perception test with German native speakers revealed that manipulated utterances significantly better reflect lexical stress than the corresponding original utterances. Thus, two groups of Italian learners of German were provided with different feedback during a training session, one group with manipulated utterances in their individual voices and the other with correctly pronounced original utterances in the teacher's voice. Afterwards, both groups produced the same sentences again and German native speakers judged the resulting utterances. Resynthesised stimuli, especially with emphasised stress, were found to be a more effective feedback than natural stimuli to learn the correct stress position. Since resynthesis was obtained without previous segmentation of the learners' speech signals, this technology could be effectively included in Computer Assisted Language Learning software. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Bissiri, Maria Paola] Univ Munich, Inst Phonet & Speech Proc IPS, D-80799 Munich, Germany.
[Bissiri, Maria Paola] Univ Sassari, Dipartimento Sci Linguaggi, I-07100 Sassari, Italy.
[Pfitzinger, Hartmut R.] Univ Kiel, Inst Phonet & Digital Speech Proc IPDS, D-24118 Kiel, Germany.
RP Bissiri, MP (reprint author), Univ Munich, Inst Phonet & Speech Proc IPS, Schellingstr 3, D-80799 Munich, Germany.
EM mariapa@phonetik.uni-muenchen.de; hpt@ipds.uni-kiel.de
CR Anderson-Hsieh J., 1994, CALICO Journal, V11
Anderson-Hseih J., 1992, SYSTEM, V20, P51, DOI 10.1016/0346-251X(92)90007-P
Bertinetto Pier Marco, 1981, STRUTTURE PROSODICHE
BERTINETTO PM, 1980, J PHONETICS, V8, P385
Bissiri M. P., 2006, P 11 AUSTR INT C SPE, P24
BISSIRI MP, 2008, THESIS LUDWIGMAXIMIL
BISSIRI MP, 2008, P 4 C SPEECH PROS CO, P639
BISSIRI MP, 2007, PERSPEKTIVEN, V2, P353
Chapelle C., 1998, LANGUAGE LEARNING TE, V2, P22
Chun D., 1998, LANGUAGE LEARNING TE, V2, P61
Chun D. M., 1989, CALICO Journal, V7
CRANEN B, 1984, SYSTEM, V12, P25, DOI 10.1016/0346-251X(84)90044-7
DEBOT K, 1983, LANG SPEECH, V26, P331
Delmonte R, 2000, SPEECH COMMUN, V30, P145, DOI 10.1016/S0167-6393(99)00043-6
DELMONTE R, 2003, ATT 13 GIORN STUD GR, P169
DELMONTE R, 1981, STUDI GRAMMATICA ITA, P69
DELMONTE R, 1997, P ESCA EUR 97 RHOD, V2, P669
DIMPERIO M, 2000, OHIO STATE U WORKING, V54, P59
Dupoux E, 2001, J ACOUST SOC AM, V110, P1606, DOI 10.1121/1.1380437
Eskenazi M., 2000, P INSTIL 2000 INT SP, P73
Eskenazi M., 1999, CALICO Journal, V16
Eskenazi M, 1998, P SPEECH TECHN LANG, P77
Gass S., 1994, 2 LANGUAGE ACQUISITI
Gass Susan M., 1997, INPUT INTERACTION 2
GERMAINRUTHERFO.A, 2000, COMM ALSIC, V3, P61
Hardison DM, 2004, LANG LEARN TECHNOL, V8, P34
HIROSE K, 2003, P EUROSPEECH GEN, V4, P3149
Hirose K, 2004, P INT S TON ASP LANG, P77
JAMES E, 1976, LANG TEACHING, V14, P227
Jessen M., 1995, P 13 INT C PHON SCI, V4, P428
Kohler Klaus, 1995, EINFUHRUNG PHONETIK, VSecond
KOMMISSARCHIK J, 2000, P INSTILL WORKSH SPE, P86
Ladd D. R., 1996, INTONATIONAL PHONOLO
LANE H, 1969, LANG TEACHING, P159
Lehiste I., 1970, SUPRASEGMENTALS
Lyster R, 1998, LANG LEARN, V48, P183, DOI 10.1111/1467-9922.00039
MARTIN P, 2004, P INSTIL ICALL2004 S, P177
Mennen I., 2007, NONNATIVE PROSODY PH, P53
NAGANO K, 1990, P ICSLP KOB, V2, P1169
Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473
Nicholas H, 2001, LANG LEARN, V51, P719, DOI 10.1111/0023-8333.00172
PFITZINGER HR, 2009, AIPUK, V38, P21
Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7
Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955
TILLMANN HG, 2004, P INSTIL ICALL2004 S, P17
van der Hulst H., 1999, WORD PROSODIC SYSTEM, P273
VARDANIAN RM, 1964, LANG LEARN, P109
Wang Q., 2008, P 4 C SPEECH PROS CA, P635
NR 48
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 933
EP 947
DI 10.1016/j.specom.2009.03.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500011
ER
PT J
AU Saz, O
Yin, SC
Lleida, E
Rose, R
Vaquero, C
Rodriguez, WR
AF Saz, Oscar
Yin, Shou-Chun
Lleida, Eduardo
Rose, Richard
Vaquero, Carlos
Rodriguez, William R.
TI Tools and Technologies for Computer-Aided Speech and Language Therapy
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken language learning; Speech disorders; Speech corpora; Automatic
speech recognition; Pronunciation verification
ID RECOGNITION
AB This paper addresses the problem of Computer-Aided Speech and Language Therapy (CASLT). The goal of the work described in the paper is to develop and evaluate a semi-automated system for providing interactive speech therapy to the increasing population of impaired individuals and help professional speech therapists. A discussion on the development and evaluation of a set of interactive therapy tools, along with the underlying speech technologies that support these tools is provided. The interactive tools are designed to facilitate the acquisition of language skills in the areas of basic phonatory skills, phonetic articulation and language understanding primarily for children with neuromuscular disorders like dysarthria. Human-machine interaction for all of these areas requires the existence of speech analysis, speech recognition, and speech verification algorithms that are robust with respect to the sources of speech variability that are characteristic of this population of speakers. The paper will present an experimental study that demonstrates the effectiveness of an interactive system for eliciting speech from a population of impaired children and young speakers ranging in age from 11 to 21 years. The performance of automatic speech recognition (ASR) systems and subword-based pronunciation verification (PV) on this domain are also presented. The results indicate that ASR and PV systems configured from speech utterances taken from the impaired speech domain can provide adequate performance, similar to the experts' agreement rate, for supporting the presented CASLT applications. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Saz, Oscar; Lleida, Eduardo; Vaquero, Carlos; Rodriguez, William R.] Univ Zaragoza, GTC, Aragon Inst Engn Res 13A, Zaragoza, Spain.
[Yin, Shou-Chun; Rose, Richard] McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 2A7, Canada.
RP Saz, O (reprint author), Univ Zaragoza, GTC, Aragon Inst Engn Res 13A, Maria de Luna 1, Zaragoza, Spain.
EM oskarsaz@unizar.es; shou-chun.yin@-mail.mcgill.ca; lleida@unizar.es;
rose@ece.mcgill.ca; cvaquero@unizar.es; wricardo@unizar.es
RI Lleida, Eduardo/K-8974-2014; Saz Torralba, Oscar/L-7329-2014
OI Lleida, Eduardo/0000-0001-9137-4013;
CR ACEROVILLAN P, 2005, TRATAMIENTO VOZ MANU
Aguinaga G., 2004, PRUEBA LENGUAJE ORAL
Alarcos E., 1950, FONOLOGIA ESPANOLA
ALBOR JC, 1991, ELA EXAMEN LOGOPEDIC
Bengio S., 2004, P OD SPEAK LANG REC, P237
Coorman G, 2000, P INT C SPOK LANG BE, P395
Cucchiarini C., 2007, P INT 2007 ANTW BELG, P2181
DELLER JR, 1991, COMPUT METH PROG BIO, V35, P125, DOI 10.1016/0169-2607(91)90071-Z
DELONG ER, 1988, J BIOMETR, V3, P837
DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1
DUCHATEAU J, 2007, P 10 EUR C SPEECH CO, P1210
ESCARTIN A, 2008, COMUNICA FRAMEWORK
GARCIAGOMEZ R, 1999, P 6 EUR C SPEECH COM, P1067
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
GEROSA M, 2008, P 2008 INT C AC SPEE, P5057
Goronzy S., 1999, P SON RES FOR 99, V1, P9
GRANSTROM B, 2005, P INT LISB, P449
HATZIS A, 1999, THESIS U SHEFFIELD S
HAWLEY M, 2003, P 7 C ASS ADV ASS TE
ITO A, 2008, P 11 INT C SPEECH CO, P2819
JUSTO R, 2008, P 1 INT C AMB MED SY
KIM H, 2008, P INT C SPOK LANG PR, P1741
Koehn P., 2005, P 10 MACH TRANSL SUM
Kornilov A.-U., 2004, P 9 INT C SPEECH COM
LEFEVRE JP, 1996, 1060 TIDE
Legetter C., 1995, COMPUTER SPEECH LANG, V9, P171
Lleida E, 2000, IEEE T SPEECH AUDI P, V8, P126, DOI 10.1109/89.824697
Mangu L., 1999, P EUR C SPEECH COMM, P495
MARTINEZ B, 2007, P 3 C NAC U DISC ZAR
Menendez-Pidal X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608020
Monfort M., 2001, MENTE SOPORTE GRAFIC
Monfort M., 1989, REGISTRO FONOLOGICO
Moreno A., 1993, P EUR SEPT, P653
MORENO A, 2000, P LREC ATH GREEC, P895
NAVARROMESA JL, 2005, ORAL CORPUS PROJECT
OESTER AM, 2002, P 15 SWED PHON C FON, P45
Patel R, 2002, ALTERNATIVE AUGMENTA, V18, P2, DOI 10.1080/714043392
PRATT SR, 1993, J SPEECH HEAR RES, V36, P1063
Rabiner L., 1978, SIGNAL PROCESSING SE
RODRIGUEZ V, 2008, THESIS U ANTONIO NEB
RODRIGUEZ WR, 2008, P 4 KUAL LUMP INT C
Sanders E., 2002, P 7 INT C SPOK LANG, P661
SAZ O, 2008, P 1 WORKSH CHILD COM
VAQUERO C, 2008, P IEEE INT C AC SPEE, P4509
VICSI K, 1999, P 6 EUR C SPEECH COM, P859
YIN SC, 2009, P 2009 INT C AC SPEE
ZHANG F, 2008, P ICASSP, P5077
NR 47
TC 5
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 948
EP 967
DI 10.1016/j.specom.2009.04.006
PG 20
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500012
ER
PT J
AU Price, P
Tepperman, J
Iseli, M
Duong, T
Black, M
Wang, S
Boscardin, CK
Heritage, M
Pearson, PD
Narayanan, S
Alwan, A
AF Price, Patti
Tepperman, Joseph
Iseli, Markus
Duong, Thao
Black, Matthew
Wang, Shizhen
Boscardin, Christy Kim
Heritage, Margaret
Pearson, P. David
Narayanan, Shrikanth
Alwan, Abeer
TI Assessment of emerging reading skills in young native speakers and
language learners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Children's speech recognition; Reading assessment; Language learning;
Accented English; Speaker adaptation
ID CHILDRENS SPEECH; RECOGNITION; VARIABILITY; INFORMATION; INTERFACE;
LITERACY; READERS
AB To automate assessments of beginning readers, especially those still learning English, we have investigated the types of knowledge sources that teachers use and have tried to incorporate them into an automated system. We describe a set of speech recognition and verification experiments and compare teacher scores with automatic scores in order to decide when a novel pronunciation is best viewed as a reading error or as dialect variation. Since no one classroom teacher is expected to be familiar with as many dialect systems as might occur in an urban classroom, making progress in automated assessments in this area can improve the consistency and fairness of reading assessment. We found that automatic methods performed best when the acoustic models were trained on both native and non-native speech, and argue that this training condition is necessary for automatic reading assessment since a child's reading ability is not directly observable in one utterance. We also found assessment of emerging reading skills in young children to be an area ripe for more research! (C) 2009 Elsevier B.V. All rights reserved.
C1 [Price, Patti] PPrice Speech & Language Technol, Menlo Pk, CA 94025 USA.
[Tepperman, Joseph; Black, Matthew; Narayanan, Shrikanth] Univ So Calif, Dept Elect Engn, Los Angeles, CA 90089 USA.
[Iseli, Markus] Univ Calif Los Angeles, Dept Elect Engn, Henry Samueli Sch Engn & Appl Sci Engr 63 134 4, Los Angeles, CA 90095 USA.
[Duong, Thao; Pearson, P. David] Univ Calif Berkeley, Grad Sch Educ, Berkeley, CA 94720 USA.
[Boscardin, Christy Kim] Univ Calif San Francisco, Sch Med, Off Med Educ, San Francisco, CA 94143 USA.
[Heritage, Margaret] Univ Calif Los Angeles, CRESST, Los Angeles, CA 90095 USA.
[Alwan, Abeer] Univ Calif Los Angeles, Dept Elect Engn, Henry Samueli Sch Engn & Appl Sci Engr 66 147G 4, Los Angeles, CA 90095 USA.
RP Price, P (reprint author), PPrice Speech & Language Technol, 420 Shirley Way, Menlo Pk, CA 94025 USA.
EM pjp@pprice.com; tepperma@usc.edu; iseli@ee.ucla.edu; thaod@berkeley.edu;
mattthepb@usc.edu; szwang@ee.ucla.edu; BoscardinCK@medsch.ucsf.edu;
mheritag@ucla.edu; ppearson@berkeley.edu; shri@sipi.us-c.edu;
alwan@ee.ucla.edu
RI Narayanan, Shrikanth/D-5676-2012
CR Alwan A., 2007, P IEEE INT WORKSH MU, P26
August D., 2006, DEV LITERACY 2 LANGU
BARKER TA, 1995, J EDUC COMPUT RES, V13, P89
BARRON RW, 1986, COGNITION, V24, P93, DOI 10.1016/0010-0277(86)90006-5
BLACK M, 2008, P INTERSPEECH ICSLP, P2783
BUNTSCHUH B, 1998, P ICSLP, P2863
Cassell J., 2001, PERSONAL TECHNOLOGIE, V5, P203
COHEN PR, 1998, P INT C SPOK LANG PR, V2, P249
Coulston R., 2002, P 7 INT C SPOK LANG, V4, P2689
Dalby J., 1999, CALICO Journal, V16
DARVES C, 2002, P 7 INT C SPOK LANG, P16
DIFABRIZZIO G, 1999, P INT DIAL MULT SYST, P9
Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1
Eskenazi M., 1999, CALICO Journal, V16
FARMER ME, 1992, REM SPEC EDUC, V13, P50
Gerosa M, 2007, SPEECH COMMUN, V49, P847, DOI 10.1016/j.specom.2007.01.002
Goldstein UG., 1980, THESIS MIT CAMBRIDGE
Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X
Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004
Harris A., 1982, BASIC READING VOCABU
JONES B, 2007, AM ED RES ASS AERA C
Kazemzadeh A., 2005, P EUR, P1581
KENT RD, 1976, J SPEECH HEAR RES, V19, P421
Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X
Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686
LOVETT MW, 1994, BRAIN LANG, V47, P117, DOI 10.1006/brln.1994.1045
MA J, 2002, INT C SPOK LANG PROC, V1, P197
MCCULLOUGH CS, 1995, SCHOOL PSYCHOL REV, V24, P426
Mostow J., 1995, P ACM S US INT SOFTW, P77, DOI 10.1145/215585.215665
Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544
National Reading Panel, 2000, NIH PUBL, V00-4769
Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026
PRICE P, 2007, WORKSH MULT SIGN PRO
Russell M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607069
Sharma R, 1998, P IEEE, V86, P853, DOI 10.1109/5.664275
Khalili A., 1994, Journal of Research on Computing in Education, V27
Shefelbine J., 1996, BPST BEGINNING PHONI
SMITH BL, 1992, J ACOUST SOC AM, V91, P2165, DOI 10.1121/1.403675
Oviatt S, 2000, EMBODIED CONVERSATIONAL AGENTS, P319
TAKEZAWA T, 1998, P ICSLP SYDN AUSTR
TEPPERMAN J, 2007, P INTERSPEECH ANTW B, P2185
VANDUSEN L, 1993, COMPUTER BASED INTEG, P35
WANG S, 2007, P SLATE FARM PENNS, P120
Whitehurst GJ, 1998, CHILD DEV, V69, P848, DOI 10.1111/j.1467-8624.1998.00848.x
WIBURG K, 1995, COMPUTING TEACHER, V22, P7
Williams SM, 2000, PROCEEDINGS OF ICLS 2000 INTERNATIONAL CONFERENCE OF THE LEARNING SCIENCES, P115
XIAO B, 2002, P 7 INT C SPOK LANG, V1, P629
You H., 2005, P EUR LISB PORT, P749
Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460
WATCH ME READ
PROJECT LISTEN
NR 51
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 968
EP 984
DI 10.1016/j.specom.2009.05.001
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500013
ER
PT J
AU Duchateau, J
Kong, YO
Cleuren, L
Latacz, L
Roelens, J
Samir, A
Demuynck, K
Ghesquiere, P
Verhelst, W
Van Hamme, H
AF Duchateau, Jacques
Kong, Yuk On
Cleuren, Leen
Latacz, Lukas
Roelens, Jan
Samir, Abdurrahman
Demuynck, Kris
Ghesquiere, Pol
Verhelst, Werner
Van Hamme, Hugo
TI Developing a reading tutor: Design and evaluation of dedicated speech
recognition and synthesis modules
SO SPEECH COMMUNICATION
LA English
DT Article
DE Reading tutor; Computer-assisted language learning (CALL); Speech
technology for education
ID LEARNING-DISABILITIES; CORRECTIVE FEEDBACK; CHILDREN; SYSTEM
AB When a child learns to read, the learning process can be enhanced by significant reading practice with individual support from a tutor. But in reality, the availability of teachers or clinicians is limited, so the additional use of a fully automated reading tutor would be beneficial for the child. This paper discusses our efforts to develop an automated reading tutor for Dutch. First, the dedicated speech recognition and synthesis modules in the reading tutor are described. Then, three diagnostic and remedial reading tutor tools are evaluated in practice and improved based on these evaluations: (1) automatic assessment of a child's reading level, (2) oral feedback to a child at the phoneme, syllable or word level, and (3) tracking where a child is reading, for automated screen advancement or for direct feedback to the child. In general, the presented tools work in a satisfactory way, including for children with known reading disabilities. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Duchateau, Jacques; Roelens, Jan; Samir, Abdurrahman; Demuynck, Kris; Van Hamme, Hugo] Katholieke Univ Leuven, ESAT Dept, B-3001 Louvain, Belgium.
[Kong, Yuk On; Latacz, Lukas; Verhelst, Werner] Vrije Univ Brussel, ETRO Dept, B-1050 Brussels, Belgium.
[Cleuren, Leen; Ghesquiere, Pol] Katholieke Univ Leuven, Ctr Parenting Child Welfare & Disabil, B-3000 Louvain, Belgium.
RP Duchateau, J (reprint author), Katholieke Univ Leuven, ESAT Dept, Kasteelpk Arenberg 10,POB 2441, B-3001 Louvain, Belgium.
EM Jacques.Duchateau@esat.kuleuven.be
RI Ghesquiere, Pol/B-9226-2009; Van hamme, Hugo/D-6581-2012
OI Ghesquiere, Pol/0000-0001-9056-7550;
CR ABDOU SM, 2006, P INT 2006 ICSLP 9 I, P849
Adams M. J., 2006, INT HDB LITERACY TEC, V2, P109
Banerjee S., 2003, P 8 EUR C SPEECH COM, P3165
BLACK AW, 2001, P 4 ISCA SPEECH S WO, P63
BLACK M, 2007, P INTERSPEECH ICSLP, P206
CHU M, 2003, P INT C AC SPEECH SI, V1, P264
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
Cleuren L., 2008, P 6 INT C LANG RES E
Coltheart M., 1978, STRATEGIES INFORMATI, P151
Daelemans Walter, 2005, MEMORY BASED LANGUAG
DEMUYNCK K, 2006, P INT 2006 ICSLP 9 I, P1622
DEMUYNCK K, 2004, P 4 INT C LANG RES E, V1, P61
DMELLO SK, 2007, P SLATE WORKSH SPEEC, P49
DUCHATEA J, 2007, EUR 10 EUR C SPEECH, P1210
DUCHATEAU J, 2002, P IEEE INT C AC SPEE, V1, P221
Duchateau J., 2006, P ITRW SPEECH REC IN, P59
ESKENAZI M, 2002, PMLA, P48
Hagen A, 2007, SPEECH COMMUN, V49, P861, DOI 10.1016/j.specom.2007.05.004
Heiner C., 2004, P INSTIL ICALL S NLP, P195
Hunt A. J., 1996, P ICASSP 96, P373
KERKHOFF J, 2002, P 13 M COMP LING NET
LATACZ L, 2007, P 6 ISCA WORKSH SPEE, P270
LATACZ L, 2006, P PRORISC IEEE BEN W
MacArthur CA, 2001, ELEM SCHOOL J, V101, P273, DOI 10.1086/499669
MCCOY KM, 1986, READ TEACH, V39, P548
NERI A, 2006, P ISCA INT PITTSB PA, P1982
PANY D, 1988, J LEARN DISABIL, V21, P546
PERKINS VL, 1988, J LEARN DISABIL, V21, P244
Probst K, 2002, SPEECH COMMUN, V37, P161, DOI 10.1016/S0167-6393(01)00009-7
Russell M, 2000, COMPUT SPEECH LANG, V14, P161, DOI 10.1006/csla.2000.0139
SPAAI GWG, 1991, J EDUC RES, V84, P204
Wise BW, 1998, READING AND SPELLING, P473
NR 32
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 985
EP 994
DI 10.1016/j.specom.2009.04.010
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500014
ER
PT J
AU Wang, HC
Waple, CJ
Kawahara, T
AF Wang, Hongcui
Waple, Christopher J.
Kawahara, Tatsuya
TI Computer Assisted Language Learning system based on dynamic question
generation and error prediction for automatic speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Computer Assisted Language Learning (CALL); Second language learning;
Automatic speech recognition; Error prediction
AB We have developed a new Computer Assisted Language Learning (CALL) system to aid students learning Japanese as a second language. The system offers students the chance to practice elementary Japanese by creating their own sentences based on visual prompts, before receiving feedback on their mistakes. It is designed to detect lexical and grammatical errors in the input sentence as well as pronunciation errors in the speech input. Questions are dynamically generated along with sentence patterns of the lesson point, to realize variety and flexibility of the lesson. Students can give their answers with either text input or speech input. To enhance speech recognition performance, a decision tree-based method is incorporated to predict possible errors made by non-native speakers for each generated sentence on the fly. Trials of the system were conducted by foreign university students, and positive feedback was reported. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Wang, Hongcui; Waple, Christopher J.; Kawahara, Tatsuya] Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan.
RP Wang, HC (reprint author), Kyoto Univ, Sch Informat, Sakyo Ku, Kyoto 6068501, Japan.
EM wang@ar.media.kyoto-u.ac.jp
CR ABDOU SM, 2006, P ICSLP
Bernstein J., 1999, CALICO Journal, V16
ESKENAZI M, 1998, STILL
FRANCO H, 2000, P INSTIL INT SPEECH
Kawai G, 2000, SPEECH COMMUN, V30, P131, DOI 10.1016/S0167-6393(99)00041-2
LEVIE WH, 1982, ECTJ-EDUC COMMUN TEC, V30, P195
NAGATA N, 2002, COMPUTER ASSISTED SY
NELSON DL, 1976, HUMAN LEARNING MEMOR, V2, P523
NEUMEYER L, 1998, STILL
SMITH MC, 1980, J EXP PSYCHOL GEN, V109, P373, DOI 10.1037/0096-3445.109.4.373
Tsubota Y., 2002, P ICSLP, P1205
Tsubota Y., 2004, P ICSLP, P1689
WANG H, 2008, P ICASSP
Witt S.M., 1999, THESIS
ZINOVJEVA N, 2005, SPEECH TECHNOLOGY
NR 15
TC 4
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 995
EP 1005
DI 10.1016/j.specom.2009.03.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500015
ER
PT J
AU McGraw, I
Yoshimoto, B
Seneff, S
AF McGraw, Ian
Yoshimoto, Brandon
Seneff, Stephanie
TI Speech-enabled card games for incidental vocabulary acquisition in a
foreign language
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Intelligent computer assisted language learning;
Computer aided vocabulary acquisition
ID WORD MEANINGS
AB In this paper, we present a novel application for speech technology to aid students with vocabulary acquisition in a foreign language through interactive card games. We describe a generic platform for card game development and then introduce a particular prototype card game called Word War, designed for learning Mandarin Chinese. We assess the feasibility of deploying Word War via the Internet by conducting our first user study remotely and evaluating the performance of the speech recognition component. It was found that the three central concepts in our system were recognized with ail error rate of 16.02%. We then turn to assessing the effects of the Word War game on vocabulary retention in a controlled environment. To this end, we performed a user study using two variants of the Word War game: a speaking mode, in which the user issues spoken commands to manipulate the game cards, and a listening mode, in which the computer gives spoken directions that the students must follow by manipulating the cards manually with the mouse. These two modes of learning were compared against a more traditional computer assisted vocabulary learning system: an oil-line flash cards program. To assess long-term learning gains as a function of time-on-task, we had the students interact with each system twice over a period of three weeks. We found that all three systems were competitive in terms of the vocabulary words learned as measured by pre-tests and post-tests, with less than a 5% difference among the systems' average overall learning gains. We also conducted surveys, which indicated that the students enjoyed the speaking mode of Word War more than the other two systems. (C) 2009 Elsevier B.V. All rights reserved.
C1 [McGraw, Ian; Yoshimoto, Brandon; Seneff, Stephanie] MIT Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA.
RP McGraw, I (reprint author), MIT Comp Sci & Artificial Intelligence Lab, 32 Vassar St, Cambridge, MA 02139 USA.
EM imcgraw@csail.mit.edu; yoshimoto@alum.mit.edu; seneff@csail.mit.edu
CR Anderson R. C., 1992, AM EDUC, V16, P44
ATWELL E, 1999, RECOGNITION LEARNER
BERNSTEIN J, 1999, COMPUTER ASSISTED LA, V16
BRINDLY G, 1988, STUDIES 2 LANGUAGE A, V10, P217
BROWN TS, 1991, TESOL QUART, V25, P655, DOI 10.2307/3587081
Carey Susan, 1978, LINGUISTIC THEORY PS, P264
Cooley R. E, 2001, AIED 2001 WORKSH PAP, P17
Dalby J., 1999, COMPUTER ASSISTED LA, V16
EHSANI F, 1998, LANGUAGE LEARNING TE
ELLIS R, 1994, LANG LEARN, V44, P449, DOI 10.1111/j.1467-1770.1994.tb01114.x
Ellis R, 1995, APPL LINGUIST, V16, P409, DOI 10.1093/applin/16.4.409
Ellis R., 1999, STUDIES 2 LANGUAGE A, V21, P285, DOI DOI 10.1017/S0272263199002077
Eskenazi M., 2007, SLATE WORKSH SPEECH, P124
Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62
GAMPER J, 2002, P 6 WORLD MULT SYST
GAMPER J, 2002, COMPUTER ASSISTED LA
GLASS J, 2003, COMPUTER SPEECH LANG
GRUENSTEIN A, 2008, IMCI 08 P 10 INT C M, P141
GRUNEBERG MM, 1991, LANGUAGE LEARNING J, V4, P60, DOI 10.1080/09571739185200511
Harless WG, 2003, IEEE COMPUT GRAPH, V23, P46, DOI 10.1109/MCG.2003.1231177
HERMAN PA, 1987, NATURE VOCABULARY AC, P19
Hincks R., 2003, ReCALL, V15, DOI 10.1017/S0958344003000211
HOLLAND MM, 1999, COMPUTER ASSISTED LA, V16
Johnson WL, 2004, LECT NOTES COMPUT SC, V3220, P336
KRASHEN S, 1982, INPUT HYPOTHESIS ISS
Krashen S., 1994, IMPLICIT EXPLICIT LE, P45
Krashen S., 2004, LANGUAGE TEACHER, V28/7, P3
Krashen S., 1982, PRINCIPLES PRACTICE
LONG MH, 1981, INPUT INTERACTION 2, P259
MCGRAW I, P AAAI
MCGRAW I, SLATE WORKSH SPEECH
Menzel W., 2001, ReCALL, V13
NATION ISP, 2001, LEARN VOC AN LANG
NERBONNE J, 1998, COMPUTER ASSISTED LA, P543
NERI A, 2006, INTERSPEECH
Oxford R., 1990, TESL CANADA J, V7, P9
Pavlik PI, 2005, COGNITIVE SCI, V29, P559, DOI 10.1207/s15516709cog0000_14
Peabody M, 2006, LECT NOTES COMPUT SC, V4274, P602
PIMSLEUR P, 1980, LEARN FOREIGN LANGUA
Swain M., 1985, INPUT 2 LANGUAGE ACQ, P235
TAO JH, 2008, P BLIZZ CHALL
*TAYL FRANC, CHIN ESS GRAMM ESS G
von Ahn L, 2006, COMPUTER, V39, P92, DOI 10.1109/MC.2006.196
YI J, 2000, ICSLP
TALK ME
NR 45
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 1006
EP 1023
DI 10.1016/j.specom.2009.04.011
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500016
ER
PT J
AU Wik, P
Hjalmarsson, A
AF Wik, Preben
Hjalmarsson, Anna
TI Embodied conversational agents in computer assisted language learning
SO SPEECH COMMUNICATION
LA English
DT Article
DE Second langue learning; Dialogue systems; Embodied conversational
agents; Pronunciation training; CALL; CAPT
AB This paper describes two systems using embodied conversational agents (ECAs) for language learning. The first system, called Ville, is a virtual language teacher for vocabulary and pronunciation training. The second system, a dialogue system called DEAL, is a role-playing game for practicing conversational skills. Whereas DEAL acts as a conversational partner with the objective of creating and keeping an interesting dialogue, Ville takes the role of a teacher who guides, encourages and gives feedback to the students. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Wik, Preben; Hjalmarsson, Anna] KTH, Ctr Speech Technol, SE-10044 Stockholm, Sweden.
RP Wik, P (reprint author), KTH, Ctr Speech Technol, Lindstedtsvagen 24, SE-10044 Stockholm, Sweden.
EM preben@speech.kth.se; annah@speech.kth.se
CR AIST G, 2006, P INT PITTSB PA US, P1922
Engwall Olov, 2007, Computer Assisted Language Learning, V20, DOI 10.1080/09588220701489507
BANNERT R, 2004, VAG MOT SVENSKT UTTA
Beskow J., 2003, THESIS KTH STOCKHOLM
Bosseler A, 2003, J AUTISM DEV DISORD, V33, P653, DOI 10.1023/B:JADD.0000006002.82367.4f
BRENNAN SE, 2000, P 38 ANN M ASS COMP
Brusk J., 2007, P ACM FUT PLAY TOR C, P137, DOI 10.1145/1328202.1328227
BURNHAM D, 1999, AVSP, P80
CARLSON R, 2002, P FON STOCKH SWED MA, P65
Ellis Rod, 1994, STUDY 2 LANGUAGE ACQ
Engwall O., 2004, P ICSLP 2004 JEJ ISL, P1693
ENGWALL O, 2008, P INT 2008 BRISB AUS, P2631
Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62
FLEGE JE, 1998, STILL SPEECH TECHNOL, P1
Gee JP, 2003, WHAT VIDEO GAMES HAVE TO TEACH US ABOUT LEARNING AND LITERACY, P1
Granstrom B., 1999, P INT C PHON SCI ICP, P655
GUSTAFSON J, 2004, P SIGDIAL
Hjalmarsson A., 2007, P SIGDIAL ANTW BELG, P132
Hjalmarsson A., 2008, P SIGDIAL 2008 COL O
Iuppa Nicholas, 2007, STORY SIMULATIONS SE
JOHNSON WL, 2004, P I ITSEC
KODA T, 1996, P HCI 96 LOND UK, P98
Lester J. C., 1997, Proceedings of the First International Conference on Autonomous Agents, DOI 10.1145/267658.269943
Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025)
Mateas M., 2003, GAM DEV C GAM DES TR
MCALLISTER R, 1997, 2 LANGUAGE SPEECH
Meng H., 2007, AUTOMATIC SPEECH REC, P437
NERI A, 2002, ICSLP, P1209
Neri A., 2002, Computer Assisted Language Learning, V15, DOI 10.1076/call.15.5.441.13473
Prensky M., 2002, HORIZON, V10, P5, DOI DOI 10.1108/10748120210431349
Prensky M., 2001, DIGITAL GAME BASED L
Sjolander K., 2003, P FON 2003 UM U DEP, V9, P93
SKANTZE G, 2005, P SIGDIAL LISB PORT, P178
SLUIJTER AMC, 1995, THESIS
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
van Mulken S, 1998, PEOPLE AND COMPUTER XIII, PROCEEDINGS, P53
von Ahn L, 2006, COMPUTER, V39, P92, DOI 10.1109/MC.2006.196
Walker J. H., 1994, P SIGCHI C HUM FACT, P85, DOI DOI 10.1145/191666.191708
WIK P, 2007, P SLATE 2007
WIK P, 2004, P 17 SWED PHON C FON, P136
NR 40
TC 18
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD OCT
PY 2009
VL 51
IS 10
SI SI
BP 1024
EP 1037
DI 10.1016/j.specom.2009.05.006
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 485AK
UT WOS:000269092500017
ER
PT J
AU Chetouani, M
Faundez-Zanuy, M
Hussain, A
Gas, B
Zarader, JL
Paliwal, K
AF Chetouani, Mohamed
Faundez-Zanuy, Marcos
Hussain, Amir
Gas, Bruno
Zarader, Jean-Luc
Paliwal, Kuldip
TI Special issue on non-linear and non-conventional speech processing
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
C1 [Chetouani, Mohamed; Gas, Bruno; Zarader, Jean-Luc] Univ Paris 06, CNRS, UMR 7222, ISIR, F-75252 Paris, France.
[Faundez-Zanuy, Marcos] Escola Univ Politecn Mataro, Dept Telecommun, Barcelona 08303, Spain.
[Hussain, Amir] Univ Stirling, Dept Math & Comp Sci, Stirling FK9 4LA, Scotland.
[Paliwal, Kuldip] Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia.
RP Chetouani, M (reprint author), Univ Paris 06, CNRS, UMR 7222, ISIR, 4 Pl Jussieu, F-75252 Paris, France.
EM mohamed.chetouani@upmc.fr; faundez@eupmt.es; ahu@cs.stir.ac.uk;
bruno.gas@upmc.fr; jean-luc.zarader@upmc.fr; k.paliwal@griffith.edu.au
RI CHETOUANI, Mohamed/F-5854-2010; Faundez-Zanuy, Marcos/F-6503-2012
OI Faundez-Zanuy, Marcos/0000-0003-0605-1282
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 713
EP 713
DI 10.1016/j.specom.2009.06.001
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700001
ER
PT J
AU Huang, HY
Lin, FH
AF Huang, Heyun
Lin, Fuhuei
TI A speech feature extraction method using complexity measure for voice
activity detection in WGN
SO SPEECH COMMUNICATION
LA English
DT Article
DE Kolmogorov complexity; Feature extraction; Speech model; Voice activity
detection; Complexity analysis
ID HIGHER-ORDER STATISTICS; MODEL
AB A novel speech extraction algorithm is proposed in this paper for Voice Activity Detection (VAD). Signal complexity analysis with definition of Kolmogorov complexity is adopted, which explores model characteristics of speech production to differentiate speech and white Gaussian noise (WGN). In the view of speech signal processing, properties of speech's source and vocal tract are explored by complexity analysis. Also, some interesting properties of signal complexity are presented with experimental study, including complexity analysis of general noise-corrupted signal. Moreover, some enhanced features with complexity and a feature incorporation method are presented. These features incorporate some unique characteristics of speech, like pitch information, vocal organ information, and so oil. With a large database of speech signals and synthetic/real Gaussian noise, distributions of novel features and receiver operating characteristics (ROC) curves are shown, which are proved as potential features for voice activity detection. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Huang, Heyun; Lin, Fuhuei] Spreadtrum Commun Inc, Shanghai, Peoples R China.
RP Huang, HY (reprint author), Spreadtrum Commun Inc, Zuchongzhi Rd 2288, Shanghai, Peoples R China.
EM heyun.huang@spreadtrum.com; fuhuei.lin@spreadtrum.com
CR CHI Z, 1998, P INT C SIGN PROC, P1185
Gazor S, 2003, IEEE T SPEECH AUDI P, V11, P498, DOI 10.1109/TSA.2003.815518
KASPAR F, 1987, PHYS REV A, V36, P842, DOI 10.1103/PhysRevA.36.842
LEMPEL A, 1976, IEEE T INFORM THEORY, V22, P75, DOI 10.1109/TIT.1976.1055501
Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955
Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996
Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551
Ramirez J, 2007, IEEE T AUDIO SPEECH, V15, P2177, DOI 10.1109/TASL.2007.903937
Shin JW, 2008, IEEE SIGNAL PROC LET, V15, P257, DOI 10.1109/LSP.2008.917027
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521
Tong QY, 1996, CHAOS SOLITON FRACT, V7, P371, DOI 10.1016/0960-0779(95)00070-4
TUCKER R, 1992, IEE PROC-I, V139, P377
NR 13
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 714
EP 723
DI 10.1016/j.specom.2009.02.004
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700002
ER
PT J
AU Charbuillet, C
Gas, B
Chetouani, M
Zarader, JL
AF Charbuillet, C.
Gas, B.
Chetouani, M.
Zarader, J. L.
TI Optimizing feature complementarity by evolution strategy: Application to
automatic speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Feature extraction; Evolution strategy; Speaker verification
ID FEATURE-EXTRACTION; RECOGNITION
AB Conventional automatic speaker verification systems are based on cepstral features like Mel-scale frequency cepstrum coefficient (MFCC), or linear predictive cepstrum coefficient (LPCC). Recent published works showed that the use of complementary features can significantly improve the system performances. In this paper, we propose to use an evolution strategy to optimize the complementarity of two filter bank based feature extractors. Experiments we made with a state of the art speaker verification system show that significant improvement can be obtained. Compared to the standard MFCC, an equal error rate (EER) improvement of 11.48% and 21.56% was obtained on the 2005 Nist SRE and Ntimit databases, respectively. Furthermore, the obtained filter banks picture out. the importance of some specific spectral information for automatic speaker verification. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Charbuillet, C.; Gas, B.; Chetouani, M.; Zarader, J. L.] Univ Paris 06, CNRS, UMR 7222, ISIR, F-94200 Ivry, France.
RP Charbuillet, C (reprint author), Univ Paris 06, CNRS, UMR 7222, ISIR, F-94200 Ivry, France.
EM christophe.charbuillet@lis.jussieu.fr; bruno.gas@upmc.fr;
mohamed.chetouani@upmc.fr; jean-luc.zarader@upmc.fr
RI CHETOUANI, Mohamed/F-5854-2010
CR BEYER HG, 2002, NAT COMPUT, V1, P2
CAMPBELL WM, 2004, SPEAK LANG REC WORKS, P41
CAMPBELL WM, 2007, IEEE INT C AC SPEECH, V4, P217
Chetouani M, 2005, LECT NOTES ARTIF INT, V3445, P344
Chin-Teng Lin, 2000, IEEE Transactions on Speech and Audio Processing, V8, DOI 10.1109/89.876300
CIERI C, 2006, LREC 2006
FARRELL K, 1998, P 1998 IEEE INT C AC, V2, P1129, DOI 10.1109/ICASSP.1998.675468
Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793
Mitchell T.M., 1997, MACHINE LEARNING
Miyajima C, 2001, SPEECH COMMUN, V35, P203, DOI 10.1016/S0167-6393(00)00079-0
PARIS G, 2004, LECT NOTES COMPUTER, P267
PELECANOS J, 2001, SPEAK LANG REC WORKH
PRZYBOCKI M, 2006, SPEAK LANG REC WORKS, P1
Reynolds D., 2003, P ICASSP 03, VIV, P784
Reynolds D.A., 2002, P IEEE INT C AC SPEE, V4, P4072
ROSS B, 2000, GECCO, P443
Thian NPH, 2004, LECT NOTES COMPUT SC, V3072, P631
Torkkola K., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753742
Vair C., 2006, SPEAK LANG REC WORKS
YI L, 2004, P 8 IEEE INT S HIGH
YIN SC, 2006, SPEAK LANG REC WORKS
ZAMALLOA M, 2006, SPEAK LANG REC WORKS, V1, P1
ZHIYOU M, 2003, IEEE INT C SYST MAN, V5, P4153
NR 23
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 724
EP 731
DI 10.1016/j.specom.2009.01.005
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700003
ER
PT J
AU Zouari, L
Chollet, G
AF Zouari, Leila
Chollet, Gerard
TI Efficient codebooks for fast and accurate low resource ASR systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Gaussian selection; Codebook
ID SPEECH RECOGNITION; HMMS
AB Today, speech interfaces have become widely employed in mobile devices, thus recognition speed and resource consumption are becoming new metrics of Automatic Speech Recognition (ASR) performance.
For ASR systems using continuous Hidden Markov Models (HMMs), the computation of the state likelihood is one of the most time consuming parts. In this paper, we propose novel multi-level Gaussian selection techniques to reduce the cost of state likelihood computation. These methods are based on original and efficient codebooks. The proposed algorithms are evaluated within the framework of a large vocabulary continuous speech recognition task. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Zouari, Leila; Chollet, Gerard] GET ENST CNRS LTCI, Dept Traitement Signal & Images, F-75634 Paris, France.
RP Zouari, L (reprint author), GET ENST CNRS LTCI, Dept Traitement Signal & Images, 46 Rue Barrault, F-75634 Paris, France.
EM zouari@enst.fr
CR Aiyer A, 2000, INT CONF ACOUST SPEE, P1519, DOI 10.1109/ICASSP.2000.861939
Bocchieri E., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319405
CHAN A, 2004, INT C SPOK LANG PROC
CHAN A, 2005, EUR C SPEECH COMM TE, P565
Digalakis V, 2000, COMPUT SPEECH LANG, V14, P33, DOI 10.1006/csla.1999.0134
Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931
FILALI K, 2002, INT C SPOK LANG PROC
GALES MJF, 1999, IEEE T SPEECH AUDIO, P470
GALES MJF, 1996, INT C SPOK LANG PROC, P470
Galliano S, 2005, EUR C SPEECH COMM TE
Herman SM, 1998, INT CONF ACOUST SPEE, P485, DOI 10.1109/ICASSP.1998.674473
JURGEN F, 1996, INT C AC SPEECH SIGN, P837
JURGEN F, 1996, EUR C SPEECH COMM TE, P1091
LEE A, 2001, INT C AC SPEECH SIGN, P1269
LEE A, 1997, IEEE INT C AC SPEECH
LEPPANEN J, 2006, INT C AC SPEECH SIGN
LI X, 2006, IEEE T AUD SPEECH LA
MAK B, 2001, IEEE T SPEECH AUDIO, P264
Mokbel C, 2001, IEEE T SPEECH AUDI P, V9, P342, DOI 10.1109/89.917680
MOSUR RM, 1997, EUR C SPEECH COMM TE
OLSEN J, 2000, IEEE NORD PROC S
ORTMANNS S, 1997, EUR C SPEECH COMM TE, P139
PADMANABLAN M, 1997, IEEE WORKSH AUT SPEE, P325
PADMANABLAN M, 1999, IEEE T SPEECH AUDIO, P282
Pellom BL, 2001, IEEE SIGNAL PROC LET, V8, P221, DOI 10.1109/97.935736
SAGAYAMA S, 1995, SPEECH SIGNAL PROCES, V2, P213
SANKAR A, 1999, EUR C SPEECH COMM TE
Sankar A, 2002, SPEECH COMMUN, V37, P133, DOI 10.1016/S0167-6393(01)00063-2
SUONTAUSTS J, 1999, WORKSH AUT SPEECH RE
TAKAHASHI S, 1995, INT C AC SPEECH SIGN, V1, P520
Tsakalidis S, 1999, INT CONF ACOUST SPEE, P569
WOSZCZYNA M, 1998, THESIS KARLSRUHE U
NR 32
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 732
EP 743
DI 10.1016/j.specom.2009.01.010
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700004
ER
PT J
AU Iriondo, I
Planet, S
Socoro, JC
Martinez, E
Alias, F
Monzo, C
AF Iriondo, Ignasi
Planet, Santiago
Socoro, Joan-Claudi
Martinez, Elisa
Alias, Francesc
Monzo, Carlos
TI Automatic refinement of an expressive speech corpus assembling
subjective perception and automatic classification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Expressive speech databases; Expression of emotion; Speech technology;
Expressive speech synthesis
ID EMOTIONAL SPEECH; RECOGNITION; DATABASES; FEATURES
AB This paper presents an automatic system able to enhance expressiveness in speech corpora recorded from acted or stimulated speech. The system is trained with the results of a subjective evaluation carried out on a reduced set of the original corpus. Once the system has been trained, it is able to check the complete corpus and perform an automatic pruning of the unclear utterances, i.e. with expressive styles which are different from the intended corpus. The content which most closely matches the subjective classification remains in the resulting corpus. An expressive speech corpus in Spanish, designed and recorded for speech synthesis purposes, has been used to test the presented proposal. The automatic refinement has been applied to the whole corpus and the result has been validated with a second subjective test. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Iriondo, Ignasi; Planet, Santiago; Socoro, Joan-Claudi; Martinez, Elisa; Alias, Francesc; Monzo, Carlos] Univ Ramon Llull, GPMM Grp Recerca Processament Multimodal Enginyer, Barcelona 08022, Spain.
RP Iriondo, I (reprint author), Univ Ramon Llull, GPMM Grp Recerca Processament Multimodal Enginyer, C Quatre Camins 2, Barcelona 08022, Spain.
EM iriondo@salle.url.edu
RI Planet, Santiago/N-8400-2013; Alias, Francesc/L-1088-2014; Iriondo,
Ignasi/L-1664-2014
OI Planet, Santiago/0000-0003-4573-3462; Alias,
Francesc/0000-0002-1921-2375; Iriondo, Ignasi/0000-0003-2467-4192
FU European Commission [FP6 IST-4-027122-1P]; Spanish Government
[TEC2006-08043/TCM]
FX This work has been partially Supported by the European Commission,
Project SALERO (FP6 IST-4-027122-1P) and the Spanish Government, Project
SAVE (TEC2006-08043/TCM).
CR Alias F, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1698
Pitrelli JF, 2006, IEEE T AUDIO SPEECH, V14, P1099, DOI 10.1109/TASL.2006.876123
BOZKURT B, 2003, 8 EUR C SPEECH COMM, P277
Campbell N, 2005, IEICE T INF SYST, VE88D, P376, DOI 10.1093/ietisy/e88-d.3.376
Campbell N., 2000, P ISCA WORKSH SPEECH, P34
CAMPBELL NW, 2002, P 3 INT C LANG RES E
CAMPBELL WN, 1991, J PHONETICS, V19, P37
Cowie R., 2001, IEEE SIGNAL PROCESSI, V18, P33
Cowie R, 2005, NEURAL NETWORKS, V18, P371, DOI 10.1016/j.neunet.2005.03.002
Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007
Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5
DRIOLI C, 2003, VOQUAL 03, P127
Duda R. O., 2001, PATTERN CLASSIFICATI
FRANCOIS H, 2002, P 3 INT C LANG RES E
Goldberg D. E, 1989, GENETIC ALGORITHMS S
HOZJAN V, 2002, P 3 INT C LANG RES E
Iriondo I, 2007, LECT NOTES ARTIF INT, V4885, P86
IRIONDO I, 2007, 32 IEEE I C AC SPEEC, V4, P821
Iriondo T, 2007, LECT NOTES COMPUT SC, V4507, P646
Krstulovic S., 2007, P INT 2007 ANTW BELG, P1897
Michaelis D, 1997, ACUSTICA, V83, P700
MONTERO JM, 1998, 5 INT C SPOK LANG PR, P923
Montoya N., 1998, ZER REV ESTUDIOS SPA, P161
Monzo C., 2007, P 16 INT C PHON SCI, P2081
Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004
MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558
Navas E, 2006, IEEE T AUDIO SPEECH, V14, P1117, DOI 10.1109/TASL.2006.876121
Nogueiras A., 2001, P EUROSPEECH, P2679
Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6
PEREZ EH, 2003, FRECUENCIA FONEMAS
Planet S., 2008, P 2 INT WORKSH EMOTI
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Schroder M, 2004, THESIS SAARLAND U
SCHWEITZER A, 2003, 15 INT C PHON SCI BA, P1301
Shami M, 2007, SPEECH COMMUN, V49, P201, DOI 10.1016/j.specom.2007.01.006
Theune M, 2006, IEEE T AUDIO SPEECH, V14, P1137, DOI 10.1109/TASL.2006.876129
Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003
WELLS J, 1993, SAMPA COMPUTER READA
Witten I.H., 2005, DATA MINING PRACTICA
NR 39
TC 12
Z9 12
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 744
EP 758
DI 10.1016/j.specom.2008.12.001
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700005
ER
PT J
AU Gomez-Vilda, P
Fernandez-Baillo, R
Rodellar-Biarge, V
Lluis, VN
Alvarez-Marquina, A
Mazaira-Fernandez, LM
Martinez-Olalla, R
Godino-Llorente, JI
AF Gomez-Vilda, Pedro
Fernandez-Baillo, Roberto
Rodellar-Biarge, Victoria
Nieto Lluis, Victor
Alvarez-Marquina, Agustin
Miguel Mazaira-Fernandez, Luis
Martinez-Olalla, Rafael
Ignacio Godino-Llorente, Juan
TI Glottal Source biometrical signature for voice pathology detection
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice biometry; Speaker's identification; Speaker biometrical
characterization; Voice pathology detection; Glottal Source
ID ACOUSTIC ANALYSIS; VOCAL FOLDS; PARAMETERS
AB The Glottal Source is ill important component of voice as it call be considered as the excitation signal to the voice apparatus. The use of the Glottal Source for pathology detection or the biometric characterization of the speaker are important objectives in the acoustic study of the voice nowadays. Through the present work a biometric signature based oil the speaker's power spectral density of the Glottal Source is presented. It may be shown that this spectral density is related to the vocal fold cover biomechanics, and from literature it is well-known that certain speaker's features its gender, age or pathologic condition leave changes in it. The paper describes the methodology to estimate the biometric signature from the power spectral density of the mucosal wave correlate, which after normalization can be used in pathology detection experiments. Linear Discriminant Analysis is used to confront the detection capability of the parameters defined oil this glottal signature among themselves and compared to classical perturbation parameters. A database of 100 normal and 100 pathologic subjects equally balanced in gender and age is used to derive the best parameter cocktails for pathology detection and quantification purposes to validate this methodology in voice evaluation tests. In a study case presented to illustrate the detection capability of the methodology exposed a control Subset of 24 + 24 subjects is used to determine a subject's voice condition in a pre- and post-surgical evaluation. Possible applications of the study can be found in pathology detection and grading and in rehabilitation assessment after treatment. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Gomez-Vilda, Pedro; Fernandez-Baillo, Roberto; Rodellar-Biarge, Victoria; Nieto Lluis, Victor; Alvarez-Marquina, Agustin; Miguel Mazaira-Fernandez, Luis; Martinez-Olalla, Rafael] Univ Politecn Madrid, Fac Informat, E-28660 Madrid, Spain.
[Ignacio Godino-Llorente, Juan] Univ Politecn Madrid, Escuela Univ Ingn Tecn Telecomunicac, Madrid 28031, Spain.
RP Gomez-Vilda, P (reprint author), Univ Politecn Madrid, Fac Informat, Campus Montegancedo S-N, E-28660 Madrid, Spain.
EM pedro@pino.datsi.fi.upm.es
FU Plan Nacional de I + D + i [TIC2003-08756, TEC2006-12887-CO2-01/02];
Ministry of Education and Science; CAM/UPM [CCG06-UPM/TIC-0028]; Project
HESPERIA; Programme CENIT; Centro para el Desarrollo Tecnologico
Industrial; Ministry of Industry, Spain
FX This work is being funded by Grants TIC2003-08756 and
TEC2006-12887-CO2-01/02 from Plan Nacional de I + D + i, Ministry of
Education and Science, by Grant CCG06-UPM/TIC-0028 from CAM/UPM, and by
Project HESPERIA (http://www.proyecto-hesperia.org) from the Programme
CENIT, Centro para el Desarrollo Tecnologico Industrial, Ministry of
Industry, Spain. The authors want to express their most thanks to the
anonymous reviewers helping to produce a better conceptualized and
understandable manuscript.
CR AKANDE OO, 2005, SPEECH COMMUN, V46, P1
Alku P., 1992, P IEEE INT C AC SPEE, V2, P29
Alku P., 2003, P VOQUAL 03, P81
ARROABARREN I, 2003, P ISCA TUTORIAL RES, P29
Berry DA, 2001, J PHONETICS, V29, P431, DOI 10.1006/jpho.2001.0148
Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024]
Boyanov B, 1997, IEEE ENG MED BIOL, V16, P74, DOI 10.1109/51.603651
Deller J. R., 1993, DISCRETE TIME PROCES
Doval B, 2003, P VOQUAL 03, P16
Fant G, 1960, THEORY SPEECH PRODUC
FANT G, 2004, STLQSPR, V4, P1
FERNANDEZBAILLO R, 2007, 7 PAN EUR VOIC C GRO, P94
FERNANDEZBAILLO R, 2007, MAVEBA 07, P65
GODINO JI, 2001, P IEEE ENG MED BIOL, P4253
GODINO JI, 2004, IEEE T BIOMED ENG, V5, P1380
Godino-Llorente JI, 2006, IEEE T BIO-MED ENG, V53, P1943, DOI 10.1109/TBME.2006.871883
GOMEZ P, 2006, LECT NOTES COMPUTER, V3817, P242
GOMEZ P, 2005, P EUROSPEECH 05, P645
GOMEZ P, 2007, P MABEVA 07, P183
GOMEZ P, 2004, P ICSLP 04, P842
Gomez-Vilda P, 2007, J VOICE, V21, P450, DOI 10.1016/j.jvoice.2006.01.008
Hadjitodorov S, 2000, IEEE T INF TECHNOL B, V4, P68, DOI 10.1109/4233.826861
HIRANO M, 1988, ACTA OTO-LARYNGOL, V105, P432, DOI 10.3109/00016488809119497
HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829
JACKSON LB, 1989, IEEE T ACOUST SPEECH, V10, P1606
Johnson RA, 2002, APPL MULTIVARIATE ST, V5th
KUO J, 1999, P ICASSP 99 15 19 MA, V1, P77
Nickel R. M., 2006, IEEE Circuits and Systems Magazine, V6
ORR R, 2003, P ISCA WORKSH VOIC Q, P35
Parsa V, 2000, J SPEECH LANG HEAR R, V43, P469
PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8
Ritchings RT, 2002, MED ENG PHYS, V24, P561, DOI 10.1016/S1350-4533(02)00064-4
Rodellar V., 1993, Simulation Practice and Theory, V1, DOI 10.1016/0928-4869(93)90008-E
Rosa MD, 2000, IEEE T BIO-MED ENG, V47, P96
Ruiz MT, 1997, J EPIDEMIOL COMMUN H, V51, P106, DOI 10.1136/jech.51.2.106
SHALVI O, 1990, IEEE T INFORM THEORY, V36, P312, DOI 10.1109/18.52478
STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234
Svec JG, 2000, J ACOUST SOC AM, V108, P1397, DOI 10.1121/1.1289205
TITZE IR, 1994, WORKSH AC VOIC AN NA
Whiteside SP, 2001, J ACOUST SOC AM, V110, P464, DOI 10.1121/1.1379087
NR 40
TC 14
Z9 14
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 759
EP 781
DI 10.1016/j.specom.2008.09.005
PG 23
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700006
ER
PT J
AU Bouzid, A
Ellouze, N
AF Bouzid, A.
Ellouze, N.
TI Voice source parameter measurement based on multi-scale analysis of
electroglottographic signal
SO SPEECH COMMUNICATION
LA English
DT Article
DE Electroglottographic signal; Multi-scale product; Voicing decision;
Voice source parameters; Fundamental frequency; Open quotient
ID EDGE-DETECTION; SCALE MULTIPLICATION; SPEECH; CONTACT; DOMAIN; MODEL;
FLOW
AB This paper deals with glottal parameter measurement from electroglottographic signal (EGG). The proposed approach is based on GCI and GOI determined by the multi-scale analysis of the EGG signal. Wavelet transform of EGG signal is done with a quadratic spline function. Wavelet coefficients calculated on different dyadic scales, show modulus maxima at localized discontinuities of the EGG signal. The detected maxima and minima correspond to the so-called GOIs and GCIs. To improve the GCI and GOI localization precision, the product of wavelet transform coefficients of three successive dyadic scales, called multi-scale product (MP), is operated. This process enhances edges and reduces noise and spurious peaks. Applying the cubic root amplitude on the multi-scale product improves the detection of weak GOI maximum and avoids the GCI misses. Applied on the Keele University database, the method brings about a good detection of GCI and GOI. Based on the GCI and GOI, voicing classification, pitch frequency and open quotient measurements are processed. The proposed voicing classification approach is evaluated with additive noise. For clean signal the performance is of 96.4%, and at SNR level of 5 dB, the performance is of 93%. For the fundamental frequency and the open quotient measurement, the comparison of the MP with the DEGG, Howard (3/7), the threshold (35% and 50%), and the DECOM methods show that this new proposed approach is similar to the major methods with an improvement displayed by its lowest deviation. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Bouzid, A.] ISECS, Sfax 3018, Tunisia.
[Ellouze, N.] ENIT, Tunis 1002, Tunisia.
RP Bouzid, A (reprint author), ISECS, PB 868, Sfax 3018, Tunisia.
EM bouzidacha@yahoo.fr
CR ANASTAPLO S, 1988, J ACOUST SOC AM, V83, P1883, DOI 10.1121/1.396472
Bao P, 2005, IEEE T PATTERN ANAL, V27, P1485, DOI 10.1109/TPAMI.2005.173
BOUZID A, 2003, P EUR 2003 GEN, P2837
Bouzid A., 2004, P EUR SIGN PROC C EU, P729
BROOKES M, 2008, SPEECH PROCESSING TO
CHILDERS DG, 1985, CRIT REV BIOMED ENG, V12, P131
CHILDERS DG, 1984, FOLIA PHONIATR, V36, P105
CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044
Fisher E, 2006, IEEE T AUDIO SPEECH, V14, P502, DOI 10.1109/TSA.2005.857806
Henrich N, 2004, J ACOUST SOC AM, V115, P1321, DOI 10.1121/1.1646401
HERBST C, 2004, THESIS DEP SPEECH MU
HOWARD DM, 1995, J VOICE, V9, P1212
HOWARD DM, 1990, J VOICE, V4, P205, DOI 10.1016/S0892-1997(05)80015-3
Kadambe S., 1991, P ICASSP, P449, DOI 10.1109/ICASSP.1991.150373
KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909
Mallat S., 1999, WAVELET TOUR SIGNAL
MALLAT S, 1992, IEEE T INFORM THEORY, V38, P617, DOI 10.1109/18.119727
MCKENNA J, 1999, P EUR 99 BUD, P2793
Naylor PA, 2007, IEEE T AUDIO SPEECH, V15, P34, DOI 10.1109/TASL.2006.876878
NOGOC TV, 1999, P EUR 1999, P2805
PEREZ J, 2005, P ICSLP 2005, P1065
Plante F., 1995, P EUR, P837
Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109
ROSENFEL.A, 1970, PR INST ELECTR ELECT, V58, P814, DOI 10.1109/PROC.1970.7756
ROTHENBERG M, 1988, J SPEECH HEAR RES, V31, P338
Sadler BM, 1999, IEEE T INFORM THEORY, V45, P1043, DOI 10.1109/18.761341
Sadler BM, 1998, J ACOUST SOC AM, V104, P955, DOI 10.1121/1.423312
Sapienza CM, 1998, J VOICE, V12, P31, DOI 10.1016/S0892-1997(98)80073-8
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
VEENEMAN DE, 1985, IEEE T ACOUST SPEECH, V33, P369, DOI 10.1109/TASSP.1985.1164544
XU YS, 1994, IEEE T IMAGE PROCESS, V3, P747
Zhang L, 2002, PATTERN RECOGN LETT, V23, P1771, DOI 10.1016/S0167-8655(02)00151-4
NR 32
TC 5
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 782
EP 792
DI 10.1016/j.specom.2008.08.004
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700007
ER
PT J
AU Kroger, BJ
Kannampuzha, J
Neuschaefer-Rube, C
AF Kroeger, Bernd J.
Kannampuzha, Jim
Neuschaefer-Rube, Christiane
TI Towards a neurocomputational model of speech production and perception
SO SPEECH COMMUNICATION
LA English
DT Review
DE Speech; Speech production; Speech perception; Neurocomputational model;
Artificial neural networks; Self-organizing networks
ID NEURAL-NETWORK MODEL; TEMPORAL-LOBE; CATEGORICAL PERCEPTION; ACTION
REPRESENTATION; SENSORIMOTOR CONTROL; LANGUAGE PRODUCTION; SPANISH
VOWELS; LEXICAL ACCESS; BRAIN; FMRI
AB The limitation in performance of current speech synthesis and speech recognition systems may result from the fact that these systems are not designed with respect to the human neural processes of speech production and perception. A neurocomputational model of speech production and perception is introduced which is organized with respect to human neural processes of speech production and perception. The production-perception model comprises all artificial computer-implemented vocal tract as a front-end module, which is capable of generating articulatory speech movements and acoustic speech signals. The structure of the production-perception model comprises motor and sensory processing pathways. Speech knowledge is collected during training stages which imitate early stages of speech acquisition. This knowledge is stored in artificial self-organizing maps. The current neurocomputational model is capable of producing and perceiving vowels, VC-, and CV-syllables (V = vowels and C = voiced plosives). Basic features of natural speech production and perception are predicted from this model in a straight forward way: Production of speech items is feedforward and feedback controlled and phoneme realizations vary within perceptually defined regions. Perception is less categorical in the case of vowels in comparison to consonants. Due to its human-like production-perception processing the model should be discussed as a basic module for more technical relevant approaches for high-quality speech synthesis and for high performance speech recognition. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Kroeger, Bernd J.] Univ Hosp Aachen, Dept Phoniatr Pedaudiol & Commun Disorders, Aachen, Germany.
Univ Aachen, D-5100 Aachen, Germany.
RP Kroger, BJ (reprint author), Univ Hosp Aachen, Dept Phoniatr Pedaudiol & Commun Disorders, Aachen, Germany.
EM bkroeger@ukaachen.de; jkannampuzha@ukaachen.de; cneuschaefer@ukaachen.de
RI Kroger, Bernd/A-4435-2009
OI Kroger, Bernd/0000-0002-4727-2957
FU German Research Council [KR 1439/13-1]
FX This work was supported in part by the German Research Council Grant No.
KR 1439/13-1.
CR Ackermann H, 2004, BRAIN LANG, V89, P320, DOI 10.1016/S0093-934X(03)00347-X
Arbib MA, 2005, BEHAV BRAIN SCI, V28, P105, DOI 10.1017/S0140525X05000038
Bailly G, 1997, SPEECH COMMUN, V22, P251, DOI 10.1016/S0167-6393(97)00025-3
Batchelder EO, 2002, COGNITION, V83, P167, DOI 10.1016/S0010-0277(02)00002-1
Benson RR, 2001, BRAIN LANG, V78, P364, DOI 10.1006/brln.2001.2484
Benzeghiba M, 2007, SPEECH COMMUN, V49, P763, DOI 10.1016/j.specom.2007.02.006
Binder JR, 2000, CEREB CORTEX, V10, P512, DOI 10.1093/cercor/10.5.512
Birkholz P., 2006, P 7 INT SEM SPEECH P, P493
BIRKHOLZ P, 2007, P 16 INT C PHON SCI, P377
Birkholz P., 2004, P INT 2004 ICSLP JEJ, P1125
BIRKHOLZ P, 2006, P INT C AC SPEECH SI, P873
Birkhoz P, 2007, IEEE T AUDIO SPEECH, V15, P1218, DOI 10.1109/TASL.2006.889731
Blank SC, 2002, BRAIN, V125, P1829, DOI 10.1093/brain/awf191
Boatman D, 2004, COGNITION, V92, P47, DOI 10.1016/j.cognition.2003.09.010
Bookheimer SY, 2000, NEUROLOGY, V55, P1151
BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064
Brent MR, 1999, TRENDS COGN SCI, V3, P294, DOI 10.1016/S1364-6613(99)01350-9
BROWMAN CP, 1992, PHONETICA, V49, P155
Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019
BULLOCK D, 1993, J COGNITIVE NEUROSCI, V5, P408, DOI 10.1162/jocn.1993.5.4.408
Callan DE, 2006, NEUROIMAGE, V31, P1327, DOI 10.1016/j.neuroimage.2006.01.036
Cervera T, 2001, J SPEECH LANG HEAR R, V44, P988, DOI 10.1044/1092-4388(2001/077)
Clark RAJ, 2007, SPEECH COMMUN, V49, P317, DOI 10.1016/j.specom.2007.01.014
Damper RI, 2000, PERCEPT PSYCHOPHYS, V62, P843, DOI 10.3758/BF03206927
Dell GS, 1999, COGNITIVE SCI, V23, P517
EIMAS PD, 1963, LANG SPEECH, V6, P206
Fadiga L, 2004, J CLIN NEUROPHYSIOL, V21, P157, DOI 10.1097/00004691-200405000-00004
Fadiga L, 2002, EUR J NEUROSCI, V15, P399, DOI 10.1046/j.0953-816x.2001.01874.x
FOWLER CA, 1986, J PHONETICS, V14, P3
FRY DB, 1962, LANG SPEECH, V5, P171
Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613
Goldstein L, 2006, ACTION TO LANGUAGE VIA THE MIRROR NEURON SYSTEM, P215, DOI 10.1017/CBO9780511541599.008
Goldstein L, 2007, COGNITION, V103, P386, DOI 10.1016/j.cognition.2006.05.010
Grossberg S, 2003, J PHONETICS, V31, P423, DOI 10.1016/S0095-4470(03)00051-2
GUENTHER FH, 1994, BIOL CYBERN, V72, P43, DOI 10.1007/BF00206237
Guenther FH, 2006, J COMMUN DISORD, V39, P350, DOI 10.1016/j.jcomdis.2006.06.013
Guenther FH, 2006, BRAIN LANG, V96, P280, DOI 10.1016/j.bandl.2005.06.001
GUENTHER FH, 1995, PSYCHOL REV, V102, P594
Hartsuiker RJ, 2001, COGNITIVE PSYCHOL, V42, P113, DOI 10.1006/cogp.2000.0744
Heim S, 2003, COGNITIVE BRAIN RES, V16, P285, DOI 10.1016/S0926-6410(02)00284-7
Hickok G, 2004, COGNITION, V92, P67, DOI 10.1016/j.cognition.2003.10.011
Hickok G, 2000, TRENDS COGN SCI, V4, P131, DOI 10.1016/S1364-6613(00)01463-7
Hillis AE, 2004, BRAIN, V127, P1479, DOI 10.1093/brain/awh172
Huang J, 2001, HUM BRAIN MAPP, V15, P39
Iacoboni M, 2005, CURR OPIN NEUROBIOL, V15, P632, DOI 10.1016/j.conb.2005.10.010
Indefrey P, 2004, COGNITION, V92, P101, DOI 10.1016/j.cognition.2002.06.001
Ito T, 2004, BIOL CYBERN, V91, P275, DOI 10.1007/s00422-004-0510-6
Jardri R, 2007, NEUROIMAGE, V35, P1645, DOI 10.1016/j.neuroimage.2007.02.002
Jusczyk PW, 1999, TRENDS COGN SCI, V3, P323, DOI 10.1016/S1364-6613(99)01363-7
Kandel E. R., 2000, PRINCIPLES NEURAL SC
Kemeny S, 2005, HUM BRAIN MAPP, V24, P173, DOI 10.1002/hbm.20078
Kohler E, 2002, SCIENCE, V297, P846, DOI 10.1126/science.1070311
Kohonen T., 2001, SELF ORG MAPS
Kroger BJ, 2007, LECT NOTES COMPUT SC, V4775, P174
KROGER BJ, 1993, PHONETICA, V50, P213
KROGER BJ, 2006, P 9 INT C SPOK LANG, P565
KROGER BJ, 2006, DAGA P ANN M GERM AC, P561
Kroger BJ, 2008, LECT NOTES ARTIF INT, V5042, P121
KROGER BJ, 2006, P 7 INT SEM SPEECH P, P67
Kuriki S, 1999, NEUROREPORT, V10, P765, DOI 10.1097/00001756-199903170-00019
Latorre J, 2006, SPEECH COMMUN, V48, P1227, DOI 10.1016/j.specom.2006.05.003
LEVELT WJM, 1994, COGNITION, V50, P239, DOI 10.1016/0010-0277(94)90030-2
Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1
Li P, 2004, NEURAL NETWORKS, V17, P1345, DOI 10.1016/j.neunet.2004.07.004
LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417
LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6
LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279
Liebenthal E, 2005, CEREB CORTEX, V15, P1621, DOI 10.1093/cercor/bhi040
Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113
Maass W, 1999, INFORM COMPUT, V153, P26, DOI 10.1006/inco.1999.2806
MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0
Murphy K, 1997, J APPL PHYSIOL, V83, P1438
Nasir SM, 2006, CURR BIOL, V16, P1918, DOI 10.1016/j.cub.2006.07.069
Norris D, 2006, COGNITIVE PSYCHOL, V53, P146, DOI 10.1016/j.cogpsych.2006.03.001
Obleser J, 2006, HUM BRAIN MAPP, V27, P562, DOI 10.1002/hbm.20201
Obleser J, 2007, J NEUROSCI, V27, P2283, DOI 10.1523/JNEUROSCI.4663-06.2007
Okada K, 2006, BRAIN LANG, V98, P112, DOI 10.1016/j.bandl.2006.04.006
Oller DK, 1999, J COMMUN DISORD, V32, P223, DOI 10.1016/S0021-9924(99)00013-1
PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814
Poeppel D, 2004, NEUROPSYCHOLOGIA, V42, P183, DOI 10.1016/j.neuropsychologia.2003.07.010
Postma A, 2000, COGNITION, V77, P97, DOI 10.1016/S0010-0277(00)00090-1
Raphael L.J., 2007, SPEECH SCI PRIMER PH
Riecker A, 2006, NEUROIMAGE, V29, P46, DOI 10.1016/j.neuroimage.2005.03.046
Rimol LM, 2005, NEUROIMAGE, V26, P1059, DOI 10.1016/j.neuroimage.2005.03.028
RITTER H, 1989, BIOL CYBERN, V61, P241, DOI 10.1007/BF00203171
Rizzolatti G, 2004, ANNU REV NEUROSCI, V27, P169, DOI 10.1146/annurev.neuro.27.070203.144230
Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0
Rosen HJ, 2000, BRAIN COGNITION, V42, P201, DOI 10.1006/brcg.1999.1100
Saltzman E, 2000, HUM MOVEMENT SCI, V19, P499, DOI 10.1016/S0167-9457(00)00030-0
SALTZMAN E, 1979, J MATH PSYCHOL, V20, P91, DOI 10.1016/0022-2496(79)90020-8
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
Sanguineti V, 1997, BIOL CYBERN, V77, P11, DOI 10.1007/s004220050362
Scharenborg O, 2007, SPEECH COMMUN, V49, P336, DOI 10.1016/j.specom.2007.01.009
Scott SK, 2000, BRAIN, V123, P2400, DOI 10.1093/brain/123.12.2400
SHADMEHR R, 1994, J NEUROSCI, V14, P3208
Shuster LI, 2005, BRAIN LANG, V93, P20, DOI 10.1016/j.bandl.2004.07.007
Sober SJ, 2003, J NEUROSCI, V23, P6982
Soros P, 2006, NEUROIMAGE, V32, P376, DOI 10.1016/j.neuroimage.2006.02.046
Studdert-Kennedy M, 2002, MIRROR NEURONS EVOLU, P207
Tani J, 2004, NEURAL NETWORKS, V17, P1273, DOI 10.1016/j.neunet.2004.05.007
Todorov E, 2004, NAT NEUROSCI, V7, P907, DOI 10.1038/nn1309
Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710
Ullman MT, 2001, NAT REV NEUROSCI, V2, P717, DOI 10.1038/35094573
Uppenkamp S, 2006, NEUROIMAGE, V31, P1284, DOI 10.1016/j.neuroimage.2006.01.004
Vanlancker-Sidtis D, 2003, BRAIN LANG, V85, P245, DOI 10.1016/S0093-934X(02)00596-5
Varley R, 2001, APHASIOLOGY, V15, P39
Werker JF, 2005, TRENDS COGN SCI, V9, P519, DOI 10.1016/j.tics.2005.09.003
Westermann G, 2004, BRAIN LANG, V89, P393, DOI 10.1016/S0093-934X(03)00345-6
Wilson M, 2005, PSYCHOL BULL, V131, P460, DOI 10.1037/0033-2909.131.3.460
Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263
Wise RJS, 1999, LANCET, V353, P1057, DOI 10.1016/S0140-6736(98)07491-1
Zekveld AA, 2006, NEUROIMAGE, V32, P1826, DOI 10.1016/j.neuroimage.2006.04.199
Zell A., 2003, SIMULATION NEURONALE
NR 113
TC 27
Z9 27
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 793
EP 809
DI 10.1016/j.specom.2008.08.002
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700008
ER
PT J
AU Hsieh, CH
Feng, TY
Huang, PC
AF Hsieh, Cheng-Hsiung
Feng, Ting-Yu
Huang, Po-Chin
TI Energy-based VAD with grey magnitude spectral subtraction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice activity detection; Grey system; Magnitude spectral subtraction;
G.729; GSM AMR
ID VOICE ACTIVITY DETECTION; ROBUST SPEECH RECOGNITION; MODEL; ENHANCEMENT;
NOISE
AB In this paper, we propose a novel voice activity detection (VAD) scheme for low SNR conditions with additive white noise. The proposed approach consists of two parts. First, a grey magnitude spectral subtraction (GMSS) is applied to remove additive noise from a given noisy speech. By this doing, an estimated clean speech is obtained. Second, the enhanced speech by the GMSS is segmented and put into an energy-based VAD to determine whether it is a speech or non-speech segment. The approach presented in this paper is called the GMSS/EVAD. Simulation results indicate that the proposed GMSS/EVAD outperforms VAD in G.729 and GSM AMR for the given low SNR examples. To investigate the performance of the GMSS/EVAD for real-life background noises, the babble and volvo noises in the NOISEX-92 database are under consideration. The simulation results for the given examples indicate that the GMSS/EVAD is able to handle appropriately for the cases of the babble noise with the SNR above 10 dB and the cases of the volvo noise with SNR 15 dB and up. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Hsieh, Cheng-Hsiung; Feng, Ting-Yu; Huang, Po-Chin] Chaoyang Univ Technol, Dept Comp Sci & Informat Engn, Wufong 413, Taiwan.
RP Hsieh, CH (reprint author), Chaoyang Univ Technol, Dept Comp Sci & Informat Engn, Wufong 413, Taiwan.
EM chhsieh@cyut.edu.tw
FU National Science Council of Republic of China [NSC 95-2221-E-324040, NSC
96-2221-E-324-044]
FX This work was supported by National Science Council of Republic of China
under Grants NSC 95-2221-E-324040 and NSC 96-2221-E-324-044.
CR Chang JH, 2001, IEICE T INF SYST, VE84D, P1231
Chang JH, 2006, IEEE T SIGNAL PROCES, V54, P1965, DOI 10.1109/TSP.2006.874403
Chang JH, 2003, ELECTRON LETT, V39, P632, DOI 10.1049/el:20030392
Childers D. G., 1999, SPEECH PROCESSING SY
Davis A, 2006, IEEE T AUDIO SPEECH, V14, P412, DOI 10.1109/TSA.2005.855842
Deng Julong, 1989, Journal of Grey Systems, V1
Deng JL, 1982, SYSTEMS CONTROL LETT, V5, P288
ESTEVEZ PA, 2005, ELECT LETT, V41
*ETSI, 1999, 301 708 ETSI EN
Goorriz JM, 2006, SPEECH COMMUN, V48, P1638, DOI 10.1016/j.specom.2006.07.006
Gorriz JM, 2006, J ACOUST SOC AM, V120, P470, DOI 10.1121/1.2208450
Hsieh CH, 2003, IEICE T INF SYST, VE86D, P522
ITU-T, 1996, G729 ITUT
Kim DK, 2007, IEEE SIGNAL PROC LET, V14, P891, DOI 10.1109/LSP.2007.900225
Kim HI, 2004, ELECTRON LETT, V40, P1454, DOI 10.1049/el:20046064
LONGBOTHAM HG, 1989, IEEE T ACOUST SPEECH, V37, P275, DOI 10.1109/29.21690
Ramirez J, 2004, IEEE SIGNAL PROC LET, V11, P266, DOI 10.1109/LSP.2003.821762
Ramirez J, 2005, IEEE T SPEECH AUDI P, V13, P1119, DOI 10.1109/TSA.2005.853212
Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002
Ramirez J, 2007, IEEE T AUDIO SPEECH, V15, P2177, DOI 10.1109/TASL.2007.903937
Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1
Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521
NR 22
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 810
EP 819
DI 10.1016/j.specom.2008.08.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700009
ER
PT J
AU Monte-Moreno, E
Chetouani, M
Faundez-Zanuy, M
Sole-Casals, J
AF Monte-Moreno, Enric
Chetouani, Mohamed
Faundez-Zanuy, Marcos
Sole-Casals, Jordi
TI Maximum likelihood linear programming data fusion for speaker
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker recognition; Data fusion; GMM; Maximum likelihood; Linear
programming; Simplex
ID WIENER SYSTEMS; IDENTIFICATION; SPEECH; INFORMATION; PREDICTION;
INVERSION; MIXTURES
AB Biometric system performance can be improved by means of data fusion. Several kinds of information can be fused in order to obtain a more accurate classification (identification or verification) of an input sample. In this paper we present a method for computing the weights in a weighted sum fusion for score combinations, by means of a likelihood model. The maximum likelihood estimation is set as a linear programming problem. The scores are derived from a GMM classifier working on different feature extraction techniques. Our experimental results assessed the robustness of the system in front changes oil time (different sessions) and robustness in front of changes of microphone. The improvements obtained were significantly better (error bars of two standard deviations) than a uniform weighted sum or a uniform weighted product or the best single classifier. The proposed method scales computationally with the number of scores to be fusioned as the simplex method for linear programming. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Monte-Moreno, Enric] UPC Barcelona, TALP Res Ctr, Barcelona, Spain.
[Chetouani, Mohamed] Univ Paris 06, F-75252 Paris 05, France.
[Faundez-Zanuy, Marcos] UPC Barcelona, Escola Univ Politecn Mataro, Barcelona, Spain.
[Sole-Casals, Jordi] Univ Vic, Barcelona, Spain.
RP Monte-Moreno, E (reprint author), UPC Barcelona, TALP Res Ctr, Barcelona, Spain.
EM enric@gps.tsc.upc.edu
RI CHETOUANI, Mohamed/F-5854-2010; Faundez-Zanuy, Marcos/F-6503-2012; Monte
Moreno, Enrique/F-8218-2013; Sole-Casals, Jordi/B-7754-2008
OI Faundez-Zanuy, Marcos/0000-0003-0605-1282; Monte Moreno,
Enrique/0000-0002-4907-0494; Sole-Casals, Jordi/0000-0002-6534-1979
FU FEDER; MEC [TEC2006-13141-C03-02/TCM, TIN2005-08852, TEC2007-61535/TCM]
FX This work hits been partially supported by FEDER and MEC,
TEC2006-13141-C03-02/TCM, TIN2005-08852, and the TEC2007-61535/TCM.
CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679
BELLINGS SA, 1978, P IEEE, V66, P691
Bertsimas D., 1997, INTRO LINEAR OPTIMIZ
Bishop C. M., 1995, NEURAL NETWORKS PATT
BOER ED, 1976, P IEEE, V64, P1443
Boyd S., 2004, CONVEX OPTIMIZATION
COVER T. M., 1991, WILEY SERIES TELECOM
FAUNDEZ M, 1998, INT C SPOK LANG PROC, V2, P121
Faundez-Zanuy M, 2004, IEEE AERO EL SYS MAG, V19, P3, DOI 10.1109/MAES.2004.1308819
Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P7, DOI 10.1109/MAES.2005.1432568
Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P25, DOI 10.1109/MAES.2005.1499300
Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P34, DOI 10.1109/MAES.2005.1396793
Faundez-Zanuy M, 2006, IEEE AERO EL SYS MAG, V21, P15, DOI 10.1109/MAES.2006.1662038
Faundez-Zanuy M., 2002, Control and Intelligent Systems, V30
Hayakawa S, 1997, LECT NOTES COMPUT SC, V1206, P253
HE J, 1996, P IEEE ICASSP 96, V1, P5
JACOVITTI G, 1987, IEEE T ACOUST SPEECH, V35, P1126, DOI 10.1109/TASSP.1987.1165253
Kubin G., 1995, SPEECH CODING SYNTHE, P557
Kuncheva L.I., 2004, COMBINING PATTERN CL
MARY L, 2004, P ISCA TUT RES WORKS, P323
Nikias C., 1993, HIGHER ORDER SPECTRA
NIKIAS CL, 1987, P IEEE, V75, P869, DOI 10.1109/PROC.1987.13824
Ortega-Garcia J, 2000, SPEECH COMMUN, V31, P255, DOI 10.1016/S0167-6393(99)00081-3
PRAKRIYA S, 1985, BIOL CYBERN, V55, P135
Prasanna SRM, 2006, SPEECH COMMUN, V48, P1243, DOI 10.1016/j.specom.2006.06.002
REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379
SATUE A, 1999, EUROSPEECH, V3, P1231
Sole-Casals J, 2006, NEUROCOMPUTING, V69, P1467, DOI 10.1016/j.neucom.2005.12.023
Sole J, 2002, NEUROCOMPUTING, V48, P339, DOI 10.1016/S0925-2312(01)00651-8
Sole-Casals J, 2005, SIGNAL PROCESS, V85, P1780, DOI 10.1016/j.sigpro.2004.11.030
Taleb A, 1999, IEEE T SIGNAL PROCES, V47, P2807, DOI 10.1109/78.790661
Taleb A, 2001, IEEE T SIGNAL PROCES, V49, P917, DOI 10.1109/78.917796
THEVENAZ P, 1995, SPEECH COMMUN, V17, P145, DOI 10.1016/0167-6393(95)00010-L
YEGNANARAYANA B, 2001, P IEEE INT C AC SPEE, P409
ZHENG N, 2006, IEEE SIGNAL PROCESS, V14, P181
NR 35
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD SEP
PY 2009
VL 51
IS 9
SI SI
BP 820
EP 830
DI 10.1016/j.specom.2008.05.009
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 475OZ
UT WOS:000268370700010
ER
PT J
AU Shao, Y
Wang, DL
AF Shao, Yang
Wang, DeLiang
TI Sequential organization of speech in computational auditory scene
analysis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Sequential organization; Computational auditory scene analysis; Speaker
quantization; Binary time-frequency mask
ID MODELS; RECOGNITION; INTERFERENCE; SEGREGATION; TRACKING
AB A human listener has the ability to follow a speaker's voice over time in the presence of other talkers and non-speech interference. This paper proposes a general system for sequential organization of speech based on speaker models. By training a general background model, the proposed system is shown to function well with both interfering talkers and non-speech intrusions. To deal with situations where prior information about specific speakers is not available, a speaker quantization method is employed to extract representative models from a large speaker space and obtained generic models are used to perform sequential grouping. Our systematic evaluations show that grouping performance using generic models is only moderately lower than the performance level achieved with known speaker models. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Shao, Yang; Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA.
[Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA.
RP Shao, Y (reprint author), Ohio State Univ, Dept Comp Sci & Engn, 2015 Neil Ave, Columbus, OH 43210 USA.
EM shao.19@osu.edu; dwang@cse.ohio-state.edu
FU AFOSR [FA9550-04-1-0117]; NSF [IIS-0081058]; AFRL [FA8750-04-1-0093]
FX This research was supported in part by an AFOSR Grant
(FA9550-04-1-0117), an NSF Grant (IIS-0081058), and an AFRL Grant
(FA8750-04-1-0093).
CR Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002
BEN M, 2002, P INT C AC SPEECH SI, V1, P689
Bimbot Frederic, 2004, EURASIP J APPL SIG P, V4, P430, DOI [DOI 10.1155/S1110865704310024, 10.1155/S1110865704310024]
Bregman AS., 1990, AUDITORY SCENE ANAL
Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696
CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229
Cooke M., 2006, SPEECH SEPARATION RE
Deng L, 2005, IEEE T SPEECH AUDI P, V13, P412, DOI 10.1109/TSA.2005.845814
Duda R. O., 2001, PATTERN CLASSIFICATI
Dunn RB, 2000, DIGIT SIGNAL PROCESS, V10, P93, DOI 10.1006/dspr.1999.0359
Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P115
Furui S., 2001, DIGITAL SPEECH PROCE
HELMHOLTZ H, 1863, SENSATION TONE
Hershey J. R., 2007, P ICASSP, V4, P317
Hu G., 2006, THESIS OHIO STATE U
Hu G., 2006, TOPICS ACOUSTIC ECHO, P485
Hu GN, 2008, J ACOUST SOC AM, V124, P1306, DOI 10.1121/1.2939132
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Huang X., 2001, SPOKEN LANGUAGE PROC
Kullback S., 1968, INFORM THEORY STAT
Kwon S., 2004, P ICSLP, P1517
Kwon S, 2005, IEEE T SPEECH AUDI P, V13, P1004, DOI 10.1109/TSA.2005.851981
Moore BC., 2003, INTRO PSYCHOL HEARIN
Patterson R. D., 1988, 2341 APU MRC APPL PS
PRZYBOCKI MA, 2004, P OD 2004
QUATIERI TF, 1990, IEEE T ACOUST SPEECH, V38, P56, DOI 10.1109/29.45618
Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007
REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Rice J, 1995, MATH STAT DATA ANAL
Russell S., 2003, ARTIFICIAL INTELLIGE, V2nd
SHAO Y, 2007, P ICASSP, V4, P277
Shao Y., 2007, THESIS OHIO STATE U
Shao Y, 2006, IEEE T AUDIO SPEECH, V14, P289, DOI 10.1109/TSA.2005.854106
Silva J, 2006, IEEE T AUDIO SPEECH, V14, P890, DOI 10.1109/TSA.2005.858059
Srinivasan S, 2007, IEEE T AUDIO SPEECH, V15, P2130, DOI 10.1109/TASL.2007.901836
VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3
Vasconcelos N, 2004, IEEE T INFORM THEORY, V50, P1482, DOI 10.1109/TIT.2004.830760
Wang D., 2006, COMPUTATIONAL AUDITO
Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12
Wang D. L., 2006, COMPUTATIONAL AUDITO, P81
NR 41
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD AUG
PY 2009
VL 51
IS 8
BP 657
EP 667
DI 10.1016/j.specom.2009.02.003
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HH
UT WOS:000267572900001
ER
PT J
AU Messing, DP
Delhorne, L
Bruckert, E
Braida, LD
Ghitza, O
AF Messing, David P.
Delhorne, Lorraine
Bruckert, Ed
Braida, Louis D.
Ghitza, Oded
TI A non-linear efferent-inspired model of the auditory system; matching
human confusions in stationary noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE MOC efferent; Speech recognition; Noise robustness; Human confusions;
MBPNL model
ID CROSSED-OLIVOCOCHLEAR-BUNDLE; MODULATION TRANSFER-FUNCTION; RATE
RESPONSES; SPEECH; STIMULATION; RECOGNITION; NEURONS; TONES;
INTELLIGIBILITY; DISCRIMINATION
AB Current predictors of speech intelligibility are inadequate for understanding and predicting speech confusions caused by acoustic interference. We develop a model of auditory speech processing that includes a phenomenological representation of the action of the Medial Olivocochlear efferent pathway and that is capable of predicting consonant confusions made by normal hearing listeners in speech-shaped Gaussian noise. We then use this model to predict human error patterns of initial consonants in consonant-vowel-consonant words in the context of a Dynamic Rhyme Test. In the process we demonstrate its potential for speech discrimination in noise. Our results produced performance that was robust to varying levels of stationary additive speech-shaped noise and which mimicked human performance in discrimination of synthetic speech as measured by the Chi-squared test. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Messing, David P.; Delhorne, Lorraine; Braida, Louis D.] MIT, Dept Elect Engn & Comp Sci, Cambridge, MA 02139 USA.
[Bruckert, Ed; Ghitza, Oded] Sensimetrics Corp, Malden, MA USA.
RP Messing, DP (reprint author), 11208 Vista Sorrento,Pkwy J306, San Diego, CA 92103 USA.
EM dpmessing@gmail.com; delhome@MIT.EDU; Ebruckert@fonix.com;
ldbraida@mit.edu; oded.ghitza@gmail.com
FU US Air Force Office of Scientific Research [F49620-03-C-0051,
FA9550-05-C-0032]; NIH [R01-DC7152]
FX This work was sponsored by the US Air Force Office of Scientific
Research (contracts F49620-03-C-0051 and FA9550-05-C-0032) and by NIH
Grant R01-DC7152.
CR AINSWORTH W, 2001, EFFECTS NOISE ADAPTA, P371
AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306
*ANSI, 1997, S361997 ANSI
ANSI, 1969, S351969 ANSI
Cervera T, 2007, ACTA ACUST UNITED AC, V93, P1036
DEWSON JH, 1968, J NEUROPHYSIOL, V31, P122
DOLAN DF, 1988, J ACOUST SOC AM, V83, P1081, DOI 10.1121/1.396052
Dunn HK, 1940, J ACOUST SOC AM, V11, P278, DOI 10.1121/1.1916034
Ferry RT, 2007, J ACOUST SOC AM, V122, P3519, DOI 10.1121/1.2799914
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
GHITZA O, 2007, HEARING SENSORY PROC
GHITZA O, 2004, J ACOUST SOC AM 2, V115
GIFFORD ML, 1983, J ACOUST SOC AM, V74, P115, DOI 10.1121/1.389728
Giraud AL, 1997, NEUROREPORT, V8, P1779
GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T
GOLDSTEIN JL, 1990, HEARING RES, V49, P39, DOI 10.1016/0378-5955(90)90094-6
Guinan Jr J.J., 1996, COCHLEA, P435
GUMMER M, 1988, HEARING RES, V36, P41, DOI 10.1016/0378-5955(88)90136-0
Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7
HOUTGAST T, 1980, ACUSTICA, V46, P60
JOHNSON DH, 1980, J ACOUST SOC AM, V68, P1115, DOI 10.1121/1.384982
KAWASE T, 1993, J NEUROPHYSIOL, V70, P2519
KIANG NYS, 1987, INT COCHL IMPL S DUR
LIBERMAN MC, 1986, HEARING RES, V24, P17, DOI 10.1016/0378-5955(86)90003-1
LIBERMAN MC, 1988, J NEUROPHYSIOL, V60, P1779
Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6
MAY BJ, 1992, J NEUROPHYSIOL, V68, P1589
PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456
Scharf B, 1997, HEARING RES, V103, P101, DOI 10.1016/S0378-5955(96)00168-2
Slaney M., 1993, 35 APPL COMP
Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009
Voiers W. D., 1983, Speech Technology, V1
Warr WB, 1978, EVOKED ELECTRICAL AC, P43
WINSLOW RL, 1988, HEARING RES, V35, P165, DOI 10.1016/0378-5955(88)90116-5
Zar J. H., 1999, BIOSTATISTICAL ANAL, V4th
Zeng FG, 2000, HEARING RES, V142, P102, DOI 10.1016/S0378-5955(00)00011-3
NR 36
TC 15
Z9 15
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD AUG
PY 2009
VL 51
IS 8
BP 668
EP 683
DI 10.1016/j.specom.2009.02.002
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HH
UT WOS:000267572900002
ER
PT J
AU Lee, S
Iverson, GK
AF Lee, Soyoung
Iverson, Gregory K.
TI Vowel development in English and Korean: Similarities and differences in
linguistic and non-linguistic factors
SO SPEECH COMMUNICATION
LA English
DT Article
DE Vowel development; Fundamental and formant frequency; Vocal tract
length; Korean; English; Cross-linguistic study
ID VOCAL-TRACT; VOICES; FREQUENCIES; PATTERNS; SPEAKERS; SPEECH
AB This study presents an acoustic examination of vowels produced by English and Korean-speaking male and female children in two age groups, 5 and 10 years. In response to picture cards, a total of 80 children (10 each in eight groups categorized according to age, sex and language) produced tokens of seven vowels which are typically transcribed using the same IPA symbols in both languages. Fundamental as well as first and second formant frequencies were measured. In addition, vocal tract length was estimated on the basis of the F-3 values Of /Lambda/. As expected, these properties differed between the two age groups. However, the two non-linguistic elements (fundamental frequency and vocal tract length) were similar between sexes and between languages whereas the linguistic factors (frequencies of the first and second formants) varied as a function of sex and language. Moreover, the F-2 differences between English and Korean-speaking children were similar to those between English and Korean-speaking adults as reported by Yang [Yang, B., 1996. A comparative study of American English and Korean vowels produced by male and female speakers. J. Phonet. 24, 245-261], with children as young as 5 years thus reflecting the formant frequencies of their ambient languages. The results of this study suggest that formant frequency differences between sexes during the pre- and peri-puberty periods can be attributed chiefly to behavioral rather than to anatomical differences. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Lee, Soyoung] Univ Wisconsin Milwaukee, Dept Commun Sci & Disorders, Milwaukee, WI 53201 USA.
[Iverson, Gregory K.] Univ Wisconsin Milwaukee, Univ Maryland Ctr Adv Study Language, Milwaukee, WI 53201 USA.
RP Lee, S (reprint author), Univ Wisconsin Milwaukee, Dept Commun Sci & Disorders, POB 413, Milwaukee, WI 53201 USA.
EM lees59@uwm.edu
CR Assmann PF, 2000, J ACOUST SOC AM, V108, P1856, DOI 10.1121/1.1289363
Baker W, 2005, LANG SPEECH, V48, P1
BENNETT S, 1981, J ACOUST SOC AM, V69, P231, DOI 10.1121/1.385343
BENNETT S, 1983, J SPEECH HEAR RES, V26, P137
BUSBY PA, 1995, J ACOUST SOC AM, V97, P2603, DOI 10.1121/1.412975
Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011
Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1
Fant G, 1975, STL QPSR, V2-3, P1
Fant G., 1970, ACOUSTIC THEORY SPEE
Fant G., 1973, SPEECH SOUNDS FEATUR
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
Goldstein UG, 1980, ARTICULATORY MODEL V
HA H, 2005, THESIS YONSEI U KORE
HASEK CS, 1980, J ACOUST SOC AM, V68, P1262, DOI 10.1121/1.385118
Hillenbrand JM, 2001, J ACOUST SOC AM, V109, P748, DOI 10.1121/1.1337959
Kent R., 1997, SPEECH SCI
Lee S, 2008, CLIN LINGUIST PHONET, V22, P523, DOI 10.1080/02699200801945120
Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686
LINDBLOM B, 1989, J PHONETICS, V17, P107
LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403
MATTINGLY IG, 1966, 71 M AC SOC AM BOST
Nordstrom P.E., 1977, J PHONETICS, V4, P81
NORDSTROM PE, 1975, INT C PHON SCI LEEDS
Perry TL, 2001, J ACOUST SOC AM, V109, P2988, DOI 10.1121/1.1370525
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Rvachew S, 2006, J ACOUST SOC AM, V120, P2250, DOI 10.1121/1.2266460
Sachs J., 1973, LANGUAGE ATTITUDES C, P74
Sagart L., 1989, J CHILD LANG, V16, P1
Sohn H. -M. -M., 1999, KOREAN LANGUAGE
TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959
TRAUNMULLER H, 1988, PHONETICA, V45, P1
Vorperian HK, 2005, J ACOUST SOC AM, V117, P338, DOI 10.1121/1.1835958
WANG S, 1996, J KOREAN OTOLARYNGOL, V39, P12
Whiteside SP, 2001, J ACOUST SOC AM, V110, P464, DOI 10.1121/1.1379087
Whittam LR, 2000, CLIN EXP DERMATOL, V25, P122
Xue SA, 2006, CLIN LINGUIST PHONET, V20, P691, DOI 10.1080/02699200500297716
Yang BG, 1996, J PHONETICS, V24, P245, DOI 10.1006/jpho.1996.0013
YOON S, 1998, J SPEECH HEAR DISORD, V7, P67
NR 38
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD AUG
PY 2009
VL 51
IS 8
BP 684
EP 694
DI 10.1016/j.specom.2009.03.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HH
UT WOS:000267572900003
ER
PT J
AU Jackson, PJB
Singampalli, VD
AF Jackson, Philip J. B.
Singampalli, Veena D.
TI Statistical identification of articulation constraints in the production
of speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE Critical articulator; Speech production model; Articulatory gesture;
Coarticulation
ID MODEL; COARTICULATION; RECOGNITION; MOVEMENTS; FEATURES; TONGUE; MRI;
HMM
AB We present a statistical technique for identifying critical, dependent and redundant roles played by the articulators during production of English phonemes using articulatory (EMA) data. It identifies a list of critical articulators for each phone based on changes in the distribution of articulator positions. The effect of critical articulation on dependent articulators is derived from inter-articulator correlation. Articulators unaffected or not correlated with the critical articulators are regarded as redundant. The technique was implemented on 1D and 2D distributions of midsagittal articulator coordinates, and the results of this data-driven approach are analyzed in comparison with the phonetic descriptions from the IPA chart. The results using the proposed method gave a closer fit to measured data than those estimated from IPA information alone and highlighted significant factors in the phoneme-to-phone transformation. The proposed algorithm was evaluated against an exhaustive search of critical articulators, and found to be as effective as the exhaustive search in modeling phone distributions with the added advantage of faster execution times. The efficiency of the approach in generating a parsimonious yet accurate representation of the observed articulatory constraintsis described, and its potential for applications in speech science and technology discussed. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Jackson, Philip J. B.; Singampalli, Veena D.] Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England.
RP Jackson, PJB (reprint author), Univ Surrey, Ctr Vis Speech & Signal Proc, Guildford GU2 7XH, Surrey, England.
EM p.jackson@surrey.ac.uk
RI Jackson, Philip/E-8422-2013
FU UK by EPSRC [GR/S85511/01]
FX This research formed part of the DANSA project funded in the UK by EPSRC
(GR/S85511/01),
http://personal.ce.sui-rey.ac.uk/Personal/P.Jackson/Dansa/.
CR Anderson TW, 1984, INTRO MULTIVARIATE S
Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166
Badin P., 2006, P 7 INT SEM SPEECH P, P395
BAKIS R, 1991, P IEEE WORKSH AUT SP, P20
BLACKBURN S, 2000, J ACOUST SOC AM, V103, P1659
BLADON RA, 1976, J PHONETICS, V4, P135
Browman Catherine, 1986, PHONOLOGY YB, V3, P219
Chomsky N., 1968, SOUND PATTERN ENGLIS
Cohen M. M., 1993, Models and Techniques in Computer Animation
COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154
Coppin B., 2004, ARTIFICIAL INTELLIGE
DANG J, 2004, ACOUSTICAL SCI TECHN, V25, P318, DOI 10.1250/ast.25.318
Dang J.W., 2005, P INT LISB, P1025
Dang JW, 2004, J ACOUST SOC AM, V115, P853, DOI 10.1121/1.1639325
Daniloff R. G., 1973, J PHONETICS, V1, P239
DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839
EIDE E, 2001, P EUR AALB DENM, P1613
Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358
FANT G, 1969, DISTINCTIVE FEATURES
FRANKEL J, 2001, P EUR AALB DENM, P599, DOI DOI 10.1109/TSA.2005.851910
Frankel J., 2003, THESIS U EDINBURGH
Frankel J., 2000, P ICSLP, V4, P254
FRANKEL J, 2004, P INT C SPOK LANG P, P1477
HENKE WL, 1965, THESIS MIT CAMBRIDGE
Hershey J. R., 2007, P ICASSP, V4, P317
HOOLE P, 2008, P INT SEM SPEECH PRO, P157
Jackson P., 2008, J ACOUST SOC AM 2, V123, P3321, DOI 10.1121/1.2933798
JACKSON PJB, 2004, GRS8551101 CVSSP EPS
JACKSON PJB, 2008, P INT SEM SPEECH PRO, P377
Johnson R. A., 1998, APPL MULTIVARIATE ST
Keating P. A., 1988, UCLA WORKING PAPERS, V69, P3
King S, 2007, J ACOUST SOC AM, V121, P723, DOI 10.1121/1.2404622
Kirchhoff K., 1999, THESIS U BIELEFELD
KOREMAN J, 1998, P INT C SPOK LANG P, V3, P1035
Kullback S., 1968, INFORM THEORY STAT
LIBERMAN AM, 1970, COGNITIVE PSYCHOL, V1, P301, DOI 10.1016/0010-0285(70)90018-6
LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816
LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289
MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070
MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131
Meister IG, 2007, CURR BIOL, V17, P1692, DOI 10.1016/j.cub.2007.08.064
MERMELSTEIN P, 1973, J ACOUST SOC AM, V53, P1072
Metze F., 2002, P INT C SPOK LANG PR, P2133
MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683
OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310
OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151
Ostry DJ, 1996, J NEUROSCI, V16, P1570
PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994
Parthasarathy V, 2007, J ACOUST SOC AM, V121, P491, DOI 10.1121/1.2363926
QIN C, 2008, P INT BRISB AUSTR, P2306
Recasens D, 1999, J PHONETICS, V27, P143, DOI 10.1006/jpho.1999.0092
Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727
RICHARDS H, 1999, INT C ACOUST SPEECH, V1, P357
RICHARDSON M, 2000, P INT C SPOK LANG PR, V3, P131
Richmond K, 2007, P INT ANTW BELG, P2465
RICHMOND K, 2006, P INT PITTSB PA
RUSSELL MJ, 2002, P ICSLP DENV CO, P1253
Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2
Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356
Shadle C., 1985, THESIS MIT CAMBRIDGE
Singampalli V. D., 2007, P INT ANTW, P70
SOQUET A, 1999, P ICPHS 99 SAN FRANC, P1645
Westbury J.R., 1994, XRAY MICROBEAM SPEEC
Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263
WRENCH AA, 2001, P I ACOUST, V23, P207
Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002
NR 66
TC 7
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD AUG
PY 2009
VL 51
IS 8
BP 695
EP 710
DI 10.1016/j.specom.2009.03.007
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HH
UT WOS:000267572900004
ER
PT J
AU Wutiwiwatchai, C
Furui, S
AF Wutiwiwatchai, Chai
Furui, Sadaoki
TI Thai speech processing technology: A review (vol 49, pg 8, 2007)
SO SPEECH COMMUNICATION
LA English
DT Correction
C1 [Wutiwiwatchai, Chai] Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, Klongluang 12120, Pathumthani, Thailand.
[Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan.
RP Wutiwiwatchai, C (reprint author), Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, 112 Thailand Sci Pk,Pahonyothin Rd, Klongluang 12120, Pathumthani, Thailand.
EM chai@nectec.or.th
RI Wutiwiwatchai, Chai/G-5010-2012
CR Wutiwiwatchsi C, 2007, SPEECH COMMUN, V49, P8, DOI 10.1016/j.specom.2006.10.004
NR 1
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD AUG
PY 2009
VL 51
IS 8
BP 711
EP 711
DI 10.1016/j.specom.2009.04.003
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HH
UT WOS:000267572900005
ER
PT J
AU Blomberg, M
Elenius, K
House, D
Karlsson, I
AF Blomberg, Mats
Elenius, Kjell
House, David
Karlsson, Inger
TI Research Challenges in Speech Technology: A Special Issue in Honour of
Rolf Carlson and Bjorn Granstrom
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
C1 [Blomberg, Mats; Elenius, Kjell; House, David; Karlsson, Inger] KTH, Stockholm, Sweden.
RP Blomberg, M (reprint author), KTH, Stockholm, Sweden.
NR 0
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 563
EP 563
DI 10.1016/j.specom.2009.04.002
PG 1
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800001
ER
PT J
AU Fant, G
AF Fant, Gunnar
TI A personal note from Gunnar Fant
SO SPEECH COMMUNICATION
LA English
DT Editorial Material
ID SYNTHESIZER
C1 KTH, Dept Speech Mus & Hearing, S-10040 Stockholm, Sweden.
RP Fant, G (reprint author), KTH, Dept Speech Mus & Hearing, S-10040 Stockholm, Sweden.
EM gunnar@speech.kth.se
CR BESKOW J, 2003, THESIS DEP SPEECH MU
BLOMBERG M, 1993, P EUR C SPEECH COMM, P1867
Carlson R., 1975, AUDITORY ANAL PERCEP, P55
CARLSON R, 1981, 41981 KTH SPEECH TRA, P29
Carlson R., 1970, 231970 STLQPSR, V2-3, P19
Carlson R, 1989, ICASSP, V1, P223
CARLSON R, 1973, 231973 KTH SPEECH TR, P31
CARLSON R, 1982, REPRESENTATION SPEEC, P109
CARLSON R, 1975, 1 SPEECH TRANSM LAB, P17
CARLSON R, 1990, ADV SPEECH HEARING L
FANT G, 1978, RIV ITALIANA ACUSTIC, V2, P69
FANT G, 2004, SPEECH ACOUSTICS PHO
FANT G, 1974, P SPEECH COMM SEM ST, V3, P117
FANT G, 1953, IVA ROYAL SWEDISH AC, V2, P331
FANT G, 1962, 21962 STLQPSR, P18
FANT G, 1986, INVARIANCE VARIABILI, P480
FANT G, 1990, SPEECH COMMUN, V9, P171, DOI 10.1016/0167-6393(90)90054-D
HOLMES J, 1961, 11961 KTH SPEECH TRA, P10
HUNNICUTT S, 1986, BLISSTALK J AM VOICE, V3
LILJENCRANTS J, 1969, 41969 KTH SPEECH TRA, P43
LILJENCR.JC, 1968, IEEE T ACOUST SPEECH, VAU16, P137, DOI 10.1109/TAU.1968.1161961
STEVENS KN, 1991, J PHONETICS, V19, P161
NR 22
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 564
EP 568
DI 10.1016/j.specom.2009.04.001
PG 5
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800002
ER
PT J
AU Mariani, J
AF Mariani, Joseph
TI Research infrastructures for Human Language Technologies: A vision from
France
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Research funding; International cooperation; Language technologies;
Language resources; Language technology evaluation
AB The size of the effort in language science and technology has increased over the years, reflecting the complexity of this research area. The role of Language Resources and Language Technology Evaluation is now recognized as being crucial for the development of written and spoken language processing systems. Several initiatives have been taken in this framework, with a pioneering activity in the US starting in the mid 80s. In France, several programs may be reported, such as the Aupelf-Uref Francil network and the Techno-Langue national program. Despite the size of the effort and the success of those actions, they did not provide yet a permanent infrastructure. To be properly covered, the development of Language Technologies would require a better share of efforts between the European Commission and the Member States to address the major challenge of multilingualism in Europe, and worldwide. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Mariani, Joseph] LIMSI, CNRS, F-91403 Orsay, France.
[Mariani, Joseph] French Minist Res, F-75005 Paris, France.
RP Mariani, J (reprint author), LIMSI, CNRS, BP 133, F-91403 Orsay, France.
EM joseph.mariani@limsi.fr
CR ADDA G, 1998, 1 INT C LANG RES EV
ANTOINE JY, 1998, 1 INT C LANG RES EV
AYACHE C, 2006, 5 INT C LANG RES EV
BONHOMME P, 1997, JST FRANC 1997 P AUP
BONNEAUMAYNARD H, 2006, 5 INT C LANG RES EV
BUEHLER D, 2006, 5 INT C LANG RES EV
*BUILD LANG RES EV, 2004, JOINT COCOSDA WRITE
CAMARA E, 1997, JST FRANC 1997 P AUP
CARRE R, 1984, IEEE ICASSP C SAN DI
CHAUDIRON S, 2006, 5 INT C LANG RES EV
CHIAO Y, 2006, 5 INT C LANG RES EV
CIERI C, 2006, 5 INT C LANG RES EV
COLE R, 1997, SURVEY STATE ART HUM, pCH12
DAILLE B, 1997, JST FRANC 1997 P AUP
DEBILI F, 1997, JST FRANC 1997 P AUP
DEMAREUIL PB, 1998, 1 INT C LANG RES EV
DEMAREUIL PB, 2006, 5 INT C LANG RES EV
DENIS A, 2006, 5 INT C LANG RES EV
DEVILLERS L, 2004, P 4 INT C LANG RES E
DOLMAZON JM, 1997, JST FRANC 1997 P AUP
ELHADI WM, 2006, 5 INT C LANG RES EV
ELHADI WM, 2004, P 4 INT C LANG RES E
ELHADI WM, 2004, COLING 29004
GARCIA M, 2006, 5 INT C LANG RES EV
GILLARD L, 2006, 5 INT C LANG RES EV
GRAU B, 2006, 5 INT C LANG RES EV
GRAVIER G, 2004, P 4 INT C LANG RES E
HAMON O, 2006, 5 INT C LANG RES EV
HIRSCHMAN L, 1997, SURVEY STATE OF THE
HOVY E, 1999, MULTILINGUAL INFORM
JARDINO M, 1998, 1 INT C LANG RES EV
LABED L, 1997, JST FRANC 1997 P AUP
LANDI B, 1998, 1 INT C LANG RES EV
LANGLAIS P, 1998, 1 INT C LANG RES EV
LAZZARI G, 2006, HUMAN LANGUAGE TECHN
MAPELLI V, 2004, P 4 INT C LANG RES E
MARIANI J, 2002, ARE WE LOOSING GROUN
MARIANI J, 1997, SURVEY STATE ART HUM, pCH13
MARIANI J, 1995, COC WORKSH SEPT 22 1
MAUCLAIR J, 2006, 5 INT C LANG RES EV
MCTAIT K, 2004, TALN JEP RECITAL C 2
MOSTEFA D, 2006, 5 INT C LANG RES EV
MUSTAFA W, 1998, 1 INT C LANG RES EV
NAVA M, 2004, TALN JEP RECITAL 200
PAPINENI K, 2002, P 3 INT C LANG RES E
PAROUBEK P, 2006, 5 INT C LANG RES EV
PAROUBEK P, 2000, 2 INT C LANG RES EV
PIERRE J, 2007, LANGUE COEUR NUMERIQ
ROSSET S, 1997, JST FRANC 1997 P AUP
SABATIER P, 1997, JST FRANC 1997 P AUP
VANRULLEN T, 2006, 5 INT C LANG RES EV
VILNAT A, 2004, P 4 INT C LANG RES E
WHITE J, 1999, MULTILINGUAL INFORM
WITTENBURG P, 2006, 5 INT C LANG RES EV, V6
ZAMPOLLI A, 1999, MULTILINGUAL INFORM
NR 55
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 569
EP 584
DI 10.1016/j.specom.2007.12.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800003
ER
PT J
AU Greenberg, Y
Shibuya, N
Tsuzaki, M
Kato, H
Sagisaka, Y
AF Greenberg, Yoko
Shibuya, Nagisa
Tsuzaki, Minoru
Kato, Hiroaki
Sagisaka, Yoshinori
TI Analysis on paralinguistic prosody control in perceptual impression
space using multiple dimensional scaling
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Paralinguistic prosody; Nonverbal information; Fundamental frequency
control; Communicative speech synthesis
AB A multi-dimensional perceptual space for communicative speech prosodies was derived using a psychometric method from multidimensional expressions of impressions to characterize paralinguistic information conveyed by prosody in communication. Single word utterances of "n" were employed to allow freedom from lexical effects and to cover communicative prosodic variations as much as possible. The analysis of daily conversations showed that conversational speech impressions were manifested in the global F0 control of "n" as differences of average height (high-low) and dynamic patterns (rise, fall, gradual fall, and rise&fall). Using controlled single utterances of "n", multiple dimensional scaling analysis was applied to a mutual distance matrix obtained by 26 dimensional vectors expressing perceptual impressions. The result showed the three-dimensional structure of a perceptual impression space, and each dimension corresponded to different F0 control characteristics. The positive-negative impression can be controlled by average F0 height while confident-doubtful or allowable-unacceptable impressions can be controlled by F0 dynamic patterns.
Unlike conventional categorical classification of prosodic patterns frequently observed in studies of emotional prosody, this control characterization enables us to flexibly and quantitatively describe prosodic impressions. These experimental results allow the possibility of input specifications for communicative prosody generation using impression vectors and control through average F0 height and F0 dynamic patterns. Instead of the generation of speech with categorical prototypical prosody, more adequate communicative speech synthesis can be approached through input specification and its correspondence with control characteristics. (C) 2007 Elsevier B.V. All rights reserved.
C1 [Greenberg, Yoko; Shibuya, Nagisa; Sagisaka, Yoshinori] Waseda Univ, GITI, Shinjuku Ku, Tokyo 1690051, Japan.
[Tsuzaki, Minoru] Kyoto City Univ Arts, Nishikyo Ku, Kyoto 6101197, Japan.
[Kato, Hiroaki] Natl Inst Informat & Commun Technol, ATR Cognit Informat Sci Lab, Seika, Kyoto 6190288, Japan.
RP Greenberg, Y (reprint author), Waseda Univ, GITI, Shinjuku Ku, 1-3-10 Nishi Waseda, Tokyo 1690051, Japan.
EM Yoko.Kokenawa@toki.waseda.jp
CR AUDIBERT N, 2005, P 9 EUR C SPEECH COM, P525
Campbell N., 2004, J PHONET SOC JPN, V8, P9
Campbell Nick, 2004, P 2 INT C SPEECH PRO, P217
*CTT, 2005, WAVESURFER
Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5
GREENBERG Y, 2006, P SPEECH PROS 2006
ISHI TC, 2005, P SPEECH PROS 2005
Maekawa K., 2004, P SPEECH PROS 2004, V2004, P367
TORGERSON WS, 1952, PSYCHOMETRIKA, V17, P401
NR 9
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 585
EP 593
DI 10.1016/j.specom.2007.10.006
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800004
ER
PT J
AU Gronnum, N
AF Gronnum, Nina
TI A Danish phonetically annotated spontaneous speech corpus (DanPASS)
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Monologue; Dialogue; Spontaneous speech; Corpus; Phonetic notation;
Prosodic labeling
ID MAP TASK
AB A corpus is described consisting of non-scripted monologues and dialogues, recorded by 27 speakers, comprising a total of 73,227 running words, corresponding to 9 h and 46 min of speech. The monologues were recorded as one-way communication with an unseen partner where the speaker performed three different tasks: (s)he described a network consisting of various geometrical shapes in various colours, (s)he guided the listener through four different routes in a virtual city map, and (s)he instructed the listener how to build a house from its individual pieces. The dialogues are replicas of the HCRC map tasks. Annotation is performed in Praat. The sound files are segmented into prosodic phrases, words, and syllables. The files are supplied, in separate interval tiers, with an orthographical representation, detailed part-of-speech tags, simplified part-of-speech tags, a phonemic notation, a semi-narrow phonetic notation, a symbolic representation of the pitch relation between each stressed and post-tonic syllable, and a symbolic representation of the phrasal intonation. (C) 2008 Elsevier B.V. All rights reserved.
C1 Univ Copenhagen, Dept Scandinavian Studies & Linguist, Linguist Lab, DK-2300 Copenhagen, Denmark.
RP Gronnum, N (reprint author), Univ Copenhagen, Dept Scandinavian Studies & Linguist, Linguist Lab, 120 Njalsgade, DK-2300 Copenhagen, Denmark.
EM ninag@hum.ku.dk
CR ANDERSON AH, 1991, LANG SPEECH, V34, P351
Boersma P., 2001, GLOT INT, V5, P341
Boersma P., 2006, PRAAT DOING PHONETIC
BROWN G, 1984, TEACHING TALK
DAU T, 2007, CARLSBERGFONDET ARSS, P44
DYRBY M, 2005, M PROGRAM CIRC NOV 1
Fletcher J, 2002, LANG SPEECH, V45, P229
Horiuchi Y., 1999, Journal of Japanese Society for Artificial Intelligence, V14
GRONNUM N, 1995, STOCKHOLM 1995, V2, P124
GRONNUM N, 2006, P 5 INT C LANG RES E
GRONNUM N, 2007, P 16 INT C PHON SCI, P1229
GRONNUM N, 2005, FONETIK FONOLOGI
GRONNUM N, 1985, J ACOUST SOC AM, V77, P1205
GRONNUM N, 1990, PHONETICA, V47, P182
GRONNUM N, 1986, J ACOUST SOC AM, V80, P1040
HELGASON P, 2006, WORKING PAPERS, V52, P57
HENRICHSEN P, 2002, LAMBDA I DATALINGVIS, V27
Jensen C., 2005, P INT 2005 LISB, P2385
Kohler K., 2007, EXPT APPROACHES PHON, P41
KOHLER KJ, 2006, METHODS EMPIRICAL PR, V3, P123
Kohler KJ, 2005, PHONETICA, V62, P88, DOI 10.1159/000090091
PAGGIO P, 2006, P 5 INT C LANG RES E
Silverman K., 1992, P INT C SPOK LANG PR, P867
Sole, 2007, EXPT APPROACHES PHON, P192
SWERTS M, 1994, PROSODIC FEATURES DI
SWERTS M, 1992, SPEECH COMMUN, V121, P463
TERKEN JMB, 1984, LANG SPEECH, V27, P269
TONDERING J, 2008, THESIS U COPENHAGEN
NR 28
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 594
EP 603
DI 10.1016/j.specom.2008.11.002
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800005
ER
PT J
AU Massaro, DW
Jesse, A
AF Massaro, Dominic W.
Jesse, Alexandra
TI Read my lips: speech distortions in musical lyrics can be overcome
(slightly) by facial information
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Perception of speech; Perception of music; Audiovisual speech
perception; Singing
ID VOICE ONSET TIME; SUNG VOWELS; VISUAL-PERCEPTION; FUNDAMENTAL-FREQUENCY;
PHRASE STRUCTURE; PERFORMANCE; PROSODY; PITCH; ENGLISH; SINGERS
AB Understanding the lyrics of many contemporary songs is difficult, and an earlier study [Hidalgo-Barnes, M., Massaro, D.W., 2007. Read my lips: an animated face helps communicate musical lyrics. Psychomusicology 19, 3-12] showed a benefit for lyrics recognition when seeing a computer-animated talking head (Baldi (R)) mouthing the lyrics along with hearing the singer. However, the contribution of visual information was relatively small compared to what is usually found for speech. In the current experiments, our goal was to determine why the face appears to contribute less when aligned with sung lyrics than when aligned with normal speech presented in noise. The first experiment compared the contribution of the talking head with the originally sung lyrics versus the case when it was aligned with the Festival text-to-speech synthesis (TtS) spoken at the original duration of the song's lyrics. A small and similar influence of the face was found in both conditions. In the three experiments, we compared the presence of the face when the durations of the TtS were equated with the duration of the original musical lyrics to the case when the lyrics were read with typical TtS durations and this speech embedded in noise. The results indicated that the unusual temporally distorted durations of musical lyrics decreases the contribution of the visible speech from the face. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Massaro, Dominic W.; Jesse, Alexandra] Univ Calif Santa Cruz, Dept Psychol, Santa Cruz, CA 95064 USA.
RP Jesse, A (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands.
EM massaro@ucsc.edu; Alexandra.Jesse@mpi.nl
CR AUER ET, 2004, J ACOUST SOC AM, V116, P2644
Austin SF, 2007, J VOICE, V21, P72, DOI 10.1016/j.jvoice.2005.08.013
BENOLKEN MS, 1990, J ACOUST SOC AM, V87, P1781, DOI 10.1121/1.399426
BERNSTEIN LE, 1989, J ACOUST SOC AM, V85, P397, DOI 10.1121/1.397690
Burnham D., 2001, P INT C AUD VIS SPEE, P155
Calvert GA, 1997, SCIENCE, V276, P593, DOI 10.1126/science.276.5312.593
Cave C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607235
Chiappe P, 1997, PSYCHON B REV, V4, P254, DOI 10.3758/BF03209402
Clarke E., 1987, PSYCHOL MUSIC, V15, P58, DOI 10.1177/0305735687151005
CLARKE EF, 1993, MUSIC PERCEPT, V10, P317
Clarke EF, 1985, MUSICAL STRUCTURE CO, P209
Stone RE, 2003, J VOICE, V17, P283, DOI 10.1067/S0892-1997(03)00074-2
Cleveland TE, 2001, J VOICE, V15, P54, DOI 10.1016/S0892-1997(01)00006-6
CLEVELAND TF, 1994, J VOICE, V8, P18, DOI 10.1016/S0892-1997(05)80315-7
CUDDY LL, 1981, J EXP PSYCHOL HUMAN, V7, P869, DOI 10.1037//0096-1523.7.4.869
Dahl S, 2007, MUSIC PERCEPT, V24, P433, DOI 10.1525/MP.2007.24.5.433
de Gelder B, 2000, COGNITION EMOTION, V14, P289
DICARLO NS, 2004, SEMIOTICA, V149, P47
DICARLO NS, 1985, PHONETICA, V42, P188
Dohen M, 2005, P AVSP, P115
Dohen M, 2004, SPEECH COMMUN, V44, P155, DOI 10.1016/j.specom.2004.10.009
Ellison JW, 1997, J EXP PSYCHOL HUMAN, V23, P213, DOI 10.1037/0096-1523.23.1.213
FISHER CG, 1969, J SPEECH HEAR RES, V12, P379
Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332
FROMKIN VA, 1971, LANGUAGE, V47, P27, DOI 10.2307/412187
FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022
Granstrom B., 1999, P INT C PHON SCI ICP, P655
Granstrom B, 2005, SPEECH COMMUN, V46, P473, DOI 10.1016/j.specom.2005.02.017
Gregg JW, 2006, J VOICE, V20, P198, DOI 10.1016/j.jvoice.2005.01.007
GREGORY AH, 1978, PERCEPT PSYCHOPHYS, V24, P171, DOI 10.3758/BF03199545
Hasegawa T, 2004, COGNITIVE BRAIN RES, V20, P510, DOI 10.1016/j.cogbrainres.2004.04.005
HIDALGO-BARNES M., 2007, PSYCHOMUSICOLOGY, V19, P3, DOI [10.1037/h0094037, DOI 10.1037/H0094037]
HNATHCHISOLM T, 1988, EAR HEARING, V9, P329, DOI 10.1097/00003446-198812000-00009
Hollien H, 2000, J VOICE, V14, P287, DOI 10.1016/S0892-1997(00)80038-7
House D., 2001, P EUR 2001, P387
HOUSE D, 2002, TMH QPSR FONETIK, V44, P41
Huron D, 2003, MUSIC PERCEPT, V21, P267, DOI 10.1525/mp.2003.21.2.267
Jackendoff R, 2006, COGNITION, V100, P33, DOI 10.1016/j.cognition.2005.11.005
Jesse A., 2000, INTERPRETING, V5, P95, DOI 10.1075/intp.5.2.04jes
JUSCZYK PW, 1993, J EXP PSYCHOL HUMAN, V19, P627, DOI 10.1037/0096-1523.19.3.627
Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770
Keating P., 2003, P 16 INT C PHON SCI, P2071
Krumhansl C. L., 1997, MUSIC SCI, V1, P63
KRUMHANSL CL, 1990, PSYCHOL SCI, V1, P70, DOI 10.1111/j.1467-9280.1990.tb00070.x
Lansing CR, 1999, J SPEECH LANG HEAR R, V42, P526
LARGE EW, 1995, COGNITIVE SCI, V19, P53, DOI 10.1207/s15516709cog1901_2
Lerdahl F., 1983, GENERATIVE THEORY TO
LISKER L, 1986, LANG SPEECH, V29, P3
Lundy DS, 2000, J VOICE, V14, P490, DOI 10.1016/S0892-1997(00)80006-5
Massaro D. W., 1998, PERCEIVING TALKING F
Massaro D. W., 1987, SPEECH PERCEPTION EA
Massaro DW, 2002, TEXT SPEECH LANG TEC, V19, P45
Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861
Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421
Massaro DW, 1999, J SPEECH LANG HEAR R, V42, P21
McCrea CR, 2005, J SPEECH LANG HEAR R, V48, P1013, DOI 10.1044/1092-4388(2005/069)
McCrea CR, 2007, J VOICE, V21, P54, DOI 10.1016/j.jvoice.2005.05.002
McCrea CR, 2005, J VOICE, V19, P420, DOI 10.1016/j.jvoice.2004.08.002
MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526
Mixdorff H., 2005, P AUD VIS SPEECH PRO, P3
Munhall KG, 2004, PSYCHOL SCI, V15, P133, DOI 10.1111/j.0963-7214.2004.01502010.x
Neuhaus C, 2006, J COGNITIVE NEUROSCI, V18, P472, DOI 10.1162/089892906775990642
Nicholson KG, 2003, BRAIN COGNITION, V52, P382, DOI 10.1016/S0278-2626(03)00182-9
Omori K, 1996, J VOICE, V10, P228, DOI 10.1016/S0892-1997(96)80003-8
Ouni S, 2007, EURASIP J AUDIO SPEE, DOI 10.1155/2007/47891
PALMER C, 1995, J EXP PSYCHOL HUMAN, V21, P947, DOI 10.1037//0096-1523.21.5.947
PALMER C, 1987, J EXP PSYCHOL HUMAN, V13, P116, DOI 10.1037//0096-1523.13.1.116
PALMER C, 1992, J MEM LANG, V31, P525, DOI 10.1016/0749-596X(92)90027-U
PALMER C, 1990, J EXP PSYCHOL HUMAN, V16, P728, DOI 10.1037//0096-1523.16.4.728
PALMER C, 1992, COGNITIVE BASES OF MUSICAL COMMUNICATION, P249, DOI 10.1037/10104-014
Palmer C, 2006, PSYCHOL LEARN MOTIV, V46, P245, DOI 10.1016/S0079-7421(06)46007-2
PALMER C, 1989, J EXP PSYCHOL HUMAN, V15, P331, DOI 10.1037/0096-1523.15.2.331
Patel AD, 2003, COGNITION, V87, pB35, DOI 10.1016/S0010-0277(02)00187-7
Patel AD, 2003, MUSIC PERCEPT, V21, P273, DOI 10.1525/mp.2003.21.2.273
Patel AD, 2006, J ACOUST SOC AM, V119, P3034, DOI 10.1121/1.2179657
Penel A, 2004, PERCEPT PSYCHOPHYS, V66, P545, DOI 10.3758/BF03194900
*PRIM, 1993, PRESSM ALB PORK SOD
REPP BH, 1995, PERCEPT PSYCHOPHYS, V57, P1217, DOI 10.3758/BF03208378
Repp BH, 1998, J EXP PSYCHOL HUMAN, V24, P791, DOI 10.1037//0096-1523.24.3.791
REPP BH, 1992, COGNITION, V44, P241, DOI 10.1016/0010-0277(92)90003-Z
Risberg A., 1978, STL QPSR, V4, P1
ROSSING TD, 1986, J ACOUST SOC AM, V79, P1975, DOI 10.1121/1.393205
SALDANA HM, 1993, PERCEPT PSYCHOPHYS, V54, P406, DOI 10.3758/BF03205276
SCHMUCKLER MA, 1989, MUSIC PERCEPT, V7, P109
SLOBODA JA, 1980, CAN J PSYCHOL, V34, P274, DOI 10.1037/h0081052
SLOBODA JA, 1983, Q J EXP PSYCHOL-A, V35, P377
SMITH GP, 2003, ELT J, V57, P113, DOI 10.1093/elt/57.2.113
SMITH LA, 1980, J ACOUST SOC AM, V67, P1795, DOI 10.1121/1.384308
Srinivasan RJ, 2003, LANG SPEECH, V46, P1
SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309
Summerfield Q, 1987, HEARING EYE PSYCHOL, P3
Sundberg J, 2003, TMH QPSR SPEECH MUSI, V45, P11
Sundberg J, 1997, J VOICE, V11, P301, DOI 10.1016/S0892-1997(97)80008-2
SUNDBERG J, 1974, J ACOUST SOC AM, V55, P838, DOI 10.1121/1.1914609
SUNDBERG J, 1982, PSYCHOL MUSIC, P59
Swerts M, 2005, J MEM LANG, V53, P81, DOI 10.1016/j.jml.2005.02.003
SWERTS M, 2004, P SPEECH PROS
TAN N, 1981, MEM COGNITION, V9, P533, DOI 10.3758/BF03202347
Thompson DM, 1934, J GEN PSYCHOL, V11, P160
TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929
TODD NPM, 1995, J ACOUST SOC AM, V97, P1940, DOI 10.1121/1.412067
Trainor LJ, 2000, PERCEPT PSYCHOPHYS, V62, P333, DOI 10.3758/BF03205553
Vatakis A, 2006, NEUROSCI LETT, V393, P40, DOI 10.1016/j.neulet.2005.09.032
Vines BW, 2006, COGNITION, V101, P80, DOI 10.1016/j.cognition.2005.09.003
Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165
NR 105
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 604
EP 621
DI 10.1016/j.specom.2008.05.013
PG 18
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800006
ER
PT J
AU Lindblom, B
Diehl, R
Creeger, C
AF Lindblom, Bjorn
Diehl, Randy
Creeger, Carl
TI Do 'Dominant Frequencies' explain the listener's response to formant and
spectrum shape variations?
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Vowel quality perception; Auditory representation; Dominant Frequency
AB Psychoacoustic experimentation shows that formant frequency shifts can give rise to more significant changes in phonetic vowel timber than differences in overall level, bandwidth, spectral tilt, and formant amplitudes. Carlson and Granstrom's perceptual and computational findings suggest that, in addition to spectral representations, the human ear uses temporal information on formant periodicities ('Dominant Frequencies') in building vowel timber percepts. The availability of such temporal coding in the cat's auditory nerve fibers has been demonstrated in numerous physiological investigations undertaken during recent decades. In this paper we explore, and provide further support for, the Dominant Frequency hypothesis using KONVERT, a computational auditory model. KONVERT provides auditory excitation patterns for vowels by performing a critical-band analysis. It simulates phase locking in auditory neurons and outputs DF histograms. The modeling supports the assumption that listeners judge phonetic distance among vowels on the basis formant frequency differences as deter-mined primarily by a time-based analysis. However, when instructed to judge psychophysical distance among vowels, they can also use spectral differences such as formant bandwidth, formant amplitudes and spectral tilt. Although there has been considerable debate among psychoacousticians about the functional role of phase locking in monaural hearing, the present research suggests that detailed temporal information may nonetheless play a significant role in speech perception. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Lindblom, Bjorn] Stockholm Univ, Dept Linguist, S-10691 Stockholm, Sweden.
[Diehl, Randy; Creeger, Carl] Univ Texas Austin, Dept Psychol, Austin, TX 78712 USA.
RP Lindblom, B (reprint author), Stockholm Univ, Dept Linguist, S-10691 Stockholm, Sweden.
EM lindblom@ling.su.se; diehl@psy.utex-as.edu; creeger@psy.utexas.edu
CR ASSMANN PF, 1989, J ACOUST SOC AM, V85, P327, DOI 10.1121/1.397684
BLADON RAW, 1981, J ACOUST SOC AM, V69, P1414, DOI 10.1121/1.385824
BLOMBERG M, 1984, ACOUSTICS SPEECH SIG, P33
Carlson R., 1970, STL QPSR, V2, P19
CARLSON R, 1975, AUDITORY ANAL PERCEP
CARLSON R, 1979, STL QPSR, V3, P3
CARLSON R, 1979, STL QPSR, V20, P84
CARLSON R, 1982, REPRESENTATION SPEEC, P109
CARLSON R, 1976, STL QPSR, V17, P1
DELGUTTE B, 1984, J ACOUST SOC AM, V75, P866, DOI 10.1121/1.390596
Diehl RL, 1996, J PHONETICS, V24, P187, DOI 10.1006/jpho.1996.0011
Fant G., 1960, ACOUSTIC THEORY SPEE
Fant G, 1972, STL QPSR, V2, P28
FANT G, 1963, J ACOUST SOC AM, V35, P1753, DOI 10.1121/1.1918812
Fletcher H, 1933, J ACOUST SOC AM, V5, P82, DOI 10.1121/1.1915637
GREENBERG S, 1998, J PHONETICS, V16, P3
Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing
Klatt D. H, 1982, REPRESENTATION SPEEC, P181
LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991
Lindblom B., 1986, EXPT PHONOLOGY, P13
MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861
PLOMP R, 1970, FREQUENCY ANAL PERIO
ROSE JE, 1967, J NEUROPHYSIOL, V32, P402
SACHS MB, 1982, REPRESENTATION SPEEC, P115
Schroeder M. R., 1979, FRONTIERS SPEECH COM, P217
VIEMEISTER NF, 2002, GENETICS FUNCTION AU, P273
Zwicker E., 1967, OHR ALS NACHRICHTENE
NR 27
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 622
EP 629
DI 10.1016/j.specom.2008.12.003
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800007
ER
PT J
AU Pelachaud, C
AF Pelachaud, Catherine
TI Studies on gesture expressivity for a virtual agent
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Nonverbal behavior; Embodied conversational agents; Expressivity
ID EMOTION; FACE
AB Our aim is to create an affective embodied conversational agent (ECA); that is an ECA able to display communicative and emotional signals. Nonverbal communication is done through certain facial expressions, gesture shapes, gaze direction, etc. But it can also carry a qualitative aspect through behavior expressivity: how a facial expression, a gesture is executed. In this paper we describe some of the work we have conducted on behavior expressivity, more particularly on gesture expressivity. We have developed a model of behavior expressivity using a set of six parameters that act as modulation of behavior animation. Expressivity may act at different levels of the behavior: on a particular phase of the behavior, on the whole behavior and on a sequence of behaviors. When applied at these different levels, expressivity may convey different functions. (C) 2008 Published by Elsevier B.V.
C1 Univ Paris 08, INRIA Paris Rocquencourt, IUT Montreuil, Paris, France.
RP Pelachaud, C (reprint author), Univ Paris 08, INRIA Paris Rocquencourt, IUT Montreuil, Paris, France.
EM pelachaud@iut-univ.paris8.fr
CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
BANZIGER T, 2006, WORKSH CORP RES EM A
BASSILI JN, 1979, J PERS SOC PSYCHOL, V37, P2049, DOI 10.1037//0022-3514.37.11.2049
Bavelas JB, 2000, J LANG SOC PSYCHOL, V19, P163, DOI 10.1177/0261927X00019002001
BECKER C, 2006, INT WORKSH EM COMP C, P31
Bui T, 2004, COMPUTER GRAPHICS INTERNATIONAL, PROCEEDINGS, P284
BUISINE S, 2006, 6 INT C INT VIRT AG
BUISINE S, 2004, BROWS TRUST EVALUATI
CARIDAKIS G, 2006, WORKSH MULT CORP MUL
Cassell J, 2007, CONVERSATIONAL INFOR, P133
CASSELL J, 2001, ANN C SERIES
Cassell J., 1999, CHI 99, P520
CHAFAI NE, 2007, INT J LANGUAGE RESOU
CHAFAI NE, 2006, LREC 2006 WORKSH MUL
CHAFAI NE, 2006, 6 INT C INT VIRT AG
Chi D, 2000, COMP GRAPH, P173
DECAROLIS B, 2004, LIFE LIKE CHARACTERS, P65
DEVILLERS L, 2005, 1 INT C AFF COMP INT
Ekman P, 2003, EMOTIONS REVEALED
Ekman P, 1975, UNMASKING FACE GUIDE
FEYEREISEN P, 1997, GESTE COGNITION COMM, P20
GALLAHER PE, 1992, J PERS SOC PSYCHOL, V63, P133, DOI 10.1037/0022-3514.63.1.133
HARTMANN B, 2005, 3 INT JOINT C AUT AG
HARTMANN B, 2006, LNCS LNAI
KAISER S, 2006, P 19 INT C COMP AN S
Kendon A., 2004, GESTURE VISIBLE ACTI
Kendon A., 1990, CONDUCTING INTERACTI
Kopp S, 2004, COMPUT ANIMAT VIRT W, V15, P39, DOI 10.1002/cav.6
KRAHMER E, 2004, BROWS TRUST EVALUATI
LIU K, 2005, ANN C SERIES
LUNDEBERG M, 1999, P ESCA WORKSH AUD VI
MANCINI M, 2007, GEST WORKSH LISB MAY
MARTIN JC, 2005, 1 INT C AFF COMP INT
MARTIN JC, 2006, INT J HUM ROBOT, V20, P477
McNeill D., 1992, HAND MIND WHAT GESTU
Menardais S., 2004, P ACM SIGGRAPH EUR S, P325, DOI 10.1145/1028523.1028567
NEFF M, 2004, P 2004 ACM SIGGRAPH, P49, DOI 10.1145/1028523.1028531
NIEWIADOMSKI R, 2007, 2 INT C AFF COMP INT
Pandzic I.S., 2002, MPEG 4 FACIAL ANIMAT
PELACHAUD C, 2005, ACM MULTIMEDIA BRAVE
Peters C, 2006, INT J HUM ROBOT, V3, P321, DOI 10.1142/S0219843606000783
POGGI I, 1998, P 6 INT PRAGM C REIM
Poggi I, 2002, GESTURE, V2, P71, DOI 10.1075/gest.2.1.05pog
POGGI I, 2003, GESTURES MEANING USE
POSNER R, 2003, GESTURES MEANING USE
PRILLWITZ S, 1989, INT STUDIES SIGN LAN, V5
Raouzaiou A, 2002, EURASIP J APPL SIG P, V2002, P1021, DOI 10.1155/S1110865702206149
RAPANTZIKOS K, 2005, INT WORKSH CONT BAS
Ruttkay Z, 2003, COMPUT GRAPH FORUM, V22, P49, DOI 10.1111/1467-8659.t01-1-00645
RUTTKAY Z, 2003, P AAMAS03 WS EMB CON
Sidner C. L., 2004, IUI 04, P78, DOI DOI 10.1145/964442.964458
STONE M, 2004, ANN C SERIES
STONE M, 2003, COMPUTER ANIMATION S
TERZOPOULOS D, 1993, IEEE T PATTERN ANAL, V15, P569, DOI 10.1109/34.216726
Thomas F, 1981, DISNEY ANIMATION ILL
Wallbott HG, 1998, EUR J SOC PSYCHOL, V28, P879, DOI 10.1002/(SICI)1099-0992(1998110)28:6<879::AID-EJSP901>3.0.CO;2-W
WALLBOTT HG, 1986, J PERSONALITY SOCIAL, V24
WALLBOTT HG, 1985, J CLIN PSYCHOL, V41
NR 58
TC 17
Z9 17
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 630
EP 639
DI 10.1016/j.specom.2008.04.009
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800008
ER
PT J
AU Rosenberg, A
Hirschberg, J
AF Rosenberg, Andrew
Hirschberg, Julia
TI Charisma perception from text and speech
SO SPEECH COMMUNICATION
LA English
DT Article; Proceedings Paper
CT Seminar on Research Challenges in Speech Technology
CY OCT, 2005
CL Stockholm, SWEDEN
SP KTH, Dept Speech, Music & Hearing, Ctr Speech Technol
DE Charismatic speech; Paralinguistic analysis; Political speech
AB Charisma, the ability to attract and retain followers without benefit of formal authority, is more difficult to define than to identify. While we each seem able to identify charismatic individuals - and non-charismatic individuals - it is not clear what it is about an individual that influences our judgment. This paper describes the results of experiments designed to discover potential correlates of such judgments, in what speakers say and the way that they say it. We present results of two parallel experiments in which subjective judgments of charisma in spoken and in transcribed American political speech were analyzed with respect to the acoustic and prosodic (where applicable) and lexico-syntactic characteristics of the speech being assessed. While we find that there is considerable disagreement among subjects on how the speakers of each token are ranked, we also find that subjects appear to share a functional definition of charisma, in terms of other personal characteristics we asked them to rank speakers by. We also find certain acoustic, prosodic, and lexico-syntactic characteristics that correlate significantly with perceptions of charisma. Finally, by comparing the responses to spoken vs. transcribed stimuli, we attempt to distinguish between the contributions of "what is said" and "how it is said" with respect to charisma judgments. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Rosenberg, Andrew; Hirschberg, Julia] Columbia Univ, New York, NY 10027 USA.
RP Rosenberg, A (reprint author), Columbia Univ, 2960 Broadway, New York, NY 10027 USA.
EM amaxwell@cs.columbia.edu; julia@cs.columbia.edu
CR BARD EG, 1996, P ICSLP 96, P1876
BIRD FB, 1993, HDB CULTS SECTS AM, VB, P75
Boersma P., 2001, GLOT INT, V5, P341
BOSS P, 1976, S SPEECH COMM J, V41, P300
Brill E, 1995, COMPUT LINGUIST, V21, P543
COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256
Cohen R., 1987, Computational Linguistics, V13
Dahan D, 2002, J MEM LANG, V47, P292, DOI 10.1016/S0749-596X(02)00001-3
DAVIES J, 1954, AM POLIT SCI REV, V48
DOWIS R, 2000, LOST ART GREAT SPEEC
Hamilton M. A., 1993, COMMUNICATION Q, V41, P231
Ladd D. R., 1996, INTONATIONAL PHONOLO
LARSON K, 2003, SCI WORD RECOGNITION
MARCUS J, 1967, W POLIT Q, V14, P237
PIERREHUMBERT J, 1990, INTENTIONS COMMUNICA
Silverman K., 1992, P INT C SPOK LANG PR, V2, P867
Touati P., 1993, ESCA WORKSH PROS, P168
TUPPEN CJS, 1974, SPEECH MONOGR, V41, P253
Weber M., 1947, THEORY SOCIAL EC ORG
NR 19
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUL
PY 2009
VL 51
IS 7
SI SI
BP 640
EP 655
DI 10.1016/j.specom.2008.11.001
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 465HG
UT WOS:000267572800009
ER
PT J
AU Molina, C
Yoma, NB
Wuth, J
Vivanco, H
AF Molina, Carlos
Becerra Yoma, Nestor
Wuth, Jorge
Vivanco, Hiram
TI ASR based pronunciation evaluation with automatically generated
competing vocabulary and classifier fusion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Computer aided pronunciation training; Speech recognition; Automatic
generation of competitive vocabulary; Multi-classifier fusion; Computer
aided language learning
ID RECOGNITION; SYSTEMS
AB In this paper, the application of automatic speech recognition (ASR) technology in computer aided pronunciation training (CAPT) is addressed. A method to automatically generate the competitive lexicon, required by an ASR engine to compare the pronunciation of a target word with its correct and wrong phonetic realizations, is proposed. In order to enable the efficient deployment of CAPT applications, the generation of this competitive lexicon does not require any human assistance or a priori information of mother language dependent error rules. Moreover, a Bayes based multi-classifier fusion approach to map ASR objective confidence scores to subjective evaluations in pronunciation assessment is presented. The method proposed here to generate a competitive lexicon given a target word leads to averaged subjective-objective score correlation equal to 0.67 and 0.82 with five and two levels of pronunciation quality, respectively. Finally, multi-classifier systems (MCS) provide a promising formal framework to combine poorly correlated scores in CAPT. When applied to ASR confidence metrics, MCS can lead to an increase of 2.4% and a reduction of 10.2% in subjective-objective score correlation and classification error, respectively, with two pronunciation quality levels. (c) 2009 Elsevier B.V. All rights reserved.
C1 [Molina, Carlos; Becerra Yoma, Nestor; Wuth, Jorge] Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile.
[Vivanco, Hiram] Univ Chile, Dept Linguist, Santiago, Chile.
RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile.
EM nbecerra@ing.uchile.cl
FU Conicyt-Chile [1070382]; Fondef [D051-10243, AT 24080121]
FX This work was funded by Conicyt-Chile under grants Fondecyt No. 1070382,
Fondef No. D051-10243, and AT 24080121.
CR ABDOU S, 2006, P INTERSPEECH ICSLP
BONAVENTURA P, 2000, P KONV 2000 C NAT LA, P225
CUCCHIARINI C, 1997, IEEE WORKSH ASRU, P622
Duda R. O., 1973, PATTERN CLASSIFICATI
Fassinut-Mombot B., 2004, Information Fusion, V5, DOI 10.1016/j.inffus.2003.06.001
FRANCO H, 1997, ICASSP 97, V2, P1471
Fumera G, 2005, IEEE T PATTERN ANAL, V27, P942, DOI 10.1109/TPAMI.2005.109
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Garofalo J., 1993, CONTINUOUS SPEECH RE
HAMID S, 2004, P NEMLAR C AR LANG R
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
Kittler J, 2003, IEEE T PATTERN ANAL, V25, P110, DOI 10.1109/TPAMI.2003.1159950
Kuncheva LI, 2002, IEEE T PATTERN ANAL, V24, P281, DOI 10.1109/34.982906
Kuncheva LI, 2001, PATTERN RECOGN, V34, P299, DOI 10.1016/S0031-3203(99)00223-X
Kwan K.Y., 2002, P ICSLP, P69
*LDC, 1995, LAT 40 DAT PROV LING
Moustroufas N, 2007, COMPUT SPEECH LANG, V21, P219, DOI 10.1016/j.csl.2006.04.001
NAKAGAWA S, 2007, P INTERSPEECH ICSLP
Neri A., 2003, P 15 INT C PHON SCI, P1157
NEUMEYER L, 1996, P ICSLP 96
Sooful J., 2002, P 7 INT C SPOK LANG, P521
TAX DMJ, 1997, P 1 IAPR WORKSH STAT, P165
TEPPERMAN J, 2007, P INTERSPEECH ICSLP
XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943
NR 24
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 485
EP 498
DI 10.1016/j.specom.2009.01.002
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800001
ER
PT J
AU Gerosa, M
Giuliani, D
Brugnara, F
AF Gerosa, Matteo
Giuliani, Diego
Brugnara, Fabio
TI Towards age-independent acoustic modeling
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker adaptive acoustic modeling; Speaker normalization; Vocal tract
length normalization; Children's speech recognition
ID AUTOMATIC SPEECH RECOGNITION; CHILDRENS SPEECH; VOCAL-TRACT;
NORMALIZATION
AB In automatic speech recognition applications, due to significant differences in voice characteristics, adults and children are usually treated as two population groups, for which different acoustic models are trained. In this paper, age-independent acoustic modeling is investigated in the context of large vocabulary speech recognition. Exploiting a small amount (9 h) of children's speech and a more significant amount (57 h) of adult speech, age-independent acoustic models are trained using several methods for speaker adaptive acoustic modeling. Recognition results achieved using these models are compared with those achieved using age-dependent acoustic models for children and adults, respectively. Recognition experiments are performed on four Italian speech corpora, two consisting of children's speech and two of adult speech, using 64k word and 11k word trigram language models. Methods for speaker adaptive acoustic modeling prove to be effective for training age-independent acoustic models ensuring recognition results at least as good as those achieved with age-dependent acoustic models for adults and children. (c) 2009 Elsevier B.V. All rights reserved.
C1 [Gerosa, Matteo; Giuliani, Diego; Brugnara, Fabio] FBK, I-38100 Trento, Italy.
RP Gerosa, M (reprint author), FBK, I-38100 Trento, Italy.
EM gerosa@fbk.eu; giuliani@fbk.eu; brugnara@fbk.eu
FU European Union [IST-2001-37599]; TC-STAR [FP6-506738]
FX This work was partially financed by the European Union under the
Projects PF-STAR (Grant IST-2001-37599, http://pfstar.itc.it) and
TC-STAR (Grant FP6-506738, http://www.tc-star.org).
CR ACKERMANN U, 1997, P EUROSPEECH, P1807
Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807
ANGELINI B, 1994, P ICSLP YOK, P1391
Batliner A., 2005, P INT, P2761
BERTOLDI N, 2001, P ICASSP SALT LAK CI, V1, P37
BRUGNARA F, 2002, P ICSLP DENV CO, P1441
BURNETT DC, 1996, P ICSLP PHIL PA, V2, P1145, DOI 10.1109/ICSLP.1996.607809
Claes T, 1998, IEEE T SPEECH AUDI P, V6, P549, DOI 10.1109/89.725321
DARCY S, 2005, P INT 2005 LISB PORT, P2197
Das S., 1998, P ICASSP SEATTL US M, V1, P433, DOI 10.1109/ICASSP.1998.674460
EIDE E, 1996, P ICASSP, P346
Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
GEROSA M, 2006, P ICASSP TOUL FRANC, P393
Gerosa M, 2007, SPEECH COMMUN, V49, P847, DOI 10.1016/j.specom.2007.01.002
GEROSA M, 2006, P INTERSPEECH PITTSB
Gillick L., 1989, P ICASSP, P532
GIULIANI D, 2004, P ICSLP, V4, P2893
GIULIANI D, 2003, P ICASSP, V2, P137
Giuliani D, 2006, COMPUT SPEECH LANG, V20, P107, DOI 10.1016/j.csl.2005.05.002
Goldstein UG., 1980, THESIS MIT CAMBRIDGE
HAGEN A, 2003, P IEEE AUT SPEECH RE
Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150
Lee L., 1996, P ICASSP
Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
LI Q, 2002, P ICSLP DENV CO, P2337
MARTINEZ F, 1998, P ICASSP, V2, P725, DOI 10.1109/ICASSP.1998.675367
MCGOWAN RS, 1988, J ACOUST SOC AM, V83, P229, DOI 10.1121/1.396425
MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, P335
Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544
Nisimura R., 2004, P ICASSP, V1, P433
NITTROUER S, 1989, J ACOUST SOC AM, V86, P1266, DOI 10.1121/1.398741
PALLETT DS, 1992, P 1992 ARPAS CONT SP
POTAMIANOS A, 1997, P EUR C SPEECH COMM, P2371
Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026
Steidl S, 2003, LECT NOTES COMPUT SC, V2781, P600
Stemmer G., 2003, P EUR C SPEECH COMM, P1313
Stemmer G., 2005, P IEEE INT C AC SPEE, V1, P997, DOI 10.1109/ICASSP.2005.1415284
WEGMANN S, 1996, P ICASSP, P339
Welling L., 1999, P IEEE INT C AC SPEE, P761
WILPON JG, 1996, P ICASSP, P349
NR 42
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 499
EP 509
DI 10.1016/j.specom.2009.01.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800002
ER
PT J
AU Amano, S
Kondo, T
Kato, K
Nakatani, T
AF Amano, Shigeaki
Kondo, Tadahisa
Kato, Kazumi
Nakatani, Tomohiro
TI Development of Japanese infant speech database from longitudinal
recordings
SO SPEECH COMMUNICATION
LA English
DT Article
DE Infant utterance; Speech database; Speech development
ID VOCAL FUNDAMENTAL-FREQUENCY; UTTERANCES; FEATURES; CHILDREN; FATHER;
PITCH
AB Developmental research on speech production requires both a cross-sectional and a longitudinal speech database. Previous longitudinal speech databases are limited in terms of recording period or number of utterances. An infant speech database was developed from 5 years of recordings containing a large number of daily life utterances of five Japanese infants and their parents. The resulting database contains 269,467 utterances with various types of information including a transcription, an F0 value, and a phoneme label. This database can be used in future research on the development of speech production. (c) 2009 Elsevier B.V. All rights reserved.
C1 [Amano, Shigeaki; Kondo, Tadahisa; Kato, Kazumi; Nakatani, Tomohiro] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan.
RP Amano, S (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikari Dai, Seika, Kyoto 6190237, Japan.
EM amano@cslab.kecl.ntt.co.jp
CR Amano S, 2006, J ACOUST SOC AM, V119, P1636, DOI 10.1121/1.2161443
BENNETT S, 1983, J SPEECH HEAR RES, V26, P137
Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1
Fairbanks G, 1942, CHILD DEV, V13, P227
HOLLIEN H, 1994, J ACOUST SOC AM, V96, P2646, DOI 10.1121/1.411275
INUI T, 2003, COGNITIVE STUDIES, V10, P304
Ishizuka K, 2007, J ACOUST SOC AM, V121, P2272, DOI 10.1121/1.2535806
Kajikawa S, 2004, J CHILD LANG, V31, P215, DOI 10.1017/S0305000903005968
KAJIKAWA S, 2004, PAFOUMANSU KYOUIKU, V3, P61
KEATING P, 1978, J ACOUST SOC AM, V63, P567, DOI 10.1121/1.381755
KENT RD, 1976, J SPEECH HEAR RES, V19, P421
KENT RD, 1982, J ACOUST SOC AM, V72, P353, DOI 10.1121/1.388089
MacWhinney B., 2000, CHILDES PROJECT TOOL
McRoberts GW, 1997, J CHILD LANG, V24, P719, DOI 10.1017/S030500099700322X
MUGITANI R, 2006, J PHONET SOC JPN, V10, P96
Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522
Nakatani T., 2003, P EUROSPEECH, P2313
NAKATANI T, 2002, P ICSLP 2002, V3, P1733
Reissland N, 1998, INFANT BEHAV DEV, V21, P793, DOI 10.1016/S0163-6383(98)90046-7
ROBB MP, 1989, J ACOUST SOC AM, V85, P1708, DOI 10.1121/1.397960
ROBB MP, 1985, J SPEECH HEAR RES, V28, P421
SHEPPARD WC, 1968, J SPEECH HEAR RES, V11, P94
NR 22
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 510
EP 520
DI 10.1016/j.specom.2009.01.009
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800003
ER
PT J
AU Kim, K
Baran, RH
Ko, H
AF Kim, Kihyeon
Baran, Robert H.
Ko, Hanseok
TI Extension of two-channel transfer function based generalized sidelobe
canceller for dealing with both background and point-source noise
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Adaptive signal processing; Non-stationary noise;
Transfer function ratio; Generalized sidelobe canceller
ID SPEECH ENHANCEMENT; MICROPHONE ARRAYS; ENVIRONMENTS; RECOGNITION;
REDUCTION; ESTIMATOR; AID
AB This paper describes an algorithm to suppress non-stationary noise as well as stationary noise in a speech enhancement system that employs a two-channel generalized sidelobe canceller (GSC). Our approach builds on recent advances in GSC design involving a transfer function ratio (TFR). The proposed system has four stages. The first stage estimates a new TFR along the acoustic paths from the non-stationary noise source to the microphones and the power of the stationary noise components. Second, the estimated power of the stationary noise components is used to execute spectral subtraction (SS) with respect to the input signals. Thirdly, the optimal gain is estimated for speech enhancement on the primary channel. In the final stage, an adaptive filter reduces the residual correlated noise components of the signal. These algorithmic improvements consistently give a better performance than a transfer function based GSC (TF-GSC) alone or a GSC with SS post-filtering under various noise conditions while slightly increasing the computational complexity. (c) 2009 Elsevier B.V. All rights reserved.
C1 [Kim, Kihyeon; Baran, Robert H.; Ko, Hanseok] Korea Univ, ISPL, Dept Elect & Comp Engn, Seoul, South Korea.
RP Ko, H (reprint author), Korea Univ, ISPL, Dept Elect & Comp Engn, 5-1 Anam Dong, Seoul, South Korea.
EM khkim@ispl.korea.ac.kr; rhbaran@yahoo.com; hsko@korea.ac.kr
FU MIC (Ministry of Information and Communication), Korea
FX This research was supported by the MIC (Ministry of Information and
Communication), Korea, Under the ITFSIP (IT Foreign Specialist Inviting
Program) supervised by the IITA (Institute of Information Technology
Advancement).
CR Bitzer J., 1998, P EUR SIGN PROC C RH, P105
Bitzer J, 2001, SPEECH COMMUN, V34, P3, DOI 10.1016/S0167-6393(00)00042-X
BITZER J, 1999, P IEEE INT C AC SPEE, V5, P2965
Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1
Cohen I, 2004, IEEE T SIGNAL PROCES, V52, P1149, DOI 10.1109/TSP.2004.826166
Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544
EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550
FISCHER S, 1995, P 4 INT WORKSH AC EC, P44
Fischer S, 1996, SPEECH COMMUN, V20, P215, DOI 10.1016/S0167-6393(96)00054-4
FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817
Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132
GANNOT S, 2002, EE PUB, V1319
Gannot S, 2004, IEEE T SPEECH AUDI P, V12, P561, DOI 10.1109/TSA.2004.834599
GRENIER Y, 1992, P ICASSP 92, P305, DOI 10.1109/ICASSP.1992.225911
GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739
HOSHUYAMA O, 2001, MICROPHONE ARRAYS SI, P87
Jeong S, 2008, ELECTRON LETT, V44, P253, DOI 10.1049/el:20083327
Kim G, 2007, ELECTRON LETT, V43, P783, DOI 10.1049/el:20070780
LeBouquinJeannes R, 1997, IEEE T SPEECH AUDI P, V5, P484
McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212
MEYER J, 1997, P ICASSP 97, V2, P1167
NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760
Shalvi O, 1996, IEEE T SIGNAL PROCES, V44, P2055, DOI 10.1109/78.533725
Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821
Van Veen B. D., 1988, IEEE ASSP Magazine, V5, DOI 10.1109/53.665
VANCOMPERNOLLE D, 1995, P COST, V229, P107
VANCOMPERNOLLE D, 1990, SPEECH COMMUN, V9, P433, DOI 10.1016/0167-6393(90)90019-6
Vaseghi S. V., 2002, ADV DIGITAL SIGNAL P
NR 28
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 521
EP 533
DI 10.1016/j.specom.2009.02.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800004
ER
PT J
AU Zhang, SX
Mak, MW
AF Zhang, Shi-Xiong
Mak, Man-Wai
TI A new adaptation approach to high-level speaker-model creation in
speaker verification
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speaker verification; High-level features; Model adaptation;
Maximum-a-posterior (MAP) adaptation
ID TRANSFORMATION
AB Research has shown that speaker verification based on high-level speaker features requires long enrollment utterances to guarantee low error rate during verification. However, in practical speaker verification, it is common to model speakers based on a limited amount of enrollment data, which will make the speaker models unreliable. This paper proposes four new adaptation methods for creating high-level speaker models to alleviate this undesirable effect. Unlike conventional methods in which only the phoneme-dependent background model is adapted, the proposed adaptation methods also adapts the phoneme-independent speaker model to fully utilize all the information available in the training data. A proportional factor, which is derived from the ratio between the phoneme-dependent background model and the phoneme-independent background model, is used to adjust the phoneme-independent speaker models during adaptation. The proposed method was evaluated under the NIST 2000 and NIST 2002 SRE frameworks. Experimental results show that the proposed adaptation method can alleviate the data-sparseness problem effectively and achieves a better performance when compared with traditional MAP adaptation. (c) 2009 Elsevier B.V. All rights reserved.
C1 [Zhang, Shi-Xiong; Mak, Man-Wai] Hong Kong Polytech Univ, Ctr Multimedia Signal Proc, Elect & Informat Engn Dept, Kowloon, Hong Kong, Peoples R China.
RP Mak, MW (reprint author), Hong Kong Polytech Univ, Ctr Multimedia Signal Proc, Elect & Informat Engn Dept, Kowloon, Hong Kong, Peoples R China.
EM zhang.sx@alumni.polyu.edu.hk; enmwmak@polyu.edu.hk
CR ADAMI A, 2003, P ICASSP, V4, P788
Andrews W., 2002, P ICASSP
BAKER B, 2004, P IEEE WORKSH SPEAK, P91
BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0
CAMPBELL JP, 1999, P INT C AC SPEECH SI, V2, P829
Campbell J.P., 2003, P EUR, P2665
Campbell WM, 2006, IEEE SIGNAL PROC LET, V13, P308, DOI 10.1109/LSP.2006.870086
CHAPPELL D, 1998, P ICASSP, V1, P885
CHEN KT, 2000, P ICSLP, V3, P742
Dahan D, 1996, LANG SPEECH, V39, P341
Doddington G., 2001, P EUR, P2521
FEDERICO M, 1996, ICSLP, P279
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Gillick L., 1989, P ICASSP, P532
HANSEN EG, 2004, P OD 04 SPEAK LANG R, P179
JIN Q, 2003, P ICASSP
KAJAREKAR SS, 2005, P IEEE INT C AC SPEE, V1, P173, DOI 10.1109/ICASSP.2005.1415078
KIMBALL O, 1997, EUROSPEECH 97, P967
Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881
KLUSACEK D, 2003, P ICASSP, V4, P804
KOSAKA T, 1996, J COMPUT SPEECH LANG, V10, P54
Kuehn David P., 1976, J PHONETICS, V4, P303
Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Leung KY, 2006, SPEECH COMMUN, V48, P71, DOI 10.1016/j.specom.2005.05.013
Mak B, 2005, IEEE T SPEECH AUDI P, V13, P984, DOI 10.1109/TSA.2005.851971
Mak BKW, 2006, IEEE T AUDIO SPEECH, V14, P1267, DOI 10.1109/TSA.2005.860836
Mak MW, 2007, NEUROCOMPUTING, V71, P137, DOI 10.1016/j.neucom.2007.08.003
Mak MW, 2006, INT CONF ACOUST SPEE, P929
Marithoz J., 2002, INT C SPOK LANG PROC, P581
MATSUI T, 1993, P ICASSP 93, V2, P391
NAVRATIL J, 2003, P ICASSP 2003, V4, P796
Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213
PESKIN B, 2003, P ICASSP, V4, P792
Reynolds D., 2003, P ICASSP 03, VIV, P784
Reynolds D. A., 1997, P EUR, P963
REYNOLDS DA, 1997, P IEEE INT C AC SPEE, V2, P1535
Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361
Shlens J., 2005, TUTORIAL PRINCIPAL C
SHRIBERG E, 2007, SPEAKER CLASSIFICATI, V1, P241
Siohan O, 2001, IEEE T SPEECH AUDI P, V9, P417, DOI 10.1109/89.917687
SONMEZ K, 1998, ICSLP, V4, P3189
Sussman HM, 1998, PHONETICA, V55, P204, DOI 10.1159/000028433
Thyes O., 2000, P INT C SPOK LANG PR, V2, P242
WEBER F, 2002, P IEEE INT C AC SPEE, V1, P141
XIANG B, 2002, P ICASSP, V1, P681
Zhang SX, 2007, IEEE T COMPUT, V56, P1189, DOI 10.1109/TC.2007.1081
Zhang SX, 2007, LECT NOTES COMPUT SC, V4810, P325
NR 48
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 534
EP 550
DI 10.1016/j.specom.2009.02.005
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800005
ER
PT J
AU Wolfel, M
AF Woelfel, Matthias
TI Signal adaptive spectral envelope estimation for robust speech
recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Adaptive feature extraction; Spectral estimation; Minimum variance
distortionless response; Automatic speech recognition; Bilinear
transformation; Time vs. frequency domain
ID HIDDEN MARKOV-MODELS; LINEAR PREDICTION; SCALE
AB This paper describes a novel spectral envelope estimation technique which adapts to the characteristics of the observed signal. This is possible via the introduction of a second bilinear transformation into warped minimum variance distortionless response (MVDR) spectral envelope estimation. As opposed to the first bilinear transformation, however, which is applied in the time domain, the second bilinear transformation must be applied in the frequency domain. This extension enables the resolution of the spectral envelope estimate to be steered to lower or higher frequencies, while keeping the overall resolution of the estimate and the frequency axis fixed. When embedded in the feature extraction process of an automatic speech recognition system, it provides for the emphasis of the characteristics of speech features that are relevant for robust classification, while simultaneously suppressing characteristics that are irrelevant for classification. The change in resolution may be steered, for each observation window, by the normalized first autocorrelation coefficient.
To evaluate the proposed adaptive spectral envelope technique, dubbed warped-twice MVDR, we use two objective functions: class separability and word error rate. Our test set consists of development and evaluation data as provided by NIST for the Rich Transcription 2005 Spring Meeting Recognition Evaluation. For both measures, we observed consistent improvements for several speaker-to-microphone distances. In average, over all distances, the proposed front-end reduces the word error rate by 4% relative compared to the widely used mel-frequency cepstral coefficients as well as perceptual linear prediction. (c) 2009 Elsevier B.V. All rights reserved.
C1 Univ Karlsruhe TH, Inst Theoret Informat, D-76131 Karlsruhe, Germany.
RP Wolfel, M (reprint author), Univ Karlsruhe TH, Inst Theoret Informat, Fasanengarten 5, D-76131 Karlsruhe, Germany.
EM wolfel@ira.uka.de
CR Acero A., 1990, THESIS CARNEGIE MELL
BRACCINI C, 1974, IEEE T ACOUST SPEECH, VAS22, P236, DOI 10.1109/TASSP.1974.1162582
CAPON J, 1969, P IEEE, V57, P1408, DOI 10.1109/PROC.1969.7278
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
DHARANIPRAGADA S, 2001, P ICASSP, V1, P309
Dharanipragada S, 2007, IEEE T AUDIO SPEECH, V15, P224, DOI 10.1109/TASL.2006.876776
Driaunys K., 2005, Information Technology and Control, V34
Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043
Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034
HAEBUMBACH R, 1999, P ICASSP, P397
Harma A, 2001, IEEE T SPEECH AUDI P, V9, P579, DOI 10.1109/89.928922
Haykin S., 1991, ADAPTIVE FILTER THEO
HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423
*LDC, TRANS ENGL DAT
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
MATSUMOTO H, 2001, P ICASSP, P117
MATSUMOTO M, 1998, P ICSLP, P1051
MESGARANI N, 2007, P ICASSP, P765
MURTHI M, 1997, P IEEE INT C AC SPEE, P1687
Murthi MN, 2000, IEEE T SPEECH AUDI P, V8, P221, DOI 10.1109/89.841206
MUSICUS BR, 1985, IEEE T ACOUST SPEECH, V33, P1333, DOI 10.1109/TASSP.1985.1164696
NAKATOH Y, 2004, P ICSLP
*NIST, 2005, RICH TRANSC 2005 SPR
NOCERINO N, 1985, P ICASSP, P25
Olive J. P., 1993, ACOUSTICS AM ENGLISH
OPPENHEIM A, 1971, IEEE P LETT, V59, P229
Oppenheim A. V., 1989, DISCRETE TIME SIGNAL
Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695
Stevens SS, 1937, J ACOUST SOC AM, V8, P185, DOI 10.1121/1.1915893
STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992
Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435
WOLFEL M, 2003, P ESSV, P22
WOLFEL M, 2006, P EUSIPCO
Wolfel M, 2005, IEEE SIGNAL PROC MAG, V22, P117, DOI 10.1109/MSP.2005.1511829
WOLFEL M, 2003, P EUR, P1021
Yule GU, 1927, PHILOS T R SOC LOND, V226, P267, DOI 10.1098/rsta.1927.0007
COMPUTERS HUMAN INTE
NR 38
TC 1
Z9 1
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD JUN
PY 2009
VL 51
IS 6
BP 551
EP 561
DI 10.1016/j.specom.2009.02.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 444NX
UT WOS:000265988800006
ER
PT J
AU Magi, C
Pohjalainen, J
Backstrom, T
Alku, P
AF Magi, Carlo
Pohjalainen, Jouni
Backstrom, Tom
Alku, Paavo
TI Stabilised weighted linear prediction
SO SPEECH COMMUNICATION
LA English
DT Article
DE Linear prediction; All-pole modelling; Spectral estimation
ID SPEECH; RECOGNITION; EXTRACTION; SPECTRUM
AB Weighted linear prediction (WLP) is a method to compute all-pole models of speech by applying temporal weighting of the square of the residual signal. By using short-time energy (STE) as a weighting function, this algorithm was originally proposed as an improved linear predictive (LP) method based oil emphasising those samples that fit the underlying speech production model well. The original formulation of WLP, however, did not guarantee stability of all-pole models. Therefore, the current work revisits the concept of WLP by introducing a modified short-time energy function leading always to stable all-pole models. This new method, stabilised weighted linear prediction (SWLP), is shown to yield all-pole models whose general performance can be adjusted by properly choosing the length of the STE window, a parameter denoted by M.
The study compares the performances of SWLP, minimum variance distortionless response (MVDR), and conventional LP in spectral modelling of speech corrupted by additive noise. The comparisons were performed by computing, for each method, the logarithmic spectral differences between the all-pole spectra extracted from clean and noisy speech in different segmental signal-to-noise ratio (SNR) categories. The results showed that the proposed SWLP algorithm was the most robust method against zero-mean Gaussian noise and the robustness was largest for SWLP with a small M-value. These findings were corroborated by a small listening test in which the majority of the listeners assessed the quality of impulse-train-excited SWLP filters, extracted from noisy speech, to be perceptually closer to original clean speech than the corresponding all-pole responses computed by MVDR. Finally, SWLP was compared to other short-time spectral estimation methods (FFT, LP, MVDR) in isolated word recognition experiments. Recognition accuracy obtained by SWLP, in comparison to other short-time spectral estimation methods, improved already at moderate segmental SNR values for sounds corrupted by zero-mean Gaussian noise. For realistic factory noise of low pass characteristics, the SWLP method improved file recognition results at segmental SNR levels below 0 dB. (C) 2009 Published by Elsevier B.V.
C1 [Magi, Carlo; Pohjalainen, Jouni; Backstrom, Tom; Alku, Paavo] Aalto Univ, Lab Acoust & Audio Signal Proc, FI-02015 Helsinki, Finland.
RP Alku, P (reprint author), Aalto Univ, Lab Acoust & Audio Signal Proc, POB 3000, FI-02015 Helsinki, Finland.
EM jpohjala@acoustics.hut.fi; tom.backstrom@tkk.fi; paavo.alku@tkk.fi
RI Backstrom, Tom/E-2121-2011; Alku, Paavo/E-2400-2012
FU Academy of Finland [107494]
FX Supported by Academy of Finland (Project No. 107494).
CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R
BACKSTROM T, 2004, THESIS HELSINKI U TE
Bazaraa MS, 1993, NONLINEAR PROGRAMMIN
CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733
DELSARTE P, 1982, IEEE T INFORM THEORY, V33, P412
DEWET F, 2001, P EUR 2001 AALB DENM
Dharanipragada S, 2007, IEEE T AUDIO SPEECH, V15, P224, DOI 10.1109/TASL.2006.876776
ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P380, DOI 10.1109/TASSP.1976.1162849
Huiqun Deng, 2006, IEEE Transactions on Audio, Speech and Language Processing, V14, DOI 10.1109/TSA.2005.857811
Karaev MT, 2004, P AM MATH SOC, V132, P2321, DOI 10.1090/S0002-9939-04-07391-5
Kleijn W. B., 1995, SPEECH CODING SYNTHE
KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909
LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197
MA CX, 1993, SPEECH COMMUN, V12, P69, DOI 10.1016/0167-6393(93)90019-H
MAGI C, 2006, CD P 7 NORD SIGN PRO
MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792
Markel JD, 1976, LINEAR PREDICTION SP
Murthi MN, 2000, IEEE T SPEECH AUDI P, V8, P221, DOI 10.1109/89.841206
O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd
Rabiner L, 1993, FUNDAMENTALS SPEECH
SAMBUR MR, 1976, IEEE T ACOUST SPEECH, V24, P488, DOI 10.1109/TASSP.1976.1162870
SHIMAMURA T, 2004, CD P 6 NORD SIGN PRO
Theodoridis S., 2003, PATTERN RECOGNITION
VARGA A, 1992, NOISEX 92 DATABASE
Wolfel M, 2005, IEEE SIGNAL PROC MAG, V22, P117, DOI 10.1109/MSP.2005.1511829
WOLFEL M, 2003, P IEEE AUT SPEECH RE, V387, P387
WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260
WU JX, 1993, IEEE T PATTERN ANAL, V15, P1174
YAPANEL U, 2003, EUROSPEECH 2003
Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P313, DOI 10.1109/89.701359
ZHAO Q, 1997, COMMUN COMPUT PHYS, V2, P585
NR 33
TC 18
Z9 18
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 401
EP 411
DI 10.1016/j.specom.2008.12.005
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600001
ER
PT J
AU Jeong, M
Lee, GG
AF Jeong, Minwoo
Lee, Gary Geunbae
TI Multi-domain spoken language understanding with transfer learning
SO SPEECH COMMUNICATION
LA English
DT Article
DE Spoken language understanding; Multi-domain dialog system; Transfer
learning; Triangular-chain structure model
ID SEMANTIC ROLES; DIALOGUE; SYSTEM
AB This paper addresses the problem of multi-domain spoken language understanding (SLU) where domain detection and domain-dependent semantic tagging problems are combined. We present a transfer learning approach to the multi-domain SLU problem in which multiple domain-specific data sources can be incorporated. To implement multi-domain SLU with transfer learning, we introduce a triangular-chain structured model. This model effectively learns multiple domains in parallel, and allows use of domain-independent patterns among domains to create a better model for the target domain. We demonstrate that the proposed method outperforms baseline models on dialog data for multi-domain SLU problems. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Jeong, Minwoo; Lee, Gary Geunbae] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea.
RP Jeong, M (reprint author), Pohang Univ Sci & Technol, Dept Comp Sci & Engn, San 31, Pohang 790784, South Korea.
EM stardust@postech.ac.kr; gblee@poste-ch.ac.kr
FU Korea Science and Engineering Foundation (KOSEF), Korea Goverment (MEST)
[R01-2008-000-20651-0]
FX We thank thin anonymous reviewers for their valuable comments. We would
also like to thank Donghyun Lee for his preparation of speech
recognition results, and Derek Lactin for his proof-reading of the
paper. This work was supported by the Korea Science and Engineering
Foundation (KOSEF) grant funded by the Korea Goverment (MEST) (No.
R01-2008-000-20651-0).
CR Ammicht E, 1999, P EUR C SPEECH COMM, P1375
Caruana R, 1997, MACH LEARN, V28, P41, DOI 10.1023/A:1007379606734
Chung G., 1999, P EUR C SPEECH COMM, P2655
COHN T, 2006, P EUR C MACH LEARN E, P606
Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1
Crammer K, 2006, J MACH LEARN RES, V7, P551
Daume H, 2006, J ARTIF INTELL RES, V26, P101
Daume H., 2007, P 45 ANN M ASS COMP, P256
Daume III H., 2005, P 22 INT C MACH LEAR, P169, DOI 10.1145/1102351.1102373
De Mori R., 2008, SIGNAL PROCESS MAGAZ, V25, P50
Dredze M., 2008, P 25 INT C MACH LEAR, P264, DOI 10.1145/1390156.1390190
Gildea D, 2002, COMPUT LINGUIST, V28, P245, DOI 10.1162/089120102760275983
Gillick L., 1989, P ICASSP, P532
Gupta N, 2006, IEEE T AUDIO SPEECH, V14, P213, DOI 10.1109/TSA.2005.854085
Hardy H, 2006, SPEECH COMMUN, V48, P354, DOI 10.1016/j.specom.2005.07.006
Jeong M, 2008, IEEE T AUDIO SPEECH, V16, P1287, DOI 10.1109/TASL.2008.925143
Jeong M, 2006, P JOINT INT C COMP L, P412, DOI 10.3115/1273073.1273127
Komatani K, 2008, SPEECH COMMUN, V50, P863, DOI 10.1016/j.specom.2008.05.010
Lafferty John D., 2001, ICML, P282
LEE CJ, 2006, P IEEE SPOK LANG TEC, P194
Moschitti A, 2007, P IEEE WORKSH AUT SP
Nocedal J., 1999, NUMERICAL OPTIMIZATI
Palmer M, 2005, COMPUT LINGUIST, V31, P71, DOI 10.1162/0891201053630264
Peckham J., 1991, P WORKSH SPEECH NAT, P14, DOI 10.3115/112405.112408
Price P., 1990, P DARPA SPEECH NAT L, P91, DOI 10.3115/116580.116612
Ramshaw L. A., 1995, P 3 WORKSH VER LARG, P82
RAYMOND C, 2007, P INT ANTW BEG
Sha F., 2003, P C N AM CHAPT ASS C, P134
Sutton C., 2007, P 24 INT C MACH LEAR, P863, DOI 10.1145/1273496.1273605
Taskar B., 2003, P ADV NEUR INF PROC
Tsochantaridis I, 2005, J MACH LEARN RES, V6, P1453
TUR G, 2006, P IEEE INT C AC SPEE
Walker M. A., 2002, P INT C SPOK LANG PR, P269
Wang YY, 2005, IEEE SIGNAL PROC MAG, V22, P16
NR 34
TC 2
Z9 2
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 412
EP 424
DI 10.1016/j.specom.2009.01.001
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600002
ER
PT J
AU Maier, A
Haderlein, T
Eysholdt, U
Rosanowski, F
Batliner, A
Schuster, M
Noth, E
AF Maier, A.
Haderlein, T.
Eysholdt, U.
Rosanowski, F.
Batliner, A.
Schuster, M.
Noeth, E.
TI PEAKS - A system for the automatic evaluation of voice and speech
disorders
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Speech and voice disorders; Automatic evaluation
of speech and voice pathologies
ID FOREARM FREE-FLAP; OROPHARYNGEAL CANCER; PARTIAL GLOSSECTOMY; PHARYNGEAL
FLAP; CLEFT-PALATE; ORAL CAVITY; RECONSTRUCTION; INTELLIGIBILITY;
REHABILITATION; ESOPHAGEAL
AB We present a novel system for the automatic evaluation of speech and voice disorders. The system can be accessed via the internet platform-independently. The patient reads a text or names pictures. His or her speech is then analyzed by automatic speech recognition and prosodic analysis. For patients who had their larynx removed due to cancer and for children with cleft lip and palate we show that we can achieve significant correlations between the automatic analysis and the judgment of human experts in it leave-one-out experiment (p < .001). A correlation of .90 for the evaluation of the laryngectomees and .87 for the evaluation of the children's data was obtained. This is comparable to human inter-rater correlations. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Maier, A.; Haderlein, T.; Eysholdt, U.; Rosanowski, F.; Schuster, M.] Univ Erlangen Nurnberg, Abt Phoniatrie & Padaudiol, D-91054 Erlangen, Germany.
[Maier, A.; Haderlein, T.; Batliner, A.; Noeth, E.] Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung, D-91508 Erlangen, Germany.
RP Maier, A (reprint author), Univ Erlangen Nurnberg, Abt Phoniatrie & Padaudiol, Bohlenpl 21, D-91054 Erlangen, Germany.
EM Andreas.Maier@informatik.uni-erlangen.de
FU Deutsche Krebshilfe [106266]; Deutsche Forschungsgemeinschaft
[SCHU2320/1-1]
FX This work was funded by the German Cancer Aid (Deutsche Krebshilfe)
under Grant 106266 and the German Research Foundation (Deutsche
Forschungsgemeinschaft) under Grant SCHU2320/1-1. The responsibility for
the content of this paper lies with the authors. The authors would like
to thank both anonymous reviewers of this document for their beneficial
suggestions and comments.
CR ADELHARDT J, 2003, LECT NOTES COMPUTER, P591
Bagshaw P., 1993, P EUR C SPEECH COMM, P1003
Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1
BATLINER A, 1995, NATO ASI SERIES F, P325
BATLINER A, 2001, P EUR C SPEECH COMM, V4, P2781
BATLINER A, 2000, VERBMOBIL FDN SPEECH, P106
BATLINER A, 2003, P EUR C SPEECH COMM, V1, P733
Batliner A., 1999, P 14 INT C PHON SCI, V3, P2315
Bellandese MH, 2001, J SPEECH LANG HEAR R, V44, P1315, DOI 10.1044/1092-4388(2001/102)
BODIN IKH, 1994, CLIN OTOLARYNGOL, V19, P28, DOI 10.1111/j.1365-2273.1994.tb01143.x
Bressmann T, 2004, J ORAL MAXIL SURG, V62, P298, DOI 10.1016/j.joms.2003.04.017
Brown DH, 2003, WORLD J SURG, V27, P824, DOI 10.1007/s00268-003-7107-4
Brown JS, 1997, HEAD NECK-J SCI SPEC, V19, P524, DOI 10.1002/(SICI)1097-0347(199709)19:6<524::AID-HED10>3.0.CO;2-5
Cohen J., 1983, APPL MULTIPLE REGRES, V2nd
Courrieu P., 2005, NEURAL INFORM PROCES, V8, P25
ENDERBY PM, 2004, FRENCHAY DYSRARTHRIE
Flannery B. P., 1992, NUMERICAL RECIPES C
FOX AV, 2002, PLAKSS PSYCHOLINGUIS
Furia CLB, 2001, ARCH OTOLARYNGOL, V127, P877
Gales M. J. F., 1996, P ICSLP 96 PHIL US, V3, P1832, DOI 10.1109/ICSLP.1996.607987
GALLWITZ F, 2002, STUDIEN MUSTERERKENN, V6
Hacker C, 2006, LECT NOTES ARTIF INT, V4188, P581
Hall M.A., 1998, THESIS U WAIKATO HAM
Harding A, 1998, INT J LANG COMM DIS, V33, P329
Haughey BH, 2002, ARCH OTOLARYNGOL, V128, P1388
Henningsson G, 2008, CLEFT PALATE-CRAN J, V45, P1, DOI 10.1597/06-086.1
HUBER R, 2002, STUDIEN MUSTERERKENN, V8
Keuning KHD, 1999, CLEFT PALATE-CRAN J, V36, P328, DOI 10.1597/1545-1569(1999)036<0328:TIROTP>2.3.CO;2
KIESSLING A, 1997, EXTRAKTION KLASSIFKA
Knuuttila H, 1999, ACTA OTO-LARYNGOL, V119, P621
Kuttner C, 2003, HNO, V51, P151, DOI 10.1007/s00106-002-0708-7
Mady K, 2003, CLIN LINGUIST PHONET, V17, P411, DOI 10.1080/0269920031000079921
MAHANNA GK, 1998, PROSTHET DENT, V79, P310
Maier A, 2008, INTERSPEECH 2008: 9TH ANNUAL CONFERENCE OF THE INTERNATIONAL SPEECH COMMUNICATION ASSOCIATION 2008, VOLS 1-5, P1757
Maier A, 2006, LECT NOTES ARTIF INT, V4188, P431
Markkanen-Leppanen M, 2006, ORAL ONCOL, V42, P646, DOI 10.1016/j.oraloncology.2005.11.004
Millard T, 2001, CLEFT PALATE-CRAN J, V38, P68, DOI 10.1597/1545-1569(2001)038<0068:DCCFAA>2.0.CO;2
Moore E. H, 1920, B AM MATH SOC, V26, P394
Paal Sonja, 2005, J Orofac Orthop, V66, P270, DOI 10.1007/s00056-005-0427-2
Panchal J, 1996, BRIT J PLAST SURG, V49, P363, DOI 10.1016/S0007-1226(96)90004-1
Pauloski BR, 1998, OTOLARYNG HEAD NECK, V118, P616, DOI 10.1177/019459989811800509
Pauloski BR, 1998, LARYNGOSCOPE, V108, P908, DOI 10.1097/00005537-199806000-00022
Penrose R., 1955, P CAMBRIDGE PHILOS S, P406, DOI DOI 10.1017/S0305004100030401
Riedhammer K., 2007, P AUT SPEECH REC UND, P717
ROBBINS J, 1984, J SPEECH HEAR DISORD, V49, P202
ROBBINS KT, 1987, ARCH OTOLARYNGOL, V113, P1214
Rosanowski Frank, 2002, Facial Plast Surg, V18, P197, DOI 10.1055/s-2002-33066
Ruben RJ, 2000, LARYNGOSCOPE, V110, P241, DOI 10.1097/00005537-200002010-00010
Scholkopf B., 1997, THESIS TU BERLIN
SCHONWEILER R, 1994, HNO, V42, P691
Schonweiler R, 1999, INT J PEDIATR OTORHI, V50, P205, DOI 10.1016/S0165-5876(99)00243-8
SCHUKATTALAMAZZ.E, 1993, P EUR C SPEECH COMM, V1, P129
Schutte HK, 2002, FOLIA PHONIATR LOGO, V54, P8, DOI 10.1159/000048592
Seikaly H, 2003, LARYNGOSCOPE, V113, P897, DOI 10.1097/00005537-200305000-00023
Smola A., 1998, NC2TR1998030 ROYAL H
Stemmer G., 2003, P EUR C SPEECH COMM, P1313
Stemmer G., 2005, STUDIEN MUSTERERKENN, V19
Su WF, 2003, OTOLARYNG HEAD NECK, V128, P412, DOI 10.1067/mhn.2003.38
Terai H, 2004, BRIT J ORAL MAX SURG, V42, P190, DOI 10.1016/j.bjoms.2004.02.007
V. Clark SAS Institute, 2004, SAS STAT 9 1 USERS G
Wahlster W., 2000, VERBMOBIL FDN SPEECH
Wantia Nina, 2002, Facial Plast Surg, V18, P147, DOI 10.1055/s-2002-33061
Witten I.H., 2005, DATA MINING PRACTICA
ZECEVIC A, 2002, THESIS U MANNHEIM GE
NR 64
TC 34
Z9 34
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 425
EP 437
DI 10.1016/j.specom.2009.01.004
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600003
ER
PT J
AU Jancovic, P
Kokuer, M
AF Jancovic, Peter
Koekueer, Muenevver
TI Incorporating the voicing information into HMM-based automatic speech
recognition in noisy environments
SO SPEECH COMMUNICATION
LA English
DT Article
DE Source-filter model; Voicing estimation; HMM; Automatic speech
recognition; Voicing modelling; Noise robustness; Missing-feature;
Aurora 2; Phoneme recognition
AB In this paper, we propose a model for the incorporation of voicing information into a speech recognition system in noisy environments. The employed voicing information is estimated by a novel method that can provide this information for each filter-bank channel and does not require information about the fundamental frequency. The voicing information is modelled by employing the Bernoulli distribution. The voicing model is obtained for each HMM state and mixture by a Viterbi-style training procedure. The proposed voicing incorporation is evaluated both within a standard model and two other models that had compensated for the noise effect, the missing-feature and the multi-conditional training model. Experiments are first performed on noisy speech data from the Aurora 2 database. Significant performance improvements are achieved when the voicing information is incorporated within the standard model as well as the noise-compensated models. The employment of voicing information is also demonstrated on a phoneme recognition task on the noise-corrupted TIMIT database and considerable improvements are observed. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Jancovic, Peter; Koekueer, Muenevver] Univ Birmingham, Sch Elect Engn & Comp Engn, Birmingham B15 2TT, W Midlands, England.
RP Jancovic, P (reprint author), Univ Birmingham, Sch Elect Engn & Comp Engn, Pritchatts Rd, Birmingham B15 2TT, W Midlands, England.
EM p.jancovic@bham.ac.uk; m.kokuer@bham.ac.uk
FU UK EPSRC [EP/D033659/1, EP/F036132/1]
FX This work was supported by UK EPSRC Grants EP/D033659/1 and
EP/F036132/1.
CR BEAUFAYS F, 2003, USING SPEECH NONSPEE, P424
Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0
DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420
Fant G., 1960, ACOUSTIC THEORY SPEE
Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC
GRACIARENA M, 2004, P ICASSP MONTR, V1, P921
HIRSCH HG, 2000, AURORA EXPT FRAMEWOR
HUANG HCH, 2000, PITCH TRACKING TONE, P1523
Ishizuka K, 2006, J ACOUST SOC AM, V120, P443, DOI 10.1121/1.2205131
Russell MJ, 2005, COMPUT SPEECH LANG, V19, P205, DOI 10.1016/j.csl.2004.08.001
JACKSON PJB, 2003, COVARIATION WEIGHTIN, P2321
Jancovic P, 2007, IEEE SIGNAL PROC LET, V14, P66, DOI 10.1109/LSP.2006.881517
JANCOVIC P, 2002, COMBINING UNION MODE, P69
JANCOVIC P, 2007, IEEE WORKSH AUT SPEE, P42
KITAOKA N, 2002, SPEAKER INDEPENDENT, P2125
LARSON M, 2001, P WORKSH LANG MOD IN
LJOLJE A, 2002, SPEECH RECOGNITION U, P2137
Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0
Niyogi P, 2003, SPEECH COMMUN, V41, P349, DOI 10.1016/S0167-6393(02)00151-6
OSHAUGHNESSY D, 1999, ROBUST FAST CONTINUO, P413
RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P170, DOI 10.1109/TASSP.1976.1162794
Thomson DL, 2002, SPEECH COMMUN, V37, P197, DOI 10.1016/S0167-6393(01)00011-5
Young S., 1999, HTK BOOK V2 2
Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6
ZOLNAY A, 2003, EUROSPEECH GENEVA SW, P497
NR 25
TC 5
Z9 5
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 438
EP 451
DI 10.1016/j.specom.2009.01.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600004
ER
PT J
AU Diaz, FC
van Santen, J
Banga, ER
AF Campillo Diaz, Francisco
van Santen, Jan
Rodriguez Banga, Eduardo
TI Integrating phrasing and intonation modelling using syntactic and
morphosyntactic information
SO SPEECH COMMUNICATION
LA English
DT Article
DE Intonation modelling; Unit selection; Corpus-based; Syntax; POS;
Phrasing
ID SPEECH
AB This paper focuses on the relationship between intonation and syntactic and morphosyntactic information. Although intonation and syntax are both related to dependencies between the different parts of a sentence, and are therefore related to meaning, the precise influence of grammar on intonation is not clear. We describe a novel method that uses syntactic and part-of-speech features in the framework of corpus-based intonation modelling, and which integrates part of the phrasing algorithm in the unit selection stage. Subjective tests confirm an improvement in the quality of the resulting synthetic intonation: 75% of the sentences synthesised with the new intonation model were considered to be better or much better than the sentences synthesised using the old model, while only 7.5% of sentences were rated as worse or much worse. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Campillo Diaz, Francisco; Rodriguez Banga, Eduardo] Univ Vigo, Dpto Teoria Serial & Comunicac, ETSI Telecomunicac, Vigo 36200, Pontevedra, Spain.
[van Santen, Jan] Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding, OGI Sch Sci & Engn, Beaverton, OR 97006 USA.
RP Diaz, FC (reprint author), Univ Vigo, Dpto Teoria Serial & Comunicac, ETSI Telecomunicac, Campus Univ, Vigo 36200, Pontevedra, Spain.
EM campillo@gts.tsc.uvigo; vansanten@ogi.cslu.edu; erbanga@gts.tsc.uvigo.es
RI Rodriguez Banga, Eduardo/C-4296-2011
FU NSF [0205731]; MEC [TEC200613694-C03-03]
FX The work reported here was carried out while the first author was a
visiting post-doctoral researcher at the Center for Spoken Language
Understanding, with funding from the Xunta de Galicia "Isidro Parga
Pondal" research programme and PGIDIT05TIC32202-PR programme, and with
support from NSF Grant 0205731, "ITR: Prosody Generation for Child
Oriented Speech Synthesis" (PI Jan van Santen), and MEC under the
Project TEC200613694-C03-03. Thanks also to Paul Hosom and Raychel
Moldover, for their comments and suggestions on the paper.
CR ABNEY S, 1992, P SPEECH NAT LANG WO, P425, DOI 10.3115/1075527.1075629
BLACK A, 1995, P EUR MADR SPAIN, V1, P581
Black A. W., 1999, FESTIVAL SPEECH SYNT
Campillo F, 2008, ELECTRON LETT, V44, P501, DOI 10.1049/el:20083276
CAMPILLO F, 2006, JORNADAS TECNOLOGIAS, P167
CAMPILLO F, 2006, P ICSLP PITTSB, P2362
Campillo F. D., 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004
DIMPERIO M, 2003, PROSODIES
ESCUDERO D, 2002, THESIS U VALLADOLID
GARRIDO JM, 1996, THESIS U BARCELONA E
Grosz B. J., 1986, Computational Linguistics, V12
HERNAEZ I, 2001, P 4 ISCA TUT RES WOR, P151
Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9
Hunt A., 1996, P INT C AC SPEECH SI, V1, P373
Koehn P., 2000, P IEEE INT C AC SPEE, V3, P1289
Ladd D. R., 1986, PHONOLOGY YB, V3, P311, DOI 10.1017/S0952675700000671
Ladd R., 1996, INTONATIONAL PHONOLO
Ladd R.D., 1988, J ACOUST SOC AM, V84, P530
MENDEZ F, 2003, PROCESAMIENTO LENGUA, V31, P159
MOEBIUS B, 1999, COMPUT SPEECH LANG, V13, P319
Navarro T, 1977, MANUAL PRONUNCIACION
NAVAS E, 2003, THESIS U PAIS VASCO
Ostendorf M., 1994, Computational Linguistics, V20
PIERREHUMBERT J, 1990, SYS DEV FDN, P271
Pierrehumbert J. B., 1986, PHONOLOGY YB, V3, P15
PREVOST S, 1993, P 6 C EUR CHAPT ASS, P332, DOI 10.3115/976744.976783
RAUX A, 2003, ASRU
STEEDMAN M, 1990, M ASS COMP LING, P9
TAYLOR P, 2000, CONCEPT TO SPEECH SY
Taylor P, 1998, COMPUT SPEECH LANG, V12, P99, DOI 10.1006/csla.1998.0041
van Santen J., 1999, INTONATION ANAL MODE, P269
NR 31
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 452
EP 465
DI 10.1016/j.specom.2009.01.007
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600005
ER
PT J
AU Lee, C
Jung, S
Kim, S
Lee, GG
AF Lee, Cheongjae
Jung, Sangkeun
Kim, Seokhwan
Lee, Gary Geunbae
TI Example-based dialog modeling for practical multi-domain dialog system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Example-based dialog modeling; Generic dialog modeling; Multi-domain
dialog system; Domain identification
ID STRATEGIES; SIMULATION
AB This paper proposes a generic dialog modeling framework for a multi-domain dialog system to simultaneously manage goal-oriented and chat dialogs for both information access and entertainment. We developed a dialog modeling technique using an example-based approach to implement multiple applications such as car navigation, weather information, TV program guidance, and chatbot. Example-based dialog modeling (EBDM) is a simple and effective method for prototyping and deploying of various dialog systems. This paper also introduces the system architecture of multi-domain dialog systems using the EBDM framework and the domain spotting technique. In our experiments, we evaluate our system using both simulated and real users. We expect that our approach can support flexible management of multi-domain dialogs on the same framework. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Lee, Cheongjae; Jung, Sangkeun; Kim, Seokhwan; Lee, Gary Geunbae] Pohang Univ Sci & Technol, Dept Comp Sci & Engn, Pohang 790784, South Korea.
RP Lee, C (reprint author), Pohang Univ Sci & Technol, Dept Comp Sci & Engn, San 31, Pohang 790784, South Korea.
EM lcj80@postech.ac.kr
FU Ministry of Knowledge Economy (MKE) [RTI04-02-06]
FX This work-was supported by Grant No. RTI04-02-06 from the Regional
Technology Innovation Program and by the Intelligent Robotics
Development Program, one of the 21st Century Frontier R&D Programs
funded by the Ministry of Knowledge Economy (MKE).
CR Allen J., 2000, NAT LANG ENG, V6, P1
Berger AL, 1996, COMPUT LINGUIST, V22, P39
BOHUS B, 2003, P EUR C SPEECH COMM, P597
BUI TH, 2004, P TSD 2004 BRNO CZEC, P579
CHELBA C, 2003, P IEEE INT C AC SPEE, P69
Dietterich TG, 1998, NEURAL COMPUT, V10, P1895, DOI 10.1162/089976698300017197
EUN J, 2005, P EUROSPEECH, P3441
Georgila K., 2005, P 9 EUR C SPEECH COM, P893
Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X
HENDERSON J, 2005, P WORKSH KNOWL REAS
Hurtado L. F., 2005, P IEEE WORKSH AUT SP, P226
Inui M., 2001, P IEEE INT C SYST MA, P193
JENKINS MC, 2007, P INT C HUM COMP INT, P76
JUNG S, 2008, P WORKSH SPEECH PROC, P9
KOMATANI K, 2006, P 7 SIGDIAL WORKSH D, P9, DOI 10.3115/1654595.1654598
LAMEL L, 1999, P IEEE INT C AC SPEE, P501
Larsson S., 2000, NAT LANG ENG, V6, P323, DOI [DOI 10.1017/S1351324900002539, 10.1017 S1351324900002539]
LARSSON S, 2002, DEMO ABSTRACT ASS CO, P104
LEE CJ, 2006, P IEEE INT C AC SPEE, P69
Lemon O., 2002, P 3 SIGDIAL WORKSH D, P113
Lemon O., 2006, P 11 C EUR CHAPT ASS, P119, DOI 10.3115/1608974.1608986
LESH N, 2001, P 9 INT C US MOD, P63
Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450
Lopez-Cozar R, 2003, SPEECH COMMUN, V40, P387, DOI [10.1016/S0167-6393(02)00126-7, 10.1016/S0167-6393902)00126-7]
McTear M., 1998, P INT C SPOK LANG PR, V4, P1223
Minker W, 2004, SPEECH COMMUN, V43, P89, DOI 10.1016/j.specom.2004.01.005
Murao H., 2003, P SIGDIAL WORKSH DIS, P140
Nagao M., 1984, P INT NATO S ART HUM, P173
O'Neill I, 2005, SCI COMPUT PROGRAM, V54, P99, DOI 10.1016/j.scico.2004.05.006
PAEK T, 2006, P WORKSH DIAL INT C
PAKUCS B, 2003, P 8 EUR C SPEECH COM, P741
Papineni K., 2002, P 40 ANN M ASS COMP, P311
Peckham J., 1993, P 3 EUR C SPEECH COM, P33
Polifroni J., 2001, P EUR C SPEECH COMM, P1371
RICH C, 1998, J USER MODEL USER AD, V8, P315
SALTON G, 1973, J DOC, V29, P351, DOI 10.1108/eb026562
Schatzmann J, 2005, P 6 SIGDIAL WORKSH D, P45
SHIN J, 2002, P INT C SPOK LANG PR, P2069
THOMSON B, 2008, P IEEE INT C AC SPEE, P4937
Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271
Watanabe T, 1998, IEICE T INF SYST, VE81D, P1025
WENG F, 2006, P INT C SPOK LANG PR, P1061
WILLIAMS JD, 2005, P IEEE WORKSH AUT SP, P250
Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008
YOUNG S, 2007, P IEEE INT C AC SPEE, P149
Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460
NR 46
TC 14
Z9 14
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAY
PY 2009
VL 51
IS 5
BP 466
EP 484
DI 10.1016/j.specom.2009.01.008
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 429TI
UT WOS:000264942600006
ER
PT J
AU Ozimek, E
Kutzner, D
Sek, A
Wicher, A
AF Ozimek, Edward
Kutzner, Dariusz
Sek, Aleksander
Wicher, Andrzej
TI Development and evaluation of Polish digit triplet test for auditory
screening
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech intelligibility; Intelligibility function; Digit triplet;
Speech-reception-threshold; Auditory screening
ID SPEECH RECEPTION THRESHOLDS; SENTENCE MATERIALS; NOISE; HEARING;
INTELLIGIBILITY; RECOGNITION
AB The objective of this study was to develop and evaluate the Polish digit triplet test for speech intelligibility screening. The first part of the paper deals with the preparation of the speech material, the recording procedure and a listening experiment. In this part, triplet-specific intelligibility functions for 160 different digit complexes were determined and 100 'optimal' triplets were selected. Subsequently, four statistically balanced lists, each containing 25 different digit triplets, were developed. The speech material was phonemically equalized across the lists. The mean SRT and mean list-specific slope S-50 for the Polish test are -9.4 dB and 19.4%/dB, respectively, and are very similar to the data characterizing the German digit triplet test. The second part describes the results of the verification experiments in which reliability of the developed test was analyzed. The retest measurements were carried out by means of the standard constant stimuli paradigm and the adaptive procedure. It was found that mean SRT obtained with retest study was within the limits of standard deviation, in agreement with those obtained in the basic experiment. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Ozimek, Edward; Kutzner, Dariusz; Sek, Aleksander; Wicher, Andrzej] Adam Mickiewicz Univ, Inst Acoust, PL-61614 Poznan, Poland.
RP Ozimek, E (reprint author), Adam Mickiewicz Univ, Inst Acoust, Ul Umultowska 85, PL-61614 Poznan, Poland.
EM ozimaku@amu.edu.pl
FU European Union FP6 [004171]; State Ministry of Education and Science
FX This work was supported by the grant from the European Union FP6,
Project 004171 HearCom and the State Ministry of Education and Science.
CR Bellis TJ, 1996, ASSESSMENT MANAGEMEN
BRACHMANSKI S., 1999, SPEECH LANGUAGE TECH, V3, P71
BROADBENT DE, 1954, J EXP PSYCHOL, V47, P191, DOI 10.1037/h0054182
ELBERLING C, 1989, SCAND AUDIOL, V18, P175
Fletcher H., 1929, SPEECH HEARING
GROCHOLEWSKI S, 2001, STATYSTYCZNE PODSTAW
Hall S, 2006, THESIS U SOUTHAMPTON
Kaernbach C, 2001, PERCEPT PSYCHOPHYS, V63, P1389, DOI 10.3758/BF03194550
KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436
Klein SA, 2001, PERCEPT PSYCHOPHYS, V63, P1421, DOI 10.3758/BF03194552
Kollmeier B., 1997, J ACOUST SOC AM, V102, P1085
Kollmeier B., 1990, MESSMETODIK MODELLIE
MILLER GA, 1951, J EXP PSYCHOL, V41, P329, DOI 10.1037/h0062491
NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469
Ozimek E., 2006, ARCH ACOUST, V31, P431
PLOMP R, 1979, AUDIOLOGY, V18, P43
Pruszewicz A, 1994, Otolaryngol Pol, V48, P50
Pruszewicz A, 1994, Otolaryngol Pol, V48, P56
Ramkissoon Ishara, 2002, Am J Audiol, V11, P23, DOI 10.1044/1059-0889(2002/005)
RUDMIN F, 1987, Journal of Auditory Research, V27, P15
SCHMIDTNIELSEN A, 1989, J ACOUST SOC AM, V86, pS76, DOI 10.1121/1.2027645
Smits C, 2006, EAR HEARING, V27, P538, DOI 10.1097/01.aud.0000233917.72551.cf
Smits C, 2004, INT J AUDIOL, V43, P15, DOI 10.1080/14992020400050004
Smits C, 2007, INT J AUDIOL, V46, P134, DOI 10.1080/14992020601102170
STROUSE A, 2000, J REHABIL RES DEV, V37, P599
Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451
Wagener K, 1999, Z AUDIOL, V38, P44
Wagener K, 2003, INT J AUDIOL, V42, P10, DOI 10.3109/14992020309056080
Wagener K, 1999, Z AUDIOL, V38, P4
Wagener K., 1999, Z AUDIOL, V38, P86
WAGENER K, 2005, ZIFFER TRIPEL TEST S
Wilson Richard H., 2004, Seminars in Hearing, V25, P93
Wilson RH, 2005, J REHABIL RES DEV, V42, P499, DOI 10.1682/JRRD.2004.10.0134
NR 33
TC 10
Z9 10
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 307
EP 316
DI 10.1016/j.specom.2008.09.007
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700001
ER
PT J
AU Keshet, J
Grangier, D
Bengio, S
AF Keshet, Joseph
Grangier, David
Bengio, Samy
TI Discriminative keyword spotting
SO SPEECH COMMUNICATION
LA English
DT Article
DE Keyword spotting; Spoken term detection; Speech recognition; Large
margin and kernel methods; Support vector machines; Discriminative
models
AB This paper proposes a new approach for keyword spotting, which is based on large margin and kernel methods rather than on HMMs. Unlike previous approaches, the proposed method employs a discriminative learning procedure, in which the learning phase aims at achieving a high area under the ROC curve, as this quantity is the most common measure to evaluate keyword spotters. The keyword spotter we devise is based oil mapping the input acoustic representation of the speech utterance along with the target keyword into a vector-space. Building on techniques used for large margin and kernel methods for predicting whole sequences, our keyword spotter distills to a classifier in this vector-space, which separates speech utterances in which the keyword is uttered from speech utterances in which the keyword is not uttered. We describe a simple iterative algorithm for training the keyword spotter and discuss its formal properties, showing theoretically that it attains high area under the ROC curve. Experiments on read speech with the TIMIT corpus show that the resulted discriminative system outperforms the conventional context-independent HMM-based system. Further experiments using the TIMIT trained model, but tested oil both read (HTIMIT, WSJ) and spontaneous speech (OGI Stories), show that without further training or adaptation to the new corpus our discriminative system outperforms the conventional context-independent HMM-based system. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Keshet, Joseph] IDIAP Res Inst, CH-1920 Martigny, Switzerland.
[Grangier, David] NEC Labs Amer, Princeton, NJ 08540 USA.
[Bengio, Samy] Google Inc, Mountain View, CA 94043 USA.
RP Keshet, J (reprint author), IDIAP Res Inst, Rue Marconi 19, CH-1920 Martigny, Switzerland.
EM jkeshet@idiap.ch; dgrangier@nec-labs.com; bengio@google.com
CR Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179>
Benayed Y., 2004, P INT C AUD SPEECH S, P588
BENGIO S, 2005, P 22 INT C MACH LEAR
BOURLARD H, 1994, P IEEE INT C AC SPEE, P373
Cardillo P. S., 2002, International Journal of Speech Technology, V5, DOI 10.1023/A:1013670312989
Cesa-Bianchi N, 2004, IEEE T INFORM THEORY, V50, P2050, DOI 10.1109/TIT.2004.833339
COLLOBERT R, 2002, 46 IDIAPRR
CORTES C, 1995, MACH LEARN, V20, P273, DOI 10.1023/A:1022627411411
Cortes C., 2004, ADV NEURAL INFORM PR, V17, P2004
CRAMMER K, 2006, J MACHINE LEARN RES, V7
Cristianini N., 2000, INTRO SUPPORT VECTOR
Dekel O, 2004, WORKSH MULT INT REL, P146
Fu Q., 2007, IEEE WORKSH AUT SPEE, P278
Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257
Junkawitsch J., 1997, P EUR C SPEECH COMM, P259
KESHET J, 2006, P INT
Keshet J, 2007, IEEE T AUDIO SPEECH, V15, P2373, DOI 10.1109/TASL.2007.903928
KESHET J, 2001, P 7 EUR C SPEECH COM, P1637
KETABDAR H, 2006, P INT
MANOS AS, 1997, P ICASSP 97, P899
Paul D. B., 1992, P INT C SPOK LANG PR
Platt J., 1998, ADV KERNEL METHODS S
Rabiner L, 1993, FUNDAMENTALS SPEECH
RAHIM M, 1997, IEEE T SPEECH AUDIO, P266
REYNOLDS DA, 1997, P INT C AC SPEECH SI, P1535
Rohlicek J., 1989, P IEEE INT C AC SPEE, P627
Rohlicek J.R., 1993, P 1993 IEEE INT C AC, pII459
Rosevear R. D., 1990, Power Technology International
Salomon J., 2002, P 7 INT C SPOK LANG, P2645
SHALEVSHWARTZ S, 2004, P 5 INT C MUS INF RE
Silaghi MC, 1999, P IEEE AUT SPEECH RE, P213
Szoke I., 2005, P JOINT WORKSH MULT
TASKAR B, 2003, ADV NEURAL INFORM PR, V17
Vapnik V, 1998, STAT LEARNING THEORY
Weintraub M., 1995, P INT C AUD SPEECH S, P129
NR 35
TC 28
Z9 28
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 317
EP 329
DI 10.1016/j.specom.2008.10.002
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700002
ER
PT J
AU Chomphan, S
Kobayashi, T
AF Chomphan, Suphattharachal
Kobayashi, Takao
TI Tone correctness improvement in speaker-independent average-voice-based
Thai speech synthesis
SO SPEECH COMMUNICATION
LA English
DT Article
DE Tone correctness; Tone neutralization; Average-voice; Hidden Markov
models; Speech synthesis
ID MANDARIN; MODEL
AB A novel approach to the context-clustering process in a speaker-independent HMM-based Thai speech synthesis is addressed in this paper. Improvements to the tone correctness (i.e., tone intelligibility) of the average-voice and also the speaker-adapted voice were our main objectives. To treat the problem of tone neutralization, we incorporated a number of tonal features called tone-geometrical and phrase-intonation features into the context-clustering process of the H M M training stage. We carried out subjective and objective evaluations of both the average voice and adapted voice in terms of the intelligibility of tone and the logarithmic fundamental frequency (F0) error in our experiments. The effects on the decision trees of the extracted features were also evaluated. Several speech-model scenarios including male/female and gender-dependent/gender-independent were implemented to confirm the effectiveness of the proposed approach. The results of subjective tests revealed that the proposed tonal features could improve the intelligibility of tones for all speech-model scenarios. The objective tests also yielded results corresponding to those of the subjective tests. The experimental results from both the subjective and objective evaluations confirmed that the proposed tonal features could alleviate the problem of tone neutralization; as a result, the tone correctness of synthesized speech was significantly improved. Crown Copyright (C) 2008 Published by Elsevier B.V. All rights reserved.
C1 [Chomphan, Suphattharachal] Kasetsart Univ, Fac Engn Si Racha, Elect Engn Div, Si Racha 20230, Chonburi, Thailand.
[Chomphan, Suphattharachal; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Midori Ku, Yokohama, Kanagawa 2268502, Japan.
RP Chomphan, S (reprint author), Kasetsart Univ, Fac Engn Si Racha, Elect Engn Div, Si Racha 20230, Chonburi, Thailand.
EM suphattharachai@ip.titech.ac.jp; takao.kobayashi@ip.titech.ac.jp
CR ABRAMSON AS, 1979, INT C PHON SCI, P380
Anastasakos T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1137
Chen YQ, 2002, J INTELL INF SYST, V19, P95, DOI 10.1023/A:1015568521453
CHOMPHAN S, 2007, P INTERSPEECH 2007, P2849
Chomphan S, 2008, SPEECH COMMUN, V50, P392, DOI 10.1016/j.specom.2007.12.002
CHOMPHAN S, 2007, 6 ISCA WORKSH SPEECH, P160
Fujisaki H., 1971, J ACOUST SOC JPN, V57, P445
Fujisaki H., 1984, J ACOUST SOC JPN ASJ, V5, P133
Fujisaki H, 1998, ICSP '98: 1998 FOURTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, P714, DOI 10.1109/ICOSP.1998.770311
FUJISAKI H, 1990, INT C SPOK LANG PROC, P841
GANDOUR J, 1994, J PHONETICS, V22, P477
HANSAKUNBUNTHEU.C, 2005, INT S NAT LANG PROC, P127
Iwasaki Shoichi, 2005, REFERENCE GRAMMAR TH
KASURIYA S, 2003, JOINT INT C SNLP OR, P54
LI Y, 2004, SPEECH PROSODY, P467
LUKSANEEYANAWIN S, 1993, INT S NAT LANG PROC, P276
LUKSANEEYANAWIN S, 1992, INT S LANG LING, P75
Masuko T, 1996, INT CONF ACOUST SPEE, P389, DOI 10.1109/ICASSP.1996.541114
Mixdorff H., 2002, INT C SPOK LANG PROC, P753
Moren B, 2006, NAT LANG LINGUIST TH, V24, P113, DOI 10.1007/s11049-004-5454-y
Ni JF, 2006, SPEECH COMMUN, V48, P989, DOI 10.1016/j.specom.2006.01.002
PALMER A, 1969, LANG LEARN, V19, P287, DOI 10.1111/j.1467-1770.1969.tb00469.x
Riley M., 1992, TALKING MACHINES THE, P265
Russell M. J., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8)
SAITO T, 2002, INT C SPOK LANG PROC, P165
Seresangtakul P, 2003, IEICE T INF SYST, VE86D, P2223
Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79
SORNLERTLAMVANI.V, 1998, INT C SPEECH DAT ASS, P131
TAO J, 2006, TC STAR WORKSH SPEEC, P171
THATHONG U, 2000, INT C SPOK LANG PROC, P47
TRAN DD, 2006, INT S TON ASP LANG
Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956
Yamagishi J, 2003, IEEE INT C AC SPEECH, P716
YAMAGISHI J, 2002, INT C SPOK LANG PROC, P133
YAMAGISHI J, 2004, INT C SPOK LANG PROC, P1213
Yoshida T, 1998, INTERNATIONAL ELECTRON DEVICES MEETING 1998 - TECHNICAL DIGEST, P29, DOI 10.1109/IEDM.1998.746239
Zen H, 2004, INT C SPOK LANG PROC, P1393
NR 37
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 330
EP 343
DI 10.1016/j.specom.2008.10.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700003
ER
PT J
AU Sciamarella, D
Artana, G
AF Sciamarella, D.
Artana, G.
TI A water hammer analysis of pressure and flow in the voice production
system
SO SPEECH COMMUNICATION
LA English
DT Article
DE Voice production; Water hammer; Speech synthesis
ID VOCAL-FOLD MODEL; 2-MASS MODEL; KOROTKOFF SOUNDS; SEPARATION; DYNAMICS
AB The sudden pressure rise produced by glottal closure in the subglottal tract during vocal fold oscillation causes a flow transient which can be computed as a water hammer effect in engineering. In this article, we present a basic water hammer analysis for the trachea and the supralaryngeal tract under conditions which are analogue to those operating during voice production. This approach allows predicting both, the intra-oral and intra-tracheal pressure fluctuations induced by vocal fold motion, as well as the airflow evolution throughout the phonatory system. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Sciamarella, D.] LIMSI, CNRS, F-91403 Orsay, France.
[Artana, G.] Univ Buenos Aires, CONICET, Fac Ingn, LFD, RA-1053 Buenos Aires, DF, Argentina.
RP Sciamarella, D (reprint author), LIMSI, CNRS, BP 133, F-91403 Orsay, France.
EM sciamarella@limsi.fr
CR Alipour F, 2004, J ACOUST SOC AM, V116, P1710, DOI 10.1121/1.1779274
Allen J, 2004, PHYSIOL MEAS, V25, P107, DOI 10.1088/0967-3334/25/1/010
BLADE EJ, 1962, D1216 NASA
BURNETT GC, 2002, Patent No. 20020099541
Chang HS, 2003, J NEUROL NEUROSUR PS, V74, P344, DOI 10.1136/jnnp.74.3.344
CHUNGCHAROEN D, 1964, AM J PHYSIOL, V207, P190
Dang JW, 2004, J ACOUST SOC AM, V115, P853, DOI 10.1121/1.1639325
D'Souza A.F., 1964, ASME, V86, P589
FLETCHER NH, 1993, J ACOUST SOC AM, V93, P2172, DOI 10.1121/1.406857
Ghidaoui M. S., 2005, Applied Mechanics Review, V58, DOI 10.1115/1.1828050
HERMAWAN V, 2004, INT MECH ENG C EXP B
HIRSCHBERG A, 2001, LECT SERIES VONKARMA, V2
ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P190, DOI 10.1121/1.381064
ISHIZAKA K, 1972, AT&T TECH J, V51, P1233
JOUKOWSKY N, 1900, MEMOIRES ACAD IMPERI, V8, P5
Lous NJC, 1998, ACUSTICA, V84, P1135
Lucero JC, 1999, J ACOUST SOC AM, V105, P423, DOI 10.1121/1.424572
Miller TL, 2007, J BIOMECH, V40, P1615, DOI 10.1016/j.jbiomech.2006.07.022
Mongeau L, 1997, J ACOUST SOC AM, V102, P1121, DOI 10.1121/1.419864
PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449
Sciamarella D, 2004, ACTA ACUST UNITED AC, V90, P746
SCIAMARELLA D, 2007, EUR J MECH B-FLUID, V27, P42
Skalak R., 1956, T ASME, V78, P105
SPURK JH, 1997, FLUID MECH, P275
Streeter V. L., 1967, HYDRAULIC TRANSIENTS
STREETER VL, 1974, ANNU REV FLUID MECH, V6, P57, DOI 10.1146/annurev.fl.06.010174.000421
Tijsseling AS, 1996, J FLUID STRUCT, V10, P109, DOI 10.1006/jfls.1996.0009
TITZE IR, 1993, PRINCIPLES VOICE PRO
WOO P, 1996, LARYNGOSCOPE, V106
WOOD DJ, 1968, ASME J BASIC ENG, V90, P532
Wylie EB, 1993, FLUID TRANSIENTS SYS
Zhao M, 2003, J HYDRAUL ENG-ASCE, V129, P1007, DOI 10.1061/(ASCE)0733-9429(2003)129:12(1007)
Zhao M, 2007, J FLUID MECH, V570, P129, DOI [10.1017/S0022112006003193, 10.1017/s0022112006003193]
NR 33
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 344
EP 351
DI 10.1016/j.specom.2008.10.004
PG 8
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700004
ER
PT J
AU Hosom, JP
AF Hosom, John-Paul
TI Speaker-independent phoneme alignment using transition-dependent states
SO SPEECH COMMUNICATION
LA English
DT Article
DE Forced alignment; Phoneme alignment; Automatic phoneme alignment; Hidden
Markov models
ID HIDDEN MARKOV-MODELS; CHILDHOOD APRAXIA; DIAGNOSTIC MARKER; SPEECH;
RATIO
AB Determining the location of phonemes is important to a number of speech applications, including training of automatic speech recognition systems, building text-to-speech systems, and research on human speech processing. Agreement of humans oil the location of phonemes is, on average, 93.78% within 20 ms on a variety of corpora, and 93.49% within 20 ms on the TIMIT corpus. We describe a baseline forced-alignment system and a proposed system with several modifications to this baseline. Modifications include the addition of energy-based features to the standard cepstral feature set, the use of probabilities of a state transition given an observation, and the computation of probabilities of distinctive phonetic features instead of phoneme-level probabilities. Performance of the baseline system on the test partition of the TIMIT corpus is 91.48% within 20 ms, and performance of the proposed system oil this corpus is 93.36%, within 20 ms. The results of the proposed system are a 22% relative reduction in error over the baseline system, and a 14% reduction ill error over results from a non-HMM alignment system. This result of 93.36% agreement is the best known reported result on the TIMIT corpus. (C) 2008 Elsevier B.V. All rights reserved.
C1 Oregon Hlth & Sci Univ, Sch Sci & Engn, Ctr Spoken Language Understanding, Beaverton, OR 97006 USA.
RP Hosom, JP (reprint author), Oregon Hlth & Sci Univ, Sch Sci & Engn, Ctr Spoken Language Understanding, 20000 NW Walker Rd, Beaverton, OR 97006 USA.
EM hosom@cslu.ogi.edu
FU National Institutes of Health NIDCD [R21-DC06722]; National Institutes
of Health NIA [AG08017, AG024978]; National Science Foundation
[GER-9354959, IRI-9614217]
FX This work was supported in part by the National Institutes of Health
NIDCD Grant R21-DC06722, the National Institutes of Health NIA Grants
AG08017 and AG024978, and the National Science Foundation Grants
GER-9354959 and IRI-9614217. The views expressed here do not necessarily
represent the views of the NTH or NSF.
CR BOURLARD H, 1992, SPEECH COMMUN, V11, P237, DOI 10.1016/0167-6393(92)90018-3
BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W
Campbell N., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607292
Cole R., 1995, P EUR C SPEECH COMM, P821
COLE R, 1994, P INT C SPOK LANG PR, P2131
Cosi P., 1991, P EUROSPEECH 91, P693
COSI P, 2000, P ICSLP 2000 BEIJ CH, V2, P527
COX S, 1998, P INT C SPOK LANG PR, V5, P1947
Duc DN, 2003, LECT NOTES ARTIF INT, V2718, P481
FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842
Garofolo J., 1990, DARPA TIMIT ACOUSTIC
GILLICK L, 1993, P ICASSP 89 GLASG SC, P532
GONG Y, 1993, P EUR 93 BERL GERM, P1759
Gordon-Salant S, 2006, J ACOUST SOC AM, V119, P2455, DOI 10.1121/1.2171527
Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1
Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616
Hieronymus J. L., 1994, ASCII PHONETIC SYMBO
Hosom J.-P., 1998, Australian Journal of Intelligent Information Processing Systems, V5
HOSOM JP, 2000, P ICSLP 2000 BEIJ, V4, P564
Huang X., 2001, SPOKEN LANGUAGE PROC
HUOPANIEMI J, 1997, P 102 AUD ENG SOC AE
Kain AB, 2007, SPEECH COMMUN, V49, P743, DOI 10.1016/j.specom.2007.05.001
Keshet J., 2005, P INT 05, P2961
KVALE K, 1994, P ICSLP 94 YOK JAP, V3, P1667
Ladefoged Peter, 1993, COURSE PHONETICS
LEUNG HC, 1984, P ICASSP 84 SAN DIEG
Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2
LJOLJE A, 1997, PROGR SPEECH SYNTHES
Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379
MALFRRE F, 1998, P ICSLP, P1571
Moore BCJ, 1997, INTRO PSYCHOL HEARIN
PELLOM BL, 1998, THESIS DUKE U DURHAM
Rabiner L, 1993, FUNDAMENTALS SPEECH
Rapp S., 1995, P ELSNET GOES E IMAC
Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461
SHOBAKI K, P ICSLP 2000 BEIJ CH, V4, P258
Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P575, DOI 10.1080/0269920031000138141
Shriberg LD, 2003, CLIN LINGUIST PHONET, V17, P549, DOI 10.1080/0269920031000138123
Singh S, 2001, APHASIOLOGY, V15, P571, DOI 10.1080/02687040143000041
Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0)
SVENDSEN T, 1990, P ICSLP 90 KOB JAP, P997
TORKKOLA K, 1988, P IEEE INT C AC SPEE, P611
WAGNER M, 1981, P ICASSP81 ATLANTA, P1156
WEI W, 1998, P INT C AC SPEECH SI, V1, P497, DOI 10.1109/ICASSP.1998.674476
Wesenick M.-B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607054
Wheatley B., 1992, P ICASSP 92 SAN FRAN, V1, P533
WIGHTMAN CW, 1997, PROGR SPEECH SYNTHES
WOODLAND PC, 1995, P ICASSP, V1, P73
NR 48
TC 19
Z9 19
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 352
EP 368
DI 10.1016/j.specom.2008.11.003
PG 17
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700005
ER
PT J
AU Garcia-Sierra, A
Diehl, RL
Champlin, C
AF Garcia-Sierra, Adrian
Diehl, Randy L.
Champlin, Craig
TI Testing the double phonemic boundary in bilinguals
SO SPEECH COMMUNICATION
LA English
DT Article
DE Double phonemic boundary; Spanish-English bilinguals; Perceptual
switching; Language contexts
ID SPANISH-ENGLISH BILINGUALS; CROSS-LANGUAGE; PERCEPTION; CONTRAST;
INFANTS; DISCRIMINATION; SPEAKERS; ADULTS; STOPS; RANGE
AB It is widely known that language influences the way speech sounds are categorized. However, categorization of speech sounds by bilinguals is not well understood. There is evidence that bilinguals have different category boundaries than monolinguals, and there is evidence suggesting that bilinguals' phonemic boundaries can shift with language context. This phenomenon has been referred as the double phonemic boundary. In this investigation, the double phonemic boundary is tested in Spanish-English bilinguals (N = 18) and English monolinguals (N = 16). Participants were asked to categorize speech stimuli from a continuum ranging from /ga/ to /ka/ in two language contexts. The results showed phonemic boundary shifts in bilinguals and monolinguals which did not differ across language contexts. However, the magnitude of the phoneme boundary shift was significantly correlated with the level of confidence in using English and Spanish (reading, writing, speaking, and comprehension) for bilinguals, but not for monolinguals. The challenges of testing the double phonemic boundary are discussed, along with the limitations of the methodology used in this study. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Garcia-Sierra, Adrian; Diehl, Randy L.; Champlin, Craig] Univ Texas Austin, Dept Commun Sci & Disorders, Coll Commun, Austin, TX 78712 USA.
RP Garcia-Sierra, A (reprint author), Univ Washington, Inst Learning & Brain Sci, Fisheries Ctr Bldg,Box 357988, Seattle, WA 98195 USA.
EM gasa@u.washington.edu; diehl@psy.utexas.edu; champlin@austin.utexas.edu
FU The Department of Communications Sciences and Disorders at The
University of Texas Austin
FX The present work was supported by The Department of Communications
Sciences and Disorders at The University of Texas Austin. I thank the
valuable help I received from Nairan Ramirez-Esparza, Denise Padden,
Marco A. Jurado and Hayley Austin.
CR Abramson A. S., 1970, P 6 INT C PHON SCI P, P569
Abramson A. S., 1972, J PHON, V1, P1
BEST CT, 1988, J EXP PSYCHOL HUMAN, V14, P345, DOI 10.1037/0096-1523.14.3.345
BRADY SA, 1978, J ACOUST SOC AM, V63, P1556, DOI 10.1121/1.381849
CARAMAZZ.A, 1973, J ACOUST SOC AM, V54, P421, DOI 10.1121/1.1913594
DIEHL RL, 1978, J EXP PSYCHOL HUMAN, V4, P599, DOI 10.1037//0096-1523.4.4.599
EIMAS PD, 1973, COGNITIVE PSYCHOL, V4, P99, DOI 10.1016/0010-0285(73)90006-6
ELMAN JL, 1977, J ACOUST SOC AM, V62, P971, DOI 10.1121/1.381591
Finney D. J., 1971, PROBIT ANAL, V3rd
BOHN OS, 1993, J PHONETICS, V21, P267
FLEGE JE, 1987, J PHONETICS, V15, P67
Flege JE, 2002, APPL PSYCHOLINGUIST, V23, P567, DOI 10.1017/S0142716402004046
FLEGE JE, 1987, SPEECH COMMUN, V6, P185, DOI 10.1016/0167-6393(87)90025-2
Grosjean F, 1982, LIFE 2 LANGUAGES
HAZAN VL, 1993, LANG SPEECH, V36, P17
Holt LL, 2005, PSYCHOL SCI, V16, P305, DOI 10.1111/j.0956-7976.2005.01532.x
Holt LL, 2002, HEARING RES, V167, P156, DOI 10.1016/S0378-5955(02)00383-0
KEATING PA, 1981, J ACOUST SOC AM, V70, P1261, DOI 10.1121/1.387139
KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940
Kohnert KJ, 1999, J SPEECH LANG HEAR R, V42, P1400
KUHL PK, 1992, SCIENCE, V255, P606, DOI 10.1126/science.1736364
Lisker L., 1970, P 6 INT C PHON SCI P, P563
Naatanen R., 1992, ATTENTION BRAIN FUNC
Polka L, 2001, J ACOUST SOC AM, V109, P2190, DOI 10.1121/1.1362689
Sundara M, 2008, COGNITION, V108, P232, DOI 10.1016/j.cognition.2007.12.013
Sundara M, 2008, COGNITION, V106, P234, DOI 10.1016/j.cognition.2007.01.011
WILLIAMS L, 1977, PERCEPT PSYCHOPHYS, V21, P289, DOI 10.3758/BF03199477
NR 27
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 369
EP 378
DI 10.1016/j.specom.2008.11.005
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700006
ER
PT J
AU Jongtaveesataporn, M
Thienlikit, I
Wutiwiwatchai, C
Furui, S
AF Jongtaveesataporn, Markpong
Thienlikit, Issara
Wutiwiwatchai, Chai
Furui, Sadaoki
TI Lexical units for Thai LVCSR
SO SPEECH COMMUNICATION
LA English
DT Article
DE Thai LVCSR; Thai language model; Lexical unit; Pseudo-morpheme; Compound
pseudo-morpheme; Word segmentation
AB Traditional language models rely on lexical units that are defined as entities separated from each other by word boundary markers. Since there are no such boundaries in Thai, alternative definitions of lexical units have to be pursued. The problem is to find the optimal set of lexical units that constitutes the vocabulary of the language model and yields the best final result. The word is a traditional lexical unit recognized by Thai people and is used by most of the natural language processing systems, including an automatic speech recognition system. This paper discusses problems with using words as a lexical unit and investigates other lexical units for the Thai large vocabulary continuous speech recognition (LVCSR) system. The pseudo-morpheme is introduced in the paper and shown to be unsuitable for use as a lexical unit directly. A technique using pseudo-morphemes to improve the system based on the traditional word model is introduced and some improvements can be gained by this technique. Then, a new lexical unit for Thai, the compound pseudo-morpheme, and an algorithm to build compound pseudo-morphemes are presented. The experimental results show that the system using compound pseudo-morphemes outperforms other systems. Thus, the compound pseudo-morpheme is the most suitable lexical unit for Thai LVCSR system. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Jongtaveesataporn, Markpong; Thienlikit, Issara; Furui, Sadaoki] Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan.
[Wutiwiwatchai, Chai] Natl Elect & Comp Technol Ctr, Pathum Thani 12120, Thailand.
RP Jongtaveesataporn, M (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayma, Tokyo 1528552, Japan.
EM marky@furui.cs.titech.ac.jp
RI Wutiwiwatchai, Chai/G-5010-2012
FU METI Project "Development of Fundamental Speech Recognition Technology"
FX The speech corpus used for training the acoustic model was funded by the
METI Project "Development of Fundamental Speech Recognition Technology".
The authors would like to thank Dr. Wirote Aroonmanakun for allowing us
to use his PM segmentation tool and NECTEC for providing useful Thai
speech resources. We also would like to thank the anonymous reviewers
for their insightful remarks on a previous version of this manuscript
and Si-Phya Publishing Co., Ltd. which publishes Daily News newspaper
for supplying us useful newspaper text.
CR AROONMANAKUN W, 2002, P 5 S NAT LANG PROC, P68
Hacioglu K., 2003, P 8 EUR C SPEECH COM, P1165
JONGTAVEESATAPO.M, 2008, P INT C LARG RES EV
Kasuriya S., 2003, Proceedings of the Oriental COCOSDA 2003. International Coordinating Committee on Speech Databases and Speech I/O System Assessment
Lee A., 2001, P EUR C SPEECH COMM, P1691
*NECTEC, SWATH SMART WORD AN
ROSENFELD R, 1995, P ARPA SPOK LANG TEC
Saon G, 2001, IEEE T SPEECH AUDI P, V9, P327, DOI 10.1109/89.917678
Tarsaku P., 2001, P EUR C SPEECH COMM, P1057
Wutiwiwatchsi C, 2007, SPEECH COMMUN, V49, P8, DOI 10.1016/j.specom.2006.10.004
Young Steve, 2002, HTK BOOK VERSION 3 2
NR 11
TC 0
Z9 0
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 379
EP 389
DI 10.1016/j.specom.2008.11.006
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700007
ER
PT J
AU Gomez, AM
Peinado, AM
Sanchez, V
Carmona, JL
AF Gomez, Angel M.
Peinado, Antonio M.
Sanchez, Victoria
Carmona, Jose L.
TI A robust scheme for distributed speech recognition over loss-prone
packet channels
SO SPEECH COMMUNICATION
LA English
DT Article
DE Distributed speech recognition; Media-specific FEC; Interleaving;
Weighted Viterbi decoding
ID IP NETWORKS
AB In this paper, we propose a whole recovery scheme designed to improve robustness against packet losses in distributed speech recognition systems. This scheme integrates two sender-driven techniques, namely, media-specific forward error correction (FEC) and frame interleaving, along with a receiver-based error concealment (EC) technique, the weighted Viterbi algorithm (WVA). Although these techniques have been already tested separately, providing a significant increase of performance in clean acoustic environments, in this paper they are jointly applied and their performance in adverse acoustic conditions is evaluated. In particular, a noisy speech database and the ETSI Advanced Front-end are used, while the dynamic features, which play ail important role in adverse acoustic environments, and their confidences for the WVA algorithm are examined. In order to solve the issue of mixing two sender-driven techniques (both causing a delay) whose direct composition causes an increase of the global latency, we propose a double stream scheme which limits the latency to the maximum delay of both techniques. As a result, with very few overhead bits and a very limited delay, the integrated scheme achieves a significant improvement in the performance of a DSR system over a degraded transmission channel, both in clean and noisy acoustic conditions. (C) 2009 Elsevier B.V. All rights reserved.
C1 [Gomez, Angel M.; Peinado, Antonio M.; Sanchez, Victoria; Carmona, Jose L.] Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, E-18071 Granada, Spain.
RP Gomez, AM (reprint author), Univ Granada, Fac Ciencias, Dept Teoria Senal Telemat & Comunicac, Campus Fuentenueva S-N, E-18071 Granada, Spain.
EM amgg@ugr.es; amp@ugr.es; victoria@ugr.es; maqueda@ugr.es
RI Sanchez , Victoria /C-2411-2012; Peinado, Antonio/C-2401-2012; Gomez
Garcia, Angel Manuel/C-6856-2012
OI Gomez Garcia, Angel Manuel/0000-0002-9995-3068
FU Spanish MEC [TEC 2007-66600]
FX This paper has been supported by the Spanish MEC, project TEC
2007-66600.
CR Andrews K., 1997, THEORY INTERLEAVERS
Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141
Bolot J. C., 1993, ACM SIGCOMM, P289
Cardenal-Lopez A, 2006, SPEECH COMMUN, V48, P1422, DOI 10.1016/j.specom.2006.01.006
CARDENALLOPEZ A, 2004, P IEEE INT C AC SPEE, P49
ENDO T, 2003, P EUR 03
*ETSI, 2005, ETSIES202212
*ETSI, 2003, ETSIES202211
*ETSI, 2000, ETSIES201108
*ETSI, 2002, ETSIES202050
FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788
GOMEZ A, 2003, P EUROSPEECH 03
GOMEZ A, 2007, IEEE INT C AC SPEECH, V4
GOMEZ A, 2004, STAT BASED RECONSTRU
Gomez AM, 2007, IEEE T AUDIO SPEECH, V15, P1496, DOI 10.1109/TASL.2006.889800
Gomez AM, 2006, IEEE T MULTIMEDIA, V8, P1228, DOI 10.1109/TMM.2006.884611
Ion V, 2006, SPEECH COMMUN, V48, P1435, DOI 10.1016/j.specom.2006.03.007
JAMES A, 2004, P INT C SPOK LANG PR
JAMES A, 2004, P IEEE INT C AC SPEE
JAMES A, 2005, P IEEE INT C AC SPEE, P345
KOODLI R, 1999, 3357 RFC
MACHO D, 2000, SPANISH SDC AURORA D
Milner B, 2006, IEEE T AUDIO SPEECH, V14, P223, DOI 10.1109/TSA.2005.852997
MILNER B, 2004, PACKET LOSS MODELLIN
MILNER B, 2000, P ICASSP, V3, P1791
MILNER B, 2003, P EUR 03
PEINADO A, 2006, P INTERSPEECH 06
PEINADO A, 2005, P IEEE INT C AC SPEE, P329
PEINADO A, 2006, ROBUSTNESS STANDARDS
PERKINS C, 1998, IEEE NETWORK MAGAZIN
Postel J., 1980, 768 RFC
POTAMIANOS A, 2001, P IEEE INT C AC SPEE
RAMSEY JL, 1970, IEEE T INFORM THEORY, V16, P338, DOI 10.1109/TIT.1970.1054443
Schulzrinne H., 1996, 1889 RFC
XIE Q, 2002, 3357 RFC
YOMA N, 1998, P IEEE INT C AC SPEE
NR 36
TC 3
Z9 3
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD APR
PY 2009
VL 51
IS 4
BP 390
EP 400
DI 10.1016/j.specom.2008.12.002
PG 11
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 421KB
UT WOS:000264356700008
ER
PT J
AU Kjellstrom, H
Engwall, O
AF Kjellstrom, Hedvig
Engwall, Olov
TI Audiovisual-to-articulatory inversion
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech inversion; Articulatory inversion; Computer vision
ID SPEECH RECOGNITION
AB It has been shown that acoustic-to-articulatory inversion, i.e. estimation of the articulatory configuration from the corresponding acoustic signal, can be greatly improved by adding visual features extracted from the speaker's face. In order to make the inversion method usable in a realistic application, these features should be possible to obtain from a monocular frontal face video, where the speaker is not required to wear any special markers. In this study, we investigate the importance of visual cues for inversion. Experiments with motion capture data of the face show that important articulatory information can be extracted using only a few face measures that mimic the information that could be gained from a video-based method. We also show that the depth cue for these measures is not critical, which means that the relevant information can be extracted from a frontal video. A real video-based face feature extraction method is further presented, leading to similar improvements in inversion quality. Rather than tracking points on the face, it represents the appearance of the mouth area using independent component images. These findings are important for applications that need a simple audiovisual-to-articulatory inversion technique, e.g. articulatory phonetics training for second language learners or hearing-impaired persons. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Kjellstrom, Hedvig] KTH Royal Inst Technol, Sch Comp Sci & Commun, Comp Vis & Act Percept Lab, SE-10044 Stockholm, Sweden.
[Engwall, Olov] KTH Royal Inst Technol, Sch Comp Sci & Commun, Ctr Speech Technol CTT, SE-10044 Stockholm, Sweden.
RP Kjellstrom, H (reprint author), KTH Royal Inst Technol, Sch Comp Sci & Commun, Comp Vis & Act Percept Lab, SE-10044 Stockholm, Sweden.
EM hedvig@kth.se; engwall@kth.se
FU Swedish Research Council; ASPI; Audiovisual SPeech Inversion; Future and
Emerging Technologies (FET) programme within the Sixth Framework
Programme for Research of the European Commission [021324]
FX This research is part of the two projects ARTUR, funded by the Swedish
Research Council, and ASPI, Audiovisual SPeech Inversion. The authors
acknowledge the financial support of the Future and Emerging
Technologies (FET) programme within the Sixth Framework Programme for
Research of the European Commission, under FET-Open Contract No. 021324
CR Ahlberg J., 2001, LITHISYR2326 LINKP U
Bailly G., 2002, INT C SPEECH LANG PR, P1913
BESKOW J, 2003, RESYNTHESIS FACIAL I, P431
BRANDERUD P, 1985, P FRENCH SWED S SPEE, P113
BREGLER C, 1994, INT CONF ACOUST SPEE, P669
Cootes TF, 2001, IEEE T PATTERN ANAL, V23, P681, DOI 10.1109/34.927467
Cristianini N., 2000, INTRO SUPPORT VECTOR
Doucet A., 2001, SEQUENTIAL MONTE CAR
Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479
ENGWALL O, 2006, INT SEM SPEECH PROD, P469
Engwall O, 2006, BEHAV INFORM TECHNOL, V25, P353, DOI 10.1080/01449290600636702
ENGWALL O, 2005, INTERSPEECH, P3205
ENGWALL O, 2003, RESYNTHESIS 3D TONGU, P2261
Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2]
Hyvarinen A, 2001, INDEPENDENT COMPONEN
Hyvarinen A., 2005, FASTICA PACKAGE MATL
Isard M, 1998, INT J COMPUT VISION, V29, P5, DOI 10.1023/A:1008078328650
Jiang JT, 2002, EURASIP J APPL SIG P, V2002, P1174, DOI 10.1155/S1110865702206046
KATSAMANIS A, 2008, IEEE INT C AC SPEECH
KAUCIC R, 1998, IEEE INT C COMP VIS, P370
KJELLSTROM H, 2006, RECONSTRUCTING TONGU, P2238
MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840
MAEDA S, 1994, SPEECH MAPS WP2 SPEE, V3
Matthews I, 2002, IEEE T PATTERN ANAL, V24, P198, DOI 10.1109/34.982900
OUNI S, 2002, INT C SPOK LANG PROC, P2301
PETERSEN M, 1997, THESIS TU DENMARK DE
Saenko K, 2005, IEEE I CONF COMP VIS, P1424, DOI 10.1109/ICCV.2005.251
SEYMOUR R, 2005, NEW POSTERIOR BASED, P1229
SHDAIFAT I, 2005, SYSTEM AUDIO VISUAL, P1221
SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7
Tipping ME, 2001, J MACH LEARN RES, V1, P211, DOI 10.1162/15324430152748236
TORRESANI L, 2004, ECCV, P299
Yang MH, 2002, IEEE T PATTERN ANAL, V24, P34
Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X
NR 34
TC 9
Z9 9
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 195
EP 209
DI 10.1016/j.specom.2008.07.005
PG 15
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900001
ER
PT J
AU Knoll, MA
Uther, M
Costall, A
AF Knoll, Monja A.
Uther, Maria
Costall, Alan
TI Effects of low-pass filtering on the judgment of vocal affect in speech
directed to infants, adults and foreigners
SO SPEECH COMMUNICATION
LA English
DT Article
DE Low-pass filtering; Infant-directed speech; Foreign-directed speech;
Vocal affect
ID MOTHERS SPEECH; EMOTIONS; COMMUNICATION; MASKING; CUES; RECOGNITION;
FREQUENCIES; INTONATION; LANGUAGE; PROSODY
AB Low-pass filtering has been used in emotional research to remove the semantic content from speech on the assumption that the relevant acoustic cues for vocal affect remain intact. This method has also been adapted by recent investigations into the function of infant-directed speech (IDS). Similar to other emotion-related studies that have utilised various levels of low-pass filtering, these IDS investigations have used different frequency cut-offs. However, the effects of applying these different low-pass filters to speech samples on perceptual ratings of vocal affect are not well understood. Samples of natural IDS, foreigner- (FDS) and British adult-directed (ADS) speech were low-pass filtered at four different cut-offs (1200, 1000, 700, and 400 Hz), and affective ratings of these were compared to those of the original samples. The samples were also analyzed for mean fundamental frequency (F-0) and F-0 range. Whilst IDS received consistently higher affective ratings for all filters, the results of the adult conditions were more complex. ADS received significantly higher ratings of positive vocal affect than FDS with the lower cut-offs (1000-400 Hz), whereas no significant difference between the adult conditions was found in the original and 1200 Hz conditions. No difference between the adult conditions was found for encouragement of attention. These findings show that low-pass filtering leaves sufficient vocal affect for detection by raters between IDS and the adult conditions, but that residual semantic information in filters above 1000 Hz may have a confounding affect on raters' perception. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Knoll, Monja A.; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England.
[Uther, Maria] Brunel Univ, Sch Social Sci, Ctr Cognit & Neuroimaging, Uxbridge UB8 3PH, Middx, England.
RP Knoll, MA (reprint author), Univ Portsmouth, Dept Psychol, King Henry Bldg,King Henry 1 St, Portsmouth PO1 2DY, Hants, England.
EM Monja.Knoll@port.ac.uk; Maria.Uther@brunel.ac.uk;
Alan.Costall@port.ac.uk
FU ESRC
FX This research was supported by a grant from the ESRC to Monja Knoll. We
thank Pushpendra Singh (Newcastle University, UK), Paul Marshman
(Portsmouth University, UK), James Uther (Sydney University, Australia),
Axelle Philippon (Portsmouth University, UK) and Stig Walsh (NHM, UK)
for technical support and discussion. Two anonymous reviewers are
thanked for greatly improving this manuscript.
CR Baer T, 2002, J ACOUST SOC AM, V112, P1133, DOI 10.1121/1.1498853
Biersack S., 2005, P 9 EUR C SPEECH COM, P2401
Boersma P., 2005, PRAAT DOING PHONETIC
Burnham D., 2002, SCIENCE, V296, P1095
CAPORAEL LR, 1981, J PERS SOC PSYCHOL, V40, P876, DOI 10.1037//0022-3514.40.5.876
COHEN A, 1961, AM J PSYCHOL, V74, P90, DOI 10.2307/1419829
DAVITZ JR, 1961, J COMMUN, V81, P81
Dubno JR, 2005, J ACOUST SOC AM, V118, P923, DOI 10.1121/1.1953127
FERNALD A, 1991, DEV PSYCHOL, V27, P209, DOI 10.1037/0012-1649.27.2.209
FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104
FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407
FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276
Hogan CA, 1998, J ACOUST SOC AM, V104, P432, DOI 10.1121/1.423247
Kitamura C, 2003, INFANCY, V4, P85, DOI 10.1207/S15327078IN0401_5
KNOLL MA, 2008, BPS ANN C 2008 DUBL, P232
Knoll MA, 2007, SYST ASSOC SPEC VOL, V74, P299
KNOLL MA, 2004, J ACOUST SOC AM, V116, P2522
Knower FH, 1941, J SOC PSYCHOL, V14, P369
KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473
KRAMER E, 1963, PSYCHOL BULL, V60, P408, DOI 10.1037/h0044890
MILMOE S, 1967, J ABNORM PSYCHOL, V72, P78, DOI 10.1037/h0024219
Morton JB, 2001, CHILD DEV, V72, P834, DOI 10.1111/1467-8624.00318
PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889
POLLACK I, 1948, J ACOUST SOC AM, V20, P259, DOI 10.1121/1.1906369
ROGERS PL, 1971, BEHAV RES METH INSTR, V3, P16, DOI 10.3758/BF03208115
ROSS M, 1973, AM ANN DEAF, V118, P37
SCHERER KR, 1971, J EXP RES PERS, V5, P155
SCHERER KR, 1972, J PSYCHOLINGUIST RES, V1, P269, DOI 10.1007/BF01074443
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Soskin W.F., 1961, J COMMUN, V11, P73, DOI 10.1111/j.1460-2466.1961.tb00331.x
STARKWEATHER JA, 1956, AM J PSYCHOL, V69, P121, DOI 10.2307/1418129
Starkweather J.A., 1967, RES VERBAL BEHAV SOM, P253
Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240
Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003
VANBEZOOIJEN R, 1986, J PSYCHOLINGUIST RES, V15, P103
WHITE GM, 1975, J ACOUST SOC AM, V58, P106
NR 36
TC 6
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 210
EP 216
DI 10.1016/j.specom.2008.08.001
PG 7
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900002
ER
PT J
AU Caballero, M
Moreno, A
Nogueiras, A
AF Caballero, Monica
Moreno, Asuncion
Nogueiras, Albino
TI Multidialectal Spanish acoustic modeling for speech recognition
SO SPEECH COMMUNICATION
LA English
DT Article
DE Multidialectal ASR system; Dialect-independent ASR system; Spanish
acoustic modeling; Spanish dialects
AB During the last years, language resources for speech recognition have been collected for many languages and specifically, for global languages. One of the characteristics of global languages is their wide geographical dispersion, and consequently, their wide phonetic, lexical, and semantic dialectal variability. Even if the collected data is huge, it is difficult to represent dialectal variants accurately.
This paper deals with multidialectal acoustic modeling for Spanish. The goal is to create a set of multidialectal acoustic models that represents the sounds of the Spanish language as spoken in Latin America and Spain. A comparative study of different methods for combining data between dialects is presented. The developed approaches are based on decision tree clustering algorithms. They differ oil whether a multidialectal phone set is defined, and in the decision tree structure applied.
Besides, a common overall phonetic transcription for all dialects is proposed. This transcription can be used in combination with all the proposed acoustic modeling approaches. Overall transcription combined with approaches based on defining a multidialectal phone set leads to a full dialect-independent recognizer, capable to recognize any dialect even with a total absence of training data from such dialect.
Multidialectal systems are evaluated over data collected in five different countries: Spain, Colombia, Venezuela, Argentina and Mexico. The best results given by multidialectal systems show a relative improvement of 13% over the results obtained with monodialectal systems. Experiments with dialect-independent systems have been conducted to recognize speech from Chile, a dialect not seen in the training process. The recognition results obtained for this dialect are similar to the ones obtained for other dialects. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Caballero, Monica; Moreno, Asuncion; Nogueiras, Albino] Univ Politecn Cataluna, Talp Res Ctr, E-08028 Barcelona, Spain.
RP Caballero, M (reprint author), Univ Politecn Cataluna, Talp Res Ctr, E-08028 Barcelona, Spain.
EM monica@gps.tsc.upc.edu; asuncion@gps.tsc.upc.edu; albino@gps.tsc.upc.edu
RI Nogueiras Rodriguez, Albino/G-1418-2013
OI Nogueiras Rodriguez, Albino/0000-0002-3159-1718
FU Spanish Government [TEC2006-13694-C03]
FX This work was granted by Spanish Government TEC2006-13694-C03.
CR AALBURG S, 2003, P EUROSPEECH 2003, P1489
BAUM M, 2001, ISCA WORKSH AD METH, P135
BERINGER N, 1998, P ICSLP SYDN AUSTR, V2, P85
Billa J., 1997, P EUR, P363
Brousseau J., 1992, P INT C SPOK LANG PR, P1003
BYRNE W, 2000, P ICASSP, V2, P1029
CABALLERO M, 2004, P ICSLP JEJ ISL KOR, P837
Chengalvarayan R, 2001, P EUR AALB DENM, P2733
DELATORRE C, 1996, P ICSLP PHIL, V4, P2032, DOI 10.1109/ICSLP.1996.607198
DIAKOLUKAS D, 1997, P ICASSP MUN GERM, P1455
DUCHATEAU J, 1997, P EUR 97, V3, P1183
Ferreiros J, 1999, SPEECH COMMUN, V29, P65, DOI 10.1016/S0167-6393(99)00013-8
FISCHER V, 1998, P ICSLP SYDN AUSTR
FOLDVIK AK, 1998, P ICSLP SYDN AUSTR
Gibbon D, 1997, HDB STANDARDS RESOUR
Heeringa W, 2003, COMPUT HUMANITIES, V37, P293, DOI 10.1023/A:1025087115665
HUERTA JM, 1998, DARPA BN TRANSCR UND
Imperl B, 2003, SPEECH COMMUN, V39, P353, DOI 10.1016/S0167-6393(02)00048-1
Kirchhoff K, 2005, SPEECH COMMUN, V46, P37, DOI 10.1016/j.specom.2005.01.004
Kohler J, 2001, SPEECH COMMUN, V35, P21, DOI 10.1016/S0167-6393(00)00093-5
Kudo I., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607195
Lipski J. M., 1994, LATIN AM SPANISH
LLISTERRI J, 1993, SAMAUPC001VI
MARINO JB, 2000, P INT WORKSH VER LAR, P57
MARINO JB, 1998, P ICSLP SYDN AUSTR, V1, P477
MORENO A, 1998, P ICSLP SIDN AUSTR
MORENO A., 1998, P INT C LANG RES EV, VI, P367
NOGUEIRAS A, 2002, P ICASSP ORL US
SALVI G, 2003, P 15 INT C PHON SCI
Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7
YU H, 2003, P EUR GEN SWITZ, P1869
ZISSMANM MA, 1996, P ICASSP ATL GA US, P777
NR 32
TC 6
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 217
EP 229
DI 10.1016/j.specom.2008.08.003
PG 13
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900003
ER
PT J
AU Li, YP
Wang, DL
AF Li, Yipeng
Wang, DeLiang
TI On the optimality of ideal binary time-frequency masks
SO SPEECH COMMUNICATION
LA English
DT Article
DE Ideal binary mask; Ideal ratio mask; Optimality; Sound separation;
Wiener filter
ID AUDITORY SCENE ANALYSIS; SPEECH RECOGNITION; SOURCE SEPARATION; MONAURAL
SPEECH; SEGREGATION; ENHANCEMENT
AB The concept of ideal binary time-frequency masks has received attention recently in monaural and binaural sound separation. Although often assumed, the optimality of ideal binary masks in terms of signal-to-noise ratio has not been rigorously addressed. In this paper we give a formal treatment on this issue and clarify the conditions for ideal binary masks to be optimal. We also experimentally compare the performance of ideal binary masks to that of ideal ratio masks on a speech mixture database and a music database. The results show that ideal binary masks are close in performance to ideal ratio masks which are closely related to the Wiener filter, the theoretically optimal linear filter. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Li, Yipeng; Wang, DeLiang] Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA.
[Wang, DeLiang] Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA.
RP Li, YP (reprint author), Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA.
EM li.434@osu.edu
FU AFOSR [F49620-04-1-0027]; AFRL [FA8750-04-1-0093]
FX We wish to thank the three anonymous reviewers for their constructive
suggestions/criticisms. This research was supported in part by an AFOSR
Grant (F49620-04-1-0027) and an AFRL Grant (FA8750-04-1-0093).
CR Bregman AS., 1990, AUDITORY SCENE ANAL
BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016
Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929
Cooke M.P., 1993, MODELING AUDITORY PR
Deshmukh OD, 2007, J ACOUST SOC AM, V121, P3886, DOI 10.1121/1.2714913
Ellis D.P.W., 2006, COMPUTATIONAL AUDITO, P115
Goto M., 2003, INT C MUS INF RETR
Harding S, 2006, IEEE T AUDIO SPEECH, V14, P58, DOI 10.1109/TSA.2005.860354
Hu G. N., 2001, IEEE WORKSH APPL SIG
Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812
Hubbard TL, 2001, AM J PSYCHOL, V114, P569, DOI 10.2307/1423611
KIM YI, 2006, NEURAL INFORM PROCES, V10, P125
Li N, 2008, J ACOUST SOC AM, V123, P1673, DOI 10.1121/1.2832617
Li P, 2006, IEEE T AUDIO SPEECH, V14, P2014, DOI 10.1109/TASL.2006.883258
LI Y, 2007, IEEE INT C AC SPEECH, P481
LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540
Oppenheim A. V., 1999, DISCRETE TIME SIGNAL
PRINCEN JP, 1986, IEEE T ACOUST SPEECH, V34, P1153, DOI 10.1109/TASSP.1986.1164954
Radfar MH, 2007, EURASIP J AUDIO SPEE, DOI 10.1155/2007/84186
REDDY AM, 2007, IEEE T AUDIO SPEECH, V25, P1766
Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463
Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003
Strang G., 1996, WAVELETS FILTER BANK
Van Trees H., 1968, DETECTION ESTIMATION, V1st
Vincent E, 2006, IEEE T AUDIO SPEECH, V14, P1462, DOI 10.1109/TSA.2005.858005
Vincent E, 2007, SIGNAL PROCESS, V87, P1933, DOI 10.1016/j.sigpro.2007.01.016
Wang D., 2006, COMPUTATIONAL AUDITO
Wang DL, 2005, SPEECH SEPARATION BY HUMANS AND MACHINES, P181, DOI 10.1007/0-387-22794-6_12
Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727
Weintraub M., 1985, THESIS STANFORD U
Wiener N., 1949, EXTRAPOLATION INTERP
NR 31
TC 42
Z9 44
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 230
EP 239
DI 10.1016/j.specom.2008.09.001
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900004
ER
PT J
AU Recasens, D
Espinosa, A
AF Recasens, Daniel
Espinosa, Aina
TI Dispersion and variability in Catalan five and six peripheral vowel
systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Vowel space; Catalan; Contextual variability; Near-mergers; Mid vowel
neutralization; Acoustic analysis
ID MODERN GREEK; ENGLISH; LANGUAGE; PATTERNS; COARTICULATION; CONTRAST;
CONTEXT; SPEECH
AB This study compares F1 and F2 for the vowels of the five and six peripheral vowel systems of four minor dialects of Catalan (Felanitxer, Gironi, Sitgeta, Rossellones), with those of the seven peripheral vowel systems of the major dialects those minor dialects belong to (Majorcan, Eastern). Results indicate that most mid vowel pairs subjected to neutralization may be characterized as near-mergers. Merging appears to have proceeded through two stages: in the first place, one of the two mid vowel pairs undergoes neutralization yielding a relatively close mid vowel in the resulting six vowel system; then, the members of the second vowel pair approach each other until they cease to be contrastive, and the front and back mid vowels of the resulting five vowel system tend to occupy a fairly equidistant position with respect to the mid high and mid low cognates. Moreover, in six vowel systems with a single mid vowel pair, the contrasting members of this pair approach each other if belonging to the back series but not if belonging to the front series. These findings are in support of two hypotheses: vowel systems tend to be symmetrical; reparation of six vowel systems is most prone to occur if the system is unoptimal. Predictions of the adaptive dispersion theory were not supported by the data. Thus, smaller vowel systems turned out not to be less disperse than larger ones, and mid vowels were not clearly more variable in five or six vowel systems than in seven vowel systems. It appears that for these predictions to come into play, the systems being compared need to differ considerably in number of vowels. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Recasens, Daniel] Univ Autonoma Barcelona, Dept Catalan Philol, E-08193 Barcelona, Spain.
Inst Estudis Catalans, Phonet Lab, Barcelona 08001, Spain.
RP Recasens, D (reprint author), Univ Autonoma Barcelona, Dept Catalan Philol, E-08193 Barcelona, Spain.
EM daniel.recasens@uab.es
FU Spanish Ministry of Education and Science [HUM2006-03743]; FEDER
[2005SGR864]
FX This research was funded by project HUM2006-03743 of the Spanish
Ministry of Education and Science and FEDER, and by project 2005SGR864
of the Generalitat de Catalunya. We thank two anonymous reviewers for
comments on a previous manuscript version.
CR ADANK P, 2003, VOWEL NORMALIZATION
Adank P, 2004, J ACOUST SOC AM, V116, P3099, DOI 10.1121/1.1795335
ALTAMIMI JE, 2005, DOES VOWEL SYSTEM SI, P2465
Beddor PS, 2002, J PHONETICS, V30, P591, DOI 10.1006/jpho.2002.0177
BLADON A, 1983, SPEECH COMMUN, V2, P305, DOI 10.1016/0167-6393(83)90047-X
Boersma P, 1998, FUNCTIONAL PHONOLOGY
BOHN O, 2004, J INT PHON ASSOC, V34, P161, DOI 10.1017/S002510030400180X
Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5
BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064
CALAMAI S, 2002, SCUOLA NORMALE SUPER, V3, P40
CHAPTER J, 2004, CAMBRIDGE OCCASIONAL, V2, P100
de Boer B., 2001, ORIGINS VOWEL SYSTEM
de Boer B, 2000, J PHONETICS, V28, P441, DOI 10.1006/jpho.2000.0125
DEGRAAF T, 1984, P I PHONETIC SCI, V8, P41
Disner S. F., 1983, UCLA WORKING PAPERS
Fant G., 1973, SPEECH SOUNDS FEATUR
Ferrero F., 1978, J ITALIAN LINGUISTIC, V3, P87
FLEGE JE, 1989, LANG SPEECH, V32, P123
Flemming E., 2004, PHONETICALLY BASED P, P232, DOI 10.1017/CBO9780511486401.008
Flemming Edward S., 2002, AUDITORY REPRESENTAT
Fourakis M, 1999, PHONETICA, V56, P28, DOI 10.1159/000028439
Goldstein L., 1983, 10TH INT C PHON SCI, P267
Guion SG, 2003, PHONETICA, V60, P98, DOI 10.1159/000071449
Hawks JW, 1995, LANG SPEECH, V38, P237
Hillenbrand JM, 2001, J ACOUST SOC AM, V109, P748, DOI 10.1121/1.1337959
JONGMAN A, 1989, LANG SPEECH, V32, P221
KEATING PA, 1994, J PHONETICS, V22, P407
KEATING PA, 1984, PHONETICA, V41, P191
KOOPMANSVANBEIN.FJ, 1973, J PHONETICS, V1, P249
Labov W., 1994, PRINCIPLES LINGUISTI
LADEFOGED P, 1967, NATURE VOWEL QUALITY, P50
Lass Roger, 1992, CAMBRIDGE HIST ENGLI, V2, p[1066, 23]
Lass Roger, 1999, CAMBRIDGE HIST ENGLI, VIII, P56
Lass Roger, 1994, OLD ENGLISH HIST LIN
LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991
Lindblom B., 1986, EXPT PHONOLOGY, P13
LIVJN P, 2000, PERILUS, V23, P93
MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705
Martinet Andre, 1970, EC CHANGEMENTS PHONE
MARTINS MRD, 1964, B FILOLOGIA, V22, P303
Max L, 1999, J SPEECH LANG HEAR R, V42, P261
MEUNIER C, 2003, P 15 INT C PHON SCI, V1, P723
Mok P, 2006, THESIS U CAMBRIDGE
MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492
Most T, 2000, LANG SPEECH, V43, P295
Munson B, 2004, J SPEECH LANG HEAR R, V47, P1048, DOI 10.1044/1092-4388(2004/078)
Nearey Terrance Michael, 1978, PHONETIC FEATURE SYS
NOBRE MA, 1987, HONOR ILSE LEHISTE, P195
PAPCUN G, 1976, UCLA WORKING PAPERS, V31, P38
PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875
Quilis A., 1981, FONETICA ACUSTICA LE
Recasens D., 1996, FONETICA DESCRIPTIVA
Recasens D, 2006, SPEECH COMMUN, V48, P645, DOI 10.1016/j.specom.2005.09.011
RECASENS D, 1985, LANG SPEECH, V28, P97
Schwartz JL, 1997, J PHONETICS, V25, P255, DOI 10.1006/jpho.1997.0043
Schwartz JL, 1997, J PHONETICS, V25, P233, DOI 10.1006/jpho.1997.0044
Sole M. J., 2007, EXPT APPROACHES PHON, P104
Stevens K.N., 1998, ACOUSTIC PHONETICS
STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111
VENY J, 1983, ELS PARLARS CATALANS
NR 60
TC 8
Z9 8
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 240
EP 258
DI 10.1016/j.specom.2008.09.002
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900005
ER
PT J
AU Ding, HJ
Soon, IY
Koh, SN
Yeo, CK
AF Ding, Huijun
Soon, Ing Yann
Koh, Soo Nee
Yeo, Chai Kiat
TI A spectral filtering method based on hybrid wiener filters for speech
enhancement
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech enhancement; Wiener filter; Spectrogram filtering; Noise
reduction
ID NOISE SUPPRESSION FILTER; SUBTRACTION; ESTIMATOR
AB It is well known that speech enhancement using spectral filtering will result in residual noise. Residual noise which is musical in nature is very annoying to human listeners. Many speech enhancement approaches assume that the transform coefficients are independent of one another and can thus be attenuated separately, thereby ignoring the correlations that exist between different time frames and within each frame. This paper, proposes a single channel speech enhancement system which exploits such correlations between the different time frames to further reduce residual noise. Unlike other 2D speech enhancement techniques which apply a post-processor after some classical algorithms such as spectral subtraction, the proposed approach uses a hybrid Wiener spectrogram filter (HWSF) for effective noise reduction, followed by a multi-blade post-processor which exploits the 2D features of the spectrogram to preserve the speech quality and to further reduce the residual noise. This results in pleasant sounding speech for human listeners. Spectrogram comparisons show that in the proposed scheme, musical noise is significantly reduced. The effectiveness of the proposed algorithm is further confirmed through objective assessments and informal subjective listening tests. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Ding, Huijun; Soon, Ing Yann; Koh, Soo Nee] Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
[Yeo, Chai Kiat] Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore.
RP Ding, HJ (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore.
EM ding0032@ntu.edu.sg; eiysoon@ntu.edu.sg; esnkoh@ntu.edu.sg;
asckyeo@ntu.edu.sg
RI KOH, Soo Ngee/A-5081-2011; Soon, Ing Yann/A-5173-2011
CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283
CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
Evans N. W. D., 2002, Proceedings of the Fourth IASTED International Conference Signal and Image Processing
Goh Z, 1998, IEEE T SPEECH AUDI P, V6, P287
Hu Y., 2006, P INTERSPEECH 2006 P
Jensen J, 2001, IEEE T SPEECH AUDI P, V9, P731, DOI 10.1109/89.952491
LIN Z, 2003, P 2 IEEE INT WORKSH, P61
MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394
Plapous C, 2006, IEEE T AUDIO SPEECH, V14, P2098, DOI 10.1109/TASL.2006.872621
Quatieri T. F., 2001, DISCRETE TIME SPEECH
Soon IY, 2003, IEEE T SPEECH AUDI P, V11, P717, DOI 10.1109/TSA.2003.816063
Soon IY, 1999, SIGNAL PROCESS, V75, P151, DOI 10.1016/S0165-1684(98)00230-8
WANG SH, 1992, IEEE J SEL AREA COMM, V10, P819, DOI 10.1109/49.138987
WHIPPLE G, 1994, P ICASSP, V1
NR 16
TC 13
Z9 13
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 259
EP 267
DI 10.1016/j.specom.2008.09.003
PG 9
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900006
ER
PT J
AU Inanoglu, Z
Young, S
AF Inanoglu, Zeynep
Young, Steve
TI Data-driven emotion conversion in spoken English
SO SPEECH COMMUNICATION
LA English
DT Article
DE Emotion conversion; Expressive speech synthesis; Prosody modeling
ID VOICE CONVERSION; SPEECH SYNTHESIS
AB This paper describes an emotion conversion system that combines independent parameter transformation techniques to endow a neutral utterance with a desired target emotion. A set of prosody conversion methods have been developed which utilise a small amount of expressive training data (similar to 15 min) and which have been evaluated for three target emotions: anger, surprise and sadness. The system performs F0 conversion at the syllable level while duration conversion takes place at the phone level using a set of linguistic regression trees. Two alternative methods are presented as a means to predict F0 contours for unseen utterances. Firstly, an HMM-based approach uses syllables as linguistic building blocks to model and generate F0 contours. Secondly, an F0 segment selection approach expresses F0 conversion as a search problem, where syllable-based F0 contour segments from a target speech corpus are spliced together under contextual constraints. To complement the prosody modules, a GMM-based spectral conversion function is used to transform the voice quality. Each independent module and the combined emotion conversion framework were evaluated through a perceptual study. Preference tests demonstrated that each module contributes a measurable improvement in the perception of the target emotion. Furthermore, an emotion classification test showed that converted utterances with either F0 generation technique were able to convey the desired emotion above chance level. However, F0 segment selection outperforms the HMM-based F0 generation method both in terms of emotion recognition rates as well as intonation quality scores, particularly in the case of anger and surprise. Using segment selection, the emotion recognition rates for the converted neutral utterances were comparable to the same utterances spoken directly in the target emotion. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Inanoglu, Zeynep; Young, Steve] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England.
RP Young, S (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England.
EM zeynep@gatesscholar.org; sjy@eng.cam.ac.uk
CR Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016
BARRA R, 2007, P INT
Boersma P., 2005, PRAAT DOING PHONETIC
BULUT M, 2007, P ICASSP
FALLSIDE F, 1987, SPEECH LANG, V2, P27
GILLETT B, 2003, P EUR
GOUBANOVA O, 2003, P INT C PHON SCI, V3, P2349
HELANDER E, 2007, P ICASSP, V4, P509
Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X
INANOGLU Z, 2005, P AFF COMP INT INT
Inanoglu Z., 2003, THESIS U CAMBRIDGE
INANOGLU Z, 2008, THESIS U CAMBRIDGE
JENSEN U, 1994, COMPUT SPEECH LANG, V8, P227
JONES D, 1996, CAMBRIDGE ENGLISH PR
Kain A., 1998, P ICASSP, V1, P285, DOI 10.1109/ICASSP.1998.674423
Kawanami H., 1999, IEEE T SPEECH AUDIO, V7, P697
ROSS K, 1994, P ESCA IEEE WORKSH S
SCHERER K, 2004, P SPEECH PROS
SCHRODER M, 1999, P EUROSPEECH, V1, P561
SILVERMAN K, 1992, P ICSLP, V40, P862
Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472
Tao JH, 2006, IEEE T AUDIO SPEECH, V14, P1145, DOI 10.1109/TASL.2006.876113
TIAN J, 2007, P INT
TODA T, 2005, P INT
Tokuda K., 2002, IEEE SPEECH SYNTH WO
Tokuda K, 2002, IEICE T INF SYST, VE85D, P455
TOKUDA K, 2000, P ICASSP, V3, P1315
TSUZUKI H, 2004, P ICSLP, V2, P1185
Vroomen J., 1993, P EUR, V1, P577
Wu CH, 2006, IEEE T AUDIO SPEECH, V14, P1109, DOI 10.1109/TASL.2006.876112
YAMAGISHI J, 2003, P EUROSPEECH, V3, P2461
YE H, 2005, THESIS CAMBRIDGE U
YILDIRIM S, 2004, P ICSLP
YOSHIMURA T, 1998, P ICSLP
Young S. J., 2006, HTK BOOK VERSION 3 4
NR 35
TC 11
Z9 11
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 268
EP 283
DI 10.1016/j.specom.2008.09.006
PG 16
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900007
ER
PT J
AU Breslin, C
Gales, MJF
AF Breslin, C.
Gales, M. J. F.
TI Directed decision trees for generating complementary systems
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Complementary systems; System combination
ID NEWS TRANSCRIPTION SYSTEM; MODELS
AB Many large vocabulary continuous speech recognition systems use a combination of multiple systems to obtain the final hypothesis. These complementary systems are typically found in an ad-hoc manner, by testing combinations of diverse systems and selecting the best. This paper presents a new algorithm for generating complementary systems by altering the decision tree generation, and a divergence measure for comparing decision trees. In this paper, the decision tree is biased against clustering states which have previously led to confusions. This leads to a system which concentrates states in contexts that were previously confusable. Thus these systems tend to make different errors. Results are presented on two broadcast news tasks - Mandarin and Arabic. The results show that combining multiple systems built from directed decision trees give gains in performance when confusion network combination is used as the method of combination. The results also show that the gains achieved using the directed tree algorithm are additive to the gains achieved using other techniques that have been empirically shown as complementary. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Breslin, C.; Gales, M. J. F.] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England.
RP Breslin, C (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England.
EM catherine.breslin@crl.toshiba.co.uk; mjfg@eng.cam.ac.uk
CR Arslan LM, 1999, IEEE T SPEECH AUDI P, V7, P46, DOI 10.1109/89.736330
Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350
Breiman L, 2001, MACH LEARN, V45, P5, DOI 10.1023/A:1010933404324
BRESLIN C, 2006, P ICSLP
BRESLIN C, 2007, P ICASSP
BRESLIN C, 2007, P INT
BUCKWATER T, 2004, LDC2004L02
Cincarek T, 2006, IEICE T INF SYST, VE89D, P962, DOI 10.1093/ietisy/e89-d.3.962
DIETTERICH T, 1999, MACH LEARN, V12, P1
Dietterich TG, 2000, LECT NOTES COMPUT SC, V1857, P1
DIMITRAKAKIS C, 2004, P ICASSP
Evermann G., 2000, P SPEECH TRANSCR WOR
FISCUS JG, 1997, P IEEE ASRU WORKSH
Freund Y., 1996, P 13 INT C MACH LEAR
Gales M.J.F, 2007, P ASRU
Gales MJF, 2006, IEEE T AUDIO SPEECH, V14, P1513, DOI 10.1109/TASL.2006.878264
Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9
GILLICK L, 1989, SOME STAT ISSUES COM
HAIN T, 2007, AMI SYSTEM TRANSCRIP
Hoffmeister B., 2006, P ICSLP
HU R, 2007, P ICASSP
HUANG J, 2007, P INT
HWANG M, 2007, ADV MANDARIN BROADCA
Jiang H, 2005, SPEECH COMMUN, V45, P455, DOI 10.1016/j.specrom.2004.12.004
KAMM T, 2003, P IEEE WORKSH AUT SP
KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694
LAMEL L, 2002, SPEECH LANG
Mangu L., 1999, P EUR
MEYER C, 2002, P ICASSP
NGUYEN L, 2002, SPEECH LANG
Nock H.J., 1997, P EUR
Odell J.J., 1995, THESIS U CAMBRIDGE
POVEY D, 2005, THESIS U CAMBRIDGE
RAMABHADRAN B, 2006, IBM 2006 SPEECH TRAN
RAND WM, 1971, J AM STAT ASSOC, V66, P846, DOI 10.2307/2284239
SCHWENK H, 1999, P ICASSP
Sinha R., 2006, P ICASSP
Siohan O., 2005, P ICASSP
STUKER S, 2006, P ICSLP
XUE J, 2007, P ICASSP
Zhang R., 2004, P ICSLP
ZHANG R, 2003, P ICASSP
ZWEIG G, 2000, P ICASSP
NR 43
TC 4
Z9 4
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 284
EP 295
DI 10.1016/j.specom.2008.09.004
PG 12
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900008
ER
PT J
AU Knoll, M
Scharrer, L
Costall, A
AF Knoll, Monja
Scharrer, Lisa
Costall, Alan
TI Are actresses better simulators than female students? The effects of
simulation on prosodic modifications of infant- and foreigner-directed
speech
SO SPEECH COMMUNICATION
LA English
DT Article
DE IDS; Simulated speech; Actresses; Hyperarticulation
ID MOTHERS SPEECH; VOCAL EXPRESSION; MATERNAL SPEECH; EMOTION; LANGUAGE;
COMMUNICATION; INTONATION; PREFERENCE; CLEAR; CUES
AB Previous research has used simulated interactions to investigate emotional and linguistic speech phenomena. Here, we evaluate the use of these simulated interactions by comparing speech addressed to imaginary speech partners produced by psychology students and actresses, to an existing study of natural speech addressed to genuine interaction partners. Simulated infant-(IDS), foreigner-(FDS) and adult-directed speech (ADS) was obtained from 10 female students and 10 female actresses. These samples were acoustically analysed and rated for positive vocal affect. Our results for affect for actresses and student speakers are consistent with previous findings using natural interactions, with IDS rated higher in positive affect than ADS/FDS, and ADS rated higher than FDS. In contrast to natural speech, acoustic analyses of IDS produced by student speakers revealed a smaller vowel space than ADS/FDS, with no significant difference between those adult conditions. In contrast to natural speech (IDS > ADS/FDS), the mean F-0 of IDS was significantly higher than ADS, but not than FDS. Acoustic analyses of actress speech were more similar to natural speech, with IDS vowel space significantly larger than ADS vowel space, and with the mean FDS vowel space positioned between these two conditions. IDS mean F-0 of the actress speakers was significantly higher than both ADS and FDS. These results indicate that training plays an important role in eliciting natural-like speech in simulated interactions, and that participants without training are less successful in reproducing such speech. Speech obtained from simulated interactions should therefore be used with caution, and level of experience and training of the speakers should be taken into account. (c) 2008 Elsevier B.V. All rights reserved.
C1 [Knoll, Monja; Scharrer, Lisa; Costall, Alan] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England.
[Scharrer, Lisa] Heidelberg Univ, Inst Psychol, D-69117 Heidelberg, Germany.
RP Knoll, M (reprint author), Univ Portsmouth, Dept Psychol, King Henry Bldg,King Henry 1 St, Portsmouth PO1 2DY, Hants, England.
EM monja.knoll@port.ac.uk; lisa.scharrer@port.ac.uk;
alan.costall@port.ac.uk
FU Economic and Social Research Council (ESRC)
FX We thank Dr Maria Uther (Brunel University, UK), Dr Stig Walsh (NHM
London) and Dr Darren Van Laar (University of Portsmouth) for useful
comments on the project. Our particular thanks go to David Bauckham and
the 'The Bridge Theatre Training Company' for their invaluable help in
recruiting the actresses used in this study. This research was supported
by a grant from the Economic and Social Research Council (ESRC) to Monja
Knoll.
CR Andruski J. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607913
Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614
Biersack S., 2005, P 9 EUR C SPEECH COM, P2401
Boersma P., 2006, PRAAT DOING PHONETIC
Burnham D, 2002, SCIENCE, V296, P1435, DOI 10.1126/science.1069587
Englund KT, 2005, J PSYCHOLINGUIST RES, V34, P259, DOI 10.1007/s10936-005-3640-7
FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104
FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8
GRIESER DL, 1988, DEV PSYCHOL, V24, P14, DOI 10.1037/0012-1649.24.1.14
Kitamura C, 2003, INFANCY, V4, P85, DOI 10.1207/S15327078IN0401_5
Kitamura C., 1998, ADV INFANCY RES, V12, P221
Knoll MA, 2007, SYST ASSOC SPEC VOL, V74, P299
KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473
Kuhl Patricia K., 1999, J ACOUST SOC AM, V105.2, P1095, DOI 10.1121/1.425135
Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533
Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684
Liu HM, 2003, DEVELOPMENTAL SCI, V6, pF1, DOI 10.1111/1467-7687.00275
PAPOUSEK M, 1991, INFANT BEHAV DEV, V14, P415, DOI 10.1016/0163-6383(91)90031-M
PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889
PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96
Schaeffler F., 2006, 3 SPEECH PROS C DRES
Scherer KR, 2007, EMOTION, V7, P158, DOI 10.1037/1528-3542.7.1.158
Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009
Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5
Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788
STERN DN, 1983, J CHILD LANG, V10, P1
Stewart M. A., 1982, J LANG SOC PSYCHOL, V1, P91, DOI 10.1177/0261927X8200100201
Street R. L., 1983, J LANG SOC PSYCHOL, V2, P37, DOI 10.1177/0261927X8300200103
Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240
Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003
VISCOVICH N, 2003, MOTOR SKILLS, V96, P759
WALLBOTT HG, 1986, J PERS SOC PSYCHOL, V51, P690, DOI 10.1037//0022-3514.51.4.690
WERKER JF, 1994, INFANT BEHAV DEV, V17, P323, DOI 10.1016/0163-6383(94)90012-4
NR 33
TC 7
Z9 7
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD MAR
PY 2009
VL 51
IS 3
BP 296
EP 305
DI 10.1016/j.specom.2008.10.001
PG 10
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 405DW
UT WOS:000263203900009
ER
PT J
AU Kim, W
Hansen, JHL
AF Kim, Wooil
Hansen, John H. L.
TI Feature compensation in the cepstral domain employing model combination
SO SPEECH COMMUNICATION
LA English
DT Article
DE Speech recognition; Feature compensation; Model combination; Multiple
models; Mixture sharing
ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; SPEAKER ADAPTATION; NOISE;
ENHANCEMENT; CLASSIFICATION; STRESS
AB In this paper, we present all effective cepstral feature compensation scheme which leverages knowledge of the speech model in order to achieve robust speech recognition. Ill the proposed scheme, the requirement for a prior noisy speech database in off-line training is eliminated by employing parallel model combination for the noise-corrupted speech model. Gaussian mixture models of clean speech and noise are used for the model combination. The adaptation of the noisy speech model is possible only by updating the noise model. This method has the advantage of reduced computational expenses and improved accuracy for model estimation since it is applied in the cepstral domain. In order to cope with time-varying background noise, a novel interpolation method of multiple models is employed. By sequentially calculating the posterior probability of each environmental model, the compensation procedure call be applied oil a frame-by-frame basis. In order to reduce the computational expense due to the multiple-model method, a technique of sharing similar Gaussian components is proposed. Acoustically similar components across all inventory of environmental models are selected by the proposed sub-optimal algorithm which employs the Kullback-Leibler similarity distance. The combined hybrid model, which consists of the selected Gaussian components is used for noisy speech model sharing. The performance is examined using Aurora2 and speech data for an in-vehicle environment. The proposed feature compensation algorithm is compared with standard methods in the field (e.g., CMN, spectral subtraction, RATZ). The experimental results demonstrate that the proposed feature compensation schemes are very effective in realizing robust speech recognition in adverse noisy environments. The proposed model combination-based feature compensation method is superior to existing model-based feature compensation methods. Of particular interest is that the proposed method shows up to an 11.59% relative WER reduction compared to the ETSI AFE front-end method. The multi-model approach is effective at coping with changing noise conditions for input speech, producing comparable performance to the matched model condition. Applying the mixture sharing method brings a significant reduction in computational overhead, while maintaining recognition performance at a reasonable level with near real-time operation. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, Richardson, TX 75080 USA.
RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, Erik Jonsson Sch Engn & Comp Sci, CRSS, 2601 N Floyd Rd,EC33, Richardson, TX 75080 USA.
EM john.hansen@utdallas.edu
CR Acero A., 1993, ACOUSTIC ENV ROBUSTN
Akbacak M, 2007, IEEE T AUDIO SPEECH, V15, P465, DOI 10.1109/TASL.2006.881694
Angkititrakul P., 2007, IEEE INT VEH S, P566
BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209
Droppo J., 2001, EUROSPEECH2001, P217
EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453
*ETSI, 2000, 201108V112200004 ETS
*ETSI, 2002, 202050V111200210 ETS
Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929
Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278
Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088
Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618
HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901
Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935
Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7
HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143
Hansen JHL, 2004, DSP IN VEHICLE MOBIL
HIRSCH HG, 2000, AURORA EXPT FRAMEWOR
KAWAGUCHI N, 2004, DSP IN VEHICLE MOBIL, pCH1
Kim NS, 2002, SPEECH COMMUN, V37, P231, DOI 10.1016/S0167-6393(01)00013-9
KIM W, 2003, EUROSPEECH 2003, P677
KIM W, 2004, ICASSP2004, P989
LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902
Lee K.-F., 1989, AUTOMATIC SPEECH REC
LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010
Martin R., 1994, EUSIPCO 94, P1182
Morales N, 2006, INT CONF ACOUST SPEE, P533
Moreno P.J., 1996, THESIS CARNEGIE MELL
Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9
RAJ B, 2005, IEEE SIGNAL PROCESS, V22
SASOU A, 2003, EUROSPEECH2003, P29
Sasou A., 2004, ICSLP2004, P121
Schulte-Fortkamp B., 2007, ACOUST TODAY, V3, P7, DOI 10.1121/1.2961148
SEGURA JC, 2001, EUROSPEECH2001, P221
SINGH R, 2002, CHAPTER CRC HDB NOIS
Stouten V., 2004, ICASSP2004, P949
VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970
Westphal M, 2001, INT CONF ACOUST SPEE, P221, DOI 10.1109/ICASSP.2001.940807
WOMACK BD, 1996, IEEE P INT C AC SPEE, V1, P53
Womack BD, 1999, IEEE T SPEECH AUDI P, V7, P668, DOI 10.1109/89.799692
NR 40
TC 22
Z9 26
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2009
VL 51
IS 2
BP 83
EP 96
DI 10.1016/j.specom.2008.06.004
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 391AC
UT WOS:000262203800001
ER
PT J
AU Jallon, JF
Berthommier, F
AF Jallon, Julie Fontecave
Berthommier, Frederic
TI A semi-automatic method for extracting vocal tract movements from X-ray
films
SO SPEECH COMMUNICATION
LA English
DT Article
DE Cineradiography; Contour extraction; Low-frequency DCT components; Vocal
tract movements
ID SPEECH PRODUCTION
AB Despite the development of new imaging techniques, existing X-ray data remain an appropriate tool to study speech production phenomena. However, to exploit these images, the shapes of the vocal tract articulators must first be extracted. This task, usually manually realized, is long and laborious. This paper describes a semi-automatic technique for facilitating the extraction of vocal tract contours from complete sequences of large existing cineradiographic databases in the context of continuous speech production. The proposed method efficiently combines the human expertise required for marking a small number of key images and an automatic indexing of the video data to infer dynamic 2D data. Manually acquired geometrical data are associated to each image of the sequence via a similarity measure based on the low-frequency Discrete Cosine Transform (DCT) components of the images. Moreover, to reduce the reconstruction error and improve the geometrical contour estimation, we perform post-processing treatments, such as a neighborhood averaging and a temporal filtering. The method is applied independently for each articulator (tongue, velum, lips, and mandible). Then the acquired contours are combined to reconstruct the movements of the entire vocal tract. We carry out evaluations, including comparisons with manual markings and with another semi-automatic method. (C) 2008 Elsevier B.V. All rights reserved.
C1 [Jallon, Julie Fontecave; Berthommier, Frederic] Domaine Univ, Dept Speech & Cognit, GIPSA Lab, F-38402 St Martin Dheres, France.
RP Berthommier, F (reprint author), Domaine Univ, Dept Speech & Cognit, GIPSA Lab, BP 46, F-38402 St Martin Dheres, France.
EM Julie.Fontecave@gipsa-lab.inpg.fr;
Frederic.Berthommier@gipsa-lab.inpg.fr
RI Fontecave-Jallon, Julie/M-5807-2014
CR Akgul YS, 1999, IEEE T MED IMAGING, V18, P1035, DOI 10.1109/42.811315
ARNAL A, 2000, P 23 JOURN ET PAR AU
BADIN P, 1995, P INT C AC TRONDH NO
BADIN P, 1998, P INT C SPOK LANG PR
BERTHOMMIER F, 2004, P INT C AC SPEECH SI
Bothorel A., 1986, CINERADIOGRAPHIE VOY
DEPAULA H, 2006, SPEECH PRODUCTION MO
Fant G., 1960, ACOUSTIC THEORY SPEE
FONTECAVE J, 2005, P EUR C SPEECH COMM
GUIARDMARIGNY T, 1996, PROGR SPEECH SYNTHES
HARDCAST.WJ, 1972, PHONETICA, V25, P197
HECKMANN M, 2000, P INT C SPOK LANG PR
HECKMANN M, 2003, P AUD VIS SPEECH PRO
HEINZ JM, 1964, J ACOUST SOC AM, V36
KASS M, 1987, INT J COMPUT VISION, V4, P321
Laprie Y., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607097
MAEDA S, 1979, 20 JOURN ET PAR GREN, P152
MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427
MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621
Narayanan S, 2004, J ACOUST SOC AM, V115, P1771, DOI 10.1121/1.1652588
PERKELL J, 1972, J ACOUST SOC AM, V92, P3078
Perkell JS, 1969, PHYSL SPEECH PRODUCT, V53
Potamianos G, 1998, P IEEE INT C IM PROC, V3, P173, DOI 10.1109/ICIP.1998.999008
Rao K. R., 1990, DISCRETE COSINE TRAN
ROY JP, 2003, INTRIC INTERFACE TRA, P163
RUBIN P, 1996, P 4 INT SEM SPEECH P
STEVENS KN, 1963, STL QPSR, V4, P11
Thimm G., 1999, P EUR C SPEECH COMM, P157
THIMM G, 1998, ILLUMINATION ROBUST
TIEDE MK, 1994, P INT C SPOK LANG PR
WOOD S, 1979, J PHONETICS, V7, P25
NR 31
TC 5
Z9 6
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2009
VL 51
IS 2
BP 97
EP 115
DI 10.1016/j.specom.2008.06.005
PG 19
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 391AC
UT WOS:000262203800002
ER
PT J
AU den Ouden, H
Noordman, L
Terken, J
AF den Ouden, Hanny
Noordman, Leo
Terken, Jacques
TI Prosodic realizations of global and local structure and rhetorical
relations in read aloud news reports
SO SPEECH COMMUNICATION
LA English
DT Article
DE Discourse; Text structure; Prosody; Global structure; Local structure;
Rhetorical relations
ID COHERENCE RELATIONS; DISCOURSE
AB The aim of this research is to Study effects of global and local structure of texts and of rhetorical relations between sentences on the prosodic realization of sentences in read aloud text. Twenty texts were analyzed using Rhetorical Structure Theory. Based oil these analyses, the global structure in terms of hierarchical level, the local structure in terms of the relative importance of text segments and the rhetorical relations between text segments were identified. The texts were read aloud. Pause durations preceding segments, F0-maxima and articulation rates of the segments were measured. It was found that speakers give prosodic indications about hierarchical level by means of variations in pause duration and pitch range: the higher the segments are connected in the text structure, the longer the preceding pauses and the higher the F0-maxima are realized. Also, it was found that speakers articulate important segments more slowly than unimportant segments, and that they read aloud causally related segments with shorter in-between pauses and at faster rate than non-causally related segments. We conclude that variation in pause duration and F0-maximum is a robust means for speakers to express the global structure of texts, although this does not apply to all speakers. Speakers also vary pause duration and articulation rate to indicate importance of sentences and meaning relations between sentences. (C) 2008 Elsevier B.V. All rights reserved.
C1 [den Ouden, Hanny] Univ Utrecht, Fac Humanities, NL-3512 JK Utrecht, Netherlands.
[Noordman, Leo] Tilburg Univ, Fac Arts, NL-5000 LE Tilburg, Netherlands.
[Terken, Jacques] Eindhoven Univ Technol, Dept Ind Design, NL-5600 MB Eindhoven, Netherlands.
RP den Ouden, H (reprint author), Univ Utrecht, Fac Humanities, Trans 10, NL-3512 JK Utrecht, Netherlands.
EM Hanny.denOuden@let.uu.nl; noordman@uvt.nl; j.m.b.terken@tue.nl
FU SOBU (Cooperation of Brabant Universities)
FX The research reported in this paper was funded by SOBU (Cooperation of
Brabant Universities). Our thanks go to Leo Vogten and Jan Roelof de
Pijper for making available their programming software.
CR Bateman JA, 1997, DISCOURSE PROCESS, V24, P3
BRUBAKER RS, 1972, J PSYCHOLINGUIST RES, V1, P141, DOI 10.1007/BF01068103
Cooper W. E., 1980, SYNTAX SPEECH
den Ouden H., 2004, PROSODIC REALIZATION
DENOUDEN H, 2001, P 7 EUR C SPEECH COM, P91
DENOUDEN H, 1998, IPO ANN PROGR REPORT, V33, P129
HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427
Hirschberg J., 1992, P SPEECH NAT LANG WO, P441, DOI 10.3115/1075527.1075632
Hirschberg JG, 1996, NATO ADV SCI I A-LIF, V286, P293
Lehiste I., 1975, STRUCTURE PROCESS SP, P195
MANNS MP, 1988, CLIN LAB MED, V8, P281
Noordman L, 1999, AMST STUD THEORY HIS, V176, P133
POTTER A, 2007, THESIS NOVA SE U
Prince Ellen, 1981, RADICAL PRAGMATICS, P223
SANDERS TJM, 1992, DISCOURSE PROCESS, V15, P1
Sanders TJM, 2000, DISCOURSE PROCESS, V29, P37, DOI 10.1207/S15326950dp2901_3
SCHILPEROORD J, 1996, THESIS UTRECHT U
SILVERMAN T, 1987, THESIS CAMBRIDGE U C
Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114
THORNDYKE PW, 1977, COGNITIVE PSYCHOL, V9, P77, DOI 10.1016/0010-0285(77)90005-6
THORSEN N, 1985, J ACOUST SOC AM, V80, P1205
Van Donzel M., 1999, THESIS U AMSTERDAM
Wennerstrom A, 2001, MUSIC EVERYDAY SPEEC
Wichmann A., 2000, INTONATION TEXT DISC
YULE G, 1980, LINGUA, V52, P33, DOI 10.1016/0024-3841(80)90016-9
NR 25
TC 14
Z9 14
PU ELSEVIER SCIENCE BV
PI AMSTERDAM
PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS
SN 0167-6393
EI 1872-7182
J9 SPEECH COMMUN
JI Speech Commun.
PD FEB
PY 2009
VL 51
IS 2
BP 116
EP 129
DI 10.1016/j.specom.2008.06.003
PG 14
WC Acoustics; Computer Science, Interdisciplinary Applications
SC Acoustics; Computer Science
GA 391AC
UT WOS:000262203800003
ER
EF