FN Thomson Reuters Web of Science™ VR 1.0 PT J AU Hamalainen, A ten Bosch, L Boves, L AF Hamalainen, Annika ten Bosch, Louis Boves, Lou TI Modelling pronunciation variation with single-path and multi-path syllable models: Issues to consider SO SPEECH COMMUNICATION LA English DT Article DE ASR; HMM; Topology; Syllable; Pronunciation variation ID AUTOMATIC SPEECH RECOGNITION AB In this paper, we construct context-independent single-path and multi-path syllable models aimed at improved pronunciation variation modelling. We use phonetic transcriptions to define the topologies of the syllable models and to initialise the model parameters, and the Baum-Welch algorithm for the re-estimation of the model parameters. We hypothesise that the richer topology of multi-path syllable models would be better at accounting for pronunciation variation than context-dependent phone models that can only account for the effects of the left and right neighbours, or single-path syllable models whose power of modelling segmental variation would seem to be limited. However, both context-dependent phone models and single-path syllable models outperform multi-path syllable models on a large-vocabulary continuous speech recognition task. Careful analyses of the errors made by the recognisers with single-path and multi-path syllable models show that the most important factors affecting the speech recognition performance are syllable context and lexical confusability. In addition, the speech recognition results suggest that the benefits of the greater acoustic modelling accuracy of the multi-path syllable models can only be reaped if the information about the syllable-level pronunciation variation can be linked with the word-level information in the language model. (C) 2008 Elsevier B.V. All rights reserved. C1 [Hamalainen, Annika; ten Bosch, Louis; Boves, Lou] Radboud Univ Nijmegen, Fac Arts, CLST, NL-6500 HD Nijmegen, Netherlands. RP Hamalainen, A (reprint author), Radboud Univ Nijmegen, Fac Arts, CLST, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM A.Hamalainen@let.ru.nl; L.ten-Bosch@let.ru.nl; L.Boves@let.ru.nl RI Hamalainen, Annika/M-5155-2013 OI Hamalainen, Annika/0000-0002-6845-4987 FU Netherlands Organisation for Scientific Research (NWO); European FET project ACORNS [nr FP6-034362] FX Annika Hamalainen's research was carried out within the Interactive Multimodal Information eXtraction (IMIX) program, which is sponsored by the Netherlands Organisation for Scientific Research (NWO). Louis ten Bosch participates in the European FET project ACORNS (nr FP6-034362). The authors would like to thank two anonymous reviewers for their valuable ideas and suggestions. CR Anguita J, 2005, IEEE SIGNAL PROC LET, V12, P585, DOI 10.1109/LSP.2005.851256 ARADILLA G, 2006, P IEEE INT C AC SPEE Baayen R. H., 1995, CELEX LEXICAL DATABA Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 BOOIJ G, 1999, PHONOLOGY DUTCH Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Deng L, 2006, IEEE T AUDIO SPEECH, V14, P1492, DOI 10.1109/TASL.2006.878265 DEWACHTER M, 2007, THESIS U LEUVEN LEUV ELFFERS B, 2005, ADAPT ALGORITHM DYNA Ganapathiraju A, 2001, IEEE T SPEECH AUDI P, V9, P358, DOI 10.1109/89.917681 Greenberg S., 2000, P ISCA WORKSH AUT SP, P195 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 Hain T, 2005, SPEECH COMMUN, V46, P171, DOI 10.1016/j.specom.2005.03.008 HAMALAINEN A, 2006, P 9 INT C SPOK LANG HAMALAINEN A, 2006, PUBLICATIONS U HELSI, P57 HAMALAINEN A, 2007, P 10 EUR C SPEECH CO HAMALAINEN A, 2007, EURASIP J AUDIO SPEE, V11 HAMALAINEN A, 2007, P IEEE INT C AC SPEE Han Y, 2007, IEEE T AUDIO SPEECH, V15, P1425, DOI 10.1109/TASL.2007.894529 JOHNSON K., 2004, SPONTANEOUS SPEECH D, P29 Jones R. J., 1997, P EUR 97, P1171 JOUVET D, 2004, P 8 INT C SPOK LANG, P645 Jurafsky D., 2001, P IEEE INT C AC SPEE, V1, P577 Jurafsky Daniel, 2001, FREQUENCY EMERGENCE, P229, DOI 10.1075/tsl.45.13jur KESSENS J, 2002, P ISCA TUT RES WORKS, P18 Kessens JM, 2003, SPEECH COMMUN, V40, P517, DOI 10.1016/S0167-6393(02)00150-4 Lee K.-F., 1989, AUTOMATIC SPEECH REC LIN SS, 2007, P 10 EUR C SPEECH CO MARKOV K, 2007, P 10 EUR C SPEECH CO MCALLASTER D, 1999, P 6 EUR C SPEECH COM, P1787 Oostdijk Nelleke, 2002, P LREC 2002, V1, P340 Ostendorf M., 1999, P IEEE WORKSH AUT SP PLANNERER G, 1992, P IEEE INT C AC SPEE, V1, P581 Pluymaekers M, 2005, PHONETICA, V62, P146, DOI 10.1159/000090095 Pols L.C.W., 2003, P 8 EUR C SPEECH COM, P769 ROE DB, 1994, P ICSLP, P227 SARACLAR M, 2000, P IEEE INT C AC SPEE, V3, P1679 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 SETHY A, 2003, P IEEE WORKSH AUT SP SETHY A, 2003, P ICASSP 2003, V1, P772 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 Wells J. C., 2000, LONGMAN PRONUNCIATIO WESTER M, 2002, THESIS U NIJMEGEN NI Young S., 2002, HTK BOOK HTK VERSION Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002 NR 45 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 130 EP 150 DI 10.1016/j.specom.2008.07.001 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800004 ER PT J AU Pernkopf, F Van Pham, T Bilmes, JA AF Pernkopf, Franz Van Pham, Tuan Bilmes, Jeff A. TI Broad phonetic classification using discriminative Bayesian networks SO SPEECH COMMUNICATION LA English DT Article DE Broad phonetic class recognition; Wavelet transform; Time-scale features; Bayesian networks; Discriminative learning ID SPEECH RECOGNITION; LOGISTIC-REGRESSION; PATTERN-RECOGNITION; CLASSIFIERS; INFORMATION; TUTORIAL AB We present an approach to broad phonetic classification, defined as mapping acoustic speech frames into broad (or clustered) phonetic categories. Our categories consist of silence, general voiced, general unvoiced, mixed sounds, voiced closure, and plosive release, and are sufficiently rich to allow accurate time-scaling of speech signals to improve their intelligibility in, e.g. voice-mail applications. There are three main aspects to this work. First, in addition to commonly used speech features, we employ acoustic time-scale features based oil the intra-scale relationships of the energy from different wavelet subbands. Secondly, we use and compare against discriminatively learned Bayesian networks. By this, we mean Bayesian networks whose structure and/or parameters have been optimized using a discriminative objective function. We utilize a simple order-based greedy heuristic for learning discriminative structure based on mutual information. Given an ordering, we can find the discriminative classifier structure with O(N-q) score evaluations (where cl is the maximum number of parents per node). Third, we provide a large assortment of empirical results, including gender dependent/independent experiments on the TIMIT corpus. We evaluate both discriminative and generative parameter learning on both discriminatively and generatively structured Bayesian networks and compare against generatively trained Gaussian mixture models (GMMs), and discriminatively trained neural networks (NNs) and support vector machines (SVMs). Results show that: (i) the combination of time-scale features and mel-frequency cepstral coefficients (MFCCs) provides the best performance; (ii) discriminative learning of Bayesian network classifiers is superior to the generative approaches; (iii) discriminative classifiers (NNs and SVMs) perform better than both discriminatively and generatively trained and structured Bayesian networks; and (iv) the advantages of generative yet discriminatively structured Bayesian network classifiers still hold in the case of missing features while the discriminatively trained NNs and SVMs are unable to deal with such a case. This last result is significant since it suggests that discriminative Bayesian networks are the most appropriate approach when missing features are common. (C) 2008 Elsevier B.V. All rights reserved. C1 [Pernkopf, Franz; Van Pham, Tuan] Graz Univ Technol, Signal Proc & Speech Commun Lab, A-8010 Graz, Austria. [Bilmes, Jeff A.] Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. [Van Pham, Tuan] Danang Univ Technol, Fac Elect & Telecommun, Donang, Vietnam. RP Pernkopf, F (reprint author), Graz Univ Technol, Signal Proc & Speech Commun Lab, Inffehlgasse 12, A-8010 Graz, Austria. EM pernkopf@tugraz.at; v.t.pham@tugraz.at; bilmes@ee.washington.edu FU Austrian Science Fund [P19737-N15]; ONR MURI [N000140510388] FX We would like to acknowledge support for this project from the Austrian Science Fund (Project number P19737-N15). This work was also supported by an ONR MURI Grant, No. N000140510388. This research was carried out in the context of COAST (littp://www.coast.at). We gratefully acknowledge funding by the Austrian KNet Program, ZID Zentrum fuer Innovation und Technology, Vienna, the Steirische WirtschaftsfoerderungsGmbH, and the Land Steiermark. CR Acid S, 2005, MACH LEARN, V59, P213 ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 BAHL L, 1986, 11 IEEE INT C AC SPE, P49 BARTELS C, 2007, IEEE AUTOMATIC SPEEC, P335 Bartlett PL, 2006, J AM STAT ASSOC, V101, P138, DOI 10.1198/016214505000000907 Bilmes J., 2001, DISCR STRUCT GRAPH M BISHOP C, 2007, BAYESIAN STAT, V3, P3 Bishop C. M., 1995, NEURAL NETWORKS PATT BORYS S, 2005, 9 EUR C SPEECH COMM, P697 Bourlard Ha, 1994, CONNECTIONIST SPEECH Buntine W, 1991, 7TH P C UNC ART INT, P52 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 CAMPBELL WN, 1991, J PHONETICS, V19, P37 CHILDERS DG, 1989, IEEE T ACOUST SPEECH, V37, P1771, DOI 10.1109/29.46561 CHOW CK, 1968, IEEE T INFORM THEORY, V14, P462, DOI 10.1109/TIT.1968.1054142 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 COOPER GF, 1992, MACH LEARN, V9, P309, DOI 10.1007/BF00994110 Covet T., 1991, ELEMENTS INFORM THEO Cowell R., 1999, PROBABILISTIC NETWOR de Campos LM, 2006, J MACH LEARN RES, V7, P2149 Donnellan O, 2003, 3RD IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, P165 Dougherty J., 1995, 12 INT C MACH LEARN, P194 Duda R. O., 2001, PATTERN CLASSIFICATI EPHRAIM Y, 1990, IEEE T INFORM THEORY, V36, P372, DOI 10.1109/18.52483 EPHRAIM Y, 1989, IEEE T INFORM THEORY, V35, P1001, DOI 10.1109/18.42209 Fayyad U., 1993, 13 INT JOINT C ART I, P1022 FEI S, 2006, 31 IEEE INT C AC SPE, P265 FRIEDMAN N, 1999, 15 INT C UNC ART INT, P196 Friedman N, 1997, MACH LEARN, V29, P131, DOI 10.1023/A:1007465528199 Greiner R, 2005, MACH LEARN, V59, P297, DOI 10.1007/s10994-005-0469-0 GREINER R, 2002, 18 C AAAI, P167 Grossman D., 2004, 21 INT C MACH LEARN, P361 HALBERSTADT A, 1997, 5 EUR C SPEECH COMM, P401 Heckerman D., 1995, MSRTR9506 Jebara Tony, 2001, THESIS MIT Jordan M. I., 1999, LEARNING GRAPHICAL M Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KEDEM B, 1986, P IEEE, V74, P1477, DOI 10.1109/PROC.1986.13663 Keogh E.J., 1999, 7 INT WORKSH ART INT, P225 Kirchhoff K, 2002, SPEECH COMMUN, V37, P303, DOI 10.1016/S0167-6393(01)00020-6 Kruskal J., 1956, P AM MATH SOC, V7, P48, DOI DOI 10.1090/S0002-9939-1956-0078686-7 KUBIN G, 1994, P INT C AC SPEECH SI, V1, P453 KUBIN G, 1993, IEEE WORKSH SPEECH C, P35 KUWABARA H, 2000, CH ACOUSTIC PERCEPTU, P163 Lamel L., 1986, P DARPA SPEECH REC W LEUNG H, 1993, 18 IEEE INT C AC SPE, P657 LEVINSON S, 1989, HUM LANG TECHN C, P75 LIN H, 2007, IEEE AUTOMATIC SPEEC, P478 MALKIN J, 2008, 33 IEEE INT C AC SPE, P4113 MINGHU J, 1996, 3 INT C SIGN PROC, V2, P1469 Mitchell T.M., 1997, MACHINE LEARNING Murphy K.P., 2002, THESIS U CALIFORNIA OLIVE JP, 1993, ACOUSTIC AM ENGLISH Parveen S, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL I, PROCEEDINGS, P733 PAZZANI M, 1996, LEARNING DATA ARTIFI, V5, P248 Pearl J., 1988, PROBABILISTIC REASON PERNKOPF F, 2008, 10 INT S ART INT MAC PERNKOPF F, 2005, 22 INT C MACH LEARN, P657 Pernkopf F, 2005, PATTERN RECOGN, V38, P1, DOI 10.1016/j.patcog.2004.05.012 PHAM TV, 2005, 30 IEEE INT C AC SPE, P401 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Roos T, 2005, MACH LEARN, V59, P267 SALOMON J, 2002, ICSLP, P2645 Sanneck H, 1998, IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS, P140 Scholkopf B., 2001, LEARNING KERNELS SUP Smith ND, 2002, INT CONF ACOUST SPEE, P77 SUBRAMANYA A, 2005, 9 EUR C SPEECH COMM, P393 Teyssier M., 2005, 21 C UNC AI UAI, P584 Vetterli M., 1995, WAVELETS SUBBAND COD WETTIG H, 2003, INT JOINT C ART INT, P491 ZHANG L, 1997, 22 IEEE INT C AC SPE, V2, P735 NR 73 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 151 EP 166 DI 10.1016/j.specom.2008.07.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800005 ER PT J AU Heeringa, W Johnson, K Gooskens, C AF Heeringa, Wilbert Johnson, Keith Gooskens, Charlotte TI Measuring Norwegian dialect distances using acoustic features SO SPEECH COMMUNICATION LA English DT Article DE Acoustics; Acoustic features; Dialectology; Dialectometry; Dialect; Phonetics; Phonology AB Levenshtein distance has become a popular tool for measuring linguistic dialect distances, and has been applied to Irish Gaelic, Dutch, German and other dialect groups. The method, in the current state of the art, depends upon phonetic transcriptions, even when acoustic differences are used the number of segments in the transcriptions is used for speech rate normalization. The goal of this paper is to find a fully acoustic measure which approximates the quality of semi-acoustic measures that rely oil tagged speech. We use a set of 15 Norwegian dialect recordings and test the hypothesis that the use of the acoustic signal only, without transcriptions, is sufficient for obtaining results which largely agree with both traditional Norwegian dialectology and the perception of the speakers themselves. We use formant trajectories and consider both the Hertz and the Bark scale. We experiment with an approach in which Z-scores per frame are used instead of the original frequency values. Besides formant tracks, we also consider zero crossing rates: the number of times per interval that the amplitude waveform crosses the zero line. The zero crossing rate is sensitive to the difference between voiced and unvoiced speech sections. When using the fully acoustic measure on the basis of the combined representation with normalized frequency values, we obtained results comparable with the results obtained with the semi-acoustic measure. We applied cluster analysis and multidimensional scaling to distances obtained with this method and found results which largely agree with both the results of traditional Norwegian dialectology and with the perception of the speakers. When scaling to three dimensions, we found the first dimension responsible for gender differences. However, when leaving out this dimension, dialect specific information is lost as well. (C) 2008 Elsevier B.V. All rights reserved. C1 [Heeringa, Wilbert] Variationist Linguist, Meertens Inst, NL-1090 GG Amsterdam, Netherlands. [Johnson, Keith] Univ Calif Berkeley, Dept Linguist, Berkeley, CA 94720 USA. [Gooskens, Charlotte] Univ Groningen, Scandinavian Dept, NL-9700 AS Groningen, Netherlands. RP Heeringa, W (reprint author), Variationist Linguist, Meertens Inst, POB 94264, NL-1090 GG Amsterdam, Netherlands. EM wilbert.heeringa@meertens.knaw.nl; keithjohnson@berkeley.edu; c.s.gooskens@rug.nl CR Adank P, 2004, J ACOUST SOC AM, V116, P3099, DOI 10.1121/1.1795335 Bonnet E, 2002, J STAT SOFTW, V7, P1 Chambers Jack K., 1998, DIALECTOLOGY, V2nd CHRISTIANSEN H, 1954, FRA NORSK MALFOREGRA, P39 FINTOFT K, 1980, MAAL MINNE, P66 FRANKEL J, 2000, P 6 INT C SPOK LANG GOEBL H, 1982, PHILOS HIST D, V157 Goebl Hans, 1993, P INT C DIAL, V1, P37 Gooskens C., 2004, LANG VAR CHANGE, V16, P189 Heeringa W., 2004, THESIS RIJKSUNIVERSI Heeringa W, 2003, COMPUT HUMANITIES, V37, P293, DOI 10.1023/A:1025087115665 HEERINGA W, 2005, US WURK TYDSKRIFT FR, V54, P125 Jain A. K., 1988, ALGORITHMS CLUSTERIN Kessler Brett, 1995, P 7 C EUR CHAPT ASS, P60, DOI 10.3115/976973.976983 KRISTOFFERSEN G, 2000, PHONOLOGY WORLDS LAN Kruskal J, 1999, TIME WARPS STRING ED, P1 MANTEL N, 1967, CANCER RES, V27, P209 NERBONNE J, 1996, CLIN, V6, P185 POTTER RK, 1947, BELL TELEPHONE LAB S Rietveld A. C. M., 1997, ALGEMENE FONETIEK SEGUY JEAN, 1973, REV LINGUIST ROMAN, V37, P1 SKJEKKELAND M, 1997, NORSKE DIALEKTANE TR Sokal Robert R., 1973, SERIES BOOKS BIOL TRAUNMULLER H, 1990, J ACOUST SOC AM, V88, P97 NR 24 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 167 EP 183 DI 10.1016/j.specom.2008.07.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800006 ER PT J AU Andersen, TS Tiippana, K Laarni, J Kojo, I Sams, M AF Andersen, Tobias S. Tiippana, Kaisa Laarni, Jari Kojo, Ilpo Sams, Mikko TI The role of visual spatial attention in audiovisual speech perception SO SPEECH COMMUNICATION LA English DT Article DE Speech perception; Multisensory; Attention; McGurk effect ID SELECTIVE ATTENTION; AUDITORY-CORTEX; MULTISENSORY INTEGRATION; RHESUS-MONKEY; SEEING SPEECH; HUMAN BRAIN; INFORMATION; DIRECTION; VOICES; DEPEND AB Auditory and visual information is integrated when perceiving speech, as evidenced by the McGurk effect in which viewing an incongruent talking face categorically alters auditory speech perception. Audiovisual integration in speech perception has long been considered automatic and pre-attentive but recent reports have challenged this view. Here we study the effect of visual spatial attention on the McGurk effect. By presenting a movie of two faces symmetrically displaced to each side of a central fixation point and dubbed with a single auditory speech track, we were able to discern the influences front each of the faces and from the voice oil the auditory speech percept. We found that directing visual spatial attention towards a face increased the influence of that face oil auditory perception. However, the influence of the voice on auditory perception did not change suggesting that audiovisual integration did not change. Visual spatial attention was also able to select between the faces when lip reading. This suggests that visual spatial attention acts at the level of visual speech perception prior to audiovisual integration and that the effect propagates through audiovisual integration to influence auditory perception. (C) 2008 Elsevier B.V. All rights reserved. C1 [Andersen, Tobias S.] Univ Copenhagen, Ctr Computat Cognit Modeling, Dept Psychol, DK-1361 Copenhagen K, Denmark. [Andersen, Tobias S.] Tech Univ Denmark, Informat & Math Modeling, DK-2800 Lyngby, Denmark. [Tiippana, Kaisa; Sams, Mikko] Aalto Univ, Dept Biomed Engn & Computat Sci, Helsinki 02015, Finland. [Laarni, Jari] VTT Tech Res Ctr, Espoo 02044, Finland. [Kojo, Ilpo] Helsinki Sch Econ, Ctr Knowledge & Innovat Res, Helsinki 00101, Finland. RP Andersen, TS (reprint author), Univ Copenhagen, Ctr Computat Cognit Modeling, Dept Psychol, Linnesgade 22, DK-1361 Copenhagen K, Denmark. EM ta@imm.dtu.dk RI Sams, Mikko/G-7060-2012; Andersen, Tobias/I-5317-2013 OI Andersen, Tobias/0000-0002-0263-1354 CR Alsius A, 2005, CURR BIOL, V15, P839, DOI 10.1016/j.cub.2005.03.046 Andersen T. S., 2001, P 4 INT ESCA ETRW C, P172 Andersen TS, 2004, COGNITIVE BRAIN RES, V21, P301, DOI 10.1016/j.cogbrainres.2004.06.004 ANDERSEN TS, 2002, P NF2002 Bertelson P, 2000, PERCEPT PSYCHOPHYS, V62, P321, DOI 10.3758/BF03205552 Bolognini N, 2005, EXP BRAIN RES, V160, P273, DOI 10.1007/s00221-004-2005-z Calvert GA, 2000, CURR BIOL, V10, P649, DOI 10.1016/S0960-9822(00)00513-3 Calvert GA, 2001, CEREB CORTEX, V11, P1110, DOI 10.1093/cercor/11.12.1110 CAMPBELL R, 1992, PHILOS T ROY SOC B, V335, P39, DOI 10.1098/rstb.1992.0005 CAMPBELL R, 1996, SPEECHREADING HUMANS, P115 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Colin C, 2002, CLIN NEUROPHYSIOL, V113, P495, DOI 10.1016/S1388-2457(02)00024-X CROWTHER CS, 1995, PSYCHOL REV, V102, P396, DOI 10.1037//0033-295X.102.2.396 CUTTING JE, 1992, J EXP PSYCHOL GEN, V121, P364 Frassinetti F, 2002, EXP BRAIN RES, V147, P332, DOI 10.1007/s00221-002-1262-y Ghazanfar AA, 2005, J NEUROSCI, V25, P5004, DOI 10.1523/JNEUROSCI.0799-05.2005 Grant KW, 1998, J ACOUST SOC AM, V104, P2438, DOI 10.1121/1.423751 Lavie N, 2005, TRENDS COGN SCI, V9, P75, DOI 10.1016/j.tics.2004.12.004 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1989, SCIENCE, V243, P489, DOI 10.1126/science.2643163 Luck SJ, 1997, J NEUROPHYSIOL, V77, P24 Massaro D. W., 1998, PERCEIVING TALKING F Massaro D. W., 1987, SPEECH PERCEPTION EA MASSARO DW, 1984, CHILD DEV, V55, P1777, DOI 10.1111/j.1467-8624.1984.tb00420.x Massaro DW, 1996, J ACOUST SOC AM, V100, P1777, DOI 10.1121/1.417342 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Möttönen Riikka, 2002, Brain Res Cogn Brain Res, V13, P417 Munhall KG, 1996, PERCEPT PSYCHOPHYS, V58, P351, DOI 10.3758/BF03206811 Musacchia G, 2006, EXP BRAIN RES, V168, P1, DOI 10.1007/s00221-005-0071-5 Naatanen R., 1992, ATTENTION BRAIN FUNC Nishitani N, 2002, NEURON, V36, P1211, DOI 10.1016/S0896-6273(02)01089-9 Pekkola J, 2005, NEUROREPORT, V16, P125, DOI 10.1097/00001756-200502080-00010 PITT MA, 1995, J EXP PSYCHOL LEARN, V21, P1065, DOI 10.1037//0278-7393.21.4.1065 POSNER MI, 1980, J EXP PSYCHOL GEN, V109, P160, DOI 10.1037//0096-3445.109.2.160 POSNER MI, 1980, Q J EXP PSYCHOL, V32, P3, DOI 10.1080/00335558008248231 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 SAMS M, 1991, NEUROSCI LETT, V127, P141, DOI 10.1016/0304-3940(91)90914-F Schroeder CE, 2005, CURR OPIN NEUROBIOL, V15, P454, DOI 10.1016/j.conb.2005.06.008 Schroeder CE, 2002, COGNITIVE BRAIN RES, V14, P187, DOI 10.1016/S0926-6410(02)00073-3 Schwartz JL, 2006, J ACOUST SOC AM, V120, P1795, DOI 10.1121/1.2258814 SELTZER B, 1994, J COMP NEUROL, V343, P445, DOI 10.1002/cne.903430308 Soto-Faraco S, 2004, COGNITION, V92, pB13, DOI 10.1016/j.cognition.2003.10.005 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Talsma D, 2005, J COGNITIVE NEUROSCI, V17, P1098, DOI 10.1162/0898929054475172 Tiippana K, 2004, EUR J COGN PSYCHOL, V16, P457, DOI 10.1080/09541440340000268 TREISMAN AM, 1980, COGNITIVE PSYCHOL, V12, P97, DOI 10.1016/0010-0285(80)90005-5 Tuomainen J, 2005, COGNITION, V96, pB13, DOI 10.1016/j.cognition.2004.10.004 Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382 Vroomen J, 2000, TRENDS COGN SCI, V4, P37, DOI 10.1016/S1364-6613(99)01426-6 Vroomen J, 2001, PERCEPT PSYCHOPHYS, V63, P651, DOI 10.3758/BF03194427 WARREN DH, 1979, PERCEPTION, V8, P323, DOI 10.1068/p080323 NR 51 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2009 VL 51 IS 2 BP 184 EP 193 DI 10.1016/j.specom.2008.07.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391AC UT WOS:000262203800007 ER PT J AU Andrianakis, I White, PR AF Andrianakis, I. White, P. R. TI Speech spectral amplitude estimators using optimally shaped Gamma and Chi priors SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; MMSE; MAP; Gamma; Chi ID ENHANCEMENT AB In this paper, four STFT based speech enhancement algorithms are proposed. The algorithms enhance speech by estimating its short time spectral amplitude and are combinations of two estimators (MMSE and MAP) with two speech spectral amplitude priors (Gamma and Chi). The proposed priors have a shape parameter a, whose effect on the quality of speech is a focal point of our investigation. Rather than using a priori estimated values of a, we seek those values that maximise the quality of the enhanced speech, in an a posteriori fashion. The performance of the algorithms is first evaluated as a function of the shape parameter a and optimal values are then sought by means of a formal subjective listening test. Finally, the parallel examination of four speech enhancement algorithms offers an insight into the relative importance of the employed priors and estimators, as the proposed algorithms are only different with respect to these two elements. (C) 2008 Elsevier B.V. All rights reserved. C1 [Andrianakis, I.; White, P. R.] Univ Southampton, Inst Sound & Vibrat Res, Southampton SO17 1BJ, Hants, England. RP Andrianakis, I (reprint author), Univ Southampton, Inst Sound & Vibrat Res, Southampton SO17 1BJ, Hants, England. EM ia@isvr.soton.ac.uk; prw@isvr.soton.ac.uk FU EPSRC [GR/S30238/01] FX The authors would like to thank Prof. Saeed Vaseghi, Esfandiar Zavarehei, Qin Yan, Dr. Ben Milner, Jonathan Darch and EPSRC Grant No. GR/S30238/01. The subjects that participated in the subjective listening test are also gratefully acknowledged. CR Abramowitz M., 1965, HDB MATH FUNCTIONS Andrianakis I, 2006, P IEEE INT C AC SPEE, V3, P1068 ANDRIANAKIS I, 2007, THESIS U SOUTHAMPTON BARROWES B, 2007, MATLAB ROUTINES COMP Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 DAT TH, 2005, P 30 IEEE INT C AC S, V4, P181 Deller J. R., 1993, DISCRETE TIME PROCES EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Flannery B. P., 1992, NUMERICAL RECIPES C Gander W, 2000, BIT, V40, P84, DOI 10.1023/A:1022318402393 Gradshteyn I. S., 1965, TABLES INTEGRALS SER Johnson N.L., 1994, CONTINUOUS UNIVARIAT, VI Kraskov A, 2004, PHYS REV E, V69, DOI [10.1103/PhysRevD.69.043502, 10.1103/PhysRevD.69.023502] Kullback S., 1959, INFORM THEORY STAT LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Lotter T., 2004, P EUSIPCO 04 VIENN A, P1457 Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Moore D. S., 1989, INTRO PRACTICE STAT Papoulis A, 1965, PROBABILITY RANDOM V Rix A., 2001, P IEEE INT C AC SPEE, P749 Van Trees H., 1968, DETECTION ESTIMATION, V1st WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 WOLFE PJ, 2003, EURASIP J APPL SIG P, V10, P1043 NR 27 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 1 EP 14 DI 10.1016/j.specom.2008.05.018 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100001 ER PT J AU Park, HM Stern, RM AF Park, Hyung-Min Stern, Richard M. TI Spatial separation of speech signals using amplitude estimation based on interaural comparisons of zero-crossings SO SPEECH COMMUNICATION LA English DT Article DE Automatic speech recognition; Noise robustness; Speech enhancement; Sound source segregation; Interaural time differences; Zero-crossings ID SOUND LOCALIZATION; RECOGNITION; MODEL AB This paper describes an algorithm called zero-crossing-based amplitude estimation (ZCAE) that enhances speech by reconstructing the desired signal from a mixture of two signals using continuously-variable weighting factors, based on pre-processing that is motivated by the well-known ability of the human auditory system to resolve spatially-separated signals. Although most conventional methods of signal separation have been based on interaural time differences (ITDs) derived from cross-correlation information, the ZCAE approach provides sound segregation based on estimates of ITD from comparisons of zero-crossings [Kim, Y.-I, An, S.J., Kil, R.M., Park, H.-M., 2005. Sound segregation based on binaural zero-crossings. In: Proc. European Conf. on Speech Communication and Technology (INTERSPEECH-2005), Lisbon, Portugal, pp. 2325-2328). These ITD estimates are used to determine the relative contribution of the desired source in a mixture and subsequently to reconstruct a closer approximation to the desired signal. The estimation of relative target intensity in a given time-frequency segment is accomplished by analytically deriving a monotonic function that maps the estimated ITD in each time-frequency segment to the putative relative intensity of each source. The ZCAE method is evaluated by comparing the sample standard deviation of ITD estimates derived using cross-correlation and using zero-crossing information, by comparing the speech recognition accuracy that is obtained by applying the proposed methods to speech in the presence of interfering speech sources, and by comparing recognition accuracy obtained using a continuous weighting versus a binary weighting of the target and masker. It is found that better results are obtained when ITDs are estimated using zero-crossing information rather than cross-correlation information, and when continuous weighting functions are used in place of binary weighting of the target and masker in each time-frequency segment. (C) 2008 Elsevier B.V. All rights reserved. C1 [Park, Hyung-Min] Sogang Univ, Dept Elect Engn, Seoul 121742, South Korea. [Park, Hyung-Min; Stern, Richard M.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. [Park, Hyung-Min; Stern, Richard M.] Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA. RP Park, HM (reprint author), Sogang Univ, Dept Elect Engn, Seoul 121742, South Korea. EM hpark@sogang.ac.kr FU Institute of Information Technology Assessment, Korea; Sogang University Foundation; National Science Foundation [IIS-0420866] FX This work was supported by the Information and Telecommunication National Scholarship Program sponsored by the Institute of Information Technology Assessment, Korea, by the Sogang University Foundation Research Grants, and by the National Science Foundation (Grant IIS-0420866). CR Barker J., 2000, P ICSLP BEIJ CHIN, P373 BODDEN M, 1995, P EUR C SPEECH COMM, P127 Bodden M., 1993, Acta Acustica, V1 Braasch J, 2005, COMMUNICATION ACOUSTICS, P75, DOI 10.1007/3-540-27437-5_4 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 COLBURN HS, 2005, SPRINGER HDB AUDITOR, V6, P272 HENNING GB, 1974, J ACOUST SOC AM, V77, P1129 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Kim YJ, 2004, PROCEEDINGS OF THE XIX EUROPEAN CONGRESS OF PERINATAL MEDICINE, P477 Kim YI, 2007, IEEE T AUDIO SPEECH, V15, P734, DOI 10.1109/TASL.2006.881669 KIM YI, 2005, P INT C SPOK LANG P, P2325 LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325 Lyon R. F., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725 Morris A. C., 2001, P WISP 01 STRATF UP, P153 NUETZEL JM, 1981, J ACOUST SOC AM, V69, P1112, DOI 10.1121/1.385690 OMARD LP, 2000, DEV SYSTEM AUDITORY Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 Price P., 1988, P IEEE INT C AC SPEE, P651 Raj B, 2005, IEEE SIGNAL PROC MAG, V22, P101, DOI 10.1109/MSP.2005.1511828 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Raj B., 1997, P ICASSP 97, P851 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 SINGH R, 2002, CRC HDB NOISE REDUCT, P245 Srinivasan S, 2006, SPEECH COMMUN, V48, P1486, DOI 10.1016/j.specom.2006.09.003 Srinivasan S., 2004, P ICSLP, P2541 STERN RM, 1996, BINAURAL SPATIAL HEA, P499 STERN RM, 1978, J ACOUST SOC AM, V64, P127, DOI 10.1121/1.381978 Strutt J. W, 1907, PHILOS MAG, V13, P214, DOI 10.1080/14786440709463595 Summerfield Q., 2004, SPRINGER HDB AUDITOR, V18, P231, DOI 10.1007/0-387-21575-1_5 TESSIER E, 1999, P INT C SPEECH P 199, P97 Wang D., 2006, COMPUTATIONAL AUDITO WEINTRAUB M, 1986, P IEEE INT C AC SPEE, V11, P81 NR 36 TC 16 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 15 EP 25 DI 10.1016/j.specom.2008.05.012 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100002 ER PT J AU Carre, R AF Carre, Rene TI Dynamic properties of an acoustic tube: Prediction of vowel systems SO SPEECH COMMUNICATION LA English DT Article DE Speech production; Acoustic tube; Dynamic properties; Vowel systems ID SPEECH-PERCEPTION; FORMANT TRANSITIONS; COARTICULATED VOWELS; 1ST FORMANT; RECOGNITION; REPRESENTATION; SPECIFICATION; TRAJECTORIES; REDUCTION; FREQUENCY AB Approaches to characterizing and explaining the diverse phonologies of the world's languages usually begin with data from the analysis of speech signals or from the results of speech production and perception experiments. In the present paper, the dynamic acoustic properties that arise from changing the shape of a simple acoustic tube 18 cm length (without any articulatory machinery) are explored to develop a simple and efficient acoustic communication system. By efficient we mean that minimum deformations of the tube lead to maximum acoustic variations. Intrinsic characteristics of the tube are derived from these specific 'gestural' deformations associated with formant trajectories in the acoustic plane. This deductive approach (without reference to data on speech production or speech signals) leads to define an acoustic communication system characterized by its acoustic space and by several specific formant trajectories. The acoustic space fits well with the vowel triangle, and 18 oral vowels can be placed on the trajectories. From these deductive results a tentative explanation of vowel systems is proposed. The good match between deductive prediction and observation results encourages to make further predictions, formulating hypotheses about a unified view of vowel and consonant production, and reconsidering the relation between phonetics and phonology. (C) 2008 Elsevier B.V. All rights reserved. C1 Univ Lyon 2, CNRS, Umr 5596, Lab Dynam Langage, F-69363 Lyon 07, France. RP Carre, R (reprint author), Univ Lyon 2, CNRS, Umr 5596, Lab Dynam Langage, 14 Ave Marcelin Berthelot, F-69363 Lyon 07, France. EM recarre@wanadoo.fr FU French Ministere de la recherche FX This research was supported by the French Ministere de la recherche: ACI "Systemes complexes en SHS, 2003". The author thanks Michael Studdert-Kennedy for his very helpful comments on an early draft and Pierre Divenyi, CR Badin P., 1984, KTH SPEECH TRANSMISS, V2-3, P53 BOE LJ, 1989, P EUR PAR FRANC, V2, P281 BROWMAN CP, 1992, PHONETICA, V49, P155 Carre R, 2001, PHONETICA, V58, P163, DOI 10.1159/000056197 Carre R., 1992, Journal d'Acoustique, V5 CARRE R, 2007, P ICPHS SAARBR, P569 CARRE R, 1995, J PHONETICS, V23, P231, DOI 10.1016/S0095-4470(95)80045-X CARRE R, 2002, P INT C SPEECH LANG, P1681 CARRE R, 1990, NATO ASI SERIES CARRE R, 1994, J ACOUST SOC AM, V95, pS2924 Carre Rene, 1995, P73 CARRE R, 1991, J PHONETICS, V19, P433 Carre R, 2000, PHONETICA, V57, P152, DOI 10.1159/000028469 Carre R, 1998, BEHAV BRAIN SCI, V21, P261, DOI 10.1017/S0140525X98241173 Carre R, 2004, SPEECH COMMUN, V42, P227, DOI 10.1016/j.specom.2003.12.001 CARRE R, 1995, CR ACAD SCI II B, V30, P471 CARRE R, 1997, P 3 M ACL SPEC INT G, P26 Catford JC, 1988, PRACTICAL INTRO PHON Chennoukh S, 1997, J ACOUST SOC AM, V102, P2380, DOI 10.1121/1.419622 Chomsky N., 1968, SOUND PATTERN ENGLIS CLEMENTS GN, 2003, 15 ICPHS, P371 Crothers J., 1978, UNIVERSALS HUMAN LAN, V2, P93 DEBOER B, 1999, SELF ORG VOWEL SYSTE DEBOER B, 1997, P ECAL 97 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 DIBENEDETTO MG, 1989, J ACOUST SOC AM, V86, P67, DOI 10.1121/1.398221 DIBENEDETTO MG, 1989, J ACOUST SOC AM, V86, P55, DOI 10.1121/1.398220 Diehl R., 2003, P 15 ICPHS BARC, P1381 DORMAN MF, 1977, PERCEPT PSYCHOPHYS, V22, P109, DOI 10.3758/BF03198744 DUEZ D, 1995, J PHONETICS, V23, P407, DOI 10.1006/jpho.1995.0031 Fant G., 1960, ACOUSTIC THEORY SPEE Fant G., 1974, P SPEECH COMM SEM, P121 Fowler C. A., 1980, LANGUAGE PRODUCTION, P373 FOWLER CA, 1986, J PHONETICS, V14, P3 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 GAY T, 1978, J ACOUST SOC AM, V63, P223, DOI 10.1121/1.381717 Gunnilstam O, 1974, J PHONETICS, V2, P91 Johnson K, 1997, TALKER VARIABILITY S, P145 KENT RD, 1969, J ACOUST SOC AM, V46, P1549, DOI 10.1121/1.1911902 KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081 Labov W., 1972, SOCIOLINGUISTICS PAT LADEFOGED P, 1990, J PHONETICS, V18, P335 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 LINDBLOM B, 1990, PHONETIC CONTENT PHO, P101 Lindblom B., 1990, NATO ASI SER, P403 LINDBLOM B, 1990, J PHONETICS, V18, P135 Lindblom B., 1963, 29 ROYAL I TECHN SPE Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 Lindblom B., 1986, EXPT PHONOLOGY, P13 LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 Maddieson I., 1984, PATTERNS SOUNDS Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 MILLER JD, 1989, J ACOUST SOC AM, V85, P2114, DOI 10.1121/1.397862 MRAYATI M, 1990, SPEECH COMMUN, V9, P231, DOI 10.1016/0167-6393(90)90059-I MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 NEAREY TM, 1986, J ACOUST SOC AM, V80, P1297, DOI 10.1121/1.394433 NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Ohala J. J., 1979, P INT C PHON SCI, V3, P181 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 OShaughnessy D, 1996, J ACOUST SOC AM, V99, P1726, DOI 10.1121/1.414697 Oudeyer PY, 2005, J THEOR BIOL, V233, P435, DOI 10.1016/j.jtbi.2004.10.025 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 POLS LCW, 1993, SPEECH COMMUN, V13, P135, DOI 10.1016/0167-6393(93)90065-S Schwartz JL, 1997, J PHONETICS, V25, P255, DOI 10.1006/jpho.1997.0043 Schwartz JL, 1997, J PHONETICS, V25, P233, DOI 10.1006/jpho.1997.0044 STEVENS JL, 1989, COENZYMES COFACTOR B, V3, P45 Stevens KN, 1972, HUMAN COMMUNICATION, P51 STRANGE W, 1989, J ACOUST SOC AM, V85, P2135, DOI 10.1121/1.397863 STRANGE W, 1989, J ACOUST SOC AM, V85, P2081, DOI 10.1121/1.397860 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 STRANGE W, 1976, J ACOUST SOC AM, V60, P213, DOI 10.1121/1.381066 Studdert-Kennedy M., 1987, LANGUAGE PERCEPTION, P67 SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381 TENBOSCH L, 1991, THESIS U AMSTERDAM A TENBOSCH LFM, 1986, P 11 INT C PHON SCI, P235 VERBRUGGE RR, 1980, SR62 HASK LAB, P205 NR 78 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 26 EP 41 DI 10.1016/j.specom.2008.05.015 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100003 ER PT J AU Gomez, R Toda, T Saruwatari, H Shikano, K AF Gomez, Randy Toda, Tomoki Saruwatari, Hiroshi Shikano, Kiyohiro TI Techniques in rapid unsupervised speaker adaptation based on HMM-Sufficient Statistics SO SPEECH COMMUNICATION LA English DT Article DE HMM-Sufficient Statistics; Unsupervised adaptation; Speaker adaptation; Rapid adaptation; Single-utterance adaptation; Adaptation based on speaker selection AB In realizing a speech recognition system robust to variation of speakers, a reliable adaptation algorithm is needed. Most adaptation techniques require a large amount of adaptation data from the target speaker to carry out the adaptation task. With the time needed to gather and transcribe adaptation utterances together with the time to execute adaptation, application to speech recognition is limited. We propose a rapid approach to speaker adaptation. We employ H M M-Sufficient Statistics in storing speaker-dependent subspaces. N-Closest speaker selection is employed in resolving the combinatorics of the speaker-dependent subspaces during recognition. This approach allows the adapted model to have a direct correspondence with the target speaker by using the target speakers' utterance for the N-Closest speaker selection. The proposed method employs series of adaptation processes. First, the general model is trained, then adapted to broad gender/age classes, which are further adapted to speak er-specific data. Since HMM-Sufficient Statistics are pre-computed offline, little computation is needed in carrying out the adaptation task online. Moreover, the method requires only a single arbitrary utterance from the target speaker for adaptation. In this paper, we discuss the modification, expansion, and the improvement of rapid adaptation based on HMM-Sufficient Statistics in the framework of Baum-Welch and maximum likelihood linear regression (MLLR). Experimental results using the conventional MLLR, speaker-adaptive training, and CMLLR are evaluated and compared. We also tested for robustness in office, car, crowd and booth environments in several SNR conditions. (C) 2008 Elsevier B.V. All rights reserved. C1 [Gomez, Randy; Toda, Tomoki; Saruwatari, Hiroshi; Shikano, Kiyohiro] Nara Inst Sci & Technol, Nara, Japan. RP Gomez, R (reprint author), Nara Inst Sci & Technol, Nara, Japan. EM randy-g@is.naist.jp FU Japanese MEXT FX This work was supported by the Japanese MEXT e-Society project while the first author was in Nara Institute of Science and Technology. Currently, he is affiliated with the Academic Center for Computing and Media Studies, Kyoto University, Kyoto Japan. CR ANASTAKOS T, 1996, P ICSLP OCT BABA A, 2001, P EUROSPEECH, P1657 *CAMBR U ENG DEP, HTK HIDD MARK MOD TO Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GIULIANI D, 2003, P ICASSP, V2, P137 GOMEZ R, 2007, IEICE T INFORM SYS D, V90 GOMEZ R, 2006, IEICE SPECIAL ISSU D, V89 GOMEZ R, 2005, P EUROSPEECH, P296 GOMEZ R, P ICASSP IN PRESS GOMEZ R, 2005, P AC SOC JAP MARCH 2 HUANG C, 2004, P ICSLP Huang C., 2001, P EUR 2001, V2, P1377 LEE A, 2000, P ICASSP, P1269 LEE A, JULIUS FREE CONTINUO LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MATSOUKAS S, 1997, P ARPA WORKSH SPEECH MATSUI T, 1997, P ICASSP, P1015 Minka T.P., 1998, EXPECTATION MAXIMIZA NEUMEYER LR, 1995, P EUROSPEECH, V2, P1127 PYE D, 1997, P ICASSP, V2, P1047 VATBHAVA G, 2001, P ICASSP XIANG B, 2005, P ICASSP, V1, P677 YAMADA M, 2001, P EUROSPEECH, P869 YAMADE S, 2000, P ICSLP, P1045 YOSHIZAWA S, 2001, P ICASSP ZHAN P, 1997, P EUR, P2087 NR 27 TC 2 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 42 EP 57 DI 10.1016/j.specom.2008.05.014 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100004 ER PT J AU Haque, S Togneri, R Zaknich, A AF Haque, Serajul Togneri, Roberto Zaknich, Anthony TI Perceptual features for automatic speech recognition in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE Auditory system; Automatic speech recognition; Perceptual features; Synaptic adaptation; Two-tone suppression; Hidden Markov model ID SHORT-TERM ADAPTATION; WORD RECOGNITION; MODEL; RESPONSES; EXCITATION; COCHLEA AB The performances of two perceptual properties of the peripheral auditory system, synaptic adaptation and two-tone suppression, are compared for automatic speech recognition (ASR) in an additive noise environment. A simple method of synaptic adaptation as determined by psychoacoustic observations was implemented with temporal processing of speech utilizing a zero-crossing auditory model as a pre-processing front end. The concept is similar to RASTA processing, but instead of bandpass filters, a high-pass infinite impulse response (IIR) filter is used. It is shown that rapid synaptic adaptation may be implemented by temporal processing using the zero-crossing algorithm, not otherwise implementable in the spectral domain implementation. The two-tone suppression was implemented in the zero-crossing auditory model using a companding strategy. Recognition performances with the two perceptual features were evaluated on isolated digits (TIDIGITS) corpus using continuous density HMM recognizer in white, factory, babble and Volvo noise. It is observed that synaptic adaptation performs better in stationary white Gaussian noise. In presence of non-stationary non-Gaussian noise, however, no improvements or a degradation is observed. Moreover, a reciprocal effect is observed with two-tone suppression, with better performance in non-Gaussian real-world noise and degradation in stationary white Gaussian noise. (C) 2008 Elsevier B.V. All rights reserved. C1 [Haque, Serajul; Togneri, Roberto; Zaknich, Anthony] Univ Western Australia, Sch Elect Engn & Comp Engn, Crawley, WA 6009, Australia. RP Haque, S (reprint author), Univ Western Australia, Sch Elect Engn & Comp Engn, 35 Stirling Hwy, Crawley, WA 6009, Australia. EM serajul@ee.uwa.edu.au RI Togneri, Roberto/C-2466-2013; Haque, Serajul/D-2046-2013 CR ABDELATTY AM, 1999, P IEEE INT C AC SPEE BLOOMBERG M, 1984, AUDITORY MODELS ISOL, P1 COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756 DOLBY R, 1967, J AUDIO ENG SOC, V15 Frey D, 2001, IEEE T CIRCUITS-II, V48, P329, DOI 10.1109/82.933791 GAJIC B, 2003, P IEEE INT C AC SPEE Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357 GHULAM M, 2005, P IEEE INT C AC SPEE GILLICK L, 1989, P IEEE INT C AC SPEE Grayden D. B., 2004, IEEE ISSNIP, P491 HAQUE S, 2007, P IEEE INT C AC SPEE HAYES MH, 1996, STAT DIGITAL SIGNAL, P72 HERMANSKY H, 1994, P INT WORKSH HUM INT Hermansky H., 1994, IEEE T SPEECH AUDIO, V2, P587 Holmberg M., 2006, IEEE T AUDIO SPEECH, V14, P44 JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 KATES JM, 1995, IEEE T SPEECH AUDI P, V3, P396, DOI 10.1109/89.466656 KEDEM B, 1986, P IEEE, V74, P1477, DOI 10.1109/PROC.1986.13663 Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55 Koopmans L. H., 1974, SPECTRAL ANAL TIME S LOUGHLIN P, 1997, P IEEE INT C AC SPEE LYON RF, 1988, IEEE T ACOUST SPEECH, V36, P1119, DOI 10.1109/29.1639 MEDDIS R, 1986, J ACOUST SOC AM, V79, P702, DOI 10.1121/1.393460 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Oxenham AJ, 2001, J ACOUST SOC AM, V109, P732, DOI 10.1121/1.1336501 PATTERSON RD, 1999, J ACOUST SOC AM, V105, P1305, DOI 10.1121/1.424756 Pitton JW, 1996, P IEEE, V84, P1199, DOI 10.1109/5.535241 RHODE WS, 1978, J ACOUST SOC AM, V64, P158, DOI 10.1121/1.381981 Ruggero MA, 2000, P NATL ACAD SCI USA, V97, P11744, DOI 10.1073/pnas.97.22.11744 SACHS MB, 1983, J NEUROPHYSIOL, V50 SENEFF S, 1988, J PHONETICS, V16, P55 SMITH RL, 1975, BIOL CYBERN, V17, P169, DOI 10.1007/BF00364166 Spoor A., 1976, ELECTROCOCHLEOGRAPHY, P183 Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 TCHORZ J, 1999, J ACOUST SOC AM, V106 Theodoridis S., 1999, PATTERN RECOGNITION Turicchia L, 2005, IEEE T SPEECH AUDI P, V13, P243, DOI 10.1109/TSA.2004.841044 WESTERMAN LA, 1984, HEARING RES, V15, P249, DOI 10.1016/0378-5955(84)90032-7 NR 38 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 58 EP 75 DI 10.1016/j.specom.2008.06.002 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100005 ER PT J AU Amano, S Sakamoto, S Kondo, T Suzuki, Y AF Amano, Shigeaki Sakamoto, Shuichi Kondo, Tadahisa Suzuki, Yoti TI Development of familiarity-controlled word lists 2003 (FW03) to assess spoken-word intelligibility in Japanese SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th Western Pacific Acoustic Conference CY JUN 26-28, 2006 CL Seoul, SOUTH KOREA DE Word intelligibility; Word familiarity; Word list ID SPEECH; NOISE; RECOGNITION; SENTENCES AB A new set of "Familiarity-controlled word lists 2003" (FW03) has been developed for a spoken-word intelligibility test in Japanese. FW03 consists of 20 lists of 50 words in four word-familiarity ranks (i.e., 4000 words in total). The entropy of (a) initial moras and (b) sequences consisting of a vowel and a following consonant was maximized in the word lists within each word-familiarity rank. FW03 is now published with speech files of the 4000 words spoken by two male and two female Japanese. The word intelligibility of FW03 was measured with the speech files at various signal-to-noise ratios. In addition to the signal-to-noise ratio effects, strong word-familiarity effects were observed in terms of word intelligibility, indicating that word familiarity is well controlled in FW03. FW03 enables us to measure word intelligibility in several word-familiarity ranks that correspond to the degree of lexical information. (C) 2008 Elsevier B.V. All rights reserved. C1 [Amano, Shigeaki; Kondo, Tadahisa] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. [Sakamoto, Shuichi; Suzuki, Yoti] Tohoku Univ, Res Inst Elect Commun, Grad Sch Informat Sci, Aoba Ku, Sendai, Miyagi 9808577, Japan. RP Amano, S (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikari Dai, Seika, Kyoto 6190237, Japan. EM amano@cslab.kecl.ntt.co.jp CR Amano S., 2000, LEXICAL PROPERTIES J, V7 AMANO S, 1999, P 14 INT C PHON SCI, V2, P873 Amano S, 2007, BEHAV RES METHODS, V39, P1008, DOI 10.3758/BF03192997 AMANO S, 1998, P INT C SPOK LANG PR, V5, P2119 Amano S., 1993, Journal of the Acoustical Society of Japan (E), V14 Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X BILGER RC, 1984, J SPEECH HEAR RES, V27, P32 BOOTHROYD A, 1988, EAR HEARING, V9, P306 CONNINE CM, 1987, J MEM LANG, V26, P527, DOI 10.1016/0749-596X(87)90138-0 Cox R.M., 1987, EAR HEARING, V8, P119 EGAN JP, 1948, LARYNGOSCOPE, V58, P955, DOI 10.1288/00005537-194809000-00002 HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 *ITU T, G227 ITUT *JAP AUD SOC, 1983, 57 S SYLL LIST *JAP AUD SOC, 1987, 67 S SYLL LIST KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 Kondo T., 1999, LEXICAL PROPERTIES J, V1 KONDO T, 1998, P 62 ANN M JAP PSYCH, V711 MAKINO S, 1979, IEICE T INF SYST, V62, P507 MARSLENWILSON WD, 1978, COGNITIVE PSYCHOL, V10, P29, DOI 10.1016/0010-0285(78)90018-X NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Otake T., 1996, PHONOLOGICAL STRUCTU SAKAMOTO S, 2004, J ACOUST SOC JPN, V60, P351 SHIKANO K, 1984, P SPRING M AC SOC JA, P211 Vance T. J., 1987, INTRO JAPANESE PHONO Voiers W. D., 1983, Speech Technology, V1 YONEMOTO K, 1995, JHONS, V11, P1395 YONEMOTO K, 1989, AUDIOL JPN, V32, P429 NR 28 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2009 VL 51 IS 1 BP 76 EP 82 DI 10.1016/j.specom.2008.07.002 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 381GQ UT WOS:000261524100006 ER PT J AU Trancoso, I Becerra-Yoma, N Barbosa, P San-Segundo, R Paliwal, K AF Trancoso, Isabel Becerra-Yoma, Nestor Barbosa, Plinio San-Segundo, Ruben Paliwal, Kuldip TI Special Issue on Iberian Languages SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [Becerra-Yoma, Nestor] Univ Chile, Santiago, Chile. [Barbosa, Plinio] Univ Estadual Campinas, Campinas, SP, Brazil. [San-Segundo, Ruben] Univ Politecn Madrid, E-28040 Madrid, Spain. [Paliwal, Kuldip] Griffith Univ, Nathan, Qld 4111, Australia. RI Trancoso, Isabel/C-5965-2008 OI Trancoso, Isabel/0000-0001-5874-6313 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 872 EP 873 DI 10.1016/j.specom.2008.06.001 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100001 ER PT J AU Gonzalez, MG Banga, ER Diaz, FC Pazo, FM Linares, LR Iglesias, GI AF Gonzalez, Manuel Gonzdlez Banga, Eduardo Rodriguez Diaz, Francisco Carnpillo Pazo, Francisco Mendez Linares, Leandro Rodriguez Iglesias, Gonzalo Iglesias TI Specific features of the Galician language and implications for speech technology development SO SPEECH COMMUNICATION LA English DT Article DE Galician; Iberian languages; Speech technology; Text-to-speech; Linguistic analysis AB In this article, we present the main linguistic and phonetic features of Galician which need to be considered in the development of speech technology applications for this language. We also describe the solutions adopted in our text-to-speech system, also useful for speech recognition and speech-to-speech translation. On the phonetic plane in particular, the handling of vocal contact and the determination of mid-vowel openness are discussed. On the linguistic plane we place special emphasis on the handling of clitics and verbs. It should be noted that in Galician there is high interrelation between phonetics and grammatical information. Therefore, the task of morphosyntactic disambiguation is also addressed. Moreover, this task is fundamental for a higher level linguistic analysis. (C) 2008 Elsevier B.V. All rights reserved. C1 [Banga, Eduardo Rodriguez; Diaz, Francisco Carnpillo; Pazo, Francisco Mendez; Iglesias, Gonzalo Iglesias] Univ Vigo, Dpto Teoria Senal Comunicac, ETSI Telecomunicac, Vigo 36310, Spain. [Gonzalez, Manuel Gonzdlez] Univ Santiago, Dpto Filoloxia Galega, Santiago De Compostela, Spain. [Linares, Leandro Rodriguez] Univ Vigo, Dpto Informat, Orense, Spain. RP Banga, ER (reprint author), Univ Vigo, Dpto Teoria Senal Comunicac, ETSI Telecomunicac, Vigo 36310, Spain. EM erbanga@gts.tsc.uvigo.es RI Rodriguez Banga, Eduardo/C-4296-2011 FU Spanish government; ERDF; Xunta de Galicia [TEC2006-13694-C03-03, HUM2005-08282-C02-01, PGI0IT05TIC32202PR] FX This work has been partially supported by the Spanish government, ERDF funds and the Xunta de Galicia under the projects TEC2006-13694-C03-03, HUM2005-08282-C02-01 and PGI0IT05TIC32202PR. Cotovia was developed in collaboration with the Centro Ramon Pineiro para a Investigacion en Humanidades. The authors wish to thank the anonymous reviewers for their valuable suggestions for improving the initial version of this paper. CR Brill E, 1995, COMPUT LINGUIST, V21, P543 CAMPILLO F, 2005, THESIS U VIGO SPAIN Campillo F. D., 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004 Castro Obdulia, 1998, APROXIMACION FONOLOG GARRIDO JM, 1996, THESIS U BARCELONA S Gonzalez Manuel, 1994, ACT 19 C INT LING FI, VVI, P141 GONZALEZ MG, 2002, DICCIONARIO VERBOS G LOPEZ E, 1993, THESIS U POLIECNICA Merialdo B., 1994, Computational Linguistics, V20 MORENO A, 2006, 4 JORN TECN HABL, P77 Olive J. P., 1993, ACOUSTICS AM ENGLISH Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE SEIJO L, 2004, P LREC LISB, V5, P1759 VEIGA A, 1976, FONOLOGIA GALLEGA VOUTILAINEN A, 1992, PUBLICATION U HELSIN, V21 NR 15 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 874 EP 887 DI 10.1016/j.specom.2008.02.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100002 ER PT J AU Navas, E Hernaez, I Sainz, I AF Navas, Eva Hernaez, Inmaculada Sainz, Inaki TI Evaluation of automatic break insertion for an agglutinative and inflected language SO SPEECH COMMUNICATION LA English DT Article DE Break prediction; Prosodic phrasing; Speech synthesis ID TEXT-TO-SPEECH; PHRASE BREAKS AB This paper presents the evaluation of automatic break insertion for standard Basque. Basque is an agglutinative and inflected language and POS features, widely used for other languages, are not enough to accurately predict the insertion of breaks in the text. Other morpho-syntactic features, like grammatical case and information about syntagms have also been taken into account. With a textual cor-pus specially gathered for this study where the sentence internal punctuation marks have been removed, CARTs have been used to predict break locations. After applying parameter selection to the whole morpho-syntactic feature set, the best features were employed to build two CARTs, one that gives the same importance to deletion and insertion errors, T1, and another one, T2, that tries to minimize insertion errors. The objective evaluation of the break insertion algorithms gives a K statistic of 0.518 and an F of 0.757 for T1 tree. The algorithms have also been subjectively evaluated and although T1 had better objective measures, the number of serious errors made by this tree is larger than the number of serious errors made by T2. (C) 2008 Elsevier B.V. All rights reserved. C1 [Navas, Eva; Hernaez, Inmaculada; Sainz, Inaki] Univ Basque Country, Dept Elect & Telecomunicac, Bilbao 48013, Spain. RP Navas, E (reprint author), Univ Basque Country, Dept Elect & Telecomunicac, Alda Urquijo S-N, Bilbao 48013, Spain. EM eva.navas@ehu.es; inma.hernaez@ehu.es; inaki.sainz@ehu.cs RI Hernaez Rioja, Inmaculada/K-8303-2012; Navas, Eva/H-4317-2013 OI Hernaez Rioja, Inmaculada/0000-0003-4447-7575; Navas, Eva/0000-0003-3804-4984 FU Spanish Ministry of Education and Sciences [TEC2006-13694-C03-02]; Basque Government [IE06-185] FX We are grateful to Nerea Ezeiza and the IXA research group for the automatic morphological and syntactic labelling of the corpus and to Inaki Gaminde for the manual labelling of the breaks.The authors would also like to thank all the people that took part in the subjective evaluation.This work has been partially founded by Spanish Ministry of Education and Sciences under Grant TEC2006-13694-C03-02 (AVIVAVOZ project, www.avivavoz.es/) and by Basque Government under grant IE06-185 (AN-HITZ project, www.zinhitz.com/). CR Aduriz I, 2004, LECT NOTES COMPUT SC, V2945, P124 Agresti A, 1996, INTRO CATEGORICAL DA Allen J., 1987, TEXT SPEECH MITALK S APEL J, 2004, KONVENS, P5 Bachenko J., 1990, Computational Linguistics, V16 BELL P, 2006, P SPEECH PROS Black A. W., 1997, P EUR, P995 Blum AL, 1997, ARTIF INTELL, V97, P245, DOI 10.1016/S0004-3702(97)00063-5 BONAFONTE A, 2004, AST 2004 Busser G., 2001, P 4 ISCA TUT RES WOR, P29 Carletta J, 1996, COMPUT LINGUIST, V22, P249 CASTEJON F, 1994, CONVERSOR TEXTO VOZ Chen C.J., 1999, P EUROSPEECH, P447 Christensen H., 2001, P ISCA WORKSH PROS S, P35 COX S, 2005, P INTERSPEECH 2005, P3229 DEILARRAZA AD, 2003, P WORKSH NLP MIN LAN DEMAREUIL PB, 1998, P 3 ESCA COCOSDA INT, P127 EZEIZA N, 1998, P 36 ANN M ASS COMP, P379 FORDYCE C, 1998, P ICSLP, V3, P843 Frazier L, 2004, LINGUA, V114, P3, DOI 10.1016/S0024-3841(03)00044-5 Gotoh Y., 2000, P ISCA WORKSH AUT SP, P228 HAKKANITUR D, 1999, P EUR, P1991 Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 HIRSCHBERG J, 1994, P 2 ESCA IEEE WORKSH, P159 HU GP, 2003, P NAT LANG PROC KNOW, P407 Hualde Jose Ignacio, 2003, GRAMMAR BASQUE Ingulfsen T., 2005, P INTERSPEECH 2005, P1817 Kim J., 2001, P EUR C SPEECH COMM, P2757 KIM S, 2006, P INTERSPEECH 2006 Koehn P., 2000, P IEEE INT C AC SPEE, V3, P1289 Langley P., 1994, P AAAI FALL S REL, P140 Li Jianfeng, 2004, P INTERSPEECH 2004, P729 LIBERMAN M, 1991, ADV SPEECH SIGNAL PR, P791 Maragoudakis M, 2003, LECT NOTES ARTIF INT, V2807, P189 Marsi E., 2003, P 41 ANN M ASS COMP, P489 NAVAS E, 2002, P 1 INT C SPEECH PRO, P527 Olshen R., 1984, CLASSIFICATION REGRE, V1st Oparin I, 2005, LECT NOTES ARTIF INT, V3658, P356 Ostendorf M., 1994, Computational Linguistics, V20 PFITZINGER HR, 2006, P SPEECH PROS Read I, 2007, COMPUT SPEECH LANG, V21, P519, DOI 10.1016/j.csl.2006.09.004 READ I, 2004, P INTERSPEECH 2004, P741 SALTON G, 1972, J AM SOC INFORM SCI, V78, P11 SANDERS E, 1995, THESIS U NIJMEGEN SANGHO L, 1999, SPEECH COMMUN, V28, P283 Schmid H., 2004, P 20 INT C COMP LING Siegel S., 1988, NONPARAMETRIC STAT B Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 STONE M, 1974, J R STAT SOC B, V36, P111 Sun Y, 2001, IEEE J SEL TOP QUANT, V7, P1 Taylor P, 1998, COMPUT SPEECH LANG, V12, P99, DOI 10.1006/csla.1998.0041 TESPRASIT V, 2003, P EUR C SPEECH COMM, P325 Torres H, 2004, P SPEECH PROS 2004, P553 van Rijsbergen CJ., 1979, INFORM RETRIEVAL VEILLEUX N, 1990, P INT C ACOUST SPEEC, V2, P777 Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y YI L, 2006, P INTERSPEECH 2006, P1308 Yoon K, 2006, COMPUT SPEECH LANG, V20, P69, DOI 10.1016/j.csl.2005.01.001 Zellner B., 1994, FUNDAMENTALS SPEECH, P41 ZERVAS P, 2003, P EUROSP 2003, P113 Zervas P, 2005, LECT NOTES ARTIF INT, V3658, P334 ZHAO S, 2002, P ICSLP 2002 DENV US, P2417 ZHENG Y, 2004, P INTERSPEECH 2004, P737 NR 63 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 888 EP 899 DI 10.1016/j.specom.2008.03.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100003 ER PT J AU Martinez-Castilla, P Peppe, S AF Martinez-Castilla, Pastora Peppe, Sue TI Developing a test of prosodic ability for speakers of Iberian Spanish SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Intonation; Test; Contrastive accent; Iberian Spanish ID HIGH-FUNCTIONING AUTISM; PERCEPTION; CHILDREN; RECOGNITION; INTONATION; DUTCH AB In the absence of a Spanish prosody assessment procedure, an English one (Profiling Elements of Prosodic Systems-Children: PEPS-C) has been adapted for use with Iberian Spanish speakers. The paper describes the scope, principles and methods of the test and the modifications other than lexical translation that were required to produce a Spanish procedure. Findings from the first studies of data collected using the Spanish test are briefly considered: these suggest crosslinguistic parallels and English/Spanish differences in adult prosodic ability. Lengthier consideration is given to prosodic data from Spanish children and the use of prefinal contrastive accent in the two languages. We conclude that the test is a feasible and valid instrument for assessing Spanish prosodic ability and indicate possible directions for further research. (C) 2008 Elsevier B.V. All rights reserved. C1 [Martinez-Castilla, Pastora] Univ Autonoma Madrid, Fac Psicol, E-28049 Madrid, Spain. [Peppe, Sue] Queen Margaret Univ Speech & Hearing Sci, Edinburgh EH12 8TS, Midlothian, Scotland. RP Martinez-Castilla, P (reprint author), Univ Autonoma Madrid, Fac Psicol, Ciudad Univ Cantoblanco, E-28049 Madrid, Spain. EM p.martinez@uam.es; sue.peppe@googlemail.com RI Martinez-Castilla, Pastora/N-3197-2014 OI Martinez-Castilla, Pastora/0000-0003-2369-0641 FU Ministry of Education and Science of the Spanish Government [AP2003-5098] FX This research was funded by a grant from the Ministry of Education and Science of the Spanish Government (AP2003-5098). CR Asperger H, 1944, ARCH PSYCHIAT NERVEN, V117, P76, DOI 10.1007/BF01837709 BAAUW S, 2004, LOT OCCASIONAL SERIE, P103 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008 Beckman ME, 1996, LANG COGNITIVE PROC, V11, P17, DOI 10.1080/016909696387213 Butcher A., 1981, ARBEITSBERICHTE U KI CONTRERAS H, 1980, MELODY LANGUAGE, P45 Couper-Kuhlen Elizabeth, 2004, SOUND PATTERNS INTER Cruttenden Alan, 1997, INTONATION, V2nd Crystal D., 1982, PROFILING LINGUISTIC Crystal D., 1969, PROSODIC SYSTEMS INT CUTLER A, 1987, J CHILD LANG, V14, P145 D'Imperio M, 2005, PHONOL PHONET, V9, P59, DOI 10.1515/9783110197587.1.59 Enderby P. M., 1983, FRENCHAY DYSARTHRIA Face T., 2001, PROBUS, V13, P223, DOI 10.1515/prbs.2001.004 Face T., 2002, PROBUS, V14, P71, DOI 10.1515/prbs.2002.006 Face T., 2005, ITALIAN J LINGUISTIC, V17, P271 Face T. L., 2006, J LANG LINGUIST, V5, P295 Face Timothy, 2007, J PORTUGUESE LINGUIS, V6, P117 FORD C, 1996, CAMBRIDGE STUDIES IN, V13, P134 FRY DB, 1958, LANG SPEECH, V1, P126 Gil J., 1991, SONIDOS LENGUAJE Halle M., 1968, SOUND PATTERNS ENGLI HUALDE JI, 2007, SPECIAL ISSUE PROSOD, V6, P59 JUSCZYK PW, 1992, COGNITIVE PSYCHOL, V24, P252, DOI 10.1016/0010-0285(92)90009-Q Kanner L, 1943, NERV CHILD, V2, P217 Labastia LO, 2006, J PRAGMATICS, V38, P1677, DOI 10.1016/j.pragma.2005.03.019 LLISTERRI J, 2003, P 15 INT C PHON SCI, P2023 MARTINEZCASTILL.P, 2006, CLIN LINGUI IN PRESS Martinez-Celdran E., 1984, FONETICA CON ESPECIA Mccall L, 2007, PRIMARY CARE COMMUN, V12, P1, DOI 10.1080/17468840701560441 McCann J, 2003, INT J LANG COMM DIS, V38, P325, DOI 10.1080/1368282031000154204 MOZZICONACCI SJL, 1998, THESIS EINDHOVEN Navarro Tomas Tomas, 1944, MANUAL ENTONACION ES PARKER A, 1982, PETAL PHONOLOGICAL E PASCUAL S, 2005, REV PSIQUIATRIA FAC, V32, P179 Peppe S, 2003, CLIN LINGUIST PHONET, V17, P345, DOI 10.1080/0269920031000079994 Peppe S, 2006, J PRAGMATICS, V38, P1776, DOI 10.1016/j.pragma.2005.07.004 Peppe S, 2007, J SPEECH LANG HEAR R, V50, P1015, DOI 10.1044/1092-4388(2007/071) Quilis A., 1981, FONETICA ACUSTICA LE Roachs P., 2000, ENGLISH PHONETICS PH Sanz-Martin A, 2006, REV NEUROLOGIA, V42, P391 Samuelsson Christina, 2003, Logoped Phoniatr Vocol, V28, P156, DOI 10.1080/14015430310018324 SCOTT DR, 1982, J ACOUST SOC AM, V71, P996, DOI 10.1121/1.387581 Shriberg L. D., 1990, PROSODY VOICE SCREEN Sosa J., 1999, ENTONACION ESPANOL Swerts M, 2007, J PHONETICS, V35, P380, DOI 10.1016/j.wocn.2006.07.001 Swerts M, 2002, J PHONETICS, V30, P629, DOI 10.1006/jpho.2002.0178 TOLEDO G, 2006, ESTUDIOS FONETICA EX, V15, P99 VERDUGO MR, 2005, ESTUDIOS FONETICA EX, V14, P309 Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X WING L, 1991, AUTISM ASPERGER SYND Zubizarreta M. L., 1999, GRAMATICA DESCRIPTIV, V3, P4215 Zubizarreta M.-L., 1998, PROSODY FOCUS WORLD NR 54 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 900 EP 915 DI 10.1016/j.specom.2008.03.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100004 ER PT J AU Meireles, AR Barbosa, PA AF Meireles, A. R. Barbosa, P. A. TI Lexical reorganization in Brazilian Portuguese: An articulatory study SO SPEECH COMMUNICATION LA English DT Article DE Speech rate; Dynamical systems; Articulatory Phonology; Linguistic change; Lexical variation AB This work, which is couched in the theoretical framework of Articulatory Phonology, deals with the influence of speech rate on the change/variation from antepenultimate stress words into penultimate stress words in Brazilian Portuguese. Both acoustic and articulatory (EMMA) studies were conducted. Oil the acoustic side, results show different patterns of post-stressed vowel reduction according to the word type. Some words reduced their medial post-stressed vowels more than their final post-stressed vowels, and others reduced their final post-stressed vowels more than their medial post-stressed vowels. On the articulatory side, results show that the coarticulation degree of the post-stressed consonants increases with speech rate. Also, with the use of a measure called proportional consonantal interval (PCI), it was found in measurements of articulation that such measure is influenced by the word type. Three different groups of words were found according to their PCI. These results show how dynamical aspects influenced by speech rate increase are related to the lexical process of change/variation from antepenultimate stress words into penultimate ones. (C) 2008 Elsevier B.V. All rights reserved. C1 [Meireles, A. R.] Univ Fed Minas Gerais, Speech Prosody Studies Grp, Fac Letras, BR-31270901 Belo Horizonte, MG, Brazil. [Barbosa, P. A.] Univ Estadual Campinas, Speech Prosody Studies Grp, Dept Linguist, BR-13083970 Campinas, SP, Brazil. RP Meireles, AR (reprint author), Univ Fed Minas Gerais, Speech Prosody Studies Grp, Fac Letras, Av Antonio Carlos 6627, BR-31270901 Belo Horizonte, MG, Brazil. EM meirelesatex@gmail.com; sa.unicampbr@gmail.com FU CAPES; CNPq [200199/2004-8]; Fapesp [03/09199-2]; NIH [DC03172] FX This work was supported by the following grants: CAPES, CNPq (200199/2004-8), Fapesp (03/09199-2), and NIH (DC03172). The authors thank Marco Antonio de Oliveira, Tania Alckmim, Maria Bernadete Marques Abaurre, and Louis Goldstein for their insightful comments on this research. Many thanks are due to USC Phonetics Laboratory, Dani Byrd and Sungbok Lee, for help with the articulatory design and their support, and James Mah, for help with the articulatory recordings. CR Abaurre-Gnerre M. B., 1981, CADERNO ESTUDOS LING, V2, P23 ABAURREGNERRE MB, THESIS STATE U NEW Y Albano Eleonora C., 2001, GESTO SUAS BORDAS ES AMARAL MP, 2002, FONOLOGIA VARIACAO R Barbosa P. A., 2006, INCURSOES TORNO RITM Bradley T. G., 2004, LAB APPROACHES SPANI, P195 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 BROWMAN CP, 1992, PHONETICA, V49, P155 Browman C. P., 1990, PAPERS LABORATORY PH, P341 Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019 BYRD D, 2003, 15 ICPHS BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 CAMARA JM, 1988, DICIONARIO LINGUISTI DEVASCONCELOS CM, 1956, LISBOA NOVA EDICAO A Elman J. L., 1995, MIND MOTION EXPLORAT, P195 HAUY AB, 1994, HIST LINGUA PROTUGUE Kelso J. A. S., 1995, DYNAMIC PATTERNS SEL KELSO JAS, 1986, J PHONETICS, V14, P29 McMahon April M. S., 2000, LEXICAL PHONOLOGY HI MEIRELES A, 2001, TEHSIS U FEDERAL MIN MEIRELES AR, PAPEL TAXA ELO UNPUB MEIRELS A, 2007, THESIS U ESTADUAL CA NUNES JJ, 1973, CANTIGAS AMIGO TROVA NUNES JJ, 1969, COMPENDIO GRAMATICA PERKELL J, 1992, J ACOUST SOC AM, P3078 QUEDNAU LR, 2002, FONOLOGIA VARIACAO R SILVANETO S, 1946, FONTES LATIM VULGAR TIEDE MK, 1999, MAGNETOMETER DATA AC TULLER B, 1984, J ACOUST SOC AM, V76, P1030, DOI 10.1121/1.391421 van Gelder Timothy, 1995, MIND MOTION EXPLORAT WILLIAMS E, 1961, LATIM PORTUGUES FONO NR 31 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 916 EP 924 DI 10.1016/j.specom.2008.05.005 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100005 ER PT J AU MartinS, P Carbone, I Pinto, A Silva, A Teixeira, A AF MartinS, Paula Carbone, Ines Pinto, Alda Silva, Augusto Teixeira, Antonio TI European Portuguese MRI based speech production studies SO SPEECH COMMUNICATION LA English DT Article DE Speech production; European Portuguese; Magnetic resonance imaging; Nasals; Coarticulation ID ARTICULATORY-ACOUSTIC MODELS; FRICATIVE CONSONANTS; EPG DATA; ENGLISH; VOWELS; COARTICULATION; IMAGES; NASAL AB Knowledge of the speech production mechanism is essential for the development of speech production models and theories. Magnetic resonance imaging delivers high quality images of soft tissues, has multiplanar capacity and allows for the visualization of the entire vocal tract. To our knowledge, there are no complete and systematic magnetic resonance imaging studies of European Portuguese production. In this study, a recently acquired magnetic resonance imaging database including almost all classes of European Portuguese sounds, excluding taps and trills, is presented and analyzed. Our work contemplated not only image acquisition but also the utilization of image processing techniques to allow the exploration of the entire database in a reasonable time. Contours extracted from 2D images, articulatory measures (2D) and area functions are explored and represent valuable information for articulatory synthesis and articulatory phonetics descriptions. Some European Portuguese distinctive characteristics, such as nasality are addressed ill more detail. Results relative to oral vowels, nasal vowels and a comparison between both classes are presented. The more detailed information on tract configuration supports results obtained with other techniques, such as EMMA, and allows the comparison of European Portuguese and French nasal vowels articulation, with differences detected at pharyngeal cavity level and velum port opening quotient. A detailed characterization of the central vowels, particularly the [i], is presented and compared with classical descriptions. Results for consonants point to the existence of a single positional dark allophone for [1], a more palato-alveolar place of articulation for [s], a more anterior place of articulation for [lambda] relative to [eta], and the use, by our speaker, of a palatal place of articulation for [k]. Some preliminary results concerning coarticulation are also reported. European Portuguese stops revealed less resistant to coarticulatory effects than fricatives. Among all the sounds studied, [integral] and [3] present the highest resistance to coarticulation. These results follow the main key features found in other studies performed for different languages. (C) 2008 Elsevier B.V. All rights reserved. C1 [Carbone, Ines; Silva, Augusto; Teixeira, Antonio] Univ Aveiro, Dep Elect Telec Informat IEETA, P-3810 Aveiro, Portugal. [MartinS, Paula] Univ Aveiro, Escola Super Saude, P-3810 Aveiro, Portugal. [Pinto, Alda] Hosp Univ Coimbra, Dep Radiol, Coimbra, Portugal. RP Teixeira, A (reprint author), Univ Aveiro, Dep Elect Telec Informat IEETA, P-3810 Aveiro, Portugal. EM ajst@ua.pt RI Silva, Augusto/A-3950-2012; Teixeira, Antonio/A-4958-2012; Martins, Paula/L-4836-2014 OI Martins, Paula/0000-0001-5218-388X CR ADAMS R, 1994, IEEE T PATTERN ANAL, V16, P641, DOI 10.1109/34.295913 Alwan A, 1997, J ACOUST SOC AM, V101, P1078, DOI 10.1121/1.417972 ANDRADE A, 1999, INT C PHON ICPHS SAN BADIN P, 1998, 5 INT C SPOK LANG PR, P417 BAER T, 1991, J ACOUST SOC AM, V90, P799, DOI 10.1121/1.401949 Bangayan P, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P793 BARBOSA AM, 1994, INTRO ESTUDO FONOLOG Bryman A, 2001, QUANTITATIVE DATA AN CHODOROWSKI A, 2005, SPIE Cruz-Ferreira M., 1999, HDB INT PHONETIC ASS, P126 DANG J, 1994, J ACOUST SOC AM JASA, V94, P1765 DANG J, 1996, ICSLP DANG JW, 1994, J ACOUST SOC AM, V96, P2088, DOI 10.1121/1.410150 DELVAUX V, 2002, 7 INT C SPOK LANG PR, V1, P53 DEMOLIN D, 1998, 5 INT C SPOK LANG PR Demolin D, 2003, J VOICE, V17, P454, DOI 10.1067/S0892-1997(03)00036-5 DEMOLIN D, 1996, 4 INT C SPOK LANG PR, V1, P272 DEMOLIN D, 1996, 20 JOURN ET PAR AV F, P83 Engwall O, 2000, 5 SEM SPEECH PROD KL, P297 ENGWALL O, 2003, 6 SEM SPEECH PROD, P43 ENGWALL O, 1999, 6 EUR C SPEECH COMM, P113 ENGWALL O, 1999, COLLECTING ANAL 2 3, P11 Engwall O., 2006, 7 INT SEM SPEECH PRO Engwall O., 2002, THESIS ROYAL I TECHN ERICSDOTTER C, 2005, THESIS STOCKHOLM U Espinosa A., 2005, J INT PHON ASSOC, V35, P1, DOI DOI 10.1017/S0025100305001878 Fametani E., 1999, HDB PHONETIC SCI, P371 Gick B, 2002, J PHONETICS, V30, P357, DOI 10.1006/jpho.2001.0161 GREENWOOD AR, 1992, IEE PROC-I, V139, P553 GREGIO FN, 2006, THESIS PONTIFICIA U Hardcastle W., 1999, COARTICULATION THEOR Hardcastle W., 1976, PHYSL SPEECH PRODUCT HOOLE P, 2000, 5 SEM SPEECH PROD KL, P157 Hoole P., 1999, COARTICULATION THEOR, P260 Hoole P, 1993, FORSCHUNGSBERICHTE I, V31, P43 International Phonetic Association, 1999, HDB INT PHON ASS GUI JACKSON PJB, 2000, THESIS U SOUTHAMPTON Jesus LMT, 2002, J PHONETICS, V30, P437, DOI 10.1006/yjpho.2002.0169 Kim H, 2004, PHONETICA, V61, P234, DOI 10.1159/000084160 KIRITANI S, 1986, SPEECH COMMUN, V5, P119, DOI 10.1016/0167-6393(86)90003-8 Kroger BJ, 2000, 5 SEM SPEECH PROD KL, P333 KUHNERT B, 1999, COARTICULATION THEOR Ladefoged P., 1996, SOUNDS WORLDS LANGUA Magen HS, 1997, J PHONETICS, V25, P187, DOI 10.1006/jpho.1996.0041 Manuel S. Y, 1999, COARTICULATION THEOR, P179 Mathiak K, 2000, INT J LANG COMM DIS, V35, P419 MOHAMMAD M, 1997, 5 EUR C SPEECH COMM, V4, P2027 Narayanan S, 2004, J ACOUST SOC AM, V115, P1771, DOI 10.1121/1.1652588 Narayanan S, 2000, IEEE T SPEECH AUDI P, V8, P328, DOI 10.1109/89.841215 NARAYANAN SS, 1995, J ACOUST SOC AM, V98, P1325, DOI 10.1121/1.413469 Narayanan SS, 1997, J ACOUST SOC AM, V101, P1064, DOI 10.1121/1.418030 NARAYANAN SS, 1995, J ACOUST SOC AM, V97, P2511, DOI 10.1121/1.411971 NOGUEIRA RS, 1938, ELEMENTOS TRATADO FO Perkell JS, 1969, PHYSL SPEECH PRODUCT, V53 RECASENS D, 1999, COARTICULATION Recasens D, 2006, J PHONETICS, V34, P295, DOI 10.1016/j.wocn.2005.06.003 Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727 ROSSATO S, 2006, 26 JOURN DET PAR DIN RUA S, 2006, COMPIMAGE COMPUTATIO Sachs L., 1984, APPL STAT HDB TECHNI Santos BS, 2004, ACAD RADIOL, V11, P868, DOI 10.1016/j.acra.2004.05.004 SERRURIER A, 2005, ZAS PAPERS LINGUISTI, V40, P195 SERRURIER A, 2005, INTERSPEECH Shadle C, 1999, 14 INT C PHON SCI IC, P623 SHADLE CH, 1996, MRI STUDY EFFECTS VO, V18, P187 STONE M, 1999, HDB PHONETIC SCI, P11 STONE M, 1997, 5 EUR C SPEECH COMM, P1 Stone M, 2001, J SPEECH LANG HEAR R, V44, P1026, DOI 10.1044/1092-4388(2001/081) Story BH, 1996, J ACOUST SOC AM, V100, P537, DOI 10.1121/1.415960 STREVENS PD, 1954, REV LABORATORIO FONE, V2, P5 Takemoto H., 2004, Acoustical Science and Technology, V25, DOI 10.1250/ast.25.468 TEIXEIRA A, 2003, INT C PHON SCI ICPHS, P3033 TEIXEIRA A, 1999, ICPHS, P2557 TEIXEIRA A, 2001, 7 EUR C SPEECH COMM, V2, P1483 Teixeira AJS, 2005, EURASIP J APPL SIG P, V2005, P1435, DOI 10.1155/ASP.2005.1435 TIEDE M, 2000, 5 SEM SPEECH PROD KL, P25 Tiede MK, 1996, J PHONETICS, V24, P399, DOI 10.1006/jpho.1996.0022 TULLER E, 1981, J PHONETICS, V9, P175 VIANA MC, 1996, INTRO LINGUISTICA GE, P113 WEST P, 2000, SPEECH PROD SEM SEEO, P105 YANG B, 1999, 14 INT C PHON SCI IC, P2005 NR 81 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 925 EP 952 DI 10.1016/j.specom.2008.05.019 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100006 ER PT J AU Yoma, NB Garreton, C Molina, C Huenupan, F AF Yoma, Nestor Becerra Garreton, Claudio Molina, Carlos Huenupan, Fernando TI Unsupervised intra-speaker variability compensation based on Gestalt and model adaptation in speaker verification with telephone speech SO SPEECH COMMUNICATION LA English DT Article DE Text-dependent speaker verification; Feature compensation; Intra-speaker variability; Unsupervised model adaptation; Gestalt; Telephone speech; Limited enrolling data; Noise robustness; Speaker verification database in Spanish ID LIKELIHOOD LINEAR-REGRESSION; RECOGNITION AB In this paper, an unsupervised intra-speaker variability compensation (ISVC) method based oil Gestalt is proposed to address the problem of limited enrolling data and noise robustness in text-dependent speaker verification (SV). Experiments with two databases show that: ISVC can lead to reductions in EER as high as 20% or 40% and ISCV provides reductions in the integral below the ROC curve between 30%, and 60%. Also, the observed improvements are independent of the number of enrolling utterances. In contrast to model adaptation methods, ISVC is memoryless with respect to previous verification attempts. As shown here, unsupervised model adaptation can lead to substantial improvements in EER but is highly dependent oil the sequence of client/impostor verification events. In adverse scenarios, such its massive impostor attacks and verification from alternated telephone line, unsupervised model adaptation might even provide reductions in verification accuracy when compared with the baseline system. In those cases, ISVC can even outperform adaptation schemes. It is worth emphasizing that ISVC and unsupervised model adaptation are compatible and the combination of both methods always improves the performance of model adaptation. The combination of both schemes can lead to improvements in EER its high its 34%. Due to the restrictions of commercially available databases for text-dependent SV research, the results presented here are based oil local databases in Spanish. By doing so, the visibility of research in Iberian Languages is highlighted. (C) 2007 Elsevier B.V. All rights reserved. C1 [Yoma, Nestor Becerra; Garreton, Claudio; Molina, Carlos; Huenupan, Fernando] Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile. RP Yoma, NB (reprint author), Univ Chile, Speech Proc & Transmiss Lab, Dept Elect Engn, Santiago, Chile. EM nbecerra@ing.uchile.cl FU Conicyt - Chile [D051-10243]; Fondecyt [1070382, 1030956] FX The authors would like to thank Dr. Simon King, CSTR/University of Edinburgh (UK) for having proofread this manuscript. This work was funded by Conicyt - Chile under grants Fondef No. D051-10243 and Fondecyt No. 1070382/1030956. CR Afify M, 1998, IEEE T SPEECH AUDI P, V6, P524, DOI 10.1109/89.725319 AHN S, 2000, IEE ELECT LETT, V36, P371 ASAMI T, 2005, P INT 2005 LISB PORT, P2185 BARRAS C, 2004, P OD TOL SPAIN Cui XD, 2005, IEEE T SPEECH AUDI P, V13, P1161, DOI 10.1109/TSA.2005.853002 FREDOUILLE C, 2000, P ICASSP 2000, P1197 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 Gao QG, 2005, COMPUT VIS IMAGE UND, V100, P442, DOI 10.1016/j.cviu.2005.06.002 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HARDT D, 1997, P ICASSP, P867 Jiang H, 2006, PATTERN RECOGN, V39, P988, DOI 10.1016/j.patcog.2005.08.012 Kim S, 2006, PATTERN RECOGN LETT, V27, P811, DOI 10.1016/j.patrec.2005.11.008 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LIT PW, 2001, P ICASSP SALT LAK CI, P457 MYRVOLL T, 2000, P ICSLP BEIJ CHIN, P540 Ortega-Garcia J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607754 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Rao C., 1965, LINEAR STAT INFERENC Sano D., 1995, DESIGNING VISUAL INT UEBEL LF, 2001, SPEAKER ADAPTATION U WERTHEOMER M, 1950, LAWS ORG PERCEPTUAL Yegnanarayana B, 2005, IEEE T SPEECH AUDI P, V13, P575, DOI 10.1109/TSA.2005.848892 Yiu KK, 2007, COMPUT SPEECH LANG, V21, P231, DOI 10.1016/j.csl.2006.05.001 YU K, 1996, P ICSLP 96, P1752 NR 24 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 953 EP 964 DI 10.1016/j.specom.2007.11.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100007 ER PT J AU Rouas, JL Trancoso, I Viana, C Abreu, M AF Rouas, Jean-Luc Trancoso, Isabel Viana, Ceu Abreu, Monica TI Language and variety verification on broadcast news for Portuguese SO SPEECH COMMUNICATION LA English DT Article DE Language verification; Portuguese varieties ID IDENTIFICATION AB This paper describes a language/accent verification system for Portuguese, that explores different type of properties: acoustic, phonotactic and prosodic. The two-stage system is designed to be used as a pre-processing module for the Portuguese Automatic Speech Recognition (ASR) system developed at INESC-1D. As the ASR system is applied everyday to transcribe the evening news from a Portuguese public TV channel, the presence of other languages (mainly English) and other varieties of Portuguese is very likely. In the first stage, for each automatically detected speaker, the system verifies if the spoken language is Portuguese, as opposed to nine other languages - English, Belgian Dutch, Croatian, Czech, Galician, Greek, Hungarian, Sloven and Slovak. The identified Portuguese speakers are then fed to the second stage which aims at identifying the Portuguese variety: European, Brazilian or African Portuguese from five Countries. The identification results arc then used either to mark the speech data as untranscribable or forward it to the European Portuguese ASR system, or a system tuned for other languages or varieties. The language verification system achieved all equal error rate for European Portuguese of 2.5%. In terms of variety identification, the overall rate of correct identification was 83.9%, when considering only the three broad varieties, and the best results were obtained for Brazilian Portuguese, also the variety that proved easiest to identify in perceptual experiments. The identification rate between African varieties themselves is relatively low, a fact that was also observed in the perceptual experiments. (C) 2008 Elsevier B.V. All rights reserved. C1 [Rouas, Jean-Luc; Trancoso, Isabel; Abreu, Monica] INESC ID, Spoken Language Syst Lab L2F, P-1000029 Lisbon, Portugal. [Rouas, Jean-Luc] INRETS Elect Waves & Signal Proc Res Lab Transpor, F-59650 Villeneuve Dascq, France. [Viana, Ceu] Univ Lisbon, Ctr Linguist, P-1649003 Lisbon, Portugal. RP Trancoso, I (reprint author), INESC ID, Spoken Language Syst Lab L2F, R Alves Redol 9, P-1000029 Lisbon, Portugal. EM rouas@inrets.fr; Isabel.Trancoso@inesc-id.pt; mcv@clul.ul.pt; monica7abreu@gmail.com RI Trancoso, Isabel/C-5965-2008 OI Trancoso, Isabel/0000-0001-5874-6313 FU FCT [SFRH/BPD/22032/2005]; European project Vidi-Video; PRIME National Project TECNOVOZ [03/165] FX The authors would like to thank our colleagues Hugo Meinedo and Ernesto de Andrade for helpful comments. This work was partially funded by FCT under the postdoc scholarship SFRH/BPD/22032/2005, and also by the European project Vidi-Video, and by PRIME National Project TECNOVOZ No. 03/165. CR ABAURRE MB, 1996, GRAMATICA PORTUGUES, V6, P495 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 Barbosa P. A., 2004, J INT PHON ASSOC, V34, P227, DOI 10.1017/S0025100304001756 BERKLING K, 1998, ICSLP 98 CALLOU D, 1990, RIO JANEIRO CAMPBELL WM, 2006, P OD 2006 SPEAK LANG CAMPIONC E, 1998, MULTILINGUAL PROSODI CHEN T, 2001, IEEE WORKSH AUT SPEE DAUER RM, 1983, J PHONETICS, V11, P51 FERNADES F, THESIS U ESTADUAL CA Frota S., 2001, PROBUS, V13, P247, DOI 10.1515/prbs.2001.005 FROTA S, 2002, SPEECH PROSODY FUJISAKI H, 2003, ISCA WORKSH SPOK LAN FUNG P, 1999, ICASSP 1999 GAUVAIN JL, 2004, INTERSPEECH 2004 HUANG R, 2006, INTERSPEECH 2006 IKENO A, 2006, INTERSPEECH 2006 KITAZAWA S, 2002, SPEECH PROSODY KOMATSU M, 2004, SPEECH PROSODY LACERDA AND, 1958, REV LAB FONETICA EXP, V4, P5 LEITE Y, 1996, ACT C INT PORT LISB, V3 LI J, 2006, P OD 06 SPEAK LANG R LINCOLN M, 1998, ICSLP 98 MATEJKA P, 2006, P OD 2006 SPEAK LANG Mateus M. H., 2000, PHONOLOGY PORTUGUESE MEINEDO H, 2003, ICASSP 2003 MEINEDO H, 2003, PROPOR 2003 MEINEDO H, 2005, INTERSPEECH 2005 Parkinson Stephen, 1988, ROMANCE LANGUAGES, P131 PELLEGRINO F, 1997, IEEE DIGITAL SIGNAL ROUAS JL, 2008, 2008 IEEE INT C AC S ROUAS JL, 2005, THESIS U TOULOUSE 3 ROUAS JL, 2005, INTERSPEECH 2005 ROUAS JL, 2006, JOURN ET PAR Rouas JL, 2007, IEEE T AUDIO SPEECH, V15, P1904, DOI 10.1109/TASL.2007.900094 SCHULTZ T, 2002, HLT 2002 SJOLANDER K, 2000, SNACK SOUND TOOLKIT STOLCKE A, 2002, INTERSPEECH 2002 TORRESCARRASQUI.PA, 2004, ODYSSEY SPEAKER LANG Tsai WH, 2002, SPEECH COMMUN, V36, P317, DOI 10.1016/S0167-6393(00)00090-X VANDECATSEYE A, 2004, LREC 2004 VIERUDIMULESCU B, 2006, INTERSPEECH 2006 WU T, 2006, INTERSPEECH 2005 YIN B, 2006, INT C PATT REC ZHENG Y, 2006, INTERSPEECH 2005 ZISSMAN MA, 1993, IEEE 18 INT C AC SPE Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6 NR 47 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 965 EP 979 DI 10.1016/j.specom.2008.05.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100008 ER PT J AU Tejedor, J Wang, D Frankel, J King, S Coias, J AF Tejedor, Javier Wang, Dong Frankel, Joe King, Simon Coias, Jose TI A comparison of grapheme and phoneme-based units for Spanish spoken term detection SO SPEECH COMMUNICATION LA English DT Article DE Spoken term detection; Keyword spotting; Graphemes; Spanish ID SPEECH; RETRIEVAL AB The ever-increasing volume of audio data available online through the world wide web means that automatic methods for indexing and search are becoming essential. Hidden Markov model (HMM) keyword spotting and lattice search techniques are the two most common approaches used by such systems. In keyword spotting, models or templates are defined for each search term prior to accessing the speech and used to find matches. Lattice search (referred to as spoken term detection), uses a pre-indexing of speech data ill terms of word or sub-word units, which can then quickly be searched for arbitrary terms without referring to the original audio. In both cases, the search term can be modelled in terms of sub-word units, typically phonemes. For in-vocabulary words (i.e. words that appear in the pronunciation dictionary), the letter-to-sound conversion systems are accepted to work well. However, for out-of-vocabulary (OOV) search terms, letter-to-sound conversion must be used to generate a pronunciation for the search term. This is usually a hard decision (i.e. not probabilistic and with no possibility of backtracking), and errors introduced at this step are difficult to recover from. We therefore propose the direct use of graphemes (i.e., letter-based sub-word units) for acoustic modelling. This is expected to work particularly well in languages such as Spanish, where despite the letter-to-sound mapping being very regular, the correspondence is not one-to-one, and there will be benefits from avoiding hard decisions at early stages of processing. In this article, we compare three approaches for Spanish keyword spotting or spoken term detection, and within each of these we compare acoustic modelling based on phone and grapheme units. Experiments were performed using the Spanish geographical-domain ALBAYZIN corpus. Results achieved in the two approaches proposed for spoken term detection show us that trigrapheme units for acoustic modelling match or exceed the performance of phone-based acoustic models. In the method proposed for keyword spotting, the results achieved with each acoustic model are very similar. (C) 2008 Elsevier B.V. All rights reserved. C1 [Tejedor, Javier; Coias, Jose] Escuela Politecn Super UAM, Human Comp Technol Lab, Madrid 28049, Spain. [Tejedor, Javier; Wang, Dong; Frankel, Joe; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland. RP Tejedor, J (reprint author), Escuela Politecn Super UAM, Human Comp Technol Lab, Ave Francisco Tomas & Valiente 11, Madrid 28049, Spain. EM javier.tejedor@uam.es; dwang2@inf.ed.ac.uk; joe@cstr.ed.ac.uk; Simon.King@ed.ac.uk; jose.colas@uam.es FU Spanish Ministry of Science and Education [TIN 200506885] FX This work was mostly performed whilst JT was a visiting researcher at the Centre for Speech Technology Research, University of Edinburgh, and was partly funded by the Spanish Ministry of Science and Education (TIN 200506885). DW is a Fellow on the Edinburgh Speech Science and Technology (EdSST) interdisciplinary Marie Curie training programme. JF is funded by Scottish Enterprise tinder the Edinburgh Stanford Link. SK is an EPSRC Advanced Research Fellow. Many thanks to Igor Szoke and colleagues in the Speech Processing Group of the Faculty of Information Technology at Brno University of Technology for providing the lattice search tools. CR ALARCOS E, 1995, GRAMATICA LENGUA ESP CHEN B, 2004, P RICH TRANSCR FALL Cole R. A., 1994, P ISCLP YOK JAP, P1815 CUAYAHUITL H., 2002, P MICAI, P156 DINES J, 2007, P MLMI BRNO CZECH RE FISSORE L, 1989, IEEE T ACOUST SPEECH, V37, P1197, DOI 10.1109/29.31268 Hansen JH, 2005, IEEE T SPEECH AUDI P, V13, P712, DOI 10.1109/TSA.2005.852088 HAUPTMANN AG, 1997, P IEEE INT C AC SPEE, V1, P195 JAMES D, 1994, P IEEE INT C AC SPEE, V1, P465 Killer M., 2003, P EUR KIM J., 2004, P SPECOM, P156 Lleida E., 1993, P EUROSPEECH, P1265 Logan B., 2000, P ICSLP 00 BEIJ CHIN, V2, P676 MAGIAMIDOSS M, 2004, P ICASSP MONT CAN, P177 MAGIMAIDOSS M, 2003, P WORKSH AUT SPEECH, P94 Makhoul J, 2000, P IEEE, V88, P1338, DOI 10.1109/5.880087 Moreno A., 1993, P EUR SEPT, V1, P653 NIST, 2006, SPOK TERM DET STD 20 PRICE P, 1998, P ICASSP, V1, P651 Quilis A., 1998, COMENTARIO FONOLOGIC ROHLICEK J, 1995, MODERN METHODS SPEEC Rohlicek J. R., 1989, P INT C AC SPEECH SI, V1, P627 SCOTT J, 2007, P IEEE INT C AC SPEE STEIGNER J, 2003, P INT Szoke I., 2005, P INTERSPEECH, P633 TANAKA K, 2001, P IEEE WORKSHOP APR, P323 TEJEDOR J, 2006, P 4 JORN TECN HABL, P255 Thambiratnam K, 2007, IEEE T AUDIO SPEECH, V15, P346, DOI 10.1109/TASL.2006.872615 Young S., 2006, HTK BOOK HTK VERSION YOUNG SJ, 1997, P IEEE INT C AC SPEE, V1, P199 Yu P, 2005, IEEE T SPEECH AUDI P, V13, P635, DOI 10.1109/TSA.2005.851881 YU P, 2004, P INT C SPEECH LANG, P635 NR 32 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 980 EP 991 DI 10.1016/j.specom.2008.03.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100009 ER PT J AU Martinez-Hinarejos, CD Benedi, JM Granell, R AF Martinez-Hinarejos, Carlos-D. Benedi, Jose-Miguel Granell, Ramon TI Statistical framework for a Spanish spoken dialogue corpus SO SPEECH COMMUNICATION LA English DT Article DE Spoken dialogue systems; Statistical models; Dialogue annotation; Dialogue models ID SYSTEMS; INFERENCE; LANGUAGES AB Dialogue systems are one of the most interesting applications of speech and language technologies. There have recently been some attempts to build dialogue systems in Spanish, and some corpora have been acquired and annotated. Using these corpora, statistical machine learning methods can be applied to try to solve problems in spoken dialogue systems. In this paper, two statistical models based on the maximum likelihood assumption are presented, and two main applications of these models on a Spanish dialogue corpus are shown: labelling and decoding. The labelling application is useful for annotating new dialogue corpora. The decoding application is useful for implementing dialogue strategies in dialogue systems. Both applications centre on unsegmented dialogue turns. The obtained results show that, although limited, the proposed statistical models are appropriate for these applications. (C) 2008 Elsevier B.V. All rights reserved. C1 [Martinez-Hinarejos, Carlos-D.; Benedi, Jose-Miguel] Univ Politecn Valencia, Inst Informat Technol, Valencia 46022, Spain. [Granell, Ramon] Univ Oxford, Comp Lab, Oxford OX1 3QD, England. RP Martinez-Hinarejos, CD (reprint author), Univ Politecn Valencia, Inst Informat Technol, Camino Vera S-N, Valencia 46022, Spain. EM cmartine@dsic.upv.es RI Benedi, Juana/K-9740-2014 OI Benedi, Juana/0000-0002-3796-639X FU Spanish research programme Consolider Ingenio [2010]; MIPRCV [CSD2007-00018]; EU [FP6-IST 034434]; EC (FEDER); Spanish MEC [TIN2006-15694-C02-01]; VIDI-UPV [20070315] FX Work partially supported by the Spanish research programme Consolider Ingenio 2010: MIPRCV (CSD2007-00018), by the EU under Project FP6-IST 034434, by the EC (FEDER) and the Spanish MEC Under Grant TIN2006-15694-C02-01, and by VIDI-UPV under project 20070315. CR Alcacer N., 2005, P 10 INT C SPEECH CO, P583 ALEXANDERSSON J, 1998, 226 DFKI GMBH Ang J., 2005, P ICASSP PHIL US, V1, P1061 AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M Austin J., 1962, DO THINGS WORDS Bened J. M., 2006, 5 INT C LANG RES EV, P1636 Benedi J. M., 2006, P COLING ACL 2006 MA, P563 Bos J., 2003, 4 SIGDIAL WORKSH DIS Brown P. F., 1993, Computational Linguistics, V19 Carletta J., 1996, HCRC DIALOGUE STRUCT Casacuberta F, 2005, PATTERN RECOGN, V38, P1431, DOI 10.1016/j.patcog.2004.03.025 Cieri C., 2004, MIXER CORPUS MULTILI Core M., 1997, WORK NOT AAAI FALL S, P28 FINKE M, 1998, APPLYING MACHINE LEA Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M FUKADA T, 1998, P INT C SPOK LANG PR, P2771 GARCIA P, 1990, IEEE T PATTERN ANAL, V12, P920, DOI 10.1109/34.57687 Hardy Hilda, 2002, P ISLE WORKSH DIAL T HEEMAN P, 1994, 942 TRAINS TN U ROCH HO TK, 1994, IEEE T PATTERN ANAL, V16, P66 Jurafsky D., 1997, 9701 U COL I COGN SC KUPPEVELT JV, 2003, CURRENT NEW DIRECTIO, V22 Lavie A., 1997, P WORKSH SPOK LANG T LEVIN L, 1998, DISCOURSE CODING SCH LEVIN L, 1999, P WORKSH STAND TOOLS, P42 LOPEZCOZAR R, 1998, P 1 LANG RES EV C, P55 MARK B, 2004, CALO COGNITIVE AGENT MARTINEZHINAREJ.CD, 2000, P 5 IB S PATT REC LI, P669 MCTEAR M, 2000, P 6 INT C SPOK LANG, V1, P110 Meng HM, 2003, IEEE T SPEECH AUDI P, V11, P757, DOI 10.1109/TSA.2003.814380 Mengel A., 2000, MATE DIALOGUE ANNOTA Martinez-Hinarejos CD, 2006, LECT NOTES ARTIF INT, V4188, P653 RANGARAJAN V, 2007, EXPLOITING PROSODIC REAVES B, 1998, P AC SOC JAP SPRING, P53 Schiffrin Deborah, 1994, APPROACHES DISCOURSE Seneff S., 2000, ANLP NAACL 2000 SAT, P1 Shriberg E., 2004, P 5 SIGDIAL WORKSH D, P97 Stolcke A., 2000, COMPUTATIONAL LINGUI, V26, P1 TAPIAS D, 1994, P 3 INT C SPOK LANG, P1181 Trahanias P., 2007, INDIGO INTERACTION P VIDAL E, 1994, LNAI, V862 WALKER M, EUROSPEECH 2001 WALKER M, 2001, P HLT, P1, DOI 10.3115/1072133.1072148 Warnke V., 1997, P 5 EUR C SPEECH COM, V1, P207 Webb N., 2005, P AAAI WORKSH SPOK L Wilks Y., 2006, COMPANIONS INTELLIGE Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 Young SJ, 2000, PHILOS T ROY SOC A, V358, P1389, DOI 10.1098/rsta.2000.0593 NR 48 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 992 EP 1008 DI 10.1016/j.specom.2008.05.011 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100010 ER PT J AU San-Segundo, R Barra, R Cordoba, R D'Haro, LF Fernandez, F Ferreiros, J Lucas, JM Macias-Guarasa, J Montero, JM Pardo, JM AF San-Segundo, R. Barra, R. Cordoba, R. D'Haro, L. F. Fernandez, F. Ferreiros, J. Lucas, J. M. Macias-Guarasa, J. Montero, J. M. Pardo, J. M. TI Speech to sign language translation system for Spanish SO SPEECH COMMUNICATION LA English DT Article DE Spanish Sign Language (LSE); Spoken language translation; Sign animation ID MACHINE TRANSLATION AB This paper describes the development of and the first experiments in a Spanish to sign language translation system in a real domain. The developed system focuses oil the sentences spoken by ail official when assisting people applying for, or renewing their identity Card. The system translates official explanations into Spanish Sign Language (LSE: Lengua de Signos Espanola) for Deaf people. The translation system is made up of a speech recognizer (for decoding the spoken utterance into a word sequence), a natural language translator (for converting a word sequence into a sequence of signs belonging to the sign language), and a 3D avatar animation module (for playing back the]land movements). Two proposals for natural language translation have been evaluated: a rule-based translation module (that computes sign confidence measures from the word confidence measures obtained in the speech recognition module) and a statistical translation module (in this case, parallel corpora were used for training the statistical model). The best configuration reported 31.6% SER (Sign Error Rate) and 0.5780 BLEU (BiLingual Evaluation Understudy). The paper also describes the eSIGN 3D avatar animation module (considering the sign confidence), and the limitations found when implementing a strategy for reducing the delay between the spoken utterance and the sign sequence animation. (C) 2008 Elsevier B.V. All rights reserved. C1 [San-Segundo, R.; Barra, R.; Cordoba, R.; D'Haro, L. F.; Fernandez, F.; Ferreiros, J.; Lucas, J. M.; Montero, J. M.; Pardo, J. M.] Univ Politecn Madrid, Grp Tecnol Habla, Dept Ingn Elect, ETSI Telecomunicac, E-28040 Madrid, Spain. [Macias-Guarasa, J.] Univ Alcala De Henares, Dept Elect, Alcala De Henares, Spain. RP San-Segundo, R (reprint author), Univ Politecn Madrid, Grp Tecnol Habla, Dept Ingn Elect, ETSI Telecomunicac, Ciudad Univ S-N, E-28040 Madrid, Spain. EM lapiz@die.upm.es RI Macias-Guarasa, Javier/J-4625-2012; Cordoba, Ricardo/B-5861-2008; Pardo, Jose/H-3745-2013; Montero, Juan M/K-2381-2014; Barra-Chicote, Roberto/L-4963-2014; Fernandez-Martinez, Fernando/M-2935-2014 OI Cordoba, Ricardo/0000-0002-7136-9636; Montero, Juan M/0000-0002-7908-5400; Barra-Chicote, Roberto/0000-0003-0844-7037; FU ATI-NA (UPM-DGUI-CAM) [CCG06-UPM/COM-516]; ROBINT (MEC) [DPI2004-07908-C02]; EDEC-AN (MEC) [TIN2005-08660-C04] FX The authors would like to thank the eSIGN (Essential Sign Language Information on Government Networks) consortium for giving us permission to use of the eSIGN Editor and the 3D avatar in this research work. This work has been supported by the following projects ATI-NA (UPM-DGUI-CAM. Ref: CCG06-UPM/COM-516), ROBINT (MEC Ref: DPI2004-07908-C02) and EDEC-AN (MEC Ref. TIN2005-08660-C04). The work presented here was carried out while Javier Macias-Guarasa was a member of the Speech Technology Group (Department of Electronic Engineering, ETSIT de Telecomunicacion. Universidad Politecnica de Madrid). Authors also want to thank Mark Hallett for the English revision. CR Abdel-Fattah MA, 2005, J DEAF STUD DEAF EDU, V10, P212, DOI 10.1093/deafed/eni007 ATHERTON M, 1999, DEAF WORLDS, V15, P11 BERTENSTAM J, 1995, P SPOK DIAL SYST VIG BUNGEROTH J, 2006, 5 INT C LANG RES EV Casacuberta F, 2004, COMPUT LINGUIST, V30, P205, DOI 10.1162/089120104323093294 CASSELL J, 2002, P IM INT AUT AG MONT CHRISTOPOULOU C, 1985, J COMMUN DISORD, V18, P1, DOI 10.1016/0021-9924(85)90010-3 Cole R, 2003, P IEEE, V91, P1391, DOI 10.1109/JPROC.2003.817143 Cole R., 1999, P ESCA SOCRATES WORK, P45 FERREIROS J, 2005, NEW WORLD LEVEL SENT, P3377 GALLARDO B, 2002, ESTUDIOS LINGUISTICO GRANSTROM B, 2002, SPEECH GESTURES TALK, P209 GUSTAFSON J, 2002, THESIS ROYAL I TECHN GUSTAFSON J, 2003, J NATURAL LANGUAGE E, P273 HERREROBLANCO A, 2005, FUNCTIONAL GRAMMAR S, V27, P281 Engberg-Pedersen E, 2003, POINTING: WHERE LANGAUAGE, CULTURE, AND COGNITON MEET, P269 Koehn P., 2004, PHARAOH BEAM SEARCH Koehn P, 2003, HLT-NAACL 2003: HUMAN LANGUAGE TECHNOLOGY CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE MAIN CONFERENCE, P127 LUNDBERG M, 1999, P AUD VIS SPEECH PRO Masataka N, 2006, J DEAF STUD DEAF EDU, V11, P144, DOI 10.1093/deafed/enj017 MEURANT L, 2004, TISLR, V8, P113 NYST V, 2004, TISLR, V8, P127 Och F. J., 2003, Computational Linguistics, V29, DOI 10.1162/089120103321337421 Och J., 2000, P 38 ANN M ASS COMP, P440 Och J., 2002, ANN M ASS COMP LING, P295 Papineni K, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P311 PRILLWITZ S, 1989, INT STUDIES SIGN LAN, V5 Pyers JE, 2006, LANG SCI, V28, P280, DOI 10.1016/j.langsci.2005.11.010 REYES I, 2005, COMUNICAR TRAVES SIL, P310 Rodriguez M. A., 1991, THESIS CONFEDERACION Stokoe W., 1960, STUDIES LINGUISTICS Sumita E., 2003, C EUR CHAPT ASS COMP, P171 Sutton S, 1998, P INT C SPOK LANG PR, P3221 Sylvie O., 2005, IEEE T PATTERN ANAL, V27 Zens R, 2002, LECT NOTES ARTIF INT, V2479, P18 NR 35 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 1009 EP 1020 DI 10.1016/j.specom.2008.02.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100011 ER PT J AU Perez, A Torres, MI Casacuberta, F AF Perez, Alicia Torres, M. Ines Casacuberta, Francisco TI Joining linguistic and statistical methods for Spanish-to-Basque speech translation SO SPEECH COMMUNICATION LA English DT Article DE Spoken language translation; Stochastic transducer; Phrase-based translation model; Morpho-syntactic knowledge modeling; Bilingual resources ID FINITE-STATE TRANSDUCERS; MACHINE TRANSLATION; LANGUAGE MODELS; RECOGNITION; INFERENCE AB The goal of this work is to develop a text and speech translation system from Spanish to Basque. This pair of languages shows quite odd characteristics as they differ extraordinarily in both morphology and syntax, thus, attractive challenges in machine translation are involved. Nevertheless, since both languages share official status in the Basque Country, the underlying motivation is not only academic but also practical. Finite-state transducers were adopted as basic translation models. The main contribution of this work involves the Study of several techniques to improve probabilistic finite-state transducers by means of additional linguistic knowledge. Two methods to cope with both linguistics and statistics were proposed. The first one performed a morphological analysis in all attempt to benefit from atomic meaningful units when it comes to rendering the meaning from One language to the other. The second approach aimed at clustering words according to their syntactic role and used such phrases its translation unit. From the latter approach phrase-based finite-state transducers arose as a natural extension of classical ones. The models were assessed under a restricted domain task, very repetitive and with a small vocabulary. Experimental results shown that both morphological and syntactical approaches outperformed the baseline under different test sets and architectures for speech translation. (C) 2008 Elsevier B.V. All rights reserved. C1 [Perez, Alicia; Torres, M. Ines] Univ Basque Country, Dept Elect & Elect, Fac Sci & Technol, Leioa 48940, Spain. [Casacuberta, Francisco] Univ Politecn Valencia, Dept Informat Syst & Computat, Fac Comp Sci, Valencia 46071, Spain. RP Perez, A (reprint author), Univ Basque Country, Dept Elect & Elect, Fac Sci & Technol, Leioa 48940, Spain. EM alicia.perez@ehu.es; manes@we.lc.ehu.es; fcn@dsic.upv.es RI Torres, Maria Ines/M-5490-2013 OI Torres, Maria Ines/0000-0002-1773-3214 FU University of the Basque Country [9/UPV 00224.310-15900/2004]; Spanish CICYT [TIN2005-08660-C04-03]; Consolider Ingenio [2010 MIPRCV (CSD2007-00018)] FX We would like to thank anonymous reviewers for their criticisms and suggestions.We would also like to thank Ametzagina group (http://www.ametza.com), and Josu Landa, in particular, for providing us with the morpho-syntactic parse which made possible this work.This work has been partially supported by the University of the Basque Country under Grant 9/UPV 00224.310-15900/2004, by the Spanish CICYT under grant TIN2005-08660-C04-03 and by the research programme Consolider Ingenio-2010 MIPRCV (CSD2007-00018). CR AGIRRE E, 2006, ACT 22 C SOC ESP PRO, P257 Alegria I, 2007, LECT NOTES COMPUT SC, V4394, P374 Bangalore S., 2002, Machine Translation, V17, DOI 10.1023/B:COAT.0000010804.12581.96 BANGALORE S, 2001, P 2 M N AM CHAPT ASS, P40 BERTOLDI N, 2007, P ICASSP HON HA Brown P. F., 1993, Computational Linguistics, V19 Carreras X., 2004, P 4 INT C LANG RES E Casacuberta F, 2004, COMPUT LINGUIST, V30, P205, DOI 10.1162/089120104323093294 Casacuberta F, 2007, MACH LEARN, V66, P69, DOI 10.1007/s10994-006-9612-9 Casacuberta F, 1999, PATTERN RECOGN LETT, V20, P813, DOI 10.1016/S0167-8655(99)00045-8 Casacuberta F, 2004, COMPUT SPEECH LANG, V18, P25, DOI 10.1016/S0885-2308(03)00028-7 Casacuberta F, 2000, LECT NOTES ARTIF INT, V1891, P1 Caseiro D, 2006, IEEE T AUDIO SPEECH, V14, P1281, DOI 10.1109/TSA.2005.860838 Caseiro D., 2001, P ASRU, P393 Collins M., 2005, P 43 ANN M ASS COMP, P531, DOI 10.3115/1219840.1219906 CORBIBELLOT AM, 2005, P 10 C EUR ASS MACH, P79 DEGISPERT A, 2006, NATURAL LANGUAGE ENG, V12, P91 Doddington G, 2002, P 2 INT C HUM LANG T, P138, DOI 10.3115/1289189.1289273 GARCIA P, 1990, IEEE T PATTERN ANAL, V12, P920, DOI 10.1109/34.57687 Goldwater S., 2005, HLT 05, P676 GONZALEZ J, 2007, P 6 INT WORKSH FIN S Hirsimaki T, 2006, COMPUT SPEECH LANG, V20, P515, DOI 10.1016/j.csl.2005.07.002 Jelinek F., 1997, STAT METHODS SPEECH Knight K, 1998, LECT NOTES ARTIF INT, V1529, P421 KOEHN P, 2006, OPEN SOURCE TOOLKIT KOEHN P, 2003, P 2003 C N AM CHAPT, P48 Kumar Shankar, 2006, NAT LANG ENG, V12, P35 LABAKA G, 2007, COMPARING RULE BASED MATUSOV E, 2005, P EAMT 2005 10 ANN C MOHRI FCN, 2003, AT T FSM LIB TM FINI MORENO A, 1993, P EUR C SPEECH COMM NEVADO F, 2004, ACT 3 JORN TECN HABL NIESSEN S, 2001, P MT SUMMIT, V8, P247 OCH FJ, 2003, JHU 2003 SUMM WORKSH Och F. J., 2003, Computational Linguistics, V29, DOI 10.1162/089120103321337421 Papineni K., 2002, P 40 ANN M ASS COMP, P311 Pereira FCN, 1997, LANG SPEECH & COMMUN, P431 PEREZ A, 2007, P IEEE 32 INT C AC S, V4, P113 Perez A, 2006, LECT NOTES ARTIF INT, V4139, P716 PEREZ A, 2006, P 5 WORKSH SPEECH LA Quan V. H., 2005, P INT LISB PORT SEP, P3181 Saleem S., 2004, P INT C SPOK LANG PR, P41 SARIKAYA R, 2007, P 32 INT C AC SPEECH Torres I, 2001, COMPUT SPEECH LANG, V15, P127, DOI 10.1006/csla.2001.0162 Vidal E, 2005, IEEE T PATTERN ANAL, V27, P1013, DOI 10.1109/TPAMI.2005.147 Vidal E., 2005, IEEE T PATTERN ANAL, V27, P1025 Vidal E., 1997, P INT C AC SPEECH SI, P111 ZHOU B, 2005, P IEEE INT C AC SPEE, V1, P1017 NR 48 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 1021 EP 1033 DI 10.1016/j.specom.2008.05.016 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100012 ER PT J AU de Gispert, A Marino, JB AF de Gispert, A. Marino, J. B. TI On the impact of morphology in English to Spanish statistical MT SO SPEECH COMMUNICATION LA English DT Article DE Morphology generation; N-gram based translation; Statistical machine translation; Machine learning ID MACHINE TRANSLATION AB This paper presents a thorough study of the impact of morphology derivation on N-gram-based Statistical Machine Translation (SMT) models from English into a morphology-rich language such as Spanish. For this purpose, we define a framework under the assumption that a certain degree of morphology-related information is not only being ignored by current statistical translation models, but also has a negative impact on their estimation due to the data sparseness it causes. Moreover, we describe how this information can be decoupled from the standard bilingual N-gram models and introduced separately by means of a well-defined and better informed feature-based classification task. Results are presented for the European Parliament Plenary Sessions (EPPS) English -> Spanish task, showing oracle scores based on to what extent SMT models can benefit from simplifying Spanish morphological surface forms for each Part-Of-Speech category. We show that verb form morphological richness greatly weakens the standard statistical models, and we carry out a posterior morphology classification by defining a simple set of features and applying machine learning techniques. In addition to that, we propose a simple technique to deal with Spanish enclitic pronouns. Both techniques are empirically evaluated and final translation results show improvements over the baseline by just dealing with Spanish morphology. In principle, the study is also valid for translation from English into any other Romance language (Portuguese, Catalan, French, Galician, Italian, etc.). The proposed method can be applied to both monotonic and non-monotonic decoding scenarios, thus revealing the interaction between word-order decoding and the proposed morphology simplification techniques. Overall results achieve statistically significant improvement over baseline performance in this demanding task. (C) 2008 Elsevier B.V. All rights reserved. C1 [de Gispert, A.; Marino, J. B.] UPC, TALP Res Ctr, Barcelona 08034, Spain. RP de Gispert, A (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM ad465@cam.ac.uk; canton@gps.tsc.upc.edu RI Marino, Jose /N-1626-2014 FU European Union [IST-2002-FP6-5067-38]; Spanish Government [TEC2006-13694-C03] FX The authors would especially like to thank Lluis Marquez (UPC, Barcelona) for providing the machine learning toolkit implementing Adaboost that has been used for morphology generation experiments, and Josep Maria Crego (UPC, Barcelona) and Graeme Blackwood (University of Cambridge) for their most helpful contribution. We would also like to give credit to the anonymous reviewers of this paper, who helped improve the quality of the presentation with their insightful suggestions.This work has been funded by the European Union under the integrated project TC-STAR - (IST-2002-FP6-5067-38) and by the Spanish Government under the project AVIVAVOZ - (TEC2006-13694-C03). CR Al-Onaizan Y., 1999, STAT MACHINE TRANSLA Brants T., 2000, P 6 APPL NAT LANG PR Carreras X., 2003, P CONLL 2003 EDM CAN, P152 Carreras X., 2004, 4 INT C LANG RES EV Casacuberta F, 2004, COMPUT LINGUIST, V30, P205, DOI 10.1162/089120104323093294 COLLINS M, 2005, 43 ANN M ASS COMP LI, P531 CORSTONOLIVER S, 2004, P AMTA, P48 Costa-jussa Marta R., 2006, P WORKSH STAT MACH T, P162, DOI 10.3115/1654650.1654677 CREGO J, 2006, 1 IEEE ACL WORKSH SP CREGO J, 2005, P 9 EUR C SPEECH COM, P3193 CREGO J, 2007, MACHINE TRANSLATION, V21 de Gisper Adrik, 2006, P WORKSH STAT MACH T, P1, DOI 10.3115/1654650.1654652 DEGISPERT A, 2005, P 9 EUR C SPEECH COM, P107 Goldwater S., 2005, P HUM LANG TECHN C C, P676, DOI 10.3115/1220575.1220660 GUPTA D, 2006, P 11 ANN C EUR ASS M Habash N, 2006, P HUM LANG TECHN C N, P49, DOI 10.3115/1614049.1614062 Hutchins W. John, 1992, INTRO MACHINE TRANSL Kneser R, 1995, P IEEE INT C AC SPEE, V1, P181 Koehn P., 2007, P JOINT C EMP METH N, P868 Lee Y., 2004, HLT NAACL 2004 SHORT, P57 MARINO J, 2006, TC STAR WORKSH SPEEC, P43 Marino JB, 2006, COMPUT LINGUIST, V32, P527, DOI 10.1162/coli.2006.32.4.527 NIESSEN S, 2000, P COLING 2000 18 INT, P1081 POPOVIC M, 2006, TC STAR WORKSH SPEEC, P99 POPOVIC M, 2004, 4 INT C LANG RES EV, P1583 Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 TALBOT D, 2006, P 21 INT C COMP LING, P969, DOI 10.3115/1220175.1220297 UEFFING N, 2003, 10 C EUR CHAPT ASS C, P347 VILAR D, 2006, 5 INT C LANG RES EV, P697 ZOLLMANN A, 2006, P HUM LANG TECHN C N, P201, DOI 10.3115/1614049.1614100 NR 30 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV-DEC PY 2008 VL 50 IS 11-12 BP 1034 EP 1046 DI 10.1016/j.specom.2008.05.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 377WY UT WOS:000261284100013 ER PT J AU Au-Yeung, JSK Siu, MH AF Au-Yeung, Jeff Siu-Kei Siu, Manhung TI Evaluation of the robustness of the polynomial segment models to noisy environments with unsupervised adaptation SO SPEECH COMMUNICATION LA English DT Article DE Polynomial segment models; Robustness; Adaptation; Aurora 4 ID CONTINUOUS SPEECH RECOGNITION; HIDDEN MARKOV-MODELS; RAPID SPEAKER ADAPTATION; MAXIMUM-LIKELIHOOD; LINEAR-REGRESSION; TRAJECTORY MODEL; FEATURE HMM; ALGORITHM; MIXTURE; TIME AB Recently, the polynomial segment models (PSMs) have been shown to be a competitive alternative to the HMM in large vocabulary continuous recognition task [Li, C., Siu, M., Au-yeung, S., 2006. Recursive likelihood evaluation and fast search algorithm for polynomial segment model with application to speech recognition. IEEE Trans. on Audio, Speech and Language Processing 14, 1704-1708]. Its more constrained nature raises the issue of robustness under environmental mis-matches. In this paper, we examine the robustness properties of PSMs using the Aurora 4 corpus under both clean training and multi-conditional training. In addition, we generalize two unsupervised model adaptation schemes, namely, the maximum likelihood linear regression (MLLR) and reference speaker weighting (RSW), to be applicable for PSMs and explore their effectiveness in PSM environmental adaptation. Our experiments showed that although the word error rate differences between PSMs and HMMs became smaller under noisy test environments than under clean test environment, PSMs were still competitive under mis-match conditions. After model adaptation, especially with the RSW adaptation, the word error rates were reduced for both HMMs and PSMs. The best word error rate was obtained with RSW-adapted PSMs by rescoring lattices generated with the adapted HMMs. Overall, with model adaptation, the recognition word error rate can be reduced by more than 20%. (c) 2008 Elsevier B.V. All rights reserved. C1 [Siu, Manhung] BBN Syst & Technol Corp, Cambridge, MA 02138 USA. [Au-Yeung, Jeff Siu-Kei] Hong Kong Univ Sci & Technol, Dept ECE, Hong Kong, Hong Kong, Peoples R China. RP Siu, MH (reprint author), BBN Syst & Technol Corp, 10 Moulton St, Cambridge, MA 02138 USA. EM jeffay@ust.hk; msiu@ieee.org FU Hong Kong Research Grant Council [HKUST619505, CA02/03.EG05] FX We would like to thank Dr. Herbert Gish for his constructive comments, and the anonymous reviewers for their careful review and suggestions. This work is partially supported by the Hong Kong Research Grant Council, CERG project number: HKUST619505 and CA02/03.EG05. The views in this article are that of the authors and do not reflect the view of the sponsor. CR [Anonymous], 2003, 202050 ETSI ES ATAL B, 1979, J ACOUST SOC AM, V55, P1304 AUYEUNG S, 2006, P IEEE INT C AC SPEE, V1, P233 AUYEUNG S, 2005, P ICASSP, P193 AUYEUNG S, 2006, IEEE SIGNAL PROC LET, P644 AuYeung S.-K., 2004, P INT C SPOK LANG PR, P161 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chengalvarayan R, 2001, IEEE T SPEECH AUDI P, V9, P549, DOI 10.1109/89.928919 Chengalvarayan R, 1998, IEEE SIGNAL PROC LET, V5, P63, DOI 10.1109/97.661562 Chesta C., 1999, P EUR C SPEECH COMM, P211 CUI X, 2004, P IEEE INT C AC SPEE, P969 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 DENG L, 1991, IEEE WORKSH AUT SPEE, P24 DIGALAKIS VV, 1992, IEEE T SIGNAL PROCES, V40, P2885, DOI 10.1109/78.175733 Droppo J, 2002, P ICSLP, P29 ETSI, 2003, 201108 ETSI ES FUKADA T, 1997, P ICASSP, V2, P1403 Gales M., 1993, P EUR C SPEECH COMM, P1579 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gish H., 1993, P ICASSP, V2, P447 Glass J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607261 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 Goldberger J, 1999, IEEE T SPEECH AUDI P, V7, P262, DOI 10.1109/89.759032 Hazen T. J., 1997, P EUR, P2047 Hazen TJ, 2000, SPEECH COMMUN, V31, P15, DOI 10.1016/S0167-6393(99)00059-X Holmes WJ, 1999, COMPUT SPEECH LANG, V13, P3, DOI 10.1006/csla.1998.0048 HON HW, 2000, P IEEE INT C AC SPEE, P1017 Huang X., 2001, SPOKEN LANGUAGE PROC Huo Q, 1998, IEEE T SPEECH AUDI P, V6, P386 Illina I, 1998, SPEECH COMMUN, V26, P245, DOI 10.1016/S0167-6393(98)00060-0 KANNAN A, 1997, INT C AC SPEECH SIGN, V2, P1411 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 LAI YP, 2003, P EUR C SPEECH COMM, P13 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LI C, 2003, P IEEE INT C AC SPEE, P756 Li CF, 2006, IEEE T AUDIO SPEECH, V14, P1704, DOI 10.1109/TSA.2005.858553 LI CF, 2004, P ICASSP 2004, P841 MAK B, 2006, P IEEE INT C AC SPEE, P229 Mak B, 2005, IEEE T SPEECH AUDI P, V13, P984, DOI 10.1109/TSA.2005.851971 Ming J., 2001, Computer Speech and Language, V15, DOI 10.1006/csla.2000.0166 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 Parihar N., 2003, P EUR, P337 PARIHAR N, 2002, AU38402 AUR WORK GRO PARIKH V, 1997, P DARP SPEECH REC WO, P119 Paul D., 1992, P ICSLP, P899 Pearce D., 2000, P ICSLP, V4, P29 RAGHAVAN P, 1998, TR227 CAIP RICHARDS H, 1999, P ICASSP, V1, P357 RUSSELL M, 2006, COMPUT SPEECH LANG, V19, P205 Russell MJ, 1997, IEEE SIGNAL PROC LET, V4, P72, DOI 10.1109/97.558642 SEGBROECK MV, 2007, P EUR C SPEECH COMM, P910 SIOHAN O, 1996, P IEEE INT C ACOUSTI, P471 Siu M, 2006, IEEE T AUDIO SPEECH, V14, P2122, DOI 10.1109/TASL.2006.872592 SIU M, 2006, P IEEE INT C AC SPEE, V1, P449 SIU M, 1999, P IEEE INT C AC SPEE, P105 Tonomura M, 1996, COMPUT SPEECH LANG, V10, P117, DOI 10.1006/csla.1996.0008 YOUNG S, 2001, HTK BOOK HTK 3 1 Yun YS, 2000, IEEE SIGNAL PROC LET, V7, P135 Yun YS, 2002, SPEECH COMMUN, V38, P115, DOI 10.1016/S0167-6393(01)00047-4 ZEN H, 2007, COMPUT SPEECH LANG, V21, P157 ZHAO B, 2002, P IEEE INT C AC SPEE, V4, P4165 ZHOU J, 2003, P IEEE INT C AC SPEE, V1, P744 NR 66 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 769 EP 781 DI 10.1016/j.specom.2008.04.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200001 ER PT J AU Mary, L Yegnanarayana, B AF Mary, Leena Yegnanarayana, B. TI Extraction and representation of prosodic features for language and speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE Prosody; Vowel onset point; Intonation; Stress; Rhythm; Language recognition; Speaker recognition; Multilayer feedforward neural network; Autoassociative neural network ID LINEAR PREDICTION; SPEECH; IDENTIFICATION; VERIFICATION; ALIGNMENT; TONE AB In this paper, we propose a new approach for extracting and representing prosodic features directly from the speech signal. We hypothesize that prosody is linked to linguistic units such as syllables, and it is manifested in terms of changes in measurable parameters such as fundamental frequency (F-0), duration and energy. In this work, syllable-like unit is chosen as the basic unit for representing the prosodic characteristics. Approximate segmentation of continuous speech into syllable-like units is obtained by locating the vowel onset points (VOP) automatically. The knowledge of the VOPs serve as reference for extracting prosodic features from the speech signal. Quantitative parameters are used to represent F-0 and energy contour in each region between two consecutive VOPs. Prosodic features extracted using this approach may be useful in applications such as recognition of language or speaker, where explicit phoneme/syllable boundaries are not easily available. The effectiveness of the derived prosodic features for language and speaker recognition is evaluated in the case of NIST language recognition evaluation 2003 and the extended data task of NIST speaker recognition evaluation 2003, respectively. (c) 2008 Elsevier B.V. All rights reserved. C1 [Mary, Leena] Indian Inst Technol, Dept Comp Sci & Engn, Speech & Vis Lab, Madras 600036, Tamil Nadu, India. [Yegnanarayana, B.] Int Inst Informat Technol, Dept Comp Sci & Engn, Hyderabad 500032, Andhra Pradesh, India. RP Mary, L (reprint author), Indian Inst Technol, Dept Comp Sci & Engn, Speech & Vis Lab, Madras 600036, Tamil Nadu, India. EM leenmary@rediffmail.com; yegna@cs.iiit.ac.in CR Abercrombie D, 1967, ELEMENTS GEN PHONETI ADAMI A, 2003, P ICASSP, V4, P788 Adami A.G., 2003, P EUROSPEECH, P841 ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 Atterer M, 2004, J PHONETICS, V32, P177, DOI 10.1016/S0095-4470(03)00039-1 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 CHAITANYA M, 2005, THESIS INDIAN I TECH CUMMINS F, 1999, IDSIA0799 Cutler Anne, 1983, PROSODY MODELS MEASU Doddington G., 2001, P EUR, P2521 Fox A, 2000, PROSODIC FEATURES PR FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Grabe E., 2002, PAPERS LAB PHONOLOGY, V7, P515 Gussenhoven C, 1997, J ACOUST SOC AM, V102, P3009, DOI 10.1121/1.420355 Haykin S., 1999, NEURAL NETWORKS COMP, V2nd HECK LP, 2002, J HOPK U WORKSH SUPE Hirst D., 1998, INTONATION SYSTEMS S HYMAN LM, 2005, WORD PROSODIC TYPOLO, P164 Kometsu M., 2001, P EUROSPEECH SCAND, V1, P149 Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089 MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 Maidment J., 2005, INTRO PHONETIC SCI MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MORI K, 1999, P EUROSPEECH BUD HUN, P391 Nagarajan T, 2006, SPEECH COMMUN, V48, P913, DOI 10.1016/j.specom.2005.12.003 Navratil J, 2001, IEEE T SPEECH AUDI P, V9, P678, DOI 10.1109/89.943345 PESKIN B, 2003, P ICASSP, V4, P792 PRASANNA SRM, 2001, P INT C SIGN PROC BA, V1, P81 Ramus F, 1999, J ACOUST SOC AM, V105, P512, DOI 10.1121/1.424522 Ramus F, 1999, COGNITION, V73, P265, DOI 10.1016/S0010-0277(99)00058-X REYNOLDS D, 1996, P ICASSP 1996 ATL, V1, P113 Reynolds D., 2003, P ICASSP 03, VIV, P784 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Roach Peter, 1983, ENGLISH PHONETICS PH Rouas JL, 2005, SPEECH COMMUN, V47, P436, DOI 10.1016/j.specom.2005.04.012 Shriberg E, 2005, SPEECH COMMUN, V46, P455, DOI 10.1016/j.specom.2005.02.018 Shriberg E, 2000, SPEECH COMMUN, V32, P127, DOI 10.1016/S0167-6393(00)00028-5 Sonmez K., 1998, P INT C SPOK LANG PR, P3189 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 THYMEGOBBEL AE, 1996, P ICSLP 96 PHIL US, V3, P1768, DOI 10.1109/ICSLP.1996.607971 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 Yegnanarayana B, 2002, NEURAL NETWORKS, V15, P459, DOI 10.1016/S0893-6080(02)00019-9 Yegnanarayana B., 1999, ARTIFICIAL NEURAL NE Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 46 TC 31 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 782 EP 796 DI 10.1016/j.specom.2008.04.010 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200002 ER PT J AU Flynn, R Jones, E AF Flynn, Ronan Jones, Edward TI Combined speech enhancement and auditory modelling for robust distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Speech enhancement; Auditory front-end; Robust speech recognition ID FRONT-END AB The performance of automatic speech recognition (ASR) systems in the presence of noise is an area that has attracted a lot of research interest. Additive noise from interfering noise sources, and convolutional noise arising from transmission channel characteristics both contribute to a degradation of performance in ASR systems. This paper addresses the problem of robustness of speech recognition systems in the first of these conditions, namely additive noise. In particular, the paper examines the use of the auditory model of Li et al. [Li, Q., Soong, F.K., Siohan, O., 2000. A high-performance auditory feature for robust speech recognition. In: Proc. 6th Internat. Conf. on Spoken Language Processing (ICSLP), Vol. III. pp. 51-54] as a front-end for a HMM-based speech recognition system. The choice of this particular auditory model is motivated by the results of a previous study by Flynn and Jones [Flynn, R., Jones, E., 2006. A comparative study of auditory-based front-ends for robust speech recognition using the Aurora 2 database. In: Proc. IET Irish Signals and Systems Conf., Dublin, Ireland. pp. 111-116] in which this auditory model was found to exhibit superior performance for the task of robust speech recognition using the Aurora 2 database [Hirsch, H.G., Pearce, D., 2000. The Aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proc. ISCA ITRW ASR2000, Paris, France. pp. 181-188]. In the speech recognition system described here, the input speech is pre-processed using an algorithm for speech enhancement. A number of different methods for the enhancement of speech, combined with the auditory front-end of Li et al., are evaluated for the purpose of robust connected digit recognition. The ETSI basic [ETSI ES 201 108 Ver. 1.1.3, 2003. Speech processing, transmission and quality aspects (STQ); distributed speech recognition; front-end feature extraction algorithm; compression algorithms] and advanced [ETSI ES 202 050 Ver. 1.1.5, 2007. Speech processing, transmission and quality aspects (STQ); distributed speech recognition; advanced front-end feature extraction algorithm; compression algorithms] front-ends proposed for DSR are used as a baseline for comparison. In addition to their effects on speech recognition performance, the speech enhancement algorithms are also assessed using perceptual speech quality tests, in order to examine if a correlation exists between perceived speech quality and recognition performance. Results indicate that the combination of speech enhancement pre-processing and the auditory model front-end provides an improvement in recognition performance in noisy conditions over the ETSI front-ends. (c) 2008 Elsevier B.V. All rights reserved. C1 [Flynn, Ronan] Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland. [Jones, Edward] Natl Univ Ireland, Dept Elect Engn, Galway, Ireland. RP Flynn, R (reprint author), Athlone Inst Technol, Dept Elect Engn, Athlone, Ireland. EM rflynn@ait.ie; edwardjones@nuigal-way.ie CR Agarwal A., 1999, P ASRU, P67 [Anonymous], 2007, 202050 ETSI ES [Anonymous], 2003, 201108 ETSI ES [Anonymous], HTK SPEECH REC TOOLK Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Ephraim Y., 2006, ELECT ENG HDB EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FLYNN R, 2006, P IET IR SIGN SYST C, P116 GHITZA O, 1988, J PHONETICS, V16, P109 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 Hu Y, 2007, J ACOUST SOC AM, V122, P1777, DOI 10.1121/1.2766778 ITU-T, 2001, P862 ITU T Kleinschmidt M, 2001, SPEECH COMMUN, V34, P75, DOI 10.1016/S0167-6393(00)00047-9 Li J., 2004, P ICASSP 2004 MONTR, V1, P61 Li Q., 2000, P 6 INT C SPOK LANG, V3, P51 LI Q, 2001, P EUR, V1, P619 Machiraju VR, 2002, J CARDIAC SURG, V17, P20 MACHO D, 2001, P INT C AC SPEECH SI, P305 Mak BKW, 2004, IEEE T SPEECH AUDI P, V12, P27, DOI 10.1109/TSA.2003.819951 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 Mauuary L., 1998, P EUSPICO 98, V1, P359 Milner B., 2002, P IEEE INT C AC SPEE, V1, P797 Moore BC., 2003, INTRO PSYCHOL HEARIN Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 SENEFF S, 1988, J PHONETICS, V16, P55 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 TSONTZOS G, 2007, P IEEE INT C AC SPEE, V4, P453 Westerlund N, 2005, SIGNAL PROCESS, V85, P1089, DOI 10.1016/j.sigpro.2005.01.004 NR 31 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 797 EP 809 DI 10.1016/j.specom.2008.05.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200003 ER PT J AU Huang, CF Akagi, M AF Huang, Chun-Fang Akagi, Masato TI A three-layered model for expressive speech perception SO SPEECH COMMUNICATION LA English DT Article DE Expressive speech; Perception; Multi-layer model; Fuzzy inference system; Acoustic analysis; Rule-based ID EMOTIONAL SPEECH; SYNTHETIC SPEECH; VOCAL CUES; COMMUNICATION; RULE AB This paper proposes a multi-layer approach to modeling perception of expressive speech. Many earlier studies of expressive speech focused on statistical correlations between expressive speech and acoustic features without taking into account the fact that human perception is vague rather than precise. This paper introduces a three-layer model: five categories of expressive speech constitute the top layer, semantic primitives constitute the middle layer, and acoustic features, the bottom layer. Three experiments followed by multidimensional scaling analysis revealed suitable semantic primitives. Then, fuzzy inference systems were built to map the vagueness of the relationship between expressive speech and the semantic primitives. Acoustic features in terms of F-0 contour, time duration, power envelope, and spectrum were analyzed. Regression analysis revealed correlation between the semantic primitives and the acoustic features. Parameterized rules based on the analysis results were created to morph neutral utterances to those perceived as having different semantic primitives and expressive speech categories. Experiments to verify the relationships of the model showed significant relationships between expressive speech, semantic primitives, and acoustic features. (c) 2008 Elsevier B.V. All rights reserved. C1 [Huang, Chun-Fang; Akagi, Masato] JAIST, Sch Informat Sci, Nomi, Ishikawa 9231211, Japan. RP Huang, CF (reprint author), JAIST, Sch Informat Sci, 1-1 Asahidai, Nomi, Ishikawa 9231211, Japan. EM chuang@jaist.ac.jp FU Ministry of Education, Culture, Sports, Science and Technology; SCOPE of Ministry of Internal Affairs and Communications (MIC), Japan [071705001] FX This research is conducted as a program for the "21st Century COE Program" by Ministry of Education, Culture, Sports, Science and Technology. We sincerely thank Fujitsu Laboratory for permission to use the voice database. This study was also supported by SCOPE (071705001) of Ministry of Internal Affairs and Communications (MIC), Japan. We also sincerely thank Donna Erickson for her valuable comments. CR Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 CAHN JE, 1990, THESIS MIT Chiu S., 1994, J INTELL FUZZY SYST, V2 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 COWIE R, 1996, P ICSLP96 PHIL DARKE G, 2005, P CIM05 MONTR Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 DEVILLERS L, 2003, P INT C MULT EXP Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317 Fauconnier, 1997, MAPPINGS THOUGHT LAN Friberg A, 2006, ADV COGNITIVE PSYCHO, V2, P145, DOI DOI 10.2478/V10053-008-0052-X Friberg A., 2004, P MUS MUS SCI STOCKH FUJISAKI H, IEICE 1994 0 GOBL C, 1997, HDB PHONETIC SCI, P427 GRIMM M, 2007, SPEECH COMMUN Huang CF, 2005, LECT NOTES COMPUT SC, V3784, P366 Jang J. S. R., 1996, NEUROFUZZY SOFT COMP Juslin P. N., 2001, MUSIC EMOTION THEORY, P309 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Keating PA, 2006, 11 AUSTR INT C SPEEC Kecman V., 2001, LEARNING SOFT COMPUT KENDON A, 1981, NONVERBAL COMMUNICAT, V41 Kienast M., 2000, ISCA WORKSH SPEECH E MAEKAWA K, 2002, CONGNITIVE STUDIES, V9, P46 Maekawa K., 2004, P SPEECH PROS 2004, V2004, P367 Manning P, 1989, SYMBOLIC COMMUNICATI Mehrabian A., 1972, NONVERBAL COMMUNICAT MENEZES C, 2006, P SPEECH PROS 2006 D MURRAY IR, 1995, SPEECH COMMUN, V16, P369, DOI 10.1016/0167-6393(95)00005-9 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Nguyen PC, 2003, IEICE T INF SYST, VE86D, P397 Robbins S.P., 2001, ORG BEHAV CONCEPTS C SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHRODER M, 2001, P EUR 2001 AALB, V1, P87 SHIKLER TS, 2004, P ICSLP2004 KOR Sugeno M., 1985, IND APPL FUZZY CONTR TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 TRAUBE C, 2003, P 2003 C NEW INT MUS UEDA K, 1988, ACOUST SCI TECHNOL, V44, P102 UEDA K, 1990, J ACOUST SOC AM, V87, P814, DOI 10.1121/1.398893 UEDA K, 1996, J ACOUST SOC AM van Bezooijen R., 1984, CHARACTERISTICS RECO Venditti Jennifer J., 2005, PROSODIC TYPOLOGY PH, P172 VICKHOFF B, 2004, PHILOS COMMUNICATION, V34 WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wolkenhauer O., 2001, DATA ENG FUZZY MATH NR 52 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 810 EP 828 DI 10.1016/j.specom.2008.05.017 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200004 ER PT J AU Williams, JD AF Williams, Jason D. TI Evaluating user simulations with the Cramer-von Mises divergence SO SPEECH COMMUNICATION LA English DT Article DE User modeling; User simulation; Dialog simulation; Dialog management ID MANAGEMENT AB User simulations are increasingly employed in the development and evaluation of spoken dialog systems. However, there is no accepted method for evaluating user simulations, which is problematic because the performance of new dialog management techniques is often evaluated on user simulations alone, not on real people. In this paper, we propose a novel method of evaluating user simulations. We view a user simulation as a predictor of the performance of a dialog system, where per-dialog performance is measured with a domain-specific scoring function. The divergence between the distribution of dialog scores in the real and simulated corpora provides a measure of the quality of the user simulation, and we argue that the Cramer-von Mises divergence is well-suited to this task. To demonstrate this technique, we study a corpus of callers with real information needs and show that Cramer-von Mises divergence conforms to expectations. Finally, we present simple tools which enable practitioners to interpret the statistical significance of comparisons between user simulations. (c) 2008 Elsevier B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Williams, JD (reprint author), AT&T Labs Res, 180 Pk Ave, Florham Pk, NJ 07932 USA. EM jdw@research.att.com CR AI H, 2007, P EUR ANTW BELG ANDERSON TW, 1962, ANN MATH STAT, V33, P1148, DOI 10.1214/aoms/1177704477 Bui T. H., 2007, P WORKSH KNOWL REAS, P34 Cramer H., 1928, SKAND AKTUARIETIDSK, V11, P171 Cuayahuitl H., 2005, P IEEE WORKSH AUT SP, P290 Denecke M., 2004, P INTERSPEECH 2004, P325 Eadie W.T., 1971, STAT METHODS EXPT PH FILISKO E, 2005, P SIGDIAL WORKSH DIS FRAMPTON M, 2006, P 44 ANN M ASS COMP, P185, DOI 10.3115/1220175.1220199 Georgila K., 2005, P 9 EUR C SPEECH COM, P893 GEORGILA K, 2006, P INT C SPOK LANG PR GODDEAU D, 2000, P INT C AC SPEECH SI, P1233 Heeman P., 2007, P 8 ANN C N AM CHAPT, P268 Henderson J., 2005, P WORKSH KNOWL REAS, P68 Kolmogorov A., 1933, GIORNALE I ITALIANO, V4, P1 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 Lemon O., 2006, P IEEE ACL WORKSH SP, P178 Levin E., 1997, P EUR, P1883 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 Levin E., 2006, P SPOK LANG TECHN WO, P198 Pietquin O., 2002, P IEEE ICASSP, P46 Pietquin O., 2004, THESIS FACULTY ENG B Roy N., 2000, P 38 ANN M ASS COMP, P93, DOI DOI 10.3115/1075218.1075231 Schatzmann J, 2007, P 8 SIGDIAL WORKSH D, P273 Schatzmann J., 2007, P IEEE AUT SPEECH RE, P526 SCHATZMANN J, 2005, P SIGDIAL WORKSH DIS, P178 Scheffler K, 2001, P HUM LANG TECHN HLT, P12 SCHEFFLER K, 2002, THESIS CAMBRIDGE U Scheffler Konrad, 2002, P 2 INT C HUM LANG T, P12, DOI 10.3115/1289189.1289246 Singh S, 2002, J ARTIF INTELL RES, V16, P105 STEPHENS MA, 1992, BREAK THROUGHS STAT, V2, P93 STUTTLE M, 2004, P INT C SPEECH LANG, P241 Thomson B., 2007, P NAACL HLT DIAL 07, P9, DOI 10.3115/1556328.1556330 von Mises R., 1931, WAHRSCHEINLICHKEITSR Walker M., 1998, P 36 ANN M ASS COMP, P1345 Walker M. A., 2002, P INT C SPOK LANG PR, P269 Walker M.A., 2000, NATURAL LANGUAGE ENG WILLIAMS J, 2007, NAACL HLT WORKSH BRI, P1 Williams JD, 2007, IEEE T AUDIO SPEECH, V15, P2116, DOI 10.1109/TASL.2007.902050 WILLIAMS JD, 2006, THESIS CAMBRIDGE U Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 YOUNG S, 2007, P IEEE INT C AC SPEE, P149 Zhang B., 2001, P EUR AALB DENM, P2169 NR 43 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 829 EP 846 DI 10.1016/j.specom.2008.05.007 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200005 ER PT J AU Batista, F Caseiro, D Mamede, N Trancoso, I AF Batista, F. Caseiro, D. Mamede, N. Trancoso, I. TI Recovering capitalization and punctuation marks for automatic speech recognition: Case study for Portuguese broadcast news SO SPEECH COMMUNICATION LA English DT Article DE Rich transcription; Punctuation recovery; Sentence boundary detection; Capitalization; Truecasing; Maximum entropy; Language modeling; Weighted finite state transducers ID LANGUAGE; SYSTEM AB The following material presents a study about recovering punctuation marks, and capitalization information from European Portuguese broadcast news speech transcriptions. Different approaches were tested for capitalization, both generative and discriminative, using: finite state transducers automatically built from language models; and maximum entropy models. Several resources were used, including lexica, written newspaper corpora and speech transcriptions. Finite state transducers produced the best results for written newspaper corpora, but the maximum entropy approach also proved to be a good choice, suitable for the capitalization of speech transcriptions, and allowing straightforward on-the-fly capitalization. Evaluation results are presented both for written newspaper corpora and for broadcast news speech transcriptions. The frequency of each punctuation mark in BN speech transcriptions was analyzed for three different languages: English, Spanish and Portuguese. The punctuation task was performed using a maximum entropy modeling approach, which combines different types of information both lexical and acoustic. The contribution of each feature was analyzed individually and separated results for each focus condition are given, making it possible to analyze the performance differences between planned and spontaneous speech. All results were evaluated on speech transcriptions of a Portuguese broadcast news corpus. The benefits of enriching speech recognition with punctuation and capitalization are shown in an example, illustrating the effects of described experiments into spoken texts. (c) 2008 Elsevier B.V. All rights reserved. C1 [Batista, F.; Caseiro, D.; Mamede, N.; Trancoso, I.] INESC ID Lisboa R Alves Redol, Spoken Language Syst Lab, L2F, P-1000029 Lisbon, Portugal. [Caseiro, D.; Mamede, N.; Trancoso, I.] Univ Tecn Lisboa, Inst Super Tecn, IST, P-1100 Lisbon, Portugal. RP Batista, F (reprint author), INESC ID Lisboa R Alves Redol, Spoken Language Syst Lab, L2F, P-1000029 Lisbon, Portugal. EM Fernando.Batista@inesc-id.pt; Dia-mantino.Caseiro@inesc-id.pt; Nuno.Mamede@inesc-id.pt; Isabel.Trancoso@inesc-id.pt RI Batista, Fernando/C-8355-2009; Trancoso, Isabel/C-5965-2008; Mamede, Nuno/C-5555-2008 OI Batista, Fernando/0000-0002-1075-0177; Trancoso, Isabel/0000-0001-5874-6313; Mamede, Nuno/0000-0001-6033-158X FU FCT [POSC/PLP/58697/2004]; DIGA [POSI/PLP/41319/2001]; PRIME National Project TECNOVOZ [03/165]; Quadro Comunitario de Apoio III FX This paper has been partially funded by the FCT Projects LECTRA (POSC/PLP/58697/2004) and DIGA (POSI/PLP/41319/2001), and by the PRIME National Project TECNOVOZ number 03/165. INESC-ID Lisboa had support from the POSI program of the "Quadro Comunitario de Apoio III". CR AMARAL R, 2007, EURASIP J ADV SI MAY BATISTA F, 2007, INT 2007 AUG ANTW BE, P2153 BATISTA F, 2007, P RANLP 2007 SEPT BO Beeferman D, 1998, INT CONF ACOUST SPEE, P689, DOI 10.1109/ICASSP.1998.675358 Berger AL, 1996, COMPUT LINGUIST, V22, P39 Chelba C., 2004, EMNLP 04 Christensen H., 2001, P ISCA WORKSH PROS S, P35 Collins Michael, 1999, P JOINT SIGDAT C EMN Daume III H, 2004, NOTES CG LM BFGS OPT Gotoh Y., 2000, P ISCA WORKSH AUT SP, P228 HARPER MP, 2005, 2005 J HOPK SUMM WOR Huang J., 2002, P ICSLP, P917 Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC Kim J., 2001, P EUR C SPEECH COMM, P2757 Kim JH, 2004, COMPUT SPEECH LANG, V18, P67, DOI 10.1016/S0885-2308(03)00032-9 Koehn Philipp, 2005, MT SUMMIT 2005 Lita L.V., 2003, P 41 ANN M ASS COMP, P152 LIU Y, 2007, P IEEE ICASSP HON H LIU Y, 2006, IEEE T AUDIO SPEECH, V14, P9 MAKHOUL J, 1999, P DARPA BROADC NEW W MARTINS C, 2007, ASRU MEDEIROS JC, 1995, THESIS IST UTL PORTU Meinedo H, 2003, LECT NOTES ARTIF INT, V2721, P9 MIKHEEV A, 1999, P 37 C ASS COMP LING, P159, DOI 10.3115/1034678.1034710 Mikheev A, 2002, COMPUT LINGUIST, V28, P289, DOI 10.1162/089120102760275992 MOTA C, 2008, THESIS U TECNICA LIS MROZINSK J, 2006, P ICASSP Neto J, 2008, INT CONF ACOUST SPEE, P1561, DOI 10.1109/ICASSP.2008.4517921 RIBEIRO R, 2004, LANGUAGE TECHNOLOGY, P31 Shriberg E, 2000, SPEECH COMMUN, V32, P127, DOI 10.1016/S0167-6393(00)00028-5 Stolcke A., 1996, P INT C SPOK LANG PR, P1005, DOI 10.1109/ICSLP.1996.607773 Stolcke A., 2002, P INT C SPOK LANG PR, V2, P901 Strassel S., 2004, SIMPLE METADATA ANNO TRANCOSO I, 2006, P ISCA C INT 2006 Wang W., 2006, HLT NAACL, P1 Yarowsky D., 1994, P 32 ANN M ASS COMP, P88, DOI 10.3115/981732.981745 NR 36 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 847 EP 862 DI 10.1016/j.specom.2008.05.008 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200006 ER PT J AU Komatani, K Ikeda, S Ogata, T Okuno, HG AF Komatani, Kazunori Ikeda, Satoshi Ogata, Tetsuya Okuno, Hiroshi G. TI Managing out-of-grammar utterances by topic estimation with domain extensibility in multi-domain spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article DE Multi-domain spoken dialogue system; Topic estimation; Out-of-grammar utterance; Domain extensibility AB Spoken dialogue systems must inevitably deal with out-of-grammar utterances. We address this problem in multi-domain spoken dialogue systems, which deal with more tasks than a single-domain system. We defined a topic by augmenting a domain about which users want to find more information, and we developed a method of recovering out-of-grammar utterances based on topic estimation, i.e., by providing a help message in the estimated domain. Moreover, domain extensibility, that is, the ability to add new domains to the system, should be inherently retained in multi-domain systems. To estimate domains without sacrificing extensibility, we collected documents from the Web as training data. Since the data contained a certain amount of noise, we used latent semantic mapping (LSM), which enables robust topic estimation by removing the effects of noise from the data. Experimental results showed that our method improved topic estimation accuracy by 23.2 points for data including out-of-grammar utterances. (c) 2008 Elsevier B.V. All rights reserved. C1 [Komatani, Kazunori; Ikeda, Satoshi; Ogata, Tetsuya; Okuno, Hiroshi G.] Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. RP Komatani, K (reprint author), Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. EM komatani@kuis.kyoto-u.ac.jp FU Ministry of Education, Culture, Sports, Science and Technology of Japan; Support Center for Advanced Telecommunications Technology Research (SCAT) FX We are grateful to Mr. Teruhisa Misu and Prof. Tatsuya Kawahara of Kyoto University for allowing us to use the document-collecting program they developed (Misu and Kawahara, 2006). This work was supported in part by a grant-in-aid for scientific research from the Ministry of Education, Culture, Sports, Science and Technology of Japan and from Support Center for Advanced Telecommunications Technology Research (SCAT). CR Bellegarda JR, 2005, IEEE SIGNAL PROC MAG, V22, P70, DOI 10.1109/MSP.2005.1511825 BOHUS D, 2005, P 6 SIGDIAL WORKSH D, P128 CARPENTER B, 2002, P INT C SPOK LANG PR, P2705 CHUCARROLL J, 1998, P 36 ANN M ASS COMP, P256 CHUNG G, 2005, P 6 SIGDIAL WORKSH D, P55 FUKUBAYASHI Y, 2006, P INT C SPOK LANG PR, P1946 GORRELL G, 2002, P INT C SPOK LANG PR, P2065 HOCKEY BA, 2003, P 10 C EUR CHAPT ACL, P147 Komatani K, 2005, USER MODEL USER-ADAP, V15, P169, DOI 10.1007/s11257-004-5659-0 KOMATANI K, 2006, P 7 SIGDIAL WORKSH D, P9, DOI 10.3115/1654595.1654598 Komatani K., 2007, P 8 SIGDIAL WORKSH D, P202 LANE IR, 2004, P ICSLP, P2197 Lin BS, 2001, IEICE T INF SYST, VE84D, P1217 MISU T, 2006, P INTERSPEECH, P9 Nakano M, 2005, 2005 IEEE/RSJ International Conference on Intelligent Robots and Systems, Vols 1-4, P1542 ONEILL I, 2004, P 8 INT C SPOK LANG, P205 PAKUCS B, 2003, P 8 EUR C SPEECH COM, P741 RAUX A, 2004, P HLT NAACL, P217 Steinberger J., 2004, P 7 INT C INF SYST I, P93 TSAI A, 2001, P EUR C SPEECH COMM, P2213 Williams J., 2005, P 4 IJCAI WORKSH KNO, P76 NR 21 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2008 VL 50 IS 10 BP 863 EP 870 DI 10.1016/j.specom.2008.05.010 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 369NW UT WOS:000260702200007 ER PT J AU McTear, M Jokinen, K Larson, J AF McTear, Michael Jokinen, Kristiina Larson, James TI Special Issue on "Evaluating new methods and models for advanced speech-based interactive systems" SO SPEECH COMMUNICATION LA English DT Editorial Material C1 [McTear, Michael] Univ Ulster, Sch Comp & Math, Newtownabbey BT37 0QB, Antrim, North Ireland. [Jokinen, Kristiina] Univ Helsinki, FIN-00014 Helsinki, Finland. [Larson, James] Portland State Univ, Portland, OR 97207 USA. RP McTear, M (reprint author), Univ Ulster, Sch Comp & Math, Shore Rd, Newtownabbey BT37 0QB, Antrim, North Ireland. EM mf.mctear@ulster.ac.uk RI Jokinen, Kristiina/D-1824-2013 OI Jokinen, Kristiina/0000-0003-1229-239X CR Jokinen K, 2007, AI MAG, V28, P133 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 627 EP 629 DI 10.1016/j.specom.2008.05.001 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000001 ER PT J AU Edlund, J Gustafson, J Heldner, M Hjalmarsson, A AF Edlund, Jens Gustafson, Joakim Heldner, Mattias Hjalmarsson, Anna TI Towards human-like spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE human-like interaction; conversational interfaces; human gold standard ID HUMAN-COMPUTER INTERACTION; SPEECH; INFORMATION; TELEPHONE; INTERFACE AB This paper presents an overview of methods that can be used to collect and analyse data on user responses to spoken dialogue system components intended to increase human-likeness, and to evaluate how well the components succeed in reaching that goal. Wizard-of-Oz variations, human-human data manipulation, and micro-domains are discussed ill this context, as is the use of third-party reviewers to get a measure of the degree of human-likeness. We also present the two-way mimicry target, a model for measuring how well a human-computer dialogue mimics or replicates some aspect of human-human dialogue, including human flaws and inconsistencies. Although we have added a measure of innovation, none of the techniques is new in its entirely. Taken together and described from a human-likeness perspective, however, they form a set of tools that may widen the path towards human-like spoken dialogue systems. (c) 2008 Elsevier B.V. All rights reserved. C1 [Edlund, Jens; Gustafson, Joakim; Heldner, Mattias; Hjalmarsson, Anna] KTH Speech Mus & Hearing, SE-10044 Stockholm, Sweden. RP Edlund, J (reprint author), KTH Speech Mus & Hearing, SE-10044 Stockholm, Sweden. EM edlund@speech.kth.sc RI Heldner, Mattias/B-4262-2009 OI Heldner, Mattias/0000-0002-0034-0924 CR Agelfors E, 2006, LECT NOTES COMPUT SC, V4061, P579 Allen JF, 2001, AI MAG, V22, P27 ALLEN JF, 1995, J EXP THEOR ARTIF IN, V7, P7, DOI 10.1080/09528139508953799 ALLWOOD J, 1992, COMMUNICATIVE ACTIVI AMALBERTI R, 1993, INT J MAN MACH STUD, V38, P547, DOI 10.1006/imms.1993.1026 AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M Balentine B., 2001, BUILD SPEECH RECOGNI BECHET F, 2004 C EMP METH NAT, P134 Bernsen N.O., 1998, DESIGNING INTERACTIV BERRY GA, 1998, AAAI WORKSH INT ENV BERTENSTAM J, 1995, P ESCA WORKSH SPOK D, P281 BLOMBERG M, 1993, P EUR C SPEECH COMM, P1867 BOHUS D, 2002, P ISCA WORKSH MULT M Bolt R. A., 1980, Computer Graphics, V14 BOYCE S, 1999, HUMAN FACTORS VOICE, P37 Boyce SJ, 2000, COMMUN ACM, V43, P29, DOI 10.1145/348941.348974 BOYE J, 2007, P WORKSH BRIDG GAP A, P68, DOI 10.3115/1556328.1556338 BOYE J, 1999, P IJCAI 99 WORKSH KN BRADY PT, 1968, AT&T TECH J, V47, P73 Brennan S, 1996, P INT S SPOK DIAL, P41 BRENNAN SE, 1995, KNOWL-BASED SYST, V8, P143, DOI 10.1016/0950-7051(95)98376-H BROCKETT C, 2005, AAAI 2005 SPRING S K Cassell J., 2002, USER MODELING ADAPTI, V12, P1 CASSELL J, 2000, P CHI2000, P259 Cassell J., 2007, GENESIS REDUX ESSAYS, P346 Cassell J., 2002, P IM 02 MONT CARL Cassell J., 1999, CHI 99, P520 CHAPANIS A, 1981, MAN COMPUTER INTERAC CLARK HH, 1994, SPEECH COMMUN, V15, P243, DOI 10.1016/0167-6393(94)90075-2 Dahlback N., 1993, P 1 INT C INT US INT, P193, DOI 10.1145/169891.169968 DAMPER R, 1984, J MAN MACHINE STUD, V21, P541 Dautenhahn K., 2005, P INT C INT ROB SYST, P1192 DYBKJAER H, 1993, P 3 EUR C SPEECH COM Dybkjaer L, 2004, SPEECH COMMUN, V43, P33, DOI 10.1016/j.specom.2004.02.001 EDLUND J, 2006, INT 2006 ICSLP SAT W EDLUND J, 2005, P ISCA TUT RES WORKS EDLUND J, 2007, P INT 2007 ANTW BELG EKLUND R, 2004, THESIS LINKOPING U L Fischer K., 2006, WHAT COMPUTER TALK I Fischer K, 2006, P WORKSH PEOPL TALK, P112 Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M GEORGILA K, 2006, 9 INT C SPOK LANG PR GLASS J, 2000, ICSLP 2000, P931 Goodwin C., 1981, CONVERSATIONAL ORG I Granstrom B., 2002, P SPEECH PROS 2002 C, P347 GRATCH J, 2006, P 6 INT C INT VIRT A Gustafson J., 2002, P INT C SPOK LANG PR, P297 Gustafson J., 2000, P INT C SPOK LANG PR, P134 GUSTAFSON J, 2008, P 4 IEEE WORKSH PERC GUSTAFSON J, 2005, P INT VIRT AG IVA 05 GUSTAFSON J, 2000, NATURAL LANGUAGE ENG, V6 HAYESROTH B, 2004, LIFE LIKE CHARACTERS, P447 HEALEY PGT, 2006, BRANDIAL 06 HEMPHILL C, 1990, P 3 DARPA WORKSH SPE, P102 Hjalmarsson A., 2007, P SIGDIAL ANTW BELG, P132 HJALMARSSON A, 2008, P 4 IEEE WORKSH PERC Isbister K., 2006, BETTER GAME CHARACTE Johnson WL, 2000, INT J ARTIFICIAL INT, V11, P47 Jokinen K, 2003, P WORKSH ONT MULT US, P730 Jonsson A., 2000, P 6 APPL NAT LANG PR, P44, DOI 10.3115/974147.974154 Julia LE, 1998, OZCHI 98 - 1998 AUSTRALASIAN COMPUTER HUMAN INTERACTION CONFERENCE, PROCEEDINGS, P32 Julia L, 1998, VSMM98: FUTUREFUSION - APPLICATION REALITIES FOR THE VIRTUAL AGE, VOLS 1 AND 2, P466 KAWAMOTO S, 2004, LIFE LIKE CHARACTERS, P187 KITAWAKI N, 1991, IEEE J SEL AREA COMM, V9, P586, DOI 10.1109/49.81952 LARSSON S, 2005, P 9 WORKSH SEM PRAGM LATHROP B, 2004, P 12 INT C INT TRANS Laurel B., 1990, ART HUMAN COMPUTER I, P355 LEIJTEN M, 2001, P HUM COMP INT INTER Lester J. C., 1997, Proceedings of the First International Conference on Autonomous Agents, DOI 10.1145/267658.269943 Lester JC, 1999, APPL ARTIF INTELL, V13, P383, DOI 10.1080/088395199117324 LIFE A, 1996, P ICSLP 96 3 6 OCT P MARTIN TB, 1976, P IEEE, V64, P487, DOI 10.1109/PROC.1976.10157 Martinovsky B., 2003, P ISCA TUT RES WORKS Mobius B., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1021052023237 Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003 MOODY T, 1998, THESIS N CAROLINA ST Mori M., 1970, ENERGY, V7, P33, DOI DOI 10.1162/PRES.16.4.337 Mutlu B., 2006, P 2006 IEEE RAS INT NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547 NISBETT RE, 1977, PSYCHOL REV, V84, P231, DOI 10.1037/0033-295X.84.3.231 NORMAN D. A., 1998, DESIGN EVERYDAY THIN Norman D. A., 1983, MENTAL MODELS, P7 Nye J. M., 1982, Speech Technology, V1 OVIATT S, 2000, P INT C SPOK LANG PR, P877 PAEK T, 2007, P INT 2007 ANTW BELG, P1322 PAEK T, 2001, ACL 2001 WORKSH EV M Peckham J., 1991, P WORKSH SPEECH NAT, P14, DOI 10.3115/112405.112408 PEISSNER M, 2001, P HCI INT LAWR ERB M, P233 Piesk J., 1997, Proceedings. International Conference on Virtual Systems and MultiMedia, VSMM '97 (Cat. No.97100182), DOI 10.1109/VSMM.1997.622336 PORZEL R, 2006, P WORKSH PEOPL TALK PURVER M, 2003, 4 SIGDIAL WORKSH DIS QVARFORDT P, 2004, THESIS LINKOPING U RAYNER M, 2001, P 2 SIGDIAL WORKSH D Reeves B, 1996, MEDIA EQUATION Reeves LM, 2004, COMMUN ACM, V47, P57, DOI 10.1145/962081.962106 Riccardi G, 2000, IEEE T SPEECH AUDI P, V8, P3, DOI 10.1109/89.817449 Rich C, 2001, AI MAG, V22, P15 RUDNICKY A, 1996, SPEECH INTERFA UNPUB RUTTKAY Z, 2004, BROWS TILL TRUST EVA SAINI P, 2005, P INT 2005 COMM NAT Salber D., 1993, P EWHCI, P219 SCHALNGEN D, 2007, P INT 2007 ANTW BELG SCHATZMANN J, 2005, 6 SIGDIAL WORKSH DIS Seneff S., 1998, P ICSLP, P931 Shneiderman B., 1997, INTERACTIONS, V4, P42, DOI 10.1145/267505.267514 Skantze G., 2003, P ISCA TUT RES WORKS, P71 Skantze G, 2006, LECT NOTES ARTIF INT, V4021, P193 SKANTZE G, 2006, P INT 2006 ICSLP PIT, P2002 SMITH R, 1992, P 3 C APPL NAT LANG, P9, DOI 10.3115/974499.974502 STUTTLE M, 2004, P ICSLP JEJ S KOR TENBRINK T., 2007, P 8 SIGDIAL WORKSH D, P103 Traum D., 2001, P IJCAI WORKSH KNOWL, P766 Turing A.M., 1950, MIND, V59, P433, DOI DOI 10.1093/MIND/LIX.236.433 VANE ET, 1994, PROGRAMMING TV RADIO Von Ahn L., 2004, ACM C HUM FACT COMP, P319 von Alm Luis, 2006, P SIGCHI C HUM FACT, P79, DOI 10.1145/1124772.1124785 WAHLSTER W, 2001, P MTI STAT C BERL GE, P26 Wahlster W., 2000, VERBMOBIL FDN SPEECH Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Ward N, 2000, J PRAGMATICS, V32, P1177, DOI 10.1016/S0378-2166(99)00109-5 WEIZENBA.J, 1966, COMMUN ACM, V9, P36, DOI 10.1145/365153.365168 WHITTAKER S, 2006, SPOKEN MULTIMODAL HU WIREN M, 2007, BRIDGING GAP ACAD IN Wooffitt R., 1997, HUMANS COMPUTERS WIZ ZUE V, 2007, P INT ANTW BELG, P1 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 126 TC 24 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 630 EP 645 DI 10.1016/j.specom.2008.04.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000002 ER PT J AU Callejas, Z Lopez-Cozar, R AF Callejas, Zoraida Lopez-Cozar, Ramon TI Relations between de-facto criteria in the evaluation of a spoken dialogue system SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialogue systems; evaluation; field test ID INFORMATION-TECHNOLOGY; USABILITY; FRAMEWORK AB Evaluation of spoken dialogue systems has been traditionally carried out in terms of instrumentally or expert-derived measures (usually called "objective" evaluation) and quality judgments of users who have previously interacted with the system (also called "subjective" evaluation). Different research efforts have been made to extract relationships between these evaluation criteria. In this paper we report empirical results obtained from statistical studies, which were carried out on interactions of real users with our spoken dialogue system. These studies have rarely been exploited in the literature. Our results show that they can indicate important relationships between criteria, which can be used as guidelines for refinement of the systems under evaluation, as well as contributing to the state-of-the-art knowledge about how quantitative aspects of the systems affect the user's perceptions about them. (c) 2008 Elsevier B.V. All rights reserved. C1 [Callejas, Zoraida; Lopez-Cozar, Ramon] Univ Granada, Fac Comp Sci & Telecommun, Dept Languages & Comp Syst, E-18071 Granada, Spain. RP Callejas, Z (reprint author), Univ Granada, Fac Comp Sci & Telecommun, Dept Languages & Comp Syst, C Pdta Daniel Saucedo Aranda S-N, E-18071 Granada, Spain. EM zoraida@ugr.es; rlopezc@ugr.es RI Prieto, Ignacio/B-5361-2013; Lopez-Cozar, Ramon/A-7686-2012; Callejas Carrion, Zoraida/C-6851-2012 OI Lopez-Cozar, Ramon/0000-0003-2078-495X; Callejas Carrion, Zoraida/0000-0001-8891-5237 CR BAGGIA P, 2000, SPEECH COMMUN, V11, P355 BECKER T, 2006, P 4 EUR C PREST APPL, P612 Beringer N, 2002, P LREC WORKSH MULT R, P77 BERNSEN NO, 2000, P 2 INT C LANG RES E, P183 Bickmore T, 2006, J BIOMED INFORM, V39, P556, DOI 10.1016/j.jbi.2005.12.004 Callejas Z., 2005, P APPL SPOK LANG INT CALLEJAS Z, 2007, P 8 INT WORKSH EL CO Degerstedt L., 2006, P 9 INT C SPOK LANG, P489 DENOS E, 1999, P EUR C SPEECH TECHN, P1527 DEVILLERS L., 2004, P 4 INT C LANG RES E, P2131 *DISC, 1999, DISC FIN REP COV PER Dybkjaer L, 2004, SPEECH COMMUN, V43, P33, DOI 10.1016/j.specom.2004.02.001 Dybkjaer L., 2000, NATURAL LANGUAGE ENG, V6, P243, DOI 10.1017/S1351324900002461 *EAGLES, 1996, EAGEWGPR2 CTR SPROG Farzanfar R, 2005, J BIOMED INFORM, V38, P220, DOI 10.1016/j.jbi.2004.11.011 GEUTNER P, P 3 INT C LANG RES E Hartikainen M., 2004, P ICSLP 2004, P2273 HURTIG T, 2004, P 8 C HUM COMP INT M, P251 Lamel L, 2002, SPEECH COMMUN, V38, P131, DOI 10.1016/S0167-6393(01)00048-6 Larsen L., 2003, P IEEE WORKSH AUT SP, P209 Legris P, 2003, INFORM MANAGE-AMSTER, V40, P191, DOI 10.1016/S0378-7206(01)00143-4 Litman DJ, 2002, USER MODEL USER-ADAP, V12, P111, DOI 10.1023/A:1015036910358 Lopez-Cozar R., 2005, SPOKEN MULTILINGUAL McTear M., 2004, SPOKEN DIALOGUE TECH Minker W, 2004, SPEECH COMMUN, V43, P89, DOI 10.1016/j.specom.2004.01.005 MOLLER S, 2002, P 3 WORKSH DISC DIAL, P143 Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003 Moller S., 2005, QUALITY TELEPHONE BA PAEK T, 2001, P WORKSH EV LANG DIA, V9, P1, DOI 10.3115/1118053.1118054 Park WY, 2007, LECT NOTES COMPUT SC, V4559, P398 Rajman M, 2004, ACTA ACUST UNITED AC, V90, P1096 RAUX A, 2006, P INT C SPOK LANG PR, P65 Raux A., 2003, P INT, P753 ROBINSON SM, 2006, P 25 ARM SCI C ORL U Sanderman A., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727700 SCHIEL F, 2006, SMARTKOM FDN MULTIMO, P617, DOI 10.1007/3-540-36678-4_38 Sturm J, 2005, TEXT SPEECH LANG TEC, V28, P329 TURUNEN M, 2006, P INT 2006, P1057 TURUNEN M, 2001, P 4 INT C TEXT SPEEC, P357 Wahlster W., 2006, SMARTKOM FDN MULTIMO WALKER M, 2002, P ICSLP, V1, P269 Walker M., 2000, P N AM M ASS COMP LI, P210 WALKER M, 2002, P 7 INT C SPOK LANG, V1, P273 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 NR 45 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 646 EP 665 DI 10.1016/j.specom.2008.04.004 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000003 ER PT J AU Griol, D Hurtado, LF Segarra, E Sanchis, E AF Griol, David Hurtado, Lluis F. Segarra, Encarna Sanchis, Emilio TI A statistical approach to spoken dialog systems design and evaluation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialog systems; statistical models; dialog management; user simulation; system evaluation ID STRATEGIES; SIMULATION AB In this paper, we present a statistical approach for the development of a dialog manager and for learning optimal dialog strategies. This methodology is based oil a classification procedure that considers all of the previous history of the dialog to select the next system answer. To evaluate the performance of the dialog system, the statistical approach for dialog management has been extended to model the user behavior. The statistical user simulator has been used for the evaluation and improvement of the dialog strategy. Both the user model and the system model arc automatically learned from a training corpus that is labeled in terms of dialog acts. New measures have been defined to evaluate the performance of the dialog system. Using these measures, we evaluate both the quality of the simulated dialogs and the improvement of the new dialog strategy that is obtained with the interaction of the two modules. This methodology has been applied to develop a dialog manager within the framework of the DIHANA project, whose goal is the design and development of a dialog system to access a railway information system using spontaneous speech in Spanish. We propose the use of corpus-based methodologies to develop the main modules in the dialog system. (c) 2008 Elsevier B.V. All rights reserved. C1 [Griol, David; Hurtado, Lluis F.; Segarra, Encarna; Sanchis, Emilio] Univ Politecn Valencia, Dept Sistemes Informat & Comp, E-46022 Valencia, Spain. RP Griol, D (reprint author), Univ Politecn Valencia, Dept Sistemes Informat & Comp, E-46022 Valencia, Spain. EM dgriol@dsic.upv.es; lhurtado@dsic.upv.es; esegarra@dsic.upv.es; esanchis@dsic.upv.es RI Segarra, Encarna/K-5883-2014 OI Segarra, Encarna/0000-0002-5890-8957 CR Benedi J.M., 2006, P 5 INT C LANG RES E, P1636 Bishop C. M., 1995, NEURAL NETWORKS PATT BONAFONTE A, 2000, P PRIM JORN TECN HAB Bos J., 2003, P 4 SIGDIAL WORKSH D, P115 CASTRO MJ, 2003, LNCS, V2527, P664 Cuayahuitl H., 2006, P INTERSP 06 ICSLP P, P469 Cuayahuitl H., 2005, P IEEE WORKSH AUT SP, P290 DORAN C, 2001, P 2 SIGDIAL WORKSH D, P1 Dybkjaer L, 2004, SPEECH COMMUN, V43, P33, DOI 10.1016/j.specom.2004.02.001 Dybkjaer L., 2000, NATURAL LANGUAGE ENG, V6, P243, DOI 10.1017/S1351324900002461 *EAGLES, 1996, EAGEWGPR2 CTR SPROGT Eckert W., 1997, P IEEE WORKSH AUT SP, P80 Eckert W., 1998, TR9891 ATT LABS RES ESTEVE Y, 2003, P EUROSPEECH, V1, P617 FAILENSCHMID K, 1999, DISC DELIVERABLE D3 FIKES R, 1985, COMMUN ACM, V28, P904, DOI 10.1145/4284.4285 Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M Garcia F, 2003, LECT NOTES ARTIF INT, V2807, P165 Georgila K., 2005, P 9 EUR C SPEECH COM, P893 GEORGILA K, 2006, P 9 INT C SPOK LANG, P1065 Griol D., 2006, P INT C SPEECH COMP, P131 Griol D., 2006, P AAAI WORKSH STAT E, P25 He Y., 2003, P IEEE AUT SPEECH RE, P583 Hurtado L. F., 2005, P IEEE WORKSH AUT SP, P226 HURTADO LF, 2006, P INT 06 ICSLP PITTS, P49 Jonsson A., 1988, P SCAND C ART INT SC, P53 Lane I., 2004, P WORKSH MAN MACH SY, P2837 LEMON O, 2007, P 8 SIGDIAL WORKSH D, P55 Lemon O., 2006, P IEEE ACL WORKSH SP, P178 Levin E., 1997, P EUR, P1883 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 LEVIN E, 1995, P EUR C SPEECH COMM, P555 Meng HM, 2003, IEEE T SPEECH AUDI P, V11, P757, DOI 10.1109/TSA.2003.814380 Minker W., 1999, STOCHASTICALLY BASED Minsky M., 1975, FRAMEWORK REPRESENTI, P211 Paek T., 2000, P 16 C UNC ART INT U, P455 Pietquin O, 2006, IEEE T AUDIO SPEECH, V14, P589, DOI 10.1109/TSA.2005.855836 Pietquin O., 2005, P 9 EUR C SPEECH COM, P861 RIESER V, 2006, P 9 INT C SPOK LANG, P1766 Roy N., 2000, P 38 ANN M ASS COMP, P93, DOI DOI 10.3115/1075218.1075231 RUMELHART DE, 1986, LEARNING INTERNAL RE, P319 Schatzmann J., 2005, P IEEE AUT SPEECH RE, P220 Schatzmann J., 2007, P HUM LANG TECHN 200, P149 Schatzmann J, 2005, P 6 SIGDIAL WORKSH D, P45 Schatzmann J, 2006, KNOWL ENG REV, V21, P97, DOI 10.1017/S0269888906000944 Schatzmann J, 2007, P 8 SIGDIAL WORKSH D, P273 Scheffier K., 2000, P INT C ACOUSTICS SP, P1217 Scheffler K., 2001, P NAACL WORKSH AD DI, P64 Scheffler K., 1999, 355 CUEDFINFENGTR Scheffler K, 2001, P HUM LANG TECHN HLT, P12 Segarra E, 2002, INT J PATTERN RECOGN, V16, P301, DOI 10.1142/S021800140200168X Singh S., 1999, P NEUR INF PROC SYST, P956 Torres F., 2003, P EUROSPEECH, P605 Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 Young S., 2002, CUEDFINFENGTR433 YOUNG S, 2007, P 32 IEEE INT C AC S, V4, P149 Young S., 2005, HIDDEN INFORM STATE NR 58 TC 42 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 666 EP 682 DI 10.1016/j.specom.2008.04.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000004 ER PT J AU Tetreault, JR Litman, DJ AF Tetreault, Joel R. Litman, Diane J. TI A Reinforcement Learning approach to evaluating state representations in spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialogue systems; evaluation; adaptive systems; Reinforcement Learning; Markov decision processes; feature selection; machine learning; affect; tutoring systems AB Although dialogue systems have been all area of research for decades, finding accurate ways of evaluating different systems is still a very active subfield since many leading methods, such as task completion rate or user satisfaction, capture different aspects of the end-to-end human-computer dialogue interaction. In this work, we step back the focus from the complete evaluation of a dialogue system to presenting metrics for evaluating one internal component of a dialogue system: its dialogue manager. Specifically, we investigate how to create and evaluate the best state space representations for a Reinforcement Learning model to learn an optimal dialogue control strategy. We present three metrics for evaluating the impact of different state models and demonstrate their use oil the domain of a spoken dialogue tutoring system by comparing the relative utility of adding three features to a model of user, or student, state. The motivation for this work is that if one knows which features are best to use, one can construct a better dialogue manager, and thus better performing dialogue systems. (c) 2008 Elsevier B.V. All rights reserved. C1 [Tetreault, Joel R.] Educ Testing Serv, Princeton, NJ 08541 USA. [Litman, Diane J.] Univ Pittsburgh, Dept Comp Sci, LRDC, Pittsburgh, PA 15260 USA. RP Tetreault, JR (reprint author), Educ Testing Serv, Princeton, NJ 08541 USA. EM JTetreault@ets.org; litman@cs.pitt.edu CR AI H, 2006, USING SYSTEM USER PE AI H, 2007, KNOWLEDGE CONSISTENT AMMICHT E, 2001, AMBIGUITY REPRESENTA ANG J, 2002, PROSODY BASED AUTOMA BHATT K, 2004, COGNIT SCI Byron DK, 2002, 40TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P80 CORE M, 1996, 612 U ROCH DANIELI M, 1995, AAAI SPRING S EMP ME, P34 Dumais S., 1998, CIKM 98, P148 FORBESRILEY K, 2005, ARTIF INTELL ED FORBESRILEY K, 2005, USING BIGRAMS IDENTI FRAMPTON M, 2005, IJCAI WORKSH K R PRA, P62 FRAMPTON M, 2006, LEARNING MORE EFFECT, P185 GEORGILA K, 2005, LEARNING USER SIMULA, P893 HENDERSON J, 2005, IJCAI WORKSH K R PRA, P68 HIRSCHMAN L, 1993, 3 EUR C SPEECH COMM, P1419 HIRSCHMAN L, 1990, SPEECH NAT LANG WORK, P109 JAULMES R, 2005, EUR C MACH LEARN Jelinek F., 1998, STAT METHODS SPEECH Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 LEVIN E, 1997, STOCHASTIC MODEL COM LISCOMBE J, 2005, DETECTING CERTAINNES LITMAN D, 2004, ITSPOKE INTELLIGENT LITMAN D, 2000, P 18 INT C COMP LING NICHOLAS G, 2006, IEEE ACL WORKSH SPOK PAEK T, 2005, 6 SIGDIAL WORKSH DIS POLIFRONI J, 1992, DARPA SPEECH NAT LAN, P28 RIESER V, 2006, IEEE ACL WORKSH SPOK Rivals I, 2002, NEURAL NETWORKS, V15, P143, DOI 10.1016/S0893-6080(01)00093-4 ROY N, 2002, SPOKEN DIALOGUE MANA Schapire R. E., 2002, MSRI WORKSH NONL EST SCHEFFLER K, 2002, AUTOMATIC LEARNING D Singh S, 2002, J ARTIF INTELL RES, V16, P105 SINGH S, 1999, REINFORCEMENT LEARNI Sutton R. S., 1998, REINFORCEMENT LEARNI TETREAULT J, 2006, USING REINFORCEMENT TETREAULT J, 2007, ESTIMATING RELIABILI TETREAULT J, 2006, COMPARING UTILITY ST VANLEHN K, 2002, INTELL TUTORING SYST WALKER M, 1998, LEARNING OPTIMAL DIA WALKER M, 2000, APPL REINFORCEMENT L WALKER MA, 2000, DEV GEN MODELS USABI WARNESTAL P, 2005, IJCAI WORKSH K R PRA, P62 WILLIAMS J, 2005, IJCAI WORKSH K R PRA, P76 WILLIAMS J, 2006, COMPUT SPEECH LANG, P24 WILLIAMS J, 2005, PARTIALLY OBSERVABLE NR 46 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 683 EP 696 DI 10.1016/j.specom.2008.05.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000005 ER PT J AU Jung, S Lee, C Kim, S Lee, GG AF Jung, Sangkeun Lee, Cheongjae Kim, Seokhwan Lee, Gary Geunbae TI DialogStudio: A workbench for data-driven spoken dialog system development and management SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialog systems; workbench tool; data driven; spoken language understanding; dialog modeling; automatic speech recognition ID MODEL AB Recently, data-driven speech technologies have been widely used to build speech user interfaces. However, developing and managing data-driven spoken dialog systems arc laborious and time consuming tasks. Spoken dialog systems have many components and their development and management involves numerous tasks such as preparing the corpus, training, testing and integrating each component for system development and management. In addition, data annotation for natural language understanding and speech recognition is quite burdensome. This paper describes the development of a tool, DialogStudio, to support the development and management of data-driven spoken dialog systems. Desirable aspects of the data-driven spoken dialog system workbench tool are identified, and architectures and concepts are proposed that make DialogStudio efficient in data annotation and system development in a domain and methodology neutral manner. The usability of DialogStudio was validated by developing dialog systems in three different domains with two different dialog management methods. Objective evaluations of each domain show that DialogStudio is a feasible solution as a workbench for data-driven spoken dialog systems. (c) 2008 Elsevier B.V. All rights reserved. C1 [Jung, Sangkeun; Lee, Cheongjae; Kim, Seokhwan; Lee, Gary Geunbae] Pohang Univ Comp Sci & Technol POSTECH, Dept Comp Sci & Engn, Pohang 790784, South Korea. RP Jung, S (reprint author), Pohang Univ Comp Sci & Technol POSTECH, Dept Comp Sci & Engn, San 31, Pohang 790784, South Korea. EM hugman@postech.ac.kr CR Chin J. P., 1988, DEV INSTRUMENT MEASU COLE R, 1999, P INT C PHON SC SAN, P1277 DYBJAER H, 2005, P 6 SIGDIAL WORKSH D, P227 Dybkjaer H, 2006, LANG RESOUR EVAL, V40, P87, DOI 10.1007/s10579-006-9010-8 Hardy H, 2006, SPEECH COMMUN, V48, P354, DOI 10.1016/j.specom.2005.07.006 HE Y, 2003, 2003 IEEE WORKSH AUT, P583 JEONG M, 2006, P IEEE ACL 2006 WORK KAISER E, 1999, P IEEE INT C AC SPEE, P2 Klemmer S.R., 2000, P 13 ANN ACM S US IN, P1, DOI 10.1145/354401.354406 Larsson S., 2000, NAT LANG ENG, V6, P323, DOI [DOI 10.1017/S1351324900002539, 10.1017 S1351324900002539] LEE C, 2005, P IEEE AUT SPEECH RE LEE J, 2006, P INT 2006 ICSLP Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 Manning C. D., 1999, FDN STAT NATURAL LAN MCTEAR M, 1999, P EUR 99 BUD HUNG Singh S, 2002, J ARTIF INTELL RES, V16, P105 SINHA AK, 2001, CHI 01 CHI 01 EXT AB, P203 Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737 Sutton S, 1998, P INT C SPOK LANG PR, P3221 WILLIAMS JD, 2006, THESIS CAMBRIDGE U Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 Young S., 2005, HTK BOOK HTK VERSION NR 22 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 697 EP 715 DI 10.1016/j.specom.2008.04.003 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000006 ER PT J AU Paek, T Pieraccini, R AF Paek, Tim Pieraccini, Roberto TI Automating spoken dialogue management design using machine learning: An industry perspective SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE dialogue management; machine learning; reinforcement learning; industry ID SYSTEM; STRATEGIES; SIMULATION AB In designing a spoken dialogue system, developers need to specify the actions a system should take in response to user speech input and the state of the environment based on observed or inferred events, states, and beliefs. This is the fundamental task of dialogue management. Researchers have recently pursued methods for automating the design of spoken dialogue management using machine learning techniques such as reinforcement learning. In this paper, we discuss how dialogue management is handled in industry and critically evaluate to what extent current state-of-the-art machine learning methods can be of practical benefit to application developers who are deploying commercial production systems. In examining the strengths and weaknesses of these methods, we highlight what academic researchers need to know about commercial deployment if they are to influence the way industry designs and practices dialogue management. (c) 2008 Elsevier B.V. All rights reserved. C1 [Paek, Tim] Microsoft Res, Redmond, WA 98052 USA. [Pieraccini, Roberto] SpeechCycle, New York, NY 10004 USA. RP Paek, T (reprint author), Microsoft Res, 1 Microsoft Way, Redmond, WA 98052 USA. EM timpaek@microsoft.com CR ACOMB K, 2007, P HLT WORKSH BRIDG G ALLEN J, 1998, NAT LANG ENG, V6, P213 Balentine B., 2001, BUILD SPEECH RECOGNI BARTO AG, 1995, ARTIF INTELL, V72, P81, DOI 10.1016/0004-3702(94)00011-O BOHUS D, P IEEE ACL WORKSH SP BOHUS D, 2005, P 6 SIGDIAL WORKSH D, P128 BOHUS D, 2005, P HUM LANG TECHN C C, P225, DOI 10.3115/1220575.1220604 Boland J., 2001, P 39 ANN M ASS COMP, P515, DOI 10.3115/1073012.1073078 Chickering DM, 2007, USER MODEL USER-ADAP, V17, P71, DOI 10.1007/s11257-006-9020-7 Eckert W., 1997, P IEEE WORKSH AUT SP, P80 ECKERT W, 1998, P TWLT13 FORM SEM PR, P99 Esselink Bert, 2000, PRACTICAL GUIDE SOFT HANSEN EA, 1998, ADV NEURAL INFORM PR, V10 HECKERMAN D, 1995, COMMUN ACM, V38, P49, DOI 10.1145/203330.203341 Kaelbling LP, 1996, J ARTIF INTELL RES, V4, P237 Kaelbling LP, 1998, ARTIF INTELL, V101, P99, DOI 10.1016/S0004-3702(98)00023-X LEVIN E, 1998, P IEEE T SPEECH AUD, V8, P11 LEVIN E, 2006, P IEEE ACL WORKSH SP LEWIS C, 2006, P INT Litman DJ, 2002, USER MODEL USER-ADAP, V12, P111, DOI 10.1023/A:1015036910358 McConnell S, 2004, CODE COMPLETE Nass C, 2005, WIRED SPEECH VOICE A Ng A. Y., 2000, P 17 INT C MACH LEAR, P663 PAEK T, 2005, P 6 SIGDIAL WORKSH D, P35 Paek T., 2004, P HLT NAACL, P41 Papineni K., 1999, P 6 EUR C SPEECH COM, P1411 PIERACCINI R, 2005, P IEA AIE, P6 Pierce Richard, 2005, P 6 SIGDIAL WORKSH D, P1 Pietquin O, 2006, IEEE T AUDIO SPEECH, V14, P589, DOI 10.1109/TSA.2005.855836 Polifroni J., 2000, P 2 INT C LANG RES E, P725 Rosenfeld R, 2000, P IEEE, V88, P1270, DOI 10.1109/5.880083 Roy N., 2000, P ACL Schatzmann J, 2006, KNOWL ENG REV, V21, P97, DOI 10.1017/S0269888906000944 SCHATZMANN J, 2005, P IEEE ASRU WORKSH, P412 Singh S, 2002, J ARTIF INTELL RES, V16, P105 Sutton R. S., 1998, REINFORCEMENT LEARNI Tetreault J., 2006, P HLT NAACL, P272, DOI 10.3115/1220835.1220870 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Walker MA, 2000, J ARTIF INTELL RES, V12, P387 Walker MA, 2002, J ARTIF INTELL RES, V16, P293 WATKINS CJC, 1992, Q LEARNING MACH LEAR, V8, P229 Wei X., 2000, P ANLP NAACL, P42 Williams J., 2005, P 4 IJCAI WORKSH KNO, P76 Williams J., 2005, P IEEE ASRU WORKSH Williams JD, 2007, COMPUT SPEECH LANG, V21, P393, DOI 10.1016/j.csl.2006.06.008 Williams Jeff, 2007, Proceedings of the 2007 Integrated Communications, Navigation and Surveillance Conference Young S., 2002, CUEDFINFENGTR433 Zhang B., 2001, P 16 C UNC ART INT, P572 NR 48 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 716 EP 729 DI 10.1016/j.specom.2008.03.01 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000007 ER PT J AU Moller, S Engelbrecht, KP Schleicher, R AF Moeller, Sebastian Engelbrecht, Klaus-Peter Schleicher, Robert TI Predicting the quality and usability of spoken dialogue services SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialogue system; quality; usability; prediction model; optimization AB In this paper, we compare different approaches for predicting the quality and usability of spoken dialogue systems. The respective models provide estimations of user judgments oil perceived quality, based oil parameters which can be extracted from interaction logs. Different types of input parameters and different modeling algorithms have been compared using three spoken dialogue databases obtained with two different systems. The results show that both linear regression models and classification trees are able to cover around 50% of the variance in the training data, and neural networks even more. When applied to independent test data, in particular to data obtained with different systems and/or user groups, the prediction accuracy decreases significantly. The underlying reasons for the limited predictive power are discussed. It is shown that - although an accurate prediction of individual ratings is not yet possible with such models - they may still be used for taking decisions oil component optimization, and are thus helpful tools for the system developer. (c) 2008 Elsevier B.V. All rights reserved. C1 [Moeller, Sebastian; Engelbrecht, Klaus-Peter; Schleicher, Robert] Tech Univ Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, D-10587 Berlin, Germany. RP Moller, S (reprint author), Tech Univ Berlin, Deutsch Telekom Labs, Qual & Usabil Lab, Ernst Reuter Platz 7, D-10587 Berlin, Germany. EM sebastian.moeller@telekom.de; klaus-peter.engelbrecht@telekom.de; robert.schleicher@telekom.de CR [Anonymous], 1998, 924111 ISO Bernsen N.O., 1998, DESIGNING INTERACTIV COMPAGNONI B, 2006, THESIS TU BERLIN CONSTANTINIDES P, 1999, P 6 EUR C, V1, P243 ENGELBRECHT KP, 2006, THESIS TU BERLIN Fraser N., 1997, HDB STANDARDS RESOUR, P564 HASTIE HW, 2002, P 3 INT C LANG RES E, V2, P641 Hone K.S., 2000, NAT LANG ENG, V6, P287, DOI 10.1017/S1351324900002497 HONE KS, 2001, P 7 EUR C SPEECH COM, V3, P2083 *ITU T, 2005, PAR DESCR INT SPOK S, V24 *ITU T REC, 2003, P851 ITUT REC *ITU T REC, 2001, P862 ITUT REC Jekosch U., 2005, VOICE SPEECH QUALITY KAMM CA, 1998, P 5 INT C SPOK LANG, V4, P1211 Larsen L., 2003, P IEEE WORKSH AUT SP, P209 MACKAY DJC, 1992, NEURAL COMPUT, V4, P415, DOI 10.1162/neco.1992.4.3.415 Moller S, 2007, COMPUT SPEECH LANG, V21, P26, DOI 10.1016/j.csl.2005.11.003 MOLLER S, 2005, P 4 EUR C AC FOR AC, P2681 MOLLER S, 2007, ITU T SG12 M 16 25 J Moller S., 2005, QUALITY TELEPHONE BA Moller S., 2005, P INT, P2489 Moller S., 2006, P 2 ISCA DEGA TUT RE, P56 MOLLER S, 2006, P 9 INT C SPOK LANG, P1786 Moller S, 2005, SPEECH COMMUN, V48, P1 Oulasvirta A., 2006, P 2 ISCA DEGA TUT RE, P61 RAJMAN M, 2003, P ISCA TUT RES WORKS, P126 Rix AW, 2006, IEEE T AUDIO SPEECH, V14, P1890, DOI 10.1109/TASL.2006.883260 SIMPSON A, 1993, P 3 EUR C SPEECH COM, V2, P1423 Sutton S., 1998, P INT C SPOK LANG PR, V7, P3221 TRUTNEV A, 2004, P 4 LREC, P611 WALKER M, 2000, P 2 INT C LANG RES E, V1, P189 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 NR 34 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 730 EP 744 DI 10.1016/j.specom.2008.03.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000008 ER PT J AU Lopez-Cozar, R Callejas, Z AF Lopez-Cozar, Ramon Callejas, Zoraida TI ASR post-correction for spoken dialogue systems based on semantic, syntactic, lexical and contextual information SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 9th International Conference on Spoken Language Processing/INTERSPEECH 2006 CY 2006 CL Pittsburgh, PA SP Int Speech Commun Assoc DE spoken dialogue systems; speech recognition; speech understanding; natural language processing; speech-based human-computer interaction ID SPEECH AB This paper proposes a technique to correct speech recognition errors in spoken dialogue systems that presents two main novel contributions. On the one hand, it considers several contexts where a speech recognition result call be corrected. A threshold learnt in the training is used to decide whether the correction must be carried out in the context associated with the current prompt type of a dialogue system, or in another context. On the other hand, the technique deals with the confidence scores of the words employed in the corrections. The correction is carried out at two levels: statistical and linguistic. At the first level the technique employs syntactic-semantic and lexical models, both contextual, to decide whether a recognition result is correct. According to this decision the recognition result may be changed. At the second level the technique employs basic linguistic knowledge to decide about the grammatical correctness of the outcome of the first level. According to this decision the outcome may be changed as well. Experimental results indicate that the technique enhances a dialogue system's word accuracy, speech understanding, implicit recovery and task completion rates by 8.5%, 16.54%, 4% and 44.17%, respectively. (c) 2008 Elsevier B.V. All rights reserved. C1 [Lopez-Cozar, Ramon; Callejas, Zoraida] Univ Granada, Fac Comp Sci, Dept Languages & Comp Syst, E-18071 Granada, Spain. RP Lopez-Cozar, R (reprint author), Univ Granada, Fac Comp Sci, Dept Languages & Comp Syst, E-18071 Granada, Spain. EM rlopezc@ugr.es; zoraida@ugr.es RI Prieto, Ignacio/B-5361-2013; Lopez-Cozar, Ramon/A-7686-2012; Callejas Carrion, Zoraida/C-6851-2012 OI Lopez-Cozar, Ramon/0000-0003-2078-495X; Callejas Carrion, Zoraida/0000-0001-8891-5237 CR Allen J, 1995, NATURAL LANGUAGE UND Billi R, 1997, SPEECH COMMUN, V23, P83, DOI 10.1016/S0167-6393(97)00041-1 Crestani F., 2000, P 4 INT C FLEX QUER, P267 DANIELI M, 1995, AAAI SPRING S EMP ME, P34 DENDA Y, 2007, P INTERSPEECH, P222 Fiscus J. G., 1993, P INT C AC SPEECH SI, P59 Hazen TJ, 2002, COMPUT SPEECH LANG, V16, P49, DOI 10.1006/csla.2001.0183 HUANG X, 2001, ALGORITHM SYSTEM DEV, P1 JEONG M, 1996, P ICSLP, P897 JEONG M, 2004, P INTERSPEECH, P2137 Kaki S., 1998, P COLING ACL, P653 KAKUTANI N, 2002, P ICSLP, P833 Kellner A, 1997, SPEECH COMMUN, V23, P95, DOI 10.1016/S0167-6393(97)00036-8 KRAISS KF, 2006, ADV MAN MAHCINE INTE Lee CH, 2000, SPEECH COMMUN, V31, P309, DOI 10.1016/S0167-6393(99)00064-3 Levow GA, 2002, SPEECH COMMUN, V36, P147, DOI 10.1016/S0167-6393(01)00031-0 LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LO WK, 2005, P IEEE INT C AC SPEE, P85 LOPEZCOZAR R, 2000, P 1 SPAN WORKSH SPEE Lopez-Cozar R., 1997, P EUROSPEECH Lopez-Cozar R, 2003, SPEECH COMMUN, V40, P387, DOI [10.1016/S0167-6393(02)00126-7, 10.1016/S0167-6393902)00126-7] LOPEZCOZAR R, 2005, DEV ASSESSMENT LOPEZCOZAR R, 2005, COMPUTER SPEECH LANG, V20, P420, DOI DOI 10.1016/J.CSL.2005.05.003 LOPEZCOZAR R, 1998, P 1 LANG RES EV C, P55 Lopez-Cozar R, 2006, ARTIF INTELL REV, V26, P291, DOI 10.1007/s10462-007-9059-9 MANGU L, 2001, P ICASSP, P29 McTear M., 2004, SPOKEN DIALOGUE TECH MORALES N, 2007, P ICSLP, P930 Nakano N., 2001, P EUR, P1331 Ogata J., 2005, P EUR 2005, P133 Rabiner L, 1993, FUNDAMENTALS SPEECH Ringger E. K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607746 Seneff S., 2000, P ANLP NAACL 2000 SA, P1 SETO S, 1994, SPEECH COMMUN, V15, P341, DOI 10.1016/0167-6393(94)90084-1 SHI Y, 2006, P INTERSPEECH, P1089 Skantze G, 2005, SPEECH COMMUN, V45, P325, DOI 10.1016/j.specom.2004.11.005 Suhm B., 2001, ACM Transactions on Computer-Human Interaction, V8, DOI 10.1145/371127.371166 SWERTS M, 2000, P 6 INT C SPOK LANG, P615 Wahlster W., 2006, SMARTKOM FDN MULTIMO WARD W, 1996, P ICASSP, P416 ZHOU Z, 2006, P ICSLP, P1646 ZHOU Zhengyu, 2004, P ICSLP, P449 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 43 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG-SEP PY 2008 VL 50 IS 8-9 BP 745 EP 766 DI 10.1016/j.specom.2008.03.008 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 342WJ UT WOS:000258814000009 ER PT J AU Taniguchi, T Tohyama, M Shirai, K AF Taniguchi, Toru Tohyama, Mikio Shirai, Katsuhiko TI Detection of speech and music based on spectral tracking SO SPEECH COMMUNICATION LA English DT Article DE speech detection; speech and music discrimination; sinusoidal model; spectral tracking; sinusoidal trajectory ID REPRESENTATION; ENVELOPE AB How to deal with Sounds that include spectrally and temporally complex signals such as speech and music remains a problem in real-world audio information processing. We have devised (1) a classification method based on sinusoidal trajectories for speech and music and (2) a detection method based on (1) for speech with background music. Sinusoidal trajectories represent the temporal characteristics of each category of sounds Such as speech, singing voice and musical instrument. From the trajectories, 20 temporal features are extracted and used to classify sound segments into the categories by using statistical classifiers. The average F, measure of the classification of non-mixed sounds was 0.939, which might be sufficiently high to apply to subsequent detection of sound categories in a mixed sound. To handle the temporal overlapping of sounds, we also developed an optimal spectral tracking algorithm with low computational complexity; it is based oil dynamic programming (DP) with iterative improvement for the sinusoidal decomposition of signals. The classification and detection of a temporal mixture of speech and music are performed by a statistical integration of the temporal features of their trajectories and the optimization of the combination of their categories. The detection method was experimentally evaluated using 400 samples of mixed sounds, and the average of the narrow-band correlation coefficients and improvement in the segmental signal-to-noise ratio (SNR) were 0.55 and +5.67 dB, respectively, which show effectiveness of the proposed detection method. (c) 2008 Elsevier B.V. All rights reserved. C1 [Taniguchi, Toru; Shirai, Katsuhiko] Waseda Univ, Dept Comp Sci & Engn, Shinjuku Ku, Tokyo 1698555, Japan. [Tohyama, Mikio] Waseda Univ, Global Informat & Telecommun Inst, Honjo, Saitama 3670035, Japan. RP Taniguchi, T (reprint author), Waseda Univ, Dept Comp Sci & Engn, Shinjuku Ku, 3-4-1 Okubo, Tokyo 1698555, Japan. EM ttani@ieee.org; m_tohyama@wase-da.jp; shirai@shirai.cs.waseda.ac.jp CR Abe T, 2006, IEEE T AUDIO SPEECH, V14, P1292, DOI 10.1109/TSA.2005.858545 ABE T, 1996, P ICSLP, V2, P1277, DOI 10.1109/ICSLP.1996.607843 Bregman AS., 1990, AUDITORY SCENE ANAL CHOU W, 2001, P ICASSP, V2, P865 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Depalle P., 1993, IEEE ICASSP APR, V1, P225, DOI 10.1109/ICASSP.1993.319096 DRULLMAN R, 1995, J ACOUST SOC AM, V97, P585, DOI 10.1121/1.413112 Goto M., 2003, P 4 INT C MUS INF RE, P229 Goto M., 2002, P 3 INT C MUS INF RE, P287 Hogg RV, 1987, ENG STAT Itou K., 1999, Journal of the Acoustical Society of Japan (E), V20 Jackson J.E, 1991, USERS GUIDE PRINCIPA, P592 Kazama M, 2003, J AUDIO ENG SOC, V51, P123 KIM H, 2004, 25 INT AUD ENG SOC C Maekawa K, 2000, P LREC2000, V2, P947 Marks S. K., 2005, Proceedings of the IASTED International Conference on Internet and Multimedia Systems and Applications MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MELIH K, 1999, P 5 INT S SIGN PROC, V1, P51 MELIH K, 2000, IEEE INT C MULT EXP, V2, P811 MOORE BCJ, 2004, INTRO PSYCHOL HEARIN, P269 NAWAB SH, 1998, COMPUTATIONAL AUDITO, P177 Plante F., 1995, EUROSPEECH, P837 RABINER L, 1978, DIGTAL PROCESSING SP, P274 SAKAKIBARA KI, 1998, TECHNICAL REPORT IEI, P1 SAUNDERS J, 1996, P INT C AC SPEECH SI, V2, P993 Scheirer E, 1997, P ICASSP 97, V2, P1331, DOI 10.1109/ICASSP.1997.596192 TAKEUCHI S, 2001, CONS REL AC CUES SOU TANIGUCHI T, 2006, IEEE INT S SIGN PROC, P300 TANIGUCHI T, 2005, P INTERSPEECH2005, P589 TORKKOLA K, 1999, P INT WORKSH IND COM Virtanen T., 2003, P INT COMP MUS C, P231 VIRTANEN T, 2000, ICASSP2000, V2, P765 Xiong Ziyou, 2003, P ICME, P397 NR 33 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 547 EP 563 DI 10.1016/j.specom.2008.03.007 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400001 ER PT J AU Ghorshi, S Vaseghi, S Yan, Q AF Ghorshi, Seyed Vaseghi, Saeed Yan, Qin TI Cross-entropic comparision of formants of British, Australian and American English accents SO SPEECH COMMUNICATION LA English DT Article DE accents; cross-entropy; formant features; cepstrum features; formant space ID FOREIGN ACCENT; PREDICTION; SPEECH AB This paper highlights the differences in spectral features between British, Australian and American English accents and applies the cross-entropy information measure for comparative quantification of the impacts of the variations of accents, speaker groups and recordings on the probability models of spectral features of phonetic units of speech. Comparison of the cross-entropies of formants and cepstrum features indicates that formants are a better indicator of accents. In particular it appears that the measurements of differences in formants across accents are less sensitive to different recordings or databases compared to cepstrum features. It is found that the crossentropies of the same phonemes across speaker groups with different accents (inter-accent distances) are significantly greater than the cross-entropies of the same phonemes across speaker groups of the same accent (intra-accent distances). Comparative evaluations presented on cross-gender speech recognition shows that accent differences have an impact comparable to gender differences. The crossentropy measure is also used to construct cross-accent phonetic-trees, which serve to show the structural similarities and differences of the phonetic systems across accents. (c) 2008 Elsevier B.V. All rights reserved. C1 [Ghorshi, Seyed; Vaseghi, Saeed; Yan, Qin] Brunel Univ, Sch Engn & Design, London UB8 3PH, England. RP Ghorshi, S (reprint author), Brunel Univ, Sch Engn & Design, London UB8 3PH, England. EM Seyed.Ghorshi@brunel.ac.uk; aghorshi@gmail.com; Saeed.Vaseghi@brunel.ac.uk CR Arslan LM, 1997, J ACOUST SOC AM, V102, P28, DOI 10.1121/1.419608 Boyce S, 1997, J ACOUST SOC AM, V101, P3741, DOI 10.1121/1.418333 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P1841, DOI 10.1121/1.401664 Cruttenden Alan, 1997, INTONATION, V2nd Crystal D., 2003, DICT LINGUISTICS PHO Darch J, 2006, SPEECH COMMUN, V48, P1556, DOI 10.1016/j.specom.2006.06.001 Darch J., 2007, INTERSPEECH, P542 Deller J. R., 1993, DISCRETE TIME PROCES de Mareuil PB, 2006, PHONETICA, V63, P247, DOI 10.1159/000097308 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FLETCHER J, 2004, PROSODIC TYPOLOGY TR Grabe Esther, 2001, P PROS 2000, P51 HANSEN JHL, 2004, INTERSPEECH, P1569 Harrington J., 1997, AUSTR J LINGUISTICS, V17, P155, DOI DOI 10.1080/07268609708599550 HO CH, 2001, THESIS BRUNEL U HUCKVALE M, 2004, ACCDIST METRIC COMPA HUMPHRIES J, 1997, THESIS CAMBRIDGE U E Ikeno A, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P437 JAYNES ET, 1982, P IEEE, V70, P939, DOI 10.1109/PROC.1982.12425 KIM C, 2001, P INT C SPEECH PROC, P447 Kohler J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2195 LABOV W, 1994, INTERNAL FEATURES, V1 Labov W., 2006, ATLAS N AM ENGLISH Ladd D. R., 1996, INTONATIONAL PHONOLO MILLER CA, 1998, THESIS U PENNSYLVANI Mitchell A, 1965, SPEECH AUSTR ADOLESC NAGY N, 2007, HDB VARIETIES ENGLIS, V2 NOLAN F, 1997, ESCA TUT RES WORKSH, P259 PRZEWOZNY A, 2004, VARIATION AUSTR ENGL, P74 Rabiner L.R., 1978, DIGITAL PROCESSING S SHORE JE, 1981, IEEE T INFORM THEORY, V27, P472, DOI 10.1109/TIT.1981.1056373 Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882 TENBOSCH L, 2000, ICSLP, P1009 TRUBETZKOY NS, 1931, TRAV CERCL LING PRAG, P228 Van Bezooijen R, 1999, J LANG SOC PSYCHOL, V18, P31, DOI 10.1177/0261927X99018001003 Vergin R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607793 WATSON C, 1996, AUSTR J LINGUIST WEBER K, 2001, P EUR, P607 Wells John, 1982, ACCENTS ENGLISH Woehrling C, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1511 YAN Q, 2005, THESIS BRUNEL U Yan Q, 2007, COMPUT SPEECH LANG, V21, P543, DOI 10.1016/j.csl.2006.11.001 Yan Q, 2002, INT CONF ACOUST SPEE, P413 Yan Q, 2003, ASRU'03: 2003 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING ASRU '03, P345 Young Steve, 2002, HTK BOOK VERSION 3 2 ZWICKER E, 1957, J ACOUST SOC AM, V29, P548, DOI 10.1121/1.1908963 MACQUARIE DICT CMU DICT BEEP DICT NR 49 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 564 EP 579 DI 10.1016/j.specom.2008.03.013 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400002 ER PT J AU Sarikaya, R AF Sarikaya, Ruhi TI Rapid bootstrapping of statistical spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article DE spoken dialog systems; Semantic annotation; Web-based language modeling; Rapid deployment AB Rapid deployment of statistical spoken dialogue systems poses portability challenges for building new applications. We discuss the challenges that arise and focus on two main problems: (i) fast semantic annotation for statistical speech understanding and (ii) reliable and efficient statistical language modeling using limited in-domain resources. We address the first problem by presenting a new boot-strapping framework that uses a majority-voting based combination of three methods for the semantic annotation of a "mini-corpus" that is usually manually annotated. The three methods are a statistical decision tree based parser, a similarity measure and a support vector machine classifier. The bootstrapping framework results in an overall cost reduction of about a factor of two in the annotation effort compared to the baseline method. We address the second problem by devising a method to efficiently build reliable statistical language models for new spoken dialog systems, given limited in-domain data. This method exploits external text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from the World Wide Web. The proposed method is applied to a spoken dialog system in a financial transaction domain and a natural language call-routing task in a package shipment domain. The experiments demonstrate that language models built using external resources, when used jointly with the limited in-domain language model, result in relative word error rate reductions of 9-18%. Alternatively, the proposed method can be used to produce a 3-to-10 fold reduction for the in-domain data requirement to achieve a given performance level. (c) 2008 Elsevier B.V. All rights reserved. C1 IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. RP Sarikaya, R (reprint author), IBM Corp, Thomas J Watson Res Ctr, 1101 Kitchawan Rd Route 134, Yorktown Hts, NY 10598 USA. EM sarikaya@us.ibm.com CR Abney S., 1996, CORPUS BASED METHODS BECHET F, 2004, EMPIRICAL METHODS NA BERGER A, 2001, P INT C AC SPEECH SI, P705 BERTOLDI N, 2001, P INT C AC SPEECH SI, P37 BULYKO I, 2003, GETTING MORE MILEAGE CHEN S, 1996, EMPIRICAL STUDY SMOO Chen S., 2001, IEEE T SAP, V8, P37 CHUNG G, 2005, SPECIAL INTEREST GRO Cristianini N., 2000, INTRO SUPPORT VECTOR DAVIES K, 1999, P EUR C SPEECH TECHN FABBRIZIO GD, 2004, P SIGDIAL CAMBR MA Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 FOSLERLUSSIER E, 2001, P ICASSP SALT LAK CI GAO Y, 2006, ICASSP GOEL V, 2005, P ICASSP PHIL PA GULLI A, 2005, P WWW 2005 CHIB JAP HACIOGLU K, 2003, TARGET WORD DETECTIO HARDY H, 2004, DATA DRIVEN STRATEGI JURAFSKY D, 1994, P INT C SPOK LANG PR, P2139 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 KINGSBURY B, EUROSPEECH KNESER R, 1995, INT CONF ACOUST SPEE, P181, DOI 10.1109/ICASSP.1995.479394 Kudoh T., 2000, P CONLL 2000 LLL 200, P142 LAPATA M, 2004, WEB BASELINE EVALUAT, P121 Lefevre F, 2005, COMPUT SPEECH LANG, V19, P345, DOI 10.1016/j.csl.2004.11.001 Levin E., 2000, IEEE T SPEECH AUDIO, V8 Magerman David M., 1994, THESIS STANFORD U MCCANDLESS M, 1993, EUROSPEECH Meng HM, 2002, IEEE T KNOWL DATA EN, V14, P172, DOI 10.1109/69.979980 Papineni K., 2002, BLEU METHOD AUTOMATI Pietra S. D., 1997, IEEE T PATTERN ANAL, V19, P380 ROSENFELD R, 2001, P IEEE, V88 RUDNICKY A, 1995, P ARPA SPOK LANG TEC, P66 SARIKAYA R, 2005, INTERSPEECH 2005 SARIKAYA R, 2005, ICASSP SARIKAYA R, 2005, IEEE ASRU WORKSH SAN SARIKAYA R, 2004, ICSLP SENEFF S, 1992, INT C SPOK LANG PROC, P317 TUR G, 2005, SPEECH COMMUN, V45, P175 Vapnik V., 1995, NATURE STAT LEARNING Zhu XJ, 2001, INT CONF ACOUST SPEE, P533 NR 41 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 580 EP 593 DI 10.1016/j.specom.2008.03.011 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400003 ER PT J AU Akdemir, E Ciloglu, T AF Akdemir, Eren Ciloglu, Tolga TI The use of articulator motion information in automatic speech segmentation SO SPEECH COMMUNICATION LA English DT Article DE automatic speech segmentation; articulator motion; lip motion; text-to-speech ID VOWEL RECOGNITION; MODELS; INTEGRATION AB The use of articulator motion information in automatic speech segmentation is investigated. Automatic speech segmentation is an essential task in speech processing applications like speech synthesis where accuracy and consistency of segmentation are firmly connected to the quality of synthetic speech. The motions of upper and lower lips are incorporated into a hidden Markov model based segmentation process. The MOCHA-TIMIT database, which involves simultaneous articulatograph and microphone recordings, was used to develop and test the models. Different feature vector compositions are proposed for incorporation of articulator motion parameters to the automatic segmentation system. Average absolute boundary error of the system with respect to manual segmentation is decreased by 10.1%. The results are examined in a boundary class dependent manner using both acoustic and visual phone classes, and the performance of the system in different boundary types is discussed. After analyzing the boundary class dependent performance, the error reduction is increased to 18.0% by using the appropriate feature vectors in selected boundaries. (c) 2008 Elsevier B.V. All rights reserved. C1 [Akdemir, Eren; Ciloglu, Tolga] Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. RP Akdemir, E (reprint author), Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. EM akdemir@metu.edu.tr CR ADJOUDANI A, 1996, NATO ASI SER, P461 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Chen T, 1998, P IEEE, V86, P837 Duchnowski P., 1994, P INT C SPOK LANG PR, P547 Fagel S, 2004, SPEECH COMMUN, V44, P141, DOI 10.1016/j.specom.2004.10.006 FISCUS JG, 1997, P 1997 IEEE WORKSH A KAWAI H, 2004, AC P ICASSP 04 IEEE Kaynak MN, 2004, IEEE T SYST MAN CY A, V34, P564, DOI 10.1109/TSMCA.2004.826274 MAK MW, 1994, SPEECH COMMUN, V14, P279, DOI 10.1016/0167-6393(94)90067-1 MAKASHAY MJ, 2000, P INT C SPOK LANG PR, P431 Malfrere F, 2003, SPEECH COMMUN, V40, P503, DOI 10.1016/S0167-6393(02)00131-0 MATOUSEK J, 2003, EUROSPEECH 2003 MUNHALL KG, 1995, J ACOUST SOC AM, V98, P1222, DOI 10.1121/1.413621 Neti C., 2000, FIN WORKSH 2000 REP Petajan E. D., 1984, THESIS U ILLINOIS Rabiner L, 1993, FUNDAMENTALS SPEECH Schwartz J. L., 1998, HEARING EYE, P85 SETHY A, 2002, P ICSLP, P145 STORK DG, 1992, P IJCNN 92, V2, P285 Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 Teissier P, 1999, IEEE T SPEECH AUDI P, V7, P629, DOI 10.1109/89.799688 TOLEDANO DT, 2003, IEEE T SPEECH AUDIO, V11, P1 YEONJUN K, 2002, ICSLP 2002, P145 Young S., 2002, HTK BOOK HTK VERSION YUHAS BP, 1990, P IEEE, V78, P1658, DOI 10.1109/5.58349 NR 25 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 594 EP 604 DI 10.1016/j.specom.2008.04.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400004 ER PT J AU Liu, LQ Zheng, TF Wu, WH AF Liu, Linquan Zheng, Thomas Fang Wu, Wenhu TI State-dependent phoneme-based model merging for dialectal Chinese speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; dialectal Chinese; state-dependent phoneme-based model merging; acoustic modeling; pronunciation modeling; acoustic model distance measure; a small amount of data ID NONNATIVE SPEECH; ADAPTATION AB This paper discusses and evaluates a novel but simple and effective acoustic modeling method called "state-dependent phoneme-based model merging (SDPBMM)", used to build dialectal Chinese speech recognizer from a small amount of dialectal Chinese speech. In SDPBMM, state-level pronunciation modeling is done by merging a tied-state of standard triphones with a state of dialectal monophone(s). In state-level pronunciation modeling, which acts as the merging criterion for SDPBMM, sparseness arises due to limited data set. To overcome this problem, a distance-based pronunciation modeling approach is also proposed. With a 40-min Shanghai-dialectal Chinese speech data, SDPBMM achieves a significant absolute syllable error rate (SER) reduction of approximately 7.1% (and a relative SER reduction of 14.3%) for Shanghai-dialectal Chinese, without performance degradation for standard Chinese. It is experimentally shown that SDPBMM outperforms Maximum Likelihood Linear Regression (MLLR) adaptation and the Pooled Retraining methods by 1.4% and 5.3%, respectively, in terms of SER reduction. Also, when combined with MLLR adaptation, an absolute SER reduction of 1.4% can further be achieved by SDPBMM. (c) 2008 Elsevier B.V. All rights reserved. C1 [Liu, Linquan; Zheng, Thomas Fang; Wu, Wenhu] Tsinghua Univ, Ctr Speech & Language Technol, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China. RP Zheng, TF (reprint author), Tsinghua Univ, Ctr Speech & Language Technol, Tsinghua Natl Lab Informat Sci & Technol, Beijing 100084, Peoples R China. EM liulq@cslt.riit.tsinghua.edu.cn; fzheng@tsinghua.edu.cn; wuwh@tsinghua.edu.cn CR Angkititrakul P, 2006, IEEE T AUDIO SPEECH, V14, P634, DOI 10.1109/TSA.2005.851980 CHEN T, 2001, P ASRU DIAKOLOUKAS V, 1997, P ICASSP Fung P, 2005, J ACOUST SOC AM, V118, P3279, DOI 10.1121/1.2035588 Gao J., 2002, ACM T ASIAN LANGUAGE, V1, P3, DOI 10.1145/595576.595578 Goronzy S, 2004, SPEECH COMMUN, V42, P109, DOI 10.1016/j.specom.2003.09.003 GRUHN R, 2004, P ICSLP He XD, 2003, IEEE T SPEECH AUDI P, V11, P298, DOI 10.1109/TSA.2003.814379 Huang C., 2004, INT J SPEECH TECHNOL, V7, P141, DOI 10.1023/B:IJST.0000017014.52972.1d HUANG R, 2005, P INT C AC SPEECH SI, P585 Huang X., 2001, SPOKEN LANGUAGE PROC Hwang MY, 1996, IEEE T SPEECH AUDI P, V4, P412 LI AJ, 2003, P EUROSPEECH LI J, 2003, P OR COCOSDA SENT SI, P62 Li J, 2006, J COMPUT SCI TECH-CH, V21, P106, DOI 10.1007/s11390-006-0106-9 Liu Y., 2004, INT J SPEECH TECHNOL, V7, P155, DOI 10.1023/B:IJST.0000017015.63206.9e Livescu K., 1999, THESIS MIT LUSSIER EF, 2003, LECT NOTES COMPUT SC, V2705, P38 MYRVOLL TA, 2003, TELEKTRONIKK, V2, P59 Oh YR, 2007, SPEECH COMMUN, V49, P59, DOI 10.1016/j.specom.2006.10.006 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 Sproat R, 2004, DIALECTAL CHINESE SP Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 TJALVE M, 2005, P EUROSPEECH TOMOKIYO L, 2001, THESIS CARNEGIE MELL Wang Z., 2003, P IEEE INT C AC SPEE, P540 XU XT, 2004, IEEE T SPEECH AUDIO, V12, P168 Young S., 2002, HTK BOOK HTK VERSION Zheng F, 2002, J COMPUT SCI TECHNOL, V17, P249, DOI 10.1007/BF02947304 ZHENG YL, 2005, P EUROSPEECH NR 30 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 605 EP 615 DI 10.1016/j.specom.2008.04.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400005 ER PT J AU Oba, T Hori, T Nakamura, A AF Oba, Takanobu Hori, Takaaki Nakamura, Atsushi TI Sequential dependency analysis for online spontaneous speech processing SO SPEECH COMMUNICATION LA English DT Article DE sequential dependency analysis; sentence boundary detection; spontaneous speech; incomplete sentence AB A dependency structure interprets modification relationships between words and is often recognized as an important element in semantic information analysis. With conventional approaches for extracting this dependency structure, it is assumed that the complete sentence is known before the analysis starts. For spontaneous speech data, however, this assumption is not necessarily correct since sentence boundaries are not marked in the data and it is not easy to detect them correctly. Although sentence boundaries can be detected before dependency analysis, this cascaded implementation is not suitable for online processing since it delays the responses of the application. In this paper, we propose a sequential dependency analysis method for online spontaneous speech processing. The proposed method enables us to analyze incomplete sentences sequentially and detect sentence boundaries simultaneously. The analyzer can be trained using parsed data based on the maximum entropy principle. Experimental results using spontaneous lecture speech from the Corpus of Spontaneous Japanese show that our proposed method achieves online processing with an accuracy equivalent to that of offline processing in which boundary detection and dependency analysis are cascaded. (c) 2008 Elsevier B.V. All rights reserved. C1 [Oba, Takanobu; Hori, Takaaki; Nakamura, Atsushi] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto, Japan. RP Oba, T (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Seika, Kyoto, Japan. EM oba@cslab.kecl.ntt.co.jp; hori@cslab.kecl.ntt.co.jp; ats@cslab.keel.ntt.co.jp CR CMEJREK M, 2003, P EUR CHAP M ACL, V1, P83, DOI 10.3115/1067807.1067820 Collins M., 1996, P 34 ANN M ASS COMP, P184, DOI 10.3115/981863.981888 Crammer Koby, 2005, P 43 ANN M ASS COMP, P91, DOI 10.3115/1219840.1219852 HALL J, 2006, P COLING ACL 2006 MA, P316, DOI 10.3115/1273073.1273114 HORI C, 2002, P IEEE ICASSP, V1, P9 Hori T, 2007, IEEE T AUDIO SPEECH, V15, P1352, DOI 10.1109/TASL.2006.889790 KATO Y, 2005, SYSTEMS COMPUT JAPAN, V36, P84 KUDO T, 2004, 2004NL162 IPSJ SIG, P205 Kudo T., 2002, P 6 C NAT LANG LEARN, P63 LIU DC, 1989, MATH PROGRAM, V45, P503, DOI 10.1007/BF01589116 Maekawa K., 2000, P 2 INT C LANG RES E, V2, P947 McDonald Ryan, 2006, P 11 C EUR CHAPT ASS, P81 MORI S, 2000, P INT C COMP LING AC, V1, P558, DOI 10.3115/990820.990901 Nivre J., 2005, P 43 ANN M ASS COMP, P99, DOI 10.3115/1219840.1219853 Nivre J., 2006, P 10 C COMP NAT LANG, P221, DOI 10.3115/1596276.1596318 NIVRE J, 2002, 02118 MSI OHNO T, 2005, P 9 EUROSPEECH, P3449 SCHIEHLEN M, 2007, P EMNLP CONLL, P1156 SEKINE S, 2000, P INT C COMP LING AC, V2, P754, DOI 10.3115/992730.992755 SHITAOKA K, 2004, P INTERSPEECH 2004, P1353 Yamada H., 2003, P 8 INT WORKSH PARS, P195 YUAN JL, 2005, INT WORLD WID C, P926 NR 22 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2008 VL 50 IS 7 BP 616 EP 625 DI 10.1016/j.specom.2008.04.008 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333TN UT WOS:000258175400006 ER PT J AU Lu, Y Lolzou, PC AF Lu, Yang Lolzou, Philipos C. TI A geometric approach to spectral subtraction SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; spectral subtraction; musical noise ID SPEECH ENHANCEMENT; NOISE; RECOGNITION; SUPPRESSION AB The traditional power spectral subtraction algorithm is computationally simple to implement but suffers from musical noise distortion. In addition, the subtractive rules are based on incorrect assumptions about the cross terms being zero. A new geometric approach to spectral subtraction is proposed in the present paper that addresses these shortcomings of the spectral subtraction algorithm. A method for estimating the cross terms involving the phase differences between the noisy (and clean) signals and noise is proposed. Analysis of the gain function of the proposed algorithm indicated that it possesses similar properties as the traditional MMSE algorithm. Objective evaluation of the proposed algorithm showed that it performed significantly better than the traditional spectral subtractive algorithm. Informal listening tests revealed that the proposed algorithm had no audible musical noise. (C) 2008 Elsevier B.V. All rights reserved. C1 [Lu, Yang; Lolzou, Philipos C.] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. RP Lolzou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. EM loizou@utdallas.edu CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O., 1994, IEEE T SPEECH AUDIO, V2, P346 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 EVANS N, 2006, P IEEE INT C AC SPEE, V1, P145 HIRSCH H, 2000, P ISCA ITRW ASR200 Hu Y, 2007, SPEECH COMMUN, V49, P588, DOI 10.1016/j.specom.2006.12.006 Hu Y, 2008, IEEE T AUDIO SPEECH, V16, P229, DOI 10.1109/TASL.2007.911054 HU Y, 2006, P INT, P1447 Kamath S., 2002, P IEEE INT C AC SPEE KITAOKA N, 2002, P INT C SPOK LANG PR, P477 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Papoulis A., 2002, PROBABILITY RANDOM V VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WEISS M, 1974, NSCFR4023 Wolfe P. J., 2001, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing (Cat. No.01TH8563), DOI 10.1109/SSP.2001.955331 Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325 NR 21 TC 34 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 453 EP 466 DI 10.1016/j.specom.2008.01.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600001 ER PT J AU Hsieh, CH Wu, CH AF Hsieh, Chia-Hsin Wu, Chung-Hsien TI Stochastic vector mapping-based feature enhancement using prior-models and model adaptation for noisy speech recognition SO SPEECH COMMUNICATION LA English DT Article DE noisy speech recognition; cepstral feature enhancement; environment model adaptation ID LEIBLER INFORMATION MEASURE; PARAMETER-ESTIMATION AB This paper presents an approach to feature enhancement for noisy speech recognition. Three prior-models are introduced to characterize clean speech, noise and noisy speech, respectively. Sequential noise estimation is employed for prior-model construction based on noise-normalized stochastic vector mapping. Therefore, feature enhancement can work without stereo training data and manual tagging of background noise type based on the auto-clustering on the estimated noise data. Environment model adaptation is also adopted to reduce the mismatch between training data and test data. For the evaluation on the AURORA2 database, the experimental results indicate that a 9.6% relative reduction in digit error rate for multi-condition training and a 3.5% relative reduction in digit error rate for clean speech training were achieved without stereo training data compared to the SPLICE-based approach. For MATBN Mandarin broadcast news database with multi-condition training, a 13% relative reduction in syllable error rate for anchor speech, a 12% relative reduction in syllable error rate for field reporter speech and a 7% relative reduction in syllable error rate for interviewee speech were obtained compared to the MCE-based approach. (C) 2008 Elsevier B.V. All rights reserved. C1 [Hsieh, Chia-Hsin; Wu, Chung-Hsien] Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM ngsnail@csie.ncku.edu.tw; chwu@csie.ncku.edu.tw RI Wu, Chung-Hsien/E-7970-2013 CR Benveniste A., 1990, ADAPTIVE ALGORITHMS, V22 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen HZ, 2002, ELECTRON LETT, V38, P485, DOI 10.1049/el:20020324 Chen YJ, 2002, SPEECH COMMUN, V38, P349, DOI 10.1016/S0167-6393(01)00076-0 DENG L, 2000, P ICSLP, P806 Deng L, 2003, IEEE T SPEECH AUDI P, V11, P568, DOI 10.1109/TSA.2003.818076 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 KRISHNAMURTHY V, 1993, IEEE T SIGNAL PROCES, V41, P2557, DOI 10.1109/78.229888 Machiraju VR, 2002, J CARDIAC SURG, V17, P20 Stern R. M., 1996, P ICASSP, V1, P733 Wang J, 2005, CONSTRAINTS, V10, P219, DOI 10.1007/s10601-005-2238-x WEINSTEIN E, 1990, IEEE T ACOUST SPEECH, V38, P1652, DOI 10.1109/29.60089 WU CH, 2004, J VLSI SIGNAL PROC, V36, P87 Wu J., 2002, P ICSLP, P453 Yao KS, 2004, SPEECH COMMUN, V42, P5, DOI 10.1016/j.specom.2003.09.002 NR 18 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 467 EP 475 DI 10.1016/j.specom.2008.02.002 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600002 ER PT J AU Solvang, HK Ishizuka, K Fujimoto, M AF Solvang, Hiroko Kato Ishizuka, Kentaro Fujimoto, Masakiyo TI Voice activity detection based on adjustable linear prediction and GARCH models SO SPEECH COMMUNICATION LA English DT Article DE voice activity detection; AR-GARCH model; state-space representation; Kalman filter; linear prediction ID HIGHER-ORDER STATISTICS; SPEECH RECOGNITION; NOISE; ALGORITHM; VARIANCE AB We propose a method for voice activity detection (VAD) that employs a class of the Autoregressive-Generalized Autoregressive Conditional Heteroskedasticity (AR-GARCH) model. As regards correlated speech signals, we represent the AR part of the AR-GARCH model with a state-space to obtain the appropriate linear prediction error series. By applying the GARCH model to the residual, we estimate the conditional variance sequences corresponding to the voice activity parts. To detect voice activity, we establish an appropriate threshold for the conditional variance sequences. To confirm the performance of our proposed VAD method, we conduct experiments using speech signals with real background noise (signal-to-noise ratios (SNRs) of 10, 5 and 0 dB) of an airport and a street. Furthermore, using receiver operating characteristics curves and equal error rates, we compare our results with those of previous standardized VAD algorithms (ITU-T G.729B, ETSI ES 202 050, and ETSI EN 301 708) as well as recently developed methods (VAD with long-term spectral divergence, likelihood ratio tests, and higher-order statistics for VAD). In terms of the signals with background noise at an SNR of 0 dB, the experimental results show a significant performance improvement compared with standardized VAD algorithms and more than 10% improvement compared with recently developed VAD methods. (C) 2008 Elsevier B.V. All rights reserved. C1 [Solvang, Hiroko Kato] Rikshosp Univ Hosp, Norwegian Radium Hosp, Inst Canc Res, Dept Genet, N-0310 Oslo, Norway. [Solvang, Hiroko Kato] Univ Oslo, Inst Basic Med Sci, Dept Biostat, N-0317 Oslo, Norway. [Ishizuka, Kentaro; Fujimoto, Masakiyo] NTT Corp, NTT Commun Sci Labs, Seika, Kyoto 6190237, Japan. RP Solvang, HK (reprint author), Rikshosp Univ Hosp, Norwegian Radium Hosp, Inst Canc Res, Dept Genet, N-0310 Oslo, Norway. EM hsolvang@rr-research.no RI Ramli, Roziana/E-7157-2010 CR ABDOLAHI M, 2005, P ICASSP, V1, P957, DOI 10.1109/ICASSP.2005.1415274 ABRAMSON A, 2006, P INT WORKSH AC ECH, P1 Akaike H., 1980, Journal of Time Series Analysis, V1, DOI 10.1111/j.1467-9892.1980.tb00296.x AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 Anderson B. D. O., 1979, OPTIMAL FILTERING [Anonymous], 2005, 202050 ETSI ES [Anonymous], 1999, 301708 ETSI EN Basu S., 2003, P ICASSP, V1, pI BOLLERSLEV T, 1986, J ECONOMETRICS, V31, P307, DOI 10.1016/0304-4076(86)90063-1 COHEN I, 2005, P INT LESB PORT SEP, P2053 ENGLE RF, 1982, ECONOMETRICA, V50, P987, DOI 10.2307/1912773 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 ISHIGURO M, 1999, COMPUTER SCI MONOGRA, V30 ISHIZUKA K, 2006, P ICASSP 06 TOUL FRA, V1, P789 Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3 KITAGAWA G, 1984, J AM STAT ASSOC, V79, P378, DOI 10.2307/2288279 KRISTIANSSON T., 2005, P INTERSPEECH, P369 LEBOUQUINJEANNES R, 1995, SPEECH COMMUN, V16, P245, DOI 10.1016/0167-6393(94)00056-G Lee S, 2005, STAT SINICA, V15, P215 Li K, 2005, IEEE T SPEECH AUDI P, V13, P965, DOI 10.1109/TSA.2005.851955 Ling SQ, 2003, J AM STAT ASSOC, V98, P955, DOI 10.1198/016214503000000918 Ling SQ, 2003, ANN STAT, V31, P642 LJUNG GM, 1978, BIOMETRIKA, V68, P189 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 NAKAMURA A, 1998, P ICSLP Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 RABINER LR, 1975, AT&T TECH J, V54, P297 Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 Ramirez J, 2005, IEEE SIGNAL PROC LET, V12, P689, DOI 10.1109/LSP.2005.855551 SHEN JL, 1998, P ICSLP Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Srinivasant K., 1993, P IEEE SPEECH COD WO, P85, DOI 10.1109/SCFT.1993.762351 Tahmasbi R, 2007, IEEE T AUDIO SPEECH, V15, P1129, DOI 10.1109/TASL.2007.894521 TUCKER R, 1992, IEE PROC-I, V139, P377 West M., 1989, SPRINGER SERIES STAT NR 37 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 476 EP 486 DI 10.1016/j.specom.2008.02.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600003 ER PT J AU Clavel, C Vasilescu, I Devillers, L Richard, G Ehrette, T AF Clavel, C. Vasilescu, I. Devillers, L. Richard, G. Ehrette, T. TI Fear-type emotion recognition for future audio-based surveillance systems SO SPEECH COMMUNICATION LA English DT Article DE fear-type emotions recognition; fiction corpus; annotation scheme; acoustic features of emotions; machine learning; threatening situations; civil safety ID SPEECH; COMMUNICATION; EXPRESSION; ALGORITHM AB This paper addresses the issue of automatic emotion recognition in speech. We focus on a type of emotional manifestation which has been rarely studied in speech processing: fear-type emotions occurring during abnormal situations (here, unplanned events where human life is threatened). This study is dedicated to a new application in emotion recognition - public safety. The starting point of this work is the definition and the collection of data illustrating extreme emotional manifestations in threatening situations. For this purpose we develop the SAFE corpus (situation analysis in a fictional and emotional corpus) based on fiction movies. It consists of 7 h of recordings organized into 400 audiovisual sequences. The corpus contains recordings of both normal and abnormal situations and provides a large scope of contexts and therefore a large scope of emotional manifestations. In this way, not only it addresses the issue of the lack of corpora illustrating strong emotions, but also it forms an interesting support to study a high variety of emotional manifestations. We define a task-dependent annotation strategy which has the particularity to describe simultaneously the emotion and the situation evolution in context. The emotion recognition system is based on these data and must handle a large scope of unknown speakers and situations in noisy sound environments. It consists of a fear vs. neutral classification. The novelty of our approach relies on dissociated acoustic models of the voiced and unvoiced contents of speech. The two are then merged at the decision step of the classification system. The results are quite promising given the complexity and the diversity of the data: the error rate is about 30%. (C) 2008 Elsevier B.V. All rights reserved. C1 [Clavel, C.; Ehrette, T.] Thales Res & Technol France, F-91767 Palaiseau, France. [Vasilescu, I.; Devillers, L.] LIMSI CNRS, F-91403 Orsay, France. [Richard, G.] Telecom ParisTech, F-75014 Paris, France. RP Clavel, C (reprint author), Thales Res & Technol France, RD 128, F-91767 Palaiseau, France. EM chloe.clavel@thalesgroup.com CR Abelin A., 2000, P ISCA WORKSH SPEECH, P110 AMIR N, 2007, P AFF COMP INT INT L, P148 Auberge V, 2004, P 4 INT C LANG RES E, P179 Bakeman R, 1997, OBSERVING INTERACTIO Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2006, P LREC WORKSH CORP R, P15 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Batliner A., 2004, P 4 INT C LANG RES E, P171 Batliner A, 2006, P IS LTC 2006 LJUBL, P240 Bengio S., 2004, P OD 2004 SPEAK LANG Boersma P., 2005, PRAAT DOING PHONETIC Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749 Campbell N., 2003, P 15 INT C PHON SCI, P2417 Carletta J, 1996, COMPUT LINGUIST, V22, P249 Clavel C., 2004, P INT C SPOK LANG PR, P2277 CLAVEL C, 2006, P SPEECH PROS PS6 10 CLAVEL C, 2006, P LREC GEN, P1099 Clavel C., 2005, P IEEE INT C MULT EX CLAVEL C, 2007, P ICASSP HON, P21 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 CRONBACH LJ, 1951, PSYCHOMETRIKA, V16, P297 Damasio A., 1994, DESCARTES ERROR EMOT Darwin C, 1872, EXPRESSION EMOTIONS Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 DEVILLERS L, 2005, P 1 INT C AFF COMP I, P519 Devillers L., 2007, SPEAKER CHARACTERIZA DEVILLERS L, 2006, THESIS U PARIS 11 OR Devillers L., 2003, P EUR GEN, P189 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Duda R. O., 1973, PATTERN CLASSIFICATI Ekman P., 1999, BASIC EMOTIONS HDB C Ekman P, 1975, UNMASKING FACE GUIDE Enos F., 2006, P LREC WORKSH CORP R, P6 Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 FRANCE D, 2003, IEEE T BIOMEDICAL EN, V47, P829 Harrigan J. A., 2005, NEW HDB METHODS NONV Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Kienast M., 2000, P ISCA ITRW SPEECH E, p[5, 92] Kipp M., 2001, P 7 EUR C SPEECH COM, P1367 KLEIBER G., 1990, SEMANTIQUE PROTOTYPE KWON O.W., 2003, P 8 EUR C SPEECH COM, P125 Landis J. R., 1977, BIOMETRICS, V33, P174 Lee C., 2002, P INT C MULT EXP LAU, P737 LEE CM, 1997, INFORM COMM SIGNAL P, V1, P347 MCGILLOWAY S, 1997, THESIS QUEENS U BELF MOZZICONACCI S, 1998, THESIS TU EIDHOVEN Nunnaly J., 1978, PSYCHOMETRIC THEORY ORTHONY A, 1990, PSYCHOL REV, V97, P315 Osgood C.E., 1975, CROSS CULTURAL UNIVE Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Pelachaud C., 2005, P 13 ANN ACM INT C M, P683, DOI 10.1145/1101149.1101301 Picard R. W., 1997, AFFECTIVE COMPUTING PLUTCHIK R, 1984, GEN PSYCHOEVOLUTIONA RUSSELL JA, 1997, SHALL EMOTION BE CAL Scherer K. R., 2001, APPRAISAL PROCESSES Scherer K. R., 1984, NATURE FUNCTION EMOT Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHERER U, 1980, INTERNAL PUSH EXTERN SCHROEDER M, 2007, P ACIL LISB, P440 SCHULLER B, 2004, P ICASSP MONTR, P80 Schuller B, 2003, P ICASSP HONG KONG, P1 Shafran I., 2003, P IEEE AUT SPEECH RE, P31 Vacher M, 2004, Proceedings of the Second IASTED International Conference on Biomedical Engineering, P395 van Bezooijen R., 1984, CHARACTERISTICS RECO VARADARAJAN V, 2006, P LREC WORKSH CORP R, P72 Vidrascu L, 2005, P INT LISB PORT SEPT, P1841 WAGNER J, 2007, P ACII LISB, P114 WHISSEL C, 1989, DICT AFFECT LANGUAGE Yacoub S., 2003, P EUROSPEECH, P729 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P1, DOI 10.1109/89.650304 NR 72 TC 30 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 487 EP 503 DI 10.1016/j.specom.2008.03.012 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600004 ER PT J AU Senapati, S Chakroborty, S Saha, G AF Senapati, Suman Chakroborty, Sandipan Saha, Goutam TI Speech enhancement by joint statistical characterization in the Log Gabor Wavelet domain SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; speech recognition; Log Gabor Wavelet; Bayesian bivariate estimator; circularly symmetric probability density function; Spherically Invariant Random Processes ID SPECTRAL AMPLITUDE ESTIMATOR AB In speech enhancement, Bayesian Marginal models cannot explain the inter-scale statistical dependencies of different wavelet scales. Simple non-linear estimators for wavelet-based denoising assume that the wavelet coefficients in different scales are independent in nature. However, wavelet coefficients have significant inter-scale dependencies. This paper introduces a new method that uses the inter-scale dependency between the coefficients and their parents by a Circularly Symmetric Probability Density Function (CS-PDF) related to the family of Spherically Invariant Random Processes (SIRPs) in Log Gabor Wavelet (LGW) domain and corresponding joint shrinkage estimators are derived by Maximum a Posteriori (MAP) estimation theory. The proposed work presents two different joint shrinkage estimators. In first, the inter-scale variance of LGW coefficients is kept constant which gives a closed form solution. In second, a relatively more complex approach is presented where variance is not constrained to be constant. It is also shown that the proposed methods show better performance when speech uncertainty is taken into consideration. The robustness of the proposed frameworks are tested on 50 speakers of POLYCOST and YOHO speech corpus in four different noisy environments against four established speech enhancement algorithms. Experimental results show that the proposed estimators yield a higher improvement in Segmental SNR (S-SNR) and also lower Log Spectral Distortion (LSD) compared to other estimators. In the second evaluation, the proposed speech enhancement techniques are found to give more robust Digit Recognition in noisy conditions on the AURORA 2.0 speech corpus compared to competing methods. (C) 2008 Elsevier B.V. All rights reserved. C1 [Senapati, Suman; Chakroborty, Sandipan; Saha, Goutam] Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. RP Senapati, S (reprint author), Indian Inst Technol, Dept Elect & Elect Commun Engn, Kharagpur 721302, W Bengal, India. EM suman@ece.iitkgp.ernet.in; sandipan@ece.iitkgp.ernet.in; gsaha@ece.iitkgp.ernet.in CR Acero Alex, 2000, INTERSPEECH, P869 Berouti M, 1979, IEEE INT C AC SPEECH, V4, P208, DOI 10.1109/ICASSP.1979.1170788 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 BREHM H, 1974, AEU-INT J ELECTRON C, V28, P445 BREHM H, 1982, LECT NOTES MATH, V969, P39 BREITHAUPT C, 2003, IEEE P INT C AC SPEE, V1, P896 CAMPBELL JP, 1999, P IEEE INT C AC SPEE Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] Coifman R. R., 1995, LECT NOTES STAT, V103, P125 DAT TH, 2005, P ICASSP 2005 DAVENPORT WB, 1952, J ACOUST SOC AM, V24, P390, DOI 10.1121/1.1906909 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 DONOHO DL, 1994, BIOMETRIKA, V81, P425, DOI 10.1093/biomet/81.3.425 DROPPO DJ, 2004, IEEE T SPEECH AUDIO, V12, P218 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI STQ, 2000, 201108V111 ETSI ES FIELD DJ, 1987, J OPT SOC AM A, V4, P2379, DOI 10.1364/JOSAA.4.002379 Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HIRSCH HG, 2000, ISCA ITRW ASR 2000 A Kamath S., 2002, P INT C AC SPEECH SI Kovesi P., 1999, J COMPUTER VISION RE, V1, P1 LEGGETTER C, 1995, MAXIMUM LIKELIHOOD L, P171 Lotter T., 2005, EURASIP J APPL SIG P, V7, P1110 MALAH D, 1999, P IEEE INT C AC SPEE, P789 Martin R., 2003, P INT WORKSH AC ECH, P87 MARTIN R, 2002, IEEE ICASSP 02 ORL F MORENO PJ, 1996, P ICASSP, P733 MORLET J, 1982, GEOPHYSICS, V47, P22 WOLFE PJ, 2003, EURASIP J APPL SIG P, V10, P1043 YANG J, 1993, P 18 IEEE INT C AC S, P363 YOMA NB, 1998, IEEE T SPEECH AUDIO, V6 ZAVAREHEI E, 2005, AUT SPEECH REC UND A, P219 NR 34 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 504 EP 518 DI 10.1016/j.specom.2008.03.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600005 ER PT J AU Pell, MD Skorup, V AF Pell, Marc D. Skorup, Vera TI Implicit processing of emotional prosody in a foreign versus native language SO SPEECH COMMUNICATION LA English DT Article DE speech processing; vocal expression; cross-linguistic; cultural factors; semantic priming ID NEGATIVE FACIAL EXPRESSIONS; AFFECT DECISION TASK; VOCAL EXPRESSION; SENTENCE COMPREHENSION; CULTURAL-DIFFERENCES; NONVERBAL EMOTION; PERCEPTION; SPEECH; FACE; RECOGNITION AB To test ideas about the universality and time course of vocal emotion processing, 50 English listeners performed an emotional priming task to determine whether they implicitly recognize emotional meanings of prosody when exposed to a foreign language. Arabic pseudoutterances produced in a happy, sad, or neutral prosody acted as primes for a happy, sad, or 'false' (i.e., non-emotional) face target and participants judged whether the facial expression represents an emotion. The prosody-face relationship (congruent, incongruent) and the prosody duration (600 or 1000 ms) were independently manipulated in the same experiment. Results indicated that English listeners automatically detect the emotional significance of prosody when expressed in a foreign language, although activation of emotional meanings in a foreign language may require increased exposure to prosodic information than when listening to the native language. (C) 2008 Elsevier B.V. All rights reserved. C1 [Pell, Marc D.; Skorup, Vera] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. RP Pell, MD (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1266 Ave Pins Ouest, Montreal, PQ H3G 1A8, Canada. EM marc.pell@mcgill.ca CR ALBAS DC, 1976, J CROSS CULT PSYCHOL, V7, P481, DOI 10.1177/002202217674009 Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BEIER EG, 1972, J CONSULT CLIN PSYCH, V39, P166, DOI 10.1037/h0033170 BOWER GH, 1981, AM PSYCHOL, V36, P129, DOI 10.1037//0003-066X.36.2.129 Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114 de Gelder B, 2000, COGNITION EMOTION, V14, P289 Dimberg U, 1996, MOTIV EMOTION, V20, P149, DOI 10.1007/BF02253869 EKMAN P, 1987, J PERS SOC PSYCHOL, V53, P712, DOI 10.1037/0022-3514.53.4.712 EKMAN P, 1971, J PERS SOC PSYCHOL, V17, P124, DOI 10.1037/h0030377 EKMAN P, 1994, PSYCHOL BULL, V115, P268, DOI 10.1037//0033-2909.115.2.268 EKMAN P, 1969, SCIENCE, V164, P86, DOI 10.1126/science.164.3875.86 Elfenbein HA, 2002, PSYCHOL BULL, V128, P203, DOI 10.1037//0033-2909.128.2.203 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Innes-Ker A, 2002, J PERS SOC PSYCHOL, V83, P804, DOI 10.1037//0022-3514.83.4.804 IZARD CE, 1994, PSYCHOL BULL, V115, P288, DOI 10.1037/0033-2909.115.2.288 Izard C. E., 1977, HUMAN EMOTIONS Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Juth P, 2005, EMOTION, V5, P379, DOI 10.1037/1528-3542.5.4.379 KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473 Laukka P, 2005, EMOTION, V5, P277, DOI 10.1037/1528-3542.5.3.277 Leppanen JM, 2003, J PSYCHOPHYSIOL, V17, P113, DOI 10.1027//0269-8803.17.3.113 Leppanen JM, 2004, PSYCHOL RES-PSYCH FO, V69, P22, DOI 10.1007/s00426-003-0157-2 Massaro DW, 1996, PSYCHON B REV, V3, P215, DOI 10.3758/BF03212421 Mathews A, 1998, COGNITIVE THER RES, V22, P539, DOI 10.1023/A:1018738019346 MCCLUSKEY KW, 1975, DEV PSYCHOL, V11, P551, DOI 10.1037/0012-1649.11.5.551 MESQUITA B, 1992, PSYCHOL BULL, V112, P179, DOI 10.1037//0033-2909.112.2.179 Mogg K, 1998, BEHAV RES THER, V36, P809, DOI 10.1016/S0005-7967(98)00063-1 Niedenthal P. M., 1994, HEARTS EYE EMOTIONAL, P87 ONIFER W, 1981, MEM COGNITION, V9, P225, DOI 10.3758/BF03196957 PELL MD, 2005, PSYCH SOC 46 ANN M, V10, P98 PELL MD, FACTORS RECOGN UNPUB Pell MD, 2005, J NONVERBAL BEHAV, V29, P193, DOI 10.1007/s10919-005-7720-z Pell MD, 2002, BRAIN COGNITION, V48, P499, DOI 10.1006/brxg.2001.1406 PELL MD, RECOGNIZING EM UNPUB Pell MD, 2001, J ACOUST SOC AM, V109, P1668, DOI 10.1121/1.1352088 Pell MD, 2005, J NONVERBAL BEHAV, V29, P45, DOI 10.1007/s10919-004-0889-8 R Rosenthal, 1991, APPL SOC RES METHODS, V6, P19 Rossell SL, 2004, EMOTION, V4, P354, DOI 10.1037/1528-3542.4.4.354 RUSSELL JA, 1994, PSYCHOL BULL, V115, P102, DOI 10.1037/0033-2909.115.1.102 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X SWINNEY DA, 1979, J VERB LEARN VERB BE, V18, P645, DOI 10.1016/S0022-5371(79)90355-4 Thompson WF, 2006, SEMIOTICA, V158, P407, DOI 10.1515/SEM.2006.017 VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001 Vroomen J, 2001, COGN AFFECT BEHAV NE, V1, P382, DOI 10.3758/CABN.1.4.382 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Williams CE, 1981, SPEECH EVALUATION PS, P221 Wilson D, 2006, J PRAGMATICS, V38, P1559, DOI 10.1016/j.pragma.2005.04.012 NR 51 TC 19 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 519 EP 530 DI 10.1016/j.specom.2008.03.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600006 ER PT J AU Ishi, CT Ishiguro, H Hagita, N AF Ishi, Carlos Toshinori Ishiguro, Hiroshi Hagita, Norihiro TI Automatic extraction of paralinguistic information using prosodic features related to F0, duration and voice quality SO SPEECH COMMUNICATION LA English DT Article DE prosody; voice quality; paralinguistic information; speech act; emotion; automatic detection ID EMOTION AB The use of acoustic-prosodic features related to F0, duration and voice quality is proposed and evaluated for automatic extraction of paralinguistic information (intentions, attitudes, and emotions) in dialogue speech. Perceptual experiments and acoustic analyses were conducted for monosyllabic interjections spoken in several speaking styles, conveying a variety of paralinguistic information.. Experimental results indicated that the classical prosodic features, i.e., F0 and duration, were effective for discriminating groups of paralinguistic information expressing intentions, such as affirm, deny, filler, and ask for repetition, and accounted for 57% of the global detection rate, in a task of discriminating seven groups of paralinguistic information. On the other hand, voice quality features were effective for identifying part of the paralinguistic information expressing emotions or attitudes, such as surprised, disgusted and admired, leading to a 12% improvement in the global detection rate. (C) 2008 Elsevier B.V. All rights reserved. C1 [Ishi, Carlos Toshinori; Ishiguro, Hiroshi; Hagita, Norihiro] ATR Intelligent Robot & Commun Labs, Kyoto 6190288, Japan. RP Ishi, CT (reprint author), ATR Intelligent Robot & Commun Labs, 2-2 Hikaridai Keihanna Sci City, Kyoto 6190288, Japan. EM carlos@atr.jp; ishiguro@ams.eng.osaka-u.ac.jp; hagita@atr.jp CR Campbell N., 2003, P 15 INT C PHON SCI, P2417 Campbell N., 2004, J PHONET SOC JPN, V8, P9 DANG J, 1966, J ACOUST SOC AM, V101, P456 Erickson D., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.317 Fernandez R., 2005, P INT 2005 LISB PORT, P473 Fujie S., 2003, P IEEE WORKSH AUT SP, P231, DOI 10.1109/ASRU.2003.1318446 FUJIMOTO M, 2003, P 15 INT C PHON SCI, P2401 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 Gordon M, 2001, J PHONETICS, V29, P383, DOI 10.1006/jpho.2001.0147 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 Hayashi Y., 1999, P 14 INT C PHON SCI, P2355 HESS W, 1983, SPRINGER SERIES INFO, V3 Ishi C. T., 2005, P EUR 2005, P481 Ishi Carlos Toshinori, 2004, P INT 2004 ICSLP JEJ, V2004, P941 Ishi CT, 2005, IEICE T INF SYST, VE88D, P481, DOI 10.1093/ietisy/e88-d.3.481 Ito Mika, 2004, P 2 INT C SPEECH PRO, V2004, P213 *JST CREST, ESP PROJ HOM Kasuya H., 2000, P INT C SPOK LANG PR, P345 Kitamura T., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.16 Klasmeyer G, 2000, VOICE QUALITY MEASUR, P339 Kreiman J, 2000, VOICE QUALITY MEASUR, P73 Laver J., 1980, PHONETIC DESCRIPTION, P93 lmagawa H, 2003, P STOCKH MUS AC C SM, P471 Maekawa K., 2004, P SPEECH PROS 2004, V2004, P367 Neiberg D, 2006, P INT C SPOK LANG PR, P809 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Sadanobu Toshiyuki, 2004, J PHONETIC SOC JAPAN, V8, P29 SCHROEDER MR, 1999, HILBERT ENVELOPE INS, P174 SCHULLER B., 2005, P INT LISB PORT, P805 STEVENS K, 2000, TURBULENCE NOISE GLO, P445 NR 30 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2008 VL 50 IS 6 BP 531 EP 543 DI 10.1016/j.specom.2008.03.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 321HY UT WOS:000257296600007 ER PT J AU Fergani, B Davy, M Houacine, A AF Fergani, Belkacem Davy, Manuel Houacine, Arnrane TI Speaker diarization using one-class support vector machines SO SPEECH COMMUNICATION LA English DT Article DE speaker indexing; kernel change detection; one-class support vector machine; speaker diarization ID BROADCAST NEWS AB This paper addresses speaker diarization, which consists of two steps: speaker turn detection and speaker clustering. These two steps require a metric to be defined in order to compare speech segments. Here, we employ a novel metric, based on one-class support vector machines, and recently introduced by one of the authors. This paper presents our speaker diarization primary system based one-class SVM, easy to build and configure. We show through several experiments, using NIST RT'03S and ESTER data sets, that our approach competes most standard approaches based on, e.g., Generalized Likelihood Ratios or Gaussian Mixture Models and may be complement mentary to them. Moreover, our technique permits the use of any-dimensional heterogeneous acoustic feature vectors, while keeping the computational cost reasonable. (c) 2008 Published by Elsevier B.V. C1 [Fergani, Belkacem; Houacine, Arnrane] USTHB, LCPTS, Dept Elect Engn, Algiers, Algeria. [Fergani, Belkacem; Davy, Manuel] CNRs, UMR 8146, LAGIS, INRIA FUTURS sequeL, Villeneuve Dascq, France. RP Fergani, B (reprint author), USTHB, LCPTS, Dept Elect Engn, BP 32,El Alia Bab Ezzouar, Algiers, Algeria. EM bfergani2001@yahoo.fr CR AJMERA J, 2002, P ICSLP02 DENV US AM ARONSZAJN N, 1950, T AM MATH SOC, V68, P337 Barras C, 2006, IEEE T AUDIO SPEECH, V14, P1505, DOI 10.1109/TASL.2006.878261 BEN M, 2004, P INT C SPOK LANG PR, P1125 CANU S, 2005, P ESANN 05 BRUGG BEL Chang C.-C., 2001, LIBSVM LIB SUPPORT V CHRISTENSEN H, 2002, THESIS AALBORG U DEN Davy M, 2006, SIGNAL PROCESS, V86, P2009, DOI 10.1016/j.sigpro.2005.09.027 Delacourt P, 2000, SPEECH COMMUN, V32, P111, DOI 10.1016/S0167-6393(00)00027-3 Desobry F, 2005, IEEE T SIGNAL PROCES, V53, P2961, DOI 10.1109/TSP.2005.851098 DESOBRY F, 2004, P IEEE ICASSP 04 MON Duda R. O., 2001, PATTERN CLASSIFICATI Dunn RB, 2000, DIGIT SIGNAL PROCESS, V10, P93, DOI 10.1006/dspr.1999.0359 GRAVIER G., 2004, P LANG EV RES C LREC, P885 HEGDE RM, 2007, EURASIP J AUDIO SPEE, P1 Janin A, 1999, P 6 EUR C SPEECH COM KARTIK V, 2005, P NOLISP 05 BARC SPA Kwon S, 2005, IEEE T SPEECH AUDI P, V13, P1004, DOI 10.1109/TSA.2005.851981 LIU D, 2004, P IEEE ICASSP 04 MON MEIGNIER S, 2002, P ICSLP 2002 DENV CO, V1, P573 Meignier S, 2006, COMPUT SPEECH LANG, V20, P303, DOI 10.1016/j.csl.2005.08.002 MEIGNIER S, 2002, THESIS U AVIGN PAYS MEIGNIER S, 2000, P INT C AC SPEECH SI, P1177 MORARU D, 2004, P IEEE ICASSP 01 MON NGUYEN P, 2003, WORKSH NIST RT03 S P *NIST, 2003, TR03S NIST SCHMIDT M, 1996, P IEEE ICASSP 96 ATL Scholkopf B., 2002, LEARNING KERNELS SECK M, 2001, P IEEE INT C AUD SPE SOLONONOFF A, 1998, P IEEE ICASSP 98 Tranter SE, 2006, IEEE T AUDIO SPEECH, V14, P1557, DOI 10.1109/TASL.2006.878256 Wan V, 2005, IEEE T SPEECH AUDI P, V13, P203, DOI 10.1109/TSA.2004.841042 Wooters C., 2004, P FALL 2004 RICH TRA NR 33 TC 8 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 355 EP 365 DI 10.1016/j.specom.2007.11.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300001 ER PT J AU Cheang, HS Pell, MD AF Cheang, Henry S. Pell, Marc D. TI The sound of sarcasm SO SPEECH COMMUNICATION LA English DT Article DE verbal irony; sarcasm; prosody; acoustic cues ID TO-NOISE RATIO; VERBAL IRONY; VOICE QUALITY; SPECTRAL CHARACTERISTICS; PERCEIVED HYPERNASALITY; PARKINSONS-DISEASE; PRETENSE THEORY; SPEAKER AFFECT; VOCAL CUES; SPEECH AB The present study was conducted to identify possible acoustic cues of sarcasm. Native English speakers produced a variety of simple utterances to convey four different attitudes: sarcasm, humour, sincerity, and neutrality. Following validation by a separate naive group of native English speakers, the recorded speech was subjected to acoustic analyses for the following features: mean fundamental frequency (F0), F0 standard deviation, F0 range, mean amplitude, amplitude range, speech rate, harmonics-to-noise ratio (HNR, to probe for voice quality changes), and one-third octave spectral values (to probe resonance changes). The results of analyses indicated that sarcasm was reliably characterized by a number of prosodic cues, although one acoustic feature appeared particularly robust in sarcastic utterances: overall reductions in mean F0 relative to all other target attitudes. Sarcasm was also reliably distinguished from sincerity by overall reductions in HNR and in F0 standard deviation. In certain linguistic contexts, sarcasm could be differentiated from sincerity and humour through changes in resonance and reductions in both speech rate and F0 range. Results also suggested a role of language used by speakers in conveying sarcasm and sincerity. It was concluded that sarcasm in speech can be characterized by a specific pattern of prosodic cues in addition to textual cues, and that these acoustic characteristics can be influenced by language used by the speaker. (c) 2007 Elsevier B.V. All rights reserved. C1 [Cheang, Henry S.; Pell, Marc D.] McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. RP Cheang, HS (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1226 Pine Ave W, Montreal, PQ H3G 1A8, Canada. EM henry.cheang@mail.mcgill.ca CR ACKERMAN BP, 1983, J EXP CHILD PSYCHOL, V35, P487, DOI 10.1016/0022-0965(83)90023-1 ACKERMAN BP, 1986, CHILD DEV, V57, P485, DOI 10.1111/j.1467-8624.1986.tb00047.x Anolli L, 2002, INT J PSYCHOL, V37, P266, DOI 10.1080/00207590244000106 Attardo S, 2003, HUMOR, V16, P243, DOI 10.1515/humr.2003.012 Baken RJ, 2000, CLIN MEASUREMENT SPE Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Beddor P, 1993, NASALS NASALIZATION, P171 Berlyne Daniel E., 1972, PSYCHOL HUMOR, P43 Boersma P., 2007, PRAAT DOING PHONETIC BOLINGER D, 1989, INTONATION ITS USER BROWNELL HH, 1983, BRAIN LANG, V18, P20, DOI 10.1016/0093-934X(83)90002-0 BRYANT GA, IN PRESS PROSODIC CO Bryant GA, 2002, METAPHOR SYMBOL, V17, P99, DOI 10.1207/S15327868MS1702_2 Bryant GA, 2005, LANG SPEECH, V48, P257 CAPELLI CA, 1990, CHILD DEV, V61, P1824, DOI 10.1111/j.1467-8624.1990.tb03568.x CHEANG HS, UNPUB ACOUSTICS SARC Cheang HS, 2007, J NEUROLINGUIST, V20, P221, DOI 10.1016/j.jneuroling.2006.07.001 Colston H. L., 1997, METAPHOR SYMBOL, V12, P43, DOI 10.1207/s15327868ms1201_4 Colston HL, 2000, J PRAGMATICS, V32, P1557, DOI 10.1016/S0378-2166(99)00110-1 Colston HL, 2000, DISCOURSE PROCESS, V30, P179, DOI 10.1207/S15326950DP3002_05 CUTLER A, 1976, N HOLLAND LINGUISTIC, V30, P133 DARA C, IN PRESS NEUROPSYCHO DEKROM G, 1995, J SPEECH HEAR RES, V38, P794 DEWS S, 1995, DISCOURSE PROCESS, V19, P347 ESKENAZI L, 1990, J SPEECH HEAR RES, V33, P298 Ferrand CT, 2002, J VOICE, V16, P480, DOI 10.1016/S0892-1997(02)00123-6 FONAGY I, 1971, PHONETICA, V23, P42 FONAGY I, 1976, J PSYCHOL NORMALE PA, V73, P273 FONAGY I, 1976, J PSYCHOL NORMALE PA, V73, P304 CLARK HH, 1984, J EXP PSYCHOL GEN, V113, P121, DOI 10.1037/0096-3445.113.1.121 Gerrig R.J., 2000, METAPHOR SYMBOL, V15, P197, DOI 10.1207/S15327868MS1504_1 Gibbs Jr R. W., 2000, METAPHOR SYMBOL, V15, P5, DOI [DOI 10.1080/10926488.2000.9678862, DOI 10.1207/S15327868MS151&] Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 Grice H. P., 1975, SYNTAX SEMANTICS, P41, DOI DOI 10.1017/S0022226700005296 HAIMAN J, 1998, TALK CHEAP SARCASM A HAVERKATE H, 1990, J PRAGMATICS, V14, P77, DOI 10.1016/0378-2166(90)90065-L HEILMAN KM, 1984, NEUROLOGY, V34, P917 Jorgensen J, 1996, J PRAGMATICS, V26, P613, DOI 10.1016/0378-2166(95)00067-4 Kataoka R, 2001, FOLIA PHONIATR LOGO, V53, P198, DOI 10.1159/000052675 Kataoka R, 2001, J ACOUST SOC AM, V109, P2181, DOI 10.1121/1.1360717 Kataoka R, 1996, CLEFT PALATE-CRAN J, V33, P43, DOI 10.1597/1545-1569(1996)033<0043:SPAQEO>2.3.CO;2 KREUZ RJ, 1989, J EXP PSYCHOL GEN, V118, P374 KREUZ RJ, 1991, METAPHOR SYMB ACT, V6, P149, DOI 10.1207/s15327868ms0603_1 KREUZ RJ, 1993, METAPHOR SYMB ACT, V8, P97, DOI 10.1207/s15327868ms0802_2 KREUZ RJ, 1995, METAPHOR SYMB ACT, V10, P21, DOI 10.1207/s15327868ms1001_3 KUMONNAKAMURA S, 1995, J EXP PSYCHOL GEN, V124, P3, DOI 10.1037//0096-3445.124.1.3 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Laval V, 2005, J SPEECH LANG HEAR R, V48, P610, DOI 10.1044/1092-4388(2005/042) LAVAL V, 2004, PSYCHOL FRANCAISE, V49, P177, DOI 10.1016/j.psfr.2004.04.004 Lee ASY, 2003, CLIN LINGUIST PHONET, V17, P259, DOI 10.1080/0269920031000080091 Lee ASY, 2004, J MED SPEECH-LANG PA, V12, P173 Maeda S., 1993, NASALS NASALIZATION, P147 MONETTA L, IN PRESS J NEUROPSYC MUEKE DC, 1969, COMPASS IRONY MUEKE DC, 1978, POETICS, V7, P363 Mullennix JW, 2005, PSYCHOLOGY OF MOODS, P123 Pell MD, 2006, BRAIN LANG, V96, P221, DOI 10.1016/j.bandl.2005.04.007 PELL MD, IN PRESS FACTORS REC Pell MD, 2003, COGN AFFECT BEHAV NE, V3, P275, DOI 10.3758/CABN.3.4.275 Pell MD, 2006, BRAIN LANG, V97, P123, DOI 10.1016/j.bandl.2005.08.010 PEREIRA JG, 2002, J VOICE, V16, P28 Pexman PM, 2002, J LANG SOC PSYCHOL, V21, P245, DOI 10.1177/0261927X02021003003 Rockwell P, 2005, PSYCHOLOGY OF MOODS, P109 Rockwell P, 2000, PERCEPT MOTOR SKILL, V91, P665, DOI 10.2466/PMS.91.6.665-668 Rockwell P, 2000, J PSYCHOLINGUIST RES, V29, P483, DOI 10.1023/A:1005120109296 Rockwell P, 2001, PERCEPT MOTOR SKILL, V93, P47, DOI 10.2466/PMS.93.4.47-50 Sabbagh MA, 1999, BRAIN LANG, V70, P29, DOI 10.1006/brln.1999.2139 Schaffer R., 1982, PAPERS PARASESSION L, P204 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SOBIN C, 1999, J PSYCHOLINGUISTIC R, V28 SPERBER D, 1984, J EXP PSYCHOL GEN, V113, P130, DOI 10.1037//0096-3445.113.1.130 Suls J. M., 1972, PSYCHOL HUMOR, P81 SULS JM, 1983, HDB HUMOR RES, V1, P39 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 TOMPKINS CA, 1991, J SPEECH HEAR RES, V34, P820 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wilson D, 2006, J PRAGMATICS, V38, P1559, DOI 10.1016/j.pragma.2005.04.012 WINNER E, 1991, BRIT J DEV PSYCHOL, V9, P257 Yoshida H, 2000, J ORAL REHABIL, V27, P723, DOI 10.1046/j.1365-2842.2000.00537.x YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 NR 82 TC 24 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 366 EP 381 DI 10.1016/j.specom.2007.11.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300002 ER PT J AU Kim, G Cho, NI AF Kim, Gibak Cho, Nam Ik TI Frequency domain multi-channel noise reduction based on the spatial subspace decomposition and noise eigenvalue modification SO SPEECH COMMUNICATION LA English DT Article DE multi-channel filtering; noise reduction; subspace decomposition ID SPEECH ENHANCEMENT AB In this paper, frequency domain multi-channel noise reduction algorithms are proposed, based on the subspace decomposition of narrow-band spatial covariance matrices. In speech-present periods, the multi-channel input signals are decomposed into speech and noise spatial subspaces. The noise eigenvalues are modified in order to update the noise statistics not only in the noise-only period but also in the speech-present period. Three approaches are introduced for the noise eigenvalue modification, which are based on the rank-1 property of the speech narrow-band spatial covariance matrix for the single speech source. The proposed algorithms are tested with the imulated data and real data, and the results show that the proposed methods yield better performance compared to the conventional multi-channel Wiener filtering and the time domain subspace approaches. (c) 2007 Elsevier B.V. All rights reserved. C1 [Kim, Gibak] Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. [Cho, Nam Ik] Seoul Natl Univ, Sch Elect Engn, Seoul 151744, South Korea. RP Kim, G (reprint author), Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. EM imkgb27@gmail.com CR Asano F, 2000, IEEE T SPEECH AUDI P, V8, P497, DOI 10.1109/89.861364 Bitzer J, 2001, SPEECH COMMUN, V34, P3, DOI 10.1016/S0167-6393(00)00042-X Chen Jingdong, 2006, IEEE T AUDIO SPEECH, V14 Cohen I, 2004, IEEE T SIGNAL PROCES, V52, P1149, DOI 10.1109/TSP.2004.826166 DOCLO S, 2001, INT WORKSH AC ECH NO, P31 Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 FLORENCIO D, 2001, P IEEE INT C AC SPEE, P197 Golub G.H., 1996, MATRIX COMPUTATIONS GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 Herbordt W., 2005, SOUND CAPTURE HUMAN Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 JABLOUN F, 2001, P IEEE INT C AC SPEE, P205 JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211 MARRO C, 1998, IEEE T SPEECH AUDIO, V6 Rombouts G, 2003, SIGNAL PROCESS, V83, P1889, DOI 10.1016/S0165-1684(03)00107-5 *RWCP SOUND SCEN D, 2001, REAL WORLD COMP PART SPRIET A, 2005, IEEE T SPEECH AUDIO, V13 Zelinski R., 1988, P IEEE INT C AC SPEE, V5, P2578 NR 19 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 382 EP 391 DI 10.1016/j.specom.2007.11.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300003 ER PT J AU Chomphan, S Kobayashi, T AF Chomphan, Suphattharachai Kobayashi, Takao TI Tone correctness improvement in speaker dependent HMM-based Thai speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Thai speech; HMM; speech synthesis; decision tree; tone ID DYNAMIC FEATURES AB In this paper, we describe a novel approach to the realization of Thai speech synthesis. Spectrum, fundamental frequency (F0), and phone duration are modeled simultaneously in a unified framework of HMM, and their parameter distributions are clustered independently by using a decision-tree based context clustering technique. A group of contextual factors which affect spectrum, F0, and state duration, i.e., tone type, part of speech, are taken into account. Since Thai is a tonal language, not only intelligibility and naturalness but also correctness of synthesized tone is taken into account. To improve the correctness of tone of the synthesized speech, tone groups and tone types are used to design four different structures of decision tree in the tree-based context clustering process, including a single binary tree structure, a simple tone-separated tree structure, a constancy-based-tone-separated tree structure, and a trend-based-tone-separated tree structure. A subjective evaluation of tone correctness is conducted by using tone perception of eight Thai listeners. The simple tone-separated tree structure gives the highest level of tone correctness, while the single binary tree structure gives the lowest level of tone correctness. In addition to the tree structure, the additional contextual tone information which is applied to all structures of the decision tree achieves a significant improvement of tone correctness. Moreover, the evaluation of syllable duration distortion among the four structures shows that the con stancy-based-tone-separated and the trend-based-tone-separated tree structures can alleviate the distortions that appear when using the simple tone-separated tree structure. Finally, MOS and CCR tests show that the implemented system gives the better reproduction of prosody (or naturalness, in some sense) than the unit-selection-based system with the same speech database. (c) 2007 Elsevier B.V. All rights reserved. C1 [Chomphan, Suphattharachai; Kobayashi, Takao] Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Midori Ku, Yokohama, Kanagawa 2268502, Japan. RP Chomphan, S (reprint author), Tokyo Inst Technol, Interdisciplinary Grad Sch Sci & Engn, Midori Ku, 4259-G2-4 Nagatsuta Cho, Yokohama, Kanagawa 2268502, Japan. EM suphattharachai@ip.titech.ac.jp CR ABRAMSON AS, 1979, INT C PHON SCI, P380 Black AW, 2007, INT CONF ACOUST SPEE, P1229 CHOMPUN S, 2004, IEEE AS PAC C CIRC S, P197 GANDOUR J, 1994, J PHONETICS, V22, P477 GONZALVO X, 2007, 6 ISCA WORKSH SPEECH, P362 HANSAKUNBUNTHEU.C, 2005, INT S NAT LANG PROC, P127 LUKSANEEYANAWIN S, 1993, INT S NAT LANG PROC, P276 LUKSANEEYANAWIN S, 1989, REG WORKSH COMP PROC, P305 LUKSANEEYANAWIN S, 1992, INT S LANG LING, P75 Masuko T, 1996, INT CONF ACOUST SPEE, P389, DOI 10.1109/ICASSP.1996.541114 Mittrapiyanuruk P., 2000, NECTEC ANN C BANGK, P483 PALMER A, 1969, LANG LEARN, V19, P287, DOI 10.1111/j.1467-1770.1969.tb00469.x PONYANUM P, 2003, INT S RES DEV INN DE Riley M., 1992, TALKING MACHINES THE, P265 Russell M. J., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8) Saravari C., 1983, Journal of the Acoustical Society of Japan (E), V4 Seresangtakul P, 2003, IEICE T INF SYST, VE86D, P2223 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 SORNLERTLAMVANI.V, 1998, INT C SPEECH DAT ASS, P131 THATHONG U, 2000, INT C SPOK LANG PROC, P47 THOMPSON NGI, 1998, INT C SPOK LANG PROC, P53 THUBTHONG N, 2001, INT C INT TECHN INTE, P356 Tokuda K, 1999, INT CONF ACOUST SPEE, P229 TOKUDA K, 1995, INT CONF ACOUST SPEE, P660, DOI 10.1109/ICASSP.1995.479684 Wutiwiwatchsi C, 2007, SPEECH COMMUN, V49, P8, DOI 10.1016/j.specom.2006.10.004 Yamagishi J, 2003, IEEE INT C AC SPEECH, P716 Yoshimura T., 1999, EUR C SPEECH COMM TE, P2347 Yoshida T, 1998, INTERNATIONAL ELECTRON DEVICES MEETING 1998 - TECHNICAL DIGEST, P29, DOI 10.1109/IEDM.1998.746239 Young S.J., 1994, ARPA WORKSH HUM LANG, P307 Zen H, 2004, INT C SPOK LANG PROC, P1393 NR 30 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 392 EP 404 DI 10.1016/j.specom.2007.12.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300004 ER PT J AU Yamagishi, J Kawai, H Kobayashi, T AF Yamagishi, Junichi Kawai, Hisashi Kobayashi, Takao TI Phone duration modeling using gradient tree boosting SO SPEECH COMMUNICATION LA English DT Article DE text-to-speech synthesis; phone duration modeling; gradient tree boosing ID SPEECH AB In text-to-speech synthesis systems, phone duration influences the quality and naturalness of synthetic speech. In this study, we incorporate an ensemble learning technique called gradient tree boosting into phone duration modeling as an alternative to the conventional approach using regression trees, and objectively evaluate the prediction accuracy of Japanese, Mandarin, and English phone duration. The gradient tree boosting algorithm is a meta algorithm of regression trees: it iteratively builds the regression tree from the residuals and outputs weighting sum of the regression trees. Our evaluation results show that compared to the regression trees or other techniques related to the regression trees, the gradient tree boosting algorithm can substantially and robustly improve the predictive accuracy of the phone duration regardless of languages, speakers, or domains. (c) 2008 Elsevier B.V. All rights reserved. C1 [Yamagishi, Junichi; Kawai, Hisashi] Adv Telecommun Res Inst Int, Spoken Language Commun Res Labs, Kyoto 6190288, Japan. [Yamagishi, Junichi; Kobayashi, Takao] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland. [Kawai, Hisashi] KDDI, R&D Labs, Saitama 3568502, Japan. RP Yamagishi, J (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 2 Buccleuch Pl, Edinburgh EH8 9LW, Midlothian, Scotland. EM jyamagis@inf.ed.ac.uk CR Breiman L, 1996, MACH LEARN, V24, P123, DOI 10.1023/A:1018054314350 CAMPBELL WN, 1990, SPEECH COMMUN, V9, P57, DOI 10.1016/0167-6393(90)90046-C Chen SH, 2003, IEEE T SPEECH AUDI P, V11, P308, DOI 10.1109/TSA.2003.814377 Friedman JH, 2002, COMPUT STAT DATA AN, V38, P367, DOI 10.1016/S0167-9473(01)00065-2 Friedman JH, 2001, ANN STAT, V29, P1189, DOI 10.1214/aos/1013203451 Hastie T., 2001, SPRINGER SERIES STAT, V2nd Iwahashi N, 2000, IEICE T INF SYST, VE83D, P1550 Kawai H, 2004, P 5 ISCA SPEECH SYNT, P179 KAWAI H, 2006, IEICE T, P2688 Lee S., 1999, P OR COCOSDA 99, P109 Quinlan J. R., 1992, Proceedings of the 5th Australian Joint Conference on Artificial Intelligence. AI '92 Riedi M, 1995, P EUROSPEECH, P599 Riley M., 1992, TALKING MACHINES THE, P265 Snyman J. A., 2005, PRACTICAL MATH OPTIM TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467 Takezawa T., 2002, P LREC, P147 Van Santen J. P. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90016-Y Witten I.H., 2005, DATA MINING PRACTICA NR 18 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 405 EP 415 DI 10.1016/j.specom.2007.12.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300005 ER PT J AU Callejas, Z Lopez-Cozar, R AF Callejas, Zoraida Lopez-Cozar, Ramon TI Influence of contextual information in emotion annotation for spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article DE emotion annotation; emotion recognition; emotional speech; spoken dialogue system; dialogue context; acoustic context; affective computing ID SPEECH RECOGNITION; WEIGHTED KAPPA; HIGH AGREEMENT; CALL-CENTERS; 2 PARADOXES; RELIABILITY AB In this paper, we study the impact of considering context information for the annotation of emotions. Concretely, we propose the inclusion of the history of user-system interaction and the neutral speaking style of users. A new method to automatically include both sources of information has been developed making use of novel techniques for acoustic normalization and dialogue context annotation. We have carried out experiments with a corpus extracted from real human interactions with a spoken dialogue system. Results show that the performance of non-expert human annotators and machine-learned classifications are both affected by contextual information. The proposed method allows the annotation of more non-neutral emotions and yields values closer to maximum agreement rates for nonexpert human annotation. Moreover, automatic classification accuracy improves by 29.57% compared to the classical approach based only on acoustic features. (c) 2008 Elsevier B.V. All rights reserved. C1 [Callejas, Zoraida; Lopez-Cozar, Ramon] Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci & Telecommun, E-18071 Granada, Spain. RP Callejas, Z (reprint author), Univ Granada, Dept Languages & Comp Syst, Fac Comp Sci & Telecommun, C Periodista Daniel Saucedo Aranda S-N, E-18071 Granada, Spain. EM zoraida@ugr.es; rlopezc@ugr.es RI Prieto, Ignacio/B-5361-2013; Lopez-Cozar, Ramon/A-7686-2012; Callejas Carrion, Zoraida/C-6851-2012 OI Lopez-Cozar, Ramon/0000-0003-2078-495X; Callejas Carrion, Zoraida/0000-0001-8891-5237 CR Adell J., 2005, PROCESAMIENTO LENGUA, V35, P277 AI H, 2006, P INT PITTSB PA, P797 Ang J, 2002, P INT C SPOK LANG PR, P2037 ARTSTEIN R, 2005, KAPPA3 ALPHA BETA TE BICKMORE T, 2004, P AAAI FALL S DIAL S, P275 Bishop C. M., 2006, PATTERN RECOGNITION Boehner K, 2007, INT J HUM-COMPUT ST, V65, P275, DOI 10.1016/j.ijhcs.2006.11.016 Boersma P., 1993, P I PHONETIC SCI, V17, P97 Burkhardt F., 2005, P EL SPEECH SIGN PRO, P123 Callejas Z., 2005, P APPL SPOK LANG INT Camurri A., 2004, Cognition, Technology & Work, V6, DOI 10.1007/s10111-003-0138-7 CICCHETTI DV, 1990, J CLIN EPIDEMIOL, V43, P551, DOI 10.1016/0895-4356(90)90159-M COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256 CORRADINI A, 2005, P 10 INT C INT US IN, P183, DOI 10.1145/1040830.1040872 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R, 2005, LECT NOTES COMPUT SC, V3361, P305 CRAGGS R, 2003, P 4 SIGDIAL WORKSH D, P218 Critchley HD, 2005, NEUROIMAGE, V24, P751, DOI 10.1016/j.neuroimage.2004.10.013 DAVIES M, 1982, BIOMETRICS, V38, P1047, DOI 10.2307/2529886 Devillers L, 2005, NEURAL NETWORKS, V18, P407, DOI 10.1016/j.neunet.2005.03.007 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 Dunn G., 1989, DESIGN ANAL RELIABIL FEINSTEIN AR, 1990, J CLIN EPIDEMIOL, V43, P543, DOI 10.1016/0895-4356(90)90158-L FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619 FLEISS JL, 1973, EDUC PSYCHOL MEAS, V33, P613, DOI 10.1177/001316447303300309 Forbes-Riley K., 2004, P HUM LANG TECHN C N, P201 GEBHARD P, 2004, P TUT RES WORKSH AFF, P128 Gerfen C., 2002, PROBUS, V14, P247, DOI 10.1515/prbs.2002.010 GONZALEZ GM, 1999, 39 U MICH Gut Ulrike, 2004, P SPEECH PROS NAR JA, P565 Guyon I., 2003, Journal of Machine Learning Research, V3, DOI 10.1162/153244303322753616 HALL L, 2005, P AFF COMP INT INT A, P731 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 HOZJAN V, 2002, P 3 INT C LANG RES E, P2019 Iriondo I, 2000, P ISCA WORKSH SPEECH, P161 JOHNSTONE T, 1996, P 4 INT C SPOK LANG, V3, P1985, DOI 10.1109/ICSLP.1996.608026 Krippendorff K., 2003, CONTENT ANAL INTRO M Landis J. R., 1977, BIOMETRICS, V33, P174 Lantz CA, 1996, J CLIN EPIDEMIOL, V49, P431, DOI 10.1016/0895-4356(95)00571-4 Lee C, 2005, P ANN INT IEEE EMBS, P5523, DOI 10.1109/IEMBS.2005.1615734 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Liscombe J., 2005, P INT LISB PORT, P1845 Litman DJ, 2006, SPEECH COMMUN, V48, P559, DOI 10.1016/j.specom.2005.09.008 MAHLKE S, 2006, P 2006 C HUM FACT CO Montero JM, 1999, P 14 INT C PHON SCI, P957 Morrison D, 2007, SPEECH COMMUN, V49, P98, DOI 10.1016/j.specom.2006.11.004 ONEILL P, 2005, LANGUAGE DESIGN, V7, P151 PICARD RW, 2005, P 2005 C HUM FACT CO PITTERMAN J, 2006, P 2 IEEE INT C INT E, P197 Plutchik R., 1980, EMOTION PSYCHOEVOLUT RICCARDI G, 2005, LECT NOTES COMPUTER, P144 ROTARU M, 2006, P 9 INT C SPOK LANG, P53 Rumelhart D., 1986, LEARNING INTERNAL RE RUSSELL JA, 1980, J PERS SOC PSYCHOL, V39, P1161, DOI 10.1037/h0077714 SCHERER KR, 2005, SOC SCI INFORM, V44, P694 Scott WA, 1955, PUBLIC OPIN QUART, V19, P321, DOI 10.1086/266577 Shafran I., 2003, P IEEE AUT SPEECH RE, P31 SHAFRAN I, 2005, P INT C AC SPEECH SI, P341 Stibbard R., 2000, P ISCA WORKSH SPEECH, P60 Streit M, 2006, SMARTKOM FDN MULTIMO, P317, DOI 10.1007/3-540-36678-4_21 Ververidis D, 2006, SPEECH COMMUN, V48, P1162, DOI 10.1016/j.specom.2006.04.003 Vidrascu L, 2005, LECT NOTES COMPUT SC, V3784, P739 Vogt T., 2005, P MULT EXP AMST, P474 Wilks Y., 2006, 13 OXF INT I Wilting Janneke, 2006, P INT 2006, P805 Witten I.H., 2005, DATA MINING PRACTICA ZENG Z, 2006, P 8 INT C MULT INT B, P828 NR 68 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 416 EP 433 DI 10.1016/j.specom.2008.01.001 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300006 ER PT J AU Bisani, M Ney, H AF Bisani, Maximilian Ney, Hermann TI Joint-sequence models for grapheme-to-phoneme conversion SO SPEECH COMMUNICATION LA English DT Article DE grapheme-to-phoneme; letter-to-sound; phonemic transcription; joint-sequence model; pronunciation modeling ID SMOOTHING TECHNIQUES; LANGUAGE; ANALOGY; PRONUNCIATION AB Grapheme-to-phoneme conversion is the task of finding the pronunciation of a word given its written form. It has important applications in text-to-speech and speech recognition. Joint-sequence models are a simple and theoretically stringent probabilistic framework that is applicable to this problem. This article provides a self-contained and detailed description of this method. We present a novel estimation algorithm and demonstrate high accuracy on a variety of databases. Moreover, we study the impact of the maximum approximation in training and transcription, the interaction of model size parameters, n-best list generation, confidence measures, and phoneme-to-grapheme conversion. Our software implementation of the method proposed in this work is available under an Open Source license. (c) 2008 Elsevier B.V. All rights reserved. C1 [Bisani, Maximilian; Ney, Hermann] Rhein Westfal TH Aachen Univ, Lehrstuhl Informat 6, D-52056 Aachen, Germany. RP Bisani, M (reprint author), Rhein Westfal TH Aachen Univ, Lehrstuhl Informat 6, Ahornstr 55, D-52056 Aachen, Germany. EM bisani@informatik.rwth-aachen.de; ney@informatik.rwth-aachen.de CR Allen J. F., 2002, P INT C SPOK LANG PR, V1, P109 Andersen O., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607954 Bagshaw PC, 1998, COMPUT SPEECH LANG, V12, P119, DOI 10.1006/csla.1998.0042 Bellegarda JR, 2005, SPEECH COMMUN, V46, P140, DOI 10.1016/j.specom.2005.03.002 BESLING S, 1994, K VER NAT SPRACH KON, P24 Bisani M., 2001, P EUR C SPEECH COMM, V2, P1429 Bisani M., 2003, P EUR C SPEECH COMM, P933 Bisani M., 2004, P IEEE INT C AC SPEE, V1, P409 Bisani M., 2002, P INT C SPOK LANG PR, V1, P105 Bisani M., 2005, P INT, P725 BISANI M, 2005, S0245 EUR LANG RES A CASEIRO D, 2002, P IEEE WORKSH SPEECH *CEL, 1995, CEL LEX DAT Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128 Chen SF, 2000, IEEE T SPEECH AUDI P, V8, P37, DOI 10.1109/89.817452 Chen Stanley F, 2003, P EUR, P2033 CONTENT A, 1990, ANN PSYCHOL, V90, P551 DAELEMANS WMP, 1996, PROGR SPEECH SYNTHES, P77 Dedina M. J., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90017-K Deligne S., 1995, P ICASSP, V1, P169 Deligne S., 1995, P EUR C SPEECH COMM, P2243 Deligne S, 1997, SPEECH COMMUN, V23, P223, DOI 10.1016/S0167-6393(97)00048-4 Flannery B. P., 1992, NUMERICAL RECIPES C Galescu L., 2003, P EUR C SPEECH COMM, P249 Galescu L., 2001, P 4 ISCA TUT RES WOR GOLLAN C, 2005, P IEEE INT C AC SPEE, V1, P825, DOI 10.1109/ICASSP.2005.1415241 Hakkinen J, 2003, SPEECH COMMUN, V41, P455, DOI 10.1016/S0167-6393(03)00015-3 Jensen K., 2000, P INT C SPOK LANG PR, V3, P318 JIANG L, 1997, P EUR C SPEECH COMM, V2, P605 Kaplan R. M., 1994, Computational Linguistics, V20 KINGSBURY P, 1997, LDC97L20 LDC Kneser R, 1995, P IEEE INT C AC SPEE, V1, P181 LEVENSHT.VI, 1965, DOKL AKAD NAUK SSSR+, V163, P845 LOOF J, 2006, P INT C SPOK LANG PR, P105 LUCASSEN JM, 1984, P IEEE INT C AC SPEE, V9, P304 LUNGEN H, 1998, BIELEFELDER LEXIKON Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674 McCulloch N., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90013-1 MENG HM, 1994, HLT 94, P289 MITTON R, 1992, COMPUTER USABLE DICT Ney H, 1997, TEXT SPEECH LANG TEC, V2, P174 NEY H, 1995, IEEE T PATTERN ANAL, V17, P1202, DOI 10.1109/34.476512 Och F. J., 2003, Computational Linguistics, V29, DOI 10.1162/089120103321337421 Pagel V., 1998, P INT C SPOK LANG PR, V5, P2015 ROBINSON T, 1997, BEEP BRIT ENGLISH SEJNOWKSI TJ, 1993, NETTALK CORPUS Sejnowski T. J., 1987, Complex Systems, V1 SUONTAUSTA J, 2000, P INT C SPOK LANG PR Torkkola K., 1993, P INT C AC SPEECH SI, V2, P199 VANDENBOSCH A, 2006, PASCAL LETTERTOPHONE VOZILA P, 2003, P EUR C SPEECH COMM, P2469 WEIDE RL, 1998, CARNEGIE MELLON PRON Wells J.C., 1997, SAMPA COMPUTER READA Wells JC, 1997, HDB STANDARDS RESOUR Yvon F., 1996, P C NEW METH NAT LAN, P218 ZIEGENHAIN U, 2005, CREATION LEXICA SPEE NR 56 TC 91 Z9 95 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2008 VL 50 IS 5 BP 434 EP 451 DI 10.1016/j.specom.2008.01.002 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 298PJ UT WOS:000255699300007 ER PT J AU Liao, H Gales, MJF AF Liao, H. Gales, M. J. F. TI Issues with uncertainty decoding for noise robust automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; uncertainty decoding; noise robustness; AURORA2 AB Interest continues in a class of robustness algorithms for speech recognition that exploit the notion of uncertainty introduced by environmental noise. These techniques share the property that the uncertainty varies with the noise level and is propagated to the decoding stage, resulting in increased model variances. In observation uncertainty forms, the uncertainty variance is simply the variance of the error in enhancement that is added to the model variances. Another form, called uncertainty decoding, refers to a factorisation which results in a linear feature transform and model variance bias that increases with noise; using appropriate approximations, efficient implementations may be obtained, with the goal of achieving near model-based performance without the associated computational cost. Unfortunately, uncertainty decoding forms that compute the uncertainty in the front-end and pass this to the decoder may suffer from a theoretical problem in low signal-to-noise ratio conditions. This report discusses how this fundamental issue arises, and demonstrates it through two schemes: SPLICE with uncertainty and front-end joint uncertainty decoding (FE-Joint). A method to mitigate this for FE-Joint compensation is presented, as well as how SPLICE implicitly addresses it. However, it is shown that a model-based joint uncertainty decoding approach does not suffer from this limitation, like these front-end forms do, and is more computationally attractive. The issues described and performance of the various schemes are examined on two artificially corrupted corpora: the AURORA 2.0 digit string recognition and 1000-word Resource Management tasks. (c) 2007 Elsevier B.V. All rights reserved. C1 [Liao, H.; Gales, M. J. F.] Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Gales, MJF (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM hl251@eng.cam.ac.uk; mjfg@eng.cam CR Acero A., 2000, P ICSLP BEIJ CHIN ARROWOOD JA, 2002, P ICSLP DENV CO BENITEZ C, 2004, P ICSLP JEJ ISL KOR Borga M., 2001, CANONICAL CORRELATIO DENG L, 2005, IEEE T SPEECH AUDIO, V12 DENG L, 2000, P ICSLP, P806 DENG L, 2002, P ICSLP DROPPO J, 2002, P ICASSP ORL FL Droppo J., 2001, P EUR, P217 Gales M. J. F., 1998, COMPUTER SPEECH LANG, V12 GALES MJF, 1998, SPEECH COMM, V25 Gales M.J.F, 1995, THESIS CAMBRIDGE U Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 HOLMES JN, 1997, P EUR RHOD GREEC Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 KIM DY, 2003, P ASRU KRISTJANSSON TT, 2002, P ICASSP ORL FL LIAO H, 2006, P INT LIAO H, 2005, P INT LIAO H, 2007, P ICASSP LIAO H, 2004, CUEDFINFENGTR499 U C PRICE P, 1988, P ICASSP SEATTL WA U SHMODA K, 1995, P EURSP MADR SPAIN STOUTEN V, 2004, P IVSLP JEJ ISL KOR, V1, P105 VARGA AP, 1992, NOISEX 92 STUDY EFFE WOLFEL M, 2007, P ICASSP XU H, 2006, P INT Young S., 2004, HTK BOOK NR 28 TC 32 Z9 33 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 265 EP 277 DI 10.1016/j.specom.2007.10.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700001 ER PT J AU Drahota, A Costall, A Reddy, V AF Drahota, Amy Costall, Alan Reddy, Vasudevi TI The vocal communication of different kinds of smile SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 10th Conference on Facial Expression CY SEP, 2003 CL Rimini, ITALY DE smiles; non-verbal communication; emotion; speech characteristics; auditory perception ID EMOTION; EXPRESSION; SPEECH; EMBARRASSMENT; APPEASEMENT; INTONATION; EXPERIENCE; FREQUENCY; CORRELATE; ACCURACY AB The present study investigated the vocal communication of naturally occurring smiles. Verbal variation was controlled in the speech of 8 speakers by asking them to repeat the same sentence in response to a set sequence of 17 questions, intended to provoke reactions such as amusement, mild embarrassment, or just a neutral response. After coding for facial expressions, a sample of 64 utterances was chosen to represent Duchenne smiles, non-Duchenne smiles, suppressed smiles and non-smiles. These audio clips were used to test the discrimination skills of I I listeners, who had to rely on vocal indicators to identify different types of smiles in speech. The study established that listeners can discriminate different smile types and further indicated that listeners utilize prototypical ideals to discern whether a person is smiling. Some acoustical cues appear to be taken by listeners as strong indicators of a smile, regardless of whether the speaker is actually smiling. Further investigations into listeners' prototypical ideals of vocal expressivity could prove worthwhile for voice synthesizing technology endeavoring to make computer-simulations more naturalistic. (c) 2007 Elsevier B.V. All rights reserved. C1 [Drahota, Amy; Costall, Alan; Reddy, Vasudevi] Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England. RP Drahota, A (reprint author), Univ Portsmouth, Sch Hlth Sci & Social Work, James Watson Hall,2 King Richard 1st Rd, Portsmouth PO1 2FR, Hants, England. EM amy.drahota@port.ac.uk CR Auberge V, 2003, SPEECH COMMUN, V40, P87, DOI 10.1016/S0167-6393(02)00077-8 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 Boersma P., 1993, P I PHONETIC SCI, V17, P97 BURDICK A, 2003, DISCOVER, V24, P1 Cassell J., 2000, EMBODIED CONVERSATIO Ceschi G, 2003, COGNITION EMOTION, V17, P385, DOI 10.1080/02699930143000725 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 de Gelder B, 2000, COGNITION EMOTION, V14, P289 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P1, DOI 10.1016/S0167-6393(02)00072-9 Eibl-Eibesfeldt I., 1970, ETHOLOGY BIOL BEHAV Ekman P., 1969, SEMIOTICA, V1, P49 EKMAN P, 1974, J PERS SOC PSYCHOL, V29, P288, DOI 10.1037/h0036006 Ekman P., 2001, TELLING LIES CLUES D EKMAN P, 2002, FACS EKMAN P, 1980, J PERS SOC PSYCHOL, V39, P1125, DOI 10.1037/h0077722 EKMAN P, 1988, J PERS SOC PSYCHOL, V54, P414, DOI 10.1037//0022-3514.54.3.414 Fernandez-Dols J.-M., 1997, PSYCHOL FACIAL EXPRE, P255, DOI 10.1017/CBO9780511659911.013 Fitch WT, 1997, J ACOUST SOC AM, V102, P1213, DOI 10.1121/1.421048 Flannery B. P., 1992, NUMERICAL RECIPES C FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Fridlund A. J., 1994, HUMAN FACIAL EXPRESS FRIDLUND AJ, 1991, J PERS SOC PSYCHOL, V60, P229, DOI 10.1037/0022-3514.60.2.229 Gross JJ, 1998, J PERS SOC PSYCHOL, V74, P224, DOI 10.1037/0022-3514.74.1.224 HAYMAN CAG, 1989, J EXP PSYCHOL LEARN, V15, P228, DOI 10.1037/0278-7393.15.2.228 Johnstone T., 2000, HDB EMOTIONS, V2nd, P220 JONCKHEERE AR, 1970, FORMULATION ASSESSME, P190 Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381 Keltner D., 2000, HDB EMOTIONS, P236 Keltner D, 1997, PSYCHOL BULL, V122, P250, DOI 10.1037//0033-2909.122.3.250 KELTNER D, 1995, J PERS SOC PSYCHOL, V68, P441, DOI 10.1037/0022-3514.68.3.441 Kendall M.G., 1970, RANK CORRELATION MET Kirouac G., 1999, SOCIAL CONTEXT NONVE, P182 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Ladefoged P., 1975, COURSE PHONETICS LaFrance M, 1999, SOCIAL CONTEXT NONVE, P45 Levenson RW, 1994, NATURE EMOTION FUNDA, P273 LEVENSON RW, 2002, NEW YORK AC SCI C EM, P12 LEVITIN DJ, 1994, PERCEPT PSYCHOPHYS, V56, P414, DOI 10.3758/BF03206733 LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 Messinger D., 1997, PSYCHOL FACIAL EXPRE, P205, DOI 10.1017/CBO9780511659911.011 MORRIS WN, 1987, MOTIV EMOTION, V11, P215, DOI 10.1007/BF01001412 OHALA JJ, 1984, PHONETICA, V41, P1 SCHERER KR, 1994, J PERS SOC PSYCHOL, V66, P310, DOI 10.1037/0022-3514.66.2.310 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Scherer KR, 1986, EXPERIENCING EMOTION SCHERER KR, 1985, ADV STUD BEHAV, V15, P189, DOI 10.1016/S0065-3454(08)60490-8 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SWETS JA, 1986, PSYCHOL BULL, V99, P100, DOI 10.1037/0033-2909.99.1.100 TARTTER VC, 1980, PERCEPT PSYCHOPHYS, V27, P24, DOI 10.3758/BF03199901 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 Trevarthen C., 2000, NORD J MUSIC THER, V9, P3, DOI 10.1080/08098130009477996 van Puijenbroek EP, 2002, PHARMACOEPIDEM DR S, V11, P3, DOI 10.1002/pds.668 XU Y, 2007, 16 INT C PHON SCI SA, P2105 Yoder PJ, 2000, BEHAVIORAL OBSERVATION, P317 Zaalberg R, 2004, COGNITION EMOTION, V18, P183, DOI 10.1080/02699930341000040 NR 56 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 278 EP 287 DI 10.1016/j.specom.2007.10.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700002 ER PT J AU Cnockaert, L Schoentgen, J Auzou, P Ozsancak, C Defebvre, L Grenez, F AF Cnockaert, L. Schoentgen, J. Auzou, P. Ozsancak, C. Defebvre, L. Grenez, F. TI Low-frequency vocal modulations in vowels produced by Parkinsonian subjects SO SPEECH COMMUNICATION LA English DT Article DE speech analysis; vocal modulations; Parkinsonian speech ID INSTANTANEOUS FREQUENCY; SPEECH SIGNALS; VOICE QUALITY; DISEASE; TREMOR; DYSARTHRIA; DISORDERS AB Low-frequency vocal modulations here designate slow disturbances of the phonatory frequency F-0. They are present in all voiced speech sounds, but their properties may be affected by neurological disease. An analysis method, based on continuous wavelet transforms, is proposed to extract the phonatory frequency trace and low-frequency vocal modulation in sustained speech sounds. The method is used to analyze a corpus of vowels uttered by male and female speakers, some of whom are healthy and some of whom suffer from Parkinson's disease. The latter present general speech problems but their voice is not perceived as tremulous. The objective is to discover differences between speaker groups in F-0 low-frequency modulations. Results show that Parkinson's disease has different effects on the voice of male and female speakers. The average phonatory frequency is significantly higher for male Parkinsonian speakers. The modulation amplitude is significantly higher for female Parkinsonian speakers. The modulation frequency is significantly higher and the ratio between the modulation energies in the frequency bands [3 Hz, 7 Hz] and [7 Hz, 15 Hz] is significantly lower for Parkinsonian speakers of both genders. (c) 2007 Elsevier B.V. All rights reserved. C1 [Cnockaert, L.; Schoentgen, J.; Grenez, F.] Univ Libre Bruxelles, Lab Images Signaux & Dispositifs Telecommun, Fac Sci Appl, B-1050 Brussels, Belgium. [Auzou, P.; Ozsancak, C.; Defebvre, L.] CHRU Lille, Serv Neurol & Pathol Mouvement A, Fac Med H Warenbourg, EA 6283,IFR 114, Lille, France. [Auzou, P.] Etab Helio Marin Grp Hopale, Serv Explorat Fonct Neurol, F-62600 Berck Sur Mer, France. [Ozsancak, C.] CH Boulogne sur Mer, Serv Neurol, F-62200 Boulogne Sur Mer, France. RP Cnockaert, L (reprint author), Univ Libre Bruxelles, Lab Images Signaux & Dispositifs Telecommun, Fac Sci Appl, CP 165-51,Av FD Roosevelt 50, B-1050 Brussels, Belgium. EM lcnockae@ulb.ac.be; jschoent@ulb.ac.be; pauzou@yahoo.fr; c_ozsancak@yahoo.fr; fgrenez@ulb.ae.be RI LICEND, CEMND/F-1296-2015 CR Addison P.S., 2002, ILLUSTRATED WAVELET Auzou P, 1998, REV NEUROL, V154, P523 BOASHASH B, 1992, P IEEE, V80, P520, DOI 10.1109/5.135376 Boersma P., 2004, PRAAT DOING PHONETIC Carmona RA, 1997, IEEE T SIGNAL PROCES, V45, P2586, DOI 10.1109/78.640725 CNOCKAERT L, 2005, P ICASSP PHIL US, P393 DEFEBVRE L, 2005, TROUBLES PAROLE DEGL, P9 Fant Gunnar, 1985, STL QPSR, V4, P1 FREUND HJ, 1987, TEMPORAL DISORDER HU, P79 FUCCI D, 1984, ADV BASIC RES PRACTI, V11, P249 Gresty M A, 1984, Adv Neurol, V40, P361 HANSON DG, 1984, LARYNGOSCOPE, V94, P348 HARTELIUS L, 1994, FOLIA PHONIATR LOGO, V46, P9 Hess W., 1983, PITCH DETERMINATION Hirose H, 1995, VOCAL FOLD, P235 Holmes RJ, 2000, INT J LANG COMM DIS, V35, P407 Jiménez-Jiménez F J, 1997, Parkinsonism Relat Disord, V3, P111, DOI 10.1016/S1353-8020(97)00007-2 KADAMBE S, 1992, IEEE T INFORM THEORY, V38, P917, DOI 10.1109/18.119752 Kawahara H., 1999, P EUR 99, P2781 KENT RD, 1994, J MED SPEECH-LANG PA, V2, P157 King J, 1994, J MED SPEECH-LANG PA, V2, P29 Leech NL, 2005, SPSS INTERMEDIATE ST LETIEN T, 1997, P IEEE TENC, P31 LOGEMANN JA, 1978, J SPEECH HEAR DISORD, V43, P47 Mallat S., 1999, WAVELET TOUR SIGNAL MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763 Mitev P, 2003, INFORM SCIENCES, V156, P3, DOI 10.1016/S0020-0255(03)00161-0 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 ORLIKOFF RF, 1989, J ACOUST SOC AM, V85, P888, DOI 10.1121/1.397560 Percival D.B., 2000, WAVELET METHODS TIME Perez KS, 1996, J VOICE, V10, P354, DOI 10.1016/S0892-1997(96)80027-0 QIU LJ, 1995, SIGNAL PROCESS, V44, P233, DOI 10.1016/0165-1684(95)00027-B Rabiner L.R., 1978, DIGITAL PROCESSING S ROBERT D, 2005, TROUBLES PAROLE DEGL, P131 Schoentgen J, 2002, J ACOUST SOC AM, V112, P690, DOI 10.1121/1.1492820 *SQLAB, 2005, EV 2 WORKST VOIC SPE Titze IR, 1995, VOCAL FOLD, P335 TITZE IR, 1994, WORKSH AC VOIC AN NA WINHOLTZ WS, 1992, J SPEECH HEAR RES, V35, P562 YAIR E, 1988, P IEEE, V76, P1166, DOI 10.1109/5.9662 Zar JH, 1996, BIOSTATISTICAL ANAL ZIEGLER W, 1999, VOICE QUALITY MEASUR, P397 ZWIRNER P, 1991, J COMMUN DISORD, V24, P287, DOI 10.1016/0021-9924(91)90004-3 ZWIRNER P, 1992, J SPEECH HEAR RES, V35, P761 NR 44 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 288 EP 300 DI 10.1016/j.specom.2007.10.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700003 ER PT J AU Goubanova, O King, S AF Goubanova, Olga King, Simon TI Bayesian networks for phone duration prediction SO SPEECH COMMUNICATION LA English DT Article DE text-to-speech; Bayesian networks; duration modelling; sums of products; classification and regression trees ID CONNECTED-SPEECH SIGNALS; SEGMENTAL DURATIONS; VOWEL DURATION; ENGLISH; TEXT; BOUNDARIES; UTTERANCE; PATTERNS; POSITION; STRESS AB In a text-to-speech system, the duration of each phone may be predicted by a duration model. This model is usually trained using a database of phones with known durations; each phone (and the context it appears in) is characterised by a feature vector that is composed of a set of linguistic factor values. We describe the use of a graphical model - a Bayesian network - for predicting the duration of a phone, given the values for these factors. The network has one discrete variable for each of the linguistic factors and a single continuous variable for the phone's duration. Dependencies between variables (or the lack of them) are represented in the BN structure by arcs (or missing arcs) between pairs of nodes. During training, both the topology of the network and its parameters are learned from labelled data. We compare the results of the BN model with results for sums of products and CART models on the same data. In terms of the root mean square error, the BN model performs much better than both CART and SoP models. In terms of correlation coefficient, the BN model performs better than the SoP model, and as well as the CART model. A BN model has certain advantages over CART and SoP models. Training SoP models requires a high degree of expertise. CART models do not deal with interactions between factors in any explicit way. As we demonstrate, a BN model can also make accurate predictions of a phone's duration, even when the values for some of the linguistic factors are unknown. (c) 2007 Elsevier B.V. All rights reserved. C1 [Goubanova, Olga; King, Simon] Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland. RP King, S (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 2 Buccleuch Pl, Edinburgh EH8 9LW, Midlothian, Scotland. EM ogoubanova@netscape.net; Si-mon.King@ed.ac.uk CR Allen J., 1987, TEXT SPEECH MITALK S BARBOSA P, 1994, SPEECH COMMUN, V15, P127, DOI 10.1016/0167-6393(94)90047-7 Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 Bishop CM, 1998, NEURAL NETWORKS PATT BLACK A, 2003, 120 U ED CTR SPEECH BOUTILIER C, 1996, P 12 C UNC ART INT U CAMPBELL N, 1992, P 2 INT C SPOK LANG CAMPBELL WN, 1991, J PHONETICS, V19, P37 Clark R. A. J., 2004, P 5 ISCA WORKSH SPEE COKER CH, 1973, IEEE T ACOUST SPEECH, VAU21, P293, DOI 10.1109/TAU.1973.1162458 Coombs C.H., 1964, A THEORY OF DATA COOPER A, 1912, P 12 INT C PHON SCI, V2, P50 COOPER GF, 1992, MACH LEARN, V9, P309, DOI 10.1007/BF00994110 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1574, DOI 10.1121/1.395912 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DUSTERHOFF KE, 1999, CD ROM P EUR 99 BUD Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 Friedman N., 1996, P 12 C UNC ART INT U GOUBANOVA O, 2005, P INT 2005 LISB PORT, V4, P1941 GOUBANOVA O, 2005, THESIS U EDINBURGH GREGORY M, 2001, J ACOUST SOC AM, V110, P2738 Haggard M., 1973, J PHONETICS, V1, P9 Heckerman D., 1995, MSRTR9506 HILLER SEJ, 1990, 11 U ED CTR SPEECH T KAIKI N, 1990, P INT C SPOK LANG PR, P17 KLATT D, 1974, J SPEECH HEAR RES, V17, P51 KLATT D, 1975, J PHONETICS, V59, P129 Klatt D. H., 1976, J ACOUST SOC AM, V59, P1209 KLATT DH, 1973, J ACOUST SOC AM, V54, P1102, DOI 10.1121/1.1914322 KRANTZ D, 1964, FDN MEASUREMENT, V1 KRISHNA N, 2004, CD ROM P INT C SPOK Lam W., 1994, COMPUT INTELL, V10, P269, DOI 10.1111/j.1467-8640.1994.tb00166.x Lee P. M., 1997, BAYESIAN STAT LEHISTE I, 1973, J ACOUST SOC AM, V54, P1228, DOI 10.1121/1.1914379 LEHISTE I, 1972, J ACOUST SOC AM, V51, P2018, DOI 10.1121/1.1913062 Lindblom B. E. F., 1973, PAPERS LINGUISTICS U, V21, P1 MAYO C, 2005, P INT 2005 LISB PORT, V4, P1725 Nooteboom S., 1972, THESIS U UTRECHT OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393 Olshen R., 1984, CLASSIFICATION REGRE, V1st Pearl J., 1988, PROBABILISTIC REASON PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 Riley M., 1992, TALKING MACHINES THE, P265 SHILL C, 2000, J ACOUST SEC AM, V107, P1012 SLUIJTER AMC, 1995, PHONETICA, V52, P71 STROM V, 2006, P INT 2006 PITTS US TOKUDA K, 2002, P 2002 IEEE SPEECH S Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 UMEDA N, 1975, J ACOUST SOC AM, V58, P62 UMEDA N, 1977, J ACOUST SOC AM, V61, P847 Umeda N, 1975, J ACOUST SOC AM, V58, P435 VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSON RJJ, 1997, P EUR 97 RHOD, P319 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 58 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 301 EP 311 DI 10.1016/j.specom.2007.10.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700004 ER PT J AU Lu, XG Dang, JW AF Lu, Xugang Dang, Jianwu TI An investigation of dependencies between frequency components and speaker characteristics for text-independent speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; physiological features; speech production; fisher's F-ratio; mutual information; frequency warping ID ACOUSTIC CHARACTERISTICS; RECOGNITION; MODELS AB The features used for speech recognition are expected to emphasize linguistic information while suppressing individual differences. For speaker recognition, in contrast, features should preserve individual information and attenuate the linguistic information at the same time. In most studies, however, identical acoustic features are used for the different missions of speaker and speech recognition. In this paper, we first investigated the relationships between the frequency components and the vocal tract based on speech production. We found that the individual information is encoded non-uniformly in different frequency bands of speech sound. Then we adopted statistical Fisher's F-ratio and information-theoretic mutual information measurements to measure the dependencies between frequency components and individual characteristics based on a speaker recognition database (NTT-VR). From the analysis, we not only confirmed the finding of non-uniform distribution of individual information in different frequency bands from the speech production point of view, but also quantified their dependencies. Based on the quantification results, we proposed a new physiological feature which emphasizes individual information for text-independent speaker identification by using a non-uniform subband processing strategy to emphasize the physiological information involved in speech production. The new feature was combined with GMM speaker models and applied to the NTT-VR speaker recognition database. The speaker identification using proposed feature reduced the identification error rate 20.1% compared that with MFCC feature. The experimental results confirmed that emphasizing the features from highly individual-dependent frequency bands is valid for improving speaker recognition performance. (c) 2007 Elsevier B.V. All rights reserved. C1 [Lu, Xugang; Dang, Jianwu] Japan Adv Inst Sci & Technol, Nomi, Ishikawa 9231292, Japan. RP Lu, XG (reprint author), Japan Adv Inst Sci & Technol, 1-1 Asahidai, Nomi, Ishikawa 9231292, Japan. EM xugang@jaist.ac.jp; jdang@jaist.ac.jp CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 Cover T M, 1991, ELEMENTS INFORM THEO Dang J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607763 DANG JW, 1994, J ACOUST SOC AM, V96, P2088, DOI 10.1121/1.410150 Dang JW, 1996, J ACOUST SOC AM, V100, P3374, DOI 10.1121/1.416978 Dang JW, 1997, J ACOUST SOC AM, V101, P456, DOI 10.1121/1.417990 HAYAKAWA S, 1995, P ICASSP1994, P140 HE J, 1995, P EUROSPEECH 95 SEPT, V1, P313 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MIYAJIMA C, 1999, P EUROSPEECH1999, P782 Rabiner L, 1993, FUNDAMENTALS SPEECH REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Stevens K.N., 1998, ACOUSTIC PHONETICS Suzuki H., 1990, P ICSLP90, P437 Takemoto H, 2006, J ACOUST SOC AM, V120, P2228, DOI 10.1121/1.2261270 Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065 YOUNG S, 1992, HTK TUTORIAL BOOK NR 19 TC 18 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 312 EP 322 DI 10.1016/j.specom.2007.10.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700005 ER PT J AU Harnsberger, JD Wright, R Pisoni, DB AF Harnsberger, James D. Wright, Richard Pisoni, David B. TI A new method for eliciting three speaking styles in the laboratory SO SPEECH COMMUNICATION LA English DT Article DE speaking styles; speech perception; speaking rate; vowel dispersion ID COMPUTER ERROR RESOLUTION; HARD-OF-HEARING; CONVERSATIONAL SPEECH; CLEAR SPEECH; CROSS-LANGUAGE; SPOKEN WORDS; INTELLIGIBILITY; MEMORY; RECOGNITION; PERCEPTION AB In this study, a method was developed to elicit three different speaking styles, reduced, citation, and hyperarticulated, using controlled sentence materials in a laboratory setting. In the first set of experiments, the reduced style was elicited by having 12 talkers read a sentence while carrying out a distractor task that involved recalling from short-term memory an individually-calibrated number of digits. The citation style corresponded to read speech in the laboratory. The hyperarticulated style was elicited by prompting talkers (twice) to reread the sentences more carefully. The results of perceptual tests with naive listeners and an acoustic analysis showed that 6 of the 12 talkers produced a reduced style of speech for the test sentences in the distractor task relative to the same sentences in the citation style condition. In addition, all talkers consistently produced sentences in the citation and hyperarticulated styles. In the second set of experiments, the reduced style was elicited by increasing the number of digits in the distractor task by one (a heavier cognitive load). The procedures for eliciting citation and hyperarticulated sentences remained unchanged. Ten talkers were recorded in the second experiment. The results showed that 6 out of 10 talkers differentiated all three styles as predicted. In addition, all talkers consistently produced sentences in the citation and hyperarticulated styles. Overall, the results demonstrate that it is possible to elicit controlled sentence stimulus materials varying in speaking style in a laboratory setting, although the method requires further refinement to elicit these styles more consistently from individual participants. Published by Elsevier B.V. C1 [Harnsberger, James D.; Wright, Richard; Pisoni, David B.] Indiana Univ, Speech Res Lab, Dept Psychol & Brain Sci, Bloomington, IN 47405 USA. RP Harnsberger, JD (reprint author), Univ Florida, Gainesville, FL 32610 USA. EM jharns@ufl.edu CR Aylett M, 2004, LANG SPEECH, V47, P31 BADDELEY AD, 1975, J VERB LEARN VERB BE, V14, P575, DOI 10.1016/S0022-5371(75)80045-4 Bates RA, 2007, SPEECH COMMUN, V49, P83, DOI 10.1016/j.specom.2006.10.007 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 Bradlow AR, 1999, PERCEPT PSYCHOPHYS, V61, P206, DOI 10.3758/BF03206883 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 BRINK J, 1998, RES SPOKEN LANGUAGE, V22, P396 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 CAVANAGH JP, 1972, PSYCHOL REV, V79, P525, DOI 10.1037/h0033482 DUEZ D, 1992, SPEECH COMMUN, V11, P417, DOI 10.1016/0167-6393(92)90047-B FERNALD A, 1989, J CHILD LANG, V16, P477 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 Hirschberg JG, 1996, NATO ADV SCI I A-LIF, V286, P293 JOHNSON K, 1993, LANGUAGE, V69, P505, DOI 10.2307/416697 Jurafsky Daniel, 2001, FREQUENCY EMERGENCE, P229, DOI 10.1075/tsl.45.13jur KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 KOSTER S, 2001, ICASSP IEEE INT C AC, V2, P873 KRULL D, 1989, CONSONANT VOWEL COAR, P101 Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684 Labov William, 1972, SOCIOLINGUISTIC PATT Labov William, 1984, LANGUAGE USE READING, P28 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Liu Y, 2006, IEEE T AUDIO SPEECH, V14, P1526, DOI 10.1109/TASL.2006.878255 Milroy L., 1987, OBSERVING ANAL NATUR MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 MULLENNIX JW, 1990, PERCEPT PSYCHOPHYS, V61, P206 NYGAARD LC, 1995, PERCEPT PSYCHOPHYS, V57, P989, DOI 10.3758/BF03205458 Oberauer K, 2006, J MEM LANG, V55, P601, DOI 10.1016/j.jml.2006.08.009 OSTENDORF M, 1996, 1996 CLSP JHU WORKSH Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3 Oviatt S, 1998, J ACOUST SOC AM, V104, P3080, DOI 10.1121/1.423888 PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545 PICHENY MA, 1989, J SPEECH HEAR RES, V32, P600 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 SCHRIBERG E, 2001, J INT PHON ASSOC, V31, P153 Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 SPEER SR, 1999, J ACOUST SOC AM, V106, P2275, DOI 10.1121/1.427776 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 SWERTS M, 1992, SPEECH COMMUN, V11, P463, DOI 10.1016/0167-6393(92)90052-9 Uchanski RM, 1996, J SPEECH HEAR RES, V39, P494 Uther M, 2007, SPEECH COMMUN, V49, P2, DOI 10.1016/j.specom.2006.10.003 WASSINK AB, 2007, J PHONETICS, V35, P353 Wright R., 2003, PAPERS LAB PHONOLOGY, P75 NR 46 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 323 EP 336 DI 10.1016/j.specom.2007.11.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700006 ER PT J AU Shao, X Barker, J AF Shao, Xu Barker, Jon TI Stream weight estimation for multistream audio-visual speech recognition in a multispeaker environment SO SPEECH COMMUNICATION LA English DT Article DE audio-visual speech recognition; multistream; multispeaker; likelihood; artificial neural networks ID INFORMATIONAL MASKING; PERCEPTION; INTEGRATION; NOISE AB The paper considers the problem of audio-visual speech recognition in a simultaneous (target/masker) speaker environment. The paper follows a conventional multistream approach and examines the specific problem of estimating reliable time-varying audio and visual stream weights. The task is challenging because, in the two speaker condition, signal-to-noise ratio (SNR) - and hence audio stream weight - cannot always be reliably inferred from the acoustics alone. Similarity between the target and masker sound sources can cause the foreground and background to be confused. The paper presents a novel solution that combines both audio and visual information to estimate acoustic SNR. The method employs artificial neural networks to estimate the SNR from hidden Markov model (HMM) state-likelihoods calculated using separate audio and visual streams. SNR estimates are then mapped to either constant utterance-level (global) stream weights or time-varying frame-based (local) stream weights. The system has been evaluated using either gender dependent models that are specific to the target speaker, or gender independent models that discriminate poorly between target and masker. When using known SNR, the time-varying stream weight system outperforms the constant stream weight systems at all SNRs tested. It is thought that the time-vary weight allows the automatic speech recognition system to take advantage of regions where local SNRs are temporally high despite the global SNR being low. When using estimated SNR the time-varying system outperformed the constant stream weight system at SNRs of 0 dB and above. Systems using stream weights estimated from both audio and video information performed better than those using stream weights estimated from the audio stream alone, particularly in the gender independent case. However, when mixtures are at a global SNR below 0 dB, stream weights are not sufficiently well estimated to produce good performance. Methods for improving the SNR estimation are discussed. The paper also relates the use of visual information in the current system to its role in recent simultaneous speaker intelligibility studies, where, as well as providing phonetic content, it triggers 'informational masking release', helping the listener to attend selectively to the target speech stream. (c) 2007 Elsevier B.V. All rights reserved. C1 [Shao, Xu; Barker, Jon] Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Shao, X (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM x.shao@dcs.shef.ac.uk; j.barker@shef.ac.uk CR Adjoudani A., 1996, SPEECHREADING HUMANS, P461 Barker J., 2000, P ICSLP BEIJ CHIN, P373 Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Bishop C. M., 1995, NEURAL NETWORKS PATT BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 BOURLARD H, 1996, P ICSLP 96 PHIL PA BRADSKI GR, 1998, INTEL TECHNOL J Q, V2, P43 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Chibelushi CC, 2002, IEEE T MULTIMEDIA, V4, P23, DOI 10.1109/6046.985551 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 COX S, 1997, P AUD SPEECH PROC AV DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479 Fu SL, 2005, IEEE T MULTIMEDIA, V7, P243, DOI 10.1109/TMM.2005.843341 GARG A, 2003, P ICASSP, P24 GLOTIN H, 2001, P INT C AC SPEECH SI, P173 GOLDSCHEN AJ, 1993, THESIS GEORGE WASHIN GRAVIER G, 2002, P ICASSP 2002 ORL FL GURBUZ S, 2002, P ICASSP 2002 ORL FL HALL JW, 1984, J ACOUST SOC AM, V76, P50, DOI 10.1121/1.391005 Helfer KS, 2005, J ACOUST SOC AM, V117, P842, DOI [10.1121/1.1836832, 10.1121/1.183682] HERMANSKY H, 1998, P ICSLP 1998 SYDN AU Hirsch H. G., 1993, TR93012 INT COMP SCI LOCKWOOD P, 1991, P EUROSPEECH 91, V1, P79 Lucey S, 2005, IEEE T MULTIMEDIA, V7, P495, DOI 10.1109/TMM.2005.846777 LUETTIN J, 2001, P ICASSP 2001 SALT L Massaro D. W., 1998, PERCEIVING TALKING F Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861 Matthews I., 1998, THESIS U E ANGLIA NO MEIER U, 1996, P ICASSP 1996 ATL GA OKAWA S, 1999, P EUR C SPEECH COMM, P603 PATTERSON E, 2002, EURASIP J APPL SIG P, V11, P1189 PATTERSON EK, 2001, P AUD VIS SPEECH PRO POTAMIANOS G, 2000, P ICSLP 2000 BEIJ CH POTAMIANOS G, 1998, P IEEE INT C IM PROC Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150 Rabiner LR, 1989, P IEEE, V77, P267 Rosenblum LD, 2005, BLACKW HBK LINGUIST, P51, DOI 10.1002/9780470757024.ch3 Rudmann DS, 2003, HUM FACTORS, V45, P329, DOI 10.1518/hfes.45.2.329.27237 Schwartz JL, 2004, COGNITION, V93, pB69, DOI 10.1016/j.cognition.2004.01.006 Shewchuk Jonathan Richard, 1994, CMUCS94125 SCH COMP SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 TAMURA S, 2005, P ICASSP 2005 PHIL P Wightman F, 2006, J ACOUST SOC AM, V119, P3940, DOI 10.1121/1.2195121 Young S., 1995, HTK BOOK NR 46 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2008 VL 50 IS 4 BP 337 EP 353 DI 10.1016/j.specom.2007.11.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 289GJ UT WOS:000255042700007 ER PT J AU Stent, AJ Huffman, MK Brennan, SE AF Stent, Amanda J. Huffman, Marie K. Brennan, Susan E. TI Adapting speaking after evidence of misrecognition: Local and global hyperarticulation SO SPEECH COMMUNICATION LA English DT Article DE hyperarticulation; clear speech; speaking rate; adaptation in speaking; speech recognition; spoken dialog ID CLEAR SPEECH; CONVERSATIONAL SPEECH; HEARING; SYSTEM; HARD; CUES AB In this paper we examine the two-way relationship between hyperarticulation and evidence of misrecognition of computer-directed speech. We report the results of an experiment in which speakers spoke to a simulated speech recognizer and received text feedback about what had been "recognized". At pre-determined points in the dialog, recognition errors were staged, and speakers made repairs. Each repair utterance was paired with the utterance preceding the staged recognition error and coded for adaptations associated with hyperarticulate speech: speaking rate and phonetically clear speech. Our results demonstrate that hyperarticulation is a targeted and flexible adaptation rather than a generalized and stable mode of speaking. Hyperarticulation increases after evidence of misrecognition and then decays gradually over several turns in the absence of further misrecognitions. When repairing misrecognized speech, speakers are more likely to clearly articulate constituents that were apparently misrecognized than those either before or after the troublesome constituents, and more likely to clearly articulate content words than function words. Finally, we found no negative impact of hyperarticulation on speech recognition performance. Published by Elsevier B.V. C1 [Stent, Amanda J.; Brennan, Susan E.] SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA. [Huffman, Marie K.] SUNY Stony Brook, Dept Linguist, Stony Brook, NY 11794 USA. [Stent, Amanda J.; Brennan, Susan E.] SUNY Stony Brook, Dept Psychol, Stony Brook, NY 11794 USA. RP Stent, AJ (reprint author), SUNY Stony Brook, Dept Comp Sci, Stony Brook, NY 11794 USA. EM amanda.stent@stonybrook.edu; marie.huffman@stortybrook.edu; susan.brennan@stonybrook.edu CR Allen JF, 2001, AI MAG, V22, P27 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 BOHUS D, 2005, P 6 SIGDIAL WORKSH D, P128 Bradlow A. R, 2002, PAPERS LAB PHONOLOGY, V7, P241 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 Brennan S. E., 1991, User Modeling and User-Adapted Interaction, V1, DOI 10.1007/BF00158952 Brennan S, 1996, P INT S SPOK DIAL, P41 Brennan SE, 1996, J EXP PSYCHOL LEARN, V22, P1482, DOI 10.1037/0278-7393.22.6.1482 Bulyko I, 2005, SPEECH COMMUN, V45, P271, DOI 10.1016/j.specom.2004.09.009 CHOULARTON S, 2004, P 10 AUSTR INT C SPE, P457 Cohen J., 1988, STAT POWER ANAL BEHA, V2nd Core M. G., 1999, Psychological Models of Communication in Collaborative Systems. Papers from the 1999 AAAI Fall Symposium (TR FS-99-03) CUTLER A, 1990, SPEECH COMMUN, V9, P485, DOI 10.1016/0167-6393(90)90024-4 FERGUSON CA, 1975, ANTHROPOL LINGUIST, V17, P1 FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104 Ferreira F, 2002, CURR DIR PSYCHOL SCI, V11, P11, DOI 10.1111/1467-8721.00158 GIESELMANN P, 2006, P 24 C NAT LANG PROC, P24 Gorin AL, 2002, COMPUTER, V35, P51, DOI 10.1109/MC.2002.993771 HARNSBERGER, 2000, 24 RES SPOK LANG PRO Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 Hirschberg J., 1999, P AUT SPEECH REC UND, P349 HIRSCHBERG J, 2000, P INT C SPOK LANG PR, V1, P254 HOCKEY BA, 2003, P 10 C EUR CHAPT ACL, P147 Huang X., 2001, SPOKEN LANGUAGE PROC JOHNSON K, 1993, LANGUAGE, V69, P505, DOI 10.2307/416697 KIRCHHOFF K, 2001, P NAACL WORKSH AD DI KNIQHT S, 2001, P EUR 2001 INT SPEEC, P1779 Kraljic T, 2005, COGNITIVE PSYCHOL, V50, P194, DOI 10.1016/j.cogpsych.2004.08.002 Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842 LEVOW GA, 1999, P ESCA WORKSH DIAL P LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 Litman D, 2006, COMPUT LINGUIST, V32, P417, DOI 10.1162/coli.2006.32.3.417 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Oviatt S, 1998, J ACOUST SOC AM, V104, P3080, DOI 10.1121/1.423888 OVIATT SL, 1998, SPEECH COMMUN, V24, P1 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 Rudnicky A.I, 2002, P INT C SPOK LANG PR, P341 Schmandt C., 1982, P HUMAN FACTORS COMP, P363, DOI 10.1145/800049.801812 SCHMANDT C, 1984, IEEE T CONSUM ELECTR, V30, pR21, DOI 10.1109/TCE.1984.354042 Shriberg E., 1992, P DARPA SPEECH NAT L, P49, DOI 10.3115/1075527.1075538 SIKVELAND RO, 2006, P FON 2006 CTR LANG, P109 SOLTAU H, 2000, P INT C SPOK LANG PR, V4, P105 SOLTAU H, 1998, P ICSLP98, P229 SOLTAU H, 2000, P IEEE INT C APPL SP, V3, P1779 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 WADE E, 1992, P INT C SPOK LANG PR, V2, P995 Whalen DH, 2004, LANG SPEECH, V47, P155 NR 49 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 163 EP 178 DI 10.1016/j.specom.2007.07.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900001 ER PT J AU Mokhtari, P Takemoto, H Kitamura, T AF Mokhtari, Parham Takemoto, Hironorl Kitamura, Tatsuya TI Single-matrix formulation of a time domain acoustic model of the vocal tract with side branches SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th Joint Meeting of the Acoustical-Society-of-America/Acoustical-Society-of-Japan CY NOV 28-DEC 02, 2006 CL Honolulu, HI SP Acoust Soc Amer, Acoust Soc Japan DE vocal tract; time domain simulation; piriform fossa; side branch; articulatory synthesis ID SPEECH; MRI; CORDS AB Although it has been found that the piriform fossae play an important role in speech production and acoustics, the popular time domain articulatory synthesizer of [Maeda, S., 1982. A digital simulation method of the vocal-tract system. Speech Comm. 1 (3-4), 199-229] currently cannot include any more than one side branch to the acoustic tube that represents the main vocal tract. To overcome this limitation, in this paper we extended Maeda's (1982) simulation method, by mathematical reformulation in terms of a single-matrix equation having a system matrix that is both sparse and symmetric. Using vocal tract area functions measured by MRI, the simulation results showed that the piriform fossae suppress the energy in the higher frequencies by introducing spectral zeros around 4-5 kHz, and also tend to lower the second formant of vowels. These spectral changes agree with results produced using a well-tested frequency domain transmission-line method, thus validating our new formulation of the time domain synthesizer. The reformulation can be easily extended to accommodate any number of vocal tract side branches, thus enabling more realistic, physiologically correct acoustic simulation of speech production. (c) 2007 Elsevier B.V. All rights reserved. C1 [Mokhtari, Parham; Takemoto, Hironorl; Kitamura, Tatsuya] ATR, Cognitve Informat Sci Labs, Kyoto 6190288, Japan. RP Mokhtari, P (reprint author), ATR, Cognitve Informat Sci Labs, 2-2-2 Hikaridai, Kyoto 6190288, Japan. EM parham@atr.jp CR Adachi S, 1999, J ACOUST SOC AM, V105, P2920, DOI 10.1121/1.426905 Badin P., 1984, NOTES VOCAL TRACT CO, P53 Birkholz P., 2004, P INT 2004 ICSLP JEJ, P1125 BIRKHOLZ P, 2006, P INT C AC SPEECH SI, V1, P873 DANG JW, 1994, J ACOUST SOC AM, V96, P2088, DOI 10.1121/1.410150 Dang JW, 1997, J ACOUST SOC AM, V101, P456, DOI 10.1121/1.417990 Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2] Fant G., 1960, ACOUSTIC THEORY SPEE Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1975, AT&T TECH J, V54, P485 FLANAGAN JL, 1970, IEEE SPECTRUM, V7, P22 Honda K, 2004, IEICE T INF SYST, VE87D, P1050 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Jackson MTT, 2001, J ACOUST SOC AM, V109, P2983, DOI 10.1121/1.1370526 Kitamura T., 2005, Acoustical Science and Technology, V26, DOI 10.1250/ast.26.16 KITAMURA T, 2006, J ACOUST SOC AM, V120, P3037 Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 Mokhtari P, 2007, J PHONETICS, V35, P20, DOI 10.1016/j.wocn.2006.01.001 SHADLE CH, 2001, P 4 ISCA TUT RES WOR, P121 SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 Takemoto H, 2006, J ACOUST SOC AM, V120, P2228, DOI 10.1121/1.2261270 Takemoto H, 2006, J ACOUST SOC AM, V119, P1037, DOI 10.1121/1.2151823 NR 22 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 179 EP 190 DI 10.1016/j.specom.2007.08.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900002 ER PT J AU Sinha, R Umesh, S AF Sinha, Rohit Umesh, S. TI A shift-based approach to speaker normalization using non-linear frequency-scaling model SO SPEECH COMMUNICATION LA English DT Article DE vocal-tract length normalization; frequency-warping; linear transformation of cepstra ID VOCAL-TRACT NORMALIZATION; TRANSFORMATION; RECOGNITION; LENGTH; SPEECH AB In this work, we present a speaker-normalization method based on the idea that the speaker-dependent scale-factor can be separated out as a fixed translation factor in an alternate domain. We also introduce a non-linear frequency-scaling model motivated by the analysis of speech data. The proposed shift-based normalization approach is implemented using a maximum-likelihood (ML) search for the translation factor in the alternate domain. The advantage of our approach is that we are able to show the relationship between conventional frequency-warping based vocal-tract length normalization (VTLN) methods and the methods based on shifts in psycho-acoustic scale thus providing a unifying frame-work for speaker-normalization. Additionally, in our approach it is simple to show that the shifting required for normalization can be expressed as a linear transformation in the cepstral domain. This is important for computational efficiency since we do not have to recompute the features by re-doing the signal processing for each scale/translation factor as is usually done in conventional normalization. We present recognition results using our proposed approach on a digit recognition task and show that the non-linear scaling model provides relative improvement of 4% for adults and 7.5% for children when compared to the linear-scaling model. (c) 2007 Elsevier B.V. All rights reserved. C1 [Sinha, Rohit; Umesh, S.] Indian Inst Technol, Dept Elect Engn, Kanpur 208016, Uttar Pradesh, India. RP Sinha, R (reprint author), Indian Inst Technol, Dept Elect Engn, Kanpur 208016, Uttar Pradesh, India. EM rsinha@iitg.ernet.in; surnesh@iitk.ac.in CR ANDREOU A, 1994, P CAIP WORKSHOP FRON, V2 BLADON RAW, 1984, LANG COMMUN, V4, P59, DOI 10.1016/0271-5309(84)90019-3 BURNETT DC, 1996, P INT C AC SPEECH SI Claes T, 1998, IEEE T SPEECH AUDI P, V6, P549, DOI 10.1109/89.725321 EIDE E, 1996, P ICASSP, P346 FANT G, 1975, STL QPSR, P1 FUJIMURA O, 1995, J ACOUST SOC AM, V49, P3099 HIRSCH HG, 2000, ISCA ITRW ASRU 2000 ISELI M, 2004, P IEEE ICASSP 04 BOS Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 MANNELL RH, 1998, P IEEE ICSLP 98 SYDN McDonough J., 1998, P INT C SPOK LANG PR, V16, P2307 NUTTALL AH, 1982, P IEEE, V70, P1115, DOI 10.1109/PROC.1982.12435 ONO Y, 1993, P EUROSPEECH, P21 OPPENHEI.AV, 1972, PR INST ELECTR ELECT, V60, P681, DOI 10.1109/PROC.1972.8727 Pitz M, 2005, IEEE T SPEECH AUDI P, V13, P930, DOI 10.1109/TSA.2005.848881 PITZ M, 2001, P EUROSPEECH 01 SEPT, V4, P2653 Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 STEVENS SS, 1940, AM J PSYCHOL, V53 Umesh S, 2002, IEEE SIGNAL PROC LET, V9, P104, DOI 10.1109/97.995829 UMESH S, 2002, P IEEE ICASSP 02 MAY, V1, P517 UMESH S, IN PRESS IEEE T AUDI UMESH S, 2005, P INT C SPOK LANG PR UMESH S, 2002, J ACOUST SOC AM, V3, P83 UMESH S, 2004, P IEEE INT C AC SPEE, P345 VANBEKESY G, 1951, HDB EXPT PSYCHOL WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 WEGMANN S, 1996, P ICASSP, V1, P339 Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 ZHAN P, 1997, P ICASSP, P1039 ZHAN P, 1997, CMUCS1971148 CARN ME NR 31 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 191 EP 202 DI 10.1016/j.specom.2007.08.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900003 ER PT J AU Nakatani, T Amano, S Irino, T Ishizuka, K Kondo, T AF Nakatani, Tomohiro Amano, Shigeaki Irino, Toshio Ishizuka, Kentaro Kondo, Tadahisa TI A method for fundamental frequency estimation and voicing decision: Application to infant utterances recorded in real acoustical environments SO SPEECH COMMUNICATION LA English DT Article DE fundamental frequency estimation; voicing decision; infant speech; instantaneous frequency; dominance spectrum ID PITCH; SPEECH; EXTRACTION AB This paper proposes a method for fundamental frequency (F0) estimation and voicing decision that can handle wide-ranging speech signals including adult and infant utterances recorded in real noisy environments. In particular, infant utterances have unique characteristics that are different from those of adults, such as a wide F0 range, F0 abrupt transitions, and unique energy distribution patterns over frequencies. Therefore, conventional methods that were developed mainly for adult utterances do not necessarily work well for infant utterances especially when the signals are contaminated by background noise. Several techniques are introduced into the proposed method to cope with this problem. We show that the ripple-enhanced power spectrum based method (REPS) can estimate the F0s robustly, and that the use of instantaneous frequency (IF) enables us to refine the accuracy of the FO estimates. In addition, the degree of dominance defined based on the IF i.s introduced as a robust voicing decision measure. The effectiveness of the proposed method is confirmed in terms of gross pitch errors and voicing decision errors in comparison with the recently proposed methods, Praat and YIN, using both longitudinal recordings of Japanese infant utterances and adult utterances. (c) 2007 Elsevier B.V. All rights reserved. C1 [Nakatani, Tomohiro; Amano, Shigeaki; Ishizuka, Kentaro; Kondo, Tadahisa] NTT Corp, NTT Commun Sci Labs, Kyoto 6190237, Japan. [Irino, Toshio] Wakayama Univ, Fac Engn Sci, Wakayama, Japan. RP Nakatani, T (reprint author), NTT Corp, NTT Commun Sci Labs, 2-4 Hikaridai, Kyoto 6190237, Japan. EM nak@cslab.keel.ntt.co.jp CR ABE T, 1997, P ASVA 97, P423 Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042 Amano S, 2006, J ACOUST SOC AM, V119, P1636, DOI 10.1121/1.2161443 Atake Y., 2000, P ICSLP 2000, V2, P907 Boersma P., 2005, PRAAT DOING PHONETIC Boersma P., 1993, P I PHONETIC SCI, V17, P97 CHARPENTIER FJ, 1986, P ICASSP 86 TOK Crystal D., 1986, LANG ACQUIS, P174 de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 *ETSI, 2003, 202 050 V113 ETSI ES FLANAGAN JL, 1966, AT&T TECH J, V45, P1493 Hess W., 1983, PITCH DETERMINATION Ishizuka K., 2006, P SAPA 06 SEPT, P65 Ishizuka K, 2007, J ACOUST SOC AM, V121, P2272, DOI 10.1121/1.2535806 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kuhl PK, 1996, J ACOUST SOC AM, V100, P2425, DOI 10.1121/1.417951 LIU D, 2001, IEEE T SPEECH AUDIO, V9 Martin P., 1982, P ICASSP 82, V7, P180 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 Shimamura T., 2001, IEEE T SPEECH AUDIO, V9 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 SONDHI MM, 1968, IEEE T ACOUST SPEECH, VAU16, P262, DOI 10.1109/TAU.1968.1161986 WISE JD, 1976, IEEE T ACOUST SPEECH, V24, P418, DOI 10.1109/TASSP.1976.1162852 NR 26 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 203 EP 214 DI 10.1016/j.specom.2007.09.003 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900004 ER PT J AU Toda, T Black, AW Tokuda, K AF Toda, Tomoki Black, Alan W. Tokuda, Keiichi TI Statistical mapping between articulatory movements and acoustic spectrum using a Gaussian mixture model SO SPEECH COMMUNICATION LA English DT Article DE articulatory-to-acoustic mapping; acoustic-to-articulatory inversion mapping; GMM; MMSE; dynamic features ID SPEECH PRODUCTION-MODEL; VOCAL-TRACT; HMM; INVERSION AB In this paper, we describe a statistical approach to both an articulatory-to-acoustic mapping and an acoustic-to-articulatory inversion mapping without using phonetic information. The joint probability density of an articulatory parameter and an acoustic parameter is modeled using a Gaussian mixture model (GMM) based on a parallel acoustic-articulatory speech database. We apply the GMM-based mapping using the minimum mean-square error (MMSE) criterion, which has been proposed for voice conversion, to the two mappings. Moreover, to improve the mapping performance, we apply maximum likelihood estimation (MLE) to the GMM-based mapping method. The determination of a target parameter trajectory having appropriate static and dynamic properties is obtained by imposing an explicit relationship between static and dynamic features in the MLE-based mapping. Experimental results demonstrate that the MLE-based mapping with dynamic features can significantly improve the mapping performance compared with the MMSE-based mapping in both the articulatory-to-acoustic mapping and the inversion mapping. (c) 2007 Elsevier B.V. All rights reserved. C1 [Toda, Tomoki] Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma, Nara 6030192, Japan. [Black, Alan W.] Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. [Tokuda, Keiichi] Nagoya Inst Technol, Grad Sch Engn, Showa Ku, Nagoya, Aichi 4668555, Japan. RP Toda, T (reprint author), Nara Inst Sci & Technol, Grad Sch Informat Sci, 8916-5 Takayama, Ikoma, Nara 6030192, Japan. EM tomoki@is.naist.jp; awb@es.cmu.edu; tokuda@ics.nitech.ac.jp CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 CHU M, 2001, P ICASSP SALT LAK CI, P785 Frankel J., 2000, P ICSLP, V4, P254 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Hiroya S, 2004, IEICE T INF SYST, VE87D, P1071 Hogden J, 1996, J ACOUST SOC AM, V100, P1819, DOI 10.1121/1.416001 Hunt A. J., 1996, P ICASSP 96, P373 Kaburagi T., 1998, P INT C SPOK LANG PR, P433 Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423 Kain A., 2004, P 5 ISCA SPEECH SYNT, P25 Kawahara H., 1999, P EUR 99, P2781 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kawai H, 2004, P 5 ISCA SPEECH SYNT, P179 Kello CT, 2004, J ACOUST SOC AM, V116, P2354, DOI 10.1121/1.1715112 MINAMI Y, 2004, P INT C SPOK LANG PR, P549 Nakamura K., 2006, P ICASSP, P93 PARK KY, 2000, P ICASSP, P1847 Richmond K, 2003, COMPUT SPEECH LANG, V17, P153, DOI 10.1016/S0885-2308(03)00005-6 Richmond K., 2001, THESIS U EDINBURGH RICHMOND K, 2006, P INTERSPEECH PITTSB, P577 Sagisaka Y., 1988, P INT C AC SPEECH SI, P679 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 Schroeter J., 1992, ADV SPEECH SIGNAL PR, P231 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Shiga Y., 2004, P 5 ISCA SPEECH SYNT, P19 SONDHI MM, 2002, IEEE 2002 WORKSH SPE Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 Suzuki S., 1998, P INT C SPOK LANG PR, P2251 Syrdal A. K., 2000, P ICSLP BEIJ CHIN, P410 Toda T., 2004, P 5 ISCA SPEECH SYNT, P31 Toda T., 2004, P ICSLP JEJ ISL KOR, P1129 TOKUDA K, 2000, P ICASSP, V3, P1315 Wrench A., 1999, MOCHA TIMIT ARTICULA Wrench A. A., 2000, P ICSLP BEIJ CHIN, P145 Zen H, 2007, COMPUT SPEECH LANG, V21, P153, DOI 10.1016/j.csl.2006.01.002 Zheng Y., 2003, P ASRU, P249 NR 36 TC 64 Z9 67 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 215 EP 227 DI 10.1016/j.specom.2007.09.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900005 ER PT J AU Subramanya, A Zhang, Z Liu, Z Acero, A AF Subramanya, Amarnag Zhang, Zhengyou Liu, Zicheng Acero, Alex TI Multisensory processing for speech enhancement and magnitude-normalized spectra for speech modeling SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; speech modeling; multi-sensory processing; dynamic Bayesian AB In this paper, we tackle the problem of speech enhancement from two fronts: speech modeling and multisensory input. We present a new speech model based on statistics of magnitude-normalized complex spectra of speech signals. By performing magnitude normalization, we are able to get rid of huge intra- and inter-speaker variation in speech energy and to build a better speech model with a smaller number of Gaussian components. To deal with real-world problems with multiple noise sources, we propose to use multiple heterogeneous sensors, and in particular, we have developed microphone headsets that combine a conventional air microphone and a bone sensor. The bone sensor makes direct contact with the speaker's temple (area behind the ear), and captures the vibrations of the bones and skin during the process of vocalization. The signals captured by the bone microphone, though distorted, contain useful audio information, especially in the low frequency range, and more importantly, they are very robust to external noise sources (stationary or not). By fusing the bone channel signals with the air microphone signals, much improved speech signals have been obtained. (c) 2007 Elsevier B.V. All rights reserved. C1 [Subramanya, Amarnag] Univ Washington, Signal Speech & Language Interpret Lab, Seattle, WA 98195 USA. [Zhang, Zhengyou; Liu, Zicheng; Acero, Alex] Microsoft Res, Redmond, WA 98052 USA. RP Subramanya, A (reprint author), Univ Washington, Signal Speech & Language Interpret Lab, Seattle, WA 98195 USA. EM asubram@ee.washington.edu CR Beal MJ, 2003, IEEE T PATTERN ANAL, V25, P828, DOI 10.1109/TPAMI.2003.1206512 BILMES J, UWEETR20010005, P2001 BILMES J, 2000, DYNAMIC BAYESIAN MUL CHEN T, 2002, P INT C SPOK LANG PR Deller J., 1999, DISCRETE TIME PROCES EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 2005, CRC ELECT ENG HDB Ephraim Y., 1984, IEEE T ACOUST SPEECH, V32, P109 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549] HERACLEOUS P, 2003, P IEEE AUT SPEECH RE JEANNES WLB, 2001, IEEE T SPEECH AUDIO, V9 JORDAN MI, 2002, GRAPHICAL MODELS PRO KLEIN D, 2002, P WORKSH BEMP METH N LEE BG, 1995, SIGNAL PROCESS, V46, P1, DOI 10.1016/0165-1684(95)00068-O LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Liu Z., 2004, P IEEE 6 WORKSH MULT, P363 LIU Z, 2005, P IEEE INT C AC SPEE, V1, P1093 LOTTER T, 2003, P INT C AC SPEECH SI Meyer J., 1997, P INT C AC SPEECH SI NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384 PALIWAL KK, 1987, P INT C AC SPEECH SI Quatieri T. F., 2002, DISCRETE TIME SPEECH STRAND OM, 2003, P IEEE AUT SPEECH RE SUBRAMANYA A, 2005, P EUR EUR C SPEECH C WU J, 2003, P IEEE AUT SPEECH RE YOSHIZAWA S, 2004, P INT C AC SPEECH SI Zhang Z., 2004, P IEEE INT C AC SPEE, VIII, P781 ZHAO DY, 2005, P EUR EUR C SPEECH C Zheng Y., 2003, P ASRU, P249 ZWEIG G, 2002, P INT C AC SPEECH SI NR 31 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 228 EP 243 DI 10.1016/j.specom.2007.09.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900006 ER PT J AU Hirsch, HG Finster, H AF Hirsch, Hans-Guenter Finster, Harald TI A new approach for the adaptation of HMMs to reverberation and background noise SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; HMM adaptation; hands-free speech input; reverberation ID FREE SPEECH RECOGNITION AB Looking at practical application scenarios of speech recognition systems several distortion effects exist that have a major influence on the speech signal and can considerably deteriorate the recognition performance. So far, mainly the influence of stationary background noise and of unknown frequency characteristics has been studied. A further distortion effect is the hands-free speech input in a reverberant room environment. A new approach is presented to adapt the energy and spectral parameters of HMMs as well as their time derivatives to the modifications by the speech input in a reverberant environment. The only parameter, needed for the adaptation, is an estimate of the reverberation time. The usability of this adaptation technique is shown by presenting the improvements for a series of recognition experiments on reverberant speech data. The approach for adapting the time derivatives of the acoustic parameters can be applied in general for all different types of distortions and is not restricted to the case of a hands-free input. The use of a hands-free speech input comes along with the recording of any background noise that is present in the room. Thus there exists the need of combining the adaptation to reverberant conditions with the adaptation to background noise and unknown frequency characteristics. A combined adaptation scheme for all mentioned effects is presented in this paper. The adaptation is based on an estimation of the noise characteristics before the beginning of speech is detected. The estimation of the distortion parameters is based on signal processing techniques. The applicability is demonstrated by showing the improvements on artificially distorted data as well as on real recordings in rooms. (c) 2007 Elsevier B.V. All rights reserved. C1 [Hirsch, Hans-Guenter; Finster, Harald] Neiderrhein Univ Appl Sci, Dept Elect Engn & Comp Sci, D-47805 Krefeld, Germany. RP Hirsch, HG (reprint author), Neiderrhein Univ Appl Sci, Dept Elect Engn & Comp Sci, Reinarzstr 49, D-47805 Krefeld, Germany. EM hans-guenter.hirsch@hs-niederrhein.de CR Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607744 Bitzer J., 1999, P WORKSH ROB METH SP, P171 CAMPOSNETO S, 1999, INT J SPEECH TECHNOL, P259 COUVREUR L, 2001, P INT WORKSH AD METH *ETSI, 2003, 202 050 V113 ETSI ES FINSTER H, 2005, WEB INTERFACE EXPERI GADRUDADRI H, 2002, P ICSLP, P21 GALES MJF, 1997, P ESCA NATO WORKSH R, P55 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1995, THESIS U CAMBRIDGE G Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gelbart D., 2002, P ICSLP, P2185 Hirsch H., 1995, P ICASSP, P153 Hirsch H. G., 2001, P EUR C SPEECH COMM, P1837 HIRSCH HG, 2000, P ISCA WORKSH ASR200 Hirsch HG, 2005, P INT, P2697 HIRSCH HG, 1999, P EUR C SPEECH COMM, V1, P9 Hirsch HG, 2001, SPEECH COMMUN, V34, P127, DOI 10.1016/S0167-6393(00)00050-9 HOUTGAST T, 1980, ACUSTICA, V46, P60 Janin Adam, 2003, P ICASSP KINGSBURY B, 1998, THESIS UC BERKELEY U KINSHITA K, 2005, P INT C LISB PORT, P3145 Kuttruff H., 2000, ROOM ACOUSTICS *LDC, 1993, WALL STREET J LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LEONARD RG, 1984, P ICASSP, V3, P42 LIU J, 2001, P ICASSP, V5, P3037 Machiraju VR, 2002, J CARDIAC SURG, V17, P20 MINAMI Y, 1996, P IEEE INT C AC SPEE, P327 Omologo M, 1998, SPEECH COMMUN, V25, P75, DOI 10.1016/S0167-6393(98)00030-2 PALOMAKI KJ, 2002, P ICASSP ORL 13 17 M, P65 PICONE J, 2004, P EUR SIGN PROC C VI RANT CK, 2005, P INT C LISB PORT, P277 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Seltzer ML, 2004, IEEE T SPEECH AUDI P, V12, P489, DOI 10.1109/TSA.2004.832988 TASHEV I, 2005, P WORKSH HANDS FREE WOODLAND PC, 2001, P INT WORKSH AD METH WU M, 2005, P ICASSP2005, V1, P1085 Yegnanarayana B, 2000, IEEE T SPEECH AUDI P, V8, P267, DOI 10.1109/89.841209 YEUNG SKA, 2004, P ICSLP Young S., 2005, HTK BOOK NR 41 TC 31 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2008 VL 50 IS 3 BP 244 EP 263 DI 10.1016/j.specom.2007.09.004 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 277UJ UT WOS:000254239900007 ER PT J AU Watanabe, M Hirose, K Den, Y Minematsu, N AF Watanabe, Michiko Hirose, Keikichi Den, Yasuharu Minematsu, Nobuaki TI Filled pauses as cues to the complexity of upcoming phrases for native and non-native listeners SO SPEECH COMMUNICATION LA English DT Article DE filled pauses; listeners; prediction; complexity; Japanese ID SPONTANEOUS SPEECH; HESITATION PHENOMENA; DISFLUENCIES; UM; COMPREHENSION; SPEAKERS; SPEAKING; WORDS; UH AB We examined whether filled pauses (FPs) affect listeners' predictions about the complexity of upcoming phrases in Japanese. Studies of spontaneous speech corpora show that constituents tend to be longer or more complex when they are immediately preceded by FPs than when they are not. From this finding, we hypothesized that FPs cause listeners to expect that the speaker is going to refer to something that is likely to be expressed by a relatively long or complex constituent. In the experiments, participants listened to sentences describing both simple and compound shapes on a computer screen. Their task was to press a button as soon as they had identified the shape corresponding to the description. Phrases describing shapes were immediately preceded by a FP, a silent pause of the same duration, or no pause. We predicted that listeners' response times to compound shapes would be shorter when there is a FP before phrases describing the shape than when there is no FP, because FPs are good cues to complex phrases, whereas response times to simple shapes would not be shorter with a preceding FP than without. The results of native Japanese and proficient non-native Chinese listeners agreed with the prediction and provided evidence to support the hypothesis. Response times of the least proficient non-native listeners were not affected by the existence of FPs, suggesting that the effects of FPs on non-native listeners depend on their language proficiency. (C) 2007 Elsevier B.V. All rights reserved. C1 [Watanabe, Michiko; Minematsu, Nobuaki] Univ Tokyo, Bunkyo Ku, Grad Sch Frontier Sci, Tokyo 1130033, Japan. [Hirose, Keikichi] Univ Tokyo, Bunkyo Ku, Grad Sch Informat Sci & Technol, Tokyo 1130033, Japan. [Den, Yasuharu] Chiba Univ, Fac Letters, Inage Ku, Chiba 2638522, Japan. RP Watanabe, M (reprint author), Univ Tokyo, Bunkyo Ku, Grad Sch Frontier Sci, Engn Bldg 2,Room 103C2,7-3-1 Hongo, Tokyo 1130033, Japan. EM watanabe@gavo.t.u-tokyo.ac.jp; hirose@gavo.t.u-tokyo.ac.jp; den@cogsci.l.chiba-u.acjp; mine@gavo.t.u-tokyo.acjp CR Arnold JE, 2003, J PSYCHOLINGUIST RES, V32, P25, DOI 10.1023/A:1021980931292 Arnold JE, 2000, LANGUAGE, V76, P28, DOI 10.2307/417392 Bailey KGD, 2003, J MEM LANG, V49, P183, DOI 10.1016/S0749-596X(03)00027-5 BLAU EK, 1991, ANN M PUERT RIC TEAC BRENNAN SE, 1995, J MEM LANG, V34, P383, DOI 10.1006/jmla.1995.1017 Brennan SE, 2001, J MEM LANG, V44, P274, DOI 10.1006/jmla.2000.2753 Buck G, 2001, ASSESSING LISTENING CHIANG CS, 1992, TESOL QUART, V26, P345, DOI 10.2307/3587009 Clark HH, 1998, COGNITIVE PSYCHOL, V37, P201, DOI 10.1006/cogp.1998.0693 Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X COOK M, 1974, LANG SPEECH, V17, P11 Finegan Edward, 1994, LANGUAGE ITS STRUCTU FOX BA, 1996, INTERACTION GRAMMAR, V185 Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032 FUKAO Y, 1991, AUT M SOC TEACH JAP GOLDMANEISLER F, 1968, PSYCHOLINGUISTICS GRIFFITHS R, 1991, APPL LINGUIST, V12, P345, DOI 10.1093/applin/12.4.345 Holmes V. M, 1988, LANG COGNITIVE PROC, V3, P323, DOI 10.1080/01690968808402093 Levelt W. J. M., 1989, SPEAKING MACLAY H, 1959, WORD, V15, P19 *NAT I JAP LANG, 2005, CORP SPONT JAP HOM Rose R., 1998, THESIS U BIRMINGHAM RUBIN J, 1994, MOD LANG J, V78, P199, DOI 10.2307/329010 SADANOBU T, 1995, GENGO KENKYU, V108, P74 Shriberg E, 2005, P EUR C SPEECH COMM Shriberg E. E., 1994, THESIS U CALIFORNIA STENSTROEM A, 1994, INTRO SPOKEN INTERAC SUGITO M, 1990, P 1 INT C SPOK LANG, P513 Tree JEF, 2001, MEM COGNITION, V29, P320 Tree JEF, 2002, DISCOURSE PROCESS, V34, P37, DOI 10.1207/S15326950DP3401_2 VOSS B, 1979, LANG SPEECH, V22, P129 Wasow Thomas, 2002, POSTVERBAL BEHAV WATANABE M, 2000, P 6 INT C SPOK LANG, P167 WATANABE M, 2004, P 18 ANN CONV PHON S, P65 WATANABE M, 2004, P 8 INT C SPOK LANG, P905 WATANABE M, 2003, P 15 INT C PHON SCI, P2473 NR 37 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 81 EP 94 DI 10.1016/j.specom.2007.06.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000001 ER PT J AU Avanzini, F AF Avanzini, Federico TI Simulation of vocal fold oscillation with a pseudo-one-mass physical model SO SPEECH COMMUNICATION LA English DT Article DE voice source; low-dimensional models; synthesis ID VELOCITY WAVE-FORM; 2-MASS MODEL; FLOW-THROUGH; GLOTTAL FLOW; VOICE SOURCE; TRACT; SPEECH; PARAMETERS; PHONATION; COLLISION AB This paper presents a novel "pseudo-one-mass model" of the vocal folds, which is derived from a previously proposed two-mass model. Two-mass models account for effects of vertical phase differences in fold motion by means of a pair of coupled oscillators that describe the lower and upper fold portions. Instead, the proposed model employs a single mass-spring oscillator to describe only the oscillation of the lower fold portion, while phase difference effects are simulated through an approximate phenomenological description of the upper glottal area. This approximate description is derived in the hypothesis that 1:1 modal entrainment occurs between the two masses in the large-amplitude oscillation regime, and is then exploited to derive the equations of the pseudo-one-mass model. Numerical simulations of a reference two-mass model are analyzed to show that the proposed approximation remains valid when values of the physical parameters are varied in a large region of the control space. The effects on the shape of the glottal flow pulse are also analyzed. Comparison of simulations with the reference two-mass model and the pseudo-one-mass model show that the dynamic behavior of the former is accurately approximated by the latter. The similarity of flow signals synthesized with the two models is assessed in terms of four acoustic parameters: fundamental frequency, maximum amplitude, open quotient, and speed quotient. The results confirm that the pseudo-one-mass model fit with good accuracy the behavior of the reference two-mass model, while requiring significantly lower computational resources and roughly half of the mechanical parameters. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Padua, Dept Informat Engn, I-35131 Padua, Italy. RP Avanzini, F (reprint author), Univ Padua, Dept Informat Engn, Via Gradenigo 61A, I-35131 Padua, Italy. EM avanzini@dei.unipd.it RI Avanzini, Federico/I-3917-2012 OI Avanzini, Federico/0000-0002-1257-5878 CR Alku P, 1996, SPEECH COMMUN, V18, P131, DOI 10.1016/0167-6393(95)00040-2 Avanzini F, 2002, J ACOUST SOC AM, V111, P2293, DOI 10.1121/1.1467674 Avanzini F., 2001, P EUR C, P51 Avanzini F, 2006, ACTA ACUST UNITED AC, V92, P731 Berry DA, 2001, J ACOUST SOC AM, V110, P2539, DOI 10.1121/1.1408947 Berry DA, 1996, J ACOUST SOC AM, V100, P3345, DOI 10.1121/1.416975 CHILDERS DG, 1995, J ACOUST SOC AM, V97, P505, DOI 10.1121/1.412276 CHILDERS DG, 1994, IEEE T BIO-MED ENG, V41, P663, DOI 10.1109/10.301733 Deverge M, 2003, J ACOUST SOC AM, V114, P3354, DOI 10.1121/1.1625933 de Vries MP, 1999, J ACOUST SOC AM, V106, P3620, DOI 10.1121/1.428214 de Vries MP, 2002, J ACOUST SOC AM, V111, P1847, DOI 10.1121/1.1323716 Drioli C, 2005, J ACOUST SOC AM, V117, P3184, DOI 10.1121/1.1861234 FANT G, 1982, STL QPSR, V23, P1 Fant G., 1985, STL QPSR, V26, P1 FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949 FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817 FLETCHER NH, 1978, J ACOUST SOC AM, V64, P1566, DOI 10.1121/1.382139 Gunter HE, 2003, J ACOUST SOC AM, V113, P994, DOI 10.1121/1.1534100 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Liljencrants J., 1991, STL QPSR, V32, P1 Lous NJC, 1998, ACUSTICA, V84, P1135 Lucero JC, 1996, J ACOUST SOC AM, V100, P3355, DOI 10.1121/1.416976 PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 RIEGELSBERGER EL, 1993, P IEEE INT C AC SPEE, P542 Schroeter J., 1992, ADV SPEECH SIGNAL PR, P231 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Sciamarella D, 2004, ACTA ACUST UNITED AC, V90, P746 SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 Story B. H., 2002, Acoustical Science and Technology, V23, DOI 10.1250/ast.23.195 STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234 Strik H, 1998, J ACOUST SOC AM, V103, P2659, DOI 10.1121/1.422786 Titze IR, 2006, J SPEECH LANG HEAR R, V49, P439, DOI 10.1044/1092-4388(2006/034) Titze IR, 2002, J ACOUST SOC AM, V112, P1064, DOI 10.1121/1.1496080 TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910 Titze IR, 1997, J ACOUST SOC AM, V101, P2234, DOI 10.1121/1.418246 Vilain CE, 2004, J SOUND VIB, V276, P475, DOI 10.1016/j.jsv.2003.07.035 NR 36 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 95 EP 108 DI 10.1016/j.specom.2007.07.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000002 ER PT J AU Goudbeek, M Cutler, A Smits, R AF Goudbeek, Martijn Cutler, Anne Smits, Roel TI Supervised and unsupervised learning of multidimensionally varying non-native speech categories SO SPEECH COMMUNICATION LA English DT Article DE category learning; non-native categorization; statistical learning; vowels; duration; frequency ID ENGLISH VOWELS; PERCEPTION; ACQUISITION; INFORMATION; SPEAKERS; IDENTIFICATION; ATTENTION; PHONEMES; INFANTS; SPANISH AB The acquisition of novel phonetic categories is hypothesized to be affected by the distributional properties of the input, the relation of the new categories to the native phonology, and the availability of supervision (feedback). These factors were examined in four experiments in which listeners were presented with novel categories based on vowels of Dutch. Distribution was varied such that the categorization depended on the single dimension duration, the single dimension frequency, or both dimensions at once. Listeners were clearly sensitive to the distributional information, but unidimensional contrasts proved easier to learn than multidimensional. The native phonology was varied by comparing Spanish versus American English listeners. Spanish listeners found categorization by frequency easier than categorization by duration, but this was not true of American listeners, whose native vowel system makes more use of duration-based distinctions. Finally, feedback was either available or not; this comparison showed supervised learning to be significantly superior to unsupervised learning. (C) 2007 Elsevier B.V. All rights reserved. C1 [Goudbeek, Martijn; Cutler, Anne; Smits, Roel] Max Planck Inst Psycholinguist, Nijmegen, Netherlands. [Cutler, Anne] Univ Western Sydney, MARCS Auditory Labs, Sydney, NSW, Australia. RP Goudbeek, M (reprint author), Univ Geneva, Dept Psychol, 40 Blvd Pt Arve, CH-1205 Geneva, Switzerland. EM goudbeek@pse.unige.ch RI Cutler, Anne/C-9467-2012 CR Agresti A., 1990, CATEGORICAL DATA ANA Aoyama K, 2004, J PHONETICS, V32, P233, DOI 10.1016/S0095-4470(03)00036-6 Ashby FG, 1998, PSYCHOL REV, V105, P442, DOI 10.1037/0033-295X.105.3.442 ASHBY FG, 1993, J MATH PSYCHOL, V37, P372, DOI 10.1006/jmps.1993.1023 Ashby FG, 1999, PERCEPT PSYCHOPHYS, V61, P1178, DOI 10.3758/BF03207622 Best C. T., 1995, SPEECH PERCEPTION LI, P171 Best C. T., 1994, DEV SPEECH PERCEPTIO, P167 BEST CT, 2007, SECOND LANGUAGE SPEE, P113 BEST CT, 1988, J EXP PSYCHOL HUMAN, V14, P345, DOI 10.1037/0096-1523.14.3.345 BOERSMA P, 2003, PRAAT 4 1 COMPUTER S Booij Geert, 1995, PHONOLOGY DUTCH BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064 BROERSMA M, 2006, J ACOUST SOC AM, V119, P3270 BURNHAM DK, 1991, J CHILD LANG, V18, P231 Creelman C. D., 1991, DETECTION THEORY USE Cutler A, 2005, FIGURE OF SPEECH: A FESTSCHRIFT FOR JOHN LAVER, P63 Diehl R. L., 1987, CATEGORICAL PERCEPTI, P226 Eisner F, 2005, PERCEPT PSYCHOPHYS, V67, P224, DOI 10.3758/BF03206487 Feldman J, 2000, NATURE, V407, P630, DOI 10.1038/35036586 FLEGE JE, 1989, LANG SPEECH, V32, P123 FLEGE JE, 1986, J ACOUST SOC AM, V79, P508, DOI 10.1121/1.393538 Flege J. E., 1995, SPEECH PERCEPTION LI, P233 Flege James E, 1992, INTELLIGIBILITY SPEE, P157 Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052 Francis AL, 2000, PERCEPT PSYCHOPHYS, V62, P1668, DOI 10.3758/BF03212164 Francis AL, 2002, J EXP PSYCHOL HUMAN, V28, P349, DOI 10.1037//0096-1523.28.2.349 GOUDBEEK M, UNPUB SUPERVISED UNS Gussenhoven C., 1999, HDB INT PHONETIC ASS, P74 GUSSENHOVEN C, IN PRESS NATURE WORD Hammond Robert M., 2001, SOUNDS SPANISH ANAL HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 HOMA D, 1984, J EXP PSYCHOL LEARN, V10, P83, DOI 10.1037/0278-7393.10.1.83 KAWAHARA S, 2006, 151 M ACOUST SOC AM, V119, P3243 Ladefoged P., 1999, HDB INT PHONETIC ASS, p[41, 41] Lisker L., 1978, HASKINS LAB STATUS R, V54, P127 LOGAN JS, 1991, J ACOUST SOC AM, V89, P874, DOI 10.1121/1.1894649 Love BC, 2002, PSYCHON B REV, V9, P829, DOI 10.3758/BF03196342 Maye J, 2000, PROC ANN BUCLD, P522 Maye J, 2001, PROC ANN BUCLD, P480 McAllister R, 2002, J PHONETICS, V30, P229, DOI 10.1006/jpho.2002.0174 MERMELSTEIN P, 1978, PERCEPT PSYCHOPHYS, V23, P331, DOI 10.3758/BF03199717 NAVARRO T, 1968, STUDIES SPANISH PHON NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Norris D, 2006, Q J EXP PSYCHOL, V59, P1505, DOI 10.1080/17470210600739494 Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9 NOSOFSKY RM, 1990, MULTIDIMENSIONAL MET, P363 Pierrehumbert JB, 2003, LANG SPEECH, V46, P115 SHEPARD RN, 1961, PSYCHOL MONOGR, V75, P1 Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 Smits R, 2003, J ACOUST SOC AM, V113, P563, DOI 10.1121/1.1525287 Strange W., 1995, SPEECH PERCEPTION LI WHALEN DH, 1989, PERCEPT PSYCHOPHYS, V46, P284, DOI 10.3758/BF03208093 NR 53 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 109 EP 125 DI 10.1016/j.specom.2007.07.003 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000003 ER PT J AU Jande, PA AF Jande, Per-Anders TI Spoken language annotation and data-driven modelling of phone-level pronunciation in discourse context SO SPEECH COMMUNICATION LA English DT Article DE spoken language annotation; pronunciation variation; pronunciation modelling; decision trees ID SPEECH; FREQUENCY AB A detailed description of the discourse context of a word can be used for predicting word pronunciation in discourse context and also enables studies of the interplay between various types of information on e.g. phone-level pronunciation. The work presented in this paper is aimed at modelling systematic variation in the phone-level realisation of words inherent to a language variety. A data-driven approach based on access to detailed discourse context descriptions is used. The discourse context descriptions are constructed through annotation of spoken language with a large variety of linguistic and related variables in multiple layers. Decision tree pronunciation models are induced from the annotation. The effects of using different types and different amounts of information for model induction are explored. Models generated in a tenfold cross-validation experiment produce on average 8.2% errors on the phone level when they are trained on all available information. Models trained on phoneme level information only have an average phone error rate of 14.2%. This means that including information above the phoneme level in the context description can improve model performance by 42.2%. (C) 2007 Elsevier B.V. All rights reserved. C1 Sch Comp Sci & Commun, Dept Speech Music & Hearing, SE-10044 Stockholm, Sweden. RP Jande, PA (reprint author), Sch Comp Sci & Commun, Dept Speech Music & Hearing, KTH Lindstedtsvagen 24, SE-10044 Stockholm, Sweden. EM per.andersjande@gmail.com CR ALLWOOD J, 2000, QUALITATIVE SOCIAL R, V1 AYCOCK J, 1998, P 7 INT PYTH C HOUST BANNERT R, 1999, VARIATIONS CONSONANT BENNETT C, 2005, P INT C ACOUST SPEEC, P297 Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 BORGELT C, 2004, DTREE Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178 BRUCE G, 1986, 9 SCAND C LING STOCK, P86 CARLSON R, 1982, P ICASSP 82, V3, P1604 CARLSON R, 2002, P FON STOCKH SWED MA, P65 Carlson R., 1976, P IEEE ICASSP76, P686 DUEZ D, 1998, P ESCA SOUND PATT SP, P63 EKLUND R, 1999, P ICPHS DISFL SPONT, P3 ELERT CC, 1970, SOUNDS WORDS SWEDISH FANT G, 2001, P EUR AALB DENM SEP, P657 Finke M., 1997, P EUR C SPEECH COMM, P2379 Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 GARDING E, 1974, SVENSKANS BESKRIVNIN, V8, P97 Heldner M, 2003, J PHONETICS, V31, P39, DOI 10.1016/S0095-4470(02)00071-2 HERMES DJ, 1991, J ACOUST SOC AM, V90, P97, DOI 10.1121/1.402397 JANDE PA, 2006, P LREC WORKSH MERG L, P1 JANDE PA, 2006, THESIS KTH SWEDEN JANDE PA, 2005, P INT LISB PORT, P1945 JANDE PA, 2003, P ICPHS, P2557 JURAFSKY D, 2001, P INT C ACOUST SPEEC, V2, P2118 DEMANTARAS RL, 1991, MACH LEARN, V6, P81, DOI 10.1023/A:1022694001379 Megyesi B, 2002, J MACH LEARN RES, V2, P639, DOI 10.1162/153244302320884579 Megyesi Beata, 2002, THESIS KTH STOCKHOLM MILLER C, 1998, P ESCA COCOSDA 3 INT, P133 MORE B, 1983, J ACOUST SOC AM, V74, P750 Nolan Francis, 2003, P 15 INT C PHON SCI, P771 OSTENDORF M, 1996, P INT C SPOK LANG PR, P1039 RIDINGS D, 2002, SWEDISH RESOURCES LA Sigurd Bengt, 1965, PHONOTACTIC STRUCTUR Sjolander K., 2003, P FON 2003, P93 SJOLANDER K, 2004, P FON STOCKH SWED MA, P116 Sjolander Kare, 2004, SNACK SOUND TOOLKIT Stevens S. S., 1940, AM J PSYCHOL, V53, P329, DOI 10.2307/1417526 Strassel S., 2004, SIMPLE METADATA ANNO Traunmuller H., 1995, FREQUENCY RANGE VOIC TRAUNMULLER H, 1995, J ACOUST SOC AM, V97, P1905 VANBAEL C, 2004, P INT C SPOK LANG PR, P586 WERNER S, 2004, P INT C ACOUST SPEEC, V1, P673 Werner S, 2004, IEEE T SPEECH AUDI P, V12, P436, DOI 10.1109/TSA.2004.828635 ZHENG J, 2000, P NIST SPEECH TRANSC, P57 NR 45 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 126 EP 141 DI 10.1016/j.specom.2007.07.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000004 ER PT J AU Yapanel, UH Hansen, JHL AF Yapanel, Umit H. Hansen, John H. L. TI A new perceptually motivated MVDR-based acoustic front-end (PMVDR) for robust automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE acoustic feature extraction; robust speech recognition; noise-robustness analysis ID STRESS AB Acoustic feature extraction from speech constitutes a fundamental component of automatic speech recognition (ASR) systems. In this paper, we propose a novel feature extraction algorithm, perceptual-MVDR (PMVDR), which computes cepstral coefficients from the speech signal. This new feature representation is shown to better model the speech spectrum compared to traditional feature extraction approaches. Experimental results for small (40-word digits) to medium (5k-word dictation) size vocabulary tasks show varying degree of consistent improvements across different experiments; however, the new front-end is most effective in noisy car environments. The PMVDR front-end uses the minimum variance distortionless response (MVDR) spectral estimator to represent the upper envelope of the speech signal. Unlike Mel frequency cepstral coefficients (MFCCs), the proposed front-end does not utilize a filterbank. The effectiveness of the PMVDR approach is demonstrated by comparing speech recognition accuracies with the traditional MFCC front-end and recently proposed PMCC front-end in both noise-free and real adverse environments. For speech recognition in noisy car environments, a 40-word vocabulary task, PMVDR front-end provides a 36% relative decrease in word error rate (WER) over the MFCC front-end. Under simulated speaker stress conditions, a 35-word vocabulary task, the PMVDR front-end yields a 27% relative decrease in the WER. For a noise-free dictation task, a 5k-word vocabulary task, again a relative 8% reduction in the WER is reported. Finally, a novel analysis technique is proposed to quantify noise robustness of an acoustic front-end. This analysis is conducted for the acoustic front-ends analyzed in the paper and results are presented. (C) 2007 Elsevier B.V. All rights reserved. C1 [Yapanel, Umit H.; Hansen, John H. L.] Univ Texas Dallas, Dept Elect Engn, Ctr Robust Speech Syst, Richardson, TX 75083 USA. RP Hansen, JHL (reprint author), Univ Texas Dallas, Dept Elect Engn, Ctr Robust Speech Syst, E33,POB 830688, Richardson, TX 75083 USA. EM John.Hansen@utdallas.edu CR Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 Bou-Ghazale SE, 2000, IEEE T SPEECH AUDI P, V8, P429, DOI 10.1109/89.848224 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DHARANIPRAGADA S, 2001, INT C ACOUST SPEECH, P3009 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 GU L, 2001, ISCA INT 01 EUROSPEE, P583 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 HANSEN JHL, 2001, INTERSPEECH 01 EUROS, V2, P905 HANSEN JHL, 1997, ISCA EUROSPEECH, V95, P1743 HANSEN JHL, 2001, INTERSPEECH 01 EUROS, V3, P2023 Haykin S., 1991, ADAPTIVE FILTER THEO HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Huang X., 2001, SPOKEN LANGUAGE PROC HUNT MJ, 1999, SPECTRAL SIGNAL PROC, V1, P17 JELINEK M, 1999, INT C ACOUST SPEECH, P1818 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MCDONOUGH J, 1998, INT C SPOK LANG P SY Murthi MN, 2000, IEEE T SPEECH AUDI P, V8, P221, DOI 10.1109/89.841206 MUSICUS BR, 1985, IEEE T ACOUSTICS SPE, V33, P133 *NISH, 2004, NIST SPHERE SOFTW PA Oppenheim A. V., 1989, DISCRETE TIME SIGNAL Pellom B, 2001, TRCSLR200101 U COL PELLOM B, 2003, INT C ACOUST SPEECH, P4 Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 Tokuda K., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing WOLFEL M, 2003, MINIMUM VARIANCE DIS, P1021 YAPANEL U, 2002, HIGH PERFORMANCE DIG, P793 Yapanel U. H., 2003, INT C ACOUST SPEECH, P644 YAPANEL UH, 2005, INT C ACOUST SPEECH YAPANEL UH, 2003, PERCEPTUAL MVDR BASE, P1829 YAPANEL UH, 2003, NEW PERSPECTIVE FEAT, P1281 YAPANEL UH, 2005, THESIS U COLORADO BO NR 32 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 142 EP 152 DI 10.1016/j.specom.2007.07.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000005 ER PT J AU Dhananjaya, N Yegnanarayana, B AF Dhananjaya, N. Yegnanarayana, B. TI Speaker change detection in casual conversations using excitation source features SO SPEECH COMMUNICATION LA English DT Article DE speaker change detection; multispeaker conversation; autoassociative neural network (AANN) models; excitation source features; linear prediction (LP) residual ID SPEECH AB In this paper we propose a method for speaker change detection using features of excitation source of the speech production mechanism. The method uses neural network models to capture the speaker-specific information from a signal that represents predominantly the excitation source. The focus in this paper is on speaker change detection in casual telephone conversations, in which short (<5 s) speaker turns are common. Excitation source features are a better choice for modeling a speaker, when limited amount of speech data is available, when compared to the vocal tract system features. Linear prediction residual is used as an estimate of the excitation source signal. Autoassociative neural network models are proposed to capture the higher order relations among the samples of the residual signal. Speaker models are generated for every one second of voiced speech from the first few seconds of the conversation. These models are used to detect the speaker change points. Performance of the proposed method for speaker change detection is evaluated on a database containing several two-speaker conversations. (C) 2007 Elsevier B.V. All rights reserved. C1 [Dhananjaya, N.] Indian Inst Technol, Madras 600036, Tamil Nadu, India. [Yegnanarayana, B.] Int Inst Informat Technol, Hyderabad, Andhra Pradesh, India. RP Dhananjaya, N (reprint author), Indian Inst Technol, Madras 600036, Tamil Nadu, India. EM dhanu@cs.iitm.ernet.in; yegna@iiit.ac.in CR CHAN W, 2006, P IEEE INT C AC SPEE, P657 Chen S. S., 1998, P DARPA BROADC NEWS, P127 Delacourt P, 2000, SPEECH COMMUN, V32, P111, DOI 10.1016/S0167-6393(00)00027-3 Gish H, 1991, P INT C AC SPEECH SI, V2, P873 GRAFF D, 2002, SWITCHBOARD 2 PHASE JOHNSON S, 1997, THESIS CAMBRIDGE U E Lu L., 2002, P 10 ACM INT C MULT, P602 Makhoul J, 2000, P IEEE, V88, P1338, DOI 10.1109/5.880087 MARTIN A, 2002, NIST SPEAKER RECOGNI Prasanna SRM, 2006, SPEECH COMMUN, V48, P1243, DOI 10.1016/j.specom.2006.06.002 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 YEGNANARAYANA B, 2001, P INT C AC SPEECH SI, V1, P409 NR 12 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2008 VL 50 IS 2 BP 153 EP 161 DI 10.1016/j.specom.2007.08.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 264ZD UT WOS:000253328000006 ER PT J AU Rosset, S Tribout, D Lamel, L AF Rosset, S. Tribout, D. Lamel, L. TI Multi-level information and automatic dialog act detection in human-human spoken dialogs SO SPEECH COMMUNICATION LA English DT Article DE automatic dialog act detection; human-human dialogs; memory based learning ID SPEECH; CLASSIFICATION AB This paper reports studies on annotating and automatically detecting dialog acts in human-human spoken dialogs. The work reposes on three hypotheses: first, the succession of dialog acts is strongly constrained; second, the initial word and semantic class of word are more important for identifying dialog acts than the complete exact word sequence of an utterance; third, most of the important information is encoded in specific entities. A memory based learning approach is used to detect dialog acts. For each utterance unit, eight dialog acts are systematically annotated. Experiments have been conducted using different levels of information, with and without the use of dialog history information. In order to assess the generality of the method, the specific entity tag based model trained on a French corpus was tested on an English corpus for a similar task and on a French corpus from a different domain. A correct dialog act detection rate of about 86% is obtained for the same domain/language condition and 77% for the cross-language or cross-domain conditions., (c) 2007 Elsevier B.V. All rights reserved. C1 LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. RP Rosset, S (reprint author), LIMSI, CNRS, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM rosset@limsi.fr; tribout@limsi.fr; lamel@limsi.fr CR ALLEN J, 1997, DRAFT DAMSLL DIALOG *AMITES CONS, 2005, AMITIES FIN REP ANDERSON AH, 1991, LANG SPEECH, V34, P351 Ang J, 2005, INT CONF ACOUST SPEE, P1061 Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4 BONNEAUMAYNARD H, 2003, ISCEA EUR 03 GEN SEP, V1, P253 Brill E, 1995, COMPUT LINGUIST, V21, P543 Carletta J, 1996, COMPUT LINGUIST, V22, P249 CATTONI R, 2001, BUILDING CORPUS ANNO Chu-Carroll J., 1998, Applying Machine Learning to Discourse Processing. Papers from the 1998 AAAI Symposium DAELEMANS W, 2003, TIMBL TILBURG MEMORY Daelemans W, 1999, MACH LEARN, V34, P11, DOI 10.1023/A:1007585615670 DIEUGENIO B, 1998, ACLCOLING98 EHARA T, 1990, ICSLP 90, P1093 HARDY H, 2003, ISCA EUR 03 GEN SEPT, P201 Hirschberg J., 1993, Computational Linguistics, V19 ISARD A, 1995, AAAI 1995 SPR S SER, P60 JANIN A, 2003, ICASSP 03 HONG KONG JEKAT S, 1995, 65 U HAMBURG JI G, 2005, ICASSP 2005 PHIL APR JURAFSKY D, 9702 U COL NAGATA M, 1992, ICSLP 92 BANFF CAN O REITHINGER N, 1997, DIALOGUE CLASSIFICAT, P2235 RIETHINGER N, 1996, ICSLP 1996 PHIL OCT ROSSET S, 2004, INT 2004 ICSLP JEJ I, P540 SAMUEL K, 1998, COLING ACL, P1150 Samuel Ken, 1999, P 4 C PAC ASS COMP L *SRI TRANS, 1992, TRANSCR DER AUD CONV Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737 SURENDRAN D, 2006, INT 06 PITTS PA SEPT Traum D. R., 2000, J SEMANT, V17, P7, DOI DOI 10.1093/JOS/17.1.7 VANDENBOSCH A, 2001, ACL 00 NEW BRUNSW, P499 Webb N., 2005, P AAAI WORKSH SPOK L NR 33 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 1 EP 13 DI 10.1016/j.specom.2007.05.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800001 ER PT J AU Menard, L Schwartz, JL Aubin, J AF Menard, Lucie Schwartz, Jean-Luc Aubin, Jerome TI Invariance and variability in the production of the height feature in French vowels SO SPEECH COMMUNICATION LA English DT Article DE phonetics; speech production; speech perception; vowel systems; French vowels ID DISPERSION-FOCALIZATION-THEORY; VOCAL-TRACT; ARTICULATORY MODEL; SPEECH-PERCEPTION; ACQUISITION; ASYMMETRIES; ADULTHOOD; INVERSION; MOVEMENTS; GROWTH AB This paper investigates the organization of the vowel space in French speakers. Speakers aged from 4 years to adulthood were recorded in order to generate significant between-speaker variability. Each speaker produced repetitions of the ten French oral vowels /i y u e phi o epsilon ae superset of a/. Acoustic analyses show that, despite considerable between-speaker variability in the relative positions of the vowels within the vowel space, speakers tend to produce vowels along a given height degree with a stable F1 value, depending on the speaker, but independently of place of articulation and roundedness. Simulations with the Variable Linear Articulatory Model (VLAM) show that a stable F I value is basically related to stable tongue heights. The results are discussed in the framework of the Perception-for-Action Control theory (PACT), in which speech units are considered as gestures shaped by perceptual processes. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Quebec, Dept Linguist & Didact Langues, Montreal, PQ H3C 3P8, Canada. Univ Grenoble 3, INPG, Inst Commun Parlee, Grenoble, France. RP Menard, L (reprint author), Univ Quebec, Dept Linguist & Didact Langues, Montreal, PQ H3C 3P8, Canada. EM menard.lucie@uqam.ca CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 Bailly G, 1997, SPEECH COMMUN, V22, P251, DOI 10.1016/S0167-6393(97)00025-3 BAILLY G, 1995, P 4 EUR C SPEECH COM, V3, P1913 BOE LJ, 1992, J PHONETICS, V20, P27 Boe L.-J., 1999, P 14 INT C PHON SCI, V3, P2501 CHEVRIEMULLER C, 2001, NOUVELLES ETUDES POU GAY T, 1981, J ACOUST SOC AM, V69, P802, DOI 10.1121/1.385591 Goldstein UG., 1980, THESIS MIT CAMBRIDGE Guenther FH, 1998, PSYCHOL REV, V105, P611 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 JORDAN MI, 1992, COGNITIVE SCI, V16, P316 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LINDBLOM B, 1979, J PHONETICS, V7, P147 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 Lindblom B., 1988, LANGUAGE SPEECH MIND, P62 Lindblom B., 1986, EXPT PHONOLOGY, P13 Lindblom Bjorn, 1998, APPROACHES EVOLUTION, p[Chris, 242] MACNEILAGE PF, 1990, ATTENTION PERFORMANC, P453 MAEDA S, 1979, J ACOUST SOC AM, V65, pS22, DOI 10.1121/1.2017158 MENARD L, 2002, PRODUCTION PERCEPTIO Menard L, 2004, J SPEECH LANG HEAR R, V47, P1059, DOI 10.1044/1092-4388(2004/079) Menard L, 2002, J ACOUST SOC AM, V111, P1892, DOI 10.1121/1.1459467 Menard L, 2007, J PHONETICS, V35, P1, DOI 10.1016/j.wocn.2006.01.003 Neagu A., 1997, THESIS INPG GRENOBLE Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Ohala J. J., 1979, P INT C PHON SCI, V3, P181 OSTRY DJ, 2006, P 5 INT C SPEECH MOT, P14 Perkell J, 1997, SPEECH COMMUN, V22, P227, DOI 10.1016/S0167-6393(97)00026-5 Perkell JS, 2004, J ACOUST SOC AM, V116, P2338, DOI 10.1121/1.1787524 PERRIER P, 2005, SPECIAL ISSUE SPEECH, V40, P190 Polka L, 2003, SPEECH COMMUN, V41, P221, DOI 10.1016/S0167-6393(02)00105-X Schroeder M. R., 1979, FRONTIERS SPEECH COM, P217 Schwartz J.-L., 2002, PHONETICS PHONOLOGY, P255 Schwartz JL, 2005, SPEECH COMMUN, V45, P425, DOI 10.1016/j.specom.2004.12.001 Schwartz JL, 1997, J PHONETICS, V25, P255, DOI 10.1006/jpho.1997.0043 Sole M. J., 2007, EXPT APPROACHES PHON, P104 STEVENS KN, 1989, J PHONETICS, V17, P3 Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710 VALLE N, 1994, THESIS U STENDHAL VALLEE N, 2003, P INT C PHON SCI BAR, P817 NR 42 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 14 EP 28 DI 10.1016/j.specom.2007.06.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800002 ER PT J AU Kocinski, J AF Kocinski, Jedrzej TI Speech intelligibility improvement using convolutive blind source separation assisted by denoising algorithms SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ForumAcusticum CY 2005 CL Budapest, HUNGARY DE blind source separation; speech enhancement; denoising; speech intelligibility ID NONSTATIONARY SIGNALS; FREQUENCY-DOMAIN; NATURAL GRADIENT; MIXTURES; NOISE AB The present study is concerned with the blind source separation (BSS) of speech and speech-shaped noise sources. All recordings were carried out in an anechoic chamber using a dummy head (two microphones, one in each ear). The program which implements the algorithm for BSS of convolutive mixtures introduced by Parra and Spence [Parra, L., Spence, C., 2000a. Convolutive blind source separation of non-stationary sources. IEEE Trans. Speech Audio Process. 8(3), 320-327 (US Patent US6167417)] was used to separate out the signals. In the postprocessing phase two different denoising algorithms were used. The first was based on a minimum mean-square error log-spectral amplitude estimator [Ephraim, E., Malah, D., 1985. Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Speech Audio. Process. ASSP-33(2), 443-445], while the second one was based on Wiener filter in which the concept of an a priori signal-to-noise estimation presented by Ephraim (as mentioned above) was applied [Scalart, P., Filho, J.V., 1996. Speech enhancement based on a priori signal to noise estimation. IEEE Internat. Conf. Acoust. Speech Signal Process. 1, 629632]. Non-sense word tests were used as a target speech in both cases while one or two disturbing sources were used as interferences. The speech intelligibility before and after the BSS was measured for three subjects with audiologically normal hearing. Next the speech signal after BSS was denoised and presented to the same listeners. The results revealed some ambiguities caused by the insufficient number of microphones compared to the number of sound sources. For one disturbance only, the intelligibility improvement was significant. However, when there were two disturbances and the target speech, the separation was much poorer. The additional denoising, as could be expected, raises the intelligibility slightly. Although the BSS method requires more research on optimization, the results of the investigation imply that it may be applied to hearing aids in the future. (c) 2007 Elsevier 13N. All rights reserved. C1 Adam Mickiewicz Univ, Fac Phys, Inst Acoust, PL-61614 Poznan, Poland. RP Kocinski, J (reprint author), Adam Mickiewicz Univ, Fac Phys, Inst Acoust, 85 Umultowska Str, PL-61614 Poznan, Poland. EM jen@amu.edu.pl CR AICHNER R, 2003, 4 INT S IND COMP AN AMARI S, 1997, 1 IEEE WORKSH SIGN P ANEMUELLER J, 2000, AMPLITUDE MODULATION ASANO F, 2001, REALTIME SOUND SOURC BELOUCHRANI A, 1996, P SPIE BRACHMANSKI S., 1999, SPEECH LANGUAGE TECH, V3, P71 Buchner H, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P255, DOI 10.1007/1-4020-7769-6_10 CARDOSO JF, 1989, P ICASSP, P89 Choi S, 2003, IEICE T FUND ELECTR, VE86A, P198 Choi S, 2005, NEURAL INFORM PROCES, V6, P1 Choi SJ, 2002, J VLSI SIG PROC SYST, V32, P93, DOI 10.1023/A:1016319502849 CICHOCKI A, 2001, 3 INT C IND COMP AN Cichocki A., 2003, ADAPTIVE BLIND SIGNA COMON P, 1991, SIGNAL PROCESS, V24, P11, DOI 10.1016/0165-1684(91)90080-3 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 Culling JF, 2000, J ACOUST SOC AM, V107, P517, DOI 10.1121/1.428320 Douglas SC, 2003, SPEECH COMMUN, V39, P65, DOI 10.1016/S0167-6393(02)00059-6 DUQUESNOY AJ, 1983, J ACOUST SOC AM, V73, P2166, DOI 10.1121/1.389540 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 HARMELLING S, 2001, CONVBSS BERLIN Hyvarinen A, 2001, INDEPENDENT COMPONEN JUTTEN C, 1991, SIGNAL PROCESS, V24, P1, DOI 10.1016/0165-1684(91)90079-X Kawamoto M, 1998, NEUROCOMPUTING, V22, P157, DOI 10.1016/S0925-2312(98)00055-1 KOCINSKI J, 2005, ARCH ACOUST Makino S, 2005, IEICE T FUND ELECTR, VE88A, P1640, DOI 10.1093/ietfec/e88-a.7.1640 MATSUOKA K, 1995, NEURAL NETWORKS, V8, P411, DOI 10.1016/0893-6080(94)00083-X MOLGEDEY L, 1994, PHYS REV LETT, V72, P3634, DOI 10.1103/PhysRevLett.72.3634 Moore BCJ, 1997, INTRO PSYCHOL HEARIN Parra L, 2000, J VLSI SIG PROC SYST, V26, P39, DOI 10.1023/A:1008187132177 PARRA L, 2002, NOISE REDUCTION SPEE Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 PHAM DT, 2003, 2003 BLIND SEPARATIO Saruwatari H, 2003, EURASIP J APPL SIG P, V2003, P1135, DOI 10.1155/S1110865703305104 Sawada H., 2005, SPEECH ENHANCEMENT Scalart P., 1996, IEEE INT C AC SPEECH, V1, P629 Schobben L, 2002, IEEE T SIGNAL PROCES, V50, P1855 SMARAGDIS P, 1997, EEE ASSP WORKSH APPL Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 ZAVAREHEI E, 2005, WIENERSCALART96 M ZAVAREHEI E, 2005, MMSESTSA85 M Zhou Y, 2003, SIGNAL PROCESS, V83, P2037, DOI 10.1016/S0165-1684(03)00134-8 Ziehe A, 2000, IEEE T BIO-MED ENG, V47, P75, DOI 10.1109/10.817622 NR 42 TC 11 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 29 EP 37 DI 10.1016/j.specom.2007.06.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800003 ER PT J AU Almpanidis, G Kotropoulos, C AF Almpanidis, George Kotropoulos, Constantine TI Phonemic segmentation using the generalised Gamma distribution and small sample Bayesian information criterion SO SPEECH COMMUNICATION LA English DT Article DE phonemic segmentation; Bayesian information criterion; generalised gamma distribution; small sample ID VOICE ACTIVITY DETECTION; MODEL SELECTION; PARAMETERS; LIKELIHOOD; INFERENCE; CRITIQUE; FIT AB In this work, we present a text-independent automatic phone segmentation algorithm based on the Bayesian Information Criterion. Speech segmentation at a phone level imposes high resolution requirements in the short-time analysis of the audio signal; otherwise the limited information available in such a small scale would be too restrictive for an efficient characterisation of the signal. In order to alleviate this problem and detect the phone boundaries accurately, we employ an information criterion corrected for small samples while modelling speech samples with the generalised Gamma distribution, which offers a more efficient parametric characterisation of speech in the frequency domain than the Gaussian distribution. Using a computationally inexpensive maximum likelihood approach for parameter estimation, we evaluate the efficiency of the proposed algorithm in M2VTS and NTIMIT data sets and we demonstrate that the proposed adjustments yield significant performance improvement in noisy environments. (c) 2007 Elsevier B.V. All rights reserved. C1 Aristotle Univ Thessaloniki, Dept Informat, Thessaloniki 54124, Greece. RP Almpanidis, G (reprint author), Aristotle Univ Thessaloniki, Dept Informat, Box 451, Thessaloniki 54124, Greece. EM galba@aiia.csd.auth.gr; costas@aiia.csd.auth.gr RI Kotropoulos, Constantine/B-7928-2010 OI Kotropoulos, Constantine/0000-0001-9939-7930 CR Adell J., 2004, P 5 ISCA SPEECH SYNT, P139 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 ANDERSON D, 1999, BIRD STUDY, V46, P514 ANDERSON DR, 1994, ECOLOGY, V75, P1780, DOI 10.2307/1939637 Aversano G., 2001, P 44 IEEE MIDW S CIR, V2, P516, DOI 10.1109/MWSCAS.2001.986241 Bakis R., 1997, P SPEECH REC WORKSH, P67 Bernardo J. M., 1976, Applied Statistics, V25, DOI 10.2307/2347257 Bogdan M, 2004, GENETICS, V167, P989, DOI 10.1534/genetics.103.021683 BOLLEN K, 2005, SAMSI LVSS WORKSH TR BOZDOGAN H, 1987, PSYCHOMETRIKA, V52, P345, DOI 10.1007/BF02294361 BOZDOGAN H, 1988, P CLASS REL METH DAT, P599 Brugnara F., 1992, P INT C SPOK LANG PR, V1, P627 Burnham KP, 2004, SOCIOL METHOD RES, V33, P261, DOI 10.1177/0049124104268644 CETTOLO M, 2000, P ISCA ITRW ASR2000, P221 Chang J.-H., 2003, P EUR GEN SWITZ AUG, P1065 Chen S.S., 1998, DARPA SPEECH REC WOR Chen SS, 2002, SPEECH COMMUN, V37, P69, DOI 10.1016/S0167-6393(01)00060-7 Chickering DM, 1997, MACH LEARN, V29, P181 COHEN AC, 1986, J QUAL TECHNOL, V17, P147 DAT TH, 2006, P 2006 IEEE INT C AC, V4, P1149 DAYTON C, 2003, MODERN APPL STAT MET, V2, P281 Delacourt P, 2000, SPEECH COMMUN, V32, P111, DOI 10.1016/S0167-6393(00)00027-3 ESPOSITO A, 2004, P SUMM SCH NEUR NETW, P261 Garofolo J., 1990, DARPA TIMIT ACOUSTIC Gazor S, 2003, IEEE SIGNAL PROC LET, V10, P204, DOI 10.1109/LSP.2003.813679 Gelman A, 1995, BAYESIAN DATA ANAL GHEISSARI N, 2003, P DIG IMAG COMP TECH, V1, P185 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 HAGER HH, 1987, J AMER STAT ASS, V82, P528 HANNAN EJ, 1979, J ROY STAT SOC B MET, V41, P190 HARTER HL, 1967, TECHNOMETRICS, V9, P159, DOI 10.2307/1266326 HAUGHTON DMA, 1988, ANN STAT, V16, P342, DOI 10.1214/aos/1176350709 Hwang TY, 2002, ANN I STAT MATH, V54, P840, DOI 10.1023/A:1022471620446 Jankowski C., 1990, P IEEE INT C AC SPEE, P109 KADRI H, 2006, P EUR C SIGN PROC KASHYAP RL, 1977, IEEE T AUTOMAT CONTR, V22, P715, DOI 10.1109/TAC.1977.1101594 KASS RE, 1995, J AM STAT ASSOC, V90, P928, DOI 10.2307/2291327 KOKKINAKIS K, 2006, P IEEE INT C AC SPEE, V1, P1217 Kominek J., 2003, P 8 EUR C SPEECH COM, P313 Kuha J, 2004, SOCIOL METHOD RES, V33, P188, DOI 10.1177/0049124103262065 LEE M, 2002, MATH PSYCHOL, V45, P131 LWALESS J, 1980, TECHNOMETRICS, V33, P409 MARTIN R, 2002, P IEEE INT C AC SPEE, V1, P253 Mitchell C., 1995, P ICASSP DETR MI, V1, P229 NAKAMURA A, 2000, IEICE T INF SYST, P2118 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 OLSTHOORN TN, 1995, GROUND WATER, V33, P42, DOI 10.1111/j.1745-6584.1995.tb00261.x PARR VB, 1965, TECHNOMETRICS, V7, P1, DOI 10.2307/1266123 Pellom BL, 1998, SPEECH COMMUN, V25, P97, DOI 10.1016/S0167-6393(98)00031-4 Pickett J. M., 1999, ACOUSTICS SPEECH COM Pigeon S, 1997, LECT NOTES COMPUT SC, V1206, P403 RAFTERY A, 1996, MODEL SELECTION GENE, P321 Raftery AE, 1999, SOCIOL METHOD RES, V27, P411, DOI 10.1177/0049124199027003005 RISSANEN J, 1978, AUTOMATICA, V14, P465, DOI 10.1016/0005-1098(78)90005-5 Schneider B. E., 1978, Applied Statistics, V27, DOI 10.2307/2346249 SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Schwarz P., 2006, P ICASSP, VI, P325 SHEWCHUK JR, 1994, CMUCS94125 CARN MELL Shin JW, 2005, IEEE SIGNAL PROC LET, V12, P258, DOI 10.1109/LSP.2004.840869 Shin JW, 2005, P IEEE INT C ACOUSTI, V1, p[781, 17] Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Spiegelhalter DJ, 2002, J ROY STAT SOC B, V64, P583, DOI 10.1111/1467-9868.00353 STACY EW, 1962, ANN MATH STAT, V33, P1187, DOI 10.1214/aoms/1177704481 SUGIURA N, 1978, COMMUN STAT A-THEOR, V7, P13, DOI 10.1080/03610927808827599 SUGIYAMA M, 2001, P ART NEUR NETS GEN, P418 Takeuchi K., 1976, SURI KAGAKU, V153, P12 Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579 Tremblay M, 2004, AGRONOMIE, V24, P351, DOI 10.1051/agro:2004033 Tritschler A, 1999, P EUROSPEECH, P679 Tsionas EG, 2001, COMMUN STAT-THEOR M, V30, P747, DOI 10.1081/STA-100002149 VANALLEN T, 2000, THESIS DEP COMPUTER Varga A., 1992, NOISEX 92 STUDY AFFE VILLIERS E, 2001, P 12 ANN S PATT REC, P120 VISSER I, 2005, SAMSI LVSSS WORKSH T Volinsky CT, 2000, BIOMETRICS, V56, P256, DOI 10.1111/j.0006-341X.2000.00256.x WANG H, 2004, P INT C SPOK LANG PR, P1617 WANG L, 2006, IEICE T INF SYST, P1082 Weakliem DL, 1999, SOCIOL METHOD RES, V27, P359, DOI 10.1177/0049124199027003002 Woodland P., 1997, P SPEECH REC WORKSH, P73 ZHANG S, 2006, P 18 INT C PATT REC, P298 ZHAO Y, 2005, P INT, P2557 NR 81 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 38 EP 55 DI 10.1016/j.specom.2007.06.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800004 ER PT J AU Moore, E Torres, J AF Moore, E. Torres, J. TI A performance assessment of objective measures for evaluating the quality of glottal waveform estimates SO SPEECH COMMUNICATION LA English DT Article DE glottal waveform; voice source; inverse filtering ID VOCAL FOLD VIBRATION; AIR-FLOW; SPEECH; MODEL AB Automatic glottal waveform, estimation remains a challenging problem in speech analysis. Developing criteria for the objective assessment of the quality of glottal waveform estimates would facilitate the design of more robust estimation algorithms. The aim of this paper is to investigate the performance of potential glottal waveform quality measures (GQM's) and to determine whether a combination of these GQM's is able to consistently provide an accurate assessment of glottal waveform estimate quality across several speakers and phonemes. We develop an experimental setup that produces disjoint sets of high and low-quality glottal waveform estimates from real speech and use this data to objectively assess the performance of 12 glottal waveform quality measures on a sustained vowel speech dataset spanning 16 male speakers and 3 phonemes. In addition, we present a rank-based method (RB-GQA) that allows arbitrary GQM subsets to be effectively combined. Using this method, we perform an exhaustive search on the GQM subset space to determine the best-performing GQM combinations for different groups of speakers and phonemes. While it was found that the optimal GQM combinations are speaker and phoneme dependent, optimization across all utterances (speaker-phoneme pairs) resulted in a combination of 4 GQM's (ratio of first harmonic to maximum harmonic over 0-3.7 kHz, group delay variance, phase-plane cycles/period, and phase-plane mean sub-cycle length) that performed very well on almost every utterance in the dataset and nearly matched the performance of the GQM subsets obtained via phoneme-dependent optimization. (c) 2007 Elsevier B.V. All rights reserved. C1 Georgia Inst Technol, Sch Elect & Comp Engn, Savannah, GA 31407 USA. RP Moore, E (reprint author), Georgia Inst Technol, Sch Elect & Comp Engn, 210 Technol Circle, Savannah, GA 31407 USA. EM emoore@gtsav.gatech.edu CR Airas M, 2006, PHONETICA, V63, P26, DOI 10.1159/000091405 Airas M., 2005, INTERSPEECH, P2145 Akande OO, 2005, SPEECH COMMUN, V46, P15, DOI 10.1016/j.specom.2005.01.007 ALKU P, 2005, INTERSPEECH, P1053 ALKU P, 2004, INTERSPEECH, P497 Ananthapadmanabha T. V., 1982, SPEECH COMMUN, V1, P167, DOI 10.1016/0167-6393(82)90015-2 BACKSTROM T, 2005, IEEE INT C AC SPCH S, V1, P897 Berg J. V. D., 1957, J ACOUST SOC AM, V29, P626 BROOKES DM, 1999, IEEE INT C AC SPCH S, V1, P213 Brookes M, 2006, IEEE T AUDIO SPEECH, V14, P456, DOI 10.1109/TSA.2005.857810 Childers D.G., 2000, SPEECH PROCESSING SY CUMMINGS K, 1990, IEEE INT C AC SPEECH, V1, P369 DELLER JR, 1981, IEEE T ACOUST SPEECH, V29, P917, DOI 10.1109/TASSP.1981.1163651 Deng HQ, 2006, IEEE T AUDIO SPEECH, V14, P445, DOI 10.1109/TSA.2005.857811 Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076 Fu Q, 2006, IEEE T AUDIO SPEECH, V14, P492, DOI 10.1109/TSA.2005.857807 Granqvist S, 2003, J VOICE, V17, P319, DOI 10.1067/S0892-1997(03)00070-5 Guerin B, 1976, IEEE INT C AC SPEECH, V1, P47, DOI 10.1109/ICASSP.1976.1170152 KOUNOUDES A, 2002, IEEE INT C AC SPEECH, V1, P349 Moore E., 2003, P 25 ANN C ENG MED B, V3, P2849 Moore E, 2006, INTERSPEECH 2006 AND 9TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, VOLS 1-5, P1694 MOORE E, 2004, IEEE INT C AC SPCH S, V1, P101 Ozdas A, 2004, IEEE T BIO-MED ENG, V51, P1530, DOI 10.1109/TBME.2004.827544 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 Rentzos D., 2003, 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No.03EX721), DOI 10.1109/ASRU.2003.1318526 Svec JG, 1996, J VOICE, V10, P201, DOI 10.1016/S0892-1997(96)80047-6 TEAGER H, 1989, NATO ASI SPEECH PROD, P1 TEAGER HM, 1980, IEEE T ACOUST SPEECH, V28, P599, DOI 10.1109/TASSP.1980.1163453 Theodoridis S., 1999, PATTERN RECOGNITION WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 30 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 56 EP 66 DI 10.1016/j.specom.2007.06.006 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800005 ER PT J AU Safaa, JA Dominique, PA Rosec, O AF Jarifi, Safaa Pastor, Dominique Rosec, Olivier TI A fusion approach for automatic speech segmentation of large corpora with application to speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE automatic speech segmentation; speech synthesis; HMM; brandt's GLR algorithm; boundary model; soft supervision; hard supervision AB This paper deals with the automatic segmentation of large speech corpora in the case when the phonetic sequence corresponding to the speech signal is known. A direct and typical application is corpus-based Text-To-Speech (TTS) synthesis. We start by proposing a general approach for combining several segmentations produced by different algorithms. Then, we describe and analyse three automatic segmentation algorithms that will be used to evaluate our fusion approach. The first algorithm is segmentation by Hidden Markov Models (HMM). The second one, called refinement by boundary model, aims at improving the segmentation performed by HMM via a Gaussian Mixture Model (GMM) of each boundary. The third one is a slightly modified version of Brandt's Generalized Likelihood Ratio (GLR) method; its goal is to detect signal discontinuities in the vicinity of the HMM boundaries. Objective performance measurements show that refinement by boundary model is the most accurate of the three algorithms in the sense that the estimated segmentation marks are the closest to the manual ones. When applied to the different output segmentations obtained by the three algorithms mentioned above, any of the fusion methods proposed in this paper is more accurate than refinement by boundary model. With respect to the corpora considered in this paper, the most accurate fusion method, called optimal fusion by soft supervision, reduces by 25.5%, 60% and 75%, the number of segmentation errors made by refinement by boundary model, standard HMM segmentation and Brandt's GLR method, respectively. Subjective listening tests are carried out in the context of corpus-based speech synthesis. They show that the quality of the synthetic speech obtained when the speech corpus is segmented by optimal fusion by soft supervision approaches that obtained when the same corpus is manually segmented. (c) 2007 Elsevier B.V. All rights reserved. C1 Ecole Natl Super Telecommun Bretagne, Dept Signal & Commun, F-29238 Brest 3, France. TECH SSTP VMI, France Telecom, R&D Div, F-22307 Lannion, France. RP Safaa, JA (reprint author), Ecole Natl Super Telecommun Bretagne, Dept Signal & Commun, Technopole Brest Iroise,CS 83818, F-29238 Brest 3, France. EM safaa.jarifi@enst-bretagne.fr; dominique.pastor@enst-bretagne.fr; olivier.rosec@orange-ftgroup.com CR Adell J., 2004, P 5 ISCA SPEECH SYNT, P139 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 BRANDT AV, 1983, IEEE INT C AC SPEECH, P107 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W JARIFI S, 2005, 13 EUR SIGN PROC C E JARIFI S, 2006, 9 INT C SPOK LANG PR JARIFI S, 2007, THESIS ECOLE NATL SU Kim I, 2002, ELEC SOC S, V2002, P145 Matousek J., 2003, 8 EUR C SPEECH COMM, P301 NEFTI S, 2004, THESIS U RENNES, P1 Odell J.J., 1995, THESIS U CAMBRIDGE Park SS, 2006, IEEE SIGNAL PROC LET, V13, P640, DOI 10.1109/LSP.2006.875347 PARK SS, 2006, 9 INT C SPOK LANG PR Toledano DT, 2003, IEEE T SPEECH AUDI P, V11, P617, DOI 10.1109/TSA.2003.813579 TOLEDANO DT, 1998, 3 ESCA COSCOSDA INT, P26 Wang L., 2004, ICCASP, VI, P641 YOUNG S, 2002, HTK BOOK HTK V 3 2 1 NR 17 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2008 VL 50 IS 1 BP 67 EP 80 DI 10.1016/j.specom.2007.07.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 234TB UT WOS:000251185800006 ER PT J AU Hagen, A Pellom, B Cole, R AF Hagen, Andreas Pellom, Bryan Cole, Ronald TI Highly accurate children's speech recognition for interactive reading tutors using subword units SO SPEECH COMMUNICATION LA English DT Article DE literacy tutors; subword unit based speech recognition; language modeling; reading tracking AB Speech technology offers great promise in the field of automated literacy and reading tutors for children. In such applications speech recognition can be used to track the reading position of the child, detect oral reading miscues, assessing comprehension of the text being read by estimating if the prosodic structure of the speech is appropriate to the discourse structure of the story, or by engaging the child in interactive dialogs to assess and train comprehension. Despite such promises, speech recognition systems exhibit higher error rates for children due to variabilities in vocal tract length, formant frequency, pronunciation, and grammar. In the context of recognizing speech while children are reading out loud, these problems are compounded by speech production behaviors affected by difficulties in recognizing printed words that cause pauses, repeated syllables and other phenomena. To overcome these challenges, we present advances in speech recognition that improve accuracy and modeling capability in the context of an interactive literacy tutor for children. Specifically, this paper focuses on a novel set of speech recognition techniques which can be applied to improve oral reading recognition. First, we demonstrate that speech recognition error rates for interactive read aloud can be reduced by more than 50% through a combination of advances in both statistical language and acoustic modeling. Next, we propose extending our baseline system by introducing a novel token-passing search architecture targeting subword unit based speech recognition. The proposed subword unit based speech recognition framework is shown to provide equivalent accuracy to a whole-word based speech recognizer while enabling detection of oral reading events and finer grained speech analysis during recognition. The efficacy of the approach is demonstrated using data collected from children in grades 3-5, namely 34.6% of partial words with reasonable evidence in the speech signal are detected at a low false alarm rate of 0.5%. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Colorado, Ctr Spoken Language Res, Boulder, CO 80301 USA. RP Pellom, B (reprint author), Univ Colorado, Ctr Spoken Language Res, 1777 Exposit Dr,Suite 171, Boulder, CO 80301 USA. EM andreash@cslr.colorado.edu; pellom@csir.colorado.edu; cole@cslr.colorado.edu CR AIST G, 1998, P ICSLP 98 SYDN AUST ARCY S, 2004, P ICSLP 2004 JEJ ISL Banerjee S., 2003, P 2 INT C APPL ART I BANERJEE S, 2003, P EUR 2003 GEN SWITZ Bazzi I., 2002, THESIS MIT Cole R., 2006, TRCSLR200602 U COL COLE R, 2006, EDUC TECHNOL, V47, P14 Cole R., 2006, TRCSLR200603 U COL *COLIT, 2004, COL LIB TUT PROJ COSI P, 2005, P EUR 2005 LISB PORT Creutz M., 2002, P WORKSH MORPH PHON, P21 DAS S, 1998, P ICASSP 98 SEATTL W ESKENAZI M, 1996, J ACOUST SOC AM 2, V100 FOGARTY J, 2001, P 10 INT C ART INT E Gales M.J.F., 1997, CUEDFINFENGTR291 CAM GIULIANI D, 2003, P ICASSP 2003 HONG K GUSTAFSON J, 2002, P ICSLP 2002 DENV CO HACIOGLU K, 2003, P EUR 2003 GEN SWITZ HAGEN A, 2005, INTERSPEECH 2005 HAGEN A, 2005, 2 LANG TECH C POZN P HAGEN A, 2003, IEEE AUT SPEECH REC HAGEN A, 2004, ADV CHILDRENS SPEECH LEE K, 2004, ANAL DETECTION READI LEE S, 1997, P EUROSPEECH 97 RHOD Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LI Q, 2002, P ICSLP 02 DENV COL MCCANDLESS M, 2002, THESIS MIT MOSTOW J, 2002, ICSLP 2002 MOSTOW J, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P785 PELLOM B, 2003, P ICASSP 2003 HONG K Pellom B, 2001, TRCSLR200101 U COL POTAMIANOS A, 1997, P EUROSPEECH 97 RHOD Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 Shobaki K., 2000, P ICSLP 2000 BEIJ CH Siohan O, 2002, COMPUT SPEECH LANG, V16, P5, DOI 10.1006/csla.2001.0181 SPACHE G, 1981, DIAGNOSTIC READING S TAM UC, 2003, P EUR 2003 GEN SWITZ VANVUUREN S, 2006, TRCSLR200601 U COL WELLING L, 1999, P ICASSP 99 PHOEN AR WISE B, 2005, INTERACTIVE LIT ED F Young S., 1989, CUEDFINFENGTR38 NR 41 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 861 EP 873 DI 10.1016/j.specom.2007.05.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300001 ER PT J AU Ma, N Green, P Barker, J Coy, A AF Ma, Ning Green, Phil Barker, Jon Coy, Andre TI Exploiting correlogram structure for robust speech recognition with multiple speech sources SO SPEECH COMMUNICATION LA English DT Article DE speech separation; robust speech recognition; multiple pitch tracking; computational auditory scene analysis; correlogram; speech fragment decoding ID AUDITORY SCENE ANALYSIS; DIFFERENT FUNDAMENTAL FREQUENCIES; CONCURRENT VOWELS; SEPARATION; PERCEPTION; IDENTIFICATION; SOUNDS; MODEL AB This paper addresses the problem of separating and recognising speech in a monaural acoustic mixture with the presence of competing speech sources. The proposed system treats sound source separation and speech recognition as tightly coupled processes. In the first stage sound source separation is performed in the correlograrn domain. For periodic sounds, the correlogram exhibits symmetric tree-like structures whose stems are located on the delay that corresponds to multiple pitch periods. These pitch-related structures are exploited in the study to group spectral components at each time frame. Local pitch estimates are then computed for each spectral group and are used to form simultaneous pitch tracks for temporal integration. These processes segregate a spectral representation of the acoustic mixture into several time-frequency regions such that the energy in each region is likely to have originated from a single periodic sound source. The identified time-frequency regions, together with the spectral representation, are employed by a 'speech fragment decoder' which employs 'missing data' techniques with clean speech models to simultaneously search for the acoustic evidence that best matches model sequences. The paper presents evaluations based on artificially mixed simultaneous speech utterances. A coherence-measuring experiment is first reported which quantifies the consistency of the identified fragments with a single source. The system is then evaluated in a speech recognition task and compared to a conventional fragment generation approach. Results show that the proposed system produces more coherent fragments over different conditions, which results in significantly better recognition accuracy. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Ma, N (reprint author), Univ Sheffield, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM n.ma@dcs.shef.ac.uk; p.green@dcs.shef.ac.uk; j.barker@dcs.shef.ac.uk; a.coy@dcs.shef.ac.uk CR ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 BARKER J, 2006, P INT 2006 PITTSB, P85 Barker J., 2000, P ICSLP BEIJ CHIN, P373 Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Browns GJ, 2005, SIG COM TEC, P371, DOI 10.1007/3-540-27489-8_16 COKKE M, 1991, THESIS U SHEFFIELD Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 COOKE M, UNPUB J ACOUST SOC A COOKE M, 1997, P ICASSP 1997, V1, P25 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M, 2001, SPEECH COMMUN, V35, P141, DOI 10.1016/S0167-6393(00)00078-9 Coy A, 2007, SPEECH COMMUN, V49, P384, DOI 10.1016/j.specom.2006.11.002 COY A, 2006, P INTERSPEECH 2006, P1678 COY A, 2005, P INT 2005 LISB, P2641 DECHEVEIGNE A, 1993, J ACOUST SOC AM, V93, P3271 Ellis DPW, 1999, SPEECH COMMUN, V27, P281, DOI 10.1016/S0167-6393(98)00083-1 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Gonzales RC, 2004, DIGITAL IMAGE PROCES Hirsch H., 2000, P ICSLP, V4, P29 Hu G., 2006, THESIS OHIO STATE U LICKLIDER JCR, 1951, EXPERIENTIA, V7, P128, DOI 10.1007/BF02156143 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 MA N, 2006, P INT 2006 PITTSB, P669 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767 MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1613 Slaney M., 1990, P IEEE INT C AC SPEE, P357 SUMMERFIELD Q, 1990, P I ACOUSTICS, V12, P507 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 NR 33 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 874 EP 891 DI 10.1016/j.specom.2007.05.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300002 ER PT J AU Yang, ZG Chen, J Huang, Q Wu, XH Wu, YH Schneider, BA Li, L AF Yang, Zhigang Chen, Jing Huang, Qiang Wu, Xihong Wu, Yanhong Schneider, Bruce A. Li, Liang TI The effect of voice cuing on releasing Chinese speech from informational masking SO SPEECH COMMUNICATION LA English DT Article DE speech; informational masking; energetic masking; cuing effect; voice ID PERCEIVED SPATIAL SEPARATION; NORMAL-HEARING LISTENERS; 2 SIMULTANEOUS TALKERS; ENERGETIC MASKING; FLUCTUATING NOISE; RECEPTION THRESHOLD; FUNDAMENTAL-FREQUENCY; INTELLIGIBILITY INDEX; PERCEPTION; RECOGNITION AB In a cocktail-party environment, human listeners are able to use perceptual-level and cognitive-level cues to segregate the attended target speech from other background conversations. At the cognitive level, priming the listener with part of the target speech in quiet can markedly improve the recognition of the remaining parts when the target speech and competing speech are presented at the same time. Hence, knowledge of content (content cuing) improves speech recognition when other people are talking. In addition, familiarity or knowledge of the voice characteristics of the target talker could also help the listener attend to the target talker when other talkers are present. The present study investigated the extent to which a cognitive-level cue (content cuing) and a perceptual-level cue (voice cuing) can improve word identification for speech masked by noise or by other speech in Chinese listeners. Specifically, listeners were primed with part of a sentence in quiet before a sentence was repeated in the presence of either noise or speech. The priming sentence was always in the same voice as the target sentence. Two kinds of primes were investigated: same-sentence primes, and different-sentence primes. Under speech-masking conditions, each of the two prime types significantly improved recognition of the last key word in the full-length target sentence. Under noise-masking conditions, same-sentence primes had a weak but significant releasing effect, but different-sentence primes had only a negligible releasing effect. These results suggest that in addition to content cues, voice cues can be used by Chinese listeners to release speech from masking by other talkers. (C) 2007 Elsevier B.V. All rights reserved. C1 Peking Univ, Natl Key Lab Machine Percept, Dept Psychol, Speech & Hearing Res Ctr, Beijing 100871, Peoples R China. Univ Toronto, Ctr Res Biol Commun Syst, Dept Psychol, Mississauga, ON L5L 1C6, Canada. RP Li, L (reprint author), Peking Univ, Natl Key Lab Machine Percept, Dept Psychol, Speech & Hearing Res Ctr, Beijing 100871, Peoples R China. EM liangli@pku.edu.cn CR Arbogast TL, 2002, J ACOUST SOC AM, V112, P2086, DOI 10.1121/1.1510141 ASSMANN PF, 1989, J ACOUST SOC AM, V85, P327, DOI 10.1121/1.397684 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Brungart DS, 2002, J ACOUST SOC AM, V112, P664, DOI 10.1121/1.1490592 Darwin CJ, 2003, J ACOUST SOC AM, V114, P2913, DOI 10.1121/1.1616924 Darwin CJ, 2000, J ACOUST SOC AM, V107, P970, DOI 10.1121/1.428278 Durlach NI, 2003, J ACOUST SOC AM, V114, P368, DOI 10.1121/1.1577562 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Freyman RL, 1999, J ACOUST SOC AM, V106, P3578, DOI 10.1121/1.428211 Freyman RL, 2001, J ACOUST SOC AM, V109, P2112, DOI 10.1121/1.1354984 Freyman RL, 2004, J ACOUST SOC AM, V115, P2246, DOI 10.1121/1.689343 GUSTAFSSON HA, 1994, J ACOUST SOC AM, V95, P518, DOI 10.1121/1.408346 Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 HOWARDJONES PA, 1993, ACUSTICA, V78, P258 Kang J, 1998, J ACOUST SOC AM, V103, P1213, DOI 10.1121/1.421253 KIDD G, 1994, J ACOUST SOC AM, V95, P3475, DOI 10.1121/1.410023 Kidd G, 2005, J ACOUST SOC AM, V118, P982, DOI 10.1121/1.1953167 Kidd G, 1998, J ACOUST SOC AM, V104, P422, DOI 10.1121/1.423246 Kidd G, 2005, J ACOUST SOC AM, V118, P3804, DOI 10.1121/1.2109187 Krishnan A, 2005, COGNITIVE BRAIN RES, V25, P161, DOI 10.1016/j.cogbrainres.2005.05.004 Li L, 2004, J EXP PSYCHOL HUMAN, V30, P1077, DOI 10.1037/0096-1523.30.6.1077 LUTFI RA, 1990, J ACOUST SOC AM, V88, P2607, DOI 10.1121/1.399980 Nelson PB, 2003, J ACOUST SOC AM, V113, P961, DOI 10.1121/1.1531983 Oxenham AJ, 2003, J ACOUST SOC AM, V114, P1543, DOI 10.1121/1.1598197 Rhebergen KS, 2006, J ACOUST SOC AM, V120, P3988, DOI 10.1121/1.2358008 Rhebergen KS, 2005, J ACOUST SOC AM, V117, P2181, DOI 10.1121/1.1861713 SCHNEIDER BA, IN PRESS J AM ACAD A Shinn-Cunningham BG, 2005, ACTA ACUST UNITED AC, V91, P967 Summers V, 2004, J SPEECH LANG HEAR R, V47, P245, DOI 10.1044/1092-4388(2004/020) Wu XH, 2005, HEARING RES, V199, P1, DOI 10.1016/j.heares.2004.03.010 NR 31 TC 24 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 892 EP 904 DI 10.1016/j.specom.2007.05.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300003 ER PT J AU Hoen, M Meunier, F Grataloup, CL Pellegrino, F Grimault, N Perrin, F Perrot, X Collet, L AF Hoen, Michel Meunier, Fanny Grataloup, Claire-Leonie Pellegrino, Francois Grimault, Nicolas Perrin, Fabien Perrot, Xavier Collet, Lionel TI Phonetic and lexical interferences in informational masking during speech-in-speech comprehension SO SPEECH COMMUNICATION LA English DT Article DE cocktail party; speech-in-speech; energetic masking; informational masking; lexical competition ID SPOKEN-WORD-RECOGNITION; COCKTAIL PARTY PHENOMENON; SIMULTANEOUS TALKERS; INTELLIGIBILITY; NOISE; COMPETITION; TIME; SEGMENTATION; PERCEPTION; ACTIVATION AB This study investigates masking effects occurring during speech comprehension in the presence of concurrent speech signals. We examined the differential effects of acoustic-phonetic and lexical content of 4- to 8-talker babble (natural speech) or babble-like noise (reversed speech) on word identification. Behavioral results show a monotonic decrease in speech comprehension rates with an increasing number of simultaneous talkers in the reversed condition. Similar results are obtained with natural speech except for the 4-talker babble situations. An original signal analysis is then proposed to evaluate the spectro-temporal saturation of composite multitalker babble. Results from this analysis show a monotonic increase in spectro-temporal saturation with an increasing number of simultaneous talkers, for both natural and reversed speech. This suggests that informational masking consists of at least acoustic-phonetic masking which is fairly similar in the reversed and natural conditions and lexical masking which is present only with natural babble. Both effects depend on the number of talkers in the background babble. In particular, results confirm that lexical masking occurs only when some words in the babble are detectable, i.e. for a low number of talkers, such as 4, and diminishes with more talkers. These results suggest that different levels of linguistic information can be extracted from background babble and cause different types of linguistic competition for target-word identification. The use of this paradigm by psycholinguists could be of primary interest in detailing the various information types competing during lexical access. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Lyon, Inst Sci Homme, UMR 5596, CNRS,Lab Dynam Du Langage DDL, F-69363 Lyon 07, France. Univ Lyon, CNRS, UMR 5020, Lab Neurosci & Syst Sensoriels, F-69366 Lyon 07, France. RP Hoen, M (reprint author), Univ Lyon, Inst Sci Homme, UMR 5596, CNRS,Lab Dynam Du Langage DDL, 14 Ave Berthelot, F-69363 Lyon 07, France. EM michel.hoen@phonak.com RI Hoen, Michel/C-7721-2012 OI Hoen, Michel/0000-0003-2099-8130 CR ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 BACON SP, 1985, J ACOUST SOC AM, V78, P1231, DOI 10.1121/1.392891 Bolia RS, 2000, J ACOUST SOC AM, V107, P1065, DOI 10.1121/1.428288 Bregman A., 1994, AUDITORY SCENE ANAL BRONKHORST AW, 1992, J ACOUST SOC AM, V92, P3132, DOI 10.1121/1.404209 Bronkhorst AW, 2000, ACUSTICA, V86, P117 Brungart DS, 2001, J ACOUST SOC AM, V109, P2276, DOI 10.1121/1.1357812 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Brungart DS, 2006, J ACOUST SOC AM, V120, P4007, DOI 10.1121/1.2363929 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Chi TS, 1999, J ACOUST SOC AM, V106, P2719, DOI 10.1121/1.428100 DANHAUER JL, 1979, J SPEECH HEAR DISORD, V44, P354 DIRKS DD, 1969, J SPEECH HEAR RES, V12, P229 Divenyi P. L, 2004, SPEECH SEGREGATION H Divenyi Pierre L., 2004, Seminars in Hearing, V25, P229, DOI 10.1055/s-2004-832857 DIVENYI PL, 2003, P 15 INT C PHON SCI, P2777 Drullman R, 2000, J ACOUST SOC AM, V107, P2224, DOI 10.1121/1.428503 EGAN JP, 1954, J ACOUST SOC AM, V26, P774, DOI 10.1121/1.1907416 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 DUQUESNOY AJ, 1983, J ACOUST SOC AM, V74, P739, DOI 10.1121/1.389859 Gaskell MG, 1997, PROCEEDINGS OF THE NINETEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P247 Greenberg S., 2001, P 7 EUR C SPEECH COM, P473 GREENBERG S, 1996, P ICSLP 96 PHIL USA Greenberg S., 1995, P 13 INT C PHON SCI, V3, P34 Hawley ML, 1999, J ACOUST SOC AM, V105, P3436, DOI 10.1121/1.424670 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 KARNEBACK S, 2001, P EUR C SPEECH COMM, P1891 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 LUCE PA, 1990, ACL MIT NAT, P122 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 MarslenWilson W, 1996, J EXP PSYCHOL HUMAN, V22, P1376 MARSLENWILSON W, 1990, ACL MIT NAT, P148 MARSLENWILSON WD, 1987, COGNITION, V25, P71, DOI 10.1016/0010-0277(87)90005-9 McClelland J., 1986, COGNITIVE PSYCHOL, V8, P1, DOI [10.1016/0010-0285(86)90015-0, DOI 10.1016/0010-0285(86)90015-0] MCQUEEN JM, 1994, J EXP PSYCHOL LEARN, V20, P621, DOI 10.1037/0278-7393.20.3.621 MILLER GA, 1947, PSYCHOL BULL, V44, P105, DOI 10.1037/h0055960 Monsell S, 1998, J EXP PSYCHOL LEARN, V24, P1495 Moss HE, 1997, LANG COGNITIVE PROC, V12, P695 New B, 2004, BEHAV RES METH INS C, V36, P516, DOI 10.3758/BF03195598 NORRIS D, 1995, J EXP PSYCHOL LEARN, V21, P1209 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150 PINQUIER J, 2003, ICASSP 2003 Saberi K, 1999, NATURE, V398, P760, DOI 10.1038/19652 Scheirer E, 1997, INT CONF ACOUST SPEE, P1331, DOI 10.1109/ICASSP.1997.596192 Simpson SA, 2005, J ACOUST SOC AM, V118, P2775, DOI 10.1121/1.2062650 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 WOOD N, 1995, J EXP PSYCHOL LEARN, V21, P255, DOI 10.1037/0278-7393.21.1.255 WREDE B, 2002, THESIS U BIELEFELD ZWITSERLOOD P, 1995, LANG COGNITIVE PROC, V10, P121, DOI 10.1080/01690969508407090 NR 51 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 905 EP 916 DI 10.1016/j.specom.2007.05.008 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300004 ER PT J AU Holmberg, M Gelbart, D Hemmert, W AF Holmberg, Marcus Gelbart, David Hemmert, Werner TI Speech encoding in a model of peripheral auditory processing: Quantitative assessment by means of automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE auditory model; speech encoding; rate-place coding; automatic speech recognition; auditory nerve ID STEADY-STATE VOWELS; VENTRAL COCHLEAR NUCLEUS; NERVE-FIBERS; BACKGROUND-NOISE; DYNAMIC-RANGE; DISCHARGE PATTERNS; WORD RECOGNITION; STOP CONSONANTS; REPRESENTATION; RESPONSES AB Our notion of how speech is processed is still very much dominated by von Helmholtz's theory of hearing. He deduced that the human inner ear decomposes the spectrum of sound signals. However, physiological recordings of auditory nerve fibers (ANF) showed that the rate-place code, which is thought to transmit spectral information to the brain, is at least complemented by a temporal code. In our paper we challenge the rate-place code using a complex but realistic scenario: speech in noise. We used a detailed model of human auditory processing that closely replicates key aspects of auditory nerve spike trains. We performed quantitative evaluations of coding strategies using standard automatic speech recognition (ASR) tools. Our test data was spoken letters of the whole English alphabet from a variety of speakers, with and without background noise. We evaluated a purely rate-place-based encoding strategy, a temporal strategy based on interspike intervals, and a combination thereof. The results suggest that as few as 4% of the total number of ANFs would be sufficient to code speech information in a rate-place fashion. Rate-place coding performed its best for speech in clean conditions at normal sound level, but broke down at higher-than-normal levels, and failed dramatically in noise at high levels. Low-spontaneous rate fibers improved the rate-place code, mainly for vowels and at higher-than-normal levels. At high speech levels, and in particular in the presence of background noise, combining rate-place coding with the temporal coding strategy greatly improved recognition accuracy. We therefore conclude that the human auditory system does not rely on a rate-place code alone but requires the abundance of fibers for precise temporal coding. (C) 2007 Elsevier B.V. All rights reserved. C1 Infineon Technol AG, D-81726 Munich, Germany. Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Hemmert, W (reprint author), Infineon Technol AG, D-81726 Munich, Germany. EM werner_hemmert@alum.mit.edu CR Ali AMA, 2002, IEEE T SPEECH AUDI P, V10, P279, DOI 10.1109/TSA.2002.800556 ANSI, 1997, S351997 ANSI BAKER RJ, 1998, PSYCHOPHYSICAL PHYSL Bandyopadhyay S, 2004, J NEUROSCI, V24, P531, DOI 10.1523/JNEUROSCI.4234-03.2004 BEATTIE RC, 1985, J SPEECH HEAR DISORD, V50, P166 Bregman AS., 1990, AUDITORY SCENE ANAL CARNEY LH, 1994, HEARING RES, V76, P31, DOI 10.1016/0378-5955(94)90084-1 COLE R, 1990, CSE90004 OR GRAD I Conley RA, 1995, J ACOUST SOC AM, V98, P3223, DOI 10.1121/1.413812 Cooke M, 2001, SPEECH COMMUN, V35, P141, DOI 10.1016/S0167-6393(00)00078-9 DALY N, 1987, THESIS MIT Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P908, DOI 10.1121/1.390537 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P866, DOI 10.1121/1.390596 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P887, DOI 10.1121/1.390598 DELGUTTE B, 1990, HEARING RES, V49, P225, DOI 10.1016/0378-5955(90)90106-Y ELLIS D, 2000, 00007 INT COMP SCI I ELLIS D, 2002, SPRACHCORE SOFTWARE FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 GEISLER CD, 1989, J ACOUST SOC AM, V85, P1639, DOI 10.1121/1.397952 GELBART D, 2005, ISOLET NOISE Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357 HASHIMOTO T, 1975, JPN J PHYSIOL, V25, P633 Hienz RD, 1998, HEARING RES, V116, P10, DOI 10.1016/S0378-5955(97)00197-4 HOLMBERG M, 2007, WEB PABE ACCOMPANYIN HOLMBERG M, 2004, P JOINT C CFA DAGA 0, P773 HOLMBERG M, 2007, THESIS TU DARMSTADT JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 Lai Y C, 1994, J Comput Neurosci, V1, P167, DOI 10.1007/BF00961733 Lazzaro J, 1997, ANALOG INTEGR CIRC S, V13, P37, DOI 10.1023/A:1008259307326 LIBERMAN MC, 1978, J ACOUST SOC AM, V63, P442, DOI 10.1121/1.381736 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Lopez-Poveda EA, 2003, J ACOUST SOC AM, V113, P951, DOI 10.1121/1.1534838 Lorenzi C, 2006, P NATL ACAD SCI USA, V103, P18866, DOI 10.1073/pnas.0607364103 May BJ, 1996, AUDIT NEUROSCI, V3, P135 May BJ, 2003, SPEECH COMMUN, V41, P49, DOI 10.1016/S0167-6393(02)00092-4 May BJ, 1998, J NEUROPHYSIOL, V79, P1755 MILLER MI, 1983, J ACOUST SOC AM, V74, P502, DOI 10.1121/1.389816 Musch H, 2001, J ACOUST SOC AM, V109, P2910, DOI 10.1121/1.1371972 Ohm G. S., 1843, ANN PHYS CHEM, V59, P513 POLLACK I, 1958, J ACOUST SOC AM, V30, P127, DOI 10.1121/1.1909503 Recio A, 2002, J ACOUST SOC AM, V111, P2213, DOI 10.1121/1.1468878 Rhode WS, 1998, HEARING RES, V117, P39, DOI 10.1016/S0378-5955(98)00002-1 RUTHERFORD E, 1986, J ANAT PHYSL, V21, P166 SACHS MB, 1979, J ACOUST SOC AM, V66, P470, DOI 10.1121/1.383098 SACHS MB, 1980, J ACOUST SOC AM, V68, P858, DOI 10.1121/1.384825 SACHS MB, 1983, J NEUROPHYSIOL, V50, P27 SANDHU S, 1995, INT CONF ACOUST SPEE, P409, DOI 10.1109/ICASSP.1995.479608 SECKERWALKER HE, 1990, J ACOUST SOC AM, V88, P1427, DOI 10.1121/1.399719 SENEFF S, 1988, J PHONETICS, V16, P55 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1612, DOI 10.1121/1.392799 SHAMMA SA, 1985, J ACOUST SOC AM, V78, P1622, DOI 10.1121/1.392800 SHARMA S, 2000, P ICASSP IST JUN, P1117 Sheikhzadeh H, 1999, COMPUT SPEECH LANG, V13, P39, DOI 10.1006/csla.1998.0049 Sheikhzadeh H, 1998, IEEE T SPEECH AUDI P, V6, P90, DOI 10.1109/89.650316 Shera CA, 2002, P NATL ACAD SCI USA, V99, P3318, DOI 10.1073/pnas.032675099 SILKES SM, 1991, J ACOUST SOC AM, V90, P3122, DOI 10.1121/1.401421 SINEX DG, 1983, J ACOUST SOC AM, V73, P602, DOI 10.1121/1.389007 SPOENDLIN H, 1989, HEARING RES, V43, P25, DOI 10.1016/0378-5955(89)90056-7 STEENEKEN H, 1968, DESCRIPTION RSG 10 N STERN RM, 1992, P DARPA SPEECH 5 NAT, P274, DOI 10.3115/1075527.1075592 STRUBE HW, 1985, ACUSTICA, V58, P207 Studebaker GA, 1999, J ACOUST SOC AM, V105, P2431, DOI 10.1121/1.426848 Sumner CJ, 2002, J ACOUST SOC AM, V111, P2178, DOI 10.1121/1.1453451 VIEMEISTER NF, 1988, HEARING RES, V34, P267, DOI 10.1016/0378-5955(88)90007-X von Helmholtz H, 1863, LEHRE TONEMPFINDUNGE WANG H, 2006, P IEEE ICASSP 2006 T, P129 WINSLOW RL, 1988, HEARING RES, V35, P165, DOI 10.1016/0378-5955(88)90116-5 YATES GK, 1990, HEARING RES, V45, P203, DOI 10.1016/0378-5955(90)90121-5 YOUNG ED, 1979, J ACOUST SOC AM, V66, P1381, DOI 10.1121/1.383532 NR 71 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 917 EP 932 DI 10.1016/j.specom.2007.05.009 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300005 ER PT J AU Peng, JX AF Peng Jianxin TI Relationship between Chinese speech intelligibility and speech transmission index using diotic listening SO SPEECH COMMUNICATION LA English DT Article DE Chinese speech intelligibility; speech transmission index; room impulse response ID OCTAVE-BAND WEIGHTS; SOUND REPRODUCTION AB The speech intelligibility in rooms is evaluated using the room impulse responses obtained from the room acoustical simulation software ODEON. The simulated room impulse responses are first convolved with the speech intelligibility test signals recorded in an anechoic chamber, then reproduced through the earphone. The subjective Chinese speech intelligibility scores are obtained and the relationship between Chinese speech intelligibility scores and speech transmission index (STI) is built and validated. The result shows that there is high correlation between Chinese speech intelligibility scores and STI. The STI method can predict and evaluate the speech intelligibility for Mandarin Chinese without changes in the algorithm of the weighting values for diotic listening in rooms. (C) 2007 Elsevier B.V. All rights reserved. C1 S China Univ Technol, Sch Phys, Guangzhou 510640, Peoples R China. RP Peng, JX (reprint author), S China Univ Technol, Sch Phys, Guangzhou 510640, Peoples R China. EM phjxpeng@163.com CR ANDERSON BW, 1985, J ACOUST SOC AM, V81, P1982 [Anonymous], 1985, 495985 GB Bork I, 2000, ACUSTICA, V86, P943 BRADLEY JS, 1986, J ACOUST SOC AM, V80, P837, DOI 10.1121/1.393907 CHRISTENSEN C.L., 2003, ODEON ROOM ACOUSTICS HOUTGAST T, 1984, ACUSTICA, V54, P185 *IEC, 1988, 6026816 IEC 16 *IEC, 2003, 6026816 IEC 16 Kirkeby O, 1999, J AUDIO ENG SOC, V47, P583 KNUDSEN VO, 1950, ACOUSTICAL DESIGNING, P51 LATHAM HG, 1979, APPL ACOUST, V18, P252 LOCHNER JPA, 1964, J SOUND VIB, V1, P426, DOI 10.1016/0022-460X(64)90057-4 Peng JX, 2005, APPL ACOUST, V66, P591, DOI 10.1016/j.apacoust.2004.08.006 Peng JX, 2006, ACTA ACUST UNITED AC, V92, P79 Shen G. L., 1997, MOTION PICTURE VIDEO, V5, P3 Shen H., 1993, AUDIO ENG, V1, P2 SHEN H, 1995, MOD ACOUST RES, P295 *STAND PR CHIN, GBT1447693 STAND CHI Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Steeneken HJM, 2002, SPEECH COMMUN, V38, P413, DOI 10.1016/S0167-6393(02)00010-9 Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0 WANG J, 1986, P 12 INT C AC, VE, P10 ZHANG J, 1981, ACTA ACUST, V6, P237 NR 24 TC 5 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2007 VL 49 IS 12 BP 933 EP 936 DI 10.1016/j.specom.2007.06.001 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 227EZ UT WOS:000250641300006 ER PT J AU De Mori, R Deroo, O Dupont, S Jouvet, D Fissore, L Laface, P Mertins, A Wellekens, CJ AF De Mori, R. Deroo, O. Dupont, S. Jouvet, D. Fissore, L. Laface, P. Mertins, A. Wellekens, C. J. TI Introduction to the special issue on intrinsic speech variations SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 761 EP 762 DI 10.1016/j.specom.2007.05.006 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800001 ER PT J AU Benzeghiba, M De Mori, R Deroo, O Dupont, S Erbes, T Jouvet, D Fissore, L Laface, P Mertins, A Ris, C Rose, R Tyagi, V Wellekens, C AF Benzeghiba, M. De Mori, R. Deroo, O. Dupont, S. Erbes, T. Jouvet, D. Fissore, L. Laface, P. Mertins, A. Ris, C. Rose, R. Tyagi, V. Wellekens, C. TI Automatic speech recognition and speech variability: A review SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE speech recognition; speech analysis; speech modeling; speech intrinsic variations ID HIDDEN MARKOV-MODELS; SPEAKER ADAPTATION; PRONUNCIATION QUALITY; FEATURE-EXTRACTION; CHILDRENS SPEECH; LANGUAGE; NORMALIZATION; CLASSIFICATION; IDENTIFICATION; INFORMATION AB Major progress is being recorded regularly on both the technology and exploitation of automatic speech recognition (ASR) and spoken language systems. However, there are still technological barriers to flexible solutions and user satisfaction under some circumstances. This is related to several factors, such as the sensitivity to the environment (background noise), or the weak representation of grammatical and semantic knowledge. Current research is also emphasizing deficiencies in dealing with variation naturally present in speech. For instance, the lack of robustness to foreign accents precludes the use by specific populations. Also, some applications, like directory assistance, particularly stress the core recognition technology due to the very high active vocabulary (application perplexity). There are actually many factors affecting the speech realization: regional, sociolinguistic, or related to the environment or the speaker herself. These create a wide range of variations that may not be modeled correctly (speaker, gender, speaking rate, vocal effort, regional accent, speaking style, non-stationarity, etc.), especially when resources for system training are scarce. This paper outlines current advances related to these topics. (C) 2007 Elsevier B.V. All rights reserved. C1 Multitel, Parc Initialis, B-7000 Mons, Belgium. RP Dupont, S (reprint author), Multitel, Parc Initialis, Ave Copernic, B-7000 Mons, Belgium. EM dupont@multitel.be CR AALBURG S, 2004, P ICSLP JEJ ISL KOR, P1465 ABDELHALEEM YH, 2004, P ICASSP MONTR CAN, P637 ABRASH V, 1996, P IEEE INT C AC SPEE, P729 ACHAN K, 2004, 2004001 UTML Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Adda-Decker M, 2005, SPEECH COMMUN, V46, P119, DOI 10.1016/j.specom.2005.03.006 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 Ang J, 2002, P INT C SPOK LANG PR, P2037 Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 Atal B. S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 Athineos M., 2003, P IEEE WORKSH AUT SP, P261 BARD EG, 2001, LANG COGNITIVE PROC, V15, P731 BARRAULT L, 2005, P INT LISB PROT, P221 BARTKOVA K, 1999, P ICPHS SAN FRANC US, P1725 BARTKOVA K, 2004, P SPECOM SAINT PET R BARTKOVA K, 2003, P ICPHS BARC SPAIN Beattie V, 1995, P EUR 95, P1123 BEAUFORD JQ, 1999, COMPENSATING VARIATI Bell A, 2003, J ACOUST SOC AM, V113, P1001, DOI 10.1121/1.1534836 Benitez C, 2001, P EUR 2001, P429 BLOMBERG M, 1991, SPEECH COMMUN, V10, P453, DOI 10.1016/0167-6393(91)90048-X Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17 BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 BOUGHAZALE SE, 1995, ECSA NATO P SPEECH S, P45 BOUGHAZALE SE, 1994, P IEEE INT C AC SPEE, P413 Bourlard H., 1997, P ICASSP, P1251 BOZKURT B, 2005, P EUS ANT TURK Bozkurt B, 2005, IEEE SIGNAL PROC LET, V12, P344, DOI 10.1109/LSP.2005.843770 BRUGNARA F, 1992, P ICASSP, V1, P377 Byrne W, 2004, IEEE T SPEECH AUDI P, V12, P420, DOI 10.1109/TSA.2004.828702 Carey M. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607979 CARLSON BA, 1992, P IEEE 1992 INT C AC, P237, DOI 10.1109/ICASSP.1992.225928 Chase Lin, 1997, THESIS CARNEGIE MELL Chen C. J., 2001, P ICASSP, V1, P61 Chen C.J., 1997, P EUROSPEECH, P1543 Chen Y., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) CHESTA C, 1999, P ICASSP, V2, P557 Chesta C., 1999, P EUR C SPEECH COMM, P211 CHOLLET G, 1981, P IEEE C AC PSEECH S, P758 CINCAREK T, 2004, P ICSLP JEJ ISL KOR, P1509 COIFMAN RR, 1992, IEEE T INFORM THEORY, V38, P713, DOI 10.1109/18.119732 Colibro D., 2005, P ICASSP PHIL PA, P1001 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0 DALSGAARD P, 1998, P ICSLP SYDN AUSTR, P482 DARCY SM, 2004, P ICSLP JEJ ISL KOR Das S., 1998, P ICASSP SEATTL US M, V1, P433, DOI 10.1109/ICASSP.1998.674460 DAS S, 1999, P EUROSPEECH, P1959 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Demeechai T, 2001, SPEECH COMMUN, V33, P241, DOI 10.1016/S0167-6393(00)00017-0 DEMUYNCK K, 2004, P ICSLP 04 JEJ ISL K DENG Y, 2003, P EUR GEN SWITZ SEPT, P929 de Wet F, 2004, J ACOUST SOC AM, V116, P1781, DOI 10.1121/1.1781620 DIBENEDETTO MG, 1992, P ICSLP, P579 DODDINGTON G, 2003, P ASRU US VIRB ISL, P630 DRAXLER C, 1997, P EUR 97, P747 Duda R. O., 1973, PATTERN CLASSIFICATI DUPONT S, 2005, P INT 05 LISB, P1353 Dupont S., 2005, P ASRU, P29 EIDE E, 2001, P EUR AALB DENM, P1613 EIDE E, 1995, P IEEE INT C AC SPEE, P221 EIDE E, 1996, P ICASSP, P346 Eklund R, 2001, SPEECH COMMUN, V35, P81, DOI 10.1016/S0167-6393(00)00097-2 Elenius D., 2004, P FONETIK 2004, P156 ELLIS DPW, 2001, P ICASSP, P517 *ESCA, 1998, ESCA WORKSH MOD PRON Eskenazi M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607892 ESKENAZI M, 2002, P PMLA WORKSH COL US ESKENAZI M, 1996, J ACOUST SOC AM, P2759 FALTHAUSER R, 2000, P ICASSP IST TURK, P1355 Fant G., 1960, ACOUSTIC THEORY SPEE Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 FITT S, 1995, P EUR MADR SPAIN, P2227 Flanagan J., 1972, SPEECH ANAL SYNTHESI Flege JE, 2003, SPEECH COMMUN, V40, P467, DOI 10.1016/S0167-6393(02)00128-0 Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 Fosler-Lussier E, 2005, SPEECH COMMUN, V46, P153, DOI 10.1016/j.specom.2005.03.003 Franco H, 2000, SPEECH COMMUN, V30, P121, DOI 10.1016/S0167-6393(99)00045-X Fujinaga K, 2001, P 2001 IEEE INT C AC, V1, P513 Fukunaga K., 1972, INTRO STAT PATTERN R Fung P., 1999, P ICASSP, P221 Furui S, 2004, IEEE T SPEECH AUDI P, V12, P349, DOI 10.1109/TSA.2004.828628 Gales M. J. F., 1998, P ICSLP, P1783 GALES MJF, 2001, P ASRU MAD CAMP ITAL GALES MJF, 2001, P ICASSP 2001 MAY, P361 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Garner PN, 1998, INT CONF ACOUST SPEE, P1, DOI 10.1109/ICASSP.1998.674352 GARVIN PL, 1963, PHONETICA, V9, P193 Gemello R, 2006, COMPUT SPEECH LANG, V20, P2, DOI 10.1016/j.csl.2004.06.001 GIRARDI A, 1998, P ICSLP SYDN AUSTR, P687 GIULIANI D, 2003, P ICASSP, V2, P137 Goel V, 2004, IEEE T SPEECH AUDI P, V12, P234, DOI 10.1109/TSA.2004.825678 Gopinath RA, 1998, INT CONF ACOUST SPEE, P661, DOI 10.1109/ICASSP.1998.675351 Goronzy S, 2004, SPEECH COMMUN, V42, P109, DOI 10.1016/j.specom.2003.09.003 GRACIARENA M, 2004, P ICASSP MONTR, V1, P921 GREENBERG S, 2000, P ISCA WORKSH AUT SP GREENBERG STEVEN, 2000, P CREST WORKSH MOD S GUPTA SK, 1996, P IEEE INT C AC SPEE, V1, P57 HAEBUMBACH R, 1992, P ICASSP, P13, DOI 10.1109/ICASSP.1992.225984 Hagen Andreas, 2003, P IEEE WORKSH AUT SP, P186 Hain T., 1999, P EUR C SPEECH COMM, P1327 Hain T, 2005, SPEECH COMMUN, V46, P171, DOI 10.1016/j.specom.2005.03.008 Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 Hansen JHL, 1989, IEEE P 15 NE BIOENG, P31 HANSEN JHL, 1995, J ACOUSTICAL SOC AM HANSEN JHL, 1993, P INT C AC SPEECH SI, P95 Hanson B. A., 1990, P ICASSP, P857 Hariharan R, 2001, IEEE T SPEECH AUDI P, V9, P856, DOI 10.1109/89.966088 Haykin S., 1993, ADAPTIVE FILTER THEO Haykin S., 1995, COMMUNICATION SYSTEM Hazen TJ, 2005, SPEECH COMMUN, V46, P189, DOI 10.1016/j.specom.2005.03.004 He XD, 2003, IEEE T SPEECH AUDI P, V11, P298, DOI 10.1109/TSA.2003.814379 HEDGE RM, 2005, P IEEE INT C AC SPEE, V1, P541 HEDGE RM, 2004, P INTERSPEECH 2004, P905 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 1998, P INT C SPOK LANG PR, P1003 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HETHERINGTON L, 1995, P EUR C SPEECH COMM, P1645 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 HOLMES JN, 1997, P EUROSPEECH 97 RHOD, P2083 HON HW, 1999, P ASRU KEYST COL Huang C., 2001, P EUR 2001, V2, P1377 Huang H., 2000, P ICASSP 2000, P1523 HUANG XD, 1991, P ICASSP, P877, DOI 10.1109/ICASSP.1991.150478 HUMPHRIES JJ, 1996, P ICSLP RHOD GREEC, P2367 Hunt M., 1989, P ICASSP, P262 Hunt M.J., 1999, P ASRU KEYST COL HUNT MJ, 2004, P ICSLP JEJ ISL KOR IIVONEN A, 2003, P ICPHS BARC SPAIN, P695 *ISCA, 2002, ISCA TUT RES WORKSH Janse E, 2004, SPEECH COMMUN, V42, P155, DOI 10.1016/j.specom.2003.07.001 JIANG K, 1999, ACOUSTIC FEATURE SEL JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 Jurafsky D., 2001, P 2001 IEEE INT C AC, P577 KAJAREKAR S, 1999, P EUROSPEECH BUD HUN, P343 KAJAREKAR S, 1999, P ASRU KEYST COL Kenny P, 2005, IEEE T SPEECH AUDI P, V13, P345, DOI 10.1109/TSA.2004.840940 Kim DK, 2004, SPEECH COMMUN, V42, P467, DOI 10.1016/j.specom.2004.01.002 KINGSBURY B, 2002, P ICASSP, V1, P53 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KIRCHHOFF K., 1998, P ICSLP, P891 KITAOKA N, 2002, P INT C SPOK LANG PR, P2125 Kleinschmidt M., 2002, P INT C SPOK LANG PR, P25 Kohler J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607240 KONIG Y, 1992, P INT JOINT C NEUR N, V2, P332, DOI 10.1109/IJCNN.1992.226966 KORKMAZSKIY F, 1997, P 1997 IEEE INT C AC, V2, P1443 KORKMAZSKY F, 2004, P ICSLP JEJ ISL KOR KUBALA F, 1994, P ICASSP 1994, P561 Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2 Kumaresan R, 1999, J ACOUST SOC AM, V105, P1912, DOI 10.1121/1.426727 Kumaresan R, 1998, IEEE SIGNAL PROC LET, V5, P256, DOI 10.1109/97.720558 Kumpf K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607964 KUWABARA H, 1997, P EUROSPEECH, P1003 LADEFOGED P, 1957, J ACOUST SOC AM, V29, P98, DOI 10.1121/1.1908694 LAMEL L, 2005, P ICASSP PHIL PENNS, P1005 Laver John, 1994, PRINCIPLES PHONETICS Lawson A., 2003, P EUR GEN SWITZ, P1505 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 Lee C.-H., 1993, P IEEE INT C AC SPEE, V2, P558 Lee L., 1996, P ICASSP, V1, P353 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leonard R., 1984, P IEEE INT C AC SPEE, P328 LIN X, 2004, P 38 AS C SIGN SYST, V2, P1801 LINCOLN M, 1997, P EUROSPEECH, P2095 Lindblom B., 1990, SPEECH PRODUCTION SP Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X LIU S, 1998, P ICSLP98, V6, P2647 LIU WK, 2000, P ICSLP, V3, P738 LIVESCU K, 2000, P ICASSP, V3, P1683 LJOLJE A, 2001, P EUR AALB DENM LJOLJE A, 2002, P ICSLP DENV US, P2137 LOMBARD E, 1911, ANN MAALDIES OREILLE, P37 Loog M, 2004, IEEE T PATTERN ANAL, V26, P732, DOI 10.1109/TPAMI.2004.13 MAGIMAIDOSS M, 2001, P ICSLP JEJ ISL KOR Maison B., 2003, P ASRU VIRG ISL US, P429 MAK B, 2004, P ICSLP JEJ ISL KOR MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 MARKEL JD, 1977, IEEE T ACOUST SPEECH, V25, P330, DOI 10.1109/TASSP.1977.1162961 MARKOV K, 2003, P ICASSP, V1, P840 MARTIN A, 2003, P EUR GEN SWITZ, P3069 Martinez F, 1998, INT CONF ACOUST SPEE, P725, DOI 10.1109/ICASSP.1998.675367 Martinez F., 1997, P EUR RHOD GREEC, P469 MATSUDA S, 2004, P ICSLP JEJ ISL KOR Mertins A., 2005, P 2005 IEEE AUT SPEE, P308 MESSINA R, 2004, P ICSLP JEJ ISL KOR Milner B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607093 MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, P335 Mirghafori N., 1995, P EUROSPEECH 95, P491 Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3 MOKHTARI P, 1998, THESIS U NEW S WALES Morgan N, 1998, INT CONF ACOUST SPEE, P729, DOI 10.1109/ICASSP.1998.675368 MORGAN N, 2004, P ICASSP MONTR, V1, P536 Morgan N., 1997, P 5 EUR C SPEECH COM, V4, P2079 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NAITO YSM, 1998, P ICASSP SEATTL WA, P1889 NANJO H, 2002, P ICASSP, V1, P725 Nanjo H, 2004, IEEE T SPEECH AUDI P, V12, P391, DOI [10.1109/TSA.2004.828641, 10.1106/TSA.2004.828641] *NAT I STAND TECHN, 2001, SCLITE SCOR SOFTW Nearey Terrance Michael, 1978, PHONETIC FEATURE SYS NETI C, 1997, P 1997 IEEE WORKSH A, P192 Neumeyer L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607890 Neumeyer L, 2000, SPEECH COMMUN, V30, P83, DOI 10.1016/S0167-6393(99)00046-1 NGUYEN P, 2000, ANN TELECOMMUNICATIO, V55 NGUYEN P, 2003, P EUR GEN SWITZ, P1837 Nolan F, 1983, PHONETIC BASES SPEAK O'Saughnessy D., 1987, SPEECH COMMUNICATION ODELL JJ, 1994, P ICASSP, V2, P125 OMAR MK, 2002, P INT C SPOK LANG PR, P2129 OMAR MK, 2002, P ICASSP, V1, P81 Ono Y., 1993, P EUR, P355 OSHAUGHNESSY D, 1999, P ICASSP, V1, P413 Padmanabhan M, 2005, IEEE T SPEECH AUDI P, V13, P512, DOI 10.1109/TSA.2005.848876 Padmanabhan M, 2004, IEEE T SPEECH AUDI P, V12, P572, DOI 10.1109/TSA.2003.822629 PADMANABHAN M, 1996, P ICASSP ATL, P701 Paliwal K., 2003, P EUROSPEECH, P65 Paliwal K. K., 2003, P EUR 2003, P2117 Paul D. B., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) PAUL DB, 1997, P ICASSP 97, V2, P1487 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PFAU T, 1998, P ICSLP SYDN AUSTR Pitz M., 2005, THESIS RWTH AACHEN U Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 POLS LCW, 1969, J ACOUST SOC AM, V46, P458, DOI 10.1121/1.1911711 Potamianos G., 1997, P EUR RHOD GREEC, P473 POTAMIANOS G, 1997, P EUR RHOD GREEC, P2371 Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 POTTER RK, 1950, J ACOUST SOC AM, V22, P807, DOI 10.1121/1.1906694 Printz H, 2002, COMPUT SPEECH LANG, V16, P131, DOI 10.1006/csla.2001.0188 Pujol P, 2005, IEEE T SPEECH AUDI P, V13, P14, DOI 10.1109/TSA.2004.834466 Rabiner L.R., 1989, P IEEE INT C AC SPEE, V1, P405 RABINER LR, 1993, FUNDAMENTALS SPEECH, P20 RAUX A, 2004, P ICSLP JEJ ISL KOR SAITO S, 1983, SPEECH COMMUN, V2, P149, DOI 10.1016/0167-6393(83)90014-6 SAKAUCHI S, 2004, P INT JEJ ISL KOR, P2053 Saon G., 2000, P INT C AC SPEECH SI, V2, P1129 SCHAAF T, 1997, P IEEE INT C AC SPEE, P875 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHIMMEL S, 2005, P ICASSP PHIL US MAY, V1, P221, DOI 10.1109/ICASSP.2005.1415090 SCHOTZ S, 2001, 49 LUND U DEP LING, P136 SCHROEDER MR, 1986, J ACOUST SOC AM, V79, P1580, DOI 10.1121/1.393292 SCHULTZ T, 1998, P ICSLP SYDN AUSTR, V5, P1819 SCHWARTZ R, 1989, P SPEECH NAT LANG WO, P21 SELOUANI SA, 2002, C SIGN PROC PATT REC, P530 Seneff S, 2005, SPEECH COMMUN, V46, P204, DOI 10.1016/j.specom.2005.03.005 SHI YY, 2002, P INT C SIGN PROC, V1, P528 Shinozaki T., 2003, P IEEE WORKSH AUT SP, P417 Shinozaki T., 2004, P INT ICSLP JEJ KOR, P1705 SHOBAKI K, 2000, P ICSLP BEIJ CHIN, P564 Siegler M. A., 1995, THESIS CARNEGIE MELL SIEGLER MA, 1995, P 1995 IEEE INT C AC, P612 Singer H., 1992, P ICASSP, V1, P273 SLIFKA J, 1995, P IEEE INT C AC SPEE, P644 SONG MG, 1998, P ICSLP SOTILLO C, 1998, P SPOSS, P109 Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 STEENKEN HJM, 1989, P ICASSP GLASG, V1, P540 Stephenson TA, 2004, IEEE T SPEECH AUDI P, V12, P189, DOI 10.1109/TSA.2003.822631 STEPHENSON TA, 2000, P 6 INT C SPOK LANG, V2, P951 STOLCKE A, 2006, P ICASSP, V1, P321 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 SUN DX, 1995, P INT C AC SPEECH SI, P201 SUZUKI H, 2003, P ICASSP, V1, P740 Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Svendsen T., 1989, P INT C AC SPEECH SI, P108 TEIXEIRA C, 1996, P ICSLP 96, V3, P1784, DOI 10.1109/ICSLP.1996.607975 Thomson DL, 2002, SPEECH COMMUN, V37, P197, DOI 10.1016/S0167-6393(01)00011-5 THOMSON DL, 1998, P IEEE INT C AC SPEE, V1, P21, DOI 10.1109/ICASSP.1998.674357 Tibrewala S., 1997, P ICASSP, P1255 TOLBA H, 2002, P ICASSP ORL FL, P837 TOLBA H, 2003, P EUR GEN SWITZ, P3085 TOMLINSON MJ, 1997, P IEEE INT C AC SPEE, P1247 TOWNSHEND B, 1998, P ESCA WORKSH SPEECH, P179 TRAUNMULLER H, 1998, PERCEPTION SPEAKER S Tsakalidis S, 2005, IEEE T SPEECH AUDI P, V13, P367, DOI 10.1109/TSA.2005.845806 Tsao Y, 2005, IEEE T SPEECH AUDI P, V13, P399, DOI 10.1109/TSA.2005.845819 TUERK A, 1999, P EUROSPEECH, V1, P419 Tuerk C., 1993, P EUR, P351 Tur G, 2005, SPEECH COMMUN, V45, P171, DOI 10.1016/j.specom.2004.08.002 TYAGI V, 2003, P ASRU ST THOM US VI, P381 TYAGI V, 2005, P ASRU CANC MEX Tyagi V., 2005, P INT LISB PORT, P209 Uebler U, 2001, SPEECH COMMUN, V35, P53, DOI 10.1016/S0167-6393(00)00095-9 UEBLER U, 1999, P EUR BUD, V2, P911 Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329 UTSURO T, 2002, P ICSLP 2002, P701 Van Compernolle D., 1991, P EUR 91, P723 Van Compernolle D, 2001, SPEECH COMMUN, V35, P71, DOI 10.1016/S0167-6393(00)00096-0 VASEGHI SV, 1997, P INT C AC SPEECH SI, P1263 VENKATARAMAN A, 2004, P ICSLP JEJ ISL KOR WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 WATROUS RL, 1993, IEEE T NEURAL NETWOR, V4, P21, DOI 10.1109/72.182692 WEINTRAUB M, 1996, P ADD ICSLP PHIL PA Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 WENG F, 1997, P EUR C SPEECH COMM, V1, P359 Wesker T., 2005, P INT, P1273 Westphal M., 1997, P EUR C SPEECH COMM, V3, P1143 WILLIAMS DAG, 1999, THESIS U SHEFFIELD WILPON JG, 1996, P ICASSP ATL GEORG M, V1, P349 Witt S., 1999, P EUR, V3, P1367 Witt SM, 2000, SPEECH COMMUN, V30, P95, DOI 10.1016/S0167-6393(99)00044-8 WONG PF, 2004, P ICASSP, V1, P905 WREDE B, 2001, P EUR AALB DENM Wu XT, 2004, IEEE T SPEECH AUDI P, V12, P168, DOI 10.1109/TSA.2003.818029 YANG WJ, 1988, IEEE T ACOUST SPEECH, V36, P988, DOI 10.1109/29.1620 ZAVALIAKOS G, 1996, P ICASSP, P725 ZHAN P, 1997, P ICASSP, V2, P1039 ZHAN P, 2000, P ISCA TUT RES WORKS, P145 ZHAN P, 1994, P IEE VIS IM SIGN PR, V141, P197 Zhan P, 1997, CMUCS97148 Zhang B., 2005, P ICASSP, V1, P925 ZHENG J, 2004, P ICSLP JEJ ISL KOR, P401 Zhou BW, 2005, IEEE T SPEECH AUDI P, V13, P554, DOI 10.1109/TSA.2005.845808 ZHOU G, 2002, P ICASSP, V4, P3816 Zhu D., 2004, P INT C AC SPEECH SI, P125 ZHU Q, 2004, P ICSLP JEJ ISL KOR ZHU Q, 2000, P ICSLP, V1, P341 ZOLNAY A, 2005, P IEEE INT C AC SPEE, V1, P457, DOI 10.1109/ICASSP.2005.1415149 Zolnay A., 2002, P INT C SPOK LANG PR, P1065 NR 324 TC 75 Z9 78 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 763 EP 786 DI 10.1016/j.specom.2007.02.006 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800002 ER PT J AU Grimm, M Kroschel, K Mower, E Narayanan, S AF Grimm, Michael Kroschel, Kristian Mower, Emily Narayanan, Shrikanth TI Primitives-based evaluation and estimation of emotions in speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE emotion estimation; emotion expression variations; emotion recognition; emotion space concept; fuzzy logic; man-machine interaction; natural speech understanding; speech analysis ID HUMAN-COMPUTER INTERACTION; VOCAL EMOTION; RECOGNITION AB Emotion primitive descriptions are an important alternative to classical emotion categories for describing a human's affective expressions. We build a multi-dimensional emotion space composed of the emotion primitives of valence, activation, and dominance. In this study, an image-based, text-free evaluation system is presented that provides intuitive assessment of these emotion primitives, and yields high inter-evaluator agreement. An automatic system for estimating the emotion primitives is introduced. We use a fuzzy logic estimator and a rule base derived from acoustic features in speech such as pitch, energy, speaking rate and spectral characteristics. The approach is tested on two databases. The first database consists of 680 sentences of 3 speakers containing acted emotions in the categories happy, angry, neutral, and sad. The second database contains more than 1000 utterances of 47 speakers with authentic emotion expressions recorded from a television talk show. The estimation results are compared to the human evaluation as a reference, and are moderately to highly correlated (0.42 < r < 0.85). Different scenarios are tested: acted vs. authentic emotions, speaker-dependent vs. speaker-independent emotion estimation, and gender-dependent vs. gender-independent emotion estimation. Finally, continuous-valued estimates of the emotion primitives are mapped into the given emotion categories using a k-nearest neighbor classifier. An overall recognition rate of up to 83.5% is accomplished. The errors of the direct emotion estimation are compared to the confusion matrices of the classification from primitives. As a conclusion to this continuous-valued emotion primitives framework, speaker-dependent modeling of emotion expression is proposed since the emotion primitives are particularly suited for capturing dynamics and intrinsic variations in emotion expression. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Karlsruhe, Inst Nachrichtentecnik, D-76128 Karlsruhe, Germany. Univ So Calif, Speech Anal Interpret Lab, Los Angeles, CA 90089 USA. RP Grimm, M (reprint author), Univ Karlsruhe, Inst Nachrichtentecnik, Kaiserstr 12, D-76128 Karlsruhe, Germany. EM grimm@int.uni-karlsruhe.de RI Narayanan, Shrikanth/D-5676-2012 CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 BULUT M, 2002, P ICSLP DENV CO Carletta J, 1996, COMPUT LINGUIST, V22, P249 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022 Douglas-Cowie E., 2003, P 15 INT C PHON SCI, P2877 FISCHER L, 2002, MESSUNG EMOTIONEN AN Fragopanagos N, 2005, NEURAL NETWORKS, V18, P389, DOI 10.1016/j.neunet.2005.03.006 GRIMM M, 2006, P 14 EUR SIGN PROC C Grimm M., 2005, P IEEE AUT SPEECH RE, P381 GRIMM M, 2006, P ISCA 3 INT C SPEEC, P9 GRIMM M, 2005, P 31 DTSCH JAHR AK D, P731 Grimm M., 2005, P 3 INT C TEL MULT C HAMMAL Z, 2005, P EUS ANT TURK HERNANDEZ C, 2005, EINSATZ FUZZY LOGIC HUANG CF, 2005, P EUR LISB PORT, P417 Kehrein R., 2002, P SPEECH PROS C, P423 KROSCHEL K, 2004, STAT INFORM SIGNAL M Lang P. J., 1980, TECHNOLOGY MENTAL HL, P119 LEE C, 2005, P EUR, P497 Lee C, 2003, P EUR, P157 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Lee C-M, 2001, P IEEE WORKSH AUT SP, P240 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NAGEL A, 2005, ROBUSTE PITCH EXTRAK Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 RUSSELL JA, 1977, J RES PERS, V11, P273, DOI 10.1016/0092-6566(77)90037-X Scherer KR, 2005, SOC SCI INFORM, V44, P695, DOI 10.1177/0539018405058216 Scholkopf B., 2002, LEARNING KERNELS SCHRODER M, 2001, P EUR 2001 AALB, V1, P87 SCHULLER B., 2005, P INT LISB PORT, P805 Schuller B., 2006, P 32 DTSCH JAHR AK D, P57 Ververidis D., 2004, P ICASSP2004 VIDRASCU L, 2005, P INT C AFF COMP INT, P739 Vidrascu L, 2005, P INT LISB PORT SEPT, P1841 Wundt W., 1896, GRUNDRISS PSYCHOL YU C, 2004, P 8 INT C SPOK LANG, V2, P1329 YU Y, 2002, P INT INT SCI ENG FA NR 42 TC 79 Z9 79 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 787 EP 800 DI 10.1016/j.specom.2007.01.010 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800003 ER PT J AU Casale, S Russo, A Serrano, S AF Casale, Salvatore Russo, Alessandra Serrano, Salvatore TI Multistyle classification of speech under stress using feature subset selection based on genetic algorithms SO SPEECH COMMUNICATION LA English DT Article DE genetic algorithms; speech under stress; stress classification; nonlinear features; Hidden Markov models ID RECOGNITION AB The determination of an emotional state through speech increases the amount of information associated with a speaker. It is therefore important to be able to detect and identify a speaker's emotional state or state of stress. Various techniques are used in the literature to classify emotional/stressed states on the basis of speech, often using different speech feature vectors at the same time. This study proposes a new feature vector that will allow better classification of emotional/stressed states. The components of the feature vector are obtained from a feature subset selection procedure based on genetic algorithms. A good discrimination between neutral, angry, loud and Lombard states for the simulated domain of the Speech Under Simulated and Actual Stress (SUSAS) database and between neutral and stressed states for the actual domain of the SUSAS database is obtained. (C) 2007 Elsevier B.V. All rights reserved. C1 Catania Univ, Dipartimento Ingn Informat Telecommun, I-95125 Catania, Italy. RP Russo, A (reprint author), Catania Univ, Dipartimento Ingn Informat Telecommun, Viale A Doria 6, I-95125 Catania, Italy. EM arusso@diit.unict.it RI Serrano, Salvatore/A-7774-2008 CR BERITELLI F, 2005, P 39 ANN AS C SIGN S, P550 Bou-Ghazale SE, 2000, IEEE T SPEECH AUDI P, V8, P429, DOI 10.1109/89.848224 Chorin A.J., 1990, MATH INTRO FLUID MEC Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Fukunaga K, 1990, INTRO STAT PATTERN R, P446 Hansen J., 1997, P EUROSPEECH 97 RHOD, V4, P1743 Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 Nicholson J, 1999, P 6 INT C NEUR INF P, V2, P495 NWE TL, 2001, P IEEE REG 10 INT C, V1, P297 Park C.-H., 2003, P INT JOINT C JUL, V4, P2594 SCHULLER B, 2003, P 2003 INT C MULT EX, V1, P401 TEAGER HM, 1980, IEEE T ACOUST SPEECH, V28, P599, DOI 10.1109/TASSP.1980.1163453 Thomas T. J., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80019-5 Vafaie H., 1993, Proceedings. Fifth International Conference on Tools with Artificial Intelligence TAI '93 (Cat. No.93CH3325-8), DOI 10.1109/TAI.1993.633981 Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 Young S., 2005, HTK BOOK HTK VERSION Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 17 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 801 EP 810 DI 10.1016/j.specom.2007.04.012 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800004 ER PT J AU Scharenborg, O Wan, V Moore, RK AF Scharenborg, Odette Wan, Vincent Moore, Roger K. TI Towards capturing fine phonetic variation in speech using articulatory features SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE human speech recognition; automatic speech recognition; articulatory feature classification; fine phonetic variation ID CONSONANTS AB The ultimate goal of our research is to develop a computational model of human speech recognition that is able to capture the effects of fine-grained acoustic variation on speech recognition behaviour. As part of this work we are investigating automatic feature classifiers that are able to create reliable and accurate transcriptions of the articulatory behaviour encoded in the acoustic speech signal. In the experiments reported here, we analysed the classification results from support vector machines (SVMs) and multilayer perceptrons (MLPs). MLPs have been widely and successfully used for the task of multi-value articulatory feature classification, while (to the best of our knowledge) SVMs have not. This paper compares the performance of the two classifiers and analyses the results in order to better understand the articulatory representations. It was found that the SVMs outperformed the MLPs for five out of the seven articulatory feature classes we investigated while using only 8.8-44.2% of the training material used for training the MLPs. The structure in the misclassifications of the SVMs and MLPs suggested that there might be a mismatch between the characteristics of the classification systems and the characteristics of the description of the AF values themselves. The analyses showed that some of the misclassified features are inherently confusable given the acoustic space. We concluded that in order to come to a feature set that can be used for a reliable and accurate automatic description of the speech signal; it could be beneficial to move away from quantised representations. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Speech & Hearing Res Grp, Sheffield S1 4DP, S Yorkshire, England. RP Scharenborg, O (reprint author), Univ Sheffield, Dept Comp Sci, Speech & Hearing Res Grp, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM O.Scharenborg@dcs.shef.ac.uk; V.Wan@dcs.shef.ac.uk; R.K.Moore@dcs.shef.ac.uk RI Scharenborg, Odette/E-2056-2012 CR Burgess C.J.C., 1998, DATA MIN KNOWL DISC, V2, P1, DOI DOI 10.1023/A:1009715923555 Chang C.-C., 2001, LIBSVM LIB SUPPORT V Cooke M., 1993, MODELLING AUDITORY P Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218 FRANKEL J, 2003, THESIS EDINBURGH U FRANKEL J, 2004, P INT JEJ ISL KOR Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Garofolo J., 1988, GETTING STARTED DARP HARBORG E, 1990, THESIS NTH TRONDHEIM Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006 JUNEJA A, 2004, THESIS U MARYLAND Kemps RJJK, 2005, MEM COGNITION, V33, P430, DOI 10.3758/BF03193061 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kirchhoff K., 1999, THESIS U BIELEFIELD Ladefoged P., 1982, COURSE PHONETICS, V2nd Livescu K., 2003, P EUR GEN SWITZ, P2529 Meyer B., 2006, P WORKSH SPEECH REC MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666 Ostendorf M., 1999, P IEEE AUT SPEECH RE, P79 RIETVELD ACM, 1997, ALGEMENE FONETIEK CO SAENKO K, 2005, P ICCV BEIJ CHIN Salverda AP, 2003, COGNITION, V90, P51, DOI 10.1016/S0010-0277(03)00139-2 Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 STROM N, 1997, FREE SPEECH, V5 Wester M., 2001, P EUR AALB DENM, P1729 WESTER M, 2004, P IEICI HMM WORKSH K Wester Mirjam, 2003, P EUR 2003 GEN SWITZ, P233 Wu TF, 2004, J MACH LEARN RES, V5, P975 NR 31 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 811 EP 826 DI 10.1016/j.specom.2007.01.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800005 ER PT J AU Gemello, R Mana, F Scanzio, S Laface, P De Mori, R AF Gemello, Roberto Mana, Franco Scanzio, Stefano Laface, Pietro De Mori, Renato TI Linear hidden transformations for adaptation of hybrid ANN/HMM models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE automatic speech recognition; speaker adaptation; neural network adaptation; catastrophic forgetting ID SPEECH RECOGNITION AB This paper focuses on the adaptation of Automatic Speech Recognition systems using Hybrid models combining Artificial Neural Networks (ANN) with Hidden Markov Models (HMM). Most adaptation techniques for ANNs reported in the literature consist in adding a linear transformation network connected to the input of the ANN. This paper describes the application of linear transformations not only to the input features, but also to the outputs of the internal layers. The motivation is that the outputs of an internal layer represent discriminative features of the input pattern suitable for the classification performed at the output of the ANN. In order to reduce the effect due to the lack of adaptation samples for some phonetic units we propose a new solution, called Conservative Training. Supervised adaptation experiments with different corpora and for different types of adaptation are described. The results show that the proposed approach always outperforms the use of transformations in the feature space and yields even better results when combined with linear input transformations. (C) 2006 Elsevier B.V. All rights reserved. C1 Politecn Torino, I-10129 Turin, Italy. LOQUENDO, I-10149 Turin, Italy. Univ Avignon, LIA, F-84911 Avignon, France. RP Laface, P (reprint author), Politecn Torino, Corso Duca Degli Abruzzi, 24, I-10129 Turin, Italy. EM Roberto.Gemello@loquendo.com; Franco.Mana@loquendo.com; Stefano.Scanzio@polito.it; Pietro.Laface@polito.it; Renato.Demori@lia.univ-avignon.fr CR Abrash V., 1995, P EUROSPEECH 1995, P2183 Albesano D., 1997, P INT C NEUR INF PRO, P1112 BENZEGHIBA MF, 2003, P ICASSP, P225 BERNSTEIN J, 1994, MACROPHONE LDC CATAL DUPONT S, 2005, P INT 05 LISB, P1353 DUPONT S, 2000, P ICASSP 2000, P1795 FRENCH M, 1994, TRENDS COGNIT, V3, P128 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HERMANSKY H, 2000, P ICASSP, P1635 Hsiao R., 2004, P ICASSP 04 MONTR, P897 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Lee CH, 2000, P IEEE, V88, P1241 LI X, 2006, P ICASSP 06, P237 LIU X, 2004, P ICASSP 04 MONTR, P797 Neto J., 1995, P EUROSPEECH 1995, P2171 Pallett D.S., 1994, P HUM LANG TECHN WOR, P49, DOI 10.3115/1075812.1075824 Robins A., 1995, Connection Science, V7, DOI 10.1080/09540099550039318 SAGAYAMA S, 2001, ISCA ITR WORKSH FRAN, P67 STADERMANN J, 2005, P ICASSP 05 PHIL, P997 NR 20 TC 23 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 827 EP 835 DI 10.1016/j.specom.2006.11.005 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800006 ER PT J AU Bartkova, K Jouvet, D AF Bartkova, Katarina Jouvet, Denis TI On using units trained on foreign data for improved multiple accent speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE non-native speech; speech recognition; adaptation; multilingual units ID ADAPTATION AB Foreign accented speech recognition systems have to deal with the acoustic realization of sounds produced by non-native speakers that does not always match with native speech models. As the standard native speech modeling alone is generally not adequate, it is usually extended with models of phonemes estimated from speech data of foreign languages, and often complemented with extra pronunciation variants. In this paper, the focus is set on the speech recognition of multiple non-native accents. The speech corpus used was recorded from speakers originated from 24 different countries. The introduction of models of phonemes of the target language adapted on foreign speech data is presented and detailed. For the recognition of non-native speech comprising multiple foreign accents, this approach provides better performance than the introduction of standard foreign units. The selection of the most frequent acoustic variants for each phoneme is also discussed as this method makes recognition results more homogenous across speaker language groups. Furthermore, the adaptation of the acoustic models on non-native speech data is studied. Results show that detailed models, which include the modeling of extra pronunciation variants through acoustic units estimated on foreign data, benefit more from the task and accent adaptation process than baseline standard models used for native speech recognition. In addition, experiments show that an adaptation of the acoustic models on a limited set of foreign accents provides speech recognition performance improvements even on foreign accents absent from the adaptation data. (C) 2007 Elsevier B.V. All rights reserved. C1 France Telecom, Div R&D, TECH, SSTP, F-22300 Lannion, France. RP Jouvet, D (reprint author), France Telecom, Div R&D, TECH, SSTP, 2,Ave Pierre Marzin, F-22300 Lannion, France. EM denisjouvet@francetelecom.com CR AALBURG S, 2004, P ICSLP JEJ ISL KOR, P1465 ADDADECKER M, 1998, P ESCA WORKSH MOD PR, P1 Arslan LM, 1996, SPEECH COMMUN, V18, P353, DOI 10.1016/0167-6393(96)00024-6 BARTKOVA K, 1999, P ICPHS SAN FRANC US, P1725 BARTKOVA K, 2004, P MIDL 2004 WORKSH L BARTKOVA K, 2004, P SPECOM 04 INT C SP, P22 BARTKOVA K, 2006, P ICASSP 2006 IEEE C Beattie V, 1995, P EUR 95, P1123 Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17 CINCAREK T, 2004, P ICSLP JEJ ISL KOR, P1509 DELATTRE P, ORIGINES CELTIQUES P, P215 DRAXLER C, 1997, P EUR 97, P747 Flege JE, 2003, SPEECH COMMUN, V40, P467, DOI 10.1016/S0167-6393(02)00128-0 Fung P., 1999, P ICASSP, P221 Goronzy S, 2004, SPEECH COMMUN, V42, P109, DOI 10.1016/j.specom.2003.09.003 He XD, 2003, IEEE T SPEECH AUDI P, V11, P298, DOI 10.1109/TSA.2003.814379 HUANG C, 2001, P EUR 01 EUR C SPEEC, P1337 Humphries J. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607273 JOUVET D, 1991, P EUROSPEECH 91, P923 Jurafsky D., 2001, P 2001 IEEE INT C AC, P577 KUBALA F, 1994, P ICASSP 1994, P561 Kumpf K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607964 Lawson A., 2003, P EUR GEN SWITZ, P1505 LIN X, 2004, INT C REC 38 AS C SI, P1801 LIU WK, 2000, P ICSLP 00 INT C SPO, P738 LIVESCU K, 2000, P INT C AC SPEECH SI, P1683 MOKBEL C, 1999, P ICASSP, P453 Mokbel C, 1996, SPEECH COMMUN, V19, P185, DOI 10.1016/0167-6393(96)00032-5 Raux A., 2004, P ICSLP 04 INT C SPO, P613 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 Teixeira C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607975 UEBLER U, 1999, P EUR BUD, V2, P911 VALEE N, 1990, P JEP 90 JOURN ET PA, P32 Van Compernolle D., 1991, P EUR 91, P723 VANCOMPERNOLLE D, 2001, RECOGNIZING SPEECH G, V35, P71 WITT S, 1999, P EUR 99, P1367 NR 36 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 836 EP 846 DI 10.1016/j.specom.2006.12.009 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800007 ER PT J AU Gerosa, M Giuliani, D Brugnara, F AF Gerosa, Matteo Giuliani, Diego Brugnara, Fabio TI Acoustic variability and automatic recognition of children's speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 31st IEEE International Conference on Acoustics, Speech and Signal Processing CY MAY 14-19, 2006 CL Toulouse, FRANCE SP IEEE Signal Proc Soc DE children's speech analysis; automatic speech recognition for children; speaker normalization; speaker adaptive acoustic modeling ID VOCAL-TRACT LENGTH; NORMALIZATION AB This paper presents several acoustic analyses carried out on read speech collected from Italian children aged from 7 to 13 years and North American children aged from 5 to 17 years. These analyses aimed at achieving a better understanding of spectral and temporal changes in speech produced by children of various ages in view of the development of automatic speech recognition applications. The results of these analyses confirm and complement the results reported in the literature, showing that characteristics of children's speech change with age and that spectral and temporal variability decrease as age increases. In fact, younger children show a substantially higher intra- and inter-speaker variability with respect to older children and adults. We investigated the use of several methods for speaker adaptive acoustic modeling to cope with inter-speaker spectral variability and to improve recognition performance for children. These methods proved to be effective in recognition of read speech with a vocabulary of about 11k words. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Trent, Ctr Ric Sci & Technol, Ist Ric Sci & Tecnol, I-38050 Trento, Italy. RP Gerosa, M (reprint author), Univ Trent, Ctr Ric Sci & Technol, Ist Ric Sci & Tecnol, I-38050 Trento, Italy. EM gerosa@itc.it; giuliani@itc.it; brugnara@itc.it RI Narayanan, Shrikanth/D-5676-2012 CR ACKERMANN U, 1997, P EUROSPEECH, P1807 Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 ANGELINI B, 1994, P ICSLP YOK, P1391 Arunachalam S., 2001, P EUR AALB DENM, P2675 BANERJEE S, 2003, P EUROSPEECH GEN SWI BERTOLDI N, 2001, P ICASSP SALT LAK CI, V1, P37 Boersma P., 2001, GLOT INT, V5, P341 BRUGNARA F, 2002, P ICSLP DENV CO, P1441 Burnett D. C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607809 Claes T, 1998, IEEE T SPEECH AUDI P, V6, P549, DOI 10.1109/89.725321 Clarke GM, 1998, BASIC COURSE STAT, P520 DAS S, 1998, P ICASSP SEATTL WA DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 EIDE E, 1996, P ICASSP, P346 ESKENAZI M, 2002, PMLA, P48 Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 GEROSA M, 2005, P INTERSPEECH EUROSP, P2193 Gillick L., 1989, P ICASSP, P532 GIULIANI D, 2003, P ICASSP, V2, P137 Giuliani D, 2006, COMPUT SPEECH LANG, V20, P107, DOI 10.1016/j.csl.2005.05.002 GUSTAFSON J, 2000, P ICSLP BEIJ CHIN, P297 HAGEN A, 2003, P ASRU WORKSH ST THO HAGEN A, 2004, P HLT NAACL BOST MA Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150 Kumar S.C., 2005, P INTERSPEECH EUROSP, P3357 Lee L., 1996, P ICASSP Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Mak B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607191 MICH O, 2004, P INSTIL ICALL VEN I, P269 MILLER JD, 1996, P ICASSP ATL GA, P849 MIRGHAFORI M, 1996, P EUROSPEECH LISB PO, P335 Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544 Nisimura R., 2004, P ICASSP, V1, P433 POTAMIANOS A, 1997, P EUR C SPEECH COMM, P2371 Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 Russell M, 2000, COMPUT SPEECH LANG, V14, P161, DOI 10.1006/csla.2000.0139 SALVI G, 2003, P 15 ICPHS INT C PHO WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 WEGMANN S, 1996, P ICASSP, P339 Welling L., 1999, P IEEE INT C AC SPEE, P761 Whiteside SP, 2000, SPEECH COMMUN, V32, P267, DOI 10.1016/S0167-6393(00)00013-3 WILPON JG, 1996, P ICASSP, P349 Young S. J., 1994, HLT, P307 ZHENG J, 2000, P INT C AC SPEECH SI, V3, P1775 NR 47 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT-NOV PY 2007 VL 49 IS 10-11 BP 847 EP 860 DI 10.1016/j.specom.2007.01.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 214HS UT WOS:000249728800008 ER PT J AU Craig, M Lieshout, P Wong, W AF Craig, Matthew van Lieshout, Pascal Wong, Willy TI Suitability of a UV-based video recording system for the analysis of small facial motions during speech SO SPEECH COMMUNICATION LA English DT Article DE UV video recording; system resolution; visual speech; kinematics ID MOTOR CONTROL; TASK; LIP; ACOUSTICS; MOVEMENT; APRAXIA; ADULTS AB The motion of the face carries great importance in research about speech production and perception. The suitability of a novel UV-based video recording system to track small facial motions during speech is examined. Tests are performed to determine the calibration and system errors, as well as the spatial and temporal resolutions of the setup. Results of the tests are further evaluated through kinematic data of the upper-lip, recorded from human speech, as this articulator typically shows the smallest movements, which would be the strongest test for any movement recording equipment. The results indicate that the current system, with a resolution slightly better than 1 mm, is capable of resolving the relatively small upper-lip motions during the production of normal speech. The system therefore provides an effective, easy-to-use and cost-effective alternative to more expensive commercial systems. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Toronto, ODL, Dept Speech Language Pathol, Toronto, ON M5G 1V7, Canada. Univ Toronto, IBBME, Toronto, ON M5G 1V7, Canada. TRI, Toronto, ON, Canada. Univ Toronto, Dept Psychol, HCL, Mississauga, ON L5L 1C6, Canada. Univ Toronto, Dept Elect & Comp Engn, Toronto, ON M5G 1V7, Canada. RP Craig, M (reprint author), Univ Toronto, ODL, Dept Speech Language Pathol, Rehabil Sci Bldg,500 Univ Ave,Room 60, Toronto, ON M5G 1V7, Canada. EM matt.simon.craig@gmail.com; p.vanlieshout@utoronto.ca; willy@eecg.utoronto.ca CR BARLOW SM, 1983, J SPEECH HEAR RES, V26, P283 Clark HM, 2001, J SPEECH LANG HEAR R, V44, P1015, DOI 10.1044/1092-4388(2001/080) Dromey C, 2003, J SPEECH LANG HEAR R, V46, P1234, DOI 10.1044/1092-4388(2003/096) Green JR, 2000, J SPEECH LANG HEAR R, V43, P239 Harris CM, 2004, MATH BIOSCI, V188, P99, DOI 10.1016/j.mbs.2003.08.011 Hasegawa-Johnson M, 1998, J ACOUST SOC AM, V104, P2529, DOI 10.1121/1.423775 HERTRICH I, 1997, FORSCHUNGSBERICHTE I, V35, P165 Jiang JT, 2002, EURASIP J APPL SIG P, V2002, P1174, DOI 10.1155/S1110865702206046 Katz WF, 2006, J SPEECH LANG HEAR R, V49, P645, DOI 10.1044/1092-4388(2006/047) LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Recasens D, 2002, J ACOUST SOC AM, V111, P2828, DOI 10.1121/1.1479146 Riley R., 2003, J APPL PHYSIOL, V94, P2119 Shaiman S, 2002, J SPEECH LANG HEAR R, V45, DOI 10.1044/1092-4388(2002/053) Simons G, 2004, J INT NEUROPSYCH SOC, V10, P521, DOI 10.1017/S135561770410413X Tasko SM, 2004, J SPEECH LANG HEAR R, V47, P85, DOI 10.1044/1092-4388(2004/008) van Lieshout PHHM, 2002, J SPEECH LANG HEAR R, V45, P5, DOI 10.1044/1092-4388(2002/001) Yehia HC, 2002, J PHONETICS, V30, P555, DOI 10.1006/jpho.2002.0165 Ziegler W, 2002, BRAIN LANG, V80, P556, DOI 10.1006/brln.2001.2614 Zierdt A., 2000, P 5 SEM SPEECH PROD, P313 NR 19 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2007 VL 49 IS 9 BP 679 EP 686 DI 10.1016/j.specom.2007.04.011 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 204XX UT WOS:000249078400001 ER PT J AU Kondo, K Nakagawa, K AF Kondo, Kazuhiro Nakagawa, Kiyoshi TI Speech emission control using active cancellation SO SPEECH COMMUNICATION LA English DT Article DE cellular speech; active control; linear prediction AB We investigated on the possibility of an active cancellation system for unnecessary speech radiation control. Some examples of the intended application of this system are cellular speech cancellation and speech input for recognition-based dictation systems. Both of these applications do not require speech to be radiated into surrounding space, but only into the input microphone, and would benefit if global radiation is controlled. We first show that speech cancellation is possible with a secondary source placed in proximity to the mouth generating linear-predicted phase-inverted speech. However, the prediction must also cover the long delay associated with the acoustic to/from electric conversion, as well as A/D, D/A conversions, and all associated processing, which we found could go up to as long as 3 ms. By using LPC predicted samples recursively to predict further samples, we found that prediction with SNR of about 6 dB is possible, even with this long delay. The prediction coefficient update is suppressed during this recursion. Lowering the sampling frequency in order to lower the number of predicted samples at the cost of reduced bandwidth further enhances prediction accuracy. At a sampling frequency of 8 kHz, speech emission control of about 7 dB for female speech and 4 dB for male speech was found to be possible. Finally, we experimentally evaluated the proposed active speech control method. Predicted samples of recorded speech was first prepared off line. We then actually played out both the original and the predicted samples simultaneously from two loud speakers. It was found that (1) speech cancellation of up to about 10 dB is possible, but is highly speaker dependent, (2) secondary loud speaker should be oriented in the same direction as the primary source, i.e., the mouth. We plan to investigate further to improve prediction accuracy using prediction coefficient extrapolation. A prototype system implementation using DSPs is also planned. (C) 2007 Elsevier B.V. All rights reserved. C1 Yamagata Univ, Fac Engn, Dept Elect Engn, Yonezawa, Yamagata 9928510, Japan. RP Kondo, K (reprint author), Yamagata Univ, Fac Engn, Dept Elect Engn, 4-3-16 Jonan, Yonezawa, Yamagata 9928510, Japan. EM kkondo@yz.yamagata-u.ac.jp CR BLACK RD, 1957, J ACOUST SOC AM, V29, P260, DOI 10.1121/1.1908850 Elliot SJ, 2001, SIGNAL PROCESSING AC FLANAGAN JL, 1960, J ACOUST SOC AM, V32, P1613, DOI 10.1121/1.1907972 Haykin S., 1996, ADAPTIVE FILTER THEO *JAP INF PROC DEV, 1991, ASJ CONT SPEECH CORP KONDO K, 2003, P INT 2003 SEOGW KOR KONDO K, 2002, P INT C SPOK LANG PR KONDO K, 2005, P INT C AC SPEECH SI Kuo S. M., 1996, ACTIVE NOISE CONTROL Nelson P A, 1992, ACTIVE CONTROL SOUND ONO H, 1977, J ACOUST SOC AM, V62, P1613 Sano H, 2001, IEEE T SPEECH AUDI P, V9, P755, DOI 10.1109/89.952494 NR 12 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2007 VL 49 IS 9 BP 687 EP 696 DI 10.1016/j.specom.2007.04.010 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 204XX UT WOS:000249078400002 ER PT J AU Romsdorfer, H Pfister, B AF Romsdorfer, Harald Pfister, Beat TI Text analysis and language identification for polyglot text-to-speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE polyglot speech synthesis; mixed-lingual text analysis; language identification; morphological and syntactic analysis; word and sentence; boundary identification AB In multilingual countries, text-to-speech synthesis systems often have to deal with texts containing inclusions of multiple other languages in form of phrases, words, or even parts of words. In such multilingual cultural settings, listeners expect a high-quality text-to-speech synthesis system to read such texts in a way that the origin of the inclusions is heard, i.e., with correct language-specific pronunciation and prosody. The challenge for a text analysis component of a text-to-speech synthesis system is to derive from mixed-lingual sentences the correct polyglot phone sequence and all information necessary to generate natural sounding polyglot prosody. This article presents a new approach to analyze mixed-lingual sentences. This approach centers around a modular, mixed-lingual morphological and syntactic analyzer, which additionally provides accurate language identification on morpheme level and word and sentence boundary identification in mixed-lingual texts. This approach can also be applied to word identification in languages without a designated word boundary symbol like Chinese or Japanese. To date, this mixed-lingual text analysis supports any mixture of English, French, German, Italian, and Spanish. Because of its modular design it is easily extensible to additional languages. (C) 2007 Elsevier B.V. All rights reserved. C1 ETH, Speech Proc Grp, Comp Engn & Networks Lab, Zurich, Switzerland. RP Romsdorfer, H (reprint author), ETH, Speech Proc Grp, Comp Engn & Networks Lab, Zurich, Switzerland. EM romsdorf@tik.ee.ethz.ch CR Cavnar W. B., 1994, 3 ANN S DOC AN INF R, P161 Coker C., 1990, P 1 ESCA WORKSH SPEE, P83 DEMAREUIL PB, 2001, P EUR 2001 AALB DENM, P1923 Dutoit T., 1993, THESIS FACULTE POLYT GIGUET E, 1995, 4 INT WORKSH PARS TE GREFENSTETTE G, 1995, P 3 INT C STAT AN TE, P1 Hakkinen J., 2001, IEEE WORKSH AUT SPEE, P335, DOI 10.1109/ASRU.2001.1034655 LIBERMAN M, 1991, ADV SPEECH SIGNAL PR, P791 MCALLISTER M, 1989, P EUR 89 PAR, V1, P538 PEREIRA FCN, 1980, ARTIF INTELL, V13, P231, DOI 10.1016/0004-3702(80)90003-X Pfister Beat, 2003, P EUR, P2037 Riedi M., 1997, P EUR 97, P2627 Riley M.D., 1989, P DARPA SPEECH NAT L, P339, DOI 10.3115/1075434.1075492 Romsdorfer H, 2005, LECT NOTES COMPUT SC, V3361, P263 ROMSDORFER H, 2004, P INT 2004 ICSLP JEJ, P737 ROMSDORFER H, 2006, ISCA TUT RES WORKSH Romsdorfer H., 2005, P INT 2005 LISB PORT, P3281 SCHMITT JC, 1991, Patent No. 5062143 Sproat R, 1996, COMPUT LINGUIST, V22, P377 TIAN J, 2002, P ICSLP 2002 DEN COL TIAN J, 2004, P ICASSP 2004 MONT C TRABER C, 1995, THESIS COMPUTER ENG Traber C., 1999, P EUR, P835 NR 23 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2007 VL 49 IS 9 BP 697 EP 724 DI 10.1016/j.specom.2007.04.006 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 204XX UT WOS:000249078400003 ER PT J AU Barbosa, PA AF Barbosa, Plinio A. TI From syntax to acoustic duration: A dynamical model of speech rhythm production SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd International Conference on Speech Prosody CY 2006 CL Dresden, GERMANY DE speech rhythm; dynamic systems modelling; prosody-syntax interface; prominence; constituency ID PERFORMANCE STRUCTURES; BOUNDARY; ENGLISH; PHRASE AB This paper presents a speech rhythm production model able to generate segmental acoustic duration from several levels of dynamical coupling between linguistic and production-related subsystems. A probabilistic algorithm for phrase stress assignment accounts for both prominence and constituency prosodic relations by considering the coupling between a dependency-grammar system of markers and constituent-size constraints. This algorithm copes with intra- and inter-speaker prosodic variability. Having as input the position and magnitude of underlying phrase stress, and a set of dynamical control parameters, the model acts at three nested temporal domains to assign segmental duration in Brazilian Portuguese. The modelled V-to-V duration patterns reproduce the patterns found at the surface under several conditions of perturbation. The nature and advantages of the dynamical model of speech rhythm production for simulating natural data are thoroughly discussed. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Estadual Campinas, Soeech Prosody Studies Grp, Inst Estudos Linguagem, BR-13083970 Campinas, SP, Brazil. Univ Estadual Campinas, Dept Linguist, Inst Estudos Linguagem, BR-13083970 Campinas, SP, Brazil. RP Barbosa, PA (reprint author), Univ Estadual Campinas, Soeech Prosody Studies Grp, Inst Estudos Linguagem, POB 6045, BR-13083970 Campinas, SP, Brazil. EM plinio@iel.unicamp.br CR ARANTES P, 2006, P SPEECH PROS 2006 D, P73 ATTERER M, 2005, THESIS U STUTTGART Bachenko J., 1990, Computational Linguistics, V16 BAILLY G, 1986, ACT 15 JOURN ET PAR, P75 Barbosa P., 2005, P 9 EUR C SPEECH COM, V2005, P1441 Barbosa P. A., 2006, INCURSOES TORNO RITM Barbosa P. A., 2002, P SPEECH PROS 2002 C, P163 BARBOSA PA, 2004, P 2 INT C SPEECH PRO, P49 BARBOSA PA, 1999, P 6 EUR C SPEECH COM, V5, P2059 BARBOSA PA, 2002, CADERN ESTUD LING, V43, P71 BARBOSA PA, 1994, THESIS ICP I NATL PO BARBOSA PA, 1996, P 1 ESCA TUT RES WOR, P85 Beckman M. E., 1992, SPEECH PERCEPTION PR, P457 Berthoz A., 1997, SENS MOUVEMENT Boersma P., 2005, PRAAT DOING PHONETIC Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019 Byrd D, 2003, J PHONETICS, V31, P149, DOI 10.1016/S0095-4470(02)00085-2 CAMPBELL WN, 1991, J PHONETICS, V19, P37 Classe A, 1939, RHYTHM ENGLISH PROSE CUMMINS F, 2006, METHODS EMPIRICAL PR, P211 Cummins F, 1998, J PHONETICS, V26, P145, DOI 10.1006/jpho.1998.0070 CUMMINS F, 2002, CADERNOS ESTUDOS LIN, V43, P55 DAUER RM, 1983, J PHONETICS, V11, P51 DOGIL G, 1988, PIVOT MODEL SPEECH P Erickson D, 1998, PHONETICA, V55, P147, DOI 10.1159/000028429 ERIKSSON A, 1991, MONOGRAPH LINGUISTIC, V9 Fraisse P., 1963, PSYCHOL TIME GAY T, 1981, PHONETICA, V38, P148 GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 Hibi S., 1983, Journal of the Acoustical Society of Japan (E), V4 SHATTUCKHUFNAGEL S, 1979, J VERB LEARN VERB BE, V18, P41, DOI 10.1016/S0022-5371(79)90554-1 KOHNO M, 1992, SPEECH PERCEPTION PR, P287 KOHNO M, 1995, P 13 INT C PHON SCI, V1, P94 LEHISTE I, 1970, SUPROSEGMENTALS Levelt W. J., 1989, SPEAKING INTENTION A MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 MADDISON Ian, 1997, HDB PHONETIC SCI, P619 MARCUS SM, 1981, PERCEPT PSYCHOPHYS, V30, P247, DOI 10.3758/BF03214280 MARTIN P, 1987, PROSODIC RHYTHMIC ST, V25, P925 MASSINI G, 1991, DURACAO ESTUDO ACENT Mauk MD, 2004, ANNU REV NEUROSCI, V27, P307, DOI 10.1146/annurev.neuro.27.070203.144247 McAuley J., 1995, THESIS INDIANA U MONNIN P, 1993, ANN PSYCHOL, V93, P9 NOOTEBOOM S, 1995, P 13 INT C PHON SCI, V4, P578 Pasdeloup V., 1992, TALKING MACHINES THE, P335 Pompino-Marschall B., 1991, FORSCHUNGSBERICHEE I, V29, P66 Price P. J., 1989, P 2 DARPA WORKSH SPE, P5, DOI 10.3115/1075434.1075437 RHARDISSE N, 1995, P 13 INT C PHON SCI, V3, P556 Riley M.D., 1989, P DARPA SPEECH NAT L, P339, DOI 10.3115/1075434.1075492 Roach P., 1982, LINGUISTIC CONTROVER, P73 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 SCHMID H, 2004, P 20 INT C COMP LING, V1, P659 SCHWEITZER A, 2002, P 1 INT C SPEECH PRO, P639 SHIH C, 1998, P 5 INT C SPOK LANG, P177 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 STETSON RH, 1988, STETSONS MOTOR PHONE Tabain M, 2003, J ACOUST SOC AM, V113, P2834, DOI 10.1121/1.1564013 TESNIERE L, 1967, ELEMENTS SYNTAXE STR VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 van Santen JPH, 2000, J ACOUST SOC AM, V107, P1012, DOI 10.1121/1.428281 Watson D, 2004, LANG COGNITIVE PROC, V19, P713, DOI 10.1080/01690960444000070 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 Wong SW, 2003, SPEECH COMMUN, V41, P93, DOI 10.1016/S0167-6393(02)00096-1 NR 63 TC 10 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2007 VL 49 IS 9 BP 725 EP 742 DI 10.1016/j.specom.2007.04.013 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 204XX UT WOS:000249078400004 ER PT J AU Kain, AB Hosom, JP Niu, XC Santen, JPH Fried-Oken, M Staehely, J AF Kain, Alexander B. Hosom, John-Paul Niu, Xiaochuan van Santen, Jan P. H. Fried-Oken, Melanie Staehely, Janice TI Improving the intelligibility of dysarthric speech SO SPEECH COMMUNICATION LA English DT Article DE speech processing; speech transformation; speech modification; intelligibility; dysarthria ID AMYOTROPHIC-LATERAL-SCLEROSIS; VERTICAL-BAR; DEAF SPEECH; VOWEL; IDENTIFICATION; FREQUENCY AB Dysarthria is a speech motor disorder usually resulting in a substantive decrease in speech intelligibility by the general population. In this study, we have significantly improved the intelligibility of dysarthric vowels of one speaker from 48% to 54%, as evaluated by a vowel identification task using 64 CVC stimuli judged by 24 listeners. Improvement was obtained by transforming the vowels of a speaker with dysarthria to more closely match the vowel space of a non-dysarthric (target) speaker. The optimal mapping feature set, from a list of 21 candidate feature sets, proved to be one utilizing vowel duration and F1-F3 stable points, which were calculated using shape-constrained isotonic regression. The choice of speaker-specific or speaker-independent vowel formant targets appeared to be insignificant. Comparisons with "oracle" conditions were performed in order to evaluate the analysis/re-synthesis system independently of the transformation function. (C) 2007 Elsevier B.V. All rights reserved. C1 Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding, OGI Sch Sci & Engn, Portland, OR 97201 USA. Oregon Hlth & Sci Univ, Dept Neurol, Oregon Inst Disabil & Dev, Portland, OR USA. Oregon Hlth & Sci Univ, Dept Otolaryngol, Oregon Inst Disabil & Dev, Portland, OR USA. RP Kain, AB (reprint author), Oregon Hlth & Sci Univ, Ctr Spoken Language Understanding, OGI Sch Sci & Engn, Portland, OR 97201 USA. EM kain@cslu.ogi.edu CR BEUKELMAN DR, 2005, ARGMENTATIVE ALTERNA Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 BRALOW RE, 1980, STAT INTERFERENCE OR *CARN MELL U, 2004, CMU PRON DICT V 06 Cohen J., 1988, STAT POWER ANAL BEHA, V2nd DARLEY FL, 1969, J SPEECH HEAR RES, V12, P246 DIBENEDETTO MG, 1989, J ACOUST SOC AM, V86, P55, DOI 10.1121/1.398220 Drager K. D. R., 2004, AUGMENTATIVE ALTERNA, V20, P103, DOI 10.1080/07434610410001699681 Duffy J.R, 2005, MOTOR SPEECH DISORDE *EL SPEECH ENH INC, SPEECH ENH Ferrier L., 1995, AUGMENTATIVE ALTERNA, V11, P165, DOI 10.1080/07434619512331277289 Hartelius L, 2000, FOLIA PHONIATR LOGO, V52, P160, DOI 10.1159/000021531 HIERONYMUS JL, ASCII PHONETIC SYMBO Hillenbrand JM, 1999, J ACOUST SOC AM, V105, P3509, DOI 10.1121/1.424676 HOSOM JP, 2003, P ICASSP, P878 KAIN AB, 2004, IEEE WORKSH SPEECH S, P25 KENT RD, 1990, J SPEECH HEAR DISORD, V55, P721 KENT RD, 1989, J SPEECH HEAR DISORD, V54, P482 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MAASSEN B, 1984, J ACOUST SOC AM, V76, P1673, DOI 10.1121/1.391614 MAASSEN B, 1985, J ACOUST SOC AM, V78, P877, DOI 10.1121/1.392918 MENEDEZPIDAL X, 1996, P ICSLP PHIL PA, V3, P1962, DOI 10.1109/ICSLP.1996.608020 NIU X, 2006, P ICSLP, P957 NIU X, 2003, MODELS ANAL VOCAL EM, P233 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 *PRENTK ROM CO, PATHF COMM DEV QI YY, 1995, J ACOUST SOC AM, V98, P2461, DOI 10.1121/1.413279 *SEM COMP SYST, MINS LANG REPR TECHN Shuster LI, 1996, J SPEECH HEAR RES, V39, P827 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 STRANGE W, 1976, J ACOUST SOC AM, V60, P213, DOI 10.1121/1.381066 Titze IR, 2000, PRINCIPLES VOICE PRO TURNER GS, 1995, J SPEECH HEAR RES, V38, P1001 VANSANTEN JPH, 2000, QUANTITATIVE MODEL F, P269 VISSER J, PVOICE DYNAMIC SCREE Weismer G, 2003, J SPEECH LANG HEAR R, V46, P1247, DOI 10.1044/1092-4388(2003/097) Yorkston K. M., 1999, MANAGEMENT MOTOR SPE Yorkston K. M., 1988, CLIN MANAGEMENT DYSA ZIEGLER W, 1988, CLIN LINGUIST PHONET, V2, P291, DOI 10.3109/02699208808985261 NR 40 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2007 VL 49 IS 9 BP 743 EP 759 DI 10.1016/j.specom.2007.05.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 204XX UT WOS:000249078400005 ER PT J AU Loizou, PC Cohen, I Gannot, S Paliwal, K AF Loizou, Philipos C. Cohen, Israel Gannot, Sharon Paliwal, Kuldip TI Special issue on speech enhancement SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Texas, Dallas, TX 75230 USA. Technion Israel Inst Technol, IL-32000 Haifa, Israel. Bar Ilan Univ, IL-52100 Ramat Gan, Israel. Griffith Univ, Nathan, Qld 4111, Australia. RP Loizou, PC (reprint author), Univ Texas, Dallas, TX 75230 USA. EM loizou@utdallas.edu; icohen@ee.technion.ac.il; gannot@eng.biu.ac.il; K.Paliwal@griffith.edu.au NR 0 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 527 EP 529 DI 10.1016/j.specom.2007.05.002 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400001 ER PT J AU Erkelens, J Jensen, J Heusdens, R AF Erkelens, Jan Jensen, Jesper Heusdens, Richard TI A data-driven approach to optimizing spectral speech enhancement methods for various error criteria SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; spectral distortion measures; speech model ID AMPLITUDE ESTIMATOR AB Gain functions for spectral noise suppression have been derived in literature for some error criteria and statistical models. These gain functions are only optimal when the statistical model is correct and the speech and noise spectral variances are known. Unfortunately, the speech distributions are unknown and can at best be determined conditionally on the estimated spectral variance. We show that the "decision-directed" approach for speech spectral variance estimation can have an important bias at low SNRs, which generally leads to too much speech suppression. To correct for such estimation inaccuracies and adapt to the unknown speech statistics, we propose a general optimization procedure, with two gain functions applied in parallel. A conventional algorithm is run in the background and is used for a priori SNR estimation only. For the final reconstruction a different gain function is used, optimized for a wide range of signal-to-noise ratios. The gain function providing for the reconstruction is trained on a speech database, by minimizing a relevant error criterion. The procedure is illustrated for several error criteria. The method compares favorably to current state-of-the-art methods, and needs less smoothing in the decision-directed spectral variance estimator. (c) 2006 Elsevier B.V. All rights reserved. C1 Delft Univ Technol, Theory Grp, Dept Med Informat & Commun, NL-2628 CD Delft, Netherlands. RP Erkelens, J (reprint author), Delft Univ Technol, Theory Grp, Dept Med Informat & Commun, Mekelweg 4, NL-2628 CD Delft, Netherlands. EM j.s.erkelens@tudelft.nl CR Abramowitz M., 1965, HDB MATH FUNCTIONS Benesty J., 2005, SPEECH ENHANCEMENT Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 COHEN I, 2005, P INT LESB PORT SEP, P2053 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Ephraim Y., 2006, ELECT ENG HDB EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 ERKELENS J, 2006, P EUR SIGNAL P C FOR GAROFOLO J, 1990, PB91505065 HU Y, 2006, P IEEE INT C AC SPEE, P153 Jensen J., 2005, P IEEE 1 BENELUX DSP, P155 Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Lotter T., 2005, EURASIP J APPL SIG P, V7, P1110 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 Martin R, 2005, SIG COM TEC, P43, DOI 10.1007/3-540-27489-8_3 PORTER JE, 1984, P INT C AC SPEECH SI Ross SM, 1972, INTRO PROBABILITY MO VARGA AP, 1992, NOISEX 92 STUDY EFFE Wolfe P. J., 2001, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing (Cat. No.01TH8563), DOI 10.1109/SSP.2001.955331 You CH, 2005, IEEE T SPEECH AUDI P, V13, P475, DOI 10.1109/TSA.2005.848883 NR 22 TC 28 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 530 EP 541 DI 10.1016/j.specom.2006.06.012 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400002 ER PT J AU Lin, Z Goubran, RA Dansereau, RM AF Lin, Zhong Goubran, Rafik A. Dansereau, Richard M. TI Noise estimation using speech/non-speech frame decision and subband spectral tracking SO SPEECH COMMUNICATION LA English DT Article DE non-stationary noise estimation; voice activity detection; speech enhancement; pitch estimation ID VOICE ACTIVITY DETECTION; AMPLITUDE ESTIMATOR; NONSTATIONARY NOISE; ACTIVITY DETECTOR; ENHANCEMENT; ENVIRONMENTS; STATISTICS; MODEL AB As a fundamental part of single microphone speech quality enhancement, noise power spectrum estimation is particularly challenging in adverse environments with low signal-to-noise ratio (SNR) and highly non-stationary background noise. In this paper, we propose a novel scheme which applies human speech properties, such as pitch properties of voiced speech and statistical properties of durations of unvoiced speech, into subband spectral tracking to estimate the power spectrum of non-stationary noise. We show that our proposed method is able to estimate the power spectrum more accurately and faster when the noise is highly non-stationary and the proposed method tracks bursts of noise 4-6 times faster than competitive methods. We also show that the mean square error of the estimated noise spectrum by the proposed method is 15% lower on average than competitive methods. The proposed algorithm is then combined with conventional MMSE-STSA and its overall performance is tested in a speech enhancement application. Simulation results justify that the segmental SNR improvement of the proposed system is on average 0.9 dB higher than the competitive system, and the mean opinion score (MOS) improvement is on average 0.17 higher than the competitive system. (c) 2006 Elsevier B.V. All rights reserved. C1 Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada. RP Lin, Z (reprint author), Carleton Univ, Dept Syst & Comp Engn, 1125 Colonel Dr, Ottawa, ON K1S 5B6, Canada. EM linzhong@sce.carleton.ca CR ALLORGE L, 2004, LUNE ROUGE SOUNDS DA Beritelli F, 1998, IEEE J SEL AREA COMM, V16, P1818, DOI 10.1109/49.737650 Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 Doblinger G., 1995, P EUR, V2, P1513 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Garafolo J. S., 1988, GETTING STARTED DARP Gazor S, 2003, IEEE T SPEECH AUDI P, V11, P498, DOI 10.1109/TSA.2003.815518 HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 HIRSCH HG, 1995, P IEEE INT C AUD SPE, V12, P59 *ITU, 1996, ITU T P800 P LIN Z, 2005, P INT C AC SPEECH SI, V1, P161 LIN Z, 2003, P 2 IEEE INT WORKSH, P61 LYNCH JF, 1987, SPEECH SIGNAL PROCES, V12, P1348 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd Quackenbush S. R., 1988, OBJECTIVE MEASURES S RANGACHARI S, 2004, SPEECH SIGNAL PROCES, V1, P305 Rangachari S, 2006, SPEECH COMMUN, V48, P220, DOI 10.1016/j.specom.2005.08.005 SENEFF S, 1978, ASSP, V26, P358 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 STAHL V, 2000, P ICASSP, V3, P1875 Tanyer SG, 2000, IEEE T SPEECH AUDI P, V8, P478, DOI 10.1109/89.848229 VARGA A, 1992, NOISEX 92 CD ROMS Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 NR 29 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 542 EP 557 DI 10.1016/j.specom.2006.10.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400003 ER PT J AU You, CH Koh, SN Rahardja, S AF You, Chang Huai Koh, Soo Ngee Rahardja, Susanto TI Subband Kalman filtering incorporating masking properties for noisy speech signal SO SPEECH COMMUNICATION LA English DT Article DE subband decomposition; Kalman filtering; masking properties ID ENHANCEMENT; TRANSFORM; SERIES AB This paper considers a subband Kalman filtering scheme that incorporates the auditory masking properties for single channel speech enhancement. It attempts to achieve high quality enhanced speech by optimizing the trade-off between speech distortion and noise reduction. The use of Kalman filtering in the subband instead of the full-band domain leads to considerable complexity reduction and performance improvement. We propose a novel approach to incorporate the masking threshold with subband Kalman filtering, whereby the estimate of the noise variance that is used in the Kalman filtering process in each subband is modified according to the masking threshold. We adopt an iterative scheme for the estimation of autoregressive (AR) parameters. We investigate, through simulations, the proposed approach by studying the contributions from different functions including Kalman filtering, subband decomposition and perceptual effect based on masking threshold. At the same time, we examine the optimal configuration for the proposed scheme. The proposed approach leads to better enhancement results as compared to the full-band and the conventional subband Kalman filtering methods. Through intensive simulations, we show that our proposed enhancement scheme outperforms various existing well known schemes in terms of objective as well as subjective measures. (c) 2007 Elsevier B.V. All rights reserved. C1 Inst Infocomm Res, Singapore 119613, Singapore. Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP You, CH (reprint author), Inst Infocomm Res, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore. EM echyou@i2r.a-star.edu.sg; esnkoh@ntu.edu.sg; rsusanto@i2r.a-star.edu.sg RI KOH, Soo Ngee/A-5081-2011 CR [Anonymous], 1993, 111723 ISO IEC BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Durbin J., 1960, REV INT STATIST I, V28, P233, DOI DOI 10.2307/1401322 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 Garofolo J., 1988, GETTING STARTED DARP GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P3833, DOI 10.1121/1.413108 Haykin S., 2002, ADAPTIVE FILTER THEO, V4th JABLOUN F, 2001, P IWAENC, P199 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Levinson N., 1947, J MATH PHYS, V25, P261 Lin Y.-P., 1995, IEEE T SIGNAL PROCES, V42, P2525 MA N, 2003, P INT S SIGNAL PROCE MA N, 2004, ICASSP 2004, V1, P414 NGUYEN TQ, 1994, IEEE T SIGNAL PROCES, V42, P65, DOI 10.1109/78.258122 PALIWAL KK, 1987, ICASSP 87, P631 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rabiner L.R., 1978, DIGITAL PROCESSING S SINHA DP, 1993, IEEE T SIGNAL PROCES, V41, P3463, DOI 10.1109/78.258086 Soon IY, 2003, IEEE T SPEECH AUDI P, V11, P717, DOI 10.1109/TSA.2003.816063 Soon IY, 1998, SPEECH COMMUN, V24, P249, DOI 10.1016/S0167-6393(98)00019-3 Vaidyanathan P. P., 1993, MULTIRATE SYSTEMS FI VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Walker G, 1931, P R SOC LOND A-CONTA, V131, P518, DOI 10.1098/rspa.1931.0069 WU WR, 1998, ANALOG DIGITAL SIGNA, V45, P1072 You CH, 2005, IEEE T SPEECH AUDI P, V13, P475, DOI 10.1109/TSA.2005.848883 YOU CH, 2004, ICME 2004 You CH, 2006, SPEECH COMMUN, V48, P57, DOI 10.1016/j.specom.2005.05.012 Yule GU, 1927, PHILOS T R SOC LOND, V226, P267, DOI 10.1098/rsta.1927.0007 NR 33 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 558 EP 573 DI 10.1016/j.specom.2007.02.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400004 ER PT J AU Lollmann, HW Vary, P AF Loellmann, Heinrich W. Vary, Peter TI Uniform and warped low delay filter-banks for speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE filter-bank equalizer; frequency warping; time-varying filters; low delay; speech enhancement ID POLYPHASE; DESIGN AB A versatile filter-bank concept for adaptive subband filtering is proposed, which achieves a significantly lower algorithmic signal delay than commonly used analysis-synthesis filter-banks. It is derived as an efficient implementation of the filter-bank summation method and performs time-domain filtering with coefficients adapted in the uniform or non-uniform frequency-domain. The frequency warped version of the proposed filter-bank has a lower computational complexity than the usual warped analysis-synthesis filter-bank for most parameter configurations. The application to speech enhancement shows that the same quality of the enhanced speech can be achieved but with lower signal delay. For systems with tight signal delay requirements, modifications of the new filter-bank design are discussed to further decrease its signal delay by approximating the original time-domain filter by an FIR or IIR filter of lower degree. This approach can achieve a very low signal delay and reduced computational complexity with almost no loss for the perceived speech quality. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Aachen, RWTH, Inst Commun Syst & Data Proc, D-52056 Aachen, Germany. RP Lollmann, HW (reprint author), Univ Aachen, RWTH, Inst Commun Syst & Data Proc, D-52056 Aachen, Germany. EM loellmann@ind.rwth-aachen.de; vary@ind.rwth-aachen.de CR [Anonymous], 2001, ITUT REC, P862 Beauchamp K. G., 1975, WALSH FUNCTIONS THEI BELLANGER MG, 1976, IEEE T ACOUST SPEECH, V24, P109, DOI 10.1109/TASSP.1976.1162788 Benesty J., 2005, SPEECH ENHANCEMENT BRACCINI C, 1974, IEEE T ACOUST SPEECH, VAS22, P236, DOI 10.1109/TASSP.1974.1162582 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 COHEN I, 2001, P EUR C SPEECH COMMU CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 Crochiere R. E., 1983, MULTIRATE DIGITAL SI Doblinger G., 1991, P INT S CIRC SYST IS, V1, P646 ENGELSBER A, 1998, THESIS CHRISTIAN ALB Ephraim Y., 1984, ACOUSTICS SPEECH SIG, V32, P1109 Flannery B. P., 1992, NUMERICAL RECIPES C GALIJASEVIC E, 2002, P INT C AC SPEECH SI, V2, P1181 Gulzow T, 1998, SIGNAL PROCESS, V64, P5, DOI 10.1016/S0165-1684(97)00172-2 Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083 HU Y, 2006, P INTERSP PHIL US HU Y, 2006, P IEEE INT C AC SPEE, P153 KARP T, 1997, P INT C DIG SIGN PRO, V1, P443, DOI 10.1109/ICDSP.1997.628142 Kates J. M., 2005, EURASIP J APPL SIG P, V18, P3003 LEOU TY, 1984, P IEEE, V72, P980 Lollmann H. W., 2005, P EUR SIGN PROC C EU Lollmann H. W., 2006, P EUR SIGN PROC C EU Lotter T., 2005, EURASIP J APPL SIG P, V7, P1110 MALAH D, 1999, P IEEE INT C AC SPEE, P789 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MORGAN DR, 1995, IEEE T SIGNAL PROCES, V43, P1819, DOI 10.1109/78.403341 OPPENHEI.A, 1971, PR INST ELECTR ELECT, V59, P299, DOI 10.1109/PROC.1971.8146 Oppenheim A. V., 1999, DISCRETE TIME SIGNAL PETROVSKY A, 2004, CONVENTION PAPER AUD Proakis J. G., 1996, DIGITAL SIGNAL PROCE RENFORS M, 1987, IEEE T CIRCUITS SYST, V34, P24, DOI 10.1109/TCS.1987.1086034 Schuller GDT, 2000, IEEE T SIGNAL PROCES, V48, P737, DOI 10.1109/78.824669 Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 STEIGLITZ K, 1980, ASSP, V28, P111 VAIDYSNATHAN PP, 1993, MULTIRATE SYSTEMS VARY P, 1979, AEU-INT J ELECTRON C, V33, P293 Vary P., 1980, P EUR SIGN P C, P41 Vary P, 2006, SIGNAL PROCESS, V86, P1206, DOI 10.1016/j.sigpro.2005.06.020 Zwicker E, 1999, PSYCHOACOUSTICS FACT NR 40 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 574 EP 587 DI 10.1016/j.specom.2007.04.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400005 ER PT J AU Hu, Y Loizou, PC AF Hu, Yi Loizou, Philipos C. TI Subjective comparison and evaluation of speech enhancement algorithms SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; noise reduction; subjective evaluation; ITU-T P.835 ID SPECTRAL AMPLITUDE ESTIMATOR; SUBSPACE APPROACH; COLORED NOISE AB Making meaningful comparisons between the performance of the various speech enhancement algorithms proposed over the years has been elusive due to lack of a common speech database, differences in the types of noise used and differences in the testing methodology. To facilitate such comparisons, we report on the development of a noisy speech corpus suitable for evaluation of speech enhancement algorithms. This corpus is subsequently used for the subjective evaluation of 13 speech enhancement methods encompassing four classes of algorithms: spectral subtractive, subspace, statistical-model based and Wiener-type algorithms. The subjective evaluation was performed by Dynastat, Inc., using the ITU-T P.835 methodology designed to evaluate the speech quality along three dimensions: signal distortion, noise distortion and overall quality. This paper reports the results of the subjective tests. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Texas, Dept Elect Engn, Richardson, TX 75083 USA. RP Loizou, PC (reprint author), Univ Texas, Dept Elect Engn, Richardson, TX 75083 USA. EM loizou@utdallas.edu CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 Cohen I., 2002, IEEE Signal Processing Letters, V9, DOI 10.1109/97.1001645 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gustafsson H, 2001, IEEE T SPEECH AUDI P, V9, P799, DOI 10.1109/89.966083 Hirsch H.G., 2000, ISCA ITRW ASR 2000 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Hu YQ, 2003, ENG MED BIOL SOC ANN, P334 IEEE Subcommittee, 1969, IEEE T AUDIO ELECTRO, V3, P225, DOI DOI 10.1109/TAU.1969.1162058 *ITUT, 1993, OBJ MEAS ACT SPEECH, P56 *ITUT, 2003, SUBJ TEST METH EV SP, P835 *ITUT, 2000, PREC EV SPEECH QUAL, P862 Jabloun F, 2003, IEEE T SPEECH AUDI P, V11, P700, DOI 10.1109/TSA.2003.818031 Kamath S., 2002, P IEEE INT C AC SPEE Kamath S. D., 2001, THESIS U TEXAS DALLA Loizou PC, 2005, IEEE T SPEECH AUDI P, V13, P857, DOI 10.1109/TSA.2005.851929 Loizou P.C., 2007, SPEECH ENHANCEMENT T, Vfirst Mittal U, 2000, IEEE T SPEECH AUDI P, V8, P159, DOI 10.1109/89.824700 RANGACHARI S, 2006, SPEECH COMMUN, P220 SCALART P, 1996, P IEEE INT C AC SPEE, P629 SOHN J, 1999, IEEE SIGNAL PROC JAN, P1 Tsoukalas D., 1997, IEEE T SPEECH AUDIO, V5, P479 NR 23 TC 120 Z9 134 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 588 EP 601 DI 10.1016/j.specom.2006.12.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400006 ER PT J AU Reuven, G Gannot, S Cohen, I AF Reuven, Gal Gannot, Sharon Cohen, Israel TI Performance analysis of dual source transfer-function generalized sidelobe canceller SO SPEECH COMMUNICATION LA English DT Article ID SPEECH ENHANCEMENT; NOISE-REDUCTION AB In this work, we evaluate the performance of a recently proposed adaptive beamformer, namely Dual source Transfer-Function Generalized Sidelobe Canceller (DTF-GSC). The DTF-GSC is useful for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. We demonstrate the applicability of the DTF-GSC in some representative reverberant and non-reverberant environments under various noise field conditions. The performance is evaluated based on the power spectral density (PSD) deviation imposed on the desired signal at the beamformer output, the achievable noise reduction, and the interference reduction. We show that the resulting expressions for the PSD deviation and noise reduction depend on the actual acoustical environment, the noise field, and the estimation accuracy of the relative transfer functions (RTFs), defined as the ratio between each acoustical transfer function (ATF) and a reference ATF. The achievable interference reduction is generally independent of the noise field. Experimental results demonstrate the sensitivity of the system's performance to array misalignments. (c) 2007 Elsevier B.V. All rights reserved. C1 Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. Bar Ilan Univ, Sch Engn, IL-52900 Ramat Gan, Israel. RP Reuven, G (reprint author), Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. EM galrv@techunix.technion.ac.il; gannot@eng.biu.ac.il; icohen@ee.technion.ac.il CR ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 BITZER J, 1998, EUSIPCO, V1, P105 BITZER J, 1999, P IEEE INT C AC SPEE, V5, P2965 BITZER J, 1999, IEEE WORKSHOP APPL S DALDEGAN N, 1988, SIGNAL PROCESS, V15, P43, DOI 10.1016/0165-1684(88)90027-8 Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 Gannot S, 2004, IEEE T SIGNAL PROCES, V52, P1115, DOI 10.1109/TSP.2004.823487 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HUARNG KC, 1990, IEEE T ACOUST SPEECH, V38, P209 NORDHOLM S, 1992, IEEE T SIGNAL PROCES, V40, P474, DOI 10.1109/78.124966 NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760 Nordholm S, 1999, IEEE T SPEECH AUDI P, V7, P241, DOI 10.1109/89.759030 Nordholm SE, 2000, J ACOUST SOC AM, V107, P1057, DOI 10.1121/1.428570 REUVEN G, 2005, INT WORKSH AC ECH NO, P27 REUVEN G, IEEE T SPEECH AUDIO Spriet A, 2004, SIGNAL PROCESS, V84, P2367, DOI 10.1016/j.sigpro.2004.07.028 Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821 NR 18 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 602 EP 622 DI 10.1016/j.specom.2006.12.007 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400007 ER PT J AU Reuven, G Gannot, S Cohen, I AF Reuven, Gal Gannot, Sharon Cohen, Israel TI Joint noise reduction and acoustic echo cancellation using the transfer-function generalized sidelobe canceller SO SPEECH COMMUNICATION LA English DT Article ID MICROPHONE ARRAY; SPEECH; ALGORITHM; SYSTEMS AB Man machine interaction requires an acoustic interface for providing full duplex hands-free communication. The transfer-function generalized sidelobe canceller (TF-GSC) is an adaptive beamformer suitable for enhancing a speech signal received by an array of microphones in a noisy and reverberant environment. When an echo signal is also present in the microphone output signals, cascade schemes of acoustic echo cancellation and TF-GSC can be employed for suppressing both interferences. However, the performances obtainable by cascade schemes are generally insufficient. An acoustic echo canceller (AEC) that precedes the adaptive beamformer suffers from the noise component at its input. Acoustic echo cancellation following the adaptive beamformer lacks robustness due to time variations in the echo path affecting beamformer adaptation. In this paper, we introduce an echo transfer-function generalized sidelobe canceller (ETF-GSC), which combines the TF-GSC with an acoustic echo canceller. The proposed scheme consists of a primary TF-GSC for dealing with the noise interferences, and a secondary modified TF-GSC for dealing with the echo cancellation. The secondary TF-GSC includes an echo canceller embedded within a replica of the primary TF-GSC components. We show that using this structure, the problems encountered in the cascade schemes can be appropriately avoided. Experimental results demonstrate improved performance of the ETF-GSC compared to cascade schemes in noisy and reverberant environments. (c) 2007 Elsevier B.V. All rights reserved. C1 Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. Bar Ilan Univ, Sch Engn, IL-52900 Ramat Gan, Israel. RP Cohen, I (reprint author), Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. EM galrv@techunix.technion.ac.il; gannot@eng.biu.ac.il; icohen@ee.technion.ac.il CR Affes S, 1997, IEEE T SPEECH AUDI P, V5, P425, DOI 10.1109/89.622565 AFFES S, 1997, 22 IEEE INT C AC SPE, P269 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 AVARGEL Y, IN PRESS IEEE SIGNAL BITZER J, 1999, P IEEE INT C AC SPEE, V5, P2965 Dahl M, 1999, IEEE T VEH TECHNOL, V48, P1518, DOI 10.1109/25.790527 DOCLO S, 2000, P 25 IEEE INT C AC S, P1061 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 Garofolo J., 1988, GETTING STARTED DARP GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 Herbordt W., 2004, P EUR SIGN PROC C EU, P2003 HERBORDT W, 2005, PRACTICAL ASPECTS MI, V315 HERBORDT W, 2005, P ICASSP2005, V3, P77, DOI 10.1109/ICASSP.2005.1415650 Herbordt W., 1843, P EURASIP EUR SIGN P, V3 Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650 Jeannes RL, 2001, IEEE T SPEECH AUDI P, V9, P808 KAMMEYER KD, 2005, P IEEE INT C ACOUSTI, V3, P137, DOI 10.1109/ICASSP.2005.1415665 KELLERMANN W, 2001, MICROPHONE ARRAYS SI, P281 KELLY W, 1997, P 38 GRASSL SOC VICT, P81 KellyPowell ML, 1997, RES NURS HEALTH, V20, P219, DOI 10.1002/(SICI)1098-240X(199706)20:3<219::AID-NUR5>3.0.CO;2-L LOW SY, 2005, P 2005 IEEE INT C AC, V3, P69 NORDHOLM S, 1992, IEEE T SIGNAL PROCES, V40, P474, DOI 10.1109/78.124966 REUVEN G, 2007, IN PRESS IEEE INT C REUVEN G, 2004, P 23 IEEE CONV EL EL, P412 Rombouts G, 2005, SIGNAL PROCESS, V85, P849, DOI 10.1016/j.sigpro.2004.11.017 Shynk JJ, 1992, IEEE SIGNAL PROC MAG, V9, P14, DOI 10.1109/79.109205 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 NR 28 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 623 EP 635 DI 10.1016/j.specom.2006.12.008 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400008 ER PT J AU Doclo, S Spriet, A Wouters, J Moonen, M AF Doclo, Simon Spriet, Ann Wouters, Jan Moonen, Marc TI Frequency-domain criterion for the speech distortion weighted multichannel Wiener filter for robust noise reduction SO SPEECH COMMUNICATION LA English DT Article DE multi-microphone noise reduction; adaptive frequency-domain algorithms; multichannel Wiener filter; generalized sidelobe canceler; hearing aids ID GENERALIZED SIDELOBE CANCELLATION; MICROPHONE ARRAY; HEARING-AIDS; BLOCKING MATRIX; ENHANCEMENT; PERFORMANCE; CANCELER; ERRORS; GAIN AB Recently, a generalized multi-microphone noise reduction scheme, referred to as the spatially pre-processed speech distortion weighted multichannel Wiener filter (SP-SDW-MWF), has been presented. This scheme consists of a fixed spatial pre-processor and a multichannel adaptive noise canceler (ANC) optimizing the SDW-MWF cost function. By taking speech distortion explicitly into account in the design criterion of the multichannel ANC, the SP-SDW-MWF adds robustness to the standard generalized sidelobe canceler (GSC). In this paper, we present a multichannel frequency-domain criterion for the SDW-MWF, from which several - existing and novel - adaptive frequency-domain algorithms can be derived. The main difference between these adaptive algorithms consists in the calculation of the step size matrix (constrained vs. unconstrained, block-structured vs. diagonal) used in the update formula for the multichannel adaptive filter. We investigate the noise reduction performance, the robustness and the tracking performance of these adaptive algorithms, using a perfect voice activity detection (VAD) mechanism and using an energy-based VAD. Using experimental results with a small-sized microphone array in a hearing aid, it is shown that the SP-SDW-MWF is more robust against signal model errors than the GSC, and that the block-structured step size matrix gives rise to a faster convergence and a better tracking performance than the diagonal step size matrix, only at a slightly higher computational cost. (c) 2007 Elsevier B.V. All rights reserved. C1 Katholieke Univ Leuven, Dept Elect Engn, ESAT, SCD, B-3001 Heverlee, Belgium. RP Doclo, S (reprint author), Katholieke Univ Leuven, Dept Elect Engn, ESAT, SCD, Kasteelpk Arenberg 10 Bus 2446, B-3001 Heverlee, Belgium. EM simon.doclo@esat.kuleuven.be RI Doclo, Simon/A-5472-2008; Wouters, Jan/D-1800-2015 CR [Anonymous], 1997, S351997 ANSI BENESTY J, 2001, ADV NETWORK ACOUSTIC, P157 Buchel C, 2005, PHOTOSYNTH RES, V85, P3, DOI 10.1007/s11120-004-3195-8 CLAESSON I, 1992, IEEE T ANTENN PROPAG, V40, P1093, DOI 10.1109/8.166535 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 Doclo S, 2003, IEEE T SIGNAL PROCES, V51, P2511, DOI 10.1109/TSP.2003.816885 Doclo S, 2002, IEEE T SIGNAL PROCES, V50, P2230, DOI 10.1109/TSP.2002.801937 DOCLO S, 2001, MICROPHONE ARRAYS SI, P111 DOCLO S, P EUR SIGN PROC C EU, P2007 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HERBORDT W, 2003, P INT WORKSH AC ECH, P247 Herbordt W, 2003, EURASIP J APPL SIG P, V2003, P21, DOI 10.1155/S1110865703211094 HERBORDT W, 2003, ADAPTIVE SIGNAL PROC, P155 HOFFMAN MW, 1995, IEEE T SPEECH AUDI P, V3, P193, DOI 10.1109/89.388145 Hoshuyama O, 1999, IEEE T SIGNAL PROCES, V47, P2677, DOI 10.1109/78.790650 Hoshuyama O, 2001, IEICE T FUND ELECTR, VE84A, P406 JABLON NK, 1986, IEEE T ANTENN PROPAG, V34, P996, DOI 10.1109/TAP.1986.1143936 JENSEN LB, 2004, Patent No. 6741714 LINK MJ, 1993, J ACOUST SOC AM, V93, P2139, DOI 10.1121/1.406676 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 NORDEBO S, 1994, IEEE J OCEANIC ENG, V19, P583, DOI 10.1109/48.338394 NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760 Rombouts G, 2003, SIGNAL PROCESS, V83, P1889, DOI 10.1016/S0165-1684(03)00107-5 Shynk JJ, 1992, IEEE SIGNAL PROC MAG, V9, P14, DOI 10.1109/79.109205 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Spriet A, 2004, SIGNAL PROCESS, V84, P2367, DOI 10.1016/j.sigpro.2004.07.028 Spriet A, 2005, SIGNAL PROCESS, V85, P1073, DOI 10.1016/j.sigpro.2005.01.005 Spriet A, 2005, IEEE T SPEECH AUDI P, V13, P487, DOI 10.1109/TSA.2005.845821 Spriet A, 2005, IEEE T SIGNAL PROCES, V53, P911, DOI 10.1109/TSP.2004.842182 Van Compernolle D., 1990, P IEEE INT C AC SPEE, V2, P833 Van Gerven S., 1997, P EUROSPEECH, V3, P1095 Van Veen B. D., 1988, IEEE ASSP Magazine, V5, DOI 10.1109/53.665 Vanden Berghe J, 1998, J ACOUST SOC AM, V103, P3621, DOI 10.1121/1.423066 NR 38 TC 48 Z9 49 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 636 EP 656 DI 10.1016/j.specom.2007.02.001 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400009 ER PT J AU Lefkimmiatis, S Maragos, P AF Lefkimmiatis, Stamatios Maragos, Petros TI A generalized estimation approach for linear and nonlinear microphone array post-filters SO SPEECH COMMUNICATION LA English DT Article DE nonlinear; noise reduction; speech enhancement; microphone array; post-filter; complex coherence ID SPECTRAL AMPLITUDE ESTIMATOR; ENVIRONMENTS; FIELD AB This paper presents a robust and general method for estimating the transfer functions of microphone array post-filters, derived under various speech enhancement criteria. For the case of the mean square error (MSE) criterion, the proposed method is an improvement of the existing McCowan post-filter, which under the assumption of a known noise field coherence function uses the auto- and cross-spectral densities of the microphone array noisy inputs to estimate the Wiener post-filter transfer function. In contrast to McCowan post-filter, the proposed method takes into account the noise reduction performed by the minimum variance distortionless response (MVDR) beamformer and obtains a more accurate estimation of the noise spectral density. Furthermore, the proposed estimation approach is general and can be used for the derivation of both linear and nonlinear microphone array post-filters, according to the utilized enhancement criterion. In experiments with real noise multichannel recordings the proposed technique has shown to obtain a significant gain over the other studied methods in terms of five different objective speech quality measures. (c) 2007 Elsevier B.V. All rights reserved. C1 Natl Tech Univ Athens, Sch Elect & Comp Engn, GR-15773 Athens, Greece. RP Lefkimmiatis, S (reprint author), Natl Tech Univ Athens, Sch Elect & Comp Engn, GR-15773 Athens, Greece. EM sleukim@cs.ntua.gr; maragos@cs.ntua.gr CR ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 Balan R., 2002, P SENS ARR MULT SIGN, P209 BITZER J, 2001, MICROPHONE ARRAYS SI, P19 Cohen I, 2004, IEEE T SIGNAL PROCES, V52, P1149, DOI 10.1109/TSP.2004.826166 COHEN I, 2002, INT C ACOUSTISCS SPE, V1, P901 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 COX H, 1986, IEEE T ACOUST SPEECH, V34, P393, DOI 10.1109/TASSP.1986.1164847 Doclo S, 2003, SIGNAL PROCESS, V83, P2641, DOI 10.1016/j.sigpro.2003.07.005 ELKO GW, 2001, MICROPHONE ARRAYS SI, P61 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FISCHER S, 1997, INT C ACOUSTICS SPEC, V1, P359 Fischer S, 1996, SPEECH COMMUN, V20, P215, DOI 10.1016/S0167-6393(96)00054-4 Hansen J. H., 1998, INT C SPOK LANG PROC, P2819 Johnson D, 1993, ARRAY SIGNAL PROCESS Kay S. M., 1993, FUNDAMENTALS STAT SI LEUKIMMIATIS S, 2006, P INT EUR, P2142 Marro C, 1998, IEEE T SPEECH AUDI P, V6, P240, DOI 10.1109/89.668818 McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212 MEYER J, 1997, P ICASSP 97, V2, P1167 Poor HV, 1998, INTRO SIGNAL DETECTI Rabiner L., 1978, DIGITAL SIGNAL PROCE SIMMER KU, 2001, MICROPHONE ARRAYS SI, P39 Sullivan T., 1996, CMU MICROPHONE ARRAY VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Veen B. V., 1988, IEEE ASSP MAG APR, V5, P4 Zelinski R., 1988, IEEE INT C AC SPEECH, V5, P2578 NR 27 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 657 EP 666 DI 10.1016/j.specom.2007.02.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400010 ER PT J AU Rivet, B Girin, L Jutten, C AF Rivet, Bertrand Girin, Laurent Jutten, Christian TI Visual voice activity detection as a help for speech source separation from convolutive mixtures SO SPEECH COMMUNICATION LA English DT Article DE speech source separation; convolutive mixtures; voice activity detector; visual speech processing; speech enhancement; highly non-stationary environments ID BLIND SIGNAL SEPARATION AB Audio-visual speech source separation consists in mixing visual speech processing techniques (e.g., lip parameters tracking) with source separation methods to improve the extraction of a speech source of interest from a mixture of acoustic signals. In this paper, we present a new approach that combines visual information with separation methods based on the sparseness of speech: visual information is used as a voice activity detector (VAD) which is combined with a new geometric method of separation. The proposed audiovisual method is shown to be efficient to extract a real spontaneous speech utterance in the difficult case of convolutive mixtures even if the competing sources are highly non-stationary. Typical gains of 18-20 dB in signal to interference ratios are obtained for a wide range of (2 x 2) and (3 x 3) mixtures. Moreover, the overall process is computationally quite simpler than previously proposed audio-visual separation schemes. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Grenoble 3, INPG, CNRS UMR 5009, Inst Commun Parlee ICP, Grenoble, France. Univ Grenoble 1, INPG, CNRS UMR 5083, LIS, Grenoble, France. RP Rivet, B (reprint author), Univ Grenoble 3, INPG, CNRS UMR 5009, Inst Commun Parlee ICP, Grenoble, France. EM rivet@icp.inpg.fr CR Abrard F, 2005, SIGNAL PROCESS, V85, P1389, DOI 10.1016/j.sigpro.2005.02.010 BABAIEZADEH M, 2004, P ICA 2004 GRAN SPAI, P798 Bernstein L. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607895 CAPDEVIELLE V, 1995, P IEEE INT C AC SPEE, P2080 Cardoso JF, 1998, P IEEE, V86, P2009, DOI 10.1109/5.720250 DANSEREAU R, 2004, P IEEE INT C AC SPEE DAPENA A, 2001, P INT C IND COMP AN, P315 Elisei F., 2001, P AUD VIS SPEECH PRO, P90 Girin L, 2001, J ACOUST SOC AM, V109, P3007, DOI 10.1121/1.1358887 LALLOUACHE T, 1990, P JOURN ET PAR JEP F LEGOFF B, 1995, P EUR C SPEECH COMM, P291 Liu P., 2004, P IEEE INT C AC SPEE, P609 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 PHAM DT, 2003, P INT C IND COMP AN RAJARAM S, 2004, P IEEE INT C AC SPEE Rivet B, 2007, IEEE T AUDIO SPEECH, V15, P96, DOI 10.1109/TASL.2006.872619 SODOYER D, 2006, P IEEE INT C AC SPEE, P601 Sodoyer D, 2004, SPEECH COMMUN, V44, P113, DOI 10.1016/j.specom.2004.10.002 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 WANG W, 2005, P IEEE INT C AC SIGN NR 20 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL-AUG PY 2007 VL 49 IS 7-8 BP 667 EP 677 DI 10.1016/j.specom.2007.04.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 200OJ UT WOS:000248772400011 ER PT J AU Rotovnik, T Maucec, MS Kacic, Z AF Rotovnik, Tomaz Maucec, Mirjam Sepesy Kacic, Zdravko TI Large vocabulary continuous speech recognition of an inflected language using stems and endings SO SPEECH COMMUNICATION LA English DT Article DE large vocabulary continuous speech recognition; sub-word modeling; search algorithm; stem; ending ID SEARCH; MODELS AB In this article, we focus on creating a large vocabulary speech recognition system for the Slovenian language. Currently, state-of-heart recognition systems are able to use vocabularies with sizes of 20,000 to 100,000 words. These systems have mostly been developed for English, which belongs to a group of uninflectional languages. Slovenian, as a Slavic language, belongs to a group of inflectional languages. Its rich morphology presents a major problem in large vocabulary speech recognition. Compared to English, the Slovenian language requires a vocabulary approximately 10 times greater for the same degree of text coverage. Consequently, the difference in vocabulary size causes a high degree of OOV (out-of-vocabulary words). Therefore OOV words have a direct impact on recognizer efficiency. The characteristics of inflectional languages have been considered when developing a new search algorithm with a method for restricting the correct order of sub-word units, and to use separate language models based on sub-words. This search algorithm combines the properties of sub-word-based models (reduced OOV) and word-based models (the length of context). The algorithm also enables better search-space limitation for sub-word models. Using sub-word models, we increase recognizer accuracy and achieve a comparable search space to that of a standard word-based recognizer. Our methods were evaluated in experiments on a SNABI speech database. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Maribor, Fac Elect Engn & Comp Sci, SLO-2000 Maribor, Slovenia. RP Rotovnik, T (reprint author), Univ Maribor, Fac Elect Engn & Comp Sci, Smetanova 17, SLO-2000 Maribor, Slovenia. EM tomaz.rotovnik@uni-mb.si CR Bellman R. E., 1957, DYNAMIC PROGRAMMING Beyerlein P, 2002, SPEECH COMMUN, V37, P109, DOI 10.1016/S0167-6393(01)00062-0 BYRNE W, 2001, P 7 EUR C SPEECH COM, P487 Byrne W, 2000, LECT NOTES ARTIF INT, V1902, P211 CARKI K, 2000, P IEEE INT C AC SPEE, V3, P1563 CHOI I, 2004, P INT C SPOK LANG PR CILINGIR O, 2003, P EUR C SPEECH COMM, P1185 COMRIE B, 2001, SLAVONIC LANGUAGES Deshmukh N, 1999, IEEE SIGNAL PROC MAG, V16, P84, DOI 10.1109/79.790985 DIMEC J, 1999, WWW SEARCH ENGINE SL ERDOGAN H, 2005, P AUT SPEECH REC UND Evermann G., 2003, P ASRU ST THOM US VI, P7 GEUNTNER P, 1995, P INT C AC SPEECH SI, P445 GEUNTNER P, 1998, P INT C AC SPEECH SI, P925 GEUNTNER P, 1998, DARPA BROADC NEWS TR IRCING P, 2002, P INT C SPEECH COMP, P23 KACIC Z, 2000, P INT C LANG RES EV Kanthak S., 2002, P INT C SPOK LANG PR, P1309 Kwon OW, 2003, SPEECH COMMUN, V39, P287, DOI 10.1016/S0167-6393(02)00031-6 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 OHTSUKI K, 1999, SPEECH COMMUN, V28, P83 POPOVIC M, 1992, J AM SOC INFORM SCI, V43, P384, DOI 10.1002/(SICI)1097-4571(199206)43:5<384::AID-ASI6>3.0.CO;2-L RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 ROTOVNIK T, 2003, IEEE AUT SPEECH REC, P83 ROTOVNIK T, 2002, P INT C TEXT SPEECH, P329 SEPESY MM, 2002, THESIS U MARIBOR SEPESY MM, 2003, INT J SPEECH TECHNOL, P245 Sixtus A, 2002, COMPUT SPEECH LANG, V16, P245, DOI 10.1006/csla.2002.192 Stolcke A., 2002, P INT C SPOK LANG PR, P901 Szarvas M., 2003, P ICASSP HONG KONG C, P368 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 NR 31 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 437 EP 452 DI 10.1016/j.specom.2007.02.010 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400001 ER PT J AU Niesler, T AF Niesler, Thomas TI Language-dependent state clustering for multilingual acoustic modelling SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ISCA Workshop on Multilingual Speech and Language Processing CY APR 09-11, 2006 CL Stellenbosch, SOUTH AFRICA SP ISCA HO Stellenbosch Univ DE multilinguality; multilingual acoustic models; multilingual speech recognition ID SPEECH RECOGNITION AB The need to compile annotated speech databases remains an impediment to the development of automatic speech recognition (ASR) systems in under-resourced multilingual environments. We investigate whether it is possible to combine speech data from different languages spoken within the same multilingual population to improve the overall performance of a speech recognition system. For our investigation, we use recently collected Afrikaans, South African English, Xhosa and Zulu speech databases. Each consists of between 6 and 7 h of speech that has been annotated at the phonetic and the orthographic level using a common IPA-based phone set. We compare the performance of separate language-specific systems with that of multilingual systems based on straightforward pooling of training data as well as on a data-driven alternative. For the latter, we extend the decision-tree clustering process normally used to construct tied-state hidden Markov models to allow the inclusion of language-specific questions, and compare the performance of systems that allow sharing between languages with those that do not. We find that multilingual acoustic models obtained in this way show a small but consistent improvement over separate-language systems as well as systems based on IPA-based data pooling. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. RP Niesler, T (reprint author), Univ Stellenbosch, Dept Elect & Elect Engn, ZA-7600 Stellenbosch, South Africa. EM trn@dsp.sun.ac.za CR Bisani M., 2004, P ICASSP *INT PHON ASS, 2001, HDB INT PHON ASS KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Kohler J, 2001, SPEECH COMMUN, V35, P21, DOI 10.1016/S0167-6393(00)00093-5 LOUW P, 2005, S AFR J AFR LANG, V25, P71 LOUW P, 2001, P EUROSPEECH NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 NIESLER T, 2005, LINGUIST APPL LANG S, V23, P459 Roux J.C., 2004, P LREC Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 SCHULTZ T, 1998, P ICSLP Schultz T., 2000, P ICASSP Statistics South Africa, 2004, CENS 2001 PRIM TABL UEBLER U, 1998, P ICSLP Uebler U, 2001, SPEECH COMMUN, V35, P53, DOI 10.1016/S0167-6393(00)00095-9 Waibel A, 2000, P IEEE, V88, P1297, DOI 10.1109/5.880085 WARD T, 1998, P ICSLP WEBB V, 2002, WORLD C LANG POL BAR WISSING D, 2004, P LREC Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 21 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 453 EP 463 DI 10.1016/j.specom.2007.04.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400002 ER PT J AU Radfar, MH Dansereau, RM Sayadiyan, A AF Radfar, Mohammad H. Dansereau, Richard M. Sayadiyan, Abolghasem TI Monaural speech segregation based on fusion of source-driven with model-driven techniques SO SPEECH COMMUNICATION LA English DT Article DE speech processing; monaural speech segregation; CASA; speech coding; harmonic modelling; vector quantization; envelope extraction; multi-pitch tracking; MIXMAX estimator ID OVERCOMPLETE REPRESENTATIONS; BLIND SEPARATION; NOISY SPEECH; ENHANCEMENT; SUPPRESSION; TRACKING AB In this paper by exploiting the prevalent methods in speech coding and synthesis, a new single channel speech segregation technique is presented. The technique integrates a model-driven method with a source-driven method to take advantage of both individual approaches and reduce their pitfalls significantly. We apply harmonic modelling in which the pitch and spectrum envelope are the main components for the analysis and synthesis stages. Pitch values of two speakers are obtained by using a source-driven method. The spectrum envelope, is obtained by using a new model-driven technique consisting of four components: a trained codebook of the vector quantized envelopes (VQ-based separation), a mixture-maximum approximation (MIXMAX), minimum mean square error estimator (MMSE), and a harmonic synthesizer. In contrast with previous model-driven techniques, this approach is speaker independent and can separate out the unvoiced regions as well as suppress the crosstalk effect which both are the drawbacks of source-driven or equivalently computational auditory scene analysis (CASA) models. We compare our fused model with both model- and source-driven techniques by conducting subjective and objective experiments. The results show that although for the speaker-dependent case, model-based separation delivers the best quality, for a speaker independent scenario the integrated model outperforms the individual approaches. This result supports the idea that the human auditory system takes on both grouping cues (e.g., pitch tracking) and a priori knowledge (e.g., trained quantized envelopes) to segregate speech signals. (C) 2007 Elsevier B.V. All rights reserved. C1 Carleton Univ, Dept Syst & Comp Engn, Ottawa, ON K1S 5B6, Canada. Amirkabir Univ Technol, Dept Elect Engn, Tehran 15875 4413, Iran. RP Radfar, MH (reprint author), Carleton Univ, Dept Syst & Comp Engn, 1125 Colonel By Dr, Ottawa, ON K1S 5B6, Canada. EM radfar@sce.carleton.ca CR Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Beierholm T., 2004, P ICASSP 04, V5, P529 BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Burslem R, 2002, MATER WORLD, V10, P6 CHAZAN D, 1993, P ICASSP 93, P728 CHU WC, 2004, EURASIP J APPL SIG P, V17, P2601 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 De Boor C., 1978, PRACTICAL GUIDE SPLI EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 ERIKSSON T, 1998, P ICASSP 98, V1, P37, DOI 10.1109/ICASSP.1998.674361 Fant G., 1973, SPEECH SOUNDS FEATUR FEVOTTE C, 2005, IEEE T SPEECH AUDIO, V4, P1 Gersho A., 1992, VECTOR QUANTIZATION Girolami M, 2001, NEURAL COMPUT, V13, P2517, DOI 10.1162/089976601753196003 HANSON BA, 1984, P ICASSP 84, V9, P65 Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Jang G.J., 2003, P ADV NEUR INF PROC, P1173 KAMEOKA H, 2004, INTERSPEECH 2004, V1, P2433 KRISTJANSSON T, 2004, P ICASSP 04 MAY, P817 KWON YH, 2000, P IEEE INT S CIRC SY, V3, P722 Lee TW, 1999, IEEE SIGNAL PROC LET, V6, P87 Martin P., 1982, P ICASSP 82, V7, P180 MCAULAY RJ, 1995, SPEECH CODING SYNTHE Moore BCJ, 1997, INTRO PSYCHOL HEARIN Morgan DP, 1997, IEEE T SPEECH AUDI P, V5, P407, DOI 10.1109/89.622561 NADAS A, 1989, IEEE T ACOUST SPEECH, V37, P1495, DOI 10.1109/29.35387 NAYLOR JA, 1987, P ICASSP 87, V1, P205 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 PAUL DB, 1981, IEEE T ACOUST SPEECH, V29, P786, DOI 10.1109/TASSP.1981.1163643 Pobloth H, 2003, J ACOUST SOC AM, V114, P1081, DOI 10.1121/1.1594190 QUATIERI TF, 1990, IEEE T ACOUST SPEECH, V38, P56, DOI 10.1109/29.45618 Rabiner L.R., 1978, DIGITAL PROCESSING S Radfar M., 2006, ELECTRON LETT, V42, P75 Reddy A.M, 2004, INTERSPEECH 2004 OCT, P2445 Reyes-Gomez M. J., 2004, P ICASSP 04, V5, P641 Roweis S.T., 2000, P NEUR INF PROC SYST, P793 ROWIES ST, 2003, EUROSPEECH 03, V7, P1009 Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670 SECREST BG, 1983, P IEEE INT C AC SPEE, V8, P1352 SPIEGEL MR, 1998, SCHAUMS MATH HDB FOR Talkin D., 1995, SPEECH CODING SYNTHE Tolonen T, 2000, IEEE T SPEECH AUDI P, V8, P708, DOI 10.1109/89.876309 van der Kouwe AJW, 2001, IEEE T SPEECH AUDI P, V9, P189, DOI 10.1109/89.905993 VIRTANEN T, 2000, P IEEE INT C AC SPEE, P765 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 Wang H, 2005, EXP HEAT TRANSFER, V18, P1, DOI 10.1080/08916150490502253 Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539 NR 49 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 464 EP 476 DI 10.1016/j.specom.2007.04.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400003 ER PT J AU Ding, LJ Lin, Z Radwan, A El-Hennawey, MS Goubran, RA AF Ding, Lijing Lin, Zhong Radwan, Ayman El-Hennawey, Mohamed Samy Goubran, Rafik A. TI Non-intrusive single-ended speech quality assessment in VoIP SO SPEECH COMMUNICATION LA English DT Article DE speech quality; quality prediction model; voice over internet protocol; non-intrusive model; objective method ID ENHANCEMENT; MODELS AB Evaluating speech quality in voice over Internet protocol (VoIP) in a non-intrusive manner is challenging, because it relies on a degraded speech signal only. In this paper, a parametric, non-intrusive VoIP speech quality assessment algorithm is proposed, which adopts a three-step strategy, impairment detection, individual effect modeling and an overall model. Mainly based on voice payload analysis, the algorithm also combines Internet protocol analysis approach and the ITU-T E-model. It quantifies the individual contributions to speech quality from several major VoIP impairments, including packet loss, temporal clipping and noise. Also, an overall assessment model is developed. The performance is evaluated through intensive simulations, and the results show that the algorithm is effective and accurate. For the overall model, the correlation between prediction and measurement is 0.90; the root mean square error (RMSE) is 0.27 mean opinion score (MOS). The algorithm aims to be implemented at the receive-end media gateway or IP terminal, for identifying the root causes of speech quality degradation as well as quality assessment in VoIP. (C) 2007 Elsevier B.V. All rights reserved. C1 Carleton Univ, Ottawa, ON K1S 5B6, Canada. Queens Univ, Kingston, ON K7L 3N6, Canada. Nortel, Enterprise Multimedia Syst, Belleville, ON K8N 5B7, Canada. RP Ding, LJ (reprint author), Carleton Univ, Ottawa, ON K1S 5B6, Canada. EM lding@sce.carleton.ca; linzhong@sce.carleton.ca; ayman.radwan@ece.queensu.ca; hennawey@nortel.com; goubran@sce.carleton.ca CR Beritelli F, 2002, SPEECH COMMUN, V38, P365, DOI 10.1016/S0167-6393(01)00077-2 Billingsley P., 1961, STAT INFERENCE MARKO Bolot J., 1993, COMPUT COMMUN REV, V23, P289, DOI 10.1145/167954.166265 Borella MS, 1998, PROCEEDINGS OF THE 1998 ICPP WORKSHOPS ON ARCHITECTURAL AND OS SUPPORT FOR MULTIMEDIA APPLICATIONS - FLEXIBLE COMMUNICATION SYSTEMS - WIRELESS NETWORKS AND MOBILE COMPUTING, P3, DOI 10.1109/ICPPW.1998.721868 BROOM S, 2003, HIGH LEVEL DESCRIPTI Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 De Martin J.C, 2001, P IEEE INT C AC SPEE, P753 Ding L., 2005, P IEEE INSTR MEAS TE, V2, P1135 DING L, 2003, P IEEE GLOB C DEC, V7, P3974 Ding LJ, 2006, IEEE T INSTRUM MEAS, V55, P2062, DOI 10.1109/TIM.2006.884138 Duysburgh B., 2001, Proceedings Tenth International Conference on Computer Communications and Networks (Cat. No.01EX495), DOI 10.1109/ICCCN.2001.956284 ELHENNAWEY MS, 2004, GSPX INT EMBEDDED SO ELHENNAWEY MS, 2006, Patent No. 2006035269 Falk TH, 2006, IEEE SIGNAL PROC LET, V13, P108, DOI 10.1109/LSP.2005.861598 Garofolo J., 1990, DARPA TIMIT ACOUSTIC Gierlich HW, 2006, SIGNAL PROCESS, V86, P1327, DOI 10.1016/j.sigpro.2005.06.024 GILBERT EN, 1960, AT&T TECH J, V39, P1253 Gray P, 2000, IEE P-VIS IMAGE SIGN, V147, P493, DOI 10.1049/ip-vis:20000539 Ilk HG, 2006, SIGNAL PROCESS, V86, P127, DOI 10.1016/j.sigpro.2005.05.006 James A.B., 2004, P IEEE INT C AC SPEE, V1, P853 Kim DS, 2005, IEEE T SPEECH AUDI P, V13, P821, DOI 10.1109/TSA.2005.851924 LIN Z, 2003, P 2 IEEE INT WORKSH, P61 *MALD EL LTD, 2003, DIG SPEECH LEV AN US Moller S, 2002, SPEECH COMMUN, V38, P47, DOI 10.1016/S0167-6393(01)00043-7 Nemer E, 2002, SPEECH COMMUN, V36, P219, DOI 10.1016/S0167-6393(00)00081-9 Paxson V, 1999, IEEE ACM T NETWORK, V7, P277, DOI 10.1109/90.779192 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P338, DOI 10.1109/TASSP.1977.1162964 RADWAN A, 2003, THESIS CARLETON U OT Rix A., 2000, P IEEE INT C AC SPEE, V3, P1515 Westerlund N, 2005, SIGNAL PROCESS, V85, P1089, DOI 10.1016/j.sigpro.2005.01.004 Yajnik M., 1999, P IEEE INFOCOM 99 NE, V1, P345, DOI DOI 10.1109/INFC.1999.749301 2003, 3551 IETF RFC 2003, 3611 IETF RFC 2003, 3550 IETF RFC 1996, 250 ETSI ETR NR 35 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 477 EP 489 DI 10.1016/j.specom.2007.04.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400004 ER PT J AU Ehara, H Morii, T Yoshida, K AF Ehara, Hiroyuki Morii, Toshiyuki Yoshida, Koji TI Predictive vector quantization of wideband LSF using narrowband LSF for bandwidth scalable coders SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT INTERSPEECH 2005 CY SEP, 2005 CL Lisbon, PORTUGAL DE predictive vector quantization; LSF; LSP; bandwidth scalability; codebook mapping ID SPEECH CODER; DESIGN AB For implementing a bandwidth-scalable coder, a wideband line spectral frequency (LSF) quantizer was developed. It works in combination with a narrowband LSF quantizer. A new predictive vector quantization was introduced to the wideband LSF quantizer. The predictive vector quantizer is based on the use of several predictive contributions, which include first-order auto regressive (AR) prediction and vector quantization (VQ) codebook mapping. One feature of the new predictive vector quantizer is exploitation of the correlation between wideband and narrowband LSFs quantized in the previous frame for estimating wideband LSF in the current frame. A 16-bit switched predictive three-stage vector quantizer was used to encode estimation residues. Results showed that introduction of the predictor brought about a performance improvement of 0.3 dB in spectral distortion. This paper describes procedures of designing the predictor and the three-stage codebook, as well as simulation results. (C) 2007 Elsevier B.V. All rights reserved. C1 Matsushita Elect Ind Co Ltd, Panasonic, Next Generat Mobile Commun Dev Ctr, Yokosuka, Kanagawa 2390847, Japan. RP Ehara, H (reprint author), Matsushita Elect Ind Co Ltd, Panasonic, Next Generat Mobile Commun Dev Ctr, Yokosuka, Kanagawa 2390847, Japan. EM ehara.hiroyuki@jp.panasonic.com CR AGIOMYRGIANNAKI.Y, 2004, P IEEE ICASSP 2004, pI469 EHARA H, 2005, P ISCA INTERSPEECH 2, P1493 EHARA H, 2005, P IEEE ICASSP 2005, pI137 Eriksson T, 1999, IEEE T SPEECH AUDI P, V7, P495, DOI 10.1109/89.784102 GERSHO A, 1992, VECTOR QUANTIZATION, P506 GERSHO A, 1992, VECTOR QUANTIZATION, P423 Hiwasaki Y, 2004, IEICE T INF SYST, VE87D, P1496 *ITUT, 2005, ITUT SOFTW TOOL LIB, P161 KOISHIDA K, 2000, P IEEE WORKSH SPEECH, P90 LeBlanc WP, 1993, IEEE T SPEECH AUDI P, V1, P373, DOI 10.1109/89.242483 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Nomura T, 1998, INT CONF ACOUST SPEE, P341, DOI 10.1109/ICASSP.1998.674437 Salami R, 1998, IEEE T SPEECH AUDI P, V6, P116, DOI 10.1109/89.661471 So S, 2007, DIGIT SIGNAL PROCESS, V17, P138, DOI 10.1016/j.dsp.2005.08.005 THYSSEN J, 2001, P IEEE INT C AC SPEE, P681 NR 15 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 490 EP 500 DI 10.1016/j.specom.2007.04.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400005 ER PT J AU Wang, LB Kitaoka, N Nakagawa, S AF Wang, Longbiao Kitaoka, Norihide Nakagawa, Selichi TI Robust distant speaker recognition based on position-dependent CMN by combining speaker-specific GMM with speaker-adapted HMM SO SPEECH COMMUNICATION LA English DT Article DE distant speaker recognition; GMM; HMM; position-dependent CMN; sound source estimation ID SPEECH RECOGNITION; TIME-DELAY; VERIFICATION; LOCALIZATION; NORMALIZATION; SPECTRUM; LOCATION; MODELS AB In this paper, we propose a robust speaker recognition method based on position-dependent Cepstral Mean Normalization (CMN) to compensate for the channel distortion depending on the speaker position. In the training stage, the system measures the transmission characteristics according to the speaker positions from some grid points to the microphone in the room and estimates the compensation parameters a priori. In the recognition stage, the system estimates the speaker position and adopts the estimated compensation parameters corresponding to the estimated position, and then the system applies the CMN to the speech and performs speaker recognition. In our past study, we proposed a new text-independent speaker recognition method by combining speaker-specific Gaussian mixture models (GMMs) with syllable-based HMMs adapted to the speakers by MAP [Nakagawa, S., Zhang, W., Takahashi, M., 2004. Text-independent speaker recognition by combining speaker-specific GMM with speaker-adapted syllable-based HMM. Proc. ICASSP-2004 1, 8184]. The robustness of this speaker recognition method for the change of the speaking style in close-talking environment was evaluated in (Nakagawa et al., 2004). In this paper, we extend this combination method to distant speaker recognition and integrate this method with the proposed position-dependent CMN. Our experiments showed that the proposed method improved the speaker recognition performance remarkably in a distant environment. (C) 2007 Elsevier B.V. All rights reserved. C1 Toyohashi Univ Technol, Dept Informat & Comp Sci, Toyohashi, Aichi 4418580, Japan. RP Wang, LB (reprint author), Toyohashi Univ Technol, Dept Informat & Comp Sci, 1-1,Hibarigaoka,Tempaku Cho, Toyohashi, Aichi 4418580, Japan. EM wang@slp.ics.tut.ac.jp RI Wang, Longbiao/J-1544-2014 CR Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bard Y., 1974, NONLINEAR PARAMETER Barras C., 2003, P IEEE INT C AC SPEE, V2, P49 Brandstein M., 1995, THESIS BROWN U PROVI DIBIASE JH, 2001, MICROPHONE ARRAYS SI, P157 Doclo S, 2003, EURASIP J APPL SIG P, V2003, P1110, DOI 10.1155/S111086570330602X FOY WH, 1976, IEEE T AERO ELEC SYS, V12, P187, DOI 10.1109/TAES.1976.308294 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 FURUI S, 1972, ELECTRON COMMUN JPN, V55, P54 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Huang YT, 2001, IEEE T SPEECH AUDI P, V9, P943 Hughes TB, 1999, IEEE T SPEECH AUDI P, V7, P346, DOI 10.1109/89.759045 Juang B. H., 2001, P INT WORKSH HANDS F, P5 Kitaoka N., 2001, P INT WORKSH HANDSFR, P159 KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830 LIU CS, 1995, P ICASSP 95, V1, P345 Liu F., 1993, P ARPA SPEECH NAT LA, P69, DOI 10.3115/1075671.1075688 MARKEL JD, 1977, IEEE T ACOUST SPEECH, V25, P330, DOI 10.1109/TASSP.1977.1162961 MATUSI T, 1995, P ICASSP 93, V2, P391 Nakagawa S., 2004, P IEEE INT C AC SPEE, P81 Nakagawa S., 1999, P INT WORKSH AUT SPE, P393 Nakagawa S, 2006, IEICE T INF SYST, VE89D, P1058, DOI 10.1093/ietisy/e89-d.3.1058 Nilsson N., 1966, LEARNING MACHINES Omologo M, 1997, IEEE T SPEECH AUDI P, V5, P288, DOI 10.1109/89.568735 OMOLOGO M, 1996, P ICASSP96, P921 Pelecanos J., 2001, P SPEAK OD SPEAK REC, P213 PUJOL P, 2006, P ICASSP 2006, P773 Raykar VC, 2005, IEEE T SPEECH AUDI P, V13, P751, DOI 10.1109/TSA.2005.851907 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Savic M., 1990, P IEEE INT C AC SPEE, P281 Seltzer ML, 2004, IEEE T SPEECH AUDI P, V12, P489, DOI 10.1109/TSA.2004.832988 Tseng B., 1992, P ICASSP 92, VII, P161 Tsurumi Y., 1994, P ICSLP, P431 Viikki O, 1998, SPEECH COMMUN, V25, P133, DOI 10.1016/S0167-6393(98)00033-8 WANG L, 2005, P EUROSPEECH 2005, P1977 WANG L, 2004, P ICSLP 2004, P2049 Wang L., 2005, P EUROSPEECH 2005, P2661 Wang LB, 2006, EURASIP J APPL SIG P, DOI 10.1155/ASP/2006/95491 XIANG B, 2002, P ICASSP, V1, P681 Young S, 2000, HTK BOOK NR 41 TC 18 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 501 EP 513 DI 10.1016/j.specom.2007.04.004 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400006 ER PT J AU Zolnay, A Kocharov, D Schluter, R Ney, H AF Zolnay, Andras Kocharov, Daniil Schlueter, Ralf Ney, Hermann TI Using multiple acoustic feature sets for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE acoustic feature extraction; auditory features; articulatory features; voicing; spectrum derivative feature; linear discriminant analysis; discriminative model combination ID NORMALIZATION AB In this paper, the use of multiple acoustic feature sets for speech recognition is investigated. The combination of both auditory as well as articulatory motivated features is considered. In addition to a voicing feature, we introduce a recently developed articulatory motivated feature, the spectrum derivative feature. Features are combined both directly using linear discriminant analysis (LDA) as well as indirectly on model level using discriminative model combination (DMC). Experimental results are presented for both small- and large-vocabulary tasks. The results show that the accuracy of automatic speech recognition systems can be significantly improved by the combination of auditory and articulatory motivated features. The word error rate is reduced from 1.8% to 1.5% on the SieTill task for German digit string recognition. Consistent improvements in word error rate have been obtained on two large-vocabulary corpora. The word error rate is reduced from 19.1% to 18.4% on the VerbMobil II corpus, a German large-vocabulary conversational speech task, and from 14.1% to 13.5% on the British English part of the European parliament plenary sessions (EPPS) task from the 2005 TC-STAR ASR evaluation campaign. (C) 2007 Elsevier B.V. All rights reserved. C1 Univ Aachen, Rhein Westfal TH Aachen, Lehrsuthl Informat 6, Dept Comp Sci, D-52056 Aachen, Germany. St Petersburg State Univ, Dept Phonet, St Petersburg 199034, Russia. RP Zolnay, A (reprint author), Univ Aachen, Rhein Westfal TH Aachen, Lehrsuthl Informat 6, Dept Comp Sci, D-52056 Aachen, Germany. EM zolnay@informatik.rwth-aachen.de; kocharov@phonetics.pu.ru; schlueter@informatik.rwth-aachen.de; ney@informatik.rwth-aachen.de RI Kocharov, Daniil/J-4909-2013 OI Kocharov, Daniil/0000-0002-7858-5331 CR ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 Beyerlein P., 1997, P IEEE AUT SPEECH RE, P238 BEYERLEIN P, 2000, THESIS RWTH AACHEN U BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 GRACIARENA M, 2004, P ICASSP MONTR, V1, P921 GRAY AH, 1974, IEEE T ACOUST SPEECH, VAS22, P207, DOI 10.1109/TASSP.1974.1162572 GU L, 2001, P IEEE ICASSP, P125 HABUMBACH R, 1999, P EUR C SPEECH COMM, V3, P1323 Hab-Umbach R., 1992, P IEEE INT C AC SPEE, V1, P13 HEDGE RM, 2005, P IEEE INT C AC SPEE, V1, P541 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HOLMES JN, 1997, P EUR C SPEECH COMM, V4, P2083 ISHIZUKA K, 2004, P ICASSP, V1, P141 KOCHAROV D, 2005, P EUR C SPEECH COMM, V2, P1101 Lee L., 1996, P ICASSP, V1, P353 Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0 Paliwal KK, 1999, P EUR C SPEECH COMM, P85 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 RABINER LR, 1979, PRENTICE HALL SIGNAL Schluter R., 2001, P IEEE INT C AC SPEE, P133 THOMSON DL, 1998, P IEEE INT C AC SPEE, V1, P21, DOI 10.1109/ICASSP.1998.674357 TOLBA H, 2002, P IEEE INT C AC SPEE, V1, P837 WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 Welling L, 2002, IEEE T SPEECH AUDI P, V10, P415, DOI 10.1109/TSA.2002.803435 WELLING L, 1996, P 1996 IEEE INT C AC, V2, P797 WOODLAND P, 1997, P IEEE INT C AC SPEE, V2, P719 ZOLNAY A, 2003, P EUR C SPEECH COMM, V1, P497 Zolnay A., 2002, P INT C SPOK LANG PR, P1065 NR 29 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2007 VL 49 IS 6 BP 514 EP 525 DI 10.1016/j.specom.2007.04.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 194CU UT WOS:000248322400007 ER PT J AU ten Bosch, L Kirchhoff, K AF ten Bosch, Louis Kirchhoff, Katrin TI Bridging the gap between human and automatic speech recognition SO SPEECH COMMUNICATION LA English DT Editorial Material ID MODEL; PERCEPTION; FEATURES; WORDS C1 Univ Nijmegen, Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. RP ten Bosch, L (reprint author), Univ Nijmegen, Ctr Language & Speech Technol, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM l.tenbosch@let.ru.nl CR Barker J, 2007, SPEECH COMMUN, V49, P402, DOI 10.1016/j.specom.2006.11.003 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Bregman AS., 1990, AUDITORY SCENE ANAL Carey M. J., 2005, P INTERSPEECH, P1257 CARPENTER B, 1999, P IEEE WORKSH AUT SP, P225 Coy A, 2007, SPEECH COMMUN, V49, P384, DOI 10.1016/j.specom.2006.11.002 DeWachter M., 2003, P EUROSPEECH, P1133 Dusan S., 2005, P INTERSPEECH, P1233 Fikkert P, 2005, TWENTY-FIRST CENTURY PSYCHOLINGUISTICS: FOUR CORNERSTONES, P43 HAMALAINEN A, 2007, P ICASSP HAN Y, 2007, IN PRESS IEEE T AUDI HERMANSKY H, 2001, P WORKSH SPEECH REC, P61 Hogden J, 2007, SPEECH COMMUN, V49, P361, DOI 10.1016/j.specom.2007.02.008 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kirchhoff K, 2002, SPEECH COMMUN, V37, P303, DOI 10.1016/S0167-6393(01)00020-6 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Livescu K., 2003, P EUR GEN SWITZ, P2529 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 Marr D., 1982, VISION COMPUTATIONAL MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 McDermott E, 2006, IEICE T INF SYST, VE89D, P1006, DOI 10.1093/ietisy/e89-d.3.1006 Metze F, 2007, SPEECH COMMUN, V49, P348, DOI 10.1016/j.specom.2007.02.009 Moore R. K., 2003, P EUROSPEECH, P2581 Moore RK, 2007, SPEECH COMMUN, V49, P418, DOI 10.1016/j.specom.2007.01.011 MOORE RK, 2001, P WORKSH SPEECH REC, P145 NEAREY RM, 2001, P WORKSH SPEECH REC, P133 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 Roy DK, 2002, COGNITIVE SCI, V26, P113, DOI 10.1207/s15516709cog2601_4 SCHALDACH M, 2000, PROG BIOMED RES, V5, P336 Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37 Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 Weizenbaum J., 1976, COMPUTER POWER HUMAN WRIGHT R, 2006, P WORKSH SPEECH REC, P39 Yu D, 2006, SPEECH COMMUN, V48, P1214, DOI 10.1016/j.specom.2006.05.002 NR 35 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 331 EP 335 DI 10.1016/j.specom.2007.03.001 PG 5 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100001 ER PT J AU Scharenborg, O AF Scharenborg, Odette TI Reaching over the gap: A review of efforts to link human and automatic speech recognition research SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; human speech recognition ID SPOKEN WORD-RECOGNITION; CONSONANT RECOGNITION; COMPUTATIONAL MODELS; DIRECTED SPEECH; LEXICAL ACCESS; PERCEPTION; REPRESENTATIONS; INTELLIGIBILITY; SHORTLIST; MACHINES AB The fields of human speech recognition (HSR) and automatic speech recognition (ASR) both investigate parts of the speech recognition process and have word recognition as their central issue. Although the research fields appear closely related, their aims and research methods are quite different. Despite these differences there is, however, lately a growing interest in possible cross-fertilisation. Researchers from both ASR and HSR are realising the potential benefit of looking at the research field on the other side of the 'gap'. In this paper, we provide an overview of past and present efforts to link human and automatic speech recognition research and present an overview of the literature describing the performance difference between machines and human listeners. The focus of the paper is on the mutual benefits to be derived from establishing closer collaborations and knowledge interchange between ASR and HSR. The paper ends with an argument for more and closer collaborations between researchers of ASR and HSR to further improve research in both fields. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Scharenborg, O (reprint author), Univ Sheffield, Speech & Hearing Res Grp, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM O.Scharenborg@dcs.shef.ac.uk RI Scharenborg, Odette/E-2056-2012 CR Allen JB, 2005, J ACOUST SOC AM, V117, P2212, DOI 10.1121/1.1856231 Axelrod S., 2004, P INT C AC SPEECH SI, P173 Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Bregman AS., 1990, AUDITORY SCENE ANAL Carey M. J., 2005, P INTERSPEECH, P1257 CARPENTER B, 1999, P IEEE WORKSH AUT SP, P225 Chomsky N., 1968, SOUND PATTERN ENGLIS COLE R, 1990, P INT JOINT C NEUR N, V2, P45 Cooke M., 1994, P 3 INT C SPOK LANG, P1555 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M, 2001, SPEECH COMMUN, V35, P141, DOI 10.1016/S0167-6393(00)00078-9 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Cutler A., 1992, P ICSLP, P189 Davis MH, 2002, J EXP PSYCHOL HUMAN, V28, P218, DOI 10.1037//0096-1523.28.1.218 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 de Boer B, 2003, ACOUST RES LETT ONL, V4, P129, DOI 10.1121/1.1613311 DeWachter M., 2003, P EUROSPEECH, P1133 Dusan S., 2005, P INTERSPEECH, P1233 FURUKAWA S, 2001, P PCOS 01, P55 Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613 Goldinger SD, 1998, PSYCHOL REV, V105, P251, DOI 10.1037/0033-295X.105.2.251 Harley T, 2001, PSYCHOL LANGUAGE DAT HAWKINS J, 2004, INTELLIGENCE TIME BO Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006 HERMANSKY H, 2001, P WORKSH SPEECH REC, P61 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HINTZMAN DL, 1984, BEHAV RES METH INS C, V16, P96 HOLMES J, 2002, SPEECH RECOGNITION S HUCKVALE M, 1998, P I AC C SPEECH HEAR, P9 Kemps RJJK, 2005, MEM COGNITION, V33, P430, DOI 10.3758/BF03193061 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kirchhoff K, 2005, J ACOUST SOC AM, V117, P2238, DOI 10.1121/1.1869172 Kirchhoff K., 1999, THESIS U BIELEFIELD KLATT DH, 1979, J PHONETICS, V7, P279 Krause JC, 2002, J ACOUST SOC AM, V112, P2165, DOI 10.1121/1.1509432 *LDC, 1995, LDC94S7 U PENNS LEONARD RG, 1984, P ICASSP, V3, P42 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Livescu K., 2003, P EUR GEN SWITZ, P2529 Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113 Maier V., 2005, P INTERSPEECH, P1245 Marslen-Wilson W. D., 1989, LEXICAL REPRESENTATI, P169 MARSLENWILSON WD, 1987, COGNITION, V25, P71, DOI 10.1016/0010-0277(87)90005-9 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MCQUEEN JM, 2004, HDB COGNITION, P255 Meyer B., 2006, P WORKSH SPEECH REC MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Moore R. K., 2003, P EUROSPEECH, P2581 MOORE RK, 2001, P AC WORKSH INN SPEE, V23, P19 MOORE RK, 1995, P 13 INT C PHON SCI MOORE RK, 2001, P WORKSH SPEECH REC, P145 MOORE RK, 2005, P ISCA WORKSH PLAST, P109 NEAREY TM, 2001, P SPRAAC WORKSH NIJM, P133 Norris D, 2005, TWENTY-FIRST CENTURY PSYCHOLINGUISTICS: FOUR CORNERSTONES, P331 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 Ostendorf M., 1999, P IEEE AUT SPEECH RE, P79 Alsteris LD, 2006, SPEECH COMMUN, V48, P727, DOI 10.1016/j.specom.2005.10.005 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 POLS L, 1999, P 14 INT C PHON SCI, P9 Qin MK, 2003, J ACOUST SOC AM, V114, P446, DOI 10.1121/1.1579009 Rabiner L, 1993, FUNDAMENTALS SPEECH Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Roy DK, 2002, COGNITIVE SCI, V26, P113, DOI 10.1207/s15516709cog2601_4 Salverda AP, 2003, COGNITION, V90, P51, DOI 10.1016/S0010-0277(03)00139-2 SCHARENBORG O, 2005, THESIS RADBOUD U NIJ Scharenborg O, 2003, J ACOUST SOC AM, V114, P3032, DOI 10.1121/1.1624065 Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37 SCHARENBORG O, 2006, P WORKSH SPEECH REC, P77 Scharenborg O, 2007, COMPUT SPEECH LANG, V21, P54, DOI 10.1016/j.csl.2005.12.001 SCHARENBORG O, 2005, P INT LISB PORT, P1237 Seltzer ML, 2004, SPEECH COMMUN, V43, P379, DOI 10.1016/j.specom.2004.03.006 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Srinivasan S., 2005, P INTERSPEECH, P1265 Sroka JJ, 2005, SPEECH COMMUN, V45, P401, DOI 10.1016/j.specom.2004.11.009 STRIK H, 2003, P 15 ICPHS, P227 STRIK H, 2006, P SPEECH REC INTR VA, P33 Tanaka H, 2000, RES LANG SOC INTERAC, V33, P1, DOI 10.1207/S15327973RLSI3301_1 TENBOSCH L, 2001, P WORKSH SPEECH REC, P49 Tuller B, 2003, J PHONETICS, V31, P503, DOI 10.1016/S0095-4470(03)00018-4 Van Leeuwen D.A., 1995, P EUROSPEECH, P1461 Voskuhl A, 2004, SOC STUD SCI, V34, P393, DOI 10.1177/0306312704 WADE T, 2002, P ICSLP, P1653 Wang D., 2006, COMPUTATIONAL AUDITO Wesker T., 2005, P INT, P1273 Wester M., 2001, P EUR AALB DENM, P1729 Wester Mirjam, 2003, P EUR 2003 GEN SWITZ, P233 WRIGHT R, 2006, P WORKSH SPEECH REC, P39 Yu C, 2005, COGNITIVE SCI, V29, P961, DOI 10.1207/s15516709cog0000_40 NR 89 TC 24 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 336 EP 347 DI 10.1016/j.specom.2007.01.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100002 ER PT J AU Metze, F AF Metze, Florian TI Discriminative speaker adaptation using articulatory features SO SPEECH COMMUNICATION LA English DT Article DE LVCSR; acoustic modeling; multi-stream systems; articulatory features; discriminative training ID RECOGNITION AB This. paper presents an automatic speech recognition system using acoustic models based on both sub-phonetic units and broad, phonological features such as VOICED and ROUND as output densities in a hidden Markov model framework. The aim of this work is to improve speech recognition performance particularly on conversational speech by using units other than phones as a basis for discrimination between words. We explore the idea that phones are more of a short-hand notation for a bundle of phonological features, which can also be used directly to distinguish competing word hypotheses. Acoustic models for different features are integrated with phone models using a multi-stream approach and log-linear interpolation. This paper presents a new lattice based discriminative training algorithm using the maximum mutual information criterion to train stream weights. This algorithm allows us to automatically learn stream weights from training or adaptation data and can also be applied to other tasks. Decoding experiments conducted in comparison to a non-feature baseline system on the large vocabulary English Spontaneous Scheduling Task show reductions in word error rate of about 20% for discriminative model adaptation based on articulatory features, slightly outperforming other adaptation algorithms. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Karlsruhe, Inter ACT Ctr, Karlsruhe, Germany. RP Metze, F (reprint author), Tech Univ Berlin, Deut Telekom Labs, Berlin, Germany. EM florian.metze@telekom.de RI Metze, Florian/N-4661-2014 OI Metze, Florian/0000-0002-6663-8600 CR BEYERLEIN P, 1998, P ICASSP IEEE BOURLARD H, 1996, 9607 IDIAPRR Brown P., 1987, THESIS CARNEGIE MELL Chomsky N., 1968, SOUND PATTERN ENGLIS DENG L, 2005, P ICASSP EIDE E, 2001, P EUROSPEECH 2001 ESKENAZI M, 1993, P EUROSPEECH ISCA ESPYWILSON CY, 1994, J ACOUST SOC AM, V96, P65, DOI 10.1121/1.410375 Finke M., 1997, P ICASSP FRANKEL J, 2001, P EUROSPEECH 2001 FRANKEL J, 2004, P INTERSPEECH ICSLP GALES MJF, 1997, 291 CUED FINFENG TR Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 GRAVIER G, 2002, P ICASSP GUNAWARDANA A, 2001, P EUROSPEECH 2001 SC HALLE M, 1992, INT ENCY LINGUISTICS, V3 HASEGAWAJOHNSON M, 2005, P ICASSP IEEE *IPSK, 2000, BAY ARCH SPRACHS JAKOBSON R, 1952, 13 MIT AC LAB CAMBR KEMP T, 1997, P EUROSPEECH King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kirchhoff K, 2002, SPEECH COMMUN, V37, P303, DOI 10.1016/S0167-6393(01)00020-6 LEGGETTER CJ, 1994, SPEAKER ADAPTATION H LI J, 2005, P ICASP IEEE LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 LIVESCU K, 2003, P EUROSPEECH ISCA MACHEREY W, 1998, THESIS LEHRSTUHL INF METZE F, 2003, P ASRU 2003 METZE F, 2005, THESIS FAKULTAT INFO METZE F, 2002, P ICSLP ISCA MIYAJIMA C, 2000, P ICSLP ISCA OSTENDORF M, 1999, P ASRU IEEE POTAMIANOS G, 1998, P ICASSP IEE POVE D, 2003, P EUROSPEECH ISCA POVE D, 2005, THESIS CU ENG DEP SARACLARE M, 2000, P 2000 SPEECH TRANSC SCHLUTER R, 2000, THESIS FAKULTAT MATH SCHMIDBAUER O, 1989, P INT C AC SPEECH SI, V1, P616 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SOLTAU H, 2005, THESIS U KARLSRUHE T SOLTAU H, 2002, P ICASSP IEEE SOLTAU H, 2002, P ICSLP ISCA STEMMER G, 2003, P ICASSP, V1, P736 STEVENS KN, 2002, JASA, V111 STUKER S, 2003, P EUROSPEECH ISCA TAM YC, 2000, P ICSLP ISCA TAMURA S, 2004, P ICASSP IEEE WAIBEL A, 2000, VERBMOBIL FDN SPEECH WEINTRAUB M, 1996, P ICSLP ISCA WRENCH A, 2000, P ICSLP ISCA ZHAN P, 1997, P ICASSP IEEE NR 51 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 348 EP 360 DI 10.1016/j.specom.2007.02.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100003 ER PT J AU Hogden, J Rubin, P McDermott, E Katagiri, S Goldstein, L AF Hogden, John Rubin, Philip McDermott, Erik Katagiri, Shigeru Goldstein, Louis TI Inverting mappings from smooth paths through R-n to paths through R-m: A technique applied to recovering articulation from acoustics SO SPEECH COMMUNICATION LA English DT Review DE speech inverse problem; dimensionality reduction; channel normalization ID NONLINEAR DIMENSIONALITY REDUCTION; AUTOMATIC SPEECH RECOGNITION; VOCAL-TRACT; PRODUCTION MODELS; INVERSE PROBLEM; PERCEPTION; MOVEMENTS; GEOMETRY; TRANSFORMATION; ALGORITHMS AB Motor theories, which postulate that speech perception is related to linguistically significant movements of the vocal tract, have guided speech perception research for nearly four decades but have had little impact on automatic speech recognition. In this paper, we describe a signal processing technique named MIMICRI that may help link motor theory with automatic speech recognition by providing a practical approach to recovering articulator positions from acoustics. MIMICRI's name reflects three important operations it can perform on time-series data: it can reduce the dimensionality of a data set (manifold inference); it can blindly invert nonlinear functions applied to the data (mapping inversion); and it can use temporal context to estimate intermediate data (contextual recovery of information). In order for MIMICRI to work, the signals to be analyzed must be functions of unobservable signals that lie on a linear subspace of the set of all unobservable signals. For example, MIMICRI will typically work if the unobservable signals are band-pass and we know the pass-band, as is the case for articulator motions. We discuss the abilities of MIMICRI as they relate to speech processing applications, particularly as they relate to inverting the mapping from speech articulator positions to acoustics. We then present a mathematical proof that explains why MIMICRI can invert nonlinear functions, which it can do even in some cases in which the mapping from the unobservable variables to the observable variables is many-to-one. Finally, we show that MIMICRI is able to infer accurately the positions of the speech articulators from speech acoustics for vowels. Five parameters. estimated by MIMICRI were more linearly related to articulator positions than 128 spectral energies. (c) 2007 Elsevier B.V. All rights reserved. C1 Los Alamos Natl Lab, Los Alamos, NM 87545 USA. Haskins Labs Inc, New Haven, CT 06511 USA. NTT Corp, NTT Commun Sci Labs, Kyoto, Japan. Doshisha Univ, Fac Engn, Dept Informat Syst Design, Kyoto 6100394, Japan. RP Hogden, J (reprint author), Los Alamos Natl Lab, MS B265, Los Alamos, NM 87545 USA. EM hogden@lanl.gov; rubin@haskins.yale.edu; mcd@csiab.kecl.ntt.co.jp; skatagir@mail.doshisha.ac.jp; goldstein@haskins.yale.edu CR ABUELMA'ATTI MT, 1990, APPL ACOUST, V31, P233, DOI 10.1016/0003-682X(90)90031-O ACZEL J, 1989, ENCY MATH ITS APPL S, V31 AHALT SC, 1990, NEURAL NETWORKS, V3, P277, DOI 10.1016/0893-6080(90)90071-R ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 BADIN P, 1995, J PHONETICS, V23, P221, DOI 10.1016/S0095-4470(95)80044-1 BALCHANDRAN R, 1998, P ICASSP, V2, P749, DOI 10.1109/ICASSP.1998.675373 Beautemps D, 2001, J ACOUST SOC AM, V109, P2165, DOI 10.1121/1.1361090 Bendat J. S., 1998, NONLINEAR SYSTEMS TE BENEVISTE A, 1984, IEEE T COMMUN, V32, P871 Blackburn C. S., 2001, Computer Speech and Language, V15, DOI 10.1006/csla.2001.0165 BLESSER B, 1972, J SPEECH HEAR RES, V15, P5 BOE LJ, 1992, J PHONETICS, V20, P27 CARREIRAPERINAN MA, 2001, THESIS U SHEFFIELD S COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6 Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 Dillon W. R., 1984, MULTIVARIATE ANAL ME DUSAN S, 2000, P 5 SEM SPEECH PROD, P237 Edelman A, 1998, SIAM J MATRIX ANAL A, V20, P303, DOI 10.1137/S0895479895290954 Fant G., 1970, ACOUSTIC THEORY SPEE Flanagan J., 1972, SPEECH ANAL SYNTHESI FOWLER CA, 1980, PHONETICA, V37, P306 FRANKEL J, 2001, P EUR AALB DENM, P599, DOI DOI 10.1109/TSA.2005.851910 FRANKEL J, 2001, P WORKSH INN SPEECH Gray R.M., 1984, IEEE ASSP MAG APR, P4 Guenther FH, 1998, PSYCHOL REV, V105, P611 GUPTA SK, 1993, J ACOUST SOC AM, V94, P2517, DOI 10.1121/1.407364 HIRAYAMA M, 1992, NEURAL INFORM PROCES Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 HOGDEN J, 1993, WORLD C NEUR NETW PO HOGDEN J, 1991, THESIS STANFORD U ST HOGDEN J, 1998, 9 HUB 5 CONV SPEECH HOGDEN J, 1995, NEUR INF PROC SYST 9 HOGDEN J, 2000, P 5 SEM SPEECH PROD HOGDEN J, 1996, B COMMUNICATION PARL, V3, P101 HOGDEN J, 1996, J ACOUST SOC AM, V100 HOGDEN J, 1992, J ACOUST SOC AM, V91, P2443, DOI 10.1121/1.403129 HOGDEN J, 2003, P EUROSPEECH, P1409 HOGEN J, 2000, Patent No. 6052662 Kaburagi T, 2001, J ACOUST SOC AM, V110, P441, DOI 10.1121/1.1373707 KAMBHATLA N, 1997, NEURAL COMPUT, V9, P1 KIMBER D, 1994, THESIS STANFORD U ST KIRCHHOFF K, 1998, TR98037 INT COMP SCI KUX E, 1985, P INT C AC SPEECH SI Levin DN, 2002, J ACOUST SOC AM, V111, P2257, DOI 10.1121/1.1470164 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 LICKLIDER JCR, 1948, J ACOUST SOC AM, V20, P42, DOI 10.1121/1.1906346 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 Lindblom B., 1979, J PHONETICS, V7, P146 Lyons R. G., 2004, UNDERSTANDING DIGITA, VSecond MAEDA S, 1979, J ACOUST SOC AM, V65, pS22, DOI 10.1121/1.2017158 Markel JD, 1976, LINEAR PREDICTION SP Maxson CJ, 2001, AM MATH MON, V108, P531, DOI 10.2307/2695707 McDermott E, 2006, IEICE T INF SYST, VE89D, P1006, DOI 10.1093/ietisy/e89-d.3.1006 MCGOWAN R, 1987, SR8990 HASK LAB STAT McGowan RS, 1996, J ACOUST SOC AM, V99, P595, DOI 10.1121/1.415220 McGowan RS, 1996, J ACOUST SOC AM, V99, P1680, DOI 10.1121/1.414690 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MOODY J, 1999, THESIS U CALIFORNIA MORRIS R, 2001, P INT C AC SPEECH SI, P289 MULLER E, 1982, J ACOUST SOC AM S1, V78, pS38 NELSON WL, 1977, ARTICULATORY FEATURE NIX D, 1998, THESIS U COLORADO BO OPPENHEI.AV, 1969, J ACOUST SOC AM, V45, P458, DOI 10.1121/1.1911395 Ouni S, 2005, J ACOUST SOC AM, V118, P444, DOI 10.1121/1.1921448 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 PERRIER P, 1997, SPEECH COMMUN, V22, P82 Qiu W, 1997, DIGIT SIGNAL PROCESS, V7, P199, DOI 10.1006/dspr.1997.0293 Quatieri TF, 2000, IEEE T SPEECH AUDI P, V8, P567, DOI 10.1109/89.861376 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 REYNOLDS D, 1996, P IEEE INT C AC SPEE, P113 RICHARDS H, 1999, P ICASSP, V1, P357 RICHARDS H, 1997, P EUROSPEECH 97 Rose RC, 1996, J ACOUST SOC AM, V99, P1699, DOI 10.1121/1.414679 Rosenblum LD, 2005, BLACKW HBK LINGUIST, P51, DOI 10.1002/9780470757024.ch3 ROWEIS S, 1999, THESIS CALIFORNIA I ROWEIS S, 1997, P EUROSPEECH, V3, P1227 Roweis ST, 2000, SCIENCE, V290, P2323, DOI 10.1126/science.290.5500.2323 RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780 Saberi K, 1999, NATURE, V398, P760, DOI 10.1038/19652 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SCULLY C, 1979, FRONTIERS SPEECH COM, P35 SHIRAI K, 1986, SPEECH COMMUN, V5, P159, DOI 10.1016/0167-6393(86)90005-1 SONDHI MM, 1983, J ACOUST SOC AM, V73, P985, DOI 10.1121/1.389024 Sorokin VN, 1996, SPEECH COMMUN, V19, P105, DOI 10.1016/0167-6393(96)00028-3 Strang G., 1980, LINEAR ALGEBRA ITS A STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Suzuki S., 1998, P INT C SPOK LANG PR, P2251 Tenenbaum JB, 2000, SCIENCE, V290, P2319, DOI 10.1126/science.290.5500.2319 TSIMBINOS J, 1995, THESIS U S AUSTR LEV WAKITA H, 1973, IEEE T ACOUST SPEECH, VAU21, P417, DOI 10.1109/TAU.1973.1162506 WHALEN DH, 1990, BEHAV RES METH INSTR, V22, P550, DOI 10.3758/BF03204440 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X Zlokarnik I., 1995, J ACOUST SOC AM, V97, P3246, DOI 10.1121/1.411699 NR 101 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 361 EP 383 DI 10.1016/j.specom.2007.02.008 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100004 ER PT J AU Coy, A Barker, J AF Coy, Andre Barker, Jon TI An automatic speech recognition system based on the scene analysis account of auditory perception SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; perceptually motivated ASR; speech fragment decoder; source segmentation; multiple pitch tracker ID NOISY SPEECH; ALGORITHM; TRACKING AB Despite many years of concentrated research, the performance gap between automatic speech recognition (ASR) and human speech recognition (HSR) remains large. The difference between ASR and HSR is particularly evident when considering the response to additive noise. Whereas human performance is remarkably robust, ASR systems are brittle and only operate well within the narrow range of noise conditions for which they were designed. This paper considers how humans may achieve noise robustness. We take the view that robustness is achieved because the human perceptual system treats the problems of speech recognition and sound source separation as being tightly coupled. Taking inspiration from Bregman's Auditory Scene Analysis account of auditory organisation, we present a speech recognition system which couples these processes by using a combination of primitive and schema-driven processes: first, a set of coherent spectro-temporal fragments is generated by primitive segmentation techniques; then, a decoder based on statistical ASR techniques performs a simultaneous search for the correct background/foreground segmentation and word sequence hypothesis. Mutually supporting solutions to both the source segmentation and speech recognition problems arise as a result. The decoder is tested on a challenging corpus of connected digit strings mixed monaurally at 0 dB and recognition performance is compared with that achieved by listeners using identical data. The results, although preliminary, are encouraging and suggest that techniques which interface ASA and statistical ASR have great potential. The paper concludes with a discussion of future research directions that may further develop this class of perceptually motivated ASR solutions. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Coy, A (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM a.coy@dcs.shef.ac.uk; j.barker@dcs.shef.ac.uk CR Barker J., 2001, P EUR 2001 ESCA, P213 Barker J., 2000, P ICSLP 2000, V1, P373 BARKER J, 2006, P INT 2006 PITTSB, P85 Barker JP, 2005, SPEECH COMMUN, V45, P5, DOI 10.1016/j.specom.2004.05.002 Bregman AS., 1990, AUDITORY SCENE ANAL Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M.P., 1993, MODELING AUDITORY PR COY A, 2005, P INTERSPEECH 05, V1, P2641 COY A, 2006, P INTERSPEECH 2006, P1678 COY A, 2005, P ICASSP 05, V1, P425, DOI 10.1109/ICASSP.2005.1415141 Gonzales RC, 2004, DIGITAL IMAGE PROCES GU YH, 1991, P IEEE ICASSP, P949, DOI 10.1109/ICASSP.1991.150497 Hirsch H., 2000, P ICSLP, V4, P29 HU G, 2004, ISCA TUT RES WORKSH Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 Liu DJ, 2001, IEEE T SPEECH AUDI P, V9, P609 Meyer F., 1990, Journal of Visual Communication and Image Representation, V1, DOI 10.1016/1047-3203(90)90014-M RABINER LR, 1975, AT&T TECH J, V54, P297 Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Rouat J, 1997, SPEECH COMMUN, V21, P191, DOI 10.1016/S0167-6393(97)00002-2 WEINTRAUB M, 1985, THESIS Wu MY, 2003, IEEE T SPEECH AUDI P, V11, P229, DOI 10.1109/TSA.2003.811539 NR 23 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 384 EP 401 DI 10.1016/j.specom.2006.11.002 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100005 ER PT J AU Barker, J Cooke, M AF Barker, Jon Cooke, Martin TI Modelling speaker intelligibility in noise SO SPEECH COMMUNICATION LA English DT Article DE intelligibility; automatic speech recognition; speech perception; glimpsing; energetic masking; computational modelling ID AUTOMATIC SPEECH RECOGNITION; CONVERSATIONAL SPEECH; SIMULTANEOUS TALKERS; CLEAR SPEECH; PERCEPTION; HEARING; LISTENERS; HARD AB This study compared listeners' performance on a multispeaker speech-in-noise task with that of a model inspired by automatic speech recognition techniques. Listeners identified three keywords in simple 6-word sentences presented in speech-shaped noise at a range of signal-to-noise ratios. Sentence material was provided by 18 male or 16 female speakers. An across-speaker analysis of a number of acoustic parameters (vocal tract length, mean fundamental frequency and speaking rate) found none to be consistently good predictors of relative intelligibility. A simple measure of degree of energetic masking was a good predictor of female speech intelligibility, especially in high noise conditions, but failed to account for interspeaker differences for the male group. A glimpsing model, which combined a simulation of energetic masking with speaker-dependent statistical models, produced recognition scores which were fitted to the behavioural data pooled across all speakers. Using a single set of speaker-independent, noise-level-independent parameters, the model was able to predict not only the intelligibility of individual speakers to a remarkable degree, but could also account for most of the token-wise intelligibilities of the letter keywords. The fit was particularly good in high noise conditions. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Barker, J (reprint author), Univ Sheffield, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM j.barker@dcs.shef.ac.uk; m.cooke@dcs.shef.ac.uk CR AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306 Boersma P., 1993, IFA P, V17, P97 Boersma P., 2005, PRAAT DOING PHONETIC BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 Brungart DS, 2001, J ACOUST SOC AM, V109, P1101, DOI 10.1121/1.1345696 Brungart DS, 2001, J ACOUST SOC AM, V110, P2527, DOI 10.1121/1.1408946 Burg J. P., 1975, THESIS STANFORD U Cooke M, 2003, J PHONETICS, V31, P579, DOI 10.1016/S0095-4470(03)00013-5 Cooke M, 2006, J ACOUST SOC AM, V120, P2421, DOI 10.1121/1.2229005 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M, 2006, J ACOUST SOC AM, V119, P1562, DOI 10.1121/1.2166600 Durlach NI, 2003, J ACOUST SOC AM, V113, P2984, DOI 10.1121/1.1570435 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 GHITZA O, 1993, J ACOUST SOC AM, V93, P2160, DOI 10.1121/1.406679 Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006 Hazan V, 2004, J ACOUST SOC AM, V116, P3108, DOI 10.1121/1.1806826 Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 Krause JC, 2004, J ACOUST SOC AM, V115, P362, DOI 10.1121/1.1635842 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Müsch H, 2001, J Acoust Soc Am, V109, P2896, DOI 10.1121/1.1371971 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 Shannon RV, 1999, J ACOUST SOC AM, V106, P71 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Van Compernolle D, 2001, SPEECH COMMUN, V35, P71, DOI 10.1016/S0167-6393(00)00096-0 VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917 Woodland P. C., 2001, P ISCA WORKSH AD MET, P11 NR 29 TC 30 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 402 EP 417 DI 10.1016/j.specom.2006.11.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100006 ER PT J AU Moore, RK AF Moore, Roger K. TI Spoken language processing: Piecing together the puzzle SO SPEECH COMMUNICATION LA English DT Review DE spoken language processing; speech technology; communicative behaviour; sensorimotor control ID PERCEPTUAL LOOP THEORY; AUTOMATIC SPEECH RECOGNITION; LEXICAL ACCESS; AUDITORY-FEEDBACK; WORD PRODUCTION; MOTOR CONTROL; BRAIN; MEMORY; MODEL; INTELLIGENCE AB Attempting to understand the fundamental mechanisms underlying spoken language processing, whether it is viewed as behaviour exhibited by human beings or as a faculty simulated by machines, is one of the greatest scientific challenges of our age. Despite tremendous achievements over the past 50 or so years, there is still a long way to go before we reach a comprehensive explanation of human spoken language behaviour and can create a technology with performance approaching or exceeding that of a human being. It is argued that progress is hampered by the fragmentation of the field across many different disciplines, coupled with a failure to create an integrated view of the fundamental mechanisms that underpin one organism's ability to communicate with another. This paper weaves together accounts from a wide variety of different disciplines concerned with the behaviour of living systems - many of them outside the normal realms of spoken language - and compiles them into a new model: PRESENCE (PREdictive SENsorimotor Control and Emulation). It is hoped that the results of this research will provide a sufficient glimpse into the future to give breath to a new generation of research into spoken language processing by mind or machine. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Moore, RK (reprint author), Univ Sheffield, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM r.k.moore@dcs.shef.ac.uk CR ABLER WL, 1989, J SOC BIOL STRUCT, V12, P1, DOI 10.1016/0140-1750(89)90015-8 Aboitiz F, 2005, NEURAL NETWORKS, V18, P1357, DOI 10.1016/j.neunet.2005.04.009 Alexandrov YI, 2005, COGNITIVE BRAIN RES, V25, P387, DOI 10.1016/j.cogbrainres.2005.08.006 ALTMANN G, 1997, ASCENT BABEL Anderson ML, 2003, ARTIF INTELL, V149, P91, DOI 10.1016/S0004-3702(03)00054-7 Anderson RH, 2005, CARDIOL YOUNG, V15, P1, DOI 10.1017/S1047951105000016 Arnold K, 2006, NATURE, V441, P303, DOI 10.1038/441303a AXELROD S, 2004, P IEEE ICASSP BADDELEY AD, 1974, RECENT ADV LEARNING, V8, P7 Bailly G, 1997, SPEECH COMMUN, V22, P251, DOI 10.1016/S0167-6393(97)00025-3 Bara B.G., 2005, COGNITIVE PRAGMATICS BARONCOHEN S, 1985, COGNITION, V21, P37, DOI 10.1016/0010-0277(85)90022-8 Baron-Cohen Simon, 1997, MINDBLINDNESS ESSAY Barto A. G., 1995, MODELS INFORM PROCES, P215 Becchio C, 2006, CONSCIOUS COGN, V15, P64, DOI 10.1016/j.concog.2005.03.006 Becker J, 2006, COGNITIVE DEV, V21, P194, DOI 10.1016/j.cogdev.2005.11.002 BELAVKIN RV, 2004, P AISB04 S EM COGN A, P1 BLOMBERG M, 1987, EUR C SPEECH TECHN E, P369 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Brainard MS, 2002, NATURE, V417, P351, DOI 10.1038/417351a Bregman AS., 1990, AUDITORY SCENE ANAL BRIDLE JS, 1985, COMPUTER SPEECH PROC Brunswik E., 1952, INT ENCY UNIFIED SCI, V1 BRYANT CM, 2004, P AISB S EM COGN AFF, P9 Burke J, 1995, CONNECTIONS Chartrand TL, 1999, J PERS SOC PSYCHOL, V76, P893, DOI 10.1037//0022-3514.76.6.893 Chella A, 2006, ROBOT AUTON SYST, V54, P403, DOI 10.1016/j.robot.2006.01.008 Cherry C., 1978, HUMAN COMMUNICATION Clark HH, 2002, SPEECH COMMUN, V36, P5, DOI 10.1016/S0167-6393(01)00022-X Cooke M, 2003, J PHONETICS, V31, P579, DOI 10.1016/S0095-4470(03)00013-5 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cowley SJ, 2004, LANG SCI, V26, P273, DOI 10.1016/j.langsci.2003.08.005 Cox MT, 2005, ARTIF INTELL, V169, P104, DOI 10.1016/j.artint.2005.10.009 Darwin C, 1872, EXPRESSION EMOTIONS Dawkins Richard, 1991, BLIND WATCHMAKER de Graaf-Peters VB, 2006, EARLY HUM DEV, V82, P257, DOI 10.1016/j.earlhumdev.2005.10.013 Denes P. B., 1973, SPEECH CHAIN PHYS BI DEUTSCH JA, 1963, PSYCHOL REV, V70, P80, DOI 10.1037/h0039515 DEWACHTER M, 2003, P EUROSPEECH de Zubicaray GI, 2006, BRAIN COGNITION, V60, P272, DOI 10.1016/j.bandc.2005.11.008 Dijksterhuis A, 2006, CONSCIOUS COGN, V15, P135, DOI 10.1016/j.concog.2005.04.007 Donald Merlin, 1998, APPROACHES EVOLUTION, P44 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P1, DOI 10.1016/S0167-6393(02)00072-9 DOYLE L, 2006, TALKING YOUR MOUTH F Dutoit T., 1997, INTRO TEXT SPEECH SY Ekman P., 1999, HDB COGNITION EMOTIO, P301 EMMOREY K, 2002, P AAAS ANN M, P1 Engelhardt PE, 2006, J MEM LANG, V54, P554, DOI 10.1016/j.jml.2005.12.009 Erlhagen W, 2006, ROBOT AUTON SYST, V54, P353, DOI 10.1016/j.robot.2006.01.004 EVERMANN G, 2005, P ICASSP, P209 Fadiga L, 2002, EUR J NEUROSCI, V15, P399, DOI 10.1046/j.0953-816x.2001.01874.x Fairbanks G, 1955, J SPEECH HEAR DISORD, V20, P333 FALLSIDE GF, 1990, P ESCA WORKSH SPEECH, P237 Feldman JA, 2005, ARTIF INTELL, V169, P181, DOI 10.1016/j.artint.2005.10.010 Fenn KM, 2003, NATURE, V425, P614, DOI 10.1038/nature01951 Figueredo AJ, 2006, INTELLIGENCE, V34, P211, DOI 10.1016/j.intell.2005.03.006 Fitch WT, 2000, TRENDS COGN SCI, V4, P258, DOI 10.1016/S1364-6613(00)01494-7 Fitch WT, 2004, SCIENCE, V303, P377, DOI 10.1126/science.1089401 FODOR J, 2001, MIND DOESNT WORKS WA FOWLER CA, 1986, J PHONETICS, V14, P3 Frith C, 2002, CONSCIOUS COGN, V11, P481, DOI 10.1016/S1053-8100(02)00022-3 Fry Dennis, 1977, HOMO LOQUENS MAN TAL FUJISAKI H, 2005, P INT S COMM SKILLS GEERS AE, 1992, J SPEECH HEAR RES, V35, P1384 GERDES VGJ, 1994, BIOL CYBERN, V70, P513, DOI 10.1007/BF00198804 Gerken L. A., 2005, LANG LEARN DEV, V1, P5, DOI 10.1207/s15473341lld0101_3 Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166 Goldinger SD, 1998, PSYCHOL REV, V105, P251, DOI 10.1037/0033-295X.105.2.251 Goldsmith H.H., 2003, HDB AFFECTIVE SCI Gopnik A., 2001, SCI CRIB GRAND S, 2003, GROWING LUCY Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 Grush R, 2004, BEHAV BRAIN SCI, V27, P377 GRUSH R, 1998, CONSCIOUSNESS READER Hartsuiker RJ, 2001, COGNITIVE PSYCHOL, V42, P113, DOI 10.1006/cogp.2000.0744 Hauser MD, 2002, SCIENCE, V298, P1569, DOI 10.1126/science.298.5598.1569 Hawkins J., 2004, INTELLIGENCE Hawkins J., 2006, HIERARCHICAL TEMPORA Hawkins J, 2005, ARTIF INTELL, V169, P196, DOI 10.1016/j.artint.2005.10.014 Hawkins S, 2003, J PHONETICS, V31, P373, DOI 10.1016/j.wocn.2003.09.006 HAWKINS S, 2004, PUZZLES PATTERNS 50 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HINTZMAN DL, 1986, PSYCHOL REV, V93, P411, DOI 10.1037//0033-295X.93.4.411 Hoare T, 2005, COMPUT J, V48, P49, DOI 10.1093/comjnl/bxh065 Holden C, 2004, SCIENCE, V303, P1316, DOI 10.1126/science.303.5662.1316 HOLMES J, 2002, SPEECH RECOGNITION S Howard I. S., 2005, P SPECOM, P159 HOWELL P, 2002, PATHOLOGY THERAPY SP HOWELL P, 2001, SPEECH MOTOR CONTROL, P91 Huang X., 2001, SPOKEN LANGUAGE PROC Hunter MD, 2006, P NATL ACAD SCI USA, V103, P189, DOI 10.1073/pnas.0506268103 Hunter MD, 2004, AM J PSYCHIAT, V161, P923, DOI 10.1176/appi.ajp.161.5.923 Ikuta N, 2006, BRAIN LANG, V97, P154, DOI 10.1016/j.bandl.2005.10.006 Jarvis ED, 2004, ANN NY ACAD SCI, V1016, P749, DOI 10.1196/annals.1298.038 Jelinek F, 1996, SPEECH COMMUN, V18, P242, DOI 10.1016/0167-6393(96)00009-X Jelinek F., 1998, STAT METHODS SPEECH John ER, 2002, BRAIN RES REV, V39, P1, DOI 10.1016/S0165-0173(02)00142-X Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Junqua JC, 1996, SPEECH COMMUN, V20, P13, DOI 10.1016/S0167-6393(96)00041-6 Jusczyk PW, 1999, TRENDS COGN SCI, V3, P323, DOI 10.1016/S1364-6613(99)01363-7 KELLER E, 2001, IMPROVEMENTS SPEECH KELLER E, IMPROVEMENTS SPEECH Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 Kurzweil R., 1990, AGE INTELLIGENT MACH Kurzweil R., 1999, AGE SPIRITUAL MACHIN LANE H, 1971, J SPEECH HEAR RES, V14, P677 LEE CH, 2004, P ICSLP KOR Leggetter C.J., 1994, P INT C SPOK LANG PR, P451 Lengagne T, 1999, P ROY SOC B-BIOL SCI, V266, P1623, DOI 10.1098/rspb.1999.0824 LEVELT WJM, 1992, CONSCIOUS COGN, V1, P226, DOI 10.1016/1053-8100(92)90062-F Levelt W. J., 1989, SPEAKING INTENTION A Levelt WJM, 2001, P NATL ACAD SCI USA, V98, P13464, DOI 10.1073/pnas.231459498 Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1 LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4 Lewicki MS, 2002, NAT NEUROSCI, V5, P356, DOI 10.1038/nn831 LEWIS RL, 2000, COMPUTATIONAL PSYCHO Liberman AM, 2000, TRENDS COGN SCI, V4, P187, DOI 10.1016/S1364-6613(00)01471-6 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Maier V., 2005, P INTERSPEECH, P1245 MAIRESSE F, 2005, P EUROSPEECH, P1881 MAKHOUL J, 1984, INVARIANCE VARIABILI Marr D., 1982, VISION COMPUTATIONAL MARTINLOECHES M, 2005, J HUM EVOL, V50, P226 Maslow AH, 1943, PSYCHOL REV, V50, P370, DOI 10.1037/h0054346 Meguerditchian A, 2006, BEHAV BRAIN RES, V171, P170, DOI 10.1016/j.bbr.2006.03.018 Meltzoff AN, 1997, EARLY DEV PARENTING, V6, P179, DOI 10.1002/(SICI)1099-0917(199709/12)6:3/4<179::AID-EDP157>3.0.CO;2-R MESSUM P, 2005, UNPUB LEARNING TALK Moore RK, 1996, J ACOUST SOC AM, V99, P1710, DOI 10.1121/1.414694 Moore R. K., 2005, Proceedings of the Fourth IEEE International Conference on Cognitive Informatics MOORE RK, 2005, KEYN TALK SPECOM 10 MOORE RK, 2005, KEYN TALK COST278 IS MOORE RK, 1993, P EUR BERL MOORE RK, 2001, P WORKSH SPEECH REC, P145 Morgan RE, 2005, INT MARKET REV, V22, P5 Mountcastle V., 1978, MINDFUL BRAIN Nicolelis MAL, 2001, NATURE, V409, P403, DOI 10.1038/35053191 NORRIS DG, 1994, COGNITION, V52, P163 Pacherie E, 2006, COGN SYST RES, V7, P101, DOI 10.1016/j.cogsys.2005.11.012 Paul ES, 2005, NEUROSCI BIOBEHAV R, V29, P469, DOI 10.1016/j.neubiorev.2005.01.002 Perkell J, 1997, SPEECH COMMUN, V22, P227, DOI 10.1016/S0167-6393(97)00026-5 Perlis D, 2005, ARTIF INTELL, V169, P184, DOI 10.1016/j.artint.2005.10.012 Philipson L, 2002, J THEOR BIOL, V215, P109, DOI 10.1006/jtbi.2001.2501 Pinker S., 1994, LANGUAGE INSTINCT Pinker Steven, 1997, MIND WORKS Powers W. T., 1973, BEHAV CONTROL PERCEP POWERS WT, 2005, BRIEF INTRO PERCEPTU Pulvermuller F, 2005, NAT REV NEUROSCI, V6, P576, DOI 10.1038/nrn1706 Rabiner L, 1993, FUNDAMENTALS SPEECH Rakoczy H, 2006, COGN SYST RES, V7, P113, DOI 10.1016/j.cogsys.2005.11.008 Rizzolatti G, 1996, COGNITIVE BRAIN RES, V3, P131, DOI 10.1016/0926-6410(95)00038-0 Rizzolatti G, 2004, ANNU REV NEUROSCI, V27, P169, DOI 10.1146/annurev.neuro.27.070203.144230 Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0 ROY D, 1998, P INT C SPEECH LANG, P1279 Roy DK, 2002, COGNITIVE SCI, V26, P113, DOI 10.1207/s15516709cog2601_4 Scharenborg O, 2005, COGNITIVE SCI, V29, P867, DOI 10.1207/s15516709cog0000_37 SCHARENBORG O, 2003, P EUROSPEECH, P2097 SCHARENBORG O, 2003, J ACOUST SOC AM, V114, P3023 Scherer K.R., 2001, APPRAISAL PROCESS EM Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Schweizer K, 2005, INTELLIGENCE, V33, P589, DOI 10.1016/j.intell.2005.07.001 Searle J. R., 1983, INTENTIONALITY ESSAY Shannon CE, 1949, MATH THEORY COMMUNIC Sinha P, 2002, NAT NEUROSCI, V5, P1093, DOI 10.1038/nn949 SLANEY M, 1997, COMPUTATIONAL AUDITO, P27 Slevc LR, 2006, J MEM LANG, V54, P515, DOI 10.1016/j.jml.2005.11.002 Sokhi DS, 2005, NEUROIMAGE, V27, P572, DOI 10.1016/j.neuroimage.2005.04.023 STERNBERG S, 1988, PHONETICA, V45, P175 STEVENS KN, 1989, J PHONETICS, V17, P3 Studdert-Kennedy M, 2002, MIRROR NEURONS EVOLU, P207 SUNDSTROM P, 2005, EXPLORING AFFECTIVE Taylor JG, 2005, ARTIF INTELL, V169, P192, DOI 10.1016/j.artint.2005.10.011 Taylor JG, 2005, NEURAL NETWORKS, V18, P353, DOI 10.1016/j.neunet.2005.03.005 Taylor MM, 1999, INT J HUM-COMPUT ST, V50, P521, DOI 10.1006/ijhc.1998.0258 Taylor MM, 1999, INT J HUM-COMPUT ST, V50, P433, DOI 10.1006/ijhc.1998.0262 TAYLOR MM, 1992, RECENT ADV NATO AS F, V75 Tirassa M, 2006, COGN SYST RES, V7, P128, DOI 10.1016/j.cogsys.2006.01.002 Tirassa M, 2006, CONSCIOUS COGN, V15, P197, DOI 10.1016/j.concog.2005.06.005 Toates F, 2006, CONSCIOUS COGN, V15, P75, DOI 10.1016/j.concog.2005.04.008 Tremblay S, 2003, NATURE, V423, P866, DOI 10.1038/nature01710 Tulving E, 2002, ANNU REV PSYCHOL, V53, P1, DOI 10.1146/annurev.psych.53.100901.135114 TUMMOLINI L, 2006, COGN SYSTEMS RES, V7, P140 VARGA AP, 1994, P EUR GEN SEPT, P1175 Varga A.P., 1990, P ICASSP, P845 Walker MA, 2004, COGNITIVE SCI, V28, P811, DOI [10.1016/j.cogsci.2004.06,002, 10.1016/j.cogsci.2004.06.002] Wang Y., 2003, BRAIN MIND TRANSDISC, V4, P151, DOI 10.1023/A:1025401527570 Wang Y, 2006, IEEE T SYST MAN CY C, V36, P124, DOI 10.1109/TSMCC.2006.871126 Warren JE, 2005, TRENDS NEUROSCI, V28, P636, DOI 10.1016/j.tins.2005.09.010 Wilson M, 2005, PSYCHOL BULL, V131, P460, DOI 10.1037/0033-2909.131.3.460 Wilson SM, 2004, NAT NEUROSCI, V7, P701, DOI 10.1038/nn1263 Worgotter F, 2005, NEURAL COMPUT, V17, P245, DOI 10.1162/0899766053011555 Wundt W., 1874, GRUNDZUGE PHYSL PSYC Yarbus A. L., 1967, EYE MOVEMENTS VISION Yu AC, 1996, SCIENCE, V273, P1871, DOI 10.1126/science.273.5283.1871 Zipf G.K., 1949, HUMAN BEHAV PRINCIPA NR 197 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2007 VL 49 IS 5 BP 418 EP 435 DI 10.1016/j.specom.2007.01.011 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 179NH UT WOS:000247296100007 ER PT J AU Solera-Urena, R Martin-Iglesias, D Gallardo-Antolin, A Pelaez-Moreno, C Diaz-de-Maria, F AF Solera-Urena, R. Martin-Iglesias, D. Gallardo-Antolin, A. Pelaez-Moreno, C. Diaz-de-Maria, F. TI Robust ASR using support vector machines SO SPEECH COMMUNICATION LA English DT Article DE robust ASR; additive noise; machine learning; Support Vector Machines; kernel methods; HMM; ANN; hybrid ASR; Dynamic Time Alignment ID SPEECH RECOGNITION; CLASSIFICATION; MODELS AB The improved theoretical properties of Support Vector Machines with respect to other machine learning alternatives due to their max-margin training paradigm have led us to suggest them as a good technique for robust speech recognition. However, important shortcomings have had to be circumvented, the most important being the normalisation of the time duration of different realisations of the acoustic speech units. In this paper, we have compared two approaches in noisy environments: first, a hybrid HMM-SVM solution where a fixed number of frames is selected by means of an HMM segmentation and second, a normalisation kernel called Dynamic Time Alignment Kernel (DTAK) first introduced in Shimodaira et al. [Shimodaira, H., Noma, K., Nakai, M., Sagayama, S., 2001. Support vector machine with dynamic time-alignment kernel for speech recognition. In: Proc. Eurospeech, Aalborg, Denmark, pp. 1841-1844] and based on DTW (Dynamic Time Warping). Special attention has been paid to the adaptation of both alternatives to noisy environments, comparing two types of parameterisations and performing suitable feature normalisation operations. The results show that the DTA Kernel provides important advantages over the baseline HMM system in medium to bad noise conditions, also outperforming the results of the hybrid system. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Carlos III Madrid, Signal Theory & Commun Dept, EPS, Leganes 28911, Spain. RP Pelaez-Moreno, C (reprint author), Univ Carlos III Madrid, Signal Theory & Commun Dept, EPS, Leganes, Leganes 28911, Spain. EM rsolera@tsc.uc3m.es; dmartin@rtpa.es; gallardo@tsc.uc3m.es; carmen@tsc.uc3m.es; fdiaz@tsc.uc3m.es RI Diaz de Maria, Fernando/E-8048-2011; Gallardo-Antolin, Ascension/L-4152-2014; Pelaez-Moreno, Carmen/B-7373-2008 OI Pelaez-Moreno, Carmen/0000-0003-1425-6763 CR Allwein E. L., 2000, J MACHINE LEARNING R, V1, P113, DOI DOI 10.1162/15324430152733133 Bengio Y., 1995, NEURAL NETWORKS SPEE Bourlard Ha, 1994, CONNECTIONIST SPEECH Burges C. J., 1996, P 13 INT C MACH LEAR, P71 CHIHCHUNG C, 2004, LIBSVM LIB SUPPORT V CLARKSON P, 1999, IEEE INT C AC SPEECH, V2, P585 COLLOBERT R, SVMTORCH SUPPORT VEC Crammer K., 2001, J MACHINE LEARNING R, V2, P265 Ech-Cherif A., 2002, P 9 INT C NEUR INF P, V5, P2507 FINE S, 2001, P INT C AC SPEECH SI, V1, P417 Furnkranz J, 2002, J MACH LEARN RES, V2, P721, DOI 10.1162/153244302320884605 Ganapathiraju A., 2000, P INT C SPOK LANG PR, V4, P504 Ganapathiraju A, 2004, IEEE T SIGNAL PROCES, V52, P2348, DOI 10.1109/TSP.2004.831018 Ganapathiraju A., 2002, THESIS MISSISSIPI ST Gangashetty S. V., 2005, Proceedings of 2005 International Conference on Intelligent Sensing and Information Processing (IEEE Cat. No. 05EX979), DOI 10.1109/ICISIP.2005.1529482 GARCIACABELLOS JM, 2004, P EUSIPCO 2004, P2067 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 HAMAKER J, 2003, UNPUB ADV SPEECH REC HAMAKER J, 2002, P INT C SPOK LANG PR, V2, P1001 Hsu CW, 2002, IEEE T NEURAL NETWOR, V13, P415, DOI 10.1109/72.991427 HT Lin, 2003, NOTE PLATTS PROBABIL ISO K, 1990, P INT C AC SPEECH SI, P441 JAAKKOLA T, 1998, EXPLOITING GENERATIV Jiang H, 2006, IEEE T AUDIO SPEECH, V14, P1584, DOI 10.1109/TASL.2006.879805 Joachims T., 1999, MAKING LARGE SCALE S, P169 LE Q, 2003, INT C ART NEUR NETW, P443 Ma C., 2001, P IEEE INT C AC SPEE, V1, P381 Martin-Iglesias D., 2005, LECT NOTES COMPUTER, V3817, P256 MORENO A, 1998, SPEECHDAT DOCUMENTAT Navia-Vazquez A, 2001, IEEE T NEURAL NETWOR, V12, P1047, DOI 10.1109/72.950134 Osuna E., 1997, IEEE WORKSH NEUR NET, P276 Platt J., 1999, ADV LARGE MARGIN CLA, P61 RABINER LR, 1978, IEEE T ACOUST SPEECH, V26, P575, DOI 10.1109/TASSP.1978.1163164 REICHL W, 1995, P INT C AC SPEECH SI, P3335 ROBINSON T, 1995, AUTOMATIC SPEECH SPE, P159 SAKOE H, 1989, P INT C AC SPEECH SI, P439 SCHOLKOPF B, 2002, LEARNING KERNALS SEKHAR C, 2001, WORKSH SPOK LANG PRO Shimodaira H, 2002, ADV NEUR IN, V14, P921 Shimodaira H., 2001, P EUR C SPEECH COMM, P1841 Smith N, 2002, ADV NEUR IN, V14, P1197 STADERMANN J, 2004, P INT C SPOK LANG PR, P661 TEBELSKIS J, 1991, P IEEE INT C AC SPEE, P61, DOI 10.1109/ICASSP.1991.150278 Thubthong N, 2001, INT J UNCERTAIN FUZZ, V9, P803, DOI 10.1142/S0218488501001253 Trentin E, 2001, NEUROCOMPUTING, V37, P91, DOI 10.1016/S0925-2312(00)00308-8 Vapnik V., 1995, NATURE STAT LEARNING Vapnik V, 1998, STAT LEARNING THEORY VARGA AP, 1992, NOISEX 92 STUDY EFFE Vicente-Pena J, 2006, SPEECH COMMUN, V48, P1379, DOI 10.1016/j.specom.2006.07.007 WAN V, 2003, INT C AC SPEECH SIGN, V2, P221 Weiss N. A., 1993, INTRO STAT Weston J., 1999, P EUR S ART NEUR NET Wu TF, 2004, J MACH LEARN RES, V5, P975 Young S. J., 1995, HTK HIDDEN MARKOV MO NR 54 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 253 EP 267 DI 10.1016/j.specom.2007.01.013 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700001 ER PT J AU Rozman, R Kodek, DM AF Rozman, Robert Kodek, Dusan M. TI Using asymmetric windows in automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE asymmetric windows; windowing; robustness; automatic speech recognition; Short Time Fourier Transform ID FILTERS; DESIGN AB This paper considers the windowing problem of the short-time frequency analysis that is used in speech recognition systems (SRS). Since human hearing is relatively insensitive to short-time phase distortion of the speech signal there is no apparent reason for the use of symmetric windows which give a linear phase response. Furthermore, phase information is usually completely disregarded in SRS. This should be contrasted with the well-known fact that relaxation of the linearity constraint on window phase results in a better magnitude response and shorter time delay. These observations form a strong argument in favor of the research presented in this paper. First, a general overview of the role that windows play in the frequency analysis stage of SRS is presented. Important properties for speech recognition are highlighted and potential advantages of asymmetric windows are presented. Among them the shorter time delay and the better magnitude response are most important. Two possible design methods for asymmetric windows are discussed. Since little is known about window influence on SRS performance the design methods are first considered from a frequency analysis point of view. This is followed by practical evaluations on real SRS. Expectations were confirmed by the results. The proposed asymmetric windows increased the robustness of elementary, isolated and connected speech recognition on a variety of adverse test conditions. This is particularly true for the case of a combination of additive and low pass convolutional distortions. Further research on asymmetric windows and on the parameterization process as a whole is suggested. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Ljubljana, Fac Comp & Informat Sci, Lab Architecture & Signal Proc, Ljubljana 1001, Slovenia. RP Rozman, R (reprint author), Univ Ljubljana, Fac Comp & Informat Sci, Lab Architecture & Signal Proc, Trzaska 25, Ljubljana 1001, Slovenia. EM rozman@fri.uni-lj.si; duke@fri.uni-lj.si CR BURNSIDE D, 1995, IEEE T SIGNAL PROCES, V43, P605, DOI 10.1109/78.370616 Fletcher H., 1953, SPEECH HEARING COMMU Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1997, P ESCA TUT RES WORKS, P1 Milner B, 2006, SPEECH COMMUN, V48, P697, DOI 10.1016/j.specom.2005.10.004 PARKS TW, 1972, IEEE T ACOUST SPEECH, VAU20, P195, DOI 10.1109/TAU.1972.1162381 Rabinovitch A, 1989, Reg Immunol, V2, P77 ROZMAN R, 2003, P EUR INT C COMP TOO, V2, P171 ROZMAN R, 2000, P C LANG TECHN LJUBL, P75 VARGA AP, 1992, NOISEX 92 STUDY EFFE NR 10 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 268 EP 276 DI 10.1016/j.specom.2007.01.012 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700002 ER PT J AU Adami, AG AF Adami, Andre Gustavo TI Modeling prosodic differences for speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE automatic speaker recognition; prosody; speaker verification ID VERIFICATION; SYSTEMS AB Prosody plays an important role in discriminating speakers. Due to the complexity of estimating relevant prosodic information, most recognition systems rely on the notion that the statistics of the fundamental frequency (as a proxy for pitch) and speech energy (as a proxy for loudness/stress) distributions can be used to capture prosodic differences between speakers. However, this simplistic notion disregards the temporal aspects and the relationship between prosodic features that determine certain phenomena, such as intonation and stress. We propose an alternative approach that exploits the dynamics between the fundamental frequency and speech energy to capture prosodic differences. The aim is to characterize different intonation, stress, or rhythm patterns produced by the variation in the fundamental frequency and speech energy contours. In our approach, the continuous speech signal is converted into a sequence of discrete units that describe the signal in terms of dynamics of the fundamental frequency and energy contours. Using simple statistical models, we show that the statistical dependency between such discrete units can capture speaker-specific information. On the extended-data speaker detection task of the 2001 and 2003 NIST Speaker Recognition Evaluation, such approach achieves a relative improvement of at least 17% over a system based on the distribution statistics of fundamental frequency, speech energy and their deltas. We also show that they are more robust to communication channel effects than the state-of-the-art speaker recognition system. Since conventional speaker recognition systems do not fully incorporate different levels of information, we show that the prosodic features provide complementary information to conventional systems by fusing the prosodic systems with the state-of-the-art system. The relative performance improvement over the state-of-the-art system is about 42% and 12% for the extended-data task of the 2001 and 2003 NIST Speaker Recognition Evaluation, respectively. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Caxias Do Sul, Dept Informat, BR-95070560 Caxias do Sul, Brazil. RP Adami, AG (reprint author), Univ Caxias Do Sul, Dept Informat, Rua Francisco Getulio Vargas,1130 Caxias Do Sul, BR-95070560 Caxias do Sul, Brazil. EM agadami@ucs.br RI Adami, Andre/E-9714-2012 OI Adami, Andre/0000-0001-8105-7698 CR ADAMI A, 2003, MODELING PROSODIC DY, P788 ANDREWS WD, 2002, GENDER DEPENDENT PHO, P149 ANDREWS WD, 2001, PHONETIC IDIOLECTAL, P55 ATAL BS, 1972, J ACOUST SOC AM, V52, P1687, DOI 10.1121/1.1913303 ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716 BOVES L, 1988, J ACOUST SOC AM S1, V84, pS82, DOI 10.1121/1.2026505 CAMPBELL JP, 2003, FUSING HIGH LOW LEVE, P2665 CAREY MJ, 1996, ROBUST PROSODIC FEAT, P1800 COLLIER R, 1975, J ACOUST SOC AM, V58, P249, DOI 10.1121/1.380654 Cover T. M., 1991, ELEMENTS INFORMATION DODDINGTON G, 1971, J ACOUST SOC AM, V49, pA139 Doddington G., 2001, EUROSPEECH 2001, P2521 Doddington GR, 2000, SPEECH COMMUN, V31, P225, DOI 10.1016/S0167-6393(99)00080-1 FANT G, 1991, SPEECH COMMUN, V10, P521, DOI 10.1016/0167-6393(91)90055-X FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P342, DOI 10.1109/TASSP.1981.1163605 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 GILLICK L, 1989, SOME STAT ISSUES COM, P532 HERMANSKY H, 1999, ROBUST METHODS SPEEC Jelinek F., 1997, STAT METHODS SPEECH KAJAREKAR S, 2003, SPEAKER RECOGNITION, P19 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 LADEFOGE.P, 1968, ANN NY ACAD SCI, V155, P141, DOI 10.1111/j.1749-6632.1968.tb56758.x Lehiste I., 1970, SUPRASEGMENTALS LUMMIS RC, 1973, IEEE T ACOUST SPEECH, VAU21, P80, DOI 10.1109/TAU.1973.1162443 MARKEL JD, 1977, IEEE T ACOUST SPEECH, V25, P330, DOI 10.1109/TASSP.1977.1162961 MARTIN A, 2001, NIST 2001 SPEAKER RE MARTIN A, 2003, NIST 2003 SPEAKER RE Martin A., 1997, EUROSPEECH, P1895 Martins CMC, 1999, PHYTOCHEM ANALYSIS, V10, P1, DOI 10.1002/(SICI)1099-1565(199901/02)10:1<1::AID-PCA420>3.3.CO;2-A NAVRATIL J, 2003, PHONETIC SPEAKER REC, P796 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 REYNOLDS D, 2003, SUPERSID PROJECT EXP, P784 REYNOLDS DA, 1997, HTIMIT LLHDB SPEECH, P1535 REYNOLDS DA, 1992, INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING APPLICATIONS AND TECHNOLOGY, VOLS 1 AND 2, P967 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 REYNOLDS DA, 2000, LINCOLN SPEAKER RECO, P470 SONMEZ K, 1997, EUROSPEECH, P1391 SONMEZ K, 1998, MODELING DYNAMIC PRO, P3189 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 Talkin D., 1995, SPEECH CODING SYNTHE VANDENBROEK LAGM, 1987, J MED CHEM, V30, P325, DOI 10.1021/jm00385a014 WEBER F, 2002, USING PROSODIC LEXIC, P141 Werner S., 1994, FUNDAMENTALS SPEECH, P23 Xiang B, 2003, IEEE SIGNAL PROC LET, V10, P141, DOI 10.1109/LSP.2003.810913 NR 44 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 277 EP 291 DI 10.1016/j.specom.2007.02.005 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700003 ER PT J AU Nguyen, PC Akagi, M Nguyen, BP AF Nguyen, Phu Chien Akagi, Masato Nguyen, Binh Phu TI Limited error based event localizing temporal decomposition and its application to variable-rate speech coding SO SPEECH COMMUNICATION LA English DT Article DE temporal decomposition; event vector; event function; STRAIGHT; speech coding; line spectral frequency ID QUANTIZATION; PARAMETERS AB This paper proposes a novel algorithm for temporal decomposition (TD) of speech, called 'limited error based event localizing temporal decomposition' (LEBEL-TD), and its application to variable-rate speech coding. In previous work with TD, TD analysis was usually performed on each speech segment of about 200-300 ms or more, making it impractical for online applications. In this present work, the event localization is determined based on a limited error criterion and a local optimization strategy, which results in an average algorithmic delay of 65 ms. Simulation results show that an average log spectral distortion of about 1.5 dB can be achievable at an event rate of 20 events/s. Also, LEBEL-TD uses neither the computationally costly singular value decomposition routine nor the event refinement process, thus reducing significantly the computational cost of TD. Further, a method for variable-rate speech coding an average rate of around 1.8 kbps based on STRAIGHT (Speech Transformation and Representation using Adaptive Interpolation of weiGHTed spectrum), which is a high-quality speech analysis-synthesis framework, using LEBEL-TD is also realized. Subjective test results indicate that the performance of the proposed speech coding method is comparable to that of the 4.8 kbps FS-1016 CELP coder. (c) 2007 Elsevier B.V. All rights reserved. C1 Japan Adv Inst Sci & Technol, Grad Sch Informat Sci, Nomi, Ishikawa 9231292, Japan. RP Nguyen, PC (reprint author), Osaka Univ, Inst Sci & Ind Res, Osaka, Japan. EM chien@ar.sanken.osaka-u.ac.jp; akagi@jaist.ac.jp; npbinh@jaist.ac.jp CR Atal B. S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing ATHAUDAGE CN, 1999, P ISSPA 99, P471 CAMPBELL JP, 1986, P IEEE INT C AC SPEE, P473 Campos M, 2000, MANAG INFORMAT SYST, V1, P145 Dix PJ, 1994, IEEE T SPEECH AUDI P, V2, P9, DOI 10.1109/89.260329 Fallside F., 1985, COMPUTER SPEECH PROC Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Itakura F., 1975, J ACOUST SOC AM, V57, P35 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kim SJ, 1999, ELECTRON LETT, V35, P962, DOI 10.1049/el:19990670 Nandasena ACR, 2001, COMPUT SPEECH LANG, V15, P381, DOI 10.1006/csla.2001.0173 NGUYEN PC, 2002, P EUSIPCO, P239 NGUYEN PC, 2002, P ICASSP 02, P265 NIRANJAN M, 1989, P ICASSP 89, P655 Paliwal K. K., 1995, P EUR C SPEECH COMM, P1029 Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 SCHEFFE H, 1952, J AM STAT ASSOC, V47, P381, DOI 10.2307/2281310 SHIRAKI Y, 1991, P 1991 AUT M AC SOC, P233 VANDIJKKAPPERS AML, 1989, SPEECH COMMUN, V8, P125, DOI 10.1016/0167-6393(89)90039-3 NR 19 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 292 EP 304 DI 10.1016/j.specom.2007.02.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700004 ER PT J AU Chen, JD Benesty, J Huang, YT AF Chen, Jingdong Benesty, Jacob Huang, Yiteng (Arden) TI On the optimal linear filtering techniques for noise reduction SO SPEECH COMMUNICATION LA English DT Article ID SPECTRAL AMPLITUDE ESTIMATOR; SPEECH ENHANCEMENT; COLORED NOISE; SUBSPACE APPROACH; SUPPRESSION AB Noise reduction, which aims at extracting the clean speech from noisy observations, has plenty of applications. It has attracted a considerable amount of research attention over the past several decades. Although many methods have been developed, the most widely used one, by far, is the optimal linear filtering technique, which achieves clean speech estimate by passing the noisy observation through an optimal linear filter/transformation. The representative algorithms of this include Wiener filtering, spectral restoration, subspace method, etc. Many experiments have been carried out, from various points of view, to show that the optimal filtering technique can reduce the level of noise that is present in the speech signal and improve the corresponding signal-to-noise ratio (SNR). However, there is not much theoretical justification so far for the noise reduction and SNR improvement. This paper attempts to provide a theoretical analysis on the performance (including noise reduction, speech distortion, and SNR improvement) of the optimal filtering noise-reduction techniques including the time-domain causal Wiener filter, the subspace method, and the frequency-domain subband Wiener filter. We show that the optimal linear filter, regardless of how we delineate it, can indeed reduce the level of noise (but at a price of attenuating the desired speech signal). Most importantly, we prove that the a posteriori SNR (defined after the optimal filtering) is always greater than, or at least equal to the a priori SNR, which reveals that the optimal linear filtering technique is indeed able to make noisy speech signals cleaner. We will also discuss the bounds for noise reduction, speech distortion, and SNR improvement. (c) 2007 Elsevier B.V. All rights reserved. C1 Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. Univ Quebec, INRS, EMT, Montreal, PQ H5A 1K6, Canada. RP Chen, JD (reprint author), Bell Labs, Lucent Technol, 600 Mt Ave,Room 2D-534, Murray Hill, NJ 07974 USA. EM jingdong@research.bell-labs.com; benesty@emt.inrs.ea; arden@research.bell-labs.com CR Benesty J., 2005, SPEECH ENHANCEMENT BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CHANG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943 CHEN J, 2003, ADAPTIVE SIGNAL PROC, P129 Chen JD, 2006, IEEE T AUDIO SPEECH, V14, P1218, DOI 10.1109/TSA.2005.860851 DENDRINOS M, 1991, SPEECH COMMUN, V10, P45, DOI 10.1016/0167-6393(91)90027-Q Diethorn EJ, 2004, AUDIO SIGNAL PROCESSING: FOR NEXT-GENERATION MULTIMEDIA COMMUNICATION SYSTEMS, P91, DOI 10.1007/1-4020-7769-6_4 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 ETTER W, 1994, J AUDIO ENG SOC, V42, P341 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 Hansen P. S. K., 1997, THESIS TECHN U DENMA Hirsch H. G., 1995, P IEEE INT C AC SPEE, V1, P153 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P334, DOI 10.1109/TSA.2003.814458 Lev-Ari H, 2003, IEEE SIGNAL PROC LET, V10, P104, DOI 10.1109/LSP.2003.808544 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 LIM JS, 1983, SPEECH ENHANCEMENT Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Mittal U, 2000, IEEE T SPEECH AUDI P, V8, P159, DOI 10.1109/89.824700 Paliwal K. K., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Rezayee A, 2001, IEEE T SPEECH AUDI P, V9, P87, DOI 10.1109/89.902276 STAHL V, 2000, P ICASSP, V3, P1875 VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Widrow B, 1985, ADAPTIVE SIGNAL PROC Wiener N., 1949, EXTRAPOLATION INTERP NR 31 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 305 EP 316 DI 10.1016/j.specom.2007.02.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700005 ER PT J AU Clark, RAJ Richmond, K King, S AF Clark, Robert A. J. Richmond, Korin King, Simon TI Multisyn: Open-domain unit selection for the Festival speech synthesis system SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; unit selection AB We present the implementation and evaluation of an open-domain unit selection speech synthesis engine designed to be flexible enough to encourage further unit selection research and allow rapid voice development by users with minimal speech synthesis knowledge and experience. We address the issues of automatically processing speech data into a usable voice using automatic segmentation techniques and how the knowledge obtained at labelling time can be exploited at synthesis time. We describe target cost and join cost implementation for such a system and describe the outcome of building voices with a number of different sized datasets. We show that, in a competitive evaluation, voices built using this technology compare favourably to other systems. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, CSTR, Edinburgh EH8 9LW, Midlothian, Scotland. RP Clark, RAJ (reprint author), Univ Edinburgh, CSTR, 2 Buccleuch Pl, Edinburgh EH8 9LW, Midlothian, Scotland. EM robert@cstr.ed.ac.uk; korin@cstr.ed.ac.uk; Simon.King@ed.ac.uk CR Bennett C. L., 2005, P INT EUR, P105 BENNETT CL, 2005, ICASSP 2005 PHIL US, V1, P297 BEUTNAGEL M, 1999, EUR C SPEECH COMM TE, V3, P1063 BLACK A, 2004, P ICASSP 2004 MONTR BLACK A, 2000, P ICSLP2000 BEIJ CHI Black A. B., 2005, P INT 2005 LISB PORT, P77 Black A. W., 1997, EUROSPEECH, V2, P601 Black A. W., 1995, P EUROSPEECH MADR SP, P581 BLACK AW, 2001, 4 ISCA WORKSH SPEECH, P63 Bozkurt Baris, 2003, P EUR 03 GEN SWITZ, P277 BULYKO I, 2001, JONIT PROSODY PREDIC CONKIE A, 1999, ROBUST UNIT SELECTIO CONKIE A, 1996, PROGR SPEECH SYNTHES FITT S, 1999, P EUR, V2, P823 FOSTER ME, 2005, P ACL 2005 DEM SESS Garofolo J., 1988, GETTING STARTED DARP HAMZA W, 2004, ICSLP 2004 JEJ ISL S HOFER G, 2005, P INT Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 Kominek J., 2004, 5 ISCA SPEECH SYNTH, P223 KURTIC E, 2004, THESIS U EDINBURGH MAKASHAY M, 2000, P ICSLP 2000 BEIJ CH MOBIUS B, 2001, 4 ISCA WORKSH SPEECH, P41 Syrdal A. K., 2000, P ICSLP BEIJ CHIN, P410 TAYLOR P, 2000, PHILOS T ROYAL SOC A Taylor P. A., 1998, P 3 ESCA WORKSH SPEE, P147 van Santen J.P.H., 1997, P EUROSPEECH 97 RHOD, V5, P2511 VANSANTEN J, 1997, EUROSPEECH97, V2, P553 Vepa J., 2004, SPEECH SYNTHESIS Wells John, 1982, ACCENTS ENGLISH WRENCH AA, 2001, P WORKSH INN SPEECH Young S., 2002, HTK BOOK HTK VERSION NR 32 TC 40 Z9 40 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2007 VL 49 IS 4 BP 317 EP 330 DI 10.1016/j.specom.2007.01.014 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 165OB UT WOS:000246313700006 ER PT J AU Bozkurt, B Couvreur, L Dutoit, T AF Bozkurt, Baris Couvreur, Laurent Dutoit, Thierry TI Chirp group delay analysis of speech signals SO SPEECH COMMUNICATION LA English DT Article DE group delay processing; phase processing; windowing; spectral analysis; automatic speech recognition ID FORMANT EXTRACTION; PHASE; PERCEPTION; ALGORITHM AB This study proposes new group delay estimation techniques that can be used for analyzing resonance patterns of short-term discrete-time signals and more specifically speech signals. Phase processing or equivalently group delay processing of speech signals are known to be difficult due to large spikes in the phase/group delay functions that mask the formant structure. In this study, we first analyze in detail the z-transform zero patterns of short-term speech signals in the z-plane and discuss the sources of spikes on group delay functions, namely the zeros closely located to the unit circle. We show that windowing largely influences these patterns, therefore short-term phase processing. Through a systematic study, we then show that reliable phase/group delay estimation for speech signals can be achieved by appropriate windowing and group delay functions can reveal formant information as well as some of the characteristics of the glottal flow component in speech signals. However, such phase estimation is highly sensitive to noise and robust extraction of group delay based parameters remains difficult in real acoustic conditions even with appropriate windowing. As an alternative, we propose processing of chirp group delay functions, i.e. group delay functions computed on a circle other than the unit circle in z-plane, which can be guaranteed to be spike-free. We finally present one application in feature extraction for automatic speech recognition (ASR). We show that chirp group delay representations are potentially useful for improving ASR performance. (c) 2007 Elsevier B.V. All rights reserved. C1 Fac Polytech Mons, TCTS Lab, B-7000 Mons, Belgium. RP Bozkurt, B (reprint author), Izmir Inst Technol, Dept Elect Engn, Gulbahce Koyu, Izmir, Turkey. EM barisbozkurt@iyte.edu.tr; Laurent.Couvreur@fpms.ac.be; Thierry.Dutiot@fpms.ac.be CR Abel N., 1826, J REINE ANGEW MATH, V1, P65 Alsteris L. D., 2004, P INT C AC SPEECH SI, P573 ANDERSEN TH, 2001, P INT COMP MUS C ICM BANNO H, 2001, P INT C AC SPEECH SI, P3297 BOITE JM, SPEECH TRAINING RECO Bourlard Ha, 1994, CONNECTIONIST SPEECH BOZKURT B, 2005, THESIS DACULTE POLYT Bozkurt B., 2003, P ISCA ITRW VOQUAL03, P21 Bozkurt B, 2005, IEEE SIGNAL PROC LET, V12, P344, DOI 10.1109/LSP.2005.843770 BOZKURT B, 2004, P INT C SPOK LANG PR BOZKURT B, P EUR SIGN PROC C EU Chavez S, 2002, IEEE T MED IMAGING, V21, P966, DOI 10.1109/TMI.2002.803106 Chen CW, 2002, IEEE T GEOSCI REMOTE, V40, P1709, DOI 10.1109/TGRS.2002.802453 Costantini M, 1999, IEEE T GEOSCI REMOTE, V37, P452, DOI 10.1109/36.739085 Doval B., 2003, P ISCA ITRW VOQUAL G, P15 EDELMAN A, 1995, MATH COMPUT, V64, P763, DOI 10.2307/2153450 FANT G, 1985, SPEECH T LAB Q REP R, V2, P121 Fant G., 1960, ACOUSTIC THEORY SPEE FROLOVA GV, 1996, P ULTR S, V2, P1371 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Harris JFredric, 1978, P IEEE, V66 HEDERLIN P, 1988, P INT C AC SPEECH SI, V1, P339 HEGDE RM, 2004, P INT C SPOK LANG PR HIRSCH HG, 2000, P ISCA TUR RES WORKS HUANG X, 2001, SPOKEN KANGUAGE PROC JUNQUA JC, 2000, ROBUST SPEECH PROCES KAWAHARA H, 2001, P INT WORKSH MOD AN KAWAHARA H, 2000, P INT C SPOK LANG PR Li D., 2002, P INT C ROB AUT ICRA, V1, P19 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X MARQUES JS, 1989, THESIS TECHNICAL U L MARQUES JS, 1990, P IEEE INT C AC SPEE, V1, P17 MCAULAY RJ, 1991, SPEECH CODING SYNTHE, P165 MURTHY HA, 1991, SPEECH COMMUN, V10, P209, DOI 10.1016/0167-6393(91)90011-H MURTHY HA, 1989, ELECTRON LETT, V25, P1609, DOI 10.1049/el:19891080 MURTHY HA, 1991, SIGNAL PROCESS, V22, P259, DOI 10.1016/0165-1684(91)90014-A Oppenheim A. V., 1999, DISCRETE TIME SIGNAL OPPENHEI.AV, 1969, J ACOUST SOC AM, V45, P458, DOI 10.1121/1.1911395 Paliwal K. K., 2003, P EUR 2003, P2117 PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 POBLOTH H, 1999, P INT C AC SPEECH SI, V1, P29 QUATIERI TF, 1979, IEEE T ACOUST SPEECH, V27, P328, DOI 10.1109/TASSP.1979.1163252 RABINER LR, 1969, AT&T TECH J, V48, P1249 SCHROEDER MR, 1986, J ACOUST SOC AM, V79, P1580, DOI 10.1121/1.393292 SCHROEDER MR, 1959, J ACOUST SOC AM, V31, P1597 Sitton GA, 2003, IEEE SIGNAL PROC MAG, V20, P27, DOI 10.1109/MSP.2003.1253552 Stylianou Y., 1996, THESIS ECOLE NATL SU SUN X, 1997, P INT C AC SPEECH SI, V3, P1691 von Helmholtz Hermann, 1912, SENSATIONS TONE VYACHESLAV V, 2003, OPT LETT, V28, P2156 YEGNANARAYANA B, 1984, IEEE T ACOUST SPEECH, V32, P610, DOI 10.1109/TASSP.1984.1164365 YEGNANARAYANA B, 1988, P EUR SIGN PROC C EU, V1, P447 Zhu D., 2004, P INT C AC SPEECH SI, P125 ZOLFAGHARI P, 2003, P EUR C SPEECH COMM, P2441 NR 54 TC 25 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 159 EP 176 DI 10.1016/j.specom.2006.12.004 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900001 ER PT J AU Afghari, A AF Afghari, Akbar TI A sociopragmatic study of apology speech act realization patterns in Persian SO SPEECH COMMUNICATION LA English DT Article DE apology speech act; sociopragmatic; CCSARP; status; dominance ID ENGLISH AB This research study aimed at extracting and categorizing the range of strategies used in performing the speech act of apologizing in Persian. The first objective was to see if Persian apologies were formulaic in pragmatic structure as in English apologies are said to be [Holmes, J., 1990. Apologies in New Zealand English. Lang. Soc. 19, 155-200; Wolfson, N., Judd, E. (Eds.), 1983. Sociolinguistics and Language Acquisition. Rowley, Mass, Newbury House]. The other issue explored in this study was the investigation of the effect of the values assigned to the two context-external variables of social distance and social dominance on the frequency of the apology intensifiers. To this end, Persian apologetic utterances were collected via a Discourse Completion Test (DCT). The research findings indicated that Persian apologies are as formulaic in pragmatic structures. Also, the values assigned to the two context-external variables were found to have significant effect on the frequency of the intensifiers in different situations. (c) 2007 Elsevier B.V. All rights reserved. C1 Sheikh Bahaaee Univ, Dept English, Esfahan, Iran. RP Afghari, A (reprint author), Sheikh Bahaaee Univ, Dept English, Esfahan, Iran. EM afghary@yahoo.com CR Austin J. L., 1962, HOW DO THINGS WORDS Beebe L. M., 1996, SPEECH ACTS CULTURES, P65 Billmyer K, 2000, APPL LINGUIST, V21, P517, DOI 10.1093/applin/21.4.517 BLUMKULKA S, 1989, CROSSCULTURAL PRAGMA BLUMKULKA S, 1984, APPL LINGUIST, V5, P196, DOI 10.1093/applin/5.3.196 BLUMKULKA S, 1982, APPL LINGUIST, V3, P29, DOI 10.1093/applin/3.1.29 BLUMKULKA S, SOCIOLINGUISTICS LAN, P83 Blum-Kulka Shoshana, 1989, CROSS CULTURAL PRAGM Boxer Diana, 2002, APPL SOCIOLINGUISTIC BROWN P, 1989, POLITENESS SOME UNIV Duranti Alessandro, 1997, LINGUISTIC ANTHROPOL ERVINTRIPP S, 1976, LANG SOC, V5, P25 GOLATO A, 2000, AAAL ANN C CROSS BOU Goody E., 1978, QUESTIONS POLITENESS GREEN G, 1975, SPEECH ACTS, V3 Hinkel E, 1997, APPL LINGUIST, V18, P1, DOI 10.1093/applin/18.1.1 HOLMES J, 1990, LANG SOC, V19, P155 HYMES D, 1967, J SOC ISSUES, V33, P8 Kasper G., 2000, CULTURALLY SPEAKING, P316 MANES J, 1981, CONVERSATIONAL ROUTI MARKEE N, 2002, ILTA AAAL PLEN PAN A OLSHTAIN E, 1989, CROSS CULTURAL PRAGM, P55 Olshtain E., 1983, SOCIOLINGUISTICS LAN RINTELL E, 1989, CROSSCULTURAL PRAGMA SEARLE J, 1975, SPEECH ACTS, V3 Searle John R., 1969, SPEECH ACTS Stockwell P., 2002, SOCIOLINGUISTICS TROSBORG A, 1987, J PRAGMATICS, V11, P147, DOI 10.1016/0378-2166(87)90193-7 van Ek J. A., 1976, THRESHOLD LEVEL MODE WIERZBICKA A, 1985, J PRAGMATICS, V9, P145, DOI 10.1016/0378-2166(85)90023-2 WOLFSON N, 1989, CROSSCULTURAL PRAGMA Wolfson N., 1983, SOCIOLINGUISTICS LAN Yuan Y, 2001, J PRAGMATICS, V33, P271, DOI 10.1016/S0378-2166(00)00031-X NR 33 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 177 EP 185 DI 10.1016/j.specom.2007.01.003 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900002 ER PT J AU Drepper, FR AF Drepper, Friedhelm R. TI Voiced speech as response of a self-consistent fundamental drive SO SPEECH COMMUNICATION LA English DT Article DE signal analysis; non-stationary acoustic objects; part-tone phases; fundamental drive; cascaded response; generalized synchronization; voiced continuants ID SIGNAL REPRESENTATION; SYNCHRONIZATION; PITCH AB Voiced segments of speech are assumed to be composed of non-stationary acoustic objects which can be described as stationary response of a non-stationary fundamental drive (FD) process and which are furthermore suited to reconstruct the hidden FD by using a voice adapted (self-consistent) part-tone decomposition of the speech signal. The universality and robustness of human pitch perception encourage the reconstruction of a band-limited FD in the frequency range of the pitch. The self-consistent decomposition of voiced continuants generates several part-tones which can piecewise be confirmed to be topologically equivalent to corresponding acoustic modes of the excitation on the transmitter side. As topologically equivalent image of a glottal master oscillator, the self-consistent FD is suited to serve as low frequency part of the basic time-scale separation of auditive perception and to describe the broadband voiced excitation as entrained (synchronized) and/or modulated primary response. Being guided by the acoustic correlates of pitch and loudness perception, the time-scale separation avoids the conventional assumption of stationary excitation and represents the basic decoding step of an advanced precision transmission protocol of self-consistent (voiced) acoustic objects. The present study is focussed on the adaptation of the trajectories (contours) of the centre filter frequency of the part-tones to the chirp of the glottal master oscillator. (c) 2007 Elsevier B.V. All rights reserved. C1 Jorschungszentrum Julich GmbH, D-52425 Julich, Germany. RP Drepper, FR (reprint author), Jorschungszentrum Julich GmbH, D-52425 Julich, Germany. EM f.drepper@fz-juelich.de CR AFRAIMOVICH VS, 1986, RADIOPHYS QUANT EL, V29, pFF795 DREPPER FR, 2004, MAVEBA 2003 DREPPER FR, 2005, INTERSPEECH 2005 LIS DREPPER FR, 2005, FORTSCHRITTE AKUSTIK Drepper FR, 2000, PHYS REV E, V62, P6376, DOI 10.1103/PhysRevE.62.6376 DREPPER FR, 2006, FORTSCHRITTE AKUSTIK DREPPER FR, 2005, LNAI, V3817, P125 GABOR D, 1947, NATURE, V159, P591, DOI 10.1038/159591a0 Gold B., 2000, SPEECH AUDIO SIGNAL Grice M., 2006, ENCY LANGUAGE LINGUI, V5, P778 HANQUINET J, 2005, INTERSPEECH 2005 LIS HEINBACH W, 1988, ACUSTICA, V67, P113 HERZEL H, 1994, J SPEECH HEAR RES, V37, P1008 Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433 Jackson PJB, 2001, IEEE T SPEECH AUDI P, V9, P713, DOI 10.1109/89.952489 Kantz H., 1997, NONLINEAR TIME SERIE Kubin G., 1995, SPEECH CODING SYNTHE, P557 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MCADAMS S, 1989, J ACOUST SOC AM, V86, P2148, DOI 10.1121/1.398475 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 Moore B. C. J., 1989, INTRO PSYCHOL HEARIN PALIWAL KK, 2003, EUROSPEECH 2003 GENF PATTERSON RD, 2000, J ACOUST SOC JPN, V21 RAMEAU JO, 1737, COMPLETE THEORETICAL, V3 RULKOV NF, 1995, PHYS REV E, V51, P980, DOI 10.1103/PhysRevE.51.980 SCHOENTGEN J, 1990, SPEECH COMMUN, V9, P189, DOI 10.1016/0167-6393(90)90056-F SCHOMBURG I, 2000, GENE FUNCT DIS, V1, P109, DOI 10.1002/1438-826X(200010)1:3/4<109::AID-GNFD109>3.0.CO;2-O Schroeder M. R., 1999, COMPUTER SPEECH Seebeck August, 1844, POGGENDORFS ANN PHYS, V63, P353 TEAGER HM, 1990, P NATO ASI SPEECH PR, P241 TERHARDT E, 1998, AKUSTISCHE KOMMUNIKA TERHARDT E, 1982, J ACOUST SOC AM, V71, P679, DOI 10.1121/1.387544 Vary P., 1998, DIGITALE SPRACHSIGNA NR 33 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 186 EP 200 DI 10.1016/j.specom.2007.01.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900003 ER PT J AU Shami, M Verhelst, W AF Shami, Mohammad Verhelst, Werner TI An evaluation of the robustness of existing supervised machine learning approaches to the classification of emotions in speech SO SPEECH COMMUNICATION LA English DT Article DE emotion recognition; analysis of intent; vocal expressiveness; speech processing ID DIRECTED SPEECH; RECOGNITION AB In this study, the robustness of approaches to the automatic classification of emotions in speech is addressed. Among the many types of emotions that exist, two groups of emotions are considered, adult-to-adult acted vocal expressions of common types of emotions like happiness, sadness, and anger and adult-to-infant vocal expressions of affective intents also known as "motherese". Specifically, we estimate the generalization capability of two feature extraction approaches, the approach developed for Sony's robotic dog AIBO (AIBO) and the segment-based approach (SBA) of [Shami, M., Karnel, M., 2005. Segment-based approach to the recognition of emotions in speech. In: IEEE Conf. on Multimedia and Expo (ICME05), Amsterdam, The Netherlands]. Three machine learning approaches are considered, K-nearest neighbors (KNN), Support vector machines (SVM) and Ada-boosted decision trees and four emotional speech databases are employed, Kismet, BabyEars, Danish, and Berlin databases. Single corpus experiments show that the considered feature extraction approaches AIBO and SBA are competitive on the four databases considered and that their performance is comparable with previously published results on the same databases. The best choice of machine learning algorithm seems to depend on the feature extraction approach considered. Multi-corpus experiments are performed with the Kismet-BabyEars and the Danish-Berlin database pairs that contain parallel emotional classes. Automatic clustering of the emotional classes in the database pairs shows that the patterns behind the emotions in the Kismet-BabyEars pair are less database dependent than the patterns in the Danish-Berlin pair. In off-corpus testing the classifier is trained on one database of a pair and tested on the other. This provides little improvement over baseline classification. In integrated corpus testing, however, the classifier is machine learned on the merged databases and this gives promisingly robust classification results, which suggest that emotional corpora with parallel emotion classes recorded under different conditions can be used to construct a single classifier capable of distinguishing the emotions in the merged corpora. Such a classifier is more robust than a classifier learned on a single corpus as it can recognize more varied expressions of the same emotional classes. These findings suggest that the existing approaches for the classification of emotions in speech are efficient enough to handle larger amounts of training data without any reduction in classification accuracy. (c) 2007 Elsevier B.V. All rights reserved. C1 Vrije Univ Brussel VIB, Dept ETRO DSSP, Lab Digital Speech & Audio Proc, Interdisciplinary Inst Broadband Technol, B-1050 Brussels, Belgium. RP Verhelst, W (reprint author), Vrije Univ Brussel VIB, Dept ETRO DSSP, Lab Digital Speech & Audio Proc, Interdisciplinary Inst Broadband Technol, Pleinlaan 2, B-1050 Brussels, Belgium. EM wverhels@etro.vub.ac.be CR Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Batliner A., 2005, INTERSPEECH 2005, P489 BOERSMA P, 1996, PRAAT SYSTEM DOING P, V132 Boersma P., 1993, P I PHONETIC SCI, V17, P97 Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749 Cichosz J, 2005, INTERSPEECH 2005 LIS, P477 DUSAN S, 2005, INTERSPEECH 2005 LIS Engberg I. S., 1996, DOCUMENTATION DANISH Fernald A., 1992, ADAPTED MIND EVOLUTI FERNANDEZ R, 2005, INTERSPEECH, P473 Frank E, 2003, APPL PROPOSITIONAL L HAMMAL Z, 2005, P EUSIPCO 2005 ANT T Katz GS, 1996, CHILD DEV, V67, P205, DOI 10.1111/j.1467-8624.1996.tb01729.x Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 Paeschke A., 2000, P ISCA WORKSH SPEECH, P75 Picard R. W., 1997, AFFECTIVE COMPUTING ROTARU M, 2005, INTERSPEECH 2005 SCHULLER B, 2003, IEEE C MULT EXP ICME, V1, P401 Schuller B., 2005, IEEE INT C MULT EXP, P864 Shami M., 2006, P 2 ANN IEEE BENELUX SHAMI M, 2005, IEEE C MULT EXP ICME SHRIBERG E, 2005, EUROSPEECH 2005 LISB Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3 ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3 Ververidis D, 2004, P 12 EUR SIGN PROC C, P341 Witten I. H., 2000, DATA MINING PRACTICA NR 27 TC 52 Z9 53 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 201 EP 212 DI 10.1016/j.specom.2007.01.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900004 ER PT J AU Escudero-Mancebo, D Cardenoso-Payo, V AF Escudero-Mancebo, David Cardenoso-Payo, Valentin TI Applying data mining techniques to corpus based prosodic modeling SO SPEECH COMMUNICATION LA English DT Article DE prosody; intonation modeling; data mining; text-to-speech; FO contours ID SPEECH SYNTHESIS; INTONATION AB This article presents MEMOInt, a methodology to automatically extract the intonation patterns which characterize a given corpus, with applications in text-to-specch systems. Easy to understand information about the form of the characteristic patterns found in the corpus can be obtained from MEMOint in a way which allows easy comparison with other proposals. A visual representation of the relationship between the set of prosodic features which could have been selected to label the corpus and the intonation contour patterns is also easy to obtain. The particular function-form correspondence associated to the given corpus is represented by means of a list of dictionaries of classes of parameterized FO patterns, where the access key is given by a sequence of prosodic features. MEMOInt can also be used to obtain valuable information about the relative impact of the use of different parameterization techniques of FO contours or of different types of intonation units and information about the relevance of different prosodic features. The methodology has been specifically designed to provide a successful strategy to solve the data sparseness problem which usually affects corpora as a consequence of the inherent high variability of the intonation phenomenon. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Valladolid, Dept Informat, E-47011 Valladolid, Spain. RP Escudero-Mancebo, D (reprint author), Univ Valladolid, Dept Informat, Campus Miguel Delibes S-N, E-47011 Valladolid, Spain. EM descuder@infor.uva.es RI Escudero, David/K-7905-2014; Cardenoso-Payo, Valentin/L-8755-2014 OI Escudero, David/0000-0003-0849-8803; Cardenoso-Payo, Valentin/0000-0003-1460-158X CR AARON A, 2005, CONVERSATIONAL COMPU, P64 Alarcos E., 2000, GRAMATICA LENGUA ESP Allen J., 1987, TEXT SPEECH MITALK S Bartels R, 1986, INTRO SPLINES USE CO Beckman M.E., 2000, INTONATION SPANISH T Botinis A, 2001, SPEECH COMMUN, V33, P263, DOI 10.1016/S0167-6393(00)00060-1 BULYKO I, 1999, P ICPHS 99, P81 Campbell N., 2004, J PHONET SOC JPN, V8, P9 CARDENOSO V, 2004, P ICASSP, V1, P665 DALESSANDRO C, 1995, COMPUT SPEECH LANG, V9, P257, DOI 10.1006/csla.1995.0013 Diaz FC, 2006, SPEECH COMMUN, V48, P941, DOI 10.1016/j.specom.2005.12.004 Eide E., 2003, P ICASSP HONG KONG A, V1, P708 EMERARD F, 1992, TALKING MACHINES THE, P2265 ESCUDERO D, 2005, INTERSPEECH 2005, P3261 ESCUDERO D, 2000, P ICSLP 2002, P1162 Escudero D., 2002, P ICASSP 2002, V1, P481 ESCUDERO D, 2003, P EUROSPEECH 2003, P2309 ESCUDERO D, 2004, INTERSPEECH 2004, P745 ESCUDERO D, 2002, THESIS U VALLADOLID Face Timothy, 2001, THESIS OHIO STATE U Farin G., 1996, CURVES SURFACES CAGD FERRER A, 2001, THESIS U POLITECNICA Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GARRIDO JM, 1996, THESIS U BARCELONA E GUTIERREZ JM, 2001, P ICASSP, V2, P821 Hart J't, 1990, PERCEPTUAL STUDY INT HERMES DJ, 1994, J SPEECH LANG HEAR R, V41, P73 HOLM B, 2003, THESIS I NATL POLYTE Jain A., 1999, ACM COMPUT SURV, V31 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Lee S, 2001, COMPUT SPEECH LANG, V15, P75, DOI 10.1006/csla.2000.0158 LOBANOV BM, 1987, P 11 INT C PHON SCI, P61 Lopez-Gonzalo E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607870 NAVARROTOMAS T, 1944, MANUAL ENTONACION ES Pena D, 1999, ESTADISTICA MODELOS Pierrehumbert J, 1980, PHONOLOGY PHONETICS Plass M., 1983, Computer Graphics, V17 Quilis Antonio, 1993, TRATADO FONOLOGIA FO SAKAI S, 2005, P ICASSP 2005, V1, P277, DOI 10.1109/ICASSP.2005.1415104 Sakurai A, 2003, SPEECH COMMUN, V40, P535, DOI 10.1016/S0167-6393(02)00177-2 SANTEN JPH, 2000, QUALITATIVE MODEL F0, P269 Silverman K., 1992, P INT C SPOK LANG PR, P867 SOSA JM, 1999, ENTONACION ESPANOLA SPROAT R, 1995, APPROACH TEXT SPEECH, P611 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 Taylor P.A., 1995, SPEECH COMMUN, V15, P169 TOKUDA K, 2000, P ICASSP, V3, P1315 Traber C, 1992, TALKING MACHINES THE, P287 VALLEJO JA, 1998, THESIS U POLITECNICA Veronis J, 1998, SPEECH COMMUN, V26, P233, DOI 10.1016/S0167-6393(98)00063-6 Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd NR 51 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 213 EP 229 DI 10.1016/j.specom.2007.01.008 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900005 ER PT J AU Rojc, M Kacic, Z AF Rojc, Matej Kacic, Zdravko TI Time and space-efficient architecture for a corpus-based text-to-speech synthesis system SO SPEECH COMMUNICATION LA English DT Article DE corpus-based text-to-speech system; finite-state machines (FSM); heterogeneous-relation graphs (HRG); queuing mechanism AB This paper proposes a time and space-efficient architecture for a text-to-speech synthesis system (TTS). The proposed architecture can be efficiently used in those applications with unlimited domain, requiring multilingual or polyglot functionality. The integration of a queuing mechanism, heterogeneous graphs and finite-state machines gives a powerful, reliable and easily maintainable architecture for the TTS system. Flexible and language-independent framework efficiently integrates all those algorithms used within the scope of the TTS system. Heterogeneous relation graphs are used for linguistic information representation and feature construction. Finite-state machines are used for time and space-efficient representation of language resources, for time and space-efficient lookup processes, and the separation of language-dependent resources from a language-independent TTS engine. Its queuing mechanism consists of several dequeue data structures and is responsible for the activation of all those TTS engine modules having to process the input text. In the proposed architecture, all modules use the same data structure for gathering linguistic information about input text. All input and output formats are compatible, the structure is modular and interchangeable, it is easily maintainable and object oriented. The proposed architecture was successfully used when implementing the Slovenian PLATTOS corpus-based TTS system, as presented in this paper. (c) 2007 Elsevier B.V. All rights reserved. C1 Univ Maribor, Fac Elect Engn & Comp Sci, SLO-2000 Maribor, Slovenia. RP Rojc, M (reprint author), Univ Maribor, Fac Elect Engn & Comp Sci, Smetanova Ulica 17, SLO-2000 Maribor, Slovenia. EM matej.rojc@uni-mb.si; kacic@uni-mb.si CR Aho A. Y., 1974, DESIGN ANAL COMPUTER Arnaud A., 2000, STUDY IMPLEMENTATION Black A. W., 1997, P EUROSPEECH 97, V2, P601 BLACK AW, 1995, EUROSPEECH95 MADR SP, V1, P581 BRILL E, 1993, THESIS BULYKO I, 2001, P EUROSPEECH CAMPBELL N, 1996, SP9607 Clark R. A. J., 2004, P 5 ISCA WORKSH SPEE DACIUK J, 1998, THESIS TECHNICAL U G EMMANUEL R, 1997, FINITE STATE LANGUAG EMMANUEL R, 1995, DETERMINISTIC PART S HOLZAPFEL M, 2000, THESIS Horowitz E., 1996, COMPUTER ALGORITHMS KACIC Z, 1995, ONOMASTICA SLOVENIAN Kaplan R. M., 1994, COMPUTATIONAL LINGUI, V20 KARTTUNEN L, 1992, P 15 INT C COMP LING KIRAZ GA, 1998, P 3 INT WORKSH SPPEC MEHRYAR M, 1997, COMPUTATIONAL LIGUIS, V23, P269 MOHRI M, 1996, 34 M ASS COMP LING A MOHRI M, 1995, NATURAL LANGUAGE ENG, V1 MOHRI M, 1996, ECAI 96 WORKSH BUD H Olshen R., 1984, CLASSIFICATION REGRE, V1st Ostendorf M., 2001, P ICASSP Pagel V., 1998, P ICSLP ROJC M, 2003, THESIS ROJC M, 2000, THESIS ROJC M, 2000, P 2 LANG RES EV C LR Silberztein M., 1993, DICT ELECT ANAL AUTO SPROAT R, 1998, MULTILINGUAL TEXT SP Sproat R., 1996, P 34 ANN M ASS COMP, P215, DOI 10.3115/981863.981892 STROM V, 1998, THESIS BONN Syrdal A. K., 2000, P ICSLP BEIJ CHIN, P410 Taylor P, 2001, SPEECH COMMUN, V33, P153, DOI 10.1016/S0167-6393(00)00074-1 TAYLOR P, 2000, J ACOUSTICAL SOC AM Taylor P. A., 1998, P 3 ESCA WORKSH SPEE, P147 Watson B. W., 1995, THESIS EINDHOVEN U T NR 36 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2007 VL 49 IS 3 BP 230 EP 249 DI 10.1016/j.specom.2007.01.007 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160TT UT WOS:000245965900006 ER PT J AU Bates, RA Ostendorf, M Wright, RA AF Bates, Rebecca A. Ostendorf, Mari Wright, Richard A. TI Symbolic phonetic features for modeling of pronunciation variation SO SPEECH COMMUNICATION LA English DT Article DE linguistic features; pronunciation modeling; conversational speech ID SPEECH RECOGNITION; LANDMARKS AB A significant source of variation in spontaneous speech is due to intra-speaker pronunciation changes, often realized as small feature changes. e.g., nasalized vowels or affricated stops, rather than full phone transformations. Previous computational modeling of pronunciation variation has typically involved transformations from one phone to another, in part because most speech processing systems use phone-based units. Here, a phonetic-feature-based prediction model is presented where phones are represented by a vector of symbolic features that can be on, off, unspecified or unused. Feature interaction is examined using different groupings of possibly dependent features, and a hierarchical grouping with conditional dependencies led to the best results. Feature-based models are shown to be more efficient than phone-based models, in the sense of requiring fewer parameters to predict variation while giving smaller distance and perplexity values when comparing predictions to the hand-labeled reference. A parsimonious model is better suited to incorporating new conditioning factors, and this work investigates high-level information sources, including both text (syntax, discourse) and prosody cues. Experiments show that feature-based models benefit from prosody cues, but not text, and that phone-based models do not benefit from any of the high-level cues explored here. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. Univ Washington, Dept Linguist, Seattle, WA 98195 USA. RP Bates, RA (reprint author), Minnesota State Univ, Dept Comp & Informat Sci, Mankato, MN 56001 USA. EM bates@mnsu.edu CR ALLEN J, 1997, AAAI FALL S COMM ACT BATES R, 1998, THESIS U WASHINGTON BATES R, 2001, P ISCA WORKSH PROS S, P17 Bates R. A., 2002, P ISCA TUT RES WORKS, P42 BITAR N, 1997, THESIS BOSTON U MA BITAR NN, 1997, P EUR C SPEECH COMM, P1239 Chomsky N., 1968, SOUND PATTERN ENGLIS Clements G.N., 1995, HDB PHONOLOGICAL THE, P245 COHEN MH, 1989, THESIS U CALIFORNIA DENG L, 1992, J ACOUST SOC AM, V92, P3058, DOI 10.1121/1.404202 Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 Deng L., 1994, P ICASSP 94, pI DESHMUKH O, 2002, P ICASSP, pI593 Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023 EIDE E, 2001, P EUR AALB DENM, P1613 ESPYWILSON CY, 1994, J ACOUST SOC AM, V96, P65, DOI 10.1121/1.410375 Finke M., 1997, P EUR C SPEECH COMM, P2379 Fosler-Lussier E., 1998, P ESCA WORKSH MOD PR, P35 FOSLERLUSSIER E, 1999, P EUR BUD HUNG SEPT, P463 Fosler-Lussier J. E., 1999, THESIS U CALIFORNIA Fukuda T., 2003, P ICASSP 03, V2, P25 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 GREENBERG S, 1996, SWITCHBOARD TRANSCRI Halle M, 2000, LINGUIST INQ, V31, P387, DOI 10.1162/002438900554398 Halle Morris, 1992, INT ENCY LINGUISTICS, V3, P207 HASEGAWAJOHNSON M, 2005, P ICASSP, V1, P213, DOI 10.1109/ICASSP.2005.1415088 Jakobson Roman, 1952, PRELIMINARIES SPEECH JUNEJA A, 2004, SOUND SENSE 50 YEARS Juneja A., 2004, THESIS U MARYLAND CO JURAFSKY D, 1998, P ICSLP JURAFSKY D, 1997, SWBD DISCOURSE LANGU JURAFSKY D, 1996, 9702 U COL I COGN SC Jurafsky D., 1997, P IEEE WORKSH SPEECH, P88 KING S, 1998, P ICSLP 98, P1031 KIRCHHOFF K, 1998, P ICSLP, P873 Kirchhoff K., 1999, THESIS U BIELEFELD G Kirchhoff K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607260 LADEFOGED PETER, 2001, COURSE PHONETICS LAHIRI A, 1999, P INT C PHON SCI, P715 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 Livescu K., 2004, P ICSLP Livescu K., 2003, P EUR GEN SWITZ, V4, P2529 Lobacheva Y., 2000, THESIS BOSTON U MANUEL S, 1991, PERILUS, V14, P115 MARKOV K, 2004, P ICSLP Metze F., 2002, P INT C SPOK LANG PR, P2133 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 Morgan N., 1998, P ICASSP NOCK H, 2001, THESIS CAMBRIDGE U C OSTENDORF M, 1997, ECE970002 BOST U Pruthi T, 2004, SPEECH COMMUN, V43, P225, DOI 10.1016/j.specom.2004.06.001 REETZ H, 1999, P 14 INT C PHON SCI, V3, P1733 RILEY M, 1996, AUTOMATIC SPEECH SPE, P1 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 Ross K, 1996, COMPUT SPEECH LANG, V10, P155, DOI 10.1006/csla.1996.0010 SAGEY E, 1991, REPRESENTATION FEATU Salomon A, 2004, J ACOUST SOC AM, V115, P1296, DOI 10.1121/1.1646400 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 SHATTUCKHUFNAGE.S, 2002, COMMUNICATION Soltau H., 2002, P ICSLP, P841 Sonmez K., 1998, P INT C SPOK LANG PR, P3189 STEVENS K, 1992, P ICSLP Stevens K.N., 1998, ACOUSTIC PHONETICS Stevens KN, 2002, J ACOUST SOC AM, V111, P1872, DOI 10.1121/1.1458026 Stuker S., 2003, P 8 EUR C SPEECH COM, P1033 TAJCHMAN G, 1995, P EUR C SPEECH COMM, P2247 Talkin D., 1995, SPEECH CODING SYNTHE Wakita Y, 1999, COMPUT SPEECH LANG, V13, P143, DOI 10.1006/csla.1998.0116 WEINTRAUB M, 1996, J HOPK U CTR LANG SP Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 ZWEIG G, 1998, THESIS UC BERKELEY C NR 72 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 83 EP 97 DI 10.1016/j.specom.2006.10.007 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200001 ER PT J AU Morrison, D Wang, RL De Silva, LC AF Morrison, Donn Wang, Ruili De Silva, Liyanage C. TI Ensemble methods for spoken emotion recognition in call-centres SO SPEECH COMMUNICATION LA English DT Article DE affect recognition; emotion recognition; ensemble methods; speech processing; speech databases ID EXPRESSION; SPEECH; COMMUNICATION; FEATURES; VOICE; PITCH AB Machine-based emotional intelligence is a requirement for more natural interaction between humans and computer interfaces and a basic level of accurate emotion perception is needed for computer systems to respond adequately to human emotion. Humans convey emotional information both intentionally and unintentionally via speech patterns. These vocal patterns are perceived and understood by listeners during conversation. This research aims to improve the automatic perception of vocal emotion in two ways. First, we compare two emotional speech data sources: natural, spontaneous emotional speech and acted or portrayed emotional speech. This comparison demonstrates the advantages and disadvantages of both acquisition methods and how these methods affect the end application of vocal emotion recognition. Second, we look at two classification methods which have not been applied in this field: stacked generalisation and unweighted vote. We show how these techniques can yield an improvement over traditional classification methods. (C) 2006 Elsevier B.V. All rights reserved. C1 Massey Univ Turutea, Inst Informat Sci & Technol, Palmerston North, New Zealand. RP Wang, RL (reprint author), Massey Univ Turutea, Inst Informat Sci & Technol, Private Bag 11222, Palmerston North, New Zealand. EM d.morrison@massey.ac.nz; r.wang@massey.ac.nz CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759 Ang J., 2002, P INT C SPOK LANG PR Anton H., 2000, ELEMENTARY LINEAR AL Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Blum AL, 1997, ARTIF INTELL, V97, P245, DOI 10.1016/S0004-3702(97)00063-5 Breazeal C, 2002, AUTON ROBOT, V12, P83, DOI 10.1023/A:1013215010749 Breiman L, 2001, MACH LEARN, V45, P5, DOI 10.1023/A:1010933404324 Cleary J.G., 1995, ICML, P108 COVER TM, 1967, IEEE T INFORM THEORY, V13, P21, DOI 10.1109/TIT.1967.1053964 DAVITZ JR, 1994, COMMUNICATION EMOTIO Dellaert F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608022 Devillers L., 2002, P ISLE WORKSH DIAL T DIETERLE F, 2003, THESIS Dietterich T.G., 2002, HDB BRAIN THEORY NEU, P405 Elfenbein HA, 2002, PSYCHOL BULL, V128, P203, DOI 10.1037//0033-2909.128.2.203 EMMANOUILIDIS C, 1999, P INT JOINT C NEUR N, P4387 Fairbanks G, 1941, SPEECH MONOGR, V8, P85 Fairbanks G, 1939, SPEECH MONOGR, V6, P87 Fonagy I., 1981, RES ASPECTS SINGING, V33, P51 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 FONAGY I, 1978, LANG SPEECH, V21, P34 FRICK RW, 1986, AGGRESSIVE BEHAV, V12, P121, DOI 10.1002/1098-2337(1986)12:2<121::AID-AB2480120206>3.0.CO;2-F Goldberg D. E, 1989, GENETIC ALGORITHMS S Haykin S., 1999, NEURAL NETWORKS COMP, V2nd HUBER R, 1998, P WORKSH TEXT SPEECH, P223 HUBER R, 2000, P INT C SPOK LANG PR, V1, P665 JOHNSON WF, 1986, ARCH GEN PSYCHIAT, V43, P280 Lee C. M., 2004, P INT C SPOK LANG PR Liscombe J., 2005, INTERSPEECH, P1845 McGilloway S., 2000, P ISCA WORKSH SPEECH, P200 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Nakatsu R., 1999, P INT C MULT COMP SY NWE TL, 2003, THESIS NATL U SINGAP Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd OSTER A, 1986, Q PROG STAT REP, V4, P79 Petrushin V.A., 2000, P 6 INT C SPOK LANG Platt J., 1998, ADV KERNEL METHODS S POLZIN T, 2000, P ISCA WORKSH SPEECH Rabiner L.R., 1978, DIGITAL PROCESSING S Salovey P., 2004, FEELINGS EMOTIONS, P321, DOI 10.1017/CBO9780511806582.019 SCHERER KR, 1996, P INT C SPOK LANG PR Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SEEWALD A, 2002, P 19 INT C MACH LEAR Shipp C. A., 2002, Information Fusion, V3, DOI 10.1016/S1566-2535(02)00051-9 Skinner ER, 1935, SPEECH MONOGR, V2, P81 Talkin D., 1995, SPEECH CODING SYNTHE, P495 VAFAIE H, 1992, P 4 INT C TOOLS ART Vapnik V., 1995, NATURE STAT LEARNING WILLIAMS CE, 1972, EMOTIONS SPEECH SOME WOLPERT DH, 1992, NEURAL NETWORKS, V5, P241, DOI 10.1016/S0893-6080(05)80023-1 YACOUB S, 2003, P EUR 2003 8 EUR C S NR 52 TC 73 Z9 80 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 98 EP 112 DI 10.1016/j.specom.2006.11.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200002 ER PT J AU Zakis, JA McDermott, HJ Vandali, AE AF Zakis, Justin A. McDermott, Hugh J. Vandali, Andrew E. TI A fundamental frequency estimator for the real-time processing of musical sounds for cochlear implants SO SPEECH COMMUNICATION LA English DT Article DE cochlear implant; fundamental frequency; pitch; sung vowels ID PITCH PERCEPTION; TEMPORAL CUES; SPEECH; RECOGNITION; RECIPIENTS; TRANSFORM; CHILDREN; HEARING; ABILITY; TONES AB A real-time fundamental frequency (F-0) estimator that operates in the frequency domain was developed for the processing of musical sounds in cochlear-implant (CI) sound processors. Its performance was evaluated with male and female sung-vowel stimuli in quiet, and in white noise and babble noise. The error rates of the developed F-0 estimator were much lower than those of a temporal F-0 estimator that was previously used in CI sound processors, and were comparable to the published error rates of F-0 estimators that were designed for other applications and evaluated with speech or musical instrument stimuli. It is envisaged that the experimental F-0 estimator will be used in advanced CI coding strategies to improve the perception of pitch by CI users, which may result in improved perception of musical sounds, as well as improved speech perception for tonal languages. (C) 2007 Elsevier B.V. All rights reserved. C1 Cooperat Res Ctr Cochlear Implant & Hearing Aid I, Melbourne, Vic 3002, Australia. Bion Ear Inst, Melbourne, Vic 3002, Australia. Univ Melbourne, Dept Otolaryngol, Melbourne, Vic 3002, Australia. RP Zakis, JA (reprint author), Dynam Hearing Pty Ltd, 2 Chapel St, Richmond, Vic 3121, Australia. EM jzakis@dynamichearing.com.au; hughm@unimelb.edu.au; avandali@bionicear.org CR Barry JG, 2002, CLIN LINGUIST PHONET, V16, P79, DOI 10.1080/02699200110109802 Bosman AJ, 1997, ACUSTICA, V83, P567 BROWN JC, 1991, J ACOUST SOC AM, V89, P425, DOI 10.1121/1.400476 BROWN JC, 1992, J ACOUST SOC AM, V92, P1394, DOI 10.1121/1.403933 BROWN JC, 1993, J ACOUST SOC AM, V94, P662, DOI 10.1121/1.406883 Ciocca V, 2002, J ACOUST SOC AM, V111, P2250, DOI 10.1121/1.1471897 Crochiere R. E., 1983, MULTIRATE DIGITAL SI de Cheveigne A, 2002, J ACOUST SOC AM, V111, P1917, DOI 10.1121/1.1458024 DUIFHUIS H, 1982, J ACOUST SOC AM, V71, P1568, DOI 10.1121/1.387811 Geurts L, 2001, J ACOUST SOC AM, V109, P713, DOI 10.1121/1.1340650 Gfeller Kate, 2002, Cochlear Implants Int, V3, P29, DOI 10.1002/cii.50 Gfeller K, 2000, J Am Acad Audiol, V11, P390 Gfeller Kate, 2002, Annals of Otology Rhinology and Laryngology, V111, P349 Gfeller K, 1998, J Am Acad Audiol, V9, P1 GFELLER K, 1991, J SPEECH HEAR RES, V34, P916 GOLDSTEI.JL, 1973, J ACOUST SOC AM, V54, P1496, DOI 10.1121/1.1914448 Green T, 2004, J ACOUST SOC AM, V116, P2298, DOI 10.1121/1.1785611 GRUENZ OO, 1949, J ACOUST SOC AM, V21, P487, DOI 10.1121/1.1906538 HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 Hess W., 1983, PITCH DETERMINATION Hess W., 1992, ADV SPEECH SIGNAL PR, P3 Kong YY, 2004, EAR HEARING, V25, P173, DOI 10.1097/01.AUD.0000120365.97792.2F Leal MC, 2003, ACTA OTO-LARYNGOL, V123, P826, DOI 10.1080/00016480310000386 Lee KYS, 2002, INT J PEDIATR OTORHI, V63, P137, DOI 10.1016/S0165-5876(02)00005-8 Liu DJ, 2001, IEEE T SPEECH AUDI P, V9, P609 McDermott Hugh J, 2004, Trends Amplif, V8, P49, DOI 10.1177/108471380400800203 McDermott HJ, 1997, J ACOUST SOC AM, V101, P1622, DOI 10.1121/1.418177 Moore BC, 2005, SPRINGER HDB AUDITOR MOORE FR, 1990, ELEMENTS COMPUTER MU, P246 Nakatani T, 2004, J ACOUST SOC AM, V116, P3690, DOI 10.1121/1.1787522 PIJL S, 1995, J ACOUST SOC AM, V98, P886, DOI 10.1121/1.413514 SCHEFFERS MTM, 1983, J ACOUST SOC AM, V74, P1716, DOI 10.1121/1.390280 SKINNER MW, 1991, EAR HEARING, V12, P3, DOI 10.1097/00003446-199102000-00002 Vandali AE, 2005, J ACOUST SOC AM, V117, P3126, DOI 10.1121/1.1874632 ZAKIS JA, 2004, IRAN J AUDIOL, V3, P75 NR 35 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 113 EP 122 DI 10.1016/j.specom.2006.12.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200003 ER PT J AU Johnson, MT Yuan, XL Ren, Y AF Johnson, Michael T. Yuan, Xiaolong Ren, Yao TI Speech signal enhancement through adaptive wavelet thresholding SO SPEECH COMMUNICATION LA English DT Article DE adaptive wavelets; bionic wavelet transform; speech enhancement; denoising ID SPECTRAL AMPLITUDE ESTIMATOR; TEAGER ENERGY OPERATOR; AUDITORY MODEL; TRANSFORM; DECOMPOSITION; NOISE AB This paper demonstrates the application of the Bionic Wavelet Transform (BWT), an adaptive wavelet transform derived from a non-linear auditory model of the cochlea, to the task of speech signal enhancement. Results, measured objectively by Signal-to-Noise ratio (SNR) and Segmental SNR (SSNR) and subjectively by Mean Opinion Score (MOS), are given for additive white Gaussian noise as well as four different types of realistic noise environments. Enhancement is accomplished through the use of thresholding on the adapted BWT coefficients, and the results are compared to a variety of speech enhancement techniques, including Ephraim Malah filtering, iterative Wiener filtering, and spectral subtraction, as well as to wavelet denoising based on a perceptually scaled wavelet packet transform decomposition. Overall results indicate that SNR and SSNR improvements for the proposed approach are comparable to those of the Ephraim Malah filter, with BWT enhancement giving the best results of all methods for the noisiest (-10 db and -5 db input SNR) conditions. Subjective measurements using MOS surveys across a variety of 0 db SNR noise conditions indicate enhancement quality competitive with but still lower than results for Ephraim Malah filtering and iterative Wiener filtering, but higher than the perceptually scaled wavelet method. (C) 2007 Elsevier B.V. All rights reserved. C1 Marquette Univ, Dept Elect & Comp Engn, Milwaukee, WI 53233 USA. Motorola Elect Ltd, Beijing 100022, Peoples R China. RP Johnson, MT (reprint author), Marquette Univ, Dept Elect & Comp Engn, 1515 W Wisconson Ave, Milwaukee, WI 53233 USA. EM mike.johnson@mu.edu; xlyuan0514@yahoo.com; yao.ren@mu.edu CR Bahoura M, 2001, IEEE SIGNAL PROC LET, V8, P10, DOI 10.1109/97.889636 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen SH, 2004, J VLSI SIG PROC SYST, V36, P125, DOI 10.1023/B:VLSI.0000015092.19005.62 Cohen I, 2001, EUR 2001 DENM Daubechies I., 1992, 10 LECT WAVELETS Debnath L., 2002, WAVELET TRANSFORMS T Deller J., 2000, DISCRETE TIME PROCES DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 Donoho DL, 1995, J AM STAT ASSOC, V90, P1200, DOI 10.2307/2291512 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Fu Q., 2003, EUR GEN Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93 Garofolo JS, 1993, TIMIT ACOUSTIC PHONE GIGUERE C, 1993, SPEECH PROCESSING US GIGUERE C, 1994, J ACOUST SOC AM, V95, P331 GUO D, 2000, INT C SIGN PROC BEIJ Haykin S., 1996, ADAPTIVE FILTER THEO Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Jaffard S., 2001, WAVELETS TOOLS SCI T Johnstone IM, 1997, J R STAT SOC B, V59, P319, DOI 10.1111/1467-9868.00071 Lu CT, 2003, SPEECH COMMUN, V41, P409, DOI 10.1016/S0167-6393(03)00011-6 *MATHWORKS INC, 2003, MATL Oppenheim G., 1995, WAVELETS STAT Walnut D. F., 2002, INTRO WAVELET ANAL YAO J, 2001, ACTIVE MODEL OTOACOU Yao J, 2001, IEEE T BIO-MED ENG, V48, P856 Yao J, 2002, IEEE T BIO-MED ENG, V49, P1299, DOI 10.1109/TMBE.2002.804590 Zheng L, 1999, IEEE T BIO-MED ENG, V46, P1098 NR 29 TC 22 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 123 EP 133 DI 10.1016/j.specom.2006.12.002 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200004 ER PT J AU Chen, B Loizou, PC AF Chen, Bin Loizou, Philipos C. TI A Laplacian-based MMSE estimator for speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE MMSE estimator; speech enhancement; Laplacian speech modeling; speech absence probability ID SPECTRAL AMPLITUDE ESTIMATOR AB This paper focuses on optimal estimators of the magnitude spectrum for speech enhancement. We present an analytical solution for estimating in the MMSE sense the magnitude spectrum when the clean speech DFT coefficients are modeled by a Laplacian distribution and the noise DFT coefficients are modeled by a Gaussian distribution. Furthermore, we derive the MMSE estimator under speech presence uncertainty and a Laplacian statistical model. Results indicated that the Laplacian-based MMSE estimator yielded less residual noise in the enhanced speech than the traditional Gaussian-based MMSE estimator. Overall, the present study demonstrates that the assumed distribution of the DFT coefficients can have a significant effect on the quality of the enhanced speech. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. RP Loizou, PC (reprint author), Univ Texas Dallas, Dept Elect Engn, Richardson, TX 75083 USA. EM loizou@utdallas.edu CR Breithaupt C., 2003, P INT C AC SPEECH SI, P896 CHEN B, 2005, P IEEE ICASSP, V1, P1097 CHEN B, 2005, THESIS U TEXAS DALLA Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gradshteyn I. S., 2000, TABLE INTEGRALS SERI Hansen J. H. L., 1998, P INT C SPOK LANG PR, V7, P2819 Hu Y., 2006, P INTERSPEECH PHIL P Kwon Y.W., 2000, FINITE ELEMENT METHO LOTTER T, 2003, INT WORKSH AC ECH NO Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 Martin R., 2002, P IEEE ICASSP, P504 Martin R., 2003, P 8 INT WORKSH AC EC, V8-11, P87 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Papoulis A., 2001, PROBABILITY RANDOM V, V4th PORTER J, 1984, P IEEE ICASSP NR 17 TC 29 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 134 EP 143 DI 10.1016/j.specom.2006.12.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200005 ER PT J AU Truong, KP van Leeuwen, DA AF Truong, Khiet P. van Leeuwen, David A. TI Automatic discrimination between laughter and speech SO SPEECH COMMUNICATION LA English DT Article DE automatic detection laughter; automatic detection emotion ID EMOTIONS; SYSTEMS; MODELS AB Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speaker's state and emotion can be revealed. This paper describes the development of a gender-independent laugh detector with the aim to enable automatic emotion recognition. Different types of features (spectral, prosodic) for laughter detection were investigated using different classification techniques (Gaussian Mixture Models, Support Vector Machines, Multi Layer Perceptron) often used in language and speaker recognition. Classification experiments were carried out with short pre-segmented speech and laughter segments extracted from the ICSI Meeting Recorder Corpus (with a mean duration of approximately 2 s). Equal error rates of around 3% were obtained when tested on speaker-independent speech data. We found that a fusion between classifiers based on Gaussian Mixture Models and classifiers based on Support Vector Machines increases discriminative power. We also found that a fusion between classifiers that use spectral features and classifiers that use prosodic information usually increases the performance for discrimination between laughter and speech. Our acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughter and speech. (C) 2007 Published by Elsevier B.V. C1 TNO HUman Factors, Dept Human Interfaces, NL-3769 ZG Soesterberg, Netherlands. RP Truong, KP (reprint author), TNO HUman Factors, Dept Human Interfaces, POB 23, NL-3769 ZG Soesterberg, Netherlands. EM khiet.truong@tno.nl CR Adami A.G., 2003, P EUROSPEECH, P841 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Bachorowski JA, 2001, J ACOUST SOC AM, V110, P1581, DOI 10.1121/1.1391244 BETT M, 2000, P RIAO 2000 PAR FRAN Bickley C.A., 1992, P ICSLP 1992 BANFF C, P927 Boersma P., 2005, PRAAT DOING PHONETIC CAI R, 2003, P IEEE INT C MULT EX, V3, P37 Campbell N., 2005, P INT LISB PORT, P465 Campbell W.M., 2004, P OD SPEAK LANG REC, P41 CAMPBELL WM, 2002, P ICASSP, P161 CAREY MJ, 1999, P IEEE ICASSP 99 PHO, P1432 Collobert R, 2001, J MACH LEARN RES, V1, P143, DOI 10.1162/15324430152733142 Doddington GR, 2000, SPEECH COMMUN, V31, P225, DOI 10.1016/S0167-6393(99)00080-1 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 ELHANNANI A, 2005, P NON LIN SPEECH PRO, P19 GILLICK L, 1989, ICASSP 1989 GLASG SC HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JANIN A, 2004, NIST ICASSP 2004 M R Kennedy L., 2004, NIST ICASSP 2004 M R Lippmann R. P., 1993, Lincoln Laboratory Journal, V6 Lockerd A., 2002, P CHI HUM FACT COMP, P574, DOI DOI 10.1145/506443.506490 Martin A. F., 1997, P EUROSPEECH, P1895 MOWRER DE, 1987, J NONVERBAL BEHAV, V11, P191, DOI 10.1007/BF00990237 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 NWOKAH EE, 1993, J ACOUST SOC AM, V94, P3076, DOI 10.1121/1.407242 OHARA R, 2004, THESIS NARA I SCI TE Oostdijk N., 2000, P 2 INT C LANG RES E, P887 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rothganger H, 1998, NATURWISSENSCHAFTEN, V85, P394, DOI 10.1007/s001140050522 SCHERER KR, 1982, HDB METHODS NONVERBA, P36 ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3 Trouvain J., 2003, P 15 INT C PHON SCI, P2793 Truong K.P., 2005, P EUR C SPEECH COMM, P485 Vapnik V., 1995, NATURE STAT LEARNING Vapnik V, 1998, STAT LEARNING THEORY WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Witten I.H., 2005, DATA MINING PRACTICA Yacoub S., 2003, P EUROSPEECH, P729 NR 38 TC 49 Z9 52 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2007 VL 49 IS 2 BP 144 EP 158 DI 10.1016/j.specom.2007.01.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 149OB UT WOS:000245155200006 ER PT J AU Uther, M Knoll, MA Burnham, D AF Uther, M. Knoll, M. A. Burnham, D. TI Do you speak E-NG-L-I-SH? A comparison of foreigner- and infant-directed speech SO SPEECH COMMUNICATION LA English DT Article DE infant-directed speech; prosody; foreigner-directed speech; vowel hyperarticulation ID CROSS-LANGUAGE; MOTHERS SPEECH; MATERNAL SPEECH; INTONATION; PREFERENCE; PITCH; TONE; AGE AB Infant-directed speech has three main roles - it attracts attention, conveys emotional affect, and conveys language-specific phonological information, and each of these roles are reflected in certain components of the speech signal - pitch, rated affect, and vowel hyperarticulation. We sought to investigate the independence of these components by comparing British English speech directed to first language English learners (infants), and second language English learners (adult foreigners), populations with similar linguistic but dissimilar affective needs. It was found that, compared with British adult-directed speech, vowels were equivalently hyperarticulated in infant- and foreigner-directed speech. On the other hand, pitch was higher in speech to infants than to foreigners or adult British controls; and positive affect was highest in infant-directed and lowest in foreigner-directed speech. These results suggest that linguistic modifications found in both infant- and foreigner-directed speech are didactically oriented, and that linguistic modifications are independent of vocal pitch and affective valence. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Portsmouth, Dept Psychol, Portsmouth PO1 2DY, Hants, England. Brunel Univ, Sch Social Sci, Ctr Cognit & NeuroImaging, Uxbridge UB8 3PH, Middx, England. Univ Western Sydney, MARCS Auditory Labs, Sydney, NSW 1797, Australia. RP Knoll, MA (reprint author), Univ Portsmouth, Dept Psychol, King Henry Bldg,King Henry I St, Portsmouth PO1 2DY, Hants, England. EM Maria.Uther@brunel.ac.uk; Monja.Knoll@port.ac.uk; D.Burnham@uws.edu.au CR BARD EG, 1994, J CHILD LANG, V21, P623 BIERSACK S, 2005, INTERSPEECH 2005, P2401 Burnham D, 2002, SCIENCE, V296, P1435, DOI 10.1126/science.1069587 Davis BL, 2001, EMERGING COGNITIVE ABILITIES IN EARLY INFANCY, P135 FERNALD A, 1993, CHILD DEV, V64, P657, DOI 10.1111/j.1467-8624.1993.tb02934.x FERNALD A, 1989, J CHILD LANG, V16, P477 FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104 FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8 Foulkes P, 2005, LANGUAGE, V81, P177, DOI 10.1353/lan.2005.0018 GRIESER DL, 1988, DEV PSYCHOL, V24, P14, DOI 10.1037/0012-1649.24.1.14 Kemler Nelson D G, 1989, J Child Lang, V16, P55 Kitamura C, 2003, INFANCY, V4, P85, DOI 10.1207/S15327078IN0401_5 Kitamura C., 1998, ADV INFANCY RES, V12, P221 Knoll M.A., 2006, P SPEECH PROS 3 INT, P165 Kuhl Patricia K., 1999, J ACOUST SOC AM, V105.2, P1095, DOI 10.1121/1.425135 Kuhl P.K., 2000, NEW COGNITIVE NEUROS, P99 Kuhl PK, 2004, NAT REV NEUROSCI, V5, P831, DOI 10.1038/nrn1533 Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684 Liu HM, 2003, DEVELOPMENTAL SCI, V6, pF1, DOI 10.1111/1467-7687.00275 Oviatt S, 1998, J ACOUST SOC AM, V104, P3080, DOI 10.1121/1.423888 PAPOUSEK M, 1991, INFANT BEHAV DEV, V14, P415, DOI 10.1016/0163-6383(91)90031-M PAPOUSEK M, 1991, APPL PSYCHOLINGUIST, V12, P481, DOI 10.1017/S0142716400005889 Smiljanic R, 2005, J ACOUST SOC AM, V118, P1677, DOI 10.1121/1.2000788 STERN DN, 1983, J CHILD LANG, V10, P1 Trainor LJ, 2002, PSYCHON B REV, V9, P335, DOI 10.3758/BF03196290 WERKER JF, 1994, INFANT BEHAV DEV, V17, P323, DOI 10.1016/0163-6383(94)90012-4 NR 26 TC 41 Z9 41 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 2 EP 7 DI 10.1016/j.specom.2006.10.003 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500001 ER PT J AU Wutiwiwatchsi, C Furui, S AF Wutiwiwatchsi, Chai Furui, Sadaoki TI Thai speech processing technology: A review SO SPEECH COMMUNICATION LA English DT Review DE speech analysis; speech processing; speech synthesis; speech recognition; Thai speech ID RECOGNITION; LANGUAGE; TONES; MODEL AB This paper presents a comprehensive review of Thai speech technology, from its impetus in the early 1960s to 2005. Thai is the official language of Thailand, and is spoken by over 60 million people worldwide. As with Chinese, it is a tonal language. It has a spelling system using a Thai alphabet, but has no explicit word boundaries, similar to several Asian languages, such as Japanese and Chinese. It does have explicit marks for tones, as in the languages of the neighboring countries, Laos and Vietnam. Therefore, with these unique characteristics, research and development of language and speech processing specifically for Thai is necessary and quite challenging. This paper reviews the progress of Thai speech technology in five areas of research: fundamental analyses and tools, text-to-speech synthesis (TTS), automatic speech recognition (ASR), speech applications, and language resources. At the end of the paper, the progress and focus of Thai speech research, as measured by the number of publications in each research area, is reviewed and possible directions for future research are suggested. (C) 2006 Elsevier B.V. All rights reserved. C1 Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, Klongluang 12120, Pathumthani, Thailand. Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Wutiwiwatchsi, C (reprint author), Natl Elect & Comp Technol Ctr, Informat R&D Div, Speech Technol Res Sect, 112 Pahonyothin Rd, Klongluang 12120, Pathumthani, Thailand. EM chai@nectec.or.th CR Abramson A. S, 1962, VOWELS TONES STANDAR, V20 ABRAMSON AS, 1979, STUDIES TAI MONKHMER ABRAMSON AS, 1982, INT C LING AHKUPUTRA V, 2000, DIRECT CLASSIFICATIO AHKUPUTRA V, 2001, INT J PHONETICS Ahkuputra V, 1997, IEEE PACIF, P593 ANIVAN S, 1988, INT S LANG LING AROONMANAKUN W, 2002, JOINT INT C SNLP OR, P68 AROONMANAKUN W, 2004, SE AS LING SOC C 14 AROONMANAKUN W, 2005, PAC AS C LANG INF CO, P205 BLACK A, 2000, INT C SPOK LANG PROC Black A. W., 1997, EUR C SPEECH COMM TE, P995 BOONPIAM V, 2005, EL ENG C EECON THAIL, P1053 BRADLEY CB, 1911, J AM ORIENTAL SOC Burnham D, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2514 CHANLEKHA H, 2002, JOINT INT C SNLP OR, P326 CHANYAPORNPONG S, 1983, THAI SYLLABLE SEPARA CHARNVIVAT P, 2003, EUR C SPEECH COMM TE, P137 CHAROENPORN T, 1997, NAT LANG PROC PAC RI CHAROENPORNSAWA.P, 1998, IEEE AS PAC C CIRC S, P647 CHAROENPORNSAWA.P, 2001, INT C COMP PROC OR L, P231 CHOTIMONGKOL A, 2000, INT C SPOK LANG PROC, P533 DEEMAGARN A, 2004, INT C SPEECH COMP SP, P731 Demeechai T, 2001, SPEECH COMMUN, V33, P241, DOI 10.1016/S0167-6393(00)00017-0 DEMEECHAI T, 2000, IEEE NORD SIGN PROC, P303 FUJISAKI H, 2003, INT C PHON SCI FUJISAKI H, 2004, INT C SPEECH PROS 20, P1 GANDOUR J, 1994, INT J PHONETICS Gandour J, 1999, PHONETICA, V56, P123, DOI 10.1159/000028447 GANDOUR J, 1991, SPEECH COMMUN, V10, P355, DOI 10.1016/0167-6393(91)90003-C GANDOUR J, 1979, SE ASIAN STUDIES HAAS MR, 1980, THAI SYSTEM WRITING HANSAKUNBUNTHEU.C, 2005, INT S NAT LANG PROC, P127 HANSAKUNBUNTHEU.C, 2005, EUR C SPEECH COMM TE, P1969 HANSAKUNBUNTHEU.C, 2003, INT C SPEECH DAT ASS HANSAKUNBUNTHEU.C, 2003, EUR C SPEECH COMM TE, P93 Higbie James, 2002, THAI REFERENCE GRAMM Hirst D. J., 1998, INTONATION SYSTEMS S, P1 Inrut J., 2001, INT S COMM INF TECHN, P37 Jitapunkul S., 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242), DOI 10.1109/APCCAS.1998.743704 JITTIWARANGKUL N, 2000, INT S NAT LANG PROC, P243 KANOKPHARA S, 2004, AUSTR INT C SPEECH S, P322 KANOKPHARA S, 2003, IEEE INT C AC SPEECH, P764 KANOKPHARA S, 2003, EUR C SPEECH COMM TE, P797 KARNJANADECHA M, 2002, INT C SPOK LANG PROC, P2141 KARNJANADECHA M, 2001, INT S COMM INF TECHN, P271 KASEMSIRI W, 2000, S NAT LANG PROC SNLP, P252 KASURIYA S, 2001, IASTED INT C MOD ID, P190 KASURIYA S, 2002, JOINT INT C SNLP OR, P211 KASURIYA S, 2003, INT C SPEECH DAT ASS KAWTRAKUL A, 1995, INT S NAT LANG PROC KAWTRAKUL A, 1997, NAT LANG PROC PAC RI, P341 KAWTRAKUL A, 2002, COLING 2002 POSTC WO KHAORAPAPONG T, 2004, INT C SPOK LANG PROC, P1909 KHRUAHONG S, 2003, INT S COMM INF TECHN KIATARPAKUL R, 1995, INT S NAT LANG PROC, P354 KIATARPAKUL R, 1995, INT S NAT LANG PROC, P361 KIATARPAKUL R, 1996, SPEECH RECOGNITION S KITTIPIYAKUL S, 2004, ECTI INT C ECTI CON KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Kongkachandra R., 1998, IEEE. APCCAS 1998. 1998 IEEE Asia-Pacific Conference on Circuits and Systems. Microelectronics and Integrating Systems. Proceedings (Cat. No.98EX242), DOI 10.1109/APCCAS.1998.743702 LUANGTHONGKUM T, 1977, RHYTHM STANDARD THAI LUKSANEEYANAWIN S, 1985, INT C SIN TIB LANG L LUKSANEEYANAWIN S, 1993, INT S NAT LANG PROC, P276 LUKSANEEYANAWIN S, 1989, REG WORKSH COMP PROC, P305 LUKSANEEYANAWIN S, 1992, INT S LANG LING, P75 Luksaneeyanawin S., 1983, INTONATION THAI MAMORU S, 2001, J INFORM PROCESSING, V15 MANEENOI E, 1998, THAI VOWEL PHENOME R MANEENOI E, 2000, J ACOUST SOC AM, V108, P2575 MANEENOI E, 2002, AUSTR INT C SPEECH S, P462 MANEENOI EJ, 1997, NAT LANG PROC PAC RI, P343 MEKNAVIN S, 1997, NAT LANG PROC PAC RI, P41 MEKNAVIN S, 1997, INTERDISCIPLINARY AP MITTRAPIYANURAK P, 2000, INT S NAT LANG PROC, P23 MITTRAPIYANURAK P, 2000, INT C SPOK LANG PROC, P334 Mittrapiyanuruk P., 2000, NECTEC ANN C BANGK, P483 MIXDORFF H, 2004, INT S TON ASP LANG T, P143 MIXDORFF H, 2003, INT C PHON SCI Mixdorff H., 2002, INT C SPOK LANG PROC, P753 MIXDORFF H, 2005, AUDITORY VISUAL SPEE MUENPINIJ B, 2004, AUTOMATIC THAI SPEEC NARUPIYAKUL L, 1999, INT S INT SIGN PROC NAVAS E, 2002, INT C SPEECH PROS, P527 PALMER A, 1969, LANG LEARN, V19, P287, DOI 10.1111/j.1467-1770.1969.tb00469.x PANSOMBAT N, 2002, EL ENG C EECON, pEL54 PATHUMTHAN T, 1987, THAI SPEECH RECOGNIT PENGPHON N, 2002, JOINT INT SNLP OR CO, P277 PENSIRI R, 1995, EL ENG C EECON PATT, P977 PISARN C, 2004, INT C COMP LING COLI, P529 Pisarn C, 2004, LECT NOTES COMPUT SC, V3283, P100 PISARN C, 2005, IASTED C ART INT APP, P453 PONYANUN P, 2003, INT S RES AND DEV IN POOWARAWAN Y, 1986, EL ENG C EECON PORNSUKJANTRA W, 1997, NLPRS 97 INC POR SNL, P585 Potisuk S, 1996, PHONETICA, V53, P200 POTISUK S, 1995, INT CONF ACOUST SPEE, P632, DOI 10.1109/ICASSP.1995.479677 POTISUK S, 1996, INT S LANG LING PAN, P1177 POTISUK S, 1999, IEEE T SPEECH ADIO P, V7, P91 RARUNROM S, 1991, DICT BASED THAI WORD RATSAMEEWICHAI S, 2002, INT TECHN C CIRC SYS, P110 Sagisaka Y., 1992, INT C SPOK LANG PROC, P483 SAIYOT S, 2005, NAT COMP SCI ENG C N, P521 Saravari C., 1983, Journal of the Acoustical Society of Japan (E), V4 SATTAYAPANICH T, 2003, NAT COMP SCI ENG C N SAWAMIPAKDEE D, 1990, DEV THAI ANAL STRUCT SCHULTZ T, 2002, INT C SPOK LANG PROC, P345 SCHULTZ T, 2004, HUMAN LANGUAGE TECHN SCHULTZ T, 1997, EUR C SPEECH COMM TE, P371 SHUICHI I, 2000, INT C SPEECH DAT ASS, P8 SOJKA P, 2003, P EACL WORKSH COMP L SORNLERTLAMVANI.V, 1993, MACHINE TRANSLATION, P50 SRICHAROENCHAI A, 2002, JOINT INT C SNLP OR, P334 SUCHATO A, 2005, INT S NAT LANG PROC, P247 Suebvisai S, 2005, INT CONF ACOUST SPEE, P857 SURINPAIBOON S, 1985, STRESSED UNSTRESSED TAISETWATKUL S, 1996, THAI SPEECH SYNTHESI TANPRASERT C, 1999, TEXT DEPENDENT SPEAK TARSAKU P, 2002, JOINT INT C SNLP OR, P217 TARSAKU P, 2001, EUR C SPEECH COMM TE, P1057 TESPRASIT V, 2003, EUR C SPEECH COMM TE, P325 TESPRASIT V, 2003, HUM LANG TECHN C N A, P103 THAIRATANANOND Y, 1981, DESIGN THAI TEXT SYL THATPHITHAKKUL N, 2004, EL ENG C EECON THAIL THEERAMUNKONG T, 2000, 1 INT C HUM LANG TEC, P1 THIENLIKIT I, 2004, AC SOC JAP ASJ M THONGPRASERT R, 2002, PROGR REPORT CORPUS THUBTHONG N, 2000, INT C INT TECHN INTE, P206 THUBTHONG N, 1999, EL ENG C EECON BANGK, P163 Thubthong N, 2001, INT J UNCERTAIN FUZZ, V9, P815 THUBTHONG N, 2002, INT C SPOK LANG PROC, P1169 THUBTHONG N, 2001, COMPUTATIONAL INTELL, V18, P313 THUBTHONG N, 1999, IEEE INT S INT SIGN, P785 THUBTHONG N, 2004, 11 BIENN INT C INT S THUBTHONG N, 2001, INT C INT TECHN INTE, P356 THUBTHONG N, 2004, REH ENG ASS TECHN SO THUBTHONG N, 2000, NAT COMP SCI ENG C N, P63 TINGSABADH K, 1999, HDB INT PHONETIC ASS TUNGTHANGTHUM A, 1998, IEEE AS PAC C CIRC S Wutiwiwatchai C, 2006, SPEECH COMMUN, V48, P305, DOI 10.1016/j.specom.2005.02.005 WUTIWIWATCHAI C, 2001, EUR C SPEECH COMM TE, P775 WUTIWIWATCHAI C, 1998, INT C SPOK LANG PROC, P763 WUTIWIWATCHAI C, 1999, IEEE REG 10 C DIG SI, P674 WUTIWIWATCHAI C, 2004, HUM LANG TECHN C N A, P2 WUTIWIWATCHAI C, 2003, IEEE WORKSH AUT SPEE, P566 WUTIWIWATCHAI C, 2003, AC SOC JAP ASJ M WUTIWIWATCHAI C, 2002, INT C LANG RES EV LR, P869 WUTIWIWATCHAI C, 2004, INT SPOK LANG PROC I, P2129 WUTIWIWATCHAI C, 1998, INT C SPOK LANG PROC, P767 NR 149 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 8 EP 27 DI 10.1016/j.specom.2006.10.004 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500002 ER PT J AU Welby, P AF Welby, Pauline TI The role of early fundamental frequency rises and elbows in French word segmentation SO SPEECH COMMUNICATION LA English DT Article DE word segmentation; speech segmentation; intonation; prosody; French ID SPEECH SEGMENTATION; MOTHERS SPEECH; INFANTS; LANGUAGE; ACQUISITION; INTONATION; PERCEPTION; BOUNDARIES; RECOGNITION; TRACKING AB The study examined intonational cues to word beginnings in French. French has an optional "early rise" in fundamental frequency (F-0) starting at the beginning of a content word. The role of this rise in segmentation by human listeners had previously been suggested, but not empirically tested. In Experiment 1, participants listened to noise-masked items like le ballon de mimentosIle ballon de mes manteaux, differing in segmentation and presence of an early rise. They interpreted early rises as markers of content word beginnings. In Experiment 2, the alignment of the early rise was manipulated in nonword sequences like [me.la.mo.din]. Listeners were more likely to perceive two words (mes lamondines) when the early rise started at the second syllable and a single content (non)word (melamondine) when it started at the first. Experiment 3 showed that a simple F-0 elbow at a function word-content word boundary also cued a content word beginning. This pattern and its potential use in word segmentation had not been previously reported in the literature. These intonational cues, like other cues to word segmentation, influenced rather than determined segmentation decisions. The influence of other cues (e.g., duration, word frequency) is also discussed. These results are the first evidence that French listeners use intonational information as cues to content word beginning. These cues are particularly important because the beginning of a word is a privileged position in word recognition and because, unlike many other cues (e.g., stress in English), they identify actual rather than potential word boundaries. The results provide support for an autosegmental-metrical account of the intonational phonology of French in which the early rise is a bitonal (LH) phrase accent that serves as a cue to content word beginnings. The cue is strongest when both tones (LH) are realized, which leads to an early rise, but can still be used if only the L is realized, which leads to a simple elbow. These results illustrate the importance of expanding studies of the range of cues to word segmentation to include intonational cues. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Grenoble 3, Inst Natl Polytech Grenoble, CNRS,UMR 5009, Inst Commun Parlee, F-38031 Grenoble, France. RP Welby, P (reprint author), Univ Grenoble 3, Inst Natl Polytech Grenoble, CNRS,UMR 5009, Inst Commun Parlee, 46 Ave Felix Viallet, F-38031 Grenoble, France. EM welby@icp.inpg.fr CR Allopenna PD, 1998, J MEM LANG, V38, P419, DOI 10.1006/jmla.1997.2558 ASTEANO C, 2003, P 15 ICPHS U AUT BAR, P503 Astesano C., 2002, P SPEECH PROS 2002 C, P139 Bagou O., 2002, PROSODY 2002, P159 BAGOU O, 2006, ACT 26 JOURN ET PAR, P571 BANEL MH, 1994, SPEECH COMMUN, V15, P115, DOI 10.1016/0167-6393(94)90046-9 BANEL MH, 1998, ACT 22 JOURN ET PAR, P29 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X Bloom L., 1993, TRANSITION INFANCY L Boersma P., 2001, GLOT INT, V5, P341 Boersma P, 2002, PRAAT DOING PHONETIC Campbell N., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607292 Cutler A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90004-0 CHRISTOPHE A, 1994, J ACOUST SOC AM, V95, P1570, DOI 10.1121/1.408544 CHRISTOPHE A, 1993, THESS ECOLE HAUTES E CHRISTOPHE A, 2005, C ARCH MECH LANG PRO Christophe A, 1997, LANG COGNITIVE PROC, V12, P585, DOI 10.1080/016909697386637 Content A, 2001, LANG COGNITIVE PROC, V16, P609 CUTLER A, 1992, COGNITIVE PSYCHOL, V24, P381, DOI 10.1016/0010-0285(92)90012-Q CUTLER A, 1995, SPEECH LANGUAGE COMM, P97 CUTLER A, 1986, J MEM LANG, V25, P385, DOI 10.1016/0749-596X(86)90033-1 Dahan D, 2002, J MEM LANG, V47, P292, DOI 10.1016/S0749-596X(02)00001-3 D'Imperio M., 2000, THESIS OHIO STATE U DELABATIE BD, 1993, THESIS MONASH U Delais-Roussarie E., 1999, CAHIERS GRAMMAIRE, V24, P17 DELAISROUSSARIE E, 1995, THESIS U TOULOUSE LE Di Cristo A., 1998, INTONATION SYSTEMS S, P195 Di Cristo A., 2000, FRENCH LANGUAGE STUD, V10, P27 Di Cristo A., 1999, FRENCH LANGUAGE STUD, V9, P143 Durand J., 1990, GENERATIVE NONLINEAR FERNALD A, 1989, CHILD DEV, V60, P1497, DOI 10.1111/j.1467-8624.1989.tb04020.x FERNALD A, 1984, DEV PSYCHOL, V20, P104, DOI 10.1037//0012-1649.20.1.104 Fonagy I., 1980, STUDIA PHONETICA, V15, P123 Fonagy Ivan, 1976, FRANCAIS MODERN, V44, P193 FOUGERON C, 2002, ACT 24 JOURN ET PAR, P125 Fougeron C, 2001, J PHONETICS, V29, P109, DOI 10.1006/jpho.2000.0114 Gaskell MG, 2002, MEM COGNITION, V30, P798, DOI 10.3758/BF03196435 GERKEN L, 1993, DEV PSYCHOL, V29, P448, DOI 10.1037/0012-1649.29.3.448 GERKEN L, 1994, COGNITION, V51, P237, DOI 10.1016/0010-0277(94)90055-8 GUISSON L, 1975, ACT 6 JOURN ET PAR T, P117 Harrington J., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90004-1 Herman R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607059 Kemler Nelson D G, 1989, J Child Lang, V16, P55 HIRST D, 1996, ACT 21 JOURN ET PAR, P223 ITO K, 2006, P SPEECH PROS 2006 D, P261 JACOBSON JL, 1983, CHILD DEV, V54, P436, DOI 10.2307/1129704 Jun S. A., 2002, PROBUS, V14, P147, DOI 10.1515/prbs.2002.002 Jun SA, 2000, TEXT SPEECH LANG TEC, V15, P209 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KIM S, 2004, P 8 INT C SPOK LANG, P3005 KNIGHT RA, 2004, THESIS CAMBRIDGE U LAEUFER C, 1987, OHIO STATE U WORKING, V36, P75 MAKASHAY M, 2003, THESIS OHIO STATE U Mattys SL, 2005, J EXP PSYCHOL GEN, V134, P477, DOI 10.1037/0096-3445.134.4.477 McQueen JM, 1998, J MEM LANG, V39, P21, DOI 10.1006/jmla.1998.2568 MEHLER J, 1981, J VERB LEARN VERB BE, V20, P298, DOI 10.1016/S0022-5371(81)90450-3 MEHLER J, 1988, COGNITION, V29, P143, DOI 10.1016/0010-0277(88)90035-2 Mertens P., 1993, TRAVAUX LINGUIST, V26, P21 MERTENS P, 2004, FRANCAIS MODERN, V71, P39 MERTENS P, 2002, P SPEECH PROS 2002 C, P499 Mertens Piet, 2001, TRAITEMENT AUTOMATIQ, V42, P145 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z NAKATANI LH, 1978, J ACOUST SOC AM, V63, P234, DOI 10.1121/1.381719 Nazzi T, 2005, LANG SPEECH, V48, P279 New B, 2001, ANN PSYCHOL, V101, P447 Norris D, 1997, COGNITIVE PSYCHOL, V34, P191, DOI 10.1006/cogp.1997.0671 Pasdeloup V., 1990, THESIS U PROVENCE Pierrehumbert J., 1988, JAPANESE TONE STRUCT Pierrehumbert J, 1980, THESIS MIT Post B., 2000, TONAL PHRASAL STRUCT Post Brechtje, 2002, P SPEECH PROS 2002 C, P11 Ramus F., 2002, ANN REV LANGUAGE ACQ, V2, P85, DOI DOI 10.1075/ARLA.2.05RAM REINHOLTPETERSO.N, 1986, PHONETICA, V43, P31 RIETVELD ACM, 1980, LANG SPEECH, V23, P289 ROSSI M, 1985, PHONETICA, V42, P135 Rossi Mario, 1999, INTONATION SYSTEME F SHAFER V, 1992, BOST U C LANG DEV BO SMITH MR, 1989, J SPEECH HEAR RES, V32, P912 Spinelli E, 2003, J MEM LANG, V48, P233, DOI 10.1016/S0749-596X(02)00513-2 Spinelli E., 2002, REV FR LING APPL, VVII, P83 VAISSIERE J, 1976, RECHERCHES ACOUSTIQU, V3, P345 VAISSIERE J, 1975, ACTES 6EMES JOURNEES, P39 VAISSIERE J, 1976, ACT 7 JOURN ET PAR N, P103 Vaissiere J., 1997, TRAITEMENT AUTOMATIQ, V38, P53 VAISSIERE J, 1992, WENNER GREN INT S SE, V59, P108 Vaissiere Jacqueline, 1983, PROSODY MODELS MEASU, P53 Vihman M. M., 1996, PHONOLOGICAL DEV ORI VIVES R, 1977, ACT 8 JOURN ET PAR, P353 WAUQUIERGRAVELI.S, 1996, THESIS U PARIS 7 D D Weber A, 2006, LANG SPEECH, V49, P367 Welby P, 2006, J PHONETICS, V34, P343, DOI 10.1016/j.wocn.2005.09.001 Welby P., 2003, THESIS OHIO STATE U WELBY P, 2003, P EUR 8 ANN C SPEECH, P2125 WELBY P, 2002, P SPEECH PROS 2002 C, P695 WELBY P, 2006, IN PRESS ITALIAN J L, V18 YersinBesson C, 1996, ANN PSYCHOL, V96, P9 NR 96 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 28 EP 48 DI 10.1016/j.specom.2006.10.005 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500003 ER PT J AU Takano, S Honda, K AF Takano, Sayoko Honda, Kiyoshi TI An MRI analysis of the extrinsic tongue muscles during vowel production SO SPEECH COMMUNICATION LA English DT Article DE tongue muscles; muscle geometry; vowel production; MRI AB Functions of the extrinsic tongue muscles in vowel production were examined by measurements of muscle length and tongue tissue deformation using MRI (magnetic resonance imaging). Results from the analysis of Japanese vowel data suggested: (1) Contraction and relaxation of the three subdivisions of the genioglossus (GG) play a dominant role in forming tongue shapes for vowels. (2) The extralingual part of the styloglossus (SG), which was previously thought to cause a high-back tongue position by pulling its insertion point in the tongue, was found to be nearly constant across all vowels both in length and orientation. (3) The tongue shape for back vowels is mainly achieved by internal deformation of the tongue tissue, and the medial tissue of the tongue showed lateral expansion in front vowels, and medial compression in back vowels. (C) 2006 Published by Elsevier B.V. C1 ATR Human Informat Sci Labs, Kyoto 6190288, Japan. RP Takano, S (reprint author), RWTH Univ Hosp Aachen, Dept Diagnost Radiol, Pauwelsstr 30, D-52074 Aachen, Germany. EM takano@rad.rwth-aachen.de; honda@atr.jp CR Abd-El-Malek S, 1939, J ANAT, V73, P201 Baer T., 1988, ANN B RES I LOGOPEDI, V22, P7 Engwall O, 2003, SPEECH COMMUN, V41, P303, DOI [10.1016/S0167-6393(02)00132-2, 10.1016/S0167-6393(03)00132-2] Gray H, 1989, GRAYS ANATOMY, V37 HIROSE H, 1971, SR2626 HASK LAB, P73 Honda K, 1996, J PHONETICS, V24, P39, DOI 10.1006/jpho.1996.0004 KAKITA Y, 1985, PHONETIC LINGUISTICS, P133 MAEDA S, 1994, PHONETICA, V51, P17 Masaki S., 1999, Journal of the Acoustical Society of Japan (E), V20 Miyawaki K., 1974, ANN B RES I LOGOPEDI, V8, P23 Perkell JS, 1996, J PHONETICS, V24, P3, DOI 10.1006/jpho.1996.0002 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 STONE M, 2001, J ACOUST SOC AM, V109, P2947 Takemoto H, 2001, J SPEECH LANG HEAR R, V44, P95, DOI 10.1044/1092-4388(2001/009) WILHELMSTRICARICO R, 1995, J ACOUST SOC AM, V97, P3085, DOI 10.1121/1.411871 NR 15 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 49 EP 58 DI 10.1016/j.specom.2006.09.004 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500004 ER PT J AU Oh, YR Yoon, JS Kim, HK AF Oh, Yoo Rhee Yoon, Jae Sam Kim, Hong Kook TI Acoustic model adaptation based on pronunciation variability analysis for non-native speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; non-native speech; knowledge-based pronunciation variability; data-driven pronunciation variability; state-tying; state-clustering; decision tree; acoustic model adaptation AB In this paper, pronunciation variability between native and non-native speakers is investigated, and a novel acoustic model adaptation method is proposed based on pronunciation variability analysis in order to improve the performance of a speech recognition system by non-native speakers. The proposed acoustic model adaptation method is performed in two steps: analysis of the pronunciation variability of non-native speech, and acoustic model adaptation based on the pronunciation variability analysis. In order to obtain informative variant phonetic units, we analyze the pronunciation variability of non-native speech in two ways: a knowledge-based approach, and a data-driven approach. Next, for each approach, the acoustic model corresponding to each informative variant phonetic unit is adapted such that the state-tying of the acoustic model for non-native speech reflects a phonetic variability. For further improvement, a conventional acoustic model adaptation method such as MLLR and/or MAP is combined with the proposed acoustic model adaptation method. It is shown from the continuous Korean-English speech recognition experiments that the proposed method achieves an average word error rate reduction of 16.76% and 12.80% for the knowledge-based approach and the data-driven approach, respectively, when compared with the baseline speech recognition system trained by native speech. Moreover, a reduction of 53.45% and 57.14% in the average word error rate is obtained by combining MLLR and MAP adaptations to the adapted acoustic models by the proposed method for the knowledge-based approach and the data-driven approach, respectively. (C) 2006 Elsevier B.V. All rights reserved. C1 Gwangju Inst Sci & Technol, Dept Informat & Commun, Kwangju 500712, South Korea. RP Kim, HK (reprint author), Gwangju Inst Sci & Technol, Dept Informat & Commun, 1 Oryong Dong, Kwangju 500712, South Korea. EM yroh@gist.ac.kr; jsyoon@gist.ac.kr; hongkook@gist.ac.kr CR BINDER N, 2002, P SPRING M AC SOC JA, P203 Compernolle D. V., 2001, SPEECH COMMUN, V35, P71 GRUHN R, 2004, P ICSLP JEJ ISL KOR, P1497 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MATSUNAGA S, 2003, P ICASSP HONG KONG C, P340 MORGAN J, 2004, P INSTIL ICALL S COM, P213 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 Rhee S.-C., 2004, P ICSLP JEJ ISL KOR, P2769 RYU SY, 1994, JUNGANG J ENGLISH LI, V35, P145 STIEDL S, 2004, P ICSLP JEJ ISL KOR, P2901 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 Tomokiyo L. M., 2000, P ICSLP, P346 WANG Z, 2003, P EUR 03 GEN SWITZ, P1449 WEIDE H, 1998, CMU PRONUNCIATION DI YOUE HM, 2001, MALSORI, V42, P47 Young S., 2002, HTK BOOK HTK VERSION Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 YUN HS, 2005, J ENGLISH LANG LIT, V47, P307 ZAVALIAKOS G, 1996, P ICASSP, P725 NR 19 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 59 EP 70 DI 10.1016/j.specom.2006.10.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500005 ER PT J AU Dusan, S AF Dusan, Sorin TI On the relevance of some spectral and temporal patterns for vowel classification SO SPEECH COMMUNICATION LA English DT Article DE acoustic patterns; pattern recognition; vowel classification; automatic speech recognition ID DYNAMIC SPECIFICATION; COARTICULATED VOWELS; WORD RECOGNITION; IDENTIFICATION; TRANSITIONS; FEATURES; SPOKEN AB Many previous studies suggested that the information necessary for the identification of vowels from continuous speech is distributed both within and outside vowel boundaries. This information appears to be embedded in the speech signal in the form of various acoustic cues or patterns: spectral, energy, static, dynamic, and temporal. In a recent paper we identified seven types of acoustic patterns that might be exploited by listeners in the identification of coarticulated vowels. The current paper extends the previous study and quantizes the relevance for vowel classification of eight types of acoustic patterns, including static spectral patterns, dynamical spectral patterns, and temporal-durational patterns. Four of these eight patterns are not directly exploited by current automatic speech recognition techniques in computing the likelihood of each phonetic model. These four new patterns proved to contain significant vowel information. Two of these four new patterns represent static spectral patterns lying outside of the currently accepted boundaries of vowels, whereas one is a double-slope dynamical pattern and another one is a simple durational pattern. The findings of this paper may be important for both automatic speech recognition models and models of vowel/phoneme perception by humans. (C) 2006 Elsevier B.V. All rights reserved. C1 Rutgers State Univ, Ctr Adv Informat Proc, Piscataway, NJ 08854 USA. RP Dusan, S (reprint author), Rutgers State Univ, Ctr Adv Informat Proc, Piscataway, NJ 08854 USA. EM sdusan@caip.rutgers.edu CR DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DUSAN S, 2004, P ICSLP 04 JEJ KOR DUSAN S, 2006, P ISCA TUT RES WORKS DUSAN S, 2005, P INTERSPEECH EUROSP DUSAN S, 2004, J ACOUST SOC AM 2, V116, pA2479 FISHER WM, 1987, J ACOUST SOC AM, V81, pS92, DOI 10.1121/1.2034854 FOLEY DH, 1972, IEEE T INFORM THEORY, V18, P618, DOI 10.1109/TIT.1972.1054863 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 Garofolo J. S., 1993, NISTIR PUBLICATION, V4930 KIRCHHOFF K, 1999, P ICPHS 1999 SAN FRA, P1729 LEHISTE I, 1961, J ACOUST SOC AM, V33, P268, DOI 10.1121/1.1908638 LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 Morris A., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1006 NEAREY TM, 1986, J ACOUST SOC AM, V80, P1297, DOI 10.1121/1.394433 PARKER EM, 1984, PERCEPT PSYCHOPHYS, V36, P369, DOI 10.3758/BF03202791 PETERSON GE, 1952, J ACOUST SOC AM, V24, P75 Rabiner L, 1993, FUNDAMENTALS SPEECH SCANLON P, 2003, P INTERSPEECH EUROSP STRANGE W, 1989, J ACOUST SOC AM, V85, P2135, DOI 10.1121/1.397863 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 STRANGE W, 1976, J ACOUST SOC AM, V60, P213, DOI 10.1121/1.381066 SUN DX, 1995, P INT C AC SPEECH SI, P201 van Son RJJH, 1999, SPEECH COMMUN, V29, P1, DOI 10.1016/S0167-6393(99)00024-2 Yang HH, 2000, SPEECH COMMUN, V31, P35, DOI 10.1016/S0167-6393(00)00007-8 NR 25 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2007 VL 49 IS 1 BP 71 EP 82 DI 10.1016/j.specom.2006.11.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 133WL UT WOS:000244044500006 ER PT J AU Faundez-Zanuy, M Janer-Garcia, L Alcobe, JR Bimbot, F de Mori, R AF Faundez-Zanuy, Marcos Janer-Garcia, Leonard Alcobe, Josep Roure Bimbot, Frederic de Mori, Renato TI Special issue: NOLISP 2005 SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Escola Univ Politecn Mataro, Dept Telecommun, Barcelona 08303, Spain. Escola Univ Politecn Mataro, Comp Networks Dept, Barcelona 08303, Spain. Escola Univ Politecn Mataro, Dept Comp Sci, Barcelona 08303, Spain. Inst Rech Informat & Syst Aleatoires, CNRS, F-35042 Rennes, France. Inst Rech Informat & Syst Aleatoires, INRIA, F-35042 Rennes, France. Univ Avignon, Lab Informat, F-84911 Avignon, France. RP Faundez-Zanuy, M (reprint author), Escola Univ Politecn Mataro, Dept Telecommun, Avda Puig & Cadafalch 101-111, Barcelona 08303, Spain. EM faundez@eupmt.es; leonard@eupmt.es; roure@eupmt.es; bimbot@irisa.fr; renato.demori@lia.univ-avignon.fr RI Faundez-Zanuy, Marcos/F-6503-2012 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1607 EP 1607 DI 10.1016/j.specom.2006.10.001 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600001 ER PT J AU Faundez-Zanuy, M Hagmuller, M Kubin, G AF Faundez-Zanuy, Marcos Hagmueller, Martin Kubin, Gernot TI Speaker verification security improvement by means of speech watermarking SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE biometric; speech watermarking; speaker verification AB This paper presents a security enhanced speaker verification system based on speech signal watermarking. Our proposed system can detect several situations where a playback speech, a synthetically generated speech, a manipulated speech signal or a hacker trying to imitate the speech is fooling the biometric system. In addition, we have generated a watermarked speech signals database from which we have obtained relevant conclusions about the influence of this technique on speaker verification rates. Mainly we have checked that biometrics and water,marking can coexist simultaneously minimizing the mutual effects. Experimental results show that the proposed speech watermarking system can suffer A-law coding with a message error rate lower than 2 x 10(-4) for SWR higher than 20 dB at a message rate of 48 bits/s. (C) 2006 Elsevier B.V. All rights reserved. C1 Escola Univ Politecn Mataro, Barcelona 08303, Spain. Graz Univ Technol, Signal Proc & Speech Commun Lab, A-8010 Graz, Austria. RP Faundez-Zanuy, M (reprint author), Escola Univ Politecn Mataro, Avda Puig & Cadafalch 101-111, Barcelona 08303, Spain. EM faundez@eupmt.es; hagmueller@tugraz.at; g.kubin@ieee.org RI Faundez-Zanuy, Marcos/F-6503-2012 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282 CR Bender W., 1996, IBM SYST J, V35 Bimbot F., 1993, P EUROSPEECH, P169 BROOKES M, 2000, VOICEBOX CHENG Q, 2001, P IEEE INT C AC SPEE, V3, P1337 Deller J. R., 1993, DISCRETE TIME PROCES Faundez-Zanuy M, 2004, IEEE AERO EL SYS MAG, V19, P3, DOI 10.1109/MAES.2004.1308819 FAUNDEZZANUY M, 2002, INT C SPEECH LANG PR, P2317 FAUNDEZZANUY M, 2002, EUSIPCO 2002, V3, P125 Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P34, DOI 10.1109/MAES.2005.1396793 HAGMULLER M, 2003, TUGSPSC200302 HERING H, 2003, P 22 DIG AV SYST C D JOHNSON NF, 1998, IEEE COMPUT, V31, P26 Martin A. F., 1997, P EUROSPEECH, P1895 Ortega-Garcia J, 2000, SPEECH COMMUN, V31, P255, DOI 10.1016/S0167-6393(99)00081-3 NR 14 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1608 EP 1619 DI 10.1016/j.specom.2006.06.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600002 ER PT J AU Bahoura, M Rouat, J AF Bahoura, Mohammed Rouat, Jean TI Wavelet speech enhancement based on time-scale adaptation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE speech enhancement; wavelet transform; teager energy operator; speech recognition; adaptive thresholds ID SPECTRAL AMPLITUDE ESTIMATOR; TEAGER ENERGY OPERATOR; NOISE; TRANSFORM AB We propose a new speech enhancement method based on time and scale adaptation of wavelet thresholds. The time dependency is introduced by approximating the Teager energy of the wavelet coefficients, while the scale dependency is introduced by extending the principle of level dependent threshold to wavelet packet thresholding. This technique does not require an explicit estimation of the noise level or of the a priori knowledge of the SNR, as is usually needed in most of the popular enhancement methods. Performance of the proposed method is evaluated on speech recorded in real conditions (plane, sawmill, tank, subway, babble, car, exhibition hall, restaurant, street, airport, and train station) and artificially added noise. MEL-scale decomposition based on wavelet packets is also compared to the common wavelet packet scale. Comparison in terms of signal-to-noise ratio (SNR) is reported for time adaptation and time-scale adaptation of the wavelet coefficients thresholds. Visual inspection of spectrograms and listening experiments are also used to support the results. Hidden Markov Models speech recognition experiments are conducted on the AURORA-2 database and show that the proposed method improves the speech recognition rates for low SNRs. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Sherbrooke, Dept Genie Elect & Genie Informat, Sherbrooke, PQ J1K 2R1, Canada. Univ Quebec, Dept Math Informat & Genie, Rimouski, PQ G5L 3A1, Canada. RP Rouat, J (reprint author), Univ Sherbrooke, Dept Genie Elect & Genie Informat, 2500 Blvd Univ, Sherbrooke, PQ J1K 2R1, Canada. EM Jean.Rouat@usherbrooke.ca CR Bahoura M, 2001, EUROSPEECH, P1937 Bahoura M, 2001, IEEE SIGNAL PROC LET, V8, P10, DOI 10.1109/97.889636 Chen SH, 2004, J VLSI SIG PROC SYST, V36, P125, DOI 10.1023/B:VLSI.0000015092.19005.62 COHEN I, 2001, EUROSPEECH 2001 AALB, P1933 DELLER JR, 1993, DESCRETE TIME PROCES DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 DONOHO DL, 1994, BIOMETRIKA, V81, P425, DOI 10.1093/biomet/81.3.425 Donoho DL, 1995, J AM STAT ASSOC, V90, P1200, DOI 10.2307/2291512 Donoho DL, 1993, P S APPL MATH, V47, P173 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Gulzow T, 1998, SIGNAL PROCESS, V64, P5, DOI 10.1016/S0165-1684(97)00172-2 Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 Jabloun F, 1999, IEEE SIGNAL PROC LET, V6, P259, DOI 10.1109/97.789604 Johnstone IM, 1997, J R STAT SOC B, V59, P319, DOI 10.1111/1467-9868.00071 Mahmoudi D, 1998, INT CONF ACOUST SPEE, P385, DOI 10.1109/ICASSP.1998.674448 Mahmoudi D., 1997, P EUR 97 RHOD GREEC, P339 MALLAT S, 1992, IEEE T INFORM THEORY, V38, P617, DOI 10.1109/18.119727 Pan Q, 1999, IEEE T SIGNAL PROCES, V47, P3401 Sarikaya R., 1998, NORSIG'98. 3rd IEEE Nordic Signal Processing Symposium SEOK J, 1997, ICASSP 97 MUN GERM, P1223 Sika J., 1997, EUROSPEECH 97 RHOD G, P2595 Vidakovic B, 1998, IEEE T SIGNAL PROCES, V46, P2549, DOI 10.1109/78.709544 XU YS, 1994, IEEE T IMAGE PROCESS, V3, P747 YOUNG S, 2000, HTK HIDDEN MARKOV MO, P2 ZHANG XP, 1998, SIGNAL PROCESS LETT, V5, P1070 ZHANG XP, 1998, P ICASSP98, V3, P1589 NR 28 TC 21 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1620 EP 1637 DI 10.1016/j.specom.2006.06.004 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600003 ER PT J AU Goorriz, JM Ramirez, J Lang, EW Puntonet, CG AF Gorriz, J. M. Ramirez, J. Lang, E. W. Puntonet, C. G. TI Hard C-means clustering for voice activity detection SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE voice activity detection; speech recognition; clustering; C-means; prototypes; subband energy ID GAUSSIAN MODEL; SPEECH; NOISE; END AB An effective voice activity detection (VAD) algorithm is proposed for improving speech recognition performance in noisy environments. The proposed speech/pause discrimination method is based on a hard-decision clustering approach built on a set of subband log-energies and noise prototypes that define a cluster. Detecting the presence of speech (a new cluster) is achieved using a basic sequential algorithm scheme (BSAS) according to a given "distance" (in this case, geometrical distance) and a suitable threshold. The accuracy of the Cluster VAD (CIVAD) algorithm lies in the use of a decision function defined over a multiple-observation (MO) window of averaged subband log-energies and a suitable noise subspace model defined in terms of prototypes. In addition, the reduced computational cost of the clustering approach makes it adequate for real-time applications, i.e. speech recognition. An exhaustive analysis is conducted on the Spanish SpeechDat-Car databases in order to assess the performance of the proposed method and to compare it to existing standard VAD methods. The results show improvements in detection accuracy over standard VADs such as ITU-T G.729, ETSI GSM AMR and ETSI AFE and a representative set of recently reported VAD algorithms for noise robust speech processing. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Granada, Dept Signal Theory Networking & Commun, E-18071 Granada, Spain. Univ Regensburg, Inst Biophys, D-93040 Regensburg, Germany. Univ Granada, Dept Comp Architecture & Technol, E-18071 Granada, Spain. RP Goorriz, JM (reprint author), Univ Granada, Dept Signal Theory Networking & Commun, E-18071 Granada, Spain. EM gorriz@ugr.es RI Puntonet, Carlos/B-1837-2012; Prieto, Ignacio/B-5361-2013; Gorriz, Juan/C-2385-2012; Ramirez, Javier/B-1836-2012 OI Ramirez, Javier/0000-0002-6229-2921 CR Anderberg MR, 1973, CLUSTER ANAL APPL [Anonymous], 1999, 301708 ETSI EN [Anonymous], 2000, 201108 ETSI ES Armani L., 2003, P EUROSPEECH 2003 GE, P501 Basbug F, 2003, IEEE T SPEECH AUDI P, V11, P1, DOI 10.1109/TSA.2002.807350 Bouquin-Jeannes R. L., 1995, SPEECH COMMUN, V16, P245 Chengalvarayan R, 1999, P EUROSPEECH 1999 BU, P61 Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276 *ETSI, 2002, 201108 ETSI ES Fisher D. H., 1987, Machine Learning, V2, DOI 10.1023/A:1022852608280 Gazor S, 2003, IEEE T SPEECH AUDI P, V11, P498, DOI 10.1109/TSA.2003.815518 GORRIZ JM, 2006, P IEEE INT C AC SPEE Gorriz JM, 2005, ELECTRON LETT, V41, P877, DOI 10.1049/el:20051761 Hastie T., 2001, SPRINGER SERIES STAT, V2nd Jain A. K., 1988, PRENTICE HALL ADV RE Jain A.K., 1996, ADV IMAGE UNDERSTAND, P65 KARRAY L, 2003, SPEECH COMMUN, P261 KOHONEN T., 1989, SELF ORG ASS MEMORY Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146 MacQueen J., 1967, P 5 BERK S MATH STAT, P281, DOI DOI 10.1234/12345678 Marzinzik M., 2002, IEEE T SPEECH AUDIO, V10, P341 MORENO A, 2000, P 2 LREC C Ramirez J, 2004, SPEECH COMMUN, V42, P271, DOI 10.1016/j.specom.2003.10.002 RAMIREZ J, IN PRESS IEEE T SPEE RAMIREZ J, 2003, P EUROSPEECH 2003 GE, P3041 Rasmussen E., 1992, INFORMATION RETRIEVA, P419 SALTON G, 1991, SCIENCE, V253, P974, DOI 10.1126/science.253.5023.974 Sangwan A., 2002, IEEE INT C HIGH SPEE, P46 SOHN J, 1999, IEEE SIGN P LETT, V7, P1 Tanyer SG, 2000, IEEE T SPEECH AUDI P, V8, P478, DOI 10.1109/89.848229 TUCKER R, 1992, IEE PROC-I, V139, P377 Woo KH, 2000, ELECTRON LETT, V36, P180, DOI 10.1049/el:20000192 YOUNG S, 1997, HTK BOOK NR 33 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1638 EP 1649 DI 10.1016/j.specom.2006.07.006 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600004 ER PT J AU Hagmuller, M Kubin, G AF Hagmueller, Martin Kubin, Gernot TI Poincare pitch marks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE Poincare section; pitch marks; state space; nonlinear speech processing; diplophome voice; disordered voice; jitter ID HUMAN SPEECH SIGNALS; LYAPUNOV EXPONENTS; TIME-SERIES; SPACES; VOICE AB A novel approach for pitch mark determination based on dynamical systems theory is presented. Pitch marks are used for speech analysis and modification, such as jitter measurement or time scale modification. The algorithm works in a pseudo-state space and calculates the Poincare section at a chosen point in the state space. Pitch marks are then found at the crossing of the trajectories with the Poincare plane of the initial point. The procedure is performed frame-wise to account for the changing dynamics of the speech production system. The system is intended for real-time use, so higher-level processing extending over more than one frame is not used. The processing delay is, therefore, limited to one frame. The algorithm is evaluated by calculating an average pitch value for 10 ms frames and using a small database with pitch measurements from a laryngograph signal. The results are compared to a reference correlation-based pitch mark algorithm. The performance of the proposed algorithm is comparable to the reference algorithm, but in contrast correctly follows the pitch marks of diplophonic voices. (C) 2006 Elsevier B.V. All rights reserved. C1 Graz Univ Technol, Signal Proc & Speech Commun Lab, A-8010 Graz, Austria. RP Hagmuller, M (reprint author), Graz Univ Technol, Signal Proc & Speech Commun Lab, Inffeldgasse 12, A-8010 Graz, Austria. EM hagmueller@tugraz.at CR Kumar A, 1996, J ACOUST SOC AM, V100, P615, DOI 10.1121/1.415886 BAGSHAW P, 1994, EVALUATING PITCH DET Banbrook M, 1996, CHAOS SOLITON FRACT, V7, P973, DOI 10.1016/0960-0779(95)00105-0 BIMBOT F, 2003, ISCA TUT RES WORKSH BOERSMA P, 2005, PRAAT SOFTW SPEECH A Broomhead D. S., 1986, Nonlinear Phenomena and Chaos CHOLLET G, 2005, LECT NOTES COMPUTER, V3445 FAUNDEZZANUY M, 2006, LECT NOTES COMPUTER, V3817 Giovanni A, 1999, J VOICE, V13, P465, DOI 10.1016/S0892-1997(99)80002-2 Giovanni A, 1999, J VOICE, V13, P341, DOI 10.1016/S0892-1997(99)80040-X HAGMULLER M, 2004, P INT C SPOK LANG PR, P541 HAGMULLER M, 2005, ISCA TOT RES WORKSH, P107 HAGMULLER M, 2003, P 3 INT WORKSH MOD A, P281 Hegger R, 2001, IEEE T CIRCUITS-I, V48, P1454, DOI 10.1109/TCSI.2001.972852 Hegger R, 2000, PHYS REV LETT, V84, P3197, DOI 10.1103/PhysRevLett.84.3197 HERZEL H, 1995, CHAOS, V5, P30, DOI 10.1063/1.166078 Indrebo KM, 2006, SPEECH COMMUN, V48, P760, DOI 10.1016/j.specom.2004.12.002 Jiang JJ, 2006, J VOICE, V20, P2, DOI 10.1016/j.jvoice.2005.01.001 JOHNSON MT, 2003, P IEEE INT C AC SPEE, V1, P920 Kantz H., 2004, NONLINEAR TIME SERIE, V2nd KLEIJN WB, 2002, P IEEE WORKSH SPEECH Kokkinos I, 2005, IEEE T SPEECH AUDI P, V13, P1098, DOI 10.1109/TSA.2005.852982 Kubin G., 1997, P IEEE WORKSH SPEECH, P7 Kubin G., 1995, SPEECH CODING SYNTHE, P557 Little MA, 2006, J ACOUST SOC AM, V119, P549, DOI 10.1121/1.2141266 MANN I, 1998, P 9 EUR SIGN PROC C, V2, P701 MANN IM, 1999, THESIS U EDINBURGH Matassini L, 2002, COMPUT METH PROG BIO, V68, P135, DOI 10.1016/S0169-2607(01)00161-4 MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 SAUER T, 1991, J STAT PHYS, V65, P579, DOI 10.1007/BF01053745 Schoentgen J, 2003, J VOICE, V17, P114, DOI 10.1016/S0892-1997(03)0014-6 SCHREIBER T, 1995, INT J BIFURCAT CHAOS, V5, P349, DOI 10.1142/S0218127495000296 Stylianou Y., 1995, P EUROSPEECH, P451 Takens F., 1981, LECT NOTES MATH, V898, P366, DOI DOI 10.1007/BFB0091924 TEREZ D, 2002, P IEEE INT C AC SPEE, V1, P345 TISHBY N, 1990, P IEEE INT C AC SPEE, V4, P365 TITZE IR, 1994, P WORKSH AC VOIC AN NR 38 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1650 EP 1665 DI 10.1016/j.specom.2006.07.008 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600005 ER PT J AU Salvi, G AF Salvi, Giampiero TI Segment boundary detection via class entropy measurements in connectionist phoneme recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE boundary detection; entropy; connectionist phoneme recognition ID SPEECH RECOGNITION AB This article investigates the possibility to use the class entropy of the output of a connectionist phoneme recogniser to predict time boundaries between phonetic classes. The rationale is that the value of the entropy should increase in proximity of a transition between two segments that are well modelled (known) by the recognition network since it is a measure of uncertainty. The advantage of this measure is its simplicity as the posterior probabilities of each class are available in connectionist phoneme recognition. The entropy and a number of measures based on differentiation of the entropy are used in isolation and in combination. The decision methods for predicting the boundaries range from simple thresholds to neural network based procedure. The different methods are compared with respect to their precision, measured in terms of the ratio between the number C of predicted boundaries within 10 or 20 ms of the reference and the total number of predicted boundaries, and recall, measured as the ratio between C and the total number of reference boundaries. (C) 2006 Elsevier B.V. All rights reserved. C1 Royal Inst Technol, KTH, Sch Comp Sci & Commun Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP Salvi, G (reprint author), Royal Inst Technol, KTH, Sch Comp Sci & Commun Speech Mus & Hearing, Lindstedtsv 24, S-10044 Stockholm, Sweden. EM giampi@kth.se CR Andreev VP, 2003, ANAL CHEM, V75, P6314, DOI 10.1021/ac0301806 Beskow J., 2004, J SPEECH TECHNOLOGY, V4, P335 Elenius K., 2000, International Journal of Speech Technology, V3, DOI 10.1023/A:1009641213324 Gibbon D., 1997, HDB STANDARDS RE 4 B Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 HOSOM JP, 2002, INT C SPOK LANG PROC, V1, P357 KARLSSON L, 2003, P EUR, P1297 LI CW, 1995, IEEE T BIO-MED ENG, V42, P21 LINDBERG B, 2000, 6 INT C SPOK LANG PR, V3, P370 Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 Ostendorf M., 1994, Computational Linguistics, V20 SALVI G, 2006, LECT NOTES ARTIF INT, V3817, P267 Salvi G, 2006, SPEECH COMMUN, V48, P802, DOI 10.1016/j.specom.2005.05.005 SALVI G, 2003, ISCA TUT RES WORKSH Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 NR 15 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1666 EP 1676 DI 10.1016/j.specom.2006.07.009 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600006 ER PT J AU Hiroya, S Mochida, T AF Hiroya, Sadao Mochida, Takemi TI Multi-speaker articulatory trajectory formation based on speaker-independent articulatory HMMs SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE inter-speaker variability; articulatory trajectory formation; articulatory HMMs; speaker-adaptive training ID SPEECH PRODUCTION-MODEL; VOCAL-TRACT; MOVEMENTS; ADAPTATION; INVERSION AB Inter-speaker variability in the speech spectrum domain has been modeled using speaker-adaptive training (SAT), in which speaker-independent phoneme-specific hidden Markov models (HMMs) were used along with a speaker-adaptive matrix. In this paper, multi-speaker articulatory trajectory formation based on this method is presented. Both speaker-independent and speaker-specific features are statistically separated from a multi-speaker articulatory database, which consists of the mid-sagittal motion data of the lips, incisor, and tongue measured with an electro-magnetic articulographic (EMA) system. We evaluated the proposed method in terms of the RMS error between the measured and estimated articulatory parameters. When multi-speaker models of articulatory parameters with two speaker-adaptive matrices for each speaker were used, the average RMS error of articulatory parameters was 1.29 mm and showed no statistically significant difference from that for speaker-dependent models (1.22 mm). For comparison, multi-speaker models of the conventional speech spectrum were also constructed using a multi-speaker spectrum database, which consists of speech data simultaneously recorded during the articulatory measurements. The average spectral distance between the vocal-tract and estimated spectrum from two-matrix models was 4.19 dB and showed a statistically significant difference from that for speaker-dependent models (3.97 dB). These results indicate that modeling of inter-speaker variability in the articulatory parameter domain with a small number of matrices for each speaker almost perfectly approximates the speaker dependency of articulation and is better than that in the speech spectrum domain. (C) 2006 Elsevier B.V. All rights reserved. C1 NTT Corp, NTT Commun Sci Labs, Atsugi, Kanagawa 2430198, Japan. RP Hiroya, S (reprint author), NTT Corp, NTT Commun Sci Labs, 3-1 Morinosato Wakamiya, Atsugi, Kanagawa 2430198, Japan. EM hiroya@idea.brl.ntt.co.jp CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 EIDE E, 1996, P ICASSP, P346 Fukada T., 1992, P ICASSP, P137, DOI 10.1109/ICASSP.1992.225953 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 GALVAN A, 2003, P ICPHS, P1325 Hashi M, 1998, J ACOUST SOC AM, V104, P2426, DOI 10.1121/1.423750 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 Hiroya S, 2004, IEICE T INF SYST, VE87D, P1071 Honda K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607480 HONDA M, 1999, P IEEE INT C SYST MA, P463 KABURAGI T, 1994, J ACOUST SOC AM, V96, P1356, DOI 10.1121/1.410280 Kaburagi T, 1996, J ACOUST SOC AM, V99, P3154, DOI 10.1121/1.414800 Kaburagi T., 1998, P INT C SPOK LANG PR, P433 Kaburagi T, 2001, J ACOUST SOC AM, V110, P441, DOI 10.1121/1.1373707 Lee L., 1996, P ICASSP Nakajima T., 1978, Journal of the Acoustical Society of Japan, V34 Okadome T, 2001, J ACOUST SOC AM, V110, P453, DOI 10.1121/1.1377633 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 PYE D, 1997, P ICASSP, P1047 SALTZMAN E, 1998, ECOL PSYCHOL, V1, P333 Simpson AP, 2001, J ACOUST SOC AM, V109, P2153, DOI 10.1121/1.1356020 SUGAMURA N, 1981, IECE T A, V64, P323 Tokuda K., 1995, P EUROSPEECH, P757 Yamagishi J, 2003, IEICE T FUND ELECTR, VE86A, P1956 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 NR 29 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1677 EP 1690 DI 10.1016/j.specom.2006.08.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600007 ER PT J AU Pribilova, A Pribil, J AF Pribilova, Anna Pribil, Jiri TI Non-linear frequency scale mapping for voice conversion in text-to-speech system with cepstral description SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE speech synthesis; source-filter speech model; harmonic speech model; text-to-speech system; voice conversion ID TRANSFORMATION; NETWORKS AB Voice conversion, i.e. modification of a speech signal to sound as if spoken by a different speaker, finds its use in speech synthesis with a new voice without necessity of a new database. This paper introduces two new simple non-linear methods of frequency scale mapping for transformation of voice characteristics between male and female or childish. The frequency scale mapping methods were developed primarily for use in the Czech and Slovak text-to-speech (TTS) system designed for the blind and based on the Pocket PC device platform. It uses cepstral description of the diphone speech inventory of the male speaker using the source-filter speech model or the harmonic speech model. Three new diphone speech inventories corresponding to female, childish and young male voices are created from the original male speech inventory. Listening tests are used for evaluation of voice transformation and quality of synthetic speech. (C) 2006 Elsevier B.V. All rights reserved. C1 Slovak Univ Technol Bratislava, Dept Radio Elect, Bratislava 81219, Slovakia. Acad Sci Czech Republic, Inst Radio Engn & Elect, CR-18251 Prague 8, Czech Republic. RP Pribilova, A (reprint author), Slovak Univ Technol Bratislava, Dept Radio Elect, Ilkovicova 3, Bratislava 81219, Slovakia. EM pribilova@kre.elf.stuba.sk CR Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1 DUBEDA T, 2005, ELECT SPEECH SIGNAL, P364 DUXANS H, 2003, P EUR C SPEECH COMM, P861 FANT G., 1997, ENCY ACOUSTICS, p1589 to 1598 GUTIERREZARRIOL.JM, 2001, P EUR C SPEECH COMM, P357 Imai S., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B JANOTA P, 1994, P 31 C AC CZECH AC S, P139 JANOTA P, 1967, PERSONAL CHARACTERIS JANOTA P, 1994, ACTA U CAROLINAE PHI, P33 Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423 KUWABARA H, 1995, SPEECH COMMUN, V16, P165, DOI 10.1016/0167-6393(94)00053-D LEE KS, 2002, IEICE T INF SYST, P1297 LEUTELT L, 2004, P INT EURASIP C AN B, P30 Madlova A., 2002, Journal of Electrical Engineering, V53 McAulay R., 1995, SPEECH CODING SYNTHE, P121 MIZUNO H, 1995, SPEECH COMMUN, V16, P153, DOI 10.1016/0167-6393(94)00052-C MOUCHTARIS A, 2004, P IEEE INT C AC SPEE, pI1 NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I PRIBIL J, 2005, ELECT SPEECH SIGNAL, P402 PRIBILOVA A, 2004, P 14 INT CZECH SLOV, P100 RENTZOS D, 2003, P IEEE AUT SPEECH RE, P706 SLIFKA J, 1995, P IEEE INT C AC SPEE, P644 STEVENS KN, 1997, ENCY ACOUSTICS, P1565 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 SUNDERMANN D, 2004, SPAN SOC NAT LANG PR, P41 TODA T, 2001, P ICASSP, P841 TURK O, 2002, P INT C SPOK LANG PR, P289 Unser M, 1999, IEEE SIGNAL PROC MAG, V16, P22, DOI 10.1109/79.799930 Vich R., 2000, P 15 BIENN EURASIP C, P77 VICH R, 1996, P 6 NAT SCI C INT PA, P1 VONDRA M, 2005, ELECT SPEECH SIGNAL, P423 Vondra M, 2005, LECT NOTES ARTIF INT, V3445, P421 Ye H., 2003, P EUR, P2409 NR 34 TC 10 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1691 EP 1703 DI 10.1016/j.specom.2006.08.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600008 ER PT J AU Murphy, PJ AF Murphy, Peter J. TI Periodicity estimation in synthesized phonation signals using cepstral rahmonic peaks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Non-Linear Speech Processing CY APR 19-22, 2005 CL Barcelona, SPAIN DE voice signal aperiodicity; rahmonic analysis; harmonics-to-noise ratio ID TO-NOISE RATIO; PATHOLOGICAL VOICES; SPEECH SIGNALS; HOARSENESS; QUALITY; PERTURBATION; INDEX AB Aperiodicity in sustained phonation can result from temporal, amplitude and waveshape perturbations, turbulent noise, nonlinear phenomena and non-stationarity of the vocal tract. General measures of the periodicity of the voice signal are of interest in, for example, quantifying voice quality and in the assessment of pathological voice. High and low quefrency cepstral techniques are employed to supply an index of the degree of voice signal periodicity. In the high quefrency region, the first rahmonic is used to provide an indication of the periodicity of the signal. A new measure, SRA (sum of rahmonic amplitudes) - utilising all rahmonics in the cepstrum, is tested against synthesis data (six levels of random jitter, cyclic jitter, shimmer and random noise). In addition, an existing popular technique using the first rahmonic (cepstral peak prominence, CPP) is assessed with synthesis data for the first time. Both measures decrease with increasing aperiodicity levels of the glottal source, decreasing more noticeably for noise and random jitter than for shimmer and cyclic jitter. CPP is shown to be relatively f(o)-independent; however, the index appears to be less sensitive when compared against SRA. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Limerick, Dept Elect & Comp Engn, Limerick, Ireland. RP Murphy, PJ (reprint author), Univ Limerick, Dept Elect & Comp Engn, Limerick, Ireland. EM Peter.Murphy@ul.ie CR Awan SN, 2006, CLIN LINGUIST PHONET, V20, P35, DOI 10.1080/02699200400008353 DEJONCKERE PH, 1994, CLIN LINGUIST PHONET, V8, P161, DOI 10.3109/02699209408985304 DEKROM G, 1993, J SPEECH HEAR RES, V36, P254 Herman-Ackah Y., 2002, J VOICE, V16, P20 HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769 HIRAOKA N, 1984, J ACOUST SOC AM, V76, P1648, DOI 10.1121/1.391611 Imaizumi S., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4) KASUYA H, 1991, VOCAL FOLD PHYSL ACO, P251 KASUYA H, 1986, J ACOUST SOC AM, V80, P1329, DOI 10.1121/1.394384 Kasuya H, 1995, VOCAL FOLD, P305 KASUYA Y, 1986, IEEE INT C AC SPEECH, P669 KITAJIMA K, 1981, FOLIA PHONIATR, V3, P145 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KLINGHOLZ F, 1985, J SPEECH HEAR RES, V28, P169 KOIKE Y, 1986, J PHONETICS, V14, P501 KOIKE Y, 1991, VOCAL FOLD PHYSL ACO, P259 KOJIMA H, 1980, ACTA OTO-LARYNGOL, V89, P547, DOI 10.3109/00016488009127173 Ladefoged P., 1985, UCLA WORKING PAPERS, V61, P79 MANFREDI C, 2003, 3 INT WORKSH FIR IT Michaelis D, 1997, ACUSTICA, V83, P700 MURPHY PJ, 2000, P IR SIGN SYST C DUB, P266 Murphy PJ, 1999, J ACOUST SOC AM, V105, P2866, DOI 10.1121/1.426901 MURPHY PJ, 2005, LECT NOTES ARTIF INT, V3445, P119 MURPHY PJ, IN PRESS J ACOUST SO, V120 MURPHY PJ, 2000, P INT C SPOK LANG PR, P672 MUTA H, 1988, J ACOUST SOC AM, V84, P1292, DOI 10.1121/1.396628 NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 QI Y, 1992, J ACOUST SOC AM, V92, P1569 Qi YY, 1997, J ACOUST SOC AM, V102, P537, DOI 10.1121/1.419726 QI YY, 1995, J ACOUST SOC AM, V97, P2525, DOI 10.1121/1.411972 RABINER LR, 1978, DIGITAL PROCESS SPEE ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P1, DOI 10.1109/89.650304 YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 NR 34 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2006 VL 48 IS 12 BP 1704 EP 1713 DI 10.1016/j.specom.2006.09.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 122SI UT WOS:000243246600009 ER PT J AU Milner, B Wellekens, C Lindberg, B AF Milner, Ben Wellekens, Christian Lindberg, Borge TI Special Issue on Robustness Issues for Conversational Interaction SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. Aalborg Univ, Dept Elect Syst, Aalborg, Denmark. RP Milner, B (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM b.milner@uea.ac.uk; Christian.wellekens@eurecom.fr; bli@kom.auc.dk NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1399 EP 1401 DI 10.1016/j.specom.2006.09.002 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300001 ER PT J AU James, A Milner, B AF James, Alastair Milner, Ben TI Towards improving the robustness of distributed speech recognition in packet loss SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE distributed speech recognition; packet loss; interleaving; MAP reconstruction; weighted-Viterbi decoding AB This work addresses the problem of achieving robust distributed speech recognition (DSR) performance in the presence of packet loss. The nature of packet loss is analysed by examining packet loss data gathered from a GSM mobile data channel. This analysis is then used to examine the effect of realistic packet loss conditions on DSR systems, and shows that the accuracy of DSR is more sensitive to burst-like packet loss rather than the actual number of lost packets. This leads to the design of a three-stage packet loss compensation scheme. First, interleaving is applied to the transmitted feature vectors to disperse bursts of packet loss. Second, lost feature vectors are reconstructed prior to recognition using a variety of reconstruction techniques. Third, a weighted-Viterbi decoding method is applied to the recogniser itself, which modifies the contribution of the reconstructed feature vectors according to the accuracy of their reconstruction. Experimental results on both a connected digits task and a large-vocabulary task show that simple methods, such as repetition, are not as effective as interpolation methods. Best performance is given by a novel maximum a posteriori (MAP) estimation, which utilizes temporal statistics of the feature vector stream. This reconstruction method is then combined with weighted-Viterbi decoding, using a novel method to calculate the confidences of reconstructed static and temporal components separately. Using interleaving, results improve significantly, and it is shown that a limited level of interleaving can be applied without increasing the delay to the end-user. Using a combination of these techniques for the connected digits task, word accuracy is increased from 49.5% to 95.3% even with a packet loss rate of 50% and average burst length of 20 feature vectors. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. RP James, A (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM a.james@uea.ac.uk; b.milner@uea.ac.uk CR ANDREWS K, 1997, 971634 CORN U COMP S [Anonymous], 2003, 202050 ETSI ES ARIZMENDI I, 2004, P ICASSP MONTR CAN BASAGNI S, 2004, MOBIEL AD HOC NETWOR BERNARD A, 2002, P ICSLP 2002 BERNARD A, 2002, IEEE T SPEECH AUDIO, V10 BOLOT JC, 1995, P NOSSDAV BOULIS C, 2002, IEEE T SPEECH AUDIO, V10 CARDENALLOPEZ A, 2004, P ICASSP 2004 MONTR CHESTERFIELD J, 2004, P WORKSH BROADB WIR Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 CUNY R, 2003, VOIP 3G NETWORKS END ENDO T, 2003, P EUROSPEECH 2003 *ERICS, 2000, AU26600 ER *ETSI, 2002, STQ DSR ADV FRONT EN *ETSI, 2000, STQ DSR FRONT END FE FURUI S, 1986, IEEE T ASSP, V34 GOMEZ AM, 2004, P ROBUST 2004 NORW U Halsall F., 1995, DATA COMMUNICATIONS Hirsch H., 2000, P ISCA ITRW ASR2000 JAMES AB, 2004, P EUSIPCO 2004 VIENN JAMES AB, 2004, P ICASSP 2004 MONTR JAMES AB, 2005, P ICASSP 2005 PHIL U Ji P, 2004, PERFORM EVALUATION, V55, P165, DOI 10.1016/S0166-5316(03)00104-4 MILNER BP, 2004, P ICSLP 2004 JEJ ISL MILNER BP, 2003, P EUR MUTTER A, 2004, P EUNICE 2004 NOURELDIN AH, 2004, P ICASSP 2004 MONTR PEARCE D, 2000, P AVIOS 2000 PEARCE D, 2004, P ROBUST 2004 NORW U PEINEDO AM, 2003, SPEECH COMMUN, V21, P549 Raj B., 2000, THESIS CARNEGIE MELL Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 RAMSEY RL, 1970, IEEE T INFORM THEORY, V16, P772 ROBINSON T, 1995, P ICASSP 1995 Schulzrinne H., 2003, 3550 IETF RFC TAN Z, 2004, P ICASSP 2004 MONTR Vaseghi S. V., 2000, ADV DIGITAL SIGNAL P Wesolowski K., 2002, MOBILE COMMUNICATION XIE Q, 2003, 3557 RFC IETF XIE Q, 4060 RFC IETF YAJNIK M, 1999, P INFOCOMM 1999 NR 42 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1402 EP 1421 DI 10.1016/j.specom.2006.07.005 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300002 ER PT J AU Cardenal-Lopez, A Garcia-Mateo, C Docio-Fernandez, L AF Cardenal-Lopez, Antonio Garcia-Mateo, Carmen Docio-Fernandez, Laura TI Weighted Viterbi decoding strategies for distributed speech recognition over IP networks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE distributed speech recognition; weighted Viterbi decoding; missing data ID FRONT-END; CONCEALMENT AB The presence of burst-like packet losses may produce a serious degradation of the performance of Distributed Speech Recognition systems over IP networks. Several strategies that exploit high speech-signal correlation have already been proposed to overcome this problem. One of the most common mechanisms is to fill in the gap using the correctly received frames, as in nearest frame repetition (NFR), the mechanism proposed in the ETSI Aurora standard. Other strategies involve the reconstruction of the lost segment using interpolation algorithms or even more complex mechanisms. However, the effectiveness of these strategies may be seriously compromised if losses appear in long bursts. Interpolation may simply become unfeasible if the lost segment is a lengthy one. In repetition algorithms, on the other hand, the unreliability of the substituted frame produces a high insertion rate that can reduce the word error rate to the point where the mechanism becomes counterproductive. A feasible strategy for palliating this effect is to use a soft-decoding algorithm. In this kind of strategy, the reliability of the frame is taken into account at the decoding stage, aiming to mitigate the effect of the incorrect vectors in the recognition stage. In this paper we present a comprehensive study of the performance of some soft-decoding weighted Viterbi algorithms in a burst-like network environment. Tests will be conducted using the Aurora 3 framework working over three simulated network conditions, with the NFR algorithm as the baseline approach. Three techniques will be thoroughly explored. The first one uses a fixed weighting coefficient along the burst. This improves the results obtained by the basic NFR algorithm, but with the drawback, however, of showing a strong dependence on mean burst length. The application of a varying weighting coefficient is proposed next as a way to solve this problem. Two strategies-one heuristic and one data-driven-are proposed to obtain the variation in the weighting coefficient. Finally, a third method that exploits the different degree of correlation of individual components of the feature vector is presented. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Vigo, Dpto Teor Senal & Commun, Vigo, Spain. RP Cardenal-Lopez, A (reprint author), Univ Vigo, Dpto Teor Senal & Commun, Campus Univ,3631, Vigo, Spain. EM cardenal@gts.tsc.uvigo.es; carmen@gts.tsc.uvigo.es; Idocio@gts.tsc.uvigo.es CR BERNARD A, 2001, P EUROSPEECH, V4, P2703 Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 BOLOT J, 2003, P ACM SIGCOMM, P289 Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532 CARDENALLOPEZ A, 2002, P ICASSP MAY, P705 CARDENALLOPEZ A, 2004, P IEEE INT C AC SPEE, P49 COOKE M, 2001, SPEECH COMMUN, V34, P261 DOCIOFERNANDEZ L, 2002, P INT C SPEECH LANG, P461 *ETSI, 2000, ETSIES201108V112 *ETSI, 2002, ETSIES202050V111 EULER S, 1994, P INT C AC SPEECH SI, V1, P621 Fingscheidt T, 2001, IEEE T SPEECH AUDI P, V9, P240, DOI 10.1109/89.905998 Hagenauer J, 1989, P IEEE GLOB TEL C GL, V3, P1680 Hirsch H. G., 2002, P INT C SPOK LANG PR, P1877 Hirsch H.G., 2000, ISCA ITRW ASR2000 JAMES A, 2004, P 2 COST 278 ISCA WO James A.B., 2004, P IEEE INT C AC SPEE, V1, P853 Jiang W., 2000, P 10 INT WORKSH NETW Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558 LILY BT, 1996, P INT C SPEECH LANG, P2344 MILNER B, 2000, P IEEE INT C AC SPEE, P1791 MILNER B, 2001, P IEEE INT C AC SPEE, P261 PEARCE D, 2000, APPL VOIC INP OUTP S PEARCE D, 2004, P 2 COST 278 ISCA WO Peinado AM, 2003, SPEECH COMMUN, V41, P549, DOI 10.1016/S0167-6393(03)00048-7 Pelaez-Moreno C, 2001, IEEE T MULTIMEDIA, V3, P209, DOI 10.1109/6046.923820 POTAMIANOS A, 2001, P IEEE INT C AC SPEE, P269 Quercia D., 2002, P IEEE INT C AC SPEE, V4, P3820 SANNECK H, 2000, P SPIE ACM SIGMM MUL, P177 Tan ZH, 2003, ELECTRON LETT, V39, P1619, DOI 10.1049/el:20031026 TAN ZH, 2004, P 2 COST 278 ISCA WO NR 31 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1422 EP 1434 DI 10.1016/j.specom.2006.01.006 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300003 ER PT J AU Ion, V Haeb-Umbach, R AF Ion, Valentin Haeb-Umbach, Reinhold TI Uncertainty decoding for distributed speech recognition over error-prone networks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE distributed speech recognition; channel error robustness; soft features; uncertainty decoding AB In this paper, we propose an enhanced error concealment strategy at the server side of a distributed speech recognition (DSR) system, which is fully compatible with the existing DSR standard. It is based on a Bayesian approach, where the a posteriori probability density of the error-free feature vector is computed, given all received feature vectors which are possibly corrupted by transmission errors. Rather than computing a point estimate, such as the MMSE estimate, and plugging it into the Bayesian decision rule, we employ uncertainty decoding, which results in an integration over the uncertainty in the feature domain. In a typical scenario the communication between the thin client, often a mobile device, and the recognition server spreads across heterogeneous networks. Both bit errors on circuit-switched links and lost data packets on IP connections are mitigated by our approach in a unified manner. The experiments reveal improved robustness both for small- and large-vocabulary recognition tasks. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Paderborn, Dept Commun Engn, D-33098 Paderborn, Germany. RP Ion, V (reprint author), Univ Paderborn, Dept Commun Engn, Pohlweg 47-49, D-33098 Paderborn, Germany. EM ion@nt.uni-paderborn.de; haeb@nt.uni-paderborn.de CR [Anonymous], 1989, 207 COST BAHL LR, 1974, IEEE T INFORM THEORY, V20, P284, DOI 10.1109/TIT.1974.1055186 Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 BOLOT JC, 1999, P IEEE INFOCOM 99 Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532 Deng L, 2005, IEEE T SPEECH AUDI P, V13, P412, DOI 10.1109/TSA.2005.845814 DENG L, 2002, P ICSLP DENV US DOCIOFERNANDEZ L, 2002, P ICSLP DENV US ENDO T, 2003, P EUROSPEECH GEN SWI *ETSI, 2002, ES 202 050 V1 1 1 SP *ETSI, 2003, ES 202 121 V1 1 1 *ETSI, 2003, ETSI TS 100 909 V8 7 FINGSCHEIDT T, 2002, P ICSLP DENV FINGSCHEIDT T, 2001, IEEE T SPEECH AUDIO, V9, P1 GILBERT EN, 1960, CAPACITY BURST NOISE GOMEZ A, 2003, P EUROSPEECH HAEBUMBACH R, 2004, SOFT FEATURES IMPROV HAGENAUER J, 1989, P IEEE GLOBAL COMMUN HIRSCH H, 2000, AURORA EXPT FRAM PER Huo Q, 2000, IEEE T SPEECH AUDI P, V8, P200 ION V, 2002, P ICASSP PHIL ION V, 2005, P INT LISB KELLEHER H, 2002, P ICSLP DENV MILNER B, 2000, P ICASSP IST Peinado AM, 2003, SPEECH COMMUN, V41, P549, DOI 10.1016/S0167-6393(03)00048-7 PUAL D, 1992, DESIGN WALL STRET J TAN ZH, 2005, SPEECH COMMUN TAN ZH, 2004, P ICASSP MONT Weerackody V, 2002, IEEE T WIREL COMMUN, V1, P282, DOI 10.1109/7693.994822 WOODLAND P, 1993, P EUR LISB, V2, P2207 XIE Q, 2005, RTP PAYLOAD FORMATS NR 31 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1435 EP 1446 DI 10.1016/j.specom.2006.03.007 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300004 ER PT J AU Ishizuka, K Nakatani, T AF Ishizuka, Kentaro Nakatani, Tomohiro TI A feature extraction method using subband based periodicity and aperiodicity decomposition with noise robust frontend processing for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE speech feature; noise robust frontend; subband; periodicity; aperiodicity ID IDENTIFICATION; ENVIRONMENTS; PATTERNS AB This paper proposes a frontend processing technique that employs a speech feature extraction method called Subband based Periodicity and Aperiodicity DEcomposition (SPADE), and examines its validity for automatic speech recognition in noisy environments. SPADE divides speech signals into subband signals, which are then decomposed into their periodic and aperiodic features, and uses both features as speech feature parameters. SPADE employs independent periodicity estimation within each subband and periodicity-aperiodicity decomposition design based on a parallel distributed processing technique motivated by the human speech perception process. Unlike other speech features, this decomposition of speech into two characteristics provides information about periodicities and aperiodicities, and thus allows the utilization of the robustness exhibited by periodic features without losing certain essential information included in aperiodic features. This paper first introduces an implementation of SPADE that operates in the frequency domain, and then examines the validity of combining SPADE with speech enhancement methods. For this examination, we combine SPADE with noise compensation methods that operate in the frequency domain and cepstral normalization methods. In addition, we employ an energy parameter calculation method based on the SPADE framework. An evaluation with the AURORA-2J noisy continuous digit speech recognition database (Japanese AURORA-2) shows that SPADE combined with adaptive Wiener filtering, cepstral normalization, and the energy parameter achieves average word accuracy rates of 82.58% with clean training and 92.55% with multicondition training. These rates are higher than those achieved with ETSI W1008 advanced DSR frontend processing (77.98% and 91.01%, respectively) whose speech feature parameter is based on conventional Mel-frequency cepstral coefficients. By comparison with ETSI W1008 advanced DSR frontend, the proposed method reduces word error rates by 20.9% with clean training and 17.2% with multicondition training. These results confirmed that SPADE combined with noise reduction methods can increase robustness in the presence of noise. (c) 2006 Elsevier B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, NTT Commun Sci Labs, Kyoto 6190237, Japan. RP Ishizuka, K (reprint author), Nippon Telegraph & Tel Corp, NTT Commun Sci Labs, Hikaridai 2-4, Kyoto 6190237, Japan. EM ishizuka@cslab.kecl.ntt.co.jp; nak@cslab.kecl.ntt.co.jp CR Aikawa K, 1996, J ACOUST SOC AM, V100, P603, DOI 10.1121/1.415961 Ali AMA, 2002, IEEE T SPEECH AUDI P, V10, P279, DOI 10.1109/TSA.2002.800556 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BERNARD A, 2004, P 29 INT C AC SPEECH, V1, P1025 Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Chen C.-P., 2002, P ICSLP, P241 DAMI A, 2002, P ICSLP, P21 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2848, DOI 10.1121/1.419476 *ETSI, 2003, ETSI ES 202 050 V 1 Gales M.J.F., 1993, P EUROSPEECH, P837 GAO Y, 1992, P 2 INT C SPOK LANG, P73 GHITZA O, 1988, J PHONETICS, V16, P109 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Greenberg S., 2004, SPEECH PROCESSING AU HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hess W., 1983, PITCH DETERMINATION Hirsch H. G., 2000, P ISCA ITRW ASR2000, P181 ISHIZUKA K, 2004, P ICASSP, V1, P141 ISHIZUKA K, 2004, P ICSLP, V2, P937 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 JACKSON PJB, 2003, P EUROSPEECH, P2321 Kajita S, 1995, P ICASSP, P421 Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55 Li Q., 2001, P 7 EUR C SPEECH COM, P619 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Macho D., 2002, P ICSLP, P17 Mauuary L., 1998, P EUSPICO 98, V1, P359 MINAMI Y, 1995, P ICASSP, P129 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 NAKAMURA S, 2003, P 8 IEEE WORKSH AUT, P619 Nakamura S, 2005, IEICE T INF SYST, VE88D, P535, DOI 10.1093/ietisy/e88-d.3.535 PATTERSON RD, 1986, FREQUENCY SELECTIVIT, P23 PATTERSON RD, 1976, J ACOUST SOC AM, V59, P640, DOI 10.1121/1.380914 Pearce D., 2000, P ICSLP, V4, P29 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905 SENEFF S, 1988, J PHONETICS, V16, P55 NR 39 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1447 EP 1457 DI 10.1016/j.specom.2006.06.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300005 ER PT J AU Shannon, BJ Paliwal, KK AF Shannon, Benjamin J. Paliwal, Kuldip K. TI Feature extraction from higher-lag autocorrelation coefficients for robust speech recognition. SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE speech recognition; feature extraction; robustness to noise; MFCC ID SPECTRAL ESTIMATION; LINEAR PREDICTION; NOISE AB In this paper, a feature extraction method that is robust to additive background noise is proposed for automatic speech recognition. Since the background noise corrupts the autocorrelation coefficients of the speech signal mostly at the lower-time lags, while the higher-lag autocorrelation coefficients are least affected, this method discards the lower-lag autocorrelation coefficients and uses only the higher-lag autocorrelation coefficients for spectral estimation. The magnitude spectrum of the windowed higher-lag autocorrelation sequence is used here as an estimate of the power spectrum of the speech signal. This power spectral estimate is processed further (like the well-known Mel frequency cepstral coefficient (MFCC) procedure) by the Mel filter bank, log operation and the discrete cosine transform to get the cepstral coefficients. These cepstral coefficients are referred to as the autocorrelation Mel frequency cepstral coefficients (AMFCCs). We evaluate the speech recognition performance of the AMFCC features on the Aurora and the resource management databases and show that they perform as well as the MFCC features for clean speech and their recognition performance is better than the MFCC features for noisy speech. Finally, we show that the AMFCC features perform better than the features derived from the robust linear prediction-based methods for noisy speech. (c) 2006 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP Paliwal, KK (reprint author), Griffith Univ, Sch Microelect Engn, Nathan Campus, Brisbane, Qld 4111, Australia. EM K.Paliwal@griffith.edu.au CR Anstey N. A., 1966, CAN J EXPLOR GEOPHYS, V2, P55 Bellegarda JR, 1997, P EUROSPEECH Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145 CADZOW JA, 1982, P IEEE, V70, P907, DOI 10.1109/PROC.1982.12424 CHAN YT, 1982, IEEE T ACOUST SPEECH, V30, P689, DOI 10.1109/TASSP.1982.1163946 COOKE M, 1997, P ICASSP, P863 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 GERSCH W, 1970, IEEE T AC, V5, P583 Ghitza O., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80018-3 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Harris JFredric, 1978, P IEEE, V66 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMUS K, 2004, P IEEE INT C AC SPEE, V1, P945 Hernando J, 1997, IEEE T SPEECH AUDI P, V5, P80, DOI 10.1109/89.554273 *HTK, HIDD MARK MOD TOOL K HUAGN X, 2001, SPOKEN LANGUAGE PROC Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Kay SM, 1988, MODERN SPECTRAL ANAL KAY SM, 1979, IEEE T ACOUST SPEECH, V27, P478, DOI 10.1109/TASSP.1979.1163275 KIM HG, 2003, P EUROSPEECH, P545 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 MCGINN DP, 1989, IEEE T ACOUST SPEECH, V37, P433, DOI 10.1109/29.21712 PALIWAL KK, 1986, P EUSIPCO, P593 PALIWAL KK, 1986, P ICASSP APR, P1369 Paliwal KK, 1997, P EUROSPEECH, P279 PALIWAL KK, 1986, P EUSIPCO, P295 PALIWAL KK, 1991, P INT C AC SPEECH SI, P429, DOI 10.1109/ICASSP.1991.150368 Pearce D., 2000, P ICSLP, V4, P29 Price P., 1988, P IEEE INT C AC SPEE, P651 Rabiner L, 1993, FUNDAMENTALS SPEECH Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 SHANNON BJ, 2005, P ICASSP, V2 SHANNON BJ, 2004, P ICSLP Stern R., 1996, AUTOMATIC SPEECH REC, P357 Tibrewala S., 1997, P ICASSP, P1255 NR 39 TC 12 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1458 EP 1485 DI 10.1016/j.specom.2006.08.003 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300006 ER PT J AU Srinivasan, S Roman, N Wang, DL AF Srinivasan, Soundararajan Roman, Nicoleta Wang, DeLiang TI Binary and ratio time-frequency masks for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE ideal binary mask; ratio mask; robust speech recognition; missing-data recognizer; binaural processing; speech segregation ID SEPARATION; NOISE; SUPPRESSION AB A time-varying Wiener filter specifies the ratio of a target signal and a noisy mixture in a local time-frequency unit. We estimate this ratio using a binaural processor and derive a ratio time-frequency mask. This mask is used to extract the speech signal, which is then fed to a conventional speech recognizer operating in the cepstral domain. We compare the performance of this system with a missing-data recognizer that operates in the spectral domain using the time-frequency units that are dominated by speech. To apply the missing-data recognizer, the same binaural processor is used to estimate an ideal binary time-frequency mask, which selects a local time-frequency unit if the speech signal within the unit is stronger than the interference. We find that the performance of the missing data recognizer is better on a small vocabulary recognition task but the performance of the conventional recognizer is substantially better when the vocabulary size is increased. (c) 2006 Elsevier B.V. All rights reserved. C1 Ohio State Univ, Dept Biomed Engn, Columbus, OH 43210 USA. Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Srinivasan, S (reprint author), Ohio State Univ, Dept Biomed Engn, 395 Dreese Labs,2015 Neil Ave, Columbus, OH 43210 USA. EM srinivasan.36@osu.edu; roman.45@osu.edu; dwang@cse.ohio-state.edu CR Barker J., 2000, P ICSLP BEIJ CHIN, P373 Blauert J., 1997, SPATIAL HEARING PSYC BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bradstein M., 2001, MICROPHONE ARRAYS SI Browns GJ, 2005, SIG COM TEC, P371, DOI 10.1007/3-540-27489-8_16 Cardoso JF, 1998, P IEEE, V86, P2009, DOI 10.1109/5.720250 Chen HH, 2005, WIREL COMMUN MOB COM, V5, P1, DOI 10.1002/wcm.287 Cole R., 1995, P EUR C SPEECH COMM, P821 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 CUNNINGHAM S, 1999, P INT C PHON SCI, P215 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DEVETH J, 1999, P WORKSH ROB METH SP, P231 DROPPO J, 2002, P INT C SPOK LANG PR, P1569 Ehlers F, 1997, IEEE T SIGNAL PROCES, V45, P2608, DOI 10.1109/78.640731 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Gardner W.G., 1994, 280 MIT MED LAB PERC GAROFOLO JS, 1993, 4930 NISTIR NAT I ST GLOTIN H, 1999, P EUROSPEECH, P2351 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Hughes TB, 1999, IEEE T SPEECH AUDI P, V7, P346, DOI 10.1109/89.759045 Leonard R. G., 1984, P ICASSP 84, P111 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Little R. J., 1987, STAT ANAL MISSING DA MACHO D, 2002, P INT C SPOK LANG PR, P175 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 OPPENHIEM AV, 1999, DISCETE TIME SIGNAL Palomaki KJ, 2004, SPEECH COMMUN, V43, P361, DOI 10.1016/j.specom.2004.03.005 Price P., 1988, P IEEE INT C AC SPEE, P651 Rabiner L, 1993, FUNDAMENTALS SPEECH Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 Roman N, 2003, J ACOUST SOC AM, V114, P2236, DOI 10.1121/1.1610463 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO SHIRE ML, 2000, THESIS U CALIFORNIA Srinivasan S., 2004, P ICSLP, P2541 *STQ AURORA, 2005, ETSI ES 202 050 V1 1 TESSIER E, 1999, P INT C SPEECH P 199, P97 Van Trees H., 1968, DETECTION ESTIMATION, V1st VANHAMME H, 2003, P EUR C SPEECH COMM, P3089 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE Young S., 2000, HTK BOOK HTK VERSION NR 43 TC 47 Z9 49 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1486 EP 1501 DI 10.1016/j.specom.2006.09.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300007 ER PT J AU Stouten, V Van Hamme, H Warnbacq, P AF Stouten, Veronique Van hamme, Hugo Warnbacq, Patrick TI Model-based feature enhancement with uncertainty decoding for noise robust ASR SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE noise robust speech recognition; model-based feature enhancement; additive noise; convolutional noise; uncertainty decoding ID SPEECH RECOGNITION AB In this paper, several techniques are proposed to incorporate the uncertainty of the clean speech estimate in the decoding process of the backend recogniser in the context of model-based feature enhancement (MBFE) for noise robust speech recognition. Usually, the Gaussians in the acoustic space are sampled in a single point estimate, which means that the backend recogniser considers its input as a noise-free utterance. However, in this way the variance of the estimator is neglected. To solve this problem, it has already been argued that the acoustic space should be evaluated in a probability density function, e.g. a Gaussian observation pdf. We illustrate that this Gaussian observation pdf can be replaced by a computationally more tractable discrete pdf, consisting of a weighted sum of delta functions. We also show how improved posterior state probabilities can be obtained by calculating their maximum likelihood estimates or by using the pdf of clean speech conditioned on both the noisy speech and the backend Gaussian. Another simple and efficient technique is to replace these posterior probabilities by M Kronecker deltas, which results in M front-end feature vector candidates, and to take the maximum over their backend scores. Experimental results are given for the Aurora2 and Aurora4 database to compare the proposed techniques. A significant decrease of the word error rate of the resulting speech recognition system is obtained. (c) 2006 Elsevier B.V. All rights reserved. C1 Katholieke Univ Leuven, Dept ESAT, B-3001 Heverlee, Belgium. RP Stouten, V (reprint author), Katholieke Univ Leuven, Dept ESAT, Kasteelpk Arenberg 10, B-3001 Heverlee, Belgium. EM veronique.stouten@esat.kuleuven.be; hugo.vanhamme@esat.kuleuven.be; patrick.wambacq@esat.kuleuven.be RI Van hamme, Hugo/D-6581-2012 CR [Anonymous], 2002, ETSI ES 202 050 V1 1 Arrowood J. A., 2002, P ICSLP, P1561 Attias H., 2001, P EUROSPEECH, P1903 Benitez M. C., 2004, P ICSLP JEJ ISL KOR, P137 Bernard A., 2004, P ICASSP, P1025 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Demuynck K, 2000, SPEECH COMMUN, V30, P37, DOI 10.1016/S0167-6393(99)00030-8 Deng L., 2002, P ICSLP, P2449 Droppo J., 2001, P EUR, P217 Duchateau J, 1998, SPEECH COMMUN, V24, P5, DOI 10.1016/S0167-6393(98)00002-8 Duchateau J., 2001, P 7 EUR C SPEECH COM, VIII, P1621 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 EPHRAIM Y, 1990, P IEEE INT C AC SPEE, V2, P829 Gales M. J., 1995, THESIS U CAMBRIDGE GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HOLMES J, 1997, P EUROSPEECH RHOD GR KRISTJANSSON T, 2001, P ASRU MAD DI CAMP I KRISTJANSSON TT, 2002, P ICASSP, P61 MACHO D, 1917, 9I Macho D., 2002, P ICSLP, P17 Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670 Stouten V., 2003, P 2003 EUR GEN SWITZ, P17 STOUTEN V, 2004, P IVSLP JEJ ISL KOR, V1, P105 STOUTEN V, 2004, P ICASSP, V1, P949 Van Compernolle D., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Varga A.P., 1990, P ICASSP, P845 YAMAGUCHI Y, 1997, P EUROSPEECH RHOD GR, P2051 NR 28 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1502 EP 1514 DI 10.1016/j.specom.2005.12.006 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300008 ER PT J AU Dat, TH Takeda, K Itakura, F AF Dat, Tran Huy Takeda, Kazuya Itakura, Fumitada TI On-line Gaussian mixture modeling in the log-power domain for signal-to-noise ratio estimation and speech enhancement SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE Gaussian mixture modeling; segmental SNR; log-normal distributions; cumulative distribution function equalization; speech enhancement AB We present on-line Gaussian mixture modeling (GMM) in the log-power domain of actual noisy speech and its applications to segmental signal-to-noise ratio (SNR) estimation and speech enhancement. The basic idea in this method is the use of conventional two-component GMM modeling in the log-power domain to estimate the distributions of noise and noisy speech subspaces in each speech segment of a length of 0.5-2 s. Given the subspace distributions, the statistical estimation method is adopted in the applications. For the segmental SNR estimation, the average speech level is estimated from noisy speech using a nonlinear moment of modeled distributions. This method is suitable under real conditions, when neither reference signals nor speech activity is available, and is shown to be more robust and accurate than conventional methods, particularly under low-SNR conditions. The proposed GMM model is extended to the multiband log-power domains for noise estimation. We use long-term information, which is obtained by GMM modeling in each segment of 0.5 s, to update the local distributions of noise and noisy speech power at each actual time-frequency index. The cumulative distribution function equalization (CDFE) is then used to estimate the noise and subtract it from the noisy speech power. The advantage of the CDFE method for noise estimation is that the estimation is given in the logarithmic domain without any approximation. The proposed speech enhancement is tested using the AURORA-2J database. We also compare the proposed method to the conventional minimum statistic and quantile-based noise estimation. The proposed method is found to be superior to the conventional in the speech recognition rate over most noise environments and shown to provide very good compromise between speech enhancement and speech recognition performance. (c) 2006 Elsevier B.V. All rights reserved. C1 Inst Infocomm Res, Singapore 119613, Singapore. Nagoya Univ, Grad Sch Informat Sci, Chikusa Ku, Nagoya, Aichi 4648603, Japan. Meijo Univ, Grad Sch Informat Engn, Tempaku Ku, Nagoya, Aichi 4688502, Japan. RP Dat, TH (reprint author), Inst Infocomm Res, Heng Mui Keng Terrace, Singapore 119613, Singapore. EM hdtran@i2r.a-star.edu.sg CR Acero A, 1993, ACOUSTICAL ENV ROBUS AULAY M, 1980, IEEE T ASSP, V28, P137 Burslem R, 2002, MATER WORLD, V10, P6 COHEN I, 2002, IEEE T, V20 DASGUPTA S, 1995, P IEEE ICASSP DEMPSTER A, 1977, P ROY STAT SOC B, V39 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI, 2000, ETSI ES201 108 V1 1 Hastie T., 2001, ELEMENTS STAT LEARNI HIRSCH H, 2000, P ISCA ITWR ASR HIRSCH H, 1999, P IEEE RMSRAC *HTK, 1995, HTK BOOK Itou K., 1999, Journal of the Acoustical Society of Japan (E), V20 KORTHAUER A, 1999, P IEEE RMSRAC Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 NAKAMURA S, 2003, P 8 IEEE WORKSH AUT, P619 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 STAHL V, 2000, P ICSLP VANCOMPERNOLLE D, 1989, COMPUT SPEECH LANG, V13, P151 ZOLFAGHARAI P, 1996, P IEEE ICSLP NR 20 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1515 EP 1527 DI 10.1016/j.specom.2006.06.009 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300009 ER PT J AU Tyagi, V Wellekens, C Slock, DTM AF Tyagi, Vivek Wellekens, Christian Slock, Dirk T. M. TI Least squares filtering of speech signals for robust ASR SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE least squares; adaptive filtering; speech enhancement; robust speech recognition ID ADAPTIVE EQUALIZATION; LATTICE ALGORITHMS; NOISE; IDENTIFICATION; ARRAY AB The behavior of the least squares filter (LeSF) is analyzed for a class of non-stationary signals that are either (a) composed of multiple sinusoids (voiced speech) whose frequencies, phases and the amplitudes may vary from block to block or (b) are output of an all-pole filter excited by white noise input (unvoiced speech segments) and which are embedded in white noise. In this work, analytic expressions for the weights and the output of the LeSF are derived as a function of the block length and the signal SNR computed over the corresponding block. We have used LeSF filter estimated on each block to enhance the speech signals embedded in white noise as well as other realistic noises such as factory noise and an aircraft cockpit noise. Automatic speech recognition (ASR) experiments on a connected numbers task, OGI Numbers95 [Varga, A., Steeneken, H., Tomlinson, M., Jones, D., 1992. The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Technical Report, DRA Speech Research Unit, Malvern, England] show that the proposed LeSF based features provide a significant improvement in speech recognition accuracies in various non-stationary noise conditions when compared directly to the un-enhanced speech, spectral subtraction and noise robust CJ-RASTA-PLP features. (c) 2006 Elsevier B.V. All rights reserved. C1 Inst Eurecom, F-06094 Sophia Antipolis, France. Swiss Fed Inst Technol, CH-1015 Lausanne, Switzerland. RP Tyagi, V (reprint author), Inst Eurecom, 2229,Route Cretes,POB 193, F-06094 Sophia Antipolis, France. EM tyagi@eurecom.fr; welleken@eurecom.fr; slock@eurecom.fr CR ANDERSSON CM, 1983, IEEE T ASSP ASSP, V31 BERSHAD NJ, 1980, IEEE T ACOUST SPEECH, V28, P504, DOI 10.1109/TASSP.1980.1163438 COLE RA, 1994, P ICSLP YOK JAP COMPTON RT, 1980, IEEE T AERO ELEC SYS, V16, P280, DOI 10.1109/TAES.1980.308897 EPHRAIM Y, 1985, IEEE T ASSP ASSP, V33 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 GERSHO A, 1969, AT&T TECH J, V48, P55 GISON C, 1980, IEEE T ASSP ASSP, V28, P681 GRIFFITH.LJ, 1969, P IEEE, V57, P1696, DOI 10.1109/PROC.1969.7385 Haykin S., 1993, ADAPTIVE FILTER THEO HERMANSKY H, 1994, IEEE T SAP, V2 KIM HK, 2003, IEEE T SAP, V11 LATHOUD G, 2005, P EUR LISB PORT MARPLE SL, 1981, IEEE T ACOUST SPEECH, V29, P62, DOI 10.1109/TASSP.1981.1163507 MARTIN R, 2001, IEEE T, V9 MCAULEY RJ, 1980, IEEE T ASSP, V28 MCAULEY RJ, 1986, IEEE T ASSP, V34 RABINER LR, 1978, IEEE T ACOUST SPEECH, V26, P319, DOI 10.1109/TASSP.1978.1163113 SAMBUR MR, 1978, IEEE T ASSP, V26 SATORIUS E, 1978, 331 NAV OC SYST CTR SATORIUS EH, 1979, IEEE T COMMUN, V27, P899, DOI 10.1109/TCOM.1979.1094477 SATORIUS EH, 1981, IEEE T COMMUN, V29, P136, DOI 10.1109/TCOM.1981.1094968 SONDHI MM, 1980, P IEEE, V68, P948, DOI 10.1109/PROC.1980.11774 VARGA A, 1992, NOSIEX 92 STUDY EFFE WIDROW B, 1975, P IEEE, V63, P1692, DOI 10.1109/PROC.1975.10036 Young S., 1995, HTK BOOK ZEIDLER JR, 1978, IEEE T ASSP ASSP, V26 NR 28 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1528 EP 1544 DI 10.1016/j.specom.2006.07.010 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300010 ER PT J AU Zavarehei, E Vaseghi, S Yan, Q AF Zavarehei, Esfandiar Vaseghi, Saeed Yan, Qin TI Inter-frame modeling of DFT trajectories of speech and noise for speech enhancement using Kalman filters SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE speech enhancement; Kalman filter; AR modeling of DFT; DFT distributions ID SPECTRAL AMPLITUDE ESTIMATOR AB In this paper a time-frequency estimator for enhancement of noisy speech signals in the DFT domain is introduced. This estimator is based on modeling the time-varying correlation of the temporal trajectories of the short-time (ST) DFT components of the noisy speech signal using autoregressive (AR) models. The time-varying trajectory of the DFT components of speech in each channel is modeled by a low-order AR process incorporated in the state equation of Kalman filters. The parameters of the Kalman filters are estimated recursively from the estimates of the signal and noise in DFT channels. The issue of convergence of the Kalman filters' statistics during the noise-only periods is addressed. A method is incorporated for restarting of Kalman filters, after long periods of noise-dominated activity in a DFT channel, to mitigate distortions of the onsets of speech activity. The performance of the proposed method with and without AR modeling of the DFT trajectories of noise for the enhancement of noisy speech is evaluated and compared with the MMSE log-amplitude speech estimator,. parametric spectral subtraction and Wiener filter. Evaluation results show that the incorporation of spectral-temporal information through Kalman filters results in reduced residual noise and improved perceived quality of speech. (c) 2006 Elsevier B.V. All rights reserved. C1 Brunel Univ, Dept Elect & Comp Engn, Uxbridge UB8 3PH, Middx, England. RP Zavarehei, E (reprint author), Brunel Univ, Dept Elect & Comp Engn, Uxbridge UB8 3PH, Middx, England. EM esfandiar.zavarehei@brunel.ac.uk; eepgeez@brunel.ac.uk; qin.yan@brunel.ac.uk CR Brillinger D. R, 1981, TIME SERIES DATA ANA CEHN B, 2005, P IEEE ITN C AC SPEE, P1097 Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] COHEN I, 2004, P 29 IEEE INT C AC S, P293 Cohen I, 2005, IEEE T SPEECH AUDI P, V13, P870, DOI 10.1109/TSA.2005.851940 COHEN J, 2001, AAMC REPORTER, V11, P5 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 HANSEN J, 1998, P ICSLP 1998 SYDN KARA F, 2004, SIGN PROC COMM APPL, P556 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 LAROCHE J, 1993, IEEE ASSP WORKSH APP LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 Ma N, 2006, IEEE T AUDIO SPEECH, V14, P19, DOI 10.1109/TSA.2005.858515 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 2003, P INT WORKSH AC ECH, P87 MARTIN R, 2002, IEEE ICASSP 02 ORL F Martin R, 2005, IEEE T SPEECH AUDI P, V13, P845, DOI 10.1109/TSA.2005.851927 Paliwal K. K., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) SCALART P, 1996, P IEEE INT C AC SPEE, P629 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Wolfe P. J., 2001, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing (Cat. No.01TH8563), DOI 10.1109/SSP.2001.955331 NR 25 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1545 EP 1555 DI 10.1016/j.specom.2006.03.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300011 ER PT J AU Darch, J Milner, B Vaseghi, S AF Darch, Jonathan Milner, Ben Vaseghi, Saeed TI MAP prediction of formant frequencies and voicing class from MFCC vectors in noise SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE formant prediction; formant estimation; MAP prediction; GMM; HMM; DSR ID SPEECH RECOGNITION; LINEAR PREDICTION; RECONSTRUCTION AB Novel methods are presented for predicting formant frequencies and voicing class from mel-frequency cepstral coefficients (MFCCs). It is shown how Gaussian mixture models (GMMs) can be used to model the relationship between formant frequencies and MFCCs. Using such models and an input MFCC vector, a maximum a posteriori (MAP) prediction of formant frequencies can be made. The specific relationship each speech sound has between MFCCs and formant frequencies is exploited by using state-specific GMMs within a framework of a set of hidden Markov models (HMMs). Formant prediction accuracy and voicing prediction of speaker-independent male speech are evaluated on both a constrained vocabulary connected digits database and a large vocabulary database. Experimental results show that for HMM-GMM prediction on the connected digits database, voicing class prediction error is less than 3.5%. Less than 1.8% of frames have formant frequency percentage errors greater than 20% and the mean percentage error of the remaining frames is less than 3.7%. Further experiments show prediction accuracy under noisy conditions. For example, at a signal-to-noise ratio (SNR) of 0 dB, voicing class prediction error increases to 9.4%, less than 4.3% of frames have formant frequency percentage errors over 20% and the formant frequency percentage error for the remaining frames is less than 5.7%. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. Brunel Univ, Dept Elect & Comp Engn, Uxbridge UB8 3PH, Middx, England. RP Milner, B (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM b.milner@uea.ac.uk CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BRUCE IC, 2002, ICASSP ORL FL, V1, P281 CHEN B, 2004, IEEE INT C ACOUST SP, V1, P581 DARCH J, 2005, PREDICTING FORMANT F, V1, P941, DOI 10.1109/ICASSP.2005.1415270 DARCH J, 2005, EUR LISB PORT SEPT, P1129 FRANSEN J, 1994, CUEDFINFENGTRU92 CAM Kent R. D., 2002, ACOUSTIC ANAL SPEECH KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559 MILNER B, 2005, EUR LISB PORT SEPT, P321 NIEDERJOHN RJ, 1992, IEEE INT EL CONTR IS, V3, P1336 Pearce D., 2000, ICSLP, V4, P29 Rabiner L.R., 1978, DIGITAL PROCESSING S Raj B, 2004, SPEECH COMMUN, V43, P275, DOI 10.1016/j.specom.2004.03.007 SCHAFER RW, 1970, J ACOUST SOC AM, V47, P634, DOI 10.1121/1.1911939 Shao X, 2005, J ACOUST SOC AM, V118, P1134, DOI 10.1121/1.1953269 Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882 SORIN A, 2003, ES202212 ETSI Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308 WELLING L, 1996, ICASSP ATL GA MAY, V2, P797 WILKINSON N, 2002, ICSLP DENV CO SEPT, P2121 YAN Q, 2000, EUR LISB PORT SEPT, P2081 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 24 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1556 EP 1572 DI 10.1016/j.specom.2006.06.001 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300012 ER PT J AU Rose, RC Arizmendi, I AF Rose, Richard C. Arizmendi, Iker TI Efficient client-server based implementations of mobile speech recognition services SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE automatic speech recognition; distributed speech recognition; robustness; client-server implementations; adaptation AB The purpose of this paper is to demonstrate the efficiencies that can be achieved when automatic speech recognition (ASR) applications are provided to large user populations using client-server implementations of interactive voice services. It is shown that, through proper design of a client-server framework, excellent overall system performance can be obtained with minimal demands on the computing resources that are allocated to ASR. System performance is considered in the paper in terms of both ASR speed and accuracy in multi-user scenarios. An ASR resource allocation strategy is presented that maintains sub-second average speech recognition response latencies observed by users even as the number of concurrent users exceeds the available number of ASR servers by more than an order of magnitude. An architecture for unsupervised estimation of user-specific feature space adaptation and normalization algorithms is also described and evaluated. Significant reductions in ASR word error rate were obtained by applying these techniques to utterances collected from users of hand-held mobile devices. These results are important because, while there is a large body of work addressing the speed and accuracy of individual ASR decoders, there has been very little effort applied to dealing with the same issues when a large number of ASR decoders are used in multi-user scenarios. (c) 2006 Elsevier B.V. All rights reserved. C1 McGill Univ, Dept Elect & Comp Engn, Montreal, PQ H3A 2A7, Canada. AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Rose, RC (reprint author), McGill Univ, Dept Elect & Comp Engn, McConnell Engn Bldg,Room 755,3480 Univ St, Montreal, PQ H3A 2A7, Canada. EM rose@ece.mcgill.ca; iker@research.att.com CR BANGA G, 1999, P USENIX 1999 ANN TE BERNARD A, 2001, P EUR C SPEECH COMM Bocchieri E, 2001, IEEE T SPEECH AUDI P, V9, P264, DOI 10.1109/89.906000 CARDENALLOPEZ A, 2004, P IEEE INT C AC SPEE, P49 CHANDRA A, 2000, HPL2000174 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 *ETSI, 2001, 126094 ETSI TS Fingscheidt T., 2002, P ICSLP, P2209 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 GUNAWARDANA A, 2000, P INT C AC SPEECH SI Kim HK, 2002, IEEE T SPEECH AUDI P, V10, P591, DOI 10.1109/TSA.2002.804302 KISS I, 2003, P ASRU, P613 Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 McDowell L., 2003, P 9 ACM SIGPLAN S PR MOHRI M, 1998, SPEECH COMMUN, V25 ORTMANNS S, 1997, P EUR C SPEECH COMM Pai V.S., 1999, P USENIX 1999 ANN TE PITZ M, 2001, P EUR C SPEECH COMM POTAMIANOS A, 2001, P IEEE INT C AC SPEE, P269 ROSE RC, 2001, P INT C AC SPEECH SI ROSE RC, 2003, P INT C AC SPEECH SI SUKKAR RA, 2002, P INT C AC SPEECH SI, V2, P293 TAN ZH, 2004, P INT C SPOK LANG PR VIIKKI O, 2001, P IEEE ASRU WORKSH D Welsh M., 2001, S OP SYST PRINC, P230 WENDT S, 2002, P INT C SPOK LANG PR NR 26 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1573 EP 1589 DI 10.1016/j.specom.2006.07.004 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300013 ER PT J AU Stouten, F Duchateau, J Martens, JP Wambacq, P AF Stouten, Frederik Duchateau, Jacques Martens, Jean-Pierre Wambacq, Patrick TI Coping with disfluencies in spontaneous speech recognition: Acoustic detection and linguistic context manipulation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robustness Issues for Conversational Interaction CY AUG, 2004 CL Norwich, ENGLAND SP ISCA Tutorial & Res Workshop, COST 278 HO Univ E Anglia DE disfluency handling; spontaneous speech recognition; disfluency detection AB Nowadays read speech recognition already works pretty well, but the recognition of spontaneous speech is much more problematic. There are plenty of reasons for this, and we hypothesize that one of them is the regular occurrence of disfluencies in spontaneous speech. Disfluencies disrupt the normal course of the sentence and when for instance word interruptions are concerned, they also give rise to word-like speech elements which have no representation in the lexicon of the recognizer. In this paper we propose novel methods that aim at coping with the problems induced by three types of disfluencies, namely filled pauses, repeated words and sentence restarts. Our experiments show that especially the proposed methods for filled pause handling offer a moderate but statistically significant improvement over the more traditional techniques previously presented in the literature. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Ghent, ELIS, B-9000 Ghent, Belgium. Katholieke Univ Leuven, ESAT, B-3001 Heverlee, Belgium. RP Stouten, F (reprint author), Univ Ghent, ELIS, St Pietersnieuwstr 41, B-9000 Ghent, Belgium. EM fstouten@elis.ugent.be; Jacques.Duchateau@esat.kuleuven.be; martens@elis.ugent.be; Patrick.Wambacq@esat.kuleuven.be CR ADDA G, 1999, P EUR C SPEECH COMM, V4, P1759 Adda-Decker M., 2003, P ITRW DISFL SPONT S, P67 BATLINER A, 1995, P INT C PHON SCI STO Beyerlein P., 1999, P EUR C SPEECH COMM, V2, P647 Demuynck K, 2000, SPEECH COMMUN, V30, P37, DOI 10.1016/S0167-6393(99)00030-8 Duchateau J, 1998, SPEECH COMMUN, V24, P5, DOI 10.1016/S0167-6393(98)00002-8 Duchateau J, 2003, LANG COMPUT, P39 Gabrea M., 2000, P ICSLP BEIJ, V3, P678 GAUVAIN J, 1999, P EUR C SPEECH COMM, V2, P655 Godfrey J. J., 1992, P ICASSP, V1, P517 GOEDERTIER W, 2000, INT C LANG RES EV AT, P909 GOTO M, 1999, P EUR BUD, V1, P227 Ma KW, 2000, SPEECH COMMUN, V31, P51, DOI 10.1016/S0167-6393(99)00060-6 MARTENS JP, 2002, INT C LANG REST EV, V5, P1432 OSHAUGHNESSY D, 1993, P EUR BERL, V3, P2187 PAKHOMOV S, 1999, P ASS COMP LING ACL, P619, DOI 10.3115/1034678.1034692 PETERS J, 2003, P HUM LANG TECHN C H, P82 QUIMBO FM, 1998, P ICSLP, P3313 SCHRAMM H, 2003, P ISCA IEEE WORKSH S SHRIBERG E., 1996, P INT C SPOK LANG PR, P11 SHRIBERG E, 1996, P INT C SPOK LANG PR, V3, P1868, DOI 10.1109/ICSLP.1996.607996 SIU M, 1996, P INT C SPOK LANG PR, V1, P386 STOLCKE A, 1996, P INT C AC SPEECH SI, V1, P405 Stouten F., 2003, P IEEE AUT SPEECH RE, P309 STOUTEN V, 2003, P EUROSPEECH GEN SWI, P349 Vorstermans A, 1996, SPEECH COMMUN, V19, P271, DOI 10.1016/S0167-6393(96)00037-4 YU H, 2000, P INT C SPOK LANG PR, V4, P310 ZECHNER K, 1998, P 17 C COMP LING COL, P1453 NR 28 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2006 VL 48 IS 11 BP 1590 EP 1606 DI 10.1016/j.specom.2006.04.004 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 109VE UT WOS:000242336300014 ER PT J AU Latorre, J Iwano, K Furui, S AF Latorre, Javier Iwano, Koji Furui, Sadaoki TI New approach to the polyglot speech generation by means of an HMM-based speaker adaptable synthesizer SO SPEECH COMMUNICATION LA English DT Article DE multilingual; polyglot synthesis; voice adaptation; cross-language synthesis; phone mapping AB In this paper we present a new method for synthesizing multiple languages, with the same voice, using HMM-based speech synthesis. Our approach, which we call HMM-based polyglot synthesis, consists of mixing speech data from several speakers in different languages, to create a speaker- and language-independent (SI) acoustic model. We then adapt the resulting SI model to a specific speaker in order to create a speaker dependent (SD) acoustic model. Using the SD model it is possible to synthesize any of the languages used to train the SI model, With the voice of the speaker, regardless of the speaker's language. We show that the performance obtained with our method is better than that of methods based on phone mapping for both adaptation and synthesis. Furthermore, for languages not included during training the performance of our approach also equals or surpasses the performance of any monolingual synthesizers based on the languages used to train the multilingual one. This means that our method can be used to create synthesizers for languages where no speech resources are available. (c) 2006 Elsevier B.V. All rights reserved. C1 Tokyo Inst Technol, Grad Sch Informat Sci & Engn, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Latorre, J (reprint author), Tokyo Inst Technol, Grad Sch Informat Sci & Engn, Dept Comp Sci, Meguro Ku, 2-12-1 8E-602, Tokyo 1528552, Japan. EM latorre@furui.cs.titech.ac.jp CR BADINO L, 2004, P ICSLP JEJ ISL KOR, P849 BLACK A, 2004, P ICASSP MONTR CAN, P761 BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 CAMPBELL N, 2001, P EUR AALB DENM, P337 CAMPBELL N, 1998, P ESCA COCOSDA WORKS DIJKSTRA J, 2004, P 5 ISCA SPEECH SYNT, P97 Graddol D, 2004, SCIENCE, V303, P1329, DOI 10.1126/science.1096546 Imai S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LIU C, 2005, P 9 EUR C SPEECH COM, P1365 Mak B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607191 Mashimo Mikiko, 2001, P EUR AALB DENM, P361 Masuko T., 1996, P ICASSP, P389 MOBERG M, 2004, P ICSLP JEJ ISL KOR, P1029 SCHULTZ T., 2002, P ICSLP, P345 Schultz T., 2001, P EUR AALB DENM, P2721 SHIN H, 2003, US CENSUS 2000 Shinoda K., 2000, Journal of the Acoustical Society of Japan (E), V21, DOI 10.1250/ast.21.79 Tamura M., 1998, P 3 ESCA COCOSDA WOR, P273 Tamura M., 2001, P EUROSPEECH 2001 SE, P345 Tokuda K., 1995, P EUROSPEECH, P757 TOKUDA K, 1995, P ICASSP, P660 Traber C., 1999, P EUR, P835 YU H, 2003, P EUR GEN SWITZ, P1869 NR 24 TC 16 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1227 EP 1242 DI 10.1016/j.specom.2006.05.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500001 ER PT J AU Prasanna, SRM Gupta, CS Yegnanarayana, B AF Prasanna, S. R. Mahadeva Gupta, Cheedella S. Yegnanarayana, B. TI Extraction of speaker-specific excitation information from linear prediction residual of speech SO SPEECH COMMUNICATION LA English DT Article DE speaker recognition; excitation information; LP residual; AANN model; vocal tract information ID VERIFICATION; RECOGNITION; IDENTIFICATION AB In this paper, through different experimental studies we demonstrate that the excitation component of speech can be exploited for speaker recognition studies. Linear prediction (LP) residual is used as a representation of excitation information in speech. The speaker-specific information in the excitation of voiced speech is captured using the AutoAssociative Neural Network (AANN) models. The decrease in the error during training and recognizing correct speakers during testing demonstrates that the excitation component of speech contains speaker-specific information and is indeed being captured by the AANN models. The study on the effect of different LP orders demonstrates that for a speech signal sampled at 8 kHz, the LP residual extracted using LP order in the range 8-20 best represents the speaker-specific excitation information. It is also demonstrated that the proposed speaker recognition system using excitation information and AANN models requires significantly less amount of data both during training as well as testing, compared to the speaker recognition system using vocal tract information. Finally the speaker recognition studies on NIST 2002 database demonstrates that even though, the recognition performance from the excitation information alone is poor, when combined with evidence from vocal tract information, there is significant improvement in the performance. This result demonstrates the complementary nature of the excitation component of speech. (c) 2006 Elsevier B.V. All rights reserved. C1 Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India. Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. RP Prasanna, SRM (reprint author), Indian Inst Technol, Dept Elect & Commun Engn, Gauhati 781039, Assam, India. EM prasanna@iitg.ernet.in; yegna@cs.iitm.ernet.in CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 ANJANI AVN, 2000, THESIS INDIAN I TECH ATAL BS, 1972, J ACOUST SOC AM, V52, P1687, DOI 10.1121/1.1913303 ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 CAMPBELL JP, 1997, P IEEE, V85, P1436 Deller J., 2000, DISCRETE TIME PROCES Diamantaras K I., 1996, PRINCIPAL COMPONENT Doddington G., 2001, P EUR, P2521 FAUNDEZ M, 1998, P INT C SPOK LANG PR Feustel T. C., 1989, SPEECH TECH, P169 FURUI S, 1996, AUTOMATIC SPEECH SPE, pCH2 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 Gupta C. S., 2003, THESIS INDIAN I TECH Haykin S., 1999, NEURAL NETWORKS COMP, V2nd IKBAL MS, 1999, P INT JOINT C NEUR N, P854 KAJAREKAR SS, 2003, P EUR C SPEECH PROC KAJAREKAR SS, 2002, OGI SUBMISSION NIST KISHORE SP, 2001, THESIS INDIAN I TECH KISHORE SP, 2001, P INT JOINT C NEUR N LIU JHL, 1997, P EUR C SPEECH PROC, P313 MADHUKUMAR AS, 1993, THESIS INDIAN I TECH MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MARTIN A, 1999, P NIST SPEAK REC WOR MARTIN A, 2002, P NIST SPEAK REC WOR MARTIN A, 2001, P NIST SPEAK REC WOR MURTHY KSR, 2004, INT C SIGN PROC COMM, P516 O'Shaughnessy D., 1987, SPEECH COMMUNICATION O'Shaughnessy D., 1986, IEEE ASSP Magazine, V3, DOI 10.1109/MASSP.1986.1165388 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 Prasanna SRM, 2004, P IEEE INT C AC SPEE Rabiner L, 1993, FUNDAMENTALS SPEECH REDDY KS, 2004, THESIS INDIAN I TECH Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 ROSENBERG AE, 1976, P IEEE, V64, P475, DOI 10.1109/PROC.1976.10156 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 THEVENAZ P, 1995, SPEECH COMMUN, V17, P145, DOI 10.1016/0167-6393(95)00010-L WAKITA H, 1976, IEEE T ACOUST SPEECH, V24, P270, DOI 10.1109/TASSP.1976.1162797 WEBER F, 2002, P IEEE INT C AC SPEE, V1, P141 Yegnanarayana B, 2005, IEEE T SPEECH AUDI P, V13, P575, DOI 10.1109/TSA.2005.848892 YEGNANARAYANA B, 2001, P IEEE INT C AC SPEE, P409 Yegnanarayana B., 1999, ARTIFICIAL NEURAL NE YEGNANARAYANA B, 1992, P ESCA WORKSH SPEECH YEGNANARAYANA B, 2002, P NIST SPEAK REC WOR NR 44 TC 29 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1243 EP 1261 DI 10.1016/j.specom.2006.06.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500002 ER PT J AU Salor, O Demirekler, M AF Salor, Ozgul Demirekler, Mubeccel TI Dynamic programming approach to voice transformation SO SPEECH COMMUNICATION LA English DT Article DE voice transformation; speaker transformation; codebook; line spectral frequencies; dynamic programming ID CONVERSION AB This paper presents a voice transformation algorithm which modifies the speech of a source speaker such that it is perceived as if spoken by a target speaker. A novel method which is based on dynamic programming approach is proposed. The designed system obtains speaker-specific codebooks of line spectral frequencies (LSFs) for both source and target speakers. Those codebooks are used to train a mapping histogram matrix, which is used for LSF transformation from one speaker to the other. The baseline system uses the maxima of the histogram matrix for LSF transformation. The shortcomings of this system, which are the limitations of the target LSF space and the spectral discontinuities due to independent mapping of subsequent frames, have been overcome by applying the dynamic programming approach. Dynamic programming approach tries to model the long-term behaviour of LSFs of the target speaker, while it is trying to preserve the relationship between the subsequent frames of the source LSFs, during transformation. Both objective and subjective evaluations have been conducted and it has been shown that dynamic programming approach improves the performance of the system in terms of both the speech quality and speaker similarity. (c) 2006 Elsevier B.V. All rights reserved. C1 Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. RP Salor, O (reprint author), TUBITAK, Inst Space Technol Res, METU Campus, Ankara, Turkey. EM ozgul.salor@bilten.metu.edu.tr CR Abe M., 1988, P ICASSP, P655 Arslan LM, 1999, SPEECH COMMUN, V28, P211, DOI 10.1016/S0167-6393(99)00015-1 CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K COHN RP, 1997, P IEEE ICASSP Huang X., 2001, SPOKEN LANGUAGE PROC Kain A., 2001, THESIS OREGON HLTH S LEE SK, 1995, P 4 EUR C SPEECH COM Markel JD, 1976, LINEAR PREDICTION SP *MELP, 1997, SPEC AN DIG CONV VOI MIZUNO H, 1995, SPEECH COMMUN, V16, P153, DOI 10.1016/0167-6393(94)00052-C SALOR O, 2002, P 7 INT C SPOK LANG SALOR O, 2004, P INT S INT MULT VID SALOR O, 2005, THESIS MIDDLE E TU T SALOR O, 2003, P 8 EUR C SPEECH COM STYLIANOU Y, 1999, P IEEE INT C AC SPEE STYLIANOU Y, 1998, IEEE T SPEECH AUDIO, V6, P451 TODA T, 2003, THESIS NARA I TECHNO VALBRET H, 1992, SPEECH COMMUN, V11, P175, DOI 10.1016/0167-6393(92)90012-V NR 18 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1262 EP 1272 DI 10.1016/j.specom.2006.06.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500003 ER PT J AU Xiong, ZY Zheng, TF Song, ZJ Soong, F Wu, WH AF Xiong, Zhenyu Zheng, Thomas Fang Song, Zhanjiang Soong, Frank Wu, Wenhu TI A tree-based kernel selection approach to efficient Gaussian mixture model-universal background model based speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker recognition; speaker identification; tree-based kernel selection; GMM-UBM ID VERIFICATION; RECOGNITION; ALGORITHM AB We propose a tree-based kernel selection (TBKS) algorithm as a computationally efficient approach to the Gaussian mixture model-universal background model (GMM-UBM) based speaker identification. All Gaussian components in the universal background model are first clustered hierarchically into a tree and the corresponding acoustic space is mapped into structurally partitioned regions. When identifying a speaker, each test input feature vector is scored against a small subset of all Gaussian components. As a result of this TBKS process, computation complexity can be significantly reduced. We improve the efficiency of the proposed system further by applying a previously proposed observation reordering based pruning (ORBP) to screen out unlikely candidate speakers. The approach is evaluated on a speech database of 1031 speakers, in both clean and noisy conditions. The experimental results show that by integrating TBKS and ORBP together we can speed up the computation efficiency by a factor of 15.8 with only a very slight degradation of identification performance, i.e., an increase of 1% of relative error rate, compared with a baseline GMM-UBM system. The improved search efficiency is also robust to additive noise. (c) 2006 Published by Elsevier B.V. C1 Tsinghua Univ, Ctr Speech Technol, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China. Beijing dEar Technol Co Ltd, Beijing, Peoples R China. Microsoft Res Asia, Beijing, Peoples R China. RP Zheng, TF (reprint author), Tsinghua Univ, Ctr Speech Technol, Dept Comp Sci & Technol, State Key Lab Intelligent Technol & Syst, Beijing 100084, Peoples R China. EM xiongzhy@cst.cs.tsinghua.edu.cn; fzheng@tsinghua.edu.cn; zjsong@d-Ear.com; frankkps@microsoft.com; wuwh@tsinghua.edu.cn CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 AUCKENTHALER R, 2001, P SPEAK OD SPEAK REC Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Huang X., 2001, SPOKEN LANGUAGE PROC McLaughlin J., 1999, P EUR, P1215 Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984 Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467 Ramachandran RP, 2002, PATTERN RECOGN, V35, P2801, DOI 10.1016/S0031-3203(01)00235-7 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds D. A., 1997, P EUR, P963 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Reynolds DA, 1994, IEEE T SPEECH AUDI P, V2, P639, DOI 10.1109/89.326623 Reynolds D.A., 2002, P IEEE INT C AC SPEE, P472 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 Soong F.K., 1985, P INT C AC SPEECH SI, P387 VARGA AP, 1992, NOISEX 92 STUDY EFFE Watanbe T., 1994, P INT C SPEECH LANG, P223 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 NR 22 TC 8 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1273 EP 1282 DI 10.1016/j.specom.2606.06.011 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500004 ER PT J AU Ning, GX Leung, SH Chu, KK Wei, G AF Ning, Geng-xin Leung, Shu-hung Chu, Kam-keung Wei, Gang TI A dynamic parameter compensation method for noisy speech recognition SO SPEECH COMMUNICATION LA English DT Article DE noisy speech recognition; model compensation; dynamic parameter combination ID PARALLEL MODEL COMBINATION; ENVIRONMENTS AB Model-based compensation techniques have been successfully used for speech recognition in noisy environments. Popular model-based compensation methods such as the Log-Normal PMC and Log-Add PMC generally use approximate compensation for dynamic parameters. Hence their recognition accuracy is degraded at low and very low signal-to-noise ratios. In this paper we use time derivatives of static features to derive a dynamic parameter compensation method (DPCM). In this method, we assume the static features independent of the dynamic features of speech and noise. This assumption helps simplify the procedures of the compensation of delta and delta-delta parameters. The new compensated dynamic model together with any known compensated static model form a new corrupted speech recognition model. Experimental results show that the recognition model using this DPCM scheme gives recognition accuracy better than the original model compensation method for different additive noises at the expense of slight increase in computational complexity. (c) 2006 Elsevier B.V. All rights reserved. C1 City Univ Hong Kong, Dept Elect Engn, Kowloon, Hong Kong, Peoples R China. S China Univ Technol, Sch Elect & Informat Engn, Guangzhou 510640, Peoples R China. RP Leung, SH (reprint author), City Univ Hong Kong, Dept Elect Engn, Kowloon, Hong Kong, Peoples R China. EM eeeugshl@cityu.edu.hk CR Abramowitz M., 1972, HDB MATH FUNCTIONS F Acero A., 2000, P ICSLP, P869 Cerisara C, 2004, SPEECH COMMUN, V42, P25, DOI 10.1016/j.specom.2003.08.003 CERISARA C, 2002, P ICASSP, P201 Gales M.J.F, 1995, THESIS CAMBRIDGE U Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 HIRSCH HG, 2000, AURORA EXPT FRAMEWOR Hung JW, 2001, IEEE T SPEECH AUDI P, V9, P842 Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MANSOUR D, 1989, IEEE T ACOUSTICS SPE, V37, P759 Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9 Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733 Papoulis A, 1991, PROBABILITY RANDOM V, V3rd PARSSINEN K, 2002, P ICASSP, P193 Varga A.P., 1990, P ICASSP, P845 Yao KS, 2004, SPEECH COMMUN, V42, P5, DOI 10.1016/j.specom.2003.09.002 NR 21 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1283 EP 1293 DI 10.1016/j.specom.2006.06.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500005 ER PT J AU Tufekei, Z Gowdy, JN Gurbuz, S Patterson, E AF Tufekei, Zekeriya Gowdy, John N. Gurbuz, Sabri Patterson, Eric TI Applied mel-frequency discrete wavelet coefficients and parallel model compensation for noise-robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE noise robust ASR; wavelet; local feature; feature weighting ID COMBINATION AB Interfering noise severely degrades the performance of a speech recognition system. The Parallel Model Compensation (PMC) technique is one of the most efficient techniques for dealing with such noise. Another approach is to use features local in the frequency domain, such as Mel-Frequency Discrete Wavelet Coefficients (MFDWCs). In this paper, we investigate the use of PMC and MFDWC features to take advantage of both noise compensation and local features (MFDWCs) to decrease the effect of noise on recognition performance. We also introduce a practical weighting technique based on the noise level of each coefficient. We evaluate the performance of several wavelet-schemes using the NOISEX-92 database for various noise types and noise levels. Finally, we compare the performance of these versus Mel-Frequency Cepstral Coefficients (MFCCs), both using PMC. Experimental results show significant performance improvements for MFDWCs versus MFCCs, particularly after compensating the HMMs using the PMC technique. The best feature vector among the six MFDWCs we tried gave 13.72 and 5.29 points performance improvement, on the average, over MFCCs for -6 and 0 dB SNR, respectively. This corresponds to 39.9% and 62.8% error reductions, respectively. Weighting the partial score of each coefficient based on the noise level further improves the performance. The average error rates for the best MFDWCs dropped from 19.57% to 16.71% and from 3.14% to 2.14% for -6 dB and 0 dB noise levels, respectively, using the weighting scheme. These improvements correspond to 14.6% and 31.8% error reductions for -6 dB and 0 dB noise levels, respectively. (c) 2006 Elsevier B.V. All rights reserved. C1 Clemson Univ, ECE Dept, Clemson, SC 29634 USA. RP Tufekei, Z (reprint author), Izmir Yuksek Teknol Ensitusu, Elektr Elektron Muhendisligi Bolumu, TR-35430 Urala Izmir, Turkey. EM zekeriyatufekci@iyte.edu.tr; john.gowdy@ces.clemson.edu; sabrig@his.atr.jp; pattersone@uncw.edu CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 BEATTIE VL, 1992, P ICSLP, P519 BEATTIE VL, 1991, P ICASSP, P917, DOI 10.1109/ICASSP.1991.150489 Berstein A., 1991, P ICASSP, P913, DOI 10.1109/ICASSP.1991.150488 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bourlard H., 1996, P INT C SPOK LANG PR, P422 Cerisara C, 1998, INT CONF ACOUST SPEE, P717, DOI 10.1109/ICASSP.1998.675365 CHENGALVARAYAN R, 1999, P ICASSP, P409 COHEN A, 1992, COMMUN PUR APPL MATH, V45, P485, DOI 10.1002/cpa.3160450502 Cung H. M., 1992, P ESCA WORKSH SPEECH, P171 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Fletcher H., 1953, SPEECH HEARING COMMU GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 GALES MJF, 1992, P ICASSP, P233, DOI 10.1109/ICASSP.1992.225929 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z Gales M.J.F., 1993, P EUROSPEECH, P837 GALES MJF, 1995, P ICASSP, P133 Ghitza O., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80018-3 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GOWDY JN, 2000, P IEEE INT C AC SPEE, V3, P1351 HERMANSKY H, 1996, P INT C SPOK LANG PR, P462 KLATT DH, 1979, P ICASSP, P573 LOCKWOOD P, 1991, P EUR, P79 Mallat S., 1998, WAVELET TOUR SIGNAL MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 MELLOR BA, 1992, P IOA, V14, P503 Mirghafori N, 1998, INT CONF ACOUST SPEE, P713, DOI 10.1109/ICASSP.1998.675364 TOMLINSON MJ, 1997, P IEEE INT C AC SPEE, P1247 Tufekci Z., 2000, Proceedings of the IEEE SoutheastCon 2000. `Preparing for The New Millennium' (Cat. No.00CH37105), DOI 10.1109/SECON.2000.845444 TUFEKCI Z, 2001, P ICASSP, V1, P149 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE VASEGHI SV, 1997, P INT C AC SPEECH SI, P1263 Vetterli M., 1995, WAVELETS SUBBAND COD YANG R, 1995, P ICASSP, P433 YANG R, 1996, P ICASSP, P49 YOUNG S, 1997, HTK BOOK ENTROPIC CA NR 38 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1294 EP 1307 DI 10.1016/j.specom.2006.06.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500006 ER PT J AU Patel, R Grigos, MI AF Patel, Rupal Grigos, Maria I. TI Acoustic characterization of the question-statement contrast in 4, 7 and 11-year-old children SO SPEECH COMMUNICATION LA English DT Article DE prosody; children; acoustics; speech; development; acquisition; questions; statements; intonation ID VOCAL FUNDAMENTAL-FREQUENCY; INTONATION CONTOURS; SEVERE DYSARTHRIA; CEREBRAL-PALSY; SPEECH; PROSODY; STRESS; ADULTS; CRIES; PERCEPTION AB Prosodic features of the speech signal include fundamental frequency (F0), intensity and duration. In order to study the development of prosody independent from segmental aspects of speech, we considered the question-statement contrast. In English, adults mark the contrast using changes in fundamental frequency, duration and intensity, with F0 being the most prominent cue. Declarative questions are marked by rising intonation whereas statements are marked by falling intonation. While previous studies have noted that young children can signal this contrast in imitative paradigms, little is known about the acoustic cues children use at different stages in development. The present study sought to provide an acoustic characterization of prosodic cues used by 12 children from three age groups, 4-year-olds, 7-year-olds and 11-year-olds, for elicited productions of declarative statements and questions. Results indicated that 4-year-olds were unable to reliably signal questions using rising fundamental frequency contour. Instead, they used increased final syllable duration to mark questions. Children in the 7-year-old group used all three cues, fundamental frequency, intensity and syllable duration, to contrast questions from statements. The oldest group relied primarily on changes in fundamental frequency and less so on intensity and duration cues. An age-related pattern is evident in that children employ different combinations of acoustic cues to mark the question-statement contrast across development. The impact of motor and cognitive-linguistic complexity on the development of prosodic control is discussed. (c) 2006 Elsevier B.V. All rights reserved. C1 Northeastern Univ, Dept Speech Language Pathol & Audiol, Boston, MA 02115 USA. NYU, Steinhardt Sch Educ, Dept Speech Language Pathol & Audiol, New York, NY 10003 USA. RP Patel, R (reprint author), Northeastern Univ, Dept Speech Language Pathol & Audiol, 360 Huntington Ave,Room 102 FR, Boston, MA 02115 USA. EM r.patel@neu.edu CR Allen G., 1980, CHILD PHONOLOGY, P227 Allen GD, 2000, J SPEECH LANG HEAR R, V43, P441 Bloom L., 1973, ONE WORD TIME Boersma P., 2004, PRAAT SYSTEM DOING P Bolinger D., 1989, INTONATION ITS USES BONVILLIAN JD, 1979, J CHILD LANG, V6, P459 CRUTTENDEN A, 1981, J LINGUIST, V17, P77, DOI 10.1017/S0022226700006782 Cruttenden A., 1986, INTONATION CRUTTENDEN A, 1985, J CHILD LANG, V12, P643 Crystal D, 1978, COMMUN COGNITION, P257 Crystal D, 1986, LANGUAGE ACQUISITION EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 Eguchi S, 1969, Acta Otolaryngol Suppl, V257, P1 GELUYKENS R, 1988, J PRAGMATICS, V12, P467, DOI 10.1016/0378-2166(88)90006-9 Gilbert HR, 1996, INT J PEDIATR OTORHI, V34, P237, DOI 10.1016/0165-5876(95)01273-7 HADDINGKOCH K, 1964, PHONETICA, V11, P175 Hirst D. J., 1998, INTONATION SYSTEMS S, P1 House David, 2002, P ICSLP 2002, P1957 HOWELL P, 1993, J ACOUST SOC AM, V94, P2063, DOI 10.1121/1.407479 Katz WF, 1996, J ACOUST SOC AM, V99, P3179, DOI 10.1121/1.414802 KENT RD, 1980, J PHONETICS, V8, P157 KENT RD, 1976, J SPEECH HEAR RES, V19, P421 Koenig LL, 2000, J SPEECH LANG HEAR R, V43, P1211 Ladd D., 1996, INTONATION PHONOLOGY Lam S. L-M., 2001, SPEECH MOTOR CONTROL, P228 Lehiste I., 1976, CONT ISSUES EXPT PHO, P225 LIEBERMAN P, 1960, J ACOUST SOC AM, V32, P451, DOI 10.1121/1.1908095 Lieberman Philip, 1967, INTONATION PERCEPTIO Lind K, 2002, INT J PEDIATR OTORHI, V64, P97, DOI 10.1016/S0165-5876(02)00024-1 Local J., 1980, SOCIOLINGUISTIC VARI LOEB DF, 1993, J SPEECH HEAR RES, V36, P4 Macneilage P. F., 1990, ATTENTION PERFORM, P453 MAJEWSKI W, 1969, J ACOUST SOC AM, V45, P450, DOI 10.1121/1.1911394 MENYUK P, 1969, MIT Q PROGR REP, V93, P216 MORTON J, 1965, LANG SPEECH, V8, P159 Netsell R., 1973, NORMAL ASPECTS SPEEC, P211 OSHAUGHNESSY D, 1979, J PHONETICS, V7, P119 OLLER DK, 1977, J ACOUST SOC AM, V62, P994, DOI 10.1121/1.381594 Patel R, 2002, J SPEECH LANG HEAR R, V45, P858, DOI 10.1044/1092-4388(2002/069) Patel R, 2003, J SPEECH LANG HEAR R, V46, P1401, DOI 10.1044/1092-4388(2003/109) Patel R, 2004, J MED SPEECH-LANG PA, V12, P189 Protopapas A, 1997, J ACOUST SOC AM, V102, P3723, DOI 10.1121/1.420403 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 Smith BL, 1998, J PHONETICS, V26, P95, DOI 10.1006/jpho.1997.0061 Snow D, 1998, J SPEECH LANG HEAR R, V41, P576 SNOW D, 1994, J SPEECH HEAR RES, V37, P831 Srinivasan RJ, 2003, LANG SPEECH, V46, P1 Stathopoulos ET, 1997, J SPEECH LANG HEAR R, V40, P595 TINGLEY BM, 1975, CHILD DEV, V46, P186 VANCE JE, 1994, EUR J DISORDER COMM, V29, P61 Wells B, 2004, J CHILD LANG, V31, P749, DOI 10.1017/S030500090400652X Wermke K, 2002, MED ENG PHYS, V24, P501, DOI 10.1016/S1350-4533(02)00061-9 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 NR 53 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1308 EP 1318 DI 10.1016/j.specom.2006.06.007 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500007 ER PT J AU Krstulovic, S Bimbot, F Boeffard, O Charlet, D Fohr, D Mella, O AF Krstulovic, Sacha Bimbot, Frederic Boeffard, Olivier Charlet, Delphine Fohr, Dominique Mella, Odile TI Optimizing the coverage of a speech database through a selection of representative speaker recordings SO SPEECH COMMUNICATION LA English DT Article DE speech database; cost minimization; speaker selection; speaker clustering; optimal coverage; multi-models; speech and speaker recognition; speech synthesis ID RECOGNITION; ADAPTATION AB In the context of the N-EOLOGOS Trench speech database creation project, a general methodology was defined for the selection of representative speaker recordings. The selection alms at providing a good coverage in terms of speaker variability while limiting the number of recorded speakers. This is intended to make the resulting database both more adapted to the development of recently proposed multi-model methods and less expensive to collect. The presented methodology proposes a selection process based on the optimization of a quality criterion defined in a variety of speaker similarity modeling frameworks. The selection can be achieved with respect to a unique similarity criterion, using classical clustering methods such as hierarchical or K-medians clustering, or it can combine several speaker similarity criteria, thanks to a newly developed clustering method called focal speakers selection. In this framework, four different speaker similarity criteria are tested, and three different speaker clustering algorithms are compared. Results pertaining to the collection of the N-EOLOGOS database are also discussed. (c) 2006 Elsevier B.V. All rights reserved. C1 IRISA, METISS, F-35042 Rennes, France. IRISA, CORDIAL, F-22305 Lannion, France. France Telecom, R&D, F-22307 Lannion, France. LORIA, F-54506 Vandoeuvre Les Nancy, France. RP Krstulovic, S (reprint author), DFKI, Saarbrucken, Germany. EM sacha@dfki.de; bimbot@irisa.fr; olivier.boeffard@univ-rennes1.fr; delphine.charlet@francetelecom.com; dominique.fohr@loria.fr; odile.mella@loria.fr CR BEN M, 2002, P ICASSP 2002 BEN M, 2004, THESIS U RENNES 1 Catford John C., 1977, FUNDAMENTAL PROBLEMS COLLET M, 2005, INTERSPEECH EUROSPEE Duda R. O., 2001, PATTERN CLASSIFICATI FALTHAUSER R, 2001, P ASRU 01 FOHR D, 2000, P ICSLP 2000 FRANCOIS H, 2001, P EUR 01 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HUANG C, 2002, P ICLSP, V1, P609 Iskra D., 2002, P LREC, P329 Johnson S., 1998, P 5 INT C SPOK LANG, P1775 KOSAKA T, 1994, P ICSLP, P1375 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Mami Yassine, 2002, P ICSLP 2002, V2, P1333 MCLAUGHLIN J, 1999, P IEEE INT C AC SPEE, V2, P817 MELLA O, 1998, 1 INT C LANG RES EV NAGORSKI A, 2002, P ICSLP, V4, P2473 Naito M, 2002, SPEECH COMMUN, V36, P305, DOI 10.1016/S0167-6393(00)00089-3 NAKAMURA A, 1996, P ICSLP, V4, P2199, DOI 10.1109/ICSLP.1996.607241 Padmanabhan M, 1998, IEEE T SPEECH AUDI P, V6, P71, DOI 10.1109/89.650313 PUSATERI E, 2002, P ICSLP 02, P61 Rabiner L, 1993, FUNDAMENTALS SPEECH Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 SOLOMONOFF A, 1998, P ICASSP 1998, P557 STURIM D, 2001, P ICASSP 2001 WU J, 2001, P EUR 2001, V2, P1261 YOSHIZAWA S, 2001, P ICASSP2001, V1, P341 NR 30 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1319 EP 1348 DI 10.1016/j.specom.2006.07.002 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500008 ER PT J AU Jin, W Scordilis, MS AF Jin, Wen Scordilis, Michael S. TI Speech enhancement by residual domain constrained optimization SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; linear prediction; constrained optimization ID COLORED NOISE AB A new algorithm for the enhancement of speech corrupted by additive noise is proposed. This algorithm estimates the linear prediction residuals of the clean speech using a constrained optimization criterion. The signal distortion is minimized in the residual domain subject to a constraint on the average power of the noise residuals. Enhanced speech is obtained by exciting the time-varying all-pole synthesis filter with the estimated residuals of the clean speech. The proposed method was tested with speech corrupted by both white Gaussian and colored noise. The enhancement performances were evaluated in terms of segmental signal-to-noise ratio (SNR) and ITU-PESQ scores. Experimental results indicate our method yields better enhancement results than a former residual-weighting scheme [Yegnanarayana, B., Avendano, C., Hermansky, H., Murthy P.S., 1999. Speech enhancement using linear prediction residual. Speech Commun. 28, 25-42]. The proposed method also achieves better noise reduction than the time-domain subspace method [Ephraim, Y., Van Trees, H.L., 1995. A signal subspace approach for speech enhancement. IEEE Trans. Speech Audio Process. 3, 251-266] on real world colored noise. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Miami, Dept Elect & Comp Engn, Coral Gables, FL 33146 USA. RP Scordilis, MS (reprint author), Univ Miami, Dept Elect & Comp Engn, 1251 Mem Dr,Room EB406, Coral Gables, FL 33146 USA. EM w.jin@umiami.edu; m.scordilis@miami.edu CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 Hu Y, 2003, IEEE T SPEECH AUDI P, V11, P457, DOI 10.1109/TSA.2003.815936 JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211 Lev-Ari H, 2003, IEEE SIGNAL PROC LET, V10, P104, DOI 10.1109/LSP.2003.808544 Mittal U, 2000, IEEE T SPEECH AUDI P, V8, P159, DOI 10.1109/89.824700 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rezayee A, 2001, IEEE T SPEECH AUDI P, V9, P87, DOI 10.1109/89.902276 Yegnanarayana B, 1999, SPEECH COMMUN, V28, P25, DOI 10.1016/S0167-6393(98)00070-3 NR 10 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1349 EP 1364 DI 10.1016/j.specom.2006.07.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500009 ER PT J AU Kacha, A Grenez, F Schoentgen, J AF Kacha, A. Grenez, F. Schoentgen, J. TI Estimation of dysperiodicities in disordered speech SO SPEECH COMMUNICATION LA English DT Article DE disordered speech analysis; long-term prediction; generalized variogram; segmental signal-to-dysperiodicity ratio ID PREDICTION; VOICES AB This paper presents two methods for tracking vocal dysperiodicities in connected speech. The first is based on a longterm linear predictor with one coefficient and the second on a generalized variogram. Both analysis methods guarantee that a slight increase or decrease of irregularities in the speech signal produces a slight increase or decrease of the estimated vocal dysperiodicity trace. No spurious noise boosting occurs owing to erroneous insertions or omissions of speech cycles, or the comparison of speech cycles across phonetic boundaries. The two techniques differ with regard to how slow changes of speech cycle amplitudes are compensated for. They are compared on two speech corpora. One comprises stationary fragments of vowel [a] produced by 89 male and female normophonic and dysphonic, speakers. Another comprises four French sentences as well as vowel [a] produced by 22 male and female normophonic and dysphonic speakers. Vocal dysperiodicities are summarized by means of global and segmental signal-to-dysperiodicity ratios. They are correlated with hoarseness scores obtained by means of perceptual ratings of the speech tokens. The two techniques obtain signal-to-dysperiodicity ratios that are statistically significantly correlated with the hoarseness scores. For connected speech, the segmental signal-to-dysperiodicity ratio correlates more strongly with perceptual scores of hoarseness than the global signal-to-dysperiodicity ratio. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Libre Bruxelles, Fac Engn, Dept Signals & Waves, B-1050 Brussels, Belgium. Natl Fund Sci Res, Brussels, Belgium. RP Kacha, A (reprint author), Univ Libre Bruxelles, Fac Engn, Dept Signals & Waves, Av FD Roosevelt 50,CP 165-51, B-1050 Brussels, Belgium. EM akacha@ulb.ac.be CR Bettens F, 2005, J ACOUST SOC AM, V117, P328, DOI 10.1121/1.1835511 Haslett J, 1997, STATISTICIAN, V46, P475 Haykin S., 1991, ADAPTIVE FILTER THEO Jayant N. S., 1984, DIGITAL CODING WAVEF Kacha A., 2005, P INT C SPOK LANG PR, P1733 KACHA A, 2005, P ICSLP 05, V1, P917, DOI 10.1109/ICASSP.2005.1415264 Kay S. M., 1988, MODERN SPECTRAL ESTI Kreiman J, 1998, J ACOUST SOC AM, V104, P1598, DOI 10.1121/1.424372 MAKHOUL J, 1977, IEEE T ACOUST SPEECH, V25, P423, DOI 10.1109/TASSP.1977.1162979 Moore D. S., 1999, INTRO PRACTICE STAT, V3rd Qi YY, 1999, J ACOUST SOC AM, V105, P2532, DOI 10.1121/1.426860 Quackenbush S. R., 1988, OBJECTIVE MEASURES S RAMACHANDRAN RP, 1989, IEEE T ACOUST SPEECH, V37, P467, DOI 10.1109/29.17527 Schoentgen J, 2000, J SPEECH LANG HEAR R, V43, P1493 Tribolet J. M., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing YANAGIHA.N, 1967, J SPEECH HEAR RES, V10, P531 NR 16 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1365 EP 1378 DI 10.1016/j.specom.2006.07.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500010 ER PT J AU Vicente-Pena, J Gallardo-Antolin, A Pelaez-Moreno, C Diaz-de-Maria, F AF Vicente-Pena, J. Gallardo-Antolin, A. Pelaez-Moreno, C. Diaz-de-Maria, F. TI Band-pass filtering of the time sequences of spectral parameters for robust wireless speech recognition SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; wireless speech recognition; transmission errors; modulation spectrum; RASTA-PLP ID WORLD-WIDE-WEB; FRONT-END AB In this paper we address the problem of automatic speech recognition when wireless speech communication systems are involved. In this context, three main sources of distortion should be considered: acoustic environment, speech coding and transmission errors. Whilst the first one has already received a lot of attention, the last two deserve further investigation in our opinion. We have found out that band-pass filtering of the recognition features improves ASR performance when distortions due to these particular communication systems are present. Furthermore, we have evaluated two alternative configurations at different bit error rates (BER) typical of these channels: band-pass filtering the LP-MFCC parameters or a modification of the RASTA-PLP using a sharper low-pass section perform consistently better than LP-MFCC and RASTA-PLP, respectively. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Carlos III Madrid, EPS, Dpto Teoria Senal & Comunicac, Madrid 28911, Spain. RP Vicente-Pena, J (reprint author), Univ Carlos III Madrid, EPS, Dpto Teoria Senal & Comunicac, Avda Univ 30, Madrid 28911, Spain. EM jvicente@tsc.uc3m.es; gallardo@tsc.uc3m.es; carmen@tsc.uc3m.es; fdiaz@tsc.uc3m.es RI Diaz de Maria, Fernando/E-8048-2011; Gallardo-Antolin, Ascension/L-4152-2014; Pelaez-Moreno, Carmen/B-7373-2008 OI Pelaez-Moreno, Carmen/0000-0003-1425-6763 CR [Anonymous], 2003, 201108 ETSI ES CHEN B, 2004, P INT S SPOK LANG PR, P925 Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 *ETSI, 1999, 620 GSM ETSI *ETSI ES, 2004, 202050 ETSI ES *ETSI ETS, 1999, 300578 ETSI ETS *ETSI TS, 2004, 128062 ETSI TS EULER S, 1994, P INT C AC SPEECH SI, V1, P621 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Gallardo-Antolin A, 2005, IEEE T SPEECH AUDI P, V13, P1186, DOI 10.1109/TSA.2005.853210 Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 HANSON BA, 1993, 1993 P IEEE INT C AC, V2, P79 Hermansky H., 1994, IEEE T SPEECH AUDIO, V2, P587 Hermansky H., 1992, P IEEE INT C AC SPEE, V1, P121 Hirsch H. G., 2002, P INT C SPOK LANG PR, P1877 Junqua J., 2000, ROBUST SPEECH RECOGN KANEDERA N, 1998, P IEEE INT C AC SPEE, V2, P613, DOI 10.1109/ICASSP.1998.675339 Kim HK, 2002, IEEE T SPEECH AUDI P, V10, P591, DOI 10.1109/TSA.2002.804302 KISS I, 2003, P ASRU, P613 LILLY B, 1996, P ICSLP, V4, P2344, DOI 10.1109/ICSLP.1996.607278 Nadeu C, 2001, SPEECH COMMUN, V34, P93, DOI 10.1016/S0167-6393(00)00048-0 Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7 *NIST, 1992, RES MAN CORP RMI PELAEZMORENO C, 2002, P INT C SPOK LANG PR, P2217 Pelaez-Moreno C, 2001, IEEE T MULTIMEDIA, V3, P209, DOI 10.1109/6046.923820 SMOLDERS J, 1993, P INT C AC SPEECH SI, V2, P684 Weiss N. A., 1993, INTRO STAT Young S. J., 1995, HTK HIDDEN MARKOV MO NR 29 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2006 VL 48 IS 10 BP 1379 EP 1398 DI 10.1016/j.specom.2006.07.007 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 099II UT WOS:000241586500011 ER PT J AU Xydas, G Kouroupetroglou, G AF Xydas, Gerasimos Kouroupetroglou, Georgios TI Tone-Group F-0 selection for modeling focus prominence in small-footprint speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE Text-to-Speech synthesis; Tone-Group unit-selection; intonation and emphasis in speech synthesis ID VOWELS AB This work targets to improve the naturalness of synthetic intonational contours in Text-to-Speech synthesis through the provision of prominence, which is a major expression of human speech. Focusing on the tonal dimension of emphasis, we present a robust unit-selection methodology for generating realistic F-0 curves in cases where focus prominence is required. The proposed approach is based on selecting Tone-Group units from commonly used prosodic corpora that are automatically transcribed as patterns of syllables. In contrast to related approaches, patterns represent only the most perceivable sections of the sampled curves and are encoded to serve morphologically different sequence of syllables. This results in a minimization of the required amount of units so as to achieve sufficient coverage within the database. Nevertheless, this optimization enables the application of high-quality F0 generation to small-footprint text-to-speech synthesis. For generic F0 selection we query the database based on sequences of ToBI labels, though other intonational frameworks can be used as well. To realize focus prominence on specific Tone-Groups the selection also incorporates a level indicator of emphasis. We set up a series of listening tests by exploiting a database built from a 482-utterance corpus, which featured partially purpose-uttered emphasis. The results showed a clear subjective preference of the proposed model against a linear regression one in 75% of the cases when used in generic synthesis. Furthermore, this model provided ambiguous percept of emphasis in an experiment featuring major and minor degrees of prominence. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Athens, Dept Informat & Telecommun, Div Commun & Signal Proc, GR-15784 Athens, Greece. RP Kouroupetroglou, G (reprint author), Univ Athens, Efkalypton 39, GR-15342 Athens, Greece. EM gxydas@di.uoa.gr; koupe@di.uoa.gr CR Arvaniti A, 1998, J PHONETICS, V26, P3, DOI 10.1006/jpho.1997.0063 Arvaniti A., 2005, PROSODIC TYPOLOGY PH, P84 AULANKO R, 1985, FONETIIKAN PAIVAT TU, P33 BAILLY G, 2005, SPEECH COMMUN, V46, P364 Beutnagel M., 1999, P 137 M AC SOC AM, P18 Black A., 2000, BUILDING VOICES FEST Black A., 2003, P EUROSPEECH GEN SWI, P1649 BLACK A, 2000, P ICSLP, V2, P411 BLACK A, 1996, P ICSLP 96 PHIL US, V3, P1385, DOI 10.1109/ICSLP.1996.607872 Black A. W., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025704800086 BLACK AW, 2001, P 4 ISCA WORKSH SPEE, P204 Black A.W., 1998, FESTIVAL SPEECH SYNT BULYKO I, 2001, P ICASSP, V2, P781 CALDER J, 2005, MULTIMODAL INTELLIGE, V27, P177, DOI 10.1007/1-4020-3051-7_9 Campbell N, 2005, IEICE T INF SYST, VE88D, P376, DOI 10.1093/ietisy/e88-d.3.376 CAMPBELL N, 1994, P ESCA WORKSH SPEECH, P61 CLARK R, 2003, THESIS U EDINBURGH Collier R., 1990, PERCEPTUAL STUDY INT CONKIE A, 1994, P SSW2 2 ESCA IEEE W, P119 DALESSANDRO C, 1995, COMPUT SPEECH LANG, V9, P257, DOI 10.1006/csla.1995.0013 DONOVAN RE, 1995, P EUROSPEECH 95, V1, P573 DUSTERHOFF K, 1997, INTONATION THEORY MO, P107 Dutoit T., 1997, INTRO TEXT SPEECH SY Dutoit T., 1996, P ICSLP 96 PHIL, V3, P1393, DOI 10.1109/ICSLP.1996.607874 EIDE E, 2003, P SSW5 5 ISCA ITRW S, P79 Fourakis M, 1999, PHONETICA, V56, P28, DOI 10.1159/000028439 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 HITZEMAN J, 1999, P EUR WORKSH NAT LAN, P59 Huang X., 1996, P ICSLP 96 PHIL, P659 Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 Keller E., 2003, LINGUISTIK ONLINE, V17, P57 Kishore S. P., 2003, P EUROSPEECH 2003 GE, P1317 Ladd D. R., 1986, PHONOLOGY YB, V3, P311, DOI 10.1017/S0952675700000671 Lieberman Philip, 1967, INTONATION PERCEPTIO MALFRERE F, 1998, SSW3 3 ESCA COCOSDA, P323 MERON J, 2001, P SSW4 4 ISCA ITRW S, P113 MONAGHAN AIC, 1992, INT C SPEECH LANG PR, P1159 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Mozziconacci S., 1999, P 14 INT C PHON SCI, P2001 MOZZICONACCI S, 2000, P ISCA ITRW NEWC N I, P45 Nespor M., 1986, PROSODIC PHONOLOGY Pierrehumbert J, 1980, THESIS MIT PITRELLI JF, 2003, P IEEE ASRU AUT SPEE, P694 QUAZZA S, 2001, P SSW4 4 ISCA ITRW S Raux A., 2003, P WORKSH AUT SPEECH, P700 SchrOder M., 2001, P 7 EUR C SPEECH COM, V1, P561 Schweitzer A., 2003, P EUROPEAN C SPEECH, P1321 Selkirk E., 1986, PHONOLOGY YB, V3, P371 Selkirk Elizabeth, 1995, U MASSACHUSETTS OCCA, V18, P439 Selkirk Elizabeth, 1978, NORDIC PROSODY Silverman K., 1992, P INT C SPOK LANG PR, P867 SPROAT R, 1998, MULTILINGUAL TEXT SP Taylor P, 2001, SPEECH COMMUN, V33, P153, DOI 10.1016/S0167-6393(00)00074-1 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 VAINIO M, 2001, THESIS U HELSINKI WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0 Wightman C. W., 2000, P ICSLP, V2, P71 XUB Y, 2002, J ACOUST SOC AM, V111, P1388 Xydas G, 2005, IEICE T INF SYST, VE88D, P510, DOI 10.1093/ietisy/e88-d.3.510 XYDAS G, 1910, P SSW4 4 ISCA ITRW S, P167 XYDAS G, 2004, P ICSLP 2004, V1, P801 ZERVAS P, 2005, P SPECOM 2005 10 INT, V2, P603 NR 62 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1057 EP 1078 DI 10.1016/j.specom.2006.02.002 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100001 ER PT J AU Roberts, F Francis, AL Morgan, M AF Roberts, Felicia Francis, Alexander L. Morgan, Melanie TI The interaction of inter-turn silence with prosodic cues in listener perceptions of "trouble" in conversation SO SPEECH COMMUNICATION LA English DT Article DE silence; prosody; pausing; human conversation; word duration ID COMMUNICATION; FEATURES; DIALOGUE AB The forms, functions, and organization of. sounds and utterances are generally the focus of speech communication research; little is known, however, about how the silence between speaker turns shades-the meaning of the surrounding talk: We use an experimental protocol to test whether listeners' perception of trouble in interaction (e.g., disagreement or unwillingness) varies when prosodic cues are manipulated in the context of 2 speech acts (requests and assessments). The prosodic cues investigated were inter-turn silence and the duration, absolute pitch, and pitch contour of affirmative response tokens ("yeah" and "sure") that followed the inter-turn silence. Study participants evaluated spoken dialogues simulating telephone calls between friends in which the length of silence following a request/assessment (i.e., the inter-turn silence) was manipulated in Praat as were prosodic features of the responses. Results indicate that with each incremental increase in pause duration (0-600-1200 ms) listeners perceived increasingly less willingness to comply with requests and increasingly weaker agreement with assessments. Inter-turn silence and duration of response token proved to be stronger cues to unwillingness and disagreement than did the response token's pitch characteristics. However, listeners tend to perceive response token duration as a cue to "trouble" when inter-turn silence cues were, apparently, ambiguous (less than I s). (c) 2006 Elsevier B.V. All rights reserved. C1 Purdue Univ, Dept Commun, W Lafayette, IN 47907 USA. Dept Speech Language & Hearing Sci, W Lafayette, IN 47907 USA. RP Roberts, F (reprint author), Purdue Univ, Dept Commun, Beering Hall,Room 2114, W Lafayette, IN 47907 USA. EM froberts@purdue.edu; francisa@purdue.edu; melanie.morgan@cla.purdue.edu CR ANDERSON PA, 1999, NONVERBAL COMMUNICAT Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 Boersma P., 2001, PRAAT SYSTEM DOING P Boersma P., 1993, P I PHONETIC SCI, V17, P97 BRADY PT, 1969, AT&T TECH J, V48, P2445 BRENNAN SE, 1995, J MEM LANG, V34, P383, DOI 10.1006/jmla.1995.1017 BURGOON JK, 1995, J LANG SOC PSYCHOL, V14, P289, DOI 10.1177/0261927X95143003 Davidson J., 1984, STRUCTURES SOCIAL AC, P102 GLUCKSBERG S, 1981, J EXP PSYCHOL-HUM L, V7, P311, DOI 10.1037/0278-7393.7.5.311 GREEN SB, 1997, USING SPSS WINDOWS A HART JT, 1965, J EDUC PSYCHOL, V56, P208, DOI 10.1037/h0022263 Hirschberg J, 2002, SPEECH COMMUN, V36, P31, DOI 10.1016/S0167-6393(01)00024-3 HIRSCHBERG J, 1992, J PHONETICS, V20, P241 Holmes Janet, 2003, HDB LANGUAGE GENDER Hopper R., 1992, TELEPHONE CONVERSATI Jackson S. A., 1992, MESSAGE EFFECTS RES Jefferson G., 1989, CONVERSATION INTERDI, P166 Krahmer E, 2002, SPEECH COMMUN, V36, P133, DOI 10.1016/S0167-6393(01)00030-9 Levinson Stephen C., 1983, PRAGMATICS Nelson T., 1993, METACOGNITION CORE R O'Shaughnessy D, 2003, P IEEE, V91, P1272, DOI 10.1109/JPROC.2003.817117 Pomerantz Anita, 1984, STRUCTURES SOCIAL AC, P57 REPP BH, 1982, PSYCHOL BULL, V92, P81, DOI 10.1037//0033-2909.92.1.81 Roberts F, 2004, HUM COMMUN RES, V30, P376, DOI 10.1093/hcr/30.3.376 SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 SCHEUERMANN EA, 1979, ERNAHRUNGSWIRTSCHAFT, P23 SELTING M, 1996, PROSODY CONVERSATION Shimojima A, 2002, SPEECH COMMUN, V36, P113, DOI 10.1016/S0167-6393(01)00029-2 SMITH VL, 1993, J MEM LANG, V32, P25, DOI 10.1006/jmla.1993.1002 Swerts M, 2005, J MEM LANG, V53, P81, DOI 10.1016/j.jml.2005.02.003 WILSON TP, 1986, DISCOURSE PROCESS, V9, P375 NR 31 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1079 EP 1093 DI 10.1016/j.specom.2006.02.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100002 ER PT J AU Ogut, F Kilic, MA Engin, EZ Midilli, R AF Ogut, Fatih Kilic, Mehmet Akif Engin, Erkan Zeki Midilli, Rasit TI Voice onset times for Turkish stop consonants SO SPEECH COMMUNICATION LA English DT Article DE articulation; consonant; acoustics; speech; stop consonants; voice onset time ID INITIAL STOPS; PRODUCTIONS; DYSARTHRIA; SPEAKERS; ENGLISH; APRAXIA; SPEECH; FRENCH AB In this study, we aimed to determine the average VOT (voice onset time) values of the Turkish stop consonants by using 30 volunteers (15 female and 15 male). For this aim, we measured the VOT values of the six Turkish stops (i.e., /p/, /b/, /t/, /d/, /k/ and /g/), which were uttered by 30 subjects in three times, on wideband spectrograms. At the result of this study, the average VOT values of /p/, /b/, /t/, /d/, /k/ and /g/ were found to be 41, -66, 50, -53, 69, and -10 ms, respectively. (c) 2006 Published by Elsevier B.V. C1 Ege Univ, Fac Med, ENT Dept, TR-35100 Izmir, Turkey. Sutcu Imam Univ, Fac Med, ENT Dept, Kahramanmaras, Turkey. Ege Univ, Fac Engn, Elect & Elect Dept, TR-35100 Izmir, Turkey. RP Ogut, F (reprint author), Ege Univ, Fac Med, ENT Dept, TR-35100 Izmir, Turkey. EM fatih.ogut@ege.edu.tr CR ABRAMSON AS, 1977, PHONETICA, V34, P295 Auzou P, 2000, CLIN LINGUIST PHONET, V14, P131 BORTOLINI U, 1995, INT J PEDIATR OTORHI, V31, P191, DOI 10.1016/0165-5876(94)01091-B CARUSO AJ, 1987, J SPEECH HEAR RES, V30, P80 DAVIS K, 1995, J CHILD LANG, V22, P275 Demircam Omer, 1996, TURKCENIN SESDIZIMI HARDCASTLE WJ, 1985, BRIT J DISORD COMMUN, V20, P249 ITOH M, 1982, BRAIN LANG, V17, P193, DOI 10.1016/0093-934X(82)90016-5 Jaklin Kornfilt, 1997, TURKISH Kessinger RH, 1997, J PHONETICS, V25, P143, DOI 10.1006/jpho.1996.0039 KLATT DH, 1975, J SPEECH HEAR RES, V18, P686 Lisker L., 1970, P 6 INT C PHON SCI P, P563 LISKER L, 1971, LANGUAGE, V47, P767, DOI 10.2307/412155 LISKER L, 1964, WORD, V20, P384 MacKay I.R., 1987, PHONETICS SCI SPEECH MacLeod A. A., 2005, J MULTILINGUAL COMMU, V3, P118, DOI DOI 10.1080/14769670500066313 Ozsancak C, 2001, FOLIA PHONIATR LOGO, V53, P48, DOI 10.1159/000052653 PETROSINO L, 1993, PERCEPT MOTOR SKILL, V76, P83 Rosner BS, 2000, J PHONETICS, V28, P217, DOI 10.1006/jpho.2000.0113 RYALLS J, 1995, J COMMUN DISORD, V28, P205, DOI 10.1016/0021-9924(94)00009-O SWEETING PM, 1982, J SPEECH HEAR RES, V25, P129 NR 21 TC 6 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1094 EP 1099 DI 10.1016/j.specom.2006.02.003 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100003 ER PT J AU Sasou, A Asano, F Nakamura, S Tanaka, K AF Sasou, Akira Asano, Futoshi Nakamura, Satoshi Tanaka, Kazuyo TI HMM-based noise-robust feature compensation SO SPEECH COMMUNICATION LA English DT Article DE noise robust; hidden Markov model; AURORA2 AB In this paper, we describe a hidden Markov model (HMM)-based feature-compensation method. The proposed method compensates for noise-corrupted speech features in the mel-frequency cepstral coefficient (MFCC) domain using the output probability density functions (pdfs) of the HMM. In compensating the features, the output pdfs are adaptively weighted according to forward path probabilities. Because of this, the proposed method can minimize degradation of feature-compensation accuracy due to a temporarily changing noise environment. We evaluated the proposed method based on the AURORA2 database. All the experiments were conducted under clean conditions. The experimental results indicate that the proposed method, combined with cepstral mean subtraction, can achieve a word accuracy of 87.64%. We also show that the proposed method is useful in a transient pulse noise environment. (c) 2006 Elsevier B.V. All rights reserved. C1 Natl Inst Adv Ind Sci & Technol, Informat Technol Res Inst, Ibaraki 3058568, Japan. Adv Telecommun Res Inst Int ATR, Spoken Language Translat Res Labs, Seika, Japan. Univ Tsukuba, Inst Lib & Informat Sci, Tsukuba, Ibaraki 305, Japan. RP Sasou, A (reprint author), Natl Inst Adv Ind Sci & Technol, Informat Technol Res Inst, 1-1-1 Tsukuba, Ibaraki 3058568, Japan. EM a-sasou@aist.go.jp; f.asano@aist.go.jp; satoshi.nakamura@atr.jp; ktanaka@ulis.ac.jp CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Deng L., 2000, P ICSLP, VIII, P806 Doucet A, 2000, STAT COMPUT, V10, P197, DOI 10.1023/A:1008935410038 FUJIMOTO M, 2005, P ICASSP2005, V1, P257, DOI 10.1109/ICASSP.2005.1415099 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 NAKAMURA S, 2000, 2 INT C LANG RES EV PEARCE D, 2000, ISCA ITRW ASR 2000 Segura J. C., 2001, P EUROSPEECH2001, P221 Varga A.P., 1990, P ICASSP, P845 YAO K, 2000, P ICSLP2000, P760 Yao K., 2001, P NIPS 01, P1205 Yao KS, 2004, SPEECH COMMUN, V42, P5, DOI 10.1016/j.specom.2003.09.002 NR 14 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1100 EP 1111 DI 10.1016/j.specom.2006.03.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100004 ER PT J AU Bassi, A Yoma, NB Loncomilla, P AF Bassi, Alejandro Yoma, Nestor Becerra Loncomilla, Patricio TI Estimating tonal prosodic discontinuities in Spanish using HMM SO SPEECH COMMUNICATION LA English DT Article ID SPEECH AB The tonal prosodic discontinuity estimation in Spanish is exhaustively modelled using HMM. Due to the high morphological complexity in Spanish, a relatively coarse grammatical categorization is tested in two sorts of texts (sentences from newspapers and a theatre play). The estimation of the type of discontinuity (falling or rising tones) at the boundary of intonation groups is assessed. The HMM approach is tested with: (a) modelling the observation probability with monograms, bigrams and full-window probability; (b) state duration modelling; (c) discriminative analysis of intermediate and final observation vectors and (d) penalization scheme in Viterbi decoding. The optimal configurations led to reductions of 3% or 5% in error detection. The estimation of the observation probability with monograms and bigrants leads to worse results than the ordinary full-window probability, although they provide better generalization. Nevertheless, the performance of the monograms and bigrams approximation can be enhanced if applied in combination with state duration constraints. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Chile, Dept Elect Engn, Santiago, Chile. Univ Chile, Dept Comp Sci, Santiago, Chile. RP Yoma, NB (reprint author), Univ Chile, Dept Elect Engn, Av Tupper 2007,POB 412-3, Santiago, Chile. EM nbecerra@ing.uchile.cl CR AGUERO PD, 2000, 19 C SOC ESP PROC LE ANDERSON M, 1984, P ICASSP 1 SAN DIEG, P281 ATTERER M, 2002, P 1 INT C SPEECH PRO Beckman Mary, 2002, PROBUS, V14, P9, DOI 10.1515/prbs.2002.008 BLACK AW, 1997, EURO 97 RHOD GREEC, V2, P995 Fant Lars, 1984, ESTRUCTURA INFORM ES GARRIDO JM, 1995, P 13 INT C PHON SCI, V2, P370 Gussenhoven C., 2002, GLOT INT, V6, P271 Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 Huang X.D., 1990, HIDDEN MARKOV MODELS HUCKVALE M, 2001, SPEECH FILING SYSTEM Jelinek F., 1998, STAT METHODS SPEECH Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Navarro T., 1944, MANUAL ENTONACION ES Olshen R., 1984, CLASSIFICATION REGRE, V1st Prieto P, 1998, J PHONETICS, V26, P261, DOI 10.1006/jpho.1998.0074 Quilis Antonio, 1993, TRATADO FONOLOGIA FO Silverman K, 1992, ICSLP, P867 Sosa J., 1999, ENTONACION ESPANOL Yoma NB, 2001, IEEE T SPEECH AUDI P, V9, P179, DOI 10.1109/89.902285 NR 20 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1112 EP 1125 DI 10.1016/j.specom.2006.03.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100005 ER PT J AU Sethy, A Narayanan, S Parthasarthy, S AF Sethy, Abhinav Narayanan, Shrikanth Parthasarthy, S. TI A split lexicon approach for improved recognition of spoken names SO SPEECH COMMUNICATION LA English DT Article DE syllable; spoken name recognition; reverse lookup; split lexicon AB Recognition of spoken names is a challenging task for automatic speech recognition systems because the list of names for applications such as directory assistance tends to be in the order of several hundred thousands. This makes spoken name recognition a very high perplexity task. In this paper we propose the use of syllables as the acoustic unit for spoken name recognition based on reverse lookup schemes and show how syllables can be used to improve recognition performance and reducing the system perplexity. We present system design methodologies to address the problem of acoustic-training data sparsity encountered when using longer length units such as syllables. We illustrate our ideas first on a TIMIT based continuous speech recognition problem and then focus on the application of these ideas to spoken name recognition. Our results on the OGI spoken name corpus indicate that using syllables in place of phoneme models can help boost system accuracy significantly while helping to reduce the system complexity. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ So Calif, Integrated Media Syst Ctr, Dept Elect Engn Syst, Los Angeles, CA 90007 USA. AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Sethy, A (reprint author), Univ So Calif, Integrated Media Syst Ctr, Dept Elect Engn Syst, Los Angeles, CA 90007 USA. EM sethy@sipi.usc.edu; shri@sipi.usc.edu; sps@research.att.com CR ABELLA A, 1998, P ICSLP BEAUFAYS F, 2003, P EUR BILLI R, 1988, IEEE WORKSH INT VOIC DESHMUKH N, 1997, P ICASSP FISHER M, 2000, SYLLABIFICATION SOFT GALESCU L, 2002, P ICSLP GANAPATHIRAJU A, 2001, IEEE T SPEECH AUDIO GEUTNER P, 1988, P ICSLP GISH H, 1996, P ICSLP GREENBERG S, 1997, P ESCA WORKSH ROB SP HAIN T, 2002, P PMLA Kahn Daniel, 1976, THESIS INDIANA U LIN KIRCHHOFF K, 1996, P ICSLP KORKMAZSKIY F, 1997, P ICASSP LIPPMANN R, 1996, P WORKSH AUD BAS SPE MASSARO DW, 1972, PSYCHOL REV, V79, P124, DOI 10.1037/h0032264 Odell J., 1995, HTK BOOK HTK V2 0 RAMABHADRAN B, 1999, P EUR Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 SETHY A, 2003, P IEEE ASRU SETHY A, 2002, P ISCA PRON MOD WORK Shafran Z, 2003, COMPUT SPEECH LANG, V17, P311, DOI 10.1016/S0885-2308(02)00049-9 THONG JV, 2000, SPEECHBOT SPEECH REC WU SL, 1988, P ICASSP NR 24 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1126 EP 1136 DI 10.1016/j.specom.2006.03.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100006 ER PT J AU Misu, T Kawahara, T AF Misu, Teruhisa Kawahara, Tatsuya TI Dialogue strategy to clarify user's queries for document retrieval system with speech interface SO SPEECH COMMUNICATION LA English DT Article DE spoken dialogue system; information retrieval; document retrieval; dialogue strategy ID INFORMATION AB This paper proposes a dialogue strategy for clarifying and constraining queries to document retrieval systems with speech input interfaces. It is indispensable for spoken dialogue systems to interpret user's intention robustly in the presence of speech recognition errors and extraneous expressions characteristic of spontaneous speech. In speech input, moreover, users' queries tend to be vague, and they may need to be clarified through dialogue in order to extract sufficient information to get meaningful retrieval results. In conventional database query tasks, it is easy to cope with these problems by extracting and confirming keywords based on semantic slots. However, it is not straightforward to apply such a methodology to general document retrieval tasks. In this paper, we first introduce two statistical measures for identifying critical portions to be confirmed. The relevance score (RS) represents the matching degree with the document set. The significance score (SS) detects portions that affect retrieval results. With these measures, the system can generate confirmations to handle speech recognition errors, prior to and after the retrieval, respectively. Then, we propose a dialogue strategy for generating clarifications to narrow down the retrieved items, especially when many documents are matched because of a vague input query. The optimal clarification question is dynamically selected based on information gain (IG) - the reduction in the number of matched items. A set of possible clarification questions is prepared using various knowledge sources. As a bottom-up knowledge source, we extract a list of words that can take a number of objects and potentially causes ambiguity, using a dependency structure analysis of the document texts. This is complemented by top-down knowledge sources of metadata and hand-crafted questions. Our dialogue strategy is implemented and evaluated against a software support knowledge base of 40 K entries. We demonstrate that our strategy significantly improves the success rate of retrieval. (c) 2006 Elsevier B.V. All rights reserved. C1 Kyoto Univ, Sch Informat, Kyoto 6068501, Japan. RP Misu, T (reprint author), Kyoto Univ, Sch Informat, Kyoto 6068501, Japan. EM misu@ar.media.kyoto-u.ac.jp CR BARNETT J, 1997, P EUR BENNACEF S, 1996, P ICSLP BOUWMAN G, 1999, P IEEE ICASSP Chang E, 2002, IEEE T SPEECH AUDI P, V10, P531, DOI 10.1109/TSA.2002.804301 DENECKE M, 1997, P EUR FUJII A, 2003, P EUR HARABAGIU S, 2002, P COLING, P502 HAZEN TJ, 2000, P ICSLP HORI C, 2003, P IEEE ICASSP KIYOTA Y, 2002, P 19 INT C COMP LING, P460 KOMATANI K, 2002, P COLING, P481 Komatani K., 2000, P INT C COMP LING CO, P467 LEE A, 2001, P EUR Levin E., 2000, P ICSLP LEWIS C, 2005, P IEEE ICASSP Miller B. W., 1996, P 34 ANN M ASS COMP, P62, DOI 10.3115/981863.981872 *NIST DARPA, 2003, NIST SPEC PUBL POTAMIANOS A, 2000, P ICSLP RAYNER M, 2003, P 42 ANN M ACL RUDNICKY A, 2000, P ICSLP, V2 SANSEGUNDO R, 2000, P IEEE ICASSP SCHOFIELD E, 2003, P 41 ANN M ASS COMP, P177 SENEFF S, 2000, P ANLP NAACL 2000 SA STENT A, 1999, P 37 ANN M ACL Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 25 TC 3 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1137 EP 1150 DI 10.1016/j.specom.2006.04.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100007 ER PT J AU Hirohata, M Shinnaka, Y Iwano, K Furui, S AF Hirohata, Makoto Shinnaka, Yosuke Iwano, Koji Furui, Sadaoki TI Sentence-extractive automatic speech summarization and evaluation techniques SO SPEECH COMMUNICATION LA English DT Article DE automatic speech summarization; sentence extraction; evaluation metrics; spontaneous presentations ID TEXT AB This paper presents sentence extraction-based automatic speech summarization techniques for making abstracts from spontaneous presentations. We propose a summarization technique using dimension reduction based on singular value decomposition which effectively focuses on the most salient topics of each presentation. With this technique, sentence location information, which is used for text summarization, is combined to extract important sentences from the introduction and conclusion segments of each presentation. We also investigate the combination of confidence measure and linguistic likelihood to effectively extract sentences with less recognition error. Experimental results show that the dimension-reduction-based method incorporating sentence location information, the confidence measure, and linguistic likelihood achieves the best automatic speech summarization performance in the condition of 10% summarization ratio. This paper also presents objective methods for evaluating automatic speech summarization methods. The correlation analysis between subjective and objective evaluation scores confirms that summarization accuracy, sentence F-measure, and 2 and 3-gram recall are the most effective among the objective evaluation metrics investigated in this paper. (c) 2006 Published by Elsevier B.V. C1 Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Hirohata, M (reprint author), Toshiba Co Ltd, Ctr Corp Res & Dev, Multimedia Lab, Saiwai Ku, 1 Komukai Toshiba Cho, Kawasaki, Kanagawa 2128582, Japan. EM makoto.hirohata@toshiba.co.jp; yosuke.shinnaka@fujixerox.co.jp; iwano@furui.cs.titech.ac.jp; furui@furui.cs.titech.ac.jp CR CHRISTENSEN H, 2003, P IEEE WORKSH AUT SP, P489 Furui S, 2004, IEEE T SPEECH AUDI P, V12, P401, DOI 10.1109/TSA.2004.828699 Gong Y., 2001, P 24 ANN INT ACM SIG, P19, DOI DOI 10.1145/383952.383955 Hearst MA, 1997, COMPUT LINGUIST, V23, P33 HIROHATA M, 2003, P 2003 AUT M AC SOC, V1, P93 HORI C, 2004, P ACL, P82 KITADE T, 2004, P 3 SPONT SPEECH SCI, P111 KOLLURU B, 2003, P IEEE WORKSH AUT SP, P495 LIN CY, 2004, P WORK NOT NTCIR 4, V2, P1 Mani I., 1999, ADV AUTOMATIC TEXT S Murray G., 2005, P ACL 2005 WORKSH IN, P33 Murray G., 2005, P INT 2005 LISB PORT, P593 OHTSUKI K, 1999, P IEEE INT C AC SPEE, V2, P625 Shinozaki T., 2004, P INT ICSLP JEJ KOR, P1705 Steinberger J, 2004, LECT NOTES COMPUT SC, V3261, P245 VALENZA R, 1999, P ESCA WORKSH ACC IN, P111 NR 16 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1151 EP 1161 DI 10.1016/j.specom.2006.04.005 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100008 ER PT J AU Ververidis, D Kotropoulos, C AF Ververidis, Dimitrios Kotropoulos, Constantine TI Emotional speech recognition: Resources, features, and methods SO SPEECH COMMUNICATION LA English DT Review DE emotions; emotional speech data collections; emotional speech classification; stress; interfaces; acoustic features ID HUMAN-COMPUTER INTERACTION; HIDDEN MARKOV-MODELS; STRESS; CLASSIFICATION; SYSTEM; WORD; CONNOTATIONS; EXPRESSION AB In this paper we overview emotional speech recognition having in mind three goals. The first goal is to provide an up-to-date record of the available emotional speech data collections. The number of emotional states, the language, the number of speakers, and the kind of speech are briefly addressed. The second goal is to present the most frequent acoustic features used for emotional speech recognition and to assess how the emotion affects them. Typical features are the pitch, the formants, the vocal tract cross-section areas, the mel-frequency cepstral coefficients, the Teager energy operator-based features, the intensity of the speech signal, and the speech rate. The third goal is to review appropriate techniques in order to classify speech into emotional states. We examine separately classification techniques that exploit timing information from which that ignore it. Classification techniques based on hidden Markov models, artificial neural networks, linear discriminant analysis, k-nearest neighbors, support vector machines are reviewed. (c) 2006 Elsevier B.V. All rights reserved. C1 Aristotle Univ Thessaloniki, Artificial Intelligence & Informat Anal Lab, Dept Informat, Thessaloniki 54124, Greece. RP Kotropoulos, C (reprint author), Aristotle Univ Thessaloniki, Artificial Intelligence & Informat Anal Lab, Dept Informat, Univ Campus,Box 451, Thessaloniki 54124, Greece. EM costas@aiia.csd.auth.gr RI Kotropoulos, Constantine/B-7928-2010 OI Kotropoulos, Constantine/0000-0001-9939-7930 CR ABELIN A, 2000, P ISCA WORKSH SPEECH, V1, P110 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 Alpert M, 2001, J AFFECT DISORDERS, V66, P59, DOI 10.1016/S0165-0327(00)00335-9 Alter K., 2000, P ISCA WORKSH SPEECH, V1, P138 Ambrus DC, 2000, COLLECTING RECORDING AMIR N, 2000, P ISCA WORKSH SPEECH, V1, P29 Ang J., 2002, P INT C SPOK LANG PR, V3, P2037 Anscombe E., 1970, DESCARTES PHILOS WRI Atal B. S., 1967, P INT C SPEECH COMM, P360 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Banziger T, 2005, SPEECH COMMUN, V46, P252, DOI 10.1016/j.specom.2005.02.016 BATLINER A, 2004, P LANG RES EV LREC 0 Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 Buck R, 1999, PSYCHOL REV, V106, P301, DOI 10.1037/0033-295X.106.2.301 BULUT M, 2002, P INT C SPOK LANG PR, V2, P1265 BURKHARDT F, 2000, P ISCA WORKSH SPEECH, V1, P151 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 Caldognetto EM, 2004, SPEECH COMMUN, V44, P173, DOI 10.1016/j.specom.2004.10.012 CHUANG ZJ, 2002, P INT C SPOK LANG PR, V3, P2033 Clavel C., 2004, P INT C SPOK LANG PR, P2277 COLE R, 2005, CU KIDS SPEECH CORPU COWIE R, 1996, P 4 INT C SPOK LANG, V3, P1989, DOI 10.1109/ICSLP.1996.608027 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DELLAERT F, 1996, P 4 INT C SPOK LANG, V3, P1970, DOI 10.1109/ICSLP.1996.608022 Deller J., 2000, DISCRETE TIME PROCES DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P33, DOI 10.1016/S0167-6393(02)00070-5 ECKEMAN P, 1992, COGNITION EMOTION, V6, P169 EDGINGTON M, 1997, P EUR C SPEECH COMM, V1, P593 Efron B., 1993, INTRO BOOTSTRAP Engberg I. S., 1996, DOCUMENTATION DANISH Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 FISCHER K, 1999, 236 U HAMB Flanagan J., 1972, SPEECH ANAL SYNTHESI France DJ, 2000, IEEE T BIO-MED ENG, V47, P829, DOI 10.1109/10.846676 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd GONZALEZ GM, 1999, 39 U MICH HANSEN JHL, 1995, SPEECH COMMUN, V16, P391, DOI 10.1016/0167-6393(95)00007-B HANSEN JHL, 1996, IST03 NATO Hanson HM, 1994, IEEE T SPEECH AUDI P, V2, P436, DOI 10.1109/89.294358 Haykin S., 1998, NEURAL NETWORKS COMP, V2nd Heijden F. v. d., 2004, CLASSIFICATION PARAM Hess W.J., 1992, ADV SPEECH SIGNAL PR HEUFT B, 1996, P ICSLP 96, V3, P1974, DOI 10.1109/ICSLP.1996.608023 Iida A, 2003, SPEECH COMMUN, V40, P161, DOI 10.1016/S0167-6393(02)00081-X IIDA A, 2000, P ISCA WORKSH SPEECH, V1, P167 IRIONDO I, 2000, P ISCA WORKSH SPEECH, V1, P161 JIANG DN, 2004, P INT C MULT EXP ICM KADAMBE S, 1992, IEEE T INFORM THEORY, V38, P917, DOI 10.1109/18.119752 KAWANAMI H, 2003, P EUR C SPEECH COMM, V4, P2401 KWON OW, 2003, P EUR C SPEECH COMM, V1, P125 Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Leinonen L, 1997, J ACOUST SOC AM, V102, P1853, DOI 10.1121/1.420109 LIBERMAN M, 2005, LING DAT CONS Linnankoski I, 2005, SPEECH COMMUN, V45, P27, DOI 10.1016/j.specom.2004.09.007 Lloyd AJ, 1999, CORTEX, V35, P389, DOI 10.1016/S0010-9452(08)70807-4 MAKAROVA V, 2002, P INT C SPOK LANG PR, V1, P2041 MALLAT SG, 1989, RRT483RR219 COUR I M Markel JD, 1976, LINEAR PREDICTION SP MARTINS C, 1998, P 10 PORT C PATT REC McGilloway S., 2000, P ISCA WORKSH SPEECH, V1, P207 MCMAHON E, 2003, TALES DISAPPEARING C MERMELSTEIN P, 1975, J ACOUST SOC AM, V58, P880, DOI 10.1121/1.380738 MONTANARI S, 2004, P INT C SPOK LANG PR, V1, P1841 Montero J. M., 1999, P 14 INT C PHON SCI, V2, P957 MORGANTI R, 1995, PUBL ASTRON SOC AUST, V12, P3 Mozziconacci S. J. L., 2000, P INT C SPOK LANG PR, V2, P373 MOZZICONACCI SJL, 1997, 32 IPO MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 MURRAY I, 1996, P 4 INT C SPOK LANG, V3, P1816, DOI 10.1109/ICSLP.1996.607983 Nakatsu R., 1999, P INT C MULT COMP SY, V2, P804, DOI 10.1109/MMCS.1999.778589 NIIMI Y, 2001, P ISCA TUT WORKSH RE NOGUEIRAS A, 2001, P EUR C SPEECH COMM Nordstrand M, 2004, SPEECH COMMUN, V44, P187, DOI 10.1016/j.specom.2004.09.003 Nwe TL, 2003, SPEECH COMMUN, V41, P603, DOI 10.1016/S01167-6393(03)00099-2 Pantic M, 2003, P IEEE, V91, P1370, DOI 10.1109/JPROC.2003.817122 PELLOM BL, 1996, P INT C AC SPEECH SI, V2, P645 PEREIRA C, 2000, P ISCA WORKSH SPEECH, V1, P25 Petrushin V.A., 1999, P ART NEUR NETW ENG, V1, P7 Picard RW, 2001, IEEE T PATTERN ANAL, V23, P1175, DOI 10.1109/34.954607 POLLERMAN BZ, 2002, IMPROVEMENTS SPEECH POLZIN T, 2000, P ISCA WORKSH SPEECH, V1, P201 POLZIN TS, 1998, P COOP MULT COMM CMC Quatieri T. F., 2002, DISCRETE TIME SPEECH Rabiner L., 1993, FUNDAMENTALS SPEECH Rahurkar M., 2002, P INT C SPOK LANG PR, V3, P2021 SCHERER KR, 2000, P ISCA WORKSH SPEECH SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHERER KR, 2002, P INT C SPOK LANG PR, V3, P2017 SCHERER KR, 2000, P INT C SPOK LANG PR, V1, P379 SCHIEL F, 2002, P LANG RES EV LREC 0 SCHRODER M, 2000, P ISCA WORKSH SPEECH, V1, P132 SCHRODER M, 2003, P INT C PHON SCI ICP SCHRODER M, 2005, HUMAINE CONSORTIUM R SCHULLER B, 2004, P INT C AC SPEECH SI, V1, P557 Shawe-Taylor J., 2004, KERNEL METHODS PATTE SHI RP, 2003, P ISCA TUT RES WORKS, V1, P151 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 Slaney M, 2003, SPEECH COMMUN, V39, P367, DOI 10.1016/S0167-6393(02)00049-3 SONDHI MM, 1968, IEEE T ACOUST SPEECH, VAU16, P262, DOI 10.1109/TAU.1968.1161986 Steeneken H.J.M., 1999, P ICASSP, V4, P2079 STIBBARD R, 2000, P ISCA WORKSH SPEECH, V1, P60 Tato R., 2002, P INT C SPOK LANG PR, V3, P2029 TEAGER HM, 1990, EVIDENCE NONLINEAR S, V15 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 van Bezooijen R., 1984, CHARACTERISTICS RECO Ververidis D., 2004, P EUR SIGN PROC C, V1, P341 VERVERIDIS D, 2005, P INT C MULT EXP ICM Ververidis D., 2004, P IEEE INT C AC SPEE, V1, P593 WAGNER J, 2005, P INT C MULT EXP ICM Wendt B., 2002, P INT C SPEECH PROS, P699 Womack BD, 1999, IEEE T SPEECH AUDI P, V7, P668, DOI 10.1109/89.799692 Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 YILDIRIM S, 2004, P INT C SPOK LANG PR, V1, P2193 Yu F., 2001, P IEEE PAC RIM C MUL, V1, P550 YUAN J, 2002, P INT C SPOK LANG PR, V3, P2025 Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 120 TC 165 Z9 170 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1162 EP 1181 DI 10.1016/j.specom.2006.04.003 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100009 ER PT J AU Tyagi, V Bourlard, H Wellekens, C AF Tyagi, Vivek Bourlard, Herve Wellekens, Christian TI On variable-scale piecewise stationary spectral analysis of speech signals for ASR SO SPEECH COMMUNICATION LA English DT Article DE variable-scale quasi-stationary analysis; speech spectral analysis AB It is often acknowledged that speech signals contain short-term and long-term. temporal properties [Rabiner, L., Juang, B. H., 1993. Fundamentals of Speech Recognition, Prentice-Hall, NJ, USA] that are difficult to capture and model by using the usual fixed scale (typically 20 ms) short-time spectral analysis used in hidden Markov models (HMMs), based on piece-wise stationarity and state conditional independence assumptions of acoustic vectors. For example, vowels are typically quasi-stationary over 40-80 ms segments, while plosive typically require analysis below 20 ms segments. Thus, fixed scale analysis is clearly sub-optimal for "optimal" time-frequency resolution and modeling of different stationary phones found in the speech signal. In the present paper, we investigate the potential advantages of using variable size analysis windows towards improving state-of-the-art speech. recognition systems. Based on the usual assumption that the speech signal can be modeled by a time-varying autoregressive (AR) Gaussian process, we, estimate the largest piecewise quasi-stationary speech segments, based on the likelihood that a segment was generated. by the same AR process. This likelihood is estimated from the linear prediction (LP) residual error. Each of these quasi-stationary segments is then used as an analysis window from which spectral features are extracted. Such an approach thus results in a variable-scale time spectral analysis, adaptively estimating the largest possible analysis window size such that the signal remains quasi-stationary, thus the best temporal/frequency resolution tradeoff. The speech recognition experiments on the OGI Numbers95 database [Cole, R.A., Fanty, M., Lander; T., 1994% Telephone speech corpus at CSLU. In: Proc. of ICSLP, Yokohama, Japan.], show that the proposed variable-scale piecewise stationary spectral analysis based features indeed yield improved recognition accuracy in clean conditions, compared to features based on minimum cross entropy spectrum [Loughlin, P., Pitton, J., Hannaford, B., 1994. Approximating time-frequency density functions via optimal combinations of spectrograms, IEEE Signal Process. Lett. 1 (12)] as well as those based on fixed scale spectral analysis. (c) 2006 Elsevier B.V. All rights reserved. C1 Inst Eurecom, Dept Multimedia Commun, F-06904 Sophia Antipolis, France. IDIAP Res Inst, Martigny, Switzerland. Swiss Fed Inst Technol, CH-1015 Lausanne, Switzerland. RP Tyagi, V (reprint author), Inst Eurecom, Dept Multimedia Commun, 2229,Route Cretes,POB 193, F-06904 Sophia Antipolis, France. EM tyagi@eurecom.fr; bourlard@idiap.ch; welleken@eurecom.fr CR ACHAN K, 2004, 2004001 UTML U TOR D AJMERA J, 2004, IEEE SIGNAL PROCESS, V11 ATAL BS, 1983, P IEEE ICASSP BOST U Brandt A. V., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Coifman R. R., 1992, IEEE T INFORM THEORY, V38 COLE RA, 1994, P ICSLP YOK JAP DAVIS SB, 1980, IEEE T ASSP, V28 Haykin S., 1993, ADAPTIVE FILTER THEO Hermansky H., 1990, J ACOUST SOC AM, V87 ITAKURA F, 1975, IEEE T ASSP, V23 Kay S. M., 1998, FUNDAMENTALS STAT SI LOUGLIN P, 1994, IEEE SIGNAL PROCESS, V1 Makhoul J., 1975, P IEEE, V63 OBRECHT RA, 1988, IEEE T ASSP, V36 Rabiner L, 1993, FUNDAMENTALS SPEECH SRINIVASAN S, 2004, P ICSLP 2004 JEJ S K SVENDSEN T, 1989, P IEEE ICASSP TYAGI V, 2003, P IEEE ASRU THOM VIR Young S., 1995, HTK BOOK NR 19 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1182 EP 1191 DI 10.1016/j.specom.2006.04.002 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100010 ER PT J AU Frankel, J King, S AF Frankel, Joe King, Simon TI Observation process adaptation for linear dynamic models SO SPEECH COMMUNICATION LA English DT Article DE linear dynamic model; acoustic model adaptation; ASR; MLLR; GEM; HMM ID SPEECH RECOGNITION; COVARIANCE MATRICES; LIKELIHOOD AB This work introduces two methods for adapting the observation process parameters of linear dynamic models (LDM) or other linear-Gaussian models. The first method uses the expectation-maximization (EM) algorithm to estimate transforms for location and covariance parameters, and the second uses a generalized EM (GEM) approach which reduces computation in making updates from 0(p) to 0(p), where p is the feature dimension. We present the results of speaker adaptation on TIMIT phone classification and recognition experiments with relative error reductions of up to 6%. Importantly, we find minimal differences in the results from EM and GEM. We therefore propose that the GEM approach be applied to adaptation of hidden Markov models which use non-diagonal covariances. We provide the necessary update equations. (c) 2006 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland. RP Frankel, J (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 2 Buccleuch Pl, Edinburgh EH8 9LW, Midlothian, Scotland. EM joe@cstr.ed.ac.uk CR BILMES J, 2000, P ICASSP Bilmes J., 1997, ICSITR97021 U BERK DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS V, 1992, THESIS BOSTON U GRAD FRANKEL J, 2003, THESIS EDINBURGH U FRANKEL J, IN PRESS IEEE T SPEE GALES MJF, 1997, CUEDFINFENGTR298 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Gunawardana A, 2001, COMPUT SPEECH LANG, V15, P15, DOI 10.1006/csla.2000.0151 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 Ma J, 2004, COMPUT SPEECH LANG, V18, P49, DOI 10.1016/S0885-2308(03)00031-7 Ma JZ, 2004, IEEE T SPEECH AUDI P, V12, P47, DOI 10.1109/TSA.2003.818074 Olsen PA, 2004, IEEE T SPEECH AUDI P, V12, P37, DOI 10.1109/TSA.2003.819943 RAUCH HE, 1963, IEEE T AUTOMAT CONTR, VAC 8, P371, DOI 10.1109/TAC.1963.1105600 Rosti A. I., 2001, CUEDFINFENGTR420 ROSTI AV, 2004, THESIS U CAMBRIDGE Roweis S., 1999, NEURAL COMPUTATION, V11 Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380 NR 19 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1192 EP 1199 DI 10.1016/j.specom.2006.05.001 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100011 ER PT J AU BenZeghiba, MF Bourlard, H AF BenZeghiba, Mohamed Faouzi Bourlard, Herve TI User-customized password speaker verification using multiple reference and background models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT IEEE International Conference on Acoustics, Speech, and Signal Processing CY MAY 17-21, 2004 CL Montreal, CANADA SP IEEE Signal Proc Soc, IEEE DE speaker verification; User-Customized Password Speaker Verification; hybrid HMM/MLP; HMM inference; speaker adaptation; multiple reference models; background model; equal error rate AB This paper discusses and optimizes an HMM/GMM based User-Customized Password Speaker Verification (UCP-SV) system. Unlike text-dependent speaker verification, in UCP-SV systems, customers can choose their own passwords with no lexical constraints. The password has to be pronounced a few times during the enrollment step to create a customer dependent model. Although potentially more "user-friendly", such systems are less understood and actually exhibit several practical issues, including automatic HMM inference, speaker adaptation, and efficient likelihood normalization. In our case, HMM inference (HMM topology) is performed using hybrid HMM/MLP systems, while the parameters of the inferred model, as well as their adaptation, will use GMMs. However, the evaluation of a UCP-SV baseline system shows that the background model used for likelihood normalization is the main difficulty. Therefore, to circumvent this problem, the main contribution of the paper is to investigate the use of multiple reference models for customer acoustic modeling and multiple background models for likelihood normalization. In this framework, several scoring techniques are investigated, such as Dynamic, Model Selection (DMS) and fusion techniques. Results on two different experimental protocols show that an appropriate selection criteria for customer and background models can improve significantly the UCP-SV performance, making the UCP-SV system quite competitive with a text-dependent SV system. Finally, as customers' passwords are short, a comparative experiment using the conventional GMM-UBM text-independent approach is, also conducted. (c) 2006 Elsevier B.V. All rights reserved. C1 Eurecom Inst, Dept Multimedia Commun, F-06904 Sophia Antipolis, France. IDIAP Res Inst, CH-1920 Martigny, Switzerland. Ecole Polytech Fed Lausanne, Swiss Fed Inst Technol, CH-1015 Lausanne, Switzerland. RP BenZeghiba, MF (reprint author), Eurecom Inst, Dept Multimedia Commun, 2229 Route Cretes,BP 193, F-06904 Sophia Antipolis, France. EM mohamed.benzeghiba@eurecom.fr CR BENZEGHIBA M, 2001, 13 IDIAPRR BENZEGHIBA MF, 2004, INT C SPEECH SIGN PR, V1, P389 BOURLARD H, 1985, SPEECH SPEAKER RECOG, V12, P115 Bourlard Ha, 1994, CONNECTIONIST SPEECH CHOLLET G, 1996, 01 IDIAPRR COLLOBERT R, 2002, 46 IDIAPRR DEVETH J, 1995, SPEECH COMMUN, V17, P81, DOI 10.1016/0167-6393(95)00015-G Furui S., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HEBERT M, 2001, EUR C SPEECH COMM TE, P2557 Li Q, 2000, IEEE T SPEECH AUDI P, V8, P585 Neto Joao, 1995, EUROSPEECH 95, P2171 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Rodriguez-Linares Leandro, 2003, Pattern Recognition, V36, P347 Rosenberg AE, 1996, INT CONF ACOUST SPEE, P81, DOI 10.1109/ICASSP.1996.540295 ROSENBERG AE, 1997, EUR C SPEECH COMM TE, P1371 SIOHAN O, 1999, INT C SPEECH SIGN PR, V1, P825 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 NR 19 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1200 EP 1213 DI 10.1016/j.specom.2005.08.008 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100012 ER PT J AU Yu, D Deng, L Acero, A AF Yu, Dong Deng, Li Acero, Alex TI A lattice search technique for a long-contextual-span hidden trajectory model of speech SO SPEECH COMMUNICATION LA English DT Article DE A* search over recognition lattices; decoder; phonetic recognition; vocal tract resonances; speech dynamics; hidden trajectories; contextual assimilation; filtering of targets; TIMIT; long-span context dependence; lattice rescoring; pruning; speech recognition ID RECOGNITION; HMMS AB We have recently developed a long-contextual-span hidden trajectory model (HTM) which captures underlying dynamic structure of speech coarticulation and reduction. Due to the long-span nature of the HTM and the complexity of its likelihood score computation, N-best list rescoring was the principal paradigm for evaluating the HTM for phonetic recognition in our earlier work. In this paper, we describe improved likelihood score computation in the HTM and a novel A*-based time-asynchronous lattice-constrained decoding algorithm for the HTM evaluation. We focus on several special considerations in the decoder design, which are necessitated by the dependency of the HTM score at each given frame on the model parameters associated with a variable number of adjacent past and future phones. We present details on how the nodes and links in the lattices are expanded via a look-ahead mechanism, on how the A* heuristics are estimated, and on how pruning strategies are applied to speed up the search process. The experiments on the standard TIMIT phonetic recognition task show improvement of recognition accuracy by the new search algorithm on recognition lattices over the traditional N-best rescoring paradigm. (c) 2006 Elsevier B.V. All rights reserved. C1 Microsoft Res, Redmond, WA 98052 USA. RP Yu, D (reprint author), Microsoft Res, 1 Microsoft Way, Redmond, WA 98052 USA. EM dongyu@microsoft.com; deng@microsoft.com; alexac@microsoft.com CR Aubert XL, 2002, COMPUT SPEECH LANG, V16, P89, DOI 10.1006/csla.2001.0185 BAKIS R, 1991, P IEEE WORKSH AUT SP, P20 BILMES J, 2004, MATH FDN SPEECH LANG, P135 Bridle J., 1998, 1998 WORKSH LANG ENG, P1 Deng L, 2006, IEEE T AUDIO SPEECH, V14, P256, DOI 10.1109/TSA.2005.854107 Deng L, 2006, IEEE T AUDIO SPEECH, V14, P425, DOI 10.1109/TSA.2005.855841 DENG L, 2004, ICSLP 2004 JEJ KOR DENG L, 2005, P ICASSP, P337 DENG L, 1994, J ACOUST SOC AM, V96, P2008, DOI 10.1121/1.410144 Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 Deng L., 2005, P IEEE WORKSH ASRU Deng L, 2003, SPEECH PROCESSING DY Deng L, 2004, IMA V MATH, V138, P115 GAO Y, 2000, P ICSLP, V1, P25 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 GOEL V, 1999, P EUROSPEECH 99 Holmes WJ, 1999, COMPUT SPEECH LANG, V13, P3, DOI 10.1006/csla.1998.0048 Ma JZ, 2003, IEEE T SPEECH AUDI P, V11, P590, DOI 10.1109/TSA.2003.818075 OPPENHEI.AV, 1972, PR INST ELECTR ELECT, V60, P681, DOI 10.1109/PROC.1972.8727 Oppenheim A. V., 1999, DISCRETE TIME SIGNAL Ortmanns S, 2000, COMPUT SPEECH LANG, V14, P15, DOI 10.1006/csla.1999.0131 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 RICHARDS H, 1999, P ICASSP, V1, P357 Rose RC, 1996, J ACOUST SOC AM, V99, P1699, DOI 10.1121/1.414679 Russell S, 1995, ARTIFICIAL INTELLIGE Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380 Yu D., 2005, P INT, P553 ZHOU J, 2003, IEEE P ICASSP APR 20, V1, P744 NR 28 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2006 VL 48 IS 9 BP 1214 EP 1226 DI 10.1016/j.specom.2006.05.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 079NZ UT WOS:000240185100013 ER PT J AU D'Haro, LF de Cordoba, R Ferreiros, J Hamerich, SW Schless, V Kladis, B Schubert, V Kocsis, O Igel, S Pardo, JM AF D'Haro, Luis Fernando de Cordoba, Ricardo Ferreiros, Javier Hamerich, Stefan W. Schless, Volker Kladis, Basilis Schubert, Volker Kocsis, Otilia Igel, Stefan Pardo, Jose M. TI An advanced platform to speed up the design of multilingual dialog applications for multiple modalities SO SPEECH COMMUNICATION LA English DT Article DE automatic dialog systems generation; dialog management tools; multiple modalities; multilinguality; XML; VoiceXML ID USER-INTERFACE; INFORMATION; SYSTEMS AB In this paper, we present a complete platform for the semiautomatic and simultaneous generation of human-machine dialog applications in two different and separate modalities (Voice and Web) and several languages to provide services oriented to obtaining or modifying the information from a database (data-centered). Given that one of the main objectives of the platform is to unify the application design process regardless of its modality or language and then to complete it with the specific details of each one, the design process begins with a general description of the application, the data model, the database access functions, and a generic finite state diagram consisting of the application flow. With this information, the actions to be carried out in each state of the dialog are defined. Then, the specific characteristics of each modality and language (grammars, prompts, presentation aspects, user levels, etc.) are specified in later assistants. Finally, the scripts that execute the application in the real-time system are automatically generated. We describe each assistant in detail, emphasizing the methodologies followed to ease the design process, especially in its critical aspects. We also describe different strategies and characteristics that we have applied to provide portability, robustness, adaptability and high performance to the platform. We also address important issues in dialog applications such as mixed initiative and over-answering, confirmation handling or providing long lists of information to the user. Finally, the results obtained in a subjective evaluation with different designers and in the creation of two full applications that confirm the usability, flexibility and standardization of the platform, and provide new research directions. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Politecn Madrid, Grp Tecnol Habla, E-28040 Madrid, Spain. Harman Becker Automot Syst, Ulm, Germany. Knowledge SA, LogicDIS Grp, Patras, Greece. Forschungsinst Anwendungsorientierte, Wissensverarbeitung, Ulm, Germany. RP de Cordoba, R (reprint author), Univ Politecn Madrid, Grp Tecnol Habla, Ciudad Univ S-N, E-28040 Madrid, Spain. EM lfdharo@die.upm.es; cordoba@die.upm.es; jfl@dic.upm.es; shamerich@harmanbecker.com; vschless@harmanbecker.com; bkladis@knowledge-speech.gr; vschubert@harmanbecker.com; okocsis@knowledge-speech.gr; sigel@faw.uni-ulm.de; pardo@die.upm.es RI Pardo, Jose/H-3745-2013; Cordoba, Ricardo/B-5861-2008 OI Cordoba, Ricardo/0000-0002-7136-9636 CR Allen J. E., 1999, IEEE Intelligent Systems, V14, DOI 10.1109/5254.796083 Allen JF, 2001, AI MAG, V22, P27 ALMEIDA L, 2002, P INT CLASS WORKSH N, P1 BENNETT C, 2002, P INT C SPOK LANG PR, P2245 Bohus D., 2003, P 8 EUR C SPEECH COM, P597 Burnett D. C., 2002, SPEECH SYNTHESIS MAR COLE R, 1999, P INT C PHON SC SAN, P1277 CORDOBA R, 2001, P 7 EUR C SPEECH COM, V2, P1279 CORDOBA R, 2004, P INT C SPOK LANG PR, pI257 D'Haro L. F., 2004, P INT C SPOK LANG PR, pIV DENECKE M, 2002, P 19 INT C COMP LING EBERMAN B, 2002, P 11 INT C WORLD WID, P713 EHRLICH U, 1997, P EUROSPEECH 97, P1819 Flippo F, 2003, P 5 INT C MULT INT, P109 GLASS J, 2001, P 7 EUR C SPEECH COM, P1335 Gustafson J., 2000, P INT C SPOK LANG PR, P134 GUSTAFSON J, 1998, P INT C SPOK LANG PR, P33 Hamerich S. W., 2004, P INT C SPOK LANG PR, pIV Hamerich S. W., 2004, P WORKSH DISC DIAL S, P31 HAMERICH SW, 2003, P BERL XML TAG, P404 HUNT A, 2000, JSPEECH GRAMMAR FORM Johnston M., 2002, P 40 ANN M ASS COMP, P376 KATSURADA K, 2002, P ICSLP 02 DENV US S, P2549 KLEMMER SR, 2000, CHI LETT, V2, P1 KOMATANI K, 2003, P EUR C SPEECH COMM, P745 Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9 LEHTINEN G, 2000, P COST249 ISCA WORKS, P51 Levine A, 2000, ISRAEL MED ASSOC J, V2, P122 McGlashan S., 2004, VOICE EXTENSIBLE MAR McTear M., 1999, P EUR, P339 McTear M., 1998, P INT C SPOK LANG PR, P1223 McTear MF, 2002, ACM COMPUT SURV, V34, P90, DOI 10.1145/505282.505285 Meng HM, 2002, INTERACT COMPUT, V14, P327, DOI 10.1016/S0953-5438(02)00006-1 NIGAY L, 1993, P INTERCHI 93, P172, DOI 10.1145/169059.169143 Oviatt S, 2000, HUM-COMPUT INTERACT, V15, P263, DOI 10.1207/S15327051HCI1504_1 Pargellis AN, 2004, SPEECH COMMUN, V42, P329, DOI 10.1016/j.specom.2003.10.003 Polifroni J., 2003, P EUR 03 GEN, P193 Polifroni J., 2000, P 2 INT C LANG RES E, P725 Rudnicky A.:, 1999, IEEE AUT SPEECH REC, P337 SANSEGUNDO R, 2001, P EUR C SPEECH COMM, P2165 SCHUBERT V, 2005, P EUR C SPEECH COMM, P789 Seneff S., 2000, P ANLP NAACL 2000 SA, P1 Strik H., 1997, International Journal of Speech Technology, V2, DOI 10.1007/BF02208824 TOTH A, 2002, P 7 INT C SPOK LANG, P1497 TURUNEN M, 2004, P WORKSH ROB AD INF Uebler U, 2001, SPEECH COMMUN, V35, P53, DOI 10.1016/S0167-6393(00)00095-9 WAHLSTER W, 2001, P EUR C SPEECH COMM, P1547 Wang K., 2002, P ICSLP 02 DENV US, P2241 WANG K, 2000, P INT C SPOK LANG PR, V2, P138 WANG YH, 2003, P ISCA WORKSH ERR HA, P139 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 51 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 863 EP 887 DI 10.1016/j.specom.2005.11.001 PG 25 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300001 ER PT J AU Srinivasamurthy, N Ortega, A Narayanan, S AF Srinivasamurthy, Naveen Ortega, Antonio Narayanan, Shrikanth TI Efficient scalable encoding for distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE distributed speech recognition; scalable encoding; multi-pass recognition; joint coding-classification ID QUANTIZATION; SYSTEM AB The problem of encoding speech features in the context of a distributed speech recognition system is addressed. Specifically, speech features are compressed using scalable encoding techniques to provide a multi-resolution bitstream. The use of this scalable encoding procedure is investigated in conjunction with a multi-pass distributed speech recognition (DSR) system. The multi-pass DSR system aims at progressive refinement in terms of recognition performance, (i.e., as additional bits are transmitted the recognition can be refined to improve the performance) and is shown to provide both bandwidth and complexity (latency) reductions. T he proposed encoding schemes are well suited for implementation on light-weight mobile devices where varying ambient conditions and limited computational capabilities pose a severe constraint in achieving good recognition performance. The multi-pass DSR system is capable of adapting to varying network and system constraints by operating at an appropriate trade-off point between transmission rate, recognition performance and complexity to provide desired quality of service (QoS) to the user. The system was tested using two case studies. In the first, a distributed two-stage names recognition task, the scalable encoder operating at a bitrate of 4.6 kb/s achieved the same performance as that achieved using uncompressed features. In the second study, a two stage multi-pass continuous speech recognition task using HUB-4 data, the scalable encoder at a bitrate of 5.7 kb/s achieved the same performance as that achieved with uncompressed features. Reducing the bitrate to 4800 b/s resulted in a 1% relative increase in WER. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ So Calif, Dept Elect Engn Syst, Inst Signal & Image Proc, Integrated Media Syst Ctr, Los Angeles, CA 90089 USA. Qualcomm Inc, Stand Engn, San Diego, CA 92121 USA. RP Narayanan, S (reprint author), Univ So Calif, Dept Elect Engn Syst, Inst Signal & Image Proc, Integrated Media Syst Ctr, Los Angeles, CA 90089 USA. EM naveens@qualcomm.com; ortega@sipi.usc.edu; shri@sipi.usc.edu RI Ortega, Antonio/B-6252-2009; Narayanan, Shrikanth/D-5676-2012 OI Ortega, Antonio/0000-0001-5403-0940; CR ABE Y, 1999, EUR 99 BUD SEPT BEHET F, 2001, IEEE INT C AUT SPEEC, P222 BERNARD A, 2001, ICASSP 2001, V4 CHAZAN D, 2000, IEEE ICASSP 2000 CHRYSAFIS C, 1997, DCC DATA COMPR C SNO COLETTI P, 1999, EUR 99 BUD SEPT Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 EQUITZ WHR, 1991, IEEE T INFORM THEORY, V37, P269, DOI 10.1109/18.75242 GALES M, 1996, IEEE T ACOUST SPEECH, P352 GAO Y, 2001, ICASSP 2001, V1, P53 Gray RM, 1998, IEEE T INFORM THEORY, V44, P2325, DOI 10.1109/18.720541 Huerta J. M., 2000, THESIS CARNEGIE MELL Junqua JC, 1997, IEEE T SPEECH AUDI P, V5, P173, DOI 10.1109/89.554779 KIM HK, 2000, ICASSP 2000, V3, P1607 KISS I, 2000, INT C SPOK LANG PROC KISS I, 1999, EUR 1999 Lilly BT, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2344 MAYORGA P, 2002, EUS 2002 TOUL FRANC MILNER B, 2000, ICASSP 2000 IST TURK Murveit H., 1993, ICASSP 93 MINN MINN, V2, P319, DOI 10.1109/ICASSP.1993.319301 RAMABADRAN T, 2001, EUR 2001 AALB DENM S Ramaswamy GN, 1998, INT CONF ACOUST SPEE, P977, DOI 10.1109/ICASSP.1998.675430 RISKIN E, 2001, EUR 2001 AALB DENM S Rose K, 2001, IEEE T IMAGE PROCESS, V10, P965, DOI 10.1109/83.931091 SETHY A, 2002, ISCA PRON MOD LEX AD SINGH R, 1999, IEEE WORKSH MULT SIG SRINIVASAMURTHY N, 2001, ISCA ITR WORKSH AD M SRINIVASAMURTHY N, 2004, P ICASSP 2004 MONTR SRINIVASAMURTHY N, 2003, P EUR 2003 GEN SWITZ SRINIVASAMURTHY N, 2000, ICME 2000 NEW YORK N SRINIVASAMURTHY N, 2001, EUR 2001 AALB DENM S *STQ, 2000, 201108 ES ETSI STQ TURUNEN J, 2001, EUR 2001 AALB DENM S Weerackody V, 2002, IEEE T WIREL COMMUN, V1, P282, DOI 10.1109/7693.994822 Weinberger G. S. M., 2000, IEEE T IMAGE PROCESS, V9, P1309 WOODLAND P, 2001, LVCSR HUB5 WORKSH 20 Young S., 2000, HTK BOOK HTK VERSION ZHU Q, 2001, ICASSP 2001, V1 NR 38 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 888 EP 902 DI 10.1016/j.specom.2005.11.003 PG 15 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300002 ER PT J AU Salmani-Nodoushan, MA AF Salmani-Nodoushan, Mohammad Ali TI A comparative sociopragmatic study of ostensible invitations in English and Farsi SO SPEECH COMMUNICATION LA English DT Article DE ostensible invitations; politeness; speech act theory; pragmatics; face threatening acts AB In their study in 1990, Clark and Isaacs identified five properties and seven defining features that distinguished between English ostensible and genuine invitations. To see if Persian ostensible and genuine invitations could be distinguished by the same features and properties, the present study was carried out. Forty five field workers observed and reported 566 ostensible and 607 genuine invitations. In addition, 34 undergraduate students were interviewed and 68 ostensible and 68 genuine invitations were gathered. Forty one pairs of friends were also interviewed and afforded 41 ostensible invitations. The results of the data analysis revealed that Persian ostensible invitations can also be distinguished from Persian genuine invitations by the features and properties identified by Clark and Isaacs. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Zanjan, Dept English, Zanjan, Iran. RP Salmani-Nodoushan, MA (reprint author), Univ Zanjan, Dept English, Univ Blvd, Zanjan, Iran. EM nodushan@ut.ac.ir CR BROWN R, 1978, QUESTIONS POLITENESS, P56 EDELSKY C, 1981, LANG SOC, V10, P383 HYMES D, 1967, J SOC ISSUES, V23, P8 ISAACS EA, 1990, LANG SOC, V19, P493 KEENAN EO, 1976, LANG SOC, V5, P67 Labov William, 1972, SOCIOLINGUISTIC PATT LEECH G, 1983, PRINIPLES PRAGMATICS Levinson Stephen C., 1983, PRAGMATICS Miller C., 1976, WORDS WOMEN Savignon S. J., 1983, COMMUNICATIVE COMPET Wolfson N., 1989, PERSPECTIVES SOCIOLI NR 11 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 903 EP 912 DI 10.1016/j.specom.2005.12.001 PG 10 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300003 ER PT J AU Nagarajan, T Murthy, HA AF Nagarajan, T. Murthy, H. A. TI Language identification using acoustic log-likelihoods of syllable-like units SO SPEECH COMMUNICATION LA English DT Article DE language identification; syllable; incremental training ID AUTOMATIC SEGMENTATION; SPEECH AB Automatic spoken language identification (LID) is the task of identifying the language from a short utterance of the speech signal uttered by an unknown speaker. The most successful approach to LID uses phone recognizers of several languages in parallel [Zissman, M.A., 1996. Comparison of four approaches to automatic language identification of telephone speech. IEEE Trans. Speech Audio Process. 4 (1), 31-44]. The basic requirement to build a parallel phone recognition (PPR) system is segmented and labeled speech corpora. In this paper, a novel approach is proposed for the LID task which uses parallel syllable-like unit recognizers, in a frame work similar to the PPR approach in the literature. The difference is that the sub-word unit models for each of the languages to be recognized are generated in an unsupervised manner without the use of segmented and labeled speech corpora. The training data of each of the languages is first segmented into syllable-like units and language-dependent syllable-like unit inventory is created. These syllable-like units are then clustered using an incremental approach. This results in a set of syllable-like units models for each language. Using these language-dependent syllable-like unit models, language identification is performed based on accumulated acoustic log-likelihoods. Our initial results. on the Oregon Graduate Institute Multi-language Telephone Speech Corpus [Muthusamy, Y.K., Cole, R.A., Oshika, B.T., 1992. The OGI multi-language telephone speech corpus. In: Proceedings of Internat. Conf. Spoken Language Process., October 1992, pp. 895-898] show that the performance is 72.3%. We further show that if only a subset of syllable-like unit models that are unique (in some sense) are considered, the performance improves to 75.9%. (C) 2006 Elsevier B.V. All rights reserved. C1 Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. RP Nagarajan, T (reprint author), Univ Quebec, INRS, EMT, 800 Gauchetiere Ouest,Bur 6900, Montreal, PQ H5A 1K6, Canada. EM raju@emt.inrs.ca CR BERKLING KM, 1994, P ICASSP, P289 FUJIMURA O, 1975, IEEE T ACOUST SPEECH, VAS23, P82, DOI 10.1109/TASSP.1975.1162631 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 Hazen T. J., 1994, P ICSLP 94, P1883 JAYARAM AKV, 2003, P IEEE INT C AC SPEE, V1, P32 KADAMBE S, 1995, P IEEE INT C AC SPEE, P3507 LAMEL LF, 1994, P IEEE INT C AC SPEE, V1, P293 LI KP, 1994, P IEEE INT C AC SPEE, P297 MERMELSTEIN P, 1975, J ACOUST SOC AM, V58, P880, DOI 10.1121/1.380738 MUTHUSAMY UK, 1994, IEEE SIGNAL PROC MAG, P33 Muthusamy Y. K., 1992, P INT C SPOK LANG PR, P895 Nagarajan T., 2003, P EUROSPEECH, P2893 Nagarajan T., 2004, P IEEE INT C AC SPEE, VI, P401 NAKAGAWA S, 1988, P IEEE INT C AC SPEE, P960 NAVRATIL J, 1997, P EUR C SPEECH COMM, V1, P71 Noetzel A., 1991, Electro International. Conference Record, DOI 10.1109/ELECTR.1991.718279 Prasad VK, 2004, SPEECH COMMUN, V42, P429, DOI 10.1016/j.specom.2003.12.002 PRASAD VK, 2003, THESIS INDIAN I TECH Rabiner L.R., 1978, DIGITAL PROCESSING S RAMASUBRAMANIAN V, 2003, P EUR GEN SWITZ SEP, P1357 RAMASUBRAMANIAN V, 2003, WSLP TIFR MUMBAI JAN, P109 Schmidbauer O., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Shastri L., 1999, ICPHS SAN FRANC, P1721 Singer E., 2003, P EUROSPEECH, P1345 Wu SL, 1998, INT CONF ACOUST SPEE, P721 YAN Y, 1995, P IEEE INT C AC SPEE, P3511 Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 28 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 913 EP 926 DI 10.1016/j.specom.2005.12.003 PG 14 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300004 ER PT J AU Ghanbari, Y Karami-Mollaei, MR AF Ghanbari, Yasser Karami-Mollaei, Mohammad Reza TI A new approach for speech enhancement based on the adaptive thresholding of the wavelet packets SO SPEECH COMMUNICATION LA English DT Article DE speech processing; speech enhancement; wavelet thresholding; noisy speech recognition ID NOISE; ADAPTATION AB In this paper, we propose a new speech enhancement system using the wavelet thresholding algorithm. The basic wavelet thresholding algorithm has some defects including the assumption of white Gaussian noise (WGN), malfunction in unvoiced segments, bad auditory quality, etc. In the proposed system, we introduce a new algorithm which does not require any voiced/ unvoiced detection system. Also, in this proposed method adaptive wavelet thresholding and modified thresholding functions are introduced to improve the speech enhancement performance as well as the automatic speech recognition (ASR) accuracy. A new voice activity detector (VAD) was designed to update noise statistics in the proposed speech enhancement system when facing to the colored and non-stationary noises. The proposed method was evaluated on several speakers and under various noise conditions including white Gaussian noise, pink noise, and multi-talker babble noise. The SNR and ASR results show that the new method highly improves the performance of speech enhancement algorithm based on the wavelet thresholding. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Mazandaran, Fac Elect Engn, Dept Elect Engn, Babol Sar 484, Iran. RP Karami-Mollaei, MR (reprint author), Univ Mazandaran, Fac Elect Engn, Dept Elect Engn, Babol Sar 484, Iran. EM y.ghanbari@nit.ac.ir; mkar-ami@nit.ac.ir CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CHANG S, 2002, P IEEE INT C AC SPEE, P561 Cho YD, 2001, ELECTRON LETT, V37, P540, DOI 10.1049/el:20010368 Deller J., 2000, DISCRETE TIME PROCES DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 DONOHO DL, 1994, BIOMETRIKA, V81, P425, DOI 10.1093/biomet/81.3.425 Freeman DK, 1989, P INT C AC SPEECH SI, P369 Ghanbari Y, 2004, Proceedings of the Sixth IASTED International Conference on Signal and Image Processing, P225 GHANBARI Y, 2004, INT J SOFTWARE INF T, V1, P26 ITOH K, 1997, P IEEE INT C AC SPEE, P21 Johnstone IM, 1997, J R STAT SOC B, V59, P319, DOI 10.1111/1467-9868.00071 Kamath S., 2002, P IEEE INT C AC SPEE KLEIN M, 2002, P IEEE INT C AC SPEE, P537 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Sameti H, 1998, IEEE T SPEECH AUDI P, V6, P445, DOI 10.1109/89.709670 Sangwan A., 2002, 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612), DOI 10.1109/HSNMC.2002.1032545 SEOK JW, 1997, P ICASSP 97, P1323 SHEIKHZADEH H, 2001, P 7 EUR C SPEECH COM Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 Soon IY, 1997, TENCON IEEE REGION, P479 NR 22 TC 30 Z9 39 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 927 EP 940 DI 10.1016/j.specom.2005.12.002 PG 14 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300005 ER PT J AU Diaz, FC Banga, ER AF Diaz, Francisco Campillo Rodriguez Banga, Eduardo TI A method for combining intonation modelling and speech unit selection in corpus-based speech synthesis systems SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; unit selection; corpus-based; intonation AB In this paper, we focus on,improving the quality of corpus-based synthesis systems by considering several candidate intonation contours. These candidates are generated by a unit selection procedure for which cost functions are defined. The consideration of several possible pitch contours adds an additional degree of freedom to the search for appropriate speech units. Objective and subjective tests confirm an improvement in the quality of the resulting synthetic speech. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Vigo, ETSI Telecomunicac, Dpto Teoria Senal & Comunicac, Vigo 36200, Spain. RP Banga, ER (reprint author), Univ Vigo, ETSI Telecomunicac, Dpto Teoria Senal & Comunicac, Campus Univ, Vigo 36200, Spain. EM campillo@gts.tsc.uvigo.es; erbanga@gts.tsc.uvigo.es RI Rodriguez Banga, Eduardo/C-4296-2011 CR BLACK A, 1995, P EUR MADR SPAIN, V1, P581 Botinis A, 2001, SPEECH COMMUN, V33, P263, DOI 10.1016/S0167-6393(00)00060-1 BULYKO I, 2001, P ICASSP, V2, P781 CAMPILLO F, 2002, P ICSLP DENV, V1, P141 CAMPILLO F, 2003, P EUR, V1, P289 Efron B., 1993, INTRO BOOTSTRAP ESCUDERO D, 2003, P EUR GINE, V3, P2309 ESCUDERO D, 2002, THESIS U VALLADOLID GARRIDO JM, 1996, THESIS U BARCELONA S Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 LEE M, 2001, 4 ISCA WORKSH SPEECH, P75 LOPEZ E, 1993, THESIS U POLITECNICA MALFRERE F, 1998, P 3 ESCA WORKSH SPEE, P323 Navarro T, 1977, MANUAL PRONUNCIACION OSHAUGHNESSY D, 2000, SPEECH COMMUN, P134 NR 15 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 941 EP 956 DI 10.1016/j.specom.2005.12.004 PG 16 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300006 ER PT J AU Maj, JB Royackers, L Wouters, J Moonen, M AF Maj, Jean-Baptiste Royackers, Liesbeth Wouters, Jan Moonen, Marc TI Comparison of adaptive noise reduction algorithms in dual microphone hearing aids SO SPEECH COMMUNICATION LA English DT Article DE adaptive beamformer; adaptive directional microphone; calibration; noise reduction algorithms; hearing aids ID SPEECH RECEPTION THRESHOLD; PERFORMANCE; RATIO AB In this paper, a physical and perceptual evaluation of two adaptive noise reduction algorithms for dual-microphone hearing aids are described. This is the first comparison between a fixed directional microphone on the one hand, and an adaptive directional microphone and an adaptive beamformer on the other hand, all implemented in the same digital hearing aid. The adaptive directional microphone is state-of-the-art in most modern commercial hearing aids. The physical evaluation shows the importance of an individual calibration procedure for the performance of the noise reduction algorithms with two microphone hearing aids. The directivity index calculated in anechoic conditions and intelligibility-weighted polar diagrams measured in reverberant conditions show that all the noise reduction strategies yield an improved signal-to-noise ratio (SNR), but that the adaptive beamformer generally performs best. From the perceptual evaluation, it is demonstrated that the adaptive beamformer always performs best in single noise source scenarios. In a more complex noise scenario, there is still a SNR improvement with all the techniques, however the effect is the same for all the strategies. (C) 2006 Elsevier B.V. All rights reserved. C1 Lab Exp ORL, B-3000 Louvain, Belgium. SCD, B-3001 Louvain, Belgium. RP Wouters, J (reprint author), Lab Exp ORL, Kapucijnenvoer 33, B-3000 Louvain, Belgium. EM Jean-Baptiste.Maj@uz.kuleuven.ac.be; Liesbeth.Royackers@uz.kuleuven.ac.be; Jan.Wouters@uz.kuleuven.ac.be; Marc.Moonen@esat.kuleuven.ac.be RI Wouters, Jan/D-1800-2015 CR *ANSI, 1997, S331997 ANSI ANSI, 1969, S351969 ANSI BACHLER H, 1995, PHONAK FOCUS, P18 Beranek L., 1954, MCGRAW HILL ELECT EL CEZANNE J, 1995, Patent No. 5303307 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 GREENBERG JE, 1993, J ACOUST SOC AM, V94, P3009, DOI 10.1121/1.407334 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HAWKINS DB, 1984, J SPEECH HEAR DISORD, V49, P278 Haykin S., 1996, ADAPTIVE FILTER THEO Hinkle DE, 1998, APPL STAT BEHAV SCI *ICRA, 1997, INT COLL REAB AUD NO LEEUW AR, 1991, AUDIOLOGY, V30, P330 Luo FL, 2002, IEEE T SIGNAL PROCES, V50, P1583 Maj JB, 2004, EAR HEARING, V25, P215, DOI 10.1097/01.AUD.0000130794.28068.96 Maj J.B, 2004, THESIS KATHOLIEKE U PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 PETERSON PM, 1989, THESIS MIT CAMBRIDGE PLOMP R, 1979, AUDIOLOGY, V18, P43 PLOMP R, 1994, EAR HEARING, V15, P2 Ricketts T, 2000, EAR HEARING, V21, P45, DOI 10.1097/00003446-200002000-00008 Ricketts T, 2000, EAR HEARING, V21, P318, DOI 10.1097/00003446-200008000-00007 Ricketts T, 2002, INT J AUDIOL, V41, P100, DOI 10.3109/14992020209090400 THOMPSON SC, 1999, HIGH PERFORMANCE HEA, V3, P31 Van Gerven S., 1997, EUR SEPT 22 25 RHOD, P1095 Vanden Berghe J, 1998, J ACOUST SOC AM, V103, P3621, DOI 10.1121/1.423066 Versfeld NJ, 2000, J ACOUST SOC AM, V107, P1671, DOI 10.1121/1.428451 Wouters J, 2002, INT J AUDIOL, V41, P401, DOI 10.3109/14992020209090417 Wouters J, 2001, EAR HEARING, V22, P420, DOI 10.1097/00003446-200110000-00006 NR 31 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 957 EP 970 DI 10.1016/j.specom.2005.12.005 PG 14 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300007 ER PT J AU Togneri, R Deng, L AF Togneri, Roberto Deng, Li TI A state-space model with neural-network prediction for recovering vocal tract resonances in fluent speech from Mel-cepstral coefficients SO SPEECH COMMUNICATION LA English DT Article DE vocal tract resonance; tracking; cepstra; neural network; multi-layer perceptron; EM algorithm; hidden dynamics; state-space model ID FORMANT SYNTHESIZER; CONSONANTS AB In this paper, we present a state-space formulation of a neural-network-based hidden dynamic model of speech whose parameters are trained using an approximate EM algorithm. This efficient and effective training makes use of the output of an off-the-shelf formant tracker (for the vowel segments of the speech signal), in addition to the Mel-cepstral observations, to simplify the complex sufficient statistics that would be required in the exact EM algorithm. The trained model, consisting of the state equation for the target-directed vocal tract resonance (VTR) dynamics on all classes of speech sounds (including consonant closure and constriction) and the observation equation for mapping from the VTR to Mel-cepstral acoustic measurement, is then used to recover the unobserved VTR based on the extended Kalman filter. The results demonstrate accurate estimation of the VTR, especially during rapid consonant-vowel or vowel-consonant transitions and during consonant closure when the acoustic measurement alone provides weak or no information to infer the VTR values. The practical significance of correctly identifying the VTRs during consonantal closure or constriction is that they provide target frequency values for the VTR or formant transitions from adjacent sounds. Without such target values, the VTR transitions from vowel to consonant or from consonant to vowel are often very difficult to extract accurately by the previous formant tracking techniques. With the use of the new technique reported in this paper, the consonantal VTRs and the related transitions become more reliably identified from the speech signal. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Western Australia, Sch EE&C Engn, Crawley, WA 6009, Australia. Microsoft Corp, Res, Redmond, WA 98052 USA. RP Togneri, R (reprint author), Univ Western Australia, Sch EE&C Engn, Crawley, WA 6009, Australia. EM roberto@ee.uwa.edu.au; deng@microsoft.com RI Togneri, Roberto/C-2466-2013 CR Allen J., 1987, TEXT SPEECH MITALK S DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 DENG L, 2003, SPEECH PROCESSING DY, P73 Deng L, 2000, J ACOUST SOC AM, V108, P3036, DOI 10.1121/1.1315288 DENG L, 2003, P EUROSPEECH, P73 Deng L, 2003, SPEECH PROCESSING DY Foresee F.-D., 1997, P 1997 INT JOINT C N, P1930, DOI DOI 10.1109/ICNN.1997.614194 Haykin S., 1999, NEURAL NETWORKS COMP, V2nd KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KOPEC GE, 1986, IEEE T ACOUST SPEECH, V34, P709, DOI 10.1109/TASSP.1986.1164908 MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559 SEIDE F, 2003, P ICASSP, P748 SJOLANDER K, 2002, RECENT DEV REGARDING STEVENS K, 1993, SPEECH SYNTHESIS FOR STEVENS KN, 1991, J PHONETICS, V19, P161 Togneri R, 2003, IEEE T SIGNAL PROCES, V51, P3061, DOI 10.1109/TSP.2003.819013 ZUE V, 1991, COURSE NOTES SPEECH NR 18 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 971 EP 988 DI 10.1016/j.specom.2006.01.001 PG 18 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300008 ER PT J AU Ni, JF Hirose, K AF Ni, Jinfu Hirose, Keikichi TI Quantitative and structural modeling of voice fundamental frequency contours of speech in Mandarin SO SPEECH COMMUNICATION LA English DT Article DE prosody modeling; F-0 contours; tone; intonation; tone modulation; resonance principle; analysis-by-synthesis; tonal languages ID INTONATION AB This paper presents an approach to structural modeling of voice fundamental frequency contours (F-0 contours) of Mandarin utterances as a sequence of modulated tones. A proposed functional model mathematically implements the tone modulation with both local and global controls. The local control consists of placing a series of normalized F-0 targets along the time axis, which are specified by transition time and amplitudes and are always reached; and the transitions between targets are approximated by connecting truncated second-order transition functions. The global control in terms of sentence modality simply compresses or expands the heights and ranges of the prototypical patterns of syllabic tones generated by the local control. Both local and global controls are integrated in a unified framework, and this paper explains the underlying scientific and linguistic principles. Analysis of 1044 utterances of various sentences read by eight native speakers revealed that the model could closely approximate the observed F-0 contours with a small number of parameters. These parameters are localized and suited to a data-driven fitting process. As will be demonstrated, the model also is promising for measuring intonation variations from observed F-0 contours. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Tokyo, Sch Engn, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1138656, Japan. RP Ni, JF (reprint author), ATR Spoken Language Commun Res Labs, 2-2-2 Hikaridai,Keihanna Sci City, Seika, Kyoto 6190288, Japan. EM jinfu.ni@atr.jp; hirose@gavo.t.u-tokyo.ac.jp CR ABE I, 1980, MELODY LANGUAGE, P1 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE CHEN SH, 1992, J ACOUST SOC AM, V92, P114, DOI 10.1121/1.404276 Collier R., 1990, PERCEPTUAL STUDY INT FOURCIN AJ, 1979, FRONTIERS SPEECH COM, P167 Fry Dennis B., 1979, PHYS SPEECH Fujimura O., 1981, VOCAL FOLD PHYSL, P271 Fujisaki H., 1983, PRODUCTION SPEECH, P39 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FUJISAKI H, 2000, P INT C SPOK LANG PR, V1, P9 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 GARDING E, 1987, PHONETICA, V44, P12 HIRANO M, 1974, FOLIA PHONIATR, V26, P89 Hirose K., 1994, Journal of the Acoustical Society of Japan, V50 Hirst D. J., 1998, INTONATION SYSTEMS S, P1 Hirst D. J., 2000, PROSODY THEORY EXPT, V14, P51 Honda K., 1995, PRODUCING SPEECH CON, P215 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Kochanski G, 2003, SPEECH COMMUN, V41, P625, DOI 10.1016/S0167-6393(03)00100-6 Kratochvil P., 1998, INTONATION SYSTEMS S, P417 Ladd D., 1996, INTONATION PHONOLOGY Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P287, DOI 10.1109/89.232612 Ni J., 2004, P INT C SPEECH PROS, P95 NI J, 2003, P INT C AC SPEECH SI, P72 NI J, 1997, CHIN J ACOUST, V16, P339 NI J, 2005, P INT 2005 EUR, P1397 Ohman S., 1967, WORD SENTENCE INTONA, P20 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 Sagart L., 1986, CAHIERS LINGUISTIQUE, V15, P205, DOI 10.3406/clao.1986.1204 SHEN J, 1985, EXPT PEKINESE PHONET, P27 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 TITZE IR, 1997, SPEECH PROD LANGUAGE, P33 van Santen J, 1998, MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS: THE BELL LABS APPROACH, P141 WU Z, 1996, ANAL PERCEPTION PROC, P255 Xiaonan Shen, 1990, PROSODY MANDARIN CHI Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 NR 39 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 989 EP 1008 DI 10.1016/j.specom.2006.01.002 PG 20 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300009 ER PT J AU Patwardhan, P Rao, P AF Patwardhan, Pushkar Rao, Preeti TI Effect of voice quality on frequency-warped modeling of vowel spectra SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Spoken Language Technology CY 2004 CL New Delhi, INDIA DE voice quality; spectral envelope modeling; frequency warping; all-pole modeling; partial loudness ID THRESHOLDS AB The perceptual accuracy of an all-pole representation of the spectral envelope of voiced sounds may be enhanced by the use of frequency-scale warping prior to LP modeling. For the representation of harmonic amplitudes in the sinusoidal coding of voiced sounds, the effectiveness of frequency warping was shown to depend on the underlying signal spectral shape as determined by phoneme quality. In this paper, the previous work is extended to the other important dimension of spectral shape variation, namely voice quality.-The influence of voice quality attributes on the perceived modeling error in frequency-warped LP modeling of the spectral envelope is investigated through subjective and objective measures applied to synthetic and natural steady sounds. Experimental results are presented that demonstrate the feasibility and advantage of adapting the warping function to the signal spectral envelope in the context of a sinusoidal speech coding scheme. (C) 2006 Elsevier B.V. All rights reserved. C1 Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. RP Patwardhan, P (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. EM pushkar@ee.iitb.ac.in CR Burred JJ, 2004, J AUDIO ENG SOC, V52, P724 CHAMPION T, 1994, P IEEE INT C AC SPEE, P529 Childers D.G., 2000, SPEECH PROCESSING SY CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DOVAL B, 1997, P ICASSP 97, P1295 Feng G, 1996, J ACOUST SOC AM, V99, P3694, DOI 10.1121/1.414967 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 HERMANSKY H, 1985, DIGITAL PROCESS SIGN, P55 MACAULAY R, 1995, SINUSOIDAL CODING SP Markel JD, 1976, LINEAR PREDICTION SP MILLER J, 1989, CORRELATION STAT ADV MOLYNEUX D, 1998, P INT C SPOK LANG PR, P946 Moore BCJ, 1997, J AUDIO ENG SOC, V45, P224 Rao P, 2001, J ACOUST SOC AM, V109, P2085, DOI 10.1121/1.1354986 Rao P, 2005, SPEECH COMMUN, V47, P322, DOI 10.1016/j.specom.2005.02.009 WONG R, 1980, IEEE T ACOUST SPEECH, V28, P263 NR 17 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 1009 EP 1023 DI 10.1016/j.specom.2006.01.003 PG 15 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300010 ER PT J AU Borowicz, A Parfieniuk, M Petrovsky, AA AF Borowicz, A. Parfieniuk, M. Petrovsky, A. A. TI An application of the warped discrete Fourier transform in the perceptual speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; warped discrete Fourier transform; perceptual processing ID AUDIO AB An application of the warped discrete Fourier transform (WDFT) in the perceptual speech enhancement is of interest. The WDFT allows nonuniform sampling the z-transform of finite length sequence. We focus on the perceptual warping which allocates frequency samples in good accordance with the Bark scale. The WDFT can replace conventional DFT based analysis/synthesis block of spectral weighting method. In the case of the perceptual warping, there is a problem with signal reconstruction because the WDFT matrix is ill-conditioned. This paper addresses the problem of signal distortions generated in WDFT based synthesis block. Spectral characteristics of the reconstructed signal are analyzed and discussed in the context of the perceptual processing. A new extension of the WDFT intended to cancellation of the synthesis error is presented. The new method is also validated in practical speech enhancement system. The results show that the new algorithm outperforms pure WDFT based system. (C) 2006 Elsevier B.V. All rights reserved. C1 Bialystok Tech Univ, Dept Real Time Syst, PL-15351 Bialystok, Poland. RP Borowicz, A (reprint author), Bialystok Tech Univ, Dept Real Time Syst, Wiejska 45A, PL-15351 Bialystok, Poland. EM borowicz@ii.pb.bialystok.pl CR [Anonymous], 1992, IS111723 ISOIEC Bagchi S, 1996, IEEE T CIRCUITS-II, V43, P422, DOI 10.1109/82.502315 BOROWICZ A, 2004, ELECT TELECOMM Q PAN, V50, P395 BOROWICZ A, 2004, ELECT TELECOMM Q PAN, V50, P379 Cho NI, 2000, IEEE T CIRC SYST VID, V10, P1364 Cohen I, 2002, IEEE SIGNAL PROC LET, V9, P12, DOI 10.1109/97.988717 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Franz S, 2003, SIGNAL PROCESS, V83, P1661, DOI 10.1016/S0165-1684(03)00079-3 GUSTAFSSON S, 1998, P IEEE INT C AC SPEE, V1, P397, DOI 10.1109/ICASSP.1998.674451 GUSTAFSSON S, 1999, P ICASSP, V2, P873 Gustafsson S, 2002, IEEE T SPEECH AUDI P, V10, P245, DOI 10.1109/TSA.2002.800553 Hansen P. C., 1994, REGULARIZATION TOOLS, V7.3 HANSEN PC, 1987, BIT, V27, P534, DOI 10.1007/BF01937276 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Makur A, 2001, IEEE T CIRCUITS-I, V48, P1086, DOI 10.1109/81.948436 OPPENHEI.A, 1971, PR INST ELECTR ELECT, V59, P299, DOI 10.1109/PROC.1971.8146 Painter T, 2000, P IEEE, V88, P451, DOI 10.1109/5.842996 Parfieniuk M, 2004, 2004 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH, AND SIGNAL PROCESSING, VOL IV, PROCEEDINGS, P185, DOI 10.1109/ICASSP.2004.1326794 PARFIENIUK M, 2004, P 6 INT C EXH DIG SI, P190 Penrose R., 1955, P CAMBRIDGE PHILOS S, P406, DOI DOI 10.1017/S0305004100030401 Petrovsky A., 2004, AES CONV 116 BERL GE PETROVSKY AA, 2002, P 11 EUR SIGN PROC C, V1, P487 Rao C, 1971, GEN INVERSE MATRICES Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 WANG EB, 1992, J RARE EARTH, V10, P5 Yang WH, 1998, INT CONF ACOUST SPEE, P541 NR 28 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 1024 EP 1036 DI 10.1016/j.specom.2006.01.004 PG 13 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300011 ER PT J AU Stadermann, J Rigoll, G AF Stadermann, Jan Rigoll, Gerhard TI Hybrid NN/HMM acoustic modeling techniques for distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE distributed speech recognition; tied-posteriors; hybrid speech recognition AB Distributed speech recognition (DSR) where the recognizer is split up into two parts and connected via a transmission channel offers new perspectives for improving the speech recognition performance in mobile environments. In this work, we present the integration of hybrid acoustic models using tied-posteriors in a distributed environment. A comparison with standard Gaussian models is performed on the AURORA2 task and the WSJ0 task. Word-based HMMs and phoneme-based HMMs are trained for distributed and nod-distributed recognition using either MFCC or RASTA-PLP features. The results show that hybrid modeling techniques can outperform standard continuous systems on this task. Especially the tied-posteriors approach is shown to be usable for DSR in a very flexible way since the client can be modified without a change at the server site and vice versa. (C) 2006 Elsevier B.V. All rights reserved. C1 Tech Univ Munich, Inst Human Machine Commun, D-8000 Munich, Germany. RP Stadermann, J (reprint author), CreaLog Software Entwicklung & Beratung GmbH, Munich, Germany. EM stadermann@mmk.ei.tum.de; rigoll@mmk.ei.tum.de CR BARRAS C, 2001, IEEE INT C AC SPEECH BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 Bourlard Ha, 1994, CONNECTIONIST SPEECH Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hirsch H.G., 2000, ISCA ITRW ASR2000 JAMES A, 2005, IEEE INT C AC SPEECH PAUL DB, 1992, INT C SPOK LANG PROC, P899 ROTTLAND J, 2000, IEEE INT C AC SPEECH SANTINI S, 1995, NEURAL NETWORKS, V8, P25, DOI 10.1016/0893-6080(94)00059-U SCHULZ H, 2001, 6 INT WORKSH APPL NA STADERMANN J, 2001, IEEE WORKSH AUT SPEE STADERMANN J, 2001, EUR C SPEECH COMM TE STADERMANN J, 2005, 3 1 DTSCH JAHR AK DA STADERMANN J, 2003, IEEE INT C AC SPEECH STADERMANN J, 2003, IEEE WORKSH AUT SPEE *STQ, 2003, 201108 ETSI ES STQ YUK D, 1999, IEEE INT C AC SPEECH NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 1037 EP 1046 DI 10.1016/j.specom.2006.01.007 PG 10 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300012 ER PT J AU Shahin, I AF Shahin, Ismail TI Enhancing speaker identification performance under the shouted talking condition using second-order circular hidden Markov models SO SPEECH COMMUNICATION LA English DT Article DE first-order left-to-right hidden Markov models; neutral talking condition; second-order circular hidden Markov models; shouted talking condition ID STRESS COMPENSATION TECHNIQUE; SPEECH RECOGNITION; CLASSIFICATION AB It is known that the performance of speaker identification systems is high under the neutral talking condition; however, the performance deteriorates under the shouted talking condition. In this paper, second-order circular hidden Markov models (CHMM2s) have been proposed and implemented to enhance the performance of isolated-word text-dependent speaker identification systems-under the shouted talking condition. Our results show that CHMM2s significantly improve speaker identification performance under such a condition compared to the first-order left-to-right hidden Markov, models (LTRHMM1s), second-order left-to-right hidden Markov models (LTRHMM2s), and the first-order circular hidden Markov models (CHMM1s). Under the shouted talking condition, our results show that the average speaker identification performance is 23% based on LTRHMM1s, 59% based on LTRHMM2s, and 60% based on CHMM1s. On the other hand, the average speaker identification performance under the same talking condition based on CHMM2s is 72%. (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Sharjah, Dept Elect & Comp Engn, Sharjah, U Arab Emirates. RP Shahin, I (reprint author), Univ Sharjah, Dept Elect & Comp Engn, POB 27272, Sharjah, U Arab Emirates. EM ismail@sharjah.ac.ac CR Bou-Ghazale SE, 2000, IEEE T SPEECH AUDI P, V8, P429, DOI 10.1109/89.848224 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 CHEN YN, 1988, IEEE T ACOUST SPEECH, V36, P433, DOI 10.1109/29.1547 CUMMINGS KE, 1995, J ACOUST SOC AM, V98, P88, DOI 10.1121/1.413664 DAI JN, 1995, IEEE T SPEECH AUDI P, V3, P458 HANSEN JHL, 2000, RTOTR10 NATO RES TEC Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 Huang X.D., 1990, HIDDEN MARKOV MODELS JUANG BH, 1991, TECHNOMETRICS, V33, P251 JUANG BH, 1985, IEEE T ACOUST SPEECH, V33, P1404 LEVINSON SE, 1983, AT&T TECH J, V62, P1035 MARI JF, 1996, P IEEE INT C AC SPEE, V1, P435 Mari JF, 1997, IEEE T SPEECH AUDI P, V5, P22, DOI 10.1109/89.554265 Rabiner L., 1983, FUNDAMENTALS SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 SHAHIN I, 2005, EURASIP J APPL SIG P, V5, P482 SHAHIN I, 2004, 1 INT C INF COMM TEC Shahin I, 1998, IEEE SOUTH RECORD, P65, DOI 10.1109/SECON.1998.673293 Shahin I, 1998, IEEE SOUTH RECORD, P61, DOI 10.1109/SECON.1998.673292 STROH M, BALTIMORE SUNDA 1129, pA1 ZHENG C, 1988, P IEEE INT C AC SPEE, P580 Zhou GJ, 2001, IEEE T SPEECH AUDI P, V9, P201, DOI 10.1109/89.905995 NR 22 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2006 VL 48 IS 8 BP 1047 EP 1055 DI 10.1016/j.specom.2006.01.005 PG 9 WC Acoustics; Communication; Computer Science, Interdisciplinary Applications; Language & Linguistics SC Acoustics; Communication; Computer Science; Linguistics GA 069KX UT WOS:000239446300013 ER PT J AU Bimbot, F Faundez-Zanuy, M de Mori, R AF Bimbot, Frederic Faundez-Zanuy, Marcos de Mori, Renato TI Special issue: NOLISP '03 SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Inst Rech Informat & Syst Aleatoires, CNRS, F-35042 Rennes, France. Inst Natl Rech Informat & Automat, F-35042 Rennes, France. Escola Univ Politecn Mataro, Dept Telecommun, Barcelona 08303, Spain. Univ Avignon, Lab Informat, F-84911 Avignon 9, France. RP Bimbot, F (reprint author), Inst Rech Informat & Syst Aleatoires, CNRS, Campus Univ Beaulieu, F-35042 Rennes, France. EM bimbot@irisa.fr; faundez@eupmt.es; renato.demori@lia.univ-avignon.fr RI Faundez-Zanuy, Marcos/F-6503-2012 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 759 EP 759 DI 10.1016/j.specom.2006.03.001 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600001 ER PT J AU Indrebo, KM Povinelli, RJ Johnson, MT AF Indrebo, Kevin M. Povinelli, Richard J. Johnson, Michael T. TI Sub-banded reconstructed phase spaces for speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE speech recognition; dynamical systems; nonlinear signal processing; sub-bands ID TIME-SERIES; MODELS AB A novel method combining filter banks and reconstructed phase spaces is proposed for the modeling and classification of speech. Reconstructed phase spaces, which are based on dynamical systems theory, have advantages over spectral-based analysis methods in that they can capture nonlinear or higher-order statistics. Recent work has shown that the natural measure of a reconstructed phase space can be used for modeling and classification of phonemes. In this work, sub-banding of speech, which has been examined for recognition of noise-corrupted speech, is studied in combination with phase space reconstruction. This sub-banding, which is motivated by empirical psychoacoustical studies, is shown to dramatically improve the phoneme classification accuracy of reconstructed phase space-based approaches. Experiments that examine the performance of fused sub-banded reconstructed phase spaces for phoneme classification are presented. Comparisons against a cepstral-based classifier show that the proposed approach is competitive with state-of-the-art methods for modeling and classification of phonemes. Combination of cepstral-based features and the sub-band RPS features shows improvement over a cepstral-only baseline. (C) 2005 Elsevier B.V. All rights reserved. C1 Marquette Univ, Dept Elect & Comp Engn, Milwaukee, WI 53233 USA. RP Indrebo, KM (reprint author), Marquette Univ, Dept Elect & Comp Engn, 1515 W Wisconsin Ave, Milwaukee, WI 53233 USA. EM kevin.indrebo@marquette.edu; richard.povinelli@marquette.edu; mike.johnson@marquette.edu CR Abarbanel H., ANAL OBSERVED CHAOTI Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 BADII R, 1988, PHYS REV LETT, V60, P979, DOI 10.1103/PhysRevLett.60.979 Banbrook M, 1994, IEE C EXPL CHAOS SIG Banbrook M, 1999, IEEE T SPEECH AUDI P, V7, P1, DOI 10.1109/89.736326 Bourlard H, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P426 BOURLARD H, 1997, INT C AC SPEECH SIGN, P21 CHENNAOUI A, 1990, PHYS REV A, V41, P4151, DOI 10.1103/PhysRevA.41.4151 DIMITRIADIS D, 2002, INT C AC SPEECH SIGN, pI377 Duda R. O., 2001, PATTERN CLASSIFICATI Fletcher H., 1953, SPEECH HEARING COMMU Garofolo JS, 1993, TIMIT ACOUSTIC PHONE GIBSON JF, 1992, PHYSICA D, V57, P1, DOI 10.1016/0167-2789(92)90085-2 Gold B., 2000, SPEECH AUDIO SIGNAL Hagen A, 2001, INT CONF ACOUST SPEE, P257, DOI 10.1109/ICASSP.2001.940816 HERMANSKY H, 1996, 4 INT C SPOK LANG IC, V461, P462 INDREBO KM, 2003, ISCA TUT RES WORKSH, P107 Isabelle S. H., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.226468 JOHNSON MT, IEEE T SPEECH AUDIO Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 KUBIN G, 1995, NONLINEAR SPEECH PRO LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LINDGREN AC, 2003, THESIS MARQUETTE U M Lindgren A.C., 2003, INT C AC SPEECH SIGN, P61 LINDGREN AC, 2004, INT C AC SPEECH SIGN, pI533 MCCOURT P, 1998, P 1998 IEEE INT C AC, V551, P557 Misra H, 2003, INT CONF ACOUST SPEE, P741 Moreno A, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1281 NELDER JA, 1965, COMPUT J, V7, P308 PITSIKALIS V, 2002, INT C ACOUSTICS SPEE, V531, pI533 Povinelli RJ, 2002, IEEE T ENERGY CONVER, V17, P39, DOI 10.1109/60.986435 Povinelli RJ, 2004, IEEE T KNOWL DATA EN, V16, P779, DOI 10.1109/TKDE.2004.17 Roberts F, 2001, PRINCIPLES PRACTICE, P411 SAUER T, 1991, J STAT PHYS, V65, P579, DOI 10.1007/BF01053745 Schafer G., 1976, MATH THEORY EVIDENCE Takens F, 1981, DYNAMICAL SYSTEMS TU, P366 TEAGER HM, 1990, NATO ADV SCI I D-BEH, V55, P241 TIBREWALA S, 1997, IEEE INT C ACOUSTICS, V1252, P1255 WEI G, 2002, S INSTR SCI TECHN, P637 YE J, 2003, ISCA TUT RES WORKSH, P11 YE J, 2002, IEEE SIGN PROC SOC 1 NR 41 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 760 EP 774 DI 10.1016/j.specom.2004.12.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600002 ER PT J AU Rank, E Kubin, G AF Rank, Erhard Kubin, Gernot TI An oscillator-plus-noise model for speech synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE non-linear time-series; oscillator model; speech production; noise modulation ID VECTOR MACHINE; WAVE-FORM; ALGORITHM; NETWORKS; VOWELS; JITTER AB The autonomous oscillator model for speech synthesis is augmented by a non-linear predictor to re-generate the modulated noise-like signal component of speech signals. The resulting 'oscillator-plus-noise' model in combination with vocal tract modeling by linear prediction is able to re-generate the spectral content of stationary wide-band vowel signals with high fidelity. For adequate modeling of mixed-excitation speech signals (such as voiced fricatives), the model is extended by a second linear prediction path for the independent spectral shaping of the noise-like component. With one and the same model, not only sustained voiced and mixed-excitation phonemes, but also stationary unvoiced sounds can be re-generated faithfully. (C) 2005 Elsevier B.V. All rights reserved. C1 Graz Univ Technol, Signal Proc & Speech Commun Lab, A-8010 Graz, Austria. Vienna Univ Technol, Inst Commun & Radio Frequency Engn, Vienna, Austria. RP Rank, E (reprint author), Graz Univ Technol, Signal Proc & Speech Commun Lab, Inffeldgasse 12, A-8010 Graz, Austria. EM erank@tugraz.at; g.kubin@ieee.org CR ABARBANEL HDI, 1993, REV MOD PHYS, V65, P1331, DOI 10.1103/RevModPhys.65.1331 BAILLY G, 2002, COST, V258, P39 BERNHARD HP, 1991, P 13 GRETSI S SIGN I, P1301 BIRGMEIER M, 1995, P IEEE INT C NEUR NE, P259 BLACK AW, 2003, FESTIVAL SPEECH SYNT CHILDERS DG, 1995, J ACOUST SOC AM, V97, P505, DOI 10.1121/1.412276 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FANT G, 1985, 4 ROYAL I TECHN SPEE Haykin S, 1996, IEEE SIGNAL PROC MAG, V13, P24, DOI 10.1109/79.487040 Hegger R, 1999, CHAOS, V9, P413, DOI 10.1063/1.166424 HERMES DJ, 1991, SPEECH COMMUN, V10, P497, DOI 10.1016/0167-6393(91)90053-V Jackson P., 2000, P 5 SPEECH PROD SEM, P185 Jackson PJB, 2000, J ACOUST SOC AM, V108, P1421, DOI 10.1121/1.1289207 KUBIN G, 1993, P IEEE WORKSH SPEECH, P1 Kubin G., 1995, SPEECH CODING SYNTHE, P557 KUBIN G, IN PRESS LECT NOTES Kubin G., 1996, P INT C AC SPEECH SI, V1, P267 Laroche J., 1993, P INT C AC SPEECH SI, V2, P550 LEITH D, 2000, 5 IMA INT C MATH SIG Li JM, 2003, PROC INT C TOOLS ART, P259 Lu H.-L., 2000, P INT COMP MUS C BER, P90 MACKAY DJC, 1992, NEURAL COMPUT, V4, P415, DOI 10.1162/neco.1992.4.3.415 MAKHOUL J, 1978, P INT C AC SPEECH SI, V3, P163 MANN I, 1999, P EUR C SPEECH COMM, V5, P2315 Mann I, 2001, SIGNAL PROCESS, V81, P1743, DOI 10.1016/S0165-1684(01)00087-1 MANN IM, 1999, THESIS U EDINBURGH NARASIMHAN K, 1999, P INT C AC SPEECH SI, P389 Narayanan S, 2000, IEEE T SPEECH AUDI P, V8, P328, DOI 10.1109/89.841215 PINTO NB, 1989, IEEE T ACOUST SPEECH, V37, P1870, DOI 10.1109/29.45534 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 Poggio T., 1989, 1140 AI MIT PRINCIPE J, 1997, P 1 EUR C SIGN AN PR RANK E, 2001, LECT NOTES COMPUTER, V2085, P746 RANK E, 2002, COST 277 MCM GRAZ AU Rank E, 2003, SIGNAL PROCESS, V83, P1393, DOI 10.1016/S0165-1684(03)00088-4 SAUER T, 1991, J STAT PHYS, V65, P579, DOI 10.1007/BF01053745 Schoentgen J, 1997, SPEECH COMMUN, V21, P255, DOI 10.1016/S0167-6393(97)00008-3 SCHOENTGEN J, 1991, SPEECH COMMUN, V10, P533, DOI 10.1016/0167-6393(91)90056-Y SKOGLUND J, 1998, P INT C SPOK LANG PR, V5, P1791 Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 Stylianou Y., 1995, P EUROSPEECH, P451 TAKENS F, 1981, WARWICK 1980 LECT NO, V898, P366 Tikhonov A. N., 1977, SOLUTION ILL POSED P Tipping ME, 2001, J MACH LEARN RES, V1, P211, DOI 10.1162/15324430152748236 NR 44 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 775 EP 801 DI 10.1016/j.specom.2005.02.004 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600003 ER PT J AU Salvi, G AF Salvi, Giampiero TI Dynamic behaviour of connectionist speech recognition with strong latency constraints SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE speech recognition; neural network; low latency; non-linear dynamics AB This paper describes the use of connectionist techniques in phonetic speech recognition with strong latency constraints. The constraints are imposed by the task of deriving the lip movements of a synthetic face in real time from the speech signal, by feeding the phonetic string into an articulatory synthesiser. Particular attention has been paid to analysing the interaction between the time evolution model learnt by the multi-layer perceptrons and the transition model imposed by the Viterbi decoder, in different latency conditions. Two experiments were conducted in which the time dependencies in the language model (LM) were controlled by a parameter. The results show a strong interaction between the three factors involved, namely the neural network topology, the length of time dependencies in the LM and the decoder latency. (C) 2005 Elsevier B.V. All rights reserved. C1 Royal Inst Technol, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP Salvi, G (reprint author), Royal Inst Technol, Dept Speech Mus & Hearing, Lindstedtsv 24, S-10044 Stockholm, Sweden. EM giampi@kth.se CR Beskow J., 2004, J SPEECH TECHNOLOGY, V4, P335 BOURLARD H, 1993, IEEE T NEURAL NETWOR, V4, P893, DOI 10.1109/72.286885 Elenius K., 2000, International Journal of Speech Technology, V3, DOI 10.1023/A:1009641213324 Imai T., 2000, ICASSP, P1937 Karlsson I., 2003, P EUR, P1297 KWAN D, 1998, IEEE T COMMUN, V46, P565 LINDBERG B, 2000, 6 INT C SPOK LANG PR, V3, P370 LJOLJE A, 2000, SPEECH TRANSCR WORKS R Development Core Team, 2003, R LANG ENV STAT COMP Robinson AJ, 2002, SPEECH COMMUN, V37, P27, DOI 10.1016/S0167-6393(01)00058-9 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 SALVI G, 2003, ISCA TUT RES WORKSH STROM N, 1996, NICO TOOLKIT ARTIFIC STROM N, 1992, TMH QPSR, V26, P1 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 WEATHERS AD, 1999, P IEEE INT C COMM, V3, P1951 WERBOS PJ, 1990, P IEEE, V78, P1550, DOI 10.1109/5.58337 Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 19 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 802 EP 818 DI 10.1016/j.specom.2005.05.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600004 ER PT J AU Dimitriadis, D Maragos, P AF Dimitriadis, Dimitrios Maragos, Petros TI Continuous energy demodulation methods and application to speech analysis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE nonstationary speech analysis; energy operators; AM-FM modulations; demodulation; Gabor filterbanks; feature distributions; ASR; robust features; nonlinear speech analysis ID SIGNAL; FREQUENCY AB Speech resonance signals appear to contain significant amplitude and frequency modulations. An efficient demodulation approach is based on energy operators. In this paper, we develop two new robust methods for energy-based speech demodulation and compare their performance on both test and actual speech signals. The first method uses smoothing splines for discrete-to-continuous signal approximation. The second (and best) method uses time-derivatives of Gabor filters. Further, we apply the best demodulation method to explore the statistical distribution of speech modulation features and study their properties regarding applications of speech classification and recognition. Finally, we present some preliminary recognition results and underline their improvements when compared to the corresponding MFCC results. (C) 2005 Elsevier B.V. All rights reserved. C1 Natl Tech Univ Athens, Sch Elect & Comp Engn, GR-15773 Athens, Greece. RP Dimitriadis, D (reprint author), Natl Tech Univ Athens, Sch Elect & Comp Engn, Iroon Polytexneiou Str, GR-15773 Athens, Greece. EM ddim@cs.ntua.gr; maragos@cs.ntua.gr CR ALDROUBI A, 1992, SIGNAL PROCESS, V28, P127, DOI 10.1016/0165-1684(92)90030-Z BOVIK AC, 1993, IEEE T SIGNAL PROCES, V41 DIMITRIADIS D, 2003, P EUR 03 GEN SEPT DIMITRIADIS D, 2002, P ICASSP 02 ORL FL M DIMITRIADIS D, 2001, P ICASSP 01 SALT LAK Duda R. O., 2001, PATTERN CLASSIFICATI Fertig LB, 1996, IEEE SIGNAL PROC LET, V3, P54, DOI 10.1109/97.484216 Kaiser J. F., 1983, VOCAL FOLD PHYSL BIO, P358 Kaiser J. F., 1990, P IEEE INT C AC SPEE, P381 MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799 Potamianos A, 1996, J ACOUST SOC AM, V99, P3795, DOI 10.1121/1.414997 MARAGOS P, 1995, IEEE SIGNAL PROC LET, V2, P152, DOI 10.1109/97.404130 Papoulis A., 1962, FOURIER INTEGRAL ITS Potamianos A, 1999, SPEECH COMMUN, V28, P195, DOI 10.1016/S0167-6393(99)00012-6 POTAMIANOS A, 1994, SIGNAL PROCESS, V37, P95, DOI 10.1016/0165-1684(94)90169-4 Rabiner L.R., 1978, DIGITAL PROCESSING S Ramalingam CS, 1996, IEEE SIGNAL PROC LET, V3, P141, DOI 10.1109/97.491655 UNSER M, 1991, IEEE T PATTERN ANAL, V13, P277, DOI 10.1109/34.75515 UNSER M, 1993, IEEE T SIGNAL PROCES, V41, P821, DOI 10.1109/78.193220 Young S., 2002, HTK BOOK HTK VERSION NR 20 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 819 EP 837 DI 10.1016/j.specom.2005.08.007 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600005 ER PT J AU Faundez-Zanuy, M AF Faundez-Zanuy, Marcos TI Speech coding through adaptive combined nonlinear prediction SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE speech coding; nonlinear prediction; neural networks; data fusion AB In this paper we propose a nonlinear predictive speech encoder based on an adaptive combiner with a neural net that weighs the prediction of several nonlinear predictors. Thus, we exploit the advantages of data fusion on a nonlinear prediction scheme, where it appears in a more natural way than for linear predictors. Experimental results reveal that this scheme outperforms the fixed combination (with mean, median, etc. operators) up to 1.5 dB in SEGSNR. (C) 2005 Elsevier B.V. All rights reserved. C1 Escola Univ Politecn Mataro, Barcelona 08303, Spain. RP Faundez-Zanuy, M (reprint author), Escola Univ Politecn Mataro, Avda Puig & Cadafalch 101-111, Barcelona 08303, Spain. EM faundez@eupmt.es RI Faundez-Zanuy, Marcos/F-6503-2012 OI Faundez-Zanuy, Marcos/0000-0003-0605-1282 CR Banbrook M, 1999, IEEE T SPEECH AUDI P, V7, P1, DOI 10.1109/89.736326 Birgmeier M, 1996, P EUSIPCO, V1, P459 CAMPBELL KP, 1999, BIOMETRICS PERSONAL, pCH8 FAUNDEZZANUY M, 2000, EUSIPCO 2000 TAMP, V2, P813 Faundez-Zanuy M, 2005, IEEE AERO EL SYS MAG, V20, P34, DOI 10.1109/MAES.2005.1396793 Faundez-Zanuy M., 2002, Control and Intelligent Systems, V30 Foresee F.-D., 1997, P 1997 INT JOINT C N, P1930, DOI DOI 10.1109/ICNN.1997.614194 GERSHO A, 1990, IEEE T COMMUN, V38, P1285, DOI 10.1109/26.61363 HAYKIN S, 1999, COMMITTEE MACHINES N Jain A. K., 1996, IEEE COMPUTER MAR, P31 JAYANT NS, 1984, DIGITAL COMPRESSION KUBIN G, 1996, ICASSP MACKAY DJC, 1992, NEURAL COMPUT, V4, P415, DOI 10.1162/neco.1992.4.3.415 Mann I, 2001, SIGNAL PROCESS, V81, P1743, DOI 10.1016/S0165-1684(01)00087-1 MUMOLO E, 1993, WORKSH NONL DIG SIGN NARAYANAN SS, 1995, J ACOUST SOC AM, V97, P2511, DOI 10.1121/1.411971 Perrone MP, 1993, NEURAL NETWORKS SPEE PORTER J., 1993, P ICASSP, P375 Reynolds D.A., 1995, IEEE T SPEECH AUDIO, V3 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 THYSSEN J, 1994, P IEEE INT C AC SPEE, V1, P185 Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd NR 22 TC 0 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 838 EP 847 DI 10.1016/j.specom.2005.09.007 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600006 ER PT J AU Benaroya, L Bimbot, F Gravier, G Gribonval, R AF Benaroya, Laurent Bimbot, Frederic Gravier, Guillaume Gribonval, Remi TI Experiments in audio source separation with one sensor for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE noise suppression; source separation; speech enhancement; speech recognition AB This paper focuses on the problem of noise compensation in speech signals for robust speech recognition. We investigate on a novel paradigm based on source separation techniques to remove music from speech, a common situation in broadcast news transcription tasks. The two methods proposed, namely adaptive Wiener filtering and adaptive shrinkage, rely on the use of a dictionary of spectral shapes to deal with the non-stationarity of the signals. Unlike most classical noise suppression methods, we assume a prior knowledge of the sources that are mixed. The proposed algorithms are compared to simple standard approaches on the source separation task and assessed in terms of average distortion. Their effect on the entire transcription system is eventually compared in terms of word error rate. Results indicate that source separation techniques show some effectiveness for robust transcription at signal/noise ratio lower than 15 dB. We also observe that the improvement of the word error rate is correlated to the spectral distortion rather than to specific source separation performance measure such as the signal to interference ratio. (C) 2005 Elsevier B.V. All rights reserved. C1 Inst Rech Informat & Syst Aleatoires, Equipe METISS, F-35042 Rennes, France. RP Bimbot, F (reprint author), Inst Rech Informat & Syst Aleatoires, Equipe METISS, Campus Univ Beaulieu, F-35042 Rennes, France. EM laurent.benaroya@irisa.fr; frederic.bimbot@irisa.fr; guillaume.gravier@irisa.fr; remi.gribonval@irisa.fr CR BENAROYA L, 2003, THESIS U RENNES 1 BENAROYA L, IEEE T SPEECH AUDIO BOLL S, 1979, IEEE T ACOUST SPEECH, V28 DOLMAZON JM, 1997, ACT PREM JST FRANC 1, P13 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FLORES JAN, 1994, P ICASSPP, P409 Gales M. J., 1995, THESIS U CAMBRIDGE Gauvain J., 1994, IEEE T SPEECH AUDIO, V2 Gribonval R., 2003, P 4 INT S IND COMP A, P763 Lamel L. F., 1991, P EUR C SPEECH COMM, P505 Molau S., 2003, P IEEE INT C SPEECH, P656 Wolfe P. J., 2001, Proceedings of the 11th IEEE Signal Processing Workshop on Statistical Signal Processing (Cat. No.01TH8563), DOI 10.1109/SSP.2001.955331 NR 13 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 848 EP 854 DI 10.1016/j.specom.2005.11.002 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600007 ER PT J AU Kim, S Frisina, RD Frisina, DR AF Kim, SungHee Frisina, Robert D. Frisina, D. Robert TI Effects of age on speech understanding in normal hearing listeners: Relationship between the auditory efferent system and speech intelligibility in noise SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Research Workshop on Non-Linear Speech Processing (NOLISP) CY MAY 20-23, 2003 CL Le Croisic, FRANCE SP IRISA, IRCCyN, ISCA DE aging; presbycusis; medial efferent system; release from masking; cocktail party effect ID OLIVOCOCHLEAR BUNDLE STIMULATION; MASKING-LEVEL DIFFERENCES; OTOACOUSTIC EMISSIONS; CONTRALATERAL SUPPRESSION; ACOUSTIC STIMULATION; COCHLEAR MECHANICS; DISTORTION-PRODUCT; HUMANS; TONES AB Human listeners are able to listen to one voice in the midst of other conversations and background noise. Although the neural mechanisms for this process are not well understood, there is growing evidence that the medial olivocochlear (MOC) auditory efferent system located in the brainstem is involved in the detection of signals in noise, such as speech sounds, by modulating cochlear (inner ear) active physiological mechanisms. The present study examined MOC efferent (feedback) effects as revealed in distortion product otoacoustic emissions (DPOAEs) and effects of spatial separation for speech perception in background noise. Both spatial separation of speech in noise and contralateral suppression (CS) of DPOAE invoke neural mechanisms central to the inner ear. We sought to determine whether these tasks might be related and thereby represent involvement in the ability to listen to one voice in the midst of concurrent conversation and background noise; a situation in which elderly listeners have great difficulty. The Hearing in the Noise Test (HINT) [Nilsson, M., Soli, S.D., Sullivan, J.A., 1994.. Development of the hearing in noise test for the measurement of speech reception thresholds in quiet and in noise. J. Acoust. Soc. Amer. 95, 1085-1099] was used to obtain release from masking (RFM) estimates of speech in noise. DPOAEs were used to measure effects of broadband noise introduced to the ear contralateral to the target ear (CS). Significant age effects were found in both domains. Significant correlations resulted between speech perception in noise (RFM) and the degree of CS of DPOAEs. RFM was significantly related only to age and CS at the DPOAE narrow-band frequency of 1-2 kHz; a frequency band critical for successful speech perception. These findings suggest that the MOC efferent system and neural mechanisms underlying RFM are related and contribute to sound source determination commonly referred to as the "cocktail party effect". (C) 2006 Elsevier B.V. All rights reserved. C1 Univ Rochester, Sch Med & Dent, Dept Otolaryngol, Rochester, NY USA. Univ Rochester, Sch Med & Dent, Dept Neurobiol & Anat, Rochester, NY 14642 USA. Univ Rochester, Sch Med & Dent, Dept Biomed Engn, Rochester, NY USA. Rochester Inst Technol, Natl Tech Inst Deaf, Int Ctr Hearing & Speech Res, Rochester, NY 14623 USA. RP Kim, S (reprint author), Daegu Fatima Hosp, Dept Otolaryngol, 576-31 Shinam 4 Dong, Taegu 701600, South Korea. EM sungheekirn@fatima.or.kr; rdf@q.ent.rochester.edu; rxf1389@rit.edu CR Abdala C, 1999, J ACOUST SOC AM, V105, P2392, DOI 10.1121/1.426844 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 CODY AR, 1982, HEARING RES, V6, P199, DOI 10.1016/0378-5955(82)90054-5 COMIS S D, 1973, Journal of Laryngology and Otology, V87, P529, DOI 10.1017/S0022215100077252 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DEWSON JH, 1968, J NEUROPHYSIOL, V31, P122 Frisina D. R., 2001, FUNCTIONAL NEUROBIOL, P565 GIGUERE C, 1994, J ACOUST SOC AM, V95, P331 Gilkey R., 1997, BINAURAL SPATIAL HEA, P329 Giraud AL, 1997, NEUROREPORT, V8, P1779 James AL, 2002, CLIN OTOLARYNGOL, V27, P106, DOI 10.1046/j.1365-2273.2002.00541.x KAWASE T, 1993, J NEUROPHYSIOL, V70, P2533 KEMP DT, 1978, J ACOUST SOC AM, V64, P1386, DOI 10.1121/1.382104 Kim SH, 2002, AUDIOL NEURO-OTOL, V7, P348, DOI 10.1159/000066159 Levitt H., 1971, J ACOUST SOC AM, V16, P331 LIBERMAN MC, 1995, HEARING RES, V90, P158, DOI 10.1016/0378-5955(95)00160-2 LIBERMAN MC, 1988, J NEUROPHYSIOL, V60, P1779 LITTMAN TA, 1992, J ACOUST SOC AM, V92, P1945, DOI 10.1121/1.405242 LYNN GE, 1981, ARCH OTOLARYNGOL, V107, P357 Micheyl C, 1996, J ACOUST SOC AM, V99, P1604, DOI 10.1121/1.414734 MINSLEY GE, 1988, J ACOUST SOC AM, V83, P820, DOI 10.1121/1.396127 MOTT JB, 1989, HEARING RES, V38, P229, DOI 10.1016/0378-5955(89)90068-3 MOUNTAIN DC, 1980, SCIENCE, V210, P71, DOI 10.1126/science.7414321 NIEDER P, 1970, EXP NEUROL, V28, P179, DOI 10.1016/0014-4886(70)90172-X NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 OLSEN WO, 1976, ANN OTO RHINOL LARYN, V85, P820 RAJAN R, 1988, J NEUROPHYSIOL, V60, P569 ROUAT J, 2002, EUSIPCO 2002 SIEGEL JH, 1982, HEARING RES, V6, P171, DOI 10.1016/0378-5955(82)90052-1 Walsh EJ, 1998, J NEUROSCI, V18, P3859 WILLIAMS DM, 1995, J ACOUST SOC AM, V97, P1130, DOI 10.1121/1.412226 WILLIAMS EA, 1994, ACTA OTO-LARYNGOL, V114, P121, DOI 10.3109/00016489409126029 NR 32 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2006 VL 48 IS 7 BP 855 EP 862 DI 10.1016/j.specom.2006.03.004 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 065SC UT WOS:000239178600008 ER PT J AU Kim, SH Frisina, RD Mapes, FM Hickman, ED Frisina, DR AF Kim, SungHee Frisina, Robert D. Mapes, Frances M. Hickman, Elizabeth D. Frisina, D. Robert TI Effect of age on binaural speech intelligibility in normal hearing adults SO SPEECH COMMUNICATION LA English DT Article DE age; presbycusis; HINT; speech intelligibility in noise ID RECEPTION THRESHOLD; ELDERLY LISTENERS; NOISE; RECOGNITION; QUIET; PRESBYCUSIS; YOUNG AB Sentence perception performance, in quiet and in background noise, was measured in three groups of adult subjects categorized as young, middle-aged, and elderly. Pure tone audiometric thresholds, measures of inner ear function, obtained in all subjects were within the clinically normal hearing range. The primary purpose of this study was to determine the effect of age on speech perception: a secondary purpose was to determine if the speech recognition problem commonly reported in elderly subjects might be due to alterations at sites central to the peripheral nervous system inner ear. Standardized sentence lists were presented in free field conditions in order to invoke binaural hearing that occurs at the brainstem level, and to simulate everyday speech-m-noise listening conditions. The results indicated: (1) an age effect on speech perception performance in quiet and in noise backgrounds, (2) absolute pure tone thresholds conventionally obtained monaurally do not accurately predict suprathreshold speech perception performance in elderly subjects, and (3) by implication the listening problems of the elderly may be influenced by auditory processing changes upstream of the inner ear. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Rochester, Sch Med & Dent, Dept Otolaryngol, Rochester, NY 14642 USA. Rochester Inst Technol, Natl Tech Inst Deaf, Int Ctr Hearing & Speech Res, Rochester, NY 14623 USA. RP Kim, S (reprint author), Daegu Fatima Hosp, Dept Otolaryngol, 576-31 Shinam 4 Dong, Taegu 701600, South Korea. EM sungheekim@fatima.or.kr CR Beattie RC, 1997, BRIT J AUDIOL, V31, P153, DOI 10.3109/03005364000000018 BRANT LJ, 1990, J ACOUST SOC AM, V88, P813, DOI 10.1121/1.399731 BRONKHORST AW, 1990, AUDIOLOGY, V29, P275 Committee on Hearing Bioacoustics and Biomechanics (CHABA), 1988, J ACOUST SOC AM, V83, P859 DIRKS DD, 1969, J SPEECH HEAR RES, V12, P644 Folstein MF, 1975, J PSYCHIATR RES, V12, P198 Frisina D. R., 2001, FUNCTIONAL NEUROBIOL, P565 Frisina DR, 1997, HEARING RES, V106, P95, DOI 10.1016/S0378-5955(97)00006-3 GORDONSALANT S, 1987, EAR HEARING, V8, P270, DOI 10.1097/00003446-198710000-00003 GORDONSALANT S, 1993, J SPEECH HEAR RES, V36, P1276 HAGERMAN B, 1984, SCAND AUDIOL, V13, P57, DOI 10.3109/01050398409076258 Halling DC, 2000, J SPEECH LANG HEAR R, V43, P414 HIRSH IJ, 1950, J ACOUST SOC AM, V22, P196, DOI 10.1121/1.1906588 HUMES LE, 1991, J SPEECH HEAR RES, V34, P686 Humes L E, 1996, J Am Acad Audiol, V7, P161 Jerger James, 2001, Seminars in Hearing, V22, P255, DOI 10.1055/s-2001-15630 JERGER JF, 1976, ARCH OTOLARYNGOL, V102, P614 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P623 Levitt H. L., 1971, J ACOUST SOC AM, V49, P476 Mazelova J, 2003, EXP GERONTOL, V38, P87, DOI 10.1016/S0531-5565(02)00155-9 MOORE BCJ, 1992, BRIT J AUDIOL, V26, P369, DOI 10.3109/03005369209076661 Morrell CH, 1996, J ACOUST SOC AM, V100, P1949, DOI 10.1121/1.417906 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Noordhoek IM, 2001, J ACOUST SOC AM, V109, P1197, DOI 10.1121/1.1349429 OSTRI B, 1991, British Journal of Audiology, V25, P41, DOI 10.3109/03005369109077863 PEARSON JD, 1995, J ACOUST SOC AM, V97, P1196, DOI 10.1121/1.412231 PICHORAFULLER MK, 1995, J ACOUST SOC AM, V97, P593, DOI 10.1121/1.412282 Pittman AL, 2001, J SPEECH LANG HEAR R, V44, P487, DOI 10.1044/1092-4388(2001/038) PLOMP R, 1979, AUDIOLOGY, V18, P43 PLOMP R, 1986, J SPEECH HEAR RES, V29, P146 SCHOW RL, 1990, EAR HEARING, V11, pS6 Studebaker G A, 1997, J Am Acad Audiol, V8, P150 Wingfield A, 1996, J Am Acad Audiol, V7, P175 NR 33 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 591 EP 597 DI 10.1016/j.specom.2005.09.004 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900001 ER PT J AU Kakumanu, P Esposito, A Garcia, ON Gutierrez-Osuna, R AF Kakumanu, Praveen Esposito, Anna Garcia, Oscar N. Gutierrez-Osuna, Ricardo TI A comparison of acoustic coding models for speech-driven facial animation SO SPEECH COMMUNICATION LA English DT Article DE speech-driven facial animation; audio-visual mapping; linear discriminants analysis ID LIP MOVEMENT SYNTHESIS; FACE; RECOGNITION; EXPRESSIONS; PARAMETERS; CONVERSION; ALGORITHM AB This article presents a thorough experimental comparison of several acoustic modeling techniques by their ability to capture information related to orofacial motion. These models include (1) Linear Predictive Coding and Linear Spectral Frequencies, which model the dynamics of the speech production system, (2) Mel Frequency Cepstral Coefficients and Perceptual Critical Feature Bands, which encode perceptual cues of speech, (3) spectral energy and fundamental frequency, which capture prosodic aspects, and (4) two hybrid methods that combine information from the previous models. We also consider a novel supervised procedure based on Fisher's Linear Discriminants to project acoustic information onto a low-dimensional subspace that best discriminates different orofacial configurations. Prediction of orofacial motion from speech acoustics is performed using a non-parametric k-nearest-neighbors procedure. The sensitivity of this audio-visual mapping to coarticulation effects and spatial locality is thoroughly investigated. Our results indicate that the hybrid use of articulatory, perceptual and prosodic features of speech, combined with a supervised dimensionality-reduction procedure, is able to outperform any individual acoustic model for speech-driven facial animation. These results are validated on the 450 sentences of the TIMIT compact dataset. (C) 2005 Elsevier B.V. All rights reserved. C1 Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA. Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA. Univ Naples 2, Dept Psychol, Naples, Italy. Univ N Texas, Coll Engn, Denton, TX 76203 USA. Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA. RP Gutierrez-Osuna, R (reprint author), Texas A&M Univ, Dept Comp Sci, College Stn, TX 77843 USA. EM kpraveen@cs.wright.edu; iiass.anna@tin.it; ogarcia@unt.edu; rgutier@cs.tamu.edu CR ABRY C, 1989, J PHONETICS, V17, P47 Arslan LM, 1999, SPEECH COMMUN, V27, P81, DOI 10.1016/S0167-6393(98)00068-5 Arun K S, 1987, IEEE Trans Pattern Anal Mach Intell, V9, P698 Aversano G, 2001, PROCEEDINGS OF THE 44TH IEEE 2001 MIDWEST SYMPOSIUM ON CIRCUITS AND SYSTEMS, VOLS 1 AND 2, P516, DOI 10.1109/MWSCAS.2001.986241 BALAN N, 2003, THESIS WRIGHT STATE Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4 BERNSTEIN L, 1996, P INT C SPOK LANG PR, V3, P1477, DOI 10.1109/ICSLP.1996.607895 BESKOW J, 1995, P EUR 95 MADR, V1, P299 BOLL SF, 1979, IEEE T ACOUSTICS SPE, V27, P112 Brand M, 1999, P SIGGRAPH 99, P21, DOI 10.1145/311535.311537 BREGLER C, 1995, ADV NEURAL INFORMATI, V7, P401 BRYLL R, 1999, CAMERA CALIBRATION U CALDOGNETTO EM, 1989, P EUR 89 C EUR SPEEC, V2, P453 Cohen M. M., 1993, MODELS TECHNIQUES CO, P141 COIANIZ T, 1995, SPEECHREADING MAN MA, P391 Duda R. O., 2001, PATTERN CLASSIFICATI DUTTWEILER DL, 1976, IEEE T COMMUN, V24, P864, DOI 10.1109/TCOM.1976.1093389 ESSA I, 1995, THESIS MIT CAMBRIDGE Ezzat T, 2000, INT J COMPUT VISION, V38, P45, DOI 10.1023/A:1008166717597 FINN K, 1986, THESIS GEORGETOWN U FU S, 2002, THESIS WRIGHT STATE FU S, 2005, IEEE T MULTIMEDIA, V7 GAROFOLO J, 1988, DARPA TIMIT CDROM GOLDSCHEN AJ, 1993, THESIS G WASHINGTON Greenberg S, 2003, J PHONETICS, V31, P465, DOI 10.1016/j.wocn.2003.09.005 GUTIERREZOSUNA R, 2002, CSWSU0203 Gutierrez-Osuna R, 2005, IEEE T MULTIMEDIA, V7, P33, DOI 10.1109/TMM.2004.840611 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hong PY, 2002, IEEE T NEURAL NETWOR, V13, P916, DOI 10.1109/TNN.2002.1021892 Itakura F., 1975, J ACOUST SOC AM, V57, P535 Jourlin P, 1997, PATTERN RECOGN LETT, V18, P853, DOI 10.1016/S0167-8655(97)00070-6 Kass M., 1988, INT J COMPUT VISION, V1, P321, DOI DOI 10.1007/BF00133570 Kim HK, 1999, IEEE T SPEECH AUDI P, V7, P87 Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing LAVAGETTO F, 1995, IEEE T REHABIL ENG, V3, P1 Lee Y. C., 1995, P SIGGRAPH 95, P55, DOI DOI 10.1145/218380.218407 Lepsoy S, 1998, SIGNAL PROCESS-IMAGE, V13, P209, DOI 10.1016/S0923-5965(98)00006-X LUTTIN J, 1996, SPEECH READING MAN M, V150, P383 Markel JD, 1976, LINEAR PREDICTION SP Massaro D., 1999, P INT C AUD VIS SPEE, P133 Massaro D. W., 1997, PERCEIVING TALKING F McAllister DF, 1998, COMPUT NETWORKS ISDN, V30, P1975, DOI 10.1016/S0169-7552(98)00216-5 MONTGOMERY AA, 1983, J ACOUST SOC AM, V73, P2134, DOI 10.1121/1.389537 MORISHIMA S, 1991, IEEE J SEL AREA COMM, V9, P594, DOI 10.1109/49.81953 Nakamura S, 2001, J VLSI SIG PROC SYST, V27, P119, DOI 10.1023/A:1008179732362 PARKS DA, 1982, GASTROENTEROLOGY, V2, P9 PARSONS TW, 1986, VOICE SPEECH PROCESS, pCH3 Pelachaud C, 1996, COGNITIVE SCI, V20, P1 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Rabiner L.R., 1978, DIGITAL PROCESSING S Rogozan A, 1998, SPEECH COMMUN, V26, P149, DOI 10.1016/S0167-6393(98)00056-9 SENEFF S, 1988, J PHONETICS, V16, P55 SHARMA S, 1998, P SPEAK REC ITS COMM, P115 Soong FK, 1993, IEEE T SPEECH AUDI P, V1, P15, DOI 10.1109/89.221364 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 Tekalp AM, 2000, SIGNAL PROCESS-IMAGE, V15, P387, DOI 10.1016/S0923-5965(99)00055-7 Tibrewala S., 1997, P ICASSP, P1255 Tsai R. Y., 1987, IEEE J ROBOTIC AUTOM, V3, P323, DOI DOI 10.1109/JRA.1987.1087109 WATERS K, 1993, 934 RLE CRL Waters K., 1995, Proceedings Graphics Interface '95 Wilson DR, 2000, MACH LEARN, V38, P257, DOI 10.1023/A:1007626913721 Yamamoto E, 1998, SPEECH COMMUN, V26, P105, DOI 10.1016/S0167-6393(98)00054-5 NR 62 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 598 EP 615 DI 10.1016/j.specom.2005.09.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900002 ER PT J AU Zhang, T Hasegawa-Johnson, M Levinson, SE AF Zhang, Tong Hasegawa-Johnson, Mark Levinson, Stephen E. TI Cognitive state classification in a spoken tutorial dialogue system SO SPEECH COMMUNICATION LA English DT Article DE intelligent tutoring system; user affect recognition; spoken language processing ID CHILDRENS SPEECH; RECOGNITION; COMMUNICATION; NORMALIZATION; FREQUENCY; LANGUAGE AB This paper addresses the manual and automatic labelling, from spontaneous speech, of a particular type of user affect that we call the cognitive state in a tutorial dialogue system with students of primary and early middle school ages. Our definition of the cognitive state is based on analysis of children's spontaneous speech, which is acquired during Wizard-of-Oz simulations of an intelligent math and physics tutor. The cognitive states of children are categorized into three classes: confidence, puzzlement, and hesitation. The manual labelling of cognitive states had an inter-transcriber agreement of kappa score 0.93. The automatic cognitive state labels are generated by classifying prosodic features, text features, and spectral features. Text features are generated from an automatic speech recognition (ASR) system; features include indicator functions of keyword classes and part-of-speech sequences. Spectral features are created based on acoustic likelihood scores of a cognitive state-dependent ASR system, in which phoneme models are adapted to utterances labelled for a particular cognitive state. The effectiveness of the proposed method has been tested on both manually and automatically transcribed speech, and the test yielded very high correctness: 96.6% for manually transcribed speech and 95.7% for automatically recognized speech. Our study shows that the proposed spectral features greatly outperformed the other types of features in the cognitive state classification experiments. Our study also shows that the spectral and prosodic features derived directly from speech signals were very robust to speech recognition errors, much more than the lexical and part-of-speech based features. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Illinois, Dept Elect & Comp Engn, Urbana, IL 61801 USA. RP Zhang, T (reprint author), Univ Illinois, Dept Elect & Comp Engn, 405 N Mathews Ave, Urbana, IL 61801 USA. EM tzhangl@ifp.uiuc.edu; hasega-wa@ifp.uiuc.edu; sel@ifp.uiuc.edu CR ALPERT SR, 1999, INT WORKSH INSTR US Ang J., 2002, P INT C SPOK LANG PR Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 BATLINER A, 2000, VERBMOBIL FDN SPEECH BECK J, 2003, P 9 INT C US MOD JOH CLARK B, 2001, P 2001 INT WORKSH IN Cole R., 2003, P IEEE SPEC ISS MULT Corbett A. T., 1992, COMPUTER ASSISTED IN Fernandez R, 2003, SPEECH COMMUN, V40, P145, DOI 10.1016/S0167-6393(02)00080-8 Flammia G., 1998, THESIS MIT Forbes-Riley K., 2004, P HUM LANG TECHN C N Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 GODFREY JJ, 1991, P IEEE INT C AC SPEE Graesser AC, 2001, AI MAG, V22, P39 Juang BH, 2000, P IEEE, V88, P1142 JURAFSKY D, 1997, LVCSR WORKSH BALT MD Kafai Y. B., 1996, CONSTRUCTIONISM PRAC KANG BS, 2000, P INT C SPOK LANG PR KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 KITAZAWA S, 1997, P EUR C SPEECH COMM Lee CM, 2005, IEEE T SPEECH AUDI P, V13, P293, DOI 10.1109/TSA.2004.838534 Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LESGOLD A, 1990, COMPUTER ASSISTED IN Li Q, 2002, IEEE T SPEECH AUDI P, V10, P146 LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 Litman D., 2004, P HUM LANG TECHN C N Martinovsky B., 2003, P ISCA TUT RES WORKS Medvedeva O., 2003, P 11 INT C ART INT E MORGAN N, 1998, P IEEE INT C AC SPEE MOSTOW J, 2002, P INT C SPOK LANG PR Mozziconacci S., 1998, STUDY INTONATION PAT MUNOZ M, 1999, P C EMP METH NAT LAN Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544 PELLOM B, 2000, P INT C SPOK LANG PR PETRUSHIN V, 1999, INTELLIGENT ENG SYST, V9, P1085 PETRUSHIN VA, 2000, P INT C SPOK LANG PR POLZIN TS, 2000, ISCA WORKSH SPEECH E PONBERRY H, 2004, P ITS 2004 WORKSH DI Potamianos A, 2003, IEEE T SPEECH AUDI P, V11, P603, DOI 10.1109/TSA.2003.818026 REYES RL, 2000, P PHIL COMP SCI C MA *RUL RES, 2004, DAT MIN TOOLS SCHULTZ K, 2003, AIED 2003 SUPPL P SY Silliman S., 2004, P 7 INT C INT TUT SY Steele M. M., 1999, Journal of Computers in Mathematics and Science Teaching, V18 WARD W, 1999, IEEE WORKSH AUT SPEE Wilensky U., 1991, CONSTRUCTIONISM Young S, 2000, HTK BOOK ZHAN P, 1997, P IEEE INT C AC SPEE Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 50 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 616 EP 632 DI 10.1016/j.specom.2005.09.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900003 ER PT J AU Clopper, CG Pisoni, DB AF Clopper, Cynthia G. Pisoni, David B. TI The Nationwide Speech Project: A new corpus of American English dialects SO SPEECH COMMUNICATION LA English DT Article DE speech corpus; dialect variation; American English ID PERCEPTUAL MEASURES; CATEGORIZATION; REDUCTION; VOICE AB Perceptual and acoustic research on dialect variation in the United States requires an appropriate corpus of spoken language materials. Existing speech corpora that include dialect variation are limited by poor recording quality, small numbers of talkers, and/or small samples of speech from each talker. The Nationwide Speech Project corpus was designed to contain a large amount of speech produced by male and female talkers representing the primary regional varieties of American English. Five male and five female talkers from each of six dialect regions in the United States were recorded reading words, sentences, passages, and in interviews with an experimenter, using high quality digital recording equipment in a sound-attenuated booth. The resulting corpus contains nearly an hour of speech from each of the 60 talkers that can be used in future research on the perception and production of dialect variation. (C) 2005 Elsevier B.V. All rights reserved. C1 Northwestern Univ, Dept Linguist, Evanston, IL 60208 USA. Indiana Univ, Dept Psychol & Brain Sci, Bloomington, IN 47405 USA. RP Clopper, CG (reprint author), Northwestern Univ, Dept Linguist, 2016 Sheridan Rd, Evanston, IL 60208 USA. EM c-clopper@northwestern.edu; pisoni@indiana.edu CR ASH S, SAMPLING STRATEGY TE Baker KK, 1997, J SPEECH LANG HEAR R, V40, P615 Boberg C, 2001, AM SPEECH, V76, P3, DOI 10.1215/00031283-76-1-3 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 CANAVAN A, 1996, LDC96S46 CANAVAN A, 1996, LDC96S47 Carter AK, 2002, LANG SPEECH, V45, P321 Carver C. M., 1987, AM REGIONAL DIALECTS CASSIDY FG, 1985, DICT AM REGIONAL ENG, V2 CASSIDY FG, 1985, DICT AM REGIONAL ENG, V4 Cassidy FG, 1985, DICT AM REGIONAL ENG, VI CASSIDY FG, 1985, DICT AM REGIONAL ENG, V3 Clopper C. G., 2004, THESIS INDIANA U BLO CLOPPER CG, 2001, 25 IND U SPEECH RES, P367 Clopper CG, 2005, J ACOUST SOC AM, V118, P1661, DOI 10.1121/1.2000774 CLOPPER CG, 2005, METH DIAL MONCT NB, V12 Clopper CG, 2004, J PHONETICS, V32, P111, DOI 10.1016/S0095-4470(03)00009-3 Clopper CG, 2005, J LANG SOC PSYCHOL, V24, P182, DOI 10.1177/0261927X05275741 CLOPPER CG, IN PRESS J ACOUST SO Darley F.L, 1975, MOTOR SPEECH DISORDE DUBOIS JW, 2000, LDC2000S85 Fairbanks G., 1940, VOICE ARTICULATION D Feagin C., 2002, HDB LANGUAGE VARIATI, P20 Fisher W. M., 1986, P DARPA WORKSH SPEEC, P93 Gelfer MP, 2000, J VOICE, V14, P22, DOI 10.1016/S0892-1997(00)80092-2 HALL JH, 2004, DICT AM REGIONAL ENG Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 Krapp George P., 1925, ENGLISH LANGUAGE AM, V2 Labov W., 1998, HDB DIALECTS LANGUAG, P39 LABOV W, IN PRESS ATLAS N AM Labov William, 1972, LANG SOC, V1, P97, DOI DOI 10.1017/S0047404500006576 LAGOV W, 1972, SOCIOLINGUISTIC PATT MCDAVID RI, 1958, STRUCTURE AM ENGLISH, P480 McHenry MA, 1999, BRAIN INJURY, V13, P281, DOI 10.1080/026990599121656 Nusbaum H. C., 1984, 10 IND U SPEECH RES, P357 Pisoni David B., 2004, LANG VAR CHANGE, V16, P31 PLICHTA B, 2001, NEW WAYS AN VAR RAL, V30 Powesland P., 1997, SOCIOLINGUISTICS REA, P232 RISCHEL J, 1992, SPEECH COMMUN, V11, P379, DOI 10.1016/0167-6393(92)90043-7 ROJAS DM, 2002, THESIS U EDINBURGH Sapienza CM, 1999, J SPEECH LANG HEAR R, V42, P127 Stockwell P, 2002, SOCIOLINGUISTICS RES STRASSEL S, 2003, LDC2003T15 Thomas E., 2001, ACOUSTIC ANAL VOWEL Trudgill P., 1998, HDB DIALECT LANGUAGE, P307 WOLFRAM W, 1997, SOCIOLINGUISTICS REA, P89 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 48 TC 15 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 633 EP 644 DI 10.1016/j.specom.2005.09.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900004 ER PT J AU Recasens, D Espinosa, A AF Recasens, Daniel Espinosa, Aina TI Dispersion and variability of Catalan vowels SO SPEECH COMMUNICATION LA English DT Article DE vowels; Catalan; schwa; vowel spaces; contextual and non-contextual variability for vowels; acoustic analysis; electropalatography ID ENGLISH VOWELS; SPEECH; ARTICULATIONS; REDUCTION; LANGUAGES; PATTERNS; CONTRAST; CONTEXT; SCHWA; DUTCH AB Formant frequency data for Catalan vowels reveal essentially the same degree of expansion for three dialect systems with seven vowels (Valencian, Eastern Catalan, Western Catalan). A slightly larger vowel space dispersion for a fourth system with those same vowels and stressed /there exists)/ (Majorcan) is not clearly associated with a larger vowel system size but rather with a local effect of schwa in repelling neighbouring vowels or with specific requirements on the production of some peripheral vowels. Schwa appears to be targetless or specified for a widely defined mid central target. Intervocalic distances were found to vary according to dialect and to vowel pair, and to compensate with each other such that the maximal formant frequency range between point vowels is kept constant across dialects. These findings are partially in support of the Adaptive Dispersion Theory, i.e., they are in agreement with the claim that vowel system expansion should be proportional to vowel system size but not with the notion that adjacent vowels should be evenly spaced in identical vowel systems. Patterns of vowel variability differ depending on the contextual or non-contextual factors involved, i.e., F1 shows more contextual and token-to-token variation for open vs. close vowels, while F2 exhibits little contextual variation and much token-dependent variation for /i/ and the opposite trend for /u/ and /(sic)/. These patterns are accounted for assuming that random variability for vowels is ruled by the precision involved in achieving a specific articulatory target, and that contextual variability is determined by the vowel articulatory requirements and by the relative compatibility between the articulatory gestures for adjacent vowels and consonants. (C) 2005 Elsevier B.V. All rights reserved. C1 Inst Estudis Catalans, Phonet Lab, Barcelona 08001, Spain. Univ Autonoma Barcelona, Dept Catalan Philol, E-08193 Barcelona, Spain. RP Recasens, D (reprint author), Inst Estudis Catalans, Phonet Lab, C Carme 47, Barcelona 08001, Spain. EM daniel.recasens@uab.es CR ADANK P, 2003, VOWEL NORMALIZATION BATES SA, 1995, A STROMBERGS GRAFISK, V3, P230 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 BRADLOW AR, 1995, J ACOUST SOC AM, V97, P1916, DOI 10.1121/1.412064 Browman C. P., 1992, PAPERS LABORATORY PH, P26 CHRISTOV P, 1987, P 11 INT C PHON SCI, V3, P121 DELATTRE P, 1969, IRAL-INT REV APPL LI, V7, P295, DOI 10.1515/iral.1969.7.4.295 Disner S. F., 1983, UCLA WORKING PAPERS Espinosa A., 2005, J INT PHON ASSOC, V35, P1, DOI DOI 10.1017/S0025100305001878 Fant G., 1960, ACOUSTIC THEORY SPEE Flemming Edward S., 2002, AUDITORY REPRESENTAT Gick B, 2002, J PHONETICS, V30, P357, DOI 10.1006/jpho.2001.0161 Gick Bryan, 1999, PHONOLOGY, V16, P29, DOI 10.1017/S0952675799003693 GODINEZ M, 1978, UCLA WORKING PAPERS, P3 HARDCASTLE W, 1989, CLIN LINGUIST PHONET, V3, P1, DOI 10.3109/02699208908985268 Herrick D., 2003, THESIS U CALIFORNIA HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Hillenbrand JM, 2001, J ACOUST SOC AM, V109, P748, DOI 10.1121/1.1337959 Hoole P., 1990, FORSCHUNGSBERICHTE I, V28, P107 JONGMAN A, 1989, LANG SPEECH, V32, P221 KEATING PA, 1994, J PHONETICS, V22, P407 KEATING PA, 1984, PHONETICA, V41, P191 KONDO Y, 1994, P INT C SPOK LANG PR, V94, P311 KOOPMANSVANBEINUM FJ, 1994, PHONETICA, V51, P68 KOOPMANSVANBEIN.FJ, 1973, J PHONETICS, V1, P249 LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 Lindblom Bjorn, 1986, EXPT PHONOLOGY, P13 LIVJN P, 2000, PERILUS, V23 MAGEN H, 1989, J PHONETICS, V25, P187 MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705 MCDOUGALL K, 2003, P 6 INT SEM SPEECH P, P161 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Munson B, 2004, J SPEECH LANG HEAR R, V47, P1048, DOI 10.1044/1092-4388(2004/078) Nearey Terrance Michael, 1978, PHONETIC FEATURE SYS PAPCUN G, 1976, UCLA WORKING PAPERS, V31, P38 PERKELL JS, 1990, NATO ADV SCI I D-BEH, V55, P263 PERKELL JS, 1985, J ACOUST SOC AM, V77, P1889, DOI 10.1121/1.391940 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PISONI DB, 1980, PHONETICA, V37, P285 POLS LCW, 1973, J ACOUST SOC AM, V53, P1093, DOI 10.1121/1.1913429 Recasens D., 1999, COARTICULATION THEOR, P80 RECASENS D, 1985, LANG SPEECH, V28, P97 RECASENS D, 1986, PUBLICACIONS ABADIA, P523 Schwartz JL, 1997, J PHONETICS, V25, P233, DOI 10.1006/jpho.1997.0044 STEVENS KN, 1989, J PHONETICS, V17, P3 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 VANBERGEM DR, 1994, SPEECH COMMUN, V14, P143, DOI 10.1016/0167-6393(94)90005-1 WOOD SAJ, 1988, FOLIA LINGUIST, V22, P239, DOI 10.1515/flin.1988.22.3-4.239 NR 48 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 645 EP 666 DI 10.1016/j.specom.2005.09.011 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900005 ER PT J AU Arvaniti, A Ladd, DR Mennen, I AF Arvaniti, Amalia Ladd, D. Robert Mennen, Ineke TI Phonetic effects of focus and "tonal crowding" in intonation: Evidence from Greek polar questions SO SPEECH COMMUNICATION LA English DT Article DE intonation; focus; tonal alignment; phrase accent; tonal crowding ID PITCH ACCENT REALIZATION; FALL-RISE INTONATION; FUNDAMENTAL-FREQUENCY; ENGLISH; CONTEXTS; MANDARIN; SPEECH; STABILITY; PHONOLOGY; ALIGNMENT AB This paper deals with the intonation of polar (yes/no) questions in Greek. An experiment was devised which systematically manipulated the position of the focused word in the question (and therefore of the intonation nucleus) and the position of the last stressed syllable. Our results showed that all questions had a low level stretch associated with the focused word and a final rise-fall movement, the peak of which aligned in two different ways depending on the position of the nucleus: when the nucleus was on the final word, the peak of the rise fall co-occurred with the utterance-final vowel, irrespective of whether this vowel was stressed or not; when the nucleus was on an earlier word, the peak co-occurred with the stressed vowel of the last word. In addition, our results showed finely-tuned adjustments of tonal alignment and scaling that depended on the extent to which tones were "crowded" by surrounding tones in the various conditions we set up. These results can best be explained within a model of intonational phonology in which a tune consists of a string of sparse tones and their association to specific elements of the segmental string. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Sch Philosophy Psychol & Language Sci, Edinburgh EH8 9LN, Midlothian, Scotland. RP Arvaniti, A (reprint author), Univ Calif San Diego, Dept Linguist, 9500 Gilman Dr 0108, La Jolla, CA 92093 USA. EM amalia@liniz.ucsd.edu; bob.ladd@ed.ac.uk; imennen@qinuc.ac.uk CR ARVANITI A, 1998, P 4 INT C SPOK LANG, P2883 ARVANITI A, IN PRESS LANGUAGE SP Arvaniti A, 1998, J PHONETICS, V26, P3, DOI 10.1006/jpho.1997.0063 ARVANITI A, 2000, P 2 INT C LANG RES E, V2, P555 ARVANITI A, IN PRESS LAB PHONOLO, V9 Arvaniti A., 2005, PROSODIC TYPOLOGY PH, P84 ARVANITI A, UNPUB SCALING ALIGNM Arvaniti Amalia, 2000, PAPERS LAB PHONOLOGY, P119 Atterer M, 2004, J PHONETICS, V32, P177, DOI 10.1016/S0095-4470(03)00039-1 BALTAZANI M, 2002, THESIS UCLA Baltazani M., 1999, P 14 INT C PHON SCI, P1305 BECKMAN ME, 2004, LAB PHON 24 26 JUN U, V9 Bruce Gosta, 1977, SWEDISH WORD ACCENTS Connell B., 1990, PHONOLOGY, V7, P1, DOI 10.1017/S095267570000110X COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Crystal D., 1969, PROSODIC SYSTEMS INT D'Imperio M, 2001, SPEECH COMMUN, V33, P339, DOI 10.1016/S0167-6393(00)00064-9 EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 Elenius Kjell, 1995, P 13 INT C PHON SCI, P220 FROTA S, 2002, LAB PHONOLOGY, V7, P387 GANDOUR J, 1994, J PHONETICS, V22, P477 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Grabe E, 1998, J PHONETICS, V26, P129, DOI 10.1006/jpho.1997.0072 Grabe E, 1998, LANG SPEECH, V41, P63 Grabe E, 2000, J PHONETICS, V28, P161, DOI 10.1006/jpho.2000.0111 GRABE E, 2005, P SASRTLM 2003 SPEEC Grice M., 1995, INTONATION INTERROGA Grice Martine, 2000, PHONOLOGY, V17, P143, DOI 10.1017/S0952675700003924 Gronnum N., 1983, PROSODY MODELS MEASU, P27 Halliday M. A. K., 1970, COURSE SPOKEN ENGLIS HIRSCHBERG J, 1992, J PHONETICS, V20, P241 Jun Sun-Ah, 2005, PROSODIC TYPOLOGY PH Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Kochanski G, 2003, SPEECH COMMUN, V41, P625, DOI 10.1016/S0167-6393(03)00100-6 Kochanski G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1021095805490 Ladd D. R., 1996, INTONATIONAL PHONOLO LADD DR, 1995, P 13 INT C PHON SCI, V2, P386 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 Mackridge P., 1990, MODERN GREEK LANGUAG Mackridge Peter, 1997, GREEK COMPREHENSIVE Max L, 1999, J SPEECH LANG HEAR R, V42, P261 MENN L, 1982, LANG SPEECH, V25, P341 Nibert Holly, 2000, THESIS U ILLINOIS UR O'Connor John D., 1973, INTONATION COLLOQUIA PAN H, IN PRESS TOPIC FOCUS Peng SH, 1997, J PHONETICS, V25, P371, DOI 10.1006/jpho.1997.0047 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 Pierrehumbert J., 1988, JAPANESE TONE STRUCT PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J, 1980, THESIS MIT Prieto P, 2005, J PHONETICS, V33, P215, DOI 10.1016/j.wocn.2004.07.001 PRIETO P, 1995, J PHONETICS, V23, P429, DOI 10.1006/jpho.1995.0032 PRIETO P, IN PRESS LANGUAGE SP SCHEPMAN A, IN PRESS J PHONETICS Silverman Kim E. A., 1990, PAPERS LABORATORY PH, P72 Steedman M, 2000, LINGUIST INQ, V31, P649, DOI 10.1162/002438900554505 THORSEN N, 1979, P 9 INT C C PHON SCI, P417 WANG S, 2002, LANG LINGUIST-TAIWAN, V3, P839 WARD G, 1985, LANGUAGE, V61, P747, DOI 10.2307/414489 WARING H, 1976, THESIS U LONDON Xu Y, 2001, PHONETICA, V58, P26, DOI 10.1159/000028487 Xu Y, 2005, SPEECH COMMUN, V46, P220, DOI 10.1016/j.specom.2005.02.014 Xu Y, 2005, J PHONETICS, V33, P159, DOI 10.1016/j.wocn.2004.11.001 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 Xydas G, 2005, IEICE T INF SYST, VE88D, P510, DOI 10.1093/ietisy/e88-d.3.510 Xydas G, 2004, LECT NOTES COMPUT SC, V3206, P521 ZERVAS P, 2004, P 8 INT C SPOK LANG, P761 NR 67 TC 21 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 667 EP 696 DI 10.1016/j.specom.2005.09.012 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900006 ER PT J AU Milner, B Shao, X AF Milner, Ben Shao, Xu TI Clean speech reconstruction from MFCC vectors and fundamental frequency using an integrated front-end SO SPEECH COMMUNICATION LA English DT Article DE distributed speech recognition; speech reconstruction; sinusoidal model; source-filter model; fundamental frequency estimation; auditory model ID RECOGNITION AB The aim of this work is to enable a noise-free time-domain speech signal to be reconstructed from a stream of MFCC vectors and fundamental frequency and voicing estimates, such as may be received in a distributed speech recognition system. To facilitate reconstruction, both a sinusoidal model and a source-filter model of speech are compared by listening tests and spectrogram analysis, with the result that the former provides higher quality speech reconstruction. Analysis of the sinusoidal model shows that for clean speech reconstruction, both a noise-free spectral envelope and a robust estimate of the fundamental frequency and voicing are necessary. Investigation into fundamental frequency estimation reveals that an auditory model based approach gives superior performance over other methods of estimation. This leads to the proposal of an integrated front-end which uses the auditory model for both fundamental frequency and voicing estimation, and as the filterbank stage in MFCC extraction, and thereby reduces computation. Applying spectral subtraction to the auditory model parameters improves the spectral envelope estimates needed for clean speech reconstruction. Experiments on the Aurora connected digits database show that the auditory model-based MFCCs give comparable performance to that attained with conventional MFCCs. Speech reconstruction tests reveal that the combination of robust fundamental frequency and voicing estimation with spectral subtraction in the integrated front-end leads to intelligible and relatively noise-free speech. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. RP Milner, B (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM b.milner@uea.ac.uk; x.shao@uea.ac.uk CR [Anonymous], 1996, P800 ITU T CHAZAN D, 2001, P EUR CHAZAN D, 2000, P ICASSP *ETSI, 2003, 202212STQ ETSI ES DS *ETSI, 2000, 201108STQ ETSI ES DS GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Huerta J. M., 1998, P ICSLP, P1463 Kaiser J.F., 1993, P INT C AC SPEECH SI, V3, P149 Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558 Kleijn W. B., 1995, SPEECH CODING SYNTHE LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MILNER BP, 2002, P ICSLP PATTERSON RD, 1992, COMPLEX SOUNDS AUDIT Patterson R.D., 1988, 2341 APU RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 Rabiner L.R., 1978, DIGITAL PROCESSING S RAJ B, 2001, P ASRU Rouat J, 1997, SPEECH COMMUN, V21, P191, DOI 10.1016/S0167-6393(97)00002-2 Slaney M., 1993, 35 APPL COMP INC PER TUCKER R, 1999, P EUR VANIMMERSEEL L, 1992, JASA, V91, P3311 Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 WU M, 2002, P ICASSP NR 24 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 697 EP 715 DI 10.1016/j.specom.2005.10.004 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900007 ER PT J AU Chu, M Zhao, Y Chang, E AF Chu, Min Zhao, Yong Chang, Eric TI Modeling stylized invariance and local variability of prosody in text-to-speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE prosody; stylized invariance; local variability; soft prediction; unit selection; text-to-speech AB This paper investigates the stylized invariance and local variability of prosody patterns by using a speech database containing two repetitions of 1000 sentences. The two repetitions (separated by a time span of 6 months) were recorded by a single professional speaker, who was instructed to read these sentences in the same reading style. It was observed statistically that the two repetitions have fairly wide variations in prosodic features and the variations can be up to 50% of the full dynamic range of the speaker. This shows the inadequacy of traditional prosody models that focus on capturing the universal invariance of prosody as precise as possible. In this paper, we propose to model prosody by capturing its stylized invariance and retaining local variability with a soft prediction strategy, which predicts an acceptable region rather than a single fixed point in the multi-dimensioned prosody space. A prosodic-constrained unit selection algorithm is devised under the soft prediction strategy. (C) 2005 Elsevier B.V. All rights reserved. C1 Microsoft Res Asia, Beijing Sigma Ctr, Beijing 100080, Peoples R China. RP Chu, M (reprint author), Microsoft Res Asia, Beijing Sigma Ctr, 5F,No 49 Zhichun Rd, Beijing 100080, Peoples R China. EM minchu@microsoft.com; yzhao@microsoft.com; echang@microsoft.com CR Beckman M, 1997, GUIDELINES TOBI LABE CARLSON R, 1975, STRUCTURE PROCESS SP, P90 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 CHU M, 2001, P ICASSP 01 SALT LAK CHU M, 2003, P ICASSP 03 HONG KON CHU M, 2001, P 4 ISCA WORKSH SPEE Chu M., 2001, COMPUTATIONAL LINGUI, V6, P61 DOGIL G, 2001, P EUR 2001 COP DONOVAN RE, 1998, P ICSLP 98 SYDN FANT G, 1996, P ICSLP 96 PHIL Fujisaki H, 1986, P IEEE INT C AC SPEE, P2039 GUENTHER FH, 1995, P 13 INT C PHON SCI, V2, P92 Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 HUANG XD, 1996, P ICSLP 96 PHIL HUGGINS AWF, 1972, J ACOUST SOC AM, V51, P1270, DOI 10.1121/1.1912971 KATO H, 1998, P ICSLP 98 SYDN Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X MARKEL J, 1976, LINEAR PREDICTION SP, V12 Mayer J., 2002, P SPEECH PROS 2002 A, P487 MOBIUS B, 2002, P SPEECH PROS 2002 A MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Ostendorf M., 1994, Computational Linguistics, V20 Perkell JS, 2000, J PHONETICS, V28, P233, DOI 10.1006/jpho.2000.0116 Ross KN, 1999, IEEE T SPEECH AUDI P, V7, P295, DOI 10.1109/89.759037 SHADLE CH, 2001, P 4 ISCA TUT RES WOR STEVENS KN, 1989, J PHONETICS, V17, P3 STYLIANOU Y, 1997, P EUROSPEECH 97, P613 Taylor P, 1998, COMPUT SPEECH LANG, V12, P99, DOI 10.1006/csla.1998.0041 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 Wang Michelle Q., 1991, P ASS COMP LING 29 A, P285, DOI 10.3115/981344.981381 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 Young S., 2000, HTK BOOK HTK VERSION NR 33 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 716 EP 726 DI 10.1016/j.specom.2005.10.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900008 ER PT J AU Alsteris, LD Paliwal, KK AF Alsteris, Leigh D. Paliwal, Kuldip K. TI Further intelligibility results from human listening tests using the short-time phase spectrum SO SPEECH COMMUNICATION LA English DT Article DE short-time Fourier transform; phase spectrum; magnitude spectrum; speech perception; overlap-add procedure; automatic speech recognition; feature extraction; group delay function; instantaneous frequency distribution ID GROUP DELAY FUNCTIONS; FOURIER-ANALYSIS; SPEECH; FREQUENCY AB State-of-the-art automatic speech recognition systems (ASRs) use only the short-time magnitude spectrum for feature extraction; the short-time phase spectrum is generally ignored in these systems. Results from our recent human listening tests indicate that the short-time phase spectrum can significantly contribute to speech intelligibility over small window durations (i.e., 20-40 ms). This is an interesting result, indicating the possible usefulness of the short-time phase spectrum for ASR. which commonly employs small window durations of 20-40 ms for spectral analysis. In this paper, we continue our investigation of the short-time phase spectrum. We explore the use of partial short-time phase spectrum information, in the absence of all the short-time magnitude spectrum information, for intelligible signal reconstruction. We create two types of stimuli; one in which its frequency-derivative (i.e., group delay function, GDF) is preserved and another in which its time-derivative (i.e., instantaneous frequency distribution, IFD) is preserved. We do this to determine the contribution that each of these derivatives provides toward intelligibility. Reconstructing stimuli from knowledge of only the GDF or only the IFD results in poor intelligibility. However, when we create stimuli using knowledge of both the GDF and the IFD. reasonable intelligibility is obtained. In light of these results, we conclude that both the GDF and IFD components of the short-time phase spectrum are needed to reconstruct an intelligible signal. In addition, we also perform some experiments to quantify the intelligibility of stimuli reconstructed from the short-time phase and magnitude spectra of noisy speech. The intelligibility of stimuli constructed from either the short-time magnitude spectrum or the short-time phase spectrum degrades at a similar rate under increasing noise levels. The intelligibility of the original signals under noisy conditions also degrades with increased noise, but in all cases the intelligibility is superior to that provided by the stimuli constructed from the separate short-time components. Therefore, we argue that knowledge of both short-time magnitude and phase spectrum information results in superior human speech recognition performance. (C) 2005 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP Alsteris, LD (reprint author), Griffith Univ, Sch Microelect Engn, Nathan Campus, Brisbane, Qld 4111, Australia. EM L.Alsteris@griffith.edu.au; K.Paliwal@griffith.edu.au CR Abe T., 1995, P IEEE ICASSP, P756 ALLEN JB, 1977, P IEEE, V65, P1558, DOI 10.1109/PROC.1977.10770 ALSTERIS LD, 2004, P IEEE INT C AC SPEE, pI573 [Anonymous], 2005, P INT S SIGN PROC IT, P715 BOZKURT B, 2004, EUSPICO Charpentier F. J., 1986, P INT C AC SPEECH SI, P113 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 Delgutte B., 1996, AUDITORY COMPUTATION DIMITRIADIS D, 2003, P EUR GEN SWITZ SEP, P2853 DUNCAN G, 1989, P ICASSP MAY, P572 FRIEDMAN DH, 1985, P INT C AC SPEECH SI, P1121 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 HEGDE RM, 2004, P INT C SPEECH LANG HEGDE RM, 2004, P IEEE INT C AC SPEE, P2517 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X Potamianos A, 1996, J ACOUST SOC AM, V99, P3795, DOI 10.1121/1.414997 MURTHY HA, 1989, P IEEE INT C AC SPEE, P484 MURTHY HA, 2003, P IEEE INT C AC SPEE, pI68 Nakatani T., 2003, P EUROSPEECH, P2313 NUTTALL AH, 1981, IEEE T ACOUST SPEECH, V29, P84, DOI 10.1109/TASSP.1981.1163506 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 Paliwal K., 2003, P EUROSPEECH, P65 Paliwal K. K., 2003, P EUR 2003, P2117 Paliwal KK, 2005, SPEECH COMMUN, V45, P153, DOI 10.1016/j.specom.2004.08.001 PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P364, DOI 10.1109/TASSP.1981.1163580 Potamianos A, 2001, IEEE T SPEECH AUDI P, V9, P196, DOI 10.1109/89.905994 Prasad VK, 2004, SPEECH COMMUN, V42, P429, DOI 10.1016/j.specom.2003.12.002 Quatieri T. F., 2002, DISCRETE TIME SPEECH REDDY NS, 1985, IEEE T CIRC SYST CAS, V32 Satyanarayana Murthy P., 1999, IEEE Transactions on Speech and Audio Processing, V7, DOI 10.1109/89.799686 SCHROEDER MR, 1975, P IEEE, V63, P1332, DOI 10.1109/PROC.1975.9941 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 Wang Y., 2003, P 22 INT S REL DISTR, P25 YEGNANARAYANA B, 1984, IEEE T ACOUST SPEECH, V32, P610, DOI 10.1109/TASSP.1984.1164365 YEGNANARAYANA B, 1992, IEEE T SIGNAL PROCES, V40, P2281, DOI 10.1109/78.157227 NR 36 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 727 EP 736 DI 10.1016/j.specom.2005.10.005 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900009 ER PT J AU Park, J Ko, H AF Park, Junho Ko, Hanseok TI Achieving a reliable compact acoustic model for embedded speech recognition system with high confusion frequency model handling SO SPEECH COMMUNICATION LA English DT Article DE tied-mixture HMM; compact acoustic modeling; embedded speech recognition system ID HIDDEN MARKOV-MODELS; HMMS AB An acoustic model for art embedded speech recognition system must exhibit two desirable features; the ability to minimize the performance degradation in recognition, while solving the memory problem under the constraint of limited system resources. Moreover, for general speech recognition tasks, context dependent models such as state-clustered tri-phones are used to guarantee the high recognition performance of the embedded system. To cope with these challenges, we introduce the state-clustered tied-mixture (SCTM) HMM as a method of optimizing an acoustic model. The proposed SCTM modeling system offers a significant improvement in recognition performance, as well as providing a solution to sparse training data problems. Moreover, the state weight quantizing method achieves a drastic reduction in the size of the model. However, using models constructed only in this way is insufficient to improve the recognition rate in some tasks where a large mutual similarity exists, such as in the case of the Korean-digit recognition task. Hence, we also construct new dedicated HMM's for all or part of the Korean-digits that have exclusive states using the same Gaussian pool of previous tri-phone models. In this paper, we describe the acoustic model optimization procedure for embedded speech recognition systems and the corresponding performance evaluation results. (C) 2005 Elsevier B.V. All rights reserved. C1 Korea Univ, ISPL, Dept Elect & Comp Engn, Seoul 136701, South Korea. RP Ko, H (reprint author), Korea Univ, ISPL, Dept Elect & Comp Engn, 5-1 Anan Dong, Seoul 136701, South Korea. EM jhpark@ispl.korea.ac.kr; hsko@korea.ac.kr CR Bocchieri E, 2001, IEEE T SPEECH AUDI P, V9, P264, DOI 10.1109/89.906000 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 DORTA P, 1987, IEEE INT C AC SPEECH, P81 Duchateau J, 1998, SPEECH COMMUN, V24, P5, DOI 10.1016/S0167-6393(98)00002-8 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P152, DOI 10.1109/89.748120 HUANG X, 1998, IEEE WORKSH SPEECH R HUANG X, 1996, ICASSP 96, V2, P885 Hwang MY, 1993, IEEE T SPEECH AUDI P, V1, P414 Jelinek F., 1997, STAT METHODS SPEECH JUANG BH, 1985, AT&T TECH J, V64, P391 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 KIM K, 2004, 8 IEEE INT S CONS EL, P595 Kim NS, 1997, IEEE T SPEECH AUDI P, V5, P292 LEE A, 2000, ICASSP 2000, V3, P1269 Moreno PJ, 1998, SPEECH COMMUN, V24, P267, DOI 10.1016/S0167-6393(98)00025-9 PARK J, 2004, ICSLP 2004, V1, P693 Park J, 2005, SPEECH COMMUN, V46, P1, DOI 10.1016/j.specom.2004.12.003 YOUNG DP, 1992, J COMPUT PHYS, V92, P1 NR 18 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 737 EP 745 DI 10.1016/j.specom.2005.10.001 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900010 ER PT J AU So, S Paliwal, KK AF So, Stephen Paliwal, Kuldip K. TI Scalable distributed speech recognition using Gaussian mixture model-based block quantisation SO SPEECH COMMUNICATION LA English DT Article DE distributed speech recognition; Gaussian mixture models; block quantisation; Aurora-2 ID VECTOR QUANTIZATION; PARAMETERS; ALGORITHM AB In this paper, we investigate the use of block quantisers based on Gaussian mixture models (GMMs) for the coding of Mel frequency-warped cepstral coefficient (MFCC) features in distributed speech recognition (DSR) applications. Specifically, we consider the multi-frame scheme, where temporal correlation across MFCC frames is exploited by the Karhunen-Loeve transform of the block quantiser. Compared with vector quantisers, the GMM-based block quantiser has relatively low computational and memory requirements which are independent of bitrate. More importantly, it is bitrate scalable, which means that the bitrate can be adjusted without the need for re-training. Static parameters such as the GMM and transform matrices are stored at the encoder and decoder and bit allocations are calculated 'on-the-fly' without intensive processing. We have evaluated the quantisation scheme on the Aurora-2 database in a DSR framework. We show that jointly quantising more frames and using more mixture components in the GMM leads to higher recognition performance. The multi-frame GMM-based block quantiser achieves a word error rate (WER) of 2.5% at 800 bps, which is less than 1% degradation from the baseline (unquantised) word recognition accuracy, and graceful degradation down to a WER of 7% at 300 bps. (C) 2005 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP So, S (reprint author), Griffith Univ, Sch Engn, PMB 50, Gold Coast, Qld 9726, Australia. EM s.so@griffith.edu.au; k.paliwal@griffith.edu.au CR ADAMI A, 2002, P INT C SPOK LANG PR Archer C, 2004, IEEE T SIGNAL PROCES, V52, P255, DOI 10.1109/TSP.2003.819980 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 *ETSI, 2000, 201108 ES ETSI GALLARDOANTOLIN A, 1998, P ICSLP, P1443 GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658 Gersho A., 1992, VECTOR QUANTIZATION GOYAL VK, 2001, IEEE SIGNAL PROCESS, V18 Hedelin P, 2000, IEEE T SPEECH AUDI P, V8, P385, DOI 10.1109/89.848220 Hirsch H.-G., 2000, ISCA ITRW ASR2000 PA HRISCH HG, 1998, P ICSLP DENV US SEPT Huang J.J.Y., 1963, IEEE Transactions on Communication Systems, VCS-11, DOI 10.1109/TCOM.1963.1088759 Huerta JM, 1998, P 5 INT C SPOK LANG, V4, P1463 JARVINEN K, 1997, P ICASSP, V2, P771 JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558 KISS I, 1999, P EUROSPEECH, P2183 Kiss I., 2000, P INT C SPOK LANG PR KRAMER KP, 1971, IEEE T INFORM THEORY, V17, P751 LILLY B, 1996, P ICSLP, V4, P2344, DOI 10.1109/ICSLP.1996.607278 LINDE Y, 1980, IEEE T COMMUN, V28, P1 ORTEGA A, 1996, IEEE T IM P PALIWAL KK, 2004, P INT C SPOK LANG PR PALIWAL KK, 2004, P IEEE INT C AC SPEE, V1, P125 RAJ B, 2001, P ASRU TRENT IT DEC Ramaswamy GN, 1998, INT CONF ACOUST SPEE, P977, DOI 10.1109/ICASSP.1998.675430 Samuelsson J, 2001, IEEE T SPEECH AUDI P, V9, P492, DOI 10.1109/89.928914 SRINIVASAMURTHY N, 2003, UNPUB IEEE T SPEECH Su JK, 1996, INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, PROCEEDINGS - VOL I, P217 Subramaniam AD, 2003, IEEE T SPEECH AUDI P, V11, P130, DOI 10.1109/TSA.2003.809192 Turunen J., 2001, P EUR, P2363 ZHU Q, 2001, P IEEE INT C AC SPEE, V1, P113 NR 33 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2006 VL 48 IS 6 BP 746 EP 758 DI 10.1016/j.specom.2005.10.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 052DR UT WOS:000238211900011 ER PT J AU Dromey, C Nissen, S Nohr, P Fletcher, SG AF Dromey, C Nissen, S Nohr, P Fletcher, SG TI Measuring tongue movements during speech: Adaptation of a magnetic jaw-tracking system SO SPEECH COMMUNICATION LA English DT Article DE tongue; movement; measurement; magnetic; kinematic ID ELECTROMAGNETIC ARTICULOGRAPHY; ARTICULATORY MOVEMENTS; PALATOMETRY; CONSONANTS; ULTRASOUND; SPEAKERS AB The purpose of the present investigation was to determine whether measurements of tongue movement during speech could be obtained with an electronic device originally designed to record jaw movements via a magnetized pellet. The findings indicated that the system allowed basic quantification of tongue movements in a straightforward mariner. The primary advantages of this system are that no distracting wires are attached to the pellet, and it is much less costly than other systems used for this purpose. Its main disadvantages are that it is unable to track multiple tongue fleshpoints simultaneously.. lacks an anatomically based coordinate system, and the head must remain still during recordings. (C) 2005 Elsevier B.V. All rights reserved. C1 Brigham Young Univ, Dept Audiol & Speech Language Pathol, Provo, UT 84602 USA. RP Dromey, C (reprint author), Brigham Young Univ, Dept Audiol & Speech Language Pathol, 133 Taylor Bldg, Provo, UT 84602 USA. EM dromey@byu.edu CR ADAMS SG, 1993, J SPEECH HEAR RES, V36, P41 BARLOW SM, 1983, J SPEECH HEAR RES, V26, P283 FLETCHER SG, 1991, J SPEECH HEAR RES, V34, P929 FLETCHER SG, 1975, J SPEECH HEAR RES, V18, P812 FLETCHER SG, 1989, J SPEECH HEAR RES, V32, P736 FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977 Jongman A, 2000, J ACOUST SOC AM, V108, P1252, DOI 10.1121/1.1288413 Kaburagi T, 1997, J ACOUST SOC AM, V101, P2391, DOI 10.1121/1.418255 Katz WF, 1999, J SPEECH LANG HEAR R, V42, P1355 Lundberg AJ, 1999, J ACOUST SOC AM, V106, P2858, DOI 10.1121/1.428110 Napadow VJ, 1999, J BIOMECH, V32, P1, DOI 10.1016/S0021-9290(98)00109-2 Nissen S. L., 2003, THESIS OHIO STATE U OSTRY DJ, 1985, J ACOUST SOC AM, V77, P640, DOI 10.1121/1.391882 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 SMITH A, 1995, EXP BRAIN RES, V104, P493 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 STONE M, 1990, J ACOUST SOC AM, V87, P2207, DOI 10.1121/1.399188 Tasko SM, 2002, J SPEECH LANG HEAR R, V45, P127, DOI 10.1044/1092-4388(2002/010) TULLER B, 1990, J ACOUST SOC AM, V88, P674, DOI 10.1121/1.399771 van Lieshout PHHM, 2002, J SPEECH LANG HEAR R, V45, P5, DOI 10.1044/1092-4388(2002/001) Weismer G, 1999, J ACOUST SOC AM, V105, P2882, DOI 10.1121/1.426902 WESTBURY JR, 1991, J ACOUST SOC AM, V89, P1782, DOI 10.1121/1.401012 WESTBURY JR, 1994, J ACOUST SOC AM, V95, P2271, DOI 10.1121/1.408638 NR 24 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 463 EP 473 DI 10.1016/j.specom.2005.05.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700001 ER PT J AU Kepesi, M Weruaga, L AF Kepesi, M Weruaga, L TI Adaptive chirp-based time-frequency analysis of speech signals SO SPEECH COMMUNICATION LA English DT Article DE time-frequency analysis; harmonically related chirps; fan-chirp transform ID WIGNER DISTRIBUTION; TRANSFORM; TOOL AB In this paper, a new method for time-frequency analysis of speech signals is proposed. Given that the fundamental frequency of voiced speech often undergoes rapid fluctuation and in these cases the classical spectrogram suffers from blurring and artifacts, an adaptive analysis basis composed of quadratic chirps is what we consider. The analysis basis of the proposed Short-Time Fan-Chirp Transform (FChT) is defined univocally by the analysis window length and by the frequency variation rate, this parameter being predicted from the last computed spectral segments. The prediction algorithm is based on time tracking the joint trajectory of the harmonic contours, this process also provides a voiced/unvoiced detection parameter. Comparative results between the proposed Short-Time FChT and popular time-frequency techniques reveal an improvement in spectral and time-frequency representation. Since the signal can be synthesized from its FChT, the proposed method is suitable for filtering purposes. (C) 2005 Elsevier B.V. All rights reserved. C1 Austrian Acad Sci, Commiss Sci Visualisat, A-1220 Vienna, Austria. Graz Univ Technol, Signal Proc & Speech Commun Lab, A-8010 Graz, Austria. RP Weruaga, L (reprint author), Austrian Acad Sci, Commiss Sci Visualisat, Donau City Str 1, A-1220 Vienna, Austria. EM weruaga@ieee.org CR Arons B., 1992, J AM VOICE I O SOC, V12, P35 Auger F., 1995, TIME FREQUENCY TOOLB Baraniuk R. G., 1993, P IEEE ICASSP MINN M, P320 Bendat J. S., 2000, RANDOM DATA ANAL MEA BERGER J, 1994, J AUDIO ENG SOC, V42, P808 BROWN JC, 1991, J ACOUST SOC AM, V89, P425, DOI 10.1121/1.400476 CASEY M, 2000, P ICMC Chowning J., 1973, J AUDIO ENG SOC, V21 CLAASEN TACM, 1980, PHILIPS J RES, V35, P372 CLAASEN TACM, 1980, PHILIPS J RES, V35, P276 Cohen L., 1995, TIME FREQUENCY ANAL Cook C. E., 1993, RADAR SIGNALS INTRO Flandrin P., 1998, TIME FREQUENCY TIME Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HINICH MJ, 1982, IEEE T ACOUST SPEECH, V30, P747, DOI 10.1109/TASSP.1982.1163952 MANN S, 1995, IEEE T SIGNAL PROCES, V43, P2745, DOI 10.1109/78.482123 Mecklenbrauker W., 1997, WIGNER DISTRIBUTION Mercado E, 2000, NEUROCOMPUTING, V32, P913, DOI 10.1016/S0925-2312(00)00260-5 MIHOVILOVIC D, 1992, J GEOPHYS RES, V97, P17199, DOI 10.1029/92JA01140 MUNSON DC, 1999, P IEEE ICASSP PHOEN, P2099 O'Neill JC, 1998, PROCEEDINGS OF THE IEEE-SP INTERNATIONAL SYMPOSIUM ON TIME-FREQUENCY AND TIME-SCALE ANALYSIS, P425, DOI 10.1109/TFSA.1998.721452 Ozaktas H. M., 2001, FRACTIONAL FOURIER T Papoulis A., 1977, SIGNAL ANAL Pinter I, 1996, COMPUT SPEECH LANG, V10, P1, DOI 10.1006/csla.1996.0001 Pitton JW, 1994, IEEE T SPEECH AUDI P, V2, P554, DOI 10.1109/89.326614 Qian S, 1998, INT CONF ACOUST SPEE, P1781 Quatieri T. F., 2002, DISCRETE TIME SPEECH RAMALHO MA, 1993, P 36 MIDW SCAS DETR, P16 SIEBERT W, 1956, IEEE T INFORMATION T, V2 SLUIJTER RJ, 1999, P IEEE SPEECH COD WO, P150 Twaroch T, 1998, PROCEEDINGS OF THE IEEE-SP INTERNATIONAL SYMPOSIUM ON TIME-FREQUENCY AND TIME-SCALE ANALYSIS, P9, DOI 10.1109/TFSA.1998.721547 Vapnik V., 1995, NATURE STAT LEARNING Vetterli M., 1995, WAVELETS SUBBAND COD WARREN DW, 1986, CLEFT PALATE J, V23, P251 WERUAGA L, 2004, P EUSIPCO VIENN AT, P1011 WERUAGA L, 2003, P EUROSPEECH, P53 WERUAGA L, UNPUB IEEE T SIGNAL WOKUREK W, 1987, P INT C DSP, P294 YAN FZ, 2000, P IEEE ISCAS GEN CH, P28 NR 40 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 474 EP 492 DI 10.1016/j.specom.2005.08.004 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700002 ER PT J AU Schramm, H Aubert, X Bakker, B Meyer, C Ney, H AF Schramm, H Aubert, X Bakker, B Meyer, C Ney, H TI Modeling spontaneous speech variability in professional dictation SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; spontaneous speech modeling; pronunciation modeling; rate of speech modeling.; filled pause modeling; model combination ID HIDDEN MARKOV-MODELS; PRONUNCIATION VARIATION; RECOGNITION AB In this work, we present a model combination approach at the word level that aims to improve the modeling of spontaneous speech variabilities on a highly spontaneous, real life medical transcription task. The technique (1) separates speech variabilities into pre-defined classes, (2) generates speech variability specific acoustic and pronunciation models and (3) properly combines these models later in the search procedure on a word level basis. For efficient integration of the specific acoustic and pronunciation models into the search procedure, a theoretical framework is provided. Our algorithm is a general approach that can be applied to model various speech variabilities. In our experiments, we focused on the variabilities related to filled pauses, rate of speech and speaker accent. Our best system combines six variability specific acoustic and pronunciation models on a word level and achieves a word error rate reduction of 13% relative compared to the baseline. In a number of contrast experiments we evaluated the importance of different components in our system and explored ways to reduce the system complexity. (C) 2005 Elsevier B.V. All rights reserved. C1 Philips Res Labs, D-52066 Aachen, Germany. Univ Technol, Rhein Westfal TH Aachen, Dept Comp Sci, Lehrstuhl Informat 6, D-52056 Aachen, Germany. RP Schramm, H (reprint author), Philips Res Labs, Weisshausstr 2, D-52066 Aachen, Germany. EM hauke.schramm@philips.com CR ADDADECKER M, 1998, P ESCA WORKSH MOD PR, P1 Amdal I., 2002, THESIS NORWEGIAN U S AUBERT X, 1995, P EUROSPEECH, P767 AUBERT X, 1995, P ICASSP, V1, P49 AUBERT X, 1999, P EUR C SPEECH COMM, P1559 AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M Bahl L.R., 1991, P INT C AC SPEECH SI, P173, DOI 10.1109/ICASSP.1991.150305 Bates R. A., 2002, P ISCA TUT RES WORKS, P42 BELLEGARDA JR, 1990, IEEE T ACOUST SPEECH, V38, P2033, DOI 10.1109/29.61531 BERNSTEIN J, 1986, P DARPA SPEECH REG W, P41 Beyerlein P, 2002, SPEECH COMMUN, V37, P109, DOI 10.1016/S0167-6393(01)00062-0 BEYERLEIN P, 2001, P EUROSPEECH, P499 Byrne W, 1998, INT CONF ACOUST SPEE, P313, DOI 10.1109/ICASSP.1998.674430 CHEN K, 2004, ISCA INT C SPEECH PR, P583 CHEN K, 2003, P IEEE WORKSH SPEECH, P435 COHEN MH, 1989, THESIS U CALIFORNIA Evermann G., 2000, P NIST SPEECH TRANSC FINKE M, 1997, P EUR C SPEECH COMM, P239 Fosler-Lussier E., 1998, P ESCA WORKSH MOD PR, P35 FOSLERLUSSIER JE, 1999, P EUR C SPEECH COMM, P463 FUKADA T, 1998, P ESCA WORKSH MOD PR, P41 Furui S., 2003, P ISCA IEEE WORKSH S, P1 GAUVAIN JL, 1997, P ARPA SPEECH REC WO, P56 GORONZY S, 2001, P ISCA ITWR AD METH, P143 Greenberg S., 1996, P INT C SPOK LANG PR, P24 Hain T., 2002, P ISCA TUT RES WORKS, P129 He XD, 2003, IEEE T SPEECH AUDI P, V11, P298, DOI 10.1109/TSA.2003.814379 HUANG C, 2003, P INT C SPOK LANG PR, V3, P818 Humphries J. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607273 HUMPHRIES JJ, 1997, P EUR C SPEECH COMM, V5, P2367 Hwang MY, 1993, IEEE T SPEECH AUDI P, V1, P414 Jurafsky D., 2001, P IEEE INT C AC SPEE, V1, P577 KESSENS J, 2002, THESIS U NIJMEGEN NE Kessens JM, 1999, SPEECH COMMUN, V29, P193, DOI 10.1016/S0167-6393(99)00048-5 Lamel L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.606916 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LEE KT, 2003, P IEEE INT C AC SPEE, V1, P296 LIU D, 1998, P DARPA BROADC NEWS, P123 LIVESCU K, 2003, P IEEE INT C AC SPEE, P1683 Martinez F, 1998, INT CONF ACOUST SPEE, P725, DOI 10.1109/ICASSP.1998.675367 McAllister D., 1998, P 5 INT C SPOK LANG, P1847 MIRGHAFORI N, 1995, P EUR 95 MADR SEPT, V1, P491 Nanjo H., 2001, P EUROSPEECH, P2531 OSTENDORF M, 2003, S PROS SPEECH PROC T, P147 Ostendorf M., 1996, P INT C SPOK LANG PR Pallett D.S., 1994, P HUM LANG TECHN WOR, P49, DOI 10.3115/1075812.1075824 PESKIN B, 1997, P EUROSPEECH 97 RHOD, P22 PETERS J, 2003, P HUM LANG TECHN C H, P82 PFAU T, 1998, P INT C SPOK LANG PR, P205 Richardson M., 1999, P EUR C SPEECH COMM, V1, P411 RIGOLL G, 2003, P IEEE WORKSH SPONT, P131 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 Riley M. D., 1991, P INT C AC SPEECH SI, P737, DOI 10.1109/ICASSP.1991.150446 ROSE RC, 1993, P IEEE INT C AC SPEE, V1, P341 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 Schluter R, 2001, SPEECH COMMUN, V34, P287, DOI 10.1016/S0167-6393(00)00035-2 SCHRAMM H, 2000, P IEEE INT C AC SPEE, P1659 SCHRAMM H, 2003, P SSPR, P143 SCHWARTZ R, 2004, P ICASSP, V3, P17 Shriberg E. E., 1994, THESIS U CALIFORNIA Slobada T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607274 SPROAT R, 2004, JHU CSLP WORKSH Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 TOMOKIYO LM, 2001, P ULT SPOK LANG PROC Wang Z., 2003, P IEEE INT C AC SPEE, P540 Wessel F, 1998, INT CONF ACOUST SPEE, P225, DOI 10.1109/ICASSP.1998.674408 WOODLAND P, 2000, P NIST SPEECH TRANSC WREDE B, 2002, THESIS U IELEFELD GE YOUNG SJ, 1994, COMPUT SPEECH LANG, V8, P369, DOI 10.1006/csla.1994.1019 YU H, 2003, P EUR GEN SWITZ, P1869 ZHENG J, 2000, P IEEE ICASSP, P1775 NR 71 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 493 EP 515 DI 10.1016/j.specom.2005.08.003 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700003 ER PT J AU Fujii, A Itou, K Ishikawa, T AF Fujii, A Itou, K Ishikawa, T TI LODEM: A system for on-demand video lectures SO SPEECH COMMUNICATION LA English DT Article DE cross-media retrieval; speech recognition; spoken document retrieval; adaptation; lecture video AB We propose a cross-media lecture-on-demand system, called LODEM, which searches a lecture video for specific segments in response to a text query. We utilize the benefits of text, audio, and video data corresponding to a single lecture. LODEM extracts the audio track from a target lecture video, generates a transcription by large-vocabulary continuous speech recognition, and produces a text index. A user can formulate text queries using the textbook related to the target lecture and can selectively view specific video segments by submitting those queries. Experimental results showed that by adapting speech recognition to the lecturer and the topic of the target lecture, the recognition accuracy was increased and consequently the retrieval accuracy was comparable with that obtained by human transcription. LODEM is implemented as a client-server system on the Web to facilitate c-learning. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Tsukuba, Grad Sch Lib Informat & Media Studies, Tsukuba, Ibaraki 3058550, Japan. Nagoya Univ, Grad Sch Informat Sci, Chikusa Ku, Nagoya, Aichi 4648603, Japan. RP Fujii, A (reprint author), Univ Tsukuba, Grad Sch Lib Informat & Media Studies, 1-2 Kasuga, Tsukuba, Ibaraki 3058550, Japan. EM fujii@slis.tsukuba.ac.jp CR Allan J., 2002, TOPIC DETECTION TRAC AUZANNE C, 2000, P RIAO 2000 C CONT B BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 Berger A, 1998, INT CONF ACOUST SPEE, P705, DOI 10.1109/ICASSP.1998.675362 CHEN L, 2001, P ISCA WORKSH AD MET Clarkson P., 1997, P EUR 97 RHOD GREEC, P2707 Eguchi K., 2002, Proceedings of SIGIR 2002. Twenty-Fifth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval Fujii A, 2003, P 8 EUR C SPEECH COM, P1149 FUJII A, 2003, P EUROSPEECH2003, P1153 FUJII A, 2004, P 8 INT C SPOK LANG, P2957 Garofolo J., 2000, P 8 TEXT RETR C, P107 GAROFOLO J, 1998, P 6 TEXT RETR C TREC, P83 HAMADA R, 2000, P ACM MULT 2000 WORK, P237, DOI 10.1145/357744.357953 Hauptmann A. G., 1997, INTELLIGENT MULTIMED, P215 Hearst MA, 1997, COMPUT LINGUIST, V23, P33 Itou K., 1998, P ICSLP, P3261 Iwayama M., 2003, P 26 ANN INT ACM SIG, P251 JOHNSON SE, 1999, P IEEE INT C AC SPEE, P49 Jones CR, 2002, EXTREMOPHILES, V6, P291, DOI 10.1007/s00792-001-0256-1 Jones G.J.F., 1996, P 19 ANN INT ACM SIG, P30, DOI 10.1145/243199.243208 Jourlin P, 2000, SPEECH COMMUN, V32, P21, DOI 10.1016/S0167-6393(00)00021-2 Kawahara T., 2000, P 6 INT C SPOK LANG, P476 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 NOMURA K, 1999, SP9913 IEICE, P31 OGATA J, 2002, P 7 ICSLP, P1429 Robertson S. E., 1994, P 17 ANN INT ACM SIG, P232 Seymore K, 1997, P EUROSPEECH, P1987 Sheridan P, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P99, DOI 10.1145/258525.258544 Takao S., 2000, Proceedings ACM Multimedia 2000, DOI 10.1145/354384.376354 ZHU X, 2001, P 2001 IEEE INT C AC NR 30 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 516 EP 531 DI 10.1016/j.specom.2005.08-006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700004 ER PT J AU Meyer, C Schramm, H AF Meyer, C Schramm, H TI Boosting HMM acoustic models in large vocabulary speech recognition SO SPEECH COMMUNICATION LA English DT Article DE boosting; AdaBoost; machine learning; acoustic model training; spontaneous speech; automatic speech recognition AB Boosting algorithms have been successfully used to improve performance in a variety of classification tasks. Here, we suggest an approach to apply a popular boosting algorithm (called "AdaBoost.M2") to Hidden Markov Model based speech recognizers, at the level of utterances. In a variety of recognition tasks we show that boosting significantly improves the best test error rates obtained with standard maximum likelihood training. In addition, results in several isolated word decoding experiments show that boosting may also provide further performance gains over discriminative training, when both training techniques are combined. In our experiments this also holds when comparing final classifiers with a similar number of parameters and when evaluating in decoding conditions with lexical and acoustic mismatch to the training conditions. Moreover, we present an extension of our algorithm to large vocabulary continuous speech recognition, allowing online recognition without further processing of N-best lists or word lattices. This is achieved by using a lexical approach for combining different acoustic models in decoding. In particular, we introduce a weighted summation over an extended set of alternative pronunciation models representing both the boosted models and the baseline model. In this way, arbitrarily long utterances can be recognized by the boosted ensemble in a single pass decoding framework. Evaluation results are presented on two tasks: a real-life spontaneous speech dictation task with a 60k word vocabulary and Switchboard. (C) 2005 Elsevier B.V. All rights reserved. C1 Philips Res Labs, D-52066 Aachen, Germany. RP Meyer, C (reprint author), Philips Res Labs, Weisshausstr 2, D-52066 Aachen, Germany. EM Carsteii.Meyer@philips.com CR AUBERT X, 2000, P INT C SPOK LANG PR, V3, P802 AUBERT X, 1999, P EUR C SPEECH COMM, P1559 Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 Beyerlein P, 2002, SPEECH COMMUN, V37, P109, DOI 10.1016/S0167-6393(01)00062-0 BEYERLEIN P, 2001, P EUROSPEECH, P499 Collins M., 2002, P ACL 2002 COLLINS M, 2000, P 17 INT C MACH LEAR Cook G., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607852 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DIMITRAKAKIS D, 2004, P INT C AC SPEECH SI Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504 Godfrey J., 1992, P INT C AC SPEECH SI JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KOLTCHINSKII V, 2002, ANN STAT, V30 MEYER C, 2000, P INT C SPOK LANG PR, P632 MEYER C, 2002, P ICASSP 02 ORL, P109 MEYER C, 2002, P 19 INT C MACH LEAR, P419 MEYER C, 2004, P 6 IASTED INT C SIG PETERS J, 2003, P HUM LANG TECHN C H, P82 RAETSCH G, 2003, P EUROSPEECH 03 GEN, P997 Ruber B., 1997, P ESCA EUR 97 RHOD G, P739 Schapire R. E., 2002, P MSRI WORKSH NONL E, P149 SCHAPIRE RE, 1990, MACH LEARN, V5, P197, DOI 10.1007/BF00116037 Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 Schapire RE, 1998, ANN STAT, V26, P1651 SCHLUTER R, 1999, P IEEE ASRU WORKSH K, P119 SCHRAMM H, 2000, P INT C AC SPEECH SI, V3, P1659 SCHRAMM H, 2003, P SSPR, P143 Schwcnk H., 1999, P INT C AC SPEECH SI, P1009 VALIANT LG, 1984, COMMUN ACM, V27, P1134, DOI 10.1145/1968.1972 WOODLAND PC, 2000, P ISCA ITRW ASR2000, P7 ZHANG R, 2003, P EUROSPEECH 03 GEN, V3, P1885 ZHENG J, 2000, P INT C AC SPEECH SI, V3, P1775 ZWEIG G, 2000, P INT C AC SPEECH SI, P1527 NR 34 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 532 EP 548 DI 10.1016/j.specom.2005.09.009 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700005 ER PT J AU Skowronski, MD Harris, JG AF Skowronski, MD Harris, JG TI Applied principles of clear and Lombard speech for automated intelligibility enhancement in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE clear speech; speech enhancement; energy redistribution ID HEARING-IMPAIRED LISTENERS; CONVERSATIONAL SPEECH; RECOGNITION; HARD AB Previous studies have documented phenomena involving the modification of human speech in special communication circumstances. Whether speaking to a hearing-impaired person (clear speech) or in a noisy environment (Lombard speech), speakers tend to make similar modifications to their normal, conversational speaking style in order to increase the understanding of their message by the listener. One strategy characteristic of the above speech types is to increase consonant power relative to the signal power of adjacent vowels and is referred to as consonant-vowel (CV) ratio boosting. An automated method of speech enhancement using CV ratio boosting is called energy redistribution voiced/unvoiced (ERVU). To characterize the performance of ERVU, 25 listeners responded to 500 words in a two-word, forced-choice experiment in the presence of energetic masking noise. The test material was a vocabulary of confusable monosyllabic words spoken by 8 male and 8 female speakers, and the conditions tested were a control (unmodified speech), ERVU, and a high-pass filter (HPF). Both ERVU and the HPF significantly increased recognition accuracy compared to the control. Nine of the 16 speakers were significantly more intelligible when ERVU or the HPF was used, compared to the control, while no speaker was less intelligible. The results show that ERVU successfully increased intelligibility of speech using a simple automated segmentation algorithm, applicable to a wide variety of communication systems such as cell phones and public address systems. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Florida, Dept Elect & Comp Engn, Computat Neuroengn Lab, Gainesville, FL 32611 USA. RP Skowronski, MD (reprint author), Univ Florida, Dept Elect & Comp Engn, Computat Neuroengn Lab, Gainesville, FL 32611 USA. EM markskow@cnel.ufl.edu CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bradlow AR, 2002, J ACOUST SOC AM, V112, P272, DOI 10.1121/1.1487837 DODDINGTON GR, 1981, IEEE SPECTRUM SEP, P26 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 Ferguson SH, 2002, J ACOUST SOC AM, V112, P259, DOI 10.1121/1.1482078 Fletcher H., 1953, SPEECH HEARING COMMU FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 GORDONSALANT S, 1986, J ACOUST SOC AM, V80, P1599, DOI 10.1121/1.394324 GRAY AH, 1974, IEEE T ACOUST SPEECH, VAS22, P207, DOI 10.1109/TASSP.1974.1162572 HARRIS JG, 2002, J ACOUST SOC AM, V112, P2305 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Hopkins W. G., 2000, NEW VIEW STAT JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 Kennedy E, 1998, J ACOUST SOC AM, V103, P1098, DOI 10.1121/1.423108 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824 PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 REINKE TL, 2001, THESIS U FLORIDA GAI STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 NR 22 TC 38 Z9 38 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 549 EP 558 DI 10.1016/j.specom.2005.09.003 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700006 ER PT J AU Litman, DJ Forbes-Riley, K AF Litman, DJ Forbes-Riley, K TI Recognizing student emotions and attitudes on the basis of utterances in spoken tutoring dialogues with both human and computer tutors SO SPEECH COMMUNICATION LA English DT Article DE emotional speech; predicting user state via machine learning; prosody; empirical study relevant to adaptive spoken dialogue systems; tutorial dialogue systems ID AGREEMENT; COMMUNICATION; SPEECH; STATES AB While human tutors respond to both what a student says and to how the student says it, most tutorial dialogue systems cannot detect the student emotions and attitudes underlying an utterance. We present an empirical study investigating the feasibility of recognizing student state in two corpora of spoken tutoring dialogues, one with a human tutor, and one with a computer tutor. We first annotate student turns for negative, neutral and positive student states in both corpora. We then automatically extract acoustic-prosodic features from the student speech, and lexical items from the transcribed or recognized speech. We compare the results of machine learning experiments using these features alone, in combination, and with student and task dependent features, to predict student states. We also compare our results across human-human and human-computer spoken tutoring dialogues. Our results show significant improvements in prediction accuracy over relevant baselines, and provide a first step towards enhancing our intelligent tutoring spoken dialogue system to automatically recognize and adapt to student states. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Pittsburgh, Dept Comp Sci, Pittsburgh, PA 15260 USA. Univ Pittsburgh, Ctr Learning Res & Dev, Pittsburgh, PA 15260 USA. RP Litman, DJ (reprint author), Univ Pittsburgh, Dept Comp Sci, Pittsburgh, PA 15260 USA. EM litman@cs.pitt.edu; forbesk@pitt.edu CR AIST G, 2002, P INT TUT SYST ITS, P992 Mostow J, 2001, SMART MACHINES IN EDUCATION, P169 ALEVEN V, 2003, P AIED 2003 WORKSH T ALEVEN V, 2001, P AI ED 2001, P246 ANG J, 2002, P INT C SPOK LANG PR, P203 Batliner A, 2003, SPEECH COMMUN, V40, P117, DOI 10.1016/S0167-6393(02)00079-1 BATLINER A, 2000, ISCA WORKSH SPEECH E, P195 BHATT K, 2004, P COGN SCI BIDDLE ES, 2003, P 3 US MOD WORKSH AS, P65 Black A., 1997, FESTIVAL SPEECH SYNT CARLETTA J, 1996, COMPUT LINGUISTICS, V22 CAVALLUZZI A, 2003, P US MOD C JOHNST PA, P86 CHI MTH, 1994, COGNITIVE SCI, V18, P439, DOI 10.1207/s15516709cog1803_3 COHEN J, 1968, PSYCHOL BULL, V70, P213, DOI 10.1037/h0026256 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 COLES G, 1999, LIT EMOTIONS BRAIN CONATI C, 2003, P 3 US MOD WORKSH AS CONATI C, 2003, P 3 US MOD WORKSH AS, P16 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 CRAIG SD, 2003, ADV TECHNOLOGY BASED, V3, P1903 DEROSIS F, 2001, P US MOD WORKSH ATT DEROSIS F, 2001, SPECIAL ISSUES USER, V11 DEROSIS F, 1999, P US MOD WORKSH ATT DEROSIS F, 2002, SPECIAL ISSUES USER, V12 DEVILLERS L, 2003, P IEEE INT C MULT EX DIEUGENIO B, 1997, P 35 ANN M ASS COMP DIEUGENIO B, 2004, COMPUT LINGUISTICS, V30 Evens M., 2001, P 12 MIDW AI COGN SC, P16 FAN C, 2003, P 2 INT C COMP INT R FISCHER K, 1999, 236 VERBM FORBESRILEY K, 2004, 4 M N AM CHAP ASS CO, P201 FORBESRILEY K, 2005, P INT C ART INT ED FOX BA, 1993, HUMAN TUTORIAL DIAGL Freund Y., 1996, P 13 INT C MACH LEAR, P148 Graesser A. C., 2001, INT J ARTIFICIAL INT, V12, P257 Graesser AC, 2001, AI MAG, V22, P39 HAUSMANN R, 2002, J CONIGITIVE TECHNOL, V7, P4 Huang X., 1993, COMPUT SPEECH LANG, V2, P137 Izard C. E., 1984, EMOTIONS COGNITION B, P17 JORDAN P, 2002, P 3 SIGDIAL WORKSH D, P74 JORDAN P, 2003, P ART INT ED, P73 JORDAN PW, 2004, P INT TUT SYST C ITS, P346 Kort B, 2001, IEEE INTERNATIONAL CONFERENCE ON ADVANCED LEARNING TECHNOLOGIES, PROCEEDINGS, P43 KRIPPENDORF K, 1980, CONTENT AAL INTRO IT LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 LEE C, 2002, P INT C SPOK LANG PR LEE C, 2001, P IEEE AUT SPEECH RE LISCOMBE J, 2003, P EUROSPEECH Litman D., 2003, P ASRU VIRG ISL, P25 LITMAN D, 2004, P SIGDIAL WORKSH DIS, P144 LITMAN D, 2001, P 39 ANN M 10 C EUR, P362 LITMAN D, 2004, P INT C INT TUT SYST, P368 Litman D. J., 2004, P 42 ANN M ASS COMP, P352 Litman DJ, 1996, J ARTIF INTELL RES, V5, P53 Litman D.J., 2004, P 4 HLT NAACL C, P233 Maeireizo B., 2004, COMP P ASS COMP LING, P203 MASTERS JC, 1979, J PERS SOC PSYCHOL, V37, P380, DOI 10.1037//0022-3514.37.3.380 Moreno R, 2001, COGNITION INSTRUCT, V19, P177, DOI 10.1207/S1532690XCI1902_02 Mozziconacci SJL, 2001, USER MODEL USER-ADAP, V11, P297, DOI 10.1023/A:1011800417621 NARAYANAN S, 2002, P ISLE WORKSH DIAL T NASBY W, 1982, J PERS SOC PSYCHOL, V43, P1244, DOI 10.1037/0022-3514.43.6.1244 Oudeyer PY, 2003, INT J HUMAN COMPUTER, V59, P157, DOI DOI 10.1016/S1071-581(02)00141-6 OWIE R, 2003, SPEECH COMMUN, V40, P5 Pantic M, 2003, P IEEE, V91, P1370, DOI 10.1109/JPROC.2003.817122 Picard R., 2001, IEEE T PATTERN ANAL, V23 POLZIN TS, 1998, P COOP MULT COMM POTTS R, 1986, MOTIV EMOTION, V10, P39, DOI 10.1007/BF00992149 Rickel J, 2000, EMBODIED CONVERSATIONAL AGENTS, P95 ROSE C, 2005, COMMUNICATION Rose C. P., 2001, P ART INT ED, P256 ROSE CP, 2000, P 1 M N AM CHAPT ASS, P1129 ROSE CP, 2002, P ITS 2002 WORKSH EM ROSE CP, 2000, AAAI WORK NOT FALL S RUSSELL JA, 2003, ANNU REV PSYCHOL, V54, P29 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHULTZ K, 2003, AIED S P, P367 Seipp B., 1991, ANXIETY RES, V4, P27, DOI DOI 10.1080/08917779108248762 Shafran I., 2003, P IEEE AUT SPEECH RE, P31 SHAH F, 2002, DISCOURSE PROCESS, V33 Siegel S., 1988, NONPARAMETRIC STAT B Siegle G. J., 1994, BALANCED AFFECTIVE W ten Bosch L, 2003, SPEECH COMMUN, V40, P213, DOI 10.1016/S0167-6393(02)00083-3 VanLehn K., 2002, P 6 INT C INT TUT SY, P158 Witten I. H., 1999, DATA MINING PRACTICA ZINN, 2002, P INT TUT SYST C ITS, P574 NR 85 TC 35 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2006 VL 48 IS 5 BP 559 EP 590 DI 10.1016/j.specom.2005.09.008 PG 32 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 037QB UT WOS:000237160700007 ER PT J AU Bangalore, S Hakkani-Tur, D Tur, G AF Bangalore, S Hakkani-Tur, D Tur, G TI Introduction to the special issue on spoken language understanding in conversational systems SO SPEECH COMMUNICATION LA English DT Editorial Material C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Bangalore, S (reprint author), AT&T Labs Res, 180 Pk Ave, Florham Pk, NJ 07932 USA. EM srini@research.att.com; dtur@research.att.com; gtur@research.att.com CR ALLEN JF, 1995, J EXP THEOR ARTIF IN, V7, P7, DOI 10.1080/09528139508953799 Chu-Carroll J, 1999, COMPUT LINGUIST, V25, P361 DENOS E, 1999, P EUR C SPEECH TECHN, P1527 DOWDING J, 1993, P ARPA WORKSH HUM LA *FASIL, 2002, FASIL FLEX AD SPOK L Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X GUPTA N, IN PRESS IEEE T SPEE HAKKANITUR D, IN PRESS COMPUT SPEE JACKENDOFF R, 2002, FDN LANGUAGE, pCH9 JOHNSTON M, 2002, P ANN M ASS COMP LIN LAMEL L, 1998, P INT C SPOK LANG PR Miller S., 1994, P ANN M ASS COMP LIN *MUC 7, 1998, P 7 MESS UND C MUC 7 Natarajan P., 2002, P INT C SPOK LANG PR *PAL, 2003, DARPA PERS ASS LEARN PECKHAM J, 1991, P DARPA SPEECH NAT L PIERACCINI R, 1992, P INT C AC SPEECH SI Price P. J., 1990, P DARPA WORKSH SPEEC Seneff S., 1992, Computational Linguistics, V18 van den Heuvel Henk, 1997, INT J SPEECH TECHNOL, V2, P119 WALKER MA, 1997, P ANN M ASS COMP LIN Walker M.A., 2002, P INT C SPOK LANG PR Ward W., 1994, P ARPA HUM LANG TECH, P213, DOI 10.3115/1075812.1075857 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 NR 24 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 233 EP 238 DI 10.1016/j.specom.2005.09.001 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500001 ER PT J AU Haffner, P AF Haffner, P TI Scaling large margin classifiers for spoken language understanding SO SPEECH COMMUNICATION LA English DT Article DE spoken language understanding; machine learning; SVM; AdaBoost; maximum entropy ID SUPPORT VECTOR MACHINES; ALGORITHMS; REGRESSION AB Large margin classifiers, such as SVMs and AdaBoost, have achieved state-of-the-art performance for semantic classification problems that occur in spoken language understanding or textual data mining applications. However, these computationally expensive learning algorithms cannot always handle the very large number of examples, features, and classes that are present in the available training corpora. This paper provides an original and unified presentation of these algorithms within the framework of regularized and large margin linear classifiers, reviews some available optimization techniques, and offers practical solutions to scaling issues. Systematic experiments compare the algorithms according to a number of criteria: performance, robustness, computational and memory requirements, and ease of parallelization. Furthermore, they confirm that the 1-vs-other multiclass scheme is a simple, generic and easy to implement baseline that has excellent scaling properties. Finally, this paper identifies the limitations of the classifiers and the multiclass schemes that are implemented. (c) 2005 Elsevier B.V. All rights reserved. C1 AT&T Labs Res, Middletown, NJ 07748 USA. RP Haffner, P (reprint author), AT&T Labs Res, 200 Laurel Ave S, Middletown, NJ 07748 USA. EM haffner@research.att.com CR ABNEY S, 1999, P JOINT SIGDAT C EMP AIOLLI F, 2005, ADV NEURAL INFORM PR, V17 BARTLETT PL, 2005, ADV NEURAL INFORM PR, V17, P113 BEGEJA L, 2005, P ICASSP 05 PHIL BORDES A, 2005, WORKING DOCUMENT FAS Breiman L, 1999, NEURAL COMPUT, V11, P1493, DOI 10.1162/089976699300016106 Chang C.-C., 2001, LIBSVM LIB SUPPORT V CHELBA C, 2003, P INTERSPEECH 03 EUR Collins M., 2000, P 13 ANN C COMP LEAR, P158 COLLOBERT R, 2002, NEURAL COMPUT, V14 Collobert R, 2001, J MACH LEARN RES, V1, P143, DOI 10.1162/15324430152733142 Cortes C, 2004, J MACH LEARN RES, V5, P1035 Crammer K, 2002, J MACH LEARN RES, V2, P265, DOI 10.1162/15324430260185628 Cristianini N., 2000, INTRO SUPPORT VECTOR DIFABBRIZIO G, 2004, P 5 SIGDIAL WORKSH D DUDIK M, 2004, P COLT 04 BANFF CAN Freund Y., 1996, P 13 INT C MACH LEAR, P148 Friedman J., 1998, ADDITIVE LOGISTIC RE Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X GRAF HP, 2005, ADV NEURAL INFORM PR, V17 GUPTA N, 2002, P EMNLP 02 HAFFNER P, 2005, WORKSH MIN NETW DAT HAFFNER P, 2003, P ICASSP 03 HONG KON HAFFNER P, 2005, EFFICIENT MULTICLASS JOACHIMS T, 1998, P ECML 98 JOACHIMS T, 1998, ADV KERNAL METHODS S KEERTHI SS, 2002, P INT C MACH LEARN, P299 LEBANON G, 2002, ADV NEURAL INFORM PR, V14 LEVIT M, 2004, P INTERSPEECH 04 Mangasaian OL, 2001, J MACH LEARN RES, V1, P161, DOI 10.1162/15324430152748218 Ng AY, 2002, ADV NEUR IN, V14, P841 Nigam K, 1999, IJCAI 99 WORKSH MACH, P61 PHILLIPS S, 2004, P ICML 04 BANFF CAN PHILLIPS S, 2005, UNPUB ACCELERATING S Platt J., 1998, ADV KERNEL METHODS S Ratsch G., 2002, P 15 ANN C COMP LEAR, P334 Rifkin R, 2004, J MACH LEARN RES, V5, P101 Schapire R E, 1997, P 14 INT C MACH LEAR, P322 Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 Scholkopf B., 2002, LEARNING KERNELS STEINWART I, 2004, ADV NEURAL INFORM PR, V16 TANG M, 2003, P ASRU 03 Tipping ME, 2000, ADV NEURAL INFORM PR, V12 Tsang IW, 2005, J MACH LEARN RES, V6, P363 TUR G, 2003, P INTERSPEECH 03 EUR TUR G, 2002, P ICSLP 02 TUR G, 2004, P ICASSP 04 Vapnik V, 1998, STAT LEARNING THEORY VURAL V, 2004, P ICML 04 BANFF CAN WESTON J., 1998, CSDTR9804 U LOND DEP ZHANG J, 2003, P ICML 03 WASH DC NR 52 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 239 EP 261 DI 10.1016/j.specom.2005.06.008 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500002 ER PT J AU He, YL Young, S AF He, YL Young, S TI Spoken language understanding using the hidden vector state model SO SPEECH COMMUNICATION LA English DT Article DE spoken language understanding; spoken dialogue systems; statistical semantic parsing; hidden vector state model AB The Hidden Vector State (HVS) Model is an extension of the basic discrete Markov model in which context is encoded as a stack-oriented state vector. State transitions are factored into a stack shift operation similar to those of a push-down automaton followed by the push of a new preterminal category label. When used as a semantic parser, the model can capture hierarchical structure without the use of treebank data for training and it can be trained automatically using expectation-maximization (EM) from only-lightly annotated training data. When deployed in a system, the model can be continually refined as more data becomes available. In this paper, the practical application of the model in a spoken language understanding system (SLU) is described. Through a sequence of experiments, the issues of robustness to noise and portability to similar and extended domains are investigated. The end-to-end performance obtained from experiments in the ATIS domain show that the system is comparable to existing SLU systems which rely on either hand-crafted semantic grammar rules or statistical models trained on fully annotated training corpora. Experiments using data which have been artificially corrupted with varying levels of additive noise show that the HVS-based parser is relatively robust, and experiments using data sets from other domains indicate that the overall framework allows adaptation to related domains, and scaling to cover enlarged domains. In summary, it is argued that constrained statistical parsers such as the HVS model allow robust spoken dialogue systems to be built at relatively low cost, and which can be automatically adapted as new data is acquired both to improve performance and extend coverage. (c) 2005 Elsevier B.V. All rights reserved. C1 Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP He, YL (reprint author), Nanyang Technol Univ, Sch Comp Engn, Nanyang Ave, Singapore 639798, Singapore. EM asylhe@ntu.edu.sg; sjy@eng.cam.ac.uk CR BACCHIANI M, 2003, P IEEE INT C AC SPEE *CUDATA, 2004, DARPA COMM TRAV DAT DAHL D, 1994, ARPA HUM LANG TECHN Dowding J., 1994, P 32 ANN M ASS COMP, P110, DOI 10.3115/981732.981748 Fine S, 1998, MACH LEARN, V32, P41, DOI 10.1023/A:1007469218079 Friedman N, 1997, MACH LEARN, V29, P131, DOI 10.1023/A:1007465528199 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GOEL V, 1999, ESCA ETRW WORKSH ACC, P49 HE Y, 2004, HLT NAACL WORKSH SPO HE Y, 2003, P IEEE INT C AC SPEE HE Y, 2003, IEEE AC SPEECH REC U He Y, 2005, COMPUT SPEECH LANG, V19, P85, DOI 10.1016/j.csl.2004.03.001 KLAKOW D, 1998, P INT C SPOK LANG PR Kneser R., 1993, P EUR C SPEECH COMM, P973 Kumar N, 1997, THESIS J HOPKINS U B Lari K., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90022-X Lari K., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90009-F Levin E., 1995, P DARPA SPEECH NAT L, P269 LUO X, 2000, P IEEE INT C AC SPEE LUO X, 1999, IEEE AUT SPEECH REC MARTIN S, 2000, P INT C SPOK LANG PR MENG H, 2000, P INT C SPOK LANG PR Miller S., 1995, P DARPA SPEECH NAT L, P276 ROARK B, 2003, P JOINT M N AM CHAPT SCHABES Y, 1991, 2 WORKSH MATH LANG T SCHWARTZ R, 1997, P IEEE INT C AC SPEE, P1479 SENEFF S, 1992, P IEEE INT C AC SPEE STUTTLE MN, 2004, P INT C SPOK LANG PR VARGA AP, 1992, NOISEX 92 STUDY EFFE WARD W, 1996, P ARPA HUM LANG TECH, P213 Williams J., 2004, P INT C SPOK LANG PR YOUNG S, 2004, HTK BOOK HTK VERSION Young S., 2002, P INT C SPOK LANG PR NR 34 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 262 EP 275 DI 10.1016/j.specom.2005.06.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500003 ER PT J AU Saraclar, M Roark, B AF Saraclar, M Roark, B TI Utterance classification with discriminative language modeling SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; language modeling; discriminative training; utterance classification ID ALGORITHMS AB This paper investigates discriminative language modeling in a scenario with two kinds of observed errors: errors in ASR transcription and errors in utterance classification. We train joint language and class models either independently or simultaneously, under various parameter update conditions. On a large vocabulary customer service call-classification application, we show that simultaneous optimization of class, n-gram, and class/n-gram feature weights results in a significant WER reduction over a model using just n-gram features, while additionally significantly outperforming a deployed baseline in classification error rate. A range of parameter estimation approaches, based on either the perceptron algorithm or conditional log-linear models, for various feature sets are presented and evaluated. The resulting models are encoded as weighted finite-state automata, and are used by intersecting the model with word lattices. (c) 2005 Elsevier B.V. All rights reserved. C1 Bogazici Univ, Dept Elect & Elect Engn, TR-34342 Istanbul, Turkey. Oregon Hlth & Sci Univ, OGI Sch Sci & Engn, Ctr Spoken Language Understanding, Beaverton, OR 97006 USA. RP Saraclar, M (reprint author), Bogazici Univ, Dept Elect & Elect Engn, TR-34342 Istanbul, Turkey. EM murat.saraclar@boun.edu.tr; roark@cslu.ogi.edu RI Saraclar, Murat/E-8640-2010 OI Saraclar, Murat/0000-0002-7435-8510 CR ALLAUZEN C, 2003, GRM LIB ALLAUZEN C, 2004, P 9 INT C IMPL APPL Allauzen Cyril, 2003, P 41 ANN M ASS COMP, P40 Balay S., 2002, PETSC USERS MANUAL BENSON SJ, 2002, ANLACSP9090901 BENSON SJ, 2002, TAO USERS MANUAL Chelba C., 2003, P INT C AC SPEECH SI Chen Stanley F., 1999, CMUCS99108 COLLINS M, 2004, NEW DEV PARSING Collins M, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P1 Cortes C, 2004, J MACH LEARN RES, V5, P1035 Crammer K, 2003, J MACH LEARN RES, V3, P1025, DOI 10.1162/153244303322533188 GILBERT M, IN PRESS IEEE SPEECH Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X GUPTA N, IN PRESS IEEE T SPEE Haffner P., 2003, P INT C AC SPEECH SI Johnson Mark, 1999, P 37 ANN M ASS COMP, P535, DOI 10.3115/1034678.1034758 Kuo H-K. J., 2002, P INT C AC SPEECH SI Kuo HKJ, 2003, IEEE T SPEECH AUDI P, V11, P24, DOI 10.1109/TSA.2002.807352 Lafferty John D., 2001, ICML, P282 Malouf R., 2002, P 6 C NAT LANG LEARN, P49 McCallum A., 2003, 7 C NAT LANG LEARN C *NIST, 2000, SPEECH REC SCOR TOOL PINTO D, 2003, P ACM SIGIR Ratnaparkhi A., 1994, P INT C SPOK LANG PR, P803 RICCARDI G, 1998, P INT C SPOK LANG PR ROARK B, 2004, P 42 ANN M ASS COMP Roark Brian, 2004, P INT C AC SPEECH SI, P749 SARACLAR M, 2005, P INT C AC SPEECH SI Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 SHA F, 2003, P HLT NAACL EDM CAN Stephen Cox, 2003, P INT C AC SPEECH SI Stolcke A., 2000, P NIST SPEECH TRANSC TUR G, 2004, P INT C AC SPEECH SI, P437 Wallach H., 2002, THESIS U EDINBURGH WU J, 2002, P INT C AC SPEECH SI NR 36 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 276 EP 287 DI 10.1016/j.specom.2005.06.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500004 ER PT J AU Raymond, C Bechet, F De Mori, R Damnati, G AF Raymond, C Bechet, F De Mori, R Damnati, G TI On the use of finite state transducers for semantic interpretation SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; spoken language understanding; dialogue; finite state transducers ID SPEECH RECOGNITION; SYSTEMS AB A spoken language understanding (SLU) system is described. It generates hypotheses of conceptual constituents with a translation process. This process is performed by finite state transducers (FST) which accept word patterns from a lattice of word hypotheses generated by an Automatic Speech Recognition (ASR) system. FSTs operate in parallel and may share word hypotheses at their input. Semantic hypotheses are obtained by composition of compatible translations under the control of composition rules. Interpretation hypotheses are scored by the sum of the posterior probabilities of paths in the lattice of word hypotheses supporting the interpretation. A compact structured n-best list of interpretation is obtained and used by the SLU interpretation strategy. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Avignon, CNRS, LIA, F-84911 Avignon 09, France. France Telecom R&D Tech, SSTP, F-22307 Lannion 07, France. RP Bechet, F (reprint author), Univ Avignon, CNRS, LIA, BP1228, F-84911 Avignon 09, France. EM christian.raymond@univ-avignon.fr; frederic.bechet@univ-avignon.fr; renato.demori@univ-avignon.fr; geraldine.damnati@rd.francetelecom.com CR BANGALORE S, 2004, P HLT NAACL C BOST M, P33 BECHET F, 2002, P ICSLP 02 DENV COL BECHET F, 2000, 38 ANN M ASS COMP LI, P77 BRACHMAN RJ, 1985, COGNITIVE SCI, V9, P171, DOI 10.1207/s15516709cog0902_1 CHAPPELIER J, 1999, P 6 C TRAIT AUT LANG Esteve Y, 2003, IEEE T SPEECH AUDI P, V11, P746, DOI 10.1109/TSA.2003.818318 HACIOGLU K, 2004, P HLT NAACL C BOST M, P145, DOI 10.3115/1613984.1614021 HACIOGLU K, 2001, P EUR 2001 DENM HAFFNER P, 2003, IEEE INT C AC SPEECH HE Y, 2003, AUT SPEECH REC UND W HE Y, 2004, P SPOK LANG UND CONV, P39 Jackendoff Ray, 1990, SEMANTIC STRUCTURES KAISER E, 1999, P IEEE INT C AC SPEE, V5 KUHN R, 1995, IEEE T PATTERN ANAL, V17, P449, DOI 10.1109/34.391397 LEVESQUE HJ, 1985, READINGS KNOWLEDGE R, P42 LEVIN E, 1995, P EUR C SPEECH COMM, P555 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 MOHRI M, 1997, AT T FSM LIB FINITE POTAMIANOS A, 2000, P ICSLP 2000 BEIJ CH, V3 PRADHAN S, 2004, P HLT NAACL C BOST M, P33 Rahim M, 2001, SPEECH COMMUN, V34, P195, DOI 10.1016/S0167-6393(00)00054-6 RAYMOND C, 2003, AUT SPEECH REC UND W ROARK B, 2002, P 40 ACL M PHIL SADEK D, 1996, ICSLP 96 SARIKAYA R, 2004, P HLT NAACL C BOST U, P65, DOI 10.3115/1613984.1614001 Seneff S., 1992, Computational Linguistics, V18 VIDAL E, 1993, P EUR 93 BERL GERM Wang Y., 2002, P INT C SPOK LANG PR YOUNG SR, 1989, COMMUN ACM, V32, P183, DOI 10.1145/63342.63344 NR 29 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 288 EP 304 DI 10.1016/j.specom.2005.06.012 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500005 ER PT J AU Wutiwiwatchai, C Furui, S AF Wutiwiwatchai, C Furui, S TI A multi-stage approach for Thai spoken language understanding SO SPEECH COMMUNICATION LA English DT Article DE multi-stage spoken language understanding; logical N-gram modeling; Thai spoken dialogue system AB This article investigates a novel multi-stage approach for spoken language understanding (SLU), with an application to a pioneering Thai spoken dialogue system in a hotel reservation domain. Given an input word string, the system determines a goal and concept-values by three-stage processing; concept extraction, goal identification, and concept-value recognition. The concept extraction utilizes weighted finite state transducers (WFST) to extract concepts from the word string. Given the extracted concepts, a goal of the utterance is identified using a pattern classifier. Within a particular goal, the necessary concept-values are recognized from the WFST outputs produced in the concept extraction stage. A new logical N-gram model, which strategically combines the conventional N-gram parser with a regular grammar, is evaluated for concept extraction and concept-value recognition. Several classifiers are optimized and compared for goal identification. An advantage of the proposed SLU model is that it can be trained by a partially annotated corpus, where only the relevant keywords and the goal of each training utterance are required. Although the proposed model is evaluated only on the Thai hotel reservation system, the SLU itself is general and it is expected to be applicable for other languages once training data is available. (c) 2005 Elsevier B.V. All rights reserved. C1 Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Wutiwiwatchai, C (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM chai@furui.cs.titech.ac.jp RI Wutiwiwatchai, Chai/G-5010-2012 CR Bechet F., 2002, P INT C SPOK LANG PR, P597 Esteve Y., 2003, P EUR C SPEECH COMM, P617 GARNER PN, 1997, P ICASSP 97 MUN GERM, P1823 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X Hacioglu K., 2001, P IEEE INT C AC SPEE, P537 HE Y, 2003, P IEEE WORKSH AC SPE Joachims T, 1999, ADVANCES IN KERNEL METHODS, P169 Kasuriya S., 2003, Proceedings of the Oriental COCOSDA 2003. International Coordinating Committee on Speech Databases and Speech I/O System Assessment Lin Chih-Jen, 2001, COMP METHODS MULTICL Luksaneeyanawin S., 1993, P S NAT LANG PROC TH, P276 MEKNAVIN S, 1997, P NAT LANG PROC PAC, P41 Miller Scott, 1994, P 32 ANN M ASS COMP, P25, DOI 10.3115/981732.981736 Minker W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607775 MITTRAPIYANURAK P, 2000, P NECTEC ANN C BANGK, P483 MOHRI M, 1997, GEN PURPOSE FINITE S Palmer D., 1999, P DARPA BROADC NEWS, P41 Platt JC, 2000, ADV NEUR IN, V12, P547 Potamianos A., 2000, P INT C SPOK LANG PR, P510 ROSSET S, 1999, P EUR, P1535 Seneff S., 1992, Computational Linguistics, V18 Szarvas M., 2003, P ICASSP HONG KONG C, P368 WANG YY, 2000, P IEEE INT C AC SPEE, P1639 Wang Y.-Y., 2002, P 7 INT C SPOK LANG, P609 WUTIWIWATCHAI C, 2003, P IEEE WORKSH AC SPE WUTIWIWATCHAI C, 2003, P EUR 2003, P2761 WUTIWIWATCHAI C, 2003, SPRING M AC SOC JPN, P87 WUTIWIWATCHAI C, 2004, P INT C SPOK LANG PR Zell A., 1994, SNNS STUTTGART NEURA Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 29 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 305 EP 320 DI 10.1016/j.specom.2005.02.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500006 ER PT J AU Zhang, RQ Kikui, G AF Zhang, RQ Kikui, G TI Integration of speech recognition and machine translation: Speech recognition word lattice translation SO SPEECH COMMUNICATION LA English DT Article AB An important issue in speech translation is to minimize the negative effect of speech recognition errors on machine translation. We propose a novel statistical machine translation decoding algorithm for speech translation to improve speech translation quality. The algorithm can translate the speech recognition word lattice, where more hypotheses are utilized to bypass the misrecognized single-best hypothesis. The decoding involves converting the recognition word lattice to a translation word graph by a graph-based search, followed by a fine rescoring by an A* search. We show that a speech recognition confidence measure implemented by posterior probability is effective to improve speech translation. The proposed techniques were tested in a Japanese-to-English speech translation task, in which we measured the translation results in terms of a number of automatic evaluation metrics. The experimental results demonstrate a consistent and significant improvement in speech translation achieved by the proposed techniques. (c) 2005 Elsevier B.V. All rights reserved. C1 ATR Spoken Language Translat Res Labs, Kyoto 6190288, Japan. RP Zhang, RQ (reprint author), ATR Spoken Language Translat Res Labs, 2-2 Hikaridai, Kyoto 6190288, Japan. EM ruiqiang.zhang@atr.jp CR AKIBA Y, 2004, P IWSLT04 ATR KYOT J BERGER A, 1994, P ARPA HLT BOITET C, 1994, P COL 1994 Brown P. F., 1993, Computational Linguistics, V19 Casacuberta F., 2002, P WORKSH SPEECH TO S, P39 DODDINGTON G, 2002, P ARPA WORKSH HUM LA Gao Y., 2003, P EUR 2003 GEN, P365 Kikui G., 2003, P EUR, P381 KOEHN P, 2004, P AMTA 2004 WASH DC NEY H, 1999, P IEEE ICASSP PHOEN, V1, P517 Niessen S., 2000, P 2 INT C LANG RES E, P39 OCH FJ, 2004, P HLT NAACL BOST US Och F.J, 2003, P 41 ANN M ASS COMP, P160 Och F. J., 2003, Computational Linguistics, V29, DOI 10.1162/089120103321337421 PAPAINENI KA, 2002, P ACL 2002 PHIL PA, P311 Press WH, 2000, NUMERICAL RECIPES C SALEEM S, 2004, P ICSLP 2004 JEJ KOR TAKEZAWA T, 2000, P LREC 2002 LAS PALM, P147 TILLMANN C, 1997, P ACL EACL 1997 MADR, P313 TURIAN J., 2003, P MT SUMM 9 NEW ORL, P386 Ueffing N, 2002, PROCEEDINGS OF THE 2002 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, P156 Vogel S., 1996, P 16 INT C COMP LING, P836 Wang Y., 1997, P 35 ANN C ASS COMP, P366 ZHANG R, 2004, P COL 2004 GEN ZHANG R, 2000, P ICASSP 2000 IST, P1595 NR 25 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 321 EP 334 DI 10.1016/j.specom.2005.06.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500007 ER PT J AU Boye, J Gustafson, J Wiren, M AF Boye, J Gustafson, J Wiren, M TI Robust spoken language understanding in a computer game SO SPEECH COMMUNICATION LA English DT Article DE spoken language understanding; robust parsing; robustness; dialogue systems; conversational systems; computer games; animated characters AB We present and evaluate a robust method for the interpretation of spoken input to a conversational computer game. The scenario- of the game is that of a player interacting with embodied fairy-tale characters in a 3D world via spoken dialogue (supplemented by graphical pointing actions) to solve various problems. The player himself cannot directly perform actions in the world, but interacts with the fairy-tale characters to have them perform various tasks, and to get information about the world and the problems to solve. Hence the role of spoken dialogue as the primary means of control is obvious and natural to the player. Naturally, this means that robust spoken language understanding becomes a critical component. To this end, the paper describes a semantic representation formalism and an accompanying parsing algorithm which works off the output of the speech recogniser's statistical language model. The evaluation shows that the parser is robust in the sense of considerably improving on the noisy output of the speech recogniser. (c) 2005 Elsevier B.V. All rights reserved. C1 TeliaSonera R&D, S-13680 Haninge, Sweden. RP Boye, J (reprint author), TeliaSonera R&D, Rudsjoterrassen 2, S-13680 Haninge, Sweden. EM johan.boye@teliasonera.com CR AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M BELL L, 2005, P INT 05 LISB PORT Boros M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607774 BOYE J, 2003, P EUR GEN SWITZ BOYE J, 2003, P DIABR 7 WORKSH SEM BOYE J, 1999, P IJCAI WORKSH KNOWL CHARNIAK E, 2000, P NAACL N AM CHAPT A Collins M., 1999, THESIS U PENNSYLVANI DALRYMPLE M, 1991, LINGUIST PHILOS, V14, P399, DOI 10.1007/BF00630923 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HINDLEY R, 1986, INTRO COMBINTORS LAM JACKSON E, 1991, P DARPA SPEECH NAT L JURAFSKY A, 2000, SPEECH LANGUAGE PROC KASPER W, 1999, P ACL Larsson S., 2002, THESIS GOTEBORG U MILWARD D, 2001, P WISP NIVRE J, 2004, P COLING 2004 GEN SW Sterling L, 1994, ART PROLOG VANNOORD G, 1999, J NATURAL LANGUAGE E, V5, P45 WARD W, 1989, P DARPA SPEECH NAT L, P137, DOI 10.3115/100964.100975 NR 20 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 335 EP 353 DI 10.1016/j.specom.2005.06.015 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500008 ER PT J AU Hardy, H Biermann, A Inouye, RB McKenzie, A Strzalkowski, T Ursu, C Webb, N Wu, M AF Hardy, H Biermann, A Inouye, RB McKenzie, A Strzalkowski, T Ursu, C Webb, N Wu, M TI The AMITIES system: Data-driven techniques for automated dialogue SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 42nd Annual Meeting of the Association-for-Computational-Linguistics CY 2004 CL Barcelona, SPAIN SP Assoc Computat Linguist DE human-computer dialogue; spoken dialogue systems; language understanding; language generation ID RECOGNITION; SPEECH AB We present a natural-language customer service application for a telephone banking call center, developed as part of the Amities dialogue project (Automated Multilingual Interaction with Information and Services). Our dialogue system, based on empirical data gathered from real call-center conversations, features data-driven techniques that allow for spoken language understanding despite speech recognition errors, as well as mixed system/customer initiative and spontaneous conversation. These techniques include robust named-entity extraction, slot-filling Frame Agents, vector-based task identification and dialogue act classification, a Bayesian database record selection algorithm, and a natural language generator designed with templates created from real agents' expressions. Preliminary evaluation results indicate efficient dialogues and high user satisfaction, with performance comparable to or better than that of current conversational information systems. (c) 2005 Elsevier B.V. All rights reserved. C1 SUNY Albany, ILS Inst, Albany, NY 12222 USA. Duke Univ, Levine Sci Res Ctr, Dept Comp Sci, Durham, NC 27708 USA. Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Hardy, H (reprint author), SUNY Albany, ILS Inst, 1400 Washington Ave,SS262, Albany, NY 12222 USA. EM hhardy@cs.albany.edu; awb@cs.duke.edu; rbi@cs.duke.edu; armckenz@cs.duke.edu; tomek@cs.albany.edu; c.ursu@dcs.shef.ac.uk; n.webb@dcs.shef.ac.uk; minwu@cs.albany.edu CR ALLEN J, 1997, DRAFT DAMSL DIALOG A Allen J., 2001, AI MAGAZINE Allen J.F., 1996, P 34 ANN M ASS COMP ALLEN JF, 1995, J EXP THEOR ARTIF IN, V7, P7, DOI 10.1080/09528139508953799 Austin J. L., 1962, HOW DO THINGS WORDS BIERMANN A, UNPUB CALLER IDENTIF Brill E., 1992, P 3 C APPL NAT LANG Chu-Carroll J, 1999, COMPUT LINGUIST, V25, P361 COLE R, 1991, P IEEE INT C AC SPEE CORTES C, 2003, P ICASSP 03 HONG KON Cunningham H., 2002, P 40 ANN M ASS COMP Cunningham Hamish, 2000, CS0010 U SHEFF DEP C DAHLBACK N, 1992, P 14 ANN C COGN SCI EFABBRIZIO G, 2002, P 7 INT C SPOK LANG FERNANDEZ R, 2004, P 20 INT C COMP LING Grosz B. J., 1986, Computational Linguistics, V12 HARDY H, 2003, EUROSPEECH 2003 HARDY H, 2004, P 42 ANN M ASS COMP HARDY H, 2003, RES DIRECTIONS DIALO Hardy Hilda, 2002, P ISLE WORKSH DIAL T HILD H, 1995, P EUR, V2, P1977 JOHNSTON M, 2002, P 40 ANN M ASS COMP Jurafsky Daniel, 1998, 30 J HOPK U CTR LANG KARIS D, 1991, IEEE J SEL AREA COMM, V9, P574, DOI 10.1109/49.81951 LAMEL L, 1999, P IEEE INT C AC SPEE, P501 Lamel L, 2002, SPEECH COMMUN, V38, P131, DOI 10.1016/S0167-6393(01)00048-6 LEVIN E, ICSLP 2000 MAYNARD D, 2003, EXPERT UPDATE MAYNARD S, 2003, RECENT ADV NATURAL L MEYER M, 1997, P EUR, P1579 Peckham J., 1993, P 3 EUR C SPEECH COM, P33 Reithinger N., 1997, P 5 EUR C SPEECH COM, P2235 ROBERTSON SE, 1995, NATL I STANDARDS TEC, P219 *SAIC, 1998, P 7 MESS UND C MUC 7 Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 SENEFF S, 2000, SAT DIAL WORKSH ANLP SENEFF S, 1998, ICSLP 1998 Singhal A., 1996, SIGIR Forum Stolcke A, 2000, COMPUT LINGUIST, V26, P339, DOI 10.1162/089120100561737 Vijay-Shanker K., 1998, P 36 ANN M ASS COMP WALKER M, 2002, ICSLP 2002 WALKER M, 2001, EUROSPEECH 2001 Walker MA, 2000, J ARTIF INTELL RES, V12, P387 WARD W, 1999, IEEE ASRU, P341 XU W, 2000, ANLP NAACL WORKSH CO, P42 Young S., 2002, INT C SPOK LANG PROC Zue V., 2000, IEEE T SPEECH AUDIO, V8 ZUE V, 1994, SPEECH COMMUN, V15, P331, DOI 10.1016/0167-6393(94)90083-3 NR 48 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 354 EP 373 DI 10.1016/j.specom.2005.07.006 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500009 ER PT J AU Huang, Q Cox, S AF Huang, Q Cox, S TI Task-independent call-routing SO SPEECH COMMUNICATION LA English DT Article DE call routing; interactive voice systems; voice response systems; speech dialogue systems; speech recognition ID RECOGNITION AB Call-routing is the technology of automatically classifying the type of a telephone call from a customer to a business or an institution in order to transmit the call onward to the correct "destination". Making transcriptions of calls to provide training data for automatic routing in a particular application requires considerable human effort, and it would be highly advantageous for the system to be able to learn how to route calls from training utterances that were not transcribed. This paper introduces several techniques that can be used to build call routers from an untranscribed training set, and also without any prior knowledge of the application vocabulary or grammar. The techniques concentrate on identifying sequences of decoded phones that are salient for routing, and introduces two methods for doing this using language models that are specifically tailored for the routing task. Despite the fact that the phone recognition error-rate on the calls is over 70%, the best system described here achieves a routing error of 13.5% on an 18 route task. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. RP Cox, S (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM sjc@cmp.uea.ac.uk CR Alshawi Hiyan, 2003, P HLT NAACL EDM, P1 Bellegarda JR, 1998, IEEE T SPEECH AUDI P, V6, P456, DOI 10.1109/89.709671 Chu-Carroll J, 1999, COMPUT LINGUIST, V25, P361 COX S, 2001, P I AC WORKSH INN SP COX S, 2003, P 8 EUR C SPEECH COM Gillick L., 1989, P ICASSP, P532 GIULIANI D, 2001, P ISCI ITRW WORKSH A GORIN A, 1999, P ASRU WORKSH KEYST GORIN AL, 1994, P INT C SPOKEN LANGU, P1483 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HUANG Q, 2004, P HLT NAACL BOST US Kawahara T., 1998, IEEE T SPEECH AUDIO, V6 KUO H, 2000, P INT C SPOK LANG PR LAVRENKO V, 2002, P 24 EUR C IR RES, P193 LEVITT M, 2001, P EUR Miller D. R. H., 1999, Proceedings of SIGIR '99. 22nd International Conference on Research and Development in Information Retrieval, DOI 10.1145/312624.312680 Peng F., 2003, 25 EUR C INF RETR RE, P335 ROHLICEK JR, 1993, P IEEE INT C AC SPEE Tur G, 2003, P IEEE INT C AC SPEE TUR G, 2003, P EUR GEN Webb A. R., 2002, STAT PATTERN RECOGNI, V2nd WILPON JG, 1985, IEEE T ACOUST SPEECH, V33, P587, DOI 10.1109/TASSP.1985.1164581 WRIGHT JH, 1997, P 5 EUR C SPEECH COM, P1419 YOKOYAMA T, 2003, P IEEE INT C AC SPEE ZHAI C, 2001, RES DEV INFORM RETRI, P334 NR 25 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 374 EP 389 DI 10.1016/j.specom.2005.06.001 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500010 ER PT J AU Wang, YY Acero, A AF Wang, YY Acero, A TI Rapid development of spoken language understanding grammars SO SPEECH COMMUNICATION LA English DT Article DE automatic grammar generation; context free grammars (CFGs); example-based grammar learning; grammar controls; hidden Markov models (HMMs); n-Gram model; automatic speech recognition (ASR); spoken language understanding (SLU); statistical modeling; W3C speech recognition grammar specification (SRGS) AB To facilitate the development of spoken dialog systems and speech enabled applications, we introduce SGStudio (Semantic Grammar Studio), a grammar authoring tool that enables regular software developers with little speech/linguistic background to rapidly create quality semantic grammars for automatic speech recognition (ASR) and spoken language understanding (SLU). We focus on the underlying technology of SGStudio, including knowledge assisted example-based grammar learning, grammar controls and configurable grammar structures. While the focus of SGStudio is to increase productivity, experimental results show that it also improves the quality of the grammars being developed. (c) 2005 Elsevier B.V. All rights reserved. C1 Microsoft Res, Speech Technol Grp, Redmond, WA 98052 USA. RP Wang, YY (reprint author), Microsoft Res, Speech Technol Grp, 1 Microsoft Way, Redmond, WA 98052 USA. EM yeyiwang@microsoft.com; alexac@microsoft.com CR ALLEN JF, 1996, 34 ANN M ACL SANT CR, P62 BANGALORE S, 2004, HUMAN LANGUAGE TECHN CARPENTER B, 1998, INT C SPEECH LANG PR CHELBA C, 2003, IEEE INT C AC SPEECH Della Pietra S, 1997, 35TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 8TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P168 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DOLFING H, 2004, INT C SPEECH LANG PR DOWDING J, 1993, 31ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P54 Duda R. O., 2001, PATTERN CLASSIFICATI ESTEVE Y, 2003, EUROSPEECH 2003 FU KS, 1975, IEEE T SYST MAN CYB, VSMC5, P409, DOI 10.1109/TSMC.1975.5408432 FU KS, 1975, IEEE T SYST MAN CYB, V5, P85 GORIN A, 1995, J ACOUST SOC AM, V97, P3441, DOI 10.1121/1.412431 HAKKANITUR D, 2004, IEEE INT C AC SPEECH He Y, 2005, COMPUT SPEECH LANG, V19, P85, DOI 10.1016/j.csl.2004.03.001 HUNT A, 2002, SPEECH RECOGNITION G JELINEK F, 1990, 16374 RC IBM TJ WATS KUO HKJ, 2002, INT C SPEECH LANG PR MACHEREY K, 2001, EUROSPEECH 2001 MILLER S, 1994, 31 ANN M ASS COMP LI PARGELLIS A, 2001, EUROSPEECH 2001 PIERACCINI R, 1993, 1993 NATO ASI SUMM S PIERACCINI R, 2004, HLT NAACL WORKSH SPO Price P., 1990, DARPA SPEECH NAT LAN Riccardi G, 1996, COMPUT SPEECH LANG, V10, P265, DOI 10.1006/csla.1996.0014 RICCARDI G, 1998, INT C SPEECH LANG PR RINGGER E, 2000, THSIS U ROCHESTER Schapire RE, 2005, IEEE T SPEECH AUDI P, V13, P174, DOI 10.1109/TSA.2004.840937 Seneff S., 1992, Computational Linguistics, V18 STOLCKE A, 1994, TR94003 INT COMP SCI VIDAL E, 1993, II4193 DSIC U POL VA WANG NJ, 2004, INT C SPEECH LANG PR Wang P, 2005, CHINESE PHYS LETT, V22, P5, DOI 10.1088/0256-307X/22/1/002 WANG YY, 2000, IEEE INT C AC SPEECH WANG YY, 2004, INT C SPEECH LANG PR WANG YY, 1998, 36 ANN M ASS COMP LI WANG YY, 1999, EUROSPEECH 1999, V5, P2055 WARD W, 1994, HUM LANG TECHN WORKS WONG CC, 2001, IEEE AUT SPEECH REC WOODS WA, 1983, COMPUTER SPEECH PROC YOUNG S, 1993, TR153 CAMBR U DEP EN NR 41 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 390 EP 416 DI 10.1016/j.specom.2005.07.001 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500011 ER PT J AU Higashinaka, R Sudoh, K Nakano, M AF Higashinaka, R Sudoh, K Nakano, M TI Incorporating discourse features into confidence scoring of intention recognition results in spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 30th IEEE International Conference on Acoustics, Speech, and Signal Processing CY MAR 19-23, 2005 CL Philadelphia, PA SP IEEE DE confidence scoring; speech understanding; discourse understanding; spoken dialogue systems ID SPEECH RECOGNITION AB This paper proposes a method for the confidence scoring of intention recognition results in spoken dialogue systems. To achieve tasks, a spoken dialogue system has to recognize user intentions. However, because of speech recognition errors and ambiguity in user utterances, it sometimes has difficulty recognizing them correctly. Confidence scoring allows errors to be detected in intention recognition results and has proved useful for dialogue management. Conventional methods use the features obtained from the speech recognition/understanding results for single utterances for confidence scoring. However, this may be insufficient since the intention recognition result is a result of discourse processing. We propose incorporating discourse features for a more accurate confidence scoring of intention recognition results. Experimental results show that incorporating discourse features significantly improves the confidence scoring. (c) 2005 Elsevier B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, NTT, Commun Sci Labs, Kyoto 6190237, Japan. RP Higashinaka, R (reprint author), Nippon Telegraph & Tel Corp, NTT, Commun Sci Labs, 2-4 Hikaridai, Kyoto 6190237, Japan. EM rh@cslab.kecl.ntt.co.jp; sudoh@cslab.kecl.ntt.co.jp; nakano@jp.honda-ri.com CR ABDOU S, 2001, P EUR, P1783 Allen J, 2001, P INT US INT 2001 IU, P1, DOI 10.1145/359784.359822 Ammicht E, 2001, P EUR, P2217 BAGGIA P, 1993, P ICASSP 93 MINN, V2, P123 Bechet F, 2004, SPEECH COMMUN, V42, P207, DOI 10.1016/j.specom.2003.07.003 BOBROW DG, 1977, ARTIF INTELL, V8, P155, DOI 10.1016/0004-3702(77)90018-2 Chu-Carroll J., 2000, P 6 ACL C APPL NAT L, P97, DOI 10.3115/974147.974161 CORAZZA A, 1991, IEEE T PATTERN ANAL, V13, P936, DOI 10.1109/34.93811 DOHSAKA K, 2003, P EUR, P657 ENDO T, 2002, P ICSLP, P1469 FILISKO EA, 2002, THESIS MIT Foote JT, 1997, COMPUT SPEECH LANG, V11, P207, DOI 10.1006/csla.1997.0027 Gillick L., 1989, P ICASSP, V1, P532 Goddeau D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607458 Grice H. P., 1975, SYNTAX SEMANTICS, P41, DOI DOI 10.1017/S0022226700005296 HACIOGLU K, 2002, P ICASSP, V1, P225 Hazen TJ, 2002, COMPUT SPEECH LANG, V16, P49, DOI 10.1006/csla.2001.0183 HIGASHINAKA R, 2005, P ICASSP2005, V1, P25, DOI 10.1109/ICASSP.2005.1415041 Higashinaka R, 2003, P 41 ACL, P240 HIGASHINAKA R, 2003, P EUR, P1941 Higashinaka R., 2004, ACM T SPEECH LANGUAG, V1, P1, DOI 10.1145/1035112.1035113 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 Huang J., 2001, P 39 ANN M ASS COMP, P290 KOMATANI K, 2000, P 18 COLING, V1, P467, DOI 10.3115/990820.990888 KUHN R, 1995, IEEE T PATTERN ANAL, V17, P449, DOI 10.1109/34.391397 Lee A., 2001, P EUR C SPEECH COMM, P1691 LIN YC, 2001, P EUR 2001, P1049 MACHEREY K, 2001, P 7 EUR 01, P2205 MIYAZAKI N, IN PRESS SYSTEMS COM PELLOM B, 2000, P ICSLP, V2, P723 POTAMIANOS A, 2000, P ICSLP, V3, P510 POWELL MJD, 1964, COMPUT J, V7, P155, DOI 10.1093/comjnl/7.2.155 PRADHAN SS, 2002, P ICASSP, V1, P233 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 Rich C, 2001, AI MAG, V22, P15 SENEFF S, 1992, P ICASSP, V1, P23 Seneff S, 2002, COMPUT SPEECH LANG, V16, P283, DOI 10.1016/SO885-2308(02)00011-6 Singh S, 2002, J ARTIF INTELL RES, V16, P105 Takano S, 2001, IEEE T SPEECH AUDI P, V9, P3, DOI 10.1109/89.890065 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 NR 40 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 417 EP 436 DI 10.1016/j.specom.2005.06.011 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500012 ER PT J AU Zhang, T Hasegawa-Johnson, M Levinson, SE AF Zhang, T Hasegawa-Johnson, M Levinson, SE TI Extraction of pragmatic and semantic salience from spontaneous spoken English SO SPEECH COMMUNICATION LA English DT Article DE spoken language understanding; spoken dialogue systems; computational linguistics; information extraction ID STRESS; ACCENT AB This paper computationalizes two linguistic concepts, contrast and focus, for the extraction of pragmatic and semantic salience from spontaneous speech. Contrast and focus have been widely investigated in modern linguistics, as categories that link intonation and information/discourse structure. This paper demonstrates the automatic tagging of contrast and focus for the purpose of robust spontaneous speech understanding in a tutorial dialogue system. In particular, we propose two new transcription tasks, and demonstrate automatic replication of human labels in both tasks. First, we define focus kernel to represent those words that contain novel information neither presupposed by the interlocutor nor contained in the precedent words of the utterance. We propose detecting the focus kernel based on a word dissimilarity measure, part-of-speech tagging, and prosodic measurements including duration, pitch, energy, and our proposed spectral balance cepstral coefficients. In order to measure the word dissimilarity, we test a linear combination of ontological and statistical dissimilarity measures previously published in the computational linguistics literature. Second, we propose identifying symmetric contrast, which consists of a set of words that are parallel or symmetric in linguistic structure but distinct or contrastive in meaning. The symmetric contrast identification is performed in a way similar to the focus kernel detection. The effectiveness of the proposed extraction of symmetric contrast and focus kernel has been tested on a Wizard-of-Oz corpus collected in the tutoring dialogue scenario. The corpus consists of 630 non-single word/phrase utterances, containing approximately 5700 words and 48 minutes of speech. The tests used speech waveforms together with manual orthographic transcriptions, and yielded an accuracy of 83.8% for focus kernel detection and 92.8% for symmetric contrast detection. Our tests also demonstrated that the spectral balance cepstral coefficients, the semantic dissimilarity measure, and part-of-speech played important roles in the symmetric contrast and focus kernel detections. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Illinois, Beckman Inst, Urbana, IL 61801 USA. RP Zhang, T (reprint author), Univ Illinois, Beckman Inst, 405 N Mathews Ave, Urbana, IL 61801 USA. EM tzhang1@ifp.uiuc.edu; hasegawa@ifp.uiuc.edu; sel@ifp.uiuc.edu CR AHL D, 1969, TOPIC FOCUS STUDY RU Beckman M. E., 1994, GUIDELINES TOBI LABE BOLINGER DL, 1961, LANGUAGE, V37, P83, DOI 10.2307/411252 Bolinger Dwight, 1965, FORMS ENGLISH Bosch Peter, 1999, FOCUS LINGUISTIC COG Chomsky N., 1971, SEMANTICS INTERDISCI Chu-Carroll J, 1999, COMPUT LINGUIST, V25, P361 DAGAN I, 1995, COMPUT SPEECH LANG, V9, P123, DOI 10.1006/csla.1995.0008 DAUBECHIES I, 1990, IEEE T INFORM THEORY, V36, P961, DOI 10.1109/18.57199 Edmonds P, 2002, COMPUT LINGUIST, V28, P105, DOI 10.1162/089120102760173625 Firbas Jan, 1964, TRAVAUX LINGUIST, P267 Firbas Jan, 1966, TRAVAUX LINGUIST, V2, P229 Flammia G., 1998, THESIS MIT Gorin AL, 2002, COMPUTER, V35, P51, DOI 10.1109/MC.2002.993771 GUNDEL J, 2001, HDB PRAGMATIC THEORY GUSSENHOVEN C, 2002, SPEECH PROSODY 2002 Halliday M. A. K., 1967, J LINGUIST, V3, P199, DOI DOI 10.1017/S0022226700016613 HEDBERG N, 2001, LSA TOP FOC WORKSH HELDNER M, 1999, INT C PHON SCI HIGGINS D, 2004, INT C LING EV Jackendoff Ray S., 1972, SEMANTIC INTERPRETAT Jiang JJ, 1997, P INT C RES COMP LIN Kadmon N., 2001, FORMAL PRAGMATICS KAY M, 1975, THEORETICAL ISSUES N KIM SS, 2003, IEEE SIGNAL PROCESS, V11, P645 Kim SS, 1998, NEUROCOMPUTING, V20, P253, DOI 10.1016/S0925-2312(98)00018-6 KRIFKA M, 1999, P SALT 8 Kruijuff-Korbayova I., 2003, J LOGIC LANGUAGE INF, V12, P249, DOI 10.1023/A:1024160025821 LEE C, 1999, CRISPI, V1 Lee C., 2003, JAPANESE KOREAN LING, V12 LEE JH, 1993, J DOC, V49, P188, DOI 10.1108/eb026913 LENCI A, 2001, ONTOLOGY BASED INTER Lin D., 1998, P COLING ACL MONTR C MILER G, 2002, WORDNET MUNOZ M, 1999, EMNLP WVLC 99 PANTEL P, 2002, ACM SIGKDD C KNOWL D PIERREHUMBERT J, 1990, SYS DEV FDN, P271 RADA R, 1989, IEEE T SYST MAN CYB, V19, P17, DOI 10.1109/21.24528 REN Y, 2004, INT C SPEECH PROS Resniks P., 1995, P 14 INT JOINT C ART, P448 Rooth Mats, 1992, NAT LANG SEMANT, V1, P75, DOI DOI 10.1007/BF02342617 *RUL RES, 2004, DAT MIN TOOLS Santorini B, 1990, PART OF SPEECH TAGGI Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Steedman M, 2000, LINGUIST INQ, V31, P649, DOI 10.1162/002438900554505 SUSSMANN RS, 1993, IND DIAMOND REV, V2, P1 TERRA E, 2003, P HLT NAACL THELEN M, 2002, P EMNLP Umbach C., 2004, J SEMANT, V21, P155, DOI 10.1093/jos/21.2.155 VALLDUVI E, 1998, SYNTAX SEMANTICS LIM, V29 Welby P, 2003, LANG SPEECH, V46, P53 XU Y, 2004, P ISCA INT C SPEECH YOON T, 2004, P ICSLP ZECHNER K, 2000, P COLING ZHANG T, 2004, THESIS U ILLINOIS UR ZHANG T, 2004, P ICSLP Zubizarreta M.L., 1998, PROSODY FOCUS WORD O NR 57 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR-APR PY 2006 VL 48 IS 3-4 BP 437 EP 462 DI 10.1016/j.specom.2005.07.007 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 016ZS UT WOS:000235664500013 ER PT J AU Li, JF Akagi, M AF Li, JF Akagi, M TI A noise reduction system based on hybrid noise estimation technique and post-filtering in arbitrary noise environments SO SPEECH COMMUNICATION LA English DT Article DE hybrid noise estimation; post-filtering; coherence function; speech presence uncertainty ID SPECTRAL AMPLITUDE ESTIMATOR; SPEECH ENHANCEMENT AB In this paper, we propose a novel noise reduction system, using a hybrid noise estimation technique and post-filtering, to suppress both localized noises and non-localized noise simultaneously in arbitrary noise environments. To estimate localized noises, we present a hybrid noise estimation technique which combines a multi-channel estimation approach we previously proposed and a soft-decision single-channel estimation approach. Final estimation accuracy for localized noises is significantly improved by incorporating a robust and accurate speech absence probability (RASAP) estimator, which considers the strong correlation of SAPs between adjacent frequency bins and consecutive frames and makes full use of the high estimation accuracy of the multi-channel approach. The estimated spectra of localized noises are reduced from those of noisy observations by spectral subtraction. Non-localized noise is further reduced by a multi-channel post-filter which is based on the optimally modified log-spectral amplitude (OM-LSA) estimator. With the assumption of a diffuse noise field, we propose an estimator for the a priori SAP based on the coherence characteristic of the noise field at spectral subtraction output, high coherence at low frequencies and low coherence at high frequencies, improving the spectral enhancement of the desired speech signal. Experimental results demonstrates the effectiveness and superiorities of the proposed noise estimation/reduction methods in terms of objective and subjective measures in various noise conditions. (c) 2005 Elsevier B.V. All rights reserved. C1 Japan Adv Inst Sci & Technol, Sch Informat Sci, Nomigun, Ishikawa 9231292, Japan. RP Li, JF (reprint author), Japan Adv Inst Sci & Technol, Sch Informat Sci, 1-1 Asahidai, Nomigun, Ishikawa 9231292, Japan. EM junfeng@jaist.ac.jp; akagi@jaist.ac.jp CR AKAGI M, 2002, P ICASSP 02 ORL, P909 AKAGI M, 1997, P EUROSPEECH97 ROD, P335 Berouti M., 1979, P IEEE INT C AC SPEE, P208 BITZER J, 2001, SUPERDIRECTIVE MICRO, P19 BITZER J, 1999, INT WORKSH AC ECH NO, P27 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1 Elko GW, 1996, SPEECH COMMUN, V20, P229, DOI 10.1016/S0167-6393(96)00057-X EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 Gannot S, 2001, IEEE T SIGNAL PROCES, V49, P1614, DOI 10.1109/78.934132 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 LI J, 2004, P INT C SPOK LANG PR, P2705 McCowan IA, 2003, IEEE T SPEECH AUDI P, V11, P709, DOI 10.1109/TSA.2003.818212 Meyer J., 1997, P IEEE INT C AC SPEE, P21 MIZUMACHI M, 1999, P WORKSH ROB METH SP, P179 QUACKENBUSH SR, 1988, OBJECTIVES MEASURES RABINER L, 1993, SPEECH RECOGNITION S Simmer K. U., 1992, P 2 COST 229 WORKSH, P185 SIMMER KU, 2001, POST FILTERING TECHN, P39 Zelinski R., 1988, P IEEE INT C AC SPEE, V5, P2578 NR 22 TC 17 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 111 EP 126 DI 10.1016/j.specom.2005.06.013 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500001 ER PT J AU Mami, Y Charlet, D AF Mami, Y Charlet, D TI Speaker recognition by location in the space of reference speakers SO SPEECH COMMUNICATION LA English DT Article DE speaker recognition; Gaussian mixture models; relative representation; anchor models; reference speakers ID VERIFICATION AB Speaker representation by location in a reference space is a new technique of speaker recognition and adaptation. It consists in representing a speaker relatively rather than absolutely, by comparing him to a set of well-trained speakers. The main motivation is to obtain a compact modeling of every speaker, which gives similar performances to those of the state of the art GMM-UBM. Thus, instead of estimating numerous parameters of an absolute model of the speaker, only a few parameters of a model relatively to other speaker models called reference speakers are estimated. In this study, several points are addressed that are related to the concept of relative location in speaker recognition. Firstly, the reference speaker space is built. Then the appropriate metrics in this space are investigated in order to perform speaker recognition in a geometrical approach. Finally, a statistical approach for speaker location is used to eliminate the weaknesses of the geometrical approach. In-depth evaluations on a telephone database show that the concept of relative location is a promising technique for speaker verification. Therefore, it can be concluded that the most important motivation for using anchor models is their computational efficiency for indexing tasks (c) 2005 Elsevier B.V. All rights reserved. C1 France Telecom, F-22307 Lannion, France. RP Mami, Y (reprint author), France Telecom, 2 Ave Pierre Marzin, F-22307 Lannion, France. EM yassine.mami@francetelecom.com; delphine.charlet@francetelecom.com CR Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 MAMI Y, 2002, INT C SPOK LANG PROC, V2, P1333 MAMI Y, 2003, INT C AC SPEECH SIGN, V1, P180 MERLIN T, 1999, COST 254 INT WORKSH REYNOLDS D, 1998, INT C AC SPEECH SIGN Reynolds D. A., 1997, EUROSPEECH, P963 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Solomonoff A, 1998, INT CONF ACOUST SPEE, P757, DOI 10.1109/ICASSP.1998.675375 Sturim DE, 2001, INT CONF ACOUST SPEE, P429, DOI 10.1109/ICASSP.2001.940859 NR 13 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 127 EP 141 DI 10.1016/j.specom.2005.06.014 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500002 ER PT J AU Doumpiotis, V Byrne, W AF Doumpiotis, V Byrne, W TI Lattice segmentation and minimum Bayes risk discriminative training for large vocabulary continuous speech recognition SO SPEECH COMMUNICATION LA English DT Article DE discriminative training; maximum mutual information (MMI) estimation; acoustic modeling; minimum Bayes risk decoding; risk minimization; large vocabulary speech recognition; lattice segmentation AB Lattice segmentation techniques developed for Minimum Bayes Risk decoding in large vocabulary speech recognition tasks are used to compute the statistics needed for discriminative training algorithms that estimate HMM parameters so as to reduce the overall risk over the training data. New estimation procedures are developed and evaluated for both small and large vocabulary recognition tasks, and additive performance improvements are shown relative to maximum mutual information estimation. These relative gains are explained through a detailed analysis of individual word recognition errors. (c) 2005 Elsevier B.V. All rights reserved. C1 Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA. Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA. RP Byrne, W (reprint author), Univ Cambridge, Engn Dept, Machine Intelligence Lab, Trumpington St, Cambridge CB2 1PZ, England. EM vlasios@jhu.edu; wjb31@cam.ac.uk CR Byrne W., 2004, IEEE T SPEECH AUDIO BYRNE W, 2001, P NIST LVCSR WORSKH Goel V, 2000, COMPUT SPEECH LANG, V14, P115, DOI 10.1006/csla.2000.0138 GOEL V, 2004, IEEE T SPEECH AUDIO GOEL V, 2001, EUR C SPEECH COMM TE, V4, P2569 GOPALAKRISHNAN PS, 1991, IEEE T INFORM THEORY, V37, P107, DOI 10.1109/18.61108 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Kaiser J., 2000, ICSLP Kaiser J, 2002, SPEECH COMMUN, V38, P383, DOI 10.1016/S0167-6393(02)00009-2 KUMAR S, 2002, ICSLP 2002, P373 MARTIN A, 1998, HUB 5 WORKSH NIST LI MARTIN A, 2000, P SPEECH TRANSCR WOR MOHRI M, 2001, ATT GEN PURPOSE FINI MOHRI M, 1999, EUR C SPEECH COMM TE NOEL M, 1997, ALPHADIGITS TECH REP NORMANDIN Y, 1996, AUTOMATIC SPEECH SPE, P57 PALLETT D, 1990, P ICASSP ALB NM, V1, P97 Sankoff D, 1983, TIME WARPS STRING ED STOLCKE A, 2000, P SPEECH TRANSCR WOR Stolcke A, 1997, EUROSPEECH WOODLAND PC, 2000, P TUT RES WORKSH AUT Young S., 2000, HTK BOOK VERSION 3 0 NR 23 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 142 EP 160 DI 10.1016/j.specom.2005.07.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500003 ER PT J AU Markov, K Dang, JW Nakamura, S AF Markov, K Dang, JW Nakamura, S TI Integration of articulatory and spectrum features based on the hybrid HMM/BN modeling framework SO SPEECH COMMUNICATION LA English DT Article DE HMM/BN; multiple feature integration; articulatory modeling ID SPEECH RECOGNITION; NEURAL-NETWORK; MOVEMENTS; ACOUSTICS AB Most of the current state-of-the-art speech recognition systems are based on speech signal parametrizations that crudely model the behavior of the human auditory system. However, little or no use is usually made of the knowledge on the human speech production system. A data-driven statistical approach to incorporate this knowledge into ASR would require a substantial amount of data, which are not widely available since their acquisition is difficult and expensive. Furthermore, during recognition, it is nearly impossible to obtain observations of articulators movement. Thus, research on speech production mechanisms in ASR has largely focused on modeling the hidden articulatory trajectories and using prior phonetic and phonological knowledge. Nevertheless, it has been shown that combining the acoustic and articulatory information can lead to improved speech recognition performance. The approach taken in this study is to integrate features extracted from actual articulatory data with acoustic MFCC features in a way that allows recognition using MFCC only, Rather than trying to map articulatory features to the corresponding acoustic features, we use the probabilistic dependency between them. Bayesian Networks (BN) are ideally suited for this purpose. They can model complex joint probability distributions with many discrete and continuous variables and have great flexibility in representing their dependencies. Our speech recognition system is based on the hybrid HMM/BN acoustic model where the BN is used to describe the HMM states' probability distributions. HMM transitions, on the other hand, model the temporal speech characteristics. Articulatory and acoustic features are represented by different variables of the BN. Dependencies are learned from the observable articulatory and acoustic training data. During recognition, when only the acoustic observations are available, articulatory variables are assumed hidden. We have evaluated our ASR system by using a small database consisting of articulatory and acoustic data recorded from three speakers. The articulatory data are actual measurements of articulators position at several points. In all experiments involving both speaker-dependent and multi-speaker acoustic models, the HMM/BN system outperformed the baseline HMM system trained on acoustic data only. In experimenting with different BN topologies, we found that integrating the velocity and acceleration coefficients calculated as first and second derivatives of the articulatory position data can further improve recognition performance. (c) 2005 Elsevier B.V. All rights reserved. C1 ATR Spoken Language Translat Res Labs, Kyoto 6190288, Japan. Japan Adv Inst Sci & Technol, Nomi, Ishikawa 9231292, Japan. RP Markov, K (reprint author), ATR Spoken Language Translat Res Labs, Hikaridai 2-2-2, Kyoto 6190288, Japan. EM konstantin.markov@atr.jp; dang@jaist.ac.jp; satoshi.nakamura@atr.jp CR Bourland H., 1994, CONNECTIONIST SPEECH Chen S. S., 1998, P IEEE INT C AC SPEE, V2, P645 Cowell R, 1998, NATO ADV SCI I D-BEH, V89, P9 DAOUDI K, 2001, P ASRU Dean T., 1988, AAAI 88. Seventh National Conference on Artificial Intelligence DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 DENG L, 1996, ESCA TUT RES WORKSH, P69 ERLER K, 1995, P IEEE C COMM COMP S, P562 GAO Y, 2000, P ICSLP, V1, P25 Heckerman D, 1998, NATO ADV SCI I D-BEH, V89, P301 Hiroya S, 2004, IEEE T SPEECH AUDI P, V12, P175, DOI 10.1109/TSA.2003.822636 HODGEN J, 2000, P 5 SEM SPEECH PROD HOGDEN J, 2001, P IEEE IMTC, V2, P1105 JENSEN F, 1988, INTRO BAYESIAN NETWO KIRCHHOFF K, 2000, P ICASSP, V3, P1435 KIRCHHOFF K, 1998, TR98037 INT COMP SCI LIU S, 1996, J ACOUST SOC AM, P3417 MARKOV K, 2003, P EUROSPEECH GEN SWI, P965 Markov K, 2003, IEICE T INF SYST, VE86D, P438 MARKOV K, 2003, P ICASSP, V1, P888 Okadome T, 2001, J ACOUST SOC AM, V110, P453, DOI 10.1121/1.1377633 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 STEPHENSON T, 2001, P EUR, P2765 YOUNG S, 1999, HTB BOOK ZACKS J, 1994, COMPUT SPEECH LANG, V8, P189, DOI 10.1006/csla.1994.1009 NR 26 TC 26 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 161 EP 175 DI 10.1016/j.specom.2005.07.003 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500004 ER PT J AU Facco, A Falavigna, D Gretter, R Vigano, M AF Facco, A Falavigna, D Gretter, R Vigano, M TI Design and evaluation of acoustic and language models for large scale telephone services SO SPEECH COMMUNICATION LA English DT Article DE automatic telephone services; acoustic model training; supervised/unsupervised/partly supervised training; language model refinement ID BROADCAST NEWS AB This paper describes the specification, design and development phases of two widely used telephone services based on automatic speech recognition. The effort spent for evaluating and tuning these services will be discussed in detail. In developing the first service, mainly based on the recognition of "alphanumeric" sequences, a significant part of the work consisted in refining the acoustic models. To increase recognition accuracy we adopted algorithms and methods consolidated in the past over broadcast news transcription tasks. A significant result shows that the use of task specific context dependent phone models reduces the word error rate by about 40% relative to using context independent phone models. Note that the latter result was achieved over a small vocabulary task, significantly different from those generally used in broadcast news transcription. We also investigated both unsupervised and supervised training procedures. Moreover, we studied a novel partly supervised technique that allows us to select in some "optimal" way the speech material to manually transcribe and use for acoustic model training. A significant result shows that the proposed procedure gives performance close to that obtained with a completely supervised training method. In the second service, mainly based on phrase spotting, a wide effort was devoted to language model refinement. In particular, several types of rejection networks were studied to detect out of vocabulary words for the given task; a major result demonstrates that using rejection networks based on a class trigram language model reduces the word error rate from 36.7% to 11.1% with respect to using a phone loop network. For the latter service, the benefits and related costs brought by regular grammars, stochastic language models and mixed language models will be also reported and discussed. Finally, notice that most of experiments described in this paper were carried out on field databases collected through the developed services. (c) 2005 Elsevier B.V. All rights reserved. C1 ITC Irst, SSI Div, I-38050 Trento, Italy. Reitek Spa, I-20126 Milan, Italy. RP Falavigna, D (reprint author), ITC Irst, SSI Div, Via Sommarive 18, I-38050 Trento, Italy. EM andrea.facco@crf.it; falavi@itc.it; gretter@itc.it; m.vigano@reitek.com CR Brown P. F., 1992, Computational Linguistics, V18 BRUGNARA F, 1997, P 5 EUR C SPEECH COM, P2751 De Mori R., 1998, SPOKEN DIALOGUES COM FACCO A, 2004, P ICSLP JEJ KOR, P2625 FALAVIGNA D, 2000, P ICSLP BEIJ CHIN, P585 FALAVIGNA D, 1997, P EUROSPEECH, P1827 Falavigna D., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727688 Federico M, 2004, COMPUT SPEECH LANG, V18, P417, DOI [10.1016/j.csl.2003.10.001, 10.1016/j.cai.2003.10.001] Federico M, 2000, SPEECH COMMUN, V32, P37, DOI 10.1016/S0167-6393(00)00022-4 Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P245 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X Graff D, 2002, SPEECH COMMUN, V37, P15, DOI 10.1016/S0167-6393(01)00057-7 GRETTER R, 2001, P INT C AC SPEECH SI, P557 HARALD A, 1998, P DARPA BROADC NEWS He Y., 2003, P IEEE AUT SPEECH RE, P583 KAMM T, 2004, P ICSLP JEJ ISL KOR, P1973 KAMM T, 2001, P ASRU MAD CAMP IT Lamel L., 2000, P ISCA ITRW ASR2000, P150 LAMEL L, 2001, P ICASSP SALT LAK CI Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 LEVIN E, 1995, P ARPA SPOK LANG SYS ODELLJ, 1995, THESIS CAMBRIDGE U U ORLANDI M, 2003, P ISCA WORKSH ERR HA, P47 PALLET D, 1999, P NIST DARPA BROADC, P1 Puterman M., 1994, MARKOV DECISION PROC Rabiner LR, 1989, P IEEE, V77, P267 WARD W, 1995, P APRA SPOK LANG SYS WOODLAND P, 2002, SPEECH COMMUN, V80, P2295 Young S., 2002, P INT C SPOK LANG PR, P9 YOUNG S, 1998, COMPUT SPEECH LANG, V6, P263 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 34 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 176 EP 190 DI 10.1016/j.specom.2005.07.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500005 ER PT J AU Martin, A Mauuary, L AF Martin, A Mauuary, L TI Robust speech/non-speech detection based on LDA-derived parameter and voicing parameter for speech recognition in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE speech/non-speech detection; speech recognition; noise reduction; LDA; voicing parameter ID ALGORITHM AB Every speech recognition system contains a speech/non-speech detection stage. Detected speech sequences are only passed through the speech recognition stage later on. In a very noisy environment, the noise detection stage is generally responsible for most of the recognition errors. Indeed, many detected noisy periods can be recognized as a vocabulary word. This manuscript provides solutions to improve the performance of a speech/non-speech detection system in very noisy environment (for both stationary and short-time energetic noise), with an application to the France Telecom system. The improvement we propose are threefold. First, noise reduction is considered in order to reduce stationary noise effects on the speech detection system. Then, in order to decrease detections of noise characterized by brief duration and high energy, two new versions of the speech/non-speech detection stage are proposed. On the one hand, a linear discriminate analysis algorithm applied to the Mel frequency cepstrum coefficients is incorporated in the speech/nonspeech detection algorithm. On the other hand, the use of a voicing parameter is introduced in the speech/non-speech detection in order to reduce the probability of false noise detections. (c) 2005 Elsevier B.V. All rights reserved. C1 ENSIET AE 1, EA3876, F-29806 Brest 9, France. France Telecom R&D, F-22307 Lannion, France. RP Martin, A (reprint author), ENSIET AE 1, EA3876, 2 Rue F Verny, F-29806 Brest 9, France. EM arnaud.martin@ensieta.fr; laurent.mauuary@francetelecom.com CR EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Ganapathiraju A., 1996, IEEE SOUTHEASTCON '96. Bringing Together Education, Science and Technology (Cat. No.96CH35880), DOI 10.1109/SECON.1996.510121 GUPTA P, 1997, Patent No. 5649055 Huang LS, 2000, INT CONF ACOUST SPEE, P1751 IWANO K, 1999, ICASSP 99, V1, P133 Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 JUNQUA JC, 1991, EUR C SPEECH COMM TE, V3, P1371 KARRAY L, 1998, INT C SPOK LANG PROC, P1471 Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3 KOBATAKE H, 1989, INT C AC SPEECH SIGN, V1, P365 MARTIN A, 2001, THESIS U RENNES MARTIN A, 2000, EUSIPCO, P469 MARTIN A, 2003, EUR C SPEECH COMM TE MARTIN A, 2001, EUR C SPEECH COMM TE, V2, P885 MARTIN P, 1982, ICASSP 82, P180 Mauuary L., 1993, EUR C SPEECH COMM TE, P1097 MAUUARY L, 1994, THESIS U RENNES Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3 NOE B, 2001, EUR C SPEECH COMM TE, V1, P433 RABINER LR, 1977, AT&T TECH J, V56, P455 RAO GVR, 1996, INT C SPOK LANG PROC, V2, P813 SAVOJI MH, 1989, SPEECH COMMUN, V8, P45, DOI 10.1016/0167-6393(89)90067-8 SHIN WH, 2000, ICASSP 00, V3, P1399 WU D, 1999, P ICASSP, V4, P2407 NR 24 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 191 EP 206 DI 10.1016/j.specom.2005.07.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500006 ER PT J AU Lee, CL Chang, WW Chiang, YC AF Lee, CL Chang, WW Chiang, YC TI Spectral and prosodic transformations of hearing-impaired Mandarin speech SO SPEECH COMMUNICATION LA English DT Article DE voice conversion; prosodic modification; spectral conversion; hearing-impaired speaker; sinusoidal model ID CONVERSION; ALGORITHM AB This paper studies the combined use of spectral and prosodic conversions to enhance the hearing-impaired Mandarin speech. The analysis-synthesis system is based on a sinusoidal representation of the speech production mechanism. By taking advantage of the tone structure in Mandarin speech, pitch contours are orthogonally transformed and applied within the sinusoidal framework to perform pitch modification. Also proposed is a time-scale modification algorithm that finds accurate alignments between hearing-impaired and normal utterances. Using the alignments, spectral conversion is performed on subsyllabic acoustic units by a continuous probabilistic transform based on a Gaussian mixture model. Results of perceptual evaluation indicate that the proposed system greatly improves the intelligibility and the naturalness of hearing-impaired Mandarin speech. (c) 2005 Elsevier B.V. All rights reserved. C1 Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu 300, Taiwan. Natl Hsinchu Teachers Coll, Dept Special Educ, Hsinchu, Taiwan. RP Chang, WW (reprint author), Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu 300, Taiwan. EM wwchang@cc.nctu.edu.tw CR Abe M., 1988, P ICASSP, P655 Bi N, 1997, IEEE T SPEECH AUDI P, V5, P97 CHANG BL, 2000, B SPECIAL ED, V18, P573 CHEN SH, 1990, IEEE T COMMUN, V38, P1317 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Garr Mc, 1983, SPEECH HEARING IMPAI HOCHBERG I, 1983, SPEECH HEARING IMPAI Johnson R.A., 1996, STAT PRINCIPLES METH Kain A, 1998, INT CONF ACOUST SPEE, P285, DOI 10.1109/ICASSP.1998.674423 Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 LEE PC, 1999, B SPECIAL ED REHABIL, V7, P79 LIN BG, 1997, B SPECIAL ED, V15, P109 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 MCAULAY RJ, 1995, SINUSOIDAL CODING SP MONSEN RB, 1978, J SPEECH HEAR RES, V21, P197 Ohde RN, 1992, PHONETIC ANAL NORMAL OPPENHEIN AV, 1989, DISCRETE TIME SIGNAL OSBERGER MJ, 1979, J ACOUST SOC AM, V66, P1316, DOI 10.1121/1.383552 QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793 Rabiner L, 1993, FUNDAMENTALS SPEECH SHEN XNS, 1991, LANG SPEECH, V34, P145 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 NR 22 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 207 EP 219 DI 10.1016/j.specom.2005.08.001 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500007 ER PT J AU Rangachari, S Loizou, PC AF Rangachari, S Loizou, PC TI A noise-estimation algorithm for highly non-stationary environments SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; noise estimation; non-stationary noise ID SPEECH; COMPENSATION AB A noise-estimation algorithm is proposed for highly non-stationary noise environments. The noise estimate is updated by averaging the noisy speech power spectrum using time and frequency dependent smoothing factors, which are adjusted based on signal-presence probability in individual frequency bins. Signal presence is determined by computing the ratio of the noisy speech power spectrum to its local minimum, which is updated continuously by averaging past values of the noisy speech power spectra with a look-ahead factor. The local minimum estimation algorithm adapts very quickly to highly non-stationary noise environments. This was confirmed with formal listening tests which indicated that the proposed noise-estimation algorithm when integrated in speech enhancement was preferred over other noise-estimation algorithms. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Texas, Dept Elect Engn, Richardson, TX 75083 USA. RP Loizou, PC (reprint author), Univ Texas, Dept Elect Engn, POB 830688,EC 33, Richardson, TX 75083 USA. EM loizou@utdallas.edu CR AFIFY M, 2001, P ICASSP, V1, P229 Cohen GD, 2001, AM J GERIAT PSYCHIAT, V9, P1 COHEN J, 2001, AAMC REPORTER, V11, P5 DENG L, 2003, P IEEE INT C AC SPEE, V1, P672 Deng L, 2003, IEEE T SPEECH AUDI P, V11, P568, DOI 10.1109/TSA.2003.818076 Doblinger G., 1995, P EUR, V2, P1513 EPHRAIM Y, 1993, P IEEE INT C AC SPEE, V2, P355 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Hirsch H., 1995, P ICASSP, P153 Hu Y, 2004, IEEE T SPEECH AUDI P, V12, P59, DOI 10.1109/TSA.2003.819949 Kim NS, 1998, IEEE SIGNAL PROC LET, V5, P57 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LIN L, 2003, P IEEE INT C AC SPEE, V1, P80 MALAH D, 1999, P IEEE INT C AC SPEE, P789 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rangachari S, 2004, P IEEE INT C AC SPEE, V1, P305 Ris C, 2001, SPEECH COMMUN, V34, P141, DOI 10.1016/S0167-6393(00)00051-0 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 STAHL V, 2000, P IEEE INT C AC SPEE, P1873 Yao K, 2002, ADV NEUR IN, V14, P1213 NR 23 TC 93 Z9 100 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2006 VL 48 IS 2 BP 220 EP 231 DI 10.1016/j.specom.2005.08.005 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 004SM UT WOS:000234772500008 ER PT J AU Moller, S Krebber, J Smeele, P AF Moller, S Krebber, J Smeele, P TI Evaluating the speech output component of a smart-home system SO SPEECH COMMUNICATION LA English DT Article DE speech quality; spoken dialogue system; evaluation; smart-home system ID QUALITY EVALUATION; INTELLIGIBILITY AB This paper describes four experiments which have been carried out to evaluate the speech output component of the INSPIRE spoken dialogue system, providing speech control for different devices located in a "smart" home environment. The aim is to quantify the impact of different factors oil the quality of the system, when addressed either in the home or from a remote location (office, car). Factors analyzed ill the experiments include the characteristics of the machine agent during the interaction (voice, personality), the physical characteristics of the usage environment (acoustic user interface, background noise, electrical transmission path), as well as task-related characteristics (listening-only vs. interaction Situation, parallel tasks). The results show a significant impact of agent and environmental factors, but not of task factors. Potential reasons for this finding are discussed. They serve as a basis for design decisions which have been taken for the final system. (c) 2005 Elsevier B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Kommunikat, D-44780 Bochum, Germany. TNO, Huma Factors, NL-3769 ZG Soesterberg, Netherlands. RP Moller, S (reprint author), Ruhr Univ Bochum, Inst Kommunikat, D-44780 Bochum, Germany. EM sebastian.moeller@ruhr-uni-bochum.de CR BALESTRI M, 1992, P INT C SPOK LANG PR, V1, P559 Bappert V., 1994, Acta Acustica, V2 BNOIT C, 1991, P 2 INT C SPEECH COM, V2, P875 BODDEN M, 1996, ENTWICKLUNG DURCHFUH DELOGU C, 1995, ACTA ACUST, V3, P89 Delogu C, 1998, SPEECH COMMUN, V24, P153, DOI 10.1016/S0167-6393(98)00009-0 DELOGU C, 1991, P 2 EUR C SPEECH COM, V1, P353 *ETSI, 1997, 300 726 ETSI ETS Fraser N., 1997, HDB STANDARDS RESOUR, P564 Gong L., 2003, INT J SPEECH TECHNOL, V6, P123, DOI 10.1023/A:1022382413579 HONE KS, 2000, NATL LANGUAGE ENG, V6, P303 HOWARDJONES P, 1992, 2589 ESPRIT *ITUT, 2000, P340 ITUT *ITUT, 2003, G107 ITUT *ITUT, 1993, P56 ITUT *ITUT, 1994, P85 ITUT *ITUT, 1990, G726 ITUT *ITUT, 1996, P830 ITUT *ITUT, 1994, 041 ITUT *ITUT, 1996, G729 ITUT *ITUT, 2003, P851 ITUT *ITUT, 1998, G711 ITUT *ITUT, 1988, P48 ITUT JEKOSCH U, 2000, THESIS U ESSEN Klaus H, 1997, ACUSTICA, V83, P124 KRAFT V, 1995, ACTA ACUST, V3, P351 MCINNES FR, 1999, P 6 EUR C SPEECH COM, V2, P831 MOLLER S, 2005, QUALTIY TELEPHONE BA Moller S., 2002, P 3 SIGDIAL WORKSH D, P142 MOLLER S, 2003, P 8 EUR C SPEECH COM, V3, P1953 Moller S, 2004, ACTA ACUST UNITED AC, V90, P121 Moller S., 2000, ASSESSMENT PREDICTIO Pavlovic C. V., 1990, Journal d'Acoustique, V3 Rajman M, 2004, ACTA ACUST UNITED AC, V90, P1096 REHMANN S, 2002, P 3 EUR C AC FOR AC, V33 Salza PL, 1996, ACUSTICA, V82, P650 SILVERMAN K, 1990, P 1990 INT C SPOK LA, V2, P981 SPIEGEL MF, 1990, SPEECH COMMUN, V9, P279, DOI 10.1016/0167-6393(90)90004-S SPROAT R, 1997, MULTILINGUAL TEXT SP Sutton S., 1998, P INT C SPOK LANG PR, V7, P3221 van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481 VANBEZOOIJEN R, 1990, SPEECH COMMUN, V9, P263, DOI 10.1016/0167-6393(90)90002-Q van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 NR 43 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 1 EP 27 DI 10.1016/j.specom.2005.05.004 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800001 ER PT J AU Lee, H Ko, H AF Lee, H Ko, H TI Competing models-based text-prompted speaker independent verification algorithm SO SPEECH COMMUNICATION LA English DT Article DE utterance verification; LRT; voice code verification ID CONFIDENCE MEASURES; SPEECH RECOGNITION AB In this paper, we propose competing models based on text-prompted speaker independent verification algorithm for an intelligent surveillance guard robot, wherein a robot prompts a code (i.e. word or phrase) for user entrance authentication. The proposed system requires the text-prompted speaker independent verification. In addition, it does not require a speaker dependent model and extra trained model as an alternative hypothesis for log-likelihood ratio test because of memory limitation. This is due to the given application scenario that an administrator changes the voice code every day for security reasoning and the targeting domain is unlimited. To resolve these issues, we propose to exploit the sub-word based anti-models for log-likelihood normalization through reusing an acoustic model and competing with voice code model. Anti-models using automatic production rules are set up in an initial time by using the statistical distance of phonemes against a voice code. The proposed system uses a two-pass strategy using a SCHMM-based recognition and verification step. In addition, a harmonics-based spectral subtraction algorithm is applied for a noisy robustness on an outdoor environment. The performance evaluation is done by using a common Korean database, PBW452DB, which consists of 63,280 utterances of 452 isolated words recorded in silent environment. (c) 2005 Elsevier B.V. All rights reserved. C1 Korea Univ, Dept Elect & Comp Engn, Seoul 136701, South Korea. Korea Univ, Dept Visual Informat Proc, Seoul 136701, South Korea. RP Lee, H (reprint author), Korea Univ, Dept Elect & Comp Engn, 5ka 1 Anamdong, Seoul 136701, South Korea. EM hklee@ispl.korea.ac.kr; hsko@korea.ac.kr CR AHN SJ, 2002, INT C SPOK LANG PROC, V2, P1361 BEH JH, 2003, ICME 2003, P633 Benitez MC, 2000, SPEECH COMMUN, V32, P79, DOI 10.1016/S0167-6393(00)00025-X Huang X., 2001, SPOKEN LANGUAGE PROC JIANG H, 2003, IEEE T SPEECH AUDIO, V11 KIM TY, 2003, EUROSPEECH 2003, P889 KIM W, 2003, EUROSPEECH 2003, P677 LEUNG LK, 1999, P ICASSP 99 IEEE INT, V2 LLEIDA E, 2000, IEEE T SPEECH AUDIO, V8 Park H., 2002, THESIS U TEXAS AUSTI RAHIM MG, 1995, ICASSP 95 INT C AC S, V1 RAHIM MG, 1997, IEEE T SPEECH AUDIO, V5 SANKAR A, 2003, IEEE INT C AC SPEECH, V1 SUKKAR RA, 1996, IEEE T SPEECH AUDIO, V4 TOMOKO M, 1995, SPEECH COMMUN, V17, P109 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 WESSEL F, 2000, P IEEE INT C AC SPEE, P1587 WILLIAM J, 1997, HDB PHONETIC SCI WU CH, 2000, IEE P VISION IMAGE S, V147 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 XIN L, 2001, 2001 INT C INF INF, V3 NR 21 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 28 EP 44 DI 10.1016/j.specom.2005.05.014 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800002 ER PT J AU Toda, T Kawai, H Tsuzaki, M Shikano, K AF Toda, T Kawai, H Tsuzaki, M Shikano, K TI An evaluation of cost functions sensitively capturing local degradation of naturalness for segment selection in concatenative speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE segment selection; cost function; perceptual evaluation; RMS cost AB In this paper, we evaluate various Cost functions for selecting a segment sequence in terms of the correspondence between the cost and perceptual scores to the naturalness of synthetic speech. The results demonstrate that the conventional average cost, which shows the degradation of naturalness over the entire synthetic utterance, has better correspondence to the perceptual scores than the maximum cost, which shows the worst local degradation of naturalness. Furthermore, it is shown that root mean square (RMS) cost, which takes into account both the average cost and the maximum cost, has the best correspondence. We also show that the naturalness of synthetic speech can be improved by using the RMS cost for segment selection. Then, we investigate the effects of applying the RMS cost to segment selection in comparison to those of applying the average cost. Experimental results show that in segment selection based on the RMS cost, a larger number of concatenations causing slight local degradation are performed so that concatenations causing greater local degradation are avoided. (c) 2005 Elsevier B.V. All rights reserved. C1 Nara Inst Sci & Technol, Grad Sch Informat Sci, Ikoma, Nara 6300192, Japan. ATR Spoken Language Commun Res Labs, Keihanna Sci City, Kyoto 6190288, Japan. KDDI R&D Labs, Kamifukuoka, Saitama 3568502, Japan. Kyoto City Univ Arts, Kyoto 6101197, Japan. RP Toda, T (reprint author), Nara Inst Sci & Technol, Grad Sch Informat Sci, 8916-5 Takayama, Ikoma, Nara 6300192, Japan. EM tomoki@is.naist.jp; Hisashi.Kawai@kddilabs.jp; minoru.tsuzaki@kcua.ac.jp; shikano@is.naist.jp CR CAMPBELL WN, 1997, PROGR SPEECH SYNTHES, P279 CHU M, 2001, P ICASSP SALT LAK CI, P785 Chu M, 2001, P 7 EUR C SPEECH COM, P2087 CONKIE A, 2000, P ICSLP, V3, P279 Conkie A.D., 1999, JOINT M ASA EAA DAGA DING W, 1998, P 3 ESCA COCOSDA INT, P191 HUNT AJ, 1996, JP ICASSP ATL US MAY, P373 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 Kawai H., 2009, P ICSLP DENV US SEP, P2621 Klabbers E, 2001, IEEE T SPEECH AUDI P, V9, P39, DOI 10.1109/89.890070 LEE M, 2001, P EUROSPEECH AALB DE, P2227 PENG H, 2002, P ICSLP DENV US SEPT, P2613 Sagisaka Y., 1988, P INT C AC SPEECH SI, P679 STYLIANOU Y, 2001, P ICASSP, P837 Syrdal A. K., 2000, P ICSLP BEIJ CHIN, P410 TODA T, 2002, P ICASSP ORL US MAY, P465 WOUTERS J, 1998, P ICSLP, V6, P2747, DOI DOI 10.1109/ICASSP.2001.941045 NR 17 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 45 EP 56 DI 10.1016/j.specom.2005.05.011 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800003 ER PT J AU You, CH Koh, SN Rahardja, S AF You, CH Koh, SN Rahardja, S TI Masking-based beta-order MMSE speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; minimum mean-square error (MMSE); masking properties ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE SUPPRESSION; FILTER AB This paper considers an effective approach for attenuating acoustic noise and mitigating its effect in a speech signal. In this approach, human perceptual auditory masking effect is incorporated into an adaptive P-order minimum mean-square error (MMSE) speech enhancement algorithm. The relationship between the value of P and the noise-masking threshold is introduced and analyzed. The algorithm is based on a criterion by which the inaudible noise may be masked rather than suppressed. It thereby reduces the chance of distortion introduced to speech due to the enhancement process. In order to obtain an optimal estimation of the masking threshold, a modified way to measure the relative threshold offset is described. The performance of the proposed masking-based beta-order MMSE method has been evaluated through objective speech distortion measurement, spectrograin inspection and subjective listening tests. It is shown that the proposed method can achieve a more significant noise reduction and a better spectral estimation over the conventional adaptive beta-order MMSE method and the conventional over Subtraction noise-masking method. (c) 2005 Elsevier B.V. All rights reserved. C1 Inst Infocomm Res, Div Media, Singapore 119613, Singapore. Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP You, CH (reprint author), Inst Infocomm Res, Div Media, 21 Heng Mui Keng Terrace, Singapore 119613, Singapore. EM echyou@i2r.a-star.edu.sg; esnkoh@ntu.edu.sg; rsusanto@i2r.a-star.edu.sg CR Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 Garofolo J., 1988, GETTING STARTED DARP Hansen JHL, 1997, SPEECH COMMUN, V21, P169, DOI 10.1016/S0167-6393(97)00003-4 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P3833, DOI 10.1121/1.413108 HELLMAN RP, 1972, PERCEPT PSYCHOPHYS, V11, P241, DOI 10.3758/BF03206257 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P471 LIM JS, 1982, IEEE T ASSP, V30, P679 Martin R., 1994, P 7 EUR SIGN PROC C, P1182 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Quackenbush S. R., 1988, OBJECTIVE MEASURES S SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 TSOUKALAS D, 1993, P IEEE INT C AC SPEE, V2, P359 Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 You CH, 2005, IEEE T SPEECH AUDI P, V13, P475, DOI 10.1109/TSA.2005.848883 YOU CH, 2004, P IEEE INT C AC SPEE, P725 YOU CH, 2003, P IEEE INT C AC SPEE, V1, P852 YOU CH, 2004, P IEEE INT C MULT EX NR 27 TC 11 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 57 EP 70 DI 10.1016/j.specom.2005.05.012 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800004 ER PT J AU Leung, KY Mak, MW Siu, MH Kung, SY AF Leung, KY Mak, MW Siu, MH Kung, SY TI Adaptive articulatory feature-based conditional pronunciation modeling for speaker verification SO SPEECH COMMUNICATION LA English DT Article AB Because of the differences in education background, accents, and so on, different persons have different ways of pronunciation. Therefore, the pronunciation patterns of individuals call be used as features for discriminating speakers. This paper exploits the pronunciation characteristics of speakers and proposes a new conditional pronunciation modeling (CPM) technique for speaker verification. The proposed technique establishes a link between articulatory properties (e.g., manners and places of articulation) and phonerne sequences produced by a speaker. This is achieved by aligning two articulatory feature (AF) streams with a phoneme sequence determined by a phoneme recognizer, which is followed by formulating the probabilities of articulatory classes conditioned oil the phonemes as speaker-dependent discrete probabilistic models. The scores obtained from the AF-based pronunciation models are then fused with those obtained from spectral-based acoustic models. A frame-weighted fusion approach is introduced to weight the frame-based fused scores based oil the confidence of observing the articulatory classes. The effectiveness of AF-based CPM and the frame-weighted approach is demonstrated in a speaker verification task. (c) 2005 Elsevier B.V. All rights reserved. C1 Hong Kong Polytech Univ, Dept Elect & Informat Engn, Hong Kong, Peoples R China. Hong Kong Univ Sci & Technol, Dept Elect & Elect Engn, Hong Kong, Hong Kong, Peoples R China. Princeton Univ, Dept Elect Engn, Princeton, NJ 08544 USA. RP Mak, MW (reprint author), Hong Kong Polytech Univ, Dept Elect & Informat Engn, Hong Kong, Peoples R China. EM enmwmak@polyu.edu.hk CR ADAMI A, 2003, P ICASSP, V4, P788 CAMPBELL JP, 1999, P INT C AC SPEECH SI, V2, P829 Campbell J.P., 2003, P EUR, P2665 DODDINGTON GR, 1995, P IEEE, P1651 Erler K., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1014 FARBER P, 1997, TR97047 ICSI Kirchhoff K., 1999, THESIS U BIELEFIELD KLUSACEK D, 2003, P ICASSP, V4, P804 Leung K. Y., 2004, P ICASSP04 MONTR MAY, V1, P85 LEUNG KY, 2004, P ICSLP 2004, P516 Martin A. F., 1997, P EUROSPEECH, P1895 NAVRATIL J, 2003, P ICASSP 2003, V4, P796 PARANDEKAR S, 2003, P ICASSP 03, V1, P28 PESKIN B, 2003, P ICASSP, V4, P792 Reynolds D., 2003, P ICASSP 03, VIV, P784 REYNOLDS DA, 1997, P IEEE INT C AC SPEE, V2, P1535 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 YOUNG S, 2000, HTK BOOK HTK 3 0 TEC NR 18 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 71 EP 84 DI 10.1016/j.specom.2005.05.013 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800005 ER PT J AU Roch, M AF Roch, M TI Gaussian-selection-based non-optimal search for speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker recogintion; text-independeut speaker identification; talker recognition; Gaussiau selection; non-optimal scarch ID MIXTURE-MODELS; ALGORITHM; VERIFICATION; SPEECH AB Most speaker identification systems train individual models for each speaker. This is done as individual models often yield better performance and they permit easier adaptation and enrollment. When classifying a speech token, the token is scored against each model and the maximum a priori decision rule is used to decide the classification label. consequently, the cost of classification grows linearly for each token as the population size grows. When considering that the number of tokens to classify is also likely to grow linearly with the population, the total work load increases exponentially. This paper presents a preclassifier which generates an N-best hypothesis using a. novel application of Gaussian selection, and a transformation of the traditional tail test statistic which lets the implementer specify the tail region in terms of probability. The system is trained using parameters of individual speaker models and does not require the original feature vectors, even when enrolling new speakers or adapting existing ones. As the correct class label need only be in the N-best hypothesis set, it is possible to prime more Gaussians than in a traditional Gaussian selection application. The N-best hypothesis set is then evaluated using individual speaker models, resulting in all overall reduction of workload. (c) 2005 Elsevier B.V. All rights reserved. C1 San Diego State Univ, San Diego, CA 92182 USA. RP Roch, M (reprint author), San Diego State Univ, 5500 Campanile Dr, San Diego, CA 92182 USA. EM marie.roch@ieee.org CR AUCKENTHALER R, 2001, 2001 SPEAKER ODYSSEY, P83 Bocchieri E., 1993, P ICASSP, V2, P692 BEI CD, 1985, IEEE T COMMUN, V33, P1132 Davenport J., 1999, P DARPA BROADC NEWS Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P152, DOI 10.1109/89.748120 Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Hogg R. V., 1978, INTRO MATH STAT Huang X., 2001, SPOKEN LANGUAGE PROC KINNUNEN T, IN PRESS IEEE T SPEE LIN Q, 1996, INT C SPOK LANG PROC, V4, P2415 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Martin A, 2000, DIGIT SIGNAL PROCESS, V10, P1, DOI 10.1006/dspr.1999.0355 MARTIN A, 1994, LDC94S15 CDROM Padmanabhan M, 1999, IEEE T SPEECH AUDI P, V7, P282, DOI 10.1109/89.759035 PAN Z, 2000, P NORSIG200 KOLM SWE, P33 Pellom BL, 1998, IEEE SIGNAL PROC LET, V5, P281, DOI 10.1109/97.728467 PICHENY MA, 1999, 1999 INT WORKSH AUT Rey R. F., 1983, ENG OPERATIONS BELL REYNOLDS DA, 1995, IEEE SIGNAL PROC LET, V2, P46, DOI 10.1109/97.372913 Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 Shinoda K, 2001, IEEE T SPEECH AUDI P, V9, P276, DOI 10.1109/89.906001 Xiang B, 2003, IEEE T SPEECH AUDI P, V11, P447, DOI 10.1109/TSA.2003.815822 Young Steve, 2002, HTK BOOK VERSION 3 2 NR 23 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 85 EP 95 DI 10.1016/j.specom.2005.06.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800006 ER PT J AU Manohar, K Rao, P AF Manohar, K Rao, P TI Speech enhancement in nonstationary noise environments using noise properties SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; nonstationary noise; spectral subtraction AB Traditional short-time spectral attenuation (STSA) speech enhancement algorithms are ineffective in the presence of highly nonstationary noise due to difficulties in the accurate estimation of the local noise spectrum. With a view to improve the speech quality in the presence of random noise bursts, characteristic of many environmental sounds, a simple postprocessing scheme is proposed that can be applied to the output of an STSA speech enhancement algorithm. The postprocessing algorithm is based on using spectral properties of the noise in order to detect noisy time-frequency regions which are then attenuated using a SNR-based rule. A suitable suppression rule is developed that is applied to the detected noisy regions so as to achieve significant reduction of noise with minimal speech distortion. The post-processing method is evaluated in the context of two well-known STSA speech enhancement algorithms and experimental results demonstrating improved speech quality are presented for a data set of real noise samples. (c) 2005 Elsevier B.V. All rights reserved. C1 Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. RP Rao, P (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. EM manohar@ee.iitb.ac.in; prao@ee.iitb.ac.in CR ACCARDI AJ, 1999, P ICASSP 99 Attias H, 2001, ADV NEUR IN, V13, P758 Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X Berouti M., 1979, P IEEE INT C AC SPEE, P208 *BMG CRESC IND LTD, 1999, ESS IND SOUND EFF COHEN J, 2001, AAMC REPORTER, V11, P5 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Fisher W.M., 1986, P DARPA SPEECH REC W HANSEN JHL, 1998, P ICSLP 1998 Hirsch H., 1995, P ICASSP, P153 ITOH K, 1997, P ICASSP 1997 *ITUT REC, 1993, OBJ MEAS ACT SPEECH JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 MALAH D, 1999, P IEEE INT C AC SPEE, P789 Martin R, 2001, IEEE T SPEECH AUDI P, V9, P504, DOI 10.1109/89.928915 MARZINZIK, 2000, THESIS U OLDENBURG Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 *PERC EV SPEECH QU, 2001, ITUT REC P862 QUACKENBUSH SR, 1988, OBJ MEAS SPEECH QUAL Renevey P., 2001, P 7 EUR C SPEECH COM, P1887 Ris C, 2001, SPEECH COMMUN, V34, P141, DOI 10.1016/S0167-6393(00)00051-0 SELTZER ML, 2000, P ICSLP 2000 SRINIVASAN S, 2003, P EUROSPEECH 2003 STAHL V, 2000, P ICASSP, V3, P1875 YAO K, 2004, P ICASSP 2004, P693 NR 27 TC 15 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2006 VL 48 IS 1 BP 96 EP 109 DI 10.1016/j.specom.2005.08.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 996OY UT WOS:000234184800007 ER PT J AU Wu, YJ Kawai, H Ni, JF Wang, RH AF Wu, YJ Kawai, H Ni, JF Wang, RH TI Discriminative training and explicit duration modeling for HMM-based automatic segmentation SO SPEECH COMMUNICATION LA English DT Article DE automatic segmentation; discriminative training; minimum segmentation error; explicit duration modeling; speech synthesis ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION AB HMM-based automatic segmentation has been popularly used for corpus construction for concatenative speech synthesis. Since the most important reasons for the inaccuracy of HMM-based automatic segmentation are the HMM training criterion and duration control, we will study these particular issues. For the HMM training, we apply the discriminative training method. and introduce a new criterion, named Minimum SeGmentation Error (MSGE). In this method, a loss function directly related to the segmentation error is defined, and parameter optimization is performed by the Generalized Probabilistic Descent (GPD) algorithm. For the duration control problem, we apply explicit duration models and propose a two-step-based segmentation method to solve the problem of computational cost, where the duration model is incorporated in a postprocessor procedure. From the experimental results, these two techniques significantly improve segmentation accuracy with different focuses, where the MSGE-based discriminative training focuses on improving the accuracy of sensitive boundary, i.e., a boundary where an error in segmentation is likely to cause a noticeable degradation in speech synthesis quality, and the explicit duration modeling focuses on eliminating large errors. After combining these two techniques, the error average was reduced from 6.86 ms to 5.79 ms on Japanese data, and from 8.67 ms to 6.61 ms on Chinese data. Simultaneously, the number of errors larger than 30 ms were reduced 25% and 51% on Chinese and Japanese data, respectively. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Sci & Technol China, Ifly Speech Lab, Hefei 230026, Anhui, Peoples R China. ATR, Spoken Language Translat Res Labs, Kyoto 6190288, Japan. RP Wu, YJ (reprint author), Univ Sci & Technol China, Ifly Speech Lab, West Campus,8-642, Hefei 230026, Anhui, Peoples R China. EM jasonwu@mail.ustc.edu.cn; hisashi.kawai@atr.jp; jinfu.ni@atr.jp; rhw@ustc.edu.cn CR BLUM JR, 1954, ANN MATH STAT, V25, P737, DOI 10.1214/aoms/1177728659 Burshtein D, 1996, IEEE T SPEECH AUDI P, V4, P240, DOI 10.1109/89.496221 CARVALHO P, 1998, P RECPAD 98 10 PORT, P221 Chou W, 2000, P IEEE, V88, P1201 Chu M., 2001, P ICASSP 2001 SALT L, V2, P785 HUANG X, 1996, P ICSLP 96, V4, P2387 Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 Kim I, 2002, ELEC SOC S, V2002, P145 Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2 Ljolje A., 1997, PROGR SPEECH SYNTHES, P305 Malfrere F., 1997, P EUR C SPEECH COMM, P2631 MCDERMOTT E, 1997, THESIS RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Russell M. J., 1985, P IEEE INT C AC SPEE, P5 SETHY A, 2002, ICSLP 2002, P149 VANSANTEN JPH, 1999, P EUROSPEECH 1999 BU, P2809 Yoma NB, 2001, IEEE T SPEECH AUDI P, V9, P179, DOI 10.1109/89.902285 Young S., 1999, HTK BOOK NR 19 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2005 VL 47 IS 4 BP 397 EP 410 DI 10.1016/j.specom.2005.03.016 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 985SK UT WOS:000233398500001 ER PT J AU Palaz, H Bicil, Y Kanak, A Dogan, MU AF Palaz, H Bicil, Y Kanak, A Dogan, MU TI New Turkish intelligibility test for assessing speech communication systems SO SPEECH COMMUNICATION LA English DT Article DE Turkish intelligibility test; diagnostic rhyme test (DRT); assessment of speech communication systems; quality of service (QoS) AB This article describes a Turkish Intelligibility Test (TIT) in order to evaluate Quality of Service (QoS) of some speech communication systems (SCSs) based on Turkish phonetic properties. Since widely used speech communication systems are generally developed considering one language, the need of broadening linguistic coverage in designing next generation SCSs requires assessment methods in many broadly used languages including Turkish as well. In this article, selection of TIT material considering Turkish phonetic characteristics is discussed and the conduct of TIT including recording, preparing and presentation of test material is detailed. Experimental results, which present subjective intelligibility assessment outcomes of three well-known speech coders, are given in comparison with the results of the Diagnostic Rhyme Test (DRT) in North American English. Intelligibility assessment via TIT is expected to be a leading and critical concept in improving the intelligibility performance of Turkish processed by SCSs under varying acoustic environments. (c) 2005 Elsevier B.V. All rights reserved. C1 UEKAE, TUBITAK, Sci & Tech Res Council Turkey, Natl Res Inst Elect & Cryptol, TR-41470 Kocaeli, Turkey. RP Palaz, H (reprint author), UEKAE, TUBITAK, Sci & Tech Res Council Turkey, Natl Res Inst Elect & Cryptol, POB 74, TR-41470 Kocaeli, Turkey. EM akustiklab@uekae.tubitak.gov.tr CR Afifi A. A., 1972, STAT ANAL COMPUTER O ANSI, 1989, S32 ANSI BAGUOGLU T, 2004, TURKCENIN GRAMERI TU ERGENC I, 1995, KONUSMA DILI TURKCEN ERGIN M, 2002, TURK DIL BILGISI TUR HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 *I TWTUV GMBH, 2001, 0UUUI8G01IAC TWTUV IPA I. P. A., 1999, HDB INT PHON ASS GUI SANDER JW, 2002, PAST PRESENT FUTURE STEENEKEN HJM, 1992, MEASURING PREDICTING TARDELLI JD, 2002, IEEE SPEECH COD WORK VOIERS WD, 1983, SPEECH TECHNOL, P30 VOIERS WD, 1977, SPEECH INTELLIGIBILI NR 13 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2005 VL 47 IS 4 BP 411 EP 423 DI 10.1016/j.specom.2005.04.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 985SK UT WOS:000233398500002 ER PT J AU Trout, JD AF Trout, JD TI Lexical boosting of noise-band speech in open- and closed-set formats SO SPEECH COMMUNICATION LA English DT Article DE spoken word recognition; lexical boosting; noise-band speech; monosyllables; open- and closed-set task ID WORD RECOGNITION; COCHLEAR IMPLANTS; CONSONANT RECOGNITION; PHONEMIC RESTORATION; NORMAL-HEARING; TEMPORAL CUES; ENVELOPE CUES; SPOKEN WORDS; CHANNELS; NUMBER AB Recognition of spoken words in noise and in quiet is more accurate for Lexically Easy words (high frequency words with few similar-sounding neighbors) than for Lexically Hard words (low frequency words with many similar-sounding neighbors). Using monosyllables, the present set of two experiments extends this finding to a perceptually interesting class of stimuli and test formats. In both open- and closed-sets, listeners attempted to identify amplitude-modulated and bandpass-filtered words [Shannon, R., Zeng, F., Kamath, V., Wgonski. J., Ekelid, M.. 1995. Speech recognition with primarily temporal cues. Science 270, 303-304]-noise- band speech-shown to simulate the performance of cochlear implant (CI) patients using the same number of frequency channels. The words were synthesized from a database that controls for Lexical Difficulty, Talker Identity and Talker Gender, Word recognition was significantly more accurate for Easy words in both the open- and the closed-set experiments, These results indicate that, even when spoken word recognition is challenged by noise-band speech, the Easy-Hard effect survives the perceptually uncertain conditions of word variability. Consequences for models of spoken word recognition are explored. (c) 2005 Elsevier B.V. All rights reserved. C1 Loyola Univ, Parmly Hearing Inst, Chicago, IL 60626 USA. RP Trout, JD (reprint author), Loyola Univ, Parmly Hearing Inst, 6525 N Sheridan Rd, Chicago, IL 60626 USA. EM jtrout@luc.edu CR Bashford JA, 1996, PERCEPT PSYCHOPHYS, V58, P342, DOI 10.3758/BF03206810 Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Cohen J., 1988, STAT POWER ANAL BEHA, V2nd Dirks DD, 2001, EAR HEARING, V22, P1, DOI 10.1097/00003446-200102000-00001 Dorman M F, 2000, Ann Otol Rhinol Laryngol Suppl, V185, P64 Dorman MF, 1997, J ACOUST SOC AM, V102, P2403, DOI 10.1121/1.419603 Dorman MF, 1998, EAR HEARING, V19, P162, DOI 10.1097/00003446-199804000-00008 Dorman MF, 2000, EAR HEARING, V21, P590, DOI 10.1097/00003446-200012000-00006 Fishman KE, 1997, J SPEECH LANG HEAR R, V40, P1201 Forster K., 1979, SENTENCE PROCESSING Forster K. I., 1976, NEW APPROACHES LANGU Friesen LM, 2001, J ACOUST SOC AM, V110, P1150, DOI 10.1121/1.1381538 Kucera F, 1967, COMPUTATIONAL ANAL P Loizou PC, 1999, J ACOUST SOC AM, V106, P2097, DOI 10.1121/1.427954 Luce PA, 1998, EAR HEARING, V19, P1, DOI 10.1097/00003446-199802000-00001 LUCE PA, 1986, PERCEPT PSYCHOPHYS, V39, P155, DOI 10.3758/BF03212485 MARSLENWILSON WD, 1978, COGNITIVE PSYCHOL, V10, P29, DOI 10.1016/0010-0285(78)90018-X Nusbaum H. C., 1984, 10 IND U SPEECH RES, P357 PAAP KR, 1982, PSYCHOL REV, V89, P573, DOI 10.1037/0033-295X.89.5.573 Pisoni D.B., 2000, VOLTA REV, V101, P111 Pisoni DB, 2005, BLACKW HBK LINGUIST, P494, DOI 10.1002/9780470757024.ch20 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 SAMUEL AG, 1987, J MEM LANG, V26, P36, DOI 10.1016/0749-596X(87)90061-1 SCHROEDE.MR, 1968, J ACOUST SOC AM, V44, P1735, DOI 10.1121/1.1911323 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Sommers MS, 1997, EAR HEARING, V18, P89, DOI 10.1097/00003446-199704000-00001 TORRETTA GM, 1995, 20 IND U SPEECH RES, P321 TROUT JD, 1990, LANG SPEECH, V33, P121 VANTASSELL DJ, 1992, J ACOUST SOC AM, V92, P1247, DOI 10.1121/1.403920 van der Horst R, 1999, J ACOUST SOC AM, V105, P1801, DOI 10.1121/1.426718 VANTASELL DJ, 1987, J ACOUST SOC AM, V82, P1152, DOI 10.1121/1.395251 NR 31 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2005 VL 47 IS 4 BP 424 EP 435 DI 10.1016/j.specom.2005.04.011 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 985SK UT WOS:000233398500003 ER PT J AU Rouas, JL Farinas, J Pellegrino, F Andre-Obrecht, R AF Rouas, JL Farinas, J Pellegrino, F Andre-Obrecht, R TI Rhythmic unit extraction and modelling for automatic language identification SO SPEECH COMMUNICATION LA English DT Article DE rhythm modelling; language identification; rhythm typology; Asian languages; European languages ID SPEECH; INFANTS AB This paper deals with an approach to automatic language identification based on rhythmic modelling, Beside phonetics and phonotactics, rhythm is actually one of the most promising features to be considered for language identification, even if its extraction and modelling are not a straightforward issue, Actually, one of the main problems to address is it-hat to model. In this paper, an algorithm of rhythm extraction is described: using a vowel detection algorithm, rhythmic units related to syllables are segmented. Several parameters are extracted (consonantal and vowel duration, cluster complexity) and modelled with a Gaussian Mixture. Experiments are performed on read speech for seven languages (English, French, German, Italian, Japanese. Mandarin and Spanish) and results reach up to 86 +/- 6% of correct discrimination between stress-timed mora-timed and syllable-timed classes of languages, and to 67 +/- 8% of correct language identification on average for the seven languages with utterances of 21 s, These results, are commented and compared with those obtained with a standard acoustic Gaussian mixture modelling approach (89 +/- 5% of correct identification for the seven languages identification task). (c) 2005 Published by Elsevier B.V. C1 Univ Toulouse 3, Inst Rech Informat Toulouse, CNRS, UMR 5505, F-31062 Toulouse, France. RP Pellegrino, F (reprint author), Univ Lyon 2, Lab Dynam Langage, CNRS, UMR 5596, F-69363 Lyon, France. EM rouas@irit.fr; jerome.farinas@irit.fr; francois.pellegrino@univ-lyon2.fr; obrecht@irit.fr CR Abercrombie D, 1967, ELEMENTS GEN PHONETI Adami A.G., 2003, P EUROSPEECH, P841 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36 ANTOINE F, 2004, P JOURN ET PAR FES M BARKATDEFRADAS M, 2003, REV PAROLE, V25, P1 BERG T, 1992, PSYCHOL RES-PSYCH FO, V54, P114, DOI 10.1007/BF00937140 Besson M., 2001, ANN NEW YORK ACAD SC, V930 Bond ZS, 2002, LANG SCI, V24, P175, DOI 10.1016/S0388-0001(01)00013-4 BOYSSONBARDIES B, 1992, PHONOLOGICAL DEV MOD CAMPIONE E, 1998, P ICSLP 98 SYDN AUST CONTENT A, 2000, P WORKSH SPOK WORD A CONTENT A, 2001, J MEMORY LANG, V45 CRYSTAL D, 1990, DICT LINGUISTICS PHO CUMMINS F, 1999, P EUROSPEECH 99 CUTLER A, 1988, J EXP PSYCHOL HUMAN, V14 CUTLER A, 1996, SIGNAL SYNTAX BOOTST DAUER RM, 1983, J PHONET, V11 DELATTRE P, 1969, LINGUA, V22, P160, DOI 10.1016/0024-3841(69)90051-5 DOMINEY PF, 2000, LANG COGNITIVE PROCE, V15 DRULLMAN R, 1994, JASA, V95 FAKOTAKIS N, 1997, 5 EUR C SPEECH COMM, V5, P2247 FERRAGNE E, 2004, P INTERSPEECH ICSLP Fromkin V., 1973, SPEECH ERRORS LINGUI FUJIMURA O, 1975, IEEE T ACOUST SPEECH, VAS23, P82, DOI 10.1109/TASSP.1975.1162631 GALVES A, 2002, P SPEECH PROS 2002 C GANAPATHIRAJU A, 1999, WEBPASGE SYLLABLE BA GAUVAIN JL, 2004, P INT SPOK LANG PROC GRABE E, 2002, PAPERS LAB PHONOLOGY, P7 Greenberg S., 1998, P ESCA WORKSH MOD PR GREENBERG S, 2002, P 2001 ISCA WORKSH P, P53 Greenberg S., 1996, P ESCA TUT ADV RES W GREENBERG S, 1997, P ESCA WORKSH ROB SP HAMDI R, 2004, P INTERSPEECH ICSLP HOWEITT AW, 2000, 6 INT C SPOK LANG PR JESTEAD W, 1982, JASA, V74 KELLER BZ, 2002, P SPEECH PROS 2002 C, P727 KELLER E, 1997, P MIDDIM 96 12 14 AU, P300 KELLNER BZ, 2001, IMPROVEMENTS SPEECH KERN S, IN PRESS EUROPEAN SC KITAZAWA S, 2002, SPEECH PROSODY KOMATSU M, 2004, P SPEECH PROS NAR JA, P725 KOPECEK I, 1999, LECT NOTES ARTIFICIA, V1692 LADEFOGED P, 1975, COURSE PHONETICS, P296 LEVELT W, 1994, COGNITION, V50 LI KP, 1994, P IEEE ICASSP 94 AD LIBERMAN AM, 1985, COGNITION, V21 LINDE Y, 1980, IEEE T COMM, V28 MacNeilage P. F., 2000, EVOLUTIONARY EMERGEN, P146, DOI 10.1017/CBO9780511606441.010 MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 MacNeilage PF, 2000, CHILD DEV, V71, P153, DOI 10.1111/1467-8624.00129 Martin A., 2003, P EUROSPEECH, P1341 MASSARO DW, 1972, PSYCHOL REV, V79 MEHLER J, 1981, J VERBAL LEARNING VE, V20 MEHLER J, 1996, SIGNAL SYNTAX BOOTST MIRGHAFORI N, 1995, P EUROSPEECH 95 MADR Mobius B., 1998, P 3 ESCA WORKSH SPEE, P59 Murthy H.A., 2004, P ICASSP, pI MUTHUSAMY YK, 1994, P IEEE ICASSP 94 AD Nazzi T, 2003, SPEECH COMMUN, V41, P233, DOI 10.1016/S0167-6393(02)00106-1 Nowlan S., 1991, THESIS CARNEGIE MELL O'Shaughnessy D., 1987, SPEECH COMMUNICATION OHALA JJ, 1979, PROBLEMES PROSODIE, V2 Pellegrino F, 2000, SIGNAL PROCESS, V80, P1231, DOI 10.1016/S0165-1684(00)00032-3 PELLEGRINO F, 2004, P SPEECH PROS 2004 N PFAU T, 1998, P IEEE ICASSP 98 SEA PFITZINGER H, 1996, 4 INT C SPOK LANG PR, V2, P1261 RAMUS F, 1999, J ACOUST SOC AM, V105 RAMUS F, 2002, P SPEECH PROS 2002 A Ramus F., 1999, COGNITION, V73 Ramus F., 2002, ANN REV LANGUAGE ACQ, V2, P85, DOI DOI 10.1075/ARLA.2.05RAM REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Rouas J. L., 2003, P ICASSP 2003 HONG K, V1, P40 ROUAS JL, 2004, ACT 25 JEP FES MAR A SHASTRI L, 1999, P ICPHS 99 SAN FRANC Singer E., 2003, P EUROSPEECH, P1345 STOCKMAL, 1996, P ICSLP PHIL, P1748 Stockmal V, 2000, APPL PSYCHOLINGUIST, V21, P383, DOI 10.1017/S0142716400003052 TAYLOR PA, 1997, P EUR 97 RHOD GREEC THYMEGOBBEL A, 1999, P ICPHS 99 SAN FRANC TODD NP, 1994, P ICSLP 94 YOK JAP VALLEE N, 2000, P JEP 2000 AUSS FRAN VASILESCU I, 2000, P ICSLP 2000 BEIJ VERHASSELT JP, 1996, P ISCLP 96 PHIL PA U WEISSENBORN J, 2001, APPROACHES BOOTSTRAP, V1, P299 WU SL, 1998, TR98014 INT COMP SCI Zissman MA, 2001, SPEECH COMMUN, V35, P115, DOI 10.1016/S0167-6393(00)00099-6 NR 86 TC 18 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2005 VL 47 IS 4 BP 436 EP 456 DI 10.1016/j.specom.2005.04.012 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 985SK UT WOS:000233398500004 ER PT J AU Paulikas, S Navakauskas, D AF Paulikas, S Navakauskas, D TI Restoration of voiced speech signals preserving prosodic features SO SPEECH COMMUNICATION LA English DT Article DE accent; prosody; speech; restoration; algorithm AB The paper deals with the restoration of voiced speech signals that contain clicks, noise, and gaps of sufficient length so that entire phonemes are lost. A particular focus is on restoring the proper prosodic elements: accent and stress. The importance of the problem is grounded by the fact that the meaning of some words (known as homographs) is solely dependent on the prosody. A new restoration method, exploiting the developed simple polynomial accent model with averaged speech signal characteristics of intensity and a fundamental frequency period as its parameters, is proposed. Feasibility of the method is confirmed by experimental investigations of the restoration of both one period and multiple periods of a voiced speech signal, examination of an instantaneous error, the total mean-square-error, and the influence of sampling frequency on the restoration quality. (c) 2005 Elsevier B.V. All rights reserved. C1 Vilnius Gediminas Tech Univ, Dept Elect Syst, LT-03227 Vilnius, Lithuania. Vilnius Gediminas Tech Univ, Dept Telecommun Engn, LT-03227 Vilnius, Lithuania. RP Navakauskas, D (reprint author), Vilnius Gediminas Tech Univ, Dept Elect Syst, Naugarduko 41-422, LT-03227 Vilnius, Lithuania. EM dalius.navakauskas@el.vtu.lt RI Navakauskas, Dalius/F-4516-2010 OI Navakauskas, Dalius/0000-0001-8897-7366 CR BARAUSKAITE J, 1995, LITHUANIAN LANGUAGE, V1 Botinis A, 2001, SPEECH COMMUN, V33, P263, DOI 10.1016/S0167-6393(00)00060-1 Czyzewski A, 1997, J AUDIO ENG SOC, V45, P815 DUBNOWSKI JJ, 1976, IEEE T ACOUST SPEECH, V24, P2, DOI 10.1109/TASSP.1976.1162765 Etter W, 1996, IEEE T SIGNAL PROCES, V44, P1124, DOI 10.1109/78.502326 GODSILL SJ, 1995, IEEE T SPEECH AUDI P, V3, P267, DOI 10.1109/89.397091 GOLOVINAS B, 1982, INTRO LINGUISTICS Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Kreiman J, 1998, J ACOUST SOC AM, V104, P1598, DOI 10.1121/1.424372 Navakauskas D., 1999, THESIS VILNIUS GEDIM PAKERYS A, 1986, PHONETICS COMMON LIT PAULIKAS S, 1999, THESIS VILNIUS GEDIM PAULIKAS S, 1998, P 1 INT C DIG SIGN P, V1, P130 SCHULLER D, 1987, 43 FIAF C ARCH AUD V, P85 SCHULLER D, 1991, J AUDIO ENG SOC, V39, P1014 Vaseghi S, 2000, ADV SIGNAL PROCESSIN VASEGHI SV, 1993, EUROSPEECH, V2, P1023 NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 2005 VL 47 IS 4 BP 457 EP 468 DI 10.1016/j.specom.2005.05.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 985SK UT WOS:000233398500005 ER PT J AU Zhu, DL Nakamura, S Paliwal, KK Wang, RH AF Zhu, DL Nakamura, S Paliwal, KK Wang, RH TI Maximum likelihood sub-band adaptation for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; sub-band; adaptation AB Noise-robust speech recognition has become an important area of research in recent years. In current speech recognition systems, the Mel-frequency cepstrum coefficients (MFCCs) are used as recognition features. When the speech signal is corrupted by narrow-band noise, the entire MFCC feature vector gets corrupted and it is not possible to exploit the frequency-selective property of the noise signal to make the recognition system robust. Recently, a number of sub-band speech recognition approaches have been proposed in the literature, where the full-band power spectrum is divided into several sub-bands and then the sub-bands are combined depending on their reliability. In conventional sub-band approaches the reliability can only be set experimentally or estimated during training procedures, which may not match the observed data and often causes degradation of performance. We propose a novel sub-band approach, where frequency sub-bands are multiplied with weighting factors and then combined and converted to cepstra, which have proven to be more robust than both full-band and conventional sub-band cepstra in our experiments. Furthermore, the weighting factors can be estimated by using maximum likelihood adaptation approaches in order to minimize the mismatch between trained models and observed features. We evaluated our methods on AURORA2 and Resource Management tasks and obtained consistent performance improvement on both tasks. (c) 2005 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Nathan, Qld 4111, Australia. RP Zhu, DL (reprint author), Univ Hong Kong, Dept Comp Sci, Pokfulam Rd, Hong Kong, Hong Kong, Peoples R China. EM dlzhu@cs.hku.hk; satoshi.nakamura@atr.co.jp; k.paliwal@me.gu.edu.au; rhw@ustc.edu.cn CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bourlard H., 1996, P ICSLP Bourlard H., 1997, P ICASSP, P1251 Cerisara C, 1998, INT CONF ACOUST SPEE, P717, DOI 10.1109/ICASSP.1998.675365 CERISARA C, 2000, P ICASSP, P1121 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Fletcher H., 1953, SPEECH HEARING COMMU Gauvain J.-L., 1991, P DARPA SPEECH NAT L, P272, DOI 10.3115/112405.112457 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1996, P INT C SPOK LANG PR, P462 Hirsch H. G., 2000, ISCA ITRW ASR2000 AU KRYTER KD, 1960, J ACOUST SOC AM, V32, P547, DOI 10.1121/1.1908140 Legetter C., 1995, COMPUTER SPEECH LANG, V9, P171 Leonard R., 1984, P ICASSP Lippmann RP, 1996, IEEE T SPEECH AUDI P, V4, P66, DOI 10.1109/TSA.1996.481454 Macho D., 2002, P ICSLP MAK B, 2000, P ICSLP, V4, P149 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MORRIS A, 1999, P EUR BUD HUNG, P599 Okawa S, 1998, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1998.675346 PALIWAL KK, 2000, 7 W PAC REG AC C 200, P61 Price P., 1988, P IEEE INT C AC SPEE, P651 RIENER K, 1992, J ACOUST SOC AM, V91, pS2339, DOI 10.1121/1.403495 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 STEVE Y, 2001, HTK BOOK TAM YC, 2001, P EUR, V1, P575 Tibrewala S., 1997, P ICASSP, P1255 TOMLINSON MJ, 1997, P ICASSP VARGA AP, 1992, NOISEX 92 STUDY EFFE WARREN RM, 1995, PERCEPT PSYCHOPHYS, V57, P175, DOI 10.3758/BF03206503 Weber K, 2003, COMPUT SPEECH LANG, V17, P195, DOI 10.1016/S0885-2308(03)00012-3 NR 33 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 243 EP 264 DI 10.1016/j.specom.2005.02.006 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500001 ER PT J AU So, S Paliwal, KK AF So, S Paliwal, KK TI Multi-frame GMM-based block quantisation of line spectral frequencies SO SPEECH COMMUNICATION LA English DT Article DE speech coding; LSF coding; transform coding; block quantisation; Gaussian mixture models ID VECTOR QUANTIZATION; LPC PARAMETERS; ALGORITHM; DESIGN AB In this paper, we investigate the use of the Gaussian mixture model-based block quantiser for coding line spectral frequencies that uses multiple frames and mean squared error as the quantiser selection criterion. As a viable alternative to vector quantisers, the GMM-based block quantiser encompasses both low computational and memory requirements as well as bitrate scalability. Jointly quantising multiple frames allows the exploitation of correlation across successive frames which leads to more efficient block quantisation. The efficiency gained from joint quantisation permits the use of the mean squared error distortion criterion for cluster quantiser selection, rather than the computationally expensive spectral distortion. The distortion performance gains come at the cost of an increase in computational complexity and memory. Experiments on narrowband speech from the TIMIT database demonstrate that the multi-frame GMM-based block quantiser can achieve a spectral distortion of 1 dB at 22 bits/frame, or 21 bits/frame with some added complexity. (c) 2005 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP Paliwal, KK (reprint author), Griffith Univ, Sch Microelect Engn, Nathan Campus, Brisbane, Qld 4111, Australia. EM k.paliwal@griffith.edu.au CR ATAL BS, 1979, IEEE T ACOUST SPEECH, V27, P247, DOI 10.1109/TASSP.1979.1163237 CAMPBELL JP, 1989, P IEEE INT C AC SPEE, P735 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658 Gersho A., 1992, VECTOR QUANTIZATION GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P459, DOI 10.1109/TASSP.1976.1162857 Hedelin P, 2000, IEEE T SPEECH AUDI P, V8, P385, DOI 10.1109/89.848220 Huang J.J.Y., 1963, IEEE Transactions on Communication Systems, VCS-11, DOI 10.1109/TCOM.1963.1088759 Itakura F., 1969, P JSA, P199 Itakura F., 1975, J ACOUST SOC AM, V57, P35 Kroon P., 1995, SPEECH CODING SYNTHE, P79 LeBlanc WP, 1993, IEEE T SPEECH AUDI P, V1, P373, DOI 10.1109/89.242483 LINDE Y, 1980, IEEE T COMMUN, V28, P1 NURMINEN J, 2003, P EUROSPEECH 03, P1073 Paliwal K. K., 1995, SPEECH CODING SYNTHE, P443 PALIWAL KK, 2004, P IEEE INT C AC SPEE, P149 Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 Proakis J. G., 1996, DIGITAL SIGNAL PROCE SHABESTARY TZ, 2002, P INT C AC SPEECH SI, V1, P641 SINERVO U, 2003, P EUROSPEECH 03, P1073 Soong F, 1984, P IEEE INT C AC SPEE, P37 Subramaniam AD, 2003, IEEE T SPEECH AUDI P, V11, P130, DOI 10.1109/TSA.2003.809192 SUBRAMANIAM AD, 2001, P IEEE INT C AC SPEE, V2, P705 SUBRAMANIAM AD, 2000, 34 AS C SIGN SYST CO SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 TSAO C, 1985, IEEE T ACOUST SPEECH, V33, P537 VISWANATHAN R, 1975, IEEE T ACOUST SPEECH, VAS23, P309, DOI 10.1109/TASSP.1975.1162675 Xydeas CS, 1999, IEEE T SPEECH AUDI P, V7, P113, DOI 10.1109/89.748117 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 265 EP 276 DI 10.1016/j.specom.2005.02.007 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500002 ER PT J AU Muto, M Kato, H Tsuzaki, M Sagisaka, Y AF Muto, M Kato, H Tsuzaki, M Sagisaka, Y TI Effect of speaking rate on the acceptability of change in segment duration SO SPEECH COMMUNICATION LA English DT Article DE speaking rate; speech perception; temporal perception; naturalness; acceptability ID TEMPORAL MODIFICATION; ISOLATED WORDS; SPEECH; DISCRIMINATION; SENSITIVITY; PERCEPTION; SEQUENCES; INTERVALS AB The acceptability of changes in segment duration at different speaking rates is studied to find useful perceptual characteristics for designing an objective naturalness measure in speech synthesis. Based on a series of previous studies on the intra-phrase positional dependency of perceptual acceptability, we investigate three factors: (1) speaking rate, (2) position within a phrase, and (3) presence/absence of a carrier sentence using three-mora (three-syllable) phrases at three rates (fast, normal and slow) with or without a carrier sentence (Experiment 1). Seven listeners evaluate the acceptability of resynthesized speech stimuli in which one of the vowel segments was either lengthened or shortened by up to 50 ms. Moreover, to understand the observed results within a psychophysical or auditory-based framework instead of language-dependent features, we simplify and replicate the temporal structures of the speech stimuli used and investigate the corresponding three factors (Experiment 2). Ten listeners rate the difference between standard and comparison stimuli in which one of the duration was either lengthened or shortened by up to 40 ms. The speech experiment shows that the acceptability for the same amount of absolute change decreased with an increase in speaking rate, i.e., the listeners more sensitively responded to the same absolute duration change when the speaking rate was fast than when it was slow. Similarly, the non-speech experiment shows that the delectability for the same amount of absolute change increased with an increase in tempo. In addition, the speech experiment shows the differences in acceptability declinations due to intra-phrase positions at three speaking rates. Similarly, the non-speech experiment shows the differences in the detectability due to temporal positions at three tempi. These agreements between the speech and non- speech experiments suggest that the two experiments share a common perceptual mechanism in processing temporal differences. On the other hand, the speech experiment shows no consistent tendency of the acceptability declinations due to the presence/absence,of a carrier sentence, while the non-speech experiment shows, in several cases, that the presence of a carrier context could lower the detectability. (c) 2005 Elsevier B.V. All rights reserved. C1 Waseda Univ, Global Informat & Telecommun Studies, Shinjuku Ku, Tokyo 1690072, Japan. ATR Human Informat Sci Labs, Kyoto 6190288, Japan. Kyoto City Univ Arts, Nishikyo Ku, Kyoto 6101197, Japan. ATR Spoken Language Translat Res Labs, Kyoto 6190288, Japan. RP Muto, M (reprint author), Waseda Univ, Global Informat & Telecommun Studies, Shinjuku Ku, 3-14-9 Okuba, Tokyo 1690072, Japan. EM makiko.muto@ruri.waseda.jp; kato@atr.jp; minoru.tsuzaki@kcua.ac.jp; sagisaka@giti.waseda.ac.jp CR ABEL SM, 1972, J ACOUST SOC AM, V52, P519, DOI 10.1121/1.1913139 ABEL SM, 1972, J ACOUST SOC AM, V51, P1219, DOI 10.1121/1.1912963 BOCHNER JH, 1988, J ACOUST SOC AM, V84, P493, DOI 10.1121/1.396827 CARLSON R, 1979, FRONTIERS SPEECH COM, P233 CARLSON R, 1986, PHONETICA, V43, P140 DRAKE C, 1993, PERCEPT PSYCHOPHYS, V54, P277, DOI 10.3758/BF03205262 GREEN DM, 1966, SIGNAL DETECTION THE, P40 Hibi S., 1983, Journal of the Acoustical Society of Japan (E), V4 Higuchi N., 1993, Journal of the Acoustical Society of Japan (E), V14 HOSHINO M, 1983, S8275 AC SOC JAP T T, P539 KAKI N, 1992, SPEECH PERCEPTION PR, P391 Kato H, 2002, J ACOUST SOC AM, V111, P387, DOI 10.1121/1.1428543 Kato H, 1998, J ACOUST SOC AM, V104, P540, DOI 10.1121/1.423301 Kato H, 1997, J ACOUST SOC AM, V101, P2311, DOI 10.1121/1.418210 KATO H, 1992, P INT C SPOK LANG PR, P507 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Klatt D. H., 1979, FRONTIERS SPEECH COM, P287 Klatt D.H., 1975, STRUCTURE PROCESS SP, P69 LEHISTE I, 1979, J PHONETICS, V7, P313 MACMILLAN NA, 1991, DETECTION THEORY USE, P58 MICHON JA, 1964, ACTA PSYCHOL, V22, P441, DOI 10.1016/0001-6918(64)90032-0 Muto M, 2005, SPEECH COMMUN, V45, P361, DOI 10.1016/j.specom.2004.11.004 Sagisaka Y., 1984, Transactions of the Institute of Electronics and Communication Engineers of Japan, Part A, VJ67A TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467 Tanaka M., 1994, Journal of the Acoustical Society of Japan (E), V15 Tsujimura N., 1996, INTRO JAPANESE LINGU VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 277 EP 289 DI 10.1016/j.specom.2005.02.012 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500003 ER PT J AU Chang, SY Wester, M Greenberg, S AF Chang, SY Wester, M Greenberg, S TI An elitist approach to automatic articulatory-acoustic feature classification for phonetic characterization of spoken language SO SPEECH COMMUNICATION LA English DT Article DE articulatory features; automatic phonetic classification; multi-lingual phonetic classification; speech analysis ID SPEECH RECOGNITION; PRONUNCIATION VARIATION; PRODUCTION MODELS AB A novel framework for automatic articulatory-acoustic feature extraction has been developed for enhancing the accuracy of place- and manner-of-articulation classification in spoken language. The "elitist" approach provides a principled means of selecting frames for which multi-layer perceptron, neural-network classifiers are highly confident. Using this method it is possible to achieve a frame-level accuracy of 93% on "elitist" frames for manner classification on a corpus of American English sentences passed through a telephone network (NTIMIT). Place-of-articulation information is extracted for each manner class independently, resulting in an appreciable gain in place-feature classification relative to performance for a manner-independent system. A comparable enhancement in classification performance for the elitist approach is evidenced when applied to a Dutch corpus of quasi-spontaneous telephone interactions (VIOS). The elitist framework provides a potential means of automatically annotating a corpus at the phonetic level without recourse to a word-level transcript and could thus be of utility for developing training materials for automatic speech recognition and speech synthesis applications, as well as aid the empirical study of spoken language. (c) 2005 Elsevier B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Chang, SY (reprint author), 46 Oxford Dr, San Rafael, CA 94903 USA. EM shawn@tellme.com; mwester@inf.ed.ac.uk; steveng@savant-garde.net CR BERINGER N, 2000, P INT C SPOK LANG PR, V4, P728 Bourlard H.A., 1993, CONNECTIONIST SPEECH CHANG S, 2002, THESIS U CALIFORNIA CHANG S., 2001, P EUR, P1725 Chang S., 2000, P INT C SPOK LANG PR, VIV, P330 Chen M.Y., 2000, P 6 INT C SPOK LANG, V4, P636 Chomsky N., 1968, SOUND PATTERN ENGLIS Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 DENG L, 1998, P INT S CHIN SPOK LA, P22 ESPYWILSON CY, 1994, J ACOUST SOC AM, V96, P65, DOI 10.1121/1.410375 GREENBERG S, 2000, P NIST SPEECH TRANSC Greenberg S., 2000, P ISCA WORKSH AUT SP, P195 GREENBERG S, 2002, P 2 INT C HUM LANG T, P36, DOI 10.3115/1289189.1289251 GREENBERG S, 2000, P CREST WORKSH MOD S, P129 GREENBERG S, 2001, P ISCA WORKSH PROS S, P51 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 GREENBERG S, 2003, P 8 EUR C SPEECH COM, P45 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HITCHCOCK L, 2001, P 7 EUR C SPEECH COM, P79 Howitt A.W., 2000, P INT C SPOK LANG PR, P628 Jakobson Roman, 1952, PRELIMINARIES SPEECH Jankowski C., 1990, P IEEE INT C AC SPEE, P109 Juneya A., 2002, P ICONIP, P726 Kessens JM, 1999, SPEECH COMMUN, V29, P193, DOI 10.1016/S0167-6393(99)00048-5 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 Kirchhoff K, 2002, SPEECH COMMUN, V37, P303, DOI 10.1016/S0167-6393(01)00020-6 Kirchhoff K., 1999, THESIS U BIELEFELD G Ladefoged Peter, 1993, COURSE PHONETICS LAMEL LF, 1990, TIMIT DARPA ACOUSTIC LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Lindau M., 1985, PHONETIC LINGUISTICS, P157 McAllister D., 1998, P 5 INT C SPOK LANG, P1847 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NIYOGI P, 1999, P INT C AC SPEECH SI, P425 OMAR MK, 2002, P IEEE INT C AC SPEE, P1916 OSTENDORF M, 2000, P IEEE AUT SPEECH RE, P72 Rose RC, 1996, J ACOUST SOC AM, V99, P1699, DOI 10.1121/1.414679 Schiel F, 1999, P 14 INT C PHON SCI, P607 STEINBISS V, 1993, P ESCA 3 EUR C SPEEC, P2125 STEVENS KN, 2000, P 6 INT C SPOK LANG, V1, pA1 Stevens K.N., 1998, ACOUSTIC PHONETICS Sun JP, 2002, J ACOUST SOC AM, V111, P1086, DOI 10.1121/1.1420380 van den Heuvel Henk, 1997, INT J SPEECH TECHNOL, V2, P119 VIEREGGE WH, 1993, P EUR 93 BERL, P267 Wester M., 2001, P EUR AALB DENM, P1729 WILLIAMS G, 1998, P INT C SPOK LANG PR, P88 NR 47 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 290 EP 311 DI 10.1016/j.specom.2005.01.006 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500004 ER PT J AU Manickam, K Moore, C Willard, T Slevin, N AF Manickam, K Moore, C Willard, T Slevin, N TI Quantifying aberrant phonation using approximate entropy in electrolaryngography SO SPEECH COMMUNICATION LA English DT Article DE larynx cancer; voicing; electroglottogram; approximate entropy; speech and language therapy AB Vocal fold vibration during vowel phonation can be used to characterise voice quality. This vibration can be measured using a laryngograph, which produces a waveform of highly correlated trans-larynx impedance variations, collectively termed the electroglottogram (EGG). Using approximate entropy (ApEn) in the EGG spectral domain, earlier work has been able to explain the meaning of "voice normality" and also to begin quantifying the impact that radiotherapy treatment has on the voicing of larynx cancer patients. In this paper ApEn is used to quantify pathological voicing in radiotherapy patients using the EGG in the time domain. Since ApEn is a viable single figure of merit, it has the potential to make assessment of aberrant voicing both more concise and objective than the subjective analysis adopted by speech and language therapists (SALTs). (c) 2005 Elsevier B.V. All rights reserved. C1 HQ Christie Hosp NHS Trust, N Western Med Phys, Manchester M20 4BX, Lancs, England. Univ S Manchester Hosp, Manchester M20 2LR, Lancs, England. Christie Hosp NHS Trust, Clin Dept Radiat Oncol, Manchester M20 4BX, Lancs, England. RP Manickam, K (reprint author), HQ Christie Hosp NHS Trust, N Western Med Phys, Manchester M20 4BX, Lancs, England. EM kathiresan.manickam@physics.cr.man CR CHEVEIGNE AD, 2003, J PHONETICS, V31, P547 FOURCIN A, 2003, C ADV QUANT LAR VOIC FOURCIN A, 1986, J PHONETICS, V14, P435 John A, 2000, INT J LANG COMM DIS, V35, P287 JOHN A, 2002, CRAN SOC GREAT BRIT Moore C, 2004, MED ENG PHYS, V26, P291, DOI 10.1016/j.medengphy.2004.01.005 PINCUS S, 1995, CHAOS, V5, P110, DOI 10.1063/1.166092 PINCUS SM, 2001, FRONTIERS POPULATION PINCUS SM, 1988, P NATL ACAD SCI USA, P2297 PINCUS SM, 1999, HORMONE PULSATILITY, pE948 Titze IR, 1994, PRINCIPLES VOICE PRO NR 11 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 312 EP 321 DI 10.1016/j.specom.2005.02.008 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500005 ER PT J AU Rao, P Patwardhan, P AF Rao, P Patwardhan, P TI Frequency warped modeling of vowel spectra: Dependence on vowel quality SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Spoken Language Processing (ICSLP 2002) CY 2002 CL Denver, CO DE spectral envelope modeling; frequency warping; all-pole modeling; partial loudness ID LINEAR PREDICTION; THRESHOLDS AB The compact representation of harmonic amplitudes in the sinusoidal coding of speech is an important problem in low bit rate speech compression. A widely used method to achieve this is by the all-pole modeling of the spectral envelope. Often a perceptually warped frequency scale is applied in the all-pole modeling to improve perceived accuracy at low model orders. In this work, an attempt is made to obtain a suitable frequency scale warping function by an experimental study on synthetic and natural steady vowels. Subjective listening experiments indicate that the change in perceived quality brought about by frequency warping depends on the underlying signal spectrum or vowel quality. Objective distortion measures are computed to obtain insights into the subjective results. It is observed that an auditory distance measure based on partial loudness shows high correlation with the subjective test scores indicating that frequency masking plays an important role in spectrum envelope modeling. (c) 2005 Elsevier B.V.. All rights reserved. C1 Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. RP Rao, P (reprint author), Indian Inst Technol, Dept Elect Engn, Bombay 400076, Maharashtra, India. EM prao@ee.iitb.ac.in; pushkar@ee.iitb.ac.in CR CHAMPION T, 1994, P IEEE INT C AC SPEE, P529 Childers D.G., 2000, SPEECH PROCESSING SY DAS A, 1995, P IEEE INT C AC SPEE, P492 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 Harma A, 2001, IEEE T SPEECH AUDI P, V9, P579, DOI 10.1109/89.928922 HERMANSKY H, 1985, DIGITAL PROCESS SIGN, P55 JAROUDI E, 1991, IEEE T SIGNAL PROCES, V39, P411 JELINEK M, 1999, P ICASSP99, P253 KOLJONEN J, 1984, P IEEE INT C AC SPEE, P191 KONDOZ A, 1991, DIGITAL SPEECH CODIN, pCH8 LUKASIAK J, 2000, P IEEE INT C AC SPEE, P1471 MACAULAY R, 1995, SPEECH CODING SYNTHE MAKHOUL J, 1975, IEEE T ACOUST SPEECH, VAS23, P283, DOI 10.1109/TASSP.1975.1162685 Miller J, 1989, STAT ADV LEVEL MOLYNEUX D, 1998, P INT C SPOK LANG PR, P946 MOLYNEUX DJ, 2000, P IEEE INT C AC SPEE, P1455 Moore BCJ, 1997, J AUDIO ENG SOC, V45, P224 MURTHI M, 1997, P IEEE INT C AC SPEE, P1687 *NAT I STAND, 1989, TIMIT CDROM OPPENHEI.A, 1971, PR INST ELECTR ELECT, V59, P299, DOI 10.1109/PROC.1971.8146 QUACKENBUSH SR, 1988, OBJECTIVE MEASURE SP Rao P, 2001, J ACOUST SOC AM, V109, P2085, DOI 10.1121/1.1354986 RIX W, 2001, P IEEE INT C AC SPEE, P749 Schroeder M.R., 1979, FRONTIERS SPEECH COM Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 STEVENS K, 2004, ACOUSTIC PHONETICS STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992 Varho S, 1998, SPEECH COMMUN, V24, P111, DOI 10.1016/S0167-6393(98)00003-X WEI B, 2000, P IEEE DIG SIGN PROC Zwicker E., 1974, FACTS MODELS HEARING NR 30 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 322 EP 335 DI 10.1016/j.specom.2005.02.009 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500006 ER PT J AU Cohen, I AF Cohen, I TI Speech enhancement using super-Gaussian speech models and noncausal a priori SNR estimation SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; noncausal estimation; spectral enhancement; super-Gaussian speech modeling ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE; SYSTEMS AB A priori signal-to-noise ratio (SNR) estimation is of major consequence in speech enhancement applications. Recently, we introduced a noncausal recursive estimator for the a priori SNR based on a Gaussian speech model, and showed its advantage compared to using the decision-directed estimator. In particular, noncausal estimation facilitates a distinction between speech onsets and noise irregularities. In this paper, we extend our noncausal estimation approach to Gamma and Laplacian speech models. We show that the performance of noncausal estimation, when applied to the problem of speech enhancement, is better under a Laplacian model than under Gaussian or Gamma models. Furthermore, the choice of the specific speech model has a smaller effect on the enhanced speech signal when using the noncausal a priori SNR estimator than when using the decision-directed method. (c) 2005 Elsevier B.V. All rights reserved. C1 Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. RP Cohen, I (reprint author), Technion Israel Inst Technol, Dept Elect Engn, IL-32000 Haifa, Israel. EM icohen@ee.technion.ac.il CR ACCARDI AJ, 1999, P 24 IEEE INT C AC S, P201 BREITHAUPT C, 2003, P 28 IEEE INT C AC S, P896 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Cohen I, 2001, SIGNAL PROCESS, V81, P2403, DOI 10.1016/S0165-1684(01)00128-1 Cohen I, 2004, IEEE SIGNAL PROC LET, V11, P725, DOI [10.1109/LSP.2004.833478, 10.1109/LSP.2004.833278] COHEN I, 2004, P 29 IEEE INT C AC S, P293 Cohen I, 2003, IEEE T SPEECH AUDI P, V11, P466, DOI 10.1109/TSA.2003.811544 COHEN I, IN PRESS IEEE T SPEE DAVENPORT JWB, 1970, PROBABILITY RANDOM P Deller J., 2000, DISCRETE TIME PROCES EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Garofolo J., 1988, GETTING STARTED DARP Gradshteyn I. S., 1980, TABLE INTEGRALS SERI LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 LOTTER T, 2003, P 28 IEEE INT C AC S, P832 Lotter T., 2003, P INT WORKSH AC ECH, P83 MALAH D, 1999, P IEEE INT C AC SPEE, P789 MARTIN R, 2002, P IEEE INT C AC SPEE, V1, P253 Martin R., 2003, P INT WORKSH AC ECH, P87 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 Papamichalis P.E., 1987, PRACTICAL APPROACHES PORTER J, 1984, P IEEE INT C AC SPEE, P4 Quackenbush S. R., 1988, OBJECTIVE MEASURES S SCALART P, 1996, P IEEE INT C AC SPEE, P629 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Wolfe P. J., 2003, EURASIP J APPL SIG P, P1043 NR 30 TC 13 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 336 EP 350 DI 10.1016/j.specom.2005.02.011 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500007 ER PT J AU Dromey, C Silveira, J Sandor, P AF Dromey, C Silveira, J Sandor, P TI Recognition of affective prosody by speakers of English as a first or foreign language SO SPEECH COMMUNICATION LA English DT Article DE prosody; emotion; second language; perception ID EMOTIONAL PROSODY; VOCAL COMMUNICATION; SEX-DIFFERENCES; PERCEPTION; SPEECH; EXPRESSION; BRAIN; COMPREHENSION; ABILITY; ADULTS AB Adults who were fluent in English, and who grew up speaking English or one of 21 other languages listened to words spoken with angry or neutral intonation. We measured the accuracy with which the listeners identified the intended emotion. English mother tongue (EMT) polyglots scored higher than other mother tongue (OMT) listeners, whereas EMT monoglots did not. Women were significantly more accurate than men across the three listener groups. There was a modest inverse correlation between accuracy and age. The learning of a second language may have helped the EMT polyglots develop additional perceptual skills in decoding speech emotion in their native language. (c) 2005 Elsevier B.V. All rights reserved. C1 Brigham Young Univ, Dept Audiol & Speech Language Pathol, Provo, UT 84602 USA. Toronto Western Hosp, Dept Psychiat, Toronto, ON M5T 2S8, Canada. Univ Toronto, Dept Psychiat, Toronto, ON M5T 1R8, Canada. RP Dromey, C (reprint author), Brigham Young Univ, Dept Audiol & Speech Language Pathol, 133 Taylor Bldg, Provo, UT 84602 USA. EM dromey@byu.edu CR ALBAS DC, 1976, J CROSS CULT PSYCHOL, V7, P481, DOI 10.1177/002202217674009 Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013 Barrett AM, 1999, NEUROPSY NEUROPSY BE, V12, P117 BEIER EG, 1972, J CONSULT CLIN PSYCH, V39, P166, DOI 10.1037/h0033170 Bonebright TL, 1996, SEX ROLES, V34, P429, DOI 10.1007/BF01547811 Bostanov V, 2004, PSYCHOPHYSIOLOGY, V41, P259, DOI 10.1111/j.1469-8986.2003.00142.x BROSGOLE L, 1995, INT J NEUROSCI, V82, P169 Buchanan TW, 2000, COGNITIVE BRAIN RES, V9, P227, DOI 10.1016/S0926-6410(99)00060-9 Crucian GP, 1998, BRAIN COGNITION, V36, P377, DOI 10.1006/brcg.1998.0999 CUMMINGS KE, 1995, J ACOUST SOC AM, V98, P88, DOI 10.1121/1.413664 DEUTSCH D, 1991, MUSIC PERCEPT, V8, P335 Emerson CS, 1999, NEUROPSY NEUROPSY BE, V12, P102 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 GITTER AG, 1972, J SOC PSYCHOL, V88, P213 Greasley P, 2000, LANG SPEECH, V43, P355 Grunwald I S, 1999, Appl Neuropsychol, V6, P226, DOI 10.1207/s15324826an0604_5 Heilman KM, 1998, J CLIN NEUROPHYSIOL, V15, P409, DOI 10.1097/00004691-199809000-00005 HERRERO JV, 1990, PERCEPT MOTOR SKILL, V71, P479 JOHNSON WF, 1986, ARCH GEN PSYCHIAT, V43, P280 KIROUAC G, 1985, J NONVERBAL BEHAV, V9, P3, DOI 10.1007/BF00987555 MCCLUSKEY KW, 1981, INT J PSYCHOL, V16, P119, DOI 10.1080/00207598108247409 McNeely HE, 2001, BRAIN LANG, V79, P473, DOI 10.1006/brln.2001.2502 MEHLER J, 1994, PHILOS T ROY SOC B, V346, P13, DOI 10.1098/rstb.1994.0123 MEHLER J, 1994, CURR OPIN NEUROBIOL, V4, P171, DOI 10.1016/0959-4388(94)90068-X MUENTE TF, 2001, NATURE, V409, P580 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Pantev C, 2001, ANN NY ACAD SCI, V930, P300 Pell MD, 1997, BRAIN LANG, V57, P80, DOI 10.1006/brln.1997.1638 Pell MD, 2001, J ACOUST SOC AM, V109, P1668, DOI 10.1121/1.1352088 Ross ED, 1997, BRAIN LANG, V56, P27, DOI 10.1006/brln.1997.1731 ROTTER NG, 1988, J NONVERBAL BEHAV, V12, P139, DOI 10.1007/BF00986931 SCHERER KR, 1995, J VOICE, V9, P235, DOI 10.1016/S0892-1997(05)80231-0 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 STARKSTEIN SE, 1994, NEUROLOGY, V44, P515 VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001 NR 35 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 351 EP 359 DI 10.1016/j.specom.2004.09.010 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500008 ER PT J AU Hazan, V Sennema, A Iba, M Faulkner, A AF Hazan, V Sennema, A Iba, M Faulkner, A TI Effect of audiovisual perceptual perception and production of training on the consonants by Japanese learners of English SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 8th International Conference for Spoken Language Processing CY OCT 05-09, 2004 CL Cheju Isl, SOUTH KOREA DE audiovisual perception; second language acquisition; perceptual training ID R-VERTICAL-BAR; SPEECH-PERCEPTION; 2ND-LANGUAGE SPEECH; NONSENSE SYLLABLES; AMERICAN LISTENERS; TALKER VARIABILITY; VISUAL CUES; ACQUISITION; HEARING; CONTEXT AB This study investigates whether L2 learners can be trained to make better use of phonetic information from visual cues in their perception of a novel phonemic contrast. It also evaluates the impact of audiovisual perceptual training on the learners' pronunciation of a novel contrast. The use of visual cues for speech perception was evaluated for two English phonemic contrasts: the /v/-/b/-/p/ labial/labiodental contrast and /l/-/r/ contrast. In the first study, 39 Japanese learners of English were tested on their perception of the /v/-/b/-/p/ distinction in audio, visual and audiovisual modalities, and then undertook ten sessions of either auditory ('A training') or audiovisual ('AV training') perceptual training before being tested again. AV training was more effective than A training in improving the perception of the labial/ labiodental contrast. In a second study, 62 Japanese learners of English were tested on their perception of the /l/-/r/ contrast in audio, visual and audiovisual modalities, and then undertook ten sessions of perceptual training with either auditory stimuli ('A training'), natural audiovisual stimuli ('AV Natural training') or audiovisual stimuli with a synthetic face synchronized to natural speech ('AV Synthetic training'). Perception of the /l/-/r/ contrast improved in all groups but learners trained audiovisually did not improve more than those trained auditorily. Auditory perception improved most for 'A training' learners and performance in the lipreading alone condition improved most for 'natural AV training' learners. The learners' pronunciation of /1/-/r/ improved significantly following perceptual training, and a greater improvement was obtained for the 'AV Natural training' group. This study shows that sensitivity to visual cues for non-native phonemic contrasts can be enhanced via audiovisual perceptual training. AV training is more effective than A training when the visual cues to the phonemic contrast are sufficiently salient. Seeing the facial gestures of the talker also leads to a greater improvement in pronunciation, even for contrasts with relatively low visual salience. (c) 2005 Elsevier B.V. All rights reserved. C1 UCL, Dept Phonet & Linguist, London WC1E 6BT, England. Konan Univ, Inst Language & Culture, Higashinada Ku, Kobe, Hyogo 658, Japan. RP Hazan, V (reprint author), UCL, Dept Phonet & Linguist, Gower St, London WC1E 6BT, England. EM val@phon.ucl.ac.uk; sennema@rz.uni-potsdam.de; midori@center.konan-u.ac.jp; a.Faulkner@phon.ucl.ac.uk RI Faulkner, Andrew/A-8212-2008; Hazan, Valerie/C-9722-2009 OI Faulkner, Andrew/0000-0002-2969-5630; Hazan, Valerie/0000-0001-6572-6679 CR Aoyama K, 2004, J PHONETICS, V32, P233, DOI 10.1016/S0095-4470(03)00036-6 Best C. T., 1995, SPEECH PERCEPTION LI, P171 Best CT, 2001, J ACOUST SOC AM, V109, P775, DOI 10.1121/1.1332378 Bradlow AR, 1997, J ACOUST SOC AM, V101, P2299, DOI 10.1121/1.418276 Cole R., 1999, P ESCA SOCRATES WORK, P45 Demorest ME, 1996, J SPEECH HEAR RES, V39, P697 Flege J. E., 1995, SPEECH PERCEPTION LI, P229 Flege JE, 1999, SEC LANG ACQ RES, P101 Flege J.E., 1998, P STILL ESCA WORKSH, P1 FOWLER CA, 1986, J PHONETICS, V14, P3 Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7 Grant KW, 1998, J ACOUST SOC AM, V104, P2438, DOI 10.1121/1.423751 Guion SG, 2000, J ACOUST SOC AM, V107, P2711, DOI 10.1121/1.428657 Hardison DM, 1999, LANG LEARN, V49, P213, DOI 10.1111/0023-8333.49.s1.7 Hardison DM, 2003, APPL PSYCHOLINGUIST, V24, P495, DOI 10.1017/S0142716403000250 Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121 Hazan V, 2000, LANG SPEECH, V43, P273 Ingram JCL, 1998, J ACOUST SOC AM, V103, P1161, DOI 10.1121/1.421225 Iverson P, 2003, COGNITION, V87, pB47, DOI 10.1016/S0010-0277(02)00198-1 KUHL PK, 1982, SCIENCE, V218, P1138, DOI 10.1126/science.7146899 KUHL PK, 1993, J PHONETICS, V21, P125 LAMBACHER S, 2002, P 7 INT C SPOK LANG, P245 Lenneberg E., 1967, BIOL FDN LANGUAGE LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177 LIVELY SE, 1994, J ACOUST SOC AM, V96, P2076, DOI 10.1121/1.410149 LOGAN JS, 1991, J ACOUST SOC AM, V89, P874, DOI 10.1121/1.1894649 Lotto A. J., 2004, SOUND SENSE 50 YEARS, pC381 MARASSA LK, 1995, J SPEECH HEAR RES, V38, P1387 Massaro D. W., 2003, P EUR C SPEECH COMM, P2249 Massaro D. W., 1998, PERCEIVING TALKING F Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) MASSARO DW, 2000, P INSTIL DUND SCOTL, P153 MASSARO DW, 1986, J EXP CHILD PSYCHOL, V41, P93, DOI 10.1016/0022-0965(86)90053-6 Mayo C, 2004, J ACOUST SOC AM, V115, P3184, DOI 10.1121/1.1738838 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Nittrouer S, 1997, J ACOUST SOC AM, V101, P2253, DOI 10.1121/1.418207 Ortega-Llebaria M., 2001, P INT C AUD VIS SPEE, P149 OUNI S, 2003, P ICPHS BARC SPAIN A, P2569 SEKIYAMA K, 1993, J PHONETICS, V21, P427 Sekiyama K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607896 Sekiyama K, 1997, PERCEPT PSYCHOPHYS, V59, P73, DOI 10.3758/BF03206849 SEKIYAMA K, 2003, P 2003 AUD VIS SPEEC, P43 SENNEMA A, 2003, P 15 ICPHS BARC SPAI, P135 SICILIANO C, 2003, P AUD VIS SPEECH PRO, P205 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield Q., 1983, HEARING SCI HEARING, P131 TAKATA Y, 1990, J ACOUST SOC AM, V88, P663, DOI 10.1121/1.399769 Tyler MD, 2001, LANG LEARN, V51, P257, DOI 10.1111/1467-9922.00155 Wang Y, 1999, J ACOUST SOC AM, V106, P3649, DOI 10.1121/1.428217 Wang Y, 2003, J ACOUST SOC AM, V113, P1033, DOI 10.1121/1.1531176 Werker J., 1984, INFANT BEHAV DEV, V7, P47 Yamada R. A., 1995, SPEECH PERCEPTION LI, P305 NR 53 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 360 EP 378 DI 10.1016/j.specom.2005.04.007 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500009 ER PT J AU Zgank, A Horvat, B Kacic, Z AF Zgank, A Horvat, B Kacic, Z TI Data-driven generation of phonetic broad classes, based on phoneme confusion matrix similarity SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; acoustic modeling; data-driven phonetic broad classes; phoneme confusion matrix; decision tree clustering AB This paper addresses the topic of defining phonetic broad classes needed during acoustic modeling for speech recognition in the procedure of decision tree based clustering. The usual approach is to use phonetic broad classes which are defined by an expert. This method has some disadvantages, especially in the case of multilingual speech recognition. A new data-driven method is proposed for the generation of phonetic broad classes based on a phoneme confusion matrix. The similarity measure is defined using the number of confusions between the master phoneme and all other phonemes included in the set. This proposed method is compared to the standard approach based on expert knowledge and to the randomly generated broad classes approach. The proposed data-driven method is implicitly evaluated within a speech recognition experiment. The intention of the first evaluation stage is to test the generated acoustic models in a monolingual environment (Slovenian), to show that the proposed method does not contain a multilingual influence. In the second evaluation stage, the generated acoustic models are tested in a multilingual environment (Slovenian, German and Spanish). All experiments were based on SpeechDat(II) speech databases. The proposed data-driven method for the generation of phonetic broad classes, based on phoneme confusion matrix, improved speech recognition results when compared to the method based on expert knowledge. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Maribor, Fac Elect Engn & Comp Sci, SI-2000 Maribor, Slovenia. RP Zgank, A (reprint author), Univ Maribor, Fac Elect Engn & Comp Sci, Smetanova 17, SI-2000 Maribor, Slovenia. EM andrej.zgank@uni-mb.si RI Zgank, Andrej/A-5711-2008 CR Baggia P, 2000, SPEECH COMMUN, V31, P355, DOI 10.1016/S0167-6393(99)00068-0 BEULEN K, 1999, THESIS RWTH AACHEN Beulen K, 1998, INT CONF ACOUST SPEE, P805, DOI 10.1109/ICASSP.1998.675387 BEYERLEIN P, 1999, P ASRU 1999 KEYST CHELBA C, 2002, P ICSLP 2002 DENV CONSTANTINESCU A, 1997, P ASRU 1997 FISCHER V, 2001, P ASRU 2001 MAD DI C HOGE H, 1997, P ICASSP 97 MUN Imperl B, 2003, SPEECH COMMUN, V39, P353, DOI 10.1016/S0167-6393(02)00048-1 IMPERL B, 2000, P ICASSP 2000 IST Jelinek F., 1997, STAT METHODS SPEECH Johnston D, 1997, SPEECH COMMUN, V23, P5, DOI 10.1016/S0167-6393(97)00050-2 KAISER J, 1998, P SPEECH DAT DEV CEN KOHLER J, 1996, P ICLSP 1996 Ladefoged Peter, 1993, COURSE PHONETICS LINDBERG B, 2000, P ICSLP 2000 BEIJ Van Den Heuvel H., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1011375311203 SCHULTZ T, 2000, THESIS KARLSRUHE SINGH R, 1999, P INT C SPOK LANG PR, V1, P117 WOODLAND PC, 1994, P ICASSP 1994 Young S., 1994, P ARPA HUM LANG TECH ZGANK A, 2001, P MULT SIGN LANG PRO Zgank A., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1023414119770 ZGANK A, 2004, P LREC 2004 LISB POR Zgank A, 2001, P EUR 2001 AALB DENM, P2725 NR 25 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 379 EP 393 DI 10.1016/j.specom.2005.03.011 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500010 ER PT J AU Kochanski, G Shih, C AF Kochanski, G Shih, C TI "Quantitative measurement of prosodic strength in Mandarin (vol 41, pg 625, 2003)" SO SPEECH COMMUNICATION LA English DT Correction C1 Univ Oxford, Phonet Lab, Oxford OX1 2JF, England. RP Kochanski, G (reprint author), Univ Oxford, Phonet Lab, 41 Wellington Sq, Oxford OX1 2JF, England. EM gpk@kochanski.org CR Kochanski G, 2003, SPEECH COMMUN, V41, P625, DOI 10.1016/S0167-6393(03)00100-6 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2005 VL 47 IS 3 BP 394 EP 394 DI 10.1016/j.specom.2005.04.004 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 979AW UT WOS:000232915500011 ER PT J AU Ode, C van Son, R AF Ode, C van Son, R TI Note from the guest editors SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 1 EP 2 DI 10.1016/j.specom.2005.05.008 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500001 ER PT J AU Mariani, J Moore, RK Furui, S AF Mariani, J Moore, RK Furui, S TI Speech communication: Louis Pols special issue - Preface SO SPEECH COMMUNICATION LA English DT Editorial Material C1 CNRS, LIMSI, F-91405 Orsay, France. French Minist Res, Informat & Commun Technol Dept, Paris, France. RP Mariani, J (reprint author), CNRS, LIMSI, F-91405 Orsay, France. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 3 EP 6 DI 10.1016/j.specom.2005.04.009 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500002 ER PT J AU Bondarko, LV AF Bondarko, LV TI Phonetic and phonological aspects of the opposition of 'soft' and 'hard' consonants in the modern Russian language SO SPEECH COMMUNICATION LA English DT Article DE primary and secondary articulation; phonetic features of softness in Russian; relations between phonetic and phonological characteristics AB The present article deals with issues arising during articulatory, acoustic and perceptive description of the opposition of 'soft' and 'hard' consonants in modern Russian. Its phonological interpretation is also considered, as well as the main tendencies in the development of the pronunciation standard. (C) 2005 Elsevier B.V. All rights reserved. C1 St Petersburg State Univ, Dept Phonet, St Petersburg 199034, Russia. RP Bondarko, LV (reprint author), St Petersburg State Univ, Dept Phonet, Univ Skaya Nab 11, St Petersburg 199034, Russia. EM lvbon@lb1082.spb.edu CR AVANESOV RI, 1972, RUSSKOYE LIT PROIZNO BONDARKO LV, 1977, SLUKH RECH NORME PAT, P3 DIEHM EE, 1998, GESTURES LINGUISTIC KASATKIN LL, 1993, PROBLEMY FONETIKI, P161 LAVER J, 1994, PRINCIPLES PHONETICS, P333 REFORMATSKIY AA, 1970, IZ ISTORII RUSSKOY F, P494 SKALOZUB LG, 1981, TEORIYA YAZYKA METOD, P240 NR 7 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 7 EP 14 DI 10.1016/j.specom.2005.03.012 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500003 ER PT J AU van Bezooijen, R AF van Bezooijen, R TI Approximant vertical bar r vertical bar in Dutch: Routes and feelings SO SPEECH COMMUNICATION LA English DT Article DE phoneme /r/; Dutch; sociophonetics; language variation; language attitude AB At present, three variants of /r/ co-occur in northern Standard Dutch i.e. the variant of Standard Dutch as spoken in the Netherlands. In addition to the older alveolar and uvular consonantal types of /r/, there is now an approximant type of /r/, which is restricted to syllable coda. This approximant /r/ has been around at least since the beginning of the 20th century.. but it seems that it recently started to expand. In this article, two sociophonetic studies are described of this ongoing change. In the first study, three hypotheses based on statements in the literature are tested: approximant /r/ is spreading from the west to other parts of the Netherlands, it is used more often by women than by men, and it is used more often by children than by adults. All three hypotheses were confirmed. In the second study, we studied how approximant /r/ is received: Do people find it attractive? Is it associated with particular personality characteristics? Where do people think it is spoken? The matched-guise approach was used, in which one speaker read the same text with different /r/-variants. It appeared that listeners from the west find approximant /r/ more attractive than listeners from other regions. Its use, at least when it is not perceptually salient, does not affect listeners' impressions of how likeable the speaker is; it is associated, however, with a high social position and with people living in the western part of the country. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Nijmegen, Inst Linguist, NL-6500 HD Nijmegen, Netherlands. RP van Bezooijen, R (reprint author), Oostersingel 98C, NL-8921 GB Leeuwarden, Netherlands. EM r.v.bezooijen@let.ru.nl CR BLOOMFIELD LEONARD, 1933, LANGUAGE Booij Geert, 1995, PHONOLOGY DUTCH Chambers Jack K., 1998, DIALECTOLOGY, V2nd DESCHUTTER G, 1994, TAAL TONGVAL, P73 FOULKES P, 2001, ETUDES TRAVAUX U LIB, V4, P27 Giles H., 1975, SPEECH STYLE SOCIAL GUSSENHOVEN C., 1976, PRONUNCIATION ENGLIS KLOEKE GG, 1927, HOLLANDSE EXPANSIE Z KLOEKE GG, 1938, TIJDSCHRIFT NEDERLAN, V57, P15 Labov W., 1994, PRINCIPLES LINGUISTI Labov William, 1972, SOCIOLINGUISTIC PATT Labov William, 2001, PRINCIPLES LINGUISTI, VII Ladefoged P., 1996, SOUNDS WORLDS LANGUA LAMBERT WE, 1960, J ABNORM SOC PSYCH, V60, P44, DOI 10.1037/h0044430 Lindau M., 1985, PHONETIC LINGUISTICS, P157 LLAMAS C, 2001, ETUDES TRAVAUX U LIB, V4, P123 MEERTENS PJ, 1938, LOGOPAEDIE PHONIATRI, V10, P53 Mees I, 1982, J INT PHON ASSOC, V12, P2 Plug L, 2003, PHONETICA, V60, P159, DOI 10.1159/000073501 Sankoff G., 2001, ETUDES TRAVAUX, V4, P141 SEBREGTS K, 2003, SOCIOGEOGRAFISCHE LI, P375 Stroop Jan, 1998, POLDERNEDERLANDS WAA TAELDEMAN J, 1985, KLANKSTRUCTUUR GENTS Torp Arne, 2001, ETUDES TRAVAUX, P75 Van Bezooijen Renee, 2003, WAAR GAAT NEDERLANDS, P204 VANBEZOOIJEN R, 2004, TAAL TONGVAL, V17, P86 VANBEZOOIJEN R, 2003, ART VIERD SOC C EB D, P80 VANBEZOOIJEN R, 2002, AVT PUBLICATIONS, V19, P1 VANDENTOORN MC, 1992, TWEEDE WERELDOORLOG VANDEVELDE H, 1996, THESIS NIJMEGEN U VANREENEN P, 1994, TAAL TONGVAL, P54 Verstraeten B., 2001, ETUDES TRAVAUX, V4, P45 VIEREGGE WH, 1993, P EUR 93 BERL, P267 WORTEL D, 2002, LEIDE DIALECT LEIDEN NR 34 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 15 EP 31 DI 10.1016/j.specom.2005.04.010 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500004 ER PT J AU Cutler, A Smits, R Cooper, N AF Cutler, A Smits, R Cooper, N TI Vowel perception: Effects of non-native language vs. non-native dialect SO SPEECH COMMUNICATION LA English DT Article DE speech perception; vowels; perceptual confusion; non-native language; dialect ID ENGLISH; SPEECH; CONFUSIONS; CONSONANTS AB Three groups of listeners identified the vowel in CV and VC syllables produced by an American English talker. The listeners were (a) native speakers of American English, (b) native speakers of Australian English (different dialect), and (c) native speakers of Dutch (different language). The syllables were embedded in multispeaker babble at three signal-to-noise ratios (0 dB, 8 dB, and 16 dB). The identification performance of native listeners was significantly better than that of listeners with another language but did not significantly differ from the performance of listeners with another dialect. Dialect differences did however affect the type of perceptual confusions which listeners made; in particular, the Australian listeners' judgements of vowel tenseness were more variable than the American listeners' judgements, which may be ascribed to cross-dialectal differences in this vocalic feature. Although listening difficulty can result when speech input mismatches the native dialect in terms of the precise cues for and boundaries of phonetic categories, the difficulty is very much less than that which arises when speech input mismatches the native language in terms of the repertoire of phonemic categories available. (C) 2005 Elsevier B.V. All rights reserved. C1 Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. RP Cutler, A (reprint author), Max Planck Inst Psycholinguist, POB 310, NL-6500 AH Nijmegen, Netherlands. EM anne.cutler@mpi.nl RI Cutler, Anne/C-9467-2012 CR Cutler A, 2004, J ACOUST SOC AM, V116, P3668, DOI 10.1121/1.1810292 Cutler A, 1996, PERCEPT PSYCHOPHYS, V58, P807, DOI 10.3758/BF03205485 Cutler A, 2000, MEM COGNITION, V28, P746, DOI 10.3758/BF03198409 FLETCHER J, 1994, P 5 AUSTR INT C SPEE, V2, P656 Gussenhoven C., 1999, HDB INT PHONETIC ASS, P74 HARRINGTON J, 1994, LANG SPEECH, V37, P357 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 KEATING PA, 1984, PHONETICA, V41, P191 Labov W., 1991, LANG VAR CHANGE, V3, P33, DOI 10.1017/S0954394500000442 Ladefoged P., 1999, HDB INT PHONETIC ASS, p[41, 41] Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 SCHOUTEN MEH, 1979, J PHONETICS, V7, P1 Strange W., 1995, SPEECH PERCEPTION LI VANBEINUM FJK, 1980, THESIS U AMSTERDAM van Son RJJH, 1999, SPEECH COMMUN, V29, P1, DOI 10.1016/S0167-6393(99)00024-2 WARREN P, 2003, P 15 INT C PHON SCI Wells John, 1982, ACCENTS ENGLISH NR 20 TC 23 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 32 EP 42 DI 10.1016/j.specom.2005.02.001 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500005 ER PT J AU Nooteboom, SG AF Nooteboom, SG TI Lexical bias revisited: Detecting, rejecting and repairing speech errors in inner speech SO SPEECH COMMUNICATION LA English DT Article DE production; speech errors; self-monitoring; lexical bias ID SENTENCE PRODUCTION; ELICITED SLIPS; TONGUE AB This paper confirms and exploits the observation that early overt self-interruptions and repairs of phonological speech errors very likely are reactions to inner speech, not to overt speech. In an experiment eliciting word word and nonword-nonword phonological spoonerisms it is found that self-interruptions and repairs come in two classes, one class of reactions to inner speech, another with reactions to overt speech. It is also found that in inner speech nonword-nonword spoonerisms are more often rejected than word-word spoonerisms. This is mirrored in the set of completed spoonerisms where word-word spoonerisms are more frequent than nonword-nonword ones. This finding supports a classical but controversial explanation of the well-known lexical bias effect from nonwords being rejected more frequently than real words in inner speech. This explanation is further supported by an increasing number of overt rejections of nonword-nonword spoonerisms with phonetic distance between error and target, and increasing lexical bias with phonetic distance. It is concluded that the most likely cause of lexical bias in phonological speech errors is that nonword errors are more often detected, rejected, and repaired than real-word errors in self-monitoring of inner speech. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Utrecht, UiL OTS, NL-3512 JK Utrecht, Netherlands. RP Nooteboom, SG (reprint author), Univ Utrecht, UiL OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM sieb.nooteboom@let.uu.nl CR BAARS BJ, 1974, CATALOG SELECTED DOC BAARS BJ, 1975, J VERB LEARN VERB BE, V14, P382, DOI 10.1016/S0022-5371(75)80017-X Baars B.J., 1980, ERRORS LINGUISTIC PE, P307 BLACKMER ER, 1991, COGNITION, V39, P173, DOI 10.1016/0010-0277(91)90052-6 DELL GS, 1981, J VERB LEARN VERB BE, V20, P611, DOI 10.1016/S0022-5371(81)90202-4 DELL GS, 1990, LANG COGNITIVE PROC, V5, P313, DOI 10.1080/01690969008407066 DELL GS, 1986, PSYCHOL REV, V93, P283, DOI 10.1037//0033-295X.93.3.283 ell G. S., 1980, ERRORS LINGUISTIC PE, P273 Fromkin V. A., 1973, SPEECH ERRORS LINGUI, P215 DELVISO S, 1991, J PSYCHOLINGUIST RES, V20, P161 Garrett M. F., 1976, NEW APPROACHES LANGU, P231 Hartsuiker R. J., 2005, PHONOLOGICAL ENCODIN, P187 Hartsuiker RJ, 2005, J MEM LANG, V52, P58, DOI 10.1016/j.jml.2004.07.006 Humphreys K. R., 2002, THESIS U ILLINOIS UR KOLK H, 1995, BRAIN LANG, V50, P282, DOI 10.1006/brln.1995.1049 Levelt W. J., 1989, SPEAKING INTENTION A Levelt WJM, 1999, BEHAV BRAIN SCI, V22, P1 LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4 Liss JM, 1998, BRAIN LANG, V62, P342, DOI 10.1006/brln.1997.1907 MacKay D. G, 1970, SPEECH ERRORS LINGUI, VSpeech errors as linguistic evidence, P164 MACKAY DG, 1992, AUDITORY IMAGERY, P274 MOGLEY MT, 1980, ERRORS LINGUISTIC PE, P133 MOTLEY MT, 1979, J SPEECH HEAR RES, V22, P421 MOTLEY MT, 1982, J VERB LEARN VERB BE, V21, P578, DOI 10.1016/S0022-5371(82)90791-5 NICKELS L, 1995, CORTEX, V31, P209 Nooteboom S. G., 2005, PHONOLOGICAL ENCODIN, P167 Nooteboom S. G., 1980, ERRORS LINGUISTIC PE, P87 NOOTEBOOM SG, 1969, NOMEN LEYDEN STUDIES NOOTEBOOM SG, 2003, GOTHENBURG PAPERS TH, V89, P25 Oomen C. E., 2005, PHONOLOGICAL ENCODIN, P157 POLS LCW, 2004, LOT NETHERLANDS GRAD, P141 POSTMA A, 1992, J SPEECH HEAR RES, V35, P1024 Postma A, 2000, COGNITION, V77, P97, DOI 10.1016/S0010-0277(00)00090-1 Roelofs A., 2005, PHONOLOGICAL ENCODIN, P42 Schade U., 1999, KONNEKTIONISTISCHE S SCHADE U, 1990, 6 OST ART INT TAG KO, P18 Stemberger J. P., 1985, PROGR PSYCHOL LANGUA, V1, P143 NR 37 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 43 EP 58 DI 10.1016/j.specom.2005.02.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500006 ER PT J AU Fujisaki, H Wang, CF Ohno, S Gu, WT AF Fujisaki, H Wang, CF Ohno, S Gu, WT TI Analysis and synthesis of fundamental frequency contours of Standard Chinese using the command-response model SO SPEECH COMMUNICATION LA English DT Article DE standard Chinese; tone; fundamental frequency contour; command-response model; constraint; perceptual test ID SPEECH AB While the tonal characteristics of Chinese syllables have been qualitatively described in traditional phonetics, quantitative analysis requires a mathematical model. This paper presents such a model for the fundamental frequency contours of Standard Chinese, based on an extension of a model that has already been proved to be applicable to non-tone languages including Japanese, English, and others. The model allows one to interpret a given fundamental frequency contour in terms of tone commands and phrase commands, and to analyze various tonal phenomena in quantitative terms. The paper then describes the results of analysis of fundamental frequency contours of a number of utterances, revealing systematic relationships between the timing of the tone commands and the final of each syllable. The results are used to derive constraints for tone and phrase command generation in speech synthesis. The validity of the rules is confirmed by evaluating the naturalness of prosody of synthetic speech. The validity of introducing these constraints in speech synthesis of Standard Chinese is confirmed by perceptual tests on naturalness of prosody as well as on intelligibility of tones, using speech synthesized with and without these constraints. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Tokyo, Bunkyo Ku, Tokyo 1138656, Japan. Univ Sci & Technol China, Hefei 230026, Anhui, Peoples R China. Tokyo Univ Technol, Tokyo 1920982, Japan. Shanghai Jiao Tong Univ, Shanghai 200030, Peoples R China. RP Gu, WT (reprint author), Univ Tokyo, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1138656, Japan. EM fujisaki@alum.mit.edu; ohno@cc.teu.ac.jp; wtgu@gavo.t.u-tokyo.ac.jp CR CHEN GP, 2004, P INT S TON ASP LANG, P25 CHEN SH, 1992, J ACOUST SOC AM, V92, P114, DOI 10.1121/1.404276 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Chou F.-c., 1996, P ICSLP 1996 PHIL US, P1624 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FUJISAKI H, 1969, MODEL SYNTHESIS PITC, V28, P53 Fujisaki H., 1990, P ICSLP 90, P841 FUJISAKI H, 1987, AC SOC JPN AUT M, P197 Fujisaki H., 2004, P INT S TON ASP LANG FUJISAKI H, 1992, P ICSLP 92, P433 FUJIWARA H, 2000, PEPTIDE SCI 1999, P9 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X KODAMA T, 1999, P 1999 JAP CHIN S AD, P31 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 NI JF, 2005, P 2005 SPRING M ASJ, P287 Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE Shih C., 1996, COMPUTATIONAL LINGUI, V1, P37 SHIH C, 2000, P INT C SPOK LANG PR, V2, P67 TOKUDA K, 1999, P ICASSP, P229 van Santen J, 1998, MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS: THE BELL LABS APPROACH, P141 WANG CF, 1999, P EUROSPEECH, P1655 XU CX, 1999, P 14 INT C PHON SCI, P2359 XU Y, 2004, INT S TON ASP LANG E, P215 YU MS, 2002, P ISCSLP 2002 TAIP NR 24 TC 25 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 59 EP 70 DI 10.1016/j.specom.2005.06.009 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500007 ER PT J AU Ode, C AF Ode, C TI Neutralization or truncation? The perception of two Russian pitch accents on utterance-final syllables SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT International Conference on Tone and Intonation in Europe CY SEP 09-11, 2004 CL Santorini, GREECE DE Russian pitch accents; truncation; speech perception AB This paper presents the results of a perception experiment that was carried out to verify the hypothesis that in Russian the contrast between pitch accents LH*L and LH* on utterance-final syllables is neutralized. Recordings for the experiment were 10 sets of three short utterances with word stress in the ultimate, penultimate and antepenultimate syllable of the utterance-final word. These utterances were read aloud by four female and four male native speakers. They were asked to realize accents LH*L and LH* in the utterance-final word. After instructions and rehearsing, recordings were made separately for each of the two types. In the perception experiment, 30 native subjects listened to short utterances selected from the recordings and presented in 180 pairs: 120 pairs with ultimate stress and, in order to test whether listeners can hear the difference at all, 60 pairs with penultimate and antepenultimate word stress in utterance-final position. The 180 stimuli pairs consisted of short utterances with realizations of LH*L and LH* on the final word, each pair containing two same or two different types of pitch accent. The task was to compare two stimuli in a pair and to indicate on a score form whether two realizations in a stimulus pair count as passable imitations of each other and thus belong to the same type of pitch accent. The same/different judgments indicate that listeners successfully distinguished between the two pitch accents in the antepenultimate and penultimate conditions, but much less so in the ultimate condition. This suggests that the two accents are truncated in final position, but not neutralized. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Amsterdam, Chair Phonet Sci ACLC, NL-1016 CG Amsterdam, Netherlands. RP Ode, C (reprint author), Univ Amsterdam, Chair Phonet Sci ACLC, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM c.ode@uva.nl CR Boersma P., 2005, PRAAT DOING PHONETIC Bryzgunova E. A., 1977, ZVUKI INTONATSIIA RU Bryzgunova E. A., 1980, RUSSKAYA GRAMMATIKA, V1, P96 BRYZGUNOVA EA, 1984, EMOTSIONALNO STILIST Fougeron I., 1989, PROSODIE ORG MESSAGE Gussenhoven C., 2003, TODI TRANSCRIPTION D GUSSENOVEN C, 2005, TRANSCRIPTION DUTCH, P118 IGARASHI Y, 2002, B JAPANESE ASS RUSSI, V34, P15 Igarashi Y., 2004, B JPN ASS STUDY RUSS, V36, P85 IGARASHI Y, 2004, P INT C SPEECH PROS, P25 KEIJSPER CE, 1992, STUDIES SLAVIC GEN L, V17, P151 KODZASOV SV, 1996, PROSODICHESKII STROI, P70 KODZASOV SV, 1999, PROBLEMY FONETIKI, V3, P197 Ladd D. R., 1996, INTONATIONAL PHONOLO NAKATANI LH, 1978, J ACOUST SOC AM, V63, P234, DOI 10.1121/1.381719 NIKOLAEVA TM, 2000, OT ZVUKA K TEKSTU IA ODE C, 2003, STUDIES SLAVIC GEN L, V30, P279 ODE C, 1992, STUDIES SLAVIC GEN L, V17, P227 Ode C., 1989, RUSSIAN INTONATION P SVETOZAROVA ND, 1982, INTONATSIONNAIA SIST Yokoyama OT, 2001, WELT SLAVEN, V46, P1 NR 21 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 71 EP 79 DI 10.1016/j.specom.2005.06.004 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500008 ER PT J AU ten Bosch, L Oostdijk, N Boves, L AF ten Bosch, L Oostdijk, N Boves, L TI On temporal aspects of turn taking in conversational dialogues SO SPEECH COMMUNICATION LA English DT Article DE temporal structure; spontaneous speech; dialogues; turn taking phenomena ID TALK AB In this short communication we show how shallow annotations in large speech corpora can be used to derive data about the temporal aspects of turn taking. Within the limitations of such a speech corpus, we show that the average durations of between-turn pauses made by speakers in a dyad are statistically related, and our data suggest the existence of gender effects in the temporal aspects of turn taking. Also, clear differences in turn taking behaviour between face-to-face and telephone dialogues can be detected using shallow analyses. We discuss the most important limitations imposed by the shallowness of the annotations in large corpora, and the possibility for enriching those annotations in a semi-automatic iterative manner. (C) 2005 Elsevier B.V. All rights reserved. C1 Radboud Univ Nijmegen, Dept Linguist, NL-6500 HD Nijmegen, Netherlands. RP ten Bosch, L (reprint author), Radboud Univ Nijmegen, Dept Linguist, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM l.tenbosch@let.ru.nl; boves@let.ru.nl CR ALLPORT DA, 1972, Q J EXP PSYCHOL, V24, P225, DOI 10.1080/00335557243000102 Campione E., 2002, ESCA WORKSH SPEECH P, P199 CASPERS J, 2001, P EUR C, P1395 Caspers J, 2003, J PHONETICS, V31, P251, DOI 10.1016/S0095-4470(03)00007-X CASSELL J, 1999, MACHINE CONVERSATION Clark H. H., 1996, USING LANGUAGE Clark HH, 2002, COGNITION, V84, P73, DOI 10.1016/S0010-0277(02)00017-3 Day David, 1997, P 5 C APPL NAT LANG, P348, DOI DOI 10.3115/974557.974608 Duncan Jr S, 1977, FACE TO FACE INTERAC FERNANDEZ R, 2005, P IWCS 6, P115 Ford C. E., 1996, INTERACTION GRAMMAR, P134, DOI 10.1017/CBO9780511620874.003 Garrido I, 2004, COMPUTAT GEOSCI, V8, P1, DOI 10.1023/B:COMG.0000024426.15902.d8 Giles H., 1992, CONTEXTS ACCOMMODATI, P1 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Jefferson G, 1989, CONVERSATION INTERDI, V3, P166 OOSTDIJK N, 2002, COLLECTION PAPERS CO Oviatt S., 2004, ACM T COMPUTER HUMAN, V11 Roger D, 1988, J LANG SOC PSYCHOL, V7, P27 SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 Schegloff Emanuel A., 1982, ANAL DISCOURSE TEXT, P71 SELLEN AJ, 1995, HUM-COMPUT INTERACT, V10, P401, DOI 10.1207/s15327051hci1004_2 Selting M, 2000, LANG SOC, V29, P477, DOI 10.1017/S0047404500004012 Selting M., 1996, PRAGMATICS, V6, P357 Tannen D., 1989, TALKING VOICES REPET WAHLSTER W, 1997, 0193 VERBM WARD N, 1999, P ESCA WORKSH DIAL P, P83 Ward N., 2000, J PRAGMATICS, V23, P1177 WEILHAMMER K, 2003, P INT C PHON SCI BAR WEILHAMMER K, 2000, P LREC Zellner B., 1994, FUNDAMENTALS SPEECH, P41 NR 30 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 80 EP 86 DI 10.1016/j.specom.2005.05.009 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500009 ER PT J AU van Heuven, VJ van Zanten, E AF van Heuven, VJ van Zanten, E TI Speech rate as a secondary prosodic characteristic of polarity questions in three languages SO SPEECH COMMUNICATION LA English DT Article DE intonation; speech rate; prosodic boundary; clause type; statement; question ID SYLLABLES; STRESS; DUTCH AB Questions (almost) universally differ from statements in that the former have some element of high pitch that is absent in the latter. Therefore, the difference in speech melody (intonation) is considered to be the primary prosodic correlate of the contrast. We now pursue the possibility that an other, secondary prosodic correlate may exist that signals the difference between statement and question. We noted in Manado Malay (an Austronesian language) that questions were spoken at a faster rate than the corresponding statements. We then examined speech rate in questions and statements in two Germanic languages, viz. Orkney English and Dutch. In all three languages we find faster speaking rate in questions than in statements, but with different distribution of the phenomenon over the sentence. In Manado Malay, the difference seems restricted to the boundaries of prosodic domains, in Orkney it is evenly spread over the sentence, and in Dutch it is only found in the middle portion of the sentence. Some speculation on possible causes of the rate difference between statements and questions is offered in conclusion. (C) 2005 Elsevier B.V. All rights reserved. C1 Leiden Univ, Ctr Linguist, Phonet Lab, NL-2300 RA Leiden, Netherlands. RP van Heuven, VJ (reprint author), Leiden Univ, Ctr Linguist, Phonet Lab, POB 9515,Cleveringaplaats 1, NL-2300 RA Leiden, Netherlands. EM v.j.j.p.van.heuven@let.leidenuniv.nl CR Bolinger D., 1989, INTONATION ITS USES CHISHOLM WS, 1982, INTERROGATIVITY C GR EEFTING W, 1991, J ACOUST SOC AM, V89, P412, DOI 10.1121/1.400475 GOSY M, 1994, J PHONETICS, V22, P269 GRICE M, 2005, PROSODIC TYPOLOGY PH, P261 GROSJEAN F, 1983, LINGUISTICS, V21, P501, DOI 10.1515/ling.1983.21.3.501 Gussenhoven Carlos, 2004, PHONOLOGY TONE INTON HAAN J, 2003, LINGUISTICS NETHERLA, P59 Haan Judith, 2001, LOT DISSERTATION SER, V52 Hermann E., 1942, PROBLEME FRAGE KRETSCHMER P, 1938, SCRITTI ONORE A TROM, P27 LINDBLOM B, 1981, DURATIONAL PATTERNS Lindsey Geoffrey, 1985, THESIS U CALIFORNIA Lunt H. G., 1964, P 9 INT C LING CAMBR, P833 NOOTEBOOM SG, 1985, TIME MIND BEHAV, P242 OHALA JJ, 1984, PHONETICA, V41, P1 RIALLAND A, 2004, C TON INT EUR TIE SA RIETVELD ACM, 1987, J PHONETICS, V15, P273 SLUIJTER AMC, 1995, PHONETICA, V52, P71 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 STOEL RB, 2005, FOCUS MANADO MALAY THORSEN N, 1978, J PHONETICS, V6, P151 van Heuven V. J., 2003, P 15 INT C PHON SCI, P805 van Heuven Vincent, 2002, LAB PHONOLOGY, V7, P61 van Leyden Klaske, 2004, LOT DISSERTATION SER, V92 vanHeuven V. J., 1999, P 14 INT C PHON SCI, P1581 van Heuven VJ, 2000, TEXT SPEECH LANG TEC, V15, P119 VANZANTEN E, 1991, 41 SPINASSP VANZANTEN E, 1993, ANAL SYNTHESIS SPEEC, P207 Weenink D., 1996, 132 U AMST I PHON SC NR 30 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 87 EP 99 DI 10.1016/j.specom.2005.05.010 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500010 ER PT J AU van Son, RJJH van Santen, JPH AF van Son, RJJH van Santen, JPH TI Duration and spectral balance of intervocalic consonants: A case for efficient communication SO SPEECH COMMUNICATION LA English DT Article DE prosodic structure; acoustic reduction; redundancy ID AMERICAN ENGLISH; LINGUISTIC STRESS; GLOTTAL CHARACTERISTICS; VOWEL REDUCTION; DUTCH VOWELS; PROMINENCE; PERCEPTION; BOUNDARIES; SPEAKERS; SPEECH AB The prosodic structure of speech and the redundancy of words can significantly strengthen or weaken segmental articulation. This paper investigates the acoustic effects of lexical stress, intra-word location, and predictability on sentence internal intervocalic consonants from accented words, using meaningful reading materials from 4157 sentences read by two American English speakers. Consonant duration and spectral balance in such reading materials show reduction in unstressed consonants and in consonants occurring later in the word (Initial vs. Medial vs. Final). Coronal consonants behaved distinctly, which was interpreted as a shift from full to flap or tap articulation in a subset of the phoneme realizations. This shift in articulation, and part of the consonant specific acoustic variation, could be linked to the frequency distribution of consonant classes over the investigated conditions. A higher frequency of occurrence of a consonant class in our corpus and a CELEX word-list was associated with shorter durations and differences in spectral balance that would increase the communicative efficiency of speech. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Amsterdam, Chair Phonet Sci ACLC, NL-1016 CG Amsterdam, Netherlands. Oregon Hlth & Sci Univ, OGI Sch Sci & Engn, Ctr Spoken Language Understanding, Beaverton, OR 97006 USA. RP van Son, RJJH (reprint author), Univ Amsterdam, Chair Phonet Sci ACLC, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM r.j.j.h.vanson@uva.nl; vansanten@ece.ogi.edu CR AYLETT M, 1999, THESIS U EDINBURGH AYLETT M, 1999, P ICPHS 99 SAN FRANC, P289 Aylett M, 2004, LANG SPEECH, V47, P31 BOERSMA PPG, 1998, THESIS U AMSTERDAM, P493 Borsky S, 1998, J ACOUST SOC AM, V103, P2670, DOI 10.1121/1.422787 Byrd D., 1993, UCLA WORKING PAPERS, V83, P97 Byrd D, 1998, J PHONETICS, V26, P173, DOI 10.1006/jpho.1998.0071 Cutler A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90004-0 Chennoukh S, 1997, J ACOUST SOC AM, V102, P2380, DOI 10.1121/1.419622 Clark J., 1990, INTRO PHONETICS PHON Cooper A. M., 1991, P ICPHS 91 AIX EN PR, P50 Cutler A, 1997, LANG SPEECH, V40, P141 DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275 DEJONG K, 1993, LANG SPEECH, V36, P197 DODGE Y, 1981, ANAL EXPT MISSING DA FARNETANI E, 1995, P EUR 95 MAD, P2255 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 FOURAKIS M, 1991, J ACOUST SOC AM, V90, P1816, DOI 10.1121/1.401662 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 Hanson HM, 1999, J ACOUST SOC AM, V106, P1064, DOI 10.1121/1.427116 Jongman A, 2000, J ACOUST SOC AM, V108, P1252, DOI 10.1121/1.1288413 KOOPMANSVANBEIN.FJ, 1980, THESIS U AMSTERDAM, P163 LIEBERMAN P, 1963, LANG SPEECH, V6, P172 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 NESPOR M, 1986, STUDIES GENERATIVE G, P327 Nord L., 1987, SWEDISH PHONETICS C, P16 OSHAUGHNESSY D, 1987, ADDISONWESLEY SERIES, P568 Pols L.C.W., 2003, P 8 EUR C SPEECH COM, P769 POUPLIER M, 2003, P 15 INT C PHON SCI, P2245 RIETVELD ACM, 1987, SPEECH COMMUN, V6, P217, DOI 10.1016/0167-6393(87)90027-6 SLUIJTER AMC, 1995, THESIS U LEIDEN, P188 Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 SLUIJTER AMC, 1995, P EUR 95 MADR, P941 SPROAT R, 1998, MULTILINGUAL TEXT TO, P300 Tabain M, 2003, J ACOUST SOC AM, V113, P516, DOI 10.1121/1.1523390 Turk A., 1992, WORKING PAPERS CORNE, V7, P103 Turk AE, 1997, J PHONETICS, V25, P25, DOI 10.1006/jpho.1996.0032 Turk AE, 2000, J PHONETICS, V28, P397, DOI 10.1006/jpho.2000.0123 UMEDA N, 1975, J ACOUST SOC AM, V58, P434, DOI 10.1121/1.380688 Van Santen J. P. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90016-Y van Son R.J.J.H., 2004, P INTERSPEECH 2004 J, P1277 van Son R.J.J.H., 2003, P ICPHS BARC 2003, P2141 VANBERGEM D, 1995, STUDIES LANGUAGE LAN, P195 VANSANTEN JPH, 1993, 93080510TM BELL LABS VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSANTEN JPH, 1993, J MATH PSYCHOL, V37, P327, DOI 10.1006/jmps.1993.1022 VANSON RJJ, 1997, P EUR 97 RHOD, P319 VANSON RJJ, 1999, P EUR 99 BUD, P439 VANSON RJJ, 1998, P ICSLP 98 SIDN AUST, P2395 van Son R. J. J. H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607908 VANSON RJJ, 2002, P ICSLP2002 DENV US, V1, P37 VANSON RJJ, 1997, P EUR 97 RHOD, P2135 van Son RJJH, 1999, SPEECH COMMUN, V28, P125, DOI 10.1016/S0167-6393(99)00009-6 VANSON RJJH, 1992, J ACOUST SOC AM, V92, P121, DOI 10.1121/1.404277 VANSON RJJH, 1990, J ACOUST SOC AM, V88, P1683, DOI 10.1121/1.400243 VITEVITCH MS, 1997, LANG SPEECH, V50, P147 WANG X, 1997, STUDIES LANGUAGE LAN, V29, P190 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 ZUE VW, 1979, J ACOUST SOC AM, V66, P1039, DOI 10.1121/1.383323 NR 60 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 100 EP 123 DI 10.1016/j.specom.2005.06.005 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500011 ER PT J AU van Beinum, FJ Schwippert, CE Been, PH van Leeuwen, TH Kuijpers, CTL AF van Beinum, FJ Schwippert, CE Been, PH van Leeuwen, TH Kuijpers, CTL TI Development and application of a vertical bar bAk vertical bar-vertical bar dAk vertical bar continuum for testing auditory perception within the Dutch longitudinal dyslexia study SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT EURESCO Conference 2002 CY 2002 CL Acquafredda di Maratea, ITALY DE developmental dyslexia; auditory processing; event-related potentials; mismatch negativity ID EVENT-RELATED POTENTIALS; SPEECH-PERCEPTION; TEMPORAL PERCEPTION; MISMATCH NEGATIVITY; CHILDREN; DISCRIMINATION; SOUNDS; BRAIN; DEFICITS; TIME AB A national longitudinal research program on developmental dyslexia was started in The Netherlands, including auditory perception and processing as an important research component. New test materials had to be developed, to be used for measuring the auditory sensitivity of the subjects to speech-like stimuli from birth until the age of 10 years. This paper describes the subsequent steps and experiments in developing the auditory test material. Several experiments showed that dyslexic adults, as compared to a control group, were less accurate and slower in discriminating phoneme contrasts with subtle acoustic differences. The continuum developed so far, was tested in an experiment using a mismatch negativity paradigm applied in an adult control group. Results of this ERP study indicated that reliable mismatch negativity could be obtained which warrants the application of the paradigm and the stimuli to be appropriate for the currently running Dutch longitudinal dyslexia study. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Amsterdam, Inst Phonet Sci ACLC, NL-1016 CG Amsterdam, Netherlands. Univ Groningen, BCN Neuroimaging Ctr, NL-9700 AB Groningen, Netherlands. Univ Groningen, Dept Dutch, NL-9700 AB Groningen, Netherlands. Univ Amsterdam, Fac Behav & Social Sci, NL-1012 WX Amsterdam, Netherlands. Municipal Hosp Slotervaart, Dept Clin Neurophysiol, Amsterdam, Netherlands. Radboud Univ Nijmegen, Dept Special Educ, Nijmegen, Netherlands. RP van Beinum, FJ (reprint author), Univ Amsterdam, Inst Phonet Sci ACLC, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM f.j.vanbeinum@uva.nl; caroline.schwippert@hetnet.nl; p.h.been@let.rug.nl; t.h.vanleeuwen@uva.nl; c.kuijpers@pwo.ru.nl CR Adlard A, 1998, Q J EXP PSYCHOL-A, V51, P153 Blomert L, 2004, BRAIN LANG, V89, P21, DOI 10.1016/S0093-934X(03)00305-5 Boersma P., 2000, PRAAT SYSTEM DOING P BOERSMA P, 1996, 132 I PHON SCI BRADLEY L, 1983, NATURE, V301, P419, DOI 10.1038/301419a0 GERRITS PAM, 2001, THESIS U UTRECHT GRATTON G, 1983, ELECTROEN CLIN NEURO, V55, P468, DOI 10.1016/0013-4694(83)90135-9 Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121 IRAUSQUIN RS, 1997, THESIS CATHOLIC U BR Kuijpers CTL, 1996, J PHONETICS, V24, P367, DOI 10.1006/jpho.1996.0020 Kujala T, 2000, PSYCHOPHYSIOLOGY, V37, P262 Leppanen PHT, 1997, AUDIOL NEURO-OTOL, V2, P308 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 Lyytinen H, 1997, DYSLEXIA: BIOLOGY, COGNITION AND INTERVENTION, P97 Maurer U, 2003, CLIN NEUROPHYSIOL, V114, P808, DOI 10.1016/S1388-2457(03)00032-4 Mody M, 1997, J EXP CHILD PSYCHOL, V64, P199, DOI 10.1006/jecp.1996.2343 Molfese DL, 1997, DEV NEUROPSYCHOL, V13, P135 Naatanen R, 2001, PSYCHOPHYSIOLOGY, V38, P1, DOI 10.1017/S0048577201000208 Naatanen R, 1997, NATURE, V385, P432, DOI 10.1038/385432a0 Nicholson R. I., 1994, Q J EXPT PSYCHOL A, V47, P29 *NWO NETH ORG SCI, 1996, ID COR FEAT DEV DYSL REED MA, 1989, J EXP CHILD PSYCHOL, V48, P270, DOI 10.1016/0022-0965(89)90006-4 REPP BH, 1981, B PSYCHONOMIC SOC, V18, P12 RICHARDSON U, 1998, THESIS U JYVASKYLA F van Hessen AJ, 1999, PHONETICA, V56, P56, DOI 10.1159/000028441 SCHWIPPERT CE, 1998, 135 U AMST I PHON SC Serniclaes W, 2001, J SPEECH LANG HEAR R, V44, P384, DOI 10.1044/1092-4388(2001/032) Serniclaes WI, 2004, J EXP CHILD PSYCHOL, V87, P336, DOI 10.1016/j.jecp.2004.02.001 Snowling M.J., 2000, DYSLEXIA Sprenger-Charolles L, 2000, CAN J EXP PSYCHOL, V54, P87, DOI 10.1037/h0087332 STANOVICH KE, 1988, ANN DYSLEXIA, V38, P154, DOI 10.1007/BF02648254 STEVENSON DC, 1979, THESIS U ALBERTA US StuddertKennedy M, 1995, PSYCHON B REV, V2, P508, DOI 10.3758/BF03210986 TALLAL P, 1980, BRAIN LANG, V9, P182, DOI 10.1016/0093-934X(80)90139-X TOONEN G, 1998, THESIS U NIJMEGEN NE VANHESSEN AJ, 1992, THESIS U UTRECHT NET WERKER JF, 1993, J PHONETICS, V21, P83 WERKER JF, 1987, CAN J PSYCHOL, V41, P48, DOI 10.1037/h0084150 WOOD CC, 1976, J ACOUST SOC AM, V60, P1381, DOI 10.1121/1.381231 NR 39 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 124 EP 142 DI 10.1016/j.specom.2005.04.003 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500012 ER PT J AU van der Stelt, JM Zajdo, K Wempe, TG AF van der Stelt, JM Zajdo, K Wempe, TG TI Exploring the acoustic vowel space in two-year-old children: Results for Dutch and Hungarian SO SPEECH COMMUNICATION LA English DT Article DE automated band filtering; vowel acquisition; principal component analysis; Dutch; Hungarian ID SPEECH; PERCEPTION; FREQUENCY; INFANTS AB In the last decade, there has been an increasing interest in exploring patterns of vowel acquisition in young children. Traditionally, researchers attempt to estimate formant values of vowel realizations via acoustic measurements. However, these techniques have yielded questionable results, due primarily to a low sampling rate of the spectrum caused by a high fundamental frequency in young children's speech. Additionally, the researcher's knowledge about the intended vowel quality affects the decision pertaining to vowel formants. A frequency domain band filtering analysis method that minimizes the dependence of the results on F-0 is developed to measure the spectral envelopes in children's utterances automatically, and is applied to existing utterance data sets of Dutch and Hungarian. One further advantage of the current method is that it selects a maximum of 10 measurement points along the length of the utterance. Data reduction of all filter outputs is achieved via Principal Component Analysis (PCA). By using the first 2 eigenvectors, a reference plane is created. The first two eigenvectors account for 54.2 vs. 58.6% in the Dutch and Hungarian data sets, respectively. Next, a common reference plane for Dutch and Hungarian two-year-olds is constructed by balancing the number of utterances that are analyzed per language. Perceptually judged as being correctly pronounced corner vowels of Dutch- and Hungarian-speaking two-year-old boys were mapped onto this common Dutch-Hungarian reference plane. The band filtering method has shown to be robust with regard to signal-to-noise ratios and to the differences in numbers of measurements. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Amsterdam, Inst Phonet Sci ACLC, NL-1016 CG Amsterdam, Netherlands. Univ Wyoming, Dept 3311, Div Commun Disorders, Laramie, WY 82071 USA. RP van der Stelt, JM (reprint author), Univ Amsterdam, Inst Phonet Sci ACLC, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM jeannet.van.der.stelt@uva.nl CR Beinum F. J. Koopmans-van, 1986, PRECURSORS EARLY SPE, P37 Boe L.-J., 1997, JOURN ET LING VOYELL, P98 CLEMENT CJ, 2004, THESIS U AMSTERDAM A DAVIS BL, 1990, J SPEECH HEAR RES, V33, P16 DEBOYSSONBARDIES B, 1989, J CHILD LANG, V16, P1 KENT RD, 1993, J PHONETICS, V21, P117 KOOPMANSVANBEIN.FJ, 2003, P 15 INT C PHON SCI, V1, P1033 KUHL PK, 1982, SCIENCE, V218, P1138, DOI 10.1126/science.7146899 Kuhl PK, 1996, J ACOUST SOC AM, V100, P2425, DOI 10.1121/1.417951 KUHL PK, 1993, J PHONETICS, V21, P125 Lee SW, 1997, PROCEEDINGS OF THE SEVENTH INTERNATIONAL CONFERENCE ON COMPUTING IN CIVIL AND BUILDING ENGINEERING, VOLS 1-4, P473 MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 MENARD L, 2002, J AC SOC AM, V111, P1895 Palethorpe S, 1996, J ACOUST SOC AM, V100, P3843, DOI 10.1121/1.417240 PLOMP R, 1967, J ACOUST SOC AM, V41, P707, DOI 10.1121/1.1910398 Pols LCW, 1977, THESIS FREE U AMSTER POLS LCW, 1969, J ACOUST SOC AM, V46, P458, DOI 10.1121/1.1911711 POLS LCW, 1973, J ACOUST SOC AM, V53, P1093, DOI 10.1121/1.1913429 Robb MP, 1997, FOLIA PHONIATR LOGO, V49, P88 SERKHANE J, 2002, P INT C SPOK LANG PR, P45 STOELGAMMON C, 1983, J CHILD LANG, V10, P455 Vallabha GK, 2002, SPEECH COMMUN, V38, P141, DOI 10.1016/S0167-6393(01)00049-8 Van der Stelt J., 2003, P I PHON SCI U AMST, V25, P197 VANDERSTELT JM, 2003, P 15 INT C PHON SCI, V3, P2225 Weenink D., 1996, 132 U AMST I PHON SC WEMPE AG, 2003, P 15 INT C PHON SCI, V1, P343 WEMPE AG, 2001, P I PHON SCI U AMST, V24, P167 Zajdo K., 2003, P 15 INT C PHON SCI, V3, P2229 ZAJDO K, 2002, THESIS U WASHINGTON ZAJDO K, 2002, P 28 ANN M BERK LING, P363 NR 30 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 143 EP 159 DI 10.1016/j.specom.2005.06.006 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500013 ER PT J AU van Gogh, CDL Festen, JM Verdonck-de Leeuw, IM Parker, AJ Traissac, L Cheesman, AD Mahieu, HF AF van Gogh, CDL Festen, JM Verdonck-de Leeuw, IM Parker, AJ Traissac, L Cheesman, AD Mahieu, HF TI Acoustical analysis of tracheoesophageal voice SO SPEECH COMMUNICATION LA English DT Article DE laryngectomy; tracheoesophageal voice quality; acoustical voice analysis; speech rehabilitation ID FUNDAMENTAL-FREQUENCY; PERCEPTUAL EVALUATION; ESOPHAGEAL SPEECH; QUALITY; LARYNGEAL AB Acoustical voice analysis of laryngectomees is a complicated matter because of the often weak periodicity of the voice and the high noise component. This study consists of a feasibility study and validation of an acoustical tracheoesophageal (TE) voice analysis on a sustained vowel based upon recordings of 66 laryngectomees from four clinics in three European countries. Based on reliability analysis of the acoustical data, TE voices can be objectively divided in three categories: (I) good voices with low-frequency harmonics and noise taking over at the higher frequencies; (II) moderate voices consisting of repetitive bursts of sound energy with low repetition rate and a weak periodicity due to high levels of noise, even at the low frequencies; (III) poor voices with no detectable or very weak fundamental frequency or envelope periodicity. The voice samples from category I and II correlate well with perceptually analyzed voice quality parameters, which supports the robustness and validation of this acoustical analysis method to analyze TE voices. (C) 2005 Elsevier B.V. All rights reserved. C1 Vrije Univ Amsterdam Med Ctr, Dept Otorhinolaryngol Head & Neck Surg, NL-1007 MB Amsterdam, Netherlands. Royal Hallamshire Hosp, Dept Otorhinolaryngol, Sheffield S10 2JF, S Yorkshire, England. Univ Bordeaux 2, Dept Otorhinolaryngol Head & Neck Surg, F-33076 Bordeaux, France. Charing Cross Hosp, Dept Otolaryngol, London, England. RP Festen, JM (reprint author), Vrije Univ Amsterdam Med Ctr, Dept Otorhinolaryngol Head & Neck Surg, POB 7057, NL-1007 MB Amsterdam, Netherlands. EM jm.festen@vumc.nl CR Arias MR, 2000, OTOLARYNG HEAD NECK, V122, P743, DOI 10.1016/S0194-5998(00)70208-7 BAGGS TW, 1983, J COMMUN DISORD, V16, P299, DOI 10.1016/0021-9924(83)90014-X Bertino G, 1996, FOLIA PHONIATR LOGO, V48, P255 BLOOD GW, 1984, J COMMUN DISORD, V17, P319, DOI 10.1016/0021-9924(84)90034-0 Carding PN, 2004, CLIN OTOLARYNGOL, V29, P538, DOI 10.1111/j.1365-2273.2004.00846.x Crevier-Buchman L, 1996, Ann Otolaryngol Chir Cervicofac, V113, P61 DEBRUYNE F, 1994, J LARYNGOL OTOL, V108, P325 Dejonckere PH, 2001, EUR ARCH OTO-RHINO-L, V258, P77, DOI 10.1007/s004050000299 DEKROM G, 1993, J SPEECH HEAR RES, V36, P254 Festen JM, 1996, INT CONGR SER, V1112, P171 Globlek D, 2004, Logoped Phoniatr Vocol, V29, P87 Lawson G, 2001, HEAD NECK-J SCI SPEC, V23, P871, DOI 10.1002/hed.1126 MAHIEU HF, 1986, SPEECH RESTORATION V, P139 MAHIEU HF, 1988, THESIS U GRONINGEN MAHIEU HF, 2000, 941611 EC BMT Mérol J C, 1999, Rev Laryngol Otol Rhinol (Bord), V120, P249 Moerman M, 2004, EUR ARCH OTO-RHINO-L, V261, P541, DOI 10.1007/s00405-003-0681-0 NIEBOER GLJ, 1988, J PHONETICS, V16, P417 Olszański Witold, 2004, Otolaryngol Pol, V58, P473 PINDZOLA RH, 1989, ANN OTO RHINOL LARYN, V98, P960 ROBBINS J, 1984, J SPEECH HEAR RES, V27, P577 ROBBINS J, 1984, ARCH OTOLARYNGOL, V110, P670 ROBBINS J, 1984, J SPEECH HEAR DISORD, V49, P202 SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 SEDORY SE, 1989, J SPEECH HEAR DISORD, V54, P209 Singer MI, 2004, OTOLARYNG CLIN N AM, V37, P507, DOI 10.1016/j.otc.2004.01.001 TITZE IR, 1994, WORKSH AC VOIC AN, P4 TOSI O, 1987, FOLIA PHONIATR, V39, P290 van As CJ, 2003, J SPEECH LANG HEAR R, V46, P947, DOI 10.1044/1092-4388(2003/3074) VANAS CJ, 2001, THESIS U AMSTERDAM B van As CJ, 1998, J VOICE, V12, P239, DOI 10.1016/S0892-1997(98)80044-1 van Rossum MA, 2002, J SPEECH LANG HEAR R, V45, P1106, DOI 10.1044/1092-4388(2002/089) NR 32 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 160 EP 168 DI 10.1016/j.specom.2005.03.007 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500014 ER PT J AU van Wieringen, A Wouters, J AF van Wieringen, A Wouters, J TI Normalization and feasibility of speech understanding tests for Dutch speaking toddlers SO SPEECH COMMUNICATION LA English DT Article DE normative data; speech tests; toddlers ID COCHLEAR IMPLANTS; CONSONANT RECOGNITION; HEARING-AIDS; CHILDREN; LANGUAGE; PERCEPTION AB As a result of newborn hearing screening programs, hearing impairment is often identified early in life and proper intervention (hearing aids, cochlear implant) will enable children to develop expressive and receptive skills as young as possible. The goals of this study are to obtain normative data on speech tests that are suitable for evaluation of young hearing-impaired Dutch speaking children in Flanders and the Netherlands and to determine the youngest age at which these tests are feasible. Normative data are obtained with 143 normal-hearing children with normal cognitive and language development for the Gottinger I (3-4 years) and Gottinger II (5-6 years). It is shown that one performance intensity curve can describe both tests. Moreover, a subsequent study with 35 normal-hearing children showed that the Gottinger I can also be administered to children with normal cognitive and language development younger than three years of age. In addition, the feasibility of two analytical tests was examined. These tests were designed to obtain information on the transmission of spectral and temporal speech cues by the hearing aid or cochlear implant in children as young as 2 1/2 years of age. (C) 2005 Elsevier B.V. All rights reserved. C1 Katholieke Univ Leuven, Lab Expt Otorhinolaryngol, B-3000 Louvain, Belgium. RP van Wieringen, A (reprint author), Katholieke Univ Leuven, Lab Expt Otorhinolaryngol, Kapucijnenvoer 33, B-3000 Louvain, Belgium. EM astrid.vanwieringen@uz.kuleuven.ac.be RI Wouters, Jan/D-1800-2015 CR BEIJNON AJ, 1992, UNPUB ANTWERPEN NIJM BOETS B, UNPUB AUDITORY TEMPO BOSMAN AJ, 1995, LOGOPEDIE FONIATRIE, V9, P218 CHANET T, 1998, THESIS KATHOLIEKE U CRUL AM, 1984, LOGOPEDIE FONIATRIE, V56, P31 DORMAN MF, 1990, J ACOUST SOC AM, V88, P2074, DOI 10.1121/1.400104 FINITZOHIEBER T, 1980, EAR HEARING, V1, P271, DOI 10.1097/00003446-198009000-00007 GOMMERS K, 1998, THESIS KATHOLIEKE U GOOSSENS K, 2004, THESIS KATHOLIEKE U HUYSMANS I, 1997, THESIS KATHOLIEKE U Kirk K, 1997, AUDIOLOGIC EVALUATIO, P101 LAMBRECHTS M, 1979, THESIS HOGESCHOOL BR Laneau J, 2005, J NEUROSCI METH, V142, P131, DOI 10.1016/j.jneumeth.2004.08.015 LANEAU J, 2004, VLAAMSE OPNAME WOORD LEMKENS, 2000, THESIS KATHOLIEKE U Moog J. S., 1983, GRAMMATICAL ANAL ELI OSBERGER MJ, 1991, AM J OTOL, V12, P105 PLASMANS A, 1999, THESIS KATHOLIEKE U Robbins AM, 1994, MR POTATO HEAD TASK Robbins Amy M., 1996, Seminars in Hearing, V17, P353, DOI 10.1055/s-0028-1083065 Spencer LJ, 2003, EAR HEARING, V24, P236, DOI 10.1097/01.AUD.0000069231.72244.94 Spencer LJ, 1998, EAR HEARING, V19, P310, DOI 10.1097/00003446-199808000-00006 *SPSS INC, 2004, SPSS 12 0 Svirsky MA, 2004, AUDIOL NEURO-OTOL, V9, P224, DOI 10.1159/000078392 Tyler R. S., 1993, COCHLEAR IMPLANTS AU, P191 Tyler R. S., 1991, AUDIOVISUAL FEATURE TYLER RS, 1991, EAR HEARING, V12, pS177, DOI 10.1097/00003446-199112001-00011 TYLER RS, 1992, J ACOUST SOC AM, V92, P3068, DOI 10.1121/1.404203 VANGOMPEL J, 1979, TIJDSCHRIFT LOGOPEDI, V9, P1 VANHAL E, 2000, THESIS KATHOLIEKE U VANKERSCHAVER E, 2002, ALGO GEHOORSCREENING VANRIE L, 2004, THESIS KATHOLIEKE U VANWIERINGEN, 1998, UNPUB HANDLEIDING PA VANWIERINGEN A, 2000, COCHLEAR IMPLANTS, P355 van Wieringen A, 1999, EAR HEARING, V20, P89, DOI 10.1097/00003446-199904000-00001 Wouters J, 1994, LOGOPEDIE, V7, P28 Yoshinaga-Itano C, 1998, PEDIATRICS, V102, P1161, DOI 10.1542/peds.102.5.1161 NR 37 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 169 EP 181 DI 10.1016/j.specom.2005.03.013 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500015 ER PT J AU Carlson, R Granstrom, B AF Carlson, R Granstrom, B TI Data-driven multimodal synthesis SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; multimodal synthesis; data-driven synthesis ID SPEECH SYNTHESIS AB This paper is a report on current efforts at the Department of Speech, Music and Hearing, KTH, on data-driven multimodal synthesis including both visual speech synthesis and acoustic modeling. In the research we try to combine both corpus based methods with knowledge based models and to explore the best of the two approaches. In the paper an attempt to build formant-synthesis systems based on both rule-generated and database driven methods is presented. A pilot experiment is also reported showing that this approach can be a very interesting path to explore further. Two studies on visual speech synthesis are reported, one on data acquisition using a combination of motion capture techniques and one concerned with coarticulation, comparing different models. (C) 2005 Elsevier B.V. All rights reserved. C1 KTH, CTT, Dept Speech Mus & Hearing, SE-10044 Stockholm, Sweden. RP Carlson, R (reprint author), KTH, CTT, Dept Speech Mus & Hearing, Lindstedsvagen 24,5th Floor, SE-10044 Stockholm, Sweden. EM rolf@speech.kth.se CR Acero A., 1999, P EUROSPEECH, P1047 Allen J., 1987, TEXT SPEECH MITALK S BAILLY G, 2002, P ICSLP2002, P1913 BESKOW J, 2003, P ICPHS 2003 BARC SP Beskow J., 1995, P 4 EUR C SPEECH COM, P299 Beskow J., 2003, THESIS KTH STOCKHOLM Beskow J., 1997, P ESCA WORKSH AUD VI, P149 Beskow J., 2004, J SPEECH TECHNOLOGY, V4, P335 Beskow J, 2004, LECT NOTES COMPUT SC, V3118, P1178 BRANDERUD P, 1985, P FRENCH SWED S SPEE, P113 Bregler C., 1997, P ACM SIGGRAPH, P353, DOI 10.1145/258734.258880 Brooke N., 1998, P AUD VIS SPEECH PRO, P213 CARLSON R, 1982, P ICASSP 82, V3, P1604 CARLSON R, 1976, P ICASSP 76 CARLSON R, 1991, SPEECH COMMUN, V10, P481, DOI 10.1016/0167-6393(91)90051-T CARLSON R, 1992, INT C SPOK LANG PROC, P671 CARLSON R, 2002, FONETIK 2002 CHARPENTIER F, 1990, SPEECH COMMUN, V9, P435 CHARPENTIER F, 1986, P ICASSP 86, V3, P2015 Cohen M. M., 1993, Models and Techniques in Computer Animation Dixon N.R., 1968, IEEE Transactions on Audio and Electroacoustics, VAU-16, DOI 10.1109/TAU.1968.1161948 ENGWALL O, 2002, P ICSLP 2002 ENGWALL O, 2004, P ICSLP 2004 ENGWALL O, 2002, THESIS KTH SWEDEN Ezzat T., 2002, P ACM SIGGRAPH 2002, P388, DOI 10.1145/566570.566594 Hallgren A., 1998, P AUD VIS SPEECH PRO, P181 HERTZ S, 2002, P IEEE 2002 WORKSH S HOGBERG J, 1997, P EUR 97 JIANG J, 2000, P ICSLP2000, V1, P42 Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Le Goff B., 1997, P 5 EUR C RHOD, P1667 LEE M, 1999, P EUR C SPEECH COMM, V6, P2789 LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289 MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840 MANNELL RH, 1998, P ICSLP 98 MASSARO DW, 2005, AUDIOVISUAL SPEECH P MORI H, 2002, ICSLP 2002, P2365 Ogden R, 2000, COMPUT SPEECH LANG, V14, P177, DOI 10.1006/csla.2000.0141 OHLIN D, 2004, THESIS TMH STOCKHOLM OHLIN D, 2004, P FON, P1603 OHMAN T, 1998, KTH TMH QPSR, V1, P61 Olive J. P., 1977, P INT C ACOUST SPEEC, P568 PARKS DA, 1982, GASTROENTEROLOGY, V2, P9 Pelachaud C., 2002, MPEG 4 FACIAL ANIMAT, P125, DOI 10.1002/0470854626.ch8 Pelachaud C, 1996, COGNITIVE SCI, V20, P1 PETERSON G, 1958, J ACOUST SOC AM, V32, P639 Reveret L., 2000, P 6 INT C SPOK LANG, P755 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SICILIANO C, 2003, IN PRESS P 15 INT C SIGVARDSON T, 2002, THESIS TMH STOCKHOLM SJOLANDER A, 2001, THESIS TMH STOCKHOLM Sjolander K., 2003, P FON 2003, P93 STEVENS KN, 1991, J PHONETICS, V19, P161 Talkin D., 1989, Speech Technology, V4 VINET R, 2004, THESIS TMH STOCKHOLM Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 57 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 182 EP 193 DI 10.1016/j.specom.2005.02.015 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500016 ER PT J AU den Os, E Boves, L Rossignol, S ten Bosch, L Vuurpijl, L AF den Os, E Boves, L Rossignol, S ten Bosch, L Vuurpijl, L TI Conversational agent or direct manipulation in human-system interaction SO SPEECH COMMUNICATION LA English DT Article DE multimodal interaction; usability; conversational agent; direct manipulation AB In this paper we investigate the usability of speech-centric multimodal interaction by comparing two systems that support the same unfamiliar task, viz. bathroom design. One version implements a conversational agent (CA) metaphor, while the alternative one is based on direct manipulation (DM). Twenty subjects, 10 males and 10 females, none of whom had recent experience with bathroom (re-)design completed the same task with both systems. After each task we collected objective measures (task completion time, task completion rate, number of actions performed, speech and pen recognition errors) and subjective measures in the form of Likert Scale ratings. We found that the task completion rate for the CA system is higher than for the DM system. Nevertheless, subjects did not agree on their preference for one of the systems: those subjects who were able to use the DM system effectively preferred that system, mainly because it was faster for them, and they felt more in control. We conclude that for multimodal CA systems to become widely accepted substantial improvements in system architecture and in the performance of almost all individual modules are needed. (C) 2005 Elsevier B.V. All rights reserved. C1 Ctr Language & Speech Technol, NL-6500 HD Nijmegen, Netherlands. Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. Nijmegen Inst Cognit & Informat, NL-6500 HE Nijmegen, Netherlands. RP Boves, L (reprint author), Ctr Language & Speech Technol, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM l.boyes@let.ru.nl RI den Os, Els/J-2769-2014 CR Argyle Michael, 1976, GAZE AND MUTUAL GAZE BOVES L, 1999, P IEEE WORKSH AUT SP BUISINE S, 2004, BROWS TRUST EVALUATI DENOS E, 2005, P HCI INT 2005 DENOS E, 2003, P CHALL 2003 HERZOG G, 2003, P HLT NAACL 2003 WOR KVALE K, 2003, P INT S HUM FACT TEL LARSEN BL, 2003, P EUR C SPEECH COMM LOVE S, 1994, P INT C SPOK LANG PR, P1307 MCGLASHAN S, 1995, P 2 INT WORKSH MIL A OVIATT S, 2003, HUM FAC ER, P286 Shneiderman B., 1997, INTERACTIONS, V4, P42, DOI 10.1145/267505.267514 STURM J, IN PRESS HUMAN COMPU STURM J, UNPUB INT J SPEECH T Thorisson KR, 2002, TEXT SPEECH LANG TEC, V19, P173 Wahlster W., 2003, P HUM COMP INT STAT, P47 WALKER MA, 2000, DEV GEN MODELS USABI Xiao B., 2003, P ICMI, P265 NR 18 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 194 EP 207 DI 10.1016/j.specom.2005.04.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500017 ER PT J AU Furui, S Nakamura, M Ichiba, T Iwano, K AF Furui, S Nakamura, M Ichiba, T Iwano, K TI Analysis and recognition of spontaneous speech using Corpus of Spontaneous Japanese SO SPEECH COMMUNICATION LA English DT Article DE spontaneous speech; Corpus of Spontaneous Japanese; automatic speech recognition; cepstrum; speaking rate AB Although speech is in almost any situation spontaneous, recognition of spontaneous speech is an area which has only recently emerged in the field of automatic speech recognition. Broadening the application of speech recognition depends crucially on raising recognition performance for spontaneous speech. For this purpose, it is necessary to analyze and model spontaneous speech using spontaneous speech databases, since spontaneous speech and read speech are significantly different. This paper reports analysis and recognition of spontaneous speech using a large-scale spontaneous speech database "Corpus of Spontaneous Japanese (CSJ)". Recognition results in this experiment show that recognition accuracy significantly increases as a function of the size of acoustic as well as language model training data and the improvement levels off at approximately 7M words of training data. This means that acoustic and linguistic variation of spontaneous speech is so large that we need a very large corpus in order to encompass the variations. Spectral analysis using various styles of utterances in the CSJ shows that the spectral distribution/difference of phonemes is significantly reduced in spontaneous speech compared to read speech. It has also been observed that speaking rates of both vowels and consonants in spontaneous speech are significantly faster than those in read speech. (C) 2005 Elsevier B.V. All rights reserved. C1 Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Furui, S (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM furui@furui.cs.titech.ac.jp; masa@furui.cs.titech.ac.jp; tichiba@furui.cs.titech.ac.jp; iwano@furui.cs.titech.ac.jp CR DUEZ D, 1995, J PHONETICS, V23, P407, DOI 10.1006/jpho.1995.0031 Evermann G., 2004, P IEEE ICASSP MONTR, P249 Furui S, 2003, PATTERN RECOGNITION IN SPEECH AND LANGUAGE PROCESSING, P191 Furui S., 2003, P ISCA IEEE WORKSH S, P1 FURUI S, 2004, P INT S LARG SCAL KN, P1 Gauvain JL, 2003, PATTERN RECOGNITION IN SPEECH AND LANGUAGE PROCESSING, P149 ICHIBA T, 2004, P AC SOC JAP FALL M Kawahara T, 2003, P ISCA IEEE WORKSH S, P135 KAWAHARA T, 2004, P SPEC WORKSH MAUI S KAWAHARA T, 2001, P IEEE WORKSH AUT SP LUSSIER L, 2004, P 3 SPONT SPEECH SCI, P73 Maekawa K., 2003, P ISCA IEEE WORKSH S, P7 MAEKAWA K, 2004, P INT S LARG SCAL KN, P19 Maekawa K., 2002, P 7 INT C SPOK LANG, P1545 NAKAMURA M, 2004, P AC SOC JAP FALL M NANJO H, 2003, P IEEE WORKSH SPONT, P75 Sankar A, 2002, SPEECH COMMUN, V37, P133, DOI 10.1016/S0167-6393(01)00063-2 SCHWARTZ R, 2004, P IEEE INT C AC SPEE, V3, P753 Shinozaki T., 2003, P IEEE WORKSH AUT SP, P417 SHINOZAKI T, 2002, P IEEE INT C AC SPEE, P729 Shinozaki T., 2004, P INT ICSLP JEJ KOR, P1705 SHINOZAKI T, 2001, P EUROSPEECH AALB, V1, P491 SON RJJ, 1999, SPEECH COMMUN, V28, P125 UCHIMOTO K, 2003, P ISCA IEEE WORKSH S, P159 UEBERLA J, 1994, COMPUT SPEECH LANG, V8, P153, DOI 10.1006/csla.1994.1007 Venditti J., 1997, OSU WORKING PAPERS L, V50, P127 NR 26 TC 11 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 208 EP 219 DI 10.1016/j.specom.2005.02.010 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500018 ER PT J AU Tan, ZH Dalsgaard, P Lindberg, B AF Tan, ZH Dalsgaard, P Lindberg, B TI Automatic speech recognition over error-prone wireless networks SO SPEECH COMMUNICATION LA English DT Article DE distributed speech recognition; channel error robustness; out-of-vocabulary detection ID UTTERANCE VERIFICATION; RECOVERY TECHNIQUES; CHANNEL; COMMUNICATION; CONCEALMENT; MITIGATION; SYSTEM AB The past decade has witnessed a growing interest in deploying automatic speech recognition (ASR) in communication networks. The networks such as wireless networks present a number of challenges due to e.g. bandwidth constraints and transmission errors. The introduction of distributed speech recognition (DSR) largely eliminates the bandwidth limitations and the presence of transmission errors becomes the key robustness issue. This paper reviews the techniques that have been developed for ASR robustness against transmission errors. In the paper, a model of network degradations and robustness techniques is presented. These techniques are classified into three categories: error detection, error recovery and error concealment (EC). A one-frame error detection scheme is described and compared with a frame-pair scheme. As opposed to vector level techniques a technique for error detection and EC at the sub-vector level is presented. A number of error recovery techniques such as forward error correction and interleaving are discussed in addition to a review of both feature-reconstruction and ASR-decoder based EC techniques. To enable the comparison of some of these techniques, evaluation has been conduced on the basis of the same speech database and channel. Special, attention is given to the unique characteristics of DSR as compared to streaming audio e.g. voice-over-IP. Additionally, a technique for adapting ASR to the varying quality of networks is presented. The frame-error-rate is here used to adjust the discrimination threshold with the goal of optimising out-of-vocabulary detection. This paper concludes with a discussion of applicability of different techniques based on the channel characteristics and the system requirements. (C) 2005 Elsevier B.V. All rights reserved. C1 Aalborg Univ, CTIF, SMC, DK-9220 Aalborg, Denmark. RP Tan, ZH (reprint author), Aalborg Univ, CTIF, SMC, Niels Jernes Vej 12, DK-9220 Aalborg, Denmark. EM zt@kom.aau.dk; pd@kom.aau.dk; bli@kom.aau.dk RI Tan, Zheng-Hua/B-6889-2015 OI Tan, Zheng-Hua/0000-0001-6856-8928 CR Acero A, 1993, ACOUSTICAL ENV ROBUS [Anonymous], 2003, 202212 ETSI ES [Anonymous], 2000, 201108 ETSI ES BERNARD A, 2002, THESIS U CALIFORNIA BERNARD A, 2001, P ICASSP01 US MAY 20 Bernard A, 2002, IEEE T SPEECH AUDI P, V10, P570, DOI 10.1109/TSA.2002.808141 Besacier L., 2001, P IEEE MULT SIGN PRO BOSSERT M, 2000, CHANNEL CODING TELEC Boulis C, 2002, IEEE T SPEECH AUDI P, V10, P580, DOI 10.1109/TSA.2002.804532 CARDENALLOPEZ A, 2004, P ICASSP04 Carle G, 1997, IEEE NETWORK, V11, P24, DOI 10.1109/65.642357 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cox RV, 2000, P IEEE, V88, P1314, DOI 10.1109/5.880086 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Digalakis VV, 1999, IEEE J SEL AREA COMM, V17, P82, DOI 10.1109/49.743698 ENDO T, 2003, P EUROSPEECH03 GEN S ETSI, 2002, 202050 ETSI ES *ETSI, 2003, 202211 ETSI ES EULER S, 1994, P ICASSP94 FINGSCHEIDT T, 2002, P ICSLP02 FINGSCHEIDT T, 2001, IEEE T SPEECH AUDIO, V9, P1 GILBERT EN, 1960, AT&T TECH J, V39, P1253 GOMEZ AM, 2004, P ROBUST2004 GOMEZ AM, 2003, P EUROSPEECH03 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Goyal VK, 2001, IEEE SIGNAL PROC MAG, V18, P74, DOI 10.1109/79.952806 GUNAWAN W, 2001, P ICASSP HAAVISTO P, 1999, P ROB METH SPEECH RE HAAVISTO P, 1998, P EUR SIGN PROC C IS HAEBUMBACH R, 2004, P ICSLP04 HO YC, 1999, IEEE CONTROL SYSTEMS, P8 HSU WH, 2004, P ICASSP04 Huang X., 2001, SPOKEN LANGUAGE PROC HUERTA J, 1998, P ICSLP98 HUERTA JM, 2000, THESIS CMU JAMES AB, 2004, P ICSLP04 JAMES AB, 2004, P ICASSP04 MONTR QUE KANAL LN, 1978, P IEEE, V66, P724, DOI 10.1109/PROC.1978.11013 KELLEHER H, 2002, P ICSLP02 DENV US KIM HK, 2000, P ICASSP00 TURK Kim HK, 2001, IEEE T SPEECH AUDI P, V9, P558 KIM MY, 2004, P ICSLP04 KISS I, 2000, P ICSLP00 BEIJ CHIN Lee LS, 2001, P IEEE, V89, P41 LEE M, 2004, P ICSLP04 LILLY BT, 1996, P ICSLP96 LINDBERG B, 2000, P ICSLP00 Lleida E, 2000, IEEE T SPEECH AUDI P, V8, P126, DOI 10.1109/89.824697 MAYORGA P, 2003, P ASRU03 VIRG ISL US MILNER B, 2000, P ICASSP00 TURK MILNER B, 2001, P ICASSP01 US MILNER BP, 2004, P ICSLP04 MILNER BP, 2003, P ICASSP03 PALIWAL KK, 2004, P ICSLP04 PEARCE D, 2004, P ROB 2004 NORW UK PEARCE D, 2000, P ICSLP00 BEIJ CHIN PEARCE D, 2000, P AVIOS00 SPEECH APP PEINADO AM, 2001, P EUROSPEECH 01 Peinado AM, 2003, SPEECH COMMUN, V41, P549, DOI 10.1016/S0167-6393(03)00048-7 Peinado AM, 2005, IEEE T WIREL COMMUN, V4, P14, DOI 10.1109/TWC.2004.840198 PELAEZMORENO C, 2001, IEEE T MULTIMEDI JUN Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750 POTAMIANOS A, 2001, P ICASSP01 US Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 RAMABADRAN T, 2004, P ICASSP04 MONTR QUE RAMAKRISHANA BR, 2000, THESIS CARNEGIE MELL RAMSEY JL, 1970, IEEE T INFORM THEORY, V16, P338, DOI 10.1109/TIT.1970.1054443 RISKIN EV, 2001, P EUROSPEECH01 ROSE RC, 2003, P ICASSP03 HONG KONG ROSE RC, 2004, P ROBUST2004 SCHULZRINNE H, 2003, RDC3550 Shannon E., 1948, BELL SYST TECH J, V27, P623 SKLAR B, 2004, IEEE SIGNAL PROC JUL, P14 SORIN A, 2004, P ICASSP04 SRINIVASAMURTHY N, 2001, P EUROSPEECH01 AALBO SRINIVASAMURTHY N, 2004, P ICASSP04 SUKKAR RA, 2002, P ICASSP02 TAN ZH, 2003, P ICASSP03 HONG KONG TAN ZH, 2002, P ICSLP02 DENV US Tan ZH, 2003, ELECTRON LETT, V39, P1619, DOI 10.1049/el:20031026 TAN ZH, 2004, P ICSLP04 JEJ ISL KO TAN ZH, 2004, P ROBUST2004 NORW UK TAN ZH, 2004, P ICASSP04 MONTR QUE VIIKKI O, 2001, P ASRU01 MAD DI CAMP WANG Y, 2002, P IEEE ICIP 2002 Wang Y, 1998, P IEEE, V86, P974 Weerackody V, 2002, IEEE T WIREL COMMUN, V1, P282, DOI 10.1109/7693.994822 WEERACKODY V, 2001, P IEEE INT C COMM 20 XIE Q, 2004, RTP PAYLOAD FORMATS YOMA NB, 1998, P ICASSP98 ZHONG X, 2002, P IEEE DIG SIGN PROC ZHU Q, 2001, P ICASSP01 2004, PACKET SWITCHED CONV 2004, RECOGNITION PERFORMA NR 94 TC 24 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP-OCT PY 2005 VL 47 IS 1-2 BP 220 EP 242 DI 10.1016/j.specom.2005.05.007 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 963FG UT WOS:000231788500019 ER PT J AU Hirose, K Hirst, D Sagisaka, Y AF Hirose, K Hirst, D Sagisaka, Y TI Quantitative prosody modelling for natural speech description and generation SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Tokyo, Grad Sch Informat Sci & Technol, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1130033, Japan. Univ Aix Marseille 1, CNRS, Lab Parole & Langage, F-13621 Aix En Provence, France. Waseda Univ, Grad Sch Global Informat & Telecommun Studies, Shinjuku Ku, Waseda 1690051, Japan. RP Hirose, K (reprint author), Univ Tokyo, Grad Sch Informat Sci & Technol, Dept Informat & Commun Engn, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1130033, Japan. EM hirose@gavo.t.u-tokyo.ac.jp; daniel.hirst@lpl.univ-aix.fr; yoshinori.sagisaka@atr.jp NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 217 EP 219 DI 10.1016/j.specom.2005.05.006 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200001 ER PT J AU Xu, Y AF Xu, Y TI Speech melody as articulatorily implemented communicative functions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE intonation; tone; pitch accents; pitch target; intonation model ID FUNDAMENTAL-FREQUENCY; MAXIMUM SPEED; PITCH ACCENTS; MANDARIN; TONE; INTONATION; PERCEPTION; LANGUAGE; PROSODY; FOCUS AB The understanding of speech melody, i.e., pitch variations related to tone and intonation, can be improved by simultaneously taking into consideration two basic facts: that speech conveys communicative meanings, and that it is produced by human articulators. Communicative meanings, as I will argue, are conveyed through a set of separate functions that are realized by an articulatory system with various biophysical properties. These properties make it unlikely that the melodic functions are encoded directly in terms of invariant surface acoustic forms. Rather, the encoding is likely done through the manipulation of a limited number of articulatorily operable parameters that may be considered as the phonetic primitives. Four such primitives can be recognized for speech melody: local pitch targets, pitch range, articulatory strength and duration. The values of the melodic primitives are specified by a set of encoding schemes, each associated with a particular communicative function. The encoding schemes are distinct from each other in the manner of controlling the melodic primitives, which allows multiple communicative functions to be conveyed in parallel. The communicative functions are ultimately converted to continuous, detailed surface acoustic patterns through an articulatory process of syllable-synchronized sequential target approximation, which takes the melodic primitives specified by the encoding schemes as the control parameters. This view of speech melody is summarized into a comprehensive model of tone and intonation, namely, the parallel encoding and target approximation (PENTA) model. (c) 2005 Elsevier B.V. All rights reserved. C1 UCL, Dept Phonet & Linguist, London NW1 2HE, England. Haskins Labs Inc, New Haven, CT USA. RP Xu, Y (reprint author), UCL, Dept Phonet & Linguist, Wolfon House,4 Stephenson Way, London NW1 2HE, England. EM yi@phon.ucl.ac.uk RI Xu, Yi/C-4013-2008 OI Xu, Yi/0000-0002-8541-2658 CR Alku P, 2002, SPEECH COMMUN, V38, P321, DOI 10.1016/S0167-6393(01)00072-3 ARVANITI A, 1998, J PHONETICS, V36, P3 Atterer M, 2004, J PHONETICS, V32, P177, DOI 10.1016/S0095-4470(03)00039-1 BLACK A, 1996, P INT C SPOK LANG PR BLEVINS J, 1993, LANGUAGE, V69, P237, DOI 10.2307/416534 BOLINGER D, 1972, LANGUAGE, V48, P633, DOI 10.2307/412039 Bolinger D., 1989, INTONATION ITS USES BOLINGER DL, 1964, HARVARD EDUC REV, V34, P282 Botinis A, 2000, TEXT SPEECH LANG TEC, V15, P97 Bruce Gosta, 1977, SWEDISH WORD ACCENTS Brungart D. S., 2002, P ICSLP 2002 DENV CO, P1641 CARTON F, 1976, ACCENT INSISTANCE EM CHAO YR, 1932, PRELIMINARY STUDY EN, P105 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE Chen Mathew Y., 2000, TONE SANDHI PATTERNS CHEN Y, IN PRESS PRODUCTION COHEN A, 1965, P 5 INT C AC LIEG A, P16 COHEN A, 1982, PHONETICA, V39, P254 Collier R., 1990, PERCEPTUAL STUDY INT Cooper W. E., 1981, FUNDAMENTAL FREQUENC COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 Cruttenden Alan, 1997, INTONATION, V2nd D'Imperio Mariapaola, 2002, PROBUS, V14, P37, DOI 10.1515/prbs.2002.005 D'Imperio M, 2001, SPEECH COMMUN, V33, P339, DOI 10.1016/S0167-6393(00)00064-9 Duanmu San, 1994, PHONOLOGY, V11, P1, DOI 10.1017/S0952675700001822 EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 FAIRBANKS G, 1959, VOICE ARTICULATION D FRY DB, 1958, LANG SPEECH, V1, P126 Fujimura O, 2000, PHONETICA, V57, P128, DOI 10.1159/000028467 Fujisaki H, 2003, P WORKSH SPOK LANG P, P5 Fujisaki H., 1983, PRODUCTION SPEECH, P39 GANDOUR J, 1994, J PHONETICS, V22, P477 GARDING E, 1982, PHONETICA, V39, P288 Gordon M., 1999, THESIS UCLA Grice Martine, 2000, PHONOLOGY, V17, P143, DOI 10.1017/S0952675700003924 Gussenhoven C, 2002, P 1 INT C SPEECH PRO, P47 GUSSENHOVEN C, IN PRESS TOPIC FOCUS HASEGAWA Y, 1992, LANG SPEECH, V35, P87 Hirst Daniel, 1993, TRAVAUX I PHONETIQUE, V15, P75 HOLLIEN H, 1960, J SPEECH HEAR RES, V3, P157 HOLLIEN H, 1960, J SPEECH HEAR RES, V3, P150 HONOROF DN, J ACOUSTICAL SOC AM JIAO W, 2001, P 5 NATL C MOD PHOT, P328 Jin S., 1996, THESIS OHIO STATE U Kelso J.A.S., 1984, AM J PHYSIOL-REG I, V246, pR1000 Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Kochanski G, 2003, SPEECH COMMUN, V41, P625, DOI 10.1016/S0167-6393(03)00100-6 Krahmer E, 2001, SPEECH COMMUN, V34, P391, DOI 10.1016/S0167-6393(00)00058-3 Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089 Ladd D. R., 1996, INTONATIONAL PHONOLO Ladd D. R., 1984, PHONOLOGY YB, V1, P53, DOI DOI 10.1017/S0952675700000294 Ladd DR, 2003, J PHONETICS, V31, P81, DOI 10.1016/S0095-4470(02)00073-6 Ladd DR, 2000, J ACOUST SOC AM, V107, P2685, DOI 10.1121/1.428654 Ladd DR, 1999, J ACOUST SOC AM, V106, P1543, DOI 10.1121/1.427151 Laniran YO, 2003, J PHONETICS, V31, P203, DOI 10.1016/S0095-4470(02)00098-0 Lehiste I., 1975, STRUCTURE PROCESS SP, P195 Li Yong, 2002, Pacific Rim Workshop on Transducers and Micro/Nano Technologies Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 LIN M, 1991, P 12 INT C PHON SCI, P242 LIN M, 1980, ZHONGGUO YUWEN, P74 LIN M, 1980, DIALECT, P166 Lin T., 1985, WORKING PAPERS EXPT, P1 LIU F, 2004, J ACOUST SOC AM, V115, P2397 MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 MAEDA S, 1976, CHARACTERIZATION AM MAN VCH, 2002, P 1 INT C SPEECH PRO, P467 Mixdorff H., 2004, P INT S TON ASP LANG, P137 Mozziconacci S, 2002, P 1 INT C SPEECH PRO, P1 MYERS S, 1999, P 14 INT C PHON SCI, P1981 MYERS S, 2004, P INT S TON ASP LANG, P147 MYERS S, 1999, PHONOLOGY, V15, P367 NAKAJIMA S, 1993, PHONETICA, V50, P197 NELSON WL, 1983, BIOL CYBERN, V46, P135, DOI 10.1007/BF00339982 OHALA JJ, 1983, PHONETICA, V40, P1 OHALA JJ, 1992, DIACHRONY SYNCHRONY, P308 OHALA JJ, 1973, J ACOUST SOC AM, V53, P345, DOI 10.1121/1.1982441 OHALA JJ, 1984, PHONETICA, V41, P1 OHALA JJ, 2002, P 7 INT C SPOK LANG, P2285 Ohala John J, 1981, PAPERS PARASESSION L, P178 Ohman S., 1967, WORD SENTENCE INTONA, P20 Peng S.-H., 2000, PAPERS LAB PHONOLOGY, VV, P152 PIEREHUMBERT J, 1988, JAPANESE TONE STRUCT PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J., 2000, PROSODY THEORY EXPT, P11 Pierrehumbert Janet, 1980, THESIS MIT CAMBRIDGE Pike K. L., 1945, INTONATION AM ENGLIS Rose P.J., 1988, PROSODIC ANAL ASIAN, P55 Rump HH, 1996, LANG SPEECH, V39, P1 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHMIDT RC, 1990, J EXP PSYCHOL HUMAN, V16, P227, DOI 10.1037//0096-1523.16.2.227 SELKIRK E, 1990, PHONOLOGY-SYNTAX CONNECTION, P313 SELKIRK ELISABETH, 2002, P 1 INT C SPEECH PRO, P643 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 SHEN XS, 1992, ACTA LINGUIST HAF, V24, P131 Shih C., 1988, WORKING PAPERS CORNE, V3, P83 SHIH C, 1993, P 5 N AM C CHIN LING, P36 Shih C., 1992, P IRCS WORKSH PROS N, P193 Shih Chilin, 1986, THESIS U CALIFORNIA Shih CL, 2000, TEXT SPEECH LANG TEC, V15, P243 SHIHARA S, 2002, P TCP 2002 TOK, P165 SILVERMAN KEA, 1990, LAB PHON BETW GRAMM, P72 SPEER SR, 1989, LANG SPEECH, V32, P337 STEELE SA, 1986, J ACOUST SOC AM, V80, pS51, DOI 10.1121/1.2023842 Sun X., 2002, THESIS NW U SUNDBERG J, 1979, J PHONETICS, V7, P71 Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 TITZE IR, 1979, J ACOUST SOC AM, V66, P60, DOI 10.1121/1.382973 UMEDA N, 1982, J PHONETICS, V10, P279 Van Heuven V. J., 1994, PHONOLOGICAL STRUCTU, P76 WANG WSY, 1967, J SPEECH HEAR RES, V10, P629 WICHMANN A, 2000, P ISCA WORKSH SPEECH WICHMANN A, 2002, P 1 INT C SPEECH PRO WU Z, 1982, ZHONGGUO YUWEN, P439 Wu Zong Ji, 1984, ZHONGGUO YUYAN XUEBA, V2, P70 Xiaonan Shen, 1990, PROSODY MANDARIN CHI Xu C. X., 2003, J INT PHON ASSOC, V33, P165, DOI 10.1017/S0025100303001270 XU CX, 1999, P 14 INT C PHON SCI, P2359 Xu Y, 2004, J ACOUST SOC AM, V116, P1168, DOI 10.1121/1.1763952 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 XU Y, 2002, P 1 INT C SPEECH PRO, P91 XU Y, 2003, P 15 INT C PHON SCI, P257 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684 Xu Y, 2001, PHONETICA, V58, P26, DOI 10.1159/000028487 Xu Y, 2004, P INT C SPEECH PROS, P81 Xu Y, 1993, THESIS U CONNECTICUT XU Y, IN PRESS PHONETIC RE Xu Y, 2004, PROCEEDINGS OF THE 1ST INTERNATIONAL CONFERENCE ON NEW FORMING TECHNOLOGY, P215 Xu Y., 2004, LANGUAGE LINGUISTICS, V5, P757 XU Y, 2005, J ACOUST SOC AM, V117, P2573 Xu Y., 2004, J ACOUSTICAL SOC A 2, V115, P2397 Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 Yip M., 2002, TONE Yuan J., 2002, P 1 INT C SPEECH PRO, P711 Zee E., 1980, UCLA WORKING PAPERS, V49, P98 Zemlin WR., 1988, SPEECH HEARING SCI A ZHANG J, 2001, THESIS UCLA NR 143 TC 59 Z9 65 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 220 EP 251 DI 10.1016/j.specom.2005.02.014 PG 32 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200002 ER PT J AU Banziger, T Scherer, KR AF Banziger, T Scherer, KR TI The role of intonation in emotional expressions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE emotion; expression; intonation; pitch-contour; expressive-speech ID VOCAL CUES; RIGHT-HEMISPHERE; SPEAKER AFFECT; COMMUNICATION; LANGUAGES; PROSODY AB The influence of emotions on intonation patterns (more specifically F0/pitch contours) is addressed in this article. A number of authors have claimed that specific intonation patterns reflect specific emotions, whereas others have found little evidence supporting this claim and argued that F0/pitch and other vocal aspects are continuously, rather than categorically, affected by emotions and/or emotional arousal. In this contribution, a new coding system for the assessment of F0 contours in emotion portrayals is presented. Results obtained for actor portrayed emotional expressions show that mean level and range of F0 in the contours vary strongly as a function of the degree of activation of the portrayed emotions. In contrast, there was comparatively little evidence for qualitatively different contour shapes for different emotions. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Geneva, FAPSE, Dept Psychol, CH-1205 Geneva, Switzerland. RP Banziger, T (reprint author), Univ Geneva, FAPSE, Dept Psychol, 40 Bv Pont Arve, CH-1205 Geneva, Switzerland. EM tanja.banziger@pse.unige.ch; klaus.scherer@pse.unige.ch CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BANZIGER T, 2004, THESIS Boersma P., 1996, PRAAT SYSTEM DOING P Collier R., 1990, PERCEPTUAL STUDY INT Cruttenden A., 1986, INTONATION FERNALD A, 1993, CHILD DEV, V64, P657, DOI 10.1111/j.1467-8624.1993.tb02934.x Fernald A., 1991, ANN CHILD DEV, V8, P43 Fernald Anne, 1992, NONVERBAL VOCAL COMM, P262 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 Halliday M. A. K., 1970, COURSE SPOKEN ENGLIS HEILMAN KM, 1984, NEUROLOGY, V34, P917 Juslin PN, 2003, PSYCHOL BULL, V129, P770, DOI 10.1037/0033-2909.129.5.770 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 LEON PR, 1970, PROLEGOMENES LETUDE LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 McGory Julia, 2000, INT C SPOK LANG PROC MOZZICONACCI SJL, 1998, THESIS TU EINDHOVEN O'Connor John D., 1973, INTONATION COLLOQUIA PAKOSZ M, 1983, J PSYCHOLINGUIST RES, V12, P311 PAPOUSEK M, 1991, INFANT BEHAV DEV, V14, P415, DOI 10.1016/0163-6383(91)90031-M PATTERSON D, 1999, 14 INT C PHON SCI IC Pell MD, 1998, NEUROPSYCHOLOGIA, V36, P701, DOI 10.1016/S0028-3932(98)00008-6 Pierrehumbert J, 1980, PHONOLOGY PHONETICS ROSS ED, 1981, ARCH NEUROL-CHICAGO, V38, P561 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 SCHERER KR, 1985, J PSYCHOLINGUIST RES, V14, P409, DOI 10.1007/BF01067884 Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X Silverman K., 1992, INT C SPOK LANG PROC Snow D, 2000, DEV NEUROPSYCHOL, V17, P1, DOI 10.1207/S15326942DN1701_01 Uldall E. T., 1964, HONOUR D JONES, P271 VANLANCKER D, 1992, J SPEECH HEAR RES, V35, P963 WIGHTMAN CW, 2002, INT C SPEECH PROS 20 NR 36 TC 73 Z9 76 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 252 EP 267 DI 10.1016/j.specom.2005.02.016 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200003 ER PT J AU House, D AF House, D TI Phrase-final rises as a prosodic feature in wh-questions in Swedish human-machine dialogue SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE intonation; spontaneous speech; human-machine dialogue; question intonation; prosodic features ID INTONATION AB This paper examines the extent to which optional final rises occur in a set of 200 wh-questions extracted from a large corpus of computer-directed spontaneous speech in Swedish and discusses the function these rises may have in signalling dialogue acts and speaker attitude over and beyond an information question. Final rises occurred in 22% of the utterances, primarily in conjunction with final focal accent. Children exhibited the largest percentage of final rises (32%), with women second (27%) and men lowest (17%). The distribution of the rises in the material is examined and evidence relating to the final rise as a signal of a social interaction oriented dialogue act is gathered from the distribution. Two separate perception tests were carried out to test the hypothesis that high and late focal accent peaks in a wh-question are perceived as friendlier and more socially interested than low and early peaks. Generally, the results were consistent with these hypotheses when the late peaks were in phrase-final position. Finally, the results of this study are discussed in terms of pragmatic and attitudinal meanings and biological codes. (c) 2005 Elsevier B.V. All rights reserved. C1 KTH, Ctr Speech Technol, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP House, D (reprint author), KTH, Ctr Speech Technol, Dept Speech Mus & Hearing, Lindstedtsvagen 24, S-10044 Stockholm, Sweden. EM davidh@speech.kth.se CR BELL L, 1999, P IDS 99 KLOST IRS G, P81 Bell L., 1999, P EUR 99 BUD, P1143 Bell L., 2003, THESIS KTH STOCKHOLM Beskow J., 2000, P ICSLP 2000 BEIJ CH, V4, P464 BHAGAT S, 2003, P 15 ICPHS BARC, P2961 BOLIGNER D, 1989, INTONATION ITS USES BREDVADJENSEN AC, 1984, NORDIC PROSODY, V3, P31 BRUCE G, 1987, NORDIC PROSODY, V4, P41 BRUCE G, 1992, SPEECH COMMUN, V11, P453, DOI 10.1016/0167-6393(92)90050-H CARLSON R, 2002, P FON 2002 TMH QPSR, V1, P65 CASPERS J, 2003, P 15 ICPHS BARC, P1771 CERRATO L, 2002, P FON 2002 TMH QPSR, V1, P101 CHEN A, 2001, P EUR 2001 AALB DENM, P1403 Cruttenden A., 1986, INTONATION D'IMPERIO MARIAPAOLA, 1997, P EUR 97 RHOD GREEC, P251 Ferrer L., 2002, P INT C SPOK LANG PR, P2061 GARDING E, 1979, PHONETICA, V36, P207 Garding Eva, 1998, INTONATION SYSTEMS, P112 Gussenhoven C, 2002, P 1 INT C SPEECH PRO, P47 Gustafson J., 2002, THESIS KTH STOCKHOLM Gustafson J., 1999, P EUR 99 BUD, P1151 HADDINGKOCH K, 1964, PHONETICA, V11, P175 Hann J., 1999, P ESCA INT WORKSH DI, P35 Hirst D. J., 1998, INTONATION SYSTEMS S, P1 HORNE M, 1999, P ESCA INT WORKSH DI, P71 House D., 2003, P 15 INT C PHON SCI, P755 House David, 2002, P ICSLP 2002, P1957 Ishi C. T., 2003, P EUR 2003, P405 JILKA M, 2003, P 15 ICPHS BARC, P2549 Kohler K. J., 2004, TRADITIONAL PHONOLOG, P205 Ladd D., 1996, INTONATION PHONOLOGY OHALA JJ, 1983, PHONETICA, V40, P1 OHALA JJ, 1984, PHONETICA, V41, P1 SYRDAL AK, 2004, J ACOUST SOC AM 2, V115, pA2543 NR 34 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 268 EP 283 DI 10.1016/j.specom.2005.03.009 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200004 ER PT J AU Tseng, CY Pin, SH Lee, Y Wang, HM Chen, YC AF Tseng, CY Pin, SH Lee, Y Wang, HM Chen, YC TI Fluent speech prosody: Framework and modeling SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE prosodic phrase grouping; top-down; PG; prosodic hierarchy; multi-phrase; cross-phrase; constraints; templates; speech planning; look-ahead; global F-0 templates; temporal allocations; syllable duration patterns; intensity distribution; boundary breaks AB The prosody of fluent connected speech is much more complicated than concatenating individual sentence intonations into strings. We analyzed speech corpora of read Mandarin Chinese discourses from a top-down perspective on perceived units and boundaries, and consistently identified speech paragraphs of multiple phrases that reflected discourse rather than sentence effects in fluent speech. Subsequent cross-speaker and cross-speaking-rate acoustic analyses of identified speech paragraphs revealed systematic cross-phrase prosodic patterns in every acoustic parameter, namely, F-0 contours, duration adjustment, intensity patterns, and in addition, boundary breaks. We therefore argue for a higher prosodic node that governs, constrains, and groups phrases to derive speech paragraphs. A hierarchical multi-phrase framework is constructed to account for the governing effect, with complimentary production and perceptual evidences. We show how cross-phrase F-0 and syllable duration patterns templates are derived to account for the tune and rhythm characteristic to fluent speech prosody, and argue for a prosody framework that specifies phrasal intonations as subjacent sister constituent subject to higher terms. Output fluent speech prosody is thus cumulative results of contributions from every prosodic layer. To test our framework, we further construct a modular prosody model of multiplephrase grouping with four corresponding acoustic modules and begin testing the model with speech synthesis. To conclude, we argue that any prosody framework of fluent speech should include prosodic contributions above individual sentences in production, with considerations of its perceptual effects to on-line processing; and development of unlimited TTS could benefit most appreciably by capturing and including cross-phrase relationships in prosody modeling. (c) 2005 Published by Elsevier B.V. C1 Acad Sinica, Phonet Lab, Inst Linguist, Taipei, Taiwan. Acad Sinica, Inst Sci Informat, Taipei, Taiwan. RP Tseng, CY (reprint author), Acad Sinica, Phonet Lab, Inst Linguist, Taipei, Taiwan. EM cytling@sinica.edu.tw; whm@iis.sinica.edu.tw CR CHANG LP, 1995, P ICCPOL, P172 CHANG Y, 1998, CAHIERS LINGUISTIQUE, P51 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE Charpentier M. J., 1986, P ICASSP 86, P2015 CHEN K, 2004, P INT S CHIN SPOK LA, P173 FUJISAKI H, 2002, P SNLP O COCOSDA 200 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GUSSENHOVE C, 2004, TOPIC FOCUS INTONATI HO AT, 1976, J CHINESE LINGUISTIC, V4, P1 KELLER E, 1996, YORK PAPERS LINGUIST, V17, P53 LIN MC, 2002, HANYU YUNLYU JIEGOU, P7 MIXDORFF H, 2000, P IEEE INT C AC SPEE, V3, P1281 Mixdorff H., 2004, P INT S TON ASP LANG, P137 Mixdorff H., 2003, P EUR 2003, P873 Selkirk E., 1986, PHONOLOGY YB, V3, P371 ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 SHEN Jiong, 1985, BEIJING YUYIN SHIYAN, P73 Shih C., 1988, WORKING PAPERS CORNE, V3, P83 SHIH C, 2004, P INT S TON ASP LANG, P163 TSENG C, 2003, P OR COCOSDA 2003 Tseng C., 2004, TRADITIONAL PHONOLOG, P417 Tseng C., 2002, P 1 INT C SPEECH PRO, P667 Tseng C., 1999, P ICPHS 99, P2379 TSENG C, 2003, P ICPHS2003 Tseng C., 2004, P INT C SPEECH PROS, P251 XU Y, 2002, P 1 INT C SPEECH PRO, P91 YUAN JH, 2004, P INT S CHIN SPOK LA, P45 NR 27 TC 44 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 284 EP 309 DI 10.1016/j.specom.2005.03.015 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200005 ER PT J AU Mixdorff, H Pfitzinger, HR AF Mixdorff, H Pfitzinger, HR TI Analysing fundamental frequency contours and local speech rate in map task dialogs SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE Fujisaki model; perceptual local speech rate; F0 contours; map task AB The current paper reports first results from the analysis of task-oriented dialogs using a Fujisaki model-based parameterization of F0 contours, as well as a model of the perceptual local speech rate. Two versions of map task style dialogs were examined: (1) the recordings made during the map task proper, (2) readings from scripts of the original dialogs by the same subjects. The first part of this paper presents an analysis of phrase boundaries with respect to form and function. A second issue is the problem of processing fillers, hesitations and repairs within the framework of the Fujisaki model-based analysis. The second part of the paper describes the comparative analysis of spontaneous and read versions of the same dialog fragments with respect to Fujisaki model parameters, contours of the perceptual local speech rate, and other features. In a perception test we asked listeners to identify the speaking style of dialog fragments. Apparently this was possible only for part of the data. Analysis of accent commands and perceptual local speech rate contours still suggested differences between the two speaking styles. The number of accented syllables, the associated accent commands' amplitudes, and the perceptual local speech rate were generally higher in the read than in the spontaneous utterances. These results were almost significant despite the fact that the read version had been well re-enacted by the subjects and therefore did not exactly exhibit typical reading style characteristics. Despite this drawback, the methodology presented here has strong potential for further comparative prosodic studies of speaking styles. (c) 2005 Elsevier B.V. All rights reserved. C1 TFH Berlin Univ Appl Sci, Dept Comp Sci & Media, D-13353 Berlin, Germany. Univ Munich, Dept Phonet & Speech Commun, D-80799 Munich, Germany. RP Mixdorff, H (reprint author), TFH Berlin Univ Appl Sci, Dept Comp Sci & Media, Luxemburger Str 10, D-13353 Berlin, Germany. EM mixdorff@tfh-berlin.de; hpt@phonetik.uni-muenchen.de CR ANDERSON AH, 1991, LANG SPEECH, V34, P351 Beckman M., 1997, COMPUTING PROSODY CO, P7 BLAAUW E, 1995, THESIS UTRECHT U UTR BROWN G, 1984, TEACHING TALK CAMPBELL WN, 2000, PROSODY THEORY EXPT, P281 CLASSEN K, 2000, MAP TASK VERSION DTS, P65 Eskenazi M., 1993, P EUROSPEECH 93, V1, P501 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 Horne M., 2000, PROSODY THEORY EXPT, P335 Isacenko A. V., 1964, UNTERSUCHUNGEN DTSCH Mixdorff H, 2001, P EUR C SPEECH COMM, V2, P947 MIXDORFF H, 2000, P IEEE INT C AC SPEE, V3, P1281 MIXDORFF H, 1995, P 13 ICPHS STOCKH SW, V2, P410 Mixdorff H., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1021099922328 Pfitzinger H. R., 1998, P ICSLP 98 SYDN, V3, P1087 PFITZINGER HARTMUT R., 1999, P 14 INT C PHON SCI, P893 Zacharias C., 1982, DTSCH SATZINTONATION NR 17 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 310 EP 325 DI 10.1016/j.specom.2005.02.019 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200006 ER PT J AU Carlson, R Hirschberg, J Swerts, M AF Carlson, R Hirschberg, J Swerts, M TI Cues to upcoming Swedish prosodic boundaries: Subjective judgment studies and acoustic correlates SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE prosodic boundaries; prosody perception AB Studies of perceptually based predictions of upcoming prosodic boundaries in spontaneous Swedish speech, both by native speakers of Swedish and of native speakers of standard American English reveal marked similarity in judgments. We examined whether Swedish and American listeners were able to predict the occurrence and strength of upcoming boundaries in a series of web-based perceptive experiments. Utterance fragments (in both long and short versions) were selected from a corpus of spontaneous Swedish speech, which was first labeled for boundary presence and strength by expert labelers. These fragments were then presented to listeners, who were instructed to guess whether or not they were followed by a prosodic break, and if so, what the strength of the break was. Results revealed that both Swedish and American listening groups were indeed able to predict whether or not a boundary (of a particular strength) followed the fragment. This suggests that acoustic and prosodic, rather than lexico-grammatical and semantic information was being used by listeners as a primary cue. Acoustic and prosodic correlates of these judgments were then examined, with significant correlations found between judgments and the presence/absence of final creak and phrase-final f0 level and slope. (c) 2005 Elsevier B.V. All rights reserved. C1 KTH, Royal Inst Technol, Dept Speech Mus & Hearing, SE-10044 Stockholm, Sweden. Columbia Univ, Dept Comp Sci, New York, NY 10027 USA. Tilburg Univ, Fac Arts, NL-5000 LE Tilburg, Netherlands. Univ Antwerp, Dept Linguist, B-2610 Antwerp, Belgium. RP Carlson, R (reprint author), KTH, Royal Inst Technol, Dept Speech Mus & Hearing, Lindstedsvagen 24,5th Floor, SE-10044 Stockholm, Sweden. EM rolf@speech.kth.se RI Swerts, Marc/C-8855-2013 CR AUBERGE V, 1997, P EUR C SPEECH COMM, P871 BARON D, 2002, ICSLP 2002, P949 BRUCE G, 1995, P ICPHS 95 BRUCE G, 1993, P ESCA WORKSH PROS CARLSON R, 2003, P ICPHS 03 CARLSON R, 2002, P FON 2002 TMH QPSR, P44 FANT G, 2000, INTONATION ANAL MODE FERRER L, 2002, ICSLP 2002 GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 GROSJEAN F, 1983, LINGUISTICS, V21, P501, DOI 10.1515/ling.1983.21.3.501 Hansson P., 2003, TRAVAUX I LINGUISTIQ HELDNER M, 2003, P ICPHS 03 KLATT DK, 1979, FRONTIERS SPEECH COM LEROY L, 1984, ANTWERP PAPERS LINGU, V40 LICKLEY RJ, 1999, P ICPHS SAT M DISFL, P23 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 SANDERMAN A, 1996, THESIS EINDHOVEN U T STRANGERT E, 2004, P SPEECH PROS 2004 N, P305 STRANGERT E, 1995, PHONEUM, V3, P85 SWERTS M, 1994, SPEECH COMMUN, V15, P79, DOI 10.1016/0167-6393(94)90043-4 VANHEUVEN VJ, 1997, ESCA WORKSH INT THEO, P317 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 22 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 326 EP 333 DI 10.1016/j.specom.2005.02.013 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200007 ER PT J AU Hirst, DJ AF Hirst, DJ TI Form and function in the representation of speech prosody SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE speech prosody; annotation; form; function AB The way in which prosody contributes to meaning is still, today, a poorly understood process corresponding to a mapping between two levels of representation, for neither of which there is any general consensus. It is argued that annotation of prosody generally consists in describing both prosodic function and prosodic form, but that it would be preferable to clearly distinguish the two levels. One elementary annotation system for prosodic function, IF-annotation, is, it has been argued, sufficient to capture at least those aspects of prosodic function which influence syntactic interpretation. The annotation of prosodic form can be carried out automatically by means of an F0 modelling algorithm, MOMEL, and an automatic coding scheme, INTSINT. The resulting annotation is under-determined by the IF-annotation, but defining mapping rules between representations of function and representation of form could provide an interesting means of establishing an enriched functional annotation system through analysis by synthesis. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Aix Marseille 1, CNRS, UMR 6057, F-13621 Aix En Provence, France. RP Hirst, DJ (reprint author), Univ Aix Marseille 1, CNRS, UMR 6057, 29 Ave Schuman, F-13621 Aix En Provence, France. EM daniel.hirst@lpl.univ-aix.fr CR Auran C., 2004, P 2 INT C SPEECH PRO, P561 AURAN C, 2003, MOMEL INTSINT PACKAG BOERSMA P, 1995, PRAAT SYSTEM DOING P BOUZON C, 2004, THESIS U PROVENCE BOUZON C, 2004, P 2 INT C SPEECH PRO, P223 CAMPIONE E, 2001, THESIS U PROVENCE AI Chan D., 1995, P EUR C SPEECH COMM, P867 Couper-Kuhlen E., 1986, INTRO ENGLISH PROSOD Cruttenden A., 1986, INTONATION Crystal D., 1969, PROSODIC SYSTEMS INT DICRISTO A, 1986, PHONETICA, V43, P11 DICRISTO A, 1997, INTONATION THEORY MO, P83 ESPESSER R, 1996, P 21 JOURN DET PAR A, P447 Grabe Esther, 2001, P PROS 2000, P51 Gussenhoven C., 2002, GLOT INT, V6, P271 Halliday M. A. K., 1967, INTONATION GRAMMAR B HAWKINS S, 1994, P 3 INT C SPOK LANG, V1, P57 HEID S, 1999, P 14 INT C PHON SCI, V1, P511 Hirst D., 1984, NEUEREN SPRACHEN, V83, P554 Hirst D. J., 1998, INTONATION SYSTEMS S, P1 Hirst D. J., 1993, TRAVAUX I PHONETIQUE, V15, P71 Hirst D. J., 2000, PROSODY THEORY EXPT, V14, P51 HIRST DJ, 1977, JANUA LINGUARUM SERI, V139 HIRST DJ, 2001, IMPROVEMENTS SPEECH, P320, DOI 10.1002/0470845945.ch32 HIRST DJ, 1988, AUTOSEGMENTAL STUDIE, P151 HIRST DJ, IN PRESS ANAL SYNTHE HIRST DJ, 1999, P ICSLP 99 HUCKVALE M, 2000, SPEECH FILING SYSTEM Jassem Wiktor, 1952, INTONATION CONVERSAT Jun Sun-Ah, 2005, PROSODIC TYPOLOGY PH LADD DR, 1996, CAMBRIDGE STUDIES LI, V79 MAGHBOULEH A, 1998, P ICSLP 98 MIXDORFF H, 1999, ICASSP 1999 O'Connor John D., 1973, INTONATION COLLOQUIA Ostendorf M., 2000, PROSODY THEORY EXPT, P263 Pierrehumbert J., 2000, PROSODY THEORY EXPT, P11 Scarna A, 2002, LANG COGNITIVE PROC, V17, P185, DOI 10.1080/0169096014300038 SILVERMAN K, 1992, P ICSLP BANFF CAN, V92, P867 WIGHTMAN C, 2002, P 1 INT C SPEECH PRO Wightman C. W., 1995, IEEE T SPEECH AUDIO WIGHTMAN CW, 2000, P ICSLP, V2, P7174 NR 41 TC 25 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 334 EP 347 DI 10.1016/j.specom.2005.02.020 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200008 ER PT J AU Bailly, G Holm, B AF Bailly, G Holm, B TI SFC: A trainable prosodic model SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE intonation; prosodic modelling; automatic generation of prosody ID SPEECH SYNTHESIS; FRENCH AB This paper introduces a new model-constrained and data-driven system to generate prosody from metalinguistic information. This system considers the prosodic continuum as the superposition of multiple elementary overlapping multiparametric contours. These contours encode specific metalinguistic functions associated with various discourse units. We describe the phonological model underlying the system and the specific implementation made of that model by the trainable prosodic model described here. The way prosody is analyzed, decomposed and modelled is illustrated by experimental work. In particular, we describe the original training procedure that enables the system to identify the elementary contours and to separate out their contributions to the prosodic contours of the training data. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Grenoble 3, INPG, CNRS, UMR 5009,Inst Commun Parlee, F-38031 Grenoble, France. RP Bailly, G (reprint author), Univ Grenoble 3, INPG, CNRS, UMR 5009,Inst Commun Parlee, 46 Av Felix Viallet, F-38031 Grenoble, France. EM bailly@icp.inpg.fr; holm@icp.inpg.fr CR AGUERO PD, 2004, INT C SPOK LANG PROC AUBERGE V, 1993, PROSODY MODELING DYN, V41, P62 AUBERGE V, 1992, TALKING MACHINES THE, P307 Bachenko J., 1990, Computational Linguistics, V16 BAILLY G, 1997, COMPUTING PROSODY CO, P157 BAILLY G, 1989, SPEECH COMMUN, V8, P137, DOI 10.1016/0167-6393(89)90040-X Bailly G., 2002, CADERNOS ESTUDOS LIN, V43, P37 BALFOURIER JM, 2002, COLING, P36 BARBOSA P, 1994, SPEECH COMMUN, V15, P127, DOI 10.1016/0167-6393(94)90047-7 Barbosa P. A., 1997, PROGR SPEECH SYNTHES, P365 BARTKOVA K, 1987, SPEECH COMMUN, V6, P245, DOI 10.1016/0167-6393(87)90029-X Black A. W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607872 BOERSMA P, 1996, 132 I PHON SCI U AMS Bolinger D., 1989, INTONATION ITS USES BRICHET C, 2004, JOURN DET PAR NANC F BUHMANN J, 2000, INT C SPOK LANG PROC, P179 CAMPBELL N, 1992, THESIS U SUSSEX BRIG CHEN GP, 2004, INT C CHIN SPOK LANG, P177 CUTLER A, 1991, P INT C PHON SCI AIX, P264 Di Cristo A, 2000, TEXT SPEECH LANG TEC, V15, P321 DUSTERHOFF KE, 1999, EUROSPEECH, P1627 Fant G., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607202 FONAGY L, 1984, FOLIA LINGUIST, V17, P153 Fujisaki H., 1971, Annual Report of the Engineering Research Institute, Faculty of Engineering, University of Tokyo, V30 FUJISAWA K, 1998, ESCA COCOSDA INT WOR GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 Gussenhoven C, 1999, LANG SPEECH, V42, P283 HIRST D, 1991, P 12 INT C PHON SCI, P234 Hirst D. J., 2000, PROSODY THEORY EXPT, V14, P51 HIRST DJ, 2003, INT C SPEECH PROS NA, P163 HOLM B, 1999, INT C PHON SCI SAN F, P1297 HOLM B, 2000, P INT C SPEECH LANG, P203 HOLM B, 2002, SPEECH PROSODY, P399 HOLM B, 2003, THESIS I NATL POLYTE Klatt D. H., 1979, FRONTIERS SPEECH COM, P287 LJOLJE A, 1986, IEEE T ACOUST SPEECH, V34, P1074, DOI 10.1109/TASSP.1986.1164948 Marchi L, 2001, STUD MUSIC, V30, P3 MARSI E, 1997, PROGR SPEECH SYNTHES, P477 MIXDORFF H, 2001, EUR C SPEECH COMM TE, P947 MONAGHAN AIC, 1992, INT C SPEECH LANG PR, P1159 Morlec Y, 2001, SPEECH COMMUN, V33, P357, DOI 10.1016/S0167-6393(00)00065-0 MORLEC Y, 1998, 1 INT C LANG RES EV NARUSAWA S, 2002, INT C AC SPEECH SIGN, P1281 Nespor M., 1986, PROSODIC PHONOLOGY OSHAUGHNESSY D, 1981, J PHONETICS, V9, P385 Pynte J, 1996, LANG COGNITIVE PROC, V11, P165, DOI 10.1080/016909696387259 RAIDT S, 2004, INT C SPEECH PROS NA, P417 Riley M., 1992, TALKING MACHINES THE, P265 Ross KN, 1999, IEEE T SPEECH AUDI P, V7, P295, DOI 10.1109/89.759037 SAGISAKA Y, 1990, IEEE INT C ACOUST SP, V1, P325 SCHREUDER M, 2004, INT C SPEECH PROS NA, P341 Scordilis M. S., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266404 SELKIRK E., 1984, PHONOLOGY SYNTAX SILVERMAN Kim E. A., 1992, ICSLP 1992, V2, P867 STROM V, 2002, INT C SPOK LANG PROC, P2081 Syrdal A., 2000, INT C SPOK LANG PROC, P235 TAYLOR P, 1999, EUROSPEECH, P1531 TESSER F, 2004, WORKSH SPEECH SYNTH, P185 TOURNEMIER S, 1997, P EUROSPEECH, P191 Traber C, 1992, TALKING MACHINES THE, P287 TROUVAIN J, 1998, ETRW WORKSH SPEECH S, P47 VANSANTEN JPH, 2002, INT C SPEECH PROS AI, P107 VANSANTEN JPH, 1992, TALKING MACHINES THE, P275 WIGHTMAN CW, 2000, INT C SPOK LANG PROC, P71 NR 64 TC 11 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 348 EP 364 DI 10.1016/j.specom.2005.04.008 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200009 ER PT J AU van Santen, J Kain, A Klabbers, E Mishra, T AF van Santen, J Kain, A Klabbers, E Mishra, T TI Synthesis of prosody using multi-level unit sequences SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN ID SPECTRAL DYNAMICS; DURATION AB Generating meaningful and natural sounding prosody is a central challenge in text-to-speech synthesis (TTS). In traditional synthesis, the challenge consists of how to generate natural target prosodic contours and how to impose these contours on recorded speech without causing audible distortions. In unit selection synthesis, the challenge is the sheer size of the speech corpus that is needed to cover all combinations of phone sequences and prosodic contexts that can occur in a given language. This paper describes new methods that are being explored, based on the principle of super-positional prosody transplant. Both methods are based on the following procedure. In a recorded, prosodically and phonemically labeled corpus, the log pitch contours are additively decomposed into component curves according to a prosodic hierarchy, typically phrase curves (corresponding to phrases), accent curves (corresponding to feet), and segmental perturbation (or residuals) curves. During synthesis, the corpus is searched for multiple unit sequences: A unit sequence that covers the target phoneme string, and one or more unit sequences that cover the prosodic labels at a given phonological level (e.g., the foot or phrase) and are constrained by being matched to the phone match sequence in terms of the phonetic classes of the phonemes (or in terms of higher level entities, such as the number of feet and their sizes measured in syllables). The methods differ in terms of the level of detail of these constraints. A superpositional prosody transplant procedure generates a target pitch contour by extracting and recombining component curves from these sequences, and imposing this contour on the sequence that matches the phone string using standard speech modification methods. This process minimizes prosodic modification artifacts, optimizes the naturalness of the target pitch contour, yet avoids the combinatorial explosion of standard unit selection synthesis. (c) 2005 Elsevier B.V. All rights reserved. C1 Oregon Hlth & Sci Univ, OGI Sch Sci & Engn, Ctr Spoken Language Understanding, Portland, OR 97201 USA. RP van Santen, J (reprint author), Oregon Hlth & Sci Univ, OGI Sch Sci & Engn, Ctr Spoken Language Understanding, Portland, OR 97201 USA. EM vansanten@bme.ogi.edu CR BAAIJEN DR, 2000, WORD FREQUENCY DISTR BOUZON C, 2004, P SPEECH PROS 2004 N CHARPENTIER F, 1989, P EUROSPEECH 89, V2, P13 DODGE Y, 1981, ANAL EXPT MISSING DA Dutoit T., 1997, INTRO TEXT SPEECH SY Fujisaki H., 1983, PRODUCTION SPEECH, P39 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 KLABBERS E, 2004, P 5 ISCA SPEECH SYNT KLABBERS E, 2003, P EUR 2003 GEN SWITZ KLABBERS E, 2002, WORKSH SPEECH SYNTH MACON MW, 1996, THESIS GEORGIA TECH MOEBIUS B, 2001, 4 ISCA TUT RES WORKS MORLEC Y, 1996, P ICSLP 96 PHIL PA S OHMAN SEG, 1996, Q PROGR STATUS REPOR, V4, P1 Olive J.P., 1985, J ACOUST SOC AM S1, V78, pS6, DOI 10.1121/1.2022951 RAIDT S, 2004, P SPEECH PROS 2004 N RAUX A, 2003, P ASRU 2003 ST THOM SAKAI S, 2004, P 5 ISCA SPEECH SYNT THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 van Santen J., 1999, INTONATION ANAL MODE van Santen J. P. H., 2004, P 5 ISCA SPEECH SYNT VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 VANSANTEN J, 2003, P EUR 2003 GEN SWITZ VANSANTEN J, 1996, COMPUTING PROSODY VANSANTEN J, 2002, IEEE WORKSH SPEECH S VANSANTEN J, 1997, P EUR 1997 RHOD GREE VANSANTEN J, 1999, P EUR 1999 BUD HUNG VANSANTEN JPH, 1992, J ACOUST SOC AM, V92, P2444, DOI 10.1121/1.404554 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSANTEN JPH, 1993, J MATH PSYCHOL, V37, P327, DOI 10.1006/jmps.1993.1022 Wouters J, 2002, J ACOUST SOC AM, V111, P428, DOI 10.1121/1.1428263 Wouters J, 2002, J ACOUST SOC AM, V111, P417, DOI 10.1121/1.1428262 NR 32 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 365 EP 375 DI 10.1016/j.specom.2005.01.008 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200010 ER PT J AU Sagisaka, Y Yamashita, T Kokenawa, Y AF Sagisaka, Y Yamashita, T Kokenawa, Y TI Generation and perception of F-0 markedness for communicative speech synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE corpus-based speech synthesis; computational prosody modeling; conversational speech prosody; perceptual markedness; fundamental frequency control AB Aiming at natural F-0 control for conversational speech synthesis using attributes of constituent output words, F-0 characteristics are analyzed from both generation and perception viewpoints. We recorded commonly used two-phrase utterances consisting of Japanese adjective and adverb phrases expressing different degree of markedness under designed conversational situations, and compared their F-0 characteristics. The comparison showed the consistent F-0 control dependencies not only on adverbs themselves but also on the attribute of following adjective phrases. Strong positive or negative correlation is observed between the markedness of adverbs and F-0 height when an adjective phrase showing positiveness or negativeness is followed to the current adverb phrase. These consistencies have been perceptually confirmed by naturalness evaluation tests using the same two-phrase samples with different F-0 heights. Finally, a computational model of conversational F-0 control is proposed using lexical information of adjectives showing positiveness or negativeness and adverbs expressing markedness. F-0 estimation experiments quantitatively showed the possibility of F-0 control for natural conversational speech synthesis using the attribute of constituent output words. (c) 2005 Elsevier B.V. All rights reserved. C1 Waseda Univ, Global Informat & Telecommun Inst, Shinjuku Ku, Tokyo 1690051, Japan. RP Sagisaka, Y (reprint author), Waseda Univ, Global Informat & Telecommun Inst, Shinjuku Ku, Nishi Waseda 1-3-10, Tokyo 1690051, Japan. EM sagisaka@giti.waseda.ac.jp CR Douglas-Cowie E, 2003, SPEECH COMMUN, V40, P1, DOI 10.1016/S0167-6393(02)00072-9 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 HIGUCHI N, 1997, PROGR SPEECH SYNTHES, P417 KOKENAWA Y, 2004, IPSJ SIG TECHNICAL R, P87 Riley M., 1992, TALKING MACHINES THE, P265 SAGISAKA Y, 1991, P ICPHS INT C PHON S, V3, P506 SAGISAKA Y, 1990, P ICASSP, P325 Tokuda K., 1999, P IEEE ICASSP, VI, P229 TRABER C, 1995, SVOX IMPLEMENTATION NR 9 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 376 EP 384 DI 10.1016/j.specom.2005.03.017 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200011 ER PT J AU Hirose, K Sato, K Asano, Y Minematsu, N AF Hirose, K Sato, K Asano, Y Minematsu, N TI Synthesis of F-0 contours using generation process model parameters predicted from unlabeled corpora: application to emotional speech synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE F-0 model; corpus-based generation; automatic extraction of model commands; emotional speech synthesis; HMM-based speech synthesis ID SYSTEM AB A corpus-based method of generating fundamental frequency (F-0) contours from text was developed for Japanese. Instead of directly predicting F-0 values, the method predicts command values of the F-0 contour generation process model using binary decision trees. Since the model controls the F-0 movement in word or in longer units, sudden undulations, unlikely in natural utterances, can be avoided even in the case of erroneous prediction. The method includes a scheme of extracting the model commands from given F-0 contours, which makes it possible to prepare the corpora for training the binary decision trees automatically. Since accuracy of the extracted model commands in the training corpora is crucial for the method, constraints are applied on the location of commands. Although the method can generate any speaking styles if the corpora of the styles are available, this paper is aimed at realizing three types of emotional speech (anger, joy, and sadness) besides calm speech. The mismatches between the predicted and target contours for angry speech were similar to those for calm speech. Synthesis of emotional speech was then conducted. Phoneme durations were predicted in a similar corpus-based method, and segmental features were generated using an HMM-based speech synthesizer. A perceptual experiment was conducted for the synthesized speech, and the result indicated that anger could be conveyed well by the developed method. The result was less satisfactory for joy and sadness. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Tokyo, Grad Sch Informat Sci & Technol, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1130033, Japan. Univ Tokyo, Grad Sch Frontier Sci, Dept Frontier Informat, Bunkyo Ku, Tokyo 1130033, Japan. RP Hirose, K (reprint author), Univ Tokyo, Grad Sch Informat Sci & Technol, Dept Informat & Commun Engn, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1130033, Japan. EM hirose@gavo.t.u-tokyo.ac.jp CR Black A. W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607872 Burkhardt F., 2000, P ISCA WORKSH SPEECH, P151 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FUKUDA T, 1994, P INT C SPOK LANG PR, P723 HIRAI T, 1996, PROGR SPEECH SYNTHES, P333 HIROSE K, 2002, P INT C SPOK LANG PR, P2085 HIROSE K, 2003, P 15 INT C PHON SCI, V3, P2945 HIROSE K, 2004, P INT C SPOK LANG PR, P1349 Hirose K., 2004, P INT C SPEECH PROS, P417 HIROSE K, 2003, P EUR C SPEECH COMM, P333 HIROSE K, 1996, P INT C SPOK LANG PR, V1, P378, DOI 10.1109/ICSLP.1996.607133 HIROSE K, 2002, P INT C SPEECH PROS, P391 HIROSE K, 2001, P EUR C SPEECH COMM, P2255 HIROSE K, 1993, IEICE T FUND ELECTR, VE76A, P1971 Jokisch O., 2000, P ICSLP 2000 BEIJ, P645 KITAHARA Y, 1992, IEICE T FUND ELECTR, VE75A, P155 Kurohashi S., 1994, J COMPUT LINGUIST, V20, P507 Lee A., 2001, P EUR C SPEECH COMM, P1691 LJOLJE A, 1986, IEEE T ACOUST SPEECH, V34, P1074, DOI 10.1109/TASSP.1986.1164948 MATSUMOTO Y, 2000, ISPJ MAG, V41, P1208 Minematsu N, 2003, IEICE T INF SYST, VE86D, P550 Mixdorff H., 2001, P EUR 2001 AALB DENM, P947 MURRAY IR, 1995, SPEECH COMMUN, V16, P369, DOI 10.1016/0167-6393(95)00005-9 NARUSAWA N, 2002, P INT C AC SPEECH SI, P509 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 Ross K., 1994, P ESCA IEEE WORKSH S, P131 SAGISAKA Y, 1990, P ICASSP, P325 Sakurai A, 2003, SPEECH COMMUN, V40, P535, DOI 10.1016/S0167-6393(02)00177-2 Schroder M., 2001, P EUROSPEECH 2001 SE, P561 Silverman K., 1992, P INT C SPOK LANG PR, P867 TAYLOR P, 1998, P ICSLP, V4, P1383 TOKUDA K, 1999, P ICASSP, P229 Tsuzuki R, 2004, P INTERSPEECH 2004 I, P1185 Yamagishi J., 2003, P INTERSPEECH 2003 E, P2461 YOSHIMURA T, 1999, P EUR C SPEECH COMM, V5, P1691 ZOVATO E, 2004, P INT C SPOK LANG PR, P1897 NR 36 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 385 EP 404 DI 10.1016/j.specom.2005.03.014 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200012 ER PT J AU Saitou, T Unoki, M Akagi, M AF Saitou, T Unoki, M Akagi, M TI Development of an F0 control model based on F0 dynamic characteristics for singing-voice synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE F0 fluctuation; singing-voice perception; F0 control model; singing-voice synthesis ID SOUNDS AB A fundamental frequency (F0) control model, which can cope with F0 dynamic characteristics related to singing-voice perception, is required to construct natural singing-voice synthesis systems. This paper discusses importance of F0 dynamic characteristics in singing-voices and demonstrates how strongly they influence singing-voice perception through psychoacoustic experiments. This paper, then, proposes an F0 control model that can generate F0 contours of singing-voices based on these considerations, and a singing-voice synthesis system. The results show that several types of F0 fluctuation-overshoot, vibrato, preparation, and fine fluctuation-affect the perception and quality of a singing-voice, and that overshoot has the greatest effect. Moreover, the results show that the proposed F0 control model can control F0 fluctuations, generate F0 contours of singing-voices, and can be applied to natural singing-voice synthesis. (c) 2005 Elsevier B.V. All rights reserved. C1 JAIST, Sch Informat Sci, Nomi, Ishikawa 9231292, Japan. RP Saitou, T (reprint author), JAIST, Sch Informat Sci, 1-1 Asahidai, Nomi, Ishikawa 9231292, Japan. EM t-saitou@jaist.ac.jp CR Akagi M, 2000, P ICSLP2000, V3, P458 Akagi M, 1998, P ICSLP98 SYDN, V4, P1519 DECHEVEIGNE A, 2001, P EUROSPEECH2001, P2451 DEKROM G, 1995, P ICPHS 95, V1, P206 FUJISAKI H, 2000, P 5 SEM SPEECH PROD, P145 FUJISAKI H, 1981, VOCAL FIELD PHYSL, P347 Hakes J, 1987, J VOICE, V1, P326 HORII Y, 1989, J VOICE, V3, P151 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Kawahara H., 1999, P EUR 99, P2781 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 MORI H, 2004, P ICA2004, P499 MORIYAMA T, 1996, P ASA ASJ 3 JOINT M, P1171 Myers D., 1987, J VOICE, V1, P157, DOI 10.1016/S0892-1997(87)80039-5 NAKAYAMA I, 2004, P ICA200J, P1295 NAKAYAMA I, 1996, J ACOUST SOC JPN, V52, P383 Press WH, 1988, NUMERICAL RECIPES C SEASHORE C, 1938, STUDIES PSYCHOL MUSI, V1 SUNDBERG J, 1987, SCI SINGING VOICES R, P163 SUNDBERG J, 1979, J PHONETICS, V7, P71 NR 21 TC 22 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 405 EP 417 DI 10.1016/j.specom.2005.01.010 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200013 ER PT J AU Hasegawa-Johnson, M Chen, K Cole, J Borys, S Kim, SS Cohen, A Zhang, T Choi, JY Kim, H Yoon, T Chavarria, S AF Hasegawa-Johnson, M Chen, K Cole, J Borys, S Kim, SS Cohen, A Zhang, T Choi, JY Kim, H Yoon, T Chavarria, S TI Simultaneous recognition of words and prosody in the Boston University Radio Speech Corpus SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE automatic speech recognition; prosody ID SEGMENTAL DURATIONS; NEURAL-NETWORK; STRESS AB This paper describes automatic speech recognition systems that satisfy two technological objectives. First, we seek to improve the automatic labeling of prosody, in order to aid future research in automatic speech understanding. Second, we seek to apply statistical speech recognition models of prosody for the purpose of reducing the word error rate of an automatic speech recognizer. The systems described in this paper are variants of a core dynamic Bayesian network model, in which the key hidden variables are the word, the prosodic tag sequence, and the prosody-dependent allophones. Statistical models of the interaction among words and prosodic tags are trained using the Boston University Radio Speech Corpus, a database annotated using the tones and break indices (ToBI) prosodic annotation system. This paper presents both theoretical and empirical results in support of the conclusion that a prosody-dependent speech recognizer-a recognizer that simultaneously computes the most-probable word labels and prosodic tags-can provide lower word recognition error rates than a standard prosody-independent speech recognizer in a multi-speaker speaker-dependent speech recognition task on radio speech. (c) 2005 Published by Elsevier B.V. C1 Univ Illinois, Beckman Inst, Urbana, IL 61801 USA. RP Hasegawa-Johnson, M (reprint author), Univ Illinois, Beckman Inst, Urbana, IL 61801 USA. EM jhasegaw@uiuc.edu RI Chen, Ken/A-1074-2009; Cole, Jennifer/A-9961-2009 CR BATLINER A, 1997, P ESCA WORKSH INT DE, P39 Beckermann B, 1996, NUMER ALGORITHMS, V11, P1, DOI 10.1007/BF02142485 Beckman M. E., 1994, GUIDELINES TOBI LABE BORYS S, 2003, THESIS U ILLINOIS UR Charniak E., 1994, STAT LANGUAGE LEARNI CHAVARRIA S, P SPEECH PROS NAR JA CHEN K, IN PRESS IEEE T SPEE CHEN K, 2004, P ICASSP CHEN K, 2004, P SPEECHPR NAR JAP CHEN K, 2003, INT C SYST CYB INT S CHEN K, 2003, P EURO SPEECH GEN, P393 CHEN K, 2003, IEEE WORKSH AUT SPEE CHEN K, 2004, P SPEECH PROS NAR JA CHOI H, 2003, P TEX LING C U TEX A Cohen A., 2004, THESIS U ILLINOIS UR COLE J, 2003, INT C PHON SCI CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 DEJONG K, 1995, J ACOUST SOC AM, V89, P369 Dilley L, 1996, J PHONETICS, V24, P423, DOI 10.1006/jpho.1996.0023 Ferguson J. D., 1980, P S APPL HIDD MARK M, P143 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 GREENBERG S, 2001, NIST LARG VOC CONT S Hahn L. D., 1999, THESIS U ILLINOIS UR HASEGAWAJOHNSON M, 2004, HLT NAACL WORKSH LIN HIRAI T, 1997, PROGR SPEECH SYNTHES, P333 HIRSCHBERG J, 1998, P INT C SPOK LANG PR HOMBERT Jean-Marie, 1978, TONE LINGUISTIC SURV, P77 Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 KENT RD, 1971, PHONETICA, V24, P23 KIM H, 2004, P SPEECHPR NAR JAP Kim SS, 2004, IEEE SIGNAL PROC LET, V11, P645, DOI 10.1109/LSP.2004.830114 Kim SS, 1998, NEUROCOMPUTING, V20, P253, DOI 10.1016/S0925-2312(98)00018-6 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Kompe R, 1997, PROSODY SPEECH UNDER LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LIU Y, 2003, P EUROSPEECH Omar MK, 2003, IEEE T SPEECH AUDI P, V11, P660, DOI 10.1109/TSA.2003.814457 OSTENDORF M, 2002, P ISCA TUT RES WORKS OSTENDORF M, 1997, COMPUTING PROSODY CO Ostendorf M., 1995, BOSTON U RADIO NEWS PITRELLI JF, 1994, P INT C SPOK LANG PR PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 REN Y, 2004, P SPEECHPR NAR JAP SHRIBERG E, 2004, P SPEECHPR SILVERMAN K, 1992, P INT C SPOK LANG PR Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sonmez K., 1998, P INT C SPOK LANG PR, P3189 Stolcke A., 2003, P ICASSP, P608, DOI 10.1109/ICASSP.2003.1198854 Stolcke A., 1999, P EUROSPEECH, P307 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 van Kuijk D, 1999, SPEECH COMMUN, V27, P95, DOI 10.1016/S0167-6393(98)00069-7 Vergyri D., 2003, P ICASSP WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P328, DOI 10.1109/29.21701 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 YOON T, 2004, P INT C SPOK LANG PR YOUNG S, 2002, NTK BOOK ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 60 TC 19 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 418 EP 439 DI 10.1016/j.specom.2005.01.009 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200014 ER PT J AU Zhang, JS Nakamura, S Hirose, K AF Zhang, JS Nakamura, S Hirose, K TI Tone nucleus-based multi-level robust acoustic tonal modeling of sentential F0 variations for Chinese continuous speech tone recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE Chinese lexical tones; tone recognition; anchoring-based discrimination; tone nucleus model; hypo and hyper-coarticulation ID CONTINUOUS MANDARINE SPEECH; INFORMATION; SYSTEM AB The complex F0 variations make it rather difficult to perform tone recognition of Chinese continuous speech. In this paper, we propose building robust tonal acoustic models by modeling F0 variations at different levels ranging from segmental factors to tone co-articulations and to the interplay effects among tonality, tone co-articulation and high-level prosodic event. First, we extract the tone nucleus of each tonal F0 contour in the continuous speech, and only use the features of the tone nucleus to estimate the tonal HMMs. This can protect the tonal modeling from the influences of F0 transition loci at sub-syllable levels. Second, two techniques are adopted to model local tone co-articulation variations. The left and right context dependent tri-tone HMMs estimated using tone nuclei features can model tone co-articulation effects. And the anchoring-based left and right directional normalized tonal F0 contours prove to be efficient tone discriminating features. Third, we model the interplay effects of tones and high-level prosodic events by building so-called hypo- and hyper-co-articulation-based tonal HMMs. The whole approach achieved a significantly higher performance than the conventional method when applied to a speaker dependent task. (c) 2005 Elsevier B.V. All rights reserved. C1 ATR Spoken Language Traslat Res Labs, Kyoto 6190288, Japan. Univ Tokyo, Dept Frontier Informat, Bunkyo Ku, Tokyo 1130083, Japan. RP Zhang, JS (reprint author), ATR Spoken Language Traslat Res Labs, 2-2-2 Keihanna, Kyoto 6190288, Japan. EM jinsong.zhang@atr.jp; satoshi.nakamura@atr.jp; hirose@gavo.t.u-tokyo.ac.jp CR CAO Y, 2000, P ICASSP IST TURK, P1610 CHANG PC, 1990, P 1990 IEEE C AC SPE, P517 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE CHEN SH, 1995, IEEE T SPEECH AUDI P, V3, P146 Fujisaki H., 1997, COMPUTING PROSODY CO, P27 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GRANSTROM B, 1997, P ESCA WORKSH INT TH, P21 HOWIE JM, 1974, PHONETICA, V30, P129 LIN WY, 2004, P ICASSP MONTR CAN, V1, P933 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 LIU J, 1999, P EUR BUD HUNG, P891 PIERREHUMBERT JB, THESIS MIT Rabiner L, 1993, FUNDAMENTALS SPEECH SANTEN J, 1997, INTONATION MULTILING, P141 Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing SHEN XS, 1990, J PHONETICS, V18, P281 SHIGENO S, 1979, JPN PSYCHOL RES, V21, P165 SHIH CL, 1988, TONE INTONATION MAND, P83 WANG CF, 1990, P ICSLP KOB JAP, V1, P221 Wang HM, 1997, IEEE T SPEECH AUDI P, V5, P195 WANG NL, 1988, J IEICE D, V71, P257 WANG YR, 1994, J ACOUST SOC AM, V96, P2637, DOI 10.1121/1.411274 WHALEN DH, 1992, PHONETICA, V49, P25 WU YD, 1991, J IEICE, V74, P1631 Xu C. X., 2003, J INT PHON ASSOC, V33, P165, DOI 10.1017/S0025100303001270 XU SL, 1992, J CHIN ORIENT LANG I, V2 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684 Xu Y, 2002, J ACOUST SOC AM, V111, P1399, DOI 10.1121/1.1445789 YANG WJ, 1988, IEEE T ACOUST SPEECH, V36, P988, DOI 10.1109/29.1620 YU L, 1990, J IEICE, V73, P122 ZHANG JS, 2004, P ICSLP 2004 JEJ KOR, P809 ZHANG JS, 2004, P ICASSP 2003 HONG K, P776 Zhang JS, 2004, SPEECH COMMUN, V42, P447, DOI 10.1016/j.specom.2004.01.001 ZHANG JS, 2004, P INT C SPEECH PROS, P525 Zhang YQ, 2000, REFR SCI T, V2000, P111 Zhou L, 1996, IEICE T INF SYST, VE79D, P1570 ZU YQ, 1996, HKU96 PUTONGHUA CORP NR 39 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 440 EP 454 DI 10.1016/j.specom.2005.03.010 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200015 ER PT J AU Shriberg, E Ferrer, L Kajarekar, S Venkataraman, A Stolcke, A AF Shriberg, E Ferrer, L Kajarekar, S Venkataraman, A Stolcke, A TI Modeling prosodic feature sequences for speaker recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE prosody; automatic speaker recognition; speaker verification; support vector machines ID STYLE; READ AB We describe a novel approach to modeling idiosyncratic prosodic behavior for automatic speaker recognition. The approach computes various duration, pitch, and energy features for each estimated syllable in speech recognition output, quantizes the features, forms N-grams of the quantized values, and models normalized counts for each feature N-gram using support vector machines (SVMs). We refer to these features as "SNERF-grams" (N-grams of Syllable-based Nonuniform Extraction Region Features). Evaluation of SNERF-gram performance is conducted on two-party spontaneous English conversational telephone data from the Fisher corpus, using one conversation side in both training and testing. Results show that SNERF-grams provide significant performance gains when combined with a state-of-the-art baseline system, as well as with two highly successful long-range feature systems that capture word usage and lexically constrained duration patterns. Further experiments examine the relative contributions of features by quantization resolution, N-gram length, and feature type. Results show that the optimal number of bins depends on both feature type and N-gram length, but is roughly in the range of 5-10 bins. We find that longer N-grams are better than shorter ones, and that pitch features are most useful, followed by duration and energy features. The most important pitch features are those capturing pitch level, whereas the most important energy features reflect patterns of rising and falling. For duration features, nucleus duration is more important for speaker recognition than are durations from the onset or coda of a syllable. Overall, we find that SVM modeling of prosodic feature sequences yields valuable information for automatic speaker recognition. It also offers rich new opportunities for exploring how speakers differ from each other in voluntary but habitual ways. (c) 2005 Elsevier B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. SRI Int, Menlo Pk, CA 94025 USA. Stanford Univ, EE Dept, Stanford, CA 94305 USA. RP Shriberg, E (reprint author), Int Comp Sci Inst, 1947 Ctr St, Berkeley, CA 94704 USA. EM ees@speech.sri.com CR Adami A.G., 2003, P IEEE INT C AC SPEE Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 Barlow M. G., 1988, P AUSTR INT C SPEECH, P80 BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 CAMPBELL W, 2004, ADV NEURAL INFORMATI, V16 Dahan D, 1996, LANG SPEECH, V39, P341 Doddington G., 2001, P EUR, P2521 *ENTR RES LAB, 1993, ENTROPIC ESPS VERSIO Ferrer L., 2003, P EUR GEN, P2017 FISHER W, 1995, TSYLB2 SOURCE CODE A GADDEM VRR, 2000, P ICSLP2000, V1, P601 HAWKINS SR, 1997, THESIS AUSTR NATL U Joachims T., 1998, P EUR C MACH LEARN Johnson K., 1997, TALKER VARIABILITY S KAJAREKAR S, 2004, P ODYSS 04 SPEAK LAN, P51 KAJAREKAR S, 2003, P IEEE AUT SPEECH RE, P19 KAJAREKAR S, 2005, P IEEE ICASSP PHIL, P173 Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5 LICKLEY RJ, 1994, THESIS U EDINBURGH MILLAR J, 1980, J ACOUSTICAL SOC S1, V67, P94 PERKELL J, 1997, P 134 M AC SOC AM SA Reynolds D., 2003, P ICASSP 03, VIV, P784 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds D.A., 2003, P IEEE ICASSP, P53 SHRIBERG E, 2004, P INT 2004 INT C SPO Sonmez K., 1998, P INT C SPOK LANG PR, P3189 Stolcke A., 2000, P NIST SPEECH TRANSC STRANGERT E, 1993, PHONUM, V2, P121 Sussman HM, 1998, PHONETICA, V55, P204, DOI 10.1159/000028433 TAJIMA K, 1998, P 6 C LAB PHON YORK VANDONZEL M, 1997, P 5 EUR C SPEECH COM Vapnik V., 1995, NATURE STAT LEARNING WEBER F, 2002, P IEEE INT C AC SPEE, V1, P141 WEINTRAUB M, 1996, P INT C SPOK LANG PR, P16 Yang Yiming, 1999, P 22 ANN INT ACM SIG, P42, DOI 10.1145/312624.312647 NR 37 TC 65 Z9 67 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 455 EP 472 DI 10.1016/j.specom.2005.02.018 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200016 ER PT J AU Granstrom, B House, D AF Granstrom, B House, D TI Audiovisual representation of prosody in expressive speech communication SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd International Conference on Speech Prosody CY MAR, 2004 CL Nara, JAPAN DE audiovisual prosody; multimodal communication; expressive speech; talking heads; animation ID DIALOGUE AB Prosody in a single speaking style-often read speech-has been studied extensively in acoustic speech. During the past few years we have expanded our interest in two directions: (1) Prosody in expressive speech communication and (2) prosody as an audiovisual expression. Understanding the interactions between visual expressions (primarily in the face) and the acoustics of the corresponding speech presents a substantial challenge. Some of the visual articulation is for obvious reasons tightly connected to the acoustics (e.g. lip and jaw movements), but there are other articulatory movements that do not show up on the outside of the face. Furthermore, many facial gestures used for communicative purposes do not affect the acoustics directly, but might nevertheless be connected on a higher communicative level in which the timing of the gestures could play an important role. In this presentation we will give some examples of recent work, primarily at KTH, addressing these questions. We will report on methods for the acquisition and modeling of visual and acoustic data, and some evaluation experiments in which audiovisual prosody is tested. The context of much of our work in this area is to create an animated talking agent capable of displaying realistic communicative behavior and suitable for use in conversational spoken language systems, e.g. a virtual language teacher. (c) 2005 Elsevier B.V. All rights reserved. C1 KTH, Ctr Speech Technol, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP Granstrom, B (reprint author), KTH, Ctr Speech Technol, Dept Speech Mus & Hearing, Lindstedtsvagen 24, S-10044 Stockholm, Sweden. EM bjorn@speech.kth.se; davidh@speech.kth.se CR AGELFORS E, 1999, P AVSP 99 SANT CRUZ, P123 Bell L., 1999, P EUR 99 BUD, P1143 BESKOW J, 2003, THESIS TMH KTH BESKOW J, 2003, P ICPHS 2003 BARC SP Beskow J., 1997, P ESCA WORKSH AUD VI, P149 BESKOW J, 2000, P INSTIL 2000 Brennan S. E., 1990, THESIS STANFORD U ST CARLSON R, 1997, HDB PHONETIC SCI, P768 Cave C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607235 CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 EDLUND J, 2002, P ISCA WORKSH MULT M FONAGY I, 1976, PHONETICA, V33, P31 GILL SP, 1999, P 3 INT COGN TECHN C, P345 Granstrom B., 1999, P INT C PHON SCI ICP, P655 Granstrom B., 2002, P SPEECH PROS 2002 C, P347 Gustafson J., 2002, THESIS KTH STOCKHOLM HIRSCHBERG J, 2001, P NAACL 2001 PITTSB House D., 2001, P EUR 2001, P387 HOUSE D, 2001, NORDIC PROSODY, V8, P127 House David, 2002, P ICSLP 2002, P1957 Krahmer E, 2002, SPEECH COMMUN, V36, P133, DOI 10.1016/S0167-6393(01)00030-9 Massaro D. W., 1998, PERCEIVING TALKING F Massaro DW, 1996, J ACOUST SOC AM, V100, P1777, DOI 10.1121/1.417342 NORDENBERG M, 2003, HESIS TMH KTH NORDSTRAND M, 2003, P AVSP 03 S JORIOZ F, P233 Nordstrand M, 2004, SPEECH COMMUN, V44, P187, DOI 10.1016/j.specom.2004.09.003 PARKS DA, 1982, GASTROENTEROLOGY, V2, P9 Pelachaud C, 1996, COGNITIVE SCI, V20, P1 Shimojima A, 2002, SPEECH COMMUN, V36, P113, DOI 10.1016/S0167-6393(01)00029-2 SICILIANO C, 2003, P 15 INT C PHON SCI Srinivasan RJ, 2003, LANG SPEECH, V46, P1 TRAUM DR, 1994, THESIS ROCHESTER NR 32 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2005 VL 46 IS 3-4 BP 473 EP 484 DI 10.1016/j.specom.2005.02.017 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 949QW UT WOS:000230804200017 ER PT J AU Fosler-Lussier, E Byrne, W Jurafsky, D AF Fosler-Lussier, E Byrne, W Jurafsky, D TI Special issue on pronunciation modeling and lexicon adaptation SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. Stanford Univ, Dept Linguist, Stanford, CA 94305 USA. RP Fosler-Lussier, E (reprint author), Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. EM fosler@cse.ohio-state.edu NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 117 EP 118 DI 10.1016/j.specom.2005.04.002 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700001 ER PT J AU Adda-Decker, M de Mareuil, PB Adda, G Lamel, L AF Adda-Decker, M de Mareuil, PB Adda, G Lamel, L TI Investigating syllabic structures and their variation in spontaneous French SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO DE pronunciation dictionaries; pronunciation variation; spontaneous speech recognition; reduction phenomena; syllabic restructuring; syllable deletion ID LANGUAGE; ENGLISH; SYSTEM AB The paper presents a study of syllabic structures and their variation in a large corpus of French radio interview speech. A further aim is to show how automatic speech recognition (ASR) systems can serve as a linguistic tool to consistently explore virtually unlimited speech corpora. Automatically selected subsets can be manually checked to accumulate knowledge on pronunciation variants. Our belief is that better formalised knowledge of pronunciation variant mechanisms can be obtained by analysing very large amounts of data, and will ultimately contribute to improve pronunciation modelling and ASR systems. This study is meant to be a step in this direction. The linguistic phenomena that we are particularly interested in, are sequential variants (i.e. variants with different numbers of phonemes) which may or not entail syllabic restructuring. These variants, frequent in spontaneous speech, are known to be particularly difficult for speech recognizers. To focus on sequential variants, a methodology has been set up using descriptions at the phonemic, syllabic and lexical levels. This study reports on a radio corpus composed of 30 one-hour shows of interviews. Spontaneous speech is found to have a larger proportion of closed syllables than is found in the canonical syllables derived from orthographic transcriptions. As can be expected, the optional schwa contributes to a large amount of variation in syllabic structure. Less well described phenomena are also observed, such as other vowels (/u/, /epsilon/, /i/ and /a/) being deleted in a non-final (unstressed) position. Unstressed CV syllables, when preceded by an open syllable, are likely to undergo syllabic restructuring: vowel deletion together with backward onset-coda transfer. Complex syllables tend to be simplified: liquid consonants have a tendancy to be deleted, more often in coda than onset position. The most deletion-prone consonant is /v/, in both onset and coda positions. Finally, a substantial percentage of word-final schwa syllables may completely disappear and short function words are deletion prone whatever the vowel identity. (c) 2005 Elsevier B.V. All rights reserved. C1 CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. CNRS, LIMSI, Situated Percept Grp, F-91403 Orsay, France. RP Adda-Decker, M (reprint author), CNRS, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM madda@limsi.fr; mareuil@limsi.fr; gadda@limsi.fr; lamel@limsi.fr CR Adda-Decker M, 1999, SPEECH COMMUN, V29, P83, DOI 10.1016/S0167-6393(99)00032-1 Adda-Decker M., 2003, P ICPHS 2003 BARC, P1329 CORBIN O, 2003, P ICPHS 2003 BARC, P2813 *CTR LANG SPEECH P, 1997, J HOPK U SUMM WORKSH CUTLER A, 1986, J MEM LANG, V25, P385, DOI 10.1016/0749-596X(86)90033-1 Dauses A., 1973, ETUDES INSTABLE FRAN Delattre P., 1965, COMPARING PHONETIC F Delattre P., 1966, STUDIES FRENCH COMP Dell F., 1973, REGLES SONS DEMAREUIL PB, 1997, THESIS U PARIS 11 OR DEMAREUIL PB, 1998, P 1 INT C LANG RES E, P641 Duez D., 2003, P IEEE ISCA WORKSH S Durand J, 2000, LANGUE FRANCAISE, P29 EGGS E, 1990, PHONETIQUE PHONOLOGI ENCREVE P, 1988, LIASON ENCHAINEMENT *ESCA, 1998, ESCA WORKSH MOD PRON Fougeron C., 2001, P EUR 2001 AALB, P639 FOURGERON C, 2002, P JEP NANC 2002, P125 FOWLER CA, 1993, J MEM LANG, V32, P115, DOI 10.1006/jmla.1993.1007 Gauvain JL, 2002, SPEECH COMMUN, V37, P89, DOI 10.1016/S0167-6393(01)00061-9 GOSLIN J, 1999, P 2 JOURN LING NANT, P75 Greenberg S., 2000, P ISCA WORKSH AUT SP, P195 Greenberg S., 2002, P HUM LANG TECHN C H *ISCA, 2002, PRON MOD LEX AD SPOK Kahn D., 1976, THESIS MIT LACHERETDUJOUR A, 1994, P ICSLP YOK, P1763 Ladefoged P., 1975, COURSE PHONETICS Leon P., 1993, PRECIS PHONOSTYLISTI Lucci V., 1983, ETUDE PHONETIQUE FRA MALMBERG B, 1975, PHONETIQUE Martinet Andre, 1971, PRONONCIATION FRANCA PALLIER C, 1994, THESIS EHESS PARIS PERENNOU G, 1987, BDLEX BASE DONNEES L SHRIBERG E, THESIS U CALIFORNIA Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 SU TT, 1998, P SPOSS BAUME AIX, P55 van Son R.J.J.H., 2003, P ICPHS BARC 2003, P2141 Verney Pleasants J., 1956, ETUDES MUET TIMBRE D VOGEL I, 1982, FENOMENI LINGUISTICI, V2 Walter Henriette, 1988, FRANCAIS TOUS SENS Walter Henriette, 1976, DYNAMIQUE PHONEMES L Wioland F., 1985, STRUCTURES SYLLABIQU NR 42 TC 23 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 119 EP 139 DI 10.1016/j.specom.2005.03.006 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700002 ER PT J AU Bellegarda, JR AF Bellegarda, JR TI Unsupervised, language-independent grapheme-to-phoneme conversion by latent analogy SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO DE letter-to-sound mapping; pronunciation modeling; name transcription ID PRONUNCIATION; ENGLISH AB Automatic, data-driven grapheme-to-phoneme conversion is a challenging but often necessary task. The top-down strategy implicitly followed by traditional inductive learning techniques tends to dismiss relevant contexts when they have been seen too infrequently in the training data. The bottom-up philosophy inherent in pronunciation by analogy allows for a markedly better handling of unusual patterns, but also relies heavily on individual, language-dependent alignments between letters and phonemes. To avoid such supervision, this paper proposes an alternative solution, dubbed pronunciation by latent analogy, which adopts a more global definition of analogous events. For each out-of-vocabulary word, a neighborhood of globally relevant pronunciations is constructed through an appropriate data-driven mapping of its graphemic form. Phoneme transcription then proceeds via locally optimal sequence alignment and maximum likelihood position scoring. This method was successfully applied to the speech synthesis of proper names with a large diversity of origin. (c) 2005 Elsevier B.V. All rights reserved. C1 Apple Comp Inc, Speech & Language Technol, Cupertino, CA 95014 USA. RP Bellegarda, JR (reprint author), Apple Comp Inc, Speech & Language Technol, Cupertino, CA 95014 USA. EM jerome@apple.com CR Andersen O., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607954 Bagshaw PC, 1998, COMPUT SPEECH LANG, V12, P119, DOI 10.1006/csla.1998.0042 Bellegarda JR, 2000, IEEE T SPEECH AUDI P, V8, P76, DOI 10.1109/89.817455 BELLEGARDA JR, 2003, IEEE T SPEECH AUDI P, P11 Bellegarda JR, 2000, P IEEE, V88, P1279, DOI 10.1109/5.880084 BELLEGARDA JR, 1996, 1996 INT C AC SPEECH, P1172 Black A., 1998, P 3 ESCA WORKSH SPEE, P77 BYRNE W, 1998, 1998 P INT C AC SPEE, P313 CALLUM JK, 1985, LANCZOS ALGORITHMS L, V1, pCH5 DALLI A, 2002, P HUM LANG TECHN WOR, P341 Damper RI, 1999, COMPUT SPEECH LANG, V13, P155, DOI 10.1006/csla.1998.0117 Damper Robert I., 2001, P 4 INT WORKSH SPEEC, P97 DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 GALESCU L, 2002, P ICSLP, P109 Galescu L., 2001, P 4 ISCA TUT RES WOR GOTOH Y, 1997, P 5 EUR C SPEECH COM, P1443 KIENAPPEL AK, 2001, P EUROSPEECH, P11911 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Llitjos Ariadna Font, 2001, P EUROSPEECH, P1919 LUK RWP, 2001, DATA DRIVEN TECHNIQU, P91 MA CX, 2001, P EUROSPEECH, P1453 Marchand Y, 2000, COMPUT LINGUIST, V26, P195, DOI 10.1162/089120100561674 NGAN J, 1998, P ICSLP SYDN AUSTR N, P3285 Pagel V., 1998, P ICSLP, P2015 Ramabhadran B, 1998, INT CONF ACOUST SPEE, P309, DOI 10.1109/ICASSP.1998.674429 *SAMPA, 1987, STAND MACH READ ENC SUONTAUSTA J, 2000, P ICSLP, P831 Vingron M, 1996, CURR OPIN STRUC BIOL, V6, P346, DOI 10.1016/S0959-440X(96)80054-6 Yvon F, 1997, 35TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 8TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, PROCEEDINGS OF THE CONFERENCE, P428 NR 29 TC 13 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 140 EP 152 DI 10.1016/j.specom.2005.03.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700003 ER PT J AU Fosler-Lussier, E Amdal, I Kuo, HKJ AF Fosler-Lussier, E Amdal, I Kuo, HKJ TI A framework for predicting speech recognition errors SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO DE automatic speech recognition; error prediction; pronunciation modeling; lexicon optimization; lexical adaptation ID PRONUNCIATION VARIATION; WORD AB Pronunciation modeling in automatic speech recognition systems has had mixed results in the past; one likely reason for poor performance is the increased confusability in the lexicon from adding new pronunciation variants. In this work, we propose a new framework for determining lexically confusable words based on inverted finite state transducers (FSTs); we also present experiments designed to test some of the implementation details of this framework. The method is evaluated by examining how well the algorithm predicts the errors in an ASR system. The model is able to generalize confusions learned from a training set to predict errors made by the speech recognizer on an unseen test set. (c) 2005 Elsevier B.V. All rights reserved. C1 Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. Norwegian Univ Sci & Technol, N-7034 Trondheim, Norway. IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. RP Fosler-Lussier, E (reprint author), Ohio State Univ, Dept Comp Sci & Engn, Columbus, OH 43210 USA. EM fosler@cse.ohio-state.edu CR Bilmes J, 2002, INT CONF ACOUST SPEE, P3916 CHASE L, 1997, THESIS CARNEGIE MELO CHEN Z, 2000, P INT C SPOK LANG PR, P493 Chomsky N., 1968, SOUND PATTERN ENGLIS Cucchiarini C, 1996, CLIN LINGUIST PHONET, V10, P131, DOI 10.3109/02699209608985167 DENG Y, 2003, P EUR GEN SWITZ SEPT, P929 FOSLERLUSSIER E, 1999, DARPA BROADC NEWS WO Fosler-Lussier E., 2002, ISCA TUT RES WORKSH Fosler-Lussier J. E., 1999, THESIS U CALIFORNIA Goel V, 2004, IEEE T SPEECH AUDI P, V12, P234, DOI 10.1109/TSA.2004.825678 GOLDBERGER J, 2003, ICCV, V1, P487 GREENBERG S, 1996, P 4 INT C SPOK LANG, pS24 GREENBERG S, 2000, P NIST SPEECH TRANSC HETHERINGTON L, 1995, P EUR C SPEECH COMM, P1645 HIRSCHBERG J, 1999, P AUT SPEECH REC UND Holter T, 1999, SPEECH COMMUN, V29, P177, DOI 10.1016/S0167-6393(99)00036-9 Jakobson R., 1952, 13 MIT AC LAB Kessens JM, 1999, SPEECH COMMUN, V29, P193, DOI 10.1016/S0167-6393(99)00048-5 Kuo H. J., 2002, P INT C AC SPEECH SI, P325 LIVESCU K, 2000, P INT C AC SPEECH SI, P1683 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 McAllister D., 1998, P 5 INT C SPOK LANG, P1847 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MOHRI M, 1998, P ICASSP SEATTL WA, V2, P665, DOI 10.1109/ICASSP.1998.675352 MOU X, 2000, P EUR ALB DENM, P451 *NAT I STAND TECHN, 2001, SCLITE SCOR SOFTW Pitrelli J.F., 1995, PHONEBOOK NYNEX ISOL POTAMIANOS A, 2000, P INT C SPOK LANG PR, P603 Press W. H., 1999, NUMERICAL RECIPES C Printz H, 2002, COMPUT SPEECH LANG, V16, P131, DOI 10.1006/csla.2001.0188 RILEY M, 1998, ESCA TUT RES WORKSH, P109 Riley M. D., 1991, P INT C AC SPEECH SI, P737, DOI 10.1109/ICASSP.1991.150446 SCHAAF T, 1997, P IEEE INT C AC SPEE, P875 SCHRAMM H, 2002, ISCA TUT RES WORKSH Slobada T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607274 SPROAT R, 1998, MULTILINGUAL TEXT SP Voiers W. D., 1983, Speech Technology, V1 WESTER M, 2000, P INT C SPOK LANG PR, P270 WILLIAMS DAG, 1999, THESIS U SHEFFIELD S ZHOU Q, 1997, P EUR RHOD GREEC, P621 NR 40 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 153 EP 170 DI 10.1016/j.specom.2005.03.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700004 ER PT J AU Hain, T AF Hain, T TI Implicit modelling of pronunciation variation in automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO DE automatic speech recognition; pronunciation modelling; acoustic modelling; hidden markov models; pronunciation dictionaries; single pronunciations; parameter tying; phonetic decision trees; state clustering; conversational speech recognition; Hidden Model Sequence Models AB Modelling of pronunciation variability is an important task for the acoustic model of an automatic speech recognition system. Good pronunciation models contribute to the robustness and generic applicability of a speech recogniser. Usually pronunciation modelling is associated with a lexicon that allows to explicitly control the selection of appropriate HMMs for a particular word. However, the use of data-driven clustering techniques or specific parameter tying techniques has considerable impact on this form of model selection and the construction of a task-optimal dictionary. Most large vocabulary speech recognition systems make use of a dictionary with multiple possible pronunciation variants per word. By manual addition of pronunciation variants explicit human knowledge is used in the recognition process. For reasons of complexity the optimisation of manual entries for performance is often not feasible. In this paper a method for the stepwise reduction of the number of pronunciation variants per word to one is described. By doing so in a way consistent with the classification procedure, pronunciation variation is modelled implicitly. It is shown that the use of single pronunciation dictionaries provides similar or better word error rate performance, achieved both on Wall Street Journal and Switchboard data. The use of single pronunciation dictionaries in conjunction with Hidden Model Sequence Models as an example of an implicit pronunciation modelling technique shows further improvements. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Hain, T (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM th223@eng.cam.ac.uk CR BAHL LR, 1991, P ICASSP 91, V1, P177 Bates R. A., 2002, P ISCA TUT RES WORKS, P42 BELLEGARDA JR, 1990, IEEE T ACOUST SPEECH, V38, P2033, DOI 10.1109/29.61531 BYRNE W, 1998, P ICASSP 98, V1, P313, DOI 10.1109/ICASSP.1998.674430 CREMELIE N, 1997, P EUROSPEECH 97, P2459 FINKE M, 1997, P EUR, V5, P2379 FOSLER E, 1996, P ICSLP 9L GAUVAIN JL, 1994, P ARPA SPOK LANG TEC, P125 GREENBERG S, 1996, 1996 LVCSR SUMM WORK Greenberg S., 1998, P ESCA WORKSH MOD PR, P47 HAIN T, 2001, P ICASSP 01, P57 HAIN T, 1999, P ICASSP, P57 HAIN T, 1999, P EUROSPEECH 99, V3, P1327 HAIN T, 2001, THESIS CAMBRIDGE U HAIN T, 2000, P 2000 NIST SPEECH T Huang X. D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90020-X HUMPHRIES JJ, 1997, THESIS CAMBRIDGE U HWANG MY, 1993, P IEEE ICASSP 93 MIN, V2, P311 LUO X, 1999, P INT C AC SPEECH SI, P2044 MA K, 1998, 9 CONV SPEECH REC WO NOCK HJ, 1998, P ESCA WORKSH MOD PR, P85 Ostendorf M., 1999, P IEEE AUT SPEECH RE, V1, P79 PALLETT D, 1994, P ARPA WORKSH SPOK L, P3 PRINTZ H, 2000, P ISCA ITRW ASR 2000 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 STOLCKE A, 2000, P 2000 NIST SPEECH T Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 WEINTRAUB M, 1996, P INT C SPOK LANG PR, pS16 WOODLAND PC, 2002, CUHTK APRIL 2002 SWI WOODLAND PC, 1995, P ARPA WORKSH SPOK L, P104 WOOTERS C, 1994, P INT C SPOKEN LANGU, V3, P1363 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 YOUNG SJ, 1994, COMPUT SPEECH LANG, V8, P369, DOI 10.1006/csla.1994.1019 NR 34 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 171 EP 188 DI 10.1016/j.specom.2005.03.008 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700005 ER PT J AU Hazen, TJ Hetherington, IL Shu, H Livescu, K AF Hazen, TJ Hetherington, IL Shu, H Livescu, K TI Pronunciation modeling using a finite-state transducer representation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO ID SPEECH RECOGNITION AB The MIT SUMMIT speech recognition system models pronunciation using a phonemic baseform dictionary along with rewrite rules for modeling phonological variation and multi-word reductions. Each pronunciation component is encoded within a finite-state transducer (FST) representation whose transition weights can be trained using an EM algorithm for finite-state networks. This paper explains the modeling approach we use and the details of its realization. We demonstrate the benefits and weaknesses of the approach both conceptually and empirically using the recognizer for our JUPITER weather information system. Our experiments demonstrate that the use of phonological rewrite rules within our system achieves word error rate reductions between 4% and 9% over different test sets when compared against a system using no phonological rewrite rules. (c) 2005 Elsevier B.V. All rights reserved. C1 MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA. RP Hazen, TJ (reprint author), MIT, Comp Sci & Artificial Intelligence Lab, 32 Vassar St, Cambridge, MA 02139 USA. EM hazen@csail.mit.edu CR CHURCH KW, 1983, THESIS MIT CAMBRIDGE DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Eisner Jason, 2002, P 40 ANN M ASS COMP, P1 Gillick L., 1989, P ICASSP, P532 GLASS J, 1999, P ICASSP PHOEN AZ MA, P61 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 Hain T., 2002, P ISCA TUT RES WORKS, P129 HAZEN TJ, 2002, P ISCA TUT RES WORKS, P99 HETHERINGTON L, 2001, P EUR AALB DENM SEPT, P1599 Jurafsky D., 2001, P 2001 IEEE INT C AC, P577 Lamel L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.606916 LIVESCU K, 2000, P IEEE INT C AC SPEE, P1842 McAllister D., 1998, P 5 INT C SPOK LANG, P1847 Pereira FCN, 1997, LANG SPEECH & COMMUN, P431 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 Seneff S, 2004, SPEECH COMMUN, V42, P373, DOI 10.1016/j.specom.2003.11.001 SENEFF S, 2002, P ISCA WORKSH PRON M, P71 SHAFRAN I, 2001, THESIS U WASHINGTON Shu H., 2002, P 7 INT C SPOK LANG, P1293 Strom N., 1999, P IEEE AUT SPEECH RE, P139 TAJCHMAN C, 1995, P 4 EUR C SPEECH COM, P2247 YI J, 2000, P INT C SPOK LANG PR, P322 YI J, 2002, P ICSLP 2002 DENV SE, P2617 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 25 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 189 EP 203 DI 10.1016/j.specom.2005.03.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700006 ER PT J AU Seneff, S Wang, C AF Seneff, S Wang, C TI Statistical modeling of phonological rules through linguistic hierarchies SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Pronunciation Modeling and Lexicon Adaptation CY 2002 CL Estes Pk, CO DE pronunciation modeling; phonological modeling; speech recognition; dialogue systems ID SPEECH RECOGNITION; PRONUNCIATION VARIATION AB This paper describes our research aimed at acquiring a generalized probability model for alternative phonetic realizations in conversational speech. For all of our experiments, we utilize the SUMMIT landmark-based speech recognition framework. The approach begins with a set of formal context-dependent phonological rules, applied to the baseforms in the recognizer's lexicon. A large speech corpus is phonetically aligned using a forced recognition procedure. The probability model is acquired by observing specific realizations expressed in these alignments. A set of context-free rules is used to parse words into substructure, in order to generalize context-dependent probabilities to other words that share the same sub-word context. The model maps phones to sub-word units probabilistically in a finite state transducer framework, capturing phonetic predictions based on local phonemic, morphologic, and syllabic contexts. We experimented within two domains: the MERCURY flight reservation domain and the JUPITER weather domain. The baseline system used the same set of phonological rules for lexical expansion, but with no probabilities for the alternates. We achieved 14.4% relative reduction in concept error rate for JUPITER and 16.5% for MERCURY. (c) 2005 Elsevier B.V. All rights reserved. C1 MIT, Comp Sci Lab, Spoken Language Syst Grp, Stata Ctr, Cambridge, MA 02139 USA. RP Seneff, S (reprint author), MIT, Comp Sci Lab, Spoken Language Syst Grp, Stata Ctr, 32 Vassar St, Cambridge, MA 02139 USA. EM seneff@csail.mit.edu CR Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5 CHANG J, 1997, P EUR RHOD GREEC OCT, P1199 Chung G., 2003, P HLT NAACL EDM CAN, P32 CHUNG G, 2002, P INT C SPOK LANG PR, P2061 CHUNG G, 2000, P ICSLP 2000 BEIJ CH, P266 COHEN MH, 1989, THESIS U CALIFORNIA GAUVAIN JL, 1993, P EUR, P125 Glass JR, 2003, COMPUT SPEECH LANG, V17, P137, DOI 10.1016/S0885-2308(03)00006-8 GLASS JR, 1998, P 5 INT C SPOK LANG, P1327 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 Halberstadt A., 1998, P ICSLP, P995 HAZEN TJ, 2002, P ISCA TUT RES WORKS, P99 Hetherington I. L., 2001, P EUR 2001 ALLB DENM, P1599 KAHN D, 1980, SYLLABLE BASED GENER LIVESCU L, 2001, P 7 EUR C SPEECH COM, P1437 McAllister D., 1998, P 5 INT C SPOK LANG, P1847 SENEFF S, 2000, P 6 INT C SPOK LANG, P142 SENEFF S, 2002, P ISCA WORKSH PRON M, P71 SENEFF S, 2003, P 8 EUR C SPEECH COM, P749 Seneff S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607049 SENEFF S, 1998, P 5 INT C SPOK LANG, P3321 Seneff S., 2000, P ANLP NAACL 2000 SA, P1 Shu H., 2002, P 7 INT C SPOK LANG, P1293 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 TAJCHMAN G, 1995, P EUR C SPEECH COMM, P2247 WEINTRAUB M, 1989, P IEEE INT C ASSP GL, P699 ZUE VW, 1983, SPEECH COMMUN, V2, P181, DOI 10.1016/0167-6393(83)90023-7 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2005 VL 46 IS 2 BP 204 EP 216 DI 10.1016/j.specom.2005.03.005 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 938XW UT WOS:000230037700007 ER PT J AU Park, J Ko, H AF Park, J Ko, H TI Effective acoustic model clustering via decision-tree with supervised learning SO SPEECH COMMUNICATION LA English DT Article DE acoustic modeling; decision-tree; large vocabulary continuous speech recognition ID CONTINUOUS SPEECH RECOGNITION; HIDDEN MARKOV-MODELS AB In large vocabulary speech recognition, context-dependent modeling is essential for improving both accuracy and speed. To cope with the sparse data problem that arises from the proliferation of context-dependent models, two kinds of clustering methods, data-driven and rule-based, have been vigorously investigated. The inherent difficulty of applying data-driven approaches to unknown contexts has motivated the development of better rule-based clustering methods. This paper develops a hybrid approach that essentially constructs a supervised decision rule which operates on pre-clustered triphones. This scheme employs the C45 decision-tree learning algorithm to extract the attributes that best support clustering of training data. In particular, the data-driven method is used as a clustering algorithm, while its result is used as the learning target of the C45 algorithm. The proposed scheme provides an effective solution to the clustering error problem arising from unsupervised decision-tree learning and also renders successful clustering of the multiple mixture Gaussian state distributions. In speaker-independent, task-independent continuous speech recognition, the proposed method reduced the relative WER by 3.93 %. (c) 2005 Elsevier B.V. All rights reserved. C1 Korea Univ, Dept Elect & Comp Engn, ISPL, Seoul 136701, South Korea. RP Ko, H (reprint author), Korea Univ, Dept Elect & Comp Engn, ISPL, 5ka 1 Anamdong, Seoul 136701, South Korea. EM jhpark@ispl.korea.ac.kr; hsko@korea.ac.kr CR Aubert X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.606918 BAHL LR, 1994, P ICASSP 94, V1 Boulianne G., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607126 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 HAMMING RW, 1986, CODING INFORMATION T, P50601 HWANG MY, 1992, P DARPA WORKSH SPEEC, P174, DOI 10.3115/1075527.1075566 Hwang MY, 1993, IEEE T SPEECH AUDI P, V1, P414 Kwon OW, 2003, SPEECH COMMUN, V39, P287, DOI 10.1016/S0167-6393(02)00031-6 LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P599, DOI 10.1109/29.52701 Ming J, 1999, COMPUT SPEECH LANG, V13, P195, DOI 10.1006/csla.1999.0120 Mitchell T.M., 1997, MACHINE LEARNING NOCK HJ, 1997, P EUROSPEECH 97 Odell J.J., 1995, THESIS U CAMBRIDGE QUINLAN JR, 1990, IEEE T SYST MAN CYB, V20, P339, DOI 10.1109/21.52545 Reichl W, 2000, IEEE T SPEECH AUDI P, V8, P555, DOI 10.1109/89.861375 Reichl W, 1998, INT CONF ACOUST SPEE, P801, DOI 10.1109/ICASSP.1998.675386 WOODLAND PC, 1992, P DARPA CONT SPEECH, P71 WOODLAND PC, 1994, P ARPA SLT WORKSH YOUNG SJ, 1994, P ARPA WORKSH HUM LA YOUNG SJ, 1992, P ICASSP 92 NR 20 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 1 EP 13 DI 10.1016/j.specom.2004.12.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100001 ER PT J AU Akande, OO Murphy, PJ AF Akande, OO Murphy, PJ TI Estimation of the vocal tract transfer function with application to glottal wave analysis SO SPEECH COMMUNICATION LA English DT Article DE glottal volume velocity; formant estimation; inverse filtering ID SPEECH ANALYSIS; FORM AB A new method for determining the vocal tract transfer function for voiced speech is proposed. The method exploits the frequency domain characteristics of voiced speech and the concepts of minimum-phase systems. The short-time spectrum of voiced speech contains information about the glottal source and vocal tract filter components. In voiced speech there is usually a small frequency gap between the glottal source peak frequency response and the first formant of the vocal tract. The inherent inability of linear prediction parametric modelling in discriminating between closely spaced frequency components can make accurate modelling/estimation of the first formant (in the presence of a close and elevated peak due to the glottal flow) in voiced speech very difficult. A fixed pre-emphasis (single-pole, high-pass filter), commonly used in existing inverse filtering methods to reduce the influence of the glottal source, is not guaranteed to give the desired result across a range of voice types e.g. breathy, pressed and modal. The proposed method overcomes this problem by suppressing the glottal wave contribution using a dynamic, multi-pole, zero-phase lag high-pass filter. prior to analysis. In addition to minimising the influence of the glottal source, an expanded analysis region is provided in the form of a pseudo-closed phase. The technique also takes into cognizance the time varying nature of the vocal tract filter by determining, adaptively, an optimum vocal tract filter function using the properties of minimum phase systems. The performance of the new method is evaluated using synthesized and real speech. The results show that, under certain conditions, the method estimates the vocal tract formants and bandwidths more accurately than an existing closed phase inverse filtering technique. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Limerick, Dept Elect & Comp Engn, Limerick, Ireland. RP Murphy, PJ (reprint author), Univ Limerick, Dept Elect & Comp Engn, Limerick, Ireland. EM olatunji.akande@ul.ie; peter.murphy@ul.ie CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Ananthapadmanabha T. V., 1982, SPEECH COMMUN, V1, P167, DOI 10.1016/0167-6393(82)90015-2 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 Backstrom T, 2002, IEEE T SPEECH AUDI P, V10, P186, DOI 10.1109/TSA.2002.1001983 BOZKURT B, 2003, P ISCA VOIC QUAL C G Childers D. G., 1999, SPEECH PROCESSING SY FANT G, 1995, SPEECH T LAB Q REP R, V2, P121 FANT G, 1986, STL QPSR, V4, P1 GOBL C, 1988, STL QPSR, V2, P23 HEDELIN P, 1988, IEEE INT C AC SPEECH, V1, P339 HIKI S, 1976, P IEEE INT C AC SPEE, V1, P613 Hu H. T., 2000, P NAT SCI COUNC ROC, V24, P134 JIANG Y, 2002, P 7 INT C SPOK LANG, V20, P2073 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 KRISHNAMURTHY AK, 1986, IEEE T ACOUST SPEECH, V34, P730, DOI 10.1109/TASSP.1986.1164909 LAURI E, 1998, P 24 C INT ASS LOG P, V1, P53 Plumpe MD, 1999, IEEE T SPEECH AUDI P, V7, P569, DOI 10.1109/89.784109 QUATIERI T, 2002, DISCRETE TIME SIGNAL WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 19 TC 20 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 15 EP 36 DI 10.1016/j.specom.2005.01.007 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100002 ER PT J AU Kirchhoff, K Vergyri, D AF Kirchhoff, K Vergyri, D TI Cross-dialectal data sharing for acoustic modeling in Arabic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; acoustic modeling; Arabic; dialectal variation AB Many of the world's languages have a multitude of dialects which differ considerably from each other in their linguistic properties. Dialects are often spoken rather than written varieties; the development of automatic speech recognition systems for dialects therefore requires the collection and transcription of large amounts of dialectal speech. In those cases where sufficient training data is not available, acoustic and/or language models may benefit from additional data from different though related dialects. In this study we investigate the feasibility of cross-dialectal data sharing for acoustic modeling using two different varieties of Arabic, Modern Standard Arabic and Egyptian Colloquial Arabic. An obstacle to this type of data sharing is the Arabic writing system, which lacks short vowels and other phonetic information. We address this problem by developing automatic procedures to restore the missing information based on morphological, contextual and acoustic knowledge. These procedures are evaluated with respect to the relative contributions of different knowledge sources and with respect to their effect on the overall recognition system. We demonstrate that cross-dialectal data sharing leads to significant reductions in word error rate. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. SRI Int, Menlo Pk, CA 94720 USA. RP Kirchhoff, K (reprint author), Univ Washington, Dept Elect Engn, Box 352500, Seattle, WA 98195 USA. EM katrin@ee.washington.edu; dverg@speech.sri.com CR BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472 Bilmes J., 2002, P ICASSP, V4, P3916 Brants T., 2000, P 6 C APPL NAT LANG, P224, DOI 10.3115/974147.974178 BYRNE W, 1999, P IEEE WORKSH AUT SP BYRNE W, 2000, P INT C AC SPEECH SI, P1029 DEBILI F, 2002, ETIQUETAGE GRAMMATIC Dempster A. P., 1977, J ROYAL STAT SOC B, V39 DIGALAKIS V, 1994, P INT C AC SPEECH SI DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 El-Imam YA, 2004, COMPUT SPEECH LANG, V18, P339, DOI 10.1016/S0885-2308(03)00035-4 GAL Y, 2002, P WORKSH COMP APPR S, P27 GLOTIN H, 2001, P INT C AC SPEECH SI, P173 Kirchhoff K., 2002, NOVEL APPROACHES ARA Kohler J, 1998, INT CONF ACOUST SPEE, P417, DOI 10.1109/ICASSP.1998.674456 MAAMOURI M, 2004, P NEMLAR INT C AR LA MESSAOUDI A, 2004, P 2004 DARPA RICH TR NELDER JA, 1965, COMPUT J, V7, P308 Ostendorf M., 1991, P DARPA WORKSH SPEEC, P83, DOI 10.3115/112405.112416 Schultz T, 2001, SPEECH COMMUN, V35, P31, DOI 10.1016/S0167-6393(00)00094-7 Schutze H., 1993, P 31 ANN M ASS COMP, P251, DOI 10.3115/981574.981608 SIEMUND R, 2002, P LREC WORKSH AR LAN Stolcke A., 2000, P NIST SPEECH TRANSC VERGYRI D, 2000, THESIS JOHNS HOPKINS Versteegh Kees, 2001, ARABIC LANGUAGE ZITOUNI I, 2002, ORIENTEL SPEECH BASE, P325 NR 25 TC 18 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 37 EP 51 DI 10.1016/j.specom.2005.01.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100003 ER PT J AU Warner, N Smits, R McQueen, JM Cutler, A AF Warner, N Smits, R McQueen, JM Cutler, A TI Phonological and statistical effects on timing of speech perception: Insights from a database of Dutch diphone perception SO SPEECH COMMUNICATION LA English DT Article DE speech perception; diphone; timing; Dutch; feature ID SEGMENTAL DURATIONS; CONSONANTS; SIGNALS; IDENTIFICATION; DISTINCTION; RECOGNITION; INFORMATION AB We report detailed analyses of a very large database on timing of speech perception collected by Smits et al. (Smits, R., Warner, N., McQueen, J.M., Cutler, A., 2003. Unfolding of phonetic information over time: A database of Dutch diphone perception. J. Acoust. Soc. Am. 113, 563-574). Eighteen listeners heard all possible diphones of Dutch, gated in portions of varying size and presented without background noise. The present report analyzes listeners' responses across gates in terms of phonological features (voicing, place, and manner for consonants; height, backness, and length for vowels). The resulting patterns for feature perception differ from patterns reported when speech is presented in noise. The data are also analyzed for effects of stress and of phonological context (neighboring vowel vs. consonant); effects of these factors are observed to be surprisingly limited. Finally, statistical effects, such as overall phoneme frequency and transitional probabilities, along with response biases, are examined; these too exercise only limited effects on response patterns. The results suggest highly accurate speech perception on the basis of acoustic information alone. (c) 2005 Elsevier B.V. All rights reserved. C1 Max Planck Inst Psycholinguist, NL-6500 AH Nijmegen, Netherlands. RP Warner, N (reprint author), Univ Arizona, Dept Linguist, POB 210028, Tucson, AZ 85721 USA. EM nwarner@u.arizona.edu; heersmits@hotmail.com; james.mcqueen@mpi.nl; anne.cutler@mpi.nl RI McQueen, James/B-2212-2010; Cutler, Anne/C-9467-2012 CR Baayen Harald, 1993, CELEX LEXICAL DATABA Benki JR, 2003, PHONETICA, V60, P129, DOI 10.1159/000071450 Booij Geert, 1995, PHONOLOGY DUTCH Bradlow A. R., 2003, LAB PHONOLOGY, V7, P241 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 CRYSTAL TH, 1982, J ACOUST SOC AM, V72, P705, DOI 10.1121/1.388251 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1574, DOI 10.1121/1.395912 Gussenhoven C., 1992, J INT PHON ASSOC, V22, P45 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Kenstowicz Michael, 1994, PHONOLOGY GENERATIVE MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241 Nygaard L. C., 1995, SPEECH LANGUAGE COMM, P63 Ohala J. J., 1975, S NASALS NASALIZATIO, P289 Ohala J. J., 1995, PHONOLOGY PHONETIC E, V4, P41, DOI [10.1017/CBO9780511554315.004, DOI 10.1017/CB09780511554315.004] PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Pitt MA, 1998, J MEM LANG, V39, P347, DOI 10.1006/jmla.1998.2571 POLS LCW, 1977, SPECTRAL ANAL IDENTI POLS LCW, 1979, ANNIVERSARIES PHONET, P270 POLS LCW, 1978, J ACOUST SOC AM, V64, P1333, DOI 10.1121/1.382100 RECASENS D, 1983, J ACOUST SOC AM, V73, P1346, DOI 10.1121/1.389238 REPP BH, 1986, J ACOUST SOC AM, V79, P1987, DOI 10.1121/1.393207 Smits R., 2000, J PHONETICS, V27, P111, DOI [10.1006/jpho.2000.0107, DOI 10.1006/JPHO.2000.0107] Smits R, 2003, J ACOUST SOC AM, V113, P563, DOI 10.1121/1.1525287 Stevens K.N., 1998, ACOUSTIC PHONETICS van Alphen PM, 2004, J PHONETICS, V32, P455, DOI 10.1016/j.wocn.2004.05.001 van Son RJJH, 1999, SPEECH COMMUN, V29, P1, DOI 10.1016/S0167-6393(99)00024-2 WARNER N, 2003, FORMAL APPROACHES FU, P245 WARNER NL, 1998, THESIS U CALIFORNIA NR 29 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 53 EP 72 DI 10.1016/j.specom.2005.01.003 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100004 ER PT J AU Hardison, DM AF Hardison, DM TI Variability in bimodal spoken language processing by native and nonnative speakers of English: A closer look at effects of speech style SO SPEECH COMMUNICATION LA English DT Article DE auditory-visual; spoken language processing; speech style; gating; language learning ID WORD-RECOGNITION; CONVERSATIONAL SPEECH; TALKER VARIABILITY; NEIGHBORHOOD DENSITY; GATING PARADIGM; SPEAKING RATE; PERCEPTION; INTELLIGIBILITY; HEARING; CONTEXT AB Experiments using the gating paradigm investigated the influence of speech style (unscripted vs. scripted), visual cues from a talker's face (auditory-visual vs. auditory-only presentations), word length (one vs. two syllables) and initial consonant (IQ visual category in spoken word identification by native- (NSs) and normative speakers (NNSs) of English. Two talkers were videotaped in separate conversations with the author on various topics (unscripted speech) with the camera focused on the talker's face. They later recorded selected sentences as scripted speech. In Experiment 1, target words were excised from sentences, gated, and presented to NSs and NNSs. Results for both groups showed significantly earlier identification with visual cues and for bisyllabic vs. monosyllabic words, and significant interaction of speech style and IC category across talkers. For Talker A, identification of unscripted monosyllabic words in some IC categories was earlier than the scripted versions. For this talker, words beginning with /_L/ were identified later than others; for Talker B, they were the earliest to be identified. For NNSs, the AV advantage was accentuated for words beginning with /j, w.. G/ in unscripted speech by Talker A, and for /(sic)/- and /(sic)/-initial words in both speech styles by Talker B. Experiment 2 presented the preceding sentence context with the gated word to NSs. Results revealed earlier identification in AV presentation and for unscripted vs. scripted words by Talker A. With context, word length was not significant. Findings highlight the priming role of visual cues, and the talker- and context-dependent nature of bimodal spoken language processing, but do not support a strict conversational-clear speech dichotomy. (c) 2005 Elsevier B.V. All rights reserved. C1 Michigan State Univ, Dept Linguist & Languages, E Lansing, MI 48824 USA. RP Hardison, DM (reprint author), Michigan State Univ, Dept Linguist & Languages, A-714 Wells Hall, E Lansing, MI 48824 USA. EM hardiso2@msu.edu CR BARD EG, 1988, PERCEPT PSYCHOPHYS, V44, P395, DOI 10.3758/BF03210424 BERGER KW, 1972, SPEECHREADING PRINCI Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 COTTON S, 1984, PERCEPT PSYCHOPHYS, V35, P41, DOI 10.3758/BF03205923 CRAIG CH, 1990, J SPEECH HEAR RES, V33, P808 Dalby J. M., 1986, PHONETIC STRUCTURE F Demorest ME, 1996, J SPEECH HEAR RES, V39, P697 FISHER CG, 1968, J SPEECH HEAR RES, V11, P796 Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135 Garlock VM, 2001, J MEM LANG, V45, P468, DOI 10.1006/jmla.2000.2784 GOLDINGER SD, 1989, J MEM LANG, V28, P501, DOI 10.1016/0749-596X(89)90009-0 GROSJEAN F, 1980, PERCEPT PSYCHOPHYS, V28, P267, DOI 10.3758/BF03204386 GROSJEAN F, 1985, PERCEPT PSYCHOPHYS, V38, P299, DOI 10.3758/BF03207159 Grosjean F, 1996, LANG COGNITIVE PROC, V11, P597, DOI 10.1080/016909696386999 Hardison DM, 1999, LANG LEARN, V49, P213, DOI 10.1111/0023-8333.49.s1.7 Hardison DM, 2003, APPL PSYCHOLINGUIST, V24, P495, DOI 10.1017/S0142716403000250 HARDISON DM, 2004, UNPUB ROLE CONTEXT S HARDISON DM, IN PRESS APPL PSYCHO Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 Johnson K., 1997, TALKER VARIABILITY S KRICOS PB, 1982, VOLTA REV, V84, P219 LIVELY SE, 1993, J ACOUST SOC AM, V94, P1242, DOI 10.1121/1.408177 Luce P. A., 1986, 6 IND U MARSLENWILSON WD, 1978, COGNITIVE PSYCHOL, V10, P29, DOI 10.1016/0010-0285(78)90018-X MCALLISTER J, 1991, LANG SPEECH, V34, P1 MCALLISTER JM, 1988, PERCEPT PSYCHOPHYS, V44, P94, DOI 10.3758/BF03207482 Metsala JL, 1997, MEM COGNITION, V25, P47, DOI 10.3758/BF03197284 Munhall KG, 1998, J ACOUST SOC AM, V104, P530, DOI 10.1121/1.423300 Nusbaum H. C., 1984, SIZING HOOSIER MENTA NYGAARD LC, 1995, PERCEPT PSYCHOPHYS, V57, P989, DOI 10.3758/BF03205458 PICHENY MA, 1989, J SPEECH HEAR RES, V32, P600 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 PISONI DB, 1993, SPEECH COMMUN, V13, P109, DOI 10.1016/0167-6393(93)90063-Q SALASOO A, 1985, J MEM LANG, V24, P210, DOI 10.1016/0749-596X(85)90025-7 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 TYLER LK, 1984, PERCEPT PSYCHOPHYS, V36, P417, DOI 10.3758/BF03207496 WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130 NR 38 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 73 EP 93 DI 10.1016/j.specom.2005.02.002 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100005 ER PT J AU van Dinther, R Veldhuis, R Kohlrausch, A AF van Dinther, R Veldhuis, R Kohlrausch, A TI Perceptual aspects of glottal-pulse parameter variations SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; glottal pulse; LF parameters; perceptual distance measure ID NOTICEABLE DIFFERENCES; VOICE; THRESHOLDS; SOUNDS AB The relation between speech production and perception parameters is investigated for synthetic stationary vowels. A perceptual distance measure, based on the distance between excitation patterns, is used to quantify and model the perceptual relevance of variations to glottal-pulse parameters. The parameters that are studied are the R parameters of the well-known Liljencrants-Fant (LF) model. An approximation of the perceptual distance measure was used to quantify the perceptual effects of the R parameters. The data used in this paper consist of 33 R-parameter vectors which all have been measured from real voices and have been taken from the literature. The 33 points in the R-parameter space are found to lie near a trajectory when ordered as function of a perceptual parameter, which was derived from the perceptual distance measure. This ordering seems to be largely independent of other speech parameters, such as F-0 values, level and of parameters of the vocal tract filters. Finally, it is demonstrated that the perceptual parameter has a close relation to the production parameter R-d, described in [Fant, G., 1995. The LF-model revisited. Transformations and frequency domain analysis. STL-QPSR 2-3/95, 119-156]. (c) 2005 Elsevier B.V. All rights reserved. C1 Philips Res Labs, NL-5656 AA Eindhoven, Netherlands. Univ Cambridge, Ctr Neural Basis Hearing, Dept Physiol, Cambridge CB2 3EG, England. Univ Twente, Fac Engn, Chair Signals & Syst, NL-7500 AE Enschede, Netherlands. Tech Univ Eindhoven, Dept Technol Management, NL-5600 MB Eindhoven, Netherlands. RP Kohlrausch, A (reprint author), Philips Res Labs, Prof HOlstlaan 4,WO 02, NL-5656 AA Eindhoven, Netherlands. EM armin.kohlrausch@philips.com CR CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 Fant G., 1995, STL QPSR, V36, P119 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P Fant Gunnar, 1985, STL QPSR, V4, P1 Henrich N, 2003, J VOICE, V17, P481, DOI 10.1067/S0892-1997(03)00005-5 KARLSSON I, 1996, STL QPSR, V2, P143 Kawahara H., 2003, P IEEE INT C AC SPEE, V1, P256 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Laver J, 1980, PHONETIC DESCRIPTION MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MOORE BCJ, 1987, J ACOUST SOC AM, V81, P1633, DOI 10.1121/1.394518 Moore BC., 2003, INTRO PSYCHOL HEARIN Moore BCJ, 1997, J AUDIO ENG SOC, V45, P224 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Pestman W.R., 1998, MATH STAT Rao P, 2001, J ACOUST SOC AM, V109, P2085, DOI 10.1121/1.1354986 Scherer RC, 1998, J VOICE, V12, P21, DOI 10.1016/S0892-1997(98)80072-6 van Dinther R, 2004, SPEECH COMMUN, V42, P175, DOI 10.1016/j.specom.2003.07.002 VELDHUIS R, 1998, P ICASSP 98, V2, P873, DOI 10.1109/ICASSP.1998.675404 NR 20 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2005 VL 46 IS 1 BP 95 EP 112 DI 10.1016/j.specom.2005.01.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 930NY UT WOS:000229424100006 ER PT J AU Muto, M Kato, H Tsuzaki, M Sagisaka, Y AF Muto, M Kato, H Tsuzaki, M Sagisaka, Y TI Effect of intra-phrase position on acceptability of change in segment duration in sentence speech SO SPEECH COMMUNICATION LA English DT Article ID TEMPORAL MODIFICATION; ISOLATED WORDS AB For use as a naturalness criterion for duration rules in speech synthesis, human acceptability of change in segment duration is investigated with regard to the temporal position within a phrase. Three perceptual experiments are carried out to introduce variations in the attribute and context of a phrase in sentence speech: (1) the length of a phrase and the type of a phrase accent (2 lengths x 3 types), (2) variation in carrier sentence (3 carriers + 1 without carrier), and (3) the position of a phrase in a breath group (two positions). In total, 22 listeners evaluate the acceptability of resynthesized speech stimuli in which one of the vowel segments was either lengthened or shortened by up to 50 ms. Overall results show that a duration change in the phrase-initial segment is generally the least acceptable and that in the phrase-final segment the most acceptable, with that in a phrase at intermediate positions in between. This position-dependent tendency is observed regardless of the variations in phrase length, accent type, carrier sentence, presence of carrier sentence, and position in a breath group. These results suggest that the error criteria of duration modeling should be reconsidered by taking into account such perceptual characteristics in order to improve temporal naturalness in synthesized speech. (c) 2004 Elsevier B.V. All rights reserved. C1 Waseda Univ, Shinjuku Ku, Tokyo 1690072, Japan. ATR Human Informat Sci Labs, Kyoto 6190288, Japan. ATR Spoken Language Translat Res Labs, Kyoto 6190288, Japan. RP Muto, M (reprint author), Waseda Univ, Shinjuku Ku, 3-14-9 Okubo, Tokyo 1690072, Japan. EM makiko.muto@ruri.waseda.jp; kato@atr.jp; minoru.tsuzaki@kcua.ac.jp; sagisaka@giti.waseda.ac.jp CR [Anonymous], 1975, 5321975E ISO BOCHNER JH, 1988, J ACOUST SOC AM, V84, P493, DOI 10.1121/1.396827 CARLSON R, 1979, FRONTIERS SPEECH COM, P233 CARLSON R, 1986, PHONETICA, V43, P140 Higuchi N., 1993, Journal of the Acoustical Society of Japan (E), V14 HOSHINO M, 1983, S8275 AC SOC JAP T T, P539 Kaiki N., 1992, SPEECH PERCEPTION PR, P391 Kato H, 2002, J ACOUST SOC AM, V111, P387, DOI 10.1121/1.1428543 Kato H, 1998, J ACOUST SOC AM, V104, P540, DOI 10.1121/1.423301 Kato H, 1997, J ACOUST SOC AM, V101, P2311, DOI 10.1121/1.418210 Kato Hiroaki, 2003, P 15 INT C PHON SCI, P2043 Kawahara H, 1999, SPEECH COMMUN, V27, P187, DOI 10.1016/S0167-6393(98)00085-5 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 Klatt D. H., 1979, FRONTIERS SPEECH COM, P287 Klatt D.H., 1975, J PHONETICS, V3, P129 Klatt D.H., 1975, STRUCTURE PROCESS SP, P69 LINDBOLM B, 1973, PUBL U STOCKHOLM, V21 MARTIN JG, 1970, J VERB LEARN VERB BE, V9, P75, DOI 10.1016/S0022-5371(70)80010-X MIYATAKE M, 1988, TR10056 ATR INT TEL Sagisaka Y., 1984, Transactions of the Institute of Electronics and Communication Engineers of Japan, Part A, VJ67A TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467 Tanaka M., 1994, Journal of the Acoustical Society of Japan (E), V15 Tsujimura N., 1996, INTRO JAPANESE LINGU VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 Zwicker E., 1991, Journal of the Acoustical Society of Japan (E), V12 NR 25 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 361 EP 372 DI 10.1016/j.specom.2004.11.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400001 ER PT J AU Heracleous, P Shimizu, T AF Heracleous, P Shimizu, T TI A novel approach for modeling non-keyword intervals in a keyword spotter exploiting acoustic similarities of languages SO SPEECH COMMUNICATION LA English DT Article DE keyword spotting; two-pass; telephone speech; garbage hidden Markov models AB In this paper, we present a new keyword spotting technique. A critical issue in keyword spotting is the explicit modeling of the non-keyword portions. To date, most keyword spotters use a set of Hidden Markov Models (HMM) to represent the non-keyword portions. A widely used approach is to split the training data into keyword and non-keyword data. The keywords are represented by HMMs trained using the keyword speech, and the garbage models are trained using the non-keyword speech. The main disadvantage of this method is the task dependence. Another approach is to use a common set of acoustic models for both keywords and garbage models. However, this method faces a major problem. In a keyword spotter, the garbage models are usually connected to allow any sequence. Therefore, the keywords are also included in these sequences. When the same training data are used for keyword and garbage models, the garbage models also cover the keywords. In order to overcome these problems, we propose a new method for modeling the non-keyword intervals. In our method, the garbage models are phonemic HMMs trained using a speech corpus of a language other than-but acoustically similar to-the target language. In our work, the target language is Japanese and, due to the high similarity, English was chosen as the 'garbage language' for training the garbage models. Using English garbage models-instead of Japanese-our method achieves higher performance, compared with when Japanese garbage models are used. Moreover, parameter tuning (e.g., word insertion penalty) does not have a serious effect on the performance when English garbage models are used. Using clean telephone speech test data and a vocabulary of 100 keywords, we achieved a 7.9% equal error rate which is a very promising result. In this paper we also introduce results obtained using several sizes of vocabulary, and we investigate the selection of the most appropriate garbage model set. In addition to the Japanese keyword spotting system, we also introduce results of an English keyword spotter. By using Japanese garbage models-instead of English-we achieved significant improvement. Using telephone speech test data and a vocabulary of 25 keywords the achieved Figure of Merit (FOM) was 74.7% compared to 68.9% when English garbage models were used. (c) 2005 Elsevier B.V. All rights reserved. C1 KDDI R&D Labs Inc, Kamifukuoka, Saitama 3568502, Japan. Nara Inst Sci & Technol, Grad Sch Informat Sci, Speech & Acoust Proc Lab, Nara 6300101, Japan. RP Heracleous, P (reprint author), KDDI R&D Labs Inc, 2-1-15 Ohara, Kamifukuoka, Saitama 3568502, Japan. EM panikos@is.naist.jp; shimizu@kddilabs.jp CR BERNSTEIN J, 1994, P INT C AC SPEECH SI, P81 BOURLARD H, 1994, P IEEE INT C AC SPEE, P373 Bridle JS., 1973, BRIT AC SOC M, P1 Higgins A., 1985, P ICASSP, P1233 KNILL KM, 1996, P ICASSP, P522 MANOS AS, 1997, P ICASSP 97, P899 Martin A. F., 1997, P EUROSPEECH, P1895 ROSE RC, 1995, P INT C AC SPEECH SI, P281 Rosevear R. D., 1990, Power Technology International SCHULTZ T, 1998, P DARPA BROADCAST NE 1999, HDB INT PHONETIC ASS, P41 NR 11 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 373 EP 386 DI 10.1016/j.specom.2004.10.016 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400002 ER PT J AU Sudoh, K Nakano, M AF Sudoh, K Nakano, M TI Post-dialogue confidence scoring for unsupervised statistical language model training SO SPEECH COMMUNICATION LA English DT Article DE confidence scoring; speech recognition; statistical language models; spoken language dialogue systems ID SPEECH RECOGNITION AB This paper presents a new recognition confidence scoring method for unsupervised training of statistical language models in spoken language dialogue systems. Based on the proposed confidence scoring, the speech recognition results for untranscribed user utterances are selected for training the statistical language models of speech recognizers. The method uses features that can only be obtained after the dialogue session, in addition to other features, such as the acoustic scores of recognition results. Experimental results show that the proposed confidence scoring improves correct/incorrect classification of recognition results and that using the language models obtained through our approach results in better recognition accuracy than that achieved by conventional methods. (c) 2005 Elsevier B.V. All rights reserved. C1 NTT Corp, NTT, Commun Sci Labs, Kyoto 6190237, Japan. RP Sudoh, K (reprint author), NTT Corp, NTT, Commun Sci Labs, 2-1 Hikaridai,Seika Cho, Kyoto 6190237, Japan. EM sudoh@cslab.keci.ntt.co.jp; nakano@jp.honda-ri.com CR CLARKSON P, 1997, P EUR C SPEECH COMM, V5, P2707 GLASS J, 2001, P 7 EUR C SPEECH COM, P1335 GRETTER R, 2001, P IEEE INT C AC SPEE, V1 HAKKANITUR D, 2002, P IEEE INT C AC SPEE, V4, P3904 Hazen TJ, 2002, COMPUT SPEECH LANG, V16, P49, DOI 10.1006/csla.2001.0183 KAMPPARI SO, 2000, P IEEE INT C AC SPEE, P1799 Kemp T., 1999, P EUROSPEECH 99, P2725 Lamel L, 2002, COMPUT SPEECH LANG, V16, P115, DOI 10.1006/csla.2001.0186 Lee A., 2001, P EUR C SPEECH COMM, P1691 Mangu L, 2000, COMPUT SPEECH LANG, V14, P373, DOI 10.1006/csla.2000.0152 NAKANO M, 2000, 1 SIGD WORKSH DISC D, P150 Nakano M., 2003, P 8 EUR C SPEECH COM, P417 PRADHAN SS, 2002, P ICASSP, V1, P233 Riccardi G., 2003, P 8 EUR C SPEECH COM, P1825 Rosenfeld R, 2000, P IEEE, V88, P1270, DOI 10.1109/5.880083 SCHAAF T, 1997, P IEEE INT C AC SPEE, P875 STOLCKE A, 2001, P 2001 NIST LARG VOC WEINTRAUB M, 1997, P ICASSP, V2, P887 WENDEMUTH A, 1999, P INT C AC SPEECH SI, P705 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 ZAVALIAGKOS G, 1998, P DARPA BROADC NEWS, P301 Zhang R., 2001, P 7 EUR C SPEECH COM, P2105 NR 22 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 387 EP 400 DI 10.1016/j.specom.2004.10.017 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400003 ER PT J AU Sroka, JJ Braida, LD AF Sroka, JJ Braida, LD TI Human and machine consonant recognition SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; consonant identification; filtering; noise; speech recognition ID AMPLITUDE COMPRESSION; SPEECH RECOGNITION; AUDITORY MODELS; PERCEPTION; CLUSTERS; PREDICT AB Three traditional ASR parameterizations matched with Hidden Markov Models (HMMs) are compared to humans for speaker-dependent consonant recognition using nonsense syllables degraded by highpass filtering, lowpass filtering, or additive noise. Confusion matrices were determined by recognizing the syllables using different ASR front ends, including Mel-Filter Bank (MFB) energies, Mel-Filtered Cepstral Coefficients (MFCCs), and the Ensemble Interval Histogram (EIH). In general the MFB recognition accuracy was slightly higher than the MFCC, which was higher than the EIH. For syllables degraded by lowpass and highpass filtering, automated systems trained on the degraded condition recognized the consonants as well as humans. For syllables degraded by additive speech-shaped noise, none of the automated systems recognized consonants as well as humans. The greatest advantage displayed by humans was in determining the correct voiced/unvoiced classification of consonants in noise. (c) 2005 Elsevier B.V. All rights reserved. C1 MIT, Elect Res Lab, Cambridge, MA 02139 USA. MIT, Harvard Mit Div Hlth Sci & Technol, Cambridge, MA 02139 USA. RP Braida, LD (reprint author), 36-747 MIT, Cambridge, MA 02139 USA. EM jjsroka@email.msn.com; ldbraida@mit.edu CR ANSI, 1997, S351997 ANSI Bratakos MS, 2001, EAR HEARING, V22, P225, DOI 10.1097/00003446-200106000-00006 Bustamante D K, 1987, J Rehabil Res Dev, V24, P149 BUSTAMANTE DK, 1987, J ACOUST SOC AM, V82, P1227, DOI 10.1121/1.395259 De Gennaro S, 1986, J Rehabil Res Dev, V23, P17 DIX AK, 2002, UNPUB EFFECT NOISE R EBEL WJ, 1995, P SPOK LANG SYST TEC, P53 Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357 GHITZA O, 1993, J ACOUST SOC AM, V93, P2160, DOI 10.1121/1.406679 Hant JJ, 2003, SPEECH COMMUN, V40, P291, DOI 10.1016/S0167-6393(02)00068-7 JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 King S, 2000, COMPUT SPEECH LANG, V14, P333, DOI 10.1006/csla.2000.0148 KIRCHHOFF K, 1999, P 1999 IEEE INT C AC KIRCHOFF K, 1998, P 5 INT C SPOK LANG KLATT DH, 1975, J SPEECH HEAR RES, V18, P686 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LIPPMANN RP, 1981, J ACOUST SOC AM, V69, P524, DOI 10.1121/1.385375 *MATHW INC, 1984, MATLAB VERS 4 8 5 2 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Rabiner L, 1993, FUNDAMENTALS SPEECH Ronan D, 2004, J ACOUST SOC AM, V116, P1749, DOI 10.1121/1.1777858 Slaney M., 1998, 199810 INT RES CORP SROKA J, 1998, THESIS MIT CAMBRIDGE STEVENS KN, 1992, J ACOUST SOC AM, V91, P2979, DOI 10.1121/1.402933 STEVENS KN, 1980, J ACOUST SOC AM, V68, P836, DOI 10.1121/1.384823 *U CAMBR, 2004, HIDD MARK MOD TOOLK Winer B.J., 1971, STAT PRINCIPLES EXPT NR 28 TC 39 Z9 39 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 401 EP 423 DI 10.1016/j.specom.2004.11.009 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400004 ER PT J AU Schwartz, JL Abry, C Boe, LJ Menard, L Vallee, N AF Schwartz, JL Abry, C Boe, LJ Menard, L Vallee, N TI Asymmetries in vowel perception, in the context of the Dispersion-Focalisation Theory SO SPEECH COMMUNICATION LA English DT Article DE vowel discrimination; peripheral vowels; focalisation; salience; stability; anchor ID SPEECH-PERCEPTION; INFANTS; SYSTEMS; IDENTIFICATION; REPRESENTATION; CATEGORIES; CONTRAST; STIMULI; NOISE AB In a recent paper in this journal, Polka and Bohn [Polka, L., Bohn, O.-S., 2003. Asymmetries in vowel perception. Speech Communication 41, 221-231] display a robust asymmetry effect in vowel discrimination, present in infants as well as adults. They interpret this effect as a preference for peripheral vowels, providing an anchor for comparison. We discuss their data in the framework of the Dispersion-Focalisation Theory of vowel systems. We show that focalisation, that is the convergence between two consecutive formants in a vowel spectrum, is likely to provide the ground for anchor vowels, by increasing their perceptual salience. This enables to explain why [y] is an anchor vowel, as well as [i], [a] or [u]. Furthermore, we relate the asymmetry data to an old experiment we had done on the discrimination of focal vs. non-focal vowels. Altogether, it appears that focal vowels, more salient in perception, provide both a stable percept and a reference for comparison and categorisation. (c) 2005 Published by Elsevier B.V. C1 Univ Grenoble 3, CNRS, INPG, Inst Commun Parlee, F-38031 Grenoble, France. Univ Quebec, Dept Linguist & Didact Langues, Montreal, PQ H3C 3P8, Canada. RP Schwartz, JL (reprint author), Univ Grenoble 3, CNRS, INPG, Inst Commun Parlee, 46 Av Felix Viallet, F-38031 Grenoble, France. EM schwartz@icp.inpg.fr CR ABRY C, 1989, J PHONETICS, V17, P47 BADIN P, 1991, J ACOUST SOC AM, V87, P1290 Best C. T., 2000, INT C INF STUD BRIGH BOE LJ, 1986, ACT 15 JOURN ET PAR, P303 Boe L.-J., 1989, P EUROSPEECH, V89, P281 Chistovich L. A., 1979, FRONTIERS SPEECH COM, P143 CHISTOVICH LA, 1979, HEARING RES, V1, P185, DOI 10.1016/0378-5955(79)90012-1 DIEHL R, 2003, P 15 ICPHS POST EIMAS PD, 1971, SCIENCE, V171, P303, DOI 10.1126/science.171.3968.303 ESCUDIER P, 1985, FRANC SEM SOC FRANC, P143 FANT G, 1983, STL QPSR, V2, P1 KUHL PK, 1991, PERCEPT PSYCHOPHYS, V50, P93, DOI 10.3758/BF03212211 KUHL PK, 1992, SCIENCE, V255, P606, DOI 10.1126/science.1736364 KUHL PK, 1978, J ACOUST SOC AM, V63, P905, DOI 10.1121/1.381770 LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 LINDBLOM B, 1990, J PHONETICS, V18, P135 LINDBLOM B, 2003, P 15 ICPHS, P39 Lindblom B., 1986, EXPT PHONOLOGY, P13 MENARD L, UNPUB J PHONETICS MILLER JD, 1976, J ACOUST SOC AM, V60, P410, DOI 10.1121/1.381097 POLKA L, 1996, J ACOUST SOC AM, V95, P1286 POLKA L, 1994, J EXP PSYCHOL HUMAN, V20, P421, DOI 10.1037/0096-1523.20.2.421 Polka L, 2003, SPEECH COMMUN, V41, P221, DOI 10.1016/S0167-6393(02)00105-X Repp B. H., 1984, SPEECH LANGUAGE ADV, V10, P243 REPP BH, 1979, J EXP PSYCHOL HUMAN, V5, P129, DOI 10.1037//0096-1523.5.1.129 Robert-Ribes J, 1998, J ACOUST SOC AM, V103, P3677, DOI 10.1121/1.423069 Rosch Eleanor H., 1972, J EXP PSYCHOL, V93, P10, DOI DOI 10.1037/H0032606 SCHWARTZ JL, 1987, NATO ASI SER, P284 SCHWARTZ JL, 1993, J PHONETICS, V21, P411 SCHWARTZ JL, 1989, SPEECH COMMUN, V8, P235, DOI 10.1016/0167-6393(89)90004-6 Schwartz JL, 1997, J PHONETICS, V25, P255, DOI 10.1006/jpho.1997.0043 Schwartz JL, 1997, J PHONETICS, V25, P233, DOI 10.1006/jpho.1997.0044 Stevens KN, 1972, HUMAN COMMUNICATION, P51 STEVENS KN, 1989, J PHONETICS, V17, P3 SYRDAL AK, 1985, SPEECH COMMUN, V4, P121, DOI 10.1016/0167-6393(85)90040-8 Vallee N., 1999, P 14 INT C PHON SCI, V1, P333 NR 36 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 425 EP 434 DI 10.1016/j.specom.2004.12.001 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400005 ER PT J AU Yao, KS Paliwal, KK Lee, TW AF Yao, KS Paliwal, KK Lee, TW TI Generative factor analyzed HMM for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE hidden Markov models; factor analysis; mixture of Gaussian; speech recognition; expectation maximization algorithm ID BLIND SOURCE SEPARATION; MAXIMUM-LIKELIHOOD; MODELS AB We present a generative factor analyzed hidden Markov model (GFA-HMM) for automatic speech recognition. In a standard HMM, observation vectors are represented by mixture of Gaussians (MoG) that are dependent on discrete-valued hidden state sequence. The GFA-HMM introduces a hierarchy of continuous-valued latent representation of observation vectors, where latent vectors in one level are acoustic-unit dependent and latent vectors in a higher level are acoustic-unit independent. An expectation maximization (EM) algorithm is derived for maximum likelihood estimation of the model. We show through a set of experiments to verify the potential of the GFA-HMM as an alternative acoustic modeling technique. In one experiment, by varying the latent dimension and the number of mixture components in the latent spaces, the GFA-HMM attained more compact representation than the standard HMM. In other experiments with varies noise types and speaking styles, the GFA-HMM was able to have (statistically significant) improvement with respect to the standard HMM, (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Calif San Diego, Inst Neural Computat, La Jolla, CA 92093 USA. Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. RP Yao, KS (reprint author), Univ Calif San Diego, Inst Neural Computat, 9500 Gilman Dr, La Jolla, CA 92093 USA. EM kyao@ti.com; k.paliwal@me.gu.edu.au; tewon@ucsd.edu CR AMARI SI, 2000, BLIND SIGNAL SEPARAT, P63 Attias H, 1999, NEURAL COMPUT, V11, P803, DOI 10.1162/089976699300016458 Bell A. J., 1995, Advances in Neural Information Processing Systems 7 BELLEGARDA JR, 1990, IEEE T ACOUST SPEECH, V38, P2033, DOI 10.1109/29.61531 Cardoso JF, 1997, IEEE SIGNAL PROC LET, V4, P112, DOI 10.1109/97.566704 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 Dempster A., 1977, J ROYAL STAT SOC B, V3, P1 DING P, 2002, ICSLP, P1341 Everitt BS, 1984, INTRO LATENT VARIABL FREY B, 1999, TR992 UWCS Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 GOPINATH R, 1998, ICASSP, P12 HANSEN JHL, 1998, RSPL9810 Hyvarinen A, 2001, INDEPENDENT COMPONEN Kumar N, 1997, THESIS J HOPKINS U OLSEN P, 2002, ICASSP, P13 PEARCE D, 2000, ISCA ITRW ASR2000 Pearlmutter BA, 1997, ADV NEUR IN, V9, P613 Rabiner L, 1993, FUNDAMENTALS SPEECH Rosti AVI, 2002, INT CONF ACOUST SPEE, P949 ROSTI AVI, 2001, 420 CUEDFINFENGTR Roweis S, 1999, NEURAL COMPUT, V11, P305, DOI 10.1162/089976699300016674 RUBIN DB, 1982, PSYCHOMETRIKA, V47, P69, DOI 10.1007/BF02293851 Saul LK, 2000, IEEE T SPEECH AUDI P, V8, P115, DOI 10.1109/89.824696 Tipping M.E., 1997, NCRG97003 AST U VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 Woodland PC, 1996, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.1996.540291 YAMAMOTO H, 2004, ICASSP, P29 YOUNG S, 1997, HTK BOOK VERSION 2 1 NR 29 TC 0 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 435 EP 454 DI 10.1016/j.specom.2005.01.002 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400006 ER PT J AU Jiang, H AF Jiang, H TI Confidence measures for speech recognition: A survey SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition (ASR); confidence measures (CM); word posterior probability; utterance verification; likelihood ratio testing (LRT); Bayes factors ID DISCRIMINATIVE UTTERANCE VERIFICATION; CONNECTED DIGITS RECOGNITION; HIDDEN MARKOV-MODELS; INFORMATION AB In speech recognition, confidence measures (CM) are used to evaluate reliability of recognition results. A good confidence measure can largely benefit speech recognition systems in many practical applications. In this survey, I summarize most research works related to confidence measures which have been done during the past 10-12 years. I will present all these approaches as three major categories, namely CM as a combination of predictor features, CM as a posterior probability, and CM as utterance verification. Then, I also introduce some recent advances in the area. Moreover, I will discuss capabilities and limitations of the current CM techniques and generally comment on today's CM approaches. Based on the discussion, I will conclude the paper with some clues for future works. (c) 2005 Elsevier B.V. All rights reserved. C1 York Univ, Dept Comp Sci, Toronto, ON M3J 1P3, Canada. RP Jiang, H (reprint author), York Univ, Dept Comp Sci, 4700 Keele St, Toronto, ON M3J 1P3, Canada. EM hj@cs.yorku.ca CR AFIFY M, 2000, IN PRESS IEEE T SPEE Arslan LM, 1999, IEEE T SPEECH AUDI P, V7, P46, DOI 10.1109/89.736330 Asadi A., 1990, P ICASSP 90, P125 Benitez MC, 2000, SPEECH COMMUN, V32, P79, DOI 10.1016/S0167-6393(00)00025-X BOURLARD H, 1994, P IEEE INT C AC SPEE, P373 CHARLET D, 2001, P EUR C SPEECH COMM Chase L., 1997, P EUR C SPEECH COMM, P815 CHIGIER B, 1992, P ICASSP92, P93, DOI 10.1109/ICASSP.1992.226112 Cox S, 2002, IEEE T SPEECH AUDI P, V10, P460, DOI 10.1109/TSA.2002.804304 COX S, 1996, P INT C AC SPEECH SI, P511 DOLFING JGA, 1998, P INT C SPOK LANG PR EIDE E, 1995, P IEEE INT C AC SPEE, P221 GARCIA MC, 1999, P AUT SPEECH REC UND Gillick L., 1997, P IEEE INT C AC SPEE, P879 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GORONZY S, 2002, ROBUST ADAPTATION NO, P57 GUNAWARDANA A, 1998, P ICSLP 98 SYDN AUST, P791 GUO G, 2004, P 4 INT S CHIN LANG GUPTA S, 1998, P ICSLP 98 SYDN AUST, P795 HACIOGLU K, 2002, P INT C AC SPEECH SI, P225 HAZEN TJ, 2000, P ISCA ITRW WORKSH HERNANDEZ AG, 2000, P INT C AC SPEECH SI, P1803 Jiang H, 2001, IEEE T SPEECH AUDI P, V9, P874 JIANG H, 2001, P EUR C SPEECH COMM, P2573 JIANG H, 2002, P INT C SPOK LANG PR Jiang H, 1999, IEEE T SPEECH AUDI P, V7, P426 Jiang H, 2003, IEEE T SPEECH AUDI P, V11, P425, DOI 10.1109/TSA.2003.815821 JIANG L, 1998, P INT C SPOK LANG PR Jitsuhiro T, 1998, INT CONF ACOUST SPEE, P217, DOI 10.1109/ICASSP.1998.674406 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E KAMPPARI SO, 2000, P IEEE INT C AC SPEE, P1799 Kemp T., 1997, P EUR C SPEECH COMM, P827 Koo MW, 2001, IEEE T SPEECH AUDI P, V9, P821 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 LEE CH, 2001, P ICSP AUG Lee C-H, 1998, P NORSIG 98 VIGS DEN, P9 Li Q, 2000, IEEE T SPEECH AUDI P, V8, P585 LI Q, 2002, P INT C SPOK LANG PR LIN Q, 1998, P INT C SPOK LANG PR LIN Q, 1999, P EUR C SPEECH COMM LIU F, 2001, P EUR C SPEECH COMM, P851 LLEIDA E, 2000, IEEE T SPEECH AUDIO, V6, P558 MAISON B, 2001, P INT C AC SPEECH SI MATHAN L., 1991, P INT C AC SPEECH SI, P93, DOI 10.1109/ICASSP.1991.150286 MATSUI T, 2001, P AUT SPEECH REC UND MENGUSOGLU E, 2001, P EUR C SPEECH COMM Modi P, 1997, P EUR C SPEECH COMM, P103 MOREAU N, 1999, P EUR C SPEECH COMM MORENO PJ, 2001, P EUR C SPEECH COMM NETI CV, 1997, P INT C AC SPEECH SI, P883 PALMER DD, 2001, P EUR C SPEECH COMM PAO C, 1998, P INT C SPOK LANG PR Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 Rahim MG, 1997, COMPUT SPEECH LANG, V11, P147, DOI 10.1006/csla.1997.0026 Rahim MG, 1997, J ACOUST SOC AM, V101, P2892, DOI 10.1121/1.418519 Rohlicek J.R., 1993, P 1993 IEEE INT C AC, pII459 Rose R.C., 1992, P IEEE INT C AC SPEE, P105, DOI 10.1109/ICASSP.1992.226109 ROSE RC, 1999, P EUR C SPEECH COMM Rose RC, 2001, SPEECH COMMUN, V34, P321, DOI 10.1016/S0167-6393(00)00040-6 ROSE RC, 1995, P INT C AC SPEECH SI, P281 ROSE RC, 1995, COMPUT SPEECH LANG, V9, P309, DOI 10.1006/csla.1995.0015 RUEBER B, 1997, P EUR C SPEECH COMM SANKAR A, 2003, P INT C AC SPEECH SI, P584 San-Segundo R., 2001, P INT C AC SPEECH SI SCHAAF T, 1997, P IEEE INT C AC SPEE, P875 Siu MH, 1999, COMPUT SPEECH LANG, V13, P299, DOI 10.1006/csla.1999.0126 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 Sukkar RA, 1997, SPEECH COMMUN, V22, P333, DOI 10.1016/S0167-6393(97)00031-9 SUKKAR RA, 1994, P INT C AC SPEECH SI, P393 Sukkar R.A., 1993, P IEEE ICASSP 93, P451 TAN BT, 2001, P EUR C SPEECH COMM TAN BT, 2000, P INT C SPOK LANG PR UHRIK C, 1997, P EUR C SPEECH COMM VERGYRI D, 2000, P IEEE ICASSP 2000, P1823 WALLHOFF F, 2000, P ICASSP 2000 IST, P1835 WEINTRAUB M, 1997, P IEEE INT C AC SPEE, P887 Wessel F., 1999, P EUR C SPEECH COMM, P315 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 Wessel F, 1998, INT CONF ACOUST SPEE, P225, DOI 10.1109/ICASSP.1998.674408 WESSEL F, 2000, P IEEE INT C AC SPEE, P1587 WILLETT D, 1998, P INT C SPOK LANG PR Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 YOUNG S, 1994, P IEEE INT C AC SPEE, P21 YOUNG SR, 1993, P INT C AC SPEECH SI, P590 ZHANG R, 2001, P EUR C SPEECH COMM NR 86 TC 109 Z9 113 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2005 VL 45 IS 4 BP 455 EP 470 DI 10.1016/j.specrom.2004.12.004 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 919KR UT WOS:000228617400007 ER PT J AU Carlson, R Hirschberg, J Swerts, M AF Carlson, R Hirschberg, J Swerts, M TI Error handling in spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Columbia Univ, Dept Comp Sci, New York, NY 10027 USA. RP Hirschberg, J (reprint author), Columbia Univ, Dept Comp Sci, 1214 Amsterdam Ave,M-C 0401, New York, NY 10027 USA. EM julia@cs.columbia.edu RI Swerts, Marc/C-8855-2013 NR 0 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 207 EP 209 DI 10.1016/j.specom.2004.11.003 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500001 ER PT J AU Torres, F Hurtado, LF Garcia, F Sanchis, E Segarra, E AF Torres, F Hurtado, LF Garcia, F Sanchis, E Segarra, E TI Error handling in a stochastic dialog system through confidence measures SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE stochastic dialog system; stochastic language understanding; speech acts; confidence measures ID INFORMATION AB In this work, we present an approach to take advantage of confidence measures obtained during the recognition and understanding processes of a dialog system, in order to guide the behavior of the dialog manager. Our approach allows the system to ask the user for confirmation about the data which have low confidence values associated to them, after the recognition or understanding processes. This technique could help to protect the system from recognition or understanding errors. Although the number of confirmation turns could increase, it would be less probable for the system to consider data with a low confidence value as correct. The understanding module and the dialog manager that we have used are modelled by stochastic automata, and some confidence measures are proposed for the understanding module. An evaluation of the behavior of the dialog system is also presented. (c) 2004 Elsevier B.V. All rights reserved. C1 Univ Politecn Valencia, Dept Sist Informat & Comp, DSIC, Valencia 46022, Spain. RP Torres, F (reprint author), Univ Politecn Valencia, Dept Sist Informat & Comp, DSIC, Camino Vera S-N, Valencia 46022, Spain. EM ftgoterr@dsic.upv.es; lhurtado@dsic.upv.es; fgarcia@dsic.upv.es; esanchis@dsic.upv.es; esegarra@dsic.upv.es RI Garcia, Fernando/K-5073-2014; Segarra, Encarna/K-5883-2014 OI Garcia, Fernando/0000-0003-2213-4213; Segarra, Encarna/0000-0002-5890-8957 CR AUST H, 1994, 2ND P IVTTA 94 WORKS, P141 BONAFONTE A, 2000, DESARROLLO SISTEMA D BOUWMAN AG, 1999, P INT C AC SPEECH SI, V1, P493 *CMU, 1999, CMU COMM FORNEY GD, 1973, P IEEE, V61, P3 Garcia F, 2003, LECT NOTES ARTIF INT, V2807, P165 GLASS J, 2001, P 7 EUR C SPEECH COM, P1335 HACIOGLU K, 2002, P ICASSP Hazen TJ, 2002, COMPUT SPEECH LANG, V16, P49, DOI 10.1006/csla.2001.0183 Kellner A, 1997, SPEECH COMMUN, V23, P95, DOI 10.1016/S0167-6393(97)00036-8 Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9 LEVIN E, 1995, P EUR C SPEECH COMM, P555 LOPEZCOZAR R, 2000, SISTEMA TELEFONICO I MARTINEZ C, 2002, P 3 INT C LANG RES E, P1577 Minker W., 1999, Grammars, V2, DOI 10.1023/A:1009943728288 Pieraccini R., 1997, P EUR RHOD GREEC, P1875 RAYMOND C, 2003, IEEE ASRU AUTOMATIC, P150 SANSEGUNDO R, 2001, P 2 SIGDIAL WORKSH D, P140 SANSEGUNDO R, 2001, P ICASSP Schwartz R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607771 Segarra E, 2002, INT J PATTERN RECOGN, V16, P301, DOI 10.1142/S021800140200168X STURM J, 1999, P EUROSPEECH, P1419 TORRES F, 2003, P EUROSPEECH, V1, P605 Wessel F, 1998, INT CONF ACOUST SPEE, P225, DOI 10.1109/ICASSP.1998.674408 Zhang R., 2001, P 7 EUR C SPEECH COM, P2105 NR 25 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 211 EP 229 DI 10.1016/j.specom.2004.10.014 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500002 ER PT J AU Prodanov, P Drygajlo, A AF Prodanov, P Drygajlo, A TI Bayesian networks based multi-modality fusion for error handling in human-robot dialogues under noisy conditions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE human-robot dialogue; error handling; multi-modality fusion; Bayesian networks AB In this paper, we introduce probabilistic model based architecture for error handling in human-robot spoken dialogue systems under adverse audio conditions. In this architecture, a Bayesian network framework is used for interpretation of multi-modal signals in the spoken dialogue between a tour-guide robot and visitors in mass exhibition conditions. In particular, we report on experiments interpreting speech and laser scanner signals in the dialogue management system of the autonomous tour-guide robot RoboX, successfully deployed at the Swiss National Exhibition (Expo.02). A correct interpretation of a user's (visitor's) goal or intention at each dialogue state is a key issue for successful voice-enabled communication between tour-guide robots and visitors. To infer the visitors' goal under the uncertainty intrinsic to these two modalities, we introduce Bayesian networks for combining noisy speech recognition with data from a laser scanner, which are independent of acoustic noise. Experiments with real-world data, collected during the operation of RoboX at Expo.02 demonstrate the effectiveness of the approach in adverse environment. The proposed architecture makes it possible to model error-handling processes in spoken dialogue systems, which include complex combination of different multi-modal information sources in cases where such information is available. (c) 2005 Elsevier B.V. All rights reserved. C1 Ecole Polytech Fed Lausanne, Swiss Fed Inst Technol, Autonomous Syst Lab, EPFL ST1 12S LSA1, CH-1015 Lausanne, Switzerland. Ecole Polytech Fed Lausanne, Swiss Fed Inst Technol, Signal Proc Inst, EPFL ST1 12S LSA1, CH-1015 Lausanne, Switzerland. RP Prodanov, P (reprint author), Ecole Polytech Fed Lausanne, Swiss Fed Inst Technol, Autonomous Syst Lab, EPFL ST1 12S LSA1, CH-1015 Lausanne, Switzerland. EM plamen.prodanov@epfl.ch; andrzej.drygajlo@epfl.ch CR BURGARD W, 1999, ARTIF INTELL, V114, P1 CHURCHER G, 1997, 976 U LEEDS SCH COMP Drygajlo A, 2003, ADV ROBOTICS, V17, P599, DOI 10.1163/156855303769156974 GARCIAMATEO C, 1999, INT WORKSH AUT SPEEC Horvitz E., 1999, P 7 INT C US MOD, P201 Huang X., 2001, SPOKEN LANGUAGE PROC JENSEN B, 2002, WORKSH ROB EXH LAUS JENSEN B, 2002, IROS 2002 IEEE RSJ I, P1221 Jensen F. V., 1996, INTRO BAYESIAN NETWO Jensen F.V., 1990, COMPUTATIONAL STATIS, V4, P269 Kam M, 1997, P IEEE, V85, P108, DOI 10.1109/JPROC.1997.554212 KEIZER S, 2002, P 3 SIGD WORKSH DISC Krahmer E., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009648614566 Murphy K., 2002, THESIS UC BERKELEY Nefian A.V., 2002, EURASIP J APPL SIG P, V11, P1 Pavlovic V.I., 1999, THESIS U ILLINOIS UR PRODANOV P, 2002, INT C INT ROB SYST I, P1332 Prodanov P., 2003, P 8 EUR C SPEECH COM, P1057 Roy N., 2000, P 38 ANN M ASS COMP Russel S., 2003, ARTIFICIAL INTELLIGE, V2nd Shachter R., 1998, UAI SINGHAL A, 1997, SPIE C SEN FUS DEC C SKANTZE G, 2003, ITR WORKSH ERR HANDL, P71 Smith M., 2003, THESIS FLORIDA STATE STURM J, 2001, P 2 ACL SIGD WORKSH, P162 THORPE J, 2002, DAT FUS ALG COLL ROB, P1 THRUN S, 1999, IEEE INT C ROB AUT D Turunen M., 2001, P EUR 2001 AALB DENM, P2189 WILLEKE T, 2001, HIST MOBOT MUSEUM RO NR 29 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 231 EP 248 DI 10.1016/j.specom.2004.10.015 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500003 ER PT J AU McTear, M O'Neill, I Hanna, P Liu, X AF McTear, M O'Neill, I Hanna, P Liu, X TI Handling errors and determining confirmation strategies - An object-based approach SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE object-based dialogue management; grounding; information state; error handling AB A number of different approaches have been applied to the treatment of errors in spoken dialogue systems, including careful design to prevent potential errors, methods for on-line error detection, and error recovery when errors have occurred and have been detected. The approach to error handling presented here is premised on the theory of grounding, in which it is assumed that errors cannot be avoided in spoken dialogue and that it is more useful to focus on methods for determining what information needs to be grounded within a dialogue and how this grounding should be achieved. An object-based architecture is presented that incorporates generic confirmation strategies in combination with domain-specific heuristics that together contribute to determining the system's confirmation strategies when attempting to complete a transaction. The system makes use of a representation of the system's information state as it conducts a transaction along with discourse pegs that are used to determine whether values have been sufficiently confirmed for a transaction to be concluded. An empirical evaluation of the system is presented along with a discussion of the advantages of the object-based approach for error handling. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Ulster, Sch Comp & Math, Newtownabbey BT37 0QB, North Ireland. Queens Univ Belfast, Sch Comp Sci, Belfast, Antrim, North Ireland. RP McTear, M (reprint author), Univ Ulster, Sch Comp & Math, Shore Rd, Newtownabbey BT37 0QB, North Ireland. EM mf.mctear@ulster.ac.uk; i.oneill@qub.ac.uk; p.hanna@qub.ac.uk; xingkun.liu@qub.ac.uk CR BOHLIN P, 1999, LE48314 TRINDI BOUWMAN AG, 1999, P INT C AC SPEECH SI, V1, P493 Clark H. H., 1996, USING LANGUAGE EVERMANN G, 2000, P ICASSP, P1655 HAZEN TJ, 2000, P 6 INT C SPOK LANG Heisterkamp P., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607076 Komatani K., 2000, P INT C COMP LING CO, P467 Krahmer E., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009648614566 Larsson S., 2000, NAT LANG ENG, V6, P323, DOI [DOI 10.1017/S1351324900002539, 10.1017 S1351324900002539] Litman D. J., 2000, P ANLP NAACL, P218 Litman D.J., 1999, P 37 ANN M ASS COMP, P309, DOI 10.3115/1034678.1034729 McRoy S, 1998, INT J HUM-COMPUT ST, V48, P547 O'Neill I, 2005, SCI COMPUT PROGRAM, V54, P99, DOI 10.1016/j.scico.2004.05.006 ONEILL IM, 2000, NAT LANG ENG, V6, P341, DOI 10.1017/S1351324900002527 ONEILL IM, 2002, P ICSLP 2002 DENV SE, V3, P2045 ONEILL IM, 2003, P EUROSPEECH2003 PAEK T, 2000, P 16 C UNC ART INT A, P445 PELLOM B, 2000, P 6 INT C SPOK LANG RUDNICKY A, 1999, P IEEE AUT SPEECH RE Skantze G., 2003, P ISCA TUT RES WORKS, P71 STURM J, 1999, P EUROSPEECH, P1419 SWERTS M, 2000, P 6 INT C SPOK LANG Traum D., 1999, AAAI FALL S PSYCH MO, P124 Turunen M., 2001, P EUR 2001 AALB DENM, P2189 WALKER M, 2000, P 17 INT C MACH LEAR WALKER MA, 2000, N AM M ASS COMP LING WARD W, 1994, P ICSLP 94 Wessel F, 1998, INT CONF ACOUST SPEE, P225, DOI 10.1109/ICASSP.1998.674408 NR 28 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 249 EP 269 DI 10.1016/j.specom.2004.11.006 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500004 ER PT J AU Bulyko, I Kirchhoff, K Ostendorf, M Goldberg, J AF Bulyko, I Kirchhoff, K Ostendorf, M Goldberg, J TI Error-correction detection and response generation in a spoken dialogue system SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE spoken dialogue systems; error correction; dialogue management; response generation AB Speech understanding errors in spoken dialogue systems can be frustrating for users and difficult to recover from in a mixed-initiative spoken dialogue system. Handling such errors requires both detecting error conditions and adjusting the response generation strategy accordingly. In this paper, we show that different response wording choices tend to be associated with different user behaviors that can impact word recognition performance in a telephone-based dialogue system. We leverage these findings in a system that integrates an error correction detection module with a modified dialogue strategy in order to drive the response generation module. In a user study, we find slight preferences for a dialogue system using this error handling strategy over a simple reprompting strategy. (c) 2004 Elsevier B.V. All rights reserved. C1 Univ Washington, Dept Elect Engn, Seattle, WA 98115 USA. Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA. RP Ostendorf, M (reprint author), Univ Washington, Dept Elect Engn, Box 352500, Seattle, WA 98115 USA. EM mo@ssli.ee.washington.edu CR ABERDEEN J, 2003, P ISCA WORKSH ERR HA, P17 Ang J., 2002, P INT C SPOK LANG PR, V3, P2037 Bulyko I, 2002, COMPUT SPEECH LANG, V16, P533, DOI 10.1016/S0885-2308(02)00023-2 CHUCARROL J, 1999, P EUROSPEECH, V4, P1519 GOLDBERG J, 2003, P ISCA WORKSH ERR HA, P101 HIRASAWA, 2000, P INT C SPOK LANG PR, V2, P739 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 HIRSCHBERG J, 2001, P 2 N AM ANN M ASS C, P208 KAZEMZADEH A, 2003, P IEEE WORKSH AUT SP, P215 KIRCHHOFF K, 2001, P NAACL WORKSH AD DI, P33 KITAOKA N, 2003, P EUROSPEECH, V1, P625 Klein J, 2002, INTERACT COMPUT, V14, P119 KRAHMER E, 2001, J SPEECH TECHNOL, P19 LENDVAI P, 2002, P ESSLLI 2002 WORKSH LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 MACHEREY K, 2003, P ISCA WORKSH ERR HA, P123 MARTINOVSKI B, 2003, P ISCA TUT RES WORKS, P11 Miller B. W., 1996, P 34 ANN M ASS COMP, P62, DOI 10.3115/981863.981872 ORLANDI M, 2003, P ISCA WORKSH ERR HA, P47 OVIATT S, 1996, P INT C SPOK LANG PR, V2, P801, DOI 10.1109/ICSLP.1996.607722 Oviatt S, 1998, J ACOUST SOC AM, V104, P3080, DOI 10.1121/1.423888 PAEK T, 2003, ISCA WORKSH ERR HAND, P95 PELLOM B, 2003, P INT C SPOK LANG PR, V2, P723 Quinlan JR, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P725 Quinlan J. R., 1993, C4 5 PROGRAMS MACH L SANSEGUNDO R, 2001, P 2 SIGD WORKSH DISC Seneff S., 1998, P ICSLP 98 SYDN AUST, V3, P931 SHIN J, 2002, P INT C SPOK LANG PR, V3, P2069 SWERTS M, 2000, P INT C SPOK LANG PR, V2, P615 TOKUDA K, 2002, P ISCA TTS WORKSH WALKER M, 2000, P EUROSPEECH, V2, P1371 WANG YH, 2003, P ISCA WORKSH ERR HA, P139 WEISS G, 2001, 43 ML TR RUTG U DEP NR 33 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 271 EP 288 DI 10.1016/j.specom.2004.09.009 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500005 ER PT J AU Sturm, J Boves, L AF Sturm, J Boves, L TI Effective error recovery strategies for multimodal form-filling applications SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE multimodal interfaces; error correction; speech recognition AB The goal of the research described in this article is to determine in what way speech recognition errors can be handled best in a multimodal form-filling interface. Besides two well-known error correction mechanisms (re-speaking the value and choosing the correct value from a list of alternatives), the interface offers a novel correction mechanism in which the user selects the first letter of the target word from a soft-keyboard, after which the utterance is recognized once again, with a limited language model and lexicon. The multimodal interface that was used is a web-based form-filling GUI, extended with a speech overlay, which allows for pen and speech input. The effectiveness and efficiency of the error correction mechanisms, the error correction strategies that are applied by the users and the effects on user satisfaction were studied in an evaluation in which the interface was tested in two conditions: in one condition (LIST), the interface provides only re-speaking and the alternatives list as error correction facilities. In the other condition (LETTER), the interface provides the soft-keyboard technique as an additional error correction facility. The study shows that error correction was more effective in the LETTER condition than in the LIST condition. The Keyboard correction facility enables the users to solve errors that could not be solved using the Re-speak method or by choosing from a list of alternatives. In spite of its low effectiveness, subjects initially attempted to use Re-speaking for error correction in both interfaces. However, we also found that subjects rapidly learned to choose the most effective option (Keyboard) immediately as they gain experience. The user satisfaction turned out to be higher for the LETTER interface than for the LIST interface: subjects considered the LETTER interface to be more useful and less frustrating and they felt more in control. As a result, most subjects clearly preferred the LETTER interface. (c) 2005 Elsevier B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP Sturm, J (reprint author), Univ Nijmegen, Dept Language & Speech, Postbus 9103, NL-6500 HD Nijmegen, Netherlands. EM j.sturm@tue.nl; l.boves@let.ru.nl CR AINSWORTH WA, 1992, INT J MAN MACH STUD, V36, P833, DOI 10.1016/0020-7373(92)90075-V COHEN PR, 1997, QUICKSET MULTIMODAL, P31 Halverson Christine A., 1999, P HUM COMP INT INTER, P133 HUANG XD, P IEEE INT C AC SPEE JOHNSTON M, 2002, P 40 ANN M ASS COMP Karat CM, 1999, P SIGCHI C HUM FACT, P568, DOI DOI 10.1145/302979.303160 KARAT J, 2000, CHI 00 EXT ABSTR HUM, P141 LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LOVE S, 1997, P INT C SPOK LANG PR, P1307 MANKOFF J, 1999, GITGVU9918 GVU Larson K., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1022342732234 NIKLFELD G, 2002, P 2 SIGD WORKSH DISC Oviatt S., 1999, P C HUM FACT COMP SY, P576, DOI 10.1145/302979.303163 Oviatt S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607077 Oviatt S, 2000, COMMUN ACM, V43, P45, DOI 10.1145/330534.330538 Oviatt S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607722 STURM J, 2002, P 2 INT C LANG RES E STURM J, 2002, P ISCA WORKSH MULT I STURM J, 2001, P 2 SIGD WORKSH DISC Suhm B., 2001, ACM Transactions on Computer-Human Interaction, V8, DOI 10.1145/371127.371166 WALKER MA, 2001, P HUM LANG TECHN C ZAJICEK M, 1990, P INTERACT 90, P755 NR 22 TC 11 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 289 EP 303 DI 10.1016/j.specom.2004.11.007 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500006 ER PT J AU Karsenty, L Botherel, V AF Karsenty, L Botherel, V TI Transparency strategies to help users handle system errors SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE spoken dialogue system; dialogue strategies; misunderstanding; correction; empirical test ID SPOKEN DIALOGUE; MACHINE AB This study analyzes the impact of transparency strategies on the quality of user responses following system rejections and misunderstandings. The aim of transparency is to make the system "visible" according to user needs, i.e. when the user's system knowledge is not sufficiently comprehensive and/or correct to successfully achieve the current goal. Since system knowledge depends on users' expertise, system expectations and system error rank, transparency is treated in this study as an adaptable and adaptive feature. Adaptable and adaptive transparency strategies were applied to TRAVELS, a spoken dialogue system which enables users to obtain plane and train schedules over the phone. Based on a partial Wizard-of-Oz simulation, an empirical assessment indicates the extent to which these transparency strategies can help users to respond appropriately to system errors. (c) 2005 Published by Elsevier B.V. C1 IntuiLab, F-31672 Labege, France. France Telecom R&D, F-22307 Lannion, France. RP Karsenty, L (reprint author), IntuiLab, Prologue 1,La Pyreneenne, F-31672 Labege, France. EM karsenty@intuilab.com CR AZZINI I, 2001, P EUROSPEECH 2001 AA Bernsen NO, 1996, DISCOURSE PROCESS, V21, P213 BOYCE S, 1996, P INT S SPOK DIAL IS, P65 BRENNAN SE, 1995, KNOWL-BASED SYST, V8, P143, DOI 10.1016/0950-7051(95)98376-H Chu-Carroll J., 2000, P 6 ACL C APPL NAT L, P97, DOI 10.3115/974147.974161 Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M Grice H. P., 1975, SYNTAX SEMANTICS, P41, DOI DOI 10.1017/S0022226700005296 Hone KS, 2001, INT J HUM-COMPUT ST, V54, P637, DOI 10.1006/ijhc.2000.0456 KAMM C, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P422 Kamm C.A., 1998, P 5 INT C SPOK LANG, P1211 Karsenty L, 2001, APPL ERGON, V32, P15, DOI 10.1016/S0003-6870(00)00058-2 KARSENTY L, 1999, 991B079 IRIT Karsenty L., 2002, International Journal of Speech Technology, V5, DOI 10.1023/A:1015472130944 KRAHMER E, 1999, EUROSPEECH 99 LANGLEY P, 1999, P 3 INT WORKSH COOP, P347 LAVELLE CA, 1999, P EUROSPEECH 99, V3, P1399 LITMAN D, 2000, P AAAI 2000 AUST TX LUZZATI D, 1989, P EUR C SPEECH COMM, P601 MAASS S, 1983, PSYCHOL COMPUTER USE, P19 ROUILLARD J, 2001, REV INTERACTION HOMM, V2, P55 Sadek D., 1997, P 15 INT JOINT C ART, P1030 SHIN J, 2002, ICSLP 02 Shriberg E., 1992, P DARPA SPEECH NAT L, P49, DOI 10.3115/1075527.1075538 SPITZ J, 1991, P 4 DARP WORKSH SPEE, P164, DOI 10.3115/112405.112430 SWERTS M, 2000, ICSLP 2000, V2, P615 Thomson D., 1999, P ESCA WORKSH INT DI VERONIS J, 1991, INT J MAN MACH STUD, V35, P187, DOI 10.1016/S0020-7373(05)80148-8 Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 YANKELOVICH N, 1996, ACM INTERACTIONS, V3, P32 NR 29 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 305 EP 324 DI 10.1016/j.specom.2004.10.018 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500007 ER PT J AU Skantze, G AF Skantze, G TI Exploring human error recovery strategies: Implications for spoken dialogue systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial DE error handling; miscommunication; spoken dialogue systems; wizard-of-oz AB In this study, an explorative experiment was conducted in which subjects were asked to give route directions to each other in a simulated campus (similar to Map Task). In order to elicit error handling strategies, a speech recogniser was used to corrupt the speech in one direction. This way, data could be collected on how the subjects might recover from speech recognition errors. This method for studying error handling has the advantages that the level of understanding is transparent to the analyser, and the errors that occur are similar to errors in spoken dialogue systems. The results show that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis about the situation instead of signalling non-understanding. Compared to other strategies, such as asking for a repetition, this strategy leads to better understanding of subsequent utterances, whereas signalling non-understanding leads to decreased experience of task success. (c) 2005 Elsevier B.V. All rights reserved. C1 KTH, Dept Speech Mus & Hearing, SE-10044 Stockholm, Sweden. RP Skantze, G (reprint author), KTH, Dept Speech Mus & Hearing, Lindstedtsvagen 24, SE-10044 Stockholm, Sweden. EM gabriel@speech.kth.se CR AINSWORTH WA, 1992, INT J MAN MACH STUD, V36, P833, DOI 10.1016/0020-7373(92)90075-V AMALBERTI R, 1993, INT J MAN MACH STUD, V38, P547, DOI 10.1006/imms.1993.1026 ANDERSON AH, 1991, LANG SPEECH, V34, P351 Balentine B., 2001, BUILD SPEECH RECOGNI BELL L, 1999, P ICPHS99 SAN FRANS, P1221 Brennan S, 1996, P INT S SPOK DIAL, P41 BROWN G, 1995, SPEAKER LISTENERS CO Carletta J, 1996, J PRAGMATICS, V26, P71, DOI 10.1016/0378-2166(95)00046-1 Carlson R, 1997, ELECT J DIFFERENTIAL, V23, P1 CLARK HH, 1994, SPEECH COMMUN, V15, P243, DOI 10.1016/0167-6393(94)90075-2 Clark H. H., 1996, USING LANGUAGE DOAHLBACK N, 1993, P 1993 INT WORKSH IN, P193 EDLUND J, 2004, P ICFSLP FLYCHTERIKSSON A, 2001, THESIS LINKOPING U Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M Gustafson J., 2000, P INT C SPOK LANG PR, V2, P134 Hinkle D, 1994, APPL STAT BEHAV SCI HIRST G, 1994, SPEECH COMMUN, V15, P213, DOI 10.1016/0167-6393(94)90073-6 LARSEN, 2003, THESIS AALBORG U LEVOW GA, 1998, P COLING ACL 98 McRoy S, 1998, INT J HUM-COMPUT ST, V48, P547 Oviatt S., 1996, P INT C SPOK LANG PR, V1, P204, DOI 10.1109/ICSLP.1996.607077 PAEK T, 2001, ACLI 2001 WORKSH EV SCHEGLOFF EA, 1992, AM J SOCIOL, V97, P1295, DOI 10.1086/229903 SHIN J, 2002, P ICSLP Skantze G., 2004, ISCA TUT RES WORKSH SKANTZE G, 2004, ISCA TUTORIAL RES WO Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 Weigand E, 1999, J PRAGMATICS, V31, P763, DOI 10.1016/S0378-2166(98)00068-X NR 29 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 325 EP 341 DI 10.1016/j.specom.2004.11.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500008 ER PT J AU Barkhuysen, P Krahmer, E Swerts, M AF Barkhuysen, P Krahmer, E Swerts, M TI Problem detection in human-machine interactions based on facial expressions of users SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Error Handling in Spoken Dialogue Systems CY AUG, 2003 CL Chateau d Oex, SWITZERLAND SP SIGdial ID SPEECH RECOGNITION; DIALOGUE; EMOTION; CUES AB This paper describes research into audiovisual cues to communication problems in interaction, between users and a spoken dialogue system. The study consists of two parts. First, we describe a series of three perception experiments in which subjects are offered film fragments (without any dialogue context) of speakers interacting with a spoken dialogue system. In half of these fragments, the speaker is or becomes aware of a communication problem. Subjects have to determine by forced choice which are the problematic fragments. In all three tests, subjects are capable of performing this task to some extent, but with varying levels of correct classifications. Second, we report results of an observational analysis in which we first attempt to relate the perceptual results to visual features of the stimuli presented to subjects, and second to find out which visual features actually are potential cues for error detection. Our major finding is that more problematic contexts lead to more dynamic facial expressions, in line with earlier claims that communication errors lead to marked speaker behaviour. We conclude that visual information from a user's face is potentially beneficial for problem detection. (c) 2004 Elsevier B.V. All rights reserved. C1 Tilburg Univ, NL-5000 LE Tilburg, Netherlands. RP Barkhuysen, P (reprint author), Tilburg Univ, POB 90153, NL-5000 LE Tilburg, Netherlands. EM p.n.barkhuysen@uvt.nl; ej.krahmer@uvt.nl; m.g.j.swerts@uvt.nl RI Swerts, Marc/C-8855-2013 CR AGELFORS E, 1998, P INT C SPOK LANG PR AHRENBERG L, 1993, 14 SCAND C LING 8 C BENOIT C, 2000, HDB STANDARDS RESOUR BOUWMAN AG, 1999, P INT C AC SPEECH SI, V1, P493 BRENNAN SE, 1995, J MEM LANG, V34, P383, DOI 10.1006/jmla.1995.1017 CARPENTER P, 2001, P EUROSPEECH 2001, P2121 DANIELI M, 1996, AAAI WORKSH DET REP DOHEN M, 2003, P ISCA TUT RES WORKS, V47, P245 Doherty-Sneddon G, 2001, MEM COGNITION, V29, P909, DOI 10.3758/BF03195753 Ekman P., 1978, FACIAL ACTION CODING Ekman P, 1975, UNMASKING FACE GUIDE Erickson D, 1998, LANG SPEECH, V41, P399 FRIDLUND AJ, 1993, HUMAN FACIAL EXPRESS Gagne JP, 2002, SPEECH COMMUN, V37, P213, DOI 10.1016/S0167-6393(01)00012-7 Goldberg J, 2003, ISCA TUT RES WORKSH, P101 Granstrom B., 2002, P SPEECH PROS 2002 C, P347 HART JT, 1965, J EDUC PSYCHOL, V56, P208, DOI 10.1037/h0022263 Hirschberg J, 2004, SPEECH COMMUN, V43, P155, DOI 10.1016/j.specom.2004.01.006 HIRSCHBERG J, 2001, P NAACL 01 Jordan TR, 2000, LANG SPEECH, V43, P107 KENDON A, 2001, SEMIOTICA, V35, P191 Krahmer E, 2002, SPEECH COMMUN, V36, P133, DOI 10.1016/S0167-6393(01)00030-9 KRAUT RE, 1979, J PERS SOC PSYCHOL, V37, P1539, DOI 10.1037//0022-3514.37.9.1539 LENDVAI P, 2002, MACHINE LEARNING APP, P1 Levow GA, 2002, SPEECH COMMUN, V36, P147, DOI 10.1016/S0167-6393(01)00031-0 LITMAN DJ, 2001, NACCL 01 Nakano M., 2003, P 8 EUR C SPEECH COM, P417 OVIATT SL, 1998, SPEECH COMMUN, V24, P1 Petajan E. D., 1985, Proceedings CVPR '85: IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No. 85CH2145-1) Picard RW, 2002, INTERACT COMPUT, V14, P141, DOI 10.1016/S0953-5438(01)00055-8 SENGER P, 1999, 16 INT JOINT C ART I SMITH VL, 1993, J MEM LANG, V32, P25, DOI 10.1006/jmla.1993.1002 SWERTS M, 2003, P ISCA WORKSH ERR HA WADE E, 1992, P 2 INT C SPOK LANG, P995 Walker MA, 1998, COMPUT SPEECH LANG, V12, P317, DOI 10.1006/csla.1998.0110 NR 35 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2005 VL 45 IS 3 BP 343 EP 359 DI 10.1016/j.specom.2004.10.004 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 908HQ UT WOS:000227778500009 ER PT J AU Kusumoto, A Arai, T Kinoshita, K Hodoshima, N Vaughan, N AF Kusumoto, A Arai, T Kinoshita, K Hodoshima, N Vaughan, N TI Modulation enhancement of speech by a pre-processing algorithm for improving intelligibility in reverberant environments SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; modulation spectrum; modulation transfer function; reverberation; word intelligibility ID ROOM ACOUSTICS; PERCEPTION; LISTENERS; HEARING; IDENTIFICATION; RECEPTION AB Most listeners have difficulty understanding speech in reverberant conditions. The purpose of this study is to investigate whether it is possible to reduce the degree of degradation of speech intelligibility in reverberation through the development of an algorithm. The modulation spectrum is the spectral representation of the temporal envelope of the speech signal. That of clean speech is dominated by components between 1 and 16 Hz centered at 4 Hz which is the most important range for human perception of speech. In reverberant conditions, the modulation spectrum of speech is shifted toward the lower end of the modulation frequency range. In this study, we proposed to enhance the important modulation spectral components prior to distortion of speech by reverberation. Word intelligibility in a carrier sentence was tested with the newly developed algorithm including two different filter designs in three reverberant conditions. The reverberant speech was simulated by convoluting clean speech with impulse responses measured in the actual halls. The experimental results show that modulation filtering incorporated into a pre-processing algorithm improves intelligibility for normal hearing listeners when (1) the modulation filters are optimal for a specific reverberant condition (i.e., T-60 = 1.1 s), and (2) consonants are preceded by highly powered segments. Under shorter (0.7 s) and longer (1.6 s) reverberation times, the modulation filtering in the current experiments, an Empirically-Designed (E-D) filter and a Data-Derived (D-D) filter, caused a slight performance decrement respectively. The results of this study suggest that further gains in intelligibility may be accomplished by re-design of the modulation filters suitable for other reverberant conditions. (C) 2004 Elsevier B.V. All rights reserved. C1 Portland VA Med Ctr, Natl Ctr Rehabilitat Auditory Res, VA RR&D, Portland, OR 97239 USA. Sophia Univ, Dept Elect & Elect Engn, Chiyoda Ku, Tokyo 1028554, Japan. RP Kusumoto, A (reprint author), Portland VA Med Ctr, Natl Ctr Rehabilitat Auditory Res, VA RR&D, NCRAR,3710 SW US Vet Hosp Rd, Portland, OR 97239 USA. EM akiko.kusumoto@med.va.gov; arai@sophia.ac.jp; kinoshita@cslab.kecl.ntt.co.jp; n-hodosh@sophia.ac.jp; vaughann@ohsu.edu CR Arai T., 2002, Acoustical Science and Technology, V23, DOI 10.1250/ast.23.229 Arai T, 1998, INT CONF ACOUST SPEE, P933, DOI 10.1109/ICASSP.1998.675419 Arai T, 1999, J ACOUST SOC AM, V105, P2783, DOI 10.1121/1.426895 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 Arai T., 1997, P EUR RHOD GREEC, P1011 Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607744 BOLT RH, 1949, J ACOUST SOC AM, V21, P577, DOI 10.1121/1.1906551 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 Fant G., 1960, ACOUSTIC THEORY SPEE FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 GONZALEZRODRIGU.J, 2000, P ICASSP, P953 GREENBERG S, 2001, P EUR C SPEECH COMM, V1, P473 Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HODOSHIMA N, 2003, P EUR C SPEECH COMM, P1365 HODOSHIMA N, 2002, P CHIN JAP JOINT C A, P199 HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 HOUTGAST T, 1980, ACUSTICA, V46, P60 HOUTGAST T, 1973, ACUSTICA, V28, P66 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Jordan V., 1980, ACOUSTICAL DESIGN CO Knudsen VO, 1929, J ACOUST SOC AM, V1, P56, DOI 10.1121/1.1901470 KRUEL EJ, 1968, J SPEECH HEAR RES, V11, P536 Langhans T., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing NABELEK AK, 1982, J ACOUST SOC AM, V71, P1242 NABELEK AK, 1974, J SPEECH HEAR RES, V17, P724 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 NABELEK AK, 1989, J ACOUST SOC AM, V86, P1259 NABELEK AK, 1978, J ACOUST SOC AM, V63, P187 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Wang H., 1991, P ICASSP, P953, DOI 10.1109/ICASSP.1991.150498 NR 35 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 101 EP 113 DI 10.1016/j.specom.2004.06.003 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600001 ER PT J AU Ouni, S Cohen, MM Massaro, DW AF Ouni, S Cohen, MM Massaro, DW TI Training Baldi to be multilingual: A case study for an Arabic Badr SO SPEECH COMMUNICATION LA English DT Article DE talking head; avatar; visible and visual speech synthesis; text-to-speech; auditory; Arabic; multilingual ID HEARING-LOSS; CHILDREN; SPEECH; VOCABULARY AB In this paper, we describe research to extend the capability of an existing talking head, Baldi, to be multilingual. We use parsimonious client/server architecture to impose autonomy in the functioning of an auditory speech module and a visual speech synthesis module. This scheme enables the implementation and the joint application of text-to-speech synthesis and facial animation in many languages simultaneously. Additional languages can be added to the system by defining a unique phoneme set and unique phoneme definitions for the visible speech for each language. The accuracy of these definitions is tested in Perceptual experiments in which human observers identify auditory speech in noise presented alone or paired with the synthetic versus a comparable natural face. We illustrate the development of an Arabic talking head, Badr, and demonstrate how the empirical evaluation enabled the improvement of the visible speech synthesis from one version to another. (C) 2005 Elsevier B.V. All rights reserved. C1 Univ Calif Santa Cruz, Perceptual Sci Lab, Santa Cruz, CA 95064 USA. RP Ouni, S (reprint author), LORIA, Speech Grp, 615 Rue Jardin Bot, F-54600 Villers Les Nancy, France. EM slim@fuzzy.ucsc.edu CR Al-Ani Salman H, 1970, ARABIC PHONOLOGY ACO Ali L.H., 1972, STUD LINGUISTICA, V26, P81, DOI 10.1111/j.1467-9582.1972.tb00589.x BAGEIN M, 2000, PRORISC 2000 VELDH Barker L., 2003, J DEAF STUD DEAF EDU, V8, P187, DOI 10.1093/deafed/eng002 BERNSTEIN LE, 1986, J HOPKINS LIP READIN Beutnagel M., 1999, JOINT M ASA EAA DAGA BLANZ V, 2003, P EUROGRAPHICS 2003 Bosseler A, 2003, J AUTISM DEV DISORD, V33, P653, DOI 10.1023/B:JADD.0000006002.82367.4f Bregler C, 1997, P ACM SIGGRAPH 97 BRESLAW PI, 1981, J CHILD PSYCHOL PSYC, V22, P269, DOI 10.1111/j.1469-7610.1981.tb00552.x Chuang E. S., 2002, Proceedings 10th Pacific Conference on Computer Graphics and Applications, DOI 10.1109/PCCGA.2002.1167840 COHEN MM, 2002, P ICMI 02 IEEE 4 INT COHEN MM, 1998, ETRW AUD VIS SPEECH, P201 COHEN MM, 1993, COMPUTER ANIMATION, P141 COHEN MM, 1996, SPEECHREADING HUMANS, P53 COSI P, 2002, 7 INT C SPOK LANG PR Davis H., 1978, HEARING AND DEAFNESS Elgendy A. M., 2001, ASPECTS PHARYNGEAL C Ezzat T, 2002, ACM T GRAPHIC, V21, P388 EZZAT T, 1998, P COMP AN C PHIL PA Gairdner W. H. T., 1925, PHONETICS ARABIC PHO GHAZALI S, 1977, BLACK CONSONANTS BAC GUENTER B, 1998, MAKING FACES, P55 HOLT JA, 1997, INTERPRETING SCORES HUANG XD, 1998, J ACOUST SOC AM, V103, P2815, DOI 10.1121/1.421583 JAKOBSON R, 1962, MOFAXXAMA EMPHATIC P, V1, P510 Jesse A., 2000, INTERPRETING, V5, P95, DOI 10.1075/intp.5.2.04jes Kahler K, 2001, P GRAPH INT 2001, P37 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 LINDAU M, 1983, INVARIANCE VARIABILI, P464 Massaro D. W., 1998, PERCEIVING TALKING F MASSARO DW, IN PRESS AUDIOVISUAL Massaro DW, 2004, J SPEECH LANG HEAR R, V47, P304, DOI 10.1044/1092-4388(2004/025) MASSARO DW, 2000, EMBODIED CONVERSATIO, P286 MASSARO DW, 2003, 15 INT C PHON SCI IC MASSARO DW, 2004, P 37 ANN HAW INT C S MASSARO DW, 2003, 8 EUR C SPEECH COMM Massaro DW, 2004, VOLTA REV, V104, P141 PARKE FI, 1975, COMPUTERS GRAPHICS J, V1, P1 SPROAT R, 1997, MULTILINGUAL TEXT TO SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 TAYLOR P, 1997, FESTIVAL SPEECH SYNT NR 42 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 115 EP 137 DI 10.1016/j.specom.2004.11.008 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600002 ER PT J AU Ito, T Takeda, K Itakura, F AF Ito, T Takeda, K Itakura, F TI Analysis and recognition of whispered speech SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; whispered speech; telephone handset; noise robustness AB In this study, we have examined the acoustic characteristics of whispered speech and addressed some of the issues involved in recognition of whispered speech used for communication over a mobile phone in a noisy environment. The acoustic analysis shows that there is an upward shift of formant frequencies of vowels as observed in the whispered speech data compared to the normal speech data. Voiced consonants in the whispered speech have lower energy at low frequencies up to 1.5 kHz and their spectral flatness is greater compared to the normal speech. In experiments on whispered speech recognition, results of our studies on adaptation of the whispered speech models have shown that adaptation using a small amount of whispered speech data from a target speaker can be effectively used for recognition of the whispered speech. In a noisy environment, the recognition accuracy decreases significantly for the whispered speech compared to the normal speaking of the same speech. A method to increase the SNR by covering the mouth with a hand has been shown to give a higher recognition accuracy for the whispered speech frequently encountered for private communication in a noisy environment. (C) 2004 Elsevier B.V. All rights reserved. C1 Nagoya Univ, Grad Sch Engn, Nagoya, Aichi 4648603, Japan. RP Takeda, K (reprint author), Nagoya Univ, Grad Sch Engn, Nagoya, Aichi 4648603, Japan. EM itou.taisuke@jp.fujitsu.com; takeda@nuee.nagoya-u.ac.jp; itakuraf@ccmfs.meijo-u.ac.jp CR Eklund I, 1996, PHONETICA, V54, P1 FUJIMURA O, 1971, J ACOUST SOC AM, V49, P541, DOI 10.1121/1.1912385 HOLMES JN, 1983, J ACOUST SOC AM, V73, pS87, DOI 10.1121/1.2020610 ITOU K, 1998, P OR COCOSDA MAY 199 KALLAIL KJ, 1984, J SPEECH HEAR RES, V27, P245 KAWAHARA T, 1999, P IEEE AUT SPEECH RE, P393 Konno H., 1996, SP95140 IEICE, P39 KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W Leggetter C., 1995, P ARPA SPOK LANG TEC MEYEREPPLER W, 1957, J ACOUST SOC AM, V29, P104, DOI 10.1121/1.1908631 MORRIS R, 2002, P INT C AC SPEECH SI, P4159 SUGITO M, 1991, SP911 IEICE, P1 THOMAS IB, 1969, J ACOUST SOC AM, V46, P468, DOI 10.1121/1.1911712 WENNDT SJ, 2002, P INT C SPOK LANG PR, P649 NR 14 TC 41 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 139 EP 152 DI 10.1016/j.specom.2003.10.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600003 ER PT J AU Paliwal, KK Alsteris, LD AF Paliwal, KK Alsteris, LD TI On the usefulness of STFT phase spectrum in human listening tests SO SPEECH COMMUNICATION LA English DT Article DE short-time Fourier transform; phase spectrum; magnitude spectrum; speech perception; overlap-add procedure; automatic speech recognition ID TIME FOURIER-ANALYSIS; SIGNAL RECONSTRUCTION; TRANSFORM MAGNITUDE; SPEECH; PERCEPTION; ENHANCEMENT; VOCODER AB The short-time Fourier transform (STFT) of a speech signal has two components: the magnitude spectrum and the phase spectrum. In this paper, the relative importance of short-time magnitude and phase spectra for speech perception is investigated. Human perception experiments are conducted to measure intelligibility of speech stimuli synthesized either from magnitude spectra or phase spectra. It is traditionally believed that the magnitude spectrum plays a dominant role for small window durations (20-40 ms); while the phase spectrum is more important for large window durations (>1 s). It is shown in this paper that even for small window durations, the phase spectrum can contribute to speech intelligibility as much as the magnitude spectrum if the analysis-modification-synthesis parameters are properly selected. (C) 2004 Elsevier B.V. All rights reserved. C1 Griffith Univ, Sch Microelect Engn, Nathan, Qld 4111, Australia. RP Paliwal, KK (reprint author), Griffith Univ, Sch Microelect Engn, Nathan Campus, Nathan, Qld 4111, Australia. EM k.paliwal@griffith.edu.au; l.alsteris@griffith.edu.au CR ALLEN JB, 1977, IEEE T ACOUST SPEECH, V25, P235, DOI 10.1109/TASSP.1977.1162950 ALLEN JB, 1977, P IEEE, V65, P1558, DOI 10.1109/PROC.1977.10770 Alsteris L., 2004, P IEEE INT C AC SPEE, V1, P573 COX RC, 1980, P IEEE INT C AC SPEE, P150 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 ESPY CY, 1983, IEEE T ACOUST SPEECH, V31, P894, DOI 10.1109/TASSP.1983.1164151 FLANAGAN JL, 1966, AT&T TECH J, V45, P1493 GOLDSTEI.JL, 1967, J ACOUST SOC AM, V41, P458, DOI 10.1121/1.1910357 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 HAYES MH, 1980, IEEE T ACOUST SPEECH, V28, P672, DOI 10.1109/TASSP.1980.1163463 IZRAELEVITZ D, 1985, IEEE T ACOUST SPEECH, V33, P1611, DOI 10.1109/TASSP.1985.1164746 KIM DS, 2000, P IEEE INT C AC SPEE, P1383 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Liu L, 1997, SPEECH COMMUN, V22, P403, DOI 10.1016/S0167-6393(97)00054-X MATHES RC, 1947, J ACOUST SOC AM, V19, P780, DOI 10.1121/1.1916623 MERCHANT GA, 1983, IEEE T ACOUST SPEECH, V31, P1135, DOI 10.1109/TASSP.1983.1164199 NAWAB SH, 1983, IEEE T ACOUST SPEECH, V31, P986, DOI 10.1109/TASSP.1983.1164162 Ohm G. S., 1843, ANN PHYS CHEM, V59, P513 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 Paliwal K., 2003, P EUROSPEECH, P65 Paliwal K. K., 2003, P EUR 2003, P2117 Paliwal K.K., 2003, P IPSJ SPOK LANG PRO, P1 PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 PLOMP R, 1969, J ACOUST SOC AM, V46, P409, DOI 10.1121/1.1911705 POBLOTH H, 1999, P INT C AC SPEECH SI, V1, P29 PORTNOFF MR, 1976, IEEE T ACOUST SPEECH, V24, P243, DOI 10.1109/TASSP.1976.1162810 PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P364, DOI 10.1109/TASSP.1981.1163580 PORTNOFF MR, 1979, P IEEE INT C AC SPEE, P186 PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581 PORTNOFF MR, 1980, IEEE T ACOUST SPEECH, V28, P55, DOI 10.1109/TASSP.1980.1163359 Quatieri T. F., 2002, DISCRETE TIME SPEECH QUATIERI TF, 1981, IEEE T ACOUST SPEECH, V29, P1187, DOI 10.1109/TASSP.1981.1163714 RABINER LR, 1978, DISCRETE TIME SPEECH REDDY NS, 1985, IEEE T CIRCUITS SYST, V32 SCHAFER RW, 1973, IEEE T ACOUST SPEECH, VAU21, P165, DOI 10.1109/TAU.1973.1162474 SCHROEDER MR, 1959, J ACOUST SOC AM, V31, P1579, DOI 10.1121/1.1930316 SCHROEDER MR, 1975, P IEEE, V63, P1332, DOI 10.1109/PROC.1975.9941 SCHROEDER MR, 1986, J ACOUST SOC AM, V79, P1580, DOI 10.1121/1.393292 THOMAS DM, 1984, P IEEE INT C AC SPEE VANHOVE PL, 1983, IEEE T ACOUST SPEECH, V31, P1286, DOI 10.1109/TASSP.1983.1164178 von Helmholtz Hermann, 1912, SENSATIONS TONE WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 YEGNANARAYANA B, 1984, IEEE T ACOUST SPEECH, V32, P610, DOI 10.1109/TASSP.1984.1164365 Yegnanarayana B., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) NR 46 TC 32 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 153 EP 170 DI 10.1016/j.specom.2004.08.001 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600004 ER PT J AU Tur, G Hakkani-Tur, D Schapire, RE AF Tur, G Hakkani-Tur, D Schapire, RE TI Combining active and semi-supervised learning for spoken language understanding SO SPEECH COMMUNICATION LA English DT Article DE active learning; semi-supervised learning; spoken language understanding; call classification AB In this paper, we describe active and semi-supervised learning methods for reducing the labeling effort for spoken language understanding. In a goal-oriented call routing system, understanding the intent of the user can be framed as a classification problem. State of the art statistical classification systems are trained using a large number of human-labeled utterances, preparation of which is labor intensive and time consuming. Active learning aims to minimize the number of labeled utterances by automatically selecting the utterances that are likely to be most informative for labeling. The method for active learning we propose, inspired by certainty-based active learning, selects the examples that the classifier is the least confident about. The examples that are classified with higher confidence scores (hence not selected by active learning) are exploited using two semi-supervised learning methods. The first method augments the training data by using the machine-labeled classes for the unlabeled utterances. The second method instead augments the classification model trained using the human-labeled utterances with the machine-labeled ones in a weighted manner. We then combine active and semi-supervised learning using selectively sampled and automatically labeled data. This enables us to exploit all collected data and alleviates the data imbalance problem caused by employing only active or semi-supervised learning. We have evaluated these active and semi-supervised learning methods with a call classification system used for AT&T customer care. Our results indicate that it is possible to reduce human labeling effort significantly. (C) 2004 Elsevier B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. Princeton Univ, Dept Comp Sci, Princeton, NJ 08544 USA. RP Tur, G (reprint author), AT&T Labs Res, 180 Pk Ave, Florham Pk, NJ 07932 USA. EM gtur@research.att.com; dtur@research.att.com; schapire@cs.princeton.edu CR ABE N, 1998, P INT C MACH LEARN I Angluin D., 1988, Machine Learning, V2, DOI 10.1007/BF00116828 Argamon-Engelson S, 1999, J ARTIF INTELL RES, V11, P335 Blum A, 1998, P WORKSH COMP LEARN COHN D, 1994, MACH LEARN, V15, P201, DOI 10.1023/A:1022673506211 Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504 Freund Y, 1997, MACH LEARN, V28, P133, DOI 10.1023/A:1007330508534 Friedman J, 2000, ANN STAT, V28, P337, DOI 10.1214/aos/1016218223 GHANI R, 2002, P INT C MACH LEARN I Gorin AL, 2002, COMPUTER, V35, P51, DOI 10.1109/MC.2002.993771 Haffner P., 2003, P INT C AC SPEECH SI Hakkani-Tuh D., 2002, P INT C AC SPEECH SI IYER R, 2002, P INT C AC SPEECH SI KUO J, 2002, P INT C SPOK LANG PR Lewis D., 1994, P INT C MACH LEARN I LIERE R, 1997, P C AM ASS ART INT A MCCALLUM AK, 1998, P INT C MACH LEARN I MUSLEA I, 2002, P INT C MACH LEARN I Natarajan P., 2002, P INT C SPOK LANG PR Nigam K, 2000, MACH LEARN, V39, P103, DOI 10.1023/A:1007692713085 Nigam K., 2000, P INT C INF KNOWL MA RICCARDI G, 2003, P EUR C SPEECH COMM SASSANO M, 2002, P ANN M ASS COMP LIN Schapire R., 2002, P INT C MACH LEARN I Schapire RE, 2001, P MSRI WORKSH NONL E Schapire RE, 1999, MACH LEARN, V37, P297, DOI 10.1023/A:1007614523901 Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 SCHOHN G, 2000, P INT C MACH LEARN I SEUNG HS, 1992, P WORKSH COMP LEARN Singer Y., 2002, MACHINE LEARNING, V48 TANG M, 2002, P ANN M ASS COMP LIN Thompson C., 1999, P INT C MACH LEARN I Tong S, 2000, J MACHINE LEARNING R, V2, P45, DOI DOI 10.1162/153244302760185243 Tur G., 2002, P INT C SPOK LANG PR TUR G, 2003, P EUR C SPEECH COMM TUR G, 2003, P INT C AC SPEECH SI NR 36 TC 43 Z9 45 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 171 EP 186 DI 10.1016/j.specom.2004.08.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600005 ER PT J AU Baca, J Picone, J AF Baca, J Picone, J TI Effects of displayless navigational interfaces on user prosodics SO SPEECH COMMUNICATION LA English DT Article DE prosodics; displayless; multimodal ID MAPS AB Displayless interface technology provides speech-based access to computer applications for which visual access is not possible. These applications are increasingly prevalent, especially in situations requiring mobility, such as navigational applications. To ensure the successful deployment of this technology however, many human factors issues must be addressed. In particular, its nonvisual nature requires verbal presentation of spatial data. Prosodics, or nonverbal aspects, of human speech have been established as an indicator of cognitive stress. In this paper, we examine the assumption that the cognitive burden placed on the user by displayless access to spatial data would significantly alter the prosodics of the user's speech. Results were gathered through experiments in which user interactions with a prototype speech-based navigational system were recorded, post-processed, and analyzed for prosodic content. Subjects participated in two sessions, one using a speech-based, displayless interface, and a second using a multimodal interface that included a visual-tactile map display. Results showed strong evidence of significant changes in subjects' prosodic features when using a displayless versus a multimodal navigational interface for all categories of subjects. Insights gained from this work can be used to improve the design of the user interface for such applications. Also, results of this work can be used to refine the selection of acoustic cues used as predictors in prosodic pattern detection algorithms for these types of applications. (C) 2004 Elsevier B.V. All rights reserved. C1 USA, Corps Engineers, Waterways Expt Stn, Vicksburg, MS 39180 USA. Inst Signal & Informat Proc, Dept Elect & Comp Engn, Mississippi State, MS 39762 USA. RP Baca, J (reprint author), Mississippi State Univ, Ctr Adv Vehicular Syst, 200 Res Blvd, Starville, MS 39759 USA. EM baca@cse.msstate.edu; picone@isip.msstate.edu CR BACA J, 2003, P EUR GEN SWITZ BACA J, 1998, ITL983 WES MISS STAT BARTH J, 1983, TACTILE GRAPHICS GUI BAYER S, 1995, P SPOK LANG SYST TEC, P243 BUHLER D, 2002, P ICSLP 02 DENV CO U CAMPBELL WN, 1992, P ICSLP 92 BANFF CAN, P663 CHEN F, 1992, ICASSP, V1, P229 Dahl D.A., 1994, P ARPA WORKSH HUM LA, P43, DOI 10.3115/1075812.1075823 DALY N, 1990, P INT C SPOK LANG PR, P497 *DARPA, 2003, DARPA EARS C BOST MA GODFREY CHJ, 1990, P SPEECH NAT LANG WO, P96 HUBER D, 1989, P INT C AC SPEECH SI, P600 JACOBSON WH, 1993, ART SCI TEACHING ORI, P105 KAMM C, 1994, VOICE COMMUNICATION KOZLOWSKI LT, 1977, J EXP PSYCHOL HUMAN, V3, P590, DOI 10.1037//0096-1523.3.4.590 Loomis J. M., 1994, P 1 ANN INT ACM SIGC, P85, DOI 10.1145/191028.191051 NAKAI M, 1994, ELECTRON COMM JPN 3, V77, P80, DOI 10.1002/ecjc.4430770608 Noth E, 2000, IEEE T SPEECH AUDI P, V8, P519, DOI 10.1109/89.861370 OKAWA S, 1993, IEICE T INF SYST, VE76D, P44 Paul D., 1992, P ICSLP, P899 PELLOM B, 2001, P 2001 HUM LANG TECH PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 ROSENFELD R, 1996, P ICSLP 96 PHIL PA S SCHERER KR, 2002, P ICSLP 02 DENV CO U Scherer K.R., 1981, SPEECH EVALUATION PS, P189 Silverman K., 1992, P INT C SPOK LANG PR, P867 THORNDYKE PW, 1980, COGNITIVE PSYCHOL, V12, P137, DOI 10.1016/0010-0285(80)90006-7 Waibel A., 1988, PROSODY SPEECH RECOG Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 NR 30 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2005 VL 45 IS 2 BP 187 EP 202 DI 10.1016/j.specom.2004.09.006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 903WZ UT WOS:000227457600006 ER PT J AU Zera, J AF Zera, J TI Speech intelligibility measured by adaptive maximum-likelihood procedure (vol 42, pg 313, 2004) SO SPEECH COMMUNICATION LA English DT Correction C1 Chopin Acad Mus, Dept Sound Engn, Mus Acoust Lab, PL-00368 Warsaw, Poland. RP Zera, J (reprint author), Natl Res Inst, Cent Inst Labour Protect, Dept Acoust & Electromagnet Hazards, Czerniakowska 16, PL-00701 Warsaw, Poland. EM jazer@ciop.waw.pl CR Zera J, 2004, SPEECH COMMUN, V42, P313, DOI 10.1016/j.specom.2003.08.007 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 1 EP 1 DI 10.1016/j.specom.2004.10.005 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800001 ER PT J AU Barker, JP Cooke, MP Ellis, DPW AF Barker, JP Cooke, MP Ellis, DPW TI Decoding speech in the presence of other sources SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; signal separation; missing data recognition; computational auditory scene analysis; acoustic mixtures ID SEPARATION AB The statistical theory of speech recognition introduced several decades ago has brought about low word error rates for clean speech. However, it has been less successful in noisy conditions. Since extraneous acoustic sources are present in virtually all everyday speech communication conditions, the failure of the speech recognition model to take noise into account is perhaps the most serious obstacle to the application of ASR technology. Approaches to noise-robust speech recognition have traditionally taken one of two forms. One set of techniques attempts to estimate the noise and remove its effects from the target speech. While noise estimation can work in low-to-moderate levels of slowly varying noise, it fails completely in louder or more variable conditions. A second approach utilises noise models and attempts to decode speech taking into account their presence. Again, model-based techniques can work for simple noises, but they are computationally complex under realistic conditions and require models for all sources present in the signal. In this paper, we propose a statistical theory of speech recognition in the presence of other acoustic sources. Unlike earlier model-based approaches, our framework makes no assumptions about the noise background, although it can exploit such information if it is available. It does not require models for background sources, or an estimate of their number. The new approach extends statistical ASR by introducing a segregation model in addition to the conventional acoustic and language models. While the conventional statistical ASR problem is to find the most likely sequence of speech models which generated a given observation sequence, the new approach additionally determines the most likely set of signal fragments which make up the speech signal. Although the framework is completely general, we provide one interpretation of the segregation model based on missing-data theory. We derive an efficient HMM decoder, which searches both across subword state and across alternative segregations of the signal between target and interference. We call this modified system the speech fragment decoder. The value of the speech fragment decoder approach has been verified through experiments on small-vocabulary tasks in high-noise conditions. For instance, in a noise-corrupted connected digit task, the new approach decreases the word error rate in the condition of factory noise at 5dB SNR from over 59% for a standard ASR system to less than 22%. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Columbia Univ, Dept Elect Engn, New York, NY 10027 USA. RP Barker, JP (reprint author), Univ Sheffield, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM j.barker@dcs.shef.ac.uk CR BAILEY PJ, 1977, J ACOUST SOC AM, V61 Barker J, 1999, SPEECH COMMUN, V27, P159, DOI 10.1016/S0167-6393(98)00081-8 Bell A J, 1995, NEURAL COMPUT, V7, P1004 Bourlard H., 1997, P ICASSP, P1251 BREGMAN AS, 1990, AUDOTORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Cooke M., 1994, P 3 INT C SPOK LANG, P1555 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 Cooke M, 2001, SPEECH COMMUN, V35, P141, DOI 10.1016/S0167-6393(00)00078-9 Cooke M., 1991, THESIS U SHEFFIELD CULLING JF, 1993, J ACOUST SOC AM, V93, P3454, DOI 10.1121/1.405675 CUNNINGHAM S, 1999, ICPHS 99, P215 DENBIGH PN, 1992, SPEECH COMMUN, V11, P119, DOI 10.1016/0167-6393(92)90006-S Ellis D. P. W., 1996, THESIS MIT Gales M. J. F., 1993, EUROSPEECH 93, V2, P837 HARTMANN WM, 1991, MUSIC PERCEPT, V9, P155 Hyvarinen A, 2000, NEURAL NETWORKS, V13, P411, DOI 10.1016/S0893-6080(00)00026-5 Leonard R. G., 1984, P ICASSP 84, P111 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 Pearce D., 2000, P ICSLP, V4, P29 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 Scheffers M. T. M., 1983, THESIS U GRONINGEN N SLANEY M, 1995, P COMP AUD SCEN AN W, P13 van Noorden L. P. A. S., 1975, THESIS EINDHOVEN U T VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 VARGA AP, 1992, NOISEX 92 STUDY EFFE Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 NR 27 TC 72 Z9 73 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 5 EP 25 DI 10.1016/j.specom.2004.05.002 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800002 ER PT J AU Linnankoski, I Leinonen, L Vihla, M Laakso, ML Carlson, S AF Linnankoski, I Leinonen, L Vihla, M Laakso, ML Carlson, S TI Conveyance of emotional connotations by a single word in English SO SPEECH COMMUNICATION LA English DT Article DE conveyance of emotion in speech; vocal expression; spectral features ID VOICE QUALITY; VOCAL COMMUNICATION; AIR-PRESSURE; EXPRESSION; SPEECH; FREQUENCY; JUDGMENTS; RESPONSES; REGISTER; CUES AB Native British English speakers uttered the name Sarah to simulate 10 emotional connotations: "naming", "sad", "pleading", "admiring", "content", "commanding", "astonished", "scornful", "angry", and "frightened". In an identification task, British English listeners categorized the samples. Of the connotations, "angry", "frightened" and "astonished" were conveyed best, and "content" poorest. Regarding auditory differentiation among the connotations, the results suggest that recognition of "naming", "sad", "admiring" "commanding", "angry", and "frightened" is based on differences in the signal wave form and its short-term alterations, whereas recognition of "pleading", "astonished", and "scornful" also relies on temporal patterning of short-term cues. In general, the present results together with an earlier comparable study on the conveyance of emotional connotations by a single word in Finnish indicate that English and Finnish have shared features in the vocal expression of admiration, positive surprise, scorn, plea, command, fear, and emotional neutrality. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Helsinki, Helsinki Brain Res Ctr, Inst Biomed Physiol, Neurosci Unit, FIN-00014 Helsinki, Finland. Univ Helsinki, Dept Basic Vet Sci Physiol, FIN-00014 Helsinki, Finland. Aalto Univ, Neural Networks Res Ctr, FIN-02015 Helsinki, Finland. Sleep Res Ctr, Rinnekoti Fdn, FIN-02980 Espoo, Finland. RP Carlson, S (reprint author), Univ Helsinki, Helsinki Brain Res Ctr, Inst Biomed Physiol, Neurosci Unit, POB 63, FIN-00014 Helsinki, Finland. EM syncarls@cc.helsinki.fi RI Carlson, Synnove/A-6337-2013; Carlson, Synnove/G-2210-2013 CR Auberge V, 2003, SPEECH COMMUN, V40, P87, DOI 10.1016/S0167-6393(02)00077-8 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Callan DE, 1999, J SPEECH LANG HEAR R, V42, P355 CAVALLO SA, 1984, J COMMUN DISORD, V17, P231, DOI 10.1016/0021-9924(84)90028-5 Cowie R, 2003, SPEECH COMMUN, V40, P5, DOI 10.1016/S0167-6393(02)00071-7 EKMAN P, 1987, J PERS SOC PSYCHOL, V53, P712, DOI 10.1037/0022-3514.53.4.712 EKMAN P, 1988, MOTIV EMOTION, V12, P303, DOI 10.1007/BF00993116 Fairbanks G, 1939, SPEECH MONOGR, V6, P87 FONAGY I, 1972, PHONETICA, V26, P157 FONAGY I, 1962, PHONETICA, V8, P209 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 Gobl C, 2003, SPEECH COMMUN, V40, P189, DOI 10.1016/S0167-6393(02)00082-1 GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C HAVILAND JM, 1987, DEV PSYCHOL, V23, P97, DOI 10.1037/0012-1649.23.1.97 HILLENBRAND J, 1994, J SPEECH HEAR RES, V37, P769 HOFFE WL, 1960, PHONETICA, V5, P129 HOLLIEN H, 1968, J SPEECH HEAR RES, V11, P600 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 Hozjan V., 2003, EUROSPEECH, P133 HOZJAN V, 2002, LREC 2002 LAS PALM JURGENS U, 1995, CURRENT TOPICS IN PRIMATE VOCAL COMMUNICATION, P199 JURGENS U, 1976, ARCH PSYCHIAT NERVEN, V222, P117, DOI 10.1007/BF02206613 Kohonen T., 1995, SELF ORG MAPS, P77 KOHONEN T, 1999, SOM PAK SELF ORG MAP KRAMER E, 1964, J ABNORM SOC PSYCH, V68, P390, DOI 10.1037/h0042473 LABARBERA JD, 1976, CHILD DEV, V47, P535, DOI 10.2307/1128816 Leinonen L, 1997, FOLIA PHONIATR LOGO, V49, P9 Leinonen L, 1999, KOHONEN MAPS, P329, DOI 10.1016/B978-044450270-4/50026-7 Leinonen Lea, 2003, Logoped Phoniatr Vocol, V28, P53, DOI 10.1080/14015430310011754 LEINONEN L, 1991, LANG COMMUN, V11, P241, DOI 10.1016/0271-5309(91)90031-P Leinonen L, 1997, J ACOUST SOC AM, V102, P1853, DOI 10.1121/1.420109 Liscombe J., 2003, EUROSPEECH, P725 MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593 MONSEN RB, 1978, J ACOUST SOC AM, V64, P65, DOI 10.1121/1.381957 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 OUDEYER P, 2002, SPEECH PROSODY Rama P, 2001, NEUROIMAGE, V13, P1090, DOI 10.1006/nimg.2001.0777 RIHKANEN H, 1994, J VOICE, V8, P320, DOI 10.1016/S0892-1997(05)80280-2 Scherer K. R., 2001, APPRAISAL PROCESSES SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Stevens K. N., 1998, ACOUSTIC PHONETICS, P73 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 VANBEZOOYEN R, 1984, CHARACTERISTICS RECO, P1 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Zwicker E., 1999, PSYCHOACOUSTICS FACT, P203 NR 48 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 27 EP 39 DI 10.1016/j.specom.2004.09.007 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800003 ER PT J AU Jung, HK Kim, NS Kim, TJ AF Jung, HK Kim, NS Kim, TJ TI A new double-talk detector using echo path estimation SO SPEECH COMMUNICATION LA English DT Article DE double-talk detection; echo cancellation; echo path estimation AB This paper presents a new double-talk detector (DTD) based on echo path estimation. The proposed algorithm consists of two decision stages. In the first stage, single-talk periods are distinguished from double-talk or echo path changes, based on the gain of the echo path estimate. An accurate distinction between double-talk and echo path changes is made in the second stage based on the derivative of the gain of the estimated echo path. By experiments, it is found that the proposed approach is effective in detecting double-talk periods. Moreover, the required decision delay is shorter than that of conventional methods. (C) 2004 Elsevier B.V. All rights reserved. C1 Seoul Natl Univ, Sch Elect Engn, Seoul 151742, South Korea. RP Jung, HK (reprint author), Seoul Natl Univ, Sch Elect Engn, San 56-1 Sillim Dong, Seoul 151742, South Korea. EM shizuka@infolab.snu.ac.kr; nkim@snu.ac.kr; tkim@snu.ac.kr CR Benesty J, 2000, IEEE T SPEECH AUDI P, V8, P168 Chao J., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196921 Cho JH, 1999, IEEE T SPEECH AUDI P, V7, P718, DOI 10.1109/89.799697 Gansler T, 1996, IEEE T COMMUN, V44, P1421, DOI 10.1109/26.544458 Haykin S., 1996, ADAPTIVE FILTER THEO Park SJ, 2002, IEEE T CIRCUITS-II, V49, P188 Widrow B, 1985, ADAPTIVE SIGNAL PROC YE H, 1991, IEEE T COMMUN, V39, P1542 NR 8 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 41 EP 48 DI 10.1016/j.specom.2004.09.005 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800004 ER PT J AU Peng, G Wang, WSY AF Peng, G Wang, WSY TI Tone recognition of continuous Cantonese speech based on support vector machines SO SPEECH COMMUNICATION LA English DT Article DE tone language; F-O normalization; support vector machines; tone recognition; automatic speech recognition ID MANDARIN; LANGUAGE; CHINESE AB Tone is an essential component for word formation in all tone languages. It plays a very important role in the transmission of information in speech communication. In this paper, we look at using support vector machines (SVMs) for automatic tone recognition in continuously spoken Cantonese, which is well known for its complex tone system. An adaptive log-scale 5-level F-0 normalization method is proposed to reduce the tone-irrelevant variation of F-0 values. Furthermore, an extended version of the above normalization method that considers intonation is also presented. A tone recognition accuracy of 71.50% has been obtained in a speaker-independent task. This result compares favorably with the results reported earlier for the same task. Considerable improvement has been achieved by adopting this tone recognition scheme in a speaker-independent Cantonese large vocabulary continuous speech recognition (LVCSR) task. (C) 2004 Elsevier B.V. All rights reserved. C1 City Univ Hong Kong, Dept Elect Engn, Kowloon, Hong Kong, Peoples R China. Chinese Univ Hong Kong, Dept Elect Engn, Shatin, Hong Kong, Peoples R China. RP Peng, G (reprint author), City Univ Hong Kong, Dept Elect Engn, 83 Tat Chee Ave, Kowloon, Hong Kong, Peoples R China. EM gpeng@ee.cityu.edu.hk; wang@ee.cuhk.edu.hk CR Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 Cao Y, 2000, P INT C AC SPEECH SI, V3, P1759 Chao Yuan-Ren, 1930, MAITRE PHONETIQUE, V45, P24 CHEN SH, 1995, IEEE T SPEECH AUDI P, V3, P146 Chen X.-X., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) EMONTS M, 2003, P 8 EUR C SPEECH COM, P2305 Fok Y.Y., 1974, PERCEPTUAL STUDY TON HASTIE T, 1998, ADV NEURAL INFORMATI Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 Lee T, 2002, SPEECH COMMUN, V36, P327, DOI 10.1016/S0167-6393(00)00101-1 Lee T., 2002, ACM T ASIAN LANGUAGE, V1, P83, DOI 10.1145/595576.595581 LEE T, 1995, IEEE T SPEECH AUDI P, V3, P204 Li Yong, 2002, Pacific Rim Workshop on Transducers and Micro/Nano Technologies *LSHK, 2002, HONG KONG JYUT PING MAN CHV, 1992, THESIS U VICTORIA BR OHALA Johni, 1978, TONE LINGUISTIC SURV, P5 PENG G, 2004, P INT C SPOK LANG PR Potisuk S, 1999, IEEE T SPEECH AUDI P, V7, P95, DOI 10.1109/89.736336 Qian Y., 2003, P 8 EUR C SPEECH COM, P1845 Rabiner L, 1993, FUNDAMENTALS SPEECH RIFKIN R, 2001, SVMFU DOCUMENTATION SHEN XS, 1990, J PHONETICS, V18, P281 Vapnik V., 1995, NATURE STAT LEARNING Wang C., 2001, THESIS MIT WANG CF, 1990, P INT C SPOK LANG PR, V6, P221 WANG WSY, 1967, INT J AM LINGUIST, V33, P93, DOI 10.1086/464946 WANG WSY, 1987, HONOR ILSE LEHISTE WANG WSY, 1973, SCI AM, V228, P50 WANG WSY, 1969, LANGUAGE, V45, P695 Weenink D., 2001, PRAAT DOING PHONETIC Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 YANG WJ, 1988, IEEE T ACOUST SPEECH, V36, P988, DOI 10.1109/29.1620 ZHANG JS, 2000, P INT C AC SPEECH SI, P1419 2001, WISENEWS NR 35 TC 12 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 49 EP 62 DI 10.1016/j.specom.2004.09.004 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800005 ER PT J AU Srinivasan, S Wang, DL AF Srinivasan, S Wang, DL TI A schema-based model for phonemic restoration SO SPEECH COMMUNICATION LA English DT Article DE phonemic restoration; top-down model; speech schemas; computational auditory scene analysis; prediction; missing data ASR; dynamic time warping ID AUDITORY SCENE ANALYSIS; PERCEPTUAL RESTORATION; SPEECH RECOGNITION; SOUND; INTELLIGIBILITY; SEGREGATION; CONTINUITY; INDUCTION; ILLUSION AB Phonemic restoration is the perceptual synthesis of phonemes when masked by appropriate replacement sounds by utilizing linguistic context. Current models attempting to accomplish acoustic restoration of phonemes, however, use only temporal continuity and produce poor restoration of unvoiced phonemes, and are also limited in their ability to restore voiced phonemes. We present a schema-based model for phonemic restoration. The model employs a missing data speech recognition system to decode speech based on intact portions and activates word templates corresponding to the words containing the masked phonemes. An activated template is dynamically time warped to the noisy word and is then used to restore the speech frames corresponding to the masked phoneme, thereby synthesizing it. The model is able to restore both voiced and unvoiced phonemes with a high degree of naturalness. Systematic testing shows that this model outperforms a Kalman-filter based model. (C) 2004 Elsevier B.V. All rights reserved. C1 Ohio State Univ, Ctr Biomed Engn, Oreese Labs 395, Columbus, OH 43210 USA. Ohio State Univ, Oreese Labs 395, Dept Comp & Engn Sci, Columbus, OH 43210 USA. Ohio State Univ, Oreese Labs 395, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Srinivasan, S (reprint author), Ohio State Univ, Ctr Biomed Engn, Oreese Labs 395, 2015 Neil Ave, Columbus, OH 43210 USA. EM srinivasan.36@osu.edu; dwang@cse.ohio-state.edu CR Anderson B. D. O., 1979, OPTIMAL FILTERING BASHFORD JA, 1992, PERCEPT PSYCHOPHYS, V51, P211, DOI 10.3758/BF03212247 Boersma P, 2002, PRAAT DOING PHONETIC BREGMAN AS, 1981, PERCEPTUAL ORG, P99 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 COOKE MP, 1993, SPEECH COMMUN, V13, P391, DOI 10.1016/0167-6393(93)90037-L Drygajlo A., 1998, P ICASSP 98, V1, P121, DOI 10.1109/ICASSP.1998.674382 Ellis DPW, 1999, SPEECH COMMUN, V27, P281, DOI 10.1016/S0167-6393(98)00083-1 Goldinger SD, 2003, J PHONETICS, V31, P305, DOI 10.1016/S0095-4470(03)00030-5 Goldinger SD, 1996, J EXP PSYCHOL LEARN, V22, P1166, DOI 10.1037/0278-7393.22.5.1166 GRAY AH, 1976, IEEE T ACOUST SPEECH, V24, P380, DOI 10.1109/TASSP.1976.1162849 Hassan M, 2000, IEEE COMMUN MAG, V38, P96, DOI 10.1109/35.833564 Herre J, 2001, PROCEEDINGS OF THE 2001 IEEE WORKSHOP ON THE APPLICATIONS OF SIGNAL PROCESSING TO AUDIO AND ACOUSTICS, P127, DOI 10.1109/ASPAA.2001.969559 Hu GN, 2004, IEEE T NEURAL NETWOR, V15, P1135, DOI 10.1109/TNN.2004.832812 HUANG CS, 2003, P EUR 03, P457 ISHIGURO M, 1999, COMPUTER SCI MONOGRA, V25 Jayant N. S., 1984, DIGITAL CODING WAVEF Kato H, 1998, SPEECH COMMUN, V24, P325, DOI 10.1016/S0167-6393(98)00020-X Leonard R. G., 1984, P ICASSP 84, P111 Masuda-Katsuse I, 1999, SPEECH COMMUN, V27, P235, DOI 10.1016/S0167-6393(98)00084-3 MILLER GA, 1950, J ACOUST SOC AM, V22, P167, DOI 10.1121/1.1906584 MOULINES E, 1988, P FASE INT C EDINBUR, P47 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Nakatani T, 1999, SPEECH COMMUN, V27, P209, DOI 10.1016/S0167-6393(98)00079-X Nakayama K, 1995, INVITATION COGNITIVE, P1 Nygaard L. C., 1998, PERCEPT PSYCHOPHYS, V60, P335 Oppenheim A. V., 1999, DISCRETE TIME SIGNAL Perkins C, 1998, IEEE NETWORK, V12, P40, DOI 10.1109/65.730750 Principe JC, 2000, NEURAL ADAPTIVE SYST Rabiner L, 1993, FUNDAMENTALS SPEECH RAJ B, 2000, P INT C SPOK LANG PR, P1491 Renevey P., 2001, P CONS REL AC CUES S, P71 REPP BH, 1992, PERCEPT PSYCHOPHYS, V51, P14, DOI 10.3758/BF03205070 Samuel AG, 1997, COGNITIVE PSYCHOL, V32, P97, DOI 10.1006/cogp.1997.0646 SAMUEL AG, 1981, J EXP PSYCHOL HUMAN, V7, P1124, DOI 10.1037/0096-1523.7.5.1124 SELTZER ML, 2003, P EUR 03, P1277 SELTZER ML, 2000, P INT C SPOK LANG PR, P538 SRINIVASAN S, 2003, P EUR 03, P2053 Stevens K.N., 1998, ACOUSTIC PHONETICS Stoica P., 1997, INTRO SPECTRAL ANAL VERSCHUURE J, 1983, PERCEPT PSYCHOPHYS, V33, P232, DOI 10.3758/BF03202859 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 Warren R. M., 1999, AUDITORY PERCEPTION WARREN RM, 1994, PERCEPT PSYCHOPHYS, V55, P313, DOI 10.3758/BF03207602 WARREN RM, 1974, PERCEPT PSYCHOPHYS, V16, P150, DOI 10.3758/BF03203268 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WARREN RM, 1971, PERCEPT PSYCHOPHYS, V9, P358, DOI 10.3758/BF03212667 WEI B, 2000, P IEEE DIG SIGN PROC YANTORNO RE, 2001, P IEEE INT WORKSH IN, P193 Young S., 2000, HTK BOOK HTK VERSION Young S, 1996, IEEE SIGNAL PROC MAG, V13, P45, DOI 10.1109/79.536824 NR 53 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 63 EP 87 DI 10.1016/j.specom.2004.09.002 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800006 ER PT J AU Pitt, MA Johnson, K Hume, E Kiesling, S Raymond, W AF Pitt, MA Johnson, K Hume, E Kiesling, S Raymond, W TI The Buckeye corpus of conversational speech: labeling conventions and a test of transcriber reliability SO SPEECH COMMUNICATION LA English DT Article DE spontaneous speech corpus; transcription; labeling; American English ID WORD RECOGNITION; JUDGMENTS; AGREEMENT; INFERENCE; DATABASE AB This paper describes the Buckeye corpus of spontaneous American English speech, a 307,000-word corpus containing the speech of 40 talkers from central Ohio, USA. The method used to elicit and record the speech is described, followed by a description of the protocol that was developed to phonemically label what talkers said. The results of a test of labeling consistency are then presented. The corpus will be made available to the scientific community when labeling is completed. (C) 2004 Elsevier B.V. All rights reserved. C1 Ohio State Univ, Dept Psychol, Columbus, OH 43210 USA. Ohio State Univ, Dept Linguist, Columbus, OH 43210 USA. Univ Pittsburgh, Dept Linguist, Pittsburgh, PA 15260 USA. RP Pitt, MA (reprint author), Ohio State Univ, Dept Psychol, 1885 Neil Ave, Columbus, OH 43210 USA. EM pitt.2@osu.edu; kjohnson@julius.ling.ohio-state.edu; ehume@julius.ling.ohio-state.edu; kiesling@pitt.edu; raymond@ling.ohio-stat.edu RI Hume, Elizabeth/J-9227-2013 CR AMOROSA H, 1985, BRIT J DISORD COMMUN, V20, P281 BURKOWSKY MR, 1967, PHONETICA, V17, P38 BYRD D, 1992, J ACOUST SOC AM, V92, P593, DOI 10.1121/1.404271 Byrd D., 1993, UCLA WORKING PAPERS, V83, P97 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 Cucchiarini C, 1996, CLIN LINGUIST PHONET, V10, P131, DOI 10.3109/02699209608985167 Dalby Jonathan Marler, 1986, THESIS INDIANA U Deelman T, 2001, J EXP PSYCHOL HUMAN, V27, P656, DOI 10.1037//0096-1523.27.3.656 DONSELAAR W, 1999, J MEM LANG, V41, P59 EISEN B, 1991, EUROSPEECH, P673 Fasold Ralph, 1990, SOCIOLINGUISTICS LAN FISHER WM, 1987, J ACOUST SOC AM, V81, pS92, DOI 10.1121/1.2034854 Garofolo J. S., 1993, NISTIR PUBLICATION, V4930 Gaskell MG, 1998, J EXP PSYCHOL HUMAN, V24, P380, DOI 10.1037/0096-1523.24.2.380 Gaskell MG, 1996, J EXP PSYCHOL HUMAN, V22, P144, DOI 10.1037//0096-1523.22.1.144 Gimson A. C., 1989, INTRO PRONUNCIATION GREENBERG S, 1997, LARGE VOCABULARY CON Guy G. R., 1980, LOCATING LANGUAGE TI IRWIN RB, 1970, J SPEECH HEAR RES, V13, P548 Jones M. H., 1974, INFORMAL SPEECH ALPH JURAFSKY D, 1998, P INT C SPOK LANG PR Jurafsky Daniel, 2001, FREQUENCY EMERGENCE, P229, DOI 10.1075/tsl.45.13jur KEATING PA, 1994, SPEECH COMMUN, V14, P131, DOI 10.1016/0167-6393(94)90004-3 Labov W., 1994, PRINCIPLES LINGUISTI Manuel S. Y., 1992, P INT C SPOK LANG PR, P943 NEU H, 1980, LOCATING LANGUAGE TI PERREAULT WD, 1989, J MARKETING RES, V26, P135, DOI 10.2307/3172601 Philips B J, 1969, Cleft Palate J, V6, P24 ROWN G, 1990, LISTENING SPOKEN ENG SENEFF ZV, 1988, TIMIT CDROM DOCUMENT SHOCKEY L, 1973, THESIS OHIO STATE U SHRIBERG LD, 1991, CLIN LINGUIST PHONET, V5, P225, DOI 10.3109/02699209108986113 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 SVARTVICK J, 1980, CORPUS ENGLISH CONVE Utman JA, 2000, PERCEPT PSYCHOPHYS, V62, P1297, DOI 10.3758/BF03212131 Weber A, 2001, LANG SPEECH, V44, P95 WESENICK MB, 1996, P ICSLP PHIL US, P12 NR 38 TC 28 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2005 VL 45 IS 1 BP 89 EP 95 DI 10.1016/j.specom.2004.09.001 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 888WB UT WOS:000226403800007 ER PT J AU Schwartz, JL Berthommier, F Cathiard, MA de Mori, R AF Schwartz, JL Berthommier, F Cathiard, MA de Mori, R TI Special Issue on audio visual speech processing SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Grenoble 3, INPG, Inst Commun Parlee, UMR CNRS 5009, F-38031 Grenoble 1, France. Univ Avignon Pays Vaucluse, LIA, CERI, F-84911 Avignon 9, France. RP Schwartz, JL (reprint author), Univ Grenoble 3, INPG, Inst Commun Parlee, UMR CNRS 5009, 46 Av Felix Viallet, F-38031 Grenoble 1, France. EM schwartz@icp.inpg.fr NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 1 EP 3 DI 10.1016/j.specom.2004.11.002 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500001 ER PT J AU Bernstein, LE Auer, ET Takayanagi, S AF Bernstein, LE Auer, ET Takayanagi, S TI Auditory speech detection in noise enhanced by lipreading SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audiovisual speech processing; speech detection in noise; speech in noise; audiovisual speech perception; speech processing; lipreading; speechreading ID MISMATCH NEGATIVITY MMN; MULTISENSORY INTEGRATION; PERCEPTION; BRAIN; ELECTROPHYSIOLOGY; HEARING; BINDING; VISION; HUMANS; CORTEX AB Audiovisual speech stimuli have been shown to produce a variety of perceptual phenomena. Enhanced detectability of acoustic speech in noise, when the talker can also be seen, is one of those phenomena. This study investigated whether this enhancement effect is specific to visual speech stimuli or can rely on more generic non-speech visual stimulus properties. Speech detection thresholds for an auditory /ba/ stimulus were obtained in a white noise masker. The auditory /ba/ was presented adaptively to obtain its 79.4% detection threshold under five conditions. In Experiment 1, the syllable was presented (1) auditory-only (AO) and (2) as audiovisual speech (AVS), using the original video recording. Three types of synthetic visual stimuli were also paired synchronously with the audio token: (3) A dynamic Lissajous (AVL) figure whose vertical extent was correlated with the acoustic speech envelope; (4) a dynamic rectangle (AVR) whose horizontal extent was correlated with the speech envelope; and (5) a static rectangle (AVSR) whose onset and offset were synchronous with the acoustic speech onset and offset. Ten adults with normal hearing and vision participated. The results, in terms of dB signal-to-noise ratio (SNR), were AVS < (AVL approximate to AVR approximate to ASR) < AO. That is, AVS was significantly easiest to detect, there was no difference among the synthesized visual stimuli, and all audiovisual conditions resulted in significantly lower thresholds than AO. To determine the advantage of the AVS stimulus, in Experiment 2, a preliminary mouth gesture was edited from the video speech token. This manipulation defeated the advantage for both the original and the edited AVS stimulus, while the audiovisual detection enhancement persisted. Overall, the results showed enhanced auditory speech detection with visual stimuli but no advantage for a fine-grained correlation between acoustic and optical speech signals. (C) 2004 Elsevier B.V. All rights reserved. C1 House Ear Res Inst, Dept Commun Neurosci, Los Angeles, CA 90057 USA. Natl Sci Fdn, Arlington, VA 22230 USA. RP Bernstein, LE (reprint author), House Ear Res Inst, Dept Commun Neurosci, 2100 W 3rd St, Los Angeles, CA 90057 USA. EM lbernstein@hei.org; auer@ku.edu; stakayanagi@hei.org CR American National Standards Institute, 1989, S361989 ANSI Arnold P, 2001, BRIT J PSYCHOL, V92, P339, DOI 10.1348/000712601162220 Bernstein L., 2004, HDB MULTISENSORY PRO Bernstein LE, 2000, PERCEPT PSYCHOPHYS, V62, P233, DOI 10.3758/BF03205546 Colin C, 2002, CLIN NEUROPHYSIOL, V113, P507, DOI 10.1016/S1388-2457(02)00028-7 De Gelder B, 2003, TRENDS COGN SCI, V7, P460, DOI 10.1016/j.tics.2003.08.014 Foxe JJ, 2002, EXP BRAIN RES, V142, P139, DOI 10.1007/s00221-001-0906-7 Grant KW, 2001, J ACOUST SOC AM, V109, P2272, DOI 10.1121/1.1362687 Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Krolak-Salmon P., 2001, Society for Neuroscience Abstracts, V27, P913 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Meredith MA, 2002, COGNITIVE BRAIN RES, V14, P31, DOI 10.1016/S0926-6410(02)00059-9 MEREDITH MA, 1987, J NEUROSCI, V7, P3215 Mesulam MM, 1998, BRAIN, V121, P1013, DOI 10.1093/brain/121.6.1013 Moutoussis K, 1997, P ROY SOC B-BIOL SCI, V264, P393 Naatanen R, 2001, PSYCHOPHYSIOLOGY, V38, P1, DOI 10.1017/S0048577201000208 NILSSON M, 1994, J ACOUST SOC AM, V95 Puce A, 2003, PHILOS T R SOC B, V358, P435, DOI 10.1098/rstb.2002.1221 RATCLIFF R, 1993, PSYCHOL BULL, V114, P510, DOI 10.1037/0033-2909.114.3.510 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 Schroeder CE, 2002, COGNITIVE BRAIN RES, V14, P187, DOI 10.1016/S0926-6410(02)00073-3 SCHWARTZ JL, 2003, P 7 AVSP AUD SPEECH SCHWARTZ JL, 2002, P 7 INT C SPOK LANG Stein B. E., 1993, MERGING SENSES Steinschneider M, 1999, J NEUROPHYSIOL, V82, P2346 Stevens K.N., 1998, ACOUSTIC PHONETICS SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Treisman A, 1996, CURR OPIN NEUROBIOL, V6, P171, DOI 10.1016/S0959-4388(96)80070-5 VONDERMALSBURG C, 1995, CURR OPIN NEUROBIOL, V5, P520 Yvert B, 2001, CEREB CORTEX, V11, P411, DOI 10.1093/cercor/11.5.411 Zeki S, 1998, NEUROSCIENTIST, V4, P365, DOI 10.1177/107385849800400518 NR 32 TC 47 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 5 EP 18 DI 10.1016/j.specom.2004.10.011 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500002 ER PT J AU Kim, J Davis, C AF Kim, J Davis, C TI Investigating the audio-visual speech detection advantage SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE speech detection; visual speech; bimodal speech; audio-visual integration ID PATTERNS; VOICES AB Seeing the moving face of the talker permits better detection of speech in noise compared to auditory only presentation, an Audio-Visual (AV) facilitation effect. Three experiments that used a masked speech detection task are reported. The experiments were designed to contrast two accounts of the AV facilitation effect (AV peak listening and AV grouping). In each experiment a different manipulation of the relationship between the auditory and visual signals was employed. The first experiment manipulated the sequencing of the visual and auditory information by presenting the displays time normal or time reversed. The results showed that AV facilitation only occurred for the time normal presentations where there was a high correlation between the AV signals. Experiment 2 examined the impact on AV facilitation of shifting the auditory signals earlier in time than its normal position (again with time normal and time reversed presentation). It was found that shifting the auditory component abolished the AV effect for both the time normal and reversed conditions. The final experiment examined the AV detection advantage using another situation in which the relationship between the AV signals differed. Two versions of AV speech produced by a virtual talker were investigated. In one version, based on text-to-speech synthesis, the video and auditory signals were more rapid than the human talker of Experiment 1. In the other version, the signals were lengthened to match the durations of the human talker. A small but reliable AV facilitation effect was only found for the second version. The results are consistent with a cross-modal peak listening account and are discussed in terms of constraints on the integration of auditory and visual speech. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Melbourne, Dept Psychol, Parkville, Vic 3010, Australia. RP Davis, C (reprint author), Univ Melbourne, Dept Psychol, Parkville, Vic 3010, Australia. EM jeesun@unimelb.edu.au; cwd@unimelb.edu.au CR Benoit C, 1998, SPEECH COMMUN, V26, P117, DOI 10.1016/S0167-6393(98)00045-4 BERNSTEIN LE, 2003, P AVSP 2003 ST JOR F, P13 Davis C, 2001, ARTIF INTELL REV, V16, P37, DOI 10.1023/A:1011086120667 Forster KI, 2003, BEHAV RES METH INS C, V35, P116, DOI 10.3758/BF03195503 Gordon PC, 1997, J ACOUST SOC AM, V102, P2276, DOI 10.1121/1.419600 Grant KW, 2003, P 2003 AUD VIS SPEEC, P31 Grant KW, 2001, J ACOUST SOC AM, V109, P2272, DOI 10.1121/1.1362687 Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 KAWAHARA H, 1998, SPEECH TRANSFORMATIO Kim J, 2003, PERCEPTION, V32, P111, DOI 10.1068/p3466 MASSARO DW, 1999, P ESCA SOCRATES WORK, P45 Ramus F, 2000, SCIENCE, V288, P349, DOI 10.1126/science.288.5464.349 REPP BH, 1992, Q J EXP PSYCHOL-A, V45, P1 Schwartz JL, 2004, COGNITION, V93, pB69, DOI 10.1016/j.cognition.2004.01.006 SCHWARTZ JL, 2004, P 8 INT ICSLP JEJ KO Sheffert SM, 2002, J EXP PSYCHOL HUMAN, V28, P1447, DOI 10.1037//0096-1523.28.6.1447 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 VANLANCKER D, 1985, J PHONETICS, V13, P19 NR 18 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 19 EP 30 DI 10.1016/j.specom.2004.09.008 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500003 ER PT J AU Berthommier, F AF Berthommier, F TI A phonetically neutral model of the low-level audio-visual interaction SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audio-visual processing; multimodal interaction; temporal envelope processing; subband decomposition; modeling ID SPEECH; CUES AB The improvement of detectability of visible speech cues found by Grant and Seitz [2000. The use of visible speech cues for improving auditory detection of spoken sentences. JASA 108, 1197-1208] has been related to the degree of correlation between acoustic envelopes and visible movements. This suggests that audio and visual signals could interact early during the audio-visual perceptual process on the basis of audio envelope cues. On the other hand, acoustic-visual correlations were previously reported by Yehia et at. [1998. Quantitative association of vocal tract and facial behavior. Speech Commun. 26 (1), 23-43]. Taking into account these two main facts, the problem of extraction of the redundant audio-visual components is revisited: the video parametrization of natural images and three types of audio parameters are tested together, leading to new and realistic applications in video synthesis and audio-visual speech enhancement. Consistent with Grant and Seitz's prediction, the 4-subband envelope energy features are found to be optimal for encoding the redundant components available for the enhancement task. The proposed computational model of audio-visual interaction is based on the product, in the audio pathway, between the time-aligned audio envelopes and video-predicted envelopes. This interaction scheme is shown to be phonetically neutral, so that it will not bias phonetic identification. The low-level stage which is described is compatible with a late integration process, which may be used as a potential front-end for speech recognition applications. (C) 2004 Elsevier B.V. All rights reserved. C1 INPG, Inst Commun Parlee, UPRESA, CNRS 5009, F-38041 Grenoble 1, France. RP Berthommier, F (reprint author), INPG, Inst Commun Parlee, UPRESA, CNRS 5009, 46 Ave Felix Viallet, F-38041 Grenoble 1, France. EM bertho@icp.inpg.fr CR BARKER J, 1998, P AVSP 98 TERR AUSTR, P103 Barker J. P., 1999, P AVSP 99 SANT CRUZ, P112 Bernstein L., 2004, HDB MULTISENSORY PRO Berthommier F., 2001, P AVSP 01 AALB, P183 BERTOMMIER F, 2003, P EUR 03 GEN BERTOMMIER F, 2003, P SOC 03 GREN Bregman AS., 1990, AUDITORY SCENE ANAL ERBER NP, 1972, J ACOUST SOC AM, V51, P1224, DOI 10.1121/1.1912964 Girin L, 2001, J ACOUST SOC AM, V109, P3007, DOI 10.1121/1.1358887 GOECKE R, 2002, P ICASSP 02 ORL Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Heckmann M., 2002, P INT C SPOK LANG PR, P1925 Jiang JT, 2002, EURASIP J APPL SIG P, V2002, P1174, DOI 10.1155/S1110865702206046 KIM J, 2001, P AVSP 01 AALB, P127 Massaro D. W., 1998, PERCEIVING TALKING F SCHWARTZ JL, 2002, P ICSLP 2002, P1937 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SUMMERFIELD Q, 1987, HEARING EYE PSYCHOL Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 19 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 31 EP 41 DI 10.1016/j.specom.2004.10.003 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500004 ER PT J AU Grant, KW van Wassenhove, V Poeppel, D AF Grant, KW van Wassenhove, V Poeppel, D TI Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE spectro-temporal asynchrony; cross-modal asynchrony; auditory visual speech processing ID CORRELATIONAL METHOD; SPEECH RECOGNITION; PERCEPTION; ARTICULATION; INTEGRATION; SENTENCES; CUES AB Detection thresholds for temporal synchrony in auditory and auditory-visual sentence materials were obtained on normal-hearing subjects. For auditory conditions, thresholds were determined using an adaptive-tracking procedure to control the degree of temporal asynchrony of a narrow audio band of speech, both positive and negative in separate tracks, relative to three other narrow audio bands of speech. For auditory-visual conditions, thresholds were determined in a similar manner for each of four narrow audio bands of speech as well as a broadband speech condition, relative to a video image of a female speaker. Four different auditory filter conditions, as well as a broadband auditory-visual speech condition, were evaluated in order to determine whether detection thresholds were dependent on the spectral content of the acoustic speech signal. Consistent with previous studies of auditory-visual speech recognition which showed a broad, asymmetrical range of temporal synchrony for which intelligibility was basically unaffected (audio delays roughly between -40ms and +240ms), auditory-visual synchrony detection thresholds also showed a broad, asymmetrical pattern of similar magnitude (audio delays roughly between -45 ms and +200 ms). No differences in synchrony thresholds were observed for the different filtered bands of speech, or for broadband speech. In contrast, detection thresholds for audio-alone conditions were much smaller (between -17ms and +23ms) and symmetrical. These results suggest a fairly tight coupling between a subject's ability to detect cross-spectral (auditory) and cross-modal (auditory visual) asynchrony and the intelligibility of auditory and auditory-visual speech materials. Published by Elsevier B.V. C1 Walter Reed Army Med Ctr, Army Audiol & Speech Ctr, Washington, DC 20307 USA. Univ Maryland, Cognit Neurosci Language Lab, Neurosci & Cognit Sci Program, College Pk, MD 20742 USA. RP Grant, KW (reprint author), Walter Reed Army Med Ctr, Army Audiol & Speech Ctr, Bldg 2,Room 6A53C,6900 Geirgua Ave, Washington, DC 20307 USA. EM grant@tidalwave.net RI Van Wassenhove, Virginie/F-4129-2010 CR Abry C, 1996, NATO ASI SERIES F, V150, P247 American National Standards Institute (ANSI), 1989, S361989 ANSI ANSI, 1969, S351969 ANSI DIXON NF, 1980, PERCEPTION, V9, P719, DOI 10.1068/p090719 Doherty KA, 1996, J ACOUST SOC AM, V100, P3769, DOI 10.1121/1.417336 Erdohegyi K., 1999, P 6 EUR C SPEECH COM, P2687 Grant KW, 2001, J ACOUST SOC AM, V109, P2272, DOI 10.1121/1.1362687 GRANT KW, 1991, J ACOUST SOC AM, V89, P2952, DOI 10.1121/1.400733 Grant KW, 1996, J SPEECH HEAR RES, V39, P228 Grant KW, 1998, J ACOUST SOC AM, V103, P2677, DOI 10.1121/1.422788 Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Grant KW, 1998, J ACOUST SOC AM, V104, P2438, DOI 10.1121/1.423751 GRANT KW, 2001, P AUD VIS SPEECH PRO Greenberg S., 2001, P 7 EUR C SPEECH COM, P473 GREENBERG S, 2003, P 15 INT C PHON SCI, P219 IEEE, 1969, IEEE REC PRACT SPEEC Massaro DW, 1996, J ACOUST SOC AM, V100, P1777, DOI 10.1121/1.417342 MCGRATH M, 1985, J ACOUST SOC AM, V77, P678, DOI 10.1121/1.392336 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Munhall KG, 1998, J ACOUST SOC AM, V104, P530, DOI 10.1121/1.423300 Munhall KG, 1996, PERCEPT PSYCHOPHYS, V58, P351, DOI 10.3758/BF03206811 PANDEY P C, 1986, Journal of Auditory Research, V26, P27 SEITZ PF, 1999, AVSP 99 P AUG 7 9 19 STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Stone MA, 2003, EAR HEARING, V24, P175, DOI 10.1097/01.AUD.0000058106.68049.9C SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Turner CW, 1998, J ACOUST SOC AM, V104, P1580, DOI 10.1121/1.424370 VANWASSENHOVE V, 2001, SOC NEUR ANN M SAN D, P488 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 29 TC 39 Z9 39 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 43 EP 53 DI 10.1016/j.specom.2004.06.004 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500005 ER PT J AU Vroomen, J van Linden, S Keetels, M de Gelder, W Bertelson, P AF Vroomen, J van Linden, S Keetels, M de Gelder, W Bertelson, P TI Selective adaptation and recalibration of auditory speech by lipread information: dissipation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audio-visual speech; aftereffect; recalibration; selective adaptation; perceptual learning ID PERCEPTION; VENTRILOQUISM; DETECTORS AB Recently, we have shown that lipread speech can recalibrate auditory speech identification when there is a conflict between the auditory and visual information (Bertelson, P., Vroomen, J., De Gelder, B, 2003. Visual recallibration of auditory speech identification: a McGurk aftereffect. Psychol. Sci. 14 (2003) 592-597). When an ambiguous sound intermediate between /aba/ and /ada/ was dubbed onto a face articulating /aba/ (or /ada/), the proportion of responses consistent with the visual stimulus increased in subsequent unimodal auditory sound identification trials, revealing recalibration. In contrast, when an unambiguous /aba/ or /ada/ sound was dubbed onto the face (with no conflict between vision and audition), the proportion of responses decreased, revealing selective adaptation. In the present study we show that recalibration and selective adaptation not only differ in the direction of their aftereffects, but also that they dissipate at different rates, confirming that the effects are caused by different mechanisms. (C) 2004 Elsevier B.V. All rights reserved. C1 Tilburg Univ, Dept Psychol, NL-5000 LE Tilburg, Netherlands. Free Univ Brussels, Expt Psychol Lab, Brussels, Belgium. RP Vroomen, J (reprint author), Tilburg Univ, Dept Psychol, Warandelaan 2, NL-5000 LE Tilburg, Netherlands. EM j.vroomen@uvt.nl RI Vroomen, Jean/K-1033-2013 CR Bertelson P., 1999, COGNITIVE CONTRIBUTI, P347 Bertelson P, 2003, PSYCHOL SCI, V14, P592, DOI 10.1046/j.0956-7976.2003.psci_1470.x Boersma P., 1999, PRAAT SYSTEM DOING P De Gelder B, 2003, TRENDS COGN SCI, V7, P460, DOI 10.1016/j.tics.2003.08.014 EIMAS PD, 1973, COGNITIVE PSYCHOL, V4, P99, DOI 10.1016/0010-0285(73)90006-6 Frissen I, 2003, ACTA PSYCHOL, V113, P315, DOI 10.1016/S0001-6918(03)00043-X MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Norris D, 2003, COGNITIVE PSYCHOL, V47, P204, DOI 10.1016/S0010-0285(03)00006-9 RADEAU M, 1974, Q J EXP PSYCHOL, V26, P63, DOI 10.1080/14640747408400388 ROBERTS M, 1981, PERCEPT PSYCHOPHYS, V30, P309, DOI 10.3758/BF03206144 SALDANA HM, 1994, J ACOUST SOC AM, V95, P3658 Samuel AG, 2001, PSYCHOL SCI, V12, P348, DOI 10.1111/1467-9280.00364 SAMUEL AG, 1986, COGNITIVE PSYCHOL, V18, P452, DOI 10.1016/0010-0285(86)90007-1 Vroomen J., 2004, HDB MULTISENSORY PRO, P141 NR 14 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 55 EP 61 DI 10.1016/j.specom.2004.03.009 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500006 ER PT J AU Odisio, M Bailly, G Elisei, F AF Odisio, M Bailly, G Elisei, F TI Tracking talking faces with shape and appearance models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE ID BIOLOGICAL MOTION; SPEECH; VIDEO; PERCEPTION; IMAGES; NOISE AB This paper presents a system that can recover and track the 3D speech movements of a speaker's face for each image of a monocular sequence. To handle both the individual specificities of the speaker's articulation and the complexity of the facial deformations during speech, speaker-specific articulated models of the face geometry and appearance are first built from real data. These face models are used for tracking: articulatory parameters are extracted for each image by an analysis-by-synthesis loop. The geometric model is linearly controlled by only seven articulatory parameters. Appearance is seen either as a classical texture map or through local appearance of a relevant subset of 3D points. We compare several appearance models: they are either constant or depend linearly on the articulatory parameters. We compare tracking results using these different appearance models with ground truth data not only in terms of recovery errors of the 3D geometry but also in terms of intelligibility enhancement provided by the movements. (C) 2004 Elsevier B.V. All rights reserved. C1 Inst Commun Parlee, F-38031 Grenoble 1, France. RP Odisio, M (reprint author), Inst Commun Parlee, 46 Av Felix Viallet, F-38031 Grenoble 1, France. EM matthias.odisio@icp.inpg.fr CR Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166 Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 Bailly G, 2002, PROCEEDINGS OF THE 2002 IEEE WORKSHOP ON SPEECH SYNTHESIS, P27, DOI 10.1109/WSS.2002.1224365 Basu S, 1998, SPEECH COMMUN, V26, P131, DOI 10.1016/S0167-6393(98)00055-7 Benoit C., 1992, TALKING MACHINES THE, P485 BERGESON TR, 2003, P AUD VIS SPEECH PRO, P55 BLANZ V, 1999, P 26 ANN C COMP GRAP, P187, DOI DOI 10.1145/311535.311556 COHEN MM, 1996, SPEECHREADING HUMANS, V150, P153 DeCarlo D, 2000, INT J COMPUT VISION, V38, P99, DOI 10.1023/A:1008122917811 DORNAIKA F, 2003, P IEEE INT WORKSH AN, P173 Eisert P, 1998, IEEE COMPUT GRAPH, V18, P70, DOI 10.1109/38.708562 Elisei F., 2001, P AUD VIS SPEECH PRO, P90 Eveno N, 2004, IEEE T CIRC SYST VID, V14, P706, DOI 10.1109/TCSVT.2004.826754 FILLBRANDT H, 2003, P IEEE INT WORKSH AN, P181 Flannery B. P., 1992, NUMERICAL RECIPES C GEIGER G, 2003, 2003003 AI MIT CAMBR GROSS R, 2002, DAGM LECT NOTES COMP, V2449, P481 Guenin BM, 1998, P IEEE SEMICOND THER, P55, DOI 10.1145/280814.280822 GUIARDMARIGNY T, 1995, P 13 INT C PHON SCI, V3, P222 HALL D, 2000, P EUR C COMP VIS DUB, P164 HONG P, 2002, MPEG 4 FACIAL ANIMAT, P115, DOI 10.1002/0470854626.ch7 JOHANSSO.G, 1973, PERCEPT PSYCHOPHYS, V14, P201, DOI 10.3758/BF03212378 Jurie F, 2002, PATTERN RECOGN, V35, P317, DOI 10.1016/S0031-3203(01)00031-0 La Cascia M, 2000, IEEE T PATTERN ANAL, V22, P322, DOI 10.1109/34.845375 LI HB, 1993, IEEE T PATTERN ANAL, V15, P545, DOI 10.1109/34.216724 Lindeberg T., 1998, INT J COMPUT VISION, V30, P77 Lucero JC, 1999, J ACOUST SOC AM, V106, P2834, DOI 10.1121/1.428108 Luettin J., 1996, P INT C SPOK LANG PR, V1, P62, DOI 10.1109/ICSLP.1996.607030 ODISIO M, 2004, P INT C SPOK LANG PR, P2029 Pandzic IS, 1999, VISUAL COMPUT, V15, P330, DOI 10.1007/s003710050182 PARKE FI, 1982, IEEE COMPUT GRAPH, V2, P61 Pighin F, 2002, INT J COMPUT VISION, V50, P143, DOI 10.1023/A:1020393915769 Reveret L., 1998, P INT C AUD VIS SPEE, P207 Robert-Ribes J, 1998, J ACOUST SOC AM, V103, P3677, DOI 10.1121/1.423069 ROMDHANI S, 2002, P EUR C COMP VIS LEC, V2353, P3 Rosenblum LD, 1996, J EXP PSYCHOL HUMAN, V22, P318 Rosenblum LD, 1996, J SPEECH HEAR RES, V39, P1159 RYDFALK M, 1987, LITHISYI866 CANDIDE Santi A, 2003, J COGNITIVE NEUROSCI, V15, P800, DOI 10.1162/089892903322370726 SICILIANO C, 2003, P AUD VIS SPEECH PRO, P205 STROM J, 1999, P INT C COMP VIS COR SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Tao H, 2002, INT J COMPUT VISION, V50, P111, DOI 10.1023/A:1020389714861 TERZOPOULOS D, 1993, IEEE T PATTERN ANAL, V15, P569, DOI 10.1109/34.216726 THEOBALD B, 2001, P AUD VIS SPEECH PRO, P78 THEOBALD BJ, 2003, P AUD VIS SPEECH PRO, P187 VANSON RJJ, 1995, P EUR 95, P2277 WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130 Walker KN, 2002, IMAGE VISION COMPUT, V20, P435, DOI 10.1016/S0262-8856(02)00014-8 NR 49 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 63 EP 82 DI 10.1016/j.specom.2004.10.008 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500007 ER PT J AU Huang, J Potamianos, G Connell, J Neti, C AF Huang, J Potamianos, G Connell, J Neti, C TI Audio-visual speech recognition using an infrared headset SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE visual speech; audio-visual speech recognition; infrared headset; feature fusion; multi-stream HMM; decision fusion ID INTEGRATION; NOISE AB It is well known that frontal video of the speaker's mouth region contains significant speech information that, when combined with the acoustic signal, can improve accuracy and noise robustness of automatic speech recognition (ASR) systems. However, extraction of such visual speech information from full-face videos is computationally expensive, as it requires tracking faces and facial features. In addition, robust face detection remains challenging in practical human-computer interaction (HCI), where the subject's posture and environment (lighting, background) are hard to control, and thus successfully compensate for. In this paper, in order to bypass these hindrances to practical bimodal ASR, we consider the use of a specially designed, wearable audio-visual headset, a feasible solution in certain HCI scenarios. Such a headset can consistently focus on the speaker's mouth region, thus eliminating altogether the need for face tracking. In addition, it employs infrared illumination to provide robustness against severe lighting variations. We study the appropriateness of this novel device for audio-visual ASR by conducting both small- and large-vocabulary recognition experiments on data recorded using it under various lighting conditions. We benchmark the resulting ASR performance against bimodal data containing frontal, full-face videos collected at an ideal, studio-like environment, under uniform lighting. The experiments demonstrate that the infrared headset video contains comparable speech information to the studio, full-face video data, thus being a viable sensory device for audio-visual ASR. (C) 2004 Elsevier B.V. All rights reserved. C1 IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. RP Huang, J (reprint author), IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. EM jghg@us.ibm.com; gpotam@us.ibm.com CR ADJOUDANI A, 1997, P EUR C SPEECH COMM, P1671 CHEN LS, 2001, J SAFETY ENV, V1, P21 Chen T, 1998, P IEEE, V86, P837 Chibelushi CC, 2002, IEEE T MULTIMEDIA, V4, P23, DOI 10.1109/6046.985551 CONNELL JH, 2003, P INT C MULT EXP, P469 Deligne S., 2002, P INT C SPOK LANG PR, P1449 Duchnowski P., 1994, P INT C SPOK LANG PR, P547 Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479 GAGNE JP, 2001, EVALUATION AUDIOVISU Girin L, 2001, J ACOUST SOC AM, V109, P3007, DOI 10.1121/1.1358887 Graciarena M, 2003, IEEE SIGNAL PROC LET, V10, P72, DOI [10.1109/LSP.2003.808549, 10.1109/LSP.2002.808549] GURBUZ S, 2001, P INT C AC SPEECH SI, P177 Heckmann M, 2002, EURASIP J APPL SIG P, V2002, P1260, DOI 10.1155/S1110865702206150 Hennecke M. E., 1996, SPEECHREADING HUMANS, P331 HUANG J, 2003, P WORK AUD VUS SPEEC, P175 Jain AK, 2000, IEEE T PATTERN ANAL, V22, P4, DOI 10.1109/34.824819 Jiang JT, 2002, EURASIP J APPL SIG P, V2002, P1174, DOI 10.1155/S1110865702206046 Massaro DW, 1998, AM SCI, V86, P236, DOI 10.1511/1998.25.861 Matthews I., 2001, P INT C MULT EXP Nefian AV, 2002, EURASIP J APPL SIG P, V2002, P1274, DOI 10.1155/S1110865702206083 Potamianos G., 2003, P EUR C SPEECH COMM, P1293 Potamianos G, 2003, P IEEE, V91, P1306, DOI 10.1109/JPROC.2003.817150 Senior A. W., 1999, P INT C AUD VID BAS, P154 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Teissier P, 1999, IEEE T SPEECH AUDI P, V7, P629, DOI 10.1109/89.799688 Zhang Z., 2004, P IEEE INT C AC SPEE, VIII, P781 NR 26 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 83 EP 96 DI 10.1016/j.specom.2004.10.007 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500008 ER PT J AU Nakadai, K Matsuura, D Okuno, HG Tsujino, H AF Nakadai, K Matsuura, D Okuno, HG Tsujino, H TI Improvement of recognition of simultaneous speech signals using AV integration and scattering theory for humanoid robots SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audio-visual integration; robot audition; scattering theory; sound source localization; sound source separation; speech recognition; active audition AB This paper presents a method to improve recognition of three simultaneous speech signals by a humanoid robot equipped with a pair of microphones. In such situations, sound separation and automatic speech recognition (ASR) of the separated speech signal are difficult, because the signal-to-noise ratio is quite low (around -3dB) and noise is not stable due to interfering voices. To improve recognition of three simultaneous speech signals, two key ideas are introduced. One is two-layered audio-visual integration of both name (ID) and location, that is, speech and face recognition, and speech and face localization. The other is acoustical modeling of the humanoid head by scattering theory. Sound sources are separated in real-time by an active direction-pass filter (ADPF), which extracts sounds from a specified direction by using the interaural phase/intensity difference estimated by scattering theory. Since features of separated sounds vary according to the sound direction, multiple direction- and speaker-dependent acoustic models are used. The system integrates ASR results by using the sound direction and speaker information provided by face recognition as well as confidence measure of ASR results to select the best one. The resulting system shows an improvement of about 10% on average against recognition of three simultaneous speech signals, where three speakers were located around the humanoid on a I m radius half circle, one of them being in front of him (angle 0degrees) and the other two being at symmetrical positions (+/-theta) varying by 10degrees steps from 0degrees to 90degrees. (C) 2004 Elsevier B.V. All rights reserved. C1 Honda Res Inst Japan Co Ltd, Wako, Saitama 3510114, Japan. Tokyo Inst Technol, Grad Sch Sci & Engn, Meguro Ku, Tokyo 1528550, Japan. Kyoto Univ, Grad Sch Informat, Sakyo Ku, Kyoto 6068501, Japan. RP Nakadai, K (reprint author), Honda Res Inst Japan Co Ltd, Honcho 8-1, Wako, Saitama 3510114, Japan. EM nakadai@jp.honda-ri.com; matsuurd@mep.titech.ac.jp; okuno@i.kyoto-u.ac.jp; tsujino@jp.honda-ri.com RI Tsujino, Hiroshi/A-1198-2009 OI Tsujino, Hiroshi/0000-0001-8042-2796 CR Aoki M., 2001, J ACOUST SOC JAPAN, V22, P149 ARAKI S, 2003, P 2003 AUT M AC SOC, P585 Asano F., 2001, P INT C SPEECH PROC, P1013 Barker J., 2001, P EUR 2001 ESCA, P213 Bowman J.J., 1987, ELECTROMAGNETIC ACOU Breazeal C., 1999, P 16 INT JOINT C ART, P1146 Bregman AS., 1990, AUDITORY SCENE ANAL BROOKS RA, 1999, COMPUTATION METAPHOR, V1562, P52, DOI 10.1007/3-540-48834-0_5 Faugeras O., 1993, 3 DIMENSIONAL COMPUT Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Grant KW, 2003, P 2003 AUD VIS SPEEC, P31 Hershey J, 2000, ADV NEUR IN, V12, P813 HIDAI K, 2000, P IEEE RAS INT C INT, P1384 Hiraoka K., 2000, Proceedings of the IEEE-INNS-ENNS International Joint Conference on Neural Networks. IJCNN 2000. Neural Computing: New Challenges and Perspectives for the New Millennium, DOI 10.1109/IJCNN.2000.861335 IRIE RE, 1997, P 2 IJCAI WORKSH COM, P54 Ivanov Y., 2004, Proceedings. Sixth IEEE International Conference on Automatic Face and Gesture Recognition JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 Klemm O., 1909, PSYCHOL STUDIEN WUND, V5, P73 Lax P., 1989, SCATTERING THEORY, V1st Lee A., 2001, P EUR C SPEECH COMM, P1691 LUETTIN J, 1998, LECT NOTES COMPUTER, V1407, P657 Matsusaka Y., 1999, P 6 EUR C SPEECH COM, P1723 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MERSHON DH, 1981, PERCEPTION, V10, P531, DOI 10.1068/p100531 Moore B. C. J., 1989, INTRO PSYCHOL HEARIN MURATA N, 1998, P 1998 INT S NONL TH, V3, P923 NAKADAI K, 2002, P IEEE RSJ INT C INT, P1314 NAKADAI K, 2001, P 1M INT C AT INT IJ, P1424 Nakagawa Y., 1999, Proceedings Sixteenth National Conference on Artificial Intelligence (AAI-99). Eleventh Innovative Applications of Artificial Intelligence Conference (IAAI-99) POTAMIANOS G, 2000, P INT C SPOK LANG PR, V3, P746 RENEVEY P, 2001, P 7 EUR C SPEECH COM, V2, P1107 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO Saruwatari H, 1999, IEICE T FUND ELECTR, VE82A, P1501 SCHWARTZ JL, 2002, P ICSLP 2002, P1937 Shafer S. A., 1986, Proceedings 1986 IEEE International Conference on Robotics and Automation (Cat. No.86CH2282-2) Silsbee PL, 1996, IEEE T SPEECH AUDI P, V4, P337, DOI 10.1109/89.536928 Sugita Y, 2003, NATURE, V421, P911, DOI 10.1038/421911a Takanishi A., 1995, Bulletin of the Centre for Informatics, Waseda University, V20 Tibrewala S., 1997, P ICASSP, P1255 Verma A., 1999, P WORKSH AUT SPEECH, P71 NR 40 TC 30 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 97 EP 112 DI 10.1016/j.specom.2004.10.010 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500009 ER PT J AU Sodoyer, D Girin, L Jutten, C Schwartz, JL AF Sodoyer, D Girin, L Jutten, C Schwartz, JL TI Developing an audio-visual speech source separation algorithm SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE blind source separation; audio-visual coherence; speech enhancement; audio-visual joint probability; spectral information AB Looking at the speaker's face is useful to hear better a speech signal and extract it from competing sources before identification. This might result in elaborating new speech enhancement or extraction techniques exploiting the audiovisual coherence of speech stimuli. In this paper, a novel algorithm plugging audio-visual coherence estimated by statistical tools on classical blind source separation algorithms is presented, and its assessment is described. We show, in the case of additive mixtures, that this algorithm performs better than classical blind tools both when there are as many sensors as sources, and when there are less sensors than sources. Audio-visual coherence enables a focus on the speech source to extract. It may also be used at the output of a classical source separation algorithm, to select the "best" sensor with reference to a target source. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Grenoble 3, INPG, ICP, CNRS UMR 5009, F-38031 Grenoble 1, France. Univ Grenoble 1, INPG, LIS, CNRS UMR 5083, F-38041 Grenoble, France. RP Sodoyer, D (reprint author), Univ Grenoble 3, INPG, ICP, CNRS UMR 5009, 46 Av Felix Viallet, F-38031 Grenoble 1, France. EM sodoyer@icp.inpg.fr CR Amari S, 1998, NEURAL COMPUT, V10, P251, DOI 10.1162/089976698300017746 Benoit C., 1992, TALKING MACHINES THE, P485 BERNSTEIN LE, 2004, ENHANCED AUDITORY DE Bernstein L. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607895 Berthommier F, 2003, P EUR 03 GEN, P1045 BERTHOMMIER F, 2004, PHONETICALLY NEUTRAL Cardoso JF, 1996, IEEE T SIGNAL PROCES, V44, P3017, DOI 10.1109/78.553476 CARDOSO JF, 1993, IEE PROC-F, V140, P362 Deligne S., 2002, P INT C SPOK LANG PR, P1449 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 GIRIN L, 1997, P 1 ESCA ETRW AUD SP, P37 Girin L, 2001, J ACOUST SOC AM, V109, P3007, DOI 10.1121/1.1358887 GOECKE R, 2002, P INT C AC SPEECH SI, P2025 GRANT KW, 2004, DETECTION AUDITORY A Grant KW, 2000, J ACOUST SOC AM, V108, P1197, DOI 10.1121/1.1288668 Hyvarinen A, 1999, IEEE T NEURAL NETWOR, V10, P626, DOI 10.1109/72.761722 JUTTEN C, 1991, SIGNAL PROCESS, V24, P1, DOI 10.1016/0165-1684(91)90079-X KIM J, 2001, P AVSP 01 AALB, P127 KIM J, 2004, TESTING CUING HYPOTH Lallouache M. T., 1990, P 18 JOURN ET PAR MO, P282 NAKADAI K, 2004, IMPROVEMENT 3 SIMULT OKUNO HG, 2001, P INT C SPEECH PROC, P2643 Petajan E. D., 1984, THESIS U ILLINOIS Schwartz JL, 2004, COGNITION, V93, pB69, DOI 10.1016/j.cognition.2004.01.006 SCHWARTZ JL, 2002, P ICSLP 2002, P1937 SODOYER D, 2002, EUR JASP 2002, P1164 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Taleb A, 1999, IEEE T SIGNAL PROCES, V47, P2807, DOI 10.1109/78.790661 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 29 TC 22 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 113 EP 125 DI 10.1016/j.specom.2004.10.002 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500010 ER PT J AU Theobald, BJ Bangham, JA Matthews, IA Cawley, GC AF Theobald, BJ Bangham, JA Matthews, IA Cawley, GC TI Near-videorealistic synthetic talking faces: implementation and evaluation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE talking faces; shape and appearance models; avatars; dynamic textures AB The application of two-dimensional (2D) shape and appearance models to the problem of creating realistic synthetic talking faces is presented. A sample-based approach is adopted, where the face of a talker articulating a series of phonetically balanced training sentences is mapped to a trajectory in a low-dimensional model-space that has been learnt from the training data. Segments extracted from this trajectory corresponding to the synthesis units (e.g. triphones) are temporally normalised, blended, concatenated and smoothed to form a new trajectory, which is mapped back to the image domain to provide a natural, realistic sequence corresponding to the desired (arbitrary) utterance. The system has undergone early subjective evaluation to determine the naturalness of this synthesis approach. Described are tests to determine the suitability of the parameter smoothing method used to remove discontinuities introduced during synthesis at the concatenation boundaries, and tests used to determine how well long term coarticulation effects are reproduced during synthesis using the adopted unit selection scheme. The system has been extended to animate the face of a 3D virtual character (avatar) and this is also described. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. Carnegie Mellon, Inst Robot, Pittsburgh, PA 15123 USA. RP Theobald, BJ (reprint author), Univ E Anglia, Sch Comp Sci, Earlham Rd, Norwich NR4 7TJ, Norfolk, England. EM bjt@cmp.uea.ac.uk; ab@cmp.uea.ac.uk; iainm@cs.cmu.edu; gcc@cmp.uea.ac.uk CR Arslan L., 1998, P AUD VIS SPEECH PRO, P175 Bailly G., 2003, International Journal of Speech Technology, V6, DOI 10.1023/A:1025700715107 BAILLY G, 1992, TALKING MACHINES THE Baker S., 2001, P IEEE C COMP VIS PA, V1, P1090, DOI 10.1109/CVPR.2001.990652 Bartels R. H., 1987, INTRO SPLINES USE CO BENOIT C, 1992, TALKING MACHINES THE, P435 BLACK A, 1997, HCRCR83 U ED Brand M, 1999, P SIGGRAPH 99, P21, DOI 10.1145/311535.311537 Bregler C., 1997, Computer Graphics Proceedings, SIGGRAPH 97 Cootes T. F., 1998, P EUR C COMP VIS, V2, P484 Cosatto E., 1998, Proceedings Computer Animation '98 (Cat. No.98EX169), DOI 10.1109/CA.1998.681914 Ezzat T., 2002, P ACM SIGGRAPH 2002, P388, DOI 10.1145/566570.566594 Huang F., 2002, P IEEE INT C AC SPEE, V2, P2037 KRICOS PB, 1982, VOLTA REV, V84, P219 Massaro D. W., 1998, PERCEIVING TALKING F MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 OWENS E, 1985, J SPEECH HEAR RES, V28, P381 PARKE F, 1974, THESIS U UTAH SALTLA PELACHAUD C, 1991, THESIS U PENNSYLANIA Pighing F., 1998, P SIGGRAPH 98, P75, DOI 10.1145/280814.280825 STORK DG, 1996, HUMAN MACHINES MODEL, V150 THEOBALD B, 2003, P IEEE INT C AC SPEE, P800 Theobald B.-J., 2003, THESIS U E ANGLIA NO UNION IT, 2000, METHODOLOGY SUBJECTI Wakerly D., 2002, MATH STAT APPL Waters K., 1987, COMPUT GRAPH, V22, P17 Young S., 1999, HTK BOOK NR 27 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 127 EP 140 DI 10.1016/j.specom.2004.07.002 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500011 ER PT J AU Fagel, S Clemens, C AF Fagel, S Clemens, C TI An articulation model for audiovisual speech synthesis - Determination, adjustment, evaluation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audiovisual speech synthesis; auditory-visual speech perception; articulation model; talking head ID INTEGRATION AB The authors present a visual articulation model for speech synthesis and a method to obtain it from measured data. This visual articulation model is integrated into MASSY, the Modular Audiovisual Speech SYnthesizer, and used to control visible articulator movements described by six motion parameters: one for the up-down movement of the lower jaw, three for the lips and two for the tongue. The visual articulation model implements the dominance principle as suggested by Lofqvist [Lofqvist, A., 1990. Speech as audible gestures. In: Hardcastle, W.J., Marchal, A. (Eds.), Speech Production and Speech Modeling. Kluwer Academic Publishers, Dodrecht, pp. 289-322]. The parameter values for the model derive from measured articulator positions. To obtain these data, the articulation movements of a female speaker were measured with the 2D-articulograph AG100 and simultaneously filmed. The visual articulation model is adjusted and evaluated by testing word recognition in noise. (C) 2004 Elsevier B.V. All rights reserved. C1 Tech Univ Berlin, Dept Commun Sci, D-10587 Berlin, Germany. RP Fagel, S (reprint author), Tech Univ Berlin, Dept Commun Sci, Ernst Reuter Pl 7, D-10587 Berlin, Germany. EM sascha.fagel@tu-berlin.de; carocl@web.de CR BENOIT C, 1996, MULTIMEDIA VIDEO COD BESKOW J, 2004, INT J SPEECH TECHNOL BOERSMA P, 2004, PRAAT DOING PHONE BRAIDA LD, 1991, Q J EXP PSYCHOL-A, V43, P647 Bui T. D., 2004, Proceedings. Computer Graphics International Cohen M, 1994, P WORKSH SPEECH SYNT, P53 Cohen M. M., 1993, Models and Techniques in Computer Animation FAGEL S, 2002, TAG 13 K EL SPRACHS, P372 FISHER CG, 1968, J SPEECH HEAR RES, V11, P769 Grant KW, 1998, J ACOUST SOC AM, V104, P2438, DOI 10.1121/1.423751 Le Goff B., 1997, P 5 EUR C RHOD, P1667 LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289 Massaro D. W., 1987, SPEECH PERCEPTION EA MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MUNHALL KG, 2003, XRAY DATABASE SPEECH Munhall KG, 1998, BEHAV BRAIN SCI, V21, P524, DOI 10.1017/S0140525X98391268 Pandzic IS, 1999, VISUAL COMPUT, V15, P330, DOI 10.1007/s003710050182 SENDLMEIER WF, 1986, VERFAHREN MESSUNG FE, V10, P164 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 *WEB 3D CONS, 2001, HUM AN SPEC *WEB 3D CONS, 1997, VRML VIRT REAL MOD L WIRTH G, 1994, SPRECH SPRACHSTORUNG *ZAS, 2003, ZENTR ALLG SPRACHW NR 23 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 141 EP 154 DI 10.1016/j.specom.2004.10.006 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500012 ER PT J AU Dohen, M Loevenbruck, H Cathiard, MA Schwartz, JL AF Dohen, M Loevenbruck, H Cathiard, MA Schwartz, JL TI Visual perception of contrastive focus in reiterant French speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE audiovisual speech; perception; prosody; contrastive focus; French ID RIGHT-HEMISPHERE; PROSODY; DISTURBANCES; LANGUAGE AB The aim of this paper is to study how contrastive focus is conveyed by prosody both articulatorily and acoustically and how viewers extract focus structure from visual prosodic realizations. Is the visual modality useful for the perception of prosody? An audiovisual corpus was recorded from a male native speaker of French. The sentences had a subject-verb-object (SVO) structure. Four contrastive focus conditions were studied: focus on each phrase (S, V or 0) and broad focus. Normal and reiterant modes were recorded, only the latter was studied. An acoustic validation (fundamental frequency, duration and intensity) showed that the speaker had pronounced the utterances with a typical focused intonation on the focused phrase. Then, lip height and jaw opening were extracted from the video data. An articulatory analysis suggested a set of possible visual cues to focus for rei\terant /ma/ speech: (a) prefocal lengthening, (b) large jaw opening and high opening velocities on all the focused syllables; (c) long lip closure for the first focused syllable and (d) hypo-articulation (reduced jaw opening and duration) of the following phrases. A visual perception test was developed. It showed that (a) contrastive focus was well perceived visually for reiterant speech; (b) no training was necessary and (c) subject focus was slightly easier to identify than the other focus conditions. We also found that if the visual cues identified in our articulatory analysis were present and marked, perception was enhanced. This enables us to assume that the visual cues extracted from the corpus are probably the ones which are indeed perceptively salient. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Grenoble 3, INPG, Inst Commun Parlee, UMR CNRS 5009, F-38031 Grenoble, France. RP Dohen, M (reprint author), Univ Grenoble 3, INPG, Inst Commun Parlee, UMR CNRS 5009, 46 Ave Felix Viallet, F-38031 Grenoble, France. EM dohen@icp.inpg.fr; loeven@icp.inpg.fr CR AUDOUY M, 2000, TRAITEMENT IMAGES VI BARTELS C, 1994, FOCUS NATURAL LANGUA, P94 BAUM SR, 1982, BRAIN LANG, V17, P261, DOI 10.1016/0093-934X(82)90020-7 BECKMAN ME, IN PRESS PROSODIC TY, pCH2 BECKMAN ME, 1992, LAB PHONOLOGY, P68 BRADVIK B, 1991, ACTA NEUROL SCAND, V84, P114 BRYAN KL, 1989, APHASIOLOGY, V3, P285, DOI 10.1080/02687038908249000 Burnham D., 2001, P INT C AUD VIS SPEE, P155 Cave C, 1996, P ICSLP 96, V4, P2175, DOI 10.1109/ICSLP.1996.607235 Clech-Darbon Anne, 1999, GRAMMAR FOCUS, P83 Dahan D, 1996, LANG SPEECH, V39, P341 DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275 Delais-Roussarie E., 2002, P SPEECH PROSODY 200 Di Cristo A., 1998, INTONATION SYSTEMS S, P195 Di Cristo A., 2000, J FRENCH LANGUAGE ST, V10, P27, DOI 10.1017/S0959269500000120 DICRISTO A, 1993, TRAVAUX I PHONETIQUE, V23, P9 DIMPERIO M, 2003, P ICPHS 2003 BARC SP D'Imperio M, 2001, SPEECH COMMUN, V33, P339, DOI 10.1016/S0167-6393(00)00064-9 Erickson D, 1998, PHONETICA, V55, P147, DOI 10.1159/000028429 Fraisse P., 1957, PSYCHOL TEMPS GRANSTROM B, 1999, P ICPHS 1999 SAN FRA, V1, P655 GUSSENHOVEN C, 1983, LANG SPEECH, V26, P61 HARRINGTON J, 1995, J PHONETICS, V23, P305, DOI 10.1016/S0095-4470(95)80163-4 Jankowski L., 1999, P 14 INT C PHON SCI, P1565 Jun S. A., 2002, PROBUS, V14, P147, DOI 10.1515/prbs.2002.002 Jun SA, 2000, TEXT SPEECH LANG TEC, V15, P209 Keating P., 2003, P 16 INT C PHON SCI, P2071 KELSO JAS, 1985, J ACOUST SOC AM, V77, P266, DOI 10.1121/1.392268 KRAHMER E, IN PRESS PERCEIVING Ladd R., 1996, INTONATIONAL PHONOLO Lallouache M., 1991, THESIS I NATL POLYTE LARKEY LS, 1983, J ACOUST SOC AM, V73, P1337, DOI 10.1121/1.389237 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 LIBERMAN MY, 1978, J ACOUST SOC AM, V63, P231, DOI 10.1121/1.381718 LOEVENBRUCK H, 1999, P 1J INT C PHON SCI, V1, P667 MERTENS P, 1993, P ESCA WORKSH PROS L, V41, P155 PIERREHUMBERT J, 1990, INTENTIONS COMMUNICA POST B, 2000, THESIS NETHERLANDS G ROSSI M, 1999, OPHRYS, P116 Selkirk E. O., 1984, PHONOLOGY SYNTAX REL, P197 Summers W V, 1987, J Acoust Soc Am, V82, P847 TOUATI P, 1987, 21 LUND U VAISSIERE J, 1997, ATALA, V38, P53 Weenink D., 1996, 132 U AMST I PHON SC WEINTRAUB S, 1981, ARCH NEUROL-CHICAGO, V38, P742 YEHIA H, 2000, 5 SEM SPEECH PROD MO, P265 NR 46 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 155 EP 172 DI 10.1016/j.specom.2004.10.009 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500013 ER PT J AU Caldognetto, EM Cosi, P Drioli, C Tisato, G Cavicchio, F AF Caldognetto, EM Cosi, P Drioli, C Tisato, G Cavicchio, F TI Modifications of phonetic labial targets in emotive speech: effects of the co-production of speech and emotions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE co-production of speech and emotion; emotive articulatory lip parameters; phonetic-phonological modifications induced by emotions ID EXPRESSION AB This paper describes how the visual characteristics of some Italian phones (/'a/, /b/, /v/) are modified in emotive speech by the expression of the "big six" emotions: Joy, surprise, sadness, disgust, anger, and fear. In this research we specifically analyze the interaction between the articulatory lip targets of the Italian vowel /'a/ and consonants /b/ and /v/, defined by phonetic-phonological rules, and the labial configurations, peculiar to each emotion. This interaction was quantified on the basis of the variations of the following parameters: lip opening, upper and lower lip vertical displacements, lip rounding, anterior/posterior movements (protrusion) of upper lip and lower lip, left and right lip corner horizontal displacements, left and right corner vertical displacements, and two asymmetry parameters, calculated as the difference between right and left corner position along the horizontal and the vertical axes. The first aim of this research is to quantify the modifications of the lip articulatory parameters due to the emotions; the second aim is to analyze the parameters which are subject to phonetic-phonological constraints and are consequently less influenced by emotions. The results are useful to define the emotive speech production models and are presently employed in researches concerning audiovisual speech synthesis (Talking Heads). (C) 2004 Elsevier B.V. All rights reserved. C1 CNR, ISTC Inst Cognit Sci & Technol, Lab Phonet & Dialectol, I-35121 Padua, Italy. RP Caldognetto, EM (reprint author), CNR, ISTC Inst Cognit Sci & Technol, Lab Phonet & Dialectol, Via Anghinoni 10, I-35121 Padua, Italy. EM magno@csrf.pd.cnr.it RI Drioli, Carlo/B-3303-2014 CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 CALDAS E, 1997, PROGRAMA ESTADUAL CO, P5 CALDOGNETTO EM, 1998, ATT 26 C NAZL AC TOR, P263 CALDOGNETTO EM, 2003, P AVSP 2003 AUD VIS, P209 CALDOGNETTO EM, 1998, P AVSP 98 TERR AUS, P135 CALDOGNETTO EM, 2004, P TUT RES WORKSH AFF, P233 CALDOGNETTO EM, 1996, ATTI 6 GIORNATE STUD, P95 Ekman P., 1978, FACIAL ACTION CODING EKMAN P, 2002, NEW VERSION FACIAL A HESS U, 1998, FACETS EMOTION RECEN, P161 MASSARO D, 1998, PSYCHOL BULL, V3, P1021 Massaro D. W., 1987, HEARING EYE PSYCHOL, P53 NORDSTRAND M, 2003, P AVSP 03 S JORIOZ F, P233 Scherer K. R., 2003, HDB AFFECTIVE SCI, P433 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 Scherer KR, 2003, SPEECH COMMUN, V40, P227, DOI 10.1016/S0167-6393(02)00084-5 Summerfield Q, 1987, HEARING EYE PSYCHOL, P3 ZMARICH C, 2003, P 15 INT C PHON SCI, P3121 NR 18 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 173 EP 185 DI 10.1016/j.specom.2004.10.012 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500014 ER PT J AU Nordstrand, M Svanfeldt, G Granstrom, B House, D AF Nordstrand, M Svanfeldt, G Granstrom, B House, D TI Measurements of articulatory variation in expressive speech for a set of Swedish vowels SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE talking heads; expressive speech; facial gestures; articulation AB Facial gestures are used to convey e.g. emotions, dialogue states and conversational signals, which support us in the interpretation of other people's feelings and intentions. Synthesising this behaviour with an animated talking head would widen the possibilities of this intuitive interface. The dynamic characteristics of these facial gestures during speech affect articulation. Previously, articulation for neutral speech has been studied and implemented in animation rules. The results obtained in this study show how some articulatory parameters are affected by the influence of expressiveness in speech for a selection of Swedish vowels. Our focus has primarily been on attitudes and emotions conveying information that is intended to make an animated agent more "human-like". A multimodal corpus of acted expressive speech has been collected for this purpose. (C) 2004 Elsevier B.V. All rights reserved. C1 KTH, Dept Speech Mus & Hearing, Ctr Speech Technol, S-10044 Stockholm, Sweden. RP Nordstrand, M (reprint author), KTH, Dept Speech Mus & Hearing, Ctr Speech Technol, Drottning Kristinas Vag 31, S-10044 Stockholm, Sweden. EM magnusn@speech.kth.se; gunillas@speech.kth.se; bjorn@speech.kth.se; davidh@speech.kth.se CR Argyle Michael, 1976, GAZE AND MUTUAL GAZE Benoit C., 1996, SPEECHREADING HUMANS, P315 Beskow J., 2003, THESIS KTH STOCKHOLM Beskow J., 1997, P ESCA WORKSH AUD VI, P149 Cohen M. M., 1993, Models and Techniques in Computer Animation Davitz J. R, 1964, COMMUNICATION EMOTIO, P101 DUNCAN S, 1972, J PERS SOC PSYCHOL, V23, P283, DOI 10.1037/h0033031 Duncan S., 1974, LANG SOC, V3, P161, DOI DOI 10.1017/S0047404500004322 Ekman P., 1979, HUMAN ETHOLOGY, P169 ERBER NP, 1969, J SPEECH HEAR RES, V12, P423 FONAGY I, 1976, PHONETICA, V33, P31 Granstrom B., 1999, P INT C PHON SCI ICP, P655 House D., 2001, P EUR 2001, P387 IIDA A, 2002, THESIS KEIO U JAPAN MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840 OHMAN T, 1998, KTH TMH QPSR, V1, P61 Osgood Charles E., 1957, MEASUREMENT MEANING Pelachaud C, 1996, COGNITIVE SCI, V20, P1 Reveret L., 2000, P 6 INT C SPOK LANG, P755 Reveret L., 1998, P INT C AUD VIS SPEE, P207 Schrober M, 2003, SPEECH COMMUN, V40, P99, DOI 10.1016/S0167-6393(02)00078-X Sjolander K., 2003, P FON 2003, P93 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SVANFELDT G, 2003, P FONETIK 2003 LOVAN, P53 NR 24 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 187 EP 196 DI 10.1016/j.specom.2004.09.003 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500015 ER PT J AU Attina, V Beautemps, D Cathiard, MA Odisio, M AF Attina, V Beautemps, D Cathiard, MA Odisio, M TI A pilot study of temporal organization in Cued Speech production of French syllables: rules for a Cued Speech synthesizer SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Audio-Visual Speech Processing CY 2003 CL St Jorioz, FRANCE DE cued speech; coarticulation; speech production; temporal organization; audiovisual synthesis ID DEAF-CHILDREN; HEARING; COARTICULATION; RECOGNITION; PERCEPTION; MODEL AB This study investigated the temporal coordination of the articulators involved in French Cued Speech. Cued Speech is a manual complement to lipreading. It uses handshapes and hand placements to disambiguate series of CV syllables. Hand movements, lip gestures and acoustic data were collected from a speaker certified in manual Cued Speech uttering and coding CV sequences. Experiment I studied hand placement in relation to lip gestures and the corresponding sound. The results show that the hand movement begins up to 239 ms before the acoustic onset of the CV syllable. The target position is reached during the consonant, well before the vowel lip target. Experiment II used a data glove to collect finger gesture. It was designed to investigate handshape formation relatively to lip gestures and the corresponding acoustic signal. The results show that the handshape formation gesture takes a large part of the hand transition. Both experiments therefore reveal the anticipatory gesture of the hand motion over the lips. The types of control for vocalic and consonantal information transmitted by the hand are discussed in reference to speech coarticulation. Finally the temporal coordination observed between Cued Speech articulators and the corresponding sound was used as rules to control an audiovisual system delivering Cued Speech for French CV syllables. (C) 2004 Published by Elsevier B.V. C1 Univ Grenoble 3, INPG, Inst Commun Parlee, CNRS UMR 5009, F-38031 Grenoble 01, France. RP Beautemps, D (reprint author), Univ Grenoble 3, INPG, Inst Commun Parlee, CNRS UMR 5009, 46 Ave Felix Viallet, F-38031 Grenoble 01, France. EM denis.beautemps@icp.inpg.fr CR ABRY C, 1996, SPEECHREADING HUMANS, P247 Abry C., 2002, PHONETICS PHONOLOGY, P226 Alegria J, 1999, EUR J COGN PSYCHOL, V11, P451 Badin P, 2002, J PHONETICS, V30, P533, DOI 10.1006/jpho.2002.0166 Bailly G., 1992, Traitement du Signal, V9 Beautemps D, 2001, J ACOUST SOC AM, V109, P2165, DOI 10.1121/1.1361090 Bernstein LE, 2000, PERCEPT PSYCHOPHYS, V62, P233, DOI 10.3758/BF03205546 BRATAKOS MS, 1998, CUED SPEECH J, V6, P1 BROWN KT, 1998, AFRICAN AM RES PERSP, V4, P55 CATHIARD MA, IN PRESS PAROLE CATHIARD MA, 2004, P 25 J ETUDES P 0419, P113 Cornett R O, 1988, Acta Otorhinolaryngol Belg, V42, P375 CORNETT RO, 1967, AM ANN DEAF, V112, P3 CORNETT RO, 1982, AIDES MANUELLES LECT, P5 DUCHNOWSKI P, 1998, P 5 INT C SPOK LANG, V7, P3289 Duchnowski P, 2000, IEEE T BIO-MED ENG, V47, P487, DOI 10.1109/10.828148 Elisei F., 2001, P AUD VIS SPEECH PRO, P90 GIBERT G, 2004, LREC, P2123 Lallouache M., 1991, THESIS I NATL POLYTE LEYBAERT J, 1996, REV FR LING APPL, V1, P81 Leybaert J, 2000, J EXP CHILD PSYCHOL, V75, P291, DOI 10.1006/jecp.1999.2539 LEYBAERT J, 1998, HEARING EYE, V2, P283 Leybaert J, 2001, J SPEECH LANG HEAR R, V44, P949, DOI 10.1044/1092-4388(2001/074) LUETTIN J, 1998, P 5 EUR C COMP VIS, P657 MacNeilage PF, 1998, BEHAV BRAIN SCI, V21, P499 Massaro D. W., 1998, PERCEIVING TALKING F Morlec Y, 2001, SPEECH COMMUN, V33, P357, DOI 10.1016/S0167-6393(00)00065-0 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z *NAT CUED SPEECH A, 1994, CUED SPEECH J, V5, P73 NICHOLLS GH, 1982, J SPEECH HEAR RES, V25, P262 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OWENS E, 1985, J SPEECH HEAR RES, V28, P381 PERKELL JS, 1990, NATO ADV SCI I D-BEH, V55, P263 PERKELL JS, 1992, J ACOUST SOC AM, V91, P2911, DOI 10.1121/1.403778 Radeau M., 1992, ANAL APPROACHES HUMA, P107 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 SATO M, 2002, P 7 INT C SPEECH LAN, P669 Schmidt R, 1988, MOTOR CONTROL LEARNI Schwartz J. L., 1998, HEARING EYE, P85 SCHWARTZ JL, 2002, TRAITEMENT AUTOMATIQ, P141 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield A. Q., 1987, HEARING EYE PSYCHOL, P3 UCHANSKI RM, 1994, J REHABIL RES DEV, V31, P20 VILAIN A, 2000, P 5 SEM SPEECH PROD, P81 WOODWARD MF, 1960, J SPEECH HEAR RES, V3, P212 NR 45 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2004 VL 44 IS 1-4 BP 197 EP 214 DI 10.1016/j.specom.2004.10.013 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 884HD UT WOS:000226074500016 ER PT J AU Cooke, MP Ellis, DPW AF Cooke, MP Ellis, DPW TI Introduction to the special issue on the recognition and organization of real-world sound SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Columbia Univ, Dept Elect Engn, New York, NY 10027 USA. RP Cooke, MP (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St,Regent Court, Sheffield S1 4DP, S Yorkshire, England. EM m.cooke@dcs.shef.ac.uk NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 273 EP 274 DI 10.1016/j.specom.2004.05.001 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300001 ER PT J AU Raj, B Seltzer, ML Stern, RM AF Raj, B Seltzer, ML Stern, RM TI Reconstruction of missing features for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article ID ALGORITHM; SENTENCES AB Speech recognition systems perform poorly in the presence of corrupting noise. Missing feature methods attempt to compensate for the noise by removing noise corrupted components of spectrographic representations of noisy speech and performing recognition with the remaining reliable components. Conventional classifier-compensation methods modify the recognition system to work with the incomplete representations so obtained. This constrains them to perform recognition using spectrographic features which are known to be less optimal than cepstra. In this paper we present two missing-feature algorithms that reconstruct complete spectrograms from incomplete noisy ones. Cepstral vectors can now be derived from the reconstructed spectrograms for recognition. The first algorithm uses MAP procedures to estimate corrupt components from their correlations with reliable components. The second algorithm clusters spectral vectors of clean speech. Corrupt components of noisy speech are estimated from the distribution of the cluster that the analysis frame is identified with. Experiments show that, although conventional classifier-compensation methods are superior when recognition is performed with spectrographic features, cepstra derived from the reconstructed spectrograms result in better recognition performance overall. The proposed methods are also less expensive computationally and do not require modification of the recognizer. (C) 2004 Elsevier B.V. All rights reserved. C1 Mitsubishi Electr Corp, Res Labs, Cambridge, MA 02139 USA. Carnegie Mellon Univ, Pittsburgh, PA 15213 USA. RP Raj, B (reprint author), Mitsubishi Electr Corp, Res Labs, 201 Broadway,8th Floor, Cambridge, MA 02139 USA. EM bhiksha@meri.com CR Acero A., 1993, ACOUSTIC ENV ROBUSTN BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cooke M., 1994, P 3 INT C SPOK LANG, P1555 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 COOKE M, 1994, TR940501 U SHEFF DEP COOKE MP, 1997, P IEEE C AC SPEECH S DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Drygajlo A, 1998, INT CONF ACOUST SPEE, P121, DOI 10.1109/ICASSP.1998.674382 DUPONT S, 1998, P ICSLP98 SYDN AUSTR Fletcher H., 1953, SPEECH HEARING COMMU Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 JOSIFOVSKI L, 1999, P EUROSPEECH 99 BUD LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Lippmann R., 1997, P EUR 97 RHOD GREEC, P37 McQueen J., 1967, 5TH BERK S MATH STAT, V1, P281 MILLER GA, 1950, J ACOUST SOC AM, V22, P167, DOI 10.1121/1.1906584 Moreno P.J., 1996, THESIS CARNEGIE MELL MORRIS AC, 2001, WISP 2001 O'Shaughnessy D., 1987, SPEECH COMMUNICATION Papoulis A, 1991, PROBABILITY RANDOM V, V3rd Price P., 1988, P IEEE INT C AC SPEE, P651 RAJ B, 1998, P ICSLP98 SYDN AUSTR Raj B., 2000, THESIS CARNEGIE MELL RAJ B, 1997, P IEEE C AC SPEECH S RENEVEY P, 2000, P EUSIPCO 2000 Renevey P., 1999, P EUROSPEECH BUD HUN, P2627 RENEVEY P, 2001, THESIS SWISS FEDERAL Varga A.P., 1990, P ICASSP, P845 VIZHINHO A, 1999, P EUROSPEEC99 BUD HU, P2407 WARREN RM, 1995, PERCEPT PSYCHOPHYS, V57, P175, DOI 10.3758/BF03206503 NR 31 TC 126 Z9 128 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 275 EP 296 DI 10.1016/j.specom.2004.03.007 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300002 ER PT J AU Collier, GL AF Collier, GL TI A comparison of novices and experts in the identification of sonar signals SO SPEECH COMMUNICATION LA English DT Article ID SOUNDS; PERCEPTION; RECOGNITION; WHALE AB This research contrasted trained sonar operators with novices in their abilities to categorize sonar signals as being of biological or man-made origin. Although the sonar operators performed significantly better than the naive subjects, the performance of the naive subjects (d' = 1.39) was closer to that of the sonar operators (d' = 1.81) than to chance. Additionally, there were qualitative similarities in the two groups' performances. In order to understand this, a new group of people listened to a subset of 76 of the sounds, and chose verbal descriptors from a detailed list of 146 words such as "pht". "chuck-chuck", "rainfall", etc. to describe each sound. Analyses of the descriptor choices demonstrated that biological sound sources were identifiable as such simply because they sounded like animal noises. Man-made signals were identified as more percussive, a result confirmed by an acoustical analysis of the signals. Two scales were developed for the signals, indicating how "animal like" and how percussive they were, respectively. Either of the two scales could predict with about 75% accuracy the category of the signals, a level of accuracy equal to that of the novices, and not bested by neither a CART model nor a neural net model. A simple acoustical analysis of the percussiveness of the waveforms could predict people's judgments, but not predict the objective classifications of the signals. (C) 2004 Elsevier B.V. All rights reserved. C1 S Carolina State Univ, Dept Psychol, Orangeburg, SC 29117 USA. RP Collier, GL (reprint author), 343 Hunters Blind Dr, Columbia, SC 29212 USA. EM brainstorm@sc.rr.com CR ANNETT J, 1971, 67C01051 NAVTR BALLAS JA, 1993, J EXP PSYCHOL HUMAN, V19, P250, DOI 10.1037//0096-1523.19.2.250 Barshan B, 2002, NEURAL NETWORKS, V15, P131, DOI 10.1016/S0893-6080(01)00120-4 Bishop C. M., 1995, NEURAL NETWORKS PATT BRUSKE J, 1995, NEURAL COMPUT, V7, P845, DOI 10.1162/neco.1995.7.4.845 CHABOT D, 1988, ETHOLOGY, V77, P89 CLARK CW, 1982, ANIM BEHAV, V30, P1060, DOI 10.1016/S0003-3472(82)80196-6 DROR IE, 1995, NEURAL NETWORKS, V8, P149, DOI 10.1016/0893-6080(94)00057-S GAVER WW, 1993, ECOL PSYCHOL, V5, P1, DOI 10.1207/s15326969eco0501_1 Gibson J. J., 1966, SENSES CONSIDERED PE Gibson J. J., 1979, ECOLOGICAL APPROACH Handel S, 1989, LISTENING INTRO PERC HOWARD JH, 1977, J ACOUST SOC AM, V62, P1490 MACKAY RS, 1981, SCIENCE, V212, P676, DOI 10.1126/science.212.4495.676 MCADAMS S, 1993, THINKIN SOUND COGNIT Mellinger DK, 1997, MAR FRESHW BEHAV PHY, V29, P163 Michaels C. F., 1981, DIRECT PERCEPTION Olshen R., 1984, CLASSIFICATION REGRE, V1st Ripley B., 1996, PATTERN RECOGNITION Shepherd A.J., 1997, 2 ORDER METHODS NEUR SOLOMON LN, 1958, J ACOUST SOC AM, V30, P421, DOI 10.1121/1.1909632 Vanderveer N. J., 1979, DISS ABSTR INT, V40, p4543B WARREN WH, 1984, J EXP PSYCHOL HUMAN, V10, P704, DOI 10.1037/0096-1523.10.5.704 NR 23 TC 8 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 297 EP 310 DI 10.1016/j.specom.2004.03.003 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300003 ER PT J AU Goto, M AF Goto, M TI A real-time music-scene-description system: predominant-FO estimation for detecting melody and bass lines in real-world audio signals SO SPEECH COMMUNICATION LA English DT Article DE FO estimation; MAP estimation; EM algorithm; music understanding; computational auditory scene analysis; music information retrieval ID COMPUTATIONAL MODEL; BEAT TRACKING; PITCH; FREQUENCY; SOUNDS; PERCEPTION; SEPARATION AB In this paper, we describe the concept of music scene description and address the problem of detecting melody and bass lines in real-world audio signals containing the sounds of various instruments. Most previous pitch-estimation methods have had difficulty dealing with such complex music signals because these methods were designed to deal with mixtures of only a few sounds. To enable estimation of the fundamental frequency (F0) of the melody and bass lines, we propose a predominant-F0 estimation method called PreFEst that does not rely on the unreliable fundamental component and obtains the most predominant F0 supported by harmonics within an intentionally limited frequency range. This method estimates the relative dominance of every possible F0 (represented as a probability density function of the F0) by using MAP (maximum a posteriori probability) estimation and considers the F0's temporal continuity by using a multiple-agent architecture. Experimental results with a set of ten music excerpts from compact-disc recordings showed that a real-time system implementing this method was able to detect melody and bass lines about 80% of the time these existed. (C) 2004 Elsevier B.V. All rights reserved. C1 Natl Inst Adv Ind Sci & Technol, Tsukuba, Ibaraki 3058568, Japan. RP Goto, M (reprint author), Natl Inst Adv Ind Sci & Technol, 1-1-1 Umezono, Tsukuba, Ibaraki 3058568, Japan. EM m.goto@aist.go.jp CR ABE T, 1997, P ASVA 97, P423 ABE T, 1996, P ICSLP, V2, P1277, DOI 10.1109/ICSLP.1996.607843 BOASHASH B, 1992, P IEEE, V80, P520, DOI 10.1109/5.135376 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, J NEW MUSIC RES, V23, P107, DOI 10.1080/09298219408570651 BROWN GJ, 1992, THESIS U SHEFFIELD Chafe C., 1986, P IEEE INT C AC SPEE, P1289 Charpentier F. J., 1986, P INT C AC SPEECH SI, P113 COOKE MP, 1993, SPEECH COMMUN, V13, P391, DOI 10.1016/0167-6393(93)90037-L de Cheveigne A, 1999, SPEECH COMMUN, V27, P175, DOI 10.1016/S0167-6393(98)00074-0 DECHEVEIGNE A, 1993, J ACOUST SOC AM, V93, P3271 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FLANAGAN JL, 1966, AT&T TECH J, V45, P1493 Goto M., 1996, P 2 INT C MULT SYST, P103 GOTO M, 1998, THESIS WASEDA U GOTO M, 1998, COMPUTATIONAL AUDITO, P157 Goto M, 2001, J NEW MUSIC RES, V30, P159, DOI 10.1076/jnmr.30.2.159.7114 Goto M, 1999, SPEECH COMMUN, V27, P311, DOI 10.1016/S0167-6393(98)00076-4 GOTO M, 1997, P 1997 INT COMP MUS, P446 Goto M., 1994, Proceedings ACM Multimedia '94, DOI 10.1145/192593.192700 KASHINO K, 1994, ELECTRON COMM JPN 3, V77, P35, DOI 10.1002/ecjc.4430770704 KASHINO K, 1997, P INT JOINT C ART IN, P1126 KASHINO K, 1994, THESIS U TOKYO Kashino K., 1998, COMPUTATIONAL AUDITO, P115 KATAYOSE H, 1989, COMPUT MUSIC J, V13, P72, DOI 10.2307/3679555 Kawahara H., 1999, P EUR 99, P2781 Kitano H., 1993, P 13 INT JOINT C ART, P813 KLAPURI A, 2001, P IEEE INT C AC SPEE MASUDAKATSUSE I, 2001, P WORKSH CONS REL AC MASUDAKATSUSE I, 2001, P EUR 01, P1119 NAKATANI T, 1995, P IJCAI, P165 NEHORAI A, 1986, IEEE T ACOUST SPEECH, V34, P1124, DOI 10.1109/TASSP.1986.1164952 NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 OHMURA H, 1994, P IEEE INT C AC SPEE, P189 OKUNO HG, 1997, IJCAI 97 WORKSH COMP PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 PLOMP R, 1967, J ACOUST SOC AM, V41, P1526, DOI 10.1121/1.1910515 RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 RITSMA RJ, 1967, J ACOUST SOC AM, V42, P191, DOI 10.1121/1.1910550 ROSENTHAL D, 1995, IJCAI 95 WORKSH COMP Rosenthal D. F., 1998, COMPUTATIONAL AUDITO SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 Tolonen T, 2000, IEEE T SPEECH AUDI P, V8, P708, DOI 10.1109/89.876309 NR 43 TC 94 Z9 95 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 311 EP 329 DI 10.1016/j.specom.2004.07.001 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300004 ER PT J AU Houben, MMJ Kohlrausch, A Hermes, DJ AF Houben, MMJ Kohlrausch, A Hermes, DJ TI Perception of the size and speed of rolling balls by sound SO SPEECH COMMUNICATION LA English DT Article DE rolling sounds; auditory event perception; auditory cues; object size discrimination; object speed discrimination ID ROUGHNESS; EVENTS; MODEL AB In everyday life, we listen to the properties of sources that generate sound, not to properties of the sound itself. But what properties of the sound source can we identify and what is it in the sound that informs us about these properties? This paper reports three experiments investigating the auditory perception of the size and the speed of wooden balls rolling over a wooden plate on the basis of recorded sounds. Experiment I showed that listeners are able to choose the larger ball from paired sounds. Experiment II showed that listeners are able to discriminate between the sounds of balls rolling with different speeds. However, some listeners reversed the labeling of the speed. In experiment III, the interaction between size and speed was tested. Results indicated that if the size and the speed of a rolling ball are varied simultaneously, listeners generally are still able to identify the larger ball, but the judgment of speed is influenced by the variation in size. An analysis of the spectral and temporal properties of the recorded sounds listeners may use in their decisions suggested a conflict in available cues when varying both size and speed, which is in line with the observed interaction effect. (C) 2004 Elsevier B.V. All rights reserved. C1 Philips Res Labs, Digital Signal Proc, NL-5656 AA Eindhoven, Netherlands. Eindhoven Univ Technol, Human Technol Interact Grp, NL-5600 MB Eindhoven, Netherlands. RP Kohlrausch, A (reprint author), Philips Res Labs, Digital Signal Proc, WO 02,Prof Holstlaan 4, NL-5656 AA Eindhoven, Netherlands. EM armin.kohlrausch@philips.com CR Cabe PA, 2000, J EXP PSYCHOL HUMAN, V26, P313, DOI 10.1037//0096-1523.26.1.313 Carello C, 1998, PSYCHOL SCI, V9, P211, DOI 10.1111/1467-9280.00040 Daniel P, 1997, ACUSTICA, V83, P113 FREED DJ, 1990, J ACOUST SOC AM, V87, P311, DOI 10.1121/1.399298 Gaver W., 1988, THESIS U CALIFORNIA GAVER WW, 1993, ECOL PSYCHOL, V5, P1, DOI 10.1207/s15326969eco0501_1 GREY JM, 1978, J ACOUST SOC AM, V63, P1493, DOI 10.1121/1.381843 KENDALL RA, 1996, HDB PERCEPTION COGNI, P87 Kunkler-Peck AJ, 2000, J EXP PSYCHOL HUMAN, V26, P279, DOI 10.1037//0096-1523.26.1.279 Lakatos S, 1997, PERCEPT PSYCHOPHYS, V59, P1180, DOI 10.3758/BF03214206 LI XF, 1991, J ACOUST SOC AM, V90, P3036, DOI 10.1121/1.401778 McAdams S, 1999, J ACOUST SOC AM, V105, P882, DOI 10.1121/1.426277 Moore BCJ, 1997, J AUDIO ENG SOC, V45, P224 REPP BH, 1987, J ACOUST SOC AM, V81, P1100, DOI 10.1121/1.394630 SCHORER E, 1989, ACUSTICA, V68, P183 TERHARDT E, 1974, ACUSTICA, V30, P201 WARREN WH, 1984, J EXP PSYCHOL HUMAN, V10, P704, DOI 10.1037/0096-1523.10.5.704 Zwicker E, 1999, PSYCHOACOUSTICS FACT NR 18 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 331 EP 345 DI 10.1016/j.specom.2004.03.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300005 ER PT J AU Okuno, HG Nakadai, K Kitano, H AF Okuno, HG Nakadai, K Kitano, H TI Effects of increasing modalities in recognizing three simultaneous speeches SO SPEECH COMMUNICATION LA English DT Article AB One of the essential problems of auditory processing in noisy real-world environments is that the number of sound sources is greater than that of microphones. To model this situation, we try to separate three simultaneous speeches by two microphones. This problem is difficult because well-known techniques with microphone arrays such as the null-forming and beamforming techniques and independent component analysis (ICA) require in practice three or more microphones. This paper reports the effects of increasing modalities in recognizing three simultaneous speeches with two microphones. We investigate four cases; monaural (one microphone), binaural (a pair of microphones embedded in a dummy head), binaural with ICA, and binaural with vision (two dummy head microphones and two cameras). The fourth method is called "Direction-Pass Filter" (DPF), which separates sound sources originating from a specific direction given by auditory and/or visual processing. The direction of auditory frequency component is determined by using the Head-Related Transfer Function (HRTF) of the dummy head and thus the DPF is independent for the number of sound sources i.e. it does not assume the number of sound sources. With 200 benchmarks of three simultaneous utterances of Japanese words, the quality of each separated speech is evaluated by an automatic speech recognition system. The performance of word recognition of three simultaneous speeches is improved by adding more modalities, that is, from monaural, binaural, binaural with ICA, to binaural with vision. The average 1-best and 10-best recognition rates of separated speeches attained by the Direction-Pass Filter are 60% and 81%, respectively. (C) 2004 Elsevier B.V. All rights reserved. C1 Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan. Honda Res Inst Japan Co Ltd, Wako, Saitama 3510114, Japan. Japan Sci & Technol Agcy, Kitano Symbiot Syst Project, Tokyo 1500001, Japan. RP Okuno, HG (reprint author), Kyoto Univ, Grad Sch Informat, Kyoto 6068501, Japan. EM okuno@i.kyoto-u.ac.jp CR Bodden M., 1993, Acta Acustica, V1 Boll S, 1979, P 1979 INT C AC SPEE, P200 Bregman AS., 1990, AUDITORY SCENE ANAL CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 COOKE M, 1993, ENDEAVOUR, V17, P186, DOI 10.1016/0160-9327(93)90061-7 De Lathauwer L, 1999, PROCEEDINGS OF THE IEEE SIGNAL PROCESSING WORKSHOP ON HIGHER-ORDER STATISTICS, P116, DOI 10.1109/HOST.1999.778706 Kita K., 1990, Transactions of the Information Processing Society of Japan, V31 LOURENS T, 2000, P 1 IEEE RAS INT C H Madisetti V. K., 1997, DIGITAL SIGNAL PROCE Makeig S, 2000, IEEE T REHABIL ENG, V8, P208, DOI 10.1109/86.847818 MURATA N, 1998, P 1998 INT S NONL TH, V3, P923 NAKADAI K, 2001, P INT C SPEECH PROC, P1193 Nakadai Kazuhiro, 2001, P 17 INT JOINT C ART, V2, P1425 Nakamura S, 2002, IEEE T NEURAL NETWOR, V13, P854, DOI 10.1109/TNN.2002.1021886 NAKAMURA S, 1999, J JPN SOC MICROGRAVI, V16, P99 NAKATANI T, 1995, P 1995 INT C AC SPEE, V4, P2671 Nakatani T, 1999, SPEECH COMMUN, V27, P209, DOI 10.1016/S0167-6393(98)00079-X NAKATANI T, 1996, P 1996 IEEE INT C AC, V2, P653 NAKATANI T, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P100 OKUNO HG, 1999, SPEECH COMMUN, V27, P281 OKUNO HG, 1999, P IJCAI 99 WORKSH CO, P92 Okuno HG, 1996, PROCEEDINGS OF THE THIRTEENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND THE EIGHTH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE, VOLS 1 AND 2, P1082 ROGOZAN A, 1998, SPEECH COMMUN, V26, P1 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO NR 24 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 347 EP 359 DI 10.1016/j.specom.2004.03.008 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300006 ER PT J AU Palomaki, KJ Brown, GJ Wang, DL AF Palomaki, KJ Brown, GJ Wang, DL TI A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation SO SPEECH COMMUNICATION LA English DT Article DE binaural model; speech recognition; precedence effect; missing data ID INTERAURAL TIME DIFFERENCES; LOCALIZATION; SEPARATION; MODEL; SEGREGATION; DISTORTION; ATTENTION AB In this study we describe a binaural auditory model for recognition of speech in the presence of spatially separated noise intrusions. under small-room reverberation conditions. The principle underlying the model is to identify time-frequency regions which constitute reliable evidence of the speech signal. This is achieved both by determining the spatial location of the speech source, and by grouping the reliable regions according to common azimuth. Reliable time-frequency regions are passed to a 'missing data' speech recogniser, which performs decoding based on this partial description of the speech signal. In order to obtain robust estimates of spatial location in reverberant conditions, we incorporate some aspects of precedence effect processing into the auditory model. We show that the binaural auditory model improves speech recognition performance in small room reverberation conditions in the presence of spatially separated noise, particularly for conditions in which the spatial separation is 20degrees or larger. We also demonstrate that the binaural system outperforms a single channel approach, notably in cases where the target speech and noise intrusion have substantial spectral overlap. (C) 2004 Elsevier B.V. All rights reserved. C1 Aalto Univ, Lab Acoust & Audio Signal Prod, FIN-02015 Espoo, Finland. Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Univ Helsinki, Dept Psychol, Appercept & Cortical Dynam, FIN-00014 Helsinki, Finland. Ohio State Univ, Dept Comp & Informat Sci, Columbus, OH 43210 USA. Ohio State Univ, Ctr Cognit Sci, Columbus, OH 43210 USA. RP Palomaki, KJ (reprint author), Aalto Univ, Lab Acoust & Audio Signal Prod, POB 3000, FIN-02015 Espoo, Finland. EM kalle.palomaki@hut.fi; g.brown@dcs.shef.ac.uk; dwang@cis.ohio-state.edu CR ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 Barker J., 2000, P ICSLP 2000, V1, P373 BARKER J, 2000, P ICSLP 00, V4, P270 Blauert J., 1997, SPATIAL HEARING PSYC Bodden M., 1993, Acta Acustica, V1 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 2001, P IJCNN, V4, P2907 Cohen L., 1994, TIME FREQUENCY ANAL Cooke M., 1993, MODELLING AUDITORY P Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 CULLING JF, 1995, J ACOUST SOC AM, V98, P785, DOI 10.1121/1.413571 Darwin CJ, 2000, J ACOUST SOC AM, V108, P335, DOI 10.1121/1.429468 Darwin CJ, 1999, J EXP PSYCHOL HUMAN, V25, P617, DOI 10.1037/0096-1523.25.3.617 DENBIGH PN, 1992, SPEECH COMMUN, V11, P119, DOI 10.1016/0167-6393(92)90006-S Gardner B., 1994, 280 MIT MED LAB Gilkey R., 1997, BINAURAL SPATIAL HEA, P329 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T GLOTIN H, 1999, P EUROSPEECH, P2351 Hall D, 1991, MUSICAL ACOUSTICS Hawley ML, 1999, J ACOUST SOC AM, V105, P3436, DOI 10.1121/1.424670 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HUKIN RW, 1995, J ACOUST SOC AM, V98, P1380, DOI 10.1121/1.414348 Huopaniemi J., 1997, P IEEE WORKSH APPL S Kingsbury B., 1998, THESIS U CALIFORNIA Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KOUTRAS A, 2001, P EUR C SPEECH COMM, V2, P1009 LEONARD RG, 1984, P ICASSP, V3, P111 LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1623, DOI 10.1121/1.394326 Litovsky RY, 1999, J ACOUST SOC AM, V106, P1633, DOI 10.1121/1.427914 LOKKI T, 2002, THESIS HELSINKI U TE MACPHERSON EA, 1991, J AUDIO ENG SOC, V39, P604 MARTIN K, 1997, P IEEE WORKSH APPL S MOLLER H, 1992, APPL ACOUST, V36, P171, DOI 10.1016/0003-682X(92)90046-U Moore B. C. J., 1995, HDB PERCEPTION COGNI, V6, P387 Moore BCJ, 1997, INTRO PSYCHOL HEARIN NABELEK AK, 1982, J ACOUST SOC AM, V71, P1242 Okuno HG, 1999, SPEECH COMMUN, V27, P299, DOI 10.1016/S0167-6393(98)00080-6 PALOMAKI KJ, 2001, P WORKSH CONS REL AC Palomaki KJ, 2004, SPEECH COMMUN, V43, P123, DOI 10.1016/j.specom.2004.02.005 PALOMAKI KJ, 2002, P ICASSP ORL 13 17 M, P65 Patterson R.D., 1988, 2341 APU ROMAN N, 2003, P ICASSP, P149 ROMAN N, 2002, P ICASSP OR 13 17 MA, P1013 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO SELTZER ML, 2001, P EUROSPEECH, V2, P1005 SHACKLETON TM, 1992, J ACOUST SOC AM, V91, P2276, DOI 10.1121/1.403663 Shamsoddini A, 2001, SPEECH COMMUN, V33, P179, DOI 10.1016/S0167-6393(00)00015-7 SPIETH W, 1954, J ACOUST SOC AM, V26, P391, DOI 10.1121/1.1907347 van der Kouwe AJW, 2001, IEEE T SPEECH AUDI P, V9, P189, DOI 10.1109/89.905993 WALLACH H, 1949, AM J PSYCHOL, V62, P315, DOI 10.2307/1418275 WATKINS AJ, 1991, J ACOUST SOC AM, V90, P2942, DOI 10.1121/1.401769 Zurek P. M., 1987, DIRECTIONAL HEARING, P85, DOI 10.1007/978-1-4612-4738-8_4 NR 54 TC 58 Z9 58 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 361 EP 378 DI 10.1016/j.specom.2004.03.005 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300007 ER PT J AU Seltzer, ML Raj, B Stern, RM AF Seltzer, ML Raj, B Stern, RM TI A Bayesian classifier for spectrographic mask estimation for missing feature speech recognition SO SPEECH COMMUNICATION LA English DT Article ID SUPPRESSION AB Missing feature methods of noise compensation for speech recognition operate by first identifying components of a spectrographic representation of speech that are considered to be corrupt. Recognition is then performed either using only the remaining reliable components, or the corrupt components are reconstructed prior to recognition. These methods require a spectrographic mask which accurately labels the reliable and corrupt regions of the spectrogram. Depending on the missing feature method applied, these masks must either contain binary values or probabilistic values. Current mask estimation techniques rely on explicit estimation of the characteristics of the corrupting noise. The estimation process usually assumes that the noise is pseudo-stationary or varies slowly with time. This is a significant drawback since the missing feature methods themselves have no such restrictions. We present a new mask estimation technique that uses a Bayesian classifier to determine the reliability of spectrographic elements. Features used for classification were designed that make no assumptions about the corrupting noise signal, but rather exploit characteristics of the speech signal itself. Experiments were performed on speech corrupted by a variety of noises, using missing feature compensation methods which require binary masks and probabilistic masks. In all cases, the proposed Bayesian mask estimation method resulted in significantly better recognition accuracy than conventional mask estimation approaches. (C) 2004 Elsevier B.V. All rights reserved. C1 Microsoft Res, Redmond, WA 98052 USA. Mitsubishi Electr Corp, Res Labs, Cambridge, MA 02139 USA. Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA. Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA. RP Seltzer, ML (reprint author), Microsoft Res, 1 Microsoft Way, Redmond, WA 98052 USA. EM mseltzer@microsoft.com; bhiksha@merl.com; rms@cs.cmu.edu CR ACERO A, 1993, P EUROSPEECH 93 AJ B, 1998, P ICSLP 98 BARKER J, 2001, P EUROSPEECH 01 BARKER J, 2000, P ICSLP 00 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Donoho D. L., 1981, APPL TIME SERIES ANA GILLESPIE B, 2001, P ICASSP 01 Hirsch H., 1995, P ICASSP, P153 LEBLANC JP, 1998, P ICASSP 98 LEGGETTER CJ, 1994, 181 CUEDFINFENGTR Moreno P.J., 1996, THESIS CARNEGIE MELL Morgan DP, 1997, IEEE T SPEECH AUDI P, V5, P407, DOI 10.1109/89.622561 Placeway P., 1997, P DARPA SPEECH REC W Price P., 1988, P IEEE INT C AC SPEE, P651 RAJ B, 2001, P CRAC 01 RAJ B, 2000, P ICSLP 00 Raj B., 2000, THESIS CARNEGIE MELL RENEVEY P, 2001, P CRAC 01 Renevey P., 2000, THESIS ECOLE POLYTEC SELTZER ML, 2000, P ICSLP 00 Seltzer M.L., 2000, THESIS CARNEGIE MELL SINGH R, 2001, P ICASSP 01 Talkin D., 1995, SPEECH CODING SYNTHE, P495 VIZINHO A, 1999, P EUROSPEECH 99 NR 26 TC 76 Z9 77 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2004 VL 43 IS 4 BP 379 EP 393 DI 10.1016/j.specom.2004.03.006 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 862FZ UT WOS:000224477300008 ER PT J AU Pargellis, A Fosler-Lussier, E Lee, CH Potamianos, A Tsai, A AF Pargellis, A Fosler-Lussier, E Lee, CH Potamianos, A Tsai, A TI Auto-induced semantic classes SO SPEECH COMMUNICATION LA English DT Article DE computer; dialogue agent; natural language understanding; unsupervised clustering; lexical bigram context; domain independence; spoken dialogue AB Advanced computer dialogue agents contain a natural language understanding component that requires knowledge of semantic classes and concepts. These are frequently generated manually for new tasks. We avoid this time-consuming procedure by using a two-step unsupervised clustering process. First, a semantic generalizer automatically induces semantic classes using training data from well-studied applications (domains) for which large transcribed corpora of human-human dialogues are available. Candidate word pairs are grouped into similar semantic groups according to the similarity of their lexical bigram contexts. We show that the proposed algorithms for automatically inducing semantic classes perform very well for typical spoken dialogue applications. We exceed 90% precision for the first 100 cluster assignments for narrowly defined tasks such as a movie information task. For a heterogeneous task such as the text-based WSJ, a precision of only 24% is increased to 73% by including context thresholding and part-of-speech tagging. Second. we determine the degree of domain independence for each class by using concept comparison and projection metrics to rank order semantic classes by degree of domain independence. (C) 2004 Elsevier B.V. All rights reserved. C1 Bell Labs, Dialogue Syst Res Dept, Lucent Technol, Murray Hill, NJ 07974 USA. RP Pargellis, A (reprint author), Agilix Corp, 2 Church St S,Suite 401, New Haven, CT 06519 USA. EM apargellis@aol.com; fosler@cis.ohio-state.edu; chl@ece.gatech.edu; potam@kronos.telecom.tuc.gr; augustine.tsai@verizon.com CR ARAI K, 1998, P 5 INT C SPOK LANG, V5, P2051 AUST H, 1998, P INT S SPOK DIAL SY, P27 Bellegarda J.-R., 1997, P 5 EUR C SPEECH COM, P1451 BRILL E, 1992, P WORKSH SPEECH NAT, P112, DOI 10.3115/1075527.1075553 Brown P. F., 1992, Computational Linguistics, V18 CHUCARROLL J, 1998, P 36 ANN M ASS COMP, P256 CHUCARROLL J, 1999, 6 EUR C SPEECH COMM Clarkson P., 1997, P 5 EUR C SPEECH COM Cohen J, 1960, EDUC PSYCHOL MEAS, V20, P307 DAGAN I, 1997, P 35 ANN M ACL DEVILLERS L, 1998, ICSLP 98 SYDN AUSTR Duda R. O., 2001, PATTERN CLASSIFICATI FOSLERLUSSIER E, 2001, P IEEE INT C AC SPEE Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X ISSAR S, 1997, 5 EUR C SPEECH COMM Jelinek F, 1985, IMPACT PROCESSING TE JURAFSKKY D, 1997, P IEEE WORKSH SPEECH Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC LAMEL L, 1999, P IEEE INT C AC SPEE Manning C. D., 2000, FDN STAT NATURAL LAN MCCANDLESS MK, 1993, P 3 EUR C SPEECH COM, P981 NAKAGAWA S, 1998, P 1998 INT S SPOK DI, P1 Narayanan S, 2002, IEEE T SPEECH AUDI P, V10, P65, DOI 10.1109/89.985544 PAPINENI KA, 1999, 6 EUR C SPEECH COMM PARGELLIS AN, 2001, P 7 EUR C SPEECH COM PARGELLIS AN, 2000, P 6 INT C SPOK LANG, V3, P502 PARGELLIS AN, 2001, P AUT SPEECH REC UND POTAMIANOS A, 1999, DESIGN PRINCIPLES TO SENEFF S, 1998, GALAXY, V2 SIU KC, 1999, P 6 EUR C SPEECH COM, V5, P2039 NR 30 TC 13 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2004 VL 43 IS 3 BP 183 EP 203 DI 10.1016/j.specom.2004.03.002 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 854CL UT WOS:000223877800001 ER PT J AU McInnes, F Attwater, D AF McInnes, F Attwater, D TI Turn-taking and grounding in spoken telephone number transfers SO SPEECH COMMUNICATION LA English DT Article DE spoken dialogue; telephone numbers; grounding; turn-taking; prosody ID CONVERSATION; ORGANIZATION; INTONATION; PREFERENCE; DIALOGUE; PROSODY AB Two studies of spoken telephone number transfers-in which one participant communicates a number to the other-are reported. The data comprised transcripts of telephone conversations between callers and operators; for the second corpus audio recordings were also available. In most cases the caller was giving the number to the operator, and in these dialogues a chunked echo protocol was found to be very common, with the operator repeating each chunk of one or more digits to the caller for confirmation before the next chunk was given. Errors in speaking and in recognition were corrected efficiently within this protocol. The observations support a model of dialogue in which a single utterance unit can perform multiple dialogue acts and in which discourse units can have hierarchical structure. Examination of the audio recordings showed that there was usually very little silence, and sometimes a slight overlap, between conversational turns. Various prosodic phenomena were noted as contributing to the turn-taking and grounding processes. Implications for automated dialogue systems are discussed. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Sch Engn & Elect, Ctr Commun Interface Res, Edinburgh EH9 3JL, Midlothian, Scotland. BPexact Technol, Ipswich IP5 3RE, Suffolk, England. RP McInnes, F (reprint author), Univ Edinburgh, Sch Engn & Elect, Ctr Commun Interface Res, Mayfield Rd, Edinburgh EH9 3JL, Midlothian, Scotland. EM fergus.mcinnes@ed.ac.uk; david@eiginc.com CR Baumann S., 2001, P EUR 2001, P557 BULL M, 1998, P INT C SPOK LANG PR, P1179 Clark H., 1987, LANG COGNITIVE PROC, V2, P19, DOI 10.1080/01690968708406350 CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 COHEN PR, 1994, SPEECH COMMUN, V15, P265, DOI 10.1016/0167-6393(94)90077-9 Ferrer L., 2002, P INT C SPOK LANG PR, P2061 FRANKISH C, 1995, APPL COGNITIVE PSYCH, V9, pS5, DOI 10.1002/acp.2350090703 Hopper R., 1992, TELEPHONE CONVERSATI KOMPE R, 1994, SPEECH COMMUN, V15, P155, DOI 10.1016/0167-6393(94)90049-3 KOWTKO J, 1992, HCRCRP31 U ED Lickley R. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607315 MIXDORFF H, 2002, P SPEECH PROS 2002 A Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 POESIO M, 1998, P 20 WORKSH FORM SEM, P207 Rahim M, 2001, SPEECH COMMUN, V34, P195, DOI 10.1016/S0167-6393(00)00054-6 Rutter D. R., 1987, COMMUNICATING TELEPH RUTTER DR, 1989, CONVERSATION INTERDI SACKS H, 1974, LANGUAGE, V50, P696, DOI 10.2307/412243 SCHEGLOFF EA, 1977, LANGUAGE, V53, P361, DOI 10.2307/413107 THOMPSON HS, 1996, P INT S SPOK DIAL PH, P49 Traum D., 1992, COMPUT INTELL, V8, P575, DOI DOI 10.1111/J.1467-8640.1992.TB00380.X TRAUM David, 1992, P 2 INT C SPOK LANG, P137 WATERWORTH JA, 1983, APPL ERGON, V14, P39, DOI 10.1016/0003-6870(83)90219-3 WIGHTMAN CW, 2002, P SPEECH PROS 2002 A NR 24 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2004 VL 43 IS 3 BP 205 EP 223 DI 10.1016/j.specom.2004.04.001 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 854CL UT WOS:000223877800002 ER PT J AU Pruthi, T Espy-Wilson, CY AF Pruthi, T Espy-Wilson, CY TI Acoustic parameters for automatic detection of nasal manner SO SPEECH COMMUNICATION LA English DT Article DE nasal; nasal manner; acoustic parameters; acoustic correlates; automatic detection; automatic speech recognition ID CONTINUOUS-SPEECH; RECOGNITION; CONSONANTS; SYSTEM; ENGLISH; VOWELS AB Of all the sounds in any language, nasals are the only class of sounds with dominant speech output from the nasal cavity as opposed to the oral cavity. This gives nasals some special properties including presence of zeros in the spectrum, concentration of energy at lower frequencies, higher formant density, higher losses, and stability. In this paper we propose acoustic correlates for the linguistic feature nasal. In particular, we focus on the development of Acoustic Parameters (APs) which can be extracted automatically and reliably in a speaker independent way. These APs were tested in a classification experiment between nasals and semivowels, the two classes of sounds which together form the class of sonorant consonants. Using the proposed APs with a support vector machine based classifier we were able to obtain classification accuracies of 89.53%, 95.80% and 87.82% for prevocalic, postvocalic and intervocalic sonorant consonants respectively on the TIMIT database. As an additional proof to the strength of these parameters, we compared the performance of a Hidden Markov Model (HMM) based system that included the APs for nasals as part of the front-end, with an HMM system that did not. In this digit recognition experiment, we were able to obtain a 60% reduction in error rate on the TI46 database. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Maryland, Dept Elect & Comp Engn, College Pk, MD 20742 USA. RP Pruthi, T (reprint author), Univ Maryland, Dept Elect & Comp Engn, College Pk, MD 20742 USA. EM tpruthi@glue.umd.edu; espy@glue.umd.edu CR BITAR N, 1997, THESIS BOSTON U BITAR NN, 1997, P EUR C SPEECH COMM, P1239 Burges CJC, 1998, DATA MIN KNOWL DISC, V2, P121, DOI 10.1023/A:1009715923555 CHEN MY, 1995, J ACOUST SOC AM, V98, P2443, DOI 10.1121/1.414399 Chen MY, 1997, J ACOUST SOC AM, V102, P2360, DOI 10.1121/1.419620 Chen M.Y., 2000, P 6 INT C SPOK LANG, V4, P636 DESHMUKH O, 2002, P ICASSP 2002, P593 DICKSON DR, 1962, J SPEECH HEAR RES, V5, P103 DIXON NR, 1976, IEEE T ACOUST SPEECH, V24, P137, DOI 10.1109/TASSP.1976.1162793 ESPYWILSON CY, 1992, J ACOUST SOC AM, V92, P736, DOI 10.1121/1.403998 Fant G., 1960, ACOUSTIC THEORY SPEE Flanagan J. L., 1965, SPEECH ANAL SYNTHESI FUJIMURA O, 1962, J ACOUST SOC AM, V34, P1865, DOI 10.1121/1.1909142 Fujishima A., 2000, J PHOTOCH PHOTOBIO C, V1, P1, DOI DOI 10.1016/S1389-5567(00)00002-2 GLASS J, 1984, THESIS MIT CAMBRIDGE Glass J. R., 1985, P ICASSP, P1569 HESS WJ, 1976, IEEE T ACOUST SPEECH, V24, P14, DOI 10.1109/TASSP.1976.1162771 HOUSE AS, 1957, J SPEECH HEAR DISORD, V22, P190 Joachims T., 1999, ADV KERNEL METHODS S Juneja A., 2002, P 9 INT C NEUR INF P, V2, P726, DOI 10.1109/ICONIP.2002.1198153 JUNEJA A, 2003, P INT JOINT C NEUR N LIBERMAN M, 1993, T146 WORD Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 MERMELSTEIN P, 1977, J ACOUST SOC AM, V61, P581, DOI 10.1121/1.381301 NAKATA K, 1959, J ACOUST SOC AM, V31, P661, DOI 10.1121/1.1907770 *NTIS, 1990, TIMIT AC PHON CONT S PRUTHI T, 2003, P 15 INT C PHON SCI Salomon A, 2004, J ACOUST SOC AM, V115, P1296, DOI 10.1121/1.1646400 Stevens K.N., 1998, ACOUSTIC PHONETICS Vapnik V., 1995, NATURE STAT LEARNING WEINSTEIN CJ, 1975, IEEE T ACOUST SPEECH, VAS23, P54, DOI 10.1109/TASSP.1975.1162651 Young S., 1995, HTK BOOK NR 32 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2004 VL 43 IS 3 BP 225 EP 239 DI 10.1016/j.specom.2004.06.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 854CL UT WOS:000223877800003 ER PT J AU Mahe, G Gilloire, A Gros, L AF Mahe, G Gilloire, A Gros, L TI Correction of the voice timbre distortions in telephone networks: method and evaluation SO SPEECH COMMUNICATION LA English DT Article DE spectral equalization; speaker classification; timbre and noise perception ID SPEECH AB In a telephone link, the voice timbre is impaired by spectral distortions generated by the analog parts of the link. Our purpose is to restore a timbre as close as possible to that of the original voice of the speaker, using a blind equalizer centralized in the network, which compensates for the spectral distortions. We propose a spectral equalization algorithm, which consists in matching the long-term spectrum of the processed signal to a reference spectrum within a limited frequency bandwidth (200-3150 Hz). Subjective evaluations show a satisfactory restoration of the timbre of the speakers, within the limits of the chosen equalization band. The A-law quantization of the output samples of the equalizer induces however a disturbing noise at the reception end. A subjective evaluation shows that speakers' voices with corrected timbre, even with quantization noise, are preferred to the same voices at the output of a link without timbre correction (and without noise). In order to make the reference spectrum more appropriate to the various speakers' voices, we classify them according to their long-term spectra and use a specific reference spectrum for each class. This leads to a decrease of the spectral distortion induced by the equalizer, significantly perceived as an improvement of the timbre correction, as a subjective test shows. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Paris 05, V, CRIP5, InfoCom, F-75270 Paris 06, France. France Telecom, R&D, DIH, IPS, F-22307 Lannion, France. France Telecom, R&D, DIH, EQS, F-22307 Lannion, France. RP Mahe, G (reprint author), Univ Paris 05, V, CRIP5, InfoCom, 45 Rue St Peres, F-75270 Paris 06, France. EM mahe@math-info.univ-paris5.fr; andre.gilloire@rd.francetelecom.com; laetitia.gros@rd.francetelecom.com CR BOITE R, 2000, TRAITEMENT PAROLE PR, P99 BONNET C, 1986, MANUEL PRATIQUE PSYC, P136 BOWKER DO, 1993, Patent No. 5333195 DEJACO AP, 1997, Patent No. 5915235 FAUCON G, 1993, P GRETSI 93 JUAN PIN, P587 HO HS, 1993, Patent No. 5471527 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 LEBART L, 2000, STAT EXPLORATOIRE MU, P251 LEBART L, 2000, STAT EXPLORATOIRE MU, P145 MAHE G, 2002, Patent No. 0215618 MAHE G, 2001, Patent No. 0104194 MAHE G, 2002, P IEEE WORKSH SPEECH, P56 MAHE G, 2003, P EUR 2003 GEN SUISS, P1381 MAHE G, 2001, P EUR, P1867 MAKHOUL J, 1979, IEEE T ACOUST SPEECH, V27, P63, DOI 10.1109/TASSP.1979.1163199 MOKBEL C, 1993, P EUROSPEECH, P1247 Mokbel C, 1996, SPEECH COMMUN, V19, P185, DOI 10.1016/0167-6393(96)00032-5 Tukey J., 1953, PROBLEM MULTIPLE COM 1994, TP3054 NR 19 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2004 VL 43 IS 3 BP 241 EP 266 DI 10.1016/j.specum.2004.06.002 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 854CL UT WOS:000223877800004 ER PT J AU Kaynak, MN Zhi, Q Cheok, AD Sengupta, K Jian, Z Chung, KC AF Kaynak, MN Zhi, Q Cheok, AD Sengupta, K Jian, Z Chung, KC TI Lip geometric features for human-computer interaction using bimodal speech recognition: comparison and analysis SO SPEECH COMMUNICATION LA English DT Article ID VOWEL RECOGNITION; MODELS; FUSION AB Bimodal speech recognition is a novel extension of acoustic speech recognition for which both acoustic and visual speech information are used to improve the recognition accuracy in noisy environments. Although various bimodal speech systems have been developed, a rigorous and detailed comparison of the possible geometric visual features from speakers' faces has not been given yet in the previous papers. Thus, in this paper, the geometric visual features are compared and analyzed rigorously for their importance in bimodal speech recognition. The relevant information of each possible single visual feature is used to determine the best combination of geometric visual features for both visual-only and bimodal speech recognition. From the geometric visual features analyzed, lip vertical aperture is the most relevant; and the set formed by the vertical and horizontal lip apertures and the first order derivative of the lip corner angle gives the best results among the possibilities of reduced set of geometric features that were analyzed. Also, in this paper, the effect of the modelling parameters of hidden Markov models (HMM) on each single geometric lip feature's recognition accuracy is analyzed. Finally, the accuracy of acoustic-only, visual-only, and bimodal speech recognition methods are experimentally determined and compared using the optimized HMMs and geometric visual features. Compared to acoustic and visual-only speech recognition, the bimodal speech recognition scheme has a much improved recognition accuracy using the geometric visual features, especially in the presence of noise. The results obtained showed that a set of as few as three labial geometric features are sufficient to improve the recognition rate by as much as 20% (from 62%, with acoustic-only information, to 82%, with audio-visual information at a signal to noise ratio (SNR) of 0 dB). (C) 2004 Elsevier B.V. All rights reserved. C1 Arizona State Univ, Dept Elect Engn, Tempe, AZ 85287 USA. Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117576, Singapore. RP Kaynak, MN (reprint author), Arizona State Univ, Dept Elect Engn, Tempe, AZ 85287 USA. EM mustafa.kaynak@asu.edu; adriancheok@nus.edu.sg CR ADJOUDANI A, 1996, NATO ASI SER, P461 BASU S, 1999, P IEEE 3 WORKSH MULT, P475 Becchetti C., 1999, SPEECH RECOGNITION T Chan M., 1998, P IEEE 2 WORKSH MULT, P65 Cosi P., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727708 Dupont S, 2000, IEEE T MULTIMEDIA, V2, P141, DOI 10.1109/6046.865479 GOLDSCHEN AJ, 1993, THESIS G WASHINGTON GURBUZ S, 2001, P INT C AC SPEECH SI, V1, P177 KAYNAK MN, 2000, 2 JSPS NUS SEM INT E, P220 LINCOLN M, 2000, IEE C VIS BIOM, V5, P1 Luettin J., 1996, P 4 INT C SPOK LANG, V1, P58, DOI 10.1109/ICSLP.1996.607024 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MOVELLAN J, 1995, P NIPS94, P851 Petajan E., 1988, P HUM FACT COMP SYST, P19, DOI 10.1145/57167.57170 PETAJAN E, 1984, P IEEE GLOB TEL C, V1, P205 POTAMIANOS G, 1997, P EUR TUT WORKSH AUD Potamianos G, 1998, INT CONF ACOUST SPEE, P3733, DOI 10.1109/ICASSP.1998.679695 Rabiner L, 1993, FUNDAMENTALS SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rogozan A, 1998, SPEECH COMMUN, V26, P149, DOI 10.1016/S0167-6393(98)00056-9 SCHALKWYK J, 1995, CSLU TOOLKIT AUTOMAT SILCBEE P, 1993, BIOMEDICAL SCI INSTR, V20, P415 Stork D.G., 1992, P INT JOINT C NEUR N, V2, P289, DOI 10.1109/IJCNN.1992.226994 Teissier P, 1999, IEEE T SPEECH AUDI P, V7, P629, DOI 10.1109/89.799688 Vatikiotis-Bateson E, 1998, PERCEPT PSYCHOPHYS, V60, P926, DOI 10.3758/BF03211929 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X Yu KR, 1999, SIGNAL PROCESS, V77, P195, DOI 10.1016/S0165-1684(99)00032-8 YUHAS BP, 1990, P IEEE, V78, P1658, DOI 10.1109/5.58349 ZHANG J, 2001, P IEEE INT C FUZZ SY, V3, P1359 NR 29 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 1 EP 16 DI 10.1016/j.specom.2004.01.003 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900001 ER PT J AU Yoon, SW Kang, HG Park, YC Youn, DH AF Yoon, SW Kang, HG Park, YC Youn, DH TI An efficient transcoding algorithm for G.723.1 and G.729A speech coders: interoperability between mobile and IP network SO SPEECH COMMUNICATION LA English DT Article DE speech coding; transcoding; tandem; G.723.1; G.729; G.729A; CELP; LSP conversion; pitch conversion; fast codebook search AB In this paper, an efficient transcoding algorithm for G.723.1 and G.729A speech coders is proposed. Transcoding in this paper is completed through four processing steps: LSP conversion, pitch interval conversion, fast adaptive-codebook search, and fast fixed-codebook search. For maintaining minimum distortion, sensitive parameters to quality such as adaptive- and fixed-codebooks are re-estimated from synthesized target signals. To reduce overall complexity, other parameters are directly converted in parametric levels without running through the decoding process. Objective and subjective preference tests verify that the proposed transcoding algorithm has comparable quality to the classical encoder-decoder tandem approach. To compare the complexity of the algorithms, we implement them on the TI TMS320C6201 DSP chip. As a result, the proposed algorithm achieves 26-38% reduction of the overall complexity with a shorter processing delay. (C) 2004 Elsevier B.V. All rights reserved. C1 Yonsei Univ, Dept Elect & Elect Engn, MCSP LAB, Seoul 120749, South Korea. RP Yoon, SW (reprint author), Yonsei Univ, Dept Elect & Elect Engn, MCSP LAB, 134 Shinchon Dong, Seoul 120749, South Korea. EM yocello@mcsp.yonsei.ac.kr RI Kang, Hong-Goo/G-8545-2012 CR Epperson J.F., 2002, INTRO NUMERICAL METH Hersent O., 2000, IP TELEPHONY PACKET Jung Sung-Kyo, 2001, P EUR SEPT, P2017 Kang H. G., 2000, P IEEE WORKSH SPEECH, P78 KITAWAKI N, 1998, IEEE J SEL AREA COMM, V7, P242 NETO AFC, 1999, IEEE P INT C AC SPEE, P177 Rabiner L.R., 1978, DIGITAL PROCESSING S SALAMI R, 1997, IEEE COMMUN MAG, P53 SALAMI R, 1997, IEEE P INT C AC SPEE, V2, P775 *TEX INSTR, 1996, TMS320C62X67X CPU IN *TEX INSTR, 1998, TMS320C6X C SOURC DE *TIA EIA, 1996, PN3467 TIAEIA Yoon S. W., 2001, P EUR SEPT, P2499 NR 13 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 17 EP 31 DI 10.1016/j.specom.2004.01.004 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900002 ER PT J AU Dybkjaer, L Bernsen, NO Minker, W AF Dybkjaer, L Bernsen, NO Minker, W TI Evaluation and usability of multimodal spoken language dialogue systems SO SPEECH COMMUNICATION LA English DT Article DE evaluation projects; state-of-the-art; generalisations; guidelines; theory ID RECOGNITION; EXPRESSION; PROJECT AB With the technical advances and market growth in the field, the issues of evaluation and usability of spoken language dialogue systems, unimodal as well as multimodal, are as crucial as ever. This paper discusses those issues by reviewing a series of European and US projects which have produced major results on evaluation and usability. Whereas significant progress has been made on unimodal spoken language dialogue systems evaluation and usability, the emergence of, among others, multimodal, mobile, and domain-oriented systems continues to pose entirely new challenges to research in evaluation and usability. (C) 2004 Elsevier B.V. All rights reserved. C1 Nat Interact Syst Lab, DK-5230 Odense M, Denmark. Univ Ulm, Dept Informat Technol, D-89081 Ulm, Germany. RP Minker, W (reprint author), Nat Interact Syst Lab, Sci Pk 10, DK-5230 Odense M, Denmark. EM laila@nis.sdu.dk; nob@nis.sdu.dk; wolfgang.minker@e-technik.uni-ulm.de CR ALLEN JF, 1995, J EXP THEOR ARTIF IN, V7, P7, DOI 10.1080/09528139508953799 ALMEIDA L, 2002, P INT CLASS WORKSH N, P1 AMSHAW L, 1990, 7 SLS BBN SYST TECHN BAEKGAARD A, 1995, P ESCA WORKSH SPOK D, P89 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 Beringer N, 2002, P LREC WORKSH MULT R, P77 BERNSEN NO, 2003, LECT NOTES ARTIFICAL, P378 BERNSEN NO, 1994, INTERACT COMPUT, V6, P347, DOI 10.1016/0953-5438(94)90008-6 Bernsen NO, 1997, SPEECH COMMUN, V23, P181, DOI 10.1016/S0167-6393(97)00046-0 BERNSEN NO, 1999, P ESCA WORKSH INT DI, P105 Bernsen N.O., 1998, DESIGNING INTERACTIV BERNSEN NO, 2000, P 2 INT C LANG RES E, P183 Bernsen NO, 2002, TEXT SPEECH LANG TEC, V19, P93 BICKMORE T, 2002, P INT CLASS WORKSH N, P15 BOROS M, 1996, P ICSLP, V2, P1009, DOI 10.1109/ICSLP.1996.607774 Bossemeyer R. W. Jr., 1991, Speech Technology, V5 BRUSILOVSKY P, 2003, SPRINGER LECT NOTES, V2702 Buhler D., 2002, P ECAI WORKSH ART IN, P66 BUISINE S, 2003, P 9 IFIP TC13 INT C Cassell J., 2000, EMBODIED CONVERSATIO Cohen I, 2003, COMPUT VIS IMAGE UND, V91, P160, DOI 10.1016/S1077-3142(03)00081-X Cohen P.R., 1997, 5 ACM INT C MULT, P31 DYBKJAER L, 1998, P 1 INT C LANG RES E, P185 DYBKJAER L, 2003, D71 NICE U SO DENM Dybkjaer L, 1998, INT J HUM-COMPUT ST, V48, P605, DOI 10.1006/ijhc.1997.0183 Dybkjaer L., 2000, NATURAL LANGUAGE ENG, V6, P243, DOI 10.1017/S1351324900002461 Ekman P, 1975, UNMASKING FACE GUIDE Ferguson G., 1998, P 15 NAT C ART INT A, P567 FRASER N, 1997, P EUR C SPEECH COMM, P1907 GARTNER U, 2001, P INT DRIV S HUM FAC Gibbon D, 1997, HDB STANDARDS RESOUR GILBERT N, 1999, GUIDELINES ADV SPOKE GLASBY CJ, 2000, FAUNA AUSTR A, V4, P1 Grice H. P., 1975, SYNTAX SEMANTICS, P41, DOI DOI 10.1017/S0022226700005296 Gustafson J., 1999, P EUR 99 BUD, P1151 Gustafson J., 2000, P INT C SPOK LANG PR, V2, P134 HANDRIEDER G, 1998, P INT C SPOK LANG PR, P503 HIRSCHBERG J, 2001, P 2 SIGDIAL WORKSH D, P72 HJALMARSON A, 2002, THESIS KTH STOCKHOLM KARLSSON I, 1999, D23 DISC King M., 1996, EAGEWGPR2 KOMATANI K, 2003, P EUR C SPEECH COMM, P745 LARSEN LB, 2003, P EUR C SPEECH COMM, P1945 LEAVITT N, 2003, IEEE COMPUT, V36, P13 MARIANI J, 1999, P DARPA BROADC NEWS, P237 MINKER W, 2002, P ISCA WORKSH MULT M OVIATT S, 2001, MULTIMODAL INTERFACE, P203 OVIATT S, 1997, MULTIMODAL INTERACTI, P93 PALLETT D, 1994, P ARPA WORKSH SPOK L, P5 Peckham J., 1993, P 3 EUR C SPEECH COM, P33 Polifroni J., 2000, P 2 INT C LANG RES E, P725 Roth SF, 1997, HUM-COMPUT INTERACT, V12, P131, DOI 10.1207/s15327051hci1201&2_5 SANDERS G, 2001, P INT C SPOK LANG PR, P277 Seneff S., 1998, P ICSLP, P931 Simpson A., 1993, P 3 EUR C SPEECH COM, P1423 STURM J, 2002, P ISCA WORKSH MULT M STURM J, 1999, P ESCA WORKSH INT DI, P1 TEMEM JN, 1999, WORLD C RAILW RES TO WAHLSTER W, 2001, P EUR C SPEECH COMM, P1547 WAHLSTER W, 1993, MACHINE TRANSLATION, V4 WALKER M, 2000, P 2 INT C LANG RES E, P735 Walker M. A., 2002, P INT C SPOK LANG PR, P269 Walker Marilyn, 2000, NATURAL LANGUAGE ENG, V6 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 NR 65 TC 42 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 33 EP 54 DI 10.1016/j.specom.2004.02.001 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900003 ER PT J AU Davidson, N McInnes, F Jack, MA AF Davidson, N McInnes, F Jack, MA TI Usability of dialogue design strategies for automated surname capture SO SPEECH COMMUNICATION LA English DT Article DE usability; name recognition; spelling; dialogue design ID RECOGNITION; SYSTEMS; ISSUES AB Surname capture via automatic speech recognition over the telephone has many commercial applications, including automated directory assistance and travel reservation services. This paper presents a usability evaluation of three different dialogue designs for automated surname capture, within the context of a flight reservation service. The three designs explored were: a Speak Only strategy, in which callers simply say the surname; a One Stage Speak and Spell strategy in which callers speak and spell the surname in a single utterance; and a Two Stage Speak and Spell strategy in which callers speak and spell the surname in two separate dialogue stages. The methodology employed in the research provides both quantitative user attitude data and performance results for each of the strategies, based on an empirical study with a cohort of 95 participants. The results show a clear distinction between strategies. User attitude towards the dialogues that involve both speaking and spelling the name is high. User attitude towards the Speak Only strategy is significantly less positive. Task completion rates are also significantly higher in the two strategies that involve spelling the name, at around 80% compared to just over 50% in the Speak Only strategy. The data underline the importance of user testing, demonstrating the value of the evaluation methodology used, and provide encouraging results for the strategies that involve both speaking and spelling the name. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Edinburgh, Ctr Commun Interface Res, Edinburgh EH9 3JL, Midlothian, Scotland. RP Davidson, N (reprint author), Univ Edinburgh, Ctr Commun Interface Res, Kings Bldg, Edinburgh EH9 3JL, Midlothian, Scotland. EM Nancie.Davidson@ccir.ed.ac.uk; Fergus.Mcinnes@ccir.ed.ac.uk; Mervyn.Jack@ccir.ed.ac.uk CR Attwater DJ, 1996, BT TECHNOL J, V14, P177 BAUER JG, 1999, P EUROSPEECH, P263 BECHET F, 2001, P IEEE ASRU WORKSH A CHUNG G, 2002, P ICSLP, P2053 Chung G., 2003, P HLT NAACL EDM CAN, P32 CORDOBA R, 2001, P EUROSPEECH, P1279 DUTTON RT, 1993, P EUROSPEECH 93, P1335 Eskenazi M., 1993, P EUR 93 BERL, P501 GALESCU L, 2002, P ICSLP, P109 GAO YQ, 2000, P ICASSP, P333 HILD H, 1996, P ICSLP, V1, P346, DOI 10.1109/ICSLP.1996.607125 JOUVET D, 1993, P EUROSPEECH, P2081 Jouvet D., 1999, P EUROSPEECH 99 BUD, P283 KAMM CA, 1995, SPEECH COMMUN, V17, P303, DOI 10.1016/0167-6393(95)00023-H KASPAR B, 1995, P EUROSPEECH, P1161 Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5 Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9 LEHTINEN G, 2000, P VOTS 2000 WORKSH B LENNIG M, 1995, SPEECH COMMUN, V31, P227 Likert R., 1932, ARCH PSYCHOL, V140 Llitjos Ariadna Font, 2001, P EUROSPEECH, P1919 MCINNES FR, 1999, P EUROSPEECH99, P831 Meyer M., 1997, P EUROSPEECH, V3, P1579 MITCHELL CD, 1999, P IEEE INT C AC SPEE, V2, P597 NEUBERT F, 1998, P IEEE IVTTA WORKSH San-Segundo R, 2002, SPEECH COMMUN, V38, P287, DOI 10.1016/S0167-6393(01)00069-3 Saraclar M, 2000, COMPUT SPEECH LANG, V14, P137, DOI 10.1006/csla.2000.0140 Schmidt M. S., 1994, Proceedings of Language Engineering Convention Schramm H, 2000, SPEECH COMMUN, V31, P329, DOI 10.1016/S0167-6393(99)00066-7 SEIDE F, 1997, P EUROSPEECH, V3, P1327 SETHY A, 2002, P ISCA PRON MOD LEX WEINTRAUB M, 1996, P INT C SPOK LANG PR, P16 NR 32 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 55 EP 70 DI 10.1016/j.specom.2004.02.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900004 ER PT J AU Wu, CH Chen, YJ AF Wu, CH Chen, YJ TI Recovery from false rejection using statistical partial pattern trees for sentence verification SO SPEECH COMMUNICATION LA English DT Article DE false rejection error; error recovery; partial pattern tree; sentence verification ID UTTERANCE VERIFICATION; SPEECH RECOGNITION; INFORMATION; LANGUAGE; MODELS AB In conversational speech recognition, recognizers are generally equipped with a keyword spotting capability to accommodate a variety of speaking styles. In addition, language model incorporation generally improves the recognition performance. In conversational speech keyword spotting, there are two types of errors, false alarm and false rejection. These two types of errors are not modeled in language models and therefore offset the contribution of the language models. This paper describes a partial pattern tree (PPT) to model the partial grammatical rules of sentences resulting from recognition errors and ungrammatical sentences. Using the PPT and a proposed sentence-scoring algorithm, the false rejection errors can be recovered first. A sentence verification approach is then employed to re-rank and verify the recovered sentence hypotheses to give the results. A PPT merging algorithm is also proposed to reduce the number of partial patterns with similar syntactic structure and thus reduce the PPT tree size. An automatic call manager and an airline query system are implemented to assess the performance. The keyword error rates for these two systems using the proposed approach achieved 10.40% and 14.67%, respectively. The proposed method was compared with conventional approaches to show its superior performance. (C) 2004 Elsevier B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. Ind Technol Res Inst, Comp & Commun Labs, Ctr Adv Technol, Hsinchu, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, 1 Ta Shueh Rd, Tainan 70101, Taiwan. EM chwu@csie.ncku.edu.tw RI Wu, Chung-Hsien/E-7970-2013 CR BELLEGARDA JR, 1999, P INT C AC SPEECH SI, V2, P717 Benitez MC, 2000, SPEECH COMMUN, V32, P79, DOI 10.1016/S0167-6393(00)00025-X BILLA J, 1999, P ICASSP 99, V1, P41 Chen SF, 2000, IEEE T SPEECH AUDI P, V8, P37, DOI 10.1109/89.817452 DellaPietra S, 1997, IEEE T PATTERN ANAL, V19, P380, DOI 10.1109/34.588021 Fukunaga K., 1972, INTRO STAT PATTERN R GORIN AL, 1996, P ICSLP 96, V2, P1001, DOI 10.1109/ICSLP.1996.607772 Guyon I., 1995, Proceedings of the Third International Conference on Document Analysis and Recognition, DOI 10.1109/ICDAR.1995.599034 Hamaker J. S., 1999, Proceedings 1999 International Conference on Information Intelligence and Systems (Cat. No.PR00446), DOI 10.1109/ICIIS.1999.810351 HUANG TL, 1994, IEEE PARALL DISTRIB, V2, P3 IYER I, 1997, P EUR C SPEECH COMM, V4, P1975 Iyer RM, 1999, IEEE T SPEECH AUDI P, V7, P30, DOI 10.1109/89.736328 Jelinek F., 1998, STAT METHODS SPEECH Kamppari S O, 2000, P ICASSP, V3, P1799 KELLNER A, 1998, P 1998 INT C AC SPEE, V1, P185, DOI 10.1109/ICASSP.1998.674398 KHUDANPUR S, 1999, P IEEE INT C AC SPEE, V1, P553 KREMER SC, 1997, P INT C NEUR NETW, V3, P1424, DOI 10.1109/ICNN.1997.614003 LEEUWEN GFV, 1999, IEEE AFRICON, V1, P195 Lleida E, 2000, IEEE T SPEECH AUDI P, V8, P126, DOI 10.1109/89.824697 MA KW, 1998, P INT C AC SPEECH SI, V2, P693, DOI 10.1109/ICASSP.1998.675359 MOREAU N, 2000, P INT C AC SPEECH SI, V3, P1807 NIESLER T, 1999, COMPUT SPEECH LANG, V21, P1 OBOYLE R, 1996, P INT C AC SPEECH SI, V1, P168 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 Ron D., 1994, ADV NEURAL INFORMATI, V6, P176 ROSE RC, 1998, P 1998 ICASSP SEATTL, V1, P237, DOI 10.1109/ICASSP.1998.674411 Sarukkai RR, 1997, IEEE T SPEECH AUDI P, V5, P438, DOI 10.1109/89.622567 SHU CQ, 1998, P INT C SIGN PROC, V1, P646 Siu MH, 2000, IEEE T SPEECH AUDI P, V8, P63 Stolcke Andreas, 1998, P DARPA BROADC NEWS, P270 Wu CH, 2000, IEE P-VIS IMAGE SIGN, V147, P55, DOI 10.1049/ip-vis:20000099 Wu CH, 2001, SPEECH COMMUN, V33, P197, DOI 10.1016/S0167-6393(00)00016-9 YAMAMOTO H, 1999, P ICASSP 99, V1, P533 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 34 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 71 EP 88 DI 10.1016/j.specom.2004.02.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900005 ER PT J AU Minker, W Haiber, U Heisterkamp, P Scheible, S AF Minker, W Haiber, U Heisterkamp, P Scheible, S TI The SENECA spoken language dialogue system SO SPEECH COMMUNICATION LA English DT Article DE clarification dialogue; command & control; confidence; dialogue design; evaluation; grammar; noise reduction; robustness; speech recognition; text enrolment; voice enrolment AB This article describes a speech-based user interface to a wide range of entertainment, navigation and communication applications in mobile environments by means of human-machine dialogues. The system has been developed in the framework of the EU-project SENECA. It uses noise reduction, speech recognition, and dialogue processing techniques. One interesting aspect relies in the fact that low speech recognition confidence and word-level ambiguities are compensated by engaging flexible clarification dialogues with the user. The SENECA system demonstrator has been evaluated by means of user tests. With speech input, road safety, especially for complex tasks is significantly improved. Compared to manual input, the feeling of being distracted from driving is less important with speech. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Ulm, Dept Informat Technol, D-89081 Ulm, Germany. DaimlerChrysler, Res & Technol, D-89013 Ulm, Germany. Temic SDS GmbH, D-89077 Ulm, Germany. RP Minker, W (reprint author), Univ Ulm, Dept Informat Technol, Albert Einstein Allee 43, D-89081 Ulm, Germany. EM wolfgang.minker@e-technik.uni-ulm.de CR GARTNER U, 2001, SENECA PROJECT SPEEC Gartner U., 2001, INT DRIV S HUM FACT GREEN P, 2000, CONV 2000 C P SOC AU HANSEN JHL, 2000, P INT C SPEECH LANG HEISTERKAMP P, 2001, P C HUM LANG TECHN H LINHARD K, 1998, P INT C SPEECH LANG MAIER E, 1997, P EUR C SPEECH COMM MINKER W, 2003, P INT C INT US INT MORENO A, 2000, P INT C LANG RES EV MUTSCHLER H, 2001, FINAL REPORT EVALUAT SUHM B, 1999, P ACM SIGGHI C HUM F TEMEM JN, 1999, WORLD C RAILW RES TO WAHLSTER W, 2001, P EUR C SPEECH COMM NR 13 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 89 EP 102 DI 10.1016/j.specom.2004.01.005 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900006 ER PT J AU Quene, H van den Bergh, H AF Quene, H van den Bergh, H TI On multi-level modeling of data from repeated measures designs: a tutorial SO SPEECH COMMUNICATION LA English DT Article DE multi-level modeling; repeated measures; experimental design; analysis of variance; mixed effects; variance components ID LANGUAGE; SPEECH; PRIMER AB Data from repeated measures experiments are usually analyzed with conventional ANOVA. Three well-known problems with ANOVA are the sphericity assumption, the design effect (sampling hierarchy), and the requirement for complete designs and data sets. This tutorial explains and demonstrates multi-level modeling (MLM) as an alternative analysis tool for repeated measures data. MLM allows us to estimate variance and covariance components explicitly. MLM does not require sphericity, it takes the sampling hierarchy into account, and it is capable of analyzing incomplete data. A fictitious data set is analyzed with MLM and ANOVA, and analysis results are compared. Moreover, existing data from a repeated measures design are re-analyzed with MLM, to demonstrate its advantages. Monte Carlo simulations suggest that MLM yields higher power than ANOVA, in particular under realistic circumstances. Although technically complex, MLM is recommended as a useful tool for analyzing repeated measures data from speech research. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Utrecht, Utrecht Inst Linguist OTS, NL-3512 JK Utrecht, Netherlands. RP Quene, H (reprint author), Univ Utrecht, Utrecht Inst Linguist OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM hugo.quene@let.uu.nl CR Agrawal AF, 2001, AM NAT, V158, P308, DOI 10.1086/321324 Beacon HJ, 1996, STAT MED, V15, P2717 Broekkamp H, 2002, J EDUC PSYCHOL, V94, P260, DOI 10.1037//0022-0663.94.2.260 BRYK A, 2001, HLM HIERARCHICAL LIN Bryk A. S., 1992, HIERARCHICAL LINEAR Carvajal SC, 2001, MULTIVAR BEHAV RES, V36, P185, DOI 10.1207/S15327906MBR3602_03 Cochran W. G., 1977, SAMPLING TECHNIQUES, V3rd Cohen J., 1988, STAT POWER ANAL BEHA, V2nd COHEN J, 1992, PSYCHOL BULL, V112, P155, DOI 10.1037/0033-2909.112.1.155 Cronbach L. J., 1972, DEPENDABILITY BEHAV GOLDSTEIN H, 1991, BIOMETRIKA, V78, P42 Goldstein H, 1995, MULTILEVEL STAT MODE, V2nd GOLDSTEIN H, 1988, PSYCHOMETRIKA, V53, P455, DOI 10.1007/BF02294400 GOLDSTEIN H, 1994, STAT MED, V13, P1643, DOI 10.1002/sim.4780131605 Haggard E., 1958, INTRACLASS CORRELATI Hall DB, 2001, FOREST SCI, V47, P311 Hox J. J., 1995, APPL MULTILEVEL ANAL Kirk RR, 1995, EXPT DESIGN PROCEDUR Kish L., 1967, SURVEY SAMPLING Kreft I. G. G., 1998, INTRO MULTILEVEL MOD Lochner K, 2001, AM J PUBLIC HEALTH, V91, P385, DOI 10.2105/AJPH.91.3.385 Longford NT, 1993, RANDOM COEFFICIENT M Max L, 1999, J SPEECH LANG HEAR R, V42, P261 Maxwell S E, 2004, DESIGNING EXPT ANAL MCCULLOCH CE, 2001, GENERALIZED LINEAR Merlo J, 2001, J EPIDEMIOL COMMUN H, V55, P791, DOI 10.1136/jech.55.11.791 OBRIEN RG, 1985, PSYCHOL BULL, V97, P316, DOI 10.1037//0033-2909.97.2.316 Pedhazur E. J., 1991, MEASUREMENT DESIGN A Pinheiro J. C., 2000, MIXED EFFECTS MODELS Raaijmakers JGW, 1999, J MEM LANG, V41, P416, DOI 10.1006/jmla.1999.2650 RASBASH J, 2000, USERS GUIDE MLWIN CO Raudenbush S. W., 2002, HIERARCHICAL LINEAR, V2nd Reise SP, 2001, MULTIVAR BEHAV RES, V36, P153, DOI 10.1207/S15327906MBR3602_01 Rijlaarsdam Gert, 1996, SCI WRITING THEORIES, P207 Searle S.R., 1987, LINEAR MODELS UNBALA Searle SR, 1992, VARIANCE COMPONENTS Singer JD, 1998, J EDUC BEHAV STAT, V23, P323 Sluijter A., 1995, PHONETIC CORRELATES Snijders TAB, 1999, MULTILEVEL ANAL INTR, V1st Van der Leeden R, 1998, QUAL QUANT, V32, P15, DOI 10.1023/A:1004233225855 van Rossum MA, 2002, J SPEECH LANG HEAR R, V45, P1106, DOI 10.1044/1092-4388(2002/089) Winer B.J., 1971, STAT PRINCIPLES EXPT NR 42 TC 141 Z9 141 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 103 EP 121 DI 10.1016/j.specom.2004.02.004 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900007 ER PT J AU Palomaki, KJ Brown, GJ Barker, JP AF Palomaki, KJ Brown, GJ Barker, JP TI Techniques for handling convolutional distortion with 'missing data' automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; missing data; spectral distortion; spectral normalisation; reverberation ID INTELLIGIBILITY AB In this study we describe two techniques for handling convolutional distortion with 'missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify 'reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T-60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram. (C) 2004 Elsevier B.V. All rights reserved. C1 Aalto Univ, Lab Acoust & Audio Signal Proc, FIN-02015 Helsinki, Finland. Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Univ Helsinki, Dept Psychol, FIN-00014 Helsinki, Finland. RP Palomaki, KJ (reprint author), Aalto Univ, Lab Acoust & Audio Signal Proc, POB 3000, FIN-02015 Helsinki, Finland. EM kalle.palomaki@hut.fi; g.brown@dcs.shef.ac.uk; j.barker@dcs.shef.ac.uk CR Akbacak M., 2003, P ICASSP 2003, P113 [Anonymous], 1997, 3382 ISO ASSMANN P, 2003, SPRINGER HDB AUDITOR, V18 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Avendano C., 1997, THESIS OREGON GRADUA Barker J., 2001, P EUR 2001 ESCA, P213 Barker J., 2000, P ICSLP 2000, V1, P373 BARKER J, 2001, P WORKSH INN SPEECH BARKER J, 2000, P ICSLP 00, V4, P270 BRADLEY JS, 1986, J ACOUST SOC AM, V80, P837, DOI 10.1121/1.393907 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 2001, P IJCNN 01, P2907 Cole R., 1995, P EUR C SPEECH COMM, P821 Cooke M., 1993, MODELLING AUDITORY P Cooke M, 2001, SPEECH COMMUN, V34, P267, DOI 10.1016/S0167-6393(00)00034-0 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Droppo J., 2002, P ICASSP 2002, V1, P57 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DUPONT S, 1999, P WORKSH ROB METH SP, P115 Eronen A. J., 2003, P IEEE INT C AC SPEE, V5, P529 Gold B., 2000, SPEECH AUDIO SIGNAL GOLZER H, 2003, P EL SPRACHS ARB ESS Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H, 2000, P ICASSP, V3, P1635 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H. G., 1995, P IEEE INT C AC SPEE, V1, P153 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hyvarinen A, 2001, INDEPENDENT COMPONEN KANDERA N, 1999, SPPECH COMM, V28, P43 Kingsbury B., 1998, THESIS U CALIFORNIA Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Kleinschmidt M, 2003, SPEECH COMMUN, V39, P47, DOI 10.1016/S0167-6393(02)00058-4 Li DG, 2001, PATTERN RECOGN LETT, V22, P533, DOI 10.1016/S0167-8655(00)00119-7 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LUO Y, 2003, P ISCAS, V2, P564 MARTIN KD, 1999, THESIS MIT MASSACHUS MARTIN R, 1993, P EUR 1993, P37 *MATHW INC, 2003, MATLAB REL 13 REF MA Moore BC., 2003, INTRO PSYCHOL HEARIN MORRIS AC, 1998, P 1998 IEEE INT C AC, V2, P737, DOI 10.1109/ICASSP.1998.675370 MORRIS AC, 2002, 0229 IDIAP NABELEK AK, 1982, J ACOUST SOC AM, V71, P1242 Omologo M, 1998, SPEECH COMMUN, V25, P75, DOI 10.1016/S0167-6393(98)00030-2 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PALOMAKI K, 2002, P 2002 INT C AC SPEE, V1, P65 PALOMAKI KJ, 2001, P CRAC EUR SAT WORKS PALOMAKI KJ, IN PRESS SPEECH COMM PATTERSON RD, 1988, 2341 APL SVOS Pearce D., 2000, P ICSLP, V4, P29 PELTONEN V, 2002, IEEE INT C AC SPEECH, V2, P1941 RAJ B, IN PRESS SPEECH COMM Raj B., 2000, THESIS CARNEGIE MELL Rosenberg A.E., 1994, P INT C SPOK LANG PR, V4, P1835 Young S., 2001, HTK BOOK 1997, STRUT VERS 2 4 NR 57 TC 36 Z9 36 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 123 EP 142 DI 10.1016/j.specom.2004.02.005 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900008 ER PT J AU Kilic, MA Ogut, F AF Kilic, MA Ogut, F TI A high unrounded vowel in Turkish: is it a central or back vowel? SO SPEECH COMMUNICATION LA English DT Article DE magnetic resonance imaging; speech acoustics; Turkish; vowels ID RESONANCE; SHAPE AB The aim of this paper is to investigate the phonetic properties of a Turkish vowel in which its backness is indefinite. The five members ([a],[epsilon],[i],[u] and the high unrounded Turkish vowel, in short HUTV) of the Turkish vowel system were investigated in five adult native Turkish speaker males. For the articulatory analysis, midsagittal magnetic resonance images were obtained during sustained phonation of the vowels, and the distances of the main constrictions from the glottis and the areas of the oral and pharyngeal cavities were calculated. For the acoustic analysis, both the Turkish vowels' and HUTV-Iike IPA vowels' fundamental frequencies (f(0)) and the first three formants (F-1, F-2 and F-3), were calculated. The acoustic parameters of HUTV were compared both with other vowels' and with those of the IPA vowels'. For the auditory analysis, 220 synthetic stimuli and 26 IPA vowels were used in an identification test. Articulatory analyses revealed that there were no statistically significant differences between HUTV and [u], and HUTV and [epsilon]. Acoustic analyses revealed that there were no statistically significant differences between HUTV and [epsilon], and HUTV and phoneticians' [+] and [w], and [v] vowels. Auditory investigation revealed that the [+] and [w], and [v] vowels perceived as HUTV. These results suggested that HUTV's position in the vowel space was between the [epsilon] and [u] vowels, but its subarea was fairly wide. (C) 2004 Elsevier B.V. All rights reserved. C1 Kahramanmaras Sutcu Imam Univ, Sch Med, Dept Otolaryngol, TR-46050 Kahramanmaras, Turkey. Ege Univ, Sch Med, Dept Otolaryngol, Izmir, Turkey. RP Kilic, MA (reprint author), Kahramanmaras Sutcu Imam Univ, Sch Med, Dept Otolaryngol, TR-46050 Kahramanmaras, Turkey. EM makilic@doruk.net.tr CR BAER T, 1991, J ACOUST SOC AM, V90, P799, DOI 10.1121/1.401949 Catford John C., 1977, FUNDAMENTAL PROBLEMS DEMIRCAN, 1979, TURKIYE TURKCESININ Demirezen M, 1986, PHONEMICS PHONOLOGY Demolin D, 2002, CR BIOL, V325, P547, DOI 10.1016/S1631-0691(02)01458-0 Ergenc I., 1989, TURKIYE TURKCESININ ESLING JH, 1994, U VICTORIA PHONETIC Fant G., 1970, ACOUSTIC THEORY SPEE *IPA TRANSCR TUT, 1993, COMP PROGR CSL MOD 4 Jaklin Kornfilt, 1997, TURKISH JOHNSON K, 1993, LANGUAGE, V69, P505, DOI 10.2307/416697 KILIC MA, 2003, STUDIES TURKISH LING, P3 Ladefoged P., 1985, Computer speech processing Ladefoged Peter, 1993, COURSE PHONETICS Lewis Geoffrey, 1967, TURKISH GRAMMAR MOORE CA, 1992, J SPEECH HEAR RES, V35, P1009 Rosner B. S., 1994, VOWEL PERCEPTION PRO SELEN N, 1979, SOYLEYIS SESBILIMI A Shriberg LD, 1995, CLIN PHONETICS SYRDAL AK, 1985, SPEECH COMMUN, V4, P121, DOI 10.1016/0167-6393(85)90040-8 TRAUNMULLER H, 1988, PHONETICA, V45, P1 WELLS J, 1995, SOUNDS INT PHONETIC Whalen DH, 1999, J SPEECH LANG HEAR R, V42, P592 Zimmer Karl, 1999, HDB INT PHONETIC ASS, P154 NR 24 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 143 EP 154 DI 10.1016/j.specom.2004.03.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900009 ER PT J AU Hirschberg, J Litman, D Swerts, M AF Hirschberg, J Litman, D Swerts, M TI Prosodic and other cues to speech recognition failures SO SPEECH COMMUNICATION LA English DT Article DE prosody; confidence scores; recognition error AB In spoken dialogue systems, it is important for the system to know how likely a speech recognition hypothesis is to be correct, so it can reject misrecognized user turns, or, in cases where many errors have occurred, change its interaction strategy or switch the caller to a human attendant. We have identified prosodic features which predict more accurately when a recognition hypothesis contains errors than the acoustic confidence scores traditionally used in automatic speech recognition in spoken dialogue systems. We describe statistical comparisons of features of correctly and incorrectly recognized turns in the TOOT train information corpus and the W99 conference registration corpus, which reveal significant prosodic differences between the two sets of turns. We then present machine learning results showing that the use of prosodic features, alone and in combination with other automatically available features, can predict more accurately whether or not a user turn was correctly recognized, when compared to the use of acoustic confidence scores alone. (C) 2004 Published by Elsevier B.V. C1 Columbia Univ, Dept Comp Sci, New York, NY 10027 USA. Univ Pittsburgh, Dept Comp Sci, Pittsburgh, PA 15260 USA. Univ Pittsburgh, LRDC, Pittsburgh, PA 15260 USA. Tilburg Univ, Fac Arts Commun & Cognit, NL-5000 LE Tilburg, Netherlands. Univ Antwerp, CNTS, B-2610 Antwerp, Belgium. RP Hirschberg, J (reprint author), Columbia Univ, Dept Comp Sci, 1241 Amsterdam Ave,M-C 0401, New York, NY 10027 USA. EM julia@cs.columbia.edu; litman@cs.pitt.edu; m.g.j.swerts@uvt.nl RI Swerts, Marc/C-8855-2013 CR Ammicht E, 2001, P EUR, P2217 ANDORNO M, 2002, P INT C SPOK LANG PR, P1377 BELL L, 1999, P ICPHS99 SAN FRANS, P1221 BLAAUW E, 1992, P INT C SPOK LANG PR, V1, P751 BOUWMAN AG, 1999, P INT C AC SPEECH SI, V1, P493 BRUCE G, 1995, P 13 INT C PHON SCI, V2, P28 COHEN W, 1996, 14 C AM ASS ART INT, P709 DODDINGTON G, 1998, P INT C SPOK LANG PR, P608 FALAVIGNA D, 2002, P INT C SPOK LANG PR, P1621 FANT G, 1995, 6975 BR ESPRIT, V27 GUILLEVIC D, 2002, P INT C SPOK LANG PR, P853 HIROSE K, 1997, COMPUTING PROSODY CO, P327 Hirschberg J., 1999, P AUT SPEECH REC UND, P349 HIRSCHBERG J, 2001, P 2 N AM ANN M ASS C, P208 Hirschberg J., 1991, P 2 EUR C SPEECH COM, P1275 Hirschberg J., 1995, P 13 INT C PHON SCI, V2, P36 KAMM CA, 1997, P EUROSPEECH 1997, P2203 KRAAYEVELD H, 1997, THESIS NIJMEGEN U Krahmer E., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009648614566 LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LITMAN D, 2001, P ACL2001 TOUL, P329 LITMAN DJ, 1999, P 7 INT C US MOD, P55 Litman D.J., 1999, P 37 ANN M ASS COMP, P309, DOI 10.3115/1034678.1034729 MORENO PJ, 2001, P EUROSPEECH 01 AALB, P2109 OSTENDORF M, 1997, 1996 CLSP JHU WORKSH Oviatt S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607722 RAHIM M, 1999, P ASRU 99 KEYST SHARP RD, 1997, P INT C AC SPEECH SI, P4065 SOLTAU H, 2002, P INT C SPOK LANG PR, P83 SOLTAU H, 1998, P INT C SPOK LANG PR, P225 SOLTAU H, 2000, P ICASSP 2000, P1779 Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 SWERTS M, 1997, INTONATION THEORY MO, P297 SWERTS M, 2000, P 6 INT C SPOK LANG, P615 Talkin D., 1995, SPEECH CODING SYNTHE, P495 Veilleux N. M., 1994, THESIS BOSTON U WADE E, 1992, P INT C SPOK LANG PR, V2, P995 Walker M., 2000, P N AM M ASS COMP LI, P210 Walker M., 1998, P 36 ANN M ASS COMP, P1345 Walker M. A., 2000, NAT LANG ENG, V6, P363, DOI 10.1017/S1351324900002503 WANG HM, 2002, P INT C SPOK LANG PR, P1625 WEINTRAUB M, 1996, P INT C SPOK LANG PR, pS16 Zeljkovic I, 1996, INT CONF ACOUST SPEE, P129, DOI 10.1109/ICASSP.1996.540307 Zhang R., 2001, P 7 EUR C SPEECH COM, P2105 NR 44 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2004 VL 43 IS 1-2 BP 155 EP 175 DI 10.1016/j.specom.2004.01.006 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 832LL UT WOS:000222268900010 ER PT J AU Smith, CL AF Smith, CL TI Topic transitions and durational prosody in reading aloud: production and modeling SO SPEECH COMMUNICATION LA English DT Article DE prosody; final lengthening; prosodic modeling; relationship to text structure; perception of prosody ID TO-SPEECH SYNTHESIS; DISCOURSE STRUCTURE; SENTENCE; ENGLISH; TEXT; INTONATION; BOUNDARIES AB The linguistic structure of an utterance is known to affect the durational prosody of sounds, words and phrases. There has been increasing interest in how discourse-level organization affects prosody, in part because modeling discourse-level effects could improve the comprehensibility of longer passages of synthesized text. The approach taken here is to look at how topics are sequenced in a text, and how this affects durational prosody when that text is read aloud. Two speakers of American English were recorded reading a set of text materials on 10 separate occasions. Measurements of these recordings indicated that the type of transition in topic between two successive sentences had a significant effect on the amount of sentence-final lengthening, the duration of the pause between sentences, and the speech rate at the end of a sentence and the beginning of the following sentence. These measurements were then used to create a mathematical model of one speaker, and to generate several versions of one of this speaker's original recordings, with each version incorporating different manipulations of the durational patterns and their variability. These versions were played to listeners, who preferred those where the manipulations included durational patterns reflecting the organization of topics in the text. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ New Mexico, Dept Linguist, Albuquerque, NM 87131 USA. RP Smith, CL (reprint author), Univ New Mexico, Dept Linguist, Humanities 526,MSC 03 2130, Albuquerque, NM 87131 USA. EM caroline@unm.edu CR Ayers G., 1994, WORKING PAPERS LINGU, V44, P1 BARBOSA P, 1994, SPEECH COMMUN, V15, P127, DOI 10.1016/0167-6393(94)90047-7 Brown Gillian, 1980, QUESTIONS INTONATION CAMPBELL N, 1990, TALKING MACHINES THE, P211 CAMPBELL Nick, 2000, PROSODY THEORY EXPT, P281 COHEN J, 1993, BEHAV RES METH INSTR, V25, P257, DOI 10.3758/BF03204507 CRYSTAL TH, 1982, J ACOUST SOC AM, V72, P705, DOI 10.1121/1.388251 *DEN SYST, 1997, CANV 5 US GUID DENOUDEN H, 2000, P 10 ANN M SOC TEXT, P40 EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674 Efron B, 1993, INTRO BOOTSTRAP, P45 FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619 Fon Y.-J. J., 2002, THESIS OHIO STATE U GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 Grosz B., 1992, P INT C SPOK LANG PR, P429 Grosz B. J., 1986, Computational Linguistics, V12 HAASE M, 2001, P EUROSPEECH, P2157 Herman R, 2000, J PHONETICS, V28, P466, DOI 10.1006/jpho.2000.0127 Hirschberg J., 1996, P 34 ANN M ASS COMP, P286, DOI 10.3115/981863.981901 Hirschberg J., 1986, P 24 ANN M ASS COMP, P136, DOI 10.3115/981131.981152 HIRSCHBERG J, 1993, P ESCA WORKSH PROS, P90 Jurafsky D, 1997, 9702 U COL I COGN SC KOOPMANSVANBEIN.F, 1996, P I PHON SC U AMST, P1 KREIMAN J, 1982, J PHONETICS, V10, P163 LANDIS JR, 1977, BIOMETRICS, V33, P159, DOI 10.2307/2529310 Lehiste I., 1975, STRUCTURE PROCESS SP, P195 Lehiste I., 1979, FRONTIERS SPEECH COM, P191 Littell RC, SAS SYSTEM MIXED MOD Mann W. C., 1988, TEXT, V8, P243, DOI 10.1515/text.1.1988.8.3.243 MUNHALL KG, 1985, J ACOUST SOC AM, V78, P1548, DOI 10.1121/1.392790 NAKAJIMA S, 1997, COMPUTING PROSODY CO, P81 NAKAJIMA S, 1993, PHONETICA, V50, P197 NAKATANI C, 1996, TR2195 HARV U CTR RE Noordman L, 1999, AMST STUD THEORY HIS, V176, P133 PASSONNEAU RJ, 1996, COMPUTATIONAL CONVER, P161 SAS Institute, 1998, STATV REF MAN ShattuckHufnagel S, 1996, J PSYCHOLINGUIST RES, V25, P193, DOI 10.1007/BF01708572 Shriberg E, 2000, SPEECH COMMUN, V32, P127, DOI 10.1016/S0167-6393(00)00028-5 SLUIJTER AMC, 1993, PHONETICA, V50, P180 Stirling L, 2001, SPEECH COMMUN, V33, P113, DOI 10.1016/S0167-6393(00)00072-8 Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 SWERTS M, 1994, LANG SPEECH, V37, P21 THORSEN NG, 1985, J ACOUST SOC AM, V77, P1205, DOI 10.1121/1.392187 Turk AE, 1999, J PHONETICS, V27, P171, DOI 10.1006/jpho.1999.0093 UMEDA N, 1975, J ACOUST SOC AM, V58, P434, DOI 10.1121/1.380688 VANSANTEN JPH, 1994, COMPUT SPEECH LANG, V8, P95, DOI 10.1006/csla.1994.1005 VANDONZEL M, 1999, PROSODIC ASPECTS INF WIGHTMAN S, 1992, J ACOUST SOC AM, V92, P1707 Wouters J, 2002, J ACOUST SOC AM, V111, P417, DOI 10.1121/1.1428262 YULE G, 1980, LINGUA, V52, P33, DOI 10.1016/0024-3841(80)90016-9 NR 51 TC 18 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 247 EP 270 DI 10.1016/j.specom.2003.09.004 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900001 ER PT J AU Ramirez, J Segura, JC Benitez, C de la Torre, A Rubio, A AF Ramirez, J Segura, JC Benitez, C de la Torre, A Rubio, A TI Efficient voice activity detection algorithms using long-term speech information SO SPEECH COMMUNICATION LA English DT Article DE speech/non-speech detection; speech enhancement; speech recognition; long-term spectral envelope; long-term spectral divergence ID NOISE SPECTRUM AB Currently, there are technology barriers inhibiting speech processing systems working under extreme noisy conditions. The emerging applications of speech technology, especially in the fields of wireless communications, digital hearing aids or speech recognition, are examples of such systems and often require a noise reduction technique operating in combination with a precise voice activity detector (VAD). This paper presents a new VAD algorithm for improving speech detection robustness in noisy environments and the performance of speech recognition systems. The algorithm measures the long-term spectral divergence (LTSD) between speech and noise and formulates the speech/non-speech decision rule by comparing the long-term spectral envelope to the average noise spectrum, thus yielding a high discriminating decision rule and minimizing the average number of decision errors. The decision threshold is adapted to the measured noise energy while a controlled hang-over is activated only when the observed signal-to-noise ratio is low. It is shown by conducting an analysis of the speech/non-speech LTSD distributions that using long-term information about speech signals is beneficial for VAD. The proposed algorithm is compared to the most commonly used VADs in the field, in terms of speech/non-speech discrimination and in terms of recognition performance when the VAD is used for an automatic speech recognition system. Experimental results demonstrate a sustained advantage over standard VADs such as G.729 and adaptive multi-rate (AMR) which were used as a reference, and over the VADs of the advanced front-end for distributed speech recognition. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Granada, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. RP Ramirez, J (reprint author), Univ Granada, Dept Elect & Tecnol Comp, Campus Univ Fuentenueva, E-18071 Granada, Spain. EM javierrp@ugr.es; segura@ugr.es; carmen@ugr.es; atv@ugr.es; rubio@ugr.es RI de la Torre, Angel/C-6618-2012; Benitez Ortuzar, M Del Carmen/C-2424-2012; Segura, Jose/B-7008-2008; Prieto, Ignacio/B-5361-2013; Ramirez, Javier/B-1836-2012 OI Segura, Jose/0000-0003-3746-0978; Ramirez, Javier/0000-0002-6229-2921 CR Benyassine A, 1997, IEEE COMMUN MAG, V35, P64, DOI 10.1109/35.620527 Beritelli F, 1998, IEEE J SEL AREA COMM, V16, P1818, DOI 10.1109/49.737650 Beritelli F, 2002, IEEE SIGNAL PROC LET, V9, P85, DOI 10.1109/97.995824 Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bouquin-Jeannes R. L., 1995, SPEECH COMMUN, V16, P245 BOUQUINJEANNES RL, 1994, ELECTRON LETT, V30, P930 Cho YD, 2001, ELECTRON LETT, V37, P540, DOI 10.1049/el:20010368 CHO YD, 2001, INT C AC SPEECH SIGN, V2, P737 Cho YD, 2001, IEEE SIGNAL PROC LET, V8, P276 Freeman D.K., 1989, IEEE INT C AC SPEECH, V1, P369 Hirsch H. G., 2000, ISCA ITRW ASR2000 AU ITOH K, 1997, INT C ACOUST SPEECH, V1, P419 Karray L, 2003, SPEECH COMMUN, V40, P261, DOI 10.1016/S0167-6393(02)00066-3 Macho D., 2002, P ICSLP, P17 Madisetti V.K., 1999, DIGITAL SIGNAL PROCE MARTIN R, 1993, EUROSPEECH, V1, P1093 Marzinzik M, 2002, IEEE T SPEECH AUDI P, V10, P109, DOI 10.1109/89.985548 MORENO A, 2000, P 2 LREC Nemer E, 2001, IEEE T SPEECH AUDI P, V9, P217, DOI 10.1109/89.905996 *NOK, 2000, BAS RES SUBS SPEECHD Sangwan A., 2002, IEEE INT C HIGH SPEE, P46 SOHN J, 1998, INT C AC SPEECH SIGN, V1, P365 Sohn J, 1999, IEEE SIGNAL PROC LET, V6, P1 *TEX INSTR, 2001, DESCR BAS RES SUBS S Woo KH, 2000, ELECTRON LETT, V36, P180, DOI 10.1049/el:20000192 Young S., 2001, HTK BOOK HTK VERSION NR 27 TC 124 Z9 138 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 271 EP 287 DI 10.1016/j.specom.2003.10.002 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900002 ER PT J AU Cox, S Vinagre, L AF Cox, S Vinagre, L TI Modelling of confusions in aircraft call-signs SO SPEECH COMMUNICATION LA English DT Article ID ENGLISH CONSONANTS; INTELLIGIBILITY; INDEX AB Air-traffic has grown rapidly in the last twenty years and concern has been mounting about the safety implications of mis-recognition of call-signs by both pilots and air-traffic controllers. This paper presents the results of a preliminary study into perceptual (i.e. non-cognitive) confusions in two closed vocabularies of the type used as aircraft call-signs. Conventional methods of subjective and objective testing were found to be unsuitable for our aim of predicting potential confusions within a vocabulary. Hence a method for modelling confusion probability in a closed vocabulary at a certain signal-to-noise ratio has been developed. The method is based on the use of a phoneme confusion matrix and a technique for comparing phoneme strings. The method is presented and results are given. These suggest that the behaviour of the model is plausible, and a comparison of its predictions with a set of real confusions showed a correct prediction of position of confusion in three-word phrases. The predictions of the model need to be verified by subjective testing before it can be deployed in a system that designs low-confusability call-signs, which is the ultimate goal of the research. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. Natl Air Traff Serv Ltd, London Terminal Control Ctr, W Drayton UB7 9AX, England. RP Cox, S (reprint author), Univ E Anglia, Sch Comp Sci, Norwich NR4 7TJ, Norfolk, England. EM sjc@sys.uea.ac.uk CR Baddeley A. D., 1990, HUMAN MEMORY THEORY DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Duda R. O., 2001, PATTERN CLASSIFICATI Fletcher H., 1953, SPEECH HEARING COMMU Fransen J., 1994, CUEDFINFENGTR192 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 JANSEN J, 1996, HTK BOOK Kruskal JB, 1964, PSYCHOMETRIKA, V29, P1 Mendel LL, 1998, J ACOUST SOC AM, V104, P1609, DOI 10.1121/1.424373 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MOORE R, 1977, IEEE T ACOUSTICS SPE, V25, P176 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PICKETT JM, 1957, J ACOUST SOC AM, V29, P613, DOI 10.1121/1.1908983 ROBINSON T, 1996, BRIT ENGLISH EXAMPLE SERVICES NAT, 1996, 1121996 AIC SHEPARD RN, 1957, PSYCHOMETRIKA, V22, P325, DOI 10.1007/BF02288967 SIMONS A, 1995, P EUR, P1465 SINGH S, 1972, J ACOUST SOC AM, V52, P1698, DOI 10.1121/1.1913304 STEENEKEN H, 2002, SPEECH COMMUN, V38, P412 STEENEKEN HJM, 1985, J AUDIO ENG SOC, V33, P1007 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 VANDEELEN GW, 1990, AVIAT SPACE ENVIR MD, V61, P52 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 WILSON K, 1967, AM J PSYCHOL, V76, P89 1990, ICAO MANUAL RADIOTEL NR 25 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 289 EP 312 DI 10.1016/j.specom.2003.09.006 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900003 ER PT J AU Zera, J AF Zera, J TI Speech intelligibility measured by adaptive maximum-likelihood procedure SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility tests; MRT test; maximum-likelihood adaptive procedure ID HEARING-IMPAIRED SUBJECTS; UP-DOWN METHODS; PSYCHOMETRIC FUNCTIONS; PSYCHOPHYSICAL PROCEDURES; RECEPTION THRESHOLD; NOISE; RELIABILITY; PERCEPTION; SENTENCES; STIMULUS AB This paper describes an adaptive maximum-likelihood procedure, originally developed for psychoacoustic measurements, applied to measure speech intelligibility using the modified rhyme test (MRT) [J. Acoust. Soc. Am. 37 (1965) 158]. The use of this adaptive procedure for estimating speech-to-noise ratio at various pre-selected word scores as an input parameter is presented. Listening tests conducted on subjects demonstrated that the required speech-to-noise ratio could be estimated with sufficient accuracy in an adaptive run consisting of fewer than 25 test items. Such a run length is also sufficient to maintain the relative frequency distribution of consonants present in the original MRT with an average error not exceeding 3% points. Depending on the target word score, different values of bias and standard deviation of level estimates are displayed by the adaptive procedure. These variables were analyzed using numerical simulations as well as the results of experiments, in which the signal-to-noise ratio was determined for a number of target word scores. Maximum-likelihood adaptive procedure is considered as a more efficient procedure than the other adaptive methods (e.g. staircase procedures). Owing to this efficiency, the maximum-likelihood adaptive procedure can be a useful tool for quick assessment of the speech-to-noise ratio of communication systems at various percent-correct word scores. (C) 2003 Elsevier B.V. All rights reserved. C1 Natl Res Inst, Cent Inst Labour Protect, Dept Acoust & Electromagnet Hazards, PL-00701 Warsaw, Poland. Chopin Acad Mus, Dept Sound Engn, Mus Acoust Lab, PL-00368 Warsaw, Poland. RP Zera, J (reprint author), Natl Res Inst, Cent Inst Labour Protect, Dept Acoust & Electromagnet Hazards, Czernialowska 16, PL-00701 Warsaw, Poland. CR ANSI, 1989, S32 ANSI BODE DL, 1973, IEEE T ACOUST SPEECH, VAU21, P196, DOI 10.1109/TAU.1973.1162479 BRONKHORST AW, 1992, J ACOUST SOC AM, V92, P3132, DOI 10.1121/1.404209 Creelman C. D., 1991, DETECTION THEORY USE DAI H, 1994, J ACOUST SOC AM, V96, P1646 Dai HP, 1995, J ACOUST SOC AM, V98, P3135, DOI 10.1121/1.413802 DRESCHLER WA, 1980, J ACOUST SOC AM, V68, P1608, DOI 10.1121/1.385215 DRESCHLER WA, 1985, J ACOUST SOC AM, V78, P1261, DOI 10.1121/1.392895 FOSTER J R, 1987, British Journal of Audiology, V21, P165, DOI 10.3109/03005368709076402 GOWER DW, 1994, HUM FACTORS, V36, P350 Green D. M., 1966, SIGNAL DETECTION THE Green D. M., 1964, SIGNAL DETECT RECOG, P609 GREEN DM, 1995, J ACOUST SOC AM, V97, P3749, DOI 10.1121/1.412390 GREEN DM, 1993, J ACOUST SOC AM, V93, P2096, DOI 10.1121/1.406696 GREEN DM, 1991, PERCEPT PSYCHOPHYS, V49, P100, DOI 10.3758/BF03211621 GREEN DM, 1990, J ACOUST SOC AM, V87, P2662, DOI 10.1121/1.399058 HALL JL, 1981, J ACOUST SOC AM, V69, P1763, DOI 10.1121/1.385912 HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 LAMING D, 1988, PERCEPT PSYCHOPHYS, V44, P99, DOI 10.3758/BF03208701 LEEK MR, 1992, PERCEPT PSYCHOPHYS, V51, P247, DOI 10.3758/BF03212251 LEVITT H, 1967, J ACOUST SOC AM, V39, P609 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 *MIL, 1989, MILSTD1472D DOD NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 PLOMP R, 1979, AUDIOLOGY, V18, P43 PLOMP R, 1979, J ACOUST SOC AM, V66, P1333, DOI 10.1121/1.383554 SHELTON BR, 1982, J ACOUST SOC AM, V71, P1527, DOI 10.1121/1.387806 SMOORENBURG GF, 1992, J ACOUST SOC AM, V91, P421, DOI 10.1121/1.402729 VOIERS WD, 1983, SPEECH TECHNOL, P30 WATSON AB, 1983, PERCEPT PSYCHOPHYS, V33, P113, DOI 10.3758/BF03202828 ZHOU B, 1995, J ACOUST SOC AM, V98, P828, DOI 10.1121/1.413509 1991, ISOTR4870 NR 32 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 313 EP 328 DI 10.1016/j.specom.2003.08.007 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900004 ER PT J AU Pargellis, AN Kuo, HKJ Lee, CH AF Pargellis, AN Kuo, HKJ Lee, CH TI An automatic dialogue generation platform for personalized dialogue applications SO SPEECH COMMUNICATION LA English DT Article DE voice user interface; dynamic generation; finite state dialogue; user profile; speech system AB We explore a voice user interface for human-machine communication that uses a dialogue structure personalized to an individual user, subject to the constraints of the system's resources. Conventional dialogue systems typically use architectures that create a set of predefined speech objects, or subdialogues, by combining several static components, such as grammars and other language components. Such systems are limited because most databases are dynamic, and users have different preferences for topical content and presentation format. Ideally, a dialogue would combine a user's intentions, encoded in a profile, with information and services available in a dynamic and distributed external environment. We use a modular architecture where a centralized Application Generator (AG) interacts with two managers. A user's preferences are stored in a profile handled by a Profile Manager. An Information Manager uses these preferences when accessing external databases and extracting, filtering, or presenting information in a form customized to that particular user. The AG then builds an anticipated dialogue, allowing a user to navigate between a personalized set of services, each of which presents information and services in a manner customized for that user. Therefore, the AG generates, in a uniform and consistent manner, a finite state dialogue for any task described by a set of specifications residing on a distributed network. Finally, a dialogue manager uses a set of protocols to carry out an actual dialogue session with the user. (C) 2003 Elsevier B.V. All rights reserved. C1 Bell Labs Lucent Technol, Dialogue Syst Res Dept, Murray Hill, NJ 07974 USA. RP Pargellis, AN (reprint author), Bell Labs Lucent Technol, Dialogue Syst Res Dept, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM apargellis@aol.com; hkuo@us.ibm.com; chl@ece.gatech.edu CR AUST H, 1998, P INT S SPOK DIAL SY, P27 CHUCARROLL J, 1999, P EUR 99 BUD HUNG *COMM SYST, HEYAN 1 800 44ANITA DEVILLERS L, 1998, P ICSLP 1998 SYDN AU Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X ISSAR S, 1997, P EUR 1997 RHOD GREE KAMM C, 1997, P EUR 1997 RHOD GREE KUO HKJ, 2000, Patent No. 0030479332201 KUO HKJ, 1999, P INT DIAL MULT SYST LAMEL L, 1999, P ICASSP 99 PHOEN AR MCGLASHAN S, 2001, URL PAPINENI KA, 1999, P EUR 99 BUD HUNG PARGELLIS AN, 1999, P EUR 99 BUD HUNG PARGELLIS AN, 1998, P ICSLP 1998 SYDN AU PARGELLIS AN, 1999, P INT DIAL MULT SYST POTAMIANOS A, 1999, P INT DIAL MULT SYST SALTON G, 1988, INFORM PROCESS MANAG, V24, P513, DOI 10.1016/0306-4573(88)90021-0 SENEFF S, 1998, P ICSLP 1998 SYDN AU TSAI A, 2001, P EUR AALB DENM WALKER M, 1997, P EUR 97 RHOD GREEC Walker MA, 2000, J ARTIF INTELL RES, V12, P387 NR 21 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 329 EP 351 DI 10.1016/j.specom.2003.10.003 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900005 ER PT J AU Heikkinen, A AF Heikkinen, A TI Development of a 4 kbit/s hybrid sinusoidal/CELP speech coder SO SPEECH COMMUNICATION LA English DT Article DE speech coding; hybrid coding; sinusoidal coding; CELP coding ID VOCODER; MODEL AB A comprehensive performance analysis of sinusoidal and code excited linear prediction (CELP) speech coding is given around 4 kbit/s, using both subjective and objective measurements. Based on the observations made, justification for the multi-modal hybrid coding approach employing both sinusoidal and CELP coding is given, and an implementation of such a coder is described. This 4 kbit/s sinusoidal/CELP speech coder utilizes four modes to classify the input speech segment: voiced, jittery-voiced, plosive and unvoiced. For voiced segments sinusoidal coding is used whereas different CELP versions are employed for the other modes. The quality of the implemented 4 kbit/s sinusoidal/ CELP speech coder in clean speech conditions is finally verified by a listening test. In the test, the 4 kbit/s coder performed almost as well as the high-quality references used, but it still needs improvements to be classified as a high-quality 4 kbit/s speech coder. (C) 2003 Elsevier B.V. All rights reserved. C1 Nokia Res Ctr, Audio Visual Syst Lab, FIN-33721 Tampere, Finland. RP Heikkinen, A (reprint author), Nokia Res Ctr, Audio Visual Syst Lab, POB 100, FIN-33721 Tampere, Finland. EM ari.p.heikkinen@nokia.com CR Almeida L. B., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing ALMEIDA LB, 1984, P IEEE INT C AC SPEE AMADA T, 1999, P IEEE INT C AC SPEE, P13 Atal B.S., 1984, P INT C COMM AMST, P1610 ATKINSON I, 1997, P IEEE INT C AC SPEE, P1559 Chang WW, 1998, INT CONF ACOUST SPEE, P525 CUPERMAN V, 1995, INT CONF ACOUST SPEE, P496, DOI 10.1109/ICASSP.1995.479637 ETEMOGLU CO, 2000, P IEEE INT C AC SPEE, P1371 GARDNER WR, 1994, P IEEE INT C AC SPEE, P205 George EB, 1997, IEEE T SPEECH AUDI P, V5, P389, DOI 10.1109/89.622558 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 Hagen R, 1998, INT CONF ACOUST SPEE, P145, DOI 10.1109/ICASSP.1998.674388 HEDELIN P, 1981, P IEEE INT C AC SPEE, P205 HEIKKINEN A, 2002, THESIS TAMPERE U TEC HEIKKINEN A, 2001, P EUR, P1965 HEIKKINEN A, 2001, P WSES IEEE C SPEECH HEIKKINEN A, 2000, P 10 EUR SIGN PROC C JENSEN J, 1999, P IEEE INT C AC SPEE, P473 JENSEN J, 2000, P IEEE INT C AC SPEE, P1439 KLEIJN WB, 1993, P INT C AC SPEECH SI, V2, P596 KLEIJN WB, 1991, THESIS DELFT U TECHN KLEIJN WB, 1994, EUR T TELECOMMUN, V5, P573 LAFLAMME C, 1996, P IEEE INT C AC SPEE, P204 Li CY, 1998, INT CONF ACOUST SPEE, P581 Li CY, 2000, INT CONF ACOUST SPEE, P1367 McAulay R., 1995, SPEECH CODING SYNTHE, P121 McAulay R. J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 McCree A, 1998, INT CONF ACOUST SPEE, P593, DOI 10.1109/ICASSP.1998.675334 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 Ozawa K, 1999, ACTA HORTIC, P189 POBLOTH H, 1999, P INT C AC SPEECH SI, V1, P29 QIAN X, 1996, P IEEE INT C AC SPEE, P228 Shlomot E, 2001, IEEE T SPEECH AUDI P, V9, P632, DOI 10.1109/89.943341 Shlomot E, 1998, INT CONF ACOUST SPEE, P585, DOI 10.1109/ICASSP.1998.675332 SHLOMOT E, 1997, P IEEE SPEECH COD WO, P37 SKOGLUND J, 1997, P IEEE SPEECH COD WO, P51 Skoglund J, 2000, IEEE T SPEECH AUDI P, V8, P361, DOI 10.1109/89.848218 Stachurski J., 2000, P IEEE INT C AC SPEE, P1379 STACHURSKI J, 1999, P IEEE INT C AC SPEE, P485 STEGMANN J, 1996, P IEEE INT C AC SPEE, P546 SUN X, 1997, P IEEE INT C AC SPEE, P1691 *TEL IND ASS EL IN, 1996, TIAEIAIS641 THYSSEN J, 2001, P IEEE INT C AC SPEE, P681 TRANCOSO IM, 1986, P IEEE INT C ASSP TO, P1709 TRANCOSO IM, 1990, SPEECH COMMUN, V19, P389 *US DEP DEF, 1998, SPEC AN DIG CONV VOI VILLETTE S, 1999, P IEEE INT C AC SPEE, P249 YASUNAGA K, 2000, P ICASSP, P1503 YELDENER S, 1999, P ICASSP 99 MARCH AR, P481 2001, FIXED POINT SUBJECTI NR 51 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 353 EP 371 DI 10.1016/j.specom.2003.10.004 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900006 ER PT J AU Seneff, S AF Seneff, S TI The use of subword linguistic modeling for multiple tasks in speech recognition SO SPEECH COMMUNICATION LA English DT Article AB Over the past several years, I have been conducting research on subword modeling in speech recognition. The research is most specifically aimed at the difficult task of identifying and characterizing unknown words, although the proposed framework also has utility in other recognition tasks such as phonological and prosodic modeling. The approach exploits the linguistic substructure of words by describing graphemic, phonemic, phonological, syllabic, and morphemic constraints through a set of context-free rules, and supporting the resulting parse trees with a corpus-trained probability model. A derived finite state transducer representation forms a natural means for integrating the trained model into a recognizer search. This paper describes several research projects I have been engaged in, together with my students and associates, aimed at exploring ways in which recognition tasks can benefit from such formal modeling of word substructure. These include phonological modeling, hierarchical duration modeling, sound-to-letter and letter-to-sound mapping, and automatic acquisition of unknown words in a speech understanding system. Results of several experiments in these areas are summarized here. (C) 2003 Elsevier B.V. All rights reserved. C1 MIT, Spoken Languages Syst Grp, Comp Sci & Artifical Intelligence Lab, Cambridge, MA 02139 USA. RP Seneff, S (reprint author), MIT, Spoken Languages Syst Grp, Comp Sci & Artifical Intelligence Lab, 200 Technol Sq,Room 643, Cambridge, MA 02139 USA. EM seneff@csail.mit.edu CR Allen J., 1987, TEXT SPEECH MITTALK BACCHIANI M, 1998, P ICSLP 98 SYDN AUST, V4, P1319 Bazzi I., 2001, P EUR 2001, P61 Chomsky N., 1968, SOUND PATTERN ENGLIS Chung G., 2003, P HLT NAACL EDM CAN, P32 CHUNG G, 1997, THESIS MIT DEP EL EN CHUNG G, 2001, THESIS MIT DEP EL EN CHUNG G, 2000, P ICSLP BEIJ CHIN OC, P520 CHUNG G, 1997, P ICSLP 97, P1475 CHUNG G, 2000, P ICSLP 2000 BEIJ CH, P266 CHUNG G, 2002, P ICSLP 02 DENV CO S, V3, P2061 CHUNG G, 1998, ICSLP 98 SYDN AUSTR, P935 CHURCH KW, 1983, THESIS MIT CAMBRIDGE COHEN MH, 1989, THESIS U CALIFORNIA COLE R, 1992, P ICSLP 92 BANFF CAN DUPONT S, 1990, P ICASSP ALB NM GABOVICH VY, 2002, THESIS MIT MAY GAUVAIN JL, 1993, P EUR, P125 Glass J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607261 GLASS JR, 1998, ICSLP 98 SYDN AUSTR, P1327 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 Hayes B., 1995, METRICAL STRESS THEO Hetherington I. L., 2001, P EUR 2001 ALLB DENM, P1599 HETHERINGTON IL, 1994, THESIS MIT DEP EL EN HETHERINGTON IL, 1993, P 3 EUR C SPEECH COM KAHN D, 1976, THESIS MIT DEP LING Kucera H., 1967, COMPUTATIONAL ANAL P LAU R, 1998, ICSLP 98 SYDN AUSTR, P2443 Lau R., 1998, THESIS MIT CAMBRIDGE LAU R, 1997, P EUR 97 RHOD GREEC, P263 Meng H, 1996, SPEECH COMMUN, V18, P47, DOI 10.1016/0167-6393(95)00032-1 MENG HM, 1995, THESIS MIT CAMBRIDGE MOU X, 2001, P EUROSPEECH 2001 AA, P451 NGUYEN L, 1995, P ARPA SPOK LANG SYS, P693 ONISHI S, 2001, EUROSPEECH 200U AALB, P693 PARMAR AD, 1997, THESIS MIT CAMBRIDGE Randolph M., 1989, THESIS MIT CAMBRIDGE Scalise S, 1986, GENERATIVE MORPHOLOG SENEFF S, 2000, P ICSLP 00 BEIJ CHIN, V2, P142 Seneff S., 1992, Computational Linguistics, V18 SENEFF S, 1998, ICSLP 98 SYDN AUSTR, P3321 SENEFF S, 2002, ISCA TUT RES WORKSH, P71 Seneff S., 2000, P ANLP NAACL 2000 SA, P1 SENEFF S, 1996, P IC SLP PHIL PA OCT, V1, P110, DOI 10.1109/ICSLP.1996.607049 WEINTRAUB M, 1989, P IEEE INT C ASSP GL, P699 WOODLAND PC, 1993, P EUR C SPEECH COMM, V3, P2207 ZUE V, 1991, P EUR C SPEECH COMM, V2, P537 ZUE VW, 1983, SPEECH COMMUN, V2, P181, DOI 10.1016/0167-6393(83)90023-7 NR 49 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 373 EP 390 DI 10.1016/j.specom.2003.11.001 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900007 ER PT J AU Liu, CJ Yan, YH AF Liu, CJ Yan, YH TI Robust state clustering using phonetic decision trees SO SPEECH COMMUNICATION LA English DT Article DE phonetic decision tree; HMM; state clustering; two-level decision tree ID SPEECH RECOGNITION AB The widely used acoustic modeling approach of phonetic decision-tree based context clustering does not take full advantage of limited training data, and therefore fails to produce robust acoustic models. Two problems are identified: (1) all states clustered in a leaf node must share the same set of Gaussian components and mixture weights; no distinction is provided among those states; (2) rarely seen triphones in the training data might be poorly estimated and cause an adverse effect on decision-tree clustering. We propose a number of approaches to address these problems by more efficient use of training data. Specifically, (1) a two-level decision-tree approach for the first problem that ties Gaussian components and mixture weights separately, as they require different amounts of data to obtain robust estimation of their parameters; and (2) a two-stage decision-tree based clustering approach and a MAP-based approach for the second problem. Each approach gives a statistical significant reduction of the word error rate (WER) over the traditional approach. The systems combining all new approaches achieve the best performance, which reduce the WERs of the baseline systems by 14-17% and reduce the model sizes by 8-11 % on the WSJ tasks. (C) 2003 Elsevier B.V. All rights reserved. C1 Oregon Hlth & Sci Univ, OGI, Sch Sci & Engn, Dept Comp Sci & Engn, Beaverton, OR 97006 USA. Chinese Acad Sci, Inst Acoust, Beijing 100080, Peoples R China. RP Liu, CJ (reprint author), Oregon Hlth & Sci Univ, OGI, Sch Sci & Engn, Dept Comp Sci & Engn, 20000 NW Walker Rd, Beaverton, OR 97006 USA. EM cliu@cse.ogi.edu; yan@cse.ogi.edu CR BELLEGARDA JR, 1990, IEEE T ACOUST SPEECH, V38, P2033, DOI 10.1109/29.61531 Beulen K., 1998, P IEEE C ACOUSTICS S, V2, P805, DOI 10.1109/ICASSP.1998.675387 Brieman L, 1984, CLASSIFICATION REGRE CHOU PA, 1991, IEEE T PATTERN ANAL, V13, P340, DOI 10.1109/34.88569 CHOU W, 1998, P INT C SPOK LANG PR, P2203 DIGALAKIS V, 1996, IEEE T SPEECH AUDIO, V4, P284 GAUVAIN JL, 1992, P DARPA SPEECH NAT L, P272 GAUVAIN JL, 1991, P DARPA SPEECH NAT L, P185 GILLICK L, 1990, P ICASSP, V1, P97 Huang X. D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90020-X Hwang MY, 1993, IEEE T SPEECH AUDI P, V1, P414 JELINEK F, 1990, PATTERN RECOGNITION KIM D, 1999, P EUROSPEECH SEPT, V3, P1335 KUHN R, 1995, P INT C AC SPEECH SI, V1, P552 Lazarides A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607786 LEE AZ, 1997, P EUROSPEECH, V1, P19 LEE KF, 1990, AUTOMATIC SPEECH SPE, P347 LIU C, 2001, P 2001 INT C ART INT, V2, P568 LIU C, 2002, THESIS OREGON HLTH S LIU C, 1999, P EUROSPEECH, V4, P1703 LUO X, 1998, P ICSLP98, V1, P65 NOCK HJ, 1997, P EUROSPEECH, V1, P111 PAUL DB, 1997, P INT C AC SPEECH SI, P1487 PAUL DB, 1992, P ICSLP92 REICHL W, 1997, P IEEE WORKSH AUT SP, P185 REICHL W, 1999, P INT C AC SPEECH SI, P573 SINGH R, 1999, P INT C SPOK LANG PR, V1, P117 SJOLANDER K, 1987, P IEEE WORKSH AUT SP, P179 WU X, 1999, DARPA BROADC NEWS WO YOUNG SJ, 1994, ARPA HUM LANG TECHN, P286 YOUNG SJ, 1994, COMPUT SPEECH LANG, V8, P369, DOI 10.1006/csla.1994.1019 NR 31 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 391 EP 408 DI 10.1016/j.specom.2003.12.003 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900008 ER PT J AU Abdou, S Scordilis, MS AF Abdou, S Scordilis, MS TI Beam search pruning in speech recognition using a posterior probability-based confidence measure SO SPEECH COMMUNICATION LA English DT Article DE confidence measure; speech recognition; pruning; discriminative training ID DISCRIMINATIVE UTTERANCE VERIFICATION; MODEL AB In this work we propose the early incorporation of confidence information in the decoding process of large vocabulary speech recognition. A confidence based pruning technique is used to guide the search to the most promising paths. We introduce a posterior probability-based confidence measure that can be estimated efficiently and synchronously from the available information during the search process. The accuracy of this measure is enhanced using a discriminative training technique whose objective is to maximize the discrimination between the correct and incorrect decoding hypotheses. For this purpose, phone-level confidence scores are combined to derive word level scores. Highly compact models that exhibit minimal degradation in performance are introduced. Experimental results using large speech corpora show that the proposed method improves both the decoding accuracy and the decoding time when compared to a baseline recognition system that uses a conventional search approach. Furthermore, the introduced confidence measures are well-suited for cross-task portability. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Miami, Dept Elect & Comp Engn, Coral Gables, FL 33124 USA. BBN Technol, Cambridge, MA 02138 USA. RP Scordilis, MS (reprint author), Univ Miami, Dept Elect & Comp Engn, 1251 Mem Dr,Rm 514, Coral Gables, FL 33124 USA. EM m.scordilis@miami.edu CR AMTRUP JW, 1997, P EUROSPEECH 1997 RH, P2663 BENNETT C, 2002, INT C SPOK LANG PROC, P341 Bocchieri E., 1993, P ICASSP, V2, P692 Bourlard Ha, 1994, CONNECTIONIST SPEECH BOUWMAN G, 2000, P COST249 WORKSH VOI, P59 BRUSHTEIN D, 1996, IEEE T SPEECH AUDIO, V4, P240 Chase L., 1997, P EUR C SPEECH COMM, P815 Chase Lin, 1997, THESIS CARNEGIE MELL Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 Duda R. O., 1973, PATTERN CLASSIFICATI FINKE M, 1996, DARPA SWITCHBOARD AP, P29 Gillick L., 1997, P IEEE INT C AC SPEE, P879 GRAFF D, 1996, P ARPA WORKSH HUM LA, P50 Halberstadt A., 1998, THESIS MIT HERMAN SM, 1997, P 1997 WORKSH AUT SP, P331 Hernandez G.A., 2000, THESIS U POLITECNICA Huang X.D., 1990, P ICASSP, P689 Hwang M. Y., 1993, P ICASSP, P311 JIANG H, 2001, P EUR C SPEECH COMM, P2573 KAMPPARI S, 1999, THESIS MIT Koo MW, 1998, INT CONF ACOUST SPEE, P213 LEE CH, 2001, ICSP 2001, P356 LLEIDA E, 1996, P ICASSP 1996 IEEE I, V1, P507 Mak B., 1996, ICSLP 96, V4, P2005 Mangu L., 1999, P EUR C SPEECH COMM, P495 NEY H, 1992, IEEE T SIGNAL PROCES, V40, P272, DOI 10.1109/78.124938 PLACEWAY P, 1997, P DARPA SPEECH REC W, P95 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 SANSEGUNDO R, 2001, P IEEE INT C AC SPEE, V1, P492 SCHAAF T, 1997, P EUROSPEECH 1997 RH, P827 STEINBISS V, 1994, P 1994 INT C SPOK LA, V1, P397 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 SUKKAR RA, 1996, P IEEE INT C AC SPEE, P518 SUKKAR RA, 1994, P IEEE INT C AC SPEE, V1, P381 THOMAS DE, 1993, IEEE DES TEST COMPUT, V10, P6, DOI 10.1109/54.232468 VERGYRI D, 2000, P IEEE INT C AC SPEE, P1782 WEINTRAUB M, 1997, P IEEE INT C AC SPEE, P887 Wessel F, 2001, IEEE T SPEECH AUDI P, V9, P288, DOI 10.1109/89.906002 Williams G., 1997, P EUROSPEECH 97, P1955 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 ZHAN P, 1996, P 1996 INT C SPOK LA, P836 NR 41 TC 24 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 409 EP 428 DI 10.1016/j.specom.2003.11.002 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900009 ER PT J AU Prasad, VK Nagarajan, T Murthy, HA AF Prasad, VK Nagarajan, T Murthy, HA TI Automatic segmentation of continuous speech using minimum phase group delay functions SO SPEECH COMMUNICATION LA English DT Article DE minimum phase group delay functions; root cepstrum; speech segmentation AB In this paper, we present a new algorithm to automatically segment a continuous speech signal into syllable-like segments. The algorithm for segmentation is based on processing the short-term energy function of the continuous speech signal. The short-term energy function is a positive function and can therefore be processed in a manner similar to that of the magnitude spectrum. In this paper, we employ an algorithm, based on group delay processing of the magnitude spectrum to determine segment boundaries in the speech signal. The experiments have been carried out on TIMIT and TIDIGITS databases. The error in segment boundary is less than or equal to20% of syllable duration for 70% of the syllables. In addition to true segments, an overall 5% insertions and deletions have also been observed. (C) 2003 Elsevier B.V. All rights reserved. C1 Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. RP Nagarajan, T (reprint author), Indian Inst Technol, Dept Comp Sci & Engn, IIT Campus, Madras 600036, Tamil Nadu, India. EM raju@lantana.iitm.ernet.in; hema@lantana.tenet.res.in CR BERKHOUT AJ, 1974, GEOPHYS PROSPECT, P683 BERKHOUT AJ, 1973, GEOPHYSICS, V38, P657, DOI 10.1190/1.1440365 Fisher W. M., 1986, P DARPA WORKSH SPEEC, P93 Ganapathiraju A, 2001, IEEE T SPEECH AUDI P, V9, P358, DOI 10.1109/89.917681 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 LEONARD RG, 1984, P ICASSP, V3, P42 MERMELSTEIN P, 1975, J ACOUST SOC AM, V58, P880, DOI 10.1121/1.380738 MURTHY HA, 1991, SPEECH COMMUN, V10, P209, DOI 10.1016/0167-6393(91)90011-H MURTHY HA, 1997, NAT C COMM, P180 MURTHY HA, 1992, THESIS INDIAN I TECH NAGARAJAN T, 2003, IEE ELECT LETT, V39, P941 NAGARAJAN T, 2001, 6 BIENN C P SPEECH C, P95 RABINER LR, 1982, J ACOUST SOC AM, V71, P1588, DOI 10.1121/1.387813 SARGENT DC, 1974, J ACOUST SOC AM, V45, P880 VANHEMERT JP, 1991, IEEE T SIGNAL PROCES, V39, P1008, DOI 10.1109/78.80941 Wilpon J. G., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) YEGNANARAYANA B, 1984, IEEE T ACOUST SPEECH, V32, P610, DOI 10.1109/TASSP.1984.1164365 YEGNANARAYANA B, 1992, IEEE T SIGNAL PROCES, V40, P2281, DOI 10.1109/78.157227 NR 18 TC 18 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 429 EP 446 DI 10.1016/j.specom.2003.12.002 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900010 ER PT J AU Zhang, JS Hirose, K AF Zhang, JS Hirose, K TI Tone nucleus modeling for Chinese lexical tone recognition SO SPEECH COMMUNICATION LA English DT Article DE tone recognition; underlying-target; articulatory transition; tone-nucleus ID CONTINUOUS MANDARINE SPEECH; LARGE VOCABULARY; INFORMATION; ALIGNMENT; LANGUAGE AB This paper presents a new scheme to deal with variations in fundamental frequency (F0) contours for lexical tone recognition in continuous Chinese speech. We divide F0 contour of a syllable into tone nucleus and adjacent articulatory transitions. We only use acoustic features of the tone nucleus for tone recognition. Tone nucleus of a syllable is assumed to be the target F0 of the associated lexical tone, and usually conforms more likely to the standard tone pattern than the articulatory transitions. A tone nucleus can be detected from a syllable F0 contour by a two-step algorithm. First, the syllable F0 contour is segmented into several linear F0 loci that serve as candidates for the tone-nucleus using segmental K-means segmentation algorithm. Then, tone nucleus is chosen from a set of candidates by a predictor based on linear discriminant analysis. Speaker dependent tone recognition experiments using tonal HMMs showed our new approach achieved an improvement of up to 6% for tone recognition rate compared with a conventional one. This indicates not only that tone-nucleus keeps important discriminant information for the lexical tones, but also that our tone-nucleus based tone recognition algorithm works properly. (C) 2004 Elsevier B.V. All rights reserved. C1 Univ Tokyo, Sch Engn, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1138656, Japan. RP Zhang, JS (reprint author), ATR, Spoken Language Translat Labs, 2-2-2 Kansai Sci City, Kyoto 6190288, Japan. EM jinsong.zhang@atr.co.jp CR Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE CHEN CJ, 2001, P ICASSP CHEN SH, 1995, IEEE T SPEECH AUDI P, V3, P146 CHENG YB, 1990, SPEECH SIGNAL PROCES Fujisaki H., 1997, COMPUTING PROSODY CO, P27 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GARDING E, 1997, ESCA WORKSH INT THEO, P145 HIROSE K, 1995, P 45 EUR C SPEECH CO, V1, P31 HIROSE K, 1999, ICASSP99 HOWIE JM, 1974, PHONETICA, V30, P129 HSIE CT, 1988, J IEICE D, V71, P661 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P158, DOI 10.1109/89.222876 LIN MC, 1995, CHINA ACTA ACUSTICA, V20, P437 LIU J, 1999, P EUR 99 BUD HUNG, V2, P891 Ljolje A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90025-8 MILTON JS, 1992, INTRO PROBABILITY ST NIEMANN H, 1998, P SPECOM WORKSH ST P, P17 Rabiner L, 1993, FUNDAMENTALS SPEECH ROSE PJ, 1988, PACIFIC LINGUISTIC C, V104, P55 Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Shih C., 1988, WORKING PAPERS CORNE, V3, P83 SHIH CL, 1990, STUDIES CHINESE PHON, P81 STOLCKE A, 1999, EUR 99 BUD HUNG SEPT WANG C, 1998, P ICSLP98 SYDN AUSTR WANG CJ, 1990, P REL MAINT S, P221 Wang HM, 1997, IEEE T SPEECH AUDI P, V5, P195 WANG YR, 1994, J ACOUST SOC AM, V96, P2637, DOI 10.1121/1.411274 Webb A, 1999, STAT PATTERN RECOGNI WHALEN DH, 1992, PHONETICA, V49, P25 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684 XU Y, 1997, P ESCA WORKSH INT TH, P337 Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 YANG WJ, 1988, IEEE T ACOUST SPEECH, V36, P988, DOI 10.1109/29.1620 ZHANG JS, 2000, ICASSP2000 IST TURK ZHANG JS, 1998, P ICSLP98 SYDN, P703 ZHANG JS, 1999, P EUR 99 BUD HUNG SE, P747 NR 40 TC 21 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 447 EP 466 DI 10.1016/j.specom.2004.01.001 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900011 ER PT J AU Kim, DK Kim, NS AF Kim, DK Kim, NS TI Rapid online adaptation using speaker space model evolution SO SPEECH COMMUNICATION LA English DT Article DE speaker space model; prior evolution; latent variable model; Quasi-Bayes estimate; online adaptation; rapid speaker adaptation ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD; LINEAR-REGRESSION AB This paper presents a new approach to online adaptation of continuous density hidden Markov model (CDHMM) with a small amount of adaptation data based on speaker space model (SSM) evolution. The SSM which characterizes the a priori knowledge of the training speakers is effectively described in terms of the latent variable models such as the factor analysis or probabilistic principal component analysis. The SSM provides various sources of information such as the correlation information, the prior density, and the prior knowledge of the CDHMM parameters that are very useful for rapid online adaptation. We design the SSM evolution based on the quasi-Bayes estimation technique which incrementally updates the hyperparameters of the SSM and the CDHMM parameters simultaneously. In a series of speaker adaptation experiments on the continuous digit and large vocabulary recognition tasks, we demonstrate that the proposed approach not only achieves a good performance for a small amount of adaptation data but also maintains a good asymptotic convergence property as the data size increases. (C) 2004 Elsevier B.V. All rights reserved. C1 Elect & Telecommun Res Inst, Comp & Software Res Lab, Taejon 305350, South Korea. Seoul Natl Univ, Sch Elect Engn, Seoul 151742, South Korea. Seoul Natl Univ, INMC, Seoul 151742, South Korea. RP Kim, DK (reprint author), Chonnam Natl Univ, Dept Elect Comp & Informat Engn, 300 Yongbong Dong, Gwangju 500757, South Korea. EM dkkim@etri.re.kr; nkim@snu.ac.kr CR BOTTERWECK H, 2001, P IEEE INT C AC SPEE CHEN KT, 2001, P IEEE INT C AC SPEE CHEN KT, 2000, P ICSLP, V3, P742 Chien JT, 1999, IEEE T SPEECH AUDI P, V7, P656 Chien JT, 2002, IEEE T SPEECH AUDI P, V10, P268, DOI 10.1109/TSA.2002.800555 Chou W., 1999, P ICASSP, P1 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Huo Q, 1998, IEEE T SPEECH AUDI P, V6, P386 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 Huo Q, 2001, IEEE T SPEECH AUDI P, V9, P388 Jolliffe I. T., 1986, PRINCIPAL COMPONENT Kim DK, 2004, SPEECH COMMUN, V42, P59, DOI 10.1016/j.specom.2003.08.001 KIM DK, 2003, P IEEE INT C AC SPEE, P304 KIM DK, 2002, P INT C SPOK LANG PR, P1393 KIM DK, 2001, ADAPTATION METHODS S, P25 Kim N. S., 2000, P INT C SPOK LANG PR, P734 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 KUHN R, 2001, ADAPTATION METHODS S, P33 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 Lee CH, 2000, P IEEE, V88, P1241 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 RUBIN DB, 1982, PSYCHOMETRIKA, V47, P69, DOI 10.1007/BF02293851 Tipping ME, 1999, NEURAL COMPUT, V11, P443, DOI 10.1162/089976699300016728 Woodland P. C., 2001, ADAPTATION METHODS S, P11 YOUNG SJ, 1994, P ARPA HUM LANG TECH, V1, P286 ZAVALIAGKOS G, 1995, THESIS NE U BOSTON NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2004 VL 42 IS 3-4 BP 467 EP 478 DI 10.1016/j.specom.2004.01.002 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 814YB UT WOS:000221008900012 ER PT J AU Muralishankar, R Ramakrishnan, AG Prathibha, P AF Muralishankar, R Ramakrishnan, AG Prathibha, P TI Modification of pitch using DCT in the source domain SO SPEECH COMMUNICATION LA English DT Article DE linear prediction; concatenative synthesis; residual signal; resampling; 3 dB bandwidth; spectral broadening ID SINUSOIDAL REPRESENTATION; SPEECH AB In this paper, we propose a novel algorithm for pitch modification. The linear prediction residual is obtained from pitch synchronous frames by inverse filtering the speech signal. Then the discrete cosine transform (DCT) of these residual frames is taken. Based on the desired factor of pitch modification, the dimension of the DCT coefficients of the residual is modified by truncating or zero padding, and then the inverse discrete cosine transform is obtained. This period modified residual signal is then forward filtered to obtain the pitch modified speech. The mismatch between the positions of the harmonics of the pitch modified signal and the LP spectrum of the original signal introduce gain variations. which is more pronounced in the case of female speech [Proc. Int. Conf. on Acoust. Speech and Signal Process. (1997) 1623]. This is minimised by modifying the radii of the poles of the filter to broaden the otherwise peaky linear predictive spectrum. The modified LP coefficients are used for both inverse and forward filtering. This pitch modification scheme is used in our Concatenative Speech synthesis system for Kannada. The technique has also been successfully applied to creating interrogative sentences from affirmative sentences. The modified speech has been evaluated in terms of intelligibility, distortion and speaker identity. Results indicate that our scheme results in acceptable speech in terms of all these parameters for pitch change factors required for our speech synthesis work. (C) 2003 Elsevier B.V. All rights reserved. C1 Indian Inst Sci, Dept Elect Engn, Bangalore 560012, Karnataka, India. RP Muralishankar, R (reprint author), Indian Inst Sci, Dept Elect Engn, Bangalore 560012, Karnataka, India. EM sripad@ee.iisc.ernet.in RI Ramakrishnan, A/B-8317-2013 OI Ramakrishnan, A/0000-0002-3646-1955 CR ABE M, 1996, SPEAKING STYLES STAT Ahmed N., 1975, ORTHOGONAL TRANSFORM ANSARI R, 1997, P INT C AC SPEECH SI, P1623 ANSSI R, 1999, THESIS TAMPERE U TEC Charpentier F. J., 1986, P ICASSP, P2015 DELOSGALANES FMG, 1995, P ICASSP, P636 EDGINGTON M, 1996, ICSLP96 GEORGE EB, 1992, J AUDIO ENG SOC, V40, P497 Kleijn W. B., 1995, SPEECH CODING SYNTHE LIBERMAN M, 1994, COMPUTER SPEECH SYNT MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581 QUATIERI TF, 1986, IEEE T ACOUST SPEECH, V34, P1449, DOI 10.1109/TASSP.1986.1164985 RABINER L, 1975, DIGITAL PROCESSING S Rao K. R., 1990, DISCRETE COSINE TRAN ROE DB, 1994, VOICE COMMUNICATION SYRDAL A, 1995, APPL SPEECH TECHNOLO VERGIN R, 1997, P 1997 IEEE INT C AC, V2, P947 NR 20 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 143 EP 154 DI 10.1016/j.specom.2003.05.001 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800001 ER PT J AU Janse, E AF Janse, E TI Word perception in fast speech: artificially time-compressed vs. naturally produced fast speech SO SPEECH COMMUNICATION LA English DT Article DE timing; prosody; perception; word recognition; fast speech; time compression; segmental reduction ID LEXICAL ACCESS; PHONOLOGICAL VARIATION; FORMANT MOVEMENTS; VOWEL REDUCTION; DUTCH VOWELS; INTELLIGIBILITY; ENGLISH; RECOGNITION; INFORMATION; INFERENCE AB Natural fast speech differs from normal-rate speech with respect to its temporal pattern. Previous results showed that word intelligibility of heavily artificially time-compressed speech could not be improved by making its temporal pattern more similar to that of natural fast speech. This might have been due to the extrapolation of timing rules for natural fast speech to rates that are much faster than can be attained by human speakers. The present study investigates whether, at a speech rate that human speakers can attain, artificially time-compressed speech is easier to process if its timing pattern is similar to that of naturally produced fast speech. Our first experiment suggests, however, that word processing speed was slowed down, relative to linear compression. In a second experiment, word processing of artificially time-compressed speech was compared with processing of naturally produced fast speech. Even when naturally produced fast speech is perfectly intelligible, its less careful articulation, combined with the changed timing pattern, slows down processing. relative to linearly time-compressed speech. Furthermore, listeners preferred artificially time-compressed speech over naturally produced fast speech. These results suggest that linearly time-compressed speech has both a temporal and a segmental advantage over natural fast speech. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Utrecht, Utreacht Inst Linguist OTS, NL-3512 JK Utrecht, Netherlands. RP Janse, E (reprint author), Univ Utrecht, Utreacht Inst Linguist OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM esther.janse@let.uu.nl RI Janse, Esther/E-3967-2012 CR ANDRUSKI JE, 1994, COGNITION, V52, P163, DOI 10.1016/0010-0277(94)90042-6 BARD E, 2000, P WORKSH SPOK WORD A, P3 Bard EG, 2001, LANG COGNITIVE PROC, V16, P731 Bard EG, 2000, J MEM LANG, V42, P1, DOI 10.1006/jmla.1999.2667 CHO TH, 2001, THESIS U CALIFORNIA Covell M., 1998, P IEEE INT C AC SPEE CUTLER A, 1979, SENTENCE PROCESSING Cutler A., 2000, P 6 INT C SPOK LANG, V1, P593 Cutler A, 2001, LANG SPEECH, V44, P171 CUTLER A, 1984, ATTENTION PERFORM, V10, P183 CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7 DEJONG KJ, 1995, J ACOUST SOC AM, V97, P491, DOI 10.1121/1.412275 DELOGU C, 1992, 2589 PRIT SAM Fowler C.A., 1981, J SPEECH HEAR RES, V46, P127 Gaskell MG, 1998, J EXP PSYCHOL HUMAN, V24, P380, DOI 10.1037/0096-1523.24.2.380 Gaskell MG, 1996, J EXP PSYCHOL HUMAN, V22, P144, DOI 10.1037//0096-1523.22.1.144 GAY T, 1978, J ACOUST SOC AM, V63, P223, DOI 10.1121/1.381717 HAWKINS S, 1994, J PHONETICS, V22, P493 He L., 2001, P C MULT OTT, P382 Horton WS, 1996, COGNITION, V59, P91, DOI 10.1016/0010-0277(96)81418-1 *ITU, ITUP800 JANSE E, 2001, P 11 EUR C SPEECH CO, V2, P1407 Janse E, 2003, SPEECH COMMUN, V41, P287, DOI 10.1016/S0167-6393(02)00130-9 KOHLER KJ, 1990, NATO ADV SCI I D-BEH, V55, P69 Lehiste I., 1970, SUPRASEGMENTALS Lindblom B., 1990, SPEECH PRODUCTION SP LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MARSLENWILSON W, 1995, LANG COGNITIVE PROC, V10, P285, DOI 10.1080/01690969508407097 Max L, 1997, J SPEECH LANG HEAR R, V40, P1097 MEHTA G, 1988, LANG SPEECH, V31, P135 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 NAKATANI LH, 1973, J ACOUST SOC AM, V53, P1083, DOI 10.1121/1.1913428 Nix A. J., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1011 Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241 PAVLOVIC CV, 1990, J ACOUST SOC AM, V87, P373, DOI 10.1121/1.399258 Perkell J. S., 1997, HDB PHONETIC SCI, P333 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 PISONI DB, 1987, TEXT SPEECH MITALK S Pisoni D.B., 1997, PROGR SPEECH SYNTHES PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 QUENE H, 1999, P 14 INT C PHON SCI, P1831 SEIDENBERG MS, 1979, J EXP PSYCHOL-HUM L, V5, P546, DOI 10.1037//0278-7393.5.6.546 SLOWIACZEK LM, 1990, LANG SPEECH, V33, P47 SOTILLO C, 1998, P SPOSS, P109 Utman JA, 2001, BRAIN LANG, V79, P444, DOI 10.1006/brln.2001.2500 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D van Bezooijen Renee, 1997, HDB STANDARDS RESOUR, P481 van Heuven V. J., 1985, J ACOUST SOC AM, V78, pS21, DOI 10.1121/1.2022696 VANDONSELAAR W, 1994, LANG SPEECH, V37, P375 VANLEYDEN K, 1996, LINGUISTICS NETHERLA van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 VANSON RJJH, 1992, J ACOUST SOC AM, V92, P121, DOI 10.1121/1.404277 VANSON RJJH, 1990, J ACOUST SOC AM, V88, P1683, DOI 10.1121/1.400243 WHALEN DH, 1991, PERCEPT PSYCHOPHYS, V50, P351, DOI 10.3758/BF03212227 WINGFIELD A, 1984, J SPEECH HEAR RES, V27, P128 WINGFIELD A, 1975, STRUCTURE PROCESS SP NR 56 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 155 EP 173 DI 10.1016/j.specom.2003.07.001 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800002 ER PT J AU van Dinther, R Kohlrausch, A Veldhuis, R AF van Dinther, R Kohlrausch, A Veldhuis, R TI A method for analysing the perceptual relevance of glottal-pulse parameter variations SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; glottal-pulse; LF parameters; perceptual distance measure ID VOICE QUALITY; THRESHOLDS; LOUDNESS; MODEL AB This paper describes a method for analysing the perceptual relevance of parameter variations of the Liljencrants-Fant (LF) model for the glottal pulse. A perceptual distance measure based on excitation patterns was developed and evaluated in order to predict audibility discrimination thresholds for small changes to the R-parameters of the LF model. For a number of R-parameter sets, taken from real source data, an approximation Q of the distance measure was used to compute the directions of maximal and minimal perceptual sensitivity. In addition, we show that the inverse of Q can be used to calculate the amount of variation of the R-parameters to reach a just noticeable difference (JND). The results were evaluated in a listening test in which JNDs for R-parameter changes were measured. Discrimination thresholds were fairly constant across all tested conditions and corresponded on average to an excitation pattern distance of 4.3 dB. An additional error analysis demonstrated that Q is a fair approximation of the perceptual distance measure for small variations of the R-parameters up to one just noticeable difference. (C) 2003 Elsevier B.V. All rights reserved. C1 Eindhoven Univ Technol, Dept Technol Management, NL-5600 MB Eindhoven, Netherlands. Philips Res Labs, NL-5656 AA Eindhoven, Netherlands. Univ Twente, Fac Engn, Chair Signals & Syst, NL-7500 AE Enschede, Netherlands. RP van Dinther, R (reprint author), Univ Cambridge, Ctr Neurol Basis Hearing, Dept Physiol, Downing St, Cambridge CB2 3EG, England. EM ralph.van-dinther@mrc-cbu.cam.ac.uk CR CAMPBELL S, 1998, P JOINT M 16 INT C A, V4, P2603 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DOVAL B, 1997, P EUROSPEECH97, V1, P533 DOVAL B, 1997, P ICASSP 97, P1295 Fant Gunnar, 1985, STL QPSR, V4, P1 Gobl C., 1989, STL QPSR, P9 GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C HENRICH N, 2003, IN PRESS J VOICE KARLSSON I, 1996, TMH QPSR, V2, P143 KARLSSON I, 1990, P ICSLP90 KOB JAP, P69 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 LEVITT H, 1971, J ACOUST SOC AM, V49, P971 Moore B.C.J., 1986, FREQUENCY SELECTIVIT Moore BCJ, 1997, J AUDIO ENG SOC, V45, P224 PLOMP R, 1969, J ACOUST SOC AM, V46, P409, DOI 10.1121/1.1911705 Rao P, 2001, J ACOUST SOC AM, V109, P2085, DOI 10.1121/1.1354986 Scherer RC, 1998, J VOICE, V12, P21, DOI 10.1016/S0892-1997(98)80072-6 VANDINTHER R, 2001, P EUROSPEECH 01, V2, P1507 Veldhuis RNJ, 1998, INT CONF ACOUST SPEE, P873, DOI 10.1109/ICASSP.1998.675404 ZWICKER E, 1965, PSYCHOL REV, V72, P3, DOI 10.1037/h0021703 NR 20 TC 4 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 175 EP 189 DI 10.1016/j.specom.2003.07.002 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800003 ER PT J AU Seward, A AF Seward, A TI A fast HMM match algorithm for very large vocabulary speech recognition SO SPEECH COMMUNICATION LA English DT Article DE HMM; acoustic match; parallel; large vocabulary speech recognition ID SEARCH AB The search over context-dependent continuous density Hidden Markov Models (HMMs), including state-likelihood computations, accounts for a considerable part of the total decoding time for a speech recognizer. This is especially apparent in tasks that incorporate large vocabularies and long-dependency n-gram grammars, since these impose a high degree of context dependency and HMMs have to be treated differently in each context. This paper proposes a strategy for acoustic match of typical continuous density HMMs, decoupled from the main search and conducted as a separate component suited for parallelization. Instead of computing a large amount of probabilities for different alignments of each HMM, the proposed method computes all alignments, but more efficiently. Each HMM is matched only once against any time interval, and thus may be instantly looked up by the main search algorithm as required. In order to accomplish this in real time, a fast time-warping match algorithm is proposed, exploiting the specifics of the 3-state left-to-right HMM topology without skips. In proof-of-concept tests, using a highly optimized SIMD-parallel implementation, the algorithm was able to perform time-synchronous decoupled evaluation of a triphone acoustic model, with maximum phone duration of 40 frames, with a real-time factor of 0.83 on one of the CPUs of a Dual-Xeon 2 GHz workstation. The algorithm was able to compute the likelihood for 636,000 locally optimal HMM paths/second, with full state evaluation. (C) 2003 Elsevier B.V. All rights reserved. C1 Royal Inst Technol, KTH, Ctr Speech Technol, SE-10044 Stockholm, Sweden. RP Seward, A (reprint author), Royal Inst Technol, KTH, Ctr Speech Technol, Drottning Kristinas V 31, SE-10044 Stockholm, Sweden. EM alec@speech.kth.se CR AIYER A, 2000, IEEE INT C AC SPEECH, V3, P1519 AUBERT XL, 2000, P AUT SPEECH REC WOR, V1, P91 Bahl LR, 1993, IEEE T SPEECH AUDI P, V1, P59, DOI 10.1109/89.221368 Bakis R., 1976, 91 M AC SOC AM Bellman R. E., 1957, DYNAMIC PROGRAMMING CHUNG SH, 1999, 13 INT 10 S PAR DIST, P45 Elenius K., 2000, International Journal of Speech Technology, V3, DOI 10.1023/A:1009641213324 FORNEY GD, 1973, P IEEE, V61, P268, DOI 10.1109/PROC.1973.9030 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P152, DOI 10.1109/89.748120 HART P, 1968, IEEE T SYST SCI CYB, V2, P100, DOI DOI 10.1109/TSSC.1968.300136 Jelinek F., 1969, IBM Journal of Research and Development, V13 KANTHAK S, 2000, P IEEE INT C AC SPEE, P1531 KENNOVIN GD, 1993, ENDOTHELIUM, V1, P1 KNILL KM, 1996, P INT C AC SPEECH SI, V1, P522 KORF RE, 1985, ARTIF INTELL, V27, P97, DOI 10.1016/0004-3702(85)90084-0 LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P599, DOI 10.1109/29.52701 LINDBERG B, 2000, P ICSLP 2000 BEIJ, V3, P370 LJOLJE A, 1991, IEEE T SIGNAL PROCES, V39, P29, DOI 10.1109/78.80762 LJOLJE M, 2000, P NIST LARG VOC CONV MITCHUM CC, 1995, NEUROPSYCHOL REHABIL, V5, P1, DOI 10.1080/09602019508520173 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984 Odell JJ, 1995, THESIS CAMBRIDGE U PAUL DB, 1992, IEEE C AC SPEECH SIG, V1, P25 RENALS S, 1994, CUEDFINFENGTR186 Renals S, 1999, IEEE T SPEECH AUDI P, V7, P542, DOI 10.1109/89.784107 Robinson T, 1998, INT CONF ACOUST SPEE, P829, DOI 10.1109/ICASSP.1998.675393 SAKOE H, 1979, IEEE T ACOUST SPEECH, V27, P588, DOI 10.1109/TASSP.1979.1163310 SEWARD A, 2001, P EUROSPEECH, P1607 SILVERMAN HF, 1990, IEEE ASSP MAGAZINE, P6 SIXTUS A, 2000, P IEEE INT C AC SPEE, P1671 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 Young S., 2002, HTK BOOK, P3 NR 33 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 191 EP 206 DI 10.1016/j.specom.2003.08.005 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800004 ER PT J AU Bechet, F Gorin, AL Wright, JH Tur, DH AF Bechet, F Gorin, AL Wright, JH Tur, DH TI Detecting and extracting named entities from spontaneous speech in a mixed-initiative spoken dialogue context: How May I Help You?(sm,tm) SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; spoken dialogue systems; spoken language understanding; named entities ID RECOGNITION AB The understanding module of a spoken dialogue system must extract, from the speech recognizer output, the kind of request expressed by the caller (the call type) and its parameters (numerical expressions, time expressions or proper-names). Such expressions are called Named Entities and their definitions can be either generic or linked to the dialogue application domain. Detecting and extracting such Named Entities within a mixed-initiative dialogue context like How May I Help You?(sm,tm) (HMIHY) is the subject of this study. After reviewing standard methods based on hand-written grammars and statistical tagging, we propose a new approach, combining the advantages of both in a 2-step process, We also propose a novel architecture which exploits understanding to improve recognition accuracy: the output of the Automatic Speech Recognition module is now a word lattice and the understanding module is responsible for transcribing the word strings which are useful to the Dialogue Manager. All the methods proposed are trained and evaluated on a corpus comprising utterances from live customer traffic. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Avignon, LIA, F-84911 Avignon 09, France. AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Bechet, F (reprint author), Univ Avignon, LIA, BP1228, F-84911 Avignon 09, France. EM frederie.bechet@lia.univ-avignon.fr; algor@research.att.com; jwright@research.att.com; dtur@research.att.com CR BECHET F, 2000, 38 ANN M ASS COMP LI, P77 Bikel DM, 1999, MACH LEARN, V34, P211, DOI 10.1023/A:1007558221122 BLACK WJ, 1998, FACILE DESCRIPTION N BORTHWICK A, 1998, P 7 MESS UND C MUC 7 Carrasco RC, 1999, RAIRO-INF THEOR APPL, V33, P1, DOI 10.1051/ita:1999102 CHAPPELIER J, 1999, P 6 C TRAIT AUT LANG CHARNIAK E, 1993, 11 NAT C ART INT, P784 Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 CHINCHOR N, 1998, MUC7 NAMED ENTITY TA Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X GRISHMAN R, 1998, P DARPA BROADC NEWS Huang J., 2001, P 39 ANN M ASS COMP, P290 KIEFER B, 2000, VERBMOBIL FDN SPEECH, P280 KIM J, 2000, P ICSLP 2000 Kubala F., 1998, P DARPA BROADC NEWS KUHN R, 1995, IEEE T PATTERN ANAL, V17, P449, DOI 10.1109/34.391397 Mohri M, 2002, COMPUT SPEECH LANG, V16, P69, DOI 10.1006/csla.2001.0184 Mohri M, 2000, THEOR COMPUT SCI, V231, P17, DOI 10.1016/S0304-3975(99)00014-6 Mohri M., 2000, ROBUSTNESS LANGUAGE, P251 PADMANABHAN M, 1999, P EUROSPEECH 99 BUD PALMER DD, 2001, THESIS U WASHINGTON PALMER DD, 1999, P EUROPSPEECH 99 BUD Rahim M, 2001, SPEECH COMMUN, V34, P195, DOI 10.1016/S0167-6393(00)00054-6 ROARK B, 2002, P 40 ACL M PHIL Ron D, 1998, J COMPUT SYST SCI, V56, P133, DOI 10.1006/jcss.1997.1555 STOLCKE A, 1994, INT C GRAMM INF TUR G, 2002, P ICSLP 02 NR 27 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 207 EP 225 DI 10.1016/j.specom.2003.07.003 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800005 ER PT J AU Carre, R AF Carre, R TI From an acoustic tube to speech production SO SPEECH COMMUNICATION LA English DT Article DE speech production; modeling; acoustic tube ID STOP CONSONANTS; FORMANT TRANSITIONS; PERCEPTION; ARTICULATION; PLACE; CUES AB The general aim of this paper is to find out if speech production characteristics may be explained by the physical and permanent properties of an acoustic tube that is 18 cm long. In a way, we intend to generalize the deductive approach used by Lindblom [Phonetic Universal in Vowel Systems, in: J.J. Ohala, J.J. Jaeger (Eds.), Experimental Phonology, Academic Press. Orlando, p. 13] to explain vowel systems from production and perception properties. However, the question here is not to account for vowel systems but rather to explain certain characteristics of the speech production system. In the present research, these characteristics are not observed per se and used as unquestionable "constraints". In order to answer this question related to the characteristics of the speech production system, the properties of the acoustic tube are first studied to build an acoustic production model having the following specific property: the shape of the tube is deformed to perform maximum acoustic changes, i.e., a minimum area deformation provokes a maximum acoustic variation. Following this approach, a set of distinctive deformation gestures involving corresponding distinctive acoustic changes is obtained and used to set up the intrinsic "phonological" system of the tube designed for communication needs. Then, it is shown that the Distinctive Region Model (DRM) summarizes the main results obtained without any constraints. Finally, model and speech production characteristics are compared. With a limited number of constraints, which can be explained, included into the model (such as a fixed larynx cavity), they fit surprisingly well. Thus, can these speech production characteristics be explained by the proposed deductive approach, i.e., are the main characteristics of the vocal tract. and of speech production in general, consequences of specific deformations of the shape of the tube to perform maximum acoustic changes? Implications of the findings are discussed. (C) 2003 Elsevier B.V. All rights reserved. C1 CNRS, ENST, Dept TSI, F-75634 Paris 13, France. RP Carre, R (reprint author), CNRS, ENST, Dept TSI, 46 Barrault, F-75634 Paris 13, France. EM rene.carre@enst.fr CR ALDAKKAR O, 1994, LINGUISTICA COMMUNI, V6, P56 BADIN P, 1984, 23 STL QPSR, P53 BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019 Carre R, 2001, PHONETICA, V58, P163, DOI 10.1159/000056197 Carre R., 1992, Journal d'Acoustique, V5 CARRE R, 1994, ASHA M NEW ORL, P86 CARRE R, 1995, J AC SOC AM, V97, pS3420 CARRE R, 2002, P INT C SPEECH LANG CARRE R, 1994, J ACOUST SOC AM, V95, pS2924 Carre R, 2000, PHONETICA, V57, P152, DOI 10.1159/000028469 CARRE R, 2000, P INT C SPEECH LANG, P13 CARRE R, 1990, SPEECH PRODUCTION SP CARRE R, 1995, CR ACAD SCI II B, V30, P471 CARRE R, 1997, P 3 M ACL SPEC INT G, P26 DORMAN MF, 1977, PERCEPT PSYCHOPHYS, V22, P109, DOI 10.3758/BF03198744 Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1975, 4 SPEECH TRANSM LAB, P1 Fant G., 1973, SPEECH SOUNDS FEATUR Fant G., 1974, P SPEECH COMM SEM, P121 HARRIS KS, 1958, J ACOUST SOC AM, V30, P122, DOI 10.1121/1.1909501 ISKAROUS K, 2001, THESIS U ILLINOIS UR KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P1779, DOI 10.1121/1.389402 LINDBLOM B, 1999, P LP 98 COL, P401 Lindblom B., 1985, PHONETIC LINGUISTICS, P169 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 Lindblom B., 1986, EXPT PHONOLOGY, P13 Lindblom Bjorn, 1990, PHONETIC CONTENTS PH, V11, P101 MRAYATI M, 1990, SPEECH COMMUN, V9, P231, DOI 10.1016/0167-6393(90)90059-I MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 SHEPARD RN, 1984, PSYCHOL REV, V91, P417, DOI 10.1037/0033-295X.91.4.417 Stevens K. N., 1981, PERSPECTIVES STUDY S STEVENS KN, 1989, J PHONETICS, V17, P3 Ungeheuer G., 1962, ELEMENTE AKUSTISCHEN WALLEY AC, 1983, J ACOUST SOC AM, V73, P1011, DOI 10.1121/1.389149 NR 37 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2004 VL 42 IS 2 BP 227 EP 240 DI 10.1016/j.specom.2003.12.001 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 780JP UT WOS:000189377800006 ER PT J AU Junqua, JC Wellekens, C AF Junqua, JC Wellekens, C TI Special Issue on Adaptation Methods for Speech Recognition - Editorial SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Panason Tech Co, Panason Speech Technol Lab, Div Matsushita Elect Corp Amer, Santa Barbara, CA 93105 USA. Inst Eurecom, MultiMedia Commun Dept, F-06904 Sophia Antipolis, France. RP Junqua, JC (reprint author), Panason Tech Co, Panason Speech Technol Lab, Div Matsushita Elect Corp Amer, Suite 202,3888 State St, Santa Barbara, CA 93105 USA. EM jcj@research.panasonic.com; welleken@eurecom.fr NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 1 EP 3 DI 10.1016/j.specom.2003.10.001 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400001 ER PT J AU Yao, KS Paliwal, KK Nakamura, S AF Yao, KS Paliwal, KK Nakamura, S TI Noise adaptive speech recognition based on sequential noise parameter estimation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE noisy speech recognition; non-stationary noise; expectation maximization algorithm; Kullback proximal algorithm ID MAXIMUM-LIKELIHOOD-ESTIMATION; COMPENSATION AB In this paper, a noise adaptive speech recognition approach is proposed for recognizing speech which is corrupted by additive non-stationary background noise. The approach sequentially estimates noise parameters, through which a nonlinear parametric function adapts mean vectors of acoustic models. In the estimation process, posterior probability of state sequence given observation sequence and the previously estimated noise parameter sequence is approximated by the normalized joint likelihood of active partial paths and observation sequence given the previously estimated noise parameter sequence. The Viterbi process provides the normalized joint-likelihood. The acoustic models are not required to be trained from clean speech and they can be trained from noisy speech. The approach can be applied to perform continuous speech recognition in presence of non-stationary noise. Experiments conducted on speech contaminated by simulated and real non-stationary noise show that when acoustic models are trained from clean speech, the noise adaptive speech recognition system provides improvements in word accuracy as compared to the normal noise compensation system (which assumes the noise to be stationary) in slowly time-varying noise. When the acoustic models are trained from noisy speech, the noise adaptive speech recognition system is found to be helpful to get improved performance in slowly time-varying noise over a system employing multi-conditional training. (C) 2003 Elsevier B.V. All rights reserved. C1 ATR, Spoken Language Translat Res Labs, Kyoto, Japan. Griffith Univ, Sch Microelect Engn, Brisbane, Qld, Australia. RP Yao, KS (reprint author), Univ Calif San Diego, Inst Neural Computat, 9500 Gilman Dr, La Jolla, CA 92093 USA. EM kyao@ucsd.edu; k.paliwal@me.gu.edu.au; satoshi.nakamura@atr.co.jp CR Acero A., 1990, THESIS CARNEGIE MELL Afify M, 2001, INT CONF ACOUST SPEE, P229, DOI 10.1109/ICASSP.2001.940809 CERISARA C, 2001, ENV ADAPTATION BASED, P213 Chretien S, 2000, IEEE T INFORM THEORY, V46, P1800, DOI 10.1109/18.857792 Deng L, 2001, INT CONF ACOUST SPEE, P301 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Frey B., 2001, EUROSPEECH, P901 GALES M, 1997, COMPUTER SPEECH LANG, V9, P289 HANSON BA, 1990, INT CONF ACOUST SPEE, P857, DOI 10.1109/ICASSP.1990.115973 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H.G., 2000, ISCA ITRW ASR2000 Kim NS, 1998, IEEE SIGNAL PROC LET, V5, P57 KRISHNAMURTHY V, 1993, IEEE T SIGNAL PROCES, V41, P2557, DOI 10.1109/78.229888 Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 Morris AC, 1998, INT CONF ACOUST SPEE, P737, DOI 10.1109/ICASSP.1998.675370 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Sagayama S, 1997, INT CONF ACOUST SPEE, P835, DOI 10.1109/ICASSP.1997.596063 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Surendran AC, 1999, IEEE T SPEECH AUDI P, V7, P643, DOI 10.1109/89.799689 Takiguchi T, 2000, INT CONF ACOUST SPEE, P1403, DOI 10.1109/ICASSP.2000.861848 VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 YAO K, 2002, ADV NEURAL INFORMATI, P1213 YAO K, 2001, EUROSPEECH, P1139 Yao KS, 2002, INT CONF ACOUST SPEE, P189 YOUNG S, 1997, HTK BOOK VER 2 1 Zhao YX, 2001, INT CONF ACOUST SPEE, P225 Zhao YX, 2000, IEEE T SPEECH AUDI P, V8, P255, DOI 10.1109/89.841208 NR 28 TC 19 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 5 EP 23 DI 10.1016/j.specom.2003.09.002 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400002 ER PT J AU Cerisara, C Rigazio, L Junqua, JC AF Cerisara, C Rigazio, L Junqua, JC TI alpha-Jacobian environmental adaptation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE model compensation; noise robustness; automatic speech recognition; PMC; Jacobian adaptation; fast environmental adaptation ID ROBUST SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD-ESTIMATION; PARALLEL MODEL COMBINATION; HIDDEN MARKOV-MODELS AB The robustness of automatic speech recognition systems to noise is still a problem, especially for small footprint systems. This paper addresses the problem of noise robustness using model compensation methods. Such algorithms are already available, but their complexity is usually high. An often-referenced method for achieving noise robustness is parallel model combination (PMC). Several algorithms have been proposed to develop more computationally efficient methods than PMC. For example, Jacobian adaptation approximates PMC with a linear transformation function in the cepstral domain. However, the Jacobian approximation is valid only for test environments that are close to the training conditions whereas, in real test conditions, the mismatch between the test and training environments is usually large. In this paper, we propose two methods, respectively called static and dynamic alpha-Jacobian adaptation (or alpha-JAC), to compute new linear approximations of PMC for realistic test environments. We further extend both algorithms to compensate for additive and convolutional noise and we derive the corresponding non-linear algorithm that is approximated. All these algorithms are experimentally compared in important mismatch conditions. As compared to Jacobian adaptation, improvements are observed with both static and dynamic a-Jacobian adaptation. (C) 2003 Elsevier B.V. All rights reserved. C1 LORIA, UMR 7503, F-54506 Vandoeuvre Les Nancy, France. Panason Speech Technol Lab, Santa Barbara, CA 93105 USA. RP Cerisara, C (reprint author), LORIA, UMR 7503, Campus Sci,BP 239, F-54506 Vandoeuvre Les Nancy, France. EM cerisara@loria.fr CR BACCHIERI E, 1992, ICASSP 92, V1, P501 BOURLARD H, 1996, ICSLP 96, P422 CERISARA C, 2001, ICASSP 2001 SALT LAK CERISARA C, 2002, ICASSP 2002 ORL US M CERISARA C, 2000, ICSLP 2000 BEIJ CHIN, V1, P369 CERISARA C, 1999, EUROSPEECH 99 PRAG S DAOUDI K, 2000, ICSLP 2000 BEIJ CHIN FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 FURUI S, 1999, NATO ASI SERIES F, V169, P102 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gales M. J. F., 1995, THESIS GONVILLE CAIU Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 GALES MJF, 1995, INT CONF ACOUST SPEE, P133, DOI 10.1109/ICASSP.1995.479291 GASSERT C, 1998, 22 J ETUD PAR MART S, P171 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HANSEN J, 1995, IEEE T ASSP, V33, P1404 HERMANSKY H, 1993, 1993 P IEEE INT C AC, V2, P83 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hunt MJ, 1979, J ACOUST SOC AM, V66, pS535 JANQUA JC, 1996, ROBUST AUTOMATIC SPE Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Junqua J., 2000, ROBUST SPEECH RECOGN LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Moreno PJ, 1996, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1996.543225 PONTING K, 1999, NATO ASI SERIES F, V169, P112 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Raj B, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2340 RICHARD W, 1997, DEFINITION CORPUS SC RIGAZIO L, 2001, EUROSPEECH 2001 SCAN SAGAYAMA S, 2001, ISCA WORKSH AD METH, P117 Sagayama S, 1997, INT CONF ACOUST SPEE, P835, DOI 10.1109/ICASSP.1997.596063 SARIKAYA R, 2000, ICSLP 2000, V3, P702 Stern R. M., 1997, ESCA NATO TUT RES WO, P33 WOODLAND P, 1993, EUROSPEECH 93, P2207 Woodland PC, 1996, INT CONF ACOUST SPEE, P65, DOI 10.1109/ICASSP.1996.540291 Zhao YX, 2000, IEEE T SPEECH AUDI P, V8, P255, DOI 10.1109/89.841208 NR 37 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 25 EP 41 DI 10.1016/j.specom.2003.08.003 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400003 ER PT J AU Zhang, ZP Furui, S AF Zhang, ZP Furui, S TI Piecewise-linear transformation-based HMM adaptation for noisy speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE robust speech recognition; noise adaptation; piecewise-linear transformation; noise clustering; GMM ID SPEAKER ADAPTATION; RECOGNITION AB This paper proposes a new method using piecewise-linear transformation for adapting phone HMMs to noisy speech. Various noises are clustered according to their spectral property, and a noisy speech HMM corresponding to each clustered noise and SNR condition is made. Based on the likelihood maximization criterion, an HMM that best matches an input noisy speech is selected and further adapted using linear transformation. The proposed method is evaluated by its ability to recognize noisy broadcast-news speech. It is confirmed that the proposed method is effective in recognizing numerically noise-added speech and actual noisy speech under various noise conditions. The proposed method minimizes mismatches between noisy input speech and the HMM's, sentence by sentence, without requiring online noise spectrum/model estimation. The proposed method is therefore easily applicable to real world conditions with frequently changing noise. (C) 2003 Elsevier B.V. All rights reserved. C1 Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. RP Furui, S (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM furui@cs.titech.ac.jp CR Acero A., 1992, ACOUSTICAL ENV ROBUS ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 FURUI S, 2001, P IEEE INT C AC SPEE, P365 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 GALES MJF, 1998, P INT C SPOK LANG PR, P369 GALES MJF, 1992, P ICASSP, P233, DOI 10.1109/ICASSP.1992.225929 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Huang CS, 2001, IEEE T SPEECH AUDI P, V9, P866 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MARITN F, 1993, P EUR C SPEECH COMM, P1031 MINAMI Y, 1995, P ICASSP, P129 MORRIS A, 2000, J SPEECH COMM, V34, P25 NEUMEYER L, 1995, IEEE INT C ACOUST SP, P1289 OHKURA K, 1992, P ICSLP 92, P369 OHTSUKI K, 1999, P EUR C SPEECH COMM, P671 PADMANABHAN M, 1997, IEEE T ACOUST SPEECH, P835 SAGAYAMA S, 1997, P IEEE INT C AC SPEE, P835 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SUGAMURA N, 1982, T COMM SPEECH RES, P505 Vizinho A., 1999, P EUR C SPEECH COMM, P2407 ZHANG Z, 2000, P ICSLP BEIJ, P694 Zhang ZP, 2002, SPEECH COMMUN, V37, P271, DOI 10.1016/S0167-6393(01)00018-8 NR 23 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 43 EP 58 DI 10.1016/j.specom.2003.08.006 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400004 ER PT J AU Kim, DK Kim, NS AF Kim, DK Kim, NS TI Maximum a posteriori adaptation of HMM parameters based on speaker space projection SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE hidden Markov model; maximum a posteriori; speaker adaptation; probabilistic principal component analysis; EM algorithm ID SPEECH RECOGNITION; MODELS AB This paper presents a novel approach to rapid speaker adaptation based on the speaker space projection paradigm in which the adapted model is constrained to lie on a specific subspace spanned by a small number of basis vectors. In order to select the basis vectors that form the speaker space, we apply probabilistic principal component analysis (PPCA) technique to a set of training speaker models represented by a number of hidden Markov models (HMMs). The PPCA incorporates a probability model to the conventional principal component analysis (PCA) method, and finds the speaker space model by means of the expectation maximization (EM) algorithm which is computationally efficient. The PPCA model provides the information of correlation among different speech units as well as the prior probability density function (pdf) associated with each HMM parameter, which can be directly applied to the maximum a posteriori (MAP) adaptation framework. Through a series of supervised adaptation experiments on the tasks of connected digit and large vocabulary recognition, we show that the proposed approach not only achieves a good performance for a small amount of adaptation data but also guarantees a consistent estimate as the data size grows. (C) 2003 Elsevier B.V. All rights reserved. C1 Seoul Natl Univ, Sch Elect Engn, Seoul 151742, South Korea. Seoul Natl Univ, INMC, Seoul 151742, South Korea. RP Kim, NS (reprint author), Seoul Natl Univ, Sch Elect Engn, Kwanak POB 34, Seoul 151742, South Korea. EM dkkim@hi.snu.ac.kr; nkim@snu.ac.kr CR Bishop GR, 1998, J LEISURE RES, V30, P281 BOTTERWECK H, 2001, P ICASSP SALT LAK CI Bottomley GE, 2000, IEEE COMMUN LETT, V4, P354, DOI 10.1109/4234.892200 Carreira-Perpinan MA, 1998, SPEECH COMMUN, V26, P259, DOI 10.1016/S0167-6393(98)00059-4 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS V, 1999, P INT C AC SPEECH SI, P765 Gales MJF, 2000, IEEE T SPEECH AUDI P, V8, P417, DOI 10.1109/89.848223 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Jolliffe I. T., 1986, PRINCIPAL COMPONENT JON E, 2001, P ICASSP SALT LAK CI Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 KIHN R, 2001, P ICASSP SALT LAK CI KIHN R, 2001, ISCA ITR WORKSH FRAN, P33 KIM DK, 2001, ISCA ITR WORKSH FRAN, P25 Kim N. S., 2000, P INT C SPOK LANG PR, P734 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 Lee CH, 2000, P IEEE, V88, P1241 McLachlan G. J., 1997, EM ALGORITHM EXTENSI NGUYEN P, 1999, P EUROSPEECH, P2519 Roweis S, 1997, NEURAL INFORM PROCES, V10, P626 SAGAYAMA S, 2001, ISCA ITR WORKSH FRAN, P67 Tipping ME, 1999, NEURAL COMPUT, V11, P443, DOI 10.1162/089976699300016728 Woodland P.C., 2001, ITRW ADAPTATION METH, P11 YOUNG SJ, 1994, P ARPA HUM LANG TECH, V1, P286 NR 25 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 59 EP 73 DI 10.1016/j.specom.2003.08.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400005 ER PT J AU McDonough, J Schaaf, T Waibel, A AF McDonough, J Schaaf, T Waibel, A TI Speaker adaptation with all-pass transforms SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE speaker adaptation; speech recognition ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; SPEECH RECOGNITION; REPRESENTATION; SPACE AB Modern speech recognition systems are based on the hidden Markov model (HMM) and employ cepstral features to represent input speech. In speaker normalization, the cepstral features of speech from a given speaker are transformed to match the speaker independent HMM. In speaker adaptation, the means of the HMM are transformed to match the input speech. Vocal tract length normalization (VTLN) is a popular normalization scheme wherein the frequency axis of the short-time spectrum is rescaled prior to the extraction of cepstral features. In this work, we develop novel speaker adaptation schemes by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We describe two classes of such maps: rational all-pass transforms (RAPTs) which are well-known in the signal processing literature, and sine-log all-pass transforms (SLAPTs) which are novel in this work. For both classes of maps, we develop the relations necessary to perform maximum likelihood estimation of the relevant transform parameters using enrollment data. from a new speaker. We also propose the means by which an HMM may be trained specifically for use with this type of adaptation. Finally, in a set of recognition experiments conducted on conversational speech material from the Switchboard Corpus as well as the English Spontaneous Scheduling Task, we demonstrate. the capacity of APT-based speaker adaptation to achieve word error rate reductions superior to those obtained with other popular adaptation techniques, and moreover, reductions that are additive with those provided by VTLN. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Karlsruhe, Integrated Syst Lab, Inst Log Komplexitat & Dedukt Syst, D-76128 Karlsruhe, Germany. RP McDonough, J (reprint author), Univ Karlsruhe, Integrated Syst Lab, Inst Log Komplexitat & Dedukt Syst, Fasanengarten 5, D-76128 Karlsruhe, Germany. EM jmcd@ira.uka.de CR Acero A., 1990, THESIS CARNEGIE MELL ANASTASAKOS T, 1996, P ICSLP Andreou A., 1994, P CAIP WORKSH FRONT BOCCHIERI E, 1999, P ICASSP, V2, P773 CHURSCHILL RV, 1990, COMPLEX VARIABLES AP DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS V, 1996, P ICASSP, V1, P339 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 DING GH, 2002, ICSLP, P1389 Eide E., 1996, P IEEE INT C AC SPEE, V1, P346 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Gill P. E., 1981, PRACTICAL OPTIMIZATI GUNAWARDANA A, 2000, IEEE ICASSP 5 9 JUN, V2, pII985 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 KANNAN A, 1996, P ICASSP, V2, P769 Kuhn R, 2000, IEEE T SPEECH AUDI P, V8, P695, DOI 10.1109/89.876308 KUMAR N, 1998, SPEECH COMMUN, V26, P238 Lee L., 1996, P ICASSP, V1, P353 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Luenberger D. G., 1984, LINEAR NONLINEAR PRO MASRY E, 1968, SIAM J APPL MATH, V16, P552, DOI 10.1137/0116044 McDonough J., 1998, P ICSLP MCDONOUGH JW, 2001, 103 U KARLSR MCDONOUGH JW, 2000, THESIS J HOPK U BALT MCDONOUGH JW, 1999, 39 J HOPK U CTR LANG MCDONOUGH JW, 1999, P EUROSPEECH MCDONOUGH JW, 1998, 36 J HOPK U CTR LANG MDDONOUGH J, 2003, 102 U KARLSR OPPENHEI.AV, 1972, PR INST ELECTR ELECT, V60, P681, DOI 10.1109/PROC.1972.8727 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PITZ M, 2001, EUROSPEECH, P721 PYE D, 1997, P ICASSP, V2, P1047 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SAON G, 2000, P ICASSP SHIKANO K, 1986, EVALUATIN LPC SPECTR WEGMANN S, 1996, P ICASSP, V1, P339 WOODLAND P, 2000, ISCA ITRW AUTOMATIC, P7 Young S., 1999, HTK BOOK ZUE V, 1971, 101 MIT RES ELTR NR 41 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 75 EP 91 DI 10.1016/j.specom.2003.09.005 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400006 ER PT J AU Bellegarda, JR AF Bellegarda, JR TI Statistical language model adaptation: review and perspectives SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE ID VOCABULARY SPEECH RECOGNITION; SPOKEN DIALOG SYSTEMS; INFORMATION AB Speech recognition performance is severely affected when the lexical, syntactic, or semantic characteristics of the discourse in the training and recognition tasks differ. The aim of language model adaptation is to exploit specific, albeit limited, knowledge about the recognition task to compensate for this mismatch. More generally, an adaptive language model seeks to maintain an adequate representation of the current task domain under changing conditions involving potential variations in vocabulary, syntax, content, and style. This paper presents an overview of the major approaches proposed to address this issue, and offers some perspectives regarding their comparative merits and associated trade-offs. (C) 2003 Elsevier B.V. All rights reserved. C1 Apple Comp Inc, Spoken Language Grp, Cupertino, CA 95014 USA. RP Bellegarda, JR (reprint author), Apple Comp Inc, Spoken Language Grp, MS-302-2LF,2 Infinite Loop, Cupertino, CA 95014 USA. EM jerome@apple.com CR ADDA G, 1999, P EUR C SPEECH COMM, V4, P1759 BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 Bellegarda JR, 1998, IEEE T SPEECH AUDI P, V6, P456, DOI 10.1109/89.709671 Bellegarda JR, 2000, IEEE T SPEECH AUDI P, V8, P76, DOI 10.1109/89.817455 BELLEGARDA JR, 1998, P 1998 INT C AC SPEE, V2, P677, DOI 10.1109/ICASSP.1998.675355 BELLEGARDA JR, 1990, IEEE T ACOUST SPEECH, V38, P2033, DOI 10.1109/29.61531 Bellegarda JR, 2000, P IEEE, V88, P1279, DOI 10.1109/5.880084 BELLEGARDA JR, 2001, P 2001 ISCA WORKSH A Berger A, 1998, INT CONF ACOUST SPEE, P705, DOI 10.1109/ICASSP.1998.675362 Berger A., 1998, P ICASSP, VII, P705, DOI 10.1109/ICASSP.1998.675362 BERTOLDI N, 2001, P 2001 INT C AC SPEE BESLING S, 1995, P EUR MADR SPAIN, P1755 CHELBA C, 1997, P 5 EUR C SPEECH COM, V5, P2775 CHELBA C, 2001, P 2001 INT C AC SPEE Chelba C, 2000, COMPUT SPEECH LANG, V14, P283, DOI 10.1006/csla.2000.0147 Chen CC, 2000, APPL IMMUNOHISTO M M, V8, P1, DOI 10.1097/00022744-200003000-00001 CHEN L, 1999, P 1999 EUR C SPEECH, V5, P1923 CHEN SF, 1998, P ICASSP 98, V2, P681, DOI 10.1109/ICASSP.1998.675356 CHURSCH KW, 1987, PHONOLOGICAL PARSING CLARKSON P, 1997, P IEEE INT C AC SPEE, P799 COCCARO N, 1998, P INT C SPOK LANG PR, P2403 DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379 DEDERICO M, 1996, P 1996 INT C SPOK LA DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Della Pietra S., 1997, IEEE T PATTERN ANAL, V19, P1 DELLAPIETRA S, 1992, P 1992 INT C AC SPEE, V1, P633 DONNELLY PG, 1999, P 1999 EUR C SPEECH, V4, P1575 FEDERICO M, 1999, LANGUAGE MODEL ADAPT FEDERICO M, 1999, P 1999 EUR C SPEECH, V4, P15830 GALESCU L, 2000, P 2000 INT C SPOK LA, pII86 Gildea D., 1999, P 6 EUR C SPEECH COM, V5, P2167 GORIN A, 1995, J ACOUST SOC AM, V97, P3441, DOI 10.1121/1.412431 GRETTER R, 2000, P 2001 INT C AC SPEE Hofmann T, 1999, LECT NOTES COMPUT SC, V1642, P161 HOFMANN T, 1999, P 15 C UNC AI STOCKH Iyer R., 1994, P ARPA WORKSH HUM LA, P82, DOI 10.3115/1075812.1075828 IYER R, 1999, IEEE T SPEECH AUDIO, V7 JANISZEK D, 2001, P 2001 INT C AC SPEE JARDINO M, 1996, P 1996 INT C AC SPEE, P1161 JELINEK F, 1985, P IEEE, V73, P1616, DOI 10.1109/PROC.1985.13343 JELINEK F, 1999, P 1999 EUIR C SPEECH, V1 KELLNER A, 1998, P 1998 INT C AC SPEE, V1, P185, DOI 10.1109/ICASSP.1998.674398 KNESER, 1993, P 1993 INT C AC SPEE, V2, P586 Kneser R., 1997, P EUROSPEECH 97 SEPT, V4, P1971 KNESER R, 1997, P 1997 INT C AC SPEE, V2, P779 KUHN R, 1990, IEEE T PATTERN ANAL, V12, P570, DOI 10.1109/34.56193 LAFFERTY JD, 1995, MAXIMUN ENTROPY BAYE LAU R, 1993, P 1993 INT C AC SPEE, pII45 LEFEVRE F, 2001, P 2001 INT C AC SPEE Martin S.C., 1997, P EUROSPEECH 97 SEPT, V3, P1447 MASATAKI H, 1997, P 1997 INT C AC SPEE, V1, P783 Mohri M, 2000, THEOR COMPUT SCI, V234, P177, DOI 10.1016/S0304-3975(98)00115-7 Mood A. M., 1974, INTRO THEORY STAT NASR A, 1999, P EUR C SPEECH COMM, V5, P2175 NIESLER T, 1996, P 1996 INT C AC SPEE, pI164 OHTSUKI K, 1999, P ESCA EUR 99 BUD HU, V2, P671 Pereira FCN, 1997, LANG SPEECH & COMMUN, P431 PETERS J, 1999, P 1999 AUT SPEECH RE, P1253 Rabiner L.R., 1996, AUTOMATIC SPEECH SPE, P1 RAO PS, 1997, P EUR C SPEECH COMM, V4, P1979 REICHL W, 1999, P 1999 EUR C SPEECH, V4, P1791 Riccardi G, 2000, IEEE T SPEECH AUDI P, V8, P3, DOI 10.1109/89.817449 ROSENFELD R, 1997, P IEEE WORKSH AUT SP, P230 Rosenfeld R, 2000, P IEEE, V88, P1270, DOI 10.1109/5.880083 Rosenfeld R, 1996, COMPUT SPEECH LANG, V10, P187, DOI 10.1006/csla.1996.0011 Rosenfeld R., 1995, P 4 EUR C SPEECH COM, P1763 ROSENFELD R, 2001, COMPUTER SPEECH LANG, V15 SCHWARTZ R, 1997, P 5 EUR C SPEECH COM, V3, P1455 SEYMORE K, 1997, P EUROSPEECH, V4, P1987 Souvignier B, 2000, IEEE T SPEECH AUDI P, V8, P51, DOI 10.1109/89.817453 WU J, 1999, P EUR 99 SEPT 6 10 B, V5, P2179 Younger D., 1967, INFORM CONTR, V10, P198 ZHANG R, 1999, P 6 EUR C SPEECH COM, V4, P1815 ZHU X, 2001, P 2001 INT C AC SPEE ZHU XJ, 1999, P 6 EUR C SPEECH COM, V4, P1807 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 76 TC 73 Z9 77 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 93 EP 108 DI 10.1016/j.specom.2003.08.002 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400007 ER PT J AU Goronzy, S Rapp, S Kompe, R AF Goronzy, S Rapp, S Kompe, R TI Generating non-native pronunciation variants for lexicon adaptation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE AB Handling non-native speech in automatic speech recognition (ASR) systems is an area of increasing interest. The majority of systems are tailored to native speech only and as a consequence performance for non-native speakers often is not satisfactory. One way to approach the problem is to adapt the acoustic models to the new speaker. Another important means to improve performance for non-native speakers is to consider non-native pronunciations in the dictionary. The difficulty here lies in the generation of the non-native variants, especially if various accents are to be considered. Traditional approaches to model pronunciation variation either require phonetic expertise or extensive speech databases. They are too costly, especially if a flexible modelling of several accents is desired. We propose to exclusively use native speech databases to derive non-native pronunciation variants. We use an English phoneme recogniser to generate English pronunciations for German words and use these to train decision trees that are able to predict the respective English-accented variant from the German canonical transcription. Furthermore we combine this approach with online, incremental weighted MLLR speaker adaptation. Using the enhanced dictionary and the speaker adaptation alone improved the word error rate of the baseline system by 5.2% and 16.8%, respectively. When both methods were combined, we achieved an improvement of 18.2%. (C) 2003 Elsevier B.V. All rights reserved. C1 Sony Int Europe GmbH, Sony Corp Labs Europe, Adv Software Lab, D-70327 Stuttgart, Germany. RP Goronzy, S (reprint author), Sony Int Europe GmbH, Sony Corp Labs Europe, Adv Software Lab, Hedelfinger Str 61, D-70327 Stuttgart, Germany. EM goronzy@sony.de; rapp@sony.de; kompe@sony.de CR ADDADECKER M, 1999, SPEECH COMMUN, V29, P88 AMDAL I, 2000, ISRU2000 PARIS, V1, P85 BALTLINER A, 2001, 16 SMARTKOM CREMELIE N, 1997, EUROSPEECH 97 RHODES, P2459 DIMORI R, 1998, SPOKEN DIALOGUES COM DOWNEY S, 1998, WORKSH MODELING PRON, P157 EISELE KNM, 2002, THESIS KARLSRUHE U A FLEGE JE, 1987, J PHONETICS, V15, P47 FOSLERLUSSIER E, 1999, EUROSPEECH99 BUDAPES, P463 GORONZY S, 1999, EUROSPEECH99 BUD, P5 GORONZY S, 2001, EUROPEECH2001, V1, P309 GORONZY S, 2002, LECT NOTES ARTIFICIA, V2560 HE X, 2001, EUROSPEECH2001, V2, P1461 Holter T, 1999, SPEECH COMMUN, V29, P177, DOI 10.1016/S0167-6393(99)00036-9 HUANG C, 2000, ICSLP2000, V3, P818 HUMPHRIES JJ, 1996, ICSLP96, V4, P2324 JURAFSKY D, 2001, ICASSP2001 SALT LAK, V1, P577 KAT LW, 2000, ICSLP2000 BEIJ, V3, P738 KIPP A, 1996, ICSLP96 PHIL, V1, P106 KIPP A, 1997, EUROSPEECH97 RHOD, V2, P1023 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 SAHAKYAN M, 2001, THESIS U STUTTGART SLOBODA T, 1996, ICSLP96, V4, P2328 STRIK H, 2001, ISCA ITR WORKSH AD M, V1, P123 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 WESTER M, 1998, ESCA WORKSH MOD PRON, V1, P145 WESTER M, 2000, ICSLP2000, V1, P270 WILLIAMS G, 1998, WORKSH MODELING PRON, V1, P151 WITT S, 1999, EUROSPEECH 99 BUD, P1367 WOODLAND PC, 1999, ASRU99, V1, P85 YANG Q, 2000, PRORISC2000 WORKSH C NR 31 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 109 EP 123 DI 10.1016/j.specom.2003.09.003 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400008 ER PT J AU Sankar, A Kannan, A AF Sankar, A Kannan, A TI A comprehensive study of task-specific adaptation of speech recognition models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Adaptation Methods for Speech Recognition CY AUG 29-30, 2001 CL SOPHIA ANTIPOLIS, FRANCE DE task adaptation; acoustic adaptation; grammar adaptation; confidence score mapping ID SPEAKER ADAPTATION AB Most published adaptation research focuses on speaker adaptation, and on adaptation for noisy channels and background environments. In this paper, we present a study of task adaptation, where the speech recognition models are adapted to a specific application or task, giving significant performance gains. We explore several new questions about adaptation which have not been studied before, and present novel solutions to these problems. For example, we show that adaptation can result in increased out-of-grammar error rates. We present an automatic confidence score mapping algorithm to correct this problem. We show that grammar-dependent acoustic adaptation gives improved performance. In addition, we show that in-grammar acoustic adaptation gives significantly better results. We study acoustic and grammar task adaptation, and show that the gains are additive. Finally we show that adaptation improves both accuracy and speed, where traditional studies have been more focused on accuracy alone. We also study traditional adaptation modes such as supervised and unsupervised adaptation, the use of confidence thresholds for unsupervised adaptation, and the effect of the amount of data on task adaptation. (C) 2003 Elsevier B.V. All rights reserved. C1 Speech Res & Dev, Menlo Pk, CA 94025 USA. RP Sankar, A (reprint author), Speech Res & Dev, 1380 Willow Rd, Menlo Pk, CA 94025 USA. EM sankar@nuance.com; ashvin@nuance.com CR Bottomley GE, 2000, IEEE COMMUN LETT, V4, P354, DOI 10.1109/4234.892200 CHANG E, 1999, P EUROSPEECH, P271 Diakoloukas V., 1997, P IEEE INT C AC SPEE, P1455 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 GALES M, 1998, COMPUT SPEECH LANG, P12 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop KUHN R, 1998, P ICSLP SYDN AUST, pV1771 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 NEUMEYER L, 1995, P ICASSP, V1, P141 NEUMEYER L, 1995, P EUR C SPEECH COMM, P1127 RIVLIN Z, 1996, P IEEE INT C AC SPEE, P515 SANKA A, 2001, P IEEE WORKSH AUT SP Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 UMESH S, 2000, P ISSLP BEIJ CHIN WEGMANN S, 1996, P ICASSP, P339 NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2004 VL 42 IS 1 BP 125 EP 139 DI 10.1016/j.specom.2003.09.001 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 772AD UT WOS:000188817400009 ER PT J AU Altincay, H Demirekler, M AF Altincay, H Demirekler, M TI Speaker identification by combining multiple classifiers using Dempster-Shafer theory of evidence SO SPEECH COMMUNICATION LA English DT Article ID DECISION COMBINATION; DATA FUSION; RECOGNITION; CLASSIFICATION; VERIFICATION; INFORMATION; SYSTEMS; SPEECH; MODELS; RULE AB This paper presents a multiple classifier approach as an alternative solution to the closed-set text-independent speaker identification problem. The proposed algorithm which is based on Dempster-Shafer theory of evidence computes the first and Rth level ranking statistics. Rth level confusion matrices extracted from these ranking statistics are used to cluster the speakers into model sets where they share set specific properties. Some of these model sets are used to reflect the strengths and weaknesses of the classifiers while some others carry speaker dependent ranking statistics of the corresponding classifier. These information sets from multiple classifiers are combined to arrive at a joint decision. For the combination task, a rule-based algorithm is developed where Dempster's rule of combination is applied in the final step. Experimental results have shown that the proposed method performed much better compared to some other rank-based combination methods. (C) 2003 Elsevier B.V. All rights reserved. C1 Eastern Mediterranean Univ, Dept Comp Engn, Mersin 10, Turkey. Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. RP Altincay, H (reprint author), Eastern Mediterranean Univ, Dept Comp Engn, Gazi Magusa,KKTC, Mersin 10, Turkey. EM hakan.altincay@emu.edu.tr; demirek@eee.metu.edu.tr CR Al-Ghoneim K, 1998, PATTERN RECOGN, V31, P2077, DOI 10.1016/S0031-3203(98)00030-2 Altincay H, 1999, PROCEEDINGS OF THE IEEE-EURASIP WORKSHOP ON NONLINEAR SIGNAL AND IMAGE PROCESSING (NSIP'99), P321 ALTINCAY H, 1999, EUROSPEECH P SEPT, P971 ALTINCAY H, 2000, IEEE ICASSP P JUN ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BATTITI R, 1994, NEURAL NETWORKS, V7, P691, DOI 10.1016/0893-6080(94)90046-9 BENEDIKTSSON JA, 1992, IEEE T SYST MAN CYB, V22, P688, DOI 10.1109/21.156582 BHATNAGAR RK, 1986, UNCERTAINTY ARTIFICI, P3 Bhattacharya P, 2000, IEEE T SYST MAN CY A, V30, P526, DOI 10.1109/3468.867860 Bloch I, 1996, IEEE T SYST MAN CY A, V26, P52, DOI 10.1109/3468.477860 BRUNELLI R, 1995, IEEE T PATTERN ANAL, V17, P955, DOI 10.1109/34.464560 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 CHEN K, 1996, IEEE INT C NEUR NETW, V4, P2015 Chen K, 1998, NEUROCOMPUTING, V20, P227, DOI 10.1016/S0925-2312(98)00019-8 Chen S, 1997, MOL ENDOCRINOL, V11, P3, DOI 10.1210/me.11.1.3 Doddington GR, 2000, SPEECH COMMUN, V31, P225, DOI 10.1016/S0167-6393(99)00080-1 FARRELL KR, 1995, MODERN METHODS SPEEC, P279 FARRELL KR, 1998, IEEE ICASSP P, V2, P1129 FARRELL KR, 1995, INT CONF ACOUST SPEE, P349, DOI 10.1109/ICASSP.1995.479545 Fredouille C, 2000, DIGIT SIGNAL PROCESS, V10, P172, DOI 10.1006/dspr.1999.0367 FUNG RM, 1986, UNCERTAINTY ARTIFICI, P295 FUNG RM, 1986, UNCERTAINITY ARTIFIC, P117 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 GENOUD D, 1996, ICSLP 96, V3, P1756 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 HECK LP, 1997, IEEE ICASSP P APR Heck LP, 2000, SPEECH COMMUN, V31, P181, DOI 10.1016/S0167-6393(99)00077-1 Hegarat-Mascle S. L., 1998, Pattern Recognition, V31 HERMANSKY H, 1991, EUROSPEECH P Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HO TK, 1994, IEEE T PATTERN ANAL, V16, P66 MELIN H, 1997, GUIDELINES EXPT POLY Pigeon S, 2000, DIGIT SIGNAL PROCESS, V10, P237, DOI 10.1006/dspr.1999.0358 Quatieri T. F., 2002, DISCRETE TIME SPEECH QUATIERI TF, 1998, IEEE ICASSP P Radova V, 1997, INT CONF ACOUST SPEE, P1135, DOI 10.1109/ICASSP.1997.596142 Rahman AFR, 1999, IEE P-VIS IMAGE SIGN, V146, P40, DOI 10.1049/ip-vis:19990015 REYNOLDS DA, 1997, EUROSPEECH P Reynolds DA, 2000, DIGIT SIGNAL PROCESS, V10, P19, DOI 10.1006/dspr.1999.0361 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Shafer G., 1976, MATH THEORY EVIDENCE SHAFER G, 1987, ARTIF INTELL, V33, P271, DOI 10.1016/0004-3702(87)90040-3 SMETS P, 1994, ARTIF INTELL, V66, P191, DOI 10.1016/0004-3702(94)90026-4 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 VOORBRAAK F, 1991, ARTIF INTELL, V48, P171, DOI 10.1016/0004-3702(91)90060-W XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943 NR 46 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 531 EP 547 DI 10.1016/S0167-6393(03)00032-3 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800001 ER PT J AU Peinado, AM Sanchez, V Perez-Cordoba, JL de la Torre, A AF Peinado, AM Sanchez, V Perez-Cordoba, JL de la Torre, A TI HMM-based channel error mitigation and its application to distributed speech recognition SO SPEECH COMMUNICATION LA English DT Article DE channel error mitigation; distributed speech recognition; hidden Markov models; wireless communications; forward-backward algorithm; Viterbi algorithm ID CODES AB The emergence of distributed speech recognition has generated the need to mitigate the degradations that the transmission channel introduces in the speech features used for recognition. This work proposes a hidden Markov model (HMM) framework from which different mitigation techniques oriented to wireless channels can be derived. First, we study the performance of two techniques based on the use of a minimum mean square error (MMSE) estimation, a raw MMSE and a forward MMSE estimation, over additive white Gaussian noise (AWGN) channels. These techniques are also adapted to bursty channels. Then, we propose two new mitigation methods specially suitable for bursty channels. The first one is based on a forward-backward MMSE estimation and the second one on the well-known Viterbi algorithm. Different experiments are carried out, dealing with several issues such as the application of hard decisions on the received bits or the influence of the estimated channel SNR. The experimental results show that the HMM-based techniques can effectively mitigate channel errors, even in very poor channel conditions. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, Granada 18071, Spain. RP Peinado, AM (reprint author), Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, Granada 18071, Spain. EM amp@ugr.es RI de la Torre, Angel/C-6618-2012; Sanchez , Victoria /C-2411-2012; Peinado, Antonio/C-2401-2012 CR [Anonymous], 2002, 202050 ETSI ES BERNARD A, 2001, P EUR C SPEECH COMM, P2703 EBEL WJ, 1995, IEEE T COMMUN, V43, P298, DOI 10.1109/26.380048 *ETSI DSR APPL PRO, 2001, AU33501 ETSI DSR APP *ETSI ES, 2000, 201 ETSI ES *ETSI TR, 2000, 101085 ETSI TR Fingscheidt T, 2001, IEEE T SPEECH AUDI P, V9, P240, DOI 10.1109/89.905998 GERLACH CG, 1993, P ICASSP 93 MINN US, V2, P419 HAGENAUER J, 1989, P IEEE GLOBECOM 89 D, P1680 LIGDAS P, 1997, P 31 C INF SCI SYST, P546 MILNER B, 2001, P IEEE INT C AC SPEE, P261 Pearce D., 2000, AVIOS 2000 SPEECH AP Pearce D., 2000, P ICSLP, V4, P29 PEINADO AM, 2002, P ICSLP 2002 DEN SEP PEINADO AM, 2001, P EUROSPEECH 2001 AA, P2707 Rabiner L, 1993, FUNDAMENTALS SPEECH RISKIN EA, 2001, P EUROSPEECH 2001 AA, P2715 SKOGLUND M, 1994, P ICASSP 1994 AD AUS, V5, P605 VAISHAMPAYAN VA, 1992, IEEE T INFORM THEORY, V38, P1230, DOI 10.1109/18.144704 NR 19 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 549 EP 561 DI 10.1016/S0167-6393(03)00048-7 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800002 ER PT J AU Kim, JH Woodland, PC AF Kim, JH Woodland, PC TI A combined punctuation generation and speech recognition system and its performance enhancement using prosody SO SPEECH COMMUNICATION LA English DT Article DE punctuation generation; speech recognition; prosody; classification and regression tree (CART); N-best rescoring AB A punctuation generation system which combines prosodic information with acoustic and language model information is presented. Experiments have been conducted for both the reference text transcriptions and speech recogniser outputs. For the reference transcription, prosodic information of acoustic data is shown to be more useful than language model information. Several straightforward modifications of a conventional speech recogniser allow the system to produce punctuation and speech recognition hypotheses simultaneously. The multiple hypotheses produced by the automatic speech recogniser are then re-scored using prosodic information. When the prosodic information is incorporated, the F-measure (defined as harmonic mean of recall and precision) can be improved. This speech recognition system including punctuation gives a small reduction in word error rate on the 1-best speech recognition output including punctuation. An alternative approach for generating punctuation from the un-punctuated 1-best speech recognition output is also proposed. The results from these two alternative schemes are compared. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Kim, JH (reprint author), LG Elect Inst Technol, Mobile & Multimedia Lab, 16 Woomyeon, Seoul 137724, South Korea. EM jhk23@eng.cam.ac.uk; pcw@eng.cam.ac.uk CR ABNEY S, 1995, LINGUISTICS FDN LING, P145 Beeferman D, 1998, INT CONF ACOUST SPEE, P689, DOI 10.1109/ICASSP.1998.675358 Breiman L, 1983, CLASSIFICATION REGRE, V1st Chen C.J., 1999, P EUROSPEECH, P447 Fach M.L., 1999, P EUR BUD, P527 Gotoh Y., 2000, P ISCA WORKSH AUT SP, P228 Grishman R., 1995, P 6 MESS UND C, P1, DOI 10.3115/1072399.1072401 HAKKANITUR D, 1999, P EUR, P1991 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Makhoul J., 1999, P DARPA BROADC NEWS, P249 Niesler TR, 1998, INT CONF ACOUST SPEE, P177, DOI 10.1109/ICASSP.1998.674396 *NIST, 1998, NIST HUB 4 INF EXTR ODELL JJ, 1999, P 1999 DARPA BROADC, P271 PALLETT DS, 1999, P DARPA BROADC NEWS, P5 Palmer DD, 1997, COMPUT LINGUIST, V23, P241 Rabiner L, 1993, FUNDAMENTALS SPEECH SHAW H, 1993, PUNCTUATE RIGHT Shriberg E., 1998, LANG SPEECH, V41, P439 Silverman K., 1992, P INT C SPOK LANG PR, P867 Stolcke A, 1999, P DARPA BROADC NEWS, P61 TAYLOR P, 1998, LANG SPEECH, V41, P489 *U CHIC, 1993, CHIC MAN STYL WOODLAND P, 1998, P BORADC NEWS TRANSC Woodland PC, 2002, SPEECH COMMUN, V37, P47, DOI 10.1016/S0167-6393(01)00059-0 Woodland P. C., 1999, P DARPA BROADC NEWS, P265 NR 25 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 563 EP 577 DI 10.1016/S00167-6393(03)00049-9 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800003 ER PT J AU Molau, S Keysers, D Ney, H AF Molau, S Keysers, D Ney, H TI Matching training and test data distributions for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE normalization; feature transformation; feature extraction; noise robustness; histogram normalization; feature space rotation AB In this work normalization techniques in the acoustic feature space are studied that aim at reducing the mismatch between training and test by matching their distributions. Histogram normalization is the first technique explored in detail. The effect of normalization at different signal analysis stages as well as training and test data normalization are investigated. The basic normalization approach is improved by taking care of the variable silence fraction. Feature space rotation is the second technique that is introduced. It accounts for undesired variations in the acoustic signal that are correlated in the feature space dimensions. The interaction of rotation and histogram normalization is analyzed and it is shown that the recognition accuracy is significantly improved by both techniques on corpora with different complexity, acoustic conditions, and speaking styles. The word error rate is reduced from 24.6% to 21.8% on VerbMobil II, a German large vocabulary conversational speech task, and from 16.5% to 15.5% on EuTrans II, an Italian speech corpus of conversational speech over telephone. On the CarNavigation task, a German isolated-word corpus recorded partly in noisy car environments, the word error rate is reduced from 74.2% to 11.1% for heavy mismatch conditions between training and test. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Technol, Rhein Westfal TH Aachen, Dept Comp Sci, Lehrstuhl Informat 6, D-52056 Aachen, Germany. RP Molau, S (reprint author), Univ Technol, Rhein Westfal TH Aachen, Dept Comp Sci, Lehrstuhl Informat 6, D-52056 Aachen, Germany. EM molau@informatik.rwth-aachen.de; keysers@informatik.rwth-aachen.de; ney@informatik.rwth-aachen.de CR BALCHANDRAN R, 1998, P ICASSP, V2, P749, DOI 10.1109/ICASSP.1998.675373 Ballard D. H., 1982, COMPUTER VISION CASACUBERTA F, 2001, P IEEE INT C AC SPEE, V1, P613 DHARANIPRAGADA S, 2000, P INT C SPOK LANG PR, V6, P556 GALES MJF, 2001, P IEEE AUT SPEECH RE GUILIANI D, 1999, P EUR C SPEECH COMM, V6, P2487 Hilger F., 2001, P EUR C SPEECH COMM, V2, P1135 HILGER F, 2002, P INT C SPOK LANG PR, V1, P237 Lee CH, 1996, ACIAR PROC, P83 Lee L., 1996, P ICASSP, V1, P353 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MACHEREY W, 2002, P IEEE INT C AC SPEE, V1, P733 MATSUKOTO H, 1992, P ICASSP, V1, P449 MOLAU S, 2002, P ICSLP 02 DENV COL, P1421 Molau S, 2001, P IEEE AUT SPEECH RE NEUMEYER L, 1994, P ICASSP, V1, P417 NEY H, 1998, P INT C AC SPEECH SI, V2, P853, DOI 10.1109/ICASSP.1998.675399 PADMANABHAN M, 2001, P EUR C SPEECH COMM, V4, P2359 SANKAR A, 1995, P ICASSP, V1, P121 SIXTUS A, 2000, P IEEE INT C AC SPEE, V3, P1671 Wahlster W., 2000, VERBMOBIL FDN SPEECH Welling L., 1999, P IEEE INT C AC SPEE, P761 NR 22 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 579 EP 601 DI 10.1016/S0167-6393(03)00085-2 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800004 ER PT J AU Nwe, TL Foo, SW De Silva, LC AF Nwe, TL Foo, SW De Silva, LC TI Speech emotion recognition using hidden Markov models SO SPEECH COMMUNICATION LA English DT Article DE recognition of emotion; emotional speech; log frequency power coefficients; hidden Markov model; human communication ID PROSODIC FEATURES; EXPRESSION; VOICE AB In emotion classification of speech signals, the popular features employed are statistics of fundamental frequency, energy contour, duration of silence and voice quality. However, the performance of systems employing these features degrades substantially when more than two categories of emotion are to be classified. In this paper, a text independent method of emotion classification of speech is proposed. The proposed method makes use of short time log frequency power coefficients (LFPC) to represent the speech signals and a discrete hidden Markov model (HMM) as the classifier. The emotions are classified into six categories. The category labels used are, the archetypal emotions of Anger, Disgust, Fear, Joy, Sadness and Surprise. A database consisting of 60 emotional utterances, each from twelve speakers is constructed and used to train and test the proposed system. Performance of the LFPC feature parameters is compared with that of the linear prediction Cepstral coefficients (LPCC) and mel-frequency Cepstral coefficients (MFCC) feature parameters commonly used in speech recognition systems. Results show that the proposed system yields an average accuracy of 78% and the best accuracy of 96% in the classification of six emotions. This is beyond the 17% chances by a random hit for a sample set of 6 categories. Results also reveal that LFPC is a better choice as feature parameters for emotion classification than the traditional feature parameters. (C) 2003 Elsevier B.V. All rights reserved. C1 Natl Univ Singapore, Dept Elect & Comp Engn, Singapore 117576, Singapore. Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. RP Nwe, TL (reprint author), Natl Univ Singapore, Dept Elect & Comp Engn, 4 Engn Dr 3, Singapore 117576, Singapore. EM engp8469@nus.edu.sg; eswfoo@ntu.edu.sg CR Arnold M. B., 1960, EMOTION PERSONALITY, V2 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Becchetti C., 1998, SPEECH RECOGNITION T Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 COLEMAN R, 1979, CARE PROFESSIONAL VO, V1 CORNELIUS R, 1996, SCIENCE EMOTION COWAN M, 1936, PITCH INTENSITY CHAR Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 CRYSTAL D, 1975, ENGLIST TONE VOICE Crystal D., 1969, PROSODIC SYSTEMS INT Darwin C., 1965, EXPRESSION EMOTIONS DAVIS M, 1975, RECOGNITION FACIAL E DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Davitz Joel Robert, 1964, COMMUNICATION EMOTIO DELLAERT F, 1996, 4 INT C SPOK LANG PR, V3, P1970 Deller J. R., 1993, DISCRETE TIME PROCES De Silva LC, 1998, IEICE T INF SYST, VE81D, P105 Ekman P, 1975, UNMASKING FACE, V1st Ekman P, 1973, DARWIN FACIAL EXPRES ELIAS NJ, 1975, P IEEE INT S CIRC SY, P329 EQUITZ WH, 1989, IEEE T ACOUST SPEECH, V37, P1568, DOI 10.1109/29.35395 Fairbanks G, 1939, SPEECH MONOGR, V6, P87 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 FONAGY I, 1978, LANG SPEECH, V21, P34 FOX NA, 1991, AM PSYCHOL, V46, P863, DOI 10.1037/0003-066X.46.8.863 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Furui S., 1989, DIGITAL SPEECH PROCE HAVRDOVA Z, 1979, ACTIV NERV SUPER, V21, P33 HUTTAR GL, 1968, J SPEECH HEAR RES, V11, P481 JOHNSON WF, 1986, ARCH GEN PSYCHIAT, V43, P280 Kaiser L., 1962, SYNTHESE, V14, P300, DOI 10.1007/BF00869311 KOTLYAR GM, 1976, SOV PHYS ACOUST+, V22, P208 Lazarus RS, 1991, EMOTION ADAPTATION LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LYNCH GE, 1934, ARCH SPEECH, V1, P9 McGilloway S., 2000, APPROACHING AUTOMATI McGilloway S., 1995, P 13 INT C PHON SCI, V1, P250 MULLER A, 1960, THESIS U GOTTINGEN G MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NICHOLSON J, 1999, 6 INT C NEUR INF PRO, V2, P495 O'Connor John D., 1973, INTONATION COLLOQUIA OATLEY K, 1995, GOALS AFFECT OSTER A, 1986, 41986 SPEECH TRANS L, P79 OTALEY K, 1996, UNDERSTANDING EMOTIO PLUTCHIK R, 1994, PSYCHOL BIOL EMOTION, P58 Rabiner L, 1993, FUNDAMENTALS SPEECH Rabiner L.R., 1978, DIGITAL PROCESSING S SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 Scherer KR, 1984, APPROACHES EMOTION Schubiger M., 1958, ENGLISH INTONATION I SULC J, 1977, ACTIV NERV SUPER, V19, P215 TROJAN F, 1952, AUSDRUCK SPRECHSTIMM UTSUKI N, 1976, JAPAN AIR SELF DEFEN, V16, P179 van Bezooijen R., 1984, CHARACTERISTICS RECO Williams C., 1981, SPEECH EVALUATION PS, P189 WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 YAMADA T, 1995, P 1995 IEEE IECON 21, V1, P183, DOI 10.1109/IECON.1995.483355 NR 59 TC 147 Z9 154 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 603 EP 623 DI 10.1016/S01167-6393(03)00099-2 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800005 ER PT J AU Kochanski, G Shih, C Jing, HY AF Kochanski, G Shih, C Jing, HY TI Quantitative measurement of prosodic strength in Mandarin SO SPEECH COMMUNICATION LA English DT Article DE intonation; tone; tonal variation; prosodic structure; metrical pattern; prosodic strength; prosody modeling; muscle dynamics; text-to-speech ID DIFFERENT SPEAKING CONDITIONS; MOVEMENTS; SPEECH; DURATION; ECONOMY; CONTEXT; CHINESE; ENGLISH; STRESS; MUSCLE AB We describe models of Mandarin prosody that allow us to make quantitative measurements of prosodic strengths. These models use Stem-ML, which is a phenomenological model of the muscle dynamics and planning process that controls the tension of the vocal folds, and therefore the pitch of speech. Because Stem-ML describes the interactions between nearby tones, we were able to capture surface tonal variations using a highly constrained model with only one template for each lexical tone category, and a single prosodic strength per word. The model accurately reproduces the intonation of the speaker, capturing 87% of the variance of f(0) with these strength parameters. The result reveals alternating metrical patterns in words, and shows that the speaker marks a hierarchy of boundaries by controlling the prosodic strength of words. The strengths we obtain are also correlated with syllable duration, mutual information and part-of-speech. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Oxford, Phonet Lab, Oxford OX1 2JF, England. Bell Labs, Murray Hill, NJ 07974 USA. RP Kochanski, G (reprint author), Univ Oxford, Phonet Lab, 41 Wellington Sq, Oxford OX1 2JF, England. EM gpk@kochanski.org CR Bellegarda JR, 2001, IEEE T SPEECH AUDI P, V9, P52, DOI 10.1109/89.890071 Browman C. P., 1990, PAPERS LABORATORY PH, P341 CHEN SH, 1992, P ICASSP, P45 CHEN Y, 2000, P 6 INT C SPOK LANG Church K. W., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90016-J *COMP LING SOC REB, 1993, ANNOUNCED LINGUIST L CROWNINSHIELD RD, 1981, J BIOMECH, V14, P793, DOI 10.1016/0021-9290(81)90035-X CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1574, DOI 10.1121/1.395912 EDWARDS J, 1991, J ACOUST SOC AM, V89, P369, DOI 10.1121/1.400674 FANO RM, 1961, TRANSMISSION INFORMA FELDMAN AG, 1990, WINTERS WOO, P195 FLASH T, 1985, J NEUROSCI, V5, P1688 FLEMMING EDWARD, 2001, PHONOLOGY, V18, P7 Flemming Edward, 1997, U MARYLAND WORKING P, P72 Fujisaki H., 1983, PRODUCTION SPEECH, P39 HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C HIRSCHBERG J, 1986, P 24 ASS COMP LING S, V24, P136 HOGAN N, 1990, WINTERS WOO, P182 HOLLIEN H, 1981, VOLAL FOLD PHYSL CON, P361 KLATT DH, 1973, J ACOUST SOC AM, V54, P1102, DOI 10.1121/1.1914322 KOCHANSKI G, 2001, P EUR 2001 INT SPEEC Kochanski G, 2003, SPEECH COMMUN, V39, P311, DOI 10.1016/S0167-6393(02)00047-X KOCHANSKI G, 2003, J SPEECH TECHNOL, V6, P33 KOCHANSKI GP, 2000, P INT C SPOK LANG PR, V3, P239 Laboissiere R, 1996, BIOL CYBERN, V74, P373 LADD DR, 1996, INTONATIONB PHONOLOG Lea W. A., 1973, CONSONANT TYPES TONE, P15 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P287, DOI 10.1109/89.232612 Levenberg K., 1944, Quarterly of Applied Mathematics, V2 LIBERMAN M, 1977, LINGUIST INQ, V8, P249 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 LIN MC, 1983, P 10 INT C PHON SCI, P504 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MACNEILAGE PF, 1979, J ACOUST SOC AM, V65, P1047, DOI 10.1121/1.382573 MARQUARDT DW, 1963, J SOC IND APPL MATH, V11, P431 *MATHS INC, 1995, SPLUS ONL DOC MCFARLAND DH, 1992, J SPEECH HEAR RES, V35, P971 MONSEN RB, 1978, J ACOUST SOC AM, V64, P65, DOI 10.1121/1.381957 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 NELSON WL, 1983, BIOL CYBERN, V46, P135, DOI 10.1007/BF00339982 OHALA JJ, 1992, PAPERS LAB PHONOLOGY, V2, P16 Ohman S., 1967, WORD SENTENCE INTONA PAN S, 2000, P 38 ANN M ASS COMP, P15 Perkell JS, 2002, J ACOUST SOC AM, V112, P1627, DOI 10.1121/1.1506369 Perkell JS, 2002, J ACOUST SOC AM, V112, P1642, DOI 10.1121/1.1506368 Pierrehumbert J., 1988, JAPANESE TONE STRUCT PRINCE A, 2001, IN PRESS OPTIMALITY SEIFNARAGHI AH, 1990, WINTERS WOO, P312 Selkirk E. O., 1984, PHONOLOGY SYNTAX REL Shih C., 2001, P EUR 2001 INT SPEEC, P669 Shih C., 1988, WORKING PAPERS CORNE, V3, P83 SHIH C, 2001, P ISCA TUTORIAL RES, P133 Shih C., 1996, COMPUTATIONAL LINGUI, V1, P37 Shih C., 1992, P IRCS WORKSH PROS N, P193 SHIH C, 2000, P INT C SPOK LANG PR, V2, P67 Shih C., 1997, PROGR SPEECH SYNTHES, P383 Shih Chilin, 1986, THESIS U CALIFORNIA Shih CL, 2000, TEXT SPEECH LANG TEC, V15, P243 SILVERMAN KEA, 1987, THESIS U CAMBRIDGE U Sproat R., 1990, Computer Processing of Chinese & Oriental Languages, V4 Stevens K.N., 1998, ACOUSTIC PHONETICS TALKIN D, 1996, SPEECH CODING SYNTHE VANSANTEN J, 1997, EUROSPEECH97, V2, P553 Whalen DH, 1997, PHONETICA, V54, P138 WILDER CN, 1981, VOCAL FOLD PHYSL CON, P109 WINKWORTH AL, 1995, J SPEECH HEAR RES, V38, P124 WINTERS JM, 1990, WINTERS WOO, P69 Winters JM, 1990, MULTIPLE MUSCLE SYST XU Y, 2000, P 6 INT C SPOK LANG, P16 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 ZAHALAK GI, 1990, WINTERS WOO, P1 ZAJAC FE, 1989, CRIT REV BIOMED ENG, V17, P359 ZAJAC PE, 1990, WINTERS WOO, P139 NR 73 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 625 EP 645 DI 10.1016/S0167-6393(03)00100-6 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800006 ER PT J AU Zitouni, I Kuo, HKJ Lee, CH AF Zitouni, I Kuo, HKJ Lee, CH TI Boosting and combination of classifiers for natural language call routing systems SO SPEECH COMMUNICATION LA English DT Article DE call routing; relevance feedback; boosting; discriminative training; constrained minimization ID RECOGNITION; MODEL AB In this paper, we present different techniques to improve natural language call routing. We first describe methods to improve a single classifier: boosting, discriminative training (DT) and automatic relevance feedback (ARF). An interesting feature of some of these algorithms is the ability to re-weight the training data in order to focus the classifier on documents judged difficult to classify. We explore ways of deriving and combining uncorrelated classifiers in order to improve accuracy; we discuss specifically the linear interpolation and the constrained minimization techniques. All these approaches are probabilistic and are inspired from the information retrieval domain. They are evaluated using two similarity metrics, a common cosine measure from the vector space model, and a beta measure which had given good results in the similar task of e-mail steering. Compared to the baseline classifiers, we show an interesting improvement in the classification accuracy on call routing for a banking task: up to 20% reported for the ARF method, up to 30% for the boosting technique, and more than 45% for the DT approach. Another relative improvement of 11% is also obtained when we combine the classifiers with the constrained minimization approach using a confusion measure and DT. More importantly, synergistic effects of DT on the boosting algorithm were demonstrated: more iterations were possible because DT reduced the classification error rate of individual classifiers trained on re-weighted data by an average of 72%. (C) 2003 Elsevier B.V. All rights reserved. C1 Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. RP Zitouni, I (reprint author), Bell Labs, Lucent Technol, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM zitouni@research.bell-labs.com; hkuo@us.ibm.com; chl@ece.gatech.edu CR Arai K, 1999, SPEECH COMMUN, V27, P43, DOI 10.1016/S0167-6393(98)00065-X Beesley KR, 1988, P 29 ANN C AM TRANSL, P47 BIGI B, 2000, SIGNAL PROCESS J, V6 BIGI B, 2001, STRING PROCESSING IN BIGI B, 2001, RECENT ADV NLP CARPENTER B, 1998, P ICSLP 98 SYDN AUST, P2059 CAVNAR W, 1994, S DOC AN INF RETR LA Chu-Carroll J, 1999, COMPUT LINGUIST, V25, P361 Drucker H., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, DOI 10.1142/S0218001493000352 DURSTON P, 2001, P EUR 01 AALB DENM Freund Y, 1997, J COMPUT SYST SCI, V55, P119, DOI 10.1006/jcss.1997.1504 FREUND Y, 1995, INFORM COMPUT, V121, P256, DOI 10.1006/inco.1995.1136 GORIN A, 2000, P ATR WORKSH MULT SP Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop Jelinek F., 1990, READINGS SPEECH RECO, P450 Joachims T, 1998, EUR C MACH LEARN ECM, P137 JURAFSY D, 2000, COMPUTATIONAL LINGUI Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 KEARNS M, 1994, J ACM, V41, P67, DOI 10.1145/174644.174647 Kearns M. J., 1994, INTRO COMPUTATIONAL KUHN R, 1990, IEEE T PATTERN ANAL, V12, P570, DOI 10.1109/34.56193 KUO HKJ, 2001, P EUR 01 AALB DENM KUO HKJ, 2002, P INT C SPEECH LANG Kuo HKJ, 2003, IEEE T SPEECH AUDI P, V11, P24, DOI 10.1109/TSA.2002.807352 KUO HKJ, 2000, P INT C SPOK LANG PR, P374 LEE CH, 1998, P INT VOIC TECHN TEL MATSUNAGA S, 1992, P INT C AC SPEECH SI, P165, DOI 10.1109/ICASSP.1992.225946 MCDONOUGH J, 1994, P INT C AC SPEECH SI, P385 Mills P, 2000, EUR PHYS J E, V1, P5, DOI 10.1007/s101890050002 NIYOGI P, 2000, P ICASSP Rocchio J. J., 1971, SMART RETRIEVAL SYST, ppp ROCHERY M, 2002, P INT C AC SPEECH SI, P29 SALTON G, 1991, SCIENCE, V253, P974, DOI 10.1126/science.253.5023.974 SALTON G, 1990, J AM SOC INFORM SCI, V4, P182 SALTON G, 1975, COMMUN ACM, V18, P613, DOI 10.1145/361219.361220 SCHAPIRE R, 1998, INT ACM SIGIR C RES SCHAPIRE R, 2002, MSRI WORKSH LIN ESTI SCHAPIRE RE, 1990, MACH LEARN, V5, P197, DOI 10.1007/BF00116037 Schapire RE, 1999, P 16 INT JOINT C ART Schapire RE, 2000, MACH LEARN, V39, P135, DOI 10.1023/A:1007649029923 Schapire RE, 1998, ANN STAT, V26, P1651 SCHWARTZ R, 1997, P EUR C SPEECH COMM SIBUN P, 1996, S DOC AN INF RETR LA, P183 VALIANT LG, 1984, COMMUN ACM, V27, P1134, DOI 10.1145/1968.1972 WRIGHT JH, 1997, P 5 EUR C SPEECH COM, P1419 ZITOUNI I, 2002, P INT C AC SPEECH SI ZITOUNI I, 2001, P ASRU 2001 MAD CAMP NR 48 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 647 EP 661 DI 10.1016/S0167-6393(03)00103-1 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800007 ER PT J AU Kim, EK Han, WJ Oh, YH AF Kim, EK Han, WJ Oh, YH TI A score function of splitting band for two-band speech model SO SPEECH COMMUNICATION LA English DT Article DE two-band speech model; band-splitting frequency; harmonic plus noise model; speech synthesis; speech coding AB Two-band speech model which assumes lower band is a quasi-periodic component and upper band is a non-periodic component is widely used due to its natural and simple framework. In this paper, a score function is defined for splitting lower and upper band of two-band speech model and estimation method of band-splitting frequency which is the boundary of the two bands is proposed. The score function is calculated for each harmonic frequency using the normalized autocorrelation function of the time signal corresponding to the each sub-band divided by the given frequency. By using the score function, tracking technique is applied to the band-splitting frequency estimation procedure to reflect the continuity between neighboring frames. Experimental tests confirm that the proposed score function is effective for estimation of the band-splitting frequency and produces better results compared with the previous other methods. (C) 2003 Elsevier B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea. RP Kim, EK (reprint author), Korea Adv Inst Sci & Technol, Dept Comp Sci, Ku Song Dong, Taejon 305701, South Korea. EM ekkim@bulsai.kaist.ac.kr; hwjketel@bulsai.kaist.ac.kr; yhoh@cs.kaist.ac.kr RI Oh, Yung-Hwan/C-1915-2011 CR Dutoit T, 1996, SPEECH COMMUN, V19, P119, DOI 10.1016/0167-6393(96)00029-5 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 Han WJ, 2002, ELECTRON LETT, V38, P292, DOI 10.1049/el:20020182 Kim CG, 2001, PHYS PLASMAS, V8, P12, DOI 10.1063/1.1324658 KIM EK, 2000, P INT C SPOK LANG PR, V3, P275 MARQUES JS, 1994, SPEECH COMMUN, V14, P231, DOI 10.1016/0167-6393(94)90064-7 Mcaulay R. J., 1990, P INT C AC SPEECH SI, P249 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MCAULAY RJ, 1991, ADV SPEECH SIGNAL PR, P165 SERRA X, 1990, THESIS STANFORD U DE STYLIANOU Y, 1996, THEISS ECOLE NATL SU Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 Yeldener S., 1994, Proceedings of the 5th International Conference on Signal Processing Applications and Technology NR 13 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2003 VL 41 IS 4 BP 663 EP 674 DI 10.1016/j.specom.2003.08.004 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 747KC UT WOS:000186802800008 ER PT J AU Zheng, J Franco, H Stolcke, A AF Zheng, J Franco, H Stolcke, A TI Modeling word-level rate-of-speech variation in large vocabulary conversational speech recognition SO SPEECH COMMUNICATION LA English DT Article DE rate-of-speech modeling; large vocabulary conversational speech recognition; pronunciation modeling AB Variations in rate-of-speech (ROS) produce variations in both spectral features and word pronunciations that affect automatic speech recognition systems. To deal with these ROS effects, we propose to use a set of parallel rate-specific acoustic and pronunciation models. Rate switching is permitted at word boundaries, to allow within-sentence speech rate variation, which is common in conversational speech. Because of the parallel structure of rate-specific models and the maximum likelihood decoding method, our approach does not require ROS estimation before recognition, which is hard to achieve. We evaluate our models on a large vocabulary conversational speech recognition task over the telephone. Experiments on the NIST 2000 Hub-5 development set show that word-level ROS-dependent modeling results in a 2.2% absolute reduction in word error rate over a rate-independent baseline system. Relative to an enhanced baseline system that models cross-word phonetic elision and reduction in a multiword dictionary, rate-dependent models achieve an absolute improvement of 1.5%. Furthermore, we introduce a novel method to modeling reduced are common in fast speech based on the approach of skipping short phones in the pronunciation models while preserving the phonetic context for the adjacent phones. This method is shown to also produce a small additional improvement on top of ROS-dependent acoustic modeling. (C) 2002 Elsevier B.V. All rights reserved. C1 SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA. RP Zheng, J (reprint author), SRI Int, Speech Technol & Res Lab, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA. EM zj@speech.sri.com CR Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 FINKE M, 1997, P EUR, V5, P2379 Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 Gadde V. R. Rao, 2000, P NIST SPEECH TRANSC GONZALEZ RC, 1992, DIGITAL IMAGE PROCES, pCH4 IYER R, 1999, P EUROSPEECH BUD, V1, P479 MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, V1, P335 Morgan N., 1998, P IEEE INT C AC SPEE, V2, P729, DOI 10.1109/ICASSP.1998.675368 Murveit H., 1993, P IEEE INT C AC SPEE, V2, P319 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 PAUL DB, 1997, P ICASSP 97, V2, P1487 Richardson M., 1999, P EUR C SPEECH COMM, V1, P411 SIEGLER MA, 1995, P IEEE INT C AC SPEE, V1, P612 Stolcke A., 2000, P NIST SPEECH TRANSC TUERK A, 1999, P EUROSPEECH, V1, P419 WEGMANN S, 1996, P ICASSP, V1, P339 WEINTRAUB M, 1998, 9 HUB 5 CONV SPEECH ZHENG J, 2000, P INT C AC SPEECH SI, V3, P1775 NR 19 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 273 EP 285 DI 10.1016/S0167-6393(03)00122-X PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900001 ER PT J AU Janse, E Nooteboom, S Quene, H AF Janse, E Nooteboom, S Quene, H TI Word-level intelligibility of time-compressed speech: prosodic and segmental factors SO SPEECH COMMUNICATION LA English DT Article DE perception; time-compression; prosody; intelligibility; timing; fast speech ID VOWEL REDUCTION; RECOGNITION; INFORMATION; DURATION; ENGLISH; STRESS AB In this study we investigate whether speakers, in line with the predictions of the Hyper- and Hypospeech theory, speed up most during the least informative parts and less during the more informative parts, when they are asked to speak faster. We expected listeners to benefit from these changes in timing, and our main goal was to find out whether making the temporal organisation of artificially time-compressed speech more like that of natural fast speech would improve intelligibility over linear time compression. Our production study showed that speakers reduce unstressed syllables more than stressed syllables, thereby making the prosodic pattern more pronounced. We extrapolated fast speech timing to even faster rates because we expected that the more salient prosodic pattern could be exploited in difficult listening situations. However, at very fast speech rates, applying fast speech timing worsens intelligibility. We argue that the non-uniform way of speeding up may not be due to an underlying communicative principle, but may result from speakers' inability to speed up otherwise. As both prosodic and segmental information contribute to word recognition, we conclude that extrapolating fast speech timing to extremely fast rates distorts this balance between prosodic and segmental information. (C) 2002 Elsevier B.V. All rights reserved. C1 Univ Utrecht, Utrecht Inst Linguist OTS, NL-3512 JK Utrecht, Netherlands. RP Janse, E (reprint author), Univ Utrecht, Utrecht Inst Linguist OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM esther.janse@let.uu.nl RI Janse, Esther/E-3967-2012 CR Altman G., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90022-3 CHARPENTIER FJ, 1986, P ICASSP, V86, P2015 Covell M., 1998, P IEEE INT C AC SPEE Cutler A., 2000, P 6 INT C SPOK LANG, V1, P593 Cutler A, 2001, LANG SPEECH, V44, P171 CUTLER A, 1984, ATTENTION PERFORM, V10, P183 Den Os E. L. S., 1988, THESIS UTRECHT U GAY T, 1978, J ACOUST SOC AM, V63, P223, DOI 10.1121/1.381717 Heuven V. J. van, 1988, LINGUISTICS NETHERLA, P59 Horton WS, 1996, COGNITION, V59, P91, DOI 10.1016/0010-0277(96)81418-1 Kager Rene, 1989, METRICAL THEORY STRE Kozhevnikov V. A., 1965, SPEECH ARTICULATION Lehiste I., 1970, SUPRASEGMENTALS LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Max L, 1997, J SPEECH LANG HEAR R, V40, P1097 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 SLOWIACZEK LM, 1990, LANG SPEECH, V33, P47 Sluijter A. M. C., 1995, THESIS LEIDEN U VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D van Heuven V. J., 1985, J ACOUST SOC AM, V78, pS21, DOI 10.1121/1.2022696 VANDONSELAAR W, 1994, LANG SPEECH, V37, P375 VANLEYDEN K, 1996, LINGUISTICS NETHERLA WINGFIELD A, 1984, J SPEECH HEAR RES, V27, P128 WINGFIELD A, 1975, STRUCTURE PROCESS SP, P146 NR 27 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 287 EP 301 DI 10.1016/S0167-6393(02)00130-9 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900002 ER PT J AU Engwall, O AF Engwall, O TI Combining MRI, EMA and EPG measurements in a three-dimensional tongue model SO SPEECH COMMUNICATION LA English DT Article DE three-dimensional modelling; magnetic resonance imaging; electromagnetic articulography; electropalatography; linear component analysis; kinematic control; parameter tuning; boundary contact handling ID SPEECH PRODUCTION; MOVEMENT; SHAPES AB A three-dimensional (3D) tongue model has been developed using MR images of a reference subject producing 44 artificially sustained Swedish articulations. Based on the difference in tongue shape between the articulations and a reference, the six linear parameters jaw height, tongue body, tongue dorsum, tongue tip, tongue advance and tongue width were determined using an ordered linear factor analysis controlled by articulatory measures. The first five factors explained 88% of the tongue data variance in the midsagittal plane and 78% in the 3D analysis. The six-parameter model is able to reconstruct the modelled articulations with an overall mean reconstruction error of 0.13 cm, and it specifically handles lateral differences and asymmetries in tongue shape. In order to correct articulations that were hyperarticulated due to the artificial sustaining in the magnetic resonance imaging (MRI) acquisition, the parameter values in the tongue model were readjusted based on a comparison of virtual and natural linguopalatal contact patterns, collected with electropalatography (EPG). Electromagnetic articulography (EMA) data was collected to control the kinematics of the tongue model for vowel-fricative sequences and an algorithm to handle surface contacts has been implemented, preventing the tongue from protruding through the palate and teeth. (C) 2002 Elsevier B.V. All rights reserved. C1 Royal Inst Technol, KTH, Ctr Speech Technol, SE-10044 Stockholm, Sweden. RP Engwall, O (reprint author), Royal Inst Technol, KTH, Ctr Speech Technol, Drottning Kristinas 31, SE-10044 Stockholm, Sweden. EM olov@speech.kth.se CR Badin P., 1998, P 3 ESCA COCOSDA INT, P249 BADIN P, 2000, P 5 SEM SPEECH PROD, P261 BADIN P, 1997, P EUR C SPEECH COMM, V1, P47 Beautemps D, 2001, J ACOUST SOC AM, V109, P2165, DOI 10.1121/1.1361090 Beskow J., 1995, P 4 EUR C SPEECH COM, P299 BRANDERUD P, 1985, P FRENCH SWED S SPEE, P113 CARLSON R, 1982, P ICASSP 82, V3, P1604 Cohen M., 1993, MODELS TECHNIQUES CO Cohen M. M., 1998, P AUD VIS SPEECH PER, P201 COKER CH, 1966, J ACOUST SOC AM, V40, P1271, DOI 10.1121/1.2143456 DANG J, 2000, P ICSLP2000, V1, P457 DANG J, 1998, P 5 INT C SPOK LANG, V5, P1767 ENGWALL O, 2000, P ICSLP 2000, V1, P17 ENGWALL O, 2000, DYNAMICAL ASPECTS CO, V4, P49 ENGWALL O, 1999, COLLECTING ANAL 2 3, V3, P11 ENGWALL O, 2000, THESIS KTH STOCKHOLM ENGWALL O, 1999, P EUR 1999, P113 ENGWALL O, 2002, P ICSLP ENGWALL O, 2000, P 5 SEM SPEECH PROD, P297 ENGWALL O, 2001, P 4 ISCA TUT RES WOR FANT G, 1965, P ICPHS 65, P120 Fujimura O., 1979, FRONTIERS SPEECH COM, P17 FUJIMURA O, 1990, LANG SPEECH, V33, P195 HARSHMAN R, 1977, J ACOUST SOC AM, V62, P693, DOI 10.1121/1.381581 HONDA K, 1994, P 1994 INT C SPOK LA, P175 JONES W, 1977, UNPUB ELECROPALATOGR, V1, P7 KIRITANI S, 1976, RES I LOGOPED PHONIA, V10, P243 Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089 Le Goff B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607232 LINDBLOM BE, 1971, J ACOUST SOC AM, V50, P1166, DOI 10.1121/1.1912750 MADEA S, 1990, SPEECH PRODUCTION MO, P131 MAEDA S, 1988, J ACOUST SOC AM, V84, pS146, DOI 10.1121/1.2025845 MARCHAL A, 1987, 16 JOURN ET PAR HAMM Masuko T, 1998, INT CONF ACOUST SPEE, P3745, DOI 10.1109/ICASSP.1998.679698 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 Nguyen N, 2000, BEHAV RES METH INS C, V32, P464, DOI 10.3758/BF03200817 Payan Y, 1997, SPEECH COMMUN, V22, P185, DOI 10.1016/S0167-6393(97)00019-8 PAYAN Y, 1995, P 13 INT C PHON SCI, V2, P474 Pelachaud C., 1991, THESIS U PENNSYLVANI PERKELL JS, 1974, THESIS MIT CAMBRIDGE PERRIER P, 2000, P ICSLP2000, P162 RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780 Rubin P. E., 1996, P 4 EUR SPEECH PROD, P125 SCHWARTZ JL, 2000, P 5 SPEECH PROD SEM, P257 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 STONE M, 1990, J ACOUST SOC AM, V87, P2207, DOI 10.1121/1.399188 TIEDE M, 1997, J ACOUST SOC AM, V2, P3166 Tiede M.K., 2000, P 5 SEM SPEECH PROD, P25 WILHELMSTRICARICO R, 1995, J ACOUST SOC AM, V97, P3085, DOI 10.1121/1.411871 WILHELMSTRICARI.R, 1997, J ACOUST SOC AM 2, V102, P3163 WILHELMSTRICARI.R, 2000, P 5 SPEECH PROD SEM, P141 Wrench A., 1998, P ICSLP98, P1867 Wrench A. A., 2000, P 5 SEM SPEECH PROD, P305 NR 53 TC 26 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 303 EP 329 DI 10.1016/S0167-6393(03)00132-2 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900003 ER PT J AU Elhilali, M Chi, T Shamma, SA AF Elhilali, M Chi, T Shamma, SA TI A spectro-temporal modulation index (STMI) for assessment of speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE modulation transfer function; spectro-temporal modulations; speech intelligibility; STMI ID PRIMARY AUDITORY-CORTEX; ROOM ACOUSTICS; REPRESENTATIONS; RESPONSES; SYSTEM; MODEL AB We present a biologically motivated method for assessing the intelligibility of speech recorded or transmitted under various types of distortions. The method employs an auditory model to analyze the effects of noise, reverberations, and other distortions on the joint spectro-temporal modulations present in speech, and on the ability of a channel to transmit these modulations. The effects are summarized by a spectro-temporal modulation index (STMI). The index is validated by comparing its predictions to those of the classical STI and to error rates reported by human subjects listening to speech contaminated with combined noise and reverberation. We further demonstrate that the STMI can handle difficult and nonlinear distortions such as phase-jitter and shifts, to which the STI is not sensitive. (C) 2002 Published by Elsevier B.V. C1 Univ Maryland, Dept Elect & Comp Engn, Syst Res Inst, College Pk, MD 20742 USA. RP Shamma, SA (reprint author), Univ Maryland, Dept Elect & Comp Engn, Syst Res Inst, AV Williams Bldg 115,Room 2202, College Pk, MD 20742 USA. EM sas@eng.umd.edu RI Elhilali, Mounya/A-3396-2010; Shamma, Shihab/F-9852-2012 OI Elhilali, Mounya/0000-0003-2597-738X; CR ANSI, 1969, S351969 ANSI Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 ATLAS L, 2001, ICASSP 2001 BELLAMY JC, 2000, WILEY SERIES TELECOM BRADLEY JS, 1986, J ACOUST SOC AM, V80, P837, DOI 10.1121/1.393907 Chi TS, 1999, J ACOUST SOC AM, V106, P2719, DOI 10.1121/1.428100 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Depireux DA, 2001, J NEUROPHYSIOL, V85, P1220 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 Greenberg S., 1998, P JOINT M AC SOC AM, P2677 GREENBERG S, 1998, P INT C SPOK LANG PR Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HOUTGAST T, 1980, ACUSTICA, V46, P60 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Kowalski N, 1996, J NEUROPHYSIOL, V76, P3503 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 Lee E. A., 1994, DIGITAL COMMUNICATIO LYON R, 1996, SPR HDB AUD, V6, P221 Noordhoek IM, 1997, J ACOUST SOC AM, V101, P498, DOI 10.1121/1.417993 Payton KL, 1999, J ACOUST SOC AM, V106, P3637, DOI 10.1121/1.428216 Saberi K, 1999, NATURE, V398, P760, DOI 10.1038/19652 Shamma S, 1998, SPATIAL TEMPORAL PRO, P411 SHAMMA SA, 1986, J ACOUST SOC AM, V80, P133, DOI 10.1121/1.394173 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 Wang KS, 1994, IEEE T SPEECH AUDI P, V2, P421 WANG KS, 1995, IEEE T SPEECH AUDI P, V3, P382 YANG XW, 1992, IEEE T INFORM THEORY, V38, P824, DOI 10.1109/18.119739 NR 28 TC 74 Z9 75 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 331 EP 348 DI 10.1016/S0167-6393(02)00134-6 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900004 ER PT J AU Niyogi, P Ramesh, P AF Niyogi, P Ramesh, P TI The voicing feature for stop consonants: recognition experiments with continuously spoken alphabets SO SPEECH COMMUNICATION LA English DT Article ID SPEECH RECOGNITION AB We consider the possibility of incorporating phonetic features into a statistically based speech recognizer. We develop a two pass strategy for recognition with a hidden Markov model based first pass followed by a second pass that performs an alternative analysis using class-specific features. For the voiced/voiceless distinction on stops for an alphabet recognition task, we show that a perceptually and linguistically motivated acoustic feature exists (the voice onset time (VOT)). We perform acoustic-phonetic analyses demonstrating that this feature provides superior separability to the traditional spectral features. Further, the VOT can be automatically extracted from the speech signal. We describe several such algorithms that can be incorporated into our two pass recognition strategy to reduce error rates by as much as 53% over a baseline HMM recognition system. (C) 2002 Published by Elsevier B.V. C1 Univ Chicago, Dept Comp Sci, Chicago, IL 60637 USA. Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. RP Niyogi, P (reprint author), Univ Chicago, Dept Comp Sci, 1100 E 58th St,167 Hyde Pk, Chicago, IL 60637 USA. EM niyogi@cs.uchicago.edu CR Abramson A. S., 1970, P 6 INT C PHON SCI P, P569 Bitar N., 1996, P INT C AC SPEECH SI, P29 DENG L, 1997, P 1997 IEEE WORKSH A, P107 DJEZZAR L, 1995, EUROSPEECH, P2217 EIDE E, 1993, P ICASSP, V2, P483 EIMAS PD, 1973, COGNITIVE PSYCHOL, V4, P99, DOI 10.1016/0010-0285(73)90006-6 ESPYWILSON CY, 1994, J ACOUST SOC AM, V96, P65, DOI 10.1121/1.410375 FANTY M, 1990, ICSLP, P1361 Glass J, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2277 HASEGAWAJOHNSON M, 1996, THESIS MIT Jakobson Roman, 1952, PRELIMINARIES SPEECH KLATT DH, 1975, J SPEECH HEAR RES, V18, P686 KUHL PK, 1977, J ACOUST SOC AM, V73, P322 Lee C.-H., 1996, P ICSLP, P1816 LISKER L, 1975, J ACOUST SOC AM, V57, P1547, DOI 10.1121/1.380602 LISKER L, 1964, WORD, V20, P384 Liu S. A., 1995, THESIS MIT CAMBRIDGE MENG HM, 1991, 4 DARPA SPEECH NAT L NIYOGI P, 1998, P ICSLP SYDN AUSTR Rabiner L, 1993, FUNDAMENTALS SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Ramesh P, 1992, P INT C AC SPEECH SI, P381, DOI 10.1109/ICASSP.1992.225892 STEVENS KN, 1995, P 4 EUR C SPEECH COM, V1, P3 Talkin D., 1995, SPEECH CODING SYNTHE, P495 THOMSON DJ, 1982, P IEEE, V70, P1055, DOI 10.1109/PROC.1982.12433 Vapnik V, 1998, STAT LEARNING THEORY ZUE VW, 1985, P IEEE, V73, P1602, DOI 10.1109/PROC.1985.13342 ZUE VW, 1975, M ASA, V58, pS96 ZUE VW, 1976, THESIS MIT NR 29 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 349 EP 367 DI 10.1016/S0167-6393(02)00151-6 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900005 ER PT J AU Yamamoto, H Isogai, S Sagisaka, Y AF Yamamoto, H Isogai, S Sagisaka, Y TI Multi-class composite N-gram language model SO SPEECH COMMUNICATION LA English DT Article DE N-gram language model; class N-gram; word clustering; variable length N-gram AB A new language model is proposed to cope with the scarcity of training data. The proposed multi-class N-gram achieves an accurate word prediction capability and high reliability with a small number of model parameters by clustering words multi-dimensionally into classes, where the left and right context are independently treated. Each multiple class is assigned by a grouping process based on the left and right neighboring characteristics. Furthermore, by introducing frequent word successions to partially include higher order statistics, multi-class N-grams are extended to more efficient multi-class composite N-grams. In comparison to conventional word tri-grams, the multi-class composite N-grams achieved 9.5% lower perplexity and a 16% lower word error rate in a speech recognition experiment with a 40% smaller parameter size. (C) 2002 Elsevier B.V. All rights reserved. C1 ATR Spoken Language Translat Res Labs, Kyoto, Japan. RP Yamamoto, H (reprint author), ATR Spoken Language Translat Res Labs, 2-2-2 Hikaridai Seika Cho, Kyoto, Japan. EM hirofumi.yamamoto@atr.co.jp CR Bai SH, 1998, INT CONF ACOUST SPEE, P173 Brown P. F., 1992, Computational Linguistics, V18 DELIGNE S, 1995, P ICASSP, P169 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 MASATAKI H, 1996, P ICASSP, P188 Ostendorf M, 1997, COMPUT SPEECH LANG, V11, P17, DOI 10.1006/csla.1996.0021 SHIMIZU T, 1996, P ICASSP 96, P145 TAKEZAWA T, 1998, P 1 INT WORKSH E AS, P148 WARD JH, 1963, J AM STAT ASSOC, V58, P236, DOI 10.2307/2282967 YAMAMOTO H, 1999, P ICASSP 99, V1, P533 ZHANG S, 1999, P EUR, P1611 NR 11 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 369 EP 379 DI 10.1016/S0167-6393(02)00179-6 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900006 ER PT J AU Ozaydin, S Baykal, B AF Ozaydin, S Baykal, B TI Matrix quantization and mixed excitation based linear predictive speech coding at very low bit rates SO SPEECH COMMUNICATION LA English DT Article DE very low bit rate; LSF matrix quantization; LSF vector quantization; mixed excitation; MELP; ARMA prediction ID LPC PARAMETERS; DESIGN AB A matrix quantization scheme and a very low bit rate vocoder is developed to obtain good quality speech for low capacity communication links. The new matrix quantization method operates at bit rates between 400 and 800 bps and using a 25 ms linear predictive coding (LPC) analysis frame, spectral distortion about I dB is achieved at 800 bps. Techniques for improving the performance at very low bit rate vocoding include quantization of residual line spectral frequency (LSF) vectors, multistage matrix quantization, joint quantization of pitch and voiced/unvoiced/mixed decisions and a technique to obtain voiced/unvoiced/mixed decisions. In the new matrix quantization based mixed excitation (MQME) vocoder, the residual LSF vectors for two consecutive frames are obtained using autoregressive moving average (ARMA) prediction, then grouped into a superframe and jointly quantized. For other speech parameters, quantization is made in each frame. The residual LSF vector quantization yields bit rate reduction in the vocoder. For the MQME vocoder, listening tests have proven that an efficient and high quality coding has been achieved at a bit rate of 1200 bps. Test results are compared with the mixed excitation based 2400 bps MELP vocoder which is chosen as the new federal standard, and it is observed that the degradation in speech quality is tolerable and the performance is near the 2400 bps MELP vocoder particularly in quiet environments. (C) 2003 Elsevier B.V. All rights reserved. C1 Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. Undersecretariat Def Ind, TR-06100 Ankara, Turkey. RP Baykal, B (reprint author), Middle E Tech Univ, Dept Elect & Elect Engn, TR-06531 Ankara, Turkey. EM buyurman@metu.edu.tr CR BHATTACHARYA B, 1992, P ICASSP, P105, DOI 10.1109/ICASSP.1992.225961 BRUHN S, 1995, P IEEE INT C AC SPEE, V1, P724 CAMPBELL JP, 1986, P IEEE INT C AC SPEE, P473 DEMARCA JRB, 1994, IEEE T VEH TECHNOL, V43, P413, DOI 10.1109/25.312805 Ghaemmaghami S, 1999, ELECTRON LETT, V35, P456, DOI 10.1049/el:19990316 Kondoz A. M., 1994, DIGITAL SPEECH CODIN LeBlanc WP, 1993, IEEE T SPEECH AUDI P, V1, P373, DOI 10.1109/89.242483 LEBLANC WP, 1993, WIRELESS NETWORK APP, V39, P302 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763 NANDKUMAR S, 1998, P IEEE INT C AC SPEE, V1, P41, DOI 10.1109/ICASSP.1998.674362 OHMURO H, 1994, ELECTRON COMM JPN 3, V77, P12, DOI 10.1002/ecjc.4430771002 OZAYDIN S, 2001, P IEEE INT C AC SPEE, P677 OZAYDIN S, 2001, P 3 IEEE SIGN PROC W, P372 SHIRAKI Y, 1988, IEEE T ACOUST SPEECH, V36, P1437, DOI 10.1109/29.90372 SKOGLUND J, 1996, P IEEE INT C AC SPEE, P1351 TOKUDA K, 1998, P ICASSP, V2, P609, DOI 10.1109/ICASSP.1998.675338 Tremain T.E., 1982, SPEECH TECHNOLOG APR, P40 TSAO C, 1985, IEEE T ACOUST SPEECH, V33, P537 WANG T, 2000, P IEEE INT C AC SPEE, P1375 Xydeas CS, 1999, IEEE T SPEECH AUDI P, V7, P113, DOI 10.1109/89.748117 NR 21 TC 4 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 381 EP 392 DI 10.1016/S0167-6393(03)00009-8 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900007 ER PT J AU Visser, E Otsuka, M Lee, TW AF Visser, E Otsuka, M Lee, TW TI A spatio-temporal speech enhancement scheme for robust speech recognition in noisy environments SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; robust speech recognition; blind source separation; noisy environments ID MICROPHONE ARRAYS; BLIND SEPARATION; ACOUSTIC NOISE; TIME-DELAY AB A new speech enhancement scheme is presented integrating spatial and temporal signal processing methods for robust speech recognition in noisy environments. The scheme first separates spatially localized point sources from noisy speech signals recorded by two microphones. Blind source separation algorithms assuming no a priori knowledge about the sources involved are applied in this spatial processing stage. Then denoising of distributed background noise is achieved in a combined spatial/temporal processing approach. The desired speaker signal is first processed along with an artificially constructed noise signal in a supplementary blind source separation step. It is further denoised by exploiting differences in temporal speech and noise statistics in a wavelet filterbank. The scheme's performance is illustrated by speech recognition experiments on real recordings in a noisy car environment. In comparison to a common multi-microphone technique like beamforming with spectral subtraction, the scheme is shown to enable more accurate speech recognition in the presence of a highly interfering point source and strong background noise. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Calif San Diego, Inst Neural Computat, Dept 0523, La Jolla, CA 92093 USA. DENSO Corp, Res Labs, Aichi 4700111, Japan. RP Visser, E (reprint author), Univ Calif San Diego, Inst Neural Computat, Dept 0523, 9500 Gilman Dr, La Jolla, CA 92093 USA. EM visser@salk.edu; mootsuka@rlab.denso.co.jp; tewon@salk.edu CR Adami A, 2002, P ICSLP, P21 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 ATTIAS H, 2001, ADV NEURAL INFORMATI, V13 Bell A J, 1995, NEURAL COMPUT, V7, P1004 Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Brandstein MS, 1997, COMPUT SPEECH LANG, V11, P91, DOI 10.1006/csla.1996.0024 BUCCIGROSSI RW, 1997, P IEEE INT C AC SPEE CARDOSO JF, 1993, IEE PROC-F, V140, P362 Carter G. C., 1993, COHERENCE TIME DELAY Champagne B, 1996, IEEE T SPEECH AUDI P, V4, P148, DOI 10.1109/89.486067 CHOW SK, 1981, IEEE T ACOUST SPEECH Dahl M, 1999, IEEE T VEH TECHNOL, V48, P1518, DOI 10.1109/25.790527 DEMBO A, 1988, IEEE T ACOUST SPEECH, V36, P471, DOI 10.1109/29.1551 DONOHO DL, 1995, J ROY STAT SOC B MET, V57, P301 Droppo J., 2001, P EUR, P217 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FERTNER A, 1986, IEEE T ACOUST SPEECH Fischer S, 1996, SPEECH COMMUN, V20, P215, DOI 10.1016/S0167-6393(96)00054-4 FRIEDLANDER B, 1984, IEEE T AEROSPACE ELE, V1 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HIRSCH HG, 2000, ISCA ITRW ASR2000 CH Holschneider M., 1989, WAVELETS TIME FREQUE, P289 Hyvarinen A, 1997, NEURAL COMPUT, V9, P1483, DOI 10.1162/neco.1997.9.7.1483 Johnson D, 1993, ARRAY SIGNAL PROCESS Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 KNAPP CH, 1976, IEEE T ACOUST SPEECH, V24, P320, DOI 10.1109/TASSP.1976.1162830 LEE TW, 1998, P IEEE INT C AC SPEE, V2, P1249 Lee TW, 1997, ADV NEUR IN, V9, P758 Li Z, 1999, IEEE T SPEECH AUDI P, V7, P91, DOI 10.1109/89.736335 LIEB M, 2001, P EUR 2001 AALB DENM, P625 Macho D., 2002, P ICSLP, P17 Mahieux Y, 1996, J AUDIO ENG SOC, V44, P365 MAKEIG S, 1995, ADV NEURAL INFORMATI, V8 Marro C, 1998, IEEE T SPEECH AUDI P, V6, P240, DOI 10.1109/89.668818 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Rabiner L, 1993, FUNDAMENTALS SPEECH SILVERMAN HF, 1997, 1997 IEEE INT C AC S, V1, P251, DOI 10.1109/ICASSP.1997.599616 Vetterli M., 1995, WAVELETS SUBBAND COD Ward D. B., 1998, Acoustics Australia, V26 ZHU Q, 2001, P EUR 2001 AALB DENM, V1, P185 NR 42 TC 25 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 393 EP 407 DI 10.1016/S0167-6393(03)00010-4 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900008 ER PT J AU Lu, CT Wang, HC AF Lu, CT Wang, HC TI Enhancement of single channel speech based on masking property and wavelet transform SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; wavelet packet transform; soft thresholding function; noise masking threshold; critical band ID DECOMPOSITION; SIGNALS AB Wavelet packet transform has been progressively applied in removing the additive white Gaussian noise. By using soft thresholding function, it performs well in enhancing the corrupted speech. However, it suffers from serious residual noise and speech distortion. In this paper, we propose a method based on critical-band decomposition which converts a noisy signal into wavelet coefficients (WCs), and enhances the WCs by subtracting a threshold from noisy WCs in each subband. The threshold of each subband is adapted according to the segmental SNR (SegSNR) and the noise masking threshold. Thus residual noise can be efficiently suppressed for a speech-dominated frame. In a noise-dominated frame, the background noise can be almost removed by adjusting the wavelet coefficient threshold (WCT) according to the SegSNR. Speech distortion can be reduced by decreasing the WCT in speech-dominated subbands. The proposed method can effectively enhance noisy speech which is infected by colored-noise. Its performance is better than other wavelet-based speech enhancement methods in our experiments. (C) 2003 Elsevier B.V. All rights reserved. C1 Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan. Chin Min Coll, Dept Elect Engn, Miaoli 351, Taiwan. RP Lu, CT (reprint author), Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 300, Taiwan. EM lucas1@ms26.hinet.net; hcwang@ee.nthu.edu.tw CR Bahoura M, 2001, IEEE SIGNAL PROC LET, V8, P10, DOI 10.1109/97.889636 BAHOURA M, 2001, P EUR C SPEECH COMM, P1937 Carnero B, 1999, IEEE T SIGNAL PROCES, V47, P1622, DOI 10.1109/78.765133 Chong NR, 2000, IEEE T SPEECH AUDI P, V8, P345, DOI 10.1109/89.841216 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 DONOHO DL, 1994, BIOMETRIKA, V81, P425, DOI 10.1093/biomet/81.3.425 DURAND S, 2001, INT C AC SPEECH SIGN JABLOUN F, 2001, INT C AC SPEECH SIGN Jansen M, 2001, IEEE T SIGNAL PROCES, V49, P1113, DOI 10.1109/78.923292 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 LUKASIAK J, 2000, INT C AC SPEECH SIGN, P11 Mallat S., 1999, WAVELET TOUR SIGNAL SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 Selesnick IW, 2000, INT CONF ACOUST SPEE, P129, DOI 10.1109/ICASSP.2000.861887 Sheikhzadeh H., 2001, P EUR C SPEECH COMM, P1855 SINGH L, 1997, P IEEE TENCON SITCT, P475 SINHA DP, 1993, IEEE T SIGNAL PROCES, V41, P3463, DOI 10.1109/78.258086 Srinivasan P, 1998, IEEE T SIGNAL PROCES, V46, P1085, DOI 10.1109/78.668558 Strang G., 1996, WAVELETS FILTER BANK Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 YOON S, 2001, P EUR C SPEECH COMM, P1941 NR 21 TC 24 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 409 EP 427 DI 10.1016/S0167-6393(03)00011-6 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900009 ER PT J AU Meyer, G Morse, R AF Meyer, G Morse, R TI The intelligibility of consonants in noisy vowel-consonant-vowel sequences when the vowels are selectively enhanced SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; consonant recognition; automatic speech recognition; relative amplitude ID NASAL CONSONANTS; RELATIVE AMPLITUDE; ACOUSTIC CUES; FORMANT TRANSITIONS; SPEECH RECOGNITION; PERCEPTION; HEARING; PLACE; STOP; ARTICULATION AB The performance of speech enhancement algorithms deteriorates rapidly with decreasing signal-to-noise ratio (SNR). At a low SNR, high-intensity phonemes such as vowels are therefore more likely to be enhanced than low-intensity speech segments such as many consonants. Although the selective enhancement of vowels enhances transitional cues for consonant recognition, it simultaneously degrades relative amplitude cues. Experiments with normal-hearing subjects were performed to determine the overall effect of selective enhancement of vowels on the intelligibility of consonants in consonant-vowel-consonant utterances. In quiet, a 12-dB enhancement of the vowels did not significantly reduce consonant intelligibility compared with an unenhanced control condition at 65 dB (A). When unenhanced utterances were presented in background noise with an average SNR of -6 dB at the vowel segments, 50.1% of the consonants were correctly identified while 69.8% of consonants were recognised in a condition where the consonant SNR remained unchanged but where the vowels were selectively amplified by 12 dB. Equal enhancement of the vowels and consonants by 12 dB, however, led to 91.5% consonant recognition. We conclude that speech enhancement algorithms should enhance all speech segments to the greatest possible extent, even if this leads to selective enhancement of some phoneme categories over others. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Keele, MacKay Inst Commun & Neurosci, Keele ST5 5BG, Staffs, England. RP Meyer, G (reprint author), Univ Liverpool, Dept Psychol, Eleanor Rathbone Bldg,Bedford St S, Liverpool L69 7ZA, Merseyside, England. EM georg@liverpool.ac.uk CR AINSWORTH WA, 1994, J ACOUST SOC AM, V96, P687, DOI 10.1121/1.410306 BAILEY PJ, 1980, J EXP PSYCHOL HUMAN, V6, P536 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 COHEN I, 2001, SIGNAL PROCESS, V81, P2408 COOPER FS, 1952, J ACOUST SOC AM, V24, P597, DOI 10.1121/1.1906940 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 DORMAN MF, 1977, PERCEPT PSYCHOPHYS, V22, P109, DOI 10.3758/BF03198744 DORMAN MF, 1990, J ACOUST SOC AM, V88, P2074, DOI 10.1121/1.400104 DORMAN MF, 1985, SPEECH SCI, P111 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GORDONSALANT S, 1986, J ACOUST SOC AM, V80, P1599, DOI 10.1121/1.394324 Hazan V, 1998, SPEECH COMMUN, V24, P211, DOI 10.1016/S0167-6393(98)00011-9 Hedrick M, 1997, J SPEECH LANG HEAR R, V40, P925 HEDRICK MS, 1993, J ACOUST SOC AM, V94, P2005, DOI 10.1121/1.407503 Hedrick MS, 1996, J ACOUST SOC AM, V100, P3398, DOI 10.1121/1.416981 Hedrick MS, 1997, J SPEECH LANG HEAR R, V40, P1445 Joos M., 1948, LANGUAGE SUPPL, V24, P1, DOI DOI 10.2307/522229 Kleinschmidt M, 2001, SPEECH COMMUN, V34, P75, DOI 10.1016/S0167-6393(00)00047-9 Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1 LISKER L, 1986, LANG SPEECH, V29, P3 LUCE PA, 1983, HUM FACTORS, V25, P17 Macaulay R.J., 1980, IEEE T ACOUST SPEECH, V28, P137 MALECOT A, 1956, LANGUAGE, V32, P274, DOI 10.2307/411004 MANN VA, 1980, PERCEPT PSYCHOPHYS, V28, P213, DOI 10.3758/BF03204377 Meyer GF, 2001, NATO SCI S A LIF SCI, V312, P297 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NAKATA K, 1959, J ACOUST SOC AM, V31, P661, DOI 10.1121/1.1907770 NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760 OHDE RN, 1994, J ACOUST SOC AM, V96, P675, DOI 10.1121/1.411326 OHDE RN, 1983, J ACOUST SOC AM, V74, P706, DOI 10.1121/1.389856 PARSONS TW, 1987, VOICE SPEECH PROCESS, P345 Pick G., 1977, PSYCHOPHYSICS PHYSL, P273 Potter R. K., 1947, VISIBLE SPEECH RABBITT P, 1968, Q J EXPT PSYCHOL, V20, P1 RECASENS D, 1983, J ACOUST SOC AM, V73, P1346, DOI 10.1121/1.389238 Rosen S., 1986, FREQUENCY SELECTIVIT, P373 Shanley A, 1999, CHEM ENG-NEW YORK, V106, P74 SOLI SD, 1981, J ACOUST SOC AM, V70, P976, DOI 10.1121/1.387032 Soon IY, 1998, SPEECH COMMUN, V24, P249, DOI 10.1016/S0167-6393(98)00019-3 STEVENS KN, 1985, ESSAYS HONOR P LADEF, P243 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Sussman HM, 1998, BEHAV BRAIN SCI, V21, P241 WHALEN DH, 1991, J ACOUST SOC AM, V90, P1776, DOI 10.1121/1.401658 Widrow B, 2003, SPEECH COMMUN, V39, P139, DOI 10.1016/S0167-6393(02)00063-8 WRIGHT HN, 1968, J SPEECH HEAR RES, V11, P842 NR 46 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 429 EP 440 DI 10.1016/S0167-6393(03)00013-X PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900010 ER PT J AU Ishi, CT Hirose, K Minematsu, N AF Ishi, CT Hirose, K Minematsu, N TI Mora F0 representation for accent type identification in continuous speech and considerations on its relation with perceived pitch values SO SPEECH COMMUNICATION LA English DT Article DE mora F0; mora pitch perception; pitch target; accent type identification ID FREQUENCY AB In Japanese continuous speech, a content word is frequently followed by a particle to form an utterance unit with one accent component, called an accentual phrase or a prosodic word. As opposed to accent type identification for isolated word utterances, automatic accent type identification for accentual phrases in continuous speech is quite difficult, and no reliable method has yet been developed. In order to realize accurate identification, a method was proposed which was based on representing the fundamental frequency (F0) movement of an utterance as a sequence of F0 values in mora unit (F0mora's). Although a consonant-vowel (CV) cluster is usually said to correspond to a mora, we also tested the vowel-consonant (VC) cluster in our study. As for F0 values, two definitions were selected and compared: one is to average F0 values of the voiced frames within the mora unit, and the other is to set the target value as the F0 value at the end of a linear regression line fit through the mora. As combinations of the two definitions for mora unit and F0 value, 4 candidates (CV-average, CV-target, VC-average, VC-target) are possible as the F0mora definition. A variable, F0ratio, was then defined as the F0mora difference between two successive morae to quantitatively represent their pitch change, and its distribution for each accent type was analyzed. After constructing a multi-dimensional Gaussian model for each accent type using F0ratio as the feature parameter, an experiment was conducted on accent type identification. The experiment included identification using accent type HMM's of frame-based F0's and delta-F0's as the baseline method. On average, the proposed method out-performed the baseline method for all four candidates of F0mora, with CV-target and VC-average showing better performances. The candidates were further analyzed in regard to how well they corresponded to human perceived mora pitch values. For this purpose, after developing a tool enabling us to control musical instrument digital interface (MIDI) sound pitch to the interval of a quarter of semitone, we asked subjects to adjust the MIDI sound pitch to the perceived mora pitch. The MIDI sound pitch values obtained afteradjustment by the subjects were used to quantify the human-perceived mora pitch (F0human). F0human values were used to evaluate each F0Pnora candidate. Although VC-average and CV-target had shown a better match with F0human, they showed large mismatches when large F0 changes occurred within the mora. Analysis of mismatches between F0mora and F0human showed that mora pitch was related to the direction of the F0 change. It also showed that the target value obtained from the linear regression approximation of the observed F0 curve over-estimated the F0-change effect on the perceived pitch. An optimal definition for F0mora will be found between averaging and linear-targeting. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Tokyo, Dept Informat & Commun Engn, Grad Sch Engn, Bunkyo Ku, Tokyo 1138656, Japan. Univ Tokyo, Dept Frontier Informat, Grad Sch Frontier Sci, Bunkyo Ku, Kyoto 1138656, Japan. Univ Tokyo, Dept Informat & Commun Engn, Grad Sch Comp Sci & Technol, Bunkyo Ku, Kyoto 1138656, Japan. RP Ishi, CT (reprint author), ATR, Human Informat Sci Labs, Japan Sci & Technol Agcy, 2-2 Hikaridai, Kyoto 6190288, Japan. EM carlos@atr.co.jp; hirose@gavo.t.u-tokyo.ac.jp; mine@gavo.t.u-tokyo.ac.jp CR FUJISAKI H, 1993, IEICE T FUND ELECTR, VE76A, P1919 HASHIMOTO S, 1960, METHODS LINGUISTICS, P428 HIROSE K, 1998, P IEEE ICASSP 98 SEA, V1, P25, DOI 10.1109/ICASSP.1998.674358 HIROSE K, 1993, P ESCA WORKSH PROS, P200 ISHI CT, 2000, P AUT M AC SOC JAP, V1, P199 KAWAI G, 1999, P EUR, V99, P177 Ljolje A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90025-8 MINEMATSU N, 1996, AUTOMATIC IDENTIFICA, P69 NABELEK IV, 1970, J ACOUST SOC AM, V48, P536 SASAKI H, 2000, P SPR M AC SOC JAP, V1, P255 SATO H, 1987, THESIS, P55 TAKAHASHI S, 1990, ISOLATED WORLD RECOG, P65 YOSHIMURA T, 1992, P AUT M AC SOC JAP, V1, P173 NR 13 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 441 EP 453 DI 10.1016/S0167-6393(03)00014-1 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900011 ER PT J AU Hakkinen, J Suontausta, J Riis, S Jensen, KJ AF Hakkinen, J Suontausta, J Riis, S Jensen, KJ TI Assessing text-to-phoneme mapping strategies in speaker independent isolated word recognition SO SPEECH COMMUNICATION LA English DT Article DE speaker independent speech recognition; isolated word recognition; text-to-phoneme mapping; decision trees; neural networks AB A phonetic transcription of the vocabulary, i.e., a lexicon, is needed in sub-word based speech recognition and text-to-speech systems. Decision trees and neural networks have successfully been used for creating lexicons on-line from an open vocabulary. We briefly review these methods and compare them in detail in the text-to-phoneme mapping task as part of a phoneme based speaker independent speech recognizer. The decision tree and neural network based methods were first evaluated in terms of phoneme accuracy and then in extensive speech recognition tests. American english dictionaries and speech databases were used in all experiments. The decision tree based method achieved high phoneme accuracies when the training material covered the test vocabulary well. In typical speech recognition tests, the recognition rates obtained using the decision tree based lexicons were close to the baseline that was obtained using accurate transcriptions. Although the lexicons obtained using neural networks resulted in somewhat lower baseline recognition rates, they provided slightly better results in generalization tests. Moreover, when the neural network based mappings were appended with a look-up table comprising the most likely vocabulary items, which would be the practical set-up, their performance increased significantly. The main advantage of neural networks over decision trees is their low memory consumption. (C) 2003 Elsevier B.V. All rights reserved. C1 Nokia Mobile Phones, FIN-33721 Tampere, Finland. Nokia Res Ctr, Speech & Audio Syst Lab, FIN-33721 Tampere, Finland. Nokia Mobile Phones R&D, DK-1790 Copenhagen V, Denmark. Oticon AS, DK-2900 Hellerup, Denmark. RP Hakkinen, J (reprint author), Nokia Mobile Phones, POB 68, FIN-33721 Tampere, Finland. EM juha.m.hakkinen@nokia.com CR ANDERSEN O, 1995, P EUR 95 MADR SPAIN, V2, P1117 Andersen O., 1996, P INT C SPOK LANG PR, V3, P1700, DOI 10.1109/ICSLP.1996.607954 ANDERSEN O, 1994, P INT C SPOK LANG PR, V3, P1627 Bahl L.R., 1991, P INT C AC SPEECH SI, P173, DOI 10.1109/ICASSP.1991.150305 Bengio Y, 2001, ADV NEUR IN, V13, P932 BISHOP M, 1995, CHRISTOPHER NEURAL N BRIDLE J, 1990, NEUROCOMPUTING ALGOR, V6, P227 DESHMUKH N, 1999, THESIS MISSISSIPPI S Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC Gillick L., 1989, P ICASSP, P532 HAKKINEN J, 1999, P IEEE WORKSH ROB ME, P139 Hallahan W. I., 1995, Digital Technical Journal, V7 Hinton G. E., 1986, P 8 ANN C COGN SCI S, P1 Jelinek F., 1998, STAT METHODS SPEECH JENSEN KJ, 2000, P INT C SPOK LANG PR JIANG L, 1997, P EUROSPEECH, P605 KIENAPPEL AK, 2001, P EUR 2001 AALB DENM LeCun Y, 1997, P INT C AC SPEECH SI, V1, P151 McCulloch N., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90013-1 Meng H, 1996, SPEECH COMMUN, V18, P47, DOI 10.1016/0167-6393(95)00032-1 *OFF CHIEF ACT, 1998, 39 OFF CHIEF ACT 199 Pagel V., 1998, P INT C SPOK LANG PR, V5, P2015 Quinlan J. R., 1993, C4 5 PROGRAMS MACHIN RABINER R, 1993, FUNDAMENTALS SPEECH Riis SK, 1996, J COMPUT BIOL, V3, P163, DOI 10.1089/cmb.1996.3.163 Sejnowski T. J., 1987, Complex Systems, V1 SUONTAUSTA J, 2000, P INT C SPOK LANG PR Torkkola K., 1993, P INT C AC SPEECH SI, V2, P199 *US CENS BUR, 1999, FREQ OCC 1 NAM SURN WEIDE RL, 1995, CARN PRON DICT REL 0 NR 30 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 455 EP 467 DI 10.1016/S0167-6393(03)00015-3 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900012 ER PT J AU Chen, JD Paliwal, KK Nakamura, S AF Chen, JD Paliwal, KK Nakamura, S TI Cepstrum derived from differentiated power spectrum for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; hidden Markov model; differential power spectrum; linear liftering; cepstral mean normalization; spectral subtraction ID WORD RECOGNITION; ENVIRONMENTS; NOISE AB In this paper, cepstral features derived from the differential power spectrum (DPS) are proposed for improving the robustness of a speech recognizer in presence of background noise. These robust features are computed from the speech signal of a given frame through the following four steps. First, the short-time power spectrum of speech signal is computed from the speech signal through the fast Fourier transform algorithm. Second, DPS is obtained by differentiating the power spectrum with respect to frequency. Third, the magnitude of DPS is projected from linear frequency to the mel scale and smoothed by a filter bank. Finally, the outputs of the filter bank are transformed to cepstral coefficients by the discrete cosine transform after a nonlinear transformation. It is shown that this new feature set can be decomposed as the superposition of the standard cepstrum and its nonlinearly liftered counterpart. While a linear lifter has no effect on the continuous density hidden Markov model based speech recognition, we show that the proposed feature set embedded with a nonlinear liftering transformation is quite effective for robust speech recognition. For this, we conduct a number of speech recognition experiments (including isolated word recognition, connected digits recognition, and large vocabulary continuous speech recognition) in various operating environments and compare the DPS features with the standard mel-frequency cepstral coefficient features used with cepstral mean normalization and spectral subtraction techniques. (C) 2003 Elsevier B.V. All rights reserved. C1 Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. Griffith Univ, Sch Microelect Engn, Brisbane, Qld 4111, Australia. ATR, Spoken Language Translat Res Labs, Kyoto 6190288, Japan. RP Chen, JD (reprint author), Bell Labs, Lucent Technol, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM jingdong@research.bell-labs.com CR Andrassy B, 2001, P 7 EUR C SPEECH COM, P193 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145 CHEN J, 2001, P 7 EUR C SPEECH COM, P571 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Droppo J., 2001, P EUR, P217 FLORES JAN, 1994, P ICASSPP, P409 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 Geller D, 1992, P ESCA WORKSH SPEECH, P203 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 1991, P EUROSPEECH, P1367 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H.-G., 1991, P EUROSPEECH, P413 HIRSCH HG, 2000, P ISCA ASR2000 PAR F JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 JUNQUA JC, 2001, P INT WORKSH HANDSFR, P31 Kim DS, 1999, IEEE T SPEECH AUDI P, V7, P55 Kotnik B., 2001, P EUROSPEECH SCAND, P197 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 MORENO PJ, 1996, P ICASSP, P733 OHKURA K, 1992, P ICSLP 92, P369 Paliwal KK, 1999, P EUR C SPEECH COMM, P85 PALIWAL KK, 1992, SPEECH COMMUN, V18, P151 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 POPESCU DC, 1998, P IEEE INT C AC SPEE, V2, P997, DOI 10.1109/ICASSP.1998.675435 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 SAGAYAMA S, 1997, P IEEE INT C AC SPEE, P835 SOONG FK, 1986, P ICASSP TOK JAP, P877 TOHKURA Y, 1987, IEEE T ACOUST SPEECH, V35, P1414, DOI 10.1109/TASSP.1987.1165058 Varga A. P., 1992, NOISEX92 STUDY EFFEC Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 WOODLAND PC, 1996, P ICASSP ATL GA MAY, P65 ZHU Q, 2001, P EUROSPEECH SCAND, P185 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 36 TC 23 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 469 EP 484 DI 10.1016/S0167-6393(03)00016-5 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900013 ER PT J AU Sivakumaran, P Ariyaeeinia, AM Loomes, MJ AF Sivakumaran, P Ariyaeeinia, AM Loomes, MJ TI Sub-band based text-dependent speaker verification SO SPEECH COMMUNICATION LA English DT Article DE speaker verification; sub-band analysis; cepstrum ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION AB This paper addresses various issues involved in sub-band based text-dependent speaker verification. The first part of the discussions is concerned with the classification methods. An important issue addressed in this part is the determination of a set of weights which emphasises the sub-bands that are specific to the target speaker while de-emphasising or removing the contaminated ones. In particular, techniques for determining these weights dynamically according to the level of contamination in the sub-bands are described. Furthermore, the effectiveness of these methods is experimentally analysed through a set of comparative studies. The second part of the discussions focuses on the feature extraction process. Analytically, it is shown that for a sub-band system of S bands, the cepstral coefficients with the quefrency of p have a strong linear relationship to the (S x p)th full-band cepstral parameter. With the aid of a set of experimental results, it is demonstrated that this means the conventional classification methods adapted to work with sub-band cepstral parameters may not be able to capture all the useful spectral information contained in the fall-band cepstral parameters. In order to tackle this problem, two methods are described and their relative effectiveness is experimentally examined. The experimental investigations also include an examination of speaker discrimination abilities of different sub-bands and an analysis of different possible recombination levels. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Hertfordshire, Fac Engn & Informat Sci, Hatfield AL10 9AB, Herts, England. RP Ariyaeeinia, AM (reprint author), Univ Hertfordshire, Fac Engn & Informat Sci, Coll Lane, Hatfield AL10 9AB, Herts, England. EM p.sivakumaran@2020speech.com; a.m.ariyaeeinia@herts.ac.uk; m.j.lommes@herts.ac.uk CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 ARIYAEEINIA A, 1997, P EUR 1997 RHOD, P1379 Auckenthaler R, 2000, DIGIT SIGNAL PROCESS, V10, P42, DOI 10.1006/dspr.1999.0360 AUCKENTHALER R, 1997, P EUR 97, P2303 BESACIER L, 1997, P 1 INT C AUD VIS BA, P195 Bourlard H., 1996, P ICSLP 96, V1, P426, DOI 10.1109/ICSLP.1996.607145 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deller J. R., 1993, DISCRETE TIME PROCES DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 HAYAKAWA S, 1994, P IEEE INT C AC SPEE, P137 Hermansky H., 1996, P ICSLP96, V1, P462 HIRSCH HG, 1993, TR93012 ICSI LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 ROSENBERG AE, 1991, P ICASSP, P381, DOI 10.1109/ICASSP.1991.150356 SIVAKUMARAN P, 2000, P ICASSP2000 JUN, V2, P1073 SIVAKUMARAN P, 1998, P ICSLP98, V3, P551 SIVAKUMARAN P, 2000, P ICSLP2000 OCT, V2, P458 TOMLINSON MJ, 1997, P IEEE INT C AC SPEE, P1247 NR 19 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 485 EP 509 DI 10.1016/S0167-6393(03)00017-7 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900014 ER PT J AU Richardson, M Bilmes, J Diorio, C AF Richardson, M Bilmes, J Diorio, C TI Hidden-articulator Markov models for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; articulatory models; noise robustness; factorial HMM ID FEATURES; HMM; REPRESENTATION; PHONEBOOK; INTERFACE; UNITS AB Most existing automatic speech recognition systems today do not explicitly use knowledge about human speech production. We show that the incorporation of articulatory knowledge into these systems is a promising direction for speech recognition, with the potential for lower error rates and more robust performance. To this end, we introduce the Hidden-Articulator Markov model (HAMM), a model which directly integrates articulatory information into speech recognition. The HAMM is an extension of the articulatory-feature model introduced by Erler in 1996. We extend the model by using diphone units, developing a new technique for model initialization, and constructing a novel articulatory feature mapping. We also introduce a method to decrease the number of parameters, making the HAMM comparable in size to standard HMMs. We demonstrate that the HAMM can reasonably predict the movement of articulators, which results in a decreased word error rate (WER). The articulatory knowledge also proves useful in noisy acoustic conditions. When combined with a standard model, the HAMM reduces WER 28-35% relative to the standard model alone. (C) 2003 Elsevier B.V. All rights reserved. C1 Univ Washington, Dept Comp Sci & Engn, Seattle, WA 98195 USA. RP Richardson, M (reprint author), Univ Washington, Dept Comp Sci & Engn, Box 352350, Seattle, WA 98195 USA. EM mattr@cs.washington.edu; bilmes@ee.washington.edu; diorio@cs.washington.edu CR BAILLY G, 1992, SIGNAL PROCESSING 6, V1, P159 Bilmes J., 2000, P 16 C UNC ART INT, P38 BILMES JA, 1999, ICASSP, V2, P713 Bishop C. M., 1995, NEURAL NETWORKS PATT BLACKBURN C, 1995, P EUR, V2, P1623 BLOMBERG M, 1991, P EUR Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6 Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 Deng L, 1997, SPEECH COMMUN, V23, P211, DOI 10.1016/S0167-6393(97)00047-2 DENG L, 1994, INT CONF ACOUST SPEE, P45 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 Dupont S, 1997, INT CONF ACOUST SPEE, P1767, DOI 10.1109/ICASSP.1997.598872 Edwards H. T., 1997, APPL PHONETICS SOUND EIDE E, 1993, P ICASSP 93, P483 ELENIUS K, 1992, P ICSLP, P1279 Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358 FRANKEL J, 2001, P EUR Frankel J., 2000, P ICSLP GHAHRAMANI Z, 1998, LECT NOTES ARTIFICIA Hardcastle W., 1999, COARTICULATION THEOR KIRCHHOFF K., 1998, P ICSLP, P891 Lauritzen S. L., 1996, GRAPHICAL MODELS LIVESCU K, 2001, P EUR Logan B, 1998, INT CONF ACOUST SPEE, P813, DOI 10.1109/ICASSP.1998.675389 PICONE J, 1999, P INT C AC SPEECH SI, V1, P109 PITRELLI JF, 1995, INT CONF ACOUST SPEE, P101, DOI 10.1109/ICASSP.1995.479283 RICHARDSON M, 2000, ICSLP 2000, V3, P131 RICHARDSON M, 2000, ASR 2000, P133 Rose RC, 1996, J ACOUST SOC AM, V99, P1699, DOI 10.1121/1.414679 Saul LK, 1999, MACH LEARN, V37, P75, DOI 10.1023/A:1007649326333 Schmidbauer O., 1989, P ICASSP, P616 Wrench A, 2000, WORKSH PHON PHON ASR Young S, 1996, IEEE SIGNAL PROC MAG, V13, P45, DOI 10.1109/79.536824 Zweig G., 1998, AAAI IAAI, P173 NR 34 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2003 VL 41 IS 2-3 BP 511 EP 529 DI 10.1016/S0167-6393(03)00031-1 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 715KW UT WOS:000184971900015 ER PT J AU Schouten, B AF Schouten, B TI The nature of speech perception - (The psychophysics of speech perception III) - Preface SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Utrecht, UiL OTS, NL-3512 JK Utrecht, Netherlands. RP Schouten, B (reprint author), Univ Utrecht, UiL OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM bert.schouten@let.uu.nl CR Lotto AJ, 1998, J ACOUST SOC AM, V103, P3648, DOI 10.1121/1.423087 SCHOUTEN MEH, 1992, AUDITORY PROCESSING SCHOUTEN MEH, 1987, NATO ASI SERIES D, V39 NR 3 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 1 EP 6 DI 10.1016/S0167-6393(02)00088-2 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900001 ER PT J AU Scott, SK Wise, RJS AF Scott, SK Wise, RJS TI Functional imaging and language: A critical guide to methodology and analysis SO SPEECH COMMUNICATION LA English DT Article DE PET; fMRI; functional imaging terminology; functional imaging methodology; language ID HUMAN BRAIN; AUDITORY-CORTEX; ACOUSTIC NOISE; FMRI; SPEECH; SYSTEM; OXYGENATION; CONTRAST; REGIONS; MRI AB This paper summarizes the methodology involved in functional neuroimaging, both experimental designs and data analyses. It is intended as a general introduction to the techniques and terminology involved, and aimed at speech scientists new to the area. The methods covered are positron emission tomography (PET) and functional magnetic resonance imaging (MRI). Other imaging methods, reliant on the pattern of electrical discharges associated with neural activity, have also been used clinically and experimentally, and provide excellent temporal resolution but poor spatial resolution. It is not within the scope of this review to address these. The emphasis is on potential criticisms and problems concerning PET and fMRI, since much has already been published about the advantages, real or perceived. The strengths and weaknesses of PET and fMRI are addressed, with reference to language studies. (C) 2002 Elsevier Science B.V. All rights reserved. C1 UCL, Dept Psychol, London WC1E 6BT, England. UCL, Dept Phonet, London WC1E 6BT, England. Hammersmith Hosp, MRC, Ctr Clin Sci, London, England. RP Scott, SK (reprint author), UCL, Dept Psychol, Gower St, London WC1E 6BT, England. EM sophie.scott@ucl.ac.uk RI Scott, Sophie/A-1843-2010 CR Aguirre GK, 1998, NEUROIMAGE, V8, P360, DOI 10.1006/nimg.1998.0369 BINDER JR, 1994, COGNITIVE BRAIN RES, V2, P31, DOI 10.1016/0926-6410(94)90018-3 Binder JR, 1997, J NEUROSCI, V17, P353 Binder JR, 1999, J COGNITIVE NEUROSCI, V11, P80, DOI 10.1162/089892999563265 Blank SC, 2001, NEUROIMAGE, V13, pS509 Bullmore ET, 1999, HUM BRAIN MAPP, V7, P38, DOI 10.1002/(SICI)1097-0193(1999)7:1<38::AID-HBM4>3.3.CO;2-H Counter SA, 2000, ACTA OTO-LARYNGOL, V120, P739 Counter SA, 1997, JMRI-J MAGN RESON IM, V7, P606, DOI 10.1002/jmri.1880070327 Devlin JT, 2000, NEUROIMAGE, V11, P589, DOI 10.1006/nimg.2000.0595 Dhankhar A, 1997, J NEUROPHYSIOL, V77, P476 Dronkers NF, 1996, NATURE, V384, P159, DOI 10.1038/384159a0 Friston KJ, 1994, HUMAN BRAIN MAPPING, V2, P189, DOI DOI 10.1002/HBM.460020402 Friston KJ, 1996, NEUROIMAGE, V4, P97, DOI 10.1006/nimg.1996.0033 Friston KJ, 1999, NEUROIMAGE, V10, P385, DOI 10.1006/nimg.1999.0484 Friston KJ, 1996, NEUROIMAGE, V4, P223, DOI 10.1006/nimg.1996.0074 Friston KJ, 1997, HUMAN BRAIN FUNCTION, P141 Gorno-Tempini ML, 2001, NEUROIMAGE, V14, P465, DOI 10.1006/nimg.2001.0811 Griffiths TD, 1998, NAT NEUROSCI, V1, P422, DOI 10.1038/1637 Guimaraes AR, 1998, HUM BRAIN MAPP, V6, P33, DOI 10.1002/(SICI)1097-0193(1998)6:1<33::AID-HBM3>3.0.CO;2-M Hall DA, 1999, HUM BRAIN MAPP, V7, P213, DOI 10.1002/(SICI)1097-0193(1999)7:3<213::AID-HBM5>3.0.CO;2-N Kircher TTJ, 2000, NEUROREPORT, V11, P4093, DOI 10.1097/00001756-200012180-00036 KWONG KK, 1992, P NATL ACAD SCI USA, V89, P5675, DOI 10.1073/pnas.89.12.5675 Leff AP, 2000, ANN NEUROL, V47, P171, DOI 10.1002/1531-8249(200002)47:2<171::AID-ANA6>3.0.CO;2-P Maguire EA, 1999, HIPPOCAMPUS, V9, P54, DOI 10.1002/(SICI)1098-1063(1999)9:1<54::AID-HIPO6>3.0.CO;2-O Mansfield P, 1977, J PHYS C SOLID STATE, V10, P55 MarslenWilson WD, 1997, NATURE, V387, P592, DOI 10.1038/42456 McColl J H, 1994, Stat Methods Med Res, V3, P63, DOI 10.1177/096228029400300105 Moore CJ, 1999, NEUROIMAGE, V10, P181, DOI 10.1006/nimg.1999.0450 Morris JS, 1996, NATURE, V383, P812, DOI 10.1038/383812a0 Mummery CJ, 1998, J COGNITIVE NEUROSCI, V10, P766, DOI 10.1162/089892998563059 Mummery CJ, 1999, J ACOUST SOC AM, V106, P449, DOI 10.1121/1.427068 Ni W, 2000, J COGNITIVE NEUROSCI, V12, P120, DOI 10.1162/08989290051137648 OGAWA S, 1990, MAGNET RESON MED, V14, P68, DOI 10.1002/mrm.1910140108 OGAWA S, 1990, P NATL ACAD SCI USA, V87, P9868, DOI 10.1073/pnas.87.24.9868 Poeppel D, 1996, BRAIN LANG, V55, P317, DOI 10.1006/brln.1996.0108 Postle BR, 2000, BRAIN RES PROTOC, V5, P57, DOI 10.1016/S1385-299X(99)00053-7 Price CJ, 1996, CEREB CORTEX, V6, P62, DOI 10.1093/cercor/6.1.62 Price CJ, 1997, NEUROIMAGE, V5, P261, DOI 10.1006/nimg.1997.0269 RABEHESKETH S, 1997, STAT METHODS MED RES, V6, P15 Ravicz ME, 2001, J ACOUST SOC AM, V109, P216, DOI 10.1121/1.1326083 SCHNEIDER W, 1977, PSYCHOL REV, V84, P1, DOI 10.1037/0033-295X.84.1.1 Scott SK, 2000, NEUROREPORT, V11, P1523, DOI 10.1097/00001756-200005150-00031 SHIFFRIN RM, 1977, PSYCHOL REV, V84, P127, DOI 10.1037/0033-295X.84.2.127 Talairach J., 1988, COPLANAR STEREOTAXIC Talavage TM, 2000, HEARING RES, V150, P225, DOI 10.1016/S0378-5955(00)00203-3 TURNER R, 1991, MAGNET RESON MED, V22, P159, DOI 10.1002/mrm.1910220117 VILLRINGER A, 1988, MAGNET RESON MED, V6, P164, DOI 10.1002/mrm.1910060205 Wise RJS, 2000, NEUROPSYCHOLOGIA, V38, P985, DOI 10.1016/S0028-3932(99)00152-9 Wise RJS, 1999, LANCET, V353, P1057, DOI 10.1016/S0140-6736(98)07491-1 Zarahn E, 1997, NEUROIMAGE, V6, P122, DOI 10.1006/nimg.1997.0279 NR 50 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 7 EP 21 DI 10.1016/S0167-6393(02)00089-4 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900002 ER PT J AU Scott, SK Wise, RJS AF Scott, SK Wise, RJS TI PET and MRI studies of the neural basis of speech perception SO SPEECH COMMUNICATION LA English DT Article DE speech perception; PET; fMRI; superior temporal gyrus; superior temporal sulcus; auditory cortex; processing streams ID HUMAN AUDITORY-CORTEX; ITERATED RIPPLED NOISE; HUMAN BRAIN; TEMPORAL-LOBE; FMRI; LATERALIZATION; NEUROANATOMY; INFORMATION; PATHWAYS; VISION AB Functional imaging (PET and fMRI) has made great advances in our understanding of the neural basis of auditory processing and speech perception. Here we review perceptual and sensory processing aspects of speech and hearing, as revealed by functional imaging studies. A comparison is made of the peaks of activity in the left hemisphere across the different studies discussed, with implications for patterns of hierarchical processing in the auditory system addressed. This is related to the functional anatomy of the auditory cortex as determined from non-human primate studies. (C) 2002 Elsevier Science B.V. All rights reserved. C1 UCL, Dept Psychol, London WC1E 6BT, England. UCL, Dept Phonet, London WC1E 6BT, England. Hammersmith Hosp, MRC, Ctr Clin Sci, London, England. RP Scott, SK (reprint author), UCL, Dept Psychol, Gower St, London WC1E 6BT, England. EM sophie.scott@ucl.ac.uk RI Scott, Sophie/A-1843-2010 CR BAILEY PJ, 1980, J EXP PSYCHOL HUMAN, V6, P536 Belin P, 2000, NATURE, V403, P309, DOI 10.1038/35002078 Belin P, 1998, J COGNITIVE NEUROSCI, V10, P536, DOI 10.1162/089892998562834 Binder J, 2000, BRAIN, V123, P2371, DOI 10.1093/brain/123.12.2371 Binder JR, 2000, CEREB CORTEX, V10, P512, DOI 10.1093/cercor/10.5.512 Binder JR, 1997, J NEUROSCI, V17, P353 BLESSER B, 1972, J SPEECH HEAR RES, V151, P5 Calvert GA, 1999, NEUROREPORT, V10, P2619, DOI 10.1097/00001756-199908200-00033 GALUSKE RAW, 1999, NEUROIMAGE, V9, pS994 GIRAUD AL, 2000, NEUROPHYSIOLOGY, V843, P1588 GOODALE MA, 1992, TRENDS NEUROSCI, V15, P20, DOI 10.1016/0166-2236(92)90344-8 Griffiths TD, 2001, NAT NEUROSCI, V4, P633, DOI 10.1038/88459 Griffiths TD, 1998, NAT NEUROSCI, V1, P422, DOI 10.1038/1637 Hall DA, 2002, CEREB CORTEX, V12, P140, DOI 10.1093/cercor/12.2.140 Hall DA, 1999, HUM BRAIN MAPP, V7, P213, DOI 10.1002/(SICI)1097-0193(1999)7:3<213::AID-HBM5>3.0.CO;2-N HANDEL S, 1988, J EXP PSYCHOL HUMAN, V14, P315, DOI 10.1037//0096-1523.14.2.315 HARMS MP, 1998, NEUROIMAGE, V7, pS365 Hickok G, 2000, TRENDS COGN SCI, V4, P131, DOI 10.1016/S1364-6613(00)01463-7 Kaas JH, 1999, NAT NEUROSCI, V2, P1045, DOI 10.1038/15967 MISHKIN M, 1983, TRENDS NEUROSCI, V6, P414, DOI 10.1016/0166-2236(83)90190-X Mummery CJ, 1999, J ACOUST SOC AM, V106, P449, DOI 10.1121/1.427068 Patterson RD, 1996, J ACOUST SOC AM, V100, P3286, DOI 10.1121/1.417212 Penhune VB, 1996, CEREB CORTEX, V6, P661, DOI 10.1093/cercor/6.5.661 POEPPEL D, 1996, BRAIN RES COGNITIVE, V44, P231 Poldrack RA, 2001, J COGNITIVE NEUROSCI, V13, P687, DOI 10.1162/089892901750363235 Rauschecker JP, 1998, CURR OPIN NEUROBIOL, V8, P516, DOI 10.1016/S0959-4388(98)80040-8 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 SCHROEDE.MR, 1968, J ACOUST SOC AM, V44, P1735, DOI 10.1121/1.1911323 Scott SK, 2000, BRAIN, V123, P2400, DOI 10.1093/brain/123.12.2400 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Talairach J., 1988, COPLANAR STEREOTAXIC Talavage TM, 2000, HEARING RES, V150, P225, DOI 10.1016/S0378-5955(00)00203-3 Tanaka H, 2000, NEUROREPORT, V11, P2045, DOI 10.1097/00001756-200006260-00047 THIVARD L, 2001, NEUROREPORT, V1113, P2969 WESSINGER CM, 2001, J COGNITIVE NEUROSCI, V131, P1 Wessinger CM, 1997, HUM BRAIN MAPP, V5, P18, DOI 10.1002/(SICI)1097-0193(1997)5:1<18::AID-HBM3>3.0.CO;2-Q WISE R, 1991, BRAIN, V114, P1803, DOI 10.1093/brain/114.4.1803 Wise RJS, 2001, BRAIN, V124, P83, DOI 10.1093/brain/124.1.83 Yost WA, 1996, J ACOUST SOC AM, V100, P511, DOI 10.1121/1.415873 ZATORRE RJ, 1992, SCIENCE, V256, P846, DOI 10.1126/science.1589767 NR 41 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 23 EP 34 DI 10.1016/S0167-6393(02)00090-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900003 ER PT J AU Kraus, N Nicol, T AF Kraus, N Nicol, T TI Aggregate neural responses to speech sounds in the central auditory system SO SPEECH COMMUNICATION LA English DT Article DE auditory; neural; perception; speech; perceptual learning; evoked responses; plasticity ID ADULT OWL MONKEYS; LEARNING-PROBLEMS; LANGUAGE COMPREHENSION; RIGHT-HEMISPHERE; CHILDREN; DISCRIMINATION; PERCEPTION; BRAIN; DEFICITS; STIMULI AB The fundamental complexity of speech-in both the spectral and temporal domains-elicits extensive dynamic activity from a broad neural population. Evoked potentials rely on a summation of synchronous aggregate neural activity, making them especially suitable for speech-sound investigation. This paper summarizes research from our lab that demonstrates the efficacy of speech-evoked responses in addressing three fundamental issues. First, the neural bases of left-brain specialization to speech are investigated in an animal model. Second, studies are aimed at inferring the underlying causes of certain language-based learning disabilities. Finally, in a series of before-and-after designs, the underlying neural plasticity that accompanies directed speech-sound training is explored. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Northwestern Univ, Dept Commun Sci, Evanston, IL 60208 USA. Northwestern Univ, Dept Neurobiol & Physiol, Evanston, IL 60208 USA. Northwestern Univ, Dept Otolaryngol, Evanston, IL 60208 USA. RP Kraus, N (reprint author), Northwestern Univ, Dept Commun Sci, Frances Searle Bldg,2299 N Campus Dr, Evanston, IL 60208 USA. EM nkraus@northwestern.edu CR Bellis TJ, 2000, J NEUROSCI, V20, P791 Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 Carrell TD, 1999, EAR HEARING, V20, P175, DOI 10.1097/00003446-199904000-00008 Conley EM, 1999, CLIN NEUROPHYSIOL, V110, P2086, DOI 10.1016/S1388-2457(99)00183-2 Cunningham J, 2002, HEARING RES, V169, P97, DOI 10.1016/S0378-5955(02)00344-1 Cunningham J, 2001, CLIN NEUROPHYSIOL, V112, P758, DOI 10.1016/S1388-2457(01)00465-5 Dehaene-Lambertz G, 1998, NEUROREPORT, V9, P1885, DOI 10.1097/00001756-199806010-00040 Diehl SF, 1999, LANG SPEECH HEAR SER, V30, P108 ELBERLING C, 1982, SCAND AUDIOL, V11, P61, DOI 10.3109/01050398209076201 ELLIOTT LL, 1989, J SPEECH HEAR RES, V32, P112 FITCH RH, 1993, ANN NY ACAD SCI, V682, P346, DOI 10.1111/j.1749-6632.1993.tb22989.x GAZZANIGA MS, 1983, AM PSYCHOL, V38, P525, DOI 10.1037/0003-066X.38.5.525 GESCHWIN.N, 1972, SCI AM, V226, P76 GODFREY JJ, 1981, J EXP CHILD PSYCHOL, V32, P401, DOI 10.1016/0022-0965(81)90105-3 HAYES E, 2001, AUDITORY PROCESSING HEFFNER HE, 1986, J NEUROPHYSIOL, V56, P683 JENKINS WM, 1990, J NEUROPHYSIOL, V63, P82 JUSCZYK PW, 1993, CHILD DEV, V64, P675, DOI 10.1111/j.1467-8624.1993.tb02935.x KIMURA D, 1961, CAN J PSYCHOLOGY, V15, P166, DOI 10.1037/h0083219 King C, 2002, NEUROSCI LETT, V319, P111, DOI 10.1016/S0304-3940(01)02556-3 King C, 1999, NEUROSCI LETT, V267, P89, DOI 10.1016/S0304-3940(99)00336-5 Kraus N., 1995, J COGNITIVE NEUROSCI, V7, P27 Kraus N, 1996, SCIENCE, V273, P971, DOI 10.1126/science.273.5277.971 KUHL PK, 1992, SCIENCE, V255, P606, DOI 10.1126/science.1736364 MCCLASKEY CL, 1983, PERCEPT PSYCHOPHYS, V34, P323, DOI 10.3758/BF03203044 MEHLER J, 1978, PERCEPTION, V7, P491, DOI 10.1068/p070491 Merzenich MM, 1996, SCIENCE, V271, P77, DOI 10.1126/science.271.5245.77 MERZENICH MM, 1990, COLD SH Q B, V55, P873 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MILNER B, 1971, BRIT MED BULL, V27, P272 MOLLER AR, 1999, CLIN NEUROPHYSIOL, V49, P27 MORRISON S, 1998, EAROBICS PRO CHILD L, V14, P279 NAATANEN R, 1987, PSYCHOPHYSIOLOGY, V24, P375, DOI 10.1111/j.1469-8986.1987.tb00311.x Naatanen R, 1997, NATURE, V385, P432, DOI 10.1038/385432a0 Orton S. T., 1937, READING WRITING SPEE PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PISONI DB, 1982, J EXP PSYCHOL HUMAN, V8, P297 POLAT U, 1988, MATURATIONAL WINDOWS, P111 Ponton CW, 2000, CLIN NEUROPHYSIOL, V111, P220, DOI 10.1016/S1388-2457(99)00236-9 RECANZONE GH, 1993, J NEUROSCI, V13, P87 SAMS M, 1985, ELECTROEN CLIN NEURO, V62, P437, DOI 10.1016/0168-5597(85)90054-1 SAMS M, 1991, PSYCHOPHYSIOLOGY, V28, P21, DOI 10.1111/j.1469-8986.1991.tb03382.x SHANWEILER D, 1995, PSYCHOL SCI, V6, P149, DOI 10.1111/j.1467-9280.1995.tb00324.x Sharma A, 1994, J ACOUST SOC AM, V95, P3011, DOI 10.1121/1.408785 TALLAL P, 1974, NEUROPSYCHOLOGIA, V12, P83, DOI 10.1016/0028-3932(74)90030-X TALLAL P, 1978, BRAIN LANG, V5, P13, DOI 10.1016/0093-934X(78)90003-2 Tallal P, 1996, SCIENCE, V271, P81, DOI 10.1126/science.271.5245.81 Tallal P, 1998, EXP BRAIN RES, V123, P210, DOI 10.1007/s002210050563 TAYLOR MM, 1967, J ACOUST SOC AM, V41, P782, DOI 10.1121/1.1910407 Tremblay K, 1998, NEUROREPORT, V9, P3557 Tremblay K, 1997, J ACOUST SOC AM, V102, P3762, DOI 10.1121/1.420139 Tremblay K, 2001, EAR HEARING, V22, P79, DOI 10.1097/00003446-200104000-00001 WERKER JF, 1992, CAN J PSYCHOL, V46, P551, DOI 10.1037/h0084331 Wible B, 2002, CLIN NEUROPHYSIOL, V113, P485, DOI 10.1016/S1388-2457(02)00017-2 Wilkinson G. S., 1993, WIDE RANGE ACHIEVEME Woodcock R. W., 1989, WOODCOCKJOHNSON PSYC Woodcock R. W., 1977, WOODCOCKJOHNSON PSYC ZATORRE RJ, 1992, SCIENCE, V256, P846, DOI 10.1126/science.1589767 NR 58 TC 4 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 35 EP 47 DI 10.1016/S0167-6393(02)00091-2 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900004 ER PT J AU May, BJ AF May, BJ TI Physiological and psychophysical assessments of the dynamic range of vowel representations in the auditory periphery SO SPEECH COMMUNICATION LA English DT Article DE speech coding; auditory-nerve fibers; ventral cochlear nucleus; rate saturation; two-tone suppression; formant frequency discrimination ID ANTEROVENTRAL COCHLEAR NUCLEUS; STEADY-STATE VOWELS; NERVE FIBERS; DISCHARGE PATTERNS; SPECTRAL CUES; CAT; DISCRIMINATION; CLASSIFICATION; THRESHOLDS; RESPONSES AB This review summarizes our work on the neural encoding of steady-state vowels. As in previous studies from our laboratory, the speech code is described in terms of average discharge rates for populations of neurons in the auditory nerve and ventral cochlear nucleus of barbiturate anesthetized cats. Our current analyses extend these population measures with new statistical models and signal detection methods to facilitate quantitative comparisons of the effects of stimulus level on vowel formant representations. These measures are applied to low and high spontaneous rate auditory-nerve fibers, and four of the principal response types of the ventral cochlear nucleus. The perceptual significance of the speech code is examined by relating neural response patterns to the behavioral performance of cats in vowel formant discrimination tasks. In combination, these physiological and psychophysical assessments suggest that vowel-coding mechanisms based on discharge rates in the auditory periphery are sufficient to support the dynamic range of speech perception. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Johns Hopkins Univ, Dept Otolaryngol HNS, Baltimore, MD 21205 USA. RP May, BJ (reprint author), Johns Hopkins Univ, Dept Otolaryngol HNS, 505 Traylor Res Bldg,720 Rutland Ave, Baltimore, MD 21205 USA. EM bmay@jhu.edu CR BLACKBURN CC, 1990, J NEUROPHYSIOL, V63, P1191 BLACKBURN CC, 1989, J NEUROPHYSIOL, V62, P1303 Bourk TR, 1976, THESIS MIT CAMBRIDGE CANT NB, 1984, HEARING SCI RECENT A, P371 CANT NB, 1981, NEUROSCIENCE, V6, P2643, DOI 10.1016/0306-4522(81)90109-3 Conley RA, 1995, J ACOUST SOC AM, V98, P3223, DOI 10.1121/1.413812 DELGUTTE BERTRAND, 1982, REPRESENTATION SPEEC, P131 Hienz RD, 1996, J ACOUST SOC AM, V100, P1052, DOI 10.1121/1.416291 Hienz RD, 1998, HEARING RES, V116, P10, DOI 10.1016/S0378-5955(97)00197-4 Hienz RD, 1996, J ACOUST SOC AM, V99, P3656, DOI 10.1121/1.414980 Lai Y C, 1994, J Comput Neurosci, V1, P167, DOI 10.1007/BF00961733 LePrell G, 1996, AUDIT NEUROSCI, V2, P275 May BJ, 1997, J ACOUST SOC AM, V101, P2705, DOI 10.1121/1.418559 May BJ, 1997, ACOUSTICAL SIGNAL PROCESSING IN THE CENTRAL AUDITORY SYSTEM, P413, DOI 10.1007/978-1-4419-8712-9_38 MAY BJ, 1998, PSYCHOPHYSICAL PHYSL, P376 May BJ, 1996, AUDIT NEUROSCI, V3, P135 May BJ, 1998, J NEUROPHYSIOL, V79, P1755 Miller RL, 1999, J ACOUST SOC AM, V106, P2693, DOI 10.1121/1.428135 Miller RL, 1997, J ACOUST SOC AM, V101, P3602, DOI 10.1121/1.418321 PALMER AR, 1986, J ACOUST SOC AM, V79, P100, DOI 10.1121/1.393633 PFEIFFER RR, 1966, EXP BRAIN RES, V1, P220 RHODE WS, 1983, J COMP NEUROL, V213, P448, DOI 10.1002/cne.902130408 RICE JJ, 1995, J ACOUST SOC AM, V97, P1764, DOI 10.1121/1.412053 RYUGO DK, 1982, J COMP NEUROL, V210, P239, DOI 10.1002/cne.902100304 SACHS MB, 1968, J ACOUST SOC AM, V43, P1120, DOI 10.1121/1.1910947 SACHS MB, 1979, J ACOUST SOC AM, V66, P470, DOI 10.1121/1.383098 SINEX DG, 1983, J ACOUST SOC AM, V73, P602, DOI 10.1121/1.389007 Winslow R. L., 1987, AUDITORY PROCESSING, P212 YOUNG ED, 1979, J ACOUST SOC AM, V66, P1381, DOI 10.1121/1.383532 NR 29 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 49 EP 57 DI 10.1016/S0167-6393(02)00092-4 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900005 ER PT J AU Kluender, KR Coady, JA Kiefte, M AF Kluender, KR Coady, JA Kiefte, M TI Sensitivity to change in perception of speech SO SPEECH COMMUNICATION LA English DT Article ID AUDITORY-NERVE FIBERS; SPECTRAL-ENVELOPE DISTORTION; STOP-CONSONANT PERCEPTION; PRECEDING LIQUID; COMPENSATION; ADAPTATION; RESPONSES; SOUNDS; COARTICULATION; IDENTIFICATION AB Perceptual systems in all modalities are predominantly sensitive to stimulus change, and many examples of perceptual systems responding to change can be portrayed as instances of enhancing contrast. Multiple findings from perception experiments serve as evidence for spectral contrast explaining fundamental aspects of perception of coarticulated speech, and these findings are consistent with a broad array of known psychoacoustic and neurophysiological phenomena. Beyond coarticulation, important characteristics of speech perception that extend across broader spectral and temporal ranges may best be accounted for by the constant calibration of perceptual systems to maximize sensitivity to change. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Wisconsin, Dept Psychol, Madison, WI 53706 USA. Dalhousie Univ, Sch Human Commun Disorders, Halifax, NS B3H 1R2, Canada. RP Kluender, KR (reprint author), Univ Wisconsin, Dept Psychol, 1202 W Johnson St, Madison, WI 53706 USA. EM krkluend@facstaff.wisc.edu CR ABELES M, 1972, BRAIN RES, V42, P337, DOI 10.1016/0006-8993(72)90535-5 Abrahams H, 1937, AM J PSYCHOL, V49, P462, DOI 10.2307/1415781 CARDOZO BL, 1967, IPO ANN PROG REP, V2, P59 CATHCART EP, 1928, BRIT J PSYCHOL, V19, P343 CHRISTMAN RJ, 1954, AM J PSYCHOL, V67, P484, DOI 10.2307/1417939 COADY JA, 2001, J ACOUST SOC AM 2, V109, P2315 CREUTZFELDT O, 1980, EXP BRAIN RES, V39, P87 Delgutte B., 1986, INVARIANCE VARIABILI, P131 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P897, DOI 10.1121/1.390599 DELGUTTE B, 1980, J ACOUST SOC AM, V68, P843, DOI 10.1121/1.384824 Delgutte B., 1996, AUDITORY BASIS SPEEC, P1 Delgutte B., 1996, HDB PHONETIC SCI, P507 ENGEL T, 1982, PERCEPTION ODORS FESTEN JM, 1981, J ACOUST SOC AM, V70, P356, DOI 10.1121/1.386771 FOWLER CA, 1990, PERCEPT PSYCHOPHYS, V48, P559, DOI 10.3758/BF03211602 GREEN DM, 1959, J ACOUST SOC AM, V31, P1146 Hoagland H, 1933, J GEN PHYSIOL, V16, P911, DOI 10.1085/jgp.16.6.911 Holt LL, 2000, J ACOUST SOC AM, V108, P710, DOI 10.1121/1.429604 HOLT LL, 1999, THESIS U WISCONSINMA HOOD JD, 1950, ACTA OTOLARYN S, V92 HOUTGAST T, 1974, ACUSTICA, V31, P320 HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Kluender K. R., 2001, J ACOUST SOC AM 2, V109, P2294 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 Lotto AJ, 1997, J ACOUST SOC AM, V102, P1134, DOI 10.1121/1.419865 Lotto AJ, 1998, PERCEPT PSYCHOPHYS, V60, P602, DOI 10.3758/BF03206049 MANN VA, 1980, PERCEPT PSYCHOPHYS, V28, P407, DOI 10.3758/BF03204884 MANN VA, 1981, J ACOUST SOC AM, V69, P548, DOI 10.1121/1.385483 MANN VA, 1986, COGNITION, V24, P169, DOI 10.1016/S0010-0277(86)80001-4 Marr D, 1982, VISION MARR D, 1976, PHILOS T ROY SOC B, V275, P483, DOI 10.1098/rstb.1976.0090 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 RIGGS LA, 1953, J OPT SOC AM, V43, P495, DOI 10.1364/JOSA.43.000495 Schouten JF, 1940, P K NED AKAD WETENSC, V43, P356 Schreiner C. E., 1984, AUDITORY FUNCTION NE, P337 SMITH RL, 1979, J ACOUST SOC AM, V65, P166, DOI 10.1121/1.382260 SMITH RL, 1971, J ACOUST SOC AM, V50, P1520, DOI 10.1121/1.1912805 SMITH RL, 1985, J ACOUST SOC AM, V78, P1310, DOI 10.1121/1.392900 SUMMERFIELD Q, 1987, J ACOUST SOC AM, V81, P700, DOI 10.1121/1.394838 SUMMERFIELD Q, 1984, PERCEPT PSYCHOPHYS, V35, P203, DOI 10.3758/BF03205933 Urbantschitsch V., 1876, BEOBACHTUNGEN ANOMAL VIEMEISTER NF, 1982, J ACOUST SOC AM, V71, P1502, DOI 10.1121/1.387849 Viemeister NF, 1980, PSYCHOPHYSICAL PHYSL, P190 Watkins AJ, 1996, J ACOUST SOC AM, V99, P588, DOI 10.1121/1.414515 WATKINS AJ, 1994, J ACOUST SOC AM, V96, P1263, DOI 10.1121/1.410275 Watkins AJ, 1996, J ACOUST SOC AM, V99, P3749, DOI 10.1121/1.414981 Wightman F., 1977, PSYCHOPHYSICS PHYSL, P295 Yarbus A. L., 1967, EYE MOVEMENTS VISION Zwaardemaker H, 1895, PHYSL GERUCHS NR 53 TC 34 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 59 EP 69 DI 10.1016/S0167-6393(02)00093-6 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900006 ER PT J AU Schouten, B Gerrits, E van Hessen, A AF Schouten, B Gerrits, E van Hessen, A TI The end of categorical perception as we know it SO SPEECH COMMUNICATION LA English DT Article DE categorical perception ID MODELING PHONEME PERCEPTION; SPEECH-PERCEPTION; MOTOR THEORY; DISCRIMINATION AB Comparing phoneme classification and discrimination (or "categorical perception") of a stimulus continuum has for a long time been regarded as a useful method for investigating the storage and retrieval of phoneme categories in long-term memory. The closeness of the relationship between the two tasks, i.e. the degree of categorical perception, depends on a number of factors, some of which are unknown or random. One very important factor, however, seems to be the degree of bias (in the signal-detection sense of the term) in the discrimination task. When the task is such (as it is in 21FC, for example) that the listener has to rely heavily on an internal, subjective, criterion, discrimination can seem to be almost perfectly categorical, if the stimuli are natural enough. Presenting the same stimuli in a much less biasing task, however, leads to discrimination results that are completely unrelated to phoneme classification. Even the otherwise ubiquitous peak at the phoneme boundary has disappeared. The traditional categorical-perception experiment measures the bias inherent in the discrimination task; if we want to know how speech sounds are categorized, we will have to look elsewhere. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Utrecht, UiL OTS, NL-3512 JK Utrecht, Netherlands. Univ Twente, NL-7500 AE Enschede, Netherlands. RP Schouten, B (reprint author), Univ Utrecht, UiL OTS, Trans 10, NL-3512 JK Utrecht, Netherlands. EM bert.schouten@let.uu.nl CR COWAN N, 1986, J ACOUST SOC AM, V79, P500, DOI 10.1121/1.393537 CUTTING JE, 1982, PERCEPT PSYCHOPHYS, V31, P462, DOI 10.3758/BF03204856 GERRITS E, 2002, UNPUB CATEGORICAL PE LANE H, 1965, PSYCHOL REV, V72, P275, DOI 10.1037/h0021986 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 Massaro D., 1987, CATEGORICAL PERCEPTI, P254 van Hessen AJ, 1999, PHONETICA, V56, P56, DOI 10.1159/000028441 SCHOUTEN MEH, 1980, ACTA PSYCHOL, V44, P71, DOI 10.1016/0001-6918(80)90077-3 SCHOUTEN MEH, 1992, J ACOUST SOC AM, V92, P1841, DOI 10.1121/1.403841 STUDDERT.M, 1970, PSYCHOL REV, V77, P234, DOI 10.1037/h0029078 THOMASSEN K, 1993, THESIS UTRECHT U VANHESSEN AJ, 1992, J ACOUST SOC AM, V92, P1856, DOI 10.1121/1.403842 NR 12 TC 33 Z9 33 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 71 EP 80 DI 10.1016/S0167-6393(02)00094-8 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900007 ER PT J AU Moore, BCJ AF Moore, BCJ TI Speech processing for the hearing-impaired: successes, failures, and implications for speech mechanisms SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; hearing impairment; effects of noise; speech enhancement; loudness recruitment ID PSYCHOPHYSICAL TUNING CURVES; AUDITORY-NERVE FIBERS; MULTICHANNEL COMPRESSION; LOUDNESS RECRUITMENT; RECEPTION THRESHOLD; DEAD REGIONS; SENTENCE INTELLIGIBILITY; SPECTRAL ENHANCEMENT; INTERFERING SPEECH; ARTICULATION INDEX AB People with sensorineural hearing impairment typically have more difficulty than normally hearing people in understanding speech in the presence of background sounds. This paper starts by quantifying the magnitude of the problem in various listening situations and with various types of background sound. It then considers some of the factors that contribute to this difficulty, including: reduced audibility; reduced frequency selectivity; loudness recruitment; and regions in the cochlea which have no surviving inner hair cells and/or neurones (dead regions). Methods of compensating for the effects of some of these factors are described and evaluated. Signal-processing methods to compensate for the effects of reduced frequency selectivity using the output of a single microphone have had only limited success, although methods using multiple microphones have worked well. Amplitude compression can compensate for some of the effects of loudness recruitment, allowing speech to be understood over a wide range of sound levels. The exact form of the compression (fast-acting versus slow-acting, single-channel versus multiple channel) does not seem to be critical, suggesting that the relative loudness of different components of speech, and dynamic aspects of loudness perception do not need to be restored to "normal". (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Dept Expt Psychol, Cambridge CB2 3EB, England. RP Moore, BCJ (reprint author), Univ Cambridge, Dept Expt Psychol, Downing St, Cambridge CB2 3EB, England. EM bcjm@cus.cam.ac.uk RI Moore, Brian/I-5541-2012 CR ALCANTARA JI, IN PRESS INT J AUDIO, V41 ANIANSSO.G, 1974, ACTA OTO-LARYNGOL, P1 ANSI, 1969, S35 ANSI ANSI, 1997, S351997 ANSI BAER T, 1994, J ACOUST SOC AM, V95, P2277, DOI 10.1121/1.408640 BAER T, 1993, J REHABIL RES DEV, V30, P49 BAER T, 1997, MODELING SENSORINEUR BAER T, 1993, J ACOUST SOC AM, V94, P1229, DOI 10.1121/1.408176 Bench J., 1979, SPEECH HEARING TESTS Bentler RA, 2000, EAR HEARING, V21, P625, DOI 10.1097/00003446-200012000-00009 Boothroyd A, 1968, SOUND, V2, P3 BRONKHORST AW, 1989, J ACOUST SOC AM, V86, P1374, DOI 10.1121/1.398697 Ching TYC, 1998, J ACOUST SOC AM, V103, P1128, DOI 10.1121/1.421224 COX RM, 1995, EAR HEARING, V16, P176, DOI 10.1097/00003446-199504000-00005 DILLON H, 1993, ACOUSTICAL FACTORS A DUGAL R, 1978, ACOUSTICAL FACTORS A Elberling C, 1993, Scand Audiol Suppl, V38, P39 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47 Fletcher H., 1953, SPEECH HEARING COMMU FLETCHER H, 1952, J ACOUST SOC AM, V24, P490, DOI 10.1121/1.1906926 FLORENTINE M, 1983, J ACOUST SOC AM, V73, P961, DOI 10.1121/1.389021 DUQUESNOY AJ, 1983, J ACOUST SOC AM, V74, P739, DOI 10.1121/1.389859 Fowler EP, 1936, ARCHIV OTOLARYNGOL, V24, P731 Franck BAM, 1999, J ACOUST SOC AM, V106, P1452, DOI 10.1121/1.428055 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GLASBERG BR, 1986, J ACOUST SOC AM, V79, P1020, DOI 10.1121/1.393374 GREENBERG S, 1998, J ACOUST SOC AM, V103, P3057, DOI 10.1121/1.422679 Hogan CA, 1998, J ACOUST SOC AM, V104, P432, DOI 10.1121/1.423247 HOHMANN V, 1995, J ACOUST SOC AM, V97, P1191, DOI 10.1121/1.413092 HYGGE S, 1992, J SPEECH HEAR RES, V35, P208 KATES JM, 1994, J SPEECH HEAR RES, V37, P449 KUK FK, 1990, SCAND AUDIOL, V19, P237, DOI 10.3109/01050399009070778 LEE LW, 1993, J ACOUST SOC AM, V93, P2879, DOI 10.1121/1.405807 LIM JS, 1983, SPEECH ENHANCEMENT LOEB GE, 1983, BIOL CYBERN, V47, P149, DOI 10.1007/BF00337005 MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840 Miller RL, 1997, J ACOUST SOC AM, V101, P3602, DOI 10.1121/1.418321 Moore B., 1998, COCHLEAR HEARING LOS Moore BCJ, 1997, AUDIT NEUROSCI, V3, P289 Moore B C, 2001, Trends Amplif, V5, P1, DOI 10.1177/108471380100500102 Moore BCJ, 1998, BRIT J AUDIOL, V32, P317, DOI 10.3109/03005364000000083 Moore B.C.J., 1995, PERCEPTUAL CONSEQUEN MOORE B C J, 1988, British Journal of Audiology, V22, P93, DOI 10.3109/03005368809077803 MOORE BCJ, 1995, BRIT J AUDIOL, V29, P131, DOI 10.3109/03005369509086590 Moore BCJ, 1996, J ACOUST SOC AM, V100, P481, DOI 10.1121/1.415861 Moore BCJ, 1999, J ACOUST SOC AM, V105, P400, DOI 10.1121/1.424571 Moore B C, 1991, Br J Audiol, V25, P171, DOI 10.3109/03005369109079851 Moore BCJ, 1997, INTRO PSYCHOL HEARIN MOORE BCJ, 1992, EAR HEARING, V13, P349 Moore BCJ, 2000, BRIT J AUDIOL, V34, P205 Moore BCJ, 2001, EAR HEARING, V22, P268, DOI 10.1097/00003446-200108000-00002 Nejime Y, 1997, J ACOUST SOC AM, V102, P603, DOI 10.1121/1.419733 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 Noordhoek IM, 1997, J ACOUST SOC AM, V101, P498, DOI 10.1121/1.417993 PAVLOVIC CV, 1986, J ACOUST SOC AM, V80, P50, DOI 10.1121/1.394082 PAVLOVIC CV, 1984, J ACOUST SOC AM, V75, P1253, DOI 10.1121/1.390731 Peters RW, 1998, J ACOUST SOC AM, V103, P577, DOI 10.1121/1.421128 Pick G, 1977, PSYCHOPHYSICS PHYSL PLOMP R, 1979, AUDIOLOGY, V18, P43 PLOMP R, 1994, EAR HEARING, V15, P2 PLOMP R, 1988, J ACOUST SOC AM, V83, P2322, DOI 10.1121/1.396363 Ricketts T, 1999, Am J Audiol, V8, P117, DOI 10.1044/1059-0889(1999/018) ROSEN S, 1986, FREQUENCY SELECTIVIV Rosen S, 1999, J ACOUST SOC AM, V106, P3629, DOI 10.1121/1.428215 Shannon RV, 1998, J ACOUST SOC AM, V104, P2467, DOI 10.1121/1.423774 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SIMPSON AM, 1990, ACTA OTO-LARYNGOL, P101 SMOORENBURG GF, 1992, J ACOUST SOC AM, V91, P421, DOI 10.1121/1.402729 SOEDE W, 1993, J ACOUST SOC AM, V94, P799, DOI 10.1121/1.408181 SRULOVICZ P, 1983, J ACOUST SOC AM, V73, P1266, DOI 10.1121/1.389275 Steinberg JC, 1937, J ACOUST SOC AM, V9, P11, DOI 10.1121/1.1915905 Stone MA, 1999, J ACOUST SOC AM, V106, P3603, DOI 10.1121/1.428213 STONE MA, 1992, BRIT J AUDIOL, V26, P351, DOI 10.3109/03005369209076659 TERKEURS M, 1992, J ACOUST SOC AM, V91, P2872, DOI 10.1121/1.402950 TERKEURS M, 1993, J ACOUST SOC AM, V93, P1547, DOI 10.1121/1.406813 THORNTON AR, 1980, J ACOUST SOC AM, V67, P638, DOI 10.1121/1.383888 TURNER C, 1983, J ACOUST SOC AM, V73, P966, DOI 10.1121/1.389022 TYLER RS, 1990, ACTA OTO-LARYNGOL, P224 TYLER RS, 1986, FREQUENCY SELECTIVIT van Buuren RA, 1999, J ACOUST SOC AM, V105, P2903, DOI 10.1121/1.426943 Vickers DA, 2001, J ACOUST SOC AM, V110, P1164, DOI 10.1121/1.1381534 VILLCHUR E, 1973, J ACOUST SOC AM, V53, P1646, DOI 10.1121/1.1913514 YOUNG ED, 1979, J ACOUST SOC AM, V66, P1381, DOI 10.1121/1.383532 YUND EW, 1995, J ACOUST SOC AM, V97, P1224, DOI 10.1121/1.412232 YUND EW, 1995, J ACOUST SOC AM, V97, P1206, DOI 10.1121/1.413093 NR 86 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 81 EP 91 DI 10.1016/S0167-6393(02)00095-X PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900008 ER PT J AU Wong, SW Schreiner, CE AF Wong, SW Schreiner, CE TI Representation of CV-sounds in cat primary auditory cortex: intensity dependence SO SPEECH COMMUNICATION LA English DT Article DE neural activity; voice-onset time; distributed representation; auditory cortex ID SPEECH-EVOKED ACTIVITY; VOICE ONSET TIME; CONSONANT VOWEL SYLLABLES; COCHLEAR NUCLEUS; BACKGROUND-NOISE; SINGLE NEURONS; TONE INTENSITY; RESPONSES; NERVE; DISCRIMINATION AB The level-dependent representation of simple speech sounds in cat primary auditory cortex (Al) is explored in naive cats and in animals that have been exposed to these sounds in behavioral detection and discrimination tasks. Population analyses of multiple unit responses in the form of post-stimulus time histograms (PSTHs), neurograms, and spatial distribution were made for synthetic consonant-vowel sounds across Al. The temporal profile of cortical responses was robust across neurons, characterized by brief phasic responses at the onset of consonantal burst and voicing. The spectral profile of the sounds, i.e., the formant structure, was only weakly expressed in the response magnitude across characteristic frequency. The spatial response distribution across Al was discontinuous, and consisted of several patches of activation. Intensity-dependence in the spatial activity distribution was more strongly expressed than in population PSTHs and neurograms. Differences attributable to behavioral training were observed for rate-encoding and temporal encoding of speech sounds. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Calif San Francisco, Sloan Swartz Ctr Theoret Neurobiol, WM Keck Ctr Integrat Neurosci, Coleman Mem Lab, San Francisco, CA 94143 USA. RP Schreiner, CE (reprint author), Univ Calif San Francisco, Sloan Swartz Ctr Theoret Neurobiol, WM Keck Ctr Integrat Neurosci, Coleman Mem Lab, 513 Paranassus Ave,Box 0732, San Francisco, CA 94143 USA. EM chris@phy.ucsf.edu CR AHISSAR E, 1992, SCIENCE, V257, P1412, DOI 10.1126/science.1529342 BLACKBURN CC, 1990, J NEUROPHYSIOL, V63, P1191 Brosch M, 1997, J NEUROPHYSIOL, V77, P923 Brosch M, 2000, CEREB CORTEX, V10, P1155, DOI 10.1093/cercor/10.12.1155 BRUGGE JF, 1973, J NEUROPHYSIOL, V32, P1136 CALFORD MB, 1995, J NEUROPHYSIOL, V73, P1876 Chen GD, 1996, AUDIT NEUROSCI, V3, P179 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P866, DOI 10.1121/1.390596 EGGERMONT JJ, 1995, J ACOUST SOC AM, V98, P911, DOI 10.1121/1.413517 Gehr DD, 2000, HEARING RES, V150, P27, DOI 10.1016/S0378-5955(00)00170-2 HEIL P, 1994, HEARING RES, V76, P188, DOI 10.1016/0378-5955(94)90099-X Kowalski N, 1996, J NEUROPHYSIOL, V76, P3503 KUHL PK, 1994, CURR OPIN NEUROBIOL, V4, P812, DOI 10.1016/0959-4388(94)90128-7 KUHL PK, 1981, J ACOUST SOC AM, V70, P340, DOI 10.1121/1.386782 May BJ, 1998, J NEUROPHYSIOL, V79, P1755 MERZENICH MM, 1990, NEURAL ARTIFICIAL PA, P177 Merzenich M.M., 1988, NEUROBIOLOGY NEOCORT, P41 Nagarajan SS, 2002, J NEUROPHYSIOL, V87, P1723, DOI 10.1152/jn.00632.2001 PHILLIPS DP, 1985, HEARING RES, V18, P73, DOI 10.1016/0378-5955(85)90111-X PHILLIPS DP, 1994, EXP BRAIN RES, V102, P210 RECANZONE GH, 1993, J NEUROSCI, V13, P87 RECANZONE GH, 1992, J NEUROPHYSIOL, V67, P1071 SACHS MB, 1979, J ACOUST SOC AM, V66, P470, DOI 10.1121/1.383098 SACHS MB, 1983, J NEUROPHYSIOL, V50, P27 SCHREINER CE, 1992, EXP BRAIN RES, V92, P105 Schreiner CE, 1994, AUDIT NEUROSCI, V1, P39 Schreiner CE, 2000, ANNU REV NEUROSCI, V23, P501, DOI 10.1146/annurev.neuro.23.1.501 SINEX DG, 1983, J ACOUST SOC AM, V73, P602, DOI 10.1121/1.389007 STEINSCHNEIDER M, 1994, ELECTROEN CLIN NEURO, V92, P30, DOI 10.1016/0168-5597(94)90005-1 STEINSCHNEIDER M, 1990, BRAIN RES, V519, P158, DOI 10.1016/0006-8993(90)90074-L STEINSCHNEIDER M, 1995, BRAIN RES, V674, P147, DOI 10.1016/0006-8993(95)00008-E STEINSCHNEIDER M, 1982, BRAIN RES, V252, P353, DOI 10.1016/0006-8993(82)90403-6 SUTTER ML, 1995, J NEUROPHYSIOL, V73, P190 Wang XQ, 2000, P NATL ACAD SCI USA, V97, P11843, DOI 10.1073/pnas.97.22.11843 WANG XQ, 1994, J NEUROPHYSIOL, V71, P59 Wang XQ, 1995, J NEUROPHYSIOL, V74, P2685 Wang XQ, 2001, J NEUROPHYSIOL, V86, P2616 NR 37 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 93 EP 106 DI 10.1016/S0167-6393(02)00096-1 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900009 ER PT J AU Wang, XQ Lu, T Liang, L AF Wang, XQ Lu, T Liang, L TI Cortical processing of temporal modulations SO SPEECH COMMUNICATION LA English DT Article DE auditory cortex; temporal processing; temporal integration; amplitude modulation; frequency modulation; temporal asymmetry; species-specific vocalization ID PRIMARY AUDITORY-CORTEX; ANTEROVENTRAL COCHLEAR NUCLEUS; SINGLE-FORMANT STIMULI; STEADY-STATE VOWELS; DISCHARGE PATTERNS; NERVE FIBERS; PHYSIOLOGICAL-MECHANISMS; NEURAL REPRESENTATIONS; AMPLITUDE-MODULATION; COMMON MARMOSET AB Temporal modulations are fundamental components of human speech and animal communication sounds. Understanding their representations in the auditory cortex is a crucial step towards our understanding of brain mechanisms underlying speech processing. While modulated signals have long been used as experimental stimuli, their cortical representations are not completely understood, particularly for rapid modulations. Known physiological data do not adequately explain psychophysical observations on the perception of rapid modulations, largely due to slow stimulus-synchronized temporal discharge patterns of cortical neurons. In this article, we summarize recent findings from our laboratory on temporal processing mechanisms in the auditory cortex. These findings show that the auditory cortex represents slow modulations explicitly using a temporal code and fast modulations implicitly by a discharge rate code. Rapidly modulated signals within a short-time window (similar to20-30 ms) are integrated and transformed into a discharge rate-based representation. The findings also indicate that there is a shared representation of temporal modulations by cortical neurons that encodes the temporal profile embedded in complex sounds of various spectral contents. Our results suggest that cortical processing of sound streams operates on a "segment-by-segment" basis with a temporal integration window on the order of similar to20-30 ms. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Johns Hopkins Univ, Sch Med, Dept Biomed Engn, Lab Auditory Neurophysiol, Baltimore, MD 21205 USA. First Med Univ, Pear River Hosp, Hearing Ctr, Guangzhou 510282, Guangdong, Peoples R China. RP Wang, XQ (reprint author), Johns Hopkins Univ, Sch Med, Dept Biomed Engn, Lab Auditory Neurophysiol, 720 Rutland Ave,Ross 424, Baltimore, MD 21205 USA. EM xwang@bme.jhu.edu CR AGAMAITE JA, 1997, ASS RES OT ABSTR, V20, P144 AITKIN L, 1993, PROG NEUROBIOL, V41, P345, DOI 10.1016/0301-0082(93)90004-C AKEROYD MA, 1995, J ACOUST SOC AM, V98, P2466, DOI 10.1121/1.414462 Bieser A, 1996, EXP BRAIN RES, V108, P273 BLACKBURN CC, 1990, J NEUROPHYSIOL, V63, P1191 BLACKBURN CC, 1989, J NEUROPHYSIOL, V62, P1303 Buonomano DV, 1998, ANNU REV NEUROSCI, V21, P149, DOI 10.1146/annurev.neuro.21.1.149 CREUTZFELDT O, 1980, EXP BRAIN RES, V39, P87 De Ribaupierre F, 1972, Brain Res, V48, P205, DOI 10.1016/0006-8993(72)90179-5 DERIBAUPIERRE F, 1980, HEARING RES, V3, P65, DOI 10.1016/0378-5955(80)90008-8 EGGERMONT JJ, 1991, HEARING RES, V56, P153, DOI 10.1016/0378-5955(91)90165-6 EGGERMONT JJ, 1994, HEARING RES, V74, P51, DOI 10.1016/0378-5955(94)90175-9 Esser KH, 1997, P NATL ACAD SCI USA, V94, P14019, DOI 10.1073/pnas.94.25.14019 EVANS EF, 1964, J PHYSIOL-LONDON, V171, P476 FRISINA RD, 1990, HEARING RES, V44, P99, DOI 10.1016/0378-5955(90)90074-Y GAESE BH, 1995, EUR J NEUROSCI, V7, P438, DOI 10.1111/j.1460-9568.1995.tb00340.x GOLDBERG JM, 1969, J NEUROPHYSIOL, V32, P613 GOLDSTEIN MH, 1959, J ACOUST SOC AM, V31, P356, DOI 10.1121/1.1907724 HOUTGAST T, 1973, ACUSTICA, V28, P66 JOHNSON DH, 1980, J ACOUST SOC AM, V68, P1115, DOI 10.1121/1.384982 JORIS PX, 1992, J ACOUST SOC AM, V91, P215, DOI 10.1121/1.402757 Krumbholz K, 2000, J ACOUST SOC AM, V108, P1170, DOI 10.1121/1.1287843 LANGNER G, 1988, J NEUROPHYSIOL, V60, P1799 LIANG L, 1999, SOC NEUR ABSTR, V29 Liang L, 2002, J NEUROPHYSIOL, V87, P2237, DOI 10.1152/jn.00834.2001 Lu T, 2001, J NEUROPHYSIOL, V85, P2364 Lu T, 2001, NAT NEUROSCI, V4, P1131, DOI 10.1038/nn737 Lu T, 2000, J NEUROPHYSIOL, V84, P236 MARDIN KV, 2000, DIRECTIONAL STAT MARGOLIASH D, 1983, J NEUROSCI, V3, P1039 MERZENICH MM, 1984, J COMP NEUROL, V224, P591, DOI 10.1002/cne.902240408 PALMER AR, 1982, ARCH OTO-RHINO-LARYN, V236, P197, DOI 10.1007/BF00454039 PATTERSON RD, 1994, J ACOUST SOC AM, V96, P1409, DOI 10.1121/1.410285 PATTERSON RD, 1994, J ACOUST SOC AM, V96, P1419, DOI 10.1121/1.410286 Pressnitzer D, 2001, J ACOUST SOC AM, V109, P2074, DOI 10.1121/1.1359797 Rall W., 1998, METHODS NEURONAL MOD ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 SACHS MB, 1979, J ACOUST SOC AM, V66, P470, DOI 10.1121/1.383098 SACHS MB, 1992, AUDITORY PROCESSING, P47 SCHREINER CE, 1988, HEARING RES, V32, P49, DOI 10.1016/0378-5955(88)90146-3 Schreiner CE, 2000, ANNU REV NEUROSCI, V23, P501, DOI 10.1146/annurev.neuro.23.1.501 WANG XQ, 1995, NATURE, V378, P71, DOI 10.1038/378071a0 WANG XQ, 1993, J NEUROPHYSIOL, V70, P1054 Wang XQ, 2000, P NATL ACAD SCI USA, V97, P11843, DOI 10.1073/pnas.97.22.11843 WANG XQ, 1994, J NEUROPHYSIOL, V71, P59 Wang XQ, 1995, J NEUROPHYSIOL, V74, P2685 Wang XQ, 2001, J NEUROPHYSIOL, V86, P2616 WANG XQ, 1995, J NEUROPHYSIOL, V73, P1600 WHITFIEL.IC, 1965, J NEUROPHYSIOL, V28, P655 YOUNG ED, 1979, J ACOUST SOC AM, V66, P1381, DOI 10.1121/1.383532 ZURITA P, 1994, NEUROSCI RES, V19, P303, DOI 10.1016/0168-0102(94)90043-4 Zwicker E., 1999, PSYCHOACOUSTICS NR 52 TC 41 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 107 EP 121 DI 10.1016/S0167-6393(02)00097-3 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900010 ER PT J AU Heil, P AF Heil, P TI Coding of temporal onset envelope in the auditory system SO SPEECH COMMUNICATION LA English DT Article DE auditory cortex; transient; temporal coding; temporal envelope; latency ID GERBIL MERIONES-UNGUICULATUS; SINGLE NEURONS; RISE TIME; SPEECH RECOGNITION; MODULATED STIMULI; POSTERIOR FIELD; TONE INTENSITY; NERVE FIBERS; CORTEX; RESPONSES AB The details of temporal envelopes provide important information in speech. Here I show how response latency, precision of response timing, and response magnitude of auditory neurons to tones shaped with different temporal envelopes depend on the dynamics of those envelopes at onset. The joint consideration of these response parameters, and of the stimulus properties on which they depend, suggests a point-by-point sampling, or tracking, mechanism for the onset envelope. This mechanism is characterized by a sampling rate and precision of spike timing that are both automatically adjusted to the rapidity of the temporal envelope, which is altered, for example, by changes in a signal's sound pressure level (SPL). The relative resolution of amplitude remains rather constant with changes in SPL. Each temporal envelope will evoke a unique spatiotemporal response pattern that involves both the tonotopic and the isofrequency axes of cortical maps. Such a mechanism could provide a temporal resolution of the time course of the onset envelope which is likely orders of magnitude higher than that inferred from the phase-locking capabilities of neurons in cortical fields to periodic signals. Observations on auditory evoked potentials and magnetic fields show that stimulus-response relationships similar to those seen in individual neurons of anesthetized animals exist in the auditory system of awake humans. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Leibniz Inst Neurobiol, D-39118 Magdeburg, Germany. RP Heil, P (reprint author), Leibniz Inst Neurobiol, Brenneckestr 6, D-39118 Magdeburg, Germany. EM peter.heil@ifn-magdeburg.de CR Biermann S, 2000, J NEUROPHYSIOL, V84, P2426 CUTTING JE, 1974, PERCEPT PSYCHOPHYS, V16, P564, DOI 10.3758/BF03198588 DARWIN CJ, 1981, Q J EXP PSYCHOL-A, V33, P185 DELGUTTE B, 1984, J ACOUST SOC AM, V75, P897, DOI 10.1121/1.390599 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 EGGERMONT JJ, 1993, HEARING RES, V65, P75 Fishbach A, 2001, J NEUROPHYSIOL, V85, P2303 GOOLER DM, 1992, J NEUROPHYSIOL, V67, P1 GREY JM, 1978, J ACOUST SOC AM, V63, P1493, DOI 10.1121/1.381843 HALL JC, 1988, HEARING RES, V36, P261, DOI 10.1016/0378-5955(88)90067-6 HALL JC, 1991, J NEUROPHYSIOL, V66, P955 Heil P, 1996, NEUROREPORT, V7, P3073, DOI 10.1097/00001756-199611250-00056 Heil P, 2001, J NEUROSCI, V21, P7404 HEIL P, 2001, NATO ADV STUDI I, P189 Heil P, 1998, CEREB CORTEX, V8, P125, DOI 10.1093/cercor/8.2.125 Heil P, 1998, J NEUROPHYSIOL, V79, P3041 Heil P, 1997, J NEUROPHYSIOL, V77, P2616 HEIL P, 1994, HEARING RES, V76, P188, DOI 10.1016/0378-5955(94)90099-X Heil P, 1997, J NEUROPHYSIOL, V77, P2642 Heil P, 1997, J NEUROPHYSIOL, V78, P2438 KRUMHANSL CL, 1992, J EXP PSYCHOL HUMAN, V18, P739, DOI 10.1037//0096-1523.18.3.739 Langner G, 1997, J COMP PHYSIOL A, V181, P665, DOI 10.1007/s003590050148 LANGNER G, 1992, HEARING RES, V60, P115, DOI 10.1016/0378-5955(92)90015-F Lu T, 2001, NAT NEUROSCI, V4, P1131, DOI 10.1038/nn737 MACK M, 1983, J ACOUST SOC AM, V73, P1739, DOI 10.1121/1.389398 ONISHI S, 1968, J ACOUST SOC AM, V44, P582, DOI 10.1121/1.1911124 PANTEV C, 1989, ELECTROEN CLIN NEURO, V72, P225, DOI 10.1016/0013-4694(89)90247-2 Paquette C, 1997, CORTEX, V33, P689, DOI 10.1016/S0010-9452(08)70726-3 PHILLIPS DP, 1987, EXP BRAIN RES, V67, P479 PHILLIPS DP, 1988, J NEUROPHYSIOL, V59, P1524 PHILLIPS DP, 1995, J NEUROPHYSIOL, V73, P674 PHILLIPS DP, 1989, J ACOUST SOC AM, V85, P2537, DOI 10.1121/1.397748 PHILLIPS DP, 1990, BEHAV BRAIN RES, V37, P197, DOI 10.1016/0166-4328(90)90132-X PITT MA, 1992, J EXP PSYCHOL HUMAN, V18, P728, DOI 10.1037//0096-1523.18.3.728 SCHARF B, 1986, HDB PERCEPTION HUMAN SCHREINER CE, 1988, HEARING RES, V32, P49, DOI 10.1016/0378-5955(88)90146-3 SCHREINER CE, 1992, EXP BRAIN RES, V92, P105 Schulze H, 1997, J COMP PHYSIOL A, V181, P651, DOI 10.1007/s003590050147 SEMPLE MN, 1993, J NEUROPHYSIOL, V69, P449 Shannon RV, 1998, J ACOUST SOC AM, V104, P2467, DOI 10.1121/1.423774 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 SMITH RL, 1980, PSYCHOPHYSICAL PHYSL, P312 SUGA N, 1971, J PHYSIOL-LONDON, V217, P159 THOMAS H, 1993, EUR J NEUROSCI, V5, P882, DOI 10.1111/j.1460-9568.1993.tb00940.x NR 44 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 123 EP 134 DI 10.1016/S0167-6393(02)00099-7 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900011 ER PT J AU Winter, IM Palmer, AR Wiegrebe, L Patterson, RD AF Winter, IM Palmer, AR Wiegrebe, L Patterson, RD TI Temporal coding of the pitch of complex sounds by presumed multipolar cells in the ventral cochlear nucleus SO SPEECH COMMUNICATION LA English DT Article DE fundamental frequency; autocorrelation; iterated rippled noise; onset units; chopper units ID ITERATED RIPPLED NOISE; AUDITORY-NERVE FIBERS; AMPLITUDE-MODULATION; GUINEA-PIG; DISCHARGE PATTERNS; SINGLE UNITS; REPRESENTATION; RESPONSES; TONES; CAT AB Extensive studies of the encoding of fundamental frequency (f(0)) in the auditory nerve indicate that f(0) can be represented by either the timing of the neuronal discharges or the mean discharge rate as a function of characteristic frequency. It is therefore of considerable interest to examine what happens to this information at the next level of the auditory pathway, the cochlear nucleus. Both physiologically and anatomically the cochlear nucleus is considerably more heterogenous than the auditory nerve. There are two main cell types in the ventral division of the cochlear nucleus; bushy and multipolar. Bushy cells give rise to primary-like responses whereas multipolar cells may be characterised by either onset or chopper type responses. Physiological studies have suggested that onset and chopper units may be good at representing the f(0) of complex sounds in their temporal discharge properties. However, in these studies the pitch-producing sounds were usually characterised by highly modulated envelopes and it was not possible to tell if the units were simply responding to the modulation or the temporal fine structure. In this paper we examine the ability of onset and chopper units to encode the f(0) of complex sounds when the modulation cue has been greatly reduced. These stimuli were steady-state vowels in the presence of background noise, and iterated rippled noise (IRN). The response of onset units to the vowel f(0) in the presence of background noise was varied but many still maintained a strong response. In contrast, the majority of chopper units showed a greater reduction in their response to vowel f(0) in the presence of background noise. In keeping with the vowel study, the responses of both types of unit to the delay of the IRN was reduced in comparison with their response to more highly modulated stimuli. Increasing anatomical, pharmacological and physiological evidence would seem to argue against onset units playing a direct role in pitch perception. However, some units, identified as sustained choppers, may be able to represent the pitch of complex sounds in their temporal discharges. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Ctr Neural Basis Hearing, Dept Physiol, Cambridge CB2 3EG, England. Univ Nottingham, MRC, Inst Hearing Res, Nottingham NG7 2RD, England. Univ Munich, Inst Zool, D-80333 Munich, Germany. RP Winter, IM (reprint author), Ctr Neural Basis Hearing, Dept Physiol, Downing St, Cambridge CB2 3EG, England. CR BLACKBURN CC, 1989, J NEUROPHYSIOL, V62, P1303 Bourk TR, 1976, THESIS MIT CAMBRIDGE Cariani PA, 1996, J NEUROPHYSIOL, V76, P1717 Cariani PA, 1996, J NEUROPHYSIOL, V76, P1698 Doucet JR, 1997, J COMP NEUROL, V385, P245 Doucet JR, 1999, J COMP NEUROL, V408, P515 EVANS EF, 1998, PSYCHOPHYSICAL PHYSL, P186 Frisina RD, 1996, J ACOUST SOC AM, V99, P475, DOI 10.1121/1.414559 FRISINA RD, 1990, HEARING RES, V44, P99, DOI 10.1016/0378-5955(90)90074-Y Griffiths TD, 1998, NAT NEUROSCI, V1, P422, DOI 10.1038/1637 HEWITT MJ, 1994, J ACOUST SOC AM, V95, P2145, DOI 10.1121/1.408676 HOUTSMA AJM, 1990, J ACOUST SOC AM, V87, P304, DOI 10.1121/1.399297 Jiang D, 1996, J NEUROPHYSIOL, V75, P380 KIM DO, 1988, BASIC ISSUES HEARING, P252 KIM DO, 1990, HEARING RES, V45, P95, DOI 10.1016/0378-5955(90)90186-S KIM DO, 1986, AUDITORY FREQUENCY S, P281 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Krumbholz K, 2001, J ACOUST SOC AM, V110, P2096, DOI 10.1121/1.1395583 Krumbholz K, 2000, J ACOUST SOC AM, V108, P1170, DOI 10.1121/1.1287843 MILLER MI, 1984, HEARING RES, V14, P257, DOI 10.1016/0378-5955(84)90054-6 Palmer A R, 1996, Audiol Neurootol, V1, P12 PALMER AR, 1986, HEARING RES, V24, P1, DOI 10.1016/0378-5955(86)90002-X PALMER AR, 1990, J ACOUST SOC AM, V88, P1412, DOI 10.1121/1.400329 PALMER AR, 1993, NATO ADV SCI INST SE, V239, P373 PALMER AR, 1992, ADV BIOSCI, V83, P231 Palmer AR, 1996, J NEUROPHYSIOL, V75, P780 Patterson RD, 1996, J ACOUST SOC AM, V100, P3286, DOI 10.1121/1.417212 RHODE WS, 1994, HEARING RES, V77, P43, DOI 10.1016/0378-5955(94)90252-6 RHODE WS, 1986, J NEUROPHYSIOL, V56, P261 RHODE WS, 1983, J COMP NEUROL, V213, P448, DOI 10.1002/cne.902130408 RHODE WS, 1994, J NEUROPHYSIOL, V71, P1797 RHODE WS, 1995, J ACOUST SOC AM, V97, P2414, DOI 10.1121/1.411963 Shofner WP, 1999, J NEUROPHYSIOL, V81, P2662 SHOFNER WP, 1991, J ACOUST SOC AM, V90, P2450, DOI 10.1121/1.402049 SMITH PH, 1989, J COMP NEUROL, V282, P595, DOI 10.1002/cne.902820410 Wiegrebe L, 2001, J NEUROPHYSIOL, V85, P1206 Wiegrebe L, 1999, HEARING RES, V132, P94, DOI 10.1016/S0378-5955(99)00040-4 WINTER IM, 1990, HEARING RES, V44, P161, DOI 10.1016/0378-5955(90)90078-4 Winter IM, 2001, J PHYSIOL-LONDON, V537, P553, DOI 10.1111/j.1469-7793.2001.00553.x WINTER IM, 1995, J NEUROPHYSIOL, V73, P141 Yost WA, 1996, J ACOUST SOC AM, V99, P1066, DOI 10.1121/1.414593 Yost WA, 1996, J ACOUST SOC AM, V100, P3329, DOI 10.1121/1.416973 Yost WA, 1996, J ACOUST SOC AM, V100, P511, DOI 10.1121/1.415873 YOUNG ED, 1985, J NEUROPHYSIOL, V60, P1 YOUNG ED, 1979, J ACOUST SOC AM, V66, P1381, DOI 10.1121/1.383532 NR 45 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 135 EP 149 DI 10.1016/S0167-6393(02)00098-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900012 ER PT J AU Covey, E AF Covey, E TI Brainstem mechanisms for analyzing temporal patterns of echolocation sounds: a model for understanding early stages of speech processing? SO SPEECH COMMUNICATION LA English DT Article ID BIG BROWN BAT; MEDIAL SUPERIOR OLIVE; AMPLITUDE-MODULATED STIMULI; INFERIOR COLLICULUS; EPTESICUS-FUSCUS; MOUSTACHED BAT; LATERAL LEMNISCUS; RESPONSE PROPERTIES; COCHLEAR NUCLEUS; PURE-TONES AB Because of their stereotyped audio-vocal behavior and highly accessible brainstem circuitry, echolocating bats provide a good model system in which to study the neural mechanisms that underlie the analysis of temporal features of sound. This paper reviews the lower brainstem auditory circuitry and describes selected forms of information processing that are performed in the pathways of the lower brainstem, and auditory midbrain (inferior colliculus (IC)). Several examples of neural circuits in echolocating bats point out the ways in which inputs with different properties converge on IC neurons to create selectivity for specific temporal features of sound that are common to speech and echolocation. The initial transformations of auditory nerve input that occur in the lower brainstem pathways include a change in sign from excitatory input to inhibitory output, changes in discharge pattern, and the creation of delay lines. Convergence of multiple inputs on neurons in the IC produces tuning for temporal features of sound including duration, the direction of frequency sweeps, modulation rate, and interstimulus interval. The auditory cortex exerts control over some of this processing by sharpening or shifting neuronal filter properties. The computational processes that occur in the IC result in integration across a time scale that is consistent with the rate at which biological sounds are produced, whether they be echolocation signals or human speech components. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Washington, Dept Psychol, Seattle, WA 98195 USA. RP Covey, E (reprint author), Univ Washington, Dept Psychol, Box 351525, Seattle, WA 98195 USA. EM ecovey@u.washington.edu CR Brand A, 2000, J NEUROPHYSIOL, V84, P1790 CASSEDAY JH, 1995, NEURAL REPRESENTATIO, P25 Casseday JH, 1997, J NEUROPHYSIOL, V77, P1595 CASSEDAY JH, 1992, J COMP NEUROL, V319, P34, DOI 10.1002/cne.903190106 Casseday JH, 2000, J NEUROPHYSIOL, V84, P1475 CASSEDAY JH, 1994, SCIENCE, V264, P847, DOI 10.1126/science.8171341 Casseday JH, 1996, BRAIN BEHAV EVOLUT, V47, P311, DOI 10.1159/000113249 Chen GD, 1998, HEARING RES, V122, P142, DOI 10.1016/S0378-5955(98)00103-8 CONDON CJ, 1991, J COMP PHYSIOL A, V168, P709 Covey E, 1996, J NEUROSCI, V16, P3009 COVEY E, 1995, SPRINGER HDB AUDITOR, V11, P235 COVEY E, 1986, J NEUROSCI, V6, P2926 COVEY E, 1991, J NEUROSCI, V11, P3456 COVEY E, 1987, J COMP NEUROL, V263, P179, DOI 10.1002/cne.902630203 COVEY E, 1993, NATO ADV SCI INST SE, V239, P321 Ehrlich D, 1997, J NEUROPHYSIOL, V77, P2360 FAINGOLD CL, 1991, HEARING RES, V52, P201, DOI 10.1016/0378-5955(91)90200-S FAURE PA, 2001, ASS RES OT ABSTR, V24, P54 FENG AS, 1990, PROG NEUROBIOL, V34, P313, DOI 10.1016/0301-0082(90)90008-5 Ferragamo MJ, 1998, J COMP PHYSIOL A, V182, P65 Fubara BM, 1996, J COMP NEUROL, V369, P83 FUZESSERY ZM, 1994, J NEUROPHYSIOL, V72, P1061 Fuzessery ZM, 1996, J NEUROPHYSIOL, V76, P1059 GOLDBERG JAY M., 1968, J NEUROPHYSIOL, V31, P639 GOOLER DM, 1992, J NEUROPHYSIOL, V67, P1 GROTHE B, 1994, J NEUROPHYSIOL, V71, P706 Grothe B, 2001, J NEUROPHYSIOL, V86, P2219 GROTHE B, 1992, P NATL ACAD SCI USA, V89, P5108, DOI 10.1073/pnas.89.11.5108 Grothe B, 1997, J NEUROPHYSIOL, V77, P1553 HALL JC, 1991, J NEUROPHYSIOL, V66, P955 HAPLEA S, 1994, J COMP PHYSIOL A, V174, P671 Hattori T, 1997, J COMP PHYSIOL A, V180, P271, DOI 10.1007/s003590050047 Huffman RF, 1998, HEARING RES, V126, P161, DOI 10.1016/S0378-5955(98)00165-8 KALKO EKV, 1989, BEHAV ECOL SOCIOBIOL, V24, P225, DOI 10.1007/BF00295202 Kuwada S., 1987, P146 Kuwada S, 1997, J NEUROSCI, V17, P7565 Lu Y, 1997, J COMP PHYSIOL A, V181, P331, DOI 10.1007/s003590050119 MANIS PB, 1990, J NEUROSCI, V10, P2338 NARINS PM, 1980, BRAIN BEHAV EVOLUT, V17, P48, DOI 10.1159/000121790 NEUWEILER G, 1990, PHYSIOL REV, V70, P615 Oertel D, 1991, Curr Opin Neurobiol, V1, P221, DOI 10.1016/0959-4388(91)90082-I PARK TJ, 1993, J NEUROSCI, V13, P5172 PINHEIRO AD, 1991, J COMP PHYSIOL A, V169, P69 POLLAK GD, 1993, HEARING RES, V65, P99, DOI 10.1016/0378-5955(93)90205-F POTTER HD, 1965, J NEUROPHYSIOL, V28, P1155 ROSE GJ, 1995, NEURAL REPRESENTATIO, P1 ROSS LS, 1988, J COMP NEUROL, V270, P488, DOI 10.1002/cne.902700403 SAITOH I, 1995, J NEUROPHYSIOL, V74, P1 SCHNITZLER HU, 1987, J COMP PHYSIOL A, V161, P267, DOI 10.1007/BF00615246 SCHULLER G, 1991, EUR J NEUROSCI, V3, P648, DOI 10.1111/j.1460-9568.1991.tb00851.x SIMMONS JA, 1989, COGNITION, V33, P155, DOI 10.1016/0010-0277(89)90009-7 SUGA N, 1965, J PHYSIOL-LONDON, V179, P26 SUGA N, 1968, J PHYSIOL-LONDON, V198, P51 Suga N., 1973, BASIC MECH HEARING, P675 SUGA N, 1972, AUDIOLOGY, V11, P58 SUGA N, 1969, J PHYSIOL-LONDON, V200, P555 Trussell LO, 1999, ANNU REV PHYSIOL, V61, P477, DOI 10.1146/annurev.physiol.61.1.477 Vater M, 1997, CELL TISSUE RES, V289, P223, DOI 10.1007/s004410050869 VATER M, 1995, J COMP NEUROL, V351, P632, DOI 10.1002/cne.903510411 WENTHOLD RJ, 1986, BRAIN RES, V380, P7, DOI 10.1016/0006-8993(86)91423-X WENTHOLD RJ, 1987, NEUROSCIENCE, V22, P897, DOI 10.1016/0306-4522(87)92968-X Yan J, 1996, SCIENCE, V273, P1100, DOI 10.1126/science.273.5278.1100 Yan J, 1996, HEARING RES, V93, P102, DOI 10.1016/0378-5955(95)00209-X Yang LC, 1997, J NEUROPHYSIOL, V77, P324 YIN TCT, 1990, J NEUROPHYSIOL, V64, P465 Zhang Y, 2000, HEARING RES, V147, P92, DOI 10.1016/S0378-5955(00)00123-4 Zhang YF, 1997, NATURE, V387, P900 Zhang YF, 1997, J NEUROPHYSIOL, V78, P3489 ZOOK JM, 1987, J COMP NEUROL, V261, P347, DOI 10.1002/cne.902610303 NR 69 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 151 EP 163 DI 10.1016/S0167-6392(02)00100-0 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900013 ER PT J AU Margoliash, D AF Margoliash, D TI Offline learning and the role of autogenous speech: new suggestions from birdsong research SO SPEECH COMMUNICATION LA English DT Article DE evolution of speech; birdsong; autogenous speech; sleep and learning ID PRIMARY AUDITORY-CORTEX; ADULT ZEBRA FINCHES; WHITE-CROWNED SPARROW; VOCAL CONTROL PATHWAYS; SONG SYSTEM; TAENIOPYGIA-GUTTATA; MELOPSITTACUS-UNDULATUS; ASSOCIATIONAL MODEL; DYNAMIC SPECTRA; CONTROL NUCLEUS AB Vocal learning in humans is isolated in evolution, nevertheless the comparative approach can help give insight into mechanisms of speech. Birdsong and speech learning share similar computational problems and may share similar neural mechanisms. In birds, there is a well-known sensory representation of the individual's own song. A similar representation is postulated for speech, and the implications are explored. Auditory responses to own song are observed in an anterior forebrain pathway that is implicated in song recognition. By analogy, speech perception may be referenced to an internal representation of own voice. This also predicts sex differences in speech perception. In awake birds, auditory responses are reduced or absent in motor pathways, which argues against a theory of perception by reference to production. Recent evidence demonstrates that auditory responses in birds are much stronger while animals sleep, and that during sleep there is replay of song motor patterns. If humans exhibit similar phenomena, then there should be circadian patterns in speech production, and possibly in speech perception. Humans should exhibit strong responses (e.g., ERP) to playback of speech during sleep when playback is matched to the perception of own speech. Sleep should be important for speech maintenance, and learning. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Chicago, Dept Organismal Biol & Anat, Chicago, IL 60637 USA. RP Margoliash, D (reprint author), Univ Chicago, Dept Organismal Biol & Anat, 1025 E 57Th St, Chicago, IL 60637 USA. EM dan@bigbird.uchicago.edu CR BAPTISTA LF, 1986, ANIM BEHAV, V34, P1359, DOI 10.1016/S0003-3472(86)80207-X Binder JR, 1997, J NEUROSCI, V17, P353 BOCO T, 2001, SOC NEUR ABSTR BOTTJER SW, 1984, SCIENCE, V224, P901, DOI 10.1126/science.6719123 Brainard MS, 2000, NATURE, V404, P762, DOI 10.1038/35008083 Brenowitz EA, 1997, J NEUROBIOL, V33, P495, DOI 10.1002/(SICI)1097-4695(19971105)33:5<495::AID-NEU1>3.3.CO;2-D Brooks D. R., 1991, PHYLOGENY ECOLOGY BE Buzsaki G, 1998, J SLEEP RES, V7, P17, DOI 10.1046/j.1365-2869.7.s1.3.x Cantalupo C, 2001, NATURE, V414, P505, DOI 10.1038/35107134 CARR CE, 1990, J NEUROSCI, V10, P3227 CARR CE, 1988, P NATL ACAD SCI USA, V85, P8311, DOI 10.1073/pnas.85.21.8311 Chew SJ, 1996, P NATL ACAD SCI USA, V93, P1950, DOI 10.1073/pnas.93.5.1950 Cynx J, 1998, ANIM BEHAV, V56, P107, DOI 10.1006/anbe.1998.0746 CYNX J, 1992, P NATL ACAD SCI USA, V89, P1372, DOI 10.1073/pnas.89.4.1372 Dave AS, 1998, SCIENCE, V282, P2250, DOI 10.1126/science.282.5397.2250 Dave AS, 2000, SCIENCE, V290, P812, DOI 10.1126/science.290.5492.812 Doupe AJ, 1999, ANNU REV NEUROSCI, V22, P567, DOI 10.1146/annurev.neuro.22.1.567 DOUPE AJ, 1991, P NATL ACAD SCI USA, V88, P11339, DOI 10.1073/pnas.88.24.11339 Doya K., 1995, ADV NEURAL INFORMATI, V7, P101 Durand SE, 1997, J COMP NEUROL, V377, P179, DOI 10.1002/(SICI)1096-9861(19970113)377:2<179::AID-CNE3>3.0.CO;2-0 Fee MS, 1998, NATURE, V395, P67, DOI 10.1038/25725 Fitch WT, 2000, TRENDS COGN SCI, V4, P258, DOI 10.1016/S1364-6613(00)01494-7 FORTUNE ES, 1995, J COMP NEUROL, V360, P413, DOI 10.1002/cne.903600305 Gardner T, 2001, PHYS REV LETT, V87, DOI 10.1103/PhysRevLett.87.208101 GENTNER TQ, 2002, ACOUSTIC COMMUNICATI Goller F, 1997, P NATL ACAD SCI USA, V94, P14787, DOI 10.1073/pnas.94.26.14787 HEILIGENBENBERG WF, 1991, NEURAL NETS ELECT FI Hessler NA, 1999, J NEUROSCI, V19, P10461 Hirano S, 1996, NEUROREPORT, V8, P363, DOI 10.1097/00001756-199612200-00071 Hirano S, 1997, NEUROREPORT, V8, P2379, DOI 10.1097/00001756-199707070-00055 Janata P, 1999, J NEUROSCI, V19, P5108 JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 Joris PX, 1998, NEURON, V21, P1235, DOI 10.1016/S0896-6273(00)80643-1 Jurgens U., 1979, NEUROBIOLOGY SOCIAL, P11 KLUENDER KR, 1987, SCIENCE, V237, P1195, DOI 10.1126/science.3629235 Konishi M., 1965, Zeitschrift fuer Tierpsychologie, V22, P770 KONISHI M, 1990, COLD SH Q B, V55, P575 Kowalski N, 1996, J NEUROPHYSIOL, V76, P3503 Kowalski N, 1996, J NEUROPHYSIOL, V76, P3524 KROODSMA DE, 1991, ANIM BEHAV, V42, P477, DOI 10.1016/S0003-3472(05)80047-8 KUHL PK, 1979, BRAIN BEHAV EVOLUT, V16, P374, DOI 10.1159/000121877 KUHL PK, 1975, SCIENCE, V190, P69, DOI 10.1126/science.1166301 KUHL PK, 1978, J ACOUST SOC AM, V63, P905, DOI 10.1121/1.381770 Leonardo A, 1999, NATURE, V399, P466, DOI 10.1038/20933 LEONARDO A, 2001, 6 INT C NEUR ABSTR, P149 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 Lotto AJ, 1997, J ACOUST SOC AM, V102, P1134, DOI 10.1121/1.419865 Manabe K, 1998, J ACOUST SOC AM, V103, P1190, DOI 10.1121/1.421227 MARGOLIASH D, 1994, BRAIN BEHAV EVOLUT, V44, P247, DOI 10.1159/000113580 MARGOLIASH D, 1983, J NEUROSCI, V3, P1039 MARGOLIASH D, 1986, J NEUROSCI, V6, P1643 Margoliash D, 1987, IMPRINTING CORTICAL, P23 Margoliash D, 1997, J NEUROBIOL, V33, P671, DOI 10.1002/(SICI)1097-4695(19971105)33:5<671::AID-NEU12>3.0.CO;2-C MARGOLIASH D, 1985, P NATL ACAD SCI USA, V82, P5997, DOI 10.1073/pnas.82.17.5997 Marler P, 1997, J NEUROBIOL, V33, P501 MARLER P, 1970, J COMP PHYSIOL PSYCH, V71, P1, DOI 10.1037/h0029144 MARLER P, 1970, AM SCI, V58, P669 MARLER P, 1982, DEV PSYCHOBIOL, V15, P369, DOI 10.1002/dev.420150409 MARLER P, 1980, PERSPECTIVES STUDY S, P75 MATTINGLY LG, 1991, MODULARITY MOTOR THE MCCASLAND JS, 1987, J NEUROSCI, V7, P23 MCCASLAND JS, 1981, P NATL ACAD SCI-BIOL, V78, P7815, DOI 10.1073/pnas.78.12.7815 Mello CV, 1998, J COMP NEUROL, V395, P137 Miller CT, 2001, NAT NEUROSCI, V4, P783, DOI 10.1038/90481 MULLER CM, 1985, EXP BRAIN RES, V59, P587 Nick TA, 2001, P NATL ACAD SCI USA, V98, P14012, DOI 10.1073/pnas.251525298 NORDEEN KW, 1993, BEHAV NEURAL BIOL, V59, P79, DOI 10.1016/0163-1047(93)91215-9 NORDEEN KW, 1992, BEHAV NEURAL BIOL, V57, P58, DOI 10.1016/0163-1047(92)90757-U NOTTEBOHM F, 1977, LATERALIZATION NERVO, P295 NOTTEBOH.F, 1972, AM NAT, V106, P116, DOI 10.1086/282756 NOTTEBOHM F, 1976, J COMP NEUROL, V165, P457, DOI 10.1002/cne.901650405 NOWICKI S, 1987, NATURE, V325, P53, DOI 10.1038/325053a0 OJEMANN GA, 1991, J NEUROSCI, V11, P2281 Penfield W, 1959, SPEECH BRAIN MECH PLOOG D, 1981, BRAIN RES REV, V3, P35, DOI 10.1016/0165-0173(81)90011-4 Plummer TK, 2000, J NEUROBIOL, V42, P79, DOI 10.1002/(SICI)1097-4695(200001)42:1<79::AID-NEU8>3.0.CO;2-W Rauschecker JP, 2000, P NATL ACAD SCI USA, V97, P11800, DOI 10.1073/pnas.97.22.11800 Rauschecker JP, 1998, CURR OPIN NEUROBIOL, V8, P516, DOI 10.1016/S0959-4388(98)80040-8 Rauske P. L., 1999, Society for Neuroscience Abstracts, V25, P624 Rizzolatti G, 1998, TRENDS NEUROSCI, V21, P188, DOI 10.1016/S0166-2236(98)01260-0 Romanski LM, 1999, NAT NEUROSCI, V2, P1131, DOI 10.1038/16056 ROSENFIELD DB, 1984, CRC CR REV CL NEUROB, V1, P117 SACHS MB, 1979, J ACOUST SOC AM, V66, P470, DOI 10.1121/1.383098 Schmidt MF, 1998, NAT NEUROSCI, V1, P513, DOI 10.1038/2232 Schreiner CE, 1997, NATURE, V388, P383, DOI 10.1038/41106 Schreiner CE, 2000, ANNU REV NEUROSCI, V23, P501, DOI 10.1146/annurev.neuro.23.1.501 SCHWAB EC, 1985, HUM FACTORS, V27, P395 Sen K, 2001, J NEUROPHYSIOL, V86, P1445 SHEA SD, 1999, SOC NEUR ABSTR Skaggs WE, 1996, SCIENCE, V271, P1870, DOI 10.1126/science.271.5257.1870 Snowdon Charles T., 1997, P234, DOI 10.1017/CBO9780511758843.012 Stickgold R, 2001, SCIENCE, V294, P1052, DOI 10.1126/science.1063530 STRIEDTER GF, 1994, J COMP NEUROL, V343, P35, DOI 10.1002/cne.903430104 SUGA N, 1990, NEURAL NETWORKS, V3, P3, DOI 10.1016/0893-6080(90)90043-K Sutton R. S., 1998, REINFORCEMENT LEARNI Sutton R. S., 1988, Machine Learning, V3, DOI 10.1007/BF00115009 Tchernichovski O, 2001, SCIENCE, V291, P2564, DOI 10.1126/science.1058522 Theunissen FE, 1998, J NEUROSCI, V18, P3786 TROYER T, 1996, COMPUTATIONAL NEUROS, P409 Troyer TW, 2000, J NEUROPHYSIOL, V84, P1204 Troyer TW, 2000, J NEUROPHYSIOL, V84, P1224 Vates GE, 1996, J COMP NEUROL, V366, P613, DOI 10.1002/(SICI)1096-9861(19960318)366:4<613::AID-CNE5>3.0.CO;2-7 Vicario DS, 2001, J NEUROBIOL, V47, P109, DOI 10.1002/neu.1020 VICARIO DS, 1993, J NEUROBIOL, V24, P488, DOI 10.1002/neu.480240407 Wang XQ, 1995, J NEUROPHYSIOL, V74, P2685 Wang XQ, 2001, J NEUROPHYSIOL, V86, P2616 WEST MJ, 1988, NATURE, V334, P244, DOI 10.1038/334244a0 WILLIAMS H, 1993, J NEUROBIOL, V24, P903, DOI 10.1002/neu.480240704 WILLIAMS H, 1985, SCIENCE, V229, P279, DOI 10.1126/science.4012321 Williams H, 1999, J NEUROBIOL, V39, P14, DOI 10.1002/(SICI)1097-4695(199904)39:1<14::AID-NEU2>3.0.CO;2-X WILSON MA, 1994, SCIENCE, V265, P676, DOI 10.1126/science.8036517 Yu AC, 1996, SCIENCE, V273, P1871, DOI 10.1126/science.273.5283.1871 NR 113 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 165 EP 178 DI 10.1016/S0167-6393(02)00101-2 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900014 ER PT J AU Esser, KH AF Esser, KH TI Modeling aspects of speech processing in bats - behavioral and neurophysiological studies SO SPEECH COMMUNICATION LA English DT Article DE audio-vocal learning; vocal learning; acoustic communication; communication calls; hand-rearing; syntax processing; auditory cortex; audio-motor integration; frontal cortex; prefrontal cortex; electrophysiology; echolocation ID PHYLLOSTOMUS-DISCOLOR; CAROLLIA-PERSPICILLATA; AUDITORY-CORTEX; NOSED BAT; COMMUNICATION SOUNDS; PTERONOTUS-PARNELLII; SENSITIVE NEURONS; FRONTAL-CORTEX; MOUSTACHED BAT; SYNTAX AB The goal of this article is to provide an integrated view of what we can learn about speech processing from animal studies. 'Integrated' refers here to the attempt to explore biologically important sounds from the level of vocal learning, to neural representation at various stations of the auditory pathway, to audio-motor integration and its influence on sound production. The methodologies required include developmental and behavioral studies as well as electrophysiological characterization of neuronal activity and electrical stimulation of (pre-)frontal/premotor areas. Three unique animal models, otherwise found scattered in the published literature, are brought together here: (1) the lesser spear-nosed bat (Phyllostomus discolor) as an animal model for audio-vocal learning; (2) the mustached bat (Pteronotus parnellii) as an animal for modeling syntax processing at the auditory cortical single-unit level; (3) the short-tailed fruit bat (Carollia perspicillata) as an animal model elucidating cortical audio-motor integration. (C) 2002 Elsevier Science Ltd. All rights reserved. C1 Univ Ulm, NSCL, Dept Neurobiol, D-89069 Ulm, Germany. RP Esser, KH (reprint author), Univ Ulm, NSCL, Dept Neurobiol, Albert Einstein Allee 11, D-89069 Ulm, Germany. EM kalle.esser@biologie.uni-ulm.de CR BALABAN E, 1988, BEHAVIOUR, V105, P292, DOI 10.1163/156853988X00052 Binkofski F, 2000, EUR J NEUROSCI, V12, P189 BITTER KS, 2001, ARO ABSTR, V24, P63 Boughman JW, 1998, P ROY SOC B-BIOL SCI, V265, P227, DOI 10.1098/rspb.1998.0286 Caldwell M.C., 1972, Cetology, VNo. 9, P1 CASSEDAY JH, 1989, J COMP NEUROL, V287, P247, DOI 10.1002/cne.902870208 CLEVELAND J, 1982, Z TIERPSYCHOL, V58, P231 DENES PB, 1993, SPEECH CHAIN PHYSICS Doupe AJ, 1999, ANNU REV NEUROSCI, V22, P567, DOI 10.1146/annurev.neuro.22.1.567 DRONKERS NF, 2000, PRINCIPLES NEURAL SC Ehret G, 2002, P NATL ACAD SCI USA, V99, P479, DOI 10.1073/pnas.012361999 EHRET G, 1992, AUDITORY PROCESSING Eiermann A, 2000, NEUROREPORT, V11, P421, DOI 10.1097/00001756-200002070-00040 ESSER KH, 2001, NEUROSCIENCES TURN C, V1 ESSER KH, 1989, ETHOLOGY, V82, P156 ESSER KH, 1998, CLIN PSYCHOACOUSTICS Esser KH, 1997, J COMP PHYSIOL A, V180, P513, DOI 10.1007/s003590050068 ESSER KH, 2002, ECHOLOCATION BATS DO ESSER KH, 1994, NEUROREPORT, V5, P1718, DOI 10.1097/00001756-199409080-00007 Esser KH, 1996, J COMP PHYSIOL A, V178, P787 Esser KH, 1997, P NATL ACAD SCI USA, V94, P14019, DOI 10.1073/pnas.94.25.14019 Esser KH, 1999, EUR J NEUROSCI, V11, P3669, DOI 10.1046/j.1460-9568.1999.00789.x FITZPATRICK DC, 1993, J NEUROSCI, V13, P931 KANWAL JS, 1994, J ACOUST SOC AM, V96, P1229, DOI 10.1121/1.410273 Kendel E. R., 2000, PRINCIPLES NEURAL SC KOBLER JB, 1987, SCIENCE, V236, P824, DOI 10.1126/science.2437655 LOHMANN P, 2001, 6 INT C NEUR BONN GE MARLER P, 1999, DESIGN ANIMAL COMMUN MARLER P, 1988, ETHOLOGY, V77, P125 MARLER P, 1988, PRIMATE VOCAL COMMUN MCCOWAN B, 1997, SOCIAL INFLUENCES VO MULLER K, 2000, RHYTHM PERCEPTION PR NOTTEBOHM F, 1999, DESIGN ANIMAL COMMUN ONEILL WE, 1979, SCIENCE, V203, P69, DOI 10.1126/science.758681 ONEILL WE, 1995, HEARING BATS PAYNE K, 1985, Z TIERPSYCHOL, V68, P89 PISTOHL D, 1998, ZOOLOGY S1, V101, P84 PORTER FL, 1979, Z TIERPSYCHOL, V50, P1 RICHARDS DG, 1984, J COMP PSYCHOL, V98, P10, DOI 10.1037/0735-7036.98.1.10 Snowdon CT, 1982, PRIMATE COMMUNICATIO Snowdon CT, 1997, SOCIAL INFLUENCES VO STRAUB O, 2000, ZOOLOGY S3, V103, P36 Tembrock G., 1977, TIERSTIMMENFORSCHUNG Webster DB, 1992, MAMMALIAN AUDITORY P NR 44 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 179 EP 188 DI 10.1016/S0167-6393(02)00102-4 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900015 ER PT J AU Suga, N Ma, XF Gao, EQ Sakai, M Chowdhury, SA AF Suga, N Ma, XF Gao, EQ Sakai, M Chowdhury, SA TI Descending system and plasticity for auditory signal processing: neuroethological data for speech scientists SO SPEECH COMMUNICATION LA English DT Article DE behaviorally relevant sound; combination-sensitive neurons; communication sound; corticofugal system; feedback loops ID COMBINATION-SENSITIVE NEURONS; BAT INFERIOR COLLICULUS; SPECIES-SPECIFIC VOCALIZATIONS; LATERAL GENICULATE-NUCLEUS; MIDBRAIN FREQUENCY MAP; MOUSTACHED BAT; CORTICOFUGAL MODULATION; MUSTACHE BAT; PHYSIOLOGICAL MEMORY; UNANESTHETIZED CATS AB The auditory system consists of the ascending and descending (corticofugal) systems. One of the major functions of the corticofugal system is the adjustment and improvement of auditory signal processing in the subcortical auditory nuclei, i.e., the adjustment and improvement of the input of cortical neurons. The corticofugal system evokes a small, short-term reorganization (plasticity) of the inferior colliculus, medial geniculate body and auditory cortex (AC) for acoustic signals repetitively delivered to an animal. When these signals become behaviorally relevant to the animal through conditioning (associative learning), the short-term reorganization is augmented and changes into a long-term reorganization of the AC. Mammals may mostly acquire the behavioral relevance of sounds through learning. Human babies may also acquire language through learning. Therefore, the corticofugal system is expected to play an important role in processing behaviorally relevant sounds and in reorganizing the AC according to the behavioral relevance of sounds. Since the ascending and descending systems form multiple feedback loops, the neural mechanisms for auditory information processing cannot be adequately understood without the exploration of the interaction between the ascending and descending systems. (C) 2002 Published by Elsevier Science B.V. C1 Washington Univ, Dept Biol, St Louis, MO 63130 USA. Gen Hosp Navy, Dept Neurol, Beijing 100037, Peoples R China. Dalhousie Univ, Dept Physiol & Biophys, Halifax, NS B3H 4H7, Canada. RP Suga, N (reprint author), Washington Univ, Dept Biol, 1 Brookings Dr, St Louis, MO 63130 USA. EM suga@biology.wustl.edu CR AHISSAR E, 1992, SCIENCE, V257, P1412, DOI 10.1126/science.1529342 Bakin JS, 1996, P NATL ACAD SCI USA, V93, P11219, DOI 10.1073/pnas.93.20.11219 BUTMAN JA, 1992, THESIS WASHINGTON U CASSEDAY JH, 1994, SCIENCE, V264, P847, DOI 10.1126/science.8171341 Chowdhury SA, 2000, J NEUROPHYSIOL, V83, P1856 Covey E, 1999, ANNU REV PHYSIOL, V61, P457, DOI 10.1146/annurev.physiol.61.1.457 Doupe AJ, 1997, J NEUROSCI, V17, P1147 Feliciano M. E., 1995, AUDIT NEUROSCI, V1, P287 FENG AS, 1978, SCIENCE, V202, P645, DOI 10.1126/science.705350 FUZESSERY ZM, 1994, J NEUROPHYSIOL, V72, P1061 Fuzessery ZM, 1996, J NEUROPHYSIOL, V76, P1059 FUZESSERY ZM, 1982, J COMP PHYSIOL, V146, P471 Gao E, 1998, P NATL ACAD SCI USA, V95, P12663, DOI 10.1073/pnas.95.21.12663 Gao EQ, 2000, P NATL ACAD SCI USA, V97, P8081, DOI 10.1073/pnas.97.14.8081 GRAY CM, 1989, NATURE, V338, P334, DOI 10.1038/338334a0 HALL JC, 1987, J COMP NEUROL, V258, P407, DOI 10.1002/cne.902580309 He JF, 1997, J NEUROPHYSIOL, V77, P896 HEFFNER HE, 1984, SCIENCE, V226, P75, DOI 10.1126/science.6474192 HERNANDEZPEON R, 1956, SCIENCE, V123, P331, DOI 10.1126/science.123.3191.331 HOSE B, 1987, BRAIN RES, V422, P367, DOI 10.1016/0006-8993(87)90946-2 HUBER F, 1985, SCI AM, V253, P60 HUFFMAN RF, 1990, BRAIN RES REV, V15, P295, DOI 10.1016/0165-0173(90)90005-9 IMIG TJ, 1977, BRAIN RES, V138, P241, DOI 10.1016/0006-8993(77)90743-0 Jen PHS, 1998, J COMP PHYSIOL A, V183, P683, DOI 10.1007/s003590050291 Jen PHS, 1999, J COMP PHYSIOL A, V184, P185, DOI 10.1007/s003590050317 Ji WQ, 2001, J NEUROPHYSIOL, V86, P211 Kaas JH, 2000, P NATL ACAD SCI USA, V97, P11793, DOI 10.1073/pnas.97.22.11793 KELLY JP, 1981, BRAIN RES, V212, P1, DOI 10.1016/0006-8993(81)90027-5 Kilgard MP, 1998, SCIENCE, V279, P1714, DOI 10.1126/science.279.5357.1714 Kim DS, 2000, NAT NEUROSCI, V3, P164 Kujirai K, 1983, Auris Nasus Larynx, V10, P9 LIVINGSTONE M, 1988, SCIENCE, V240, P740, DOI 10.1126/science.3283936 LUKAS JH, 1980, PSYCHOPHYSIOLOGY, V17, P444, DOI 10.1111/j.1469-8986.1980.tb00181.x Ma XF, 2001, P NATL ACAD SCI USA, V98, P14060, DOI 10.1073/pnas.241517098 Ma XF, 2001, J NEUROPHYSIOL, V85, P1078 Malmierca MS, 1996, HEARING RES, V93, P167, DOI 10.1016/0378-5955(95)00227-8 MANABE T, 1978, SCIENCE, V200, P339, DOI 10.1126/science.635594 MARGOLIASH D, 1983, J NEUROSCI, V3, P1039 Misawa H, 2001, HEARING RES, V151, P15, DOI 10.1016/S0300-2977(00)00079-6 Murphy PC, 1999, SCIENCE, V286, P1552, DOI 10.1126/science.286.5444.1552 NAKATA S, 1995, SOC NEUR ABST, V21, P269 OATMAN LC, 1971, EXP NEUROL, V32, P341, DOI 10.1016/0014-4886(71)90003-3 OJIMA H, 1994, CEREB CORTEX, V4, P646, DOI 10.1093/cercor/4.6.646 ONEILL WE, 1985, J COMP PHYSIOL A, V157, P797, DOI 10.1007/BF01350077 ONEILL WE, 1982, J NEUROSCI, V2, P17 PETERSEN MR, 1978, SCIENCE, V202, P324, DOI 10.1126/science.99817 Przybyszewski AW, 2000, VISUAL NEUROSCI, V17, P485, DOI 10.1017/S0952523800174012 PUEL JL, 1988, BRAIN RES, V447, P380, DOI 10.1016/0006-8993(88)91144-4 RAUSCHECKER JP, 1995, SCIENCE, V268, P111, DOI 10.1126/science.7701330 Rauschecker JP, 2000, P NATL ACAD SCI USA, V97, P11800, DOI 10.1073/pnas.97.22.11800 RECANZONE GH, 1993, J NEUROSCI, V13, P87 Sakai M, 2001, P NATL ACAD SCI USA, V98, P3507, DOI 10.1073/pnas.061021698 Saldana E, 1996, J COMP NEUROL, V371, P15, DOI 10.1002/(SICI)1096-9861(19960715)371:1<15::AID-CNE2>3.0.CO;2-O Scheich H, 1991, Curr Opin Neurobiol, V1, P236, DOI 10.1016/0959-4388(91)90084-K SCHREINER CE, 1986, HEARING RES, V21, P227, DOI 10.1016/0378-5955(86)90221-2 SCHREINER CE, 1990, J NEUROPHYSIOL, V64, P1442 SHAMMA SA, 1993, J NEUROPHYSIOL, V69, P367 SILLITO AM, 1994, NATURE, V20, P1 Solis MM, 2000, P NATL ACAD SCI USA, V97, P11836, DOI 10.1073/pnas.97.22.11836 Steriade M, 1999, TRENDS NEUROSCI, V22, P337, DOI 10.1016/S0166-2236(99)01407-1 Stiebler I, 1997, J COMP PHYSIOL A, V181, P559, DOI 10.1007/s003590050140 Suga N., 1973, BASIC MECH HEARING, P675 SUGA N, 1985, J NEUROPHYSIOL, V53, P1109 SUGA N, 1965, J PHYSIOL-LONDON, V181, P671 Suga N, 1997, J NEUROPHYSIOL, V77, P2098 SUGA N, 1979, SCIENCE, V206, P351, DOI 10.1126/science.482944 Suga N, 1988, AUDITORY FUNCTION NE, P679 SUGA N, 1976, SCIENCE, V194, P542, DOI 10.1126/science.973140 Suga N., 1965, J PHYSL, V179, P25 SUGA N, 1969, J PHYSIOL-LONDON, V200, P555 SUGA N, 1977, SCIENCE, V196, P64, DOI 10.1126/science.190681 Suga N, 2000, P NATL ACAD SCI USA, V97, P11807, DOI 10.1073/pnas.97.22.11807 Suga N., 1994, COGNITIVE NEUROSCIEN, P295 SUGA N, 1983, J NEUROPHYSIOL, V49, P1573 TSUMOTO T, 1978, EXP BRAIN RES, V32, P345 TUNTURI AR, 1952, AM J PHYSIOL, V168, P712 VILLA AEP, 1991, EXP BRAIN RES, V86, P506 Wang XQ, 2000, P NATL ACAD SCI USA, V97, P11843, DOI 10.1073/pnas.97.22.11843 Warr W. B., 1992, MAMMALIAN AUDITORY P, P410 WATANABE T, 1966, EXP BRAIN RES, V2, P302 Weinberger NM, 1998, NEUROBIOL LEARN MEM, V70, P226, DOI 10.1006/nlme.1998.3850 Weinberger Norman M., 1995, P1071 WENSTRUP JJ, 1995, J NEUROSCI, V15, P4693 Wetzel W, 1998, NEUROSCI LETT, V252, P115, DOI 10.1016/S0304-3940(98)00561-8 Woolsey CN, 1942, B JOHNS HOPKINS HOSP, V71, P315 Yan J, 1999, J NEUROPHYSIOL, V81, P817 Yan J, 1996, SCIENCE, V273, P1100, DOI 10.1126/science.273.5278.1100 Yan W, 1998, NAT NEUROSCI, V1, P54, DOI 10.1038/255 YANG LC, 1992, J NEUROPHYSIOL, V68, P1760 Zhang YF, 1997, NATURE, V387, P900 Zhang YF, 1997, J NEUROPHYSIOL, V78, P3489 Zhang YF, 2000, J NEUROPHYSIOL, V84, P325 NR 92 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 189 EP 200 DI 10.1016/S0167-6393(02)00103-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900016 ER PT J AU Dinse, HR Godde, B Reuter, G Cords, SM Hilger, T AF Dinse, HR Godde, B Reuter, G Cords, SM Hilger, T TI Auditory cortical plasticity under operation: reorganization of auditory cortex induced by electric cochlear stimulation reveals adaptation to altered sensory input statistics SO SPEECH COMMUNICATION LA English DT Article DE auditory cortex; cochleotopy; cortical plasticity; coding; perceptual learning; optical recording; continuous interleaved sampling strategy; hebbian plasticity ID NEONATALLY DEAFENED CAT; NEURONAL RESPONSES; FREQUENCY MAP; INTRACORTICAL MICROSTIMULATION; COCHLEOTOPIC SELECTIVITY; IMPAIRED CHILDREN; INTRINSIC SIGNALS; RECEPTIVE-FIELDS; NUCLEUS BASALIS; SINGLE NEURONS AB We introduce a framework based on plastic-adaptational processes for an interpretation of electrical cochlear implant (CI) stimulation. Cochlear prostheses are used to restore sound perception in adults and children with profound deafness. After providing a review of cortical plasticity, we summarize our findings using optical imaging of intrinsic signals to map cat auditory cortex (AI) activated by CI stimulation. In adult AI of neonatally deafened animals, the acoustic deprivation caused a severe distortion of cochleotopic maps. A three-month period of Cl-stimulation using the continuous interleaved sampling strategy did not re-install the status typically found in normal adults, but resulted in the emergence of a new topographical organization characterized by large, joint representations of all stimulated electrode sites. We suggest that the effectiveness of CI-stimulation relies primarily on a re-learning of input pattern arising from "artificial" sensory inputs via electrical stimulation, thereby supporting the importance of learning and training for the interpretation and understanding of the effects of CI stimulation. We suggest that the ability for gaining/re-gaining speech understanding mediated by Cl-stimulation is accomplished by new strategies of cortical processing due to enhanced cooperativity among large populations of neurons that serve higher processing stages to interpret new patterns arriving from the periphery. These strategies are thought to emerge from adaptational capacities in response to the constraints imposed by the properties of the new input statistics that in turn result from the stimulation strategy employed. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Neuroinformat, D-44780 Bochum, Germany. Univ Tubingen, D-72074 Tubingen, Germany. Hannover Med Sch, Dept Otolaryngol, D-30623 Hannover, Germany. Max Planck Inst Neurol Res, D-50866 Cologne, Germany. RP Dinse, HR (reprint author), Ruhr Univ Bochum, Inst Neuroinformat, D-44780 Bochum, Germany. EM hubert.dinse@neuroinformatik.ruhr-uni-bochum.de RI Godde, Ben/F-5045-2013 OI Godde, Ben/0000-0001-9101-7286 CR AHISSAR E, 1992, SCIENCE, V257, P1412, DOI 10.1126/science.1529342 Bakin JS, 1996, P NATL ACAD SCI USA, V93, P11219, DOI 10.1073/pnas.93.20.11219 Bakin JS, 1996, CEREB CORTEX, V6, P120, DOI 10.1093/cercor/6.2.120 BALDI P, 1988, BIOL CYBERN, V59, P313, DOI 10.1007/BF00332921 Baskerville KA, 1997, NEUROSCIENCE, V80, P1159, DOI 10.1016/S0306-4522(97)00064-X Bishop DVM, 1999, J SPEECH LANG HEAR R, V42, P1295 Blamey P, 1997, AM J OTOL, V18, pS11 WallhausserFranke E, 1996, NEUROREPORT, V7, P1585, DOI 10.1097/00001756-199607080-00010 Brosch M, 1999, J NEUROPHYSIOL, V82, P1542 BROWN M, 1992, HEARING RES, V59, P224, DOI 10.1016/0378-5955(92)90119-8 Buonomano DV, 1998, ANNU REV NEUROSCI, V21, P149, DOI 10.1146/annurev.neuro.21.1.149 Chowdhury SA, 2000, J NEUROPHYSIOL, V83, P1856 Cruikshank SJ, 1996, BRAIN RES REV, V22, P191, DOI 10.1016/S0165-0173(96)00015-X DIAMOND DM, 1986, BRAIN RES, V372, P357, DOI 10.1016/0006-8993(86)91144-3 Dinse HR, 2002, PERCEPTUAL LEARNING, P19 Dinse HR, 1997, ADV NEUROL, V73, P159 Dinse H. R., 2002, CORTICAL AREAS UNITY, P311, DOI 10.4324/9780203219911_chapter_14 DINSE HR, 2000, 200001 IRINI, P1 Dinse H. R., 1998, Society for Neuroscience Abstracts, V24, P905 Dinse HR, 1997, EUR J NEUROSCI, V9, P113, DOI 10.1111/j.1460-9568.1997.tb01359.x DINSE HR, 1997, ASS RES OTOLARYNG, V20, P229 Djourno A., 1957, PRESSE MED, V35, P14 EDELINE JM, 1994, EXP BRAIN RES, V97, P373 Edeline JM, 1996, J PHYSIOLOGY-PARIS, V90, P271, DOI 10.1016/S0928-4257(97)81437-4 ELBERT T, 1995, SCIENCE, V270, P305, DOI 10.1126/science.270.5234.305 Florence SL, 1998, SCIENCE, V282, P1117, DOI 10.1126/science.282.5391.1117 FROSTIG RD, 1990, P NATL ACAD SCI USA, V87, P6082, DOI 10.1073/pnas.87.16.6082 Ghazanfar AA, 2000, J NEUROSCI, V20, P3761 GIBSON EJ, 1953, PSYCHOL BULL, V50, P401, DOI 10.1037/h0055517 Godde B, 1996, NEUROREPORT, V8, P281, DOI 10.1097/00001756-199612200-00056 Godde B, 1995, NEUROREPORT, V7, P24 Godde B, 2000, J NEUROSCI, V20, P1597 Gonzalez-Lima F, 1990, Neuroreport, V1, P161, DOI 10.1097/00001756-199010000-00019 GRAHAM J, 1988, INT J PEDIATR OTORHI, V15, P107, DOI 10.1016/0165-5876(88)90061-4 Harrison RV, 1998, AUDIOL NEURO-OTOL, V3, P214, DOI 10.1159/000013791 HARRISON RV, 1991, HEARING RES, V54, P11, DOI 10.1016/0378-5955(91)90131-R HARRISON RV, 1993, ACTA OTO-LARYNGOL, V113, P296, DOI 10.3109/00016489309135812 HARTMANN R, 1984, HEARING RES, V13, P47, DOI 10.1016/0378-5955(84)90094-7 HATSUSHIKA S, 1990, ANN OTO RHINOL LARYN, V99, P871 Hebb D, 1949, ORG BEHAV Hess A, 1996, NEUROREPORT, V7, P2643, DOI 10.1097/00001756-199611040-00047 ROBERTSON D, 1989, J COMP NEUROL, V282, P456, DOI 10.1002/cne.902820311 JAMES W, 1890, PSYCHOL BRIEF COURSE JASTREBOFF PJ, 1990, NEUROSCI RES, V8, P221, DOI 10.1016/0168-0102(90)90031-9 Kaas JH, 1999, P NATL ACAD SCI USA, V96, P7622, DOI 10.1073/pnas.96.14.7622 Kelahan A M, 1981, Brain Res, V223, P152, DOI 10.1016/0006-8993(81)90815-5 Kilgard MP, 1998, NAT NEUROSCI, V1, P727, DOI 10.1038/3729 Kilgard MP, 1998, SCIENCE, V279, P1714, DOI 10.1126/science.279.5357.1714 KING AJ, 1991, TRENDS NEUROSCI, V14, P31, DOI 10.1016/0166-2236(91)90181-S Klinke R, 1999, SCIENCE, V285, P1729, DOI 10.1126/science.285.5434.1729 Leake PA, 1999, J COMP NEUROL, V412, P543, DOI 10.1002/(SICI)1096-9861(19991004)412:4<543::AID-CNE1>3.0.CO;2-3 Lee DS, 2001, NATURE, V409, P149, DOI 10.1038/35051653 Maldonado PE, 1996, EXP BRAIN RES, V112, P431 Maldonado PE, 1996, EXP BRAIN RES, V112, P420 Malonek D, 1997, P NATL ACAD SCI USA, V94, P14826, DOI 10.1073/pnas.94.26.14826 MCKENNA TM, 1989, BRAIN RES, V481, P142, DOI 10.1016/0006-8993(89)90494-0 Menning H, 2000, NEUROREPORT, V11, P817, DOI 10.1097/00001756-200003200-00032 MERZENICH MM, 1984, J COMP NEUROL, V224, P591, DOI 10.1002/cne.902240408 Merzenich MM, 1996, SCIENCE, V271, P77, DOI 10.1126/science.271.5245.77 MERZENICH MM, 1975, J NEUROPHYSIOL, V38, P231 MERZENICH MM, 1993, ANN NY ACAD SCI, V682, P1 MOORE DR, 1993, PROG BRAIN RES, V97, P127 Muhlnickel W, 1998, P NATL ACAD SCI USA, V95, P10340, DOI 10.1073/pnas.95.17.10340 Nicolelis MAL, 1998, NAT NEUROSCI, V1, P621, DOI 10.1038/2855 Pantev C, 1998, NATURE, V392, P811, DOI 10.1038/33918 Pleger B, 2001, P NATL ACAD SCI USA, V98, P12255, DOI 10.1073/pnas.191176298 Poggio T., 2002, PERCEPTUAL LEARNING Ponton CW, 1996, NEUROREPORT, V8, P61, DOI 10.1097/00001756-199612200-00013 Raggio MW, 1999, J NEUROPHYSIOL, V82, P3506 RAGGIO MW, 1994, J NEUROPHYSIOL, V72, P2334 Rajan R, 1998, J COMP NEUROL, V399, P35 Rajan R, 2001, CEREB CORTEX, V11, P171, DOI 10.1093/cercor/11.2.171 RAMACHANDRAN VS, 1992, NEUROREPORT, V3, P583, DOI 10.1097/00001756-199207000-00009 RASMUSSON DD, 1982, J COMP NEUROL, V205, P313, DOI 10.1002/cne.902050402 Rauschecker JP, 1999, TRENDS NEUROSCI, V22, P74, DOI 10.1016/S0166-2236(98)01303-4 Recanzone G. H., 2000, NEW COGNITIVE NEUROS, P237 RECANZONE GH, 1993, J NEUROSCI, V13, P87 REUTER G, 1997, AM J OTOL S, V18, P13 SCHEICH H, 1993, PROG BRAIN RES, V97, P135 Scheich H, 1991, Curr Opin Neurobiol, V1, P236, DOI 10.1016/0959-4388(91)90084-K Schreiner CE, 1996, J NEUROPHYSIOL, V75, P1283 Schreiner CE, 1997, ACTA OTO-LARYNGOL, P54 SCHUKNECHT HF, 1960, NEURAL MECHANISMS AU, P76 SNYDER RL, 1991, HEARING RES, V56, P246, DOI 10.1016/0378-5955(91)90175-9 SNYDER RL, 1990, HEARING RES, V50, P7, DOI 10.1016/0378-5955(90)90030-S Spitzer MW, 2001, J NEUROPHYSIOL, V85, P1283 Stanton SG, 1996, AUDIT NEUROSCI, V2, P97 Stanton SG, 2000, J COMP NEUROL, V426, P117, DOI 10.1002/1096-9861(20001009)426:1<117::AID-CNE8>3.0.CO;2-S TALLAL P, 1993, ANN NY ACAD SCI, V682, P27, DOI 10.1111/j.1749-6632.1993.tb22957.x Tyler R. S., 1993, COCHLEAR IMPLANTS AU, P191 Weinberger NM, 1990, CONCEPTS NEUROSCIENC, V1, P91 WEINBERGER NM, 1995, ANNU REV NEUROSCI, V18, P129 WESTHEIMER G, 1979, EXP BRAIN RES, V36, P585 Wiemer J, 2000, BIOL CYBERN, V82, P173, DOI 10.1007/s004220050017 WILSON BS, 1995, AM J OTOL, V16, P669 WILSON BS, 1991, NATURE, V352, P236, DOI 10.1038/352236a0 Wright BA, 1997, NATURE, V387, P176, DOI 10.1038/387176a0 XERRI C, 1994, J NEUROSCI, V14, P1710 Yan W, 1998, NAT NEUROSCI, V1, P54, DOI 10.1038/255 NR 99 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 201 EP 219 DI 10.1016/S0167-6393(02)00104-8 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900017 ER PT J AU Polka, L Bohn, OS AF Polka, L Bohn, OS TI Asymmetries in vowel perception SO SPEECH COMMUNICATION LA English DT Article DE vowel perception; perceptual asymmetries; infant speech perception; speech development ID STEADY-STATE VOWELS; EARLY INFANCY; INTERNAL STRUCTURE; RISK INFANTS; SPEECH; DISCRIMINATION; CATEGORIES; CUES; PROTOTYPES; SIMILARITY AB Asymmetries in vowel perception occur such that discrimination of a vowel change presented in one direction is easier compared to the same change presented in the reverse direction. Although such effects have been repeatedly reported in the literature there has been little effort to explain when or why they occur. We review studies that report asymmetries in vowel perception in infants and propose that these data indicate that babies are predisposed to respond differently to vowels that occupy different positions in the articulatory/acoustic vowel space (defined by F1-F2) such that the more peripheral vowel within a contrast serves as a reference or perceptual anchor. As such, these asymmetries reveal a language-universal perceptual bias that infants bring to the task of vowel discrimination. We present some new data that support our peripherality hypothesis and then compare the data on asymmetries in human infants with findings obtained with birds and cats. This comparison suggests that asymmetries evident in humans are unlikely to reflect general auditory mechanisms. Several important directions for further research are outlined and some potential implications of these asymmetries for understanding speech development are discussed. (C) 2002 Elsevier Science B.V. All rights reserved. C1 McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. Aarhus Univ, Engelsk Inst, Aarhus, Denmark. RP Polka, L (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1266 Pine Ave W, Montreal, PQ H3G 1A8, Canada. EM linda.polka@mcgill.ca; engosb@hum.au.dk CR [Anonymous], 1999, HDB INT PHONETIC ASS RATNER NB, 1984, J CHILD LANG, V11, P557 BEST C, 1997, M SOC RES CHILD DEV Best C. T., 2000, INT C INF STUD BRIGH Bohn OS, 2001, J ACOUST SOC AM, V110, P504, DOI 10.1121/1.1380415 COWAN N, 1986, J ACOUST SOC AM, V79, P201 CUTLER A, 1993, J PSYCHOLINGUIST RES, V22, P109 DEJARDINS RN, 1998, CAN ACOUST P, V26, P96 FERNALD A, 1984, ORIGINS GROWTH COMM GRIESER D, 1989, DEV PSYCHOL, V25, P577, DOI 10.1037/0012-1649.25.4.577 HIENZ RD, 1981, J ACOUST SOC AM, V70, P699, DOI 10.1121/1.386933 Hienz RD, 1996, J ACOUST SOC AM, V99, P3656, DOI 10.1121/1.414980 IVERSON P, 1995, J ACOUST SOC AM, V97, P553, DOI 10.1121/1.412280 JOHNSON K, 1993, LANGUAGE, V69, P505, DOI 10.2307/416697 KENT RD, 1987, J SPEECH HEAR DISORD, V52, P64 KENT RD, 1982, J ACOUST SOC AM, V72, P353, DOI 10.1121/1.388089 KUHL PK, 1991, PERCEPT PSYCHOPHYS, V50, P93, DOI 10.3758/BF03212211 KUHL PK, 1992, SCIENCE, V255, P606, DOI 10.1126/science.1736364 KUHL PK, 1983, INFANT BEHAV DEV, V6, P263, DOI 10.1016/S0163-6383(83)80036-8 KUHL PK, 1979, J ACOUST SOC AM, V66, P1668, DOI 10.1121/1.383639 Kuhl PK, 1996, J ACOUST SOC AM, V100, P2425, DOI 10.1121/1.417951 Kuhl PK, 1997, SCIENCE, V277, P684, DOI 10.1126/science.277.5326.684 Lacerda F, 1993, J ACOUST SOC AM, V93, P2372, DOI 10.1121/1.406120 Ladefoged P., 2001, VOWELS CONSONANTS IN LEIBERMAN P, 1984, BIOL EVOLUTION LANGU Lindblom Bjorn, 1986, EXPT PHONOLOGY, P13 MAREAN GC, 1992, DEV PSYCHOL, V28, P396, DOI 10.1037//0012-1649.28.3.396 Medin D. L., 1987, CATEGORICAL PERCEPTI MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 Miller JL, 1996, PERCEPT PSYCHOPHYS, V58, P1157, DOI 10.3758/BF03207549 Morgan J., 1996, SIGNAL SYNTAX Nearey Terrance Michael, 1978, PHONETIC FEATURE SYS Papousek M., 1985, SOCIAL PERCEPTION IN PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Polka L, 1996, J ACOUST SOC AM, V100, P577, DOI 10.1121/1.415884 POLKA L, 1995, J ACOUST SOC AM, V97, P1286, DOI 10.1121/1.412170 POLKA L, 1994, J EXP PSYCHOL HUMAN, V20, P421, DOI 10.1037/0096-1523.20.2.421 REPP BH, 1979, J EXP PSYCHOL HUMAN, V5, P129, DOI 10.1037//0096-1523.5.1.129 REPP BH, 1990, J ACOUST SOC AM, V88, P2080, DOI 10.1121/1.400105 ROSCH E, 1975, COGNITIVE PSYCHOL, V7, P573, DOI 10.1016/0010-0285(75)90024-9 ROSCH E, 1975, J EXP PSYCHOL HUMAN, V1, P303, DOI 10.1037//0096-1523.1.4.303 ROSCH E, 1975, COGNITIVE PSYCHOL, V7, P532, DOI 10.1016/0010-0285(75)90021-3 RVACHEW S, 1996, CAN ACOUST, V24, P247 SHEPARD RN, 1974, PSYCHOMETRIKA, V39, P373, DOI 10.1007/BF02291665 Shi RS, 1999, COGNITION, V72, pB11, DOI 10.1016/S0010-0277(99)00047-5 STEVENS KN, 1989, J PHONETICS, V17, P3 SUNDARA M, UNPUB ASYMMETRIES AD SUSSMAN JE, 1995, J ACOUST SOC AM, V97, P539, DOI 10.1121/1.413111 SWOBODA PJ, 1978, CHILD DEV, V49, P332, DOI 10.1111/j.1467-8624.1978.tb02320.x SWOBODA PJ, 1976, CHILD DEV, V47, P459, DOI 10.2307/1128802 TERBEEK D, 1997, UCLA WORKING PAPERS, P37 TVERBEEK D, 1997, UCLA WORKING PAPERS, P37 TVERSKY A, 1977, PSYCHOL REV, V84, P327, DOI 10.1037/0033-295X.84.4.327 Tversky A., 1978, COGNITION CATEGORIZA Werker J. F., 2001, PSYCHOL SCI, V12, P71 NR 55 TC 46 Z9 47 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 221 EP 231 DI 10.1016/S0167-6393(02)00105-X PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900018 ER PT J AU Nazzi, T Ramus, F AF Nazzi, T Ramus, F TI Perception and acquisition of linguistic rhythm by infants SO SPEECH COMMUNICATION LA English DT Article ID NATIVE-LANGUAGE; SPEECH SEGMENTATION; 2-MONTH-OLD INFANTS; 3-MONTH-OLD INFANTS; WORD SEGMENTATION; SOUND PATTERNS; FLUENT SPEECH; ENGLISH; DISCRIMINATION; SENSITIVITY AB In the present paper, we address the issue of the emergence in infancy of speech segmentation procedures that were found to be specific to rhythmic classes of languages in adulthood. These metrical procedures, which segment fluent speech into its constitutive word sequence, are crucial for the acquisition by infants of the words of their native language. We first present a prosodic bootstrapping proposal according to which the acquisition of these metrical segmentation procedures would be based on an early sensitivity to rhythm (and rhythmic classes). We then review several series of experiments that have studied infants' ability to discriminate languages between birth and 5 months, in an attempt to specify their sensitivity to rhythm and the implication of rhythm perception in the acquisition of these segmentation procedures. The results presented here establish infants' sensitivity to rhythmic classes (from birth onwards). They further show an evolution of infants' language discriminations between birth and 5 months which, though not inconsistent with our proposal, nevertheless call for more studies on the possible implication of rhythm in the acquisition of the metrical segmentation procedures. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Paris 05, CNRS, Lab Cognit & Dev, F-92774 Boulogne, France. UCL, Inst Cognit Neurosci, London WC1E 6BT, England. RP Nazzi, T (reprint author), Univ Paris 05, CNRS, Lab Cognit & Dev, 71 Ave Edouard Vaillant, F-92774 Boulogne, France. EM thierry.nazzi@psycho.univ-paris5.fr RI Ramus, Franck/A-1755-2009 OI Ramus, Franck/0000-0002-1122-5913 CR Abercrombie D, 1967, ELEMENTS GEN PHONETI ARVANITI A, 1994, J PHONETICS, V22, P239 BAHRICK LE, 1988, INFANT BEHAV DEV, V11, P277, DOI 10.1016/0163-6383(88)90014-8 Bosch L, 1997, COGNITION, V65, P33, DOI 10.1016/S0010-0277(97)00040-1 Christophe A, 1998, DEVELOPMENTAL SCI, V1, P215, DOI 10.1111/1467-7687.00033 CUTLER A, 1992, COGNITIVE PSYCHOL, V24, P381, DOI 10.1016/0010-0285(92)90012-Q CUTLER A, 1988, J EXP PSYCHOL HUMAN, V14, P113, DOI 10.1037/0096-1523.14.1.113 CUTLER A, 1986, J MEM LANG, V25, P385, DOI 10.1016/0749-596X(86)90033-1 CUTLER A, 1993, J PHONETICS, V21, P103 Dehaene-Lambertz G, 1998, LANG SPEECH, V41, P21 den Os E., 1988, RHYTHM TEMPO DUTCH I DUTOIT T, 1996, ICSLP 96 PHIL Echols CH, 1997, J MEM LANG, V36, P202, DOI 10.1006/jmla.1996.2483 FERNALD A, 1987, INFANT BEHAV DEV, V10, P279, DOI 10.1016/0163-6383(87)90017-8 FRIEDERICI AD, 1993, PERCEPT PSYCHOPHYS, V54, P287, DOI 10.3758/BF03205263 HOHNE EA, 1994, PERCEPT PSYCHOPHYS, V56, P613, DOI 10.3758/BF03208355 Jusczyk P. W., 1997, DISCOVERY SPOKEN LAN JUSCZYK PW, 1993, J MEM LANG, V32, P402, DOI 10.1006/jmla.1993.1022 Jusczyk PW, 1999, PERCEPT PSYCHOPHYS, V61, P1465, DOI 10.3758/BF03213111 JUSCZYK PW, 1999, COGNITIVE PSYCHOL, P39 JUSCZYK PW, 1995, COGNITIVE PSYCHOL, V29, P1, DOI 10.1006/cogp.1995.1010 JUSCZYK PW, 1993, CHILD DEV, V64, P675, DOI 10.1111/j.1467-8624.1993.tb02935.x JUSCZYK PW, 1978, PERCEPT PSYCHOPHYS, V23, P105, DOI 10.3758/BF03208289 JUSCZYK PW, 1994, J MEM LANG, V33, P630, DOI 10.1006/jmla.1994.1030 KARZON RG, 1989, PERCEPT PSYCHOPHYS, V45, P10, DOI 10.3758/BF03208026 KUHL PK, 1982, PERCEPT PSYCHOPHYS, V31, P279, DOI 10.3758/BF03202536 Kuijpers C., 1998, ADV INFANCY RES, V12, P205 Mattys SL, 2001, COGNITION, V78, P91, DOI 10.1016/S0010-0277(00)00109-8 Mattys SL, 1999, COGNITIVE PSYCHOL, V38, P465, DOI 10.1006/cogp.1999.0721 Mehler J., 1988, COGNITION, V29, P144 Mehler J, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P101 MEHLER J, 1981, J VERB LEARN VERB BE, V20, P298, DOI 10.1016/S0022-5371(81)90450-3 Mehler Jacques, 1995, P943 MOON C, 1993, INFANT BEHAV DEV, V16, P495, DOI 10.1016/0163-6383(93)80007-U MORAIS J, 1989, Language and Cognitive Processes, V4, P57, DOI 10.1080/01690968908406357 MORGAN JL, 1995, CHILD DEV, V66, P911, DOI 10.1111/j.1467-8624.1995.tb00913.x Morse P. A, 1972, J EXPT CHILD PSYCHOL, V13, P477 Nazzi T, 2000, J MEM LANG, V43, P1, DOI 10.1006/jmla.2000.2698 Nazzi T, 1998, INFANT BEHAV DEV, V21, P779, DOI 10.1016/S0163-6383(98)90044-3 Nazzi T, 1998, J EXP PSYCHOL HUMAN, V24, P756, DOI 10.1037//0096-1523.24.3.756 Nazzi T., 1997, THESIS ECOLE HAUTES OTAKE T, 1993, J MEM LANG, V32, P258 FANT G, 1991, J PHONETICS, V19, P351 Pike K. L., 1945, INTONATION AM ENGLIS RAMUS F, IN PRESS ANN REV LAN, P2 Ramus F, 1999, J ACOUST SOC AM, V105, P512, DOI 10.1121/1.424522 Ramus F, 2000, SCIENCE, V288, P349, DOI 10.1126/science.288.5464.349 RAMUS F, 1999, COGNITION Saffran JR, 1996, SCIENCE, V274, P1926, DOI 10.1126/science.274.5294.1926 SEBASTIANGALLES N, 1992, J MEM LANG, V31, P18, DOI 10.1016/0749-596X(92)90003-G Shafer VL, 1999, DEV NEUROPSYCHOL, V15, P73 SPRING DR, 1977, J SPEECH HEAR RES, V20, P224 Turk AE, 1995, LANG SPEECH, V38, P143 VROOMEN J, 1996, HUMAN PERCEPTION PER, V21, P98 WERKER JF, 1984, INFANT BEHAV DEV, V7, P49, DOI 10.1016/S0163-6383(84)80022-3 NR 55 TC 69 Z9 69 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 233 EP 243 DI 10.1016/S0167-6393(02)00106-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900019 ER PT J AU Poeppel, D AF Poeppel, D TI The analysis of speech in different temporal integration windows: cerebral lateralization as 'asymmetric sampling in time' SO SPEECH COMMUNICATION LA English DT Article DE temporal integration; timing; hemispheric asymmetry; neural basis of speech; auditory cortex; gamma band; theta band; oscillations ID NEURAL MECHANISMS; NONSPEECH SOUNDS; WORD DEAFNESS; PERCEPTION; DISCRIMINATION; INFORMATION; PET; BINDING; CORTEX; MEMORY AB The 'asymmetric sampling in time' (AST) hypothesis developed here provides a framework for understanding a range of psychophysical and neuropsychological data on speech perception in the context of a revised cortical functional anatomic model. The AST model is motivated by observations from psychophysics and cognitive neuroscience that speak to the fractionation of auditory processing, in general, and speech perception, in particular. Building on the observations (1) that the speech signal contains more than one time scale relevant to auditory cognition (e.g. time scales commensurate with processing formant transitions versus scales commensurate with syllabicity and intonation contours), and (2) that speech perception is mediated by both left and right auditory cortices, AST suggests a time-based perspective that maintains anatomic symmetry while permitting functional asymmetry. AST proposes that the input speech signal has a neural representation that is bilaterally symmetric at an early representational level. Beyond the initial representation, however, the signal is elaborated asymmetrically in the time domain: left auditory areas preferentially extract information from short (similar to20-40 ms) temporal integration windows. The right hemisphere homologues preferentially extract information from long (similar to150-250 ms) integration windows. It is suggested that temporal integration is reflected as oscillatory neuronal activity in different frequency bands (gamma, theta). (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Maryland, Dept Linguist & Biol, Cognit Neurosci Language Lab, College Pk, MD 20742 USA. RP Poeppel, D (reprint author), Univ Maryland, Dept Linguist & Biol, Cognit Neurosci Language Lab, 1401 Marie Mt Hall, College Pk, MD 20742 USA. EM dpoeppel@deans.umd.edu CR Belin P, 2000, NATURE, V403, P309, DOI 10.1038/35002078 Belin P, 1998, J COGNITIVE NEUROSCI, V10, P536, DOI 10.1162/089892998562834 Binder JR, 2000, CEREB CORTEX, V10, P512, DOI 10.1093/cercor/10.5.512 BLUMSTEIN SE, 1995, COGNITIVE NEUROSCIEN Bocker KBE, 1999, PSYCHOPHYSIOLOGY, V36, P706, DOI 10.1111/1469-8986.3660706 BUCHMAN AS, 1986, J NEUROL NEUROSUR PS, V49, P489, DOI 10.1136/jnnp.49.5.489 Buchsbaum BR, 2001, COGNITIVE SCI, V25, P663, DOI 10.1207/s15516709cog2505_2 Burton MW, 2001, COGNITIVE SCI, V25, P695, DOI 10.1016/S0364-0213(01)00051-9 Burton MW, 2000, J COGNITIVE NEUROSCI, V12, P679, DOI 10.1162/089892900562309 Chomsky N., 1995, MINIMALIST PROGRAM Eimas PD, 1999, J ACOUST SOC AM, V105, P1901, DOI 10.1121/1.426726 Engel AK, 2001, TRENDS COGN SCI, V5, P16, DOI 10.1016/S1364-6613(00)01568-0 FIEZ JA, 1995, J COGNITIVE NEUROSCI, V7, P357, DOI 10.1162/jocn.1995.7.3.357 Gandour J, 2000, J COGNITIVE NEUROSCI, V12, P207, DOI 10.1162/089892900561841 GESCHWIN.N, 1968, SCIENCE, V161, P186, DOI 10.1126/science.161.3837.186 Greenberg S, 1998, BEHAV BRAIN SCI, V21, P267, DOI 10.1017/S0140525X98311176 Heinze HJ, 1998, J COGNITIVE NEUROSCI, V10, P485, DOI 10.1162/089892998562898 Hickok G, 2000, TRENDS COGN SCI, V4, P131, DOI 10.1016/S1364-6613(00)01463-7 HICKOK G, IN PRESS COGNITION Hirsh IJ, 1996, ANNU REV PSYCHOL, V47, P461, DOI 10.1146/annurev.psych.47.1.461 Ivry R, 1998, RIGHT HEMISPHERE LANGUAGE COMPREHENSION, P3 Ivry RB, 1998, 2 SIDES PERCEPTION JOHNSRUDE IS, 1997, NEUROREPORT, V8, P61 JOLIOT M, 1994, P NATL ACAD SCI USA, V91, P11748, DOI 10.1073/pnas.91.24.11748 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 MEHLER J, 1981, PHILOS T ROY SOC B, V295, P333, DOI 10.1098/rstb.1981.0144 Meinschaefer J, 1999, BRAIN LANG, V70, P287, DOI 10.1006/brln.1999.2153 Naatanen R., 1992, ATTENTION BRAIN FUNC Nicholls M E, 1996, Laterality, V1, P97 NORRIS D, 2000, NEW COGNITIVE NEUROS Nusbaum H. C., 1992, AUDITORY PROCESSING PEOPPEL D, UNPUB FM TONE SYLLAB Phillips C, 2001, COGNITIVE SCI, V25, P711, DOI 10.1016/S0364-0213(01)00049-0 PISONI DB, 1973, PERCEPT PSYCHOPHYS, V13, P253, DOI 10.3758/BF03214136 Poeppel D, 1996, COGNITIVE BRAIN RES, V4, P231, DOI 10.1016/S0926-6410(96)00643-X POEPPEL D, 2000, HIGH FREQUENCY RESPO Poeppel D, 2001, COGNITIVE SCI, V25, P679, DOI 10.1016/S0364-0213(01)00050-7 Poppel E, 1997, TRENDS COGN SCI, V1, P56, DOI 10.1016/S1364-6613(97)01008-5 Proverbio AM, 1998, COGNITIVE BRAIN RES, V6, P321, DOI 10.1016/S0926-6410(97)00039-6 ROSEN S, 1992, PHILOS T ROY SOC B, V336, P367, DOI 10.1098/rstb.1992.0070 Ross ED, 1997, BRAIN LANG, V56, P27, DOI 10.1006/brln.1997.1731 Saberi K, 1999, NATURE, V398, P760, DOI 10.1038/19652 SCHILLER PH, 1990, TRENDS NEUROSCI, V13, P392, DOI 10.1016/0166-2236(90)90117-S Scott SK, 2000, BRAIN, V123, P2400, DOI 10.1093/brain/123.12.2400 Shamma S, 2001, TRENDS COGN SCI, V5, P340, DOI 10.1016/S1364-6613(00)01704-6 Shtyrov Y, 2000, NEUROIMAGE, V12, P657, DOI 10.1006/nimg.2000.0646 SINGER W, 1993, ANNU REV PHYSIOL, V55, P349, DOI 10.1146/annurev.physiol.55.1.349 TALLAL P, 1993, ANN NY ACAD SCI, V682, P27, DOI 10.1111/j.1749-6632.1993.tb22957.x THEUNISSEN F, 1995, J COMPUT NEUROSCI, V2, P149, DOI 10.1007/BF00961885 VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953 VIEMEISTER NF, 1993, HUMAN PSYCHOPHYSICS Warren R. M., 1999, AUDITORY PERCEPTION Yabe H, 1997, NEUROREPORT, V8, P1971, DOI 10.1097/00001756-199705260-00035 Yost W. A., 1993, HUMAN PSYCHOPHYSICS Zatorre RJ, 1997, ACOUSTICAL SIGNAL PROCESSING IN THE CENTRAL AUDITORY SYSTEM, P453, DOI 10.1007/978-1-4419-8712-9_42 ZATORRE RJ, 1992, SCIENCE, V256, P846, DOI 10.1126/science.1589767 ZATORRE RJ, 1994, J NEUROSCI, V14, P1908 NR 57 TC 380 Z9 383 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 245 EP 255 DI 10.1016/S0167-6393(02)00107-3 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900020 ER PT J AU McQueen, JM Cutler, A Norris, D AF McQueen, JM Cutler, A Norris, D TI Flow of information in the spoken word recognition system SO SPEECH COMMUNICATION LA English DT Article DE spoken word recognition; levels of processing; feedback; phonetic categorization ID PHONETIC CATEGORIZATION; SPEECH-PERCEPTION; LEXICAL ACCESS; TRACE MODEL; PHONEMES; IDENTIFICATION; PROSODY; COARTICULATION; COMPENSATION; RESTORATION AB Spoken word recognition consists of two major component processes. First, at the prelexical stage, an abstract description of the utterance is generated from the information in the speech signal. Second, at the lexical stage, this description is used to activate all the words stored in the mental lexicon which match the input. These multiple candidate words then compete with each other. We review evidence which suggests that positive (match) and negative (mismatch) information of both a segmental and a suprasegmental nature is used to constrain this activation and competition process. We then ask whether, in addition to the necessary influence of the prelexical stage on the lexical stage, there is also feedback from the lexicon to the prelexical level. In two phonetic categorization experiments, Dutch listeners were asked to label both syllable-initial and syllable-final ambiguous fricatives (e.g., sounds ranging from [f] to [s]) in the word-nonword series maf-mas, and the nonword-word series jaf-jas. They tended to label the sounds in a lexically consistent manner (i.e., consistent with the word endpoints of the series). These lexical effects became smaller in listeners' slower responses, even when the listeners were put under pressure to respond as fast as possible. Our results challenge models of spoken word recognition in which feedback modulates the prelexical analysis of the component sounds of a word whenever that word is heard. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Max Planck Inst Psycholinguist, NL-6525 XD Nijmegen, Netherlands. MRC, Cognit & Brain Sci Unit, Cambridge CB2 2EF, England. RP McQueen, JM (reprint author), Max Planck Inst Psycholinguist, Wundtlaan 1, NL-6525 XD Nijmegen, Netherlands. EM james.mcqueen@mpi.nl; dennis.norris@mrc-cbu.cam.ac.uk RI McQueen, James/B-2212-2010; Cutler, Anne/C-9467-2012 CR BURTON MW, 1989, J EXP PSYCHOL HUMAN, V15, P567, DOI 10.1037//0096-1523.15.3.567 CONNINE CM, 1987, J EXP PSYCHOL HUMAN, V13, P291, DOI 10.1037/0096-1523.13.2.291 Connine CM, 1997, J MEM LANG, V37, P463, DOI 10.1006/jmla.1997.2535 Cutler A, 1997, LANG SPEECH, V40, P141 Cutler A, 2001, LANG SPEECH, V44, P171 CUTLER A, 1987, COGNITIVE PSYCHOL, V19, P141, DOI 10.1016/0010-0285(87)90010-7 CUTLER A, 1986, LANG SPEECH, V29, P201 Elman J. L., 1986, INVARIANCE VARIABILI, P360 ELMAN JL, 1988, J MEM LANG, V27, P143, DOI 10.1016/0749-596X(88)90071-X FOX RA, 1984, J EXP PSYCHOL HUMAN, V10, P526, DOI 10.1037//0096-1523.10.4.526 Frauenfelder U. H., 1998, LANGUAGE COMPREHENSI, P1 Frauenfelder UH, 2001, LANG COGNITIVE PROC, V16, P583 GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 Gaskell MG, 1997, LANG COGNITIVE PROC, V12, P613 Luce PA, 2000, PERCEPT PSYCHOPHYS, V62, P615, DOI 10.3758/BF03212113 MARSLENWILSON W, 1994, PSYCHOL REV, V101, P653, DOI 10.1037//0033-295X.101.4.653 MASSARO DW, 1989, COGNITIVE PSYCHOL, V21, P398, DOI 10.1016/0010-0285(89)90014-5 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 McClelland J. L., 1987, ATTENTION PERFORM, V12, P3 MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433 MCQUEEN JM, 2001, P WORKSH SPEECH REC, P9 McQueen JM, 1999, J EXP PSYCHOL HUMAN, V25, P1363, DOI 10.1037//0096-1523.25.5.1363 MILLER JL, 1988, J EXP PSYCHOL HUMAN, V14, P369, DOI 10.1037/0096-1523.14.3.369 Newman RS, 1997, J EXP PSYCHOL HUMAN, V23, P873, DOI 10.1037/0096-1523.23.3.873 Norris D, 2000, BEHAV BRAIN SCI, V23, P299, DOI 10.1017/S0140525X00003241 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 PITT MA, 1995, J EXP PSYCHOL LEARN, V21, P1037, DOI 10.1037/0278-7393.21.4.1037 PITT MA, 1993, J EXP PSYCHOL HUMAN, V19, P699, DOI 10.1037/0096-1523.19.4.699 Pitt MA, 1998, J MEM LANG, V39, P347, DOI 10.1006/jmla.1998.2571 Pylyshyn Z, 1999, BEHAV BRAIN SCI, V22, P341 RUBIN P, 1976, PERCEPT PSYCHOPHYS, V19, P394, DOI 10.3758/BF03199398 SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474 Samuel AG, 1997, COGNITIVE PSYCHOL, V32, P97, DOI 10.1006/cogp.1997.0646 Samuel AG, 2001, PSYCHOL SCI, V12, P348, DOI 10.1111/1467-9280.00364 Samuel AG, 1996, J EXP PSYCHOL GEN, V125, P28 Soto-Faraco S, 2001, J MEM LANG, V45, P412, DOI 10.1006/jmla.2000.2783 NR 36 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2003 VL 41 IS 1 BP 257 EP 270 DI 10.1016/S0167-6393(02)00108-5 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 695QC UT WOS:000183840900021 ER PT J AU Mendoza, E Carballo, G Cruz, A Fresneda, MD Munoz, J Marrero, V AF Mendoza, E Carballo, G Cruz, A Fresneda, MD Munoz, J Marrero, V TI Temporal variability in speech segments of Spanish: context and speaker related differences SO SPEECH COMMUNICATION LA English DT Article ID TO-VOWEL COARTICULATION; AMERICAN-ENGLISH; VERTICAL BAR; DURATION; LENGTH; CONSONANTS; FRENCH; DUTCH; EXPLANATION; CHILDREN AB This article reports on segmental duration measurements of eight selected consonants (voiceless obstruents, nasals and liquids) and three vowels in 192 disyllabic (CVCe) nonsense words with stress on the first syllable, spoken in isolation by 12 Spanish speakers. Durations as measured based on acoustic discontinuities are discussed along with speaker variability. The intrinsic and context-dependent duration of consonants /f, theta, X, s, m, n, 1, r/ and vowels /a, i, u/, as well as the inter-speaker variability of these phonemes were analysed. Results show sizable differences in the duration of consonants (voiceless fricatives are longer than voiced fricatives) and vowels (/a/ has a longer duration than /i/ and /u/). With regard to contextual effects, there is a remarkable decrease and increase in vowel durations preceding voiceless fricatives and sonorants, respectively. These effects are present in all speakers. Our results on durational effects indicate that (a) the initial consonants /x, s/ and /r/ show larger differences among speakers; (b) effects for the vowel /a/ are greater than for the vowels /i/ and /u/; and (c) voiceless fricative consonants in medial position show greater intraspeaker idiosyncrasy than voiced consonants. The effects of anticipatory consonant-to-vowel coarticulation are discussed, as well as differences in segmental duration among speakers. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Granada, Fac Psicol Evaluac & Tratamiento Psicol, Dept Personalidad, E-18071 Granada, Spain. Univ Nacl Educ Distancia, Dept Lengua Espanola, Madrid 28040, Spain. RP Mendoza, E (reprint author), Univ Granada, Fac Psicol Evaluac & Tratamiento Psicol, Dept Personalidad, Campus Cartuja S-N, E-18071 Granada, Spain. EM emendoza@ugr.es RI Fresneda Lopez, M Dolores/G-3440-2011 CR ANTONIADIS Z, 1984, PHONETICA, V41, P72 BARTKOVA K, 1988, P SPEECH 88 7 FASE S, P763 Baum SR, 1998, BRAIN LANG, V63, P357, DOI 10.1006/brln.1997.1938 BORZONEDEMANRIQ.AM, 1983, J PHONETICS, V11, P117 Braunschweiler N, 1997, LANG SPEECH, V40, P353 Carballo G, 2000, CLIN LINGUIST PHONET, V14, P587 Carballo G, 1997, PERCEPT MOTOR SKILL, V84, P1099 CARBALLO G, 1995, THESIS U GRANADA SPA, P103 CELDRAN EM, 1989, ESTUDIOS FONETICA EX, V1, P73 CRYSTAL TH, 1988, J PHONETICS, V16, P263 CRYSTAL TH, 1988, J PHONETICS, V16, P285 CRYSTAL TM, 1988, J ACOUST SOC AM, V85, P1553 CUENCA MH, 1996, PHILOLOGIA HISPALENS, V11, P295 DANILOFF R, 1980, PHYSL SPEECH HEARING, P219 Daniloff R. G., 1973, J PHONETICS, V1, P239 DEFIOR S, 1996, DIFICULTADES APRENDI, P63 Delattre P, 1965, COMP PHONETIC FEATUR DELBARRIO L, 1999, LINGUISTICA ESPANOLA, V21, P99 FARNETANI E, 1993, LANG SPEECH, V36, P279 Fowler CA, 2000, LANG SPEECH, V43, P1 Hertrich I, 1999, J SPEECH LANG HEAR R, V42, P367 HERTRICH I, 1995, LANG SPEECH, V38, P157 HOOLE P, 1993, LANG SPEECH, V36, P235 HOUSE AS, 1997, SPEECH PRODUCTION LA JOHNSON CC, 1984, J PHONETICS, V12, P319 Jongman A, 1998, J PHONETICS, V26, P207, DOI 10.1006/jpho.1998.0075 KLUENDER KR, 1988, J PHONETICS, V16, P153 KOHLER KJ, 1984, PHONETICA, V41, P150 Ladefoged Peter, 1993, COURSE PHONETICS LAEUFER C, 1992, J PHONETICS, V20, P411 Lehiste I., 1970, SUPRASEGMENTALS MARIN R, 1994, ESTUDIOS LINGUISTICA, V10, P213 Navarro Tomas T., 1918, REV FILOL ESPAN, V5, P367 NOOTEBOO.SG, 1972, LANG SPEECH, V15, P301 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 OSHAUGHNESSY D, 1981, J PHONETICS, V9, P385 OSHAUGHNESSY D, 1984, J ACOUST SOC AM, V76, P1664, DOI 10.1121/1.391613 OSHAUGHNESSY D, 1987, SPEECH COMMUN, P39 PISONI D, 1990, 16 IND U, P169 PORT RF, 1980, PHONETICA, V37, P235 Quilis A., 1979, LINGUISTICA ESPANOLA, V1, P233 SAIZ MG, 1993, ESTUDIOS FONETICA EX, V5, P189 UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374 vandenHeuvel H, 1996, SPEECH COMMUN, V18, P113, DOI 10.1016/0167-6393(95)00039-9 VANDENHEUVEL H, 1994, J PHONETICS, V22, P389 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSANTEN JPH, 1992, J ACOUST SOC AM, V2, P2444 WAHLEN DH, 1990, J PHONETICS, V18, P3 WALSH T, 1981, J PHONETICS, V9, P305 ZIMMERMAN SA, 1958, J ACOUST SOC AM, V30, P152, DOI 10.1121/1.1909521 NR 50 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 431 EP 447 DI 10.1016/S0167-6393(02)00086-9 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000001 ER PT J AU Malayath, N Hermansky, H AF Malayath, N Hermansky, H TI Data-driven spectral basis functions for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE data-driven feature extraction; linear discriminant analysis AB Feature extraction plays a major role in any form of pattern recognition. Current feature extraction methods Used for automatic speech recognition (ASR) and speaker verification rely mainly on properties of speech production (modeled by all-pole filters) and perception (critical-band integration simulated by Mel/Bark filter bank). We propose to use stochastic methods for designing feature extraction methods which are trained to alleviate the unwanted variability present in speech signals. In this paper we show that such data-driven methods provide significant advantages over the conventional methods both in terms of performance of ASR and in providing understanding about the nature of speech signal. The first part of the paper investigates the suitability of the cepstral features obtained by applying discrete cosine transform on logarithmic critical-band power spectra. An alternate set of basis functions were designed by linear discriminant analysis (LDA) of logarithmic critical-band power spectra. Discriminant features extracted by these alternate basis functions are shown to outperform the cepstral features in ASR experiments. The second part of the paper discusses the relevance of non-uniform frequency resolution used by current speech analysis methods like Mel frequency analysis and perceptual linear predictive analysis. It is shown that LDA of the short-time Fourier spectrum of speech yields spectral basis functions which provide comparatively lower resolution to the high-frequency region of spectrum. This is consistent with critical-band resolution and is shown to be caused by the spectral properties of vowel sounds. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Qualcomm Inc, San Diego, CA 92121 USA. Oregon Hlth & Sci Univ, OGI Sch Sci & Technol, Portland, OR USA. Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Malayath, N (reprint author), Qualcomm Inc, AA-318V,5775 Morehouse Dr, San Diego, CA 92121 USA. EM nmalayat@qualcomm.com; hynek@ece.ogi.edu CR Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607213 AYER CM, 1993, P EUR C SPEECH COMM, P583 BATLLE E, 1998, P INT C SPOK LANG PR, P951 Brown P., 1987, THESIS CARNEGIE MELL Cole R., 1995, P EUR C SPEECH COMM, P821 COLE RA, 1994, P INT C SPOK LANG PR DODDINGTON G, 1989, P IEEE INT C AC SPEE, P556 Eisele T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607092 FLANAGAN JL, 1955, J ACOUST SOC AM, V27, P613, DOI 10.1121/1.1907979 Fukunaga K., 1990, STAT PATTERN RECOGNI, V2nd FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hermansky H., 1998, P INT C SPOK LANG PR, P1379 HUNT M, 1999, AUTOMATIC SPEECH REC HUNT M, 1979, J ACOUST SOC AM, V66, P35 Hunt M., 1989, P ICASSP, P262 HUNT MJ, 1988, P IEEE ICASSP, P215 HUNT MJ, 1991, P IEEE INT C AC SPEE, P881, DOI 10.1109/ICASSP.1991.150480 JAIN AK, 1979, IEEE T PATTERN ANAL, V1, P356 JELINEK F, 1975, IEEE T INFORM THEORY, V21, P256 Jelinek F., 1997, STAT METHODS SPEECH KAJAREKAR S, 1999, P EUROSPEECH BUD HUN, P343 KAMM T, 1997, P CLSP SUMM WORKSH J KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Kil D. H., 1996, PATTERN RECOGNITION KLATT DH, 1977, J ACOUST SOC AM, V62, P1345, DOI 10.1121/1.381666 Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2 LESSER VR, 1975, IEEE T ACOUST SPEECH, VAS23, P11, DOI 10.1109/TASSP.1975.1162648 MALAYATH N, 1997, P EUROSPEECH 97 GREE, P497 MALAYATH N, 2000, THESIS OREGON GRADUA MERCIER G, 1990, PATTERN RECOGN, P225 MERHAV N, 1993, IEEE T SIGNAL PROCES, V41 Mermelstein P., 1976, PATTERN RECOGN, P374 MORGAN N, 1995, P IEEE, V83, P742, DOI 10.1109/5.381844 NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 OPPENHEI.AV, 1968, IEEE T ACOUST SPEECH, VAU16, P221, DOI 10.1109/TAU.1968.1161965 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Rabiner L, 1993, FUNDAMENTALS SPEECH REDDY DR, 1976, P IEEE, V64, P502 SAKOE H, 1978, IEEE T ACOUST SPEECH, V26, P43, DOI 10.1109/TASSP.1978.1163055 SCHAFER RW, 1990, PATTERN RECOGN, P49 SCHUKATTALAMAZZ.E, 1995, P ICASSP 95, P369 SHAMMA SS, 1995, IEEE T SPEECH AUDIO, V3, P382 Stevens K.N., 1998, ACOUSTIC PHONETICS SUN DX, 1995, P INT C AC SPEECH SI, P201 UMESH S, 1997, P ICASSP 97 MUN GERM, P983 van Vuuren S., 1997, P EUR, P409 Vintsyuk T.K., 1968, KIBERNETIKA, V4, P81 WOODLAND PC, 1991, P ICASSP 91, P545, DOI 10.1109/ICASSP.1991.150397 YANG H, 1999, P ICASSP, P225 ZUE VW, 1990, PATTERN RECOGN, P200 ZWICKER E, 1957, J ACOUST SOC AM, V29, P548, DOI 10.1121/1.1908963 NR 53 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 449 EP 466 DI 10.1016/S0167-6393(02)00127-9 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000002 ER PT J AU Flege, JE Schirru, C MacKay, IRA AF Flege, JE Schirru, C MacKay, IRA TI Interaction between the native and second language phonetic subsystems SO SPEECH COMMUNICATION LA English DT Article DE bilingualism; second language acquisition; vowel production; language interaction; tongue movement; English; Italian ID AMERICAN ENGLISH VOWELS; DYNAMIC SPECIFICATION; FOREIGN ACCENT; 2ND LANGUAGE; L1 USE; PERCEPTION; AGE; SPEAKERS; SPANISH; ACQUISITION AB The underlying premise of this study was that the two phonetic subsystems of a bilingual interact. The study tested the hypothesis that the vowels a bilingual produces in a second language (1,2) may differ from vowels produced by monolingual native speakers of the L2 as the result of either of two mechanisms: phonetic category assimilation or phonetic category dissimilation. Earlier work revealed that native speakers of Italian identify English /e(l)/ tokens as instances of the Italian /e/ category even though English /el/ is produced with more tongue movement than Italian /e/ is. Acoustic analyses in the present study examined /e(l)/s produced by four groups of Italian-English bilinguals who differed according to their age of arrival in Canada from Italy (early versus late) and frequency of continued Italian use (low-L 1 use versus high-L1-use). Early bilinguals who seldom used Italian (Early-low) were found to produce English /e(')/ with significantly more movement than native English speakers. However, both groups of late bilinguals (Late-low, Late-high) tended to produced /el/ with less movement than NE speakers. The exaggerated movement in /el/s produced by the Early-low group participants was attributed to the dissimilation of a phonetic category they formed for English /el/ from Italian /e/. The undershoot of movement in /el/s produced by late bilinguals, on the other hand, was attributed to their failure to establish a new category for English /el/, which led to the merger of the phonetic properties of English /el/ and Italian /e/. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Alabama Birmingham, Div Speech & Hearing Sci, Birmingham, AL 35294 USA. Univ Padua, Dept Linguist, I-35137 Padua, Italy. Univ Ottawa, Dept Linguist, Ottawa, ON K1N 6N5, Canada. RP Flege, JE (reprint author), Univ Alabama Birmingham, Div Speech & Hearing Sci, CH20,Room 119,1530 3rd Ave, Birmingham, AL 35294 USA. EM jeflege@uab.edu CR BAHRICK HP, 1994, J EXP PSYCHOL GEN, V123, P264, DOI 10.1037/0096-3445.123.3.264 BAKER W, 2002, BUCLD, V26 Bohn O.-S., 1992, STUDIES 2 LANGUAGE A, V14, P131, DOI 10.1017/S0272263100010792 Conover WJ, 1980, PRACTICAL NONPARAMET, V2 DELATTRE P, 1964, IRAL-INT REV APPL LI, V2, P71, DOI 10.1515/iral.1964.2.1.71 Flege J., 1998, 34 ANN M CHIC LING S, VII, P213 FLEGE JE, 1989, LANG SPEECH, V32, P123 Flege J. E., 2002, INTEGRATED VIEW LANG Flege J. E., 1995, SPEECH PERCEPTION LI, P233 FLEGE JE, 1991, J ACOUST SOC AM, V89, P395, DOI 10.1121/1.400473 Flege J. E., 1992, PHONOLOGICAL DEV MOD, P565 Flege James Emil, 2001, STUDIES 2 LANGUAGE A, V23, P527 FLEGE JE, 1987, J PHONETICS, V15, P67 FLEGE JE, 1995, SPEECH COMMUN, V16, P1, DOI 10.1016/0167-6393(94)00044-B FLEGE JE, UNPUB CONSTRAINTS PE Flege JE, 1999, SEC LANG ACQ RES, P101 Flege JE, 1999, J MEM LANG, V41, P78, DOI 10.1006/jmla.1999.2638 FLEGE JE, 1981, LANG SPEECH, V24, P125 FLEGE JE, 1995, J ACOUST SOC AM, V97, P3125, DOI 10.1121/1.413041 FLEGE JE, 1986, LANG SPEECH, V29, P361 Flege JE, 1997, J PHONETICS, V25, P169, DOI 10.1006/jpho.1996.0040 FLEGE JE, 1988, J ACOUST SOC AM, V83, P729, DOI 10.1121/1.396115 Flege JE, 1999, J ACOUST SOC AM, V106, P2973, DOI 10.1121/1.428116 FLEGE JE, 1987, J PHONETICS, V15, P47 Gardner R. C., 1972, ATTITUDES MOTIVATION GAY T, 1968, J ACOUST SOC AM, V44, P1570, DOI 10.1121/1.1911298 GRENIER G, 1984, SOC SCI QUART, V65, P537 Grosjean F., 1997, TUTORIALS BILINGUALI, P225 Grosjean F, 1982, LIFE 2 LANGUAGES GROSJEAN F, 1989, BRAIN LANG, V36, P3, DOI 10.1016/0093-934X(89)90048-5 GROSSER J, 1999, COMMENT MOD PHYS, V1, P117 Guion SG, 2000, J PHONETICS, V28, P27, DOI 10.1006/jpho.2000.0104 Halle PA, 1999, J PHONETICS, V27, P281, DOI 10.1006/jpho.1999.0097 Hazan V, 2000, J PHONETICS, V28, P377, DOI 10.1006/jpho.2000.0121 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 Jia G, 1999, PROC ANN BUCLD, P301 Johnson C, 2000, J SPEECH LANG HEAR R, V43, P129 Kluender KR, 1998, J ACOUST SOC AM, V104, P3568, DOI 10.1121/1.423939 LAMBERT WE, 1969, J VERB LEARN VERB BE, V8, P604, DOI 10.1016/S0022-5371(69)80111-8 Lee S, 1999, J ACOUST SOC AM, V105, P1455, DOI 10.1121/1.426686 LINDBLOM B, 1998, APPROACHES EVOLUTION Liu S., 2000, BILING-LANG COGN, V3, P131, DOI DOI 10.1017/S1366728900000225 MACK M, 1989, PERCEPT PSYCHOPHYS, V46, P187, DOI 10.3758/BF03204982 Mack M, 2003, MIND BRAIN LANGUAGE Mack M., 1995, IDEAL, V8, P23 MACK M, 1990, LANGUAGE ATTITUDES L MacKay IRA, 2001, J ACOUST SOC AM, V110, P516, DOI 10.1121/1.1377287 MacKay IRA, 2001, PHONETICA, V58, P103, DOI 10.1159/000028490 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 Meador D, 2000, BILING-LANG COGN, V3, P55, DOI 10.1017/S1366728900000134 MUNRO MJ, 1993, LANG SPEECH, V36, P39 Munro MJ, 1996, APPL PSYCHOLINGUIST, V17, P313, DOI 10.1017/S0142716400007967 NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 PARADIS M, 1978, ASPECTS BILINGUALISM, P165 PARNELL MM, 1978, J SPEECH HEAR RES, V21, P682 Patkowski M., 1989, APPL LINGUIST, V11, P73 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Piske T, 2001, J PHONETICS, V29, P191, DOI 10.1006/jpho.2001.0134 Piske T, 2002, PHONETICA, V59, P49, DOI 10.1159/000056205 Sancier ML, 1997, J PHONETICS, V25, P421, DOI 10.1006/jpho.1997.0051 Scovel T., 1988, TIME SPEAK PSYCHOLIN Sebastian-Galles N, 1999, COGNITION, V72, P111, DOI 10.1016/S0010-0277(99)00024-4 Stevens G, 1999, LANG SOC, V28, P555 STRANGE W, 1989, J ACOUST SOC AM, V85, P2135, DOI 10.1121/1.397863 Strange W, 1998, J ACOUST SOC AM, V104, P488, DOI 10.1121/1.423299 SYRDAL AK, 1986, J ACOUST SOC AM, V79, P1086, DOI 10.1121/1.393381 Walley AC, 1999, J PHONETICS, V27, P307, DOI 10.1006/jpho.1999.0098 Yamada R. A., 1995, SPEECH PERCEPTION LI, P305 NR 68 TC 99 Z9 99 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 467 EP 491 DI 10.1016/S0167-6393(02)00128-0 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000003 ER PT J AU Bolia, RS Slyh, RE AF Bolia, RS Slyh, RE TI Perception of stress and speaking style for selected elements of the SUSAS database SO SPEECH COMMUNICATION LA English DT Article DE SUSAS; stressed speech; speech perception; speech recognition ID SPEECH AB The Speech Under Simulated and Actual Stress (SUSAS) database is a collection of utterances recorded under conditions of simulated or actual stress, the purpose of which is to allow researchers to study the effects of stress and speaking style on the speech waveform. The aim of the present investigation was to assess the perceptual validity of the simulated portion of the database by determining the extent to which listeners classify its utterances according to their assigned labels. Seven listeners performed an eight-alternative, forced-choice response, judging whether monosyllabic or disyllabic words spoken by talkers from three different regional accent classes (Boston, General American, New York) were best classified as, clear, fast, loud, neutral, question, slow, or soft. Mean percentages of "correct" judgments were analysed using a 3 (regional accent class) x 2 (number of syllables) x 8 (speaking style) repeated measures analysis of variance. Results indicate that, overall, listeners correctly classify the utterances only 58% of the time, and that the percentage of correct classifications varies as a function of all three independent variables. (C) 2002 Elsevier Science B.V. All rights reserved. C1 USAF, Res Lab, HECP, Wright Patterson AFB, OH 45433 USA. RP Bolia, RS (reprint author), USAF, Res Lab, HECP, 2255 H St, Wright Patterson AFB, OH 45433 USA. EM robert.bolia@wpafb.af.mil CR ABELIN A, 2000, P ISCA WORKSH SPEECH *AM NAT STAND I, 1989, SPEC AUD Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BouGhazale SE, 1996, SPEECH COMMUN, V20, P93, DOI 10.1016/S0167-6393(96)00047-7 Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 BOUGHAZALE SE, 1995, P EUROSPEECH, P455 Cummings K, 1990, P IEEE INT C AC SPEE, V1, P369 Hansen J., 1997, EUROSPEECH 97, V4, P1743 Hansen J. H. L., 1988, THESIS GEORGIA I TEC HANSEN JHL, 2000, RTOTR10AC323ISTTP5 N Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) MEJVALDOVA J, 1999, RJC PAR 99 RENC JEUN PAUL DB, 1986, P DARPA WORKSH SPEEC, P81 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SLYH RE, 1999, P IEEE INT C AC SPEE, V4 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 NR 17 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 493 EP 501 DI 10.1016/S0167-6393(02)00129-2 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000004 ER PT J AU Malfrere, F Deroo, O Dutoit, T Ris, C AF Malfrere, F Deroo, O Dutoit, T Ris, C TI Phonetic alignment: speech synthesis-based vs. Viterbi-based SO SPEECH COMMUNICATION LA English DT Article DE speech segmentation; hidden Markov models; hybrid HMM/ANN systems; speech synthesis; large speech corpora AB In this paper we compare two different methods for automatically phonetically labeling a continuous speech database, as usually required for designing a speech recognition or speech synthesis system. The first method is based on temporal alignment of speech on a synthetic speech pattern; the second method uses either a continuous density hidden Markov models (HMM) or a hybrid HMM/ANN (artificial neural network) system in forced alignment mode. Both systems have been evaluated on read utterances not part of the training set of the HMM systems, and compared to manual segmentation. This study outlines the advantages and drawbacks of both methods. The speech synthetic system has the great advantage that no training stage (hence no large labeled database) is needed, while HMM systems easily handle multiple phonetic transcriptions (phonetic lattice). We deduce a method for the automatic creation of large phonetically labeled speech databases, based on using the synthetic speech segmentation tool to bootstrap the training process of either a HMM or a hybrid HMM/ANN system. The importance of such segmentation tools is a key point for the development of improved multilingual speech synthesis and recognition systems. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Fac Polytech Mons, TCTS, B-7000 Mons, Belgium. Babel Technol SA, B-7000 Mons, Belgium. RP Malfrere, F (reprint author), Fac Polytech Mons, TCTS, 31 Bld Dolez, B-7000 Mons, Belgium. EM malfrere@babeltech.com; deroo@tcts.fpms.ac.be; dutoit@tcts.fpms.ac.be; ris@tcts.fpms.ac.be CR BAHL LR, 1995, P ICASSP, P41 BAKER JK, 1975, IEEE T ACOUST SPEECH, VAS23, P24, DOI 10.1109/TASSP.1975.1162650 Baum L. E., 1972, INEQUALITIES, V3, P1 Bourlard Ha, 1994, CONNECTIONIST SPEECH BRUGNARA B, 1993, SPEECH COMMUN, P357 CARRE R, 1984, P INT C AC SPEECH SI Cosi P., 1991, P EUROSPEECH 91, P693 DEROO O, 1998, P EUR C SIGN PROC EU, P1161 DEVILLE G, 1999, P EUR C SPEECH COMM, P843 DUPONT S, 1997, P EUR C SPEECH COMM, P1947 Dutoit T, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1393 FRANCO H, 1994, COMPUT SPEECH LANG, P211 Hermansky H., 1990, J ACOUST SOC AM HOCHBERG MM, 1995, SPOKEN LANGUAGE SYST, P170 HORAK P, 2001, IMPROVEMENTS SPEECH, P331 Hunt A. J., 1996, P ICASSP 96, P373 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 JUANG BH, 1986, P IEEE 1986 INT C AC, P765 KOEHLER J, 1994, P INT C AC SPEECH SI, P2421 Lamel L. F., 1991, P EUR C SPEECH COMM, P505 Lee L, 1998, IEEE T SPEECH AUDI P, V6, P49, DOI 10.1109/89.650310 LENZO K, 2000, P INT C SPEECH LANG LEUNG HC, 1984, P INT C AC SPEECH SI Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 Malfrere F., 1997, P EUR C SPEECH COMM, P2631 MYERS CS, 1981, P INT C AC SPEECH SI PAUL DB, 1992, DARPA SPEECH LANG WO Rabiner L, 1993, FUNDAMENTALS SPEECH ROBINSON AJ, 1994, P IEEE T NEUR NETW, P298 ROBNSON AJ, 1991, COMPUT SPEECH LANG, P257 RUSSELL MJ, 1990, P IEEE INT C AC SPEE, P69 TALKIN D, 1996, P 2 ESCA IEEE WORKSH, P89 Traber C., 1995, THESIS ETH ZURICH VANCOILE B, 1994, P ICSLP 94 WOODLAND PC, 1995, P ICASSP, V1, P73 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 36 TC 24 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 503 EP 515 DI 10.1016/S0167-6393(02)00131-0 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000005 ER PT J AU Kessens, JM Cucchiarini, C Strik, H AF Kessens, JM Cucchiarini, C Strik, H TI A data-driven method for modeling pronunciation variation SO SPEECH COMMUNICATION LA English DT Article DE pronunciation variation; data-driven; rule-based; speech recognition; error analysis; rule selection AB This paper describes a rule-based data-driven (DD) method to model pronunciation variation in automatic speech recognition (ASR). The DD method consists of the following steps. First, the possible pronunciation variants are generated by making each phone in the canonical transcription of the word optional. Next, forced recognition is performed in order to determine which variant best matches the acoustic signal. Finally, the rules are derived by aligning the best matching variant with the canonical transcription of the variant. Error analysis is performed in order to gain insight into the process of pronunciation modeling. This analysis shows that although modeling pronunciation variation brings about improvements, deteriorations are also introduced. A strong correlation is found between the number of improvements and deteriorations per rule. This result indicates that it is not possible to improve ASR performance by excluding the rules that cause deteriorations, because these rules also produce a considerable number of improvements. Finally, we compare three different criteria for rule selection. This comparison indicates that the absolute frequency of rule application (F-abs) is the most suitable criterion for rule selection. For the best testing condition, a statistically significant reduction in word error rate (WER) of 1.4% absolutely, or 8% relatively, is found. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, A2RT, NL-5600 HD Nijmegen, Netherlands. RP Kessens, JM (reprint author), Univ Nijmegen, Dept Language & Speech, A2RT, POB 9103, NL-5600 HD Nijmegen, Netherlands. EM kessens@let.kun.nl; c.cucchiarini@let.kun.nl; w.strik@let.kun.nl CR AMDALL I, 2000, P ICSLP 00, V3, P622 Cremelie N, 1999, SPEECH COMMUN, V29, P115, DOI 10.1016/S0167-6393(99)00034-5 Booij Geert, 1995, PHONOLOGY DUTCH Fosler-Lussier E., 1999, THESIS U CALIFORNIA Fukada T, 1999, SPEECH COMMUN, V27, P63, DOI 10.1016/S0167-6393(98)00066-1 Holter T, 1999, SPEECH COMMUN, V29, P177, DOI 10.1016/S0167-6393(99)00036-9 KERKHOFF J, 1994, P DEP LANG SPEECH U, V18, P107 Kessens JM, 1999, SPEECH COMMUN, V29, P193, DOI 10.1016/S0167-6393(99)00048-5 KESSENS JM, 2000, PHONUS 5 P WORKSH PH, P117 LEHTINEN G, 1998, P ESCA WORKSH MOD PR, P67 RAVISHANKAR M, 1997, P EUR 97 RHOD GREEC, V5, P467 Riley M, 1999, SPEECH COMMUN, V29, P209, DOI 10.1016/S0167-6393(99)00037-0 SCHIEL FA, 1998, P ESCA TUTORIAL RES, P131 STEINBISS V, 1993, P ESCA 3 EUR C SPEEC, P2125 STRIK H, 1997, J SPEECH TECHNOL, V2, P119 Strik H, 1999, SPEECH COMMUN, V29, P225, DOI 10.1016/S0167-6393(99)00038-2 STRIK H, 2001, P ITRW AD METH SPEEC, P123 Wester M., 2000, P ICSLP 00, V4, P488 WESTER M, 1998, P ESCA WORKSH MOD PR, P145 Wester M, 2001, LANG SPEECH, V44, P377 Wester M., 2000, P ICSLP 00 BEIJ CHIN, V4, P270 WILIAMS G, 1999, THESIS U SHEFFIELD S WILLIAMS G, 1998, P WORKSH MOD PRON VA, P151 Yang Q, 2000, P 11 PRORISC WORKSH, P589 YANG Q, 2000, P ICSLP 00 BEIJ CHIN, V1, P417 NR 25 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 517 EP 534 DI 10.1016/S0167-6393(02)00150-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000006 ER PT J AU Sakurai, A Hirose, K Minematsu, N AF Sakurai, A Hirose, K Minematsu, N TI Data-driven generation of F-0 contours using a superpositional model SO SPEECH COMMUNICATION LA English DT Article DE text-to-speech synthesis; F-0 contour generation; superpositional model ID INFORMATION; INTONATION; SPEECH AB This paper introduces a novel model-constrained, data-driven method to generate fundamental frequency contours for Japanese text-to-speech synthesis. In the training phase, the relationship between linguistic features and the parameters of a command-response F-o contour generation model is learned by a prediction module, which is represented by either a neural network or a set of binary regression trees. Input features consist of linguistic information related to accentual phrases that can be automatically derived from text, such as the position of the accentual phrase in the utterance, number of morae, accent type, and morphological information. In the synthesis phase, the prediction module is used to generate appropriate values of model parameters. The use of the parametric model restricts the degrees of freedom of the problem to facilitate the mapping between linguistic and prosodic features. Experimental results show that the method makes it possible to generate quite natural F-o contours with a relatively small training database. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Texas Instruments Japan, DCES Software Lab, Tsukuba, Ibaraki 3050841, Japan. Univ Tokyo, Grad Sch Frontier Sci, Bunkyo Ku, Tokyo 1138656, Japan. Univ Tokyo, Grad Sch Informat Sci & Technol, Bunkyo Ku, Tokyo 1138656, Japan. RP Sakurai, A (reprint author), Texas Instruments Japan, DCES Software Lab, Miyukigaoka 17, Tsukuba, Ibaraki 3050841, Japan. EM a-sakurai@ti.com CR AUBERGE V, 1992, TALKING MACHINES THE, P307 Brieman L, 1984, CLASSIFICATION REGRE CAMPBELL N, 1995, FALL M AC SOC JPN, P317 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Fujisaki H, 1986, P IEEE INT C AC SPEE, P2039 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FUJISAKI H, 2000, P ICSLP 2000 Fujisaki H., 1998, P 3 ESCA COCOSDA INT, P299 FUJISAKI H, 1993, IEICE T FUND ELECTR, VE76A, P1919 FUJISAKI H, 1999, P 1999 INT C SPEECH, V1, P19 HIRAI T, 1996, PROG SPEECH SYNTH, V28, P333 HIROSE K, 1993, IEICE T FUND ELECTR, VE76A, P1971 HIRST DJ, 1991, P 12 INT C PHON SCI, V5, P234 HOLM B, 2000, P ICSLP 2000 Imai S., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing Jilka M, 1999, SPEECH COMMUN, V28, P83, DOI 10.1016/S0167-6393(99)00008-4 Morlec Y, 2001, SPEECH COMMUN, V33, P357, DOI 10.1016/S0167-6393(00)00065-0 MULLER AF, 2000, ICASSP 2000 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 SAGISAKA Y, 1983, J IEICE D, V66, P849 Silverman K., 1992, P INT C SPOK LANG PR, P867 TAKEDA K, 1988, RES JAPANESE SPEECH Traber C, 1992, TALKING MACHINES THE, P287 *U STUTTG, 1995, 695 U STUTTG VANSANTEN JPH, 1998, P 3 ESCA COCOSDA INT, P329 VENDITTI J, 2000, P ICSLP 2000 WIDERA C, 1999, EUROSPEECH 99, P999 NR 27 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 535 EP 549 DI 10.1016/S0167-6393(02)00177-2 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000007 ER PT J AU Warren, RM Bashford, JA Lenz, PW AF Warren, RM Bashford, JA Lenz, PW TI Intelligibility of dual rectangular speech bands: Implications of observations concerning amplitude mismatch and asynchrony SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; rectangular speech bands; dual-band speech synergy; speech asynchrony; speech-level mismatch ID NOISE AB The present study examines the integration of information present in different spectral regions of speech using two 1/3-octave bands of everyday sentences (center frequencies 1- and 3-kHz). Nearly vertical slopes were employed (4000-order finite impulse response filtering) to avoid the major contribution to intelligibility made even by conventionally steep slopes (e.g., 100 dB/octave) (see [J. Acoust. Soc. Amer. 108 (2000) 12641]). Heard alone at 75 dB, the rectangular band intelligibilities were 5% (1 kHz) and 10% (3 kHz); heard together, their score was 77%. Conformity to the normal spectral profile was not required for this remarkably high degree of synergy: When the 3-kHz band was kept at 75 dB; and the 1-kHz band's level was decreased systematically, intelligibility remained unchanged from 75 to 45 dB (intensity ratio of 1000: 1). But when the rectangular bands were kept at their normal levels, and one band was delayed relative to the other, intelligibility dropped to half with a misalignment of only about 35 ms (approximately half the duration of the average phoneme); scores dropped further, approaching that of a single band when asynchrony approximated average phonemic durations. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Wisconsin, Dept Psychol, Milwaukee, WI 53201 USA. RP Warren, RM (reprint author), Univ Wisconsin, Dept Psychol, POB 413, Milwaukee, WI 53201 USA. EM rmwarren@csd.uwm.edu; bashford@csd.uwm.edu; plenz@csd.uwm.edu CR ANSI, 1969, S351969 ANSI ANSI, 1997, S351997 ANSI Arai T, 1998, INT CONF ACOUST SPEE, P933, DOI 10.1109/ICASSP.1998.675419 BASHFORD JA, 1992, PERCEPT PSYCHOPHYS, V51, P211, DOI 10.3758/BF03212247 Bashford J. A. Jr., 2000, Acoustics Research Letters Online, V1, DOI 10.1121/1.1329836 Warren RM, 1999, J ACOUST SOC AM, V106, pL47, DOI 10.1121/1.427606 BILGER RC, 1984, J SPEECH HEAR RES, V27, P32 EGAN JP, 1950, J ACOUST SOC AM, V22, P622, DOI 10.1121/1.1906661 Erdohegyi K., 1999, P 6 EUR C SPEECH COM, P2687 FLANAGAN JL, 1951, J ACOUST SOC AM, V23, P303, DOI 10.1121/1.1906762 Fu QJ, 2001, J ACOUST SOC AM, V109, P1166, DOI 10.1121/1.1344158 GRANT KW, 1991, J ACOUST SOC AM, V89, P2952, DOI 10.1121/1.400733 GREENBERG S, 1998, P 5 INT C SPOK LANG, P74 Greenberg S., 1998, P JOINT M AC SOC AM, P2677 Hays W. L., 1988, STATISTICS KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 KRYTER KD, 1960, J ACOUST SOC AM, V32, P547, DOI 10.1121/1.1908140 Licklider J. C. R., 1959, PSYCHOL STUDY SCI, V1, P41 Moore BCJ, 1997, INTRO PSYCHOL HEARIN Musch H, 2001, J ACOUST SOC AM, V109, P2910, DOI 10.1121/1.1371972 Müsch H, 2001, J Acoust Soc Am, V109, P2896, DOI 10.1121/1.1371971 POLLACK I, 1948, J ACOUST SOC AM, V20, P259, DOI 10.1121/1.1906369 Rosen S., 1986, FREQUENCY SELECTIVIT, P373 SILVERMAN S R, 1955, Ann Otol Rhinol Laryngol, V64, P1234 Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2 Warren RM, 2000, J ACOUST SOC AM, V108, P1264, DOI 10.1121/1.1287710 WARREN RM, 1995, PERCEPT PSYCHOPHYS, V57, P175, DOI 10.3758/BF03206503 NR 27 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 551 EP 558 DI 10.1016/S0167-6393(02)00178-4 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000008 ER PT J AU Suzuki, N Takeuchi, Y Ishii, K Okada, M AF Suzuki, N Takeuchi, Y Ishii, K Okada, M TI Effects of echoic mimicry using hummed sounds on human-computer interaction SO SPEECH COMMUNICATION LA English DT Article DE human-computer interaction; echoic mimicry; hummed sounds; psychological evaluation; interpersonal relations AB Our research goal is to investigate interpersonal relations involving empathy in human-computer interaction. We focus on mimicry behavior and its ability to elicit intentional stance of a partner in interaction. In this study, we conducted a psychological experiment to examine how prosodic mimicry by computers affects people. An interactive system in this experiment uses an animated character that mimics the prosodic features in a human's voice echoicly by synthesizing the hummed sounds. The sounds consist only of prosodic components similar to continuous humming of the open vowel /a/ or /o/ without any language information. The subjects completed a questionnaire to evaluate the character at different mimicry ratio. The results indicated the following possibilities: First, people favorably interpret a computer's simple response such as echoic mimicry using hummed sounds mixed with a slightly constant prosody response. Second, people may establish an interpersonal relations with a computer through such facilitated interaction. (C) 2002 Elsevier Science B.V. All rights reserved. C1 ATR, Media Informat Sci Labs, Kyoto 6190288, Japan. Nagoya Univ, Grad Sch Human Informat, Chigusa Ku, Nagoya, Aichi 4648601, Japan. Shizuoka Univ, Fac Informat, Hamamatsu, Shizuoka 4328011, Japan. Sony Corp, Entertainment Robot Co, Minato Ku, Tokyo 1050004, Japan. ATR, Intelligent Robot & Commun Labs, Kyoto 6190288, Japan. RP Suzuki, N (reprint author), ATR, Media Informat Sci Labs, 2-2-2 Hikaridai,Keilhanna Sci City, Kyoto 6190288, Japan. EM noriko@atr.co.jp CR BESKOW J, 1997, EUROSPEECH 97, P1651 Cassell J, 1999, APPL ARTIF INTELL, V13, P519, DOI 10.1080/088395199117360 CASSELL J, 1994, 16 ANN C COGN SCI SO, P153 Chafe W., 1988, CLAUSE COMBINING GRA, P1 Couper-Kuhlen E., 1996, PROSODY CONVERSATION, P366, DOI 10.1017/CBO9780511597862.011 Dennett D, 1987, INTENTIONAL STANCE FUJITA F, 1997, P AUTONOMOUS AGENTS, P435 MASATAKA N, 1992, J CHILD LANG, V19, P213 NAKATA T, 1998, INTELLIGENT AUTONOMO, V5, P352 NASS C, 1994, INT J HUM-COMPUT ST, V40, P543, DOI 10.1006/ijhc.1994.1025 SHIMOJIMA A, 1998, 20 ANN C COGN SCI SO, P951 Strommen E., 1998, CHI 98. Human Factors in Computing Systems. CHI 98 Conference Proceedings Suzuki N, 2000, ROBOT AUTON SYST, V31, P171, DOI 10.1016/S0921-8890(99)00106-2 SUZUKI N, EUROSPEECH 99, V259, P99 TAKEUCHI Y, 1998, P JOINT WORKSH CROSS, P114 TANNEN D, 1987, LANGUAGE, V63, P574, DOI 10.2307/415006 WALKER MA, 1992, COLING 92, P345 Wertsch J. V., 1991, VOICE MIND SOCIOCULT NR 18 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 559 EP 573 DI 10.1016/S0167-6393(02)00180-2 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000009 ER PT J AU Arehart, KH Hansen, JHL Gallant, S Kalstein, L AF Arehart, KH Hansen, JHL Gallant, S Kalstein, L TI Evaluation of an auditory masked threshold noise suppression algorithm in normal-hearing and hearing-impaired listeners SO SPEECH COMMUNICATION LA English DT Article DE noise suppression; hearing loss; hearing aids; auditory masked threshold; speech intelligibility; speech quality ID DIFFERENT FREQUENCY RESPONSES; ITERATIVE SPEECH ENHANCEMENT; SOUND QUALITY; RECEPTION THRESHOLD; INTELLIGIBILITY; RECOGNITION; BACKGROUNDS; RATIO; AID AB While there have been numerous studies in the field of speech enhancement, the majority of these studies have focused on noise reduction for normal-hearing (NH) individuals. In addition, no speech enhancement algorithms reported in the signal processing community have reported an improvement in intelligibility, with the exception of a recent study by Tsoukalas et al. [IEEE Transactions of Speech and Audio Processing 5 (6) (1997) 497]. This study addresses the problem of speech enhancement for both NH and hearing-impaired (HI) subjects. A noise suppression algorithm based on auditory masked thresholds was implemented and evaluated for NH and HI subjects. Two different tests for intelligibility were used in the evaluation including the nonsense syllable test and the diagnostic rhyme test. Speech quality was evaluated using sentences from the hearing-in-noise test. Tests were performed using two types of noise (voice communications channel and automobile highway noise) at two different signal-to-noise ratios. Ten NH and 11 HI listeners were used to evaluate the enhancement algorithm. Results indicate that the enhancement algorithm yielded significantly better quality ratings and significantly better intelligibility scores in both NH and HI listeners in some but not all of the test conditions. The algorithm resulted in the greatest intelligibility improvements in the communications channels noise and for the nonsense syllable stimuli. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Colorado, Dept Speech Language & Hearing Sci, Boulder, CO 80309 USA. Univ Colorado, Ctr Spoken Language Res, Robust Speech Proc Lab, Boulder, CO 80309 USA. RP Arehart, KH (reprint author), Univ Colorado, Dept Speech Language & Hearing Sci, Box 409 UCB, Boulder, CO 80309 USA. EM kathryn.arehart@colorado.edu CR American National Standards Institute, 1989, S36 ANSI BAER T, 1994, J ACOUST SOC AM, V95, P2270 BOLL SF, 1979, IEEE T ACOUST SPEECH, V39, P795 BYRNE D, 1986, EAR HEARING, V7, P257 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CLARKSON P, 1989, J ACOUST SOC AM, V89, P1378 DUBNO JR, 1982, J SPEECH HEAR RES, V25, P135 DUGGIRALA V, 1988, J ACOUST SOC AM, V83, P2372, DOI 10.1121/1.396316 Elberling C, 1993, Scand Audiol Suppl, V38, P39 FESTEN JM, 1983, J ACOUST SOC AM, V73, P652, DOI 10.1121/1.388957 FESTEN JM, 1990, J ACOUST SOC AM, V88, P1725, DOI 10.1121/1.400247 DUQUESNOY AJ, 1983, J ACOUST SOC AM, V74, P739, DOI 10.1121/1.389859 GABRIELSSON A, 1988, J SPEECH HEAR RES, V31, P166 GABRIELSSON A, 1990, J ACOUST SOC AM, V88, P1359, DOI 10.1121/1.399713 Glasberg B R, 1989, Scand Audiol Suppl, V32, P1 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P3833, DOI 10.1121/1.413108 HANSEN JHL, 1999, ENCY ELECT ELECT ENG, V20, P159 HOU ZZ, 1994, J ACOUST SOC AM, V96, P1325, DOI 10.1121/1.410279 HUMES LE, 1987, J ACOUST SOC AM, V81, P765, DOI 10.1121/1.394845 HYGGE S, 1992, J SPEECH HEAR RES, V35, P208 JAMIESON DG, 1995, EAR HEARING, V16, P274, DOI 10.1097/00003446-199506000-00004 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P314, DOI 10.1109/49.608 Larson VD, 2000, JAMA-J AM MED ASSOC, V284, P1806, DOI 10.1001/jama.284.14.1806 LEVITT H, 1993, SCANDINAVIAN AUDIO S, V93, P7 Moore B., 1998, COCHLEAR HEARING LOS MOORE BCJ, 1995, BRIT J AUDIOL, V29, P131, DOI 10.3109/03005369509086590 MUELLER G, 2000, HEAR J, V53, P27 NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384 Neuman AC, 1998, J ACOUST SOC AM, V103, P2273, DOI 10.1121/1.422745 NILSSON M, 1994, J ACOUST SOC AM, V95, P1085, DOI 10.1121/1.408469 PLOMP R, 1986, J SPEECH HEAR RES, V29, P146 Resnick S B, 1975, J ACOUSTICAL SOC S1, V58, P114 SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 Tsoukalas DE, 1997, IEEE T SPEECH AUDI P, V5, P497, DOI 10.1109/89.641296 VANROOIJ JCGM, 1990, J ACOUST SOC AM, V88, P2611, DOI 10.1121/1.399981 VIRAG N, 1999, IEEE T SPEECH AUDIO, V7, P1 Voiers W. D., 1983, Speech Technology, V1 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 ZUREK PM, 1987, J ACOUST SOC AM, V82, P1548, DOI 10.1121/1.395145 NR 41 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2003 VL 40 IS 4 BP 575 EP 592 DI 10.1016/S0167-6393(02)00183-8 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 679ZZ UT WOS:000182953000010 ER PT J AU De Mori, R Sagisaka, Y Alwan, A AF De Mori, R Sagisaka, Y Alwan, A TI Untitled SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Calif Los Angeles, Los Angeles, CA 90095 USA. RP De Mori, R (reprint author), Univ Calif Los Angeles, 66-147E Engr 4,405 Hilgard Ave, Los Angeles, CA 90095 USA. EM alwan@icsl.ucla.edu NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 259 EP 260 DI 10.1016/S0167-6393(03)00012-8 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300001 ER PT J AU Karray, L Martin, A AF Karray, L Martin, A TI Towards improving speech detection robustness for speech recognition in adverse conditions SO SPEECH COMMUNICATION LA English DT Article ID ALGORITHM AB Recognition performance decreases when recognition systems are used over the telephone network, especially wireless network and noisy environments. It appears that non-efficient speech/non-speech detection (SND) is an important source of this degradation. Therefore, speech detection robustness to noise is a challenging problem to be examined, in order to improve recognition performance for the very noisy communications. Several studies were conducted aiming to improve the robustness of SND used for speech recognition in adverse conditions. The present paper proposes some solutions aiming to improve SND in wireless environment. Speech enhancement prior detection is considered. Then, two versions of SND algorithm, based on statistical criteria, are proposed and compared. Finally, a post-detection technique is introduced in order to reject the wrongly detected noise segments. (C) 2002 Elsevier Science B.V. All rights reserved. C1 IPS, DIH, FTR&D, F-22307 Lannion, France. Univ Bretagne Sud, IUT Vannes, F-56000 Vannes, France. RP Karray, L (reprint author), IPS, DIH, FTR&D, 2 Av P Marzin, F-22307 Lannion, France. EM lamia.karray@francetelecom.com; arnaud.martin@univ-ubs.fr CR Agaiby H., 1997, ESCA EUROSPEECH 97, P1119 Berouti M., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing BURLEY S, 1997, INT C AC SPEECH SIGN, P83 BURSTEIN E, 1997, ROBUST SPEECH RECOGN, P111 DAUBECHIES I, 1988, COMMUN PUR APPL MATH, V41, P909, DOI 10.1002/cpa.3160410705 DONOHO DL, 1995, IEEE T INFORM THEORY, V41, P613, DOI 10.1109/18.382009 Downie TR, 1998, IEEE T SIGNAL PROCES, V46, P2558, DOI 10.1109/78.709546 Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236 Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 KARRAY L, 1997, IEEE WORKSH AUT SPEE, P428 Mauuary L., 1993, EUR C SPEECH COMM TE, P1097 MAUUARY L, 1994, THESIS U RENNES RENN Mokbel C., 1995, EUR C SPEECH COMM TE, P141 Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3 SAVOJI MH, 1989, SPEECH COMMUN, V8, P45, DOI 10.1016/0167-6393(89)90067-8 SORIN C, 1995, SPEECH COMMUN, V17, P273, DOI 10.1016/0167-6393(95)00035-M Vetterli M, 1995, WAVELETS SUB BAND CO NR 17 TC 15 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 261 EP 276 DI 10.1016/S0167-6393(02)00066-3 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300002 ER PT J AU Hiwasaki, Y Mano, K Kaneko, T AF Hiwasaki, Y Mano, K Kaneko, T TI An LPC vocoder based on phase-equalized pitch waveform SO SPEECH COMMUNICATION LA English DT Article DE speech coding; LPC vocoder; pitch waveform; phase equalization; vector quantization AB This paper presents a speech coder operating at a very low bit-rate using a model called "phase-equalized pitch waveform". The basic idea of the coder is to employ pitchwise extraction of the linear predictive residual signal, a pitch waveform, in voiced speech. The residual signal is processed with a phase-equalization filter to increase the efficiency of both the pitch waveform quantization and interpolation. Listening tests showed that efficient and high-quality coding is achieved at 2.0 kbits/s. The quality of the coder is equal to that of the DoD FS1016 standard CELP at 4.8 or 2.4 kbits/s MELP. (C) 2002 Elsevier Science B.V. All rights reserved. C1 NTT Corp, NTT Cyber Space Labs, Musashino, Tokyo 1808585, Japan. RP Hiwasaki, Y (reprint author), NTT Corp, NTT Cyber Space Labs, 3-9-11 Midori Cho, Musashino, Tokyo 1808585, Japan. EM hiwasaki.yusuke@lab.ntt.co.jp RI wallipakorn, sayjai/B-5177-2009 CR Brooks FCA, 2000, IEEE T VEH TECHNOL, V49, P766, DOI 10.1109/25.845096 BURNETT IS, 1993, P IEEE INT C AC SPEE, V2, P175 CAMPBELL JP, 1990, ADV SPEECH CODING, P121 Gersho A., 1991, VECTOR QUANTIZATION GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 HIWASAKI Y, 1998, SP9887 IEICE, P7 HIWASAKI Y, 1997, P IEEE INT C AC SPEE, V2, P1583 HONDA M, 1990, P IEEE INT C AC SPEE, V1, P213 KANG G, 1995, IEEE WORKSH SPEECH C, P99 Kleijn W.B., 1988, P INT C AC SPEECH SI, P155 Kleijn WB, 1993, IEEE T SPEECH AUDI P, V1, P386, DOI 10.1109/89.242484 KUBIN G, 1993, IEEE WORKSH SPEECH C, P35 MARTIN JCD, 1996, P IEEE INT C AC SPEE, P216 MATSUOKA B, 1993, P AUT M AC SOC JPN, P219 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 Moriya T., 1986, P IEEE INT C AC SPEE, P1701 Ohmuro H., 1994, Transactions of the Institute of Electronics, Information and Communication Engineers A, VJ77-A SCHROEDER MR, 1985, P ICASSP 85 INT C AC, V1, P937 TASAKI H, 1988, P SPRING M AC SOC JP, P133 TOHKURA Y, 1978, REV ELEC COMMUN LAB, V26, P1456 NR 20 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 277 EP 290 DI 10.1016/S0167-6393(02)00067-5 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300003 ER PT J AU Hant, JJ Alwan, A AF Hant, JJ Alwan, A TI A psychoacoustic-masking model to predict the perception of speech-like stimuli in noise SO SPEECH COMMUNICATION LA English DT Article ID AUDITORY FILTER SHAPES; INTENSITY DISCRIMINATION; TEMPORAL INTEGRATION; GLIDING TONES; FREQUENCY; THRESHOLD; GLIDES; SUMMATION; DURATION; BURSTS AB In this paper, a time/frequency, multi-look masking model is proposed to predict the detection and discrimination of speech-like stimuli in a variety of noise environments. In the first stage of the model, sound is processed through an auditory front end which includes bandpass filtering, squaring, time windowing, logarithmic compression and additive internal noise. The result is an internal representation of time/frequency "looks" for each sound stimulus. To detect or discriminate a signal in noise, the listener combines information across looks using a weighted d' detection device. Parameters of the model are fit to previously measured masked thresholds of bandpass noises which vary in bandwidth, duration, and center frequency (JASA 101 (1997) 2789). The resulting model is successful in predicting masked thresholds of spectrally shaped noise bursts, glides, and formant transitions of varying durations. The model is also successful in predicting the discrimination of synthetic plosive CV syllables in a variety of noise environments and vowel contexts. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Calif Los Angeles, Sch Engn & Appl Sci, Dept Elect Engn, Speech Proc & Auditory Percept Lab, Los Angeles, CA 90095 USA. RP Hant, JJ (reprint author), Univ Calif Los Angeles, Sch Engn & Appl Sci, Dept Elect Engn, Speech Proc & Auditory Percept Lab, 405 Hilgard Ave, Los Angeles, CA 90095 USA. EM james.j.hant@aero.org CR BLUMSTEIN SE, 1980, J ACOUST SOC AM, V67, P648, DOI 10.1121/1.383890 COLLINS MJ, 1978, J ACOUST SOC AM, V63, P469, DOI 10.1121/1.381738 CULLEN JK, 1982, HEARING RES, V7, P115, DOI 10.1016/0378-5955(82)90085-5 DURLACH NI, 1986, J ACOUST SOC AM, V80, P63, DOI 10.1121/1.394084 Farar CL, 1987, J ACOUST SOC AM, V81, P1085 Fletcher H, 1940, REV MOD PHYS, V12, P0047, DOI 10.1103/RevModPhys.12.47 FLORENTINE M, 1981, J ACOUST SOC AM, V70, P1646, DOI 10.1121/1.387219 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Green D. M., 1966, SIGNAL DETECTION THE GREEN DM, 1960, J ACOUST SOC AM, V32, P121, DOI 10.1121/1.1907862 Hant JJ, 1997, J ACOUST SOC AM, V101, P2789, DOI 10.1121/1.418565 HUGHES JW, 1946, PROC R SOC SER B-BIO, V133, P486, DOI 10.1098/rspb.1946.0026 KIANG NYS, 1965, RES MONOGR, V35 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 MADDEN JP, 1994, J ACOUST SOC AM, V95, P454, DOI 10.1121/1.408339 Madden JP, 1996, J ACOUST SOC AM, V100, P3754, DOI 10.1121/1.417235 Moore BCJ, 1998, J ACOUST SOC AM, V104, P411, DOI 10.1121/1.423297 NABELEK IV, 1978, J ACOUST SOC AM, V64, P751 Patterson P. D., 1992, AUDITORY PERCEPTION, P429 PATTERSON RD, 1976, J ACOUST SOC AM, V59, P640, DOI 10.1121/1.380914 PLACK CJ, 1991, J ACOUST SOC AM, V90, P3069, DOI 10.1121/1.401781 PLOMP R, 1959, J ACOUST SOC AM, V31, P749, DOI 10.1121/1.1907781 PLOMP R, 1970, FREQUENCY ANAL PERIO, P376 RAAB DH, 1975, J ACOUST SOC AM, V57, P437, DOI 10.1121/1.380467 SEK A, 1995, J ACOUST SOC AM, V97, P2479, DOI 10.1121/1.411968 Stevens K.N., 1998, ACOUSTIC PHONETICS Strope B, 1997, IEEE T SPEECH AUDI P, V5, P451, DOI 10.1109/89.622569 VANDENBRINK WAC, 1990, J ACOUST SOC AM, V87, P284, DOI 10.1121/1.399295 van Schijndel NH, 1999, J ACOUST SOC AM, V105, P3425, DOI 10.1121/1.424683 VIEMEISTER NF, 1991, J ACOUST SOC AM, V90, P858, DOI 10.1121/1.401953 ZWISLOCK.JJ, 1969, J ACOUST SOC AM, V46, P431, DOI 10.1121/1.1911708 NR 32 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 291 EP 313 DI 10.1016/S0167-6393(02)00068-7 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300004 ER PT J AU Kiefte, M AF Kiefte, M TI Temporal information in gated stop consonants SO SPEECH COMMUNICATION LA English DT Article DE speech perception; stop consonants; stop bursts; waveform envelope; onset spectrum; temporal distortion; voice-onset time ID TIME-VARYING FEATURES; FORMANT TRANSITIONS; ACOUSTIC CUES; ONSET SPECTRA; SPEECH-PERCEPTION; CROSS-VALIDATION; VOWEL SYLLABLES; MODEL SELECTION; ARTICULATION; PLACE AB The goal of the present paper is to assess the importance of dynamic spectral information in short, gated stop bursts. Automatic classification of naturally produced stimuli shows that a dynamic spectral representation gives lower misclassification rates than a static one for place-of-articulation distinctions in stimuli longer than 10 ms. At shorter durations, no significant difference is found. Human listeners were then asked to categorize 10- and 20-ms naturally produced, gated bursts in each of two conditions: unprocessed and temporally distorted. At 20 ms, correct identification was significantly lower for the distorted stimuli, while at 10 ms, no significant difference was found. It is shown that the largest changes in listeners' categorization occur with voiced stops with voice-onset times (VOTs) less than the duration of the stimuli; it is hypothesized that the temporal distortion of the onset of voicing contributes largely to the changes in categorization. It is then shown that the perception of voiceless gated stop bursts remains unaffected by the temporal distortion. These results are also supported by statistical models that compare static versus dynamic representations of the stimuli. It is shown that dynamic properties of stop bursts are important only when they include VOT information-i.e., dynamic spectral properties within isolated bursts appear to contain no phonetic information up to 20 ms. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Dalhousie Univ, Sch Human Commun Disorders, Halifax, NS B3H 1R2, Canada. RP Kiefte, M (reprint author), Dalhousie Univ, Sch Human Commun Disorders, 5599 Fenwick St, Halifax, NS B3H 1R2, Canada. EM michael.kiefte@dal.ca CR ATKINSON AC, 1981, J ECONOMETRICS, V16, P15, DOI 10.1016/0304-4076(81)90072-5 BENKI J, 1998, THESIS U MASSACHUSET BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 BLUMSTEIN SE, 1982, J ACOUST SOC AM, V72, P43, DOI 10.1121/1.388023 BLUMSTEIN SE, 1980, J ACOUST SOC AM, V67, P648, DOI 10.1121/1.383890 Bonneau A, 1996, J ACOUST SOC AM, V100, P555, DOI 10.1121/1.415866 CLARK HH, 1973, J VERB LEARN VERB BE, V12, P335, DOI 10.1016/S0022-5371(73)80014-3 COLE RA, 1974, PSYCHOL REV, V81, P348, DOI 10.1037/h0036656 COOPER FS, 1952, J ACOUST SOC AM, V24, P597, DOI 10.1121/1.1906940 Davison A.C., 1997, BOOTSTRAP METHODS TH Dorman MF, 1996, J ACOUST SOC AM, V100, P3825, DOI 10.1121/1.417238 DORMAN MF, 1977, PERCEPT PSYCHOPHYS, V22, P109, DOI 10.3758/BF03198744 Efron B., 1993, INTRO BOOTSTRAP Efron B, 1997, J AM STAT ASSOC, V92, P548, DOI 10.2307/2965703 Fant G., 1970, ACOUSTIC THEORY SPEE Fischer-Jorgensen E, 1972, ANN REP I PHON U COP, V6, P104 Fischer-Jorgensen E., 1954, MISC PHONET, V2, P42 Fleiss JL, 1981, STAT METHODS RATES P GUMPERTZ M, 1989, AM STAT, V43, P203, DOI 10.2307/2685362 HALLE M, 1957, J ACOUST SOC AM, V29, P107, DOI 10.1121/1.1908634 Hillenbrand JM, 1999, J ACOUST SOC AM, V105, P3509, DOI 10.1121/1.424676 JONGMAN A, 1991, J ACOUST SOC AM, V89, P867, DOI 10.1121/1.1894648 KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P1779, DOI 10.1121/1.389402 KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813 KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081 KEWLEYPORT D, 1984, PERCEPT PSYCHOPHYS, V35, P353, DOI 10.3758/BF03206339 LAHIRI A, 1984, J ACOUST SOC AM, V76, P391, DOI 10.1121/1.391580 Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1 Lindsey JK, 1999, MODELS REPEATED MEAS LISKER L, 1964, WORD, V20, P384 LISKER L, 1967, LANG SPEECH, V10, P1 McCullagh P., 1989, GEN LINEAR MODELS MOORE BCJ, 1988, J ACOUST SOC AM, V83, P1102, DOI 10.1121/1.396055 MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688 ANDRUSKI JE, 1992, J ACOUST SOC AM, V91, P390, DOI 10.1121/1.402781 NEAREY TM, 1986, J ACOUST SOC AM, V80, P1297, DOI 10.1121/1.394433 NEAREY TM, 1992, PERCEPTUALLY MOTIVAT NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735 ODEN GC, 1978, PSYCHOL REV, V85, P172, DOI 10.1037/0033-295X.85.3.172 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PLACK CJ, 1990, J ACOUST SOC AM, V87, P2178, DOI 10.1121/1.399185 Ripley B., 1996, PATTERN RECOGNITION SEARLE CL, 1980, PERCEPTION PRODUCTIO, P73 SHAO J, 1993, J AM STAT ASSOC, V88, P486, DOI 10.2307/2290328 Shao J, 1996, J AM STAT ASSOC, V91, P655, DOI 10.2307/2291661 Slaney M., 1994, 45 APPL COMP INC Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241 Smits R, 1996, J ACOUST SOC AM, V100, P3865, DOI 10.1121/1.417242 STEVENS KN, 1993, SPEECH COMMUN, V13, P367, DOI 10.1016/0167-6393(93)90035-J STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Strickland EA, 1997, J ACOUST SOC AM, V102, P1799, DOI 10.1121/1.419617 Sussman HM, 1998, BEHAV BRAIN SCI, V21, P241 TEKIELI ME, 1979, J SPEECH HEAR RES, V22, P103 Venables W.N., 1998, MODERN APPL STAT S P WALLEY AC, 1983, J ACOUST SOC AM, V73, P1011, DOI 10.1121/1.389149 WINITZ H, 1972, J ACOUST SOC AM, V51, P1309, DOI 10.1121/1.1912976 ZUE VW, 1976, THESIS MIT NR 57 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 315 EP 333 DI 10.1016/S0167-6393(02)00069-9 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300005 ER PT J AU van den Heuvel, H van Kuijk, D Boves, L AF van den Heuvel, H van Kuijk, D Boves, L TI Modeling lexical stress in continuous speech recognition for Dutch SO SPEECH COMMUNICATION LA English DT Article DE Dutch; prosody; lexical stress; continuous speech recognition ID LINGUISTIC STRESS; SPECTRAL BALANCE AB The acoustic realization of vowels with lexical stress generally differs substantially from their unstressed counterparts, which are more reduced in spectral quality, shorter in duration, weaker in intensity and tend to have a flatter spectral tilt. Therefore, in a continuous speech recognizer (CSR) it would appear profitable to train separate models for the stressed and unstressed variants of each vowel. In the experiments reported on here, we applied stress modeling in both training and testing of the recognizer. Recognition experiments on an independent test set showed that recognition rates did not improve by this use of stress in our CSR. However, if we swapped the stress markers in the recognition lexicon the recognition rates did significantly deteriorate. This demonstrated that the acoustic models for the stressed and unstressed variants of the vowels were different. A pitfall in this experiment was that lexical stress information and phonemic context were possibly confounded. In a follow-up experiment we controlled for context by using generalized context-dependent models. In this experiment the recognition results were not improved either, although the vowel models were better tailored to capture lexical stress-related information. We conclude that the mapping of lexical stress to the acoustic surface of fluent speech is not sufficiently straightforward to be of direct benefit for CSR, due to interaction of lexical stress with rhythm and sentence accent in real speech. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, A2RT, Dept Languages & Speech, NL-6500 HD Nijmegen, Netherlands. RP van den Heuvel, H (reprint author), Univ Nijmegen, A2RT, Dept Languages & Speech, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM h.v.d.heuvel@let.kun.nl CR ADDADECKER M, 1992, P INT C AC SPEECH SI, P561, DOI 10.1109/ICASSP.1992.225846 Baayen R. H., 1993, CELEX LEXICAL DATABA DENOS EA, 1995, P EUR 95, P825 DEVETH J, 2002, IN PRESS EFFICIENCY DUMOUCHEL P, 1993, P EUR 93, P2195 Freij G. J., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90020-7 Greenberg S., 1998, P ESCA WORKSH MOD PR, P47 HAYES B, 1993, METRICAL STRESS THEO HIERONYMUS JL, 1992, P INT C AC SPEECH SI, P225, DOI 10.1109/ICASSP.1992.225931 Hogberg J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607157 MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683 REICHL W, 1999, P INT C AC SPEECH SI, P573 RIETVELD T, 1999, P 14 INT C PHON SCI, P463 Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 STRIK H, 2000, P IT C SPOKEN LANG P, V6, P740 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D van Kuijk D, 1999, SPEECH COMMUN, V27, P95, DOI 10.1016/S0167-6393(98)00069-7 van Kuijk D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607963 WANG C, 2001, P EUROSPEECH 2001, V4, P2761 Ying G. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607932 NR 21 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 335 EP 350 DI 10.1016/S0167-6393(02)00085-7 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300006 ER PT J AU Ajmera, J McCowan, I Bourlard, H AF Ajmera, J McCowan, I Bourlard, H TI Speech/music segmentation using entropy and dynamism features in a HMM classification framework SO SPEECH COMMUNICATION LA English DT Article DE speech/music discrimination; audio segmentation; entropy; dynamism; HMM; GMM; MLP AB In this paper, we present a new approach towards high performance speech/music discrimination on realistic tasks related to the automatic transcription of broadcast news. In the approach presented here, an artificial neural network (ANN) trained on clean speech only (as used in a standard large vocabulary speech recognition system) is used as a channel model at the output of which the entropy and "dynamism" will be measured every 10 ms. These features are then integrated over time through an ergodic 2-state (speech and non-speech) hidden Markov model (HMM) with minimum duration constraints on each HMM state. For instance, in the case of entropy, it is indeed clear (and observed in practice) that, on average, the entropy at the output of the ANN will be larger for non-speech segments than speech segments presented at their input. In our case, the ANN acoustic model was a multi-layer perceptron (MLP, as often used in hybrid HMM/ANN systems) generating at its output estimators of the phonetic posterior probabilities based on the acoustic vectors at its input. It is from these outputs, thus from "real" probabilities, that the entropy and dynamism are estimated. The 2-state speech/non-speech HMM will take these two-dimensional features (entropy and dynamism) whose distributions will be modeled through multi-Gaussian densities or a secondary MLP. The parameters of this HMM are trained in a supervised manner using Viterbi algorithm. Although the proposed method can easily be adapted to other speech/non-speech discrimination applications, the present paper only focuses on speech/music segmentation. Different experiments, including different speech and music styles, as well as different temporal distributions of the speech and music signals (real data distribution, mostly speech, or mostly music), illustrate the robustness of the approach, always resulting in a correct segmentation performance higher than 90%. Finally, we will show how a confidence measure can be used to further improve the segmentation results, and also discuss how this may be used to extend the technique to the case of speech/music mixtures. (C) 2002 Elsevier Science B.V. All rights reserved. C1 IDIAP, CH-1920 Martigny, Switzerland. Ecole Polytech Fed Lausanne, CH-1015 Lausanne, Switzerland. RP Ajmera, J (reprint author), IDIAP, Case Postal 592,Rue Simplon 4, CH-1920 Martigny, Switzerland. EM jitendra@idiap.ch; mccowan@idiap.ch; bourlard@idiap.ch CR Ajmera J, 2002, INT CONF ACOUST SPEE, P297 BERNARDIS G, 1998, INT C SPOK LANG PROC, V3, P775 CAREY MJ, 1999, IEEE INT C AC SPEECH CHEN S, 1998, IBM TECH J El-Maleh K, 2000, INT CONF ACOUST SPEE, P2445, DOI 10.1109/ICASSP.2000.859336 Morgan N., 1995, IEEE SIGNAL PROC MAY, P25 Papoulis A, 1991, PROBABILITY RANDOM V, V3rd PARRIS ES, 1999, EUR C SPEECH COMM TE, P2191 Saunders J, 1996, INT CONF ACOUST SPEE, P993, DOI 10.1109/ICASSP.1996.543290 SHEIRER E, 1997, IEEE INT C AC SPEECH, P1331 Williams G., 1999, EUR C SPEECH COMM TE, P687 Zhang T, 1999, INT CONF ACOUST SPEE, P3001 NR 12 TC 50 Z9 52 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 351 EP 363 DI 10.1016/S0167-6393(02)00087-0 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300007 ER PT J AU Parsa, V Jamieson, DG AF Parsa, V Jamieson, DG TI Interactions between speech coders and disordered speech SO SPEECH COMMUNICATION LA English DT Article DE telecommunications; speech coders; GSM; LPC; CELP; speech disorders; intelligibility; objective speech quality measures ID NORMALIZING BLOCK TECHNIQUE; OBJECTIVE ESTIMATION; QUALITY AB We examined the impact of standard speech coders currently used in modern communication systems, on the quality of speech from persons with common speech and voice disorders. Four standardized coders, viz. G. 728 LD-CELP, GSM 6.10 RPE-LTP, FS1016 CELP, FS1015 LPC and the recently proposed US Federal Standard 2400 bps MELP were evaluated with speech samples collected from 30 disordered talkers. Objective speech quality measures, including the auditory distance parameter based on the measuring normalizing blocks technique, and the perceptual speech quality measure, and subjective impressions of speech coder performance were used to assess the interaction between speech coder and speech disorder. Objective speech quality measures revealed that the performance of the LD-CELP and GSM RPE-LTP coders was not measurably influenced by the type of input speech, and that MELP, FS1015 LPC and to a certain extent FS1016 CELP exhibited degraded performance with speech samples from disordered talkers. Results from perceptual experiments were in contrast with the objective measures of speech quality; ratings of speech coder performance indicated that the listeners are less sensitive to coder-induced distortions with abnormal speech samples. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Western Ontario, Natl Ctr Audiol, Elborn Coll, London, ON N6G 1H1, Canada. RP Parsa, V (reprint author), Univ Western Ontario, Natl Ctr Audiol, Elborn Coll, Room 2300, London, ON N6G 1H1, Canada. EM parsa@nca.uwo.ca CR *AV INN INC, 1998, TFR TIM FREQ REPR SO Campos M, 2000, MANAG INFORMAT SYST, V1, P145 DEGENER J, 1994, DOBBS J DEC Deller J. R., 1993, DISCRETE TIME PROCES Fairbanks G., 1960, VOICE ARTICULATION D, P127 FENICHEL R, 1991, 1016 NAT COMM SYST O Hall JL, 2001, J ACOUST SOC AM, V110, P2167, DOI 10.1121/1.1397322 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P609, DOI 10.1121/1.412283 JAMIESON DG, 2002, J SPEECH LANGUAGE HE, V45 JAMIESON DG, 1996, P INT C SPOK LANG PR, V3, P737 KREIMAN J, 1993, J SPEECH HEAR RES, V36, P21 *MASS EYE EAR INF, 1994, VOIC DIS DAT VERS 1 MCCREE AV, 1996, P IEEE ICASSP, P242 NATVIG JE, 1989, IEEE GLOBECOM NOV Papamichalis P.E., 1987, PRACTICAL APPROACHES QUACKENBUSH RS, 1988, OBJECTIVE MEASURES S SHAMES GH, 1982, HUMAN COMMUNICATION, P152 *SPSS INC, SPSS WIND V8 0 Tremain T.E., 1982, SPEECH TECHNOLOG APR, P40 Voran S, 1999, IEEE T SPEECH AUDI P, V7, P383, DOI 10.1109/89.771260 Voran S, 1999, IEEE T SPEECH AUDI P, V7, P371, DOI 10.1109/89.771259 NR 21 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 365 EP 385 DI 10.1016/S0167-6393(02)00125-5 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300008 ER PT J AU Lopez-Cozar, R De la Torre, A Segura, JC Rubio, AJ AF Lopez-Cozar, R De la Torre, A Segura, JC Rubio, AJ TI Assessment of dialogue systems by means of a new simulation technique SO SPEECH COMMUNICATION LA English DT Article DE dialogue systems; speech recognition; speech understanding; dialogue management; system evaluation ID TELEPHONE; INFORMATION; STRATEGIES AB In recent years, a question of great interest has been the development of tools and techniques to facilitate the evaluation of dialogue systems. The latter can be evaluated from various points of view, such as recognition and understanding rates, dialogue naturalness and robustness against recognition errors. Evaluation usually requires compiling a large corpus of words and sentences uttered by users, relevant to the application domain the system is designed for. This paper proposes a new technique that makes it possible to reuse such a corpus for the evaluation and to check the performance of the system when different dialogue strategies are used. The technique is based on the automatic generation of conversations between the dialogue system, together with an additional dialogue system called user simulator that represents the user's interaction with the dialogue system. The technique has been applied to evaluate a dialogue system developed in our lab using two different recognition front-ends and two different dialogue strategies to handle user confirmations. The experiments show that the prompt-dependent recognition front-end achieves better results, but that this front-end is appropriate only if users limit their utterances to those related to the current system prompt. The prompt-independent front-end achieves inferior results, but enables front-end users to utter any permitted utterance at any time, irrespective of the system prompt. In consequence, this front-end may allow a more natural and comfortable interaction. The experiments also show that the re-prompting confirmation strategy enhances system performance for both recognition front-ends. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Granada, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. RP Lopez-Cozar, R (reprint author), Univ Granada, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. EM rlopezc@ugr.es RI de la Torre, Angel/C-6618-2012; Segura, Jose/B-7008-2008; Lopez-Cozar, Ramon/A-7686-2012 OI Segura, Jose/0000-0003-3746-0978; Lopez-Cozar, Ramon/0000-0003-2078-495X CR Albesano D, 1997, INT CONF ACOUST SPEE, P1147, DOI 10.1109/ICASSP.1997.596145 Allen J, 1995, NATURAL LANGUAGE UND Araki M, 1997, P IJCAI WORKSH COLL, P13 Araki M, 1997, LECT NOTES ARTIF INT, V1236, P183 BOROS M, 1999, EUROSPEECH 99, P1986 BOUWMAN G, 1998, 1 INT C LANG RES EV, P191 BUNTSCHUH B, 1998, EUROSPEECH 97, P815 CASACUBERTA F, 1991, WORKSH INT COOP STAN DANIELI M, 1995, AAAI SPRING S EMP ME, P34 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HAIN T, 1999, ICASSP 99 Kellner A, 1997, SPEECH COMMUN, V23, P95, DOI 10.1016/S0167-6393(97)00036-8 KELLNER A, 1998, IVTTA 98, P21 LAMEL L, 1998, ICSLP 98, P2875 Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X LAVELLE CA, 1999, EUROSPEECH 99, P1399 Levin E, 2000, IEEE T SPEECH AUDI P, V8, P11, DOI 10.1109/89.817450 LOPEZCOZAR R, 1998, LREC 1998, P55 LOPEZCOZAR R, 2000, LREC 2000, P743 LOPEZCOZAR R, 2000, SPANISH SOC NATURAL, V26, P169 LOPEZCOZAR R, 1999, EUROSPEECH 99, P1395 NASR A, 1999, EUROSPEECH 99, P2175 NIIMI Y, 1999, EUROSPEECH 99, P1403 NOETH E, 1999, EUROSPEECH 99, P2019 Rabiner L, 1993, FUNDAMENTALS SPEECH ROSSET S, 1999, EUROSPEECH 99, P1535 SCHADLE I, 1999, EUROSPEECH 99 P BUD, P2035 Siu MH, 2000, IEEE T SPEECH AUDI P, V8, P63 Souvignier B, 2000, IEEE T SPEECH AUDI P, V8, P51, DOI 10.1109/89.817453 Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 WOODLAND PC, 1999, DARPA BROADC NEWS WO ZUE V, 1997, EURO SPEECH, P2227 Zue V, 2000, IEEE T SPEECH AUDI P, V8, P85, DOI 10.1109/89.817460 NR 33 TC 36 Z9 36 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 387 EP 407 DI 10.1016/S0167-6393902)00126-7 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300009 ER PT J AU Sorokin, VN AF Sorokin, VN TI Some coding properties of speech SO SPEECH COMMUNICATION LA English DT Article DE prefix code; robust cues; potential recognition rate; fast sorting; sequential decoding ID STOP CONSONANTS; INVERSE PROBLEM; ACOUSTIC CUES; RECOGNITION; PERCEPTION; ARTICULATION; PLACE; ENGLISH; SPECTRA AB Some important properties of speech are considered from the point of view of the theory of error correcting codes. It has found experimentally that the properties of Russian words encoded in terms of phonemes are largely similar to the properties of the so-called prefix codes. In the prefix codes, no code word is a prefix of another word. According to the theory of coding, for any prefix code exists an algorithm of unambiguous decoding where no pause or special symbol-delimeter separates code words. Apparently, word segmentation in the continuous speech signal is provided mainly by the use of the prefix property. Phoneme probability in Russian follows the Mandelbrot law. This finding is evidence in favour of the assumption that the probability is determined by the "complexity" or "expense" of phoneme generation. Speech recognition for a large vocabulary requires much time for access to word templates. Thus, a preliminary sorting of the templates is necessary to restrict a number of candidates for final recognition. The preliminary sorting can be executed by means of word coding by few phonemic cues. Auditory experiments with speech masked by white noise have revealed the most reliable cues. These cues are "vowel, voiced, nasal, fricative". About 150 templates were left after the fast sorting procedure for approximately 100,000 templates in the vocabulary of 10,000 of the most frequent English words. Speech recognition rate obtained by an automatic recognition system must be compared with potentially achievable rate. The potential rate of word recognition for various S/N ratios can be computed with the use of methods developed in the theory of coding. It can be argued that an optimal machine for automatic speech recognition should find robust the same cues which humans find robust. The potential rate for words encoded in terms of independent distinctive features is closer to the subjective reliability of word perception than the rate for words encoded by phonemes. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Russian Acad Sci, Inst Informat Transmiss Problems, Moscow 101447, Russia. RP Sorokin, VN (reprint author), Russian Acad Sci, Inst Informat Transmiss Problems, Bolshoy Karetny 19, Moscow 101447, Russia. EM vns@iitp.ru CR BLUMSTEIN SE, 1980, J ACOUST SOC AM, V67, P648, DOI 10.1121/1.383890 CARROL JB, 1971, WORD FREQUENCY BOOK CASSIDY S, 1995, PHONETICA, V52, P263 Demichelis P., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90018-8 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 FANO RM, 1961, TRANSMISSION INFORMA FERRERO EE, 1979, FRONTIERS SPEECH COM, P159 FISCHER RM, 1990, J ACOUST SOC AM, V83, P1251 FORNEY GD, 1966, RES MONOGRAPH N, V37 FUJIMURA O, 1962, J ACOUST SOC AM, V34, P1865, DOI 10.1121/1.1909142 Greenberg S, 1999, SPEECH COMMUN, V29, P159, DOI 10.1016/S0167-6393(99)00050-3 GUBRINOWITZ R, 1983, RECH ACOUST CNET LAN, V7, P93 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 JELINEK F, 1975, IEEE T INFORM THEORY, V21, P250, DOI 10.1109/TIT.1975.1055384 JOHANENNSSON R, 1999, FUNDAMENTALS CONVOLU KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P1779, DOI 10.1121/1.389402 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Kucera H., 1967, COMPUTATIONAL ANAL P LADEFOGED P, 1986, N64 UCLA Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 MANDELBROT B, 1953, THESIS U PARIS 2 MARTIN A, 1998, P 9 HUB 5 CONV SPEEC MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NARTEY JNA, 1982, N55 UCLA Niyogi P, 2002, J ACOUST SOC AM, V111, P1063, DOI 10.1121/1.1427666 NOSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978 OHDE RN, 1983, J ACOUST SOC AM, V74, P706, DOI 10.1121/1.389856 Peterson W. W., 1961, ERROR CORRECTING COD POKROVSKY NB, 1962, CALCULATION MEASUREM SAVIN HB, 1970, J VERB LEARN VERB BE, V9, P295, DOI 10.1016/S0022-5371(70)80064-0 Shipman D. W., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241 Sorokin V. N., 1992, SPEECH SYNTHESIS SOROKIN VN, 1994, SPEECH COMMUN, V14, P249, DOI 10.1016/0167-6393(94)90065-5 Sorokin VN, 2000, SPEECH COMMUN, V30, P55, DOI 10.1016/S0167-6393(99)00031-X STEINFELDT E, 1964, RUSSIAN WORD COUNT P STEVENS K, 1984, EVIDENCE ROLE ACOUST, P1 STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 Takeuchi S., 1975, Journal of the Acoustical Society of Japan, V31 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 WALSH T, 1987, J PHONETICS, V15, P101 Wozencraft J.M., 1965, PRINCIPLES COMMUNICA ZIGANGIROV KS, 1974, PROCEDURES SEQUENTIA ZIPF C, 1949, HUMAN BEHAV PRINCIPL ZUE V, 1990, P ICASSP 90 GLASG SC, P389 NR 48 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2003 VL 40 IS 3 BP 409 EP 423 DI 10.1016/S0167-6393(02)00152-8 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 670UV UT WOS:000182428300010 ER PT J AU Douglas-Cowie, E Cowie, R Campbell, N AF Douglas-Cowie, E Cowie, R Campbell, N TI Speech and emotion SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(02)00072-9 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100001 ER PT J AU Cowie, R Cornelius, RR AF Cowie, R Cornelius, RR TI Describing the emotional states that are expressed in speech SO SPEECH COMMUNICATION LA English DT Review DE speech; emotion; categories; dimensions; instruments ID MOOD INDUCTION PROCEDURE; BASIC EMOTIONS; FACIAL EXPRESSIONS; VOCAL EMOTION; VELTEN; STRESS; PERCEPTION; RESPONSES; CONTEXT; EYE AB To study relations between speech and emotion, it is necessary to have methods of describing emotion. Finding appropriate methods is not straightforward, and there are difficulties associated with the most familiar. The word emotion itself is problematic: a narrow sense is often seen as "correct", but it excludes what may be key areas in relation to speech-including states where emotion is present but not full-blown, and related states (e.g., arousal, attitude). Everyday emotion words form a rich descriptive system, but it is intractable because it involves so many categories, and the relationships among them are undefined. Several alternative types of description are available. Emotion-related biological changes are well documented, although reductionist conceptions of them are problematic. Psychology offers descriptive systems based on dimensions such as evaluation (positive or negative) and level of activation, or on logical elements that can be used to define an appraisal of the situation. Adequate descriptive systems need to recognise the importance of both time course and interactions involving multiple emotions and/or deliberate control. From these conceptions of emotion come various tools and techniques for describing particular episodes. Different tools and techniques are appropriate for different purposes. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Queens Univ Belfast, Sch Psychol, Belfast BT7 1NN, Antrim, North Ireland. Vassar Coll, Poughkeepsie, NY 12601 USA. RP Cowie, R (reprint author), Queens Univ Belfast, Sch Psychol, Belfast BT7 1NN, Antrim, North Ireland. EM r.cowie@qub.ac.uk CR Anscombe E., 1970, DESCARTES PHILOS WRI Arnold M. B., 1960, EMOTION PERSONALITY Averill J, 1980, EMOTION THEORY RES E, P305 Averill J. R., 1982, ANGER AGGRESSION ESS Averill J. R., 1975, SEMANTIC ATLAS EMOTI AVERILL JR, 1978, J PERS, V46, P323, DOI 10.1111/j.1467-6494.1978.tb00183.x Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Blascovich J., 2000, HDB RES METHODS SOCI, P117 Braunstein-Bercovitz H, 2001, EMOTION, V1, P182, DOI 10.1037//1528-3542.1.2.182 Buck R, 1999, PSYCHOL REV, V106, P301, DOI 10.1037/0033-295X.106.2.301 Cacioppo J.T., 1993, HDB EMOTIONS, P119 Carroll JM, 1996, J PERS SOC PSYCHOL, V70, P205, DOI 10.1037/0022-3514.70.2.205 Cauldwell RT, 2000, P ISCA ITRW SPEECH E, P127 CLARK DM, 1983, ADV BEHAV RES THER, V5, P27, DOI 10.1016/0146-6402(83)90014-0 Cornelius R. R., 1996, SCI EMOTION RES TRAD CORNELIUS RR, 1982, J RES PERS, V14, P503 Corsini RJ, 1994, ENCY PSYCHOL Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R., 1999, P ESCA WORKSH DIAL P, P41 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R., 1999, COMPUT INTELL, P109 Crystal D., 1969, PROSODIC SYSTEMS INT Crystal David, 1975, ENGLISH TONE VOICE *CSEA NIMH, 1999, INT AFF PICT SYST DI Darwin C., 1872, EXPRESSION EMOTION M Davidson R. J., 1994, NATURE EMOTION FUNDA DAVIDSON RJ, 1990, J PERS SOC PSYCHOL, V58, P330, DOI 10.1037/0022-3514.58.2.330 Davidson RJ, 2000, PSYCHOL BULL, V126, P890, DOI 10.1037//0033-2909.126.6.890 Davidson RJ, 2000, BIOL PSYCHIAT, V47, P85, DOI 10.1016/S0006-3223(99)00222-X Davidson RJ, 1999, HDB COGNITION EMOTIO, P103 de Gelder B, 2000, COGNITION EMOTION, V14, P289 de Gelder B, 2000, COGNITION EMOTION, V14, P321 DOUGLASCOWIE E, 2001, EMOTION RES, V15, P8 Ekman P., 1969, SEMIOTICA, V1, P49 EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 Ekman P., 1978, FACIAL ACTION CODING EKMAN P, 1983, SCIENCE, V221, P1208, DOI 10.1126/science.6612338 EKMAN P, 1988, MOTIV EMOTION, V12, P303, DOI 10.1007/BF00993116 Ekman P., 1999, HDB COGNITION EMOTIO, P301 EKMAN P, 1994, PSYCHOL BULL, V115, P268, DOI 10.1037//0033-2909.115.2.268 Engebretson TO, 1999, J PSYCHOSOM RES, V47, P13, DOI 10.1016/S0022-3999(99)00012-4 FEHR B, 1984, J EXP PSYCHOL GEN, V113, P464, DOI 10.1037/0096-3445.113.3.464 Feldman PJ, 1999, ANN BEHAV MED, V21, P216, DOI 10.1007/BF02884836 FINEGAN JE, 1995, CAN J BEHAV SCI, V27, P405, DOI 10.1037/0008-400X.27.4.405 France DJ, 2000, IEEE T BIO-MED ENG, V47, P829, DOI 10.1109/10.846676 Fridlund A. J., 1994, HUMAN FACIAL EXPRESS Frijda N. H., 1993, HDB EMOTIONS, P381 Frijda N. H., 1986, EMOTIONS FROST RO, 1982, PERS SOC PSYCHOL B, V8, P341, DOI 10.1177/0146167282082024 GERRARDSHESSE A, 1994, BRIT J PSYCHOL, V85, P55 Gibson J. J., 1979, ECOLOGICAL APPROACH Goffman E., 1981, FORMS TALK Gottman JM, 2000, J MARRIAGE FAM, V62, P737, DOI 10.1111/j.1741-3737.2000.00737.x GOTTMAN JM, 1992, J PERS SOC PSYCHOL, V63, P221, DOI 10.1037/0022-3514.63.2.221 GREASLEY P, 1995, P 13 ICPHS STOCKH, V1, P242 Greasley P, 2000, LANG SPEECH, V43, P355 GREEN DP, 1993, J PERS SOC PSYCHOL, V64, P1029, DOI 10.1037//0022-3514.64.6.1029 GROSS JJ, 1995, COGNITION EMOTION, V9, P87, DOI 10.1080/02699939508408966 HADFIELD P, 2000, NEW SCI, P21 Hagemann D, 1999, PERS INDIV DIFFER, V26, P627, DOI 10.1016/S0191-8869(98)00159-7 Harre R., 1986, SOCIAL CONSTRUCTION HEALEY J, 1997, DIGITAL PROCESSING A HESS U, 1995, J PERS SOC PSYCHOL, V69, P280, DOI 10.1037/0022-3514.69.2.280 Hucklebridge F, 2000, BIOL PSYCHOL, V53, P25, DOI 10.1016/S0301-0511(00)00040-5 Izard C. E., 1972, PATTERNS EMOTIONS NE Izard Carroll E, 1993, HDB EMOTIONS, P631 JACOBSON AF, 1978, J CLIN PSYCHOL, V34, P677, DOI 10.1002/1097-4679(197807)34:3<677::AID-JCLP2270340320>3.0.CO;2-T James W., 1984, MIND, V19, P188, DOI DOI 10.1093/MIND/OS-IX.34.188 Jenkins Jennifer, 1996, UNDERSTANDING EMOTIO JOHNSONLAIRD PN, 1992, COGNITION EMOTION, V6, P201, DOI 10.1080/02699939208411069 Kenealy P., 1988, COGNITION EMOTION, V2, P41, DOI DOI 10.1080/02699938808415228 KENEALY PM, 1986, MOTIV EMOTION, V10, P315, DOI 10.1007/BF00992107 Kenny Anthony, 1973, WITTGENSTEIN Koukounas E, 2001, J INTERPERS VIOLENCE, V16, P476, DOI 10.1177/088626001016005006 Lang P.J., 1999, INT AFFECTIVE PICTUR LARSEN RJ, 1991, PERS SOC PSYCHOL B, V17, P323, DOI 10.1177/0146167291173013 Lazarus R. S., 1999, HDB COGNITION EMOTIO, P3 Lazarus R. S., 1999, STRESS EMOTION NEW S Lazarus RS, 1994, NATURE EMOTION FUNDA, P79 LEVENSON RW, 1992, PSYCHOL SCI, V3, P23, DOI 10.1111/j.1467-9280.1992.tb00251.x LEWIS M, 1996, HDB EMOTIONS Luminet O, 2000, COGNITION EMOTION, V14, P661 MACDOWELL KA, 1989, MOTIV EMOTION, V13, P105, DOI 10.1007/BF00992957 Mandler G., 1975, MIND EMOTION Massaro DW, 2000, COGNITION EMOTION, V14, P313 MEHRABIAN A, 1974, NONVERBAL COMMUNICAT, P291 MISCHEL W, 1982, PSYCHOL REV, V89, P730, DOI 10.1037//0033-295X.89.6.730 MORROW J, 1990, J PERS SOC PSYCHOL, V58, P519, DOI 10.1037//0022-3514.58.3.519 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Neisser U., 1976, COGNITION REALITY Nezlek J. B., 1983, NEW DIRECTIONS METHO, V15, P57 O'Connor John D., 1973, INTONATION COLLOQUIA Ortony A., 1988, COGNITIVE STRUCTURE ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315 Osgood C.E., 1975, CROSS CULTURAL UNIVE Pereira C., 2000, P ISCA WORKSH SPEECH, P25 Picard RW, 2000, IBM SYST J, V39, P705 Plutchik R., 1997, CIRCUMPLEX MODELS PE Plutchik R., 1984, APPROACHES EMOTION, P197 Plutchik R., 1980, EMOTION PSYCHOEVOLUT Plutchik R., 1989, EMOTION THEORY RES E, P113 POLIVY J, 1981, J PERS SOC PSYCHOL, V41, P803, DOI 10.1037/0022-3514.41.4.803 POLIVY J, 1980, J ABNORM PSYCHOL, V89, P286, DOI 10.1037/0021-843X.89.2.286 Rapson R., 1993, LOVE SEX INTIMACY TH Reisenzein R, 2000, COGNITION EMOTION, V14, P1, DOI 10.1080/026999300378978 ROEMER L, 2001, AABT CLIN ASSESSMENT, P49 Rolls E. T., 1999, BRAIN EMOTION ROSCH E, 1975, COGNITIVE PSYCHOL, V7, P573, DOI 10.1016/0010-0285(75)90024-9 ROSEMAN IJ, 1991, COGNITION EMOTION, V5, P161, DOI 10.1080/02699939108411034 Ruch W., 1993, HDB EMOTIONS, P605 Russell J.A., 1997, CIRCUMPLEX MODELS PE, P205, DOI 10.1037/10261-009 RUSSELL JA, 1994, PSYCHOL BULL, V115, P102, DOI 10.1037/0033-2909.115.1.102 Russell JA, 1999, J PERS SOC PSYCHOL, V76, P805, DOI 10.1037//0022-3514.76.5.805 SALOMON R, 2001, EMOTION RES, V152, P3 Schachter S., 1959, PSYCHOL AFFILIATION Scherer K. R., 1994, NATURE EMOTION FUNDA, P25 Scherer K. R., 1999, HDB COGNITION EMOTIO, P637 Scherer K. R., 1984, REV PERSONALITY SOCI, V5, P37 Scherer K. R., 1984, APPROACHES EMOTION, P293 Scherer KR, 1988, COGNITIVE PERSPECTIV, P89 SCHETTLER G, 1986, PROG LIPID RES, V25, P1, DOI 10.1016/0163-7827(86)90004-4 SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570 SCHNEIDER F, 1994, PSYCHIAT RES, V51, P19, DOI 10.1016/0165-1781(94)90044-2 SCHRODER M, 2001, P EUR 2001 AALB, V1, P87 Schubiger M., 1958, ENGLISH INTONATION I Shackelford TK, 2000, COGNITION EMOTION, V14, P643 SHAVER P, 1987, J PERS SOC PSYCHOL, V52, P1061, DOI 10.1037/0022-3514.52.6.1061 SPEISMAN JC, 1964, J ABNORM SOC PSYCH, V68, P367, DOI 10.1037/h0048936 SPIELBEGER CD, 1988, STATE TRAIT ANGER EX STEIN NL, 1992, COGNITION EMOTION, V6, P161, DOI 10.1080/02699939208411067 Stemmler G, 2001, PSYCHOPHYSIOLOGY, V38, P275, DOI 10.1017/S0048577201991668 Sternberg Robert J, 1988, PSYCHOL LOVE STIBBARD RM, 2001, THESIS U READING UK STORM C, 1987, J PERS SOC PSYCHOL, V53, P805, DOI 10.1037//0022-3514.53.4.805 Teasdale J. D., 1999, HDB COGNITION EMOTIO, P665 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 Tomkins S. S., 1982, EMOTION HUMAN FACE, P353 Tsapatsoulis N, 2002, MPEG 4 FACIAL ANIMAT VELTEN E, 1968, BEHAV RES THER, V6, P473, DOI 10.1016/0005-7967(68)90028-4 VYAS E, 1999, OFFLINE ONLINE RECOG WATSON D, 1985, PSYCHOL BULL, V98, P219, DOI 10.1037/0033-2909.98.2.219 WATSON D, 1988, J PERS SOC PSYCHOL, V54, P1063, DOI 10.1037/0022-3514.54.6.1063 Westermann R, 1996, EUR J SOC PSYCHOL, V26, P557, DOI 10.1002/(SICI)1099-0992(199607)26:4<557::AID-EJSP769>3.0.CO;2-4 Wichmann Anne, 2000, P ISCA WORKSH SPEECH, P143 WIERZBICKA A, 1992, COGNITION EMOTION, V6, P285, DOI 10.1080/02699939208411073 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wittgenstein Ludwig, 1953, PHILOS INVESTIGATION Womack BD, 1996, SPEECH COMMUN, V20, P131, DOI 10.1016/S0167-6393(96)00049-0 WUNDT W, 2003, GRUNDZUGE PHYSL PSYC, V2 YARTZ AR, 2001, AABT CLIN ASSESSMENT, P25 Zammuner VL, 1998, COGNITION EMOTION, V12, P243 ZANNA M, 1908, SOCIAL PSYCHOL KNOWL, P315 Zuckerman M, 1985, MANUAL MULTIPLE AFFE NR 154 TC 154 Z9 157 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 5 EP 32 DI 10.1016/S0167-6393(02)00071-7 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100002 ER PT J AU Douglas-Cowie, E Campbell, N Cowie, R Roach, P AF Douglas-Cowie, E Campbell, N Cowie, R Roach, P TI Emotional speech: Towards a new generation of databases SO SPEECH COMMUNICATION LA English DT Article DE databases; emotional speech; scope; naturalness; context; descriptors ID VOCAL EXPRESSION; PERCEPTION; RECOGNITION; DEPRESSION; STRESS; EAR; EYE AB Research on speech and emotion is moving from a period of exploratory research into one where there is a prospect of substantial applications, notably in human-computer interaction. Progress in the area relies heavily on the development of appropriate databases. This paper addresses four main issues that need to be considered in developing databases of emotional speech: scope, naturalness, context and descriptors. The state of the art is reviewed. A good deal has been done to address the key issues, but there is still a long way to go. The paper shows how the challenge of developing appropriate databases is being addressed in three major recent projects-the Reading-Leeds project, the Belfast project and the CREST-ESP project. From these and other studies the paper draws together the tools and methods that have been developed, addresses the problems that arise and indicates the future directions for the development of emotional speech databases. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Queens Univ Belfast, Sch English, Belfast BT7 1NN, Antrim, North Ireland. Queens Univ Belfast, Sch Psychol, Belfast BT7 1NN, Antrim, North Ireland. ATR, Human Informat Sci Labs, Seika, Kyoto 6190288, Japan. Univ Reading, Sch Linguist & Appl Language Stud, Reading RG6 6AA, Berks, England. RP Douglas-Cowie, E (reprint author), Queens Univ Belfast, Sch English, Belfast BT7 1NN, Antrim, North Ireland. EM e.douglas-cowie@qub.ac.uk CR Abelin A., 2000, P ISCA WORKSH SPEECH, P110 ALTER K, 2000, P ISCA ITRW SPEECH E, P138 Amir N., 2000, P ISCA ITRW SPEECH E, P29 ARNFIELD S, 1995, P ESCA NATO TUT RES, P13 BACHOROWSKI JA, 1995, PSYCHOL SCI, V6, P219, DOI 10.1111/j.1467-9280.1995.tb00596.x Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batliner A, 2000, P ISCA WORKSH SPEECH, P195 Beckman M. E., 1994, GUIDELINES TOBI LABE BIRD S, 2001, ANNOTATION CORPUS TO, V33 Bonner MR, 1943, AM J PSYCHOL, V56, P262, DOI 10.2307/1417508 BREND RM, 1975, LANGUAGE SEX DIFFERE CAMPBELL WN, 1996, SP967 IEICE, P45 CAMPBELL WN, 2002, P LREC 2002 LAS PALM CAMPBELL WN, 2000, P ICSLP 2000 BEIJ, V4, P468 Cauldwell RT, 2000, P ISCA ITRW SPEECH E, P127 CHUNG S, 2000, EXPRESSION PERCEPTIO Cowie R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608027 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Cowie R., 2000, P ISCA WORKSH SPEECH, P19 Cowie R., 1999, COMPUT INTELL, P109 COWIE R, 2000, P ISCA ITRW SPEECH E COWIE R, 1995, P 13 INT C PHON SCI, V3, P278 COWIE R, 1999, P 14 INT C PHON SCI, P2327 Crystal D., 1964, SYSTEMS PROSODIC PAR Crystal D., 1969, PROSODIC SYSTEMS INT de Gelder B, 2000, COGNITION EMOTION, V14, P289 de Gelder B, 2000, COGNITION EMOTION, V14, P321 Douglas-Cowie E., 2000, P ISCA WORKSH SPEECH, P39 Douglas-Cowie E, 1998, LANG SPEECH, V41, P351 Ekman P., 1969, SEMIOTICA, V1, P49 Ekman P., 1978, FACIAL ACTION CODING Ekman P., 1999, HDB COGNITION EMOTIO, P45, DOI DOI 10.1002/0470013494.CH3 EKMAN P, 1994, PSYCHOL BULL, V115, P268, DOI 10.1037//0033-2909.115.2.268 ENGBERG IS, 1997, P EUR 97 RHOD GREEC FERNANDEZ R, 2000, P ISCA ITRW SPEECH E, P129 France DJ, 2000, IEEE T BIO-MED ENG, V47, P829, DOI 10.1109/10.846676 Fridlund A. J., 1994, HUMAN FACIAL EXPRESS Frolov M. V., 1999, HUM PHYSL, V25, P42 GREASLEY P, 1995, P 13 ICPHS STOCKH, V1, P242 GREASLEY P, 1996, ABS INT J PSYCH, V31, P406 Greasley P, 2000, LANG SPEECH, V43, P355 HANSEN J, 1997, P EUR 1997 RHOD GREE, V5, P2387 HARGREAVES WILLIAM A., 1965, J ABNORM PSYCHOL, V70, P218, DOI 10.1037/h0022151 Harre R., 1986, SOCIAL CONSTRUCTION Iida A., 2000, P ISCA WORKSH SPEECH, P167 Iriondo I, 2000, P ISCA WORKSH SPEECH, P161 *ISO IEC JTCISC29W, 1996, MPEG96N1365 ISOIEC J JOHANNES B, AVIATION SPACE ENV M, V71, pA58 Johns-Lewis C., 1986, INTONATION DISCOURSE, P199 KARLSSON I, 1998, P FON 98 SWED PHON C, P150 Kienast M., 2000, P ISCA ITRW SPEECH E, p[5, 92] Ladd D. Robert, 1986, INTONATION DISCOURSE, P125 MARCUS H, 2001, EMOTIONS SOCIAL PSYC Massaro DW, 2000, COGNITION EMOTION, V14, P313 McEnery T., 1996, CORPUS LINGUISTICS MCGILLOWAY S, 1997, THESIS QUEENS U BELF Milroy L., 1987, OBSERVING ANAL NATUR MOKHTARI P, 2002, P LREC 2002 Mozziconacci S. J. L., 1998, SPEECH VARIABILITY E ORTONY A, 1988, COGNITIVE STRUCTRE E Osgood Charles E., 1957, MEASUREMENT MEANING Paeschke A., 2000, P ISCA WORKSH SPEECH, P75 Pereira C., 2000, P ISCA WORKSH SPEECH, P25 PEREIRA C, 2000, THESIS MACQUARIE U A Polzin T., 2000, P ISCA WORKSH SPEECH, P201 ROACH P, 1994, SPEECH COMMUN, V15, P91, DOI 10.1016/0167-6393(94)90044-2 ROACH P, 2000, P ISCA WORKSH SPEECH, P53 Roach P., 1998, J INT PHON ASSOC, V28, P83 ROESSLER R, 1979, PHENOMENOLLOGY TREAT Scherer KR, 2000, PERS SOC PSYCHOL B, V26, P327, DOI 10.1177/0146167200265006 SCHERER KR, 2000, ISCA WORKSH SPEECH E Scherer KR, 1997, MOTIV EMOTION, V21, P211, DOI 10.1023/A:1024498629430 SCHERER KR, 1985, J PSYCHOLINGUIST RES, V14, P409, DOI 10.1007/BF01067884 Sherrard C, 1996, INT J PSYCHOL, V31, P4762 SLANEY M, 1998, P ICASSP SEATTL WA U STASSEN HH, 1991, PSYCHOPATHOLOGY, V24, P88 Stemmler G., 1992, J PSYCHOPHYSIOL, V6, P17 Stibbard R., 2000, P ISCA WORKSH SPEECH, P60 SUMMERFIELD AQ, 1983, HEARING SCI HERING D, P132 TENBOSCH L, 2000, P ISCA ITRW SPEECH E, P189 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 Trainor LJ, 2000, PSYCHOL SCI, V11, P188, DOI 10.1111/1467-9280.00240 Trudgill P., 1983, SOCIOLINGUISTICS INT VANBEEZOOIJEN R, 1984, CHARACTERISTICS RECO Waterman M, 1996, INT J PSYCHOL, V31, P4761 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 1996, GRONINGEN CORPUS S00 NR 87 TC 130 Z9 132 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 33 EP 60 DI 10.1016/S0167-6393(02)00070-5 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100003 ER PT J AU Alter, K Rank, E Kotz, SA Toepel, U Besson, M Schirmer, A Friederici, AD AF Alter, K Rank, E Kotz, SA Toepel, U Besson, M Schirmer, A Friederici, AD TI Affective encoding in the speech signal and in event-related brain potentials SO SPEECH COMMUNICATION LA English DT Article DE acoustics of affective speech; event-related brain potentials ID ACTIVATION; EMOTION AB A number of perceptual features have been utilized for the characterization of the emotional state of a speaker. However, for automatic recognition suitable objective features are needed. We have examined several features of the speech signal in relation to accentuation and traces of event-related brain potentials (ERPs) during affective speech perception. Concerning the features of the speech signal we focus on measures related to breathiness and roughness. The objective measures used were an estimation of the harmonics-to-noise ratio, the glottal-to-noise excitation ratio, a measure for spectral flatness, as well as the maximum prediction gain for a speech production model computed by the mutual information function and the ERPs. Results indicate that in particular the maximum prediction gain shows a good differentiation between neutral and non-neutral emotional speaker state. This differentiation is partly comparable to the ERP results that show a differentiation of neutral, positive and negative affect. Other objective measures are more related to accentuation than to emotional state of the speaker. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Max Planck Inst Cognit Neurosci, D-04103 Leipzig, Germany. Vienna Univ Technol, Inst Commun & Radio Frequency Engn, A-1040 Vienna, Austria. Univ Potsdam, Dept Linguist, D-14476 Potsdam, Germany. CNRS, Language & Mus Grp, Natl Ctr Sci Res, CRNC, F-13402 Marseille 20, France. RP Alter, K (reprint author), Max Planck Inst Cognit Neurosci, Stephanstr 1a, D-04103 Leipzig, Germany. EM alter@cns.mpg.de RI Schirmer, Annett /A-5257-2012 CR Bernhard HP, 1998, IEEE T SIGNAL PROCES, V46, P2909, DOI 10.1109/78.726805 BERNHARD HP, 1997, THESIS U TECHNOLOGY BLANKEN G., 1993, LINGUISTIC DISORDERS Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 DAVIDSON RJ, 1989, HDB NEUROPSYCHOLOGY, V3, P419 DEKROM G, 1994, THESIS UTRECHT FRIEDERICI AD, 1995, BRAIN LANG, V50, P259, DOI 10.1006/brln.1995.1048 FROHLICH M, 1998, P ICASSP 98 SEATTL W, V2, P937, DOI 10.1109/ICASSP.1998.675420 KLASMEYER G, 1997, P ICASSP 97 MUN GERM, P1615 KLASMEYER G, 1995, P ICPHS 95 STOCKH SW, V1, P181 KOTZ SA, 2000, J COG NEUROSCI S, V123 KUTAS M, 1980, SCIENCE, V207, P203, DOI 10.1126/science.7350657 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Markel JD, 1976, LINEAR PREDICTION SP MICHAELIS D, 1995, ACTA ACUSTICA, V81, P700 Pihan H, 1997, NEUROREPORT, V8, P623, DOI 10.1097/00001756-199702100-00009 Pihan H, 2000, BRAIN, V123, P2338, DOI 10.1093/brain/123.11.2338 PINTO NB, 1990, J ACOUST SOC AM, V87, P1278, DOI 10.1121/1.398803 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 NR 20 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 61 EP 70 DI 10.1016/S0167-6393(02)00075-4 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100004 ER PT J AU Menezes, C Pardo, B Erickson, D Fujimura, O AF Menezes, C Pardo, B Erickson, D Fujimura, O TI Changes in syllable magnitude and timing due to repeated correction SO SPEECH COMMUNICATION LA English DT Article DE C/D model; articulatory gestures; X-ray microbeam; jaw movement; syllable magnitude; syllable timing; rhythm; prosody; emphasis; repeated correction AB In a semi-spontaneous conversational setting, subjects were made to repeat the same correction of one digit in a three-digit sequence consisting of "five" or "nine" followed by "Pine Street". Articulatory and acoustic signals were recorded by the University of Wisconsin Microbeam Facility for four speakers of American English. By analyzing jaw movements, syllable magnitude and time values were evaluated, to represent the rhythmic organization of the utterance by a linear string of syllable pulses. Preliminary results suggest that not only does the magnitude of the corrected syllable increase by the correction of a digit, but also, in most cases, there is some systematic increase of syllable magnitude both in the corrected digit and other digits in the same utterance, as the same correction is repeated. Considerable difference among different speakers is observed and discussed in terms of syllable magnitude and timing patterns. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Ohio State Univ, Dept Speech & Hearing Sci, Columbus, OH 43210 USA. Univ Michigan, Ann Arbor, MI 48109 USA. Gifu City Womens Coll, Gifu 5010192, Japan. RP Menezes, C (reprint author), Ohio State Univ, Dept Speech & Hearing Sci, Columbus, OH 43210 USA. RI Pardo, Bryan/B-7262-2009 CR BELL L, 1999, P ICPHS99 SAN FRANS, P1221 BROWMAN CP, 1992, PHONETICA, V49, P155 ERICKSON D, 1999, J ACOUST SOC AM, V106, P2241, DOI 10.1121/1.427636 Erickson D, 1998, PHONETICA, V55, P147, DOI 10.1159/000028429 ERICKSON D, 2000, P INT C SPOK LANG PR, V3, P247 Erickson D, 2002, PHONETICA, V59, P134, DOI 10.1159/000066067 Erickson D., 1998, LANG SPEECH, V41, P395 Fant G, 2000, PHONETICA, V57, P113, DOI 10.1159/000028466 FARNETANI E., 1997, HDB PHONETIC SCI, P371 FUJIMURA O, 1973, Computers in Biology and Medicine, V3, P371, DOI 10.1016/0010-4825(73)90003-6 FUJIMURA O, 1998, P EUR SPEECH COMM AS, P23 FUJIMURA O, 1999, IT ORD P LP 98, P40 Fujimura O., 1992, Journal of the Acoustical Society of Japan (E), V13 Fujimura O, 2000, PHONETICA, V57, P128, DOI 10.1159/000028467 FUJIMURA O, 1994, DIMACS SERIES DISCRE, V17, P1 Fujisaki H., 1992, SPEECH PERCEPTION PR, P313 KEATING P, 1995, P 13 INT C PHON SCI, V3, P26 KIRITANI S, 1975, J ACOUST SOC AM, V57, P1516, DOI 10.1121/1.380593 Krakow RA, 1999, J PHONETICS, V27, P23, DOI 10.1006/jpho.1999.0089 Laver John, 1994, PRINCIPLES PHONETICS Levelt W. J., 1989, SPEAKING INTENTION A LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MAEKAWA K, 2000, P INT C SPOK LANG PR, V2, P349 MITCHELL CJ, 2000, P ISCA WORKSH SPEECH, P98 NADLER RD, 1987, P 11 INT C PHON SCI, V6, P10 Oviatt S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607722 Pierrehumbert J. B., 1980, THESIS MIT BLOOMINGT PIERREHUMBERT JB, 1988, LINGUIST INQUIRY MON, V17 SPRING C, 1992, P INT C SPOK LANG PR, P679 SPROAT R, 1993, J PHONETICS, V21, P291 Westbury J., 1989, J ACOUSTICAL SOC S1, VS98, DOI 10.1121/1.2027241 Westbury J.R., 1994, XRAY MICROBEAM SPEEC WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 34 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 71 EP 85 DI 10.1016/S0167-6393(02)00076-6 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100005 ER PT J AU Auberge, V Cathiard, M AF Auberge, V Cathiard, M TI Can we hear the prosody of smile? SO SPEECH COMMUNICATION LA English DT Article DE expression of amusement; audio-visual perception; prosodic parameters ID EXPRESSION; EMOTION; SPEECH AB Visual expression alone (a smile or a laugh) is often enough to identify an emotion such as amusement: in a perception task, subjects correctly identified visual and audio-visual stimuli of amused speech to the same degree (Proceedings of ICSLP, Sydney, 1998, p. 559), but even from the acoustic signal alone it has been demonstrated that the consequences of a (mechanical) smile gesture can be perceived as amusement (JASA 96 (1994) 2101). A hypothesis developed in the present work is that the expression of amusement in speech involves specific control of prosody and cannot be reduced simply to a change in voice quality as a consequence of the facial smile gesture. Speech stimuli were produced by French speakers for various tasks (spontaneous amusement, simulated amusement, mechanical smiling, ...), and in a first experiment, listeners were able to identify speech from the spontaneous smile condition as more amused than the "mechanical smile". It was shown from a second experiment that, even under clear visual conditions, the auditory modality contributes to audio-visual perception. A McGurk paradigm applied to discordant amused/mechanical stimuli clearly showed that acoustic information interacts with the visual decoding. The stimuli were analysed using a set of parameters chosen following Tartter (Percept. Psychophys. 27 (1980) 24), Banse and Scherer (J. Pers. Soc. Psych. 170 (1996) 614) and Mozziconacci (PhD Thesis, Eindhoren University, 1998). The prosodic parameters affected in the expression of amusement are primarily intensity and F0 declination, but they are different for different speakers. Our results confirm the finding by Mozziconacci (PhD Thesis, Eindhoren University, 1998), that there may be numerous ways of using the same parameters to express emotions such as amusement. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Grenoble 3, Inst Commun Parlee, INPG, F-38040 Grenoble 9, France. RP Auberge, V (reprint author), Univ Grenoble 3, Inst Commun Parlee, INPG, 1180 Av Cent,BP25, F-38040 Grenoble 9, France. EM auberge@icp.inpg.fr CR AUBERGE V, 1997, EUROSPEECH 97, P871 BANSE R, 1999, J PERS SOC PSYCHOL, V170, P614 Damasio A., 1994, DESCARTES ERROR EMOT de Gelder B, 2000, COGNITION EMOTION, V14, P289 Duchenne B., 1862, MECH PHYSIONOMIE HUM EKMAN P, 1990, J PERS SOC PSYCHOL, V58, P342, DOI 10.1037/0022-3514.58.2.342 Greasley P, 2000, LANG SPEECH, V43, P355 LEMAITRE L, 2000, RECHERCHE PARAMETRES MASSARO DW, 2000, P ISCA WORKSH SPEECH, P114 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MOZZICONNACI S, 1998, THESIS EINDHOVEN U Ohala J. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607982 Pourtois G, 2000, NEUROREPORT, V11, P1329, DOI 10.1097/00001756-200004270-00036 Provine RR, 1996, AM SCI, V84, P38 ROACH P, 2000, P ISCA WORKSH SPEECH, P53 RUSSELL JA, 1994, PSYCHOL BULL, V115, P102, DOI 10.1037/0033-2909.115.1.102 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHRODER M, 1998, P ICSLP SYDN, P559 SCHRODER M, 1998, EXPRESSION VOCALE AM SHIGENO S, 1998, P 5 INT C SPOK LANG, V1, P149 TARTTER VC, 1980, PERCEPT PSYCHOPHYS, V27, P24, DOI 10.3758/BF03199901 TARTTER VC, 1994, J ACOUST SOC AM, V96, P2101, DOI 10.1121/1.410151 VIAL A, 1999, CONTRIBUTION INFORMA Wichmann Anne, 2000, P ISCA WORKSH SPEECH, P143 NR 24 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 87 EP 97 DI 10.1016/S0167-6393(02)00077-8 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100006 ER PT J AU Schrober, M AF Schrober, M TI Experimental study of affect bursts SO SPEECH COMMUNICATION LA English DT Article DE affect bursts; interjections; emotion ID EMOTION RECOGNITION; SPEAKER AFFECT; VOCAL EMOTION; EXPRESSION AB The study described here investigates the perceived emotional content of "affect bursts" for German. Affect bursts are defined as short emotional non-speech expressions. This study shows that affect bursts, presented without context, can convey a clearly identifiable emotional meaning. The influence of the segmental structure on emotion recognition, as opposed to prosody and voice quality, is investigated. Agreement between transcribers is used as an experimental criterion for distinguishing between reflexive raw affect bursts and conventionalised. affect emblems. A detailed account of 28 affect burst classes is given, including perceived emotion and recognition rate in listening and reading perception tests as well as a phonetic transcription of segmental structure, voice quality and intonation. (C) 2002 Elsevier Science B.V. All rights reserved. C1 DFKI, D-66123 Saarbrucken, Germany. Univ Saarland, Inst Phonet, D-6600 Saarbrucken, Germany. RP Schrober, M (reprint author), DFKI, Stuhlsatzenhausweg 3, D-66123 Saarbrucken, Germany. EM schroed@dfki.de CR Andre E, 2000, EMBODIED CONVERSATIONAL AGENTS, P220 Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Campbell N., 2000, P ISCA WORKSH SPEECH, P34 Cauldwell RT, 2000, P ISCA ITRW SPEECH E, P127 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 COWIE R, 2003, COMMUNICATION, V40 Cowie R., 2000, P ISCA WORKSH SPEECH, P11 Cowie R., 1999, COMPUT INTELL, P109 DIETZ RB, P 3 INT COGN TECHN C Douglas-Cowie E., 2000, P ISCA WORKSH SPEECH, P39 DROSDOWSKI G, 1989, DUDEN HERKUNFTSWORTE Dutoit T., 1996, P ICSLP 96 PHIL, V3, P1393, DOI 10.1109/ICSLP.1996.607874 Ehlich K., 1986, INTERJEKTIONEN Ekman P., 1982, EMOTION HUMAN FACE GOBL C, 2000, P ISCA WORKSH SPEECH, P178 JOHNSTONE T, 1995, P INT C PHON SCI STO, V4, P2 KIENAST M, 1999, P EUR 1999, P117 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Leinonen L, 1997, J ACOUST SOC AM, V102, P1853, DOI 10.1121/1.420109 McQueen J. M., 1997, HDB PHONETIC SCI, P566 MOZZICONACCI SJL, 1998, THESIS TU EINDHOVEN MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315 Paeschke A., 2000, P ISCA WORKSH SPEECH, P75 Pereira C., 2000, P ISCA WORKSH SPEECH, P25 SCHEMPP P, 1988, J TEACH PHYS EDUC, V7, P79 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 Scherer KR, 2000, PERS SOC PSYCHOL B, V26, P327, DOI 10.1177/0146167200265006 SCHERER KR, 1994, EMOTIONS: ESSAYS ON EMOTION THEORY, P161 SCHRODER M, 1998, P ICSLP SYDN, P559 SCHRODER M, 2001, P EUR 2001 AALB, V1, P87 SCHRODER M, 1999, PHONUS 4 RES REPORT, P37 STALLO J, 2000, THESIS CURTIN U TECH Wells RS, 1945, LANGUAGE, V21, P27, DOI 10.2307/410202 Wichmann Anne, 2000, P ISCA WORKSH SPEECH, P143 ZERLING JP, 1995, TRAVAUX I PHONETIQUE, V25, P95 NR 37 TC 45 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 99 EP 116 DI 10.1016/S0167-6393(02)00078-X PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100007 ER PT J AU Batliner, A Fischer, K Huber, R Spilker, J Noth, E AF Batliner, A Fischer, K Huber, R Spilker, J Noth, E TI How to find trouble in communication SO SPEECH COMMUNICATION LA English DT Article DE emotion; dialogue; prosody; annotation; automatic classification; spontaneous speech; neural networks ID ERROR RESOLUTION; SPEECH; EXPRESSION; PROSODY AB Automatic dialogue systems used, for instance, in call centers, should be able to determine in a critical phase of the dialogue-indicated by the customers vocal expression of anger/irritation-when it is better to pass over to a human operator. At a first glance, this does not seem to be a complicated task: It is reported in the literature that emotions can be told apart quite reliably on the basis of prosodic features. However, these results are achieved most of the time in a laboratory setting, with experienced speakers (actors), and with elicited, controlled speech. We compare classification results obtained with the same feature set for elicited speech and for a Wizard-of-Oz scenario, where users believe that they are really communicating with an automatic dialogue system. It turns out that the closer we get to a realistic scenario, the less reliable is prosody as an indicator of the speakers' emotional state. As a consequence, we propose to change the target such that we cease looking for traces of particular emotions in the users' speech, but instead look for indicators of TROUBLE IN COMMUNICATION. For this reason, we propose the module Monitoring of User State [especially of] Emotion (MOUSE) in which a prosodic classifier is combined with other knowledge sources, such as conversationally peculiar linguistic behavior, for example, the use of repetitions. For this module, preliminary experimental results are reported showing a more adequate modelling Of TROUBLE IN COMMUNICATION. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung Informat 5, D-91058 Erlangen, Germany. Univ Bremen, Fachbereich 10, Sprach & Literaturwissensch, D-28334 Bremen, Germany. RP Batliner, A (reprint author), Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung Informat 5, Martensstr 3, D-91058 Erlangen, Germany. EM batliner@informatik.uni-erlangen.de CR Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 Batliner A, 1998, SPEECH COMMUN, V25, P193, DOI 10.1016/S0167-6393(98)00037-5 BATLINER A, 1997, P ESCA WORKSH INT DE, P39 BATLINER A, 2001, P EUR C SPEECH COMM, V4, P2781 BATLINER A, 2000, VERBMOBIL FDN SPEECH, P106 Batliner A, 2001, P ISCA TUT RES WORKS, P23 BATLINER A, 1999, P EUR C SPEECH COMM, V1, P519 BATLINER A, 1994, FOCUS NATURAL LANGUA, V1, P11 Batliner A., 1999, P 14 INT C PHON SCI, V3, P2315 BATLINER A, 2000, VERBMOBIL FDN SPEECH, P122 Cowie R., 2000, P ISCA WORKSH SPEECH ECKERT W, 1995, P ESCA WORKSH SPOK D, P193 Ekman P., 1969, SEMIOTICA, V1, P49 Fiehler Reinhard, 1990, KOMMUNIKATION EMOTIO FISCHER K, 1999, 23L VERBM FISCHER K, 1999, HUMAN COMPUTER INTER, V1, P560 Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M Gunthner S, 1997, LANGUAGE OF EMOTIONS, P247 HAAS J, 2001, PROBABILISTIC METHOD Hirschberg J., 1999, P AUT SPEECH REC UND, P349 JOHNSTONE T, 1995, P INT C PHON SCI STO, V4, P2 Kaiser S., 1998, P 10 C INT SOC RES E, P82 Kiessling A., 1997, EXTRAKTION KLASSIFIK KOMPE R, 1997, LECT NOTES ARTIFICIA LEVOW GA, 1999, P ESCA WORKSH DIAL P, P193 LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LI Y, 1998, P INT C SPOK LANG PR, V6, P2255 Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3 Oviatt S, 1998, LANG SPEECH, V41, P419 PAESCHKE A, 1999, P 14 INT C PHON SCI, V2, P929 Pirker H., 1999, P ESCA WORKSH DIAL P, P181 REITHINGER N, 2000, VERBMOBIL FDN SPEECH, P428 Scherer KR, 2000, PERS SOC PSYCHOL B, V26, P327, DOI 10.1177/0146167200265006 SCHERER KR, 1995, P INT C PHON SCI STO, V3, P90 Scherer KR, 1997, MOTIV EMOTION, V21, P211, DOI 10.1023/A:1024498629430 SELTING M, 1994, J PRAGMATICS, V22, P375, DOI 10.1016/0378-2166(94)90116-3 TISCHER B, 1993, VOKALE KOMMUNIKATION, V18 WAHLSTER W, 2001, P EUR 2001 7 EUR C S, V3, P1547 Walker M., 2000, P N AM M ASS COMP LI, P210 WARNKE V, 1999, P EUR C SPEECH COMM, V1, P235 NR 41 TC 117 Z9 119 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 117 EP 143 DI 10.1016/S0167-6393(02)00079-1 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100008 ER PT J AU Fernandez, R Picard, RW AF Fernandez, R Picard, RW TI Modeling drivers' speech under stress SO SPEECH COMMUNICATION LA English DT Article DE stress in speech; automatic recognition AB We explore the use of features derived from multiresolution analysis of speech and the Teager energy operator for classification of drivers' speech under stressed conditions. We apply this set of features to a database of short speech utterances to create user-dependent discriminants of four stress categories. In addition we address the problem of choosing a suitable temporal scale for representing categorical differences in the data. This leads to two modeling approaches. In the first approach, the dynamics of the feature set within the utterance are assumed to be important for the classification task. These features are then classified using dynamic Bayesian network models as well as a model consisting of a mixture of hidden Markov models (M-HMM). In the second approach, we define an utterance-level feature set by taking the mean value of the features across the utterance. This feature set is then modeled with a support vector machine and a multilayer perceptron classifier. We compare the performance on the sparser and full dynamic representations against a chance-level performance of 25% and obtain the best performance with the speaker-dependent mixture model (96.4% on the training set, and 61.2% on a separate testing set). We also investigate how these models perform on the speaker-independent task. Although the performance of the speaker-independent models degrades with respect to the models trained on individual speakers, the mixture model still outperforms the competing models and achieves significantly better than random recognition (80.4% on the training set, and 51.2% on a separate testing set). (C) 2002 Elsevier Science B.V. All rights reserved. C1 MIT, Media Lab, Cambridge, MA 02139 USA. RP Fernandez, R (reprint author), MIT, Media Lab, 20 Ames St, Cambridge, MA 02139 USA. EM galt@media.mit.edu; picard@media.mit.edu CR Bishop C. M., 1995, NEURAL NETWORKS PATT COWIE R, 2000, P ISCA ITRW SPEECH E Daubechies I., 1992, REGIONAL C SERIES AP Ghahramani Z, 1997, MACH LEARN, V29, P245, DOI 10.1023/A:1007425814087 Gunn S., 1998, SUPPORT VECTOR MACHI Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 HANSEN JHL, 1998, GETTING STARTED SUSA JABLOUN F, 1999, P IEEE INT C AC SPEE, V1, P273 Jensen F. V., 1996, INTRO BAYESIAN NETWO JORDAN MI, 1996, ADV NEURAL INFORMATI, V9, P501 McGilloway S., 2000, P ISCA WORKSH SPEECH, P207 MINKA TP, 1999, BAYESIAN INFERENCE M MURPHY K, 1998, FITTING CONSTRAINED MURRAY IR, 1996, SPEECH COMMUN, V20, P1 OSUNA EE, 1997, 1602CBCL AI Pittman J., 1993, HDB EMOTIONS, P185 POLZIN T, 2000, THESIS CARNEGIE MELL Rabiner L, 1993, FUNDAMENTALS SPEECH SARIKAYA R, 1997, P SOUTH 97 ENG NEW C, P92, DOI 10.1109/SECON.1997.598617 SARIKAYA R, 1998, P IEEE INT C AC SPEE, V1, P569, DOI 10.1109/ICASSP.1998.674494 Steeneken H.J.M., 1999, P ICASSP, V4, P2079 van Bezooijen R., 1984, CHARACTERISTICS RECO ZHOU G, 1998, P IEEE INT C AC SPEE, V1, P549 ZHOU G, 1999, P IEEE INT C AC SPEE, V4, P2087 NR 24 TC 47 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 145 EP 159 DI 10.1016/S0167-6393(02)00080-8 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100009 ER PT J AU Iida, A Campbell, N Higuchi, F Yasumura, M AF Iida, A Campbell, N Higuchi, F Yasumura, M TI A corpus-based speech synthesis system with emotion SO SPEECH COMMUNICATION LA English DT Article DE emotion; natural speech; corpus; source database; concatenative speech synthesis ID VOCAL EMOTION; EXPRESSION AB We propose a new approach to synthesizing emotional speech by a corpus-based concatenative speech synthesis system (ATR CHATR) using speech corpora of emotional speech. In this study, neither emotional-dependent prosody prediction nor signal processing per se is performed for emotional speech. Instead, a large speech corpus is created per emotion to synthesize speech with the appropriate emotion by simple switching between the emotional corpora. This is made possible by the normalization procedure incorporated in CHATR that transforms its standard predicted prosody range according to the source database in use. We evaluate our approach by creating three kinds of emotional speech corpus (anger, joy, and sadness) from recordings of a male and a female speaker of Japanese. The acoustic characteristics of each corpus are different and the emotions identifiable. The acoustic characteristics of each emotional utterance synthesized by our method show clear correlations to those of each corpus. Perceptual experiments using synthesized speech confirmed that our method can synthesize recognizably emotional speech. We further evaluated the method's intelligibility and the overall impression it gives to the listeners. The results show that the proposed method can synthesize speech with a high intelligibility and gives a favorable impression. With these encouraging results, we have developed a workable text-to-speech system with emotion to support the immediate needs of nonspeaking individuals. This paper describes the proposed method, the design and acoustic characteristics of the corpora, and the results of the perceptual evaluations. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Keio Univ, Keio Res Inst SFC, Fujisawa, Kanagawa 2528520, Japan. JST, CREST, Kyoto, Japan. ATR Human Informat Sci Res Labs, Kyoto, Japan. Keio Univ, Grad Sch Media & Governance, Kanagawa, Japan. RP Iida, A (reprint author), Keio Univ, Keio Res Inst SFC, 5322,Endo, Fujisawa, Kanagawa 2528520, Japan. CR ABE M, 1990, ATR TECHNICAL REPORT Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BLACK A, 1996, P ICSLP 96 PHIL US, V3, P1385, DOI 10.1109/ICSLP.1996.607872 BUNNELL HT, 1998, P ICSLP 98 SYDN AUST, V5, P1723 CAHN JE, 1989, P 1989 C AM VOIC I O, P251 Campbell W. N., 1997, P INT C SPEECH PROC, P183 CAMPBELL WN, 1997, PROGR SPEECH SYNTHES, P279 CAMPBELL WN, 1996, P ICSLP 96 PHIL US, P2399 CAMPBELL WN, 1997, COMPUTING PROSODY, P165 CARLSON R, 1992, P ICSLP 92 BANFF ALB, V1, P671 Cowie R, 2001, IEEE SIGNAL PROC MAG, V18, P32, DOI 10.1109/79.911197 Davitz J. R, 1964, COMMUNICATION EMOTIO, P101 Davitz JR, 1964, COMMUNICATION EMOTIO, P13 *ENTR RES LAB INC, 1996, ESPS PROGR A L Fairbanks G, 1939, SPEECH MONOGR, V6, P87 Guerrero L. K., 1998, HDB COMMUNICATION EM, P3 ICHIKAWA A, 1967, P AC SOC JAP FALL M, P95 IIDA A, 1998, P ICSLP 98 SYDN AUST, V4, P1559 Iida A., 2000, P ISCA WORKSH SPEECH, P167 ITO K, 1986, ERGONOMICS, V22, P211 *JEIDA, 2000, GUID SPEECH SYNTH SY KAMIMURA K, 1990, ASHITAWO TSUKURU SEZ KATAE N, 2000, P AC SOC JAP FALL M, P187 KEATING PA, 1984, PHONETICA, V41, P191 KITAHARA Y, 1992, IEICE T FUND ELECTR, VE75A, P155 KITAHARA Y, 1987, J INTERACTION I EL D, V70, P2095 MAKINO T, 1989, IEICE, V72, P837 MOKHTARI P, 2001, P INT C SPEECH PROC, P431 MOZZICONACCI SJL, 1998, THESIS TU EINDHOVEN Murray IR, 2000, P ISCA WORKSH SPEECH, P173 MURRAY IR, 1995, SPEECH COMMUN, V16, P369, DOI 10.1016/0167-6393(95)00005-9 MURRAY IR, 1991, P EUR 91 GEN IT, P311 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NAGAE Y, 1998, THESIS UTSUYOMIYA U OHIRA Y, 1995, WATASHIRASHIKU NINGE Russell JA, 1989, EMOTION THEORY RES E, V4, P83 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 SHAVER P, 1987, J PERS SOC PSYCHOL, V52, P1061, DOI 10.1037/0022-3514.52.6.1061 TAKEDA S, 2000, P AC SOC JAP FALL M, P191 TODOROKI T, 1993, KOUSAI KAGAYAKI TUZU NR 41 TC 49 Z9 51 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 161 EP 187 DI 10.1016/S0167-6393(02)00081-X PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100010 ER EF