FN Thomson Reuters Web of Science™ VR 1.0 PT J AU Gobl, C Ni Chasaide, A AF Gobl, C Ni Chasaide, A TI The role of voice quality in communicating emotion, mood and attitude SO SPEECH COMMUNICATION LA English DT Article DE voice quality; affect; emotion; mood; attitude; voice source; inverse filtering; fundamental frequency; synthesis; perception ID PROSODIC FEATURES; CONNECTED SPEECH; SPEAKER AFFECT; GLOTTAL WAVE; FEMALE; MODEL; PERCEPTION; PARAMETERS; PHONATION AB This paper explores the role of voice quality in the communication of emotions, moods and attitudes. Listeners' reactions to an utterance synthesised with seven different voice qualities were elicited in terms of pairs of opposing affective attributes. The voice qualities included harsh voice, tense voice, modal voice, breathy voice, whispery voice, creaky voice and lax-creaky voice. These were synthesised using a formant synthesiser, and the voice source parameter settings were guided by prior analytic studies as well as auditory judgements. Results offer support for some past observations on the association of voice quality and affect, and suggest a number of refinements in some cases. Listeners' ratings further suggest that these qualities are considerably more effective in signalling milder affective states than the strong emotions. It is clear that there is no one-to-one mapping between voice quality and affect: rather a given quality tends to be associated with a cluster of affective attributes. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Trinity Coll Dublin, Phonet & Speech Sci Lab, Ctr Language & Commun Studies, Dublin 2, Ireland. RP Gobl, C (reprint author), Trinity Coll Dublin, Phonet & Speech Sci Lab, Ctr Language & Commun Studies, Dublin 2, Ireland. EM cegobl@tcd.ie CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P., 1994, P INT C SPOK LANG PR, P1619 Alku P, 1996, FOLIA PHONIATR LOGO, V48, P240 ALTER K, 1999, P 14 INT C PHON SCI, P2121 ANANTHAPADMANABHA T.V., 1984, STL QPSR, V2, P1 Burkhardt F., 2000, P ISCA WORKSH SPEECH, P151 CAHN JE, 1990, GENERATING EXPRESSIO Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 CARLSON R, 1991, SPEECH COMMUN, V10, P481, DOI 10.1016/0167-6393(91)90051-T CARLSON R, 1992, SPEECH COMMUN, V2, P347 CHAN DSF, 1989, P EUR 89 PAR CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 CRANEN B, 1985, J ACOUST SOC AM, V77, P1543, DOI 10.1121/1.391997 CUMMINGS KE, 1995, J ACOUST SOC AM, V98, P88, DOI 10.1121/1.413664 DING W, 1994, P INT C SPOK LANG PR, P159 Fant G, 1997, SPEECH COMMUN, V22, P125, DOI 10.1016/S0167-6393(97)00017-4 Fant G., 1995, STL QPSR, V36, P119 FANT G, 1982, STL QPSR, V4, P28 FANT G, 1979, STL QPSR, V3, P31 FANT G, 1979, STL QPSR, V1, P85 FANT G, 1991, VOCAL FOLD PHYSL ACO, P47 Fant Gunnar, 1985, STL QPSR, V4, P1 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 Frohlich M, 2001, J ACOUST SOC AM, V110, P479, DOI 10.1121/1.1379076 FUJISAKI H, 1986, P IEEE INT C AC SPEE GOBL C, 1988, STL QPSR, V1, P123 Gobl C., 1989, STL QPSR, P9 GOBL C, 1999, P 14 INT C PHON SCI, P2437 GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C Gobl C, 1995, P 13 INT C PHON SCI, V1, P74 HAMMARBERG B, 1986, STUDIES LOGOPEDICS P, V1 HEDELIN P, 1984, P IEEE INT C AC SPEE Hertegard S, 1991, VOCAL FOLD PHYSL ACO, P243 HUNT MJ, 1987, P 11 INT C PHON SCI, V3, P23 Hunt M. J., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing Iida A., 2000, P ISCA WORKSH SPEECH, P167 JANSEN J, 1990, THESIS NIJMEGEN U Kane P, 1992, J CLIN SPEECH LANGUA, V1, P17 KARLSSON I, 1990, P INT C SPOK LANG PR, P225 KARLSSON I, 1996, STL QPSR, V2, P143 KARLSSON I, 1992, SPEECH COMMUN, V11, P1 Kasuya H., 1999, P INT C PHON SCI SAN, P2505 KIENAST M, 1999, P EUR 1999, P117 KITZING P, 1975, MED BIOL ENG, V13, P644, DOI 10.1007/BF02477320 Klasmeyer G, 1995, P 13 INT C PHON SCI, V1, P182 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KLATT DH, UNPUB DESCRIPTION CA, pCH3 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KOREMAN J, 1995, PHONUS, V1, P105 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Laukkanen AM, 1996, J PHONETICS, V24, P313, DOI 10.1006/jpho.1996.0017 LAUKKANEN AM, 1997, J LOGOPEDICS PHONIAT, V22, P157 LAUKKANEN AM, 1995, P INT C PHON SCI STO, V1, P246 Laver J, 1980, PHONETIC DESCRIPTION LEE CK, 1991, VOCAL FOLD PHYSL ACO, P233 LJUNGQVIST M, 1985, T COMMITTEE SPEECH S, V85, P153 MAHSHIE J, 1999, P 14 INT C PHON SCI, P1009 MCKENNA J, 1999, P EUR 99 BUD, P2793 MEURLINGER C, 1997, THESIS SPEECH MUSIC *MIN INC, 2001, MINITAB STAT SOFTW R MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593 MOZZICONACCI SJL, 1995, P 13 INT C PHONETICS, V1, P178 MOZZICONACCI SJL, 1998, THESIS TU EINDHOVEN MURRAY IR, 1995, P ISCA WORKSH SPEECH, V20, P85 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Ni Chasaide A., 1997, HDB PHONETIC SCI, P427 NICHASAIDE A, 1993, LANG SPEECH, V36, P303 Ni Chasaide A., 1999, COARTICULATION THEOR, P300 NICHASAIDE A, 1995, P 13 INT C PHON SCI, V4, P6 NICHASAIDE A, 1992, J CLIN SPEECH LANGUA, V1, P1 NICHASAIDE A, 2002, IMPROVEMENTS SPEECH, P252 OLIVERA LC, 1997, PROGR SPEECH SYNTHES, P27 OLIVERA LC, 1993, P EUR 93 BERL, P99 PALMER SK, 1992, P INT C SPOK LANG PR, P129 PIERREHUMBERT JB, 1989, SPEECH TRANSMISSION, V4, P23 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 QI YY, 1994, J ACOUST SOC AM, V96, P1182, DOI 10.1121/1.410392 ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 ROTHENBERG M., 1975, P SPEECH COMM SEM SO, V2, P235 Scherer K., 1989, EMOTION THEORY RES E, V4, P233 Scherer K., 1991, FUNDAMENTALS NONVERB, P200 Scherer K. R., 1999, P 14 INT C PHON SCI, P2029 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SCHERER KR, 1981, EVALUATION SPEECH PS, P189 SCHERER KR, 1994, EMOTIONS: ESSAYS ON EMOTION THEORY, P161 SCHOENTGEN J., 1993, P EUROSPEECH 93, P107 SCHRODER M, 2000, P ISCA WORKSH SPEECH, P132 SCULLY C, 1995, P 13 INT C PHON SCI, V2, P482 Stibbard R., 2000, P ISCA WORKSH SPEECH, P60 STRIK H, 1992, SPEECH COMMUN, V11, P167, DOI 10.1016/0167-6393(92)90011-U Strik H., 1993, P EUR 93 BERL, P103 STRIK H, 1994, P INT C SPOK LANG PR, P155 STRIK H, 1992, P INT C SPOK LANG PR, V1, P121 Swerts M, 2001, SPEECH COMMUN, V33, P297, DOI 10.1016/S0167-6393(00)00061-3 TALKIN D, 1990, P ESCA WORKSH SPEECH, P55 Uldall E. T., 1964, HONOUR D JONES, P271 Veldhuis R, 1998, J ACOUST SOC AM, V103, P566, DOI 10.1121/1.421103 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 WONG D, 1979, IEEE T ACOUSTICS SPE, V24, P350 NR 100 TC 137 Z9 143 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 189 EP 212 DI 10.1016/S0167-6393(02)00082-1 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100011 ER PT J AU ten Bosch, L AF ten Bosch, L TI Emotions, speech and the ASR framework SO SPEECH COMMUNICATION LA English DT Article DE emotion; prosody; automatic speech recognition AB Automatic recognition and understanding of speech are crucial steps towards natural human-machine. interaction. Apart from the recognition of the word sequence, the recognition of properties such as prosody, emotion tags or stress tags may be of particular importance in this communication process. This paper discusses the possibilities to recognize emotion from the speech signal, primarily from the viewpoint of automatic speech recognition (ASR). The general focus is on the extraction of acoustic features from the speech, signal that can be used for the detection of the emotional state or stress state of the speaker. After the introduction, a short overview of the ASR framework is presented. Next, we discuss the relation between recognition of emotion and ASR, and the different approaches found in the literature that deal with the correspondence between emotions and acoustic features. The conclusion is that automatic emotional tagging of the speech signal is difficult to perform with high accuracy, but prosodic information is nevertheless potentially useful to improve the dialogue handling in ASR tasks on a limited domain. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP ten Bosch, L (reprint author), Univ Nijmegen, Dept Language & Speech, A2RT,POB 9103, NL-6500 HD Nijmegen, Netherlands. EM l.tenbosch@let.kun.nl CR ALTER K, 2000, P ISCA ITRW SPEECH E, P138 AMIR N, 1998, P ICSLP, P555 *ARPA, 1995, P ARPA SPOK LANG SYS, P241 Campbell N., 2000, P ISCA WORKSH SPEECH, P34 Chu M, 2001, P 7 EUR C SPEECH COM, P2087 Cowie R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608027 Cowie R., 2000, P ISCA WORKSH SPEECH, P11 COWIE R, 2000, P ISCA ITRW NEWC N I, P151 FERNANDEZ R, 2000, P ISCA WORKSH SPEECH, P219 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 HIROSE K, 2000, P ICSLP 2000, P369 HIROSE K, 1997, COMPUTING PROSODY CO, P327 HIRSCHBERG J, 1999, ASRU 99 KEYST US DEC Hirschberg J, 1999, P ESCA WORKSH DIAL P, P7 HUBER R, 2000, P INT C SPOK LANG PR, V1, P665 IIDA A, 1998, P ICSLP, P1559 Iida A., 2000, P ISCA WORKSH SPEECH, P167 Juang BH, 2000, P IEEE, V88, P1142 KANG BS, 2000, P ICSLP, P383 KIENAST M, 1999, P EUR 1999, P117 Kitazoe T., 2000, P ICSLP, P653 KLEIN J, 1999, COMPUTER RESPONSE US Koike K, 1998, P INT C SPOK LANG PR, P679 Li Y., 1998, P INT C SPOK LANG PR, P2255 LITMAN D, 2000, P NAACL SEATTL MAY MAKHOUL J, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P165 MANNING C, 2000, PROBABILISTIC MODELS MIZUNO O, 1998, P ICSLP, P2007 Montero J. M., 1998, P 5 INT C SPOK LANG, P923 Mozziconacci S., 1998, STUDY INTONATION PAT MOZZICONACCI S, 2000, P ISCA ITRW NEWC N I, P45 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NOGUEIRAS A, 2001, P EUR 2001 AALB DENM NOTH E, 1999, P ESCA WORKSH DIAL P, P25 Ostendorf M., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1010 OVIATT SL, 1998, P ICSLP 1998, P2311 Pereira C., 1998, P INT C SPOK LANG PR, P927 Petrushin V.A., 2000, P 6 INT C SPOK LANG, P222 PICARD RW, 1997, AFFECTICE COMPUTING Rabiner L, 1993, FUNDAMENTALS SPEECH Rank E., 1998, P ICSLP 1998, P671 Scherer K., 2000, P ICSLP 2000, P379 Scherer K.R., 1981, SPEECH EVALUATION PS, P189 SHIGENO S, 1998, P ICSLP 1998, P281 Shriberg E., 1998, LANG SPEECH, V41, P439 SLANEY M, 1998, P ICASSP SEATTL WA SOLTAU H, 1998, P ICSLP98, P229 STREEFKERK BM, 1998, P ICSLP 1998, P683 van Bezooijen R., 1984, CHARACTERISTICS RECO Veilleux N. M., 1994, THESIS BOSTON U WEINTRAUB M, 1996, P ICSLP 1996, P1457 WHITESIDE SP, 1998, P ICSLP 1998, P699 YANG L, 2000, P INT C SPOK LANG PR, P74 ZETTERHOLM E, 1998, SST P 7 AUSTR INT C, P109 ZHAO L, 2000, P ICSLP 2000, P961 ZHOU G, 1998, P ICSLP 1998, P883 2000, AFFECTIVE COMPUTING NR 57 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 213 EP 225 DI 10.1016/S0167-6393(02)00083-3 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100012 ER PT J AU Scherer, KR AF Scherer, KR TI Vocal communication of emotion: A review of research paradigms SO SPEECH COMMUNICATION LA English DT Review DE vocal communication; expression of emotion; speaker moods and attitudes; speech technology; theories of emotion; evaluation of emotion effects on voice and speech; acoustic markers of emotion; emotion induction; emotion simulation; stress effects on voice; eerception/decoding ID VOICE QUALITY; FACIAL EXPRESSIONS; NONVERBAL BEHAVIOR; PROSODIC FEATURES; SYNTHETIC SPEECH; SPEAKING STYLES; CUE UTILIZATION; PERCEPTION; RECOGNITION; STRESS AB The current state of research on emotion effects on voice and speech is reviewed and issues for future research efforts are discussed. In particular, it is suggested to use the Brunswikian lens model as a base for research on the vocal communication of emotion. This approach allows one to model the complete process, including both encoding (expression), transmission, and decoding (impression) of vocal emotion communication. Special emphasis is placed on the conceptualization and operationalization of the major elements of the model (i.e., the speaker's emotional state; the listener's attribution, and the mediating acoustic cues). In addition, the advantages and disadvantages of research paradigms for the induction or observation of emotional expression in voice and speech and the experimental manipulation of vocal cues are discussed, using pertinent examples drawn from past and present research. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Geneva, Dept Psychol, CH-1205 Geneva, Switzerland. RP Scherer, KR (reprint author), Univ Geneva, Dept Psychol, 40 Blvd Pont Arve, CH-1205 Geneva, Switzerland. EM klaus.scherer@pse.unige.ch CR ABADJIEVA E, 1995, P EUR 95, P909 Allport GW, 1934, J SOC PSYCHOL, V5, P37 ALPERT M, 1963, ARCH GEN PSYCHIAT, V8, P362 Ambady N, 1996, J PERS SOC PSYCHOL, V70, P996, DOI 10.1037/0022-3514.70.5.996 AMIR N, 1998, P ICSLP 98, V3, P555 Anolli L, 1997, J NONVERBAL BEHAV, V21, P259, DOI 10.1023/A:1024916214403 Baber C, 1996, HUM FACTORS, V38, P142, DOI 10.1518/001872096778940840 Bachorowski JA, 1999, CURR DIR PSYCHOL SCI, V8, P53, DOI 10.1111/1467-8721.00013 BACHOROWSKI JA, 1995, PSYCHOL SCI, V6, P219, DOI 10.1111/j.1467-9280.1995.tb00596.x Banse R, 1996, J PERS SOC PSYCHOL, V70, P614, DOI 10.1037/0022-3514.70.3.614 BANZIGER T, 2001, P JOURN PROS GREN 10 Bonner MR, 1943, AM J PSYCHOL, V56, P262, DOI 10.2307/1417508 Borod JC, 2000, COGNITION EMOTION, V14, P193 BORTZ J, 1966, PHYSIKALISCH AKUSTIS, V9 Breitenstein C, 2001, COGNITION EMOTION, V15, P57, DOI 10.1080/0269993004200114 Brunswik E., 1956, PERCEPTION REPRESENT Burkhardt F., 2000, P ISCA WORKSH SPEECH, P151 CAFFI C, 1994, J PRAGMATICS, V22, P325, DOI 10.1016/0378-2166(94)90115-5 Cahn J.E., 1990, J AM VOICE I O SOC, V8, P1 CARLSON R, 1992, SPEECH COMMUN, V11, P159, DOI 10.1016/0167-6393(92)90010-5 CARLSON R, 1992, P ICSLP 92 BANFF ALB, V1, P671 COLEMAN RF, 1979, 8 S CAR PROF VOIC 1 COSMIDES L, 1983, J EXP PSYCHOL HUMAN, V9, P864, DOI 10.1037/0096-1523.9.6.864 COSTANZO FS, 1969, J COUNS PSYCHOL, V16, P267, DOI 10.1037/h0027355 Cowie R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608027 Cramer D., 1990, QUANTITATIVE DATA AN Darwin C, 1872, EXPRESSION EMOTIONS Davitz Joel Robert, 1964, COMMUNICATION EMOTIO de Gelder B, 2000, COGNITION EMOTION, V14, P321 DEGELDER B, 2000, COGNITIVE NEUROSCIEN, P84 DUNCAN G, 1983, WORK PROGR, V16, P9 Ekman P, 1972, NEBRASKA S MOTIVATIO, P207 EKMAN P, 1992, COGNITION EMOTION, V6, P169, DOI 10.1080/02699939208411068 EKMAN P, 1994, PSYCHOL BULL, V115, P268, DOI 10.1037//0033-2909.115.2.268 EKMAN P, 1991, J NONVERBAL BEHAV, V15, P125, DOI 10.1007/BF00998267 ELDRED SH, 1958, PSYCHIATR, V21, P115 Erickson D, 1998, LANG SPEECH, V41, P399 Fairbanks G, 1941, SPEECH MONOGR, V8, P85 Fairbanks G, 1939, SPEECH MONOGR, V6, P87 FELDMAN RS, 1991, FUNDAMENTALS NOVERBA Fichtel C, 2001, BEHAVIOUR, V138, P97, DOI 10.1163/15685390151067094 Fonagy I., 1963, Z PHONETIK SPRACHWIS, V16, P293 FONAGY I, 1978, LANG SPEECH, V21, P34 Fonagy I., 1983, VIVE VOIX Frank MG, 2001, J PERS SOC PSYCHOL, V80, P75, DOI 10.1037//0022-3514.80.1.75 FRICK RW, 1985, PSYCHOL BULL, V97, P412, DOI 10.1037//0033-2909.97.3.412 FRIEND M, 1994, J ACOUST SOC AM, V96, P1283, DOI 10.1121/1.410276 Frolov M. V., 1999, HUM PHYSL, V25, P42 GIFFORD R, 1994, J PERS SOC PSYCHOL, V66, P398, DOI 10.1037/0022-3514.66.2.398 GRAMMING P, 1988, J ACOUST SOC AM, V83, P2352, DOI 10.1121/1.396366 GRANSTROM B, 1992, SPEECH COMMUN, V11, P347, DOI 10.1016/0167-6393(92)90040-E GREEN RS, 1975, PERCEPT PSYCHOPHYS, V17, P429, DOI 10.3758/BF03203289 Gross JJ, 1997, J ABNORM PSYCHOL, V106, P95, DOI 10.1037/0021-843X.106.1.95 Hammond K.R., 2001, ESSENTIAL BRUNSWIK B HARGREAVES WILLIAM A., 1965, J ABNORM PSYCHOL, V70, P218, DOI 10.1037/h0022151 Harper R. G., 1978, NONVERBAL COMMUNICAT Hauser MD, 2000, ORIGINS OF MUSIC, P77 HAVRDOVA Z, 1979, ACTIV NERV SUPER, V21, P33 HELFRICH H, 1984, SPEECH COMMUN, V3, P245, DOI 10.1016/0167-6393(84)90019-0 HERZOG A, 1933, Z PSYCHOL, V130, P300 HEUFT B, 1996, P ICSLP 96, V3, P1974, DOI 10.1109/ICSLP.1996.608023 HICKS JW, 1979, DISS ABSTR INT, V41 Hochschild AR, 1983, MANAGED HEART COMMER HOFFE WL, 1960, PHONETICA, V5, P129 HUTTAR GL, 1968, J SPEECH HEAR RES, V11, P481 IIDA A, 1998, P ICSLP 98 SYDN AUST, V4, P1559 Isserlin M, 1925, Z GESAMTE NEUROL PSY, V94, P437, DOI 10.1007/BF02878062 Izard C. E., 1977, HUMAN EMOTIONS Izard C.E., 1971, FACE EMOTION Johannes B, 2000, AVIAT SPACE ENVIR MD, V71, pA58 Johnstone T., 2001, APPRAISAL PROCESSES, P271 Johnstone T., 2000, HDB EMOTIONS, V2nd, P220 JOHNSTONE T, 2001, THESIS U W AUSTR Juslin PN, 2000, J EXP PSYCHOL HUMAN, V26, P1797, DOI 10.1037//0096-1523.26.6.1797 Juslin PN, 2001, EMOTION, V1, P381, DOI 10.1037//1528-3542.1.4.381 Kaiser L., 1962, SYNTHESE, V14, P300, DOI 10.1007/BF00869311 KAPPAS A, 1997, 37 ANN M SOC PSYCH R Karlsson I, 2000, SPEECH COMMUN, V31, P121, DOI 10.1016/S0167-6393(99)00073-4 KARLSSON I, 1998, P RLA2C AV, P207 Kennedy G. A., 1972, ART RHETORIC ROMAN W KIENAST M, 1999, P EUR 99 BUD, V1, P117 Klasmeyer G., 2000, P ISCA WORKSH SPEECH, P66 KLASMEYER G, 1999, DAGA TAG ASA TAG 99 KLASMEYER G, 1999, VOICE QUALITY MEASUR, P339 Klasmeyer G, 1999, FORUM PHONETICUM, V67 KLASMEYER G, 1997, FORENSIC LINGUIST, V4, P104 Knapp M.L., 1972, NONVERBAL COMMUNICAT KOTLYAR GM, 1976, SOV PHYS ACOUST+, V22, P208 KURODA I, 1976, AVIAT SPACE ENVIR MD, V47, P528 LADD DR, 1985, J ACOUST SOC AM, V78, P435, DOI 10.1121/1.392466 Laver J., 1991, GIFT SPEECH Laver J, 1980, PHONETIC DESCRIPTION Lavner Y, 2000, SPEECH COMMUN, V30, P9, DOI 10.1016/S0167-6393(99)00028-X LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 LEVIN H, 1975, IEEE T SYST MAN CYB, VSMC5, P259 MAHL GF, 1964, APPROACHES SEMIOTICS, P51 MARKEL NN, 1973, LANG SPEECH, V16, P15 MASSARO DW, 2000, P ISCA WORKSH SPEECH, P114 Morris JS, 1999, NEUROPSYCHOLOGIA, V37, P1155, DOI 10.1016/S0028-3932(99)00015-9 Morton JB, 2001, CHILD DEV, V72, P834, DOI 10.1111/1467-8624.00318 Moses P., 1954, VOICE NEUROSIS MOZZICONACCI SJL, 1998, THESIS TU EINDHOVEN Murray IR, 1996, SPEECH COMMUN, V20, P85, DOI 10.1016/S0167-6393(96)00046-5 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NIWA S, 1971, 11 JAP AIR SELF DEF, P246 Ofuka E, 2000, SPEECH COMMUN, V32, P199, DOI 10.1016/S0167-6393(00)00009-1 UTSUKI N, 1976, Reports of Aeromedical Laboratory, V16, P179 OSTWALD PF, 1964, DISORDERS COMMUNICAT, V42, P450 PAESCHKE A, 1999, P 14 INT C PHON SCI, V2, P929 Pear T, 1931, VOICE PERSONALITY Pell MD, 1998, NEUROPSYCHOLOGIA, V36, P701, DOI 10.1016/S0028-3932(98)00008-6 PEREIRA C, 1998, P ICSLP 98 SYDN, V3, P927 Picard R. W., 1997, AFFECTIVE COMPUTING Pittenger Robert E., 1960, 1 5 MINUTES SAMPLE M PLAIKNER D, 1970, THESIS U INNSBRUCK A RANK E, 1998, P 5 INT C SPOK LANG, V3, P671 ROESSLER R, 1979, PHENOMENOLOGY TREATM ROESSLER R, 1976, J NERV MENT DIS, V163, P166, DOI 10.1097/00005053-197609000-00004 Roseman I. J., 2001, APPRAISAL PROCESSES, P3 RUSSELL JA, 1994, PSYCHOL BULL, V115, P102, DOI 10.1037/0033-2909.115.1.102 SANGSUE J, 1997, GENEVA STUD EMOT COM, V11 Scherer K., 1991, FUNDAMENTALS NONVERB, P200 Scherer K. R., 1994, NATURE EMOTION FUNDA, P25 Scherer K. R., 1977, MOTIV EMOTION, V1, P331, DOI 10.1007/BF00992539 Scherer K. R., 2001, APPRAISAL PROCESSES Scherer K. R., 2000, NEUROPSYCHOLOGY EMOT, P137 Scherer KR, 2000, CONTROL OF HUMAN BEHAVIOR, MENTAL PROCESSES, AND CONSCIOUSNESS, P227 Scherer K. R., 1982, HDB METHODS NONVERBA, P136 Scherer K. R., 1999, HDB COGNITION EMOTIO, P637 Scherer K. R., 1989, HDB SOCIAL PSYCHOPHY, P165 Scherer K. R., 1984, APPROACHES EMOTION, P293 Scherer K. R., 1992, INT REV STUDIES EMOT, P139 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 Scherer K.R., 1979, EMOTIONS PERSONALITY, P495 SCHERER KR, 1978, EUR J SOC PSYCHOL, V8, P467, DOI 10.1002/ejsp.2420080405 SCHERER KR, 1999, ENCY HUMAN EMOTIONS, P669 SCHERER KR, 1998, 22 JOURN ET PAR MART Scherer K.R., 1982, VOKALE KOMMUNIKATION Scherer KR, 2001, J CROSS CULT PSYCHOL, V32, P76, DOI 10.1177/0022022101032001009 SCHERER KR, 1994, EMOTIONS: ESSAYS ON EMOTION THEORY, P161 SCHERER KR, 1972, J PSYCHOLINGUIST RES, V1, P269, DOI 10.1007/BF01074443 SCHERER KR, 1992, SOCIAL REPRESENTATIO, P30 SCHERER KR, 1985, ADV STUD BEHAV, V15, P189, DOI 10.1016/S0065-3454(08)60490-8 SCHERER KR, 1972, EUR J SOC PSYCHOL, V2, P163, DOI 10.1002/ejsp.2420020205 SCHERER KR, 1991, MOTIV EMOTION, V15, P123, DOI 10.1007/BF00995674 SCHERER KR, 2000, P ICSLP2000 BEIJ CHI SCHERER KR, 1974, J PSYCHOLINGUIST RES, V3, P281, DOI 10.1007/BF01069244 SCHERER KR, 1973, J RES PERS, V7, P31, DOI 10.1016/0092-6566(73)90030-5 SCHERER KR, 1985, J PSYCHOLINGUIST RES, V14, P409, DOI 10.1007/BF01067884 SCHERER KR, IN PRESS HDB AFFECTI SCHERER T, 2000, THESIS U MARBURG GER Scripture E., 1921, VOX, V31, P179 SEDLACEK K, 1963, FOLIA PHONIATR, V15, P89 SHIPLEYBROWN F, 1988, BRAIN LANG, V33, P16, DOI 10.1016/0093-934X(88)90051-X Siegman A., 1987, NONVERBAL BEHAV COMM, P351 SIEGMAN AW, 1987, DEPRESSION EXPRESSIV, P83 SIMONOV PV, 1973, AEROSPACE MED, V44, P256 Skinner ER, 1935, SPEECH MONOGR, V2, P81 Sobin C, 1999, J PSYCHOLINGUIST RES, V28, P347, DOI 10.1023/A:1023237014909 STARKWEATHER JA, 1956, J PERS SOC PSYCHOL, V35, P345 Steinhauer K, 1999, NAT NEUROSCI, V2, P191, DOI 10.1038/5757 Stemmler G, 2001, PSYCHOPHYSIOLOGY, V38, P275, DOI 10.1017/S0048577201991668 SULC J, 1977, ACTIV NERV SUPER, V19, P215 TISCHER BERND, 1993, VOKALE KOMMUNIKATION, V18 TITZE IR, 1992, J SPEECH HEAR RES, V35, P21 TOLKMITT F, 1988, FACETS EMOTION RECEN, P119 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 Tomkins S. S., 1962, AFFECT IMAGERY CONSC, V1 Trager G. L., 1958, STUD LINGUIST, V13, P1 Tusing KJ, 2000, HUM COMMUN RES, V26, P148, DOI 10.1111/j.1468-2958.2000.tb00754.x van Bezooijen R., 1984, CHARACTERISTICS RECO VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001 WAGNER HL, 1993, J NONVERBAL BEHAV, V17, P3, DOI 10.1007/BF00987006 WALLBOTT HG, 1986, J PERS SOC PSYCHOL, V51, P690, DOI 10.1037//0022-3514.51.4.690 Wehrle T, 2000, J PERS SOC PSYCHOL, V78, P105, DOI 10.1037//0022-3514.78.1.105 Westermann R, 1996, EUR J SOC PSYCHOL, V26, P557, DOI 10.1002/(SICI)1099-0992(199607)26:4<557::AID-EJSP769>3.0.CO;2-4 Whiteside SP, 1999, PERCEPT MOTOR SKILL, V89, P1195, DOI 10.2466/PMS.89.7.1195-1208 WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Wundt W., 1874, GRUNDZUGE PHYSL PSYC ZUBERBIER E, 1957, Z Psychother Med Psychol, V7, P239 ZWICKER E, 1982, PSYCHOACOUSTICS ZWIRNER E, 1930, BEITRAG SPRACHE DEPR, V1, P171 NR 184 TC 393 Z9 401 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2003 VL 40 IS 1-2 BP 227 EP 256 DI 10.1016/S0167-6393(02)00084-5 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 643MB UT WOS:000180864100013 ER PT J AU Harding, S Meyer, G AF Harding, S Meyer, G TI Changes in the perception of synthetic nasal consonants as a result of vowel formant manipulations SO SPEECH COMMUNICATION LA English DT Article DE auditory scene analysis; proximity grouping; formant continuity; nasal consonants ID AUDITORY SCENE ANALYSIS; SPEECH; PLACE; CUES; TRANSITIONS; ARTICULATION; ORGANIZATION; SOUND AB There is controversy over the role of auditory scene analysis in speech perception and in particular whether listeners form perceptual streams of formants. The role of vowel formant frequencies in the perception of synthetic vowel-nasal syllables and the importance of formant continuity between vowel and nasal were examined in three experiments. The nasal prototypes /m/ and /n/ were used in all experiments, together with a range of preceding vowels differing only in the frequency and transitions of their second formant (F2). When no explicit transitions were present between the vowel and nasal, the perception of each nasal changed from /m/ to /n/ as the vowel F2 increased. Introducing explicit formant transitions removed this effect, and listeners heard the appropriate percepts for each nasal prototype. However, if the transition and the nasal prototype were inconsistent, the percept was determined by the transition alone. In each experiment, therefore, the target frequency of the vowel F2 transition into the nasal consonant determined the percept, taking precedence over the formant structure of the nasal prototype. The results do not show strong evidence for formant streaming and are more consistent with pattern matching processes. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Keele, MacKay Inst Commun & Neurosci, Keele ST5 5BG, Staffs, England. RP Harding, S (reprint author), Univ Keele, MacKay Inst Commun & Neurosci, Keele ST5 5BG, Staffs, England. EM s.m.harding@cns.keele.ac.uk CR ASSMANN PF, 1995, J ACOUST SOC AM, V97, P575, DOI 10.1121/1.412281 Barker J, 1999, SPEECH COMMUN, V27, P159, DOI 10.1016/S0167-6393(98)00081-8 Bregman AS., 1990, AUDITORY SCENE ANAL COOKE M, 1996, P ESCA WORKSH AUD BA Cooke M., 1993, MODELLING AUDITORY P Cooke M, 2001, SPEECH COMMUN, V35, P141, DOI 10.1016/S0167-6393(00)00078-9 COWAN N, 1984, PSYCHOL BULL, V96, P341, DOI 10.1037/0033-2909.96.2.341 Delattre P, 1952, WORD, V8, P195 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 Dewey Godfrey, 1923, RELATIVE FREQUENCY E Dorman M, 1975, J EXPT PSYCHOL HUMAN, V1, P121, DOI 10.1037//0096-1523.1.2.121 Ellis DPW, 1999, SPEECH COMMUN, V27, P281, DOI 10.1016/S0167-6393(98)00083-1 Fant G., 1960, ACOUSTIC THEORY SPEE FUJIMURA O, 1962, J ACOUST SOC AM, V34, P1865, DOI 10.1121/1.1909142 Godsmark D, 1999, SPEECH COMMUN, V27, P351, DOI 10.1016/S0167-6393(98)00082-X HAZAN V, 1991, PERCEPT PSYCHOPHYS, V49, P187, DOI 10.3758/BF03205038 HOUSE AS, 1957, J SPEECH HEAR DISORD, V22, P190 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KUROWSKI K, 1984, J ACOUST SOC AM, V76, P383, DOI 10.1121/1.391139 KUROWSKI K, 1987, J ACOUST SOC AM, V81, P1917, DOI 10.1121/1.394756 LADEFOGED P, 1957, J ACOUST SOC AM, V29, P98, DOI 10.1121/1.1908694 LARKEY LS, 1978, PERCEPT PSYCHOPHYS, V23, P299, DOI 10.3758/BF03199713 Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1 LIBERMAN AM, 1989, SCIENCE, V243, P489, DOI 10.1126/science.2643163 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MALECOT A, 1956, LANGUAGE, V32, P274, DOI 10.2307/411004 Masuda-Katsuse I, 1999, SPEECH COMMUN, V27, P235, DOI 10.1016/S0167-6393(98)00084-3 MEYER G, 1999, P 14 INT C PHON SCI NAKATA K, 1959, J ACOUST SOC AM, V31, P661, DOI 10.1121/1.1907770 PLOMP R, 1967, J ACOUST SOC AM, V41, P707, DOI 10.1121/1.1910398 RAND TC, 1974, J ACOUST SOC AM, V55, P678, DOI 10.1121/1.1914584 RECASENS D, 1983, J ACOUST SOC AM, V73, P1346, DOI 10.1121/1.389238 REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129 Repp B. H., 1987, CATEGORICAL PERCEPTI, P89 REPP BH, 1988, J ACOUST SOC AM, V83, P237, DOI 10.1121/1.396529 Stevens K.N., 1998, ACOUSTIC PHONETICS STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Warren R. M., 1970, SCIENCE, V167, P393 NR 39 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 173 EP 189 DI 10.1016/S0167-6393(02)00014-6 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800001 ER PT J AU Yu, MS Huang, FL AF Yu, MS Huang, FL TI Disambiguating the senses of non-text symbols for Mandarin TTS systems with a three-layer classifier SO SPEECH COMMUNICATION LA English DT Article DE sense disambiguation; non-text symbol; three-layer classifier; Bayesian theory; voting scheme ID LANGUAGE AB Various kinds of non-text symbols appear in texts. The oral expressions. of these symbols may vary with their senses. This paper proposes a three-layer classifier (TLC) which can disambiguate the senses of these symbols effectively. The layers within TLC are employed in sequence. The 1st layer is composed of two components: pattern table and decision tree. if this layer can disambiguate the sense of the target symbol, the disambiguation task stops. Otherwise the next two layers will be triggered. In such a situation, the procedure will go through the TLC. Based on the Bayesian theory, the 2nd layer adopts the voting scheme to compute the disambiguation score. Several features of token, which may affect the effectiveness of our voting scheme, are analyzed and compared With each other to achieve better accuracy. According to the algorithm confidence of sense disambiguation, the 3rd layer may exploit an alter. native. model to enhance the performance. Experiments show that our approaches can learn well. even with only a small amount of data. The overall accuracies of training and testing sets are 99.8% and 97.5%, respectively. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Chung Hsing Univ, Dept Appl Math, Text To Speech Syst Lab, Taichung 40227, Taiwan. RP Yu, MS (reprint author), Natl Chung Hsing Univ, Dept Appl Math, Text To Speech Syst Lab, Taichung 40227, Taiwan. EM msyu@dragon.nchu.edu.tw; flhuang@amath.nchu.edu.tw CR Brown P. F., 1991, P 29 ANN M ASS COMP, P264, DOI 10.3115/981344.981378 Brown P. F., 1992, Computational Linguistics, V18 Chen F.-Y., 1999, COMPUTATIONAL LINGUI, V4, P87 CHEN KJ, 1994, P RES COMP LING C RO, V7, P111 Chen SF, 1999, COMPUT SPEECH LANG, V13, P359, DOI 10.1006/csla.1999.0128 Cheng J., 1996, MOL DIAGN, V1, P183, DOI 10.1016/S1084-8592(96)70004-8 CHIANG TH, 1996, J CHINESE LINGUISTIC, V9, P147 *CKIP GROUP, 1995, 9502 CKIP GROUP Duda R. O., 1973, PATTERN CLASSIFICATI Dunning T., 1993, Computational Linguistics, V19 FAN CK, 1988, COMPUTER PROCESSING, V4, P33 Fujii A, 1998, COMPUT LINGUIST, V24, P573 GALE WA, 1995, COMPUT STAT DATA AN, V19, P135, DOI 10.1016/0167-9473(93)E0052-6 Golding A.R, 1995, P 3 WORKSH VER LARG, P39 Huang C, 1995, P ROCLING VIII, P81 HUANG FL, 2001, P IEEESMC IEEE SYST, P512 Ide N, 1998, COMPUT LINGUIST, V24, P1 Jurafsky Daniel, 2000, SPEECH LANGUAGE PROC Leacock C, 1998, COMPUT LINGUIST, V24, P147 LEACOCK C, 1993, P ARPA WORKSH HUM LA LEE H, 1997, P RES COMP LING C, P49 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P158, DOI 10.1109/89.222876 LEE LS, 1989, IEEE T ACOUSTICS SPE, V37, P619 LI J, 1999, INT J COMPUTATIONAL, P1 Liang Nanyuan, 1991, Communications of COLIPS, V1 MERIALDO A, 1990, P IBM NAT LANG ITL P, P161 MILLER GA, 1991, LANG COGNITIVE PROC, V6, P1, DOI 10.1080/01690969108406936 NIE JY, 1995, P RES COMP LING C RO, V8, P175 SCHUTZE H, 1995, THESIS STANFORD U Sproat R., 1990, Computer Processing of Chinese & Oriental Languages, V4 Sproat R., 1992, P INT C SPOK LANG PR STREITER O, 2000, P RES COMP LING C RO, V13, P41 SU KY, 1996, COMPUTATIONAL LINGUI, V1, P101 TOWELL G, 1999, COMPUTATIONAL LINGUI, V23, P125 URAMOTO N, 1994, IEICE T INF SYST, VE77D, P240 VERONIS J, 1990, P COLING 90 WITTEN IH, 1991, IEEE T INFORM THEORY, V37, P1085, DOI 10.1109/18.87000 YAO TS, 1990, J CHINESE INFORMATIO, V5, P38 YAROWSKY D, 1997, PROGR SPEECH SYNTHES, P159 Yarowsky D., 1994, P 32 ANN M ASS COMP, P88, DOI 10.3115/981732.981745 NR 40 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 191 EP 229 DI 10.1016/S0167-6393(02)00015-8 PG 39 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800002 ER PT J AU Baum, SR AF Baum, SR TI Age differences in the influence of metrical structure on phonetic identification SO SPEECH COMMUNICATION LA English DT Article DE speech perception; aging; phonetic identification; lexical stress ID SPOKEN WORD RECOGNITION; SPEECH-PERCEPTION; LEXICAL STATUS; OLDER-ADULTS; ELDERLY LISTENERS; COGNITIVE-FACTORS; GAP DETECTION; CATEGORIZATION; YOUNG; INFORMATION AB Two phonetic identification experiments were conducted with two groups of participants: a young adult group and an older adult group. In Experiment 1, subjects were required to make voiced-voiceless decisions for initial alveolar stop consonants to stimuli along two voice onset time (VOT) continua-one ranging from "di'gress" to "ti'gress" and the other from "'digress" to "'tigress" (i.e., in one continuum, the voiced endpoint was consistent with the word's stress pattern while in the other continuum, the voiceless endpoint was consistent with the word's stress pattern). Results revealed that both groups of participants were influenced by the stress pattern of the stimuli, but stress seemed to override VOT cues for a large number of the older individuals. To confirm that the effect was not simply due to a lexical influence, a follow-up experiment utilized two word-nonword continua ("diamond-tiamond" and "diming-timing") to examine the magnitude of lexical effects in these subject groups. Typical lexical status effects emerged for both young and older adults which were smaller than the effects of stress pattern found in Experiment 1. The findings are discussed with respect to the role of prosodic context in language processing in aging. (C) 2002 Elsevier Science B.V. All rights reserved. C1 McGill Univ, Sch Commun Sci & Disorders, Montreal, PQ H3G 1A8, Canada. RP Baum, SR (reprint author), McGill Univ, Sch Commun Sci & Disorders, 1266 Pine Ave W, Montreal, PQ H3G 1A8, Canada. EM shari.baum@mcgill.ca CR BLUMSTEIN SE, 1994, BRAIN LANG, V46, P181, DOI 10.1006/brln.1994.1011 BOOTHROYD A, 1988, J ACOUST SOC AM, V84, P101, DOI 10.1121/1.396976 BOYCZUK J, 1999, J ACOUST SOC AM, V105, P1402, DOI 10.1121/1.426659 Boyczuk JP, 1999, BRAIN LANG, V67, P46, DOI 10.1006/brln.1998.2049 BURTON MW, 1989, J EXP PSYCHOL HUMAN, V15, P567, DOI 10.1037//0096-1523.15.3.567 CAPLAN D, 1994, J ACOUST SOC AM, V95, P512, DOI 10.1121/1.408345 COHEN G, 1986, LANG COMMUN, V6, P91, DOI 10.1016/0271-5309(86)90008-X COHEN G, 1983, BRIT J PSYCHOL, V74, P239 CONNINE CM, 1987, J EXP PSYCHOL HUMAN, V13, P291, DOI 10.1037/0096-1523.13.2.291 CONNINE CM, 1986, THESIS U MASSACHUSET CRAIK FIM, 1968, HUMAN AGING BEHAV RE Elman J. L., 1986, INVARIANCE VARIABILI FOLSTEIN MF, 1975, J PSYCHIAT RES, V12, P189, DOI 10.1016/0022-3956(75)90026-6 FOX RA, 1984, J EXP PSYCHOL HUMAN, V10, P526, DOI 10.1037//0096-1523.10.4.526 GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 HUMES LE, 1994, J SPEECH HEAR RES, V37, P465 MCQUEEN JM, 1991, J EXP PSYCHOL HUMAN, V17, P433, DOI 10.1037/0096-1523.17.2.433 MERTUS J, 1989, BLISS USERS MANUAL MEYER BJF, 1981, EXP AGING RES, V7, P253 MILLER JL, 1988, J EXP PSYCHOL HUMAN, V14, P369, DOI 10.1037/0096-1523.14.3.369 NEWMAN R, 1999, PAPERS LAB PHONOLOGY, V5 PICHORAFULLER MK, 1995, J ACOUST SOC AM, V97, P593, DOI 10.1121/1.412282 PITT MA, 1993, J EXP PSYCHOL HUMAN, V19, P699, DOI 10.1037/0096-1523.19.4.699 SALTHOUSE T, 1988, LANGUAGE MEMORY AGIN SAMUEL AG, 1986, PATTERN RECOGNITION, V1 Schneider BA, 1999, J ACOUST SOC AM, V106, P371, DOI 10.1121/1.427062 Snell KB, 1997, J ACOUST SOC AM, V101, P2214, DOI 10.1121/1.418205 Sommers MS, 1999, PSYCHOL AGING, V14, P458, DOI 10.1037/0882-7974.14.3.458 STINE EAL, 1994, AGING COGNITION, V1, P152, DOI 10.1080/09289919408251456 Stine-Morrow EAL, 1999, J GERONTOL B-PSYCHOL, V54B, P125, DOI [10.1093/geronb/54B.2.P125, DOI 10.1093/GERONB/54B.2.P125] Strouse A, 1998, J ACOUST SOC AM, V104, P2385, DOI 10.1121/1.423748 TAUB HA, 1979, EXP AGING RES, V5, P3, DOI 10.1080/03610737908257184 TUN PA, 1991, PSYCHOL AGING, V6, P3, DOI 10.1037//0882-7974.6.1.3 VANROOIJ JCGM, 1990, J ACOUST SOC AM, V88, P2611, DOI 10.1121/1.399981 VANROOIJ JCGM, 1992, J ACOUST SOC AM, V91, P1028, DOI 10.1121/1.402628 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 Wingfield A., 1989, J GERONTOL B-PSYCHOL, V44, P106 Wingfield A, 1991, J GERONTOL B-PSYCHOL, V46, P127 WINGFIELD A, 1992, J GERONTOL, V47, P350 WINGFIELD A, 1989, MEMORY AGING DEMENTI Wingfield A, 1996, J Am Acad Audiol, V7, P175 Wingfield A, 2000, J SPEECH LANG HEAR R, V43, P915 NR 42 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 231 EP 242 DI 10.1016/S0167-6393(02)00028-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800003 ER PT J AU Fitch, HL Kupin, JJ Kessler, IJ DeLucia, J AF Fitch, HL Kupin, JJ Kessler, IJ DeLucia, J TI Relating articulation and acoustics through a sinusoidal description of vocal tract shape SO SPEECH COMMUNICATION LA English DT Article DE vocal; tract; shape; acoustics; formants; model; articulatory; tongue; speech; sinusoidal; equivalence ID FORMANT FREQUENCIES; TONGUE SHAPES; SPEECH; VOWELS AB A description of vocal tract shape for vowels is given in terms of a weighted sum of two sinusoids, displaced from each other along the length of the tract. The impetus for the description comes from uncovering a particularly simple relationship between F2 and a function on the parameters of a model of articulatory data. When the statistically derived articulatory model is simplified as displaced sinusoids, the relationship can be easily understood in terms of the physics of acoustic tubes. Thus, while the description is based on articulatory data, it relates simply to acoustics. The function (the sum of the weights of the two sinusoids) creates sets of articulatory shapes that can be said to trade in the amount of tongue front raising and the amount of tongue back raising to produce a constant F2. Alternatively, these equivalence classes of shapes can be said to trade in place of articulation and amount of constriction: this is so despite the fact that neither of these two variables are explicit parameters in generating the shapes. In embodying this compensatory property, it is also a description that is consistent with theories of motor control based on specifying functional equivalence classes, which are controlled with few degrees of freedom and allow context-conditioned variability. (C) 2002 Elsevier Science B.V. All rights reserved. C1 IDA Ctr Commun Res, Princeton, NJ 08540 USA. RP Fitch, HL (reprint author), IDA Ctr Commun Res, 805 Bunn Dr, Princeton, NJ 08540 USA. EM hollis@idaccr.org CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 Bernstein N., 1967, COORDINATION REGULAT Carre Rene, 1995, P73 CARRE R, 1994, PHONETICA, V51, P7 Chiba T., 1941, VOWEL ITS NATURE STR DELUCIA J, 1995, 1201 CCR Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1974, P SPEECH COMM SEM, V74, P121 HARSHMAN R, 1977, J ACOUST SOC AM, V62, P693, DOI 10.1121/1.381581 JACKSON MTT, 1988, J ACOUST SOC AM, V84, P124, DOI 10.1121/1.396979 KELSO JAS, 1984, J EXP PSYCHOL HUMAN, V10, P812, DOI 10.1037/0096-1523.10.6.812 Kelso J., 1982, HUMAN MOTOR BEHAV IN LADEFOGED P, 1978, J ACOUST SOC AM, V64, P1027, DOI 10.1121/1.382086 LILJENCRANTS J, 1971, 4 ROYAL I TECHN SPEE, P9 Mace W. M., 1978, ATTENTION PERFORM, VVII, P557 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 MERMELST.P, 1967, J ACOUST SOC AM, V41, P1283, DOI 10.1121/1.1910470 MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 Nix DA, 1996, J ACOUST SOC AM, V99, P3707, DOI 10.1121/1.414968 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 STEVENS KN, 1955, J ACOUST SOC AM, V27, P484, DOI 10.1121/1.1907943 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 NR 22 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 243 EP 268 DI 10.1016/S0167-6393(02)00029-8 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800004 ER PT J AU de Veth, J Boves, L AF de Veth, J Boves, L TI On the efficiency of classical RASTA filtering for continuous speech recognition: Keeping the balance between acoustic pre-processing and acoustic modelling SO SPEECH COMMUNICATION LA English DT Article DE continuous speech recognition; telephone lines; channel normalisation; classical RASTA filtering; phase-corrected RASTA ID TELEPHONE AB The efficiency of classical RASTA filtering for channel normalisation was investigated for continuous speech recognition based on context-independent and context-dependent hidden Markov models. For a medium and a large vocabulary continuous speech recognition task, recognition performance was established for classical RASTA filtering and compared to using no channel normalisation, cepstrum. mean normalisation, and phase-corrected RASTA. Phase-corrected RASTA is a technique that consists of classical RASTA filtering followed by a phase correction operation. In this manner, channel bias is as effectively removed as with classical RASTA. However, for phase-corrected RASTA, amplitude drift towards zero in stationary signal portions is diminished compared to classical RASTA. The results show that application of classical RASTA filtering resulted in decreased recognition performance when compared to using no channel normalisation for all conditions studied, although the decrease appeared to be smaller for context-dependent models than for context-independent models. However, for all conditions, recognition performance was. significantly and substantially improved when phase-corrected RASTA was used and reached the same performance level as obtained for cepstrum mean normalisation in some cases. It is concluded that classical RASTA filtering can only be effective for channel robustness, if the impact of the amplitude drift towards zero can be kept as limited as possible. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP de Veth, J (reprint author), Univ Nijmegen, Dept Language & Speech, POB 9103,A2RT, NL-6500 HD Nijmegen, Netherlands. EM j.deveth@let.kun.nl CR AIKAWA K, 1993, P INT C AC SIGN SPEE, P668 ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 DENOS EA, 1995, P EUR 95, P825 DEVETH J, 1997, P ESCA NATO WORKSH R, P119 DEVETH J, 1997, P INT C AC SIGN SPEE, P1239 de Veth J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607275 DEVETH J, 2001, ROBUSTNESS LANGUAGE, P9 de Veth J, 1998, SPEECH COMMUN, V25, P149, DOI 10.1016/S0167-6393(98)00034-X DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1997, P ESCA TUT RES WORKS, P1 HERMANSKY H, 1995, P INT C PHON SC Hirsch H.-G., 1991, P EUROSPEECH, P413 Junqua J.C., 1996, ROBUSTNESS AUTOMATIC JUNQUA JC, 1995, P EUROSPEECH, P1385 KOEHLER J, 1994, P ICASSP, P421 MILNER B, 1999, P WORKSH ROB METH AS, P79 MILNER B, 1995, P EUROSPEECH, P519 NADEU C, 1995, P EUR, P923 Paches-Leal P., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607789 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Steinbiss C, 1995, PHILIPS J RES, V49, P317 van den Heuvel Henk, 1997, INT J SPEECH TECHNOL, V2, P119 van Vuuren S., 1997, P EUR, P409 Young S, 1995, HTK BOOK HTK VERSION NR 27 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 269 EP 286 DI 10.1016/S0167-6393(02)00030-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800005 ER PT J AU Kwon, OW Park, J AF Kwon, OW Park, J TI Korean large vocabulary continuous speech recognition with morpheme-based recognition units SO SPEECH COMMUNICATION LA English DT Article DE Korean large vocabulary continuous speech recognition; morpheme-based language model; broadcast news transcription AB In Korean writing, a space is placed between two adjacent word-phrases, each of which generally corresponds to two or three words in English in a semantic sense. If the word-phrase is used as a recognition unit for Korean large vocabulary continuous speech recognition (LVCSR), the but-of-vocabulary (OOV) rate becomes very large. If a morpheme or a syllable is used instead, a severe inter-morpheme coarticulation problem arises due to short morphemes. We propose to use a merged morpheme as the recognition unit and pronunciation-dependent entries in a language model (LM) so that we can reduce such difficulties and incorporate the between-word phonology rule into the decoding algorithm of a Korean LVCSR system. Starting from the original morpheme units defined in the Korean morphology, we merge pairs of short and frequent morphemes into larger units by using a rule-based method and a statistical method. We define the merged morpheme unit as word and use it as the recognition unit. The performance of the system was evaluated in two business-related tasks: a read speech recognition task and a broadcast news transcription task. The OOV rate was reduced to a level comparable to that of American English in both tasks. In the read speech recognition task, with a 32k vocabulary and a word-based trigram LM computed from a newspaper text corpus, the word error rate (WER) of the baseline system was reduced from 25.0% to:20.0% by cross-word modeling and pronunciation-dependent language modeling, and finally to 15.5% by increasing speech database and text corpora. For the broadcast news transcription task, we showed that the statistical method relatively reduced the WER of the baseline system without morpheme merging by 3.4% and both of the proposed methods yielded similar performance. Applying all the proposed techniques, we achieved 17.6% WER for clean speech and 27.7% for noisy speech. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Brain Sci Res Ctr, Yuseong gu, Taejon 305701, South Korea. ETRI, Spoken Language Proc Team, Yuseong Gu, Taejon 305350, South Korea. RP Kwon, OW (reprint author), Univ Calif San Diego, Inst Neural Computat, 9500 Gilman Dr, La Jolla, CA 92093 USA. EM owkwon@rhythm.ucsd.edu CR CLARKSON PR, 1997, P EUROSPEECH 97 RHOD DAELEMANS W, 1996, PROGR SPEECH SYNTHES Demuynck K, 2000, SPEECH COMMUN, V30, P37, DOI 10.1016/S0167-6393(99)00030-8 EIDE E, 2000, P DARPA SPEECH TRANS GAO S, 2000, P ICASSP 2000 IST TU GAUVAIN JL, 1996, ICASSP 96 ATL GEUTNER P, 1998, P ICASSP 98 GILLICK L, 1989, P ICASSP 89 GUO XF, 1999, P DARPA BROADC NEWS JEON J, 1998, ICSLP 98 SYDN AUSTR KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 KIM JW, 1996, THESIS KOREA ADV I S KOU HJJ, 1999, P EUROSPEECH 99 BUD KWON OW, 2000, P ICASSP 2000, P1567 KWON OW, 1999, P EUROSPEECH 99 BUD, P483 LAMEL L, 1995, P IEEE AUT SPEECH RE, P51 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N Lee K.-F., 1989, AUTOMATIC SPEECH REC Ohtsuki K, 1999, SPEECH COMMUN, V28, P155, DOI 10.1016/S0167-6393(99)00006-0 OHTSUKI K, 1999, P 1999 DARPA BROADC Ortmanns S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607215 PALLET DS, 1990, P ICASSP 90 PALLET DS, 1999, P 1999 DARPA BROADC Ravishankar M. K., 1996, THESIS CARNEGIE MELL TOMOKIYO LM, 1998, ICASSP 98 SEATTL US WOODLAND PC, 1999, P DARPA BROADC NEWS WOODLAND PC, 1995, P ARPA SLT WORKSH Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 YU HJ, 2000, P ICSLP 2000 Yun SJ, 1999, IEEE SIGNAL PROC LET, V6, P28 NR 31 TC 31 Z9 32 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 287 EP 300 DI 10.1016/S0167-6393(02)00031-6 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800006 ER PT J AU Misra, H Ikbal, S Yegnanarayana, B AF Misra, H Ikbal, S Yegnanarayana, B TI Speaker-specific mapping for text-independent speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE speaker recognition; artificial neural network; speaker-specific mapping; linguistic information; speaker information; background normalization; network error criterion; equal error rate ID VERIFICATION; IDENTIFICATION; NETWORKS AB In this paper, we present the concept of speaker-specific mapping for the task of speaker recognition. The speaker-specific mapping is realized using a multilayer feedforward neural network. In the mapping approach, the aim is to capture the speaker-specific information by mapping a set of parameter vectors specific to linguistic information in the speech, to a set of parameter vectors having linguistic and speaker information. In this study, parameter vectors suitable for speaker-specific mapping are explored. Background normalization for score comparison and network error criterion for frame selection are proposed to improve the performance of the basic system. It is shown that removing the high frequency components of speech results in loss of performance of the speaker verification system. For all the 630 speakers of the TIMIT database, an equal error rate (EER) of 0.5% and 100% identification is achieved by the mapping approach. On a set of 38 speakers of the dialect region "drl" of NTIMIT database, an EER of 6.6% is obtained. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Indian Inst Technol, Dept Elect Engn, Madras 600036, Tamil Nadu, India. Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. RP Misra, H (reprint author), Indian Inst Technol, Dept Elect Engn, Madras 600036, Tamil Nadu, India. EM yegna@iitm.ernet.in CR EATOCK J, 1990, P INT C SPOK LANG PR, P133 FUNAHASHI K, 1989, NEURAL NETWORKS, V2, P183, DOI 10.1016/0893-6080(89)90003-8 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 GONG Y, 1992, P IEEE ICASSP, V2, P173 Haykin S., 1999, NEURAL NETWORKS COMP, V2nd HECK LP, 1997, P IEEE INT C AC SPEE, V2, P1071 HERMANSKY H, 1998, P RLA2C AV FRANC HERMANSKY H, 1992, P IEEE INT C AC SPEE, V1, P1121 HORNIK K, 1991, NEURAL NETWORKS, V4, P251, DOI 10.1016/0893-6080(91)90009-T Jankowski C., 1990, P IEEE INT C AC SPEE, V1, P109 Lippman R., 1987, IEEE ASSP MAGAZI APR, P4 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MATSUI T, 1996, P IEEE INT C AC SPEE, V1, P97 Rabiner L, 1993, FUNDAMENTALS SPEECH REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds D. A., 1994, P ESCA WORKSH AUT SP, P27 Reynolds D.A., 1995, P IEEE INT C AC SPEE, P329 ROSENBERG AE, 1976, P IEEE, V64, P475, DOI 10.1109/PROC.1976.10156 Yegnanarayana B., 1979, P IEEE INT C AC SPEE, P744 Yegnanarayana B., 1999, ARTIFICIAL NEURAL NE NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 301 EP 310 DI 10.1016/S0167-6393(02)00046-8 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800007 ER PT J AU Kochanski, G Shih, C AF Kochanski, G Shih, C TI Prosody modeling with soft templates SO SPEECH COMMUNICATION LA English DT Article DE intonation; prosody; tone; modeling; dynamics; physiology; algorithm; computer language; speech; pitch; XML; markup language; communication; text-to-speech ID SUBGLOTTAL AIR-PRESSURE; FUNDAMENTAL-FREQUENCY; LINGUISTIC STRESS; SPECTRAL BALANCE; SPEECH SYNTHESIS; WORD STRESS; DURATION; ENGLISH; PERCEPTION; INTENSITY AB This paper describes a novel prosody generation model. We intend it to broadly support many linguistic theories and multiple languages, for the model imposes no restriction on accent categories and shapes. This capability is crucial to the next generation of text-to-speech systems that will need to synthesize intonation variations for different speech acts, emotions, and styles of speech. The system supports mark-up tags that are mathematically defined and generate f(0) deterministically. Underlying the tags is an articulatory model of accent interaction which balances physiological and communication constraints. We specify the model by way of an algorithm for calculating the pitch, and by way of examples. The model allows localized, linguistically reasonable tags, and is suitable for a data-driven fitting process. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Lucent Technol, Bell Labs, Murray Hill, NJ 07974 USA. RP Kochanski, G (reprint author), Lucent Technol, Bell Labs, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM gpk@bell-labs.com CR ABRY C, 1998, B COMMUNICATION PARL, V4 ANDERSON M, 1984, P ICASSP SAN DIEG CA, V1 ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716 Beckman M, 1997, GUIDELINES TOBI LABE Berry DA, 1996, J VOICE, V10, P129, DOI 10.1016/S0892-1997(96)80039-7 BLACK AW, 1996, P ICSLP 96 PHIL PA U Bolinger D., 1958, WORD, V14, P109 Browman C. P., 1990, PAPERS LABORATORY PH, P341 CHEN SH, 1992, P IEEE ICASSP, V2, P45 de Pijper Jan Roelof, 1983, MODELLING BRIT ENGLI DUSTERHOFF KE, 1999, P 6 EUR C SPEECH COM Erickson D, 1998, PHONETICA, V55, P147, DOI 10.1159/000028429 FRY DB, 1955, J ACOUST SOC AM, V27, P765, DOI 10.1121/1.1908022 FRY DB, 1958, LANG SPEECH, V1, P126 Fujimura O, 2000, PHONETICA, V57, P128, DOI 10.1159/000028467 Fujisaki H., 1983, PRODUCTION SPEECH, P39 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 GARDING E, 1970, ANN B RES I LOGOPED, V27, P135 Gronnum N., 1992, GROUNDWORKS DANISH I Hadding-Koch Kerstin, 1961, ACOUSTICO PHONETIC S HAGGARD M, 1970, J ACOUST SOC AM, V47, P613, DOI 10.1121/1.1911936 Herzel H., 1995, NONLINEAR DYNAMICS N Hillenbrand JM, 1996, J SPEECH HEAR RES, V39, P1182 Hirst D., 2000, PROSODY THEORY EXPT HOMBERT JM, 1978, TONE LINGUISTIC SURV Keating PA, 1990, PAPERS LABORATORY PH, VI, P451 KEHOE M, 1995, J SPEECH HEAR RES, V38, P338 KOCHANSKI GP, 2001, 4 ISCA TUT RES WORKS KOCHANSKI GP, 2001, EUROSPEECH, P911 KOCHANSKI GP, 2000, P INT C SPOK LANG PR Ladd D. R., 1996, INTONATIONAL PHONOLO LADEFOGED P, 1962, P 4 INT C PHON SCI, P247 LEVITT H, 1970, J ACOUST SOC AM 2, V49, P570 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 LIEBERMAN P, 1960, J ACOUST SOC AM, V32, P451, DOI 10.1121/1.1908095 LIEBERMA.P, 1969, J ACOUST SOC AM, V45, P1537, DOI 10.1121/1.1911635 LOFQVIST A, 1989, J ACOUST SOC AM, V85, P1314 MAEKAWA K, 1998, P INT C SPOK LANG PR MALFRERE F, 1998, P INT C SPOK LANG PR MASSARO DW, 1976, J ACOUST SOC AM, V60, P704, DOI 10.1121/1.381143 MCFARLAND DH, 1992, J SPEECH HEAR RES, V35, P971 MONSEN RB, 1978, J ACOUST SOC AM, V64, P65, DOI 10.1121/1.381957 Moon F.C., 1987, CHAOTIC VIBRATIONS I MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Moore B. C. J., 1989, INTRO PSYCHOL HEARIN MUNHALL K, 1992, J PHONETICS, V20, P111 Ogorzalek MJ, 1997, CHAOS COMPLEXITY NON OHALA J, 1970, UCLA WORKING PAPERS, V14, P12 OHALA J, 1967, UCLA WORKING PAPERS, P80 OHALA JJ, 1992, PAPERS LABORATORY PH, V2, P166 Ohman S., 1967, WORD SENTENCE INTONA, P20 OLIVE JP, 1975, J ACOUST SOC AM, V57, P476, DOI 10.1121/1.380436 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 Pierrehumbert J., 1988, JAPANESE TONE STRUCT Pierrehumbert J, 1980, THESIS MIT PIPES LA, 1970, APPL MECH ENG PHYSIC POLLOCK KE, 1993, J PHONETICS, V21, P183 Ross KN, 1999, IEEE T SPEECH AUDI P, V7, P295, DOI 10.1109/89.759037 SHIH C, 2000, P INT C SPOK LANG PR Shih C., 1992, P IRCS WORKSH PROS N, P193 Shih Chilin, 1986, THESIS U CALIFORNIA Shih CL, 2000, TEXT SPEECH LANG TEC, V15, P243 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 SIMADA ZB, 1978, ANN B RES I LOGOPED, V5, P41 Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 SPROAT R, 1998, MULTILINGUAL TEXT SP SPROAT R, 1998, P ICSLP, P1719 Stevens K.N., 1998, ACOUSTIC PHONETICS TALKIN D, 1996, SPEECH CODING SYNTHE Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 Taylor P, 1997, SPEECH COMMUN, V21, P123, DOI 10.1016/S0167-6393(96)00068-4 TAYLOR PA, 1998, P INT C SPOK LANG PR TITZE IR, 1989, J ACOUST SOC AM, V85, P901, DOI 10.1121/1.397562 TITZE IR, 1993, PRINCIPLES VOICE PRO, P205 TITZE IR, 1988, J ACOUST SOC AM, V83, P1536, DOI 10.1121/1.395910 TITZE IR, 1993, PRINCIPLES VOICE PRO, P36 Turk AE, 1996, J ACOUST SOC AM, V99, P3782, DOI 10.1121/1.414995 Tyson JA, 1998, ASTROPHYS J, V498, pL107, DOI 10.1086/311314 van Santen J, 1998, MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS: THE BELL LABS APPROACH, P141 VANSANTEN J, 1997, P ESCA WORKSH INT TH, P321 VANSANTEN JPH, 2000, INTONATION ANAL MODE Whalen DH, 1997, PHONETICA, V54, P138 WIER CC, 1977, J ACOUST SOC AM, V61, P178, DOI 10.1121/1.381251 WINKWORTH AL, 1994, J SPEECH HEAR RES, V37, P535 WINKWORTH AL, 1995, J SPEECH HEAR RES, V38, P124 XU CX, 1999, P 14 INT C PHON SCI, P2359 Xu Y, 1993, THESIS U CONNECTICUT XU Y, 2000, P INT C SPOK LANG PR NR 89 TC 41 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 311 EP 352 DI 10.1016/S0167-6393(02)00047-X PG 42 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800008 ER PT J AU Imperl, B Kacic, Z Horvat, B Zgank, A AF Imperl, B Kacic, Z Horvat, B Zgank, A TI Clustering of triphones using phoneme similarity estimation for the definition of a multilingual set of triphones SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; multilingual acoustic modelling; agglomerative triphone clustering; tree-based clustering; SpeechDat(II) databases AB This paper addresses the problem of multilingual acoustic modelling for the design of multilingual speech recognisers. An agglomerative clustering algorithm for the definition of multilingual set of triphones is proposed. This clustering algorithm is based on the definition of an indirect distance measure for triphones defined as a weighted sum of the explicit estimates of the context similarity on a monophone level. The monophone similarity estimation method is based on the algorithm of Houtgast. The new clustering algorithm was tested in a multilingual speech recognition experiment for three languages. The algorithm was applied on monolingual triphone sets of language specific recognisers for all languages. In order to evaluate the clustering algorithm, the performance of the multilingual set of triphones was compared to the performance of the reference system composed of all three language specific recognisers operating in parallel, and to the performance of the multilingual set of triphones produced by the tree-based clustering algorithm. All experiments were based on the 1000 FDB SpeechDat(II) databases (Slovenian, Spanish and German). Experiments have shown that the use of the clustering algorithm results in a significant reduction of the number of triphones with minor degradation of recognition rate. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Maribor, Inst Elect, Fac Elect Engn & Comp Sci, SI-2000 Maribor, Slovenia. RP Zgank, A (reprint author), Univ Maribor, Inst Elect, Fac Elect Engn & Comp Sci, Smetanova 17, SI-2000 Maribor, Slovenia. EM bojan.imperl@uni-mb.si; andrej.zgank@uni-mb.si RI Zgank, Andrej/A-5711-2008 CR ANDERSEN O, 1993, P EUR 1993 BERL, P759 Berkling K. M., 1994, P ICSLP 94, P1891 BERKLING KM, 1996, THESIS OREGON GRADUA BONVENTURA P, 1997, P EUR 97 RHOD BOURLARD H, 1995, P EUR 95 MADR, P883 HARBECK S, 1997, MULTILINGUAL INFORMA, P9 HAUNSTEIN A, 1995, P ICASSP 95 HOEGE H, 1997, P ICASSP, P1771 JOHANSEN FT, 2000, P LREC 2000 ATH KADAMBE S, 1994, P INT C SPOKEN LANGU, P1879 KAISER J, 1998, SPEECH DATABASE DEV KOEHLER J, 1996, P ICSLP 96, P1780 KOEHLER J, 1998, P ICASSP 98 SEATTL KOEHLER J, 1997, MULTILINGUAL INFORMA, P16 METZE F, 2000, P ICASSP 2000 IST NOTH E, 1996, SPEECH IMAGE UNDERST, P59 Van Den Heuvel H., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1011375311203 SCHULTZ T, 1999, P MIST WORKSH LEUSD Waibel A, 1998, P DARPA BROADC NEWS WENG F, 1997, P EUR 97 RHOD Young S., 1994, P ARPA HUM LANG TECH YOUNG SJ, 1996, IEEE T SAP, V4, P31 YOUNG SJ, 1997, HTK BOOK HTC VERSION ZISSMAN MA, 1994, P INT C ACOUSTICS SP, P305 NR 24 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 353 EP 366 DI 10.1016/S0167-6393(02)00048-1 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800009 ER PT J AU Slaney, M McRoberts, G AF Slaney, M McRoberts, G TI BabyEars: A recognition system for affective vocalizations SO SPEECH COMMUNICATION LA English DT Article ID SPEECH; INFANTS; LANGUAGES; INTONATION; MOTHERESE; PITCH AB Our goal was to see how much of the affective message we could recover using simple acoustic measures of the speech signal. Using pitch and broad spectral-shape measures, a multidimensional Gaussian mixture-model discriminator classified adult-directed (neutral affect) versus infant-directed speech correctly more than 80% of the time, and classified the affective message of infant-directed speech correctly nearly 70% of the time. We confirmed previous findings that changes in pitch provide an important cue for affective messages. In addition, we found that timbre or cepstral coefficients also provide important information about the affective message. Mothers' speech was significantly easier to classify than fathers' speech, suggesting eitherclearer distinctions among these messages in mothers' speech to infants, or a difference between fathers and mothers in the acoustic information used to convey these messages. Our research is a step towards machines that sense the "emotional state" of a speaker. (C) 2002 Elsevier Science B.V. All rights reserved. C1 IBM Corp, Almaden Res Ctr, San Jose, CA 95120 USA. Lehigh Univ, Dept Psychol, Bethlehem, PA 18015 USA. RP Slaney, M (reprint author), IBM Corp, Almaden Res Ctr, 650 Harry Rd, San Jose, CA 95120 USA. EM malcolm@almaden.ibm.com; gwm3@lehigh.edu CR Bishop C. M., 1995, NEURAL NETWORKS PATT Droppo J., 1998, P INT C SPOK LANG PR Efron B., 1993, INTRO BOOTSTRAP Engberg I., 1997, P EUROSPEECH 97, V4, P1695 Fairbanks G, 1939, SPEECH MONOGR, V6, P87 FERGUSON CA, 1964, AM ANTHROPOL, V66, P103, DOI 10.1525/aa.1964.66.suppl_3.02a00060 Fernald A., 1992, ADAPTED MIND EVOLUTI FERNALD A, 1993, CHILD DEV, V64, P657, DOI 10.1111/j.1467-8624.1993.tb02934.x FERNALD A, 1989, J CHILD LANG, V16, P477 FERNALD A, 1989, CHILD DEV, V60, P1497, DOI 10.1111/j.1467-8624.1989.tb04020.x FERNALD A, 1985, INFANT BEHAV DEV, V8, P181, DOI 10.1016/S0163-6383(85)80005-9 Garnica O. K., 1977, TALKING CHILDREN LAN Hunt M. J., 1980, P INT C AC SPEECH SI, P880 Katz GS, 1996, CHILD DEV, V67, P205, DOI 10.1111/j.1467-8624.1996.tb01729.x LAMEL LF, 1981, IEEE T ACOUST SPEECH, V29, P777, DOI 10.1109/TASSP.1981.1163642 MCROBERTS G, UNPUB DEV PSYCHOL Nabney I., 2001, NETLAB ALGORITHMS PA PAPOUSEK M, 1991, INFANT BEHAV DEV, V14, P415, DOI 10.1016/0163-6383(91)90031-M Picard R. W., 1997, AFFECTIVE COMPUTING PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 ROY D, 1996, IEEE FAC GEST C KILL, P363 SHERER KR, 1986, PSYCHOL BULL, V99, P143 Skinner ER, 1935, SPEECH MONOGR, V2, P81 Slaney M., 1998, P 1998 INT C AC SPEE STERN DN, 1982, DEV PSYCHOL, V18, P727 Talkin D., 1995, SPEECH CODING SYNTHE, P495 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 27 TC 32 Z9 32 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2003 VL 39 IS 3-4 BP 367 EP 384 DI 10.1016/S0167-6393(02)00049-3 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 637XJ UT WOS:000180538800010 ER PT J AU Luo, FL Widrow, B Pavlovic, C AF Luo, FL Widrow, B Pavlovic, C TI Special issue on speech processing for hearing aids SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(02)00054-7 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100001 ER PT J AU Huettel, LG Collins, LM AF Huettel, LG Collins, LM TI A theoretical comparison of information transmission in the peripheral auditory system: Normal and impaired frequency discrimination SO SPEECH COMMUNICATION LA English DT Article DE frequency discrimination; signal detection theory; hearing impaired; theoretical analysis and predictions ID ONE-PARAMETER DISCRIMINATION; NERVE FIBERS; COMPUTATIONAL MODEL; PERFORMANCE LIMITS; NORMAL-HEARING; GAP DETECTION; RESPONSES; LISTENERS; INTENSITY; MASKING AB A theoretical analysis of auditory signal processing and the distortions introduced by various types of hearing impairments can aid in the design and development of digital hearing aids. In this paper, we investigate the differences between normal and impaired auditory processing on a frequency discrimination task by analyzing the responses of a computational auditory model using signal detection theory. Hearing impairments that were simulated included a threshold shift, damage to the outer hair cells, and impaired neural synchrony. We implemented two detectors, one using all of the information in the signal, the other using only the number of neural firings, and used them to generate theoretical predictions of performance. Evaluation of performance differences between theoretical detectors and experimental data allows quantification of both the type of information present in the auditory system and the efficiency of its use. This method of analysis predicted both the trends observed in comparable experimental data and the relation between normal and impaired behavior. Finally, a very simple hearing aid was simulated and the gains in performance in the impaired cases were related to the physiological bases of the impairments. This demonstrates the utility of the proposed approach in the design of more complex hearing aids. (C)) 2002 Elsevier Science B.V. All rights reserved. C1 Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA. RP Collins, LM (reprint author), Duke Univ, Dept Elect & Comp Engn, Durham, NC 27708 USA. EM lisa.huettel@duke.edu; lcollins@ee.duke.edu CR BIONDI E, 1978, AUDIOLOGY, V17, P43 CARNEY LH, 1993, J ACOUST SOC AM, V93, P401, DOI 10.1121/1.405620 Cox D. R., 1966, STAT ANAL SERIES EVE Cramer H., 1951, MATH METHODS STAT FITZGIBBONS PJ, 1982, J ACOUST SOC AM, V72, P761, DOI 10.1121/1.388256 FREYMAN RL, 1991, J SPEECH HEAR RES, V34, P1371 GIGUERE C, 1994, J ACOUST SOC AM, V95, P331 GLASBERG BR, 1987, J ACOUST SOC AM, V81, P1546, DOI 10.1121/1.394507 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Green D. M., 1974, SIGNAL DETECTION THE Gresham LC, 1998, J ACOUST SOC AM, V103, P2520, DOI 10.1121/1.422773 HARRISON JM, 1974, HDB SENSORY PHYSL Heinz MG, 2001, NEURAL COMPUT, V13, P2317, DOI 10.1162/089976601750541813 Heinz MG, 2001, NEURAL COMPUT, V13, P2273, DOI 10.1162/089976601750541804 Huettel LG, 1999, IEEE T BIO-MED ENG, V46, P1432, DOI 10.1109/10.804571 JOHNSON DH, 1980, J ACOUST SOC AM, V68, P1115, DOI 10.1121/1.384982 KATES JM, 1993, J REHABIL RES DEV, V30, P39 KIANG NYS, 1965, MIT RES MONO, V35 LIBERMAN MC, 1984, HEARING RES, V16, P55, DOI 10.1016/0378-5955(84)90025-X MEDDIS R, 1988, J ACOUST SOC AM, V83, P1056, DOI 10.1121/1.396050 MEDDIS R, 1990, J ACOUST SOC AM, V87, P1813, DOI 10.1121/1.399379 MOORE BCJ, 1973, J ACOUST SOC AM, V54, P610, DOI 10.1121/1.1913640 Moore B.C.J., 1995, PERCEPTUAL CONSEQUEN MOORE BCJ, 1989, INTRO PSYCHOL HEARIN, P76 PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 PERKEL DH, 1967, BIOPHYS J, V7, P391 Peterson W., 1954, IRE T PGIT, V4, P171, DOI DOI 10.1109/TIT.1954.1057460 Pickles JO, 1988, INTRO PHYSL HEARING Robert A, 1999, J ACOUST SOC AM, V106, P1852, DOI 10.1121/1.427935 ROSE JE, 1971, J NEUROPHYSIOL, V34, P685 SHAMMA SA, 1986, J ACOUST SOC AM, V80, P133, DOI 10.1121/1.394173 SIEBERT WM, 1968, RECOGNIZING PATTERNS SIEBERT WM, 1970, PR INST ELECTR ELECT, V58, P723, DOI 10.1109/PROC.1970.7727 SIMON HJ, 1993, EAR HEARING, V14, P190, DOI 10.1097/00003446-199306000-00006 Snyder D.L., 1991, RANDOM POINT PROCESS, Vsecond TANNER WP, 1958, J ACOUST SOC AM, V32, P1140 VANTREES H, 1968, DETECTION ESTIMATION, P86 Zhang XD, 2001, J ACOUST SOC AM, V109, P648, DOI 10.1121/1.1336503 NR 38 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 5 EP 21 DI 10.1016/S0167-6393(02)00055-9 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100002 ER PT J AU Fillon, T Prado, J AF Fillon, T Prado, J TI Evaluation of an ERB frequency scale noise reduction for hearing aids: A comparative study SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; non-uniform filter-banks; psychoacoustic; ERB frequency scale ID SPECTRAL AMPLITUDE ESTIMATOR; SPEECH ENHANCEMENT; AUDIO AB Among speech enhancement methods, the Ephraim and Malah suppression rule (EMSR) has proven to be efficient in reducing the background noise while preventing from a common artefact: the musical noise. From psychoacoustic motivation, an implementation of the EMSR with a perceptually relevant frequency partition is proposed. This implementation is based on non-uniform oversampled filter-banks. The frequency resolution is nevertheless uniform on the equivalent rectangular bandwidth (ERB)-scale. Objective and subjective comparison with classical EMSR and uniform critically decimated filter-banks implementations has been achieved. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Dept TSI ENST Paris, F-75634 Paris, France. RP Fillon, T (reprint author), Dept TSI ENST Paris, 46 Rue Barrault, F-75634 Paris, France. EM fillon@tsi.enst.fr; prado@tsi.enst.fr CR Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CAPPE Olivier, 1993, THESIS ECOLE NATL SU DIETHORN EJ, 2000, ACOUSTIC SIGNAL PROC EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 GOODWIN M, 1997, 30 AS C SIGN SYST CO, V2, P1229 GOYE A, 2000, THESIS ECOLE NATL SU Gulzow T, 1998, SIGNAL PROCESS, V64, P5, DOI 10.1016/S0165-1684(97)00172-2 HALL J, 1998, DIGITAL SIGNAL PROCE Harma A, 2000, J AUDIO ENG SOC, V48, P1011 Malah D, 1999, INT CONF ACOUST SPEE, P789 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 MOORE B J, 1982, INTRO PSYCHOL HEARIN Moore B.C.J., 1995, PERCEPTUAL CONSEQUEN Painter T, 2000, P IEEE, V88, P451, DOI 10.1109/5.842996 Smith JO, 1999, IEEE T SPEECH AUDI P, V7, P697, DOI 10.1109/89.799695 ZWICKER E, 1981, COLLECTION TECHNIQUE NR 17 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 23 EP 32 DI 10.1016/S0167-6393(02)00056-0 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100003 ER PT J AU Yang, J Luo, FL Nehorai, A AF Yang, J Luo, FL Nehorai, A TI Spectral contrast enhancement: Algorithms and comparisons SO SPEECH COMMUNICATION LA English DT Article DE signal processing; noise reduction; speech enhancement; human audition; auditory system; real-time implementation ID SENSORINEURAL HEARING IMPAIRMENT; FREQUENCY-SELECTIVITY; SPEECH; LISTENERS; INTELLIGIBILITY; QUALITY; NOISE AB This paper investigates spectral contrast enhancement techniques and their implementation complexity. Three algorithms are dealt with in this paper. The first is the method described by Baer, Moore and Gatehouse. Two alternative methods are also proposed and investigated in this paper from a practical application and implementation point of view. Theoretical analyses and results from laboratory, simulation and subject listening show that spectral contrast enhancement and performance improvement can be achieved by use of these three methods with the appropriate selection of their relevant parameters. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Quicksilver Technol, San Jose, CA 95119 USA. Fortemedia Inc, Cupertino, CA 95014 USA. Univ Illinois, ECE Dept, Chicago, IL 60607 USA. RP Luo, FL (reprint author), Quicksilver Technol, 6640 Via Del Oro, San Jose, CA 95119 USA. EM falongl@yahoo.com RI Nehorai, Arye/G-1661-2011 CR BAER T, 1993, J REHABIL RES DEV, V30, P49 BOERS PM, 1980, 15 IPO, P21 BUNNELL HT, 1990, J ACOUST SOC AM, V88, P2546, DOI 10.1121/1.399976 BUSTAMANTE DK, 1986, J ACOUST SOC AM S1, V80, pS12, DOI 10.1121/1.2023659 Finan R. A., 1994, Biomedical Engineering, Applications Basis Communications, V6 Guo HT, 1998, IEEE T SIGNAL PROCES, V46, P335 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 RIBIC Z, 1996, P 1996 IEEE INT C AC, V2, P937 SIMPSON AM, 1990, ACTA OTO-LARYNGOL, P101 STONE MA, 1992, J REHABIL RES DEV, V29, P39, DOI 10.1682/JRRD.1992.04.0039 SUMMERFIELD Q, 1985, SPEECH COMMUN, V4, P213, DOI 10.1016/0167-6393(85)90048-2 NR 11 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 33 EP 46 DI 10.1016/S0167-6393(02)00057-2 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100004 ER PT J AU Kleinschmidt, M Hohmann, V AF Kleinschmidt, M Hohmann, V TI Sub-band SNR estimation using auditory feature processing SO SPEECH COMMUNICATION LA English DT Article DE sub-band SNR estimation; situation classification; auditory front end; sigma-pi cells; neural networks ID ROBUST SPEECH RECOGNITION; AMPLITUDE-MODULATION; MODEL; ENHANCEMENT; PERCEPTION AB In this paper a new approach is presented for estimating the long-term speech-to-noise ratio (SNR) in individual frequency bands that is based on methods known from automatic speech recognition (ASR). It uses a model of auditory perception as front end, physiologically and psychoacoustically motivated sigma-pi cells as secondary features, and a linear or non-linear neural network as classifier. A non-linear neural network back end is capable of estimating the SNR in time segments of 1 s with a root-mean-square error of 5.68 dB on unknown test material. This performance is obtained on a large set of natural types of noise, containing instationary signals and alarm sounds. However, the SNR estimation works best for more stationary types of noise. The individual components of the estimation algorithms are examined with respect to their importance for the estimation accuracy. The algorithm presented in this paper yields similar or better results with comparable computational effort relative to other methods known from the literature for short-term SNR estimation. The new approach is purely based on slow spectro-temporal modulations and is therefore a valuable contribution to both, digital hearing-aids and ASR systems. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. RP Kleinschmidt, M (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. EM michael@medi.physik.uni-oldenburg.de CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Avendano C., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552760 BOURLARD H, 1996, EUR SIGN P C TRIEST, P1579 Chi TS, 1999, J ACOUST SOC AM, V106, P2719, DOI 10.1121/1.428100 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 deCharms RC, 1998, SCIENCE, V280, P1439, DOI 10.1126/science.280.5368.1439 DUPONT S, 1999, P WORKSH ROB METH SP, P115 GRAMSS T, 1991, IEE CONF PUBL, V349, P180 GRAMSS T, 1990, SPEECH COMMUN, V9, P35, DOI 10.1016/0167-6393(90)90043-9 HAGAN MT, 1994, IEEE T NEURAL NETWOR, V5, P989, DOI 10.1109/72.329697 Hansen M, 2000, J AUDIO ENG SOC, V48, P395 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1998, P ICSLP 98, V3, P1003 Hirsch H., 1995, P ICASSP, P153 Hirsch H. G., 1993, TR93012 INT COMP SCI Hohmann V, 2002, ACTA ACUST UNITED AC, V88, P433 KAERNBACH C, 2000, CONTRIB PSYCHOL, P295 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KLEINSCHMIDT M, 2000, FORTSCHRITTE AKUSTIK, P382 Kleinschmidt M, 2001, SPEECH COMMUN, V34, P75, DOI 10.1016/S0167-6393(00)00047-9 KOHLER DR, 1994, PHARMACOTHERAPY, V14, P3 KOLLMEIER B, 1994, J ACOUST SOC AM, V95, P1593, DOI 10.1121/1.408546 Martin R., 1993, P EUROSPEECH 93 BERL, P1093 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Patterson RD, 1987, M IOC SPEECH GROUP A Puschel D., 1988, THESIS U GOTTINGEN Ris C, 2001, SPEECH COMMUN, V34, P141, DOI 10.1016/S0167-6393(00)00051-0 TCHORZ J, 1999, P EUR ISCA BUD HUNG, P2399 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 Tchorz J, 2001, ADV NEUR IN, V13, P821 Tchorz J, 2002, SPEECH COMMUN, V38, P1, DOI 10.1016/S0167-6393(01)00040-1 NR 32 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 47 EP 63 DI 10.1016/S0167-6393(02)00058-4 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100005 ER PT J AU Douglas, SC Sun, XA AF Douglas, SC Sun, XA TI Convolutive blind separation of speech mixtures using the natural gradient SO SPEECH COMMUNICATION LA English DT Article ID ARRAY HEARING-AIDS; SIGNAL SEPARATION; BINAURAL OUTPUT; DECORRELATION AB Convolutive blind separation of speech, also known as the "cocktail party problem", is a challenging task for which few successful algorithms have been developed. In this paper, we explore two novel algorithms for separating mixtures of multiple speech signals as measured by multiple microphones in a room environment. Both algorithms are modifications of an existing approach for density-based multichannel blind deconvolution (MBD) using natural gradient adaptation. The first approach employs non-holonomic constraints on the multichannel separation system to effectively avoid the partial deconvolution of the extracted speech signals within the separation system's outputs. The second approach employs linear predictors within the coefficient updates and produces separated speech signals whose autocorrelation properties can be arbitrarily specified. Unlike MBD methods, the proposed techniques maintain the spectral content of the original speech signals in the extracted outputs. Performance comparisons of the proposed methods with existing techniques show their usefulness in separating real-world speech signal mixtures. (C) 2002 Published by Elsevier Science B.V. C1 So Methodist Univ, Dept Elect Engn, Dallas, TX 75275 USA. RP Douglas, SC (reprint author), So Methodist Univ, Dept Elect Engn, POB 750338, Dallas, TX 75275 USA. EM douglas@engr.smu.edu CR Amari S, 2000, NEURAL COMPUT, V12, P1463, DOI 10.1162/089976600300015466 Amari S., 1997, P 11 IFAC S SYST ID, V3, P1057 Amari S, 1998, NEURAL COMPUT, V10, P251, DOI 10.1162/089976698300017746 Amari S., 1997, P IEEE WORKSH SIGN P, P101 Anemuller J., 2000, P 2 INT WORKSH IND C, P215 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 DAVENPORT WB, 1950, 148 MIT RES LAB EL Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 Douglas S. C., 2000, UNSUPERVISED ADAPTIV, V1, P13 Douglas S. C., 2001, HDB NEURAL NETWORK S DOUGLAS SC, 2001, MICROPHONE ARRAYS TE, P355 FUDGE GL, 1994, IEEE T SIGNAL PROCES, V42, P2871, DOI 10.1109/78.324759 GIBSON JD, 1993, ELECT ENG HDB, P279 LAMBERT RH, 1997, P IEEE INT C AC SPEE, V1, P423 Lambert RH, 1996, THESIS U SO CALIFORN LEE TW, 1998, P IEEE INT C AC SPEE, P1089 Parra L, 1998, P IEEE WORKSH NEUR N, P23 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 Paulraj AJ, 1997, IEEE SIGNAL PROC MAG, V14, P49, DOI 10.1109/79.637317 Pham D., 1999, P 1 WORKSH IND COMP, P215 Shynk JJ, 1992, IEEE SIGNAL PROC MAG, V9, P14, DOI 10.1109/79.109205 SUN X, 2000, P 34 AS C SIGN SYST, V2, P1412 VANGERVEN S, 1995, IEEE T SIGNAL PROCES, V43, P1602, DOI 10.1109/78.398721 Weinstein E, 1993, IEEE T SPEECH AUDI P, V1, P405, DOI 10.1109/89.242486 Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299 NR 25 TC 38 Z9 39 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 65 EP 78 DI 10.1016/S0167-6393(02)00059-6 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100006 ER PT J AU Anemuller, J Kollmeier, B AF Anemuller, J Kollmeier, B TI Adaptive separation of acoustic sources for anechoic conditions: A constrained frequency domain approach SO SPEECH COMMUNICATION LA English DT Article DE blind source separation; independent component analysis; noise reduction; speech signal processing; hearing aids ID BLIND SOURCE SEPARATION; SPEECH SIGNALS; ALGORITHM; INFORMATION AB Blind source separation represents a signal processing technique with a large potential for noise reduction. However, its application in modern digital hearing aids poses high demands with respect to computational efficiency and speed of adaptation towards the desired solution. In this paper, an algorithm is presented which fulfills these goals under the idealized assumption that the superposition of sources in rooms can be approximated as a superposition under anechoic conditions. Specifically, attenuation, the signals' finite propagation speed, and diffuse noise are accounted for, whereas reflections and reverberation are considered as negligible effects. This approximation is referred to as the 'free field' assumption. Starting from a general blind source separation algorithm for Fourier transformed speech signals, the free field assumption is incorporated into the framework, yielding a simple, fast and adaptive algorithm that is able to track moving sources. Implementation details are given which were found to be indispensable for fast and robust signal separation. Performance is evaluated both by simulations and experimentally, including separation of a moving and a fixed speaker in a recorded real anechoic environment. The potential benefits and shortcomings of this algorithm are discussed with regard to its inclusion into the signal processing framework of digital hearing aids for real reverberant acoustic situations. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. RP Anemuller, J (reprint author), Salk Inst Biol Studies, Computat Neurobiol Lab, POB 85800, San Diego, CA 92186 USA. EM jorn@salk.edu RI Anemuller, Jorn/A-8090-2009 CR Amari S, 1996, ADV NEUR IN, V8, P757 ANEMULLER J, 2000, FORTSCHRITTE AKUSTIK, P364 ANEMULLER J, 1999, P 1 INT WORKSH IND C, P331 Anemuller J., 2000, P 2 INT WORKSH IND C, P215 Back A.D., 1994, NEURAL NETWORKS SIGN, V4, P565 BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 Bingham E, 2000, Int J Neural Syst, V10, P1, DOI 10.1142/S0129065700000028 Bishop C. M., 1995, NEURAL NETWORKS PATT BREHM H, 1987, SIGNAL PROCESS, V12, P119, DOI 10.1016/0165-1684(87)90001-6 Cardoso JF, 1996, IEEE T SIGNAL PROCES, V44, P3017, DOI 10.1109/78.553476 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 Emile B, 1998, SIGNAL PROCESS, V69, P93, DOI 10.1016/S0165-1684(98)00061-9 Fiori S, 2000, NEUROCOMPUTING, V34, P239, DOI 10.1016/S0925-2312(00)00161-2 Heckl M., 1994, TASCHENBUCH TECHNISC, V2 Hyvarinen A, 2001, INDEPENDENT COMPONEN JUTTEN C, 1991, SIGNAL PROCESS, V24, P1, DOI 10.1016/0165-1684(91)90079-X Lee T. W., 1998, INDEPENDENT COMPONEN LEE TW, 1998, P IEEE INT C AC SPEE, V2, P1249 MacKay D. J. C., 1996, MAXIMUM LIKELIHOOD C MOREAU E, 1994, P EUSIPCO 94 ED SCOT, P1157 Murata N, 2001, NEUROCOMPUTING, V41, P1, DOI 10.1016/S0925-2312(00)00345-3 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Papoulis A, 1991, PROBABILITY RANDOM V, V3rd Parra L, 2000, J VLSI SIG PROC SYST, V26, P39, DOI 10.1023/A:1008187132177 PARRA L, 2001, ADV NEURAL INFORMATI, V13 Parra L, 2000, IEEE T SPEECH AUDI P, V8, P320, DOI 10.1109/89.841214 PHAM DT, 1992, SIGNAL PROCESS, V6, P771 PLATT JC, 1992, ADV NEUR IN, V4, P730 ROBBINS H, 1951, ANN MATH STAT, V22, P400, DOI 10.1214/aoms/1177729586 Sahlin H, 1998, SIGNAL PROCESS, V64, P103, DOI 10.1016/S0165-1684(97)00180-1 Smaragdis P, 1998, NEUROCOMPUTING, V22, P21, DOI 10.1016/S0925-2312(98)00047-2 Sompolinsky H., 1995, Neural Networks: The Statistical Mechanics Perspective. Proceedings of the CTP-PBSRI. Joint Workshop on Theoretical Physics Tong L., 1991, IEEE Transactions on Circuits and Systems, V38, DOI 10.1109/31.76486 TORKKOLA K, 1998, P 1998 IEEE DIG SIGN TORKKOLA K, 1996, P IEEE INT C AC SPEE, P3509 van der Kouwe AJW, 2001, IEEE T SPEECH AUDI P, V9, P189, DOI 10.1109/89.905993 Wittkop T, 2001, THESIS U OLDENBURG Yang HH, 1997, NEURAL COMPUT, V9, P1457, DOI 10.1162/neco.1997.9.7.1457 YEREDOR A, 2001, P ICA 2001 SAN DIEG ZELINSKI R, 1977, IEEE T ACOUST SPEECH, V25, P299, DOI 10.1109/TASSP.1977.1162974 NR 40 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 79 EP 95 DI 10.1016/S0167-6393(02)00060-2 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100007 ER PT J AU Campbell, DR Shields, PW AF Campbell, DR Shields, PW TI Speech enhancement using sub-band adaptive Griffiths-Jim signal processing SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; sub-band processing; adaptive noise cancellation ID HEARING-AIDS; SOUND FIELDS; NOISE; SYSTEM; INTELLIGIBILITY; REVERBERATION; IMPROVEMENTS; SCHEME AB Results are presented from intelligibility tests of a two-microphone sub-band adaptive Griffiths-Jim (SBAGJ) processing scheme that has possible application to future hearing aids as a method of improving speech intelligibility and quality in a noisy reverberant environment. This SBAGJ scheme combines sub-band processing with a Griffiths-Jim "front-end" that delivers a simple system more robust to the "causality" issue inherent with some multi-microphone adaptive noise cancelling configurations' Intelligibility testing is described and results presented of an assessment of the SBAGJ scheme using ten normal hearing listeners and signals from a real reverberant environment. The results support the hypothesis that speech processing using the SBAGJ scheme can provide a statistically and practically significant improvement in speech intelligibility. Analysis of mean opinion score data shows a corresponding statistically significant improvement in subjective quality. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Paisley, Sch Informat & Commun Technol, Paisley PA1 2BE, Renfrew, Scotland. RP Campbell, DR (reprint author), Univ Paisley, Sch Informat & Commun Technol, High St, Paisley PA1 2BE, Renfrew, Scotland. CR AGAIBY H, 1999, THESIS U PAISLEY SCO Agaiby H., 1997, ESCA EUROSPEECH 97, P1119 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 Durlach N. I., 1987, RESNA '87: Meeting the Challenge. Proceedings of the 10th Annual Conference on Rehabilitation Technology Duttweiler DL, 2001, IEEE T SIGNAL PROCES, V49, P593, DOI 10.1109/78.905886 FERRARA ER, 1981, IEEE T ACOUST SPEECH, V29, P766, DOI 10.1109/TASSP.1981.1163589 FOSTER J R, 1987, British Journal of Audiology, V21, P165, DOI 10.3109/03005368709076402 FOSTER JR, 1979, P I AC, P9 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 GUY RW, 1994, NOISE CONTROL ENG, V42, P8, DOI 10.3397/1.2827855 Haykin S., 1996, ADAPTIVE FILTER THEO Kollmeier B, 1993, Scand Audiol Suppl, V38, P28 Kompis M, 2001, J ACOUST SOC AM, V109, P1134, DOI 10.1121/1.1338558 LEHNERT H, 1992, APPL ACOUST, V36, P259, DOI 10.1016/0003-682X(92)90049-X LINDEVALD IM, 1986, J ACOUST SOC AM, V80, P661, DOI 10.1121/1.394061 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MAHALANOBIS A, 1993, IEEE T CIRCUITS-II, V40, P375, DOI 10.1109/82.277882 PLOMP R, 1994, EAR HEARING, V15, P2 Shields PW, 1998, SPEECH COMMUN, V25, P165, DOI 10.1016/S0167-6393(98)00035-1 Shields PW, 2001, J ACOUST SOC AM, V110, P3232, DOI 10.1121/1.1413750 SHIELDS PW, 1999, EUR 99 BUD HUNG 5 10, V6, P2559 STRUBE HW, 1981, SIGNAL PROCESS, V3, P355 TONER E, 1993, SPEECH COMMUN, V12, P253, DOI 10.1016/0167-6393(93)90096-4 Van Compemolle D., 1989, P EUR 89 PAR FRANC, P657 VANHOESEL RJM, 1995, J ACOUST SOC AM, V97, P2498, DOI 10.1121/1.411970 Weiss M, 1987, J Rehabil Res Dev, V24, P93 Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 30 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 97 EP 110 DI 10.1016/S0167-6393(02)00061-4 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100008 ER PT J AU Wittkop, T Hohmann, V AF Wittkop, T Hohmann, V TI Strategy-selective noise reduction for binaural digital hearing aids SO SPEECH COMMUNICATION LA English DT Article DE digital hearing aids; noise reduction; binaural signal processing; sub-band processing ID SPEECH ENHANCEMENT; SUPPRESSION AB In this paper, different binaural signal processing strategies for noise reduction are derived and assessed which are based on particular assumptions on the (spatial) properties of the target signal and the undesired interfering signals. The processing strategies are evaluated with respect to their technical performance using artificial signals. They are shown to function if the underlying assumptions are met. The processing strategies are combined within a single, strategy-selective algorithm which automatically selects appropriate processing strategies depending on the acoustical situation. For this, a measure for the general diffusiveness or coherence, respectively, of the sound field is employed to classify the situation and to switch off particular processing strategies if necessary. Time constants were optimized with respect to the sound quality by subjective preference measurements. The algorithm was then assessed by measurements with eight hearing-impaired subjects who exhibit two different types of hearing loss (high frequency hearing loss and flat hearing loss). The subjective preference as well as speech reception thresholds (SRTs) in noise are measured under realistic free-field conditions in a laboratory environment. No significant improvement of the SRT was found on average. However, the results also suggest that there might be an improvement of speech intelligibility for subjects with a flat hearing loss in the free-field (dichotic) listening situation with interfering speech signals or diffuse cafeteria noise. Furthermore, the results of the subjective assessment exhibit a higher quality of the processed signal than the unprocessed signal especially in the diffuse cafeteria noise situation. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. RP Wittkop, T (reprint author), Carl von Ossietzky Univ Oldenburg, D-26111 Oldenburg, Germany. EM tw@medi.physik.uni-oldenburg.de CR ALBANI S, 1998, Z AUDIOLOGIE S, V1, P100 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Bortz J., 1990, VERTEILUNGSFREIE MET Desloge JG, 1997, IEEE T SPEECH AUDI P, V5, P529, DOI 10.1109/89.641298 DORBECKER M, 1998, 5 ITG FACHT SPRACHK, P53 GAIK W, 1986, FORTSCHRITTE AKUSTIK, P721 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 HOHMANN V, 1993, DYNAMIKKOMPRESSION 1 Hohmann V., 1995, Audiologische Akustik, V34 Kates JM, 1996, J ACOUST SOC AM, V99, P3138, DOI 10.1121/1.414798 Kendall MG, 1975, RANK CORRELATION MET Kollmeier B, 1993, Scand Audiol Suppl, V38, P28 Kompis M, 2001, J ACOUST SOC AM, V109, P1123, DOI 10.1121/1.1338557 KOMPIS M, 1994, J ACOUST SOC AM, V96, P1910, DOI 10.1121/1.410204 LAUNER S, 1996, AUDIOL ACOUST, V35, P156 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 Margo Vena, 1997, Seminars in Hearing, V18, P405, DOI 10.1055/s-0028-1083040 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 NIX J, 1999, PSYCHOPHYSICS PHYSL, P263 PASTOORS AD, 1998, Z AUDIOLOGIE S, V1, P103 PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 PEISSIG J, 1993, BINAURALE HORGERA 17 SOEDE W, 1993, J ACOUST SOC AM, V94, P785, DOI 10.1121/1.408180 Wagener K, 1999, Z AUDIOL, V38, P44 Wagener K, 1999, Z AUDIOL, V38, P4 Wagener K., 1999, Z AUDIOL, V38, P86 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 WELKER DP, 1997, IEEE T SPEECH AUDIO, V5, P529 Wittkop T, 1997, ACUSTICA, V83, P684 Wittkop T, 2001, THESIS U OLDENBURG NR 31 TC 28 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 111 EP 138 DI 10.1016/S0167-6393(02)00062-6 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100009 ER PT J AU Widrow, B Luo, FL AF Widrow, B Luo, FL TI Microphone arrays for hearing aids: An overview SO SPEECH COMMUNICATION LA English DT Article ID SYSTEM AB Using the difference in the spatial domain (direction or location) between a target signal and noise, a system with a microphone array can achieve the goal of noise reduction and speech enhancement in various environments, especially, in speech-like noisy environments. This paper deals with various issues related to the use of microphone arrays in hearing aids. It includes overall principles, algorithms, real-time implementation, configuration, processing mode, geometry, combination with binaural processing, directivity pattern, frequency responses, and compensation for mismatch and misplacement of microphones. A practical microphone array device will be presented that uses microphone array based processing techniques to help hearing aids deliver improvement in signal-to-noise ratio, reduction of the effects of reverberation, and reduction of the feedback, with appropriate configuration and connections. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Stanford Univ, Stanford, CA 94305 USA. Quicksilver Technol Inc, San Jose, CA 95119 USA. RP Widrow, B (reprint author), Stanford Univ, Packard Elect Engn Bldg,Room 273,350 Serra Mall, Stanford, CA 94305 USA. CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Edwards B, 1998, HEARING J, V51, P44 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 KATES JM, 1997, IEEE SIGNAL PROCESS, V14, P41 LEHR MA, 1998, Patent No. 5793875 LUO FL, 2000, Patent No. 09593728 SOEDE W, 1993, J ACOUST SOC AM, V94, P785, DOI 10.1121/1.408180 Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Welker DP, 1997, IEEE T SPEECH AUDI P, V5, P543, DOI 10.1109/89.641299 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 10 TC 15 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 139 EP 146 DI 10.1016/S0167-6393(02)00063-8 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100010 ER PT J AU Chi, HF Gao, SX Soli, SD Alwan, A AF Chi, HF Gao, SX Soli, SD Alwan, A TI Band-limited feedback cancellation with a modified filtered-X LMS algorithm for hearing aids SO SPEECH COMMUNICATION LA English DT Article DE hearing aids; acoustic feedback cancellation; adaptive filtering; oscillation frequency; filtered-X LMS; convergence behavior ID ACOUSTIC FEEDBACK AB Commonly used wideband adaptive feedback cancellation techniques do not provide satisfactory performance for reducing feedback oscillation in hearing aids. In this paper, a band-limited adaptive feedback cancellation algorithm using normalized filtered-X LMS techniques is proposed that provides good cancellation efficiency, convergence behavior and better output sound quality for speech signals, when compared to the wideband approach. Convergence analysis and computer simulations illustrate the advantages of the proposed approach. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Virata Corp, Santa Clara, CA 95051 USA. House Ear Res Inst, Los Angeles, CA 90057 USA. Univ Calif Los Angeles, Dept Elect Engn, Los Angeles, CA 90095 USA. RP Chi, HF (reprint author), Virata Corp, 2700 San Tomas Expressway, Santa Clara, CA 95051 USA. EM hsiangfengchi@yahoo.com CR AGNEW J, 1993, HEARING J, V46, P37 BUSTAMANTE DK, 1989, P INT C AC SPEECH SI, P2017 CHI HF, 1999, P IEEE ISCAS 99, P187 CHI HF, 1999, THESIS U CALIFORNIA COX RM, 1982, EAR HEARING, V3, P12, DOI 10.1097/00003446-198201000-00003 DILLON H, 1991, EAR HEARING, V12, P406, DOI 10.1097/00003446-199112000-00005 DYRLUND O, 1991, SCAND AUDIOL, V20, P49, DOI 10.3109/01050399109070790 DYRLUND O, 1989, SCAND AUDIOL, V18, P143, DOI 10.3109/01050398909070737 EGOLF DP, 1982, MONOGRAPHS CONT AUDI, P94 ELLIOTT SJ, 1987, IEEE T ACOUST SPEECH, V35, P1423, DOI 10.1109/TASSP.1987.1165044 ENGEBRETSON AM, 1990, P 12 ANN C IEEE EN 5, P2286 Graupe D., 1988, US Patent, Patent No. 4783818 Haykin S., 1993, ADAPTIVE FILTER THEO HELLGREN J, 2000, THESIS LINKOPINGS U HELLGREN J, 2000, P IEEE ICASSP 2000, P869 JOSON HAL, 1993, J ACOUST SOC AM, V94, P3248, DOI 10.1121/1.407231 Kates JM, 2001, J ACOUST SOC AM, V109, P367, DOI 10.1121/1.1332379 KATES JM, 1991, IEEE T SIGNAL PROCES, V39, P553, DOI 10.1109/78.80875 LONG G, 1989, IEEE T ACOUST SPEECH, V37, P1397, DOI 10.1109/29.31293 LONG G, 1992, IEEE T SIGNAL PROCES, V40, P230, DOI 10.1109/78.157202 MAXWELL JA, 1995, IEEE T SPEECH AUDI P, V3, P304, DOI 10.1109/89.397095 MORGAN DR, 1980, IEEE T ACOUST SPEECH, V28, P454, DOI 10.1109/TASSP.1980.1163430 NIELSEN JL, 1995, P 15 INT C AC, P597 Nyquist H, 1932, BELL SYST TECH J, V11, P126 SIQUEIRA M, 1999, P 1999 IEEE INT C AC, P925 SIQUEIRA M, 1997, P IEEE WORKSH AUD EL SNYDER SD, 1994, IEEE T SIGNAL PROCES, V42, P950, DOI 10.1109/78.285659 SVENSSON PU, 1995, J AUDIO ENG SOC, V43, P667 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 29 TC 18 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 147 EP 161 DI 10.1016/S0167-6393(02)00064-X PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100011 ER PT J AU Rafaely, B Shusina, NA Hayes, JL AF Rafaely, B Shusina, NA Hayes, JL TI Robust compensation with adaptive feedback cancellation in hearing aids SO SPEECH COMMUNICATION LA English DT Article DE feedback cancellation; system identification; hearing aids; acoustic modeling; adaptive filtering; robustness; robust stability ID ACOUSTIC FEEDBACK; ADAPTATION; ALGORITHMS AB Howling caused by feedback in hearing aids reduces comfort and limits the hearing aid amplification. Recent studies investigated adaptive feedback cancellation systems which track the response of the feedback path and attempt to cancel the feedback signal therefore reducing the howling. Since in practice the feedback signal is never cancelled completely, the closed-loop system can still become unstable. This paper presents a formulation of a robust hearing aid system with guaranteed closed-loop stability which is free from howling. The system incorporates an adaptive feedback cancellation filter integrated with a robust compensation filter. The limit on the tracking error of the feedback cancellation filter is first defined, from which a limit on the gain of the compensation filter which guarantees robust stability is derived. Then, an adaptive feedback cancellation algorithm with improved robustness is proposed. A simulation example of adaptive feedback cancellation in a hearing aid with a time-varying acoustic leak is then presented, showing the benefit of the robust-adaptive algorithm. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Southampton, Inst Sound & Vibrat Res, Southampton SO17 1BJ, Hants, England. RP Rafaely, B (reprint author), Univ Southampton, Inst Sound & Vibrat Res, Southampton SO17 1BJ, Hants, England. EM br@isvr.soton.ac.uk RI RAFAELY, BOAZ/F-2016-2012 CR CHI HF, 1999, P IEEE INT S CIRC SY, V3, P195 Doyle J. C., 1992, FEEDBACK CONTROL THE EGOLF DP, 1989, J ACOUST SOC AM, V85, P454, DOI 10.1121/1.397697 EGOLF DP, 1977, J ACOUST SOC AM, V61, P200, DOI 10.1121/1.381256 Greenburg JE, 2000, J ACOUST SOC AM, V108, P2366, DOI 10.1121/1.1316095 Hellgren J, 2001, IEEE T SPEECH AUDI P, V9, P906, DOI 10.1109/89.966094 Hellgren J, 1999, J ACOUST SOC AM, V106, P2821, DOI 10.1121/1.428107 Kates JM, 1999, J ACOUST SOC AM, V106, P1010, DOI 10.1121/1.427112 Killion M. C., 1988, HEARING INSTRUMENTS, V39, P14 MAXWELL JA, 1995, IEEE T SPEECH AUDI P, V3, P304, DOI 10.1109/89.397095 Oliveira R J, 1997, J Am Acad Audiol, V8, P401 Rafaely B, 2000, IEEE T SPEECH AUDI P, V8, P754, DOI 10.1109/89.876316 Rafaely B, 2000, J ACOUST SOC AM, V107, P2665, DOI 10.1121/1.428652 Shaw BAG, 1974, HDB SENSORY PHYSL, VV/1, P455 Siqueira MG, 2000, IEEE T SPEECH AUDI P, V8, P443, DOI 10.1109/89.848225 WESTERMANN S, 1987, HEAR INSTRUM, V38, P43 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 17 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2003 VL 39 IS 1-2 BP 163 EP 170 DI 10.1016/S0167-6393(02)00065-1 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 628PN UT WOS:000180005100012 ER PT J AU Reinhard, K Niranjan, M AF Reinhard, K Niranjan, M TI Diphone subspace mixture trajectory models for HMM complementation SO SPEECH COMMUNICATION LA English DT Article DE speech dynamics; diphones; subspace trajectory; time-constraint; PCA; N-best rescoring ID HIDDEN MARKOV-MODELS; SPEECH RECOGNITION AB This paper describes an extension of the previously reported attempt of capturing segmental transition information for speech recognition tasks [Speech Communication 27 (1) (1999) 19]. Representations in the subspace with multiple projected trajectories are discussed, employing EM-based methods to find optimal anchor points. Experimental work is carried out to illustrate that useful discriminant information is preserved in the subspace trajectories. These experiments include the development of "matched filters" to spot particular diphones in continuous speech, and the inclusion of diphone-based discriminant information into a phone-based HMM recognition framework to rerank multiple hypotheses. The difficulties in constructing the models due to the limited coverage of a sufficient amount of tokens within the phone balanced TIMIT database are discussed. The influence of the restricted diphone coverage on the rescoring results is reported. Improvements in phone recognition accuracy have been obtained on a speaker-by-speaker basis. Obtained improvements over baseline HMMs augmented with first-order derivatives suggest the importance of explicitly modelled between-phone information. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Reinhard, K (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM klaus.reinhard@eed.ericsson.se; m.niranjan@dcs.shef.ac.uk CR ATAL B, 1983, INT C AC SPEECH SIGN, V1, P81 BLACKBURN C, 1995, EUR C SPEECH COMMUN, V3, P1623 BLACKBURN C, 1996, THESIS CAMBRIDGE U BOURLAND H, 1995, EUR C SPEECH COMMUN, V2, P883 BRIDLE J, 1998, WORKSH LANG ENG J HO Deng L, 1997, IEEE T SPEECH AUDI P, V5, P319 Deng L, 1998, SPEECH COMMUN, V24, P299, DOI 10.1016/S0167-6393(98)00023-5 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 FUKADA T, 1997, P ICASSP, V2, P1403 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 GALES M, 1993, EUR C SPEECH COMMUN, V3, P1579 Garofolo J., 1988, GETTING STARTED DARP Ghitza O., 1993, COMPUTER SPEECH LANG, V2, P101 Gillick L., 1989, ICASSP 1989 GLASG UK, V1, P532 GISH H, 1993, IEEE P INT C AC SPEE, V2, P447, DOI 10.1109/ICASSP.1993.319337 Gish H, 1996, P ICSLP, P466, DOI 10.1109/ICSLP.1996.607155 GOLDENTHAL W, 1994, THESIS MIT HERMANSKY H, 1999, INT C ACOUST SPEECH, V1, P289 HOLMES W, 1997, INT C AC SPEECH SIGN, V2, P1399 HOLMES W, 1995, EUR C SPEECH COMMUN, V3, P1611 HON HW, 2000, INT C ACOUST SPEECH, V2, P1017 HOSOM JP, 1997, INT C ACOUST SPEECH, V4, P3369 KANNAN A, 1992, P DARPA SPEECH NAT L, P455, DOI 10.3115/1075527.1075638 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LLOYD SP, 1982, IEEE T INFORM THEORY, V28, P129, DOI 10.1109/TIT.1982.1056489 MARCUS S, 1984, TEMPORAL DECOMPOSITI, P25 MARTEAU P, 1988, INT C ACOUSTICS SPEE, V1, P615 MORGAN N, 1994, INT C SPOKEN LANGUAG, V4, P1943 MORGAN N, 1995, INT C ACOUST SPEECH, V1, P397 NIRANJAN M, 1987, EUR C SPEECH COMM TE, V1, P71 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd Oja E., 1983, SUBSPACE METHODS PAT Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 Ostendorf M., 1991, P DARPA WORKSH SPEEC, P83, DOI 10.3115/112405.112416 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 PICONE J, 1999, INT C AC SPEECH SIGN, V1, P109 Reinecke D, 1999, PLANT GROWTH REGUL, V27, P1, DOI 10.1023/A:1006165905968 RICHARDS H, 1999, INT C ACOUST SPEECH, V1, P357 Russell M., 1993, IEEE INT C AC SPEECH, V2, P499, DOI 10.1109/ICASSP.1993.319351 SAUL L, 1998, ADV NEURAL INF PROCE, V11, P751 SCHWAB M, 1990, GENE CHROMOSOME CANC, V1, P81 SCHWARTZ R, 1992, INT C ACOUSTICS SPEE, V1, P1 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 Young S.J., 1994, ARPA WORKSH HUM LANG, P307 NR 44 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 237 EP 265 DI 10.1016/S0167-6393(01)00054-1 PG 29 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200001 ER PT J AU de la Torre, A Peinado, AM Rubio, AJ Segura, JC Benitez, C AF de la Torre, A Peinado, AM Rubio, AJ Segura, JC Benitez, C TI Discriminative feature weighting for HMM-based continuous speech recognizers SO SPEECH COMMUNICATION LA English DT Article DE continuous speech recognition; discriminative feature extraction; error-rate; cost function; probability density function; minimum classification error; hidden Markov model; discriminative feature weighting; discriminative weighting by transformation; partial probability weighting ID PATTERN-RECOGNITION; WORD RECOGNITION; CLASSIFICATION AB The Discriminative Feature Extraction (DFE) method provides an appropriate formalism for the design of the front-end feature extraction module in pattern classification systems. In the recent years, this formalism has been successfully applied to different speech recognition problems, like classification of vowels, classification of phonemes or isolated word recognition. The DFE formalism can be applied to weight the contribution of the components in the feature vector. This variant of DFE, that we call Discriminative Feature Weighting (DFW), improves the pattern classification systems by enhancing those components more relevant for the discrimination among the different classes. This paper is dedicated to the application of the DFW formalism to Continuous Speech Recognizers (CSR) based on Hidden Markov Models (HMMs). Two different types of HMM-based speech recognizers are considered: recognizers based on Discrete-HMMs (DHMMs) (for which the acoustic evaluation is based on an Euclidean distance measure) and Semi-Continuous-HMMs (SCHMMs) (for which the acoustic evaluation is performed making use of a mixture of multivariated Gaussians). We report how the components can be weighted and how the weights can be discriminatively trained and applied to the speech recognizers. We present recognition results for several continuous speech recognition tasks. The experimental results show the utility of DFW for HMM-based continuous speech recognizers. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. RP de la Torre, A (reprint author), Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. EM atv@ugr.es RI de la Torre, Angel/C-6618-2012; Benitez Ortuzar, M Del Carmen/C-2424-2012; Peinado, Antonio/C-2401-2012; Segura, Jose/B-7008-2008; Prieto, Ignacio/B-5361-2013 OI Segura, Jose/0000-0003-3746-0978; CR BACCHIANI M, 1994, P INT C AC SPEECH SI, V2, P275 Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 BIEM A, 1997, P IEEE INT C AC SPEE, V2, P1503 BIEM A, 1994, P INT C SIGN SPEECH, V1, P485 Biem A., 1993, P IEEE INT C AC SPEE, P275 BIEM A, 1995, P EUROSPEECH 95 MADR, V1, P545 Biem A, 1997, IEEE T SIGNAL PROCES, V45, P500, DOI 10.1109/78.554319 CASACUBERTA F, 1991, P WORKSH INT COOP ST, P26 CHOU W, 1992, P INT C AC SPEECH SI, V1, P473 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DELATORRE A, 1996, P EUSIPCO 96, V3, P1575 DELATORRE A, 1997, P EUROSPEECH 97, V1, P291 DELATORRE A, 1996, SPEECH COMMUN, V20, P243 DIAZVERDEJO JE, 1998, P 1 INT C LANG RES E, V1, P497 Duda R. O., 1973, PATTERN CLASSIFICATI Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Huang X., 1989, P IEEE INT C AC SPEE, P639 Huang X.D., 1990, P ICASSP, P689 JIN Q, 2000, P ICSLP 00, V2, P250 JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 JUNQUA JC, 1989, P INT C AC SPEECH SI, P476 Kumar N, 1998, SPEECH COMMUN, V26, P283, DOI 10.1016/S0167-6393(98)00061-2 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P599, DOI 10.1109/29.52701 LLISTERRI J, 1993, 6819 ESPRIT PROJECT MCDERMOTT E, 1994, COMPUT SPEECH LANG, V8, P351, DOI 10.1006/csla.1994.1018 MILNER B, 1997, P EUR 97, V1, P405 MORENO PJ, 1997, P EUR 97, V5, P2599 PALIWAL KK, 1995, P EUR C SPEECH COMM, V1, P541 PEINADO AM, 1995, P EUROSPEECH 95 SEPT, V1, P533 PEINADO AM, 1996, IEEE T SPEECH AUDIO, V4, P88 PEINADO AM, 1990, P EUSIPCO 90, V2, P1243 Rabiner L, 1993, FUNDAMENTALS SPEECH RUBIO AJ, 1997, P EUR 97, V4, P1779 SEGURA JC, 1994, SPEECH COMMUN, V14, P163, DOI 10.1016/0167-6393(94)90006-X TOHKURA Y, 1987, IEEE T ACOUST SPEECH, V35, P1414, DOI 10.1109/TASSP.1987.1165058 WATANABE H, 1995, P ICASSP MAY, P3439 Watanabe H, 1997, IEEE T SIGNAL PROCES, V45, P2655, DOI 10.1109/78.650091 YOUNG S, 1997, HTK BOOK Young S, 1996, IEEE SIGNAL PROC MAG, V13, P45, DOI 10.1109/79.536824 NR 43 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 267 EP 286 DI 10.1016/S0167-6393(01)00068-1 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200002 ER PT J AU San-Segundo, R Colas, J de Cordoba, R Pardo, JM AF San-Segundo, R Colas, J de Cordoba, R Pardo, JM TI Spanish recognizer of continuously spelled names over the telephone SO SPEECH COMMUNICATION LA English DT Article DE continuously spelled names recognition; Spanish spelling task; recognition over the telephone ID SPEECH RECOGNITION; SYSTEM AB In this paper we present a hypothesis-verification approach for a Spanish recognizer of continuously spelled names over the telephone. We give a detailed description of the spelling task for Spanish where the most confusable letter sets are described. We introduce a new HMM topology with contextual silences incorporated into the letter model to deal with pauses between letters, increasing the Letter Accuracy by 6.6 points compared with a single silence model approach. For the final configuration of the hypothesis step we obtain a Letter Accuracy of 88.1% and a Name Recognition Rate of 94.2% for a 1000 names dictionary. In this configuration, we also use noise models for reducing letter insertions, and a Letter Graph to incorporate N-gram language models and to calculate the N-best letter sequences. in the verification step, we consider the M-best candidates provided by the hypothesis step. We evaluate the whole system for different dictionaries, obtaining more than 90.0% Name Recognition Rate for a 10,000 names dictionary. Finally, we demonstrate the utility of incorporating a Spelled Name Recognizer in a Directory Assistance Service over the telephone increasing the percentage of calls automatically serviced from 39.4% to 58.7%. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Politecn Madrid, ETSI Telecomunicac, Grp Tecnol Habla, Dept Ingn Elect, E-28040 Madrid, Spain. RP San-Segundo, R (reprint author), Univ Politecn Madrid, ETSI Telecomunicac, Grp Tecnol Habla, Dept Ingn Elect, Ciudad Univ S-N, E-28040 Madrid, Spain. EM lapiz@die.upm.es; colas@die.upm.es; cordoba@die.upm.es; pardo@die.upm.es RI Pardo, Jose/H-3745-2013; Cordoba, Ricardo/B-5861-2008 OI Cordoba, Ricardo/0000-0002-7136-9636 CR BAUER JG, 1999, P EUROSPEECH, P263 BROWN PF, 1987, THESIS CARNEGIEMELLO COLE R, 1991, P ICASSP, P325, DOI 10.1109/ICASSP.1991.150342 COLE R, 1991, P EUROSPEECH, P479 COLE RA, 1986, VARIABILITY INVARIAN, P325 EULER SA, 1990, P ICASSP, P745 Fanty M, 1990, P NEUR INF PROC SYST, P220 FISSORE L, 1989, IEEE ACOUSTICS SPEEC, V17, P1197 HANEL S, 2000, P ICASSP, P1755 Hermansky H., 1991, P EUROSPEECH, P1367 HILD H, 1997, P EUROSPEECH, P346 HILD H, 1993, P EUROSPEECH, P1481 JOUVET D, 1993, P ICASSP, pII235 JOUVET D, 1993, P EUROSPEECH, P2081 Jouvet D., 1999, P EUROSPEECH 99 BUD, P283 JUNQUA JC, 1995, P ICASSP 95 DETR MIC, P852 JUNQUA JC, 1991, SPEECH COMMUN, V10, P33, DOI 10.1016/0167-6393(91)90026-P JUNQUA JC, 1997, IEEE T SPEECH AUDIO, V5 KASPAR B, 1995, P EUROSPEECH, P1161 Kuroiwa S, 1999, SPEECH COMMUN, V27, P135, DOI 10.1016/S0167-6393(98)00072-7 Lamel L, 2000, SPEECH COMMUN, V31, P339, DOI 10.1016/S0167-6393(99)00067-9 LEHTINEN G, 2000, IDAS INTERACTIVE DIR LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LOIZOU P, 1996, IEEE T SPEECH AUDIO, V4 LOIZOU P, 1995, P ICSPAT, P2014 MITCHELL G, 1999, P ICASSP, P597 MORENO A, 1997, SPEECHDAT VER 1 0 NEY H, 1994, P INT C SPOK LANG PR, V3, P1355 Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984 Ravishankar M. K., 1996, THESIS CARNEGIE MELL ROGINSKI K, 1991, THESIS OREGON GRAD I Schramm H, 2000, SPEECH COMMUN, V31, P329, DOI 10.1016/S0167-6393(99)00066-7 THIELE F, 2000, P ICASSP, P1715 Weiss NA, 1993, INTRO STAT, P407 NR 34 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 287 EP 303 DI 10.1016/S0167-6393(01)00069-3 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200003 ER PT J AU Tammi, M Heikkinen, A Saarinen, J AF Tammi, M Heikkinen, A Saarinen, J TI On methods for perfect reconstruction WI speech coding with preprocessing SO SPEECH COMMUNICATION LA English DT Article DE speech coding; waveform interpolation; perfect reconstruction; preprocessing ID SIGNAL AB The waveform interpolation (WI) speech coding algorithm has been shown to be an efficient method to describe the evolution of periodic voiced components in the speech signal. However, the conventional WI coding does not provide perfect reconstruction property, i.e. the decoded signal does not converge to the original signal with decreasing quantization error. Therefore errors in the coding model cannot be fixed by quantization. In this paper we discuss about characteristics of the WI coding model and about modifications to the model which enable the perfect reconstruction property. The new requirements and features are examined and discussed in detail. While the perfect reconstruction property brings many benefits it also causes new demands to the operation of the coder. Particularly high requirements are set to the exactness of the pitch estimate; inaccuracies hamper rapidly the possibilities to quantize the parameters efficiently. To overcome this we introduce a preprocessing method which slightly modifies the pitch structure of the residual signal before waveform extraction. The modifications to the signal are minor and therefore the quality of the preprocessed signal is very close to that of the input speech. In the proposed method the perfect reconstruction property is maintained in relation to the preprocessed signal. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Nokia Res Ctr, Speech & Audio Syst Lab, FIN-33721 Tampere, Finland. Tampere Univ Technol, Digital & Comp Syst Lab, FIN-33101 Tampere, Finland. RP Heikkinen, A (reprint author), Nokia Res Ctr, Speech & Audio Syst Lab, POB 100, FIN-33721 Tampere, Finland. EM ari.p.heikkinen@nokia.com CR *CCITT, 1989, BLUE BOOK, V5 CHONG NR, 1999, P IEEE WORKSH SPEECH, P96 Choy E.L.T., 1998, THESIS MCGILL U MONT ERIKSSON T, 1999, P IEEE WORKSH SPEECH, P93 HESS W, 1992, PITCH DETERMINATION *ITU T REC G 723 1, 1996, DUAL RAT SPEECH COD *ITU T REC G 729, 1995, COD SPEECH 8 KBITS U Jayant N. S., 1984, DIGITAL CODING WAVEF KLEIJN WB, 1995, SPEECH CODING SYNTHE, P175 KLEIJN WB, 1994, EUR T TELECOMMUN, V5, P573 KLEIJN WB, 1998, P INT C SPEECH LANG KLEIN DF, 1994, ANXIETY, V1, P1 McAulay R., 1995, SPEECH CODING SYNTHE, P121 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905 RUOPPILA VT, 2000, P ICASSP, P1359 Schroeder M., 1985, P IEEE INT C AC SPEE, P937 TAMMI M, 2000, P 4 WORLD MULT C CIR, P4241 TAMMI M, 1999, P IEEE WORKSH SPEECH, P102 TAMMI M, 2000, P IEEE WORKSH SPEECH, P53 TRANCOSO IM, 1990, IEEE T ACOUST SPEECH, V38, P385, DOI 10.1109/29.106858 Unser M, 1999, IEEE SIGNAL PROC MAG, V16, P22, DOI 10.1109/79.799930 UNSER M, 1993, IEEE T SIGNAL PROCES, V41, P834, DOI 10.1109/78.193221 UNSER M, 1993, IEEE T SIGNAL PROCES, V41, P821, DOI 10.1109/78.193220 YANG H, 1998, P IEEE INT C SIGN PR, P591 NR 24 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 305 EP 320 DI 10.1016/S0167-6393(01)00071-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200004 ER PT J AU Alku, P Vintturi, J Vilkman, E AF Alku, P Vintturi, J Vilkman, E TI Measuring the effect of fundamental frequency raising as a strategy for increasing vocal intensity in soft, normal and loud phonation SO SPEECH COMMUNICATION LA English DT Article DE voice production; intensity regulation; sound pressure level; glottal source ID SOUND PRESSURE LEVEL; GLOTTAL AIR-FLOW; VOICE SOURCE; SUBGLOTTAL PRESSURE; WAVE-FORM; FEMALE; PHONETOGRAM; SINGERS; SPEECH; SPEAKERS AB A method is presented to estimate the effect of intentional raising of fundamental frequency (170) on vocal intensity. The method, energy of the synthesised period (ESP), is based on computation of the energy of a hypothetical speech sound synthesised using a single period of the glottal volume velocity waveform and a digital filter that models the vocal tract. If the intensity of speech is regulated by modifying either the characteristics of glottal flow or the vocal tract, the change in the ESP-value should correspond to an equal change in the value of the sound pressure level (SPL). However, F0 can only change the value of SPL, but has no effect on ESP. Hence, by comparing the behaviours of SPL and ESP, it is possible to measure the way in which speakers use F0 raising as a strategy to increase vocal intensity. The results show that, in producing loud voice, speakers use F0 to increase the number of glottal closures per time unit, which increases rapid fluctuations in the speech pressure waveform, which, in turn, raises vocal intensity. The average increase of SPL due to this active use of F0 was approximately 4 dB in loud speech produced by both female and male speakers. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Aalto Univ, Lab Acoust & Audio Signal Proc, Helsinki 02015, Finland. Univ Helsinki, Cent Hosp, Dept Otolaryngol & Phoniatr, Helsinki, Finland. Oulu Univ, Dept Otolaryngol & Phoniatr, Helsinki, Finland. RP Alku, P (reprint author), Aalto Univ, Lab Acoust & Audio Signal Proc, POB 3000, Helsinki 02015, Finland. EM paavo.alku@hut.fi RI Alku, Paavo/E-2400-2012 CR AKERLUND L, 1992, J VOICE, V6, P55, DOI 10.1016/S0892-1997(05)80009-8 ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P., 1994, P INT C SPOK LANG PR, P1619 Alku P, 1999, SPEECH COMMUN, V28, P269, DOI 10.1016/S0167-6393(99)00020-5 BOUHUYS A, 1968, ANN NY ACAD SCI, V155, P165, DOI 10.1111/j.1749-6632.1968.tb56760.x CARLSSON G, 1992, J VOICE, V6, P256, DOI 10.1016/S0892-1997(05)80150-X COLEMAN RF, 1977, J SPEECH HEAR RES, V20, P197 COLEMAN RF, 1993, J VOICE, V7, P1, DOI 10.1016/S0892-1997(05)80107-9 DAMSTE PH, 1970, PRACT-OTO-RHINO-LARY, V32, P185 DROMEY C, 1992, J VOICE, V6, P44, DOI 10.1016/S0892-1997(05)80008-6 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 Fant G, 1982, SPEECH TRANSMISSION, V4, P1 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P Fant G., 1970, ACOUSTIC THEORY SPEE, P15 GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 GRAMMING P, 1988, J ACOUST SOC AM, V83, P2352, DOI 10.1121/1.396366 Gramming P., 1988, J VOICE, V2, P118, DOI 10.1016/S0892-1997(88)80067-5 HERTEGARD S, 1992, J VOICE, V6, P224, DOI 10.1016/S0892-1997(05)80147-X HERTEGARD S, 1990, J VOICE, V4, P220, DOI 10.1016/S0892-1997(05)80017-7 HERTEGARD S, 1992, 239218 ROYAL I TECHN HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 ISSHIKI N, 1964, J SPEECH HEAR RES, V7, P17 ISSHIKI N, 1965, FOLIA PHONIATR, V17, P92 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KOMIYAMA S, 1984, FOLIA PHONIATR, V36, P1 LADEFOGED P, 1963, J ACOUST SOC AM, V35, P454, DOI 10.1121/1.1918503 LOFQVIST A, 1982, J ACOUST SOC AM, V72, P633 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE STRIK H, 1992, J PHONETICS, V20, P15 Sulter AM, 1996, J ACOUST SOC AM, V100, P3360, DOI 10.1121/1.416977 SUNDBERG J, 1993, J VOICE, V7, P15, DOI 10.1016/S0892-1997(05)80108-0 SUNDBERG J, 1990, J VOICE, V4, P107, DOI 10.1016/S0892-1997(05)80135-3 TITZE IR, 1992, J SPEECH HEAR RES, V35, P21 Titze IR, 1994, PRINCIPLES VOICE PRO TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929 TITZE IR, 1992, J ACOUST SOC AM, V91, P2926, DOI 10.1121/1.402928 VAN DEN BERG J, 1956, Folia Phoniatr (Basel), V8, P1 YOUNG HD, 1996, U PHYSICS, P612 NR 38 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 321 EP 334 DI 10.1016/S0167-6393(01)00072-3 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200005 ER PT J AU Bertoldi, N Brugnara, F Cettolo, M Federico, M Giuliani, D AF Bertoldi, N Brugnara, F Cettolo, M Federico, M Giuliani, D TI Cross-task portability of a broadcast news speech recognition system SO SPEECH COMMUNICATION LA English DT Article DE acoustic model adaptation; language model adaptation; spontaneous speech phenomena ID MAXIMUM-LIKELIHOOD AB This paper reports on experiments of porting the ITC-irst Italian broadcast news recognition system to two spontaneous dialogue domains. Porting was investigated by applying state-of-the-art adaptation methods on acoustic and language models, and by evaluating the trade-off between performance and required amount of task specific annotated data. The use of different levels of supervision for acoustic model adaptation was also studied. By employing 2 It of manually annotated speech, word error rates of 26.0% and 28.4% were achieved by the adapted systems. These results are to be compared with the performance of two domain specific baseline systems, 22.6% and 21.2%, respectively, which were developed on much more training data. Finally, a robust method is presented that allows to tune the insertion of spontaneous speech phenomena by the speech decoder. (C) 2001 Elsevier Science B.V. All rights reserved. C1 IRST, ITC, Ctr Ric Sci & Tecnol, I-38050 Povo, Italy. RP Federico, M (reprint author), IRST, ITC, Ctr Ric Sci & Tecnol, Via Sommar 18, I-38050 Povo, Italy. EM federico@itc.it CR BERTOLDI N, 2001, P IEEE INT C AC SPEE BRUGNARA F, 2000, P INT C SPOK LANG PR, V2, P660 BRUGNARA F, 1997, P 5 EUR C SPEECH COM, P2751 BRUGNARA F, 1995, P EUROSPEECH, P2075 CETTOLO M, 1998, P INT C SPOK LANG PR, P1551 CETTOLO M, 1999, P ENTER C INNSBR AUS CHEVALIER H, 1995, P IEEE INT C AC SPEE, V1, P217 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Duda R. O., 2001, PATTERN CLASSIFICATI FEDERICO M, 2000, P 2 INT C LANG RES E, V2, P921 FEDERICO M, 1995, COMPUT SPEECH LANG, V9, P353, DOI 10.1006/csla.1995.0017 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Jelinek F., 1990, READINGS SPEECH RECO, P450 JELINEK F, 1992, ADV SPEECH SIGNAL PR, P651 LEFEVRE F, 2001, P IEEE INT C AC SPEE LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leggetter C. J., 1995, P EUR 95 MADR, P1155 NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 NR 18 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 335 EP 347 DI 10.1016/S0167-6393(01)00074-7 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200006 ER PT J AU Chen, YJ Wu, CH Chiu, YH Liao, HC AF Chen, YJ Wu, CH Chiu, YH Liao, HC TI Generation of robust phonetic set and decision tree for Mandarin using chi-square testing SO SPEECH COMMUNICATION LA English DT Article DE confusion characteristic; phonetic representation; decision tree; Mandarin speech recognition AB A phonetic representation of a language is used to describe the corresponding pronunciation and synthesize the acoustic model of any vocabulary. A phonetic representation with smaller phonetic units such as SAMPA-C for Mandarin Chinese and decision trees for parameter sharing are broadly applied to deal with the problem of large numbers of recognition units. However, the confusable phonetic representation in SAMPA-C generally degrades the recognition performance. In this paper, a statistical method based on chi-square testing is used to investigate the phonetic unit characteristics that are confusing and develop a more reliable phonetic set, named modified SAMPA-C. A corresponding question set for the modified SAMPA-C and a two-level splitting criterion are also proposed to effectively and efficiently construct the decision trees. Experiments using continuous Mandarin telephone speech recognition were conducted. Experimental results show that an encouraging improvement in recognition performance can be obtained. The proposed approaches represent a good compromise between the demands of accurate acoustic modeling and the limitations imposed by insufficient training data. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RI Wu, Chung-Hsien/E-7970-2013 CR Agresti A., 1990, CATEGORICAL DATA ANA Bahl L, 1991, P INT C AC SPEECH SI, P185, DOI 10.1109/ICASSP.1991.150308 BAHL LR, 1997, P ICASSP94 AD, P533 Boulianne G., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607126 Chen CC, 1998, COMMUN ANAL GEOM, V6, P1 COCHRAN WG, 1954, J INT BIOMETRIC SOC, P417 DUCHATEAU J, 1997, P EUR 97, V3, P1183 HUO Q, 1999, P ICASSP99, V2, P577 Hwang M. Y., 1993, P ICASSP, P311 Hwang MY, 1998, INT CONF ACOUST SPEE, P669 Krzanowski WJ, 1988, PRINCIPLES MULTIVARI KUHN R, 1995, P INT C AC SPEECH SI, P552 LANG BI, 1998, THESIS U TAIWAN TAIW Lyu RY, 1998, IEEE T SPEECH AUDI P, V6, P293 Mathews R. H., 1975, MATHEWS CHINESE ENGL Odell J., 1995, THESIS U CAMBRIDGE U Rabiner L, 1993, FUNDAMENTALS SPEECH SEIDE F, 1999, P OR COCOSDA WORKSH, P105 SEIDE F, 1998, P ISCSLP 98, P54 WELLS J, 1997, EAGGLES HDB SPOKEN L WILLETT D, 1999, P ICASSP, V2, P565 WU C, 1999, P ICASSP99, V1, P345 Wu CH, 2000, IEE P-VIS IMAGE SIGN, V147, P55, DOI 10.1049/ip-vis:20000099 NR 23 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 349 EP 364 DI 10.1016/S0167-6393(01)00076-0 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200007 ER PT J AU Beritelli, F Casale, S Ruggeri, G AF Beritelli, F Casale, S Ruggeri, G TI Hybrid multimode/multirate CS-ACELP speech coding for adaptive voice over IP SO SPEECH COMMUNICATION LA English DT Article DE IP telephony; VBR speech coding; rate control ID CLASSIFICATION AB This paper proposes a variable rate, toll quality CS-ACELP coder that uses coding modes compatible with the three 6.4, 8 and 11.8 kbit/s coding schemes standardised by ITU-T in G.729. In particular we propose a hybrid multimode/multirate ((MR)-R-3) codec presenting four coding categories, with an average bit rate ranging between 3 and 8 kbit/s, that. adapts the rate to changes in network conditions. The results indicate that both for the current Internet and for future scenarios hybrid (MR)-R-3 adaptive coding seems a valid solution to achieve the quality currently guaranteed by PSTN networks. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Catania, Dipartimento Ingn Informat & Telecomunicaz, I-95125 Catania, Italy. RP Beritelli, F (reprint author), Univ Catania, Dipartimento Ingn Informat & Telecomunicaz, Vle A Doria 6, I-95125 Catania, Italy. EM beritelli@diit.unict.it; casale@diit.unict.it; gruggeri@diit.unict.it CR Barberis A, 2001, COMPUT COMMUN, V24, P757, DOI 10.1016/S0140-3664(00)00349-2 Beritelli F, 1998, IEEE J SEL AREA COMM, V16, P1818, DOI 10.1109/49.737650 BERITELLI F, 1999, IEEE SIGNAL PROCESSI, V17, P31 BERITELLI F, 2000, P ICSP BEIJ CHIN AUG Beritelli F, 1998, TELECOMMUN SYST, V9, P375, DOI 10.1023/A:1019112310453 BERITELLI F, 1997, P IEEE WORKSH SPEECH, P5 BERITELLI F, 1998, P IEEE INT C AC SPEE, P565 BERITELLI F, 2001, P ICASSP 2001 SALT L BERITELLI F, 1999, IEEE J SELECTED AREA, V17 BIALLY T, 1980, IEEE T COMMUN, V28 CASETTI C, 2000, ICC2000 Casner S., 1999, 2508 IETF RFC ELMALEH K, 1999, P IEEE INT C AC SPEE Floyd S., 2000, EQUATION BASED CONGE Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd GERSHO A, 1994, P IEEE, V82, P900, DOI 10.1109/5.286194 Hassan M, 2000, IEEE COMMUN MAG, V38, P96, DOI 10.1109/35.833564 HERRE J, 2000, ICSP 2000, V1, P11 Huber JF, 2000, IEEE COMMUN MAG, V38, P129, DOI 10.1109/35.868152 *ITU T, 2000, COMF NOIS PAYL DEF I LARZON LA, 2000, LECT NOTES COMPUT SC, V1989, P273 LARZON LA, 2000, EFFICIENT TRANSPORT Minoli D., 1998, DELIVERING VOICE IP PAKSOY E, 1994, EUR T TELECOMMUN, V5, P591 *REC G 729, 1998, ANN E ANN E REC G 72 *REC G 729, 1996, ANN B A SIL COMPR SC *REC G 729, 1998, ANN D ANN D REC G 72 *REC G 729, 1996, COD SPEECH 8 KBITS U *REC H323, 1999, PACK BAS MULT COMM S *REC P 800, 1996, METH SUBJ DET TRANSM Russo M, 1998, IEEE T FUZZY SYST, V6, P373, DOI 10.1109/91.705506 SCULZRINNE H, 1996, RTP TRANSPORT PROTOC SIRAM K, 1989, IEEE T COMMUN, V37, P703 *THREE GPP, 1999, 0812 3GPP YIN NY, 1990, IEEE T COMMUN, V38, P674, DOI 10.1109/26.54981 NR 35 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 365 EP 381 DI 10.1016/S0167-6393(01)00077-2 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200008 ER PT J AU Kaiser, J Horvat, B Kacic, Z AF Kaiser, J Horvat, B Kacic, Z TI Overall risk criterion estimation of hidden Markov model parameters SO SPEECH COMMUNICATION LA English DT Article DE continuous speech recognition; hidden Markov models; discriminative training; overall risk; extended Baum-Welch algorithm ID MINIMUM ERROR CLASSIFICATION; SPEECH RECOGNITION AB In this paper, we propose a novel discriminative objective function for the estimation of hidden Markov model (HMM) parameters, based on the calculation of overall risk. For continuous speech recognition, the algorithm minimises the risk of misclassification on the training database and thus maximises recognition accuracy. The calculation of the risk is based on the measurement of Levenshtein distance between the correct transcription and the N-best recognised transcriptions, which counts the number of recognition errors - deleted, substituted and inserted words. The minimisation of the proposed criterion is implemented by using the extended Baum-Welch algorithm for the estimation of the discrete HMM parameters and the Normandin's extension of the algorithm for the estimation of the continuous densities. We tested the performance of the proposed algorithm on two tasks: phoneme recognition on the TIMIT database and continuous speech recognition on the Resource Management (RM) database. Results show consistent decrease of the recognition error rate when compared to the standard maximum likelihood estimation training. The highest achieved decrease of word error rate in our experiments is 20.8% on the TIMIT phoneme recognition task, 39.6% on the RM task with context-independent HMMs and 18.22% with context-dependent models. (C) 2002 Elsevier Science B.V. All rights reserved. C1 HERMES SoftLab, SI-2000 Maribor, Slovenia. Univ Maribor, Fac Elect Engn & Comp Sci, SI-2000 Maribor, Slovenia. RP Kaiser, J (reprint author), HERMES SoftLab, Gosposvetska 84, SI-2000 Maribor, Slovenia. EM janez.kaiser@hermes.si; bogo.horvat@uni-mb.si; kacic@uni-mb.si CR Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 Chen H P, 1994, Bioorg Med Chem, V2, P1, DOI 10.1016/S0968-0896(00)82195-1 Chow Y.-L., 1990, P INT C AC SPEECH SI, P701 Duda R. O., 1973, PATTERN CLASSIFICATI Gopalakrishnan P. S., 1989, P ICASSP, P631 Huang X. D., 1991, P ICASSP 91, P345, DOI 10.1109/ICASSP.1991.150347 Huang X.D., 1990, HIDDEN MARKOV MODELS JOHANSEN FT, 1996, THESIS NORWEGIAN U S Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KAISER J, 2000, P ICSLP 2000 BEIJ CH KAPADIA S, 1993, P ICASSP, V2, P491 LAMEL L, 1992, FINAL REV DARPA ARTI, P59 LAMEL LF, 1987, P DARPA SPEECH REC W, P26 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Levenshtein V., 1966, SOV PHYS DOKL, V10, P707 Lin MT, 2000, SPEECH COMMUN, V30, P27, DOI 10.1016/S0167-6393(99)00027-8 NA K, 1995, P EUR 95 MADR, P97 NADAS A, 1983, IEEE T ACOUST SPEECH, V31, P814, DOI 10.1109/TASSP.1983.1164173 Normandin Y., 1991, THESIS MCGILL U MONT Price P., 1988, P IEEE INT C AC SPEE, P651 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 VALTCHEV V, 1996, P INT C AC SPEECH SI, V2, P605 WOODLAND P, 1991, P EUR 93 BERL GERM, P2207 Young S., 1999, HTK BOOK Zavaliagkos G, 1994, IEEE T SPEECH AUDI P, V2, P151, DOI 10.1109/89.260358 NR 26 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 383 EP 398 DI 10.1016/S0167-6393(02)00009-2 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200009 ER PT J AU Steeneken, HJM Houtgast, T AF Steeneken, HJM Houtgast, T TI Phoneme-group specific octave-band weights in predicting speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; octave-band contributions; frequency-importance function; phoneme groups; diagnostic prediction; objective measurement; speech transmission index AB In an earlier study we derived robust frequency-weighting functions for prediction of the intelligibility of short nonsense words. These frequency-weighting functions are applied for prediction of intelligibility such as with the speech transmission index (STI). Six independent experiments revealed essentially similar frequency-weighting functions for the prediction of the nonsense word scores with respect to signal-to-noise ratio and gender [Speech Communication 28 (1999) 109]. Although the frequency weightings do not vary significantly for signal-to-noise ratio or gender, other studies have shown that using different types of speech material (i.e., nonsense words, phonetically balanced words and connected discourse) resulted in quite different frequency-weighting functions. This may be related to the distribution of specific phonemes in the test material. In order to obtain a more generic description of the frequency weighting, four relevant groups of phonemes were identified. In situations with reduced intelligibility, a small confusion rate of the phonemes between the groups and a high confusion rate of the phonemes within each group was observed. For each group a specific frequency-weighting function and a good prediction of the phoneme group scores could be obtained. It was shown that from these (weighted) phoneme group scores, word scores could be predicted with a prediction accuracy of ca. 4% (this corresponds to a signal-to-noise ratio of about I dB). Hence, this method provides a more generic way to predict intelligibility scores for different types of speech material. (C) 2002 Elsevier Science B.V. All rights reserved. C1 TNO, Human Factors, NL-3769 ZG Soesterberg, Netherlands. RP Steeneken, HJM (reprint author), TNO, Human Factors, POB 23, NL-3769 ZG Soesterberg, Netherlands. EM steeneken@tm.tno.nl CR ANSI, 1997, S35 ANSI BRONKHORST AW, 1992, J ACOUST SOC AM, V93, P499 DUGGIRALA V, 1988, J ACOUST SOC AM, V83, P2372, DOI 10.1121/1.396316 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 *IEC INT STAND, 1998, SOUND SYST EQ 16 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 PAVLOVIC CV, 1984, J ACOUST SOC AM, V75, P1606, DOI 10.1121/1.390870 PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STEENEKEN JHM, 1989, P IEEE ICASSP GLASG, P540 STEENEKEN JHM, 2002, SPEECH COMMUN, V38, P413 STEENEKEN JHM, 1992, THESIS U AMSTERDAM STEENEKEN JHM, 1999, SPEECH COMMUN, V28, P109 STUDEBAKER GA, 1987, J ACOUST SOC AM, V81, P1130, DOI 10.1121/1.394633 WELLS JC, 1987, J INT PHONETIC ASS, V17, P94 NR 16 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 399 EP 411 DI 10.1016/S0167-6393(02)00011-0 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200010 ER PT J AU Steeneken, HJM Houtgast, T AF Steeneken, HJM Houtgast, T TI Validation of the revised STIr method SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; speech transmission index; objective measure; validation ID PREDICTING SPEECH-INTELLIGIBILITY; OCTAVE-BAND WEIGHTS AB The revised model for the speech transmission index (STIr, Speech Communication 28 (1999) 109), was validated with an independent set of 68 test conditions. For a subset of 18 conditions, including only additive noise and band-pass limiting, it was verified that the STI, provides a good prediction of the CVC-word score. The additional 50 conditions included non-linear distortion, echoes, automatic gain control, and waveform coding. For conditions with these types of distortion specific parameters of the test signal are of interest. The parameters of the STI model were tuned in an earlier study for an optimal fit between the traditional STI and the CVC score, for a similar set of transmission conditions [J. Acoust. Soc. Amer. 67 (1980) 318]. It was found that the parameter settings also apply to the present revised model. The prediction accuracy for both male and female speech is 4-6% when expressed in CVC-word scores. This corresponds to a signal-to-noise ratio of about 1-2 dB. (C) 2002 Elsevier Science B.V. All rights reserved. C1 TNO, Human Factors, NL-3769 ZG Soesterberg, Netherlands. RP Steeneken, HJM (reprint author), TNO, Human Factors, POB 23, NL-3769 ZG Soesterberg, Netherlands. EM steeneken@tm.tno.nl CR ANDERSON BW, 1987, J ACOUST SOC AM, V81, P1982, DOI 10.1121/1.394764 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 HOUTGAST T, 1984, ACUSTICA, V54, P185 HOUTGAST T, 1973, ACUSTICA, V28, P66 HOUTGAST T, 1971, ACUSTICA, V25, P355 *IEC, 1998, PUBL, P1998 KRYTER KD, 1964, ESDTDR64674 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1698, DOI 10.1121/1.1909096 LICKLIDER JCR, 1959, P NATL EL C, V15, P329 PAYNE JA, 1973, 7314 OFF TEL DEP COM PLOMP R, 1979, AUDIOLOGY, V8, P915 SCHWARTZLANDER H, 1959, ELECTRONICS, V29, P88 STEENEKEN HJM, 1991, P EUR 91 GEN, P1133 Steeneken H. J. M., 1992, Digital speech processing, speech coding, synthesis and recognition STEENEKEN HJM, 1973, P S INT PAR LIEG, P73 Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STEENEKEN HJM, 1993, P EUR 93 3 C SPEECH, P203 Steeneken HJM, 2002, SPEECH COMMUN, V38, P399, DOI 10.1016/S0167-6393(02)00011-0 Steeneken HJM, 1983, P 11 INT C AC GALF P, V7, P85 WIJNGAARDEN SJ, 1999, EUROSPEECH 99, V6, P2639 NR 21 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 413 EP 425 DI 10.1016/S0167-6393(02)00010-9 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200011 ER PT J AU Sirigos, J Fakotakis, N Kokkinakis, G AF Sirigos, J Fakotakis, N Kokkinakis, G TI A hybrid syllable recognition system based on vowel spotting SO SPEECH COMMUNICATION LA English DT Article DE syllable recognition; speech recognition; speech processing ID SPEECH RECOGNITION AB In this paper we present a hybrid ANN/HMM syllable recognition system based on vowel spotting. Using an advanced multilevel vowel-spotting module we track all vowel phonemes in speech signals from where we model the speech segments located between two successive vowels which are defined as syllables. In order to achieve minimum vowel losses and accurate detection, we focus on taking special care of the vowel spotter which is based on three different techniques: discrete hidden Markov models (DHMMs), multilayer perceptrons and heuristic rules. To set up the models of the syllable segments, hybrid DHMMs with multiple codebooks are used. The usual DHMM probability parameters are replaced by combined neural network outputs. For this purpose, we use both context dependent and context independent neural networks. The syllable recognition system was tested with the TIMIT and NTIMIT databases and the results obtained showed 75.09% and 59.30% average syllable recognition accuracy, respectively. It has to be noted that to achieve the above results no grammars or syllable-based lexicons were used. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Patras, Wire Commun Lab, Patras 26500, Greece. RP Sirigos, J (reprint author), Univ Patras, Wire Commun Lab, Patras 26500, Greece. EM john@pat.forthnet.gr CR ANQUITA D, 1993, YPROP ACCELERATING T, P500 BACCHIANI M, 1998, 586 ICSLP Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> Bourlard Ha, 1994, CONNECTIONIST SPEECH BREIMAN L, 1994, STACKED REGRESSIONS FAKOTAKIS N, 1993, SPEECH COMMUN, V12, P57, DOI 10.1016/0167-6393(93)90018-G FISHER W, 1986, JASA S A, V81, P592 FRANCO H, 1994, COMPUT SPEECH LANG, V8, P211, DOI 10.1006/csla.1994.1010 FRITSCH J, 1997, P INT C AC SPEECH SI, P340 HANSEN LK, 1990, IEEE T PATTERN ANAL, V12, P993, DOI 10.1109/34.58871 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HU Z, 1996, P ICSLP PHIL, P452 JANKOWSKI C, 1990, P INT C AC SPEECH SI, P198 JONES RJ, 1997, EUR C SPEECH COMM TE, P1171 JUANG BH, 1991, TECHNOMETRICS, V33, P251 Krogh A., 1995, ADV NEURAL INFORMATI, V25, P231 Lippmann R. P., 1989, Neural Computation, V1, DOI 10.1162/neco.1989.1.1.1 MAK B, 1997, P INT C AC SPEECH SI, P670 MAKHOUL J, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P165 MIRCHANDANI G, 1989, IEEE T CIRCUITS SYST, V36, P152 RIIS SK, 1997, P INT C AC SPEECH SI, P670 Saul L., 1995, ADV NEURAL INFORMATI, V7, P435 SCHALKWYK J, 1995, INT C NEUR NETW SIGN, P800 SIRIGOS J, 1995, EUR C SPEECH COMM TE, P1301 SIRIGOS J, 1998, EUSIPCO 98 RHODES GR, P800 VOGL TP, 1990, J BIOL CYBERNET, V59, P257 Yu HJ, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P506 Zavaliagkos G, 1994, IEEE T SPEECH AUDI P, V2, P151, DOI 10.1109/89.260358 NR 28 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 427 EP 440 DI 10.1016/S0167-6393(02)00012-2 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200012 ER PT J AU Moller, S Bourlard, H AF Moller, S Bourlard, H TI Analytic assessment of telephone transmission impact on ASR performance using a simulation model SO SPEECH COMMUNICATION LA English DT Article DE telephone speech; ASR performance; quality assessment; simulation ID SPEECH; RECOGNITION AB This paper addresses the impact of telephone transmission channels on automatic speech recognition (ASR) performance. A real-time simulation model is described and implemented, which allows impairments that are encountered in traditional as well as modern (mobile, IP-based) networks to be flexibly and efficiently generated. The model is based on input parameters which are known to telephone network planners; thus, it can be applied without measuring specific network characteristics. It can be used for an analytic assessment of the impact of channel impairments on ASR performance, for producing training material with defined transmission characteristics, or for testing spoken dialogue systems in realistic network environments. In the present paper, we present an investigation of the first point. Two speech recognizers which are integrated into a spoken dialogue system for information retrieval are assessed in relation to controlled amounts of transmission degradations. The measured ASR performance degradation is compared to speech quality degradation in human-human communication. It turns out that ASR shows a different behavior than expected human quality judgments for some impairments. This fact has to be taken into account in both telephone network planning as well as in speech and language technology development. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Kommunikat Akust, D-44780 Bochum, Germany. Inst Molle Intelligence Artificielle Percept, IDIAP, CH-1920 Martigny, Switzerland. RP Moller, S (reprint author), Ruhr Univ Bochum, Inst Kommunikat Akust, D-44780 Bochum, Germany. EM moeller@ika.ruhr-uni-bochum.de CR Chang HM, 2000, SPEECH COMMUN, V31, P293, DOI 10.1016/S0167-6393(99)00063-1 CHOLLET G, 1996, RR9601 IDIAP DAS S, 1999, P 6 EUR C SPEECH COM, V5, P1959 EULER S, 1994, P INT C AC SPEECH SI, V1, P621 Giuliani D., 1999, P ICASSP MARCH, V1, P449 Hennebert J, 2000, SPEECH COMMUN, V31, P265, DOI 10.1016/S0167-6393(99)00082-5 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HOGE H, 1997, P ICASSP, V3, P1771 *ITUT, 2000, G107 ITUT *ITUT, 1988, G722 ITUT *ITUT, 1989, P48 ITUT *ITUT, 1996, P561 ITUT *ITUT, 1996, G712 ITUT *ITUT, 1999, G109 ITUT *ITUT, 2000, P562 ITUT *ITUT, 1996, G113 ITUT *ITUT, 1999, P79 ITUT *ITUT, 1996, P810 ITUT JEKOSCH U, 2000, THESIS U GH ESSEN JOHANNESSON NO, 1997, IEEE COMMUN MAG, P70 KARRAY L, 1998, P ICASSP 98, V1, P261, DOI 10.1109/ICASSP.1998.674417 LILLY B, 1996, P ICSLP, V4, P2344, DOI 10.1109/ICSLP.1996.607278 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 MOKBEL C, 1993, P EUROSPEECH, P1247 Mokbel C, 1997, SPEECH COMMUN, V23, P141, DOI 10.1016/S0167-6393(97)00042-3 Moller S, 2002, SPEECH COMMUN, V38, P47, DOI 10.1016/S0167-6393(01)00043-7 Moller S., 2000, ASSESSMENT PREDICTIO MOLLER S, 2000, P INT C SPOK LANG PR, V1, P750 *NIST, 2001, SPEECH REC SCOR TOOL PUEL JB, 1997, P 5 EUR C SPEECH COM, P1151 RAAKE A, 1999, UNPUB ANAL VERIFICAT TARCISIO C, 1999, P 6 EUR C SPEECH COM, V6, P2825 TUCKER R, 1999, P EUR BUD HUNG, V5, P2155 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 WYARD P, 1993, P 3 EUR C SPEECH COM, P1805 NR 35 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2002 VL 38 IS 3-4 BP 441 EP 459 DI 10.1016/S0167-6393(02)00013-4 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 612VM UT WOS:000179096200013 ER PT J AU Tchorz, J Kollmeier, B AF Tchorz, J Kollmeier, B TI Estimation of the signal-to-noise ratio with amplitude modulation spectrograms SO SPEECH COMMUNICATION LA English DT Article DE sound classification; signal-to-noise ratio estimation ID FREQUENCY-SELECTIVITY; SPEECH RECOGNITION; PERIODICITY; MASKING AB An algorithm is proposed which automatically estimates the local signal-to-noise ratio (SNR) between speech and noise. The feature extraction stage of the algorithm is motivated by neurophysiological findings on amplitude modulation processing in higher stages of the auditory system in mammals. It analyzes information on both center frequencies and amplitude modulations of the input signal. This information is represented in two-dimensional, so-called amplitude modulation spectrograms (AMS). A neural network is trained on a large number of AMS patterns generated from mixtures of speech and noise. After training, the network supplies estimates of the local SNR when AMS patterns from "unknown" sound sources are presented. Classification experiments show a relatively accurate estimation of the present SNR in independent 32 ms analysis frames. Harmonicity appears to be the most important cue for analysis frames to be classified as "speech-like", but the spectro-temporal representation of sound in AMS patterns also allows for a reliable discrimination between unvoiced speech and noise. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Carl von Ossietzky Univ Oldenburg, AG Med Phys, D-26111 Oldenburg, Germany. RP Tchorz, J (reprint author), Phonak Hearing Syst, Laubisrutistr 28, CH-8712 Stafa, Switzerland. EM juergen.tchorz@phonak.ch CR BACON SP, 1989, J ACOUST SOC AM, V85, P2575, DOI 10.1121/1.397751 BERTHOMMIER F, 1999, P INT C PHON SCI ICP, V14 Bregman A. S., 1993, THINKING SOUND COGNI, P10 Dau T, 1997, J ACOUST SOC AM, V102, P2906, DOI 10.1121/1.420345 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DUPONT S, 1999, P WORKSH ROB METH SP, P115 *ETSI, 1996, 0632 GSM ETS ETSI Ewert SD, 2000, J ACOUST SOC AM, V108, P1181, DOI 10.1121/1.1288665 Hirsch H., 1995, P ICASSP, P153 HOUTGAST T, 1989, J ACOUST SOC AM, V85, P1676, DOI 10.1121/1.397956 KATES JM, 1995, J ACOUST SOC AM, V97, P461, DOI 10.1121/1.412274 KOHLER DR, 1994, PHARMACOTHERAPY, V14, P3 KOLLMEIER B, 1994, J ACOUST SOC AM, V95, P1593, DOI 10.1121/1.408546 Langner G, 1997, J COMP PHYSIOL A, V181, P665, DOI 10.1007/s003590050148 LANGNER G, 1988, J NEUROPHYSIOL, V60, P1799 Martin R., 1993, P EUROSPEECH 93 BERL, P1093 NEMER E, 1988, IEEE SIGNAL PROCESS, V6, P1799 OSTENDORF M, 1998, FORSTSCHRITTE AKUSTI, P402 Rumelhart D.E., 1986, PARALLEL DISTRIBUTED, V1, P318 SHANNON RV, 1995, SCIENCE, V270, P303, DOI 10.1126/science.270.5234.303 Strube H., 1999, J ACOUST SOC AM, V105, P1092, DOI 10.1121/1.425118 Tchorz J, 2001, ADV NEUR IN, V13, P821 Tchorz J., 1999, J ACOUST SOC AM, V105, P1157 Unoki M, 1999, SPEECH COMMUN, V27, P261, DOI 10.1016/S0167-6393(98)00077-6 YANG D, 1999, J ACOUST SOC AM, V105, P1092 ZELL A, 1995, KONNEKTIONISMUS NEUR, V272, P335 NR 27 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 1 EP 17 DI 10.1016/S0167-6393(01)00040-1 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900001 ER PT J AU Klakow, D Peters, J AF Klakow, D Peters, J TI Testing the correlation of word error rate and perplexity SO SPEECH COMMUNICATION LA English DT Article DE language model training; perplexity; correlation with word error rate AB Many groups have investigated the relationship of word error rate and perplexity of language models. This issue is of central interest because perplexity optimization can be done independent of a recognizer and in most cases it is possible to find simple perplexity optimization procedures. Moreover, many tasks in language model training such as the optimization of word classes may use perplexity as target function resulting in explicit optimization formulas which are not available if error rates are used as target. This paper first presents some theoretical arguments for a close relationship between perplexity and word error rate. Thereafter the notion of uncertainty of a measurement is introduced and is then used to test the hypothesis that word error rate and perplexity are correlated by a power law. There is no evidence to reject this hypothesis. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Philips GmbH Forschungslab, D-52066 Aachen, Germany. RP Klakow, D (reprint author), Philips GmbH Forschungslab, Weisshausstr 2, D-52066 Aachen, Germany. EM dietrich.klakow@philips.com; jochen.peters@philips.com CR BESLING S, 1995, EUROSPEECH, P1755 BEYERLEIN P, 1998, DAPRA BROADCAST NEWS Bronshtein I. N., 1997, HDB MATH CHEN S, 1998, DARPA BROADCAST NEWS CLARKSON P, 1999, P EUR, P2707 DUGAST C, 1995, ARPA SPOKEN LANGUAGE ITO A, 1999, P EUR, P1591 Iyer R., 1997, P ASRU, P254 KLAKOW D, 1998, LOG LINEAR INTERPOLA, P1695 KLAKOW D, 2000, P ICASSP, P1695 KLAKOW D, 1998, DARPA BROADCAST NEWS KNESER R, 1993, P ICASSP, P586 Kneser R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607162 Kneser R., 1993, P EUR C SPEECH COMM, P973 Kneser Reinhard, 1997, P EUR, P1971 KUHN R, 1990, IEEE T PATTERN ANAL, V12, P570, DOI 10.1109/34.56193 PRINTZ H, 2000, P ISCA ITRW ASR2000, P77 NR 17 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 19 EP 28 DI 10.1016/S0167-6393(01)00041-3 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900002 ER PT J AU Kurimo, M AF Kurimo, M TI Thematic indexing of spoken documents by using self-organizing maps SO SPEECH COMMUNICATION LA English DT Article DE latent semantic indexing; self-organizing maps; spoken document retrieval; broadcast news indexing AB A method is presented to provide a useful searchable index for spoken audio documents. The task differs from the traditional (text) document indexing, because large audio databases are decoded by automatic speech recognition and decoding errors occur frequently. The idea in this paper is to take advantage of the large size of the database and select the best index terms for each document with the help of the other documents close to it using a semantic vector space. First, the audio stream is converted into a text stream by a speech recognizer. Then the text of each story is represented in a vector space as a document vector which is the normalized sum of the word vectors in the story. A large collection of such document vectors is used to train a self-organizing map (SOM) to find latent semantic structures in the collection. As the stories in spoken news are short and will include speech recognition errors, smoothing of the document vectors using the semantic clusters determined by the SOM is introduced to enhance the indexing. The application in this paper is the indexing and retrieval of broadcast news on radio and television. Test results are given using the evaluation data from the text retrieval conference (TREC) spoken document retrieval (SDR) task. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Aalto Univ, Neural Networks Res Ctr, Espoo 02150, Finland. RP Kurimo, M (reprint author), Aalto Univ, Neural Networks Res Ctr, POB 5400,Konemiehentie 2, Espoo 02150, Finland. EM mikko.kurimo@hut.fi RI Kurimo, Mikko/F-6647-2012 CR ABBERLEY D, 1999, ESCA ETRW WORKSH ACV, P14 ABBERLEY D, 1999, P 8 TEXT RETR C ALLAN J, 1998, P 6 TEXT RETR C TREC, P169 ANDERSEN J, 1998, 987 COM IDIAP BELLEGARDA JR, 1997, IEEE WORKSH AUT SPEE, P262 BELLEGARDA JR, 1999, P IEEE INT C AC SPEE, P717 BERROL S, 1992, PHYSICAL MED REHABIL, V6, P1 Bourland H., 1994, CONNECTIONIST SPEECH Chen S., 1998, DARPA BROADC NEWS TR COCCARO N, 1998, P INT C SPOK LANG PR DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 GAROFOLO J, 1999, ESCA ETRW ACCESSING, P1 Gillick L., 1989, P ICASSP, P532 GOLUB G, 1971, HDB MATRIX COMPUTATI, V2 HOFMANN T, 1999, P 3 S INT DAT AN IDA HOFMANN T, 1998, 98042 TR INT COMP SC Honkela T., 1996, A32 HELS U TECHN LAB JOHNSON SE, 1999, P IEEE INT C AC SPEE, P49 Johnson W.B., 1984, CONT MATH, V26, P189 Kaski S., 1998, P IJCNN 98 INT JOINT, V1, P413 Kohonen T, 1997, SELF ORG MAPS Kohonen T, 1999, KOHONEN MAPS, P171, DOI 10.1016/B978-044450270-4/50013-9 KURIMO M, 1999, ESCA ETRW WORKSH ACC, P25 Kurimo M, 1999, KOHONEN MAPS, P363, DOI 10.1016/B978-044450270-4/50029-2 Lagus K., 1999, P ICANN99 9 INT C AR, V1, P371 Ng K, 1998, INT CONF ACOUST SPEE, P325, DOI 10.1109/ICASSP.1998.674433 PAPADIMITRIOU C, 1998, P 17 ACM S PRINC DAT PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814 RAUBER A, 1999, P 3 PAC AS C KNOWL D RENALS S, 1998, P 7 TEXT RETR C TREC RITTER H, 1989, BIOL CYBERN, V61, P241, DOI 10.1007/BF00203171 ROBERTSON SE, 1976, J AM SOC INFORM SCI, V27, P129, DOI 10.1002/asi.4630270302 Robinson T., 1996, AUTOMATIC SPEECH SPE, P233 Robinson T, 1998, INT CONF ACOUST SPEE, P829, DOI 10.1109/ICASSP.1998.675393 ROBINSON T, 1999, P EUR BUD, P1067 Salton G, 1971, SMART RETRIEVAL SYST SIEGLER M, 1999, P IEEE INT C AC SPEE, P505 Simula O, 1999, KOHONEN MAPS, P375, DOI 10.1016/B978-044450270-4/50030-9 Ultsch A, 1999, KOHONEN MAPS, P33, DOI 10.1016/B978-044450270-4/50003-6 Xu Jinxi, 1996, P 19 ANN INT ACM SIG, P4, DOI 10.1145/243199.243202 NR 40 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 29 EP 45 DI 10.1016/S0167-6393(01)00042-5 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900003 ER PT J AU Moller, S Raake, A AF Moller, S Raake, A TI Telephone speech quality prediction: Towards network planning and monitoring models for modern network scenarios SO SPEECH COMMUNICATION LA English DT Article DE speech quality; telephone speech; quality prediction model; network planning; monitoring AB This paper addresses the problem of predicting the quality of telephone speech. Starting from a definition of quality, which takes communicative as well as service-related factors into account, a new classification scheme for prediction models is proposed. It considers input and output parameters, the network components and application area the model is used for, as well as the psychoacoustic and judgment-related bases. According to this scheme, quality prediction models can be classified into signal-based comparative measures, network planning models and monitoring models. Whereas signal-based approaches have been described extensively in literature, this paper discusses the latter two approaches in detail. The underlying psychoacoustic properties of two network planning models, the E-model and the SUBMOD model, are analyzed, and combined approaches for monitoring models are developed. Quality predictions obtained from the models are compared to the results of auditory test data, and weaknesses as well as network elements that remain uncovered are identified. Possible future extensions to the models are pointed out, including wide-band scenarios and speech sound quality, non-stationary impairments as well as speech technology devices. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Kommunikationsakust, D-44780 Bochum, Germany. RP Moller, S (reprint author), Ruhr Univ Bochum, Inst Kommunikationsakust, D-44780 Bochum, Germany. EM moeller@ika.ruhr-uni-bochum.de CR ALLNATT J, 1975, INT J MAN MACH STUD, V7, P801, DOI 10.1016/S0020-7373(75)80037-X Bappert V., 1994, Acta Acustica, V2 BEERENDS JG, 1995, 98 CONV AUD ENG SOC BERGER J, 1998, ARBEITEN DIGITALE SI, V13 BODDEN M, 1996, UNPUB ENTWICKLUNG DU Collard J., 1929, Electrical Communication, V7 *ETSI ETR, 1996, 250 ESTI ETR EULER S, 1994, P INT C AC SPEECH SI, V1, P621 Fletcher H, 1937, J ACOUST SOC AM, V9, P1, DOI 10.1121/1.1915904 GLEISS N, 1992, USABILITY CONCEPTS E, P24 HANSEN M, 1998, THESIS CARLVONOSSIET HAUENSTEIN M, 1997, ARBETEN DIGITALE SIG Hollier MP, 1996, BT TECHNOL J, V14, P206 JEKOSCH U, 2000, THESIS U GH ESSEN JOHANNESSON NO, 1996, UP IPSW M 23 27 SEPT JOHANNESSON NO, 1997, IEEE COMMUN MAG, P70 KARIS D, 1991, P HUM FACT SOC 35 AN, V1, P217 KITAWAKI N, 1991, IEEE J SEL AREA COMM, V9, P586, DOI 10.1109/49.81952 McAdams S., 1993, THINKING SOUND COGNI MCDERMOT.BJ, 1969, J ACOUST SOC AM, V45, P774, DOI 10.1121/1.1911465 Moller S., 2000, ASSESSMENT PREDICTIO MOLLER S, 1999, J ACTA ACUSTICA S1, V85, pS49 MOLLER S, 2000, P INT C SPOK LANG PR, V1, P750 NAKATANI LH, 1973, J ACOUST SOC AM, V53, P1083, DOI 10.1121/1.1913428 RAAKE A, 1999, UNPUB ANAL VERIFICAT RAAKE A, 2000, P ICSLP 2000 CHN BEI, V4, P744 RICHARDS DL, 1973, COMMUNICATION RICHARDS DL, 1974, P I ELECTR ENG, V121, P313 RIX AW, 2000, 109 CONV AUD ENG SOC VEAUX C, 1999, P 6 EUR C SPEECH COM, V6, P2579 Voiers W., 1977, P IEEE INT C AC SPEE, P204 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 NR 32 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 47 EP 75 DI 10.1016/S0167-6393(01)00043-7 PG 29 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900004 ER PT J AU Yoma, NB Pegoraro, TF AF Yoma, NB Pegoraro, TF TI Robust speaker verification with state duration modeling SO SPEECH COMMUNICATION LA English DT Article DE speaker verification; noise robustness; temporal restrictions; HMM ID HIDDEN MARKOV MODEL; SPEECH RECOGNITION; NOISE; COMPENSATION; STRESS AB This paper addresses the problem of state duration modeling in the Viterbi algorithm in a text-dependent speaker verification task. The results presented in this paper suggest that temporal constraints can lead to reductions of 10% and 20% in the error rates with signals corrupted by noise at SNR equal to 6 and 0 dB, respectively, and that the accurate statistical modeling of state duration (e.g. with gamma probability distribution) does not seem to be very relevant if maximal and minimal state duration restrictions are imposed. In contrast, temporal restrictions do not seem to give any improvement in a speaker verification task with clean speech or high SNR. It is also shown that state duration constraints can easily be applied with the likelihood normalization metrics based on speaker-dependent temporal parameters. Finally, the results here presented show that word position-dependent state duration parameters give no significant improvement when compared with the word position-independent approach if the coarticulation effect between contiguous words is low. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Chile, Dept Elect Engn, Santiago, Chile. Ericsson Brasil, Idaiatuba, SP, Brazil. RP Yoma, NB (reprint author), Univ Chile, Dept Elect Engn, Av Tupper 2007,POB 412-3, Santiago, Chile. EM nbecerra@cec.uchile.cl CR Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 BURSHTEIN D, 1996, IEEE T SPEECH AUDIO, V4 Carey MJ, 1992, P I ACOUSTICS, V14, P95 CHEN Y, 1988, IEEE T ASSP, V36 FORSYTH ME, 1995, THESIS U EDINBURGH Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 Ghitza O., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Hansen JHL, 1996, SPEECH COMMUN, V20, P151, DOI 10.1016/S0167-6393(96)00050-7 JUNQUA JC, 1999, P ICSLP99 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 LJOLJE A, 1994, COMPUT SPEECH LANG, V8, P129, DOI 10.1006/csla.1994.1006 LJOLJE A, 1991, IEEE T SIGNAL PROCES, V39, P29, DOI 10.1109/78.80762 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 VANSUMMERS W, 1988, J ACOUST SOC AM, V84 VARGA AP, 1992, NOISEX 92 STUDY EFFE Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 Yoma NB, 2001, IEEE T SPEECH AUDI P, V9, P179, DOI 10.1109/89.902285 NR 17 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 77 EP 88 DI 10.1016/S0167-6393(01)00044-9 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900005 ER PT J AU Lee, KS Cox, RV AF Lee, KS Cox, RV TI A segmental speech coder based on a concatenative TTS SO SPEECH COMMUNICATION LA English DT Article DE very low bit rate speech coding; segmentation; unit selection ID MODEL AB An extremely low bit rate speech coder based on a recognition/synthesis paradigm is proposed. In our speech coder, the speech signal is produced in a way which is similar to concatenative speech synthesis of text-to-speech (TTS). Hence, database construction, unit selection and prosody modification, which are the major parts of concatenative TTS, are employed to implement the speech coder. The synthesis units are automatically found in a large database using a joint segmentation/classification scheme. Dynamic programming (DP) is applied to unit selection in which two cost functions, an acoustic target cost and a concatenation cost are used to increase naturalness as well as intelligibility. Prosodic differences between the selected unit and the input segment are compensated for by time-scale and pitch modifications which are based on the harmonic plus noise (HNM) model framework. In single speaker tests, the proposed scheme gave intelligible and natural sounding speech at an average bit rate of about 580 b/s. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Konkuk Univ, Dept Elect Engn, Gwangjin Gu, Seoul 143701, South Korea. AT&T Labs Res, Speech Proc Software & Technol Res Dept, Florham Pk, NJ 07932 USA. RP Lee, KS (reprint author), Konkuk Univ, Dept Elect Engn, Gwangjin Gu, 1 Hwayang Dong, Seoul 143701, South Korea. EM kslee1@sait.samsung.co.kr CR BENBASSAT G, 1984, P ICASSP 84, V1 Beutnagel M., 1999, P JOINT M ASA EAA DA CERNOCKY J, 1998, P IEEE INT C AC SPEE, V2, P605, DOI 10.1109/ICASSP.1998.675337 CHEN HC, 1997, IEEE WORKSH SPEECH C, P27 Fujisaki H., 1988, P IEEE INT C AC SPEE, V1, P663 Hunt A., 1996, P INT C AC SPEECH SI, V1, P373 ISMAIL M, 1997, P EUROSPEECH 97, V1, P441 Katsaggelos AK, 1998, P IEEE, V86, P1126, DOI 10.1109/5.687833 Klabbers E., 1998, P ICSLP, P1983 KLEIJN WB, 1995, SPEECH CODING SYNTHE, P175 Kroon Peter, 1995, SPEECH CODING SYNTHE, P467 Lee JY, 2001, TRENDS MICROBIOL, V9, P5, DOI 10.1016/S0966-842X(00)01901-6 LEE KS, 1999, P IEEE INT C AC SPEE, P181 MCCREE AV, 1995, IEEE T SPEECH AUDI P, V3, P242, DOI 10.1109/89.397089 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 Paliwal K.K., 1995, SPEECH CODING SYNTHE, P433 RIBEIRO CM, 1997, P EUROSPEECH 97, V3, P1291 Roucos S., 1985, P ICASSP 85, P236 SHIRAKI Y, 1988, IEEE T ACOUST SPEECH, V36, P1437, DOI 10.1109/29.90372 Stylianou Y, 2001, IEEE T SPEECH AUDI P, V9, P21, DOI 10.1109/89.890068 STYLIANOU Y, 1997, P EUROSPEECH 97, P613 TAYLOR P, 1994, SPEECH COMMUN, V15, P169, DOI 10.1016/0167-6393(94)90050-7 TSAO C, 1985, IEEE T ACOUST SPEECH, V33, P537 NR 23 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 89 EP 100 DI 10.1016/S0167-6393(01)00045-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900006 ER PT J AU Nieuwoudt, C Botha, EC AF Nieuwoudt, C Botha, EC TI Cross-language use of acoustic information for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE multilingual speech recognition; cross-language acoustic adaptation; Bayesian adaptation; transformation-based adaptation ID SPEAKER ADAPTATION AB Techniques are investigated that use acoustic information from existing source language databases to implement automatic speech recognition (ASR) systems for new target languages. The assumption is that the amount of target language data available is too little for the training of a robust ASR system. Strategies for cross-language use of acoustic information are evaluated which include (i) training on pooled source and target language data, (ii) adapting source language models using target language data, (iii) adapting models trained on pooled source and target language using target language data only and (iv) transforming source language data to augment target language data for model training. These strategies are allied with Bayesian and transformation-based techniques to present a framework for cross-language reuse of acoustic information. Experiments are performed for a large number of approaches from the framework, using relatively large amounts of English speech data from either a separate database or from the same-database as smaller amounts of Afrikaans speech data to improve the performance of an Afrikaans speech recogniser. Results indicate that a significant reduction in word error rate is achievable (between 14% and 48% for experiments), depending on the amount of target language data available. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Pretoria, Dept Elect Elect & Comp Engn, ZA-0002 Pretoria, South Africa. RP Botha, EC (reprint author), Univ Pretoria, Dept Elect Elect & Comp Engn, ZA-0002 Pretoria, South Africa. EM liesbeth.botha@eng.up.ac.za CR Billa J., 1997, P EUR, P363 BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 BUB U, 1997, P ICASSP, P1451 CHOU W, 1999, P EUROSPEECH 99 BUD DeGroot M., 1970, OPTIMAL STAT DECISIO DeGroot M. H., 1975, PROBABILITY STAT DIGALAKIS V, 1995, P ICASSP 95 DETR MI, P680 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Duda R. O., 1973, PATTERN CLASSIFICATI Gales M.J.F., 1997, 291 CUEDFINFENGTR Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gibbon D, 1997, HDB STANDARDS RESOUR GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C Kohler J, 1998, INT CONF ACOUST SPEE, P417, DOI 10.1109/ICASSP.1998.674456 KOHLER J, 1996, P ICSLP 96 PHIL PA, V4, P2195, DOI 10.1109/ICSLP.1996.607240 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LAVIE A, 1996, P ICSLP 96 PHIL PA O, V4, P2375, DOI 10.1109/ICSLP.1996.607286 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Malt B., 1996, P ICSLP96, V4, P2005, DOI 10.1109/ICSLP.1996.607191 NIEUWOUDT C, 2001, THESIS U PRETORIA S Padmanabhan M, 1998, IEEE T SPEECH AUDI P, V6, P71, DOI 10.1109/89.650313 SCHULTZ T, 1998, P ICSLP SYDN AUSTR, V5, P1819 Schultz T., 1997, P EUROSPEECH, P371 WAARDENBURG T, 1992, P ICASSP 92 SAN FRAN, P1585 WANG C, 1997, P EUROSPEECH 97 RHOD, P351 Weng F., 1997, P EUR RHOD, P359 Wheatley B., 1994, P ICASSP 94 AD AUSTR, P1237 NR 27 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 101 EP 113 DI 10.1016/S0167-6393(01)00046-2 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900007 ER PT J AU Yun, YS Oh, YH AF Yun, YS Oh, YH TI A segmental-feature HMM for continuous speech recognition based on a parametric trajectory model SO SPEECH COMMUNICATION LA English DT Article DE parametric trajectory model; segmental HMM; segmental-feature HMM; speech recognition AB In this paper, we propose a new acoustic model for characterizing segmental features and an algorithm based upon a general framework of hidden Markov models (HMMs). The segmental features are represented as a trajectory of observed vector sequences by a polynomial regression function. To obtain the polynomial trajectory from speech segments, we modify the design matrix to include transitional information for contiguous frames. We also propose methods for estimating the likelihood of a given segment and trajectory parameters. The observation probability of a given segment is represented as the relation between the segment likelihood and the estimation error of the trajectories. The estimation error of a trajectory is considered the weight of the likelihood of a given segment in a state. This weight represents the probability of how well the corresponding trajectory characterizes the segment. The proposed model can be regarded as a generalization of a conventional HMM and a parametric trajectory model. We conducted several experiments to establish the effectiveness of the proposed method and the characteristics of the segmental features. The recognition results on the TIMIT database demonstrate that the performance of segmental-feature HMM (SFHMM) is better than that of a conventional HMM. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Hannam Univ, Sch Informat Technol & Multimedia Engn, Daeduk Gu, Taejon 306791, South Korea. Korea Adv Inst Sci & Technol, Dept Elect Engn & Comp Sci, Div Comp Sci, Yusong Gu, Taejon 305701, South Korea. RP Yun, YS (reprint author), Hannam Univ, Sch Informat Technol & Multimedia Engn, Daeduk Gu, 133 Ojung Dong, Taejon 306791, South Korea. EM ysyun@mail.hannam.ac.kr; yhoh@cs.kaist.ac.kr RI Oh, Yung-Hwan/C-1915-2011 CR DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 Flannery B. P., 1992, NUMERICAL RECIPES C FUKADA T, 1997, INT C AC SPEECH SIGN FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 GALES M, 1993, 133 CUEDFINFENGTR GISH H, 1993, INT C AC SPEECH SIGN GISH H, 1996, INT C SPOK LANG PROC GOLDENTHAL W, 1993, EUR C SPEECH COMM TE HOLMES W, 1995, INT C AC SPEECH SIGN Holmes WJ, 1999, COMPUT SPEECH LANG, V13, P3, DOI 10.1006/csla.1998.0048 Lee K.F., 1989, IEEE T AUDIO SPEECH, V37, P16411648 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 RUSSELL M, 1993, INT C AC SPEECH SIGN NR 14 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 115 EP 130 DI 10.1016/S0167-6393(01)00047-4 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900008 ER PT J AU Lamel, L Bennacef, S Gauvain, JL Dartigues, H Temem, JN AF Lamel, L Bennacef, S Gauvain, JL Dartigues, H Temem, JN TI User evaluation of the Mask kiosk SO SPEECH COMMUNICATION LA English DT Article DE spoken language systems; speech recognition; speech understanding; natural language understanding; information retrieval dialog AB In this paper we report on a series of user trials carried out to assess the performance and usability of the Multimodal Multimedia Service Kiosk (MASK) prototype. The aim of the ESPRIT MASK project was to pave the way for advanced public service applications with user interfaces employing multimodal, multimedia input and output. The prototype kiosk was developed after analyzing the technological requirements in the context of users performing travel enquiry tasks, in close collaboration with the French Railways (SNCF) and the Ergonomics group at the University College of London (UCL). The time to complete the transaction with the MASK kiosk is reduced by about 30% compared to that required for the standard kiosk, and the transaction success rate is 85% for novices and 94% once familiar with the system. In addition to meeting or exceeding the performance goals set at the project onset in terms of success rate, transaction time, and user satisfaction, the MASK kiosk was judged to be user-friendly and simple to use. (C) 2002 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, F-91403 Orsay, France. SNCF, Direct Rech & Technol, F-75379 Paris, France. RP Lamel, L (reprint author), CNRS, LIMSI, BP 133, F-91403 Orsay, France. EM lamel@limsi.fr CR BENNACEF SK, 1994, P ICSLP 94 YOK, V3, P1271 BERNARD F, 1997, P IEA 97 TEMP FINL, V6, P264 BERNARD F, 1997, P INT 97 MONTP FRANC, P287 CHHOR E, 1995, HUM COMF SEC WORKSH DARTIGUES H, 1997, WORLD C RAILW RES 97, P513 DOWELL J, 1995, INT ERG ASS WORLD C Fraser N. M., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90019-M Gauvain J. L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607734 GAUVAIN JL, 1997, HUMAN COMFORT SECURI GAUVAIN JL, 1996, I ELECT INFORMAT DEC, P2005 LAMEL L, 1993, ESCA NATO WORKSH APP, P207 LAMEL L, 1998, ICSLP 98, P2875 LIFE A, 1994, ERGONOMICS, V37, P1801 Life A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607947 *MASK CONS, 1998, FIN REP ESPRIT MASK TERNEM JN, 1999, P WORLD C RAILW RES NR 16 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 131 EP 139 DI 10.1016/S0167-6393(01)00048-6 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900009 ER PT J AU Vallabha, GK Tuller, B AF Vallabha, GK Tuller, B TI Systematic errors in the formant analysis of steady-state vowels SO SPEECH COMMUNICATION LA English DT Article DE speech analysis; LPC; vowels; formant estimation ID ENGLISH VOWELS AB The locations of formants in a speech signal are usually estimated by computing the linear predictive coefficients (LPC) over a sliding window and finding the peaks in the spectrum of the resulting LP filter. The peak locations are estimated either by root-solving or by computing a coarse spectrum and finding its maxima. We discuss four sources of systematic error in this analysis: (1) quantization of the speech signal due to the fundamental frequency, (2) incorrect order for the LP filter, (3) exclusive reliance upon root-solving, and (4) the three-point parabolic interpolation used to compensate for the coarse spectrum. We show that the expected error due to F0 quantization is similar to10% of F0, and that the other three sources can independently skew the final formant estimates by 10-80 Hz. We also show that errors due to incorrect filter order are related to systematic differences between speakers and phonetic classes, and that root-solving is especially error-prone for low formants or when formants are close to each other. We discuss methods for avoiding these errors and improving the accuracy of formant estimation, and give a heuristic for estimating the optimal filter order of a steady-state signal. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Florida Atlantic Univ, Ctr Complex Syst & Brain Sci, Boca Raton, FL 33431 USA. RP Vallabha, GK (reprint author), Florida Atlantic Univ, Ctr Complex Syst & Brain Sci, 77 Glades Rd, Boca Raton, FL 33431 USA. EM vallabha@walt.ccs.fau.edu CR ATAL BS, 1974, P 1974 STOCKH SPEECH, V1, P27 CHANDRA S, 1974, IEEE T ACOUST SPEECH, VAS22, P403, DOI 10.1109/TASSP.1974.1162613 Deller J. R., 1993, DISCRETE TIME PROCES HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 JOHNSON K, 1993, LANGUAGE, V69, P505, DOI 10.2307/416697 JOHNSON K, 1993, J ACOUST SOC AM, V94, P701, DOI 10.1121/1.406887 KEWLEYPORT D, 1994, J ACOUST SOC AM, V95, P485, DOI 10.1121/1.410024 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Markel JD, 1976, LINEAR PREDICTION SP Maurer D, 1995, INT J NEUROSCI, V83, P25 Rabiner L.R., 1978, DIGITAL PROCESSING S REPP BH, 1987, SPEECH COMMUN, V6, P1, DOI 10.1016/0167-6393(87)90065-3 Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882 VALLABHA GK, UNPUB CHOICE FILTER Welling L, 1998, IEEE T SPEECH AUDI P, V6, P36, DOI 10.1109/89.650308 NR 15 TC 29 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 141 EP 160 DI 10.1016/S0167-6393(01)00049-8 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900010 ER PT J AU McLoughlin, I Ding, ZQ Tan, EC AF McLoughlin, I Ding, ZQ Tan, EC TI Intelligibility evaluation of GSM coder for Mandarin speech using CDRT SO SPEECH COMMUNICATION LA English DT Article DE speech coder; GSM; Chinese Diagnostic Rhyme Test; intelligibility ID CONFUSIONS AB The Global system for Mobile telecommunications (GSM) speech-coding algorithm is evaluated by using the newly proposed Chinese Diagnostic Rhyme Test (CDRT) to determine its suitability for coding Mandarin speech. Experimental results show that the algorithm is not effective for all phonetic features in Mandarin and it reduces intelligibility of sibilated consonants. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Nanyang Technol Univ, Sch Comp Engn, Singapore 639798, Singapore. RP Tan, EC (reprint author), Nanyang Technol Univ, Sch Comp Engn, Nanyang Ave, Singapore 639798, Singapore. EM asectan@ntu.edu.sg RI McLoughlin, Ian/A-3674-2011 OI McLoughlin, Ian/0000-0001-7111-2008 CR ANSI, 1989, S32 ANSI DING ZQ, 2001, P IEEE PAC RIM C COM DUBNO JR, 1981, J ACOUST SOC AM, V69, P249, DOI 10.1121/1.385345 *ETSI, 1998, 300961 ETSI HANSEN JHL, 1995, J ACOUST SOC AM, V97, P609, DOI 10.1121/1.412283 Li Z, 2000, IEE P-VIS IMAGE SIGN, V147, P254, DOI 10.1049/ip-vis:20000189 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 VOIERS WD, 1983, SPEECH TECHNOL, P30 ZHANG JL, 1994, P INT S SPEECH IM PR, P117 1999, INTRO DYNASTAT NR 10 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 161 EP 165 DI 10.1016/S0167-6393(01)00050-4 PG 5 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900011 ER PT J AU Ainsworth, WA AF Ainsworth, WA TI Factors affecting the identification of concurrent and overlapping approximants SO SPEECH COMMUNICATION LA English DT Article DE perception; approximants; syllables; concurrent ID PERCEPTUAL SEPARATION; FORMANT TRANSITIONS; VOWELS AB It is possible to recognise a message spoken by one voice in the presence of another voice, but how this is done is not currently understood. A number of studies have shown that two simultaneous vowel sounds may be identified more accurately when their fundamental frequencies (F0s) differ sufficiently than when they are the same. Four experiments are described which explore the perception of the approximants /w/ and /j/ spoken simultaneously. These consonants cannot be spoken in isolation, like vowel sounds, but must be combined with vowels to form syllables. It was found that identification of consonants in simultaneous syllables depended upon the vowels in the syllables, their F0s and their relative amplitudes. If both the vowels and their F0s differed and their amplitudes were the same, both consonants could be identified more reliably (Experiments 1 and 2). When syllables had different vowels and inharmonically related F0s the identification of the consonants did not depend on the value of the F0 difference (Experiment 3). If one syllable was delayed relative to the other it became easier to identify both syllables as the delay was increased, although the identification scores did not increase to any greater extent when the F0s of the syllables were different (Experiment 4). The experiments suggest that F0 differences in isolated syllables are only employed for separating voiced consonants when other factors are neutralised. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Keele, Sch Life Sci, MacKay Inst Commun & Neurosci, Ctr Human & Machine Percept Res, Keele ST5 5BG, Staffs, England. RP Ainsworth, WA (reprint author), Univ Keele, Sch Life Sci, MacKay Inst Commun & Neurosci, Ctr Human & Machine Percept Res, Keele ST5 5BG, Staffs, England. CR Ainsworth WA, 1997, SPEECH COMMUN, V21, P273, DOI 10.1016/S0167-6393(97)00010-1 AINSWORTH WA, 1997, P EUR 97, V4, P2115 AINSWORT.WA, 1968, J ACOUST SOC AM, V44, P689, DOI 10.1121/1.1911162 ASSMANN PF, 1995, J ACOUST SOC AM, V97, P575, DOI 10.1121/1.412281 ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 BERTHOMMIER F, 1995, P EUR 95 MADR, P135 Bird J., 1998, PSYCHOPHYSICAL PHYSL, P263 Blauert J., 1983, SPATIAL HEARING BREGMAN AL, 1990, AUDITORY SCENE ANAL, P584 BROKX JPL, 1982, J PHONETICS, V10, P23 BROKX JPL, 1979, IPO ANN PROG REP, V14, P55 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 CULLING JF, 1993, J ACOUST SOC AM, V93, P3454, DOI 10.1121/1.405675 DECHEVEIGNE A, 1995, J ACOUST SOC AM, V97, P3736 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2839, DOI 10.1121/1.418517 Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 MCKEOWN JD, 1992, SPEECH COMMUN, V11, P1 MEYER GF, 1997, P EUR 97, V5, P2495 MEYER GF, 1996, P ESCA WORKSH AUD BA, P212 OCONNOR JD, 1957, WORD, V13, P24 SHEFFERS MTM, 1983, THESIS GRONINGEN U N NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 167 EP 182 DI 10.1016/S0167-6393(01)00051-6 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900012 ER PT J AU Wu, CH Yan, GL Lin, CL AF Wu, CH Yan, GL Lin, CL TI Speech act modeling in a spoken dialog system using a fuzzy fragment-class Markov model SO SPEECH COMMUNICATION LA English DT Article DE spoken language; speech act; fuzzy fragment-class Markov model; verification AB In a spoken dialog system, it is an important problem for the computer to identify the speech act (SA) from a user's utterance due to the variability of spoken language. In this paper, a corpus-based fuzzy fragment-class Markov model (FFCMM) is proposed to model the syntactic characteristics of a speech act and used to choose the speech act candidates. A speech act verification process, that estimates the conditional probability of a speech act given a sequence of fragments, is used to verify the speech act candidate. Most main design procedures are statistical- and corpus-based to reduce manual work. In order to evaluate the proposed method, a spoken dialog system for air travel information service (ATIS) is investigated. The experiments were carried out using a test database from 25 speakers (15 male and 10 female). There are 480 dialogs, containing 3038 sentences in the test database. The experimental results show that the speech act identification rate can be improved by 10.5% using the FFCMM and speech act verification with a rejection rate of 6% compared to a baseline system. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, 1 Ta Hsueh Rd, Tainan 70101, Taiwan. RI Wu, Chung-Hsien/E-7970-2013 CR ALLEN J, 1994, NATURAL LANGUAGE UND, P542 Arai K, 1999, SPEECH COMMUN, V27, P43, DOI 10.1016/S0167-6393(98)00065-X BENNACEF S, 1996, P ICSLP, V1, P550, DOI 10.1109/ICSLP.1996.607176 CHIANG TH, 1998, P COTEC 98 TAIP Jelinek F., 1990, P IEEE ICASSP1990 AL, V1, P621 KIM H, 1999, FUZZ SYST C P 1999 F, V2, P598 LAI YS, 2000, INT J COMPUTER PROCE, V13, P83, DOI 10.1142/S0219427900000041 LEE CJ, 1997, P 1997 WORKSH DIST S, P197 Martin T, 1998, PERS INDIV DIFFER, V24, P1, DOI 10.1016/S0191-8869(97)00143-8 MENG H, 1996, P INT C SPOK LANG PR, V1, P542, DOI 10.1109/ICSLP.1996.607174 Patterson D. W., 1990, INTRO ARTIFICIAL INT, P107 Pieraccini Roberto, 1992, P IEEE ICASSP, V1, P193 RICCARDI G, 2000, SPEECH AUDIO PROCESS, V81, P3 Saeki M, 1996, PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON REQUIREMENTS ENGINEERING, P21, DOI 10.1109/ICRE.1996.491426 SEIDE F, 1997, P EUROSPEECH, V3, P1327 TRAN D, 1998, P ICSLP98 SYDN AUSTR Wang H.C., 1997, P ROCLING 10 INT C, P325 WRIGHT JH, 1997, P 5 EUR C SPEECH COM, P1419 WU CH, 1999, P ICASSP 99 PHOEN US WU CH, 1998, P ICSLP98 SYDN AUSTR ZIMMERMANN HJ, 1991, FUZZY SET THEORY ITS, P230 NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 183 EP 199 DI 10.1016/S0167-6393(01)00052-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900013 ER PT J AU Yang, CS Stone, M AF Yang, CS Stone, M TI Dynamic programming method for temporal registration of three-dimensional tongue surface motion from multiple utterances SO SPEECH COMMUNICATION LA English DT Article DE reconstruction of 3D tongue surface; 3D tongue surface motion; ultrasound imaging; dynamic programming ID SPEECH RECOGNITION; WORD RECOGNITION; ALGORITHM AB This study proposes a new method to reconstruct three-dimensional (313) tongue surface motion during speech using only a few sections of the tongue measured with ultrasound imaging. Reconstruction of static 3D tongue surfaces has been reported. This is the first report for reconstruction of 3D tongue surface motion using ultrasound imaging. To temporally align data from multiple scan locations, a dynamic programming (DP) algorithm was used to line up the tokens collected from different repetitions by using the acoustic signals recorded simultaneously with the ultrasound images. Reconstruction error was evaluated by using a pseudo-motion measurement of known 3D tongue shapes. The average error was 0.39 mm, which was within the ultrasound measurement error, of 0.5 mm. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Maryland, Sch Dent, Dept Orthodont, Dept Oral & Craniofacial Biol Sci, Baltimore, MD 21201 USA. Symantec Corp, Hampton, VA 23666 USA. RP Stone, M (reprint author), Univ Maryland, Sch Dent, Dept Orthodont, Dept Oral & Craniofacial Biol Sci, 666 W Baltimore ST,5-A-12, Baltimore, MD 21201 USA. EM mstone@umaryland.edu CR Akgul YS, 1999, IEEE T MED IMAGING, V18, P1035, DOI 10.1109/42.811315 DANG J, 1997, SP97119 IEICE, P9 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 Lundberg AJ, 1999, J ACOUST SOC AM, V106, P2858, DOI 10.1121/1.428110 Ney H, 1999, IEEE SIGNAL PROC MAG, V16, P64, DOI 10.1109/79.790984 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 PERKELL JS, 1969, RES MONOGRAPH MIT, V53 Rabiner L, 1993, FUNDAMENTALS SPEECH Rabiner L.R., 1978, DIGITAL PROCESSING S SAKOE H, 1979, IEEE T ACOUST SPEECH, V27, P588, DOI 10.1109/TASSP.1979.1163310 SAKOE H, 1978, IEEE T ACOUST SPEECH, V26, P43, DOI 10.1109/TASSP.1978.1163055 Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799 Stone M, 1996, J ACOUST SOC AM, V99, P3728, DOI 10.1121/1.414969 STRIK H, 1991, J PHONETICS, V19, P367 NR 14 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 201 EP 209 DI 10.1016/S0167-6393(01)00053-X PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900014 ER PT J AU Pallett, DS Lamel, L AF Pallett, DS Lamel, L TI Automatic transcription of broadcast news data (vol 37, pg 1, 2002) SO SPEECH COMMUNICATION LA English DT Correction C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA. RP Pallett, DS (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM david.pallett@nist.gov; lamel@limsi.fr CR Nguyen L, 2002, SPEECH COMMUN, V38, P213, DOI 10.1016/S0167-6393(02)00050-X Pallett DS, 2002, SPEECH COMMUN, V37, P1, DOI 10.1016/S0167-6393(01)00055-3 NR 2 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 211 EP 211 DI 10.1016/S0167-6393(02)00051-1 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900015 ER PT J AU Nguyen, L Matsoukas, S Davenport, J Kubala, F Schwartz, R Makhoul, J AF Nguyen, L Matsoukas, S Davenport, J Kubala, F Schwartz, R Makhoul, J TI Progress in transcription of broadcast news using Byblos SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; broadcast news transcription; hidden Markov models; acoustic modeling; adaptation; search algorithms; single-tree fast-match; fast Gaussian computation; grammar spreading AB In this paper, we describe our progress during the last four years (1995-1999) in automatic transcription of broadcast news from radio and television using the BBN Byblos speech recognition system. Overall, we achieved steady progress as reflected through the results of the last four DARPA Hub-4 evaluations, with word error rates of 42.7%, 31.8%, 20.4% and 14.7% in 1995, 1996, 1997 and 1998, respectively. This progress can be attributed to improvements in acoustic modeling, channel and speaker adaptation, and search algorithms, as well as dealing with specific characteristics of the real-life variable speech found in broadcast news. Besides improving recognition accuracy, we also succeeded in developing several algorithms to achieve close-to-real-time recognition speed without a significant sacrifice in recognition accuracy. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Verizon Commun, BBN Technol, Cambridge, MA 02138 USA. RP Nguyen, L (reprint author), Verizon Commun, BBN Technol, 70 Fawcett St, Cambridge, MA 02138 USA. EM ln@bbn.com CR ACERO A, 1995, P IEEE AUT SPEECH RE, P147 ANASTASAKOS T, 1996, P ICSLP 96 PHIL PA O Austin S., 1991, P IEEE INT C AC SPEE, P697, DOI 10.1109/ICASSP.1991.150435 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 BILLA J, 1999, P ICASSP 99, V1, P41 DAVENPORT J, 1999, P DARPA BROADC NEWS, P261 DAVENPORT J, 1999, P EUR 99 BUD HUNG SE, P651 DAVENPORT J, 1999, P IEEE ICASSP 99 PHO, P613 FISCUS JG, 1997, WORKSH LINTH MAR MAY Garofolo J. S., 1997, P DARPA SPEECH REC W, P15 HWANG MY, 1993, P IEEE ICASSP 93 MIN, V2, P311 JIN H, 1998, P DARPA BROADC NEWS, P105 KUBALA F, 1997, P EUR 97 RHOD GREEC, P927 Kubala F, 1997, P SPEECH REC WORKSH, P90 KUBALA F, 1996, P DARPA SPEECH REC W, P55 KUBALA F, 1998, P DARPA BROADC NEWS, P35 Leggetter C.J., 1994, CUEDFINFENGTR181 LIU D, 1998, P DARPA BROADC NEWS, P123 LIU D, 1999, P ESCA EUR 99 BUD HU, V3, P1031 Lowerre B. T., 1976, THESIS CARNEGIE MELL MATSOUKAS S, 1909, P DARPA BROADCAST NE, P255 MATSOUKAS S, 1997, P DARPA SPEECH REC W, P133 Ney H., 1992, P IEEE INT C AC SPEE, V1, P9 Nguyen L., 1993, P DARPA HUM LANG TEC, P91, DOI 10.3115/1075671.1075692 NGUYEN L, 1999, P IEEE INT C AC SPEE, V2, P613 NGUYEN L, 1995, P ARPA SPOK LANG SYS, P77 Nguyen L., 1997, P EUR 97 RHOD GREEC, P167 NGUYEN L, 1999, P EUR 99 BUD HUNG SE PADMANABHAN M, 1997, P 1997 IEEE WORKSH A, P325 Placeway P., 1993, P INT C AC SPEECH SI, P33 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 SCHWRTZ R, 1996, MULTIPLE PASS SEARCH, P429 NR 32 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2002 VL 38 IS 1-2 BP 213 EP 230 DI 10.1016/S0167-6393(02)00050-X PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 596KW UT WOS:000178164900016 ER PT J AU Probst, K Ke, Y Eskenazi, M AF Probst, K Ke, Y Eskenazi, M TI Enhancing foreign language tutors - In search of the golden speaker SO SPEECH COMMUNICATION LA English DT Article DE language tutor; pronunciation correction; foreign language teaching ID SPEECH AB In the past, educators relied on classroom observation to determine the relevance of various pedagogical techniques. Automated language learning now allows us to examine pedagogical questions in a much more rigorous manner. We can use a computer-assisted language learning (CALL) system as a base, tracing all user responses and controlling the information given out. We have thus used the Fluency system [Proceedings of Speech Technology in Language and Learning, 1998, p. 77] to answer the question of what voice a language learner should imitate when working on pronunciation. In this article, we will examine whether there should be a choice of model speakers and what characteristics of a model's voice may be important to match when there is a choice. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Carnegie Mellon Univ, Language Technol Inst, Pittsburgh, PA 15213 USA. RP Probst, K (reprint author), Carnegie Mellon Univ, Language Technol Inst, 5000 Forbes Ave, Pittsburgh, PA 15213 USA. EM kathrin@cs.cmu.edu; yke@cs.cmu.edu; max@cs.cmu.edu CR Aleven V., 2000, P 5 INT C INT TUT SY, P292 AYAMADA R, 1996, P 1996 INT C SPOK LA, P606 BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Child J., 1987, DEFINING DEV PROFICI, P97 Cucchiarini C., 1998, P 1998 INT C SPOK LA, P1739 EGAN KB, 1998, P SPEECH TECHN LANG, P13 Eskenazi M., 2000, P INSTIL 2000 INT SP, P73 Eskenazi M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607892 Eskenazi M, 1998, P SPEECH TECHN LANG, P77 Eskenazi M., 1999, LANGUAGE LEARNING TE, V2, P62 Markham D. J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607894 MASSARO DW, 1993, J PHONETICS, V21, P445 MCALLISTER R, 1996, KTH SPEECH MUSIC HEA, V2, P69 MCALLISTER R, 1998, P SPEECH TECHN LANG, P155 Munakata Y, 1997, PSYCHOL REV, V104, P686, DOI 10.1037/0033-295X.104.4.686 Neumeyer L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607890 PLAUT DC, 1996, PSYCHOL REV, V103, P115 RAVISHANKUR M, 1996, THESIS CARNEGIE MELL, P96 RYPA M, 1996, P CALICO 1996 TOMOKIYO LM, 2000, P ICSLP BEIJ OCT, P62 NR 20 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 161 EP 173 DI 10.1016/S0167-6393(01)00009-7 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600001 ER PT J AU Broad, DJ Clermont, F AF Broad, DJ Clermont, F TI Linear scaling of vowel-formant ensembles (VFEs) in consonantal contexts SO SPEECH COMMUNICATION LA English DT Article DE vowel; coarticulation; context; formant; scaling ID LOCUS EQUATIONS; COARTICULATION; FREQUENCY; MODEL; NUCLEI AB There are familiar terms such as "contour" and "trajectory" to refer to a vowel formant frequency as a function defined on the time axis, but there is no readily understood term for the analogous idea of how a formant behaves on the "vowel axis". For this we introduce the concept of a vowel-formant ensemble (VFE) as the set of values realized for a given formant (e.g., F-2) in going from vowel to vowel among a speaker's vowel phonemes for a fixed time frame in a fixed CVC context. The VFE affords a simple description of our development: we observe that D.J. Broad and F. Clermont's [J. Acoust. Soc. Am. 81 (1987) 155] formant-contour model is a linear function of its vowel target and that as a consequence all its VFEs for a given speaker and formant number are linearly scaled copies of one another. Are VFEs in actual speech also linearly scaled? To show how this question can be addressed, we use F, and F, data on one male speaker's productions of 7 Australian English vowels in 7 CVd contexts, with each CVd repeated 5 times. Our hypothesized scaling relation gives a remarkably good fit to these data, with a residual rms error of only about 14 Hz for either formant after discounting random variations among repetitions. The linear scaling implies a type of normalization for context which shrinks the intra-vowel scatter in the F-1, F-2, plane. VFE scaling is also a new tool which should be useful for showing how contextual effects vary over the duration of the syllable's vocalic nucleus. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ New S Wales, Univ Coll, Sch Comp Sci, Canberra, ACT 2600, Australia. RP Broad, DJ (reprint author), 2638 State St,Unit 12, Santa Barbara, CA 93105 USA. EM djbroad@silcom.com; frantz@cs.adfa.edu.au CR Bernard J. R. L., 1970, Z PHONETIK, V23, P113 BROAD D J, 1984, Journal of the Acoustical Society of America, V76, pS14, DOI 10.1121/1.2021722 BROAD DJ, 1970, J ACOUST SOC AM, V47, P1572, DOI 10.1121/1.1912090 BROAD DJ, 1976, PHONETICA, V33, P401 BROAD DJ, 1987, J ACOUST SOC AM, V81, P155, DOI 10.1121/1.395025 CLERMONT F, 1993, SPEECH COMMUN, V13, P377, DOI 10.1016/0167-6393(93)90036-K Clermont F., 1991, THESIS AUSTR NATL U DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 Fant C. G. M., 1960, ACOUSTIC THEORY SPEE FLANAGAN JL, 1955, J ACOUST SOC AM, V27, P613, DOI 10.1121/1.1907979 HOUDE RA, 1968, SCRL MONOGRAPH, V2 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 Lofqvist A, 1999, J ACOUST SOC AM, V106, P2022, DOI 10.1121/1.427948 MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 Scheffe H., 1959, ANAL VARIANCE, P56 STEVENS KN, 1966, J ACOUST SOC AM, V40, P123, DOI 10.1121/1.1910027 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 VANBERGEM DR, 1994, SPEECH COMMUN, V14, P143, DOI 10.1016/0167-6393(94)90005-1 VANSON RJJH, 1992, J ACOUST SOC AM, V92, P121, DOI 10.1121/1.404277 Yang HH, 2000, SPEECH COMMUN, V31, P35, DOI 10.1016/S0167-6393(00)00007-8 NR 23 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 175 EP 195 DI 10.1016/S0167-6393(01)00010-3 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600002 ER PT J AU Thomson, DL Chengalvarayan, R AF Thomson, DL Chengalvarayan, R TI Use of voicing features in HMM-based speech recognition SO SPEECH COMMUNICATION LA English DT Article DE voiced and unvoiced speech; voicing; autocorrelation function; periodicity; jitter; hierarchical signal bias removal; discriminative training; hidden Markov models; cepstral mean subtraction; speech recognition features ID HIDDEN MARKOV-MODELS; CLASSIFICATION; ALGORITHM; JITTER; PITCH AB We investigate speech recognition features related to voicing functions that indicate whether the vocal folds are vibrating. We describe two voicing features, periodicity and jitter, and demonstrate that they are powerful voicing discriminators. The periodicity and jitter features and their first and second time derivatives are appended to a standard 38-dimensional feature vector comprising the first and second time derivatives of the frame energy and the cepstral coefficients with their first and second time derivatives. HMM-based connected-digit (CD) and large-vocabulary (LV) recognition experiments comparing the traditional and extended feature sets show that voicing features and spectral information are complementary and that improved speech recognition performance is obtained by combining the two sources of information. We further conclude that the difference in performance with and without voicing becomes more significant when minimum string error (MSE) training is used than when maximum likelihood (ML) training is used. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Lucent Technol Inc, Lucent Speech Solut, Naperville, IL 60566 USA. RP Thomson, DL (reprint author), Lucent Technol Inc, Lucent Speech Solut, 200 N Naperville Rd, Naperville, IL 60566 USA. EM davidt@lucent.com CR ATAL BS, 1976, IEEE T ACOUST SPEECH, V24, P201, DOI 10.1109/TASSP.1976.1162800 Bocchieri E. L., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1012 Bourlard H., 1997, P ICASSP, P1251 Chang HM, 2000, SPEECH COMMUN, V31, P293, DOI 10.1016/S0167-6393(99)00063-1 Chengalvarayan R, 1998, INT CONF ACOUST SPEE, P17, DOI 10.1109/ICASSP.1998.674356 CHOU W, 1995, P EUR C SPEECH COMM, P495 DAS S, 1999, P EUROSPEECH, P1959 Furui S., 1997, P ESCA NATO TUT RES, P11 GUPTA SK, 1996, P ICASSP, P57 HAGGARD M, 1970, J ACOUST SOC AM, V47, P613, DOI 10.1121/1.1911936 Hanson B. A., 1990, P ICASSP, P857 Hess W., 1983, PITCH DETERMINATION JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 Murphy PJ, 2000, J ACOUST SOC AM, V107, P978, DOI 10.1121/1.428272 PREZAS DP, 1986, P ICASSP, P109 RABINER LR, 1985, AT&T TECH J, V64, P1211 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rabiner L.R., 1978, DIGITAL PROCESSING S Rahim MG, 1996, IEEE SIGNAL PROC LET, V3, P107, DOI 10.1109/97.489062 Rouat J, 1997, SPEECH COMMUN, V21, P191, DOI 10.1016/S0167-6393(97)00002-2 Schoentgen J, 1997, SPEECH COMMUN, V21, P255, DOI 10.1016/S0167-6393(97)00008-3 SOONG FK, 1991, P INT C AC SPEECH SI, V1, P705 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 THOMSON DL, 1997, P ASRU 97 WORKSH, P511 THOMSON DL, 1998, P IEEE INT C AC SPEE, V1, P21, DOI 10.1109/ICASSP.1998.674357 VALTCHEV V, 1996, P INT C AC SPEECH SI, V2, P605 NR 29 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 197 EP 211 DI 10.1016/S0167-6393(01)00011-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600003 ER PT J AU Gagne, JP Rochette, AJ Charest, M AF Gagne, JP Rochette, AJ Charest, M TI Auditory, visual and audiovisual clear speech SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; clear speech; intraspeaker variability; interspeaker variability; auditory-speech perception; visual-speech perception; audiovisual-speech perception ID HARD-OF-HEARING; CONVERSATIONAL SPEECH; SPEAKING RATE; INTELLIGIBILITY; PERCEPTION; ARTICULATION; LISTENERS; DURATION AB The speech intelligibility of syllables spoken under conditions of conversational and clear speech was compared. The stimuli were 18 monosyllables (/C-v/) and 18 bisyllables (/v-C-v/) consisting of six voiced-consonants (/b, d, g, v, z, 3/) presented in each of three vowel contexts (/a, i, y/). Six female adults were recorded while they produced four iterations of the stimulus set in each of the two speaking styles. The 1728 videotaped test items were edited, randomized and presented to 12 subjects with normal hearing and normal visual acuity under three conditions: visual-only, auditory-only and audiovisually. A broadband noise was mixed with the signal for the latter two conditions. The results revealed a significant three-way interaction of talker, speaking style and perceptual modality. Post-hoc analyses revealed intra and interspeaker differences in speech intelligibility for both speaking styles, in all three perceptual modalities. Overall, positive clear speech effects were observed in all three modalities. Intermodality comparisons revealed differences in the pattern of clear speech effects displayed by individual talkers. This finding indicates that there is not a direct association between the beneficial effects of clear speech in one perceptual modality and its effects on speech intelligibility in another perceptual modality. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Montreal, Ecole Orthophonie & Audiol, Montreal, PQ, Canada. Ctr Hosp Cotes des Neiges, Dept Audiol, Montreal, PQ, Canada. Purdue Univ, Dept Audiol & Speech Sci, W Lafayette, IN 47907 USA. RP Gagne, JP (reprint author), Univ Montreal, Ecole Orthophonie & Audiol, CP 6128,Succurslae Ctr Ville, Montreal, PQ, Canada. EM jean-pierre.gagne@umontreal.ca CR ABBS JH, 1986, INVARIANCE VARIABILI ADJOUDANI A, 1996, NATO ASI SER, P315 Anderson TW, 1984, INTRO MULTIVARIATE S *ANSI, 1969, S361969 ANSI BENOIT C, 1996, NATO ASI SER, P461 BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619 BRAIDA LD, 1991, Q J EXP PSYCHOL-A, V43, P647 BRANDY WT, 1966, J SPEECH HEAR RES, V9, P461 COX RM, 1987, J ACOUST SOC AM, V81, P1598, DOI 10.1121/1.394512 CREELMAN CD, 1957, J ACOUST SOC AM, V29, P655, DOI 10.1121/1.1909003 CRYSTAL TH, 1990, J ACOUST SOC AM, V88, P101, DOI 10.1121/1.399955 ERBER NP, 1974, J SPEECH HEAR RES, V17, P99 Erber NP, 1996, COMMUNICATION THERAP FOLKINS JW, 1975, J SPEECH HEAR RES, V18, P207 Fowler CA, 1996, J ACOUST SOC AM, V99, P1730, DOI 10.1121/1.415237 Gagne J. P., 1994, J ACAD REHABIL AUDIO, V27, P135 GAGNE JP, 1994, J ACAD REHABILITATIV, V27, P133 GAGNE JP, 1995, VOLTA REV, V97, P33 GAGNE JP, 2000, AUDIOLOGY TREATMENT, P547 GARSTECKI DC, 1983, J ACAD REHAB AUD, V16, P222 GRANT KW, 1991, J ACOUST SOC AM, V89, P2952, DOI 10.1121/1.400733 Harris R. J., 1975, PRIMER MULTIVARIATE Helfer KS, 1997, J SPEECH LANG HEAR R, V40, P432 HOOD JD, 1980, AUDIOLOGY, V19, P434 JOHNSON K, 1993, J ACOUST SOC AM, V94, P701, DOI 10.1121/1.406887 KRAUSE JC, 1995, THESIS MIT CAMBRIDGE LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 Massaro D. W., 1987, SPEECH PERCEPTION EA MASSARO DW, 1990, PSYCH SCI, V1, P1 MATTINGLY IG, 1989, MODULARITY MOTOR THE MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545 PERKELL JS, 1992, J ACOUST SOC AM, V91, P2911, DOI 10.1121/1.403778 PICHENY MA, 1989, J SPEECH HEAR RES, V32, P600 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 Schum D J, 1996, J Am Acad Audiol, V7, P212 Schum DJ, 1997, HEARING J, V50, P36 STUDEBAKER GA, 1985, J SPEECH HEAR RES, V28, P455 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1987, HEARING EYE SUMMERFIELD Q, 1989, MODULARITY MOTOR THE TYEMURRAY N, 1994, J ACAD REHABILITATIV, V27, P209 UCHANSKI RM, 1992, AUDITORY PROCESSING Uchanski RM, 1996, J SPEECH HEAR RES, V39, P494 WALDEN BE, 1975, J SPEECH HEAR RES, V18, P272 NR 47 TC 23 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 213 EP 230 DI 10.1016/S0167-6393(01)00012-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600004 ER PT J AU Kim, NS AF Kim, NS TI Feature domain compensation of nonstationary noise for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE robust speech recognition; nonstationary noise; interacting multiple model; fixed-interval smoothing ID ENVIRONMENT COMPENSATION; MARKOV-MODELS; SIGNAL AB One of the key issues in practical speech recognition is to achieve robustness against the environmental mismatches resulting from the background noises or different channels. Most of the conventional approaches have tried to compensate for the effects of such mismatches based on the assumption that the en viron mental characteristics are stationary, which, however, is far from the real observation. In this paper, we propose an approach to cope with time-varying environmental characteristics. With a direct modeling of the environment evolution process and the clean speech feature distribution, we construct a set of multiple linear state space models. Suboptimal state estimation under the given model structure can be efficiently performed with the interacting Multiple model (IMM) algorithm. In addition to providing a comprehensive description of the compensation technique, we propose an adaptive Kalman filtering approach with which nonstationary noise evolution characteristics can be tracked. Moreover, we propose a novel way to do fixed-interval smoothing within the IMM framework. Performance of the presented compensation technique in both the slowly and rapidly varying noise conditions is evaluated through a number of continuous digit recognition experiments. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Seoul Natl Univ, Sch Elect Engn, Seoul 151742, South Korea. RP Kim, NS (reprint author), Seoul Natl Univ, Sch Elect Engn, Kwanak POB 34, Seoul 151742, South Korea. EM nkim@snu.ac.kr CR Acero A., 1992, ACOUSTICAL ENV ROBUS BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P68, DOI 10.1109/89.221385 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GISH H, 1990, P ICASSP 90, P117 GUTMAN PO, 1990, IEEE T AERO ELEC SYS, V26, P691, DOI 10.1109/7.102704 Helmick RE, 1995, IEEE T INFORM THEORY, V41, P1845, DOI 10.1109/18.476310 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Kim DY, 1998, SPEECH COMMUN, V24, P39, DOI 10.1016/S0167-6393(97)00061-7 Kim NS, 1998, IEEE SIGNAL PROC LET, V5, P57 Kim NS, 1998, IEEE SIGNAL PROC LET, V5, P8 KIM NS, 1997, P ESCA WORKSH ROB SP, P99 Kim NS, 2000, IEEE SIGNAL PROC LET, V7, P108 Kim NS, 1998, IEEE SIGNAL PROC LET, V5, P146 LEGGETTER CJ, 1995, THESIS CAMBRIDGE U U Moreno P.J., 1996, THESIS CARNEGIE MELL NEUMEYER L, 1994, P ICASSP, V1, P417 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 SAGAYAMA S, 1997, P IEEE INT C AC SPEE, P835 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Varga A.P., 1990, P ICASSP, P845 NR 25 TC 16 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 231 EP 248 DI 10.1016/S0167-6393(01)00013-9 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600005 ER PT J AU Veprek, P Scordilis, MS AF Veprek, P Scordilis, MS TI Analysis, enhancement and evaluation of five pitch determination techniques SO SPEECH COMMUNICATION LA English DT Article DE speech analysis; pitch determination; speech segmentation ID COMPARATIVE PERFORMANCE; SPEECH SIGNALS; ALGORITHM; PERIOD AB Speech classification into voiced and unvoiced (or silent) portions is important in many speech processing applications. In addition, segmentation of voiced speech into individual pitch epochs is necessary in several high quality speech synthesis and coding techniques. This paper introduces criteria for measuring the performance of automatic procedures performing this task against manually segmented and labeled data. First, five basic pitch determination algorithms (PDAs) (SIFT, comb filter energy maximization, spectrum decimation/accumulation, optimal temporal similarity and dyadic wavelet transform) are evaluated and their performance is analyzed. A set of enhancements is then developed and applied to the basic algorithms, which yields superior performance by virtually eliminating multiple and sub-multiple pitch assignment errors and reducing all other errors. Evaluation shows that the enhancements improved performance of all five PDAs with the improvement ranging from 3.5% for the comb filter energy maximization method to 8.3% for the dyadic wavelet transform method. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Miami, Dept Elect & Comp Engn, Coral Gables, FL 33124 USA. RP Scordilis, MS (reprint author), Univ Miami, Dept Elect & Comp Engn, 1251 Mem Dr,Rm 406, Coral Gables, FL 33124 USA. EM m.scordilis@miami.edu CR CHENG YM, 1989, IEEE T ACOUST SPEECH, V37, P1805, DOI 10.1109/29.45529 DALESSANDRO C, 1995, P 1995 INT C AC SPEE Deller J. R., 1993, DISCRETE TIME PROCES DIFRANCESCO R, 1989, P EUROSPEECH, P39 DUBNOWSKI JJ, 1976, IEEE T ACOUST SPEECH, V24, P2, DOI 10.1109/TASSP.1976.1162765 DUIFHUIS H, 1982, J ACOUST SOC AM, V71, P1568, DOI 10.1121/1.387811 DUTOIT T, 1994, P 1994 INT C AC SPEE GRAYDEN DB, 1994, P 5 AUSTR INT C SPEE, P473 GRAYDEN DB, 1994, P 1994 INT C AC SPEE Hess W., 1983, PITCH DETERMINATION Hess W.J., 1992, ADV SPEECH SIGNAL PR HUGGINS AWF, 1985, J ACOUST SOC AM, V77, P1896, DOI 10.1121/1.391941 KADAMBE S, 1991, P 1991 INT C AC SPEE KADAMBE S, 1992, IEEE T INFORM THEORY, V38, P917, DOI 10.1109/18.119752 Lim T. S., 1992, Proceedings of the Fourth Australian International Conference on Speech Science and Technology Mallat S., 1992, IEEE T PATTERN ANAL, V14, P7 MARKEL JD, 1972, IEEE T ACOUST SPEECH, VAU20, P367, DOI 10.1109/TAU.1972.1162410 MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763 MOULINES E, 1990, P 1990 INT C AC SPEE, V1, P309 O'Shaughnessy D, 2000, SPEECH COMMUNICATION, V2nd O'Brien D, 2001, IEEE T SPEECH AUDI P, V9, P11, DOI 10.1109/89.890067 PALIWAL KK, 1984, SPEECH COMMUN, V3, P253, DOI 10.1016/0167-6393(84)90020-7 RABINER LR, 1975, IEEE T ACOUST SPEECH, V23, P552, DOI 10.1109/TASSP.1975.1162749 RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 RABINER LR, 1977, IEEE T ACOUST SPEECH, V25, P24, DOI 10.1109/TASSP.1977.1162905 SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 Taori R., 1995, P IEEE INT C AC SPEE, V1, P512 Titze IR, 1994, PRINCIPLES VOICE PRO VEPREK P, 1996, THESIS U MELBOURNE Walpole RE, 1998, PROBABILITY STAT ENG NR 30 TC 22 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 249 EP 270 DI 10.1016/S0167-6393(01)00017-6 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600006 ER PT J AU Zhang, ZP Furui, S Ohtsuki, K AF Zhang, ZP Furui, S Ohtsuki, K TI On-line incremental speaker adaptation for broadcast news transcription SO SPEECH COMMUNICATION LA English DT Article DE speaker adaptation; speaker-change detection; likelihood comparison; GMM (Gaussian mixture models); SA (speaker-adaptive) GMM ID MODELS AB This paper describes a new unsupervised, on-line and incremental speaker adaptation technique that improves the performance of speech recognition systems when there are frequent changes in speaker identity and each speaker utters a series of several sentences. The speaker change is detected using speaker-in dependent (SI) and speaker-adaptive (SA) Gaussian mixture models (GMMs), and both phone hidden Markov model (HMM) and GMM are adapted by maximum likelihood linear regression (MLLR) transformation. Using this method, the word error rate of a broadcast news transcription task was reduced by 10.0% relative to the results using the SI models. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, Tokyo 1528552, Japan. NTT Corp, Cyber Space Labs, Media Proc Project, Yokosuka, Kanagawa 2390847, Japan. RP Zhang, ZP (reprint author), Tokyo Inst Technol, Dept Comp Sci, Meguro Ku, 2-12-1 Ookayama, Tokyo 1528552, Japan. EM zzp@furui.cs.titech.ac.jp CR Acero A., 1990, P ICASSP, P849 COX SJ, 1989, P IEEE INT C AC SPEE, P294 De Brabandere K, 2007, P IEEE INT C AC SPEE, P1 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MOREAU N, 2000, P IEEE INT C AC SPEE, P1807 OHKURA K, 1992, P ICSLP 92, P369 OHTSUKI K, 1999, P EUR C SPEECH COMM, P671 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 SCHWARTZ R, 1991, NATO ASI F, V75, P31 TSENG BL, 1992, P IEEE INT C AC SPEE, P164 ZHAO Y, 1993, P IEEE INT C AC SPEE, P592 NR 13 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 271 EP 281 DI 10.1016/S0167-6393(01)00018-8 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600007 ER PT J AU Markovic, MZ Milosavljevic, MM Kovacevic, BD AF Markovic, MZ Milosavljevic, MM Kovacevic, BD TI Quadratic classifier with sliding training data set in robust recursive AR speech analysis SO SPEECH COMMUNICATION LA English DT Article DE AR speech analysis; LP parameters; robust recursive estimation; quadratic classifier; sliding training data set ID LINEAR PREDICTION AB We propose a robust recursive procedure, based on a weighted recursive least squares (WRLS) algorithm with variable forgetting factor (VFF) and a quadratic classifier with sliding training data set, for identification of nonstationary autoregressive (AR) model of speech production system. Experimental evaluation is done using the results obtained by analyzing speech signal with voiced and mixed excitation frames. Experimental results have shown that the proposed robust recursive procedure achieves more accurate AR speech parameter estimates and provides improved tracking performance. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Inst Appl Math & Elect, YU-11000 Belgrade, Serbia Monteneg. Univ Belgrade, Fac Elect Engn, YU-11000 Belgrade, Serbia Monteneg. RP Markovic, MZ (reprint author), Inst Appl Math & Elect, Kneza Milosa 37, YU-11000 Belgrade, Serbia Monteneg. EM markovic@kiklop.etf.bg.ac.yu; emilosam@ubbg.etf.bg.ac.yu; kovacevic_b@kiklop.etf.bg.ac.yu CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 DENOEL E, 1985, IEEE T ACOUST SPEECH, V33, P1397, DOI 10.1109/TASSP.1985.1164759 Duda R. O., 1973, PATTERN CLASSIFICATI Fant G., 1970, ACOUSTIC THEORY SPEE Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd FUKUNAGA K, 1987, IEEE T PATTERN ANAL, V9, P634 Hamilton J. D., 1994, TIME SERIES ANAL LEE CH, 1988, IEEE T ACOUST SPEECH, V36, P642, DOI 10.1109/29.1574 Ljung L., 1987, SYSTEM IDENTIFICATIO Markel JD, 1976, LINEAR PREDICTION SP MARKOVIC M, 1992, THESIS U BELGRADE MARKOVIC M, 1994, SIGNAL PROCESS, V7, P1803 Markovic MZ, 1996, IEEE T SPEECH AUDI P, V4, P456, DOI 10.1109/89.544530 MILOSAVLJEVIC M, 1996, P 3 IEEE INT C EL CI, V2, P720, DOI 10.1109/ICECS.1996.584463 MILOSAVLJEVIC M, 1988, P 22 ANN C INF SCI S POLYAK BT, 1979, AUTOMAT REM CONTR, V40, P387 Tong H., 1983, THRESHOLD MODELS NON VEINOVIC MDJ, 1994, SIGNAL PROCESS, V37, P189, DOI 10.1016/0165-1684(94)90102-3 Yang T, 1997, SIGNAL PROCESS, V63, P151, DOI 10.1016/S0165-1684(97)00150-3 NR 19 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 283 EP 302 DI 10.1016/S0167-6393(01)00019-X PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600008 ER PT J AU Kirchhoff, K Fink, GA Sagerer, G AF Kirchhoff, K Fink, GA Sagerer, G TI Combining acoustic and articulatory feature information for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; articulatory representations; neural networks; classifier combination ID DESIGN; UNITS AB The idea of using articulatory representations for automatic speech recognition (ASR) continues to attract much attention in the speech community. Representations which are grouped under the label "articulatory" include articulatory parameters derived by means of acoustic-articulatory transformations (inverse filtering), direct physical measurements or classification scores for pseudo-articulatory features. In this study, we revisit the use of features belonging to the third category. In particular, we concentrate on the potential benefits of pseudo-articulatory features in adverse acoustic environments and on their combination with standard acoustic features. Systems based on articulatory features only and combined acoustic-articulatory systems are tested on two different recognition tasks: telephone-speech continuous numbers recognition and conversational speech recognition. We show that articulatory feature (AF) systems are capable of achieving a superior performance at high noise levels and that the combination of acoustic and AFs consistently leads to a significant reduction of word error rate across all acoustic conditions. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Washington, Dept Elect Engn, Signal Speech & Language Interpretat Lab, Seattle, WA 98195 USA. Univ Bielefeld, Fac Technol, Appl Comp Sci Grp, D-33594 Bielefeld, Germany. RP Kirchhoff, K (reprint author), Univ Washington, Dept Elect Engn, Signal Speech & Language Interpretat Lab, Seattle, WA 98195 USA. CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 Bitar N., 1996, P INT C AC SPEECH SI, P29 BITAR NN, 1997, P EUR C SPEECH COMM, P1239 Bocchieri E. L., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1012 BOLL SF, 1992, ADV SPEECH SIGNAL PR, P309 Chase Lin, 1997, THESIS CARNEGIE MELL COHN RP, 1992, P INT C AC SPEECH SI, P473 Cole R., 1995, P EUR C SPEECH COMM, P821 CRAVEN MW, 1996, THESIS U WSICONSINMA DENG L, 1992, J ACOUST SOC AM, V92, P3058, DOI 10.1121/1.404202 Deng L., 1994, P ICASSP 94, pI DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 DUPONT S, 1998, P ICSLP 98, P1283 EIDE E, 1993, P ICASSP 93, P483 ELENIUS K, 1991, P EUR C SPEECH COMM, P121 Erler K, 1996, J ACOUST SOC AM, V100, P2500, DOI 10.1121/1.417358 Fink GA, 1999, LECT NOTES ARTIF INT, V1692, P229 FISCUS J, 1997, P IEEE WORKSH AUT SP GALES MJF, 1996, IEEE T SPEECH AUDIO, V4 GREENBERG S, 1997, P INT C AC SPEECH SI, V2, P1647 Halberstadt A., 1998, P ICSLP, P995 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 JIANG L, 1999, P EUR C SPEECH COMM JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KANADERA N, 1998, P INT C AC SPEECH SI KINGSBURY BED, 1997, P INT C AC SPEECH SI Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KIRCHHOFF K, 1999, P INT C AC SPEECH SI KIRCHHOFF K, 1999, THESIS BIELEFIELD U Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 KOHLER DR, 1994, PHARMACOTHERAPY, V14, P3 KOLLER D, 1996, MACH LEARN, P281 KRSTULOVIC S, 1999, P EUR C SPEECH COMM Lee K.-F., 1989, AUTOMATIC SPEECH REC LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 MCMAHON P, 1998, P INT C SPOK LANG PR, P1055 MORGANTI R, 1995, PUBL ASTRON SOC AUST, V12, P3 PAPCUN G, 1992, J ACOUST SOC AM, V92, P688, DOI 10.1121/1.403994 Potamianos G, 1998, INT CONF ACOUST SPEE, P3733, DOI 10.1109/ICASSP.1998.679695 RICHARDS HB, 1996, P INT C SPOK LANG PR RICHARDS HB, 1997, P IEEE ICASSP, P1287 SALEH GMK, 1997, P INT C AC SPEECH SI, P389 Schmidbauer O., 1989, P ICASSP, P616 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 STEINGRIMSSON P, 1995, P INT C PHON SCI STROPE B, 1998, P INT C AC SPEECH SI Wesenick M.-B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607054 Wu SL, 1998, INT CONF ACOUST SPEE, P721 NR 48 TC 46 Z9 50 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 303 EP 319 DI 10.1016/S0167-6393(01)00020-6 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600009 ER PT J AU Chien, JT AF Chien, JT TI A Bayesian prediction approach to robust speech recognition and online environmental learning SO SPEECH COMMUNICATION LA English DT Article DE Bayesian predictive classification (BPC); online unsupervised learning; speaker adaptation; speech recognition; hidden Markov model ID HIDDEN MARKOV-MODELS; CLASSIFICATION APPROACH; SPEAKER ADAPTATION; INCREMENTAL ESTIMATION; MAXIMUM-LIKELIHOOD; ALGORITHMS AB A robust speech recognizer is developed to tackle the inevitable mismatch between training and testing environments. Because the realistic environments are uncertain and nonstationary, it is necessary to characterize the uncertainty of speech hidden Markov models (HMMs) for recognition and trace the uncertainty incrementally to catch the newest environmental statistics. In this paper, we develop a new Bayesian predictive classification (BPC) for robust decision and online environmental learning. The BPC decision is adequately established by modeling the uncertainties of both the HMM mean rector and precision matrix using a conjugate prior density. The frame-based predictive distributions using multivariate t distributions and approximate Gaussian distributions are herein exploited. After the recognition, the prior density is pooled with the likelihood of the Current test sentence to generate the reproducible prior density. The hyperparameters of the prior density are accordingly adjusted to meet the newest environments and apply for the recognition of upcoming data. As a result, an efficient online unsupervised learning strategy is developed for HMM-based speech recognition without needing adaptation data. In the experiments, the proposed approach is significantly better than conventional plug-in maximum a posteriori (MAP) decision on the recognition of connected Chinese digits in hands-free car environments. This approach is economical in computation. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM jtchien@mail.ncku.edu.tw CR Berger J. O., 1985, STAT DECISION THEORY, V2nd Chien JT, 1999, IEEE T SPEECH AUDI P, V7, P656 Chien JT, 2000, SPEECH COMMUN, V30, P235, DOI 10.1016/S0167-6393(99)00052-7 DeGroot M., 1970, OPTIMAL STAT DECISIO DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Digalakis VV, 1999, IEEE T SPEECH AUDI P, V7, P253, DOI 10.1109/89.759031 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Geisser S, 1993, PREDICTIVE INFERENCE Gotoh Y, 1998, IEEE T SPEECH AUDI P, V6, P539, DOI 10.1109/89.725320 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 Huo Q, 2000, IEEE T SPEECH AUDI P, V8, P200 HUO Q, 1998, P INT S CHIN SPOK LA, P31 HUO Q, 1997, IEEE P INT C AC SPEE, V2, P1547 HUO Q, 1997, P EUR C SPEECH COMM, V4, P1847 Jiang H, 1999, SPEECH COMMUN, V28, P313, DOI 10.1016/S0167-6393(99)00018-7 JIANG H, 1997, IEEE P INT C AC SPEE, V2, P1551 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 LEE CH, 1999, P WORKSH ROB METH SP, P45 Lee JJ, 1998, SYMBIOSIS, V25, P1 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Matsui T, 1998, COMPUT SPEECH LANG, V12, P41, DOI 10.1006/csla.1997.0036 Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2157, DOI 10.1109/78.91172 NADAS A, 1985, IEEE T ACOUST SPEECH, V33, P326, DOI 10.1109/TASSP.1985.1164513 Ripley B., 1996, PATTERN RECOGNITION Shahshahani BM, 1997, IEEE T SPEECH AUDI P, V5, P183, DOI 10.1109/89.554780 SURENDRAN AC, 1999, P WORKSH ROB METH SP, P155 SURENDRAN AC, 1998, P INT C SPOK LANG PR, V2, P463 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 NR 30 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 321 EP 334 DI 10.1016/S0167-6393(01)00032-2 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600010 ER PT J AU Kim, HK Kang, HG AF Kim, HK Kang, HG TI An adaptive short-term postfilter based on pseudo-cepstral representation of line spectral frequencies SO SPEECH COMMUNICATION LA English DT Article DE adaptive short-term postfilter; pseudo-cepstrum; line spectral frequency; speech coding ID SPEECH AB We propose an adaptive short-term postfilter for speech coders by incorporating the properties of the pseudo-cepstrum [IEEE Trans. Speech Audio Process. 8 (2) (2000) 195]. Since the proposed postfilter implicitly has a characteristic of tilt compensation, it does not require an additional tilt compensation filter as conventional techniques. We derive a relationship between the postfilter parameters so that the postfilter has a minimum phase distortion, which helps simplify the tuning procedure for the parameters. We also show that the postfilter can be successfully implemented with a lower order. By applying this postfilter to several international speech coding standards, we reduce the complexity while obtaining a comparable performance to conventional approaches. (C) 2002 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Kim, HK (reprint author), AT&T Labs Res, 180 Pk Ave,Rm D107, Florham Pk, NJ 07932 USA. EM hkkim@research.att.com; goo@research.att.com RI Kang, Hong-Goo/G-8545-2012 CR Campbell J. P. Jr., 1991, Advances in Speech Coding CHEN JH, 1995, IEEE T SPEECH AUDI P, V3, P59 Flannery B. P., 1992, NUMERICAL RECIPES C Gnedenko B., 1999, STAT RELIABILITY ENG HAGEN R, 1997, P IEEE WORKSH SPEECH, P59 Honkanen T., 1997, P INT C AC SPEECH SI, P731 Itakura F., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Jarvinen K., 1997, P INT C AC SPEECH SI, P771 KABAL P, 1991, P IEEE INT S CIRC SY, V1, P312 Kim HK, 1999, IEEE T SPEECH AUDI P, V7, P87 Kim HK, 2000, IEEE T SPEECH AUDI P, V8, P195 Markel JD, 1976, LINEAR PREDICTION SP MORIYA T, 1995, P IEEE WORKSH SPEECH, P57 MUSTAPHA A, 1999, P ICASSP, P197 Recchione M. C., 1999, International Journal of Speech Technology, V2, DOI 10.1007/BF02108646 Salami R, 1998, IEEE T SPEECH AUDI P, V6, P116, DOI 10.1109/89.661471 SUPPLEE LM, 1997, P IEEE INT C AC SPEE, P1591 TASAKI H, 1997, P IEEE WORKSH SPEECH, P57 NR 18 TC 2 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 335 EP 348 DI 10.1016/S0167-6393(01)00033-4 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600011 ER PT J AU Hariharan, R Viikki, O AF Hariharan, R Viikki, O TI An integrated study of speaker normalisation and HMM adaptation for noise robust speaker-independent speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speaker variability; noise robustness; HMM adaptation; vocal tract length normalisation; gender-dependent acoustic modelling ID HIDDEN MARKOV-MODELS AB Inter-speaker variability and sensitivity to background noise are two major problems in modern speech recognition systems. In this paper, we investigate different techniques that have been developed to overcome these issues. These methods include vocal tract length normalisation (VTLN), on-line HMM adaptation and gender-dependent acoustic modelling. Our objective in this paper is to combine these techniques so that the system recognition performance is maximised. Moreover, we propose a vocal tract length normalisation technique, which is more implementation-friendly than the previously published utterance-specific VTLN (u-VTLN). In order to ensure the wide applicability of the methods to be studied, the performance evaluation is done both in connected digit recognition and monophone-based isolated word recognition. The recognition results obtained indicate the importance of the combined use of these techniques. The integrated use of VTLN and on-line adaptation always provided the highest performance in both types of recognition experiments using gender-independent models. As expected, on-line HMM adaptation provided the major performance improvement with respect to a gender- and speaker-independent baseline system. The combination of speaker-specific VTLN (s-VTLN) or gender-dependent acoustic modelling further improved the system accuracy. However, while the joint use of s-VTLN and gender-dependent HMMs improved the recognition rate with original unadapted models, a minor performance degradation was observed when s-VTLN was applied to on-line adapted gender-dependent HMMs. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Nokia Res Ctr, Speech & Audio Syst Lab, Tampere 33721, Finland. RP Hariharan, R (reprint author), Nokia Res Ctr, Speech & Audio Syst Lab, POB 100,Vissiokatu 1, Tampere 33721, Finland. EM ramalingam.hariharan@nokia.com CR Ahadi SM, 1997, COMPUT SPEECH LANG, V11, P187, DOI 10.1006/csla.1997.0031 Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 ANDREOU A, 1994, P CAIP WORKSH FRONT, pR2 EIDE E, 1996, P ICASSP, P346 Fant G., 1973, SPEECH SOUNDS FEATUR Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GONG Y, 1999, P INT C AC SPEECH SI HAKKINEN J, 1999, P IEEE WORKSH ROB ME, P139 HARIHARAN R, 1999, P EUR 99 BUD HUNG, V1, P215 HUNT MJ, 1999, P WORKSH ROB METH SP, P25 Laurila K, 1998, INT CONF ACOUST SPEE, P85, DOI 10.1109/ICASSP.1998.674373 LAURILA K, 1997, P IEEE INT C AC SPEE, P871 Lee L., 1996, P ICASSP LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 PYE D, 1997, P ICASSP, P1047 Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 SURENDRAN AC, 1999, P WORKSH ROB METH SP, P155 VIIKKI O, 1998, P INT C SPOK LANG PR, P1779 Viikki O, 1998, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1998.675369 WOODLAND PC, 1995, P ARPA WORKSH SPOK L, P104 NR 20 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2002 VL 37 IS 3-4 BP 349 EP 361 DI 10.1016/S0167-6393(01)00039-5 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 564KY UT WOS:000176314600012 ER PT J AU Pallett, DS Lamel, L AF Pallett, DS Lamel, L TI Automatic transcription of Broadcast News data SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA. RP Pallett, DS (reprint author), Natl Inst Stand & Technol, Room A216 Technol Bldg, Gaithersburg, MD 20899 USA. EM dpallett@nist.gov; lamel@limsi.fr NR 0 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 1 EP 2 DI 10.1016/S0167-6393(01)00055-3 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700001 ER PT J AU Pallett, DS AF Pallett, DS TI The role of the National Institute of Standards and Technology in DARPA's Broadcast News continuous speech recognition research program SO SPEECH COMMUNICATION LA English DT Article AB This paper reviews the role that the National Institute of Standards and Technology has played in the "Broadcast News" research community. The 1995 "Dry Run" Marketplace set the scene for the introduction of broadcast materials, followed by larger-scale benchmark tests in 1996, 1997, 1998 and (most recently) 1999. This paper discusses the 1998 test results in some detail; these also involve tests based on Broadcast News corpora, including the Spoken Document Retrieval, Information Extraction - Named Entity, and Topic Detection and Tracking tasks. Published by Elsevier Science B.V. C1 Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA. RP Pallett, DS (reprint author), Natl Inst Stand & Technol, Room A216 Technol Bldg, Gaithersburg, MD 20899 USA. EM dpallett@nist.gov CR Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 PALLETT D, 1995, COMMUNICATION 0424 PALLETT DS, 1997, P SPEECH REC WORKSH PALLETT DS, 1998, P BROADC NEWS TRANSC PALLETT DS, 1996, P SPEECH REC WORKSH ROSENFELD R, 1995, COMMUNICATION 0315 SEARS A, 1995, COMMUNICATION 0324 STERN RM, 1996, P SPEECH REC WORKSH NR 8 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 3 EP 14 DI 10.1016/S0167-6393(01)00056-5 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700002 ER PT J AU Graff, D AF Graff, D TI An overview of Broadcast News corpora SO SPEECH COMMUNICATION LA English DT Article AB The LDC began its first Broadcast News (BN) speech collection in the spring of 1996, facing a host of challenges including IPR negotiations with broadcasters, establishment of new transcription conventions and tools, and a compressed schedule for creation and release of speech, transcripts and in-domain language model data, The amount of acoustic training data available for participants in the DARPA Hub4 English benchmark tests doubled from 50 h in 1996 to 100 h in 1997, and doubled again to 200 h in 1998. An additional 40 h has been made available as of the summer of 1999. The 1997 benchmark test also saw the addition of BN speech and transcripts in Spanish and Mandarin Chinese, though in lesser quantity, with 30 h of training data in each language. Supplements to the existing pronunciation lexicons in each language were also produced. More recently, the coordinated research project on topic detection and tracking (TDT) has called for a large collection of BN speech data, totaling about 1100 h in English and 300 h in Mandarin over two phases (TDT2 and TDT3), although the level of detail and quality in the TDT transcriptions is not comparable to that of the Hub4 collections. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Linguist Data Consortium, Philadelphia, PA 19204 USA. RP Graff, D (reprint author), Linguist Data Consortium, Suite 200,3615 Market St, Philadelphia, PA 19204 USA. EM graff@ldc.upenn.edu CR ALLAN J, 1998, P 1998 DARPA BROADC Bird S, 1999, MSCIS9901 U PENNS DE BYRNE B, 1999, LANGUAGE INDEPENDENT CIERI C, 1999, P 1999 DARPA BROADC DODDINGTON G, 1996, 1996 HUB 4 ANNOTATIO FISCUS J, 1998, UNIVERSAL TRANSCRIPT GAROFOLO J, 1997, P 1997 DARPA SPEECH Graff D., 1997, P 1997 DARPA SPEECH PALLETT D, 2000, P 2000 SPEECH TRANSC PALLETT D, 1997, P 1997 DARPA SPEECH Pallett D., 1998, P 1999 DARPA BROADC PALLETT D, 1998, P 1998 DARPA BROADC PALLETT D, 1996, P 1996 DARPA SPEECH Voorhees E. M., 2000, RIAO 2000, P1 NR 14 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 15 EP 26 DI 10.1016/S0167-6393(01)00057-7 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700003 ER PT J AU Robinson, AJ Cook, GD Ellis, DPW Fosler-Lussier, E Renals, SJ Williams, DAG AF Robinson, AJ Cook, GD Ellis, DPW Fosler-Lussier, E Renals, SJ Williams, DAG TI Connectionist speech recognition of Broadcast News SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; neural networks; acoustic features; pronunciation modelling; search techniques; stack decoder AB This paper describes connectionist techniques for recognition of Broadcast News. The fundamental difference between connectionist systems and more conventional mixture-of-Gaussian systems is that connectionist models directly estimate posterior probabilities as opposed to likelihoods. Access to posterior probabilities has enabled us to develop a number of novel approaches to confidence estimation, pronunciation modelling and search. In addition we have investigated a new feature extraction technique based on the modulation-filtered spectrogram (MSG), and methods for combining multiple information sources. We have incorporated all of these techniques into a system for the transcription of Broadcast News, and we present results on the 1998 DARPA Hub-4E Broadcast News evaluation data. (C) 2002 Elsevier Science B.V. All rights reserved. C1 SoftSound Ltd, Autonomy Syst Ltd, Cambridge CB4 0WS, England. Phonet Syst UK Ltd, Cheltenham GL52 8RW, Glos, England. Columbia Univ, Dept Elect Engn, New York, NY 10027 USA. Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Robinson, AJ (reprint author), SoftSound Ltd, Autonomy Syst Ltd, Cambridge Business Pk,Cowley Rd, Cambridge CB4 0WS, England. EM ajr@softsound.com CR BARKER J, 1998, P INT C SPOK LANG PR, P2719 Bernardis G., 1998, P ICSLP, P775 BOURLARD H, 1994, P IEEE INT C AC SPEE, P373 BOURLARD H, 1994, CONTINUOUS SPEECH RE Bourlard H., 1996, AUTOMATIC SPEECH SPE, P259 Clarkson P., 1997, EUR C SPEECH COMM TE, P2707 Cook G, 1998, INT CONF ACOUST SPEE, P917, DOI 10.1109/ICASSP.1998.675415 COOK G, 1997, P DARPA SPEECH REC W, P79 COX S, 1996, P INT C AC SPEECH SI, P511 EIDE E, 1995, P IEEE INT C AC SPEE, P221 Finke M., 1997, EUROSPEECH 97, P2379 FISCUS JG, 1997, IEEE WORKSH AUT SPEE FOSLERLUSSIER E, 1999, P DARPA BROADC NEWS Fosler-Lussier J. E., 1999, THESIS U CALIFORNIA FRITSCH J, 1998, P INT C SPOK LANG PR GOPALAKRISHNAN PS, 1995, INT CONF ACOUST SPEE, P572, DOI 10.1109/ICASSP.1995.479662 Hain T., 1998, P DARPA BROADC NEWS, P133 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HETHERINGTON L, 1995, EUROSPEECH 95, P1645 HOCHBERG MM, 1995, P IEEE INT C AC SPEE, P401 Jelinek F., 1969, IBM Journal of Research and Development, V13 KERSHAW D, 1996, THESIS CAMBRIDGE U E Kingsbury B., 1998, THESIS U CALIFORNIA Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 KOHLRAUSCH A, 1992, AUDITORY PROCESSING, P85 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Lowerre B.T., 1980, TRENDS SPEECH RECOGN LUCASSEN J, 1984, P IEEE INT C AC SPEE MORGAN N, 1992, J PARALLEL DISTR COM, V14, P248, DOI 10.1016/0743-7315(92)90067-W NETI CV, 1997, P INT C AC SPEECH SI, P883 Neto Joao, 1995, EUROSPEECH 95, P2171 Nilsson N.J., 1971, PROBLEM SOLVING METH Odell J.J., 1995, THESIS U CAMBRIDGE PALLETT D, 2002, SPEECH COMMUNICATION Paul D.B., 1992, P IEEE INT C AC SPEE, P25, DOI 10.1109/ICASSP.1992.225981 RENALS S, 1995, P IEEE INT C AC SPEE, P596 Renals S, 1996, IEEE SIGNAL PROC LET, V3, P4, DOI 10.1109/97.475820 Renals S, 1999, IEEE T SPEECH AUDI P, V7, P542, DOI 10.1109/89.784107 RILEY M, 1998, ESCA TUT RES WORKSH, P109 Robinson T., 1996, AUTOMATIC SPEECH SPE, P233 Robinson T, 1998, INT CONF ACOUST SPEE, P829, DOI 10.1109/ICASSP.1998.675393 Robinson T, 1994, IEEE T NEURAL NETWOR, V5, P298 Schwartz R., 1996, AUTOMATIC SPEECH SPE, P429 Siegler M., 1997, P DARPA SPEECH REC W, P97 Wawrzynek J, 1996, COMPUTER, V29, P79, DOI 10.1109/2.485896 Williams G, 1999, COMPUT SPEECH LANG, V13, P395, DOI 10.1006/csla.1999.0129 Williams G., 1998, ESCA WORKSH MOD PRON, P151 Williford RE, 1999, ELEC SOC S, V99, P687 WU SL, 1998, THESIS U CALIFORNIA NR 50 TC 10 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 27 EP 45 DI 10.1016/S0167-6393(01)00058-9 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700004 ER PT J AU Woodland, PC AF Woodland, PC TI The development of the HTK Broadcast News transcription system: An overview SO SPEECH COMMUNICATION LA English DT Article DE large vocabulary speech recognition; Broadcast News transcription ID SPEECH RECOGNITION; ADAPTATION AB This paper describes in detail the development of the HTK Broadcast News (BN) transcription system and presents full evaluation results from the 1996, 1997 and 1998 DARPA BN evaluations. It starts with a description of the underlying HTK large vocabulary recognition system and presents the modifications used in successive generations of the HTK BN system. Initially acoustic models that relied on fairly precise manual audio-type classification were used. To enable the use of automatic segmentation and classification systems, acoustic models were developed that were independent of fine audio classifications, The basic structure of the current HTK BN system includes a high-quality segmentation stage, multiple decoding passes which initially use triphones and trigrams, and then quinphone acoustic models along with word 4-gram and category language models applied in the final pass. This system gave the lowest error rate in the 1997 BN evaluation by a statistically significant margin. Refinements to the system are then described that examine the use of a larger acoustic training set, vocal tract length normalisation, full variance transforms and improved language modelling, Furthermore a version of the system was developed that ran in less than 10 times real time with only a small increase in error rate which has been used for the bulk transcription of broadcast news for information retrieval from audio data. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Woodland, PC (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM pcw@eng.cam.ac.uk CR Anastasakos T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607807 Bimbot F., 1993, P EUROSPEECH, P169 Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 GAUVAIN JL, 1997, P EUR, P907 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Graff D, 2002, SPEECH COMMUN, V37, P15, DOI 10.1016/S0167-6393(01)00057-7 Hain T., 1998, P DARPA BROADC NEWS, P133 HAIN T, 1999, P ICASSP, P57 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Johnson LL, 1999, J SHOULDER ELB SURG, V8, P500, DOI 10.1016/S1058-2746(99)90085-X KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Kneser R., 1993, P EUR C SPEECH COMM, P973 KUBALA F, 1997, P EUR 97 RHOD GREEC, P927 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leggetter C. J., 1995, P EUR 95 MADR, P1155 MARTIN S, 1995, P EUROSPEECH MADR SE, P1253 Niesler TR, 1998, INT CONF ACOUST SPEE, P177, DOI 10.1109/ICASSP.1998.674396 Odell J. J., 1994, P ARPA SPOK LANG TEC, P405, DOI 10.3115/1075812.1075905 Odell J.J., 1995, THESIS CAMBRIDGE U E ODELL JJ, 1999, P 1999 DARPA BROADC, P271 Pallett DS, 2002, SPEECH COMMUN, V37, P3, DOI 10.1016/S0167-6393(01)00056-5 PALLETT DS, 1997, P 1998 DARPA BROADC, P5 PYE D, 1997, P ICASSP, P1047 Siegler M., 1997, P DARPA SPEECH REC W, P97 TUERK A, 2000, P RIAO 2000 CONT BAS, V3, P14 WEGMANN S, 1998, P DARPA BROADC NEWS, P60 Woodland P., 1998, P DARPA BROADC NEWS, P41 Woodland P. C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607806 Woodland P. C., 1996, P ARPA SPEECH REC WO, P99 Woodland P. C., 1999, P DARPA BROADC NEWS, P265 WOODLAND PC, 1995, P ICASSP, V1, P73 WOODLAND PC, 1997, P ICASSP 97 MUN, P719 Woodland P.C., 1996, P DARPA SPEECH REC W, P73 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 WOODLAND PC, 2000, P ISCA ITRW ASR2000, P7 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 NR 38 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 47 EP 67 DI 10.1016/S0167-6393(01)00059-0 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700005 ER PT J AU Chen, SS Eide, E Gales, MJF Gopinath, RA Kanvesky, D Olsen, P AF Chen, SS Eide, E Gales, MJF Gopinath, RA Kanvesky, D Olsen, P TI Automatic transcription of Broadcast News SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; Broadcast News; acoustic modeling; adaptive training; segmentation and clustering; model selection ID HIDDEN MARKOV-MODELS AB This paper describes the IBM approach to Broadcast News (BN) transcription. Typical problems in the BN transcription task are segmentation, clustering, acoustic modeling, language modeling and acoustic model adaptation. This paper presents new algorithms for each of these focus problems. Some key ideas include Bayesian information criterion (BIC) (for segmentation, clustering and acoustic modeling) and speaker/cluster adapted training (SAT/CAT). (C) 2002 Elsevier Science B.V. All rights reserved. C1 IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. RP Gopinath, RA (reprint author), IBM Corp, Thomas J Watson Res Ctr, Yorktown Hts, NY 10598 USA. EM rameshg@us.ibm.com CR ANASTASAKOS T, 1996, P ICSLP 96 BAHL LR, 1994, P ICASSP BAHL LR, 1995, P ICASSP, P41 BAKIS R, 1997, P DARPA SP REC WORKS Bakis R., 1997, P SPEECH REC WORKSH, P67 BASU S, 1999, POWER EXPONENTIAL DE BEIGI H, 1998, P WORLD C AUT BROWN P, 1987, THESIS IBM TJ WATSON Brown P. F., 1992, Computational Linguistics, V18 BURSHTEIN D, 1995, INT CONF ACOUST SPEE, P548, DOI 10.1109/ICASSP.1995.479656 CHEN SS, 1999, P BROADC NEWS TRANSC Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 Gales M. J. F., 1998, P ICSLP, P1783 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1999, IEEE T SPEECH AUDI P, V7, P272, DOI 10.1109/89.759034 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 Gopinath R.A., 1998, P ICASSP Jelinek F., 1997, STAT METHODS SPEECH Jin H., 1997, P DARPA SPEECH REC W, P108 Kubala F, 1997, P SPEECH REC WORKSH, P90 LIPORACE LA, 1982, IEEE T INFORM THEORY, V28, P729, DOI 10.1109/TIT.1982.1056544 MONKOWSKI MD, CONTEXT DEPENDENT PH Pallet D., 1997, P DARPA SPEECH REC W RICHTER AG, 1986, ADV SPEECH PROCESSIN SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 SIEGER MA, 1995, THESIS CARNEGIE MELL Siegler M., 1997, P DARPA SPEECH REC W, P97 STOLCKE A, 1994, TR94003 INT COMP SCI Subbotin M.T., 1923, MAT SBORNIK, V31, P296 TRITSCHLER A, 1999, P EUR 1999 BUD HUNG Woodland P., 1997, P SPEECH REC WORKSH, P73 NR 31 TC 18 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 69 EP 87 DI 10.1016/S0167-6393(01)00060-7 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700006 ER PT J AU Gauvain, JL Lamel, L Adda, G AF Gauvain, JL Lamel, L Adda, G TI The LIMSI Broadcast News transcription system SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; Broadcast News transcription; audio partitioning; acoustic modeling; language modeling; lexical modeling ID SPEECH AB This paper reports on activites at LIMSI over the last few years directed at the transcription of broadcast news data. We describe our development work in moving from laboratory read speech data to real-world or 'found' speech data in preparation for the DARPA evaluations on this task from 1996 to 1999. Two main problems needed to be addressed to deal with the continuous flow of inhomogenous data. These concern the varied acoustic nature of the signal (signal quality, environmental and transmission noise, music) and different linguistic styles (prepared and spontaneous speech on a wide range of topics, spoken by a large variety of speakers). The problem of partitioning the continuous stream of data is addressed using an iterative segmentation and clustering algorithm with Gaussian mixtures. The speech recognizer makes use of continuous density HMMs with Gaussian mixture for acoustic modeling and 4-gram statistics estimated on large text corpora. Word recognition is performed in multiple passes, where current hypotheses are used for cluster-based acoustic model adaptation prior to the next decoding pass. The overall word transcription error of the LIMSI evaluation systems were 27.1% (Nov96, partitioned test data), 18.3% (Nov97, unpartitioned data), 13.6% (Nov98, unpartitioned data) and 17.1% (Fall99, unpartitioned data with computation time under 10x real-time). (C) 2002 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. RP Gauvain, JL (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM gauvain@limsi.fr; lamel@limsi.fr; gadda@limsi.fr CR *ARPA, 1995, P ARPA SPOK LANG SYS AUBERT X, 1999, P EUROSPEECH BUD HUN, V4, P1559 Chen S. S., 1998, P DARPA BROADC NEWS, P127 Clarkson P., 1997, P EUR 97 RHOD GREEC, P2707 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gauvain JL, 1998, P ICSLP 98 SYDN AUST, P1335 GAUVAIN JL, 1997, P ESCA EUR 97 RHOD G, V2, P907 GAUVAIN JL, 1997, P ARPA SPEECH REC WO, P56 GAUVAIN JL, 2000, P INT C SPOK LANG PR, V3, P794 GAUVAIN JL, 1996, P DARPA SPEECH REC W, P105 GAUVAIN JL, 1998, P DARPA BROADC NEWS, P75 GAUVAIN JL, 1994, P ARPA SPOK LANG TEC GAUVAIN JL, 1998, P DARPA BROADC NEWS, P99 GAUVAIN JL, 1995, P IEEE ICASSP 95 DET, P65 Gish H., 1991, P IEEE INT C AC SPEE, P873, DOI 10.1109/ICASSP.1991.150477 Hain T., 1998, P DARPA BROADC NEWS, P133 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JARDINO M, 1996, P INT C AC SPEECH SI, V1, P161 KANNAN A, 1994, IEEE T SPEECH AUDIO, V2 KAUFMANN M, 1997, P DARPA SPEECH REC W KAUFMANN M, 1999, P DARPA BROADC NEWS KAUFMANN M, 1996, P DARPA SPEECH REC W KAUFMANN M, 1998, P DARPA BROADC NEWS KUBALA F, 1996, P DARPA SPEECH REC W, P55 LAMEL LF, 1996, P ICSLP 96 OCT PHIL, V1, P6, DOI 10.1109/ICSLP.1996.606916 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LIU D, 1999, P ESCA EUR 99 BUD HU, V3, P1031 Ney H., 1992, P IEEE INT C AC SPEE, V1, P9 Odell J. J., 1994, P ARPA SPOK LANG TEC, P405, DOI 10.3115/1075812.1075905 PAUL DB, 1992, P ICSLP 92 BANFF, V2, P899 Pitz M., 1999, P DARPA BROADC NEWS, P157 Robinson T., 1995, P INT C AC SPEECH SI, V1, P81 Schwartz R., 1997, P DARPA SPEECH REC W, P115 SEYMORE K, 1997, P DARPA SPEECH REC W, P141 Siegler M., 1997, P DARPA SPEECH REC W, P97 STERN R, 1996, SPECIFICATION ARPA N *STW, 2000, P 2000 SPEECH TRANSC WEGMANN S, 1999, P INT C AC SPEECH SI, P33 WEGMANN S, 1998, P DARPA BROADC NEWS, P60 WOODLAND PC, 1998, 1998 HUB5E WORKSH SE Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 NR 41 TC 126 Z9 128 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 89 EP 108 DI 10.1016/S0167-6393(01)00061-9 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700007 ER PT J AU Beyerlein, P Aubert, X Haeb-Umbach, R Harris, M Klakow, D Wendemuth, A Molau, S Ney, H Pitz, M Sixtus, A AF Beyerlein, P Aubert, X Haeb-Umbach, R Harris, M Klakow, D Wendemuth, A Molau, S Ney, H Pitz, M Sixtus, A TI Large vocabulary continuous speech recognition of Broadcast News - The Philips/RWTH approach SO SPEECH COMMUNICATION LA English DT Article DE Broadcast News; Hub-4; automatic segmentation; speaker clustering; time-synchronous one-pass trigram decoding; language-model look-ahead; discriminative model combination; log-linear interpolation; distance language models; phrases; vocal tract normalization; script verification AB Automatic speech recognition of real-live broadcast news (BN) data (Hub-4) has become a challenging research topic in recent years. This paper summarizes our key efforts to build a large vocabulary continuous speech recognition system for the heterogenous BN task without inducing undesired complexity and computational resources. These key efforts included: automatic segmentation of the audio signal into speech utterances, efficient one-pass trigram decoding using look-ahead techniques, optimal log-linear interpolation of a variety of acoustic and language models using discriminative model combination (DMC); handling short-range and weak longer-range correlations in natural speech and language by the use of phrases and of distance-language models; improving the acoustic modeling by a robust feature extraction, channel normalization, adaptation techniques as well as automatic script selection and verification. The starting point of the system development was the Philips 64k-NAB word-internal triphone trigram system. On the speaker-independent but microphone-dependent NAB-task (transcription of read newspaper texts) we obtained a word error rate of about 10%. Now, at the conclusion of the system development, we have arrived at Philips at an DMC-interpolated phrase-based crossword-pentaphone 4-gram system. This system transcribes BN data with an overall word error rate of about 17%. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Philips Res Labs, D-52066 Aachen, Germany. Rhein Westfal TH Aachen, Lehrstuhl Informat 4, D-52056 Aachen, Germany. RP Beyerlein, P (reprint author), Philips Res Labs, Weisshaustr 2, D-52066 Aachen, Germany. CR ALLEVA F, 1996, P IEEE INT C AC SPEE, P133 AUBERT X, 1995, P IEEE INT C AC SPEE, P49 AUBERT X, 1999, P EUR C SPEECH COMM, P1559 AUBERT X, 1994, P IEEE INT C AC SPEE, V2, P129 BEYERLEIN P, 1998, P DARPA BROADC NEWS BEYERLEIN P, 1997, P EUROSPEECH RHOD GR, P1163 Beyerlein P., 1997, P IEEE AUT SPEECH RE, P238 BEYERLEIN P, 1999, P EUROSPEECH BUD HUN, P647 Chen S., 1998, P DARPA BROADC NEWS DARROCH JN, 1972, ANN MATH STAT, V43, P1470, DOI 10.1214/aoms/1177692379 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Fiscus J. G., 1997, P IEEE WORKSH AUT SP, P347 HAEBUMBACH R, 1999, P EUROSPEECH BUD HUN, P1323 HAIN T, 1998, P DARPA BROADC NEWS Harris M., 1999, P EUROSPEECH BUD HUN, P1027 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JIN H, 1997, P DARPA SPEECH REC W KLAKOW D, 1997, P IEEE INT C AC SPEE, P701 Klakow Dietrich, 1998, P ICSLP, P1695 Kneser R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607162 Kubala F., 1998, P DARPA BROADC NEWS Lee L., 1996, P ICASSP, V1, P353 NEY H, 1995, IEEE T PATTERN ANAL, V17, P107, DOI 10.1109/34.368176 NEY H, 1992, P IEEE INT C AC SPEE, P13 Odell J. J., 1994, P ARPA SPOK LANG TEC, P405, DOI 10.3115/1075812.1075905 ODELL JJ, 1995, THESIS U CAMBRIDGE E OPENSHAW JP, 1994, P IEEE INT C AC SPEE, P49 Ortmanns S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607215 Ortmanns S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607214 Ortmanns S, 1997, COMPUT SPEECH LANG, V11, P43, DOI 10.1006/csla.1996.0022 PETERS J, 1999, VERBMOBIL FDN SPEECH, P79 PITZ M, 1999, P EUROSPEECH BUD HUN, P675 ROSENFELD R, 1994, THESIS CMU SCHWARTZ R, 1997, P DARPA SPEECH REC W Siegler M. A., 1997, P DARPA SPEECH REC W Steinbiss V., 1994, P INT C SPOK LANG PR, P2143 THELEN E, 1997, P IEEE INT C AC SPEE, P1035 WELLING L, 1998, P ICASSP SEATTL WA M, V2, P797, DOI 10.1109/ICASSP.1998.675385 WOODLAND PC, 1997, P ICASSP 97 MUN, P719 NR 39 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 109 EP 131 DI 10.1016/S0167-6393(01)00062-0 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700008 ER PT J AU Sankar, A Gadde, VRR Stolcke, A Weng, FL AF Sankar, A Gadde, VRR Stolcke, A Weng, FL TI Improved modeling and efficiency for automatic transcription of Broadcast News SO SPEECH COMMUNICATION LA English DT Article DE acoustic modeling; acoustic adaptation; language modeling; lattice-based decoding ID ROBUST SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD; MARKOV-CHAINS; ALGORITHM AB Over the last few years, the DARPA-sponsored Hub-4 continuous speech recognition evaluations have advanced speech recognition technology for automatic transcription of broadcast news. In this paper, we report on our research and progress in this domain, with an emphasis on efficient modeling with significantly fewer parameters for faster and more accurate recognition. In the acoustic modeling area, this was achieved through new parameter tying, Gaussian clustering, and mixture weight thresholding schemes. The effectiveness of acoustic adaptation is greatly increased through unsupervised clustering of test data. In language modeling, we explored the use of non-broadcast-news training data as well as the adaptation to topic and speaking styles. We developed an effective and efficient parameter pruning technique for backoff language models that allowed us to cope with ever increasing amounts of training data and expanded N-gram scopes. Finally, we improved our progressive search architecture with more efficient algorithms for lattice generation, compaction, and incorporation of higher-order language models. (C) 2002 Elsevier Science B.V. All rights reserved. C1 SRI Int, Menlo Pk, CA 94025 USA. RP Sankar, A (reprint author), Nuance Commun, 1380 Willow Rd, Menlo Pk, CA 94025 USA. CR Austin S., 1991, P IEEE INT C AC SPEE, P697, DOI 10.1109/ICASSP.1991.150435 BELLEGARDA JR, 1997, P 5 EUR C SPEECH COM, V3, P1451 CLARKSON P, 1997, P IEEE INT C AC SPEE, V2, P799 *DARPA, 1998, P DARPA BROADC NEWS DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS V, 1994, P DARPA HUM LANG TEC, P292 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 DODDINGTON G, 1992, P DARPA SPEECH NAT L, P363, DOI 10.3115/1075527.1075615 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gersho A., 1991, VECTOR QUANTIZATION Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 GOOD IJ, 1953, BIOMETRIKA, V40, P237, DOI 10.2307/2333344 GUPTA SK, 1996, P INT C SPOK LANG PR, V3, P1828, DOI 10.1109/ICSLP.1996.607986 HECK L, 1997, P 5 EUR C SPEECH COM, V4, P1867 HOPCROFT J, 1979, INTRO AUT THEOR LANG Hwang M. Y., 1993, P ICASSP, P311 Iyer R., 1996, P INT C SPOK LANG PR, V1, P236, DOI 10.1109/ICSLP.1996.607085 JUANG BH, 1985, AT&T TECH J, V64, P1235 KNESER R, 1996, P INT C SPOK LANG PR, V1, P494, DOI 10.1109/ICSLP.1996.607162 Kubala F, 1997, P SPEECH REC WORKSH, P90 Legetter C.J., 1995, P ARPA WORKSH SPOK L, P110 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 MOHRI M, 1997, FSM LIB GEN PURPOSE MOHRI M, 1997, P EUROSPEECH RHOD GR, V1, P131 Murveit H., 1993, P IEEE INT C AC SPEE, V2, P319 NEUMEYER LR, 1995, P EUROSPEECH, V2, P1127 NEY H, 1994, P INT C SPOK LANG PR, V3, P1355 Odell J.J., 1995, THESIS CAMBRIDGE U E SANKAR A, 1999, P 6 EUR C SPEECH COM, V4, P1711 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Sankar A, 1994, IEEE SIGNAL PROC LET, V1, P124, DOI 10.1109/97.311815 Sankar A., 1998, P 1997 DARPA BROADC, P99 SANKAR A, 1998, P DARPA BROADC NEWS, P91 SANKAR A, 1998, P INT C SPOK LANG PR, V6, P2499 SANKAR A, 1999, P DARPA BROADC NEWS, P281 SANKAR A, 1998, P INT C SPOK LANG PR, V5, P2219 SANKAR A, 1997, P DARPA SPEECH REC W, P127 Schwartz R., 1991, P IEEE INT C AC SPEE, V1, P701 SEYMORE K, 1997, P EUROSPEECH, V4, P1987 SEYMORE K, 1996, P ICSLP, V1, P232, DOI 10.1109/ICSLP.1996.607084 Stern R., 1997, P DARPA SPEECH REC W, P7 STROM N, 1997, THESIS KTH STOCKHOLM WENG F, 1998, P INT C SPOK LANG PR, V6, P2531 WENG F, 1998, P DARPA BROADC NEWS, P138 WENG F, 1997, P DARPA SPEECH REC W, P147 Woodland P., 1998, P DARPA BROADC NEWS, P41 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 Young S. J., 1993, P EUR C SPEECH COMM, V3, P2203 NR 50 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2002 VL 37 IS 1-2 BP 133 EP 158 DI 10.1016/S0167-6393(01)00063-2 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 542EC UT WOS:000175029700009 ER PT J AU Soquet, A Lecuit, V Metens, T Demolin, D AF Soquet, A Lecuit, V Metens, T Demolin, D TI Mid-sagittal cut to area function transformations: Direct measurements of mid-sagittal distance and area with MRI SO SPEECH COMMUNICATION LA English DT Article DE mid-sagittal profile; area function; articulatory data ID VOCAL-TRACT; TONGUE MOVEMENT; MODEL; DIMENSIONS; VOWELS AB This paper presents a comparative study of transformations used to compute the area of cross-sections of the vocal tract from the mid-sagittal measurements of the vocal tract. MRI techniques have been used to obtain both mid-sagittal distances and cross-sections of the vocal tract for French oral vowels uttered by two subjects. The measured cross-sectional areas can thus be compared to the cross-sectional areas computed by the different transformations. The evaluation is performed with a jackknife method where the parameters of the transformation are estimated from all but one measurement of a speaker's vocal tract region and evaluated on the remaining measurement. This procedure allows the study of both the performance of the different forms of transformation as a function of the vocal tract region and the stability of the transformation parameters for a given vocal tract region, Three different forms of transformation are compared: linear, polynomial and power function. The estimation performances are also compared with four existing transformations. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Free Univ Brussels, Lab Phonol, B-1050 Brussels, Belgium. Free Univ Brussels, Inst Langues Vivantes & Phonet, Lab Phonet Expt, Brussels, Belgium. Free Univ Brussels, Hop Erasme, Unite Resonance Magnet, Brussels, Belgium. RP Soquet, A (reprint author), Free Univ Brussels, Lab Phonol, 50 Av F D Roosevelt,CP 175, B-1050 Brussels, Belgium. EM asoquet@ulb.ac.be CR BAER T, 1991, J ACOUST SOC AM, V90, P799, DOI 10.1121/1.401949 BEAUTEMPS D, 1995, SPEECH COMMUN, V16, P27, DOI 10.1016/0167-6393(94)00045-C Bothorel A., 1986, CINERADIOGRAPHIE VOY Chiba T., 1941, VOWEL ITS NATURE STR Demolin D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607098 FANT G, 1992, P INT C SPOK LANG BA, P807 FUJIMURA O, 1973, Computers in Biology and Medicine, V3, P371, DOI 10.1016/0010-4825(73)90003-6 GREENWOOD AR, 1992, IEE PROC-I, V139, P553 HEINZ JM, 1965, P 5 INT C AC LIEG PA JOHANSSON C, 1983, ROYAL I TECHNOLOGY S, V4, P39 LADEFOGED P, 1971, DIRECT MEASUREMENT V, P4 LAKSHMINARAYANAN AV, 1991, JMRI-J MAGN RESON IM, V1, P71, DOI 10.1002/jmri.1880010109 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 MAEDA S, 1978, ACT 9 JOURN ET PAR, P191 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 NARAYANAN SS, 1995, J ACOUST SOC AM, V98, P1325, DOI 10.1121/1.413469 PERRIER P, 1992, J SPEECH HEAR RES, V35, P53 SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 Soquet A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607824 STONE M, 1991, J PHONETICS, V19, P309 STONE M, 1990, J ACOUST SOC AM, V87, P2207, DOI 10.1121/1.399188 SUNDBERG J, 1969, PROBLEM OBTAINING AR, P43 SUNDBERG J, 1987, PHONETICA, V44, P76 NR 23 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 169 EP 180 DI 10.1016/S0167-6393(00)00084-4 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700001 ER PT J AU Irino, T Patterson, RD AF Irino, T Patterson, RD TI Segregating information about the size and shape of the vocal tract using a time-domain auditory model: The stabilised wavelet-Mellin transform SO SPEECH COMMUNICATION LA English DT Article DE auditory pathway; Mellin transform; wavelet transform; stabilised auditory image; size-shape image; gammachirp auditory filter ID TEMPORAL ASYMMETRY; SCALE; REPRESENTATIONS; NORMALIZATION; SPEECH; FILTER; SYSTEM; SOUND AB We hear vowels pronounced by men and women as approximately the same although the length of the vocal tract varies considerably from group to group. At the same time, we can identify the speaker group. This suggests that the auditory system can extract and separate information about the size of the vocal-tract from information about its shape. The duration of the impulse response of the vocal tract expands or contracts as the length of the vocal tract increases or decreases. There is a transform, the Mellin transform, that is immune to the effects of time dilation; it maps impulse responses that differ in temporal scale onto a single distribution and encodes the size information separately as a scalar constant. In this paper we investigate the use of the Mellin transform for vowel normalisation. In the auditory system, sounds are initially subjected to a form of wavelet analysis in the cochlea and then, in each frequency channel, the repeating patterns produced by periodic sounds appear to be stabilised by a form of time-interval calculation. The result is like a two-dimensional array of interval histograms and it is referred to as an auditory image. In this paper, we show that there is a two-dimensional form of the Mellin transform that can convert the auditory images of vowel sounds from vocal tracts with different sizes into an invariant Mellin image (MI) and, thereby, facilitate the extraction and separation of the size and shape information associated with a given vowel type. In signal processing terms. the MI of a sound is the Mellin transform of a stabilised wavelet transform of the sound. We suggest that the MI provides a good model of auditory vowel normalisation, and that this provides a good framework for auditory processing from cochlea to cortex. (C) 2002 Elsevier Science B.V. All rights reserved. C1 ATR, Human Informat Proc Res Labs, Kyoto 6190288, Japan. Univ Cambridge, Dept Physiol, Ctr Neural Basis Hearing, Cambridge CB2 3EG, England. RP Irino, T (reprint author), Nippon Telegraph & Tel Corp, Commun Sci Labs, 2-4 Hikaridai,Seika Cho, Kyoto 6190237, Japan. EM irino@hip.atr.co.jp; roy.patterson@mrc-cbu.cam.ac.uk CR ALTES RA, 1978, J ACOUST SOC AM, V63, P174, DOI 10.1121/1.381708 Bachorowski JA, 1999, J ACOUST SOC AM, V106, P1054, DOI 10.1121/1.427115 BERTRAND J, 1996, TRANSFORMS APPL HDB Carney LH, 1999, J ACOUST SOC AM, V105, P2384, DOI 10.1121/1.426843 COHEN L, 1993, IEEE T SIGNAL PROCES, V41, P3275, DOI 10.1109/78.258073 COHEN L, 1991, P SOC PHOTO-OPT INS, V1566, P109, DOI 10.1117/12.49816 Combes J. M., 1989, WAVELETS DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DEBOER E, 1978, J ACOUST SOC AM, V63, P115, DOI 10.1121/1.381704 Fant G., 1970, ACOUSTIC THEORY SPEE Fitch WT, 1999, J ACOUST SOC AM, V106, P1511, DOI 10.1121/1.427148 Flanagan J., 1972, SPEECH ANAL SYNTHESI Gabor D., 1946, Journal of the Institution of Electrical Engineers. III. Radio and Communication Engineering, V93 GAMBARDELLA G, 1979, J ACOUST SOC AM, V66, P913, DOI 10.1121/1.383203 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T Hashi M, 1998, J ACOUST SOC AM, V104, P2426, DOI 10.1121/1.423750 Huber JE, 1999, J ACOUST SOC AM, V106, P1532, DOI 10.1121/1.427150 Imai S, 1983, ICASSP 83, P93 Irino T, 1997, J ACOUST SOC AM, V101, P412, DOI 10.1121/1.417975 IRINO T, 1999, S REC DEV AUD MECH S IRINO T, 1999, TRH264 ATR IRINO T, 1999, P EUR 99 BUD HUNG Irino T, 1996, J ACOUST SOC AM, V99, P2316, DOI 10.1121/1.415419 KLAUDER J, 1980, FUNCTIONAL INTEGRATI Patterson RD, 1998, J ACOUST SOC AM, V104, P2967, DOI 10.1121/1.423879 PATTERSON RD, 1994, J ACOUST SOC AM, V96, P1409, DOI 10.1121/1.410285 PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 PATTERSON RD, 1994, J ACOUST SOC AM, V96, P1419, DOI 10.1121/1.410286 Pullum G. K., 1986, PHONETIC SYMBOL GUID Rabiner L.R., 1978, DIGITAL PROCESSING S Titchmarsh E C, 1948, INTRO THEORY FOURIER Umesh S, 1999, IEEE T SPEECH AUDI P, V7, P40, DOI 10.1109/89.736329 WAKITA H, 1977, IEEE T ACOUST SPEECH, V25, P183, DOI 10.1109/TASSP.1977.1162929 YANG CS, 1995, J ACOUST SOC JPN E, V16, P41 NR 35 TC 53 Z9 54 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 181 EP 203 DI 10.1016/S0167-6393(00)00085-6 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700002 ER PT J AU Hu, HT Kuo, FJ Wang, HJ AF Hu, HT Kuo, FJ Wang, HJ TI Supplementary schemes to spectral subtraction for speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE spectral subtraction; cepstral processing; comb filtering ID NOISE; RECOGNITION AB Three supplementary schemes, namely spectral smoothing, formant intensification and comb filtering, are proposed to enhance noisy speech that was dealt with a classic spectral subtraction method. Performance improvements are mainly achieved by exploring cepstral properties and comb filtering, and they are evaluated using both objective and subjective measures based on 12 noise-corrupted speech files. Noise types considered in this study consist of white Gaussian and car noise with signal-to-noise ratio (SNR) set to 5 and 10 dB, respectively. Although experimental results indicate certain inconsistency between objective and subjective measures, it can still be concluded that spectral smoothing and comb filtering contribute to the melioration of speech quality for speech corrupted by additive white Gaussian noise. On the other hand, advantages due to formant intensification can only be recognized whenever music noise is suppressed and the original speech formant characteristics are preserved. The employment of all proposed schemes together generally leads to a notable improvement in perceived quality. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl I Lan Inst Technol, Dept Elect Engn, Chiba 260, Japan. RP Hu, HT (reprint author), Natl I Lan Inst Technol, Dept Elect Engn, 1 Shern Nong Rd, Chiba 260, Japan. EM hthu@mail.ilantech.edu.tw CR ARSLAN L, 1995, P IEEE INT C AC SPEE, P812 ATAL BS, 1979, IEEE T ACOUST SPEECH, V27, P247, DOI 10.1109/TASSP.1979.1163237 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CHEN JH, 1995, IEEE T SPEECH AUDI P, V3, P59 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 DEMOOR B, 1993, IEEE T SIGNAL PROCES, V41, P2826, DOI 10.1109/78.236505 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hu HT, 1998, ELECTRON LETT, V34, P16, DOI 10.1049/el:19980107 JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211 JOHNSTON JD, 1988, IEEE J SEL AREA COMM, V6, P323 KNAGENHJELM HP, 1995, P IEEE INT C AC SPEE, P732 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P354 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 TSOUKALAS D, 1993, P IEEE INT C AC SPEE, V2, P359 Vaseghi S. V., 1996, ADV SIGNAL PROCESSIN Virag N, 1999, IEEE T SPEECH AUDI P, V7, P126, DOI 10.1109/89.748118 Xydeas CS, 1999, IEEE T SPEECH AUDI P, V7, P113, DOI 10.1109/89.748117 Yegnanarayana B, 1999, SPEECH COMMUN, V28, P25, DOI 10.1016/S0167-6393(98)00070-3 NR 21 TC 8 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 205 EP 218 DI 10.1016/S0167-6393(00)00086-8 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700003 ER PT J AU Nemer, E Goubran, R Mahmoud, S AF Nemer, E Goubran, R Mahmoud, S TI Speech enhancement using fourth-order cumulants and optimum filters in the subband domain SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; higher order statistics; noise reduction ID NOISE AB A new method for speech enhancement using time-domain optimum filters and fourth-order cumulants (FOC) is proposed based on newly established properties of the FOC of speech signals. In the exploratory part of the paper, the analytical expression of the FOC of subbanded speech is derived assuming a sinusoidal model and up to two harmonies per band. Important properties about this cumulant are revealed and actual speech data is used to verify the derivations and the underlying model. In the application part of the work, speech enhancement is formulated as an estimation problem and the expression for the time-domain causal optimum filters is derived for a pth order system. The key idea is to use the FOC of the noisy speech to estimate the parameters required for the enhancement filters, namely the second-order statistics of the speech and noise. It is shown that the kurtosis and the diagonal slice of the FOC may be used to estimate such parameters as the SNR, the speech autocorrelation and the probability of speech presence in a given band. Subjective listening and examination of the spectrograms show that the resulting algorithm is effective on typical noises encountered in mobile telephony. Compared to the TIA-IS127 standard for noise reduction, it results in overall more noise reduction and better speech preservation in Gaussian, street and fan noise. Its effectiveness diminishes however in harmonic and impulsive types such as office and car engine, where discrimination between speech and noise based on FOC becomes more difficult. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Intel Corp, San Jose, CA 95134 USA. Carleton Univ, Ottawa, ON K1S 5B6, Canada. RP Nemer, E (reprint author), Nortel Networks, St Laurent, PQ, Canada. EM enemer@ieee.org; goubran@sce.carleton.ca; mahmoud@sce.carleton.ca CR Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Deller J. R., 1993, DISCRETE TIME PROCES EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FULCHIERO R, 1993, P IN C AC SPEECH SIG, V4, P488 HOELDRICH R, 1997, P ICSPAT, P265 KOILPILLAI RD, 1992, IEEE T SIGNAL PROCES, V40, P770, DOI 10.1109/78.127951 LEONGARCIA A, 1989, PROBABILITY RANDOM 4, P405 MASGRAU E, 1992, SIGNAL PROCESS, V6, P307 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MENDEL JM, 1991, P IEEE, V79, P278, DOI 10.1109/5.75086 MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 MOORE BCJ, 1981, J ACOUST SOC AM, V70, P1003, DOI 10.1121/1.386950 Nemer E., 1999, THESIS CARLETON U OT NIKIAS CL, 1993, IEEE SIGNAL PROC JUL, P10 OSHAUGHNESSY D, 1989, IEEE COMMUN MAG, V27, P46, DOI 10.1109/35.17653 PALIWAL KK, 1991, P INT C AC SPEECH SI, P429, DOI 10.1109/ICASSP.1991.150368 RUIZ DP, 1995, IEEE T SIGNAL PROCES, V43, P2665, DOI 10.1109/78.482116 SCALART P, 1996, P IEEE INT C AC SPEE, P629 SWAMI A, 1991, IEEE T SIGNAL PROCES, V39, P1099, DOI 10.1109/78.80965 *TIA EIA, 1997, IS127 TIAEIA NR 21 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 219 EP 246 DI 10.1016/S0167-6393(00)00081-9 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700004 ER PT J AU Wang, WJ Liao, YF Chen, SH AF Wang, WJ Liao, YF Chen, SH TI RNN-based prosodic modeling for mandarin speech and its application to speech-to-text conversion SO SPEECH COMMUNICATION LA English DT Article DE recurrent neural network; prosodic modeling; speech-to-text conversion; acoustic decoding; linguistic decoding ID RECOGNITION; INFORMATION AB In this paper, a recurrent neural network (RNN) based prosodic modeling method for Mandarin speech-to-text conversion is proposed. The prosodic modeling is performed in the post-processing stage of acoustic decoding and aims at detecting word-boundary cues to assist in linguistic decoding. It employs a simple three-layer RNN to learn the relationship between input prosodic features, extracted from the input utterance with syllable boundaries pre-determined by the preceding acoustic decoder, and output word-boundary information of the associated text. After the RNN prosodic model is properly trained, it can be used to generate word-boundary cues to help the linguistic decoder solving the problem of word-boundary ambiguity. Two schemes of using these word-boundary cues are proposed. Scheme I modifies the baseline scheme of the conventional linguistic decoding search by directly taking the RNN outputs as additional scores and adding them to all word-sequence hypotheses to assist in selecting the best recognized word sequence. Scheme 2 is an extended version of Scheme I by further using the RNN outputs to drive a finite state machine (FSM) for setting path constraints to restrict the linguistic decoding search. Character accuracy rates of 73.6%, 74.6% and 74.7% were obtained for the systems using the baseline scheme, Schemes I and 2, respectively. Besides, a gain of 17% reduction in the computational complexity of the linguistic decoding search was also obtained for Scheme 2. So the proposed prosodic modeling method is promising for Mandarin speech recognition. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu, Taiwan. RP Wang, WJ (reprint author), 6F,82 Tz Chung St, Taoyuan 320, Taiwan. EM wernjun@ms.chttl.com.tw; schen@cc.nctu.edu.tw CR BAI BR, 1997, P ICASSP, P903 BATLINER A, 1996, P INT C SPOK LANG PR, V3, P1720, DOI 10.1109/ICSLP.1996.607959 Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 CAMPBELL N, 1993, SPEECH COMMUN, V13, P343, DOI 10.1016/0167-6393(93)90033-H Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Cheng C.-C., 1973, SYNCHRONIC PHONOLOGY Chiang TH, 1996, IEEE T SPEECH AUDI P, V4, P167 *CHIN KNOWL INF PR, 1995, 9502 CHIN KNOWL INF ELMAN JL, 1990, COGNITIVE SCI, V14, P179, DOI 10.1207/s15516709cog1402_1 Grice M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607958 Haykin S, 1994, NEURAL NETWORKS COMP HIROSE K, 1998, P IEEE ICASSP 98 SEA, V1, P25, DOI 10.1109/ICASSP.1998.674358 HIROSE K, 1997, P EUR C SPEECH COMM, V1, P311 HSIEH HY, 1996, P INT C SPOK LANG PR, V1, P809 HUANG TL, 1994, IEEE PARALL DISTRIB, V2, P3 HUNT A, 1994, P INT C AC SPEECH SI, V2, P169 IWANO K, 1998, P INT C SPOK LANG PR, V3, P599 IWANO K, 1999, P INT C AC SPEECH SI, V1, P133 Iwano K., 1999, P EUR C SPEECH COMM, V1, P231 KOMPE R, 1995, P EUR C SPEECH COMM, V2, P1333 KOMPE R, 1997, P IEEE INT C AC SPEE, P811 Lyu R.-Y., 1995, P IEEE ICASSP, P57 Markel JD, 1976, LINEAR PREDICTION SP Morgan D.P., 1991, NEURAL NETWORKS SPEE NIEMANN H, 1997, P IEEE INT C AC SPEE, P75 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 SU YS, 1994, THESIS NATL CHIAO TU WANG YR, 1994, J ACOUST SOC AM, V96, P2637, DOI 10.1121/1.411274 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 WU Z, 1998, P C PHON LANG CHIN, P125 YANG YJ, 1994, P INT C SPOK LANG PR, V3, P1371 YIN YM, 1989, PHONOLOGICAL ASPECTS NR 34 TC 20 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 247 EP 265 DI 10.1016/S0167-6393(01)00006-1 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700005 ER PT J AU Zhang, YX Medievski, A Lawrence, J Song, JM AF Zhang, YX Medievski, A Lawrence, J Song, JM TI A study on tone statistics in Chinese names SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; tonal language ID RECOGNITION AB This paper describes a study on tone statistics of peoples' names in Mandarin Chinese. The problem was brought out when we tried to apply an English version of a speech recognizer to a Chinese voice tag dialing task. The questions were: (1) How serious is the tone-confusable problem in a Chinese voice tag dialing system? (2) Do we need to enhance the recognizer to properly recognize tonal languages? To gain more information, we studied a Chinese name database consisting of 1.6 million names. The statistical analysis shows the potential for a problem with tone-confusable names in a Mandarin voice tag dialing system. We developed a tone enhancement approach to turn an English speech recognizer to a tonal language speech recognizer. We performed benchmark testing to compare recognition performance of an English version speech recognizer and a tone-enhanced version, both working on a small database of Chinese names. Then we did a pseudo-recognition analysis for the large Chinese voice tag database. From this analysis we conclude that enhancing an English version speech recognizer to add tonal recognition capabilities would improve recognition of Chinese names, reducing the recognition error rate by 10-20%. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Motorola Inc, China Res Ctr, Shanghai 200002, Peoples R China. RP Zhang, YX (reprint author), Motorola Inc, China Res Ctr, 3-F Cen Pl,16 Henan Rd S, Shanghai 200002, Peoples R China. EM A12586@email.mot.com CR CHEN SH, 1995, IEEE T SPEECH AUDI P, V3, P146 Hon H.-W., 1994, P IEEE INT C AC SPEE, V1, P545 Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 LEE T, 1995, IEEE T SAP, V3, P194 LIN CH, 1993, P ICASSP 1993, V2, P227 YANG WJ, 1988, IEEE T ACOUST SPEECH, V36, P988, DOI 10.1109/29.1620 NR 6 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 267 EP 275 DI 10.1016/S0167-6393(01)00007-3 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700006 ER PT J AU Nakamura, A AF Nakamura, A TI Restructuring Gaussian mixture density functions in speaker-independent acoustic models SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; acoustic model; Gaussian mixture density function; modeling mismatch; frame-level error AB In continuous speech recognition featuring hidden Markov model (HMM), word N-gram, and time-synchronous beam search, a local modeling mismatch in the HMM will often cause the recognition performance to degrade. To cope with this problem, this paper proposes a method of restructuring Gaussian mixture output probability density functions (pdfs) in a pre-trained speaker-independent HMM set based on speech data. In this method, Gaussians are copied from other mixture pdfs, taking the distribution of local errors into account. This method leads to a restructuring of the mixture pdfs, where some Gaussians are shared by several states and the total number of Gaussians is not modified. Furthermore. the distribution of local errors is extracted by comparing the pre-trained HMM set and the speech data used in the pre-training, and thus new training data are not needed for this restructuring method. Experimental results prove that the proposed restructuring method can effectively restore local modeling mismatches and improve recognition performance. (C) 2002 Elsevier Science B.V. All rights reserved. C1 ATR, Interpreting Telephony Res Labs, Kyoto 6190288, Japan. RP Nakamura, A (reprint author), Nippon Telegraph & Tel Corp, Commun Sci Labs, D-202,2-4 Hikaraidai Seika Cho, Kyoto 6190237, Japan. EM ats@cslab.kecl.ntt.co.jp CR Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 CHOU W, 1992, P IEEE INT C AC SPEE, P473, DOI 10.1109/ICASSP.1992.225869 Huang X., 1989, P IEEE INT C AC SPEE, P639 Huang X.D., 1990, HIDDEN MARKOV MODELS Huang X.D., 1990, P ICASSP, P689 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 MASATAKI H, 1996, P ICASSP, P188 Morimoto T., 1994, P ICSLP, P1791 NAKAMURA A, 1997, P EUR C SPEECH COMM, P1567 Nakamura A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607241 Ostendorf M, 1997, COMPUT SPEECH LANG, V11, P17, DOI 10.1006/csla.1996.0021 PEINADO AM, 1994, P IEEE INT C AC SPEE, P61 SEGRA JC, 1994, SPEECH COMMUN, V14, P163 SHIMIZU T, 1996, P ICASSP 96, P145 SHIMIZU T, 1996, P 1996 FALL M AC SOC, P97 ZELJKOVIC I, 1996, P IEEE INT C AC SPEE, P129 NR 17 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 277 EP 289 DI 10.1016/S0167-6393(00)00087-X PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700007 ER PT J AU Chien, JT AF Chien, JT TI Adaptive hierarchy of hidden Markov models for transformation-based adaptation SO SPEECH COMMUNICATION LA English DT Article DE transformation-based adaptation; tree structure; speaker adaptation; speech recognition; hidden Markov model ID A-POSTERIORI ESTIMATION; SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD; SPEAKER ADAPTATION; ALGORITHM AB Transformation-based adaptation, which transforms clusters of speaker-independent (SI) hidden Markov model (HMM) parameters to an enrolled speaker by using cluster-dependent transformation functions, is an effective algorithm for robust speech recognition. To obtain desirable performance for any amount of adaptation data. it is beneficial to establish a tree structure of HMM parameters and apply it to dynamically control the sharing of transformation parameters. Traditionally, the transformation sharing is determined by phonetic rules or by clustering the acoustic space of training data. The tree structure is then kept unchanged for speaker adaptation (SA). In this paper, we adapt the tree structure to new environment such that the transformation parameters can be extracted adaptively by referring to the newest hierarchy of HMM parameters. The adaptation of hierarchical tree is herein combined into the maximum likelihood (ML) estimation of transformation parameters. From a series of speaker adaptation experiments, we find that the transformation-based adaptation with adaptive hierarchy of HMM parameters outperforms that with the static hierarchy for different cases of tree depths and adaptation data lengths. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM jtchien@mail.ncku.edu.tw CR Chien JT, 1997, SPEECH COMMUN, V22, P369, DOI 10.1016/S0167-6393(97)00033-2 Chien JT, 1999, IEEE T SPEECH AUDI P, V7, P656 Chien JT, 2000, SPEECH COMMUN, V30, P235, DOI 10.1016/S0167-6393(99)00052-7 Chien JT, 1997, IEEE SIGNAL PROC LET, V4, P167 Chou W., 1999, P EUR C SPEECH COMM, V1, P1 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 GALES MJF, 1997, P 5 EUR C SPEECH COM, V4, P2067 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HUO Q, 1999, INT C AC SPEECH SIGN, V2, P577 Jacobs R. A., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.1.79 Johnson S., 1998, P 5 INT C SPOK LANG, P1775 LASRY MJ, 1984, IEEE T PATTERN ANAL, V6, P530 LEE CH, 1999, P WORKSH ROB METH SP, P45 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Sankar A, 1996, INT CONF ACOUST SPEE, P713, DOI 10.1109/ICASSP.1996.543220 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Shinoda K, 1996, INT CONF ACOUST SPEE, P717, DOI 10.1109/ICASSP.1996.543221 SHINODA K, 1995, P EUR 95, P1143 SHINODA K, 1998, IEEE P INT C AC SPEE, V2, P793 Siohan O., 1999, P WORKSH ROB METH SP, P147 TAKAHASHI J, 1994, P INT C SPOK LANG PR, P991 Tou J. T., 1974, PATTERN RECOGNITION VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 ZAVALIAGKOS G, 1995, IEEE P INT C AC SPEE, V1, P676 NR 24 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 291 EP 304 DI 10.1016/S0167-6393(00)00088-1 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700008 ER PT J AU Naito, M Deng, L Sagisaka, Y AF Naito, M Deng, L Sagisaka, Y TI Speaker clustering for speech recognition using vocal tract parameters SO SPEECH COMMUNICATION LA English DT Article DE vocal tract parameters; speaker-clustering; speech recognition AB We propose speaker clustering methods for speech recogition based on vocal tract (VT) size related articulatory parameters associated with individual speakers. Two parameters characterizing gross VT dimensions are first derived from the formant frequencies of two vowels and are then used to cluster speakers. The resulting speaker clusters are significantly different from speaker clusters obtained by conventional acoustic criteria. Then phoneme recognition experiments are carried out by using speaker-clustered HMMs (SC-HMMs) trained for each cluster. The proposed method requires a small amount of speech data for speaker clustering and for selecting the most suitable SC-HMM for a target speaker, but gives higher recognition rates than conventional speaker clustering methods based on acoustic criteria. (C) 2002 Elsevier Science B.V. All rights reserved. C1 ATR, Interpreting Telephony Res Labs, Kyoto 6190288, Japan. Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. RP Naito, M (reprint author), ATR, Interpreting Telephony Res Labs, 2-2 Hikaridai,Seika Cho, Kyoto 6190288, Japan. CR FLANAGAN JL, 1983, SPEECH ANAL SYNTHESI GALVAN A, 1997, THESIS I NATL POLYTE GALVAN A, 1998, 9811 UWECE KOSAKA T, 1994, P ICASSP 94, P245 Ostendorf M, 1997, COMPUT SPEECH LANG, V11, P17, DOI 10.1006/csla.1996.0021 SUGAMURA N, 1983, P ICASSP 83, P243 TAKEZAWA T, 1998, P 1 INT WORKSH E AS, P148 TONOMURA M, 1995, P ICASSP 95, P688 NR 8 TC 9 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 305 EP 315 DI 10.1016/S0167-6393(00)00089-3 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700009 ER PT J AU Tsai, WH Chang, WW AF Tsai, WH Chang, WW TI Discriminative training of Gaussian mixture bigram models with application to Chinese dialect identification SO SPEECH COMMUNICATION LA English DT Article DE Gaussian mixture bigram model; minimum classification error algorithm; Chinese dialect identification ID AUTOMATIC LANGUAGE IDENTIFICATION; DYNAMIC FEATURE PARAMETERS; SPEECH RECOGNITION AB This study focuses on the parametric stochastic modeling of characteristic sound features that distinguish languages from one another. A new stochastic model. the so-called Gaussian mixture bigram model (GMBM), that allows exploitation of the acoustic feature bigram statistics without requiring transcribed training data is introduced. For greater efficiency, a minimum classification error (MCE) algorithm is employed to accomplish discriminative training of a GMBM-based Chinese dialect identification system. Simulation results demonstrate the effectiveness of the GMBM for dialect-specific acoustic modeling, and use of this model allows the proposed system to distinguish between the three major Chinese dialects spoken in Taiwan with 94.4% accuracy. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu, Taiwan. RP Chang, WW (reprint author), Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu, Taiwan. CR Chengalvarayan R, 1997, IEEE T SPEECH AUDI P, V5, P232, DOI 10.1109/89.568730 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Deng L, 1994, IEEE SIGNAL PROC LET, V1, P66 HARBECK S, 1999, P EUROSPEECH 99 Hazen T. J., 1993, P EUROSPEECH 93, P1303 Hazen TJ, 1997, J ACOUST SOC AM, V101, P2323, DOI 10.1121/1.418211 HOUSE AS, 1977, J ACOUST SOC AM, V62, P708, DOI 10.1121/1.381582 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 Katagiri S., 1991, P IEEE WORKSH NEUR N, P299 Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 MARKEL JD, 1972, IEEE T ACOUST SPEECH, VAU20, P129, DOI 10.1109/TAU.1972.1162367 Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925 Ramsey S. R., 1987, LANGUAGES CHINA TSAI WH, 1997, THESIS NATL CHIAO TU Wellekens C. J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 16 TC 10 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 317 EP 326 DI 10.1016/S0167-6393(00)00090-X PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700010 ER PT J AU Lee, T Lo, WK Ching, PC Meng, H AF Lee, T Lo, WK Ching, PC Meng, H TI Spoken language resources for Cantonese speech processing SO SPEECH COMMUNICATION LA English DT Article DE speech databases development; Chinese dialects; Chinese phonology and phonetics; annotation of speech data; applications of speech technology; speech recognition; text-to-speech synthesis ID RECOGNITION; DATABASE AB This paper describes the development of CU Corpora. a series of large-scale speech corpora for Cantonese. Cantonese is the most commonly spoken Chinese dialect in Southern China and Hong Kong. CU Corpora are the first of their kind and intended to serve as an important infrastructure for the advancement of speech recognition and synthesis technologies for this widely used Chinese dialect. They contain a large amount of speech data that cover various linguistic units of spoken Cantonese. including isolated syllables, polysyllabic words and continuous sentences. While some of the corpora are created for specific applications of common interest, the others are designed with emphasis on the coverage and distributions of different phonetic units, including the contextual ones. The speech data are annotated manually so as to provide sufficient orthographic and phonetic information for the development of different applications. Statistical analysis of the annotated data shows that CU Corpora contain rich and balanced phonetic content. The usefulness of the corpora is also demonstrated with a number of speech recognition and speech synthesis applications, (C) 2002 Elsevier Science B.V. All rights reserved. C1 Chinese Univ Hong Kong, Dept Elect Engn, Shatin, Hong Kong, Peoples R China. Chinese Univ Hong Kong, Dept Ind Engn & Management Syst, Shatin, Hong Kong, Peoples R China. RP Lee, T (reprint author), Chinese Univ Hong Kong, Dept Elect Engn, Shatin, Hong Kong, Peoples R China. EM tanlee@ee.cuhk.edu.hk RI Meng, Helen/F-6043-2011; Lee, Tan/D-5475-2011 OI Lee, Tan/0000-0002-7089-3436 CR Bigorne D., 1993, P INT C AC SPEECH SI, V2, P187 *CCDICT, 2000, DICT CHIN CHAR VERS Chan C., 1998, P C PHON LANG CHIN, P13 Chen H P, 1994, Bioorg Med Chem, V2, P1, DOI 10.1016/S0968-0896(00)82195-1 Ching P. C., 1994, P 1994 INT S SPEECH, V1, P127 CHOU FC, 1997, P ICASSP, V2, P923 CHOUKRI K, 1999, P 1999 OR COCOSDA WO CHOW KF, 1998, P 1998 INT S CHIN SP, P75 Chu M., 1996, CHINESE J ACOUSTICS, V15, P81 CHU M, 1998, P 1998 INT C AC SPEE, V1, P277, DOI 10.1109/ICASSP.1998.674421 Gao S., 2000, P 2000 INT C AC SPEE, V3, P1261 HASHIMOTO OKY, 1972, STUDIES YUE DIALECTS HOGE H, 1997, P ICASSP, V3, P1771 Huo Q., 1999, P 1999 OR COCOSDA WO, P85 KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W KUWABARA H, 1989, P 1989 INT C AC SPEE, V1, P560 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 *LDC, 2000, VAR RES Lee T, 1999, IEEE T SPEECH AUDI P, V7, P466 Lee T., 1999, P 6 EUR C SPEECH COM, V4, P1855 LEE T, 1995, IEEE T SPEECH AUDI P, V3, P204 LO WK, 1998, P 1998 INT S CHIN SP, P102 LSHK, 1997, HONG KONG JYUT PING Matthews S, 1994, CANTONESE COMPREHENS MOULINES E, 1990, P 1990 INT C AC SPEE, V1, P309 Ohtsuki K, 1999, SPEECH COMMUN, V28, P155, DOI 10.1016/S0167-6393(99)00006-0 PAUL D, 1992, P 5 DARPA SPEECH NAT Price P., 1988, P IEEE INT C AC SPEE, P651 Price P., 1990, P 3 DARPA SPEECH NAT TSENG CY, 1995, P 1995 INT C PHON SC, V3, P326 Wang H.-C., 1999, P 199 OR COCOSDA WOR, P53 Wang R., 1996, P 1996 INT C SPOK LA, V3, P1894 WINSKI R, 1994, ADV SPEECH APPL EURO, P25 Wong S.-L., 1941, CHINESE SYLLABARY PR WONG YW, 1999, P EUROSPEECH 99, V3, P1091 WU Y, 1998, NEWSLETTER ISCSLP 98, P1 Young S. J., 1993, P EUR C SPEECH COMM, V3, P2203 Yuan J, 1983, HANYU FANGYAN GAIYAO ZHANG B, 1999, P 5 INT S SIGN PROC, P629 ZHANG J, 1998, NEWSLETTER ISCSLP 98, P4 ZU YQ, 1996, HKU96 PUTONGHUA CORP ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 42 TC 30 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 327 EP 342 DI 10.1016/S0167-6393(00)00101-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700011 ER PT J AU Chappell, DT Hansen, JHL AF Chappell, DT Hansen, JHL TI A comparison of spectral smoothing methods for segment concatenation based speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; speech coding; spectral smoothing; spectral interpolation ID AUDITORY-NERVE FIBERS; FREQUENCY; MODEL; CAT AB There are many scenarios in both speech synthesis and coding in which adjacent time-frames of speech are spectrally discontinuous. This paper addresses the topic of improving concatenative speech synthesis with a limited database by proposing methods to smooth, adjust, or interpolate the spectral transitions between speech segments. The objective is to produce natural-sounding speech via segment concatenation when formants and other spectral features do not align properly. We consider several methods for adjusting the spectra at the boundaries between waveform segments. Techniques examined include optimal coupling, waveform interpolation (WI), linear predictive parameter interpolation, and psychoacoustic closure. Several of these algorithms have been previously developed for either coding or synthesis., while others are enhanced. We also consider the connection between speech science and articulation in determining the type of smoothing appropriate for given phoneme-phoneme transitions. Moreover, this work incorporates the use of a recently-proposed auditory-neural based distance measure (ANBM), which employs a computational model of the auditory system to assess perceived spectral discontinuities. We demonstrate how actual ANBM scores can be used to help determine the need for smoothing. In addition, formal evaluation of four smoothing methods, using the ANBM and extensive listener tests, reveals that smoothing can distinctly improve the quality of speech but when applied inappropriately can also degrade the quality. It is shown that after proper spectral smoothing, or spectral interpolation, the final synthesized speech sounds more natural and has a more continuous spectral structure. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Colorado, CSLR, RSPL, Boulder, CO 80309 USA. Duke Univ, Dept Elect Engn, Durham, NC 27708 USA. RP Hansen, JHL (reprint author), Univ Colorado, CSLR, RSPL, Room E265,3215 Marine St,POB 594, Boulder, CO 80309 USA. EM John.Hansen@colorado.edu CR Atal B., 1989, P 1989 IEEE ICASSP G, V1, P69 Berry D. A., 1996, STAT BAYESIAN PERSPE BREEN AP, 1998, P 1998 ICSLP SYDN AU Bregman AS., 1990, AUDITORY SCENE ANAL CARNEY LH, 1993, J ACOUST SOC AM, V93, P401, DOI 10.1121/1.405620 CHAPPELL DT, 1997, P 1997 IEEE ICASSP M, V3, P1639 COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154 Conkie A., 1997, PROGR SPEECH SYNTHES, P293 Deller J., 2000, DISCRETE TIME PROCES Donovan R., 1996, THESIS CAMBRIDGE U DUTOIT T, 1994, P INT C AC SPEECH SI, V1, P565 DUTOIT T, 1993, SPEECH COMMUN, V13, P435, DOI 10.1016/0167-6393(93)90042-J ERKELENS JS, 1994, P 1994 IEEE ICASSP A, V1, P481 Fant G., 1960, ACOUSTIC THEORY SPEE Flanagan J., 1972, SPEECH ANAL SYNTHESI Goncharoff V., 1995, P 1995 IEEE ICASSP, V1, P780 Hansen JHL, 1998, IEEE T SPEECH AUDI P, V6, P489, DOI 10.1109/89.709674 HIROKAWA T, 1990, P 1990 ICSLP KOB JAP, V1, P337 HUANG X, 1997, P 1997 IEEE ICASSP M, V2, P959 Hunt A. J., 1996, P ICASSP 96, P373 KLABBERS E, 1998, P 5 INT C SPOK LANG, V5, P1983 KLEIJN WB, 1995, SPEECH CODING SYNTHE, P175 KLEIJN WB, 1996, P INT C AC SPEECH SI, V1, P212 Ladefoged P., 1975, COURSE PHONETICS LADEFOGED P, 1981, PRELIMINARIES LINGUI LIBERMAN MC, 1982, J ACOUST SOC AM, V72, P1441, DOI 10.1121/1.388677 MIZUNO H, 1993, P 1993 ICASSP, V2, P195 MIZUNO H, 1995, SPEECH COMMUN, V16, P153, DOI 10.1016/0167-6393(94)00052-C Moore BCJ, 1997, INTRO PSYCHOL HEARIN MOULINES E, 1995, SPEECH COMMUN, V16, P175, DOI 10.1016/0167-6393(94)00054-E MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z OSHAUGHNESSY D, 1990, SPEECH COMMUNICATION Paliwal K.K., 1995, SPEECH CODING SYNTHE, P433 PALIWAL KK, 1995, P EUR 95 MADR, V2, P1029 Papamichalis P.E., 1987, PRACTICAL APPROACHES Parthasarathy S., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90043-4 PELLOM B, 1998, THESIS DUKE U Pellom BL, 1998, SPEECH COMMUN, V25, P97, DOI 10.1016/S0167-6393(98)00031-4 Pickett J. M., 1980, SOUNDS SPEECH COMMUN PLUMPE M, 1998, P 1998 ICSLP SYDN AU, V6, P2751 Quackenbush S. R., 1988, OBJECTIVE MEASURES S Rabiner L, 1993, FUNDAMENTALS SPEECH Savic M., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90099-7 SHADLE CH, 1979, J ACOUST SOC AM, V66, P1325, DOI 10.1121/1.383553 SHIGA Y, 1998, P 1998 ICSLP SYDN AU, V5, P2035 SLANEY M, 1996, P 1996 IEEE ICASSP A, P1001 SLIFKA J, 1995, ICASSP 95, V1, P644 Snell RC, 1993, IEEE T SPEECH AUDI P, V1, P129, DOI 10.1109/89.222882 STEVENS KN, 1955, J ACOUST SOC AM, V27, P484, DOI 10.1121/1.1907943 SYRDAL A, 1998, ICASSP 98 SEATTLE, V1, P273 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WITTEN IH, 1982, PRINCIPLES COMPUTER ZEMLIN WR, 1968, SPEECH HEARING SCI A NR 53 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2002 VL 36 IS 3-4 BP 343 EP 374 DI 10.1016/S0167-6393(01)00008-5 PG 32 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 520HE UT WOS:000173774700012 ER PT J AU Swerts, M Terken, J AF Swerts, M Terken, J TI Dialogue and prosody SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Tech Univ Eindhoven, IPO, Ctr User Syst Interact, NL-5600 MB Eindhoven, Netherlands. Univ Instelling Antwerp, CNTS Ctr Dutch Language & Speech, B-2610 Antwerp, Belgium. RP Swerts, M (reprint author), Tech Univ Eindhoven, IPO, Ctr User Syst Interact, POB 513, NL-5600 MB Eindhoven, Netherlands. EM m.g.j.swerts@tue.nl; j.m.b.terken@tue.nl RI Swerts, Marc/C-8855-2013 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(01)00021-8 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300001 ER PT J AU Clark, HH AF Clark, HH TI Speaking in time SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE disfluencies; spontaneous speech; timing; pauses; production ID SPONTANEOUS SPEECH; WORDS AB Most spoken disfluencies, it is argued, are not problems in speaking, but the solutions to problems in speaking. Speakers design most forms of disfluencies as signals. communicative acts, for coordinating with their addressees on certain of their speech actions. At the lowest level, speakers try to synchronize their vocalizations with their addressees' attention. At the next level up, they try to synchronize, or pace, the presentation of each expression with their addressees' analysis of those expressions. Speakers have a variety of strategies for achieving synchronization, and many of these lead to the common forms of disfluencies. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Stanford Univ, Dept Psychol, Stanford, CA 94305 USA. RP Clark, HH (reprint author), Stanford Univ, Dept Psychol, Bldg 420,Jordan Hall, Stanford, CA 94305 USA. EM herb@psych.stanford.edu CR CHAFE WALLACE L., 1980, PEAR STORIES COGNITI CLARK HH, 1994, SPEECH COMMUN, V15, P243, DOI 10.1016/0167-6393(94)90075-2 Clark H. H., 1996, USING LANGUAGE Clark Herbert H, 1977, PSYCHOL LANGUAGE CLARK HH, UNPUB USING UH UM SP Clark HH, 1998, COGNITIVE PSYCHOL, V37, P201, DOI 10.1006/cogp.1998.0693 Tree JEF, 1995, J MEM LANG, V34, P709, DOI 10.1006/jmla.1995.1032 Goodwin C., 1981, CONVERSATIONAL ORG I LEVELT WJM, 1983, COGNITION, V14, P41, DOI 10.1016/0010-0277(83)90026-4 Selkirk E, 1996, SIGNAL TO SYNTAX: BOOTSTRAPPING FROM SPEECH TO GRAMMAR IN EARLY ACQUISITION, P187 Svartvik J., 1980, CORPUS ENGLISH CONVE Tree Jean E. Fox, 1997, Cognition, V62, P151 Woodworth R. S., 1938, EXPT PSYCHOL NR 13 TC 33 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 5 EP 13 DI 10.1016/S0167-6393(01)00022-X PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300002 ER PT J AU Pulman, S AF Pulman, S TI Relating dialogue games to information state SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE dialogue acts; conversational games; information state; focus AB This paper discusses the use of 'conversational' or 'dialogue games' as a basis for building dialogue systems. We give a tutorial overview of some recent attempts to relate the notion of a dialogue act to changes of information state of the participants in a dialogue. These attempts all distinguish some notion of 'grounded' or 'common' propositions. We raise the question as to whether these attempts might make the notion of dialogue game redundant, reducing it to an epiphenomenon arising out of the manipulation of information states. The answer to the question is no, not quite, yet. We also look briefly at a suggestion for augmenting the notion of information state so as to be able to deal with some types of prosodically marked focus. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Oxford, Ctr Linguist & Philol, Oxford OX1 2HG, England. RP Pulman, S (reprint author), SRI Int, Cambridge, England. EM stephen.pulman@linguistics-philology.ox.ac.uk CR ALLEN J, 1997, DAMSL DIALOGUE MARKU BOLINGER D, 1972, ACCENT PREDICTABLE Y CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 COHEN P, 1979, COGNITIVE SCI, V3, P77 Cohen Philip R., 1990, INTENTIONS COMMUNICA Cooper R., 1999, CODING INSTRUCTIONAL ENGDAHL E, 1999, FOCUS GROUND ARTICUL Ginzburg J., 1994, P INT WORKSH COMP SE, P111 Ginzburg J., 1996, HDB CONT SEMANTIC TH Grosz B. J., 1986, Computational Linguistics, V12 HAMBLIN CL, 1971, THEORIA, V37, P130 Isard S . D., 1975, FORMAL SEMANTICS NAT Kamp H., 1993, DISCOURSE LOGIC INTR KOWTKO JC, 1992, RP31 HCRC LEWIS JA, 1993, BIOCONTROL SCI TECHN, V3, P3, DOI 10.1080/09583159309355253 MOORE RK, 1992, P I AC, V14, P613 POESIO M, 1997, COMPUTATIONAL INTELL POESIO M, 1998, P ESSLLI WORKSH MUT POESIO M, 1998, 13 TWLT PULMAN SG, 1997, CLIN, V7, P1 REITHINGER N, 1996, P 33 ACL CAMBR MASS, P116 Stalnaker Robert, 1978, SYNTAX SEMANTICS, P315 NR 22 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 15 EP 30 DI 10.1016/S0167-6393(01)00023-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300003 ER PT J AU Hirschberg, J AF Hirschberg, J TI Communication and prosody: Functional aspects of prosody SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE prosody; spoken dialogue systems; intonational meaning ID ENGLISH SENTENCE STRESS; PERFORMANCE STRUCTURES; DISCOURSE STRUCTURE; PITCH ACCENTS; SPEECH; INTONATION; PATTERNS; CONTEXT; WORDS; TEXT AB Interest in the contribution prosodic information makes to human communication has led to increasing expectations that such information could be of use in text-to-speech and speech understanding systems, and in application of these technologies to spoken dialogue systems. To date, research results far exceed their technology applications. This paper suggests some areas in which progress has been made, and some in which more might be made, with particular emphasis upon text-to-speech synthesis and spoken dialogue systems. (C) 2002 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, A 257 Shannon Lab, Florham Pk, NJ 07932 USA. RP Hirschberg, J (reprint author), AT&T Labs Res, A 257 Shannon Lab, 180 Pk Ave, Florham Pk, NJ 07932 USA. EM julia@research.att.com CR ABE M, 1997, PROGR SPEECH SYNTHES, P495 ALLERTON DJ, 1979, J LINGUIST, V15, P49, DOI 10.1017/S0022226700013104 Altenberg B., 1987, LUND STUDIES ENGLISH, V76 AMIR N, 1998, P INT C SPOK LANG PR AVESANI C, 1988, CORS STAMP NEGL CONV, P8 AYERS GM, 1992, LING SOC AM ANN M BAART JLG, 1987, THESIS U LEIDEN LEID BALL CN, 1977, LINGUIST INQ, V8, P585 BARD E, 1999, P INT C PHON SCI ICP BARDOVIHARLIG K, 1983, 18 REG M CHIC LING S BARDOVIHARLIG K, 1983, THESIS U CHICAGO CHI BEACH CM, 1991, J MEM LANG, V30, P644, DOI 10.1016/0749-596X(91)90030-N BELL L, 1999, P INT C PHON SCI ICP BENOIT C, 1989, P EUR SPEECH COMM AS BING JANET, 1979, THESIS U MASSACHUSET BOLINGER D, 1972, LANGUAGE, V48, P633, DOI 10.2307/412039 Bolinger D., 1986, INTONATION ITS PARTS Bolinger D., 1989, INTONATION ITS USES Botinis Antonis, 1997, INTONATION THEORY MO Bouton L.E., 1982, PAPERS PARASESSION N, P23 BRESNAN JW, 1971, LANGUAGE, V47, P257, DOI 10.2307/412081 Brown G., 1983, PROSODY MODELS MEASU, P67 Brown Gillian, 1980, QUESTIONS INTONATION CAHN J, 1998, 3 ESCA COCOSDA WORKS, P121 CAHN JE, 1988, P SPEECH TECH 88 NEW, P35 Campbell N., 1997, INTONATION THEORY MO, P67 CASPERS J, 1998, LANGUAGE SPEECH, V41 Chafe W. L., 1976, SUBJECT TOPIC, P25 Cooper W. E., 1980, SYNTAX SPEECH CULICOVER PW, 1983, LANGUAGE, V59, P123, DOI 10.2307/414063 CUTLER A, 1977, LANG SPEECH, V20, P1 Cutler Anne, 1983, PROSODY MODELS MEASU DEMAREUIL PB, 1998, 3 ESCA COCOSDA WORKS, P127 Dirksen A, 1992, P COLING 92, P865 DIRKSEN A, 1993, ANAL SYNTHESIS SPEEC, P131 Downing B. T., 1970, THESIS U TEXAS AUSTI ENKVIST NE, 1979, STUDIES ENGLISH LING, P134 ERTESCHIKSHIR N, 1983, J LINGUIST, V19, P419, DOI 10.1017/S0022226700007805 *ESCA, 1994, C P 2 ESCA IEEE WORK *ESCA COCOSDA, 1998, P 3 ESCA COCOSDA WOR FOWLER CA, 1987, J MEM LANG, V26, P489, DOI 10.1016/0749-596X(87)90136-7 FUCHS A, 1984, INTONATION ACCENT RH, P134 FUCHS A, 1980, WEGE UNIVERSALIENFOR FUJIO S, 1997, COMPUTING PROSODY CO, P271 GAWRONSKA B, 1998, P ICSLP98 INT C SPOK GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 GELUYKENS R, 1994, SPEECH COMMUN, V15, P69, DOI 10.1016/0167-6393(94)90042-6 GORIN AI, 1995, P ESCA WORKSH SPOK D GROSJEAN F, 1979, COGNITIVE PSYCHOL, V11, P58, DOI 10.1016/0010-0285(79)90004-5 GROSZ B, 1992, P INT C SPOK LANG PR GUNDEL J, 1978, U HAWAII WORKING PAP, V10, P1 GUSSENHOVEN C, 1997, INTONATION THEORY MO, P18 GUSSENHOVEN CARLOS, 1983, GRAMMAR SEMANTICS SE HESS W, 1997, COMPUTING PROSODY CO, P361 HIRSCH BA, 1992, ENVIRON MOL MUTAGEN, V20, P2, DOI 10.1002/em.2850200103 HIRSCHBERG J, 1997, ESCA TUT RES WORKSH HIRSCHBERG J, 1999, P AUT SPEECH REC UND HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C HIRSCHBERG J, 1996, P 34 ANN M SANT CRUZ Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 HORNE M, 1987, 32 LUND U DEP LING HORNE M, 1994, C P 2 ESCA IEEE WORK, P220 HORNE M, 1991, P 12 INT C PHON SCI, P230 HORNE M, 1991, 36 LUND U DEP LING HORNE MA, 1985, STUD LINGUISTICA, V39, P51 HOUSE D, 1993, P ESCA WORKSH PROS L Jackendoff Ray S., 1972, SEMANTIC INTERPRETAT KLABBERS E, 1998, P INT C SPOK LANG PR KOIKE K, 1998, P INT C SPOK LANG PR KOISO H, 1998, LANGUAGE SPEECH, V41 KOMPE R, 1994, SPEECH COMMUN, V15, P155, DOI 10.1016/0167-6393(94)90049-3 Kouroupetroglou G., 1997, INTONATION THEORY MO, P161 KRAHMER E, 1998, P INT C SPOK LANG PR Kruyt J. G., 1985, THESIS U LEIDEN Ladd D. R., 1980, STRUCTURE INTONATION Ladd D. R., 1996, INTONATIONAL PHONOLO LADD DR, 1979, CONTRIBUTIONS GRAMMA, P93 LADD DR, 1978, LANGUAGE, V54, P517, DOI 10.2307/412785 LADD DR, 1977, FUNCTION A RISE ACCE Lakoff George, 1971, SEMANTICS INTERDISCI, P329 Lehiste I., 1979, FRONTIERS SPEECH COM, P191 LEHMAN C, 1977, P 13 ANN M CHIC LING, P316 LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 Liberman M., 1974, 10 REG M CHIC LING S, P416 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NOOTEBOOM SG, 1982, PHONETICA, V39, P317 Noth E, 2002, SPEECH COMMUN, V36, P45, DOI 10.1016/S0167-6393(01)00025-5 Ostendorf M., 1994, Computational Linguistics, V20 Oviatt S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607722 PASSONEAU R, 1993, P 31 ANN M OH STAT U PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J, 1980, THESIS MIT Pitrelli J. F., 1994, P 3 INT C SPOK LANG, V2, P123 PREVOST S, 1994, SPEECH COMMUN, V15, P139, DOI 10.1016/0167-6393(94)90048-5 PRICE PJ, 1990, J ACOUST SOC AM DEC Prince Ellen, 1981, RADICAL PRAGMATICS, P223 RANK E, 1998, P INT C SPOK LANG PR Rochemont Michael S., 1990, ENGLISH FOCUS CONSTR ROOTH M, 1991, WORKSH SYNT SEM FOC Rooth Mats, 1985, THESIS U MASSACHUSET SAG IA, 1975, 11 REG M CHIC LING S SAGISAKA Y, 1997, COMPUTING PROSODY CO Sanderman AA, 1997, LANG SPEECH, V40, P391 SCHMERLI.SF, 1974, LANGUAGE, V50, P66, DOI 10.2307/412010 SCHMERLING S, 1975, TEXAS LINGUISTICS FO, P135 SCHMERLING S, 1971, 7TH REG M CHIC LING, P242 SCHMERLING SF, 1976, THESIS U ILLINOIS UR SCHRODER M, 1998, P INT C SPOK LANG PR SELKIRK E., 1984, PHONOLOGY SYNTAX SHIMOJIMA A, 2001, SPEECH COMMUNICATION SHRIBERG E, 1998, LANGUAGE SPEECH, V41 SILVERMAN K, 1987, THESIS CAMBRIDGE U C SILVERMAN K, 1993, P 1993 EUR, V3, P2169 Silverman K., 1992, P INT C SPOK LANG PR, P867 SPROAT R, 1998, MULTILINGUAL TEXT SP Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 SWERTS M, 1994, SPEECH COMMUN, V15, P79, DOI 10.1016/0167-6393(94)90043-4 TAMOTO M, 1998, P INT C SPOK LANG PR TAYLOR P, 1998, LANGUAGE SPEECH, V41 Terken J., 1997, COMPUTING PROSODY, P95 Terken J, 1987, LANG COGNITIVE PROC, V2, P145, DOI 10.1080/01690968708406928 TERKEN JMB, 1984, LANG SPEECH, V27, P269 TERKEN JMB, 1985, THESIS U LEIDEN HELM TERKEN J, 1994, LANG SPEECH, V37, P125 van Santen J., 1997, PROGR SPEECH SYNTHES VANDONZEL M, 1999, PROSODIC ASPECTS INF VANHUEVEN VJ, 1993, ANAL SYNTHESIS SPEEC WADE E, 1992, P INT C SPOK LANG PR, V2, P995 Wales R., 1979, SENTENCE PROCESSING WARD G, 1985, LANGUAGE, V61, P747, DOI 10.2307/414489 Warnke V., 1997, P 5 EUR C SPEECH COM, V1, P207 WHITESIDE SP, 1998, P INT C SPOK LANG PR WILLIAMS S, 1998, P INT C SPOK LANG PR WILSON D, 1979, SYNTAX SEMANTICS, V11, P229 ZACHARSKI R, 1992, 15 INT C COMP LING I, P253 NR 135 TC 37 Z9 38 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 31 EP 43 DI 10.1016/S0167-6393(01)00024-3 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300004 ER PT J AU Noth, E Batliner, A Warnke, V Haas, J Boros, M Buckow, J Huber, R Gallwitz, F Nutt, M Niemann, H AF Noth, E Batliner, A Warnke, V Haas, J Boros, M Buckow, J Huber, R Gallwitz, F Nutt, M Niemann, H TI On the use of prosody in automatic dialogue understanding SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE dialogue; prosody; prosodic labelling; automatic classification; spontaneous speech; large databases; neural networks; stochastic language models; partial parser; A* search AB In this paper, we show how prosodic information can be used in automatic dialogue systems and give some examples of promising new approaches. Most of these examples are taken from our own work in the VERBMOBIL speech-to-speech translation system and in the EVAR train timetable dialogue system. In a 'prosodic orbit', we first present units, phenomena, annotations and statistical methods from the signal (acoustics) to the dialogue understanding phase. We show then, how prosody can be used together with other knowledge sources for the task of resegmentation if a first segmentation turns out to be wrong, and how an integrated approach leads to better results than a sequential use of the different knowledge sources; then we present a hybrid approach which is used to perform a shallow parsing and which uses prosody to guide the parsing; finally, we show how a critical system evaluation can help to improve the overall performance of automatic dialogue systems. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Erlangen Nurnberg, Chair Pattern Recognit Inf 5, D-91058 Erlangen, Germany. Bavarian Res Ctr Knowledge Based Syst FORWISS, D-91058 Erlangen, Germany. RP Noth, E (reprint author), Univ Erlangen Nurnberg, Chair Pattern Recognit Inf 5, D-91058 Erlangen, Germany. EM noeth@informatik.uni-erlangen.de CR Albesano D., 1997, International Journal of Speech Technology, V2, DOI 10.1007/BF02208822 AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M AUST H, 1998, P INT S SPOK DIAL SY, P27 Batliner A., 1989, INTONATION MODUS FOK, p[H, 21] Batliner A, 1998, SPEECH COMMUN, V25, P193, DOI 10.1016/S0167-6393(98)00037-5 Batliner A., 1995, P 13 INT C PHON SCI, V3, P472 BATLINER A, 2000, VERBMOBIL FDN SPEECH, P106 BATLINER A, 1993, P ESCA WORKSH PROS L, P112 BATLINER A, 1999, P EUR C SPEECH COMM, V1, P519 BATLINER A, 1991, BETRIEBSLINGUISTIK L, P147 Batliner A., 1999, P 14 INT C PHON SCI, V3, P2315 BLOCK HU, 1997, P INT C AC SPEECH SI, V1, P79 CARLETTA J, 1997, DAGST SEM REP, P167 GALLWITZ F, 1999, P ESCA WORKSH DIAL P, P163 GALLWITZ F, 1998, P 1998 INT S SPOK DI, P19 GALLWITZ F, 1998, P INT C SPOK LANG PR, V7, P2883 GRICE M, 1996, P INT C SPOK LANG PR, V3, P1716, DOI 10.1109/ICSLP.1996.607958 HORMANN H, 1978, MEINEN VERSTEHEN STU Jekat S., 1995, 65 VERBM Kasper Walter, 1999, P 37 ANN M ASS COMP, P405, DOI 10.3115/1034678.1034741 Kiessling A., 1997, EXTRAKTION KLASSIFIK KIESSLING K, 1994, P INT C SPOK LANG PR, V1, P115 Kompe R, 1997, PROSODY SPEECH UNDER Lea W., 1980, TRENDS SPEECH RECOGN, P166 Levelt W. J., 1989, SPEAKING INTENTION A MAST M, 1996, P INT C SPOKEN LANGU, V3, P1728 MECKLENBURG K, 1995, P 5 INT WORKSH NAT L, P127 NIEMANN H, 1998, P SPECOM WORKSH ST P, P17 Nilsson N., 1982, PRINCIPLES ARTIFICIA NOTH E, 1999, P EUR C SPEECH COMM, V5, P2019 NUTT M, 1999, P ESCA WORKSH DIAL P, P151 REYELT M, 1994, 33 VERBM Shriberg E., 1998, LANG SPEECH, V41, P439 SPILKER J, 1999, P EUR C SPEECH COMM, V5, P2031 SPILKER J, 2000, VERBMOBIL FDN SPEECH, P131 STROM V, 1996, P ICSLP PHIL US, V3, P1497, DOI 10.1109/ICSLP.1996.607900 TAYLOR P, 1999, LANG SPEECH, V41, P489 VAISSIERE J, 1988, NATO ASI SERIES F, V46, P71 Wahlster W., 2000, VERBMOBIL FDN SPEECH WAHLSTER W, 1997, P INT C AC SPEECH SI, V1, P71 Warnke V., 1997, P 5 EUR C SPEECH COM, V1, P207 WARNKE V, 1999, P EUR C SPEECH COMM, V1, P235 NR 42 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 45 EP 62 DI 10.1016/S0167-6393(01)00025-5 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300005 ER PT J AU Hastie, HW Poesio, M Isard, S AF Hastie, HW Poesio, M Isard, S TI Automatically predicting dialogue structure using prosodic features SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE prosody; intonation; duration; dialogue acts; moves; games; discourse function; prediction; recognition ID INTONATION AB Spoken dialogue systems need to track dialogue structure in order to conduct sensible conversations. In previous work, we used only a shallow analysis of past dialogue in predicting the current dialogue act. Here we show that a hierarchical analysis of dialogue structure can significantly improve dialogue act recognition. Our approach is to integrate dialogue act recognition with speech recognition, seeking a best overall hypothesis for what words have been spoken and what dialogue act they represent, in the light of both the dialogue history so far and the current speech signal. A useful feature of this approach is that intonation can be used to aid dialogue act recognition by combining it with other information sources in a natural way. (C) 2002 Published by Elsevier Science B.V. C1 Univ Edinburgh, Human Commun Res Ctr, Ctr Speech Technol Res, Edinburgh EH8 9LW, Midlothian, Scotland. RP Hastie, HW (reprint author), AT&T Labs Res, Shannon Lab, Bldg 103,Rm D109,180 Pk Ave,POB 971, Florham Pk, NJ 07932 USA. EM hhastie@research.att.com; poesio@cogsci.ed.ac.uk; stepheni@cstr.ed.ac.uk CR Bard EG, 1996, SPEECH COMMUN, V20, P71, DOI 10.1016/S0167-6393(96)00045-3 Berger AL, 1996, COMPUT LINGUIST, V22, P39 Carletta J, 1997, COMPUT LINGUIST, V23, P13 CHUCARROLL J, 1998, APPL MACHINE LEARNIN, P98 HOCKEY BA, 1997, P 5 EUR C SPEECH COM, P2267 Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop KING S, 1998, THESIS U EDINBURGH NAGATA M, 1994, SPEECH COMMUN, V15, P193, DOI 10.1016/0167-6393(94)90071-X NAKAJIMA S, 1993, PHONETICA, V50, P197 Olshen R., 1984, CLASSIFICATION REGRE, V1st POESIO M, 1998, P ICSLP98 SYDN AUSTR, P405 POWER R, 1979, LINGUISTICS, V17, P107, DOI 10.1515/ling.1979.17.1-2.107 RABINER L, 1994, FUNDAMENTALS SPEECH Reithinger N., 1997, P 5 EUR C SPEECH COM, P2235 REITHINGER N, 1995, P AAAI SPRING S EMP, P126 ROSENFELD R, 1997, CMU CAMBRIDGE STAT L Shriberg E., 1998, LANG SPEECH, V41, P439 Siegel S., 1988, NONPARAMETRIC STAT B Taylor P, 1998, LANG SPEECH, V41, P493 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 TERRY M, 1994, P ICSLP94 YOK JAP, P891 WRIGHT H, 1999, THESIS U EDINBURGH Wright H., 1998, P ICSLP98 SYDN AUSTR, P1403 NR 23 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 63 EP 79 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300006 ER PT J AU Gallwitz, F Niemann, H Noth, E Warnke, V AF Gallwitz, F Niemann, H Noth, E Warnke, V TI Integrated recognition of words and prosodic phrase boundaries SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE speech recognition; prosody; speech understanding AB In this paper, we present an integrated approach for recognizing both the word sequence and the syntactic-prosodic structure of a spontaneous utterance. The approach aims at improving the performance of the understanding component of speech understanding systems by exploiting not only acoustic-phonetic and syntactic information, but also prosodic information directly within the speech recognition process. Whereas spoken utterances are typically modelled as unstructured word sequences in the speech recognizer, our approach includes phrase boundary information in the language model and provides HMMs to model the acoustic and prosodic characteristics of phrase boundaries. This methodology has two major advantages compared to purely word-based speech recognizers. First, additional syntactic-prosodic boundaries are determined by the speech recognizer which facilitates parsing and resolve syntactic and semantic ambiguities. Second - after having removed the boundary information from the result of the recognizer - the integrated model yields a 4% relative word error rate (WER) reduction compared to a traditional word recognizer. The boundary classification performance is equal to that of a separate prosodic classifier operating on the word recognizer output, thus making a separate classifier unnecessary for this task and saving the computation time involved. Compared to the baseline word recognizer, the integrated word-and-boundary recognizer does not involve any computational overhead. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Erlangen Nurnberg, Chair Pattern Recognit, D-91058 Erlangen, Germany. RP Gallwitz, F (reprint author), Sympalog Speech Technol AG, Erlangen, Germany. EM gallwitz@sympalog.de CR Batliner A, 1998, SPEECH COMMUN, V25, P193, DOI 10.1016/S0167-6393(98)00037-5 BECCETTI C, 1999, SPEECH RECOGNITION T BUB T, 1996, P INT C SPOK LANG PR, V4, P1026 GALLWITZ F, 1998, P INT C SPOK LANG PR, V7, P2883 GALLWITZ F, 2001, THESIS U ERLANGEN NU HEEMAN PA, 1999, COMPUTATIONAL LINGUI, V25 HEEMAN PA, 1999, P IEEE WORKSH AUT SP IWANO K, 1999, P INT C AC SPEECH SI, V1, P133 Jelinek F., 1997, STAT METHODS SPEECH Kiessling A., 1997, EXTRAKTION KLASSIFIK KOMPE R, 1997, LECT NOTES ARTIFICIA KOMPE R, 1994, P INT C AC SPEECH SI, V2, P173 KOMPE R, 1995, P EUR C SPEECH COMM, V2, P1333 Lea W., 1980, TRENDS SPEECH RECOGN, P166 OSHAUGHNESSY D, 1992, P INT C AC SPEECH SI, V1, P521 Ostendorf M., 1994, Computational Linguistics, V20 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 SCHUKATTALAMAZZ.EG, 1995, AUTOMATISCHE SPRACHE SHRIBERG E, 1996, P INT C SPOK LANG PR, V3, P1868, DOI 10.1109/ICSLP.1996.607996 STOLCKE A, 1996, P INT C AC SPEECH SI, V1, P405 STOLCKE A, 1999, P EUR C SPEECH COMM STOLCKE A, 1998, P INT C SPOK LANG PR VAISSIERE J, 1988, NATO ASI SERIES F, V46, P71 VEILLEUX N, 1993, HUMAN LANGUAGE TECHN, P315 Wahlster W., 1993, P 3 EUR C SPEECH COM, P29 WARNKE V, 1999, P EUR C SPEECH COMM, V1, P235 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 27 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 81 EP 95 DI 10.1016/S0167-6393(01)00027-9 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300007 ER PT J AU Hirose, K Kawanami, H AF Hirose, K Kawanami, H TI Temporal rate change of dialogue speech in prosodic units as compared to read speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE speech rate; dialogue speech; prosodic structure; F-0 contour model; prosodic phrase; mora reduction rate AB A comparative study on speech rate was conducted between dialogue speech and read speech of Japanese. Based on a model of fundamental frequency contour generation, four prosodic units, prosodic sentence, clause, phrase and word, are defined. Speech rate was analyzed with respect to these units, especially to prosodic phrases. In order to suppress various factors affecting speech rate, and to clarify features of dialogue speech, mora reduction rate of dialogue speech as compared to its read speech counterpart was defined and was used for the analysis. Through the analysis of speech samples recorded during simulated dialogues, it was found that, in a prosodic phrase, dialogue speech rate starts with a value slightly larger than that of read speech. Then, it gradually increases and after passing through the middle of the phrase, decreases. This result was also supported through a linear regression analysis and a hearing test of synthetic speech. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Tokyo, Dept Frontier Informat, Bunkyo Ku, Tokyo 1130033, Japan. Univ Tokyo, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1138656, Japan. RP Hirose, K (reprint author), Univ Tokyo, Dept Frontier Informat, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1130033, Japan. EM hirose@gavo.t.u-tokyo.ac.jp; kawanami@is.aist-nara.ac.jp CR Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FUJISAKI H, 1993, IEICE T FUNDAM ECCS, P1919 HIROSE K, 1996, T IEICE INFORM SYST, P2154 HIROSE K, 1998, P INT C SPOK LANG PR, V5, P1979 HIROSE K, 1996, P INT C SPOK LANG PR, V1, P378, DOI 10.1109/ICSLP.1996.607133 HIROSE K, 1994, P 2 ESCA IEEE WORKSH, P167 Minematsu N., 1995, Journal of the Acoustical Society of Japan (E), V16 HIROSE K, 1993, IEICE T FUND ELECTR, VE76A, P1971 IWAHASHI N, 1995, P EUR C SPEECH COMM, V1, P329 KAIKI N, 1993, IEICE T FUNDAM, P1927 KAIKI N, 1992, T I ELECT INFORM COM, P467 KAWANAMI H, 1999, P ESCA TUT RES WORKS, P59 SAKATA M, 1995, P EUR C SPEECH COMM, V2, P1007 Silverman K., 1992, P INT C SPOK LANG PR, P867 SLUIJTER A, 1998, P ESCA COCOSDA INT W, P213 TAKEDA K, 1989, J ACOUST SOC AM, V86, P2081, DOI 10.1121/1.398467 TAKEDA K, 1997, 97SLP183 IPSJ SIG NR 17 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 97 EP 111 DI 10.1016/S0167-6393(01)00028-0 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300008 ER PT J AU Shimojima, A Katagiri, Y Koiso, H Swerts, M AF Shimojima, A Katagiri, Y Koiso, H Swerts, M TI Informational and dialogue-coordinating functions of prosodic features of Japanese echoic responses SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE echoic response; repeat; spoken-dialogue corpus; information flow; dialogue coordination AB Echoic responses, which reuse portions of the texts uttered in the preceding turns, abound in dialogues, although semantically they contribute little new information. In this paper, we attempt to identify the informational and dialogue-coordinating functions of Japanese echoic responses while focusing on their prosodic and temporal features. Toward this goal, we conducted an observational study based on a corpus of spoken dialogues, as well as three complementary experiments, where particular prosodic/temporal features of echoic responses were studied in a controlled and focused manner. In combination, the two lines of analyses provide evidence that (1) echoic responses with different timings, intonations, pitches, and speeds signal different degrees in which the speakers have integrated the repeated information into their prior knowledge, and (2) owing to this informational function, the prosodic/temporal features of an echoic response also have the dialogue-coordinating function of directing the listener in how to handle the information just repeated. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Japan Adv Inst Sci & Technol, Grad Sch Knowledge Sci, Nomi, Ishikawa 9231292, Japan. ATR, Media Integrat & Commun Res Labs, Kyoto 6190288, Japan. Natl Language Res Inst, Kita Ku, Tokyo 1158620, Japan. Tech Univ Eindhoven, IPO, Ctr User Syst Interact, NL-5600 MB Eindhoven, Netherlands. Univ Instelling Antwerp, CNTS, Ctr Dutch Language & Speech, B-2610 Antwerp, Belgium. RP Shimojima, A (reprint author), Japan Adv Inst Sci & Technol, Grad Sch Knowledge Sci, 1-1 Asahi, Nomi, Ishikawa 9231292, Japan. EM ashimoji@jaist.ac.jp RI Swerts, Marc/C-8855-2013 CR BEUN RJ, 1995, 20 IPO Campbell N., 1992, THESIS U SUSSEX CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 Ginzburg J., 1996, LOGIC LANGUAGE COMPU, V1, P221 Grosz B., 1992, P INT C SPOK LANG PR, P429 GUMPERZ J, 1991, RETHINKING CONTEXT KATAGIRI Y, 2000, P GOTALOG 2000 4 WOR KAWAHARA H, 1997, IJCAI97 WORKSH COMP KAWAMORI M, 1996, NLC9642 I EL INF COM, P31 KOISO H, 1996, P 19 ANN C COGN SCI, P394 NAKADA T, 1991, NIBONGOGAKU, V10, P52 NORRICK NR, 1996, REPETITION DISCOURSE, V2, P15 SHIMOJIMA A, 1997, P MUN WORKSH SEM PRA, P172 Shimojima A, 1998, PROCEEDINGS OF THE TWENTIETH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P951 Tannen Deborah, 1994, TALKING VOICES REPET Traum D. R., 1994, P 32 ANN M ASS COMP, P1, DOI 10.3115/981732.981733 TRAUM DR, 1994, 545 U ROCH VENDITTI JJ, 1995, JAPANESE TOBI LABELL WALKER MA, 1992, 14 INT C COMP LING, P345 Walker MA, 1996, LANG SPEECH, V39, P265 NR 20 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 113 EP 132 DI 10.1016/S0167-6393(01)00029-2 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300009 ER PT J AU Krahmer, E Swerts, M Theune, M Weegels, M AF Krahmer, E Swerts, M Theune, M Weegels, M TI The dual of denial: Two uses of disconfirmations in dialogue and their prosodic correlates SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE spoken dialogue systems; prosody; error detection; information grounding; perception ID ERROR RESOLUTION; SPEECH; DISCOURSE AB In human-human communication, dialogue participants are continuously sending and receiving signals on the status of the information being exchanged. These signals may either be positive ('go on') or negative ('go back'), where it is usually found that the latter are comparatively marked to make sure that the dialogue partner is made aware of a communication problem. This article focuses on the users' signaling of information status in human-machine interactions, and in particular looks at the role prosody may play in this respect. Using a corpus of interactions with two Dutch spoken dialogue systems, prosodic correlates of users' disconfirmations were investigated. In this corpus, disconfirmations can have two uses: they may serve as a positive signal in one context and as a negative signal in another. With the data obtained from the corpus an acoustic analysis and a perception experiment have been carried out. The acoustic analysis shows that the difference in signaling function is reflected in the distribution of the various types of disconfirmations as well as in different prosodic variables (pause, duration, intonation contour and pitch range). The perception experiment revealed that subjects are very good at classifying disconfirmations as positive or negative signals (without context), which strongly suggests that the acoustic features have communicative relevance. The implications of these results for human-machine communication are discussed. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Eindhoven Univ Technol, IPO, Ctr User Syst Interact, NL-5600 MB Eindhoven, Netherlands. Univ Instelling Antwerp, CNTS Ctr Dutch Language & Speech, B-2610 Antwerp, Belgium. RP Krahmer, E (reprint author), Tilburg Univ, BDM Computat Linguist, POB 90153, NL-5000 LE Tilburg, Netherlands. EM e.j.krahmer@kub.nl; m.g.j.swerts@tue.nl; theune@cs.utwente.nl RI Swerts, Marc/C-8855-2013 CR AHA DW, 1991, MACH LEARN, V6, P37, DOI 10.1007/BF00153759 ALLEN J, 1997, DAMSL DIALOGUE MARKU Bou-Ghazale SE, 1998, IEEE T SPEECH AUDI P, V6, P201, DOI 10.1109/89.668815 CLARK HH, 1989, COGNITIVE SCI, V13, P259, DOI 10.1207/s15516709cog1302_7 DAELEMANS W, 2000, 0001 TIMBL ILK Erickson D, 1998, LANG SPEECH, V41, P399 GROENENDIJK J, 1996, CONTEXT DEPENDENCE A, P195 HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 HIRSCHBERG J, 1999, P INT WORKSH SPEECH HOCKEY BA, 1997, P 5 EUR C SPEECH COM, P2267 Krahmer E., 2001, International Journal of Speech Technology, V4, DOI 10.1023/A:1009648614566 Levelt W. J., 1989, SPEAKING INTENTION A LEVOW GA, 1998, P 36 ANN M ASS COMP, P736 LINDBLOM B, 1992, SPEECH COMMUN, V11, P357, DOI 10.1016/0167-6393(92)90041-5 LITMAN D, 2000, P 1 M N AM CHAPT COM Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3 Oviatt S, 1998, LANG SPEECH, V41, P419 Pierrehumbert J., 1990, INTENTIONS COMMUNICA, P342 Pulman S, 2002, SPEECH COMMUN, V36, P15, DOI 10.1016/S0167-6393(01)00023-1 Shriberg E., 1992, P DARPA SPEECH NAT L, P49, DOI 10.3115/1075527.1075538 SOLTAU H, 1998, P INT C SPOK LANG PR Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 SWERTS M, 1998, P INT C SPOK LANG PR Swerts M, 2000, P INT C SPOK LANG PR TRAUM DR, 1994, THESIS ROCHESTER Traunmuller H, 2000, J ACOUST SOC AM, V107, P3438, DOI 10.1121/1.429414 WEEGELS M, 1999, IPO ANN PROGR REPORT, P45 NR 28 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 133 EP 145 DI 10.1016/S0167-6393(01)00030-9 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300010 ER PT J AU Levow, GA AF Levow, GA TI Adaptations in spoken corrections: Implications for models of conversational speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Dialogue and Prosody CY SEP 01-03, 1999 CL VELDHOVEN, NETHERLANDS SP ESCA DE spoken language systems; dialogue; prosody AB Miscommunication in spoken human-computer interaction is unavoidable. Ironically, the user's attempts to repair these miscommunications are even more likely to result in recognition failures, leading to frustrating error "spirals". In this paper we investigate users' adaptations to recognition errors made by a spoken language system and the impact of these adaptations on models for speech recognition. In analyzing over 300 pairs of original and repeat correction utterances, matched on speaker and lexical content, we found overall increases in utterance and pause duration from original to correction. Here we focus on those adaptations - phonological and durational - that are most likely to adversely impact the accuracy of speech recognizers. We identify several phonological shifts from conversational to clear speech style. We determine that the observed durations of spoken user corrections from a field trial represent increases over, and divergences from, those derived from a speech recognizer's underlying model. Furthermore, words in final position diverge significantly more than those in non-final position, due to the additional effects of phrase-final lengthening. These systematic changes argue for a general model of pronunciation and duration, extending beyond the sentence level to incorporate higher-level dialog features, and illustrate important features for such a model to capture. (C) 2002 Elsevier Science B.V. All rights reserved. C1 Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA. RP Levow, GA (reprint author), Univ Maryland, Inst Adv Comp Studies, College Pk, MD 20742 USA. EM gina@umiacs.umd.edu CR Allen J., 1987, TEXT SPEECH MITALK S Bachenko J., 1990, Computational Linguistics, V16 Bear J., 1992, P ACL, P56, DOI 10.3115/981967.981975 BELL L, 1999, P ICPHS99 CHUNG G, 1997, THESIS MIT Collier R., 1990, PERCEPTUAL STUDY INT COLTON D, 1995, CSLU00795 OR GRAD I FERNALD A, 1989, J CHILD LANG, V16, P477 FISCHER K, 1999, HCI INT 99 GREENBERG S, 2000, P HUB5 SPEECH REC WO Hardcastle W. J., 1997, HDB PHONETIC SCI HEEMAN P, 1994, P 32 ANN M ASS COMP, P295, DOI 10.3115/981732.981773 JURAFSKY D, 2001, P ICASSP01 SALT LAK LEVOW GA, 1998, P COLING ACL 98 LJOLJE A, 2000, P HUB5 SPEECH REC WO NAKATANI CH, 1994, J ACOUST SOC AM, V95, P1603, DOI 10.1121/1.408547 Ostendorf M., 1996, P INT C SPOK LANG PR Oviatt S, 1998, SPEECH COMMUN, V24, P87, DOI 10.1016/S0167-6393(98)00005-3 OVIATT S, 1996, P INT C SPOK LANG PR, V2, P801, DOI 10.1109/ICSLP.1996.607722 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 PIRKER H, 1999, EUROSPEECH 99 Shriberg E., 1992, P DARPA SPEECH NAT L, P49, DOI 10.3115/1075527.1075538 SHRIBERG E, 1997, EUROSPEECH 97 SWERTS M, 1995, P ECSA TUT RES WORKS YANKELOVICH N, 1995, CHI95 C HUM FACT COM NR 26 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2002 VL 36 IS 1-2 BP 147 EP 163 DI 10.1016/S0167-6393(01)00031-0 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 514TZ UT WOS:000173458300011 ER PT J AU Cooke, M Ellis, DPW AF Cooke, M Ellis, DPW TI The auditory organization of speech and other sources in listeners and computational models SO SPEECH COMMUNICATION LA English DT Review DE auditory scene analysis; speech perception; streaming; auditory induction; double vowels; robust ASR ID CONCURRENT VOWEL IDENTIFICATION; DIFFERENT FUNDAMENTAL FREQUENCIES; INTERAURAL TIME DIFFERENCE; WAVE-FORM INTERACTIONS; STREAM SEGREGATION; SCENE ANALYSIS; PITCH PERCEPTION; ONSET ASYNCHRONY; TONE SEQUENCES; COMPUTER-MODEL AB Speech is typically perceived against a background of other sounds. Listeners are adept at extracting target sources from the acoustic mixture reaching the ears. The auditory scene analysis (ASA) account holds that this feat is the result of a two-stage process. In the first-stage, sound is decomposed into collections of fragments in several dimensions. Subsequent processes of perceptual organization reassemble these fragments, based on cues indicating common source of origin which are interpreted in the light of prior experience. In this way, the decomposed auditory scene is processed to extract coherent evidence for one or more sources. Auditory scene analysis in listeners has been studied for several decades and recent years have seen a steady accumulation of computational models of perceptual organization. The purpose of this review is to describe the evidence for the nature of auditory organization in listeners and to explore the computational models which have been motivated by such evidence. The primary focus is on speech rather than on sources such as polyphonic music or non-speech ambient backgrounds, although all these domains are equally amenable to auditory organization. The review includes a discussion of the relationship between auditory scene analysis and alternative approaches to sound source segregation. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England. Columbia Univ, Dept Elect Engn, New York, NY 10027 USA. RP Cooke, M (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S10 2TN, S Yorkshire, England. EM m.cooke@dcs.shef.ac.uk; dpwe@icsi.berkeley.edu CR ANSTIS S, 1985, J EXP PSYCHOL HUMAN, V11, P257, DOI 10.1037/0096-1523.11.3.257 ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 ASSMANN PF, IN PRESS AUDITORY BA ASSMANN PF, 1994, J ACOUST SOC AM, V95, P471, DOI 10.1121/1.408342 BAILEY PJ, 1977, SR5152 HASK LABS BEAUVOIS MW, 1991, Q J EXP PSYCHOL-A, V43, P517 Beauvois MW, 1996, J ACOUST SOC AM, V99, P2270, DOI 10.1121/1.415414 BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 BERTHOMMIER F, 1998, P INT C SPOK LANG PR BERTHOMMIER M, 1997, P 2 WORKSH COMP AUD Bird J., 1998, PSYCHOPHYSICAL PHYSL, P263 BODDEN M, 1995, P IEEE WORKSH APPL S Bourlard H., 1996, P EUR SIGN PROC C TR, P1579 BREGMAN A, 1995, DEMONSTRATIONS AUDIT BREGMAN AS, 1978, J EXP PSYCHOL HUMAN, V4, P380, DOI 10.1037//0096-1523.4.3.380 BREGMAN AS, 1990, CAN J PSYCHOL, V44, P400, DOI 10.1037/h0084255 BREGMAN AS, 1975, J EXP PSYCHOL HUMAN, V1, P263, DOI 10.1037//0096-1523.1.3.263 BREGMAN AS, 1978, CAN J PSYCHOL, V32, P19, DOI 10.1037/h0081664 BREGMAN AS, 1971, J EXP PSYCHOL, V89, P244, DOI 10.1037/h0031163 BREGMAN AS, 1985, PERCEPT PSYCHOPHYS, V37, P483, DOI 10.3758/BF03202881 BREGMAN AS, 1995, P IEEE WORKSH APPL S BREGMAN AS, 1984, P 7 INT C PATT REC S Bregman AS., 1990, AUDITORY SCENE ANAL BROADBENT DE, 1957, J ACOUST SOC AM, V29, P708, DOI 10.1121/1.1909019 BROKX JPL, 1982, J PHONETICS, V10, P23 BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 1998, READINGS COMPUTATION BROWN GJ, 1995, P IEEE WORKSH APPL S BROWN GJ, 1996, ESCA TUT WORKSH AUD BROWN GJ, 1992, THESIS U SHEFFIELD BUUS S, 1985, J ACOUST SOC AM, V78, P1958, DOI 10.1121/1.392652 CARDOSO JF, 1997, P ICASSP 97 CARLYON RP, 1994, J ACOUST SOC AM, V95, P949, DOI 10.1121/1.410012 CARVER N, 1992, SYMBOLIC KNOWLEDGE B CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 Churchland P., 1994, LARGE SCALE NEURONAL CIOCCA V, 1993, J ACOUST SOC AM, V93, P2870, DOI 10.1121/1.405806 COLE RA, 1973, CAN J PSYCHOL, V27, P441, DOI 10.1037/h0082495 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 COOK MP, 1994, FUNDAMENTALS SPEECH, P295 COOKE M, 1993, ENDEAVOUR, V17, P186, DOI 10.1016/0160-9327(93)90061-7 Cooke M., 1994, P 3 INT C SPOK LANG, P1555 COOKE MP, IN PRESS LISTENING S COOKE MP, 1993, THESIS CAMBRIDGE U P CRICK F, 1984, P NATL ACAD SCI-BIOL, V81, P4586, DOI 10.1073/pnas.81.14.4586 CULLING JF, 1993, J ACOUST SOC AM, V93, P3454, DOI 10.1121/1.405675 CULLING JF, 1995, J ACOUST SOC AM, V98, P837, DOI 10.1121/1.413510 CULLING JF, 1994, SPEECH COMMUN, V14, P71, DOI 10.1016/0167-6393(94)90058-2 CULLING JF, 1995, J ACOUST SOC AM, V98, P785, DOI 10.1121/1.413571 CULLING JF, 1994, J ACOUST SOC AM, V95, P1559, DOI 10.1121/1.408543 CUTTING JE, 1976, PSYCHOL REV, V83, P114, DOI 10.1037/0033-295X.83.2.114 DARWIN CJ, 1984, J ACOUST SOC AM, V76, P1636, DOI 10.1121/1.391610 DARWIN CJ, 1992, J ACOUST SOC AM, V91, P3381, DOI 10.1121/1.402828 DARWIN CJ, 1986, J ACOUST SOC AM, V79, P838, DOI 10.1121/1.393474 DARWIN CJ, 1981, Q J EXP PSYCHOL-A, V33, P185 DARWIN CJ, 1995, J ACOUST SOC AM, V98, P880, DOI 10.1121/1.413513 DARWIN CJ, 1977, J EXP PSYCHOL HUMAN, V3, P665, DOI 10.1037/0096-1523.3.4.665 DARWIN CJ, 1989, PERCEPT PSYCHOPHYS, V45, P333, DOI 10.3758/BF03204948 DARWIN CJ, 1990, SPEECH COMMUN, V9, P469, DOI 10.1016/0167-6393(90)90022-2 de Cheveigne A, 1999, J ACOUST SOC AM, V106, P2959, DOI 10.1121/1.428115 DECHEVEIGNE A, 1995, J ACOUST SOC AM, V97, P3736 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2839, DOI 10.1121/1.418517 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2848, DOI 10.1121/1.419476 DECHEVEIGNE A, 1993, J ACOUST SOC AM, V93, P3271 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2857, DOI 10.1121/1.419480 DENBIGH PN, 1992, SPEECH COMMUN, V11, P119, DOI 10.1016/0167-6393(92)90006-S DEUTSCH D, 1975, J ACOUST SOC AM, V57, P1156, DOI 10.1121/1.380573 DOWLING WJ, 1973, PERCEPT PSYCHOPHYS, V14, P37, DOI 10.3758/BF03198614 DUIFHUIS H, 1982, J ACOUST SOC AM, V71, P1568, DOI 10.1121/1.387811 DURLACH NI, 1963, J ACOUST SOC AM, V35, P1206, DOI 10.1121/1.1918675 Ellis D. P. W., 1996, THESIS MIT ELLIS DPW, 1997, P INT C AC SPEECH SI, P130 Ellis DPW, 1999, SPEECH COMMUN, V27, P281, DOI 10.1016/S0167-6393(98)00083-1 ELLIS DPW, 1993, P IEEE WORKSH APPL S GARDNER RB, 1989, J ACOUST SOC AM, V85, P1329, DOI 10.1121/1.397464 GARDNER RB, 1986, PERCEPT PSYCHOPHYS, V40, P183, DOI 10.3758/BF03203015 Godsmark D, 1999, SPEECH COMMUN, V27, P351, DOI 10.1016/S0167-6393(98)00082-X GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GRAY CM, 1989, NATURE, V338, P334, DOI 10.1038/338334a0 Green PD, 1995, P INT C AC SPEECH SI, P401 GUTTMAN N, 1963, J ACOUST SOC AM, V35, P610, DOI 10.1121/1.1918551 HALL JW, 1984, J ACOUST SOC AM, V76, P50, DOI 10.1121/1.391005 Handel S, 1989, LISTENING INTRO PERC HARTMANN WM, 1991, MUSIC PERCEPT, V9, P155 Hill N. J., 1993, J ACOUST SOC AM, V93, P2307, DOI 10.1121/1.406429 HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048 HOWARDJONES PA, 1993, J ACOUST SOC AM, V93, P2915, DOI 10.1121/1.405811 HUKIN RW, 1995, J ACOUST SOC AM, V98, P1380, DOI 10.1121/1.414348 JEFFRESS LA, 1948, J COMP PHYSIOL PSYCH, V41, P35, DOI 10.1037/h0061495 JONES MR, 1976, PSYCHOL REV, V83, P323, DOI 10.1037/0033-295X.83.5.323 Junqua J.C., 1996, ROBUSTNESS AUTOMATIC KAERNBACH C, 1992, J ACOUST SOC AM, V92, P788, DOI 10.1121/1.403948 KARLSEN BL, 1998, READINGS COMPUTATION KASHINO K, 1998, READINGS COMPUTATION KLASSNER F, 1996, THESIS U MASSACHUSET Lea A., 1992, THESIS U NOTTINGHAM LEE TW, 1997, P IEEE INT C NEUR NE, P2129 LIBERMAN AM, 1982, AM PSYCHOL, V37, P148, DOI 10.1037//0003-066X.37.2.148 LICKLIDER JCR, 1951, EXPERIENTIA, V7, P128, DOI 10.1007/BF02156143 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LIU F, 1994, BIOL CYBERN, V71, P105, DOI 10.1007/BF00197313 Lyon R. F., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Marr D, 1982, VISION McAdams S., 1984, THESIS STANFORD U McCabe SL, 1997, J ACOUST SOC AM, V101, P1611, DOI 10.1121/1.418176 MCKEOWN JD, 1995, J ACOUST SOC AM, V98, P1866, DOI 10.1121/1.413373 MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767 MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725 MELLINGER DK, 1991, THESIS STANFORD U MILLER GA, 1950, J ACOUST SOC AM, V22, P637, DOI 10.1121/1.1906663 MONTREAL, 1995, P 1 WORKSH COMP AUD Moore B. C. J., 1995, HDB PERCEPTION COGNI, V6, P387 MOORE BCJ, 1985, J ACOUST SOC AM, V77, P1853, DOI 10.1121/1.391936 Moore BCJ, 1997, INTRO PSYCHOL HEARIN MOORE BCJ, 1986, J ACOUST SOC AM, V80, P479, DOI 10.1121/1.394043 MOORE DR, 1987, BRIT MED BULL, V43, P856 MORRIS AC, 1988, P INT C AC SPEECH SI NAGOYA, 1997, P 2 WORKSH COMP AUD NAKATANI T, 1998, READINGS COMPUTATION NAKATANI T, 1997, P 2 WORKSH COMP AUD Nawab SH, 1992, SYMBOLIC KNOWLEDGE B OKUNO HG, 1997, P 2 WORKSH COMP AUD PALMER AR, 1990, J ACOUST SOC AM, V88, P1412, DOI 10.1121/1.400329 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 Pierce JR, 1983, SCI MUSICAL SOUND RAND TC, 1974, J ACOUST SOC AM, V55, P678, DOI 10.1121/1.1914584 REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 RICHARDS VM, 1987, J ACOUST SOC AM, V82, P1621, DOI 10.1121/1.395153 ROGERS WL, 1993, PERCEPT PSYCHOPHYS, V53, P179, DOI 10.3758/BF03211728 ROSENTHAL D, 1998, READINGS COMPUTATION SABERI K, 1995, NATURE, V374, P537, DOI 10.1038/374537a0 Scheffers M. T. M., 1983, THESIS U GRONINGEN SCHOONEVELDT GP, 1989, J ACOUST SOC AM, V85, P273, DOI 10.1121/1.397734 SHACKLETON TM, 1992, J ACOUST SOC AM, V91, P3579, DOI 10.1121/1.402811 SLANEY M, 1998, READINGS COMPUTATION STUBBS RJ, 1988, J ACOUST SOC AM, V84, P1236, DOI 10.1121/1.396624 SUMMERFIELD Q, 1995, FUNDAMENTALS SPEECH SUMMERFIELD Q, 1992, PHILOS T R SOC B, V336, P415 THOMPSON H, 1993, P INT S SPOK DIAL TO, P33 THURLOW WR, 1959, J ACOUST SOC AM, V31, P1337, DOI 10.1121/1.1907631 Todd N, 1996, NETWORK-COMP NEURAL, V7, P349, DOI 10.1088/0954-898X/7/2/016 TORKKOLA K, 1998, P IEEE DSP WORKSH BR Unoki M, 1999, SPEECH COMMUN, V27, P261, DOI 10.1016/S0167-6393(98)00077-6 van Noorden L. P. A. S., 1975, THESIS EINDHOVEN U T VANDERKOUWE AJW, 1999, OSUCISRC699TR15 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE Vliegen J, 1999, J ACOUST SOC AM, V106, P938, DOI 10.1121/1.427140 Vliegen J, 1999, J ACOUST SOC AM, V105, P339, DOI 10.1121/1.424503 VONDERMALSBURG C, 1986, BIOL CYBERN, V54, P29 Wang DLL, 1999, IEEE T NEURAL NETWOR, V10, P684, DOI 10.1109/72.761727 WARREN RM, 1990, PERCEPT PSYCHOPHYS, V47, P423, DOI 10.3758/BF03208175 Warren RM, 1996, J ACOUST SOC AM, V100, P2452, DOI 10.1121/1.417953 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WARREN RM, 1976, PERCEPT PSYCHOPHYS, V20, P380, DOI 10.3758/BF03199419 Warren RM, 1997, PERCEPT PSYCHOPHYS, V59, P275, DOI 10.3758/BF03211895 WARREN RM, 1970, SCI AM, V223, P30 WARREN RM, 1972, SCIENCE, V176, P1149, DOI 10.1126/science.176.4039.1149 WATKINS AJ, 1991, J ACOUST SOC AM, V90, P2942, DOI 10.1121/1.401769 Weintraub M., 1985, THESIS STANFORD U WOODS WS, 1992, J ACOUST SOC AM, V91, P2894, DOI 10.1121/1.402926 ZAKARAUSKAS P, 1993, J ACOUST SOC AM, V94, P1323, DOI 10.1121/1.408160 NR 164 TC 68 Z9 70 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2001 VL 35 IS 3-4 BP 141 EP 177 DI 10.1016/S0167-6393(00)00078-9 PG 37 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 488ZV UT WOS:000171968000001 ER PT J AU Pelorson, X AF Pelorson, X TI On the meaning and accuracy of the pressure-flow technique to determine constriction areas within the vocal tract SO SPEECH COMMUNICATION LA English DT Article DE speech production; speech aerodynamics; fluid mechanics ID COLLAPSIBLE TUBES; PHONATION; GLOTTIS; SPEECH; MODEL AB Since Warren and DuBois (D.W. Warren, A.B. DuBois, Cleft Palate Journal 1 (1964) 52-71), the "Pressure-Flow technique" has been widely used to estimate constriction areas within the vocal tract. In this paper, three fundamental questions regarding this technique are addressed: (1) What exactly is measured (minimum, maximum or "mean" areas)? (2) What degree of accuracy can be expected from this technique? (3) To what extent can this method be applied to unsteady flow conditions? A theoretical and experimental study based on a mechanical vocal tract model, including various constriction shapes, is presented. The pressure-flow technique is shown to be relatively insensitive to the exact constriction shape (circular, uniform or diverging), and the estimated area to be close to the minimum area of the constriction. This result can be theoretically rationalised by considering that in all cases studied here, the flow separation point is always close to the minimum constriction. Compared with much more complex viscous flow solutions, a simple one-dimensional flow model is shown to yield fair estimates of the areas (within 20%), except for low Reynolds numbers flows, The empirical head-loss factor, or flow coefficient, k = 0.65, sometimes used, appears to be disputable and is probably due to an experimental artefact, Lastly these results are extended to the case of unsteady flow. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Grenoble 3, UMR CNRS Q5009, INPG, Inst Commun Parlee, F-38031 Grenoble, France. RP Pelorson, X (reprint author), Univ Grenoble 3, UMR CNRS Q5009, INPG, Inst Commun Parlee, 46 Ave Felix Viallet, F-38031 Grenoble, France. EM pelorson@icp.inpg.fr CR BERTRAM CD, 1986, J BIOMECH, V19, P61, DOI 10.1016/0021-9290(86)90109-0 Blevins R. D., 1992, APPL FLUID DYNAMICS CONRAD WA, 1969, IEEE T BIO-MED ENG, VBM16, P284, DOI 10.1109/TBME.1969.4502660 GUYETTE TW, 1988, J SPEECH HEAR RES, V31, P538 HIXON T, 1966, FOLIA PHONIATR, V18, P168 MAIR SJ, 1996, P 4 SPEECH PROD SEM, P29 MULLER E, 1980, SPEECH LANGUAGE ADV PELORSON X, 1995, ACTA ACUST, V3, P191 PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 Schlichting H., 1968, BOUNDARY LAYER THEOR SCULLY C, 1986, J PHONETICS, V14, P407 SHADLE CH, 1995, J PHONETICS, V23, P53, DOI 10.1016/S0095-4470(95)80032-8 STROMBERG K, 1994, P I ACOUSTICS, V16, P325 Warren D. W., 1964, CLEFT PALATE J, V1, P52 Warren D W, 1966, Cleft Palate J, V3, P103 YATES CC, 1990, CLEFT PALATE J, V27, P193, DOI 10.1597/1545-1569(1990)027<0193:TPFMSF>2.3.CO;2 ZAJAC DJ, 1991, J SPEECH HEAR RES, V34, P1073 NR 18 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2001 VL 35 IS 3-4 BP 179 EP 190 DI 10.1016/S0167-6393(00)00082-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 488ZV UT WOS:000171968000002 ER PT J AU Lee, MS Kim, HK Lee, HS AF Lee, MS Kim, HK Lee, HS TI A new distortion measure for spectral quantization based on the LSF intermodel interlacing property SO SPEECH COMMUNICATION LA English DT Article DE spectral quantization; line spectral frequency; LSF weighting function; formant bounded weighting function; LSF distortion measure ID VECTOR QUANTIZATION; LPC PARAMETERS; FREQUENCIES AB The line spectral frequencies (LSFs) extracted from successive analysis orders are interlaced with each other. This intermodel interlacing property gives a new relationship between the closeness of LSFs and their spectral sensitivities, which motivates a new weighting function for LSF distortion measurement. By applying this new weighting function to LSF quantization, we have achieved a significantly better performance than the conventional heuristic weighting functions in both clean and noise environments. In addition, the proposed weighting function gives better performance than the weighting function based on a high-rate approximation (Gardner weighting (GW)) [W.R. Gardner, B.D. Rao, IEEE Trans. Speech Audio Processing 3 (5) (1995) 367] in noise environments while their performances are comparable in clean environments. Moreover, the complexity of the proposed weighting function is much lower than that of the GW function. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, Taejon 305701, South Korea. AT&T Labs Res, Florham Pk, NJ 07932 USA. Voiceware, Kangnam Gu, Seoul 135280, South Korea. RP Lee, HS (reprint author), Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, 373-1 Kusong Dong, Taejon 305701, South Korea. EM lms@spl.kaist.ac.kr; hkkim@research.att.com; hslee@ee.kaist.ac.kr RI Lee, Hwang-Soo/C-1867-2011 CR BERESON ML, 1988, APPL STAT ERKELENS JS, 1995, ELECTRON LETT, V31, P1410, DOI 10.1049/el:19950988 ERKELENS JS, 1995, P ICASSP, P768 GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658 ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189 Kim HK, 1999, IEEE T SPEECH AUDI P, V7, P87 KIM MY, 1997, P 1997 IEEE WORKSH S, P77 Kleijn W. B., 1995, SPEECH CODING SYNTHE Kovesi B, 1999, SPEECH COMMUN, V29, P39, DOI 10.1016/S0167-6393(99)00026-6 Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 Laurent PA, 1997, IEEE T SPEECH AUDI P, V5, P481, DOI 10.1109/89.622575 LINDE Y, 1980, IEEE T COMMUN, V28, P1 NTT-AT, 1994, MULT LING SPEECH DAT Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 *QUALC INC, 1996, HIG RAT SPEECH SERV VU HL, 1998, P ICASSP, P45 NR 16 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2001 VL 35 IS 3-4 BP 191 EP 202 DI 10.1016/S0167-6393(00)00080-7 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 488ZV UT WOS:000171968000003 ER PT J AU Miyajima, C Watanabe, H Tokuda, K Kitamura, T Katagiri, S AF Miyajima, C Watanabe, H Tokuda, K Kitamura, T Katagiri, S TI A new approach to designing a feature extractor in speaker identification based on discriminative feature extraction SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; discriminative feature extraction; minimum classification error; mel-cepstral estimation; second-order all-pass system function ID WARPED FREQUENCY SCALE; LINEAR PREDICTION; RECOGNITION; CLASSIFICATION; ALGORITHMS AB This paper presents a new framework for designing a feature extractor in a speaker identification system based on the discriminative feature extraction (DFE) method. In order to find the frequency scale appropriate for accurate speaker identification, a mel-cepstral estimation technique using a second-order all-pass warping function is applied to the feature extractor; the frequency warping parameters and the text-independent speaker model parameters are jointly optimized based on a minimum classification error (MCE) criterion. Experimental results show that the frequency scale after optimization is different from traditional Linear/Mel scales and the proposed system outperforms conventional systems in which only the classifier is optimized with the MCE criterion. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Nagoya Inst Technol, Dept Comp Sci, Showa Ku, Nagoya, Aichi 4668555, Japan. ATR, Human Informat Proc Res Labs, Kyoto 6190288, Japan. Nippon Telegraph & Tel Corp, Commun Sci Labs, Kyoto 6190237, Japan. RP Miyajima, C (reprint author), Nagoya Inst Technol, Dept Comp Sci, Showa Ku, Gokiso Cho, Nagoya, Aichi 4668555, Japan. EM chiyomi@ics.nitech.ac.jp CR ARIKI Y, 1996, P ICASSP 96 ATL, V1, P319 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 AUCKENTHALER R, 1997, P EUROSPEECH 97 RHOD, V5, P2303 AUCKENTHALER R, 1998, P RLA2C 98 BESACIER L, 1997, P ICSP 97, V2, P575 BIEM A, 1995, P EUROSPEECH 95 MADR, V1, P545 BIEM A, 1993, P 1993 IEEE WORKSH N, P392 Biem A, 1997, IEEE T SIGNAL PROCES, V45, P500, DOI 10.1109/78.554319 Chen H P, 1994, Bioorg Med Chem, V2, P1, DOI 10.1016/S0968-0896(00)82195-1 CHOU W, 1993, P IEEE INT C AC SPEE, V2, P652 CHOU W, 1992, P INT C AC SPEECH SI, V1, P473 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DELALAMO CM, 1996, P ICASSP96 MAY, V1, P89 FRANCO H, 1991, P ICASSP 91 TOR CAN, V1, P357 Fukada T., 1992, P ICASSP 92, V1, P137 GRAVIER G, 1997, P EUROSPEECH 97 RHOD, V5, P2299 HAYAKAWA S, 1994, P ICASSP 94 AD AUSTR, V1, P137 Imai S., 1988, P EURASIP 88, P203 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Katagiri S., 1991, P IEEE WORKSH NEUR N, P299 Katagiri S, 1998, P IEEE, V86, P2345, DOI 10.1109/5.726793 KATAGIRI S, 1994, ARTIFICIAL NEURAL NE, P278 KOMORI T, 1992, P ICASSP 92 SAN FRAN, V1, P497 KRUGER E, 1988, IEEE T ACOUST SPEECH, V36, P1529, DOI 10.1109/29.90384 LIN Q, 1996, P ICSLP 96 PHIL, V4, P2415 LINDE Y, 1980, IEEE T COMMUN, V28, P1 Liu Chi-Shi, 1994, P 1994 IEEE INT C AC, V1, P325 LIU CS, 1995, J ACOUST SOC AM, V97, P637, DOI 10.1121/1.412286 LIU CS, 1996, P ICASSP 96 ATL, V2, P669 Markov KP, 1998, SPEECH COMMUN, V24, P193, DOI 10.1016/S0167-6393(98)00010-7 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MCDERMOTT E, 1994, COMPUT SPEECH LANG, V8, P351, DOI 10.1006/csla.1994.1018 McDonough J., 1998, P INT C SPOK LANG PR, V16, P2307 MIYAJIMA C, 1999, P EUROSCPEECH 99 BUD, V2, P779 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PALIWAL KK, 1995, P EUR C SPEECH COMM, V1, P541 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Rosenberg A.E., 1998, P IEEE INT C AC SPEE, V1, P105, DOI 10.1109/ICASSP.1998.674378 Sato A, 1996, ADV NEUR IN, V8, P423 SHIKANO K, 1986, CMUCS86108 SIOHAN O, 1998, P IEEE INT C AC SPEE, V1, P109, DOI 10.1109/ICASSP.1998.674379 Soong F.K., 1985, P INT C AC SPEECH SI, P387 STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992 Tokuda K., 1991, Transactions of the Institute of Electronics, Information and Communication Engineers A, VJ74A Wakahara Y, 1999, ELECTRON COMM JPN 1, V82, P11, DOI 10.1002/(SICI)1520-6424(199902)82:2<11::AID-ECJA2>3.0.CO;2-P WATANABE H, 1997, P ICASSP 97 MUN GERM, P3237 YAGLE AE, 1991, IEEE T SIGNAL PROCES, V39, P2457, DOI 10.1109/78.98001 NR 48 TC 10 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2001 VL 35 IS 3-4 BP 203 EP 218 DI 10.1016/S0167-6393(00)00079-0 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 488ZV UT WOS:000171968000004 ER PT J AU Wu, CH Chen, JH AF Wu, CH Chen, JH TI Automatic generation of synthesis units and prosodic information for Chinese concatenative synthesis SO SPEECH COMMUNICATION LA English DT Article DE Chinese text-to-speech conversion; synthesis units; prosodic information; concatenative synthesis; pitch contour; syllable duration ID TEXT-TO-SPEECH; MANDARINE SPEECH; ALGORITHM; SYSTEM; RULES; MODEL AB In this paper, some approaches to the generation of synthesis units and prosodic information are proposed for Mandarin Chinese text-to-speech (TTS) conversion. The monosyllables are adopted as the basic synthesis units. A. set of synthesis units is selected from a large continuous speech database based on two cost functions, which minimize the inter- and intra-syllable distortion. The speech database is also employed to establish a word-prosody-based template tree according to the linguistic features: tone combination, word length, part-of-speech (POS) of the word, and word position in a phrase. This template tree stores them prosodic features including pitch contour, average energy, and syllable duration of a word for possible combinations of linguistic features. Two modules for sentence intonation and template selection are proposed to generate the target prosodic templates. The experimental results showed that the synthesized prosodic features matched quite well with their original counterparts. Evaluation by subjective experiments also confirmed the satisfactory performance of these approaches. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, 1 Ta Hsueh Rd, Tainan, Taiwan. RI Wu, Chung-Hsien/E-7970-2013 CR BIGORGNE D, 1993, P ICASSP, P187 CHAN NC, 1992, J INF SCI ENG, V8, P261 Charpentier F. J., 1986, P ICASSP, P2015 CHEN SH, 1990, IEEE T COMMUN, V38, P1317 Chen SH, 1998, IEEE T SPEECH AUDI P, V6, P226 Chou FC, 1998, INT CONF ACOUST SPEE, P893 George EB, 1997, IEEE T SPEECH AUDI P, V5, P389, DOI 10.1109/89.622558 KAWAI H, 1995, P ICASSP, P1569 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 LEE LS, 1989, IEEE T ACOUST SPEECH, V37, P1309 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P287, DOI 10.1109/89.232612 QUATIERI TF, 1992, IEEE T SIGNAL PROCES, V40, P497, DOI 10.1109/78.120793 RABINER LR, 1978, DIGITAL PROCESSING S, P399 SCORDILIS MS, 1989, P IEEE INT C AC SPEE, V1, P219 SHIH CL, 1996, LANGUAGE PROCESSING, V1, P37 WANG JF, 1991, IEEE T SIGNAL PROCES, V39, P2141, DOI 10.1109/78.134458 Wu CH, 1997, IEEE T SPEECH AUDI P, V5, P106 NR 17 TC 33 Z9 33 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2001 VL 35 IS 3-4 BP 219 EP 237 DI 10.1016/S0167-6393(00)00075-3 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 488ZV UT WOS:000171968000005 ER PT J AU Steeneken, HJM AF Steeneken, HJM TI Multi-lingual interoperability in speech technology SO SPEECH COMMUNICATION LA English DT Editorial Material C1 TNO, NL-3769 ZG Soesterberg, Netherlands. RP Steeneken, HJM (reprint author), TNO, POB 23, NL-3769 ZG Soesterberg, Netherlands. EM steeneken@tm.tno.nl NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(00)00091-1 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000001 ER PT J AU Adda-Decker, M AF Adda-Decker, M TI Towards multilingual interoperability in automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE interoperability; multilinguality; speech recognition AB This communication addresses multilingual aspects in speech recognition and tries to link them to the concept of interoperability. After a tentative definition of multilingual interoperability, the speech recognition components are discussed with a view towards separating language-specific from language-independent elements. An overview gives examples of previous multilingual speech recognition research and developments across different speaking styles tread, prepared and conversational). The problem of adaptation across languages is addressed. In particular there exist language-independent and cross-language acoustic modeling techniques to port recognition systems from one language to another without language-specific acoustic data. However these data remain valuable for acoustic model adaptation. At this time pronunciation dictionaries and text material appear to be the most crucial language-dependent resources. In our opinion fast porting, enabled by the existence of these language-dependent resources, is a step towards multilingual interoperability. On-going efforts to produce multilingual pronunciation dictionaries and to collect multilingual text corpora including speech transcripts, should be extended to the largest possible number of languages. These efforts could be shared with initiatives aiming at the support of minority languages. (C) 2001 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. RP Adda-Decker, M (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP133, F-91403 Orsay, France. EM madda@limsi.fr CR ADDA G, 1997, P EUR SPEECH RHOD ADDADECKER M, 1996, P IEEE ICASSP ATL ADDADECKER M, 1999, P IEEE ICASSP PHOEN ARMSTRONG S, 1998, P 1 INT C LANG RES E, V1, P975 BAKER JK, 1975, IEEE T ACOUST SPEECH, VAS23, P24, DOI 10.1109/TASSP.1975.1162650 Barras C., 1998, P 1 INT C LANG RES E BERKLING KM, 1994, P ICSLP YOK BILLA J, 1997, P EUR RHOD BIRD S, 1998, P ICSLP SYDN, V7, P3179 BYRNE B, 1999, SUMM RES WORKSH SPEE CHASE L, 1998, P 1 INT C LANG RES E, V2, P789 CORREDORARDOY C, 1997, P EUR RHOD CULHANE CS, 1996, P DARPA SPEECH REC W DEJONG F, 1999, P CBMI99 EUR WORKSH DRAXLER C., 1998, P 1 INT C LANG RES E, VI, P361 DUGAST C, 1995, P EUR MADR EKLUND R, 1999, COMMUNICATION FUNG P, 1999, P EUR BUD, V2, P871 Gauvain J., 1994, IEEE T SPEECH AUDIO, V2 GAUVAIN JL, 1997, P IEEE ICASSP MUN GAUVAIN JL, 1994, SPEECH COMMUN, V15, P21, DOI 10.1016/0167-6393(94)90038-8 Gauvain J.-L., 1998, P INT C SPOK LANG PR, P1335 GAUVAIN JL, 1999, P EUR BUD, P655 GEUTNER P, 1998, P DARPA BROADC NEWS Gibbon D, 1997, HDB STANDARDS RESOUR Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 GREFENSTETTE G, 1998, KLUWER INT SERIES IN Grimes B., 1996, ETHNOLOGUE LANGUAGE HABERT B, 1998, P 1 INT C LANG RES E HUERTE JM, 1998, P DARPA BROADC NEWS IDE N, 1998, P 1 INT C LANG RES E, V1, P463 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 JONES G, 1997, INTELLIGENET MULTIME KHUDANPUR S, 1999, SUMM RES WORKSH SPEE KOHLER J, 1998, P IEEE ICASSP SEATTL, V1, P417, DOI 10.1109/ICASSP.1998.674456 KUBALA F, 1999, P DARPA BROADC NEWS, P83 LAMEL L, 1998, P 1 INT C LANG RES E LAMEL LF, 1995, P EUR MADR LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MALHERBE M, 1995, LANGAGES HUMANITE BO MARIANI J, 1999, P DARPA BROADC NEWS, P237 MARIANI J, 1998, P 1998 BROADC NEWS T, P247 MATSOUKAS S, 1999, P DARPA BROADC NEWS, P255 MATSUKA T, 1996, P DARPA SPEECH REC W, P137 NEY H, 1999, P IEEE ICASSP PHOEN, V1, P517 PALLETT D, 1998, P 1 INT C LANG RES E, V1, P327 PYE D, 1995, P EUR MADR RUIMY N, 1998, P 1 INT C LANG RES E, V1, P241 SCHULTZ T, 1997, MULTILINGUAL INFORMA SCHULTZ T, 1998, P ICSLP SYDN AUSTR, V5, P1819 Schultz T., 1997, P EUROSPEECH, P371 Schwartz R., 1997, P DARPA SPEECH REC W, P115 TUFIS D, 1998, P 1 INT C LANG RES E, V1, P233 UEHLER U, 1998, P ICSLP SYDN, V5, P1687 VANCOMPERNOLLE D, 1999, ESCA NATO WORKSH MUL Wahlster W., 1993, P 3 EUR C SPEECH COM, P29 Weng F., 1997, P EUR RHOD, P359 Young S., 1997, CORPUS BASED METHODS Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 ZAVALIAGKOS G, 1998, P DARPA BROADC NEWS, P301 ZHAN P, 1997, P DARPA BROADC NEWS ZISSMAN MA, 1996, IEEE T, V4 ZUE V, 1997, P EUR RHOD, V1, pKN9 NR 63 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 5 EP 20 DI 10.1016/S0167-6393(00)00092-3 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000002 ER PT J AU Kohler, J AF Kohler, J TI Multilingual phone models for vocabulary-independent speech recognition tasks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE multilinguality; multilingual phone modeling; IPA; acoustic modeling; cross-language transfer AB This paper presents three different methods for developing multilingual phone models for flexible speech recognition tasks. The main goal of our investigations is to find multilingual speech units that work equally well in many languages. With such a universal set it is possible to build speech recognition systems for a variety of languages. One advantage of this approach is that acoustic-phonetic parameters in a HMM-based speech recognition system can then be shared. The multilingual approach starts with the phone sets of six languages, a total of 232 language-dependent and context-independent phone models. Then, we develop three different methods to map the language-dependent models to a multilingual phone set. The first method is a direct mapping to the phone set of the International Phonetic Association (IPA). In the second approach we apply an automatic clustering algorithm for the phone models. The third method exploits the similarities of single mixture components of the language-dependent models. Like the first method the language-specific models are mapped to the IPA inventory. In the second step an agglomerative clustering is performed on the density level to find regions of similarity between the phone models of different languages. The experiments carried out with the SpeechDat(M) database, show that the third method yields almost the same recognition rate as language-dependent models. However, using this method we achieve a huge reduction of the number of densities in the multilingual system. (C) 2001 Elsevier Science B.V. All rights reserved. C1 German Natl Res Ctr Informat Technol, GMD, IMK, Inst Media Commun, D-53754 St Augustin, Germany. RP Kohler, J (reprint author), German Natl Res Ctr Informat Technol, GMD, IMK, Inst Media Commun, Schloss Birlinghoven, D-53754 St Augustin, Germany. EM joachim.koehler@gmd.de CR ANDERSEN O, 1993, P EUR 1993 BERL, P759 BERKLING KM, 1996, AUTOMATIC LANGUAGE I BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 BUB U, 1997, P ICASSP, P1451 Campbell George L, 1995, CONCISE COMPENDIUM W CORREDORARDOY C, 1997, P EUR 1997 RHOD, P55 Dalsgaard P., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607981 Falkhausen M., 1995, P EUR 1995 MADR, P1487 GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C HAUENSEIN A, 1995, P IEEE INT C AC SPEE, P425 HIERONYMUS JL, 1993, BELL LABS TECHNICAL *INT PHOEN ASS, 1993, J INT PHONETIC ASS, V1 JUANG BH, 1985, AT&T TECH J, V64, P391 Kohler J, 1998, INT CONF ACOUST SPEE, P417, DOI 10.1109/ICASSP.1998.674456 Kohler J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607240 KOHLER J, 2000, THESIS TU MUENCHEN LADEFOGED P, 1995, SOUNDS WORLDS LANGUA Lamel L., 1995, P EUR, P185 LAMEL L, 1993, P INT C AC SPEECH SI, P507 Muthusamy Y. K., 1992, P INT C SPOK LANG PR, P895 Schultz T., 1997, P EUROSPEECH, P371 SILVERMAN H., 1994, P ICASSP 1994 AD, P317 NR 22 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 21 EP 30 DI 10.1016/S0167-6393(00)00093-5 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000003 ER PT J AU Schultz, T Waibel, A AF Schultz, T Waibel, A TI Language-independent and language-adaptive acoustic modeling for speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE language portability; multilingual acoustic models; large vocabulary continuous speech recognition; polyphone decision tree specialization (PDTS); GlobalPhone AB With the distribution of speech technology products all over the world, the portability to new target languages becomes a practical concern. As a consequence our research focuses on the question of how to port large vocabulary continuous speech recognition (LVCSR) systems in a fast and efficient way. More specifically we want to estimate acoustic models for a new target language using speech data from varied source languages, but only limited data from the target language. For this purpose, we introduce different methods for multilingual acoustic model combination and a polyphone decision tree specialization procedure. Recognition results using language-dependent, independent and language-adaptive acoustic models are presented and discussed in the framework of our GlobalPhone project which investigates LVCSR systems in 15 languages. (C) 2001 Published by Elsevier Science B.V. C1 Univ Karlsruhe, Interact Syst Labs, D-76131 Karlsruhe, Germany. Carnegie Mellon Univ, Interact Syst Labs, Pittsburgh, PA 15213 USA. RP Schultz, T (reprint author), Univ Karlsruhe, Interact Syst Labs, D-76131 Karlsruhe, Germany. EM tanja@ira.uka.de CR ANDERSEN O, 1993, P EUR 1993 BERL, P759 ANDERSEN O, 1997, P EUR RHOD, P67 Barnett J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607239 Billa J., 1997, P EUR, P363 BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 BUB U, 1997, P ICASSP, P1451 CARKI K, 2000, P IEEE INT C AC SPEE, V3, P1563 Cohen P., 1997, P AUT SPEECH REC UND, P591 Constantinescu A., 1997, P ASRU, P606 Corredor-Ardoy C., 1997, P EUR RHOD 1997, P355 Dugast C., 1995, P EUROSPEECH, P197 FINKE M, 1997, P INT C AC SPEECH SI, P83 FINKE M, 1997, P ICASSP MUN GERM, P1743 GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C Gokcen S., 1997, P AUT SPEECH REC UND, P599 HIERONYMUS JL, 1993, J INT PHON ASSOC, P23 *IPA, 1993, J INT PHONETIC ASS, P23 Kiecza D., 1999, P INT C SPEECH PROC, P323 Kohler J, 1998, INT CONF ACOUST SPEE, P417, DOI 10.1109/ICASSP.1998.674456 Lamel L., 1995, P EUR, P185 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 OSTERHOLTZ L, 1992, P ICASSP SAN FRANC 1 REICHERT J, 1999, P EUR BUD 1999, P815 Schultz T, 1997, P SQEL 2 WORKSH MULT, P20 Schultz T., 2000, THESIS U KARLSRUHE SCHULTZ T, 1998, P SPEC ST PET RUSS, P207 Schultz T., 1997, P EUROSPEECH, P371 Schultz T., 1998, P DARPA WORKSH BROAD, P259 SCHULTZ T, 1998, P ICSLP SYDN, P1819 WEBSTER, 1992, NEW ENCY DICT Wells C. J., 1989, J INT PHONETIC ASS, V19, P32 WHEATLEY B, 1994, P ICASSP, P237 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 NR 33 TC 110 Z9 114 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 31 EP 51 DI 10.1016/S0167-6393(00)00094-7 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000004 ER PT J AU Uebler, U AF Uebler, U TI Multilingual speech recognition in seven languages SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE dialect; non-native; bilingual; multilingual AB In this study we present approaches to multilingual speech recognition. We first define different approaches, namely portation, cross-lingual and simultaneous multilingual speech recognition. We will show some experiments performed in the fields of multilingual speech recognition. In recent years we have ported our recognizer to other languages than German (Italian, Slovak, Slovenian, Czech, English, Japanese). We found that some languages achieve a higher recognition performance with comparable tasks, and are thus easier for automatic speech recognition than others. Furthermore, we present experiments which show the performance of cross-lingual speech recognition of an untrained language with a recognizer trained with other languages. The substitution of phones is important for cross-lingual and simultaneous multilingual recognition. We compared results in cross-lingual recognition for different baseline systems and found that the number of shared acoustic units is very important for the performance. With simultaneous multilingual recognition, performance usually decreases compared to monolingual recognition. In few cases, like in the case of non-native speech, however, the recognition can be improved. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Bavarian Res Ctr Knowledge Based Syst, FORWISS, Res Grp Knowledge Proc, D-91058 Erlangen, Germany. RP Uebler, U (reprint author), Bavarian Res Ctr Knowledge Based Syst, FORWISS, Res Grp Knowledge Proc, Weichselgarten 7, D-91058 Erlangen, Germany. EM ulla.uebler@eed.ericsson.se CR ACKERMANN U, 1996, 3 CRIM FORW WORKSH M ACKERMANN U, 1996, P INT C SPOK LANG PR BARNETT J, 1996, P INT C SPOK LANG PR BONAVENTURA P, 1997, P EUR C SPEECH COMM, V1, P355 BUB U, 1997, P ICASSP97 MUN, V2, P1451 CERFDANON H, 1991, P EUR C SPEECH COMM, V1, P183 DALSGAARD P, 1991, P INT C AC SPEECH SI, P197, DOI 10.1109/ICASSP.1991.150311 DALSGAARD P, 1998, P INT C SPOK LANG PR, V6, P2627 Delattre P, 1965, COMP PHONETIC FEATUR DENG L, 1997, P IEEE INT C AC SPEE, V2, P1007 FABIAN P, 1997, P 2 SQEL WORKSH MULT, P96 GLASS J, 1995, SPEECH COMMUN, V17, P1, DOI 10.1016/0167-6393(95)00008-C HARBECK S, 1997, P 2 SQEL WORKSH MULT, P9 IPSIC I, 1996, 3 SLOV GERM 2 SDRV W, P87 JO CH, 1998, P INT C SPOK LANG PR, V6, P2639 KAWAI G, 1998, P INT C SPOK LANG PR, V5, P1823 KLECKOVA J, 1997, P 2 SQEL WORKSH MULT Kohler J., 1996, P INT C SPOK LANG PR KROKAVEC D, 1997, P 2 SQEL WORKSH MULT KUHN T, 1995, INFIX ST AUGUSTIN, V80 Ladefoged P., 1996, SOUNDS WORLDS LANGUA LAMEL L, 1995, P EUR C SPEECH COMM NOTH E, 1996, SPEECH IMAGE UNDERST, P59 SCHUKATTALAMAZZ.EG, 1995, GRUNDLAGEN STAT MODE Waibel A, 1998, P DARPA BROADC NEWS WARD T, 1998, P INT C SPOK LANG PR, V5, P2243 WENG F, 1997, P EUR C SPEECH COMM, V1, P359 WHEATLEY B, 1994, P ICASSP, P237 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 NR 29 TC 21 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 53 EP 69 DI 10.1016/S0167-6393(00)00095-9 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000005 ER PT J AU Van Compernolle, D AF Van Compernolle, D TI Recognizing speech of goats, wolves, sheep and ... non-natives SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE speech recognition; non-natives; accents; acoustic-phonetics AB This paper reviews the current understanding of acoustic-phonetic issues and the problems arising when trying to recognize speech from non-native speakers. Conceptually, regional accents are well modeled by systematic shifts in pronunciation. Therefore, simultaneous recognition of multiple regional variants may be performed by using multiple acoustic models in parallel, or by adding pronunciation variants in the dictionary. Recognition of non-native speech is much more difficult because it is influenced both by the native language of the speaker and non-native target language. It is characterized by a much greater speaker variability due to different levels of proficiency. A few language-pair specific transformation rules describing prototypical nativized pronunciations was found to be useful both in general speech recognition and in dedicated applications. However, due to the nature of the errors and the cross-language transformations, non-native speech recognition will remain inherently much harder. Moreover, the trend in speech recognition towards more detailed modeling seems to be counterproductive for the recognition of non-native speech and limits progress in this field. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Katholieke Univ Leuven, PSI, ESAT, B-3001 Heverlee, Belgium. RP Van Compernolle, D (reprint author), Katholieke Univ Leuven, PSI, ESAT, Kasteelpk Arenberg 10, B-3001 Heverlee, Belgium. EM dirk.vancompernolle@esat.kuleuven.ac.be CR ADDADECKER M, 1998, P ESCA WORKSH MOD PR, P1 BARY WJ, 1989, COMPUTER SPEECH LANG, V3, P355 Beattie V, 1995, P EUR 95, P1123 Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17 BONAVENTURA P, 1997, P EUR RHOD GREEC, P355 COHEN M, 1989, THESIS UC BERKELEY CREMELIE N, 1989, P ESCA WORKSH MOD PR, P23 CUCCHIARINI C, 1998, P ICSLP98, P751 DRAXLER C, 1997, P EUR 97, P747 *ESCA ETRW, 1998, ESCA ETRW WORKSH STI FOX RA, 1995, JASA, V97, P2511 Hieronymus J.-L., 1993, J INT PHONETIC ASS KAWAI G, 1998, P ICSLP98, P782 Kohler J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607240 Neumeyer L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607890 RILEY M, 1998, P ESCA WORKSH MOD PR, P108 SCHULTZ T, 1998, P ICSLP98 TRANCOSO L, 1999, P EUR 99, P195 UEBLER U, 1999, P EUR 99, P903 Van Compernolle D., 1991, P EUR 91, P723 *VODIS, ADV SPEECH TECHN VOI WITT S, 1999, P EUR 99, P1367 ZAVALIAKOS G, 1996, P ICASSP, P725 NR 23 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 71 EP 79 DI 10.1016/S0167-6393(00)00096-0 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000006 ER PT J AU Eklund, R Lindstrom, A AF Eklund, R Lindstrom, A TI Xenophones: An investigation of phone set expansion in Swedish and implications for speech recognition and speech synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE xenophones; phonetic expansion; automatic speech recognition (ASR); text-to-speech conversion (TTS); second language acquisition (SLA); multi-linguality; phonology AB In recent years, both automatic speech recognition (ASR) and text-to-speech (TTS) conversion systems have attained quality levels that allow inclusion in everyday applications. One remaining problem to be solved in both these types of applications is that alleged phone inventories of specific languages are commonly expanded with phones from other languages, a problem that becomes more acute in an increasingly internationalized world where multilingual automatic speech-based services are a desideratum. This paper investigates the nature of phone set expansion in Swedish. The status of these phones is discussed, and since such added phones do not have a phonemic for allophonic) function, the term 'xenophones' is suggested. The analysis is based on a production study involving 491 subjects, and the observed xenophonic expansion is described in terms of three categories along the "awareness" and the "fidelity" dimensions. The results show that very few subjects resort to full rephonematization and that xenophonic expansion is the rule, although there is an uneven distribution depending on particular phones, spanning from phones produced by most subjects, to phones produced by almost no subjects. Of the possible explanatory factors analyzed - regional background, gender, age and educational level - the latter is by far the most important. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Telia Res AB, S-12386 Farsta, Sweden. Linkoping Univ, Dept Comp & Informat Sci, NLPLab, S-58183 Linkoping, Sweden. RP Lindstrom, A (reprint author), Telia Res AB, Vitsandsgatan 9, S-12386 Farsta, Sweden. EM robert.h.eklund@telia.se; anders.p.lindstrom@telia.se CR ABELIN A, 1985, 21 GOTH U DEP COMP L BAYARD D, 2000, P FON 2000 SWED PHON, P33 BILLI R, J RECORD HUMAN LANGU CARLSMITH RS, 1990, PROCEEDINGS OF THE 25TH INTERSOCIETY ENERGY CONVERSION ENGINEERING CONFERENCE, VOLS 1-6, P74 CARLSON R, 1989, P 1 EUR C SPEECH COM, V1, P113 DAHLSTEDT KH, 1969, FRAMMANDE ORD SVENSK DIGALAKIS V, 2000, SPOKEN LANGUAGE TRAN, P274 EKLUND R, 1998, P ICSLP 98 SYDN 30 N, V7, P2831 EKLUND R, 2000, SPOKEN LANGUAGE TRAN, P265 Eklund R., 1996, P FONETIK 96 SWED PH, V2, P123 ELERT CC, 1994, KULTURGRANSER MYT EL, P215 FITT SE, 1998, THESIS U EINDBURGH FLEGE JE, 1987, SOUND PATTERNS 2 LAN, P9 GUSTAFSON J, 1994, 8 SWED PHON C LUND U, V43, P66 GUSTAFSON J, 1996, THESIS TMH STOCKHOLM Gustafson J., 1995, P ICPHS 95 STOCK SWE, V2, P318 HAMMARBERG B, 1990, NEW SOUNDS 90, P198 Hemphill C., 1990, P DARPA SPEECH NAT L, P96, DOI 10.3115/116580.116613 KIRKNESS A, 1976, PROBLEME LEXIKOLOGIE, V39, P226 LINDSTROM A, 1999, P FON 99 SWED PHON C, V81, P109 LINDSTROM A, 1999, P ICPHS 99 SAN FRANC, V3, P2227 LINDSTROM A, 1999, RTO M P 28 MULT INT, P15 LINDSTROM A, 2000, P ICSLP 00 BEIJ 16 2, V1, P54 LIPSKI JM, 1976, AM SPEECH, V51, P109, DOI 10.2307/455361 MADDIESON I, 1984, PATTERNS SOUNDS, P162 MOBARG M, 1997, MAJOR VARIEITES ENGL, P249 MOBIUS B, 1997, P ESCA EUR 97 RHOD G, V5, P2443 MULLER W, 1976, PROBLEME LEXIKOLOGIE, V39, P211 *ON CONS, 1995, P ESCA EUR 95 MADR S, P829 Rayner M., 2000, SPOKEN LANGUAGE TRAN TRANCOSO I, 1999, P ESCA EUR 98 BUD HU, V1, P195 VIRTANEN T, 1998, MAJOR VARIEITES ENGL, P273 WENG F, 2000, SPOKEN LANGUAGE TRAN, P250 WENG F, 1997, P EUR C SPEECH COMM, V1, P359 NR 34 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 81 EP 102 DI 10.1016/S0167-6393(00)00097-2 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000007 ER PT J AU van Wijngaarden, SJ AF van Wijngaarden, SJ TI Intelligibility of native and non-native Dutch speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA ID PERCEPTION; ENGLISH; CONSONANTS; EXPERIENCE; LISTENERS; SPEAKERS AB The intelligibility of speech is known to be lower if the speaker is non-native instead of native for the given language. This study is aimed at quantifying the overall degradation due to limitations of non-native speakers of Dutch, specifically of Dutch-speaking Americans who have lived in the Netherlands 1-3 years. Experiments were focused on phoneme intelligibility and sentence intelligibility, using additive noise as a means of degrading the intelligibility of speech utterances for test purposes. The overall difference in sentence intelligibility between native Dutch speakers and American speakers of Dutch, using native Dutch listeners, was found to correspond to a difference in speech-to-noise ratio (SNR) of approximately 3 dB. The main segmental contribution to the degradation of speech intelligibility by introducing non-native speakers and/or listeners is the confusion of vowels, especially those that do not occur in American English. Vowels that are difficult for second-language speakers to produce are also difficult for second-language listeners to classify; such vowels attract false recognition, reducing the overall recognition rate for all vowels. (C) 2001 Elsevier Science B.V. All rights reserved. C1 TNO, NL-3769 ZG Soesterberg, Netherlands. RP van Wijngaarden, SJ (reprint author), TNO, POB 23, NL-3769 ZG Soesterberg, Netherlands. EM vanwijngaarden@tm.tno.nl CR Bergman M., 1980, AGING PERCEPTION SPE BOERSMA P, 1999, PRAAT 3 8 12 SYSTEM Bradlow AR, 1999, J ACOUST SOC AM, V106, P2074, DOI 10.1121/1.427952 BUUS S, 1986, P INT 86, P895 Flege JE, 1997, J PHONETICS, V25, P437, DOI 10.1006/jpho.1997.0052 Florentine M., 1985, P INT 85, P1021 GAT IB, 1978, AUDIOLOGY, V17, P339 Mayo LH, 1997, J SPEECH LANG HEAR R, V40, P686 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NABELEK AK, 1984, J ACOUST SOC AM, V75, P632 PLOMP R, 1979, AUDIOLOGY, V18, P43 Pols LCW, 1977, THESIS FREE U AMSTER STEENEKEN HJM, 1992, THESIS U AMSTERDAM NR 13 TC 40 Z9 40 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 103 EP 113 DI 10.1016/S0167-6393(00)00098-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000008 ER PT J AU Zissman, MA Berkling, KM AF Zissman, MA Berkling, KM TI Automatic language identification SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA ID SPEECH; TEXT AB Automatic language identification of speech is the process by which the language of a digitized speech utterance is recognized by a computer. In this paper, we will describe the set of available cues for language identification of speech and discuss the different approaches to building working systems. This overview includes a range of historical approaches, contemporary systems that have been evaluated on standard databases, and promising future approaches. Comparative results are also reported. (C) 2001 Elsevier Science B.V. All rights reserved. C1 MIT, Lincoln Lab, Lexington, MA 02420 USA. RP Berkling, KM (reprint author), Buckhauser Str 3, CH-8048 Zurich, Switzerland. EM maz@ll.mit.edu CR BERKLING KM, 1995, EUROPSEECH, V1, P351 BERKLING KM, 1994, INT C AC SPEECH SIGN, V1, P289 BRAUN J, 1998, INT C SPOK LANG PROC, V7, P3201 CIMARUSTI D, 1982, INT C AC SPEECH SIGN, P1661 Comrie B., 1990, WORLDS MAJOR LANGUAG Crystal David, 1987, CAMBRIDGE ENCY LANGU DALSGAARD P, 1992, INT C SPOK LANG PROC, P547 DAMASHEK M, 1995, SCIENCE, V267, P843, DOI 10.1126/science.267.5199.843 FOIL JT, 1986, INT C AC SPEECH SIGN, V2, P861 Fromkin V., 1993, INTRO LANGUAGE GOODMAN FJ, 1989, INT C AC SPEECH SIGN, V1, P528 HAZEN TJ, 1993, THESIS MIT HAZEN TJ, 1993, EUROPSEECH, V2, P1303 HIERONYMUS JL, 1997, INT C AC SPEECH SIGN, V2, P1111 HOUSE AS, 1977, J ACOUST SOC AM, V62, P708, DOI 10.1121/1.381582 ITAHASHI S, 1995, EUROPSEECH, V2, P1359 ITAHASHI S, 1994, INT C SPOK LANG PROC, V4, P1899 KADAMBE S, 1995, INT C AC SPEECH SIGN, V5, P3507 KIMBRELL RE, 1988, BYTE, V13, P297 KOEHLER J, 1998, INT C AC SPEECH SIGN, V1, P417 KOEHLER J, 1997, INT C AC SPEECH SIGN, V2, P1451 KWAN HK, 1997, EUROPSEECH, V1, P63 LAMEL LF, 1993, INT C AC SPEECH SIGN, V2, P507, DOI 10.1109/ICASSP.1993.319353 LI KP, 1994, INT C AC SPEECH SIGN, V1, P297 LUND MA, 1995, EUROPSEECH, V2, P1363 LUND MA, 1996, INT C AC SPEECH SIGN, V2, P793 MAH C. P., 1985, P 8 INT ACM C RES DE, P155, DOI 10.1145/253495.253521 MATROUF D, 1998, INT C SPOK LANG PROC, V2, P181 MENDOZA S, 1996, INT C AC SPEECH SIGN, V2, P785 MORI K, 1999, EUROSPEECH MUTHUSAMY Y, 1994, INT C AC SPEECH SIGN, V1, P333 MUTHUSAMY Y, 1993, EUROSPEECH, V2, P1307 Muthusamy Y. K., 1992, INT C SPOK LANG PROC, V2, P895 Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925 NAKAGAWA S, 1992, INT C SPOK LANG PROC, V2, P1011 NAKAGAWA S, 1994, ELECTRON COMM JPN 3, V77, P70, DOI 10.1002/ecjc.4430770607 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 RAMESH P, 1994, INT C SPOK LANG PROC, V4, P1887 RIEK L, 1991, SPCOT91002 LOCKH SAN Savic M., 1991, INT C AC SPEECH SIGN, V2, P817 SCHMITT JC, 1991, Patent No. 5062143 SCHULTZ T, 1996, INT C AC SPEEC SIGN, V2, P781 SCHULTZ T, 1998, INT C SPOK LANG PROC, V5, P1819 SUGIYAMA M, 1991, INT C AC SPEECH SIGN, V2, P813 THOMAS HL, 1998, INT C SPOK LANG PROC, V2, P169 Thyme-Gobbel A. E., 1996, INT C SPOK LANG PROC, V3, P1768 WHEATLY B, 1994, INT C AC SPEECH SIGN, V1, P237 Yan Y, 1995, ICASSP, V5, P3511 Zissman M. A., 1995, IEEE INT C AC SPEECH, V5, P3503 ZISSMAN MA, 1994, INT C AC SPEECH SIGN, V1, P305 ZISSMAN MA, 1993, INT C AC SPEECH SIGN, V2, P399, DOI 10.1109/ICASSP.1993.319323 Zissman MA, 1996, IEEE T SPEECH AUDI P, V4, P31, DOI 10.1109/TSA.1996.481450 NR 52 TC 40 Z9 42 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 115 EP 124 DI 10.1016/S0167-6393(00)00099-6 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000009 ER PT J AU Berkling, K AF Berkling, K TI SCoPE, syllable core and periphery evaluation: Automatic syllabification and foreign accent identification SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Multi-Lingual Interoperability in Speech Technology CY 1999 CL LEUSDEN, NETHERLANDS SP RTO, ISCA DE foreign accent identification; syllabification; linguistic knowledge AB In this paper we apply a study of the structure of the English language towards an automatic syllabification algorithm and consequently an automatic foreign accent identification system. Any word consists of syllables which can in turn be divided into its constituents. Elements within the syllable structure are defined according to both their position within the syllable and the position of the syllable within the word structure. Elements of syllable structure that only occur at morpheme boundaries or that extend for the duration of morphemes are identified as peripheral elements; those that can occur anywhere with regard to word morphology are identified as core elements. All languages potentially make a distinction between core and peripheral elements of their syllable structure, however the specific forms these structures take will vary from language to language. In addition to problems posed by differences in phoneme inventories (a detailed analysis of comparative phoneme inventories across the languages treated here is outside the scope of this paper), we expect speakers with the greatest syllable structural differences between native and foreign language to have greatest difficulty with pronunciation in the foreign language. In this paper, we will analyze two accents of Australian English: Arabic whose core/periphery structure is similar to English and Vietnamese, whose structure is maximally different to English. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Sydney, Dept Elect Engn, Speech Technol Grp, Sydney, NSW 2006, Australia. RP Berkling, K (reprint author), Buckhauser Str 3, CH-8048 Zurich, Switzerland. EM kay@berkling.com CR BERKLING K, 1998, INT C SPOK LANG PROC, V2 CLEIRIGH C, 1998, THESIS SYDNEY U CLEIRIGH C, 1994, INT C SPOK LANG PROC, V1 Goldsmith J., 1990, AUTOSEGMENTAL METRIC HANSEN J, 1995, IEEE INT C AC SPEECH, V1 KAHN D, 1980, THESIS MIT KUMPF K, 1997, EUROSPEECH, V4 Lewis Wendy, 1995, J QUANT LINGUIST, V2, P177, DOI 10.1080/09296179508590049 MIXDORFF H, 1996, INT C SPOK LANG PROC, V2 OSTENDORF M, 1996, SUMM WORKSH SPEECH R TEIXEIRA C, 1997, EUROSPEECH, V4 NR 11 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2001 VL 35 IS 1-2 BP 125 EP 138 DI 10.1016/S0167-6393(00)00100-X PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 453ZM UT WOS:000169948000010 ER PT J AU Rose, RC Yao, H Riccardi, G Wright, J AF Rose, RC Yao, H Riccardi, G Wright, J TI Integration of utterance verification with statistical language modeling and spoken language understanding SO SPEECH COMMUNICATION LA English DT Article ID RECOGNITION AB Methods for utterance verification (UV) and their integration into statistical language modeling and understanding formalisms for a large vocabulary spoken understanding system are presented. The paper consists of three parts. First, a set of acoustic likelihood ratio (LR) based UV techniques are described and applied to the problem of rejecting portions of a hypothesized word string that may have been incorrectly decoded by a large vocabulary continuous speech recognizer. Second, a procedure for integrating the acoustic level confidence measures with the statistical language model is described. Finally, the effect of integrating acoustic level confidence into the spoken language understanding unit (SLU) in a call-type classification task is discussed. These techniques were evaluated on utterances collected from a highly unconstrained call routing task performed over the telephone network. They have been evaluated in terms of their ability to classify utterances into a set of 15 call-types that are accepted by the application. (C) 2001 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Speech & Image Proc Lab, Florham Pk, NJ 07932 USA. RP Rose, RC (reprint author), AT&T Labs Res, Speech & Image Proc Lab, 180 Pk Ave,Room D129, Florham Pk, NJ 07932 USA. EM rose@research.att.com; s_yao@reserach.att.com; dsp3@research.att.com; jwright@research.att.com RI riccardi, gabriele/A-9269-2012 CR Berstel J., 1979, TRANSDUCTIONS CONTEX COX S, 1996, P INT C AC SPEECH SI, P511 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X LLEIDA E, 1996, P INT C AC SPEECH SI, P507 Lleida E, 2000, IEEE T SPEECH AUDI P, V8, P126, DOI 10.1109/89.824697 NETI CV, 1997, P INT C AC SPEECH SI, P883 Rahim M, 1996, P IEEE INT C AC SPEE, P3585 Riccardi G., 1997, P INT C AC SPEECH SI, P1143 Riccardi G, 1996, COMPUT SPEECH LANG, V10, P265, DOI 10.1006/csla.1996.0014 ROSE RC, 1995, P INT C AC SPEECH SI, P281 ROSE RC, 1999, P EUR C SPEECH COMM, P303 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 WRIGHT JH, 1997, P 5 EUR C SPEECH COM, P1419 YOUNG SR, 1993, P EUR C SPEECH COMM, P1177 NR 14 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2001 VL 34 IS 4 BP 321 EP 331 DI 10.1016/S0167-6393(00)00040-6 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 442GL UT WOS:000169276800001 ER PT J AU Lopez-Soler, JM Sanchez, V de la Torre, A Rubio-Ayuso, AJ AF Lopez-Soler, JM Sanchez, V de la Torre, A Rubio-Ayuso, AJ TI Linear inter-frame dependencies for very low bit-rate speech coding SO SPEECH COMMUNICATION LA English DT Article DE speech coding; quantization; interpolation; inter-frame dependencies ID TRELLIS-CODED MODULATION; REDUNDANT SIGNAL SETS; QUANTIZER DESIGN; LPC; ALGORITHM; LSP AB We have studied experimentally the operational rate-distortion performance for very low bit-rate speech coding using linear inter-frame dependencies. We propose an algorithm that efficiently combines quantization and linear interpolation procedures. With a maximum delay of 200 ms, for the spectral envelope information and using line spectrum pair (LSP) parameters as input space the proposed algorithm performs best at rates of between 200 and 300 b/s. For comparison's sake several other procedures such as the multi-frame encoder (Kemp D., Collura J., Tremain T., Multi-Frame Coding of LPC Parameters at 600-800 bps. In: IEEE ICASSP-91, 1991, pp. 609-612) and matrix quantizer (Tsao C., Gray R., Matrix quantizer design for LPC speech using the generalized Lloyd algorithm. IEEE Transactions on Acoustics, Speech, and Signal Processing ASSP-33, 1985, 537-545) are simulated. Furthermore, a mono-dimensional version of the proposed procedure is shown experimentally to provide the best operational rate-distortion trade-off when coding a parametric representation (pitch, gain and voicing information) of the excitation signal. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. RP Lopez-Soler, JM (reprint author), Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. EM juanma@hal.ugr.es RI de la Torre, Angel/C-6618-2012; Sanchez , Victoria /C-2411-2012; Prieto, Ignacio/B-5361-2013; Lopez-Soler, Juan/C-2437-2012 OI Lopez-Soler, Juan/0000-0003-4572-2237 CR Atal B. S, 1983, ICASSP, P81, DOI 10.1109/ICASSP.1983.1172248 Cheng YM, 1993, IEEE T SPEECH AUDI P, V1, P207, DOI 10.1109/89.222879 CUPERMAN V, 1985, IEEE T COMMUN, V33, P685, DOI 10.1109/TCOM.1985.1096372 Gersho A., 1992, VECTOR QUANTIZATION GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 HONDA M, 1992, ADV SPEECH SIGNAL PR, P209 Jayant N. S., 1984, DIGITAL CODING WAVEF JUANG BH, 1984, AT&T TECH J, V63, P1477 KANG G, 1985, 8857 NRL KEMP D, 1987, NSA LPC 10 VERSION 5 KEMP DP, 1991, INT CONF ACOUST SPEE, P609, DOI 10.1109/ICASSP.1991.150413 KITAWAKI N, 1991, ADV SPEECH SIGNAL PR, P357 KNAGENHJELM HP, 1995, IEEE ICASSP 95, V1, P732 LAROIA R, 1991, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1991.150421 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LIU YJ, 1989, IEEE ICASSP 89, V89, P204 LOPEZSOLER JM, 1993, IEEE ICASSP 93, V2, P21, DOI 10.1109/ICASSP.1993.319218 LOPEZSOLER JM, 1995, THESIS U GRANADA MARCELLIN MW, 1990, IEEE T COMMUN, V38, P82, DOI 10.1109/26.46532 PALIWAL KK, 1991, INT CONF ACOUST SPEE, P661, DOI 10.1109/ICASSP.1991.150426 PICONE J, 1987, IEEE ICASSP 87, P1653 ROTHWEILER J, 1985, IEEE ICASSP 85, V82, P248 ROUCOS S, 1982, IEEE GLOBECOM, V82, P1074 ROUCOS S, 1983, IEEE ICASSP 83, P61 SANCHEZ V, 1995, IEEE T SIGNAL PROCES, V43, P2631, DOI 10.1109/78.482113 SCHWARTZ R, 1983, IEEE ICASSP 83, P69 SHIRAKI Y, 1988, IEEE T ACOUST SPEECH, V36, P1437, DOI 10.1109/29.90372 SHORE JE, 1983, IEEE T INFORM THEORY, V29, P473, DOI 10.1109/TIT.1983.1056716 SONG FK, 1984, IEEE ICASSP 84, P1 SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 SUGAMURA N, 1988, IEEE J SEL AREA COMM, V6, P432, DOI 10.1109/49.618 *TIMIT, 1990, DARPA TIMIT ACOUSIC TREMAIN T, 1982, GOVT STANDARD LINEAR TSAO C, 1985, IEEE T ACOUST SPEECH, V33, P537 UNGERBOECK G, 1987, IEEE COMMUN MAG, V25, P5, DOI 10.1109/MCOM.1987.1093542 UNGERBOECK G, 1987, IEEE COMMUN MAG, V25, P12, DOI 10.1109/MCOM.1987.1093541 VILGUS AM, 1983, IEEE ICASSP 83, P77 WELCH V, 1993, IEEE WORKSH SPEECH C, P41 WONG DY, 1982, IEEE T ACOUST SPEECH, V30, P770, DOI 10.1109/TASSP.1982.1163960 WONG DY, 1983, IEEE P ICASSP 83, P65 Yong M., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196602 NR 41 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2001 VL 34 IS 4 BP 333 EP 349 DI 10.1016/S0167-6393(00)00039-X PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 442GL UT WOS:000169276800002 ER PT J AU Logan, B Robinson, T AF Logan, B Robinson, T TI Adaptive model-based speech enhancement SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; autoregressive hidden Markov models; robust speech recognition ID HIDDEN MARKOV-MODELS; NOISY SPEECH; RECOGNITION; SIGNALS AB We investigate the enhancement of speech corrupted by unknown independent additive noise when only a single microphone is available. We present adaptive enhancement systems based on an existing non-adaptive technique [Ephraim. Y., 19992a. IEEE Transactions on Signal Processing 40 (4), 725-735]. This approach models the speech and noise statistics using autoregressive hidden Markov models (AR-HMMs). We develop two main extensions. The first estimates the noise statistics from detected pauses. The second forms maximum likelihood (ML) estimates of the unknown noise parameters using the whole utterance. Both techniques operate within the AR-HMM framework. We have previously shown that the ability of AR-HMMs to model speech can be improved by the incorporation of perceptual frequency using the bilinear transform. We incorporate this improvement into our enhancement systems. We evaluate our techniques on the NOISEX-92 and Resource Management (RM) databases, giving indications of performance on simple and more complex tasks, respectively. Both enhancement schemes proposed are able to improve substantially on baseline results. The technique of forming ML estimates of the noise parameters is found to be the most effective. Its performance is evaluated over a wide range of noise conditions ranging front -6 to 18 dB and on various types of stationary real-world noises. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Compaq Comp Corp, Cambridge Res Lab, Cambridge, MA 02142 USA. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Logan, B (reprint author), Compaq Comp Corp, Cambridge Res Lab, 1 Cambridge Ctr, Cambridge, MA 02142 USA. EM beth.logan@compaq.com; ajr@eng.cam.ac.uk CR AFIFY M, 1997, P IEEE INT C AC SPEE BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 Deller Jr. J. R., 1993, DISCRETETIME PROCESS EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 Gannot S, 1998, IEEE T SPEECH AUDI P, V6, P373, DOI 10.1109/89.701367 Gillick L., 1989, P ICASSP, P532 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 JUANG BH, 1984, AT&T TECH J, V63, P1213 JUANG BH, 1985, IEEE T ACOUST SPEECH, V33, P1404 LEE BG, 1995, SIGNAL PROCESS, V46, P1, DOI 10.1016/0165-1684(95)00068-O LEE CH, 1997, ROBUST SPEECH RECOGN, P45 LEE KY, 1996, P IEEE INT C AC SPEE, P621 LOGAN BT, 1997, P 5 EUR C SPEECH COM, P2103 LOGAN BT, 1998, THESIS U CAMBRIDGE LOGAN BT, 1997, P IEEE ICASSP 97, P843 LOGAN BT, 1996, P 6 AUSTR INT C SPEE, P85 LOGAN BT, 1998, P INT C SPOK LANG PR MCKINLEY BL, 1997, P ICASSP 97, P1179 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2111, DOI 10.1109/78.134449 MOKBEL C, 1997, P ESCA NATO TUT RES MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 MORENO PJ, 1995, P ICASSP, P733 OPPENHEI.AV, 1972, PR INST ELECTR ELECT, V60, P681, DOI 10.1109/PROC.1972.8727 Price P., 1988, P IEEE INT C AC SPEE, P651 Rabiner L, 1993, FUNDAMENTALS SPEECH Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SEYMOUR CW, 1996, THESIS U CAMBRIDGE SHEIKHZADEH H, 1995, P ICASSP, P808 SHEIKHZADEH H, 1994, P IEEE INT C AC SPEE, P113 SHIKANO K, 1985, CMUCS86108 CARN MELL STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992 VARGA AP, 1992, NOISEX 92 STUDY EFFE YOUNG S, 1996, HTK BOOK HTK V20 Young S. J., 1993, HTK HIDDEN MARKOV MO NR 39 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2001 VL 34 IS 4 BP 351 EP 368 DI 10.1016/S0167-6393(00)00038-8 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 442GL UT WOS:000169276800003 ER PT J AU Ahmadi, S Spanias, AS AF Ahmadi, S Spanias, AS TI Low bit-rate speech coding based on an improved sinusoidal model SO SPEECH COMMUNICATION LA English DT Article DE speech coding; sinusoidal model; phase modeling; speech classification; linear prediction; frame interpolation ID DESIGN; ALGORITHM; ENHANCEMENT; DISTORTION; FIR AB This paper addresses the design, implementation and evaluation of efficient low bit-rate speech coding algorithms based on an improved sinusoidal model. A series of algorithms were developed for speech classification and pitch frequency determination, modeling of sinusoidal amplitudes and phases, and frame interpolation. An improved paradigm for sinusoidal phase coding is presented, where short-time sinusoidal phases are modeled using a combination of linear prediction, spectral sampling, linear phase alignment and all-pass phase error correction components. A class-dependent split vector quantization scheme is used to encode the sinusoidal amplitudes. The masking properties of the human auditory system are effectively exploited in the algorithms. The algorithms were successfully integrated into a 2.4 kbps sinusoidal coder. The performance of the 2.4 kbps coder was evaluated in terms of informal subjective tests such as the mean opinion score (MOS) and the diagnostic rhyme test (DRT), as well as some perceptually motivated objective distortion measures. Performance analysis on a large speech database indicates considerable improvement in short-time signal matching both in the time and the spectral domains. Tn addition, subjective quality of the reproduced speech is considerably improved. (C) 2001 Elsevier Science B,V. All rights reserved. C1 Arizona State Univ, Dept Elect Engn, Ctr Telecommun Res, Tempe, AZ 85287 USA. Nokia Mobile Phones Inc, San Diego, CA 92131 USA. RP Spanias, AS (reprint author), Arizona State Univ, Dept Elect Engn, Ctr Telecommun Res, Tempe, AZ 85287 USA. EM sassan.ahmadi@nokia.com; spanias@asu.edu CR AHMADI S, 1997, THESIS ARIZONA STATE Ahmadi S, 1999, IEEE T SPEECH AUDI P, V7, P333, DOI 10.1109/89.759042 Ahmadi S, 1998, IEEE T SPEECH AUDI P, V6, P495, DOI 10.1109/89.709675 ALMEIDA LB, 1983, IEEE T ACOUST SPEECH, V31, P664, DOI 10.1109/TASSP.1983.1164128 Almeida L. B., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing ATAL BS, 1989, P INT C AC SPEECH SI, P69 CHAMPION TG, 1994, P IEEE ICASSP 94 CHEN CK, 1994, IEEE T CIRCUITS-II, V41, P346 CHEN JH, 1995, IEEE T SPEECH AUDI P, V3, P59 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 Garofolo J., 1988, GETTING STARTED DARP GERSHO A, 1994, P IEEE, V82, P900, DOI 10.1109/5.286194 GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 HEDELIN P, 1986, P IEEE ICASSP86 TOKY, P465 HEDELIN P, 1988, P IEEE ICASSP 88, P339 Honda M., 1990, P IEEE ICASSP 90, P213 Kleijn W. B., 1995, SPEECH CODING SYNTHE LIM YC, 1992, IEEE T SIGNAL PROCES, V40, P551, DOI 10.1109/78.120798 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Markel JD, 1976, LINEAR PREDICTION SP MARQUES JS, 1997, P IEEE ICASSP 90, P17 MAX J, 1960, IRE T INFORM THEOR, V6, P7, DOI 10.1109/TIT.1960.1057548 McAulay R., 1995, SPEECH CODING SYNTHE, P121 Mcaulay R. J., 1990, P INT C AC SPEECH SI, P249 MCAULAY RJ, 1992, ADV SPEECH SIGNAL PR, P165 MCAULAY RJ, 1988, P IEEE INT C AC SPEE, P370 McLarnon E., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing NOLL AM, 1967, J ACOUST SOC AM, V41, P293, DOI 10.1121/1.1910339 Paliwal K.K., 1991, P INT C AC SPEECH SI, P661, DOI 10.1109/ICASSP.1991.150426 Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 Papamichalis P.E., 1987, PRACTICAL APPROACHES RABINER LR, 1976, IEEE T ACOUST SPEECH, V24, P399, DOI 10.1109/TASSP.1976.1162846 RAMAMOORTHY V, 1988, IEEE J SEL AREA COMM, V6, P364, DOI 10.1109/49.613 SPANIAS AS, 1994, P IEEE, V82, P1541, DOI 10.1109/5.326413 STEIGLITZ K, 1981, IEEE T ACOUST SPEECH, V29, P171, DOI 10.1109/TASSP.1981.1163537 TRANCOSO IM, 1988, P IEEE ICASSP 88, P382 TRANCOSO IM, 1984, P IEEE ICASSP 84 VISWANATHAN VR, 1985, P IEEE ICASSP 85 VOIERS W, 1983, EVALUATING PROCESSED, P30 NR 39 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2001 VL 34 IS 4 BP 369 EP 390 DI 10.1016/S0167-6393(00)00057-1 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 442GL UT WOS:000169276800004 ER PT J AU Krahmer, E Swerts, M AF Krahmer, E Swerts, M TI On the alleged existence of contrastive accents SO SPEECH COMMUNICATION LA English DT Article DE contrast; prosody; dialogue games; phonetics; perception AB Speakers may use pitch accents as pointers to new information, or as signal!; of a contrast relation between the accented item and a limited set of alternatives. There is no consensus in the literature whether a separately identifiable contrastive accent exists. Some studies report that contrastive accents are more emphatic than newness accents and have a different melodic shape. In other studies, however, it is maintained that contrastiveness can only be determined by looking at how accents are distributed in an utterance. It is argued that these two contrasting views on contrastiveness can be reconciled by showing that they apply on different levels. To this end, acc:ent patterns were obtained in a (semi-) spontaneous way via a dialogue game (Dutch) in which two participants had to describe coloured figures in consecutive turns. By varying the sequential order, target descriptions ('blue square') were collected in four contexts: no contrast (all new), contrast in the adjective, contrast in the noun, all contrast. A distributional analysis revealed that both all new and all contrast situations correspond with double accents, whereas single accents on the adjective or the noun are used when these are contrastive. Single contrastive accents on the adjective are acoustically different from newness accents in the same syntactic position. The former have the shape of a 'nuclear' accent, whereas the newness accents on the adjective are 'prenuclear'. Contrastive accents stand out as perceptually more prominent than newness accents. This difference in salience tends to disappear if the accented word is heard in isolation. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Eindhoven Univ Technol, IPO, Ctr User Syst Interact, NL-5600 MB Eindhoven, Netherlands. Univ Instelling Antwerp, CNTS, B-2610 Antwerp, Belgium. RP Krahmer, E (reprint author), Eindhoven Univ Technol, IPO, Ctr User Syst Interact, POB 513, NL-5600 MB Eindhoven, Netherlands. EM e.j.krahmer@tue.nl; m.g.j.swerts@tue.nl RI Swerts, Marc/C-8855-2013 CR BARTELS C, 1994, P J SEM C FOC IBM WO BLOK P, 1993, THESIS U GRONINGEN BOLINGER DL, 1961, LANGUAGE, V37, P83, DOI 10.2307/411252 Bolinger D., 1986, INTONATION ITS PARTS Brown Gillian, 1980, QUESTIONS INTONATION CHAFE WL, 1974, LANGUAGE, V50, P111, DOI 10.2307/412014 Chafe W. L., 1976, SUBJECT TOPIC, P25 Couper-Kuhlen E., 1984, MODES INTERPRETATION, P137 Cruttenden A., 1986, INTONATION Cutler A, 1977, 13 REG M CHIC LING S, P104 DIMPERIO M, 1997, P ESCA WORKSH INT AT, P87 GUSSENHOVEN CARLOS, 1983, GRAMMAR SEMANTICS SE Hendriks H., 1995, P 10 AMST C ILLC AMS, P339 KEIJSPER CE, 1984, FORUM LETTEREN, V25, P20 LADD D, 1980, STRUCTURE INT MEANIN LADD DR, 1983, LANGUAGE, V59, P721, DOI 10.2307/413371 Levelt W. J., 1989, SPEAKING INTENTION A MOORE B, 1997, J AUDIO ENG SOC, V46, P224 PECHMANN T, 1984, THESIS MANNHEIM U Pierrehumbert J., 1990, INTENTIONS COMMUNICA, P342 PIWEK P, 1998, THESIS EINDHOVEN PREVOST S, 1995, THESIS U PENN Rooth Mats, 1985, THESIS U MASSACHUSET Rooth Mats, 1992, NAT LANG SEMANT, V1, P75, DOI DOI 10.1007/BF02342617 Schmerling S., 1976, ASPECTS ENGLISH SENT STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162 SWERTS M, UNPUB RECONSTRUCTING SWERTS M, 1999, P 14 INT C PHON SCI VALLDUVI E, 1991, P ESCOL, V7, P295 VANDEEMTER K, 1999, FOCUS LINGUISTIC COG ZWICKER E, 1965, PSYCHOL REV, V72, P3, DOI 10.1037/h0021703 NR 31 TC 53 Z9 53 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 2001 VL 34 IS 4 BP 391 EP 405 DI 10.1016/S0167-6393(00)00058-3 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 442GL UT WOS:000169276800005 ER PT J AU Moller, S Jekosch, U Mersdorf, J Kraft, V AF Moller, S Jekosch, U Mersdorf, J Kraft, V TI Auditory assessment of synthesized speech in application scenarios: Two case studies SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; auditory assessment; speech quality; acceptability; synthesis in use ID QUALITY EVALUATION; SYSTEMS AB For assessing the synthesized speech output component in a complex application system, application-oriented evaluation methods and methodologies are needed which are not supplied by standardized lest batteries so far. Many standardized tests analyze synthetic speech mainly with regard to its form (surface structure), and only to a less degree with regard to the meaning that is assigned to it (deep structure). In turn, in order to obtain a valid assessment focus for an application system, the functional aspect of speech (which depends on its deep structure) has to be taken into account. In the paper two case studies are presented which focus on the acceptability of the synthesis component and its constituent dimensions in different application scenarios. In the first one synthetic speech in a car navigation and traffic information system is assessed. The second study relates to synthetic speech in a dialogue system. The assessment is limited to laboratory experiments and avoids costly field tests. It turns out that different dimensions contribute to a variable degree to overall acceptability, differently depending on the application scenario. Application-oriented testing is thus required to identify the application-specific dimensions. It is discussed which characteristics of the application have to be modeled in the assessment, and examples are given for both applications. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Commun Acoust, D-44780 Bochum, Germany. Mannesmann Mobilfunk GMBH, Syst & Serv, Dusseldorf, Germany. RP Moller, S (reprint author), Ruhr Univ Bochum, Inst Commun Acoust, D-44780 Bochum, Germany. EM moeller@ika.ruhr-uni-bochum.de CR Bappert V., 1994, Acta Acustica, V2 BELHOULA A, 1996, GEGELBASIERTES VERFA BENNETT RW, 1985, P 11 INT S HUM FACT, V3, P1 BENOIT C, 1991, P EUROSPEECH, V2, P875 Blauert J., 1997, SPATIAL HEARING PSYC BLAUERT J, 1994, FORTSCHRITTE AKUSTIK, P905 BOHM A, 1992, THESIS RUHR U D BOCH Delogu C, 1998, SPEECH COMMUN, V24, P153, DOI 10.1016/S0167-6393(98)00009-0 DELOGU C, 1991, P 2 EUR C SPEECH COM, V1, P353 *EUR TEL STAND I, 1993, 095 ETSI ETR Gibbon D, 1997, HDB STANDARDS RESOUR GLEISS N, 1992, USABILITY CONCEPTS E, P24 HOWARDJONES P, 1992, MULTILINGUAL SPEECH *INT TEL UN, 1997, ITU T DEL CONTR D, V9 JEKOSCH U, 1998, THESIS JEKOSCH U, 1994, FORTSCHRITTE AKUSTIK, P1387 JEKOSCH U, 1992, P 2 INT C SPOK LANG, V1, P205 Klaus H, 1997, ACUSTICA, V83, P124 KRAFT V, 1995, ACTA ACUST, V3, P351 MERSDORF J, 1996, ACUSTICA ACTA ACU S1, V82, P230 MOLLER S, 1999, THESIS RUHR U BOCHUM Pavlovic C. V., 1990, Journal d'Acoustique, V3 RINSCHEID A, 1994, FORTSCHRITTE AKUSTIK, P1325 Salza PL, 1996, ACUSTICA, V82, P650 SCHULTEFORTKAMP B, 1994, GERAUSCHE BEURTEILEN SILVERMAN K, 1990, P 1990 INT C SPOK LA, V2, P981 VANBEZOOIJEN R, 1990, SPEECH COMMUN, V9, P263, DOI 10.1016/0167-6393(90)90002-Q van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2001 VL 34 IS 3 BP 229 EP 246 DI 10.1016/S0167-6393(00)00036-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 426KQ UT WOS:000168348700001 ER PT J AU de Veth, J Cranen, B Boves, L AF de Veth, J Cranen, B Boves, L TI Acoustic backing-off as an implementation of missing feature theory SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; mismatched training-test conditions; missing feature theory; acoustic backing-off ID RECOGNITION AB In this paper, we discuss acoustic backing-off as a method to improve automatic speech recognition robustness. Acoustic backing-off aims to achieve the same objective as the marginalization approach of missing feature theory: the detrimental influence of outlier values is effectively removed from the local distance computation in the Viterbi algorithm. The proposed method is based on one of the principles of robust statistical pattern matching: during recognition the local distance function (LDF) is modeled using a mixture of the distribution observed during training and a distribution describing observations not previously seen. In order to assess the effectiveness of the new method, we used artificial distortions of the acoustic Vectors in connected digit recognition over telephone lines. We found that acoustic backing-off is capable of restoring recognition performance almost to the level observed for the undisturbed features, even in cases where a conventional LDF completely fails. These results show that recognition robustness can be improved using a marginalization approach, where making the distinction between reliable and corrupted feature Values is wired into the recognition process. In addition, the results show that application of acoustic backing-off is not limited to feature representations based on filter bank outputs. Finally, the results indicate that acoustic backing-off is much less effective when local distortions are smeared over all vector elements. Therefore, the acoustic pre-processing steps should be chosen with care, so that the dispersion of distortions over all acoustic Vector elements as a result of within-vector feature transformations is minimal. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP de Veth, J (reprint author), Univ Nijmegen, Dept Language & Speech, A2RT,POB 9103, NL-6500 HD Nijmegen, Netherlands. EM deveth@let.kun.nl; cranen@let.kun.nl; boves@let.kun.nl CR Cooke M., 1996, P ESCA WORKSH AUD BA, P297 DAUTRICH BA, 1983, IEEE T ACOUST SPEECH, V31, P793, DOI 10.1109/TASSP.1983.1164172 DAUTRICH BA, 1983, AT&T TECH J, V62, P1311 DENOS EA, 1995, P EUR 95, P825 DEVETH J, 1999, P WORKSH ROB METH SP, P231 DEVETH J, 1999, P EUR C SPEECH COMM, P65 de Veth J, 1998, SPEECH COMMUN, V25, P149, DOI 10.1016/S0167-6393(98)00034-X Dupont S., 1997, P ESCA NATO WORKSH R, P95 Huber PJ, 1981, ROBUST STAT JELINEK F, 1992, ADV SPEECH SIGNAL PR, P651 Kharin Y., 1996, ROBUSTNESS STAT PATT Lippmann R., 1997, P EUR 97 RHOD GREEC, P37 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 Morris AC, 1998, INT CONF ACOUST SPEE, P737, DOI 10.1109/ICASSP.1998.675370 NADEU C, 1995, P EUR 95, P1381 Okawa S, 1998, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1998.675346 RABINER LR, 1988, NATO ASI SERIES F, V46, P183 Tibrewala S., 1997, P ICASSP, P1255 Young S, 1995, HTK BOOK HTK VERSION NR 19 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2001 VL 34 IS 3 BP 247 EP 265 DI 10.1016/S0167-6393(00)00037-6 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 426KQ UT WOS:000168348700002 ER PT J AU Cooke, M Green, P Josifovski, L Vizinho, A AF Cooke, M Green, P Josifovski, L Vizinho, A TI Robust automatic speech recognition with missing and unreliable acoustic data SO SPEECH COMMUNICATION LA English DT Article DE robust ASR; missing data; data imputation; HMM; spectral subtraction ID VOWELS; HUMANS AB Human speech perception is robust in the face of a wide variety of distortions, both experimentally applied and naturally occurring. In these conditions, state-of-the-art automatic speech recognition (ASR) technology fails. This paper describes an approach to robust ASR which acknowledges the fact that some spectro-temporal regions will be dominated by noise. For the purposes of recognition, these regions are treated as missing or unreliable. The primary advantage of this viewpoint is that it makes minimal assumptions about any noise background. Instead, reliable regions are identified, and subsequent decoding is based on this evidence. We introduce two approaches for dealing with unreliable evidence. The first - marginalisation - computes output probabilities on the basis of the reliable evidence only. The second - state-based data imputation - estimates Values for the unreliable regions by conditioning on the reliable parts and the recognition hypothesis. A further source of information is the bounds on the energy of any constituent acoustic source in an additive mixture. This additional knowledge can be incorporated into the missing data framework. These approaches are applied to continuous-density hidden Markov model (HMM)-based speech recognisers and evaluated on the TIDigits corpus for several noise conditions. Two criteria which use simple noise estimates are employed as a means of identifying reliable regions. The first treats regions which are negative after spectral subtraction as unreliable. The second uses the estimated noise spectrum to derive local signal-to-noise ratios, which are then thresholded to identify reliable data points. Both marginalisation and state-based data imputation produce a substantial performance advantage over spectral subtraction alone. The use of energy bounds leads to a further increase in performance for both approaches. While marginalisation outperforms data imputation, the latter technique allows the technique to act as a preprocessor for conventional recognisers, or in speech-enhancement applications. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Speech & Hearing Res Grp, Sheffield S1 4DP, S Yorkshire, England. RP Cooke, M (reprint author), Univ Sheffield, Dept Comp Sci, Speech & Hearing Res Grp, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM m.cooke@dcs.shef.ac.uk; p.green@dcs.shef.ac.uk; l.josifovski@dcs.shef.ac.uk; a.vizinho@dcs.shef.ac.uk CR AHMED S, 1993, ADV NEURAL INFORMATI, V5, P393 Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Barker J., 1997, P EUR 97, P2127 Barker J, 1999, SPEECH COMMUN, V27, P159, DOI 10.1016/S0167-6393(98)00081-8 Bell A J, 1995, NEURAL COMPUT, V7, P1004 BOURLARD H, 1996, P ICSLP 96 Bregman AS., 1990, AUDITORY SCENE ANAL BRENDBORG MK, 1997, P EUR 97, P295 BRIDLE JS, 1994, P I AC, P307 BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 Brown GJ, 1997, NEURAL NETWORKS, V10, P1547, DOI 10.1016/S0893-6080(97)00046-4 BROWN GJ, 1996, P WORKSH AUD BAS SPE CARREIRAPERPINA.MA, 1999, CS9903 U SHEFF DEP C CHEVEIGNE A, 1999, J ACOUST SOC AM, V105, P3497, DOI DOI 10.1121/1.424675 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 Cooke M., 1994, P 3 INT C SPOK LANG, P1555 Cooke M., 1993, MODELLING AUDITORY P COOKE M, 1997, P ICASSP, P863 COOKE MP, IN PRESS LISTENING S CUNNINGHAM S, 1999, IN PRESS INT C PHON DEVETH J, 1999, P WORKSH ROB METH SP, P231 Drygajlo A., 1998, P ICASSP 98, V1, P121, DOI 10.1109/ICASSP.1998.674382 ELLIS DPW, 1996, THESIS CAMBRIDGE MA ELMALIKI M, 1999, P COST 250 WORKSH SP Fletcher H., 1953, SPEECH HEARING COMMU Furui S., 1997, P ESCA NATO TUT RES, P11 Gales M.J.F., 1993, P EUROSPEECH, P837 Ghahramani Z., 1993, ADV NEURAL INFORMATI, V6, P120 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Green PD, 1995, P INT C AC SPEECH SI, P401 GRENIE M, 1992, P ESCA WORSKH SPEECH HERMANSKY H, 1996, P ICSLP 96 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HIRSCH HG, P ICASSP 95, P153 Holmes J., 1986, P ICASSP86 TOKYO, P741 Klatt D., 1976, P ICASSP, P573 Leonard R. G., 1984, P ICASSP 84, P111 Lippmann R., 1997, P EUR 97 RHOD GREEC, P37 Lippmann RP, 1996, IEEE T SPEECH AUDI P, V4, P66, DOI 10.1109/TSA.1996.481454 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LIU F, 1994, BIOL CYBERN, V54, P29 Martin R., 1993, P EUROSPEECH 93 BERL, P1093 MILLER GA, 1950, J ACOUST SOC AM, V22, P167, DOI 10.1121/1.1906584 MING J, 1999, P WORKSH ROB METH SP, P175 MOKBEL C, 1992, RECONNAISSANCE PAROL Moore B. C. J., 1995, HDB PERCEPTION COGNI, V6, P387 Moore BCJ, 1997, INTRO PSYCHOL HEARIN MORRIS A, 1999, P EUR BUD HUNG, P599 MORRIS AC, 1998, ICASSP 98 SEATTL Morrison DF, 1990, MULTIVARIATE STAT ME Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7 RAJ B, 1998, P INT C SPOK LANG PR, P1491 STEENEKEN HJM, 1992, THESIS U AMSTERDAM STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE VARGA AP, 1988, P ICASSP, P481 Vizinho A., 1999, P EUR C SPEECH COMM, P2407 VONDERMALSBURG C, 1986, BIOL CYBERN, V54, P29 WARREN RM, 1995, PERCEPT PSYCHOPHYS, V57, P175, DOI 10.3758/BF03206503 YOUNG SJ, 1993, HTK VERSION 1 5 USER 1996, P NIPS 95 WORKSH MIS NR 62 TC 313 Z9 321 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2001 VL 34 IS 3 BP 267 EP 285 DI 10.1016/S0167-6393(00)00034-0 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 426KQ UT WOS:000168348700003 ER PT J AU Schluter, R Macherey, W Muller, B Ney, H AF Schluter, R Macherey, W Muller, B Ney, H TI Comparison of discriminative training criteria and optimization methods for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE discriminative training; maximum mutual information; minimum classification error; corrective training; speech recognition ID STATISTICAL ESTIMATION; INEQUALITY AB The aim of this work is to build up a common framework for a class of discriminative training criteria and optimization methods for continuous speech recognition. A unified discriminative criterion based on likelihood ratios of correct and competing models with optional smoothing is presented. The unified criterion leads to particular criteria through the choice of competing word sequences and the choice of smoothing. Analytic and experimental comparisons are presented for both the maximum mutual information (MMI) and the minimum classification error (MCE) criterion together with the optimization methods gradient descent (GD) and extended Baum (EB) algorithm. A tree search-based restricted recognition method using word graphs is presented, so as to reduce the computational complexity of large vocabulary discriminative training. Moreover, for MCE training, a method using word graphs for efficient calculation of discriminative statistics is introduced. Experiments were performed for continuous speech recognition using the ARPA wall street journal (WSJ) corpus with a vocabulary of 5k words and for the recognition of continuously spoken digit strings using both the TI digit string corpus for American English digits, and the SieTill corpus for telephone line recorded German digits. For the MMI criterion, neither analytical nor experimental results do indicate significant differences between EB and GD optimization. For acoustic models of low complexity, MCE training gave significantly better results than MMI training. The recognition results for large vocabulary MMI training on the WSJ corpus show a significant dependence on the context length of the language model used for training. Best results were obtained using a unigram language model for MMI training. No significant correlation has been observed between the language models chosen for training and recognition. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Technol, Rhein Westfal TH Aachen, Lehrstuhl Informat 6, D-52056 Aachen, Germany. RP Schluter, R (reprint author), Univ Technol, Rhein Westfal TH Aachen, Lehrstuhl Informat 6, Ahornstr 55, D-52056 Aachen, Germany. EM schlueter@informatik.rwth-aachen.de; ney@informatik.rwth-aachen.de CR Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> BAHL LR, 1996, P INT C AC SPEECH SI, V2, P613 BAUM LE, 1967, B AM MATH SOC, V73, P360, DOI 10.1090/S0002-9904-1967-11751-8 BROWN PF, 1987, THESIS CARNEGIEMELLO, P119 CARDIN R, 1993, P INT C ACOUST SPEEC, V2, P243 CHOU W, 1994, P INT C SPEECH LANG, V2, P439 CHOU W, 1993, P IEEE INT C AC SPEE, V2, P652 CHOU W, 1992, P INT C AC SPEECH SI, V1, P473 Chow Y.-L., 1990, P INT C AC SPEECH SI, P701 EISELE T, 1996, P INT C SPOK LANG PR, V1, P252, DOI 10.1109/ICSLP.1996.607092 GOPALAKRISHNAN PS, 1991, IEEE T INFORM THEORY, V37, P107, DOI 10.1109/18.61108 KANEVSKY D, 1995, P INT C AC SPEECH SI, V1, P473 KAPADIA S, 1993, P ICASSP, V2, P491 Leonard R. G., 1984, P INT C AC SPEECH SI NEY H, 1998, P INT C AC SPEECH SI, V2, P853, DOI 10.1109/ICASSP.1998.675399 NEY H, 1990, P 5 EUR SIGN PROC C, P65 NORMANDIN Y, 1991, THESIS MCGILL U MONT, P159 Normandin Y, 1994, IEEE T SPEECH AUDI P, V2, P299, DOI 10.1109/89.279279 NORMANDIN Y, 1996, AUTOMATIC SPEECH SPE, P57 Normandin Y., 1991, P INT C AC SPEECH SI, V1, P537 Normandin Y., 1994, P ICSLP YOK, V3, P1367 Ortmanns S, 1997, COMPUT SPEECH LANG, V11, P43, DOI 10.1006/csla.1996.0022 PALIWAL KK, 1995, P EUR C SPEECH COMM, V1, P541 POVEY D, 1999, P INT C AC SPEECH SI, V1, P333 REICHL W, 1995, P 1995 EUROSPEECH 95, V1, P537 SCHLUTER R, 1997, P EUR C SPEECH COMM, V1, P15 SCHLUTER R, 1998, P IEEE INT C AC SPEE, V1, P493, DOI 10.1109/ICASSP.1998.674475 Schwartz R., 1991, P IEEE INT C AC SPEE, P701, DOI 10.1109/ICASSP.1991.150436 VALTCHEV V, 1996, P INT C AC SPEECH SI, V2, P605 Valtchev V, 1997, SPEECH COMMUN, V22, P303, DOI 10.1016/S0167-6393(97)00029-0 WELLING L, 1995, P 1995 EUR C SPEECH, V2, P1483 Wessel F, 1998, P INT C AC SPEECH SI, V1, P225, DOI 10.1109/ICASSP.1998.674408 NR 32 TC 48 Z9 49 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2001 VL 34 IS 3 BP 287 EP 310 DI 10.1016/S0167-6393(00)00035-2 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 426KQ UT WOS:000168348700004 ER PT J AU Viikki, O AF Viikki, O TI Noise robust ASR SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Nokia Res Ctr, Speech & Audio Syst Lab, Tampere 33721, Finland. RP Viikki, O (reprint author), Nokia Res Ctr, Speech & Audio Syst Lab, POB 100, Tampere 33721, Finland. EM olli.viikki@nokia.com NR 0 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 1 EP 2 DI 10.1016/S0167-6393(00)00041-8 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000001 ER PT J AU Bitzer, J Simmer, KU Kammeyer, KD AF Bitzer, J Simmer, KU Kammeyer, KD TI Multi-microphone noise reduction techniques as front-end devices for speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE superdirective beamformer; multi-microphone noise reduction; coherence; speech recognition; microphone arrays ID COHERENCE AB In this paper, we describe different multi-microphone noise reduction techniques as front-ends for a speaker-independent isolated word recognizer in an office environment. Our focus lies on examining the recognition rate if the noise source is not Gaussian and stationary, but a second speaker in the same room. In this case, standard noise reduction techniques like spectral subtraction fail, whereas multi-microphone techniques can raise the recognition rate by using spatial information. We compare the delay-and-sum beamformer, superdirective beamformers, and two post-filter systems. A new adaptive post-filter for superdirective beamformers (APES) is introduced. Our results show that multi-microphone techniques can increase the recognition rate significantly and that the new APES system outperforms related techniques. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Houpert Digital Audio, D-28359 Bremen, Germany. Aureca GmbH, D-28203 Bremen, Germany. Univ Bremen, Dept Telecommun, D-28334 Bremen, Germany. RP Bitzer, J (reprint author), Houpert Digital Audio, Fahrenheitstr 1, D-28359 Bremen, Germany. EM j.bitzer@hda.de; uwe.simmer@aureca.com; kammeyer@comm.uni-bremen.de CR Bitzer J., 1999, IEEE WORKSH APPL SIG, P7 BITZER J, 1998, P EURASIP EUR SIGN P, V1, P105 BUCKLEY KM, 1986, IEEE T ACOUST SPEECH, V34, P1322, DOI 10.1109/TASSP.1986.1164927 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 CRON BF, 1962, J ACOUST SOC AM, V34, P1732, DOI 10.1121/1.1909110 DOERBECKER M, 1997, P INT WORKSH AC ECH, P100 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 GILBERT EN, 1955, BELL SYSTEM TECH MAY, P637 GIULIANI D, 1995, P IEEE INT C AC SPEE, P860 GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 LIN Q, 1996, P IEEE INT C AC SPEE, V1, P21 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 Marro C, 1998, IEEE T SPEECH AUDI P, V6, P240, DOI 10.1109/89.668818 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 PIERSOL AG, 1978, J SOUND VIB, V56, P215, DOI 10.1016/S0022-460X(78)80016-9 REX JA, 1994, P EURASIP EUR SIGN P, P1752 Shamsoddini A, 1996, ICSP '96 - 1996 3RD INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING, PROCEEDINGS, VOLS I AND II, P753, DOI 10.1109/ICSIGP.1996.567372 SIMMER KU, 1992, 2 COST 229 WORKSH AD, P185 Zelinski R., 1988, P INT C AC SPEECH SI, P2578 NR 19 TC 13 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 3 EP 12 DI 10.1016/S0167-6393(00)00042-X PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000002 ER PT J AU Aubauer, R Leckschat, D AF Aubauer, R Leckschat, D TI Optimized second-order gradient microphone for hands-free speech recordings in cars SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE automobile; car; directivity; gradient microphone; hands-free; mel-frequency cepstral coefficients; microphone; multipath propagation; noise; quadrupole; reverberation; room impulse response; room transfer function; speech recognition; telecommunication; vehicle AB Hands-free telephony and automatic speech recognition in adverse acoustic conditions require noise reduction and speech enhancement methods to achieve acceptable quality and speech recognition rates. This contribution is primarily focussed on the optimization of the recording microphone. The directivity of the microphone can be used to reduce sound components outside the microphone's main axis which suppresses noise and room reverberation effectively. Considering the outer dimensions, gradient microphones possess a high directivity and good noise suppression properties. In this study, the feasibility and the noise-cancelling characteristics of an optimized second-order gradient microphone are investigated. The development of a simple sensitivity-matching principle allows the use of low-priced microphone capsules. The directivity index, characterizing the diffuse noise suppression, is nearly frequency-independent in the telephone bandwidth and is about 3 dB higher than that of conventional directional microphones used in telecommunication. The microphone is evaluated in an automobile environment with the help of a speaker-dependent single-word recognizer and is compared with lower-order gradient microphones. The noise robustness of the speech recognizer is improved by at least 4.7 dB, provided the microphone is in an optimum position. Conclusions are then drawn regarding such an optimum microphone positioning. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Siemens AG, ICM CD TI 2, D-46395 Bocholt, Germany. RP Aubauer, R (reprint author), Siemens AG, ICM CD TI 2, Frankenstr 2, D-46395 Bocholt, Germany. EM roland.aubauer@bch.siemens.de CR EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 *ITUT, 1996, ITUT REC, P58 OLSON HF, 1979, J AUDIO ENG SOC, V1, P190 SESSLER GM, 1975, J ACOUST SOC AM, V58, P273, DOI 10.1121/1.380657 ZOLLNER M, 1993, ELEKTROAKUSTIK, P189 NR 5 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 13 EP 23 DI 10.1016/S0167-6393(00)00043-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000003 ER PT J AU Morris, A Hagen, A Glotin, H Bourlard, H AF Morris, A Hagen, A Glotin, H Bourlard, H TI Multi-stream adaptive evidence combination for noise robust ASR SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE noise robust ASR; multi-band ASR; expert combination; noise adaptation; latent variable decomposition ID SPEECH RECOGNITION AB In this paper, we develop different mathematical models in the framework of the multi-stream paradigm for noise robust automatic speech recognition (ASR), and discuss their close relationship with human speech perception. Largely inspired by Fletcher's "product-of-errors'' rule (PoE rule) in psychoacoustics, multi-band ASR aims for robustness to data mismatch through the exploitation of spectral redundancy, while making minimum assumptions about noise type. Previous ASR tests have shown that independent sub-band processing can lead to decreased recognition performance with clean speech. We have overcome this problem by considering every combination of data sub-bands as an independent data stream. After introducing the background to multi-band ASR, we show how this "full combination" approach can be formalised, in the context of hidden Markov model/ artificial neural network (HMM/ANN) based ASR, by introducing a latent variable to specify which data subbands in each data frame are free from data mismatch. This enables us to decompose the posterior probability for each phoneme into a reliability-weighted integral over all possible positions of clean data. This approach offers great potential for adaptation to rapidly changing and unpredictable noise. (C) 2001 Elsevier Science B.V. All rights reserved. C1 IDIAP, CH-1920 Martigny, Switzerland. RP Morris, A (reprint author), IDIAP, Rue Simplon 4,BP 592, CH-1920 Martigny, Switzerland. EM morris@idiap.ch; hagen@idiap.ch; glotin@idiap.ch; bourlard@idiap.ch CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Berthommier F., 1999, P 14 INT C PHON SCI, P711 BISHOP C, 1995, NEURAL NETWORKS PATT, P365 BOURLARD, 1999, P TAMP WORKSH ROB ME, P1 Bourlard H., 1996, P INT C SPOK LANG PR, P422 BOURLARD H, 1997, P INT SCH NEUR NETS Bourlard Ha, 1994, CONNECTIONIST SPEECH Cole R., 1995, P EUR C SPEECH COMM, P821 DEVETH J, 1999, P WORKSH ROB METH SP, P231 Duda R.O., 1993, PATTERN CLASSIFICATI DUPONT S, 1998, P ICSLP 98, P1283 Fletcher H, 1922, J FRANKL INST, V193, P0729, DOI 10.1016/S0016-0032(22)90319-9 Gales M.J.F., 1993, P EUROSPEECH, P837 Girin L, 1998, INT CONF ACOUST SPEE, P1005, DOI 10.1109/ICASSP.1998.675437 GLOTIN H, 1999, P EUROSPEECH, P2351 Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 HAGEN A, 1999, P TAMP WORKSH ROB ME, P199 Hennebert J., 1997, P EUR C SPEECH COMM, P1951 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1996, P INT C SPOK LANG PR, P462 HERMANSKY H, 1999, P ICASSP 99, P298 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 JORDAN MI, 1994, NEURAL COMPUT, V6, P181, DOI 10.1162/neco.1994.6.2.181 Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Lippmann R., 1997, P EUR 97 RHOD GREEC, P37 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MING J, 1999, P WORKSH ROB METH SP, P175 MIRGHAFORI N, 1999, THESIS U CALIFORNIA Moore BCJ, 1997, INTRO PSYCHOL HEARIN MORGAN N, 1998, 9817 IDIAPRR MORRIS A, 1999, P EUR BUD HUNG, P599 Morris AC, 1998, INT CONF ACOUST SPEE, P737, DOI 10.1109/ICASSP.1998.675370 MORRIS AC, 1999, 9904 IDIAPCOM NADEU C, 1995, P EUR 95, P1381 Okawa S, 1998, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1998.675346 Pickles JO, 1988, INTRO PHYSL HEARING Rao S, 1996, IEEE T INFORM THEORY, V42, P1160, DOI 10.1109/18.508839 Raviv Y., 1996, CONNECT SCI, V8, P356 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 Steeneken HJM, 1999, SPEECH COMMUN, V28, P109, DOI 10.1016/S0167-6393(99)00007-2 TOMLINSON J, 1996, P ICASSP 96, P821 TOMLINSON J, 1997, P ICASSP 97, P1247 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE WESTPHAL M, 1999, P EUROSPEECH 99, P1955 Wu S., 1998, P ICASSP 98, P459 NR 47 TC 30 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 25 EP 40 DI 10.1016/S0167-6393(00)00044-3 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000004 ER PT J AU Ming, J Smith, FJ AF Ming, J Smith, FJ TI Union: A new approach for combining sub-band observations for noisy speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust speech recognition; band-limited noise; unknown, time-varying noise statistics ID DEPENDENCE AB Recent studies have shown that the sub-band based speech recognition approach has the potential of improving upon the conventional, full-band based model against frequency-selective noise. A critical issue towards exploiting this potential is the choice of the method for combining the sub-band observations. This paper introduces a new method, namely, the probabilistic-union model, for this combination. The new model is based on the probability theory for the union of random events, and represents a new method for modeling partially corrupted observations given little knowledge about the corruption. The new model has been incorporated into a hidden Markov model (HMM) and tested for recognizing a speaker-independent E-set, corrupted by various types of additive noise. The results show that the new model offers robustness to partial frequency corruption, requiring little or no knowledge about the noise statistics. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Queens Univ Belfast, Sch Comp Sci, Belfast BT7 1NN, Antrim, North Ireland. RP Ming, J (reprint author), Queens Univ Belfast, Sch Comp Sci, Belfast BT7 1NN, Antrim, North Ireland. EM j.ming@qub.ac.uk CR Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145 Bourlard H., 1999, P WORKSH ROB METH SP, P1 Bourlard H., 1997, P ICASSP, P1251 Cerisara C, 1998, INT CONF ACOUST SPEE, P717, DOI 10.1109/ICASSP.1998.675365 DUPONT S, 1997, P EUR 97 RHOD GREEC, P3 Hanna P, 1999, SPEECH COMMUN, V28, P301, DOI 10.1016/S0167-6393(99)00019-9 Harris B., 1966, THEORY PROBABILITY HERMANSKY H, 1996, P INT C SPOK LANG PR, P462 Leonard R. G., 1984, P INT C AC SPEECH SI MING J, 1999, P IEEE WORKSH AUT SP, P43 Ming J, 1996, COMPUT SPEECH LANG, V10, P229, DOI 10.1006/csla.1996.0012 MING J, 1999, P WORKSH ROB METH SP, P175 MING J, 1999, P INT C AC SPEECH SI, P161 Mirghafori N, 1998, INT CONF ACOUST SPEE, P713, DOI 10.1109/ICASSP.1998.675364 MORRIS A, 1999, P EUR BUD HUNG, P599 Okawa S, 1998, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1998.675346 Tibrewala S., 1997, P ICASSP, P1255 Tibrewala S., 1997, P EUR 97, P2619 VALTCHEV V, 1995, THESIS CAMBRIDGE U E WOODLAND PC, 1991, P ICASSP 91, P545, DOI 10.1109/ICASSP.1991.150397 NR 20 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 41 EP 55 DI 10.1016/S0167-6393(00)00045-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000005 ER PT J AU de Veth, J de Wet, F Cranen, B Boves, L AF de Veth, J de Wet, F Cranen, B Boves, L TI Acoustic features and a distance measure that reduce the impact of training-test mismatch in ASR SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE automatic speech recognition; noise robustness; acoustic features; missing feature theory; robust statistical pattern recognition; robust local distance function ID ROBUST SPEECH RECOGNITION AB For improved recognition robustness in mismatched training-test conditions, the application of key ideas from missing feature theory and robust statistical pattern recognition in the framework of an otherwise conventional automatic speech recognition (ASR) system were investigated. To this end, both the type of features used to represent the speech signals and the algorithm used to compute the distance measure between an observed feature vector and a previously trained parametric model were studied. Two different types of feature representations were used: a type in which spectrally local distortions are smeared over the entire feature vector and a type in which distortions are only smeared over part of the feature vector. In addition, two different distance measures were investigated, viz., a conventional distance measure and a robust local distance function in the form of acoustic backing-off. The effects on recognition performance were studied for artificially created, band-limited noise and NOISEX noise added to the speech signals. The results for artificial band-limited noise indicate that a partially smearing feature transform is to be preferred over a fully smearing transform. In addition, for artificial, band-limited noise, a robust local distance function is to be preferred over the conventional distance measure as long as the distorted feature values are outliers with respect to the feature distribution observed during training. The experiments with NOISEX noise show that the combination of feature type and distance measure that is optimal for artificial, band-limited noise is also capable of improving recognition robustness for NOISEX noise, provided that it is band-limited. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, A2RT, NL-6500 HD Nijmegen, Netherlands. RP de Veth, J (reprint author), Univ Nijmegen, Dept Language & Speech, A2RT, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM j.deveth@let.kun.nl; f.de.wet@let.kun.nl; b.cranen@let.kun.nl; l.boves@let.kun.nl CR BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cooke M., 1996, P ESCA WORKSH AUD BA, P297 DAUTRICH BA, 1983, AT&T TECH J, V62, P1311 DENOS EA, 1995, P EUR 95, P825 DEVETH J, 1999, 81 PRIOR PROGR LANG DEVETH J, 2001, SPEECH COMMUNICATION, V34 de Veth J, 1998, SPEECH COMMUN, V25, P149, DOI 10.1016/S0167-6393(98)00034-X DEVETH J, 1998, P INT C SPOK LANG PR, P1427 Dupont S., 1997, P ESCA NATO WORKSH R, P95 Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 Huber PJ, 1981, ROBUST STAT HUERTA J, 1998, CD ROM P INT C SPOK HUNT MJ, 1991, P IEEE INT C AC SPEE, P881, DOI 10.1109/ICASSP.1991.150480 Kharin Y., 1996, ROBUSTNESS STAT PATT LEE CH, 1999, P WORKSH ROB METH SP, P45 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 Lippmann R., 1997, P EUR 97 RHOD GREEC, P37 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 Morris AC, 1998, INT CONF ACOUST SPEE, P737, DOI 10.1109/ICASSP.1998.675370 NADEU C, 1995, P EUR 95, P1381 NEY H, 1999, EUROSPEECH 1999 *NOISEX, 1990, NOISE ROM 0 NATO AC2 Okawa S, 1998, INT CONF ACOUST SPEE, P641, DOI 10.1109/ICASSP.1998.675346 Tibrewala S., 1997, P ICASSP, P1255 Vizinho A., 1999, P EUR C SPEECH COMM, P2407 Young S, 1995, HTK BOOK HTK VERSION Zaveri K, 1979, ACOUSTIC NOISE MEASU NR 27 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 57 EP 74 DI 10.1016/S0167-6393(00)00046-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000006 ER PT J AU Kleinschmidt, M Tchorz, J Kollmeier, B AF Kleinschmidt, M Tchorz, J Kollmeier, B TI Combining speech enhancement and auditory feature extraction for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust speech recognition; perceptive modeling; auditory front end; speech enhancement ID SPECTRAL AMPLITUDE ESTIMATOR; NOISE-REDUCTION; IMPAIRED LISTENERS; QUANTITATIVE MODEL; HEARING-AIDS; FRONT-END; PERCEPTION; MODULATION; SYSTEM AB A major deficiency in state-of-the-art automatic speech recognition (ASR) systems is the lack of robustness in additive and convolutional noise. The model of auditory perception (PEMO), developed by Dau et al. (T. Dau, D. Puschel, A. Kohlrausch, J. Acoust. Sec. Am. 99 (6) (1996) 3615-3622) for psychoacoustical purposes, partly overcomes these difficulties when used as a front end for automatic speech recognition. To further improve the performance of this auditory-based recognition system in background noise, different speech enhancement methods were examined, which have been evaluated in earlier studies as components of digital hearing aids. Monaural noise reduction, as proposed by Ephraim and Malah (Y. Ephraim, D. Malah, IEEE Trans. Acoust. Speech Signal Process. ASSP-32 (6) (1984) 1109-1121) was compared to a binaural filter and dereverberation algorithm after Wittkop et al. (T. Wittkop, S. Albani, V. Hohmann, J. Peissig, W. Woods, B. Kollmeier, Acustica United with Acta Acustica 83 (4) (1997) 684- 699). Both noise reduction algorithms yield improvements in recognition performance equivalent to up to 10 dB SNR in non-reverberant conditions for all types of noise, while the performance in clean speech is not significantly affected. Even in real-world reverberant conditions the speech enhancement schemes lead to improvements in recognition performance comparable to an SNR gain of up to 5 dB. This effect exceeds the expectations as earlier studies found no increase in speech intelligibility for hearing-impaired human subjects. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Carl von Ossietzky Univ Oldenburg, AG Med Phys, D-26111 Oldenburg, Germany. RP Kleinschmidt, M (reprint author), Carl von Ossietzky Univ Oldenburg, AG Med Phys, D-26111 Oldenburg, Germany. EM michael@medi.physik.uni-oldenburg.de CR Bitzer J., 1999, P WORKSH ROB METH SP, P171 Blauert J., 1997, SPATIAL HEARING BODDEN M, 1995, FORTSCHRITTE AKUSTIK, P1145 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 Colburn H.S., 1996, SPRINGER HDB AUDITOR, P332 Dau T, 1996, J ACOUST SOC AM, V99, P3615, DOI 10.1121/1.414959 Dau T, 1996, J ACOUST SOC AM, V99, P3623, DOI 10.1121/1.414960 Dau T, 1997, J ACOUST SOC AM, V102, P2892, DOI 10.1121/1.420344 DERLETH RP, 1999, THESIS U OLDENBURG Durlach N. I., 1972, F MODERN AUDITORY TH, V2, P369 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 FISCHER A, 1999, P WORKSH ROB METH SP, P75 FRANCIS IF, 1997, P ICASSP 97, P1231 GELIN P, 1999, P EUR 1999 BUD HUNG, V6, P2483 GHITZA O, 1988, J PHONETICS, V16, P109 HANSEN M, 1997, P ICASSP 97, P1387 HERMUS K, 1999, P EUR 1999 BUD HUNG, V5, P1951 Holube I, 1996, J ACOUST SOC AM, V100, P1703, DOI 10.1121/1.417354 JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 Kanedera N, 1999, SPEECH COMMUN, V28, P43, DOI 10.1016/S0167-6393(99)00002-3 KASPER K, 1995, NEURAL NETWORKS SIGN, V5, P272 KASPER K, 1997, P ICASSP 1997, P1223 KASPER K, 1999, J ACOUST SOC AM, V105, P1175 KERMORVANT C, 1999, P EUR 1999 BUD HUNG, V6, P2841 KIYOHARA K, 1997, P IEEE INT C AC SPEE, P215 KLEINSCHMIDT M, 1999, PSYCHOPHYSICS PHYSL, P267 KLEINSCHMIDT M, 1998, FORTSCHRITTE AKUSTIK, P396 KOLLMEIER B, 1993, J REHABIL RES DEV, V30, P82 Kollmeier B., 1988, Audiologische Akustik, V27 MARZINZIK M, 1999, J ACOUST SOC AM, V105, P977 MARZINZIK M, 1999, PSYCHOPHYSICS PHYSL MEYER J, 1997, P ICASSP 97, V2, P1167 Mine R, 1996, SYST COMPUT JPN, V27, P37, DOI 10.1002/scj.4690271405 OMOLOGO M, 1997, P ICASSP97, P227 Patterson RD, 1987, M IOC SPEECH GROUP A Peissig J, 1997, J ACOUST SOC AM, V101, P1660, DOI 10.1121/1.418150 PEISSIG J, 1993, BINAURALE HORGERATES, V88 SENEFF S, 1988, J PHONETICS, V16, P55 TCHORZ J, 1997, P EUROSPEECH 97, V4, P2075 Tchorz J, 1999, J ACOUST SOC AM, V106, P2040, DOI 10.1121/1.427950 VARGA AP, 1992, NOISEX 92 STUDY EFFE VIZINHO A, 1999, P EUR 1999 BUD HUNG, V5, P2407 WESSELKAMP M, 1994, THESIS U GOTTINGEN WILMERS H, 1999, J ACOUST SOC AM, V105, P1092 Wittkop T, 1997, ACUSTICA, V83, P684 WITTKOP T, 1999, J ACOUST SOC AM, V105, P977 ZERBS C, 1999, PHYSOPHYSICS PHYSL M, P277 NR 48 TC 21 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 75 EP 91 DI 10.1016/S0167-6393(00)00047-9 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000007 ER PT J AU Nadeu, C Macho, D Hernando, J AF Nadeu, C Macho, D Hernando, J TI Time and frequency filtering of filter-bank energies for robust HMM speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust speech recognition; time and frequency filtering; modulation spectrum; filter-bank energies ID WORD RECOGNITION; SPECTRUM; FEATURES; NOISE AB Every speech recognition system requires a signal representation that parametrically models the temporal evolution of the speech spectral envelope. Current parameterizations involve, either explicitly or implicitly, a set of energies from frequency bands which are often distributed in a mel scale. The computation of those energies is performed in diverse ways, but it always includes smoothing of basic spectral measurements and non-linear amplitude compression. Several linear transformations are then applied to the two-dimensional time-frequency sequence of energies before entering the HMM pattern matching stage. In this paper, a recently introduced technique that consists of filtering that sequence of energies along the frequency dimension is presented, and its resulting parameters are compared with the widely used cepstral coefficients. Then, that frequency filtering transformation is jointly considered with the time filtering transformation that is used to compute dynamic parameters, showing that the flexibility of this combined (tiffing) approach can be used to design a robust set of filters. Recognition experiment results are reported which show the potential of tiffing for an enhanced and more robust HMM speech recognition. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Politecn Cataluna, Dept Signal Theory & Commun, TALP Res Ctr, E-08034 Barcelona, Spain. RP Nadeu, C (reprint author), Univ Politecn Cataluna, Dept Signal Theory & Commun, TALP Res Ctr, J Girona 1-3,Campus Nord,Edifici D5, E-08034 Barcelona, Spain. EM climent@talp.upc.es; dusan@talp.upc.es; javier@talp.upc.es RI Nadeu, Climent/B-9638-2014; Hernando, Javier/G-1863-2014 OI Nadeu, Climent/0000-0002-5863-0983; CR AKANSU AN, 1992, MULTIRRESOLUTION SIG ALEXANDRE P, 1993, SPEECH COMMUN, V12, P277, DOI 10.1016/0167-6393(93)90099-7 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 Atal B. S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607213 BATLLE E, 1998, P ICSLP, V3, P951 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deller J. R., 1993, DISCRETE TIME PROCES DEVETH J, 1999, P WORKSH ROB METH SP, P231 *ETSI SQL, W1007 ETSI SQL Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 GREENBERG S, 1997, P ICASSP, V3, P1647 HANSON BA, 1996, ADV TOPICS AUTOMATIC HANSON BA, 1987, IEEE T ACOUST SPEECH, V35, P968, DOI 10.1109/TASSP.1987.1165241 HERMANN AM, 1994, APPL PHYS S, V2, P1 Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HERNANDO J, 1997, P EUROSPEECH, V5, P2363 HERNANDO J, 1997, P EUROSPEECH, V1, P417 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 HUNT MJ, 1989, P ICASSP, V1, P262 HUNT MJ, 1999, P WORKSH ASRU JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 JUNQUA JC, 1996, ROBUSTNESS AUTOMATIC, P1996 Klatt D. H., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing LEONARD RG, 1984, P ICASSP, V3, P42 LJOLJE A, 1994, COMPUT SPEECH LANG, V8, P223, DOI 10.1006/csla.1994.1011 MACHO D, 1998, P ICSLP, P1487 MACHO D, 1999, P WORKSH ROB METH SP, P111 MACHO D, 1999, P EUROSPEECH, V1, P77 Mashao DJ, 1996, IEEE SIGNAL PROC LET, V3, P103, DOI 10.1109/97.489061 NADEU C, 1995, P EUR 95, P1381 NADEU C, 1998, P ICSLP, V3, P1071 NADEU C, 1994, P ICSLP, P1927 Nadeu C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607146 Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7 NADEU C, 1997, P ICASSP, P3953 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PACHESLEAL P, 1999, P EUROSPEECH, V1, P89 Pai H. F., 1992, COMPUT SPEECH LANG, V6, P361, DOI 10.1016/0885-2308(92)90029-4 PALIWAL KK, 1999, P EUROSPEECH, V1, P85 Paliwal K. K., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90034-6 PEARCE D, 1998, EXPT FRAMEWORK PERFO PICONE JW, 1991, P IEEE, V79, P1214 Rabiner L, 1993, FUNDAMENTALS SPEECH Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 SILVERMA.HF, 1974, IEEE T ACOUST SPEECH, VAS22, P362, DOI 10.1109/TASSP.1974.1162599 THOMSON DJ, 1982, P IEEE, V70, P1055, DOI 10.1109/PROC.1982.12433 TIAN J, 1999, P EUROSPEECH, P87 TOKHURA Y, 1987, IEEE T ACOUST SPEECH, V35, P1414 VASEGHI SV, 1993, IEE PROC-I, V140, P317 YOUNG S, 1997, HIDDEN MARKOV MODEL NR 54 TC 52 Z9 52 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 93 EP 114 DI 10.1016/S0167-6393(00)00048-0 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000008 ER PT J AU Menendez-Pidal, X Chen, RX Wu, DP Tanaka, M AF Menendez-Pidal, X Chen, RX Wu, DP Tanaka, M TI Compensation of channel and noise distortions combining normalization and speech enhancement techniques SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust speech recognition; noise and channel compensation ID HIDDEN MARKOV-MODELS; RECOGNITION AB This paper introduces two techniques to obtain robust speech recognition devices in mismatch conditions (additive noise mismatch and channel mismatch). The first algorithm, adaptive Gaussian attenuation algorithm (AGA), is a speech enhancement technique developed to reduce the effects of additive background noise in a wide range of signal noise ratio (SNR) and noise conditions. The technique is closely related to the classical noise spectral subtraction (SS) scheme, but in the proposed model the mean and variance of noise are used to better attenuate the noise. Information of the SNR is also introduced to provide adaptability at different SNR conditions. The second algorithm, cepstral mean normalization and variance-scaling technique (CMNVS), is an extension of the cepstral mean normalization (CMN) technique to provide robust features to convolutive and additive noise distortions. The requirements of the techniques are also analyzed in the paper. Combining both techniques the relative channel distortion effects were reduced to 90% on the HTIMIT task and the relative additive noise effects were reduced to 77% using the TIMIT database mixed with car noises at different SNR conditions. (C) 2001 Elsevier Science B.V. All rights reserved. C1 SONY US Res Labs, San Jose, CA 95134 USA. RP Menendez-Pidal, X (reprint author), SONY US Res Labs, 3300 Zanker Rd,SJ1B5, San Jose, CA 95134 USA. EM xavier@slt.sel.sony.com CR BEROUTI M, 1979, P INT C AC SPEECH SI, P849 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cheng RT, 1998, PHYSICS OF ESTUARIES AND COASTAL SEAS, P3 Gales MJF, 1998, SPEECH COMMUN, V25, P49, DOI 10.1016/S0167-6393(98)00029-6 Gauvain J. L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607734 HANSON B, 1995, AUTOMATIC SPEECH SPE IWAHASHI N, 1998, P INT C AC SPEECH SI, V2, P633, DOI 10.1109/ICASSP.1998.675344 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MENENDEZPIDAL X, 1999, WORKSH ROB METH SPEE, P101 Milner B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607093 Neumeyer LG, 1994, IEEE T SPEECH AUDI P, V2, P590, DOI 10.1109/89.326617 NOLAZCO J, 1993, TR123 CUED REYNOLDS DH, 1997, P INT AC SPEECH SIGN, P1537 Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 SCHLESS V, 1998, P INT SPOK LANG PROC, P1495 Tibrewala S., 1997, P EUR 97, P2619 Verdu S, 1998, IEEE T INFORM THEORY, V44, P2057, DOI 10.1109/18.720531 VIIKKI O, 1997, ESCA NATO WORKSH ROB, P107 Viikki O, 1998, INT CONF ACOUST SPEE, P733, DOI 10.1109/ICASSP.1998.675369 Xie F, 1996, SPEECH COMMUN, V19, P89, DOI 10.1016/0167-6393(96)00022-2 XIE F, 1994, P INT AC SPEECH SIGN, P53 NR 22 TC 1 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 115 EP 126 DI 10.1016/S0167-6393(00)00049-2 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000009 ER PT J AU Hirsch, HG AF Hirsch, HG TI HMM adaptation for applications in telecommunication SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust speech recognition; HMM adaptation ID ROBUST SPEECH RECOGNITION; NOISE AB The mismatch between the acoustic conditions during training and recognition often causes a performance deterioration in practical applications of speech recognition systems. Two important effects are the presence of a stationary background noise and the frequency response of the transmission channel from the speaker to the audio input of the recognizer. The original contribution of this work are two signal processing schemes for the estimation of the actual noise spectrum and the difference of the frequency responses between training and recognition. The estimated noise components are taken to adapt the cepstral parameters of the recognizer's references which are described by hidden Markov models (HMMs), The adaptation process is based on the parallel model combination (PMC) approach (M.J.F. Gales, Model based techniques for noise robust speech recognition, Dissertation at the University of Cambridge, 1995). For speaker independent connected or isolated word recognition considerable improvements can be achieved in the presence of just one type of noise as well as in the presence of both types together. Furthermore this adaptation scheme is integrated as part of a complete dialogue and recognition system which is accessible via the public telephone network. The usability and the gain in recognition performance is shown for this application in a real telecommunication scenario under consideration of all real-time aspects. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Inst Speech Mus & Hearing, Ctr Speech Technol, Stockholm, Sweden. Ericsson Eurolab Deutschland GmbH, D-90411 Nurnberg, Germany. RP Hirsch, HG (reprint author), Inst Speech Mus & Hearing, Ctr Speech Technol, Stockholm, Sweden. EM hans-guenter.hirsch@eed.ericsson.se CR Gales M. J., 1995, THESIS U CAMBRIDGE GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 GALES MJF, 1997, ESCA WORKSH ROB SPEE, P55 HIRSCH HG, 1995, INT CONF ACOUST SPEE, P153, DOI 10.1109/ICASSP.1995.479387 LEONARD RG, 1984, ICASSP, V84 Minami Y, 1996, INT CONF ACOUST SPEE, P327, DOI 10.1109/ICASSP.1996.541098 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Stern R. M., 1997, ESCA NATO TUT RES WO, P33 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 YOUNG S, 1996, HTK BOOK MANUAL HTK NR 10 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 127 EP 139 DI 10.1016/S0167-6393(00)00050-9 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000010 ER PT J AU Ris, C Dupont, S AF Ris, C Dupont, S TI Assessing local noise level estimation methods: Application to noise robust ASR SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robust automatic speech recognition; noise level estimation; noise reduction; spectral subtraction; missing data ID SPEECH AB In this paper, we assess and compare four methods for the local estimation of noise spectra, namely the energy clustering, the Hirsch histograms, the weighted average method and the low-energy envelope tracking. Moreover we introduce, for these four approaches, the harmonic filtering strategy, a new pre-processing technique, expected to better track fast modulations of the noise energy. The speech periodicity property is used to update the noise level estimate during voiced parts of speech, without explicit detection of voiced portions. Our evaluation is performed with six different kinds of noises (both artificial and real noises) added to clean speech. The best noise level estimation method is then applied to noise robust speech recognition based on techniques requiring a dynamic estimation of the noise spectra, namely spectral subtraction and missing data compensation. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Multitel, Fac Polytech Mons, TCTS, B-7000 Mons, Belgium. RP Ris, C (reprint author), Multitel, Fac Polytech Mons, TCTS, Parc Initialis, B-7000 Mons, Belgium. EM ris@tcts.fpms.ac.be; dupont@tcts.fpms.ac.be CR Berouti M., 1979, P IEEE INT C AC SPEE, P208 BOLL SF, 1979, IEEE ASSP, V2 Bourlard H., 1996, P EUR SIGN PROC C TR, P1579 Bourlard Ha, 1994, CONNECTIONIST SPEECH COOKE M, 1997, P ICASSP 97 MUN APR COOKE M, 1999, CS9905 U SHEFF DEP C DUPONT S, 1998, P INT C SPOK LANG PR ELMALIKI M, 1998, P 22 JOURN ET PAR MA, P409 GALES MJF, 1997, P ESCA NATO WORKSH R, P55 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H., 1995, P ICASSP, P153 Korthauer A., 1999, P ROBUST 99 WORKSH T, P123 Leonard R. G., 1984, P ICASSP 84, P111 MARTIN R, 1993, EUROSPEECH 93, P1093 MCKINLEY BL, 1997, P ICASSP 97, P1179 MIRGHAFORI N, 1998, P INT C SPOK LANG PR Price P., 1988, P IEEE INT C AC SPEE, P651 SARIKAYA R, 1998, P INT C SPOK LANG PR SINGH L, 1998, P INT C SPOK LANG PR Tibrewala S., 1997, P ICASSP, P1255 *ULG, 1998, ULG AC LAB MADRAS PR Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2 VARGA AP, 1992, NOISEX 92 STUDY EFFE Vizinho A., 1999, P EUR C SPEECH COMM, P2407 NR 25 TC 34 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 141 EP 158 DI 10.1016/S0167-6393(00)00051-0 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000011 ER PT J AU Surendran, AC Lee, CH AF Surendran, AC Lee, CH TI Transformation-based Bayesian prediction for adaptation of HMMs SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE speech recognition; model adaptation; Bayesian prediction ID HIDDEN MARKOV-MODELS; ROBUST SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD; SPEAKER ADAPTATION; DENSITIES AB Due to inaccuracies in the modeling procedure, estimation errors, and poor data to parameter ratios, adaptation techniques can perform poorly when only a limited amount of data is available. Modeling inflexibility, on the other hand, limits their potential when large amounts of data are present. In this paper, we present a transformation-based Bayesian predictive approach to hidden Markov model (HMM) adaptation that addresses the above problems. The new technique, called Bayesian predictive adaptation (BPA), treats adaptation as model evolution arising from attempted transformation of the model parameters. The transformation is a structural representation of the assumed mismatch between the trained models and the adaptation data. Instead of estimating the transformation parameters directly, and blindly treating the estimates as if they are the true values, BPA averages over the variation of the parameters to generate a new model that can be used in the decoding process. By combining the power of Bayesian prediction to take into consideration the errors in estimation and modeling, with the power of transformation based techniques to use fewer parameters for adaptation, the proposed approach creates a new family of techniques that tend to be robust to estimation and modeling errors when only limited data are available, and to modeling inflexibility when large amounts of data are present. We present adaptation results under channel and speaker mismatches, and compare the performance of BPA to other adaptation techniques to demonstrate its effectiveness. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Bell Labs, Lucent Technol, Multimedia Commun Res Lab, Murray Hill, NJ 07974 USA. RP Surendran, AC (reprint author), Bell Labs, Lucent Technol, Multimedia Commun Res Lab, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM acs@research.bell-labs.com; chl@research.bell-labs.com CR Berger JO, 1980, STAT DECISION THEORY Chesta C., 1999, P EUR C SPEECH COMM, P211 CHIEN JT, 1997, P EUR RHOD GREEC, P2575 Chien JT, 1999, IEEE T SPEECH AUDI P, V7, P656 De Brabandere K, 2007, P IEEE INT C AC SPEE, P1 DeGroot M., 1970, OPTIMAL STAT DECISIO DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 GALES MJF, 1998, P INT C SPOK LANG PR, V5, P1783 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GELFAND AE, 1990, J AM STAT ASSOC, V85, P398, DOI 10.2307/2289776 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 Huo Q, 1997, INT CONF ACOUST SPEE, P1547 Jiang H, 1999, IEEE T SPEECH AUDI P, V7, P426 JOHNSON VE, 1992, J AM STAT ASSOC, V87, P852, DOI 10.2307/2290224 KUHN R, 1998, P ICSLP, V5, P1771 LEE CH, 1993, SPEECH COMMUN, V13, P263, DOI 10.1016/0167-6393(93)90025-G Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Price P., 1988, P IEEE INT C AC SPEE, P651 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Ripley B., 1996, PATTERN RECOGNITION Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SHINODA K, 1997, P IEEE WORKSH AUT SP SIOHAN O, 2000, P IEEE C AC SPEECH S Surendran AC, 1999, IEEE T SPEECH AUDI P, V7, P643, DOI 10.1109/89.799689 SURENDRAN AC, 2000, P IEEE C AC SPEECH S SURENDRAN AC, 1998, P INT C SPOK LANG PR SURENDRAN AC, 1999, P WORKSH ROB METH SP TANNER M, 1990, J AM STAT ASSOC, V82, P528 NR 30 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 159 EP 174 DI 10.1016/S0167-6393(00)00052-2 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000012 ER PT J AU Huo, Q Lee, CH AF Huo, Q Lee, CH TI Robust speech recognition based on adaptive classification and decision strategies SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE decision rules; plug-in MAP decision rule; minimax decision rule; Bayesian predictive classification; adaptive classification; adaptation; compensation; robust automatic speech recognition ID HIDDEN MARKOV-MODELS; BAYESIAN PREDICTIVE CLASSIFICATION; A-POSTERIORI ESTIMATION; SPEAKER ADAPTATION; MAXIMUM-LIKELIHOOD; CHANNEL ADAPTATION; BIAS REMOVAL; TRANSFORMATION; PARAMETERS; ALGORITHM AB We examine key research issues in adaptively modifying the conventional plug-in MAP decision rules in order to improve the robustness of the classification and decision strategies used in automatic speech recognition (ASR) systems. It is well known that the commonly adopted plug-in MAP decoder does not achieve the minimum error rate desired in ASR because the joint probability distribution of speech and language is usually not known exactly. The optimality issue becomes even more serious when there exists acoustic mismatch between training and testing conditions. We review in detail two recently proposed classification rules, namely minimax classification and Bayesian predictive classification. Both of them model classifier parameter uncertainty and modify the classification rules to satisfy some desired robustness properties. We also present an overview on a number of related techniques and discuss how these algorithms can be used to improve the robustness of speech recognizers. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Hong Kong, Dept Comp Sci & Informat Syst, Hong Kong, Hong Kong, Peoples R China. Bell Labs, Lucent Technol, Dialogue Syst Res Dept, Murray Hill, NJ 07974 USA. RP Huo, Q (reprint author), Univ Hong Kong, Dept Comp Sci & Informat Syst, Pokfulam Rd, Hong Kong, Hong Kong, Peoples R China. EM qhuo@csis.hku.hk; chl@research.bell-labs.com CR Acero A, 1993, ACOUSTICAL ENV ROBUS AITCHISON J, 1975, STAT PREDICATION ANA ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 CHESTA C, 1999, P EUR 99 BUD HUNG Chien JT, 1997, SPEECH COMMUN, V22, P369, DOI 10.1016/S0167-6393(97)00033-2 Chien JT, 1999, IEEE T SPEECH AUDI P, V7, P656 Chien JT, 1997, IEEE SIGNAL PROC LET, V4, P167 CHOU W, 1999, P EUR 99 BUD HUNG DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Diakoloukas VD, 1999, IEEE T SPEECH AUDI P, V7, P177, DOI 10.1109/89.748122 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Digalakis VV, 1999, IEEE T SPEECH AUDI P, V7, P253, DOI 10.1109/89.759031 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Geisser S, 1993, PREDICTIVE INFERENCE Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HUNT MJ, 1999, P 1999 IEEE WORKSH A Huo Q, 1998, IEEE T SPEECH AUDI P, V6, P386 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 Huo Q, 2000, IEEE T SPEECH AUDI P, V8, P200 Huo Q, 1997, INT CONF ACOUST SPEE, P1547 HUO Q, 1999, TR9907 U HONG KONG D Huo Q, 1998, INT CONF ACOUST SPEE, P741 HUO Q, 1995, IEEE T SPEECH AUDI P, V3, P334 HUO Q, 1999, P EUR 99 BUD HUNG, P2721 HUO Q, 1997, P EUR 97 RHOD GREEC Jelinek F., 1991, ADV SPEECH SIGNAL PR, P651 Jiang H, 1999, SPEECH COMMUN, V28, P313, DOI 10.1016/S0167-6393(99)00018-7 JIANG H, 1998, P ICSLP 98 SYDN, P389 Jiang H, 1999, IEEE T SPEECH AUDI P, V7, P426 Kharin Y., 1996, ROBUSTNESS STAT PATT LASRY MJ, 1984, IEEE T PATTERN ANAL, V6, P530 Lawrence C, 1999, COMPUT SPEECH LANG, V13, P283, DOI 10.1006/csla.1999.0125 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N Lee C.-H., 1996, AUTOMATIC SPEECH SPE Lee JJ, 1998, SYMBIOSIS, V25, P1 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Leggetter C. J., 1995, P EUR 95 MADR, P1155 Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2157, DOI 10.1109/78.91172 MOKBEL C, 1999, P WORKSH ROB METH SP, P227 Moon S, 1997, IEEE T NEURAL NETWOR, V8, P194 MORGAN N, 1999, P 1999 IEEE WORKSH A NADAS A, 1985, IEEE T ACOUST SPEECH, V33, P326, DOI 10.1109/TASSP.1985.1164513 PAUL DB, 1997, P INT C AC SPEECH SI, P1487 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rahim MG, 1996, IEEE SIGNAL PROC LET, V3, P107, DOI 10.1109/97.489062 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Ripley B., 1996, PATTERN RECOGNITION Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Shahshahani BM, 1997, IEEE T SPEECH AUDI P, V5, P183, DOI 10.1109/89.554780 Shinoda K, 1998, INT CONF ACOUST SPEE, P793, DOI 10.1109/ICASSP.1998.675384 Shinoda K., 1997, P 1997 IEEE WORKSH A, P381 Siohan O., 1999, P WORKSH ROB METH SP, P147 SIOHAN O, 2000, P ICASSP 2000 IST TU STERN RM, 1987, IEEE T ACOUST SPEECH, V35, P751, DOI 10.1109/TASSP.1987.1165203 Surendran AC, 1999, IEEE T SPEECH AUDI P, V7, P643, DOI 10.1109/89.799689 SURENDRAN A, 1998, P ICSLP 98 SYDN SURENDRAN AC, 1999, P WORKSH ROB METH SP, P155 Takahashi J, 1997, COMPUT SPEECH LANG, V11, P127, DOI 10.1006/csla.1996.0025 WANG SJ, 1999, P IEEE AUT SPEECH RE ZAVALIAGKOS G, 1995, P 4 EUR C SPEECH COM, P1131 ZAVALIAGKOS G, 1995, P ICASSP 95, P676 Zhao YX, 1996, SPEECH COMMUN, V18, P65, DOI 10.1016/0167-6393(95)00036-4 Zhao YX, 1994, IEEE T SPEECH AUDI P, V2, P380 NR 68 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 175 EP 194 DI 10.1016/S0167-6393(00)00053-4 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000013 ER PT J AU Rahim, M Riccardi, G Saul, L Wright, J Buntschuh, B Gorin, A AF Rahim, M Riccardi, G Saul, L Wright, J Buntschuh, B Gorin, A TI Robust numeric recognition in spoken language dialogue SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 DE robustness; spoken dialogue system; speech recognition; utterance verification; discriminative training; understanding; language modeling; numeric recognition; digits ID SPEECH RECOGNITION AB This paper addresses the problem of automatic numeric recognition and understanding in spoken language dialogue. We show that accurate numeric understanding in fluent unconstrained speech demands maintaining robustness at several different levels of system design, including acoustic, language, understanding and dialogue. We describe a robust system for numeric recognition and present algorithms for feature extraction, acoustic and language modeling, discriminative training, utterance verification and numeric understanding and validation. Experimental results from a field-trial of a spoken dialogue system are presented that include customers' responses to credit card and telephone number requests. (C) 2001 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. RP Rahim, M (reprint author), AT&T Labs Res, Room E105,180 Pk Ave, Florham Pk, NJ 07932 USA. EM mazin@research.att.com RI riccardi, gabriele/A-9269-2012 CR ABELLA A, 1997, P 5 EUR C SPEECH COM, P1879 Bickel P. J., 1977, MATH STAT BASIC IDEA BOYCE S, 1996, P INT S SPOK DIAL IS, P65 BUHRKE E, 1994, P INT C AC SPEECH SI CARDIN R, 1993, P INT C AC SPEECH SI, P243 CHOU W, 1995, P EUR C SPEECH COMM, P495 CHUCARROLL J, 1999, P EUR C SPEECH COMM, P1519 GORIN A, 1995, J ACOUST SOC AM, V97, P3441, DOI 10.1121/1.412431 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X JORDAN MI, 1994, NEURAL COMPUT, V6, P181, DOI 10.1162/neco.1994.6.2.181 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Lamel Lori, 1999, P INT C AC SPEECH SI Lee H, 1992, BIOTECHNOL TECH, V6, P127, DOI 10.1007/BF02438817 LLEIDA E, 1995, P IEEE ASR WORKSH MITCHELL C, 1999, P INT C AC SPEECH SI OS E, 1999, P EUR C SPEECH COMM, P1527 PIERACCINI R, 1990, P AC SOC AM, P106 Rabiner L, 1993, FUNDAMENTALS SPEECH RAHIM M, 1999, P INT C AC SPEECH SI RAHIM M, 1999, P IEEE ASRU WORKSH RAHIM M, 1999, P EUR C SPEECH COMM, P495 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 Rahim MG, 1996, IEEE SIGNAL PROC LET, V3, P107, DOI 10.1109/97.489062 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 RAMASWAMY G, 1999, P EUR C SPEECH COMM, P2662 RICCARDI G, 1998, ACL WORKSH VER LARG, P188 RICCARDI G, 1999, IEEE T SPEECH AUDIO, V8, P3 Riccardi G, 1996, COMPUT SPEECH LANG, V10, P265, DOI 10.1006/csla.1996.0014 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SHARP RD, 1997, P INT C AC SPEECH SI, P4065 Shoup J. E., 1980, TRENDS SPEECH RECOGN WENDEMUTH A, 1999, P INT C AC SPEECH SI Wright J., 1998, P ICSLP WRIGHT JH, 1997, P 5 EUR C SPEECH COM, P1419 NR 34 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 195 EP 212 DI 10.1016/S0167-6393(00)00054-6 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000014 ER PT J AU Huerta, JM Stern, RM AF Huerta, JM Stern, RM TI Distortion-class modeling for robust speech recognition under GSM RPE-LTP coding SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Methods for Speech Recognition in Adverse Conditions CY MAY, 1999 CL TAMPERE, FINLAND SP Nokia, COST249 AB We present a method to reduce the degradation in recognition accuracy introduced by full-rate GSM RPE-LTP coding by combining sets of acoustic models trained under different distortion conditions. During recognition, the a posteriori probabilities of an utterance are calculated as a weighted sum of the posteriors corresponding to the individual models. The phonemes used by the system's word pronunciations are grouped into classes according to amount of distortion they undergo in coding. The acoustic model used in the decoding process is a weighted combination of models derived from clean speech and models derived from speech that had been degraded by GSM coding (the source models), with the relative combination of the two sources depending on the extent to which each class of phonemes is degraded by the coding process. To determine the distortion class membership, and hence the weights, we measure the spectral distortion introduced to the quantized long-term residual by the RPE-LTP codec. We discuss how this distortion varies according to phonetic class. The method described reduces the degradation in recognition accuracy introduced by GSM coding of sentences in the TIMIT database by more than 70% relative to the baseline accuracy obtained in matched training and testing conditions with respect to a system using the source acoustic models, and up to 60% relative to the best baseline systems regardless of the number of Gaussians. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Carnegie Mellon Univ, Dept Elect Comp Engn, Pittsburgh, PA 15213 USA. Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA. RP Huerta, JM (reprint author), Carnegie Mellon Univ, Dept Elect Comp Engn, Pittsburgh, PA 15213 USA. EM juan@speech.cs.cmu.edu RI yu, yan/C-2322-2012 CR BEYERLEIN P, 1998, P IEEE INT C AC SPEE, V1, P481, DOI 10.1109/ICASSP.1998.674472 DAS S, 1999, P EUROSPEECH 1999 DEGENER J, 1992, GSM SPEECH COMPRESSI DELPHINPOULAT L, 1997, IEEE P ASRU 1997 DIGALAKIS V, 1998, P ICASSP 1998 DUFOUR S, 1996, P ICASSP 1996 ELVIRA JM, 1998, P ICASSP 1998 EULER S, 1994, IEEE T ACOUSTICS SPE *EUR TEL STAND I, 1994, EUR DIG TEL SYST PHA FISSORE L, 1999, P WORKSH ROB METH SP GALLARDOANTOLIN A, 1999, P ICASSP 1999 GALLARDOANTOLIN A, 1998, P ICSLP 1998 GILLICK L, 1989, P ICASSP 1989 GUPTA SK, 1996, P ICASSP 1996 HAAVISTO P, 1999, P WORKSH ROB METH SP HAEBUMBACH R, 1997, P EUROSPEECH 97 HUANG XD, 1996, P 1996 IEEE INT C AC HUERTA JM, 1999, P ROB METH SPEECH RE HUERTA JM, 1998, P ICSLP 98 KARRAY L, 1998, P ICASSP 1998 Kleijn W. B., 1995, SPEECH CODING SYNTHE KROON P, 1986, IEEE T ACOUST SPEECH, V34, P1054, DOI 10.1109/TASSP.1986.1164946 KROON P, 1995, SPEECH CODING SYNTHE *LDC, 1993, TIMIT AC PHON CONT S Lilly B. T., 1996, P ICSLP 96 MING J, 1999, P ICASSP 1999 MOKBEL C, 1996, P 2 IEEE WORKSH INT PAPING M, 1997, P EUR 1997 PUEL JB, 1997, P EUR 1997 SALONIDIS T, 1998, P ICASSP 1998 SOULAS T, 1997, P ICASSP 1997 VARY P, 1988, SPEECH COMMUN, V7, P209, DOI 10.1016/0167-6393(88)90040-4 NR 32 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2001 VL 34 IS 1-2 BP 213 EP 225 DI 10.1016/S0167-6393(00)00055-8 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 408YP UT WOS:000167355000015 ER PT J AU Botinis, A AF Botinis, A TI Intonation SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Skovde, Dept Languages, S-54128 Skovde, Sweden. RP Botinis, A (reprint author), Univ Skovde, Dept Languages, Box 408, S-54128 Skovde, Sweden. EM antonis.botinis@isp.his.se NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 261 EP 262 DI 10.1016/S0167-6393(00)00059-5 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400001 ER PT J AU Botinis, A Granstrom, B Mobius, B AF Botinis, A Granstrom, B Mobius, B TI Developments and paradigms in intonation research SO SPEECH COMMUNICATION LA English DT Review DE pitch; tone; stress; accent; intonation models; intonation applications ID ENGLISH INTONATION; SPEECH SYNTHESIS; PITCH CONTOURS; MODEL; STRESS; VOWELS; PERCEPTION; MANDARIN; PROSODY; GERMAN AB The present tutorial paper is addressed to a wide audience with different discipline backgrounds as well as variable expertise on intonation. The paper is structured into five sections. In Section 1, "Introduction", basic concepts of intonation and prosody are summarised and cornerstones of intonation research are highlighted. In Section 2, "Functions and forms of intonation", a wide range of functions from morpholexical and phrase levels to discourse and dialogue levels are discussed and forms of intonation with examples from different languages are presented. In Section 3, "Modelling and labelling of intonation", established models of intonation as well as labelling systems are presented. In Section 4, "Applications of intonation" the most widespread applications of intonation and especially technological ones are presented and methodological issues are discussed. In Section 5, "Research perspective" research avenues and ultimate goals as well as the significance and benefits of intonation research in the upcoming years are outlined. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Skovde, Dept Languages, S-54128 Skovde, Sweden. Royal Inst Technol, Dept Speech Mus & Hearing, Stockholm, Sweden. Univ Stuttgart, Inst Nat Language Proc, D-7000 Stuttgart, Germany. RP Botinis, A (reprint author), Univ Skovde, Dept Languages, POB 408, S-54128 Skovde, Sweden. EM antonis.botinis@.isp.his.se CR Adriaens L.M.H, 1991, THESIS TU EINDHOVEN 't Hart J., 1975, J PHONETICS, V3, P235 ATAL BS, 1972, J ACOUST SOC AM, V52, P1687, DOI 10.1121/1.1913303 ATKINSON JE, 1978, J ACOUST SOC AM, V63, P211, DOI 10.1121/1.381716 BANNERT R, 1990, VAG MOT SVENSKT UTTA BANNERT R, 1988, KOPENHAGENER BEITRAG, V24, P26 BANNERT R, 1985, FOLIA LINGUIST, V19, P321, DOI 10.1515/flin.1985.19.3-4.321 Beckman M, 1997, GUIDELINES TOBI LABE Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X BELL AG, 1879, AM J OTOL, V1, P163 BERINSTEIN AE, 1979, UCLA WORKING PAPERS, P1 BERTENSTAM J, 1997, PHONUM, V4, P57 Bloomfield Leonard, 1933, LANGUAGE Bolinger D., 1958, WORD, V14, P109 BOTINIS A, 1997, P ESCA WORKSH INT AT, P47 BOTINIS A, 1992, TRAVAUX I PHONETIQUE, V14, P13 BOTINIS A, 1999, P SWED PHON C FON 99, P41 Botinis A., 1998, INTONATION SYSTEMS S, P288 Botinis Antonis, 1989, STRESS PROSODIC STRU Brazil D., 1997, COMMUNICATIVE VALUE Brazil D., 1980, DISCOURSE INTONATION BRESNAN JW, 1971, LANGUAGE, V47, P257, DOI 10.2307/412081 Brown G., 1983, DISCOURSE ANAL Brown Gillian, 1980, QUESTIONS INTONATION BRUCE G, 1993, P ESCA WORKSH PROS L, P180 BRUCE G, 1998, PRACTICAL LINGUISTIC, V16 Bruce Gosta, 1978, NORDIC PROSODY, P219 Bruce Gosta, 1977, SWEDISH WORD ACCENTS CARLSON R, 1990, ADV SPEECH HEARING L, P269 Chomsky N., 1972, STUDIES SEMANTICS GE, P62 Collier R., 1990, PERCEPTUAL STUDY INT COLLIER R, 1975, J ACOUST SOC AM, V58, P249, DOI 10.1121/1.380654 Cooper W. E., 1981, FUNDAMENTAL FREQUENC Cruttenden Alan, 1997, INTONATION, V2nd Cutler Anne, 1983, PROSODY MODELS MEASU DALESSANDRO C, 1995, COMPUT SPEECH LANG, V9, P257, DOI 10.1006/csla.1995.0013 DANES F, 1960, WORD, V16, P34 de Pijper Jan Roelof, 1983, MODELLING BRIT ENGLI Di Cristo A., 1998, INTONATION SYSTEMS S, P195 DICRISTO A, 1986, PHONETICA, V43, P11 DICRISTO A, 2000, IN PRESS INTONATION FANT G, 2000, IN PRESS INTONATION FISCHERJORGENSEN E, 1990, PHONETICA, V47, P99 Fourakis M, 1999, PHONETICA, V56, P28, DOI 10.1159/000028439 FRY DB, 1958, LANG SPEECH, V1, P126 Fujisaki H., 1983, PRODUCTION SPEECH, P39 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 Garde P., 1968, ACCENT GARDING E, 1977, SO CALIFORNIA OCCASI, V4, P27 Garding E, 1983, PROSODY MODELS MEASU, P11 GARDING E, 1982, WORKING PAPERS, V22, P137 Garding Eva, 1977, SCANDINAVIAN WORD AC Gimson A. C., 1962, INTRO PRONUNCIATION Goldsmith J., 1990, AUTOSEGMENTAL METRIC GOLDSMITH J, 1976, LINGUIST ANAL, V2, P23 Goldsmith John A, 1976, THESIS MIT GRONNUM NT, 1982, J PHONETICS, V39, P302 GRONNUM NT, 1978, J PHONETICS, V6, P151 Gussenhoven C., 1984, GRAMMAR SEMANTICS SE HADDINGKOCH K, 1961, ACOUSTICOPHONETIC ST HADDINGKOCH K, 1964, PHONETICA, V11, P175 Halliday M. A. K., 1967, J LINGUIST, V3, P199, DOI DOI 10.1017/S0022226700016613 Hamon C., 1989, P INT C AC SPEECH SI, P238 Helmholtz H, 1877, SENSATIONS TONE Hirschberg J., 1986, P 24 ANN M ASS COMP, P136, DOI 10.3115/981131.981152 HIRST D, 1991, P 12 INT C PHON SCI, P234 Hirst D., 1998, INTONATION SYSTEMS S HIRST D, 1993, LINGUIST INQ, V24, P781 Hirst D. J., 1998, INTONATION SYSTEMS S, P1 Hirst Daniel, 1998, INTONATION SYSTEMS S, P96 HIRST DJ, 1994, P 2 ESCA IEEE WORKSH, P77 Hockett C. F., 1955, MANUAL PHONOLOGY HOMBERT Jean-Marie, 1978, TONE LINGUISTIC SURV, P77 House David, 1990, TONAL PERCEPTION SPE Hyman Larry M., 1977, SO CALIFORNIA OCCASI, V4, P37 Jackendoff Ray S., 1972, SEMANTIC INTERPRETAT Jones Daniel, 1956, OUTLINE ENGLISH PHON JUN SA, 2000, IN PRESS INTONATION KOENIG W, 1946, J ACOUST SOC AM, V18, P19, DOI 10.1121/1.1916342 KOHLER KJ, 1991, J PHONETICS, V19, P121 KOHLER KJ, 1987, P 11 INT C PHON SC T, P149 Kohler K.J., 1990, PAPERS LAB PHONOLOGY, P115 LADD DR, 1988, J ACOUST SOC AM, V84, P530, DOI 10.1121/1.396830 Ladd D. R., 1996, INTONATIONAL PHONOLO LADD DR, 1983, LANGUAGE, V59, P721, DOI 10.2307/413371 Ladefoged P., 1967, 3 AREAS EXPT PHONETI LADEFOGED P, 1963, J ACOUST SOC AM, V35, P454, DOI 10.1121/1.1918503 LAMBRECHT K, 1996, INFORMATION STRUCTUR Lea W., 1980, TRENDS SPEECH RECOGN, P166 LEBEN WR, 1976, LINGUIST ANAL, V2, P69 Lehiste I., 1970, SUPRASEGMENTALS LIBERMAN M, 1977, LINGUIST INQ, V8, P249 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 Lieberman Philip, 1967, INTONATION PERCEPTIO LIESKE C, 1997, P EUR C SPEECH COMM, P1431 MALFRERE F, 1998, P 3 ESCA WORKSH SPEE, P323 MALMBERG B, 1967, STRUCTURAL LINGUISTI Martinet Andre, 1954, MISC PHONET, V2, P13 Matsui T, 1990, P INT C SPOK LANG PR, P137 MERON Y, 1996, P INT C SPOK LANG PR, P1449 MERTENS P, 1989, P EUR C SPEECH COMM, P46 Mertens Piet, 1987, THESIS KATHOLIEKE U MEYER EA, 1937, STUDIES SCAND PHILOL, V10 MEYER EA, 1954, STUDIES SCAND PHILOL, V11 MOBIUS B, 1995, P 13 INT C PHON SC S, P108 Mobius B., 1993, QUANTITATIVES MODELL MOHLER G, 1998, IMS FESTIVAL MOHLER G, 1998, THEORIEBASIERTE MODE MONAGHAN AIC, 1993, J PRAGMATICS, V19, P559, DOI 10.1016/0378-2166(93)90112-3 MONAGHAN AIC, 1990, SPEECH COMMUN, V9, P305, DOI 10.1016/0167-6393(90)90006-U MONAGHAN AIC, 1998, STATE ART SUMMARY EU Morlec Y, 2001, SPEECH COMMUN, V33, P357, DOI 10.1016/S0167-6393(00)00065-0 MORTON K, 1995, P EUR C SPEECH COMM, P1819 Nespor M., 1986, PROSODIC PHONOLOGY NICKERSON RS, 1976, J SPEECH HEAR DISORD, V41, P120 NIEMANN H, 1997, P IEEE INT C AC SPEE, P75 NILSONNE A, 1987, THESIS KAROLINSKA I NORD N, 1995, SCAND J LOGOPEDICS P, V20, P107 Ode C., 1989, RUSSIAN INTONATION P OHALA Johni, 1978, TONE LINGUISTIC SURV, P5 OHMAN SEG, 1967, 23 STLQPSR, P20 OHMAN SEG, 1966, 4 STLQPSR, P1 OSTER AM, 1997, PHONUM, V4, P145 PARRIS ES, 1996, P IEEE ICASSP, P685 Pierrehumbert J., 1988, JAPANESE TONE STRUCT Pierrehumbert J, 1980, THESIS MIT Pike K. L., 1945, INTONATION AM ENGLIS Potter R. K., 1947, VISIBLE SPEECH POTTER RK, 1945, SCIENCE, V102, P463, DOI 10.1126/science.102.2654.463 RISBERG A, 1976, AM ANN DEAF, P178 ROONEY E, 1992, P INT C SPOK LANG PR, P413 ROSSI M, 1978, LANG SPEECH, V21, P284 ROSSI M, 2000, IN PRESS INTONATION Rossi Mario, 1999, INTONATION SYSTEME F ROSSLER W, 1985, NEUROPSYCHIATRIE, V1, P8 SEARLE JR, 1976, LANG SOC, V5, P1 Searle John R., 1969, SPEECH ACTS Searle J.R., 1979, EXPRESSION MEANING SILKERK EO, 1984, PHONOLOGY SYNTAX REL Silverman K., 1992, P INT C SPOK LANG PR, P867 Silverman K. E. A., 1987, THESIS U CAMBRIDGE C Stockwell R. P., 1972, INTONATION, P87 SUNDSTROM A, 1997, THESIS KTH STOCKHOLM TAMS A, 1995, P EUR C SPEECH COMM, P2081 TATHAM M, 2000, P SWED PHON C FON 20, P133 TATHAM M, 1995, SPOKEN DIALOGUE SYST, P221 Taylor P, 2000, J ACOUST SOC AM, V107, P1697, DOI 10.1121/1.428453 TAYLOR PA, 1994, PHONETIC MODEL INTON Terken J., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1003 THORSEN NG, 1998, INTONATION SYSTEMS S, P131 THORSEN NG, 1995, P 13 INT C PHON SC S, P124 THORSEN NG, 1992, GROUNDWORKS DANISH I Thyme-Gobbel A. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607971 Trager George L., 1951, OUTLINE ENGLISH STRU TRUBETZKOY, 1939, GRUNDZUGE PHONOLOGIE VAISSIERE J, 1995, PHONETICA, V52, P123 van Santen J. P. H., 1998, P 3 ESCA WORKSH SPEE, P293 Van Santen J.P.H., 1994, P INT C SPOK LANG PR, P719 VANGEEL R, 1983, THESIS U UTRECHT VANHEMERT JP, 1987, ANAL SYNTHESE GESPRO, P34 VANHEUVEN VJ, 2000, IN PRESS INTONATION VANHEUVEN VJ, 1993, ANAL SYNTHESIS SPEEC VANSANTEN J, 1997, P ESCA WORKSH INT TH, P321 VANSANTEN JPH, 2000, IN PRESS INTONATION van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 Venditti J.J., 1998, P 3 ESCA WORKSH SPEE, P317 VERONIS J, 1998, P INT C SPOK LANG PR, P2899 Veronis J, 1998, SPEECH COMMUN, V26, P233, DOI 10.1016/S0167-6393(98)00063-6 WANG HD, 1993, P EUR C SPEECH COMM, P991 WELTENS B, 1984, SPEECH COMMUN, V3, P157, DOI 10.1016/0167-6393(84)90037-2 WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0 WHEATSTONE C, 1837, WESTMINSTER REV, V27, P30 WILLEMS N, 1988, J ACOUST SOC AM, V84, P1250, DOI 10.1121/1.396625 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 Xu Y, 2001, SPEECH COMMUN, V33, P319, DOI 10.1016/S0167-6393(00)00063-7 NR 176 TC 33 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 263 EP 296 DI 10.1016/S0167-6393(00)00060-1 PG 34 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400002 ER PT J AU Swerts, M Veldhuis, R AF Swerts, M Veldhuis, R TI The effect of speech melody on voice quality SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop of the European-Speech-Communication-Assocation on Intonation (ESCA) CY SEP 18-20, 1997 CL ATHENS, GREECE SP European Speech Commun Assoc, Athens Univ, Hellen Minist Educ, Hellen Minist Dev, Gen Secretariat Res & Technol DE voice quality; intonation; H1-H2; glottal pulse ID FEMALE SPEAKERS; PERCEPTION AB This paper explores whether a speaker's voice quality, defined as the perceived timbre of someone's speech, changes as a function of variation in speech melody. Analyses are based on several productions of the vowel 'a', provided with different intonation patterns. It appears that in general fundamental frequency covaries with the 'strength relationship between the first two harmonics (H1-H2), That relationship determines the voice quality to some extent, and is often claimed to reflect open quotient. However, correlating the H1-H2 measure to parameters of the LF-model reveals that both the open quotient and the skewness of the glottal pulse have an impact on the lower part of the harmonic spectrum. (C) 2001 Elsevier Science B.V, All rights reserved. C1 IPO, Ctr User Syst Interact, NL-5600 MB Eindhoven, Netherlands. Univ Instelling Antwerp, CNTS, B-2610 Antwerp, Belgium. RP Swerts, M (reprint author), IPO, Ctr User Syst Interact, POB 513, NL-5600 MB Eindhoven, Netherlands. EM m.g.j.s@tue.nl RI Swerts, Marc/C-8855-2013 CR Baken R. J., 1987, CLIN MEASUREMENTS SP CHASAIDE AN, 1997, HDB PHONETIC SCI CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 CLEVELAND T, 1983, 4 STLQPSR, P24 DOVAL B, 1997, P ICASSP 97, P1295 Fant G, 1997, SPEECH COMMUN, V22, P125, DOI 10.1016/S0167-6393(97)00017-4 Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1995, TIME FREQUENCY DOMAI Fant G., 1985, 4 PARAMETER MODEL GL, P1 GOBL C, 1988, STL QPSR, V1, P123 Hanson HM, 1997, J ACOUST SOC AM, V101, P466, DOI 10.1121/1.417991 HOLMBERG E, 1995, P ICPHS 95 STOCKH, P178 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 KARLSSON I, 1996, 21996 TMHQPSR, P143 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KOREMAN J, 1995, PHONUS, V1, P105 KULLBACK S, 1951, ANN MATH STAT, V22, P79, DOI 10.1214/aoms/1177729694 OLIVEIRA LC, 1996, PROGR SPEECH SYNTHES, P27 PIERREHUMBERT J, 1989, 4 STLQPSR, P23 Sluijter A., 1995, PHONETIC CORRELATES STEVENS K, 1994, INT S PROS YOK JAP, P53 Stevens KN, 1994, VOCAL FOLD PHYSL, P147 NR 22 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 297 EP 303 DI 10.1016/S0167-6393(00)00061-3 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400003 ER PT J AU Beaugendre, F House, D Hermes, DJ AF Beaugendre, F House, D Hermes, DJ TI Accentuation boundaries in Dutch, French and Swedish SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop of the European-Speech-Communication-Assocation on Intonation (ESCA) CY SEP 18-20, 1997 CL ATHENS, GREECE SP European Speech Commun Assoc, Athens Univ, Hellen Minist Educ, Hellen Minist Dev, Gen Secretariat Res & Technol DE intonation; accentuation; temporal alignment; language differences AB This paper presents a comparative study investigating the relation between the timing of a rising or falling pitch movement and the temporal structure of the syllable it accentuates for three languages: Dutch, French and Swedish. In a perception experiment, the five-syllable utterances /mamamamama/ and /?a?a?a?a?a/ were provided with a relatively fast rising or falling pitch movement. The timing of the movement was systematically varied so that it accented the third or the fourth syllable, subjects were asked to indicate which syllable they perceived as accented. The accentuation boundary (AB) between the third and the fourth syllable was then defined as the moment before which more than half of the subjects indicated the third syllable as accented and after which more than half of the subjects indicated the fourth syllable. The results show that there are significant differences between the three languages as to the location of the AB. In general, for the rises, well-defined ABs were found. They were located in the middle of the vowel of the third syllable for French subjects, and later in that vowel for Dutch and swedish subjects. For the falls, a clear AB was obtained only for the Dutch and the Swedish listeners. This was located at the end of the third syllable. For the French listeners, the fall did not yield a clear AB, This corroborates the absence of accentuation by means of falls in French. By varying the duration of the pitch movement it could be shown that, in all cases in which a clear AB was found. the cue for accentuation was located at the beginning of the pitch movement. (C) 2001 Elsevier Science B.V. All rights reserved. C1 IPO, Ctr Res User Syst Interact, NL-5600 MB Eindhoven, Netherlands. KTH, Dept Speech Mus & Hearing, Stockholm, Sweden. Lund Univ, S-22100 Lund, Sweden. Univ Skovde, Skovde, Sweden. RP Beaugendre, F (reprint author), Lernout & Hauspie Speech Prod, Koning Albert I Laan 64, B-1780 Wemmel, Belgium. CR 't Hart J., 1991, PERCEPTUAL STUDY INT BEAUGENDRE F, 1994, THESIS U PARIS 11 OR Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X BRUCE G, 1992, TALKING MACHINES THE, P113 BRUCE G, 1993, P EUR C SPEECH COMM, V2, P1205 Bruce Gosta, 1977, SWEDISH WORD ACCENTS COLLIER R, 1970, 5 IPO, P82 Hamon C., 1989, P INT C AC SPEECH SI, P238 Hermes DJ, 1997, J ACOUST SOC AM, V102, P2390, DOI 10.1121/1.419623 HERMES DJ, 1997, 32 IPO, P131 HERMES DJ, 1997, P ESCA TUT RES WORKS, P879 HERMES DJ, 1991, J ACOUST SOC AM, V90, P97, DOI 10.1121/1.402397 HILL DR, 1977, INT J MAN MACH STUD, V9, P337, DOI 10.1016/S0020-7373(77)80030-8 HOUSE D, 1998, P INT C SPOK LANG PR, P2799 HOUSE D, 1997, P EUR C SPEECH COMM, V2, P879 House David, 1990, TONAL PERCEPTION SPE Ladd D. R., 1996, INTONATIONAL PHONOLO RIETVELD T, 1995, J PHONETICS, V23, P375, DOI 10.1006/jpho.1995.0029 SAGEY E, 1988, LINGUIST INQ, V19, P108 TERKEN J, IN PRESS PERCEPTION Touati P., 1987, STRUCTURES PROSODIQU Vaissiere Jacqueline, 1980, ANN SCUOLA NORMALE S, P530 vanHeuven VJ, 1996, J ACOUST SOC AM, V100, P2439, DOI 10.1121/1.417952 VANKATWIJK A, 1967, 2 IPO, P115 NR 24 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 305 EP 318 DI 10.1016/S0167-6393(00)00062-5 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400004 ER PT J AU Xu, Y Wang, QE AF Xu, Y Wang, QE TI Pitch targets and their realization: Evidence from Mandarin Chinese SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop of the European-Speech-Communication-Assocation on Intonation (ESCA) CY SEP 18-20, 1997 CL ATHENS, GREECE SP European Speech Commun Assoc, Athens Univ, Hellen Minist Educ, Hellen Minist Dev, Gen Secretariat Res & Technol DE pitch targets; tone; pitch accents; F-0 contour; peak delay; articulatory constraint; downstep; declination; topic initiation ID PHONOLOGICAL FEATURES; TONAL COARTICULATION; INTONATION; ALIGNMENT; FOCUS; DECLINATION; ENGLISH; PROSODY; SPANISH; CONTOUR AB In this paper, we propose a preliminary framework for accounting for certain surface F-o variations in speech. The framework consists of definitions for pitch targets and rules of their implementation. Pitch targets are defined as the smallest operable units associated with linguistically functional pitch units, and they are comparable to segmental phones. The implementation rules are based on possible articulatory constraints on the production of surface F-o contours. Due to these constraints, the implementation of a simple pitch target may result in surface F-o forms that only partially reflect the underlying pitch targets. We will also discuss possible implications of this framework on our understanding of various observed F-o patterns, including carryover and anticipatory variations, downstep, declination, and F-o peak alignment. Finally, we will consider possible interactions between local and non-local pitch targets. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Northwestern Univ, Dept Commun Sci & Disorders Speech & Language Pat, Evanston, IL 60208 USA. Rush Univ, Dept Commun Sci & Disorders, Chicago, IL USA. Rush Presbyterian St Lukes Med Ctr, Chicago, IL 60612 USA. RP Xu, Y (reprint author), Northwestern Univ, Dept Commun Sci & Disorders Speech & Language Pat, 2299 N Campus Dr, Evanston, IL 60208 USA. EM xuqi@northwestern.edu RI Xu, Yi/C-4013-2008 OI Xu, Yi/0000-0002-8541-2658 CR Abramson A. S., 1979, STUDIES TAI MON KHME, P1 ABRAMSON AS, 1978, P 12 INT C LING VIEN Anderson S. R., 1978, TONE LINGUISTIC SURV, P133 Arnold G. F., 1961, INTONATION COLLOQUIA ARVANITI A, 1998, J PHONETICS, V36, P3 Bolinger D., 1989, INTONATION ITS USES Bruce Gosta, 1977, SWEDISH WORD ACCENTS CASPERS J, 1993, PHONETICA, V50, P161 Chao Yuan-Ren, 1930, MAITRE PHONETIQUE, V45, P24 Chao Yuen Ren, 1968, GRAMMAR SPOKEN CHINE CLARK M, 1978, THESIS U MASSACHUSET CLEMENTS GN, 1979, LINGUIST INQ, V10, P179 COHEN A, 1982, PHONETICA, V39, P254 Collier R., 1987, LARYNGEAL FUNCTION P, P403 COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 Crystal D., 1969, PROSODIC SYSTEMS INT De Jong K., 1994, OSU WORKING PAPERS L, V43, P1 DUANMU S, 1994, LINGUIST INQ, V25, P555 EADY SJ, 1986, LANG SPEECH, V29, P233 EADY SJ, 1986, J ACOUST SOC AM, V80, P402, DOI 10.1121/1.394091 Elenius Kjell, 1995, P 13 INT C PHON SCI, P220 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 GANDOUR J, 1974, UCLA WORKING PAPERS, V27, P118 GANDOUR J, 1994, J PHONETICS, V22, P477 GARDING E, 1987, PHONETICA, V44, P13 GARDING E, 1979, PHONETICA, V36, P207 GELFER CE, 1985, VOCAL FOLD PHYSL BIO, P113 Goldsmith J., 1990, AUTOSEGMENTAL METRIC Goldsmith J., 1979, AUTOSEGMENTAL PHONOL GRIMM C, 1997, 1997 ANN M LING SOC Gronnum N., 1995, P 13 INT C PHON SCI, V2, P124 HIRSCHBERG J, 1992, J PHONETICS, V20, P241 HOMBERT Jean-Marie, 1978, TONE LINGUISTIC SURV, P77 HOMBERT JM, 1974, STUDIES AFRICAN LI S, V5, P169 HOWIE JM, 1974, PHONETICA, V30, P129 Hyman Larry M., 1974, LINGUIST INQ, V5, P81 HYMAN LM, 1993, PHONOLOGY TONE, P75 Jin S., 1996, THESIS OHIO STATE U Kelso J.A.S., 1984, AM J PHYSIOL-REG I, V246, pR1000 Ladd D. R., 1996, INTONATIONAL PHONOLO Ladd D. R., 1984, PHONOLOGY YB, V1, P53, DOI DOI 10.1017/S0952675700000294 Ladd D. R., 1995, P 13 INT C PHON SCI, V2, P116 LADD DR, 1983, LANGUAGE, V59, P721, DOI 10.2307/413371 LADD D R, 1984, Phonetica, V41, P31 LANIRAN Y, 1997, 71 ANN M LING SOC AM Laniran Yetunde O., 1992, THESIS CORNELL U Leben William Ronald, 1973, THESIS MIT Lehiste I., 1975, STRUCTURE PROCESS SP, P195 LEHISTE I, 1961, J ACOUST SOC AM, V33, P419, DOI 10.1121/1.1908681 Liberman Mark, 1984, LANGUAGE SOUND STRUC, P157 Lieberman Philip, 1967, INTONATION PERCEPTIO LIN M, 1991, P 12 INT C PHON SCI, P242 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MAEDA S, 1976, CHARACTERIZATION AM MANFREDI V, 1993, PHONOLOGY TONE, P133 MATTINGL.IG, 1966, LANG SPEECH, V9, P1 NAKAJIMA S, 1993, PHONETICA, V50, P197 OHALA JJ, 1973, J ACOUST SOC AM, V53, P345, DOI 10.1121/1.1982441 OHALA JJ, 1990, NATO ADV SCI I D-BEH, V55, P23 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J, 1980, THESIS MIT PIERREHUMBERT JB, 1989, PHONETICA, V46, P181 Pike K. L., 1948, TONE LANGUAGES Poser W. J., 1984, THESIS MIT Prieto P, 1996, J PHONETICS, V24, P445, DOI 10.1006/jpho.1996.0024 PRIETO P, 1995, J PHONETICS, V23, P429, DOI 10.1006/jpho.1995.0032 Rose P.J., 1988, PROSODIC ANAL ASIAN, P55 SCHMIDT RC, 1990, J EXP PSYCHOL HUMAN, V16, P227, DOI 10.1037//0096-1523.16.2.227 SCHUH RG, 1978, TONE LINGUISTIC SURV, P221 SHEN XS, 1990, J PHONETICS, V18, P281 SHI B, 1987, P 11 INT C PHON SCI, P142 Shih C., 1988, WORKING PAPERS CORNE, V3, P83 SHIH CL, 1997, INTONATION THEORY MO, P293 Silverman Kim E. A., 1990, PAPERS LABORATORY PH, P72 STEELE SA, 1986, J ACOUST SOC AM, V80, pS51, DOI 10.1121/1.2023842 STEELE SA, 1986, PHONETICA, V43, P92 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 STEWART JM, 1983, J AFRICAN LANGUAGES, V3, P113 SUNDBERG J, 1979, J PHONETICS, V7, P71 TITZE IR, 1987, LARYNGEAL FUNCTION P, P304 UMEDA N, 1982, J PHONETICS, V10, P279 Van Santen J.P.H., 1994, P INT C SPOK LANG PR, P719 WANG WSY, 1967, INT J AM LINGUIST, V33, P93, DOI 10.1086/464946 WHALEN DH, 1995, J PHONETICS, V23, P349, DOI 10.1016/S0095-4470(95)80165-0 Woo Nancy, 1969, THESIS MIT XU CX, 1999, P 14 INT C PHON SCI, P2359 Xu Y, 1999, J PHONETICS, V27, P55, DOI 10.1006/jpho.1999.0086 Xu Y, 1997, J PHONETICS, V25, P61, DOI 10.1006/jpho.1996.0034 XU Y, 1994, J ACOUST SOC AM, V95, P2240, DOI 10.1121/1.408684 XU Y, 1997, P ESCA WORKSH INT TH, P337 Xu Y, 1993, THESIS U CONNECTICUT Xu Y., 1999, P 14 INT C PHON SCI, P1881 Xu Y, 1998, PHONETICA, V55, P179, DOI 10.1159/000028432 YIP M, 1991, TONAL PHONOLOGY CHIN NR 95 TC 90 Z9 98 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 319 EP 337 DI 10.1016/S0167-6393(00)00063-7 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400005 ER PT J AU D'Imperio, M AF D'Imperio, M TI Focus and tonal structure in Neapolitan Italian SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop of the European-Speech-Communication-Assocation on Intonation (ESCA) CY SEP 18-20, 1997 CL ATHENS, GREECE SP European Speech Commun Assoc, Athens Univ, Hellen Minist Educ, Hellen Minist Dev, Gen Secretariat Res & Technol DE intonation; focus; nuclear accent; tonal repulsion; tonal undershoot; broad focus; narrow focus; tonal compression; questions; early focus; postnuclear accent; alignment; perception; Italian; Neapolitan; truncation; prominence AB This work investigates some acoustic and perceptual characteristics of focal and postfocal accents in questions of Neapolitan Italian. In this variety, yes/no question pitch accents are characterized by a rise-fall configuration, with a very conspicuous peak (L* + H). When intended focus is early, a postfocal accent is produced, which aligns with the last stressed syllable of the intonation phrase (!H*). Results from a perception study suggest that the postfocal !H* is not the nuclear accent of the intonation phrase. despite being final. The phonetic and phonological nature of the focal L* + H and the postfocal !H* are also investigated in production through a set of yes/no questions varying in intended focus scope and focus placement. The results of this study support the hypothesis that focal and postfocal accents are structurally different, in that postfocal accents are acoustically much reduced. Finally, we explore the temporal alignment and melodic values of the initial rise and final fall in focus constituents varying in size. The results suggest an effect of "tonal repulsion" (Silverman and Pierrehumbert, The timing of prenuclear high accents in English, in: J. Kingston, M.E. Beckman (Eds.), Papers in Laboratory Phonology: Between the Grammar and the Physics of Speech, Cambridge University, Cambridge, 1990, pp. 71-106) on the temporal location of the L* + H peak as well as "seeming truncation" of the focus constituent final fall in one-word focus constituents. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Ohio State Univ, Dept Linguist, Columbus, OH 43210 USA. RP D'Imperio, M (reprint author), Ohio State Univ, Dept Linguist, 1712 Neil Ave,222 Oxley Hall, Columbus, OH 43210 USA. EM dimperio@ling.ohio-state.edu CR Bruce Gosta, 1977, SWEDISH WORD ACCENTS d'Imperio M., 1997, OHIO STATE U WORKING, V50, P19 D'Imperio M., 1997, P EUROSPEECH 97 RHOD, V1, P251 DIMPERIO M, 1999, IN PRESS P 14 INT C DIMPERIO M, 1995, J ACOUST SOC AM, V98, P2894, DOI 10.1121/1.414392 Grabe E, 1998, J PHONETICS, V26, P129, DOI 10.1006/jpho.1997.0072 Grice M., 1995, INTONATION INTERROGA Ladd R. D., 1980, STRUCTURE INTONATION Pierrehumbert J., 1988, JAPANESE TONE STRUCT Pierrehumbert J.B., 1980, THESIS Silverman K., 1990, PAPERS LAB PHONOLOGY, VI, P71 NR 11 TC 30 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 339 EP 356 DI 10.1016/S0167-6393(00)00064-9 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400006 ER PT J AU Morlec, Y Bailly, G Auberge, V AF Morlec, Y Bailly, G Auberge, V TI Generating prosodic attitudes in French: Data, model and evaluation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop of the European-Speech-Communication-Assocation on Intonation (ESCA) CY SEP 18-20, 1997 CL ATHENS, GREECE SP European Speech Commun Assoc, Athens Univ, Hellen Minist Educ, Hellen Minist Dev, Gen Secretariat Res & Technol DE automatic training; corpus design; F0 and macrorhythm generation; gating experiment; perceptual evaluation; prosodic attitudes; prosodic model; prosodic movement expansion ID INTONATION; PERCEPTION; SENTENCE AB A corpus of 322 syntactically balanced sentences uttered by one speaker with six different prosodic attitudes is analysed. The syntactic and phonotactic structure of the sentences are systematically varied in order to understand how two functions can be carried out in parallel in the prosodic continuum: (1) enunciative: demarcation of constituents; (2) illocutory: speakers attitude. The statistical analysis of the corpus demonstrates that global prototypical prosodic contours characterise each attitude. Such a global encoding is consistent with gating experiments showing that attitudes can be discriminated very early in utterances. These results are discussed in relation to a morphological and superpositional model of intonation. This model proposes that the information specific to each linguistic level (structure, hierarchy of constituents, semantic and pragmatic attributes) is encoded via superposed multiparametric contours. An implementation of this model is described that automatically captures and generates these prototypical prosodic contours. This implementation consists of parallel Recurrent Neural Networks each responsible for the encoding of one linguistic level. The identification rates of attitudes for both training and test synthetic utterances are similar to those for natural stimuli. We conclude that the study of discourse-level linguistic attributes such as prosodic attitudes is a valuable paradigm for comparing intonation models. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Grenoble 3, INPG, ENSERG, Inst Commun Parlee,UPRESA CNRS 5009, F-38700 Grenoble 01, France. RP Morlec, Y (reprint author), Univ Grenoble 3, INPG, ENSERG, Inst Commun Parlee,UPRESA CNRS 5009, 46 Ave Felix Viallet, F-38700 Grenoble 01, France. EM morlec@icp.inpg.fr CR AUBERGE V, 1997, P EUROSPEECH 97 RHOD, V2, P871 AUBERGE V, 1992, TALKING MACHINES THE, P307 BARBOSA P, 1994, SPEECH COMMUN, V15, P127, DOI 10.1016/0167-6393(94)90047-7 BOE LJ, 1975, B I PHONETIQUE GRENO Bolinger D., 1989, INTONATION ITS USES CALLAMAND M, 1973, INTONATION EXPRESSIV Campbell W. N., 1992, TALKING MACHINES THE, P211 CAMPBELL WN, 1991, J PHONETICS, V19, P37 CLARK HH, 1981, ELEMENTS DISCOURSE U, P11 Damasio A., 1994, DESCARTES ERROR EMOT DELATTRE PC, 1969, FRANCAIS MONDE, V64, P6 Elman JL, 1988, 8801 CRL U CAL SAN D Fonagy I., 1984, FOLIA LINGUIST, V17, P153 Fujisaki H., 1971, Annual Report of the Engineering Research Institute, Faculty of Engineering, University of Tokyo, V30 GARDING E, 1991, P 12 INT C PHON SCI, V1, P300 GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 GROSJEAN F, 1983, LINGUISTICS, V21, P501, DOI 10.1515/ling.1983.21.3.501 GUSSENHOVEN C, 1991, BERK C DUTCH LING 19, P139 HIRST D, 1991, P 12 INT C PHON SCI, V1, P305 JORDAN MI, 1988, 8827 COINS U MASS CO Kohler K. J., 1997, COMPUTING PROSODY, P187 MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 MORLEC Y, 1998, P 1 INT C LANG RES E, V1, P647 MORLEC Y, 1997, THESIS I NATL POLYTE MORONI V, 1997, ENQUETE QUELQUES ATT Murray IR, 1996, SPEECH COMMUN, V20, P85, DOI 10.1016/S0167-6393(96)00046-5 OHALA JJ, 1996, P INT C SPEECH LANG, V3, P1812, DOI 10.1109/ICSLP.1996.607982 OHMAN SEG, 1967, 23 KTH DEP SPEECH CO Petitot J., 1985, CATASTROPHES PAROLE PETITOT J, 1986, MORPHOLOGICAL TURN P Petitot J., 1990, REV SYNTHESE, V4, P139 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 SCHERER KR, 1996, P INT C SPEECH LANG t'Hart J., 1973, J PHONETICS, V1, P309 THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 VANHEUVEN VJ, 1997, ESCA WORKSH INT THEO, P317 NR 36 TC 15 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 2001 VL 33 IS 4 BP 357 EP 371 DI 10.1016/S0167-6393(00)00065-0 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 400TH UT WOS:000166887400007 ER PT J AU Shamsoddini, A Denbigh, PN AF Shamsoddini, A Denbigh, PN TI A sound segregation algorithm for reverberant conditions SO SPEECH COMMUNICATION LA English DT Article DE speech segregation; speech separation; speech enhancement; hearing aid; speech recognition; reverberation ID SPEECH; SEPARATION; HEARING AB A system has been developed that enables a wanted speech signal to be extracted from a background of unwanted speech and other interference under real-life conditions. Using only two microphones 25 cm apart it exploits directional and harmonicity cues in a hybrid algorithm which takes advantages of both. The output of the algorithm has been tested subjectively by human listeners, and objectively by a speech recognition system. These tests show improved intelligibility of the wanted speech signal. In one set of tests, for example, the performance of a speech recogniser after segregation was similar at a signal to noise ratio of 6 dB as without segregation at a signal to noise ratio of 20 dB. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Sussex, Sch Engn, Brighton BN1 9QT, E Sussex, England. RP Denbigh, PN (reprint author), Univ Sussex, Sch Engn, Brighton BN1 9QT, E Sussex, England. EM p.n.denbigh@sussex.ac.uk CR ALLEN JB, 1977, IEEE T ACOUST SPEECH, V25, P235, DOI 10.1109/TASSP.1977.1162950 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 ALLEN JB, 1996, MODELING SENSORINEUR, P99 CHAZAN D, 1993, IEEE ICASSP 93, V93, P728 CHITZA O, 1988, J PHONETICS, V16, P109 CULLING JF, 1994, SPEECH COMMUN, V14, P71, DOI 10.1016/0167-6393(94)90058-2 DENBIGH PN, 1992, SPEECH COMMUN, V11, P119, DOI 10.1016/0167-6393(92)90006-S Kollmeier B, 1993, Scand Audiol Suppl, V38, P28 LINDEMANN E, 1995, IEEE SIGN PROC SOC W Luo H. Y., 1994, ISSIPNN '94. 1994 International Symposium on Speech, Image Processing and Neural Networks Proceedings (Cat. No.94TH0638-7), DOI 10.1109/SIPNN.1994.344897 MACLEOD A, 1990, British Journal of Audiology, V24, P29, DOI 10.3109/03005369009077840 MILLER G, 1951, J EXP PSYCHOL, P41 Moore B.C.J., 1995, PERCEPTUAL CONSEQUEN MOORE BCJ, 1990, ACTA OTO-LARYNGOL, P250 NEELY ST, 1979, J ACOUST SOC AM, V66, P165, DOI 10.1121/1.383069 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 SHAMSODDINI A, 1997, THESIS U SUSSEX BRIG STUBBS RJ, 1990, J ACOUST SOC AM, V87, P359, DOI 10.1121/1.399257 NR 20 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2001 VL 33 IS 3 BP 179 EP 196 DI 10.1016/S0167-6393(00)00015-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BW UT WOS:000166335100001 ER PT J AU Wu, CH Chen, YJ AF Wu, CH Chen, YJ TI Multi-keyword spotting of telephone speech using a fuzzy search algorithm and keyword-driven two-level CBSM SO SPEECH COMMUNICATION LA English DT Article DE multi-keyword spotting; fuzzy search algorithm; keyword-driven two-level CBSM; telephone speech recognition ID DISCRIMINATIVE UTTERANCE VERIFICATION; SPECTRAL SUBTRACTION; RECOGNITION; NOISE AB In telephone speech recognition, the acoustic mismatch between training and testing environments often causes a severe degradation in the recognition performance. This paper presents a keyword-driven two-level codebook-based stochastic matching (CBSM) algorithm to eliminate the acoustic mismatch. Additionally, in Mandarin speech, it is difficult to correctly recognize the unvoiced part in a syllable. In order to reduce the recognition error of unvoiced segments, a fuzzy search algorithm is proposed to extract keyword candidates from a syllable lattice. Finally, a keyword relation and a weighting function for keyword combinations are presented for multi-keyword spotting. In the multi-keyword spotting of Mandarin speech, 94 right context-dependent and 38 context-independent subsyllables are used as the basic recognition units. A corresponding anti-subsyllable model for each subsyllable is trained and used for verification. In this system, 2583 faculty names and 39 department names are selected as the primary keywords and the secondary keywords, respectively. Using a testing set of 3088 conversational speech utterances from 33 speakers (20 male, 13 female), these techniques reduced the recognition error rate from 29.6% to 20.6% for multi-keywords embedded in non-keyword speech. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RP Wu, CH (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. RI Wu, Chung-Hsien/E-7970-2013 CR BAI BR, 1997, P 1997 IEEE INT C AC, V2, P903 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Fukunaga K., 1972, INTRO STAT PATTERN R HUANG TL, 1994, IEEE PARALL DISTRIB, V2, P3 LAWRENCE C, 1997, P EUR 97, P2567 LECOMTE I, 1989, P INT C ACOUST SPEEC, V1, P512 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P471 Mokbel C., 1994, P ICSLP, P987 RABINER L, 1993, FUNDAMENTALS SPEECH, P365 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 Rahim MG, 1996, IEEE SIGNAL PROC LET, V3, P107, DOI 10.1109/97.489062 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Sim BL, 1998, IEEE T SPEECH AUDI P, V6, P328 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 Yoma NB, 1998, IEEE T SPEECH AUDI P, V6, P579, DOI 10.1109/89.725325 NR 16 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2001 VL 33 IS 3 BP 197 EP 212 DI 10.1016/S0167-6393(00)00016-9 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BW UT WOS:000166335100002 ER PT J AU Meng, H AF Meng, H TI A hierarchical lexical representation for bi-directional spelling-to-pronunciation/pronunciation-to-spelling generation SO SPEECH COMMUNICATION LA English DT Article ID ENGLISH; RULES AB We propose a hierarchical framework for integrating a variety of linguistic knowledge sources of lexical representation in English, in order to facilitate their concurrent utilization in language applications. Our unified lexical representation encompasses information including morphology, stress, syllabification, phonemics and graphemics. Each linguistic knowledge source occupies a distinct stratum in the hierarchy. The merits of the proposed framework is demonstrated on the test bed of bi-directional spelling-to-pronunciation/pronunciation-to-spelling generation. Constraints from the multiple linguistic knowledge sources are administered in parallel during generation, by means of a probabilistic parsing paradigm. This paper extends the previous work on spelling-to-pronunciation generation as reported in (Meng et al., 1996), by presenting our full results on bi-directional generation which includes pronunciation-to-spelling generation. We will also introduce a robust parsing technique which is aimed for maximizing the coverage of our parser for generation. We believe that our formalism will be especially applicable for augmenting the vocabulary of existing speech recognition and synthesis systems. This work is also the precursor to the ANGIE system (Lau and Seneff, 1997; Seneff et al., 1996), which extends our lexical representation to the phonetic level, and applies successfully in speech recognition, word spotting and durational modeling (Chung and Seneff, 1997). (C) 2001 Elsevier Science B.V. All rights reserved. C1 Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China. RP Meng, H (reprint author), Chinese Univ Hong Kong, Dept Syst Engn & Engn Management, Shatin, Hong Kong, Peoples R China. EM hmmeng@se.cuhk.edu.hk RI Meng, Helen/F-6043-2011 CR Allen J., 1987, TEXT SPEECH MITALK S ALLEVA F, 1989, P DARPA SPEECH NAT L, P266, DOI 10.3115/1075434.1075478 CHUNG G, 1997, P ICSLP 97, P1475 COKER C, 1990, P C SPEECH SYNTH EUR CONROY D, 1986, DECTALK DTC03 TEXT T DAMPER RI, 1995, P CONN MOD MEM LANG, P117 Dedina M. J., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90017-K ELOVITZ H, 1976, 7949 NAV RES LAB FUKADA T, 1997, P EUROSPEECH 97, P2471 GOLDING AR, 1991, THESIS STANFORD U HOCHBERG J, 1991, IEEE T PATTERN ANAL, V13, P957, DOI 10.1109/34.93813 JIANG L, 1997, P EUROSPEECH, P605 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kucera H., 1967, COMPUTATIONAL ANAL P KUHN R, 1998, P ICSLP 98 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LAU R, 1997, P EUR 97 RHOD GREEC, P263 LUCAS SM, 1992, TALKING MACHINES THE, P127 LUCASSEN J, 1984, P ICASSP 84 LUK R, 1993, P INT C ACOUST SPEEC, P203 MENG H, 1996, SPEECH COMMUN, V18, P45 MENG H, 1994, P ICASSP 94, P1 MENG H, 1994, P ARPA HUMAN LANGUAG, P289, DOI 10.3115/1075812.1075876 MENG H, 1994, P INT S SPEECH IMAGE, P670, DOI 10.1109/SIPNN.1994.344822 OAKEY S, 1981, P IJCAI VANCOUVER, P109 OSHIKA BT, 1975, IEEE T ACOUST SPEECH, VAS23, P104, DOI 10.1109/TASSP.1975.1162639 PARFITT S, 1991, P EUROSPEECH, P801 SEGRE A, 1983, P 1 EUR ACL, P35, DOI 10.3115/980092.980098 Sejnowski T. J., 1987, Complex Systems, V1 Seneff S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607049 Seneff S., 1992, Computational Linguistics, V18 Seneff S., 1992, P ICSLP 92, P317 STANFILL C, 1987, P 6 NAT C ART INT AA, P577 Stanfill C., 1986, Communications of the ACM, V29, DOI 10.1145/7902.7906 STANFILL CW, 1988, P CAS BAS REAS WORKS, P406 SULLIVAN KPH, 1992, TALKING MACHINES THE, P183 VANCOILE B, 1992, P ICSLP BANFF, P487 VANCOILE B, 1990, P ICSLP 90, P765 VANDENBOSCH A, 1993, P 6 C EUR CHAPT ASS, P45, DOI 10.3115/976744.976751 VANLEEUWEN HC, 1993, COMPUTER SPEECH LANG, V7 WEINTRAUB M, 1987, P DARPA SPEECH REC W, P44 YANNAKOUDAKIS EJ, 1991, SPEECH COMMUN, V10, P381, DOI 10.1016/0167-6393(91)90005-E ZUE V, 1990, P ICASSP ALB NM, P49 NR 43 TC 1 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2001 VL 33 IS 3 BP 213 EP 239 DI 10.1016/S0167-6393(00)00014-5 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BW UT WOS:000166335100003 ER PT J AU Demeechai, T Makelainen, K AF Demeechai, T Makelainen, K TI Recognition of syllables in a tone language SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; pitch; tone; tone language; HMM AB Speech recognition of tone languages requires detection of the tone in addition to detection of the consonants and vowels of a syllable. This paper compares three different methods based on the hidden Markov model (HMM) framework for recognition of tonal syllables in continuous speech. In joint detection, recognition is done by employing a HMM of connected tonal syllables, in which the pitch and its time derivative are included into the feature vector in addition to the phonetic features. In sequential detection, base syllables (syllables ignoring their tones) are recognized by using a HMM of connected base syllables only; the estimated syllable boundaries are then used for subsequent tone recognition in a separate HMM of tones. In linked detection, the recognition in the HMM of connected base syllables is modified to periodically take into account also tonal likelihood computed from a HMM of tones. Linked detection can provide performance that is comparable to the performance of joint detection, which is clearly superior to that of sequential detection. The computational complexity of linked detection is lower than that of joint detection for a large vocabulary task, where the number of states in the HMM of connected tonal syllables is substantially larger than that in the HMM of connected base syllables. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Asian Inst Technol, Telecommun Program, Klongluang 12120, Pathumthani, Thailand. RP Demeechai, T (reprint author), Asian Inst Technol, Telecommun Program, POB 4, Klongluang 12120, Pathumthani, Thailand. CR BAI BR, 1997, P 1997 IEEE INT C AC, V2, P903 GAO Y, 1995, P ICASSP, V1, P77 HANPANICH S, 1993, THESIS CHULALONGKORN HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 *ITU T, 1988, BLUE BOOK, V5, P81 LEE T, 1995, IEEE T SPEECH AUDI P, V3, P204 LIU FH, 1996, P ICASSP 96, V1, P157 LYU RY, 1995, P IEEE INT C AC SPEE, V1, P57 Owens F. J., 1993, SIGNAL PROCESSING SP Picone J., 1990, IEEE ASSP Magazine, V7, DOI 10.1109/53.54527 POTISUK S, 1995, P IEEE INT C AC SPEE, V1, P632 Rabiner L, 1993, FUNDAMENTALS SPEECH SHEN JL, 1996, P ICASSP, V1, P125 WANG HM, 1995, P IEEE INT C AC SPEE, V1, P61 NR 14 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2001 VL 33 IS 3 BP 241 EP 254 DI 10.1016/S0167-6393(00)00017-0 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BW UT WOS:000166335100004 ER PT J AU Bird, S Harrington, J AF Bird, S Harrington, J TI Speech annotation and corpus tools SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA. Macquarie Univ, Macquarie Ctr Cognit Sci, Sydney, NSW 2109, Australia. Macquarie Univ, Speech & Language Res Ctr, Sydney, NSW 2109, Australia. RP Bird, S (reprint author), Univ Penn, Linguist Data Consortium, 3615 Market St,Suite 200, Philadelphia, PA 19104 USA. EM sb@ldc.upenn.edu; jmh@shlrc.mq.edu.au NR 0 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 1 EP 4 DI 10.1016/S0167-6393(00)00066-2 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000001 ER PT J AU Barras, C Geoffrois, E Wu, ZB Liberman, M AF Barras, C Geoffrois, E Wu, ZB Liberman, M TI Transcriber: Development and use of a tool for assisting speech corpora production SO SPEECH COMMUNICATION LA English DT Article DE transcription tool; speech corpora; broadcast news; linguistic annotation formats ID ANNOTATION AB We present "Transcriber", a tool for assisting in the creation of speech corpora, and describe some aspects of its development and use. Transcriber was designed for the manual segmentation and transcription of long duration broadcast news recordings, including annotation of speech turns, topics and acoustic conditions. It is highly portable, relying on the scripting language Tcl/Tk with extensions such as Snack for advanced audio functions and telex for lexical analysis, and has been tested on various Unix systems and Windows. The data format follows the XML standard with Unicode support for multilingual transcriptions. Distributed as free software in order to encourage the production of corpora, ease their sharing, increase user feedback and motivate software contributions, Transcriber has been in use for over a year in several countries, As a result of this collective experience, new requirements arose to support additional data formats, video control, and a better management of conversational speech. Using the annotation graphs framework recently formalized, adaptation of the tool towards new tasks and support of different data formats will become easier. (C) 2001 Elsevier Science B.V. All rights reserved. C1 LIMSI, Spoken Language Proc Grp, CNRS, F-91403 Orsay, France. GIP, CTA, DGA, F-94114 Arcueil, France. LDC, Philadelphia, PA 19104 USA. RP Barras, C (reprint author), LIMSI, Spoken Language Proc Grp, CNRS, BP 133, F-91403 Orsay, France. EM claude.barras@limsi.fr CR Barras C., 1998, P 1 INT C LANG RES E, P1373 BIRD S, 1999, LINGUISTIC ANNOTATIO Bird S, 2001, SPEECH COMMUN, V33, P23, DOI 10.1016/S0167-6393(00)00068-6 BONNET F, 1998, TCLEX LEXICAL ANAL G Bray T., 1998, EXTENSIBLE MARKUP LA BURGER S, 1999, 9 INT COCOSDA WORKSH Cassidy S, 2001, SPEECH COMMUN, V33, P61, DOI 10.1016/S0167-6393(00)00069-8 Clark J., 1999, XSL TRANSFORMATIONS DEJONG F, 2000, P 6 RIAO C PAR FRANC Deshmukh N., 1998, P INT C SPOK LANG PR, P1543 Free Software Foundation, 1991, GNU GEN PUBL LIC Gauvain J.-L., 1998, P INT C SPOK LANG PR, P1335 GEOFFROIS E, 2000, P 2 INT C LANG RES E Hetherington L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608015 HUCKVALE M, 1987, SFS SPEECH FILING SY Jacobson M, 2001, SPEECH COMMUN, V33, P79, DOI 10.1016/S0167-6393(00)00070-4 MacWhinney B., 2000, CHILDES PROJECT TOOL MCCANDLESS MK, 1998, THESIS MIT McKelvie D, 2001, SPEECH COMMUN, V33, P97, DOI 10.1016/S0167-6393(00)00071-6 *NIST, 1998, UN TRANSCR FORM UTF Ousterhout J., 1994, TEL TK TOOLKIT OUSTERHOUT JK, 1998, IEEE COMPUTER MAGAZI, V31 SCHALKWYK J, 1997, P EUROSPEECH 97, P689 SJOLANDER K, 1997, SNACK SOUND VISUALIZ SJOLANDER K, 1998, P 5 INT C SPOK LANG, P3217 Sperberg-McQueen C. M., 1994, TEI GUIDELINES ELECT STALLMAN R, 1998, OPEN SOURCES VOICES STERN R, 1996, P DARPA SPEECH REC W, P7 *UN CONS, 2000, UN STAND VERS 3 0 WOOD L, 1998, DOCUMENT OBJECT MODE NR 30 TC 51 Z9 52 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 5 EP 22 DI 10.1016/S0167-6393(00)00067-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000002 ER PT J AU Bird, S Liberman, M AF Bird, S Liberman, M TI A formal framework for linguistic annotation SO SPEECH COMMUNICATION LA English DT Article DE speech markup; speech corpus; general-purpose architecture; directed graph; phonological representation ID SPEECH CORPORA; TOOL AB 'Linguistic annotation' covers any descriptive or analytic notations applied to raw language data. The basic data may be in the form of time functions - audio, video and/or physiological recordings - or it may be textual. The added notations may include transcriptions of all sorts (from phonetic features to discourse structures), part-of-speech and sense tagging, syntactic analysis,'named entity' identification, coreference annotation, and so on. While there are several ongoing efforts to provide formats and tools for such annotations and to publish annotated linguistic databases, the lack of widely accepted standards is becoming a critical problem. Proposed standards, to the extent they exist, have focused on file formats. This paper focuses instead on the logical structure of linguistic annotations. We survey a wide variety of existing annotation formats and demonstrate a common conceptual core, the annotation graph. This provides a formal framework for constructing, maintaining and searching linguistic annotations, while remaining consistent with many alternative data structures and file formats. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Penn, Linguist Data Consortium, Philadelphia, PA 19104 USA. RP Bird, S (reprint author), Univ Penn, Linguist Data Consortium, 3615 Market St, Philadelphia, PA 19104 USA. EM sb@ldc.upenn.edu CR Abiteboul S., 1995, FDN DATABASES ALTOSAAR T, 1998, P 1 INT C LANG RES E ANDERSON AH, 1991, LANG SPEECH, V34, P351 Barras C., 1998, P 1 INT C LANG RES E, P1373 BARRAS C, 2000, P 2 INT C LANG RES E, P1517 Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4 BIRD S, 1997, STANDARDS TOOLS DISC, P1 BIRD S, 1997, P 3 M ACL SPEC INT G Bird S., 1995, COMPUTATIONAL PHONOL Bird S, 2000, P 2 INT C LANG RES E, P1699 BIRD S, 2000, P 2 INT C LANG RES E Browman CP, 1989, PHONOLOGY, V6, P201, DOI 10.1017/S0952675700001019 Brown P. F., 1990, Computational Linguistics, V16 CARPENTER B, 1992, CAMBRIDGE TRACTS THE, V32 CASSIDY S, 2000, P 11 AUSTR DAT C IEE, P12 Cassidy S, 2001, SPEECH COMMUN, V33, P61, DOI 10.1016/S0167-6393(00)00069-8 Garofolo J. S., 1986, DARPA TIMIT ACOUSTIC Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 GRAFF D, 2000, P 2 INT C LANG RES E, P427 GREENBERG S, 1996, SWITCHBOARD TRANSCRI Grishman R, 1997, TIPSTER ARCHITECTURE HERTZ SR, 1990, PAPERS LAB PHONOLOGY, P215 HIRSCHMAN L, 1997, MESS UND C P Jacobson M, 2001, SPEECH COMMUN, V33, P79, DOI 10.1016/S0167-6393(00)00070-4 Jurafsky D, 1997, 9702 U COL I COGN SC Jurafsky D., 1997, P IEEE WORKSH SPEECH, P88 MacWhinney B., 1995, CHILDES PROJECT TOOL McKelvie D, 2001, SPEECH COMMUN, V33, P97, DOI 10.1016/S0167-6393(00)00071-6 Mellish C. S., 1989, NATURAL LANGUAGE PRO Mitchell P, 1993, COMPUTATIONAL LINGUI, V19, P313 Schegloff EA, 1998, LANG SPEECH, V41, P235 SCHIEL F, 1998, P 1 INT C LANG RES E Skut W., 1997, P 5 C APPL NAT LANG TAYLOR A, 1995, DYSFLUENCY ANNOTATIO TAYLOR A, 2001, SPEECH COMMUN, V33, P153 TAYLOR PA, 1998, P 5 INT C SPOK LANG *UTF, UTF98 UTF NR 37 TC 63 Z9 65 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 23 EP 60 DI 10.1016/S0167-6393(00)00068-6 PG 38 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000003 ER PT J AU Cassidy, S Harrington, J AF Cassidy, S Harrington, J TI Multi-level annotation in the Emu speech database management system SO SPEECH COMMUNICATION LA English DT Article DE speech databases; speech annotation ID FORMALISM; TOOL AB Researchers in various fields, from acoustic phonetics to child language development, rely on digitised collections of spoken language data as raw material for research. Access to this data had, in the past, been provided in an ad-hoc manner with labelling standards and software tools developed to serve only one or two projects. A few attempts have been made at providing generalised access to speech corpora but none of these have gained widespread popularity. The Emu system, described here, is a general purpose speech database management system which supports complex multilevel annotations. Emu can read a number of popular label and data file formats and supports overlaying additional annotation with inter-token relations on existing time-aligned label files. Emu provides a graphical labelling tool which can be extended to provide special purpose displays. The software is easily extended via the Tcl/Tk scripting language which can be used, for example, to manipulate annotations and build graphical tools for database creation. This paper discusses the design of the Emu system, giving a detailed description of the annotation structures that it supports. It is argued that these structures are sufficiently general to allow Emu to read potentially any time-aligned linguistic annotation. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Macquarie Univ, Speech Hearing & Language Res Ctr, Sydney, NSW 2109, Australia. Macquarie Univ, Macquarie Ctr Cognit Sci, Sydney, NSW 2109, Australia. RP Cassidy, S (reprint author), Macquarie Univ, Speech Hearing & Language Res Ctr, Sydney, NSW 2109, Australia. CR ALTOSAAR T, 1999, P EUR 99 BUD HUNG, P915 Barras C., 1998, P 1 INT C LANG RES E, P1373 Beckman M. E., 1994, GUIDELINES TOBI LABE Bird S, 2001, SPEECH COMMUN, V33, P23, DOI 10.1016/S0167-6393(00)00068-6 Bird S, 1999, MSCIS9901 U PENNS DE BIRD S, 2000, P LREC 2000 ATH GREE Bray T., 1998, EXTENSIBLE MARKUP LA BUNEMAN P, 1998, QUER LANG WORKSH QL CARLSON R, 1990, SPEECH COMMUN, V9, P375, DOI 10.1016/0167-6393(90)90013-Y Cassidy K, 1996, INT J ENVIRON POLLUT, V6, P361 Cassidy S., 1999, P EUR 99 BUD HUNG, P2239 CASSIDY S, 2000, P 11 AUSTR DAT C CAN, V22, P12 Clark J., 1999, XML PATH LANGUAGE XP COLEMAN J, 1991, LINGUIST PHILOS, V14, P295, DOI 10.1007/BF00627405 Core M. G., 1997, AAAI FALL S COMM ACT Crawford M. D., 1994, P I ACOUSTICS 5, V16, P183 Deutsch A., 1998, XML QL QUERY LANGUAG Flammia G., 1995, Empirical Methods in Discourse Interpretation and Generation. Papers from the 1995 AAAI Symposium (TR SS-95-06) Goldfarb Charles F., 1990, SGML HDB Harrington J., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1017 HEDELIN P, 1990, SPEECH COMMUN, V9, P365, DOI 10.1016/0167-6393(90)90012-X HENDRIKS JPM, 1990, SPEECH COMMUN, V9, P381, DOI 10.1016/0167-6393(90)90014-Z Hieronymus J. L., 1994, ASCII PHONETIC SYMBO ISARD A, 1998, P 5 INT C SPOK LANG KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W LIEFKE H, 1999, WEBDB 99 MacWhinney B., 1995, CHILDES PROJECT TOOL McKelvie D, 2001, SPEECH COMMUN, V33, P97, DOI 10.1016/S0167-6393(00)00071-6 Ousterhout J., 1994, TEL TK TOOLKIT SCHIEL F, 1998, P 1 INT C LANG RES E TAYLOR P, 1998, EDINBURGH SPEECH TOO Taylor P, 2001, SPEECH COMMUN, V33, P153, DOI 10.1016/S0167-6393(00)00074-1 THOMPSON HS, 1997, SGML EUROPE 97 Watson CI, 1999, J ACOUST SOC AM, V106, P458, DOI 10.1121/1.427069 NR 34 TC 46 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 61 EP 77 DI 10.1016/S0167-6393(00)00069-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000004 ER PT J AU Jacobson, M Michailovsky, B Lowe, JB AF Jacobson, M Michailovsky, B Lowe, JB TI Linguistic documents synchronizing sound and text SO SPEECH COMMUNICATION LA English DT Article ID SPEECH CORPORA; ANNOTATION; TOOL AB The goal of the Langues et Civilisations a Tradition Orale (LACITO) Linguistic Archive project is to conserve and disseminate recorded and transcribed oral literature and other linguistic materials, mainly in unwritten languages, giving simultaneous access to sound recordings and text annotation. The project uses XML markup for the kinds of annotation traditionally used in field linguistics. Transcriptions are segmented into sentences (roughly) and words. Annotations are associated with different levels: metadata at the text level, free translation at the sentence level, interlinear glosses at the word level, etc. Time-alignment is at the sentence and optionally at the word level. The project makes maximum use of standard, generic software tools. Marked-up data are processed using freely available XML software and displayed using standard browsers. The project has developed (1) an authoring tool, SoundIndex, to facilitate time-alignment, (2) a Java applet, which enables browsers to access time-aligned speech, (3) XSL stylesheets, which specify "views" on the data, and (4) Common Gateway Interface (CGI) scripts, which allow the user to choose documents and views and to enter queries. Current objectives include development of the annotation and software to facilitate linguistic research beyond simple browsing. Over 100 texts in 20 languages have been processed at the time of writing; some of these are available on the Internet for browsing and simple querying. (C) 2001 Elsevier Science B.V. All rights reserved. C1 LACITO, CNRS, F-94800 Villejuif, France. RP Jacobson, M (reprint author), LACITO, CNRS, 7 Rue Guy Moquet,Bat 23, F-94800 Villejuif, France. EM jacobson@idf.ext.jussieu.fr CR ADLER S, 2000, LAST CALL WORKING DR Apparao V., 1998, DOCUMENT OBJECT MODE Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4 Bird S, 2001, SPEECH COMMUN, V33, P23, DOI 10.1016/S0167-6393(00)00068-6 Bray T., 1998, EXTENSIBLE MARKUP LA BUSEMAN A, 1998, LINGUISTS SHOEBOX Clark J., 1999, XML PATH LANGUAGE XP Clark J., 1999, XSL TRANSFORMATIONS Deutsch A., 1998, XML QL QUERY LANGUAG DYBKJAER L, 1998, MATE MARKUP FRAMEWOR FANKHAUSER P, 2000, XML QUERY REQUIREMEN HSU R, 1989, LEXWARE MANUAL IDE N, 1999, CORPUS ENDODING STAN *LACITO ARCH PROJ, CNRS MACWHINNEY B, 1995, HDB LANGUAGE ACQUISI McKelvie D, 2001, SPEECH COMMUN, V33, P97, DOI 10.1016/S0167-6393(00)00071-6 Raggett D., 1999, HTML 4 01 SPECIFICAT Robie J, 1998, XML QUERY LANGUAGE X SJOLANDER K, 1997, SNACK SOUND EXTENSIO Sperberg-McQueen CM, 1994, GUIDELINES ELECT TEX THIEBERGER N, 1999, USING SOUNDINDEX TRA THOMPSON H, 1997, LT XML SOFTWARE LANG *U PENNS, LING ANN WEBS LING D *UN CONS, 2000, UN STAND VERS 3 0 NR 24 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 79 EP 96 DI 10.1016/S0167-6393(00)00070-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000005 ER PT J AU McKelvie, D Isard, A Mengel, A Moller, MB Grosse, M Klein, M AF McKelvie, D Isard, A Mengel, A Moller, MB Grosse, M Klein, M TI The MATE workbench - An annotation tool for XML coded speech corpora SO SPEECH COMMUNICATION LA English DT Article AB This paper describes the design and implementation of the MATE workbench, a program which provides support for the annotation of speech and text. It provides facilities for flexible display and editing of such annotations, and complex querying of a resulting corpus. The workbench offers a more flexible approach than most existing annotation tools, which were often designed with a specific annotation scheme in mind. Any annotation scheme can be used with the MATE workbench, provided it is coded using XML markup (linked to the speech signal, if available, using certain conventions). The workbench uses a transformation language to define specialised editors optimised for particular annotation tasks, with suitable display formats and allowable editing operations tailored to the task. The workbench is written in Java, which means that it is platform-independent. This paper outlines the design of the workbench software and compares it with other annotation programs. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Edinburgh, Language Technol Grp, Edinburgh EH8 9LW, Midlothian, Scotland. Univ Stuttgart, Inst Maschinelle Sprachverarbeitung, D-7000 Stuttgart, Germany. Odense Univ, Nat Interact Syst Lab, DK-5230 Odense M, Denmark. Deutsch Forsch Zentrum Kunstliche Intelligenz, Saarbrucken, Germany. RP McKelvie, D (reprint author), Univ Edinburgh, Language Technol Grp, 2 Buccleuch Pl, Edinburgh EH8 9LW, Midlothian, Scotland. EM dmck@cogsci.ed.ac.uk CR Abiteboul S., 2000, DATA WEB RELATIONS S Barras C, 2001, SPEECH COMMUN, V33, P5, DOI 10.1016/S0167-6393(00)00067-4 BIRD S, 2000, LINGUISTIC ANNOTATIO Bird S, 2001, SPEECH COMMUN, V33, P23, DOI 10.1016/S0167-6393(00)00068-6 BOERSMA P, 2000, 132 U AMST I PHON SC Bray T., 1998, EXTENSIBLE MARKUP LA Carletta J, 1997, COMPUT LINGUIST, V23, P13 Cassidy S, 2001, SPEECH COMMUN, V33, P61, DOI 10.1016/S0167-6393(00)00069-8 Clark J., 1999, XSL TRANSFORMATIONS Cunningham H., 1996, P 16 INT C COMP LING, P1057 Day D., 1997, P 5 C APPL NAT LANG DEROSE S, 2000, XML LINKING LANGUAGE DEROSE S, 1999, XML POINT LANG XPOIN Deutsch A., 1998, XML QL QUERY LANGUAG Eckstein Robert, 1998, JAVA SWING ENTROPIC, 1996, WAVES PLUS MANUAL EN Ferguson G., 1998, P 15 NAT C ART INT A, P567 GOLDMAN R, 1999, P 2 INT WORKSH WEB D GRISHAM R, 2000, TIPSTER TEXT ARCHITE GRUGMAN H, 1999, EUDICO EUROPEAN DIST HEID U, 1999, P INT C PHON SCI ICP IDE N, 2000, CORPUS ENCODING STAN Ide Nancy, 1995, TEXT ENCODING INITIA ISARD A, 1998, P 5 INT C SPOK LANG JACOBSON M, 2000, SPEECH COMMUN, V33, P79 JELLIFFE R, 2000, SCHEMATRON ACAD SINI KLEIN M, 1998, MATE DELIVERABLE 1 1 LEMAITRE J, 1996, P 4 EUR C INF SYST E MARCHIORI M, 1998, QL 98 QUER LANG WORK MENGEL A, 1999, MANUAL Q4M MENGEL A, 1999, MATE DELIVERABLE 2 1 MURATA M, 1998, P QL 98 QUER LANG WO NORSKOG L, 1995, SOX SOUND FILE FORMA PEMBERTON S, 2000, XHTML 1 0 EXTENSIBLE SCHIEL F, 1998, P 1 INT C LANG RES E SJOLANDER K, 2000, SNACK SOUND EXTENSIO Taylor P., 1999, EDINBURGH SPEECH TOO Taylor P, 2001, SPEECH COMMUN, V33, P153, DOI 10.1016/S0167-6393(00)00074-1 THOMPSON HS, 2000, XML SCHEMA 1 VATTON I, 1999, AMAYA W3CS Wood L., 2000, DOCUMENT OBJECT MODE NR 41 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 97 EP 112 DI 10.1016/S0167-6393(00)00071-6 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000006 ER PT J AU Stirling, L Fletcher, J Mushin, I Wales, R AF Stirling, L Fletcher, J Mushin, I Wales, R TI Representational issues in annotation: Using the Australian map task corpus to relate prosody and discourse structure SO SPEECH COMMUNICATION LA English DT Article DE dialogue; prosody; map task; ToBI; pitch; pause; DAMSL; dialogue act ID FEATURES AB This payer reports part of an ongoing investigation of the interaction of prosody and discourse structure. A digital speech corpus (4 dialogues from the ANDOSL Australian map task corpus) was coded for prosodic structure (ToBI). Independently, two different coding systems for dialogue micro-structure were applied to the same corpus: the HCRC map task coding scheme (Carletta et al., 1996, 1997b) and the 'Switchboard' version of the DRI/DAMSL scheme (Jurafsky et al., 1997). We investigated whether silent pause location and duration, intonational boundaries associated with Break Indices 3 and 4, as well as pitch range reset were significantly correlated with dialogue act boundaries as has been found for other varieties of English (e.g., Lehiste, 1975; Hirschberg and Nakatani, 1996; Silverman, 1987) and Dutch (Swerts, 1997). The dialogue coding systems were systematically evaluated both against one another and in terms of their correlation with the prosodic structure. The paper explores a number of methodological issues which arise in effectively comparing and relating structures from different domains of analysis across a large speech corpus. It also exemplifies the way in which annotated corpora can be used to evaluate theories and systems. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Melbourne, Dept Linguist & Appl Linguist, Parkville, Vic, Australia. La Trobe Univ, Bundoora, Vic 3083, Australia. RP Stirling, L (reprint author), Univ Melbourne, Dept Linguist & Appl Linguist, Parkville, Vic, Australia. EM l.stirling@linguistics.unimelb.edu.au CR AHRENBERG L, 1995, WORK NOT AAAI SPRING ALLEN J, 1997, DRAFT DAMSL DIALOG A ALLWOOD J, 1994, SEMANTICS SPOKEN LAN ANDERSON WP, 1991, INT J QUANTUM CHEM, V39, P31, DOI 10.1002/qua.560390106 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X BECKMAN ME, 1994, GUIDE TOBI LABELLING CARLETTA J, 1996, HCRCTR82 U ED HUM CO CARLETTA J, 1997, STANDARDS DIALOGUE C Carlson R, 1997, ELECT J DIFFERENTIAL, V23, P1 Cooper R., 1999, CODING INSTRUCTIONAL CORE M, 1999, 3 WORKSH DISC RES IN FLETCHER J, 1996, P 6 AUSTR INT C SPEE, P611 GRICE M, 1995, ICPHS95, P648 GRICE M, 1995, PHONUS, V1, P19 Grosz B., 1992, P INT C SPOK LANG PR, P429 Harrington J., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1017 Hirschberg J., 1996, P 34 ANN M ASS COMP, P286, DOI 10.3115/981863.981901 ISARD A, 1995, HCRCRP65 U ED HUM CO JURAFSKY D, 1997, 9702 TR U COL I COGN Koiso H, 1998, LANG SPEECH, V41, P295 Lehiste I., 1975, STRUCTURE PROCESS SP, P195 Millar J., 1994, P ICASSP 94, P97 NAKATANI C, 1995, P AAAI 95 SPRING S E NAKATANI C, 1999, UMIACSTR9903 U MARYL Shriberg E, 1998, LANG SPEECH, V41, P443 Silverman K, 1987, THESIS CAMBRIDGE U Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 TRAUM D, 1998, UNPUB NOTES DIALOGUE Traum D., 1992, COMPUT INTELL, V8, P575, DOI DOI 10.1111/J.1467-8640.1992.TB00380.X NR 29 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 113 EP 134 DI 10.1016/S0167-6393(00)00072-8 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000007 ER PT J AU Syrdal, AK Hirschberg, J McGory, J Beckman, M AF Syrdal, AK Hirschberg, J McGory, J Beckman, M TI Automatic ToBI prediction and alignment to speed manual labeling of prosody SO SPEECH COMMUNICATION LA English DT Article ID ENGLISH; TEXT AB Tagging of corpora for useful linguistic categories can be a time-consuming process, especially with linguistic categories for which annotation standards are relatively new, such as discourse segment boundaries or the intonational events marked in the Tones and Break Indices (ToBI) system for American English. A ToBI prosodic labeling of speech typically takes even experienced labelers from 100 to 200 times real time. An experiment was conducted to determine (1) whether manual correction of automatically assigned ToBI labels would speed labeling, and (2) whether default labels introduced any bias in label assignment. A large speech corpus of one female speaker reading several types of texts was automatically assigned default labels. Default accent placement and phrase boundary location were predicted from text using machine learning techniques. The most common ToBI labels were assigned to these locations for default tones and break type. Predicted pitch accents were automatically aligned to the mid-point of the word, while breaks and edge tones were aligned to the end of the phrase-final word. The corpus was then labeled by a group of five trained transcribers working over a period of nine months. Half of each set of recordings was labeled in the standard fashion without default labels, and the other half was presented with preassigned default labels for labelers to correct. Results indicate that labeling from defaults was generally faster than standard labeling, and that defaults had relatively little impact on label assignment. (C) 2001 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. Ohio State Univ, Dept Linguist, Columbus, OH 43210 USA. RP Syrdal, AK (reprint author), AT&T Labs Res, Florham Pk, NJ 07932 USA. CR Aho A.V., 1988, AWK PROGRAMMING LANG Beckman M, 1997, GUIDELINES TOBI LABE Beckman M. E., 1994, TOBI ANNOTATION CONV BEUTNAGEL M, 1999, J ACOUST SOC AM 2, V105, P1030, DOI 10.1121/1.424924 BYRD D, 1992, P INT C SPOK LANG PR, P827 CONKIE A, 1999, P EUR C SPEECH COMM, V1, P523 *ENTR RES LAB INC, 1996, WAV PLUS MAN Garofolo J. S., 1986, DARPA TIMIT ACOUSTIC Grosz B. J., 1986, Computational Linguistics, V12 HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 HOUSE AS, 1953, J ACOUST SOC AM, V25, P105, DOI 10.1121/1.1906982 Kenyon John S., 1953, PRONOUNCING DICT AM LADD DR, 1979, CONTRIBUTIONS GRAMMA, P93 Lehiste I., 1960, PHONETICA S, V5, P1 LIBERMAN M, 1992, LEXICAL MATTERS Litman Diane, 1990, P 13 INT C COMP LING MCGORY J, 1999, J ACOUST SOC AM, V106, P2242, DOI 10.1121/1.427641 Mitchell P, 1993, COMPUTATIONAL LINGUI, V19, P313 NOOTEBOOM SG, 1982, PHONETICA, V39, P317 Olive J. P., 1993, ACOUSTICS AM ENGLISH Olshen R., 1984, CLASSIFICATION REGRE, V1st PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 Pierrehumbert Janet B., 1990, PLANS INTENTIONS COM, P271 Pitrelli J. F., 1994, P 3 INT C SPOK LANG, V2, P123 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 Prince Ellen, 1992, DISCOURSE DESCRIPTIO, P295, DOI 10.1075/pbns.16.12pri Silverman K., 1992, P INT C SPOK LANG PR, P867 SPROAT R, 1992, P INT C SPOK LANG PR, P563 Wall L., 1996, PROGRAMMING PERL Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y WIGHTMAN C, 1999, COMMUNICATION WIGHTMAN C, 1994, ALIGNER SYSTEM AUTOM NR 33 TC 22 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 135 EP 151 DI 10.1016/S0167-6393(00)00073-X PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000008 ER PT J AU Taylor, P Black, AW Caley, R AF Taylor, P Black, AW Caley, R TI Heterogeneous relation graphs as a formalism for representing linguistic information SO SPEECH COMMUNICATION LA English DT Article AB Heterogeneous relation graphs (HRGs) can be used to represent arbitrary linguistic information. Originally designed for use in speech synthesis, HRGs can be used for speech annotation purposes also. In the HRG formalism, atomic linguistic entities such as words, syllables and phones are represented by attribute value matrices known as linguistic items. Using attribute value matrices for items allows them to contain any type or amount of linguistic information. Items are organised into linguistic relations, which take the form of lists, trees or other structures. Items can belong to more than one relation, which allows words to appear in the word relation and, say, the syntax relation. The ability to have items in multiple relations, along with a function ability which can calculate certain values on the fly, eliminates much of the redundancy present in simpler systems. The HRG formalism is not tied to any particular linguistic theory, nor does it impose any preset ideas about what sort of format syntax, prosody or phonology information should have. This paper explains the KRG formalism in detail, and shows why we think this is superior to the types of "multi-level" formats normally used in speech synthesis and database annotation. (C) 2001 Elsevier Science B.V. All rights reserved. C1 Univ Edinburgh, Ctr Speech Technol Res, Edinburgh EH1 1HN, Midlothian, Scotland. RP Taylor, P (reprint author), Univ Edinburgh, Ctr Speech Technol Res, 80 S Bridge, Edinburgh EH1 1HN, Midlothian, Scotland. EM paul.taylor@ed.ac.uk CR Bailly G., 1989, Eurospeech 89. European Conference on Speech Communication and Technology Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X BIRD S, 1999, CIS9901 U PENNS BLACK AW, 1996, FESTIVAL SPEECH SYNT BLACK AW, 1994, COLING 94 KYOTO JAPA, P983 BOVES L, 1991, J PHONETICS, V19, P309 Cassidy K, 1996, INT J ENVIRON POLLUT, V6, P361 Chomsky N., 1968, SOUND PATTERN ENGLIS Dutoit T., 1997, INTRO TEXT SPEECH SY Goldsmith J., 1990, AUTOSEGMENTAL METRIC HERTZ SR, 1990, PAPERS LAB PHONOLOGY, V1, P215 Ladd D. Robert, 1992, COMPOUND PROSODIC DO LIBERMAN MY, 1975, THESIS INDIANA U LIN Local John, 1992, PAPERS LABORATORY PH, VII, P190 SELKIRK E., 1984, PHONOLOGY SYNTAX SPROAT R, 1994, P 2 ESCA IEEE WORKSH, P187 SPROAT R, 1996, J NAT LANG ENG, V2, P369 TAYLOR P, 1996, EDINBURGH SPEECH TOO Taylor P., 1999, EUROSPEECH 99, P623 TRABER C, 1995, THESIS SWISS FEDERAL Van Coile B. M. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266412 van Leeuwen H. C., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90016-8 NR 22 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2001 VL 33 IS 1-2 BP 153 EP 174 DI 10.1016/S0167-6393(00)00074-1 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 391BV UT WOS:000166335000009 ER PT J AU Parry, JJ Burnett, IS Chicharo, JF AF Parry, JJ Burnett, IS Chicharo, JF TI Language-specific phonetic structure and the quantisation of the spectral envelope of speech SO SPEECH COMMUNICATION LA English DT Article DE speech coding; speech analysis; multi-lingual speech processing; speech structure; line spectral pair frequencies; split vector quantisation; multi-stage vector quantisation ID VECTOR QUANTIZATION; MATHEMATICAL-THEORY; COMMUNICATION; PERFORMANCE; ALGORITHM AB In the design of low-bit-rate (LBR) speech coding algorithms, language variability is often considered to be of secondary importance in comparison with other operational factors such as speaker variability and noise. Given that languages differ extensively in the composition of the spectral envelope and that the quantised spectral envelope of speech represents an important part of the bit allocation in speech coding, it is surprising to find that no comprehensive studies have ever been carried out on the role of language in spectral quantisation. This paper addresses this through a series of performance studies of spectral quantisation carried out across a set of language families typical of global mobile telecommunications. The study considers factors of quantiser design such as the size and structure of codebooks, and the quantity of monolingual data used in codebook training. This study found that quantisation distortion is not uniform across languages. It is shown that a significant difference exists in the behaviour of spectral quantisation across languages, in particular the behaviour of high distortion outliers. Detailed analysis of the spectral distortion data on a phonetic level revealed that the nature of the distribution of spectral energy in phonemes influenced the behaviour of monolingual codebooks. Some explanations for codebook performance are presented as well as a set of recommendations for codebook design for multi-lingual environments. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Wollongong, Sch Elect Comp & Telecommun Engn, TITR Whisper Labs, Wollongong, NSW 2500, Australia. Univ Wollongong, Sch Elect & Comp Engn, Wollongong, NSW 2522, Australia. RP Burnett, IS (reprint author), Univ Wollongong, Sch Elect Comp & Telecommun Engn, TITR Whisper Labs, Wollongong, NSW 2500, Australia. CR BHATTACHARYA B, 1992, MAR P IEEE INT C AC, V1, P105 CRYSTAL D, 1997, CAMBRIDGE ENCY WORLD DAS A, 1995, INT CONF ACOUST SPEE, P492, DOI 10.1109/ICASSP.1995.479636 ERZIN E, 1997, INT C AC SPEECH SIGN FERNHAOUI M, 1999, IEEE WORKSH SPEECH C, P25 GERSHO A, 1979, IEEE T INFORM THEORY, V25, P373, DOI 10.1109/TIT.1979.1056067 HAGEN R, 1990, P IEEE INT C AC SPEE HOLMES W, 1998, INT C SPOK LANG PROC IRII H, 1993, SPEECH COMMUN, V12, P151, DOI 10.1016/S0167-6393(05)80007-X Itakura F, 1975, J ACOUST SOC AM, V57 JAYANT NS, 1985, DIGITAL CODING WAVEF, P57 JUANG BH, 1982, IEEE T ACOUST SPEECH, V30, P294 KANG GS, 1985, 8857 NAV RES LAB Kleijn W. B., 1995, SPEECH CODING SYNTHE Kleijn WB, 1996, IEEE SIGNAL PROC LET, V3, P228, DOI 10.1109/97.511802 KROON P, 1995, SPEECH CODING SYNTHE LEBLANC WF, 1993, IEEE T SPEECH AUDIO, V14, P373 LINDAU M, 1978, LANGUAGE, V54, P541, DOI 10.2307/412786 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 Maddieson I., 1984, PATTERNS SOUNDS MAKHOUL J, 1985, P IEEE, V73, P1551, DOI 10.1109/PROC.1985.13340 MCCREE A, 1998, P INT C AC SPEECH SI MONTAGNA R, 1993, P IEEE SPEECH CODING, P95, DOI 10.1109/SCFT.1993.762356 MUTHUSAMY YK, 1992, P INT C SPOK LANG PR Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 Ruhlen Merritt, 1976, GUIDE LANGUAGES WORL SHANNON CE, 1948, AT&T TECH J, V27, P623 SHOHAM Y, 1997, P INT C AC SPEECH SI SOONG FK, 1988, P IEEE INT C AC SPEE, P394 SOUTH CR, 1993, SPEECH COMMUN, V12, P113, DOI 10.1016/S0167-6393(05)80004-4 NR 31 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2000 VL 32 IS 4 BP 229 EP 250 DI 10.1016/S0167-6393(00)00011-X PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 379DE UT WOS:000165625000001 ER PT J AU Tokuma, S AF Tokuma, S TI Quality perception of vowels with simulated vertical bar CVC vertical bar formant trajectories SO SPEECH COMMUNICATION LA English DT Article DE formant undershoot; grid-matching experiment; vowel quality perception; formant trajectory range AB This study investigates the perceived vowel quality change caused by formant undershoot, where vowels in /CVC/ environments are compared with steady-state vowels. In the perceptual experiment of this study, listeners match constant /CVC/ stimuli of /bVb/ or /dVd/ to variable /#V#/ stimuli, using a schematic grid on a PC screen. The grid represents an acoustic vowel diagram, and the subjects change the F1/F2 frequencies of /#V#/ by moving a mouse. The results of the study show that, in vowel quality perception, the performance of subjects was affected by the formant trajectory range of the /CVC/ stimuli. When the formant trajectory range was small, they selected a value between the edge and peak frequencies, while they selected a value outside the trajectory range when it was large. This demonstrated phenomenon is compatible with the results of existing vowel perception studies, although the models proposed by these studies do not account for this phenomenon. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Sagami Womens Univ, Dept English Literature & Language, Kanagawa 2288533, Japan. RP Tokuma, S (reprint author), Sagami Womens Univ, Dept English Literature & Language, 2-1-1 Bunkyo, Kanagawa 2288533, Japan. EM tokuma@sagami-wu.ac.jp CR AKAGI M, 1994, P ICSLP 94 YOK JAP S, P503 BLADON RAW, 1981, J ACOUST SOC AM, V69, P1414, DOI 10.1121/1.385824 DIBENEDETTO MG, 1989, J ACOUST SOC AM, V86, P67, DOI 10.1121/1.398221 HOLMES JN, 1982, 1017 JSRU HUANG C, 1985, THESIS MIT CAMBRIDGE KUHL PK, 1983, SPEECH PERCEPTION PR, P239 Kuhl PK, 1995, SPEECH PERCEPTION LI, P121 Ladefoged P., 1967, 3 AREAS EXPT PHONETI LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 ANDRUSKI JE, 1992, J ACOUST SOC AM, V91, P390, DOI 10.1121/1.402781 NEAREY TM, 1989, J ACOUST SOC AM, V85, P2088, DOI 10.1121/1.397861 NEAREY TM, 1986, J ACOUST SOC AM, V80, P1293 NORD L, 1986, STL QPSR, V4, P19 POLS LCW, 1984, P I ACOUST, V6, P371 SHIGENO S, 1991, J ACOUST SOC AM, V90, P103, DOI 10.1121/1.401303 STRANGE W, 1989, J ACOUST SOC AM, V85, P2081, DOI 10.1121/1.397860 TOKUMA S, 1997, P EUR 97 RHOD GREEC, V4, P2163 TOKUMA S, 1996, THESIS U LONDON TOKUMA S, 1995, J PHONETIC SOC JAPAN, V208, P45 NR 19 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2000 VL 32 IS 4 BP 251 EP 265 DI 10.1016/S0167-6393(00)00012-1 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 379DE UT WOS:000165625000002 ER PT J AU Whiteside, SP Hodgson, C AF Whiteside, SP Hodgson, C TI Speech patterns of children and adults elicited via a picture-naming task: An acoustic study SO SPEECH COMMUNICATION LA English DT Article DE motor speech development; formant frequencies; temporal parameters; age, sex and individual differences ID FRICATIVE-VOWEL SYLLABLES; LOCUS EQUATIONS; STOP PLACE; COARTICULATION; CATEGORIZATION; ARTICULATION; SEQUENCES; EMERGENCE; SPOKEN AB This brief study presents some acoustic phonetic characteristics that reflect both the voice characteristics and motor speech behaviour of 20 pre-adolescent (6-, 8- and 10-year olds) boys and girls, and 9 adults in speech data that were elicited via a picture-naming task. The acoustic phonetic characteristics that were investigated included formant frequency values, coarticulation (or gestural overlap) and temporal patterns. Both voice characteristics and motor speech behaviour presented evidence of age and sex differences, and age by sex interactions. In addition there were significant correlations between formant frequencies and their associated formant frequency changes (or excursions). There was also evidence of individual differences in the patterns of maturation, which did not conform to chronological age. These data are presented and discussed with reference to the sexual dimorphism of the vocal apparatus, the development of vocal characteristics, and motor speech development and behaviour. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Human Commun Sci, Sheffield S10 2TA, S Yorkshire, England. Stanley Hlth Ctr, Durham SH9 OXE, England. RP Whiteside, SP (reprint author), Univ Sheffield, Dept Human Commun Sci, Sheffield S10 2TA, S Yorkshire, England. EM s.whiteside@sheffield.ac.uk CR BENNETT S, 1981, J ACOUST SOC AM, V69, P231, DOI 10.1121/1.385343 BLADON A., 1976, J PHONETICS, V4, P137 BUSBY PA, 1995, J ACOUST SOC AM, V97, P2603, DOI 10.1121/1.412975 BYRD D, 1992, J ACOUST SOC AM, V92, P593, DOI 10.1121/1.404271 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P1841, DOI 10.1121/1.401664 DALSTON RM, 1975, J ACOUST SOC AM, V57, P462, DOI 10.1121/1.380469 DAVIS BL, 1994, LANG SPEECH, V37, P341 DAVIS BL, 1990, J SPEECH HEAR RES, V33, P16 Eguchi S., 1969, ACTA OTO-LARYNGOL, V257, P5 Fant G., 1966, SPEECH TRANSMISSION, V4, P22 GOODELL EW, 1993, J SPEECH HEAR RES, V36, P707 KENT RD, 1976, J SPEECH HEAR RES, V19, P421 KRULL D, 1996, SWED PHON C NASSL 29, P73 KRULL D, 1995, P 13 INT C PHON SCI, V3, P436 Laver John, 1994, PRINCIPLES PHONETICS LINDBLOM B., 1983, PRODUCTION SPEECH, P217 Locke J. L., 1983, PHONOLOGICAL ACQUISI Locke JL, 1997, BRAIN LANG, V58, P265, DOI 10.1006/brln.1997.1791 LOCKE JL, 1995, PATHOLOGIES SPEECH L NITTROUER S, 1993, J SPEECH HEAR RES, V36, P959 NITTROUER S, 1989, J SPEECH HEAR RES, V32, P120 Nittrouer S, 1996, J SPEECH HEAR RES, V39, P379 PETERSON GE, 1952, J ACOUST SOC AM, V24, P629, DOI 10.1121/1.1906945 Pickett J. M., 1980, SOUNDS SPEECH COMMUN RECASENS D, 1991, J PHONETICS, V19, P177 RECASENS D, 1987, J PHONETICS, V15, P299 REPP BH, 1986, J ACOUST SOC AM, V79, P1616, DOI 10.1121/1.393298 SERENO JA, 1987, J ACOUST SOC AM, V81, P512, DOI 10.1121/1.394917 Sharkey S. G., 1985, J SPEECH HEAR RES, V28, P3 Slawinski EB, 1998, J PHONETICS, V26, P27, DOI 10.1006/jpho.1997.0057 Smith BL, 1998, J PHONETICS, V26, P95, DOI 10.1006/jpho.1997.0061 STOELGAMMON C, 1983, J CHILD LANG, V10, P455 Sussman HM, 1998, PHONETICA, V55, P204, DOI 10.1159/000028433 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 SUSSMAN HM, 1992, J SPEECH HEAR RES, V35, P769 SWARTZ BL, 1992, PERCEPT MOTOR SKILL, V75, P983, DOI 10.2466/PMS.75.7.983-992 Turner R. Jay, 1985, RES COMMUNITY MENTAL, V5, P77 VANBERGEM DR, 1994, SPEECH COMMUN, V14, P143, DOI 10.1016/0167-6393(94)90005-1 Whiteside S. P., 1996, J INT PHON ASSOC, V26, P23 Whiteside S. P., 1999, LOGOP PHONIATR VOCO, V24, P6, DOI 10.1080/140154399434508 NR 41 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 2000 VL 32 IS 4 BP 267 EP 285 DI 10.1016/S0167-6393(00)00013-3 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 379DE UT WOS:000165625000003 ER PT J AU Ng, K Zue, VW AF Ng, K Zue, VW TI Subword-based approaches for spoken document retrieval SO SPEECH COMMUNICATION LA English DT Article DE spoken document retrieval; audio indexing; information retrieval ID RECOGNITION; ALGORITHM AB This paper explores approaches to the problem of spoken document retrieval (SDR), which is the task of automatically indexing and then retrieving relevant items from a large collection of recorded speech messages in response to a user specified natural language text query. We investigate the use of subword unit representations for SDR as an alternative to words generated by either keyword spotting or continuous speech recognition. In this study, we explore the space of possible subword units to determine the complexity of the subword units needed for SDR; describe the development and application of a phonetic recognition system to extract subword units from the speech signal; examine the behavior and sensitivity of the subword units to speech recognition errors; measure the effect of speech recognition performance on retrieval performance; and investigate a number of robust indexing and retrieval methods in an effort to improve retrieval performance in the presence of speech recognition errors. We find that with the appropriate subword units, it is possible to achieve performance comparable to that of text-based word units if the underlying phonetic units are recognized correctly. In the presence of speech recognition errors, retrieval performance degrades to 60% of the clean reference level. This performance can be improved by 23% (to 74% of the clean reference) with use of the robust methods. (C) 2000 Elsevier Science B.V. All rights reserved. C1 MIT, Comp Sci Lab, Spoken Language Syst Grp, Cambridge, MA 02139 USA. RP Ng, K (reprint author), MIT, Comp Sci Lab, Spoken Language Syst Grp, 545 Technol Sq,Room 638, Cambridge, MA 02139 USA. EM kng@mit.edu CR Acero A., 1990, P ICASSP, P849 Buckley C., 1985, 85686 CORN U COMP SC CHANG J, 1997, P EUR RHOD GREEC OCT, P1199 Chase L., 1997, P EUR C SPEECH COMM, P815 Chomsky N., 1968, SOUND PATTERN ENGLIS DAMASHEK M, 1995, SCIENCE, V267, P843, DOI 10.1126/science.267.5199.843 DELIGNE S, 1995, P ICASSP, P169 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DHARANIPRAGADA S, 1998, 7 TEXT RETR C TREC 7 Duda R. O., 1973, PATTERN CLASSIFICATI FISHER W, 1996, AUTOMATIC SYLLABIFIC Foote J.T., 1995, P EUR 95 MADR SPAIN, P2145 GAROFOLO JS, 1993, 4930 NISTIR NAT I ST GLASS J, 1996, P ICSLP, V4, P2277, DOI 10.1109/ICSLP.1996.607261 GLASS JR, 1988, P ICASSP NEW YORK AP, P429 GLAVITSCH U, 1992, P 15 ANN INT ACM SIG, P168, DOI 10.1145/133160.133194 Halberstadt A. K., 1998, THESIS MIT CAMBRIDGE HARMAN DK, 1998, 7 TEXT RETR C TREC 7 HARMAN DK, 1997, 6 TEXT RETR C TREC 6 Hartigan J., 1975, CLUSTERING ALGORITHM HAUPTMANN AG, 1997, P 1997 IEEE INT C AC, P195 Hazen TJ, 1998, INT CONF ACOUST SPEE, P653, DOI 10.1109/ICASSP.1998.675349 James D.A., 1995, THESIS U CAMBRIDGE C JOHNSON S, 1998, 7 TEXT RETR C TREC 7 Jones G.J.F., 1996, P 19 ANN INT ACM SIG, P30, DOI 10.1145/243199.243208 JONES GJF, 1995, P ICASSP 95, V1, P309 Jourlin P., 1999, Proceedings of SIGIR '99. 22nd International Conference on Research and Development in Information Retrieval, DOI 10.1145/312624.312701 Kahn D., 1976, THESIS MIT CAMBRIDGE Lee K.-F., 1989, AUTOMATIC SPEECH REC Marukawa K, 1997, PATTERN RECOGN, V30, P1361, DOI 10.1016/S0031-3203(96)00155-0 Ng C., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.291055 NG K, 1998, P ICSLP 98 SYDN AUST Ng K., 1997, P EUR C SPEECH COMM, P1607 Ng K, 1998, INT CONF ACOUST SPEE, P325, DOI 10.1109/ICASSP.1998.674433 PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Rijsbergen CJV, 1979, INFORMATION RETRIEVA, V2nd ROBERTSON SE, 1976, J AM SOC INFORM SCI, V27, P129, DOI 10.1002/asi.4630270302 ROSE RC, 1991, P ICASSP 91 TOR CAN, P317, DOI 10.1109/ICASSP.1991.150340 SALTON G, 1983, INTRO MODERN INFORMA SCHAUBLE P, 1994, P ARPA HUM LANG TECH, P370, DOI 10.3115/1075812.1075897 Schmandt C., 1994, VOICE COMMUNICATION SINGHAL A, 1998, 7 TEXT RETR C TREC 7 Siu M., 1997, P EUR C SPEECH COMM, P831 SPINA MS, 1996, P ICSLP 96 PHIL PA, V2, P594, DOI 10.1109/ICSLP.1996.607431 SPINA MS, 1997, P EUR 97 RHOD GREEC, P1547 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 Wechsler M., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.290950 WECHSLER M, 1995, P MIRO WORKSH GLASG Winston P. H., 1992, ARTIFICIAL INTELLIGE WITBROCK MJ, 1997, P DARPA SPEECH REC W ZHAI C, 1996, P 5 TEXT RETR C TREC NR 52 TC 29 Z9 36 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2000 VL 32 IS 3 BP 157 EP 186 DI 10.1016/S0167-6393(00)00008-X PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 362ME UT WOS:000089781100001 ER PT J AU Marino, JB Nogueiras, A Paches-Leal, P Bonafonte, A AF Marino, JB Nogueiras, A Paches-Leal, P Bonafonte, A TI The demiphone: An efficient contextual subword unit for continuous speech recognition SO SPEECH COMMUNICATION LA English DT Article DE context-dependent phonetic units; coarticulation modeling; continuous speech recognition AB In this paper, we introduce the demiphone as a context-dependent phonetic unit for continuous speech recognition. A phoneme is divided into two parts: a left demiphone that accounts for the left coarticulation and a right demiphone that copes with the right-hand side context. This unit discards the dependence between the effects of both side contexts, but it models the transition between phonemes as the triphone does. By concatenating a left demiphone and a right demiphone a triphone can be built, although the left and the right-context coarticulations are modeled independently. The main appeal of this unit stems from its reduced number (respect to the number of triphones) and its capability to model left and right contexts unseen together in the training material. Thus, the demiphone shares in a simple way the advantages of a smoothed parameter estimation with the ability of generalization. In the present work, the demiphone is motivated and experimentally supported. Furthermore, demiphones are compared with triphones smoothed and generalized by decision-tree state-tying, accepted as the most powerful tool for coarticulation modeling at the present state of the art. The main conclusion of our work is that the demiphone simplifies the recognition system and yields a better performance than the triphone, at least for small or moderate size databases. This result may be explained by the ability of the demiphone to provide an excellent trade-off between a detailed coarticulation modeling and a proper parameter estimation. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Politecn Cataluna, ES-08034 Barcelona, Spain. RP Marino, JB (reprint author), Univ Politecn Cataluna, Jordi Girona 1-3, ES-08034 Barcelona, Spain. EM canton@gps.tsc.upc.es RI Nogueiras Rodriguez, Albino/G-1418-2013; Marino, Jose /N-1626-2014 OI Nogueiras Rodriguez, Albino/0000-0002-3159-1718; CR Bonafonte A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607137 BONAFONTE A, 1995, P EUR 95 MADR, P1607 Casacuberta F., 1991, P WORKSH INT COOP ST CHESTA C, 1997, P EUROSPEECH 97, P11 FEBRER A, 1998, P 1998 INT WORKSH SP FISSORE L, 1994, P ICSLP 94, P447 Gibbon D, 1997, HDB STANDARDS RESOUR Kannan A, 1994, IEEE T SPEECH AUDI P, V2, P453, DOI 10.1109/89.294362 LEE CH, 1996, P ICSLP 96, P1820 Llisterri J., 1993, SAMAUPC001V1 MARINO JB, 1997, P EUROSPEECH 97, P1215 MARINO JB, 1997, CONTEXTOS INTRAPALAB MOREN A, 1997, LRE63314 MOREN A, 1993, EUROM 1 SPANISH DATA NAVARRO TT, 1990, MANUAL PRONUNCIACION O'Shaughnessy D., 1987, SPEECH COMMUNICATION ROSENBERG AE, 1983, IEEE T ACOUST SPEECH, V31, P713, DOI 10.1109/TASSP.1983.1164132 VILLARRUBIA L, 1996, P ICASSP 96, P451 WOOD LC, 1991, P ICASSP 91, P181, DOI 10.1109/ICASSP.1991.150307 WU JJX, 1996, P ICSLP 96, P2281 YOUNG S, 1997, HTK BOOK VERSION 2 1 YOUNG SJ, 1994, COMPUT SPEECH LANG, V8, P369, DOI 10.1006/csla.1994.1019 NR 22 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2000 VL 32 IS 3 BP 187 EP 197 DI 10.1016/S0167-6393(00)00010-8 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 362ME UT WOS:000089781100002 ER PT J AU Ofuka, E McKeown, JD Waterman, MG Roach, PJ AF Ofuka, E McKeown, JD Waterman, MG Roach, PJ TI Prosodic cues for rated politeness in Japanese speech SO SPEECH COMMUNICATION LA English DT Article DE prosody; politeness; Japanese speech; cue manipulation ID PITCH; EMOTION; FEMALE; PERCEPTION; WAVE AB In order to examine potential acoustic cues for politeness in Japanese speech, F0 and temporal aspects of polite and casual utterances of two question sentences spoken:by 6 male native speakers were acoustically analysed. The analysis showed that F0 movement of the final part of utterances and speech rate of utterance were consistently differently used in these different speaking styles across all the speakers. Perceptual experiments with listeners confirmed that these acoustic variables, which were manipulated using digital resynthesis, had an impact on politeness perception. It was shown that how the final intonation of a sentence is spoken had a great impact on politeness judgements. In some cases the duration and F0 characteristics of the final vowel did change the overall impression of the utterance's politeness. An experiment which used speech rate variations of a polite utterance showed the important role of this variable in perceived politeness. Politeness ratings showed an inverted U-shape as a function of speech rate, but differed according to particular speakers. The speech rate of listeners was found to affect their utterance rate preference; listeners preferred rates close to their own. These findings suggest that listener characteristics should be considered important in politeness speech research. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Eastern Michigan Univ, Dept Foreign Languages & Bilingual Studies, Ypsilanti, MI 48197 USA. Univ Leeds, Sch Psychol, Leeds LS2 9JT, W Yorkshire, England. Univ Reading, Dept Linguist Sci, Reading RG6 6AA, Berks, England. RP Ofuka, E (reprint author), Eastern Michigan Univ, Dept Foreign Languages & Bilingual Studies, Ypsilanti, MI 48197 USA. EM eofuka@hotmail.com CR APPLE W, 1979, J PERS SOC PSYCHOL, V37, P715, DOI 10.1037//0022-3514.37.5.715 BELLBERTI F, 1995, P 13 INT C PHON SC S, P162 BICKLEY C, 1972, MIT WORKING PAPERS S, V1, P73 Brazil D., 1980, DISCOURSE INTONATION Brown B. L., 1985, RECENT ADV LANGUAGE, P144 BROWN BL, 1985, LANG COMMUN, V5, P207, DOI 10.1016/0271-5309(85)90011-4 BROWN BL, 1974, J ACOUST SOC AM, V55, P313, DOI 10.1121/1.1914504 Brown Bruce, 1980, LANGUAGE SOCIAL PSYC, P293 CAMPBELL WN, 1992, THESIS U SUSSEX UK CHANG TNC, 1958, PHONETICA, V2, P60 Charpentier F. J., 1986, P ICASSP, P2015 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 COSMIDES L, 1983, J EXP PSYCHOL HUMAN, V9, P864, DOI 10.1037/0096-1523.9.6.864 DAVITZ JR, 1959, J COMMUN, V6, P6 de Krom G., 1994, P INT C SPOK LANG PR, P1471 *ENTR RES LAB, 1993, WAVESPLUS ESPS VERS Fairbanks G, 1939, SPEECH MONOGR, V6, P87 Fairbanks G, 1940, J ACOUST SOC AM, V11, P457, DOI 10.1121/1.1916060 Frokjaer-Jensen B, 1976, BRUEL KJAER TECHNICA, V3, P3 GREEN RS, 1975, PERCEPT PSYCHOPHYS, V17, P429, DOI 10.3758/BF03203289 HENTON C, 1995, LANG COMMUN, V15, P43, DOI 10.1016/0271-5309(94)00011-Z HENTON CG, 1985, LANG COMMUN, V5, P221, DOI 10.1016/0271-5309(85)90012-6 HONG M, 1992, NIHONGO NIHON BUNGAK, V17, P32 HONG M, 1993, B PHON SOC JPN, V204, P13 IMAIZUMI S, 1994, ANN B I LOGOPEDICS P, V28, P59 KLASMEYER G, 1995, P 13 INT C PHON SC S, P182 Laver J, 1980, PHONETIC DESCRIPTION LEVIN H, 1975, IEEE T SYST MAN CYB, VSMC5, P259 LOVEDAY L, 1981, LANG SPEECH, V24, P71 LOVEDAY LJ, 1986, J PRAGMATICS, V10, P287, DOI 10.1016/0378-2166(86)90004-4 MALLORY EB, 1958, SPEECH MONOGR, V25, P255 MATSUMOTO Y, 1988, J PRAGMATICS, V12, P403, DOI 10.1016/0378-2166(88)90003-3 MILLER JL, 1984, PHONETICA, V41, P215 MINAMI F, 1987, KEIGO MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Nakane C., 1967, TATE SHAKAI NINGEN K Nakane C, 1970, JAPANESE SOC, V4 *NHK, 1995, NHK ANN SUT HAN KOT OGINO T, 1992, NIHONGO INTONATION J, P215 ROSS ED, 1986, J PHONETICS, V14, P283 Scherer K. R., 1982, HDB METHODS NONVERBA, P136 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 Scherer K.R., 1979, EMOTIONS PERSONALITY, P495 Scherer K.R., 1979, SOCIAL MARKERS SPEEC, P147 SCHERER U, 1980, ANAL SOCIAL SKILL, P315 Shibatani M., 1990, LANGUAGE JAPAN SMITH BL, 1975, LANG SPEECH, V18, P145 STARKWEATHER JA, 1956, J ABNORM SOC PSYCH, V52, P394, DOI 10.1037/h0041133 Takefuta Yukio, 1975, MEASUREMENT PROCEDUR, P363 TOKUGAWA M, 1981, KOTOBA NISHI HIGASHI VANBEZOOYEN R, 1984, CHARACTERISTICS RECO Van Dusen CR, 1941, J SPEECH DISORD, V6, P137 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 55 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 2000 VL 32 IS 3 BP 199 EP 217 DI 10.1016/S0167-6393(00)00009-1 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 362ME UT WOS:000089781100003 ER PT J AU Renals, S Robinson, T AF Renals, S Robinson, T TI Accessing information in spoken audio SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Renals, S (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM s.renals@dcs.shef.ac.uk; ajr@eng.cam.ac.uk CR HAUPTMANN AG, 1999, INTELLIGENT MULTIMED, P213 JONES GJF, 1996, P IEEE INT C AC SPEE, P311 ROBERTSON SE, 1976, J AM SOC INFORM SCI, V27, P129, DOI 10.1002/asi.4630270302 ROBINSON T, 1999, P ESCA WORKSH ACC IN SALTON G, 1983, INTRO MODERN INFORMA SCHAUBLE P, 1997, MULTIMEDIA INFORMATI NR 6 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(00)00019-4 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100001 ER PT J AU Renals, S Abberley, D Kirby, D Robinson, T AF Renals, S Abberley, D Kirby, D Robinson, T TI Indexing and retrieval of broadcast news SO SPEECH COMMUNICATION LA English DT Article DE spoken document retrieval; information retrieval; broadcast speech; large vocabulary speech recognition ID DOCUMENT-RETRIEVAL; RELEVANCE AB This paper describes a spoken document retrieval (SDR) system for British and North American Broadcast News. The system is based on a connectionist large vocabulary speech recognizer and a probabilistic information retrieval (IR) system. We discuss the development of a real-time Broadcast News speech recognizer, and its integration into an SDR system. Two advances were made for this task: automatic segmentation and statistical query expansion using a secondary corpus. Precision and recall results using the Text Retrieval Conference (TREC) SDR evaluation infrastructure are reported throughout the paper, and we discuss the application of these developments to a large scale SDR task based on an archive of British English broadcast news. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. Univ Cambridge, Dept Engn, Cambridge CB2 1TN, England. RP Renals, S (reprint author), Univ Sheffield, Dept Comp Sci, 211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM s.renals@dcs.shef.ac.uk CR Abberley D, 1998, INT CONF ACOUST SPEE, P3781, DOI 10.1109/ICASSP.1998.679707 ABBERLEY D, 1999, P ESCA WORKSH ACC IN, P19 ALLAN J, 1998, P 6 TEXT RETR C TREC, P169 Callan James P., 1994, P 17 ANN INT ACM SIG, P302 COOK G, 1999, P DARPA BROADC NEWS, P161 CROFT WB, 1979, J DOC, V35, P285, DOI 10.1108/eb026683 Dharanipragada S, 1998, INT CONF ACOUST SPEE, P233, DOI 10.1109/ICASSP.1998.674410 Ferrari A., 1999, Elettronica Oggi Foote JT, 1997, COMPUT SPEECH LANG, V11, P207, DOI 10.1006/csla.1997.0027 GAROFOLO J, 1999, ESCA ETRW ACCESSING, P1 Harman D., 1992, INFORMATION RETRIEVA, P241 HARMAN DK, 1996, P 4 TEXT RETR C TREC, pA6 Hauptmann A., 1997, INTELLIGENT MULTIMED, P213 Hearst MA, 1997, COMPUT LINGUIST, V23, P33 JAMES DA, 1994, P INT C AC SPEECH SI, V1, P377 JOHNSON SE, 1999, P IEEE INT C AC SPEE, P49 JONES GJF, 1996, P IEEE INT C AC SPEE, P311 JONES K. S., 1998, TR446 CAMBR U COMP L Kaszkiel M, 1997, PROCEEDINGS OF THE 20TH ANNUAL INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, P178, DOI 10.1145/258525.258561 KEMP T, 1998, P INT C SPOK LANG PR, P1839 KRAAIJ W, 1998, P 14 TWENT WORKSH LA, P141 MORRISON P, 1998, SUM HUMAN KNOWLEDGE Ng C., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.291055 Ng K, 1998, INT CONF ACOUST SPEE, P325, DOI 10.1109/ICASSP.1998.674433 PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814 ROBERTS SB, 1995, OBES RES, V3, P3 ROBERTSON SE, 1976, J AM SOC INFORM SCI, V27, P129, DOI 10.1002/asi.4630270302 ROBERTSON SE, 1994, P ACM SIGIR, P16 ROBERTSON SE, 1990, J DOC, V46, P359, DOI 10.1108/eb026866 ROBINSON AJ, 2000, UNPUB SPEECH COMMUNI ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 Robinson T., 1996, AUTOMATIC SPEECH SPE, P233 Robinson T, 1998, INT CONF ACOUST SPEE, P829, DOI 10.1109/ICASSP.1998.675393 ROBINSON T, 1999, P EUR BUD, P1067 ROBINSON T, 2000, UNPUB SPEECH COMMUNI SIEGLER M, 1999, P IEEE INT C AC SPEE, P505 Singhal A., 1999, P TREC 7 NIST GAITH, P239 Singhal A., 1999, Proceedings of SIGIR '99. 22nd International Conference on Research and Development in Information Retrieval, DOI 10.1145/312624.312645 SMEATON AF, 1998, LNCS, V1513, P429 VANRIJSBERGEN CJ, 1979, INFORMATION RETRIEVA Wechsler M., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.290950 Williams HE, 1999, COMPUT J, V42, P193, DOI 10.1093/comjnl/42.3.193 Witbrock M. J., 1997, P 2 ACM INT C DIG LI, P30, DOI 10.1145/263690.263779 Xu Jinxi, 1996, P 19 ANN INT ACM SIG, P4, DOI 10.1145/243199.243202 Yamron JP, 1998, INT CONF ACOUST SPEE, P333, DOI 10.1109/ICASSP.1998.674435 NR 45 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 5 EP 20 DI 10.1016/S0167-6393(00)00020-0 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100002 ER PT J AU Jourlin, P Johnson, SE Sparck-Jones, K Woodland, PC AF Jourlin, P Johnson, SE Sparck-Jones, K Woodland, PC TI Spoken document representations for probabilistic retrieval SO SPEECH COMMUNICATION LA English DT Article DE spoken document retrieval; automatic speech recognition; information retrieval AB This paper presents some developments in query expansion and document representation of our spoken document retrieval system and shows how various retrieval techniques affect performance for different sets of transcriptions derived from a common speech source. Modifications of the document representation are used, which combine several techniques for query expansion, knowledge-based on one hand and statistics-based on the other. Taken together, these techniques can improve Average Precision by over 19% relative to a system similar to that which we presented at TREC-7. These new experiments have also confirmed that the degradation of Average Precision due to a word error rate (WER) of 25% is quite small (3.7% relative) and can be reduced to almost zero (0.2% relative). The overall improvement of the retrieval system can also be observed for seven different sets of transcriptions from different recognition engines with a WER ranging from 24.8% to 61.5%. We hope to repeat these experiments when larger document collections become available, in order to evaluate the scalability of these techniques. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Comp Lab, Cambridge CB2 3QG, England. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Jourlin, P (reprint author), Univ Cambridge, Comp Lab, Pembroke St,New Museums Site, Cambridge CB2 3QG, England. EM pierre.jourlin@cl.cam.ac.uk CR Abberley D., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) Allan J., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) Dushnik B, 1941, AM J MATH, V63, P600, DOI 10.2307/2371374 Fellbaum C., 1998, WORD NET ELECT LEXIC Fox C., 1992, INFORMATION RETRIEVA, P102 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Garofolo J.S., 1999, P ESCA WORKSH ACC IN, P1 Hain T., 1998, P DARPA BROADC NEWS, P133 Johnson S., 1998, P 5 INT C SPOK LANG, P1775 JOHNSON SE, 2000, 8 TEXT RETR C TREC8 JOHNSON SE, 1999, P 1999 IEEE INT C AC, V1, P49 Johnson S. E., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) JONES KS, 2000, INFORM PROCESS MANAG, V36, P37 JONES KS, 1998, 446 TR U CAMBR COMP KNIGHT SF, 1998, COMMUNICATION Leggetter C., 1995, P ARPA WORKSH SPOK L, P110 Mandala R., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) Mitra M, 1997, P RIAO97 COMP ASS IN, P200 Nowell P., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) PORTER MF, 1980, PROGRAM-AUTOM LIBR, V14, P130, DOI 10.1108/eb046814 SALTON G, 1971, SMART RETRIEVAL SYST, P143 Singhal A., 1999, Seventh Text REtrieval Conference (TREC-7) (NIST SP 500-242) Singhal A., 1999, Proceedings of SIGIR '99. 22nd International Conference on Research and Development in Information Retrieval, DOI 10.1145/312624.312645 VOORHEES EM, 1999, 7 TEXT RETR C TREC7 Voorhees EM, 1994, P 17 ANN INT ACM SIG, P61 Woodland P., 1998, P DARPA BROADC NEWS, P41 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 NR 27 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 21 EP 36 DI 10.1016/S0167-6393(00)00021-2 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100003 ER PT J AU Federico, M AF Federico, M TI A system for the retrieval of Italian broadcast news SO SPEECH COMMUNICATION LA English DT Article DE automatic speech recognition; audio indexing; information retrieval AB This paper presents a prototype for the retrieval of Italian broadcast news, which has been developed at ITC-irst. The architecture employs a speech recognition engine for the automatic transcription of audio news. Moreover, it features document indexing based on part-of-speech tagging of text coupled with morphological analysis, and query expansion exploiting the Italian WordNet thesaurus. Query-document matching is based on a statistical term weighting scheme. The system was tested on a 203-story collection of audio news, augmented with 9500 newspaper articles. The evaluation was based on a "known item" retrieval task and aimed at evaluating the impact of speech recognition errors and query expansion on retrieval performance. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Ctr Ric Sci & Tecnol, ITC IRST, I-38050 Trento, Italy. RP Federico, M (reprint author), Ctr Ric Sci & Tecnol, ITC IRST, I-38050 Trento, Italy. EM federico@itc.it CR ARTALE A, 1997, ADV ARTIFICIAL INTEL BRUGNARA F, 1997, P 5 EUR C SPEECH COM, P2751 BRUGNARA M, 2000, P RIAO CONT BAS MULT Chen S., 1998, DARPA BROADC NEWS TR CORAZZARI O, 1991, ELSNET ITALIAN CORPU DHARANIPROGADA S, 1998, P 7 TEXT RETR C GAIT, P115 EAGLES, 1996, RECOMMENDATIONS MORP Fellbaum C., 1998, WORD NET ELECT LEXIC FRAKES WR, 1992, INFORMATION RETRIEVA Gales MJF, 1998, COMPUT SPEECH LANG, V12, P75, DOI 10.1006/csla.1998.0043 GAROFOLO J, 1998, P TREC C GAITH MD US, P79 GAROFOLO JS, 1997, P 6 TEXT RETR C, P83 GRETTER R, 1991, MORPHOLOGICAL ANAL I HAUPTMANN A. G., 1997, INTELLIGENT MULTIMED Johnson R.A., 1992, APPL MULTIVARIATE ST MANDALA R, 1998, P 7 TEXT RETR C GAIT, P457 Merialdo B., 1994, Computational Linguistics, V20 ROBERTS SB, 1995, OBES RES, V3, P3 NR 18 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 37 EP 47 DI 10.1016/S0167-6393(00)00022-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100004 ER PT J AU Wang, HM AF Wang, HM TI Experiments in syllable-based retrieval of broadcast news speech in Mandarin Chinese SO SPEECH COMMUNICATION LA English DT Article DE spoken document retrieval; broadcast news; Mandarin Chinese; syllable lattice; speech recognition; hidden Markov Model AB Spoken document retrieval (SDR) has been extensively studied in recent years because of its potential use in navigating large multi-media collections in the near future. Considering the characteristics and monosyllabic structure of the Chinese language, the syllable-based indexing for retrieval of spoken documents in Mandarin Chinese has been investigated, and extensive experiments on retrieval of broadcast news speech collected in Taiwan were performed. This paper reports some interesting results and findings obtained in this research. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan. RP Wang, HM (reprint author), Acad Sinica, Inst Informat Sci, Taipei 115, Taiwan. EM whm@iis.sinica.edu.tw CR BAI BR, 1999, P INT C MULT INT, P46 Bai B.-R., 1996, P INT C SPOK LANG PR, P1950 CHEN B, 1998, P INT C SPOK LANG PR Chen K.J., 1992, P COLING 92, P101, DOI 10.3115/992066.992085 *CKIP GROUP, 1993, 9305 CKIP GROUP I IN GLAVITSCH U, 1992, P 15 ANN INT ACM SIG, P168, DOI 10.1145/133160.133194 HARMAN D, 1995, P 4 TEXT RETR C JAMES DA, 1995, THESIS U CAMBRIDGE U Lee LS, 1997, IEEE SIGNAL PROC MAG, V14, P63 LIN SC, 1998, COMPUTER PROCESSING, V12, P123 Ng K., 1997, P EUR C SPEECH COMM, P1607 Rabiner L, 1993, FUNDAMENTALS SPEECH SALTON G, 1983, INTRO MODERN INFORMA Sparck-Jones K, 1996, INFORM PROCESS MANAG, V32, P399, DOI 10.1016/0306-4573(95)00077-1 Wactlar H. D., 1996, IEEE COMPUT, V29, P46 WANG HM, 1999, P INT WORKSH INF RET, P48 Wang HM, 1997, IEEE T SPEECH AUDI P, V5, P195 WECHSLER M, 1995, P MULT INF RETR WORK Wechsler M., 1998, THESIS SWISS FEDERAL NR 19 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 49 EP 60 DI 10.1016/S0167-6393(00)00023-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100005 ER PT J AU Ng, C Wilkinson, R Zobel, J AF Ng, C Wilkinson, R Zobel, J TI Experiments in spoken document retrieval using phoneme n-grams SO SPEECH COMMUNICATION LA English DT Article AB In spoken document retrieval (SDR), speech recognition is applied to a collection to obtain either words or subword units, such as phonemes, that can be matched against queries. We have explored retrieval based on phoneme n-grams. The use of phonemes addresses the out-of-vocabulary (OOV) problem, while use of n-grams allows approximate matching on inaccurate phoneme transcriptions. Our experiments explored the utility of word boundary information, stopword elimination, query expansion, varying the length of phoneme sequences to be matched and various combinations of n-grams of different lengths. Given word-based recognition (WBR), we can match queries to speech using a phoneme representation of the words, permitting us to test whether it was the recognition or the matching process that was most crucial to retrieval performance. Our experiments show that there is some deterioration in effectiveness, but the particular form of matching is less vital if the sequence of phonemes was correct. When phone sequences are recognised directly, with higher error rates than for words, it was more important to select a good matching approach. Varying gram length trades precision against recall; combination of n-grams of different lengths, in particular 3-grams and 4-grams, can improve retrieval. Overall, phoneme-based retrieval is not as effective as word-based retrieval, but is sufficient for situations in which word-based retrieval is either impractical or undesirable. (C) 2000 Elsevier Science B.V. All rights reserved. C1 RMIT Univ, Dept Comp Sci, Melbourne, Vic 3001, Australia. CSIRO, Div Math & Informat Sci, Melbourne, Vic 3053, Australia. RP Ng, C (reprint author), RMIT Univ, Dept Comp Sci, GPO Box 2476V, Melbourne, Vic 3001, Australia. EM chienn@cs.rmit.edu.au; ross.wilkinson@cmis.csiro.au; jz@cs.rmit.edu.au CR CANVAR WB, 1994, P 3 TEXT RETR C TREC, P269 FULLER M, 1997, P 6 TEXT RETR C, P241 FULLER M, 1998, P 7 TEXT RETR C TREC, P465 GAROFOLO J, 1998, P TREC C GAITH MD US, P79 Jones G.J.F., 1996, P 19 ANN INT ACM SIG, P30, DOI 10.1145/243199.243208 Lee K.-F., 1989, AUTOMATIC SPEECH REC *LING DAT CONS, 1997, HUB4 CSRVI *LING DAT CONS, 1996, HUB4 CSRIV *LING DAT CONS, 1996, CSRVHUB4 CDROM MATEEV B, 1997, P 6 TEXT RETR C TREC, P623 NG K, 1998, P INT C SPOK LANG PR, V3, P939 Ng K., 1997, P EUR C SPEECH COMM, P1607 Ng K, 1998, INT CONF ACOUST SPEE, P325, DOI 10.1109/ICASSP.1998.674433 Rabiner L, 1993, FUNDAMENTALS SPEECH SALTON G, 1988, INFORM PROCESS MANAG, V24, P513, DOI 10.1016/0306-4573(88)90021-0 SALTON G, 1983, INTRO MODERN INFORMA SINGHAL A, 1997, P 6 TEXT RETR C TREC, P215 SMEATON AF, 1998, P 2 EUR C RES ADV TE VOORHEES E, 1998, P 6 TEXT RETR C TREC, P1 Voorhees E. M., 1997, P 6 TEXT RETR C TREC, P1 WALKER S, 1997, P 6 TEXT RETR C TREC, P125 WECHSLER M, 1995, WORKSH COMP SCI MIRO Wechsler M., 1998, Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, DOI 10.1145/290941.290950 Witbrock M. J., 1997, P 2 ACM INT C DIG LI, P30, DOI 10.1145/263690.263779 Witten I., 1994, MANAGING GIGABYTES C WOODLAND PC, 1995, P INT C AC SPEECH SI Young S., 1995, HTK BOOK Zobel J., 1996, P 19 ANN INT ACM SIG, P166, DOI 10.1145/243199.243258 NR 28 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 61 EP 77 DI 10.1016/S0167-6393(00)00024-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100006 ER PT J AU Benitez, MC Rubio, A Garcia, P de la Torre, A AF Benitez, MC Rubio, A Garcia, P de la Torre, A TI Different confidence measures for word verification in speech recognition SO SPEECH COMMUNICATION LA English DT Article DE key-word spotting; key-word verification; non-parametric classifiers; confidence measures ID DISCRIMINATIVE UTTERANCE VERIFICATION; INFORMATION; SYSTEM AB Recent research in Automatic Speech Recognition (ASR) technologies has shown the key-word spotting (KWS) systems as one of the most interesting options for accessing information using speech. KWS systems can accept spontaneous speech, which allows potential users to ask for information without learning complex protocols for the human-machine communication. One of the most relevant aspects in KWS systems is the verification of key-word candidates. Utterances detected as key-words could be either 'false alarms' (non-key-words or incorrectly recognized key-words) or 'correct key-words'. The use of confidence measurements allows (by additional processing of the spoken sentence) the verification of the candidates and the decision as to whether each utterance must be accepted as a correctly recognized key-word or rejected as a false alarm. In this work we propose a novel method for verification in those KWS systems based on phone models. Under our new approach, a phonematic speech recognizer decodes the spoken sentence in parallel with the KWS recognizer. The first one produces a phone string as output while the second one generates a key-word/filler-model string. By aligning both strings, a set of characteristics is extracted which are used to verify the putatives key-word. For that we have built two classifiers; in the first one the euclidean metric is modified and adapted in a local and iterative way in order to give greater importance to the most discriminate directions between the classes. The second is a vector quantizer which was trained using adaptative technique learning. We have applied the proposed method to several KWS tasks. Experimental results presented in this paper show that the proposed verification method improves the performance of the KWS systems by reducing the false alarm rate without a significant increase in the rejection of correctly detected keywords. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. RP Benitez, MC (reprint author), Univ Granada, Fac Ciencias, Dept Elect & Tecnol Comp, E-18071 Granada, Spain. EM carmen@hal.ugr.es RI de la Torre, Angel/C-6618-2012; Benitez Ortuzar, M Del Carmen/C-2424-2012; Prieto, Ignacio/B-5361-2013 CR BENITEZ MC, 1998, THESIS U GRANADA BLOOTHOOFT, 1997, ELSNET BOURLARD H, 1994, P IEEE INT C AC SPEE, P373 Caminero J, 1997, INT CONF ACOUST SPEE, P891, DOI 10.1109/ICASSP.1997.596079 Casacuberta F., 1991, P WORKSH INT COOP ST Chigier B., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.226112 Cole RA, 1997, SPEECH COMMUN, V23, P243, DOI 10.1016/S0167-6393(97)00049-6 COLEMAN EA, 1995, J VINYL ADDIT TECHN, V1, P1, DOI 10.1002/vnl.730010102 COVER TM, 1967, IEEE T INFORM THEORY, V13, P21, DOI 10.1109/TIT.1967.1053964 COX S, 1996, P INT C AC SPEECH SI, P511 DIAZ J, 1995, THESIS U GRANADA FOOTE JT, 1995, EUR C SPEECH COMM TE, P2145 Fukunaga K., 1990, STAT PATTERN RECOGNI, V2nd GARCIA P, 1996, THESIS U GRANADA GARCIA P, 1998, 1 INT C LANG RES EV, P1263 Gillick L, 1997, INT CONF ACOUST SPEE, P879, DOI 10.1109/ICASSP.1997.596076 GRAY RM, 1990, ENTROPY INFORMATION Hastie T, 1996, IEEE T PATTERN ANAL, V18, P607, DOI 10.1109/34.506411 James D.A., 1995, THESIS U CAMBRIDGE, P3 Kellner A, 1997, SPEECH COMMUN, V23, P95, DOI 10.1016/S0167-6393(97)00036-8 KOHONEN T, 1990, NEURAL NETWORK THEOR, P74 KOO MW, 1999, EUROSPEECH, P287 Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X Lleida E, 1996, INT CONF ACOUST SPEE, P507, DOI 10.1109/ICASSP.1996.541144 MATHAN L, 1991, INT CONF ACOUST SPEE, P93, DOI 10.1109/ICASSP.1991.150286 NETI CV, 1997, P INT C AC SPEECH SI, P883 RICARDI G, 1997, INT C AC SPEECH SIGN, P1143 ROHLICEK J, 1993, P INT C AC SPEECH SI, V2, P459 ROSE RC, 1995, COMPUT SPEECH LANG, V9, P309, DOI 10.1006/csla.1995.0015 RUBIO AJ, 1997, EURO SPEECH, P1779 Rubio AJ, 1997, INT CONF ACOUST SPEE, P895, DOI 10.1109/ICASSP.1997.596080 Schaaf T, 1997, INT CONF ACOUST SPEE, P875, DOI 10.1109/ICASSP.1997.596075 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 Sukkar RA, 1997, SPEECH COMMUN, V22, P333, DOI 10.1016/S0167-6393(97)00031-9 *UCL, 1991, SAMSCL018 U COLL LON *UPC, 1993, SANA002 UPC U AUT BA Weintraub M, 1997, INT CONF ACOUST SPEE, P887, DOI 10.1109/ICASSP.1997.596078 WILCOX LD, 1991, EURO SPEECH WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 ZUE V, 1997, EURO SPEECH, P2227 ZUE V, 1997, EURO SPEECH NR 41 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 79 EP 94 DI 10.1016/S0167-6393(00)00025-X PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100007 ER PT J AU Palmer, DD Ostendorf, M Burger, JD AF Palmer, DD Ostendorf, M Burger, JD TI Robust information extraction from automatically generated speech transcriptions SO SPEECH COMMUNICATION LA English DT Article DE information extraction; phrase language model; named entity; word recognition confidence AB This paper describes a robust system for information extraction (IE) from spoken language data. The system extends previous hidden Markov model (HMM) work in IE, using a state topology designed for explicit modeling of variable-length phrases and class-based statistical language model smoothing to produce state-of-the-art performance for a wide range of speech error rates. Experiments on broadcast news data show that the system performs well with temporal and source differences in the data. In addition, strategies for integrating word-level confidence estimates into the model are introduced, showing improved performance by using a generic error token for incorrectly recognized words in the training data and low confidence words in the test data. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. Mitre Corp, Bedford, MA 01730 USA. RP Palmer, DD (reprint author), Univ Washington, Dept Elect Engn, Seattle, WA 98195 USA. CR Aberdeen J. B., 1995, P 6 MESS UND C MUC 6, P141, DOI 10.3115/1072399.1072413 Appelt D., 1999, P 1999 DARPA BROADC, P51 BENNETT S, 1997, P 2 C EMP METH NEUR Bikel DM, 1997, P 5 C APPL NAT LANG, P194, DOI 10.3115/974557.974586 Bikel DM, 1999, MACH LEARN, V34, P211, DOI 10.1023/A:1007558221122 Brill E., 1992, P 3 C APPL NAT LANG, P152, DOI 10.3115/974499.974526 Brill Eric, 1993, THESIS U PENNSYLVANI BURGER J, 1998, P COLING ACL 98 36 A, P201 CHINCHOR N, 1995, P 5 MESS UND C MUC5, P69 Gillick L., 1997, P INT C AC SPEECH SI, V2, P879 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X GOTOH Y, 1999, P ESCA TUT RES WORKS, P43 GOTOH Y, 1999, P INT C AC SPEECH SI, P513 HAAS J, 1997, P 2 SQEL WORKSH MULT, P65 HAAS J, 1997, MUSTERERKENNUNG 1997, P270 HOBBS J, 1997, FINITE STATE LANGUAG, P381 IYER R, 1997, P EUR C SPEECH COMM, V4, P1975 Kemp T., 1997, P EUR C SPEECH COMM, P827 Krupka G., 1995, P 6 MESS UND C MUC 6, P221, DOI 10.3115/1072399.1072419 Makhoul J., 1999, P DARPA BROADC NEWS, P249 METEER M, 1993, P INT C AC SPEECH SI, V2, P173 Miller D., 1999, P DARPA BROADC NEWS, P37 Mitchell P, 1993, COMPUTATIONAL LINGUI, V19, P313 Palmer D., 1999, P DARPA BROADC NEWS, P41 PRZYBOCKI M, 1998, P DARPA BROADC NEWS, P13 RENALS S, 1999, P DARPA BROADC NEWS, P47 RENALS S, 1999, P EUR C, V3, P1039 ROBINSON P, 1999, P DARPA BROADC NEWS, P27 SENEFF S, 1992, P 2 INT C SPOK LANG, V1, P317 Siu M., 1997, P EUR C SPEECH COMM, P831 WEINTRAUB M, 1997, P ICASSP, V2, P887 WITTEN IH, 1991, IEEE T INFORM THEORY, V37, P1085, DOI 10.1109/18.87000 NR 32 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 95 EP 109 DI 10.1016/S0167-6393(00)00026-1 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100008 ER PT J AU Delacourt, P Wellekens, CJ AF Delacourt, P Wellekens, CJ TI DISTBIC: A speaker-based segmentation for audio data indexing SO SPEECH COMMUNICATION LA English DT Article DE speaker turn detection; generalized likelihood ratio; Bayesian information criterion AB In this paper, we address the problem of speaker-based segmentation, which is the first necessary step for several indexing tasks. It aims to extract homogeneous segments containing the longest possible utterances produced by a single speaker. In our context, no assumption is made about prior knowledge of the speaker or speech signal characteristics (neither speaker model, nor speech model). However, we assume that people do not speak simultaneously and that we have no real-time constraints. We review existing techniques and propose a new segmentation method, which combines two different segmentation techniques. This method, called DISTBIC, is organized into two passes: first the most likely speaker turns are detected, and then they are validated or discarded. The advantage of our algorithm is its efficiency in detecting speaker turns even close to one another (i.e., separated by a few seconds). (C) 2000 Elsevier Science B.V. All rights reserved. C1 Inst Eurecom, F-06904 Sophia Antipolis, France. RP Delacourt, P (reprint author), Inst Eurecom, 2229 Route des Cretes, F-06904 Sophia Antipolis, France. CR BEIGI HSM, 1998, WORLD C AUT BIMBOT F, 1995, SPEECH COMMUN, V17, P177, DOI 10.1016/0167-6393(95)00013-E BONASTRE JF, 2000, IN PRESS IEEE INT C Chen S.S., 1998, DARPA SPEECH REC WOR Gauvain J.L., 1998, INT C SPEECH LANG PR, V4, P1335 GISH H, 1991, INT CONF ACOUST SPEE, P873, DOI 10.1109/ICASSP.1991.150477 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 LIU D, 1999, EUROSPEECH 99, V3, P1031 MONTACIE C, 1998, INT C SPOK LANG PROC, V4, P1579 NISHIDA M, 1998, INT C SPOK LANG PROC, V4, P1347 Nishida M, 1999, IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA COMPUTING AND SYSTEMS, PROCEEDINGS VOL 2, P466 REYNOLDS DA, 1998, INT C SPOK LANG PROC, V7, P3193 RISSANEN J, 1989, SERIES COMPUTER SCI, V15, pCH3 ROSENBERG AE, 1998, INT C SPOK LANG PROC, V4, P1339 Siegler M., 1997, DARPA SPEECH REC WOR, P97 TRITSCHLER A, 1998, THESIS I EURECOM FRA WOODLAND PC, 1997, DARPA SPEECH REC WOR, P97 NR 18 TC 101 Z9 110 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 111 EP 126 DI 10.1016/S0167-6393(00)00027-3 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100009 ER PT J AU Shriberg, E Stolcke, A Hakkani-Tur, D Tur, G AF Shriberg, E Stolcke, A Hakkani-Tur, D Tur, G TI Prosody-based automatic segmentation of speech into sentences and topics SO SPEECH COMMUNICATION LA English DT Article DE sentence segmentation; topic segmentation; prosody; information extraction; automatic speech recognition; broadcast news; switchboard ID LANGUAGE MODEL; DISCOURSE; TEXT AB A crucial step in processing speech audio data for information extraction, topic detection, or browsing/playback is to segment the input into sentence and topic units. Speech segmentation is challenging, since the cues typically present for segmenting text (headers, paragraphs, punctuation) are absent in spoken language. We investigate the use of prosody (information gleaned from the timing and melody of speech) for these tasks. Using decision tree and hidden Markov modeling techniques, we combine prosodic cues with word-based approaches, and evaluate performance on two speech corpora, Broadcast News and Switchboard. Results show that the prosodic model alone performs on par with, or better than, word-based statistical language models - for both true and automatically recognized words in news speech. The prosodic model achieves comparable performance with significantly less training data, and requires no hand-labeling of prosodic events. Across tasks and corpora, we obtain a significant improvement over word-only models using a probabilistic combination of prosodic and lexical information. Inspection reveals that the prosodic models capture language-independent boundary indicators described in the literature. Finally, cue usage is task and corpus dependent. For example, pause and pitch features are highly informative for segmenting news speech, whereas pause, duration and word-based cues dominate for natural conversation. (C) 2000 Elsevier Science B.V. All rights reserved. C1 SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA. Bilkent Univ, Dept Comp Engn, TR-06533 Ankara, Turkey. RP Shriberg, E (reprint author), SRI Int, Speech Technol & Res Lab, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA. EM ees@speech.sri.com; stolcke@speech.sri.com; hakkani@cs.bilkent.edu.tr; tur@cs.bilkent.edu.tr CR Allan J., 1998, P DARPA BROADC NEWS, P194 BAHL LR, 1989, IEEE T ACOUST SPEECH, V37, P1001, DOI 10.1109/29.32278 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 Beeferman D, 1999, MACH LEARN, V34, P177, DOI 10.1023/A:1007506220214 Brown Gillian, 1980, QUESTIONS INTONATION BRUCE G, 1982, PHONETICA, V39, P274 BUNTINE W, 1992, INTRO IND VERSION 2 Cieri C., 1999, P DARPA BROADC NEWS, P57 *DARPA, 1997, CONV SPEECH REC WORK DERMATAS E, 1995, COMPUT LINGUIST, V21, P137 DIGALAKIS V, 1994, P IEEE INT C AC SPEE, V1, P537 DODDINGTON G, 1998, P DARPA BROADC NEWS, P223 Entropic Research Laboratory Washington D.C., 1993, ESPS VERS 5 0 PROGR Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 GRAFF D, 1997, P DARPA WORKSH SPOK, P11 Grosz B., 1992, P INT C SPOK LANG PR, V1, P429 Hakkani-Tur D., 1999, P 6 EUR C SPEECH COM, V5, P1991 Hearst MA, 1997, COMPUT LINGUIST, V23, P33 HEEMAN PA, 1997, P 35 ANN M ASS COMP Hirschberg J., 1996, P 34 ANN M ASS COMP, P286, DOI 10.3115/981863.981901 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Koopmans-Van Beinum F. J., 1996, P INT C SPOK LANG PR, V3, P1724, DOI 10.1109/ICSLP.1996.607960 Kozima H., 1993, P 31 ANN M ASS COMP, P286, DOI DOI 10.1016/S0306-4573(02)00035-3 Kubala F., 1998, P DARPA BROADC NEWS, P287 Lehiste I, 1980, STRUCTURE PROCESS SP, P195 LEHISTE L, 1979, FRONTIERS SPEECH COM, P191 LIU D, 1999, P ESCA EUR 99 BUD HU, V3, P1031 *LVCSR, 1999, LVCSR HUB 5 WORKSH L MacIntyre R, 1995, DYSFLUENCY ANNOTATIO NAKAJIMA S, 1997, COMPUTING PROSODY CO, P81 Olshen R., 1984, CLASSIFICATION REGRE, V1st Palmer DD, 1997, COMPUT LINGUIST, V23, P241 PRZYBOCKI MA, 1999, P 6 EUR C SPEECH COM, V5, P2215 SANKAR A, 1998, P DARPA BROADC NEWS, P91 Shriberg E., 1999, P INT C PHON SCI SAN, P619 Shriberg E., 1998, LANG SPEECH, V41, P439 Shriberg E., 1997, P 5 EUR C SPEECH COM, V5, P2383 SILVERMAN K, 1987, THESIS CAMBRIDGE U C SLUIJTER A, 1994, PHONETICA, V50, P180 Sonmez K., 1998, P INT C SPOK LANG PR, P3189 SONMEZ K, 1999, P 6 EUR C SPEECH COM, V5, P2219 Stolcke A, 1999, P DARPA BROADC NEWS, P61 Stolcke A., 1996, P INT C SPOK LANG PR, P1005, DOI 10.1109/ICSLP.1996.607773 Stolcke A, 1998, P INT C SPOK LANG PR, P2247 Swerts M, 1997, SPEECH COMMUN, V22, P25, DOI 10.1016/S0167-6393(97)00011-3 Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 SWERTS M, 1994, LANG SPEECH, V37, P21 Talkin D., 1995, SPEECH CODING SYNTHE THORSEN NG, 1985, J ACOUST SOC AM, V77, P1205, DOI 10.1121/1.392187 TUR G, IN PRESS COMPUTATION Vaissiere Jacqueline, 1983, PROSODY MODELS MEASU, P53 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 Yamron J.P., 1998, P IEEE C AC SPEECH S, V1, P333, DOI 10.1109/ICASSP.1998.674435 NR 53 TC 126 Z9 129 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 2000 VL 32 IS 1-2 BP 127 EP 154 DI 10.1016/S0167-6393(00)00028-5 PG 28 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 350HM UT WOS:000089095100010 ER PT J AU Billi, R AF Billi, R TI Special issue on interactive voice technology for telecommunication applications (IVTTA'98) SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Ctr Studi & Lab Telecomun SpA, I-10148 Turin, Italy. RP Billi, R (reprint author), Ctr Studi & Lab Telecomun SpA, Via G Reiss Romoli 274, I-10148 Turin, Italy. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 277 EP 277 DI 10.1016/S0167-6393(99)00061-8 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800001 ER PT J AU Attwater, D Edgington, M Durston, P Whittaker, S AF Attwater, D Edgington, M Durston, P Whittaker, S TI Practical issues in the application of speech technology to network and customer service applications SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE speech recognition; dialogue modelling; network service automation; address recognition; natural language processing; semantic classification AB This paper proposes a simple model to characterise the different stages of short telephone transactions. It also discusses the impact of the context of the caller when entering an automated service. Three different styles of service were then identified, namely, large vocabulary information gathering, spoken language command and natural language task identification for helpdesks. By considering human dialogue equivalents. the requirements for each style are considered. Consequently, it is shown that each style pushes different technological limits. Three case studies, selected from current project from BT laboratories, are presented to highlight the practical design issues in these different styles. The styles and case studies presented are: Information gathering - UK name and address recognition. Spoken language command - network service configuration. Natural language helpdesks - BT operator services. It is shown that large vocabulary information gathering systems require high accuracy, careful data modelling and well-designed strategies to boost confidence and accuracy. Spoken language command requires dialogue and grammar design and test complexity to be managed. Natural language task identification requires large volumes of training data, good learning algorithms and good data generalisation techniques. These styles can be mixed into a single interaction meaning that design frameworks of the future will have to address all of the aspects of the different interaction styles. (C) 2000 Elsevier Science B.V. All rights reserved. C1 BT Labs, Mobil & Network Serv, Ipswich IP5 3RE, Suffolk, England. SRI Int, Menlo Pk, CA 94025 USA. RP Attwater, D (reprint author), BT Labs, Mobil & Network Serv, Martlesham Heath, Ipswich IP5 3RE, Suffolk, England. EM david.attwater@bt.com CR ATTWATER, 1998, BT TELECOMMUNICATION, V11 ATTWATER DJ, 1996, P IOA 18 9 ATTWATER DJ, 1998, P AVIOS 1998 ATTWATER DJ, 1998, P VOIC EUR 1998 ATTWATER DJ, P AVIOS 1997, P97 BENNACEF SK, 1995, P ESCA WORKSH DIAL S COLE, 1997, FREE SPEECH J Garner PN, 1997, COMPUT SPEECH LANG, V11, P275, DOI 10.1006/csla.1997.0032 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X Lee C.-H., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727690 MCINNES F, 1999, IN PRESS P EUR BUD 1 PESKIN B, 1997, ICASSP 97 NR 12 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 279 EP 291 DI 10.1016/S0167-6393(99)00062-X PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800002 ER PT J AU Chang, HM AF Chang, HM TI Is ASR ready for wireless primetime: Measuring the core technology for selected applications SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE automatic speech recognition; wireless applications; speech analysis; performance assessment; benchmarking; digits recognition; noisy speech recognition AB It is estimated that by the end of 2001 as many as 500 million people worldwide will use cellular services. The nature of hands-busy and eyes-busy situations inherent in the anywhere and anytime wireless communication paradigm presents exciting marketing opportunities and, at the same time, unique technical challenges to the current-generation ASR technology and their new applications. Current industry trends clearly show that incorporating ASR technology into existing or new wireless services as a replacement for touch-tone input is a natural progression in user interface. But is the current-generation ASR technology ready for prime time over wireless channels? Both qualitative and quantitative assessments for the core technology must be adopted by the industry before answering this question. In this paper, we will describe a set of benchmark tasks designed to evaluate the state-of-the-art ASR technologies from a wireless perspective and present the results of these benchmark tests on two commercially available software-based ASR systems that represent the best core ASR technology on the market. (C) 2000 Elsevier Science B.V. All rights reserved. C1 SBC Technol Resources, Speech & Comp Technol, Austin, TX 78759 USA. RP Chang, HM (reprint author), SBC Technol Resources, Speech & Comp Technol, 9505 Arboretum Blvd, Austin, TX 78759 USA. EM hchang@tri.sbc.com CR Chang H. M., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552764 CHANG HM, 1989, SPEECHTEK 89 NEW YOR CHIEN JT, 1997, P EUROSPEECH 97 SEPT, P2563 DODDINGTON GR, 1986, SPEECH SCI PUBLICATI, P556 HAEBUMACH R, 1997, P EUROSPEECH 97 SEPT, P2427 Karray L., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727714 MAUUARY L, 1993, P EUROSPEECH 93, P1097 RIETMAN J, 1997, 13437 IDC LINK ROSE RC, 1995, P INT C AC SPEECH SI, P281 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 NR 10 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 293 EP 307 DI 10.1016/S0167-6393(99)00063-1 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800003 ER PT J AU Lee, CH Carpenter, B Chou, W Chu-Carroll, J Reichl, W Saad, A Zhou, QR AF Lee, CH Carpenter, B Chou, W Chu-Carroll, J Reichl, W Saad, A Zhou, QR TI On natural language call routing SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE acoustic modeling; automatic speech recognition; call routing; dialogue processing; disambiguation dialogue; hidden Markov model; information retrieval; interactive voice response; language modeling; touch tone interface ID HIDDEN MARKOV-MODELS; RECOGNITION AB Automated call routing is the process of associating a user's request with the desired destination. Although some of the call routing functions can often be accomplished though the use of a touch-tone menu in an interactive voice response system, the interaction between the user and such a system is typically very limited. It is therefore desirable to have a call routing system that takes natural language spoken inputs from the user and asks for additional information to complete the user's request as a human agent would. In this paper we present a recent study on natural language call routing and discuss the capabilities and limitations of current technologies. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Lucent Technol, Bell Labs, Dialogue Syst Res Dept, Murray Hill, NJ 07974 USA. RP Lee, CH (reprint author), Lucent Technol, Bell Labs, Dialogue Syst Res Dept, 600 Mt Ave, Murray Hill, NJ 07974 USA. EM chl@research.bell-labs.com CR ABELLA GA, 1997, P EUROSPEECH 97 RHOD, P1879 CHUCARROLL J, 1998, P ACL COLING MONTR DEERWESTER S, 1990, J AM SOC INFORM SCI, V41, P391, DOI 10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HARMAN D, 1995, P TREC Kawahara T, 1998, IEEE T SPEECH AUDI P, V6, P558, DOI 10.1109/89.725322 LEE CH, 1996, P ICSLP 96, P957 MCDONOUGH J, 1994, P INT C AC SPEECH SI, P385 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 REICHL W, 1998, P ICSLP 98 SYDN REICHL W, 1998, P ICASSP 98 SEATTL SALTON G, 1971, SMART INFORMATION RE SPROAT R, 1997, MULTILINGUAL TEXT SP THOMSON DL, 1997, P ASRU 97 WORKSH, P511 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 ZHOU Q, 1997, P ICASSP 97 MUN ZHOU Q, 1997, P EUROSPEECH 97 RHOD 1998, BUSINESS WEEK 0223, P60 NR 18 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 309 EP 320 DI 10.1016/S0167-6393(99)00064-3 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800004 ER PT J AU Gupta, V Robillard, S Pelletier, C AF Gupta, V Robillard, S Pelletier, C TI Automation of locality recognition in ADAS plus SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE speech recognition; directory assistance automation; directory assistance; city name recognition AB In North America, people call the directory assistance operator to find the phone number of a business or residential listing. The directory assistance service is generally maintained by telcos, and it represents a significant cost to them. Partial or complete automation of directory assistance would result in significant cost savings for telcos. Nortel Networks has a product called Automated Directory Assistance System (ADAS) Plus which partially automates this directory assistance function through the use of speech recognition. The system has been deployed all across Quebec, through most of US West and BellSouth. ADAS Plus primarily automates the response to the question "for what city?" through speech recognition. We give details of this speech recognition system and outline its performance in the deployed regions. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Nortel Networks, Nuns Isl, PQ H3E 1H6, Canada. RP Gupta, V (reprint author), Speech Works Int, Montreal, PQ, Canada. EM vgupta@speechworks.com CR Billi R., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727685 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 GUPTA VN, 1988, J ACOUST SOC AM, V84, P2007, DOI 10.1121/1.397045 KASPAR B, 1997, VOIC 97 GERM Kellner A., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727687 KELLNER A, 1997, P IEEE WORKSH AUT SP, P566 LENNIG M, 1995, SPEECH COMMUN, V17, P227, DOI 10.1016/0167-6393(95)00024-I LENNIG M, 1992, Patent No. 5097509 NR 8 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 321 EP 328 DI 10.1016/S0167-6393(99)00065-5 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800005 ER PT J AU Schramm, H Rueber, B Kellner, A AF Schramm, H Rueber, B Kellner, A TI Strategies for name recognition in automatic directory assistance systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE directory information; joint recognition; confidence; database constraints; large vocabulary; spelling ID ISSUES AB The commercial viability of automating large scale directory assistance is shown by presenting new results on the recognition of large numbers of different names. Satisfactory recognition performance is achieved by employing a stochastic combination of N-best lists retrieved from multiple user utterances with the telephone database as an additional knowledge source. The strategy is used in a prototype of a fully automated directory information system which is designed to cover a whole country: After the city has been selected, the user is asked for first and last name of the desired person and, if necessary, also for the street or a spelling of the last name. Confidence measures are used for an optimal dialogue flow. We present results of different recognition strategies for databases of various sizes with up to 1.3 million entries (city of Berlin). The experiments show that for cooperative users more than 90% of all simple requests can be automated. Despite the fact that in the field a lot of practical problems like database or lexicon management or acquainting users with the new systems have to be overcome, the authors nevertheless deem the technology to be highly relevant for commercial deployment. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Philips Res Labs, D-52021 Aachen, Germany. RP Kellner, A (reprint author), Philips Res Labs, POB 1980, D-52021 Aachen, Germany. EM hauke.schramm@philips.com; bernhard.rueber@philips.com; andreas.kellner@philips.com CR Attwater DJ, 1996, BT TECHNOL J, V14, P177 AUST H, 1995, ESCA WORKSH SPOK DIA, P121 BILLI R, 1998, IVTTA TOR SEPT, P11 Haeb-Umbach R., 1992, P IEEE INT C AC SPEE, V1, P13 KAMM CA, 1995, SPEECH COMMUN, V17, P303, DOI 10.1016/0167-6393(95)00023-H KASPAR B, 1995, P EUROSPEECH, P1161 KELLNER A, 1998, P ICSLP SYDN DEC 199, V7, P2859 Kellner A., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727687 Lennig M., 1994, Proceedings. Second IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 94) (Cat. No.94TH0695-7), DOI 10.1109/IVTTA.1994.341553 Meyer M., 1997, P EUROSPEECH, V3, P1579 RUEBER B, 1997, P EUROSPEECH, V2, P739 SEIDE F, 1997, P EUROSPEECH, V3, P1327 WHITTAKER SJ, 1995, ESCA WORKSH SPOK DIA, P113 NR 13 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 329 EP 338 DI 10.1016/S0167-6393(99)00066-7 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800006 ER PT J AU Lamel, L Rosset, S Gauvain, JL Bennacef, S Garnier-Rizet, M Prouts, B AF Lamel, L Rosset, S Gauvain, JL Bennacef, S Garnier-Rizet, M Prouts, B TI The LIMSI ARISE system SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE spoken language systems; speech recognition; speech understanding; natural language understanding; information retrieval dialog ID LANGUAGE AB The LIMSI ARISE system provides vocal access by telephone to rail travel information for main French intercity connections, including timetables, simulated fares and reservations, reductions and services. Our goal is to obtain high dialog success rates with a very open interaction, where the user is free to ask any question or to provide any information at any point in time. In order to improve performance with such an open dialog strategy, we make use of implicit confirmation using the callers wording (when possible), and change to a more constrained dialog level when the dialog is not going well. (C) 2000 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. VECSYS, F-91952 Courtaboeuf, France. RP Lamel, L (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. CR ADDA G, 1997, P JOURN SCI TECHN RE, P35 Baggia P., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727701 Bennacef S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607176 BENNACEF SK, 1994, P ICSLP 94 YOK, V3, P1271 BENNACEF SK, 1995, P ESCA WORKSH SPOK D, P23 BLASBAND M, 1998, P NIMES 98 NIM, P207 BOUWMAN G, 1999, P IEEE ICASSP 99 PHO, V1, P493 Bouwman G., 1998, P 1 INT C LANG RES E, P191 BRUCE B, 1975, ARTIF INTELL, V6, P327, DOI 10.1016/0004-3702(75)90020-X *EAGLES, 1998, HDB STAND RES SPOK L, V3 FILLMORE Charles, 1968, UNIVERSALS LINGUISTI GAUVAIN JL, 1994, SPEECH COMMUN, V15, P21, DOI 10.1016/0167-6393(94)90038-8 GAUVAIN JL, 1996, I ELECT INFORMATIO D, V79, P2005 GAUVAIN JL, 1996, P ICSLP 96 PHIL, P1672 GITTON A, 1998, SPOKEN LANGUAGE UNDE KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Lamel L., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727722 LAMEL L, 1999, P IEEE ICASSP 99 PHO, V1, P501 LAMEL LF, 1993, P ESCA NATO WORKSH A, P207 Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X Lavelle C.-A., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727705 POPOVICI C, 1997, P ICASSP MUN GERM, P815 *RAILT, 1995, RESULTS FIELD TRIALS *RAILT, 1995, DEF EV METH FIELD TR Sanderman A., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727700 Wessel F, 1998, INT CONF ACOUST SPEE, P225, DOI 10.1109/ICASSP.1998.674408 1992, P DARPA SPEECH NAT L, P7 NR 27 TC 23 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 339 EP 353 DI 10.1016/S0167-6393(99)00067-9 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800007 ER PT J AU Baggia, P Castagneri, G Danieli, M AF Baggia, P Castagneri, G Danieli, M TI Field trials of the Italian ARISE train timetable system SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE spoken dialogue system; speech recognition; human-machine interaction; field trial evaluation; public transport information AB This paper reports results from two field trials of the CSELT ARISE spoken dialogue system in the Italian railway call centre FS-Informa. The system provides voice-driven access to railway timetable for the major Italian and some European cities. On the basis of the initial experiences we have been able to integrate the automatic system in the architecture of a typical railway call centre, where preferable timetable information is interchanged with a caller via a spoken dialogue system, and where a human operator is involved only for answering more complex user requests. We argue that the results we present are relevant from different points of view. They allowed us to test the impact of the automatic system on the working routines of the human operators, and the reactions of real callers who are traditionally served by human operators. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Ctr Studi & Lab Telecomun SpA, I-10148 Turin, Italy. RP Baggia, P (reprint author), Ctr Studi & Lab Telecomun SpA, Via G Reiss Romoli 274, I-10148 Turin, Italy. EM baggia@cselt.it; castagneri@cselt.it; danieli@cselt.it CR Albesano D., 1997, International Journal of Speech Technology, V2, DOI 10.1007/BF02208822 BAGGIA P, 1999, P EUROSPEECH SEPT BU, P1767 BAGGIA P, 1998, P IVTTA SEPT 1998 TU, P57 Billi R, 1997, SPEECH COMMUN, V23, P83, DOI 10.1016/S0167-6393(97)00041-1 DANIELI M, 1996, P AAAI 96 WORKSH DET, P87 DANIELI M, 1997, P ACL EACL WORKSH SP DENOS E, 1999, P EUR C SPEECH TECHN, P1527 FERGUSON G, 1998, P 15 NAT C ART INT A FISSORE L, 1995, P EUROSPEECH MADR, V1, P799 GEMELLO R, 1997, P IEEE INT C NEUR NE, P2107 GIOVARA C, 1998, THESIS POLITECNICO T Lamel L., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727722 LAVELLE CA, 1999, P EUROSPEECH SEPT 19, P1399 POPOVICI C, 1998, P ICSLP 98 SYDN, V2, P397 ROSSET S, 1999, P EUR, P1535 STURM J, P EUROSPEECH SEPT 19, V1419, P99 Tchou C., 1999, P EUROSPEECH SEPT 19, P1531 VANHAAREN L, 1998, P LREC, P655 NR 18 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 355 EP 367 DI 10.1016/S0167-6393(99)00068-0 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800008 ER PT J AU Azevedo, J Beires, N Charpentier, F Farrell, M Johnston, D LeFlour, E Micca, G Militello, S Schroeder, K AF Azevedo, J Beires, N Charpentier, F Farrell, M Johnston, D LeFlour, E Micca, G Militello, S Schroeder, K TI Multilinguality in voice activated information services: The P502 EURESCOM project SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 98) CY SEP 29-30, 1998 CL TURIN, ITALY SP IEEE Commun Soc, European Speech Commun Assoc DE multilinguality; IVR; cross-lingual; ASR AB The paper describes the multilingual system developed within the framework of the P502 EURESCOM project. The system described provides information about major telephone services available in the UK, Germany, France, Italy and Portugal in five languages. We present the results of a number of experiments carried out in the five countries, aiming to try and answer some fundamental questions concerned with the exploitation of a multilingual service. Both technological and interface design issues have been investigated and several alternatives have been tested. We compared speech recognition accuracy and successful transaction completion rates of GSM and PSTN networks, and evaluated crosscountry and cross-language effects. Using a new methodological approach to assessment, a powerful predictive model was developed. This model allowed users' subjective ratings to be predicted from objective measurements. The results showed that an average Transaction Success rate of more than 92% was obtained when speech recognisers exhibiting good Word Recognition Accuracy were coupled to suitable dialogue interfaces in. the IVR system. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 CSELT SpA, SA V R, I-10148 Turin, Italy. Portugal Telecom SA, INESCTEL, Serv & Applicat, P-4050 Oporto, Portugal. France Telecom, CNET, DIH, RCP, F-22307 Lannion, France. British Telecommun PLC, MLB3 5E BT Labs, Ipswich IP5 3RE, Suffolk, England. Deutsch Telekom Berkom GmbH, D-10589 Berlin, Germany. RP Micca, G (reprint author), CSELT SpA, SA V R, Via G Reiss Romoli 274, I-10148 Turin, Italy. EM giorgio.micca@cselt.it CR BILLI R, 1996, IVTTA 96, P129 DANIELI M, 1995, AAAI SPRING S EMP ME, P34 Fienberg SE, 1977, ANAL CROSS CLASSIFIE Kirk RE, 1982, EXPT DESIGN PROCEDUR LEONARDI F, 1997, EUROSPEECH 97, P1771 Nielsen J., 1993, USABILITY ENG Norman K. L., 1991, PSYCHOL MENU SELECTI NR 7 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 2000 VL 31 IS 4 BP 369 EP 379 DI 10.1016/S0167-6393(99)00069-2 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 333EH UT WOS:000088115800009 ER PT J AU Andre-Obrecht, R AF Andre-Obrecht, R TI Special issue on speaker recognition and its commercial and forensic applications SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 87 EP 88 DI 10.1016/S0167-6393(00)00031-5 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100001 ER PT J AU Besacier, L Bonastre, JF Fredouille, C AF Besacier, L Bonastre, JF Fredouille, C TI Localization and selection of speaker-specific information with statistical modeling SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speaker recognition; speaker-specific information; on-line selection; pruning; statistical modeling; time-frequency architecture ID IDENTIFICATION; VERIFICATION AB Statistical modeling of the speech signal has been widely used in speaker recognition. The performance obtained with this type of modeling is excellent in laboratories but decreases dramatically for telephone or noisy speech. Moreover, it is difficult to know which piece of information is taken into account by the system. In order to solve this problem and to improve the current systems, a better understanding of the nature of the information used by statistical methods is needed. This knowledge should allow to select only the relevant information or to add new sources of information. The first part of this paper presents experiments that aim at localizing the most useful acoustic events for speaker recognition. The relation between the discriminant ability and the speech's events nature is studied. particularly, the phonetic content, the signal stability and the frequency domain are explored. Finally, the potential of dynamic information contained in the relation between a frame and its p neighbours is investigated. In the second part, the authors suggest a new selection procedure designed to select the pertinent features. Conventional feature selection techniques (ascendant selection, knock-out) allow only global and a posteriori knowledge about the relevance of an information source. However, some speech clusters may be very efficient to recognize a particular speaker, whereas they can be non-informative for another one. Moreover, some information classes may be corrupted or even missing for particular recording conditions. This necessity for speaker-specific processing and for adaptability to the environment (with no a priori knowledge of the degradation affecting the signal) leads the authors to propose a system that automatically selects the most discriminant parts of a speech utterance. The proposed architecture divides the signal into different time-frequency blocks. The likelihood is calculated after dynamically selecting the most useful blocks. This information selection leads to a significative error rate reduction (up to 41% of relative error rate decrease on TIMIT) for short training and test durations. Finally, experiments in the case of simulated noise degradation show that this approach is a very efficient way to deal with partially corrupted speech. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Lab Informat Avignon LIA CERI Agroparc, F-84911 Avignon 9, France. RP Besacier, L (reprint author), Lab Informat Avignon LIA CERI Agroparc, 339 Chemin Meinajaries,BP 1228, F-84911 Avignon 9, France. CR BERNASCONI C, 1990, SPEECH COMMUN, V9, P129, DOI 10.1016/0167-6393(90)90066-I BESACIER L, 1998, THESIS U AVIGNON BESACIER L, 1998, P C SPEAK REC ITS CO BESACIER L, 1998, P IEEE INT C AC SPEE BESACIER L, 1997, P 1 INT C AUD VIS BA, P195 BIMBOT F, 1995, SPEECH COMMUN, V17, P177, DOI 10.1016/0167-6393(95)00013-E BONASTRE JF, 1994, WORKSH AUT SPEAK REC, P157 CHARLET D, 1996, 21 JOURN ET PAR AV F, P399 CHARLET D, 1998, P C SPEAK REC ITS CO FISHER WM, 1987, J ACOUST SOC AM, V81, pS92, DOI 10.1121/1.2034854 FREDOUILLE C, 1998, P C SPEAK REC ITS C Fukunaga K., 1990, STAT PATTERN RECOGNI, V2nd FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 FURUI S, 1994, WORKSH AUT SPEAK REC, P1 FURUI S, 1997, P 1 INT C AUD VID BA, P237 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 JANKOWSKI C, 1990, P IEEE INT C AC SPEE LINDBERG J, 1998, P C SPEAK REC ITS CO MAGRINCHAGNOLEA.I, 1995, P EUR 95 MADR SPAIN MAGRINCHAGNOLEA.I, 1997, THESIS PARIS Markov K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607970 MELLA O, 1992, 19 JOURN ET PAR BRUS, P549 MONTACIE C, 1993, P EUROSPEECH 93, P161 Nolan F, 1983, PHONETIC BASES SPEAK REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D SAMBUR MR, 1975, IEEE T ACOUST SPEECH, VAS23, P176, DOI 10.1109/TASSP.1975.1162664 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 van Vuuren S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607976 NR 28 TC 18 Z9 18 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 89 EP 106 DI 10.1016/S0167-6393(99)00070-9 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100002 ER PT J AU Cerrato, L Falcone, M Paoloni, A AF Cerrato, L Falcone, M Paoloni, A TI Subjective age estimation of telephonic voices SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA AB The aim of this study is to investigate the extent to which listeners can judge some unknown speakers' characteristics. In particular the accuracy of assessing age and gender by only hearing the speaker's rded over the telephone line. Moreover we tried to evaluate the reliability of subjective age estimation in the forensic context. The results of the statistical analysis we carried out show that listeners are capable of assigning a general chronological age category to a voice without seeing or knowing the talker and they are able to distinguish between male and female voices transmitted over the telephone line. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Fdn Ugo Bordoni, I-00149 Rome, Italy. RP Paoloni, A (reprint author), Fdn Ugo Bordoni, 59 Via Baldassarre Castiglione, I-00149 Rome, Italy. CR BRAUN A, 1996, J SPEECH LANGUAGE LA, V3, P65 Chin S. B., 1997, ALCOHOL SPEECH HORII Y, 1981, FOLIA PHONIATR, V33, P227 KUNZEL HJ, 1989, PHONETICA, V46, P117 LASS NJ, 1980, J PHONETICS, V8, P91 LINVILLE SE, 1985, J ACOUST SOC AM, V78, P40, DOI 10.1121/1.392452 Mullennix JW, 1995, J ACOUST SOC AM, V98, P3080, DOI 10.1121/1.413832 NEIMAN GS, 1990, FOLIA PHONIATR, V42, P327 NOLAN F, 1996, HDB PHONETIC SCI, P744 ROHLFS G, 1996, STUDI RICERCHE LINGU RYAN WJ, 1974, J COMMUN DISORD, V7, P181, DOI 10.1016/0021-9924(74)90030-6 SHIPP T, 1969, J SPEECH HEAR RES, V12, P703 *SYSTAT, 1992, SYSTAT WIND VERS 5 NR 13 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 107 EP 112 DI 10.1016/S0167-6393(99)00071-0 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100003 ER PT J AU Charlet, D Jouvet, D Collin, O AF Charlet, D Jouvet, D Collin, O TI An alternative normalization scheme in HMM-based text-dependent speaker verification SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE HMM-based speaker verification; Viterbi alignment; duration model; Bayesian training AB This paper proposes a normalization scheme for HMM-based text-dependent speaker verification in which the claimed-speaker model score and the background model score are computed for a common alignment made on the speaker-independent model of the password. It is shown that such a normalization preserves some speaker-specific information contained in the alignment and makes the normalization score more consistent in emphasizing remarkable parts of the claimed-speaker model. A special training procedure is proposed. Experiments on a large-scale and realistic telephone database are reported. Finally, first experiments about the integration of an information based on alignment in the decision making are presented. All these results show the interest of the method and encourage further investigation on the speaker modeling in such an approach. (C) 2000 Elsevier Science B.V. Al rights reserved. C1 France Telecom, BD CNET, DIH DIPS, F-22307 Lannion, France. RP Charlet, D (reprint author), France Telecom, BD CNET, DIH DIPS, 2 Ave Pierre Marzin, F-22307 Lannion, France. EM delphine.charlet@cnet.francetelecom.fr CR CHARLET D, 1999, EUROSPEECH 99, P1967 Charlet D, 1997, PATTERN RECOGN LETT, V18, P873, DOI 10.1016/S0167-8655(97)00064-0 FORSYTH M, 1993, EUROSPEECH 93, P319 FORSYTH M, 1995, SPEECH COMMUN, V17, P117, DOI 10.1016/0167-6393(95)00020-O Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 MARIETHOZ J, 1999, EUROSPEECH 99, P1979 OLSEN J, 1997, EUROSPEECH 97, P1375 Parthasarathy S, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2403 Reynolds D. A., 1997, EUROSPEECH, P963 SAPORTA G, 1992, PROBABILITES ANAL DO NR 10 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 113 EP 120 DI 10.1016/S0167-6393(99)00072-2 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100004 ER PT J AU Karlsson, I Banziger, T Dankovicova, J Johnstone, T Lindberg, J Melin, H Nolan, F Scherer, K AF Karlsson, I Banziger, T Dankovicova, J Johnstone, T Lindberg, J Melin, H Nolan, F Scherer, K TI Speaker verification with elicited speaking styles in the VeriVox project SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speaker verification; speaker variation; stressed speech ID SPEECH AB Some experiments have been carried out to study and compensate for within-speaker variations in speaker verification. To induce speaker variation, a speaking behaviour elicitation software package has been developed. A 50-speaker database with voluntary and involuntary speech variation has been recorded using this software. The database has been used for acoustic analysis as well as for automatic speaker verification (ASV) tests. The voluntary speech variations are used to form an enrolment set for the ASV system. This set is called structured training and is compared to neutral training where only normal speech is used. Both sets contain the same number of utterances. It is found that the ASV system improves its performance when testing on a mixed speaking style test without decreasing the performance of the tests with normal speech. (C) 2000 Elsevier Science B.V. All rights reserved. C1 KTH, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. Univ Geneva, FAPSE, Dept Psychol, Geneva, Switzerland. Univ Cambridge, CULD, Dept Linguist, Cambridge, England. RP Karlsson, I (reprint author), KTH, Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. EM inger@speech.kth.se CR BIMBOT F, 1997, P EUROSPEECH 97 RHOD, P971 FURUI S, 1986, SPEECH COMMUN, V5, P183, DOI 10.1016/0167-6393(86)90007-5 FURUI S, 1997, P 1 INT C AUD VID BA, P237 FURUI S, 1994, P ESCA WORKSH AUT SP, P1 HANSEN JHL, 1995, P ESCA NATO WORKSH S, P91 JUNQUA JC, 1995, P ESCA NATO TUT WORK, P83 KARLSSON I, 1998, P ICSLP 98 SYDN 30 N, P2379 Martin A. F., 1997, P EUROSPEECH, P1895 Melin H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.608018 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 YOUNG S, 1997, HTK BOOK NR 11 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 121 EP 129 DI 10.1016/S0167-6393(99)00073-4 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100005 ER PT J AU Rosenberg, AE Siohan, O Parthasarathy, S AF Rosenberg, AE Siohan, O Parthasarathy, S TI Small group speaker identification with common password phrases SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA ID RECOGNITION; VERIFICATION; CLASSIFICATION AB Text-dependent speaker identification performance is investigated for small groups of speakers in which each speaker in a group is assigned the same sentence-long password utterance. Password utterances are modeled by whole-phrase hidden Markov models (HMMs). Several model construction conditions are studied. Baseline maximum likelihood estimate (MLE) models are constructed from three same-session training utterances. Minimum classification error (MCE) models are constructed using the training utterances of all speakers in a group. In addition, models are constructed using additional test utterances from speakers in the group or additional utterances from speakers outside the group. Results show that error rates approximately double from 5-speaker groups to 10-speaker groups. MCE models provide about 25% improvement in closed- and open-set identification error rates, but less improvement, about 10% in imposter accept rates. The greatest improvements are obtained, for both MLE and MCE models, when customer test utterances augment the training utterances. For MCE models closed-set identification error rates are approximately 0.4% and 0.6% for 5- and 10-speaker groups, respectively, while imposter accept rates are approximately 4% and 10%, respectively, when customer reject rates are 5%. (C) 2000 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA. RP Rosenberg, AE (reprint author), AT&T Labs Res, Speech & Image Proc Serv Res Lab, 180 Pk Ave, Florham Pk, NJ 07932 USA. CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 del Alamo C. M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607969 DELALAMO CM, 1996, P ICASSP 96 IEEE INT, P89 Juang BH, 1997, IEEE T SPEECH AUDI P, V5, P257 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KIMBALL O, 1997, P EUR 97 5 EUR C SPE, P967 Korkmazskiy F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607965 LIU CS, 1995, J ACOUST SOC AM, V97, P637, DOI 10.1121/1.412286 Ljolje A., 1990, P ICASSP 90, P709 NETSCH LP, 1992, P ICASSP 92 IEEE INT, V2, P181 Parthasarathy S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607293 RABINER LR, 1986, AT&T TECH J, V65, P21 Rahim MG, 1997, IEEE T SPEECH AUDI P, V5, P266, DOI 10.1109/89.568733 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599 ROSENBERG AE, 1996, P INT C AC SPEECH SI, P81 Rosenberg AE, 1998, INT CONF ACOUST SPEE, P105, DOI 10.1109/ICASSP.1998.674378 ROSENBERG AE, 1997, P EUR 97, P1371 ROSENBERG AE, 1991, P ICASSP, P381, DOI 10.1109/ICASSP.1991.150356 Siegel S., 1956, NONPARAMETRIC STAT, P75 Siohan O, 1998, INT CONF ACOUST SPEE, P109, DOI 10.1109/ICASSP.1998.674379 SUKKAR RA, 1996, P IEEE INT C AC SPEE, P518 NR 22 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 131 EP 140 DI 10.1016/S0167-6393(99)00074-6 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100006 ER PT J AU Lamel, LF Gauvain, JL AF Lamel, LF Gauvain, JL TI Speaker verification over the telephone SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE text-dependent and text-independent speaker verification and speaker identification; Hidden Markov models ID IDENTIFICATION AB Speaker verification has been the subject of active research for many years, yet despite these efforts and promising results on laboratory data, speaker verification performance over the telephone remains below that required for many applications. This experimental study aimed to quantify speaker recognition performance out of the context of any specific application, as a function of factors more-or-less acknowledged to affect the accuracy. Some of the issues addressed are: the speaker model (Gaussian mixture models are compared with phone-based models), the influence of the amount and content of training and test data on performance; performance degradation due to model aging and how can this be counteracted by using adaptation techniques; achievable performance levels using text-dependent and text-independent recognition modes. These and other factors were addressed using a large corpus of read and spontaneous speech (over 250 hours collected from 100 target speakers and 1000 imposters) in French, designed and recorded for the purpose of this study. On these data, the lowest equal error rate is 1% for the text-dependent mode when two trials are allowed per verification attempt and with a minimum of 1.5 s of speech per trial. (C) 2000 Elsevier Science B.V. All rights reserved. C1 LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. RP Lamel, LF (reprint author), LIMSI, CNRS, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM lamel@limsi.fr; gauvain@limsi.fr CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 BERNSTEIN J, 1994, P ICASSP 9J AD AUSTR, V1, P81 Boves L., 1998, Proceedings 1998 IEEE 4th Workshop Interactive Voice Technology for Telecommunications Applications. IVTTA '98 (Cat. No.98TH8376), DOI 10.1109/IVTTA.1998.727721 CAMPBELL JP, 1999, P INT C AC SPEECH SI, V2, P829 DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 FURUI S, 1972, T IECE A, V55, P549 FURUI S, 1994, P ESCA WORKSH AUT SP, P1 GAUVAIN J, 1995, P EUROSPEECH, P651 GAUVAIN JL, 1993, P ARPA HUM LANG TECH, P96, DOI 10.3115/1075671.1075693 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 GODFREY J, 1994, P ARPA HUM LANG TECH, P23, DOI 10.3115/1075812.1075819 Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 Lamel L. F., 1991, P EUR C SPEECH COMM, P505 LAMEL LF, 1997, P IEEE ICASSP 97 MUN, P1067 LAMEL LF, 1995, COMPUT SPEECH LANG, V9, P87, DOI 10.1006/csla.1995.0005 LAMEL LF, 1992, P DARPA ART NEUR NET LAMEL LF, 1993, P EUR 93 BERL, V1, P23 LEGETTER JC, 1994, 181 CUED FINFENG MATSUI T, 1993, P ICASSP 93, V2, P391 NAIK JM, 1990, IEEE COMMUN MAG JAN, P42 Newman M., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607297 Przybocki M., 1998, P RLA2C 1998 AV, P120 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Rosenberg A. E., 1990, P IEEE INT C AC SPEE, P269 ROSENBERG AE, 1992, ADV SPEECH SIGNAL PR, pCH22 ROSENBERG AE, 1976, P IEEE, V64, P475, DOI 10.1109/PROC.1976.10156 NR 27 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 141 EP 154 DI 10.1016/S0167-6393(99)00075-8 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100007 ER PT J AU Bimbot, F Blomberg, M Boves, L Genoud, D Hutter, HP Jaboulet, C Koolwaaij, J Lindberg, J Pierrot, JB AF Bimbot, F Blomberg, M Boves, L Genoud, D Hutter, HP Jaboulet, C Koolwaaij, J Lindberg, J Pierrot, JB TI An overview of the CAVE project research activities in speaker verification SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA AB This article presents an overview of the research activities carried out in the European CAVE project, which focused on text-dependent speaker verification on the telephone network using whole word Hidden Markov Models. It documents in detail various aspects of the technology and the methodology used within the project. In particular, it addresses the issue of model estimation in the context of limited enrollment data and the problem of a posteriori decision threshold setting. Experiments are carried out on the realistic telephone speech database SESP. State-of-the-art performance levers are obtained, which validates the technical approaches developed and assessed during the project as well as the working infrastructure which facilitated cooperation between the partners. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Ecole Natl Super Telecommun Bretagne, Dept Signal, CNRS, URA 820, F-75634 Paris 13, France. KTH, Dept Speech Mus & Hearing, CH-1920 Martigny, Switzerland. KUN, Dept Language & Speech, NL-6525 HT Nijmegen, Netherlands. UBS Ubilab, CH-8021 Zurich, Switzerland. IDIAP, CH-1920 Martigny, Switzerland. RP Bimbot, F (reprint author), Inst Rech Informat & Syst Aleatoires, CNRS, Sigma2, Campus Univ Beaulieu, F-35042 Rennes, France. EM bimbot@irisa.fr; mats@speech.kth.se; boves@let.kun.nl; genoud@idiap.ch; cedric.jaboulet@ubs.com; koolwaaij@let.kun.nl; lindberg@speech.kth.se; pierrot@sig.enst.fr CR ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 BIMBOT F, 1997, 1930 CAVE LE BIMBOT F, 1997, Patent No. 970409 BIMBOT F, 1995, SPOKEN LANGUAGE RESO BIMBOT F, 1997, P EUR 97 RHOD, P1387 BIMBOT F, 1997, EUROSPEECH 97, P971 BIMBOT F, 1998, RLA2C WORKSH AV, P215 CAREY MJ, 1991, PI ICASSP TOR, V91, P396 Duda R. O., 1973, PATTERN CLASSIFICATI FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Furui S., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 *ITT LDC, 1994, YOHO VER CORP JABOULET C, 1998, RLA2C WORKSH AV, P202 LINDBERG J, 1998, RLA2C WORKSH AV, P89 Martin A., 1997, EUROSPEECH, P1895 PIERROT JB, 1998, IEEE ICASSP, V1, P331 ROSENBERG AE, 1991, IEEE ICASSP91, V1, P381 Scharf L. L., 1991, STAT SIGNAL PROCESSI YOUNG S, 1995, HTK BOOK HTK V2 0 US NR 20 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 155 EP 180 DI 10.1016/S0167-6393(99)00076-X PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100008 ER PT J AU Heck, LP Konig, Y Sonmez, MK Weintraub, M AF Heck, LP Konig, Y Sonmez, MK Weintraub, M TI Robustness to telephone handset distortion in speaker recognition by discriminative feature design SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speaker recognition; speaker verification; speaker identification; channel compensation; channel robustness; telephone handset distortion; feature extraction; neural network; discriminative design ID IDENTIFICATION; VERIFICATION AB A method is described for designing speaker recognition features that are robust to telephone handset distortion. The approach transforms features such as mel-cepstral features, log spectrum, and prosody-based features with a non-linear artificial neural network. The neural network is discriminatively trained to maximize speaker recognition performance specifically in the setting of telephone handset mismatch between training and testing. The algorithm requires neither stereo recordings of speech during training nor manual labeling of handset types either in training or testing. Results on the 1998 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation corpus show relative improvements as high as 28% for the new multilayered perceptron (MLP)-based features as compared to a standard mel-cepstral feature set with cepstral mean subtraction (CMS) and handset-dependent normalizing impostor models. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Nuance Commun, Menlo Pk, CA 94025 USA. Utopy Inc, San Francisco, CA 94102 USA. SRI Int, Menlo Pk, CA 94025 USA. RP Heck, LP (reprint author), Nuance Commun, 1380 Willow Rd, Menlo Pk, CA 94025 USA. EM heck@nuance.com CR Baum E.B., 1988, NEURAL INFORMATION P, P52 BENGIO Y, 1992, IEEE T NEURAL NETWOR, V3 Chengalvarayan R, 1997, IEEE T SPEECH AUDI P, V5, P243, DOI 10.1109/89.568731 EULER S, 1995, P EUROSPEECH SEP, P109 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 HECK KLP, 1997, P INT C AC SPEECH SI Hermansky H., 1991, P EUROSPEECH, P1367 LEHR M, 1996, THESIS STANFORD U LIU FH, 1994, P INT C AC SPEECH SI, V2, P19 Mammone RJ, 1996, IEEE SIGNAL PROC MAG, V13, P58, DOI 10.1109/79.536825 Murthy HA, 1999, IEEE T SPEECH AUDI P, V7, P554, DOI 10.1109/89.784108 NEUMEYER L, 1994, P ICASSP, V1, P417 *NIST, 1996, NIST WORKSH NOT *NIST, 1997, NIST WORKSH NOT *NIST, 1998, NIST WORKSH NOT PALIWAL KK, 1995, P EUROSPEECH, P541 PRZYBOCKI MA, 1998, LREC, P331 QUATIERI TF, 1998, P INT C AC SPEECH SI, V2, P745, DOI 10.1109/ICASSP.1998.675372 RAHIM M, 1997, P EUR C SPEECH COMM REYNODLS DA, 1997, P EUR C SPEECH COMM REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D REYNOLDS DA, 1997, P IEEE INT C AC SPEE, V2, P1535 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 Rumelhart D.E., 1986, PARALLEL DISTRIBUTED, V1, P318 STERN RM, 1994, P INT C SPOK LANG PR, V3, P1027 Weintraub M., 1985, THESIS STANFORD U NR 26 TC 26 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 181 EP 192 DI 10.1016/S0167-6393(99)00077-1 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100009 ER PT J AU Champod, C Meuwly, D AF Champod, C Meuwly, D TI The inference of identity in forensic speaker recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE forensic speaker recognition; inference; Bayesian approach ID PROSECUTORS FALLACY; SCIENTIFIC EVIDENCE; DNA EVIDENCE; PROBABILITY AB The aim of this paper is to investigate the ways of interpreting evidence within the field of speaker recognition. Several methods - speaker verification, speaker identification and type I and type II errors statement - will be presented and evaluated in the light of judicial needs. It will be shown that these methods for interpreting evidence unfortunately force the scientist to adopt a role and to formulate answers that are outside his scientific province. A Bayesian interpretation framework (based on the likelihood ratio) will be proposed. It represents an adequate solution for the interpretation of the aforementioned evidence in the judicial process. It fills in the majority of the gaps of the other inference frameworks and allows likening the speaker recognition to the same logic than the other forensic identification evidences. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Univ Lausanne, Inst Police Sci & Criminol, CH-1015 Lausanne, Switzerland. RP Champod, C (reprint author), Metropolitan Police Forens Sci Lab, Forens Sci Serv, 109 Lambeth Rd, London SE1 7LP, England. EM christophe.champod@ipsc.unil.ch; didier.meuwly@ipsc.unil.ch CR AFTE Criteria for Identification Committee, 1992, AFTE J, V24, P336 Aitken C.G.G., 1991, USE STAT FORENSIC SC Aitken Colin, 1995, STAT EVALUATION EVID [Anonymous], 1980, IDENTIFICATION NEWS, V30, P3 BALDING DJ, 1994, CRIM LAW REV, P711 BERRY DA, 1991, STAT FORENSIC SCI, P150 BODZIAK WJ, 1990, FOOTWEAR IMPRESSION Broeders APA, 1995, P INT C PHON SCI STO, V3, P154 CHAMPOD C, 1994, REV PENALE SUISSE, V112, P194 CHAMPOD C, 1993, REV PENALE SUISSE, V111, P223 *COMM EV SOUND SPE, 1979, THEOR PRACT VOIC ID DIENET W, 1984, 10 IAFS TRIENN M OXF DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 EVETT IEW, 1993, INT CRIMINAL POLICE, V444, P12 EVETT IW, 1983, J FORENSIC SCI SOC, V23, P35, DOI 10.1016/S0015-7368(83)71540-9 EVETT IW, 1991, CHANCE, V4, P19 EVETT IW, 1995, SCI JUSTICE, V35, P127, DOI 10.1016/S1355-0306(95)72645-4 FAIRLEY WB, 1974, U CHICAGO LAW REV, V41, P242, DOI 10.2307/1599147 FIENBERG SE, 1989, EVOLVING ROLE STAT A, V1, P357 FIENBERG SE, 1996, BAYESIAN STAT, V5, P129 FIGUEIREDO RM, 1995, ADV FORENSIC SCI, V3, P35 HILTON O, 1995, INT J FORENSIC DOCUM, V1, P224 HUBER RA, 1980, CANADIAN SOC FORENSI, V13, P1 HUMMEL K, 1984, FORENSIC SCI INT, V25, P1, DOI 10.1016/0379-0738(84)90010-0 Kaye D. H., 1992, JURIMETRICS J, V32, P313 KAYE DH, 1993, HARVARD J LAW TECHNO, V7, P101 KIND SS, 1994, J FORENSIC SCI SOC, V34, P155, DOI 10.1016/S0015-7368(94)72908-X Koehler J.J., 1993, JURIMETRICS J, V34, P21 KUNZEL H, 1994, P ESCA WORKSH AUT SP Kwan Q.Y., 1977, THESIS U CALIFORNIA LEMPERT RO, 1977, MICH LAW REV, V75, P1021, DOI 10.2307/1288024 LEMPERT RO, 1993, JURIMETRICS J, V34, P1 Lewis S., 1984, P I ACOUSTICS, V6, P69 MAJEWSKI W, 1996, FORENSIC LINGUISTICS, V3, P50 O'Shaughnessy D., 1986, IEEE ASSP Magazine, V3, DOI 10.1109/MASSP.1986.1165388 PAOLONI A, 1994, ANN M INT ASS FOR PH REDMAYNE M, 1995, CRIM LAW REV, P464 ROBERTSON B, 1992, NZ LAW J, V9, P315 Robertson B., 1995, INTERPRETING EVIDENC ROBERTSON B, 1994, EXPERT EVIDENCE, V3, P3 ROBERTSON BWN, 1994, J FORENSIC SCI SOC, V34, P270, DOI 10.1016/S0015-7368(94)72934-0 Royall R. M., 1997, STAT EVIDENCE LIKELI Rudram DA, 1996, SCI JUSTICE, V36, P133, DOI 10.1016/S1355-0306(96)72587-X SEZ VI, 1994, GIUSTIZIA PENALE, V35, P42 Stoney D.A., 1985, THESIS U CALIFORNIA STONEY DA, 1991, USE STAT FORENSIC SC, P27 Taroni F, 1998, JURIMETRICS J, V38, P183 TARONI F, 1996, 75 JAHR DTSCH GES RE Taroni F, 1996, SCI JUSTICE, V36, P290, DOI 10.1016/S1355-0306(96)72617-5 Taroni F., 1997, JURIMETRICS, V37, P327 THOMPSON WC, 1987, LAW HUMAN BEHAV, V11, P167, DOI 10.1007/BF01044641 Tuthill Harold, 1994, INDIVIDUALIZATION PR *TWGFAST, 1997, J FORENSIC IDENTIFIC, V47, P423 Walsh KAJ, 1996, SCI JUSTICE, V36, P213, DOI 10.1016/S1355-0306(96)72607-2 NR 54 TC 43 Z9 43 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 193 EP 203 DI 10.1016/S0167-6393(99)00078-3 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100010 ER PT J AU Boe, LJ AF Boe, LJ TI Forensic voice identification in France SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA AB Among the sources of information used in legal identification, fingerprints and genetic data seem to provide a high degree of reliability. It is possible to evaluate the probability of confusing two individuals who might possess the same fingerprint characteristics or the same genetic markers, and to quantify the risk of a false alarm. By their very nature, these data do not vary significantly over the course of time, and they cannot be modified by a suspect. The erroneous metaphoric term "voiceprint" leads many people (not only the general public) to believe that the voice is as reliable as the papillary ridges of the fingertips. This is not the case. According to present evidence, certain magistrates in France attach far too much importance to analyses of the voice which, along with other indices, should not be used except to help in directing an investigation. In this communication, the author will detail the conditions under which, in France, voice analyses are carried out in the course of an investigation undertaken by the law, and will attempt to define the limits of this protocol, and the difficulty (and impossibility) of producing a reliable statistical test. A historical review will then be presented of the discussions initiated by and position statements adopted by the French speech community since 1900. Finally some ideas and proposals will be put forward in conclusion, which might be discussed by specialists in speech in collaboration with the police, the gendarmerie, and the magistrature, on a national, European, and international level, to advance the search for legal proof of identification within a scientific framework, and to end up with well-defined protocols. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Univ Grenoble, INPG, Inst Commun Parlee, CNRS,UMR 5009, F-38040 Grenoble, France. RP Boe, LJ (reprint author), Univ Grenoble, INPG, Inst Commun Parlee, CNRS,UMR 5009, BP 25, F-38040 Grenoble, France. CR ALBADER HO, 1992, 10 C INT POL SCI LYO BIMBOT F, 1998, HDB STANDARDS RESSOU BOE LJ, 1984, NTLAATSS226 CNET BOE LJ, 1998, HIST PROBLEMATIQUE P, P222 BOLT RH, 1970, J ACOUST SOC AM, V47, P597, DOI 10.1121/1.1911935 Bricker P. D., 1976, CONT ISSUES EXPT PHO, P295 CHAMPOD C, 1998, RLA2C SPEAK REC ITS, P125 Duda R. O., 1973, PATTERN CLASSIFICATI FOMBONNE J, 1996, CRIMINALISTIQUE Galton Francis, 1965, FINGER PRINTS HECKER MHL, 1971, ASHA MONOGRAPHS, V16 JEFFREYS AJ, 1985, NATURE, V316, P76, DOI 10.1038/316076a0 KERSTA LG, 1962, NATURE, V196, P1253, DOI 10.1038/1961253a0 Nadler M., 1993, PATTERN RECOGNITION Nolan F, 1983, PHONETIC BASES SPEAK STEVANI G, 1996, PROCEDURE PENALE Tosi O., 1979, VOICE IDENTIFICATION VANLANCKER D, 1985, J PHOENITCS, V13, P13 VANLANCKER D, 1985, J PHONETICS, V13, P39 VANLANCKER DR, 1988, CORTEX, V24, P195 YARMEY AD, 1991, J FORENSIC SCI SOC, V31, P421 NR 21 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 205 EP 224 DI 10.1016/S0167-6393(99)00079-5 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100011 ER PT J AU Doddington, GR Przybocki, MA Martin, AF Reynolds, DA AF Doddington, GR Przybocki, MA Martin, AF Reynolds, DA TI The NIST speaker recognition evaluation - Overview, methodology, systems, results, perspective SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speaker recognition; identification; verification; performance evaluation; NIST evaluations; detection error trade-off (DET) curve AB This paper, based on three presentations made in 1998 at the RLA2C Workshop in Avignon, discusses the evaluation of speaker recognition systems from several perspectives. A general discussion of the speaker recognition task and the challenges and issues involved in its evaluation is offered. The NIST evaluations in this area and specifically the 1998 evaluation, its objectives, protocols and test data, are described. The algorithms used by the systems that were developed for this evaluation are summarized, compared and contrasted. Overall performance results of this evaluation are presented by means of detection error trade-off (DET) curves. These show the performance trade-off of missed detections and false alarms for each system and the effects on performance of training condition, test segment duration, the speakers' sex and the match or mismatch of training and test handsets. Several factors that were found to have an impact on performance, including pitch frequency, handset type and noise, are discussed and DET curves showing their effects are presented. The paper concludes with some perspective on the history of this technology and where it may be going. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Natl Inst Stand & Technol, Gaithersburg, MD 20899 USA. SRI Int, Menlo Pk, CA 94025 USA. MIT, Lincoln Lab, Lexington, MA 02173 USA. RP Martin, AF (reprint author), Natl Inst Stand & Technol, Technol Bldg 225, Gaithersburg, MD 20899 USA. EM alvin.martin@nist.gov CR BESACIER L, 1998, RLA2C APR, P106 CAREY M, 1998, RLA2C APR, P161 DODDINGTON G, 1998, P ICSLP 98 DODDINGTON G, 1998, P RLA2C 20 23 APR AV, P630 FISCUS J, 1997, P IEEE WORKSH AUT SP GILLICK L, 1993, ICASSP APR Heck LP, 1997, INT CONF ACOUST SPEE, P1071, DOI 10.1109/ICASSP.1997.596126 HENNEBERT J, 1998, RLA2C AV FRANC, P55 HERMANSKY H, 1998, RLA2C APR, P111 JABOULET C, 1998, RLA2C WORKSH AV, P202 KONIG Y, 1998, RLA2C APR, P72 Martin A. F., 1997, P EUROSPEECH, P1895 MONTACIE C, 1992, ICSLP BANFF CAN, P611 Przybocki M., 1998, P RLA2C 1998 AV, P120 PRZYBOCKI M, 1998, P LREC 28 30 MAY GRA, V1, P331 Quatieri TF, 1998, INT CONF ACOUST SPEE, P745, DOI 10.1109/ICASSP.1998.675372 REYNOLDS D, 1997, EUROSPEECH, V2, P963 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D Reynolds DA, 1997, INT CONF ACOUST SPEE, P1535, DOI 10.1109/ICASSP.1997.596243 Schmidt- Nielsen A., 1998, P ICSLP 98 VANVUUREN S, 1998, RLA2C APR, P198 NR 21 TC 108 Z9 112 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 225 EP 254 DI 10.1016/S0167-6393(99)00080-1 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100012 ER PT J AU Ortega-Garcia, J Gonzalez-Rodriguez, J Marrero-Aguiar, V AF Ortega-Garcia, J Gonzalez-Rodriguez, J Marrero-Aguiar, V TI AHUMADA: A large speech corpus in Spanish for speaker characterization and identification SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speech databases; speaker characterization; speaker recognition AB Speaker recognition is an emerging task in both commercial and forensic applications. Nevertheless, while in certain applications we can estimate, adapt or hypothesize about our working conditions, most of the commercial applications and almost the whole of the forensic approaches to speaker recognition are still open problems, due to several reasons. Some of these reasons can be stated: environmental conditions are (usually) rapidly changing or highly degraded, acquisition processes are not always under control, incriminated people exhibit low degree of cooperativeness, etc., inducing a wide range of variability sources on speech utterances. In this sense, real approaches to speaker identification necessarily imply taking into account all these variability factors. In order to isolate, analyze and measure the effect of some of the main variability sources that can be found in real commercial and forensic applications, and their influence in automatic recognition systems, a specific large speech database in Castilian Spanish called AHUMADA (/aumada/) has been designed and acquired under controlled conditions. In this paper, together with a detailed description of the database, some experimental results including different speech variability factors are also presented. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Politecn Madrid, EUIT Telecomun, Dept Ingn Audiovisual & Comun, Madrid 23031, Spain. Univ Nacl Educ Distancia, Dept Lengua Espanola, E-28040 Madrid, Spain. RP Ortega-Garcia, J (reprint author), Univ Politecn Madrid, EUIT Telecomun, Dept Ingn Audiovisual & Comun, Ctra Valencia Km 7, Madrid 23031, Spain. EM jortega@diac.upm.es RI Gonzalez-Rodriguez, Joaquin/B-2629-2009; Ortega-Garcia, Javier/J-7065-2012; Marrero Aguiar, Victoria/D-9895-2014 OI Gonzalez-Rodriguez, Joaquin/0000-0003-0910-2575; Ortega-Garcia, Javier/0000-0003-0557-1948; Marrero Aguiar, Victoria/0000-0001-6295-7463 CR ACERO A, 1993, AC ENV ROB AUT SPEEC Boves L., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification CHAMPOD C, 1998, ESCA WORKSH SPEAK RE, P125 Furui S., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification Gibbon D, 1997, HDB STANDARDS RESOUR Godfrey J., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification GonzalezRodriguez J, 1997, INT CONF ACOUST SPEE, P1103, DOI 10.1109/ICASSP.1997.596134 GUERRA R, 1983, ESTUDIOS FONETICA, V1, P9 JUILLAND A, 1969, FREQUENCY DICT SPANI Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Matsui T., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification Naik J., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification ORTEGAGARCIA J, 1998, IEEE INT C AC SPEECH, V2, P773 Ortega-Garcia J., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification OrtegaGarcia J, 1997, INT CONF ACOUST SPEE, P1107, DOI 10.1109/ICASSP.1997.596135 Quilis A., 1980, LINGUISTICA ESPANOLA, VII, P1 Reynolds D. A., 1997, P EUR, P963 Reynolds DA, 1997, INT CONF ACOUST SPEE, P1535, DOI 10.1109/ICASSP.1997.596243 Reynolds D.A., 1992, THESIS GEORGIA I TEC Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599 STEENEKEN HJM, 1985, B K TECH REV, V3, P13 NR 21 TC 32 Z9 33 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 255 EP 264 DI 10.1016/S0167-6393(99)00081-3 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100013 ER PT J AU Hennebert, J Melin, H Petrovska, D Genoud, D AF Hennebert, J Melin, H Petrovska, D Genoud, D TI POLYCOST: A telephone-speech database for speaker recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Speaker Recognition and its Commercial and Forensic Applications (RLA2C) CY APR 20-23, 1998 CL AVIGNON, FRANCE SP ESCA DE speaker recognition; databases AB This article presents an overview of the POLYCOST database dedicated to speaker recognition applications over the telephone network. The main characteristics of this database are: medium mixed speech corpus size (>100 speakers), English spoken by foreigners, mainly digits with some free speech, collected through international telephone lines, and minimum of nine sessions for 85% of the speakers. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Ecole Polytech Fed Lausanne, CIRC, CH-1015 Lausanne, Switzerland. KTH, TMH, SE-10044 Stockholm, Sweden. IDIAP, CH-1920 Martigny, Switzerland. RP Hennebert, J (reprint author), UbiCall Commun Inc, 1095 Market St,Suite 3001, San Francisco, CA 94103 USA. EM jean.hennebert@ubicall.com CR BIMBOT F, 1995, EAGLES HDB SPOKEN LA DENOS E, 1997, EAGLES HDB SPOKEN LA MELIN H, 1996, P COST 250 WORKSH AP, P59 PETROVSKA D, 1996, P COST 259 WORKSH AP, P23 NR 4 TC 12 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 2000 VL 31 IS 2-3 BP 265 EP 270 DI 10.1016/S0167-6393(99)00082-5 PG 6 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 329BW UT WOS:000087887100014 ER PT J AU Roberts, WJJ Ephraim, Y AF Roberts, WJJ Ephraim, Y TI Hidden Markov modeling of speech using Toeplitz covariance matrices SO SPEECH COMMUNICATION LA English DT Article ID MAXIMUM-LIKELIHOOD; INCOMPLETE-DATA; NOISY SPEECH; RECOGNITION AB Hidden Markov modeling of speech waveforms using structured covariance matrices is studied and applied to recognition of clean and noisy speech signals. This technique allows for easier model adaptation in additive noise than does cepstral modeling of speech. Waveform modeling using autoregressive (AR) structured covariances has been extensively studied and applied previously. However, other covariance structures are possible and here we consider waveform modeling using Toeplitz and circulant structured covariances, We detail maximum likelihood (ML) hidden Markov model training and recognition routines using these matrices, and ML speech gain estimation routines. We show equivalence of asymptotic probabilities of recognition error, under certain conditions, using Toeplitz and circulant matrices to using AR matrices. In experimental results on isolated digits in clean conditions, the Toeplitz covariance structure provides higher performance than the AR structure and has performance similar to that reported in the literature of a cepstral system on the same database. In additive Gaussian noise, we demonstrate superior performance to both the cepstral system and the AR system. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Def Sci & Technol Org, Div Informat Technol, Salisbury, SA 5108, Australia. George Mason Univ, Dept Elect & Comp Engn, Fairfax, VA 22030 USA. George Mason Univ, Sch Informat Technol & Engn, Fairfax, VA 22030 USA. RP Roberts, WJJ (reprint author), Def Sci & Technol Org, Div Informat Technol, POB 1500, Salisbury, SA 5108, Australia. EM bill.roberts@dsto.defence.gov.au; yephraim@gmu.edu CR BURG JP, 1982, P IEEE, V70, P963, DOI 10.1109/PROC.1982.12427 DAUTRICH BA, 1983, IEEE T ACOUST SPEECH, V31, P793, DOI 10.1109/TASSP.1983.1164172 DEMBO A, 1989, IEEE T INFORM THEORY, V35, P1206, DOI 10.1109/18.45276 DEMBO A, 1986, IEEE T ACOUST SPEECH, V34, P661 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 FUHRMANN DR, 1988, IEEE T INFORM THEORY, V34, P722, DOI 10.1109/18.9771 GALES MJF, 1992, P ICASSP, P233, DOI 10.1109/ICASSP.1992.225929 Gray R.M., 1977, TOEPLITZ CIRCULANT M Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E JUANG BH, 1985, IEEE T ACOUST SPEECH, V33, P1404 KAILATH T, 1978, SIAM REV, V20, P107 Kay S. M., 1988, MODERN SPECTRAL ESTI LEONARD RG, 1984, P C IEEE ICASSP LEVINSON SE, 1983, AT&T TECH J, V62, P1035 Markel JD, 1976, LINEAR PREDICTION SP Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2157, DOI 10.1109/78.91172 Merhav N., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90002-8 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2111, DOI 10.1109/78.134449 MILLER MI, 1987, P IEEE, V75, P892, DOI 10.1109/PROC.1987.13825 NADAS A, 1989, IEEE T ACOUST SPEECH, V37, P1495, DOI 10.1109/29.35387 NEWSAM GN, 1994, IEEE T INFORM THEORY, V40, P1218, DOI 10.1109/18.335952 OLKIN I, 1969, ANN MATH STAT, V40, P1358, DOI 10.1214/aoms/1177697508 ROBERTS WJJ, 1996, THESIS G MASON U RUBIN DB, 1982, BIOMETRIKA, V69, P657 Scharf L. L., 1991, STAT SIGNAL PROCESSI SZATROWSKI TH, 1980, ANN STAT, V8, P802, DOI 10.1214/aos/1176345072 Van Trees H. L., 1968, DETECTION ESTIMATION, V1 NR 31 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 1 EP 14 DI 10.1016/S0167-6393(00)00005-4 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800001 ER PT J AU Hazen, TJ AF Hazen, TJ TI A comparison of novel techniques for rapid speaker adaptation SO SPEECH COMMUNICATION LA English DT Article DE speaker adaptation; speaker constraint; speaker clustering ID LIKELIHOOD AB This paper introduces two novel techniques for rapid speaker adaptation, reference speaker weighting and consistency modeling. Also presented is an adaptation technique called speaker cluster weighting (SCW) which provides a means for improving upon generic hierarchical speaker clustering techniques. Each of these adaptation methods attempts to utilize the underlying within-speaker correlations that are present between the acoustic realizations of different phones. By accounting for these correlations, a limited amount of adaptation data can be used to adapt the models of every phonetic acoustic model, including those for phones which have not been observed in the adaptation data. Results were obtained using the DARPA Resource Management corpus for a set of rapid adaptation experiments where single test utterances were used for adaptation and recognition simultaneously. Using the new adaptation techniques relative word error rate reductions ranging from 4.9% to 8.4% were obtained under Various conditions. Using a combination of hierarchical speaker clustering techniques and the novel adaptation techniques, a word error rate reduction of 20% has been achieved from the baseline speaker independent (SI) recognition system. (C) 2000 Elsevier Science B.V. All nights reserved. C1 MIT, Comp Sci Lab, Spoken Language Syst Grp, Cambridge, MA 02139 USA. RP Hazen, TJ (reprint author), MIT, Comp Sci Lab, Spoken Language Syst Grp, Room 646,545 Technol Sq, Cambridge, MA 02139 USA. EM hazen@sls.lcs.mit.edu CR BAHL LR, 1991, P EUR C SPEECH COMM, P1209 BAHL LR, 1995, P ICASSP, P41 BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 BAKER JK, 1975, THESIS CARNEGIE MELL De Brabandere K, 2007, P IEEE INT C AC SPEE, P1 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GAUVAIN JL, 1995, P IEEE ICASSP 95 DET, P65 Glass J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607261 HAZEN TJ, 1998, THESIS MIT HUANG X, 1996, P 1996 INT C AC SPEE, P885 HUO Q, 1997, P EUR C SPEECH COMM, P1847 KANNAN A, 1997, P EUROSPEECH, P1863 KOSAKA T, 1994, P ICSLP, P1375 KOSAKA T, 1994, P 1994 IEEE INT C AC, V1, P245 KUBALA F, 1997, P EUR 97 RHOD GREEC, P927 Kuhn R., 1998, P ICSLP, P1771 LASRY MJ, 1984, IEEE T PATTERN ANAL, V6, P530 Lee K.-F., 1988, THESIS CARNEGIE MELL LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MATHAN L, 1990, P ICASSP 90, P149 Paliwal K.K., 1993, P ICASSP, V11, P215 Price P., 1988, P IEEE INT C AC SPEE, P651 Shahshahani BM, 1997, IEEE T SPEECH AUDI P, V5, P183, DOI 10.1109/89.554780 Shinoda K., 1997, P 1997 IEEE WORKSH A, P381 SZARVAS M, 1998, P 1998 INT C SPOK LA, P2967 ZAVALIAGKOS G, 1995, P ARPA SPOK LANG SYS, P82 ZAVALIAGKOS G, 1995, P 4 EUR C SPEECH COM, P1131 ZAVALIAGKOS G, 1995, P ICASSP 95, P676 ZUE V, 1997, P 5 EUR C SPEECH COM, P2227 NR 29 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 15 EP 33 DI 10.1016/S0167-6393(99)00059-X PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800002 ER PT J AU Yang, HH Van Vuuren, S Sharma, S Hermansky, H AF Yang, HH Van Vuuren, S Sharma, S Hermansky, H TI Relevance of time-frequency features for phonetic and speaker-channel classification SO SPEECH COMMUNICATION LA English DT Article DE mutual information; sources of variability; spectral feature; input selection; phonetic classification; multi-layer perceptron AB The mutual information concept is used to study the distribution of speech information in frequency and in time. The main focus is on the information that is relevant for phonetic classification. A large database of hand-labeled fluent speech is used to (a) compute the mutual information (MI) between a phonetic classification variable and one spectral feature variable in the time-frequency plane, and (b) compute the joint mutual information (JMI) between the phonetic classification variable and two feature variables in the time-frequency plane. The MI and the JMI of the feature variables are used as relevance measures to select inputs for phonetic classifiers. Multi-layer perceptron (MLP) classifiers with one or two inputs are trained to recognize phonemes to examine the effectiveness of the input selection method based on the MI and the JMI, To analyze the non-linguistic sources of variability, we use speaker-channel labels to represent different speakers and different telephone channels and estimate the MI between the speaker-channel variable and one or two feature variables. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Oregon Grad Inst Sci & Technol, Dept Elect & Comp Engn, Beaverton, OR 97006 USA. RP Yang, HH (reprint author), Oregon Grad Inst Sci & Technol, Dept Elect & Comp Engn, 20000 NW Walker Rd, Beaverton, OR 97006 USA. EM hyang@cse.ogi.deu CR BARROWS G, 1996, IEEE INT S TIM FREQ, P249 BATTITI R, 1994, IEEE T NEURAL NETWOR, V5, P537, DOI 10.1109/72.298224 Bilmes JA, 1998, INT CONF ACOUST SPEE, P469, DOI 10.1109/ICASSP.1998.674469 Bonnlander BV, 1996, THESIS U COLORADO Cole R., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing COVER TM, 1991, INFORMATION THEORY Hermansky H, 1998, SPEECH COMMUN, V25, P3, DOI 10.1016/S0167-6393(98)00027-2 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Morris A., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1006 Stuart A, 1994, KENDALLS ADV THEORY, V1 Venables WN, 1994, MODERN APPL STAT S P YANG H, 2000, ADV NEURAL INFORMATI, V12 YANG H, 1999, ICASSP 99 PHOEN, V1, P225 Yang H.H., 1999, ADV INTELLIGENT DATA NR 14 TC 27 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 35 EP 50 DI 10.1016/S0167-6393(00)00007-8 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800003 ER PT J AU Ma, KW Zavaliagkos, G Meteer, M AF Ma, KW Zavaliagkos, G Meteer, M TI Bi-modal sentence structure for language modeling SO SPEECH COMMUNICATION LA English DT Article DE large vocabulary continuous speech recognition; statistical language model; discourse model AB According to discourse theories in linguistics, conversational utterances possess an informational structure. That is, each sentence consists of two components: the given and the new. The given refers to information that has previously been conveyed in the conversation such as that in That's interesting. The new section of a sentence introduces additional information that is new to the conversation such as the word interesting in the previous example. In this work, we take advantage of this inherent structure for the purpose of automatic conversational speech recognition by building sub-sentence discourse language models (LMs) to represent the bi-modal nature of each conversational sentence. The internal sentence structure is captured with a statistical sentence model regardless of whether the input sentences are linguistically or acoustically segmented. The proposed model is verified on the Switchboard corpus. The resulting model contributes to a reduction in both LM perplexity and word recognition error rate. (C) 2000 Elsevier Science B.V. All rights reserved. C1 GTE BBN Technol, Speech & Language Dept, Cambridge, MA 02138 USA. RP Ma, KW (reprint author), Lernout & Hauspie Speech Prod, Burlington, MA 01803 USA. EM kma@lhsl.com CR BIBER D, 1986, LANGUAGE, V62, P384, DOI 10.2307/414678 Clark H. H., 1977, DISCOURSE PRODUCTION, P1 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 GAVALDA M, 1997, APPL NATURAL LANGUAG, P12 Gillick L., 1989, P ICASSP, P532 Godfrey J., 1992, ICASSP 92 IEEE INT C, V1, P517 GOTOH Y, 1997, P 5 EUR C SPEECH COM, P1443 Halliday Michael, 1976, COHESION ENGLISH Jelinek F., 1997, STAT METHODS SPEECH JURAFSKY D, 1997, J HOPK SUMM WORKSH J Kneser R., 1993, P INT C AC SPEECH SI, V2, P586 Lin S.-C., 1997, P EUR, P1463 MA K, P INT C AC SPEECH SI, V2, P693 METEER M, 1996, P C EMP METH NAT LAN Placeway P., 1993, P INT C AC SPEECH SI, P33 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 ROSENFELD R, 1995, P 1995 LANG MOD WORK Stolcke A., 1996, P INT C SPOK LANG PR, P1005, DOI 10.1109/ICSLP.1996.607773 WANG Y, 1997, P EUR, P2703 Woszczyna M., 1994, P INT C SPOK LANG PR, P847 ZAVALIAGKOS G, 1998, P INT C ACOUSTICS SP, V2, P905, DOI 10.1109/ICASSP.1998.675412 NR 21 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 51 EP 67 DI 10.1016/S0167-6393(99)00060-6 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800004 ER PT J AU Cucchiarini, C Strik, H Boves, L AF Cucchiarini, C Strik, H Boves, L TI Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms (vol 30, pg 109, 2000) SO SPEECH COMMUNICATION LA English DT Correction C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP Cucchiarini, C (reprint author), Univ Nijmegen, Dept Language & Speech, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM catia@let.kun.nl; strik@let.kun.nl; boves@let.kun.nl CR Cucchiarini C, 2000, SPEECH COMMUN, V30, P109, DOI 10.1016/S0167-6393(99)00040-0 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 69 EP 69 DI 10.1016/S0167-6393(00)00030-3 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800005 ER PT J AU Fosler-Lussier, E Morgan, N AF Fosler-Lussier, E Morgan, N TI Effects of speaking rate and word frequency on pronunciations in convertional speech (vol 29, pg 137, 1999) SO SPEECH COMMUNICATION LA English DT Correction C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. Univ Calif Berkeley, Berkeley, CA 94720 USA. RP Fosler-Lussier, E (reprint author), Int Comp Sci Inst, 1947 Ctr St,Suite 600, Berkeley, CA 94704 USA. EM fosler@icsi.berkeley.edu CR Fosler-Lussier E, 1999, SPEECH COMMUN, V29, P137, DOI 10.1016/S0167-6393(99)00035-7 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 2000 VL 31 IS 1 BP 71 EP 71 DI 10.1016/S0167-6393(00)00029-7 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 323WK UT WOS:000087587800006 ER PT J AU Gordon, PC AF Gordon, PC TI Masking protection in the perception of auditory objects SO SPEECH COMMUNICATION LA English DT Article DE auditory objects; identification; masking; vowels; thresholds ID DUPLEX PERCEPTION; ONSET ASYNCHRONY; SPEECH; NOISE; VOWELS; RESTORATION AB Three experiments demonstrate the phenomenon of masking protection, where the threshold for identifying a brief masked signal is lowered when that signal is presented in conjunction with other sounds that provide no information about the correct response and which are separated from the distinctive signal by more than a critical band (Gordon, 1997a,b). The first experiment shows that listeners' thresholds for distinguishing a low tone (375 Hz) from a high tone (625 Hz) is lower when those tones are accompanied by a synthetic speech sound that combines with the tones to give percepts of /I/ or /epsilon/, respectively. This effect is reliable for individual listeners and involves a change in perceptual sensitivity. The second experiment shows that a similar lowering of identification thresholds is produced when the distinctive signals are combined with high-frequency, acoustic energy that does not prompt a speech percept. The third experiment shows that identification thresholds are elevated when the non-distinctive, high-frequency acoustic energy leads and lags the distinctive signals. The results of the experiments indicate that mechanisms of perceptual object formation that exploit the temporal alignment of energy changes across the spectrum can contribute to the accurate identification of speech and non-speech sounds. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ N Carolina, Dept Psychol, Chapel Hill, NC 27599 USA. RP Gordon, PC (reprint author), Univ N Carolina, Dept Psychol, CB 3270,Davie Hall, Chapel Hill, NC 27599 USA. EM pcg@email.unc.edu CR BAILEY PJ, 1993, PERCEPT PSYCHOPHYS, V54, P20, DOI 10.3758/BF03206934 BREGMAN AS, 1978, CAN J PSYCHOL, V32, P19, DOI 10.1037/h0081664 Creelman C. D., 1991, DETECTION THEORY USE DARWIN CJ, 1984, J ACOUST SOC AM, V76, P1636, DOI 10.1121/1.391610 DARWIN CJ, 1984, ATTENTION PERFORM, V10, P197 DARWIN CJ, 1992, J ACOUST SOC AM, V91, P3381, DOI 10.1121/1.402828 DARWIN CJ, 1986, J ACOUST SOC AM, V79, P838, DOI 10.1121/1.393474 DARWIN CJ, 1984, Q J EXP PSYCHOL-A, V36, P193 Gordon PC, 1997, J ACOUST SOC AM, V102, P2276, DOI 10.1121/1.419600 Gordon PC, 1997, PERCEPT PSYCHOPHYS, V59, P232, DOI 10.3758/BF03211891 Green D. M., 1988, PROFILE ANAL AUDITOR HALL JW, 1984, J ACOUST SOC AM, V76, P50, DOI 10.1121/1.391005 HALL MD, 1992, J EXP PSYCHOL HUMAN, V18, P752, DOI 10.1037/0096-1523.18.3.752 Hill NI, 1996, J ACOUST SOC AM, V100, P2352, DOI 10.1121/1.417945 KAISER JF, 1978, REV SCI INSTRUMENTAT, V48, P1103 LEVITT H, 1971, J ACOUST SOC AM, V49, P467, DOI 10.1121/1.1912375 LIBERMAN AM, 1981, PERCEPT PSYCHOPHYS, V30, P133, DOI 10.3758/BF03204471 RAND TC, 1974, J ACOUST SOC AM, V55, P678, DOI 10.1121/1.1914584 REICHER GM, 1969, J EXP PSYCHOL, V81, P275, DOI 10.1037/h0027768 SHRIBERG EE, 1992, LANG SPEECH, V35, P127 Warren RM, 1997, PERCEPT PSYCHOPHYS, V59, P275, DOI 10.3758/BF03211895 WHALEN DH, 1987, SCIENCE, V237, P169, DOI 10.1126/science.3603014 YOST WA, 1989, J ACOUST SOC AM, V86, P2138, DOI 10.1121/1.398474 NR 23 TC 7 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 197 EP 206 DI 10.1016/S0167-6393(99)00053-9 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500001 ER PT J AU Verhelst, W AF Verhelst, W TI Overlap-add methods for time-scaling of speech SO SPEECH COMMUNICATION LA English DT Article DE speech processing; speech modification; time-scaling; time-warping; short-time Fourier transform; overlap-add; WSOLA AB In this tutorial on time-scaling we follow one particular line of thought towards computationally efficient high quality methods. We favor time-scaling based on time-frequency representations over model based approaches, and proceed to review an iterative phase reconstruction method for time-scaled magnitude spectrograms. The search for a good initial phase estimate leads us to consider synchronized overlap-add methods which are further optimized to eventually arrive at WSOLA, a technique based on a waveform similarity criterion. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Katholieke Univ Leuven, ESAT, B-3001 Heverlee, Belgium. RP Verhelst, W (reprint author), Katholieke Univ Leuven, ESAT, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium. EM werner.verhelst@esat.kuleuven.ac.be CR BELLENS E, 1994, THESIS MSEE DALESSANDRO C, 1989, P EUROSPEECH 89, P211 Deller J. R., 1993, DISCRETE TIME PROCES DUTOIT T, 1992, SIGNAL PROCESS, V6, P343 GABOR D, 1947, NATURE, V159, P591, DOI 10.1038/159591a0 GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 HESS W, 1992, SIGNAL PROCESS, V6, P37 LAROCHE J, 1993, P WORKSH APPL SIGN P, P200 LINDSAY PH, 1977, HUMAN INFORMATION PR MOORE B J, 1982, INTRO PSYCHOL HEARIN MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z Moulines E., 1995, SPEECH CODING SYNTHE, P519 ROADS C, 1998, COMPUTER MUSIC TUTOR ROUCOS S, 1985, P IEEE INT C AC SPEE, P493 SPLEESTERS G, 1994, 96 CONV AUD ENG SOC VALBRET H, 1992, 92E017 ENST van Heuven V.J., 1995, SPEECH CODING SYNTHE, P707 VERHELST W, 1990, 733 IPO I PERC RES Verhelst W., 1991, P EUROSPEECH 91, P1319 Verhelst W., 1997, P EUR ESCA, P899 Verhelst W., 1993, P IEEE INT C AC SPEE, P554 VERHELST W, 1991, P IEEE ICASSP, P501, DOI 10.1109/ICASSP.1991.150386 Vogten L.L.M., 1991, European patent, Patent No. [91202044.3, 912020443] Weinrichter H., 1986, Signal Processing III: Theories and Applications. Proceedings of EUSIPCO-86: Third European Signal Processing Conference NR 24 TC 21 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 207 EP 221 DI 10.1016/S0167-6393(99)00051-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500002 ER PT J AU Choi, SH Kim, HK Lee, HS AF Choi, SH Kim, HK Lee, HS TI Speech recognition using quantized LSP parameters and their transformations in digital communication SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; speech coder; digital communication; line spectrum pairs; pseudo-cepstrum AB In digital communication networks, speech recognition systems conventionally first reconstruct speech and then extract feature parameters. In this paper, we consider a useful approach of incorporating speech coding parameters into the speech recognizer. Most speech coders employed in digital communication networks use line spectrum pairs (LSPs) as spectral parameters. We introduce two ways to improve the recognition performance of the LSP-based speech recognizer. One is to devise weighted distance measures of LSPs and the other is to transform LSPs into a new feature set, named pseudo-cepstrum (PCEP). The speaker-independent connected-digit recognition experiments based on the discrete hidden Markov model showed that the weighted distance measures provide better recognition accuracy than unweighted ones do. Additionally, a mel-scale PCEP gives an even better performance than the weighted distance measures do. To clarify the performance improvement of the proposed methods, a significance test is introduced. As a result, the proposed methods achieved higher performances in recognition accuracy, compared with the conventional methods employing mel-frequency cepstral coefficients. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, Taejon 305701, South Korea. Samsung Adv Inst Technol, Human & Comp Interact Lab, Yongin 449712, Kyungki Do, South Korea. AT&T Labs Res, Florham Pk, NJ 07932 USA. SK Telecom, Cent Res Lab, Pundang Gu, Songnam 463020, Kyungki Do, South Korea. RP Choi, SH (reprint author), Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, 373-1 Kusong Dong, Taejon 305701, South Korea. EM shchoi@sait.samsung.co.kr; hkkim@research.att.com; hwanglee@sktelecom.com RI Lee, Hwang-Soo/C-1867-2011 CR Digalakis V, 1998, INT CONF ACOUST SPEE, P989, DOI 10.1109/ICASSP.1998.675433 EULER S, 1994, P INT C AC SPEECH SI, V1, P621 GARDNER WR, 1995, IEEE T SPEECH AUDI P, V3, P367, DOI 10.1109/89.466658 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 ITAKURA F, 1975, J ACOUST SOC AM, V57, pS35, DOI 10.1121/1.1995189 KIM HK, 1999, IN PRESS IEEE T SPEE Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 LILLY B, 1996, P ICSLP, V4, P2344, DOI 10.1109/ICSLP.1996.607278 Mendenhall W., 1995, STAT ENG SCI, V4th Oppenheim A. V., 1989, DISCRETE TIME SIGNAL Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 *QUALC INC, 1993, SPEECH OPT STAND WID Ramaswamy GN, 1998, INT CONF ACOUST SPEE, P977, DOI 10.1109/ICASSP.1998.675430 Salonidis T, 1998, INT CONF ACOUST SPEE, P101, DOI 10.1109/ICASSP.1998.674377 SCHROEDER MR, 1981, IEEE T ACOUST SPEECH, V29, P297, DOI 10.1109/TASSP.1981.1163546 VU HL, 1998, P ICASSP, P45 William H.P., 1992, NUMERICAL RECIPES C NR 17 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 223 EP 233 DI 10.1016/S0167-6393(99)00047-3 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500003 ER PT J AU Chien, JT Junqua, JC AF Chien, JT Junqua, JC TI Unsupervised hierarchical adaptation using reliable selection of cluster-dependent parameters SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; unsupervised learning; confidence measure; speaker adaptation; hidden Markov model ID MAXIMUM-LIKELIHOOD-ESTIMATION; TELEPHONE SPEECH RECOGNITION; HIDDEN MARKOV-MODELS; SPEAKER ADAPTATION; TRANSFORMATION; ALGORITHM AB Adaptation of speaker-independent hidden Markov models (HMMs) to a new speaker using speaker-specific data is an effective approach to improve speech recognition performance for the enrolled speaker. Practically, it is desirable to flexibly perform the adaptation without any prior knowledge or limitation on the enrolled adaptation data (e.g, data transcription, length and content). However, the inevitable transcription errors may cause unreliability in the model adaptation (or transformation). The variable length and content of adaptation data usually make it necessary to dynamically control the degree of sharing in transformation-based adaptation. This paper presents an unsupervised hierarchical adaptation algorithm for flexible speaker adaptation. We build a tree structure of HMMs such that the control of transformation sharing can be achieved. To perform the unsupervised learning, we apply Bayesian theory to estimate the transformation parameters and data transcription. To select the parameters for hierarchical model transformation, we developed a new algorithm based on the maximum confidence measure (MCM) and minimum description length (MDL) criteria. Experimental comparisons on unsupervised speaker adaptation show that the hybrid adaptation scheme based on MCM and MDL criteria achieves the best recognition results for any lengths of enrollment data. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. Panason Technol Inc, Speech Technol Lab, Santa Barbara, CA USA. RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM jtchien@mail.ncku.edu.tw; jcj@research.panasonic.com CR Abrash V, 1996, INT CONF ACOUST SPEE, P729, DOI 10.1109/ICASSP.1996.543224 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 Anastasakos T, 1998, P INT C SPOK LANG PR, P2303 Chien JT, 1997, SPEECH COMMUN, V22, P369, DOI 10.1016/S0167-6393(97)00033-2 CHIEN JT, 1998, P INT C SPOK LANG PR, P2295 Chien JT, 1997, IEEE SIGNAL PROC LET, V4, P167 Cox S. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266423 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 Fukunaga K., 1972, INTRO STAT PATTERN R FURUI S, 1989, IEEE T ACOUST SPEECH, V37, P1923, DOI 10.1109/29.45538 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Homma S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607808 Homma S, 1997, INT CONF ACOUST SPEE, P1023, DOI 10.1109/ICASSP.1997.596114 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 JUANG BH, 1986, IEEE T INFORM THEORY, V32, P307 KEMP T, 1998, P INT C SPOK LANG PR, P2207 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 Matsui T, 1998, COMPUT SPEECH LANG, V12, P41, DOI 10.1006/csla.1997.0036 Matsui T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607765 NGUYEN P, 1999, IEEE P INT C AC SPEE, V1, P173 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 RISSANEN J, 1984, IEEE T INFORM THEORY, V30, P629, DOI 10.1109/TIT.1984.1056936 ROBBINS H, 1964, ANN MATH STAT, V35, P1, DOI 10.1214/aoms/1177703729 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Shinoda K, 1996, INT CONF ACOUST SPEE, P717, DOI 10.1109/ICASSP.1996.543221 Shinoda K., 1997, P EUR SHINODA K, 1998, IEEE P INT C AC SPEE, V2, P793 Sukkar RA, 1996, IEEE T SPEECH AUDI P, V4, P420, DOI 10.1109/89.544527 Tou J. T., 1974, PATTERN RECOGNITION VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 NR 33 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 235 EP 253 DI 10.1016/S0167-6393(99)00052-7 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500004 ER PT J AU Altincay, H Demirekler, M AF Altincay, H Demirekler, M TI An information theoretic framework for weight estimation in the combination of probabilistic classifiers for speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; classifier combination; weight estimation; complementariness; decision consensus measure; information theory ID DECISION COMBINATION; MUTUAL INFORMATION; RECOGNITION; CLASSIFICATION AB In this paper, we describe a relation between classification systems and information transmission systems. By looking at the classification systems from this perspective, we propose a method of classifier weight estimation for the linear (LIN-OP) and logarithmic opinion pool (LOG-OP) type classifier combination schemes for which some tools from information theory are used. These weights provide contextual information about the classifiers such as class dependent classifier reliability and global classifier reliability. A measure for decision consensus among the classifiers is also proposed which is formulated as a multiplicative part of the classifier weights. A method of selecting the classifiers which provide complementary information for the combination operation is given. Using the proposed method, two classifiers are selected to be used in the combination operation. Simulation experiments in closed set speaker identification have shown that the method of weight estimation described in this paper improved the identification rates of both linear and logarithmic opinion type combination schemes. A comparison between the proposed method and some other methods of weight selection is also given at the end of the paper. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Middle E Tech Univ, Dept Elect & Elect Engn, Speech Proc Lab, TR-06531 Ankara, Turkey. RP Altincay, H (reprint author), Middle E Tech Univ, Dept Elect & Elect Engn, Speech Proc Lab, TR-06531 Ankara, Turkey. EM auto@metu.edu.tr; demirek@metu.edu.tr CR Al-Ghoneim K, 1998, PATTERN RECOGN, V31, P2077, DOI 10.1016/S0031-3203(98)00030-2 ALTINCAY H, 1999, P EUR BUD HUNG SEPT, P971 Altincay H, 1999, PROCEEDINGS OF THE IEEE-EURASIP WORKSHOP ON NONLINEAR SIGNAL AND IMAGE PROCESSING (NSIP'99), P321 Ash R. B., 1965, INFORMATION THEORY ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BATTITI R, 1994, NEURAL NETWORKS, V7, P691, DOI 10.1016/0893-6080(94)90046-9 BATTITI R, 1994, IEEE T NEURAL NETWOR, V5, P537, DOI 10.1109/72.298224 BENEDIKTSSON JA, 1992, IEEE T SYST MAN CYB, V22, P688, DOI 10.1109/21.156582 Bloch I, 1996, IEEE T SYST MAN CY A, V26, P52, DOI 10.1109/3468.477860 BRUNELLI R, 1995, IEEE T PATTERN ANAL, V17, P955, DOI 10.1109/34.464560 Campbell JP, 1997, P IEEE, V85, P1437, DOI 10.1109/5.628714 CASTELLANO PJ, 1997, P IEEE ICASSP, P1075 Chen K, 1998, NEUROCOMPUTING, V20, P227, DOI 10.1016/S0925-2312(98)00019-8 FARRELL KR, 1995, P IEEE ICASSP 95 1, P349 Furui S, 1997, PATTERN RECOGN LETT, V18, P859, DOI 10.1016/S0167-8655(97)00073-1 Genest C., 1986, STAT SCI, V1, P114, DOI 10.1214/ss/1177013825 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 Hashem S, 1997, NEURAL NETWORKS, V10, P599, DOI 10.1016/S0893-6080(96)00098-6 HO TK, 1994, IEEE T PATTERN ANAL, V16, P66 HOBALLAH IY, 1989, IEEE T INFORM THEORY, V35, P988, DOI 10.1109/18.42216 HUANG TS, 1995, ENVIRON GEOCHEM HLTH, V17, P1, DOI 10.1007/BF00188624 JACOBS RA, 1995, NEURAL COMPUT, V7, P865 Kittler J, 1998, IEEE T PATTERN ANAL, V20, P226, DOI 10.1109/34.667881 LAM L, 1995, PATTERN RECOGN LETT, V16, P945, DOI 10.1016/0167-8655(95)00050-Q Lathi B. P., 1989, MODERN DIGITAL ANALO Lin XF, 1998, PATTERN RECOGN LETT, V19, P975, DOI 10.1016/S0167-8655(98)00072-5 LINARES LR, 1997, EUROSPEECH P, P2315 MCELIECE RJ, 1977, THEORY INFORMATION C MELIN H, 1997, GUIDELINES EXPT POLY Pearl J., 1988, PROBABILISTIC REASON Radova V, 1997, INT CONF ACOUST SPEE, P1135, DOI 10.1109/ICASSP.1997.596142 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 ROGOVA G, 1994, NEURAL NETWORKS, V7, P777, DOI 10.1016/0893-6080(94)90099-X SETHI IK, 1982, IEEE T PATTERN ANAL, V4, P441 Shafer G., 1976, MATH THEORY EVIDENCE SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 STEPHANOU HE, 1988, IEEE T PATTERN ANAL, V10, P544, DOI 10.1109/34.3916 Tumer K, 1996, PATTERN RECOGN, V29, P341, DOI 10.1016/0031-3203(95)00085-2 Varshney P.K., 1997, DISTRIBUTED DETECTIO XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943 Yu K, 1997, PATTERN RECOGN LETT, V18, P1421, DOI 10.1016/S0167-8655(97)00113-X NR 41 TC 25 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 255 EP 272 DI 10.1016/S0167-6393(99)00054-0 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500005 ER PT J AU Hong, WT Chen, SH AF Hong, WT Chen, SH TI A robust training algorithm for adverse speech recognition SO SPEECH COMMUNICATION LA English DT Article DE robust training algorithm; PMC noise-compensation; signal bias-compensation; Mandarin speech recognition ID HIDDEN MARKOV-MODELS; MAXIMUM-LIKELIHOOD; NOISE; COMPENSATION; ENVIRONMENTS; ENHANCEMENT; ADAPTATION; SYSTEMS; CARS AB In this paper, anew robust training algorithm is proposed for the generation of a set of bias-removed, noise-suppressed reference speech HMM models in adverse environment suffering from both channel bias and additive noise. Its main idea is to incorporate a signal bias-compensation operation and a PMC noise-compensation operation into its iterative training process. This makes the resulting speech HMM models more suitable to the given robust speech recognition method using the same signal bias-compensation and PMC noise-compensation operations in the recognition process. Experimental results showed that the speech HMM models it generated outperformed both the clean-speech HMM models and those generated by the conventional k-means algorithm for two adverse Mandarin speech recognition tasks. So it is a promising robust training algorithm. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Ind Technol Res Inst, Hsinchu, Taiwan. Natl Chiao Tung Univ, Dept Commun Engn, Hsinchu, Taiwan. RP Hong, WT (reprint author), Ind Technol Res Inst, E000-CCL, Hsinchu, Taiwan. EM jfhong@taiwan.com; schen@cc.nctu.edu.tw CR Acero A., 1990, P ICASSP, P849 Acero A., 1991, P IEEE ICASSP TOR CA, P893, DOI 10.1109/ICASSP.1991.150483 ANASTASAKOS T, 1997, P ICASSP, P1043 Chang PC, 1993, IEEE T SPEECH AUDI P, V1, P326, DOI 10.1109/89.232616 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 FORNEY GD, 1973, P IEEE, V61, P268, DOI 10.1109/PROC.1973.9030 Furui S., 1992, P ESCA WORKSH SPEECH, P31 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z GONG Y, 1997, P EUR C SPEECH COMM, V3, P1555 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HONG WT, 1999, J ACOUST SOC AM HONG WT, 1997, P EUROSPEECH 97, V3, P1083 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E JUNGUA JC, 1996, ROBUSTNESS AUTOMATIC Junqua JC, 1994, IEEE T SPEECH AUDI P, V2, P406, DOI 10.1109/89.294354 Lee C, 1996, FOLD DES, V1, P1, DOI 10.1016/S1359-0278(96)00006-5 Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LIU FH, 1996, P ICASSP 96, V1, P157 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MINAMI Y, 1996, P IEEE INT C AC SPEE, P327 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 NAKAMURA S, 1996, P ICASSP 96, V1, P69 NICHOLSON S, 1997, P EUROSPEECH 97, V1, P413 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 WANG YR, 1998, P ICASSP 98, V2, P841 Zhao YX, 1996, SPEECH COMMUN, V18, P65, DOI 10.1016/0167-6393(95)00036-4 NR 37 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 2000 VL 30 IS 4 BP 273 EP 293 DI 10.1016/S0167-6393(99)00057-6 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 291KP UT WOS:000085733500006 ER PT J AU Bernstein, J AF Bernstein, J TI Special issue on language learning SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Ordinate Corp, Menlo Pk, CA 94025 USA. RP Bernstein, J (reprint author), Ordinate Corp, 1040 Noel Dr, Menlo Pk, CA 94025 USA. EM jared@ordinate.com NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 81 EP 82 DI 10.1016/S0167-6393(99)00058-8 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700001 ER PT J AU Neumeyer, L Franco, H Digalakis, V Weintraub, M AF Neumeyer, L Franco, H Digalakis, V Weintraub, M TI Automatic scoring of pronunciation quality SO SPEECH COMMUNICATION LA English DT Article DE automatic pronunciation scoring; speech technology; hidden Markov models; speech recognition; pronunciation quality assessment; language instruction systems; computer aided language learning AB We present a paradigm for the automatic assessment of pronunciation quality by machine. In this scoring paradigm, both native and nonnative speech data is collected and a database of human-expert ratings is created to enable the development of a variety of machine scores. We first discuss issues related to the design of speech databases and the reliability of human ratings. We then address pronunciation evaluation as a prediction problem, trying to predict the grade a human expert would assign to a particular skill. Using the speech and the expert-ratings databases, we build statistical models and introduce different machine scores that can be used as predictor variables. We validate these machine scores on the Voice Interactive Language Training System (VILTS) corpus, evaluating the pronunciation of American speakers speaking French and we show that certain machine scores, like the log-posterior and the normalized duration, achieve a correlation with the targeted human grades that is comparable to the human-to-human correlation when a sufficient amount of speech data is available. (C) 2000 Elsevier Science B.V. All rights reserved. C1 SRI Int, Menlo Pk, CA 94025 USA. RP Neumeyer, L (reprint author), SRI Int, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA. CR BERNSTEIN J, 1992, SRI INT INTERNAL REP BERNSTEIN J, 1990, P INT C SPOK LANG PR DIGALAKIS V, 1996, IEEE T SPEECH AUDIO, P281 DIGALAKIS V, 1992, ALGORITHM DEV AUTOGR DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 FRNCO H, 1997, P INT C AC SPEECH SI, P1471 GREENBERG S, 1992, EVALUATING PRONUNCIA Kim Y., 1997, P 4 EUR C SPEECH COM, P649 Neumeyer L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607890 Ronen O., 1997, P EUR C SPEECH COMM, P645 RYPA M, 1996, VILTS VOICE INTERACT NR 11 TC 76 Z9 83 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 83 EP 93 DI 10.1016/S0167-6393(99)00046-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700002 ER PT J AU Witt, SM Young, SJ AF Witt, SM Young, SJ TI Phone-level pronunciation scoring and assessment for interactive language learning SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; computer-assisted language learning; pronunciation assessment; pronunciation teaching AB This paper investigates a method of automatic pronunciation scoring for use in computer-assisted language learning (CALL) systems. The method utilises a likelihood-based 'Goodness of Pronunciation' (GOP) measure which is extended to include individual thresholds for each phone based on both averaged native confidence scores and on rejection statistics provided by human judges. Further improvements are obtained by incorporating models of the subject's native language and by augmenting the recognition networks to include expected pronunciation errors. The various GOP measures are assessed using a specially recorded database of non-native speakers which has been annotated to mark phone-level pronunciation errors. Since pronunciation assessment is highly subjective, a set of four performance measures has been designed, each of them measuring different aspects of how well computer-derived phone-level scores agree with human scores. These performance measures are used to cross-validate the reference annotations and to assess the basic GOP algorithm and its refinements. The experimental results suggest that a likelihood-based pronunciation scoring metric can achieve usable performance, especially after applying the various enhancements. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Witt, SM (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM smw24@eng.cam.ac.uk; sjy@eng.cam.ac.uk CR ARSLAN L, 1997, ICASSP 97 MUNICH GER Bernstein J., 1990, ICSLP 90 KOB JAP, P1185 CHANDLER R, 1991, FAREWELL MY LOVELY EHSANI F, 1997, P EUROSPEECH97 RHOD Eskenazi M., 1996, ICSLP 96 PHIL PA US FINE A, 1995, PENGUIN READERS SERI FRANSEN J, 1994, 192 CUED FINFENG TR GODDIJN A, 1997, P EUROSPEECH 97 RHOD HAMADA H, 1993, IEICE T INF SYST, VE76D, P352 HILLER S, 1993, SPEECH COMMUN, V13, P463, DOI 10.1016/0167-6393(93)90045-M Kawai G., 1997, P EUROSPEECH 97 RHOD Kenworthy Joanne, 1987, TEACHING ENGLISH PRO Kim Y., 1997, P EUROSPEECH 97 RHOD KNOLL K, 1994, 193 CUED FINFENG TR LEGGETTER C, 1994, 181 CUED F INFENG TR NEUMEYER L, 1996, ICSLP 96 PHIL PA US Rogers C., 1994, J ACOUST SOC AM 2, V96 Ronen O., 1997, P EUROSPEECH 97 RHOD Young S., 1996, HTK BOOK NR 19 TC 98 Z9 107 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 95 EP 108 DI 10.1016/S0167-6393(99)00044-8 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700003 ER PT J AU Cucchiarini, C Strik, H Boves, L AF Cucchiarini, C Strik, H Boves, L TI Different aspects of expert pronunciation quality ratings and their relation to scores produced by speech recognition algorithms SO SPEECH COMMUNICATION LA English DT Article DE automatic pronunciation assessment; expert ratings; native and non-native pronunciation AB The ultimate aim of the research reported on here is to develop an automatic testing system for Dutch pronunciation. In the experiment described in this paper automatic scores of telephone speech produced by native and non-native speakers of Dutch are compared with specific, i.e., temporal and segmental, and global pronunciation ratings assigned by three groups of experts: three phoneticians and two groups of three speech therapists. The goals of this experiment are to determinutee (1) whether specific expert ratings of pronunciation quality contribute to our understanding of the relation between human pronunciation scores and machine scores of speech quality, (2) whether different expert groups assign essentially different ratings, and (3) to what extent rater pronunciation scores can be predicted on the basis of automatic scores. The results show that collecting specific ratings along with overall ones leads to a better understanding of the relation between human and automatic pronunciation assessment. Furthermore, after normalization no considerable differences are observed between the ratings by the three expert groups. Finally, it appears that the speech quality scores produced by our speech recognizer can predict expert pronunciation ratings with a high degree of accuracy. (C) 2000 Published by Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, NL-6500 HD Nijmegen, Netherlands. RP Cucchiarini, C (reprint author), Univ Nijmegen, Dept Language & Speech, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM catia@let.kun.nl; strik@let.kun.nl; boves@let.kun.nl CR ANDERSONHSIEH J, 1992, LANG LEARN, V42, P529, DOI 10.1111/j.1467-1770.1992.tb01043.x Bernstein J., 1990, P INT C SPOK LANG PR, P1185 Cucchiarini C., 1997, P IEEE WORKSH ASRU S, P622 DENOS EA, 1995, P EUR 95, P825 FERGUSON GA, 1987, STAT ANAL PSYCHOL ED FLEGE JE, 1992, J ACOUST SOC AM, V91, P370, DOI 10.1121/1.402780 FRANCO H, 1997, P INT C AC SPEECH SI, P1471 KRAAYEVELD H, 1997, THESIS U NIJMEGEN NI Labov William, 1966, SOCIAL STRATIFICATIO LEE CY, 1997, P 21 NAT C THEOR APP, P63 Neumeyer L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607890 STRIK H, 1997, INT J SPEECH TECHNOL, P121 NR 12 TC 31 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 109 EP 119 DI 10.1016/S0167-6393(99)00040-0 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700004 ER PT J AU Franco, H Neumeyer, L Digalakis, V Ronen, O AF Franco, H Neumeyer, L Digalakis, V Ronen, O TI Combination of machine scores for automatic grading of pronunciation quality SO SPEECH COMMUNICATION LA English DT Article DE automatic pronunciation scoring; combination of scores; hidden Markov models; speech recognition; pronunciation quality assessment; language instruction systems; computer aided language learning AB This work is part of an effort aimed at developing computer-based systems for language instruction; we address the task of grading the pronunciation quality of the speech of a student of a foreign language. The automatic grading system uses SRI's Decipher(TM) continuous speech recognition system to generate phonetic segmentations. Based on these segmentations and probabilistic models we produce different pronunciation scores for individual or groups of sentences that can be used as predictors of the pronunciation quality. Different types of these machine scores can be combined to obtain a better prediction of the overall pronunciation quality. In this paper we review some of the best-performing machine scores and discuss the application of several methods based on linear and nonlinear mapping and combination of individual machine scores to predict the pronunciation quality grade that a human expert would have given. We evaluate these methods in a database that consists of pronunciation-quality-graded speech from American students speaking French. With predictors based on spectral match and on durational characteristics, we find that the combination of scores improved the prediction of the human grades and that nonlinear mapping and combination methods performed better than linear ones. Characteristics of the different nonlinear methods studied are discussed. (C) 2000 Elsevier Science B.V. All rights reserved. C1 SRI Int, Speech Technol & Res Lab, Menlo Pk, CA 94025 USA. RP Franco, H (reprint author), SRI Int, Speech Technol & Res Lab, 333 Ravenswood Ave, Menlo Pk, CA 94025 USA. CR BERNSTEIN J, SRI INT INTERNAL REP Bernstein J., 1990, P INT C SPOK LANG PR, P1185 BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 BREIMAN L, 1984, WAXWORKS BROOKS COLE BUNTINE W, 1992, INTRO IND V 2 1 RECU Cybenko G., 1989, Mathematics of Control, Signals, and Systems, V2, DOI 10.1007/BF02551274 DIGALAKIS V, 1994, P INT C AC SPEECH SI, P1537 DIGALAKIS V, 1992, ALGORITHM DEV AUTOGR Draper N., 1981, APPL REGRESSION ANAL FRANCO H, 1997, P INT C AC SPEECH SI, P1471 KAY S, 1993, APPRENTICE HALL SI A NEUMEYE RL, 1996, P ICSLP 96 PHIL PA, P1457 NEUMEYER L, 1998, P WORKSH SPEECH TECH, P61 Quinlan J. R., 1986, Machine Learning, V1, DOI 10.1023/A:1022643204877 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 Rumelhart D, 1986, PARALLEL DISTRIBUTED RYPA M, 1996, P CALICO ALB NEW MEX NR 17 TC 45 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 121 EP 130 DI 10.1016/S0167-6393(99)00045-X PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700005 ER PT J AU Kawai, G Hirose, K AF Kawai, G Hirose, K TI Teaching the pronunciation of Japanese double-mora phonemes using speech recognition technology SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; pronunciation teaching; Tokushuhaku; duration AB A CALL (computer-aided language learning) system was developed for teaching the pronunciation of Japanese double-mora phonemes to non-native speakers of Japanese. Double-mora vowels and single-mora vowels are spectrally almost identical but their phone durations differ significantly. Similar conditions exist between moraic nasals and non-moraic nasals, and between moraic and non-moraic obstruents, Our CALL system asks the learner to read minimal pairs. Speech recognition technology is used to measure the durations of each phone and the system tells the learner the likelihood of native speakers understanding the learner's utterance as the learner intended, These intelligibility scores are based on perception experiments where native speakers judged the confusability of minimal pairs containing phones with various synthesized durations. The system then instructs the learner to either shorten or lengthen his pronunciation. The learner can terminate training when his communicative performance has met his expectations. For instance, when a learner hits a learning plateau, intelligibility indices can help him decide whether further learning effort is worthwhile. Given that most adult learners can never attain complete nativeness, it is of practical use to be told when non-native accents cannot be removed further. Learning experiments show that learners quickly capture the relevant duration cues. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Tokyo, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1138656, Japan. RP Kawai, G (reprint author), Univ Tokyo, Dept Informat & Commun Engn, Bunkyo Ku, 7-3-1 Hongo, Tokyo 1138656, Japan. EM goh@kawai.com; hirose@gavo.t.u-tokyo.ac.jp CR *AUR, 1995, AUR VERS 2 50 3 EHSANI F, 1997, P EUR 1997 RHOD GREE, P681 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 HIBIYA J, 1996, J PHONETIC SOC JAPAN, V211, P43 Hiller S., 1994, Computer Assisted Language Learning, V7, DOI 10.1080/0958822940070105 HILLER S, 1993, P EUR 199O BERL GERM, P1343 Hirose K., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Hirose K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607133 IMAIZUMI H, 1989, SP8936 IEICE ISHII E, 1997, J JAPANESE LANG TEAC, V94, P2 KAWAI G, 1998, P ICSLP 1998 SYDN AU, P1823 KAWAI G, 1997, P EUR 1997 RHOD GREE, P657 MIZUTANI O, 1993, JAPANESE SPEECH JAPA, P1 RAMALAKSHMI V, 1997, JAPANESE LANGUAGE IN, P146 RONEN O, 1997, P EUR 1997 RHOD GREE, P649 SAIDA I, 1991, STUDY PROSODIC FEATU, P137 *SYR LANG SYST, 1996, TRIPL PLAY PLUS TAKEDA K, 1997, 97SLP183 IPSJ SIG TANIGUCHI H, 1991, PROSODY ITS ROLE TJS, P17 TOKI S, 1989, JAPANESE JAPANESE LA, V13, P111 TOKI S, 1988, JAPANESE EXERCISES N, V12 Witt S., 1997, P EUR, P633 WOODFORD P, 1997, J JAPANESE LANG TEAC, V94, P160 YOUNG S, 1997, HTK BOOK VERSION 2 1 NR 24 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 131 EP 143 DI 10.1016/S0167-6393(99)00041-2 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700006 ER PT J AU Delmonte, R AF Delmonte, R TI SLIM prosodic automatic tools for self-learning instruction SO SPEECH COMMUNICATION LA English DT Article DE prosody; computer assisted language learning (CALL); speech recognition; automatic tools for prosodic analysis; friendly man-machine interface; contrastive linguistic analysis; syllable-timed versus stressed-timed languages; syllable-driven versus segment-driven acoustic segmentation; F0 tracking in time domain; fuzzy logics in threshold setting ID ENGLISH AB We present the Prosodic Module of a courseware for computer-assisted foreign language learning called SLIM - an acronym for Multimedia Interactive Linguistic Software, developed at the University of Venice (see Delmonte et al., 1999a,b). The Prosodic Module has been created in order to deal with the problem of improving a student's performance both in the perception and production of prosodic aspects of spoken language activities. It is composed of two different sets of Learning Activities, the first one dealing with phonetic and prosodic problems at word level and at segmental level - where segmental refers to syllable-sized segments; the second one dealing with prosodic aspects at phonological phrase and utterance suprasegmental level. The main goal of Prosodic Activities is to ensure consistent and pedagogically sound feedback to the student intending to improve his/her pronunciation in a foreign language. We argue that the use of Automatic Speech Recognition (ASR) as Teaching Aid should be under-utilized and should be targeted to narrowly focussed spoken exercises, disallowing open-ended dialogues, in order to ensure consistency of evaluation. In addition, we argue that ASR alone cannot be used to gauge Goodness of Pronunciation (GOP), being inherently inadequate for that goal. On the contrary, we support the conjoined use of ASR technology and prosodic tools to produce GOP useable for linguistically consistent and adequate feedback to the student. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Univ Ca Foscari Ca Garzoni Moro, Lab Linguist Computaz, I-30124 Venice, Italy. RP Delmonte, R (reprint author), Univ Ca Foscari Ca Garzoni Moro, Lab Linguist Computaz, San Marco 3417, I-30124 Venice, Italy. EM delmont@unive.it CR AKAHANEYAMADA R, 1998, P STILL 98 ESCA MARH, P111 AUBERG S, 1998, P STILL 98 ESCA MARH, P103 AUBERG S, 1998, P STILL 98 ESCA MARH, P69 BACALU C, 1999, ATT 9 C GFS AIA VEN Bagshaw P., 1994, THESIS U EDINBURGH U BAGSHAW PC, P EUROSPEECH 93 BERL, V93, P1003 Bertinetto Pier Marco, 1981, STRUTTURE PROSODICHE BERTINETTO PM, 1980, J PHONETICS, V8, P385 CAMPBELL W, 1993, EUROSPEECH 93, P1081 CAMPBELL WN, 1991, J PHONETICS, V19, P37 DELCLOQUE P, 1998, P STILL 98 ESCA MARH, P9 DELMONTE R, 1995, PROSODICS TOOL IMPRO DELMONTE R, 1991, NATO ASI SERIES F, V75, P481 DELMONTE R, 1999, CONV GFS AIA ROM, P47 DELMONTE R, 1987, P 11 ICPHS, V2, P101 DELMONTE R, 1991, P EUROSPEECH91 GEN, P1291 DELMONTE R, 1999, SLIM MODEL AUTOMATIC, P326 DELMONTE R, 1986, P ICASSP 86 IEEE TOK, V4, P2407 DELMONTE R, 1981, STUDI GRAMMATICA ITA, P69 DELMONTE R, 1999, ATT 9 CONV GFSAIA VE *ESCA, 1998, P STILL 98 ESCA MARH Eskenazi M, 1998, P SPEECH TECHN LANG, P77 HAMADA H, 1993, IEICE T INF SYST, VE76D, P352 HILLER S, 1993, SPEECH COMMUN, V13, P463, DOI 10.1016/0167-6393(93)90045-M KAWAI G, 1997, P EUR 97, V2, P657 KIM Y, 1997, P EUR 97, V2, P645 KITTLER J, 1990, NATO ASI SERIES, V75 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Lehiste I., 1977, J PHONETICS, V5, P253 LEHISTE I, 1977, T SIGNAL PROCESSING, V39, P40 MEADOR J, 1998, P STILL 98 ESCA MARH, P65 MEDAN Y, 1991, SUPER RESOLUTION PIT MENNEN I, 1998, P STILL 98 ESCA MARH, P17 NEUMEYER L, 1998, P WORKSH SPEECH TECH, P61 PRICE P, 1998, P STILL 98 ESCA MARH, P103 Ronen O., 1997, P EUR 97, V2, P649 UEYAMA M, 1997, P ESCA 97, V5, P2411 UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374 van Son Rob J. J. H., 1997, P EUR 97, V1, P319 VANSANTEN J, 1997, P EUR 97, V5, P2651 VANSANTEN J, 1997, P EUR 97, V1, P19 VEKAS D, 1991, INTERFACCIA TRA FONO WAIBEL A, 1986, P ICASSP, P2287 WALLACE J, 1998, P STILL 98 ESCA MARH, P21 Witt S., 1998, P SPEECH TECHN LANG, P99 WITT S, 1997, P EUR 97, V2, P633 1999, CALICO J, V16 NR 47 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 145 EP 166 DI 10.1016/S0167-6393(99)00043-6 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700007 ER PT J AU Ehsani, F Bernstein, J Najmi, A AF Ehsani, F Bernstein, J Najmi, A TI An interactive dialog system for learning Japanese SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; computer tutor; Japanese language AB Subarashii is a system that uses automatic speech recognition (ASR) to offer first-level, computer-based exercises in the Japanese language for beginning high school students. Building the Subarashii system has identified strengths and limitations of ASR technology. The system was tested with 34 students at Silver Creek High School in San Jose, California and with 13 students at Stanford University in Stanford, California. Recognition accuracy was measured and user errors were analyzed. The functional accuracy defined as the percentage of time when the system performs the correct functional behavior turned out to be generally higher than the per-utterance speech recognition accuracy. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Sehda Inc, Menlo Pk, CA 91025 USA. Ordinate Corp, Menlo Pk, CA 94025 USA. Stanford Univ, Stanford, CA 94305 USA. RP Ehsani, F (reprint author), Sehda Inc, 1040 Noel Dr, Menlo Pk, CA 91025 USA. EM farzad@sehda.com CR BERNSTEIN J, 1990, P ICSLP 90 KOB JAP BILANGE E, 1991, 5 C EUR CHAPT ASS CO DAHLBACK N, 1992, PROCEEDINGS OF THE FOURTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P785 EHSANI F, 1997, EUR C SPEECH COMM TE, P681 JORDAN EH, 1987, JAPANESE SPOKEN LA 3 JORDAN EH, 1987, JAPANESE SPOKEN LA 2 JORDAN EH, 1987, JAPANESE SPOKEN LA 1 LEVINSON SC, 1981, DISCOURSE PROCESS, V4, P93 MEADOR J, 1998, P STILL 98 ESCA MARH, P65 Neumeyer L, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P1457 TOHSAKU YH, 1994, YOOKOSO INVITATION C Warschauer M, 1996, MULTIMEDIA LANGUAGE, P3 Waters R. C., 1995, Computer Assisted Language Learning, V8, DOI 10.1080/0958822950080403 YOUNG S, 1997, HTK BOOK, P171 NR 14 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 167 EP 177 DI 10.1016/S0167-6393(99)00042-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700008 ER PT J AU Yamada, Y Javkin, H Youdelman, K AF Yamada, Y Javkin, H Youdelman, K TI Assistive speech technology for persons with speech impairments SO SPEECH COMMUNICATION LA English DT Article DE speech training with deaf persons; speech-impairment; visual feedback; articulation; dynamic palatograph AB This paper describes a computer-based speech training system being developed for persons with speech-impairments, especially for profoundly deaf children. This system, called computer integrated speech training aid (CISTA) provides objective data and on-line diagnostic information visually, facilitates record keeping for teachers and increases student motivation. CISTA has been commercially available in Japan since 1988, mainly used in deaf schools and hospitals. CISTA has been used for some years to evaluate its efficiency and applicability in the schools for the deaf and speech clinics in rehabilitation centers in Japan and the USA. The results of two experiments have shown its effectiveness. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Matsushita Elect Ind Co Ltd, Adv Technol Res Labs, Osaka 5708501, Japan. Univ Calif Santa Barbara, Dept Linguist, Santa Barbara, CA 93106 USA. RP Yamada, Y (reprint author), Matsushita Elect Ind Co Ltd, Adv Technol Res Labs, 3-1-1 Yagumonaka Machi, Osaka 5708501, Japan. EM yamada@crl.mei.co.jp CR JAVKIN H, 1993, P SPEECH LANG TECHN, P137 McGarr N S, 1986, J Rehabil Res Dev, V23, P101 NICKERSON RS, 1976, J SPEECH HEAR DISORD, V41, P120 NICKERSON RS, 1983, SPEECH HEARING IMPAI, P313 NICKERSO.RS, 1973, IEEE T ACOUST SPEECH, VAU21, P445, DOI 10.1109/TAU.1973.1162508 Samuel G., 1992, ARTICULATION PHYSL A YOUDELMAN K, 1988, VOLTA REV, V90, P197 YOUDELMAN K, 1991, P INT S SPEECH HEAR, P1 YOUDELMAN K, 1989, VOLTA REV, V91, P197 NR 9 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 2000 VL 30 IS 2-3 BP 179 EP 187 DI 10.1016/S0167-6393(99)00039-4 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 286NU UT WOS:000085452700009 ER PT J AU Jia, WH Chan, WY AF Jia, WH Chan, WY TI An experimental assessment of personal speech coding SO SPEECH COMMUNICATION LA English DT Article DE speech coding; linear prediction based analysis-by-synthesis; spectral quantization; vector quantization; speaker dependency; personal speech communication; personal communications system ID INDIVIDUALITY; CONVERSION AB In the speech coders in common use today, all quantizer codebooks are designed to suit the statistical and perceptual characteristics of speech signals of a population of speakers. However, an individual's speech signal does not exhibit, even over a long time, the entire range of characteristics of the population. With the advent of personal communication systems, personal information might become available and hence may be exploited to improve the rate-distortion performance of speech coders. In this paper, we experimentally assess the potential gain of personal speech coding by designing codebooks for individual speakers. Experiments are performed using linear prediction based analysis-by-synthesis (LPAS) speech coders. The spectral envelope excitation, and pitch lag quantization codebooks of LPAS coders are redesigned. The gains appear to be modest, suggesting the need to use a coding framework that models personal characteristics more explicitly, e.g. by using prosodic features. Amongst the coder components, the spectral quantizer is found to be most amenable to personalization. (C) 2000 Elsevier Science B.V. All rights reserved. C1 IIT, Dept Elect & Comp Engn, Chicago, IL 60616 USA. RP Chan, WY (reprint author), IIT, Dept Elect & Comp Engn, 3301 S Dearborn, Chicago, IL 60616 USA. EM chan@ece.iit.edu CR Campbell J. P. Jr., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90106-U CHEN JH, 1990, INT CONF ACOUST SPEE, P453, DOI 10.1109/ICASSP.1990.115747 DAVIDSON G, 1987, IEEE INT C AC SPEECH, P2189 Dorsey E., 1981, ICASSP 81. Proceedings of the 1981 IEEE International Conference on Acoustics, Speech and Signal Processing *ETSI, 1996, EUR TEL STAND GSM 06 FURUI S, 1986, SPEECH COMMUN, V5, P183, DOI 10.1016/0167-6393(86)90007-5 GERSHO A, 1994, P IEEE, V82, P900, DOI 10.1109/5.286194 GOLDSTEIN UG, 1976, J ACOUST SOC AM, V59, P176, DOI 10.1121/1.380837 HAGEN R, 1995, INT CONF ACOUST SPEE, P748, DOI 10.1109/ICASSP.1995.479802 KUWABARA H, 1995, SPEECH COMMUN, V16, P165, DOI 10.1016/0167-6393(94)00053-D Paliwal KK, 1993, IEEE T SPEECH AUDI P, V1, P3, DOI 10.1109/89.221363 Stylianou Y, 1998, IEEE T SPEECH AUDI P, V6, P131, DOI 10.1109/89.661472 NR 12 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2000 VL 30 IS 1 BP 1 EP 8 DI 10.1016/S0167-6393(99)00029-1 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 275AJ UT WOS:000084797000001 ER PT J AU Lavner, Y Gath, I Rosenhouse, J AF Lavner, Y Gath, I Rosenhouse, J TI The effects of acoustic modifications on the identification of familiar voices speaking isolated vowels SO SPEECH COMMUNICATION LA English DT Article DE familiar voices; speaker identification; acoustic features; analysis/synthesis ID GLOTTAL WAVE; PARAMETERS; FEMALE; RECOGNITION; PERCEPTION; PATTERNS AB The aim of the present study was to examine the relative importance of various acoustic features as cues to familiar speaker identification for the vowel /a/, To this aim, a group of 20 speakers was recorded. The speakers' voices were modified using an analysis-synthesis system, which enabled analysis and modification of the glottal waveform, of the fundamental frequency, and of the vocal tract formants. Thirty listeners, very familiar with the speakers' voices, had to identify the speakers in an open-set, multiple-choice experiment. The results suggest that on average, the contribution of the vocal tract features to the identification process is more important than that of the glottal source features. The exact shape of the glottal waveform was found to be of minor importance. Examination of individual speakers reveals that changes of identical features affect the identification rate of various speakers differently. This finding suggests that for each speaker a different group of acoustic features serves as the cue to the vocal identity. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Technion Israel Inst Technol, Dept Biomed Engn, IL-32000 Haifa, Israel. Technion Israel Inst Technol, Dept Gen Studies, IL-32000 Haifa, Israel. RP Gath, I (reprint author), Technion Israel Inst Technol, Dept Biomed Engn, IL-32000 Haifa, Israel. EM isak@biomed.technion.ac.il CR ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R CARRELL TD, 1984, 5 IND U SPEECH RES L CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 CUMMINGS KE, 1995, J ACOUST SOC AM, V98, P88, DOI 10.1121/1.413664 Deller J. R., 1993, DISCRETE TIME PROCES Fant G., 1960, ACOUSTIC THEORY SPEE HEDELIN P, 1986, P IEEE ICASSP86 TOKY, P465 HEDELIN P, 1984, P ICASSP 84 Itoh K., 1992, SPEECH SCI TECHNOLOG, P133 KARLSSON I, 1985, 1 ROYAL I TECHN SPEE, P31 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KUWABARA H, 1991, SPEECH COMMUN, V10, P491, DOI 10.1016/0167-6393(91)90052-U Ladefoged P., 1980, UCLA WORKING PAPERS, V49, P43 LINDEN J, 1994, P EUSIPCO 94, P4 MATSUMOT.H, 1973, IEEE T ACOUST SPEECH, VAU21, P428, DOI 10.1109/TAU.1973.1162507 MILLER JE, 1964, J ACOUST SOC AM, V36, P2002, DOI 10.1121/1.1939278 MONSEN RB, 1977, J ACOUST SOC AM, V62, P981, DOI 10.1121/1.381593 PAPCUN G, 1989, J ACOUST SOC AM, V85, P913, DOI 10.1121/1.397564 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 Proakis J. G., 1992, DIGITAL SIGNAL PROCE ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 SCHMIDTNIELSEN A, 1985, J ACOUST SOC AM, V77, P658, DOI 10.1121/1.391884 SONDHI MM, 1975, J ACOUST SOC AM, V57, P228, DOI 10.1121/1.380429 Sundberg J, 1979, FRONTIERS SPEECH COM, P301 VANLANCKER D, 1985, J PHONETICS, V13, P19 VANLANCKER D, 1985, J PHONETICS, V13, P39 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 27 TC 34 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2000 VL 30 IS 1 BP 9 EP 26 DI 10.1016/S0167-6393(99)00028-X PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 275AJ UT WOS:000084797000002 ER PT J AU Lin, MT Spanias, A Loizou, P AF Lin, MT Spanias, A Loizou, P TI An improved approach to robust speech recognition using minimum error classification SO SPEECH COMMUNICATION LA English DT Article DE minimum error classification (MEC); hidden Markov model (HMM) AB An effective way of applying minimum error classification (MEC) to improve robustness in speech recognition is presented in this paper. In contrast to the traditional maximum likelihood (ML) training procedure that attempts to maximize the a priori probability of generating the training data set, MEC training attempts to minimize a function of the recognition error on the given training data set. In the MEC training procedure, the N-best algorithm is used to maximize the separation between the correct and competing models over confusable training tokens. The main focus of this paper is to investigate the effectiveness of MEC training when combined with four existing speech recognition algorithms under noisy and telephone mismatched environments. These algorithms are the weighted projection measure (WPM), the minimax approach (MA), the cepstral mean subtraction (CMS) method and the stochastic matching algorithms (SMAs), Experiments were performed using the Texas Instruments isolated digits database and the E-set words from the OGI Spelled and Spoken Telephone Corpus. The average word error rate reduction due to MEC training was 22.5% for isolated digit recognition and 8% for E-set word recognition. (C) 2000 Published by Elsevier Science B.V. AU rights reserved. C1 Arizona State Univ, Dept Elect Engn, Tempe, AZ 85287 USA. Univ Arkansas, Dept Appl Sci, Little Rock, AR 72204 USA. RP Spanias, A (reprint author), Arizona State Univ, Dept Elect Engn, Tempe, AZ 85287 USA. EM spanias@asu.edu CR BEATTIE L, 1991, P IEEE INT C AC SPEE, P917 CARLSON BA, 1992, P ICASSP, V1, P237 CHANG PC, 1991, P IEEE INT C AC SPEE, P549 CHOU W, 1993, P IEEE INT C AC SPEE, V2, P652 CHOU W, 1992, P IEEE INT C AC SPEE, P473, DOI 10.1109/ICASSP.1992.225869 COLE R, 1995, IEEE T SPEECH AUDI P, V3, P1, DOI 10.1109/89.365385 COLE RA, 1992, P INT C SPOK LANG PR, P891 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 Liu Chi-Shi, 1994, P 1994 IEEE INT C AC, V1, P325 Loizou PC, 1996, IEEE T SPEECH AUDI P, V4, P430, DOI 10.1109/89.544528 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MOON SY, 1995, P IEEE INT C AC SPEE, P145 OHKURA K, 1993, P IEEE INT C AC SPEE, V2, P75 RAHIM MG, 1994, P ICASSP, V1, P445 SANKAR A, 1995, P IEEE INT C AC SPEE, P121 SPANIAS AS, 1992, IEICE T FUND ELECTR, VE75A, P132 NR 18 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2000 VL 30 IS 1 BP 27 EP 36 DI 10.1016/S0167-6393(99)00027-8 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 275AJ UT WOS:000084797000003 ER PT J AU Demuynck, K Duchateau, J Van Compernolle, D Wambacq, P AF Demuynck, K Duchateau, J Van Compernolle, D Wambacq, P TI An efficient search space representation for large vocabulary continuous speech recognition SO SPEECH COMMUNICATION LA English DT Article DE continuous speech recognition; large vocabulary speech recognition; search algorithms; context dependent acoustic modelling AB In pursuance of better performance, current speech recognition systems tend to use more and more complicated models for both the acoustic and the language component. Cross-word context dependent (CD) phone models and long-span statistical language models (LMs) are now widely used. In this paper, we present a memory-efficient search topology that enables the use of such detailed acoustic and language models in a one pass time-synchronous recognition system. Characteristic of our approach is (1) the decoupling of the two basic knowledge sources, namely pronunciation information and LM information, and (2) the representation of pronunciation information - the lexicon in terms of CD units - by means of a compact static network. The LM information is incorporated into the search at run-time by means of a slightly modified token-passing algorithm. The decoupling of the LM and lexicon allows great flexibility in the choice of LMs, while the static lexicon representation avoids the cost of dynamic tree expansion and facilitates the integration of additional pronunciation information such as assimilation rules, Moreover, the network representation results in a compact structure when words have various pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Katholieke Univ Leuven, Dept Elect Engn, ESAT, PSI, B-3001 Heverlee, Belgium. RP Demuynck, K (reprint author), Katholieke Univ Leuven, Dept Elect Engn, ESAT, PSI, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium. EM kris.demuynck@esat.kuleuven.ac.be CR ANTONIOL G, 1992, P ICSPAT CAMBR MA, P1005 AUSTIN S, 1989, P ICASSP GLASG SCOTL, V1, P667 AUSTIN S, 1990, DARPA SPEECH NAT LAN BAHL L, 1989, P EUROSPEECH PAR FRA, V1, P156 BEYERLEIN P, 1997, P EUROSPEECH RHOD GR, V3, P1163 Demuynck K., 1998, P INT C SPOK LANG PR, VVII, P2907 DEMUYNCK K, 1997, P EUROSPEECH RHOD GR, V1, P143 DEMUYNCK K, 1996, P ICSLP, V4, P2289, DOI 10.1109/ICSLP.1996.607264 Duchateau J, 1998, SPEECH COMMUN, V24, P5, DOI 10.1016/S0167-6393(98)00002-8 FEDERICO M, 1994, COMPUTER SPEECH LANG, V9, P353 HANAZAWA K, 1997, P ICASSP MUN GERM, V3, P1787 MOHRI M, 1997, P EUROSPEECH RHOD GR, V1, P131 MOHRI M, 1998, P ICASSP SEATTL WA, V2, P665, DOI 10.1109/ICASSP.1998.675352 Murveit H., 1993, P IEEE INT C AC SPEE, V2, P319 Ney H., 1992, P IEEE INT C AC SPEE, V1, P9 Odell J., 1995, THESIS U CAMBRIDGE U OERDER M, 1993, P IEEE INT C AC SPEE, V2, P119 Placeway P., 1993, P INT C AC SPEECH SI, P33 RILEY M, 1997, P EUR C SPEECH COMM, V3, P1427 SCHWARTZ R, 1992, P ICASSP SAN FRANC C, V1, P1 Schwartz R., 1990, P IEEE INT C AC SPEE, P81 Steinbiss V., 1994, P INT C SPOK LANG PR, V4, P2143 Van Gerven S., 1997, P EUROSPEECH, V3, P1095 Young S., 1989, CUEDFINFENGTR38 NR 24 TC 22 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2000 VL 30 IS 1 BP 37 EP 53 DI 10.1016/S0167-6393(99)00030-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 275AJ UT WOS:000084797000004 ER PT J AU Sorokin, VN Leonov, AS Trushkin, AV AF Sorokin, VN Leonov, AS Trushkin, AV TI Estimation of stability and accuracy of inverse problem solution for the vocal tract SO SPEECH COMMUNICATION LA English DT Article DE speech; inverse problem; solution accuracy; calibrating curves ID SPEECH-PERCEPTION; MODEL; AREA; RECOVERY; VOWELS; SHAPE; MOVEMENTS; FORMANTS; SIGNAL AB The inverse problem for the vocal tract is under consideration from the viewpoint of the ill-posed problem theory. The proposed approach, which permits overcoming the difficulties related to ambiguity and instability, is based on the variational regularization with constraints, The work of articulators is used as a functional of regularization and a criterion of optimality for finding an approximate solution. The measured acoustical parameters of the speech signal serve as external constraints while the geometry of the vocal tract, the mechanics of the articulation, and the phonetic properties of the language play the role of internal constraints. An effective numerical implementation of the proposed approach is based on a local piecewise linear approximation of the articulatory-to-acoustics mapping and a polynomial approximation of the discrepancy measure, A heuristic method named the "calibrating curves method" is applied for estimating the accuracy of the obtained approximate solution. It was shown that in some cases the error of the inverse problem solution is weakly dependent on the errors of formant frequency measurements. The vocal tract shapes obtained by virtue of the proposed approach are very close to those measured in X-ray experiments. (C) 2000 Elsevier Science B.V. All rights reserved. C1 Russian Acad Sci, Inst Informat Transmiss Problems, Moscow 101447, Russia. Moscow Engn Phys Inst, Moscow 115409, Russia. RP Sorokin, VN (reprint author), Russian Acad Sci, Inst Informat Transmiss Problems, Bolshoy Karetny 19,GSP-4, Moscow 101447, Russia. EM vns@iitp.ru CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 ATAL BS, 1970, J ACOUST SOC AM, V47, P65 BADIN P, 1995, J PHONETICS, V23, P221, DOI 10.1016/S0095-4470(95)80044-1 Bakushinsky AB, 1994, ILL POSED PROBLEMS T BAVEGARD M, 1995, STL QPSR, V4, P55 BERNSTEIN MA, 1967, COORDINATION REGULAT BOE LJ, 1992, J PHONETICS, V20, P27 BORG G, 1946, ACTA MATH-DJURSHOLM, V78, P1, DOI 10.1007/BF02421600 BOUISSET S, 1977, ELECTROEN CLIN NEURO, V42, P543, DOI 10.1016/0013-4694(77)90218-8 BRILLOUIN L, 1956, SCI INFORMATION THEO CHARPENTIER F, 1984, SPEECH COMMUN, V3, P291, DOI 10.1016/0167-6393(84)90025-6 COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154 Fant G., 1960, ACOUSTIC THEORY SPEE FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817 FOWLER CA, 1986, J PHONETICS, V14, P3 GARDING L, 1977, ARK MAT, V15, P63, DOI 10.1007/BF02386033 GOPINATH B, 1970, AT&T TECH J, V49, P1195 HATZE H, 1980, IEEE T AUTOMAT CONTR, V25, P375, DOI 10.1109/TAC.1980.1102380 HEINZ JM, 1965, P 5 INT C AC LIEG, pA44 Hogden J, 1996, J ACOUST SOC AM, V100, P1819, DOI 10.1121/1.416001 JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1 KAWATO M, 1987, BIOL CYBERN, V57, P169, DOI 10.1007/BF00364149 KIRITANI S, 1978, ANN B RES I LOGOPEDI, V12, P1 KOBAYASHI T, 1991, P ICASSP 91 TOR, P489, DOI 10.1109/ICASSP.1991.150383 LADEFOGED P, 1978, J ACOUST SOC AM, V64, P1027, DOI 10.1121/1.382086 LARAR JN, 1988, IEEE T ACOUST SPEECH, V36, P1812, DOI 10.1109/29.9026 LEONOV AS, 1995, MOSCOW U PHYSICS B, V50, P25 Levinson N., 1949, MAT TIDSSKR B, P25 LEVINSON SE, 1983, J ACOUST SOC AM, V74, P1145, DOI 10.1121/1.390038 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 MAEDA S, 1979, SPEECH COMMUN, P67 Markel JD, 1976, LINEAR PREDICTION SP McGowan RS, 1996, J ACOUST SOC AM, V99, P595, DOI 10.1121/1.415220 MCGOWAN RS, 1994, SPEECH COMMUN, V14, P19, DOI 10.1016/0167-6393(94)90055-8 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MERMELST.P, 1967, J ACOUST SOC AM, V41, P1283, DOI 10.1121/1.1910470 MOLLER JW, 1976, J ACOUST SOC AM, V60, pS77 NAKAJIMA T, 1977, DYNAMIC ASPECTS SPEE, P251 NELSON WL, 1983, BIOL CYBERN, V46, P135, DOI 10.1007/BF00339982 PAIGE A, 1970, IEEE T ACOUST SPEECH, VAU18, P7, DOI 10.1109/TAU.1970.1162074 PERRIER P, 1992, J SPEECH HEAR RES, V35, P53 RAHIM MG, 1990, SPEECH COMMUN, V9, P49, DOI 10.1016/0167-6393(90)90045-B RAHIM MG, 1993, J ACOUST SOC AM, V93, P1109, DOI 10.1121/1.405559 SALTZMAN L, 1989, ECOL PSYCHOL, V14, P333 Schoentgen J, 1997, SPEECH COMMUN, V21, P227, DOI 10.1016/S0167-6393(97)00007-1 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 SCHROETER J, 1990, P INT C AC SPEECH SI, P393 Schroeter J., 1992, ADV SPEECH SIGNAL PR, P231 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SHIRAI K, 1976, ELECT COMM JAPAN A, V59, P35 SHIRAI K, 1986, SPEECH COMMUN, V5, P159, DOI 10.1016/0167-6393(86)90005-1 SHIRAI K, 1977, DYNAMIC ASPECTS SPEE, P279 SHIRAI K, 1983, COMPUTER ANAL PERCEP, V2, P101 SONDHI MM, 1983, J ACOUST SOC AM, V73, P985, DOI 10.1121/1.389024 SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 Sorokin V. N., 1992, SPEECH SYNTHESIS Sorokin V. N., 1985, THEORY SPEECH PRODUC SOROKIN VN, 1994, SPEECH COMMUN, V14, P249, DOI 10.1016/0167-6393(94)90065-5 SOROKIN VN, 1987, P 11 INT C PHON SCI, V3, P382 Sorokin VN, 1996, SPEECH COMMUN, V19, P105, DOI 10.1016/0167-6393(96)00028-3 SOROKIN VN, 1996, P 1 ESCA TUT RES WOR, P129 SOROKIN VN, 1992, SPEECH COMMUN, V11, P71, DOI 10.1016/0167-6393(92)90064-E SUNDBERG J, 1969, STL QPSR N, V1, P43 SUNDBERG J, 1987, PHONETICA, V44, P76 Tikhonov A. N., 1965, USSR COMP MATH MATH, V5, P93, DOI 10.1016/0041-5553(65)90150-3 Tikhonov A. N., 1977, SOLUTION ILL POSED P TIKHONOV AN, 1963, DOKL AKAD NAUK SSSR+, V153, P49 Tikhonov AN, 1998, NONLINEAR ILL POSED VIVIANI P, 1985, J EXP PSYCHOL HUMAN, V11, P828, DOI 10.1037//0096-1523.11.6.828 WAKITA H, 1975, IEEE T ACOUST SPEECH, V23, P574, DOI 10.1109/TASSP.1975.1162733 WAKITA H, 1973, IEEE T ACOUST SPEECH, VAU21, P417, DOI 10.1109/TAU.1973.1162506 WOOD S, 1979, J PHONETICS, V7, P25 Yehia H, 1996, SPEECH COMMUN, V18, P151, DOI 10.1016/0167-6393(95)00042-9 NR 74 TC 13 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 2000 VL 30 IS 1 BP 55 EP 74 DI 10.1016/S0167-6393(99)00031-X PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 275AJ UT WOS:000084797000005 ER PT J AU Strik, H AF Strik, H TI Special issue on modeling pronunciation variation for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Univ Nijmegen, Dept Language & Speech, A2 RT, NL-6500 HD Nijmegen, Netherlands. RP Strik, H (reprint author), Univ Nijmegen, Dept Language & Speech, A2 RT, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM strik@let.kun.nl NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 81 EP 82 DI 10.1016/S0167-6393(99)00049-7 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700001 ER PT J AU Adda-Decker, M Lamel, L AF Adda-Decker, M Lamel, L TI Pronunciation variants across system configuration, language and speaking style SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE automatic speech recognition; lexical modeling; pronunciation variants; acoustic modeling AB This contribution aims at evaluating the use of pronunciation variants for different recognition system configurations, languages and speaking styles. This study is limited to the use of variants during speech alignment, given an orthographic transcription of the utterance and a phonemically represented lexicon and is thus focused on the modeling capabilities of the acoustic word models. To measure the need for variants we have defined the variant2+ rate which is the percentage of words in the corpus not aligned with the most common phonemic transcription. This measure may be indicative of the possible need for pronunciation variants in the recognition system. Pronunciation lexica have been automatically created so as to include a large number of variants (overgeneration). In particular, lexica with parallel and sequential variants were automatically generated in order to assess the spectral and temporal modeling accuracy. We first investigated the dependence of the aligned variants on the recognizer configuration. Then a cross-lingual study was carried out for read speech in French and American English using the BREF and the WSJ corpora. A comparison between read and spontaneous speech was made for French based on alignments of BREF (read) and MASK (spontaneous) data. Comparative alignment results using different acoustic model sets demonstrate the dependency between the acoustic model accuracy and the need for pronunciation variants. The alignment results obtained with the above lexica have been used to study the link between word frequencies and variants using different acoustic model sets. (C) 1999 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. RP Adda-Decker, M (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM madda@limsi.fr; lamel@limsi.fr CR ADDA G, 1996, SYSTEME DICTEE LIMSI Adda G., 1997, P EUR C SPEECH COMM, V5, P2711 COHEN M, 1989, THESIS U CA BERKELEY COHEN PS, 1975, SPEECH RECOGNITION I FOSLER E, 1996, P INT C SPOK LANG PR, P28 GAUVAIN JL, 1997, HUMAN COMFORT SECURI JELINEK F, 1996, P DARPA SPEECH REC W, P148 Lamel L., 1991, P EUR 91 GEN, V2, P505 LAMEL LF, 1996, P ICSLP 96 OCT PHIL, V1, P6, DOI 10.1109/ICSLP.1996.606916 LAMEL LF, 1992, P FIN REV DARPA ANNT Lea W. A., 1980, TRENDS SPEECH RECOGN, P125 MIRGHAFORI N, 1995, P EUR 95 MADR SEPT, V1, P491 OSHIKA BT, 1975, IEEE T ACOUST SPEECH, VAS23, P104, DOI 10.1109/TASSP.1975.1162639 PAUL DB, 1992, P ICSLP 92 BANFF, V2, P899 Riley M., 1996, AUTOMATIC SPEECH SPE, P285 NR 15 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 83 EP 98 DI 10.1016/S0167-6393(99)00032-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700002 ER PT J AU Bacchiani, M Ostendorf, M AF Bacchiani, M Ostendorf, M TI Joint lexicon, acoustic unit inventory and model design SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE lexicon design; acoustic model clustering; pronunciation modeling AB Although most parameters in a speech recognition system are estimated from data by the use of an objective function, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper proposes a joint solution to the related problems of learning a unit inventory and corresponding lexicon from data. On a speaker-independent read speech task with a 1k vocabulary, the proposed algorithm outperforms phone-based systems at both high and low complexities. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Boston Univ, Dept Elect & Comp Engn, Boston, MA 02215 USA. RP Bacchiani, M (reprint author), AT&T Labs Res, Shannon Lab, Rm B235,180 Pk Ave, Florham Pk, NJ 07932 USA. CR Bacchiani M., 1996, P INT C AC SPEECH SI, V1, P443 BACCHIANI M, 1999, THESIS BOST U BACCHIANI M, 1998, P INT C SPOK LANG PR, V5, P1719 Bahl LR, 1993, IEEE T SPEECH AUDI P, V1, P443, DOI 10.1109/89.242490 HOLTER T, 1997, P IEEE WORKSH AUT SP, P199 Holter T., 1998, P ESCA WORKSH MOD PR, P63 HOLTER T, 1997, P EUR C SPEECH COMM, P1159 Kannan A, 1998, IEEE T SPEECH AUDI P, V6, P303, DOI 10.1109/89.668825 LEE CH, 1989, P INT C AC SPEECH SI, V1, P683 Ostendorf M, 1997, COMPUT SPEECH LANG, V11, P17, DOI 10.1006/csla.1996.0021 PALIWAL KK, 1990, P INT C AC SPEECH SI, V2, P729 Price P., 1988, P IEEE INT C AC SPEE, V1, P651 Svendsen T., 1989, P INT C AC SPEECH SI, V1, P108 SVENDSEN T, 1987, P INT C AC SPEECH SI, V1, P77 SVENDSEN T, 1995, P EUR C SPEECH COMM, V1, P783 TAKAMI J, 1992, P INT C AC SPEECH SI, V1, P573 WOODLAND PC, 1993, P EUR C SPEECH COMM, V3, P2207 Young S. J., 1993, P EUR C SPEECH COMM, V3, P2203 NR 18 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 99 EP 114 DI 10.1016/S0167-6393(99)00033-3 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700003 ER PT J AU Cremelie, N Martens, JP AF Cremelie, N Martens, JP TI In search of better pronunciation models for speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE pronunciation modeling; pronunciation variants; pronunciation rules; data-driven approach; rule learning ID PROBABILITY AB The lexicon of a speech recognizer is supposed to contain pronunciation models describing how words can be realized as sequences of subword units (usually phonemes). In this contribution we present a method for upgrading initially simple pronunciation models to new models that can explain several pronunciation variants of each word. Since the presented strategy is capable of producing pronunciation variants and cross-word dependencies completely automatically, it is an attractive alternative to the manual encoding of multiple pronunciations in the lexicon. The method learns pronunciation rules from orthographically transcribed speech utterances, and subsequently applies these rules to generate common pronunciation variants. All variants of one word are then compiled into a compact pronunciation model. The obtained models are properly integrated in the speech recognizer, where they replace the formerly used simple models. By learning pronunciation rules rather than pronunciation variants from the data, one can combine the advantages of data-driven and rule-based approaches. Important properties of the proposed methodology are that it incorporates dependencies between the rules from the very beginning (during the training), that it supports exception rules not producing pronunciation variants but affecting the production of such variants by other rules (called production rules), and that it has a sound probabilistic basis for the attachment of likelihoods to the word pronunciation variants. Experiments showed that the introduction of such variants in a segment-based recognizer significantly improves the recognition accuracy: on TIMIT a relative word error rate reduction of as high as 17% was obtained. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Ghent, ELIS, B-9000 Ghent, Belgium. RP Cremelie, N (reprint author), Univ Ghent, ELIS, Sint Pietersnieuwstr 41, B-9000 Ghent, Belgium. EM cremelie@elis.rug.ac.be CR AUBERT X, 1995, P EUROSPEECH, P767 CREMELIE N, 1994, P ICSLP 94, P275 CREMELIE N, 1997, P EUROSPEECH 97, P2459 FERREIROS J, 1998, P ESCA WORKSH MOD PR, P29 Fisher W.M., 1986, P DARPA SPEECH REC W FUKADA T, 1997, P EUROSPEECH 97, P2471 GLASS J, 1996, P ICSLP, V4, P2277, DOI 10.1109/ICSLP.1996.607261 GOLDENTHAL W, 1994, THESIS MIT HOLTER T, 1997, THESIS NORW U SCI TE Kipp A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607048 Kipp Andreas, 1997, P EUR 97, P1023 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LAMEL LF, 1993, P 3 EUR C SPEECH COM, V1, P121 LEE KF, 1989, IEEE T ACOUST SPEECH, V37 MARI JF, 1996, P IEEE INT C AC SPEE, V1, P435 MERCER RL, 1987, IBM J RES DEV, V31, P81 Ravishankar M, 1997, P EUR C SPEECH COMM, P2467 Riley M. D., 1991, P INT C AC SPEECH SI, P737, DOI 10.1109/ICASSP.1991.150446 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 TAJCHMAN G, 1995, P EUR C SPEECH COMM, P2247 TORRE D, 1997, P ICASSP, P1463 VERHASSELT J, 1997, P IEEE INT C AC SPEE, P1407 VERHASSELT J, 1998, P ICASSP 98, V1, P501, DOI 10.1109/ICASSP.1998.674477 Verhasselt J, 1998, SPEECH COMMUN, V24, P51, DOI 10.1016/S0167-6393(97)00064-2 Wesenick M.-B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607053 WESTER M, 1998, P ESCA WORKSH MOD PR, P145 NR 26 TC 18 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 115 EP 136 DI 10.1016/S0167-6393(99)00034-5 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700004 ER PT J AU Fosler-Lussier, E Morgan, N AF Fosler-Lussier, E Morgan, N TI Effects of speaking rate and word frequency on pronunciations in convertional speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE ASR pronunciation models; speaking rate; word predictability ID PERCEPTION AB Automatic speech recognition (ASR) systems typically have a static dictionary of word pronunciations for matching acoustic models to words. In this work, we argue that, in fact, pronunciations in spontaneous speech are dynamic and that ASR systems should change models in accordance with contextual factors. Two variables, speaking rate and word frequency, should be particularly promising for determining dynamic pronunciations, according to the linguistic literature. We analyze the relationship between these factors and realized pronunciations through a statistical exploration of the effects of these factors at the word, syllable, and phone levels in the Switchboard corpus. Both increased speaking rate and word likelihood can induce a significant shift in probabilities of the pronunciations of frequent words. However, the interplay between all of these variables in the realization of pronunciations is complex. We also confirm the intuition that variations in these factors correlate with changes in ASR system performance for both the Switchboard and Broadcast News corpora. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. Univ Calif Berkeley, Berkeley, CA 94720 USA. RP Fosler-Lussier, E (reprint author), Int Comp Sci Inst, 1947 Ctr St,Suite 600, Berkeley, CA 94704 USA. EM fosler@icsi.berkeley.edu CR BERNSTEIN J, 1992, DARPA SPEECH REC WOR, P41 BYBEE J, 1996, USAGE BASED MODELS L CHEN FR, 1990, INT CONF ACOUST SPEE, P753, DOI 10.1109/ICASSP.1990.115902 Chomsky N., 1968, SOUND PATTERN ENGLIS COOK G, 1999, DARPA BROADCAST NEWS COOK G, 1997, DARPA SPEECH REC WOR Finke M, 1997, 1997 IEEE WORKSHOP ON AUTOMATIC SPEECH RECOGNITION AND UNDERSTANDING, PROCEEDINGS, P34, DOI 10.1109/ASRU.1997.658974 FINKE M, 1997, EUR 97 FISHER W, 1996, DARPA SPEECH RECOGNI FISHER W, 1996, TSYLB2 PROGRAM ALGOR FOSLERLUISSIER E, 1999, DARPA BROADC NEWS WO GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 GAUVAIN JL, 1997, DARPA SPEECH REC WOR GREENBERG S, 1998, ESCA TUT RES WORKSH, P47 GREENBERG S, 1997, 1996 LVCSR SUMM RES JURAFSKY D, 1998, ICSLP 98 SYDN AUSTR KAHN D, 1980, SYLLABLE BASED GENER KINGSBURY BED, 1998, THESIS U CAL BERK CA KITAZAWA S, 1997, EUR 97 RHOD GREEC, P641 *LDC, 1996, PRONLEX PRON DICT MCALLASTER D, 1998, ICSLP 98 SYDN AUSTR, P1847 MILLER JL, 1981, J EXP PSYCHOL HUMAN, V7, P208, DOI 10.1037/0096-1523.7.1.208 MIRGHAFORI N, 1995, EUROSPEECH 95 MIRGHFORI N, 1996, ICASSP 96 ATLANTA GE, P1335 MORGAN N, 1998, IEEE ICASSP 98 SEATT MORGAN N, 1997, EUROSPEECH 97 *NIST, 1992, SWITCHB CORP REC TEL *NIST, 1996, BROADC NEWS SPEECH C OSTENDORF M, 1997, LVCSR SUMM RES WORKS, pCH4 PALLETT DS, 1994, ARPA SPOK LANG SYST Placeway P., 1997, DARPA SPEECH REC WOR RILEY M, 1998, ESCA TUT RES WORKSH, P109 RILEY MD, 1991, INT CONF ACOUST SPEE, P737, DOI 10.1109/ICASSP.1991.150446 SARACLAR M, 1997, CONV SPEECH REC WORK SIEGLER MA, 1995, IEEE ICASSP 95 SLOBODA T, 1996, ICSLP 96 SUMMERFIELD Q, 1981, J EXP PSYCHOL HUMAN, V7, P1074, DOI 10.1037/0096-1523.7.5.1074 TAJCHMAN G, 1995, EUROSPEECH 95 MADRID Verhasselt JP, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P2258 WEINTRAUB M, 1997, 1996 LVCSR SUMM RES, pCH3 WITHGOTT MM, 1993, COMPUTATIONAL MODELS YOUNG SJ, 1994, IEEE ICASSP 94, P307 NR 42 TC 47 Z9 48 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 137 EP 158 DI 10.1016/S0167-6393(99)00035-7 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700005 ER PT J AU Greenberg, S AF Greenberg, S TI Speaking in shorthand - A syllable-centric perspective for understanding pronunciation variation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE automatic speech recognition; pronunciation variation; spoken language; syllables AB Current-generation automatic speech recognition (ASR) systems model spoken discourse as a quasi-linear sequence of words and phones. Because it is unusual for every phone within a word to be pronounced in a standard ("canonical") way, ASR systems often depend on a multi-pronunciation lexicon to match an acoustic sequence with a lexical unit. Since there are, in practice, many different ways for a word to be pronounced, this standard approach adds a layer of complexity and ambiguity to the decoding process which, if simplified, could potentially improve recognition performance. Systematic analysis of pronunciation variation in a corpus of spontaneous English discourse (Switchboard) demonstrates that the variation observed is more systematic at the level of the syllable than at the phonetic-segment level. Thus, syllabic onsets are realized in canonical form far more frequently than either coda or nuclear constituents. Prosodic prominence and lexical stress also appear to play an important role in pronunciation variation. The governing mechanism is likely to involve the informational valence associated with syllabic and lexical elements, and for this reason pronunciation variation offers a potential window onto the mechanisms responsible for the production and understanding of spoken language. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Greenberg, S (reprint author), Int Comp Sci Inst, 1947 Ctr St, Berkeley, CA 94704 USA. EM steveng@icsi.berkeley.edu CR Arai T., 1997, P EUR RHOD GREEC, P1011 Bernstein B., 1974, CLASS CODES CONTROL BERNSTEIN J, 1992, P DARPA SPEECH REC W, P41 Byrne W, 1998, INT CONF ACOUST SPEE, P313, DOI 10.1109/ICASSP.1998.674430 BYRNE W, 1997, P IEEE WORKSH AUT SP, P26 Coleman John, 1992, PHONOLOGY, V9, P1, DOI 10.1017/S0952675700001482 Crystal D, 1995, CAMBRIDGE ENCY ENGLI Dewey Godfrey, 1923, RELATIVE FREQUENCY E Doyle A. C., 1892, ADVENTURES S HOLMES FOSLER E, 1996, P INT C SPOK LANG PR, pS28 FOSLERLUSSIER E, 1998, P INT C PHON SCI SAN, V29, P137 French NR, 1930, BELL SYST TECH J, V9, P290 GANAPATHIRAJU A, 1997, P IEEE AUT SPEECH RE, P207 GAUVAIN J, 1994, P IEEE INT C AC SPEE, P557 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Goldinger S. D., 1996, PRINCIPLES EXPT PHON, P277 Greenberg S., 1996, P INT C SPOK LANG PR, pS32 GREENBERG S, 1997, SWITCHBOARD TRANSCRI GREENBERG S, 1999, UNPUB PHONETIC TRANS Greenberg S., 1998, P ESCA WORKSH MOD PR, P47 Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 Greenberg S., 1997, ENCY ACOUSTICS, P1301 JESPERSEN O., 1922, LANGUAGE ITS NATURE Kahn D., 1980, SYLLABLE BASED GEN E Kenyon John S., 1953, PRONOUNCING DICT AM Kingsbury BED, 1998, SPEECH COMMUN, V25, P117, DOI 10.1016/S0167-6393(98)00032-6 Kirchoff K., 1999, THESIS U BIELEFELD KOHLER KJ, 1995, P 13 INT C PHON SC S, V2, P12 Kompe R, 1997, PROSODY SPEECH UNDER Labov William, 1972, SOCIOLINGUISTIC PATT Lehiste I., 1996, PRINCIPLES EXPT PHON, P226 Levelt W. J. M., 1989, SPEAKING LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lyovin Anatole V., 1997, INTRO LANGUAGES WORL MCALLASTER D, 1998, P DARPA WORKSH CONVE NIEMANN H, 1997, P IEEE INT C AC SPEE, P75 OSTENDORF M, 1997, MODELING SYSTEMATIC Rabiner L, 1993, FUNDAMENTALS SPEECH Riley M., 1998, P ETRW MOD PRON VAR, P109 Riley M, 1995, AUTOMATIC SPEECH SPE SCHIEL FA, 1998, P ESCA TUTORIAL RES, P131 SILIPO R, 1999, P INT C PHON SCI SAN VANKUIK D, 1999, COMMUNICATION, V27, P95 VANSON RJJH, 1998, P INT C SPOKEN LANGU, P2375 VANWIERINGEN A, 1995, THESIS U AMSTERDAM Waibel A., 1988, PROSODY SPEECH RECOG Weintraub M., 1996, P INT C SPOK LANG PR WEINTRAUB M, 1997, WS96 PROJECT REPORT Wu SL, 1998, INT CONF ACOUST SPEE, P721 Wu S.-L., 1998, P INT C SPOK LANG PR, P854 ZIPF G K, 1945, J Gen Psychol, V33, P251 Zue V. W., 1996, Recent research towards advanced man-machine interface through spoken language, DOI 10.1016/B978-044481607-8/50088-8 NR 53 TC 98 Z9 100 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 159 EP 176 DI 10.1016/S0167-6393(99)00050-3 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700006 ER PT J AU Holter, T Svendsen, T AF Holter, T Svendsen, T TI Maximum likelihood modelling of pronunciation variation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE pronunciation modelling; baseform optimisation; maximum likelihood; automatic speech recognition AB This paper addresses the problem of generating lexical word representations that properly represent natural pronunciation variations for the purpose of improved speech recognition accuracy. In order to create a consistent framework for optimisation of automatic speech recognition systems, we present a maximum likelihood based algorithm for fully automatic data-driven modelling of pronunciation, given a set of subword hidden Markov models (HMMs) and acoustic tokens of a word. We also propose an extension of this formulation in order to achieve optimal modelling of pronunciation variations. Since different words will not in general exhibit the same amount of pronunciation variation, the procedure allows words to be represented by a different number of baseforms. The methods improve the subword description of the vocabulary words and have been shown to improve recognition performance on the DARPA Resource Management task. (C) 1999 Elsevier Science B.V. All rights reserved. C1 SINTEF Telecom & Informat, Dept Signal Proc & Syst Design, N-7465 Trondheim, Norway. Norwegian Univ Sci & Technol, Dept Telecommun, Trondheim, Norway. RP Holter, T (reprint author), SINTEF Telecom & Informat, Dept Signal Proc & Syst Design, OS Bragstads Plass 2, N-7465 Trondheim, Norway. EM trym.holter@informatics.sintef.no CR Asadi A., 1991, P INT C AC SPEECH SI, P305, DOI 10.1109/ICASSP.1991.150337 BACCHIANI M, 1998, P ESCA WORKSH MOD PR, P7 BAHL LR, 1993, IEEE T SPEECH AUDIO, V1, P442 Bahl L.R., 1991, P INT C AC SPEECH SI, P173, DOI 10.1109/ICASSP.1991.150305 COHEN MH, 1989, THESIS U CAL BERKELE Gillick L., 1989, P ICASSP, P532 HAEBUMBACH R, 1995, P ICASSP, P840 HOLTER T, 1997, P IEEE WORKSH AUT SP, P199 HOLTER T, 1997, P EUR C SPEECH COMM, P1159 HOLTER T, 1996, P IEEE REG 10 C DIG, P102 HOLTER T, 1997, THESIS NORWEGIAN U S KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Lee C.-H., 1988, P ICASSP, P501 Lee C.-H., 1989, P INT C AC SPEECH SI, P683 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LU SY, 1978, IEEE T SYST MAN CYB, V8, P381, DOI 10.1109/TSMC.1978.4309979 LUCASSEN J, 1984, P INT C AC SPEECH SI MOKBEL H, 1998, P ESCA WORKSH MOD PR, P73 Nilsson N.J., 1971, PROBLEM SOLVING METH *NIST, 1992, RES MAN CONT SPEECH PALIWAL KK, 1990, P INT C AC SPEECH SI, P729 Price P., 1988, P IEEE INT C AC SPEE, P651 Ramabhadran B, 1998, INT CONF ACOUST SPEE, P309, DOI 10.1109/ICASSP.1998.674429 Slobada T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607274 SLOBODA T, 1995, P INT C AC SPEECH SI, P453 SOONG FK, 1991, P INT C AC SPEECH SI, V1, P705 STRIK H, 1998, P ESCA WORKSH MOD PR, P137 Svendsen T., 1995, P EUROSPEECH, P783 Svendsen T., 1989, P INT C AC SPEECH SI, P108 Wilpon J. G., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) YOUNG SJ, 1993, HTD HIDDEN MARKOV MO Zhao YX, 1993, IEEE T SPEECH AUDI P, V1, P345, DOI 10.1109/89.232618 NR 32 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 177 EP 191 DI 10.1016/S0167-6393(99)00036-9 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700007 ER PT J AU Kessens, JM Wester, M Strik, H AF Kessens, JM Wester, M Strik, H TI Improving the performance of a Dutch CSR by modeling within-word and cross-word pronunciation variation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE continuous speech recognition; modeling pronunciation variation; within-word variation; cross-word variation AB This article describes how the performance of a Dutch continuous speech recognizer was improved by modeling pronunciation variation. We propose a general procedure for modeling pronunciation variation. In short, it consists of adding pronunciation variants to the lexicon, retraining phone models and using language models to which the pronunciation variants have been added. First, within-word pronunciation variants were generated by applying a set of five optional phonological rules to the words in the baseline lexicon. Next, a limited number of cross-word processes were modeled, using two different methods. In the first approach, cross-word processes were modeled by directly adding the cross-word variants to the lexicon, and in the second approach this was done by using multi-words. Finally, the combination of the within-word method with the two cross-word methods was tested. The word error rate (WER) measured for the baseline system was 12.75%. Compared to the baseline, a small but statistically significant improvement of 0.68% in WER was measured for the within-word method, whereas both cross-word methods in isolation led to small, non-significant improvements. The combination of the within-word method and cross-word method 2 led to the best result: an absolute improvement of 1.12% in WER was found compared to the baseline, which is a relative improvement of 8.8% in WER. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, A2 RT, NL-6500 HD Nijmegen, Netherlands. RP Kessens, JM (reprint author), Univ Nijmegen, Dept Language & Speech, A2 RT, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM j.kessens@let.kun.nl CR BAAYEN H, 1991, FORUM LETT, V32, P221 Booij Geert, 1995, PHONOLOGY DUTCH COHEN MH, 1989, THESIS U CAL BERKELE COHEN PS, 1974, P IEEE S SPEECH REC, P177 Cremelie N., 1998, P ESCA WORKSH MOD PR, P23 CUCCHIARINI C, 1995, P DEP LANG SPEECH, V19, P59 KERKHOFF J, 1994, P DEP LANG SPEECH U, V18, P107 KESSENS JM, 1997, P CLS OP AC YEAR 97, P1 Lamel L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.606916 PERENNOU G, 1998, P ESCA WORKSH MOD PR, P91 STEINBISS V, 1993, P ESCA 3 EUR C SPEEC, P2125 STRIK H, 1998, P ESCA WORKSH MOD PR, P137 van den Heuvel Henk, 1997, INT J SPEECH TECHNOL, V2, P119 WISEMAN R, 1998, P ESCA WORKSH MOD PR, P157 NR 14 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 193 EP 207 DI 10.1016/S0167-6393(99)00048-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700008 ER PT J AU Riley, M Byrne, W Finke, M Khudanpur, S Ljolje, A McDonough, J Nock, H Saraclar, M Wooters, C Zavaliagkos, G AF Riley, M Byrne, W Finke, M Khudanpur, S Ljolje, A McDonough, J Nock, H Saraclar, M Wooters, C Zavaliagkos, G TI Stochastic pronunciation modelling from hand-labelled phonetic corpora SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE pronunciation modelling; decision trees; spontaneous speech; speech recognition AB In the early 1990s, the availability of the TIMIT read-speech phonetically transcribed corpus led to work at AT&T on the automatic inference of pronunciation variation. This work, briefly summarized here, used stochastic decision trees trained on phonetic and linguistic features, and was applied to the DARPA North American Business News read-speech ASR task. More recently, the ICSI spontaneous-speech phonetically transcribed corpus was collected at the behest of the 1996 and 1997 LVCSR Summer Workshops held at Johns Hopkins University. A 1997 workshop (WS97) group focused on pronunciation inference from this corpus for application to the DoD Switchboard spontaneous telephone speech ASR task. We describe several approaches taken there. These include (1) one analogous to the AT&T approach, (2) one, inspired by work at WS96 and CMU, that involved adding pronunciation variants of a sequence of one or more words ('multiwords') in the corpus (with corpus-derived probabilities) into the ASR lexicon, and (1 + 2) a hybrid approach in which a decision-tree model was used to automatically phonetically transcribe a much larger speech corpus than ICSI and then the multiword approach was used to construct an ASR recognition pronunciation lexicon. (C) 1999 Elsevier Science B.V. All rights reserved. C1 AT&T Labs Res, Florham Pk, NJ 07932 USA. Johns Hopkins Univ, Baltimore, MD USA. Carnegie Mellon Univ, Pittsburgh, PA 15213 USA. Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. US Dept Def, Ft George G Meade, MD USA. BBN, Cambridge, MA USA. RP Riley, M (reprint author), AT&T Labs Res, Room E141,180 Pk, Florham Pk, NJ 07932 USA. RI Saraclar, Murat/E-8640-2010 OI Saraclar, Murat/0000-0002-7435-8510 CR Brieman L, 1984, CLASSIFICATION REGRE BYRNE W, 1997, 1997 IEEE WORKSH SPE BYRNE W, 1998, P ICASSP 98 SEATTL W CHEN F, 1990, P ICASSP 90 CHOU P, 1988, THESIS STANF U STANF COKER C, 1985, J ACOUST SOC AM, V78 CREMELIE N, 1997, P EUROSPEECH 97, P2459 DOWNEY S, 1997, P EUROSPEECH 97, P1027 FINKE M, 1997, P EUROSPEECH 97 FISHER W, 1987, J ACOUST SOC AM, V81 FUKADA T, 1997, P EUROSPEECH 97, P2471 GREENBERG S, 1996, 1996 LVCSR SUMM WORK HOLTER T, 1997, THESIS NORWEGIAN U S Kipp Andreas, 1997, P EUR 97, P1023 Ladefoged P., 1975, COURSE PHONETICS LAMEL L, 1996, PO ICSLP 96 RANDOLPH M, 1990, P ICASSP 90 Riley M, 1995, AUTOMATIC SPEECH SPE SHOUP J, 1990, TRENDS SPEECH RECOGN, P125 STRIK H, 1998, P ETRW WORKSH MOD PR TAJCHMAN G, 1995, P EUROSPEECH 95 WEINTRAUB M, 1989, P ICASSP 89 WEINTRAUB M, 1996, 1996 LVCSR SUMM WORK WOOTERS C, 1994, P ICSLP 94, P1363 Young S., 1995, HTK BOOK NR 25 TC 46 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 209 EP 224 DI 10.1016/S0167-6393(99)00037-0 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700009 ER PT J AU Strik, H Cucchiarini, C AF Strik, H Cucchiarini, C TI Modeling pronunciation variation for ASR: A survey of the literature SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Tutorial and Research Workshop on Modeling Pronunciation Variation for Automatic Speech Recognition CY MAY 04-06, 1998 CL KERKRADE, NETHERLANDS SP ESCA DE pronunciation variation; automatic speech recognition ID SPEECH AB The focus in automatic speech recognition (ASR) research has gradually shifted from isolated words to conversational speech. Consequently, the amount of pronunciation variation present in the speech under study has gradually increased. Pronunciation variation will deteriorate the performance of an ASR system if it is not well accounted for. This is probably the main reason why research on modeling pronunciation variation for ASR has increased lately. In this contribution, we provide an overview of the publications on this topic, paying particular attention to the papers in this special issue and the papers presented at 'the Rolduc workshop'. (1) First, the most important characteristics that distinguish the various studies on pronunciation variation modeling are discussed. Subsequently, the issues of evaluation and comparison are addressed. Particular attention is paid to some of the most important factors that make it difficult to compare the different methods in an objective way. Finally, some conclusions are drawn as to the importance of objective evaluation and the way in which it could be carried out. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, A2 RT, NL-6500 HD Nijmegen, Netherlands. RP Strik, H (reprint author), Univ Nijmegen, Dept Language & Speech, A2 RT, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM strik@let.kun.nl CR ADDADECKER M, 1999, COMMUNICATION, V29, P83 ADDADECKER M, 1998, P ESCA WORKSH MOD PR, P1 AUBERT X, 1995, P EUROSPEECH, P767 BACCHIANI M, 1998, P ESCA WORKSH MOD PR, P7 BACCHIANI M, 1999, COMMUNICATION, V29, P99 BARNETT J, 1974, P IEEE S SPEECH REC, P188 BELL A, 1984, LANG SOC, V13, P145 Beulen K, 1998, P ESCA WORKSH MOD PR, P13 BLACKBURN C, 1995, P EUR C SPEECH COMM, P1623 Blackburn C. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607764 Bonaventura P., 1998, P ESCA WORKSH MOD PR, P17 Cohen P. S., 1975, SPEECH RECOGNITION, P275 COHEN PS, 1989, P IEEE S SPEECH RECO Coupland N., 1984, INT J SOCIOL LANG, V1984, P49 CREMELIE N, 1999, COMMUNICATION, V29, P115 CREMELIE N, 1997, P EUROSPEECH 97, P2459 CREMELIE N, 1995, P EUROSPEECH 95 MADR, P1747 Cremelie N., 1998, P ESCA WORKSH MOD PR, P23 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 DESHMUKH N, 1996, P ICASSP 96 ATLANTA, P283 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NR 21 TC 84 Z9 85 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1999 VL 29 IS 2-4 BP 225 EP 246 DI 10.1016/S0167-6393(99)00038-2 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 261WF UT WOS:000084032700010 ER PT J AU van Son, RJJH Pols, LCW AF van Son, RJJH Pols, LCW TI Perisegmental speech improves consonant and vowel identification SO SPEECH COMMUNICATION LA English DT Article DE vowel identification; consonant identification; context ID LOCUS EQUATIONS; DUTCH VOWELS; PERCEPTION; RECOGNITION; FORMANT; SYLLABLES; CONTEXT; INFORMATION; SEGMENTS; ENGLISH AB In two papers, Nearey (1992, 1997) discusses the fact that theories on phoneme identification generally favor strong cues that are localized in the speech signal. He proposes an alternative view in which cues to phoneme identity are relatively weak and dispersed. In the present listening experiment, Dutch subjects identified speech tokens containing fragments of vowel and consonant realizations and their immediate neighbors, taken from connected read speech. Using a measure of listener confusion based on the perplexity of the confusion matrix, it is possible to quantify the amount of information extracted by the listeners from different parts of the speech signal. Around half the information needed for the identification task was extracted from only a short, 40-50 ms, speech fragment. Considerable amounts of additional information were extracted from parts of the signal at, and beyond, the conventional boundaries of the segment, here called perisegmental speech. Speech in front of the target segment improved identification more than speech following the target segment, even if this speech was actually not part of the target phoneme itself. Correct identification of pre-vocalic consonants correlated with the correct identification of the following vowel, and vice versa. The identification of post-vocalic consonants was not correlated with the identification of the vowel in front. It is concluded that human listeners extract an important fraction of the information needed to identify phonemes from outside the conventional segment boundaries. This supports the proposal of Nearey that extended, "weak" cues might play an important part in the identification of phonemes. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Amsterdam, Inst Phonet Sci, IFOTT, NL-1016 CG Amsterdam, Netherlands. RP van Son, RJJH (reprint author), Univ Amsterdam, Inst Phonet Sci, IFOTT, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM rob.van.son@hum.uva.nl CR Bahl L. R., 1990, READINGS SPEECH RECO, P308 BROAD DJ, 1970, J ACOUST SOC AM, V47, P1572, DOI 10.1121/1.1912090 BROAD DJ, 1987, J ACOUST SOC AM, V81, P155, DOI 10.1121/1.395025 Chennoukh S, 1997, J ACOUST SOC AM, V102, P2380, DOI 10.1121/1.419622 COOPER FS, 1952, J ACOUST SOC AM, V24, P597, DOI 10.1121/1.1906940 Cutler A, 1997, SPEECH COMMUN, V21, P3, DOI 10.1016/S0167-6393(96)00075-1 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 DIBENEDETTO MG, 1989, J ACOUST SOC AM, V86, P67, DOI 10.1121/1.398221 FOWLER CA, 1991, MODULARITY AND THE MOTOR THEORY OF SPEECH PERCEPTION, P33 FOX RA, 1989, PHONETICA, V46, P97 Fruchter D, 1997, J ACOUST SOC AM, V102, P2997, DOI 10.1121/1.421012 GROSJEAN F, 1980, PERCEPT PSYCHOPHYS, V28, P267, DOI 10.3758/BF03204386 Grossberg S., 1984, COGNITION BRAIN THEO, V7, P263 Hays WL., 1973, STAT SOCIAL SCI, V2nd HOUTSMA AJM, 1983, J ACOUST SOC AM, V74, P1626, DOI 10.1121/1.390125 KEATING PA, 1994, J PHONETICS, V22, P407 KHINCHIN AI, 1957, MATH FDN INFORMATION KOOPMANSVANBEIN.FJ, 1984, P I ACOUSTICS, V6, P363 Krull D., 1989, PERILUS, VX, P87 LEHISTE I, 1961, J ACOUST SOC AM, V33, P268, DOI 10.1121/1.1908638 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LISKER L, 1986, SR8687 HASK LAB, P45 MACNEILAGE PF, 1991, MODULARITY AND THE MOTOR THEORY OF SPEECH PERCEPTION, P61 Magen HS, 1997, J PHONETICS, V25, P187, DOI 10.1006/jpho.1996.0041 MANN V, 1991, PERCEPT PSYCHOPHYS, V49, P399, DOI 10.3758/BF03212174 MARASCUILO LA, 1988, STAT METHODS BEHAV S Marslen-Wilson W. D., 1989, LEXICAL REPRESENTATI, P169 MASSARO DW, 1991, DEV PSYCHOL, V27, P85, DOI 10.1037//0012-1649.27.1.85 MASSARO DW, 1993, SPEECH COMMUN, V13, P127, DOI 10.1016/0167-6393(93)90064-R MASSARO DW, 1990, PSYCHOL REV, V97, P225, DOI 10.1037//0033-295X.97.2.225 MASSARO DW, 1974, J EXP PSYCHOL, V102, P199, DOI 10.1037/h0035854 McGowan RS, 1996, J ACOUST SOC AM, V99, P1680, DOI 10.1121/1.414690 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Nabelek AK, 1997, J ACOUST SOC AM, V101, P488, DOI 10.1121/1.417992 ANDRUSKI JE, 1992, J ACOUST SOC AM, V91, P390, DOI 10.1121/1.402781 NEAREY TM, 1986, J ACOUST SOC AM, V80, P1297, DOI 10.1121/1.394433 NEAREY TM, 1992, LANG SPEECH, V35, P153 Nearey TM, 1997, J ACOUST SOC AM, V101, P3241, DOI 10.1121/1.418290 Ohde RN, 1996, J ACOUST SOC AM, V100, P3813, DOI 10.1121/1.417338 OHDE RN, 1977, J SPEECH HEAR RES, V20, P543 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 OHTA F, 1962, STUDIA PHONOLOGICA, V2, P61 PEETERS WJM, 1991, THESIS U ULTRECHT TH PICKETT JM, 1995, PHONETICA, V52, P1 Pisoni D. B., 1986, PATTERN RECOGN, V1, P1 POLS LCW, 1993, SPEECH COMMUN, V13, P135, DOI 10.1016/0167-6393(93)90065-S POLS LCW, 1979, ASA 50 SPEECH COMMUN, P459 PRESS WH, 1998, NUMERICAL RECIPES C, P632 Recasens D, 1997, J ACOUST SOC AM, V102, P544, DOI 10.1121/1.419727 Siegel S., 1956, NONPARAMETRIC STAT B SMITS R, 1997, SPEECH HEARING LANGU, V10, P115 SMITS R, 1997, SPEECH HEARING LANGU, V9, P195 SON RJJ, 1993, P EUR 93, P285 STEVENS KN, 1989, J PHONETICS, V17, P3 STRANGE W, 1989, J ACOUST SOC AM, V85, P2081, DOI 10.1121/1.397860 Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567 SVESHNIKOV AA, 1968, PROBLEMS PROBABILITY, P157 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D VANDERKA.LJ, 1971, ACTA PSYCHOL, V35, P64, DOI 10.1016/0001-6918(71)90032-1 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSON RJJ, 1995, P EUR 95, P2277 VANSON RJJ, 1993, STUDIES LANGUAGE LAN, V3 van Son R. J. J. H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607908 VANSON RJJ, 1998, P ICSLP 98 SYDN, V6, P2375 VANSON RJJ, 1997, P EUR 97 RHOD, P2135 van Son RJJH, 1999, SPEECH COMMUN, V28, P125, DOI 10.1016/S0167-6393(99)00009-6 VANSON RJJH, 1992, J ACOUST SOC AM, V92, P121, DOI 10.1121/1.404277 VANSON RJJH, 1990, J ACOUST SOC AM, V88, P1683, DOI 10.1121/1.400243 VANWIERINGEN A, 1991, P 12 INT C PHON SCI, P446 VANWIERINGEN A, 1995, J ACOUST SOC AM, V98, P1304, DOI 10.1121/1.413467 VANWIERINGEN A, 1995, THESIS U AMSTERDAM ZUE VW, 1985, P IEEE, V73, P1603 NR 79 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1999 VL 29 IS 1 BP 1 EP 22 DI 10.1016/S0167-6393(99)00024-2 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 235ZH UT WOS:000082573400001 ER PT J AU Lee, I Gibson, JD AF Lee, I Gibson, JD TI Tree coding combined with TDHS for speech coding at 6.4 and 4.8 kbps SO SPEECH COMMUNICATION LA English DT Article DE speech coding; speech analysis ID CODERS AB Tree coding is combined with time domain harmonic scaling (TDHS) for speech coding at 6.4 and 4.8 kbps. In order to improve the robustness to channel errors, new pitch predictor, short-term predictor adaptation and gain adaptation methods are proposed for tree coder. New code trees with appropriate gain adaptation rules, new backward adaptive pitch predictor and robust short-term predictor adaptation algorithms are evaluated for both ideal and noisy channels. Paired comparison listening tests show that the 6.4 kbps coder (2-to-1 TDHS/2 bits/samples tree coding) has speech quality equivalent to 6 bit log-PCM at a sampling rate of 6400 samples/s. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Chungbuk Natl Univ, Sch Elect & Elect Engn, Cheongju 361763, Chungbuk, South Korea. So Methodist Univ, Dept Elect Engn, Dallas, TX 75275 USA. RP Lee, I (reprint author), Chungbuk Natl Univ, Sch Elect & Elect Engn, San 48,Gaesin Dong, Cheongju 361763, Chungbuk, South Korea. EM inslee@cbucc.chungbuk.ac.kr CR ANDERSON JB, 1975, IEEE T INFORM THEORY, V21, P379, DOI 10.1109/TIT.1975.1055415 CHANG WW, 1990, IEEE T INFORM THEORY, V36, P1134 CHEN JH, 1992, IEEE J SEL AREA COMM, V10, P830, DOI 10.1109/49.138988 Chen J.-H., 1991, IEEE Global Telecommunications Conference. GLOBECOM '91. Countdown to the New Millennium. Featuring a Mini-Theme on: Personal Communications Services (PCS). Conference Record (Cat. No.91CH2980-1), DOI 10.1109/GLOCOM.1991.188691 CHEONG YC, 1992, THESIS TEXAS A M U EINARSSON G, 1981, IEEE T COMMUN, V29, P830, DOI 10.1109/TCOM.1981.1095058 Gerson I. A., 1990, P IEEE INT C AC SPEE, P461 GIBSON JD, 1991, IEEE T COMMUN, V39, P963, DOI 10.1109/26.87186 GORIS AC, 1979, IEEE T INFORM THEORY, V27, P165 HAYKIN S, 1989, ADAPTIVE FILTER THEO IYENGAR V, 1991, IEEE T SIGNAL PROCES, V39, P1049, DOI 10.1109/78.80962 IYENGAR V, 1988, P IEEE INT C ACOUSTI, P243 JAYANT NS, 1970, BELL SYST TECH J, P321 JAYANT NS, 1989, DIGITAL CODING WAVEF KEIJIN WB, 1995, SPEECH CODING SYNTHE MALAH D, 1979, IEEE T ACOUST SPEECH, V27, P121, DOI 10.1109/TASSP.1979.1163210 MALAH D, 1981, BELL SYST TECH J, P2107 MALAH D, 1981, IEEE T ACOUST SPEECH, V29, P273, DOI 10.1109/TASSP.1981.1163547 MALAH D, 1980, P IEEE INT C AC SPEE, P504 MELSA JL, 1981, P IEEE INT C AC SPEE, P603 NAM SH, 1994, IEEE T SPEECH AUDIO, V2 Pettigrew R., 1989, GLOBECOM '89. IEEE Global Telecommunications Conference and Exhibition. Communications Technology for the 1990s and Beyond (Cat. No.89CH2682-3), DOI 10.1109/GLOCOM.1989.64154 Rabiner L.R., 1978, DIGITAL PROCESSING S RAMACHANDRAN RP, 1987, IEEE T ACOUST SPEECH, V35, P937, DOI 10.1109/TASSP.1987.1165238 Salami R, 1998, IEEE T SPEECH AUDI P, V6, P116, DOI 10.1109/89.661471 Woo HC, 1994, IEEE T SPEECH AUDI P, V2, P361 WOO HC, 1994, IEEE T SPEECH AUDIO, V2, P1884 YATROU P, 1988, IEEE J SEL AREA COMM, V6, P249, DOI 10.1109/49.602 YUAN J, 1990, ADV SPEECH CODING NR 29 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1999 VL 29 IS 1 BP 23 EP 37 DI 10.1016/S0167-6393(99)00025-4 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 235ZH UT WOS:000082573400002 ER PT J AU Kovesi, B Saoudi, S Boucher, JM Horvath, G AF Kovesi, B Saoudi, S Boucher, JM Horvath, G TI Real time vector quantization of LSP parameters SO SPEECH COMMUNICATION LA English DT Article DE speech coding; line spectrum pairs; split vector quantization; multi-stage vector quantization; low bit rate; spectral distance measure; power spectral density AB The distance measure is of great importance in both the design and coding stage of a vector quantizer. Due to its complexity, however, the spectral distance which best correlates with the perceptual quality is seldom used. On the other hand, various weighted squared Euclidean distance measures give close or even accurate estimation of the meaningful spectral distance. Since they are in general mathematically more tractable, these weighted squared Euclidean distance measures are more commonly used. Significant differences can be found in the performance of different distance measures suggested in previous literatures. In this paper, a complete study and comparison of weighted squared Euclidean distance measures is given. This paper also proposes a new weighted squared Euclidean distance measure for vector quantization of Line Spectrum Pairs (LSP) or Cosine of LSP (CLSP) parameters. It also presents an efficient adaptation apparatus for using the proposed distance measure in the case of split or multi-stage Vector quantizers. (C) 1999 Elsevier Science B.V. All rights reserved. C1 CNET, Dept CMC, Lannion, France. ENST Bretagne, Grp Signal, Dept Signal & Commun, F-29285 Brest, France. Tech Univ Budapest, Dept MMT, H-1521 Budapest, Hungary. RP Saoudi, S (reprint author), CNET, Dept CMC, Av P Marzin, Lannion, France. EM samir.saoudi@enst-bretagne.fr CR BRUHN S, 1994, P IEEE INT C AC SPEE COLLURA JS, 1993, P IEEE INT C AC SPEE DEKETELARE S, 1991, P 13 C TRAIT SIGN IM, P761 ERZIN E, 1993, P IEEE INT C AC SPEE, P25 GURGEN FS, 1990, P INT C SPOK LANG PR, P512 Hagen R., 1990, P IEEE INT C AC SPEE, P189 JUANG BH, 1982, P IEEE ICASSP, P597 Kleijn W. B., 1995, SPEECH CODING SYNTHE KOVESI B, 1995, P IEEE INT C AC SPEE, P269 KOVESI B, 1997, THESIS U RENNES 1 KUO CC, 1992, P IEEE INT C AC SPEE Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 PAKSOY E, 1992, P INT C SPOK LANG PR, P33 Paliwal K.K., 1991, P INT C AC SPEECH SI, P661, DOI 10.1109/ICASSP.1991.150426 PAN J, 1994, P IEEE INT C AC SPEE SAOUDI S, 1990, THESIS U RENNES 1 Soong F., 1990, P ICASSP90, P185 SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 SVENDSEN T, 1994, P IEEE INT C AC SPEE NR 19 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1999 VL 29 IS 1 BP 39 EP 47 DI 10.1016/S0167-6393(99)00026-6 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 235ZH UT WOS:000082573400003 ER PT J AU Mokbel, H Jouvet, D AF Mokbel, H Jouvet, D TI Derivation of the optimal set of phonetic transcriptions for a word from its acoustic realizations SO SPEECH COMMUNICATION LA English DT Article DE phonetic transcriptions; acoustic realizations; N-best decoding; frequency criterion; maximum likelihood criterion; phonotactic constraints; partition; optimal set AB This paper deals with a set of methods developed in order to derive multiple variants of phonetic transcriptions for words, given sample utterances of the words and an inventory of context-dependent sub-word units. These methods use a two phase process to derive the phonetic transcriptions. The first phase consists in generating a set of possible transcriptions for the word using an N-best phonetic decoding of the available utterances of that word. The second phase consists in selecting, in the set of possible transcriptions of the word, the ones that describe the best the word. The selection of the "best" transcriptions is accomplished according to different criteria. The Frequency criterion chooses the k most frequent transcriptions in the set of all phonetic decoding of that word, while the Maximum Likelihood (ML) criterion chooses the k most likely ones. With the two criteria k is the same whatever the word is, and each of the k variants "describes" all the training utterances of the word. A partition procedure, which determines the "optimal" number of transcriptions for each word, is then investigated. This procedure assumes that, in the set of the selected transcriptions, each transcription must "describe" a subset of the utterances of the word. So, the goal is to find the "suitable" transcriptions and to associate each of them to a subset of the pronunciations (utterances). Two iterative algorithms are developed and evaluated, and a compromise between the likelihood and the number of elements of the "optimal" set of transcriptions is studied. Speaker-independent speech recognition experiments showed that the ML criterion outperforms the Frequency criterion and that the performance obtained with the former criterion is comparable to that obtained with reference transcriptions. Moreover, results obtained on different speaker-independent speech recognition tasks showed an improvement in performance when the training set of utterances is partitioned between the selected transcriptions; i.e. when each selected transcription represents only a subset of the utterances. (C) 1999 Elsevier Science B.V. All rights reserved. C1 France Telecom, CNET, DIH, DIPS, F-22307 Lannion, France. RP Mokbel, H (reprint author), IDIAP, Case Postale 592, CH-1920 Martigny, Switzerland. EM houda.mokbel@cnet.francetelecom.fr; denis.jouvet@cnet.francetelecom.fr CR ADDADECKER M, 1998, P ESCA WORKSH MOD PR, P1 Asadi A., 1991, P INT C AC SPEECH SI, P305, DOI 10.1109/ICASSP.1991.150337 AUBERGE V, 1988, P JOURN ET PAR NANC, P55 Bahl LR, 1993, IEEE T SPEECH AUDI P, V1, P443, DOI 10.1109/89.242490 CREMELIE N, 1998, ESCA WORKSH MOD PRON, P23 FERREIROS J, 1998, P ESCA WORKSH MOD PR, P29 HAEBUMBACH R, 1995, P ICASSP, P840 Holter T., 1998, P ESCA WORKSH MOD PR, P63 JOUVET D, 1991, P EUROSPEECH 91, P923 JOUVET D, 1994, P ICSLP 94 YOKOHAMA, P283 MOKBEL H, 1998, P ESCA WORKSH MOD PR, P73 MOKBEL H, 1997, P EUR C SPEECH COMM, P1619 Ravishankar M, 1997, P EUR C SPEECH COMM, P2467 SLOBODA T, 1995, P INT C AC SPEECH SI, P453 SOONG FK, 1991, P INT C AC SPEECH SI, V1, P705 SORIN C, 1995, SPEECH COMMUN, V17, P273, DOI 10.1016/0167-6393(95)00035-M WESTER M, 1998, P ESCA WORKSH MOD PR, P145 NR 17 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1999 VL 29 IS 1 BP 49 EP 64 DI 10.1016/S0167-6393(99)00021-7 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 235ZH UT WOS:000082573400004 ER PT J AU Ferreiros, J Pardo, JM AF Ferreiros, J Pardo, JM TI Improving continuous speech recognition in Spanish by phone-class semicontinuous HMMs with pausing and multiple pronunciations SO SPEECH COMMUNICATION LA English DT Article DE Spanish continuous speech recognition; semicontinuous HMMs phone-class dependent; use of pausing information during training; use of multiple pronunciations during recognition AB This paper presents a comprehensive study of continuous speech recognition in Spanish. It shows the use and optimisation of several well-known techniques together with the application for the first time to Spanish of language specific knowledge to these systems, i.e. the careful selection of the phone inventory, the phone-classes used, and the selection of alternative pronunciation rules. We have developed a semicontinuous phone-class dependent contextual modelling. Using four phone-classes, we have obtained recognition error rate reductions roughly equivalent to the percentage increase of the number of parameters, compared to baseline semicontinuous contextual modelling. We also show that the use of pausing in the training system and multiple pronunciations in the vocabulary help to improve recognition rates significantly. The actual pausing of the training sentences and the application of assimilation effects improve the transcription into context-dependent units. Multiple pronunciation possibilities are generated using general rules that are easily applied to any Spanish vocabulary. With all these ideas we have reduced the recognition errors of the baseline system by more than 30% in a task parallel to DARPA-RM translated into Spanish with a vocabulary of 979 words. Our database contains four speakers with 600 training sentences and 100 testing sentences each. All experiments have been carried out with a perplexity of 979, and even slightly higher in the case of multiple pronunciations, to be able to study the acoustic modelling power of the systems with no grammar constraints. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Politecn Madrid, ETSI Telecomunic, Dept Ingn Elect, Grp Tecnol Habla, E-28040 Madrid, Spain. RP Ferreiros, J (reprint author), Univ Politecn Madrid, ETSI Telecomunic, Dept Ingn Elect, Grp Tecnol Habla, Ciudad Univ S-N, E-28040 Madrid, Spain. EM jfl@die.upm.es RI Pardo, Jose/H-3745-2013 CR BRIDLE JS, 1982, P IEEE INT C AC SPEE, V2, P899 FERREIROS J, 1996, THESIS U POLITECNICA FERREIROS J, 1998, P ESCA WORKSH MOD PR, P29 FERREIROS J, 1995, SPEECH RECOGNITION C, P68 FERREIROS J, 1995, P EUR 18 21 SEPT 199, V2, P1507 HASAN H, 1989, P IEEE INT C AC SPEE, V1, P342 HUANG XD, 1989, P EUROPEAN C SPEECH, V1, P163 HUANG XD, 1989, P IEEE INT C AC SPEE, V1, P639 HUERTA JM, 1998, P DARPA BN TRANSCR U HWANG MY, 1989, P EUR SEPT 1989 PAR, V1, P5 HWANG MY, 1994, P IEEE INT C AC SPEE, V1, P549 LEE KF, 1988, THESIS CMU NEY H, 1984, IEEE T ACOUST SPEECH, V32, P263, DOI 10.1109/TASSP.1984.1164320 PARDO JM, 1989, P EUR SEPT 1989 PAR, V2, P146 PEINADO AM, 1994, P IEEE INT C AC SPEE, V1, P61 Price P., 1988, P IEEE INT C AC SPEE, V1, P651 WEINTRAUB M, 1989, P IEEE INT C AC SPEE, V2, P699 Weiss NA, 1993, INTRO STAT, P407 NR 18 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1999 VL 29 IS 1 BP 65 EP 76 DI 10.1016/S0167-6393(99)00013-8 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 235ZH UT WOS:000082573400005 ER PT J AU Alku, P Vintturi, J Vilkman, E AF Alku, P Vintturi, J Vilkman, E TI On the linearity of the relationship between the sound pressure level and the negative peak amplitude of the differentiated glottal flow in vowel production SO SPEECH COMMUNICATION LA English DT Article ID VOICE SOURCE; FUNDAMENTAL-FREQUENCY; SUBGLOTTAL PRESSURE; WAVE-FORM; AIR-FLOW; SINGERS; PHONETOGRAM; INTENSITY; FEATURES; SPEAKERS AB The negative peak amplitude of the differentiated glottal flow (d(peak)) is known to correlate strongly with the sound pressure level (SPL) of speech. Therefore, the function between d(peak) and SPL has been conventionally modeled as a single line. In this survey, the linearity of the function between d(peak) and SPL is revisited by analyzing glottal flows that were inverse filtered from speech sounds of largely different intensities. It is shown that SPL-d(peak)-graphs can be modeled more accurately by using two linear functions, the first of which models soft phonation, and the second of which models normal and loud speech sounds. For all of the analyzed SPL-d(peak)-graphs, the slope of the modeling line matching soft phonation was larger than the slope of the line for normal and loud speech. This result suggests that vocal intensity is affected not only by the single amplitude domain value of the voice source, d(peak), hut also by the shape of the differentiated glottal flow near the instant of the negative peak. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Aalto Univ, Acoust Lab, TKK, FIN-02015 Helsinki, Finland. Univ Helsinki, Cent Hosp, Dept Otolaryngol & Phoniatr, Helsinki, Finland. Univ Oulu, Dept Otolaryngol & Phoniatr, Oulu, Finland. RP Alku, P (reprint author), Aalto Univ, Acoust Lab, TKK, POB 3000, FIN-02015 Helsinki, Finland. EM paavo.alku@hut.fi RI Alku, Paavo/E-2400-2012 CR AKERLUND L, 1994, J VOICE, V8, P263, DOI 10.1016/S0892-1997(05)80298-X ALKU P, 1992, SPEECH COMMUN, V11, P109, DOI 10.1016/0167-6393(92)90005-R Alku P, 1997, SPEECH COMMUN, V22, P67, DOI 10.1016/S0167-6393(97)00020-4 Alku P, 1998, J SPEECH LANG HEAR R, V41, P990 Alku P, 1998, SPEECH COMMUN, V24, P123, DOI 10.1016/S0167-6393(98)00004-1 Alku P., 1994, P INT C SPOK LANG PR, P1619 BOUHUYS A, 1968, ANN NY ACAD SCI, V155, P165, DOI 10.1111/j.1749-6632.1968.tb56760.x CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DROMEY C, 1992, J VOICE, V6, P44, DOI 10.1016/S0892-1997(05)80008-6 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P FANT G, 1995, LF MODEL REVISITED T, P119 Fant Gunnar, 1985, STL QPSR, V4, P1 FLANAGAN J, 1972, ANAL SYNTHESIS PERCE GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 GRAMMING P, 1988, J ACOUST SOC AM, V83, P2352, DOI 10.1121/1.396366 Gramming P., 1988, J VOICE, V2, P118, DOI 10.1016/S0892-1997(88)80067-5 HERTEGARD S, 1992, J VOICE, V6, P224, DOI 10.1016/S0892-1997(05)80147-X HERTEGARD S, 1990, J VOICE, V4, P220, DOI 10.1016/S0892-1997(05)80017-7 HILLMAN RE, 1990, J VOICE, V4, P52, DOI 10.1016/S0892-1997(05)80082-7 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 HOLMES J, 1976, P IEEE INT C AC SPEE, P39 HUNT M, 1987, P 11 INT C PHON SCI, V3, P22 Sulter AM, 1996, J ACOUST SOC AM, V100, P3360, DOI 10.1121/1.416977 SUNDBERG J, 1993, J VOICE, V7, P15, DOI 10.1016/S0892-1997(05)80108-0 SUNDBERG J, 1990, J VOICE, V4, P107, DOI 10.1016/S0892-1997(05)80135-3 TITZE IR, 1992, J SPEECH HEAR RES, V35, P21 Titze IR, 1994, PRINCIPLES VOICE PRO TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929 Vilkman E, 1997, FOLIA PHONIATR LOGO, V49, P247 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 32 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1999 VL 28 IS 4 BP 269 EP 281 DI 10.1016/S0167-6393(99)00020-5 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 223DJ UT WOS:000081824700001 ER PT J AU Lee, S Oh, YH AF Lee, S Oh, YH TI Tree-based modeling of prosodic phrasing and segmental duration for Korean TTS systems SO SPEECH COMMUNICATION LA English DT Article DE prosodic phrasing; pause duration; segmental duration; Korean prosody; TTS system; tree-based modeling ID SPEECH; TEXT AB This study describes the tree-based modeling of prosodic phrasing, pause duration between phrases and segmental duration for Korean TTS systems. We collected 400 sentences from various genres and built a corresponding speech corpus uttered by a professional female announcer. The phonemic and prosodic boundaries were manually marked on the recorded speech, and morphological analysis, grapheme-to-phoneme conversion and syntactic analysis were also done on the text. A decision tree and regression trees were trained on 240 sentences (of approximately 20 min length), and tested on 160 sentences (of approximately 13 min length). Features for modeling prosody are proposed, and their effectiveness is measured by interpreting the resulting trees. The misclassification rate of the decision tree was 14.46%, the RMSEs of the regression trees, which predict pause duration and segmental duration, were 132 and 22 ms, respectively, for the test set. To understand the performance of our approach in the run time of TTS systems, we trained and tested tries with the output of our text analyzer. The misclassification rate and the RMSE were 18.49% and 134 ms, respectively, for prosodic phrasing and pause duration on the test set. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Dept Comp Sci, Taejon 305701, South Korea. RP Lee, S (reprint author), Korea Adv Inst Sci & Technol, Dept Comp Sci, 373-1 Kusong Dong, Taejon 305701, South Korea. EM shlee@bulsai.kaist.ac.kr; yhoh@cs.kaist.ac.kr RI Oh, Yung-Hwan/C-1915-2011 CR BECKMAN M, 1996, LABELING CONVENTIONS Breiman L., 1984, WADSWORTH STAT PROBA CHOU PA, 1991, IEEE T PATTERN ANAL, V13, P340, DOI 10.1109/34.88569 COVINGTON MA, 1990, AI199001 U GEORG ART Devore J.L., 1991, PROBABILITY STAT ENG Esposito F, 1997, IEEE T PATTERN ANAL, V19, P476, DOI 10.1109/34.589207 FUJIO S, 1995, P ICASSP, P604 HAMMING RW, 1980, CODING INFORMATION T Haykin S., 1994, NEURAL NETWORKS HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C Hirschberg J, 1996, SPEECH COMMUN, V18, P281, DOI 10.1016/0167-6393(96)00017-9 Huang X.D., 1990, HIDDEN MARKOV MODELS KIM CH, 1994, COMPUTER PROCESSING, V8, P105 LEE HB, 1989, STANDARD KOREAN PRON LEE SH, 1997, P SUMM M AC SOC KOR, P293 LEE SH, 1996, P ICSLP, P1692 LEE YJ, 1998, P INT C SPOK LANG PR, P1995 LJOLJE A, 1986, IEEE T ACOUST SPEECH, V34, P1074, DOI 10.1109/TASSP.1986.1164948 MANNA F, 1995, P EUROSPEECH, P589 Ostendorf M., 1994, Computational Linguistics, V20 Riley M., 1992, TALKING MACHINES THE, P265 Ross K, 1996, COMPUT SPEECH LANG, V10, P155, DOI 10.1006/csla.1996.0010 SAFAVIAN SR, 1991, IEEE T SYST MAN CYB, V21, P660, DOI 10.1109/21.97458 SAGISAKA Y, 1992, P ICASSP Sagisaka Y., 1992, P ICSLP, P357 Seong C.-J, 1995, THESIS SEOUL NATL U SEONG CJ, 1997, P EUROSPEECH, P755 Silverman K., 1992, P INT C SPOK LANG PR, P867 Sproat R, 1998, MULTILINGUAL TEXT-TO-SPEECH SYNTHESIS: THE BELL LABS APPROACH, P89 TERKEN J, 1995, SPEECH CODING SYNTHE, P611 Traber C, 1992, TALKING MACHINES THE, P287 Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 NR 33 TC 20 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1999 VL 28 IS 4 BP 283 EP 300 DI 10.1016/S0167-6393(99)00014-X PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 223DJ UT WOS:000081824700002 ER PT J AU Hanna, P Ming, J Smith, FJ AF Hanna, P Ming, J Smith, FJ TI Inter-frame dependence arising from preceding and succeeding frames - Application to speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; HMM; interframe dependence ID HMM AB This paper extends the notion of capturing temporal information within an HMM framework by permitting the observed frame not only to be dependent upon preceding frames, but also upon succeeding frames. In particular the IFD-HMM (Ming and Smith, 1996) is extended to support any number of preceding and/or succeeding frame dependencies. The means through which such a dependency might be integrated into an HMM framework are explored, and details given of the resultant changes to the IFD-HMM. Experimental results are provided, contrasting the use of bi-directional frame dependencies to the use of preceding only frame dependencies and exploring how such dependencies can be best employed. It was found that a dependency upon succeeding frames enabled dynamic spectral information not found in the preceding frames to be usefully employed, resulting in a significant increase in the recognition accuracy. It was also found that the use of frame dependencies proved to be a more effective means of increasing recognition accuracy than the use of multiple mixtures. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Queens Univ Belfast, Sch Elect Engn & Comp Sci, Belfast BT7 1NN, Antrim, North Ireland. RP Hanna, P (reprint author), Queens Univ Belfast, Sch Elect Engn & Comp Sci, Belfast BT7 1NN, Antrim, North Ireland. EM p.hanna@qub.ac.uk; J.Ming@qub.ac.uk; FJ.Smith@qub.ac.uk CR Baum L. E., 1972, INEQUALITIES, V3, P1 HANNA P, EUR 97, V3, P1167 HARTE H, 1996, ICSLP 96, P933 JUANG BH, 1985, AT&T TECH J, V64, P1235 KENNY P, 1990, IEEE T ACOUST SPEECH, V38, P220, DOI 10.1109/29.103057 Ming J, 1996, COMPUT SPEECH LANG, V10, P229, DOI 10.1006/csla.1996.0012 PALIWAL KK, 1993, P ICASSP 93, P209 SMITH FJ, 1995, P ICASSP 95, P209 TAKAHASHI S, 1993, P ICASSP 93, P219 Wellekens C. J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Woodland P., 1992, P ICASSP 92, P509, DOI 10.1109/ICASSP.1992.225860 WOODLAND PC, 1991, P ICASSP 91, P545, DOI 10.1109/ICASSP.1991.150397 NR 12 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1999 VL 28 IS 4 BP 301 EP 312 DI 10.1016/S0167-6393(99)00019-9 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 223DJ UT WOS:000081824700003 ER PT J AU Jiang, H Hirose, K Huo, Q AF Jiang, H Hirose, K Huo, Q TI Improving Viterbi Bayesian predictive classification via sequential Bayesian learning in robust speech recognition SO SPEECH COMMUNICATION LA English DT Article DE Bayesian predictive classification (BPC); Viterbi BPC (VBPC); sequential Bayesian learning; robust speech recognition; natural conjugate prior AB In this paper, we extend our proposed Viterbi Bayesian predictive classification (VBPC) algorithm to a new class of prior probability density function(pdf), namely a family of natural conjugate prior pdf's of the complete-data density in continuous density hidden Markov model (CDHMM) and their mixtures. In this way, we can on-line adapt the prior pdf via a sequential Bayesian learning algorithm when some new data are available, so that the performance of VBPC can be continuously improved. Moreover, we also study a sequential Bayesian learning strategy for CDHMM based on a finite mixture approximation of its prior/posterior density which attempts to derive a more accurate prior pdf to describe the unknown mismatches. The experimental results on a speaker-independent recognition task of isolated Japanese digits confirm the viability and the usefulness of the proposed method. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Tokyo, Dept Informat & Commun Engn, Bunkyo Ku, Tokyo 1138656, Japan. Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. Univ Hong Kong, Dept Informat Syst & Comp Sci, Hong Kong, Peoples R China. RP Hirose, K (reprint author), Univ Tokyo, Dept Informat & Commun Engn, Bunkyo Ku, Hongo 7-3-1, Tokyo 1138656, Japan. EM hjiang@crg3.uwaterloo.ca; hirose@gavo.t.u-tokyo.ac.jp; qhuo@csis.hku.hk CR Bernardo J. M., 1988, BAYESIAN STATISTICS, V3, P67 Furui S., 1997, P ESCA NATO TUT RES, P11 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HUO Q, 1997, P EUR C SPEECH COMM, P1847 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 HUO Q, 1997, P INT C AC SPEECH SI HUO Q, 1997, UNPUB IEEE T SPEECH JIANG H, 1999, IN PRESS IEEE T SPEE, V7 JIANG H, 1997, P INT C AC SPEECH SI Lee CH, 1998, SPEECH COMMUN, V25, P29, DOI 10.1016/S0167-6393(98)00028-4 Matsui T, 1998, COMPUT SPEECH LANG, V12, P41, DOI 10.1006/csla.1997.0036 Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MOKBEL C, 1997, P ESCA WORKSH ROB SP, P211 SMITH AFM, 1985, BAYESIAN STAT, V2 Titterington DM, 1985, STAT ANAL FINITE MIX NR 15 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1999 VL 28 IS 4 BP 313 EP 326 DI 10.1016/S0167-6393(99)00018-7 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 223DJ UT WOS:000081824700004 ER PT J AU Moller, S Schonweiler, R AF Moller, S Schonweiler, R TI Analysis of infant cries for the early detection of hearing impairment SO SPEECH COMMUNICATION LA English DT Article DE infant cry; perception; hearing impairment; vocal development; automatic classification ID ACOUSTIC FEATURES; NEURAL NETWORKS; RECOGNITION; PERCEPTION AB The basic hypothesis is that cry vocalizations of hearing-impaired infants differ from those of their counterparts with normal hearing abilities due to the lack of auditory feedback. This assumption, based on observations made by clinical experts, is investigated by means of auditory experiments with naive and expert listeners, and by signal analysis of the cries. The listening experiment shows that it is possible for experts to auditorily classify cries for both infant groups, based on the voice related and melodic cry features. The cries of profoundly hearing-impaired infants are different regarding their perceived sound, rhythm and melody. The sound may well be correlated to spectral characteristics, and melodic and rhythmic parameters are extracted which differ significantly for the two infant groups. The findings are discussed in the context of a cry production model. The extracted signal parameters enable an automatic classification of the cries by means of topological feature maps, which may later be used as the basis for an early supplementary diagnostic tool. (C) 1999 Elsevier science B.V. All rights reserved. C1 Ruhr Univ Bochum, Inst Kommunikat Akust, D-44780 Bochum, Germany. Hannover Med Sch, Dept Commun Disorders, Hannover, Germany. RP Moller, S (reprint author), Ruhr Univ Bochum, Inst Kommunikat Akust, Univ Str 150, D-44780 Bochum, Germany. EM moeller@ika.ruhr-uni-bochum.de CR BAXT WG, 1995, LANCET, V346, P1135, DOI 10.1016/S0140-6736(95)91804-3 CLEMENT CJ, 1995, P I PHON SCI AMST, V19, P25 CULLEN JK, 1968, J SPEECH HEAR RES, V11, P85 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 EILERS RE, 1994, J PEDIATR-US, V124, P199, DOI 10.1016/S0022-3476(94)70303-5 Fujisaki H., 1983, PRODUCTION SPEECH, P39 GOLUB HL, 1980, THESIS MIT MA Gravel Judith S., 1998, P1 GUSTAFSON GE, 1989, CHILD DEV, V60, P772, DOI 10.1111/j.1467-8624.1989.tb03508.x JONES MC, 1971, J COMMUN DISORD, V4, P310, DOI 10.1016/0021-9924(71)90010-4 KOHONEN T, 1990, P IEEE, V78, P1464, DOI 10.1109/5.58325 KREIMAN J, 1990, J SPEECH HEAR RES, V33, P103 KUHL PK, 1996, J ACOUST SOC AM, V100, P1 LEINONEN L, 1992, J SPEECH HEAR RES, V35, P287 MERSDORF J, 1997, P ESCA TUT RES WORKS MICHAELIS D, 1995, FORTSCHRITTE AKUSTIK MORSBACH G, 1979, J CHILD LANG, V6, P175 Osgood Charles E., 1957, MEASUREMENT MEANING Partanen T J, 1967, Ann Paediatr Fenn, V13, P56 ROSENHOUSE J, 1980, J PHONETICS, V8, P135 SCHULTEFORTKAMP B, 1994, GERAUSCHE BEURTEILEN SHEPPARD WC, 1968, J SPEECH HEAR RES, V11, P94 TONKOVAYAMPOLSK.RV, 1973, STUDIES CHILD LANGUA, P128 Truby H. M, 1965, ACTA PAEDIATR SC S, V163, P8 TSUKAMOTO T, 1990, J ACOUST SOC AM, V87, pS73, DOI 10.1121/1.2028351 TSUKAMOTO T, 1989, J ACOUST SOC AM, V85, pS53, DOI 10.1121/1.2027023 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P1888, DOI 10.1109/29.45535 WERMKE K, 1986, THESIS HUMBOLD U BER ZESKIND PS, 1978, CHILD DEV, V49, P580, DOI 10.1111/j.1467-8624.1978.tb02357.x ZESKIND PS, 1993, DEV PSYCHOBIOL, V26, P321, DOI 10.1002/dev.420260603 ZWICKER E, 1961, J ACOUST SOC AM, V33, P248, DOI 10.1121/1.1908630 NR 31 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1999 VL 28 IS 3 BP 175 EP 193 DI 10.1016/S0167-6393(99)00016-3 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 218TQ UT WOS:000081568100001 ER PT J AU Potamianos, A Maragos, P AF Potamianos, A Maragos, P TI Speech analysis and synthesis using an AM-FM modulation model SO SPEECH COMMUNICATION LA English DT Article DE multiband demodulation; energy separation algorithm; AM-FM modulation model; pitch tracking; AM-FM vocoder; speech synthesis ID ENERGY OPERATORS; DEMODULATION; SEPARATION; FREQUENCY; PARALLEL; CASCADE; SIGNAL AB In this paper, the AM-FM modulation model is applied to speech analysis, synthesis and coding. The AM-FM model represents the speech signal as the sum of formant resonance signals each of which contains amplitude and frequency modulation. Multiband filtering and demodulation using the energy separation algorithm are the basic tools used for speech analysis, First, multiband demodulation analysis OC IDA) is applied to the problem of fundamental frequency estimation using the average instantaneous frequency as estimates of pitch harmonics. The MDA pitch tracking algorithm is shown to produce smooth and accurate fundamental frequency contours. Next, the AM-FM modulation vocoder is introduced, which represents speech as the sum of resonance signals. A time-varying filterbank is used to extract the formant bands and then the energy separation algorithm is used to demodulate the resonance signals into the amplitude envelope and instantaneous frequency signals. Efficient modeling and coding (at 4.8-9.6 kbits/sec) algorithms are proposed for the amplitude envelope and instantaneous frequency of speech resonances. Finally, the perceptual importance of modulations in speech resonances is investigated and it is shown that amplitude modulation patterns are both speaker and phone dependent. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Lucent Technol, Bell Labs, Murray Hill, NJ 07974 USA. Natl Tech Univ Athens, Dept ECE, GR-15773 Athens, Greece. RP Potamianos, A (reprint author), Lucent Technol, Bell Labs, 600 Mt Ave,Room 2D 463, Murray Hill, NJ 07974 USA. EM potam@research.bell-labs.com CR Atal B. S., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing BOVIK AC, 1993, IEEE T SIGNAL PROCES, V41, P3245, DOI 10.1109/78.258071 Campbell J. P. Jr., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90106-U Cohen L., 1992, TIME FREQUENCY SIGNA Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P412, DOI 10.1121/1.384752 GEORGE EB, 1991, THESIS GEORGIA I TEC HAVLICK JP, 1996, THESIS U TEXAS AUSTI HOLMES JN, 1983, SPEECH COMMUN, V2, P251, DOI 10.1016/0167-6393(83)90044-4 JANKOWSKI CR, 1996, THESIS MIT CAMBRIDGE KAISER JF, 1990, P 4 IEEE DSP WORKSH KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Lu S, 1996, IEEE T SIGNAL PROCES, V44, P773 MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P3024, DOI 10.1109/78.277799 Potamianos A, 1996, J ACOUST SOC AM, V99, P3795, DOI 10.1121/1.414997 Maragos P., 1991, P IEEE INT C AC SPEE, P421, DOI 10.1109/ICASSP.1991.150366 MARAGOS P, 1993, IEEE T SIGNAL PROCES, V41, P1532, DOI 10.1109/78.212729 Mcaulay R. J., 1990, P INT C AC SPEECH SI, P249 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 McEachern R., 1992, Proceedings of the IEEE-SP International Symposium Time-Frequency and Time-Scale Analysis (Cat.No.92TH0478-8), DOI 10.1109/TFTSA.1992.274124 Potamianos A., 1995, THESIS HARVARD U CAM POTAMIANOS A, 1994, SIGNAL PROCESS, V37, P95, DOI 10.1016/0165-1684(94)90169-4 POTAMIANOS A, 1997, P EUR C SPEECH COMM, P1355 QUATIERI TF, 1997, COMMUNICATION Quatieri TF, 1997, IEEE T SPEECH AUDI P, V5, P465, DOI 10.1109/89.622571 RAMALHO MA, 1994, THESIS STATE U NEW J RAO P, 1996, P ICASSP 96 ATL GEOR SANTHANAM B, 1998, THESIS GEORGIA I TEC Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Stylianou Y., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607826 SUSSMAN RB, 1996, THESIS RUTGERS STATE TEAGER HM, 1990, NATO ADV SCI I D-BEH, V55, P241 Tremain T.E., 1982, SPEECH TECHNOLOG APR, P40 NR 33 TC 25 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1999 VL 28 IS 3 BP 195 EP 209 DI 10.1016/S0167-6393(99)00012-6 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 218TQ UT WOS:000081568100002 ER PT J AU Arslan, LM AF Arslan, LM TI Speaker Transformation Algorithm using Segmental Codebooks (STASC) SO SPEECH COMMUNICATION LA English DT Article DE voice conversion; speaker transformation; codebook; line spectral frequencies; hidden Markov models; time-varying filter; overlap-add analysis ID VOICE CONVERSION; NETWORKS AB This paper presents a new voice conversion algorithm which modifies the utterance of a source speaker to sound-like speech from a target speaker. We refer to the method as Speaker Transformation Algorithm using Segmental Codebooks (STASC). A novel method is proposed which finds accurate alignments between source and target speaker utterances. Using the alignments, source speaker acoustic characteristics are mapped to target speaker acoustic characteristics. The acoustic parameters included in the mapping are vocal tract, excitation, intonation, energy, and duration characteristics. Informal listening tests suggest that convincing voice conversion is achieved while maintaining high speech quality. The performance of the proposed system is also evaluated on a simple Gaussian mixture model-based speaker identification system, and the results show that the transformed speech is assigned higher likelihood by the target speaker model when compared to the source speaker model. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Bogazici Univ, Elect & Elect Dept, TR-80815 Bebek, Turkey. RP Arslan, LM (reprint author), Bogazici Univ, Elect & Elect Dept, TR-80815 Bebek, Turkey. EM arslanle@boun.edu.tr RI Arslan, Levent/D-6377-2015 OI Arslan, Levent/0000-0002-6086-8018 CR Abe M., 1988, P IEEE ICASSP, P565 Arslan L. M., 1997, P EUROSPEECH RHOD GR, V3, P1347 ARSLAN LM, 1995, P IEEE INT C AC SPEE, V1, P812 BAUDOIN G, 1996, P IEEE INT C SPOK LA, P1405 CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K CROSMER JR, 1985, THESIS GEORGIA I TEC HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Itakura F, 1975, J ACOUST SOC AM, V57 IWAHASHI N, 1995, SPEECH COMMUN, V16, P139, DOI 10.1016/0167-6393(94)00051-B KUWABARA H, 1995, SPEECH COMMUN, V16, P165, DOI 10.1016/0167-6393(94)00053-D Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 Lee KS., 1996, P ICSLP, P1401 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I PALIWAL KK, 1995, P 4 EUR C SPEECH COM PELLOM BL, 1997, P IEEE INT C AC SPEE, V2, P943 STYLIANOU Y, 1995, P EUROSPEECH MADR SP WIGHTMAN C, 1994, ALIGNER USERS MANUAL ZERO A, 1993, ACOUSTICAL ENV ROBUS NR 19 TC 62 Z9 72 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1999 VL 28 IS 3 BP 211 EP 226 DI 10.1016/S0167-6393(99)00015-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 218TQ UT WOS:000081568100003 ER PT J AU Yuo, KH Wang, HC AF Yuo, KH Wang, HC TI Joint estimation of feature transformation parameters and Gaussian mixture model for speaker identification SO SPEECH COMMUNICATION LA English DT Article DE Karhunen-Loeve transform; transformation embedded GMM; generalized covariance matrices ID SPEECH RECOGNITION; MAXIMUM-LIKELIHOOD; ADAPTATION AB The Karhunen-Loeve transform is a well-known technique for orthonormally mapping features into an uncorrelated space. The Gaussian mixture model (GMM) with diagonal covariance matrices is a popular technique for modeling the speech feature distributions. These two techniques can be combined to improve the performance of speaker or speech recognition systems. The drawback of the combination is that both set of parameters are not optimized together. This paper presents a new model structure that integrates both orthonormal transformation and diagonal-covariance Gaussian mixture into a unified framework. All parameters of this model are obtained simultaneously by Maximum Likelihood estimation. This idea is further extended to attain a new GMM with generalized covariance matrices (GC-GMM). The traditional GMM with diagonal or full covariance matrices is a special case of the GC-GMM. The proposed method is demonstrated on a 100-person connected digit database for text independent speaker identification. In comparison with the traditional GMM, the computational complexity and the number of parameters can be greatly reduced with no degradation in system performance. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. RP Wang, HC (reprint author), Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. EM hcwang@ee.nthu.edu.tw CR Chengalvarayan R, 1997, IEEE T SPEECH AUDI P, V5, P243, DOI 10.1109/89.568731 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 Flury B., 1988, COMMON PRINCIPAL COM FLURY BN, 1984, J AM STAT ASSOC, V79, P892, DOI 10.2307/2288721 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales M.J.F., 1998, P ICASSP 98, VII, P657, DOI 10.1109/ICASSP.1998.675350 GALES MJF, 1997, CUEDFINFENGTR287 HUANG XD, 1990, INFORMATION TECHNOLO, V7 LI H, 1995, P EUROSPEECH 95, P363 LJOLJE A, 1994, COMPUT SPEECH LANG, V8, P223, DOI 10.1006/csla.1994.1011 Rabiner L, 1993, FUNDAMENTALS SPEECH REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Tseng B., 1992, P ICASSP 92, VII, P161 YUO KH, 1997, P EUROSPEECH 97, V5, P2279 NR 17 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1999 VL 28 IS 3 BP 227 EP 241 DI 10.1016/S0167-6393(99)00017-5 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 218TQ UT WOS:000081568100004 ER PT J AU Hung, WW Wang, HC AF Hung, WW Wang, HC TI Smoothing hidden Markov models by using an adaptive signal limiter for noisy speech recognition SO SPEECH COMMUNICATION LA English DT Article DE hidden Markov model; hard limiter; adaptive signal limiter; autocorrelation function; arcsin transformation ID FAMILY AB When a speech recognition system is deployed in the real world, environmental interference will make noisy speech signals and reference models mismatched and cause serious degradation in recognition accuracy. To deal with the effect of environmental mismatch, a family of signal limiters has been successfully applied to a template-based DTW recognizer to reduce the variability of speech features in noisy conditions. Though simulation results indicate that heavily smoothing can effectively reduce the variability of speech features in low signal-to-noise ratio (SNR), it would also cause the loss of information in speech features. Therefore, we suggest that the smoothing factor of a signal limiter should be related to SNR and adapted on a frame by frame basis. In this paper, an adaptive signal limiter (ASL) is proposed to smooth the instantaneous and dynamic spectral features of reference models and test speech. By smoothing spectral features, the smoothed covariance matrices of reference models can be obtained by means of maximum likelihood (ML) estimation. A speech recognition task for multispeaker isolated Mandarin digits has been conducted to evaluate the effectiveness and robustness of the proposed method. Experimental results indicate that the adaptive signal limiter can achieve significant improvement in noisy conditions and is more robust than the hard limiter over a wider range of SNR values. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. Ming Chi Inst Technol, Dept Elect Engn, Taishan 243, Taiwan. RP Wang, HC (reprint author), Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. EM hcwang@ee.nthu.edu.tw CR Carlson BA, 1994, IEEE T SPEECH AUDI P, V2, P97, DOI 10.1109/89.260341 CHIEN JT, 1997, THESIS NATL TSING HU FLORES JAN, 1992, IEEE INT C AC SPEECH, V1, P2409 GALES MJF, 1995, COMPUTER SPEECH LANG, V4, P352 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 LEE CH, 1993, SPEECH COMMUN, V12, P383, DOI 10.1016/0167-6393(93)90085-Y LEE LM, 1994, P INT C SPOK LANG PR, P1011 LEE LM, 1995, ELECTRON LETT, V31, P616, DOI 10.1049/el:19950410 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 RABINER L, 1993, FUNDAMENTALS SPEECH, P112 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 VARGA AP, 1992, IEEE INT C AC SPEECH, P845 VARGA AP, 1992, NOISEX 92 STUDY EFFE NR 13 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1999 VL 28 IS 3 BP 243 EP 260 DI 10.1016/S0167-6393(99)00011-4 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 218TQ UT WOS:000081568100005 ER PT J AU Jilka, M Mohler, G Dogil, G AF Jilka, M Mohler, G Dogil, G TI Rules for the generation of ToBI-based American English intonation SO SPEECH COMMUNICATION LA English DT Article DE American English intonation; ToBI; rule-based F-0 generation; synthesis; evaluation ID SPEECH SYNTHESIS AB This study presents an approach to the generation of American English intonation based on prescriptive rules that define the respective features of certain tone labels that in turn represent linguistically relevant F-0 configurations. In accordance with the principles of the Tone Sequence Model the F-0 contour is analyzed as a series of discrete target values that are connected by means of transitional functions. The target values are associated either with stressed syllables (pitch accents) or the margins of the phrase (phrasal tones). The targets' exact position is represented relative to pitch range and time. All tone labels are examined according to these parameters and the results are then converted into a set of rules that allows the generation of an F-0 contour. Tones and Break Indices (ToBI), a system for transcribing the intonation patterns of American English, provides an inventory of tone labels and a set of example utterances available for analysis. (2) Utterances from ToBI and the Boston Radio News Corpus were used for the evaluation of the generation rules: root mean squared error (RMSE) and correlation between generated and original contour were determined, and in a perception test native speakers assessed the quality of the resynthesized contours which, in general, were judged to sound natural and show few differences to the corresponding originals. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Stuttgart, Inst Nat Language Proc, Chair Expt Phonet, D-70174 Stuttgart, Germany. RP Jilka, M (reprint author), Univ Stuttgart, Inst Nat Language Proc, Chair Expt Phonet, Azenbergstr 12, D-70174 Stuttgart, Germany. EM jilka@ims.uni-stuttgart.de CR ANDERSON MD, 1984, P INT C AC SPEECH SI Beckman M. E., 1994, GUIDELINES TOBI LABE Beckman M. E., 1994, TOBI ANNOTATION CONV BECKMAN ME, 1997, PROGR SPEECH SYNTHES BLACK A, 1997, HCRCTR83 HUM COMM RE Black A. W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607872 BOLINGER D, 1958, WORD, V7, P199 BRUCE G, 1995, P 13 INT C PHON SCI, V2, P28 Bruce Gosta, 1977, SWEDISH WORD ACCENTS DALY N, 1990, P INT C SPOK LANG PR DUSTERHOFF K, 1997, P ESCA WORKSH INT TH, P107 Fujisaki H., 1988, VOCAL PHYSL VOICE PR, P347 Garding E, 1983, PROSODY MODELS MEASU, P11 Halliday M. A. K., 1967, INTONATION GRAMMAR B Hirschberg J., 1986, P 24 ANN M ASS COMP, P136, DOI 10.3115/981131.981152 Hirschberg J., 1995, P 13 INT C PHON SCI, V2, P36 HOUSE D, 1996, P ICSLP, P949 House David, 1990, TONAL PERCEPTION SPE KOHLER KJ, 1991, J PHONETICS, V19, P121 KUTIK EJ, 1983, J ACOUST SOC AM, V73, P1731, DOI 10.1121/1.389397 MAYER J, 1997, INTONATION BEDEUTUNG MOHLER G, 1998, THESIS U STUTTGART MOHLER G, 1995, P EUR MADR, P1019 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z OCONOR JD, 1959, INTONATION C ENGLISH Ostendorf M., 1995, ECS95001 BOST U EL C PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Pierrehumbert J, 1980, THESIS MIT Portele T, 1997, SPEECH COMMUN, V21, P61, DOI 10.1016/S0167-6393(96)00072-6 ROSS K, 1995, THESIS BOSTON U SAG O, 1986, 11 REG M CHIC LING S, P498 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 Silverman K. E. A., 1987, THESIS U CAMBRIDGE THORSEN NG, 1988, ARIPUC, V22, P221 VANDERSL.R, 1972, LANGUAGE, V48, P819, DOI 10.2307/411990 VANSANTEN J, 1997, P ESCA WORKSH INT TH, P321 WELLS J, 1992, SAMUCL307 PHON LING ZIMMER DE, 1988, KOMMT MUNSCH SPRACHE NR 39 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1999 VL 28 IS 2 BP 83 EP 108 DI 10.1016/S0167-6393(99)00008-4 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 205RT UT WOS:000080834900001 ER PT J AU Steeneken, HJM Houtgast, T AF Steeneken, HJM Houtgast, T TI Mutual dependence of the octave-band weights in predicting speech intelligibility SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; prediction; objective measurement; octave-band contribution ID ROOM ACOUSTICS AB Current objective measures for predicting the intelligibility of speech by an index assume that this can be obtained by simple addition of the contributions of individual frequency bands. The Articulation Index (AI, and the related Speech Intelligibility Index) and the Speech Transmission Index (STI) are based on this assumption. There is evidence that the underlying assumption of additive (mutually independent) contributions from a number of frequency bands is not optimal and may lead to erroneous prediction of the intelligibility for conditions with a limited or with a discontiguous frequency transfer. Depending on the frequency band considered, errors between 0.1 and 0.25 STI may occur. An experiment was designed to estimate the contribution of individual frequency bands, and their mutual dependence. For this purpose the speech spectrum was subdivided into seven octave bands with center frequencies ranging from 125 Hz to 8 kHz. For 26 different combinations of three or more octave bands, the Consonant-Vowel-Consonant-word score (CVC, nonsense words) was determined at three signal-to-noise ratios. It was found that successful prediction of the scores required a revised model which accounts for mutual dependency between adjacent octave bands. In this model a so-called redundancy correction is introduced. Consequences for the existing objective measures are discussed. The presented results are included in the revised IEC standard (IEC 60268-part 16, 1988). (C) 1999 Elsevier Science B.V. All rights reserved. C1 TNO, Human Factors Res Inst, NL-3769 ZG Soesterberg, Netherlands. RP Steeneken, HJM (reprint author), TNO, Human Factors Res Inst, POB 23, NL-3769 ZG Soesterberg, Netherlands. CR ANSI, 1969, S351969 ANSI Dunn HK, 1940, J ACOUST SOC AM, V11, P278, DOI 10.1121/1.1916034 FLETCHER H, 1950, J ACOUST SOC AM, V22, P89, DOI 10.1121/1.1906605 Fletcher H, 1929, BELL SYST TECH J, V8, P806 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GRANT KW, 1991, J ACOUST SOC AM, V89, P2952, DOI 10.1121/1.400733 HOUTGAST T, 1991, P EUR 91 GEN, P285 HOUTGAST T, 1973, ACUSTICA, V28, P66 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 *IEC, 1998, 6026816 IEC KRYTER KD, 1960, J ACOUST SOC AM, V32, P547, DOI 10.1121/1.1908140 KRYTER KD, 1962, J ACOUST SOC AM, V34, P1689, DOI 10.1121/1.1909094 Licklider J. C. R., 1959, PSYCHOL STUDY SCI, V1, P41 PAVLOVIC CV, 1987, J ACOUST SOC AM, V82, P413, DOI 10.1121/1.395442 POLLACK I, 1948, J ACOUST SOC AM, V20, P259, DOI 10.1121/1.1906369 STEENEKEN HJM, 1986, U98620 IZF TNO HFRI STEENEKEN HJM, 1991, P EUR 91 GEN, P1133 STEENEKEN HJM, 1992, THESIS U AMSTERDAM A Steeneken H. J. M., 1992, Digital speech processing, speech coding, synthesis and recognition STEENEKEN HJM, A13 TNO I PERC IZF STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 STUDEBAKER GA, 1987, J ACOUST SOC AM, V81, P1130, DOI 10.1121/1.394633 VANRAAIJ JL, 1991, 1991A7 IZF TNO HFRI NR 23 TC 46 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1999 VL 28 IS 2 BP 109 EP 123 DI 10.1016/S0167-6393(99)00007-2 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 205RT UT WOS:000080834900002 ER PT J AU van Son, RJJH Pols, LCW AF van Son, RJJH Pols, LCW TI An acoustic description of consonant reduction SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ICSLP 96 Meeting CY 1996 CL PHILADELPHIA, PA DE acoustic consonant reduction; speech style; spectral balance; speech effort ID LOCUS EQUATIONS; CLEAR SPEECH; DUTCH VOWELS; FORMANT TRANSITIONS; LINGUISTIC STRESS; SPECTRAL BALANCE; SPEAKING STYLE; STOP PLACE; ARTICULATION; ENGLISH AB The acoustic consequences of the articulatory reduction of consonants remain largely unknown. Much more is known about acoustic vowel reduction. Whether the acoustical and perceptual consequences of articulatory consonant reduction are comparable in kind and extent to the consequences of vowel reduction is still an open question. In this study we compare acoustic data for 791 VCV realizations, containing 17 Dutch intervocalic consonants and 13 vowels, extracted from read speech from a single male speaker, to otherwise identical segments isolated from spontaneous speech. Five acoustic correlates of reduction were studied. Acoustic tracers of articulation were based on F-2 Slope differences and locus equations. Speech effort was assessed by measuring duration, spectral balance, and the intervocalic sound energy difference of consonants. On a global level, it shows that consonants reduce acoustically like vowels on all investigated accounts when the speaking style becomes informal or syllables become unstressed. Methods that are sensitive to speech effort proved to be more reliable indicators of reduction than F-2 based measures. On a more detailed level there are differences related to the type of consonant. The acoustic results suggest that articulatory reduction will decrease the intelligibility of consonants and vowels in comparable ways. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Amsterdam, Inst Phonet Sci, IFOTT, NL-1016 CG Amsterdam, Netherlands. RP van Son, RJJH (reprint author), Univ Amsterdam, Inst Phonet Sci, IFOTT, Herengracht 338, NL-1016 CG Amsterdam, Netherlands. EM rob@fon.hum.uva.nl CR BEINUM FJK, 1992, SPEECH COMMUN, V11, P439 BEINUM FJK, 1980, THESIS U AMSTERDAM Boersma P., 1996, PRAAT SYSTEM DOING P BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 Cutler A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90004-0 CASSIDY S, 1995, PHONETICA, V52, P263 Charles-Luce J, 1997, LANG SPEECH, V40, P229 Chennoukh S, 1997, J ACOUST SOC AM, V102, P2380, DOI 10.1121/1.419622 CLARK J, 1990, INTRO PHONETICS PHON, P116 Cutler A, 1997, SPEECH COMMUN, V21, P3, DOI 10.1016/S0167-6393(96)00075-1 CUTLER A, 1991, SPEECH COMMUN, V10, P335, DOI 10.1016/0167-6393(91)90002-B CUTLER A, 1990, SPEECH COMMUN, V9, P485, DOI 10.1016/0167-6393(90)90024-4 DEJONG K, 1993, LANG SPEECH, V36, P197 DELSARTE P, 1986, IEEE T ACOUST SPEECH, V34, P470, DOI 10.1109/TASSP.1986.1164830 Dorman MF, 1996, J ACOUST SOC AM, V100, P3825, DOI 10.1121/1.417238 DUEZ D, 1995, J PHONETICS, V23, P407, DOI 10.1006/jpho.1995.0031 EEFTING W, 1991, J ACOUST SOC AM, V89, P412, DOI 10.1121/1.400475 FARNETANI E, 1995, P EUR 95 MAD, P2255 FLEGE JE, 1988, J ACOUST SOC AM, V84, P901, DOI 10.1121/1.396659 Fougeron C, 1997, J ACOUST SOC AM, V101, P3728, DOI 10.1121/1.418332 FOURAKIS M, 1991, J ACOUST SOC AM, V90, P1816, DOI 10.1121/1.401662 HARRINGTON J, 1994, LANG SPEECH, V37, P357 Hertrich I, 1997, J ACOUST SOC AM, V102, P523, DOI 10.1121/1.419725 KEATING PA, 1994, J PHONETICS, V22, P407 Kerkhoff J., 1984, LINGUISTICS NETHERLA, P111 Krull D., 1989, PERILUS, VX, P87 Laan GPM, 1997, SPEECH COMMUN, V22, P43, DOI 10.1016/S0167-6393(97)00012-5 LIEBERMAN P, 1963, LANG SPEECH, V6, P172 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 MILLER JL, 1983, J ACOUST SOC AM, V73, P1751, DOI 10.1121/1.389399 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 Noordhoek IM, 1997, J ACOUST SOC AM, V101, P498, DOI 10.1121/1.417993 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 OSHAUGHNESSY D, 1987, SPEECH COMMUNIATION Palethorpe S, 1996, J ACOUST SOC AM, V100, P3843, DOI 10.1121/1.417240 POLS LCW, 1993, SPEECH COMMUN, V13, P135, DOI 10.1016/0167-6393(93)90065-S POLS LCW, 1973, J ACOUST SOC AM, V53, P1093, DOI 10.1121/1.1913429 RECASENS D, 1998, JACOUST SOC AM, V102, P544 SCHMIDT K, 1995, J STAT COMPUT SIM, V52, P1, DOI 10.1080/00949659508811648 SLIS IH, 1985, THESIS U NIJMEGEN Sluijter A.M.C, 1995, THESIS U LEIDEN Sluijter AMC, 1997, J ACOUST SOC AM, V101, P503, DOI 10.1121/1.417994 Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 SLUIJTER AMC, 1995, P EUR 95 MADR, P941 SUSSMAN HM, 1995, J ACOUST SOC AM, V97, P3112, DOI 10.1121/1.411873 SUSSMAN HM, 1993, J ACOUST SOC AM, V94, P1256, DOI 10.1121/1.408178 Sussman HM, 1997, J ACOUST SOC AM, V101, P2826, DOI 10.1121/1.418567 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 van Bergem Dick, 1995, THESIS U AMSTERDAM VANHEUWEN VJ, 1993, ANAL SYNTHESIS SPEEC van Kuijk D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607923 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSON R, 1993, THESIS U AMSTERDAM A VANSON RJJ, 1997, P EUR 97 RHOD, P319 van Son R. J. J. H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607908 VANSON RJJ, 1997, P EUR 97 RHOD, P2135 VANSON RJJH, 1992, J ACOUST SOC AM, V92, P121, DOI 10.1121/1.404277 VANSON RJJH, 1990, J ACOUST SOC AM, V88, P1683, DOI 10.1121/1.400243 WILLEMS LF, 1986, IPO ANN PROGR REPORT, V21, P34 NR 62 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1999 VL 28 IS 2 BP 125 EP 140 DI 10.1016/S0167-6393(99)00009-6 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 205RT UT WOS:000080834900003 ER PT J AU Minker, W AF Minker, W TI Design considerations for knowledge source representations of a stochastically-based natural language understanding component SO SPEECH COMMUNICATION LA English DT Article AB We describe a stochastic component for spoken natural language understanding in an application for train travel information retrieval in French. The aim is to discuss the design considerations for representing knowledge sources appropriately in such a parsing component. The development focuses on the design of a stochastic model topology that is optimally adapted in quality and complexity to the task model and the available training data. Another important issue concerns the iterative semantic labeling of large data amounts used for the component training, The parser has been evaluated on both corrected and uncorrected speech recognizer output transcriptions. (C) 1999 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. RP Minker, W (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM minker@limsi.fr CR BENNACEF SK, 1995, P ESCA WORKSH SPOK D, P237 BRUCE B, 1975, ARTIF INTELL, V6, P327, DOI 10.1016/0004-3702(75)90020-X CASTANO MA, 1993, P EUR SEP, P1017 EPSTEIN M, 1996, P 1996 INT C AC SPEE, P176 Fillmore C. J., 1968, UNIVERSALS LINGUIST, P1 GAUVAIN JL, 1997, SPOKEN LANGUAGE COMP, P93 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Levin E., 1995, P DARPA SPEECH NAT L, P269 Life A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607947 MINKER W, 1997, P EUR C SPEECH COMM, P1423 Minker W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607775 Rabiner L., 1989, IEEE P, V77, P257, DOI DOI 10.1109/5.18626 Schwartz R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607771 NR 13 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1999 VL 28 IS 2 BP 141 EP 154 DI 10.1016/S0167-6393(99)00005-9 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 205RT UT WOS:000080834900004 ER PT J AU Ohtsuki, K Matsuoka, T Mori, T Yoshida, K Taguchi, Y Furui, S Shirai, K AF Ohtsuki, K Matsuoka, T Mori, T Yoshida, K Taguchi, Y Furui, S Shirai, K TI Japanese large-vocabulary continuous-speech recognition using a newspaper corpus and broadcast news SO SPEECH COMMUNICATION LA English DT Article DE large-vocabulary continuous-speech recognition; read newspaper speech; broadcast-news transcription; text corpus; statistical language model AB This paper describes large-vocabulary continuous-speech recognition (LVCSR) of Japanese newspaper speech read aloud and Japanese broadcast-news speech. It describes the first Japanese LVCSR experiments using morpheme-based statistical language models. The statistical language models were trained using a large text corpus constructed from several years of newspaper texts and our LVCSR system was evaluated using recorded newspaper speech read by 10 male speakers. It is difficult to train statistical n-gram language models for Japanese because Japanese sentences are written without spaces between words. This difficulty was overcome by segmenting sentences into words with a morphological analyzer and then training the n-gram language models using those words. The LVCSR system was constructed with the language models trained using the newspaper articles, and the acoustic models, which were phoneme hidden Markov models (HMMs) trained using 20 h of speech. The results for recognition of read newspaper speech with a 7k vocabulary were comparable to those for other languages. For the automatic transcription of broadcast-news speech with our LVCSR system, the language models had 20k word vocabularies and were trained using broadcast-news manuscripts. These models achieved better performance than the language models trained using newspaper texts. Our experiments indicate that LVCSR for Japanese works in much the same way as LVCSR for European languages. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, Human Interface Labs, Speech & Acoust Lab, Yokosuka, Kanagawa 2390847, Japan. Tokyo Inst Technol, Meguro Ku, Tokyo 1528552, Japan. Waseda Univ, Shinjuku Ku, Tokyo 1698555, Japan. RP Ohtsuki, K (reprint author), Nippon Telegraph & Tel Corp, Human Interface Labs, Speech & Acoust Lab, Room 420C,1-1 HikarinoOka, Yokosuka, Kanagawa 2390847, Japan. EM ohtsuki@nttspch.hil.ntt.co.jp CR BAKIS R, 1997, P ICASSP, P711 COOK GD, 1997, P ICASSP, P723 GAROFOLO JS, 1996, P DARPA SPEECH REC W, P15 Gauvain J., 1990, P ICSLP, P1097 GAUVAIN JL, 1997, P EUR, P907 GAUVAIN JL, 1995, P DARPA SPEECH REC W, P105 INOUE T, 1997, NTT R D, V46, P1103 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 KUBALA F, 1997, P EUR 97 RHOD GREEC, P927 KUBALA F, 1988, P ICASSP, P291 LAMEL L, 1995, P IEEE AUT SPEECH RE, P51 MATSUKA T, 1996, P DARPA SPEECH REC W, P137 MATSUOKA T, 1997, P ICASSP, V3, P1803 MATSUOKA T, 1997, P EUROPSEECH, V2, P9156 MATSUOKA T, 1996, P ICSLP, V1, P22, DOI 10.1109/ICSLP.1996.607005 MATSUOKA T, 1997, P DARPA SPEECH REC W, P181 Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 PALLETT DS, 1996, P ARPA SPEECH REC WO, P27 Paul D., 1992, P ICSLP, P899 ROBINSON T, 1995, P ICASSP, P81 Schwartz R., 1996, AUTOMATIC SPEECH SPE, P429 STEENEKEN HJM, 1995, P EUROPEECH, P1271 Woodland P. C., 1996, P ARPA SPEECH REC WO, P99 WOODLAND PC, 1997, P ICASSP 97 MUN, P719 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 Young SJ, 1997, COMPUT SPEECH LANG, V11, P73, DOI 10.1006/csla.1996.0023 NR 26 TC 7 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1999 VL 28 IS 2 BP 155 EP 166 DI 10.1016/S0167-6393(99)00006-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 205RT UT WOS:000080834900005 ER PT J AU Mohri, M Riley, M AF Mohri, M Riley, M TI Network optimizations for large-vocabulary speech recognition SO SPEECH COMMUNICATION LA English DT Article DE large-vocabulary speech recognition; search; network optimization; weighted finite-state transducers; stochastic automata AB The redundancy and the size of networks in large-vocabulary speech recognition systems can have a critical effect on their overall performance. We describe the use of two new algorithms: weighted determinization and minimization (Mohri, 1997a). These algorithms transform recognition labeled networks into equivalent ones that require much less time and space in large-vocabulary speech recognition. They are both optimal: weighted determinization eliminates the number of alternatives at each state to the minimum, and weighted minimization reduces the size of deterministic networks to the smallest possible number of states and transitions. These algorithms generalize classical automata determinization and minimization to deal properly with the probabilities of alternative hypotheses and with the relationships between units (distributions, phones, words) at different levels in the recognition system. We illustrate their use in several applications, and report the results of our experiments. (C) 1999 Elsevier Science B.V. All rights reserved. C1 AT&T Bell Labs, Res, Florham Pk, NJ 07932 USA. RP Mohri, M (reprint author), AT&T Bell Labs, Res, Room B019,180 Pk Ave,POB 971, Florham Pk, NJ 07932 USA. EM mohri@research.att.com; riley@research.att.com CR Aho A. V., 1986, COMPILERS PRINCIPLES Aho A. Y., 1974, DESIGN ANAL COMPUTER ANTONIOL G, 1995, P INT C AC SPEECH SI, P588 Berstel J., 1979, TRANSDUCTIONS CONTEX Eilenberg S., 1974, AUTOMATA LANGUAGES M, VA GOPALAKRISHNAN PS, 1995, INT CONF ACOUST SPEE, P572, DOI 10.1109/ICASSP.1995.479662 Kaplan R. M., 1994, COMPUTATIONAL LINGUI, V20 LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P599, DOI 10.1109/29.52701 MOHRI M, 1996, 34 M ASS COMP LING A MOHRI M, 1996, J NATURAL LANGUAGE E, V2, P1 MOHRI M, P INT C AC SPEECH SI MOHRI M, 1994, LECT NOTES COMPUTER, V807 Mohri M., 1997, COMPUTATIONAL LINGUI, V23 MOHRI M, 1997, FINITE STATE LANGUAG MOHRI M, IN PRESS MINIMIZATIO MOHRI M, 1996, 16 INT C COMP LING C MOHRI M, 1996, ECAI 96 WORKSH BUD H Odell J. J., 1994, P ARPA SPOK LANG TEC, P405, DOI 10.3115/1075812.1075905 Ortmanns S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607215 Ortmanns S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607214 RILEY M, 1997, P EUROSPEECH 97 RHOD YOUNG S, 1994, ARPA HUM LANG TECHN NR 22 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1999 VL 28 IS 1 BP 1 EP 12 DI 10.1016/S0167-6393(98)00043-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 202TY UT WOS:000080669300001 ER PT J AU Yuo, KH Wang, HC AF Yuo, KH Wang, HC TI Robust features for noisy speech recognition based on temporal trajectory filtering of short-time autocorrelation sequences SO SPEECH COMMUNICATION LA English DT Article DE noisy speech recognition; temporal trajectory filtering; relative autocorrelation sequences ID PROJECTION AB This paper introduces a new representation of speech that is invariant to noise. The idea is to filter the temporal trajectories of short-time one-sided autocorrelation sequences of speech such that the noise effect is removed. The filtered sequences are denoted by the relative autocorrelation sequences (RASs), and the mel-scale frequency cepstral coefficients (MFCC) are extracted from RAS instead of the original speech. This new speech feature set is denoted as RAS-MFCC. Experiments were conducted on a task of multispeaker isolated :Mandarin digit recognition to demonstrate the effectiveness of RAS-MFCC features in the presence of white noise and colored noise. The proposed features are also shown to be superior to other robust representations and compensation techniques. (C) 1999 Published by Elsevier Science B.V. All rights reserved. C1 Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. RP Wang, HC (reprint author), Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. EM hcwang@ee.nthu.edu.tw CR Acero A., 1991, P IEEE ICASSP TOR CA, P893, DOI 10.1109/ICASSP.1991.150483 AVENDANO C, 1996, P ICSLP 96, V2, P889, DOI 10.1109/ICSLP.1996.607744 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Carlson BA, 1994, IEEE T SPEECH AUDI P, V2, P97, DOI 10.1109/89.260341 FURUI S, 1986, P 1986 IEEE IECEJ AS, P1991 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 HANSON BA, 1993, 1993 P IEEE INT C AC, V2, P79 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERNANDO J, 1994, P ICSLP 94, V4, P1847 Hernando J, 1997, IEEE T SPEECH AUDI P, V5, P80, DOI 10.1109/89.554273 Hirsch H.-G., 1991, P EUROSPEECH, P413 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 Nadeu C, 1997, SPEECH COMMUN, V22, P315, DOI 10.1016/S0167-6393(97)00030-7 NADEU C, 1994, P ICSLP 94, V4, P1927 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Vaseghi SV, 1997, IEEE T SPEECH AUDI P, V5, P11, DOI 10.1109/89.554264 NR 19 TC 11 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1999 VL 28 IS 1 BP 13 EP 24 DI 10.1016/S0167-6393(99)00004-7 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 202TY UT WOS:000080669300002 ER PT J AU Yegnanarayana, B Avendano, C Hermansky, H Murthy, PS AF Yegnanarayana, B Avendano, C Hermansky, H Murthy, PS TI Speech enhancement using linear prediction residual SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; linear prediction residual signal ID COLORED NOISE; SIGNALS AB In this paper we propose a method for enhancement of speech in the presence of additive noise. The objective is to selectively enhance the high signal-to-noise ratio (SNR) regions in the noisy speech in the temporal and spectral domains, without causing significant distortion in the resulting enhanced speech. This is proposed to be done at three different levels. (a) At the gross level, by identifying the regions of speech and noise in the temporal domain. (b) At the finer level, by identifying the regions of high and low SNR portions in the noisy speech. (c) At the short-time spectrum level, by enhancing the spectral peaks over spectral valleys. The basis for the proposed approach is to analyze linear prediction (LP) residual signal in short (1-2 ms) segments to determine whether a segment belongs to a noise region or speech region. In the speech regions the inverse spectral flatness factor is significantly higher than in the noisy regions. The LP residual signal enables us to deal with short segments of data due to uncorrelatedness of the samples. Processing of noisy speech for enhancement involves mostly weighting the LP residual signal samples. The weighted residual signal samples are used to excite the time-varying all-pole filter to produce enhanced speech. As the additive noise level in the speech signal is increased, the quality of the resulting enhanced speech decreases progressively due to loss of speech information in the low SNR, high noise regions. Thus the degradation in performance of enhancement is graceful as the overall SNR of the noisy speech is decreased. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. Oregon Grad Inst Sci & Technol, Dept Elect Engn, Portland, OR USA. Indian Inst Technol, Dept Elect Engn, Madras 600036, Tamil Nadu, India. RP Yegnanarayana, B (reprint author), Indian Inst Technol, Dept Comp Sci & Engn, Madras 600036, Tamil Nadu, India. EM yegna@iitm.ernet.in CR ANANTHAPADMANABHA TV, 1979, IEEE T ACOUST SPEECH, V27, P309, DOI 10.1109/TASSP.1979.1163267 Avendano C., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552760 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CHEN JH, 1995, IEEE T SPEECH AUDI P, V3, P59 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 COOPER FS, 1980, J ACOUST SOC AM, V68, P18, DOI 10.1121/1.384620 Deller J. R., 1993, DISCRETE TIME PROCES EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 Erell A, 1994, IEEE T SPEECH AUDI P, V2, P1, DOI 10.1109/89.260328 FANT G, 1993, SPEECH COMMUN, V13, P1 GIBSON JD, 1991, IEEE T SIGNAL PROCES, V39, P1732, DOI 10.1109/78.91144 Hamon C., 1989, P INT C AC SPEECH SI, P238 Haykin S., 1991, ADAPTIVE FILTER THEO Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Jayant N. S., 1984, DIGITAL CODING WAVEF Junqua J.C., 1996, ROBUSTNESS AUTOMATIC LeBouquin R, 1996, SPEECH COMMUN, V18, P3, DOI 10.1016/0167-6393(95)00021-6 Lee KY, 1996, IEEE SIGNAL PROC LET, V3, P196 Leon S. J., 1990, LINEAR ALGEBRA APPL LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 MERMELSTEIN P, 1982, IEEE T ACOUST SPEECH, V72, P1368 MURTHY PS, 1998, IN PRESS IEEE T SPEE Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 SMITS R, 1995, IEEE T SPEECH AUDI P, V3, P325, DOI 10.1109/89.466662 Yegnanarayana B, 1998, IEEE T SPEECH AUDI P, V6, P1, DOI 10.1109/89.650304 Yegnanarayana B., 1998, P IEEE INT C AC SPEE, V1, P405, DOI 10.1109/ICASSP.1998.674453 YEGNANARAYANA B, 1994, 1029 I PERC RES YEGNANARAYANA B, 1992, P ESCA WORKSH COMP S, P411 Yegnanarayana B, 1996, IEEE T SPEECH AUDI P, V4, P133, DOI 10.1109/89.486063 NR 30 TC 32 Z9 33 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1999 VL 28 IS 1 BP 25 EP 42 DI 10.1016/S0167-6393(98)00070-3 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 202TY UT WOS:000080669300003 ER PT J AU Kanedera, N Arai, T Hermansky, H Pavel, M AF Kanedera, N Arai, T Hermansky, H Pavel, M TI On the relative importance of various components of the modulation spectrum for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article DE modulation frequency; modulation spectrum; automatic speech recognition ID RECEPTION AB We measured the accuracy of speech recognition as a function of band-pass filtering of the time trajectories of spectral envelopes. We examined (i) several types of recognizers such as dynamic time warping (DTW) and hidden Markov model (HMM), and (ii) several types of features, such as filter bank output, mel-frequency cepstral coefficients (MFCC), and perceptual linear predictive (PLP) coefficients. We used the resulting recognition data to determine the relative importance of information in different modulation spectral components of speech for automatic speech recognition. We concluded that: (1) most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz; (2) in some realistic environments, the use of components from the range below 2 Hz or above 16 Hz can degrade the recognition accuracy. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Ishikawa Natl Coll Technol, Tsubata, Ishikawa 92903, Japan. Sophia Univ, Tokyo 102, Japan. Oregon Grad Inst Sci & Technol, Portland, OR USA. Int Comp Sci Inst, Berkeley, CA 94704 USA. AT&T Labs W, Menlo Pk, CA USA. RP Kanedera, N (reprint author), Ishikawa Natl Coll Technol, Tsubata, Ishikawa 92903, Japan. EM kane@ishikawa-nct.ac.jp CR Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1993, P INT C AC SPEECH SI HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Kuwabara H, 1989, TRI0086 ATR YOUNG S, 1997, CORPUS BASED METHODS, P38 NR 12 TC 58 Z9 59 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1999 VL 28 IS 1 BP 43 EP 55 DI 10.1016/S0167-6393(99)00002-3 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 202TY UT WOS:000080669300004 ER PT J AU Aguilar, L AF Aguilar, L TI Hiatus and diphthong: Acoustic cues and speech situation differences SO SPEECH COMMUNICATION LA English DT Article DE vowel-vowel sequences; glide-vowel sequences; hiatus; diphthong; speech situation; formant modeling; Spanish; phonetic reduction processes AB The aim of this study is to determine the acoustic properties of hiatuses (vowel-vowel sequences) and diphthongs (glide-vowel sequences) in Spanish and to observe how these properties are modified depending on communicative factors. To do this, two groups of data were used: speech samples gathered from conversations between two speakers participating in the execution of a map task, in which the corpus items corresponded to toponyms, and the reading of the same sequences at a normal speaking rate. The comparison was done phonetically and phonologically: first, diphthongs and hiatuses were analyzed acoustically, studying their duration and spectral dynamics, and later, an inventory of diphthongizations and monophthongizations was made. Results show that hiatuses and diphthongs differ in the temporal and frequential domain: hiatuses have a longer duration and a greater degree of curvature in the F2 trajectory than diphthongs. Differences between the two categories (I hiatus and diphthong) exist in both communicative situations, although changes within each category due to the speech situation were also observed: sequences from the map task are shorter and the degree of curvature of their formant trajectories is lower than for the reading task. We have also found that vowel-vowel and glide-vowel sequences behave differently in the way they are phonetically reduced: a reduction axis can be drawn in which hiatuses become diphthongs, and diphthongs vowels. It is concluded that hiatus and diphthong are two phonetic categories which can be described on the basis of their acoustic characteristics and are subject, like any other phonetic category, to modifications due to a change in the communicative situation. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Autonoma Barcelona, Dept Filol Espanyola, E-08193 Barcelona, Spain. RP Aguilar, L (reprint author), Univ Autonoma Barcelona, Dept Filol Espanyola, E-08193 Barcelona, Spain. EM lourdes@liceu.uab.es CR AGUILAR L, 1993, P EUROSPEECH 93 21 2, V1, P433 AGUILAR L, 1995, P 13 INT C PHON SC S, V3, P460 ALARCOS E, 1965, FONOLOGIA ESPANOLA G ANDERSON AH, 1991, LANG SPEECH, V34, P351 Anderson S. R., 1985, PHONOLOGY 20 CENTURY BURGESS N, 1969, LANG SPEECH, V12, P238 CARRE R, 1991, J PHONETICS, V19, P433 CLERMONT F, 1993, SPEECH COMMUN, V13, P377, DOI 10.1016/0167-6393(93)90036-K de Manrique A. M., 1979, PHONETICA, V36, P194 DEMANRIQUE AMB, 1976, LANG SPEECH, V19, P121 GAY T, 1968, J ACOUST SOC AM, V44, P1570, DOI 10.1121/1.1911298 Harris J. W., 1969, SPANISH PHONOLOGY HARRIS JW, 1971, FUNDAMENTOS GRAMATIC, P164 Hualde Jose Ignacio, 1991, CURRENT STUDIES SPAN, P475 JHA SK, 1985, J PHONETICS, V13, P107 KOHLER KJ, 1995, P 13 INT C PHON SC S, V2, P12 KOHLER KJ, 1990, SPEECH PRODUCTION SP LEHISTE I, 1961, J ACOUST SOC AM, V33, P268, DOI 10.1121/1.1908638 Lindblom B., 1990, SPEECH PRODUCTION SP LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MADDIESON I, 1985, PHONETICA, V42, P163 MENENDEZPIDAL R, 1940, MANUAL GRAMATICA HIS MOOSMULLER S, 1997, P EUROSPEECH 97 RHOD, V22, P787 MORGAN AT, 1984, THESIS U TEXAS AUSTI NAVARROTOMAS T, 1918, MANUAL PRONUNCIACION NAVARROTOMAS T, 1946, ESTUDIOS FONOLOGIA E Quilis A., 1981, FONETICA ACUSTICA LE *RAE, 1973, ESBOZO UNA NUEVA GRA REN H, 1986, UCLA WORKING PAPERS TOLEDO GA, 1987, P 11 INT C PHON SC T, V3, P125 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D WAKSLER R, 1990, THESIS HARVARD U CAM YANG S, 1987, P 11 INT C PHON SC 1, P239 NR 33 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1999 VL 28 IS 1 BP 57 EP 74 DI 10.1016/S0167-6393(99)00003-5 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 202TY UT WOS:000080669300005 ER PT J AU Cooke, M Okuno, H AF Cooke, M Okuno, H TI Introduction to the special issue on computational auditory scene analysis SO SPEECH COMMUNICATION LA English DT Editorial Material CR BREGMAN, 1990, AUDITORY SCENCE ANAL ROSENTHAL, 1998, COMPUTATIONAL AUDITO NR 2 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 155 EP 157 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300001 ER PT J AU Barker, J Cooke, M AF Barker, J Cooke, M TI Is the sine-wave speech cocktail party worth attending? SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN ID PERCEPTION AB Listeners are remarkably adept at recognising speech that has undergone extensive spectral reduction. Natural speech can be reproduced using as few as three time-varying sinusoids mimicking the corresponding speech formants. Untrained listeners are able to transcribe this 'sine-wave' speech with a high degree of reliability. Coherent phonetic percepts generated by sine-wave speech occur despite an apparent lack of those cues on which low level auditory grouping processes are believed to operate. Consequently, it has been proposed that speech perception is governed by processes operating independently of those described by auditory scene analysis. This paper re-examines the evidence provided by previous perceptual studies of sine-wave speech and presents new data, from perceptual studies employing stimuli constructed from simultaneous sine-wave wave speech sources. These new studies suggest that in conditions that are closer to those of everyday listening, grouping cues have an important role in the formation of coherent speech percepts. In conjunction with these perceptual studies, results from automatic segregation and recognition tasks suggest that sine-wave speech contains sufficient low level, non speech-specific, structure to allow partial descriptions of sinewave sources to be recovered from two-source mixtures. It is argued that these partial descriptions may be sufficient to account for the limited intelligibility observed in two-source sine-wave speech listening tests. (C) 1999 Published by Elsevier Science B.V. All rights reserved. C1 CNRS, UPRESA 5009, Inst Commun Parlee, F-38031 Grenoble 1, France. Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Barker, J (reprint author), CNRS, UPRESA 5009, Inst Commun Parlee, 46 Ave Felix Viallet, F-38031 Grenoble 1, France. EM barker@icp.inpg.fr; m.cooke@dcs.shef.ac.uk CR BAILEY PJ, 1977, SR5152 HASK LAB BARKER JP, 1997, P EUR 97 RHOD 1997 BARKER JP, 1998, THESIS SHEFFIELD U U BASHFORD JA, 1992, PERCEPT PSYCHOPHYS, V51, P211, DOI 10.3758/BF03212247 BREGMAN AS, 1978, J EXP PSYCHOL HUMAN, V4, P380, DOI 10.1037//0096-1523.4.3.380 Bregman AS., 1990, AUDITORY SCENE ANAL CARRELL TD, 1992, PERCEPT PSYCHOPHYS, V52, P437, DOI 10.3758/BF03206703 COOKE M, 1997, P ICASSP, P863 CROWE AS, 1988, P 7 S FASE SPEECH 88, P683 GAROFOLO JS, 1989, EUR C SPEECH COMM TE PRICE P, 1988, P ICASSP, V88, P651 REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129 REMEZ RE, 1990, PERCEPT PSYCHOPHYS, V48, P313, DOI 10.3758/BF03206682 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 Warren R.M., 1996, P WORKSH AUD BAS SPE, P226 YOUNG SJ, 1993, HTK VERSION 1 5 NR 16 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 159 EP 174 DI 10.1016/S0167-6393(98)00081-8 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300002 ER PT J AU de Cheveigne, A Kawahara, H AF de Cheveigne, A Kawahara, H TI Multiple period estimation and pitch perception model SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE pitch perception; fundamental frequency; speech analysis; computational auditory scene analysis ID DIFFERENT FUNDAMENTAL FREQUENCIES; CONCURRENT VOWELS; COMPLEX TONE; CANCELLATION MODEL; AUDITORY PERIPHERY; MISTUNED PARTIALS; PHASE SENSITIVITY; COMPUTER-MODEL; VIRTUAL PITCH; IDENTIFICATION AB The pitch of a periodic sound is strongly correlated with its period. To perceive the multiple pitches evoked by several simultaneous sounds, the auditory system must estimate their periods. This paper proposes a process in which the periodic sounds are canceled in turn (multistep cancellation model) or simultaneously (joint cancellation model). As an example of multistep cancellation, the pitch perception model of Meddis and Hewitt (1991a,b) can be associated with the concurrent vowel identification model of Meddis and Hewitt (1992). A first period estimate is used to suppress correlates of the dominant sound, and a second period is then estimated from the remainder. The process may be repeated to estimate further pitches, or else to recursively refine the initial estimates. Meddis and Hewitt's models are spectrotemporal (filter channel selection based on temporal cues) but multistep cancellation can also be performed in the spectral or time domain. In the joint cancellation model, estimation and cancellation are performed together in the time domain: the parameter space of several cascaded cancellation filters is searched exhaustively for a minimum output. The parameters that yield this minimum are the period estimates. Joint cancellation is guaranteed to find all periods, except in certain situations for which the stimulus is inherently ambiguous. (C) 1999 Published by Elsevier Science B.V. All rights reserved. C1 Univ Paris 07, CNRS, F-75251 Paris, France. ATR Human Informat Proc Res Labs, Kyoto 6190288, Japan. Wakayama Univ, Fac Syst Engn, Design Informat Sci Dept, Media Design Informat Grp, Wakayama 6408510, Japan. RP de Cheveigne, A (reprint author), Lab Linguist Formelle, Case 7003,2 Pl Jussieu, F-75251 Paris, France. EM alain@linguist.jussieu.fr RI Hideki, Kawahara/H-6034-2011; de Cheveigne, Alain/F-4947-2012 CR Assmann PF, 1998, J ACOUST SOC AM, V103, P1150, DOI 10.1121/1.421249 ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 Barlow H. B., 1990, VISION CODING EFFICI, P363 BEERENDS JG, 1989, J ACOUST SOC AM, V85, P813, DOI 10.1121/1.397974 BEERENDS JG, 1988, BASIC ISSUES HEARING, P380 DARWIN CJ, 1992, ADV BIOSCI, V83, P223 DARWIN CJ, 1992, J ACOUST SOC AM, V91, P3381, DOI 10.1121/1.402828 DARWIN CJ, 1994, J ACOUST SOC AM, V95, P2631, DOI 10.1121/1.409832 DEBOER E, 1976, HDB SENSORY PHYSL, V5, P479 de Cheveigne A, 1998, J ACOUST SOC AM, V103, P1261, DOI 10.1121/1.423232 deCheveigne A, 1997, J ACOUST SOC AM, V102, P1083, DOI 10.1121/1.419612 DECHEVEIGNE A, 1993, J ACOUST SOC AM, V93, P3271 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2857, DOI 10.1121/1.419480 DUIFHUIS H, 1982, J ACOUST SOC AM, V71, P1568, DOI 10.1121/1.387811 EVANS EF, 1978, AUDIOLOGY, V17, P369 GOLDSTEI.JL, 1973, J ACOUST SOC AM, V54, P1496, DOI 10.1121/1.1914448 HARTMANN WM, 1990, J ACOUST SOC AM, V88, P1712, DOI 10.1121/1.400246 Hartmann WM, 1996, J ACOUST SOC AM, V99, P567, DOI 10.1121/1.414514 Houtsma Adrianus J. M., 1995, P267, DOI 10.1016/B978-012505626-7/50010-8 HOUTSMA AJM, 1992, AUDITORY FREQUENCY S, P237 KUBOVY M, 1979, PERCEPTUAL ORG LAMORE PJJ, 1977, ACUSTICA, V37, P250 LICKLIDER JCR, 1951, EXPERIENTIA, V7, P128, DOI 10.1007/BF02156143 Lin JY, 1998, J ACOUST SOC AM, V103, P2608, DOI 10.1121/1.422781 MCADAMS S, 1984, THESIS STANFORD MCADAMS S, 1989, J ACOUST SOC AM, V86, P2148, DOI 10.1121/1.398475 MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767 MEDDIS R, 1991, J ACOUST SOC AM, V89, P2883, DOI 10.1121/1.400726 MEDDIS R, 1991, J ACOUST SOC AM, V89, P2866, DOI 10.1121/1.400725 MOORE B J, 1982, INTRO PSYCHOL HEARIN MOORE BCJ, 1986, J ACOUST SOC AM, V80, P479, DOI 10.1121/1.394043 NAGABUCHI H, 1979, T IECE, V62, P627 Nakagawa T., 1995, Proceedings of the 7th International Symposium on Power Semiconductor Devices and ICs, ISPSD '95 (IEEE Cat. No.95CH35785), DOI 10.1109/ISPSD.1995.515014 NAKATANI T, 1995, P IJCAI, P165 NEY H, 1982, IEEE T SYST MAN CYB, V12, P383, DOI 10.1109/TSMC.1982.4308828 NORDMARK JO, 1978, HDB PERCEPTION, V4, P243 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 PETERS RW, 1983, J ACOUST SOC AM, V73, P924, DOI 10.1121/1.389017 RASCH RA, 1978, ACUSTICA, V40, P21 ROBINSON K, 1995, J ACOUST SOC AM, V98, P1858, DOI 10.1121/1.414405 SCHEFFERS MTM, 1983, SIFTING VOWELS SCHEFFERS MTM, 1983, J ACOUST SOC AM, V74, P1716, DOI 10.1121/1.390280 Schouten J. F., 1970, FREQUENCY ANAL PERIO, P41 SCHROEDE.MR, 1968, J ACOUST SOC AM, V43, P829, DOI 10.1121/1.1910902 TERHARDT E, 1974, J ACOUST SOC AM, V55, P1061, DOI 10.1121/1.1914648 WEINTRAUB M, 1985, THESIS STANFORD WIGHTMAN FL, 1973, J ACOUST SOC AM, V54, P407, DOI 10.1121/1.1913592 NR 47 TC 29 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 175 EP 185 DI 10.1016/S0167-6393(98)00074-0 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300003 ER PT J AU Kawahara, H Masuda-Katsuse, I de Cheveigne, A AF Kawahara, H Masuda-Katsuse, I de Cheveigne, A TI Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency-based F0 extraction: Possible role of a repetitive structure in sounds SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE speech analysis; pitch-synchronous; spline smoothing; instantaneous frequency; F0 extraction; speech synthesis; speech modification ID PERCEPTION; MODEL AB A set of simple new procedures has been developed to enable the real-time manipulation of speech parameters. The proposed method uses pitch-adaptive spectral analysis combined with a surface reconstruction method in the time-frequency region. The method also consists of a fundamental frequency (F0) extraction using instantaneous frequency calculation based on a new concept caned 'fundamentalness'. The proposed procedures preserve the details of time-frequency surfaces while almost perfectly removing fine structures due to signal periodicity. This close-to-perfect elimination of interferences and smooth FO trajectory allow for over 600% manipulation of such speech parameters as pitch, vocal tract length, and speaking rate, while maintaining high reproductive quality. (C) 1999 Elsevier Science B.V. All rights reserved. C1 ATR, Human Informat Proc Res Labs, Kyoto 61902, Japan. RP Kawahara, H (reprint author), Wakayama Univ, Fac Syst Engn, Dept Design Informat Sci, 930 Sakedani, Wakayama 6408510, Japan. EM kawahara@sys.wakayama-u.ac.jp RI Hideki, Kawahara/H-6034-2011; de Cheveigne, Alain/F-4947-2012 CR ABE T, 1996, P ICSLP, V2, P1277, DOI 10.1109/ICSLP.1996.607843 ABE T, 1995, IEICE T INF SYST, VE78D, P1188 ABRANTES AJ, 1991, P EUR 91 PAR, P231 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BLAUERT J, 1978, J ACOUST SOC AM, V63, P1478, DOI 10.1121/1.381841 BOASHASH B, 1992, P IEEE, V80, P520, DOI 10.1109/5.135376 BOASHASH B, 1992, P IEEE, V80, P550 Bregman AS., 1990, AUDITORY SCENE ANAL CASPERS B, 1987, P IEEE INT C AC SPEE, V4, P2388 COHEN L, 1989, P IEEE, V77, P941, DOI 10.1109/5.30749 Cooke M., 1993, MODELLING AUDITORY P DECHEVEIGNE A, 1996, TRH195 ATR HIP de Cheveigne A, 1998, J ACOUST SOC AM, V103, P1261, DOI 10.1121/1.423232 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 DUTOIT T, 1993, P EUROSPEECH 93, P531 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 ITAKURA F, 1970, ELECTRON COMMUN JPN, V53, P36 Kawabe T, 1996, PROG THEOR PHYS, V96, P1, DOI 10.1143/PTP.96.1 KAWAHARA H, 1996, EA9628 IEICE, P9 Kawahara H, 1996, VOCAL FOLD, P263 Kawahara H., 1997, P IEEE INT C AC SPEE, V1, P1303 Marr D., 1982, VISION COMPUTATIONAL MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 NARENDRANATH M, 1995, SPEECH COMMUN, V16, P207, DOI 10.1016/0167-6393(94)00058-I Oppenheim A. V., 1989, DISCRETE TIME SIGNAL PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 SCHOUTEN J. F., 1962, JOUR ACOUSTICAL SOC AMER, V34, P1418, DOI 10.1121/1.1918360 Secrest B. G., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing SLANEY M, 1996, P IEEE INT C AC SPEE, P1 Stylianou Y., 1995, P EUROSPEECH, P451 Veldhuis R, 1996, SPEECH COMMUN, V18, P257, DOI 10.1016/0167-6393(95)00044-5 Webster DB, 1992, MAMMALIAN AUDITORY P NR 33 TC 562 Z9 592 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 187 EP 207 DI 10.1016/S0167-6393(98)00085-5 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300004 ER PT J AU Nakatani, T Okuno, HG AF Nakatani, T Okuno, HG TI Harmonic sound stream segregation using localization and its application to speech stream segregation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE speech stream segregation; harmonic structure; localization; information integration; computational auditory scene analysis; spectrum distortion ID BLIND SEPARATION AB Sound stream segregation is essential to understand auditory events in the real world. In this paper, we present a new method of segregating a series of harmonic sounds. The harmonic structure and sound source direction are used as clues for segregation. The direction information of the sources is used to extract fundamental frequencies of individual harmonic sounds, and harmonic sounds are segregated according to the extracted fundamental frequencies. Sequential grouping of harmonic sounds is achieved by using both sound source directions and fundamental frequencies. An application of the harmonic stream segregation to speech stream segregation is presented. It provides effective speech stream segregation using binaural microphones. Experimental results show that the method reduces the spectrum distortions and the fundamental frequency errors compared to an existing monaural system, and that it can segregate three simultaneous harmonic streams with only two microphones. (C) 1999 Published by Elsevier Science B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, Multimedia Business Dept, Chiyoda Ku, Tokyo 1000004, Japan. Nippon Telegraph & Tel Corp, Basic Res Labs, Kanagawa 2430198, Japan. RP Nakatani, T (reprint author), Nippon Telegraph & Tel Corp, Multimedia Business Dept, Chiyoda Ku, UrbanNet Otemachi Bldg 18F,2-2-2 Otemachi, Tokyo 1000004, Japan. EM nakatani@horn.brl.ntt.co.jp CR AMARI S, 1996, ADV NEURAL INFORMATI, V8 BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 BODDEN M, 1993, ACTA ACUSTICA, V1 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1992, THESIS U SHEFFIELD CHEVEIGNE A, 1993, JASA, V93 COMON P, 1994, SIGNAL PROCESS, V36, P287, DOI 10.1016/0165-1684(94)90029-9 COOKE MP, 1993, ENDEAVOUR, V17 Costas J. P., 1981, P 1 ASSP WORKSH SPEC, P651 Furui Sadaoki, 1992, ADV SPEECH SIGNAL PR GRABKE JW, 1995, P IJCAI 95 WORKSH CO, P105 JUTTEN C, 1991, SIGNAL PROCESS, V24, P1, DOI 10.1016/0165-1684(91)90079-X LYON RF, 1984, P ICASSP 84 MORITA T, 1990, T I ELECT INF COMM A, V73 NAGABUCHI H, 1979, T IECE, V62, P627 NAKATANI T, 1995, ICASSP 95 NAKATANI T, 1996, P ICASSP 96 NAKATANI T, 1997, P IJCAI 97 WORKSH CO NAKATANI T, 1995, P IJCAI 95 WORKSH CO NAKATANI T, 1995, P IJCAI 95 OKUNO HG, 1995, HCI 95, V2 OKUNO HG, 1996, JSAS, V4, P2356 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 RAMALINGAM CS, 1994, P ICASSP 94 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO SCHACKELTON TM, 1998, J ACOUST SOC AM, V91, P3579 SCHMIDS RO, 1986, IEEE T AP, V34 SECREST B, 1983, P ICASSP 83, V3, P1352 STADLER RW, 1993, J ACOUST SOC AM, V94, P1332, DOI 10.1121/1.408161 SUGIE N, 1988, P ICNN WAINTRAUB M, 1986, ICASSP 86 NR 31 TC 29 Z9 32 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 209 EP 222 DI 10.1016/S0167-6393(98)00079-X PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300005 ER PT J AU Huang, J Ohnishi, N Guo, XL Sugie, N AF Huang, J Ohnishi, N Guo, XL Sugie, N TI Echo avoidance in a computational model of the precedence effect SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE precedence effect; sound localization; echo estimation; echo avoidance; reverberant environment ID CROSS-CORRELATION MODEL; CONTRALATERAL INHIBITION; SOUND LOCALIZATION; LATERALIZATION; SUPPRESSION; BREAKDOWN; EXTENSION; ROOMS AB The precedence effect, though a topic of continuous theoretical interest in hearing research, so far cannot be predicted explicitly by a model. Zurek (1987) reviewed the precedence effect and gave a general structure for the model of the precedence effect. However, it is still not clear how the inhibition of localization is generated. This paper proposes a computational echo-avoidance model by assuming that the precedence effect is caused by the inhibition of sound localization which depends on the estimated sound-to-echo ratio. We assume there is a neural echo-estimation mechanism in the human auditory system. It is found that a temporal pattern of aftereffect with delay and decay features, which is adopted from the previous psychological tests on the increase of the perceptual threshold of interaural time difference following a preceding impulsive sound, can be used for echo estimation. The time variance of the precedence effect is briefly discussed. The results of psychological experiments, e.g. equal-level and unequal-level paired click tests, Haas's tests and Franssen's tests, are interpreted consistently This model can also give criteria for an available onset and an explanation of why the precedence effect occurs in transient onsets. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Aizu, Multimedia Devices Lab, Aizu Wakamatsu, Fukushima 9658580, Japan. RIKEN, Inst Phys & Chem Res, Biomimet Sensory Syst Lab, Moriyama Ku, Nagoya, Aichi 463, Japan. Nagoya Univ, Dept Informat Engn, Chikusa Ku, Nagoya, Aichi 46401, Japan. Aichi Konan Jr Coll, Dept Child Educ, Takaya, Konan 483, Japan. Meijo Univ, Fac Sci & Technol, Tenpaku Ku, Nagoya, Aichi 468, Japan. RP Huang, J (reprint author), Univ Aizu, Multimedia Devices Lab, Aizu Wakamatsu, Fukushima 9658580, Japan. EM j-huang@u-aizu.accip; ohnishi@nuie.nagoya-u.ac.jp; sugie@nuie.nagoya-u.ac.jp CR Blauert J., 1997, SPATIAL HEARING PSYC BLAUERT J, 1992, ADV BIOSCI, V83, P531 BLAUERT J, 1989, 118 LMA CRNS Bodden M., 1993, Acta Acustica, V1 Bregman AS., 1990, AUDITORY SCENE ANAL CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 CLIFTON RK, 1989, PERCEPT PSYCHOPHYS, V46, P139, DOI 10.3758/BF03204973 CLIFTON RK, 1987, J ACOUST SOC AM, V82, P1834, DOI 10.1121/1.395802 CLIFTON RK, 1994, J ACOUST SOC AM, V95, P1525, DOI 10.1121/1.408540 CLIFTON RK, 1984, DEV PSYCHOBIOL, V17, P519, DOI 10.1002/dev.420170509 Cooke M.P., 1993, MODELING AUDITORY PR Duda RO, 1996, ACUSTICA, V82, P346 Ellis D. P. W., 1994, P 12 INT C PATT REC FRANSSEN NV, 1959, P 3 INT C AC, V1, P787 FREYMAN RL, 1991, J ACOUST SOC AM, V90, P874, DOI 10.1121/1.401955 GARDNER MB, 1968, J ACOUST SOC AM, V43, P1243, DOI 10.1121/1.1910974 Grantham D. Wesley, 1995, P297, DOI 10.1016/B978-012505626-7/50011-X HAAS H, 1951, ACUSTICA, V1, P49 HAFTER ER, 1988, AUDITORY FUNCTION, P647 HARRIS GERARD G., 1963, JOUR ACOUSTICAL SOC AMER, V35, P672, DOI 10.1121/1.1918583 HARTMANN WM, 1989, J ACOUST SOC AM, V86, P1366, DOI 10.1121/1.398696 HUANG J, 1995, IEEE T INSTRUM MEAS, V44, P733 Huang J., 1997, ARTIFICIAL LIFE ROBO, V1, P157, DOI 10.1007/BF02471133 Huang J, 1997, IEEE T INSTRUM MEAS, V46, P842, DOI 10.1109/19.650785 KIMURA S, 1988, P SPRING M AC SOC JP Kuffler S. W., 1984, NEURON BRAIN CELLULA LEHN KH, 1997, P IEEE WORKSH WASPAA LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1623, DOI 10.1121/1.394326 LINDEMANN W, 1986, J ACOUST SOC AM, V80, P1608, DOI 10.1121/1.394325 LITOVSKY RY, 1994, J ACOUST SOC AM, V96, P752, DOI 10.1121/1.411390 MARTIN KD, 1997, P IEEE WORKSH WASPAA MCFADDEN D, 1973, J ACOUST SOC AM, V54, P528, DOI 10.1121/1.1913611 MEDDIS R, 1990, J ACOUST SOC AM, V87, P952 OERTEL D, 1996, ADV SP HEAR A&B, V3, P293 Parkin P., 1958, ACOUSTICS NOISE BUIL RAKERD B, 1985, J ACOUST SOC AM, V78, P524, DOI 10.1121/1.392474 RAKERD B, 1992, J ACOUST SOC AM, V92, P2296, DOI 10.1121/1.405156 SABERI K, 1990, J ACOUST SOC AM, V87, P1732, DOI 10.1121/1.399422 Snow WB, 1953, J SOC MOTION PICT T, V61, P567 THURLOW WR, 1965, J ACOUST SOC AM, V37, P837, DOI 10.1121/1.1909456 von Bekesy G, 1930, PHYS Z, V31, P824 von Bekesy G, 1930, PHYS Z, V31, P857 WALLACH H, 1949, AM J PSYCHOL, V62, P315, DOI 10.2307/1418275 YOST WA, 1984, J ACOUST SOC AM, V76, P1377, DOI 10.1121/1.391454 Zurek P. M., 1987, DIRECTIONAL HEARING, P85, DOI 10.1007/978-1-4612-4738-8_4 ZUREK PM, 1980, J ACOUST SOC AM, V67, P952, DOI 10.1121/1.383974 NR 46 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 223 EP 233 DI 10.1016/S0167-6393(98)00075-2 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300006 ER PT J AU Masuda-Katsuse, I Kawahara, H AF Masuda-Katsuse, I Kawahara, H TI Dynamic sound stream formation based on continuity of spectral change SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE auditory scene analysis; computational model; dynamic stream formation; speech segregation; phonemic restoration; prediction ID DIFFERENT FUNDAMENTAL FREQUENCIES; CONCURRENT VOWEL IDENTIFICATION; PHONEMIC RESTORATION; FORMANT TRANSITIONS; PERCEPTION; CANCELLATION; ILLUSION AB A proposed computational model that dynamically tracks and predicts changes in spectral shapes was verified in both psychophysical experiments and engineering applications. The results of the psychophysical experiments confirmed the model's validity and suggested that 'the rule of good continuity' also held in audition. Furthermore, a stream segregation system was implemented with the proposed model. It was composed of simultaneous grouping and sequential integration processes. An effective integration of two processes was performed by dynamically controlling the sequential integration based on the reliability of the output of the simultaneous grouping. Finally, we applied this system to phonemic restoration and segregation of two simultaneous utterances, showing the proposed model to be effective for such engineering applications. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Inst Syst & Informat Technol, Sawara Ku, Fukuoka 8140001, Japan. RP Masuda-Katsuse, I (reprint author), Inst Syst & Informat Technol, Sawara Ku, Kyushu 2-1-22-7F, Fukuoka 8140001, Japan. EM ikuyo@k-isit.or.jp CR AIKAWA K, 1995, J ACOUST SOC AM 2, V98, P2926, DOI 10.1121/1.414186 Aikawa K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607183 Akagi M., 1986, Transactions of the Institute of Electronics and Communication Engineers of Japan, Part A, VJ69A ASSMANN PF, 1995, J ACOUST SOC AM, V97, P575, DOI 10.1121/1.412281 ASSMANN PF, 1990, J ACOUST SOC AM, V88, P680, DOI 10.1121/1.399772 Assmann PF, 1996, J ACOUST SOC AM, V100, P1141, DOI 10.1121/1.416299 BRADY PT, 1961, J ACOUST SOC AM, V33, P1357, DOI 10.1121/1.1908439 Bregman A. S., 1993, THINKING SOUND COGNI, P10 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 1992, RES REPORTS DEP COMP Cooke M., 1993, MODELLING AUDITORY P DARWIN CJ, 1991, MODULARITY AND THE MOTOR THEORY OF SPEECH PERCEPTION, P239 DECHEVEIGNE A, 1995, J ACOUST SOC AM, V97, P3736 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2839, DOI 10.1121/1.418517 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2857, DOI 10.1121/1.419480 DELATTRE PC, 1955, J ACOUST SOC AM, V27, P769, DOI 10.1121/1.1908024 DIVENYI PL, 1996, J ACOUST SOC AM 2, V10, P2682 EEK A, 1995, P 13 INT C PHON SCI, V1, P18 HANDEL S, 1986, LISTENING INTRO PERC, P185 KAWAHARA H, 1997, COMP AUD SCEN AN WOR, P103 KAWAHARA H, 1994, H9463 AC SOC JPN Kawahara H., 1997, P IEEE INT C AC SPEE, V1, P1303 KLASSNER F, 1995, COMP AUD SCEN AN WOR, P48 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KLUENDER KR, 1992, PERCEPT PSYCHOPHYS, V51, P231, DOI 10.3758/BF03212249 LINDBLOM BE, 1967, J ACOUST SOC AM, V42, P830, DOI 10.1121/1.1910655 Marr D, 1982, VISION MEDDIS R, 1992, J ACOUST SOC AM, V91, P233, DOI 10.1121/1.402767 NAKATANI T, 1995, COMP AUD SCEN AN WOR, P84 POLS LCW, 1993, SPEECH COMMUN, V13, P135, DOI 10.1016/0167-6393(93)90065-S REPP BH, 1992, PERCEPT PSYCHOPHYS, V51, P14, DOI 10.3758/BF03205070 SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474 SAMUEL AG, 1987, J MEM LANG, V26, P36, DOI 10.1016/0749-596X(87)90061-1 SAMUEL AG, 1981, J EXP PSYCHOL HUMAN, V7, P1124, DOI 10.1037/0096-1523.7.5.1124 STRANGE W, 1983, J ACOUST SOC AM, V74, P695, DOI 10.1121/1.389855 TOYAMA K, 1997, P FALL M AC SOC JPN, P207 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WARREN RM, 1972, SCIENCE, V176, P1149, DOI 10.1126/science.176.4039.1149 NR 39 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 235 EP 259 DI 10.1016/S0167-6393(98)00084-3 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300007 ER PT J AU Unoki, M Akagi, M AF Unoki, M Akagi, M TI A method of signal extraction from noisy signal based on auditory scene analysis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN ID CANCELLATION; MODEL AB This paper proposes a method of extracting the desired signal from a noisy signal, addressing the problem of segregating two acoustic sources as a model of acoustic source segregation based on Auditory Scene Analysis. Since the problem of segregating two acoustic sources is an ill-posed inverse problem, constraints are needed to determine a unique solution. The proposed method uses the four heuristic regularities proposed by Bregman as constraints and uses the instantaneous amplitude and phase of noisy signal components that have passed through a wavelet filterbank as features of acoustic sources. Then the model can extract the instantaneous amplitude and phase of the desired signal. Simulations were performed to segregate the harmonic complex tone from a noise-added harmonic complex tone and to compare the results of using all or only some constraints. The results show that the method can segregate the harmonic complex tone precisely using all the constraints related to the four regularities and that the absence of some constraints reduces the accuracy. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Japan Adv Inst Sci & Technol, Sch Informat Sci, Tatsunokuchi, Ishikawa 9231292, Japan. RP Unoki, M (reprint author), Japan Adv Inst Sci & Technol, Sch Informat Sci, 1-1 Asahidai, Tatsunokuchi, Ishikawa 9231292, Japan. CR Boll S. F., 1979, IEEE T ACOUSTIC SPEE, VASSP-27 Bregman A. S., 1993, THINKING SOUND COGNI, P10 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1992, THESIS U SHEFFIELD BROWN RG, 1992, INTRO RANDOM SIGNALS, P210 Chui C. K., 1992, INTRO WAVELETS COOKE MP, 1993, THESIS U SHEFFIELD De Boor C., 1978, PRACTICAL GUIDE SPLI DECHEVEIGNE A, 1993, J ACOUST SOC AM, V93, P3271 deCheveigne A, 1997, J ACOUST SOC AM, V101, P2857, DOI 10.1121/1.419480 Ellis D. P. W., 1996, THESIS MIT Ellis D. P. W., 1994, P 12 INT C PATT REC Furui S., 1991, ADV SPEECH SIGNAL PR IMAI S, 1977, IEEE T ACOUST SPEECH, V25, P127, DOI 10.1109/TASSP.1977.1162927 Imai S., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing Junqua J.C., 1996, ROBUSTNESS AUTOMATIC KASHINO K, 1994, IEICE T, V77, P731 KAWAHARA H, 1997, P SMC 97 12 15 OCT 1 KHINO K, 1993, P INT COMP MUS C, P248 NAKATANI T, 1995, P 1995 INT C AC SPEE, V4, P2671 NAKATANI T, 1994, P ICSLP 94, V24 NAKATANI T, 1995, P IJCAI, P165 Papoulis A., 1977, SIGNAL ANAL Papoulis A, 1991, PROBABILITY RANDOM V, V3rd PATTERSON RD, 1995, J ACOUST SOC AM, V98, P1890, DOI 10.1121/1.414456 SHAMSUNDER S, 1997, IEEE T SPEECH AUDIO, V5 UNOKI M, 1997, P EUR 97 RHOD GREEC, V5, P2583 Unoki M, 1997, ELECTRON COMM JPN 3, V80, P1, DOI 10.1002/(SICI)1520-6440(199711)80:11<1::AID-ECJC1>3.0.CO;2-8 NR 28 TC 13 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 261 EP 279 DI 10.1016/S0167-6393(98)00077-6 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300008 ER PT J AU Ellis, DPW AF Ellis, DPW TI Using knowledge to organize sound: The prediction-driven approach to computational auditory scene analysis and its application to speech/nonspeech mixtures SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE computational auditory scene analysis; nonspeech; prediction; top-down; auditory illusion; phonemetic restoration AB Computational auditory scene analysis - modeling the human ability to organize sound mixtures according to their sources - has experienced a rapid evolution from simple implementations of psychoacoustically inspired rules to complex systems able to process demanding real-world sounds. Phenomena such as the continuity illusion and phonemic restoration show that the brain is able to use a wide range of knowledge-based contextual constraints when interpreting obscured or overlapping mixtures: To model such processing, we need architectures that operate by confirming hypotheses about the observations rather than relying on directly extracted descriptions. One such architecture, the 'prediction-driven' approach, is presented along with results from its initial implementation. This architecture can be extended to take advantage of the high-level knowledge implicit in today's speech recognizers by modifying a recognizer to act as one of the 'component models' providing the explanations of the signal mixture. A preliminary investigation indicates the viability of this approach while at the same time raising a number of issues which are discussed. These results point to the conclusion that successful scene analysis must, at every level, exploit abstract knowledge about sound sources. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. RP Ellis, DPW (reprint author), Int Comp Sci Inst, 1947 Ctr St,Suite 600, Berkeley, CA 94704 USA. EM dpwe@icsi.berkeley.edu CR Bregman Albert S., 1998, COMPUTATIONAL AUDITO, P1 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 1992, CS9222 SHEFF U CARVER N, 1992, SYMBOLIC KNOWLEDGE B, P205 COOKE MP, 1993, THESIS SHEFFIELD U COOKE MP, 1996, P INT WORKSH AUD BAS, P186 COOKE MP, 1997, P IEEE INT C AC SPEE, V2, P863 DARWIN CJ, 1984, J ACOUST SOC AM, V76, P1636, DOI 10.1121/1.391610 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 ELLIS DPW, 1997, P IEEE WORKSH APPS S ELLIS DPW, 1997, P IEEE INT C AC SPEE, V2, P1307 ELLIS DPW, 1993, P IEEE WORKSH APPS S ELLIS DPW, 1996, THESIS DEP EECS ELLIS DPW, 1998, COMPUTATIONAL AUDITO, P257 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HOUTGAST T, 1972, J ACOUST SOC AM, V51, P1885, DOI 10.1121/1.1913048 KLASSNER F, 1996, THESIS U MASSACHUSET MALLAT SG, 1993, IEEE T SIGNAL PROCES, V41, P3397, DOI 10.1109/78.258082 MOORE RK, 1986, 3931 ROYAL SIG RES E MORGANTI R, 1995, PUBL ASTRON SOC AUST, V12, P3 OKUNO HG, 1996, P INT C SPOK LANG P, V4, P2356, DOI 10.1109/ICSLP.1996.607281 PATTERSON RD, 1996, ADV SP HEAR A&B, V3, P547 QUINLAN JR, 1989, INFORM COMPUT, V80, P227 Serra X., 1989, THESIS STANFORD U SLANEY M, 1995, COMPUTATIONAL AUDITO, P27 SLANEY M, 1992, VISUAL REPRESENTATIO, P95 STEVENS K, 1967, P S, P88 Summerfield Q., 1992, J ACOUST SOC AM, V92, P2317, DOI 10.1121/1.405031 Varga A.P., 1990, P ICASSP, V2, P845 WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 WARREN RM, 1970, SCI AM, V223, P30 Warren R.M., 1996, PRINCIPLES EXPT PHON, P435 Weintraub M., 1985, THESIS STANFORD U NR 34 TC 25 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 281 EP 298 DI 10.1016/S0167-6393(98)00083-1 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300009 ER PT J AU Okuno, HG Nakatani, T Kawabata, T AF Okuno, HG Nakatani, T Kawabata, T TI Listening to two simultaneous speeches SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE speech stream segregation; simultaneous speakers; auditory scene analysis AB Speech stream segregation is presented as a new speech enhancement for automatic speech recognition. Two issues are addressed: speech stream segregation from a mixture of sounds, and interfacing speech stream segregation with automatic speech recognition. Speech stream segregation is modeled as a process of extracting harmonic fragments, grouping these extracted harmonic fragments, and substituting non-harmonic residue for non-harmonic parts of a group. The main problem in interfacing speech stream segregation with hidden Markov model (HMM)-based speech recognition is how to improve the degradation of recognition performance due to spectral distortion of segregated sounds, which is caused mainly by transfer function of a binaural input. Our solution is to re-train the parameters of HMM with training data binauralized for four directions. Experiments with 500 mixtures of two women's utterances of an isolated word showed that the error reduction rate of the 1-best/10-best word recognition of each woman's utterance is, on average, 64% and 75%, respectively. (C) 1999 Elsevier Science B.V. All rights reserved. C1 NTT Corp, NTT Basic Res Labs, Kanagawa 2430198, Japan. Japan Sci & Technol Corp, ERATO, Kitano Symbiot Syst Project, Tokyo 150001, Japan. Nippon Telegraph & Tel Corp, Multimedia Business Dev Ctr, Tokyo, Japan. Nippon Telegraph & Tel Corp, Cyber Sci Labs, Kanagawa, Japan. RP Okuno, HG (reprint author), NTT Corp, NTT Basic Res Labs, 3-1 Morinosato Wakamiya, Kanagawa 2430198, Japan. EM okuno@nue.org CR Blauert J., 1983, SPATIAL HEARING PSYC Bodden M., 1993, Acta Acustica, V1 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1992, THESIS U SHEFFIELD BROWN GJ, 1992, P ICSLP 92, P523 CHERRY EC, 1953, J ACOUST SOC AM, V25, P975, DOI 10.1121/1.1907229 COOKE M, 1993, ENDEAVOUR, V17, P186, DOI 10.1016/0160-9327(93)90061-7 Costas J. P., 1981, P 1 ASSP WORKSH SPEC, P651 Ellis D. P. W., 1996, THESIS MIT ELLIS DPW, 1995, 1998 COMPUTATIONAL A FRASER NM, 1996, P 1996 INT S SPOK DI, P25 GREEN PD, 1995, P 1995 IEEE INT C AC, V1, P401 Handel S, 1989, LISTENING INTRO PERC Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P549 INOUE M, 1997, EUROSPEECH97, P331 Kashino M., 1996, J ACOUST SOC AM, V99, P2596 Kita K., 1990, Transactions of the Information Processing Society of Japan, V31 LESSER V, 1993, PROCEEDINGS OF THE ELEVENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P249 MINAMI Y, 1995, P 1995 IEEE INT C AC, V1, P129 NAKATANI T, 1997, P IJCAI 97 WORKSH CO, P25 NAKATANI T, 1995, P 1995 INT C AC SPEE, V4, P2671 NAKATANI T, 1995, P 14 INT JOINT C ART, V1, P165 Nakatani T, 1999, SPEECH COMMUN, V27, P209, DOI 10.1016/S0167-6393(98)00079-X NAKATANI T, 1996, P 1996 IEEE INT C AC, V2, P653 NAKATANI T, 1994, PROCEEDINGS OF THE TWELFTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOLS 1 AND 2, P100 NAWAB SH, 1992, SYMBOLIC KNOWLEDGE B, P251 OKUNO HG, 1996, P INT C SPOK LANG P, V4, P2356, DOI 10.1109/ICSLP.1996.607281 OKUNO HG, 1997, P IJCAI 97 WORKSH CO, P61 OKUNO HG, 1997, P 15 INT JOINT C ART, V1, P30 OKUNO HG, 1996, P 14 NAT C ART INT A, V2, P1082 OKUNO HG, 1995, SYMBIOSIS HUMAN ARTI, V2, P503 PALLET DS, 1994, LANG TECH WORKSH, P15 RAMALINGAM CS, 1994, P 1994 IEEE INT C AC, V1, P473 Rosenthal D. F., 1998, COMPUTATIONAL AUDITO SELMAN B, 1996, P 13 NAT C ART INT A STADLER RW, 1993, J ACOUST SOC AM, V94, P1332, DOI 10.1121/1.408161 SULLIVAN TM, 1993, P 1993 IEEE INT C AC WARREN RM, 1970, SCIENCE, V167, P392, DOI 10.1126/science.167.3917.392 NR 38 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 299 EP 310 DI 10.1016/S0167-6393(98)00080-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300010 ER PT J AU Goto, M Muraoka, Y AF Goto, M Muraoka, Y TI Real-time beat tracking for drumless audio signals: Chord change detection for musical decisions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE beat tracking; rhythm perception; chord change detection; music understanding; computational auditory scene analysis ID RHYTHM AB This paper describes a real-time beat-tracking system that detects a hierarchical beat structure in musical audio signals without drum-sounds. Most previous systems have dealt with MIDI signals and had difficulty in applying, in real time, musical heuristics to audio signals containing sounds of various instruments and in tracking beats above the quarter-note level. Our system not only tracks beats at the quarter-note level but also detects beat structure at the half-note and measure levels. To make musical decisions about the audio signals, we propose a method of detecting chord changes that does not require chord names to be identified. The method enables the system to track beats at different rhythmic levels - for example, to find the beginnings of half notes and measures - and to select the best of various hypotheses about beat positions. Experimental results show that the proposed method was effective to detect the beat structure in real-world audio signals sampled from compact discs of popular music. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Waseda Univ, Sch Sci & Engn, Shinjuku Ku, Tokyo 1698555, Japan. RP Goto, M (reprint author), Electrotech Lab, Machine Understanding Div, 1-1-4 Ume Zono, Tsukuba, Ibaraki 3058568, Japan. EM goto@etl.go.jp RI Goto, Masataka/K-8205-2012 OI Goto, Masataka/0000-0003-1167-0977 CR Allen P.E., 1990, P 1990 INT COMP MUS, P140 DANNENBERG RB, 1987, P 1987 INT COMP MUS, P241 DESAIN P, 1992, P 1992 INT COMP MUS, P42 DESAIN P, 1995, IJCAI 95 WORKSH ART, P1 DESAIN P, 1989, COMPUT MUSIC J, V13, P56, DOI 10.2307/3680012 DESAIN P, 1994, P 1994 INT COMP MUS, P92 DRIESSE A, 1991, P 1991 INT COMP MUS, P578 Goto M., 1995, P 1995 INT COMP MUS, P171 GOTO M, 1997, IJCAI 97 WORKSH ISS, P9 Goto M., 1996, P 2 INT C MULT SYST, P103 GOTO M, 1995, IJCAI 95 WORKSH COMP, P68 Goto M., 1994, Proceedings ACM Multimedia '94, DOI 10.1145/192593.192700 ISHIHATA H, 1991, IEEE PACIF, P13, DOI 10.1109/PACRIM.1991.160669 KATAYOSE H, 1989, PROCEEDINGS : 1989 INTERNATIONAL COMPUTER MUSIC CONFERENCE, NOVEMBER 2-5, P139 Large E. W., 1995, IJCAI WORKSH AI MUS, P24 Lee C.S., 1985, MUSICAL STRUCTURE CO, P53 ROSENTHAL D, 1992, THESIS MIT MA ROSENTHAL D, 1992, COMPUT MUSIC J, V16, P64, DOI 10.2307/3680495 Rowe R., 1993, INTERACTIVE MUSIC SY SCHEIRER ED, 1996, UNPUB USING BANDPASS Schloss W. A, 1985, THESIS STANFORD U SMITH LM, 1996, P 1996 INT COMP MUS, P392 TODD MPM, 1994, J NEW MUSIC RES, V23, P25 TODD NPM, 1995, J ACOUST SOC AM, V97, P1940, DOI 10.1121/1.412067 Todd NPM, 1996, ARTIF INTELL REV, V10, P253 VERCOE B, 1994, P 3 INT C PERC COGN, P59 NR 26 TC 20 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 311 EP 335 DI 10.1016/S0167-6393(98)00076-4 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300011 ER PT J AU Kashino, K Murase, H AF Kashino, K Murase, H TI A sound source identification system for ensemble music based on template adaptation and music stream extraction SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE sound source identification; music recognition; template adaptation; music stream; probabilistic network ID SEPARATION; NETWORKS AB Sound source identification is an important problem in auditory scene analysis when multiple sound objects are simultaneously present in the scene. This paper proposes an adaptive method for sound source identification that is applicable to real performances of ensemble music. For musical sound source identification, the feature-based methods and template-matching-based methods were already proposed. However, it is difficult to extract features of a single note from a sound mixture. In addition, sound variability has been a problem when dealing with real music performances. Thus this paper proposes an adaptive method for template matching that can cope with variability in musical sounds. The method is based on the matched filtering and does not require a feature extraction process. Moreover, this paper discusses musical context integration based on the Bayesian probabilistic networks. Evaluations using recordings of real ensemble performances have revealed that the proposed method improve the source identification accuracy from 60.8% to 88.5% on average. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, Basic Res Labs, Atsugi, Kanagawa 2430198, Japan. RP Kashino, K (reprint author), Nippon Telegraph & Tel Corp, Basic Res Labs, 3-1 Morinosato Wakamiya, Atsugi, Kanagawa 2430198, Japan. EM kunio@ca-sunl.brl.ntt.co.jp CR BELL AJ, 1995, NEURAL COMPUT, V7, P1129, DOI 10.1162/neco.1995.7.6.1129 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, J NEW MUSIC RES, V23, P107, DOI 10.1080/09298219408570651 Cardoso Jean-Francois, 1989, P IEEE INT C AC SPEE, P2109 Chafe C., 1986, P IEEE INT C AC SPEE, P1289 Cooke M., 1991, THESIS U SHEFFIELD COSI P, 1994, J NEW MUSIC RES, V23, P71, DOI 10.1080/09298219408570648 Ellis D. P. W., 1996, THESIS MIT FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 Kashino K., 1993, P INT COMP MUS C, P248 Kashino K., 1995, COMP AUD SCEN AN WOR, P32 KASHINO K, 1995, P INT JOINT C ART IN, V1, P158 KATAYOSE H, 1989, P INT C MUS PERC COG, P95 LEE TW, 1997, P IEEE INT C AC SPEE, P1199 LESSER V, 1993, PROCEEDINGS OF THE ELEVENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P249 Maes P., 1990, DESIGNING AUTONOMOUS MITCHELL OM, 1971, J ACOUST SOC AM, V50, P656, DOI 10.1121/1.1912680 MONTREYNAUD B, 1985, P IJCAI, P916 NAKATANI T, 1995, P 14 INT JOINT C ART, V1, P165 NEHORAI A, 1986, IEEE T ACOUST SPEECH, V34, P1124, DOI 10.1109/TASSP.1986.1164952 NIIHARA T, 1986, P ICASSP 86, P1277 PARSONS TW, 1976, J ACOUST SOC AM, V60, P911, DOI 10.1121/1.381172 PEARL J, 1986, ARTIF INTELL, V29, P241, DOI 10.1016/0004-3702(86)90072-X Piszczalski M., 1977, COMPUT MUSIC J, V1, P24 NR 24 TC 34 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 337 EP 349 DI 10.1016/S0167-6393(98)00078-8 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300012 ER PT J AU Godsmark, D Brown, GJ AF Godsmark, D Brown, GJ TI A blackboard architecture for computational auditory scene analysis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 2nd Workshop on Computational Auditory Scene Analysis at the International Joint Conference on Artificial Intelligence CY AUG, 1997 CL NAGOYA, JAPAN DE auditory scene analysis; computer model; blackboard ID STREAM SEGREGATION; SOUNDS; NOISE AB A challenging problem for research in computational auditory scene analysis is the integration of evidence derived from multiple grouping principles. We describe a computational model which addresses this issue through the use of a 'blackboard' architecture. The model integrates evidence from multiple grouping principles at several levels of abstraction, and manages competition between principles in a manner that is consistent with psychophysical findings. In addition, the blackboard architecture allows heuristic knowledge to influence the organisation of an auditory scene. We demonstrate that the model can replicate listeners' perception of interleaved melodies, and is also able to segregate melodic lines from polyphonic, multi-timbral audio recordings. The applicability of the blackboard architecture to speech processing tasks is also discussed. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Brown, GJ (reprint author), Univ Sheffield, Dept Comp Sci, Regent Court,211 Portobello St, Sheffield S1 4DP, S Yorkshire, England. EM g.brown@dcs.shef.ac.uk CR BREGMAN AS, 1998, COMPUTATIONAL AUDITO BREGMAN AS, 1989, PERCEPT PSYCHOPHYS, V46, P395, DOI 10.3758/BF03204994 Bregman AS., 1990, AUDITORY SCENE ANAL BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 BROWN GJ, 1994, J NEW MUSIC RES, V23, P107, DOI 10.1080/09298219408570651 COOKE M, 1993, ENDEAVOUR, V17, P186, DOI 10.1016/0160-9327(93)90061-7 Cooke M., 1993, MODELLING AUDITORY P DEUTSCH D, 1980, PERCEPT PSYCHOPHYS, V28, P381, DOI 10.3758/BF03204881 Ellis D. P. W., 1996, THESIS MIT Engelmore R.S., 1988, BLACKBOARD SYSTEMS ERMAN LD, 1980, ACM COMPUT SURV, V12, P213, DOI 10.1145/356810.356816 Foley J., 1996, COMPUTER GRAPHICS PR GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T GODSMARK D, 1998, COMPUTATIONAL AUDITO GODSMARK D, 1996, P I ACOUSTICS, V18, P11 GODSMARK D, 1996, P I ACOUSTICS, V8, P83 GODSMARK D, 1998, THESIS U SHEFFIELD HANDEL S, 1995, HDB PERCEPTION COGNI HARTMANN WM, 1991, MUSIC PERCEPT, V9, P155 *ISO, 1988, 226 ISO IVERSON P, 1993, J ACOUST SOC AM, V94, P2595, DOI 10.1121/1.407371 Kashino K., 1992, Annual Report of the Engineering Research Institute, Faculty of Engineering, University of Tokyo, V51 KASHINO K, 1998, COMPUTATIONAL AUDITO KLUENDER KR, 1992, PERCEPT PSYCHOPHYS, V51, P231, DOI 10.3758/BF03212249 LESSER VR, 1995, ARTIF INTELL, V77, P129, DOI 10.1016/0004-3702(94)00033-W MELLINGER DK, 1991, THESIS STANFORD U MILLER GA, 1950, J ACOUST SOC AM, V22, P637, DOI 10.1121/1.1906663 NAKATANI T, 1998, COMPUTATIONAL AUDITO NII P, 1988, BLACKBOARD SYSTEMS Patterson R.D., 1988, 2341 APU ROGERS WL, 1993, PERCEPT PSYCHOPHYS, V53, P179, DOI 10.3758/BF03211728 ROSENTHAL D, 1992, COMPUT MUSIC J, V16, P64, DOI 10.2307/3680495 VANNOORDEN LPA, 1975, THESIS U EINDHOVEN WARREN RM, 1984, PSYCHOL BULL, V96, P371 NR 34 TC 29 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1999 VL 27 IS 3-4 BP 351 EP 366 DI 10.1016/S0167-6393(98)00082-X PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 184YL UT WOS:000079641300013 ER PT J AU Arslan, LM Talkin, D AF Arslan, LM Talkin, D TI Codebook based face point trajectory synthesis algorithm using speech input SO SPEECH COMMUNICATION LA English DT Article DE principal component analysis; 3-D; face synthesis; lip synching; codebook; audio-visual; visual similarity matrix AB This paper presents a novel algorithm which generates three-dimensional face point trajectories for a given speech file with or without its text. The proposed algorithm first employs an off-line training phase. In this phase, recorded face point trajectories along with their speech data and phonetic labels are used to generate phonetic codebooks. These codebooks consist of both acoustic and visual features. Acoustics are represented by line spectral frequencies (LSF), and face points are represented with their principal components (PC). During the synthesis stage, speech input is rated in terms of its similarity to the codebook entries. Based on the similarity, each codebook entry is assigned a weighting coefficient. If the phonetic information about the test speech is available, this is utilized in restricting the codebook search to only several codebook entries which are visually closest to the current phoneme (a visual phoneme similarity matrix is generated for this purpose). Then these weights are used to synthesize the principal components of the face point trajectory. The performance of the algorithm is tested on held-out data, and the synthesized face point trajectories showed a correlation of 0.73 with true face point trajectories. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Entrop Res Lab, Washington, DC 20001 USA. RP Arslan, LM (reprint author), Bogazici Univ, Elect & Elect Dept, TR-80815 Bebek, Turkey. EM arslanle@boun.edu.tr RI Arslan, Levent/D-6377-2015 OI Arslan, Levent/0000-0002-6086-8018 CR Arslan L. M., 1997, P EUROSPEECH RHOD GR, V3, P1347 ARSLAN LM, 1995, P IEEE INT C AC SPEE, V1, P812 Auer E.T., 1997, P ESCA ESCOP WORKSH, P21 BERTENSTAM J, 1995, P SPOK DIAL SYST VIG Beskow J., 1997, P ESCA WORKSH AUD VI, P149 BREGLER C, 1997, P WORKSH AUD VIS SPE, P153 CASSEL JM, 1994, P 16 ANN C COGN SCI CROSMER JR, 1985, THESIS GEORGIA I TEC Deller J. R., 1993, DISCRETE TIME PROCES Fukunaga K., 1990, STAT PATTERN RECOGNI, V2nd KATASHI N, 1994, P 32 ANN M ASS COMP, P102 Kleijn W. B., 1995, SPEECH CODING SYNTHE Laroia R., 1991, P IEEE INT C AC SPEE, P641, DOI 10.1109/ICASSP.1991.150421 WIGHTMAN C, 1994, ALIGNER USERS MANUAL YEHIA H, 1997, P WORKSH AUD VIS SPE, P41 Yehia H, 1998, SPEECH COMMUN, V26, P23, DOI 10.1016/S0167-6393(98)00048-X NR 16 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 1999 VL 27 IS 2 BP 81 EP 93 DI 10.1016/S0167-6393(98)00068-5 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 175JX UT WOS:000079091000001 ER PT J AU van Kuijk, D Boves, L AF van Kuijk, D Boves, L TI Acoustic characteristics of lexical stress in continuous telephone speech SO SPEECH COMMUNICATION LA English DT Article DE Dutch; prosody; lexical stress; automatic vowel classification; automatic speech recognition AB In this paper we investigate acoustic differences between vowels in syllables that do or do not carry lexical stress. In doing so, we concentrated on segmental acoustic phonetic features that are conventionally assumed to differ between stressed and unstressed syllables, viz. Duration, Energy and Spectral Tilt. The speech material in this study differs from the type of material used in previous research: instead of specially constructed sentences we used phonetically rich sentences from the Dutch POLYPHONE corpus. Most of the Duration, Energy and Spectral Tilt features that we used in the investigation show statistically significant differences for the population means of stressed and unstressed vowels. However, it also appears that the distributions overlap to such an extent that automatic detection of stressed and unstressed syllables yields correct classifications of 72.6% at best. It is argued that this result is due to the large variety in the ways in which the abstract linguistic feature 'lexical stress' is realized in the acoustic speech signal. Our findings suggest that a lexical stress detector has little use for a single pass decoder in an automatic speech recognition (ASR) system, but could still play a useful role as an additional knowledge source in a multi-pass decoder. (C) 1999 Elsevier Science B,V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, Nijmegen, Netherlands. Max Planck Inst Psycholinguist, Nijmegen, Netherlands. RP van Kuijk, D (reprint author), Roukensstr 40, NL-6521 BP Nijmegen, Netherlands. EM david@lcn.nl CR Baayen R. H., 1993, CELEX LEXICAL DATABA DENOS EA, 1995, P 13 INT C PHON SCI, V3, P536 DENOS EA, 1995, P EUR 95, P825 DUMOUCHEL P, 1993, P EUR 93, P2195 HIERONYMUS JL, 1992, P INT C AC SPEECH SI, P225, DOI 10.1109/ICASSP.1992.225931 LEHISTE I, 1959, J ACOUST SOC AM, V31, P428, DOI 10.1121/1.1907729 PIERREHUMBERT J, 1994, VOCAL FOLD PHYSL, V8 RADEAU M, 1990, SPEECH COMMUN, V9, P155, DOI 10.1016/0167-6393(90)90068-K Sluijter AMC, 1996, J ACOUST SOC AM, V100, P2471, DOI 10.1121/1.417955 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D van Kuijk D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607963 VANWIJK C, 1980, ITL REV APPL LINGUIS, V47, P53 WAIBEL A, 1986, P ICASSP, P2287 Wang X., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607818 WIGHTMAN CW, 1992, THESIS BOSTON U Ying G. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607932 NR 16 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 1999 VL 27 IS 2 BP 95 EP 111 DI 10.1016/S0167-6393(98)00069-7 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 175JX UT WOS:000079091000002 ER PT J AU Chung, GY Seneff, S AF Chung, GY Seneff, S TI A hierarchical duration model for speech recognition based on the ANGIE framework SO SPEECH COMMUNICATION LA English DT Article DE duration modelling; prosodic modelling; speech recognition ID SEGMENTAL DURATIONS; AMERICAN ENGLISH; VOWEL DURATION AB This paper presents a hierarchical duration model applied to enhance speech recognition. The model is based on the novel ANGIE framework which is a flexible unified sublexical representation designed for speech applications. This duration model captures duration phenomena operating at the phonological, phonemic, syllabic and morphological levels. At the core of the modelling scheme is a hierarchical normalization procedure performed on the ANGIE parse structure. From this, we derive a robust measure for the rate of speech. The model uses two sets of statistical models - a first set based on relative duration between sublexical units and a second set based on absolute duration that has been normalized with respect to the speaking rate. We have used this paradigm to explore some speech timing phenomena such as the secondary effects on relative duration due to variations in speaking rate, the characteristics of anomalously slow words, and prepausal lengthening effects. Finally, we successfully demonstrate the utility of durational information for recognition applications. In phonetic recognition, we achieve a relative improvement of up to 7.7% by incorporating our model over and above a standard phone duration model, and similarly, in a word spotting task, an improvement from 89.3 to 91.6 (FOM) has resulted. (C) 1999 Elsevier Science B.V. All rights reserved. C1 MIT, Comp Sci Lab, Spoken Language Syst Grp, Cambridge, MA 02139 USA. RP Chung, GY (reprint author), MIT, Comp Sci Lab, Spoken Language Syst Grp, Cambridge, MA 02139 USA. EM graceyc@mit.edu CR Campbell W. N., 1992, TALKING MACHINES THE, P211 CAMPBELL WN, 1991, J PHONETICS, V19, P37 CHUNG G, 1997, THESIS MIT CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 Dahl D.A., 1994, P ARPA WORKSH HUM LA, P43, DOI 10.3115/1075812.1075823 HOUSE AS, 1961, J ACOUST SOC AM, V33, P1174, DOI 10.1121/1.1908941 KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 LAU R, 1997, P EUR 97 RHOD GREEC, P263 LAU R, 1998, THESIS MIT Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 PITRELLI JF, 1990, THESIS MIT *ROAD RALL WORD SP, 1991, 611 ROAD RALL WORD S SENEFF S, 1996, P IC SLP PHIL PA OCT, V1, P110, DOI 10.1109/ICSLP.1996.607049 UMEDA N, 1975, J ACOUST SOC AM, V58, P434, DOI 10.1121/1.380688 UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374 VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 ZUE V, 1994, P ARPA SPOK LANG TEC, P67 NR 18 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 1999 VL 27 IS 2 BP 113 EP 134 DI 10.1016/S0167-6393(98)00071-5 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 175JX UT WOS:000079091000003 ER PT J AU Kuroiwa, S Naito, M Yamamoto, S Higuchi, N AF Kuroiwa, S Naito, M Yamamoto, S Higuchi, N TI Robust speech detection method for telephone speech recognition system SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; telephone; endpoint detection; irrelevant sounds; garbage model AB This paper describes speech endpoint detection methods for continuous speech recognition systems used over telephone networks. Speech input to these systems may be contaminated not only by various ambient noises but also by various irrelevant sounds generated by users such as coughs, tongue clicking, lip noises and certain out-of-task utterances. Under these adverse conditions, robust speech endpoint detection remains an unsolved problem. We found in fact, that speech endpoint detection errors occurred in over 10% of the inputs in field trials of a voice activated telephone extension system. These errors were caused by problems of (1) low SNR, (2) long pauses between phrases and (3) irrelevant sounds prior to task sentences. To solve the first two problems, we propose a real-time speech ending point detection algorithm based on the implicit approach, which finds a sentence end by comparing the likelihood of a complete sentence hypothesis and other hypotheses. For the third problem, we propose a speech beginning point detection algorithm which rejects irrelevant sounds by using likelihood ratio and duration conditions. The effectiveness of these methods was evaluated under various conditions. As a result, we found that the ending point detection algorithm was not affected by long pauses and that the beginning point detection algorithm successfully rejected irrelevant sounds by using phone HMMs that fit the task. Furthermore, a garbage model of irrelevant sounds was also evaluated and we found that the garbage modeling technique and the proposed method compensated each other in their respective weak points and that the best recognition accuracy was achieved by integrating these methods. (C) 1999 Elsevier Science B.V. All rights reserved. C1 KDD R&D Labs Inc, Kamifukuoka, Saitama 3566502, Japan. RP Kuroiwa, S (reprint author), KDD R&D Labs Inc, 2-1-15 Ohara, Kamifukuoka, Saitama 3566502, Japan. EM kuroiwa@lab.kdd.co.jp CR ACERO A, 1993, P EUR 93 BERL 21 23, V3, P1551 FUJIOKA M, 1997, IN97116 IEICE, P33 INOUE N, 1994, IEICE T A, V77, P215 Johnston D, 1997, SPEECH COMMUN, V23, P5, DOI 10.1016/S0167-6393(97)00050-2 JUNQUA JC, 1996, ROBUSTNESS AUTOMATIC, P173 Kitai M, 1997, SPEECH COMMUN, V23, P17, DOI 10.1016/S0167-6393(97)00044-7 KUREMATSU A, 1990, SPEECH COMMUN, V9, P357, DOI 10.1016/0167-6393(90)90011-W KUROIWA S, 1993, P INT S SPOK DIAL TO, P25 KUROIWA S, 1997, P AUT M ASJ, V1, P159 KUROIWA S, 1995, IEICE T INF SYST, VE78D, P636 KUROIWA S, 1995, P AUT M ASJ, V1, P5 LAMEL LF, 1981, IEEE T ACOUST SPEECH, V29, P777, DOI 10.1109/TASSP.1981.1163642 LENNIG M, 1995, SPEECH COMMUN, V17, P227, DOI 10.1016/0167-6393(95)00024-I MORISHIMA M, 1997, P 1997 IEEE WORKSH A, P436 NADEU C, 1995, P EUR 95 MADR 18 21, V2, P923 NAITO M, 1995, P ESCA WORKSH SPOK D, P129 NAITO M, 1997, IEICE T D, V80, P2895 RABINER LR, 1997, P 1997 IEEE WORKSH A, P501 Rose R., 1990, P INT C AC SPEECH SI, V1, P129 Rose R. C., 1996, AUTOMATIC SPEECH SPE, P303 ROSE RC, 1995, COMPUTER SPEECH LANG, V9, P303 SCHULTZ T, 1995, P ICASSP, V1, P293 TAKEDA K, 1995, P EUROSPEECH95, V2, P1075 TAKEDA K, 1991, P KOR JAP JOINT WORK, P62 Watanabe T., 1992, IEICE T D, V75-D2, P2002 Wilpon J. G., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90015-5 WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 NR 27 TC 9 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAR PY 1999 VL 27 IS 2 BP 135 EP 148 DI 10.1016/S0167-6393(98)00072-7 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 175JX UT WOS:000079091000004 ER PT J AU Feijoo, S Fernandez, S Balsa, R AF Feijoo, S Fernandez, S Balsa, R TI Acoustic and perceptual study of phonetic integration in Spanish voiceless stops SO SPEECH COMMUNICATION LA English DT Article DE perceptual identification; acoustic analysis; voiceless stops; phonetic integration; acoustic-phonetics; speech recognition ID SPECTRAL-SHAPE-FEATURES; SPEECH-PERCEPTION; CONSONANT RECOGNITION; VOWEL IDENTIFICATION; FORMANT TRANSITIONS; WORD RECOGNITION; INVARIANT CUES; ARTICULATION; PLACE; CLASSIFICATION AB The relationship between the acoustic content and the perceptual identification of Spanish voiceless stops, /p/, /t/ and /k/, in word initial position, has been studied in two different conditions: (a) isolated plosive noise (C condition); (b) Plosive noise plus 51.2 ms of the following vowel (CV condition). The purpose of the study was to assess whether there was a clear correspondence between the perceptual identification made by listeners and the acoustic classification performed using a spectral representation combined with the duration, energy and zero-crossings of the plosive noise. The acoustic classification was represented by a distance profile, formed by the acoustic distances between a given token and the three classes corresponding to the three stops. The acoustic distances were defined as the a posteriori probabilities of membership in each class (APP scores). The perceptual identification was represented by a response profile, formed by the number of listeners' responses assigned to each of the classes for a given token. The correlation between the acoustic and perceptual distances increased from the C condition (overall correlation 0.81) to the CV condition (0.95), indicating that as the signal is more precisely defined from the perceptual point of view, the acoustic content is also less ambiguous, since both the perceptual and acoustic classifications improve in the CV condition. The best correlations were achieved when the variables obtained in the temporal domain (duration, energy and zero-crossings of the plosive noise) were included in the analysis. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Santiago, Fac Fis, Dept Fis Aplicada, E-15706 Santiago, Spain. RP Feijoo, S (reprint author), Univ Santiago, Fac Fis, Dept Fis Aplicada, E-15706 Santiago, Spain. EM fasergio@uscmail.usc.es CR ASSMANN PF, 1982, J ACOUST SOC AM, V71, P975, DOI 10.1121/1.387579 BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 Bonneau A, 1996, J ACOUST SOC AM, V100, P555, DOI 10.1121/1.415866 COLE RA, 1974, PSYCHOL REV, V81, P348, DOI 10.1037/h0036656 COLE RA, 1974, PERCEPT PSYCHOPHYS, V15, P101, DOI 10.3758/BF03205836 CRYSTAL TH, 1988, J PHONETICS, V16, P285 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DENG L, 1994, J ACOUST SOC AM, V96, P2008, DOI 10.1121/1.410144 DORMAN MF, 1977, PERCEPT PSYCHOPHYS, V22, P109, DOI 10.3758/BF03198744 FEIJOO S, 1995, P 6 SPAN S PATT REC, P279 FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977 FUKUNAGA K, 1972, STAT PATTERN RECOGNI FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 FURUI S, 1986, J ACOUST SOC AM, V80, P1016, DOI 10.1121/1.393842 HALLE M, 1957, J ACOUST SOC AM, V29, P107, DOI 10.1121/1.1908634 Hedrick MS, 1996, J ACOUST SOC AM, V100, P3398, DOI 10.1121/1.416981 JONGMAN A, 1991, J ACOUST SOC AM, V89, P867, DOI 10.1121/1.1894648 JONGMAN A, 1985, J PHONETICS, V13, P235 KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081 KOBATAKE H, 1987, J ACOUST SOC AM, V81, P1146, DOI 10.1121/1.394635 KRULL D, 1990, J ACOUST SOC AM, V88, P2557, DOI 10.1121/1.399977 LARIVIERE C, 1975, J ACOUST SOC AM, V57, P470, DOI 10.1121/1.380470 Markel JD, 1976, LINEAR PREDICTION SP NEAREY TM, 1986, J ACOUST SOC AM, V80, P1297, DOI 10.1121/1.394433 NOCERINO N, 1985, SPEECH COMMUN, V4, P317, DOI 10.1016/0167-6393(85)90057-3 NOSSAIR ZB, 1991, J ACOUST SOC AM, V89, P2978, DOI 10.1121/1.400735 NYGAARD LC, 1995, RECENT ADV SPEECH UN OHDE RN, 1983, J ACOUST SOC AM, V74, P706, DOI 10.1121/1.389856 OHDE RN, 1988, J ACOUST SOC AM, V84, P1551, DOI 10.1121/1.396603 QUILIS A, 1989, FONETICA ACUSTICA LE REPP BH, 1988, LANG SPEECH, V31, P239 SEARLE CL, 1979, J ACOUST SOC AM, V65, P799, DOI 10.1121/1.382501 SHIKANO K, 1982, T IECE J D, V65, P535 Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241 STEVENS KN, 1978, J ACOUST SOC AM, V64, P1358, DOI 10.1121/1.382102 SUOMI K, 1987, J PHONETICS, V15, P85 TANAKA K, 1981, IEEE T ACOUST SPEECH, V29, P1117, DOI 10.1109/TASSP.1981.1163693 TEKIELI ME, 1979, J SPEECH HEAR RES, V22, P13 Torres MI, 1996, SPEECH COMMUN, V18, P369, DOI 10.1016/0167-6393(96)00025-8 VANTASELL DJ, 1987, J ACOUST SOC AM, V82, P1152, DOI 10.1121/1.395251 WHALEN DH, 1984, PERCEPT PSYCHOPHYS, V35, P49, DOI 10.3758/BF03205924 WINITZ H, 1972, J ACOUST SOC AM, V51, P1309, DOI 10.1121/1.1912976 ZAHORIAN SA, 1993, J ACOUST SOC AM, V94, P1966, DOI 10.1121/1.407520 NR 43 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1999 VL 27 IS 1 BP 1 EP 18 DI 10.1016/S0167-6393(98)00064-8 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 169UY UT WOS:000078768500001 ER PT J AU Reinhard, K Niranjan, M AF Reinhard, K Niranjan, M TI Parametric subspace modeling of speech transitions SO SPEECH COMMUNICATION LA English DT Article DE speech dynamics; diphones; subspace trajectory; GTM; principal curves; time-constraint PCA ID PROJECTION PURSUIT; RECOGNITION AB This paper describes an attempt at capturing segmental transition information for speech recognition tasks. The slowly varying dynamics of spectral trajectories carries much discriminant information that is very crudely modelled by traditional approaches such as HMMs. In approaches such as recurrent neural networks there is the hope, but not the convincing demonstration, that such transitional information could be captured. The method presented here starts from the very different position of explicitly capturing the trajectory of short time spectral parameter vectors on a subspace in which the temporal sequence information is preserved. This was approached by introducing a temporal constraint into the well known technique of Principal Component Analysis (PCA). On this subspace, an attempt of parametric modelling the trajectory was made, and a distance metric was computed to perform classification of diphones. Using the Principal Curves method of Hastie and Stuetzle and the Generative Topographic map (GTM) technique of Bishop, Svensen and Williams as description of the temporal evolution in terms of latent variables was performed. On the difficult problem of /bee/, /dee/, /gee/ it was possible to retain discriminatory information with a small number of parameters. Experimental illustrations present results on ISOLET and TIMIT database. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Reinhard, K (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM kr10000@eng.cam.ac.uk CR AFIFY M, 1994, INT C SPOK LANG PROC, V1, P291 AFIFY M, 1995, EUROSPEECH 95, P515 AHLBOM G, 1987, INT C AC SPEECH SIGN, V1, P13 ATAL B, 1983, INT C AC SPEECH SIGN, V1, P81 BISHOP C, 1996, NCRG96015 AST U Bishop C. M., 1995, NEURAL NETWORKS PATT Bishop CM, 1997, ADV NEUR IN, V9, P354 Bishop CM, 1997, IEE CONF PUBL, P111, DOI 10.1049/cp:19970711 Bourlard Ha, 1994, CONNECTIONIST SPEECH CLEVELAND WS, 1979, J AM STAT ASSOC, V74, P368 COLE R, 1994, 90004 CSE OR GRAD I Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 Digalakis V, 1993, IEEE T SPEECH AUDI P, V1, P431, DOI 10.1109/89.242489 DIGALAKIS V, 1992, THESIS BOSTON U DIGALAKIS V, 1991, INT C AC SPEECH SIGN, V1, P289 Duda R. O., 1973, PATTERN CLASSIFICATI FISHER W, 1986, P SPEECH REC WORKSH FRIEDMAN JH, 1987, J AM STAT ASSOC, V82, P249, DOI 10.2307/2289161 FUKADA T, 1997, INT C AC SPEECH SIGN, V2, P1403 Garofolo J., 1988, GETTING STARTED DARP Ghitza O., 1993, COMPUTER SPEECH LANG, V2, P101 GISH H, 1996, INT C SPOK LANG PROC, V1, P466 Goldenthal W.D., 1994, THESIS MIT CAMBRIDGE GONG Y, 1996, INT C SPOK LANG PROC, V1, P334 GONG Y, 1994, P IEEE INT C AC SPEE, V1, P57 HASTIE T, 1989, J AM STAT ASSOC, V84, P502, DOI 10.2307/2289936 HOLMES W, 1997, INT C AC SPEECH SIGN, V2, P1399 HU Z, 1997, INT C AC SPEECH SIGN, V2, P979 HUBER PJ, 1985, ANN STAT, V13, P435, DOI 10.1214/aos/1176349519 KANNAN A, 1997, INT C AC SPEECH SIGN, V2, P1411 Lancaster P, 1986, CURVE SURFACE FITTIN MARCUS S, 1984, TEMPORAL DECOMPOSITI, P25 MARTEAU P, 1988, INT C ACOUSTICS SPEE, V1, P615 NIRANJAN M, 1987, EUR C SPEECH COMM TE, V1, P71 Oja E., 1983, SUBSPACE METHODS PAT Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 RAYNER M, 1994, P 1994 ARPA WORKSH H REINHARD K, 1998, CUEDFINFENGTR308 CAM Reinhard K, 1997, IEE CONF PUBL, P257, DOI 10.1049/cp:19970736 ROBINSON A, 1996, EUR C SPEECH COMM TE, P1941 Robinson T., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90010-N Salza PL, 1996, ACUSTICA, V82, P650 SCHMID P, 1997, INT C AC SPEECH SIGN, V2, P991 SCHMID P, 1996, THESIS OREGON GRADUA SCHWARTZ R, 1979, INT C AC SPEECH SIGN, V1, P891 SUN D, 1997, INT C AC SPEECH SIGN, V3, P1751 Tibshirani R., 1992, Statistics and Computing, V2, DOI 10.1007/BF01889678 VALTCHEV V, 1995, THESIS CAMBRIDGE U WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P328, DOI 10.1109/29.21701 YOUNG S, 1995, HTK BOOK VERSION 2 0 Young S.J., 1994, ARPA WORKSH HUM LANG, P307 ZUE V, 1991, LECT NOTES NR 52 TC 15 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1999 VL 27 IS 1 BP 19 EP 42 DI 10.1016/S0167-6393(98)00067-3 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 169UY UT WOS:000078768500002 ER PT J AU Arai, K Wright, JH Riccardi, G Gorin, AL AF Arai, K Wright, JH Riccardi, G Gorin, AL TI Grammar Fragment acquisition using syntactic and semantic clustering SO SPEECH COMMUNICATION LA English DT Article DE spoken language understanding; phrase similarity; the Kullback-Leibler distance; grammar fragments; call-type classification ID LANGUAGE AB A new method for automatically acquiring Fragments for understanding fluent speech is proposed. The goal of this method is to generate a collection of Fragments, each representing a set of syntactically and semantically similar phrases. First, phrases observed frequently in the training set are selected as candidates. Each candidate phrase has three associated probability distributions: of following contexts, of preceding contexts, and of associated semantic actions. The similarity between candidate phrases is measured by applying the Kullback-Leibler distance to these three probability distributions. Candidate phrases that are close in all three distances are clustered into a Fragment. Salient sequences of these Fragments are then automatically acquired, and exploited by a spoken language understanding module to classify calls in AT&T's "How may I help you?" task. These Fragments allow us to generalize unobserved phrases. For instance, they detected 246 phrases in the test-set that were not present in the training-set. This result shows that unseen phrases can be automatically discovered by our new method. Experimental results show that 2.8% of the improvement in call-type classification performance was achieved by introducing these Fragments. (C) 1999 Elsevier Science B.V. All rights reserved. C1 Nippon Telegraph & Tel Corp, Human Interface Labs, Kanagawa 2390847, Japan. AT&T Bell Labs, Florham Pk, NJ 07932 USA. RP Arai, K (reprint author), Nippon Telegraph & Tel Corp, Human Interface Labs, 1-1 Hikarinooka, Kanagawa 2390847, Japan. EM arai@nttspch.hil.ntt.co.jp; jwright@research.att.com; dsp3@research.att.com; algor@research.att.com RI riccardi, gabriele/A-9269-2012 CR ABELLA A, 1997, P EUR 97, V4, P1879 BELLEGARDA JR, 1996, P ICASSP 96, V1, P172 BOYCE S, 1996, P INT S SPOK DIAL IS, P65 Brown P. F., 1992, Computational Linguistics, V18 FARHAT A, 1996, P ICASSP 96, V1, P180 GIACHIN E, 1995, P ICASSP 95, V1, P225 GORIN A, 1995, J ACOUST SOC AM, V97, P3441, DOI 10.1121/1.412431 GORIN AL, 1996, P ICSLP 96, V2, P1001, DOI 10.1109/ICSLP.1996.607772 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X MASATAKI H, 1996, P INT C AC SPEECH SI, V1, P188 NEY H, 1993, P EUR 93, V3, P2239 Riccardi G, 1996, COMPUT SPEECH LANG, V10, P265, DOI 10.1006/csla.1996.0014 Riccardi G., 1997, P ICASSP, V2, P1143 SHARP RD, 1997, P ICASSP97, V5, P4065 WARD W, 1996, P ICASSP 96, V1, P416 WRIGHT J, 1997, P EUR 97, V3, P1419 NR 16 TC 14 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1999 VL 27 IS 1 BP 43 EP 62 DI 10.1016/S0167-6393(98)00065-X PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 169UY UT WOS:000078768500003 ER PT J AU Fukada, T Yoshimura, T Sagisaka, Y AF Fukada, T Yoshimura, T Sagisaka, Y TI Automatic generation of multiple pronunciations based on neural networks SO SPEECH COMMUNICATION LA English DT Article DE pronunciation dictionary; neural networks; spontaneous speech; speech recognition AB We propose a method for automatically generating a pronunciation dictionary based on a pronunciation neural network that can predict plausible pronunciations (realized pronunciations) from the canonical pronunciation. This method can generate multiple forms of realized pronunciations using the pronunciation network. For generating a sophisticated realized pronunciation dictionary, two techniques are described: (1) realized pronunciations with likelihoods and (2) realized pronunciations for word boundary phonemes. Experimental results on spontaneous speech show that the automatically derived pronunciation dictionaries give consistently higher recognition rates than a conventional dictionary, (C) 1999 Elsevier Science B.V. All rights reserved. C1 ATR Interpreting Telecommun Lab, Seika, Kyoto 6190288, Japan. RP Fukada, T (reprint author), ATR Interpreting Telecommun Lab, 2-2 Hikaridai, Seika, Kyoto 6190288, Japan. EM fukada@itl.atr.co.jp CR Bahl L. R., 1978, Proceedings of the 1978 IEEE International Conference on Acoustics, Speech and Signal Processing BYRNE B, 1997, P 1997 IEEE WORKSH S FOSLER E, 1996, P INT C SPOK LANG PR, P28 FUKADA T, 1997, P EUROSPEECH 97, P2471 HUMPHRIES J, 1997, THESIS U CAMBRIDGE C IMAI T, 1995, P ICASSP 95, P864 Lamel L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.606916 MASATAKI H, 1996, P ICASSP, P188 Nakamura A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607241 Ostendorf M, 1997, COMPUT SPEECH LANG, V11, P17, DOI 10.1006/csla.1996.0021 RANDOLPH M, 1990, P ICASSP 90, P1177 RILEY M, 1995, P IEEE AUT SPEECH RE, P139 Riley M. D., 1991, P INT C AC SPEECH SI, P737, DOI 10.1109/ICASSP.1991.150446 SCHMID P, 1993, P ICASSP 93 Sejnowski T., 1986, JHUEECS8601 SHIMIZU T, 1996, P ICASSP 96, P145 SLOBODA T, 1995, P INT C AC SPEECH SI, P453 WEINTRAUB M, 1996, JHU WORKSH 96 WOOTERS C, 1994, P ICSLP 94, P1363 NR 19 TC 17 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1999 VL 27 IS 1 BP 63 EP 73 DI 10.1016/S0167-6393(98)00066-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 169UY UT WOS:000078768500004 ER PT J AU Veronis, J Di Cristo, P Courtois, F Chaumette, C AF Veronis, J Di Cristo, P Courtois, F Chaumette, C TI A stochastic model of intonation for text-to-speech synthesis SO SPEECH COMMUNICATION LA English DT Article DE text-to-speech synthesis; prosody; intonation; stochastic model; part-of-speech tagging; French ID PERFORMANCE STRUCTURES; RECOGNITION AB This paper presents a stochastic model of intonation contours for use in text-to-speech synthesis. The model has two modules, a linguistic module that generates abstract prosodic labels from text, and a phonetic module that generates an F-0 curve from the abstract prosodic labels. This model differs from previous work in the abstract prosodic labels used, which can be automatically derived from the training corpus. This feature makes it possible to use large corpora or several corpora of different speech styles, in addition to making it easy to adapt to new languages. The present paper focuses on the linguistic module, which does not require full syntactic analysis of the text but simply relies on part-of-speech tagging. The results were validated on French by means of a perception test. Listeners did not perceive a significant difference in quality between the sentences synthesised using the phonetic module only, with prosodic labels derived from original recordings as input, and those synthesised directly from the text using the linguistic module followed by the phonetic module. The proposed model thus appears to capture most of the grammatical information needed to generate F-0. (C) 1998 Published by Elsevier Science B.V. All rights reserved. C1 Univ Aix Marseille 1, Lab Parole & Langage, F-13621 Aix En Provence 1, France. CNRS, F-13621 Aix En Provence 1, France. RP Veronis, J (reprint author), Univ Aix Marseille 1, Lab Parole & Langage, 29 Av Robert Schuman, F-13621 Aix En Provence 1, France. EM Jean.Veronis@pl.univ-aix.fr CR Abney S, 1997, TEXT SPEECH LANG TEC, V2, P118 Abney S., 1991, PRINCIPLE BASED PARS, P257 ALLEN J, 1979, 97 M ASA, P507 Allen J., 1987, TEXT SPEECH MITALK S Bahl L. R., 1976, International Symposium on Information Theory. (Abstracts only received) BAKER JK, 1975, IEEE T ACOUST SPEECH, VAS23, P24, DOI 10.1109/TASSP.1975.1162650 Beckman M. E., 1986, PHONOLOGY YB, V3, P255, DOI 10.1017/S095267570000066X BLACK A, 1996, P ICSLP 96 Brown P. F., 1991, P 29 ANN M ASS COMP, P264, DOI 10.3115/981344.981378 Brown P. F., 1990, Computational Linguistics, V16 CAELENHAUMONT G, IN PRESS PROSODIE SE CAMPIONE E, 1997, MEMOIRE DEA U PROVEN CAMPIONE E, 1997, ESCA TUT RES WORKSH, P71 CAMPIONE E, 1998, ACT 20 JOURN ET PAR, P99 CAMPIONE E, IN PRESS UNE EVALUAT CAMPIONE E, IN PRESS P ICSLP 98 Chan D., 1995, P 4 EUR C SPEECH COM, V1, P867 CHOUEKA Y, 1985, COMPUT HUMANITIES, V19, P147, DOI 10.1007/BF02259530 CHOUEKA Y, 1983, ALLC J, V4, P34 COLLIER R, 1991, J PHONETICS, V19, P61 DAILLE B, 1994, P 15 INT C COMP LING DEBILI F, 1977, THESIS U PARIS 7 Di Cristo A., 1985, MICROPROSODIE INTONO DICRISTO P, 1997, 4 C FRANC AC MARS, P425 DUTOIT T, 1993, SPEECH COMMUN, V13, P435, DOI 10.1016/0167-6393(93)90042-J Ejerhed E., 1988, P 2 C APPL NAT LANG, P219, DOI 10.3115/974235.974275 FAURE G, 1974, MELANGES OFFERTS C R, P283 GALE W, 1993, COMPUT HUMANITIES, V26, P415 GEE JP, 1983, COGNITIVE PSYCHOL, V15, P411, DOI 10.1016/0010-0285(83)90014-2 GROSJEAN F, 1980, TEMPORAL VARIABLES S, P307 GROSJEAN F, 1983, ANN PSYCHOL, V83, P513 Hamon C., 1989, P INT C AC SPEECH SI, P238 Church K. W., 1990, Computational Linguistics, V16 HINDLE D, 1994, COMPUTATIONAL APPROA, P103 HIRST DJ, 1994, P 2 ESCA IEEE WORKSH, P77 HIRST DJ, IN PRESS INTONATION HIRST DJ, 1993, AUTOMATIC MODELLING, V15, P71 HIRST DJ, IN PRESS PROSODY THE HIRST DJ, 1991, P 12 INT C PHON SCI, V5, P234 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 KARLSSON F, 1995, CONSTRAINT GRAMMARS KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kupiec J., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90019-Z LEECH G, 1983, NEWSLETTER INT COMPU, V7, P13 Levin H., 1979, EYE VOICE SPAN Liberman MY, 1992, ADV SPEECH SIGNAL PR, P791 Merialdo B., 1994, Computational Linguistics, V20 MONAGHAN AIC, 1990, P ESCA TUTORIAL DAY, P109 OSHAUGHNESSY D, 1990, P ESCA TUTORIAL DAY, P39 Ostendorf M., 1994, Computational Linguistics, V20 Ostendorf M., 1997, COMPUTING PROSODY, P291 PAVLOVIC C, 1995, ASSIST TECHN RES SER, V1, P332 Quene H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90044-5 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 ROSS K, 1995, THESIS BOSTON U ROSSI M, 1977, B SOC LING PARIS, V62, P55 Selkirk E. O., 1984, PHONOLOGY SYNTAX REL SHANNON CE, 1948, AT&T TECH J, V27, P623 Silverman K., 1992, P INT C SPOK LANG PR, V2, P867 SILVERMAN K, 1909, P ARPA WORSH HUM LAN, P317 SORIN C, 1987, P INT C PHON SCI, V1, P125 TAYLOR P, 1994, SPEECH COMMUN, V15, P169, DOI 10.1016/0167-6393(94)90050-7 VERONIS J, IN PRESS P ICSLP 98 VERONIS J, 1997, 5 EUR C SPEECH COMM, V5, P2643 VERONIS J, 1994, AAAI 94 WORKSH INT N, P72 WIGHTMAN CW, 1992, P ICASSP 92, V1, P221 ZELLNER B, 1997, ETUDES LETT U LAUSAN, V3, P47 NR 67 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1998 VL 26 IS 4 BP 233 EP 244 DI 10.1016/S0167-6393(98)00063-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160GH UT WOS:000078221700001 ER PT J AU Illina, I Afify, M Gong, Y AF Illina, I Afify, M Gong, Y TI Environment normalization training and environment adaptation using mixture stochastic trajectory model SO SPEECH COMMUNICATION LA English DT Article DE continuous speech recognition; mixture stochastic trajectory model; adaptation; model normalization; linear transformation ID SPEECH RECOGNITION AB This paper presents a theoretical framework for environment normalization training and adaptation in the context of mixture stochastic trajectory models. The presented approach extends, to segment based models, the currently successful technique of environment normalization used in adapting Hidden Markov models. It also adds to the environment normalization framework a novel method for representing and combining different sources of variability. In our approach the normalization and adaptation are performed using linear transformations. When applied to speaker and noise adaptation in a continuous speech recognition task, our method led to up to 34% improvement in the recognition accuracy for speaker adaptation compared to unadapted models. For noise adaptation the technique outperformed environment dependent models for some of the tested cases, It was also observed that using environment normalization training in conjunction with transformation adaptation outperforms conventional MLLR. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Comp Sci Res Ctr Nancy, LORIA, F-54506 Vandoeuvre Les Nancy, France. Texas Instruments Inc, Speech Res Media Lab, Dallas, TX 75265 USA. RP Illina, I (reprint author), Comp Sci Res Ctr Nancy, LORIA, BP239, F-54506 Vandoeuvre Les Nancy, France. EM illina@loria.fr CR ACERO A, 1996, P IEEE INT C AC SPEE, V1, P342 ANASTASAKOS T, 1997, P IEEE INT C AC SPEE, V2, P1043 Anastasakos T., 1996, P INT C SPOK LANG PR, V2, P1137, DOI 10.1109/ICSLP.1996.607807 AUBERT X, 1997, P EUR C SPEECH COMM, V4, P1851 DeGroot M., 1970, OPTIMAL STAT DECISIO Dempster AP, 1997, J ROYAL STAT SOC B, V39, P1 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 GONG Y, 1992, P INT C SPOK LANG PR, V2, P863 GONG Y, 1993, P EUR C SPEECH COMM, V3, P1759 GONG Y, 1994, P IEEE INT C AC SPEE, V1, P57 GONG Y, 1997, P EUR C SPEECH COMM, V3, P1555 Gong YF, 1997, IEEE T SPEECH AUDI P, V5, P33 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J ILLINA I, 1997, P IEEE INT C AC SPEE, V2, P1395 ILLINA I, 1997, P EUR C SPEECH COMM, V4, P1855 ISHII J, 1997, P IEEE INT C AC SPEE, V2, P1055 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Lee C.-H., 1997, P ESCA NATO WORKSH R, P45 Leggetter C., 1995, P ARPA WORKSH SPOK L, P110 MCDONOUGH J, 1997, P IEEE INT C AC SPEE, V2, P1059 NAGESHA V, 1997, P IEEE INT C AC SPEE, V2, P1031 NG KT, 1995, P EUR C SPEECH COMM, V1, P317 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 SIOHAN O, 1995, P EUR C SPEECH COMM, V1, P465 Siohan O, 1996, SPEECH COMMUN, V18, P335, DOI 10.1016/0167-6393(96)00015-5 ZAVALIAGKOS G, 1995, THESIS NE U NR 27 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1998 VL 26 IS 4 BP 245 EP 258 DI 10.1016/S0167-6393(98)00060-0 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160GH UT WOS:000078221700002 ER PT J AU Carreira-Perpinan, MA Renals, S AF Carreira-Perpinan, MA Renals, S TI Dimensionality reduction of electropalatographic data using latent variable models SO SPEECH COMMUNICATION LA English DT Article DE electropalatography; articulatory modelling; data reduction methods; dimensionality reduction; latent variable models; finite mixture distributions; mixture models; principal component analysis; factor analysis; mixtures of factor analysers; generalised topographic mapping; mixtures of multivariate Bernoulli distributions ID TONGUE-PALATE CONTACT; EPG DATA REDUCTION; NEURAL NETWORKS; COARTICULATION; PATTERNS; SPEECH AB We consider the problem of obtaining a reduced dimension representation of electropalatographic (EPG) data. An unsupervised learning approach based on latent variable modelling is adopted, in which an underlying lower dimension representation is inferred directly from the data. Several latent variable models are investigated, including factor analysis and the generative topographic mapping (GTM). Experiments were carried out using a subset of the EUR-ACCOR database, and the results indicate that these automatic methods capture important, adaptive structure in the EPG data. Nonlinear latent variable modelling clearly outperforms the investigated linear models in terms of log-likelihood and reconstruction error and suggests a substantially smaller intrinsic dimensionality for the EPG data than that claimed by previous studies. A two-dimensional representation is produced with applications to speech therapy, language learning and articulatory dynamics. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. RP Carreira-Perpinan, MA (reprint author), Univ Sheffield, Dept Comp Sci, Sheffield S1 4DP, S Yorkshire, England. EM M.Carreira@dcs.shef.ac.uk CR BALDI P, 1989, NEURAL NETWORKS, V2, P53, DOI 10.1016/0893-6080(89)90014-2 Bartholomew D.J., 1987, LATENT VARIABLE MODE Berger J. O., 1985, STAT DECISION THEORY, V2nd Bishop CM, 1998, NEURAL COMPUT, V10, P215, DOI 10.1162/089976698300017953 BOURLARD H, 1988, BIOL CYBERN, V59, P291, DOI 10.1007/BF00332918 BYRD D, 1995, J SPEECH HEAR RES, V38, P821 CARREIRAPERPINA.MA, 1998, UNPUB NEURAL COMPUTA CARREIRAPERPINA.MA, 1997, CS9709 U SHEFF DEP C DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Everitt B. S., 1981, FINITE MIXTURE DISTR Everitt BS, 1984, INTRO LATENT VARIABL Ghahramani Z., 1996, CRGTR961 U TOR GIROSI F, 1995, NEURAL COMPUT, V7, P219, DOI 10.1162/neco.1995.7.2.219 GYLLENBERG M, 1994, J APPL PROBAB, V31, P542, DOI 10.2307/3215044 HARDCASTLE W, 1989, CLIN LINGUIST PHONET, V3, P1, DOI 10.3109/02699208908985268 HARDCASTLE WJ, 1991, BRIT J DISORD COMMUN, V26, P41 HARDCASTLE WJ, 1991, J PHONETICS, V19, P251 HOLST T, 1995, EUR J DISORDER COMM, V30, P161 HUBER PJ, 1985, ANN STAT, V13, P435, DOI 10.1214/aos/1176349519 JONES W, 1995, EUR J DISORDER COMM, V30, P183 KAISER HF, 1958, PSYCHOMETRIKA, V23, P187, DOI 10.1007/BF02289233 Kohonen T., 1995, SPRINGER SERIES INFO, V30 MARCHAL A, 1993, LANG SPEECH, V36, P137 Mardia K.V., 1979, PROBABILITY MATH STA NGUYEN N, 1995, EUR J DISORDER COMM, V30, P175 Nguyen N, 1996, J PHONETICS, V24, P77, DOI 10.1006/jpho.1996.0006 NGUYEN N, 1994, J ACOUST SOC AM, V96, P33, DOI 10.1121/1.411435 NICOLAIDIS K, 1994, ARTICULATORY ACOUSTI PARK J, 1993, NEURAL COMPUT, V5, P305, DOI 10.1162/neco.1993.5.2.305 PRATT S, 1993, J SPEECH HEAR RES, V29, P99 RUBIN DB, 1982, PSYCHOMETRIKA, V47, P69, DOI 10.1007/BF02293851 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 STONE M, 1991, J PHONETICS, V19, P309 TIPPING ME, 1997, P IEE 5 INT C ART NE WOLFE JH, 1970, MULTIVAR BEHAV RES, V5, P329, DOI 10.1207/s15327906mbr0503_6 NR 35 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1998 VL 26 IS 4 BP 259 EP 282 DI 10.1016/S0167-6393(98)00059-4 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160GH UT WOS:000078221700003 ER PT J AU Kumar, N Andreou, AG AF Kumar, N Andreou, AG TI Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition SO SPEECH COMMUNICATION LA English DT Article DE heteroscedastic; discriminant analysis; speech recognition; reduced rank HMMs ID WORD RECOGNITION; MODEL AB We present the theory for heteroscedastic discriminant analysis (HDA), a model-based generalization of linear discriminant analysis (LDA) derived in the maximum-likelihood framework to handle heteroscedastic-unequal variance-classifier models. We show how to estimate the heteroscedastic Gaussian model parameters jointly with the dimensionality reducing transform, using the EM algorithm. In doing so, we alleviate the need for an a priori ad hoc class assignment. We apply the theoretical results to the problem of speech recognition and observe word-error reduction in systems that employed both diagonal and full covariance heteroscedastic Gaussian models tested on the TI-DIGITS database. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Johns Hopkins Univ, Dept Elect & Comp Engn, Ctr Language & Speech Proc, Baltimore, MD 21218 USA. RP Andreou, AG (reprint author), Johns Hopkins Univ, Dept Elect & Comp Engn, Ctr Language & Speech Proc, 3400 N Charles St, Baltimore, MD 21218 USA. EM andreou@jhu.edu RI Andreou, Andreas G./A-3271-2010 OI Andreou, Andreas G./0000-0003-3826-600X CR AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 Aubert X., 1993, P ICASSP, VII, P648 Ayer C. M., 1993, P EUROSPEECH, V1, P583 AYER CM, 1992, THESIS U LONDON Bartlett M.S., 1947, J R STAT SOC B, V9, P176 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 Bocchieri E. L., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1012 BROWN PF, 1987, THESIS CARNEGIEMELLO Campbell N. A., 1984, AUSTR J STAT, V26, P86, DOI DOI 10.1111/J.1467-842X.1984.TB01271.X COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Dempster A. P., 1977, J ROYAL STAT SOC B, P1 Dillon W. R., 1984, MULTIVARIATE ANAL DODDINGTON G, 1989, P IEEE INT C AC SPEE, P556 Duda R. O., 1973, PATTERN CLASSIFICATI Engles RF, 1995, ARCH SELECTED READIN Fisher RA, 1938, ANN EUGENIC, V8, P376 Fisher RA, 1936, ANN EUGENIC, V7, P179 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 HAEBUMBACH R, 1993, P ICASSP, P239 Haeb-Umbach R., 1992, P IEEE INT C AC SPEE, V1, P13 Hastie T., 1994, DISCRIMINANT ANAL GA HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HUNT M, 1979, 98 M AC SOC AM NOV HUNT MJ, 1989, P ICASSP, V1, P262 JANKOWSKI CR, 1992, THESIS MIT KUMAR N, 1995, P 15 ANN SPEECH RES, P153 KUMAR N, 1996, 9607 EL COMP ENG KUMAR N., 1996, P JOINT M AM STAT AS KUMAR N, UNPUB IEEE T PATT AN Kumar N, 1997, THESIS J HOPKINS U RABINER LR, 1975, AT&T TECH J, V54, P297 Rao C., 1965, LINEAR STAT INFERENC Rissanen J., 1989, SERIES COMPUTER SCI, V15 ROTH R, 1993, P ICASSP, V2, P640 SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 SIOHAN O, 1995, P ICASSP, V1, P125 SUN D, 1997, INT C LANG SPEECH, P244 WOOD L, 1991, P ICASSP, V1, P181 WOODLAND PC, 1991, P ICASSP, V1, P545 YU G, 1990, P INT C AC SPEECH SI, P685 ZAHORIAN SA, 1991, P ICASSP, V1, P561 NR 43 TC 138 Z9 142 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1998 VL 26 IS 4 BP 283 EP 297 DI 10.1016/S0167-6393(98)00061-2 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160GH UT WOS:000078221700004 ER PT J AU Tsukada, H Yamamoto, H Takezawa, T Sagisaka, Y AF Tsukada, H Yamamoto, H Takezawa, T Sagisaka, Y TI Reliable utterance segment recognition by integrating a grammar with statistical language constraints SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; language model; context-free grammar; finite-state automaton; finite-state transducer; robust parsing AB This paper proposes a novel approach to the recognition of complete utterances and partial segments of utterances. This approach ensures a high level of confidence in the results. The proposed method is based on the cooperative use of a conventional n-gram constraint and additional grammatical constraints which take deviations from the grammar into account with a multi-pass search strategy. The partial utterance segments are obtained with high confidence as the segments that satisfy both n-gram and grammatical constraints. For improved efficiency, the context-free grammar expressing the grammatical constraints is approximated by a finite-state automaton. We consider all kinds of deviations from the grammar such as insertions, deletions and substitutions when applying the grammatical constraints. As a result, we can achieve a more robust application of grammatical constraints compared to a conventional word-skipping robust parser that can only handle one type of deviation, that is, insertions. Our experiments confirm that the proposed method can recognize partial segments of utterances more reliably than conventional continuous speech recognition methods using only n-grams. In addition, our results indicate that allowing more deviations from the grammatical constraints leads to better performance than the conventional word-skipping robust parser approach. (C) 1998 Elsevier Science B.V. All rights reserved. C1 ATR Intpreting Telecommun Res Labs, Seika, Kyoto 6190288, Japan. RP Tsukada, H (reprint author), ATR Intpreting Telecommun Res Labs, 2-2 Hikaridai, Seika, Kyoto 6190288, Japan. EM tsukada@itl.atr.co.jp CR LAVIE A, 1996, CMUCS96126 LLOYDTHOMAS H, 1995, P INT C AC SPEECH SI, V1, P173 MASATAKI H, 1996, P INT C AC SPEECH SI, V1, P188 METEER M, 1993, P INT C AC SPEECH SI, V2 MORIMOTO T, 1994, P 3 INT C SPOK LANG, V4, P1791 NAKAMURA A, 1996, P ICSLP, V4, P2199, DOI 10.1109/ICSLP.1996.607241 PEREIRA FCN, 1991, 29TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS : PROCEEDINGS OF THE CONFERENCE, P246 PIERACCINI R, 1993, P NATO ASI SUMM SCH Roche E, 1997, LANG SPEECH & COMMUN, P1 ROE DB, 1992, SPEECH COMMUN, V11, P311, DOI 10.1016/0167-6393(92)90025-3 SCHWARTZ R, 1997, P INT C AC SPEECH SI, V2, P1479 SENEFF S, 1992, P 2 INT C SPOK LANG, V1, P317 SHIMIZU T, 1996, P ICASSP 96 APR, V1, P145 Takezawa T., 1997, Systems and Computers in Japan, V28 WAKITA Y, 1997, P SPOK LANG TRANSL W, P24 WARD W, 1991, P ICASSP 91, V1, P365 NR 16 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1998 VL 26 IS 4 BP 299 EP 309 DI 10.1016/S0167-6393(98)00062-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 160GH UT WOS:000078221700005 ER PT J AU Huang, J Zhao, YX AF Huang, J Zhao, YX TI An energy-constrained signal subspace method for speech enhancement and recognition in white and colored noises SO SPEECH COMMUNICATION LA English DT Article DE signal subspace; Karhunen-Loeve transform (KLT); short-time energy; autoregressive process AB In this paper, an energy-constrained signal subspace (ECSS) method is proposed for speech enhancement and automatic speech recognition under additive noise condition. The key idea is to match the short-time energy of the enhanced speech signal to the unbiased estimate of the short-time energy of the clean speech, which is proven very effective for improving the estimation of the noise-like, low-energy segments in continuous speech. The ECSS method is applied to both white and colored noises where the additive colored noise is modelled by an autoregressive (AR) process. A modified covariance method is used to estimate the AR parameters of the colored noise and a prewhitening filter is constructed based on the estimated parameters. The performances of the proposed algorithms were evaluated using the TI46 digit database and the TIMIT continuous speech database. It was found that the ECSS method can achieve very high word recognition accuracy (WRA) for the digits set under low SNR conditions. For continuous speech data set, this method helped to improve the SNR by 2-6 dB and the WRA by 13.7-45.5% for the white noise and 18.6-55.9% for the colored noise under various SNR conditions. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Illinois, Dept Elect & Comp Engn, Beckman Inst 2137, Urbana, IL 61801 USA. RP Huang, J (reprint author), Univ Illinois, Dept Elect & Comp Engn, Beckman Inst 2137, 405 N Mathews Ave, Urbana, IL 61801 USA. EM jhuang@ifp.uiuc.edu; yxz@ifp.uiuc.edu CR Acero A., 1990, P ICASSP, P849 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 COLE R, 1995, IEEE T SPEECH AUDI P, V3, P1, DOI 10.1109/89.365385 COUVREUR C, 1995, P IEEE ICASSP 95, P1605 DENDRINOS M, 1991, SPEECH COMMUN, V10, P45, DOI 10.1016/0167-6393(91)90027-Q EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 FRIEDLANDER B, 1984, IEEE T ACOUST SPEECH, V32, P338, DOI 10.1109/TASSP.1984.1164328 Huang J, 1997, IEEE SIGNAL PROC LET, V4, P283 JENSEN SH, 1995, IEEE T SPEECH AUDI P, V3, P439, DOI 10.1109/89.482211 KAY SM, 1988, MODERN SPECTRAL ESTI, P222 LeBouquin R, 1996, SPEECH COMMUN, V18, P3, DOI 10.1016/0167-6393(95)00021-6 LOGAN BT, 1997, P IEEE ICASSP 97, P843 MERHAV N, 1989, IEEE T INFORM THEORY, V35, P1109, DOI 10.1109/18.42231 NOLAZCO JA, 1994, P IEEE ICASSP 94, P1409 OJA E, 1983, SUBSPACE METHODS PAT, P7 OPPENHEIM AV, 1989, DISCRETE TIME SIGNAL, P581 PAPOULIS A, 1991, PROBABILITY RANDOM V, P240 PRIESTLEY MB, 1981, SPECIAL ANAL TIME SE, P317 RABINER L, 1993, FUNDAMENTALS SPEECH, P305 RISSANEN J, 1978, AUTOMATICA, V14, P465, DOI 10.1016/0005-1098(78)90005-5 SCHWARZ G, 1978, ANN STAT, V6, P461, DOI 10.1214/aos/1176344136 Yamashita Y, 1996, IEEE T SIGNAL PROCES, V44, P371, DOI 10.1109/78.485932 Zhao YX, 1993, IEEE T SPEECH AUDI P, V1, P345, DOI 10.1109/89.232618 NR 23 TC 15 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1998 VL 26 IS 3 BP 165 EP 181 DI 10.1016/S0167-6393(98)00041-7 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 152FB UT WOS:000077764600001 ER PT J AU Alleva, F Huang, XD Hwang, MY Jiang, L AF Alleva, F Huang, XD Hwang, MY Jiang, L TI Can continuous speech recognizers handle isolated speech? SO SPEECH COMMUNICATION LA English DT Article DE isolated speech recognition (ISR); continuous speech recognition (CSR); senones; out-of-vocabulary (OOV); classification and regression trees (CARTs) AB Continuous speech is far more natural and efficient than isolated speech for communication. However, for current state-of-the-art automatic speech recognition systems, isolated speech recognition (ISR) is far more accurate than continuous speech recognition (CSR). It is common practice in the speech research community to build CSR systems using only CS data. However, slowing of the speaking rate is a natural reaction for a user faced with the high error rates of current CSR systems. Ironically, CSR systems typically have a much higher word error rate when speakers slow down since the acoustic models are usually derived exclusively from continuous speech corpora. In this paper, we summarize our efforts to improve the robustness of our speaker-independent CSR system against speaking styles, without suffering a recognition accuracy penalty. In particular the multi-style trained system described in this paper attains a 7.0% word error rate for a test set consisting of both isolated and continuous speech, in contrast to the 10.9% word error rate achieved by the same system trained only on continuous speech. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Microsoft Corp, Res, Redmond, WA 98052 USA. RP Hwang, MY (reprint author), Microsoft Corp, Res, 1 Microsoft Way, Redmond, WA 98052 USA. EM mhwang@microsoft.com CR Alleva F, 1996, INT CONF ACOUST SPEE, P133, DOI 10.1109/ICASSP.1996.540308 BREIMAN L, 1984, CLASSIFICATON REGRES DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 HUANG X, 1994, SPEECH SPEAKER RECOG, P481 Hwang MY, 1996, IEEE T SPEECH AUDI P, V4, P412 *LING DAT CONS, 1994, CSR 3 TEXT LANG MOD *LING DAT CONS, 1994, CSR 2 ARPA CONT SPEE NR 8 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1998 VL 26 IS 3 BP 183 EP 189 DI 10.1016/S0167-6393(98)00042-9 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 152FB UT WOS:000077764600002 ER PT J AU Chengalvarayan, R AF Chengalvarayan, R TI Improved speech modelling and recognition using a new training algorithm based on outlier-emphasis for nonstationary state HMM SO SPEECH COMMUNICATION LA English DT Article DE speech signal; speech recognition; discriminative training; hidden Markov models; outlier-emphasis; non-stationary states AB In this study, we develop a modified maximum likelihood algorithm for optimally estimating the state-dependent polynomial parameters in the nonstationary-state HMM. The newly devised training method controls the influence of outliers in the training data on the constructed models. For an alphabet recognition task, outlier emphasis resulted in improved performance. An error rate reduction of 14% is achieved for the linear trend and 7.5% is obtained for the stationary-state HMMs over the conventional models trained by the Viterbi algorithm based on the joint-state maximum likelihood criterion. The properties of the nonstationary-state HMM trained with the proposed approach are analysed by examining goodness-of-fit of the real speech data to the polynomial trajectories in the model. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Lucent Technol, Bell Labs, Speech Proc Grp, Naperville, IL 60566 USA. RP Chengalvarayan, R (reprint author), Lucent Technol, Bell Labs, Speech Proc Grp, 2000 N Naperville Rd, Naperville, IL 60566 USA. EM rathi@lucent.com CR AMARI S, 1967, IEEE TRANS ELECTRON, VEC16, P299, DOI 10.1109/PGEC.1967.264665 ARSLAN LM, 1996, P ICASSP, V2, P589 Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> BORWN PF, 1987, THESIS CARNEGIE MELL Chen JK, 1994, IEEE T SPEECH AUDI P, V2, P206 CHENGALVARAYAN R, 1996, P ICSLP, V2, P1049 Chengalvarayan R, 1997, IEEE T SPEECH AUDI P, V5, P232, DOI 10.1109/89.568730 CHENGALVARAYAN R, 1997, P ICASSP, V2, P1415 Chou W., 1994, International Journal of Pattern Recognition and Artificial Intelligence, V8, DOI 10.1142/S0218001494000024 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deng L, 1997, IEEE T SPEECH AUDI P, V5, P319 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 FRANCO H, 1990, P ICASSP, V1, P357 FUKADA T, 1997, P ICASSP, V2, P1403 GISH H, 1992, P ICASSP, V2, P289 Gish H, 1996, P ICSLP, P466, DOI 10.1109/ICSLP.1996.607155 HAFFNER P, 1993, P EUROSPEECH, V4, P1929 HOLMES WJ, 1997, IEEE SIGNAL PROCESSI, V4, P72 JOHANSEN FT, 1996, THESIS NORWEGIAN U S KATAGIRI S, 1991, IEEE WORKSH NEUR NET, P299 LI H, 1995, P EUROSPEECH, V1, P363 LJOLJE A, 1990, P IEEE INT C AC SPEE, V2, P709 McDermott E., 1997, THESIS WASEDA U TOKY NADAS A, 1988, IEEE T ACOUST SPEECH, V36, P1432, DOI 10.1109/29.90371 NILES LT, 1990, P ICASSP, V1, P493 Normandin Y., 1991, THESIS MCGILL U MONT Ostendorf M., 1996, AUTOMATIC SPEECH SPE, P185 VALTCHEV V, 1996, P INT C AC SPEECH SI, V2, P605 NR 28 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1998 VL 26 IS 3 BP 191 EP 201 DI 10.1016/S0167-6393(98)00057-0 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 152FB UT WOS:000077764600003 ER PT J AU Westbury, JR Hashi, M Lindstrom, MJ AF Westbury, JR Hashi, M Lindstrom, MJ TI Differences among speakers in lingual articulation for American English vertical bar upside-down-r vertical bar SO SPEECH COMMUNICATION LA English DT Article DE tongue posture; articulation; cross-speaker variation; American English vertical bar upside-down-r vertical bar ID ANTICIPATORY LABIAL COARTICULATION; SPEECH PRODUCTION AB X-ray microbeam fleshpoint measures of lingual articulation for pre-vocalic /lambda/ were obtained for five test words spoken by 53 normal, young adult talkers of American English. The data were used to develop quantitative descriptions of cross-speaker variation in tongue shapes at acoustically-defined r-moments in the test words, and to understand whether and how /lambda/-related tongue shapes might vary across the available sample of phonetic contexts. Key results suggest that tongue shapes for this sound vary widely across speakers within any single phonetic context, and more continuously than categorically across the representational space. Shapes also vary by context in ways that are similar across most speakers, and in some contexts, in ways that can be expected given simple assumptions about lingual movements associated with adjacent sounds. Interestingly, tongue shapes for American English /lambda/ do not seem to be reliably linked to gender, measures of oral cavity size, or formant frequencies measured for two of the test words. Together, these results provide unique insight about the nature and bases of inter-speaker variation in lingual articulation for this infamously variable sound, and may prove useful to other investigators interested in speech motor control, speech synthesis, and automatic speech recognition. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Wisconsin, Waisman Ctr, Madison, WI 53705 USA. Univ Wisconsin, Dept Communicat Disorders, Madison, WI 53705 USA. Univ Wisconsin, Dept Biostat, Madison, WI 53792 USA. RP Westbury, JR (reprint author), Univ Wisconsin, Waisman Ctr, 1500 Highland Ave, Madison, WI 53705 USA. EM westbury@facstaff.wisc.edu CR Alwan A, 1997, J ACOUST SOC AM, V101, P1078, DOI 10.1121/1.417972 Anderson TW, 1984, INTRO MULTIVARIATE S BECKER R.A., 1988, NEW S LANGUAGE PROGR BELLBERTI F, 1981, PHONETICA, V38, P9 Blackburn C. S., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607764 Boyce S, 1997, J ACOUST SOC AM, V101, P3741, DOI 10.1121/1.418333 BROWMAN CP, 1992, PHONETICA, V49, P155 Delattre P., 1968, LINGUISTICS, V44, P29 Dembowski J, 1998, NEUROMOTOR SPEECH DISORDERS, P27 EDWARDS J, 1990, J SPEECH HEAR RES, V33, P550 ESPYWILSON CY, 1997, P EUROSPEECH, V1, P393 Fant G., 1960, ACOUSTIC THEORY SPEE FISHER RA, 1925, STAT METHODS RES WOR, P4 GUENTHER FH, 1998, ARTICULATORY TRADE R Hagiwara R., 1995, UCLA WORKING PAPERS, V90, P1 HASHI M, IN PRESS J ACOUST SO HENKE WL, 1966, THESIS MIT BOSTON Hockett Charles F., 1958, COURSE MODERN LINGUI Honda K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607480 JOHNSON K, 1993, J ACOUST SOC AM, V94, P701, DOI 10.1121/1.406887 KENT RD, 1987, J SPEECH HEAR DISORD, V52, P367 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kuehn David P., 1976, J PHONETICS, V4, P303 LADD WM, UNPUB STAT EXAMINATI Ladefoged P., 1996, SOUNDS WORLDS LANGUA Lindau M., 1985, PHONETIC LINGUISTICS, P157 LINDBLOM B., 1983, PRODUCTION SPEECH, P217 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 LUBKER J, 1981, PHONETICA, V38, P51 MILENKOVIC P, UNPUB J ACOUST SOC A MILENKOVIC P, 1992, C SPEECH VERSION 4 U MUNHALL K, 1992, J PHONETICS, V20, P111 PERKELL JS, 1990, NATO ADV SCI I D-BEH, V55, P263 PERKELL JS, 1992, J ACOUST SOC AM, V91, P2911, DOI 10.1121/1.403778 PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814 RUSSELL GO, 1928, VOWEL ITS PHYSL MECH Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 SHRIBERG LD, 1993, J SPEECH HEAR RES, V36, P105 Venables WN, 1994, MODERN APPL STAT S P WEISMER G, 1988, J ACOUST SOC AM, V84, P1281, DOI 10.1121/1.396627 Westbury J.R., 1994, XRAY MICROBEAM SPEEC WESTBURY JR, 1994, J ACOUST SOC AM, V95, P2271, DOI 10.1121/1.408638 YAMADA RA, 1992, SPEECH PERCEPTION PR ZAWADZKI PA, 1980, PHONETICA, V37, P253 NR 44 TC 43 Z9 43 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1998 VL 26 IS 3 BP 203 EP 226 DI 10.1016/S0167-6393(98)00058-2 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 152FB UT WOS:000077764600004 ER PT J AU Rubin, P Vatikiotis-Bateson, E AF Rubin, P Vatikiotis-Bateson, E TI Special issue on auditory-visual speech processing SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Haskins Labs Inc, New Haven, CT 06511 USA. Yale Univ, Sch Med, Dept Surg, New Haven, CT 06510 USA. ATR Human Informat Proc Res Labs, Seika, Kyoto 6190288, Japan. RP Rubin, P (reprint author), Haskins Labs Inc, 270 Crown St, New Haven, CT 06511 USA. CR Dodd B., 1987, HEARING EYE PSYCHOL Massaro D. W., 1987, SPEECH PERCEPTION EA MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 Stork D. G., 1996, SPEECHREADING HUMANS, V150 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 WOLFF GJ, 1994, ADV NEURAL INFORMATI, V6, P1027 NR 7 TC 14 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 1 EP 4 DI 10.1016/S0167-6393(98)00046-6 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100001 ER PT J AU Poggi, I Pelachaud, C AF Poggi, I Pelachaud, C TI Performative faces SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE speech acts; visual modality; facial expression; 3D synthetic agents ID FACIAL EXPRESSIONS; DISPLAYS; SPEECH AB The paper presents a model for the construction of an artificial agent that can express performatives through facial expression. The performative of a speech act or communicative act is the particular communicative intention a Sender has to one's Addressee, the way one wants to socially relate oneself to the interlocutor. Performatives are decomposed both on the meaning and on the signal side: on the meaning side, a performative is represented as a cluster of cognitive units, that in turn include subclusters mentioning the Sender's general goal (informing, asking, requesting), the power relationship between Sender and Addressee, the Sender's affective state, and further information peculiar of specific performatives; on the signal side, facial expressions are decomposed into Action Units. The proposed system computes the appropriate performative of one's communicative acts through consideration of the context of communication, particularly of the Addressee's cognitive capacity, social relationship and personality traits, and then expresses the computed performatives through 3D facial displays. (C) 1998 Published by Elsevier Science B.V. All rights reserved. C1 Univ Roma Tre, Dipartimento Linguist, I-00185 Rome, Italy. Univ Roma La Sapienza, Dipartimento Informat & Sistemist, I-00185 Rome, Italy. RP Poggi, I (reprint author), Univ Roma Tre, Dipartimento Linguist, Via Castro Pretorio 20, I-00185 Rome, Italy. EM poggi@uniroma3.it; cath@peano.dis.uniromal.it CR Argyle Michael, 1976, GAZE AND MUTUAL GAZE AUSTIN JL, 1962, DO THINKS WORDS BEATTIE GW, 1981, NONVERBAL COMMUNICAT, P297 BESKOW J, 1997, P IJCAI 97 WORKSH AN CALDOGNETTO EM, 1997, P AVSP RHOD 26 27 SE Cassell J., 1994, Computer Graphics Proceedings. Annual Conference Series 1994. SIGGRAPH 94 Conference Proceedings CASTELFRANCHI C, SHYNESS EMBARRASSMEN CASTELFRANCHI C, 1990, DECENTRALIZED CASTELFRANCHI C, 1997, P S LOG APPR AG MOD, P18 CHOVIL N, 1991, J NONVERBAL BEHAV, V15, P141, DOI 10.1007/BF01672216 Cohen M., 1993, MODELS TECHNIQUES CO Cohen P. R., 1979, COGNITIVE SCI, V3, P177, DOI DOI 10.1207/S15516709COG0303_1 COHEN PR, 1995, P INT C MULT AG SYST COHEN PR, 1990, P 28 ANN M ASS COMP, P79, DOI 10.3115/981823.981834 Conte R., 1995, COGNITIVE SOCIAL ACT DAVIS JR, 1988, ACL88, P187 Duncan S, 1974, NONVERBAL COMMUNICAT Ekman P., 1979, HUMAN ETHOLOGY, P169 Ekman P., 1978, FACIAL ACTION CODING Ekman P., 1982, EMOTION HUMAN FACE GUIARDMARIGNY T, 1996, 3D MODELS LIPS REALI, P80 KALRA P, 1993, INT C MULT MOD MMM93, P59 KELTNER D, 1995, J PERS SOC PSYCHOL, V68, P441, DOI 10.1037/0022-3514.68.3.441 Kendon A., 1990, CONDUCTING INTERACTI LADD DR, 1985, INTEGRATED APPROACH Lee Y, 1995, COMPUTER GRAPHICS, P55 Monaghan A. I. C., 1991, THESIS U EDINBURGH PAKOSZ M, 1983, J PSYCHOL RES, V12 Pelachaud C, 1996, COGNITIVE SCI, V20, P1 PELACHAUD C, 1994, P ESCA AAAI IEEE WOR PEZZATO N, 1998, P 6 INT PRAGM C REIM PIERREHUMBERT J, 1990, SYS DEV FDN, P271 PLATT SM, 1985, THESIS U PENNSYLVANI POGGI I, 1990, B DI L IT, V3, P29 POGGI I, 1997, MANI CHE PARLANO Poggi I., 1996, P WORKSH INT GEST LA, P235 POGGI I, 1987, PAROLE NELLA TESTA PREVOST S, 1994, SPEECH COMMUN, V15, P139, DOI 10.1016/0167-6393(94)90048-5 RIST T, 1997, ADDING ANIMATED PRES, P79 Scherer K. R., 1988, FACETS EMOTION RECEN Scherer K.R., 1980, SOCIAL PSYCHOL CONTE, P225 Scherer K.R., 1981, SPEECH EVALUATION PS, P189 Searle John R., 1969, SPEECH ACTS TAKEUCHI A, 1993, ACM IFIP INTERCHI 93 TERZOPOULOS D, 1991, COMPUTER ANIMATION 9, P45 THORISSON KR, 1997, P COMP GRAPH EUR 97 ULDALL E, 1960, LANG SPEECH, V3, P223 VATIKIOTISBATES.E, 1996, NATO ASI SERIES F, V150, P221 Waters K., 1987, COMPUT GRAPH, V22, P17 WIGGERS M, 1982, J NONVERBAL BEHAV, V7, P101, DOI 10.1007/BF00986872 WILLIAMS CW, 1981, SPEECH EVALUATION PS NR 51 TC 14 Z9 14 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 5 EP 21 DI 10.1016/S0167-6393(98)00047-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100002 ER PT J AU Yehia, H Rubin, P Vatikiotis-Bateson, E AF Yehia, H Rubin, P Vatikiotis-Bateson, E TI Quantitative association of vocal-tract and facial behavior SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE vocal-tract motion; facial motion; line spectrum pair (LSP); singular value decomposition; principal component analysis; dynamic time warping (DTW); linear estimator ID SPEECH PRODUCTION; MODEL; SYNTHESIZER; MOVEMENT; MOTION AB This paper examines the degrees of correlation among vocal-tract and facial movement data and the speech acoustics. Multilinear techniques are applied to support the claims that facial motion during speech is largely a byproduct of producing the speech acoustics and further that the spectral envelope of the speech acoustics can be better estimated by the 3D motion of the face than by the midsagittal motion of the anterior vocal-tract (lips, tongue and jaw). Experimental data include measurements of the motion of markers placed on the face and in the vocal-tract, as well as the speech acoustics, for two subjects. The numerical results obtained show that, for both subjects, 91% of the total variance observed in the facial motion data could be determined from vocal-tract motion by means of simple linear estimators. For the inverse path, i.e. recovery of vocal-tract motion from facial motion, the results indicate that about 80% of the variance observed in the vocal-tract can be estimated from the face. Regarding the speech acoustics, it is observed that, in spite of the nonlinear relation between vocal-tract geometry and acoustics, linear estimators are sufficient to determine between 72 and 85% (depending on subject and utterance) of the variance observed in the RMS amplitude and LSP parametric representation of the spectral envelope. A dimensionality analysis is also carried out, and shows that between four and eight components are sufficient to represent the mappings examined. Finally, it is shown that even the tongue, which is an articulator not necessarily coupled with the face, can be recovered reasonably well from facial motion since it frequently displays the same kind of temporal pattern as the jaw during speech. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Fed Minas Gerais, Dept Elect Engn, BR-30161970 Belo Horizonte, MG, Brazil. Yale Univ, New Haven, CT USA. Haskins Labs Inc, New Haven, CT 06511 USA. ATR Human Informat Proc Res Labs, Kyoto, Japan. RP Yehia, H (reprint author), Univ Fed Minas Gerais, Dept Elect Engn, Av Antonio Carlos 6627,CP209, BR-30161970 Belo Horizonte, MG, Brazil. EM hani@cpdee.ufmg.br RI Yehia, Hani Camille/E-8684-2010 OI Yehia, Hani Camille/0000-0003-4578-0525 CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 BADIN P, 1995, J PHONETICS, V23, P221, DOI 10.1016/S0095-4470(95)80044-1 CARTER JN, 1996, P 1 ESCA TUT RES WOR, P229 Fant G., 1960, ACOUSTIC THEORY SPEE HOGDEN J, 1993, SR115116 HASK LAB Horn R. A., 1985, MATRIX ANAL Itakura F., 1975, J ACOUST SOC AM, V57, P535 KABURAGI T, 1994, J ACOUST SOC AM, V96, P1356, DOI 10.1121/1.410280 KELSO JAS, 1984, AM J PSYCHOL, V15, pR928 Lin Q., 1990, THESIS ROYAL I TECHN Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 MCGOWAN RS, 1994, SPEECH COMMUN, V14, P19, DOI 10.1016/0167-6393(94)90055-8 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MERMELST.P, 1967, J ACOUST SOC AM, V41, P1283, DOI 10.1121/1.1910470 MUNHALL KG, 1985, J ACOUST SOC AM, V78, P1548, DOI 10.1121/1.392790 NITTROUER S, 1988, J ACOUST SOC AM, V84, P1653, DOI 10.1121/1.397180 OSTRY DJ, 1994, J NEUROPHYSIOL, V71, P1528 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 Rabiner L, 1993, FUNDAMENTALS SPEECH RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 Schroeter J., 1991, ADV SPEECH PROCESSIN, P231 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SCULLY C, 1990, NATO ADV SCI I D-BEH, V55, P151 SHIRAI K, 1993, SPEECH COMMUN, V13, P45, DOI 10.1016/0167-6393(93)90058-S SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 STEVENS KN, 1955, J ACOUST SOC AM, V27, P484, DOI 10.1121/1.1907943 STONE M, 1990, J ACOUST SOC AM, V87, P2207, DOI 10.1121/1.399188 SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 TIEDE MK, 1994, P INT C SPOK LANG PR VATIKIOTISBATES.E, 1997, 5 EUR C SPEECH COMM VATIKIOTISBATES.E, 1998, TRH237 ATRHIP Vatikiotis-Bateson E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607897 VATIKIOTISBATES.E, 1996, H9665 AC SOC JAP VATIKIOTISBATES.E, 1996, NATO ASI SERIES F, V150, P221 VATIKIOTISBATESON E, 1995, J PHONETICS, V23, P101, DOI 10.1016/S0095-4470(95)80035-2 Yehia H, 1996, SPEECH COMMUN, V18, P151, DOI 10.1016/0167-6393(95)00042-9 YEHIA H, UNPUB SPEECH COMMUNI YEHIA H, 1994, P IEEE INT C AC SPEE, P477 Yehia HC, 1996, IEICE T INF SYST, VE79D, P1198 YEHIA HC, 1997, EUR TUT RES WORKSH A ZACKS J, 1994, COMPUT SPEECH LANG, V8, P189, DOI 10.1006/csla.1994.1009 ZACKS Sh., 1971, THEORY STAT INFERENC NR 44 TC 154 Z9 155 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 23 EP 43 DI 10.1016/S0167-6393(98)00048-X PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100003 ER PT J AU Iverson, P Bernstein, LE Auer, ET AF Iverson, P Bernstein, LE Auer, ET TI Modeling the interaction of phonemic intelligibility and lexical structure in audiovisual word recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE audiovisual perception; speech perception; word recognition ID SPEECH-PERCEPTION; NORMAL-HEARING AB Studies of audiovisual perception of spoken language have mostly modeled phoneme identification in nonsense syllables, but it is doubtful that models or theories of phonetic processing can adequately account for audiovisual word recognition. The present study took a computational approach to examine how lexical structure may additionally constrain word recognition, given the phonetic information available under vocoded audio, visual and audiovisual stimulus conditions. Subjects made phonemic identification judgments on recordings of spoken nonsense syllables. Hierarchical cluster analysis was used first to select classes of perceptually equivalent phonemes for each of the stimulus conditions, and then a machine-readable phonemically transcribed lexicon was retranscribed in terms of these phonemic equivalence classes. Several statistics were computed for each of the transcriptions, including percent information extracted, percent words unique and expected class size. The findings suggest that superadditive levels of audiovisual enhancement are more likely for monosyllabic than for multisyllabic words. That is, impoverished phonetic information may be sufficient to recognize multisyllabic words, but the recognition of monosyllabic words seems to require additional phonetic information. (C) 1998 Elsevier Science B.V. All rights reserved. C1 House Ear Res Inst, Spoken Language Proc Lab, Los Angeles, CA 90057 USA. RP Iverson, P (reprint author), House Ear Res Inst, Spoken Language Proc Lab, 2100 W 3rd St, Los Angeles, CA 90057 USA. EM piverson@hei.org CR Aldenderfer MS, 1984, CLUSTER ANAL Altman G., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90022-3 ALTMANN GTM, 1990, COGNITIVE MODELS SPE Auer ET, 1996, ICSLP 96 - FOURTH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE PROCESSING, PROCEEDINGS, VOLS 1-4, P86 Auer ET, 1997, J ACOUST SOC AM, V102, P3704, DOI 10.1121/1.420402 Aull A. M., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8) Bernstein L. E., 1986, J HOPKINS LIPREADING BERNSTEIN LE, 1997, P ESCA ESCOP WORKSH, P89 BERNSTEIN LE, 1991, J ACOUST SOC AM, V90, P2971, DOI 10.1121/1.401771 BERNSTEIN LE, 1993, P 2 INT C TACT AIDS, P57 BERNSTEIN LE, 1989, J ACOUST SOC AM, V85, P397, DOI 10.1121/1.397690 BRAIDA LD, 1991, Q J EXP PSYCHOL-A, V43, P647 CARLSON R, 1985, 11985 SPEECH TRANSM, P63 Carter D. M., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90023-4 ENGEBRETSON AM, 1986, IEEE T BIO-MED ENG, V33, P712, DOI 10.1109/TBME.1986.325762 FISHER CG, 1968, J SPEECH HEAR RES, V11, P796 Grant KW, 1996, J ACOUST SOC AM, V100, P2415, DOI 10.1121/1.417950 GREEN KP, 1989, PERCEPT PSYCHOPHYS, V45, P34, DOI 10.3758/BF03208030 HUTTENLOCHER DP, 1984, ICASSP P SAN DIEG 19 IVERSON P, 1997, J ACOUST SOC AM, V102, P3189, DOI 10.1121/1.420874 KRICOS PB, 1982, VOLTA REV, V84, P219 Kucera H., 1967, COMPUTATIONAL ANAL P LAHIRI A, 1991, COGNITION, V38, P245, DOI 10.1016/0010-0277(91)90008-R LANDAUER TK, 1973, J VERB LEARN VERB BE, V12, P119, DOI 10.1016/S0022-5371(73)80001-5 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 Luce P. A., 1986, THESIS INDIANA U LUCE PA, 1990, ACL MIT NAT, P122 Massaro D. W., 1987, SPEECH PERCEPTION EA MCCLELLAND JL, 1986, COGNITIVE PSYCHOL, V18, P1, DOI 10.1016/0010-0285(86)90015-0 NORRIS D, 1994, COGNITION, V52, P189, DOI 10.1016/0010-0277(94)90043-4 Norusis M. J., 1993, SPSS PROFESSIONAL ST OWENS E, 1985, J SPEECH HEAR RES, V28, P381 PISONI DB, 1985, SPEECH COMMUN, V4, P75, DOI 10.1016/0167-6393(85)90037-8 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 SAVIN HB, 1963, J ACOUST SOC AM, V35, P200, DOI 10.1121/1.1918432 SEITZ PF, 1995, PHLEX PHONOLOGICALLY SEKIYAMA K, 1991, J ACOUST SOC AM, V90, P1797, DOI 10.1121/1.401660 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 WALDEN BE, 1977, J SPEECH HEAR RES, V20, P130 WOODWARD MF, 1960, J SPEECH HEAR RES, V3, P212 NR 40 TC 29 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 45 EP 63 DI 10.1016/S0167-6393(98)00049-1 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100004 ER PT J AU Remez, RE Fellowes, JM Pisoni, DB Goh, WD Rubin, PE AF Remez, RE Fellowes, JM Pisoni, DB Goh, WD Rubin, PE TI Multimodal perceptual organization of speech: Evidence from tone analogs of spoken utterances SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE multimodal speech perception; perceptual organization; auditory-visual speech perception; intersensory integration; speechreading; sinewave speech ID INFORMATION AB Theoretical and practical motives alike have prompted recent investigations of multimodal speech perception. Theoretically, multimodal studies have extended the conceptualization of perceptual organization beyond the familiar modality-bound accounts deriving from Gestalt psychology. Practically, such investigations have been driven by a need to understand the proficiency of multimodal speech perception using an electrocochlear prosthesis for hearing. In each domain, studies have shown that perceptual organization of speech can occur even when the perceiver's auditory experience departs from natural speech qualities. Accordingly, our research examined auditor-visual multimodal integration of videotaped faces and selected acoustic constituents of speech signals, each realized as a single sinewave tone accompanying a video image of an articulating face. The single tone reproduced the frequency and amplitude of the phonatory cycle or of one of the lower three oral formants. Our results showed a distinct advantage for the condition pairing the video image of the face with a sinewave replicating the second formant, despite its unnatural timbre and its presentation in acoustic isolation from the rest of the speech signal. Perceptual coherence of multimodal speech in these circumstances is established when the two modalities concurrently specify the same underlying phonetic attributes. (C) 1998 Published by Elsevier Science B.V. All rights reserved. C1 Columbia Univ Barnard Coll, Dept Psychol, New York, NY 10027 USA. Columbia Univ Coll Phys & Surg, New York, NY 10032 USA. Indiana Univ, Dept Psychol, Speech Res Lab, Bloomington, IN 47405 USA. Yale Univ, Sch Med, Dept Surg, New Haven, CT 06511 USA. Yale Univ, Sch Med, Haskins Labs, New Haven, CT 06511 USA. RP Remez, RE (reprint author), Columbia Univ Barnard Coll, Dept Psychol, 3009 Broadway, New York, NY 10027 USA. EM remez@paradise.barnard.columbia.edu CR BERNSTEIN LE, 1992, 2 INT C TACT AIDS HE Berthomieu P, 1997, POSITIF, P97 Bosman AJ, 1997, AUDIOLOGY, V36, P29 Bradlow AR, 1996, SPEECH COMMUN, V20, P255, DOI 10.1016/S0167-6393(96)00063-5 BREEUWER M, 1985, J ACOUST SOC AM, V77, P314, DOI 10.1121/1.392230 Bregman AS., 1990, AUDITORY SCENE ANAL BROADBENT DE, 1957, J ACOUST SOC AM, V29, P708, DOI 10.1121/1.1909019 GREEN KP, 1985, PERCEPT PSYCHOPHYS, V38, P269, DOI 10.3758/BF03207154 Julesz B., 1972, HUMAN COMMUNICATION, P283 Massaro D. W., 1998, PERCEIVING TALKING F Massaro D. W., 1994, HDB PSYCHOLINGUISTIC, P219 MUNHALL KG, 1996, PERCEPT PSYCHOPHYS, V58, P981 REMEZ RE, 1994, PSYCHOL REV, V101, P129, DOI 10.1037/0033-295X.101.1.129 REMEZ RE, 1984, PERCEPT PSYCHOPHYS, V35, P429, DOI 10.3758/BF03203919 REMEZ RE, 1981, SCIENCE, V212, P947, DOI 10.1126/science.7233191 ROSEN SM, 1981, NATURE, V291, P150, DOI 10.1038/291150a0 RUBIN PE, 1980, SINEWAVE SYNTHESIS SALDANA HM, 1996, SPEECHREADING MAN MA, P145 WELCH RB, 1980, PSYCHOL BULL, V88, P638, DOI 10.1037/0033-2909.88.3.638 Wertheimer M, 1923, PSYCHOL FORSCH, V4, P301, DOI 10.1007/BF00410640 NR 20 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 65 EP 73 DI 10.1016/S0167-6393(98)00050-8 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100005 ER PT J AU Sams, M Manninen, P Surakka, V Helin, P Katto, R AF Sams, M Manninen, P Surakka, V Helin, P Katto, R TI McGurk effect in Finnish syllables, isolated words, and words in sentences: Effects of word meaning and sentence context SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE multimodal; multisensory; polysensory; polymodal; auditory; lipreading; speechreading; sensory integration; speech integration ID SPEECH-PERCEPTION; PHONEMIC RESTORATION; SELECTIVE ADAPTATION; AUDITORY-CORTEX; INFORMATION; VOICES AB The "McGurk effect" is a robust illusion in which subject's perception of an acoustical syllable is modified by the view of the talker's articulation. This effect is not perceived universally but, however, is experienced by the majority of the subjects. For example, if the acoustical syllable /ba/ is presented in synchrony with a face articulating /ga/, English-speaking subjects typically perceive /da/ and less frequently /ga/. We studied the McGurk effect in Finnish syllables, isolated words, and words presented in sentence context in 65 subjects. Audiovisual combinations expected to be perceived either as meaningful words or nonwords were used. Words were also presented in various positions of three-word sentences in which the expected word could match or mismatch with the sentence context. A strong McGurk effect was obtained with each stimulus type. In addition, the strength of the McGurk effect did not appear to be influenced by word meaning or sentence context. These findings support the idea that audiovisual speech integration occurs at phonetic perceptual level before the word meaning is extracted. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Aalto Univ, Lab Computat Engn, FIN-02015 Espoo, Finland. Tampere Univ, Dept Psychol, FIN-33101 Tampere, Finland. RP Sams, M (reprint author), Aalto Univ, Lab Computat Engn, POB 9400, FIN-02015 Espoo, Finland. EM Mikko.Sams@hut.fi RI Sams, Mikko/G-7060-2012 CR BINNIE CA, 1974, J SPEECH HEAR RES, V17, P619 BRANCAZIO L, 1997, 134 M AC SOC AM SAN Calvert GA, 1997, SCIENCE, V276, P593, DOI 10.1126/science.276.5312.593 DEFILIPPO CL, 1988, NEW REFLECTIONS SPEE DEKLE DJ, 1992, PERCEPT PSYCHOPHYS, V51, P355, DOI 10.3758/BF03211629 DODD B, 1977, PERCEPTION, V6, P31, DOI 10.1068/p060031 FOWLER CA, 1991, J EXP PSYCHOL HUMAN, V17, P816 GANONG WF, 1980, J EXP PSYCHOL HUMAN, V6, P110, DOI 10.1037/0096-1523.6.1.110 GREEN K, 1996, NATO ASI SER, P55 GREEN KP, 1991, PERCEPT PSYCHOPHYS, V50, P524, DOI 10.3758/BF03207536 GREEN KP, 1985, PERCEPT PSYCHOPHYS, V38, P269, DOI 10.3758/BF03207154 HAMALAINEN M, 1993, REV MOD PHYS, V65, P413, DOI 10.1103/RevModPhys.65.413 JACKSON PL, 1988, VOLTA REV, V90, P99 JONES JA, 1998, IN PRESS CANADIAN AC KRICOS P, 1996, NATO ASI SER, P43 MACDONALD J, 1978, PERCEPT PSYCHOPHYS, V24, P253, DOI 10.3758/BF03206096 Massaro D. W., 1987, SPEECH PERCEPTION EA MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 O'Neill JJ, 1954, J SPEECH HEAR DISORD, V19, P429 PESONEN J, 1968, ANN ACAD SCI FENNI B, V151, P1 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 ROBERTS M, 1981, PERCEPT PSYCHOPHYS, V30, P309, DOI 10.3758/BF03206144 SALDANA HM, 1994, J ACOUST SOC AM, V95, P3658 SAMS M, 1996, SPEECHREADING HUMANS, P233 SAMS M, 1998, P 3 PAN PAC C BRAIN SAMS M, 1991, NEUROSCI LETT, V127, P141, DOI 10.1016/0304-3940(91)90914-F SAMUEL AG, 1981, J EXP PSYCHOL GEN, V110, P474, DOI 10.1037/0096-3445.110.4.474 SAMUEL AG, 1987, J MEM LANG, V26, P36, DOI 10.1016/0749-596X(87)90061-1 SAMUEL AG, 1981, J EXP PSYCHOL HUMAN, V7, P1124, DOI 10.1037/0096-1523.7.5.1124 SEKIYAMA K, 1991, J ACOUST SOC AM, V90, P1797, DOI 10.1121/1.401660 SEKIYAMA K, 1993, J PHONETICS, V21, P427 Sekiyama K, 1997, PERCEPT PSYCHOPHYS, V59, P73, DOI 10.3758/BF03206849 Stork D.G., 1995, SPEECHREADING HUMANS SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 NR 35 TC 27 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 75 EP 87 DI 10.1016/S0167-6393(98)00051-X PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100006 ER PT J AU de Gelder, B Vroomen, J AF de Gelder, B Vroomen, J TI Impairment of speech-reading in prosopagnosia SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE prosopagnosia; speech-reading; face processing ID FACE RECOGNITION AB The face is a source of information processed by a complex system of partly independent subsystems. The extent of the independence of processing personal identity, facial expression and facial speech remains at present unclear. We investigated the speech-reading ability of a prosopagnosic patient, LH, who is severely impaired on recognition of personal identity and recognition of facial expressions. Previous reports of such cases raised the possibility that speechreading might still be intact, even if almost all other aspects of face processing are lost. A series of speech-reading tasks were administered to LH including still photographs, video clips, short-term memory tasks for auditory and speech-read materials, and tasks aimed at assessing the impact of the visual input on auditory speech recognition. LH was severely impaired on these tasks. We conclude that in LH there is a strong association between severe face processing deficits and loss of speech-reading skills. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Tilburg Univ, Dept Psychol, NL-5000 LE Tilburg, Netherlands. RP de Gelder, B (reprint author), Tilburg Univ, Dept Psychol, POB 90153, NL-5000 LE Tilburg, Netherlands. EM b.degelder@kub.nl RI Vroomen, Jean/K-1033-2013 CR BRUCE V, 1986, BRIT J PSYCHOL, V77, P305 CAMPBELL R, 1992, PHILOS T ROY SOC B, V335, P39, DOI 10.1098/rstb.1992.0005 CAMPBELL R, 1986, BRAIN, V109, P509, DOI 10.1093/brain/109.3.509 CAMPBELL R, 1996, P 4 INT C SPOK LANG CAMPBELL R, 1996, SPEECHREADING HUMANS, P115 Campbell R, 1996, NEUROPSYCHOLOGIA, V34, P1235, DOI 10.1016/0028-3932(96)00046-2 DAMASIO AR, 1990, ANNU REV NEUROSCI, V13, P89, DOI 10.1146/annurev.neuro.13.1.89 De Gelder B, 1991, EUROPEAN J COGNITIVE, V3, P69, DOI 10.1080/09541449108406220 DEGELDER B, 1998, HEARING EYE, V2, P195 DEGELDER B, 1997, M PSYCH SOC PHIL 22 DEGELDER B, UNPUB INVERSION SUPE DEGELDER B, 1998, VISION RES ETCOFF NL, 1991, J COGNITIVE NEUROSCI, V3, P25, DOI 10.1162/jocn.1991.3.1.25 FARAH MJ, 1995, NEUROPSYCHOLOGIA, V33, P661, DOI 10.1016/0028-3932(95)00002-K FARAH MJ, 1995, VISION RES, V35, P2089, DOI 10.1016/0042-6989(94)00273-O LEVINE DN, 1980, PSYCHOL RES-PSYCH FO, V41, P217, DOI 10.1007/BF00308658 LEVINE DN, 1989, BRAIN COGNITION, V10, P149, DOI 10.1016/0278-2626(89)90051-1 MASSARO DW, 1990, PSYCH SCI, V1, P1 SUMMERFIELD Q, 1991, MODULARITY AND THE MOTOR THEORY OF SPEECH PERCEPTION, P117 WALKER S, 1995, PERCEPT PSYCHOPHYS, V57, P1124, DOI 10.3758/BF03208369 ZIHL J, 1983, BRAIN, V106, P313, DOI 10.1093/brain/106.2.313 NR 21 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 89 EP 96 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100007 ER PT J AU Blokland, A Anderson, AH AF Blokland, A Anderson, AH TI Effect of low frame-rate video on intelligibility of speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE frame-rate; intelligibility; speech perception; video communication ID TASK AB Video communication is not a kind of half-way stage between traditional audio communication like the telephone, and the even more traditional face-to-face communication. It has its own characteristics, and evokes unexpected behaviour from the participants. Many studies investigate the influence that video mediation has on the process of communication. In this paper we look at a little-studied effect, on speech production. We show that when speakers can see each other on a low frame-rate video screen, they articulate more clearly than the case where they cannot see each other and are communicating only over an audio link. This is unexpected, because when speakers can see each other face-to-face, their speech is less clear. A video image encourages speech that is more clearly articulated, and this paper will argue that video images can be a distraction rather than a help. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Glasgow, ESRC Human Commun Res Ctr, Multimedia Commun Grp, Glasgow G12 8QB, Lanark, Scotland. RP Blokland, A (reprint author), Univ Glasgow, ESRC Human Commun Res Ctr, Multimedia Commun Grp, 52 Hillhead St, Glasgow G12 8QB, Lanark, Scotland. EM art@mcg.gla.ac.uk CR Anderson A. H., 1997, VIDEO MEDIATED COMMU, P133 Anderson AH, 1997, PERCEPT PSYCHOPHYS, V59, P580, DOI 10.3758/BF03211866 ANDERSON AH, 1991, LANG SPEECH, V34, P351 ANDERSON AH, 1994, P WORKSH VID MED COM ARGYLE M, 1990, BODILY COMMUNICATION ARGYLE M, 1965, SOCIOMETRY, V28, P289, DOI 10.2307/2786027 ARGYLE M, 1977, J ENV PSYCHOL NONVER, V1 BARBER P, 1994, P IEEE INT S MULT TE, P163 BARD EG, 1994, J CHILD LANG, V21, P623 BEATTIE G., 1978, SEMIOTICA, V23, P29, DOI 10.1515/semi.1978.23.1-2.29 BOLINGER, 1981, 2 KINDS VOWELS 2 KIN Bolinger D. L, 1963, LINGUISTICS, V1, P5, DOI 10.1515/ling.1963.1.1.5 BOYLE EA, 1994, LANG SPEECH, V37, P1 CHAFE WL, 1974, LANGUAGE, V50, P111, DOI 10.2307/412014 CLARK HH, 1991, PERSPECTIVES ON SOCIALLY SHARED COGNITION, P127, DOI 10.1037/10096-006 Egido C., 1990, INTELLECTUAL TEAMWOR, P351 Fowler C., 1988, LANG SPEECH, V28, P47 FOWLER CA, 1987, J MEM LANG, V26, P489, DOI 10.1016/0749-596X(87)90136-7 KENDON A, 1967, ACTA PSYCHOL, V26, P22, DOI 10.1016/0001-6918(67)90005-4 KRANTZ M, 1983, J PSYCHOL, V113, P9 LIEBERMAN P, 1963, LANG SPEECH, V6, P172 MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 O'Conaill B., 1993, Human-Computer Interaction, V8, DOI 10.1207/s15327051hci0804_4 OMALLEY C, 1994, WORKSH VID MED COMM Reisberg D., 1987, HEARING EYE PSYCHOL, P97 Rutter D. R., 1987, COMMUNICATING TELEPH xDoherty-Sneddon G., 1997, J EXPT PSYCHOL APPL, V3, P1 NR 27 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 97 EP 103 DI 10.1016/S0167-6393(98)00053-3 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100008 ER PT J AU Yamamoto, E Nakamura, S Shikano, K AF Yamamoto, E Nakamura, S Shikano, K TI Lip movement synthesis from speech based on Hidden Markov Models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE multimodal interface; lip synchronization; lip movement synthesis; coarticulation; hidden Markov model; Viterbi alignment AB Speech intelligibility can be improved by adding lip images to the speech signal. Thus lip movement synthesis plays an important role to realize a natural human-like face of computer agents. This paper proposes a novel, lip movement synthesis method from speech input based on the Hidden Markov Models (HMMs). The difficulties of lip movement synthesis are caused by coarticulation effects from preceding and succeeding phonemes. The proposed method gives a simple solution that generates context dependent lip parameters by looking ahead to the HMM state sequence obtained using context independent HMMs. In objective evaluation experiments, the proposed method is evaluated by the time-averaged error and the time-averaged differential error between synthesized lip parameters and original ones. The result shows that the time-averaged error and the time-averaged differential error of the HMM-based method with context independent lip parameters are 8.7% and 32% smaller than those obtained using a Vector Quantization (VQ) based method. Moreover, the time-averaged error and time-averaged differential error generated by the proposed HMM-based method with context dependent lip parameters are further reduced by 10.5% and 11% compared to the HMM-based method with the context independent lip parameters. The proposed HMM-based method with context dependent lip parameters mostly reduces the errors of phonemes /h/, /g/ and /k/. In subjective evaluation experiments, although differences in the audio-visual intelligibility between the synthesized lip parameters and the original ones are insignificant, the acceptability test to evaluate naturalness reflects the results of the objective evaluation. Mean opinion scores of acceptability for the VQ-based method and the proposed HMxM-based method are 3.25 and 3.74, respectively. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Nara Inst Sci & Technol, Nara 6300101, Japan. RP Yamamoto, E (reprint author), Nara Inst Sci & Technol, 8916-5 Takayama, Nara 6300101, Japan. EM eli-y@is.aist-nara.ac.jp; nakamura@is.aist-nara.ac.jp; shikano@is.aist-nara.ac.jp CR BREGLER C, 1997, P WORKSH AUD VIS SPE, P153 BROOKE NM, 1996, SPEECHREADING HUMANS, P351 CHEN T, 1997, P ICASSP MUN APR, V1, P179 CHOU W, 1995, P IEEE INT C AC SPEE, V4, P2253 ERBER NP, 1969, J SPEECH HEAR RES, V12, P423 GOLDENTHAL W, 1997, EUR 97 P, V4, P1995 GUIARDMARIGNY T, 1994, PROGR SPEECH SYNTHES, P247 HIKI S, 1996, SPEECHREADING MAN MA, P239 IEEE, 1969, IEEE T AUDIO ELECTRO, VAE-17, P227 Lavagetto F., 1995, IEEE Transactions on Rehabilitation Engineering, V3, DOI 10.1109/86.372898 LINDE Y, 1980, IEEE T COMMUN, V28, P1 MORISHIMA S, 1989, P IEE INT C AC SPEEC, V3, P1795 MORISHIMA S, 1991, IEEE J SEL AREA COMM, V9, P594, DOI 10.1109/49.81953 NEELY KK, 1956, J ACOUST SOC AM, V28, P1275, DOI 10.1121/1.1908620 Rao RR, 1996, INT CONF ACOUST SPEE, P2056, DOI 10.1109/ICASSP.1996.545722 Simons A. D., 1990, P I AC, V12, P475 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 YAMAMOTO E, 1997, P EUR SPEECH C ASS W, P137 NR 19 TC 52 Z9 53 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 105 EP 115 DI 10.1016/S0167-6393(98)00054-5 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100009 ER PT J AU Benoit, C Le Goff, B AF Benoit, C Le Goff, B TI Audio-visual speech synthesis from French text: Eight years of models, designs and evaluation at the ICP SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE face animation; 3D lip model; speechreading; coarticulation; text-to-audiovisual speech synthesis; intelligibility; French visemes; speaking rate; loudness AB Since 1990, a series of visual speech synthesizers have been developed and synchronized with a French text-to-speech synthesizer at the ICP in Grenoble. In this article, we describe the different structures of these visual synthesizers. The techniques used include key-frame approaches based on 24 lip/chin images carefully selected to account for most of the basic coarticulated shapes in French, 2D parametric models of the lip contours, and finally 3D parametric models of the main components of the face. The successive versions were systematically evaluated, with the same reference corpus, according to a standard procedure. Auditory intelligibility and audio-visual intelligibility were compared under several conditions of acoustic distortion to evaluate the benefit of speechreading. Tests were run with acoustic material produced by a text-to-speech synthesizer or by a reference human speaker. Our results show that while visual speech is unnecessary under clear acoustic conditions, it adds intelligibility to the auditory information when the acoustics are degraded. Furthermore, the intelligibility provided by the visual channel increased constantly through successive improvements of our text-to-visual speech synthesizers. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Grenoble 3, INPG ENSERG, Inst Commun Parlee, CNRS,UPRESA 5009, F-38040 Grenoble, France. RP Benoit, C (reprint author), ATR Human Informat Proc Res Labs, 2-2 Hikaridai, Seika, Kyoto 6190288, Japan. EM bateson@hip.atr.co.jp CR ABRY C, 1991, P 12 INT C PHON SCI, V1, P220 ADJOUDANI A, 1993, MEMOIRE DEA SIP INPG ALISSALI M, 1993, THESIS I NATL POLYTE Bailly G, 1997, SPEECH COMMUN, V22, P251, DOI 10.1016/S0167-6393(97)00025-3 Bailly G., 1992, Traitement du Signal, V9 BAILLY G, 1991, P 12 INT C PHON SCI, V2, P506 BENGUERE.AP, 1974, PHONETICA, V30, P41 Benoit C, 1996, NATO ASI SER, V150, P315 Benoit C., 1992, TALKING MACHINES THE, P485 BENOIT C, 1996, P ETRW 96 AUTR, P237 BENOIT C, 1997, P 5 EUR C RHOD, V3, P1671 BESKOW J, 1995, P EUR 95 MADR, V1, P299 BROOKE NM, 1983, J PHONETICS, V11, P63 CHARPENTIER F, 1989, P EUROSPEECH 89, V2, P13 Cohen M., 1993, COMPUTER ANIMATION, P139 COHEN MM, 1990, BEHAV RES METH INSTR, V22, P260 CORNETT RO, 1967, AM ANN DEAF, V112, P3 ERBER NP, 1975, J SPEECH HEAR DISORD, V40, P481 ERBER NP, 1969, J SPEECH HEAR RES, V12, P423 Fisher C., 1968, J SPEECH HEAR RES, V15, P474 FOWLER CA, 1983, J EXP PSYCHOL GEN, V112, P386, DOI 10.1037//0096-3445.112.3.386 GUIARDMARIGNY T, 1992, ANIMATION TEMPS REEL, P72 GUIARDMARIGNY T, 1996, PROGR SPEECH SYNTHES, P247 Joos M., 1948, LANGUAGE SUPPL, V24, P1, DOI DOI 10.2307/522229 BENOI C, 1994, J SPEECH HEAR RES, V37, P1195 LEGOFF B, 1996, P 4 INT C SPOK LANG, V4, P2163, DOI 10.1109/ICSLP.1996.607232 LEGOFF B, 1996, PROGR SPEECH SYNTHES, P235 LEGOFF B, 1997, THESIS INP GRENOBLE, P256 LEGOFF B, 1993, COMMANDES PARAMETRIQ LEGOFF B, 1997, P ESCA WORKSH AUD VI, P145 LEGOFF B, 1997, P 5 EUR C RHOD, V3, P1667 LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289 Massaro D. W., 1997, PERCEIVING TALKING F MCGURK H, 1976, NATURE, V264, P746, DOI 10.1038/264746a0 MOHAMADI T, 1993, THESIS INP GRENOBLE Parke F, 1974, THESIS U UTAH SAINTOURENS M, 1990, P 1 ESCA WORKSH SPEE, P249 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 SUMBY WH, 1954, J ACOUST SOC AM, V26, P212, DOI 10.1121/1.1907309 Summerfield Q., 1989, HDB RES FACE PROCESS, P223 WOODWARD P, 1992, 19 JOURN ET PAR BRUS, P319 WOODWARD P, 1991, THESIS INP GRENOBLE NR 42 TC 26 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 117 EP 129 DI 10.1016/S0167-6393(98)00045-4 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100010 ER PT J AU Basu, S Oliver, N Pentland, A AF Basu, S Oliver, N Pentland, A TI 3D lip shapes from video: A combined physical-statistical model SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE DE lip models; deformable/non-rigid models; finite element models; analysis-synthesis models; training models from video; model-based tracking AB Tracking human lips in video is an important but notoriously difficult task. To accurately recover their motions in 3D from any head pose is an even more challenging task, though still necessary for natural interactions. Our approach is to build and train 3D models of lip motion to make up for the information we cannot always observe when tracking. We use physical models as a prior and combine them with statistical models, showing how the two can be smoothly and naturally integrated into a synthesis method and a MAP estimation framework for tracking. We have found that this approach allows us to accurately and robustly track and synthesize the 3D shape of the lips from arbitrary head poses in a 2D video stream. We demonstrate this with numerical results on reconstruction accuracy, examples of static fits, and audio-visual sequences. (C) 1998 Elsevier Science B.V. All rights reserved. C1 MIT, Media Lab, Perceptual Comp Sect, Cambridge, MA 02139 USA. RP Basu, S (reprint author), MIT, Media Lab, Perceptual Comp Sect, E15-383,20 Ames St, Cambridge, MA 02139 USA. EM sbasu@media.mit.edu CR ADJOUDANI A, 1995, SPEECHREADING MAN MA, P461 Basu S., 1998, P IEEE INT C COMP VI, P337 BASU S, 1997, THESIS MIT CAMBRIDGE BASU S, 1996, P 13 INT C PATT REC, VC, P611 BREGLER C, 1995, ADV NEURAL INFORMATI, V7, P401 COIANIZ T, 1995, SPEECHREADING MAN MA, P391 ESSA I, 1995, THESIS MIT CAMBRIDGE Gelb A., 1974, APPL OPTIMAL ESTIMAT Jebara TS, 1997, PROC CVPR IEEE, P144, DOI 10.1109/CVPR.1997.609312 Kass M., 1988, INT J COMPUT VISION, V1, P321, DOI DOI 10.1007/BF00133570 K-J Bathe, 1982, FINITE ELEMENT PROCE Lee Y. C., 1995, P SIGGRAPH 95, P55, DOI DOI 10.1145/218380.218407 Luettin J., 1996, P IEEE INT C AC SPEE, V2, P817 Martin J, 1998, IEEE T PATTERN ANAL, V20, P97, DOI 10.1109/34.659928 Oliver N, 1997, PROC CVPR IEEE, P123, DOI 10.1109/CVPR.1997.609309 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 WATERS K, 1995, GRAPH INTER, P163 Zienkiewicz O., 1967, FINITE ELEMENT METHO NR 18 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 131 EP 148 DI 10.1016/S0167-6393(98)00055-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100011 ER PT J AU Rogozan, A Deleglise, P AF Rogozan, A Deleglise, P TI Adaptive fusion of acoustic and visual sources for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Meeting on Auditory-Visual Speech Processing (AVSP'97) CY 1997 CL RHODES, GREECE ID PERCEPTION AB Among the various methods proposed to improve the accuracy and the robustness of automatic speech recognition (ASR), the use of additional knowledge sources is a successful one. In particular, a recent method proposes supplementing the acoustic information with visual data mostly derived from the speaker's lip shape. Perceptual studies support this approach by emphasising the importance of visual information for speech recognition in humans. This paper describes a method we have developed for adaptive integration of acoustic and visual information in ASR. Each modality is involved in the recognition process with a different weight, which is dynamically adapted during this process mainly according to the signal-to-noise ratio provided as a contextual input. We tested this method on continuous hidden Markov model-based systems developed according to direct identification (DI), separate identification (SI) and hybrid identification (DI + SI) strategies. Experiments performed under various noise-level conditions show that the DI + SI based system is the most promising one when compared to both DI and SI-based systems for a speaker-dependent continuous-spelling of French letters recognition task. They also confirm that using adaptive modality weights instead of fixed weights allows for performance improvement and that weight estimation could benefit from using visemes as decision units for the visual recogniser in SI and DI + SI based systems. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Maine, Lab Informat, F-72085 Le Mans 9, France. RP Rogozan, A (reprint author), Univ Maine, Lab Informat, F-72085 Le Mans 9, France. EM alexandrina.rogozan@lium.univ-lemans.fr CR Abry C., 1991, P 12 INT C PHON SCI, P220 Adjoudani A, 1996, NATO ASI SERIES F, P461 ALISSALI M, 1996, P INT C SPOK LANG PR, P34 ANDREOBRECHT R, 1997, P ESCA ESCOP WORKSH, P49 Baum L. E., 1972, INEQUALITIES, V3, P1 BENOIT C, 1992, HIRADA TECHNIKA 43 J, V43, P32 Benoit C., 1992, TALKING MACHINES THE, P485 BROOKE M, 1996, NATO ASI SERIES F, V150, P351 CAELEN J, 1996, 21 JOURN ET PAR AV 1, P325 Chen JY, 1997, PROCEEDINGS OF THE FOURTH ASIAN SYMPOSIUM ON INFORMATION DISPLAY, P57 CLINE AK, 1974, COMMUN ACM, V17, P218, DOI 10.1145/360924.360971 FICHER G, 1968, J SPEECH HEAR DISORD, V40, P481 GOLDSCHEN J, 1993, THESIS G WASHINGTON GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Hennecke M, 1996, NATO ASI SERIES F, V150, P331 Jourlin P., 1997, P ESCA ESCOP WORKSH, P69 JOURLIN P, 1996, P EUR SIGN PROC C TR, P133 BENOI C, 1994, J SPEECH HEAR RES, V37, P1195 Kohonen T., 1988, SELF ORG ASS MEMORY KRICOS PB, 1996, NATO ASI SERIES F, V150, P43 LALLOUACHE T, 1990, JOURN ET PAR MONTR LEE KF, 1992, AUTOMATIC SPEECH REC MASSARO DW, 1995, P 13 INT C PHON SCI, P106 MEIER U, 1996, P INT C AC SPEECH SI MOVELLAN JR, 1997, ADV NEURAL INFORMATI, P10 Reisberg D., 1987, HEARING EYE PSYCHOL, P97 ROBERTRIBES J, 1995, ARTIF INTELL REV, V9, P323, DOI 10.1007/BF00849043 ROBERTRIBES J, 1996, NATO ASI SERIES F, V150, P193 ROGOZAN A, 1997, P 5 C SPEECH COMM TE, P1999 Silsbee PL, 1996, IEEE T SPEECH AUDI P, V4, P337, DOI 10.1109/89.536928 SUAUDEAU N, 1993, P EUR, P307 Summerfield A. Q., 1987, HEARING EYE PSYCHOL, P3 SUMMERFIELD Q, 1979, PHONETICA, V36, P314 TEISSIER P, 1997, P 1 IEEE WORKSH MULT, P37, DOI 10.1109/MMSP.1997.602610 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 NR 35 TC 27 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1998 VL 26 IS 1-2 BP 149 EP 161 DI 10.1016/S0167-6393(98)00056-9 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 145VA UT WOS:000077391100012 ER PT J AU Batliner, A Kompe, R Kiessling, A Mast, M Niemann, H Noth, E AF Batliner, A Kompe, R Kiessling, A Mast, M Niemann, H Noth, E TI M = Syntax plus Prosody: A syntactic-prosodic labelling scheme for large spontaneous speech databases SO SPEECH COMMUNICATION LA English DT Article DE syntax; dialogue; prosody; phrase boundaries; prosodic labelling; automatic boundary classification; spontaneous speech; large databases; neural networks; stochastic language models AB In automatic speech understanding, division of continuous running speech into syntactic chunks is a great problem. Syntactic boundaries are often marked by prosodic means. For the training of statistical models for prosodic boundaries large databases are necessary. For the German Verbmobil (VM) project (automatic speech-to-speech translation), we developed a syntactic-prosodic labelling scheme where different types of syntactic boundaries are labelled for a large spontaneous speech corpus. This labelling scheme is presented and compared with other labelling schemes for perceptual-prosodic, syntactic, and dialogue act boundaries. Interlabeller consistencies and estimation of effort needed are discussed. We compare the results of classifiers (multi-layer perceptrons (MLPs) and n-gram language models) trained on these syntactic-prosodic boundary labels with classifiers trained on perceptual-prosodic and pure syntactic labels. The main advantage of the rough syntactic-prosodic labels presented in this paper is that large amounts of data can be labelled with relatively little effort. The classifiers trained with these labels turned out to be superior with respect to purely prosodic or syntactic labelling schemes, yielding recognition rates of up to 96% for the two-class-problem 'boundary versus no boundary'. The use of boundary information leads to a marked improvement in the syntactic processing of the VM system. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung, D-91058 Erlangen, Germany. Sony Stuttgart Technol Ctr, D-70736 Fellbach, Germany. Ericsson Eurolab Deutschland GMBH, D-90411 Nurnberg, Germany. IBM Heidelberg Sci Ctr, D-69115 Heidelberg, Germany. RP Batliner, A (reprint author), Univ Erlangen Nurnberg, Lehrstuhl Mustererkennung, Martensstr 3, D-91058 Erlangen, Germany. EM batliner@informatik.uni-erlangen.de CR BATLINER A, 1995, NATO ASI SERIES F, V147, P325 BATLINER A, 1996, 102 VERBM BATLINER A, 1997, P ESCA WORKSH INT DE, P39 BATLINER A, 1996, P INT C COMP LING CO, V1, P71, DOI 10.3115/992628.992644 BATLINER A, 1997, 124 VERBM BEAR J, 1990, P 28 C ASS COMP LING, P17, DOI 10.3115/981823.981826 BECKMAN M, 1994, GUIDELINES TOBI TRAN BLOCK H, 1992, P INT C COMP LING NA, V1, P87, DOI 10.3115/992066.992083 BLOCK H, 1997, P INT C AC SPEECH SI, V79, P1 BUB T, 1996, INT C SPOK LANG PROC, V4, P1026 CAHN J, 1992, P IRCS WORKSH PROS N, P19 CARLETTA J, 1997, 167 DAGST SEM Cutler A, 1997, LANG SPEECH, V40, P141 DEPIJPER JR, 1994, J ACOUST SOC AM, V96, P2037, DOI 10.1121/1.410145 FELDHAUS A, 1995, 94 VERBM Grammont Maurice, 1923, B SOC LINGUISTIQUE, V24, P1 GRICE M, 1996, INT C SPOK LANG PROC, V3, P1716 GUTTMAN L, 1977, STATISTICIAN, V26, P81, DOI 10.2307/2987957 HAFTKA B, 1993, SYNTAX INT HDB ZEITG, V1, P846 HIRSCHBERG J, 1994, P INT S PROS YOK JAP, P103 HUNT A, 1994, INT C SPOK LANG PROC, V3, P1119 JEKAT S, 1995, 65 VERMB Kiessling A., 1997, EXTRAKTION KLASSIFIK KIESSLING A, 1994, PROGR PROSPECTS SPEE, P266 KIESSLING A, 1994, INT C SPOK LANG PROC, V1, P115 KISS T, 1995, MERKMALE REPRASENTAT KOMPE R, 1997, LECT NOTES ARTIFICIA KOMPE R, 1997, P INT C AC SPEECH SI, V2, P811 KOMPE R, 1994, P INT C AC SPEECH SI, V2, P173 KOMPE R, 1995, P EUR C SPEECH COMM, V2, P1333 LAMEL L, 1992, ESCA NEWSLETTER, V8, P7 Lea W., 1980, TRENDS SPEECH RECOGN, P166 MAIER E, 1997, 193 VERBM MAST M, 1995, 97 VERBM NIEMANN H, 1997, P INT C AC SPEECH SI, V1, P75 OSTENDORF M, 1990, SPEECH NATURAL LANGU, P26 Ostendorf M., 1994, Computational Linguistics, V20 Ostendorf M., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1010 Pitrelli John F., 1994, INT C SPOK LANG PROC, V1, P123 Pollard Carl, 1987, CSLI LECT NOTES, V13 PRICE P, 1990, INT C SPOKEN LANGUAG, V1, P13 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 REITHINGER N, 1997, COMMUNICATION REYELT M, 1998, EXPT UNTERSUCHUNGEN REYELT M, 1997, COMMUNICATION REYELT M, 1994, 33 VERBM REYELT M, 1995, P 13 INT C PHON SCI, V4, P212 SWERTS M, 1992, INT C SPOK LANG PROC, V1, P421 Thon W., 1994, HDB DATENAUFNAHME TR TROPF H, 1994, 54 ZFE ST SN SIEM AG VAISSIERE J, 1988, NATO ASI SERIES F, V46, P71 Wahlster W., 1993, P 3 EUR C SPEECH COM, P29 WAHLSTER W, 1997, P INT C AC SPEECH SI, V1, P71 Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y WIGHTMAN C, 1992, THESIS BOSTON U GRAD NR 55 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1998 VL 25 IS 4 BP 193 EP 222 DI 10.1016/S0167-6393(98)00037-5 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 132AA UT WOS:000076606400001 ER PT J AU Minker, W AF Minker, W TI Stochastic versus rule-based speech understanding for information retrieval SO SPEECH COMMUNICATION LA English DT Article DE natural language; semantic representation; case grammar; task and language independency; Hidden Markov Models ID LANGUAGE; MODELS AB In this paper we report our experience at LIMSI-CNRS in developing and porting a stochastic component for natural language understanding to different tasks and human languages. The domains in which we test this component are the American ATIS (Air Travel Information Services) and the French MASK (Multimodal-Multimedia Automated Service Kiosk) applications. The study demonstrates that for limited applications, a stochastic method outperforms a well-tuned rule-based component. In addition we show that the human effort can be limited to the task of data labeling, which is much simpler than the design, maintenance and extension of the grammar rules. Since a stochastic method automatically learns the semantic formalism through an analysis of these data, it is comparatively flexible and robust. (C) 1998 Elsevier Science B.V. All rights reserved. C1 CNRS, LIMSI, Spoken Language Proc Grp, F-91403 Orsay, France. RP Minker, W (reprint author), CNRS, LIMSI, Spoken Language Proc Grp, BP 133, F-91403 Orsay, France. EM minker@limsi.fr CR BATES M, 1992, P DARPA SPEECH NAT L, P102 Bennacef S., 1994, P ICSLP, P1271 BENNACEF SK, 1995, THESIS U PARIS 11 OR BONNEAUMAYNARD H, 1993, P EUROSPEECH, P2059 BRESNAN JOAN, 1982, MENTAL REPRESENTATIO BRUCE B, 1975, ARTIF INTELL, V6, P327, DOI 10.1016/0004-3702(75)90020-X BURTON R, 1976, 3353 BBN Chomsky N., 1965, ASPECTS THEORY SYNTA FELDMAN JA, 1982, COGNITIVE SCI, V6, P205, DOI 10.1207/s15516709cog0603_1 Fillmore C. J., 1968, UNIVERSALS LINGUIST, P1 Gauvain J. L., 1997, HUMAN COMFORT SECURI, P93 HAYES P, 1986, P COLING, P587, DOI 10.3115/991365.991537 ISSAR S, 1993, P EUROSPEECH 93, P2147 JELINEK F, 1992, RECENT ADV, V75, P345 JELINEK F, 1994, P ARPA WORKSH HUM LA, P260 JOSHI A, 1985, TREE AUTOMATA LGS KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Levin E., 1995, P DARPA SPEECH NAT L, P269 *MADCOW WORK GROUP, 1992, PN SPEECH NAT LANG W, P7 Miller S., 1996, P 34 ANN M ASS COMP, P55, DOI org/10.3115/981863.981871 MINKER W, 1997, P EUR C SPEECH COMM, P1423 MINKER W, 1996, 9520 LIMSICNRS Minker W., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607775 MINKER W, 1996, P JOURNEES ETUDES PA, P417 Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 Price P., 1990, P DARPA SPEECH NAT L, P91, DOI 10.3115/116580.116612 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 RABINOWICZ E, 1986, TRIBOLOGY MECHANICS, V3, P1 Seneff S., 1992, Computational Linguistics, V18 Ward W., 1990, P DARPA SPEECH NAT L, P127, DOI 10.3115/116580.116621 NR 30 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1998 VL 25 IS 4 BP 223 EP 247 DI 10.1016/S0167-6393(98)00038-7 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 132AA UT WOS:000076606400002 ER PT J AU Sorokin, V Olshansky, V Kozhanov, L AF Sorokin, V Olshansky, V Kozhanov, L TI Internal model in articulatory control: Evidence from speaking without larynx SO SPEECH COMMUNICATION LA English DT Article DE speech pathology; motor control; internal model; inverse problem ID SPEECH-PERCEPTION; INVERSE PROBLEM; MOTOR CONTROL; LOWER-LIP; CRITIQUE; VOWELS; PERTURBATION; COMPENSATION; RESPONSES; MOVEMENT AB A compensating ability of the articulatory control system for laryngectomized patients was studied. X-rays of the vocal tract and acoustic measurements were carried out on three patients before and after the operation, using the trachea - esophagus bypass. Within two weeks of the operation, the patients produced the Russian vowels /a, u, i/ with formant frequencies closer to the phonetic norm than before the operation. After two years, two patients produced the vowels with normal formant parameters. The acoustical characteristics of speech after the operation were measured on 14 patients. 1 to 2 years after the operation, four patients were able to make voicing-unvoicing distinction. One patient has recovered complete control of the vocal source. The results obtained imply that the adaptation of the articulatory control system to the distorted conditions of articulation and voice generation can be governed, not only by acoustical parameters like formant frequencies, but also by such a complex phonetic element as the voicing cue. The control system has demonstrated its ability to reorganize the activity of articulatory muscles and to transfer the functions of the excised laryngeal muscles to the muscles that had never been used for voice control. The implication of the observed phenomena for the concept of internal model is being discussed. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Russian Acad Sci, Inst Informat Transmiss Problems, Moscow 101447, Russia. Herzen Res Inst Oncol, Moscow 125284, Russia. RP Sorokin, V (reprint author), Russian Acad Sci, Inst Informat Transmiss Problems, Bolshoy Karetny 19, Moscow 101447, Russia. EM vns@ippi.ac.msk.su CR Abbs J. H., 1983, TRENDS NEUROSCI, V6, P393 Adams J. A., 1976, MOTOR CONTROL ISSUES, P87 BENGUERE.AP, 1974, PHONETICA, V30, P41 BLOM ED, 1986, ARCH OTOLARYNGOL, V112, P440 Fant G., 1960, ACOUSTIC THEORY SPEE FLEGE JE, 1988, J ACOUST SOC AM, V83, P212, DOI 10.1121/1.396424 FOLKINS JW, 1982, J ACOUST SOC AM, V71, P1225, DOI 10.1121/1.387771 FOLKINS JW, 1975, J SPEECH HEAR RES, V18, P207 Fowler C. A., 1994, ENCY LANGUAGE LINGUI, V8, P4199 FOWLER CA, 1986, J PHONETICS, V14, P3 GANDOUR J, 1980, PHONETICA, V37, P344 GARCIA CE, 1989, AUTOMATICA, V25, P335, DOI 10.1016/0005-1098(89)90002-2 GATES GA, 1982, AM J OTOLARYNG, V3, P1, DOI 10.1016/S0196-0709(82)80025-2 GAY T, 1981, J ACOUST SOC AM, V69, P802, DOI 10.1121/1.385591 Gracco V. L, 1987, MOTOR SENSORY PROCES, P163 GRILLNER S, 1982, SPEECH MOTOR CONTROL, P217 Gurfinkel V.S., 1988, STANCE MOTION FACTS, P185 HAMLET SL, 1976, J PHONETICS, V7, P196 HILGERS FJM, 1993, CLIN OTOLARYNGOL, V18, P721 HOUSE AS, 1957, J SPEECH HEAR DISORD, V22, P190 HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1 KAWATO M, 1987, BIOL CYBERN, V57, P169, DOI 10.1007/BF00364149 KELSO JAS, 1976, MOTOR CONTROL ISSUES, P3 LIBERMAN AM, 1985, COGNITION, V21, P1, DOI 10.1016/0010-0277(85)90021-6 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 LINDBLOM B, 1979, J PHONETICS, V7, P147 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 MACFARLAND DH, 1995, J ACOUST SOC AM, V97, P1865 MACNEILAGE PF, 1972, SPEECH CORTICAL FUNC, P1 MUNHALL KG, 1994, J ACOUST SOC AM, V95, P3605, DOI 10.1121/1.409929 Ohala JJ, 1996, J ACOUST SOC AM, V99, P1718, DOI 10.1121/1.414696 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P1235 OLSHANSKY VO, 1997, RUSSIAN J ONCOLOGY, V3, P17 OShaughnessy D, 1996, J ACOUST SOC AM, V99, P1726, DOI 10.1121/1.414697 PERKELL JS, 1995, J PHONETICS, V23, P23, DOI 10.1016/S0095-4470(95)80030-1 Poeck K, 1971, Cortex, V7, P254 Remez RE, 1996, J ACOUST SOC AM, V99, P1695, DOI 10.1121/1.414693 RUSSEL DG, 1976, SPATIAL LOCATION CUE, P67 SAVARIAUX C, 1995, J ACOUST SOC AM, V98, P2428, DOI 10.1121/1.413277 SCHMIDT RA, 1976, SCHEMA SOLUTION SOME, P41 SINGER MI, 1980, ANN OTO RHINOL LARYN, V89, P529 SISTY NL, 1972, J SPEECH HEAR RES, V15, P439 Sorokin V. N., 1985, THEORY SPEECH PRODUC SOROKIN VN, 1994, SPEECH COMMUN, V14, P249, DOI 10.1016/0167-6393(94)90065-5 SOROKIN VN, 1987, P 11 INT C PHON SCI, V3, P382 Sorokin VN, 1996, SPEECH COMMUN, V19, P105, DOI 10.1016/0167-6393(96)00028-3 SOROKIN VN, 1996, P 1 ESCA TUT RES WOR, P129 SOROKIN VN, 1992, SPEECH COMMUN, V11, P71, DOI 10.1016/0167-6393(92)90064-E STAFFIERI M, 1987, REV LARYNGOL, V95, P145 Stevens KN, 1996, J ACOUST SOC AM, V99, P1693, DOI 10.1121/1.414692 TAPTAPOVA CL, 1985, REHABILITATION VOICE VANBERG DJ, 1958, FOLIA PHONIATR, V10, P66 VETTER RJ, 1967, NEUROPSYCHOLOGIA, V5, P335, DOI 10.1016/0028-3932(67)90005-X WEINBERG B, 1971, J SPEECH HEAR RES, V14, P351 WEINSTEIN S, 1961, NEUROLOGY, V11, P905 ZANOFF DJ, 1990, LARYNGOSCOPE, V100, P408 NR 57 TC 6 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1998 VL 25 IS 4 BP 249 EP 268 DI 10.1016/S0167-6393(98)00039-9 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 132AA UT WOS:000076606400003 ER PT J AU Junqua, JC Haton, JP AF Junqua, JC Haton, JP TI Special issue on robust speech recognition SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 1 EP 2 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500001 ER PT J AU Hermansky, H AF Hermansky, H TI Should recognizers have ears? SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE auditory modeling; human-like processing; modulation frequency; automatic speech recognition ID SPEECH RECEPTION; RECOGNITION; SYSTEM AB Recently, techniques motivated by human auditory perception are being applied in main-stream speech technology and there seems to be renewed interest in implementing more knowledge of human speech communication into a design of a speech recognizer. The paper discusses the author's experience with applying auditory knowledge to automatic recognition of speech. It advances the notion that the reason for applying of such a knowledge in speech engineering should be the ability of perception to suppress some parts of the irrelevant information in the speech message and argues against the blind implementation of scattered accidental knowledge which may be irrelevant to a speech recognition task. The following three properties of human speech perception are discussed in some detail: limited spectral resolution, use of information from about syllable-length segments, ability to ignore corrupted or irrelevant components of speech. It shows by referring to published works that selective use of auditory knowledge, optimized on and in some cases derived from real speech data, can be consistent with current stochastic approaches to ASR and could yield advantages in practical engineering applications. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Oregon Grad Inst Sci & Technol, Portland, OR USA. Int Comp Sci Inst, Berkeley, CA 94704 USA. Tech Univ, Brno, Czech Republic. RP Hermansky, H (reprint author), Oregon Grad Inst Sci & Technol, 20000 NW Walker Rd, Beaverton, OR 97006 USA. EM hynek@ee.ogi.edu CR AIKAWA K, 1993, P INT C AC SIGN SPEE, P668 Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 ATTIAS H, 1998, ADV NEURAL INFORMATI, V10 AVENDANO C, 1996, P INT C SPOK LANG PR AVENDANO C, 1997, P 1997 WORKSH APPL S BLADON A, 1983, SPEECH COMMUN, V2, P305, DOI 10.1016/0167-6393(83)90047-X Bourlard H., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607145 BOURLARD H, 1996, P ARPA ASR WORKSH SP, P157 Bourlard Ha, 1994, CONNECTIONIST SPEECH Bridle J., 1974, 1003 JSRU BROAD D, 1989, J ACOUST SOC AM S1, V86, pS13 Brown P., 1987, THESIS CARNEGIE MELL CHISTOVICH LA, 1985, J ACOUST SOC AM, V77, P789, DOI 10.1121/1.392049 COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756 COOK GD, 1996, P INT C AC SPEECH SI, P141 Cooper F., 1952, J ACOUST SOC AM, V24, P579 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DECHARMS CR, 1997, 1997 WORKSH ADV NEUR DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 FANT G, 1965, ACOUSTIC DESCRIPTION FANT G, 1962, 4 ROYAL I TECHN SPEE Flanagan J., 1972, SPEECH ANAL SYNTHESI Fletcher H., 1953, SPEECH HEARING COMMU FUJIMURA O, 1964, LANG SPEECH, V10, P181 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 GEMAN S, 1992, NEURAL COMPUT, V4, P1, DOI 10.1162/neco.1992.4.1.1 Green PD, 1995, P INT C AC SPEECH SI, P401 Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 HANSON B, 1984, P INT C AC SPEECH SI HANSON BA, 1996, AUTOMATIC SPEECH SPE Helmholtz H., 1954, SENSATION TONE Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1988, SIGNAL ANAL PREDICTI HERMANSKY H, 1996, P INT C SPOK LANG PR, P462 HERMANSKY H, 1995, P 15 INT C AC TRONDH, V2, P61 Hermansky H., 1989, P INT C AC SPEECH SI, P480 Hermansky H., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing HERMANSKY H, 1995, P INT C AC SPEECH SI, P405 Hermansky H., 1991, P EUROSPEECH, P1367 HERMANSKY H, 1995, 13 INT C PHON SCI ST, P42 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 Hirsch H.-G., 1991, P EUROSPEECH, P413 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 Hunt M., 1989, P ICASSP, P262 HUNT MJ, 1979, J ACOUST SOC AM, V66, pS35, DOI 10.1121/1.2017735 Itahashi S., 1976, P INT C AC SPEECH SI, P310 JANSEEN RDT, 1991, P INT JOINT C NEUR N, P801 JESTEAD W, 1982, J ACOUST SOC AM, P950 Kanedera N., 1997, P EUROSPEECH 97, P1079 KANEDERA N, 1997, UNPUB SPEECH COMMUNI Klatt D. H, 1982, REPRESENTATION SPEEC, P181 KOZHEVNIKOV VA, 1967, SPEECH ARTICULATION, P250 LADEFOGED P, 1967, 3 AREAS EXPT PHONETI, P65 LIM JS, 1979, IEEE T ACOUST SPEECH, V27, P223 Lippmann RP, 1996, IEEE T SPEECH AUDI P, V4, P66, DOI 10.1109/TSA.1996.481454 Makino S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing MALAYATH N, 1997, P EUROSPEECH 97 Marr D, 1982, VISION Mermelstein P., 1976, PATTERN RECOGN, P374 NEUMEYER L, 1994, P ICASSP 94, P417 PAVEL M, 1980, THESIS NEW YORK U PAVEL M, 1994, J ACOUST SOC AM, V95, P2876, DOI 10.1121/1.409409 POLS LCW, 1971, IEEE T COMPUT, VC 20, P972, DOI 10.1109/T-C.1971.223391 Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 SENEFF S, 1988, J PHONETICS, V16, P55 STEVENS JC, 1966, PERCEPT PSYCHOPHYS, V1, P319, DOI 10.3758/BF03215796 STEVENS KN, 1996, P EUROSPEECH 95, P3 Tibrewala S., 1997, P EUR 97, P2619 van Vuuren S., 1997, P EUR, P409 WAIBEL A, 1988, P IEEE ICASSP NEW YO, P107 WANG KS, 1995, IEEE T SPEECH AUDI P, V3, P382 Watkins AJ, 1996, J ACOUST SOC AM, V99, P588, DOI 10.1121/1.414515 WOODLAND PC, 1996, P ICASSP ATL GA MAY, P65 ZWICKER E, 1975, HDB SENSORY PHYSL, V3, P401 NR 76 TC 75 Z9 76 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 3 EP 27 DI 10.1016/S0167-6393(98)00027-2 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500002 ER PT J AU Lee, CH AF Lee, CH TI On stochastic feature and model compensation approaches to robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing ID HIDDEN MARKOV-MODELS; ISOLATED WORD RECOGNITION; MAXIMUM-LIKELIHOOD-ESTIMATION; SPEAKER ADAPTATION; NOISY SPEECH; TRANSFORMATION; PARAMETERS; ALGORITHM; SYSTEMS; CHAINS AB By now it should not be surprising that high performance speech recognition systems can be designed for a wide variety of tasks in many different languages. This is mainly attributed to the use of powerful statistical pattern matching paradigms coupled with the availability of a large amount of task-specific language and speech training examples. However, it is also well-known that such a high performance can not be maintained when the testing data do not resemble the training data. The speech distortion usually appears as a combination of various acoustic differences but the exact form of the distortion is often unknown and difficult to model. One way to reduce such acoustic mismatches is to adjust speech features according to some models of the differences. Another method is to modify the parameters of the statistical models, e.g. hidden Markov models, to make the modified models characterize the distorted speech features better. Depending on the knowledge used, this family of feature and model compensation techniques can be roughly categorized into three classes, namely: (1) training-based compensation, (2) blind compensation, and (3) structure-based compensation. This paper provides an overview of the capabilities and limitations of the compensation approaches and illustrates their similarities and differences. The relationship between adaptation and compensation will also be discussed. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Lucent Technol, Bell Labs, Dialogue Syst Res Dept, Murray Hill, NJ 07974 USA. RP Lee, CH (reprint author), Lucent Technol, Bell Labs, Dialogue Syst Res Dept, Murray Hill, NJ 07974 USA. EM chl@research.bell-labs.com CR ABRASH V, 1996, P IEEE INT C AC SPEE, P729 Acero A., 1992, ACOUSTICAL ENV ROBUS ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 BAUM LE, 1970, ANN MATH STAT, V41, P164, DOI 10.1214/aoms/1177697196 CHIEN JT, 1996, P ICASSP 9, V6, P45 Chien JT, 1997, IEEE SIGNAL PROC LET, V4, P167 CHOU W, 1995, P EUROSPEECH 95 COX S, 1995, COMPUT SPEECH LANG, V9, P1, DOI 10.1006/csla.1995.0001 COX SJ, 1989, P IEEE INT C AC SPEE, P294 COX SJ, 1990, INT CONF ACOUST SPEE, P161, DOI 10.1109/ICASSP.1990.115563 DAS S, 1993, P IEEE INT C AC SPEE, P71 De Brabandere K, 2007, P IEEE INT C AC SPEE, P1 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P294, DOI 10.1109/89.506933 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P84, DOI 10.1109/89.221370 Flanagan J., 1972, SPEECH ANAL SYNTHESI FURUI S, 1997, P ESCA WORKSH ROB SP FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 FURUI S, 1980, IEEE T ACOUST SPEECH, V28, P129, DOI 10.1109/TASSP.1980.1163393 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GAUVAIN JL, 1992, SPEECH COMMUN, V11, P205, DOI 10.1016/0167-6393(92)90015-Y HANSON BA, 1996, AUTOMATIC SPEECH SPE, pCH14 HATTORI H, 1992, P ICSLP 92, P381 HERMANSKY H, 1971, P EUROSPEECH 91 HOLMES J, 1986, P ICASSP 86 HOU Q, 1997, P ICASSP 97, P1549 Huo Q, 1997, IEEE T SPEECH AUDI P, V5, P161 HUO Q, 1996, P ICSLP 96, P981 HUO Q, 1995, IEEE T SPEECH AUDI P, V3, P334 JUANG BH, 1996, AUTOMATIC SPEECH SPE, pCH5 JUANG BH, 1985, AT&T TECH J, V64, P1235 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Junqua JC, 1990, P ICASSP, P841 LEE CH, 1993, SPEECH COMMUN, V12, P383, DOI 10.1016/0167-6393(93)90085-Y LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 LEE CH, 1996, SPEECH SPEAKER RECOG, pCH4 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LEVINSON SE, 1985, P IEEE, V73, P1625, DOI 10.1109/PROC.1985.13344 LIU FH, 1992, P ICASSP 92, P257, DOI 10.1109/ICASSP.1992.225923 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 MATSUOKA T, 1993, P EUR C SPEECH COMM, P815 Merhav N, 1993, IEEE T SPEECH AUDI P, V1, P90, DOI 10.1109/89.221371 MINAMI Y, 1995, P ICASSP, P129 MOON SY, 1995, P IEEE INT C AC SPEE, P145 MORENO PJ, 1996, P ICASSP, P733 NADAS A, 1988, P ICASSP, P517 NEUMEYER L, 1994, P ICASSP, V1, P417 NEY H, 1996, AUTOMATIC SPEECH SPE, pCH16 NORMANDIN Y, 1996, AUTOMATIC SPEECH SPE, pCH3 OHKURA K, 1992, P ICSLP 92, P369 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 RABINER LR, 1996, AUTOMATIC SPEECH SPE, pCH1 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 RAJASKARAN PK, 1986, P INT C AC SPEECH SI, P733 Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 Rozzi W., 1991, P ICASSP 91 TOR MAY, P865, DOI 10.1109/ICASSP.1991.150475 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 SCHWARTZ R, 1996, AUTOMATIC SPEECH SPE, pCH18 SIOHAN O, 1997, IN PRESS IEEE SIGNAL SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 STERN R, 1996, AUTOMATIC SPEECH SPE, pCH15 STERN RM, 1987, IEEE T ACOUST SPEECH, V35, P751, DOI 10.1109/TASSP.1987.1165203 SURENDRAN AC, 1996, P ICSLP 96, P1832 TAKAGI K, 1994, P ICSLP 94, P1023 TAKAHASHI JI, 1995, P ICASSP 95, P696 TONOMURA M, 1995, P ICASSP 95, P688 Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1988, P ICASSP, P481 ZAVALIAGKOS G, 1995, P ICASSP 95, P676 Zhao Y., 1993, P INT C ACOUST SPEEC, P562 ZHAO Y, 1995, P IEEE INT C AC SPEE, P712 NR 77 TC 49 Z9 50 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 29 EP 47 DI 10.1016/S0167-6393(98)00028-4 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500003 ER PT J AU Gales, MJF AF Gales, MJF TI Predictive model-based compensation schemes for robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE noise robustness; hidden Markov models; predictive model-based compensation; parallel model combination ID HIDDEN MARKOV-MODELS; NOISY SPEECH; WORD RECOGNITION; COMBINATION; ADAPTATION AB For practical applications speech recognition systems need to be insensitive to differences between training and test acoustic conditions. Differences in the acoustic environment may result from various sources, such as ambient background noise, channel variations and speaker stress. These differences can dramatically degrade the performance of a speech recognition system. A wide range of techniques have been proposed for achieving noise robustness. This paper considers one particular approach to model-based compensation, predictive model-based compensation, which has been shown to achieve good noise robustness in a wide range of acoustic environments. The characteristic of these schemes is that they combine a speech model with an additive noise model, a channel model and, in the general case, a speaker stress model, to generate a corrupted-speech model. The general theory of these predictive techniques is discussed. Various approximations for rapidly performing the model combination stage have been proposed and are reviewed in this paper. The advantages and the limitations of such a predictive approach to noise robustness are also discussed. In addition, methods for combining predictive schemes with schemes which make use of speech data in the new environment, adaptive schemes, are detailed. This combined approach overcomes some of the limitations of the predictive schemes. (C) 1998 Published by Elsevier science B.V. All rights reserved. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Gales, MJF (reprint author), IBM Corp, TJ Watson Res Ctr, Yorktown Hts, NY 10598 USA. EM mjfg@watson.ibm.com CR BEATTIE VL, 1992, P ICSLP, P519 Bellegarda JR, 1997, P EUROSPEECH Berstein A., 1991, P ICASSP, P913, DOI 10.1109/ICASSP.1991.150488 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P1303, DOI 10.1109/78.139237 EPHRAIM Y, 1989, IEEE T ACOUST SPEECH, V37, P1846, DOI 10.1109/29.45532 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Furui Sadaoki, 1992, ADV SPEECH SIGNAL PR GAGNON L, 1992, P ESCA WORKSH SPEECH, P139 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gales MJF, 1996, COMPUT SPEECH LANG, V10, P249, DOI 10.1006/csla.1996.0013 Gales M.J.F, 1995, THESIS CAMBRIDGE U GALES MJF, 1992, P ICASSP, P233, DOI 10.1109/ICASSP.1992.225929 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GAUVAIN JL, 1996, P IEEE INT C AC SPEE, P73 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Gopinath RA, 1995, P ARPA WORKSH SPOK L, P127 HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 JUANG BH, 1985, IEEE T ACOUST SPEECH, V33, P1404 Junqua JC, 1990, P ICASSP, P841 KLATT DH, 1979, P ICASSP, P573 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MARTIN F, 1993, P EUROSPEECH, P1031 MILNER BP, 1994, IEE P-VIS IMAGE SIGN, P280 MINAMI Y, 1996, P IEEE INT C AC SPEE, P327 MINAMI Y, 1995, P ICASSP, P129 Moreno P.J., 1996, THESIS CARNEGIE MELL MORENO PJ, 1996, P ICASSP, P733 Morgan N., 1992, P ESCA WORKSH SPEECH, P115 NEUMEYER L, 1995, P EUR C SPEECH COMM, P1127 OPENSHAW JP, 1994, P ICASSP, V2, P49 Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 SANKAR A, 1995, P IEEE INT C AC SPEE, P121 Seymour C. W., 1994, P ICSLP, P1595 Varga A.P., 1990, P ICASSP, P845 VARGA AP, 1988, P ICASSP, P481 WOODLAND PC, 1996, P ICASSP ATL GA MAY, P65 NR 41 TC 22 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 49 EP 74 DI 10.1016/S0167-6393(98)00029-6 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500004 ER PT J AU Omologo, M Svaizer, P Matassoni, M AF Omologo, M Svaizer, P Matassoni, M TI Environmental conditions and acoustic transduction in hands-free speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE hands-free speech recognition; robustness; environmental noise; microphone arrays; acoustics; MAP adaptation ID MICROPHONE ARRAY; PROCESSING TECHNIQUE; ROOM ACOUSTICS; ENHANCEMENT; ACQUISITION; ALGORITHM; NOISE AB Hands-free interaction represents a key-point for increase of flexibility of present applications and for the development of new speech recognition applications, where the user cannot be encumbered by either hand-held or head-mounted microphones. When the microphone is far from the speaker, the transduced signal is affected by degradation of different nature, that is often unpredictable. Special microphones and multi-microphone acquisition systems represent a way of reducing some environmental noise effects. Robust processing and adaptation techniques can be further used in order to compensate for different kinds of variability that may be present in the recognizer input. The purpose of this paper is to re-visit some of the assumptions about the different sources of this variability and to discuss both on special transducer systems and on compensation/adaptation techniques that can be adopted. In particular, the paper will refer to the use of multi-microphone systems to overcome some undesired effects caused by room acoustics (e.g. reverberation) and by coherent/incoherent noise (e.g. competitive talkers, computer fans). The paper concludes with the description of some experiments that were conducted both on real and simulated speech data. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Ist Ric Sci & Tecnol, I-38050 Trento, Italy. RP Omologo, M (reprint author), Ist Ric Sci & Tecnol, I-38050 Trento, Italy. EM omologo@itc.it CR Acero A, 1993, ACOUSTICAL ENV ROBUS ADCOCK J, 1996, P IEEE ICASSP, P897 AFFES S, 1996, P IEEE ICASSP, P909 Affes S, 1997, IEEE T SPEECH AUDI P, V5, P425, DOI 10.1109/89.622565 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 ANGELINI B, 1994, P ICSLP YOK, P1391 Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607744 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 BRANDSTEIN M, 1995, THESIS BROWN U RI CHE C, 1994, P HUM LANG TECHN WOR, P342, DOI 10.3115/1075812.1075891 CHOU T, 1995, P IEEE INT C AC SPEE, V5, P2995 CHU P, 1995, P IEEE ICASSP, P2999 Elko GW, 1996, SPEECH COMMUN, V20, P229, DOI 10.1016/S0167-6393(96)00057-X FISCHER S, 1995, P 4 INT WORKSH AC EC, P44 Fischer S, 1996, SPEECH COMMUN, V20, P215, DOI 10.1016/S0167-6393(96)00054-4 FLANAGAN JL, 1987, J ACOUST SOC AM, V82, pS39, DOI 10.1121/1.2024789 FLANAGAN JL, 1991, ACUSTICA, V73, P58 FLANAGAN JL, 1993, SPEECH COMMUN, V13, P207, DOI 10.1016/0167-6393(93)90072-S FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 Furui S., 1997, P ESCA NATO TUT RES, P11 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Giuliani D., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607858 Giuliani D, 1998, INT CONF ACOUST SPEE, P473, DOI 10.1109/ICASSP.1998.674470 Goodwin M. M., 1993, P IEEE INT C AC SPEE, P169 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 GRENIER Y, 1993, SPEECH COMMUN, V12, P25, DOI 10.1016/0167-6393(93)90016-E GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HOFFMAN MW, 1995, IEEE T SPEECH AUDI P, V3, P193, DOI 10.1109/89.388145 Johnson D, 1993, ARRAY SIGNAL PROCESS Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Kinsler LE, 1982, FUNDAMENTALS ACOUSTI Lee C.-H., 1996, AUTOMATIC SPEECH SPE LEE CH, 1998, NATO WORKSH ROB SPEE, V25, P29 Leggetter C.J., 1994, P INT C SPOK LANG PR, P451 LIM JS, 1983, SPEECH ENHANCEMENT Liu QG, 1996, SPEECH COMMUN, V18, P317, DOI 10.1016/0167-6393(96)00011-8 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MIYOSHI M, 1988, IEEE T ACOUST SPEECH, V36, P145, DOI 10.1109/29.1509 Monzingo R. A., 1980, INTRO ADAPTIVE ARRAY NAKAMURA S, 1996, P IEEE INT C AC SPEE, P69 OH S, 1992, P ICASSP 92, P281, DOI 10.1109/ICASSP.1992.225916 Omologo M, 1997, IEEE T SPEECH AUDI P, V5, P288, DOI 10.1109/89.568735 OMOLOGO M, 1996, P ICASSP96, P921 Omologo M., 1994, P ICASSP94, P273 Rabiner L, 1993, FUNDAMENTALS SPEECH Scalart P, 1996, SPEECH COMMUN, V20, P203, DOI 10.1016/S0167-6393(96)00056-8 SESSLER GM, 1969, J ACOUST SOC AM, V46, P28, DOI 10.1121/1.1911657 SESSLER GM, 1989, J ACOUST SOC AM, V86, P2063, DOI 10.1121/1.398464 SILVERMAN HF, 1987, IEEE T ACOUST SPEECH, V35, P1699, DOI 10.1109/TASSP.1987.1165098 SMOLDERS J, 1994, P IEEE ICASSP, P429 STURIM D, 1997, P IEEE ICASSP, P371 SULLIVAN TM, 1993, P IEEE ICASSP, P91 TOHYAMA M, 1993, P IEEE ICASSP, P157 Van Compernolle D., 1990, P IEEE INT C AC SPEE, V2, P833 Van Veen B. D., 1988, IEEE ASSP MAGAZI APR, P4 WANG H, 1991, P IEEE ICASSP, P853 WARD DB, 1995, J ACOUST SOC AM, V97, P1023, DOI 10.1121/1.412215 Widrow B, 1985, ADAPTIVE SIGNAL PROC Xie F, 1996, SPEECH COMMUN, V19, P89, DOI 10.1016/0167-6393(96)00022-2 Zelinski R., 1988, P INT C AC SPEECH SI, P2578 NR 62 TC 41 Z9 43 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 75 EP 95 DI 10.1016/S0167-6393(98)00030-2 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500005 ER PT J AU Pellom, BL Hansen, JHL AF Pellom, BL Hansen, JHL TI Automatic segmentation of speech recorded in unknown noisy channel characteristics SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE speech segmentation; noise robustness; speech enhancement; speech corpus development ID HIDDEN MARKOV-MODELS; RECOGNITION; ENHANCEMENT; COMBINATION AB This paper investigates the problem of automatic segmentation of speech recorded in noisy channel corrupted environments. Using an HMM-based speech segmentation algorithm, speech enhancement and parameter compensation techniques previously proposed for robust speech recognition are evaluated and compared for improved segmentation in colored noise. Speech enhancement algorithms considered include: Generalized Spectral Subtraction, Nonlinear Spectral Subtraction, Ephraim-Malah MMSE enhancement, and Auto-LSP Constrained Iterative Wiener filtering. In addition, the Parallel Model Combination (PMC) technique is also compared for additive noise compensation. In telephone environments, we compare channel normalization techniques including Cepstral Mean Normalization (CMN) and Signal Bias Removal (SBR) and consider the coupling of channel compensation with front-end speech enhancement for improved automatic segmentation. Compensation performance is assessed for each method by automatically segmenting TIMIT degraded by additive colored noise (i.e., aircraft cockpit, automobile highway, etc.), telephone transmitted NTIMIT, and cellular telephone transmitted CTIMIT databases. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Duke Univ, Dept Elect Engn, Robust Speech Proc Lab, Durham, NC 27708 USA. RP Hansen, JHL (reprint author), Duke Univ, Dept Elect Engn, Robust Speech Proc Lab, Box 90291, Durham, NC 27708 USA. EM jhlh@ee.duke.edu CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 Berouti M., 1979, P IEEE INT C AC SPEE, P208 Bonafonte A., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607841 BROWN K, 1995, P IEEE INT C AC SPEE, P105 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deller J. R., 1993, DISCRETE TIME PROCES EISEN B, 1992, P ICSLP, P871 EISEN B, 1991, P EUROSPEECH, P673 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z Giachin E. P., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90022-I HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hunt A. J., 1996, P ICASSP 96, P373 Itakura F., 1975, J ACOUST SOC AM, V57, P535 Jankowski C., 1990, P IEEE INT C AC SPEE, P109 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LEUNG H, 1984, P IEEE INT C AC SPEE LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MARZAL A, 1990, P EUSIPCO BARC, P43 Neumeyer LG, 1994, IEEE T SPEECH AUDI P, V2, P590, DOI 10.1109/89.326617 *NIST, 1989, GETT START DARPA TIM Petek B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607750 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Reynolds DA, 1997, INT CONF ACOUST SPEE, P1535, DOI 10.1109/ICASSP.1997.596243 Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) TORKKOLA K, 1988, P IEEE INT C AC SPEE, P611 VANHEMERT JP, 1991, IEEE T SIGNAL PROCES, V39, P1008, DOI 10.1109/78.80941 VARGA AP, 1990, INT CONF ACOUST SPEE, P845, DOI 10.1109/ICASSP.1990.115970 Vorstermans A, 1996, SPEECH COMMUN, V19, P271, DOI 10.1016/S0167-6393(96)00037-4 VORSTERMANS A, 1995, P EUROSPEECH, P1397 Wesenick M.-B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607054 Wightman C., 1997, PROGR SPEECH SYNTHES, P313 WIGHTMAN C, 1994, ALIGNER SYSTEM AUTOM NR 41 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 97 EP 116 DI 10.1016/S0167-6393(98)00031-4 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500006 ER PT J AU Kingsbury, BED Morgan, N Greenberg, S AF Kingsbury, BED Morgan, N Greenberg, S TI Robust speech recognition using the modulation spectrogram SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE robust speech recognition; reverberation ID ROOM ACOUSTICS; INTELLIGIBILITY; REPRESENTATION; VOWELS AB The performance of present-day automatic speech recognition (ASR) systems is seriously compromised by levels of acoustic interference (such as additive noise and room reverberation) representative of real-world speaking conditions. Studies on the perception of speech by human listeners suggest that recognizer robustness might be improved by focusing on temporal structure in the speech signal that appears as low-frequency (below 16 Hz) amplitude modulations in subband channels following critical-band frequency analysis. A speech representation that emphasizes this temporal structure, the "modulation spectrogram", has been developed. Visual displays of speech produced with the modulation spectrogram are relatively stable in the presence of high levels of background noise and reverberation. Using the modulation spectrogram as a front end for ASR provides a significant improvement in performance on highly reverberant speech. When the modulation spectrogram is used in combination with log-RASTA-PLP (log RelAtive SpecTrAl Perceptual Linear Predictive analysis) performance over a range of noisy and reverberant conditions is significantly improved, suggesting that the use of multiple representations is another promising method for improving the robustness of ASR systems. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Int Comp Sci Inst, Berkeley, CA 94704 USA. Univ Calif Berkeley, Berkeley, CA 94720 USA. RP Kingsbury, BED (reprint author), Int Comp Sci Inst, Suite 600,1947 Ctr St, Berkeley, CA 94704 USA. EM bedk@icsi.berkeley.edu CR ARAI T, 1998, P 1998 IEEE INT C AC BOURLARD H, 1994, CONNECTIONIST SPEECH, P155 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 Dudley H, 1939, J ACOUST SOC AM, V11, P169, DOI 10.1121/1.1916020 FURUI S, 1986, P 1986 IEEE IECEJ AS, P1991 GREENBERG S, 1996, P 4 INT C SPOK LANG, pS24 Greenberg S, 1997, P IEEE INT C AC SPEE, P1647 Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 GREENBERG S, 1998, P JOINT M AC SOC AM GREENWOOD D, 1961, J ACOUST SOC AM, V33, P1344, DOI 10.1121/1.1908437 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1985, SPEECH COMMUN, V4, P181, DOI 10.1016/0167-6393(85)90045-7 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HIRSCH HG, 1988, SIGNAL PROCESS, V4, P1177 HOUTGAST T, 1980, ACUSTICA, V46, P60 HOUTGAST T, 1973, ACUSTICA, V28, P66 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 HUGGINS AWF, 1975, PERCEPT PSYCHOPHYS, V18, P49 KINGSBURY B, 1997, P ESCA WORKSH ROB SP, P87 KINGSBURY BE, 1997, P ICASSP MUN GERM, P1259 KOLLMEIER B, 1994, J ACOUST SOC AM, V95, P1593, DOI 10.1121/1.408546 Langhans T., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing LEA AP, 1994, PERCEPT PSYCHOPHYS, V56, P379, DOI 10.3758/BF03206730 LEONARD RG, 1984, P 1984 IEEE INT C AC MILNER BP, 1995, EUROSPEECH 95 P 4 EU, P519 MORGAN N, 1992, ESCA WORKSH SPEECH P, P115 RAMSAY G, 1995, EUROSPEECH 95 P 4 EU, P1401 SCHREINER CE, 1988, HEARING RES, V32, P49, DOI 10.1016/0378-5955(88)90146-3 STEENEKEN HJM, 1980, J ACOUST SOC AM, V67, P318, DOI 10.1121/1.384464 SUMMERFIELD Q, 1984, PERCEPT PSYCHOPHYS, V35, P203, DOI 10.3758/BF03205933 WU SL, 1998, P 1998 IEEE INT C AC WU SL, 1998, THESIS U CALIFORNIA NR 32 TC 82 Z9 84 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 117 EP 132 DI 10.1016/S0167-6393(98)00032-6 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500007 ER PT J AU Viikki, O Laurila, K AF Viikki, O Laurila, K TI Cepstral domain segmental feature vector normalization for noise robust speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE speech recognition; noise robustness; feature vector normalization AB To date, speech recognition systems have been applied in real world applications in which they must be able to provide a satisfactory recognition performance under various noise conditions. However, a mismatch between the training and testing conditions often causes a drastic decrease in the performance of the systems. In this paper, we propose a segmental feature vector normalization technique which makes an automatic speech recognition system more robust to environmental changes by normalizing the output of the signal-processing front-end to have similar segmental parameter statistics in all noise conditions. The viability of the suggested technique was verified in various experiments using different background noises and microphones. In an isolated word recognition task, the proposed normalization technique reduced the error rates by over 70% in noisy conditions with respect to the baseline tests, and in a microphone mismatch case, over 75% error rate reduction was achieved, in a multi-environment speaker-independent connected digit recognition task, the proposed method reduced the error rates by over 16%. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Nokia Res Ctr, Speech & Audio Syst Lab, FIN-33721 Tampere, Finland. RP Viikki, O (reprint author), Nokia Res Ctr, Speech & Audio Syst Lab, POB 100, FIN-33721 Tampere, Finland. EM olli.viikki@research.nokia.fi CR Acero A., 1992, P SPEECH PROCESSING, P89 BOURLARD H, 1994, P INT C AC SPEECH SI, V1, P373 COOK GD, 1996, P INT C AC SPEECH SI, V1, P141 Freeman D.K., 1989, P IEEE INT C AC SPEE, V1, P369 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 LAURILA K, 1997, P INT C AC SPEECH SI, V2, P871 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z OPENSHAW JP, 1994, P ICASSP, V2, P49 PALIWAL KK, 1990, P INT C AC SPEECH SI, V2, P801 RIIS SK, 1996, P IEEE NORD SIGN PRO, P431 Rosenberg A.E., 1994, P INT C SPOK LANG PR, V4, P1835 TIBREWALA S, 1997, P EUROSPEECH, V5, P2619 VANCOMPERNOLLE D, 1996, P INT C AC SPEECH SI, V1, P331 YANG L, 1995, AUDIT NEUROSCI, V1, P1 YANG R, 1996, P INT C AC SPEECH SI, V1, P49 NR 16 TC 92 Z9 99 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 133 EP 147 DI 10.1016/S0167-6393(98)00033-8 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500008 ER PT J AU de Veth, J Boves, L AF de Veth, J Boves, L TI Channel normalization techniques for automatic speech recognition over the telephone SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE automatic speech recognition; telephone speech; channel normalisation; phase-corrected RASTA ID SPEAKER VERIFICATION AB In this paper we aim to identify the underlying causes that cart explain the performance of different channel normalization techniques. To this aim we compared four different channel normalization techniques within the context of connected digit recognition over telephone lines: cepstrum m-an subtraction, the dynamic cepstrum representation, RASTA filtering and phase-corrected RASTA. We used context-dependent and context-independent hidden Markov models that were trained using a wide range of different model complexities. The results of our recognition experiments indicate that each channel normalization technique should preserve the modulation frequencies in the range between 2 and 16 Hz in the spectrum of the speech signals. At the same time, DC components in the modulation spectrum should be effectively removed. With context-independent models the channel normalization filter should have a flat phase response. Finally, for our connected digit recognition task it appeared that cepstrum mean subtraction and phase-corrected RASTA performed equally well for context-dependent and context-independent models when equal amounts of model parameters were used. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Nijmegen, Dept Language & Speech, A2RT, NL-6500 HD Nijmegen, Netherlands. RP Boves, L (reprint author), Univ Nijmegen, Dept Language & Speech, A2RT, POB 9103, NL-6500 HD Nijmegen, Netherlands. EM boves@let.kun.nl CR AIKAWA K, 1993, P INT C AC SIGN SPEE, P668 ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 BODA P, 1996, P ESCA WORKSH AUD BA, P317 DENOS EA, 1995, P EUR 95, P825 DEVETH J, 1997, P ESCA NATO WORKSH R, P119 DEVETH J, 1997, P INT C AC SIGN SPEE, P1239 de Veth J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607275 DEVETH J, 1995, SPEECH COMMUN, V17, P81, DOI 10.1016/0167-6393(95)00015-G DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 GISH H, 1985, P ICASSP, P379 GISH H, 1986, P INT C AC SIGN SPEE, P865 HaebUmbach R, 1995, PHILIPS J RES, V49, P381, DOI 10.1016/0165-5817(96)81587-7 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1995, P INT C PHON SC Hermansky H., 1991, P EUROSPEECH, P1367 HERMANSKY H, 1996, P ESCA WORKSH AUD BA Hirsch H.-G., 1991, P EUROSPEECH, P413 HUNT M, 1978, ACOUSTICS LETT, V32, P6 JUNQUA JC, 1995, P EUROSPEECH, P1385 LITTLE J, 1993, MATLAB SIGNAL PROCES MILNER B, 1995, P EUROSPEECH, P519 NADEU C, 1995, P EUR, P923 Schetzen M., 1980, VOLTERRA WIENER THEO, Vfirst SINGER H, 1995, P EUROSPEECH, P487 SOONG FK, 1986, P ICASSP TOK JAP, P877 Steinbiss C, 1995, PHILIPS J RES, V49, P317 WANG HC, 1993, P INT C AC SIGN SPEE, P407 YOUNG S, 1992, HTK V1 4 USER MANUAL NR 29 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 149 EP 164 DI 10.1016/S0167-6393(98)00034-X PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500009 ER PT J AU Shields, PW Campbell, DR AF Shields, PW Campbell, DR TI Intelligibility improvements obtained by an enhancement method applied to speech corrupted by noise and reverberation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing ID ARRAY AB A series of psychoacoustic experiments are described, which attempt to assess the capability of a Multi-Microphone Sub-band Adaptive Signal (MMSBA) processing scheme, for improving the intelligibility of speech corrupted with noise and reverberation. The processing scheme applies the Least Mean Squares (LMS) adaptive algorithm in frequency delimited sub-bands to process speech signals from simulated and real room acoustic environments with various realistic signal to noise ratios (SNR). The processing scheme aims to take advantage of binaural input channels to perform noise cancellation. The two wide-band signals are split into linear or cochlear distributed sub-bands, then processed according to their sub-band signal characteristics. The results of a series of intelligibility tests are presented in which speech and noise data, generated in simulated and real room conditions, was presented to human volunteer subjects at various SNRs, sub-band distributions and sub-band spacings. The results from both simulated and real room acoustical environments show that the MMSBA processing scheme significantly improves both SNR and intelligibility. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Paisley, Dept Elect Engn & Phys, Paisley PA1 2BE, Renfrew, Scotland. RP Shields, PW (reprint author), Univ Paisley, Dept Elect Engn & Phys, High St, Paisley PA1 2BE, Renfrew, Scotland. EM paul@diana22.paisley.ac.uk CR Agaiby H., 1997, ESCA EUROSPEECH 97, P1119 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 BAER T, 1993, J REHABIL RES DEV, V30, P49 CULLING JF, 1994, SPEECH COMMUN, V14, P71, DOI 10.1016/0167-6393(94)90058-2 DARLINGTON DJ, 1996, IEEE DIG SIGN PROC W DEBRUNNER VE, 1995, J ACOUST SOC AM, V98, P437, DOI 10.1121/1.414360 DESHMUUKH DN, 1996, P 4 INT C SPEECH LAN, P2486 Elberling C, 1993, Scand Audiol Suppl, V38, P39 FERRARA ER, 1981, IEEE T ACOUST SPEECH, V29, P756 FOSTER J R, 1987, British Journal of Audiology, V21, P165, DOI 10.3109/03005368709076402 GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 KOLLMEIER JP, 1993, SCAND AUDIOL S, V38, P28 Ludvigsen C, 1993, Scand Audiol Suppl, V38, P50 MACKENZIE GW, 1964, ACOUSTICS MONCUR JP, 1967, J SPEECH HEAR RES, V10, P186 Plomp R., 1976, Acustica, V34 SHIELDS PW, 1997, INT C AC SPEECH SIGN, V1, P415 SHIELDS PW, 1997, EUR SPEECH COMM ASS, P91 SOEDE W, 1993, J ACOUST SOC AM, V94, P785, DOI 10.1121/1.408180 Stearns SD, 1988, SIGNAL PROCESSING AL TONER E, 1993, SPEECH COMMUN, V12, P253, DOI 10.1016/0167-6393(93)90096-4 NR 21 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 165 EP 175 DI 10.1016/S0167-6393(98)00035-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500010 ER PT J AU Hussain, A Campbell, DR AF Hussain, A Campbell, DR TI Binaural sub-band adaptive speech enhancement using artificial neural networks SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Workshop on Robust Speech Recognition for Unknown Communication Channels CY APR 17-18, 1997 CL PONT A MOUSSON, FRANCE SP European Speech Commun Assoc (ESCA), NATO Res Study Grp on Speech Processing DE adaptive speech enhancement; sub-band processing; artificial neural networks ID FILTERS AB In this paper, a general class of "single-hidden layered, feedforward" Artificial Neural Network (ANN) based adaptive non-linear filters is proposed for processing band-limited signals in a multi-microphone sub-band adaptive speech-enhancement scheme. Initial comparative results achieved in simulation experiments using both simulated and real automobile reverberant data demonstrate that the proposed speech-enhancement system employing ANN-based sub-band processing is capable of outperforming conventional noise cancellation schemes. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Paisley, Dept Elect Engn & Phys, Paisley PA1 2BE, Renfrew, Scotland. RP Hussain, A (reprint author), Univ Paisley, Dept Elect Engn & Phys, High St, Paisley PA1 2BE, Renfrew, Scotland. EM huss-ee0@paisley.ac.uk CR Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 BOUQUIN RL, 1994, P EUSIPCO 94, P1206 BOURLARD H, 1997, P IEEE INT C AC SPEE, V2, P1251 CAMPBELL DR, 1996, SIGNAL PROCESS, V8, P467 CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 DARWIN CJ, 1989, SPEECH COMMUN, V8, P221, DOI 10.1016/0167-6393(89)90003-4 Ghitza O, 1994, IEEE T SPEECH AUDI P, V2, P115, DOI 10.1109/89.260357 GOULDING MM, 1990, IEEE T VEH TECHNOL, V39, P316, DOI 10.1109/25.61353 GREENWOOD DD, 1990, J ACOUST SOC AM, V87, P2592, DOI 10.1121/1.399052 HAYKIN S, 1996, IEEE SIGNAL PROCESS, V13, P25 HERMANSKY H, 1997, P IEEE ICASSP 97, V2, P1255 HOOGENDOORN A, 1994, P IEEE, V82, P1479, DOI 10.1109/5.326405 HUSH DR, 1993, IEEE SIGNAL PROCESS, V10, P9 HUSSAIN A, 1997, P IEEE ICASSP 97, P3341 HUSSAIN A, 1997, INT J COMPUT INTELLI, V5, P16 Hussain A, 1997, IEEE T COMMUN, V45, P1358, DOI 10.1109/26.649741 HUSSAIN A, 1997, P ESCA NATO WORKSH R, P123 HUSSAIN A, 1996, THESIS U STRATHCLYDE KNECHT WG, 1995, IEEE T SPEECH AUDI P, V3, P433, DOI 10.1109/89.482210 Lippmann RP, 1997, SPEECH COMMUN, V22, P1, DOI 10.1016/S0167-6393(97)00021-6 LYNCH MR, 1991, P 2 IEE INT C ART NE, P44 MAHALANOBIS A, 1993, IEEE T CIRCUITS-II, V40, P375, DOI 10.1109/82.277882 Moore B.C.J., 1995, PERCEPTUAL CONSEQUEN RAYNER PJW, 1989, P IEEE INT C AC SPEE, P1191 SOMAYAJULU V, 1989, P ICASSP MAY, P928 TONER E, 1993, SPEECH COMMUN, V12, P253, DOI 10.1016/0167-6393(93)90096-4 WALLACE RB, 1992, IEEE T SIGNAL PROCES, V40, P700, DOI 10.1109/78.120817 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 28 TC 3 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1998 VL 25 IS 1-3 BP 177 EP 186 DI 10.1016/S0167-6393(98)00036-3 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 124TQ UT WOS:000076198500011 ER PT J AU Moreno, PJ Raj, B Stern, RM AF Moreno, PJ Raj, B Stern, RM TI Data-driven environmental compensation for speech recognition: A unified approach SO SPEECH COMMUNICATION LA English DT Article ID MAXIMUM-LIKELIHOOD; MODEL AB Environmental robustness for automatic speech recognition systems based on parameter modification can be accomplished in two complementary ways. One approach is to modify the incoming features of environmentally-degraded speech to more closely resemble the features of the (normally undegraded) speech used to train the classifier. The other approach is to modifying the internal statistical representations of speech features used by the classifier to more closely resemble the features representing degraded speech in a particular target environment. This paper attempts to unify these two approaches to robust speech recognition by presenting several techniques that share the same basic assumptions and internal structure while differing in whether they modify the features of incoming speech or whether they modify the statistics of the classifier itself. We present the multivaRiate gAussian-based cepsTral normaliZation (RATZ) family of algorithms which modify incoming cepstral features, along with the STAR (STAtistical Reestimation) family of algorithms, which modify the internal statistics of the classifier. Both types of algorithms are data driven, in that they make use of a certain amount of adaptation data for learning compensation parameters. The algorithms were evaluated using the SPHINX-II speech recognition system on subsets of the Wall Street Journal database. While all algorithms demonstrated improved recognition accuracy compared to previous algorithms, the STAR family of algorithms tended to provide lower error rates than the RATZ family of algorithms as the SNR was decreased. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Carnegie Mellon Univ, Dept Elect & Comp Engn, Pittsburgh, PA 15213 USA. Carnegie Mellon Univ, Sch Comp Sci, Pittsburgh, PA 15213 USA. RP Moreno, PJ (reprint author), Digital Equipment Corp, Cambridge Res Lab, Speech Interact Technol Grp, 1 Kendall Sq,Bldg 700, Cambridge, MA 02139 USA. EM pjm@crl.dec.com CR Acero A., 1990, P ICASSP, P849 ACERO A, 1991, ACOUSTICAL ENV ROBUS ANASTASAKOS A, 1994, P IEEE INT C AC SPEE, V1, P433 Baum L. E., 1972, INEQUALITIES, V3, P1 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 Gales M. J. F., 1995, THESIS CAMBRIDGE U C GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z Ghitza O., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80018-3 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HUANG X, 1991, P IEEE INT C AC SPEE, V1, P235 HUANG XD, 1993, HIDDEN MARKOV MODELS Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Junqua J.C., 1996, ROBUSTNESS AUTOMATIC LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P35, DOI 10.1109/29.45616 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 LIU SQ, 1994, MAT SCI ENG C-BIOMIM, V2, P61 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 Moreno P.J., 1996, THESIS CARNEGIE MELL MORENO PJ, 1995, P EUROSPEECH, V1, P481 MORENO PJ, 1995, P ICASSP, V1, P137 Moreno PJ, 1996, P IEEE INT C AC SPEE, V2, P733 NEUMEYER L, 1994, P ICASSP, V1, P417 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 SANKAR A, 1995, P ICASSP, V1, P121 SENEFF S, 1988, J PHONETICS, V16, P55 STERN RM, 1996, AUTOMATIC SPEECH SPE, P351 STERN RM, 1994, P INT C SPOK LANG PR, V3, P1027 SULLIVAN TM, 1993, P ICASSP 93 MINN, V2, P91 Varga A.P., 1990, P ICASSP, V2, P845 NR 32 TC 40 Z9 40 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1998 VL 24 IS 4 BP 267 EP 285 DI 10.1016/S0167-6393(98)00025-9 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 111UB UT WOS:000075455400001 ER PT J AU Chien, JT Wang, HC Lee, LM AF Chien, JT Wang, HC Lee, LM TI A novel projection-based likelihood measure for noisy speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; noise interference; robustness; likelihood measure; hidden Markov model ID HIDDEN MARKOV-MODELS AB The projection-based likelihood measure, an effective means of reducing noise contamination in speech recognition, dynamically searches an optimal equalization factor for adapting the cepstral mean vector of hidden Markov model (HMM) to equalize the noisy observation. In this piper, we present a novel likelihood measure which extends the adaptation mechanism to the shrinkage of covariance matrix and the adaptation bias of mean vector. A set of adaptation functions is proposed for obtaining the compensation factors, Experiments indicate that the likelihood measure proposed herein can markedly elevate the recognition accuracy. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. Mingchi Inst Technol, Dept Elect Engn, Taipei, Taiwan. RP Chien, JT (reprint author), Natl Cheng Kung Univ, Dept Comp Sci & Informat Engn, Tainan 70101, Taiwan. EM jtchien@mail.ncku.edu.tw CR BELLEGARDA JR, 1997, P 5 EUR C SPEECH COM, V1, P33 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 CARLSON BA, 1991, INT CONF ACOUST SPEE, P921, DOI 10.1109/ICASSP.1991.150490 Carlson BA, 1994, IEEE T SPEECH AUDI P, V2, P97, DOI 10.1109/89.260341 CHENGALVARAYAN R, 1997, P 5 EUR C SPEECH COM, V5, P2343 Chien JT, 1996, INT CONF ACOUST SPEE, P45 CHIEN JT, 1997, THESIS NATL TSING HU CHIEN JT, 1995, ELECTRON LETT, V31, P1555, DOI 10.1049/el:19951066 Gales M. J. F., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607987 Gales MJF, 1996, IEEE T SPEECH AUDI P, V4, P352, DOI 10.1109/89.536929 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V Lee C.-H., 1997, P ESCA NATO WORKSH R, P45 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 OPENSHAW JP, 1994, P ICASSP, V2, P49 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 NR 17 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1998 VL 24 IS 4 BP 287 EP 297 DI 10.1016/S0167-6393(98)00024-7 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 111UB UT WOS:000075455400002 ER PT J AU Deng, L AF Deng, L TI A dynamic, feature-based approach to the interface between phonology and phonetics for speech modeling and recognition SO SPEECH COMMUNICATION LA English DT Article DE speech; dynamic; recognition; features; EM-algorithm; task variables; continuity constraint ID EM ALGORITHM; MARKOV MODEL; UNITS; TIME AB An overview of a statistical paradigm for speech recognition is given where phonetic and phonological knowledge sources, drawn from the current understanding of the global characteristics of human speech communication, are seamlessly integrated into the structure of a stochastic model of speech. A consistent statistical formalism is presented in which the submodels for the discrete, feature-based phonological process and the continuous, dynamic phonetic process in human speech production are computationally interfaced. This interface enables global optimization of a parsimonious set of model parameters that accurately characterize the symbolic, dynamic, and static components in speech production and explicitly separates distinct sources of the speech variability observable at the acoustic level. The formalism is founded on a rigorous mathematical basis, encompassing computational phonology, Bayesian analysis and statistical estimation theory, nonstationary time series and dynamic system theory, and nonlinear function approximation (neural network) theory. Two principal ways of implementing the speech model and recognizer are presented, one based on the trended hidden Markov model (HMM) or explicitly defined trajectory model, and the other on the state-space or recursively defined trajectory model. Both implementations build into their respective recognition and model-training algorithms a continuity constraint on the internal, production-affiliate trajectories across feature-defined phonological units. The continuity and the parameterized structure in the dynamic speech model permit a joint characterization of the contextual and speaking-style variations manifested in speech acoustics, thereby holding promises to overcome some key limitations of the current speech recognition technology (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L G1, Canada. RP Deng, L (reprint author), Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. EM deng@crg3.uwaterloo.ca CR BAKIS R, 1991, P IEEE WORKSH AUT SP, P20 BEDDOR P, 1990, J PHONETICS, P18 Bishop C. M., 1995, NEURAL NETWORKS PATT BLACKBURN C, 1995, P EUR, V2, P1623 Blevins J., 1995, HDB PHONOLOGICAL THE, P206 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 Buntine W. L., 1991, Complex Systems, V5 Church K. W., 1987, PHONOLOGICAL PARSING Clements George N., 1985, PHONOLOGY YB, V2, P225, DOI 10.1017/S0952675700000440 COHEN J, 1996, P ADDENDUM ICSLP, pS9 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DENES PB, 1973, SPEECH CHAIN PHYSICS Deng L, 1993, IEEE T SPEECH AUDI P, V1, P471 Deng L, 1997, SPEECH COMMUN, V22, P93, DOI 10.1016/S0167-6393(97)00018-6 DENG L, 1995, P IEEE WORKSH AUT SP, P183 DENG L, 1992, COMPUTATIONAL MODEL DENG L, 1997, P IEEE INT C AC SPEE, V2, P1007 DENG L, 1991, P IEEE WORKSH AUT SP, P24 DENG L, 1995, P INT C AC SPEECH SI, P385 DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A DENG L, 1996, P INT C SPOK LANG PR, P2266 DENG L, IN PRESS COMPUTATION DENG L, 1994, J ACOUST SOC AM, V96, P2008, DOI 10.1121/1.410144 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 Deng L, 1996, IEEE T SPEECH AUDI P, V4, P301, DOI 10.1109/89.506934 DENG L, 1993, P IEEE WORKSH AUT SP, P83 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 DENG L, 1993, J ACO8UST SOC AM 2, V93, P2318, DOI 10.1121/1.406375 Digalakis V, 1993, IEEE T SPEECH AUDI P, V1, P431, DOI 10.1109/89.242489 Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 FURUI S, 1995, P EUR, V3, P1595 Goldsmith J., 1995, HDB PHONOLOGICAL THE Greenberg S., 1997, P ESCA WORKSH ROB SP, P23 Greenberg Steven, 1996, P ESCA WORKSH AUD BA, P1 Hardcastle William J., 1997, HDB PHONETIC SCI HOLMES JN, 1997, P EUR C SPEECH COMM, V4, P2083 HUO Q, IN PRESS IEEE T SPEE JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 Keating P., 1988, LINGUISTICS CAMBRIDG, V1, P281 KEYSER J, 1994, PHONOLOGY, V11, P207 Kitagawa G., 1996, SMOOTHNESS PRIORS AN KITAGAWA G, 1987, J AM STAT ASSOC, V82, P1032, DOI 10.2307/2289375 Ladefoged P., 1996, SOUNDS WORLDS LANGUA LASS N, 1995, PRINCIPLES EXPT PHON Lee C.-H., 1996, AUTOMATIC SPEECH SPE Levelt W. J., 1989, SPEAKING INTENTION A MACKAY DJC, 1992, NEURAL COMPUT, V4, P415, DOI 10.1162/neco.1992.4.3.415 Maddieson I., 1984, PATTERNS SOUNDS MAEDA S, 1991, J PHONETICS, V19, P321 Mendel JM, 1995, LESSONS ESTIMATION T MENG HML, 1995, THESIS MIT MOORE R, IN PRESS COMPUTATION MORGAN N, 1994, CONNECTIONIST SPEECH OSTENDORF M, IN PRESS COMPUTATION Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 PERKELL JS, 1995, J PHONETICS, V23, P23, DOI 10.1016/S0095-4470(95)80030-1 PERKELL JS, 1980, LANGUAGE PRODUCTION PUSKORIUS GV, 1994, IEEE T NEURAL NETWOR, V5, P279, DOI 10.1109/72.279191 Rabiner L.R., 1996, AUTOMATIC SPEECH SPE, P1 RAMSAY G, 1996, P ICSLP, V2, P1113, DOI 10.1109/ICSLP.1996.607801 Rubin P. E., 1996, P 4 EUR SPEECH PROD, P125 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Selkirk E., 1982, STRUCTURE PHONOLOG 2, P337 Spall JC, 1988, BAYESIAN ANAL TIME S STEVENS JL, 1989, COENZYMES COFACTOR B, V3, P45 West M., 1997, BAYESIAN FORECASTING, V2nd WU CFJ, 1983, ANN STAT, V11, P95, DOI 10.1214/aos/1176346060 ZUE V, 1991, NOTES SPEECH SPECTRO NR 69 TC 38 Z9 38 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1998 VL 24 IS 4 BP 299 EP 323 DI 10.1016/S0167-6393(98)00023-5 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 111UB UT WOS:000075455400003 ER PT J AU Kato, H Kawahara, H AF Kato, H Kawahara, H TI An application of the Bayesian time series model and statistical system analysis for F0 control SO SPEECH COMMUNICATION LA English DT Article DE auditory sense; speech production; auditory feedback; multivariate nonstationary time series; vector autoregressive model; open loop impulse response ID AUDITORY-FEEDBACK; PITCH; FREQUENCY; SPEECH AB F0 (fundamental frequency) control of human speech reduction is studied by using both a stochastic time series model and a system analysis with a vector autoregressive (VAR) model. We use two-dimensional time series data of F0 (ff0 and sf0) obtained through the transformed auditory feedback (TAF) experimentation system developed by one of the authors, sf0 is extracted from speech data that includes prolonged phonation of the vowel id, and the signal ff0 is extracted from frequency-modulated feedback speech by white Gaussian noise. Most of the data had mean-nonstationary characteristics. The stochastic procedure decomposes the mean-nonstationary and other components in one stage without pre-manufacturing the data. The cyclical components around the mean-nonstationary components are assumed to be generated by the VAR model. We can execute a stochastic system analysis using the estimates of the VAR model to analyze the physical characteristics of the data. We also performed a simulation study using the estimates obtained by the model to discover the role of F0 control under the situation in which we completely lost our hearing ability. The results clearly indicate the dynamic properties of F0 control for each segment which occur with each breath taken during a sustained tone. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Hiroshima Univ, Fac Engn, Dept Appl Math, Hiroshima 739, Japan. Wakayama Univ, Fac Syst Engn, Design Informat Sci Dept, Wakayama 640, Japan. RP Kato, H (reprint author), Hiroshima Univ, Fac Engn, Dept Appl Math, 1-4-1 Kagamiyama, Hiroshima 739, Japan. EM katohi@vita.amath.hiroshima-u.ac.jp; kawahara@center.wakayama-u.ac.jp CR Akaike H., 1967, SPECTRAL ANAL TIME S, P81 AKAIKE H, 1968, ANN I STAT MATH, V20, P425, DOI 10.1007/BF02911655 AKAIKE H, 1974, IEEE T AUTOMAT CONTR, VAC19, P716, DOI 10.1109/TAC.1974.1100705 ANDERSON BDO, 1979, OPTIMAL FILTERING, P165 BOX GEP, 1958, ANN MATH STAT, V29, P610, DOI 10.1214/aoms/1177706645 ELMAN JL, 1981, J ACOUST SOC AM, V70, P45, DOI 10.1121/1.386580 HOWELL P, 1984, PERCEPT PSYCHOPHYS, V36, P296, DOI 10.3758/BF03206371 ISHIGURO M, 1994, P 1 US JAP C FRONT S, P79 Ishiguro M., 1989, COMPUTER SCI MONOGRA, V25 KATO H, 1996, TRH166 ATR HUM INF P KATO H, 1994, J J BIOMETRICS, V15, P41 Kawahara H, 1996, VOCAL FOLD, P263 KAWAHARA H, 1994, P ICSLP 94 18 22 SEP, P1399 KAWATO M, 1987, J NEUROSCI, V57, P257 Larson CR, 1995, VOCAL FOLD, P321 MARSAGLIA G, 1964, SIAM REV, V6, P260, DOI 10.1137/1006063 PERKELL J, 1992, J ACOUST SOC AM, V91, P2961, DOI 10.1121/1.402932 RAUCH HE, 1965, AIAA J, V3, P1445 SAPIR S, 1983, J ACOUST SOC AM, V73, P1070, DOI 10.1121/1.389135 SVIRSKY MA, 1992, J ACOUST SOC AM, V92, P1284, DOI 10.1121/1.403923 TIMMONS BA, 1982, PERCEPT MOTOR SKILL, V55, P1179 NR 21 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1998 VL 24 IS 4 BP 325 EP 339 DI 10.1016/S0167-6393(98)00020-X PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 111UB UT WOS:000075455400004 ER PT J AU Potamianos, G Jelinek, F AF Potamianos, G Jelinek, F TI A study of n-gram and decision tree letter language modeling methods SO SPEECH COMMUNICATION LA English DT Article DE language modeling; n-grams; decision trees; smoothing; laws of succession; back-off language model; deleted interpolation; Brown corpus ID SPEECH RECOGNITION; PROBABILITIES AB The goal of this paper is to investigate various language model smoothing techniques and decision tree based language model design algorithms. For this purpose, we build language models for printable characters (letters), based on the Brown corpus. We consider two classes of models for the text generation process: the n-gram language model and various decision tree based language models. In the first part of the paper, we compare the most popular smoothing algorithms applied to the former. We conclude that the bottom-up deleted interpolation algorithm performs the best in the task of n-gram letter language model smoothing, significantly outperforming the back-off smoothing technique for large values of n. In the second part of the paper, we consider various decision tree development algorithms. Among them, a K-means clustering type algorithm for the design of the decision tree questions gives the best results. However, the n-gram language model outperforms the decision tree language models for letter language modeling. We believe that this is due to the predictive nature of letter strings, which seems to be naturally modeled by n-grams, (C) 1998 Elsevier Science B.V. All rights reserved. C1 AT&T Bell Labs, Res, Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA. Johns Hopkins Univ, Ctr Language & Speech Proc, Baltimore, MD 21218 USA. RP Potamianos, G (reprint author), AT&T Bell Labs, Res, Speech & Image Proc Serv Res Lab, 180 Park Ave,Room D191, Florham Pk, NJ 07932 USA. EM makis@research.att.com CR BAHL LR, 1991, P EUR C SPEECH COMM, P1209 BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 BAHL LR, 1989, IEEE T ACOUST SPEECH, V37, P1001, DOI 10.1109/29.32278 Bell T., 1990, TEXT COMPRESSION BETZ M, 1995, P ICASSP, P856 BROWN PF, 1992, AM J COMPUTATIONAL L, V18, P31 Brown P. F., 1992, Computational Linguistics, V18 Chen S. F., 1996, P 34 ANN M ASS COMP, P310, DOI 10.3115/981863.981904 CHOU PA, 1991, IEEE T PATTERN ANAL, V13, P340, DOI 10.1109/34.88569 EFRON B, 1982, CBMS NSF REG C SER A GOOD IJ, 1953, BIOMETRIKA, V40, P237, DOI 10.2307/2333344 Jelinek F., 1980, Pattern Recognition in Practice. Proceedings of an International Workshop Jelinek F., 1997, STAT METHODS SPEECH KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Kucera H., 1967, COMPUTATIONAL ANAL P NEY H, 1995, IEEE T PATTERN ANAL, V17, P1202, DOI 10.1109/34.476512 Olshen R., 1984, CLASSIFICATION REGRE, V1st POTAMIANOS G, 1997, CLSP RES NOTES, V13 Press WH, 1988, NUMERICAL RECIPES C RISTAD ES, 1997, CSTR54497 PRINC U DE RISTAD ES, 1995, P 33 ANN M ASS COMP, P220, DOI 10.3115/981658.981688 RISTAD ES, 1997, P 35 ANN M ACL MADR, P381 Ristad E.S, 1995, CSTR49595 PRINC U SHANNON CE, 1951, AT&T TECH J, V30, P50 NR 24 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1998 VL 24 IS 3 BP 171 EP 192 DI 10.1016/S0167-6393(98)00018-1 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 103XF UT WOS:000074982600001 ER PT J AU Markov, KP Nakagawa, S AF Markov, KP Nakagawa, S TI Text-independent speaker recognition using non-linear frame likelihood transformation SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; speaker verification; likelihood normalization; frame level normalization ID IDENTIFICATION; MODELS AB When the reference speakers are represented by Gaussian mixture model (GMM), the conventional approach is to accumulate the frame likelihoods over the whole test utterance and compare the results as in speaker identification or apply a threshold as in speaker verification. In this paper we describe a method, where frame likelihoods are transformed into new scores according to some non-linear function prior to their accumulation. We have studied two families of such functions. First one, actually, performs likelihood normalization - a technique widely used in speaker verification, but applied here at frame level. The second kind of functions transforms the likelihoods into weights according to some criterion. We call this transformation weighting models rank (WMR). Both kinds of transformations require frame likelihoods from all (or subset of all) reference models to be available. For this, every frame of the test utterance is input to the required reference models in parallel and then the likelihood transformation is applied. The new scores are further accumulated over the whole test utterance in order to obtain an utterance level score for a given speaker model. We have found out that the normalization of these utterance scores also has the effect for speaker verification. The experiments using two databases - TIMIT corpus and NTT database for speaker recognition - showed better speaker identification rates and significant reduction of speaker verification equal error rates (EER) when the frame likelihood transformation was used. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Toyohashi Univ Technol, Dept Informat & Comp Sci, Toyohashi, Aichi 4418122, Japan. RP Markov, KP (reprint author), Toyohashi Univ Technol, Dept Informat & Comp Sci, Toyohashi, Aichi 4418122, Japan. CR BIMBOT F, 1995, SPEECH COMMUN, V17, P177, DOI 10.1016/0167-6393(95)00013-E Dempster A. P., 1979, J ROYAL STAT SOC B, V39, P1 DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 DUDA R, 1973, PATTERN CLASSIFICATI, P46 Fukunaga K., 1990, INTRO STAT PATTERN R, V2nd FURUI S, 1978, THESIS U TOKYO FURUI S, 1991, SPEECH COMMUN, V10, P505, DOI 10.1016/0167-6393(91)90054-W Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 LLEIDA E, 1996, P INT C AC SPEECH SI, P507 MARKOV K, 1995, P ACOUS SOC JAP, P83 Markov K., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607970 MARKOV K, 1996, SP9617 IEICE, P37 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MATSUI T, 1993, P ICASSP 93, V2, P391 MATSUI T, 1995, SPEECH COMMUN, V17, P109, DOI 10.1016/0167-6393(95)00011-C MATSUI T, 1996, P IEEE INT C AC SPEE, V1, P97 Nakagawa S., 1994, J ACOUSTICAL SOC JAP, V50, P849 PAPOULIS A, 1991, PROBABILITY RANDOM V, P105 REYNOLDS DA, 1995, SPEECH COMMUN, V17, P91, DOI 10.1016/0167-6393(95)00009-D REYNOLDS DA, 1995, IEEE SIGNAL PROC LET, V2, P46, DOI 10.1109/97.372913 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599 Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 ROSENBERG AE, 1991, P ICASSP, P381, DOI 10.1109/ICASSP.1991.150356 Siegel S., 1956, NONPARAMETRIC STAT, P68 SOONG FK, 1987, AT&T TECH J, V66, P14 SVIC M, 1990, P ICASSP 90, P281 TISHBY NZ, 1991, IEEE T SIGNAL PROCES, V39, P563 Tseng B., 1992, P ICASSP 92, VII, P161 NR 31 TC 22 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1998 VL 24 IS 3 BP 193 EP 209 DI 10.1016/S0167-6393(98)00010-7 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 103XF UT WOS:000074982600002 ER PT J AU Hazan, V Simpson, A AF Hazan, V Simpson, A TI The effect of cue-enhancement on the intelligibility of nonsense word and sentence materials presented in noise SO SPEECH COMMUNICATION LA English DT Article DE speech intelligibility; cue-enhancement; speech perception ID SEMANTICALLY UNPREDICTABLE SENTENCES; CONSONANT CONFUSIONS; SPEECH; HEARING; RECOGNITION AB Two sets of experiments were performed to test the perceptual benefits of enhancing consonantal regions which contain a high density of acoustic cues to phonemic contrasts. In the first set, hand-annotated consonantal regions of natural vowel-consonant-vowel (VCV) stimuli were amplified to increase their salience, and filtered to stylise the cues they contained. In the second set, corresponding regions in natural semantically-unpredictable sentence (SUS) material were annotated and enhanced in the same way. Both sets of stimuli were combined with speech-shaped noise and presented to normally-hearing listeners. The VCV experiments showed statistically significant improvements in intelligibility as a result of enhancement; significant improvements were also obtained for sentence material after some adjustments in enhancement strategies and levels. These results demonstrate the benefits gained from enhancement techniques which use knowledge of acoustic cues to phonetic contrasts to improve the intelligibility of speech in the presence of background noise. (C) 1998 Elsevier Science B.V. All rights reserved. C1 UCL, Dept Phonet & Linguist, London NW1 2HE, England. RP Hazan, V (reprint author), UCL, Dept Phonet & Linguist, 4 Stephenson Way, London NW1 2HE, England. EM v.hazan@phon.ucl.ac.uk RI Hazan, Valerie/C-9722-2009 OI Hazan, Valerie/0000-0001-6572-6679 CR BENOIT C, 1990, SPEECH COMMUN, V9, P293, DOI 10.1016/0167-6393(90)90005-T Benoit C, 1996, SPEECH COMMUN, V18, P381, DOI 10.1016/0167-6393(96)00026-X CHENG YM, 1995, P INT C SPOK LANG PR, V1, P515 Deisher ME, 1997, J ACOUST SOC AM, V102, P1141, DOI 10.1121/1.419866 DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780 DUBNO JR, 1981, J ACOUST SOC AM, V69, P249, DOI 10.1121/1.385345 FRNACIS WN, 1982, FREQUENCY ANAL ENGLI GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T GORDONSALANT S, 1986, J ACOUST SOC AM, V80, P1599, DOI 10.1121/1.394324 Hazan Valerie L., 1992, SPEECH HEARING LANGU, V6, P75 HOWARDJONES PA, 1993, ACUSTICA, V78, P258 Jamieson Donald G., 1995, P 13 INT C PHON SCI, V4, P100 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 LANE H, 1971, J SPEECH HEAR RES, V14, P677 Liu C, 1997, J ACOUST SOC AM, V101, P2877, DOI 10.1121/1.418518 Liu SA, 1996, J ACOUST SOC AM, V100, P3417, DOI 10.1121/1.416983 LUCE PA, 1983, HUM FACTORS, V25, P17 LUCE PA, 1983, RES SPEECH PERCEPTIO, V9, P295 MILLER GA, 1955, J ACOUST SOC AM, V27, P338, DOI 10.1121/1.1907526 NIEDERJOHN RJ, 1976, IEEE T ACOUST SPEECH, V24, P277, DOI 10.1109/TASSP.1976.1162824 PARSONS TW, 1987, VOICE SPEECH PROCESS, P345 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 RABBITT P, 1968, Q J EXPT PSYCHOL, V20, P1 SENDLMEIER WF, 1989, CLIN LINGUIST PHONET, V3, P151, DOI 10.3109/02699208908985278 SIMPSON AS, 1997, UNPUB SPEECH HEARING, V10, P79 STEVENS KN, 1985, PHONETIC LINGUISTICS Tallal P, 1996, SCIENCE, V271, P81, DOI 10.1126/science.271.5245.81 VANSUMMERS W, 1988, J ACOUST SOC AM, V84, P917 WANG MD, 1973, J ACOUST SOC AM, V54, P1248, DOI 10.1121/1.1914417 NR 29 TC 35 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1998 VL 24 IS 3 BP 211 EP 226 DI 10.1016/S0167-6393(98)00011-9 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 103XF UT WOS:000074982600003 ER PT J AU Bouabana, S Maeda, S AF Bouabana, S Maeda, S TI Multi-pulse LPC modeling of articulatory movements SO SPEECH COMMUNICATION LA English DT Article DE speech production; articulatory movement modeling; multi-pulse LPC; syllables ID SKELETAL-MUSCLE; VARYING MODEL; SPEECH; PARAMETERS AB The frame-by-frame variation of tongue profiles derived from X-ray film data is described in terms of the temporal patterns of four articulatory parameters. The temporal variation of each parameter, i.e., movement, is assumed to be the output of a time-invariant auto-regressive filter. Each filter is excited by a sequence of pulses, representing articulatory commands. The filter coefficients, and the position and amplitude of the pulses are determined by applying an MLPC method. The curve of synthesis error for each movement shows a rapid decrease up to a number of pulses corresponding to that of the syllables in the sentence and then the decreasing rate becomes distinctively slower suggesting the presence of syllable-size motor organization. The minimum number of pulses is determined by using an acoustic criterion. It depends on the number of the phonetic features, in the sentence, of which their realization is crucially related to their pertinent parameters. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Ecole Natl Super Telecommun Bretagne, Dept Signal, F-75634 Paris 13, France. CNRS, URA 820, F-75634 Paris 13, France. RP Maeda, S (reprint author), Ecole Natl Super Telecommun Bretagne, Dept Signal, 46 Rue Barrault, F-75634 Paris 13, France. EM maeda@sig.enst.fr CR AKAZAWA K, 1969, TECHNOLOGY REPORTS O, V19, P577 Asatryan DG, 1965, BIOPHYSICS-USSR, V10, p[925, 837] Atal B. S., 1986, P IEEE INT C AC SPEE, P1681 Atal B. S., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing BOBET J, 1993, IEEE T BIO-MED ENG, V40, P1000, DOI 10.1109/10.247798 BOTHOREL A, 1986, TRAVAUX I PHONETIQUE BOUABANA S, 1994, 20 JOURN ET PAR TREG, P439 FLANAGAN JL, 1955, J ACOUST SOC AM, V27, P613, DOI 10.1121/1.1907979 FUJIMURA O, 1973, Computers in Biology and Medicine, V3, P371, DOI 10.1016/0010-4825(73)90003-6 FUJIMURA O, 1961, J SPEECH HEAR RES, V4, P233 FUJIMURA O, 1994, DIMACS SERIES DISCRE, V17, P1 FUJIMURA O, 1995, P ICPHS 95 STOCKH, P10 Gabioud B., 1994, FUNDAMENTALS SPEECH, P215 HATZE H, 1978, BIOL CYBERN, V28, P143, DOI 10.1007/BF00337136 HEINZ JM, 1964, J ACOUST SOC AM, V36, P1037, DOI 10.1121/1.2143313 Henke W., 1966, THESIS MIT CAMBRIDGE Hill AV, 1938, PROC R SOC SER B-BIO, V126, P136, DOI 10.1098/rspb.1938.0050 ISHIZAKA K, 1975, IEEE T ACOUST SPEECH, V23, P370, DOI 10.1109/TASSP.1975.1162701 KAY SM, 1981, P IEEE, V69, P1380, DOI 10.1109/PROC.1981.12184 Kent R. D., 1977, J PHONETICS, V15, P115 KENT RD, 1974, J SPEECH HEAR RES, V17, P470 KIRITANI S, 1986, SPEECH COMMUN, V5, P119, DOI 10.1016/0167-6393(86)90003-8 KRAKOW RA, 1993, PHONETICS PHONOLOGY, V5, P87 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 LINDBLOM BE, 1971, J ACOUST SOC AM, V50, P1166, DOI 10.1121/1.1912750 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Maeda S, 1992, J PHYS, VIV, P191 MAEDA S, 1994, PHONETICA, V51, P17 Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 Maeda S., 1979, 10 JOURN ET PAR, P1 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MANNARD A, 1973, J PHYSIOL-LONDON, V229, P275 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683 NELSON WL, 1983, BIOL CYBERN, V46, P135, DOI 10.1007/BF00339982 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OSTRY DJ, 1985, J ACOUST SOC AM, V77, P640, DOI 10.1121/1.391882 OVERALL JE, 1962, PSYCHOL REP, V10, P651 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 ROY R, 1989, IEEE T ACOUST SPEECH, V37, P984, DOI 10.1109/29.32276 Schmidt R, 1988, MOTOR CONTROL LEARNI SCHONLE PW, 1987, BRAIN LANG, V31, P26, DOI 10.1016/0093-934X(87)90058-7 SCHROEDER MR, 1979, J ACOUST SOC AM, V66, P1647, DOI 10.1121/1.383662 SMITH CL, 1993, J ACOUST SOC AM, V93, P1580, DOI 10.1121/1.406817 SONODA Y, 1976, ANN B RILP, V10, P29 WICKELGR.WA, 1969, PSYCHOL REV, V76, P1, DOI 10.1037/h0026823 WINTERS JM, 1987, BIOL CYBERN, V55, P403, DOI 10.1007/BF00318375 YASUHARA M, 1983, IEEE T CIRCUITS SYST, V30, P828, DOI 10.1109/TCS.1983.1085302 NR 48 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1998 VL 24 IS 3 BP 227 EP 248 DI 10.1016/S0167-6393(98)00012-0 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 103XF UT WOS:000074982600004 ER PT J AU Soon, IY Koh, SN Yeo, CK AF Soon, IY Koh, SN Yeo, CK TI Noisy speech enhancement using discrete cosine transform SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; MMSE amplitude estimation; noise removal; discrete cosine transform (DCT) ID SUPPRESSION AB This paper illustrates the advantages of using the Discrete Cosine Transform (DCT) as compared to the standard Discrete Fourier Transform (DFT) for the purpose of removing noise embedded in a speech signal. The derivation of the Minimum Mean Square Error (MMSE) filter based on the statistical modelling of the DCT coefficients is shown. Also shown is the derivation of an over-attenuation factor based on the fact that speech energy is not always present in the noisy signal at all times or in all coefficients. This over-attenuation factor is useful in suppressing any musical residual noise which may be present. The proposed methods are evaluated against the noise reduction filter proposed by Y, Ephraim and D. Malah (1984), using both Gaussian distributed white noise as well as recorded fan noise, with favourable results. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Nanyang Technol Univ, Sch Elect & Elect Engn, Singapore 639798, Singapore. Nanyang Technol Univ, Sch Appl Sci, Singapore 639798, Singapore. RP Soon, IY (reprint author), Nanyang Technol Univ, Sch Elect & Elect Engn, Block S2,Nanyang Ave, Singapore 639798, Singapore. EM eiysoon@ntu.edu.sg RI Soon, Ing Yann/A-5173-2011 CR AHMED N, 1974, IEEE T COMPUT, VC 23, P90, DOI 10.1109/T-C.1974.223784 BEROUTI M, 1979, IEEE P ICASSP, V1, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 CHEN WH, 1977, IEEE T COMMUN, V25, P1004 CROCHIERE RE, 1980, IEEE T ACOUST SPEECH, V28, P99, DOI 10.1109/TASSP.1980.1163353 EPHRAIM Y, 1995, IEEE T SPEECH AUDI P, V3, P251, DOI 10.1109/89.397090 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 JUANG BH, 1990, P INT C SPOK LANG PR, P1113 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 MUNDAY E, 1988, BRIT TELECOM TECHNOL, V6, P71 NARASIMHA MJ, 1978, IEEE T COMMUN, V26, P934, DOI 10.1109/TCOM.1978.1094144 SAMBUR MR, 1976, IEEE T ACOUST SPEECH, V24, P484 Scalart P., 1996, P ICASSP, V2, P629 VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 ZELINSKI R, 1977, IEEE T ACOUST SPEECH, V25, P299, DOI 10.1109/TASSP.1977.1162974 NR 16 TC 57 Z9 68 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1998 VL 24 IS 3 BP 249 EP 257 DI 10.1016/S0167-6393(98)00019-3 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA 103XF UT WOS:000074982600005 ER PT J AU Oviatt, S MacEachern, M Levow, GA AF Oviatt, S MacEachern, M Levow, GA TI Predicting hyperarticulate speech during human-computer error resolution SO SPEECH COMMUNICATION LA English DT Article DE hyperarticulation; linguistic adaptation; CHAM model; speech recognition errors; human-computer interaction ID CLEAR SPEECH; CONVERSATIONAL SPEECH; WORD BOUNDARIES; HEARING; INTELLIGIBILITY; CUES; HARD; RECOGNITION; LISTENERS; LANGUAGE AB When speaking to interactive systems, people sometimes hyperarticulate - or adopt a clarified form of speech that has been associated with increased recognition errors. The goals of the present study were (1) to establish a flexible simulation method for studying users' reactions to system errors, (2) to analyze the type and magnitude of linguistic adaptations in speech during human-computer error resolution, (3) to provide a unified theoretical model for interpreting and predicting users' spoken adaptations during system error handling, and (4) to outline the implications for developing more robust interactive systems. A semi-automatic simulation method with a novel error generation capability was developed to compare users' speech immediately before and after system recognition errors, and under conditions varying in error base-rate. Matched original-repeat utterance pairs then were analyzed for type and magnitude of linguistic adaptation. When resolving errors with a computer, it was revealed that users actively tailor their speech along a spectrum of hyperarticulation, and as a predictable reaction to their perception of the computer as an "at risk" listener. During both low and high error rates, durational changes were pervasive, including elongation of the speech segment and large relative increases in the number and duration of pauses. During a high error rate, speech also was adapted to include more hyper-clear phonological features, fewer disfluencies, and change in fundamental frequency. The two-stage CHAM model (Computer-elicited Hyperarticulate Adaptation Model) is proposed to account for these changes in users' speech during interactive error resolution. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Oregon Grad Inst Sci & Technol, Dept Comp Sci, Ctr Human Comp Commun, Portland, OR 97291 USA. Univ Pittsburgh, Dept Linguist, Pittsburgh, PA 15260 USA. MIT, Artificial Intelligence Lab, Cambridge, MA 02139 USA. RP Oviatt, S (reprint author), Oregon Grad Inst Sci & Technol, Dept Comp Sci, Ctr Human Comp Commun, POB 91000, Portland, OR 97291 USA. EM oviatt@cse.ogi.edu CR APPLEBAUM TH, 1990, P EUSIPCO 90, P1183 BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Brenner M., 1985, VOCAL FOLD PHYSL BIO, P239 BRUCE G, 1995, P ESCA WORKSH SPOK D, P201 CARAMAZZA A, 1991, NATURE, V349, P788, DOI 10.1038/349788a0 Chen F.R., 1980, THESIS MIT COHEN J, 1996, P ADD ICSLP, P9 CUTLER A, 1991, SPEECH COMMUN, V10, P335, DOI 10.1016/0167-6393(91)90002-B CUTLER A, 1990, SPEECH COMMUN, V9, P485, DOI 10.1016/0167-6393(90)90024-4 DANIS CM, 1989, PROC HUM FACT SOC AN, P301 EISEN B, 1992, P INT C SPOK LANG PR, V2, P871 Ferguson C. A., 1977, TALKING CHILDREN LAN, P219 FERGUSON CA, 1975, ANTHROPOL LINGUIST, V17, P1 FERNALD A, 1989, J CHILD LANG, V16, P477 Frankish C., 1995, P CHI 95, P503, DOI 10.1145/223904.223972 FREED B, 1978, THESIS U PENNSYLVANI GAGNOULET C, 1989, INT VOICE SYSTEMS RE, V1 Garnica O. K., 1977, TALKING CHILDREN LAN, P63 GORDONSALANT S, 1987, J ACOUST SOC AM, V81, P1199, DOI 10.1121/1.394643 Hanley TD, 1949, J SPEECH HEAR DISORD, V14, P363 JELINEK F, 1985, P IEEE, V73, P1616, DOI 10.1109/PROC.1985.13343 Johnston M., 1997, P 35 ANN M ASS COMP, P281 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KAMM C, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P422 KARIS D, 1991, IEEE J SEL AREA COMM, V9, P574, DOI 10.1109/49.81951 LEWIS C, 1986, USER CTR SYSTEM DESI, P411 LINDBLOM B, 1992, SPEECH COMMUN, V11, P357, DOI 10.1016/0167-6393(92)90041-5 Lindblom B, 1996, J ACOUST SOC AM, V99, P1683, DOI 10.1121/1.414691 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 Lippmann R. P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) LIVELY SE, 1993, J ACOUST SOC AM, V93, P2962, DOI 10.1121/1.405815 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 MAASSEN B, 1986, J SPEECH HEAR RES, V29, P227 MALECOT A, 1958, LANGUAGE, V34, P370, DOI 10.2307/410929 MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, V1, P335 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 MOON SJ, 1991, THESIS U TEXAS AUSTI Nass C., 1994, P SIGCHI C HUM FACT, P72, DOI 10.1145/191666.191703 OVIATT S, 1995, COMPUT SPEECH LANG, V9, P19, DOI 10.1006/csla.1995.0002 Oviatt S., 1996, P INT C SPOK LANG PR, V1, P204, DOI 10.1109/ICSLP.1996.607077 OVIATT SL, 1992, P INT C SPOK LANG PR, V2, P1351 OVIATT SL, UNPUB MODELING GLOBA OVIATT SL, UNPUB LINGUISTIC ADA OVIATT SL, 1994, SPEECH COMMUN, V15, P283, DOI 10.1016/0167-6393(94)90079-5 OVIATT SL, 1994, P INT C SPOK LANG PR, V2, P551 OVIATT SL, IN PRESS 10 MYTHS MU PAYTON KL, 1994, J ACOUST SOC AM, V95, P1581, DOI 10.1121/1.408545 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 PICKETT JM, 1956, J ACOUST SOC AM, V28, P902, DOI 10.1121/1.1908510 Rhyne J. R., 1993, ADV HUMAN COMPUTER I, V4, P191 ROE DB, 1994, VOICE COMMUNICATION SCHULMAN R, 1989, J ACOUST SOC AM, V85, P295, DOI 10.1121/1.397737 SHATZ M, 1973, MONOGR SOC RES CHILD, V38, P1, DOI 10.2307/1165783 Shriberg E., 1992, P DARPA SPEECH NAT L, P49, DOI 10.3115/1075527.1075538 SPITZ J, 1991, P 4 DARP WORKSH SPEE Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 SWERTS M, 1994, J ACOUST SOC AM, V96, P2064, DOI 10.1121/1.410148 TOLKMITT FJ, 1986, J EXP PSYCHOL HUMAN, V12, P302, DOI 10.1037//0096-1523.12.3.302 Uchanski RM, 1996, J SPEECH HEAR RES, V39, P494 WEINSTEIN C, 1994, ARPA WORKSH P HUM LA WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 WOLF CG, 1990, P HUM FACT SOC 34 AN, P249 Yankelovich N., 1995, P C HUM FACT COMP SY, P369, DOI 10.1145/223904.223952 NR 64 TC 35 Z9 35 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1998 VL 24 IS 2 BP 87 EP 110 DI 10.1016/S0167-6393(98)00005-3 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX843 UT WOS:000074561000001 ER PT J AU Varho, S Alku, P AF Varho, S Alku, P TI Separated linear prediction - A new all-pole modelling technique for speech analysis SO SPEECH COMMUNICATION LA English DT Article ID FAST ALGORITHMS; SIGNALS; FILTERS AB This study presents a new predictive method, Separated Linear Prediction (SLP), for spectral estimation of speech. The prediction of the signal value x(n) is computed from its p + 1 preceding samples by emphasising among the previous samples the one which is located next to sample x(n). All;he p samples from x(n - 2) to x(n - (p + 1)) are linearly extrapolated with x(n - 1) to obtain p new values, which are used for prediction. Optimisation of the filter coefficients is computed using the autocorrelation criterion as in conventional LP. The SLP-analysis yields an all-pole filter of the order p + 1 with p unknowns in the normal equations. The performance of SLP was compared to conventional LP by analysing vowels produced by two female and four male speakers. Results showed that the proposed method yielded in general more accurate modelling of higher formants, residuals with smaller energy and increased flatness. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Turku Univ, Dept Appl Phys, Lab Elect & Informat Technol, FIN-20014 Turku, Finland. RP Varho, S (reprint author), Turku Univ, Dept Appl Phys, Lab Elect & Informat Technol, Vesilinnantle 5, FIN-20014 Turku, Finland. EM susanna.varho@utu.fi RI Alku, Paavo/E-2400-2012 CR ATAL BS, 1979, IEEE T ACOUST SPEECH, V27, P247, DOI 10.1109/TASSP.1979.1163237 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 DAVID S, 1991, IEEE T SIGNAL PROCES, V39, P789, DOI 10.1109/78.80900 DENOEL E, 1985, IEEE T ACOUST SPEECH, V33, P1397, DOI 10.1109/TASSP.1985.1164759 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HSUE JJ, 1993, IEEE T SIGNAL PROCES, V41, P2349, DOI 10.1109/78.224244 KAY S, 1983, IEEE T ACOUST SPEECH, V31, P746, DOI 10.1109/TASSP.1983.1164088 Kay S. M., 1988, MODERN SPECTRAL ESTI LAINE UK, 1995, INT CONF ACOUST SPEE, P1701, DOI 10.1109/ICASSP.1995.479933 LEE AC, 1989, J ACOUST SOC AM, V86, P150, DOI 10.1121/1.398334 MA CX, 1993, SPEECH COMMUN, V12, P69, DOI 10.1016/0167-6393(93)90019-H MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Markel JD, 1976, LINEAR PREDICTION SP MARPLE SL, 1982, IEEE T ACOUST SPEECH, V30, P942, DOI 10.1109/TASSP.1982.1163987 MIYOSHI Y, 1987, IEEE T ACOUST SPEECH, V35, P1233, DOI 10.1109/TASSP.1987.1165282 Rabiner L.R., 1978, DIGITAL PROCESSING S STRUBE HW, 1980, J ACOUST SOC AM, V68, P1071, DOI 10.1121/1.384992 TOWNSHEND B, 1991, INT CONF ACOUST SPEE, P425, DOI 10.1109/ICASSP.1991.150367 WONG DY, 1980, IEEE T ACOUST SPEECH, V28, P263, DOI 10.1109/TASSP.1980.1163387 NR 20 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1998 VL 24 IS 2 BP 111 EP 121 DI 10.1016/S0167-6393(98)00003-X PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX843 UT WOS:000074561000002 ER PT J AU Alku, P Vilkman, E Laukkanen, AM AF Alku, P Vilkman, E Laukkanen, AM TI Estimation of amplitude features of the glottal flow by inverse filtering speech pressure signals SO SPEECH COMMUNICATION LA English DT Article DE inverse filtering; voice source analysis ID VOICE SOURCE; AIR-FLOW; WAVE-FORM; FUNDAMENTAL-FREQUENCY; FEMALE; PARAMETERS; PHONATION; SPEAKERS AB In this study a new scaling technique is presented which makes it possible to estimate the voice source including its amplitude values by inverse filtering the speech pressure waveform without applying a flow mask. The new technique is based on adjusting the DC-gain of the vocal tract model in inverse filtering to unity. The performance of the new method is tested by analysing correlation between the minimum peak amplitude of the differentiated glottal flow given by the new technique and the sound pressure level of speech. The results show that the new method yields reliable information of the amplitude values of the glottal source without applying a flow mask. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Turku Univ, Dept Appl Phys, Lab Elect & Informat Sci, FIN-20014 Turun, Finland. Univ Oulu, Dept Otolaryngol & Phoniatr, FIN-90220 Oulu, Finland. Univ Tampere, Inst Speech Commun & Voice Res, FIN-33101 Tampere, Finland. RP Alku, P (reprint author), Aalto Univ, Acoust Lab, Otakaari 5A, FIN-02150 Espoo, Finland. RI Alku, Paavo/E-2400-2012 CR Alku P, 1996, SPEECH COMMUN, V18, P131, DOI 10.1016/0167-6393(95)00040-2 Alku P., 1994, P INT C SPOK LANG PR, P1619 ALKU P, 1995, J ACOUST SOC AM, V98, P763, DOI 10.1121/1.413569 Alku P, 1996, FOLIA PHONIATR LOGO, V48, P240 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 CUMMINGS KE, 1995, J ACOUST SOC AM, V98, P88, DOI 10.1121/1.413664 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P FANT G, 1995, 231995 STLQPSR, P119 FANT G, 1994, P INT C SPOK LANG P, P1619 Fant G., 1985, 4 PARAMETER MODEL GL, P1 GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 HERTEGARD S, 1992, J VOICE, V6, P224, DOI 10.1016/S0892-1997(05)80147-X HERTEGARD S, 1990, J VOICE, V4, P220, DOI 10.1016/S0892-1997(05)80017-7 HERTEGARD S, 1992, 23 ROYAL I TECHN SPE, P9 HIGGINS MB, 1993, J VOICE, V7, P47, DOI 10.1016/S0892-1997(05)80111-0 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 Laukkanen AM, 1996, J PHONETICS, V24, P313, DOI 10.1006/jpho.1996.0017 MILLER RL, 1959, J ACOUST SOC AM, V31, P667, DOI 10.1121/1.1907771 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 Rabiner L.R., 1978, DIGITAL PROCESSING S ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 STRIK H, 1992, SPEECH COMMUN, V11, P167, DOI 10.1016/0167-6393(92)90011-U Sulter AM, 1996, J ACOUST SOC AM, V100, P3360, DOI 10.1121/1.416977 SUNDBERG J, 1993, J VOICE, V7, P15, DOI 10.1016/S0892-1997(05)80108-0 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 27 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1998 VL 24 IS 2 BP 123 EP 132 DI 10.1016/S0167-6393(98)00004-1 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX843 UT WOS:000074561000003 ER PT J AU Skoglund, J AF Skoglund, J TI Analysis and quantization of glottal pulse shapes SO SPEECH COMMUNICATION LA English DT Article DE speech coding; excitation coding; glottal source modeling; glottal codebook; vector quantization; memory based vector quantization ID VECTOR QUANTIZATION; SPEECH SIGNALS; INPUT MODEL AB In source-filter based speech coding for low bit rates an efficient representation of excitation pulses is required to attain high quality of the synthetic speech. In this paper, we discuss a pulse waveform representation by a codebook populated with pulse shapes. The codebook is designed from glottal derivative pulses obtained by a linear predictive inverse filtering technique. Pulses are extracted and normalized in time and amplitude to form prototype pulses. Design methods and performance evaluation of the codebooks are investigated in a vector quantization (VQ) framework. The quantization gains obtained by exploiting the correlation between pulses are studied by theoretic calculations which suggest that about 2 bits per vector (in a budget of 7-10 bits) can be gained when exploiting the correlation. Memory based VQ is a generic term for quantization schemes which utilizes previous quantized pulses. We study traditional memory based VQ methods and an extension of memory based VQ with memoryless VQ, denoted a safety-net extension. The experiments show that performance improves when extending memory based VQ with a safety-net. It is found that, at the designated bit rates, a safety-net extended memory based VQ can gain about 1.5-2 bits in comparison with memoryless VQ. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Chalmers, Dept Signals & Syst, Informat Theory Grp, S-41296 Gothenburg, Sweden. RP Skoglund, J (reprint author), Chalmers, Dept Signals & Syst, Informat Theory Grp, S-41296 Gothenburg, Sweden. EM jans@it.chalmers.se CR ALKU P, 1990, P IEEE INT S CIRC 3, V3, P2149 ATAL BS, 1979, IEEE T ACOUST SPEECH, V27, P247, DOI 10.1109/TASSP.1979.1163237 Atal B.S., 1984, P INT C COMM AMST, P1610 BENVENUTO N, 1986, AT&T TECH J, V65, P12 Berger T., 1971, RATE DISTORTION THEO BERGSTROM A, 1989, P IEEE INT C AC SPEE, V1, P53 BEROUTI MG, 1977, P IEEE INT C AC SPEE, V1, P33 Cheng YM, 1993, IEEE T SPEECH AUDI P, V1, P207, DOI 10.1109/89.222879 CHILDERS DG, 1995, SPEECH COMMUN, V16, P127, DOI 10.1016/0167-6393(94)00050-K Cover T. M., 1991, ELEMENTS INFORMATION CUPERMAN V, 1982, P IEEE GLOB TEL C MI, V3, P1092 Das A, 1996, IEEE SIGNAL PROC LET, V3, P200, DOI 10.1109/97.508164 Dudley H, 1939, J FRANKL INST, V227, P0739, DOI 10.1016/S0016-0032(39)90816-1 DUDLEY H, 1950, J ACOUST SOC AM, V22, P151, DOI 10.1121/1.1906583 DUNHAM MO, 1985, IEEE T COMMUN, V33, P83, DOI 10.1109/TCOM.1985.1096198 ERIKSSON T, 1995, P INT C DIG SIGN PRO, V1, P96 ERIKSSON T, 1996, P IEEE INT C AC SPEE, V2, P765 Fant G., 1970, ACOUSTIC THEORY SPEE Fant Gunnar, 1985, STL QPSR, V4, P1 FISCHER TR, 1982, P IEEE C DEC CONTR O, V3, P1222 Flanagan J., 1972, SPEECH ANAL SYNTHESI FORGY EW, 1965, BIOMETRICS, V21, P728 FOSTER J, 1985, IEEE T INFORM THEORY, V31, P348, DOI 10.1109/TIT.1985.1057035 FUJISAKI H, 1986, IEEE T ACOUST SPEECH, V3, P1605 Gersho A., 1992, VECTOR QUANTIZATION GRASS J, 1991, P IEEE INT C AC SPEE, P657, DOI 10.1109/ICASSP.1991.150425 GRAY AH, 1974, IEEE T ACOUST SPEECH, VAS22, P207, DOI 10.1109/TASSP.1974.1162572 Gray R. M., 1990, SOURCE CODING THEORY GUPTA SK, 1991, P IEEE INT C AC SPEE, V1, P481 HAAGEN J, 1992, P IEEE INT C AC SPEE, V2, P145 HEDELIN P., 1984, P IEEE INT C AC SPEE, P161 Hedelin P., 1990, P IEEE INT C AC SPEE, P361 HEDELIN P, 1986, 1986 P INT C AC SPEE, V1, P465 HEDELIN P, 1990, SPEECH COMMUN, V9, P365, DOI 10.1016/0167-6393(90)90012-X Hussain Y, 1993, IEEE T SPEECH AUDI P, V1, P25, DOI 10.1109/89.221365 ISAKSSON A, 1989, SIGNAL PROCESS, V18, P435, DOI 10.1016/0165-1684(89)90085-6 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 JOHANSSON B, 1984, THESIS CHALMERS U TE Kleijn WB, 1993, IEEE T SPEECH AUDI P, V1, P386, DOI 10.1109/89.242484 KLEIJN WB, 1996, P INT C AC SPEECH SI, V1, P212 Kubin G., 1993, P IEEE WORKSH SPEECH, P35, DOI 10.1109/SCFT.1993.762326 Lim IT, 1996, IEEE T SPEECH AUDI P, V4, P81 LINDEN J, 1994, P NORD SIGN PROC S N, V1, P198 LINDEN J, 1995, P IEEE WORKSH SPEECH, V1, P105 MCAULAY RJ, 1995, SPEECH CODING SYNTHE MCCREE A, 1996, P IEEE INT C AC SPEE, V1, P200 MILENKOVIC P, 1986, IEEE T ACOUST SPEECH, V34, P28, DOI 10.1109/TASSP.1986.1164778 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 Rabiner L.R., 1978, DIGITAL PROCESSING S ROSENBER.AE, 1971, J ACOUST SOC AM, V49, P583, DOI 10.1121/1.1912389 Schroeder M., 1985, P IEEE INT C AC SPEE, P937 SHOHAM Y, 1993, P IEEE INT C AC SPEE, V2, P167 TEAGUE KA, 1994, P IEEE WICH C COMM N, P129 THYSSEN J, 1997, P IEEE INT C AC SPEE, V2, P1595 Tremain T.E., 1982, SPEECH TECHNOLOG APR, P40 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 YONG M, 1988, P IEEE INT C AC SPEE, V1, P402 NR 57 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1998 VL 24 IS 2 BP 133 EP 152 DI 10.1016/S0167-6393(98)00008-9 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX843 UT WOS:000074561000004 ER PT J AU Delogu, C Conte, S Sementina, C AF Delogu, C Conte, S Sementina, C TI Cognitive factors in the evaluation of synthetic speech SO SPEECH COMMUNICATION LA English DT Article DE cognitive factors; evaluation; perception; text-to-speech synthesis ID NATURAL SPEECH; INTELLIGIBILITY; SYSTEMS; PERCEPTION; PERFORMANCE; DEMANDS; MEMORY AB This paper illustrates the importance of various cognitive factors involved in perceiving and comprehending synthetic speech. It includes findings drawn from the relative psychological and psycholinguistic literature together with experimental results obtained at the Fondazione Ugo Bordoni laboratory. Overall, it is shown that listening to and comprehending synthetic voices is more difficult than with a natural voice. However, and more importantly, this difficulty can and does decrease with the subjects' exposure to said synthetic voices, Furthermore, greater workload demands are associated with synthetic speech and subjects listening to synthetic passages are required to pay more attention than those listening to natural passages. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Fdn Ugo Bordoni, Multimedia Commun Div, Voice Commun Grp, I-00142 Rome, Italy. Univ Palermo, Dipartimento Psicol, Palermo, Italy. Univ Rome, Dipartimento Psicol, Rome, Italy. RP Delogu, C (reprint author), Fdn Ugo Bordoni, Multimedia Commun Div, Voice Commun Grp, Via B Castiglione 59, I-00142 Rome, Italy. EM cristina@fub.it CR ABRAMS K, 1969, Q J EXP PSYCHOL, V21, P280, DOI 10.1080/14640746908400223 ALLEN J, 1992, ADV SPEECH SIGNAL PR, P741 BENDAT JS, 1972, ANAL MEASUREMENT PRO BENOIT C, 1990, SPEECH COMMUN, V9, P293, DOI 10.1016/0167-6393(90)90005-T BOOGAART T, 1992, P INT C SPEECH LANGU, V2, P1207 CARLSON R, 1992, P 6 SWED PHON C GOTH, P63 CARLSON R, 1989, P ESCA WORKSH NOORDW DAMOS DL, 1985, HUM FACTORS, V27, P409 DELOGU C, 1992, P ECCOS 92, P109 Delogu C., 1991, P EUROSPEECH 91 GENO, P353 DELOGU C, 1991, IEEE J SELECT AREAS, V9 DELOGU C, 1992, P ICSLP 92 INT C LAN, P1231 DELOGU C, 1993, P EUROSPEECH 93, V2, P1427 DELOGU C, 1995, ACTA ACUST, V3, P89 DOUST JWL, 1978, NEUROPSYCHOBIOLOGY, V4, P93 *ETSI, 1993, TECHN REP FOSS DJ, 1970, J VERB LEARN VERB BE, V9, P699, DOI 10.1016/S0022-5371(70)80035-4 GLEISS N, 1992, USABILITY CONCEPTS E GOLDSTEIN M, 1995, SPEECH COMMUN, V16, P225, DOI 10.1016/0167-6393(94)00047-E GRICE M, 1991, P EUROSPEECH 91, V2, P879 HAZAN V, 1992, QUANTIFICATION LISTE HOUSE AS, 1965, J ACOUST SOC AM, V37, P158, DOI 10.1121/1.1909295 HOWARDJONES P, 1992, SPECIFICATION LISTEN JEKOSCH U, 1993, P EUROSPEECH 93, V2, P1387 JEKOSCH U, 1992, P 2 INT C SPOK LANG, V1, P205 JEKOSCH U, 1994, J AM VOICE I O SOC, V15, P63 JERISON HJ, 1970, ATTENTION CONT THEOR, P127 LEVELT WJM, 1978, STUDIES PERCEPTION L LUCE PA, 1983, HUM FACTORS, V25, P17 MALSHEEN BJ, 1990, P ICSLP, P333 MARICS MA, 1988, HUM FACTORS, V30, P719 MARSLENWILSON W, 1980, COGNITION, V8, P1, DOI 10.1016/0010-0277(80)90015-3 MILLER GA, 1963, J VERB LEARN VERB BE, V2, P217, DOI 10.1016/S0022-5371(63)80087-0 MORAY N, 1967, ACTA PSYCHOL, V27, P84, DOI 10.1016/0001-6918(67)90048-0 Neisser U, 1967, COGNITIVE PSYCHOL NEOVIUS L, 1993, P UEROSPEECH 93, P1687 NOOTEBOOM SG, 1977, STUDIES PERCEPTION L, P75 NUSBAUM HC, 1990, P ICSLP90, P409 Nusbaum H. C., 1995, International Journal of Speech Technology, V1, DOI 10.1007/BF02277176 NYE PW, 1975, SR41 HASK LAB PARASURAMAN R, 1979, SCIENCE, V205, P924, DOI 10.1126/science.472714 PAVLOVIC CV, 1990, J ACOUST SOC AM, V87, P373, DOI 10.1121/1.399258 PISONI DB, 1981, COGNITION, V10, P249, DOI 10.1016/0010-0277(81)90054-8 POLS LCW, 1992, P 2 INT C SPOK LANG, V1, P181 POLS LCW, 1992, ADV SPEECH SIGNAL PR RALSTON JV, 1990, 16 IND U RALSTON JV, 1989, 15 IND U RALSTON JV, 1991, HUM FACTORS, V33, P471 ROSSON MB, 1985, HUMAN FACTORS COMPUT, P193 SALZA PL, 1993, DEV CONTEXT DEPENDEN Samuels S. J., 1987, COMPREHENDING ORAL W, P295 SCHWAB EC, 1985, HUM FACTORS, V27, P395 SILVERMAN K, 1990, P ICSLP 90, P981 SLOWIACZEK LM, 1985, HUM FACTORS, V27, P701 SPIEGEL MF, 1990, SPEECH COMMUN, V9, P279, DOI 10.1016/0167-6393(90)90004-S SYDESERFF HA, 1991, EUROSPEECH 91, V1, P277 TREISMAN AM, 1969, PSYCHOL REV, V76, P282, DOI 10.1037/h0027242 VANBEZOOIJEN R, 1990, SAM SEGMENTAL TEST E van Santen J. P. H., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1004 Voiers W. D., 1983, Speech Technology, V1 Warm J. S., 1984, SUSTAINED ATTENTION, P1 WATERWORTH JA, 1987, SPEECH LANGUAGE BASE WICKENS CD, 1981, HUM FACTORS, V23, P211 NR 63 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1998 VL 24 IS 2 BP 153 EP 168 DI 10.1016/S0167-6393(98)00009-0 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX843 UT WOS:000074561000005 ER PT J AU Evans, P AF Evans, P TI Publisher's Note SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 1 EP 1 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600001 ER PT J AU Bourlard, H Furui, S Morgan, N AF Bourlard, H Furui, S Morgan, N TI Untitled SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 3 EP 4 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600002 ER PT J AU Duchateau, J Demuynck, K Van Compernolle, D AF Duchateau, J Demuynck, K Van Compernolle, D TI Fast and accurate acoustic modelling with semi-continuous HMMs SO SPEECH COMMUNICATION LA English DT Article DE large vocabulary speech recognition; acoustic modelling; semi-continuous hidden Markov models; parameter tying; decision trees AB In this paper the design of accurate Semi-Continuous Density Hidden Markov Models (SC-HMMs) for acoustic modelling in large vocabulary continuous speech recognition is presented. Two methods are described to improve drastically the efficiency of the observation likelihood calculations for the SC-HMMs. First, reduced SC-HMMs are created, where each state does not share all the - gaussian - probability density functions ( pdfs) but only those which are important for it. It is shown how the average number of gaussians per state can be reduced to 70 for a total set of 10000 gaussians. Second, a novel scalar selection algorithm is presented reducing to 5% the number of gaussians which have to be calculated on the total set of 10000, without any degradation in recognition performance. Furthermore, the concept of tied state context-de pendent modelling with phonetic decision trees is adapted to SC-HMMs. In fact, a node splitting criterion appropriate for SC-HMMs is introduced: it is based on a distance measure between the mixtures of gaussian pdfs as involved in SC-HMM state modelling. This contrasts with other criteria from literature which are based on simplified pdfs to manage the algorithmic complexity. On the ARPA Resource Management task, a relative reduction in word error rate of 8% was achieved with the proposed criterion, comparing with two known criteria based on simplified pdfs. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Katholieke Univ Leuven, Dept Elect Engn, B-3001 Heverlee, Belgium. RP Duchateau, J (reprint author), ESAT, PSI, Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium. EM jacques.duchateau@esat.kuleuven.ac.be CR Bahl L, 1991, P INT C AC SPEECH SI, P185, DOI 10.1109/ICASSP.1991.150308 BAHL LR, 1994, P ICASSP 94, V1, P533 BELLEGARDA J, 1989, P INT C AC SPEECH SI, P13 BEYERLEIN P, 1994, P ICSLP, P271 Bocchieri E., 1993, P ICASSP, V2, P692 BOULIANNE G, 1996, P ICSLP, V1, P350, DOI 10.1109/ICSLP.1996.607126 DEMUYNCK K, 1996, P ICSLP, V4, P2289, DOI 10.1109/ICSLP.1996.607264 Digalakis VV, 1996, IEEE T SPEECH AUDI P, V4, P281, DOI 10.1109/89.506931 DUCHATEAU J, 1997, P EUROSPEECH, P1183 DUGAST C, 1995, P IEEE INT C AC SPEE, V1, P524 FRITSCH J, 1995, P EUROSPEECH, P1091 Huang X. D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90020-X HWANG MY, 1993, P IEEE ICASSP 93 MIN, V2, P311 KNILL KM, 1996, P ICSLP, V1, P470, DOI 10.1109/ICSLP.1996.607156 KUHN R, 1995, P INT C AC SPEECH SI, V1, P552 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N Odell J. J., 1995, THESIS U CAMBRIDGE C PAUL DB, 1991, P ICASSP 91, P329, DOI 10.1109/ICASSP.1991.150343 SIMONIN J, 1996, P ICSLP, V2, P1089, DOI 10.1109/ICSLP.1996.607795 WATANABE T, 1995, P INT C AC SPEECH SI, V1, P556 NR 20 TC 19 Z9 19 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 5 EP 17 DI 10.1016/S0167-6393(98)00002-8 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600003 ER PT J AU Martin, S Liermann, J Ney, H AF Martin, S Liermann, J Ney, H TI Algorithms for bigram and trigram word clustering SO SPEECH COMMUNICATION LA English DT Article DE stochastic language modeling; statistical clustering; word equivalence classes; Wall Street Journal corpus AB In this paper, we describe an efficient method for obtaining word classes for class language models. The method employs an exchange algorithm using the criterion of perplexity improvement. The novel contributions of this paper are the extension of the class bigram perplexity criterion to the class trigram perplexity criterion, the description of an efficient implementation for speeding up the clustering process, the detailed computational complexity analysis of the clustering algorithm, and, finally, experimental results on large text corpora of about 1, 4, 39 and 241 million words including examples of word classes, test corpus perplexities in comparison to word language models, and speech recognition results. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Technol, Lehrstuhl Informat 6, RWTH Aachen, D-52056 Aachen, Germany. RP Martin, S (reprint author), Univ Technol, Lehrstuhl Informat 6, RWTH Aachen, Ahornstr 55, D-52056 Aachen, Germany. EM martin@informatik.rwth.aachen.de; ney@informatik.rwth.aachen.de CR BAHL LR, 1983, IEEE T PATTERN ANAL, V5, P179 BELLEGARDA JR, 1996, P 1996 IEEE INT C AC, P172 Brill Eric, 1993, THESIS U PENNSYLVANI Brown P. F., 1992, Computational Linguistics, V18 Duda R. O., 1973, PATTERN CLASSIFICATI JARDINO M, 1996, P 1996 IEEE INT C AC, P161 JARDINO M, 1993, P 3 EUR C SPEECH COM, P1191 JARDINO M, 1994, LECT NOTES ARTIF INT, V862, P57 JELINEK F, 1991, READINGS SPEECH RECO, P450 Kneser R., 1993, P EUR C SPEECH COMM, P973 Lafferty J., 1993, Probabilistic Approaches to Natural Language. Papers from the 1992 AAAI Fall Symposium Ney H, 1997, TEXT SPEECH LANG TEC, V2, P174 NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 NEY H, 1995, IEEE T PATTERN ANAL, V17, P1202, DOI 10.1109/34.476512 Ortmanns S, 1997, COMPUT SPEECH LANG, V11, P43, DOI 10.1006/csla.1996.0022 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 Rosenfeld R., 1994, THESIS CARNEGIE MELL TAKACS L, 1984, HDB STAT, V4, P123, DOI 10.1016/S0169-7161(84)04009-8 Wessel F., 1997, P SQEL WORKSH MULT I, P55 NR 19 TC 36 Z9 36 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 19 EP 37 DI 10.1016/S0167-6393(97)00062-9 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600004 ER PT J AU Kim, DY Un, CK Kim, NS AF Kim, DY Un, CK Kim, NS TI Speech recognition in noisy environments using first-order vector Taylor series SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; noise-robust; Taylor series AB In this paper, we generalize relations between clean and noisy speech signal using vector Taylor series (VTS) expansion for noise-robust speech recognition. We use it for both the noisy data compensation and hidden Markov model (HMM) parameter adaptation, and apply it for the cepstral domain directly, while Moreno used it to estimate the log-spectral parameters. Also, we develop a detailed procedure to estimate environmental variables in the cepstral domain using the expectation and maximization (EM) algorithms based on the maximum likelihood (ML) sense. To evaluate the developed method, we conduct speaker-independent isolated word and continuous speech recognition experiments. White Gaussian and driving car noises added to clean speech at various SM are used as disturbing sources. Using only noise statistics obtained from three frames of silence and noisy speech to be recognized, we achieve significant performance improvement. Especially, HMM parameter adaptation with VTS is more effective than the parallel model combination (PMC) based on the log-normal assumption. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, Taejon 305701, South Korea. Samsung Adv Inst Technol, Human & Comp Interact Lab, Suwon 440600, South Korea. Seoul Natl Univ, Sch Elect Engn, Kwanak Gu, Seoul 151742, South Korea. RP Kim, DY (reprint author), Korea Adv Inst Sci & Technol, Dept Elect Engn, Yusong Gu, 373-1 Kusong Dong, Taejon 305701, South Korea. EM dykim@eekaist.kaist.ac.kr CR Acero A, 1993, ACOUSTICAL ENV ROBUS DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P68, DOI 10.1109/89.221385 GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 Junqua J.C., 1996, ROBUSTNESS AUTOMATIC Kim DY, 1996, ELECTRON LETT, V32, P1550, DOI 10.1049/el:19961081 KIM NS, 1997, P ESCA WORKSH ROB SP, P99 Kwon OW, 1996, SPEECH COMMUN, V19, P197, DOI 10.1016/0167-6393(96)00035-0 MORENO PJ, 1996, THESIS CARNEGIE MELO NEUMEYER L, 1995, P ICASSP, V1, P141 Rose RC, 1994, IEEE T SPEECH AUDI P, V2, P245, DOI 10.1109/89.279273 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 Siohan O, 1996, SPEECH COMMUN, V18, P335, DOI 10.1016/0167-6393(96)00015-5 ZAVALIAGKOS G, 1995, P 4 EUR C SPEECH COM, P1131 NR 14 TC 70 Z9 76 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 39 EP 49 DI 10.1016/S0167-6393(97)00061-7 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600005 ER PT J AU Verhasselt, J Illina, I Martens, JP Gong, Y Haton, JP AF Verhasselt, J Illina, I Martens, JP Gong, Y Haton, JP TI Assessing the importance of the segmentation probability in segment-based speech recognition SO SPEECH COMMUNICATION LA English DT Article DE segmentation probability; segment-based speech recognition; posterior segment modeling formalism ID MODELS AB The segment-based speech recognition algorithms that have been developed over the years can be divided into two broad classes. On the one hand those using the conditional segment modeling formalism(CSM), which requires the computation of the likelihood of the sequence of acoustic vectors, conditioned on the sub-word unit sequence and corresponding segmentation. On the other hand those using the posterior segment modeling formalism (PSM), which requires the computation of the joint posterior probability of the unit sequence and segmentation, conditioned on the sequence of acoustic vectors. The latter probability can be written as the product of a segmentation probability and a unit classification probability. In this paper, we focus on the role of the segmentation probability. After having shown that the segmentation probability is not required in the CSM formalism, we motivate its importance in the PSM formalism. Next, we describe its modeling and training. Experiments with two PSM-based recognizers on several speech recognition tasks demonstrate that the segmentation probability is essential in order to obtain a high recognition accuracy. Moreover, the importance of the segmentation probability is shown to be strongly correlated with the magnitudes of the unit probability estimates on segments that do not correspond with a unit. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Univ Ghent, Elect & Informat Syst Dept, B-9000 Ghent, Belgium. Inst Natl Rech Informat & Automat Lorraine, CNRS, Comp Sci Res Ctr, F-54506 Vandoeuvre Les Nancy, France. RP Verhasselt, J (reprint author), Univ Ghent, Elect & Informat Syst Dept, St Pietersnieuwstr 41, B-9000 Ghent, Belgium. CR AFIFY M, 1995, P EUROSPEECH, V1, P515 AUSTIN S, 1992, P INT C AC SPEECH SI, V1, P625 Bourland H., 1994, CONNECTIONIST SPEECH CERISARA C, ACT 21 JOURN ET PAR, P31796 CHANG J, 1997, P EUROSPEECH, V3, P1199 FLAMMIA G, 1992, P INT C SPOK LANG PR, P983 GLASS J, 1996, P ICSLP, V4, P2277, DOI 10.1109/ICSLP.1996.607261 GONG Y, 1993, P EUR C SPEECH COMM, V3, P1759 GONG Y, 1996, P ICSLP, V1, P334 GONG Y, 1994, P IEEE INT C AC SPEE, V1, P57 Gosselin B, 1996, NEURAL PROCESS LETT, V3, P3, DOI 10.1007/BF00417783 Hashem S, 1997, NEURAL NETWORKS, V10, P599, DOI 10.1016/S0893-6080(96)00098-6 ILLINA I, 1996, P ICSLP, V1, P342, DOI 10.1109/ICSLP.1996.607124 KATKE W, 1985, AI MAG, P64 KIMBALL O, 1995, THESIS BOSTON U LAMEL I, 1993, P EUROSPEECH, V1, P121 LAMEL L, 1988, THESIS MIT LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LEUNG HC, 1991, P EUR C SPEECH COMM, V2, P931 LEUNG HC, 1989, THESIS MIT LEUNG HC, 1990, P ICSLP, V2, P1061 LEUNG HC, 1992, P INT C AC SPEECH SI, V1, P613 MARI JF, 1996, P IEEE INT C AC SPEE, V1, P435 MARTENS JP, 1991, SPEECH COMMUN, V10, P81, DOI 10.1016/0167-6393(91)90029-S MARTENS JP, 1990, P INT C AC SPEECH SI, P401 MARTENS JP, 1994, P FORWISS CRIM ESPRI, P26 Ostendorf M, 1996, IEEE T SPEECH AUDI P, V4, P360, DOI 10.1109/89.536930 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 RABINER LR, 1986, AT&T TECH J, V65, P21 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 Rumelhart D. E., 1986, PARALLEL DISTRIBUTED, V1 SIOHAN O, 1996, P INT C AC SPEECH SI, V1, P471 VANIMMERSEEL LM, 1992, J ACOUST SOC AM, V91, P3511, DOI 10.1121/1.402840 VEREECKEN H, 1996, P ICSLP, V1, P566, DOI 10.1109/ICSLP.1996.607180 VERHASSELT J, 1996, P IEEE PRORISC, P367 VERHASSELT J, 1997, P INT C AC SPEECH SI, V2, P1407 Werbos P., 1974, THESIS HARVARD U CAM XU L, 1992, IEEE T SYST MAN CYB, V22, P418, DOI 10.1109/21.155943 ZUE V, 1989, P INT C AC SPEECH SI, V1, P389 NR 40 TC 4 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 51 EP 72 DI 10.1016/S0167-6393(97)00064-2 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600006 ER PT J AU Samudravijaya, K Singh, SK Rao, PVS AF Samudravijaya, K Singh, SK Rao, PVS TI Pre-recognition measures of speaking rate SO SPEECH COMMUNICATION LA English DT Article DE speaking rate; fast speech; speech recognition AB The accuracy of speech recognition systems is known to be affected by fast speech. If fast speech can be detected by means of a measure of speaking rate, the acoustic as well as language models of a speech recognition system can be adapted to compensate for fast speech effects. We have studied several measures of speaking rate which have the advantage that they can be computed prior to speech recognition. The proposed measures have been compared with conventional measures, viz., word and phone rate on the TIMIT database. Some of the proposed measures have significant correlations with phone rate and vowel duration. We have shown that the mismatch between actual and expected durations of test vowels reduces if the vowel duration models are adapted to speaking rate, as estimated by the proposed measures. These measures can be computed from features commonly employed in speech recognition, do not entail significant additional computational load and do not need labeling or segmentation of unknown utterance in terms of linguistic units. (C) 1998 Elsevier Science B.V. All rights reserved. C1 Tata Inst Fundamental Res, Comp Syst & Commun Grp, Bombay 400005, Maharashtra, India. RP Samudravijaya, K (reprint author), Tata Inst Fundamental Res, Comp Syst & Commun Grp, Homi Bhabha Rd, Bombay 400005, Maharashtra, India. EM chief@tifrvax.tifr.res.in; sks@tifrvax.tifr.res.in; rao@tifrvax.tifr.res.in CR ANASTASAKOS A, 1995, P IEEE INT C AC SPEE, V1, P628 HANSEN JHL, 1995, SPEECH COMMUN, V16, P391, DOI 10.1016/0167-6393(95)00007-B JONES M, 1993, P EUR 93 BERL, V1, P311 Lamel L. F., 1986, P DARPA SPEECH REC W, P100 LINDBLOM B, 1963, J ACOUST SOC AM, V35, P1773, DOI 10.1121/1.1918816 MIRGHAFORI N, 1996, P IEEE INT C AC SPEE, P335 Mirghafori N., 1995, P EUROSPEECH 95, P491 PALLET DS, 1994, P ARPA SOKEN LANGUAG PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 SIEGLER MA, 1995, P IEEE INT C AC SPEE SUAUDEAU N, 1994, P IEEE INT C AC SPEE, V1, P65 Verhasselt J., 1996, P INT C SPOK LANG PR, V4, P2258, DOI 10.1109/ICSLP.1996.607256 NR 12 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1998 VL 24 IS 1 BP 73 EP 84 DI 10.1016/S0167-6393(97)00063-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZX674 UT WOS:000074542600007 ER PT J AU Kamm, C Walker, M Rabiner, L AF Kamm, C Walker, M Rabiner, L TI The role of speech processing in human-computer intelligent communication SO SPEECH COMMUNICATION LA English DT Article AB We are currently in the midst of a revolution in communications that promises to provide ubiquitous access to multimedia communication services. In order to succeed, this revolution demands seamless, easy-to-use, high quality interfaces to support broadband communication between people and machines. In this paper we argue that spoken language interfaces (SLIs) are essential to making this vision a reality. We discuss potential applications of SLIs, the technologies underlying them, the principles we have developed for designing them, and key areas for future research in both spoken language processing and human-computer interfaces. (C) 1997 Elsevier Science B.V. C1 AT&T Labs Res, Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA. RP Kamm, C (reprint author), AT&T Labs Res, Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA. CR ABELLA A, 1996, ECAI 96 SPOK DIAL PR ALSHAWI H, 1996, P 34 ANN M ASS COMP, P167, DOI 10.3115/981863.981886 Baddeley A. D., 1986, WORKING MEMORY BENNACEF S, 1996, P ISSD 96, P173 BIGORGNE D, 1993, P ICASSP, P187 Billi R., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552792 BOYCE S, 1996, P INT S SPOK DIAL IS, P65 Bub T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607285 Campbell N., 1997, PROGR SPEECH SYNTHES, P279 CARNE EB, 1995, TELECOMMUNICATIONS P COHEN PR, 1994, VOICE COMMUNICATION, P324 COLE Ronald A., 1996, SURVEY STATE ART HUM Danieli M., 1995, P 1995 AAAI SPRING S, P34 DANIELI M, 1992, WP6000D3 Gorin AL, 1997, SPEECH COMMUN, V23, P113, DOI 10.1016/S0167-6393(97)00040-X HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C HIRSCHBERG J, 1996, 34 ANN M ASS COMP LI, P286 HIRSCHMAN L, 1993, P 3 EUR C SPEECH COM, P1419 Hirschman L., 1993, P HUM LANG TECHN WOR, P19, DOI 10.3115/1075671.1075676 JOHNSTON JD, 1991, ADV SPEECH SIGNAL PR, P109 KAMM C, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P422 KAMM CA, 1997, P EUROSPEECH 1997, P2203 Kleijn W. B., 1995, SPEECH CODING SYNTHE, P1 KRAUSS RM, 1967, J ACOUST SOC AM LEVIN E, 1995, P 1995 ARPA SPOK LAN MARX M, 1996, P C HUM COMP INT CHI MENG H, 1996, P 1996 INT S SPOK DI, P165 Miller B. W., 1996, P 34 ANN M ASS COMP, P62, DOI 10.3115/981863.981872 MILLER GA, 1956, PSYCHOL REV, V63, P81, DOI 10.1037//0033-295X.101.2.343 Perdue R. J., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552708 Rabiner L, 1993, FUNDAMENTALS SPEECH Rabiner L.R., 1996, AUTOMATIC SPEECH SPE, P1 SADEK MD, 1996, P 1996 INT S SPOK DI, P169 Sagisaka Y, 1992, P ICSLP 92, V1, P483 SAPRCKJONES K, 1996, EVALUATING NATURAL L Shneiderman B., 1986, DESIGNING USER INTER Simpson A., 1993, P 3 EUR C SPEECH COM, P1423 SMITH R, 1992, P 3 C APPL NAT LANG SORIN C, 1994, PROGR PROSPECTS SPEE, P53 SPROAT R, 1995, SPEECH CODING SYNTHE, P611 SUMITA E, 1995, P TMI95 INT C THEOR, P273 VANSANTEN JPH, 1996, PROGR SPEECH SYNTHES Walker M., 1997, P EUR C SPEECH COMM, P2219 WALKER MA, 1989, HPLBRCTR89020 HEWL P WALKER MA, 1989, P HCI INT, P502 Walker Marilyn A, 1997, P 35 ANN M ASS COMP, P271 WHITTAKER S, 1989, P 4 C EUR CHAPT ASS, P116, DOI 10.3115/976815.976831 WHITTAKER S, 1991, P AAAI WORKSH MULT I WISOWATY J, 1995, P HUM FACT ERG SOC YANKELOVICH N, 1995, C HUM FACT COMP SYST NR 50 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 263 EP 278 DI 10.1016/S0167-6393(97)00059-9 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400001 ER PT J AU Hunt, MJ Neel, FD AF Hunt, MJ Neel, FD TI Untitled SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Dragon Syst UK, Cheltenham GL52 4RW, Glos, England. CNRS, LIMSI, F-91403 Orsay, France. RP Hunt, MJ (reprint author), Dragon Syst UK, Stoke Rd, Cheltenham GL52 4RW, Glos, England. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 281 EP 283 DI 10.1016/S0167-6393(97)00055-1 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400002 ER PT J AU Sharman, RA AF Sharman, RA TI Using a speaker-adaptable, task-customisable, dictation system for high accuracy text input: A case study SO SPEECH COMMUNICATION LA English DT Article AB This paper describes the design principles of a large vocabulary, free text, dictation system, employing advanced speech recognition technology. It can be used for a wide variety of tasks, and is rapidly customisable to new domains. A case study of the application of the technology in the creation of a Pathology reporting workstation is described. The objectives of using large vocabulary speech recognition technology to implement a system for the dictation of free text are described. The process of creating text, from draft to final version is discussed in terms of the pre-requisites on both the technology and its implementation. The constraints on the performance and user acceptability of a system using this technology are noted. The design of a flexible system component supporting user adaptation is shown. The use of a general purpose, standalone dictation system for a wide variety of potential applications is noted. The customisation of the system to a new task by the rapid rebuilding of its parameter tables is described. The application of a customised version of the system to an actual end user environment is exemplified by a case study of a Pathology reporting workstation. (C) 1997 Elsevier Science B.V. C1 IBM Corp, Winchester SO23 9DR, Hants, England. RP Sharman, RA (reprint author), IBM Corp, Hurlsey Pk, Winchester SO23 9DR, Hants, England. CR AVERBUCH A, 1986, P IEEE INT C AC SPEE, V1 BAHL L, 1994, P 1994 IEEE INT C AC Bahl L., 1986, P INT C AC SPEECH SI, V11, P49, DOI DOI 10.1109/ICASSP.1986.1169179> Bahl L. R., 1988, P ICASSP 88 NEW YORK, pA93 BAHL LR, 1988, P ICASSP 88 NEW YORK, P40 Brown P., 1988, P 12 INT C COMP LING, P71 BROWN PF, 1990, P IBM NAT LANG ITL P CERFDANON H, 1991, EUROSPEECH 91 GENOA DANIS C, 1989, P HUM FACT SOC AM 33 DANIS C, 1991, P HUM FACT SOC AM 35 DAS S, 1994, P INT C AC SPEECH SI, V1, P121 JARMULOVICZ M, 1996, ACP NEWS WINTER 1995, P20 JELINEK F, 1994, P HUM LANG TECHN WOR, P272, DOI 10.3115/1075812.1075873 Jelinek F., 1974, IEEE Symposium on Speech Recognition Contributed Papers JELINEK F, 1985, P IEEE, V73, P1616, DOI 10.1109/PROC.1985.13343 LAU J, 1993, P INT C AC SPEECH SI, V11, P45 LUCASSEN J, 1993, P INT C AC SPEECH SI NADAS A, 1988, P IEEE INT C AC SPEE, V1, P521 NADAS A, 1989, IEEE T ACOUST SPEECH, V37, P1495, DOI 10.1109/29.35387 RAO P, 1995, P 1995 INT C AC SPEE Ratnaparkhi A., 1994, P INT C SPOK LANG PR, P803 SHARMAN R, 1993, DISPLAYS, V14 SHARMAN R, 1994, COMPUTER J, V37 SHARMAN R, 1994, ESCA NATO C SPEECH I NR 24 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 285 EP 295 DI 10.1016/S0167-6393(97)00058-7 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400003 ER PT J AU Hunt, MJ AF Hunt, MJ TI Practical large-vocabulary speech recognition in a multilingual environment SO SPEECH COMMUNICATION LA English DT Article AB This paper is concerned with the use of a commercial large-vocabulary speech recognition system by a team of mainstream users in the course of their everyday work. In particular, it describes use by translators working in four languages in a multilingual environment at the European Commission. The paper begins by describing some of the differences between the point-of-view of a typical researcher in speech recognition and that of a typical mainstream user. It points out some of the psychological barriers that must be overcome if speech recognition is to gain really widespread acceptance, and it concludes that such acceptance will depend at least as much on the sharing of experience between users as on technical advances. The overall results of the trials at the European Commission were encouragingly positive, but several unexpected problems were encountered, many of them related to the multilingual environment. The paper describes how most of these problems are being addressed. (C) 1997 Elsevier Science B.V. C1 Dragon Syst UK, Cheltenham GL52 4RW, Glos, England. RP Hunt, MJ (reprint author), Dragon Syst UK, Stoke Rd, Cheltenham GL52 4RW, Glos, England. CR BAKER JM, 1989, P ESCA EUR 89, V2 BARNETT J, 1995, P ESCA EUR 95 MADR 1, P189 Hunt M. J., 1994, Proceedings of Language Engineering Convention LAMEL L, 1995, P IEEE AUT SPEECH RE, P51 NR 4 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 297 EP 305 DI 10.1016/S0167-6393(97)00056-3 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400004 ER PT J AU Van Coile, B Ruhl, HW Vogten, L Thoone, M Goss, S Delaey, D Moons, E Terken, JMB de Pijper, JR Kugler, M Kaufholz, P Kruger, R Leys, S Willems, S AF Van Coile, B Ruhl, HW Vogten, L Thoone, M Goss, S Delaey, D Moons, E Terken, JMB de Pijper, JR Kugler, M Kaufholz, P Kruger, R Leys, S Willems, S TI Speech synthesis for the new Pan-European traffic message control system RDS-TMC SO SPEECH COMMUNICATION LA English DT Article AB This paper reports on the speech synthesis module used to present spoken traffic messages through the car radio, as part of the "Traffic Message Control" system RDS-TMC. One of the basic ideas of this intended Pan-European service is its ability to provide traffic information in the driver's native language, independent of the language used in the geographical area or used for broadcasting. To accomplish this for an unlimited set of location names, speech synthesis is a must. To achieve best speech quality, text-to-speech techniques are combined with phonetic input that has been optimised off-line, resulting in a flexible system with almost natural sounding speech output. A prototype has been developed for German. First TMC radios for German and Germany will be presented at Internationale Funkausstellung by end of August 1997 in Berlin, which also is the start date for a regular TMC broadcasting service. Other countries and languages will follow in 1998. (C) 1997 Elsevier Science B.V. C1 Philips Car Syst Int, D-35576 Wetzlar, Germany. Lernout & Hauspie Speech Prod, B-8900 Ieper, Belgium. Inst Perceptie Onderzoek, NL-5600 MB Eindhoven, Netherlands. Philips Kommun Ind AG, D-90327 Nurnberg, Germany. Robert Bosch GmbH, D-31139 Hildesheim, Germany. RP Ruhl, HW (reprint author), Philips Car Syst Int, Philips Str 1,POB 1440, D-35576 Wetzlar, Germany. CR ADRIAENS LMH, 1991, THESIS EINDHOVEN Allen J., 1987, TEXT SPEECH MITALK S *CENELEC EN, 50067 CENELEC EN HIRSH IJ, 1954, J ACOUST SOC AM, V26, P530, DOI 10.1121/1.1907370 *MICH, 1995, MICH DEUTSCHL 1994 OWENS E, 1961, J SPEECH HEAR RES, V4, P113 PISONI DB, 1985, P IEEE, V73, P1665, DOI 10.1109/PROC.1985.13346 PISONI DB, 1987, TEXT SPEECH MITALK S *RDS TMC, IN PRESS RDS TMC CON VANCOILE B, 1994, P 1994 INT C SPOK LA, P423 WELLS J, 1992, 3 SAM STAG REP SEN NR 11 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 307 EP 317 DI 10.1016/S0167-6393(97)00057-5 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400005 ER PT J AU Bellik, Y AF Bellik, Y TI Multimodal text editor interface including speech for the blind SO SPEECH COMMUNICATION LA English DT Article DE multimodal interfaces; speech recognition; speech synthesis; non visual interfaces; blind users; Rehabilitation engineering AB This paper describes the results of a joint project funded by two French research laboratories, LIMSI-CNRS and INSERM-CREARE, and one end-user organisation, INJA (National Institute for Young Blind People). This project applies multimodal interfaces including speech recognition and synthesis to provide improved computer access for the blind. A multimodal text editor designed to provide enriched texts, direct manipulation and immediate feedback for text editing tasks is described. Using speech recognition and synthesis in a combined way with pointing gestures and touch helped to resolve many problems that blind people encounter with traditional access methods. Promising results are presented, but combining speech with other modalities in the same interface also reveals some new technological problems that are hidden when speech is used in an isolated way. These problems are discussed, and user needs and expectations are presented. (C) 1997 Elsevier Science B.V. C1 CNRS, LIMSI, F-91403 Orsay, France. RP Bellik, Y (reprint author), CNRS, LIMSI, BP 133, F-91403 Orsay, France. EM bellik@limsi.fr CR BELLIK Y, 1995, THESIS PARIS 11 U OR BERLISS J, 1993, NONVISUAL HUMAN COMP, P131 BERNSEN NO, 1994, AAAI SPRING S SER S BLATTNERMM, 1990, SIGCHI B, V22, P54 BURGER D, 1994, IEEE T REHABILIT JUN EMERSON M, 1991, P 1 WORLD C TECHN WA, V3, P65 FELLBAUM K, 1994, P INT MULT HAND VIS Foley J., 1990, COMPUTER GRAPHICS PR, V2nd FROHLICH DM, 1991, P 1 EUR WORKSH STOCK HUNT MJ, 1996, P ENTR TECHN PAR MAR KAWAI S, 1996, P ASSETS 96 2 ANN AC KOCHANEK, 1994, P ICCHP 94 4 INT C C, P89 RAMSTEIN C, 1996, P ASSETS96 2 ANN ACM RUBIN A, 1992, MACINTOSCH ACCESS PE SMITH A, 1996, P ASSETS 96 2 ANN AC SORIN C, 1995, SPEECH COMMUN, V17, P273, DOI 10.1016/0167-6393(95)00035-M STEPHANIDIS C, 1991, ACCESS GRAPHICAL USE THATCHER J, 1994, P 4 INT C COMP HAND, P76 THIMBLEBY A, 1990, USER INTERFACE DESIG NR 19 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1997 VL 23 IS 4 BP 319 EP 332 DI 10.1016/S0167-6393(97)00052-6 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ894 UT WOS:000073264400006 ER PT J AU Sorin, C Evans, P AF Sorin, C Evans, P TI Special issue on Free Speech Journal - Foreword SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1997 VL 23 IS 3 BP 179 EP 179 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ332 UT WOS:000073203400001 ER PT J AU Bernsen, NO AF Bernsen, NO TI Towards a tool for predicting speech functionality SO SPEECH COMMUNICATION LA English DT Article ID INPUT AB In these days of multimodal systems and interfaces, many research teams are investigating the purposes for which novel combinations of modalities can be used. It is easy to forget that we still lack solid foundations for evaluating the functionality of individual families of input-output modalities, such as the speech modalities. The reason why these foundations are missing is the complexity of the problem. Based on the study of particular applications, empirical investigations of speech functionality address points in a vast multi-dimensional design space. At best, solid findings yield low-level generalisations which can be used by designers developing almost identical applications. Furthermore, the conceptual and theoretical apparatus needed to describe these findings in a principled way is largely missing. This paper argues that a shift in perspective can help address issues of modality choice both scientifically and in design practice. Instead of empirically focusing on fragments of the virtually infinite combinatorics of tasks, environments, performance parameters, user groups, cognitive properties, etc., the problem of modality functionality is addressed as a problem of choosing between modalities which have very different properties with respect to the representation and exchange of information between user and system. Based on a study of 120 claims on speech functionality from the literature, it is shown that a small set of modality properties are surprisingly powerful in justifying, supporting and correcting the claims set. The paper analyses why modality properties can be used for these purposes and argues that their power could be made available to systems and interface designers who have to make modality choices during early design of speech-related systems and interfaces. Using hypertext, it is illustrated how this power may be harnessed for the purpose of predictively supporting speech modality choice during early systems and interface design. (C) 1997 Published by Elsevier Science B.V. C1 Odense Univ, Maersk Inst Prod Technol, Odense, Denmark. RP Bernsen, NO (reprint author), Odense Univ, Maersk Inst Prod Technol, Odense, Denmark. CR BABER C, 1992, BEHAV INFORM TECHNOL, V11, P216 BABER C, 1993, INTERACTIVE SPEECH T, P21 BERNSEN NO, 1997, IN PRESS COMPUTER ST BERNSEN NO, 1994, RP5TMWP11 ESPR BAS R Bernsen N. O., 1995, Design, Specification and Verification of Interactive Systems '95. Proceedings of the Eurographics Workshop BERNSEN NO, 1994, INTERACT COMPUT, V6, P347, DOI 10.1016/0953-5438(94)90008-6 BERNSEN NO, 1995, INTERACTIVE SYSTEMS, P235 BLESSER T, 1990, P ACM US INT SOFTW T, P135, DOI 10.1145/97924.97940 BRANDETTI M, 1988, P SPEECH 88 7 FASE S, P1305 CARD SK, 1991, ACM T INFORM SYST, V9, P99, DOI 10.1145/123078.128726 Carletta Jean C., 1995, P TWENT WORKSH LANG, P25 COLER CR, 1984, SPEECH TECH 84, V1, P95 COWLEY CK, 1990, CONT ERGONOMICS DAMPER RI, 1993, INTERACTIVE SPEECH T, P59 GORDEN DF, 1989, VOICE RECOGNITION SY GOULD JD, 1983, COMMUN ACM, V26, P295, DOI 10.1145/2163.358100 HAPESHI K, 1993, INTERACTIVE SPEECH T, P177 HELANDER M, 1988, HDB HUMAN COMPUTER I HELANDER MG, 1993, INTERACTIVE SPEECH T, pR9 HOVY E, 1990, AAAI S HUM COMP INT LAYCOCK J, 1980, 80019 RAE Lee K.-F., 1989, AUTOMATIC SPEECH REC LEWIS E, 1993, INTERACTIVE SPEECH T, P37 Mackinlay J., 1990, Human-Computer Interaction, V5, DOI 10.1207/s15327051hci0502&3_2 MARTIN GL, 1989, INT J MAN MACH STUD, V30, P355, DOI 10.1016/S0020-7373(89)80023-9 MURRAY IR, 1993, INTERACTIVE SPEECH T, P99 NICHOLSON RT, 1985, ACM T OFF INF SYST, V3, P307, DOI 10.1145/4229.4231 Noyes J., 1993, INTERACTIVE SPEECH T NOYES J, 1993, INTERACTIVE SPEECH T, P189 NOYES JM, 1989, BEHAV INFORM TECHNOL, V8, P475 Searle John R., 1979, EXPRESSION MEANING S, P1, DOI 10.1017/CBO9780511609213.003 STANTON N, 1993, INTERACTIVE SPEECH T, P45 STARR AFC, 1993, INTERACTIVE SPEECH T, P85 TUCKER P, 1993, INTERACTIVE SPEECH T, P109 USHER DM, 1993, INTERACTIVE SPEECH T, P73 VANNES FL, 1988, 23 IPO WARNER N, 1984, SPEECH TECH 84, V1, P110 WHITE RG, 1983, FSF515 RAE WHITE RW, 1984, 6 AIAA IEEE DIG AV S NR 39 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1997 VL 23 IS 3 BP 181 EP 210 DI 10.1016/S0167-6393(97)00046-0 PG 30 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ332 UT WOS:000073203400002 ER PT J AU Deng, L AF Deng, L TI Speech recognition using autosegmental representation of phonological units with interface to the trended HMM SO SPEECH COMMUNICATION LA English DT Article ID HIDDEN MARKOV-MODELS AB A novel speech recognizer is described which capitalizes on multi-dimensional articulatory structures and incorporates key ideas from autosegmental phonology and articulatory phonology. The novelty has been in the design of the atomic units of speech so as to arrive at a unified and parsimonious way to account for the context-dependent behavior of speech acoustics. At the heart of the recognizer is a procedure developed to automatically convert a probabilistic overlap pattern over five articulatory feature dimensions into a finite-state automaton which serves as the phonological construct of the recognizer. The phonetic-interface component of the recognizer, based on the nonstationary-state hidden Markov model or the trended HMM, is also described. Some phonetic recognition results using the TIMIT database are reported. C1 Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. RP Deng, L (reprint author), Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. EM deng@crg5.uwaterloo.ca CR BAKIS R, 1993, ARTICULATORY LIKE SP BAKIS R, 1991, P IEEE WORKSH AUT SP, P20 BITAR N, 1995, P EUR, V2, P1411 BLACKBURN C, 1995, P EUR, V2, P1623 BROWMAN CP, 1992, PHONETICA, V49, P155 DENG L, 1992, J ACOUST SOC AM, V92, P3058, DOI 10.1121/1.404202 DENG L, 1995, P INT C AC SPEECH SI, P385 DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A DENG L, 1994, J ACOUST SOC AM, V96, P2008, DOI 10.1121/1.410144 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 Deng L, 1996, IEEE T SPEECH AUDI P, V4, P301, DOI 10.1109/89.506934 DENG L, 1990, J ACOUST SOC AM, V87, P2738, DOI 10.1121/1.399064 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 EIDE E, 1993, P ICASSP, V2, P483 Goldsmith J., 1990, AUTOSEGMENTAL METRIC HUCKVALE M, 1994, SPEECH HEARING LANGU, V7, P133 HWANG M, 1993, P ICASSP, V1, P311 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 KENNY P, 1991, P IEEE WORKSH AUT SP, P22 KEYSER J, 1994, PHONOLOGY, V11, P207 LASS N, 1995, PRINCIPLES EXPT PHON LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Liu S. A., 1995, THESIS MIT MAHKOUL J, 1995, P NATL ACAD SCI USA, V92, P9956 MENG H, 1991, P ICASSP, V1, P285 PEREIRA C, 1996, 9603001 CMPLG RANDOLPH M, 1994, J ACOUST SOC AM, V95 Sagey Elizabeth, 1986, THESIS MIT CAMBRIDGE Sheikhzadeh H, 1994, IEEE T SPEECH AUDI P, V2, P80, DOI 10.1109/89.260337 SIMONS AJH, 1988, P 8 EUR C ART INT, P464 Stevens K. N., 1992, P ICSLP, V1, P499 Young S., 1995, P IEEE WORKSH AUT SP, P3 NR 32 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1997 VL 23 IS 3 BP 211 EP 222 DI 10.1016/S0167-6393(97)00047-2 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ332 UT WOS:000073203400003 ER PT J AU Deligne, S Bimbot, F AF Deligne, S Bimbot, F TI Inference of variable-length linguistic and acoustic units by multigrams SO SPEECH COMMUNICATION LA English DT Article AB The efficiency of pattern recognition algorithms is highly conditioned to a proper definition of the patterns assumed to structure the data. The multigram model provides a statistical tool to retrieve sequential variable-length regularities within streams of data. In this paper, we present a general formulation of the model, applicable to single or multiple parallel strings of data having either discrete or continuous values. The model is first assessed to derive an inventory of variable-length sequences of letters from text data, where all spaces between the words have been removed. It turns out that the sequences of letters inferred during this fully unsupervised procedure clearly relate to the morphological structure of the text. The model is then used to infer a set of variable-length acoustic units, directly from speech data. Speech files containing examples of acoustic units are provided along with this paper in order to illustrate their consistency from an auditory point of view. We also report experiments using these acoustically defined units for continuous speech recognition. C1 Ecole Natl Super Telecommun Bretagne, Dept Signal, CNRS, URA 820, F-75634 Paris 13, France. RP Deligne, S (reprint author), ATR, Interpreting Telecommun Res Labs, Dep 1, 2-2 Hikaridai, Seika, Kyoto 61902, Japan. EM sdeligne@itl.atr.co.jp; bimbot@sig.enst.fr CR ATAL BS, 1983, P INT C AC SPEECH SI BACCHIANI M, 1996, P INT C AC SPEECH SI BIMBOT F, 1995, P ICPHS 95 AUG BIMBOT F, 1988, THESIS BIMBOT F, 1995, IEEE SIGNAL PROC LET, V2 BIMBOT F, 1994, P 20 JEP TREG FRANC BRENT M, 1996, COGNITION, V61 CARTWRIGHT TA, 1994, P 1 M ASS COMP LING CHOU PA, 1994, P INT C AC SPEECH SI, V1, P505 de Marcken C., 1995, UNSUPERVISED ACQUISI DELIGNE S, 1995, P INT C AC SPEECH SI DELIGNE S, 1996, LECT NOTES ARTIF INT, V1147, P156 DELIGNE S, 1995, P EUROSPEECH 95, V3, P169 DELIGNE S, 1996, THESIS DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Lee C.-H., 1988, P ICASSP, P501 Lee C.-H., 1989, P INT C AC SPEECH SI, P683 MCCANDLESS MK, 1993, P EUROSPEECH Moore R. K., 1994, PROGR PROSPECTS SPEE RIES K, 1995, P INT C AC SPEECH SI RISSANEN J, 1989, WORLD SCI SERIES COM, V15, P79 RUANAIDH JO, 1996, NUMERICAL BAYESIAN M, P8 SAGISAKA Y, 1995, SPEECH CODING SYNTHE, P685 SUHM B, 1994, P ICSLP 94 NR 24 TC 28 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1997 VL 23 IS 3 BP 223 EP 241 DI 10.1016/S0167-6393(97)00048-4 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ332 UT WOS:000073203400004 ER PT J AU Cole, RA Novick, DG Vermeulen, PJE Sutton, S Fanty, M Wessels, LFA de Villiers, JH Schalkwyk, J Hansen, B Burnett, D AF Cole, RA Novick, DG Vermeulen, PJE Sutton, S Fanty, M Wessels, LFA de Villiers, JH Schalkwyk, J Hansen, B Burnett, D TI Experiments with a spoken dialogue system for taking the US census SO SPEECH COMMUNICATION LA English DT Article AB This paper reports the results of the development, deployment and testing of a large spoken-language dialogue application for use by the general public. We built an automated spoken questionnaire for the US Bureau of the Census. In the project's first phase, the basic recognizers and dialogue system were developed using 4000 calls. In the second phase, the system was adapted to meet Census Bureau requirements and deployed in the Bureau's 1995 national test of new technologies. In the third phase, we refined the system and showed empirically that an automated spoken questionnaire could successfully collect and recognize census data, and that subjects preferred the spoken system to written questionnaires. Our large data collection effort and two subsequent field tests showed that, when questions are asked correctly, the answers contain information within the desired response categories about 99% of the time. C1 Oregon Grad Inst Sci & Technol, Dept Comp Sci & Engn, Ctr Spoken Language Understanding, Portland, OR 97291 USA. RP Cole, RA (reprint author), Oregon Grad Inst Sci & Technol, Dept Comp Sci & Engn, Ctr Spoken Language Understanding, POB 91000, Portland, OR 97291 USA. CR BARNARD E, 1995, P INT WORKSH APPL NE, V2, P186 BARNARD E, 1989, 89014 CSE OR GRAD I Bishop C. M., 1995, NEURAL NETWORKS PATT Boite J., 1993, P 3 EUR C SPEECH COM, P1273 COLE R, 1995, IEEE T SPEECH AUDI P, V3, P1, DOI 10.1109/89.365385 COLE R, 1994, P ICSLP 94, P683 COLE R, 1994, P INT C AC SPEECH SI, V1, P93 COLE R, 1993, P C SPOK LANG SYST T, P19 COLE RA, 1992, ADV NEURAL INFORMATI, V4 COLE RA, 1994, P ARPA WORKSH HUM LA COLE RA, 1995, P EUR MADR SPAIN FANTY M, 1992, ADV NEURAL INFORMATI, V4 FONTAINE V, 1996, P 1996 IEEE INT C AC HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HUTTER H, 1995, P 1995 IEEE INT C AC, P3311 JENKINS C, 1994, REPORT RESPONDENTS A KONIG Y, 1996, P 1996 IEEE INT C AC, P3350 *NAT CTR ED STAT, 1993, 065000005883 GPO US Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 SUTTON S, 1995, CSE9512 OR GRAD I DE NR 20 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1997 VL 23 IS 3 BP 243 EP 260 DI 10.1016/S0167-6393(97)00049-6 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZJ332 UT WOS:000073203400005 ER PT J AU Spiegel, M Kamm, C AF Spiegel, M Kamm, C TI 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Bell Commun Res Inc, Remote Interact Technol Res Grp, Morristown, NJ 07960 USA. AT&T Bell Labs, Res Speech & Image Proc Serv Res Lab, Florham Pk, NJ 07932 USA. RP Spiegel, M (reprint author), Bell Commun Res Inc, Remote Interact Technol Res Grp, 445 South St, Morristown, NJ 07960 USA. NR 0 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 1 EP 3 DI 10.1016/S0167-6393(97)00051-4 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300001 ER PT J AU Johnston, D Sorin, C Gagnoulet, C Charpentier, F Canavesio, F Lochschmidt, B Alvarez, J Cortazar, I Tapias, D Crespo, C Azevedo, J Chaves, R AF Johnston, D Sorin, C Gagnoulet, C Charpentier, F Canavesio, F Lochschmidt, B Alvarez, J Cortazar, I Tapias, D Crespo, C Azevedo, J Chaves, R TI Current and experimental applications of speech technology for telecom services in Europe SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc AB This paper provides a 'snapshot' of telephony based speech technology research within European telecommunications companies. It focuses on operational and experimental telephone services exploiting speech in France, UK, Italy, Germany, Spain and Portugal and summarizes the current status of a European project (EURESCOM P502: "Multilingual Interactive Voice Activated telephone services (MIVA)"). (C) 1997 Published by Elsevier Science B.V. C1 BT Dev & Procurement, DSA23, Speech Technol Sect, Ipswich IP5 7RE, Suffolk, England. France Telecom, CNET, Lannion, France. CSELT Telecom Italia, Turin, Italy. Deutsch Telekom, Darmstadt, Germany. Telefon I&D, Madrid, Spain. Portugal Telecom, Lisbon, Portugal. RP Johnston, D (reprint author), BT Dev & Procurement, DSA23, Speech Technol Sect, Ipswich IP5 7RE, Suffolk, England. EM denis.johnston@bt-sys.bt.co.uk CR BILLI R, 1995, SPEECH COMMUN, V17, P263, DOI 10.1016/0167-6393(95)00030-R BILLI R, 1996, P IVTTA 96 WORKSHOP BROOKS R, 1996, BRIT TELECOM TECHN J, V14 CAMINERO J, 1996, P ICASSP 96, P401 ISSAR S, 1993, P EUROSPEECH 93, P2147 KASPAR B, 1995, P EUROSPEECH, P1161 Lamel LF, 1997, SPEECH COMMUN, V23, P67, DOI 10.1016/S0167-6393(97)00037-X ROSSET S, 1996, P IVTTA 96 WORKSHOP SADEK D, 1995, P ESCA WORKSH SPOK D, P145 SORIN C, 1995, SPEECH COMMUN, V17, P273, DOI 10.1016/0167-6393(95)00035-M SUBBIACO M, 1996, P ICIN 96 WESTALL FA, 1996, BRIT TELECOM TECHN J, V14 NR 12 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 5 EP 16 DI 10.1016/S0167-6393(97)00050-2 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300002 ER PT J AU Kitai, M Hakoda, K Sagayama, S Yamada, T Tsukada, H Takahashi, S Noda, Y Takahashi, J Yoshida, Y Arai, K Imoto, T Hirokawa, T AF Kitai, M Hakoda, K Sagayama, S Yamada, T Tsukada, H Takahashi, S Noda, Y Takahashi, J Yoshida, Y Arai, K Imoto, T Hirokawa, T TI ASR and TTS telecommunications applications in Japan SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE speech recognition; text-to-speech; voice interaction; telecommunication applications AB This paper first describes recent trends of ASR and TTS telecommunications applications in Japan. ASR applications focus on public services such as operator automation, operator assistance, voice-activated information retrieval, and voice dialing. Major TTS applications include information service by voice and e-mail reading. The usage of ASR and TTS functions is expected to dramatically increase in the near future with the penetration of handy and mobile telephone terminals; hot topics are text broadcasting and digital communication. Secondly this paper describes NTT's experimental interactive system featuring (1) highly accurate speaker independent and large vocabulary speech recognition based on context-dependent accurate acoustic phoneme HMM models trained with speech data from more than 10,000 speakers collected over telephone network, (2) high quality text-to-speech synthesis that generates speech by concatenating triphone-context-dependent waveform segments, (3) software-based configuration that requires no special hardware except a PC equipped with a sound board and a voice modem, and (4) easy and rapid prototyping which enables the developer to build a system by writing some types of service scenarios. (C) 1997 Elsevier Science B.V. C1 Nippon Telegraph & Tel Corp, Human Interface Labs, Yokosuka, Kanagawa 23803, Japan. Nippon Telegraph & Tel Corp, Adv Technol Corp, Musashino, Tokyo 180, Japan. ATR, Interpreting Telecommun Res Labs, Kyoto 61902, Japan. Nippon Telegraph & Tel Corp, LSI Labs, Atsugi, Kanagawa 24301, Japan. Nippon Telegraph & Tel Corp, Business Commun Headquarters, Minato Ku, Tokyo 10819, Japan. RP Kitai, M (reprint author), Nippon Telegraph & Tel Corp, Human Interface Labs, 1-2356 Take, Yokosuka, Kanagawa 23803, Japan. EM kitai@nttspch.hil.ntt.jp CR ABE M, 1996, P 1996 SOC C IEICE HAKODA K, 1995, P AVIOS95, P65 HIROKAWA T, 1993, T IEICE A, V76, P1964 HIROKAWA T, 1994, P ICSLP94, P675 *IEICE, 1996, P INF SYST SOC C IEI ISO, 1996, P 1996 INF SYST SOC, P213 *JASJ, 1996, P JASJ MAR 1995 SEP KUROIWA S, 1997, P JASJ, P173 MINAMI Y, 1992, P IVTTA92 MITOME Y, 1988, JASJ, V49, P875 MURATA Y, 1996, P 1996 S MOB COMP MA NITTA T, 1994, P ICSLP94, V2, P671 NODA Y, 1995, P EUROSPEECH 95, P913 Sugamura N., 1994, Proceedings. Second IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 94) (Cat. No.94TH0695-7), DOI 10.1109/IVTTA.1994.341548 TAKAHASHI S, 1995, P ICASSP, P520 YAMADA T, 1994, P JASJ C OCT 1994, P123 YAMAMOTO S, 1996, P JASJ C MAR 1996, P33 NR 17 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 17 EP 30 DI 10.1016/S0167-6393(97)00044-7 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300003 ER PT J AU Perdue, RJ AF Perdue, RJ TI The way we were: Speech technology, platforms and applications in the 'Old' AT & T SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE connected digit recognition; final state grammars; interactive voice response; speaker dependent recognition; speaker independent recognition; speaker verification; speech recognition; speech technology; telephone operator automation; voice dialing AB The last several years have been an exciting time at AT&T in the field of advanced speech applications for telecommunications: technical progress and platform/processor advances have enabled the identification, development and testing of a range of new services, During this period, prior to the divestiture of AT&T of Lucent Technologies and NCR, AT&T brought together, under a single corporate 'roof, a research laboratory committed to advancing speech technology, business organizations building platforms to leverage this technology for telecommunications applications, and yet other business organizations with responsibility for deploying speech-enabled services to facilitate the use and reduce the cost of telecommunications services for both consumers and businesses. While this period of our corporate history has drawn to a close, we can look back to provide an overview of how technical progress, platform advances and network sen ices needs and opportunities interacted to make speech technology an everyday experience for millions of people - and some of the lessons we learned along the way. (C) 1997 Elsevier Science B.V. C1 Lucent Technol, Bell Labs, Columbus, OH 43213 USA. RP Perdue, RJ (reprint author), Lucent Technol, Bell Labs, Room 1B-386,6200 E Broad St, Columbus, OH 43213 USA. EM rjperdue@lucent.com CR HUANG BH, 1995, AT T TECHN J, V74, P45 Longenbaker W. E., 1994, Proceedings. Second IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA 94) (Cat. No.94TH0695-7), DOI 10.1109/IVTTA.1994.341520 MIKKILINENI P, 1996, P 1996 IEEE WORKSH I ROSENBERG AE, 1994, P INT C SPEECH LANG NR 4 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 31 EP 39 DI 10.1016/S0167-6393(97)00035-6 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300004 ER PT J AU Cerf, G Naik, J Raman, V Ramanujam, V Vysotsky, G AF Cerf, G Naik, J Raman, V Ramanujam, V Vysotsky, G TI Progress in deployment and further development of the NYNEX VoiceDialing(SM) service SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE telephony based applications; spoken name recognition; spoken digit recognition; spoken phrase recognition; hidden Markov models AB As was reported at IVTTA-94, the NYNEX VoiceDialing(SM) service was first deployed in the NYNEX region in mid 1993. It has since been deployed in several other Bell Operating Companies' regions and substantial new developments have taken place. One of them was a transition, starting in 1995, from the hardware implementation of a DTW-based recognition algorithm to the implementation, on a general-purpose DSP, of a continuous density multi-Gaussian mixture HMM-based algorithm. As a result, it has been possible to expand the service beyond speaker-dependent name recognition to speaker-independent continuous digit recognition (SICDR) and voice-activated network control (VANC) command recognition. This paper highlights major efforts in this transition and provides a more detailed description of the system's speech recognition components. (C) 1997 Published by Elsevier Science B.V. C1 NYNEX Sci & Technol Inc, White Plains, NY 10604 USA. RP Cerf, G (reprint author), NYNEX Sci & Technol Inc, 500 Westchester Ave, White Plains, NY 10604 USA. EM george@nynexst.com CR ASADI A, 1995, T EUROSPEECH 95, P273 JOHNSTON D, 1996, P 3 IEEE WORKSH INT KITAI M, 1994, P 2 IEEE WORKSH INT NAIK J, 1994, P 2 IEEE WORKSH INT NAIK JM, 1990, IEEE COMMUN MAG JAN, P42 NETSCH LK, 1994, P 2 IEEE WORKSH INT PERDUE RJ, 1996, P 3 IEE WORKSH INT V RAMAN V, 1997, P INT C AC SPEECH SI RAMAN V, 1994, P ICSLP SORIN C, 1996, P 3 IEEE WORKSH INT VYSOTSKY GJ, 1995, SPEECH COMMUN, V17, P235, DOI 10.1016/0167-6393(95)00025-J YAMAMOTO S, 1994, P 2 IEEE WORKSH INT ZREIK L, 1994, P 2 IEEE WORKSH INT NR 13 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 41 EP 50 DI 10.1016/S0167-6393(97)00045-9 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300005 ER PT J AU Spiegel, MF AF Spiegel, MF TI Advanced database preprocessing and preparations that enable telecommunication services based on speech synthesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE database preprocessing; speech synthesis; reverse directory services AB Speech synthesis now profitably automates services by speaking information from computer databases. Some of these services provide driving directions, traffic and timetable information, stock quotes and related financial services, and catalog ordering. One particularly successful telecommunication service, Automated Customer Name and Address (ACNA), sometimes called Reverse Directory Assistance (RDA), requires synthesis with high intelligibility and name pronunciation accuracy, both of which are achieved by current synthesis technology. However, even the best of current technology is not good enough to mindlessly 'drop' into complex services. Customized directory preprocessing is necessary to transform listing data, which commonly contains unconventional abbreviations, acronyms unknown to the synthesizer, and scrambled word ordering, into a sentence suitable for synthesis. We describe our directory preprocessing programs used in successful implementations of synthesis in two major US telephone companies. It is also necessary that locality terms, which have considerable geographical variability, are pronounced in accordance with local customs; otherwise, the service will have an outsider's feel to the customers. We also describe an experiment that determined whether the naturalness of recorded speech for prompts and other fixed messages offsets the undesirable discontinuity between natural and synthesized utterances. (C) 1997 Published by Elsevier Science B.V. C1 BELLCORE, Bell Commun Res, Morristown, NJ 07960 USA. RP Spiegel, MF (reprint author), BELLCORE, Bell Commun Res, 445 South St, Morristown, NJ 07960 USA. EM spiegel@bellcore.com CR BELHOULA K, 1993, P EUR, P881 COILE BV, 1992, P ICSLP 92, P487 COKER C, 1990, ESCA WORKSH SPEECH S, P83 DELOGU C, 1995, P AVIOS, P137 DOBLER S, 1993, ESCA WORKSH APPL SPE, P23 GOLDING AR, 1993, J AVIOS, V14, P1 GRANSTROM B, 1993, ESCA NATO WORKSH APP, P79 KALYANSWAMY A, 1991, P AVIOS, P205 Kitai M., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552712 LIBERMAN MY, 1978, J ACOUST SOC AM, V64, pS163, DOI 10.1121/1.2003966 PATTERSON BR, 1986, DEV EVOLUTION AUTOMA Schmandt C., 1994, VOICE COMMUNICATION SILVERMAN KEA, 1993, P 3 EUR C SPEECH COM, P2169 Sorin C., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552707 SPIEGEL MF, 1993, P AVIOS, P75 SPIEGEL MF, 1985, P AVIOS, P107 SPIEGEL MF, 1994, CURRENT TOPICS ACOUS SPIEGEL MF, 1990, P AVIOS SPIEGEL MF, 1995, 6 US TELC SPEECH RES VITALE AJ, 1991, COMPUT LINGUIST, V17, P257 YASCHIN D, 1992, P AVIOS, P309 YUSCHIK M, 1994, J AVIOS, V15, P21 NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 51 EP 62 DI 10.1016/S0167-6393(97)00039-3 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300006 ER PT J AU Billi, R Lamel, LF AF Billi, R Lamel, LF TI RailTel: Railway Telephone Services SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc C1 LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. RP Lamel, LF (reprint author), LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. EM lamel@limsi.fr CR *RAILTEL, 1995, DEF EV METH FIELD TR *RAILTEL, 1995, RES FIELD TRIALS NR 2 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 63 EP 65 DI 10.1016/S0167-6393(97)00038-1 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300007 ER PT J AU Lamel, LF Bennacef, SK Rosset, S Devillers, L Foukia, S Gangolf, JJ Gauvain, JL AF Lamel, LF Bennacef, SK Rosset, S Devillers, L Foukia, S Gangolf, JJ Gauvain, JL TI The LIMSI RailTel system: Field trial of a telephone service for rail travel information SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE spoken language systems; speech recognition; speech understanding; natural language understanding; information retrieval dialog AB This paper describes the RailTel system developed at LIMSI to provide vocal access to static train timetable information in French, and a field trial carried out to assess the technical adequacy of available speech technology for interactive services. The data collection system used to carry out the field trials is based on the LIMSI Mask spoken language system and runs on a Unix workstation with a high quality telephone interface. The spoken language system allows a mixed-initiative dialog where the user can provide any information at any point in time. Experienced users are thus able to provide all the information needed for database access in a single sentence, whereas less experienced users tend to provide shorter responses, allowing the system to guide them. The RailTel field trial was carried out using a common methodology defined by the consortium. 100 novice subjects participated in the field trials, each calling the system one time and completing a user questionnaire. Of the callers, 72% successfully completed their scenario. The subjective assessment of the prototype was for the most part favourable, with subjects expressing an interest in using such a service. (C) 1997 Elsevier Science B.V. C1 LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. RP Lamel, LF (reprint author), LIMSI, CNRS, Spoken Language Proc Grp, F-91403 Orsay, France. EM lamel@limsi.fr CR Gauvain J. L., 1997, HUMAN COMFORT SECURI, P93 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 Lamel L. F., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552773 LAMEL LF, 1995, P ESCA WORKSH SPOK D, P17 LAMEL LF, 1995, P EUROSPEECH 95, P1961 LAMEL LF, 1993, P ESCA NATO WORKSH A, P207 POPOVICI C, 1997, P ICASSP MUN GERM, P815 *RAILTEL, 1995, DEF EV METH FIELD TR *RAILTEL, 1995, RES FIELD TRIALS NR 9 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 67 EP 82 DI 10.1016/S0167-6393(97)00037-X PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300008 ER PT J AU Billi, R Castagneri, G Danieli, N AF Billi, R Castagneri, G Danieli, N TI Field trial evaluations of two different information inquiry systems SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE spoken dialogue systems; speech recognition; interactive telephone services AB This paper presents the evaluations of the field trials of two information inquiry systems that use different speech technologies and different human-machine interfaces. The first system, RAILTEL, uses isolated word recognition and system driven dialogue. The second system, Dialogos, understands spontaneous speech and implements a mixed initiative dialogue strategy. Both systems allow access to the Italian Railway timetable by using the telephone over the public network. RAILTEL and Dialogos were tested in extensive field trials by inexperienced subjects. Moreover, a comparative trial with a limited number of subjects was performed in order to gain some insights on the impact of the different speech technologies on users' behaviour. The paper will provide a brief description of the two systems, details of the experimental set up of the trials, and the discussion of the results. (C) 1997 Elsevier Science B.V. C1 CSELT SpA, Ctr Studi & Lab Telecomun, I-10148 Turin, Italy. RP Billi, R (reprint author), CSELT SpA, Ctr Studi & Lab Telecomun, Via G Reiss Romoli 274, I-10148 Turin, Italy. EM roberio.billi@cselt.it CR BAGGIA P, 1993, P ICASSP 93 MINN, V2, P123 Banaka W.H., 1971, TRAINING DEPTH INTER BOROS M, 1996, P ICSLP 96 PHIL CASTAGNERI G, 1995, EEC MLAP PROJ 422 DANIELI M, 1995, WORKING NOTES AAAI95, P34 DANIELI M, 1996, P AAAI 96 WORKSH DET DANIELI M, 1997, P ACL 97 WORKSH INT GERBINO E, 1993, P 3 EUR C SPEECH COM, P1161 MANA F, 1996, P IEEE WORKSH NEUR N WALKER MA, 1997, P 35 ANN M ACL MADRI NR 10 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 83 EP 93 DI 10.1016/S0167-6393(97)00041-1 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300009 ER PT J AU Kellner, A Rueber, B Seide, F Tran, BH AF Kellner, A Rueber, B Seide, F Tran, BH TI PADIS - An automatic telephone switchboard and directory information system SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE directory information; speech understanding; speech recognition; natural-language understanding; mixed-initiative dialogue; database constraints; dialogue history; N-best; stochastic context-free grammar; attributed grammar; user behaviour AB The Philips automatic telephone switchboard and directory information system PADIS provides a natural-language user interface to a telephone directory database. Using speech recognition and language understanding technologies, the system offers phone numbers, fax numbers, e-mail addresses, and room numbers as well as direct call completion to a desired party. In this paper, we present the underlying probabilistic framework, the system architecture, and the individual modules for speech recognition, language understanding, dialogue control, and speech output. In addition, we report results on performance and user behaviour obtained from a field test in our research lab with a 600-entry database. We derive a new maximum-a-posteriori decision rule which incorporates database knowledge and dialogue history as constraints in speech recognition and language understanding. It has improved speech understanding accuracy by 19% (in terms of concept error rate), and reduced attribute substitution errors (e.g. recognition of a wrong name) by 38%. The decision rule is implemented in a multi-stage approach as a combination of state-of-the-art speech recognition, partial parsing with an attributed stochastic context-free grammar, and an N-best algorithm which is also described in this paper. The system conducts a flexible mixed-initiative dialogue rather than using a rigid form-filling scheme, and incorporates database knowledge to optimize the dialogue flow. (C) 1997 Elsevier Science B.V. C1 Philips GmbH, Forschungslab, D-52021 Aachen, Germany. Philips Res Labs, Taipei, Taiwan. Philips Speech Proc, Dialogue Syst, D-52072 Aachen, Germany. RP Kellner, A (reprint author), Philips GmbH, Forschungslab, POB 1980, D-52021 Aachen, Germany. EM kellner@pfa.research.philips.com; rueber@pfa.research.philips.com; seide@prlt.research.philips.com CR AUST H, 1995, ESCA WORKSH SPOK DIA, P121 AUST H, 1995, SPEECH COMMUN, V17, P249, DOI 10.1016/0167-6393(95)00028-M AUST H, 1994, INT C SPOK LANG PROC, V2, P703 BESLING S, 1994, P 10 ANN C UW CTR NE, P5 DUGAST C, 1995, P IEEE INT C AC SPEE, V1, P524 FU KS, 1982, SYNTACTIC PATTERN RE, P196 GIACHIN E, 1995, P INT C AC SPEECH SI, P225 JELINEK F, 1990, NATO ASI, P345 KELLNER A, 1996, IVTTA, P117 METEER M, 1993, P IEEE INT C AC SPEE, V2, P37 NEY H, 1994, INT C SPOK LANG PROC, V4, P1355 NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 NEY H, 1986, IEEE T ACOUST SPEECH, V34, P509 OERDER M, 1993, P IEEE INT C AC SPEE, V2, P119 ORTMANNS S, 1996, INT C SPOK LANG PROC, V4, P2091 PIERACCINI R, 1991, P EUROSPEECH, P383 Schwartz R., 1991, P IEEE INT C AC SPEE, P701, DOI 10.1109/ICASSP.1991.150436 SEIDE F, 1996, INT C SPOK LANG PROC, V2, P1017 SOONG FK, 1991, P INT C AC SPEECH SI, V1, P705 TRAN BH, 1996, INT C SPOK LANG PROC, V4, P2127 Young S. J., 1993, P EUROSPEECH, P2203 NR 21 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 95 EP 111 DI 10.1016/S0167-6393(97)00036-8 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300010 ER PT J AU Gorin, AL Riccardi, G Wright, JH AF Gorin, AL Riccardi, G Wright, JH TI How may I help you? SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE spoken language understanding; spoken dialog system; speech recognition; stochastic language modeling; salient phrase aquisition; topic classification ID RECOGNITION; MODELS AB We are interested in providing automated services via natural spoken dialog systems. By natural, we mean that the machine understands and acts upon what people actually say, in contrast to what one would Like them to say. There are many issues that arise when such systems are targeted for large populations of non-expert users. In this paper, we focus on the task of automatically routing telephone calls based on a user's fluently spoken response to the open-ended prompt of "How may I help you?". We first describe a database generated from 10,000 spoken transactions between customers and human agents. We then describe methods for automatically acquiring language models for both recognition and understanding from such data. Experimental results evaluating call-classification from speech are reported for that database. These methods have been embedded within a spoken dialog system, with subsequent processing for information retrieval and formfilling. (C) 1997 Elsevier Science B.V. C1 AT&T Bell Labs, Res, Florham Pk, NJ USA. RP Gorin, AL (reprint author), AT&T Bell Labs, Res, Florham Pk, NJ USA. EM algor@research.ab.com; dsp3@research.att.com; jwright@research.att.com RI riccardi, gabriele/A-9269-2012 CR ABELLA A, 1996, P ECAI SPOK DIAL SYS BLACHMAN NM, 1968, IEEE T INFORM THEORY, V14, P27, DOI 10.1109/TIT.1968.1054094 BOYCE S, 1996, P INT S SPOK DIAL IS, P65 Cover T. M., 1991, ELEMENTS INFORMATION GARNER PN, 1997, P INT C ACOUST SPEEC GERTNER AN, 1993, ARTIFICIAL NEURAL NE, P401 GIACHIN E, 1995, P INT C AC SPEECH SI, P225 GORIN A, 1995, J ACOUST SOC AM, V97, P3441, DOI 10.1121/1.412431 Gorin A. L., 1996, Proceedings. Third IEEE Workshop on Interactive Voice Technology for Telecommunications Applications. IVTTA-96 (Cat. No.96TH8178), DOI 10.1109/IVTTA.1996.552741 GORIN AI, 1995, P ESCA WORKSH SPOK D GORIN AL, 1994, P INT C SPOKEN LANGU, P1483 Gorin AL, 1994, IEEE T SPEECH AUDI P, V2, P224, DOI 10.1109/89.260365 Gorin A. L., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607772 HENIS EA, 1994, P 8 YAL WORKSH AD LE JELINEK F, 1990, READINGS SPEECH REC, P449 LJOLJE A, 1994, COMPUT SPEECH LANG, V8, P129, DOI 10.1006/csla.1994.1006 MASATAKI H, 1996, P INT C AC SPEECH SI, V1, P188 MATSUMURA T, 1995, SPEECH COMMUN, V17, P321, DOI 10.1016/0167-6393(95)00031-I MCDONOUGH J, 1994, P ICSLP, P2163 Miller L. G., 1993, International Journal of Pattern Recognition and Artificial Intelligence, V7, DOI 10.1142/S0218001493000443 PEREIRA FCN, 1997, FINITE STATE DEVICES PESKIN B, 1993, P ARPA WORKSH HUM LA Riccardi G, 1996, COMPUT SPEECH LANG, V10, P265, DOI 10.1006/csla.1996.0014 RICCARDI G, 1997, P INT C AC SPEECH SI RILEY M, 1995, P ASR WORKSH SNOWB Riley M., 1995, P EUROSPEECH 95, P207 SANKAR A, 1993, ARTIFICIAL NEURAL NE, P324 SHARP RD, 1997, P INT C AC SPEECH SI WILPON JG, 1990, IEEE T ACOUST SPEECH, V38, P1870, DOI 10.1109/29.103088 NR 29 TC 180 Z9 180 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 113 EP 127 DI 10.1016/S0167-6393(97)00040-X PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300011 ER PT J AU Pepper, DJ Singhal, S Soper, S AF Pepper, DJ Singhal, S Soper, S TI The CallManager system: A platform for intelligent telecommunications services SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc AB Bellcore's CallManager system is an experimental prototype of a call screening and 'follow-me anywhere' service platform that makes use of a personal appointment calendar and a personal phonebook (accessed through the customer's wireline or wireless PDA) to determine the customer's current location and the importance of each caller to the customer. This is an example of a future class of network-provided services that utilize advanced speech processing and customer specific data to greatly enhance the value of the telephone network to the customer. Such services allow important calls to reach the customer while screening out unwanted or unimportant calls. These services can potentially also proactively provide information to a customer 'as it happens' wherever the customer may happen to be. (C) 1997 Published by Elsevier Science B.V. C1 BELLCORE, Morristown, NJ 07960 USA. RP Pepper, DJ (reprint author), Appl Language Technol Inc, 695 Atlantic Ave,3rd Floor, Boston, MA 02111 USA. EM pepper@alltech.com CR Belina F., 1991, SDL APPL PROTOCOL SP *BELLCORE, 1996, BELLC AIRB PROD OUSTERHOUT JK, 1993, INTRO TCL TK PEPPER D, 1996, 1996 IEEE 3 WORKSH I, P49 TELELOGIC AB, 1995, GETTING STARTED SDT VAUDREUIL G, 1996, VOICE PROFILE INTERN NR 6 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 129 EP 139 DI 10.1016/S0167-6393(97)00043-5 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300012 ER PT J AU Mokbel, C Mauuary, L Karray, L Jouvet, D Monne, J Simonin, J Bartkova, K AF Mokbel, C Mauuary, L Karray, L Jouvet, D Monne, J Simonin, J Bartkova, K TI Towards improving ASR robustness for PSN and GSM telephone applications SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc ID SPEECH RECOGNITION AB In real-life applications, errors in the speech recognition system are mainly due to inefficient detection of speech segments, unreliable rejection of Out-Of-Vocabulary (OOV) words, and insufficient account of noise and transmission channel effects. In this paper, we review a set of techniques developed at CNET in order to increase the robustness to mismatches between training and testing conditions. These techniques are divided in two classes: preprocessing techniques and Hidden Markov Models (HMM) parameters adaptation. The results of several experiments carried out on field databases, as well as on databases collected over PSN and GSM networks are presented. The main sources of errors are analyzed. We show that a blind equalization scheme significantly improves the recognition accuracy regarding both field and GSM data. Speech detection allows a system to delimit the boundaries of the words to be recognized. We also use preprocessing techniques to increase the robustness of such detectors to noisy GSM speech. We show that spectral subtraction improves speech detection under noisy GSM conditions. Bayesian adaptation of HMM parameters produces models which are robust to field and GSM conditions. Models robust to GSM conditions can also be generated by linear regression adaptation of HMM parameters. Our experiments show an equivalent performance obtained with both Bayesian and linear regression adaptation of HMM parameters. The results obtained also prove that HMM adaptation and preprocessing techniques can be advantageously combined to improve Automatic Speech Recognition (ASR) robustness. (C) 1997 Elsevier Science B.V. C1 France Telecom, CNET, LAA, TSS,RCP, F-22307 Lannion, France. RP Mokbel, C (reprint author), France Telecom, CNET, LAA, TSS,RCP, 2 Av Pierre Marzin, F-22307 Lannion, France. CR BARTKOVA K, 1995, P EUROSPEECH 95, P1275 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 JOUVET D, 1991, P EUROPSPEECH, P927 JOUVET D, 1991, P EUROSPEECH 91, P923 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 LEGETTER CJ, 1994, P ICSLP 94, P451 LOCKWOOD P, 1991, P EUROSPEECH 91, V1, P79 MAUUARY L, 1996, P EUSIPCO 96, P125 MAUUARY L, 1993, P EUROSPEECH 93, P1097 MAUUARY L, 1994, THESIS U RENNES RENN Miglietta C. G., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607751 MOKBEL C, 1995, P EUROSPEECH 95, P1987 MOKBEL C, 1991, P INT C AC SPEECH SI Mokbel C, 1996, SPEECH COMMUN, V19, P185, DOI 10.1016/0167-6393(96)00032-5 MOKBEL C, 1992, THESIS ENST TELECOM MOKBEL C, 1995, P IEEE WORKSH ASR, P167 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 MORIN D, 1991, P EUROSPEECH 9U, V2, P735 Neumeyer LG, 1994, IEEE T SPEECH AUDI P, V2, P590, DOI 10.1109/89.326617 SANKAR A, 1995, P IEEE INT C AC SPEE, P121 Shynk J.J., 1992, IEEE SIGNAL PROC JAN, P15 SORIN C, 1992, IEEE WORKSH IVTTA PI SORIN C, 1995, SPEECH COMMUN, V17, P273, DOI 10.1016/0167-6393(95)00035-M NR 24 TC 23 Z9 23 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 141 EP 159 DI 10.1016/S0167-6393(97)00042-3 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300013 ER PT J AU Gamm, S Haeb-Umbach, R Langmann, D AF Gamm, S Haeb-Umbach, R Langmann, D TI The development of a command-based speech interface for a telephone answering machine SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 3rd IEEE Workshop on Interactive Voice Technology for Telecommunications Applications (IVTTA-96) CY SEP 30-OCT 01, 1996 CL BASKING RIDGE, NJ SP IEEE Commun Soc DE user interface design; speech interface; speech recognition; word spotting; voice mail; answering machine ID RECOGNITION AB This paper reports the design of a command-based speech interface for an answering machine or a voice mail system. Automatic speech recognition was integrated in order to facilitate the remote control and the retrieval of voice messages from any telephone in a speech-only dialogue. The design goal was that consumers would perceive the speech interface as a benefit compared with the common touch-tone interface. In this paper we will first describe the speech technology underlying the system. Then it will be shown how, based on this technology, the user interface was designed in a top-down approach. We started with the development of a concept and tested it by means of a Wizard-of-Oz simulation. After refining the concept in parallel design, it was implemented in a hi,oh-fidelity prototype. By means of qualitative user testing the design was improved in three iteration steps. The achievement of the design goal was finally verified with user tests in two countries. (C) 1997 Elsevier Science B.V. C1 Philips GmbH, Forschungslab, D-52085 Aachen, Germany. RP Gamm, S (reprint author), Philips Speech Proc, Kackertstr 10, D-52072 Aachen, Germany. EM gamm@acn.be.philips.com CR BENNETT RW, 1992, P SPEECH TECHNOLOGY, P222 DUGAST C, 1995, P INT C AC SPEECH SI, P524 FALCK T, 1993, P APPLICATIONS SPEEC, P125 Gamm S, 1995, PHILIPS J RES, V49, P439, DOI 10.1016/0165-5817(96)81590-7 GOULD JD, 1983, COMMUN ACM, V26, P295, DOI 10.1145/2163.358100 *GUID PROJ RACE, 1992, US ENG METH IBC SERV HaebUmbach R, 1995, PHILIPS J RES, V49, P381, DOI 10.1016/0165-5817(96)81587-7 HAUENSEIN A, 1995, P IEEE INT C AC SPEE, P425 Hermansky H., 1991, P EUROSPEECH, P1367 Hirsch H.-G., 1991, P EUROSPEECH, P413 KARIS D, 1991, IEEE J SEL AREA COMM, V9, P574, DOI 10.1109/49.81951 NEY H, 1994, PROGR PROSPECTS SPEE, P75 NIELSEN J, 1993, P ACM INTERCHI 93 C, P414, DOI 10.1145/169059.169327 OERDER M, 1994, P ICSLP 94, P703 Riley C. A., 1987, Proceedings of the Human Factors Society 31st Annual Meeting: Rising to New Heights with Technology Steinbiss V., 1994, P INT C SPOK LANG PR, P2143 *YANK GROUP, 1990, VOIC MESS ROLL BEG YANKELOVICH N, 1995, P CHI 95 NR 18 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1997 VL 23 IS 1-2 BP 161 EP 171 DI 10.1016/S0167-6393(97)00034-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA ZH567 UT WOS:000073124300014 ER PT J AU Valtchev, V Odell, JJ Woodland, PC Young, SJ AF Valtchev, V Odell, JJ Woodland, PC Young, SJ TI MMIE training of large vocabulary recognition systems SO SPEECH COMMUNICATION LA English DT Article AB This paper describes a framework for optimising the structure and parameters of a continuous density HMM-based large vocabulary recognition system using the Maximum Mutual Information Estimation (MMIE) criterion. To reduce the computational complexity of the MMIE training algorithm, confusable segments of speech are identified and stored as word lattices of alternative utterance hypotheses. An iterative mixture splitting procedure is also employed to adjust the number of mixture components in each state during training such that the optimal balance between the number of parameters and the available training data is achieved. Experiments are presented on various test sets from the Wall Street Journal database using up to 66 hours of acoustic training data. These demonstrate that the use of lattices makes MMIE training practicable for very complex recognition systems and large training sets. Furthermore, the experimental results show that MMIE optimisation of system structure and parameters can yield useful increases in recognition accuracy. (C) 1997 Elsevier Science B.V. C1 Univ Cambridge, Dept Engn, Cambridge CB2 1PZ, England. RP Young, SJ (reprint author), Univ Cambridge, Dept Engn, Trumpington St, Cambridge CB2 1PZ, England. EM vvl@entropic.co.uk; jo@entropic.co.uk; pcw@eng.cam.ac.uk; sjy@eng.cam.ac.uk CR AUBERT X, 1995, P ICASSP, V1, P49 Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 BAHL LR, 1993, T IEEE SPEECH AUDIO, V1, P77 Gopalakrishnan P. S., 1989, P ICASSP, P631 Juang B.-H., 1992, Journal of the Acoustical Society of Japan (E), V13 KAPADIA S, 1993, P ICASSP, V2, P491 KUBALA F, 1994, P ARPA HUM LANG TECH, P37, DOI 10.3115/1075812.1075822 MCDERMOTT E, 1994, COMPUT SPEECH LANG, V8, P351, DOI 10.1006/csla.1994.1018 Normandin Y, 1994, IEEE T SPEECH AUDI P, V2, P299, DOI 10.1109/89.279279 NORMANDIN Y, 1995, P INT C ACOUST SPEEC, V1, P449 Normandin Y., 1991, THESIS MCGILL U MONT Normandin Y., 1994, P ICSLP YOK, V3, P1367 Odell J. J., 1994, P ARPA SPOK LANG TEC, P405, DOI 10.3115/1075812.1075905 Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 RAINTON D, 1992, P INT C SPOK LANG PR, P233 Richardson F., 1995, P INT C AC SPEECH SI, V1, P576 STEENEKEN HJM, 1995, P EUROPEECH, P1271 VALTCHEV V, 1996, P INT C AC SPEECH SI, V2, P605 WOODLAND PC, 1995, P ICASSP, V1, P73 WOODLAND PC, 1994, P IEEE INT C AC SPEE, V2, P125 WOODLAND PC, 1995, P ARPA WORKSH SPOK L, P104 Young SJ, 1994, P ARPA HUM LANG TECH, P307, DOI 10.3115/1075812.1075885 NR 22 TC 78 Z9 83 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 303 EP 314 DI 10.1016/S0167-6393(97)00029-0 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800001 ER PT J AU Nadeu, C Paches-Leal, P Juang, BH AF Nadeu, C Paches-Leal, P Juang, BH TI Filtering the time sequences of spectral parameters for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; hidden Markov model (HMM); spectral estimation; linear filter; modulation frequency ID RECEPTION AB In automatic speech recognition, the signal is usually represented by a set of time sequences of spectral parameters (TSSPs) that model the temporal evolution of the spectral envelope frame-to-frame. Those sequences are then filtered either to make them more robust to environmental conditions or to compute differential parameters (dynamic features) which enhance discrimination. Tn this paper, we apply frequency analysis to TSSPs in order to provide an interpretation framework for the various types of parameter filters used so far. Thus, the analysis of the average long-term spectrum of the successfully filtered sequences reveals a combined effect of equalization and band selection that provides insights into TSSP filtering. Also, we show in the paper that. when supplementary differential parameters are not used, the recognition rate can be improved even for clean speech, just by properly filtering the TSSPs. To support this claim, a number of experimental results are presented, both using whole-word and subword based models. The empirically optimum filters attenuate the low-pass band and emphasize a higher band so that the peak of the average long-term spectrum of the output of these filters lies at around the average syllable rate of the employed database (approximate to 3 Hz). (C) 1997 Elsevier Science B.V. C1 Univ Politecn Cataluna, Dept Teor Senyal & Comunicacions, Barcelona 08034, Spain. AT&T Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. RP Nadeu, C (reprint author), Univ Politecn Cataluna, Dept Teor Senyal & Comunicacions, J Girona 1-3, Barcelona 08034, Spain. EM climent@gps.tsc.upc.es RI Nadeu, Climent/B-9638-2014 OI Nadeu, Climent/0000-0002-5863-0983 CR APPLEBAUM TH, 1990, P EUSIPCO 90, P1183 Arai T., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607318 Avendano C., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607213 BONAFONTE A, 1995, P EUR 95 MADR, P1607 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P1053, DOI 10.1121/1.408467 DRULLMAN R, 1994, J ACOUST SOC AM, V95, P2670, DOI 10.1121/1.409836 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Greenberg S, 1997, P IEEE INT C AC SPEE, P1647 HAEBUMBACH R, 1993, P ICASSP, P239 HANSON BA, 1996, ADV TOPICS AUTOMATIC Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1997, P ESCA NATO WORKSH R, P103 Hirsch H.-G., 1991, P EUROSPEECH, P413 HOUTGAST T, 1985, J ACOUST SOC AM, V77, P1069, DOI 10.1121/1.392224 JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 KAISER JF, 1980, IEEE T ACOUST SPEECH, V28, P105, DOI 10.1109/TASSP.1980.1163349 KATAGISHI K, 1993, SPEECH COMMUN, V13, P297, DOI 10.1016/0167-6393(93)90028-J Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V Leonard R. G., 1984, P INT C AC SPEECH SI MORENO A, 1993, SAMA6002 NADEU C, 1995, P EUR 95, P1381 NADEU C, 1994, P ICSLP, P1927 NADEU C, 1996, P ICSLP 96 PHIL, P2348 Oppenheim A. V., 1989, DISCRETE TIME SIGNAL Paches-Leal P., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607789 PICONE JW, 1993, P IEEE, V81, P1215, DOI 10.1109/5.237532 Rabiner L, 1993, FUNDAMENTALS SPEECH Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 SLEPIAN D, 1978, AT&T TECH J, V57, P1371 TOHKURA Y, 1987, IEEE T ACOUST SPEECH, V35 Wilpon J. G., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1002 NR 31 TC 22 Z9 22 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 315 EP 332 DI 10.1016/S0167-6393(97)00030-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800002 ER PT J AU Sukkar, RA Setlur, AR Lee, CH Jacob, J AF Sukkar, RA Setlur, AR Lee, CH Jacob, J TI Verifying and correcting recognition string hypotheses using discriminative utterance verification SO SPEECH COMMUNICATION LA English DT Article AB Utterance verification (UV) is a process by which the output of a speech recognizer is verified to determine if the input speech actually includes the recognized keyword(s). The output of the speech verifier is a binary decision to accept or reject the recognized utterance based on a UV confidence score. In this paper, we extend the notion of utterance verification by presenting an utterance verification method that will be utilized to perform three tasks. (1) detect non-keyword strings (false alarms), (2) detect keyword substitution errors, and (3) selectively correct substitution errors when N-best string hypotheses are available. The utterance verification method presented here employs a set of verification-specific models that are independent of the models used in the recognition process. The verification models are trained using a discriminative training procedure that seeks to minimize the verification error by simultaneously maximizing the rejection of non-keywords and misrecognized keywords while minimizing the rejection of correctly recognized keywords. The error correction is performed by reordering the hypotheses produced by an N-best recognizer based on a UV confidence score. (C) 1997 Elsevier Science B.V. C1 AT&T Bell Labs, Lucent Technol, Naperville, IL 60566 USA. AT&T Bell Labs, Lucent Technol, Murray Hill, NJ 07974 USA. RP Sukkar, RA (reprint author), AT&T Bell Labs, Lucent Technol, 2000 N Naperville Rd,Room 4G-150, Naperville, IL 60566 USA. EM rafid.sukkar@lucent.com CR BOURLARD H, 1994, P INT C AC SPEECH SI, V1, P373 CAMINEROGIL FJ, 1995, P EUR C MADR, V3, P2099 CHIGIER B, 1992, P INT C AC SPEECH SI, V2, P93 CHOU W, 1994, P ICSLP 94, P439 CHOU W, 1992, P INT C AC SPEECH SI, V1, P473 FENG MW, 1992, P 1992 INT C SPOK LA, V1, P21 JEANRENAUD P, 1994, P INT C AC SPEECH SI, V1, P381 KOMORI Y, 1992, P 1992 INT C SPOK LA, P9 LIPPMANN RP, 1994, P INT C AC SPEECH SI, V1, P389 Mardia KV, 1979, MULTIVARIATE ANAL RAHIM M, 1995, P EUR C SPEECH COMM, V1, P529 RAHIM M, 1995, P INT C AC SPEECH SI, V1, P285 RAHIM MG, 1996, P INT C AC SPEECH SI, V4, P3585 ROHLICEK J, 1993, P INT C AC SPEECH SI, V2, P459 ROSE R, 1995, P INT C AC SPEECH SI, V1, P281 ROSE R, 1992, P INT C AC SPEECH SI, V2, P105 SETLUR AR, 1996, P INT C SPOK LANG PR, V2, P602, DOI 10.1109/ICSLP.1996.607433 SUKKAR R, 1994, P INT C AC SPEECH SI, V1, P393 SUKKAR RA, 1995, P EUR C MADR, V3, P1629 SUKKAR RA, 1996, P INT C AC SPEECH SI, V1, P518 Sukkar R.A., 1993, P INT C AC SPEECH SI, V2, P451 VILLARRUBIA L, 1993, P INT C AC SPEECH SI, V2, P455 Weintraub M, 1995, P INT C AC SPEECH SI, V1, P297 NR 23 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 333 EP 342 DI 10.1016/S0167-6393(97)00031-9 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800003 ER PT J AU Bangayan, P Long, C Alwan, AA Kreiman, J Gerratt, BR AF Bangayan, P Long, C Alwan, AA Kreiman, J Gerratt, BR TI Analysis by synthesis of pathological voices using the Klatt synthesizer SO SPEECH COMMUNICATION LA English DT Article DE analysis-by-synthesis; voice quality; pathological voices ID SPEECH SYNTHESIS; VOCAL QUALITY; PERCEPTION; NOISE AB The ability to synthesize pathological voices may provide a tool for the development of a standard protocol for assessment of vocal quality. An analysis-by-synthesis approach using the Klatt formant synthesizer was applied to study 24 tokens of the vowel /a/ spoken by males and females with moderate-to-severe voice disorders. Both temporal and spectral features of the natural waveforms were analyzed and the results were used to guide synthesis. Perceptual evaluation indicated that about half the synthetic voices matched the natural waveforms they modeled in quality. The stimuli that received poor ratings reflected failures to model very unsteady or "gargled" voices or failures in synthesizing perfect copies of the natural spectra. Several modifications to the Klatt synthesizer may improve synthesis of pathological voices. These modifications include providing jitter and shimmer parameters; updating synthesis parameters as a function of period, rather than absolute time; modeling diplophonia with independent parameters Ibr fundamental frequency and amplitude variations; providing a parameter to increase low-frequency energy; and adding more pole-zero pairs. (C) 1997 Elsevier Science B.V. C1 Univ Calif Los Angeles, Sch Engn & Appl Sci, Dept Elect Engn, Los Angeles, CA 90095 USA. Univ Calif Los Angeles, Sch Med, Div Head & Neck Surg, Los Angeles, CA 90095 USA. RP Alwan, AA (reprint author), Univ Calif Los Angeles, Sch Engn & Appl Sci, Dept Elect Engn, 66-147E Engr 4,405 Hilgard Ave,Box 951594, Los Angeles, CA 90095 USA. EM alwan@icsl.ucla.edu CR Ananthapadmanabha T., 1984, 2 ROYAL I TECHN SPEE, P1 BICKLEY C, 1982, 1 MIT SPEECH COMM GR, P71 CARLSON R, 1991, SPEECH COMMUN, V10, P481, DOI 10.1016/0167-6393(91)90051-T CHILDERS DG, 1995, J ACOUST SOC AM, V97, P505, DOI 10.1121/1.412276 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 Fant Gunnar, 1985, STL QPSR, V4, P1 Fujisaki H., 1986, P IEEE ICASSP APR, V11, P1605 GERRATT BR, 1993, J SPEECH HEAR RES, V36, P14 GERRATT BR, 1988, 115 M AC SOC AM SEAT GOBL C, 1988, STL QPSR, V1, P123 GOBL C, 1992, SPEECH COMMUN, V11, P481, DOI 10.1016/0167-6393(92)90055-C Heiberger V, 1982, SPEECH LANGUAGE ADV, V7, P299 HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448 HILLENBRAND J, 1988, J ACOUST SOC AM, V83, P2361, DOI 10.1121/1.396367 IMAIZUMI S, 1991, VOCAL FOLD PHYSL ACO, P225 JENSEN PJ, 1965, EYE EAR NOSE THROAT, V44, P77 KARLSSON I, 1992, SPEECH COMMUN, V11, P491, DOI 10.1016/0167-6393(92)90056-D KARLSSON I, 1991, J PHONETICS, V19, P111 KEMPSTER GB, 1991, J SPEECH HEAR RES, V34, P534 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KREIMAN J, 1994, J ACOUST SOC AM, V96, P1291, DOI 10.1121/1.410277 KREIMAN J, 1993, 125 M AC SOC AM Kreiman J, 1996, J ACOUST SOC AM, V100, P1787, DOI 10.1121/1.416074 Ladefoged P, 1995, VOCAL FOLD, P61 LALWANI AL, 1991, P IEEE, P505, DOI 10.1109/ICASSP.1991.150387 Laver J, 1980, PHONETIC DESCRIPTION Lofqvist A, 1995, VOCAL FOLD, P3 MOORE P, 1958, FOLIA PHONIATR, V10, P205 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 QI YY, 1995, J ACOUST SOC AM, V98, P2461, DOI 10.1121/1.413279 Winholtz WS, 1997, J SPEECH LANG HEAR R, V40, P894 YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 NR 32 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 343 EP 368 DI 10.1016/S0167-6393(97)00032-0 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800004 ER PT J AU Chien, JT Wang, HC AF Chien, JT Wang, HC TI Telephone speech recognition based on Bayesian adaptation of hidden Markov models SO SPEECH COMMUNICATION LA English DT Article DE telephone speech recognition; model adaptation; hidden Markov model; maximum a posteriori estimation; affine transformation ID MAXIMUM-LIKELIHOOD; SPEAKER ADAPTATION; PARAMETERS; VERIFICATION; ALGORITHM AB This paper presents;in adaptation method of speech hidden Markov models (HMMs) for telephone speech recognition. Our goal is to automatically adapt the HMM parameters so that the adapted HMM parameters can match with the telephone environment. In this study, two kinds of transformation-based adaptations are investigated. One is the bias transformation and the other is the affine transformation. A Bayesian estimation technique which incorporates prior knowledge into the transformation is applied for estimating the transformation parameters. Experiments show that the proposed approach can be successfully employed for self adaptation as well as supervised adaptation. Besides, the performance of telephone speech recognition using Bayesian adaptation is shown to be superior to that using maximum-likelihood adaptation. The affine transformation is also demonstrated to be significantly better than the bias transformation. (C) 1997 Elsevier Science B.V. C1 Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. RP Wang, HC (reprint author), Natl Tsing Hua Univ, Dept Elect Engn, Hsinchu 30043, Taiwan. EM hcwang@ee.nthu.edu.tw CR ACERO A, 1990, IEEE P INT C AC SPEE, V1, P849 ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 CHIEN JT, 1997, THESIS NATL TSING HU CHIEN JT, 1996, P INT C SPOK LANG PR, V3, P1840 CHIEN JT, 1997, IN PRESS P 5 EUR C S CHIEN JT, 1995, P EUR C SPEECH COMM, V2, P1541 Chien JT, 1995, IEE P-VIS IMAGE SIGN, V142, P395, DOI 10.1049/ip-vis:19952274 Cox S. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266423 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DIGALAKIS VV, 1995, IEEE T SPEECH AUDI P, V3, P357, DOI 10.1109/89.466659 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Gauvain JL, 1994, IEEE T SPEECH AUDI P, V2, P291, DOI 10.1109/89.279278 GAUVAIN JL, 1992, SPEECH COMMUN, V11, P205, DOI 10.1016/0167-6393(92)90015-Y GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HUO Q, 1995, IEEE T SPEECH AUDI P, V3, P334 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E LASRY MJ, 1984, IEEE T PATTERN ANAL, V6, P530 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V Lee C.-H., 1997, P ESCA NATO WORKSH R, P45 LEGGETTER CJ, 1995, COMPUT SPEECH LANG, V9, P171, DOI 10.1006/csla.1995.0010 MENDEL JM, 1995, LESSONS ESTIMATION T, P165 Neumeyer LG, 1994, IEEE T SPEECH AUDI P, V2, P590, DOI 10.1109/89.326617 Rahim MG, 1996, IEEE T SPEECH AUDI P, V4, P19 Sankar A, 1996, IEEE T SPEECH AUDI P, V4, P190, DOI 10.1109/89.496215 TAKAGI K, 1995, INT CONF ACOUST SPEE, P149, DOI 10.1109/ICASSP.1995.479386 TAKAHASHI J, 1994, P INT C SPOK LANG PR, P991 VITERBI AJ, 1967, IEEE T INFORM THEORY, V13, P260, DOI 10.1109/TIT.1967.1054010 WIDROW B, 1985, ADAPTIVE SIGNAL PROC, P56 ZAVALIAGKOS G, 1995, INT CONF ACOUST SPEE, P676, DOI 10.1109/ICASSP.1995.479688 ZAVALIAGKOS G, 1995, P 4 EUR C SPEECH COM, V2, P1131 NR 33 TC 21 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 369 EP 384 DI 10.1016/S0167-6393(97)00033-2 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800005 ER PT J AU Imperl, B Kacic, Z Horvat, B AF Imperl, B Kacic, Z Horvat, B TI A study of harmonic features for the speaker recognition SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; speaker verification; harmonic features; harmonic decomposition of line spectrum ID VERIFICATION AB In this paper the harmonic features based on the harmonic decomposition of the Hildebrand-Prony line spectrum are introduced. A Hildebrand-Prony method of spectral analysis was applied because of its high resolution and accuracy. Comparative tests with the LP and LP-cepstral features were made with 50 speakers from the Slovene database SNABI (isolated words corpus) and 50 speakers of the German database BAS Siemens 100 (utterances of sentences). With both databases the advantages of the harmonic features were noticed especially for the speaker identification while for the speaker verification the harmonic features have performed better on the SNABI database and as good as the LP cepstral features on the BAS Siemens 100 database. (C) 1997 Elsevier Science B.V. C1 Univ Maribor, Fac Elect Engn & Comp Sci, Digital Signal Proc Lab, SLO-2000 Maribor, Slovenia. RP Imperl, B (reprint author), Univ Maribor, Fac Elect Engn & Comp Sci, Digital Signal Proc Lab, Smetanova 17, SLO-2000 Maribor, Slovenia. EM bojan.imperl@uni-mb.si CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 CHAGNOLLEAU IM, 1995, P EUROSPEECH MADR, P337 EATOCK JP, 1994, P INT C AC SPEECH SI, P133 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 Furui S., 1989, DIGITAL SPEECH PROCE KACIC Z, 1994, SLOV SECT IEEE ERK 9, P327 Kacic Z., 1992, Proceedings of the IEEE-SP International Symposium Time-Frequency and Time-Scale Analysis (Cat.No.92TH0478-8), DOI 10.1109/TFTSA.1992.274126 KACIC Z, 1993, P EUROSPEECH, P697 KAY SM, 1981, P IEEE, V69, P1380, DOI 10.1109/PROC.1981.12184 LEFLOCH JL, 1994, P INT C AC SPEECH SI, P149 LIU CS, 1990, P IEEE INT C AC SPEE, P277 Marple Jr S. L., 1987, DIGITAL SPECTRAL ANA Parsons T.W., 1986, VOICE SPEECH PROCESS SULLIVAN TM, 1978, HIGH RESOLUTION SIGN NR 14 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 385 EP 402 DI 10.1016/S0167-6393(97)00053-8 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800006 ER PT J AU Liu, L He, JL Palm, G AF Liu, L He, JL Palm, G TI Effects of phase on the perception of intervocalic stop consonants SO SPEECH COMMUNICATION LA English DT Article DE phase; amplitude; stop consonant; perception; voicing; place of articulation ID RECONSTRUCTION; MAGNITUDE; SIGNAL AB Identification experiments were performed to assess the relative importance of Fourier phase versus amplitude for intervocalic stop consonant perception. In the first experiment, three types of stimuli were constructed from VCV signals: (1) Swapped stimuli, a swapped stimulus has the amplitude spectra of its consisting segments from one VCV signal and its corresponding phase spectra from another; (2) Phase-only stimuli; and (3) Amplitude-only stimuli. It was shown that the perception of intervocalic stop consonants varies from amplitude dominance to phase dominance as the Fourier analysis window size increases. The crossover lies somewhere between 192ms and 256ms. The influence of phase at smaller window sizes was elaborated in the second experiment. This experiment demonstrated that phonetically different signals can be constructed by combining the same short-time amplitude spectra with different phase spectra, so the short-time amplitude spectra displayed on a spectrogram cannot exclusively specify a stop consonant. In both experiments, the perception of voicing in stops was found to rely strongly on phase information while the perception of the place of articulation was mainly determined by amplitude information. (C) 1997 Elsevier Science B.V. C1 Univ Ulm, Dept Neural Informat Proc, D-89069 Ulm, Germany. RP Liu, L (reprint author), Univ Ulm, Dept Neural Informat Proc, Geb O27, D-89069 Ulm, Germany. CR BUUNEN TJF, 1974, J ACOUST SOC AM, V55, P297, DOI 10.1121/1.1914501 Chapin EK, 1934, J ACOUST SOC AM, V5, P173, DOI 10.1121/1.1915646 COLE RA, 1974, PERCEPT PSYCHOPHYS, V15, P101, DOI 10.3758/BF03205836 DARWIN CJ, 1986, J ACOUST SOC AM, V79, P838, DOI 10.1121/1.393474 Fant G., 1973, SPEECH SOUNDS FEATUR GOLDSTEI.JL, 1967, J ACOUST SOC AM, V41, P458, DOI 10.1121/1.1910357 HAYES MH, 1980, IEEE T ACOUST SPEECH, V28, P672, DOI 10.1109/TASSP.1980.1163463 HAYES MH, 1982, IEEE T ACOUST SPEECH, V30, P140, DOI 10.1109/TASSP.1982.1163863 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Licklider JCR, 1957, J ACOUST SOC AM, V29, P780, DOI 10.1121/1.1918901 LIU L, 1992, J ACOUST SOC AM, V92, P2340, DOI 10.1121/1.404958 Massaro D., 1975, UNDERSTANDING LANGUA MATHES RC, 1947, J ACOUST SOC AM, V19, P780, DOI 10.1121/1.1916623 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE OPPENHEIM AV, 1981, P IEEE, V69, P529, DOI 10.1109/PROC.1981.12022 OSHAUGHNESSY D, 1990, SPEECH COMMUNICATION PATTERSON RD, 1987, J ACOUST SOC AM, V82, P1560, DOI 10.1121/1.395146 Rabiner L.R., 1978, DIGITAL PROCESSING S Schatz CD, 1954, LANGUAGE, V30, P47, DOI 10.2307/410219 SCHROEDER MR, 1959, J ACOUST SOC AM, V31, P1579, DOI 10.1121/1.1930316 SCHROEDER MR, 1986, J ACOUST SOC AM, V79, P1580, DOI 10.1121/1.393292 TARTTER VC, 1983, J ACOUST SOC AM, V74, P715, DOI 10.1121/1.389857 TRAUNMULLER H, 1987, PSYCHOPHYSICS SPEECH, P377 VON HELMHOLTZ H. L. F., 1954, SENSATIONS TONE WINTZ H, 1972, J ACOUST SOC AM, V51, P1309 Zwicker E., 1952, ACUSTICA S3, V2, P125 ZWICKER E, 1980, J ACOUST SOC AM, V68, P1523, DOI 10.1121/1.385079 NR 28 TC 34 Z9 34 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1997 VL 22 IS 4 BP 403 EP 417 DI 10.1016/S0167-6393(97)00054-X PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YX763 UT WOS:000072074800007 ER PT J AU Perrier, P Laboissiere, R Abry, C Maeda, S AF Perrier, P Laboissiere, R Abry, C Maeda, S TI Speech production: Models and data SO SPEECH COMMUNICATION LA English DT Editorial Material C1 Ecole Natl Super Elect & Radioelect Grenoble, Grenoble, France. CNRS, Inst Commun parlee, INPG, F-75700 Paris, France. Univ Grenoble 3, Maitre Conferences Phonet, Grenoble, France. Ecole Natl Super Telecommun Bretagne, CNRS, Res Unit 820, Paris, France. RP Perrier, P (reprint author), Ecole Natl Super Elect & Radioelect Grenoble, Grenoble, France. RI Laboissiere, Rafael/E-9814-2013 OI Laboissiere, Rafael/0000-0002-2180-9250 NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 89 EP 92 DI 10.1016/S0167-6393(97)00028-9 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600001 ER PT J AU Deng, L Ramsay, G Sun, D AF Deng, L Ramsay, G Sun, D TI Production models as a structural basis for automatic speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE speech production; speech recognition; analysis by synthesis; stochastic modeling; nonlinear phonology; phonetic interface; articulatory features; articulatory dynamics; stochastic target model ID HIDDEN MARKOV MODEL; MOTOR CONTROL; UNITS; TRAJECTORIES; CONSONANTS; DYNAMICS; FEATURES AB We postulate in this paper that highly structured speech production models will have much to contribute to the ultimate success of speech recognition in View of the weaknesses of the theoretical foundation underpinning current technology. These weaknesses are analyzed in terms of phonological modeling and of phonetic-interface modeling. We present two probabilistic speech recognition models with the structure designed based on approximations to human speech production mechanisms, and conclude by suggesting that many of the advantages to be gained from interaction between speech production and speech recognition communities will develop from integrating production models with the probabilistic analysis-by-synthesis strategy currently used by the technology community. (C) 1997 Elsevier Science B.V. C1 Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. AT&T Bell Labs, Murray Hill, NJ 07974 USA. RP Deng, L (reprint author), Univ Waterloo, Dept Elect & Comp Engn, Waterloo, ON N2L 3G1, Canada. EM deng@crg5.uwaterloo.ca CR ANDERSON SR, 1976, LANGUAGE, V52, P326, DOI 10.2307/412563 BAILLY G, 1991, J PHONETICS, V19, P9 BAILLY G, 1995, P 13 INT C PHON SCI, V2, P230 BAKIS R, 1991, P IEEE WORKSH AUT SP, P20 BAKIS R, 1993, FRONTIERS SPEECH PRO Baum L. E., 1972, INEQUALITIES, V3, P1 BITAR N, 1995, P EUR, V2, P1411 BLACKBURN C, 1995, P EUR, V2, P1623 BOE LJ, 1992, J PHONETICS, V20, P27 Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 BROWMAN CP, 1992, PHONETICA, V49, P155 CHOW W, 1996, AUTOMATIC SPEECH SPE, P109 CHOW W, 1996, INT J PATTERN RECOGN, V8, P5 CLEMENTS N, 1995, HDB PHONOLOGICAL THE, P206 COHEN J, 1996, P ADD ICSLP, P9 COKER CH, 1976, P IEEE, V64, P452, DOI 10.1109/PROC.1976.10154 DENG L, 1992, J ACOUST SOC AM, V92, P3058, DOI 10.1121/1.404202 DENG L, 1994, NEURAL NETWORKS, V7, P331, DOI 10.1016/0893-6080(94)90027-2 DENG L, 1992, SIGNAL PROCESS, V27, P65, DOI 10.1016/0165-1684(92)90112-A DENG L, 1994, J ACOUST SOC AM, V96, P2008, DOI 10.1121/1.410144 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 Deng L, 1996, IEEE T SPEECH AUDI P, V4, P301, DOI 10.1109/89.506934 DENG L, 1990, J ACOUST SOC AM, V87, P2738, DOI 10.1121/1.399064 DENG L, 1995, P ICASSP, V1, P385 DENG L, 1994, J ACOUST SOC AM, V95, P2702, DOI 10.1121/1.409839 DENG L, 1995, COMPUT SPEECH LANG, V9, P63, DOI 10.1006/csla.1995.0004 DENG L, 1994, P INT C SPOK LANG PR, V4, P2167 DENG L, 1993, J ACO8UST SOC AM 2, V93, P2318, DOI 10.1121/1.406375 Digalakis V, 1993, IEEE T SPEECH AUDI P, V1, P431, DOI 10.1109/89.242489 DURAND J, 1995, FRONTIERS PHONOLOGY EIDE E, 1993, P ICASSP, V2, P483 Erler K., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1014 FOSTER E, 1996, P ADD ICSLP, P28 FOURAKIS M, 1986, J PHONETICS, V14, P197 FURUI S, 1995, P EUR, V3, P1595 Ghitza O., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1005 Goldsmith J., 1990, AUTOSEGMENTAL METRIC GUENTHER FH, 1995, P 13 INT C PHON SCI, V2, P92 HOGDEN J, 1996, J ACOUST SOC AM, V100, P2663, DOI 10.1121/1.417475 HOLMES W, 1996, P EUR, V1, P447 HONDA M, 1995, P ICSLP, V1, P179 HOUDE RA, 1968, SCRL MONOGR, V2 HWANG M, 1993, P ICASSP, V1, P311 Kaburagi T, 1996, J ACOUST SOC AM, V99, P3154, DOI 10.1121/1.414800 Keating PA, 1990, PAPERS LABORATORY PH, VI, P451 KELSO JAS, 1986, J PHONETICS, V14, P29 KENNY P, 1991, P IEEE WORKSH AUT SP, P22 KENT R, 1995, PRINCIPLES EXPT PHON, P3 KEYSER J, 1994, PHONOLOGY, V11, P207 LABOISSIERE R, 1995, P 13 INT C PHON SCI, V2, P60 LEVINSON SE, 1983, AT&T TECH J, V62, P1035 Liu S. A., 1995, THESIS MIT CAMBRIDGE MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070 MAEDA S, 1991, J PHONETICS, V19, P321 MAHKOUL J, 1995, P NATL ACAD SCI USA, V92, P9956 McGowan RS, 1997, J ACOUST SOC AM, V101, P28, DOI 10.1121/1.418310 MCGOWAN RS, 1994, SPEECH COMMUN, V14, P19, DOI 10.1016/0167-6393(94)90055-8 MENG H, 1991, P ICASSP, V1, P285 MORGAN N, 1994, CONNECTIONIST SPEECH Ostendorf M., 1996, AUTOMATIC SPEECH SPE, P185 PERKEL JS, 1980, LANGUAGE PRODUCTION PERKELL JS, 1995, J PHONETICS, V23, P23, DOI 10.1016/S0095-4470(95)80030-1 Perkell JS, 1996, J PHONETICS, V24, P3, DOI 10.1006/jpho.1996.0002 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 Rabiner L.R., 1996, AUTOMATIC SPEECH SPE, P1 RAMSAY G, 1995, P 13 ICPHS, V2, P338 RAMSAY G, 1995, P EUR, V2, P1401 RAMSAY G, 1994, J ACOUST SOC AM 2, V95 RAMSAY G, 1996, P ICSLP, V2, P1113, DOI 10.1109/ICSLP.1996.607801 RANDOLPH M, 1994, J ACOUST SOC AM 2, V95 Robinson T., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90010-N Rose RC, 1996, J ACOUST SOC AM, V99, P1699, DOI 10.1121/1.414679 Rubin P. E., 1996, P 4 EUR SPEECH PROD, P125 Saltzman E. L., 1989, ECOL PSYCHOL, V1, P333, DOI 10.1207/s15326969eco0104_2 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 Sheikhzadeh H, 1994, IEEE T SPEECH AUDI P, V2, P80, DOI 10.1109/89.260337 SHIRAI K, 1976, DYNAMIC ASPECTS SPEE Stevens K. N., 1992, P ICSLP, V1, P499 TOKUDA K, 1995, P EUR, V1, P757 Varga A.P., 1990, P ICASSP, P845 WILHELMS R, 1987, THESIS GEORG AUGUST Young S., 1995, P IEEE WORKSH AUT SP, P3 NR 82 TC 27 Z9 27 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 93 EP 111 DI 10.1016/S0167-6393(97)00018-6 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600002 ER PT J AU Titze, I Wong, D Story, B Long, R AF Titze, I Wong, D Story, B Long, R TI Considerations in voice transformation with physiologic scaling principles SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE voice transformation; speech synthesis; speech simulation; vowel quality; voice conversion ID VOCAL INTENSITY VARIATION; AIR-FLOW; AREA AB This study begins to explore the importance of the physiological domain in voice transformation. A general approach is outlined for transforming the voice quality of sentence-level speech while maintaining the same phonetic content. Transformations will eventually include gender, age, voice quality, emotional state, disordered state, dialect or impersonation. In this paper, only a specific voice quality, twang, is described as an example. The basic question is: relative to pure signal processing, can voices be transformed more effectively if biomechanical, acoustic and anatomical scaling principles are applied? At present, two approaches are contrasted, a Linear Predictive Coding approach and a biomechanical simulation approach. (C) 1997 Elsevier Science B.V. C1 Univ Iowa, Dept Speech Pathol & Audiol, Iowa City, IA 52242 USA. Univ Iowa, Natl Ctr Voice & Speech, Iowa City, IA 52242 USA. Denver Ctr Performing Arts, Wilbur James Gould Voice Res Ctr, Denver, CO USA. RP Titze, I (reprint author), Univ Iowa, Dept Speech Pathol & Audiol, 330-SHC, Iowa City, IA 52242 USA. RI Titze, Ingo/G-4780-2010 CR Ananthapadmanabha TV, 1995, VOCAL FOLD, P113 ANANTHAPADMANAB.TV, 1989, J ACOUST SOC AM, V86, pS36, DOI 10.1121/1.2027477 CHILDERS DG, 1989, SPEECH COMMUN, V8, P147, DOI 10.1016/0167-6393(89)90041-1 CHILDERS DG, 1994, J ACOUST SOC AM, V96, P2026, DOI 10.1121/1.411319 COLTON R, 1978, 7 S CAR PROF VOIC, P71 Fant G., 1985, 4 PARAMETER MODEL GL, P1 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 IMAIZUMI S, 1991, VOCAL FOLD PHYSL ACO, P233 KANGAS J, 1992, P IEEE INT C AC SPEE, P341, DOI 10.1109/ICASSP.1992.226050 KARLSSON I, 1970, 23 SPEECH TRANSM LAB, P8 MILENKOVIC PH, 1993, J ACOUST SOC AM, V93, P1087, DOI 10.1121/1.405557 MILENKOVIC P, 1987, J SPEECH HEAR RES, V30, P529 PERKELL JS, 1994, J ACOUST SOC AM, V96, P695, DOI 10.1121/1.410307 Rabiner L.R., 1978, DIGITAL PROCESSING S RAHIM MG, 1994, ARTIFICIAL NEURAL NE Scherer R., 1991, VOCAL FOLD PHYSL ACO, P83 STATHOPOULOS ET, 1993, J ACOUST SOC AM, V94, P2531, DOI 10.1121/1.407365 STATHOPOULOS ET, 1993, J SPEECH HEAR RES, V36, P64 Story BH, 1996, J ACOUST SOC AM, V100, P537, DOI 10.1121/1.415960 TITZE IR, 1989, J ACOUST SOC AM, V85, P1699, DOI 10.1121/1.397959 TITZE IR, 1994, J ACOUST SOC AM, V95, P1133, DOI 10.1121/1.408461 TITZE IR, 1984, J ACOUST SOC AM, V75, P570, DOI 10.1121/1.390530 TITZE IR, 1983, 11 S CAR PROF VOIC, P90 Titze IR, 1997, J ACOUST SOC AM, V101, P2234, DOI 10.1121/1.418246 Valimaki V., 1995, THESIS HELSINKI U TE WONG D, 1996, UNPUB J ACOUSTICAL S YANAGISAWA E, 1989, Journal of Voice, V3, P342, DOI 10.1016/S0892-1997(89)80057-8 YANG CS, 1995, IEICE T INF SYST, VE78D, P732 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 113 EP 123 DI 10.1016/S0167-6393(97)00014-9 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600003 ER PT J AU Fant, G AF Fant, G TI The voice source in connected speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE voice source; LF-model; parameters; rules; connected speech ID QUALITY; FEMALE AB This is an attempt to formulate an outline of the properties of the human Voice source in connected speech. Six aspects of the production process are considered: (1) reference data for a particular speaker; (2) segment specific Values and source-tract interactions; (3) coarticulation of glottal gestures and interpolation at boundaries; (4) basic F-o dependencies; (5) the influence of stress, accents and voice intensity; (6) the phrasal contour of source variations. The parameterization of source data is based on the transformed LF-model and frequency domain correspondences (Fant, 1995) which allows for a maximal specificational power with a limited number of parameters. (C) 1997 Elsevier Science B.V. C1 Dept Speech Mus & Hearing, S-10044 Stockholm, Sweden. RP Fant, G (reprint author), Dept Speech Mus & Hearing, KTH Box 70014, S-10044 Stockholm, Sweden. EM gunnar@speech.kth.se CR Alku P, 1996, SPEECH COMMUN, V18, P131, DOI 10.1016/0167-6393(95)00040-2 BAVEGARD M, 1994, 41994 STLQPSR, P63 BICKLEY CA, 1986, J PHONETICS, V14, P373 FANT G, 1996, 21996 TMHQPSR, P29 Fant G, 1989, 21989 STLQPSR, P1 FANT G, 1991, SPEECH COMMUN, V10, P521, DOI 10.1016/0167-6393(91)90055-X FANT G, 1986, 41986 STLQPSR, P1 FANT G, 1996, 4 TMH QPSR, P45 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P FANT G, 1982, 41982 STLQPSR, P1 FANT G, 1995, 231995 STLQPSR, P119 FANT G, 1995, ICPHS 95 2, P622 FANT G, 1994, P INT S PROS YOK 18 FANT G, 1988, 231988 STLQPSR, P1 Fant G., 1985, 4 PARAMETER MODEL GL, P1 FANT G, 1994, INT C SPOK LANG PROC FANT G, 1959, ERICSSON TECHNIS GOBL C, 1988, 11988 STLQPSR, P123 GOBL C, 1988, 23 ROYAL I TECHN SPE, P23 Hanson M., 1997, J ACOUST SOC AM, V101, P466 KARLSSON I, 1992, SPEECH COMMUN, V11, P491, DOI 10.1016/0167-6393(92)90056-D KARLSSON L, 1995, SPEECH MAPS, P9 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 NICHASAIDE A, 1994, SPEECH MAPS Sluijter A. M. C., 1995, THESIS U LEIDEN LEID STEVENS K, 1994, INT S PROS YOK JAP, P53 Stevens KN, 1994, VOCAL FOLD PHYSL, P147 STRIK H, 1992, SPEECH COMMUN, V11, P167, DOI 10.1016/0167-6393(92)90011-U STRIK H, 1992, J PHONETICS, V20, P15 NR 29 TC 45 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 125 EP 139 DI 10.1016/S0167-6393(97)00017-4 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600004 ER PT J AU Mergell, P Herzel, H AF Mergell, P Herzel, H TI Modelling biphonation - The role of the vocal tract SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE vocal folds; 2-mass model; biphonation; bifurcation; source-tract coupling ID GLOTTAL FLOW; 2-MASS MODEL; FOLD MODEL; BIFURCATIONS; SIMULATION; DYNAMICS AB Instabilities of the human Voice source appear in normal voices under certain conditions (newborn cries, vocal fry, creaky voice) and are symptomatic of voice pathologies. Vocal instabilities are intimately related to bifurcations of the underlying nonlinear dynamical system. We analyse in this paper bifurcations in 2-mass models of the vocal folds and study, in particular, how the incorporation of the vocal tract affects bifurcation diagrams. A comparison of a simplified model (Steinecke and Herzel, 1995) with an extended version including vocal tract resonances reveals that essential features of the bifurcation diagrams (as e.g. frequency locking of both folds and toroidal oscillations) are found in both model versions. However, vocal instabilities appear in the extended model at lower subglottal pressures and even for weak asymmetries. (C) 1997 Elsevier Science B.V. C1 Humboldt Univ, Inst Theoret Biol, D-10115 Berlin, Germany. Univ Erlangen Nurnberg, ENT Clin, D-91054 Erlangen, Germany. RP Herzel, H (reprint author), Humboldt Univ, Inst Theoret Biol, Invalidenstr 43, D-10115 Berlin, Germany. EM herzel@itp1.pbysik.tu-berlin.de CR Berge P., 1986, ORDER CHAOS Berry DA, 1996, J VOICE, V10, P129, DOI 10.1016/S0892-1997(96)80039-7 BERRY DA, 1994, J ACOUST SOC AM, V95, P3595, DOI 10.1121/1.409875 CRANEN B, 1988, J ACOUST SOC AM, V84, P888, DOI 10.1121/1.396658 DOLANSKY L, 1968, IEEE T ACOUST SPEECH, VAU16, P51, DOI 10.1109/TAU.1968.1161962 FLANAGAN JL, 1968, IEEE T ACOUST SPEECH, VAU16, P57, DOI 10.1109/TAU.1968.1161949 Glass L, 1988, CLOCKS CHAOS GUO CG, 1993, J ACOUST SOC AM, V94, P688, DOI 10.1121/1.406886 HERZEL H, 1991, P EUROSPEECH GENOVA, P263 HERZEL H, 1991, NATO ADV SCI I B-PHY, V270, P41 HERZEL H, 1996, IN PRESS FOLIA PHONI HERZEL H, 1996, NONLINEAR CHAOTIC AD HERZEL H, 1994, J SPEECH HEAR RES, V37, P1008 Herzel H., 1993, APPL MECH REV, V46, P399 HERZEL H, 1995, NONLINEAR DYNAM, V7, P53 Herzel H, 1996, VOCAL FOLD, P63 ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P1194 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Kaplan D., 1995, UNDERSTANDING NONLIN KOHLER KJ, 1996, P 4 SPEECH PROD SEM, P1 LABOISSIERE R, 1995, P ICPHS 95 STOCKH, V3, P190 LUCERO JC, 1993, J ACOUST SOC AM, V94, P3104, DOI 10.1121/1.407216 Mazo M, 1995, VOCAL FOLD, P173 MENDE W, 1990, PHYS LETT A, V145, P418, DOI 10.1016/0375-9601(90)90305-8 MERGELL P, 1997, THESIS HUMBOLDT U BE MORSE P, 1967, THEORETICAL ACOUSTIC PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 ROBB MP, 1988, J ACOUST SOC AM, V83, P1876, DOI 10.1121/1.396523 SIRVIO P, 1976, FOLIA PHONIATR, V28, P161 SMITH ME, 1992, J SPEECH HEAR RES, V35, P545 STEINECKE I, 1995, J ACOUST SOC AM, V97, P1874, DOI 10.1121/1.412061 STORY BH, 1995, J ACOUST SOC AM, V97, P1249, DOI 10.1121/1.412234 TIGGES M, 1996, IN PRESS ACUSTICA TITZE IR, 1996, UNPUB ACOUSTIC INTER TITZE IR, 1984, J ACOUST SOC AM, V75, P570, DOI 10.1121/1.390530 Titze I.R., 1993, FRONTIERS BASIC SCI, P143 Titze IR, 1994, PRINCIPLES VOICE PRO NR 37 TC 44 Z9 44 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 141 EP 154 DI 10.1016/S0167-6393(97)00016-2 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600005 ER PT J AU Pelorson, X Hofmans, GCJ Ranucci, M Bosch, RCM AF Pelorson, X Hofmans, GCJ Ranucci, M Bosch, RCM TI On the fluid mechanics of bilabial plosives SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE speech production; stop consonants; plosives; fluid mechanics ID VOCAL-TRACT AERODYNAMICS; VERTICAL-BAR UTTERANCES; PHONATION; GLOTTIS; MODELS AB In this paper we present a review of some fluid mechanical phenomena involved in bilabial plosive sound production, As a basis for further discussion, firstly an in vivo experimental set-up is described. The order of magnitude of some important geometrical and fluid dynamical quantities is presented. Different theoretical flow models are then discussed and evaluated using in vitro measurements on a replica of the lips and using numerical simulations. (C) 1997 Elsevier Science B.V. C1 Univ Grenoble 3, INPG, Inst Commun Parlee, F-38031 Grenoble 1, France. Eindhoven Univ Technol, Gasdynam Aeroacoust Lab, NL-5600 MB Eindhoven, Netherlands. Univ Roma La Sapienza, Dipartimento Meccan & Aeronaut, I-00184 Rome, Italy. RP Pelorson, X (reprint author), Univ Grenoble 3, INPG, Inst Commun Parlee, 46 Ave F Viallet, F-38031 Grenoble 1, France. EM pelorson@icp.grenet.fr CR Abramowitz M., 1972, HDB MATH FUNCTIONS ABRY C, 1986, SPEECH COMMUN, V5, P97, DOI 10.1016/0167-6393(86)90032-4 ALIPOUR F, 1993, 3 SEM SPEECH PROD NE BAILLIET H, 1995, P INT S MUS AC, P23 Baken R. J., 1987, CLIN MEASUREMENTS SP BELFROID S, 1993, R1239S TUE EINDH U T BOSCH RCM, 1996, R1411A TUE EINDH U T CHORIN AJ, 1978, COMMUN PUR APPL MATH, V31, P205, DOI 10.1002/cpa.3160310205 Copley DC, 1996, J ACOUST SOC AM, V99, P1219, DOI 10.1121/1.414603 FORESTER JH, 1970, J BIOMECH, V3, P297 FUJIMURA O, 1961, J SPEECH HEAR RES, V4, P233 GEORGIOU GA, 1985, J FLUID MECH, V159, P259, DOI 10.1017/S0022112085003202 GRAZIANI G, 1994, MECCANICA, V29, P465, DOI 10.1007/BF00987579 GRAZIANI G, UNPUB FAST VORTEX ME Hirschberg A, 1996, MECCANICA, V31, P131, DOI 10.1007/BF00426256 HIRSCHBERG A, 1996, VOCAL FOLDS PHYSL CO ISSHIKI N, 1964, J SPEECH HEAR RES, V7, P233 KAISER JF, 1983, VOCAL FOLDS PHYSL BI LOFQVIST A, 1995, SPEECH COMMUN, V16, P49, DOI 10.1016/0167-6393(94)00049-G MAEDA S, 1987, P 11 INT C PHON SCI, P11 MCGOWAN RS, 1995, SPEECH COMMUN, V16, P67, DOI 10.1016/0167-6393(94)00048-F PELORSON X, 1995, ACTA ACUST, V3, P191 PELORSON X, 1994, J ACOUST SOC AM, V96, P3416, DOI 10.1121/1.411449 PELORSON X, 1994, P INT C SPOK LANG PR, V2, P599 RANUCCI M, 1994, THESIS U ROME LA SAP ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 STEVENS KN, 1993, SPEECH COMMUN, V13, P367, DOI 10.1016/0167-6393(93)90035-J NR 27 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 155 EP 172 DI 10.1016/S0167-6393(97)00015-0 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600006 ER PT J AU Stone, M Goldstein, MH Zhang, YQ AF Stone, M Goldstein, MH Zhang, YQ TI Principal component analysis of cross sections of tongue shapes in vowel production SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE principal component analysis; tongue shapes; ultrasound; vowels ID ULTRASOUND IMAGES; JAW AB Images of the vocal tract provide the speech researcher with new and unique data. However, quantification of the "biologically important" features present in the image is a significant challenge because these features are typically complex, inherently variable and difficult to describe. The present study quantified cross sectional tongue shape for vowels in a single plane (post-alveolar) using Principal Component Analysis. A single subject repeated eleven English vowels in two consonant contexts, five times each. Each of the resulting cross-sectional waveshapes (tokens) was represented by 70 samples. The 110 tokens were placed in a 110 x 70 matrix for PCA. The resulting principal components represent a basis set of orthonormal waveshapes with properties particularly suitable for the present project. The first two components accounted for 93% of the variance in the data. The loadings of the eleven vowels on the first two PCs indicated three distinct shape groups. These were examined statistically and found to represent high vowels, front Vowels and back vowels. (C) 1997 Elsevier Science B.V. C1 Univ Maryland, Sch Med, Div Otolaryngol, Baltimore, MD 21201 USA. Johns Hopkins Univ, Dept Elect & Comp Engn, Baltimore, MD 21218 USA. Johns Hopkins Univ, Dept Math Sci, Baltimore, MD 21218 USA. RP Stone, M (reprint author), Univ Maryland, Sch Med, Div Otolaryngol, 16 S Eutaw St,Room 525, Baltimore, MD 21201 USA. CR ABELES M, 1977, P IEEE, V65, P762, DOI 10.1109/PROC.1977.10559 BAER T, 1991, J ACOUST SOC AM, V90, P799, DOI 10.1121/1.401949 Eubank R., 1988, SPLINE SMOOTHING NON Everitt B., 1993, CLUSTER ANAL, V3rd Friedman D. H., 1968, DETECTION SIGNALS TE Glaser E. M., 1976, PRINCIPLES NEUROBIOL HARSHMAN R, 1977, J ACOUST SOC AM, V62, P693, DOI 10.1121/1.381581 Hollander M, 1973, NONPARAMETRIC STAT M JACKSON JE, 1990, USERS GUIDE PRINCIPA Jolliffe I. T., 1986, PRINCIPAL COMPONENT MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 MAEDA S, 1991, J PHONETICS, V19, P321 Reyment R., 1993, APPL FACTOR ANAL NAT SMITH K, 1989, AM SCI, V17, P28 Stone M, 1995, J ACOUST SOC AM, V98, P3107, DOI 10.1121/1.413799 STONE M, 1996, J ACOUST SOC AM, V99, P3782 STONE M, 1988, J ACOUST SOC AM, V83, P1586, DOI 10.1121/1.395913 STONE M, 1995, J PHONETICS, V23, P81, DOI 10.1016/S0095-4470(95)80034-4 SUNDBERG J, 1987, PHONETICA, V44, P76 UNSER M, 1992, J ACOUST SOC AM, V91, P3001, DOI 10.1121/1.402934 Wahba G., 1990, SPLINE MODELS OBSERV NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 173 EP 184 DI 10.1016/S0167-6393(97)00027-7 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600007 ER PT J AU Payan, Y Perrier, P AF Payan, Y Perrier, P TI Synthesis of V-V sequences with a 2D biomechanical tongue model controlled by the Equilibrium Point Hypothesis SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE biomechanical modeling; speech production; speech motor control; velocity profiles ID MOTOR CONTROL; VELOCITY PROFILES; SPEECH PRODUCTION; HUMAN JAW; MOVEMENTS; VOWELS; PATTERNS; SYSTEMS; FORCE AB An assessment of a target-based control model of speech production using Feldman's Equilibrium Point Hypothesis is presented. It consists of simulations of articulatory movements during Vowel-to-Vowel sequences with a 2D biomechanical tongue model. In the model the main muscles responsible for tongue movements and tongue shaping in the mid-sagittal plane are represented. The elastic properties are accounted through a Finite-Element modeling, while force generation principles an implemented according to the non-linear force-length Invariant Characteristics proposed by Feldman. Movement is produced through control variable shifts at rates that are constant throughout each transition. The external contours of the model an adjusted to approximate X-ray data collected on a native speaker of French, and it is inserted in the vocal tract contours of the speaker. Thus, from tongue shapes generated with the model, it was possible to produce formant trajectories compatible with the speaker's acoustic space. It permitted a comparison of simulations with real data collected on the speaker in the kinematic and acoustic domains. Emphasis is put on the realism of synthesized formant trajectories, and on the potential influence of biomechanical tongue properties on to measurable kinematic features. (C) 1997 Elsevier Science B.V. C1 Inst Natl Polytech Grenoble, UPRESA 5009, CNRS, Inst Commun Parlee, F-38031 Grenoble 01, France. Univ Grenoble 3, F-38031 Grenoble 01, France. RP Payan, Y (reprint author), Inst Natl Polytech Grenoble, UPRESA 5009, CNRS, Inst Commun Parlee, 46 Ave Felix Viallet, F-38031 Grenoble 01, France. EM payan@icp.grenet.fr; perrier@icp.grenet.fr RI PAYAN, Yohan/E-3993-2013 CR ADAMS SG, 1993, J SPEECH HEAR RES, V36, P41 ADATIA AK, 1971, J ANAT, V110, P215 Badin P., 1995, P 15 INT C AC TRONDH, VIV, P349 BADIN P, 1984, 23 STL QPSR, P53 BADIN P, 1990, J ACOUST SOC AM, V87, P1290, DOI 10.1121/1.398804 Baer T., 1988, ANN B RES I LOGOPEDI, V22, P7 BEAUTEMPS D, 1995, SPEECH COMMUN, V16, P27, DOI 10.1016/0167-6393(94)00045-C BOTHOREL A, 1975, TRAVAUX I PHONETIQUE, V7, P80 BOYCE SE, 1990, J PHONETICS, V18, P173 BUNTON K, 1994, J SPEECH HEAR RES, V37, P1020 CARRE R, 1991, J PHONETICS, V19, P433 COOPER S, 1953, J PHYSIOL-LONDON, V122, P193 DUCK FA, 1990, PHYSICAL PROPERTES T DWORKIN JP, 1980, J SPEECH HEAR RES, V23, P828 Fant G., 1960, ACOUSTIC THEORY SPEE Fant G., 1992, P ICSLP 92, V1, P807 FELDMAN AG, 1986, J MOTOR BEHAV, V18, P17 Feldman A. G., 1990, MULTIPLE MUSCLE SYST, P195 FELDMAN AG, 1966, BIOPHYS-USSR, V11, P565 FELDMAN AG, 1981, BIOL CYBERN, V42, P107, DOI 10.1007/BF00336728 FELDMAN AG, 1972, EXP NEUROL, V37, P481, DOI 10.1016/0014-4886(72)90091-X Flanagan J.R, 1990, CEREBRAL CONTROL SPE, P29 GAMBARELLI J, 1977, COUPES SERIEES CORPS GERHARDT P, 1988, ATLAS CORRELATIONS A GUENTHER FH, 1995, PSYCHOL REV, V102, P594 HOGAN N, 1984, J NEUROSCI, V4, P2745 Honda K, 1996, J PHONETICS, V24, P39, DOI 10.1006/jpho.1996.0004 HUXLEY AF, 1957, PROG BIOPHYS MOL BIO, V7, P255 KAKITA Y, 1985, PHONETIC LINGUISTICS, P133 KAKITA Y, 1977, J ACOUST SOC AM, V62, pS15, DOI 10.1121/1.2016043 KIRITANI S, 1975, J ACOUST SOC AM, V57, pS3, DOI 10.1121/1.1995215 KITANI S, 1976, ANN B RES I LOGOPEDI, V10, P243 Laboissiere R, 1996, BIOL CYBERN, V74, P373 SOECHTING JF, 1981, J NEUROSCI, V1, P710 LALLOUACHE MT, 1990, 18 JOURN ET PAR MONT, P27 LOEVENBRUCK L, 1996, THESIS I NATL POLYTE MACNEILA.PF, 1970, PSYCHOL REV, V77, P182, DOI 10.1037/h0029070 Maeda S, 1979, 10 JOURN ET PAR, P152 MAEDA S, 1994, PHONETICA, V51, P17 MCCLEAN MD, 1995, J SPEECH HEAR RES, V38, P772 MIN Y, 1994, NCVS STATUS PROGR, V7, P131 Miyawaki K., 1974, ANN B RES I LOGOPEDI, V8, P23 MORASSO P, 1981, EXP BRAIN RES, V42, P223 MUNHALL KG, 1985, J EXP PSYCHOL HUMAN, V11, P457, DOI 10.1037/0096-1523.11.4.457 NELSON WL, 1983, BIOL CYBERN, V46, P135, DOI 10.1007/BF00339982 NITTROUER S, 1988, J ACOUST SOC AM, V84, P1653, DOI 10.1121/1.397180 OKA S, 1974, RHEOLOGY BIORHEOLOGY, P454 OSTRY DJ, 1992, TUTORIALS MOTOR BEHA, V2, P646 OSTRY DJ, 1989, ARCH ORAL BIOL, V34, P685, DOI 10.1016/0003-9969(89)90074-5 OSTRY DJ, 1985, J ACOUST SOC AM, V77, P640, DOI 10.1121/1.391882 PAYAN Y, 1995, P 13 INT C PHON SCI, V2, P474 PELORSON X, 1995, P 15 INT C AC TRONDH, P505 PERKELL JS, 1974, THESIS MIT BOSTON Perkell JS, 1996, J PHONETICS, V24, P3, DOI 10.1006/jpho.1996.0002 Perrier P, 1996, J PHONETICS, V24, P53, DOI 10.1006/jpho.1996.0005 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 SCHWARTZ HR, 1984, FINITE ELEMENT METHO SMITH KK, 1989, AM SCI, V77, P29 SOCK R, 1995, J PHONETICS, V23, P129, DOI 10.1016/S0095-4470(95)80037-9 WALKER LB, 1959, ANAT REC, V133, P438 Weddell G, 1940, J ANAT, V74, P255 WELLS JB, 1965, J PHYSIOL-LONDON, V178, P252 WilhelmsTricarico R, 1996, J PHONETICS, V24, P23, DOI 10.1006/jpho.1996.0003 WILHELMSTRICARICO R, 1995, J ACOUST SOC AM, V97, P3085, DOI 10.1121/1.411871 Zienkiewicz OC, 1989, FINITE ELEMENT METHO NR 65 TC 37 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 185 EP 205 DI 10.1016/S0167-6393(97)00019-8 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600008 ER PT J AU Wood, SAJ AF Wood, SAJ TI A cinefluorographic study of the temporal organization of articulator gestures: Examples from Greenlandic SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE assimilation; coarticulation; gestural organization; Inuit; motor control; temporal coordination; West Greenlandic ID SPEECH PRODUCTION; COARTICULATION; COORDINATION; MOVEMENTS; BULGARIAN; MODEL AB Movement data on articulator gestures in West Greenlandic are presented in order to elucidate principles of articulator coordination, especially the domain of coarticulation (as distinct from the domain of assimilation), the handling of conflicting demands on articulators, and the relation of vowels to consonants. The present data are consistent with results previously obtained from Swedish and Bulgarian. The Greenlandic informant varied his domain of coarticulation by up to two phonemes either side of the current phoneme; potential gesture conflicts were resolved in accordance with the model of Kozhevnikov and Chistovich, oncoming gestures being delayed when they were antagonistic to ongoing gestures; finally, articulator gestures were organized according to the same principles for both vowels and consonants. (C) 1997 Elsevier Science B.V. C1 Lund Univ, Dept Linguist, S-22362 Lund, Sweden. RP Wood, SAJ (reprint author), Lund Univ, Dept Linguist, Helgonabacken 12, S-22362 Lund, Sweden. EM sidney.wood@ling.lu.se CR AFIFI AK, 1986, BASIC NEUROSCIENCE BELLBERTI F, 1991, J ACOUST SOC AM, V90, P112, DOI 10.1121/1.401304 BELLBERTI F, 1979, J ACOUST SOC AM, V165, P268 BELLBERTI F, 1981, PHONETICA, V38, P9 Browman Catherine, 1986, PHONOLOGY YB, V3, P219 Daniloff R. G., 1973, J PHONETICS, V1, P239 Darley F.L, 1975, MOTOR SPEECH DISORDE Fant C. G. M., 1960, ACOUSTIC THEORY SPEE Fowler C, 1977, TIMING CONTROL SPEEC FOWLER CA, 1993, LANG SPEECH, V36, P171 FOWLER CA, 1980, J PHONETICS, V8, P113 GAY T, 1977, J ACOUST SOC AM, V62, P183, DOI 10.1121/1.381480 HARDCASTLE W, 1989, CLIN LINGUIST PHONET, V3, P1, DOI 10.3109/02699208908985268 Hawkins S., 1992, PAPERS LABORATORY PH, P9 Henke W., 1966, THESIS MIT CAMBRIDGE Jespersen O, 1897, FONETIK JOOS M, 1948, LANGUAGE S, V24 KIRITANI S, 1975, J ACOUST SOC AM, V57, P1516, DOI 10.1121/1.380593 Kozhevnikov V. A., 1965, SPEECH ARTICULATION LOFQVIST A, 1990, NATO ADV SCI I D-BEH, V55, P289 MENZERATH P, 1933, PHONETISCHE STUDIEN, V1 MOLL KL, 1971, J ACOUST SOC AM, V50, P678, DOI 10.1121/1.1912683 MUNHALL K, 1992, J PHONETICS, V20, P111 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 PERKELL JS, 1992, J ACOUST SOC AM, V91, P2911, DOI 10.1121/1.403778 PERKELL JS, 1992, J ACOUST SOC AM, V92, P3078, DOI 10.1121/1.404204 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 Pisoni D. B., 1987, SPOKEN WORD RECOGNIT, P21 Porter R, 1993, CORTICOSPINAL FUNCTI RISCHEL J, 1974, TOPCIS W GREENLANDIC SCHULTZLORENTZE.CW, 1969, MEDDELELSER GRONLAND, V129 Sweet Henry, 1877, HDB PHONETICS VAXELAIRE B, 1995, EUROSPEECH 95 GTH UP, V2, P1285 Wertz RT, 1984, APRAXIA SPEECH ADULT WHALEN DH, 1990, J PHONETICS, V18, P3 WOOD S, 1979, J PHONETICS, V7, P25 WOOD SAJ, 1991, J PHONETICS, V19, P281 WOOD SAJ, 1994, PUBLICATIONS DEP PHO, V39, P191 WOOD SAJ, 1988, FOLIA LINGUIST, V22, P239, DOI 10.1515/flin.1988.22.3-4.239 WOOD SAJ, 1995, P 13 INT C PHON SCI, V1, P392 Wood SAJ, 1996, J PHONETICS, V24, P139, DOI 10.1006/jpho.1996.0009 NR 42 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 207 EP 225 DI 10.1016/S0167-6393(97)00024-1 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600009 ER PT J AU Perkell, J Matthies, M Lane, H Guenther, F Wilhelms-Tricarico, R Wozniak, J Guiod, P AF Perkell, J Matthies, M Lane, H Guenther, F Wilhelms-Tricarico, R Wozniak, J Guiod, P TI Speech motor control: Acoustic goals, saturation effects, auditory feedback and internal models SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE speech motor programming; acoustic goals; motor equivalence; quantal mechanisms; saturation effects; auditory feedback; intelligibility; internal model; cochlear implants ID COCHLEAR IMPLANT USERS; NEURAL-NETWORK MODEL; FORMANT FREQUENCIES; TRADING RELATIONS; VOCAL-TRACT; VOWEL-U; COMPENSATION; COARTICULATION; CATEGORIES; LIP AB A theoretical overview and supporting data are presented about the control of the segmental component of speech production. Findings of "motor-equivalent" trading relations between the contributions of two constrictions to the same acoustic transfer function provide preliminary support for the idea that segmental control is based on acoustic or auditory-perceptual goals. The goals an determined partly by non-linear, quantal relations (called "saturation effects") between motor commands and articulatory movements and between articulation and sound. Since processing times would be too long to allow the use of auditory feedback for closed-loop error correction in achieving acoustic goals, the control mechanism must use a robust "internal model" of the relation between articulation and the sound output that is learned during speech acquisition. Studies of the speech of cochlear implant and bilateral acoustic neuroma patients provide evidence supporting two roles for auditory feedback in adults: maintenance of the internal model, and monitoring the acoustic environment to help assure intelligibility by guiding relatively rapid adjustments in "postural" parameters underlying average sound level, speaking rate and the amount of prosodically-based inflection of FO and SPL. (C) 1997 Elsevier Science B.V. C1 MIT, Elect Res Lab, Speech Commun Grp, Cambridge, MA 02139 USA. MIT, Dept Brain & Cognit Sci, Cambridge, MA 02139 USA. Boston Univ, Dept Commun Disorders, Boston, MA 02215 USA. Northeastern Univ, Dept Psychol, Boston, MA USA. Boston Univ, Dept Cognit & Neural Syst, Boston, MA 02215 USA. RP Perkell, J (reprint author), MIT, Elect Res Lab, Speech Commun Grp, Room 36-511,50 Vassar St, Cambridge, MA 02139 USA. EM perkell@speech.mit.edu CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 BECKMAN ME, 1995, J ACOUST SOC AM, V97, P471, DOI 10.1121/1.412945 Cowie R. I., 1983, HEARING SCI HEARING, P183 Cowie R. I., 1992, POSTLINGUALLY ACQUIR deJong KJ, 1997, J ACOUST SOC AM, V101, P2221, DOI 10.1121/1.418206 Fairbanks G, 1960, VOICE ARTICULATION D FAIRBANKS G, 1961, J SPEECH HEAR RES, V4, P203 FLEGE JE, 1988, J ACOUST SOC AM, V83, P212, DOI 10.1121/1.396424 FORREST K, 1988, J ACOUST SOC AM, V84, P115, DOI 10.1121/1.396977 FOWLER CA, 1980, PHONETICA, V37, P306 FOWLER CA, 1980, J PHONETICS, V8, P113 Fujimura O., 1979, FRONTIERS SPEECH COM, P17 GAY T, 1992, J ACOUST SOC AM, V92, P1301, DOI 10.1121/1.403924 GUENTHER FH, IN PRESS PSYCHOL REV GUENTHER FH, 1992, THESIS BOSTON U BOST GUENTHER FH, 1995, PSYCHOL REV, V102, P594 HALLE M, 1990, MUSIC LANGUAGE SPEEC, P1 HAMLET S, 1978, J PROSTHET DENT, V40, P60, DOI 10.1016/0022-3913(78)90160-9 Hamlet S. L., 1976, J PHONETICS, V4, P199 Hodgson P, 1996, J ACOUST SOC AM, V100, P565, DOI 10.1121/1.415867 HOGAN N, 1987, TRENDS NEUROSCI, V10, P170, DOI 10.1016/0166-2236(87)90043-9 HOUDE JF, 1997, THESIS MIT CAMBRIDGE Iverson P, 1996, J ACOUST SOC AM, V99, P1130, DOI 10.1121/1.415234 IVERSON P, 1995, J ACOUST SOC AM, V97, P553, DOI 10.1121/1.412280 Jordan M. I, 1990, ATTENTION PERFORM, V13, P796 Jordan M. I., 1996, HDB PERCEPTION ACTIO, V2, P71 JORDAN MI, 1992, COGNITIVE SCI, V16, P307, DOI 10.1207/s15516709cog1603_1 KAWATO M, 1987, BIOL CYBERN, V57, P169, DOI 10.1007/BF00364149 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 Ladefoged P., 1996, SOUNDS WORLDS LANGUA Lane H, 1997, J ACOUST SOC AM, V101, P2244, DOI 10.1121/1.418245 LANE H, 1971, J SPEECH HEAR RES, V14, P677 LANE H, 1991, J ACOUST SOC AM, V89, P859, DOI 10.1121/1.1894647 Lane H, 1995, J ACOUST SOC AM, V98, P3096, DOI 10.1121/1.413798 LEHISTE I, 1959, J ACOUST SOC AM, V31, P428, DOI 10.1121/1.1907729 LIBERMAN AM, 1957, J EXP PSYCHOL, V54, P358, DOI 10.1037/h0044417 LIBERMAN AM, 1961, J EXP PSYCHOL, V61, P379, DOI 10.1037/h0049038 LINDBLOM B, 1979, J PHONETICS, V7, P147 LINDBLOM B, 1989, J PHONETICS, V17, P107 MAEDA S, 1991, J PHONETICS, V19, P321 MANUEL SY, 1990, J ACOUST SOC AM, V88, P1286, DOI 10.1121/1.399705 Matthies ML, 1996, J SPEECH HEAR RES, V39, P936 MATTHIES ML, 1994, J ACOUST SOC AM, V96, P1367, DOI 10.1121/1.410281 MCFARLAND DH, 1995, J ACOUST SOC AM, V97, P1865, DOI 10.1121/1.412060 McFarland DH, 1996, J ACOUST SOC AM, V100, P1093, DOI 10.1121/1.416286 MERMELST.P, 1973, J ACOUST SOC AM, V53, P1070, DOI 10.1121/1.1913427 MILLER JD, 1989, J ACOUST SOC AM, V85, P2114, DOI 10.1121/1.397862 MUNHALL KG, 1994, J ACOUST SOC AM, V95, P3605, DOI 10.1121/1.409929 PARRY DM, 1994, AM J MED GENET, V52, P450, DOI 10.1002/ajmg.1320520411 PERKELL J, 1992, J ACOUST SOC AM, V91, P2961, DOI 10.1121/1.402932 Perkell J., 1979, 97 M AC SOC AM, P109 Perkell J. S., 1997, HDB PHONETIC SCI, P333 PERKELL JS, 1995, J PHONETICS, V23, P23, DOI 10.1016/S0095-4470(95)80030-1 PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814 Perkell JS, 1996, J PHONETICS, V24, P3, DOI 10.1006/jpho.1996.0002 PERKELL JS, 1995, P 13 INT C PHON SCI, V3, P194 PERKELL JS, 1985, J ACOUST SOC AM, V77, P1889, DOI 10.1121/1.391940 PERKELL JS, 1994, J ACOUSTICAL SOC A 2, V96, P3326, DOI 10.1121/1.410722 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 PISONI DB, 1973, PERCEPT PSYCHOPHYS, V13, P253, DOI 10.3758/BF03214136 PLANT G, 1986, 11986 SPEECH TRANSM, P65 SAVARIAUX C, 1995, J ACOUST SOC AM, V98, P2428, DOI 10.1121/1.413277 SCHWARTZ JL, IN PRESS J PHONETICS Shadle C., 1985, 506 MIT RES LAB EL Stevens KN, 1972, HUMAN COMMUNICATION, P51 STEVENS KN, 1989, J PHONETICS, V17, P3 SVIRSKY MA, 1991, J ACOUST SOC AM, V89, P2895, DOI 10.1121/1.400727 SVIRSKY MA, 1992, J ACOUST SOC AM, V92, P1284, DOI 10.1121/1.403923 VOLAITIS LE, 1992, J ACOUST SOC AM, V92, P723, DOI 10.1121/1.403997 WIEGNER AW, 1992, EXP BRAIN RES, V88, P665 NR 72 TC 68 Z9 70 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 227 EP 250 DI 10.1016/S0167-6393(97)00026-5 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600010 ER PT J AU Bailly, G AF Bailly, G TI Learning to speak. Sensori-motor control of speech movements SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE speech sound acquisition; speech production; coarticulation; acoustic-to-articulatory inversion; motor control ID VOWEL-U; COARTICULATION AB This paper shows how an articulatory model, able to produce acoustic signals from articulatory motion, can learn to speak, i.e. coordinate its movements in such a way that it utters meaningful sequences of sounds belonging to a given language. This complex learning procedure is accomplished in four major steps: (a) a babbling phase, when the device builds up a model of the forward transforms, i.e. the articulatory-to-audio-visual mapping; (b) an imitation stage, where it tries to reproduce a limited set of sound sequences by audio-visual-to-articulatory inversion; (c) a "shaping" stage, where phonemes are associated with the most efficient available sensori-motor representation; and finally, (d) a "rhythmic" phase, where it learns the appropriate coordination of the activations of these sensori-motor targets. (C) 1997 Elsevier Science B.V. C1 INPG, Inst Commun Parlee, F-38031 Grenoble 1, France. Univ Grenoble 3, F-38031 Grenoble 1, France. RP Bailly, G (reprint author), INPG, Inst Commun Parlee, 46 Ave Felix Viallet, F-38031 Grenoble 1, France. EM bailly@icp.grenet.fr CR ABRY C, 1995, INT C PHON SCI STOCK, V4, P152 ABRY C, 1985, JOURNEES ETUDES PARO, P133 ABRY C, 1994, ADV SPEECH APPL, P182 ABRY C, 1996, ETRW SPEECH PRODUCTI BADIN P, 1996, ETRW SPEECH PRODUCTI, P221 BADIN P, 1995, ICA 95 - PROCEEDINGS OF THE 15TH INTERNATIONAL CONGRESS ON ACOUSTICS, VOL IV, P349 BADIN P, 1984, 2 KTH DEP SPEECH COM BAILLY G, 1994, 2ESCA WORKH SPEECH S, P9 BAILLY G, 1991, J PHONETICS, V19, P9 BAILLY G, 1995, P EUR C SPEECH COMM, V2, P1913 BAILLY G, 1995, INT C PHON SCI STOCK, V1, P230 Bailly Gerard, 1995, P91 BEAUTEMPS D, 1996, ETRW SPEECH PRODUCTI, P45 BIJELJACBABIC R, 1993, DEV PSYCHOL, V29, P711, DOI 10.1037/0012-1649.29.4.711 BOE LJ, 1994, FUNDAMENTALS SPEECH, P185 DAVIS BL, 1994, LANG SPEECH, V37, P341 DAVIS BL, 1995, J SPEECH HEAR RES, V38, P1199 FANT G, 1992, ISCLP 92 P, V1, P807 FOURAKIS M, 1991, J ACOUST SOC AM, V90, P1816, DOI 10.1121/1.401662 GUENTHER FH, 1992, THESIS BOSTON U BOST GUENTHER FH, 1995, INT C PHON SCI STOCK, V2, P93 GUENTHER FH, 1995, PSYCHOL REV, V102, P594 JORDAN MI, 1988, 8827 U MASS COMP INF KABURAGI T, 1996, ETRW SPEECH PRODUCTI, P137 KUHL PK, 1987, CATEGORICAL PERCEPTI, P335 LABOISSIERE R, 1995, EUR C SPEECH COMM TE, V2, P1289 LECLERCQ C, 1996, THESIS I NATL POLYTE LEE SH, 1994, INT C SPEECH LANG PR, V1, P37 LILJENCR.J, 1972, LANGUAGE, V48, P839, DOI 10.2307/411991 LINDBLOM B, 1979, J PHONETICS, V7, P141 LOEVENBRUCK H, 1993, P EUR C SPEECH COMM, P85 LOEVENBRUCK H, 1996, ETRW SPEECH PRODUCTI, P117 Markey K. L., 1994, THESIS U COLORADO BO MARKEY KL, 1994, PROCEEDINGS OF THE SIXTEENTH ANNUAL CONFERENCE OF THE COGNITIVE SCIENCE SOCIETY, P595 MEHLER J, 1996, BOOTSTRAPPING SPEECH, P101 MOON SJ, 1994, THESIS U TEXAS AUSTI MORASSO P, 1994, SPEECH MAPS WP3 DYNA, P42 NOWLAN SJ, 1991, NEURAL INFORMATION P, V2, P574 OHMAN SEG, 1967, J ACOUST SOC AM, V41, P310 PAYAN Y, 1995, INT C PHON SCI STOCK, V2, P474 PERKELL JS, 1993, J ACOUST SOC AM, V93, P2948, DOI 10.1121/1.405814 Perrier P, 1996, J SPEECH HEAR RES, V39, P365 REPP BH, 1985, SPEECH COMMUN, V4, P105, DOI 10.1016/0167-6393(85)90039-1 RUBIN P, 1981, J ACOUST SOC AM, V70, P321, DOI 10.1121/1.386780 SALTZMAN EL, 1989, ECOL PSYCHOL, V1, P1615 SANGUINETI V, 1997, SELF ORG COMPUTATION, P1 SANGUINETI V, UNPUB BIOL CYBERNETI SAVARIAUX C, 1995, J ACOUST SOC AM, V98, P2428, DOI 10.1121/1.413277 Smits R, 1996, J ACOUST SOC AM, V100, P3852, DOI 10.1121/1.417241 Smits R., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607311 SOQUET A, 1994, INT C SPEECH LANG PR, V2, P1643 SOQUET A, 1995, THESIS U LIBRE BRUXE STEVENS KN, 1991, SPEECH PERCEPTION BA, P83 SUSSMAN HM, 1991, J ACOUST SOC AM, V90, P1309, DOI 10.1121/1.401923 WHALEN DH, 1990, J PHONETICS, V18, P3 WILHELMSTRICARICO R, 1995, J ACOUST SOC AM, V97, P3085, DOI 10.1121/1.411871 NR 56 TC 28 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 251 EP 267 DI 10.1016/S0167-6393(97)00025-3 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600011 ER PT J AU MacNeilage, PF Davis, BL Matyear, CL AF MacNeilage, PF Davis, BL Matyear, CL TI Babbling and first words: Phonetic similarities and differences SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE phonetic; acquisition; babbling; first words AB There is a strong consensus that the sounds and sound patterns of babbling and early speech are basically the same. The common state is one of "Frame Dominance" - a syllabic frame produced by an open-close mandibular oscillation dominates both stages, with limited ability of other articulators, including the tongue to produce active intrasyllabic and intersyllabic changes. The question of whether the first words are similar to babbling in all respects was evaluated in 4 subjects, using a database consisting of 152 hours of audio recording. A tendency towards increasing use of labial consonants relative to alveolar consonants was observed in 3 of the four subjects, and this was interpreted as a regression towards an easier production form. Progress in words took the form of an increase in variegation of utterances, mainly due to vowel variegation, much of which derived from an increase in the use of high vowels and mid back yowels, especially in wordfinal position. The presence of regression and the limited nature of the progress were taken as evidence of the strength of the Frame Dominance pattern and the consequent difficulty of escaping from it. (C) 1997 Elsevier Science B.V. C1 Univ Texas, Dept Psychol, Austin, TX 78712 USA. RP MacNeilage, PF (reprint author), Univ Texas, Dept Psychol, Austin, TX 78712 USA. EM babs@mail.utexas.edu CR BOYSSONBARDIES B, 1992, PHONOLOGICAL DEV MOD DAVIS BL, 1995, J SPEECH HEAR RES, V38, P1199 DAVIS BL, 1990, J SPEECH HEAR RES, V33, P16 DAVIS BL, 1994, LANG SPEECH, V31, P341 ELBERS L, 1982, COGNITION, V12, P45, DOI 10.1016/0010-0277(82)90029-4 Jakobson R., 1968, CHILD LANGUAGE APHAS Locke J. L., 1983, PHONOLOGICAL ACQUISI MACKEN MA, 1992, PHONOLOGICAL DEV MOD MACNEILAGE PF, 1996, 1 EUR SPEECH COMM AS MACNEILAGE PF, 1990, ATTENTION PERFORMANC, V11 MACNEILAGE PF, 1990, SPEECH PODUCTION SPE Maddieson I., 1984, PATTERNS SOUNDS MITCHELL PR, 1990, J CHILD LANG, V17, P247 Oller D. K., 1980, CHILD PHONOLOGY, V1 Oller D.K., 1976, J CHILD LANG, V3, P1 Paschall L., 1983, PHONOLOGICAL DEV CHI, P27 ROBB MP, 1994, CLIN LINGUIST PHONET, V8, P295, DOI 10.3109/02699209408985314 SCHWARTZ RG, 1982, J CHILD LANG, V9, P319 SMITH BL, 1989, 1ST LANGUAGE, V17, P147 Stark R., 1980, CHILD PHONOLOGY, V1 THORNDIKE EL, 1944, TEACHERS BOOK 30000 VIHMAN M, 1985, LANGUAGE, V60, P397 VIHMAN MM, 1986, APPL PSYCHOLINGUIST, V7, P3, DOI 10.1017/S0142716400007165 Waterson N., 1971, J LINGUIST, V7, P179, DOI [10.1017/S0022226700002917S0022226700002917, DOI 10.1017/S0022226700002917] NR 24 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 269 EP 277 DI 10.1016/S0167-6393(97)00022-8 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600012 ER PT J AU Piske, T AF Piske, T TI Phonological organization in early speech production: Evidence for the importance of articulatory patterns SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 1st ESCA Workshop on Speech Production Modeling CY MAY 20-24, 1996 CL AUTRANS, FRANCE SP ESCA DE early phonological and lexical development; articulatory patterns; phonetic variation ID ACQUISITION AB The present study investigates on the basis of which units and strategies children initially organize their speech. Transcribed longitudinal data from children acquiring German as their L1 are presented. The data were obtained in weekly recording sessions, which began when the subjects were between seven and thirteen months old. All the material was collected within the framework of the Kiel Project on Early Phonological Development. The evidence presented in this paper suggests that a limited inventory of articulatory patterns functioning as underlying organizational units may determine the phonetic structure of the large majority of a child's first words. The patterns are most probably constructed on the basis of a child's preferred articulations as well as on the basis of the acoustically and auditorily most salient features of adult model words. The articulatory patterns a child relies on may involve different types of syllable structure and they usually allow for a certain amount of phonetic variation in the child's forms for a specific word. There are basically five ways in which the early or original patterns change over time. (C) 1997 Elsevier Science B.V. C1 Univ Kiel, Engl Seminar, Dept English, D-24098 Kiel, Germany. RP Piske, T (reprint author), Univ Kiel, Engl Seminar, Dept English, Olshausenstr 40, D-24098 Kiel, Germany. EM piske@anglistik.uni-kiel.de CR DAVIS BL, 1995, P 13 INT C PHON SCI, V1, P150 DAVIS BL, 1990, J SPEECH HEAR RES, V33, P16 Ferguson C. A., 1986, INVARIANCE VARIABILI, P36 FERGUSON CA, 1975, LANGUAGE, V51, P419, DOI 10.2307/412864 Ferguson Charles A., 1978, COMMUNICATIVE COGNIT, P273 Friederici AD, 1996, Z SEMIOTIK, V18, P251 Ingram D., 1974, J CHILD LANG, V1, P49 Ingram David, 1989, 1 LANGUAGE ACQUISITI Jakobson R., 1941, KINDERSPRACHE APHASI Jusczyk P. W., 1992, PHONOLOGICAL DEV MOD, P17 Kent R. D., 1992, PHONOLOGICAL DEV MOD, P65 KRUGER B, 1995, P 13 INT C PHON SCI, V2, P694 KRUGER B, 1997, Z ANGEWANDTE LINGUIS, V26, P41 Lindblom B., 1992, PHONOLOGICAL DEV MOD, P131 LINDNER U, IN PRESS ROLLE SILBE LUDTKE H, 1969, PHONETICA, V20, P147 Macken M, 1992, PHONOLOGICAL DEV MOD, P249 Macken M. A., 1983, CHILDRENS LANGUAGE, V4, P255 MACKEN MA, 1979, LINGUA, V49, P11, DOI 10.1016/0024-3841(79)90073-1 Macneilage P. F., 1990, ATTENTION PERFORM, P453 McCune L., 1992, PHONOLOGICAL DEV MOD, P313 Menn L., 1983, LANGUAGE PRODUCTION, V2, P3 Menyuk P., 1979, LANG ACQUIS, P49 MOSKOWITZ AI, 1973, APPROACHES NATURAL L, P48 NITTROUER S, 1989, J SPEECH HEAR RES, V32, P120 Oller D. K., 1986, PRECURSORS EARLY SPE, P21 SENDLMEIER WF, 1995, PHONETICA, V52, P131 SENDLMEIER WF, 1991, SPRACHE KOGNIT, V10, P162 STAMPE D, 1969, 5 REG M CHIC LING SO, P433 Studdert-Kennedy M., 1992, SR111112 HASK LAB, P89 Studdert-Kennedy M., 1987, LANGUAGE PERCEPTION, P67 Vihman M. M., 1987, PAPERS REPORTS CHILD, V26, P72 VIHMAN MM, 1985, LANGUAGE, V61, P397, DOI 10.2307/414151 VIHMAN MM, 1988, ARTICULATION PHONOLO, P60 VIHMAN MM, 1988, ARTICULATION PHONOLO, P110 WATERSON N, 1991, S CURR PHON RES PAR, P119 WATERSON N, 1987, SOZIOKULTURELLE PERS, P101 Waterson N., 1971, J LINGUIST, V7, P179, DOI [10.1017/S0022226700002917S0022226700002917, DOI 10.1017/S0022226700002917] WATERSON N, 1976, BABY TALK INFANT SPE, P294 WODE H, 1990, VARIABILITY 2 LANGUA, V1, P85 Wode H., 1992, PHONOLOGICAL DEV MOD, P605 WODE H, 1989, VARIATION 2 LANGUAGE, P176 Wode H., 1994, INT J APPL LINGUISTI, V4, P143, DOI 10.1111/j.1473-4192.1994.tb00061.x NR 43 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1997 VL 22 IS 2-3 BP 279 EP 295 DI 10.1016/S0167-6393(97)00023-X PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA YM111 UT WOS:000071029600013 ER PT J AU Lippmann, RP AF Lippmann, RP TI Speech recognition by machines and humans SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; speech perception; speech; perception; automatic speech recognition; machine recognition; performance; noise; nonsense syllables; nonsense sentences AB This paper reviews past work comparing modern speech recognition systems and humans to determine how far recent dramatic advances in technology have progressed towards the goal of human-like performance. Comparisons use six modem speech corpora with vocabularies ranging from 10 to more than 65,000 words and content ranging from read isolated words to spontaneous conversations. Error rates of machines are often more than an order of magnitude greater than those of humans for quiet, wideband, read speech. Machine performance degrades further below that of humans in noise, with channel variability, and for spontaneous speech. Humans can also recognize quiet, clearly spoken nonsense syllables and nonsense sentences with little high-level grammatical information. These comparisons suggest that the human-machine performance gap can be reduced by basic research on improving low-level acoustic-phonetic modeling, on improving robustness with noise and channel variability, and on more accurately modeling spontaneous speech. (C) 1997 Elsevier Science B.V. RP Lippmann, RP (reprint author), MIT, LINCOLN LAB, 244 WOOD ST, ROOM S4-121, 244 WOOD ST, LEXINGTON, MA 02173 USA. CR Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 BROWN GJ, 1994, COMPUT SPEECH LANG, V8, P297, DOI 10.1006/csla.1994.1016 CHANG E, 1996, P IEEE INT C AC SPEE, P526 CHOU W, 1994, P INT C SPOK LANG PR COLE R, 1990, P INT JOINT C NEUR N, V2, P45 CULHANE C, 1996, P DARPA SPEECH REC W, P143 DALY N, 1987, THESIS MIT DESHMUKH N, 1996, P DARPA SPEECH REC W, P129 EBEL WJ, 1995, P SPOK LANG SYST TEC, P53 Fletcher H, 1929, BELL SYST TECH J, V8, P806 Godfrey J., 1992, P ICASSP, P517, DOI 10.1109/ICASSP.1992.225858 Gopinath RA, 1995, P ARPA WORKSH SPOK L, P127 Gorin AL, 1994, IEEE T SPEECH AUDI P, V2, P224, DOI 10.1109/89.260365 Huang X. D., 1991, P ICASSP 91, P345, DOI 10.1109/ICASSP.1991.150347 JELINEK F, 1985, P IEEE, V73, P1616, DOI 10.1109/PROC.1985.13343 Kakehi K., 1992, SPEECH PERCEPTION PR, P135 KRYTER KD, 1960, J ACOUST SOC AM, V32, P547, DOI 10.1121/1.1908140 KUBALA F, 1995, P ARPA SPOK LANG TEC, P41 *LDC, 1995, SWITCHBOARD US MAN Lee K.-F., 1989, AUTOMATIC SPEECH REC Leonard R., 1984, P IEEE INT C AC SPEE LICKLIDER JCR, 1948, J ACOUST SOC AM, V20, P42, DOI 10.1121/1.1906346 Lippmann RP, 1996, IEEE T SPEECH AUDI P, V4, P66, DOI 10.1109/TSA.1996.481454 LIPPMANN RP, 1987, 1987 P IEEE INT C AC, P705 LIPPMANN RP, 1981, J ACOUST SOC AM, V69, P524, DOI 10.1121/1.385375 Liu JY, 1996, P TECH AS P, P157 MARTIN A, 1996, COMMUNICATION Miller G, 1991, SCI WORDS MILLER GA, 1962, IRE T INFORM THEOR, V8, P81, DOI 10.1109/TIT.1962.1057697 Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 PALLETT DS, 1991, P DARPA SPEECH NAT L, P49, DOI 10.3115/112405.112411 PAUL DB, 1992, P WORKSH SPEECH NAT, P357, DOI 10.3115/1075527.1075614 PESKIN B, 1996, P IEEE INT C AC SPEE, P303 POLLACK I, 1963, LANG SPEECH, V6, P165 Pols L. C. W., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing Price P., 1988, P IEEE INT C AC SPEE, P651 STERN RM, 1995, P SPEECH REC WORKSH, P5 VANLEEUWEN DA, 1995, HUMAN BENCHMARKS SPE, P1461 VARGA A, 1993, SPEECH COMMUN, V12, P247, DOI 10.1016/0167-6393(93)90095-3 WILLIAMS CE, 1968, J ACOUST SOC AM, V44, P1002, DOI 10.1121/1.1911189 Woodland P. C., 1996, P ARPA SPEECH REC WO, P99 Young S, 1996, IEEE SIGNAL PROC MAG, V13, P45, DOI 10.1109/79.536824 Young SJ, 1994, IEEE T SPEECH AUDI P, V2, P615, DOI 10.1109/89.326619 Zue V. W., 1989, P DARPA SPEECH NAT L, P179, DOI 10.3115/100964.100983 NR 44 TC 201 Z9 205 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1997 VL 22 IS 1 BP 1 EP 15 DI 10.1016/S0167-6393(97)00021-6 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XW430 UT WOS:A1997XW43000001 ER PT J AU Poo, GS AF Poo, GS TI Large vocabulary Mandarin final recognition based on Two-level Time-delay Neural Networks (TLTDNN) SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; Mandarin finals; Time-Delay Neural Network (TDNN) AB A Two-Level Time-Delay Neural Network (TLTDNN) technique has been developed to recognize all Mandarin Finals of the entire Chinese syllables. The first level discriminates the vowel-group based on (a,e,i,o,u,v) and the nasal-group based on nasal ending (-n, -ng, -others). The nasal-group discriminator is used to further split the large /a/ subgroup produced by the vowel-group discriminator. The two groupings in the first level produce 8 subgroups in the second level. Further discrimination in the second level enables the identification of all 35 Mandarin Finals. The technique was thoroughly tested using 8 sets of 1265 isolated Hanyu Pinyin syllables, with 6 sets used for training and 2 sets used for testing. The overall result shows that a high recognition rate of 99.4% on the training datasets and 95.6% on the test datasets, is achievable. The top 4 recognition rate attained on the test datasets is as high as 99.1%. (C) 1997 Elsevier Science B.V. RP Poo, GS (reprint author), NATL UNIV SINGAPORE, DEPT INFORMAT SYST & COMP SCI, SINGAPORE 119260, SINGAPORE. CR CHAN LCM, 1986, SPEECH COMMUN, V5, P299, DOI 10.1016/0167-6393(86)90015-4 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 HAFFNER P, 1989, P EUR C SPEECH COMM, P553 HWANG CH, 1988, COMPUTER P CHINESE O, V3, P257 Lee LS, 1993, IEEE T SPEECH AUDI P, V1, P158, DOI 10.1109/89.222876 Lee L.-s., 1991, COMPUTER SPEECH LANG, V5, P181, DOI 10.1016/0885-2308(91)90024-K LIN CH, 1993, P INT C AC SPEECH SI, P227 LIN LJ, 1987, P INT C CHIN OR LANG, P234 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P1888, DOI 10.1109/29.45535 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P328, DOI 10.1109/29.21701 WU JX, 1987, P INT C CHIN OR LANG, P234 WU Z, 1987, INTRO EXPT SPEECH NR 12 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1997 VL 22 IS 1 BP 17 EP 24 DI 10.1016/S0167-6393(97)00013-7 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XW430 UT WOS:A1997XW43000002 ER PT J AU Swerts, M Ostendorf, M AF Swerts, M Ostendorf, M TI Prosodic and lexical indications of discourse structure in human-machine interactions SO SPEECH COMMUNICATION LA English DT Article DE prosody; discourse structure; human-machine interaction; spoken dialogue systems ID SPEECH-ACT TYPE; INTONATION; DIALOG; TEXT AB From a discourse perspective, utterances may vary in at least two important respects: (i) they can occupy a different hierarchical position in a larger-scale information unit and (ii) they can represent different types of speech acts. Spoken language systems will improve if they adequately take into account both discourse segmentation and utterance purpose. An important question then is how such discourse-structural features can be detected. Analyses of monologues and human-human dialogues have shown that a good indicator of these factors is prosody, defined as the set of suprasegmental speech features. This paper explores whether speakers also use prosody to highlight discourse structure in a particular type of human-machine interaction, viz., information query in a travel-planning domain. More specifically, it investigates if speakers signal (i) the start of a new topic by marking the initial utterance of a discourse segment, and (ii) whether an utterance is a normal request for information or part of a correction sub-dialogue. The study reveals that in human-machine interactions, both discourse segmentation and utterance purpose can have particular prosodic correlates, although speakers also mark this information through choice of wording. Therefore, it is useful to explore in the future the possibilities of incorporating prosody in spoken language systems as a cue to discourse structure. (C) 1997 Elsevier Science B.V. C1 IPO, CTR RES USER SYST INTERACT, NL-5600 MB EINDHOVEN, NETHERLANDS. BOSTON UNIV, DEPT ELECT COMP & SYST ENGN, BOSTON, MA 02215 USA. RP Swerts, M (reprint author), UNIV INSTELLING ANTWERP, CTR NEDERLANDSE TAAL & SPRAAK, UNIV PL 1, B-2610 WILRIJK, BELGIUM. RI Swerts, Marc/C-8855-2013 CR BATLINER A, 1993, P ESCA WORKSH PROS L, P112 BENNACEF SK, 1995, P ESCA WORKSH SPOK D, P237 BLACK A, 1996, COMPUTING PROSODY Brown Gillian, 1980, QUESTIONS INTONATION BRUBAKER RS, 1972, J PSYCHOLINGUIST RES, V1, P141, DOI 10.1007/BF01068103 BRUCE G, 1982, PHONETICA, V39, P274 Chafe W., 1980, PEAR STORIES COLE R, 1995, IEEE T SPEECH AUDIO, V30, P1 Dahl D.A., 1994, P ARPA WORKSH HUM LA, P43, DOI 10.3115/1075812.1075823 DALY N, 1990, P INT C SPOK LANG PR, P497 FLAMMIA G, 1995, P EUR C SPEECH COMM, P1965 GELUYKENS R, 1989, J PRAGMATICS, V13, P567, DOI 10.1016/0378-2166(89)90041-6 GELUYKENS R, 1987, J PRAGMATICS, V11, P483, DOI 10.1016/0378-2166(87)90091-9 Grosz B., 1992, P INT C SPOK LANG PR, P429 Grosz B. J., 1986, Computational Linguistics, V12 HADDINGKOCH K, 1964, PHONETICA, V11, P175 HEARST M, 1994, P 32 ANN M ASS COMP HIRSCHBERG J, UNPUB PROSODIC ANAL HIRSCHBERG J, 1993, ARTIF INTELL, V63, P305, DOI 10.1016/0004-3702(93)90020-C Hirschberg Julia, 1987, P 25 ANN M ASS COMP, P163, DOI 10.3115/981175.981198 HOWTKO J, 1992, HCRCRP31 U ED HUM CO Kenne P. E., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607463 KOMPE R, 1994, SPEECH COMMUN, V15, P155, DOI 10.1016/0167-6393(94)90049-3 Lehiste I, 1980, STRUCTURE PROCESS SP, P195 MOORE R, 1994, P ARPA WORKSH SPOK L, P72 NAGATA M, 1994, SPEECH COMMUN, V15, P193, DOI 10.1016/0167-6393(94)90071-X NAKATANI C, 1996, PROGR SPEECH SYNTHES, P139 NAKATANI C, 1996, COMPUTING PROSODY Olshen R., 1984, CLASSIFICATION REGRE, V1st OSTENDORF M, 1996, COMPUTING PROSODY PASSONNEAU R, 1993, P ACL 93 Pitrelli J., 1994, P 3 INT C SPOK LANG, P123 SAG IA, 1975, 11 REG M CHIC LING S Shriberg E. E., 1994, THESIS U CALIFORNIA SLUIJTER AMC, 1993, PHONETICA, V50, P180 Swerts M, 1997, J ACOUST SOC AM, V101, P514, DOI 10.1121/1.418114 SWERTS M, 1994, LANG SPEECH, V37, P21 SWERTS M, 1994, SPEECH COMMUN, V15, P79, DOI 10.1016/0167-6393(94)90043-4 SWERTS MGJ, 1994, THESIS EINDHOVEN U T TERKEN JMB, 1984, LANG SPEECH, V27, P53 THORSEN NG, 1985, J ACOUST SOC AM, V77, P1205, DOI 10.1121/1.392187 WARD G, 1985, LANGUAGE, V61, P747, DOI 10.2307/414489 NR 42 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1997 VL 22 IS 1 BP 25 EP 41 DI 10.1016/S0167-6393(97)00011-3 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XW430 UT WOS:A1997XW43000003 ER PT J AU Laan, GPM AF Laan, GPM TI The contribution of intonation, segmental durations, and spectral features to the perception of a spontaneous and a read speaking style SO SPEECH COMMUNICATION LA English DT Article DE speaking styles; spontaneous speech; read out speech; perception of speaking styles; acoustic features of speaking styles ID SPONTANEOUS SPEECH AB The influence of pitch contour, segmental durations, and spectral features on the perception of two speaking styles was studied. For this purpose two male speakers each spoke ''spontaneously'' to an interviewer and afterwards read out their own literally transcribed spontaneous text. Pairs of identical spontaneous and read utterances were selected that were fluently spoken in both speaking styles (no false starts, hesitations, etc.). Five test conditions were constructed in which the utterances had: (1) no manipulations; (2) phoneme durations from the opposite speaking style; (3) the pitch contour from the opposite speaking style; (4) a monotonous pitch contour; (5) the original spectral features combined with the prosodic features of the opposite speaking style. The stimuli were presented to 32 subjects in a listening experiment. Their task was to classify each utterance as either ''spontaneous'' or ''read out'' speech. All manipulations of the test utterances had a significant effect on the classification of the speaking style. We also analysed the original utterances with respect to several acoustic measures for intonation, duration, jitter and shimmer, and spectral vowel quality. Overall, read speech compared to spontaneous speech had: a lower articulation rate, more F-0, variation, more F, declination, less shimmer, and less vowel reduction. However, none of these acoustic features by itself can clearly discriminate between the two speaking styles. Above all it became clear that the performance of the speakers and the listeners varied enormously. (C) 1997 Elsevier Science B.V. RP Laan, GPM (reprint author), UNIV AMSTERDAM, INST PHONET SCI, HERENGRACHT 338, NL-1016 CG AMSTERDAM, NETHERLANDS. CR BARIK HC, 1977, LANG SPEECH, V20, P116 Batliner A., 1994, SPEECH RECOGNITION C, P321 Blaauw E., 1995, THESIS UTRECHT U BLAAUW E, 1994, SPEECH COMMUN, V14, P359, DOI 10.1016/0167-6393(94)90028-0 COHEN J, 1960, EDUC PSYCHOL MEAS, V20, P37, DOI 10.1177/001316446002000104 DAVIS S, 1976, SPEECH COMMUN LABS M, V13 ESKENAZI L, 1990, J SPEECH HEAR RES, V33, P298 Eskenazi M., 1993, P EUROSPEECH 93, V1, P501 ESKENAZI M, 1992, P ICSLP 92, V1, P755 FLEISS JL, 1971, PSYCHOL BULL, V76, P378, DOI 10.1037/h0031619 Freeman D., 1987, APPL CATEGORICAL DAT HIGGINS MB, 1989, J ACOUST SOC AM, V86, P911, DOI 10.1121/1.398778 HORII Y, 1980, J SPEECH HEAR RES, V23, P202 HOWELL P, 1991, SPEECH COMMUN, V10, P163, DOI 10.1016/0167-6393(91)90039-V KOCH GG, 1977, BIOMETRICS, V33, P133, DOI 10.2307/2529309 KOOPMANSVANBEIN.FJ, 1980, THESIS U AMSTERDAM KOOPMANSVANBEINUM FJ, 1992, SPEECH COMMUN, V11, P439, DOI 10.1016/0167-6393(92)90049-D KRAAYEVELD J, 1997, THESIS U NIJMEGEN LAAN GPM, 1991, P EUROSPEECH 91, V3, P1129 LAAN GPM, 1991, 116 U AMST I PHON SC LEVIN H, 1982, LANG SPEECH, V25, P43 LIEBERMAN P, 1985, J ACOUST SOC AM, V77, P649, DOI 10.1121/1.391883 MAKHOUL J, 1976, P IEEE INT C AC SPEE, P466 MEHTA G, 1988, LANG SPEECH, V31, P135 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z RAMIG LA, 1983, J SPEECH HEAR RES, V26, P22 REMEZ RE, 1991, J ACOUST SOC AM S4, V89, P2011, DOI 10.1121/1.2029889 REMEZ RE, 1986, J ACOUST SOC AM S1, V79, P26 REMEZ RE, 1987, J ACOUST SOC AM S1, V81, P2 SYSTAT, 1992, SYSTAT STAT VERS 5 2 TIELEN MTJ, 1992, THESIS U AMSTERDAM UMEDA N, 1992, P ICSLP 92, V1, P759 UMEDA N, 1982, J PHONETICS, V10, P279 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D van Bergem Dick, 1995, THESIS U AMSTERDAM VANBERGEM DR, 1988, P I PHON SIC U AMSTE, V12, P61 VANBERGEM DR, 1990, P I PHON SCI U AMST, V14, P17 Verhelst W., 1991, P EUROSPEECH 91, P1319 WILLEMS LF, 1986, 21 IPO, P34 NR 39 TC 29 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1997 VL 22 IS 1 BP 43 EP 65 DI 10.1016/S0167-6393(97)00012-5 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XW430 UT WOS:A1997XW43000004 ER PT J AU Alku, P Strik, H Vilkman, E AF Alku, P Strik, H Vilkman, E TI Parabolic spectral parameter - A new method for quantification of the glottal flow SO SPEECH COMMUNICATION LA English DT Article DE voice production ID VOICE SOURCE; AIR-FLOW; VOCAL QUALITY; SPEECH; PERCEPTION; WAVEFORM; PRESSURE; SPEAKERS AB This study presents a new frequency domain parameter, Parabolic Spectral Parameter (PSP), for the quantification of the glottal volume velocity waveform. PSP is based on fitting a parabolic function to the low-frequency part of a pitch-synchronously computed spectrum of the estimated glottal flow. PSP gives a single numerical value that describes how the spectral decay of an obtained glottal flow behaves with respect to a theoretical limit corresponding to maximal spectral decay. By analyzing speech signals of different phonation types the performance of the new parameter is compared to three commonly used time-based parameters and to one previously developed frequency domain method. (C) 1997 Elsevier Science B.V. C1 UNIV TURKU, DEPT APPL PHYS, FIN-20014 TURKU, FINLAND. UNIV NIJMEGEN, DEPT LANGUAGE & SPEECH, NL-6500 HD NIJMEGEN, NETHERLANDS. UNIV OULU, DEPT OTOLARYNGOL & PHONIATR, FIN-90220 OULU, FINLAND. RI Alku, Paavo/E-2400-2012 CR Alku P., 1994, P INT C SPOK LANG PR, P1619 ALKU P, 1995, J ACOUST SOC AM, V98, P763, DOI 10.1121/1.413569 CARLSON R, 1991, SPEECH COMMUN, V10, P481, DOI 10.1016/0167-6393(91)90051-T CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DROMEY C, 1992, J VOICE, V6, P44, DOI 10.1016/S0892-1997(05)80008-6 ELJAROUDI A, 1991, IEEE T SIGNAL PROCES, V39, P411, DOI 10.1109/78.80824 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P FANT G, 1995, 231995 STLQPSR, P119 Fant G., 1985, 4 PARAMETER MODEL GL, P1 GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 HERTEGARD S, 1992, J VOICE, V6, P224, DOI 10.1016/S0892-1997(05)80147-X HERTEGARD S, 1990, J VOICE, V4, P220, DOI 10.1016/S0892-1997(05)80017-7 HERTEGARD S, 1992, 23 ROYAL I TECHN SPE, P9 Hillenbrand J, 1996, J SPEECH HEAR RES, V39, P311 HILLMAN RE, 1990, J VOICE, V4, P52, DOI 10.1016/S0892-1997(05)80082-7 HOLMBERG E B, 1989, Journal of Voice, V3, P294, DOI 10.1016/S0892-1997(89)80051-7 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 HOWELL P, 1988, J ACOUST SOC AM, V84, P80, DOI 10.1121/1.396877 HOWELL P, 1992, J ACOUST SOC AM, V91, P1697, DOI 10.1121/1.402449 KARLSSON I, 1990, P INT C SPOKEN LANGU, V1, P69 MATAUSEK MR, 1980, IEEE T ACOUST SPEECH, V28, P616, DOI 10.1109/TASSP.1980.1163483 O'Shaughnessy D., 1987, SPEECH COMMUNICATION Ohlsson A-C, 1987, SCAND J LOGOP PHONIA, V12, P70 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 STRIK H, 1992, SPEECH COMMUN, V11, P167, DOI 10.1016/0167-6393(92)90011-U SUNDBERG J, 1993, J VOICE, V7, P15, DOI 10.1016/S0892-1997(05)80108-0 TITZE IR, 1992, J ACOUST SOC AM, V91, P2936, DOI 10.1121/1.402929 VILKMAN E, 1997, IN PRESS FOLIA PHONI NR 30 TC 20 Z9 21 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1997 VL 22 IS 1 BP 67 EP 79 DI 10.1016/S0167-6393(97)00020-4 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XW430 UT WOS:A1997XW43000005 ER PT J AU Schoentgen, J Ciocea, S AF Schoentgen, J Ciocea, S TI Kinematic formant-to-area mapping SO SPEECH COMMUNICATION LA English DT Article DE area function models; vocal tract eigenmodes; formant-to-area mapping ID VOCAL-TRACT; ARTICULATORY MOTION; NEURAL NETWORKS; SPEECH; FREQUENCIES AB This article presents a method of formant-to-area mapping consisting of the direct calculation of the time derivatives of the cross-sections and length of a vocal tract model so that the time derivatives of the observed formant frequencies and the model's eigenfrequencies match. The vocal tract model is a concatenation of uniform tubelets whose cross-section areas and lengths can vary in time. Time derivatives of the tubelet parameters an obtained by solving a linear algebraic system of equations. The derivatives are then numerically integrated to arrive at cross-section and length movements. Since more than one area function is compatible with the observed formant frequencies, pseudo-energy constraints are made use of to determine a unique solution. The results show that the formant-matched movements of the tubelet cross-sections and lengths are smooth, and that the agreement between the observed and model-generated formant frequencies is better than 0.01 Hz. C1 FREE UNIV BRUSSELS, INST MODERN LANGUAGES & PHONET, B-1050 BRUSSELS, BELGIUM. CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 ATAL BS, 1989, J ACOUST SOC AM, V86, P123 BADIN P, 1984, STL QPSR, P53 BAJPAI A, 1980, SPECIALIST TECHNIQUE BEAUTEMPS D, 1995, SPEECH COMMUN, V16, P27, DOI 10.1016/0167-6393(94)00045-C BONDER LJ, 1983, ACUSTICA, V52, P216 Carre R., 1992, Journal d'Acoustique, V5 CHARPENTIER F, 1984, SPEECH COMMUN, V3, P291, DOI 10.1016/0167-6393(84)90025-6 Chiba T, 1958, VOWEL ITS NATURE STR Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1992, P INT C SPOK LANG BA, P807 Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817 Jospa P., 1995, P103 KELLY J, 1962, P 4 INT C AC, V1, P1 LABOISSIERE R, 1991, P 1990 CONN MOD SUMM, P319 LABOISSIERE R, 1992, THESIS I COMMUNICATI LADEFOGED P, 1978, J ACOUST SOC AM, V64, P1027, DOI 10.1121/1.382086 LIU Q, 1989, P EUROSPEECH, P673 MAEDA S, 1990, SPEECH PRODUCTION SP, P113 MAJID R, 1986, ACT 15 JOURN ET PAR, P59 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 Markel JD, 1976, LINEAR PREDICTION SP MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 Mura T, 1992, VARIATIONAL METHODS PITERMANN M, 1995, P 4 EUR C SPEECH COM, P1917 PRADO P, 1992, P INT C AC SPEECH SI, P33, DOI 10.1109/ICASSP.1992.226127 Press W. H., 1987, NUMERICAL RECIPES AR Rahim M. G., 1991, P IEEE INT C AC SPEE, P485, DOI 10.1109/ICASSP.1991.150382 RAHIM MG, 1993, J ACOUST SOC AM, V93, P1109, DOI 10.1121/1.405559 RAHIM MG, 1994, ARTIFICIAL NEURAL NE SCHOENTGEN J, 1994, P INT C SPOK LANG PR, V2, P611 SCHOENTGEN J, 1995, J PHONETICS, V23, P189, DOI 10.1016/S0095-4470(95)80042-5 SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SCHROETER P, 1990, P INT C AC SPEECH SI, P393 SHIRAI K, 1986, SPEECH COMMUN, V5, P159, DOI 10.1016/0167-6393(86)90005-1 SHIRAI K, 1977, DYNAMIC ASPECTS SPEE, P279 SHIRAI K, 1993, SPEECH COMMUN, V13, P45, DOI 10.1016/0167-6393(93)90058-S SHIRAI K, 1986, P INT C AC SPEECH SI, P2247 SHIRAI K, 1991, J PHONETICS, V19, P379 SMIRNOV V, 1970, COURS MATH SUPERLEUR, V2 SONDHI MM, 1986, J ACOUST SOC AM, V79, P1113, DOI 10.1121/1.393383 SONDHI MM, 1974, J ACOUST SOC AM, V55, P1070, DOI 10.1121/1.1914649 SOQUET A, 1990, P ESCA WORKSH SPEECH, P71 SOQUET A, 1995, THESIS U LIBRE BRUXE Stewart G. W., 1973, INTRO MATRIX COMPUTA NR 47 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1997 VL 21 IS 4 BP 227 EP 244 DI 10.1016/S0167-6393(97)00007-1 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XM478 UT WOS:A1997XM47800001 ER PT J AU Barnard, E Yan, YH AF Barnard, E Yan, YH TI Toward new language adaptation for language identification SO SPEECH COMMUNICATION LA English DT Article AB We study the adaptation of all existing language-identification system to new languages using a limited amount of training data. The platform used for this study is the system recently developed (Yan and Barnard, 1995a,b) to exploit phonotactic constraints based on language-dependent phone recognition. Using the proposed language model re-estimation technique based on probabilistic gradient descent, two new approaches and their combination are proposed and tested. These approaches all modify the phonotactic language models, so that they no longer equal the conventional maximum-likelihood estimate. The difference of these methods can be viewed as different information resampling on the same amount of data. Experiments were conducted using the standard OGI_TS database (Muthusamy et al., 1992). For comparison, the baseline system (with traditional model estimation) was also subjected to the same set of tests. Systems trained with different amounts of training data in the new languages were evaluated. Compared with the conventional model estimation, the results demonstrate that the new methods improve adaptation to new languages. The success of the discriminative model shows that conventional model estimation is not optimal for language identification, so that improvements can be obtained by modifying the maximum-likelihood estimates of the language models. C1 OREGON GRAD INST SCI & TECHNOL, CTR SPOKEN LANGUAGE UNDERSTANDING, BEAVERTON, OR 97291 USA. CR ANDERSEN O, 1994, P 1994 IEEE INT C AC, P121 BARNARD E, 1989, 89014 CSE OR GRAD I HUTCHINS S, 1995, P 15 ANN SPEECH RES, P76 ITAHASHI S, 1995, EUROSPEECH P, P1359 KADAMBE S, 1995, P IEEE INT C AC SPEE, P3507 KWAN H, 1995, EUROSPEECH P, P1367 LAMEL LF, 1994, P INT C ACOUSTICS SP, P293 LI K, 1995, P 1995 IEEE INT C AC, P3515 LUND M, 1995, EUROSPEECH P, P1363 MUTHUSAMY Y, 1992, P INT C SPOK LANG PR, V92, P895 Muthusamy YK, 1994, IEEE SIGNAL PROC MAG, V11, P33, DOI 10.1109/79.317925 PARRIS E, 1995, P 1995 IEEE INT C AC, P3503 RAMESH P, 1994, P INT C SPOK LANG PR, P1879 REYES AA, 1994, P INT C SPOKEN LANGU, P1895 YAN Y, 1995, P IEEE INT C AC SPEE, P3511 YAN Y, 1995, EUROSPEECH P, P1351 Yan YH, 1996, COMPUT SPEECH LANG, V10, P37, DOI 10.1006/csla.1996.0003 ZISSMAN M, 1995, P 15 ANN SPEECH RES, P2 Zissman M. A., 1995, P 1995 INT C AC SPEE, P3503 ZISSMAN MA, 1994, P INT C ACOUSTICS SP, P305 NR 20 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1997 VL 21 IS 4 BP 245 EP 254 DI 10.1016/S0167-6393(97)00009-5 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XM478 UT WOS:A1997XM47800002 ER PT J AU Schoentgen, J DeGuchteneere, R AF Schoentgen, J DeGuchteneere, R TI Predictable and random components of jitter SO SPEECH COMMUNICATION LA English DT Article DE jitter; linear auto-regressive modeling ID VOICE PERTURBATION MEASUREMENTS; LARYNGEAL PATHOLOGY; ACOUSTIC MEASURES; SPEECH SIGNALS; SPEAKERS AB The subject of this article is the study of the deterministic and random components of jitter by means of a statistical time series model. Jitter is the small fluctuations in glottal cycle lengths. The purpose of time series analysis is to take into account the fact that glottal cycles are produced sequentially and that relations between neighbouring perturbations exist. The jitter time series model statistically represents the present perturbation as a weighted sum of past perturbations and random noise. The model is fitted to observed jitter time series by means of conventional linear methods. A discriminant analysis of jitter time series extracted from 279 sustained vocoids [a] [i] [u] shows that the jitter features which separately describe the predictable and random components better characterise healthy and dysphonic speakers than a traditional jitter feature. The conclusion is that the relations between neighbouring cycle length perturbations are an aspect of jitter independent of the scatter of the cycle lengths which is described by conventional jitter features. C1 FREE UNIV BRUSSELS, INST MODERN LANGUAGES & PHONET, LAB EXPT PHONET, B-1050 BRUSSELS, BELGIUM. CR ASKENFELD A, 1980, SPEECH TRANSMISSION, V4, P40 ASKENFELD A, 1981, SPEECH TRANSMISSION, V4, P49 ASKENFELT A, 1980, J SPEECH HEAR RES, V23, P258 Awrejcewicz J., 1991, BIFURCATION CHAOS CO BAER T, 1983, J ACOUST SOC AM, V73, P1304, DOI 10.1121/1.389279 Box G.E.P., 1976, TIME SERIES ANAL COX NB, 1989, J ACOUST SOC AM, V86, P42, DOI 10.1121/1.398309 Davis S. B., 1979, SPEECH LANGUAGE ADV, V1, P271 Davis S.B., 1976, SCRL MONOGRAPH, V13 DEEM JF, 1989, J SPEECH HEAR RES, V32, P689 DEGUCHTENEERE R, 1991, P 12 INT C PHON SCI, P354 DEKROM G, 1994, THESIS ONDERZOEKSINS Demidovich B. P., 1976, COMPUTATIONAL MATH Flannery B. P., 1992, NUMERICAL RECIPES C GARRETT KL, 1987, J ACOUST SOC AM, V82, P58, DOI 10.1121/1.395437 GILCHRIST W, 1984, STATISTICAL MODELLIN GOLD B, 1969, J ACOUST SOC AM, V46, P442, DOI 10.1121/1.1911709 GUBRYNOWICZ R, 1977, ACTES 8 JOURN ET PAR, P21 Gubrynowicz R., 1980, Archives of Acoustics, V5 Guilford JP, 1965, FUNDAMENTAL STATISTI HECKER MHL, 1971, J ACOUST SOC AM, V49, P1275, DOI 10.1121/1.1912490 Heiberger V, 1982, SPEECH LANGUAGE ADV, V7, P299 HESS W, 1987, SPEECH COMMUN, V6, P55, DOI 10.1016/0167-6393(87)90069-0 HIGGINS M B, 1989, Journal of Voice, V3, P233, DOI 10.1016/S0892-1997(89)80005-0 HIGGINS MB, 1989, J ACOUST SOC AM, V86, P911, DOI 10.1121/1.398778 HILLENBRAND J, 1988, J ACOUST SOC AM, V83, P2361, DOI 10.1121/1.396367 Hollien H., 1973, J PHONETICS, V1, P85 HORIGUCHI S, 1987, LARYNGEAL FUNCTION P, P509 HORII Y, 1982, J SPEECH HEAR RES, V25, P12 HORII Y, 1979, J SPEECH HEAR RES, V22, P5 IMAIZUMI S, 1986, J PHONETICS, V14, P457 ISHIZAKA K, 1976, J ACOUST SOC AM, V60, P1193, DOI 10.1121/1.381221 ISSHIKI N, 1972, STUDIA PHONOLOGICA, V6, P38 KITAJIMA K, 1975, STUDIA PHONOL KYOTO, V9, P25 KOIKE Y, 1972, STUDIA PHONOLOGICA, V6, P45 KOIKE Y, 1977, ACTA OTO-LARYNGOL, V84, P105, DOI 10.3109/00016487709123948 Koike Y., 1973, STUDIA PHONOLOGICA, V7, P17 Laver J., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing LAVER J, 1984, P I ACOUSTICS, V6, P425 LAVER J, 1986, J PHONETICS, V14, P517 LIEBERMAN P, 1963, J ACOUST SOC AM, V35, P344, DOI 10.1121/1.1918465 Ludlow C.L., 1987, LARYNGEAL FUNCTION P, P492 Melard G., 1990, METHODES PREVISION C PINTO NB, 1990, J ACOUST SOC AM, V87, P1278, DOI 10.1121/1.398803 Romeder J. M., 1973, METHODES PROGRAMMES ROSS SM, 1987, INTRO PROBABILITY ST, P230 SCHOENTGEN J, 1991, SPEECH COMMUN, V10, P533, DOI 10.1016/0167-6393(91)90056-Y SCHOENTGEN J, 1991, ACT SEM TRAIT REPR S, P108 SCHOENTGEN J, 1989, SPEECH COMMUN, V8, P61, DOI 10.1016/0167-6393(89)90068-X Schoentgen J., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607472 SCHOENTGEN J, 1995, J PHONETICS, V23, P189, DOI 10.1016/S0095-4470(95)80042-5 SMITH BE, 1978, J SPEECH HEAR RES, V21, P240 *SPSS INC, 1992, SPSS WIND ADV STAT TITZE IR, 1993, J SPEECH HEAR RES, V36, P1177 TITZE IR, 1987, J SPEECH HEAR RES, V30, P252 WONG D, 1991, J ACOUST SOC AM, V89, P383, DOI 10.1121/1.400472 NR 56 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1997 VL 21 IS 4 BP 255 EP 272 DI 10.1016/S0167-6393(97)00008-3 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XM478 UT WOS:A1997XM47800003 ER PT J AU Ainsworth, WA Carre, R AF Ainsworth, WA Carre, R TI Perception of synthetic two-formant vowel transitions SO SPEECH COMMUNICATION LA English DT Article ID DISCRIMINATION AB Speech analysis shows that the second formant transitions in vowel-vowel utterances are not always of the same duration as those of the first formant transitions nor are they always synchronised. Moreover the formant transitions often move initially in a different direction from their final target. In order to investigate whether these deviations from linearity and synchrony are perceptually significant a series of listening tests have been conducted with the vowel pair /a/-/i/. It was found that delays between the first and second formant transitions of up to 30 ms are not perceived, nor are differences in duration of up to 40 ms if the first and second formants start or end simultaneously. If the second formant transition is symmetric in time with respect to the first formant differences of up to 50 ms are tolerated. Excursions in second formant transition shape of up to about 500 Hz are also not perceived. These results suggest that most of the deviations from linearity and synchrony found in natural vowel-vowel utterances are not perceptually significant. C1 ECOLE NATL SUPER TELECOMMUN BRETAGNE, DEPT SIGNAL, F-75634 PARIS 13, FRANCE. RP Ainsworth, WA (reprint author), UNIV KEELE, DEPT COMMUN & NEUROSCI, KEELE ST5 5BG, STAFFS, ENGLAND. CR AINSWORTH WA, 1968, J ACOUST SOC AM, V44, P694 AINWORTH WA, 1995, P 13 INT C PHON SCI, V2, P666 CARRE R, 1991, J PHONETICS, V19, P433 Carre R, 1996, ACUSTICA, V82, pS128 FLANAGAN JL, 1955, J ACOUST SOC AM, V27, P613, DOI 10.1121/1.1907979 HOLMES JN, 1964, LANG SPEECH, V7, P127 KEWLEYPORT D, 1994, J ACOUST SOC AM, V95, P485, DOI 10.1121/1.410024 KewleyPort D, 1996, J ACOUST SOC AM, V100, P2462, DOI 10.1121/1.417954 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 Liberman AM, 1954, PSYCHOL MONOGR-GEN A, V68, P1 LIBERMAN AM, 1956, J EXP PSYCHOL, V52, P127, DOI 10.1037/h0041240 LISKER L, 1957, WORD, V13, P256 MACMILLAN NA, 1988, J ACOUST SOC AM, V84, P1262, DOI 10.1121/1.396626 MAEDA S, 1990, NATO ASI SERIES MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 OCONNOR JD, 1957, WORD, V13, P24 SECREST B, 1983, P IEEE ICASSP, V83, P1352 VANSON RJJ, 1993, P I PHON SCI, V17, P33 VANWIERINGEN A, 1995, J ACOUST SOC AM, V98, P1304, DOI 10.1121/1.413467 NR 19 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1997 VL 21 IS 4 BP 273 EP 282 DI 10.1016/S0167-6393(97)00010-1 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XM478 UT WOS:A1997XM47800004 ER PT J AU Sanchez, V Rubio, AJ LopezSoler, JM delaTorre, A Peinado, AM AF Sanchez, V Rubio, AJ LopezSoler, JM delaTorre, A Peinado, AM TI A low-delay transform domain approach to trellis coded quantization SO SPEECH COMMUNICATION LA English DT Article DE speech coding; transform domain; trellis coded quantization; low-delay; discrete cosine transform AB In this paper we propose a new transform domain coding technique called transform trellis coded quantization (TTCQ) that is based on the trellis coded quantization technique recently proposed where Ungerboeck's amplitude modulation trellises and set partitioning ideas are used for source coding. We combine the proposed technique with a transform domain formulation far small frame sizes obtaining a transform domain scheme suitable for low-delay speech coding. We have applied this scheme to the coding of 7 kHz wideband speech, studying its performance for several discrete transforms and comparing it with a predictive version of the trellis coded quantization technique for different bit rates and number of states in the trellis. RP Sanchez, V (reprint author), UNIV GRANADA, FAC CIENCIAS, DEPT ELECT & TECNOL COMPUTADORES, E-18071 GRANADA, SPAIN. RI de la Torre, Angel/C-6618-2012; Sanchez , Victoria /C-2411-2012; Peinado, Antonio/C-2401-2012; Prieto, Ignacio/B-5361-2013; Lopez-Soler, Juan/C-2437-2012 OI Lopez-Soler, Juan/0000-0003-4572-2237 CR AHMED N, 1974, IEEE T COMPUT, VC 23, P90, DOI 10.1109/T-C.1974.223784 ATAL BS, 1989, P INT C AC SPEECH SI, P45 Gray R. M., 1971, TOEPLITZ CIRCULANT M Jayant N. S., 1984, DIGITAL CODING WAVEF LEE PZ, 1994, IEEE T SIGNAL PR AUG, P1996 MARCELLIN M, 1990, IEEE T ACOUST SP JAN, P46 MARCELLIN MW, 1990, IEEE T COMMUN, V38, P82, DOI 10.1109/26.46532 Marcellin M. W., 1991, Advances in Speech Coding Moriya T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) PEARL J, 1973, IEEE T INFORM THEORY, V19, P229, DOI 10.1109/TIT.1973.1054985 PEARLMAN WA, 1978, IEEE T INFORM THEORY, V24, P683, DOI 10.1109/TIT.1978.1055950 SANCEZ V, 1992, P ICSPAT, V2, P937 SANCEZ V, 1994, P INT C AC SPEECH SI, V1, P177 SANCEZ V, 1995, THESIS U GRANADA SANCEZ V, 1993, P INT C AC SPEECH SI, P415 SANCHEZ V, 1995, IEEE T SIGNAL PROCES, V43, P2631, DOI 10.1109/78.482113 TRANOSO IM, 1990, IEEE T ACOUST SP MAR, P385 UNGERBOECK G, 1981, IEEE T INFORM TH JAN, P55 WANG Z, 1985, APPL MATH COMPUT, V16, P19, DOI 10.1016/0096-3003(85)90008-6 WANG ZD, 1984, IEEE T ACOUST SPEECH, V32, P803 NR 20 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1997 VL 21 IS 3 BP 141 EP 153 DI 10.1016/S0167-6393(97)00006-X PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XV541 UT WOS:A1997XV54100001 ER PT J AU Djezzar, L Pican, N AF Djezzar, L Pican, N TI Phonetic knowledge embedded in a context sensitive MLP for French speaker-independent speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speaker-independent recognition system; context sensitive MLP; articulatory feature; stop consonant AB The stop /p,t,k/ recognition part of a speaker-independent speech recognition system is described in this paper. This work is based on the conclusions of several perceptual experiments and on the results of an acoustic investigation with stop consonants. These experiments allowed us to evaluate the discrimination power of the burst regarding the stop place of articulation, and how the vocalic information may help stop identification, which could not be done efficiently without taking into account the nature of the following vowel. Thus, a novel system architecture is proposed which is made up of two stages: first, an automatic detector of reliable cues regarding stop and vowel features, and, then, a context sensitive multilayered perceptron (ODWE) fed by the previous acoustic cues. The training and the test have been done on two different corpora including male and female speakers. The results show a recognition rate of 90% over the stop consonants. C1 INST NATL RECH INFORMAT & AUTOMAT LORRAINE, CRIN, CNRS, F-54506 VANDOEUVRE LES NANCY, FRANCE. LERIA, F-49045 ANGERS 01, FRANCE. CR BENGIO Y, 1990, P NIPS, V2, P218 BENGIO Y, 1992, SPEECH COMMUN, V11, P261, DOI 10.1016/0167-6393(92)90020-8 BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 BONNEAU A, 1996, IN PRESS J ACOUST SO BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 Bourlard Ha, 1994, CONNECTIONIST SPEECH CARRE R, 1984, P IEE INT C AC SPEEC DJEZZAR L, 1995, P EUROSPEECH MADRID, V2, P1389 DJEZZAR L, 1995, P 13 ICPHS STOCKH, V2, P262 DJEZZAR L, 1995, P EUROSPEECH MADRID, V3, P2217 DJEZZAR L, 1995, THESIS U NANCY 1 CRI DURBIN R, 1990, NATO ASI SERIES F, V68 EDWARDS TJ, 1981, J ACOUST SOC AM, V69, P535, DOI 10.1121/1.385482 Fant G., 1973, SPEECH SOUNDS FEATUR FISHERJORGENSEN E, 1954, MISCELLANA PHONETICA, V2, P42 HAMPSHIRE JB, 1992, IEEE T PATTERN ANAL, V14, P751, DOI 10.1109/34.142911 HERTZ J, 1992, INTRO THEORY NEURAL, V1 HILLENBRAND J, 1995, J ACOUST SOC AM, V97, P3099, DOI 10.1121/1.411872 HUANG X, 1991, NEURAL COMPUT, V3, P79 JORDAN MI, 1993, COMPUTATIONAL COGNIT JORDAN MI, 1989, ADV CONNECTIONISM TH LAPRIE Y, 1994, P INT C AC SPEECH SI, V2, P201 LEE KF, 1990, COMPUTER SPEECH LANG, V4, P57, DOI 10.1016/0885-2308(90)90023-Y LI DENG, 1993, P EUROSPEECH 93, V3, P1635 Lippmann R. P., 1989, Neural Computation, V1, DOI 10.1162/neco.1989.1.1.1 MARI JF, 1996, IEEE INT C AC SPEECH MARI JF, 1994, P ICSLP MILLER JD, 1989, J ACOUST SOC AM, V85, P2114, DOI 10.1121/1.397862 PICAN N, 1996, ICNN 96 P WASH DC 3 PICAN N, 1994, NEURAL PROCESS LETT, V1, P21, DOI 10.1007/BF02312397 PICAN N, 1994, ESANN 94 P 20 20 APR PICAN N, 1996, P ICSLP PHIL PA PICAN N, 1996, J MATH COMPUTERS SIM, V41 ROBINSON T, 1990, TR42 CUEDFINFENG ROSE NR, 1993, IMMUNOMETHODS, V2, P137, DOI 10.1006/immu.1993.1016 SCHMID P, 1993, P EUROSPEECH, V3, P1723 WAIBEL A, 1989, IEE INT C AC SPEECH Zue V. W., 1976, THESIS MIT CAMBRIDGE NR 38 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1997 VL 21 IS 3 BP 155 EP 167 DI 10.1016/S0167-6393(97)00005-8 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XV541 UT WOS:A1997XV54100002 ER PT J AU Hansen, JHL Pellom, BL AF Hansen, JHL Pellom, BL TI Text-directed speech enhancement employing phone class parsing and feature map constrained vector quantization SO SPEECH COMMUNICATION LA English DT Article ID HIDDEN MARKOV-MODELS; RECOGNITION; PROJECTION AB There an many situations where non-real-time speech enhancement is required. For such applications, employing any available a priori knowledge can lead to more effective enhancement solutions. In this study, a novel text-directed speech enhancement algorithm is developed for usage in non-real-time applications. In our approach, the text of the intended dialogue is used to partition noisy speech into regions of broad phoneme classifications. Classes considered include stops, fricatives, affricates, nasals, vowels, semivowels, diphthongs and silence. These partitions are then used to direct a new vector quantizer based enhancement scheme in which phone-class directed constraints are applied to improve speech quality. The proposed algorithm is evaluated using both objective as well as subjective quality assessment techniques. It is shown that the text-directed approach improves the quality of the degraded speech over a broad range of noise sources (i.e., flat communications channel noise, aircraft cockpit noise, helicopter fly-by noise, and automobile highway noise) and over a broad range of signal-to-noise ratios (i.e., 10, 5, 0 and -5 dB). In each case, the proposed method is shown consistently to exhibit improved objective quality over linear and generalized spectral subtraction, as well as the Auto-LSP constrained iterative enhancement method using the Itakura-Saito measure and a 100-sentence evaluation speech corpus. Subjective quality assessment was conducted in the form of an A-B comparison test. Results of these evaluations demonstrate that, for wideband noise distortions, the proposed algorithm is preferred over the unprocessed noisy speech more than 2 to 1, while the proposed algorithm is preferred over spectral subtraction by more than 3 to 1. RP Hansen, JHL (reprint author), DUKE UNIV, DEPT ELECT ENGN, ROBUST SPEECH PROC LAB, BOX 90291, DURHAM, NC 27708 USA. CR ARSLAN L, 1995, P IEEE INT C AC SPEE, P812 Azirani A. Akbari, 1995, P 1995 IEEE INT C AC, P800 BEROUTI M, 1979, APR P IEEE INT C AC, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W CHENG YM, 1991, IEEE T SIGNAL PROCES, V39, P1943, DOI 10.1109/78.134427 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 Deller J. R., 1993, DISCRETE TIME PROCES DRUCKER H, 1968, IEEE T ACOUST SPEECH, VAU16, P165, DOI 10.1109/TAU.1968.1161979 EPHRAIM Y, 1992, IEEE T SIGNAL PROCES, V40, P725, DOI 10.1109/78.127947 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 Gray R.M., 1984, IEEE ASSP MAG APR, P4 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 Hansen J. H., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P98, DOI 10.1109/89.365376 LARAR JN, 1988, IEEE T ACOUST SPEECH, V36, P1812, DOI 10.1109/29.9026 LIM JS, 1978, IEEE T ACOUST SPEECH, V26, P197 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384 OSHAUGHNESSY D, 1988, APR P IEEE INT C AC, P549 PELLOM B, 1996, THESIS DUKE UI DURHA Quackenbush S. R., 1988, OBJECTIVE MEASURES S TSOUKALAS D, 1993, P IEEE INT C AC SPEE, V2, P359 WEISS MR, 1983, RADCTR83109 NR 26 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1997 VL 21 IS 3 BP 169 EP 189 DI 10.1016/S0167-6393(97)00003-4 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XV541 UT WOS:A1997XV54100003 ER PT J AU Rouat, J Liu, YC Morissette, D AF Rouat, J Liu, YC Morissette, D TI A pitch determination and voiced/unvoiced decision algorithm for noisy speech SO SPEECH COMMUNICATION LA English DT Article DE auditory model; car speech; telephone speech; multi-channel selection; Teager energy operator; amplitude modulation; residue pitch ID AUDITORY FILTER SHAPES; DISCHARGE PATTERNS; NERVE FIBERS; REPRESENTATION; SIGNALS; MODEL AB The design of a pitch tracking system for noisy speech is a challenging and yet unsolved issue due to the association of ''traditional'' pitch determination problems with those of noise processing. We have developed a multi-channel pitch determination algorithm (PDA) that has been tested on three speech databases (0 dB SNR telephone speech, speech recorded in a car and clean speech) involving fifty-eight speakers. Our system has been compared to a multi-channel PDA based on auditory modelling (AMPEX), to hand-labelled and to Laryngograph pitch contours. Our PDA is comprised of an automatic channel selection module and a pitch extraction module that relies on a pseudo-periodic histogram (combination of normalised scalar products for the less corrupted channels) in order to find pitch. Our PDA excelled in performance over the reference system on 0 dB telephone and car speech. The automatic selection of channels was effective on the very noisy telephone speech (0 dB) but performed less significantly on car speech where the robustness of the system is mainly due to the pitch extraction module in comparison to AMPEX, This paper reports in details the voiced/unvoiced, unvoiced/voiced performance and pitch estimation errors for the proposed PDA and the reference system while utilising three speech databases. RP Rouat, J (reprint author), UNIV QUEBEC, ERMETIS, DEPT APPL SCI, 555 BLVD, CHICOUTIMI, PQ G7H 2B1, CANADA. CR Bagshaw P., 1993, P EUR C SPEECH COMM, P1003 DELGUTTE B, 1980, J ACOUST SOC AM, V68, P843, DOI 10.1121/1.384824 GLASBERG BR, 1990, HEARING RES, V47, P103, DOI 10.1016/0378-5955(90)90170-T HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 HERMES DJ, 1992, VISUAL REPRESENTATIO, P3 Hess W., 1983, PITCH DETERMINATION Hess W. J., 1980, Signal Processing: Theories and Applications. Proceedings of EUSIPCO-80, First European Signal Processing Conference HUNT MJ, 1988, P IEEE ICASSP, P215 Kaiser J. F., 1990, P IEEE INT C AC SPEE, P381 Kaiser J.F., 1993, P INT C AC SPEECH SI, V3, P149 LICKLIDER JCR, 1956, INFORMATION THEORY Maragos P., 1991, P IEEE INT C AC SPEE, P421, DOI 10.1109/ICASSP.1991.150366 MARKEL JD, 1972, IEEE T ACOUST SPEECH, VAU20, P367, DOI 10.1109/TAU.1972.1162410 MARTENS JP, 1990, P INT C AC SPEECH SI, P401 MARTENS JP, 1982, J ACOUST SOC AM, V72, P397, DOI 10.1121/1.388091 MEDAN Y, 1991, IEEE T SIGNAL PROCES, V39, P40, DOI 10.1109/78.80763 MILLER MI, 1984, HEARING RES, V14, P257, DOI 10.1016/0378-5955(84)90054-6 Moore B. C. J., 1989, INTRO PSYCHOL HEARIN MOORE BCJ, 1983, J ACOUST SOC AM, V74, P750, DOI 10.1121/1.389861 Patterson R. D., 1986, PSYCHOL MUSIC, V14, P44, DOI 10.1177/0305735686141004 PATTERSON RD, 1976, J ACOUST SOC AM, V59, P640, DOI 10.1121/1.380914 PATTERSON RD, 1982, J ACOUST SOC AM, V72, P1788, DOI 10.1121/1.388652 ROUAT J, 1992, P INT C SPOK LANG PR, V2, P1629 ROUAT J, 1992, P ETRW SPEECH PROCES, P158 ROUAT J, 1993, VISUAL REPRESENTATIO, P335 SENEFF S, 1985, THESIS MIT SENEFF S, 1988, J PHONETICS, V16, P55 TERHARDT E, 1979, HEARING RES, V1, P155, DOI 10.1016/0378-5955(79)90025-X TERHARDT E, 1982, J ACOUST SOC AM, V71, P679, DOI 10.1121/1.387544 VANIMMERSEEL IM, 1993, THESIS GENT U BELGIU VANIMMERSEEL LM, 1992, J ACOUST SOC AM, V91, P3511, DOI 10.1121/1.402840 NR 31 TC 36 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1997 VL 21 IS 3 BP 191 EP 207 DI 10.1016/S0167-6393(97)00002-2 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XV541 UT WOS:A1997XV54100004 ER PT J AU Hadjitodorov, S Boyanov, B Dalakchieva, N AF Hadjitodorov, S Boyanov, B Dalakchieva, N TI A two-level classifier for text-independent speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; neural networks; self-organizing map; MLP network; two-level classifier AB A two-level scheme for speaker identification is proposed. The first classifier level is based on the self-organizing map (SOM) of Kohonen. LPCC coefficients are used as input vectors for this classifier. LPCC coefficients are passed again through the already trained SOMs and as result the prototype distribution maps (PDMs) are obtained. The PDMs are the input for the second classifier level. The second level consists of multilayer perceptron (MLP) networks for each speaker. The first level of the classifier is a preprocessing procedure for the second level, where the final classification is made. The goal of the proposed approach is to combine the advantages of the two type of networks into one classification scheme in order to achieve higher identification accuracy. The experiments show an increased accuracy of the proposed two-level classifier, especially in the case of noise-corrupted signals. RP Hadjitodorov, S (reprint author), BULGARIAN ACAD SCI, CENT LAB BIOMED ENGN, ACAD G BONCHEV STR, BLOCK 105, BU-1113 SOFIA, BULGARIA. CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 BARRON AR, 1988, P 1988 S INT STAT CO Bennani Y, 1991, P ICASSP, P385, DOI 10.1109/ICASSP.1991.150357 BIMBOT F, 1992, P INT C ASSP SAN FRA, P5, DOI 10.1109/ICASSP.1992.226134 Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362 HADJITODOROV S, 1994, ELECTRON LETT, V30, P838, DOI 10.1049/el:19940587 KOHONEN T, 1990, P IEEE, V78, P1464, DOI 10.1109/5.58325 Kohonen T, 1984, SELFORGANIZATION ASS LEFLOH JL, 1994, P INT C AC SPEECH SI, V1, P149 Lippman R., 1987, IEEE ASSP MAGAZI APR, P4 LIPPMANN RP, 1989, IEEE COMMUN MAG, V27, P47, DOI 10.1109/35.41401 LO ZP, 1993, IEEE T NEURAL NETWOR, V4, P207, DOI 10.1109/72.207609 LONGSTAFF ID, 1987, PATTERN RECOGN LETT, V5, P315, DOI 10.1016/0167-8655(87)90072-9 MASON S, 1990, P VOICE SYSTEMS WORL, P241 MATSUI GT, 1994, P INT C AC SPEECH SI, V1, P309 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MONTACIE C, 1992, P INT C SPOKEN LANGU, P475 MONTACIE C, 1993, P EUROSPEECH 93, P161 Morgan D.P., 1991, NEURAL NETWORKS SPEE OGLESBY J, 1991, P INT C AC SPEECH SI, P393, DOI 10.1109/ICASSP.1991.150359 Rudasi L., 1991, P ICASSP TORONTO, P389, DOI 10.1109/ICASSP.1991.150358 NR 21 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1997 VL 21 IS 3 BP 209 EP 217 DI 10.1016/S0167-6393(97)00004-6 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA XV541 UT WOS:A1997XV54100005 ER PT J AU Juang, BH Sorin, C Furui, S Pols, L AF Juang, BH Sorin, C Furui, S Pols, L TI Tribute to James L. Flanagan SO SPEECH COMMUNICATION LA English DT Biographical-Item NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 1 EP 2 DI 10.1016/S0167-6393(97)88374-4 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000001 ER PT J AU Cutler, A AF Cutler, A TI The comparative perspective on spoken-language processing SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Congress on Spoken Language Processing CY OCT 03-06, 1996 CL PHILADELPHIA, PA SP Univ Delaware, Alfred I DuPont Inst, Acoust Soc Amer, Acoust Soc Japan, Amer Speech Language Hearing Assoc, Austr Speech Sci & Technol Assoc, European Speech Commun Assoc, IEEE Signal Proc Soc, Inc Canadian Acoust Assoc, Int Phonet Assoc, Linguist Soc Amer ID WORD RECOGNITION; LEXICAL ACCESS; SEGMENTATION; SYLLABLES; COMPETITION AB Psycholinguists strive to construct a model of human language processing in general. But this does not imply that they should confine their research to universal aspects of linguistic structure, and avoid research on language-specific phenomena. First, even universal characteristics of language structure can only be accurately observed cross-linguistically. This point is illustrated here by research on the role of the syllable in spoken-word recognition on the perceptual processing of vowels versus consonants, and on the contribution of phonetic assimilation phonemena to phoneme identification. In each case, it is only by looking at the pattern of effects across languages that it is possible to understand the general principle. Second, language-specific processing can certainly shed light on the universal model of language comprehension. This second point is illustrated by studies of the exploitation of vowel harmony in the lexical segmentation of Finnish, of the recognition of Dutch words with and without vowel epenthesis, and of the contribution of different kinds of lexical prosodic structure (tone, pitch accent, stress) to the initial activation of candidate words in lexical access. In each case, aspects of the universal processing model are revealed by analysis of these language-specific effects. In short, the study of spoken-language processing by human listeners requires cross-linguistic comparison. RP Cutler, A (reprint author), MAX PLANCK INST PSYCHOLINGUIST, POB 310, NL-6500 AH NIJMEGEN, NETHERLANDS. RI Cutler, Anne/C-9467-2012 CR Cutler A., 1995, P 13 INT C PHON SCI, V1, P106 Cutler A., 1992, P INT C SPOKEN LANGU, V1, P221 CUTLER A, 1988, J EXP PSYCHOL HUMAN, V14, P113, DOI 10.1037/0096-1523.14.1.113 CUTLER A, 1985, LINGUISTICS, V23, P659, DOI 10.1515/ling.1985.23.5.659 CUTLER A, 1994, P 5 AUSTR INT C SPEE, V1, P285 CUTLER A, 1986, J MEM LANG, V25, P385, DOI 10.1016/0749-596X(86)90033-1 Cutler A, 1996, PERCEPT PSYCHOPHYS, V58, P807, DOI 10.3758/BF03205485 CUTLER A, 1996, P 6 AUSTR INT C SPEE, P599 CUTLER A, 1997, PERCEPTION PSYCHOPHY CUTLER A, 1986, LANG SPEECH, V29, P201 Jongenburger W., 1995, P 13 INT C PHON SCI, P368 KUIJPERS C, IN PRESS EFFECTS REG KUIJPERS C, 1996, P 4 INT C SPEECH PRO, V1, P149, DOI 10.1109/ICSLP.1996.607060 MCQUEEN JM, 1994, J EXP PSYCHOL LEARN, V20, P621, DOI 10.1037/0278-7393.20.3.621 MEHLER J, 1996, PHONOLOGICAL STRUCTU, P145 MEHLER J, 1981, J VERB LEARN VERB BE, V20, P298, DOI 10.1016/S0022-5371(81)90450-3 NORRIS D, 1995, J EXP PSYCHOL LEARN, V21, P1209 CUTLER A, 1994, J MEM LANG, V33, P824, DOI 10.1006/jmla.1994.1039 Otake T, 1993, J MEM LANG, V32, P358 Otake T., 1996, PHONOLOGICAL STRUCTU OTAKE T, 1996, J ACOUST SOC AM SUOMI K, IN PRESS J MEMORY LA TRUBETSKOI NS, 1939, TCLP, V7 VANDONSELAAR W, IN PRESS VOWEL EPENT VANDONSELAAR W, 1996, P 4 INT C SPOK LANG, V1, P94, DOI 10.1109/ICSLP.1996.607040 VANDONSELAAR W, IN PRESS VOORNAAM IS VANOOIJEN B, 1994, THESIS LEIDEN VANOOIJEN B, IN PRESS PROCESSING NR 28 TC 28 Z9 28 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 3 EP 15 DI 10.1016/S0167-6393(96)00075-1 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000002 ER PT J AU Hernando, J Nadeu, C Marino, JB AF Hernando, J Nadeu, C Marino, JB TI Speech recognition in a noisy car environment based on LP of the one-sided autocorrelation sequence and robust similarity measuring techniques SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; noise robustness; feature extraction; spectral analysis of speech; distortion measures; vector quantization ID WORD RECOGNITION AB The performance of the existing speech recognition systems degrades rapidly in the presence of background noise. A novel representation of the speech signal, which is based on Linear Prediction of the One-Sided Autocorrelation sequence (OSALPC), has shown to be attractive for noisy speech recognition because of both its high recognition performance with respect to the conventional LPC in severe conditions of additive white noise and its computational simplicity. The aim of this work is twofold: (1) to show that OSALPC also achieves a good performance in a case of real noisy speech (in a car environment), and (2) to explore its combination with several robust similarity measuring techniques, showing that its performance improves by using cepstral liftering, dynamic features and multilabeling. RP Hernando, J (reprint author), UNIV POLITECN CATALUNA, DEPT SIGNAL THEORY & COMMUN, E-08028 BARCELONA, SPAIN. RI Nadeu, Climent/B-9638-2014; Hernando, Javier/G-1863-2014; Marino, Jose /N-1626-2014 OI Nadeu, Climent/0000-0002-5863-0983; CR ALEXANDRE P, 1993, SPEECH COMMUN, V12, P277, DOI 10.1016/0167-6393(93)90099-7 CADZOW JA, 1982, P IEEE, V70, P907, DOI 10.1109/PROC.1982.12424 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 HANSON BA, 1987, IEEE T ACOUST SPEECH, V35, P968, DOI 10.1109/TASSP.1987.1165241 HERNANDO J, 1993, P EUR 93 BERL SEPT 1, P1643 HERNANDO J, 1991, P EUR 91 SEPT 1991 G, P91 HERNANDO J, 1992, P ICSLP 92 BANFF OCT, P1593 HERNANDO J, 1994, P INT C AC SPEECH SI, V2, P69 HERNANDO J, 1993, THESIS POLYTECHNICAL HUANG XD, 1992, IEEE T SIGNAL PROCES, V40, P1062, DOI 10.1109/78.134469 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E Lagunas M. A., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) LECOMTE I, 1989, P INT C AC SPEECH SI, P512 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1959 Marple Jr S. L., 1987, DIGITAL SPECTRAL ANA MASUMOTO H, 1986, P INT C AC SPEECH SI, P769 McGinn D., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing NADEU C, 1994, P ICSLP, P1927 Nishimura M., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) SUGAWARA K, 1985, P MMIJ FALL M G, P1 TOHKURA Y, 1987, IEEE T ACOUST SPEECH, V35, P1414, DOI 10.1109/TASSP.1987.1165058 Tseng H.-P., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) NR 25 TC 17 Z9 17 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 17 EP 31 DI 10.1016/S0167-6393(96)00074-X PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000003 ER PT J AU Bateman, JA AF Bateman, JA TI Special section on Speak! SO SPEECH COMMUNICATION LA English DT Editorial Material RP Bateman, JA (reprint author), GERMAN NATL RES CTR INFORMAT TECHNOL, INST INTERGRATED INFORMAT & PUBLICAT SYST, DARMSTADT, GERMANY. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 35 EP 36 DI 10.1016/S0167-6393(96)00067-2 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000004 ER PT J AU Olaszy, G Nemeth, G AF Olaszy, G Nemeth, G TI Prosody generation for German CTS/TTS systems (from theoretical intonation patterns to practical realisation) SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; tone group prediction; prosody generation; dialog systems with synthetic speech output AB The work described in the paper was carried out in the SPEAK! project (Speech Generation in Multimodal Information Systems). The aim of the project was to improve the quality of synthesised speech output to be used in dialogue systems as an additional element of multimodal man-machine interfaces. German text and dialogue interaction analysis (theoretical research) has been carried out to predict the tone groups (TGs), the phrase boundaries in sentences and the place of the focus in the phrase. Tone groups represent the general intonation structure of the phrase not taking into account word level intonation. The results of this research are the intonation markers described in (Teich et al., 1997). The CTS synthesiser constructs the main intonation patterns from texts containing these additional markers. This paper describes the research results on German intonation, including the construction of intonation rules, combined with the study on timing adjustments, pause generation for rhythm (both for segmental and suprasegmental levels) for the MULTIVOX-SPEAK! system. Detailed rules and a new tone-group based prosody generation module are also introduced: these have been integrated into the MULTIVOX TTS system. Preliminary evaluation results are also given. C1 TU BUDAPEST, DEPT TELECOMMUN & TELEMAT, BUDAPEST, HUNGARY. RP Olaszy, G (reprint author), RES INST LINGUIST PHL, PHONET LAB, BUDAPEST, HUNGARY. CR ADRIAENS H, 1991, THESIS U LEIDEN *CELEX, 1994, CELEX LEX DAT CD ROM COLIER R, 1990, P ESCA WORKSH SPEECH, P273 DIRKSEN A, 1993, ANAL SYNTHESIS SPEEC, P131 *DUD, 1990, DUD AUSSPR GOSY M, 1996, FINAL YEAR INTONATIO GRICE M, 1995, TRANSCRIPTION GERMAN, P33 LEWY N, 1994, TEXT SPEECH TECHNOLO OLASZY G, 1995, STUDIES APPL LINGUIS, V2, P49 OLASZY G, 1991, P 12 ICPHS AIX PROV, V4, P210 OLASZY G, 1992, TALKING MACH THEORIE, P385 PFISTER B, 1994, FUNDAMENTALS SPEECH, P88 QUENE H, 1993, ANAL SYNTHESIS SPEEC, P115 SWERTS M, 1993, PHONETICA, V50, P189 Teich E, 1997, SPEECH COMMUN, V21, P73, DOI 10.1016/S0167-6393(96)00070-2 TERKEN J, 1990, P ESCA WORKSH SPEECH, P205 WERNER S, 1994, FUNDAMENTALS SPEECH, P24 NR 17 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 37 EP 60 DI 10.1016/S0167-6393(96)00071-4 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000005 ER PT J AU Portele, T Heuft, B AF Portele, T Heuft, B TI Towards a prominence-based synthesis system SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; prominence; prosody ID SPEECH SYNTHESIS; PROSODY AB The structure of a synthesis system is described that uses prominence as a central parameter. A definition of prominence suitable for this application is given. For the empirical foundation the reliability of prominence ratings by human listeners is assessed. These ratings were compared with acoustic data on F-0 and duration. A linear relationship between ratings and parameter values was found. Two algorithms to transform prominence values to prosodic parameters are briefly described and evaluated. The application of prominence to the synthesis of focal accents is demonstrated. The results indicate the validity of the prominence based approach as an interface between linguistics and acoustics. RP Portele, T (reprint author), UNIV BONN, INST KOMMUN FORSCH & PHONET, POPPELSDORFER ALLEE 47, D-53115 BONN, GERMANY. CR AYERS GM, 1995, P 13 INT C PHON SCI, V3, P660 Brown G., 1983, PROSODY MODELS MEASU, P67 Bruce Gosta, 1990, WORKING PAPERS, V36, P37 COOPER WE, 1985, J ACOUST SOC AM, V77, P2142, DOI 10.1121/1.392372 DEPIJPER JR, 1994, J ACOUST SOC AM, V96, P2037, DOI 10.1121/1.410145 EADY SJ, 1986, LANG SPEECH, V9, P233 Fahlman S.E., 1989, P 1988 CONN MOD SUMM FANT G, 1989, SPEECH TRANSMISSION, P1 HEUFT B, 1995, P ICPHS, V2, P378 HEUFT B, 1996, IN PRESS P EL SPRACH Heuft B., 1996, Proceedings ICSLP 96. Fourth International Conference on Spoken Language Processing (Cat. No.96TH8206), DOI 10.1109/ICSLP.1996.607866 Hirst D. J., 1993, TRAVAUX I PHONETIQUE, V15, P71 KLATT DH, 1979, P 9 INT C PHON SC KO, V2, P290 KOHLER KJ, 1991, J PHONETICS, V19, P121 KOHLER KJ, 1987, P 11 INT C PHON SC T KRAFT V, 1995, ACTA ACUST, V3, P351 KREIMAN J, 1982, J PHONETICS, V10, P163 LADD DR, 1994, J PHONETICS, V22, P87 NOOTEBOOM SG, 1987, J ACOUST SOC AM, V82, P1512, DOI 10.1121/1.395195 PIERREHUMBERT J, 1979, J ACOUST SOC AM, V66, P363, DOI 10.1121/1.383670 PORTELE T, 1995, P 13 INT C PHON SC S, V1, P126 PORTELE T, 1994, P ICSLP 94 YOK, P1759 PORTELE T, 1996, P WORLD C NEUR NETW, P41 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 RIETVELD A, 1992, P DEP LANG SPEECH U, V16, P86 RIETVELD ACM, 1985, J PHONETICS, V13, P299 Silverman K., 1992, P INT C SPOK LANG PR, P867 TAYLOR P, 1994, SPEECH COMMUN, V15, P169, DOI 10.1016/0167-6393(94)90050-7 TERKEN J, 1991, J ACOUST SOC AM, V89, P1768, DOI 10.1121/1.401019 Wahlster W., 1993, P 3 EUR C SPEECH COM, P29 NR 30 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 61 EP 72 DI 10.1016/S0167-6393(96)00072-6 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000006 ER PT J AU Teich, E Hagen, E Grote, B Bateman, J AF Teich, E Hagen, E Grote, B Bateman, J TI From communicative context to speech: Integrating dialogue processing, speech production and natural language generation SO SPEECH COMMUNICATION LA English DT Article DE spoken language generation; text generation; dialogue models; intonation; information system interfaces; system-functional linguistics ID TEXT AB The current article discusses the problem of appropriate intonation selection in Person-Machine dialogues, such as those expected in intelligent information systems when, for example, information retrieval is required. An approach is proposed which integrates the previously mostly separate paradigms of automatic natural language generation and speech synthesis in a Person-Machine dialogue scenario. The article introduces the two independent basis components adopted in the approach - a dialogue model for information retrieval (COR) and a text generation system for German (KOMET-PENMAN) - and develops from these a communicative-context-to-speech system architecture. This system provides for the flexible and context-appropriate selection of intonation patterns. The paper argues that such an approach removes some of the well-known gaps in both text-to-speech and concept-to-speech systems. C1 GERMAN NATL CTR INFORMAT TECHNOL GMD, D-64293 DARMSTADT, GERMANY. UNIV SAARLAND, D-6600 SAARBRUCKEN, GERMANY. TECH UNIV DARMSTADT, D-64287 DARMSTADT, GERMANY. OTTO VON GUERICKE UNIV, MAGDEBURG, GERMANY. CR ABB B, 1996, LECT NOTES ARTIF INT, V1036, P277 ALEXA M, 1996, ECRIM NEWS, P18 ALLEN J, 1987, STUDIES SPEECH SCI C ALSHAWI H, 1992, CORE LANGUAGE ENG ALTMANN H, 1989, INTONATION MODUS FOK Altmann Hans, 1988, INTONATIONSFORSCHUNG BATEMAN J, 1995, INFORM PROCESS MANAG, V31, P753, DOI 10.1016/0306-4573(95)00053-J BATEMAN JA, IN PRESS ENCY LIB IN BATEMAN JA, 1994, ADV SPEECH APPL EURO, V1 BATEMAN JA, 1991, INT C CURR ISS COMP BATEMAN JA, 1989, WORD, V40, P263 BELKIN NJ, 1985, INTERACTION INFORMAT BELKIN NJ, 1994, 875 GMD BELKIN NJ, 1995, EXPERT SYST APPL, V9, P379, DOI 10.1016/0957-4174(95)00011-W BERRY M, 1981, STUDIES DISCOURSE AN BIERWISCH M, 1966, STUD GRAMMATICA, V7, P99 BILANGE E, 1991, P 5 EACL, P83, DOI 10.3115/977180.977195 BOLINGER D, 1972, LANGUAGE, V48 DAVIS JR, 1988, P 26 ANN M ASS COMP, P187, DOI 10.3115/982023.982046 DEGAND L, 1993, DUTCH GRAMMAR DOCUME DILLEY S, 1992, P 3 C APPL NAT LANG, P72, DOI 10.3115/974499.974512 DORFFNER G, 1990, 13 INT C COMP LING, V2, P89 ENGDAHL E, 1994, INTEGRATING INFORMAT Fawcett R. P., 1988, NEW DEV SYSTEMIC LIN, V2, P116 FAWCETT RP, 1990, 5 INT WORKSH NAT LAN Fery C, 1993, GERMAN INTONATIONAL GOSY M, 1996, FINAL YEAR INTONATIO Grosz B. J., 1986, Computational Linguistics, V12 GROTE B, 1995, COPERNICUS 93 HAGEN E, 1996, P 11 CAN C ART INT C, P84 Halliday M. A. K., 1970, COURSE SPOKEN ENGLIS Halliday M. A. K., 1967, J LINGUIST, V3, p[37, 199], DOI 10.1017/S0022226700012949 Halliday M. A. K., 1967, INTONATION GRAMMAR B Halliday M. A. K., 1985, INTRO FUNCTIONAL GRA Halliday M. A. K., 1967, J LINGUIST, V3, P199, DOI DOI 10.1017/S0022226700016613 HALLIDAY MA, 1989, LANGUAGE LEARNING SE Halliday Michael A. K., 1978, LANGUAGE SOCIAL SEMI Hans Kamp, 1981, TRUTH INTERPRETATION HEMERT J, 1987, ANAL SYNTHESE GESPRO, P34 HENSCHEL R, 1995, P CLNLP 95 APR 1995 HIRSCHBERG J, 1995, P ESCA WORKSH SPOK D, P189 HIRSCHBERG J, 1992, TALKING MACHINES THE, P367 Hirschberg J., 1986, 24th Annual Meeting of the Association for Computational Linguistics. Proceedings of the Conference HOVY E, 1992, LECT NOTES ARTIF INT, V587, P57 HUBER K, 1987, ANAL SYNTHESE GESPRO, P26 INGWERSEN P, 1992, INFORMATION RETRIEVA Iordanskaja L., 1991, NATURAL LANGUAGE GEN, P293 JOKINEN K, 1996, LECT NOTES ARTIF INT, V1036, P168 KAMPS T, 1996, INFORMATION RETRIEVA, P225 KASPAR B, 1995, P EUROSPEECH, P1161 KASPER RT, 1989, P DARPA WORKSH SPEEC Kohler KJ, 1977, EINFUHRUNG PHONETIK MANN WC, 1982, 151RR82104 USC INF S Martin J.R., 1992, ENGLISH TEXT SYSTEMS Matthiessen, 1995, LEXICOGRAMMATICAL CA McKeown K. R., 1985, TEXT GENERATION USIN MOHLER G, 1996, SPEECH GENERATION MU Moore J. D., 1993, Computational Linguistics, V19 NAKATANI CH, 1995, PROGR SPEECH SYNTHES NICOLOV N, 1996, P 8 INT WORKSH NAT L, P31 ODONNELL M, 1990, WORD, V41, P293 Olaszy G, 1997, SPEECH COMMUN, V21, P37, DOI 10.1016/S0167-6393(96)00071-4 OLASZY G, 1992, TALKING MACH THEORIE, P385 OSGOOD R, 1993, PROCEEDINGS OF THE ELEVENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P309 PHEBY J, 1980, INTONATION GRAMMATIK Pheby J, 1980, GRUNDZUGE DTSCH GRAM, P839 PIERREHUMBERT J, 1990, SYS DEV FDN, P271 Portele T, 1997, SPEECH COMMUN, V21, P61, DOI 10.1016/S0167-6393(96)00072-6 PREVOST S, 1994, SPEECH COMMUN, V15, P139, DOI 10.1016/0167-6393(94)90048-5 Quirk Randolph, 1985, COMPREHENSIVE GRAMMA REITHINGER N, 1995, P ESCA WORKSH SPOK D, P33 ROSTEK L, 1994, INT J ELECT PUBLISHI, V6, P495 Searle John R., 1969, SPEECH ACTS SELTING M, 1993, Z SPRACHWISS, V11, P99 SITTER S, 1992, INFORM PROCESS MANAG, V28, P165, DOI 10.1016/0306-4573(92)90044-Z STEIN A, 1993, PROCEEDINGS OF THE ELEVENTH NATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE, P283 Stock O, 1993, INTELLIGENT MULTIMED, P197 TEICH E, 1994, P 7 INT WORKSH NAT L TEICH E, 1992, KOMET GRAMMAR DOCUME THART J, 1990, PRECEPTUAL STUDY INT THIEL U, 1996, ELECT SERIES Traum D., 1992, COMPUT INTELL, V8, P575, DOI DOI 10.1111/J.1467-8640.1992.TB00380.X vanDeemter K, 1997, SPEECH COMMUN, V21, P101, DOI 10.1016/S0167-6393(96)00069-6 VANDERLINDEN K, 1995, COMPUT LINGUIST, V21, P29 Ventola Eija, 1987, STRUCTURE SOCIAL INT WHALEN T, 1988, P CHI 88, P289 Wunderlich D., 1988, INTONATIONSFORSCHUNG, P1 NR 87 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 73 EP 99 DI 10.1016/S0167-6393(96)00070-2 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000007 ER PT J AU vanDeemter, K Odijk, J AF vanDeemter, K Odijk, J TI Context modeling and the generation of spoken discourse SO SPEECH COMMUNICATION LA English DT Article DE language/speech generation; music information system; accent location; prosody; context modeling; discourse structure AB This paper presents the Dial-Your-Disc (DYD) system, an interactive system that supports browsing through a large database of musical information and generates a spoken monologue once a musical composition has been selected. The paper focuses on the generation of spoken monologues and, more specifically, on the various ways in which the generation of an utterance at a given point in the monologue requires modeling of the linguistic context of the utterance. RP vanDeemter, K (reprint author), INST PERCEPT RES, POB 513, NL-5600 MB EINDHOVEN, NETHERLANDS. CR ABB B, 1996, TRENDS NATURAL LANGU Asher N., 1988, Journal of Semantics, V6, DOI 10.1093/jos/6.1.309 AUGUSTEIJN L, 1990, LECT NOTES COMPUTER BARWISE J, 1985, NOUN PHRASES GEN QUA BUVAC S, IN PRESS FUND INFORM, V23 CARTER D, 1987, INTERPRETING ANAPHOR Chomsky N., 1986, KNOWLEDGE LANGUAGE I Chomsky Noam, 1981, LECTURES GOVT BINDIN COLE R, 1995, IEEE T SPEECH AUDI P, V3, P1, DOI 10.1109/89.365385 Collier R., 1990, PERCEPTUAL STUDY INT CONDORAVDI C, 1996, QUANTIFIERS DEDUCTIO DIRKSEN A, 1992, P COL NANT FRANC GORDON PC, 1993, COGNITIVE SCI, V17, P311, DOI 10.1207/s15516709cog1703_1 GROENENDIJK J, 1991, LINGUIST PHILOS, V14, P39, DOI 10.1007/BF00628304 GROSZ BJ, 1995, COMPUT LINGUIST, V21, P203 Heim I., 1982, THESIS U MASSACHUSET Heim Irene, 1992, J SEMANT, V9, P183, DOI DOI 10.1093/JOS/9.3.183 HINRICHS E, 1986, LINGUIST PHILOS, V9, P63 HIRSCHBERG J, 1990, P AAAI, P953 Kamp H., 1993, STUDIES LINGUISTICS, V42 KAPLAN D., 1979, CONT PERSPECTIVES PH, P401 Kempen G., 1987, NATURAL LANGUAGE GEN KLEIN E, 1980, LINGUISTICS PHILOS, V4 Ladd D. R., 1980, STRUCTURE INTONATION McKeown K., 1985, STUDIES NATURAL LANG Montague R, 1974, FORMAL PHILOS, P95 NEY H, 1994, INT J PATTERN RECOGN ODIJK J, 1995, CLIN 5, P123 ODIJK J, 1995, 1079 IPO ODIJK J, 1996, 30 IPO, P114 Partee Barbara Hall, 1989, PAPERS CHICAGO LINGU, V25, P342 *PENM NAT LANG GRO, 1989, PENM US GUID PIERREHUMBERT J, 1990, SYS DEV FDN, P271 PINKAL M, 1986, P COL 86 BONN 25 29, P368, DOI 10.3115/991365.991474 Pollard Carl, 1994, STUDIES CONT LINGUIS ROSETTA M, 1994, INT SERIES ENG COMP, V273 Sidner Candace, 1979, TR537 MIT AI LAB Terken J, 1987, LANG COGNITIVE PROC, V2, P145, DOI 10.1080/01690968708406928 USZKOREIT H, 1996, SURVEY STATE ART HUM, P161 van Deemter K., 1994, J SEMANTICS, V11, P1 van der Sandt Rob, 1992, J SEMANT, V9, P333, DOI DOI 10.1093/JOS/9.4.333 VANDEEMTER K, 1994, FOCUS NATURAL LANGUA VANDEEMTER K, 1994, P TWLT 8 TWENT U TWE, P87 van Deemter K., 1992, Journal of Semantics, V9, DOI 10.1093/jos/9.1.27 VANDONSELAAR W, 1995, UNPUB LANGUAGE SPEEC VANDONSELAAR W, 1995, IN PRESS P EUR 95 MA WILLEMS N, 1988, J ACOUST SOC AM, V84 NR 47 TC 13 Z9 13 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 101 EP 121 DI 10.1016/S0167-6393(96)00069-6 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000008 ER PT J AU Taylor, P Isard, A AF Taylor, P Isard, A TI SSML: A speech synthesis markup language SO SPEECH COMMUNICATION LA English DT Article DE speech synthesis; markup language; SGML; API; standardization AB This paper describes the Speech Synthesis Markup Language, SSML, which has been designed as a platform independent interface standard for speech synthesis systems. The paper discusses the need for standardisation in speech synthesizers and how this will help builders of systems make better use of synthesis. Next the features of SSML (based on SGML, standard generalised markup language) are discussed, and details of the Edinburgh SSML interpreter are given as a guide on how to implement an SSML-based synthesizer. RP Taylor, P (reprint author), UNIV EDINBURGH, CTR SPEECH TECHNOL RES, 80 S BRIDGE, EDINBURGH EH1 1HN, MIDLOTHIAN, SCOTLAND. CR Allen J., 1987, TEXT SPEECH MITALK S BLACK AW, 1994, ICSLP 94 YOK JAP BLACK AW, 1994, COL 94 KYOT JAP CRYSTAL D, 1969, CAMBRIDGE STUDIES LI GOLDFARB CF, 1970, P AM SOC INFORM SCI, V7, P147 HERTZ S, 1990, PAPERS LAB PHONOLOGY, V1 ISARD AC, 1995, THESIS U EDINBURGH LADD DR, 1983, LANGUAGE, V59, P721, DOI 10.2307/413371 O'Connor John D., 1973, INTONATION COLLOQUIA Pierrehumbert J, 1980, THESIS MIT Raman T.V., 1994, THESIS CORNELL U SANDERS E, 1995, P EUR 95 MADR SILVERMAN K, 1992, INT C SPEECH LANG PR Taylor P. A., 1992, THESIS U EDINBURGH TAYLOR PA, 1994, 2 ESCA IEEE WORKSH S NR 15 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD FEB PY 1997 VL 21 IS 1-2 BP 123 EP 133 DI 10.1016/S0167-6393(96)00068-4 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WV320 UT WOS:A1997WV32000009 ER PT J AU Gilloire, A Hansler, E Kellerman, W Svean, J AF Gilloire, A Hansler, E Kellerman, W Svean, J TI Special issue on acoustic echo control and speech enhancement techniques SO SPEECH COMMUNICATION LA English DT Editorial Material C1 TH DARMSTADT, DARMSTADT, GERMANY. FACHHSCH REGENSBURG, REGENSBURG, GERMANY. SINTEF, DELAB, TRONDHEIM, NORWAY. RP Gilloire, A (reprint author), CNRS, GDR, PRC, ISIS, F-75700 PARIS, FRANCE. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 177 EP 179 DI 10.1016/S0167-6393(96)00065-9 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000001 ER PT J AU Martin, R Gustafsson, S AF Martin, R Gustafsson, S TI The echo shaping approach to acoustic echo control SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE acoustic echo control; psychoacoustics; Wiener filter AB This paper describes and analyses algorithms for hands-free telephony which use an acoustic echo canceller combined with an additional FIR filter in the sending path of the hands-free telephone. We describe two different methods to adapt the additional FIR filter (called ''the echo shaping filter'') which are highly effective and easy to implement. It is shown that these algorithms allow to reduce the order of the compensator significantly and still provide high echo attenuation and only little distortions of the near speech signal during double talk. RP Martin, R (reprint author), RHEIN WESTFAL TH AACHEN, INST COMMUN SYST & DATA PROC, MUFFETERWEG 3, D-52056 AACHEN, GERMANY. CR ANTWEILER C, 1995, THESIS AACHEN U TECH FRENZEL R, 1992, FORTSCHRITT BERICHTE, V10 GUSTAFSSON S, 1996, P WORKSH QUAL ASS SP, P36 HANSLER E, 1992, SIGNAL PROCESS, V27, P259, DOI 10.1016/0165-1684(92)90074-7 HANSLER E, 1994, ANN TELECOMMUN, V49, P360 Haykin S., 1986, ADAPTIVE FILTER THEO Jayant N. S., 1984, DIGITAL CODING WAVEF Martin R., 1993, P EUROSPEECH 93 BERL, P1093 MARTIN R, 1995, THESIS AACHEN U TECH MARTIN R, 1995, P 4 INT WORKSH AC EC, P48 MARTIN R, 1995, P INT C AC SPEECH SI, P3043 NR 11 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 181 EP 190 DI 10.1016/S0167-6393(96)00055-6 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000002 ER PT J AU Jeannes, RL Faucon, G Ayad, B AF Jeannes, RL Faucon, G Ayad, B TI How to improve acoustic echo and noise cancelling using a single talk detector SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE acoustic echo cancellation; noise reduction; mobile communications AB This paper deals with speech enhancement for hands-free audio terminals, including two major problems: noise reduction and acoustic echo cancellation. Our objective is to combine a noise reduction system and an acoustic echo canceller to get a near-end speech signal with a minimum distortion and low levels of echo and noise. We present four structures (using one or two microphones and one loudspeaker) where the operation of echo cancellation comes before that of noise reduction. Generally, the noise reduction system is derived from the output of the acoustic echo canceller. An alternative is to derive the noise reduction directly from the microphone observation in order to decrease the distortion on the near-end speech signal. Experimental results are presented. Finally, in the mono-channel situation, an optimized structure controlled by an echo detector is proposed and tested. C1 UNIV RENNES 1, LAB TRAITEMENT SIGNAL & IMAGE, F-35042 RENNES, FRANCE. CR AZIRANI AA, 1995, THESIS U RENNES 1 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 JEANNES RL, 1995, SPEECH COMMUN, V16, P245 MARTIN R, 1994, ANN TELECOMMUN, V49, P414 PRADO J, 1993, 3 INT WORKSH AC ECH, P249 YASUKAWA H, 1992, ELECTRON LETT, V28, P1403, DOI 10.1049/el:19920892 NR 6 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 191 EP 202 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000003 ER PT J AU Scalart, P Benamar, A AF Scalart, P Benamar, A TI A system for speech enhancement in the context of hands-free radiotelephony with combined noise reduction and acoustic echo cancellation SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE speech enhancement; hands-free radiotelephony; noise reduction; acoustic echo cancellation; combined systems ID SUPPRESSION AB This paper addresses typical problems encountered with hands-free equipment in the context of GSM radiotelephony. We first summarise some important characteristics of the noise field in moving vehicles, and we also describe the acoustical echo phenomenon. We show that, in order to provide sufficiently high speech quality, these hands-free equipments should include noise reduction (NR) and acoustic echo control (AEC) devices. We then describe two possible structures combining noise reduction and acoustic echo control. As a conclusion, we raise the fact that the choice of a particular structure, among those proposed, is conditioned by the performance of the adaptation algorithm of the AEC solution. C1 ALCATEL MOBILE PHONES, F-92707 COLOMBES, FRANCE. RP Scalart, P (reprint author), FRANCE TELECOM, CNET,CMC,TSS,LAA, AV PIERRE MARZIN, F-22307 LANNION, FRANCE. CR AULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 BRANCACCIO A, 1993, P EUROSPEECH 93, P1259 CAO Z, 1994, SIGNAL PROCESS, V7, P1206 Cappe O, 1994, IEEE T SPEECH AUDI P, V2, P345, DOI 10.1109/89.279283 COLOMES C, 1995, J AUDIO ENG SOC, V43, P233 Doblinger G., 1995, P 4 EUR C SPEECH COM, P1513 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 *ETSI, 1995, SMG2 ETSISTC FAUCON G, 1995, P EUROSPEECH, P1525 GILLOIRE A, 1987, ECHO RECHERCHES, P43 GILLOIRE A, 1994, ANN TELECOMMUN, V49, P368 GOUBRAN RA, 1986, P IEEE VEH TECH C, P72 GOULDING MM, 1990, IEEE T VEH TECHNOL, V39, P316, DOI 10.1109/25.61353 HAKKINEN J, 1993, 4 INT C ICSPAT, P300 HANSEN JHL, 1995, TUT INT C AC SPEECH Haykin S., 1991, ADAPTIVE FILTER THEO *ITU T, 1994, 1278 COM ITUT *ITU TSS, 1993, 1212E COM ITUTSS LEBOUQUINJEANNE.R, 1994, SIGNAL PROCESSING 7, V2, P1206 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 MARTIN R, 1994, ANN TELECOMMUN, V49, P429 Ozeki K., 1984, Electronics and Communications in Japan, V67 SCALART P, 1995, P 4 INT WORKSH AC EC, P83 VARY P, 1985, SIGNAL PROCESS, V8, P387, DOI 10.1016/0165-1684(85)90002-7 Yang J., 1993, INT C AC SPEECH SIGN, V2, P363, DOI 10.1109/ICASSP.1993.319313 NR 26 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 203 EP 214 DI 10.1016/S0167-6393(96)00056-8 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000004 ER PT J AU Fischer, S Simmer, KU AF Fischer, S Simmer, KU TI Beamforming microphone arrays for speech acquisition in noisy environments SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE microphone-arrays; audio-beamforming; noise reduction; speech processing ID HEARING-AIDS; IMPLEMENTATION; SUPPRESSION; ALGORITHM; REDUCTION; SYSTEMS AB In this paper we present an adaptive microphone array with adaptive constraint values to suppress coherent as well as incoherent noise in disturbed speech signals. We use a generalized sidelobe cancelling (GSC) structure implemented in the frequency domain since it allows a separate handling of determining the adaptive look-direction response to suppress incoherent noise and adjusting the adaptive filters for cancellation of coherent noise, The transfer function in the look-direction is an adaptive Wiener-Filter which is estimated by using the short-time Fourier transform and the Nuttall/Carter method for spectrum estimation. The experimental results demonstrate that the proposed method works well for a large range of reverberation times and is therefore able to operate independently of the correlation properties of the noise field. C1 HOUPERT DIGITAL AUDIO, D-28359 BREMEN, GERMANY. RP Fischer, S (reprint author), UNIV BREMEN, DEPT PHYS & ELECT ENGN, POB 330 440, D-28334 BREMEN, GERMANY. RI Simmer, Karen/H-5834-2014 CR AFFES S, 1994, P 5 INT C SIGN PROC, V1, P154 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 AN J, 1994, IEE P-RADAR SON NAV, V141, P270, DOI 10.1049/ip-rsn:19941411 ARMBRUSTER W, 1986, P EUR SIGN PROC C EU, P391 BARNWELL T, 1979, DCA10078003 GEORG I BOUCHER RE, 1981, IEEE T ACOUST SPEECH, V29, P609, DOI 10.1109/TASSP.1981.1163623 BUCKLEY KM, 1986, IEEE T ACOUST SPEECH, V34, P1322, DOI 10.1109/TASSP.1986.1164927 Carter G. C., 1993, COHERENCE TIME DELAY CHEN YH, 1992, J ACOUST SOC AM, V91, P3354, DOI 10.1121/1.402825 CLAESSON I, 1992, IEEE T ANTENN PROPAG, V40, P1093, DOI 10.1109/8.166535 COX H, 1987, IEEE T ACOUST SPEECH, V35, P1365, DOI 10.1109/TASSP.1987.1165054 DOWLING EM, 1992, MICROPROCESS MICROSY, V16, P507, DOI 10.1016/0141-9331(92)90080-D FLANAGAN JL, 1991, ACUSTICA, V73, P58 FLANAGAN JL, 1985, AT&T TECH J, V64, P983 FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 GIERL S, 1990, P 22 INT S AUT TECHN, P517 GREENBERG JE, 1992, J ACOUST SOC AM, V91, P1662, DOI 10.1121/1.402446 GRENIER Y, 1993, SPEECH COMMUN, V12, P25, DOI 10.1016/0167-6393(93)90016-E GRIFFITHS LJ, 1982, IEEE T ANTENN PROPAG, V30, P27, DOI 10.1109/TAP.1982.1142739 HOFFMAN MW, 1994, J ACOUST SOC AM, V96, P759, DOI 10.1121/1.410313 JIM CW, 1977, P IEEE, V65, P1730, DOI 10.1109/PROC.1977.10820 Johson D., 1993, ARRAY SIGNAL PROCESS KANEDA Y, 1986, IEEE T ACOUST SPEECH, V34, P1391, DOI 10.1109/TASSP.1986.1164975 KELLERMANN W, 1991, P IEEE INT C AC SPEE, V5, P3581 KHALIL F, 1994, J AUDIO ENG SOC, V42, P691 KROSCHEL K, 1991, DIGITAL SIGNAL PROCESSING - 91, P223 KUCZNYSKI P, 1994, KLEINHEUBACHER BER, V38, P369 LIM JS, 1983, SPEECH ENHANCEMENT MAHIEUX Y, 1995, AES 98 CONV PAR MAHIEUX Y, 1993, 1993 IEEE WORKSH APP NORDHOLM S, 1993, IEEE T VEH TECHNOL, V42, P514, DOI 10.1109/25.260760 NUTTALL AH, 1982, P IEEE, V70, P1115, DOI 10.1109/PROC.1982.12435 Peterson P M, 1987, J Rehabil Res Dev, V24, P103 PIRZ F, 1979, AT&T TECH J, V58, P1839 SIMMER KU, 1994, ANN TELECOMMUN, V49, P439 SIMMER KU, 1992, 2 COST 229 WORKSH AD, P185 SONDHI MM, 1986, P ICASSP TOK JAP APR, V2, P981 SYDOW C, 1994, J ACOUST SOC AM, V96, P845, DOI 10.1121/1.410323 Van Veen B. D., 1988, IEEE ASSP MAGAZI APR, P4 VANCOMPERNOLLE D, 1990, SPEECH COMMUN, V9, P433, DOI 10.1016/0167-6393(90)90019-6 WIDROW B, 1967, PR INST ELECTR ELECT, V55, P2143, DOI 10.1109/PROC.1967.6092 WIDROW B, 1975, P IEEE, V63, P1692, DOI 10.1109/PROC.1975.10036 YANG Z, 1993, QUAT CGRETSI JUAN LE, P479 Zelinski R., 1988, P INT C AC SPEECH SI, P2578 ZELINSKI R, 1990, ELECTRON LETT, V26, P2036, DOI 10.1049/el:19901314 NR 46 TC 29 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 215 EP 227 DI 10.1016/S0167-6393(96)00054-4 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000005 ER PT J AU Elko, GW AF Elko, GW TI Microphone array systems for hands-free telecommunication SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE microphone arrays; beamforming; adaptive beamforming; directional microphones; steerable microphones AB Microphone array systems can be effective in combating the detrimental effects of acoustic noise and reverberation in hands-free telecommunication. This paper discusses classical delay-sum beamformers as well as the more general filter-sum beamformers. Filter-sum beamformers add the ability to control the array beampattern as a function of frequency and a new design method for a constant beamwidth filter-sum beamformer is presented. The delay-sum and filter-sum beamformers require array sizes that are comparable to the acoustic wavelength. These designs can result in arrays that are large in size. For applications that are space constrained, differential microphone array systems are presented. Finally, two types of adaptive beamformers are presented: a broadside array and a two-element differential microphone. RP Elko, GW (reprint author), AT&T BELL LABS, ACOUST RES DEPT, LUCENT TECHNOL, MURRAY HILL, NJ 07974 USA. CR CEZANNE J, 1995, J ACOUST SOC AM, V96, P3262 CHENG DK, 1965, IEEE T ANTENN PROPAG, VAP13, P973, DOI 10.1109/TAP.1965.1138542 CHOU TC, 1994, THESIS MIT CAMBRIDGE FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 FROST OL, 1972, PR INST ELECTR ELECT, V60, P926, DOI 10.1109/PROC.1972.8817 GOODWIN MM, 1992, THESIS MIT CAMBRIDGE Johnson D, 1993, ARRAY SIGNAL PROCESS KANEDA Y, 1986, IEEE T ACOUST SPEECH, V34, P1391, DOI 10.1109/TASSP.1986.1164975 KELLERMANN W, 1991, P IEEE INT C AC SPEE, V5, P3581 LUSTBERG RJ, 1993, THESIS MIT CAMBRIDGE Monzingo R. A., 1980, INTRO ADAPTIVE ARRAY Olson H. F., 1957, ACOUSTICAL ENG SONDHI MM, 1986, P ICASSP TOK JAP APR, V2, P981 WEST JE, 1995, J ACOUST SOC AM, V96, P3262 Widrow B, 1985, ADAPTIVE SIGNAL PROC NR 15 TC 40 Z9 40 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 229 EP 240 DI 10.1016/S0167-6393(96)00057-X PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000006 ER PT J AU Gierlich, HW AF Gierlich, HW TI The auditory perceived quality of hands-free telephones: Auditory judgements, instrumental measurements and their relationship SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT 4th International Workshop on Acoustic Echo and Noise Control (IWAENC 95) CY JUN 22-23, 1995 CL ROROS, NORWAY DE quality evaluation; instrumental measurement; auditory evaluation; hands-free telephone AB In this paper quality evaluation procedures for hands-free telephones are described with reference to some examples. The paper briefly describes the hearing tests conducted in order to investigate the auditory perceived quality in single talk condition. The parameters found to be most relevant in this situation are listed. Special tests were conducted in order to evaluate the conversational quality of the systems. The parameters found to be most interesting are listed as well. Instrumental measurements, showing procedures for instrumental evaluation, are shown using the examples of the different hands-free telephones. As a consequence from the auditory evaluation a classification of different types of hands-free telephones is proposed, For each class of hands-free telephones the relevant instrumental measurement parameters are listed. RP Gierlich, HW (reprint author), HEAD ACOUST GMBH, EBERTSTR 30A, D-52134 HERZOGENRATH, GERMANY. CR AURES W, 1984, THESIS MUNCHEN BEERENDS JG, 1992, CONTR 90 AES CONV PA BERGER J, 1994, P EL SPRACHS 5 7 OCT Blauert J., 1983, SPATIAL HEARING BODDEN M, 1992, THESIS RUHR U BOCHUM *ETSI, 1994, ETS 300 245 3 FRENCH NR, 1947, J ACOUST SOC AM, V19, P90, DOI 10.1121/1.1916407 GIERLICH HW, 1995, P DAGA 95, P671 GIERLICH HW, 1992, SIGNAL PROCESS, V27, P281, DOI 10.1016/0165-1684(92)90076-9 GIERLICH HW, 1995, P DAGA 95, P1039 HALKA U, 1993, P AS C SIG SYST COMP, P1196, DOI 10.1109/ACSSC.1993.342382 *ITU T, 1995, 12R25 COM ITUT *ITU T, 1996, 12R32E COM ITUT *ITU T, 1993, P34 ITU T *ITU T, 1994, 15 ITUT SG *ITU T, 1995, HDB TEL *ITU T COM, 1993, 1213E ITUT COM JORDAN VC, 1982, APPL ACOUST, P321 VONHOVEL H, 1984, THESIS RWTH AACHEN Zwicker E., 1982, PSYCHOAKUSTIK NR 20 TC 6 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 241 EP 254 DI 10.1016/S0167-6393(96)00066-0 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000007 ER PT J AU Bradlow, AR Torretta, GM Pisoni, DB AF Bradlow, AR Torretta, GM Pisoni, DB TI Intelligibility of normal speech .1. Global and fine-grained acoustic-phonetic talker characteristics SO SPEECH COMMUNICATION LA English DT Article DE intelligibility; talker characteristics; acoustic-phonetics ID SPOKEN WORD RECOGNITION; CONVERSATIONAL SPEECH; SPEAKING RATE; STIMULUS VARIABILITY; PERCEPTION; CLEAR; ENGLISH; HEARING; HARD; MEMORY AB This study used a multi-talker database containing intelligibility scores for 2000 sentences (20 talkers, 100 sentences), to identify talker-related correlates of speech intelligibility. We first investigated ''global'' talker characteristics (e.g., gender, F0 and speaking rate). Findings showed female talkers to be more intelligible as a group than male talkers. Additionally, we found a tendency for F0 range to correlate positively with higher speech intelligibility scores. However, F0 mean and speaking rate did not correlate with intelligibility. We then examined several fine-grained acoustic-phonetic talker-characteristics as correlates of overall intelligibility. We found that talkers with larger vowel spaces were generally more intelligible than talkers with reduced spaces. In investigating two cases of consistent listener errors (segment deletion and syllable affiliation), we found that these perceptual errors could be traced directly to detailed timing characteristics in the speech signal. Results suggest that a substantial portion of variability in normal speech intelligibility is traceable to specific acoustic-phonetic characteristics of the talker. Knowledge about these factors may be valuable for improving speech synthesis and recognition strategies, and for special populations (e.g., the hearing-impaired and second-language learners) who are particularly sensitive to intelligibility differences among talkers. RP Bradlow, AR (reprint author), INDIANA UNIV, DEPT PSYCHOL, SPEECH RES LAB, BLOOMINGTON, IN 47405 USA. RI Bradlow, Ann/G-9319-2011 CR BLACK JW, 1957, J SPEECH HEAR DISORD, V22, P213 BOND ZS, 1994, SPEECH COMMUN, V14, P325, DOI 10.1016/0167-6393(94)90026-4 Bradlow A. R., 1995, P13 BYRD D, 1994, SPEECH COMMUN, V15, P39, DOI 10.1016/0167-6393(94)90039-6 Fant G., 1973, SPEECH SOUNDS FEATUR GERSTMAN LJ, 1968, IEEE T ACOUST SPEECH, VAU16, P78, DOI 10.1109/TAU.1968.1161953 HANSON HW, 1995, J ACOUST SOC AM, V97, P3422, DOI 10.1121/1.412417 HIRSH IJ, 1954, J ACOUST SOC AM, V26, P530, DOI 10.1121/1.1907370 HOOD JD, 1980, AUDIOLOGY, V19, P434 *IEE, 1969, 297 IEEE Karl JR, 1994, J ACOUST SOC AM, V95, P2873, DOI 10.1121/1.409447 KEATING PA, 1994, SPEECH COMMUN, V14, P131, DOI 10.1016/0167-6393(94)90004-3 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KLUENDER KR, 1988, J PHONETICS, V16, P153 KRAUSE JC, 1995, J ACOUST SOC AM, V98, P2982, DOI 10.1121/1.413900 LADEFOGED P, 1957, J ACOUST SOC AM, V29, P98, DOI 10.1121/1.1908694 LAMEL L, 1986, FEB P DARPA SPEECH R, P100 Laver J, 1979, SOCIAL MARKERS SPEEC, P1 LINDBLOM B, 1990, NATO ADV SCI I D-BEH, V55, P403 LUCE PA, 1981, 7 IND U SPEECH RES L Miller J. L., 1981, PERSPECTIVES STUDY S, P39 Monsen R. B., 1976, J PHONETICS, V4, P189 MOON SJ, 1994, J ACOUST SOC AM, V96, P40, DOI 10.1121/1.410492 MULLENNIX JW, 1989, J ACOUST SOC AM, V85, P365, DOI 10.1121/1.397688 NEEL AT, 1995, J ACOUST SOC AM, V98, P2982, DOI 10.1121/1.413903 NYGAARD LC, 1995, PERCEPT PSYCHOPHYS, V57, P989, DOI 10.3758/BF03205458 NYGAARD LC, 1994, PSYCHOL SCI, V5, P42, DOI 10.1111/j.1467-9280.1994.tb00612.x PALLETT D, 1990, P INT C SPOK LANG PR Palmeri T. J., 1993, J EXPT PSYCHOL LEARN, V19, P1 PARKER EM, 1986, PERCEPT PSYCHOPHYS, V34, P314 PICHENY MA, 1989, J SPEECH HEAR RES, V32, P600 PICHENY MA, 1985, J SPEECH HEAR RES, V28, P96 PICHENY MA, 1986, J SPEECH HEAR RES, V29, P434 PISONI DB, 1993, SPEECH COMMUN, V13, P109, DOI 10.1016/0167-6393(93)90063-Q PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 PORT RF, 1982, PERCEPT PSYCHOPHYS, V32, P141, DOI 10.3758/BF03204273 RUNYON RP, 1991, FUNDAMENTALS BEHAV S, P201 SOMMERS MS, 1994, J ACOUST SOC AM, V96, P1314, DOI 10.1121/1.411453 TIELEN MTJ, 1992, THESIS U AMSTERDAM Uchanski RM, 1996, J SPEECH HEAR RES, V39, P494 Weismer G, 1992, INTELLIGIBILITY SPEE, P67 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 42 TC 165 Z9 169 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 255 EP 272 DI 10.1016/S0167-6393(96)00063-5 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000008 ER PT J AU delaTorre, A Peinado, AM Rubio, AJ Sanchez, VE Diaz, JE AF delaTorre, A Peinado, AM Rubio, AJ Sanchez, VE Diaz, JE TI An application of minimum classification error to feature space transformations for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; distance measure; feature space transformation; liftering; clustering; discriminative feature extraction (DFE); error-rate; cost function; probability density function (pdf); minimum classification error (MCE); hidden Markov model (HMM) AB The use of signal transformations is a necessary step for feature extraction in pattern recognition systems. These transformations should take into account the main goal of pattern recognition: the error-rate minimization, In this paper we propose a new method to obtain feature space transformations based on the Minimum Classification Error criterion. The goal of these transformations is to obtain a new representation space where the Euclidean distance is optimal for classification. The proposed method is tested on a speech recognition system using different types of Hidden Markov Models, The comparison with standard pre-processing techniques shows that our method provides an error-rate reduction in all the performed experiments. RP delaTorre, A (reprint author), UNIV GRANADA, DEPT ELECT & TECNOL COMP, E-18071 GRANADA, SPAIN. RI de la Torre, Angel/C-6618-2012; Sanchez , Victoria /C-2411-2012; Peinado, Antonio/C-2401-2012; diaz-verdejo, jesus/B-5372-2011 OI diaz-verdejo, jesus/0000-0002-8424-9932 CR BACCHIANI M, 1994, P INT C AC SPEECH SI, V2, P197 BIEM A, 1994, P INT C SIGN SPEECH, V1, P485 Biem A., 1993, P IEEE INT C AC SPEE, P275 Duda R. O., 1973, PATTERN CLASSIFICATI FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 JUANG BH, 1987, IEEE T ACOUST SPEECH, V35, P947 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 JUNQUA JC, 1989, P INT C AC SPEECH SI, P476 LINDE Y, 1980, IEEE T COMMUN, V28, P84, DOI 10.1109/TCOM.1980.1094577 PALIWAL KK, 1995, P EUR C SPEECH COMM, V1, P541 PEINADO AM, 1994, P INT C AC SPEECH SI PEINADO AM, 1996, IEEE T SPEECH AUDIO, V4, P88 PEINADO AM, 1990, P EUSIPCO 90, V2, P1243 Rabiner L, 1993, FUNDAMENTALS SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 SEGURA JC, 1994, SPEECH COMMUN, V14, P163, DOI 10.1016/0167-6393(94)90006-X TOHKURA Y, 1987, IEEE T ACOUST SPEECH, V35, P1414, DOI 10.1109/TASSP.1987.1165058 NR 17 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 273 EP 290 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000009 ER PT J AU Jeong, CG Jeong, H AF Jeong, CG Jeong, H TI Automatic phone segmentation and labeling of continuous speech SO SPEECH COMMUNICATION LA English DT Article DE speech; recognition; segmentation; neural networks; Mean field theory ID RECOGNITION; ALGORITHMS AB To obtain an accurate phone sequence from a continuous speech signal, we suggest a novel approach consisting of tightly coupled bottom-up and top-down processing. The bottom-up path consists of segmentation, recognition and labeling. Also the top-down path consists of labeling, speech generation and segmentation. In this manner, the four processes form a closed feedback loop achieving an optimal interpretation efficiently for a given noisy observation of speech signal and a priori knowledge. The major goal of this paper is to identify the system model using both the stochastic estimation theory and the mean field theory. Experimental results are obtained in terms of the TIMIT database. It is shown that introducing the top-down path to the traditional bottom-up path can improve the recognition rate by 19.7%, and reduce the error (substitution, deletion and insertion) rate by 16.1%. As a result, the overall system can transform the incoming continuous signal into one of the 61 phone classes at the rate of 73.7%. C1 POHANG UNIV SCI & TECHNOL, DEPT EE, POHANG 790784, SOUTH KOREA. CR ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 BESAG J, 1974, J ROY STAT SOC B MET, V36, P192 BILBRO GL, 1992, IEEE T NEURAL NETWOR, V3, P131, DOI 10.1109/72.105426 Bridle J. S., 1990, ADV NEURAL INFORMATI, V2, P211 COPPERI M, 1988, P INT C AC SPEECH SI, P143 Dugast C, 1994, IEEE T SPEECH AUDI P, V2, P217, DOI 10.1109/89.260364 FUKUGANA K, 1972, INTRO STAT PATTERN R GEIGER D, 1991, IEEE T PATTERN ANAL, V13, P401, DOI 10.1109/34.134040 GEMAN S, 1984, IEEE T PATTERN ANAL, V6, P721 HUBENER K, 1993, P EUR C SPEECH TECHN, V3, P1763 Jeong C. G., 1994, Neural, Parallel & Scientific Computations, V2 Koza J. R., 1992, GENETIC PROGRAMMING LAKSHMANNA S, 1989, IEEE T PATTERN ANAL, V11, P790 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 Parisi G., 1988, STAT FIELD THEORY Picone J., 1990, IEEE ASSP Magazine, V7, DOI 10.1109/53.54527 RENALS S, 1992, P IEEE INT C AC SPEE, P601, DOI 10.1109/ICASSP.1992.225837 Renals S, 1994, IEEE T SPEECH AUDI P, V2, P161, DOI 10.1109/89.260359 ROBINSON AJ, 1994, IEEE T NEURAL NETWOR, V5, P298, DOI 10.1109/72.279192 SAWAI H, 1991, P INT C AC SPEECH SI, P53, DOI 10.1109/ICASSP.1991.150276 SILVERMAN HF, 1990, IEEE ASSP MAGAZINE, P6 VANHEMERT JP, 1991, IEEE T SIGNAL PROCES, V39, P1008, DOI 10.1109/78.80941 VIDAL E, 1990, SIGNAL PROCESS, V5, P43 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P1888, DOI 10.1109/29.45535 WANG S, 1989, P INT C AC SPEECH SI Young S. J., 1993, P EUROSPEECH, P2203 ZUE V, 1989, P IEEE INT C ACOUSTI, P389 NR 28 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD DEC PY 1996 VL 20 IS 3-4 BP 291 EP 311 DI 10.1016/S0167-6393(96)00064-7 PG 21 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA WE990 UT WOS:A1996WE99000010 ER PT J AU Moore, RK AF Moore, RK TI Special issue on speech under stress SO SPEECH COMMUNICATION LA English DT Editorial Material RP Moore, RK (reprint author), DEF RES AGCY, SPEECH RES UNIT, MALVERN WR14 3PS, WORCS, ENGLAND. NR 0 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 1 EP 2 DI 10.1016/S0167-6393(96)90039-4 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400001 ER PT J AU Murray, IR Baber, C South, A AF Murray, IR Baber, C South, A TI Towards a definition and working model of stress and its effects on speech SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE speech; stress; stress modelling; speech under stress AB Discussions at the ESCA-NATO Workshop on Speech Under Stress (Lisbon, Portugal, September 1995) centred on definitions and models of stress and its effects. Based on the Workshop discussions, in this paper we attempt to produce a definition of stress, and propose a number of stress models which clarify issues in this field, and which might be adopted by the speech community. The concept of stress is very broad and used differently in a number of domains - a definition of stress has thus remained elusive. Similarly, our understanding of the processes of stress and the ways in which it affects speech is incomplete, and any models used to describe stress are somewhat elementary. Greater separation of stressors (the causes of stress) and strain (the effects of stress) is proposed, and methods for relating stressors to strains are presented. Suggestions for future research directions in this field are also made. C1 UNIV BIRMINGHAM, SCH MFG & MECH ENGN, BIRMINGHAM B15 2TT, W MIDLANDS, ENGLAND. DEF RES AGCY, MAN MACHINE INTEGRAT DEPT, FARNBOROUGH GU14 6TD, HANTS, ENGLAND. RP Murray, IR (reprint author), UNIV DUNDEE, APPL COMP STUDIES DIV, DUNDEE DD1 4HN, SCOTLAND. RI Baber, Chris/A-7412-2014 CR Baber C, 1996, HUM FACTORS, V38, P142, DOI 10.1518/001872096778940840 BERGMAN H, 1985, REC REV ALCOHOLISM, V3, P336 Cox T, 1978, STRESS FOREMAN JB, 1968, COLLINS DICT FRICK RW, 1986, AGGRESSIVE BEHAV, V12, P121, DOI 10.1002/1098-2337(1986)12:2<121::AID-AB2480120206>3.0.CO;2-F HAYRE HS, 1980, APPL ACOUST, V13, P57, DOI 10.1016/0003-682X(80)90043-2 HAYRE HS, 1980, APPL ACOUST, V13, P63, DOI 10.1016/0003-682X(80)90044-4 JOYCE H, 1987, AM MED NEWS AUG, P29 Moore RK, 1996, J ACOUST SOC AM, V99, P1710, DOI 10.1121/1.414694 Roach P, 1992, INTRO PHONETICS Sapolsky R.M., 1994, WHY ZEBRAS DONT GET STORM C, 1987, J PERS SOC PSYCHOL, V53, P805, DOI 10.1037//0022-3514.53.4.805 TRANCOSO I, 1995, P ESCA NATO TUT RES UVAROV EB, 1979, PENGUIN DICT SCI NR 14 TC 32 Z9 32 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 3 EP 12 DI 10.1016/S0167-6393(96)00040-4 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400002 ER PT J AU Junqua, JC AF Junqua, JC TI The influence of acoustics on speech production: A noise-induced stress phenomenon known as the Lombard reflex SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE Lombard reflex; stress; noise; speech production; relational invariance; speech perception AB Recently, a number of researchers reported quantitative results about the acoustic changes between normal and Lombard speech, These results highlighted that the nature of the Lombard reflex is highly speaker-dependent. In this paper, after briefly discussing the influence of acoustics on speech production, we summarize some important characteristics of the Lombard reflex. Then, we review some experimental results showing how the Lombard reflex varies with the speaker gender, the language, and the environment (type of noise). Finally, we briefly discuss the use of relational features as a way to reduce the influence of the Lombard reflex on automatic speech recognizers. RP Junqua, JC (reprint author), PANASON TECHNOL INC, SPEECH TECHNOL LAB, 3888 STATE ST, SUITE 202, SANTA BARBARA, CA 93105 USA. CR ANGLADE Y, 1992, ICSLP, P595 BABER C, 1995, ESCA NATO WORKSH SPE, P37 BOND ZS, 1989, J ACOUST SOC AM, V85, P907, DOI 10.1121/1.397563 CASTELLANOS A, 1995, ESCA NATO WORKSH SPE, P57 CASTELLANOS A, 1993, D26 ROARS U POL VAL CASTELLANOS A, 1992, D26 ROARS U POL VAL Crawford M. D., 1994, P I ACOUSTICS 5, V16, P183 DREHER JJ, 1957, J ACOUST SOC AM, V29, P1320, DOI 10.1121/1.1908780 EGAN JJ, 1967, THESIS W RESERVE U Fairbanks G, 1954, J SPEECH HEAR DISORD, V19, P133 Fletcher H, 1918, 19412 W EL CO GAY T, 1977, J ACOUST SOC AM, V62, P183, DOI 10.1121/1.381480 HALPHEN E, 1910, THESIS FM PARIS Hansen J. H. L., 1988, THESIS GEORGIA I TEC Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1995, ESCA NATO WORKSH SPE, P91 HANSON BA, 1990, 1990 P IEEE INT C AC, P857 HOWELL P, 1992, ETRW SPEECH PROC NOV, P223 HOWES D, 1957, J ACOUST SOC AM, V29, P296, DOI 10.1121/1.1908862 Junqua J.C., 1992, ICSLP, P811 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 JUNQUA JC, 1990, INT CONF ACOUST SPEE, P841, DOI 10.1109/ICASSP.1990.115969 Junqua J.-C., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing Kawahara H., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing LANE H, 1971, J SPEECH HEAR RES, V14, P677 LANE H, 1970, J ACOUST SOC AM, V47, P618, DOI 10.1121/1.1911937 LEE BS, 1950, J ACOUST SOC AM, V22, P824, DOI 10.1121/1.1906696 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Mokbel C., 1992, THESIS ECOLE NATL SU NOYES J, 1995, ESCA NATO WORKSH SPE, P17 OLIVIER JF, 1991, STRESS MOTEUR VIE PICK HL, 1989, J ACOUST SOC AM, V85, P894, DOI 10.1121/1.397561 PICKETT JM, 1956, J ACOUST SOC AM, V28, P902, DOI 10.1121/1.1908510 PORT RF, 1981, J ACOUST SOC AM, V69, P262, DOI 10.1121/1.385347 RAMEZ R, 1992, D25 CRINCNRSINRIA RO Rostolland D., 1973, S SPEECH INTELLIGIBI, P293 ROSTOLLAND D, 1975, S INTELLIGIBILITY SP SCHULMAN R, 1985, ARTICULATORY TARGETI SCHULMAN R, 1989, J ACOUST SOC AM, V85, P295, DOI 10.1121/1.397737 Stanton B. J., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196583 STEVENS KN, 1987, 11TH P ICPHS TALL, P352 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 Suzuki T., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing TAKIZAWA Y, 1990, ICSLP, P293 ZEILIGER J, 1994, 10 JEP LANN JUN, P287 NR 45 TC 40 Z9 40 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 13 EP 22 DI 10.1016/S0167-6393(96)00041-6 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400003 ER PT J AU Castellanos, A Benedi, JM Casacuberta, F AF Castellanos, A Benedi, JM Casacuberta, F TI An analysis of general acoustic-phonetic features for Spanish speech produced with the Lombard effect SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE Lombard effect; continuous speech; speech production AB A noisy environment usually degrades the intelligibility of a human speaker or the performance of a speech recognizer. Due to this noise, a phenomenon appears which is caused by the articulatory changes made by speakers in order to be more intelligible in the noisy environment: the Lombard effect. Over the last few years, special emphasis has been placed on analyzing and dealing with the Lombard effect within the framework of Automatic Speech Recognition. Thus, the first purpose of the work presented in this paper was to study the possible common tendencies of some acoustic features in different phonetic units for Lombard speech. Another goal was to study the influence of gender in the characterization of the above tendencies. Extensive statistical tests were carried out for each feature and each phonetic unit, using a large Spanish continuous speech corpus, The results reported here confirm the changes produced in Lombard speech with regard to normal speech. Nevertheless, some new tendencies have been observed from the outcome of the statistical tests. C1 UNIV POLITECN, DEPT SISTEMAS INFORMAT & COMPUTAC, VALENCIA, SPAIN. RP Castellanos, A (reprint author), UNIV JAIME 1, DEPT INFORMAT, CAMPUS PENYETA ROJA, CASTELLO DE LA PLANA 12071, SPAIN. RI Benedi, Juana/K-9740-2014 OI Benedi, Juana/0000-0002-3796-639X CR ALINAT P, 1994, FINAL PUBLIC REPORT ALINAT P, 1991, D11 ROARS ANGLADE Y, 1990, SIGNAL PROCESS, V5, P1195 APPLEBAUM TH, 1990, SIGNAL PROCESS, V5, P1183 BENEDI JM, 1992, D13 ROARS BOND ZS, 1989, J ACOUST SOC AM, V85, P907, DOI 10.1121/1.397563 CAIRNS DA, 1992, P INT C SPOK LANG PR, P703 CASACUBERTA F, 1991, WORKSH INT COOP STAN CASTELLANOS A, 1992, D26 ROARS DVORAK S, 1991, P EUROPEAN C SPEECH, V3, P1375 HAJISLAM R, 1992, D25 ROARS HANSEN JHL, 1992, SIGNAL PROCESS, V6, P403 HANSON BA, 1993, 1993 P IEEE INT C AC, V2, P79 JUNQUA JC, 1990, IEEE T ACOUST SPEECH, V2, P841 Junqua J.-C., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266467 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 LANE H, 1971, J SPEECH HEAR RES, V14, P677 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 MAK B, 1992, IEEE T ACOUST SPEECH, V1, P269 PENA D, 1982, ESTADISTICA MODELOS, V2 PICK HL, 1989, J ACOUST SOC AM, V85, P894, DOI 10.1121/1.397561 QUILLIS A, 1988, FONETICA ACUSTICA LE Rohatgi VK, 1976, INTRO PROBABILITY TH STANTON B, 1989, IEEE T ACOUST SPEECH, V2, P675 STANTON B, 1988, IEEE T ACOUST SPEECH, V1, P331 SUMMER V, 1989, J ACOUST SOC AM, V86, P1717 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 WELLS JC, 1989, J INT PHON ASSOC, V1, P31 NR 28 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 23 EP 35 DI 10.1016/S0167-6393(96)00042-8 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400004 ER PT J AU Baber, C Mellor, B Graham, R Noyes, JM Tunley, C AF Baber, C Mellor, B Graham, R Noyes, JM Tunley, C TI Workload and the use of automatic speech recognition: The effects of time and resource demands SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO ID COMPATIBILITY; TASKS; MODALITIES; INPUT AB Previous research has indicated that workload can have an adverse effect on the use of speech recognition systems. In this paper, the relationship between workload and speech is discussed, and two studies are reported, In the first study, time-stress is considered. In the second study, dual-task performance is considered. Both studies show workload to significantly reduce recognition accuracy and user performance. The nature of the impairment is shown to differ between individuals and types of workload. Furthermore, it appears that workload affects the selection of words to use, the articulation of the words and the relationship between speaking to ASR and performing other tasks. It is proposed that speaking to ASR is, in itself, demanding and that as workload increases so the ability to perform the task within the limits required by ASR suffers. C1 DRA, SPEECH RES UNIT, GREAT MALVERN, WORCS, ENGLAND. UNIV LOUGHBOROUGH, HUSAT RES INST, LOUGHBOROUGH LE11 1RG, LEICS, ENGLAND. UNIV BRISTOL, DEPT PSYCHOL, BRISTOL BS8 1TN, AVON, ENGLAND. RP Baber, C (reprint author), UNIV BIRMINGHAM, IND ERGON GRP, SCH MAN & MECH ENG, BIRMINGHAM B15 2TT, W MIDLANDS, ENGLAND. RI Baber, Chris/A-7412-2014 CR ARMSTRONG JW, 1981, NPS5581016 ARMSTRONG JW, 1981, NPS5581017 Baber C, 1996, HUM FACTORS, V38, P142, DOI 10.1518/001872096778940840 BABER C, 1995, P NATO ESCA WORKSH S, P37 BABER C, 1992, INT J MAN MACH STUD, V37, P703, DOI 10.1016/0020-7373(92)90064-R BABER C, 1991, P 11 INT ERG ASS C L, P833 BABER C, 1991, SPEECH TECHNOLOGY CO BABER C, 1990, BEHAV INFORM TECHNOL, V9, P371 BROWN ID, 1961, ERGONOMICS, V4, P35, DOI 10.1080/00140136108930505 Chiles W. D., 1982, HUMAN PERFORMANCE PR COOPER GE, 1969, ASDTR7619 NASA DELL GS, 1986, PSYCHOL REV, V93, P283, DOI 10.1037//0033-295X.93.3.283 DODDINGTON GR, 1981, IEEE SPECTRUM SEP, P26 EGGEMEIER FT, 1991, MULTIPLE TASK PERFOR FITTS PM, 1953, J EXP PSYCHOL, V46, P199, DOI 10.1037/h0062827 FRANKISH C, 1990, HUM FACTORS, V32, P697 GARRETT MF, 1984, NORMALITY PATHOLOGY GRAAHM R, 1993, USER STRESS ASR, P463 Hapeshi K., 1989, Work With Computers: Organizational, Management, Stress and Health Aspects. Proceedings of the Third Conference on Human-Computer Interaction. Vol.1 HARRIS SD, 1978, BEHAV RES METH INSTR, V10, P329 HART S, 1990, MANPRINT EMERGING TE HECKER MHL, 1968, J ACOUST SOC AM, V44, P993, DOI 10.1121/1.1911241 HITCH GJ, 1978, COGNITIVE PSYCHOL, V10, P302, DOI 10.1016/0010-0285(78)90002-6 KLAPP ST, 1981, MEM COGNITION, V9, P398, DOI 10.3758/BF03197565 LINDE C, 1988, P HUM FACT SOC 32 AN, P237 MACKAY DG, 1982, PSYCHOL REV, V89, P483, DOI 10.1037/0033-295X.89.5.483 MCCAULEY ME, 1982, 81C00551 NAVTRAEQUIP MELLOR B, 1995, P ESCA WORKSH SPOK D, P117 Meshkati N., 1990, EVALUATION HUMAN WOR MOORE TJ, 1989, AVIATION PSYCHOL *NAT RES COUNC COM, 1984, AUT SPEECH REC SEV E NORMAN DA, 1975, COGNITIVE PSYCHOL, V7, P44, DOI 10.1016/0010-0285(75)90004-3 SENDERS AF, 1983, ACTA PSYCHOL, V53, P61 SIMPSON CA, 1986, ERGONOMICS, V29, P1343, DOI 10.1080/00140138608967250 SIMPSON CA, 1985, HUM FACTORS, V27, P115 SPERANDI.JC, 1971, ERGONOMICS, V14, P571, DOI 10.1080/00140137108931277 TAYLOR RM, 1989, STRUCTURE MULTIMODAL Wickens C. D., 1992, ENG PSYCHOL HUMAN PE, V2nd WICKENS CD, 1991, TIME PRESSURE STRESS WICKENS CD, 1983, HUM FACTORS, V25, P227 Wickens CD, 1980, ATTENTION PERFORMANC, VVIII WICKENS CD, 1988, HUM FACTORS, V30, P599 WICKENS CD, 1984, HUM FACTORS, V26, P533 WIERWILLE WW, 1985, HUM FACTORS, V27, P489 NR 44 TC 30 Z9 30 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 37 EP 53 DI 10.1016/S0167-6393(96)00043-X PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400005 ER PT J AU Whitmore, J Fisher, S AF Whitmore, J Fisher, S TI Speech during sustained operations SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE fatigue; sustained operations; voice; sleep deprivation; speech ID ALCOHOL-INTOXICATION; SLEEP LOSS; PERFORMANCE; FATIGUE AB Research was conducted to determine if alterations in the acoustical characteristics of voice occur over periods of sustained operations. Twelve male United States Air Force B-IB bomber aircrewmen participated in the study. The participants served in crews of four and performed three 36-hour experimental periods (missions) in a high-fidelity simulator. The missions were interspersed with 36-hour rest breaks. Data were lost from two members of the third team due to a communication malfunction. Speech, cognitive and subjective fatigue data were collected approximately every three hours for 11 trials per mission. Fundamental frequency and word duration were both found to vary significantly over trials (fundamental frequency F(10,90) = 2.63, p = 0.0076, word duration F(10,90) = 2.5, p = 0.0106). Speech duration results also showed a significant main effect of mission (F(2,18) = 6.91, p = 0.0082). The speech data follow the same trend as the data from the cognitive tests and subjective measures. A strong diurnal pattern is reflected in nearly all of the dependent measures. Overall, the results support the proposition that voice may be a valid indicator of a speaker's fatigue state. C1 ST MARYS UNIV, SAN ANTONIO, TX 78284 USA. RP Whitmore, J (reprint author), ARMSTRONG LAB, 2504 GILLINGHAM DR ST 25, BROOKS AFB, TX 78235 USA. CR *AGARD NATO ADV GR, 1989, 308 AGARD NATO ADV G ANGUS RG, 1985, BEHAV RES METH INS C, V17, P55, DOI 10.3758/BF03200897 BRENNER M, 1994, AVIAT SPACE ENVIR MD, V65, P21 BRENNER M, 1991, AVIAT SPACE ENVIR MD, V62, P893 CANNINGS R, 1979, SURVEY METHODS ASSES, P115 FOWLER CA, 1987, J MEM LANG, V26, P489, DOI 10.1016/0749-596X(87)90136-7 FRENCH J, 1990, ANN REV CHRONOPHARMA, V7, P45 GRIFFIN GR, 1987, AVIAT SPACE ENVIR MD, V58, P1165 Hockey G., 1986, HDB PERCEPTION HUMAN, p44/41 JONES WA, 1990, THESIS TEXAS A M U KLINGHOLZ F, 1988, J ACOUST SOC AM, V84, P929, DOI 10.1121/1.396661 KRUEGER GP, 1989, WORK STRESS, V3, P129, DOI 10.1080/02678378908256939 KURODA I, 1976, AVIAT SPACE ENVIR MD, V47, P528 MAYER DL, 1994, HUM FAC ERG SOC P, P124 National Transportation Safety Board, 1990, NTSBMAR9004 PEARSON RG, 1956, 56115 SCH AV MED PISONI DB, 1990, AAMRLSR90507 RUIZ RR, 1990, AVIATION SPACE ENV M, V3, P266 SAITO I, 1980, AVIAT SPACE ENVIR MD, V51, P402 SANDER EK, 1983, ANN OTO RHINOL LARYN, V92, P141 SCHLEGEL RE, 1993, ALTR19920145 SOBELL LC, 1982, FOLIA PHONIATR, V34, P316 SOBELL LC, 1972, J SPEECH HEAR RES, V15, P852 WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 WOODWARD DP, 1974, ACR206 OFF NAV RES NR 25 TC 29 Z9 29 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 55 EP 70 DI 10.1016/S0167-6393(96)00044-1 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400006 ER PT J AU Bard, EG Sotillo, C Anderson, AH Thompson, HS Taylor, MM AF Bard, EG Sotillo, C Anderson, AH Thompson, HS Taylor, MM TI The DCIEM Map Task Corpus: Spontaneous dialogue under sleep deprivation and drug treatment SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE dialogue; spontaneous speech; sleep-deprivation; stress; map task; modafinil; amphetamine ID CONTEXTS AB This paper describes a resource designed for the general study of spontaneous speech under the stress of sleep deprivation. It is a corpus of 216 unscripted task-oriented dialogues produced by normal adults in the course of a major sleep deprivation study. The study itself examined continuous task performance through baseline, sleepless and recovery periods by groups treated with placebo or one of two drugs (Modafinil, d-amphetamine) reputed to counter the effects of sleep deprivation. The dialogues were all produced while carrying out the route communication task used in the HCRC Map Task Corpus. Pairs of talkers collaborated to reproduce on one partner's schematic map a route preprinted on the other's. Controlled differences between the maps and use of labelled imaginary locations limit genre, vocabulary and effects of real-world knowledge, The designs for the construction of maps and the allocation of subjects to maps make the corpus a controlled elicitation experiment. Each talker participated in 12 dialogues over the course of the study, Preliminary examinations of dialogue length and task performance measures indicate effects of drug treatment, sleep deprivation and number of conversational partners, The corpus is available to researchers interested in all levels of speech and dialogue analysis, in both normal and stressed conditions. C1 UNIV GLASGOW, DEPT PSYCHOL, HUMAN COMMUN RES CTR, GLASGOW, LANARK, SCOTLAND. DEF & CIVIL INST ENVIRONM MED, TORONTO, ON M3M 3B9, CANADA. RP Bard, EG (reprint author), UNIV EDINBURGH, HUMAN COMMUN RES CTR, 2 BUCCLEUCH PL, EDINBURGH EH8 9LL, MIDLOTHIAN, SCOTLAND. CR ANDERSON AH, 1994, LANG COGNITIVE PROC, V9, P101, DOI 10.1080/01690969408402111 ANDERSON AH, IN PRESS CONVERSATIO ANDERSON AH, 1994, J CHILD LANG, V21, P439 ANDERSON AH, 1991, LANG SPEECH, V34, P351 BARANSKI J, 1995, P 37 ANN C MIL TEST BOYLE EA, 1994, LANG SPEECH, V37, P1 BROWN G, 1984, TEACHING TALK Buguet A, 1995, J SLEEP RES, V4, P229 CARLETTA JC, IN PRESS J PRAGMATIC FOWLER CA, 1988, LANG SPEECH, V31, P307 GARROD S, 1994, COGNITION, V53, P181, DOI 10.1016/0010-0277(94)90048-5 KILBORN K, 1995, BRAIN LANG, V48, P120, DOI 10.1006/brln.1995.1005 LYONS TJ, 1991, AVIAT SPACE ENVIR MD, V62, P432 Pigeau R, 1995, J SLEEP RES, V4, P212 PIGEAU R, 1995, P 35 ANN C MIL TEST SHADBOLT N, 1984, THESIS U EDINBURGH TAYLOR MM, 1996, P 37 ANN C MIL TEST WHITMORE J, 1995, P ESCA NATO WORKSH S NR 18 TC 24 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 71 EP 84 DI 10.1016/S0167-6393(96)00045-3 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400007 ER PT J AU Murray, IR Arnott, JL Rohwer, EA AF Murray, IR Arnott, JL Rohwer, EA TI Emotional stress in synthetic speech: Progress and future directions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE synthetic speech; speech; emotion; stress ID EXPRESSION AB Current text-to-speech systems have very good intelligibility, but most are still easily identified as artificial voices and no commercial system incorporates prosodic variation resulting from emotion and related factors. This is largely due to the complexity of identifying and categorising the emotion factors in natural human speech, and implementing these factors within synthetic speech. However, prosodic content in synthetic speech is seen as increasingly important, and there is presently renewed interest in the investigation of human vocal emotion and the expansion of synthesis models to allow greater prosodic variation. Such models could also be used as practical tools in the investigation and validation of models of emotion and other speech-altering stressors, This paper reviews progress to date in the investigation of human vocal emotions and their simulation in synthetic speech, and requirements for future research which is required to develop this area are also presented. RP Murray, IR (reprint author), UNIV DUNDEE, APPL COMP STUDIES DIV, DUNDEE DD1 4HN, SCOTLAND. CR ABADJIEVA E, 1993, P EUROSPEECH 93 BERL, P909 ARNFIELD S, 1995, P ESCA NATO TUT RES, P13 CAHN JE, 1990, GENERATING EXPRESSIO COWLEY CK, 1993, INTERACTIVE SPEECH T Davitz Joel Robert, 1964, COMMUNICATION EMOTIO Eskenazi M., 1993, P EUR 93 BERL, P501 FRICK RW, 1986, AGGRESSIVE BEHAV, V12, P121, DOI 10.1002/1098-2337(1986)12:2<121::AID-AB2480120206>3.0.CO;2-F Hara F., 1992, Proceedings. IEEE International Workshop on Robot and Human Communication (Cat. No.92TH0469-7), DOI 10.1109/ROMAN.1992.253933 MAGNENATTHALMAN.N, 1992, P 6 INT WORKSH ASP A, P1 MOORE RK, 1994, P I ACOUST, V16, P1 MURRAY IR, 1995, SPEECH COMMUN, V16, P369, DOI 10.1016/0167-6393(95)00005-9 Murray IR, 1996, SPEECH COMMUN, V20, P3, DOI 10.1016/S0167-6393(96)00040-4 MURRAY IR, 1989, THESIS U DUNDEE MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 ORTONY A, 1990, PSYCHOL REV, V97, P315, DOI 10.1037//0033-295X.97.3.315 SCHERER KR, 1986, PSYCHOL BULL, V99, P143, DOI 10.1037//0033-2909.99.2.143 SCHERER KR, 1984, J ACOUST SOC AM, V76, P1346, DOI 10.1121/1.391450 SCHERZ JW, 1995, AUGMENTATIVE ALTERNA, V11, P74, DOI 10.1080/07434619512331277159 SCHLOSBERG H, 1954, PSYCHOL REV, V61, P81, DOI 10.1037/h0054570 STORM C, 1987, J PERS SOC PSYCHOL, V53, P805, DOI 10.1037//0022-3514.53.4.805 Tatham M., 1992, P I ACOUST, V14, P447 VANBEZOOIJEN R, 1983, J CROSS CULT PSYCHOL, V14, P387, DOI 10.1177/0022002183014004001 VROOMEN J, 1993, P 3 EUR C SPEECH COM, P577 NR 23 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 85 EP 91 DI 10.1016/S0167-6393(96)00046-5 PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400008 ER PT J AU BouGhazale, SE Hansen, JHL AF BouGhazale, SE Hansen, JHL TI Generating stressed speech from neutral speech using a modified CELP vocoder SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO ID SYNTHETIC SPEECH; EMOTION AB The problem of speech modeling for generating stressed speech using a source generator framework is addressed in this paper. In general, stress in this context refers to emotional or task induced speaking conditions. Throughout this particular study, the focus will be limited to speech under angry, loud and Lombard effect (i.e., speech produced in noise) speaking conditions. Source generator theory was originally developed for equalization of speech under stress for robust recognition (Hansen, 1993, 1994). It was later used for simulated stressed training token generation for improved recognition (Bou-Ghazale, 1993; Bou-Ghazale and Hansen, 1994). The objective here is to generate stressed perturbed speech from neutral speech using a source generator framework previously employed for stressed speech recognition. The approach is based on (i) developing a mathematical model that provides a means for representing the change in speech production under stressed conditions for perturbation, and (ii) employing this framework in an isolated word speech processing system to produce emotional/stressed perturbed speech from neutral speech. A stress perturbation algorithm is formulated based on a CELP (code-excited linear prediction) speech synthesis structure. The algorithm is evaluated using four different speech feature perturbation sets. The stressed speech parameter evaluations from this study revealed that pitch is capable of reflecting the emotional state of the speaker, while formant information alone is not as good a correlate of stress. However, the combination of formant location, pitch and gain information proved to be the most reliable indicator of emotional stress under a CELP speech model. Results from formal listener evaluations of the generated stressed speech show successful classification rates of 87% for angry speech, 75% for Lombard effect speech and 92% for loud speech. C1 DUKE UNIV, DEPT ELECT & COMP ENGN, ROBUST SPEECH PROC LAB, DURHAM, NC 27708 USA. CR ATAL BS, 1984, P IEEE INT C COMM BOUGHAZALE SE, 1993, THESIS DUKE U DURHAM BOUGHAZALE SE, 1994, INT CONF ACOUST SPEE, P413 CAHN JE, 1990, J AM VOICE I O SOC, P1 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 Campbell J. P. Jr., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266532 Crosmer J. R., 1985, ICASSP 85. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No. 85CH2118-8) Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1988, THESIS SCH ELECT ENG HANSEN JHL, 1987, P ACOUST SOC AM MIAM Hansen J. H. L., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319239 Kroon P., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196535 LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 LIPPMANN RP, 1987, 1987 P IEEE INT C AC, P705 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 MURRAY IR, 1995, SPEECH COMMUN, V16, P369, DOI 10.1016/0167-6393(95)00005-9 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 Pellom BL, 1996, INT CONF ACOUST SPEE, P645, DOI 10.1109/ICASSP.1996.543203 QUANCKENBUSH SR, 1988, OBJECTIVE MEASURE SP Scherer KR, 1981, SPEECH EVALUATION PS STREETER LA, 1983, J ACOUST SOC AM, V73, P1354, DOI 10.1121/1.389239 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 22 TC 10 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 93 EP 110 DI 10.1016/S0167-6393(96)00047-7 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400009 ER PT J AU Ruiz, R Absil, E Harmegnies, B Legros, C Poch, D AF Ruiz, R Absil, E Harmegnies, B Legros, C Poch, D TI Time- and spectrum-related variabilities in stressed speech under laboratory and real conditions SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE voice analysis; pitch; spectrum; stress; emotion ID VOICE; STATE AB Stress induced by various types of situation leads to vocal signal modifications. Previous studies have indicated that stressed speech is associated with a higher fundamental frequency and noticeable changes in vowel spectrum. This paper presents pitch- and spectral-based analyses of stressed speech corpora drawn from both artificial and real situations. The laboratory corpus is obtained by means of the Stroop test, the real-case corpus is extracted from the Cockpit Voice Recording of a crashed aeroplane. Analyses relative to pitch are presented and an index of microprosodic variation, mu, is introduced, Spectrum-related indicators of stress are issued from a cumulative histogram of sound level and from statistical analyses of formant frequencies. Distances to the F1-F2-F3 centre are also investigated. All these variations, throughout the two different situations, show the direct link between some new vocal parameters and stress appearances. The results confirm the validity of laboratory experiments on stress, but emphazise quantitative as well as qualitative differences between the situations and the speakers involved. C1 UNIV MONS, LAB PHONET, B-7000 MONS, BELGIUM. UNIV AUTONOMA BARCELONA, LAB FONET, BELLATERRA 08193, SPAIN. RP Ruiz, R (reprint author), UNIV TOULOUSE 2, ACOUST LAB, 5 ALLEES ANTONIO MACHADO, F-31058 TOULOUSE 1, FRANCE. CR ALLEN MT, 1989, PSYCHOPHYSIOLOGY, V26, P603, DOI 10.1111/j.1469-8986.1989.tb00718.x Benson P., 1995, P ESCA NATO TUT RES, P61 Black JW, 1951, SPEECH MONOGR, V18, P74 BRENNER M, 1994, AVIAT SPACE ENVIR MD, V65, P21 DOHERTY ET, 1978, J PHONETICS, V6, P1 Fairbanks G, 1941, SPEECH MONOGR, V8, P85 GRAMATICA B, 1992, J PHYS IV, V2, P335, DOI 10.1051/jp4:1992172 GRIFFIN GR, 1987, AVIAT SPACE ENV MED, V58, P1167 HARGREAVES WA, 1964, LANG SPEECH, V7, P84 HARMEGNIES B, 1992, P ESCA CANNESMANDELI, P231 HECKER MHL, 1968, J ACOUST SOC AM, V44, P993, DOI 10.1121/1.1911241 HELFRICH H, 1984, SPEECH COMMUN, V3, P245, DOI 10.1016/0167-6393(84)90019-0 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 JUNQUA JC, 1992, P ESCA WORKSH SPEECH, P43 LANE H, 1971, J SPEECH HEAR RES, V14, P679 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 MOSKO JD, 1983, 1300 NAV AER MED RES NEWMAN SS, 1938, AM J PSYCHIAT, V94, P912 OSTWALD PF, 1965, SCI AM, V212, P82 Pedhazur E. J., 1991, MEASUREMENT DESIGN A PROTOPAPAS A, 1995, P ESCA NATO TUT RES, P1 RUBENSTE.L, 1966, BEHAV RES THER, V4, P135, DOI 10.1016/0005-7967(66)90053-2 Ruiz R., 1994, Acta Acustica, V2 RUIZ R, 1990, AVIAT SPACE ENVIR MD, V61, P266 RUIZ R, 1995, P 15 INT C AC TRONDH, V3, P141 Scherer K.R., 1981, SPEECH EVALUATION PS, P189 SIMONOV PV, 1973, AEROSPACE MED, V44, P256 SIMONOV PV, 1977, AVIAT SPACE ENV MED, V46, P1014 Stroop JR, 1935, J EXP PSYCHOL, V18, P643, DOI 10.1037/0096-3445.121.1.15 SULC J, 1985, ACTA NEUROBIOLOGICAL, V46, P347 THOMPSON J, 1995, P ESCA NATO TUT RES, P21 WEBSTER JC, 1962, J ACOUST SOC AM, V34, P936, DOI 10.1121/1.1918224 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 WILLIAMS CW, 1981, SPEECH EVALUATION PS NR 34 TC 24 Z9 24 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 111 EP 129 DI 10.1016/S0167-6393(96)00048-9 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400010 ER PT J AU Womack, BD Hansen, JHL AF Womack, BD Hansen, JHL TI Classification of speech under stress using target driven features SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO ID RECOGNITION; NOISE AB Speech production variations due to perceptually induced stress contribute significantly to reduced speech processing performance. One approach for assessment of production variations due to stress is to formulate an objective classification of speaker stress based upon the acoustic speech signal. This study proposes an algorithm for estimation of the probability of perceptually induced stress. It is suggested that the resulting stress score could be integrated into robust speech processing algorithms to improve robustness in adverse conditions. First, results from a previous stress classification study are employed to motivate selection of a targeted set of speech features on a per phoneme and stress group level. Analysis of articulatory, excitation and cepstral based features is conducted using a previously established stressed speech database (Speech Under Simulated and Actual Stress (SUSAS)). Stress sensitive targeted feature sets are then selected across ten stress conditions (including Apache helicopter cockpit, Angry, Clear, Lombard effect, Loud, etc.) and incorporated into a new targeted neural network stress classifier. Second, the targeted feature stress classification system is then evaluated and shown to achieve closed speaker, open token classification rates of 91.0%. Finally, the proposed stress classification algorithm is incorporated into a stress directed speech recognition system, where separate hidden Markov model recognizers are trained for each stress condition. An improvement of +10.1% and +15.4% over conventionally trained neutral and multi-style trained recognizers is demonstrated using the new stress directed recognition approach. C1 DUKE UNIV, DEPT ELECT ENGN, ROBUST SPEECH PROC LAB, DURHAM, NC 27708 USA. CR ARSLAN LM, 1994, INT CONF ACOUST SPEE, P45 BOUGHAZALE SE, 1995, P EUROSPEECH, P455 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 CLARY GJ, 1992, INT C SPOK LANG PROC, P13 FISHER W, 1986, P DARPA SPEECH RECOG HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P415, DOI 10.1109/89.466654 Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1988, THESIS GEORIGA I TEC HANSEN JHL, 1995, SPEECH COMMUN, V16, P391, DOI 10.1016/0167-6393(95)00007-B HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P407, DOI 10.1109/89.466655 Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 HANSEN JHL, 1995, ESCA NATO WORKSH SPE, P91 Hansen J. H. L., 1994, ICSLP 94. 1994 International Conference on Spoken Language Processing Hansen J. H. L., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319239 HANSEN JHL, 1995, UNPUB J ACOUST SOC A HANSON BA, 1990, 1990 P IEEE INT C AC, P857 JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KOBAYASHI T, 1991, INT CONF ACOUST SPEE, P489, DOI 10.1109/ICASSP.1991.150383 LEE HS, 1995, SPEECH COMMUN, V17, P59, DOI 10.1016/0167-6393(95)00018-J LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 LIPPMANN RP, 1987, 1987 P IEEE INT C AC, P705 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 Minai A. A., 1990, IJCNN International Joint Conference on Neural Networks (Cat. No.90CH2879-5), DOI 10.1109/IJCNN.1990.137634 MRAYATI M, 1988, SPEECH COMMUN, V7, P257, DOI 10.1016/0167-6393(88)90073-8 PELLOM BL, 1996, INT C AC SPEECH SIGN Richard H.B., 1995, P EUROSPEECH, P761 SIMONOV PV, 1977, AVIATION SPACE ENV S, V1, P23 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 Stanton B. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266517 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 Womack BD, 1996, INT CONF ACOUST SPEE, P53, DOI 10.1109/ICASSP.1996.540288 WOMACK BD, 1995, P EUROSPEECH, P1999 NR 32 TC 23 Z9 25 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 131 EP 150 DI 10.1016/S0167-6393(96)00049-0 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400011 ER PT J AU Hansen, JHL AF Hansen, JHL TI Analysis and compensation of speech under stress and noise for environmental robustness in speech recognition SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA-NATO Workshop on Speech Under Stress CY SEP, 1995 CL LISBON, PORTUGAL SP European Speech Commun Assoc, NATO DE speech under stress; Lombard effect; robust speech recognition; noise suppression ID WORD RECOGNITION; ENHANCEMENT; CARS AB It is well known that the introduction of acoustic background distortion and the variability resulting from environmentally induced stress causes speech recognition algorithms to fail. In this paper, several causes for recognition performance degradation are explored. It is suggested that recent studies based on a Source Generator Framework can provide a viable foundation in which to establish robust speech recognition techniques. This research encompasses three inter-related issues: (i) analysis and modeling of speech characteristics brought on by workload task stress, speaker emotion/stress or speech produced in noise (Lombard effect), (ii) adaptive signal processing methods tailored to speech enhancement and stress equalization, and (iii) formulation of new recognition algorithms which are robust in adverse environments. An overview of a statistical analysis of a Speech Under Simulated and Actual Stress (SUSAS) database is presented. This study was conducted on over 200 parameters in the domains of pitch, duration, intensity, glottal source and vocal tract spectral variations. These studies motivate the development of a speech modeling approach entitled Source Generator Framework in which to represent the dynamics of speech under stress. This framework provides an attractive means for performing feature equalization of speech under stress. In the second half of this paper, three novel approaches for signal enhancement and stress equalization are considered to address the issue of recognition under noisy stressful conditions. The first method employs (Auto:I,LSP:T) constrained iterative speech enhancement to address background noise and maximum likelihood stress equalization across formant location and bandwidth. The second method uses a feature enhancing artificial neural network which transforms the input stressed speech feature set during parameterization for keyword recognition. The final method employs morphological constrained feature enhancement to address noise and an adaptive Mel-cepstral compensation algorithm to equalize the impact of stress. Recognition performance is demonstrated for speech under a range of stress conditions, signal-to-noise ratios and background noise types. RP Hansen, JHL (reprint author), DUKE UNIV, DEPT ELECT ENGN, ROBUST SPEECH PROC LAB, BOX 90291, DURHAM, NC 27708 USA. CR ALEXANDRE P, 1993, SPEECH COMMUN, V12, P277, DOI 10.1016/0167-6393(93)90099-7 Bond Z.S., 1990, INT C SPEECH LANG PR, P969 BOUGHAZALE SE, 1995, ECSA NATO P SPEECH S, P45 BOUGHAZALE SE, 1994, INT CONF ACOUST SPEE, P413 CAIRNS DA, 1994, J ACOUST SOC AM, V96, P3392, DOI 10.1121/1.410601 Carlson B. A., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225928 CHEN YN, 1988, IEEE T ACOUST SPEECH, V36, P433, DOI 10.1109/29.1547 CLARY GJ, 1992, INT C SPOK LANG PROC, P13 CUMMINGS KE, 1990, INT CONF ACOUST SPEE, P369, DOI 10.1109/ICASSP.1990.115687 Darby Jr JK, 1981, SPEECH EVALUATION PS DAUTRICH BA, 1983, IEEE T ACOUST SPEECH, V31, P793, DOI 10.1109/TASSP.1983.1164172 EPHRAIM Y, 1992, P IEEE, V80, P1526, DOI 10.1109/5.168664 FLACK M, 1918, FLYING STRESS GALES MJF, 1995, COMPUT SPEECH LANG, V9, P289, DOI 10.1006/csla.1995.0014 GARDNER MB, 1966, J ACOUST SOC AM, V40, P955, DOI 10.1121/1.1910220 Goldberger L, 1982, HDB STRESS THEORETIC HANLEY CN, 1965, J SPEECH HEAR DISORD, V30, P274 Hansen J. H. L., 1990, ICSLP 90, P1125 Hansen JHL, 1994, IEEE T SPEECH AUDI P, V2, P598, DOI 10.1109/89.326618 HANSEN JHL, 1991, IEEE T SIGNAL PROCES, V39, P795, DOI 10.1109/78.80901 HANSEN JHL, 1988, THESIS GEORIGA I TEC HANSEN JHL, 1991, INT CONF ACOUST SPEE, P901, DOI 10.1109/ICASSP.1991.150485 HANSEN JHL, 1995, SPEECH COMMUN, V16, P391, DOI 10.1016/0167-6393(95)00007-B HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P407, DOI 10.1109/89.466655 Hansen JHL, 1996, IEEE T SPEECH AUDI P, V4, P307, DOI 10.1109/89.506935 HANSEN JHL, 1995, RSPL9531 DUK U DEP E HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P169, DOI 10.1109/89.388143 Hansen JHL, 1989, IEEE P 15 NE BIOENG, P31 HANSEN JHL, 1995, J ACOUST SOC AM, V97, P3833, DOI 10.1121/1.413108 Hansen J. H. L., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319239 HANSEN JHL, 1987, P ACOUST SOC AM S, V82, pS17 HANSEN JHL, 1992, SIGNAL PROCESS, V6, P403 HANSEN JHL, 1995, IEEE T SPEECH AUDI P, V3, P98, DOI 10.1109/89.365376 HANSEN JHL, 1989, 1989 P IEEE INT C AC, P266 HANSEN JHL, 1995, UNPUB J ACOUST SOC A HANSON BA, 1993, 1993 P IEEE INT C AC, V2, P79 HANSON BA, 1990, 1990 P IEEE INT C AC, P857 HECKER MHL, 1968, J ACOUST SOC AM, V44, P993, DOI 10.1121/1.1911241 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 Hermansky H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319236 HICKS JW, 1981, 1981 CARN C CRIM COU, P189 Huang X.D., 1990, HIDDEN MARKOV MODELS Hunt M. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266415 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E JUNQUA JC, 1993, J ACOUST SOC AM, V93, P510, DOI 10.1121/1.405631 KURODA I, 1976, AVIAT SPACE ENVIR MD, V47, P528 LIEBERMAN P, 1962, J ACOUST SOC AM, V34, P922, DOI 10.1121/1.1918222 LIPPMANN RP, 1987, 1987 P IEEE INT C AC, P705 Liu F.-H., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225923 LIVELY SE, 1993, J ACOUST SOC AM, V93, P2962, DOI 10.1121/1.405815 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Lockwood P., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225921 Lombard E., 1911, ANN MALADIES OREILLE, V37, P101 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P1659, DOI 10.1109/29.46548 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 MURRAY IR, 1993, J ACOUST SOC AM, V93, P1097, DOI 10.1121/1.405558 NANDKUMAR S, 1992, ICSLP 92 INT C SPOK, P527 NANDKUMAR S, 1995, IEEE T SPEECH AUDI P, V3, P22, DOI 10.1109/89.365384 PAUL DB, 1987, IEEE INT C AC SPEECH, P713 PISONI DB, 1985, IEEE INT C AC SPEECH Rajasekaran P. K., 1986, ICASSP 86 Proceedings. IEEE-IECEJ-ASJ International Conference on Acoustics, Speech and Signal Processing (Cat. No.86CH2243-4) RUSSELL MJ, 1983, RECORDINGS MADE AUTO SIMONOV PV, 1977, AVIATION SPACE ENV S, V1, P23 Stanton B. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266517 Stanton B. J., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.196583 STREETER LA, 1983, J ACOUST SOC AM, V73, P1354, DOI 10.1121/1.389239 Summers W V, 1988, J Acoust Soc Am, V84, P917, DOI 10.1121/1.396660 Wang M. Q., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225924 Whalen A D, 1971, DETECTION SIGNALS NO WILLIAMS CE, 1969, AEROSPACE MED, V40, P1369 WILLIAMS CE, 1972, J ACOUST SOC AM, V52, P1238, DOI 10.1121/1.1913238 NR 71 TC 84 Z9 87 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1996 VL 20 IS 1-2 BP 151 EP 173 DI 10.1016/S0167-6393(96)00050-7 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VY514 UT WOS:A1996VY51400012 ER PT J AU Laprie, Y Berger, MO AF Laprie, Y Berger, MO TI Cooperation of regularization and speech heuristics to control automatic formant tracking SO SPEECH COMMUNICATION LA English DT Article DE formant tracking; speech analysis AB This paper describes an automatic formant tracking algorithm incorporating speech knowledge. It operates in two phases. The first detects and interprets spectrogram peak lines in terms of formants. The second uses an image contour extraction method to regularise the peak lines thus detected. Speech knowledge served as acoustic constraints to guide the interpretation of peak lines. The proposed algorithm has the advantage of providing formant trajectories which, in addition to being sufficiently close to the spectral peaks of the respective formants, are sufficiently smooth to allow an accurate evaluation of formant transitions. The results obtained highlight the interest of the proposed approach. RP Laprie, Y (reprint author), INST NATL RECH INFORMAT & AUTOMAT LORRAINE, CRIN, CNRS, BP 239, F-54506 VANDOEUVRE LES NANCY, FRANCE. CR BERGER MO, 1991, THESIS NATL POLYTECH CALLIOPE, 1989, PAROLE SON TRAITEMEN, pCH3 COSTEMARQUIS S, 1994, THEISS U H POINCARE FANT G, 1970, ACOUSTIC THEORY SPEE, P48 GUPTA SK, 1993, J ACOUST SOC AM, V94, P2517, DOI 10.1121/1.407364 Hanson Helen M., 1995, P 13 INT C PHON SCI, V3, P182 Kass M., 1988, INT J COMPUT VISION, V1, P321, DOI DOI 10.1007/BF00133570 KOPEC GE, 1986, IEEE T ACOUST SPEECH, V34, P709, DOI 10.1109/TASSP.1986.1164908 LAFACE P, 1980, SIGNAL PROCESS, V2, P113, DOI 10.1016/0165-1684(80)90003-1 LAPRIE Y, 1990, P INT C SPOK LANG PR, V2, P1261 LAPRIE Y, 1994, P INT C AC SPEECH SI, V2 MARKEL JD, 1976, LINEAR PREDICTION SP, pCH6 MCCANDLE.SS, 1974, IEEE T ACOUST SPEECH, VSP22, P135, DOI 10.1109/TASSP.1974.1162559 MRAYATI M, 1976, REV ACOUSTIQUE, V36 OLIVE JP, 1971, J ACOUST SOC AM, V50, P661, DOI 10.1121/1.1912681 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 POLS LCW, 1973, J ACOUST SOC AM, V53, P1093, DOI 10.1121/1.1913429 RABINER LR, 1969, AT&T TECH J, V48, P1249 RIGOLL G, 1986, P IEEE INT C AC SPEE, P307 Schechter R., 1967, VARIATIONAL METHOD E NR 20 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1996 VL 19 IS 4 BP 255 EP 269 DI 10.1016/S0167-6393(96)00036-2 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VV401 UT WOS:A1996VV40100001 ER PT J AU Vorstermans, A Martens, JP VanCoile, B AF Vorstermans, A Martens, JP VanCoile, B TI Automatic segmentation and labelling of multi-lingual speech data SO SPEECH COMMUNICATION LA English DT Article DE automatic segmentation and labelling; multi-lingual; neural networks ID HIDDEN MARKOV-MODELS; RECOGNITION AB A new system for the automatic segmentation and labelling of speech is presented. The system is capable of labelling speech originating from different languages without requiring extensive linguistic knowledge or large (manually segmented and labeled) training databases of that language. The system comprises small neural networks for the segmentation and the broad phonetic classification of the speech. These networks were originally trained on one task (Flemish continuous speech), and are automatically adapted to a new task. Due to the limited size of the neural networks, the segmentation and labelling strategy requires but a limited amount of computations, and the adaptation to a new task can be accomplished very quickly. The system was first evaluated on five isolated word corpora designed for the development of Dutch, French, American English, Spanish and Korean text-to-speech systems. The results show that the accuracy of the obtained automatic segmentation and labelling is comparable to that of human experts. In order to provide segmentation and labelling results which can be compared to data reported in the literature, additional tests were run on TIMIT and on the English, Danish and Italian portions of the EUROMO continuous speech utterances. The performance of our system appears to compare favourably to that of other systems. C1 STATE UNIV GHENT, ELIS, B-9000 GHENT, BELGIUM. RP Vorstermans, A (reprint author), LERNOUT & HAUSPIE SPEECH PROD NV, SINT KRISPIJNSTR 7, B-8900 IEPER, BELGIUM. CR Anderson T.R., 1993, P 1993 INT C AC SPEE, V2, P231 Angelini B., 1993, P EUROSPEECH 93, P653 ARAI K, 1990, P ICSLP 90, P1005 Barry W. J., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90041-2 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W Cosi P., 1991, P EUROSPEECH 91, P693 DALSGAARD P, 1991, P EUROSPEECH 91, P685 DALSGAARD P, 1992, P INT C AC SPEECH SI, P549, DOI 10.1109/ICASSP.1992.225849 Dalsgaard P., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90026-Z DALSGAARD P, 1991, P INT C AC SPEECH SI, P197, DOI 10.1109/ICASSP.1991.150311 GRICE M, 1989, G001B2 SAM KABRE H, 1991, P EUROSPEECH 91, P689 KVALE K, 1993, THESIS I TELETEKNIK LAMEL L, 1986, FEB P DARPA SPEECH R, P100 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 LEUNG HC, 1984, P INT C AC SPEECH SI Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 MARTENS JP, 1991, SPEECH COMMUN, V10, P81, DOI 10.1016/0167-6393(91)90029-S MARTENS JP, 1990, P INT C AC SPEECH SI, P401 MENG HM, 1991, P INT C AC SPEECH SI, P285, DOI 10.1109/ICASSP.1991.150333 MENG HM, 1990, P ICSLP 90, P1053 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 Riley M. D., 1991, P INT C AC SPEECH SI, P737, DOI 10.1109/ICASSP.1991.150446 SENEFF S, 1990, READINGS SPEECH RECO, P101 SORENSEN HBD, 1989, P EUROSPEECH 89, V2, P79 SVENDSEN T, 1990, P ICSLP 90 KOB JAP, P997 TORKKOLA K, 1988, P IEEE INT C AC SPEE, P611 VANERP A, 1989, P EUROSPEECH 89, V2, P88 VANERP A, 1988, P SPEECH 88, P1131 VANHEMERT JP, 1985, AUTOMATIC DIPHONE PR, P23 VANIMMERSEEL LM, 1993, THESIS U GENT BELGIU VANIMMERSEEL LM, 1992, J ACOUST SOC AM, V91, P3511, DOI 10.1121/1.402840 VEREECKEN H, 1995, P EUROSPEECH 95, P1995 WANG HD, 1990, P ICSLP 90, P457 NR 35 TC 15 Z9 15 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1996 VL 19 IS 4 BP 271 EP 293 DI 10.1016/S0167-6393(96)00037-4 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VV401 UT WOS:A1996VV40100002 ER PT J AU Pind, J AF Pind, J TI Rate-dependent perception of quantity in released and unreleased syllables in Icelandic SO SPEECH COMMUNICATION LA English DT Article DE speech perception; higher-order invariants; quantity; speaking rate; Icelandic ID ARTICULATORY-RATE; SPEAKING RATE; PHONEME AB Previous research (Find, 1986, 1995a) has shown that the ratio of vowel to rhyme (vowel + consonant) duration is a major cue for quantity in Icelandic and serves as a higher-order invariant which enables the listener to disentangle those durational transformations of the speech signal which are due to changes in the speaking rate from those which involve a change of phonemic quantity. For the listener to be able to calculate this ratio, both segments, vowel and consonant, need to be present in the acoustic waveform. This paper reports two perceptual experiments using edited natural speech where a closure following the vowel is either audible or not (in the latter case the closure is unreleased). Results support the hypothesis that the surrounding context will have a greater effect on the location of the phoneme boundaries for vowel quantity in the unreleased syllables since in that case the listener will not be able to calculate the vowel to rhyme ratio. RP Pind, J (reprint author), UNIV ICELAND, FAC SOCIAL SCI, IS-101 REYKJAVIK, ICELAND. CR ABRAMSON AS, 1990, J PHONETICS, V18, P79 Ellis A. W., 1987, PROGR PSYCHOL LANGUA, V3, P119 Finney D. J., 1971, PROBIT ANAL, V3rd Hadding-Koch K., 1964, STUD LINGUISTICA, V18, P94, DOI 10.1111/j.1467-9582.1964.tb00451.x KIDD GR, 1989, J EXP PSYCHOL HUMAN, V15, P736 LIBERMAN AM, 1967, PSYCHOL REV, V74, P431, DOI 10.1037/h0020279 LINDBLOM B, 1994, 127 M AC SOC AM JUN LISKER L, 1986, LANG SPEECH, V29, P3 LISKER L, 1964, WORD, V20, P384 Miller J. L., 1981, PERSPECTIVES STUDY S, P39 MILLER JL, 1986, PHONETICA, V43, P106 PETERSON GE, 1952, J ACOUST SOC AM, V24, P175, DOI 10.1121/1.1906875 PICKETT JM, 1960, LANG SPEECH, V3, P11 PIND J, 1995, PERCEPT PSYCHOPHYS, V57, P291, DOI 10.3758/BF03213055 Pind J, 1996, SCAND J PSYCHOL, V37, P121, DOI 10.1111/j.1467-9450.1996.tb00645.x PIND J, 1986, PHONETICA, V43, P116 PIND J, 1995, P 13 INT C PHON SCI, V2, P538 Stevens K. N., 1981, PERSPECTIVES STUDY S, P1 SUMMERFIELD Q, 1981, J EXP PSYCHOL HUMAN, V7, P1074, DOI 10.1037/0096-1523.7.5.1074 WHALEN DH, 1993, J ACOUST SOC AM, V93, P2152, DOI 10.1121/1.406678 NR 20 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1996 VL 19 IS 4 BP 295 EP 306 DI 10.1016/S0167-6393(96)00051-9 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VV401 UT WOS:A1996VV40100003 ER PT J AU Chung, YJ Un, CK AF Chung, YJ Un, CK TI An MLP/HMM hybrid model using nonlinear predictors SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; multilayer perceptron; HMM; nonlinear prediction ID WORD RECOGNITION; NETWORK AB In this paper, we propose an MLP/HMM hybrid model in which the input feature vectors are transformed by nonlinear predictors using multilayer perceptrons (MLPs) assigned to each state of a Hidden Markov Model (HMM). The prediction error vectors in the states are modeled by Gaussian mixture densities. The use of a hybrid model is motivated from the need to model the prediction errors in the conventional neural prediction model (NPM) where the prediction errors are variable due to the effect of varying contexts and speaker identity. The MLP/HMM hybrid model is advantageous because frame-correlation in the input speech signal is exploited by employing the MLP predictors, and the variabilities in the prediction error signals are explicitly modeled. We present the training algorithms based on the maximum likelihood (ML) criterion and discriminative criterion for minimum error classification. Experiments were done on speaker-independent continuous speech recognition. By ML training of the hybrid model, we obtained a much better performance than a conventional NPM which does not explicitly model the prediction error signals. By training with the discriminative criterion, confusion among different models was significantly reduced and word error rate was reduced by 56% compared with the ML training. C1 KOREA ADV INST SCI & TECHNOL, DEPT ELECT ENGN, COMMUN RES LAB, YUSUNG KU, TAEJON 305701, SOUTH KOREA. CR BENGIO Y, 1992, IEEE T NEURAL NETWOR, V3, P252, DOI 10.1109/72.125866 BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 CHOU W, 1992, P IEEE INT C AC SPEE, P473, DOI 10.1109/ICASSP.1992.225869 FUNAHASHI K, 1989, NEURAL NETWORKS, V2, P183, DOI 10.1016/0893-6080(89)90003-8 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 ISO K, 1990, P INT C AC SPEECH SI, P441 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 KENNY P, 1990, IEEE T ACOUST SPEECH, V38, P220, DOI 10.1109/29.103057 LEE CH, 1989, IEEE T ACOUST SPEECH, V37, P1649, DOI 10.1109/29.46547 LEVIN E, 1990, P IEEE INT C AC SPEE, P433 MELLOUK A, 1994, P IEEE INT C AC SPEE, P233 TEBELSKIS J, 1990, P IEEE INT C ACOUSTI, P437 Wellekens C. J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) NR 13 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1996 VL 19 IS 4 BP 307 EP 316 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VV401 UT WOS:A1996VV40100004 ER PT J AU Jang, CS Un, CK AF Jang, CS Un, CK TI A new parameter smoothing method in the hybrid TDNN/HMM architecture for speech recognition SO SPEECH COMMUNICATION LA English DT Article ID NEURAL NETWORKS AB In this paper, we propose a new parameter smoothing method in the hybrid time-delay neural network (TDNN)/hidden Markov model (HMM) architecture for speech recognition. In the hybrid architecture, the TDNN and the HMM are combined using the activations from the second hidden layer of TDNN as the outputs of a fuzzy vector quantizer (FVQ). The HMM algorithm is modified to accommodate these FVQ outputs. in our modular construction of TDNN, the input layer is divided into two states to deal with the temporal structure of phonemic features, and the second hidden layer consists of two states in a time sequence. To improve the performance of the hybrid architecture, a new smoothing method has been proposed. The average values of the activation vectors from the second hidden layer of the modular TDNN are used to generate the smoothing matrix from which smoothed output symbol observation probability is obtained. With this proposed approach, our simulation results performed on speaker-independent Korean isolated words show the reduction of the error rate by 44.9% as compared to the floor smoothing method. C1 KOREA ADV INST SCI & TECHNOL, DEPT ELECT ENGN, COMMUN RES LAB, TAEJON 305701, SOUTH KOREA. RP Jang, CS (reprint author), KUM OH NATL INST TECHNOL, DEPT COMP ENGN, 188 KUMI, KYUNGBUK, SOUTH KOREA. CR FENG MW, 1988, P IEEE INT C AC SPEE, P131 KOMORI Y, 1991, P IEEE INT C AC SPEE Lee K.-F., 1989, AUTOMATIC SPEECH REC LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 MA W, 1990, P IEEE INT C AC SPEE MORGAN N, 1990, P IEEE INT C AC SPEE Picone J., 1990, IEEE ASSP Magazine, V7, DOI 10.1109/53.54527 Rabiners LR, 1986, IEEE ASSP MAGAZI JAN, P4 SCHWARTZ R, 1989, P INT C AC SPEECH SI, P548 TSENG HP, 1987, P IEEE INT C AC SPEE WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P1888, DOI 10.1109/29.45535 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P328, DOI 10.1109/29.21701 WAIBEL A, 1989, P IEEE INT C AC SPEE NR 13 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD OCT PY 1996 VL 19 IS 4 BP 317 EP 324 DI 10.1016/S0167-6393(96)00052-0 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VV401 UT WOS:A1996VV40100005 ER PT J AU Mokbel, C Jouvet, D Monne, J AF Mokbel, C Jouvet, D Monne, J TI Deconvolution of telephone line effects for speech recognition SO SPEECH COMMUNICATION LA English DT Article DE telephone line effects; blind deconvolution; cepstral subtraction; highpass filtering; adaptive filtering AB This paper presents a new approach to equalize the telephone line effects in the transmitted signal aiming at improving the performance of speech recognition systems. This new approach implements a blind equalization scheme where an adaptive filter, using some known statistics about the signals, deconvolves the channel from the transmitted signal. Measurements carried out on actual telephone data confirm that telephone lines introduce disturbing convolved components in speech signals. Line effects are almost constant for a given call but vary with the calls. The proposed adaptive filtering of the telephone line effects is compared to two conventional techniques, namely subtraction of the long-term cepstrum and highpass filtering of cepstral trajectories. Recognition experiments are carried out on several telephone databases in a speaker-independent mode. The results show that reducing the channel effects significantly improves the recognition performance. Regarding the obtained error rates, the proposed adaptive filter yields better performance than the conventional highpass filters. However, this adaptive filtering is not as good as the off-line cepstral subtraction technique where the long-term cepstrum is estimated on several recordings of a call. Experiments were also conducted to measure the amount of speech data necessary to obtain a reliable estimate of channel effects. Averaging cepstra vectors on a few seconds of speech produces a reliable estimate of the constant convolved perturbation. RP Mokbel, C (reprint author), FRANCE TELECOM, CNET,LAA,TSS,RCP,TECHNOPOLE ANTICIPA, 2 AV PIERRE MARZIN, F-22307 LANNION, FRANCE. CR Acero A., 1990, P IEEE INT C AC SPEE, V2, P849 HANSON BA, 1993, 1993 P IEEE INT C AC, V2, P79 Hermansky H., 1993, P ICASSP 93, P83 Hermansky H., 1991, P EUROSPEECH, P1367 HILAL K, 1993, THESIS ENST PARIS Hirsch H.-G., 1991, P EUROSPEECH, P413 JOUVET D, 1991, P EUROSPEECH 91, P923 Juang B. H., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90011-E LOCKWOOD P, 1991, P EUR, P79 MANSOUR D, 1989, IEEE T ACOUST SPEECH, V37, P795, DOI 10.1109/ASSP.1989.28053 Mokbel C., 1994, P ICSLP, P987 MOKBEL C, 1993, P EUROSPEECH, P1247 MOKBEL C, 1992, P ICSLP 92 BANFF 12, P707 MOKBEL CE, 1995, IEEE T SPEECH AUDI P, V3, P346, DOI 10.1109/89.466660 MONTACIE C, 1988, NATO ASI SERIES F, V46 Shynk J.J., 1992, IEEE SIGNAL PROC JAN, P15 WITTMANN M, 1993, P EUR 93 BERL SEP 19, P1251 ZHAO Y, 1993, P EUR 93 BERL SEPT 1, P359 NR 18 TC 8 Z9 8 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1996 VL 19 IS 3 BP 185 EP 196 DI 10.1016/0167-6393(96)00032-5 PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VP631 UT WOS:A1996VP63100001 ER PT J AU Kwon, OW Un, CK AF Kwon, OW Un, CK TI Performance of HMM-based speech recognizers with discriminative state-weights SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; hidden Markov models AB In this paper, assuming that the score of a speech utterance is a weighted sum of hidden Markov model (HMM) log state-likelihoods, we propose a new method of finding discriminative state-weights recursively using the generalized probabilistic descent method. With this method the conventional parameter estimation method and the Viterbi algorithm can be applied to continuous speech recognition as well as isolated word recognition without large modification by constraining the sum of the state-weights to the number of states in a recognition unit. Compared with the previous approaches, this method does not increase complexity and can be implemented with minor modification of the conventional parameter estimation and recognition algorithms by constraining the sum of the state-weights to the number of states in a recognition unit, and further it can be applied to continuous speech recognition as well as isolated word recognition. To evaluate the performance of the state-weighted HMM recognizer, we perform two kinds of experiments with phoneme-based and word-based state-weights using various kinds of speech databases. Experimental results showed that the recognizers with phoneme-based and word-based state-weights achieved 20% and 50% decrease in word error rate, respectively, for isolated word recognition, and 5% decrease for continuous speech recognition. Our approach yields recognition accuracies comparable to those of the previous approaches for continuous speech recognition, but it is much simpler to implement than others. C1 KOREA ADV INST SCI & TECHNOL, DEPT ELECT ENGN, COMMUN RES LAB, TAEJON 305701, SOUTH KOREA. CR AMARI S, 1967, IEEE TRANS ELECTRON, VEC16, P299, DOI 10.1109/PGEC.1967.264665 Chang PC, 1993, IEEE T SPEECH AUDI P, V1, P326, DOI 10.1109/89.232616 CHOU W, 1993, P IEEE INT C AC SPEE, V2, P652 CHOU W, 1992, P IEEE INT C AC SPEE, P473, DOI 10.1109/ICASSP.1992.225869 Chung YJ, 1996, SPEECH COMMUN, V18, P79, DOI 10.1016/0167-6393(95)00038-0 LEE CH, 1989, IEEE T ACOUST SPEECH, V37, P1649, DOI 10.1109/29.46547 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N Su KY, 1994, IEEE T SPEECH AUDI P, V2, P69 WOLFERSTETTER F, 1994, P INT C SPOI LANG PR, P219 NR 9 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1996 VL 19 IS 3 BP 197 EP 205 DI 10.1016/0167-6393(96)00035-0 PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VP631 UT WOS:A1996VP63100002 ER PT J AU Pauws, S Kamp, Y Willems, L AF Pauws, S Kamp, Y Willems, L TI A hierarchical method of automatic speech segmentation for synthesis applications SO SPEECH COMMUNICATION LA English DT Article DE speech segmentation; hidden Markov models (HMM); vector quantization ID AMERICAN ENGLISH; RECOGNITION; DURATION AB The paper describes a method for automatically segmenting a database of isolated words as required for the purpose of speech synthesis. The phoneme-like units in the phonetic transcription of the utterances are represented by dedicated hidden Markov models (HMMs) and segmentation is performed by aligning the speech signal against the sequence of HMMs representing the words. The specific advantage of the method presented here is that it does not need manually segmented speech material to initialize the training of the HMMs. Therefore, it can be regarded as an improved variant of established techniques for automatic segmentation. The problem of proper initialization of the HMMs without resorting to manually segmented material is solved by a hierarchical approach consisting of three successive steps. In the first step a segmentation in broad phonetic classes is realized that provides anchor points for the second stage, consisting of a sequence-constrained vector quantization. In this stage each broad phonetic class is further segmented into its constituent phonemes. The result is a crude phonetic segmentation which is then used as initialization of the HMMs in the last stage. Fine-tuning of the models is realized via Baum-Welch estimation. The final segmentation is obtained by Viterbi alignment of the utterances against the HMMs. This hierarchical approach was used to segment a database of isolated words recorded from a male speaker. An accuracy of 89.51% was obtained in the location of the phoneme boundaries with a tolerance of 20 ms. RP Pauws, S (reprint author), INST PERCEPT RES, POB 513, NL-5600 MB EINDHOVEN, NETHERLANDS. CR BOEFFARD O, 1992, INT C SPEECH LANG PR, P1211 BOEFFARD O, 1993, EUR 93 BERL, P1449 COSI P, 1991, EUR 91 GEN, V2, P693 FALAVIGNA D, 1990, EUR C SIGN PROC EUSI, P1139 Gray R. M., 1984, IEEE ASSP Magazine, V1, DOI 10.1109/MASSP.1984.1162229 ITAKURA F, 1975, IEEE T ACOUST SPEECH, VAS23, P67, DOI 10.1109/TASSP.1975.1162641 Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 LJOLJE A, 1993, EUR 93 BERL, V2, P1445 NEY H, 1993, EUR 93 BERL, V1, P491 Olive J. P., 1977, P INT C ACOUST SPEEC, P568 Rabiner L, 1993, FUNDAMENTALS SPEECH RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 Svendsen T., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) TAYLOR PA, 1991, EUR 91 GEN, V2, P709 *U COLL CENTR SERV, 1992, SAMUCLG004 CENTR SER UMEDA N, 1975, J ACOUST SOC AM, V58, P434, DOI 10.1121/1.380688 UMEDA N, 1977, J ACOUST SOC AM, V61, P846, DOI 10.1121/1.381374 VIDAL E, 1990, SIGNAL PROCESS, V5, P43 WILPON JG, 1985, IEEE T ACOUST SPEECH, V33, P587, DOI 10.1109/TASSP.1985.1164581 NR 19 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1996 VL 19 IS 3 BP 207 EP 220 DI 10.1016/0167-6393(96)00031-3 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VP631 UT WOS:A1996VP63100003 ER PT J AU Richard, G dAlessandro, C AF Richard, G dAlessandro, C TI Analysis/synthesis and modification of the speech aperiodic component SO SPEECH COMMUNICATION LA English DT Article DE speech decomposition; aperiodic component of speech; speech noises; random formant wave forms; analysis synthesis; rice representation; speech modifications ID TO-NOISE RATIO; SIGNALS; REPRESENTATION; PERCEPTION; FREQUENCY; PARALLEL; CASCADE AB The general framework of this paper is speech analysis and synthesis. The speech signal may be separated into two components: (1) a periodic component (which includes the quasi-periodic or voiced sounds produced by regular vocal cord vibrations); (2) an aperiodic component (which includes the non-periodic part of voiced sounds (e.g. fricative noise in /v/) or sound emitted without any vocal cord vibration (e.g. unvoiced fricatives, or plosives)). This work is intended to contribute to a precise modelling of this second component and particularly of modulated noises, Firstly, a synthesis method, inspired by the ''shot noise effect'', is introduced. This technique uses random point processes which define the times of arrival of spectral events (represented by Formant Wave Form (FWF)). Based on the theoretical framework provided by the Rice representation and the random modulation theory, an analysis/synthesis scheme is proposed. Perception tests show that this method allows to synthesize very natural speech signals. The representation proposed also brings new types of voice quality modifications (time scaling, vocal effort, breathiness of a voice, etc.). C1 RUTGERS STATE UNIV, CTR CAIP, PISCATAWAY, NJ 08855 USA. RP Richard, G (reprint author), CNRS, LIMSI, BP 133, F-91403 ORSAY, FRANCE. CR ALMEIDA LB, 1984, P IEEE ICASSP 84 INT Atal B. S., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing ATAL BS, 1970, AT&T TECH J, V49, P1973 BERTHOMIER C, 1983, SIGNAL PROCESS, V5, P31, DOI 10.1016/0165-1684(83)90033-6 CARL H, 1991, P IEEE ICASSP 91 INT, P581, DOI 10.1109/ICASSP.1991.150406 CHAFE C, 1990, P IEEE INT C AC SPEE, P1157 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DALESSANDRO C, 1990, SPEECH COMMUN, V9, P419, DOI 10.1016/0167-6393(90)90018-5 DALESSANDRO C, 1989, THESIS U PARIS 6 DALESSANDRO C, 1995, P ISMA 95 INT S MUS, P446 DALESSANDRO C, 1995, P IEEE ICASSP 84 INT, P760 DAVENPORT WB, 1952, J ACOUST SOC AM, V24, P390, DOI 10.1121/1.1906909 DAVENPORT WB, 1958, INTRO THEORY RANDOM, P113 DEKROM G, 1993, J SPEECH HEAR RES, V36, P254 DEPALLE P, 1991, THESIS U MAINE DUTOIT T, 1993, SPEECH COMMUN, V13, P435, DOI 10.1016/0167-6393(93)90042-J Fant G., 1960, ACOUSTIC THEORY SPEE Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P412, DOI 10.1121/1.384752 FOUCART T, 1991, INTRO AUX TESTS STAT GEORGE EB, 1992, J AUDIO ENG SOC, V40, P497 GRANDSTEN IR, 1993, P EUR 93 EUR C SPEEC, P385 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 HERMES DJ, 1991, SPEECH COMMUN, V10, P497, DOI 10.1016/0167-6393(91)90053-V HILLENBRAND J, 1987, J SPEECH HEAR RES, V30, P448 HIRAOKA N, 1984, J ACOUST SOC AM, V76, P1648, DOI 10.1121/1.391611 HOLMES JN, 1973, IEEE T ACOUST SPEECH, VAU21, P298, DOI 10.1109/TAU.1973.1162466 HOLMES JN, 1983, SPEECH COMMUN, V2, P251, DOI 10.1016/0167-6393(83)90044-4 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KLINGHOLZ F, 1987, SPEECH COMMUN, V6, P15, DOI 10.1016/0167-6393(87)90066-5 KOJIMA H, 1980, ACTA OTO-LARYNGOL, V89, P547, DOI 10.3109/00016488009127173 Laroche J., 1993, P IEEE INT C AC SPEE, P550 LEADBETTER MR, 1972, STOCHASTIC POINT PRO, P436 Lienard J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) LILJENCR.JC, 1968, IEEE T ACOUST SPEECH, VAU16, P137, DOI 10.1109/TAU.1968.1161961 MANDEL L, 1967, J OPT SOC AM, V57, P613, DOI 10.1364/JOSA.57.000613 MARQUES JS, 1994, SPEECH COMMUN, V14, P231, DOI 10.1016/0167-6393(94)90064-7 MCAULAY RJ, 1992, ADV SPEECH SIGNAL PR, P165 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z PAEZ MD, 1972, IEEE T COMMUN, VCO20, P225, DOI 10.1109/TCOM.1972.1091123 PAPOULIS A, 1983, IEEE T ACOUST SPEECH, V31, P96, DOI 10.1109/TASSP.1983.1164046 Papoulis A., 1986, PROBABILITY RANDOM V PICINBONO B, 1983, ANN TELECOMMUN, V38, P179 PICINBONO B, 1989, CISM COURSES LECT, V309, P1 RABINER LR, 1968, J ACOUST SOC AM, V49, P822 Rice SO, 1944, BELL SYST TECH J, V23, P282 RICE SO, 1944, BELL SYST TECH J, V25, P46 RICHARD G, 1992, P EUSIPCO 92 EUR SIG, P347 RICHARD G, 1993, P SMAC 93 STOCK MUS, P580 RICHARD G, 1994, THESIS U PARIS 11 OR RODET X, 1987, P EUR C SPEECH COMM RODET X, 1980, COMPUT MUSIC J, V8, P9 Serra X., 1990, COMPUTER MUSIC J, V14 Snyder D. L., 1975, RANDOM POINT PROCESS STEVENS KN, 1971, J ACOUST SOC AM, V50, P1180, DOI 10.1121/1.1912751 STEVENS P, 1960, LANGUAGE SPEECH, V3 TSOPANOGLOU A, 1993, VISUAL REPRESENTATIO, P341 WENDLER J, 1976, 16TH INT C LOG PHON, P518 YUMOTO E, 1982, J ACOUST SOC AM, V71, P1544, DOI 10.1121/1.387808 NR 60 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1996 VL 19 IS 3 BP 221 EP 244 DI 10.1016/0167-6393(96)00038-6 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VP631 UT WOS:A1996VP63100004 ER PT J AU Cosi, P Dugatto, M Ferrero, F Caldognetto, EM Vagges, K AF Cosi, P Dugatto, M Ferrero, F Caldognetto, EM Vagges, K TI Phonetic recognition by recurrent neural networks working on audio and visual information SO SPEECH COMMUNICATION LA English DT Article AB A phonetic classification scheme based on a feed forward recurrent back-propagation neural network working on audio and visual information is described. The speech signal is processed by an auditory model producing spectral-like parameters, while the visual signal is processed by a specialised hardware, called ELITE, computing lip and jaw kinematics parameters. Some results will be given for various speaker dependent and independent phonetic recognition experiments regarding the Italian plosive consonants. RP Cosi, P (reprint author), UNIV PADUA, CTR STUDIO RIC FONET, CNR, DEPT LINGUIST, VIA G ANGHINONI 10, I-35121 PADUA, ITALY. CR ADJOUDANI A, 1995, P EUR 95 18 21 SEPT, V2, P1563 Borghese N. A., 1988, Proceedings of the 1988 IEEE International Conference on Systems, Man, and Cybernetics (IEEE Cat. No.88CH2556-9) CALDOGNETTO EM, 1992, P INT C SPOK PROC BA, V1, P61 CALDOGNETTO EM, 1980, ATT SEM PERC LING FI, P123 CALDOGNETTO EM, 1993, P EUR 93 BERL GERM 2, V1, P409 COSI P, 1994, P IEEE INT JOINT C A COSI P, 1993, P EUR 93 BERL GERM 2, P665 COSI P, 1992, VISUAL REPRESENTATIO, P205 COSI P, 1991, P GRETSI 91 JUAN LES Dodd B., 1987, HEARING EYE PSYCHOL FERRIGNO G, 1985, IEEE T BIO-MED ENG, V32, P943, DOI 10.1109/TBME.1985.325627 GORI M, 1989, P INT JOINT C NEURAL, V2, P417 JANKOWSKI CR, 1995, IEEE T SPEECH AUDI P, V3, P286, DOI 10.1109/89.397093 MACLEOD A, 1987, British Journal of Audiology, V21, P131, DOI 10.3109/03005368709077786 Massaro D. W., 1987, SPEECH PERCEPTION EA MOHAMADI T, 1992, B COMMUNICATION PARL, V2, P31 Petajan E. D., 1984, THESIS U ILLINOIS UR SENEFF S, 1988, J PHONETICS, V16, P55 SILSBEE PL, 1993, P NATO ASI NEW ADV T, P13 STORK DG, 1995, IN PRESS NATO ASI F STORK DG, 1992, P INT JOINT C NEUR N, P285 WOLF RP, 1983, ELEMENTS PHOTOGRAMME NR 22 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD SEP PY 1996 VL 19 IS 3 BP 245 EP 252 DI 10.1016/0167-6393(96)00034-9 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VP631 UT WOS:A1996VP63100005 ER PT J AU Xie, F VanCompernolle, D AF Xie, F VanCompernolle, D TI Speech enhancement by spectral magnitude estimation - A unifying approach SO SPEECH COMMUNICATION LA English DT Article DE speech enhancement; Minimum Mean Square Error (MMSE) estimation; spectral magnitude estimation ID AMPLITUDE ESTIMATOR; SUBTRACTION; NOISE AB In this paper we present a solution to the nonlinear spectral estimation problem for speech enhancement. We start from a rather simple statistical model (log-normal) for the short time spectral estimates of speech and noise. By empirical data generation and curve fitting approaches we are able to get explicit, though simple, expressions for the MMSE estimator in function of input level and the model parameters for each frequency component. The great advantage of our approach is that it has a sound theoretical foundation, is general by the choice of its parameters, and almost as simple to use as classical spectral subtraction. Moreover, using a neural network as function approximator, which is found to be the best for our curve fitting problem, other model based MMSE estimators can be readily implemented with the proposed approach. RP Xie, F (reprint author), KATHOLIEKE UNIV LEUVEN, ESAT, KARDINAAL MERCIERLAAN 94, B-3001 HEVERLEE, BELGIUM. CR Acero A., 1990, P ICASSP, P849 BEROUTI M, 1979, APR P IEEE INT C AC, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 Ephraim Y., 1989, P INT C AC SPEECH SI, P353 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P84, DOI 10.1109/89.221370 GAGNON I, 1991, P INT C AC SPEECH SI, P981 GAGNON L, 1993, SPEECH COMMUN, V12, P213, DOI 10.1016/0167-6393(93)90091-X HORNIK K, 1989, NEURAL NETWORKS, V2, P359, DOI 10.1016/0893-6080(89)90020-8 KANG GS, 1989, IEEE T ACOUST SPEECH, V37, P939, DOI 10.1109/ASSP.1989.28065 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z MCAULAY RJ, 1980, IEEE T ACOUST SPEECH, V28, P137, DOI 10.1109/TASSP.1980.1163394 NAWAB H, 1981, P INT C AC SPEECH SI, P1105 PORTER JE, 1984, P INT C AC SPEECH SI VANCOMPERNOLLE D, 1989, P IEEE INT C AC SPEE, P258 Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2 VARGA A, 1992, NOIXEX 92 STUDY EFFE VARGA AP, 1990, P INT C AC SPEECH SI, P844 WUACKENBUSH SR, 1988, OBJECTIVE MEASUREMEN XIE F, 1993, P EUROSPEECH 93 BERL, P617 NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1996 VL 19 IS 2 BP 89 EP 104 DI 10.1016/0167-6393(96)00022-2 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL932 UT WOS:A1996VL93200001 ER PT J AU Sorokin, VN Trushkin, AV AF Sorokin, VN Trushkin, AV TI Articulatory-to-acoustic mapping for inverse problem SO SPEECH COMMUNICATION LA English DT Article DE speech; inverse problem; codebook; global optimum; mapping ID VOCAL-TRACT AB Mutual dependence of articulatory parameters allows the reducing of codebook volume and helps to improve conditions for global optimum search, The initial approximations of articulatory vectors for the inverse problem solving are sampled along the trajectories of articulatory parameters in synthesized syllables. Piece-wise linear mapping of the space of articulatory parameters onto the space of acoustic parameters, the minimal value of cross-sectional area of the vocal tract and the Reynolds number accelerate the process of optimization over 100 times. RP Sorokin, VN (reprint author), RUSSIAN ACAD SCI, INST INFORMAT TRANSMISS PROBLEMS, BOLSHOY KARETNY 19, MOSCOW 101447, RUSSIA. CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 Badin P., 1984, NOTES VOCAL TRACT CO, P53 DADSON RS, 1954, BRIT J APPL PHYS, V5, P435, DOI 10.1088/0508-3443/5/12/304 FANT G, 1976, STL QPSR, V4, P28 Fant G., 1972, VOCAL TRACT WALL EFF, P28 Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JL, 1980, J ACOUST SOC AM, V68, P780, DOI 10.1121/1.384817 FRANKE EK, 1952, J ACOUST SOC AM, V24, P410, DOI 10.1121/1.1906911 ISHIZAKA K, 1975, IEEE T ACOUST SPEECH, V23, P370, DOI 10.1109/TASSP.1975.1162701 LARAR JN, 1988, IEEE T ACOUST SPEECH, V36, P1812, DOI 10.1109/29.9026 Lin Q., 1990, THESIS ROYAL I TECHN SCHROETER J, 1990, P INT C AC SPEECH SI, P393 Schroeter J., 1992, ADV SPEECH SIGNAL PR, P231 SKUDRZYK E, 1954, GRUNDLAGEN AKUSTIC Sorokin V. N., 1992, SPEECH SYNTHESIS Sorokin V. N., 1985, THEORY SPEECH PRODUC SOROKIN VN, 1994, SPEECH COMMUN, V14, P249, DOI 10.1016/0167-6393(94)90065-5 SOROKIN VN, 1992, SPEECH COMMUN, V11, P71, DOI 10.1016/0167-6393(92)90064-E Tikhonov A.N., 1974, METHODS SOLVING INCO UNGENHEUER G, 1962, ELEMENTE EINER AKUST WAKITA H, 1978, STL QPS, P9 NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1996 VL 19 IS 2 BP 105 EP 118 DI 10.1016/0167-6393(96)00028-3 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL932 UT WOS:A1996VL93200002 ER PT J AU Dutoit, T Gosselin, B AF Dutoit, T Gosselin, B TI On the use of a hybrid harmonic/stochastic model for TTS synthesis-by-concatenation SO SPEECH COMMUNICATION LA English DT Article DE TTS synthesis; hybrid harmonic/stochastic; overlap-add; concatenation; segmental quality ID SPEECH AB In this paper, we address the possibilities offered by hybrid harmonic/stochastic (H/S) models in the context of wide-band text-to-speech synthesis based on segment concatenation. After a brief recall of the hypotheses underlying such models and a comprehensive review of a well-known analysis algorithm, namely the one provided by the multi-band excited (MBE) analysis framework, we study how H/S models allow to modify the prosody of segments and how segment concatenation can be organized, in the purpose of minimizing mismatches at the border of segments. In this context, we introduce an original concatenation algorithm which takes advantage of some analysis errors. Speech synthesis algorithms are then described, including an original synthesis technique based on judiciously prepared IFFTs, and the final segmental quality(1) is detailed. More particularly, we examine the differences in the quality obtained when using the model in a narrow-band speech coding context and in a wide-band, concatenation based synthesis context. We study three possible causes for these differences: the choice of an analysis criterion, the inadequacy of the model due to pitch variatons, and the effect of coarticulation on phases. RP Dutoit, T (reprint author), FAC POLYTECH MONS, 31 BVD DOLEZ, B-7000 MONS, BELGIUM. CR Abrantes A. J., 1991, EUROSPEECH 91. 2nd European Conference on Speech Communication and Technology Proceedings ABRANTES AJ, 1992, EUSIPCO 92, P487 ALMEIDA LB, 1984, P INT C AC SPEECH SI BANGA ER, 1993, P INT C AC SPEECH SI, V2, P183 BOEFFARD O, 1994, P 2 ESCA IEEE WORKSH, P111 DEKETELAERE S, 1992, EUSIPCO 92 BRUSS, P475 DUTOIT T, 1994, P EUSIPCO, V1, P8 DUTOIT T, 1993, THESIS FAC POLYTECHN DUTOIT T, 1993, P EUROSPEECH 93, P531 DUTOIT T, 1994, P INT C AC SPEECH SI, V1, P565 FUJIMURA O, 1968, IEEE T AUDIO ELE MAR, P68 GRADHSTEYN LS, 1965, TABLE INTEGRAL SERIE, P485 GRIFFIN DW, 1987, THESIS MIT CAMBRIDGE HARDWICK JC, 1988, P INT C AC SPEECH SI, P374 HARRIS FJ, 1978, P IEEE, V66, P51, DOI 10.1109/PROC.1978.10837 HOLGER C, 1991, P INT C AC SPEECH SI, P581 Kay S. M., 1988, MODERN SPECTRAL ESTI Laroche J., 1993, P INT C AC SPEECH SI, V2, P550 MACAULAY RJ, 1991, P INT C AC SPEECH SI, P577 MACAULAY RJ, 1986, IEEE T AC SPEECH SIG, V34, P744 MACAULAY RJ, 1984, P INT C AC SPEECH SI MALLAT SG, 1989, T AM MATH SOC, V315, P69, DOI 10.2307/2001373 MARQUES J, 1988, P EUUSIPCO, P891 MARQUES JS, 1989, IEEE T ACOUST SPEECH, V37, P763 MARQUES JS, 1990, P IEEE INT C AC SPEE, V1, P17 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z ROWE D, 1991, P EUR 91 GEN, P239 SERRA X, 1990, COMPUT MUSIC J, V14, P12, DOI 10.2307/3680788 SORENSEN HV, 1987, IEEE T ACOUST SPEECH, V35, P849, DOI 10.1109/TASSP.1987.1165220 Wery B., 1989, Traitement du Signal, V6 NR 30 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1996 VL 19 IS 2 BP 119 EP 143 DI 10.1016/0167-6393(96)00029-5 PG 25 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL932 UT WOS:A1996VL93200003 ER PT J AU Lublinskaja, V Sappok, C AF Lublinskaja, V Sappok, C TI Speaker attribution of successive utterances: The role of discontinuities in voice characteristics and prosody SO SPEECH COMMUNICATION LA English DT Article DE speaker attribution; voice characteristics; intonation; auditory processing; perceptual organization ID PERCEPTION; INTONATION; SPEECH; UNITS; TIME AB In our work we attempted to answer the following question: How do listeners identify two sentences as spoken by one or by two persons when they can take into consideration only acoustic speaker characteristics and intonation? We used successive utterance portions forming a single turn or two subsequent turns in a dialogue, the lexical structure of which was held constant. The experiments are based on sentence combinations varying with respect to speakers, sentence types, and types of communicative coherence, Pairs of sentences with and without constant intervening pauses were presented to subjects with the task of ascribing them to one of three categories: ''dialogue'', ''monologue'' and ''metadialogue'' (i.e., the imitation of dialogue by a single voice). The results showed that the following factors influence the decision of the listeners: voice properties, presence or absence of pause, and the communicative types of the sentences. Two principles involved in processing the input signals are supposed. One of them is applied to the case when two sentences are represented without a pause. Under these circumstances, the continuity or discontinuity of the acoustic input is likely to be the main criterion used by the listener to decide between ''monologue'' and ''dialogue''. If the sentences are separated by a pause, a comparison of the auditory voice properties of the two portions may take place. C1 RUHR UNIV BOCHUM, D-44780 BOCHUM, GERMANY. IP PAVLOV PHYSIOL INST, ST PETERSBURG 199034, RUSSIA. CR Bregman A. S., 1993, THINKING SOUND COGNI, P10 Bregman AS., 1990, AUDITORY SCENE ANAL BROKX JPL, 1982, J PHONETICS, V10, P23 CHERRY MP, 1953, J ACOUST SOC AM, V25, P554 CHISTOVICH LA, 1982, REPRESENTATION SPEEC, P165 Cooke M., 1991, THESIS U SHEFFIELD DARWIN CJ, 1983, ATTENTION PERFORMANC DARWIN CJ, 1977, J EXP PSYCHOL HUMAN, V3, P665, DOI 10.1037/0096-1523.3.4.665 Darwin C.J., 1975, STRUCTURE PROCESS SP, P178 DORMAN MF, 1975, J EXP PSYCHOL, V71, P121 ECKERT E, 1994, MENSCHEN IHRE STIMME Handel S., 1989, LISTENING Jones M. R., 1993, THINKING SOUND COGNI, P69 KNIPSCHILD M, 1991, FORTSCHRITTE AKUSTIK, P1045 KRIVNOVA OF, 1987, EXPT METHODS PSYCHOL, P177 LUBLINSKAJA VV, 1991, QUANTITATIVE LINGUIS, V46, P201 LUBLINSKAJA VV, 1991, BJULLETEN FONETICHES, P80 LUBLINSKAJA VV, 1991, P 12 INT C PHON SCI, V3, P318 MASSARO DW, 1987, P 11 INT C PHON SCI, V5, P334 MASSARO DW, 1972, PSYCHOL REV, V79, P124, DOI 10.1037/h0032264 McAdams S., 1993, THINKING SOUND COGNI, P146 NOOTEBOOM SG, 1985, TIME MIND BEHAV, P242 Nooteboom S.G., 1978, STUDIES PERCEPTION L, P75 PISONI DB, 1993, SPEECH COMMUN, V13, P109, DOI 10.1016/0167-6393(93)90063-Q PISONO DB, 1990, P ICSL 90 KOB JAP, V2, P1399 REMEZ E, 1987, PSYCHOPHYSICS SPEECH, P419 REMEZ RE, 1990, PERCEPT PSYCHOPHYS, V48, P313, DOI 10.3758/BF03206682 REPP BH, 1987, P 11 INT C PHON SCI, V2, P21 SCHUETZECOBURN S, 1991, LANG SPEECH, V34, P207 TERHARDT E, 1987, PSYCHOPHYSICS SPEECH, P271 VENCOV AV, 1993, PROBLEMS PHONETICS, P242 Warren R. M., 1993, THINKING SOUND COGNI, P37 WEINTRAUB M, 1985, THESIS STANFORD NR 33 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1996 VL 19 IS 2 BP 145 EP 159 DI 10.1016/0167-6393(96)00030-1 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL932 UT WOS:A1996VL93200004 ER PT J AU Pols, LCW Wang, X tenBosch, LFM AF Pols, LCW Wang, X tenBosch, LFM TI Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR SO SPEECH COMMUNICATION LA English DT Article ID CONNECTED-SPEECH SIGNALS; HIDDEN MARKOV-MODELS; SEGMENTAL DURATIONS; AMERICAN ENGLISH; VOWEL DURATION; RECOGNITION; STRESS AB As indicated by Bourlard et al. (1996), the best and simplest solution so far in standard ASR technology to implement durational knowledge, seems to consist of imposing a (trained) minimum segment duration, simply by duplicating or adding states that cannot be skipped. We want to argue that recognition performance can be further improved by incorporating ''specific knowledge'' (such as duration and pitch) into the recognizer. This can be achieved by optimising the probabilistic acoustic and language models, and probably also by a postprocessing step that is fully based on this specific knowledge. The widely available, hand-segmented, TIMIT database was used by us to extract duration regularities, that persist despite the great speaker variability. Two main approaches were used. In the first approach, duration distributions are considered for single phones, as well as for various broader classes, such as those specified by long or short vowels, word stress, syllable position within the word and within an utterance, post-vocalic consonants, and utterance speaking rate. The other approach is to use a hierarchically structured analysis of variance to study the numerical contributions of 11 different factors to the variation in duration. Several systematic effects have been found, but several other effects appeared to be obscured by the inherent variability in this speech material. Whether this specific use of knowledge about duration in a post-processor will actually improve recognition performance still has to be shown. However, in line with the prophetic message of Bourlard et al.'s paper, we here consider the improvement of performance as of secondary importance. C1 LERNOUT & HUAUSPIE SPEECH PROD NV, BRUSSELS, BELGIUM. RP Pols, LCW (reprint author), UNIV AMSTERDAM, IFOTT, INST PHONET SCI, KRUISLAAN 403, AMSTERDAM, NETHERLANDS. CR Bourlard H, 1996, SPEECH COMMUN, V18, P205, DOI 10.1016/0167-6393(96)00003-9 BRUGNARA F, 1993, SPEECH COMMUN, V12, P357, DOI 10.1016/0167-6393(93)90083-W BYRD D, 1992, P ICSLP 92 BANFF, V1, P827 CHEN M, 1970, PHONETICA, V22, P129 CRYSTAL TH, 1990, J ACOUST SOC AM, V88, P101, DOI 10.1121/1.399955 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1553, DOI 10.1121/1.395911 CRYSTAL TH, 1982, J ACOUST SOC AM, V72, P705, DOI 10.1121/1.388251 CRYSTAL TH, 1988, J ACOUST SOC AM, V83, P1574, DOI 10.1121/1.395912 EEFTING W, 1991, THESIS U UTRECHT Garofolo J. S., 1993, DARPA TIMIT ACOUSTIC GAUVAIN JL, 1994, SPEECH COMMUN, V15, P21, DOI 10.1016/0167-6393(94)90038-8 HOCHBERG MM, 1993, P EUR 93 BERL, V1, P323 JONES M, 1993, P EUR 93 BERL, V1, P311 KEATING PA, 1994, SPEECH COMMUN, V14, P131, DOI 10.1016/0167-6393(94)90004-3 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 LAMEL L, 1986, FEB P DARPA SPEECH R, P100 LAMEL LF, 1993, P 3 EUR C SPEECH COM, V1, P121 LAMEL LF, 1993, P EUR 93 BERL, V1, P23 LEE KF, 1989, IEEE T ACOUST SPEECH, V37, P1641, DOI 10.1109/29.46546 Levinson S. E., 1986, Computer Speech and Language, V1, DOI 10.1016/S0885-2308(86)80009-2 LJOLJE A, 1994, COMPUT SPEECH LANG, V8, P129, DOI 10.1006/csla.1994.1006 Ljolje A., 1991, P INT C AC SPEECH SI, P473, DOI 10.1109/ICASSP.1991.150379 NOOTEBOOM SG, 1970, THESIS U UTRECHT OLLER DK, 1973, J ACOUST SOC AM, V54, P1235, DOI 10.1121/1.1914393 PALLETT D, 1990, P ICSLP 99 KOB Pallett D. S., 1995, P SPOK LANG TECHN WO, P5 PETERSON GE, 1960, J ACOUST SOC AM, V32, P693, DOI 10.1121/1.1908183 ROBINSON T, 1991, CUEDFINFENTTR82 SUN DX, 1995, P INT C AC SPEECH SI, P201 UMEDA N, 1975, J ACOUST SOC AM, V58, P434, DOI 10.1121/1.380688 VANHEUVEN VJ, 1993, ANAL SYNTHESIS SPEEC VANSANTEN JPH, 1992, SPEECH COMMUN, V11, P513, DOI 10.1016/0167-6393(92)90027-5 VANSON R, 1993, THESIS U AMSTERDAM A VORSTERMANS A, 1995, P EUR 95 MADR, V2, P1397 Wang X., 1994, P I PHONETIC SCI, V18, P111 WANG X, 1996, THESIS U AMSTERDAM WANG X, 1995, SPEECH REC COD NEW A, P128 WOODLAND PC, 1993, P EUR C SPEECH COMM, V3, P2207 Young S. J., 1993, P EUR C SPEECH COMM, V3, P2203 YOUNG SJ, 1992, HTK VERSION 1 4 USER YOUNG SJ, 1994, COMPUT SPEECH LANG, V8, P369, DOI 10.1006/csla.1994.1019 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 NR 42 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1996 VL 19 IS 2 BP 161 EP 176 DI 10.1016/0167-6393(96)00033-7 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL932 UT WOS:A1996VL93200005 ER PT J AU Cranen, B Schroeter, J AF Cranen, B Schroeter, J TI Physiologically motivated modelling of the voice source in articulatory analysis synthesis SO SPEECH COMMUNICATION LA English DT Article DE glottal modelling; glottal leakage; articulatory synthesis of male and female speech ID FLOW AB This paper describes the implementation of a new parametric model of the glottal geometry aimed at improving male and female speech synthesis in the framework of articulatory analysis synthesis. The model represents glottal geometry in terms of inlet and outlet area waveforms and is controlled by parameters that are tightly coupled to physiology, such as vocal fold abduction. It is embedded in an articulatory analysis synthesis system (articulatory speech mimic). To introduce naturally occurring details in our synthetic glottal flow waveforms, we modelled two different kinds of leakage: a ''linked leak'' and a ''parallel chink''. While the first is basically an incomplete glottal closure, the latter models a second glottal duct that is independent of the membranous (vibrating) part of the glottis. Characteristic for both types of leaks is that they increase de-flow and source/tract interaction. A linked leak, however, gives rise to a steeper roll-off of the entire glottal flow spectrum, whereas a parallel chink decreases the energy of the lower frequencies more than the higher frequencies. In fact, for a parallel chink, the slope at the higher freqencies is more or less the same as in the no-leakage case. C1 AT&T BELL LABS, ACOUST RES DEPT, MURRAY HILL, NJ 07974 USA. RP Cranen, B (reprint author), UNIV NIJMEGEN, DEPT LANGUAGE & SPEECH, POB 9103, NL-6500 HD NIJMEGEN, NETHERLANDS. CR Beranek L.L., 1986, ACOUSTICS CRANEN B, 1985, J ACOUST SOC AM, V77, P1543, DOI 10.1121/1.391997 CRANEN B, 1995, J PHONETICS, V23, P165, DOI 10.1016/S0095-4470(95)80040-9 CRANEN B, 1995, P 13 INT C PHON SCI, V2, P626 CRANEN B, 1987, LARYNGEAL FUNCTION P, P203 CRANEN B, 1990, P INT C SPOK LANG PR, P65 GUPTA SK, 1993, J ACOUST SOC AM, V94, P2517, DOI 10.1121/1.407364 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 ISHIZAKA K, 1989, UNPUB ISHIZAKA K, 1983, VOCAL FOLD PHYSL, P414 ISHIZAKA K, 1972, AT&T TECH J, V51, P1233 Schroeter J., 1991, ADV SPEECH PROCESSIN, P231 SODERSTEN M, 1994, STUDIES LOGOPEDICS P, V3 SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 TITZE IR, 1984, J ACOUST SOC AM, V75, P570, DOI 10.1121/1.390530 NR 15 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1996 VL 19 IS 1 BP 1 EP 19 DI 10.1016/0167-6393(96)00016-7 PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL361 UT WOS:A1996VL36100001 ER PT J AU Yiourgalis, N Kokkinakis, G AF Yiourgalis, N Kokkinakis, G TI A TtS system for the Greek language based on concatenation of formant coded segments SO SPEECH COMMUNICATION LA English DT Article DE synthesis by rule; formant contours; speech units; concatenation; coarticulation rules; duration rules ID WAVE; PARALLEL; CASCADE; ENGLISH AB This paper presents an operational Pitch Synchronously Excited (PSE) Formant based TtS system for the Greek language, developed at WCL. The system uses a thesaurus comprised of 140 speech segments including the Consonant (C), Vowel (V), CV and CCV type. Particular attention is paid to the concatenation scheme applied to these segments, as well as on their context-sensitive duration and the coarticulation rules written for the Greek language. The formant synthesizer runs on a DSP32C board. C1 UNIV PATRAS, SCH ENGN, WIRE COMMUN LAB, PATRAS, GREECE. CR Allen J., 1987, TEXT SPEECH MITALK S Atal B. S., 1982, Proceedings of ICASSP 82. IEEE International Conference on Acoustics, Speech and Signal Processing ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BENOIT C, 1992, TALKING MACHINES THE BOVES L, 1987, P 1 EUR SPEECH TECHN, V2, P385 BRIAN D, 1987, PHONOLOGY CARLSON R, 1992, P INT C AC SPEECH PR, P1043 DARSINOS V, 1993, THESIS U PATRAS DUTOIT, 1992, EUSIPCO 92 BRUSSELS, P343 EPITROPAKIS G, 1993, ESCA WORKSH PROS LUN EPITROPAKIS G, 1993, EUROSPEECH 9O BERLIN, P1999 Fant G., 1960, ACOUSTIC THEORY SPEE FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P Flanagan JL, 1973, SPEECH SYNTHESIS FRANCINI GL, 1968, J ACOUST SOC AM, V43, P1282, DOI 10.1121/1.1910980 FUJISAKI H, 1987, INT C ACOUST SPEECH, V2, P637 GRIFFIN DW, 1987, THESIS MIT CAMBRIDGE HOLMES JN, 1973, IEEE T ACOUST SPEECH, VAU21, P298, DOI 10.1109/TAU.1973.1162466 HOLMES JN, 1983, SPEECH COMMUN, V2, P251, DOI 10.1016/0167-6393(83)90044-4 HOLMES JN, 1984, P 10 INT C PHON SCI, P125 ISARD S, 1986, P IEE SPEECH INP OUT, P77 ITAKURA F, 1968, P 6 INT C AC TOK KLATT DH, 1976, J ACOUST SOC AM, V59, P1208, DOI 10.1121/1.380986 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 KLATT DH, 1980, J ACOUST SOC AM, V67, P971, DOI 10.1121/1.383940 LALWANI A, 1991, P INT C AC SPEECH SI, P777, DOI 10.1109/ICASSP.1991.150454 LINGGARD R, 1985, ELECT SYNTHESIS SPEE MAKHOUL J, 1973, IEEE T ACOUST SPEECH, VAU21, P140, DOI 10.1109/TAU.1973.1162470 MARKEL JD, 1972, IEEE T ACOUST SPEECH, VAU20, P129, DOI 10.1109/TAU.1972.1162367 MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z OLABE J, 1984, P IEEE INT C AC SPEE Oliviera L. C., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.226117 OSHAUGHNESSY D, 1989, ETRW WORKSH SPEECH S RABINER LR, 1971, AT&T TECH J, V50, P1541 RODET X, 1984, COMPUT MUSIC J, V8, P9, DOI 10.2307/3679809 STATHOPOULOU P, 1979, INT C INF SCI SYST U, V1, P506 VERHELST W, 1986, P IEEE INT C AC SPEE, P2007 VETH J, 1990, P ITN C AC SPEECH SI, P301 YIOURGALIS N, 1993, 1 TIDE C BRUSS APR YIOURGALIS N, 1991, P INT C AC SPEECH SI, P525, DOI 10.1109/ICASSP.1991.150392 Yiourgalis N., 1985, Digital Techniques in Simulation, Communication and Control. Proceedings of the IMACS European Meeting 1993, MULTILINGUAL SYSTEM 1990, DSP32 WE 1993, LINGUISTIC ANAL EURO 1990, DSP32C NR 45 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1996 VL 19 IS 1 BP 21 EP 38 DI 10.1016/0167-6393(96)00012-X PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL361 UT WOS:A1996VL36100002 ER PT J AU YaegerDror, M AF YaegerDror, M TI Register as a variable in prosodic analysis: The case of the English negative SO SPEECH COMMUNICATION LA English DT Article DE Spontaneous speech; Read speech; Speaking style; Discourse variables; Focus ID SPEECH SYNTHESIS; SPEAKING STYLES; WORDS AB The next generation of text-to-speech systems will have to be more sensitive to sociolinguistic 'style' variables. In order to assist in the adaptation of synthesis to a wider range of contexts, this article examines several sociolinguistic parameters which have been shown to influence the realization of negatives in actual discourse, analyzing their effects on the realization of negatives in English prose readings. Consistent with the results found in an earlier study, the analysis shows that pitch prominence on negatives is not common in read prose passages, and is even less common in read dialogue. Informational content (and consequently pitch prominence on negatives) is more important in prose addressed to children than in narrative reading for adults, while the more formal the prose for adults, the less likely prominence is to occur. The results show a surprising absence of conformity with 'theoretical' linguistic expectations, highlighting the necessity for consideration of register as an important variable for speech synthesis. RP YaegerDror, M (reprint author), UNIV ARIZONA, DEPT LINGUIST, TUCSON, AZ 85721 USA. CR *AC SOC JAP, 1994, P ICSLP 94 YOK *AC SOC JAP, 1994, P INT S PROS YOK ADAMS K, 1991, MALE FEMALE VIOLATIO ADAMS K, 1988, LINGUISTIC CHANGE CO, P18 ADAMS K, 1990, J LANG SOC PSYCHOL, V9, P171 ADAMS K, 1991, MUST POLITICIAN TALK ARGENTE JA, 1992, SPEECH COMMUN, V11, P325, DOI 10.1016/0167-6393(92)90038-9 ATWOOD M, 1979, LIFE BEFORE MAN Atwood Margaret, 1983, BLUEBEARDS EGG, P133 Bell A., 1991, CONTEXTS ACCOMMODATI, P69, DOI 10.1017/CBO9780511663673.002 Bell Allan, 1991, LANGUAGE NEWS MEDIA BELL A, 1984, LANG SOC, V13, P145 Biber D., 1988, VARIATION SPEECH WRI BIBER D, 1994, PERSPECTIVES REGISTE BLADON A, 1987, EUROPEAN C SPEECH TE, V1, P55 Bolinger D., 1978, UNIVERSALS HUMAN LAN, V2, P471 BRAZIL D, 1984, INTONATION ACCENT RH, P46 Brown Penelope, 1978, QUESTIONS POLITENESS, P56 BRUCE G, 1992, SPEECH COMMUN, V11, P453, DOI 10.1016/0167-6393(92)90050-H CLAYMAN S, 1993, TALK WORK, P163 CLAYMAN SE, 1988, SOC PROBL, V35, P474, DOI 10.1525/sp.1988.35.4.03a00100 CLEARY B, 1981, R QUIMBY AGE 8 Cleary B., 1968, RAMONA PEST Coker C. H., 1971, Proceedings of the 7th International congress on acoustics DIPAOLO M, 1992, LANG COMMUN, V12, P267 DREW P, 1993, TALK WORK, P470 ENGSTRAND O, 1992, SPEECH COMMUN, V11, P337, DOI 10.1016/0167-6393(92)90039-A ENGSTRAND O, 1989, PERILUS, V10, P1 ENGSTRAND O, 1989, PERILUS, V10, P13 Fant G., 1989, STL QPSR, V2 FINEGAN E, 1994, PERSPECTIVES REGISTE, P315 Finegan Edward, 1994, LANGUAGE ITS STRUCTU FOWLER CA, 1988, LANG SPEECH, V31, P307 FOWLER CA, 1987, J MEM LANG, V26, P489, DOI 10.1016/0749-596X(87)90136-7 Goffinan Erving, 1971, RELATIONS PUBLIC Goffman E., 1967, PRESENTATION SELF EV Goffman E., 1981, FORMS TALK GRANSTROM B, 1992, SPEECH COMMUN, V11, P347, DOI 10.1016/0167-6393(92)90040-E GRANSTROM B, 1992, SPEECH COMMUN, V11, P459, DOI 10.1016/0167-6393(92)90051-8 GREATBACH D, 1993, TALK WORK, P268 HERITAGE J, 1991, TALK SOCIAL STRUCTUR, P47 Heritage John, 1985, HDB DISCOURSE ANAL, V3, P95 Hirschberg J., 1990, P 8 NAT C ART INT, V2, P952 HIRSCHBERG J, 1993, J ACOUST SOC AM, V94, P1841, DOI 10.1121/1.407735 HORN LR, 1990, NATURAL HIST NEGATIO HUTCHBY I, 1992, SOCIOLOGY, V26, P673, DOI 10.1177/0038038592026004008 Ian Hutchby, 1992, TEXT, V12, P343, DOI 10.1515/text.1.1992.12.3.343 *ICSLP, 1995, P ICSLP 94 YOK *ICSLP, 1995, P INT S PROS YOK JEFFERSON G, 1975, LANG SOC, V3, P181 Keillor Garrison, 1985, LAKE WOBEGON DAYS KOOPMANSVANBEINUM FJ, 1992, SPEECH COMMUN, V11, P439, DOI 10.1016/0167-6393(92)90049-D Labov W., 1986, INVARIANCE VARIABILI, P402 Labov William, 1972, SOCIOLINGUISTIC PATT Labov William, 1989, LANGUAGE CHANGE VARI, P1 LABOV William, 1966, STRATIFICATION ENGLI Lehiste I., 1970, SUPRASEGMENTALS LEVAC L, 1993, NWAV C OTT 23 25 OCT Levelt W. J., 1989, SPEAKING INTENTION A MACWHINNEY B, 1978, J VERB LEARN VERB BE, V17, P539, DOI 10.1016/S0022-5371(78)90326-2 NAGABUCHI H, 1993, J ACOUST SOC AM, V94, P1798, DOI 10.1121/1.407894 NOOTEBOOM SG, 1987, J ACOUST SOC AM, V82, P1512, DOI 10.1121/1.395195 Ochs Elinor, 1979, DISCOURSE SYNTAX, P51 OEHRLE R, 1992, P IRCS WORKSH PROS N, P139 OSHAUGHNESSY D, 1983, J ACOUST SOC AM, V74, P1155, DOI 10.1121/1.390039 Pomerantz Anita, 1989, RES LANG SOC INTERAC, V22, P293 PRICE PJ, 1991, J ACOUST SOC AM, V90, P2956, DOI 10.1121/1.401770 Prince Ellen, 1981, RADICAL PRAGMATICS, P223 Prince Ellen E, 1987, INT J SOCIOL LANG, V67, P83, DOI 10.1515/ijs1.1987.67.83 RODMAN R, 1988, LANGUAGE SPEECH MIND, P269 Sacks Harvey, 1992, COLLECTED LECT SAGISAKI Y, IN PRESS COMPUTING P SCHEGLOFF EA, 1977, LANGUAGE, V53, P361, DOI 10.2307/413107 Schegloff Emmanuel, 1989, RES LANG SOC INTERAC, V22, P215 SHOCKLEY L, 1995, U READING SPEECH RES, V9 Svartvik Jan, 1990, LONDON LUND CORPUS S TENBOSCH M, 1993, J ACOUST SOC AM, V94, P1798 Tottie Gunnel, 1991, NEGATION ENGLISH SPE TRUDGILL P, 1986, DIALECTICS CONTACT TYLER A, 1988, BREATHING LESSONS WIGHTMAN C, 1992, IEEE INT C AC SPEECH, V1, P221 WIGHTMAN CW, 1991, INT CONF ACOUST SPEE, P321, DOI 10.1109/ICASSP.1991.150341 WIGHTMAN CW, 1992, J ACOUST SOC AM, V91, P1707, DOI 10.1121/1.402450 YAEGER M, 1974, PENNSYLVANIA WORKING, V1, P1 YAEGERDROR M, 1996, LANGUAGE VARIATION C, V8 YAEGERDROR M, 1991, LANG COMMUN, V11, P309, DOI 10.1016/0271-5309(91)90035-T YAEGERDROR M, 1992, J ACOUST SOC AM, V91, pS3288 YAEGERDROR ML, 1985, LANG SPEECH, V28, P197 NR 88 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1996 VL 19 IS 1 BP 39 EP 60 DI 10.1016/0167-6393(96)00013-1 PG 22 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL361 UT WOS:A1996VL36100003 ER PT J AU Yang, H Koh, SN Sivaprakasapillai, P Xydeas, C AF Yang, H Koh, SN Sivaprakasapillai, P Xydeas, C TI Pitch synchronous multi-band (PSMB) coding of speech signals SO SPEECH COMMUNICATION LA English DT Article AB A novel speech coding algorithm, named pitch synchronous multi-band (PSMB), is proposed. The new coding algorithm uses the multi-band excitation (MBE) model to generate a representative pitch-cycle waveform (PCW) for each frame. The representative PCW of a frame is encoded by two out of three codebooks depending upon whether the frame is related or unrelated to the previous frame. When a frame is related to its previous frame, the PCW is encoded by a length-converted-excitation (LCE) codebook and a stochastic codebook. The codevectors of the LCE codebook are derived from the previous PCW. When a frame is unrelated to its previous frame, it is encoded by a bandlimited single pulse excitation (BSPE) codebook and the stochastic codebook. The new speech coder introduces a pitch-period-based coding feature. It overcomes some weaknesses existing in the improved MBE (IMBE) speech coder. The PSMB coder operating at 4 kbps outperforms the Inmarsat 4.15 kbps IMBE coder by a clear margin. Our listening tests also indicate that it is slightly better than the FS1016 4.8 kbps code excited linear predictive (CELP) coder in terms of perceptual quality. Fast search algorithms for the three codebooks used in PSMB are also developed. The fast algorithms render the new speech coder comparable to the FS1016 CELP coder, in terms of computational complexity. C1 NANYANG TECHNOL UNIV, SCH ELECT & ELECT ENGN, SINGAPORE 2263, SINGAPORE. UNIV MANCHESTER, DEPT ELECT ENGN, MANCHESTER M13 9PL, LANCS, ENGLAND. CR Campbell J. P. Jr., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266532 *DIG VOIC SYST INC, 1991, INM M VOIC COD VERS FENICHEL R, 1990, 1016 OFF TECHN STAND GRIFFIN DW, 1985, P IEEE INT C AC SPEE, P513 GRIFFIN DW, 1988, IEEE T ACOUST SPEECH, V36, P1223, DOI 10.1109/29.1651 KEIJIN WB, 1993, IEEE T SPEECH AUDIO, V1, P386 KROON P, 1989, P IEEE INT C AC SPEE, P735 MCAULAY RJ, 1986, IEEE T ACOUST SPEECH, V34, P744, DOI 10.1109/TASSP.1986.1164910 MEUSE PC, 1990, INT CONF ACOUST SPEE, P9, DOI 10.1109/ICASSP.1990.115524 NISHIGUCHI M, 1993, P IEEE INT C AC SPEE, P151 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE Rabiner L.R., 1978, DIGITAL PROCESSING S Schroeder M., 1985, P IEEE INT C AC SPEE, P937 Soong F., 1984, P IEEE INT C AC SPEE, P1101 SOONG FK, 1988, IEEE IINT S INF THEO Soong FK, 1993, IEEE T SPEECH AUDI P, V1, P15, DOI 10.1109/89.221364 YANG H, 1994, UNPUB IEE EL LETT YANG HY, 1995, INT CONF ACOUST SPEE, P516, DOI 10.1109/ICASSP.1995.479642 NR 18 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUL PY 1996 VL 19 IS 1 BP 61 EP 80 DI 10.1016/0167-6393(96)00027-1 PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VL361 UT WOS:A1996VL36100004 ER PT J AU Liu, QG Champagne, B Kabal, P AF Liu, QG Champagne, B Kabal, P TI A microphone array processing technique for speech enhancement in a reverberant space SO SPEECH COMMUNICATION LA English DT Article ID ROOM IMPULSE-RESPONSE; FOURIER-TRANSFORM; INVERTIBILITY; ENVIRONMENTS; ACOUSTICS AB In this paper, a new microphone array processing technique is proposed for blind dereverberation of speech signals affected by room acoustics. It is based on the separate processing of the minimum-phase and all-pass components of delay-steered multi-microphone signals. The minimum-phase components are processed in the cepstrum-domain, where spatial averaging followed by low-time filtering is applied. The all-pass components, which contain the source location information, are processed in the frequency-domain by performing spatial averaging and by retaining only the all-pass component of the resulting output. The underlying motivation for the new processor is to use spatio-temporal processing over a single set of synchronous speech segments from several microphones to reconstruct the source speech, such that it is applicable to practical time-variant acoustic environments. Simulated room impulse responses are used to evaluate the new processor and to compare it to a conventional beamformer, Significant improvements in array gain and important reductions of reverberation in listening tests are observed. C1 UNIV QUEBEC, INRS TELECOMMUN, VERDUN, PQ H3E 1H6, CANADA. MCGILL UNIV, DEPT ELECT ENGN, MONTREAL, PQ H3A 2A7, CANADA. CR ALLEN JB, 1977, IEEE T ACOUST SPEECH, V25, P235, DOI 10.1109/TASSP.1977.1162950 ALLEN JB, 1979, J ACOUST SOC AM, V65, P943, DOI 10.1121/1.382599 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 Bees D., 1991, P IEEE INT C AC SPEE, P977, DOI 10.1109/ICASSP.1991.150504 Carter G. C., 1993, COHERENCE TIME DELAY DOWLING EM, 1992, MICROPROCESS MICROSY, V16, P507, DOI 10.1016/0141-9331(92)90080-D FLANAGAN JL, 1991, ACUSTICA, V73, P58 FLANAGAN JL, 1985, AT&T TECH J, V64, P983 FLANAGAN JL, 1985, J ACOUST SOC AM, V78, P1508, DOI 10.1121/1.392786 GOODWIN MM, 1993, P INT C AC SPEECH SI GRENIER Y, 1993, SPEECH COMMUN, V12, P25, DOI 10.1016/0167-6393(93)90016-E KELLERMANN W, 1991, P IEEE INT C AC SPEE, V5, P3581 MIYOSHI M, 1988, IEEE T ACOUST SPEECH, V36, P145, DOI 10.1109/29.1509 MOURJOPOULOS J, 1985, J SOUND VIB, V102, P217, DOI 10.1016/S0022-460X(85)80054-7 NEELY ST, 1979, J ACOUST SOC AM, V66, P165, DOI 10.1121/1.383069 Oppenheim A. V., 1975, DIGITAL SIGNAL PROCE PETERSON PM, 1986, J ACOUST SOC AM, V80, P1527, DOI 10.1121/1.394357 PIRZ F, 1979, AT&T TECH J, V58, P1839 PORTNOFF MR, 1976, IEEE T ACOUST SPEECH, V24, P243, DOI 10.1109/TASSP.1976.1162810 Silverman H. F., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90023-W SYDOW C, 1994, J ACOUST SOC AM, V96, P845, DOI 10.1121/1.410323 Tanaka M., 1993, Journal of the Acoustical Society of Japan (E), V14 TOHYAMA M, 1993, P INT C AC SPEECH SI VANCOMPERNOLLE D, 1990, SPEECH COMMUN, V9, P433, DOI 10.1016/0167-6393(90)90019-6 WALSH JP, 1985, J ACOUST SOC AM, V77, P547, DOI 10.1121/1.392377 NR 25 TC 19 Z9 20 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1996 VL 18 IS 4 BP 317 EP 334 DI 10.1016/0167-6393(96)00011-8 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VD486 UT WOS:A1996VD48600001 ER PT J AU Siohan, O Gong, YF Haton, JP AF Siohan, O Gong, YF Haton, JP TI Comparative experiments of several adaptation approaches to noisy speech recognition using stochastic trajectory models SO SPEECH COMMUNICATION LA English DT Article AB The paper describes experiments on noisy speech recognition, using acoustic models based on the framework of Stochastic Trajectory Models (STM), We present the theoretical framework of 4 different approaches dealing with speech model adaptation: model-specific linear regression, speech feature space transformation, noise and speech models combination, STM state-based filtering. Experiments are performed on a speaker-dependent, 1011 word continuous speech recognition application with a word-pair perplexity of 28, using vocabulary-independent acoustic training, context independent phone models, and in various noisy testing environments. To measure the performance of each approach, recognition rate variation is studied under different noise types and noise levels. Our results show that the linear regression approach significantly outperforms the other methods, for every tested noise types at medium SNRs (between 6 to 24 dB). For the Gaussian noise, with an SNR between 6 to 24 dB, we observe a reduction of the word error rate from 20% to 59% when the linear regression is used, compared to the other methods. C1 CTR RECH INFORMAT NANCY, CNRS, F-54506 VANDOEUVRE LES NANCY, FRANCE. INRIA LORRAINE, F-54506 VANDOEUVRE LES NANCY, FRANCE. CR DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 Erell A, 1993, IEEE T SPEECH AUDI P, V1, P84, DOI 10.1109/89.221370 GALES MJF, 1993, SPEECH COMMUN, V12, P231, DOI 10.1016/0167-6393(93)90093-Z GONG Y, 1992, P INT C SPOK LANG PR, V2, P863 GONG Y, 1993, P EUR C SPEECH COMM, V3, P1759 GONG Y, 1994, P IEEE INT C AC SPEE, V1, P57 GONG YF, 1995, SPEECH COMMUN, V16, P261, DOI 10.1016/0167-6393(94)00059-J GRAY RM, 1980, IEEE T ACOUST SPEECH, V28, P367, DOI 10.1109/TASSP.1980.1163421 LEGGETTER CJ, 1994, FINFENGTR181 CUED CA MONTACIE C, 1988, NATO ASI SERIES F, V46 SIOHAN O, 1995, P EUR C SPEECH COMM SIOHAN Y, 1994, P INT C SPOK LANG PR, V3, P1031 TREURNIET WC, 1994, P IEEE INT C AC SPEE, V1, P437 VANCOMPERNOLLE D, 1989, P IEEE INT C AC SPEE, P258 VARGA AP, 1992, NOISEX 92 STUDY EFFE XIE F, 1993, P EUR C SPEECH COMM, V1, P617 XIE F, 1994, P IEEE INT C AC SPEE, V2, P53 YOUNG SJ, 1992, HTK HIDDEN MARKOV MO NR 19 TC 5 Z9 6 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1996 VL 18 IS 4 BP 335 EP 352 DI 10.1016/0167-6393(96)00015-5 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VD486 UT WOS:A1996VD48600002 ER PT J AU Arslan, LM Hansen, JHL AF Arslan, LM Hansen, JHL TI Language accent classification in American English SO SPEECH COMMUNICATION LA English DT Article ID FRENCH AB It is well known that speaker variability caused by accent is one factor that degrades performance of speech recognition algorithms. If knowledge of speaker accent can be estimated accurately, then a modified set of recognition models which addresses speaker accent could be employed to increase recognition accuracy. In this study, the problem of language accent classification in American English is considered. A database of foreign language accent is established that consists of words and phrases that are known to be sensitive to accent. Next, isolated word and phoneme based accent classification algorithms are developed. The feature set under consideration includes Mel-cepstrum coefficients and energy, and their first order differences. It is shown that as test utterance length increases, higher classification accuracy is achieved, Isolated word strings of 7-8 words uttered by the speaker results in an accent classification rate of 93% among four different language accents, A subjective listening test is also conducted in order to compare human performance with computer algorithm performance in accent discrimination. The results show that computer based accent classification consistently achieves superior performance over human listener responses for classification. It is shown, however, that some listeners are able to match algorithm performance for accent detection. Finally, an experimental study is performed to investigate the influence of foreign accent on speech recognition algorithms. It is shown that training separate models for each accent rather than using a single model for each word can improve recognition accuracy dramatically. C1 DUKE UNIV, DEPT ELECT ENGN, ROBUST SPEECH PROC LAB, DURHAM, NC 27708 USA. RI Arslan, Levent/D-6377-2015 OI Arslan, Levent/0000-0002-6086-8018 CR ASHER, 1969, MOD LANG J, V38, P334 Barry W. J., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90003-X Berkling K., 1994, P IEEE INT C AC SPEE, V1, P289 BERKLING KM, 1994, P INT C SPOK LANG PR, V4, P1891 Chreist F., 1964, FOREIGN ACCENT DELLER J, 1993, MACMILLAN SERIES PRE FLEGE JE, 1984, J ACOUST SOC AM, V76, P692, DOI 10.1121/1.391256 FLEGE JE, 1988, J ACOUST SOC AM, V84, P70, DOI 10.1121/1.396876 GROVER C, 1987, LANG SPEECH, V30, P277 GUPTA V, 1982, J ACOUST SOC AM, V71, P1581, DOI 10.1121/1.387812 HANSEN JHL, 1995, INT CONF ACOUST SPEE, P836, DOI 10.1109/ICASSP.1995.479824 HAZEN TJ, 1994, P INT C SPOK LANG PR, V4, P1883 HOUSE AS, 1977, J ACOUST SOC AM, V62, P708, DOI 10.1121/1.381582 Ljolje A., 1987, Computer Speech and Language, V2, DOI 10.1016/0885-2308(87)90025-8 MUTHUSAMY YK, 1994, IEEE SIGNAL PROC OCT, P33 PIPER T, 1988, CAN MOD LANG REV, V44, P334 RABINER LR, 1977, IEEE T ACOUST SPEECH, V27, P583 TAHTA S, 1981, LANG SPEECH, V24, P265 Zissman M. A., 1993, P IEEE INT C AC SPEE, P399 ZISSMAN MA, 1995, P INT C SPOK LANG PR, V5, P3503 NR 20 TC 47 Z9 47 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1996 VL 18 IS 4 BP 353 EP 367 DI 10.1016/0167-6393(96)00024-6 PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VD486 UT WOS:A1996VD48600003 ER PT J AU Torres, MI Iparraguirre, P AF Torres, MI Iparraguirre, P TI Acoustic parameters for place of articulation identification and classification of Spanish unvoiced stops SO SPEECH COMMUNICATION LA English DT Article ID FORMANT TRANSITIONS; CONSONANTS; PERCEPTION AB The analysis of the acoustic parameters which best summarize the cues to phone discrimination for the language under consideration should be a previous step in acoustic-phonetic decoding, regardless of the methodology to be used. The Spanish language has not been widely analyzed from this point of view, This work deals with the acoustic discrimination of Spanish stop consonants, Our main goal was to find a reliable and reduced set of parameters for place of articulation identification of Spanish unvoiced stops. On the basis of the obtained parameters, two automatic classifiers were developed and tested. Only the acoustic features of the burst segment, automatically segmented from the speech waveform, were considered in the parameter estimation. The analysis of these features was carried out in both the time and frequency domains over a CV context corpus uttered by 6 speakers. In the first case, the classifier was designed as a procedural form. Alternatively, in the second case a statistical classifier was obtained from a previous automatic discriminant analysis of the parameters. Both classifiers were tested over a CV context corpus uttered by 40 new speakers not included in the analysis corpus, which resulted in a good rate of identification. C1 UNIV BASQUE COUNTRY, DPTO ARQUITECTURA COMPUTADORES, SAN SEBASTIAN 20080, SPAIN. RP Torres, MI (reprint author), UNIV BASQUE COUNTRY, DPTO ELECT & ELECT, APDO 644, E-48080 BILBAO, SPAIN. RI Torres, Maria Ines/M-5490-2013 OI Torres, Maria Ines/0000-0002-1773-3214 CR BENEDI JM, 1992, CONTEXTUAL FACTORS P BENEDI JM, 1994, NOVATICA, V56, P27 BENEDI JM, 1991, WP1T6 ROARS BENGIO Y, 1992, SPEECH COMMUN, V11, P261, DOI 10.1016/0167-6393(92)90020-8 BENLLOCH I, 1992, WP1T7 ROARS BLUMSTEIN SE, 1979, J ACOUST SOC AM, V66, P1001, DOI 10.1121/1.383319 BLUMSTEIN SE, 1982, J ACOUST SOC AM, V72, P43, DOI 10.1121/1.388023 Bush M. A., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing CASTANEDA ML, 1986, ESTUDIOS FONETICA EX, V2, P91 CELDRAN EM, 1986, FONETICA CROWE A, 1987, ELECTRON LETT, V23, P1019, DOI 10.1049/el:19870714 EDERVEEN D, 1991, P EUROSPEECH, P42 FANT G, 1973, SPEECH SOUNDS FEATUR, P111 FISSORE L, 1991, P EUROSPEECH, P1389 Galiano I., 1994, International Journal of Pattern Recognition and Artificial Intelligence, V8, DOI 10.1142/S0218001494000073 GURLEKIAN JA, 1985, J ACOUST SOC JAPAN, V65, P271 HATON JP, 1988, RECENT ADV SPEECH UN KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P1779, DOI 10.1121/1.389402 KEWLEYPORT D, 1983, J ACOUST SOC AM, V73, P322, DOI 10.1121/1.388813 KEWLEYPORT D, 1982, J ACOUST SOC AM, V72, P379, DOI 10.1121/1.388081 KOBATAKE H, 1987, J ACOUST SOC AM, V81, P1146, DOI 10.1121/1.394635 Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N MARIANI J, 1989, P IEEE INT C AC SPEE, P429 NATHAN KS, 1991, P EUROSPEECH, P147 NEY H, 1991, P EUROSPEECH, P193 POCH D, 1984, FOLLA PHONETICA, V1, P89 POLS LCW, 1985, J ACOUST SOC AM, V78, P322 QUILIS A, 1989, FONETICA ACUSTICA LE REPP BH, 1989, J ACOUST SOC AM, V85, P379, DOI 10.1121/1.397689 SCHWARTZ RM, 1988, RECENT ADV SPEECH UN TARTTER VC, 1983, J ACOUST SOC AM, V74, P715, DOI 10.1121/1.389857 TORRES I, 1993, P EUROSPEECH, P457 TORRES I, 1994, ADV PATTERN RECOGNIT, P207 TORRES I, 1990, THESIS U PAIS VASCO ZAHORIAN SA, 1987, M AC SOC AM, V81, P1 NR 35 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1996 VL 18 IS 4 BP 369 EP 379 DI 10.1016/0167-6393(96)00025-8 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VD486 UT WOS:A1996VD48600004 ER PT J AU Benoit, C Grice, M Hazan, V AF Benoit, C Grice, M Hazan, V TI The SUS test: A method for the assessment of text-to-speech synthesis intelligibility using Semantically Unpredictable Sentences SO SPEECH COMMUNICATION LA English DT Article AB This paper describes the experimental set-up used by the SAM (ESPRIT-BRA Project no. 2589: Multilingual Speech Input/Output: Assessment, Methodology and Standardisation) group for evaluating the intelligibility of text-to-speech systems at sentence level. The SUS test measures overall intelligibility of Semantically Unpredictable Sentences which can be automatically generated using five basic syntactic structures and a number of lexicons containing the most frequently occurring mini-syllabic words in each language. The sentence material has the advantage of not being fixed, as words can be extracted from the lexicons randomly to form a new set of sentences each time the test is run. Various text-to-speech systems in a number of languages have been evaluated using this test. Results have demonstrated that the SUS test is effective and that it allows for reliable comparison across synthesisers provided guidelines are followed carefully regarding the definition of the test material and actual running of the test. These recommendations are the result of experience gained during the SAM project and beyond. They are presented here so as to provide users with a standardized evaluation method which is flexible and easy to use and is applicable to a number of different languages. C1 UNIV SAARLAND, D-6600 SAARBRUCKEN, GERMANY. UNIV LONDON UNIV COLL, DEPT PHONET & LINGUIST, LONDON NW1 2HE, ENGLAND. RP Benoit, C (reprint author), UNIV GRENOBLE 3, ENSERG,INPG,INST COMMUN PARLEE,UPRESA CNRS 5009, BP 25X, F-38040 GRENOBLE 9, FRANCE. RI Hazan, Valerie/C-9722-2009 OI Hazan, Valerie/0000-0001-6572-6679 CR BENOIT C, 1990, SPEECH COMMUN, V9, P293, DOI 10.1016/0167-6393(90)90005-T BENOIT C, 1992, P 2 INT C SPOK LANG, V2, P999 Benoit C., 1995, P127 BENOIT C, 1989, P EUROSPEECH 89 C ES, V2, P633 BENOIT C, 1989, P ESCA WORKSH SPEECH BENOIT C, 1992, TALKING MACHINES THE, P435 Chomsky Noam, 1957, SYNTACTIC STRUCTURES FOUREIN AJF, 1992, TALKING MACHINES THE, P431 GRICE M, 1989, SPEECH HEARING LANGU, V3, P107 HAZAN V, 1993, P EUROSPEECH 93 C ES, V3, P1849 HAZAN V, 1989, P ESCA WORKSH SPEECH HOHNES JN, 1964, LANG SPEECH, V7, P127 JEKOSCH U, 1992, P 2 INT C SPOK LANG, V1, P205 JEKOSCH U, 1994, J AM VOICE I O SOC, V15, P63 KALIKOW DN, 1977, J ACOUST SOC AM, V61, P1337, DOI 10.1121/1.381436 MILLER GA, 1963, J VERB LEARN VERB BE, V2, P217, DOI 10.1016/S0022-5371(63)80087-0 NYE PW, 1974, HASKINS LAB STAT REP, V37, P169 POLS LCW, 1992, P 2 INT C SPOK LANG, V1, P181 Sotscheck J, 1982, FERNMELDE INGENIEUR, V36, P1 NR 19 TC 51 Z9 51 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JUN PY 1996 VL 18 IS 4 BP 381 EP 392 DI 10.1016/0167-6393(96)00026-X PG 12 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA VD486 UT WOS:A1996VD48600005 ER PT J AU Sorin, C Mokbel, C AF Sorin, C Mokbel, C TI Untitled SO SPEECH COMMUNICATION LA English DT Editorial Material NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 203 EP 203 DI 10.1016/S0167-6393(96)90034-5 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900001 ER PT J AU Bourlard, H Hermansky, H Morgan, N AF Bourlard, H Hermansky, H Morgan, N TI Towards increasing speech recognition error rates SO SPEECH COMMUNICATION LA English DT Review ID HIDDEN MARKOV-MODELS; WORD RECOGNITION; DECONVOLUTION AB In the field of Automatic Speech Recognition (ASR) research, it is conventional to pursue those approaches that reduce the word error rate. However, it is the authors' belief that this seemingly sensible strategy often leads to the suppression of innovation. The leading approaches to ASR have been tuned for years, effectively optimizing on test data for a local minimum in the space of available techniques. In this case, almost any sufficiently new approach will necessarily hurt the accuracy of existing systems and thus increase the error rate. However, if progress is to be made against the remaining difficult problems, new approaches will most likely be necessary. In this paper, we discuss some research directions for ASR that may not always yield an immediate and guaranteed decrease in error rate but which hold some promise for ultimately improving performance in the end applications. Issues that will be addressed in this paper include: discrimination between rival utterance models, the role of prior information in speech recognition, merging the language and acoustic models, feature extraction and temporal information, and decoding procedures reflecting human perceptual properties. C1 FAC POLYTECH MONS, B-7000 MONS, BELGIUM. UNIV CALIF BERKELEY, BERKELEY, CA 94720 USA. OREGON GRAD INST, PORTLAND, OR USA. RP Bourlard, H (reprint author), INT COMP SCI INST, BERKELEY, CA 94704 USA. CR AHADI SM, 1995, P IEEE INT C AC SPEE, P684 AIKAWA K, 1993, P INT C AC SIGN SPEE, P668 AITKIN LM, 1966, J NEUROPHYSIOL, V29, P109 Allen JB, 1994, IEEE T SPEECH AUDI P, V2, P567, DOI 10.1109/89.326615 ANASTASAKOS A, 1994, P IEEE INT C AC SPEE, P433 APPLEBAUM TH, 1991, P IEEE INT C AC SPEE, P985, DOI 10.1109/ICASSP.1991.150506 Asadi A., 1990, P ICASSP 90, P125 AUSTIN S, 1992, P DARPA SPEECH NAT L, P180, DOI 10.3115/1075527.1075567 Bahl L. R., 1988, P ICASSP 88 NEW YORK, pA93 BAKER JK, 1975, IEEE T ACOUST SPEECH, VAS23, P24, DOI 10.1109/TASSP.1975.1162650 BENGIO Y, 1995, ADV NEURAL INFORMATI, V7 BENGIO Y, 1992, IEEE T NEURAL NETWOR, V3, P252, DOI 10.1109/72.125866 BOUNDS A, 1995, COMMUNICATION BOURLARD H, 1995, P EUR 95 MADR BOURLARD H, 1996, P ARPA SPEECH REC WO BOURLARD H, 1994, TR94064 ICSI BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 Bourlard Ha, 1994, CONNECTIONIST SPEECH Bridle J., 1974, 1003 JSRU BRIDLE JS, 1995, COMMUNICATION BROWN P, THESIS C MELLON U CHISTOVICH LA, 1985, J ACOUST SOC AM, V77, P789, DOI 10.1121/1.392049 COHEN JR, 1995, COMMUNICATION COHEN JR, 1989, J ACOUST SOC AM, V85, P2623, DOI 10.1121/1.397756 COHEN M, 1992, P INT C SPOK LANG PR, P915 DAVIS SB, 1980, IEEE T ACOUST SPEECH, V28, P357, DOI 10.1109/TASSP.1980.1163420 DELANOUE P, 1989, 11226 ATT DELLAPIETRA S, 1992, P DARPA SPEECH NAT L, P103, DOI 10.3115/1075527.1075551 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 Deng L, 1994, IEEE T SPEECH AUDI P, V2, P507 DODDINGTON G, 1989, P IEEE INT C AC SPEE, P556 DUCHNOWSKI P, 1993, THESIS MIT ELENIUS K, 1985, P IEEE INT C AC SPEE, P535 FANTY M, 1993, P IEEE INT C AC SPEE, P1 Fletcher H., 1953, SPEECH HEARING COMMU FRANZINI M, 1990, INT CONF ACOUST SPEE, P425, DOI 10.1109/ICASSP.1990.115733 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 FURUI S, 1986, IEEE T ACOUST SPEECH, V34, P52, DOI 10.1109/TASSP.1986.1164788 Ghitza O., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1005 Gish H., 1990, P IEEE INT C AC SPEE, P1361 GOLDENTHAL W, 1994, THESIS MIT GOOD IJ, 1963, ANN MATH STAT, V34, P911, DOI 10.1214/aoms/1177704014 GOPALARISHNAN PS, 1988, P IEEE ICASSP, P20 GREEN GGR, 1976, THESIS OXFORD UK Greene H.W., 1988, Biology of Reptilia, V16, P1 HAEBUMBACH R, 1994, P IEEE INT C AC SPEE, P239 HANSON B, 1984, P IEEE INT C AC SPEE HAUENSEIN A, 1995, P IEEE INT C AC SPEE, P425 Hermansky H, 1994, IEEE T SPEECH AUDI P, V2, P578, DOI 10.1109/89.326616 HERMANSKY H, 1995, P 15 INT C AC TRONDH, V2, P61 Hermansky H., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) Hermansky H., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing HERMANSKY H, 1995, P INT C AC SPEECH SI, P405 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 HOCHBERG MM, 1995, P IEEE INT C AC SPEE, P69 Hunt M. J., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266415 ITAHASHI S, 1976, INT C AC SPEECH SIGN, P310 JANSEEN RDT, 1991, P INT JOINT C NEUR N, P801 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 Jelinek F., 1991, P DARPA SPEECH NAT L, P293, DOI 10.3115/112405.112464 JELINEK F, 1975, IEEE T INFORM THEORY, V21, P250, DOI 10.1109/TIT.1975.1055384 JORDAN MI, 1994, NEURAL COMPUT, V6, P181, DOI 10.1162/neco.1994.6.2.181 JUANG BH, 1985, IEEE T ACOUST SPEECH, V33, P1404 KAMM T, 1995, P 15 ANN SPEECH RES, P175 Katagiri S., 1991, P IEEE WORKSH NEUR N, P299 Klatt D. H, 1982, REPRESENTATION SPEEC, P181 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 LEVIN E, 1993, IEEE T NEURAL NETWOR, V4, P109, DOI 10.1109/72.182700 LIM JS, 1979, IEEE T ACOUST SPEECH, V27, P223 LIPORACE LA, 1982, IEEE T INFORM THEORY, V28, P729, DOI 10.1109/TIT.1982.1056544 LUBENSKY DM, 1994, P INT C SPOK LANG PR Makino S., 1983, Proceedings of ICASSP 83. IEEE International Conference on Acoustics, Speech and Signal Processing Mermelstein P., 1976, PATTERN RECOGN, P374 MORGAN N, 1990, INT CONF ACOUST SPEE, P413, DOI 10.1109/ICASSP.1990.115720 MORGAN N, 1995, P EUR 95 MADR, P771 MORGAN N, 1995, INT CONF ACOUST SPEE, P397, DOI 10.1109/ICASSP.1995.479605 MORGAN N, 1995, P IEEE, V83, P741 NEUMEYER L, 1994, INT CONF ACOUST SPEE, P417 Ostendorf M., 1992, ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech and Signal Processing (Cat. No.92CH3103-9), DOI 10.1109/ICASSP.1992.225890 PAUL DB, 1991, INT CONF ACOUST SPEE, P569, DOI 10.1109/ICASSP.1991.150403 PAVEL M, 1994, J ACOUST SOC AM, V95, P2876, DOI 10.1121/1.409409 POLS LCW, 1969, J ACOUST SOC AM, V46, P458, DOI 10.1121/1.1911711 POLS LCW, 1971, IEEE T COMPUT, VC 20, P972, DOI 10.1109/T-C.1971.223391 PORITZ AB, 1982, APR P IEEE INT C AC, P1291 PORITZ AB, 1986, P IEEE INT C AC SPEE, P705 PORTER JE, 1984, P IEEE INT C AC SPEE RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 RENALS S, 1995, P IEEE INT C AC SPEE, P596 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 ROBINSON T, 1993, P EUR 93 BERL, P1941 Robinson T., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90010-N Rosenberg A. E., 1994, P INT C SPOK LANG PR, P1835 RUSSELL MJ, 1990, P IEEE INT C AC SPEE, P69 SCHWARTZ R, 1992, P IEEE INT C AC SPEE SENEFF S, 1988, J PHONETICS, V16, P55 SINGER E, 1992, P IEEE INT C AC SPEE, P629, DOI 10.1109/ICASSP.1992.225830 STEENEKEN JM, 1995, P EUR 95 MADR, P1271 STEVENS JC, 1966, PERCEPT PSYCHOPHYS, V1, P319, DOI 10.3758/BF03215796 STEVENS SS, 1957, PSYCHOL REV, V64, P153, DOI 10.1037/h0046162 STOCKHAM TG, 1975, P IEEE, V63, P678, DOI 10.1109/PROC.1975.9800 TAPPERT CC, 1978, IEEE T ACOUST SPEECH, V26, P583, DOI 10.1109/TASSP.1978.1163149 TEBELSKIS J, 1991, P IEEE INT C AC SPEE, P61, DOI 10.1109/ICASSP.1991.150278 WAIBEL A, 1988, P IEEE ICASSP NEW YO, P107 Wellekens C. J., 1987, Proceedings: ICASSP 87. 1987 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.87CH2396-0) ZUE VW, 1985, P IEEE, V73, P1602, DOI 10.1109/PROC.1985.13342 ZWICKER E, 1975, HDB SENSORY PHYSL, V3, P401 NR 106 TC 36 Z9 36 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 205 EP 231 DI 10.1016/0167-6393(96)00003-9 PG 27 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900002 ER PT J AU Atal, BS AF Atal, BS TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Atal, BS (reprint author), AT&T BELL LABS, SPEECH RES DEPT, ROOM 2D-535, 600 MT AVE, MURRAY HILL, NJ 07974 USA. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 233 EP 233 DI 10.1016/0167-6393(96)00005-2 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900003 ER PT J AU DeMori, R AF DeMori, R TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP DeMori, R (reprint author), MCGILL UNIV, SCH COMP SCI, 3480 UNIV ST, MONTREAL, PQ H3A 2A7, CANADA. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 234 EP 235 DI 10.1016/0167-6393(96)00006-4 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900004 ER PT J AU Flanagan, J AF Flanagan, J TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Flanagan, J (reprint author), RUTGERS STATE UNIV, CTR CAIP, POB 1390, PISCATAWAY, NJ 08855 USA. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 236 EP 237 DI 10.1016/0167-6393(96)00007-6 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900005 ER PT J AU Furui, S AF Furui, S TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material C1 TOKYO INST TECHNOL, MEGURO KU, TOKYO 152, JAPAN. RP Furui, S (reprint author), NIPPON TELEGRAPH & TEL PUBL CORP, HUMAN INTERFACE LABS, 3-9-11 MIDORI CHO, MUSASHINO, TOKYO 180, JAPAN. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 238 EP 238 DI 10.1016/0167-6393(96)00008-8 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900006 ER PT J AU Haton, JP AF Haton, JP TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Haton, JP (reprint author), CRIN, INRIA, NANCY, FRANCE. CR AFIFY M, 1995, EUROSPEECH 95, P515 BOURLARD H, 1995, EUROSPEECH 95, P883 GONG YF, 1994, INT CONF ACOUST SPEE, P57 NR 3 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 239 EP 239 DI 10.1016/0167-6393(96)00019-2 PG 1 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900007 ER PT J AU Hunt, MJ AF Hunt, MJ TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Hunt, MJ (reprint author), DRAGON SYST UK LTD, CHELTENHAM GL52 4RW, GLOS, ENGLAND. CR BAHL LR, 1988, P IEEE INT C AC SPEE, V1, P40 HUNT MJ, 1988, P IEEE INT C AC SPEE, V1, P215 Hunt MJ, 1979, J ACOUST SOC AM, V66, pS535 NR 3 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 240 EP 241 DI 10.1016/0167-6393(96)00023-4 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900008 ER PT J AU Jelinek, F AF Jelinek, F TI Five speculations (and a divertimento) on the themes of H. Bourlard, H. Hermansky, and N. Morgan SO SPEECH COMMUNICATION LA English DT Editorial Material RP Jelinek, F (reprint author), JOHNS HOPKINS UNIV, CTR LANGUAGE & SPEECH PROC, BALTIMORE, MD 21218 USA. CR BAHL LR, 1976, 1976 INT C AC SPEECH BOURLARD H, 1995, P EUR 95 MADR, P883 Brill Eric, 1996, RECENT ADV PARSING T CAPEK K, 1935, PRESIDENT MASARYK TE COLE R, 1995, 1995 IEEE AUT SPEECH Levenshtein V., 1966, SOV PHYS DOKL, V10, P707 Merialdo B., 1994, Computational Linguistics, V20 PIERCE JR, 1969, J ACOUST SOC AM, V46, P1049, DOI 10.1121/1.1911801 SHANE S, 1995, BALTIMORE SUN 1203 SHANNON CE, 1948, AT&T TECH J, V27, P623 Young S., 1995, P IEEE WORKSH AUT SP, P3 NR 11 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 242 EP 246 DI 10.1016/0167-6393(96)00009-X PG 5 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900009 ER PT J AU Lippmann, RP AF Lippmann, RP TI Recognition by humans and machines: Miles to go before we sleep SO SPEECH COMMUNICATION LA English DT Editorial Material RP Lippmann, RP (reprint author), MIT, LINCOLN LAB, ROOM S4-121, 244 WOOD ST, LEXINGTON, MA 02173 USA. CR LIPPMANN R, 1996, UNPUB SPEECH COMMUNI Miller G, 1991, SCI WORDS NR 2 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 247 EP 248 DI 10.1016/0167-6393(96)00018-0 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900010 ER PT J AU Mariani, J Gauvain, JL Lamel, L AF Mariani, J Gauvain, JL Lamel, L TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Mariani, J (reprint author), CNRS, LIMSI, BP 133, F-91403 ORSAY, FRANCE. NR 0 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 249 EP 252 DI 10.1016/0167-6393(96)00020-9 PG 4 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900011 ER PT J AU Seneff, S AF Seneff, S TI Towards increasing speech recognition error rates - Comment SO SPEECH COMMUNICATION LA English DT Editorial Material RP Seneff, S (reprint author), MIT, 77 MASSACHUSETTS AVE, CAMBRIDGE, MA 02139 USA. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 253 EP 255 DI 10.1016/0167-6393(96)00010-6 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900012 ER PT J AU Veldhuis, R He, HY AF Veldhuis, R He, HY TI Time-scale and pitch modifications of speech signals and resynthesis from the discrete short-time Fourier transform SO SPEECH COMMUNICATION LA English DT Article ID WAVE AB The modification methods described in this paper combine characteristics of PSOLA-based methods and algorithms that resynthesize speech from its short-time Fourier magnitude only. The starting point is a short-time Fourier representation of the signal. In the case of duration modification, portions, in voiced speech corresponding to pitch periods, are removed from or inserted in this representation. In the case of pitch modification, pitch periods are shortened or extended in this representation, and a number of pitch periods is inserted or removed, respectively. Since it is an important tool for both duration and pitch modification, the resynthesis-from-short-time-Fourier-magn method of Griffin and Lim (1984) and Griffin et al. (1984) is reviewed and adapted. Duration and pitch modification methods and their results are presented. RP Veldhuis, R (reprint author), INST PERCEPT RES, POB 513, 5600 MB EINDHOVEN, NETHERLANDS. CR ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BAILLY G, 1992, TALKING MACHINES THE GRIFFIN DW, 1984, IEEE T ACOUST SPEECH, V32, P236, DOI 10.1109/TASSP.1984.1164317 GRIFFING DW, 1984, P INT C AC SPEECH SI Hamon C., 1989, P INT C AC SPEECH SI, P238 HERMES DJ, 1988, J ACOUST SOC AM, V83, P257, DOI 10.1121/1.396427 JANSSEN AJEM, 1986, IEEE T ACOUST SPEECH, V34, P317, DOI 10.1109/TASSP.1986.1164824 Marple Jr S. L., 1987, DIGITAL SPECTRAL ANA MOULINES E, 1990, SPEECH COMMUN, V9, P453, DOI 10.1016/0167-6393(90)90021-Z OSHAUGHNESSY D, 1990, SPEECH COMMUNICATION PORTNOFF MR, 1981, IEEE T ACOUST SPEECH, V29, P374, DOI 10.1109/TASSP.1981.1163581 SENEFF S, 1982, IEEE T ACOUST SPEECH, V30, P566, DOI 10.1109/TASSP.1982.1163919 Verhelst W., 1993, P IEEE INT C AC SPEE, P554 NR 13 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 257 EP 279 DI 10.1016/0167-6393(95)00044-5 PG 23 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900013 ER PT J AU Hirschberg, J Prieto, P AF Hirschberg, J Prieto, P TI Training intonational phrasing rules automatically for English and Spanish text-to-speech SO SPEECH COMMUNICATION LA English DT Article AB We describe a procedure for acquiring intonational phrasing rules for text-to-speech synthesis automatically, from annotated text, and some evaluation of this procedure for English and Spanish. The procedure employs decision trees generated automatically, using Classification and Regression Toe techniques, from text corpora which have been hand-labeled by native speakers with likely locations of intonational boundaries, in conjunction with information available about the text via simple text analysis techniques. Rules generated by this method have been implemented in the English version of the Bell Laboratories Text-to-Speech System and have been developed for the Mexican Spanish version of that system. These rules currently achieve better than 95% accuracy for English and better than 94% for Spanish. C1 UNIV AUTONOMA BARCELONA, E-08193 BARCELONA, SPAIN. RP Hirschberg, J (reprint author), AT&T BELL LABS, 600 MT AVE, MURRAY HILL, NJ 07974 USA. CR Altenberg B., 1987, LUND STUDIES ENGLISH, V76 AVESANI C, 1995, P 13 INT C PHON SCI, V1, P174 Bachenko J., 1990, Computational Linguistics, V16 Bolinger D., 1989, INTONATION ITS USES BRUCE B, 1993, P ESCA WORKSH PROS L, V41, P180 DANLOS L, 1986, P 11 INT C COMP LING, P599, DOI 10.3115/991365.991540 HIRSCHBERG J, 1991, P 1991 EUR LARREUR D, 1989, P 1989 EUR, V1, P510 MONAGHAN A, 1991, THESIS U EDINBURGH E Olshen R., 1984, CLASSIFICATION REGRE, V1st O'Shaughnessy D. D., 1989, Computational Linguistics, V15 Ostendorf M., 1994, Computational Linguistics, V20 OSTENDORF M, 1990, DARPA P SPEECH NAT L, P26 Pierrehumbert J, 1980, THESIS MIT Pierrehumbert J. B., 1986, PHONOLOGY YB, V3, P15 PITRELLI J, 1994, ICSLP 94 INT C SPOK, V2, P123 Quene H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90044-5 RILEY MD, 1989, DARPA P SPEECH NAT L RODRIGUEZ MA, 1993, B SOCIEDAD ESPANOLA, V13, P389 SCHNABEL B, 1990, SEP P ESCA WORKSH SP, P121 SILVERMAN K, 1993, P 1993 EUR, V3, P2169 WANG MQ, 1991, ACL P 29 ANN M Wang M. Q., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90025-Y WANG MQ, 1991, DARPA P SPEECH NAT L, P378 Wightman CW, 1994, IEEE T SPEECH AUDI P, V2, P469, DOI 10.1109/89.326607 YOUNG SJ, 1979, J ACOUST SOC AM, V66, P685, DOI 10.1121/1.383695 NR 26 TC 29 Z9 32 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 281 EP 290 DI 10.1016/0167-6393(96)00017-9 PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900014 ER PT J AU Liu, CS Wang, HC AF Liu, CS Wang, HC TI A segmental probabilistic model of speech using an orthogonal polynomial representation: Application to text-independent speaker verification SO SPEECH COMMUNICATION LA English DT Article ID RECOGNITION; ALGORITHM AB A segmental probabilistic model based on an orthogonal polynomial representation of speech signals is proposed. Unlike the conventional frame based probabilistic model, this segment based model concatenates the similar acoustic characteristics of consecutive frames into an acoustic segment and represents the segment by an orthogonal polynomial function. An iterative algorithm that performs recognition and segmentation processes is proposed for estimating the segment model. This segment model is applied in the text independent speaker verification. Tests were carried out on a 20-speaker database. With the best version of the model, an equal error rate of 0.59% can be reached, for test utterances of 10 digits. This corresponds to a relative error rate reduction of more than 50%, compared to the conventional frame based probabilistic model. C1 NATL TSING HUA UNIV, DEPT ELECT ENGN, HSINCHU, TAIWAN. MINIST TRANSPORTAT & COMMUN, TELECOMMUN LABS, TAYUAN, TAIWAN. CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1034 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 GRAYHILL FA, 1983, MATRICES APPL STAT GRENIER Y, 1983, IEEE T ACOUST SPEECH, V31 GRENIER Y, 1988, IEEE T ACOUST SPEECH, V36, P1602, DOI 10.1109/29.7548 HIGGINS A, 1991, P INT C AC SPEECH SI, V1, P405 JUANG BH, 1990, P INT C AC SPEECH SI, V2, P613 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MYERS CS, 1981, IEEE T ACOUST SPEECH, V29, P284, DOI 10.1109/TASSP.1981.1163527 OSTENDORF M, 1989, IEEE T ACOUST SPEECH, V37, P1857, DOI 10.1109/29.45533 Rosenberg A, 1990, P IEEE INT C AC SPEE, V1, P269 Rosenberg A. E., 1992, P INT C SPOK LANG PR, V1, P599 SHIRAKI Y, 1988, IEEE T ACOUST SPEECH, V36, P1437, DOI 10.1109/29.90372 SOONG FK, 1987, AT&T TECH J, V66, P14 SVENDSEN T, 1987, P INT C AC SPEECH SI, V1, P77 TISHBY NZ, 1991, IEEE T SIGNAL PROCES, V39, P563 TSAO C, 1985, IEEE T ACOUST SPEECH, V33, P537 Tseng B., 1992, P ICASSP 92, VII, P161 NR 19 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD MAY PY 1996 VL 18 IS 3 BP 291 EP 304 DI 10.1016/0167-6393(96)00014-3 PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UV749 UT WOS:A1996UV74900015 ER PT J AU vandenHeuvel, H Cranen, B Rietveld, T AF vandenHeuvel, H Cranen, B Rietveld, T TI Speaker variability in the coarticulation of vertical bar a,i,u vertical bar SO SPEECH COMMUNICATION LA English DT Article DE coarticulation; speaker variability; speaker identification; vowel acoustics ID SPEECH; REDUCTION AB Speaker variability in the coarticulation of the vowels /a,i,u/ was investigated in /C(1)VC(2)e/ pseudo-words, containing the consonants /p,t,k,d,s,m,n,r/. These words were read out in isolation by fifteen male speakers of Dutch. The formants F-1-3 (in Bark) were extracted from the steady-state of each vowel /a,i,u/. Coarticulation in each of 1200 realisations per vowel was measured in F-1-3 as a function of consonantel context, using a score-model based measure called COART. The largest amount of coarticulation was found in /u/ where nasals and alveolars in C-1-position had the largest effect on the formant positions, especially on F-2. Coarticulation in /a,u/ proved to be speaker-specific. For these vowels the speaker variability of COART in a context was larger, generally, if COART itself was larger. Studied in a speaker identification task, finally, COART improved identification results only when three conditions were combined: (a) if COART was used as an additional parameter to F-1-3; (b) if the COART-values for the vowel were high; (c) if all vowel contexts were pooled in the analysis. The two main conclusions from this study are that coarticulation cannot be investigated speaker-independently and that COART can be contributive to speaker identification, but only in very restricted conditions. RP vandenHeuvel, H (reprint author), UNIV NIJMEGEN, DEPT LANGUAGE & SPEECH, POB 9103, 6500 HB NIJMEGEN, NETHERLANDS. CR BONASTRE JF, 1994, P ESCA WORKSH SPEAK, P157 Daniloff R. G., 1973, J PHONETICS, V1, P239 FLEGE JE, 1988, J SPEECH HEAR RES, V31, P525 FOWLER CA, 1980, J PHONETICS, V8, P113 HERMANSKY H, 1990, J ACOUST SOC AM, V87, P1738, DOI 10.1121/1.399423 JOHNSON K, 1993, J ACOUST SOC AM, V94, P701, DOI 10.1121/1.406887 Kent R. D., 1977, J PHONETICS, V15, P115 Kuehn David P., 1976, J PHONETICS, V4, P303 Nolan F, 1983, PHONETIC BASES SPEAK OHDE RN, 1975, J ACOUST SOC AM, V58, P923, DOI 10.1121/1.380746 POLS LCW, 1973, J ACOUST SOC AM, V53, P1093, DOI 10.1121/1.1913429 RIETVELD ACM, 1987, P 11 INT C PHON SCI, V4, P28 SCHOUTEN MEH, 1979, J PHONETICS, V7, P1 SHAIMAN S, 1995, J PHONETICS, V23, P119, DOI 10.1016/S0095-4470(95)80036-0 SHARF DJ, 1981, ADV BASIC RES PRACTI, V5, P154 STEVENS KN, 1966, J ACOUST SOC AM, V40, P123, DOI 10.1121/1.1910027 STEVENS KN, 1963, J SPEECH HEAR RES, V6, P111 SU LS, 1974, J ACOUST SOC AM, V56, P1867 SUOMI K, 1987, J PHONETICS, V15, P85 TOKUMA S, 1993, UNPUB SPEECH HEARING, V7, P233 VANBERGEM DR, 1993, SPEECH COMMUN, V12, P1, DOI 10.1016/0167-6393(93)90015-D VANSON RJJ, 1993, IFOTT STUDIES LANGUA, V3 WHALEN DH, 1990, J PHONETICS, V18, P3 NR 23 TC 4 Z9 4 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1996 VL 18 IS 2 BP 113 EP 130 DI 10.1016/0167-6393(95)00039-9 PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UQ991 UT WOS:A1996UQ99100001 ER PT J AU Alku, P Vilkman, E AF Alku, P Vilkman, E TI Amplitude domain quotient for characterization of the glottal volume velocity waveform estimated by inverse filtering SO SPEECH COMMUNICATION LA English DT Article DE inverse filtering; voice source analysis ID VOICE SOURCE; AIR-FLOW; PRESSURE; SPEECH AB An amplitude-domain quotient for parametrization of the glottal source computed by inverse filtering is presented. The new quotient, AQ, is determined as the ratio between the amplitude of the AC-flow of the glottal waveform and the amplitude of the minimum of the flow derivative. This quotient can be used even though absolute flow values are not given by the recording equipment. The behaviour of AQ was compared to conventional time-based quotients by analysing voices produced by different phonation types. It was shown that phonation types can be quantified effectively when parametrization of the glottal flow estimated by inverse filtering is based on AQ. C1 UNIV OULU, DEPT OTOLARYNGOL & PHONIATR, SF-90220 OULU, FINLAND. RP Alku, P (reprint author), HELSINKI UNIV TECHNOL, ACOUST LAB, OTAKAARI 5A, SF-02150 ESPOO, FINLAND. RI Alku, Paavo/E-2400-2012 CR ALKU P, 1992, P INT C SPOK LANG PR, P847 ALKU P, 1993, P 1993 AS PAC S INF CARLSON R, 1991, SPEECH COMMUN, V10, P481, DOI 10.1016/0167-6393(91)90051-T CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 DROMEY C, 1992, J VOICE, V6, P44, DOI 10.1016/S0892-1997(05)80008-6 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P Fant Gunnar, 1985, STL QPSR, V4, P1 GAUFFIN J, 1989, J SPEECH HEAR RES, V32, P556 HERTEGARD S, 1992, J VOICE, V6, P224, DOI 10.1016/S0892-1997(05)80147-X HILLMAN RE, 1990, J VOICE, V4, P52, DOI 10.1016/S0892-1997(05)80082-7 HOLMBERG EB, 1988, J ACOUST SOC AM, V84, P511, DOI 10.1121/1.396829 KARLSSON I, 1989, P EUROPEAN C SPEECH, V1, P349 KARLSSON I, 1990, P INT C SPOKEN LANGU, V1, P69 MARKEL JD, 1976, LINEAR PREDICTION SP, P139 PRICE PJ, 1989, SPEECH COMMUN, V8, P261, DOI 10.1016/0167-6393(89)90005-8 ROTHENBE.M, 1973, J ACOUST SOC AM, V53, P1632, DOI 10.1121/1.1913513 STRIK H, 1992, SPEECH COMMUN, V11, P167, DOI 10.1016/0167-6393(92)90011-U STRIK H, 1992, P INT C SPOK LANG PR, V1, P121 SUNDBERG J, 1993, J VOICE, V7, P15, DOI 10.1016/S0892-1997(05)80108-0 WONG DY, 1979, IEEE T ACOUST SPEECH, V27, P350, DOI 10.1109/TASSP.1979.1163260 NR 20 TC 35 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1996 VL 18 IS 2 BP 131 EP 138 DI 10.1016/0167-6393(95)00040-2 PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UQ991 UT WOS:A1996UQ99100002 ER PT J AU Castellano, P Sridharan, S AF Castellano, P Sridharan, S TI A two stage fuzzy decision classifier for speaker identification SO SPEECH COMMUNICATION LA English DT Article DE speaker identification; fuzzy sets AB The Two Stage Fuzzy Decision Classifier (TSFDC) consists of an Artificial Neural Network (ANN) providing a first stage of discrimination and a second post-processing stage. The latter stage uses reference fuzzy set information, for each class of data considered. The ANN isolates the two most likely classes for each test vector. Post-processing selects the final class from amongst the two. The TSFDC applies post-processing only to those classes which its ANN has difficulty recognising. Three text-independent Automatic Speaker Identification (ASI) experiments are conducted with emphasis on forensic needs. In these experiments, the signal is degraded by a range of factors affecting communication channels. The TSFDC increases the percentage of correctly identified speech frames, for those speakers poorly identified by its ANN, by a mean of 3.27% over the three experiments. Concurrently, the difference in number of identified frames between true and corresponding runner-up speakers improves by a mean of 5.27%. Post-processing better than halves the number of speakers misclassified by the ANN. RP Castellano, P (reprint author), QUEENSLAND UNIV TECHNOL, SCH ELECT ELECTR & SYST ENGN, SIGNAL PROC RES CTR, 2 GEORGE ST, BRISBANE, QLD 4001, AUSTRALIA. CR BENNANI Y, 1993, P INT C AC SPEECH SI, V1, P541 Berenji HR., 1992, INTRO FUZZY LOGIC AP, P69 Castellano P. J., 1994, Proceedings of the 1994 Second Australian and New Zealand Conference on Intelligent Information Systems (Cat. No.94TH8019), DOI 10.1109/ANZIIS.1994.396951 CASTELLANO P, IN PRESS APPL SIGNAL CASTELLANO P, IN PRESS AUS J INT I CASTELLANO P, 1994, P 5 AUSTR INT C SPEE, V2, P456 Farrell KR, 1994, IEEE T SPEECH AUDI P, V2, P194, DOI 10.1109/89.260362 Godfrey J. J., 1992, P ICASSP, V1, P517 Jankowski C., 1990, P IEEE INT C AC SPEE, V1, P109 Oglesby J., 1990, P IEEE INT C AC SPEE, V1, P261 OSHAUGHNESSY D, 1990, J FORENSIC SCI, V35, P1163 RABINER LR, 1978, DIGITAL PROCESSING S, P82 RUDASI L, 1991, P INT C AC SPEECH SI, V1, P389 WILENSKY GD, 1992, P INT JOINT C NEUR N, V2, P358, DOI 10.1109/IJCNN.1992.226961 ZIMMERMANN HJ, 1985, INT SERIES MANAGEMEN, P111 NR 15 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1996 VL 18 IS 2 BP 139 EP 149 DI 10.1016/0167-6393(95)00041-0 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UQ991 UT WOS:A1996UQ99100003 ER PT J AU Yehia, H Itakura, F AF Yehia, H Itakura, F TI A method to combine acoustic and morphological constraints in the speech production inverse problem SO SPEECH COMMUNICATION LA English DT Article DE speech production inverse problem; vocal-tract acoustics; vocal-tract morphology; formant; area function; Fourier cosine series ID VOCAL-TRACT; ARTICULATORY MOTION; SHAPES; WAVE AB This paper approaches the articulatory-to-acoustic speech production inverse case. A framework based on an explicit combination of vocal-tract morphological and acoustic constraints is proposed. The solution is based on a Fourier analysis of the vocal-tract log-area function: the relationship between the log-area Fourier cosine coefficients and the corresponding formants is used to formulate an acoustic constraint. The same set of coefficients is then used to express a morphological constraint, This representation of both acoustic and morphological constraints in the same parameter space allows an efficient solution for the inverse problem, The basis of the acoustic constraint formulation was first proposed by Mermelstein (1967). However, at that time, the combination with morphological constraints was not realized, The method is tested for some vowels. The results confirm the validity of the method, but they also show the need for dynamic constraints. RP Yehia, H (reprint author), NAGOYA UNIV, DEPT INFORMAT ELECT, ITAKURA LAB, CHIKUSA KU, FURO CHO, NAGOYA, AICHI 46401, JAPAN. RI Yehia, Hani Camille/E-8684-2010 OI Yehia, Hani Camille/0000-0003-4578-0525 CR ATAL BS, 1978, J ACOUST SOC AM, V63, P1535, DOI 10.1121/1.381848 ATAL BS, 1971, J ACOUST SOC AM, V50, P637, DOI 10.1121/1.1912679 BAILLY G, 1991, J PHONETICS, V19, P9 Bothorel A., 1986, CINERADIOGRAPHIE VOY DAVIS HF, 1963, FOURIER SERIES ORTHO Duda R. O., 1973, PATTERN CLASSIFICATI EISNER E, 1967, J ACOUST SOC AM, V41, P1126, DOI 10.1121/1.1910444 FANT G, 1967, READINGS ACOUSTIC PH, P44 FANT G, 1980, PHONETICA, V37, P55 FLANAGAN JL, 1955, J ACOUST SOC AM, V27, P613, DOI 10.1121/1.1907979 Flanagan J., 1972, SPEECH ANAL SYNTHESI FLANAGAN JJ, 1979, J ACOUST SOC AM, V68, P780 GUPTA SK, 1993, J ACOUST SOC AM, V94, P2517, DOI 10.1121/1.407364 HARSHMAN R, 1977, J ACOUST SOC AM, V62, P693, DOI 10.1121/1.381581 Horn R. A., 1985, MATRIX ANAL ITAKURA F, 1973, SPEECH SYNTHESIS, P289 Jayant N. S., 1984, DIGITAL CODING WAVEF JORDAN M, 1990, ATTENTION PERFORM, V13, P797 LADEFOGED P, 1978, J ACOUST SOC AM, V64, P1027, DOI 10.1121/1.382086 MAEDA S, 1990, NATO ADV SCI I D-BEH, V55, P131 Maeda S., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90017-6 Markel JD, 1976, LINEAR PREDICTION SP MCGOWAN RS, 1994, SPEECH COMMUN, V14, P19, DOI 10.1016/0167-6393(94)90055-8 MERMELST.P, 1967, J ACOUST SOC AM, V41, P1283, DOI 10.1121/1.1910470 PAIGE A, 1970, IEEE T ACOUST SPEECH, VAU18, P268, DOI 10.1109/TAU.1970.1162113 Perkell JS, 1969, PHYSL SPEECH PRODUCT, V53 Rabiner L, 1993, FUNDAMENTALS SPEECH Rabiner L.R., 1978, DIGITAL PROCESSING S SCHROEDE.MR, 1967, J ACOUST SOC AM, V41, P1002, DOI 10.1121/1.1910429 Schroeter J., 1991, ADV SPEECH PROCESSIN, P231 Schroeter J, 1994, IEEE T SPEECH AUDI P, V2, P133, DOI 10.1109/89.260356 SCULLY C, 1990, NATO ADV SCI I D-BEH, V55, P151 SHIRAI K, 1986, SPEECH COMMUN, V5, P159, DOI 10.1016/0167-6393(86)90005-1 SHIRAI K, 1993, SPEECH COMMUN, V13, P45, DOI 10.1016/0167-6393(93)90058-S SONDHI MM, 1979, IEEE T ACOUST SPEECH, V27, P268, DOI 10.1109/TASSP.1979.1163240 SONDHI MM, 1983, J ACOUST SOC AM, V73, P985, DOI 10.1121/1.389024 SONDHI MM, 1987, IEEE T ACOUST SPEECH, V35, P955 SONDHI MM, 1974, J ACOUST SOC AM, V55, P1070, DOI 10.1121/1.1914649 WAKITA H, 1979, IEEE T ACOUST SPEECH, V27, P281, DOI 10.1109/TASSP.1979.1163242 WAKITA H, 1973, IEEE T ACOUST SPEECH, VAU21, P417, DOI 10.1109/TAU.1973.1162506 Webster AG, 1919, P NATL ACAD SCI USA, V5, P275, DOI 10.1073/pnas.5.7.275 YEHIA H, 1994, P IEEE INT C AC SPEE, V1, P477 NR 42 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1996 VL 18 IS 2 BP 151 EP 174 DI 10.1016/0167-6393(95)00042-9 PG 24 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UQ991 UT WOS:A1996UQ99100004 ER PT J AU Lin, CH Wu, CH Ting, PY Wang, HM AF Lin, CH Wu, CH Ting, PY Wang, HM TI Frameworks for recognition of Mandarin syllables with tones using sub-syllabic units SO SPEECH COMMUNICATION LA English DT Article DE recognition of Mandarin syllable with tones; sub-syllabic unit; tone-dependent spectral parameter; hidden Markov model ID HIDDEN MARKOV-MODELS; SPEECH AB The recognition of Mandarin syllables is a key problem in large vocabulary Mandarin speech recognition. Conventionally, the tone and base syllable corresponding to a syllable are separately recognized by using a tone recognizer and a base syllable recognizer, respectively. In this paper, we propose a framework for Mandarin syllable recognition based on the classification of sub-syllabic units such as initials, finals and transitions, The final units are classified in accordance with the variations of tones to enhance the capability of tone discrimination, By using hidden Markov models (HMM) based on LPC-derived cepstral parameters, we develop a Mandarin syllable recognizer in which base syllables and their corresponding tones are jointly recognized. Experimental results indicate that the proposed syllable recognizer yields higher recognition rates than the conventional Syllable recognizer does when sufficient amount of training data is used, We also show that the performance of the proposed syllable recognizer can be further improved with the incorporation of a tone recognizer. C1 NATL TAIWAN UNIV, DEPT ELECT ENGN, TAIPEI 10764, TAIWAN. RP Lin, CH (reprint author), MINIST COMMUN, TELECOMMUN LABS, CHUNGLI, TAIWAN. CR Chang PC, 1993, IEEE T SPEECH AUDI P, V1, P135, DOI 10.1109/89.222873 CHEN JK, 1994, P INT C AC SPEECH SI, P137 Chen JK, 1994, IEEE T SPEECH AUDI P, V2, P206 Duda R. O., 1973, PATTERN CLASSIFICATI *GUOY CO, 1976, GUOY TZD MAND CHIN D Hon H.-W., 1992, THESIS CARNEGIE MELL HON HW, 1994, P INT C AC SPEECH SI, P545 Howie John M., 1976, ACOUSTICAL STUDIES M HWANG M, 1992, P INT C AC SPEECH SI, P311 JUANG BH, 1990, IEEE T ACOUST SPEECH, V38, P1639, DOI 10.1109/29.60082 JUANG BH, 1986, IEEE T INFORM THEORY, V32, P307 LEE CH, 1993, SPEECH COMMUN, V12, P383, DOI 10.1016/0167-6393(93)90085-Y Lee C.-H., 1992, Computer Speech and Language, V6, DOI 10.1016/0885-2308(92)90022-V Lee C. H., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90002-N LEE KF, 1990, IEEE T ACOUST SPEECH, V38, P599, DOI 10.1109/29.52701 LEE LS, 1993, IEEE T SPEECH AUDIO, V4, P158 LEE YM, 1993, COMPUTER SPEECH LANG, V7, P247, DOI 10.1006/csla.1993.1013 LIN CH, 1992, P INT C AC SPEECH SI, P227 Markel JD, 1976, LINEAR PREDICTION SP Rabiner L. R., 1986, IEEE ASSP Magazine, V3, DOI 10.1109/MASSP.1986.1165342 RABINER LR, 1989, IEEE T ACOUST SPEECH, V37, P1214, DOI 10.1109/29.31269 Rabiner L.R., 1978, DIGITAL PROCESSING S SINGER H, 1992, P INT C AC SPEECH SI, P1273 SOONG FK, 1990, P INT C AC SPEECH SI, P709 SOONG FK, 1988, IEEE T ACOUST SPEECH, V36, P871, DOI 10.1109/29.1598 TSENG CY, 1981, THESIS BROWN U WANG HM, 1994, J COMPUTER PROCESSIN, V8, P1 WANG J, 1994, J PLANT NUTR, V17, P775, DOI 10.1080/01904169409364766 WANG YR, 1994, J ACOUST SOC AM, V96, P2637, DOI 10.1121/1.411274 NR 29 TC 12 Z9 12 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD APR PY 1996 VL 18 IS 2 BP 175 EP 190 DI 10.1016/0167-6393(95)00043-7 PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA UQ991 UT WOS:A1996UQ99100005 ER PT J AU Sorin, C AF Sorin, C TI Untitled SO SPEECH COMMUNICATION LA English DT Editorial Material RP Sorin, C (reprint author), FRANCE TELECOM, CNET,LAA,TSS,RCP,BAT D, 2 AVE PIERRE MARZIN, F-22307 LANNION, FRANCE. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 1 EP 2 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800001 ER PT J AU LeBouquin, R AF LeBouquin, R TI Enhancement of noisy speech signals: Application to mobile radio communications SO SPEECH COMMUNICATION LA English DT Article DE noise cancellation; spectral subtraction; segmentation; coherence function AB This paper deals with the enhancement of noisy speech signals recorded in a car for mobile radio applications. Our concern is the signal estimation when 1 or 2 observations are available; each one is composed of a speech signal, si, and an additive noise n(i) (i = 1, 2). In the case of one microphone, we consider new techniques that capitalize on the aspect of speech perception by focusing on enhancing only the short-time spectral amplitude. A ''modified spectral subtraction'' method is proposed; it uses a frequency-dependent over-estimate of the noise and can be combined with segmentation of the observation. When two microphones are available, in the case of uncorrelated (or slightly correlated) noises, we introduce a new technique based on the coherence function which is used to filter the observations or to determine a speech/noise classification algorithm. Finally, listening tests have been conducted to compare the simplest methods. In the case of stationary noise, the modified spectral subtraction is very promising and in the case of non-stationary and decorrelated noises, the method based on the coherence is more attractive. RP LeBouquin, R (reprint author), UNIV RENNES 1, LAB TRAITEMENT SIGNAL & IMAGE, BAT 22, CAMPUS BEAULIEU, F-35042 RENNES, FRANCE. CR ALLEN JB, 1977, P IEEE, V65, P1558, DOI 10.1109/PROC.1977.10770 ALLEN JB, 1977, J ACOUST SOC AM, V62, P912, DOI 10.1121/1.381621 ANDREOBRECHT R, 1988, IEEE T ACOUST SPEECH, V36, P29, DOI 10.1109/29.1486 BAILLARGEAT C, 1991, THESIS PARIS 6 U BASSEVILLE M, 1986, LECTURE NOTES CONTRO Basseville M., 1985, DETECTION ABRUPT CHA BASSEVILLE M, 1983, IEEE T INFORM THEORY, V29, P709, DOI 10.1109/TIT.1983.1056737 BENVENISTE A, 1982, OUTILS MODELES MATH, V2, P309 BEROUTI M, 1979, APR P IEEE INT C AC, P208 Boll S. F., 1979, IEEE T ACOUST SPEECH, V27 CURTIS RA, 1978, IEEE INT C ACOUST SP, P602 FAUCON G, 1991, SEMINAIRE TRAITEMENT, P42 LIM JS, 1979, P IEEE, V67, P1586, DOI 10.1109/PROC.1979.11540 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z Markel JD, 1976, LINEAR PREDICTION SP MEZALEK ST, 1990, THESIS RENNES U, P30 MOKBEL C, 1992, WORKSHOP SPEECH PROC, P211 Preuss R. D., 1979, ICASSP 79. 1979 IEEE International Conference on Acoustics, Speech and Signal Processing QUACKENBUSH SR, 1985, THESIS GEORGIA I TEC Suzuki H., 1977, Journal of the Acoustical Society of Japan, V33 Van Compernolle D., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90027-2 WANG DL, 1982, IEEE T ACOUST SPEECH, V30, P679, DOI 10.1109/TASSP.1982.1163920 Zelinski R., 1988, ICASSP 88: 1988 International Conference on Acoustics, Speech, and Signal Processing (Cat. No.88CH2561-9), DOI 10.1109/ICASSP.1988.197172 NR 23 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 3 EP 19 DI 10.1016/0167-6393(95)00021-6 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800002 ER PT J AU Rajendran, S Yegnanarayana, B AF Rajendran, S Yegnanarayana, B TI Word boundary hypothesization for continuous speech in Hindi based on F-0 patterns SO SPEECH COMMUNICATION LA English DT Article DE word boundary; F-0 pattern; prosody ID INTONATION AB This paper proposes an algorithm based on F-0 patterns to hypothesize word boundaries and function words in continuous speech in Hindi. It makes use of the properties of F-0 contour such as declination tendency, resetting and fall-rise patterns in Hindi. The syllabic units are identified by using the energy contour, pitch and the first order LP coefficient. Each syllabic unit is assigned an accent value L (Low), H or h (High) by (i) comparing the F-0 value at the mid point of each syllabic nucleus with that of the previous syllabic unit and (ii) comparing the F-0 values at two different points within each syllabic unit in a sequence having an accent value L. Word boundaries are placed between the adjacent syllabic units (i)H and L, (ii)h and L, (iii)L and L, (iv)L and h and (v)H and h. An evaluation conducted on a corpus of 50 sentences in Hindi read aloud by five native speakers in an ordinary office environment showed that about 74 percent of the word boundaries and about 28 percent of the function words were correctly identified. The results of the word boundary hypothesization can be used to improve the performance of the acoustic-phonetic, lexical and syntactic modules in a speech-to-text conversion system. Robustness of the algorithm in handling noisy speech input conditions and telephone speech are also discussed. RP Rajendran, S (reprint author), INDIAN INST TECHNOL, DEPT COMP SCI & ENGN, SPEECH & VIS LAB, MADRAS 600036, TAMIL NADU, INDIA. CR CUTLER DR, 1990, COGNITIVE MODELS SPE DELGUTTE B, 1978, J ACOUST SOC AM, V64, P1319, DOI 10.1121/1.382118 ESWAR P, 1990, THESIS INDIAN I TECH Fujisaka H., 1984, Journal of the Acoustical Society of Japan (E), V5 GARDING E, 1991, 12TH P INT C PHON SC, V1, P300 GUSSENHOVEN C, 1988, J PHONETICS, V16, P355 HARRINGTON J, 1987, P EUROPEAN C SPEECH, V1, P163 HOUSE AS, 1953, J ACOUST SOC AM, V25, P105, DOI 10.1121/1.1906982 KELKAR AR, 1968, STUDIES HINDI URDU 1, P96 Klatt D.H., 1975, J PHONETICS, V3, P129 KOHLER KJ, 1990, PAPERS LABORATORY PH, P115 Ladd D. R., 1986, PHONOLOGY YB, V3, P311, DOI 10.1017/S0952675700000671 Lea W., 1980, TRENDS SPEECH RECOGN LEHISTE I, 1961, J ACOUST SOC AM, V33, P419, DOI 10.1121/1.1908681 Lieberman M., 1984, LANGUAGE SOUND STRUC, P157 Lieberman Philip, 1967, INTONATION PERCEPTIO MADHUKUMAR AS, 1991, P INT C PHONETIC SCI, V3, P494 Madhukumar A. S., 1993, Computer Speech and Language, V7, DOI 10.1006/csla.1993.1015 MARTIN P, 1979, CURRENT ISSUES LINGU, V9, P1091 MEHROTRA RC, 1965, INDIAN LINGUISTICS, V26, P96 OHALA M, 1991, LANG SCI, V13, P107, DOI 10.1016/0388-0001(91)90009-P Ohala M., 1983, ASPECTS HINDI PHONOL OHALA M, 1986, STRUCTURE CONVERGENC, P81 OHMAN SEG, 1966, J ACOUST SOC AM, V39, P151 PIERREHUMBERT J, 1981, J ACOUST SOC AM, V70, P985, DOI 10.1121/1.387033 PRICE PJ, 1990, P INT C SPOKEN LANGU, V1, P13 RAO GVR, 1991, COMPUTER SPEECH LANG, V5, P379 SHARMA A, 1969, INDIAN LINGUISTICS, V30, P115 THORSEN NG, 1980, J ACOUST SOC AM, V67, P1014, DOI 10.1121/1.384069 Waibel A., 1988, PROSODY SPEECH RECOG WIGHTMAN CW, 1991, P INT C ACOUST SPEEC, V1, P321 YEGNANARAYANA B, 1991, P INT C ACOUST SPEEC, V2, P945 NR 32 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 21 EP 46 DI 10.1016/0167-6393(95)00022-4 PG 26 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800003 ER PT J AU Meng, H Hunnicutt, S Seneff, S Zue, V AF Meng, H Hunnicutt, S Seneff, S Zue, V TI Reversible letter-to-sound/sound-to-letter generation based on parsing word morpology SO SPEECH COMMUNICATION LA English DT Article DE reversible letter/sound generation; phonological parsing; morphology ID ENGLISH AB This paper describes a bi-directional letter/sound generation system based on a strategy combining data-driven techniques with a rule-based formalism. Our approach provides a hierarchical analysis of a word, including stress pattern, morphology and syllabification. Generation is achieved by a probabilistic parsing technique, where probabilities are trained from a parsed lexicon. Our training and testing corpora consisted of spellings and pronunciations for the high frequency portion of the Brown Corpus (10,000 words). The phonetic labels are augmented with markers indicating morphology and stress. We will report on two distinct grammars representing a historical perspective. Our early work with the first grammar inspired us to modify the grammar formalism, leading to greater constraint with fewer rules. We evaluated our performance on letter-to-sound generation in terms of whole word accuracy as well as phoneme accuracy. For the unseen test set, we achieved a word accuracy of 69.3% and a phoneme accuracy of 91.7% using a set of 52 distinct phonemes. While this paper focuses on letter-to-sound generation, our system is also capable of generation in the reverse direction, as reported elsewhere (Meng et al., 1994a). We believe that our formalism will be especially applicable for entering unknown words orally into a recognition system. C1 MIT, COMP SCI LAB, SPOKEN LANGUAGE SYST GRP, CAMBRIDGE, MA 02139 USA. ROYAL INST TECHNOL, DEPT SPEECH COMMUN & MUS ACOUST, S-10044 STOCKHOLM, SWEDEN. CR ALLEN S, CAMBRIDGE STUDIES SP Chomsky N., 1968, SOUND PATTERN ENGLIS COKER CH, 1990, P C SPEECH SYNTHESIS CONROY D, 1986, EKDTC03OM001 DOC DAMPER R, IN PRESS 2ND P NEUR Dedina M. J., 1991, Computer Speech and Language, V5, DOI 10.1016/0885-2308(91)90017-K GODDEAU D, 1994, COMMUNICATION GOLDING AR, 1991, THESIS STANFORD U HERTZ SR, 1973, P IEEE, V11, P1589 Hirschman L., 1993, P HUM LANG TECHN WOR, P19, DOI 10.3115/1075671.1075676 HOCHBERG J, 1991, IEEE T PATTERN ANAL, V13, P957, DOI 10.1109/34.93813 HUNNICUTT S, 1993, P EUROSPEECH BERLIN, P763 HUNNICUTT S, 1976, AJCL MICROFICHE, P57 KLATT D, 1982, J ACOUST SOC AM S1, V72, pS46 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Kucera H., 1967, COMPUTATIONAL ANAL P LEHNERT M, 1987, P AAA1 87 SEATTLE, P301 LUCAS SM, 1992, TALKING MACHINES THE, P127 LUCASSEN JM, 1984, P INT C ACOUST SPEEC LUK R, 1993, P INT C ACOUST SPEEC, P203 MENG H, 1994, P INT C ACOUST SPEEC, V2, P1 MENG H, 1994, P ARPA HUMAN LANGUAG, P289, DOI 10.3115/1075812.1075876 MENG H, 1994, P INT S SPEECH IMAGE, P670, DOI 10.1109/SIPNN.1994.344822 OAKEY S, 1981, P IJCAI VANCOUVER, P109 PARFITT S, 1991, P EUROSPEECH, P801 Segre A. M., 1983, First Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the Conference Sejnowski T. J., 1987, Complex Systems, V1 SENEFF H, 1992, P ICSLP BANFF, P397 Seneff S., 1992, Computational Linguistics, V18 SHIPMAN DW, 1982, 1982 P INT C AC SPEE, P546 STANFILL C, 1987, P 6 NAT C ART INT AA, P577 SULLIVAN KPH, 1992, TALKING MACHINES THE, P183 VANCOILE B, 1992, P ICSLP BANFF, P487 VANDENBOSCH A, 1993, 6TH P EUR ACL, P45 VANLEEUWEN HC, 1993, COMPUTER SPEECH OCT, P369 ZUE V, 1993, P ATR INT WORKSHOP S 1984, WEBSTERS 9TH NEW COL NR 37 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 47 EP 63 DI 10.1016/0167-6393(95)00032-1 PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800004 ER PT J AU Zhao, YX AF Zhao, YX TI Self-learning speaker and channel adaptation based on spectral variation source decomposition SO SPEECH COMMUNICATION LA English DT Article DE self-learning; acoustic normalization; unsupervised sequential phone model adaptation; iterative phone model adaptation AB A self-learning speaker and channel adaptation technique based on the separation of speech spectral variation sources is developed for improving speaker-independent continuous speech recognition. Statistical methods are formulated to remove spectral biases at the acoustic level and to adapt parameters of Gaussian mixture densities at the phone unit level. The spectral bias is estimated in two steps using unsupervised maximum likelihood estimation: the probability distributions of the speech spectral features are first assumed uniform for severely mismatched channels, and the spectral bias is then reestimated using Gaussian phone models. Unsupervised sequential phone model adaptation (USPA) is performed via Bayesian estimation from the on-line, bias-removed speech data, and iterative phone model adaptation (IPA) is further performed for dictation applications. The task vocabulary size was 853; the grammar perplexity was 105; the test speech data were collected under mismatched recording conditions with each test set containing 198 sentences. Depending on the speakers and channel conditions, the two-step spectral bias removal yielded relative error reductions (RER) of 3% to 11% compared to the conventional cepstral mean removal; the USPA yielded RER of 12% to 26% after the two-step bias removal; the IPA further yielded RER of 8% to 19% after the USPA. C1 UNIV ILLINOIS, DEPT ELECT & COMP ENGN, URBANA, IL 61801 USA. RP Zhao, YX (reprint author), UNIV ILLINOIS, BECKMAN INST, 405 N MATHEWS AVE, URBANA, IL 61801 USA. CR ATAL BS, 1974, J ACOUST SOC AM, V55, P1304, DOI 10.1121/1.1914702 COX SJ, 1989, MAY P INT C AC SPEEC, P294 DeGroot M. H., 1970, OPTIMAL STATISTICAL DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 FURUI S, 1989, MAY P INT C AC SPEEC, P286 GAUVAIN JL, SPEECH COMMUN, V11, P205 HERMANSKY H, 1985, MAR P INT C AC SPEEC, P509 HUNT MJ, 1981, J ACOUST SOC AM, V69, P541 LAMEL LF, 1986, P SPEECH RECOGNITION LEE CH, 1993, APR P INT C AC SPEEC, P558 PAUL DB, 1993, APR P INT C AC SPEEC, V2, P660 ZHAO Y, 1993, SEP P EUROSPEECH 93, P359 Zhao YX, 1993, IEEE T SPEECH AUDI P, V1, P345, DOI 10.1109/89.232618 Zhao YX, 1994, IEEE T SPEECH AUDI P, V2, P380 NR 14 TC 16 Z9 16 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 65 EP 77 DI 10.1016/0167-6393(95)00036-4 PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800005 ER PT J AU Chung, YJ Un, CK AF Chung, YJ Un, CK TI Multilayer perceptrons for state-dependent weightings of HMM likelihoods SO SPEECH COMMUNICATION LA English DT Article DE speech recognition; multilayer perceptron; weighting of hidden Markov models ID MARKOV-MODELS; RECOGNITION AB This paper proposes multi-layer perceptrons (MLPs) use in state-dependent weightings of Hidden Markov Model (HMM) likelihoods. The static pattern classification ability of MLPs and the temporal processing capability of HMMs are employed in order to obtain the state-dependent weightings of HMM likelihoods. In this approach, the MLP is trained for phoneme classification, and then the output values of the MLP are used as the state-dependent weightings. Applying the MLP outputs to the state-dependent weightings improves the performance of the conventional HMM without state-dependent weightings. However, in order to further improve the discriminability of competing classes, the discriminative training of the state-dependent weightings is performed by computing the gradient of the optimization criterion for the state-weighted HMM with respect to the MLP parameters. The proposed algorithm reduces the error rate considerably as compared with the conventional HMM in speaker-independent continuous speech recognition. C1 KOREA ADV INST SCI & TECHNOL, DEPT ELECT ENGN, COMMUN RES LAB, YUSUNG KU, TAEJON 305701, SOUTH KOREA. CR AMARI S, 1967, IEEE TRANS ELECTRON, VEC16, P299, DOI 10.1109/PGEC.1967.264665 Bahl L. R., 1986, P IEEE INT C AC SPEE, P49 BAHL LR, 1988, APR P IEEE INT C AC, P493 Baum L. E., 1972, INEQUALITIES, V3, P1 BENGIO Y, 1992, IEEE T NEURAL NETWOR, V3, P252, DOI 10.1109/72.125866 BOURLARD H, 1990, IEEE T PATTERN ANAL, V12, P1167, DOI 10.1109/34.62605 Bourlard Ha, 1994, CONNECTIONIST SPEECH CERF PL, 1994, IEEE T SPEECH AUDIO, V2, P185 CHANG PC, 1992, P IEEE ICASSP 92, P493 CHOU W, 1992, P IEEE INT C AC SPEE, P473, DOI 10.1109/ICASSP.1992.225869 CHOU W, 1993, APR P IEEE INT C AC, P652 CHOW YL, 1990, APR P IEEE INT C AC, P701 CHUNG YJ, 1993, ELECTRON LETT, V29, P824, DOI 10.1049/el:19930550 FRANZINI MA, 1989, MAY P IEEE INT C AC, P425 JUANG BH, 1992, IEEE T SIGNAL PROCES, V40, P3043, DOI 10.1109/78.175747 LEE CH, 1989, IEEE T ACOUST SPEECH, V37, P1649, DOI 10.1109/29.46547 LUBENSKY D, 1994, SEP P INT C SPOK LAN, P295 MERHAV N, 1991, IEEE T SIGNAL PROCES, V39, P2111, DOI 10.1109/78.134449 SU KY, 1991, P IEEE ICASSP 91, P541 WAIBEL A, 1989, IEEE T ACOUST SPEECH, V37, P328, DOI 10.1109/29.21701 WOLFERSTETTER F, 1994, SEP P INT C SPOK LAN, P219 NR 21 TC 7 Z9 7 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 79 EP 89 DI 10.1016/0167-6393(95)00038-0 PG 11 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800006 ER PT J AU Rossi, M AF Rossi, M TI The evolution of phonetics: A fundamental and applied science SO SPEECH COMMUNICATION LA English DT Editorial Material CR BOE LJ, 1988, 17EMES ACT J ET PAR, P79 CARROLL JS, 1992, SLOAN MANAGE REV, V33, P91 d'Espagnat B, 1979, RECHERCHE REEL ECO U, 1988, TRATTATO SEMIOTICA G Edelman Gerald M., 1992, BRIGHT AIR BRILLIANT Fant G., 1951, PRELIMINARIES SPEECH FRY DB, 1960, LINGUISTIC THEORY EX Granger G.-G., 1988, CONNAISSANCE PHILOS HJELMSLEV L., 1968, PROLEGOMENES THEORIE JONES WE, 1973, PHONETICS LINGUISTIC Popper Karl, 1972, CONNAISSANCE OBJECTI ROUSSELOT P, 1901, PRINCIPES PHONETIQUE, V2 ROUSSELOT PJ, 1901, PRINCIPES PHONETIQUE, V1 SAUSSURE FD, 1964, COURS LINGUISTIQUE G STANKIEWICZ E, 1972, BAUDOUIN COURTENAY A, V1 NR 15 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD JAN PY 1996 VL 18 IS 1 BP 96 EP 102 DI 10.1016/S0167-6393(96)90031-X PG 7 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TU008 UT WOS:A1996TU00800007 ER PT J AU FURUI, S ROE, D AF FURUI, S ROE, D TI INTERACTIVE VOICE TECHNOLOGY FOR TELECOMMUNICATION APPLICATIONS SO SPEECH COMMUNICATION LA English DT Editorial Material C1 AT&T BELL LABS, APPL SPEECH RES DEPT, MURRAY HILL, NJ USA. NR 0 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 215 EP 216 PG 2 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100001 ER PT J AU RABINER, LR AF RABINER, LR TI THE IMPACT OF VOICE PROCESSING ON MODERN TELECOMMUNICATIONS SO SPEECH COMMUNICATION LA English DT Article AB Research has been conducted in the area of voice processing for over six decades but it has only been in the past few years that the impact of the years of research is starting to be seen in modern telecommunications systems. Virtually every area of voice processing, including speech coding, speech synthesis, speech recognition, and even, to a small extent, speaker verification, has left the research laboratory and now appears in a product or service that is in daily use out in the marketplace, often by millions of customers per day. This revolution in voice processing in telecommunications is fueled by algorithmic advances (which improve the quality of the voice processing systems), hardware advances (which provide high processing power and memory at low cost), and networking advances (which provide high bandwidth pipes to the home, office, and throughout the telecommunications network). In this paper we illustrate the impact of voice processing on modern telecommunications by showing the diverse ways in which speech coding, speech synthesis, speech recognition and speaker verification have become embodied in new products and services. RP RABINER, LR (reprint author), AT&T BELL LABS, INFORMAT PRINCIPLES RES LABS, 600 MT AVE, MURRAY HILL, NJ 07974 USA. CR ALLEN J, 1991, ADV SPEECH SIGNAL PR, P741 JAYANT NS, 1991, ADV SPEECH SIGNAL PR, P85 JAYANT NS, 1993, P IEEE, V10, P1385 KLATT DH, 1987, J ACOUST SOC AM, V82, P737, DOI 10.1121/1.395275 Lee K.-F., 1989, AUTOMATIC SPEECH REC MAKHOUL J, 1994, VOICE COMMUNICATION BETWEEN HUMANS AND MACHINES, P165 Rabiner L, 1993, FUNDAMENTALS SPEECH RABINER LR, 1994, P IEEE, V82, P199, DOI 10.1109/5.265347 Rosenberg AK, 1991, ADV SPEECH SIGNAL PR, P701 Waibel A, 1990, READINGS SPEECH RECO NR 10 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 217 EP 226 DI 10.1016/0167-6393(95)00026-K PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100002 ER PT J AU LENNIG, M BIELBY, G MASSICOTTE, J AF LENNIG, M BIELBY, G MASSICOTTE, J TI DIRECTORY ASSISTANCE AUTOMATION IN BELL-CANADA - TRIAL RESULTS SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; TELEPHONE; DIRECTORY ASSISTANCE; AUTOMATION; FLEXIBLE VOCABULARY RECOGNITION; FVR; SPEAKER INDEPENDENT; PHONEME AB Speech recognition was used to automate directory assistance in a 6-month trial with Bell Canada's public customers. The bilingual application gave the caller a choice at the beginning of the dialog to continue in English or French. Over 89% of calls were either partially or fully automated. Customer and operator reactions to the system were positive. Bell-Northern Research's flexible vocabulary recognizer, using a vocabulary of 1,700 city names and synonyms, performed well under real world conditions. Economically significant operator work time savings were demonstrated. C1 BELL SYGMA, MONTREAL, PQ H3B 2M8, CANADA. RP LENNIG, M (reprint author), BELL NO RES LTD, 16 PL COMMERCE, MONTREAL, PQ H3E 1H6, CANADA. CR LENNIG M, 1992, 1992 P INT C SPOK LA, P93 Lennig M., 1993, Telesis LENNIG M, 1990, COMPUTER, V23, P35, DOI 10.1109/2.56869 NR 3 TC 10 Z9 10 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 227 EP 234 DI 10.1016/0167-6393(95)00024-I PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100003 ER PT J AU VYSOTSKY, GJ AF VYSOTSKY, GJ TI VOICEDIALING(SM) - THE FIRST SPEECH RECOGNITION BASED SERVICE DELIVERED TO CUSTOMERS HOME FROM THE TELEPHONE NETWORK SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; TELEPHONE SERVICES; SPEAKER DEPENDENT RECOGNITION AB This paper is an overview of the NYNEX VoiceDialing(SM) service - the first introduction of speech recognition based technology to a mass market of residential and business customers. This is a network based service which allows telephone users to make calls by simply saying the name of the person or place they wish to reach. VoiceDialing(SM) is compatible with both touch-tone and rotary dial service and is designed to work on all existing telephone sets and, therefore, on any extensions in the customer's home. We describe the network architecture, user interface, and speech recognition technology, with special focus on the analysis of success in service deployment and customer acceptance. The paper opens with an overview of speech research and development at NYNEX Science and Technology. RP VYSOTSKY, GJ (reprint author), NYNEX SCI & TECHNOL INC, 500 WESTCHESTER AVE, WHITE PLAINS, NY 10604 USA. CR ASADI A, 1993, 1993 P INT C AC SPEE CHANG H, 1990, 1990 P INT C AC SPEE, P765 KALYANSWAMY A, 1992, 1ST P IEEE WORKSH IN KALYANSWAMY A, 1991, 1991 P AM VOIC INP O KIM J, 1994, 5TH US TELC SPEECH R KONDZIELLA J, 1992, 920008 NYNEX SCI TEC LEUNG H, 1992, 1992 P INT C AC SPEE LEUNG H, 1994, 2ND P IEEE WORKSH IN LUBENSKY D, 1994, 1994 P INT C SPOK LA LUBENSKY D, 1993, 1993 P EUR 93 BERL NAIK J, 1994, 1994 P INT C AC SPEE NAIK J, 1994, 2ND P IEEE WORKSH SP NAIK J, 1994, 1994 P ESCA WORKSH A NAIK JM, 1990, IEEE COMMUN MAG JAN, P42 RAMAN V, 1994, 1994 P ICSLP YOK SILVERMAN K, 1993, 1993 P EUR 93 BERL ZREIK L, 1994, 2ND P IEEE WORKSH IN NR 17 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 235 EP 247 DI 10.1016/0167-6393(95)00025-J PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100004 ER PT J AU AUST, H OERDER, M SEIDE, F STEINBISS, V AF AUST, H OERDER, M SEIDE, F STEINBISS, V TI THE PHILIPS AUTOMATIC TRAIN TIMETABLE INFORMATION-SYSTEM SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; SPEECH UNDERSTANDING; SPOKEN LANGUAGE SYSTEMS; WORD GRAPH; STOCHASTIC ATTRIBUTED GRAMMAR AB In this article, we describe an automatic system for train timetable information over the telephone that provides accurate connections between 1200 German cities. The caller can talk to it in unrestricted, natural, and fluent speech, very much like he or she would communicate with a human operator, and is not given any instructions in advance. The system's four main components, speech recognition, speech understanding, dialogue control, and speech output, are separated into independent modules that are executed sequentially. Word graphs form the interface between recognition and understanding; an attributed stochastic context-free grammar is then used to determine the meaning of a spoken sentence. In an ongoing field trial, this system has been made available to the general public, both to gather speech data and to evaluate its performance. This field test was organized as a bootstrapping process: initially, the system was trained with just the developers' voices, then the telephone number was passed around within the department, the company and, finally, the outside world. After each step, the newly collected material was used for retraining, as well as for general improvements. C1 PHILIPS GMBH, FORSCHUNGSLAB AACHEN, D-52021 AACHEN, GERMANY. CR ALLEN JF, 1992, 1992 P DARPA SPEECH, P5 AUST H, 1994, 2ND P IVTTA 94 WORKS, P141 BATES M, 1993, P INT C ACOUST SPEEC BERGMANN H, 1992, 1 P K VER NAT SPRACH, P39 BLY B, 1990, 1990 P DARPA SPEECH, P136 DEMMEL A, 1991, THESIS U KAISERSLAUT Fu K.S., 1982, SYNTACTIC PATTERN RE GERBINO E, 1993, P INT C AC SPEECH SI, P135 GIACHIN EP, 1992, P INT C ACOUST SPEEC, P173, DOI 10.1109/ICASSP.1992.225944 Hemphill C., 1990, P DARPA SPEECH NAT L, P96, DOI 10.3115/116580.116613 Jelinek F., 1992, Speech Recognition and Understanding. Recent Advances, Trends and Applications. Proceedings of the NATO Advanced Study Institute MOORE R, 1992, 1992 P DARPA SPEECH, P61 Ney H., 1994, International Journal of Pattern Recognition and Artificial Intelligence, V8, DOI 10.1142/S0218001494000036 Ney H., 1992, P IEEE INT C AC SPEE, P9, DOI 10.1109/ICASSP.1992.225985 OERDER M, 1993, P IEEE INT C AC SPEE, V2, P119 PECKHAM J, 1993, 3RD P EUR C SPEECH C, P1407 Pieraccini R., 1992, P ICASSP, P193, DOI 10.1109/ICASSP.1992.225939 STEINBISS V, 1993, 3 EUR C SPEECH COMM, P2125 Ward W, 1991, P ICASSP 91, P365, DOI 10.1109/ICASSP.1991.150352 Young S. J., 1989, Computer Speech and Language, V3, DOI 10.1016/0885-2308(89)90002-8 NR 20 TC 53 Z9 53 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 249 EP 262 DI 10.1016/0167-6393(95)00028-M PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100005 ER PT J AU BILLI, R CANAVESIO, F CIARAMELLA, A NEBBIA, L AF BILLI, R CANAVESIO, F CIARAMELLA, A NEBBIA, L TI INTERACTIVE VOICE TECHNOLOGY AT WORK - THE CSELT EXPERIENCE SO SPEECH COMMUNICATION LA English DT Article DE IVR APPLICATIONS; FIELD TRIALS; VOICE DIALING; DIRECTORY ASSISTANCE AB This paper is a survey of the speech technologies and applications developed at CSELT, some of which are employed in real services deployed in the Italian telephone network. With the rise of significant speech recognition and text-to-speech applications, the activity of our lab encompasses now a broader set of activities, from new algorithmic approaches to speech product engineering and application development. In particular, the paper gives an overview of the products originated from our speech technology research. It describes two operative applications, namely a voice dialing service for large name directories, which is installed in the CSELT PABX, and an automated network service for directory assistance, which is now accessible to all the Italian telephone customers. RP BILLI, R (reprint author), CSELT SPA, CTR STUDI & LAB TELECOMUN SPA, VIA G REISS ROMOLI, I-10148 TURIN, ITALY. CR BALESTRI M, 1993, 3RD P EUR C SPEECH C, P2091 BALESTRI M, 1992, 1992 P INT C SPOK LA, P559 BILLI R, 1982, APR P IEEE INT C AC, P574 CANAVESIO F, 1989, 1989 P AUT OP SERV T CANAVESIO F, 1991, 2ND P EUR C SPEECH C, P731 CRAVERO M, 1984, 1984 P INT C AC SPEE, P35 FISSORE L, 1988, 1988 P INT C AC SPEE, P414 FISSORE L, 1990, 1990 P EUR SIGN PROC, P1204 LENNIG M, 1992, 1ST IEEE WORKSH INT NEBBIA L, 1979, 1979 P INT C AC SPEE, P884 PECHKAM J, 1993, 3RD P EUR C SPEECH C, P33 QUAZZA S, 1993, 1993 ESCA WORKSH PRO, P78 SALZA P, 1994, IN PRESS 1994 AVIOS NR 13 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 263 EP 271 DI 10.1016/0167-6393(95)00030-R PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100006 ER PT J AU SORIN, C JOUVET, D GAGNOULET, C DUBOIS, D SADEK, D TOULARHOAT, M AF SORIN, C JOUVET, D GAGNOULET, C DUBOIS, D SADEK, D TOULARHOAT, M TI OPERATIONAL AND EXPERIMENTAL FRENCH TELECOMMUNICATION SERVICES USING CNET SPEECH RECOGNITION AND TEXT-TO-SPEECH SYNTHESIS SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; TEXT-TO-SPEECH SYNTHESIS; FIELD EVALUATION AB This paper presents a brief overview of current uses, in 1994-1995, for CNET speech recognition and text-to-speech technologies in Interactive Voice Response Services (IVR) in France. It describes several operational and experimental services, and analyzes field evaluations of some of them. Finally, this paper summarizes recent developments in the CNET speech recognition and text-to-speech technology. RP SORIN, C (reprint author), CTR NATL ETUD TELECOMMUN, F-22301 LANNION, FRANCE. CR BARTKOVA K, 1994, 20EMES JEPS TREG, P181 BIGORGNE D, 1993, P ICASSP, P187 BOEFFARD O, 1993, P EUR 93 BERL, P1449 BOEFFARD O, 1994, P ICSLP 9J YOKOHAMA, P721 DUPONT P, 1993, P EUROSPEECH 93 BERL, P1959 FELLBAUM K, 1994, MAR P DAGA 9J VORK D GAGNOULET C, 1989, P EUROSPEECH 89 PAR, P569 JOUVET D, 1994, 20EMES JEPS TREG, P159 JOUVET D, 1991, P EUROSPEECH 91, P923 JOUVET D, 1993, P EUROSPEECH, P2081 JOUVET D, 1994, P ICSLP 94 YOKOHAMA, P283 LAROCHE J, 1990, P INT C ACOUST SPEEC, P550 MAUUARY L, 1993, P EUROSPEECH 93, P1097 MAUUARY L, 1994, THESIS U RENNES Mokbel C., 1994, P ICSLP, P987 Moulines E., 1990, P ICASSP, P309 SADEK MD, 1995, P ECSA WORKSHOP SPOK, P145 SADEK MD, 1994, P AAAI 9J WORKSHOP I, P100 SORIN C, 1992, IEEE WORKSHOP IVTTA, P27 NR 19 TC 11 Z9 11 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 273 EP 286 DI 10.1016/0167-6393(95)00035-M PG 14 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100007 ER PT J AU TAKAHASHI, J SUGAMURA, N HIROKAWA, T SAGAYAMA, S FURUI, S AF TAKAHASHI, J SUGAMURA, N HIROKAWA, T SAGAYAMA, S FURUI, S TI INTERACTIVE VOICE TECHNOLOGY DEVELOPMENT FOR TELECOMMUNICATIONS APPLICATIONS SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; SPEECH SYNTHESIS; WORD SPOTTING; HIDDEN MARKOV MODEL; SPEAKER ADAPTATION; LINE ADAPTATION; ENVIRONMENT COMPENSATION; INTERACTIVE VOICE PROCESSING; TELECOMMUNICATION AB This paper describes the essential speech processing techniques for interactive voice applications in the telecommunications field. These techniques include speech recognition and speech synthesis, both of which aim to make interactive speech communications between man and machine more natural. Keyword spotting, background noise effects reduction, and speakers and/or telephone adaptation techniques are considered essential in speech recognition in order to allow a more natural voice input as well as an adequate robustness against environmental variabilities. In the area of text-to-speech synthesis, we propose a rule-based synthesis method applicable to the Japanese language, aiming to produce high quality speech. The commercial system ANSER of a former project is also described as an example of an interactive speech processing system. Finally, a recently developed speech recognition server which includes a vocabulary-flexible recognition function is described. It is meant to illustrate the concept of the techniques it employs which allow its range of applications to be easily extended and also allow it to adapt itself to the changes which are rapidly occurring in the telecommunications field. C1 NIPPON TELEGRAPH & TEL PUBL CORP, HUMAN INTERFACE LABS, MUSASHINO, TOKYO 180, JAPAN. RP TAKAHASHI, J (reprint author), NIPPON TELEGRAPH & TEL PUBL CORP, HUMAN INTERFACE LABS, SPEECH & ACOUST LAB, 1-2356 TAKE, YOKOSUKA, KANAGAWA 23803, JAPAN. CR FURUI S, 1992, SPEECH COMMUN, V11, P195, DOI 10.1016/0167-6393(92)90014-X FURUI S, 1992, P DARPA SPEECH NATUR, P162, DOI 10.3115/1075527.1075564 GAUVAIN JL, 1991, FEB P DARPA SPEECH N, P272 HIROKAWA T, 1993, T IEICE A, V76, P1964 HIROKAWA T, 1989, P EUROPEAN C SPEECH, P30 IMAMURA A, 1990, P INT C SPOKEN LANGU, P537 KITAI M, 1994, 2ND P IEEE WORKSH IN, P133 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 Lee C.-H., 1993, ICASSP-93. 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing (Cat. No.92CH3252-4), DOI 10.1109/ICASSP.1993.319368 MARTIN F, 1992, FAL P M AC SOC JAP, P50 MATSUOKA T, 1993, P EUR C SPEECH COMM, P815 Miki S., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266547 MINAMI Y, 1994, INT CONF ACOUST SPEE, P141 Nakadai Y, 1990, P INT C SPOKEN LANGU, P1141 NAKAJIMA S, 1988, IEEE T ACOUST SPEECH, P133 NAKATSU R, 1990, COMPUTER, P43 NODA Y, 1994, 1994 P FALL M AC SOC, P139 NOMURA T, 1984, IEEE T ACOUST SPEECH, P2687 OHKURA K, 1992, P ICSLP 92, P369 SHIKANO K, 1982, J ACOUST SOC JAPAN S, V82, P113 SUGAMURA N, 1994, 2ND P IEEE WORKSH IN, P37 SUGAMURA N, 1986, SPEECH COMMUN, V5, P199, DOI 10.1016/0167-6393(86)90008-7 SUGAMURA N, 1983, IEEE T ACOUST SPEECH, P723 SUWA Y, 1993, IEICE SSE9388 TECH R, P147 SUZUKI Y, 1992, P INT VOICE INPUT OU, P273 TAKAHASHI J, 1994, IEICE SP9474 TECHN R, P33 TAKAHASHI J, 1994, 2ND P IEEE WORKSH IN, P97 TAKAHASHI J, 1994, 1994 P FALL M AC SOC, P75 TAKAHASHI J, 1994, 1994 P INT C SPOK LA, P991 TAKAHASHI J, 1995, 1995 P IEEE INT C AC, V1, P696, DOI 10.1109/ICASSP.1995.479789 TAKAHASHI S, 1994, 1994 P FALL M AC SOC, P113 TAKAHASHI S, 1995, MAY P IEEE INT C AC, V1, P520 TONOMURA M, 1994, IEICE SP9451 TECHN R, P25 TONOMURA T, 1994, 1994 P FALL M AC SOC, P77 Tsurumi Y., 1994, P ICSLP, P431 YAMADA Y, 1994, 1994 P FALL M AC SOC, P41 YOSHIOKA O, 1994, P INT C SPOKEN LANGU, P887 NR 37 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 287 EP 301 DI 10.1016/0167-6393(95)00029-N PG 15 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100008 ER PT J AU KAMM, CA SHAMIEH, CR SINGHAL, S AF KAMM, CA SHAMIEH, CR SINGHAL, S TI SPEECH RECOGNITION ISSUES FOR DIRECTORY ASSISTANCE APPLICATIONS SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; DIRECTORY ASSISTANCE AUTOMATION AB Telephone companies in the United States handle over 6 billion Directory Assistance (DA) calls each year. Automation of even a portion of DA calls could significantly reduce the cost of DA services. This paper explores two factors affecting successful automation of DA: (a) the effect of directory size on speech recognition performance, and (b) the complexity of existing DA call interactions. Speech recognition performance for a set of 200 spoken names was measured for directories ranging from 200 to 1.5 million unique names. Recognition accuracy decreased from 82.5% for a 200-name directory to 16.5% for a 1.5 million name directory. In part because high recognition accuracy is not easily achievable for these very large, low-context directories, it is likely that initial implementations of DA automation will focus on a small percentage of calls, requiring a smaller vocabulary. To maximize potential savings, listings that are most frequently requested constitute the optimal vocabulary. To identify critical issues in automating frequent DA requests, approximately 13,000 DA calls from an office near a major metropolitan area in the United States were studied. In this sample, 245 listings covered 10% of the call volume, and 870 listings covered 20% of the call volume. C1 BELLCORE, MORRISTOWN, NJ USA. CR COLE R, 1991, P ICASSP, P325, DOI 10.1109/ICASSP.1991.150342 COLE RW, 1992, PROCEEDINGS OF ION GPS-92, P895 *ENTR RES LAB INC, 1993, HTK HIDD MARK MOD TO LENNIG M, 1995, SPEECH COMMUN, V17, P227, DOI 10.1016/0167-6393(95)00024-I PEPPER DJ, 1993, AVIOS P VELIUS G, 1990, P ICSLP 90, P865 NR 6 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 303 EP 311 DI 10.1016/0167-6393(95)00023-H PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100009 ER PT J AU MAZOR, B ZEIGLER, BL AF MAZOR, B ZEIGLER, BL TI THE DESIGN OF SPEECH-INTERACTIVE DIALOGS FOR TRANSACTION-AUTOMATION SYSTEMS SO SPEECH COMMUNICATION LA English DT Article DE SPEECH RECOGNITION; DIALOG; SPEECH SYSTEMS AB The use of automatic speech recognition (ASR) technology to automate telephone-service transactions provides opportunities for significant operational cost savings, as well as for improvements in the quality of customer service. At GTE, our focus has been on developing speech-interactive systems, named OASIS, for service-order transactions. In this paper, we discuss our methodology for developing dialogs, describe the dialog developed for our service-disconnect application, and present data from a field trial of the disconnect system. Our design methodology addresses development of a transaction model, specification of dialog flow, design of language structures, and construction of system speech. In addition, following our approach, dialog flow and recognition vocabularies are developed in the context of the system recognition and interpretation capabilities. Primary attributes of our methodology include the following: structured representation of conversational transactions as a progressive flow of information elements; query language that follows the style of discourse to elicit predictable responses; use of recognition outcomes and corresponding system actions to define dialog flow; and generation of scalable solutions. To evaluate the effectiveness of our approach, we present results from the service-disconnect field trial. In general, users responded cooperatively, adhered to the structured interaction, rarely anticipated future information, and provided on-target answers to system queries. RP MAZOR, B (reprint author), GTE LABS INC, WALTHAM, MA 02254 USA. CR BASSON S, 1989, SEP P AVIOS 89 NEWP, P271 MAZOR B, 1994, SEP P ICSLP, V2, P679 MAZOR B, 1992, SEP P AVIOS 92, P187 MURPHY M, 1991, IEEE COMMUN MAG, V29, P25, DOI 10.1109/35.64720 YASHCHIN D, 1991, 2ND P EUR C SPEECH C, V2, P727 ZEIGLER BL, 1995, 4TH P EUR C SPEECH C, V3, P1955 NR 6 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 313 EP 320 DI 10.1016/0167-6393(95)00034-L PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100010 ER PT J AU MATSUMURA, T MATSUNAGA, S AF MATSUMURA, T MATSUNAGA, S TI NONUNIFORM UNIT BASED HMMS FOR CONTINUOUS SPEECH RECOGNITION SO SPEECH COMMUNICATION LA English DT Article DE NONUNIFORM UNIT HMMS; LONG-UNIT AB A novel acoustic modeling algorithm that generates non-uniform unit HMMs to effectively cope with spectral variations in fluent speech is proposed. The algorithm is devised for the automatic iterative generation of long-span units for non-uniform modeling. This generation algorithm is based on an entropy reduction criterion using text data and a maximum likelihood criterion using speech data. The effectiveness of the non-uniform unit models is confirmed by comparing likelihood values between long-span unit HMMs and conventional phoneme-unit HMMs. Results of classification tests showed that the non-uniform unit HMMs provide more precise representation than do conventional phoneme-unit HMMs, and preliminary phrase recognition tests suggest that non-uniform unit HMMs achieve higher performance than phoneme-unit HMMs. C1 ATR INTERPRETING TELEPHONY RES LABS, KYOTO, JAPAN. CR ARIKI Y, 1994, P INT C ACOUST SPEEC, P253 KITA K, 1989, P ICASSP 89, P703 MOORE R, 1993, P IEEE ASR WORKSHOP, P16 MORIMOTO T, 1993, P EUROSPEECH 93, P573 OHKURA K, 1992, P ICSLP 92, P369 Schwartz R., 1985, P ICASSP 85, P1205 SINGER H, 1994, P INT C ACOUST SPEEC, P149 TAKAMI J, 1992, P INT C AC SPEECH SI, V1, P573 TAMOTO M, 1992, IEICE TECHNICAL REPO, V92 NR 9 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 321 EP 329 DI 10.1016/0167-6393(95)00031-I PG 9 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100011 ER PT J AU BIMBOT, F CHOLLET, GFA PAOLONI, A AF BIMBOT, F CHOLLET, GFA PAOLONI, A TI SPECIAL SECTION ON AUTOMATIC SPEAKER RECOGNITION, IDENTIFICATION AND VERIFICATION (VOL 17, PG 77, 1995) SO SPEECH COMMUNICATION LA English DT Correction RP BIMBOT, F (reprint author), ECOLE NATL SUPER TELECOMMUN BRETAGNE, SPEECH GRP, CNRS, PARIS, FRANCE. CR BIMBOT F, 1995, SPEECH COMMUN, V17, P77, DOI 10.1016/0167-6393(95)90046-2 NR 1 TC 0 Z9 0 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD NOV PY 1995 VL 17 IS 3-4 BP 331 EP 333 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA TH481 UT WOS:A1995TH48100012 ER PT J AU GLASS, J FLAMMIA, G GOODINE, D PHILLIPS, M POLIFRONI, J SAKAI, S SENEFF, S ZUE, V AF GLASS, J FLAMMIA, G GOODINE, D PHILLIPS, M POLIFRONI, J SAKAI, S SENEFF, S ZUE, V TI MULTILINGUAL SPOKEN-LANGUAGE UNDERSTANDING IN THE MIT VOYAGER SYSTEM SO SPEECH COMMUNICATION LA English DT Article DE SPOKEN-LANGUAGE SYSTEMS; SPEECH UNDERSTANDING; MULTILINGUAL; SPONTANEOUS SPEECH AB This paper describes our recent work in developing multilingual spoken language systems that support human-computer interactions. Our approach is based on the premise that a common semantic representation can be extracted from the input for all languages, at least within the context of restricted domains. In our design of such systems, language dependent information is separated from the system kernel as much as possible, and encoded in external data structures. The internal system manager, discourse and dialogue component, and database are all maintained in a language transparent form. Our description will focus on the development of the multilingual MIT Voyager spoken language system, which can engage in verbal dialogues with users about a geographical region within Cambridge, MA in the USA. The system can provide information about distances, travel times or directions between objects located within this area (e.g., restaurants, hotels, banks, libraries), as well as information such as the addresses, telephone numbers or location of the objects themselves. Voyager has been fully ported to Japanese and Italian, and we are in the process of porting to French and German as well. Evaluations for the English, Japanese and Italian systems are reported. Other related multilingual research activities are also briefly mentioned. RP GLASS, J (reprint author), MIT, COMP SCI LAB, SPOKEN LANGUAGE SYST GRP, CAMBRIDGE, MA 02139 USA. CR BLOMBERG M, 1993, P EUR C SPEECH COMM, P1867 Brown P. F., 1992, Computational Linguistics, V18 CLEMENTINO D, 1993, P EUROSPEECH 93, P1863 ECKERT W, 1993, P EUR C SPEECH TECHN, P1871 FLAMMIA G, 1994, P 1994 INT C SPOK LA, P911 Glass J., 1994, P ICSLP YOK JAP, P983 GLASS J, 1993, P 3 EUR C SPEECH COM, P2063 Goddeau D., 1994, P INT C SPOK LANG PR, P707 Hazen T. J., 1994, P ICSLP 94, P1883 HETHERINGTON L, 1993, P EUROSPEECH 93, P2121 Hutchins W. John, 1992, INTRO MACHINE TRANSL ITOU K, 1992, P INT C ACOUST SPEEC, P21 KUBALA F, 1992, P SPEECH NATURAL LAN, P72, DOI 10.3115/1075527.1075544 MORIMOTO T, 1993, P EUROSPEECH 93 BERL, P1291 OERDER M, 1994, P ICSLP 94, P703 PALLETT D, 1993, P DARPA SPEECH NATUR, P14 Peckham J., 1991, P WORKSH SPEECH NAT, P14, DOI 10.3115/112405.112408 PHILLIPS M, 1991, P EUROSPEECH 91, P577 POLIFRONI J, 1991, P DARPA SPEECH NATUR, P360, DOI 10.3115/112405.112481 Roe D., 1991, P EUROSPEECH 91, P1063 SAKAI S, 1993, P EUROSPEECH 93, P2151 SENEFF J, 1991, P DARPA SPEECH NATUR, P88 Seneff S., 1992, Computational Linguistics, V18 Wahlster W., 1993, P 3 EUR C SPEECH COM, P29 Waibel A., 1991, P ICASSP 91, P793, DOI 10.1109/ICASSP.1991.150456 ZUE V, 1989, P DARPA SPEECH NATUR, P160, DOI 10.3115/1075434.1075460 ZUE V, 1989, P DARPA SPEECH NATUR, P126, DOI 10.3115/1075434.1075456 ZUE V, 1989, P DARPA SPEECH NATUR, P51, DOI 10.3115/1075434.1075444 ZUE V, 1990, P ICSLP, P1317 ZUE V, 1990, SPEECH COMMUN, V9, P351, DOI 10.1016/0167-6393(90)90010-7 ZUE V, 1993, P INT S SPOKEN DIALO Zue V., 1992, P DARPA SPEECH NATUR, P84, DOI 10.3115/1075527.1075546 Zue V. W., 1989, P DARPA SPEECH NAT L, P179, DOI 10.3115/100964.100983 NR 33 TC 37 Z9 37 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 1 EP 18 DI 10.1016/0167-6393(95)00008-C PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500001 ER PT J AU STEINBISS, V NEY, H ESSEN, U TRAN, BH AUBERT, X DUGAST, C KNESER, R MEIER, HG OERDER, M HAEBUMBACH, R GELLER, D HOLLERBAUER, W BARTOSIK, H AF STEINBISS, V NEY, H ESSEN, U TRAN, BH AUBERT, X DUGAST, C KNESER, R MEIER, HG OERDER, M HAEBUMBACH, R GELLER, D HOLLERBAUER, W BARTOSIK, H TI CONTINUOUS SPEECH DICTATION - FROM THEORY TO PRACTICE SO SPEECH COMMUNICATION LA English DT Article DE CONTINUOUS SPEECH RECOGNITION; LARGE VOCABULARY RECOGNITION; ACOUSTIC MODEL; HIDDEN MARKOV MODEL (HMM); LANGUAGE MODEL; SEARCH; DICTATION ID RECOGNITION AB This paper gives an overview of the Philips research system for phoneme-based, large-vocabulary, continuous-speech recognition. The system has been successfully applied to various tasks in the German and (American) English languages, ranging from small vocabulary tasks to very large vocabulary tasks. Here, we concentrate on continuous-speech recognition for dictation in real applications, the dictation of legal reports and radiology reports in German. We describe this task and report on experimental results. We also describe a commercial PC-based dictation system which includes a PC implementation of our scientific recognition prototype. In order to allow for a comparison with the performance of other systems, a section with an evaluation on the standard Wall Street Journal task (dictation of American English newspaper text) is supplied. The recognition architecture is based on an integrated statistical approach. We describe the characteristic features of the system as opposed to other systems: 1. the Viterbi criterion is consistently applied both in training and testing; 2. continuous mixture densities are used without tying or smoothing; 3. time-synchronous beam search in connection with a phoneme look-ahead is applied to a tree-organized lexicon. C1 PHILIPS DICTAT SYST, A-1102 VIENNA, AUSTRIA. RP STEINBISS, V (reprint author), PHILIPS GMBH, FORSCHUNGSLAB AACHEN, D-52066 AACHEN, GERMANY. CR AUBERT X, 1993, 1993 P INT C AC SPEE, P648 AUBERT X, 1994, P IEEE INT C AC SPEE, V2, P129 AUST H, 1994, 2ND P IVTTA 94 WORKS, P141 AUST H, 1994, 2ND P IVTTA 94 WORKS, P67 BAKER JK, 1975, SPEECH RECOGNITION, P512 BESLING S, 1994, 1994 P KONVENS, P23 DOBLER S, 1993, SPEECH COMMUN, V12, P221, DOI 10.1016/0167-6393(93)90092-Y Duda R. O., 1973, PATTERN CLASSIFICATI GOOD IJ, 1953, BIOMETRIKA, V40, P237, DOI 10.2307/2333344 HAEBUMBACH R, 1993, 1993 P INT C AC SPEE, P239 HAEBUMBACH R, 1991, 1991 P EUR C SPEECH, P495 HAEBUMBACH R, 1992, 1992 P INT C AC SPEE, P13 HUNT MJ, 1989, 1989 P IEEE INT C AC, P262 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 JELINEK F, 1992, ADV SPEECH SIGNAL PR, P651 KATZ SM, 1987, IEEE T ACOUST SPEECH, V35, P400, DOI 10.1109/TASSP.1987.1165125 KNESER R, 1993, 1993 P EUR C SPEECH, P973 LEVINSON SE, 1983, AT&T TECH J, V62, P1035 MEIER HG, 1994, UNPUB LEAVNING M SAM NEY H, 1992, 1992 P INT C AC SPEE, P9 Ney H., 1994, International Journal of Pattern Recognition and Artificial Intelligence, V8, DOI 10.1142/S0218001494000036 NEY H, 1992, IEEE T SIGNAL PROCES, V40, P272, DOI 10.1109/78.124938 NEY H, 1994, COMPUT SPEECH LANG, V8, P1, DOI 10.1006/csla.1994.1001 NEY H, 1990, 5TH P EUSIPCO 90 EUR, P65 NEY H, 1993, 1993 P EUR C SPEECH, P491 NEY H, 1991, 1991 P INT C AC SPEE, P825 NEY H, 1993, IN PRESS 1993 P NATO OERDER M, 1993, P IEEE INT C AC SPEE, V2, P119 OERDER M, 1994, 1994 P ICSLP 94 INT, P703 PAUL D, 1992, DARPA SPEECH LANGUAG RUEHL HW, 1991, SPEECH COMMUN, V10, P11, DOI 10.1016/0167-6393(91)90024-N Schwartz R., 1991, P IEEE INT C AC SPEE, P701, DOI 10.1109/ICASSP.1991.150436 STEINBISS V, 1994, 1994 P ICSLP INT C S, P2143 NR 33 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 19 EP 38 DI 10.1016/0167-6393(95)00012-D PG 20 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500002 ER PT J AU DE, A KABAL, P AF DE, A KABAL, P TI AUDITORY DISTORTION MEASURE FOR SPEECH CODER EVALUATION - HIDDEN MARKOVIAN APPROACH SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT Canadian-Acoustical-Association Symposium CY OCT, 1993 CL TORONTO, CANADA SP Canadian Acoust Assoc DE AUDITORY (COCHLEAR) MODEL; NEURAL FIRING MECHANISM; HIDDEN MARKOV MODEL; CODED SPEECH QUALITY; DISTORTION MEASURE AB This article introduces a methodology for quantifying the distortion introduced by a low or medium bit-rate speech coder. Since the perceptual acuity of a human being determines the precision with which speech data must be processed, the speech signal is transformed onto a perceptual-domain (PD). This is done using Lyon's cochlear (auditory) model whose output provides the probability-of-firing information in the neural channels at different clock times. In our present approach, we use a hidden Markov model to describe the basic firing/non-firing process operative in the auditory pathway. We consider a two-state fully-connected model of order one for each neural channel; the two states of the model correspond to the firing and non-firing events. Assuming that the models are stationary over a fixed duration, the model parameters are determined from the PD observations corresponding to the original signal. Then, the PD representations of the coded speech are passed through the respective models and the corresponding likelihood probabilities are calculated. These probability scores are used to define a cochlear hidden Markovian (CHM) distortion measure. This methodology considers the temporal ordering in the neural firing patterns. The CHM measure which utilizes the contextual information present in the firing pattern shows robustness against coder delays. C1 UNIV QUEBEC, INRS TELECOMMUN, VERDUN, PQ H3H 1H6, CANADA. RP DE, A (reprint author), MCGILL UNIV, DEPT ELECT ENGN, 3480 UNIV ST, MONTREAL, PQ H3A 2A7, CANADA. CR Baum L. E., 1972, INEQUALITIES, V3, P1 BAUM LE, 1967, B AM MATH SOC, V73, P360, DOI 10.1090/S0002-9904-1967-11751-8 BAUM LE, 1966, ANN MATH STAT, V37, P1554, DOI 10.1214/aoms/1177699147 CAVE RL, 1980, HIDDEN MARKOV IDACRD, P16 CHANG RW, 1966, IEEE T INFORM THEORY, V12, P463, DOI 10.1109/TIT.1966.1053923 DE A, 1993, THESIS MCGILL U DE A, 1994, SPEECH COMMUN, V14, P205, DOI 10.1016/0167-6393(94)90063-9 DE A, 1993, CANADIAN ACOUSTI SEP, P105 JELINEK F, 1976, P IEEE, V64, P532, DOI 10.1109/PROC.1976.10159 JUANG BH, 1985, BELL SYST TECHN JUL, P1235 JUANG BH, 1984, BELL SYST TECHN SEP, P1213 LEVINSON SE, 1983, AT&T TECH J, V62, P1035 LIPORACE LA, 1984, IEEE T INFORM THEORY, V28, P729 RABINER LR, 1985, BELL SYST TECHN JUL, P1211 RABINER LR, 1985, BELL SYST TECHN JUL, P1251 SLANEY M, 1988, 15 APPL COMP INC TEC NR 16 TC 2 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 39 EP 57 DI 10.1016/0167-6393(95)00016-H PG 19 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500003 ER PT J AU LEE, HS TSOI, AC AF LEE, HS TSOI, AC TI APPLICATION OF MULTILAYER PERCEPTRON IN ESTIMATING SPEECH NOISE CHARACTERISTICS FOR SPEECH RECOGNITION IN NOISY ENVIRONMENT SO SPEECH COMMUNICATION LA English DT Article DE ISOLATED DIGIT RECOGNITION; MMSE ESTIMATORS; NOISY SPEECH RECOGNITION; MULTILAYER PERCEPTRONS ID SPECTRAL AMPLITUDE ESTIMATOR AB In this paper, we will consider the problem of speech recognition under noisy conditions. Multi-layer perceptron (MLP) based estimators estimating filter bank channel outputs are tested on a speaker independent isolated digit recognition task in white noise conditions. Local statistical information of the speech and the noise are estimated online and used as inputs to the estimators. The results are comparable to those obtained in clean condition at a moderate signal-to-noise ratio (SNR) of 20 dB. Substantial improvement is also obtained at lower SNRs. By carefully studying the results, it is noted that the MLP based estimators appear to perform poorly when there is virtually no detectable significant speech activity. As a result, a modified gain function is introduced. This improves the performance even further. RP LEE, HS (reprint author), UNIV QUEENSLAND, DEPT ELECTR & COMP ENGN, ST LUCIA, QLD 4072, AUSTRALIA. CR BEROUTI M, 1979, APR P IEEE INT C AC, P208 BOLL SF, 1979, IEEE T ACOUST SPEECH, V27, P113, DOI 10.1109/TASSP.1979.1163209 DAUTRICH BA, 1983, BELL SYSTEM TECHN J EPHRAIM Y, 1985, IEEE T ACOUST SPEECH, V33, P443, DOI 10.1109/TASSP.1985.1164550 EPHRAIM Y, 1993, 1993 P IEEE INT C AC, V2, P355 EPHRAIM Y, 1984, IEEE T ACOUST SPEECH, V32, P1109, DOI 10.1109/TASSP.1984.1164453 ERELL A, 1993, IEEE T SPEECH AUDIO HORNIK K, 1990, NEURAL NETWORKS, V3, P551, DOI 10.1016/0893-6080(90)90005-6 LOCKWOOD P, 1992, SPEECH COMMUN, V11, P215, DOI 10.1016/0167-6393(92)90016-Z RABINER LR, 1989, P IEEE, V77, P257, DOI 10.1109/5.18626 XIE F, 1993, 1993 EUR XIE F, 1994, IEEE P INT C ACOUST, V2 NR 12 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 59 EP 76 DI 10.1016/0167-6393(95)00018-J PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500004 ER PT J AU BIMBOT, F CHOLLET, G PAOLOUI, A AF BIMBOT, F CHOLLET, G PAOLOUI, A TI SPECIAL SECTION ON AUTOMATIC SPEAKER RECOGNITION, IDENTIFICATION AND VERIFICATION SO SPEECH COMMUNICATION LA English DT Editorial Material C1 FDN UGO BORDONI, SPEECH PROC GRP, ROME, ITALY. RP BIMBOT, F (reprint author), ECOLE NATL SUPER TELECOMMUN BRETAGNE, SPEECH GRP, CNRS, PARIS, FRANCE. NR 0 TC 1 Z9 1 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 77 EP 79 DI 10.1016/0167-6393(95)90046-2 PG 3 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500005 ER PT J AU DEVETH, J BOURLARD, H AF DEVETH, J BOURLARD, H TI COMPARISON OF HIDDEN MARKOV MODEL TECHNIQUES FOR AUTOMATIC SPEAKER VERIFICATION IN REAL-WORLD CONDITIONS SO SPEECH COMMUNICATION LA English DT Article DE SPEAKER VERIFICATION; TELEPHONE SPEECH; LIMITED TRAINING DATA; TIED MULTI-GAUSSIAN HMMS; SINGLE GAUSSIAN HMMS AB In this paper, we compare two alternative approaches for speaker verification based on hidden Markov model (HMM) technology: single Gaussian HMMs and different types of tied multi-Gaussian HMMs. In order to assess the performance under real-world constraints, we tested each system using a database of connected digit strings recorded over local and long-distance telephone lines. According to our experiments, tied-mixture models were able to perform better than the single Gaussian approach provided that sufficient training data were available. However, our experiments indicate that the single Gaussian HMM approach is to be preferred for real-world speaker verification when only limited amounts of training data are available. Results will be discussed for both text-dependent and text-independent speaker verification. C1 INT COMP SCI INST, BERKELEY, CA 94704 USA. FAC POLYTECH MONS, B-7000 MONS, BELGIUM. RP DEVETH, J (reprint author), LERNOUT & HAUSPIE SPEECH PROD, ST KRISPIJNST 7, B-8900 IEPER, BELGIUM. CR DEVETH J, 1993, 1993 P INT C AC SPEE, P247 DEVETH J, 1993, 1993 P EUR 93 BERL, P2279 FORSYTH ME, 1993, SPEECH COMMUN, V13, P411, DOI 10.1016/0167-6393(93)90039-N GODFREY J, 1994, 1994 P ESCA WORKSH A, P39 HERMANSKY H, 1991, 1991 P EUR C SPEECH, P1367 Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 HUANG XD, 1989, 1989 P EUR 89 PAR, P163 Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595 MATSUI T, 1993, 1993 P INT C AC SPEE, P391 MATSUI T, 1992, 1992 P ICSLP 92 BANF, P603 Rosenberg A. E., 1990, P IEEE INT C AC SPEE, P269 ROSENBERG AE, 1992, 1992 P ICSLP 92 BANF, P599 TSENG BL, 1992, 1992 P INT C AC SPEE, P161 NR 13 TC 5 Z9 5 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 81 EP 90 DI 10.1016/0167-6393(95)00015-G PG 10 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500006 ER PT J AU REYNOLDS, DA AF REYNOLDS, DA TI SPEAKER IDENTIFICATION AND VERIFICATION USING GAUSSIAN MIXTURE SPEAKER MODELS SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE AUTOMATIC SPEAKER IDENTIFICATION AND VERIFICATION; TEXT-INDEPENDENT; VOCABULARY-DEPENDENT; GAUSSIAN MIXTURE SPEAKER MODELS; TIMIT; NTIMIT; SWITCHBOARD; YOHO AB This paper presents high performance speaker identification and verification systems based on Gaussian mixture speaker models: robust, statistically based representations of speaker identity. The identification system is a maximum likelihood classifier and the verification system is a likelihood ratio hypothesis tester using background speaker normalization. The systems are evaluated on four publically available speech databases: TIMIT, NTIMIT, Switchboard and YOHO. The different levels of degradations and variabilities found in these databases allow the examination of system performance for different task domains. Constraints on the speech range from vocabulary-dependent to extemporaneous and speech quality varies from near-ideal, clean speech to noisy, telephone speech. Closed set identification accuracies an the 630 speaker TIMIT and NTIMIT databases were 99.5% and 60.7%, respectively. On a 113 speaker population from the Switchboard database the identification accuracy was 82.8%. Global threshold equal error rates of 0.24%, 7.19%, 5.15% and 0.51% were obtained in verification experiments on the TIMIT, NTIMIT, Switchboard and YOHO databases, respectively. RP REYNOLDS, DA (reprint author), MIT, LINCOLN LAB, 244 WOOD ST, LEXINGTON, MA 02173 USA. CR ARONS BM, 1994, THESIS MIT BAHLER LG, 1994, 1994 P INT C AC SPEE, P321 CAMPBELL JP, 1992, THESIS OKLAHOMA STAT CAMPBELL JP, 1995, INT CONF ACOUST SPEE, P341, DOI 10.1109/ICASSP.1995.479543 DEMPSTER AP, 1977, J ROY STAT SOC B MET, V39, P1 DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 FISHER W, 1986, 1986 P DARPA SPEECH, P93 FLOCH JL, 1994, 1994 P INT C AC SPEE, P149 GILLICK L, 1993, 1993 P INT C AC SPEE, P471 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 GODFREY JJ, 1992, 1992 P INT C AC SPEE, P517 HIGGINS A, 1993, 1993 P INT C AC SPEE, P375 HIGGINS A, 1992, YOHO SPEAKE AUTHENTI Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 JANKOWSKI C, 1990, 1990 P INT C AC SPEE, P109 LIOU HS, 1995, INT CONF ACOUST SPEE, P357 NAIK J, 1987, 1987 P INT C AC SPEE, P2392 NAIK J, 1989, 1989 P INT C AC SPEE, P524 REYNOLDS DA, 1995, INT CONF ACOUST SPEE, P329, DOI 10.1109/ICASSP.1995.479540 REYNOLDS DA, 1994, 1994 P SPIE C AUT SY Reynolds DA, 1994, IEEE T SPEECH AUDI P, V2, P639, DOI 10.1109/89.326623 REYNOLDS DA, 1992, 1992 P INT C SIGN PR, P967 REYNOLDS DA, 1995, IEEE T SPEECH AUDI P, V3, P72, DOI 10.1109/89.365379 Reynolds D.A., 1992, THESIS GEORGIA I TEC Rose R. C., 1990, P ICASSP, P293 ROSENBERG AE, 1992, NOV INT C SPEECH LAN, P599 SCHMANDT C, 1984, IEEE T CONSUM ELECTR, V30, pR21, DOI 10.1109/TCE.1984.354042 WILCOX L, 1994, P ICASSP 94, V1, P161 NR 28 TC 447 Z9 472 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 91 EP 108 DI 10.1016/0167-6393(95)00009-D PG 18 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500007 ER PT J AU MATSUI, T FURUI, S AF MATSUI, T FURUI, S TI LIKELIHOOD NORMALIZATION FOR SPEAKER VERIFICATION USING A PHONEME-INDEPENDENT AND SPEAKER-INDEPENDENT MODEL SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE LIKELIHOOD NORMALIZATION; SPEAKER VERIFICATION; PHONEME- AND SPEAKER-INDEPENDENT MODEL; A POSTERIORI PROBABILITY AB This paper proposes two methods for creating a phoneme- and speaker-independent model that greatly reduce the amount of calculation needed for similarity (or likelihood) normalization in speaker verification. For each input speech, these methods only need to calculate the likelihood against a single model instead of against the models of all the reference speakers as in the conventional methods. In addition, the new methods perform better than or equally as well as the conventional methods. Speaker verification is tested by using separate populations of customers and impostors in order to evaluate performance under practical conditions. The speaker (and text) verification error rates are roughly 1.5 times higher than when the same population is used for both customers and impostors. Using 15 customers and a separate group of 15 impostors, a speaker verification error rate of 1.8% for text-independent verification and a speaker-and-text verification error rate of 1.1% for text-prompted verification were obtained after normalization. The latter error rate was about half of that achieved by the original method. RP MATSUI, T (reprint author), NIPPON TELEGRAPH & TEL PUBL CORP, HUMAN INTERFACE LABS, 3-9-11 MIDORI CHO, MUSASHINO, TOKYO 180, JAPAN. CR Baum L. E., 1972, INEQUALITIES, V3, P1 CAREY MJ, 1992, P I AC, V14, P96 DEVETH J, 1993, P INT C ACOUST SPEEC, P247 Higgins A., 1991, Digital Signal Processing, V1, DOI 10.1016/1051-2004(91)90098-6 Li K. P., 1998, P IEEE INT C AC SPEE, V1, P595 MATSUI T, 1992, P INT C AC SPEECH SI, V2, P157 MATSUI T, 1993, P ICASSP 93, V2, P391 MATSUI T, 1994, P INT C ACOUST SPEEC Rosenberg A. E., 1992, P INT C SPOK LANG PR, P599 NR 9 TC 31 Z9 31 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 109 EP 116 DI 10.1016/0167-6393(95)00011-C PG 8 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500008 ER PT J AU FORSYTH, M AF FORSYTH, M TI DISCRIMINATING OBSERVATION PROBABILITY (DOP) HMM FOR SPEAKER VERIFICATION SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE SPEAKER VERIFICATION; SEMICONTINUOUS HIDDEN MARKER MODELS (HMM); SPEAKER NORMALIZATION; COHORT NORMALIZATION; TELEPHONE SPEECH; DISCRIMINATING OBSERVATION PROBABILITY (DOP) AB This paper describes the use of a multiple codebook semi-continuous hidden Markov model (SCHMM) automatic speaker verification (ASV) system, which uses a novel technique for discriminative hidden Markov modelling known as discriminative observation probabilities (DOP). DOP is not a discriminative training technique, it is a method of constructing what is effectively a discriminating model by contrasting two standard HMMs so as to improve discrimination between the classes that those models represent. This paper experimentally evaluates the use of DOP HMMs for ASV. The experimental evaluation is based on a text-dependent task using isolated digits. The database contains 24 true (client) speakers and 100 casual impostors, recorded over the public telephone network in the United Kingdom. For speaker verification, DOP has similarities to the so-called speaker normalisation technique but it is a more flexible approach and experimental results are presented here which show it to be superior, even in its simplest form. An argument is also made for preferring DOP models to discriminatively trained models although this is not evaluated experimentally. The DOP technique can easily be added to an existing HMM system and requires no additional training. The DOP concept can potentially be applied to other applications of HMMs to two-class classification problems. An equal error rate (EER) of 0.21% (using speaker specific thresholds, on a sequence of 12 isolated digits) was obtained using multiple codebook DOP models. This represents a reduction in EER of 38% compared to a common speaker normalisation technique. RP FORSYTH, M (reprint author), UNIV EDINBURGH, CTR SPEECH TECHNOL RES, 80 S BRIDGE, EDINBURGH EH1 1HN, MIDLOTHIAN, SCOTLAND. CR Bennani Y., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification CAREY MJ, 1991, INT CONF ACOUST SPEE, P397, DOI 10.1109/ICASSP.1991.150360 DEVETH J, 1993, IEEE T ACOUST SPEECH, V2, P247, DOI 10.1109/ICASSP.1993.319281 FORSYTH M, 1994, IEEE T ACOUST SPEECH, V1, P313 FORSYTH ME, 1995, THESIS U EDINBURGH Gillick L., 1989, ICASSP-89: 1989 International Conference on Acoustics, Speech and Signal Processing (IEEE Cat. No.89CH2673-2), DOI 10.1109/ICASSP.1989.266481 LIU CS, 1994, IEEE T ACOUST SPEECH, V1, P325 Matsui T., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification McNemar Q, 1947, PSYCHOMETRIKA, V12, P153, DOI 10.1007/BF02295996 Reynolds D. A., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification ROSENBERG AE, 1992, ICSLP, P599 NR 11 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 117 EP 129 DI 10.1016/0167-6393(95)00020-O PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500009 ER PT J AU CHOI, HC KING, RW AF CHOI, HC KING, RW TI ON THE USE OF SPECTRAL TRANSFORMATION FOR SPEAKER ADAPTATION IN HMM BASED ISOLATED-WORD SPEECH RECOGNITION SO SPEECH COMMUNICATION LA English DT Article DE SPEAKER ADAPTATION; SPECTRAL TRANSFORMATION; HIDDEN MARKOV MODEL; MINIMUM MEAN SQUARE ERROR; CANONICAL CORRELATION ANALYSIS; MULTILAYER PERCEPTRONS AB We describe the use of spectral transformation to perform speaker adaptation for HMM based isolated-word speech recognition. The paper describes and compares three methods, namely, minimum mean square error (MMSE), canonical correlation analysis (CCA) and multi-layer perceptrons (MLP), to compute the transformations. Using isolated words from the TI-46 speech corpus, we found that CCA offers the best adaptation performance. Three HMM training and adaptation strategies are also discussed. In the ''no-retraining'' approach, the spectral transformation is computed from a small amount of adaptation data, and may be used, essentially, for on-line adaptation. The ''training-after-adaptation'' approach computes transformations prior to off-line HMM training, but produces a better set of models. The third approach is a novel two-stage combination of these approaches which has been found to achieve good adaptation performance while maintaining fast adaptation. Our experiments show that, on average, only around 10% of a new speaker's training data is required for adaptation in order to achieve better recognition accuracy than that obtained using the speaker-dependent models of that new speaker, when the CCA spectral transformation estimation method is used with this two-stage approach. RP CHOI, HC (reprint author), UNIV SYDNEY, DEPT ELECT ENGN, SPEECH TECHNOL RES GRP, SYDNEY, NSW 2006, AUSTRALIA. CR ANDERSON TW, 1984, INTRO MULTIVARIATE, pCH12 CHOUKRI K, 1986, 1986 P IEEE INT C AC, P2659 CLASS F, 1990, 1990 P IEEE INT C AC, P133 COX SJ, 1989, 1989 P IEEE INT C AC, P294 HUANG X, 1992, 1992 P IEEE INT C AC, V1, P465 HUNT MJ, 1981, J ACOUST SOC AM, V69, pS41, DOI 10.1121/1.386266 IMAMURA A, 1991, 1991 P IEEE INT C AC, P841 KNOHL L, 1993, 1993 P EUR 93 BERL, P367 KOSAKA T, 1993, 1993 P EUR 93 BERL, P363 LEE CH, 1991, IEEE T SIGNAL PROCES, V39, P806, DOI 10.1109/78.80902 LEE CH, 1993, 1993 P IEEE INT C AC, V2, P558 LJOLJE A, 1993, 1993 P EUR 93 BERL, P631 NAKAMURA S, 1990, 1990 P IEEE INT C AC, P157 NAKAMURA S, 1991, 1991 P IEEE INT C AC, P853 OHNO S, 1993, 1993 P IEEE INT C AC, V2, P578 Rumelhart D. E., 1986, PARALLEL DISTRIBUTED TAKAMI JI, 1992, 4TH P AUSTR INT C SP, P437 NR 17 TC 1 Z9 2 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 131 EP 143 DI 10.1016/0167-6393(95)00019-K PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500010 ER PT J AU THEVENAZ, P HUGLI, H AF THEVENAZ, P HUGLI, H TI USEFULNESS OF THE LPC-RESIDUE IN TEXT-INDEPENDENT SPEAKER VERIFICATION SO SPEECH COMMUNICATION LA English DT Article DE SPEECH PROCESSING; SPEAKER RECOGNITION; SPEAKER VERIFICATION; TEXT-INDEPENDENCE; OPEN-TEST METHODOLOGY; NATURAL SPEECH DATABASE; MULTI-SESSION DATABASE; LINEAR PREDICTION ANALYSIS; SYNTHESIS FILTER; RESIDUE; COMPLEMENTARY FEATURES ID VOCAL QUALITY; RECOGNITION; PERCEPTION; FEATURES AB This paper is a contribution to automatic speaker recognition. It considers speech analysis by linear prediction and investigates the recognition contribution of its two main resulting components, namely the synthesis filter on one hand and the residue on the other hand. This investigation is motivated by the orthogonality property and the physiological significance of these two components, which suggest the possibility of an improvement over current speaker recognition approaches based on nothing but the usual synthesis filter features. Specifically, we propose a new representation of the residue and we analyse its corresponding recognition performance by issuing experiments in the context of text-independent speaker verification. Experiments involving both known and new methods allow us to compare the recognition performance of the two components. First we consider separate methods, then we combine them. Each method is tested on the same database and according to the same methodology, with strictly disjoint training and test data sets. The results show the usefulness of the residue when used alone, even if it proves to be less efficient than the synthesis filter. However, when both are combined, the residue shows its true relevance. It achieves a reduction of the error rate which, in our case, went down from 5.7% to 4.0%. RP THEVENAZ, P (reprint author), UNIV NEUCHATEL, INST MICROTECH, ABRAHAM LOUIS BREGUET 2, CH-2000 NEUCHATEL, SWITZERLAND. CR Atal B. S., 1986, P IEEE INT C AC SPEE, P1681 ATAL BS, 1976, P IEEE, V64, P460, DOI 10.1109/PROC.1976.10155 Bennani Y., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification Bimbot F., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification BOITE R, 1990, AGEN MITTEILUNGEN, P5 Carey M. J., 1991, P INT C AC SPEECH SI, P397, DOI 10.1109/ICASSP.1991.150360 CHEN MS, 1993, IEEE T SIGNAL PROCES, V41, P398 CHILDERS DG, 1990, SPEECH COMMUN, V9, P97, DOI 10.1016/0167-6393(90)90064-G CHILDERS DG, 1994, J ACOUST SOC AM, V96, P2026, DOI 10.1121/1.411319 CHILDERS DG, 1991, J ACOUST SOC AM, V90, P2394, DOI 10.1121/1.402044 CILDERS DG, 1990, J ACOUST SOC AM, V4, P1841 CORSI P, 1981, 2ND P NATO ADV STUD, P277 DODDINGTON GR, 1985, P IEEE, V73, P1651, DOI 10.1109/PROC.1985.13345 Dubreucq V., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification ESKENAZI L, 1990, J SPEECH HEAR RES, V33, P298 FANT G, 1993, SPEECH COMMUN, V13, P7, DOI 10.1016/0167-6393(93)90055-P Feustel T. C., 1989, SPEECH TECH, P169 FUKUNAGA K, 1972, INTRO STATISTICAL PA FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P254, DOI 10.1109/TASSP.1981.1163530 FURUI S, 1986, SPEECH COMMUN, V5, P183, DOI 10.1016/0167-6393(86)90007-5 FURUI S, 1990, ESCA P SPEAKER CHARA, P10 Furui S., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification GIANNINI A, 1989, EUROSPEECH PARIS, V1, P283 Gish H, 1994, IEEE SIGNAL PROC MAG, V11, P18, DOI 10.1109/79.317924 JESORSKY P, 1978, SPEECH COMMUN, P93 KLATT DH, 1990, J ACOUST SOC AM, V87, P820, DOI 10.1121/1.398894 KOIKE Y, 1975, ANN OTO RHINOL LARYN, V84, P117 MAKHOUL J, 1975, P IEEE, V63, P561, DOI 10.1109/PROC.1975.9792 MONAGHAN AIC, 1990, ESCA P SPEAKER CHARA, P167 Naik J., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification O'Shaughnessy D., 1987, SPEECH COMMUNICATION OSAWA K, 1986, P INT C ACOUST SPEEC, P457 PAKRAVAN MR, 1992, INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING APPLICATIONS AND TECHNOLOGY, VOLS 1 AND 2, P1035 PINTO NB, 1989, IEEE T ACOUST SPEECH, V37, P1870, DOI 10.1109/29.45534 PROSEK RA, 1987, J COMMUN DISORD, V20, P105, DOI 10.1016/0021-9924(87)90002-5 REYNOLDS DA, 1992, P INT C ACOUST SPEEC ROSENBERG AE, 1976, P IEEE, V64, P475, DOI 10.1109/PROC.1976.10156 ROSENBERG AE, 1991, P ICASSP, P381, DOI 10.1109/ICASSP.1991.150356 Shridhar M., 1982, Speech Communication, V1, DOI 10.1016/0167-6393(82)90019-X SKVARC J, 1990, ESCA P SPEAKER CHARA, P181 SOONG FK, 1986, P ICASSP TOK JAP, P877 Thevenaz P., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification WOLF JJ, 1972, J ACOUST SOC AM, V51, P2044, DOI 10.1121/1.1913065 Zhu X., 1994, ESCA Workshop on Automatic Speaker Recognition Identification and Verification NR 44 TC 26 Z9 26 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 145 EP 157 DI 10.1016/0167-6393(95)00010-L PG 13 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500011 ER PT J AU BENNANI, Y GALLINARI, P AF BENNANI, Y GALLINARI, P TI NEURAL NETWORKS FOR DISCRIMINATION AND MODELIZATION OF SPEAKERS SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE DISCRIMINATION; PREDICTIVE MODELING; MODULAR CONNECTIONIST SYSTEM; NEURAL NETS; HYBRID SYSTEM; SPEAKER RECOGNITION; IDENTIFICATION; VERIFICATION ID ARCHITECTURE; ALGORITHMS AB This article reviews current research on neural network systems for speaker recognition tasks. We consider two main approaches, the first one relies on direct classification and the second on speaker modelization. The potential of connectionist models for speaker recognition is first presented and the main models are briefly introduced. We then present different systems which have been recently proposed for speaker recognition tasks. We discuss their respective performances and potentials and compare these techniques to more conventional methods like vector quantization and Hidden Markov models. The paper ends with a summary and suggestions for further developments. C1 UNIV PARIS 06, LAFORIA, CNRS, URA 1095, PARIS, FRANCE. RP BENNANI, Y (reprint author), UNIV PARIS 13, LIPN, CNRS, URA 1507, PARIS, FRANCE. CR ARTIERES T, 1991, NEURONIMES FRANCE ARTIERES T, 1995, UNPUB EUROSPEECH 95 ARTIERES T, 1993, P EUROSPEECH 93 BERL BENGIO Y, 1992, PATTERN RECOGN LETT, V13, P375, DOI 10.1016/0167-8655(92)90035-X BENGIO Y, 1992, IEEE T NEURAL NETWOR, V3, P252, DOI 10.1109/72.125866 BENGIO Y, 1991, THESIS BENNANI Y, 1992, 1992 IEEE ICANN Bennani Y., 1990, P INT C AC SPEECH SI, P265 BENNANI Y, 1993, APR P IEEE INT C AC BENNANI Y, 1992, AUG P IEEE WORKSH NE BENNANI Y, 1994, INT J NEURAL SYSTEMS BENNANI Y, 1992, P ICSLP 92 BANFF, P607 Bennani Y, 1991, P ICASSP, P385, DOI 10.1109/ICASSP.1991.150357 BOTTOU L, 1992, NEURAL COMPUT, V4, P888, DOI 10.1162/neco.1992.4.6.888 BOURLARD H, 1990, NN SENSORY MOTOR SYS BOURLARD H, 1992, P IEEE INT C AC SPEE, P349, DOI 10.1109/ICASSP.1992.226048 BRIDLE JS, 1990, SPEECH COMMUN, V9, P83, DOI 10.1016/0167-6393(90)90049-F Carey M. J., 1991, P INT C AC SPEECH SI, P397, DOI 10.1109/ICASSP.1991.150360 DEVILLERS L, 1992, P INT C ACOUST SPEEC, P421 DRIANCOURT X, 1992, P INT C ACOUST SPEEC DRIANCOURT X, 1990, NEURONIMES 1990 DRIANCOURT X, 1991, 1991 IEEE IJCNN FARELL KR, 1994, IEEE T SPEECH AUDIO, V2, P194 FISHER WM, 1987, J ACOUST SOC AM, V81, pS92, DOI 10.1121/1.2034854 FRANZINI MA, 1990, P INT C AC SPEECH SI, P425 GALLINARI P, 1992, 2ND WORKSH NEUR NETW, P19 GALLINARI P, 1991, NEURAL NETWORKS, V4, P349, DOI 10.1016/0893-6080(91)90071-C GISH H, 1990, INT CONF ACOUST SPEE, P1361, DOI 10.1109/ICASSP.1990.115636 HAFFNER P, 1991, P INT C ACOUST SPEEC HAFFNER P, 1992, NIPS, P135 HAMPSHIRE JB, 1990, P INT C ACOUST SPEEC, P165 HAMPSHIRE JB, 1989, CMUCS89167 CARN MELL HATTORI H, 1992, P IEEE INT C AC SPEE, V2, P153 Hertz J., 1991, INTRO THEORY NEURAL HORNIK K, 1989, NEURAL NETWORKS, V2, P359, DOI 10.1016/0893-6080(89)90020-8 ISO K, 1990, P INT C ACOUST SPEEC ISO K, 1991, P INT C ACOUST SPEEC, P57, DOI 10.1109/ICASSP.1991.150277 Jacobs R. A., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.1.79 JACOBS RA, 1993, IEEE T SYST MAN CYB, V23, P337, DOI 10.1109/21.229447 KOHONEN T., 1989, SELF ORG ASS MEMORY LANG K, 1988, CMUCS88152 CARN MELL LEVIN E, 1990, NIPS, V3, P147 LEVIN E, 1990, P IEEE INT C AC SPEE, P433 MELLOUK A, 1993, P INT C ACOUST SPEEC Morgan N., 1990, P IEEE INT C AC SPEE, P413 NAIK JM, 1994, P IEEE INT C AC SPEE, V1, P153 NILES LT, 1990, P IEEE INT C AC SPEE, P417 Niranjan M., 1990, Computer Speech and Language, V4, DOI 10.1016/0885-2308(90)90009-U OGLESBY J, 1991, P INT C AC SPEECH SI, P393, DOI 10.1109/ICASSP.1991.150359 Oglesby J., 1990, P INT C AC SPEECH SI, P261 POGGIO T, 1990, SCIENCE, V247, P978, DOI 10.1126/science.247.4945.978 RENALS S, 1992, P IEEE INT C AC SPEE, P601, DOI 10.1109/ICASSP.1992.225837 Richard M. D., 1991, Neural Computation, V3, DOI 10.1162/neco.1991.3.4.461 ROBINSON T, 1992, P INT C ACOUST SPEEC, P617, DOI 10.1109/ICASSP.1992.225833 Rudasi L., 1991, P ICASSP TORONTO, P389, DOI 10.1109/ICASSP.1991.150358 Rumelhart D. E., 1986, PARALLEL DISTRIBUTED, V1 SORENSEN HBD, 1993, P INT C ACOUST SPEEC, P537 TEBELSKIS J, 1991, P IEEE INT C AC SPEE, P61, DOI 10.1109/ICASSP.1991.150278 TSOI AC, 1994, IEEE T NEURAL NE MAR WAIBEL A, 1987, TRI0006 ATR I TECHN White H., 1989, Neural Computation, V1, DOI 10.1162/neco.1989.1.4.425 NR 61 TC 9 Z9 9 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 159 EP 175 DI 10.1016/0167-6393(95)00014-F PG 17 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500012 ER PT J AU BIMBOT, F MAGRINCHAGNOLLEAU, I MATHAN, L AF BIMBOT, F MAGRINCHAGNOLLEAU, I MATHAN, L TI 2ND-ORDER STATISTICAL MEASURES FOR TEXT-INDEPENDENT SPEAKER IDENTIFICATION SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE SPEAKER RECOGNITION; SPEAKER IDENTIFICATION; TEXT-INDEPENDENT; GAUSSIAN LIKELIHOOD; SPHERICITY TEST; RELATIVE EIGENVALUE DEVIATION; SYMMETRIZATION; TIMIT; REFERENCE SYSTEM; ASSESSMENT METHODOLOGY AB This article presents an overview of several measures for speaker recognition. These measures relate to second-order statistical tests, and can be expressed under a common formalism. Alternative formulations of these measures are given and their mathematical properties are studied. In their basic form, these measures are asymmetric, but they can be symmetrized in various ways. All measures are tested in the framework of text-independent closed-set speaker identification, on 3 variants of the TIMIT database (630 speakers): TIMIT (high quality speech), FTIMIT (a restricted bandwidth version of TIMIT) and NTIMIT (telephone quality). Remarkable performances are obtained on TIMIT but the results naturally deteriorate with FTIMIT and NTIMIT. Symmetrization appears to be a factor of improvement, especially when little speech material is available. The use of some of the proposed measures as a reference benchmark to evaluate the intrinsic complexity of a given database under a given protocol is finally suggested as a conclusion to this work. RP BIMBOT, F (reprint author), ECOLE NATL SUPER TELECOMMUN BRETAGNE, TELECOM PARIS,DEPT SIGNAL,CNRS,URA 820, 46 RUE BARRAULT, F-75634 PARIS 13, FRANCE. CR ANDERSON TW, 1958, INTRO MULTIVARIATE A ARTIERES T, 1991, P NEURONIMES 91 NIME BENNANI Y, 1992, 1992 P ICSLP 92 BANF, V1, P607 BIMBOT F, 1993, SAM A ESPRIT I9 TECH BIMBOT F, 1993, 1993 P EUR 93 BERL, V1, P169 BIMBOT F, 1992, P ICASSP SAN FRANC U, V2, P5 CHOLLET G, 1982, 1982 P INT C AC SPEE, V3, P2026 Fisher W., 1986, JASA SA, V81 FURUI S, 1994, 1994 WORKSH AUT SPEA, P1 GISH H, 1986, 1986 P INT C AC SPEE, V2, P865 GISH H, 1990, 1990 P INT C AC SPEE, V1, P289 GRENIER Y, 1980, 11 JOURN ET PAR STRA, P163 GRENIER Y, 1977, THESIS ENST HATTORI H, 1992, 1992 P INT C AC SPEE, V2, P153 JANKOWSKI C, 1990, 1990 P INT C AC SPEE MONTACIE C, 1992, 1992 P INT C AC SPEE, V1, P153 MONTACIE C, 1992, 1992 P ICSLP 92 BANF, V1, P611 PORITZ AB, 1982, APR P IEEE INT C AC, P1291 REYNOLDS DA, 1994, 1994 WORKSH AUT SPEA, P27 RUDASI L, 1991, P INT C AC SPEECH SI, V1, P389 SAVIC M, 1990, 1990 P INT C AC SPEE, V1, P281 SOONG FK, 1987, ATT3 TECHN REP NR 22 TC 45 Z9 46 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 177 EP 192 DI 10.1016/0167-6393(95)00013-E PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500013 ER PT J AU OGLESBY, J AF OGLESBY, J TI WHATS IN A NUMBER - MOVING BEYOND THE EQUAL ERROR RATE SO SPEECH COMMUNICATION LA English DT Article; Proceedings Paper CT ESCA Workshop on Automatic Speaker Recognition, Identification and Verification CY APR 05-07, 1994 CL MARTIGNY, SWITZERLAND SP European Speech Commun Assoc DE SPEAKER VERIFICATION; PERFORMANCE; ASSESSMENT AB This paper addresses the issue of speaker verification system assessment. The objective of this work is to develop a way of characterising speaker verification systems in a concise and meaningful manner. Here a performance profile is suggested that encompasses the important aspects of a system under test, namely the actual verification performance, model storage requirements, confusability of the speakers in the test set, quality of the speech data used and the duration of speech data available for enrolment and testing. Results are presented that show how this profile of measures can be used to give a more meaningful representation of a given system. The aim of publishing this work is not to be prescriptive about a particular method of assessment, but rather to highlight the issues involved. In general, better definitions of widely used terms are required before meaningful comparisons can be drawn between different speaker verification systems. Widespread adoption of a set of standardised measures, along the lines of those suggested here, would significantly improve the value of such comparisons. RP OGLESBY, J (reprint author), BT LABS, MARTLESHAM HEATH, IPSWICH IP5 7RE, SUFFOLK, ENGLAND. CR *CCITT, 1972, COD AN SIGN PULS COD FURUI S, 1981, IEEE T ACOUST SPEECH, V29, P245 MILLAR W, 1992, P IOA C SPEECH HEARI ROSENBERG AE, 1991, P INT C ACOUST SPEEC Young S. J., 1993, HTK HIDDEN MARKOV MO NR 5 TC 3 Z9 3 PU ELSEVIER SCIENCE BV PI AMSTERDAM PA PO BOX 211, 1000 AE AMSTERDAM, NETHERLANDS SN 0167-6393 EI 1872-7182 J9 SPEECH COMMUN JI Speech Commun. PD AUG PY 1995 VL 17 IS 1-2 BP 193 EP 208 DI 10.1016/0167-6393(95)00017-I PG 16 WC Acoustics; Computer Science, Interdisciplinary Applications SC Acoustics; Computer Science GA RR225 UT WOS:A1995RR22500014 ER EF