Stay Hungry, Stay Foolish: Voice Acoustics

Why study vocal tract acoustics?

There are fundamental scientific questions to answer in the field of acoustical phonetics, but we are also interested in the applications in speech training (language teaching), and in speech pathology. When adults or teenagers learn a foreign language, they rarely achieve authentic pronunciation and are sometimes almost unintelligible to speakers of that language. This difficulty is due to imprecision or inadequacy of the auditory feedback system usually used to learn languages - students often cannot hear how wrong their imitation of a sound is, and do not know what to do to improve it. (Technically, the problems are called categorization and interference.) The problem is even more severe for the hearing impaired who have little or no auditory feedback and can obtain very little feedback about the interior of the vocal tract from looking at the lips. A feature of the "deaf accent" is inappropriate use of the soft palate - which is not surprising given how difficult it is to see or to feel what one's soft palate is doing during speech.

We also investigate the acoustics of the singing voice, partly for its intrinsic interest, and partly with the aim of improving pedagogy in that field.

We have developed a device for measuring some important acoustic properties of the vocal tract non-invasively, in real-time, while the owner of the vocal tract is speaking or singing. We use it as a research tool, but we have demonstrated its use as a speech trainer.

Existing technologies used in speech pathology and speech trainers to provide visual feedback from the speech sound are inherently limited in precision and practicality. Even the most advanced speech recognition systems still mistake words, which indicates the limits of their precision in accurate measures of pronunciation. The basic problem is that the speech signal alone does not have enough information in it to allow us to work out, quickly and precisely, the configuration of the vocal tract. This is not a problem for understanding speech, but it may be a problem in learning precise pronunciation. Our approach is therefore to introduce a signal with more information in the frequency domain.

Our technology is called Real-time Acoustic response by Vocal tract Excitation or RAVE. In model experiments using the laboratory prototype, we have shown that one or two hours' training using visual feedback of some key features of the acoustical response of a subject's vocal tract improves the accuracy and intelligibility of pronunciation of foreign phonemes by monolingual adults.

How it works:

We inject into the vocal tract an acoustic current which is synthesized to give high resolution frequency information over the frequency range of interest. We then measure the impedance of the vocal tract in parallel with the external field using the response to this excitation signal.

In this figure, the author pronounces the vowel in 'heard'. The sharp vertical peaks are the harmonics of my voice. The broad signal shows the response of my vocal tract to the acoustic current signal being injected from the lips.

For this vowel, my vocal tract behaves rather like a cylinder about 170 mm long, nearly closed at the vocal folds and open at the mouth. A cylinder, length L, closed at one end has resonances at f0 = v/4L , at 3f0, 5f0 etc, where v is the speed of sound. (See pipes and harmonics.) So we see resonances at about 0.5, 1.5, 2.5, 3.5 and 4.5 kHz, which appear as the peaks in the smooth curve in this figure. When I pronounce the vowel in "had", I open my mouth wider, so the tract is no longer cylindrical, but flared at the open end, a bit like the flare and bell on a brass instrument. One of the effects of a this shape in a brass instrument is to raise the frequencies of the resonances, especially those of the lower resonances. (In a related example, conical pipes have resonances at higher frequencies than do cylindrical ones. See this link for an explanation.)

From this response we can readily determine the resonances of the vocal tract, independently of the speech signal. The resonant frequencies are interesting for fundamental acoustical phonetic research but, if we extract them in real time, they can be used to drive a cursor for speech training. This is how we do it in the real time version.

Schematic diagram. (a) shows the spectrum of the speech signal alone. This male voice has harmonic partials spaced at the pitch frequency 126 Hz. (b) The injected signal has frequencies spaced at 5 Hz, whose amplitudes are calibrated (in this case) using the radiation field outside the speake's mouth. (c) The sum of the speech signal and the broad band signal (including the effects of the resonances) goes from the microphone to the ADC. The speech signal is used to measure pitch and amplitude; then the harmonic components below 1 kHz are removed. (d) The resonances are detected from the remaining interpolated signal. Similarly, the broadband signals may be removed to leave just the speech harmonics. In the real-time version of the device used for speech training, the resonance frequencies are used to position the cursor on the vowel plane (see below). Notice that the signal: noise ratio in these figures is greater than in the preceding figure. This is a consequence of making the measurements rapidly.

How it looks:

This is a screen dump of the feedback display in the current speech trainer device, set up with targets from Australian English. The background ellipses are measurements of the vowels of 33 Australian men, with mean values for each vowel at the centre of each ellipse. The semi-axes are the standard deviations in R1 and R2. These or other areas can be used as targets in speech training. A cursor on the monitor (the cross at (1190,530)) shows the current configuration of the subject's own vocal tract. Initially, subjects 'steer' the motion of the cursor by consciously controlling jaw and tongue position. Speakers of the language displayed can 'aim' towards one of the vowels shown. After some practice, however, it becomes nearly as automatic as using a joy-stick or a mouse - one just 'makes it go' where one wants, without thinking of the muscular details. In other words, a visual feedback loop is unconsciously used to train articulation.

Reference: http://www.phys.unsw.edu.au/jw/speech.html

Stay Hungry, Stay Foolish

2009年9月29日星期二

Voice Acoustics

沒有留言:

張貼留言

追蹤者

網誌存檔

咩mei背著洋娃娃