Stay Hungry, Stay Foolish: 9月 2009

2009年9月29日星期二

語言的結構國語

聲音／音色(timbre)的要素：

頻率(frequency)

頻率是以時間為基準,振動快則頻率高,音調較為高,其衡量單位為赫茲(Hertz,Hz),一次振動是指波形從中軸往上伸延至波峰，往下跨中軸至波谷再返回中軸，稱為一個週期；頻率越高，音頻(pitch)就越高，人類耳朵可聽到的音頻範圍是20Hz至20kHz。頻率是聲波每秒振動的次數，一千赫茲(1000Hz,或稱KHz)就等於每秒鐘振動一千次,也就是每秒鐘產生一千個音波。

振幅(amplitude)／響度(loudness)

振幅是指音波的[振動幅度],亦可稱為[力度],影響所及是聲音波形的高低,音波的振幅愈大,則響度愈大,其衡量基準是以振幅的大小為準,以volt或dB(分貝decibe)來衡量.DB的尺度是呈指數增長的,每隔20分貝,其響度或振幅則增加10倍,例如:40分貝比20分貝的響度,提高了10倍,而60分貝則比20分貝提高了一百倍,80分貝提高一千倍之多。

Digital Audio數位音效

音波由類比型態轉為數位型態，儲存格式方面，PC平台最常用的是WAV格式，Mac平台最常用的是AIFF格式。。

取樣頻率(Sampling Rate)

指音效卡在一秒之中對聲音(波形)做記錄的次數。根據研究,聲音播出時的品質常常只能達到取樣頻率的一半,因此須採取雙倍樣率才能將原音準確重現.人的聽力極致約為20KHz,所以高品質的取樣應為其兩倍以上,當聲音來源為音樂時,因位它所橫跨的頻率變化極為寬廣,通常以採44.1KHz的頻率為CD音樂取樣率的標準;但是若以語音為主,由於人說話的語音大約為10KHz,因此加倍採樣,只取22KHz即可。取樣率越高, 所記錄下來的音質就越清晰;當然,越高的取樣所記錄下來的檔案就會越大。

取樣解析度(sampling resolution)

解析度決定了取樣的一音波是否能保持原先的形狀,愈接近原形則所需解析度愈高。若以8位元來記錄取樣,則其所能表達的組合種類是2的8次方,即256,表示用8位元的取樣大小能分辨出256個層次的聲音;若採16位元來取樣,則能分辨的差異將高達2的16次方,為65536,其精確度自然大為提高。

16bit,8bit取樣的差別在於動態範圍的寬窄;動態範圍寬,音量起伏的大小變化就能夠更精細地被記錄下來。如此一來不論是細微的聲音或是強烈的動感震撼,都可以表現得淋漓盡致;而CD音質的取樣規格正是16位元取樣的規格。

瞭解了聲音的背景之後,讓我們來解析語言的結構，語言架構大致可區分如下：

語言的結構 [語音學(phonetics)]

句子(sentence)

「我是一個講國語的台灣人，雖然我的祖先來自福建，但我的台灣話說得並不流暢。」

子句(clause)

「我是一個講國語的台灣人」

詞組(phrase)

語詞(word)

「台灣人」

詞素(morpheme)

「台灣人」

音節(syllable)

「台」，「灣」，「人」即一般所說的字音，是聽覺上最容易分辨出來的語音單位。一般來說，一個漢字就是一個音節。漢語的音節一般是由聲母、韻母和聲調構成的；不過，有時也可以沒有聲母，只由韻母和聲調組合而成，稱之為「零聲母」。

音素(phoneme)

「人」是由三個音素/r/，/e/，/n/ 所形成，是最小的語音單位。

瞭解語言的架構可以幫助我們分析要如何發音，接下來讓我們來瞭解發音的2大要素：

元音/母音(vowel)

發音時，氣流會振動聲帶，在經過咽頭、口腔、鼻腔等地方時，氣流幾乎暢通無阻。由於聲帶顫動，所以聲音響亮。

輔音/子音(consonant)

發音時，氣流在咽頭、口腔、鼻腔等部位會受到阻礙。由於聲帶不一定顫動，所以聲音大多不響亮。

瞭解了發音要素後，重要的是如何讀出聲音來，我們稱之為拼音，這是會隨著國家與地區性而有所不同，我們常用的注音稱之為國音一式，通用拼音與漢語拼音還未確定誰是國音二式，而使用多年方便外國人發音的稱之為羅馬拼音(比較表)，以下介紹國音的拼音的3大要素：

聲母(initial)=前音=子音=輔音(consonant)

發出的聲音會遭受阻礙,聲母有辨義的作用如ㄍ與ㄎ：幹什麼與看什麼,ㄋ與ㄌ：惱怒與老路

指放在音節開頭的輔音。由於聲母是由輔音充當，因此發音並不響亮，國語的聲母有21個，按發音部位分為七組。                                                                                                                           1. 雙唇音Bilabials      b(ㄅ)   p(ㄆ)   m(ㄇ) 　

2. 唇齒音Labiodental    f(ㄈ) 　　　

3. 舌尖音Apicals     d(ㄉ) t(ㄊ)   n(ㄋ)   l(ㄌ)

4. 舌根音Velars     g(ㄍ)   k(ㄎ)   h(ㄏ) 　

5. 舌面音Front Palatals     j(ㄐ)   q(ㄑ)   x(ㄒ) 　

6. 舌尖後音(翹舌音)Retroflexes    zh(ㄓ) ch(ㄔ) sh(ㄕ)   r(ㄖ)

7. 舌尖前音(平舌音)Blade-alveolars z(ㄗ)   c(ㄘ)   s(ㄙ) 　

韻母(final)=後音=母音=主音=元音(vowel)

與口腔的開合有關，如ㄧ => ㄝ => ㄚ

是音節中聲母以後的部分，可以由元音 and／or 輔音充當，而且發音響亮。                               1. 半元(母)音semi-vowels (y)i(ㄧ) (w)u(ㄨ) (y)u(ㄩ) 　

2. 單韻母Simple Vowels a(ㄚ) o(ㄛ) e(ㄜ) ie(ㄝ)

3. 複韻母Diphthongs ai(ㄞ) ei(ㄟ)　 ao(ㄠ) ou(ㄡ)

4. 鼻韻母Finals with nasal endings an(ㄢ) en(ㄣ) ang(ㄤ) eng(ㄥ)

5. 捲舌韻母Retroflex er(ㄦ) 　　　

聲調(tone)

指字音的高低升降變化、國語語音高低升降、具有區別詞義作用的有規則變化就是聲調。聲正確與否是語音準確的關鍵。舉例說：國語依基頻軌跡(pitch contour)分一聲(陰平)、二聲(陽平)、三聲(上聲)和四聲(去聲)。

國語的音節數

依照聲,韻,調的結構,可能的音節組合有22*39*5=4290種，但國語有嚴格的聲韻組合規則,例如聲母的ㄐ,ㄑ,ㄒ後面只能是以ㄧ,ㄩ的韻母,當韻母是ㄩ時只有ㄐ,ㄑ,ㄒ,ㄋ,ㄌ的聲母,因此實際可用的國語音節約1300多個，若不考慮聲調的話只有411個。

拼音的組成

聲母在前，韻母在後

聲母輕短，韻母響亮

拼法口訣

前音輕短後音重，兩音相連猛一碰！                                                                                                Reference: http://irw.ncut.edu.tw/peterju/speech.html

語音合成的作法

• 頻譜參數合成方法(Articulatory Synthesis)：

如Holmes的並聯共振峰合成器（1973）和Klatt的串/並聯共振峰(Formant)（1980）合成、基於LPC等聲學參數的合成系統，但要合成出清晰的語音需要準確的設定參數，使用困難，且合成出的語音仍不夠自然。

• 波形拼接法(Formant Synthesis)：

如基頻同步累加法(PSOLA)（1990）在語音波形上做時域(time domain)的韻律修正來合成語音，就可以產生出具有韻律的合成語音。 PSOLA的設計重點，在改良頻域(frequency domain)耗時，以及在時域(time domain)接合效果太差的情形，其合成的語音在音色與自然度都大大的提升，且架構較簡單，容易實作。

對於 TTS 系統而言，無論接受的是一段文字的輸入或是一篇文章，這些文字本身並沒有包含任何聲學特性 ( 說話的聲調，停頓方式，發音長短等韻律 ) ，只有語言學的特性，所以必須透過自動預測的機制來產生這些文字的可能的聲學特性 (acoustic feature) 而所謂自動預測的機制，一般有 rule-based 跟 knowledge-based 兩種方法，但是這兩種方法不但合成的聲音平淡又缺乏吸引力且遇到連續發音或要保留語者音色時表現都不好，因此近來串接合成法大行其道。

• 串接合成法(Concatenated Synthesis)：

以一個錄好聲音的語料庫來當作比對的標的，從語料庫中抓出相對應的聲音單元，一些在 rule-based 與 knowledge-based 方法下需要做細節的聲韻調整也因此減少了許多，如此簡化了計算拼接與口音等複雜的計算，也特別適合在少量字彙的輸出時使用。

語音合成的困難點

1. 發音的自然度(清晰、流暢)。

2. 破音字的處理。

3. 即時處理的能力。

語音合成的4大模組

1. 文句分析

分析文句的語法與語意後轉成語言特徵參數

讓電腦知道本文中哪些是詞，哪些是句子，發什麼音，怎麼發音，發音時到哪應該停頓，停頓多長等等。

1. rule base：最大匹配法、反向最大匹配法、逐詞搜尋法、最佳匹配法、二次掃描法等等。

2. data driven：二元文法法(Di-Grammar Method)、三元文法法(Tri-Grammar Method)、隱藏式馬可夫模型法(HMM Method)和類神經網路法(Neural Network Method)等等。

2. 韻律產生器

將語言特徵參數送入韻律產生器來產生文句的每個音節的對應韻律訊息，包含基頻軌跡，音量，音長等

將說話的聲調，語氣，停頓方式，發音長短轉換成韻律參數。

1. rule base：。

2. data driven：類神經網路法(Neural Network Method)。

3. 合成單元產生器

根據語音資料庫中的單音節音素語音波形樣本輸出合成單元.

4. 語音合成器

根據需要發的音從聲音資料庫中選擇出合適的聲學參數，然後根據在韻律模型中得到的韻律參數，透過語音合成演算法產生語音。

語音相關應用

1. 語音合成(Speech Synthesize)：運用資訊科技使電腦或電子設備模擬人聲。

2. 語音辨識(Speech Recognition)：讓電腦聽得懂人類說話的聲音。

1. 語者相關(Speaker Dependent)：不要求語者發音準確，需先經過訓練。

2. 非語者相關(Speaker Independent)：語者發音需較正確，且無須訓練。

3. 語者識別(Speaker Identification)：辨識說話者的身份Reference:http://irw.ncut.edu.tw/peterju/speech.html

Voice Acoustics

Why study vocal tract acoustics?

There are fundamental scientific questions to answer in the field of acoustical phonetics, but we are also interested in the applications in speech training (language teaching), and in speech pathology. When adults or teenagers learn a foreign language, they rarely achieve authentic pronunciation and are sometimes almost unintelligible to speakers of that language. This difficulty is due to imprecision or inadequacy of the auditory feedback system usually used to learn languages - students often cannot hear how wrong their imitation of a sound is, and do not know what to do to improve it. (Technically, the problems are called categorization and interference.) The problem is even more severe for the hearing impaired who have little or no auditory feedback and can obtain very little feedback about the interior of the vocal tract from looking at the lips. A feature of the "deaf accent" is inappropriate use of the soft palate - which is not surprising given how difficult it is to see or to feel what one's soft palate is doing during speech.

We also investigate the acoustics of the singing voice, partly for its intrinsic interest, and partly with the aim of improving pedagogy in that field.

We have developed a device for measuring some important acoustic properties of the vocal tract non-invasively, in real-time, while the owner of the vocal tract is speaking or singing. We use it as a research tool, but we have demonstrated its use as a speech trainer.

Existing technologies used in speech pathology and speech trainers to provide visual feedback from the speech sound are inherently limited in precision and practicality. Even the most advanced speech recognition systems still mistake words, which indicates the limits of their precision in accurate measures of pronunciation. The basic problem is that the speech signal alone does not have enough information in it to allow us to work out, quickly and precisely, the configuration of the vocal tract. This is not a problem for understanding speech, but it may be a problem in learning precise pronunciation. Our approach is therefore to introduce a signal with more information in the frequency domain.

Our technology is called Real-time Acoustic response by Vocal tract Excitation or RAVE. In model experiments using the laboratory prototype, we have shown that one or two hours' training using visual feedback of some key features of the acoustical response of a subject's vocal tract improves the accuracy and intelligibility of pronunciation of foreign phonemes by monolingual adults.

How it works:

We inject into the vocal tract an acoustic current which is synthesized to give high resolution frequency information over the frequency range of interest. We then measure the impedance of the vocal tract in parallel with the external field using the response to this excitation signal.

In this figure, the author pronounces the vowel in 'heard'. The sharp vertical peaks are the harmonics of my voice. The broad signal shows the response of my vocal tract to the acoustic current signal being injected from the lips.

For this vowel, my vocal tract behaves rather like a cylinder about 170 mm long, nearly closed at the vocal folds and open at the mouth. A cylinder, length L, closed at one end has resonances at f0 = v/4L , at 3f0, 5f0 etc, where v is the speed of sound. (See pipes and harmonics.) So we see resonances at about 0.5, 1.5, 2.5, 3.5 and 4.5 kHz, which appear as the peaks in the smooth curve in this figure. When I pronounce the vowel in "had", I open my mouth wider, so the tract is no longer cylindrical, but flared at the open end, a bit like the flare and bell on a brass instrument. One of the effects of a this shape in a brass instrument is to raise the frequencies of the resonances, especially those of the lower resonances. (In a related example, conical pipes have resonances at higher frequencies than do cylindrical ones. See this link for an explanation.)

From this response we can readily determine the resonances of the vocal tract, independently of the speech signal. The resonant frequencies are interesting for fundamental acoustical phonetic research but, if we extract them in real time, they can be used to drive a cursor for speech training. This is how we do it in the real time version.

Schematic diagram. (a) shows the spectrum of the speech signal alone. This male voice has harmonic partials spaced at the pitch frequency 126 Hz. (b) The injected signal has frequencies spaced at 5 Hz, whose amplitudes are calibrated (in this case) using the radiation field outside the speake's mouth. (c) The sum of the speech signal and the broad band signal (including the effects of the resonances) goes from the microphone to the ADC. The speech signal is used to measure pitch and amplitude; then the harmonic components below 1 kHz are removed. (d) The resonances are detected from the remaining interpolated signal. Similarly, the broadband signals may be removed to leave just the speech harmonics. In the real-time version of the device used for speech training, the resonance frequencies are used to position the cursor on the vowel plane (see below). Notice that the signal: noise ratio in these figures is greater than in the preceding figure. This is a consequence of making the measurements rapidly.

How it looks:

This is a screen dump of the feedback display in the current speech trainer device, set up with targets from Australian English. The background ellipses are measurements of the vowels of 33 Australian men, with mean values for each vowel at the centre of each ellipse. The semi-axes are the standard deviations in R1 and R2. These or other areas can be used as targets in speech training. A cursor on the monitor (the cross at (1190,530)) shows the current configuration of the subject's own vocal tract. Initially, subjects 'steer' the motion of the cursor by consciously controlling jaw and tongue position. Speakers of the language displayed can 'aim' towards one of the vowels shown. After some practice, however, it becomes nearly as automatic as using a joy-stick or a mouse - one just 'makes it go' where one wants, without thinking of the muscular details. In other words, a visual feedback loop is unconsciously used to train articulation.

Reference: http://www.phys.unsw.edu.au/jw/speech.html

訂閱：文章 (Atom)

Stay Hungry, Stay Foolish

2009年9月29日星期二

語言的結構國語

語音合成的作法

Voice Acoustics

追蹤者

網誌存檔

咩mei背著洋娃娃

Stay Hungry, Stay Foolish

2009年9月29日 星期二

語言的結構 國語

語音合成的作法

Voice Acoustics

追蹤者

網誌存檔

咩mei背著洋娃娃

2009年9月29日星期二

語言的結構國語