This research is aimed to build a Chinese-to-Hakka text-to-speech system by utilizing several deep neural network-based models. The main goal of the proposed system is to help Hakka language learners who are familiar with Chinese. Therefore, the input of the proposed system is a Chinese sentence, and then our system will first translate it into a Hakka sentence with Hakka Characters. Second, a Hakka grapheme-to-phoneme model will further transform the Hakka characters into the corresponding phonetic transcription, and a subsequent pre-trained Hakka speech synthesizer would output the Hakka speech of the input Chinese sentence. The output speech and the related textual output would be useful for learning the Hakka language. Since the Hakka corpora are rare, there are several problems we have to overcome to train the models.
The proposed system consists of three sub-models.
The first model is a Chinese-to-Hakka machine translation model. To train this model, parallel sentence pairs of Chinese and Hakka are required. We adopted a CNN-based transformer model to realize the translation model. To train a good model, a reliable word segmenter is essential for both the source and target language. However, there has not been a satisfactory Hakka word segmenter available. We used a relatively trivial Hakka word segmenter and proposed two methods to deal with unseen word problems and generate better word segmentation with the forced Hakka idiom dictionary.
The second model is a Hakka grapheme-to-phoneme model. This model is similar to the first one but is much easier to train. Therefore, a Bi-LSTM is adopted here and could output suitable phonetic transcription of the input Hakka characters. However, since the training data is small, the model could frequently output incorrect phonetic transcription of the unseen characters. Therefore, we combined a phonetic dictionary, which is not included in the training data, to query possible transcriptions when the output of the model is not reliable.
The third model is a Hakka phoneme-to-speech model. To tackle the lack of Hakka speech data for training a usable model, we combined Hakka speech data and Chinese speech data from various speakers. A non-autoregressive speech synthesis model with speaker embeddings is implemented to train a code-switching TTS model, which can output Chinese or Hakka speech of different speakers' voices.
We also investigated the effects of different accents. Specifically, we trained the Southern and Northern Si-Xian models, respectively. Based on the subjective and objective experimental results. The proposed system indeed could translate the input Chinese sentence and output suitable Hakka speech with high naturalness and intelligibility. The objective results also revealed that the performance of the Hakka word segmenter could greatly affect the machine translation results.
摘要 . . . . . . . . . . . . . . . . . .i
Abstract . . . . . . . . . . . . . . .iii
目錄 . . . . . . . . . . . . . . . . . vi
圖目錄 . . . . . . . . . . . . . . . . .x
表目錄 . . . . . . . . . . . . . . . . xi
第一章 緒論 . . . . . . . . . . . . . . .1
1.1 研究動機與目的 . . . . . . . . . . . 1
1.2 客語腔調 . . . . . . . . . . . . . .2
1.3 系統流程 . . . . . . . . . . . . . .3
1.4 論文架構 . . . . . . . . . . . . . .5
第二章 文獻探討 . . . . . . . . . . . . . 6
2.1 機器翻譯 . . . . . . . . . . . . . .6
2.1.1 基於規則的字對字機器翻譯 . . . . . .7
2.1.2 基於例子的機器翻譯 . . . . . . . . 8
2.1.3 統計機器翻譯 . . . . . . . . . . .8
2.1.4 神經機器翻譯 . . . . . . . . . . .9
2.1.5 斷詞 . . . . . . . . . . . . . .11
2.2 單詞到音素 . . . . . . . . . . . . .12
2.3 語音合成 . . . . . . . . . . . . . .13
2.3.1 單元選取語音合成 . . . . . . . . .13
2.3.2 統計參數語音合成 . . . . . . . . .13
2.3.3 端到端語音合成 . . . . . . . . . .14
2.3.3.1 自迴歸語音合成 . . . . . . . . 14
2.3.3.2 非自迴歸語音合成 . . . . . . . 15
2.3.4 聲碼器 . . . . . . . . . . . . .16
第三章 南北四縣腔客語語料庫 . . . . . . . . 19
3.1 語料庫資料來源 . . . . . . . . . . . 19
3.2 中轉客機器翻譯 . . . . . . . . . . . 21
3.2.1 南北四縣用詞差異 . . . . . . . . .21
3.2.2 中轉客機器翻譯資料統計 . . . . . . 22
3.2.2.1 原始語料 . . . . . . . . . . .22
3.2.2.2 擴增語料 . . . . . . . . . . .24
3.3 客文轉拼音 . . . . . . . . . . . . .24
3.3.1 客語拼音方法 . . . . . . . . . . 25
3.3.1.1 音節結構介紹 . . . . . . . . .25
3.3.1.2 聲母 . . . . . . . . . . . . 26
3.3.1.3 韻母 . . . . . . . . . . . . 26
3.3.1.4 聲調 . . . . . . . . . . . . 27
3.3.1.5 四縣腔客語連讀變調 . . . . . . 27
3.3.2 南北四縣腔拼音差異 . . . . . . . .28
3.3.3 客文轉拼音資料統計 . . . . . . . .29
3.3.4 拼音字典 . . . . . . . . . . . .30
3.4 客語語音合成 . . . . . . . . . . . .31
第四章 研究方法 . . . . . . . . . . . . .33
4.1 中轉客機器翻譯 . . . . . . . . . . .33
4.1.1 Fairseq-CNN . . . . . . . . . .34
4.1.1.1 嵌入層 . . . . . . . . . . . 35
4.1.1.2 卷積塊結構 . . . . . . . . . .35
4.1.1.3 多階注意力 . . . . . . . . . .36
4.1.2 斷詞系統 . . . . . . . . . . . . 37
4.1.3 強制斷詞 . . . . . . . . . . . . 37
4.1.4 未知詞處理 . . . . . . . . . . . 38
4.2 客文轉拼音 . . . . . . . . . . . . .40
4.3 客文語音合成 . . . . . . . . . . . .42
第五章 實驗與討論 . . . . . . . . . . . . 46
5.1 中轉客機器翻譯 . . . . . . . . . . . 46
5.1.1 實驗設置 . . . . . . . . . . . . 46
5.1.2 測試語料 . . . . . . . . . . . . 47
5.1.3 評估方法 . . . . . . . . . . . . 47
5.1.4 實驗與討論 . . . . . . . . . . . 48
5.1.4.1 未知詞懲罰值與斷詞器差異 . . . . 48
5.1.4.2 南四縣資料語料擴增 . . . . . . .52
5.1.4.3 南北四縣模型翻譯差異 . . . . . .53
5.2 客文轉拼音 . . . . . . . . . . . . .56
5.2.1 實驗設置 . . . . . . . . . . . . 56
5.2.2 測試語料與評估方法 . . . . . . . . 56
5.2.3 實驗與討論 . . . . . . . . . . . 57
5.3 客語語音合成 . . . . . . . . . . . . 57
5.3.1 實驗設置 . . . . . . . . . . . . 58
5.3.2 實驗與討論 . . . . . . . . . . . 59
5.3.2.1 合成語音自然度與語者相似度 . . . 59
5.3.2.2 連讀變調 . . . . . . . . . . .60
5.3.2.3 合成速度 . . . . . . . . . . .61
第六章 結論與未來展望 . . . . . . . . . . .63
參考文獻 . . . . . . . . . . . . . . . . 65
附錄 A — 客文拼音音素與南北四縣拼音差異 . . .75
附錄 B — 發音詞典 . . . . . . . . . . . .78
附錄 C — 客語問卷調查表 . . . . . . . . . 81
[1] 客家委員會. 105年度全國客家人口暨語言基礎資料調查研究. 典通股份有限公司, 2017.
[2] 臺灣教育部. 【客語】臺灣客家語有哪幾種常見的腔調?. https://mhi.moe.edu.tw/faqList.jsp?ID=0&ID2=11, 2018.
[3] 臺中市政府客家事務委員會. 客語腔調分布. https://www.hakka.taichung.gov.tw/381494/post, 2019.
[4] 新北市政府客家事務局. 四縣腔簡介. https://www.hakka-language.ntpc.gov.tw/files/15-1008-1368,c427-1.php, 2022.
[5] William John Hutchins. Machine translation: past, present, future. Ellis Horwood Chichester, 1986.
[6] S Hunnicutt. Grapheme-to-phoneme rules: A review. Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR, 21(2-3):38–60, 1980.
[7] Thierry Dutoit. An introduction to text-to-speech synthesis, volume 3. Springer Science & Business Media, 1997.
[8] Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. Apertium: a free/open-source platform for rule-based machine translation. Machine translation, 25(2):127–144, 2011.
[9] Harold Somers. Example-based machine translation. Machine translation, 14(2):113–157, 1999.
[10] Kit Chunyu, Pan Haihua, and Jonathan J Webster. Example-based machine translation: A new paradigm. Translation and information technology, 57, 2002.
[11] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,2014.
[13] Ilya Pestov. A history of machine translation from the cold war to deep learning. https://www.freecodecamp.org/news/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5, 2018.
[14] Chuan-Jie Lin and Hsin-Hsi Chen. A mandarin to taiwanese min nan machine translation system with speech synthesis of taiwanese min nan. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 4, Number 1, February 1999, pages 59–84, 1999.
[15] 林昕緯. 中文轉客文語音合成系統中的文句分析模組之研究. 臺灣博碩士論文知識加值系統, 2014.
[16] Paisarn Charoenpornsawat, Virach Sornlertlamvanich, and Thatsanee Charoenporn. Improving translation quality of rule-based machine translation. In COLING-02:machine translation in Asia, 2002.
[17] Media A Ayu and Teddy Mantoro. An example-based machine translation approach for bahasa indonesia to english: An experiment using moses. In 2011 IEEE Symposium on Industrial Electronics and Applications, pages 570–573. IEEE, 2011.
[18] Richard Zens, Franz Josef Och, and Hermann Ney. Phrase-based statistical machine translation. In Annual Conference on Artificial Intelligence, pages 18–32. Springer, 2002.
[19] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[20] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[22] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[24] Lawrence Rabiner and Biinghwang Juang. An introduction to hidden markov models. ieee assp magazine, 3(1):4–16, 1986.
[25] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[26] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4):267–373, 2012.
[27] fxsjy. jieba. https://github.com/fxsjy/jieba, 2012.
[28] hankcs. Hanlp. https://github.com/hankcs/HanLP, 2015.
[29] Hsin-Wei Lin, Feng-Long Huang, Ming-Shing Yu, and Yih-Jeng Lin. 中文轉客文文轉音系統中的客語斷詞處理之研究 (research on hakka word segmentation processes in chinese-to-hakka text-to-speech system)[in chinese]. In Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014), pages 58–77, 2014.
[30] 張郁堂. 基於混合統計法與詞庫法之中文斷詞演算法. 臺灣博碩士論文知識加值系統, 2015.
[31] Ji Ma, Kuzman Ganchev, and David Weiss. State-of-the-art chinese word segmentation with bi-lstms. arXiv preprint arXiv:1808.06511, 2018.
[32] Peng-Hsuan Li, Tsu-Jui Fu, and Wei-Yun Ma. Why attention? analyze bilstm deficiency and its remedies in the case of ner. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8236–8244, 2020.
[33] Jongseok Park, Kyubyong & Kim. g2pe. https://github.com/Kyubyong/g2p, 2019.
[34] lxneng. xpinyin. https://github.com/lxneng/xpinyin, 2011.
[35] mozillazg. pypinyin. https://github.com/mozillazg/python-pinyin, 2014.
[36] Kyubyong Park. g2pc. https://github.com/Kyubyong/g2pC, 2019.
[37] Seanie Lee Kyubyong Park. A neural grapheme-to-phoneme conversion package for mandarinchinese based on a new open benchmark dataset. arXiv preprint arXiv:2004.03136, 2020.
[38] Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang, and Yi-Ren Yeh. g2pw: A conditional weighted softmax bert for polyphone disambiguation in mandarin. Proc. Interspeech 2022, 2022.
[39] 蔡依玲, 王逸如, et al. 基於隱藏式馬可夫模型之客語文句轉語音系統. PhD thesis, 國立交通大學, 2009.
[40] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996.
[41] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. Speech parameter generation algorithms for hmm-based speech synthesis. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), volume 3, pages 1315–1318. IEEE, 2000.
[42] Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009.
[43] Shu-Han Liao, Ya-Bo Chai, and Yuan-Fu Liao. 完全基於類神經網路之語音合成系統初步研究 (a preliminary study on fully neural network-based speech synthesis system)[in chinese]. In Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), pages 213–229, 2017.
[44] 吳俊毅. 線上客語語音合成系統中產生韻律訊息之研究. 華藝線上圖書館, 2010.
[45] Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. Speech synthesis based on hidden markov models. Proceedings of the IEEE, 101(5):1234–1252, 2013.
[46] James L Flanagan and Roger M Golden. Phase vocoder. Bell System Technical Journal, 45(9):1493–1509, 1966.
[47] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[48] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
[49] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32, 2019.
[50] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
[51] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[52] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
[53] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018.
[54] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
[55] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
[56] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
[57] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
[58] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
[59] 客家委員會. 哈客網路學院-初級教材及試題下載. https://elearning.hakka.gov.tw/hakka/download-files?c=2, 2021.
[60] 客家委員會. 哈客網路學院-中級暨中高級教材及試題下載. https://elearning.hakka.gov.tw/hakka/download-files?c=3, 2021.
[61] 唐鳳. 萌典. https://www.moedict.tw/about.html, 2013.
[62] 邱國源劉明宗. 美濃客家語寶典, volume 1. 五南, 2016.
[63] Wei-Yun Ma Peng-Hsuan Li. Ckiptagger. https://github.com/ckiplab/ckiptagger, 2019.
[64] ldkrsi. jieba-hakka. https://github.com/ldkrsi/jieba-Hakka, 2018.
[65] 教育部. 客家語拼音方案使用手冊. 教育部, 2012.
[66] 標貝科技有限公司. 中文標準女聲音庫 (10000 句). https://www.data-baker.com/open_source.html, 2018.
[67] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
[68] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
[69] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
[70] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[71] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
[72] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
[73] Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, and Hung-yi Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8588–8592, 2021.
[74] 客家委員會. 「客語口說故事」. https://talk.hakka.gov.tw/, 2020.
[75] wiki. 萊文斯坦距離. https://zh.wikipedia.org/zh-tw/%E8%90%8A%E6%96%87%E6%96%AF%E5%9D%A6%E8%B7%9D%E9%9B%A2, 2022.
[76] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[77] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
[78] 客家委員會. 《臺灣客語語料庫》. https://corpus.hakka.gov.tw/, 2022.