字級大小SCRIPT,如您的瀏覽器不支援,IE6請利用鍵盤按住ALT鍵 + V → X → (G)最大(L)較大(M)中(S)較小(A)小,來選擇適合您的文字大小,如為IE7或Firefoxy瀏覽器則可利用鍵盤 Ctrl + (+)放大 (-)縮小來改變字型大小。
:
twitter line
研究生: 洪翌翔
研究生(外文): HUNG, YI-HSIANG
論文名稱: 運用少量客家語料建置中文文字轉客語語音系統之研究
論文名稱(外文): A Chinese-to-Hakka Text-to-Speech System Construction with Small-sized Hakka Corpus
指導教授: 黃奕欽
指導教授(外文): HUANG, YI-CHIN
口試委員: 吳宗憲 林義凱
口試委員(外文): WU, CHUNG-HSIEN LIN, YI-KAI
口試日期: 2023-01-12
學位類別: 碩士
校院名稱: 國立屏東大學
系所名稱: 電腦科學與人工智慧學系碩士班
學門: 工程學門
學類: 電資工程學類
論文種類: 學術論文
論文出版年: 2023
畢業學年度: 111
語文別: 中文
論文頁數: 82
中文關鍵詞: 客語 機器翻譯 字轉拼音 語音合成
外文關鍵詞: Hakka Machine Translation Grapheme-to-Phoneme Text-to-Speech
相關次數:
  • 被引用 被引用:0
  • 點閱 點閱:272
  • 評分 評分:
  • 下載 下載:34
  • 收藏至我的研究室書目清單 書目收藏:0
本論文建置了一個從中文字轉客語語音的系統,目標是讓使用者輸入中文字後可直接聽到與該中文相應意思的客文音,藉此達到幫助學習客文的效果,此系統由中轉客機器翻譯、客文轉拼音、客文語音合成三個模組所組成。
在本系統中,中文字經由中文斷詞器做斷詞後,輸入基於神經網路機器的翻譯模型與斷好詞的客文做訓練,但由於客語語料不足或人工標記過少的原因,所以過去一直沒有良好的客語斷詞器。而近年陸續出現如基於統計式的客語斷詞器,以及客委會近期所開放的全新臺灣客語語料庫斷詞器,而這些斷詞器將有助於對機器翻譯上的處理,我們也提出搭配客語專用詞彙的強制斷詞方法,希望解決因斷詞錯誤造成機器翻譯效能下降的問題。機器翻譯模組將中文翻成客文後會傳入下一個客文轉拼音的神經網路模型預測拼音,但也因為客語拼音語料不足的原因,訓練資料內常會有許多未知字或特殊音無法正確被預測。故本論文提出搭配查詢客語拼音字典的方法,讓字典能查到且完全確定讀音的字音來取代該神經網路模型預測的結果,使該模組能更正確的預測拼音。客文拼音最後會傳入語音合成模型內並合成出客語音檔,考慮到客文語音資料較少,我們決定加入語者嵌入向量來使不同客語語者的語料也能一起加入語音合成模型做訓練,並且還加入了中文語料使其能合出更多語者且跨語言的句子。
本研究亦整理了南北四縣腔兩種不同腔調的客語資料,並且在機器翻譯與客文轉拼音模組中分別訓練兩個模型來預測不同腔調的客文翻譯和拼音,並且邀請了多位客語母語人士對本系統進行主觀測試,最後實驗結果證明本系統能正確翻出南北四縣腔調並合出品質良好的音檔。客觀測試實驗針對不同的客文斷詞器分析對機器翻譯結果的影響,最後jieba斷詞器獲得的BLEU分數為36.61,我們提出的強制斷詞方法分數為39.06,而客語語料庫斷詞器的分數為40.84,客語語料庫斷詞器最終獲得最好的成績,顯示它的斷詞能力最優秀,能有效提昇客語機器翻譯的品質。
This research is aimed to build a Chinese-to-Hakka text-to-speech system by utilizing several deep neural network-based models. The main goal of the proposed system is to help Hakka language learners who are familiar with Chinese. Therefore, the input of the proposed system is a Chinese sentence, and then our system will first translate it into a Hakka sentence with Hakka Characters. Second, a Hakka grapheme-to-phoneme model will further transform the Hakka characters into the corresponding phonetic transcription, and a subsequent pre-trained Hakka speech synthesizer would output the Hakka speech of the input Chinese sentence. The output speech and the related textual output would be useful for learning the Hakka language. Since the Hakka corpora are rare, there are several problems we have to overcome to train the models.
The proposed system consists of three sub-models.
The first model is a Chinese-to-Hakka machine translation model. To train this model, parallel sentence pairs of Chinese and Hakka are required. We adopted a CNN-based transformer model to realize the translation model. To train a good model, a reliable word segmenter is essential for both the source and target language. However, there has not been a satisfactory Hakka word segmenter available. We used a relatively trivial Hakka word segmenter and proposed two methods to deal with unseen word problems and generate better word segmentation with the forced Hakka idiom dictionary.
The second model is a Hakka grapheme-to-phoneme model. This model is similar to the first one but is much easier to train. Therefore, a Bi-LSTM is adopted here and could output suitable phonetic transcription of the input Hakka characters. However, since the training data is small, the model could frequently output incorrect phonetic transcription of the unseen characters. Therefore, we combined a phonetic dictionary, which is not included in the training data, to query possible transcriptions when the output of the model is not reliable.
The third model is a Hakka phoneme-to-speech model. To tackle the lack of Hakka speech data for training a usable model, we combined Hakka speech data and Chinese speech data from various speakers. A non-autoregressive speech synthesis model with speaker embeddings is implemented to train a code-switching TTS model, which can output Chinese or Hakka speech of different speakers' voices.
We also investigated the effects of different accents. Specifically, we trained the Southern and Northern Si-Xian models, respectively. Based on the subjective and objective experimental results. The proposed system indeed could translate the input Chinese sentence and output suitable Hakka speech with high naturalness and intelligibility. The objective results also revealed that the performance of the Hakka word segmenter could greatly affect the machine translation results.
摘要 . . . . . . . . . . . . . . . . . .i
Abstract . . . . . . . . . . . . . . .iii
目錄 . . . . . . . . . . . . . . . . . vi
圖目錄 . . . . . . . . . . . . . . . . .x
表目錄 . . . . . . . . . . . . . . . . xi

第一章 緒論 . . . . . . . . . . . . . . .1
1.1 研究動機與目的 . . . . . . . . . . . 1
1.2 客語腔調 . . . . . . . . . . . . . .2
1.3 系統流程 . . . . . . . . . . . . . .3
1.4 論文架構 . . . . . . . . . . . . . .5
第二章 文獻探討 . . . . . . . . . . . . . 6
2.1 機器翻譯 . . . . . . . . . . . . . .6
2.1.1 基於規則的字對字機器翻譯 . . . . . .7
2.1.2 基於例子的機器翻譯 . . . . . . . . 8
2.1.3 統計機器翻譯 . . . . . . . . . . .8
2.1.4 神經機器翻譯 . . . . . . . . . . .9
2.1.5 斷詞 . . . . . . . . . . . . . .11
2.2 單詞到音素 . . . . . . . . . . . . .12
2.3 語音合成 . . . . . . . . . . . . . .13
2.3.1 單元選取語音合成 . . . . . . . . .13
2.3.2 統計參數語音合成 . . . . . . . . .13
2.3.3 端到端語音合成 . . . . . . . . . .14
2.3.3.1 自迴歸語音合成 . . . . . . . . 14
2.3.3.2 非自迴歸語音合成 . . . . . . . 15
2.3.4 聲碼器 . . . . . . . . . . . . .16
第三章 南北四縣腔客語語料庫 . . . . . . . . 19
3.1 語料庫資料來源 . . . . . . . . . . . 19
3.2 中轉客機器翻譯 . . . . . . . . . . . 21
3.2.1 南北四縣用詞差異 . . . . . . . . .21
3.2.2 中轉客機器翻譯資料統計 . . . . . . 22
3.2.2.1 原始語料 . . . . . . . . . . .22
3.2.2.2 擴增語料 . . . . . . . . . . .24
3.3 客文轉拼音 . . . . . . . . . . . . .24
3.3.1 客語拼音方法 . . . . . . . . . . 25
3.3.1.1 音節結構介紹 . . . . . . . . .25
3.3.1.2 聲母 . . . . . . . . . . . . 26
3.3.1.3 韻母 . . . . . . . . . . . . 26
3.3.1.4 聲調 . . . . . . . . . . . . 27
3.3.1.5 四縣腔客語連讀變調 . . . . . . 27
3.3.2 南北四縣腔拼音差異 . . . . . . . .28
3.3.3 客文轉拼音資料統計 . . . . . . . .29
3.3.4 拼音字典 . . . . . . . . . . . .30
3.4 客語語音合成 . . . . . . . . . . . .31
第四章 研究方法 . . . . . . . . . . . . .33
4.1 中轉客機器翻譯 . . . . . . . . . . .33
4.1.1 Fairseq-CNN . . . . . . . . . .34
4.1.1.1 嵌入層 . . . . . . . . . . . 35
4.1.1.2 卷積塊結構 . . . . . . . . . .35
4.1.1.3 多階注意力 . . . . . . . . . .36
4.1.2 斷詞系統 . . . . . . . . . . . . 37
4.1.3 強制斷詞 . . . . . . . . . . . . 37
4.1.4 未知詞處理 . . . . . . . . . . . 38
4.2 客文轉拼音 . . . . . . . . . . . . .40
4.3 客文語音合成 . . . . . . . . . . . .42
第五章 實驗與討論 . . . . . . . . . . . . 46
5.1 中轉客機器翻譯 . . . . . . . . . . . 46
5.1.1 實驗設置 . . . . . . . . . . . . 46
5.1.2 測試語料 . . . . . . . . . . . . 47
5.1.3 評估方法 . . . . . . . . . . . . 47
5.1.4 實驗與討論 . . . . . . . . . . . 48
5.1.4.1 未知詞懲罰值與斷詞器差異 . . . . 48
5.1.4.2 南四縣資料語料擴增 . . . . . . .52
5.1.4.3 南北四縣模型翻譯差異 . . . . . .53
5.2 客文轉拼音 . . . . . . . . . . . . .56
5.2.1 實驗設置 . . . . . . . . . . . . 56
5.2.2 測試語料與評估方法 . . . . . . . . 56
5.2.3 實驗與討論 . . . . . . . . . . . 57
5.3 客語語音合成 . . . . . . . . . . . . 57
5.3.1 實驗設置 . . . . . . . . . . . . 58
5.3.2 實驗與討論 . . . . . . . . . . . 59
5.3.2.1 合成語音自然度與語者相似度 . . . 59
5.3.2.2 連讀變調 . . . . . . . . . . .60
5.3.2.3 合成速度 . . . . . . . . . . .61
第六章 結論與未來展望 . . . . . . . . . . .63
參考文獻 . . . . . . . . . . . . . . . . 65
附錄 A — 客文拼音音素與南北四縣拼音差異 . . .75
附錄 B — 發音詞典 . . . . . . . . . . . .78
附錄 C — 客語問卷調查表 . . . . . . . . . 81
[1] 客家委員會. 105年度全國客家人口暨語言基礎資料調查研究. 典通股份有限公司, 2017.
[2] 臺灣教育部. 【客語】臺灣客家語有哪幾種常見的腔調?. https://mhi.moe.edu.tw/faqList.jsp?ID=0&ID2=11, 2018.
[3] 臺中市政府客家事務委員會. 客語腔調分布. https://www.hakka.taichung.gov.tw/381494/post, 2019.
[4] 新北市政府客家事務局. 四縣腔簡介. https://www.hakka-language.ntpc.gov.tw/files/15-1008-1368,c427-1.php, 2022.
[5] William John Hutchins. Machine translation: past, present, future. Ellis Horwood Chichester, 1986.
[6] S Hunnicutt. Grapheme-to-phoneme rules: A review. Speech Transmission Laboratory, Royal Institute of Technology, Stockholm, Sweden, QPSR, 21(2-3):38–60, 1980.
[7] Thierry Dutoit. An introduction to text-to-speech synthesis, volume 3. Springer Science & Business Media, 1997.
[8] Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nordfalk, Jim O'Regan, Sergio Ortiz-Rojas, Juan Antonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema Ramírez-Sánchez, and Francis M Tyers. Apertium: a free/open-source platform for rule-based machine translation. Machine translation, 25(2):127–144, 2011.
[9] Harold Somers. Example-based machine translation. Machine translation, 14(2):113–157, 1999.
[10] Kit Chunyu, Pan Haihua, and Jonathan J Webster. Example-based machine translation: A new paradigm. Translation and information technology, 57, 2002.
[11] Philipp Koehn. Statistical machine translation. Cambridge University Press, 2009.
[12] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473,2014.
[13] Ilya Pestov. A history of machine translation from the cold war to deep learning. https://www.freecodecamp.org/news/a-history-of-machine-translation-from-the-cold-war-to-deep-learning-f1d335ce8b5, 2018.
[14] Chuan-Jie Lin and Hsin-Hsi Chen. A mandarin to taiwanese min nan machine translation system with speech synthesis of taiwanese min nan. In International Journal of Computational Linguistics & Chinese Language Processing, Volume 4, Number 1, February 1999, pages 59–84, 1999.
[15] 林昕緯. 中文轉客文語音合成系統中的文句分析模組之研究. 臺灣博碩士論文知識加值系統, 2014.
[16] Paisarn Charoenpornsawat, Virach Sornlertlamvanich, and Thatsanee Charoenporn. Improving translation quality of rule-based machine translation. In COLING-02:machine translation in Asia, 2002.
[17] Media A Ayu and Teddy Mantoro. An example-based machine translation approach for bahasa indonesia to english: An experiment using moses. In 2011 IEEE Symposium on Industrial Electronics and Applications, pages 570–573. IEEE, 2011.
[18] Richard Zens, Franz Josef Och, and Hermann Ney. Phrase-based statistical machine translation. In Annual Conference on Artificial Intelligence, pages 18–32. Springer, 2002.
[19] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
[20] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural networks. Advances in neural information processing systems, 27, 2014.
[21] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
[22] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
[23] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017.
[24] Lawrence Rabiner and Biinghwang Juang. An introduction to hidden markov models. ieee assp magazine, 3(1):4–16, 1986.
[25] G David Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3):268–278, 1973.
[26] Charles Sutton, Andrew McCallum, et al. An introduction to conditional random fields. Foundations and Trends® in Machine Learning, 4(4):267–373, 2012.
[27] fxsjy. jieba. https://github.com/fxsjy/jieba, 2012.
[28] hankcs. Hanlp. https://github.com/hankcs/HanLP, 2015.
[29] Hsin-Wei Lin, Feng-Long Huang, Ming-Shing Yu, and Yih-Jeng Lin. 中文轉客文文轉音系統中的客語斷詞處理之研究 (research on hakka word segmentation processes in chinese-to-hakka text-to-speech system)[in chinese]. In Proceedings of the 26th Conference on Computational Linguistics and Speech Processing (ROCLING 2014), pages 58–77, 2014.
[30] 張郁堂. 基於混合統計法與詞庫法之中文斷詞演算法. 臺灣博碩士論文知識加值系統, 2015.
[31] Ji Ma, Kuzman Ganchev, and David Weiss. State-of-the-art chinese word segmentation with bi-lstms. arXiv preprint arXiv:1808.06511, 2018.
[32] Peng-Hsuan Li, Tsu-Jui Fu, and Wei-Yun Ma. Why attention? analyze bilstm deficiency and its remedies in the case of ner. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8236–8244, 2020.
[33] Jongseok Park, Kyubyong & Kim. g2pe. https://github.com/Kyubyong/g2p, 2019.
[34] lxneng. xpinyin. https://github.com/lxneng/xpinyin, 2011.
[35] mozillazg. pypinyin. https://github.com/mozillazg/python-pinyin, 2014.
[36] Kyubyong Park. g2pc. https://github.com/Kyubyong/g2pC, 2019.
[37] Seanie Lee Kyubyong Park. A neural grapheme-to-phoneme conversion package for mandarinchinese based on a new open benchmark dataset. arXiv preprint arXiv:2004.03136, 2020.
[38] Yi-Chang Chen, Yu-Chuan Chang, Yen-Cheng Chang, and Yi-Ren Yeh. g2pw: A conditional weighted softmax bert for polyphone disambiguation in mandarin. Proc. Interspeech 2022, 2022.
[39] 蔡依玲, 王逸如, et al. 基於隱藏式馬可夫模型之客語文句轉語音系統. PhD thesis, 國立交通大學, 2009.
[40] Andrew J Hunt and Alan W Black. Unit selection in a concatenative speech synthesis system using a large speech database. In 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings, volume 1, pages 373–376. IEEE, 1996.
[41] Keiichi Tokuda, Takayoshi Yoshimura, Takashi Masuko, Takao Kobayashi, and Tadashi Kitamura. Speech parameter generation algorithms for hmm-based speech synthesis. In 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No. 00CH37100), volume 3, pages 1315–1318. IEEE, 2000.
[42] Heiga Zen, Keiichi Tokuda, and Alan W Black. Statistical parametric speech synthesis. speech communication, 51(11):1039–1064, 2009.
[43] Shu-Han Liao, Ya-Bo Chai, and Yuan-Fu Liao. 完全基於類神經網路之語音合成系統初步研究 (a preliminary study on fully neural network-based speech synthesis system)[in chinese]. In Proceedings of the 29th Conference on Computational Linguistics and Speech Processing (ROCLING 2017), pages 213–229, 2017.
[44] 吳俊毅. 線上客語語音合成系統中產生韻律訊息之研究. 華藝線上圖書館, 2010.
[45] Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. Speech synthesis based on hidden markov models. Proceedings of the IEEE, 101(5):1234–1252, 2013.
[46] James L Flanagan and Roger M Golden. Phase vocoder. Bell System Technical Journal, 45(9):1493–1509, 1966.
[47] Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135, 2017.
[48] Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
[49] Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32, 2019.
[50] Daniel Griffin and Jae Lim. Signal estimation from modified short-time fourier transform. IEEE Transactions on acoustics, speech, and signal processing, 32(2):236–243, 1984.
[51] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
[52] Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aaron van den Oord, Alex Graves, and Koray Kavukcuoglu. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099, 2016.
[53] Aaron Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George Driessche, Edward Lockhart, Luis Cobo, Florian Stimberg, et al. Parallel wavenet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018.
[54] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward Lockhart, Florian Stimberg, Aaron Oord, Sander Dieleman, and Koray Kavukcuoglu. Efficient neural audio synthesis. In International Conference on Machine Learning, pages 2410–2419. PMLR, 2018.
[55] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
[56] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems, 31, 2018.
[57] Kundan Kumar, Rithesh Kumar, Thibault de Boissiere, Lucas Gestin, Wei Zhen Teoh, Jose Sotelo, Alexandre de Brébisson, Yoshua Bengio, and Aaron C Courville. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
[58] Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
[59] 客家委員會. 哈客網路學院-初級教材及試題下載. https://elearning.hakka.gov.tw/hakka/download-files?c=2, 2021.
[60] 客家委員會. 哈客網路學院-中級暨中高級教材及試題下載. https://elearning.hakka.gov.tw/hakka/download-files?c=3, 2021.
[61] 唐鳳. 萌典. https://www.moedict.tw/about.html, 2013.
[62] 邱國源劉明宗. 美濃客家語寶典, volume 1. 五南, 2016.
[63] Wei-Yun Ma Peng-Hsuan Li. Ckiptagger. https://github.com/ckiplab/ckiptagger, 2019.
[64] ldkrsi. jieba-hakka. https://github.com/ldkrsi/jieba-Hakka, 2018.
[65] 教育部. 客家語拼音方案使用手冊. 教育部, 2012.
[66] 標貝科技有限公司. 中文標準女聲音庫 (10000 句). https://www.data-baker.com/open_source.html, 2018.
[67] Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann N Dauphin. Convolutional sequence to sequence learning. In International conference on machine learning, pages 1243–1252. PMLR, 2017.
[68] Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations, 2019.
[69] Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling with gated convolutional networks. In International conference on machine learning, pages 933–941. PMLR, 2017.
[70] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
[71] Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. arXiv preprint arXiv:2006.04558, 2020.
[72] Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sonderegger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502, 2017.
[73] Chung-Ming Chien, Jheng-Hao Lin, Chien-yu Huang, Po-chun Hsu, and Hung-yi Lee. Investigating on incorporating pretrained and learnable speaker representations for multi-speaker multi-style text-to-speech. In ICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8588–8592, 2021.
[74] 客家委員會. 「客語口說故事」. https://talk.hakka.gov.tw/, 2020.
[75] wiki. 萊文斯坦距離. https://zh.wikipedia.org/zh-tw/%E8%90%8A%E6%96%87%E6%96%AF%E5%9D%A6%E8%B7%9D%E9%9B%A2, 2022.
[76] Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
[77] Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
[78] 客家委員會. 《臺灣客語語料庫》. https://corpus.hakka.gov.tw/, 2022.