:
twitter line
研究生: 陳塏憲
研究生(外文): CHEN,KAI-HSIEN
論文名稱: 基於transformer應用於視覺辨識任務之研究
論文名稱(外文): Research on the Application of Transformers in Visual Recognition Tasks
指導教授: 洪國鈞 洪國鈞引用關係
指導教授(外文): HORNG,GWO-JIUN
口試委員: 周智倫 鄒耀東
口試委員(外文): CHOU, CHIH-LUN TSOU, YAO-TUNG
口試日期: 2025-07-04
學位類別: 碩士
校院名稱: 南臺科技大學
系所名稱: 資訊工程系
學門: 工程學門
學類: 電資工程學類
論文種類: 學術論文
論文出版年: 2025
畢業學年度: 113
語文別: 中文
論文頁數: 70
中文關鍵詞: Vision Transformer 光學字符辨識 手寫文字辨識
外文關鍵詞: Vision Transformer Optical Character Recognition (OCR) Handwritten Text Recognition
相關次數:
  • 被引用 被引用:0
  • 點閱 點閱:27
  • 評分 評分:
  • 下載 下載:4
  • 收藏至我的研究室書目清單 書目收藏:0
近年來,大型語言模型(Large Language Models, LLMs)掀起人工智慧領域的技
術革命,成為產學界高度關注的焦點。無論是Meta所推出的LLaMA[1],或是當
前廣泛應用的Generative Pre-Trained Transformer(GPT)系列模型[2],其背後關鍵皆
為Transformer 架構。此架構由 Google 於 2017 年提出,透過自注意力機制(Self
Attention Mechanism)[3]與高度的並行處理能力,大幅提升自然語言處理的效能與
效率,並進一步促進語言生成、翻譯與對話系統等應用之發展。Transformer 的優
勢不僅侷限於文字領域,近年來也逐漸延伸至視覺、語音、音樂等多模態應用。
Vision Transformer(ViT)[4]在圖像辨識任務中的表現甚至超越了傳統的卷積神經網
路(Convolutional Neural Network, CNN)[5],成為電腦視覺領域的關鍵技術之一。
本論文聚焦於Transformer模型於手寫繁體中文字辨識任務中的應用,以使用
位移視窗式轉換器(Shifted Window Transformer, Swin Transformer)架構[6]的
Document Understanding Transformer(Donut)[7]、使用多模態轉換器(Multimodal
Transformer) 架構[8]的 LayoutXLM[9]與使用空間視覺轉換器(Spatial‐Visual
Transformer,SVTR)架構[10]的 PaddleOCR V5[11]三種模型進行實驗,比較各模型在
合成與真實手寫發票影像上的辨識準確度,評估其在實務環境中的可行性與穩定
性。最終成果將整合於簡易使用者平台,提供欄位自動擷取如買受人、統一編號、
地址等重要資訊,使用者僅需核對並修改錯誤,系統即自動儲存辨識結果並將錯誤
樣本回饋訓練,持續提升準確性。
在三個模型應用手寫繁體中文字辨識任務之研究中,本論文發現在手寫繁體
中文字辨識任務中,最為困難之研究步驟為最初訓練資料集蒐集的部分,不管是用
來微調(Fine-tune)、生成模擬資料或是驗證模型準確率時,現在所能使用到的手寫
字體資料集並不常見,因此本論文另外開發字體生成的Transformer模型,此模型
III
基於Vision Transformer(ViT)的方法進行建構,能夠有效學習並輸入的字體風格,
並輸出模仿該字體的其他字符,本模型具備紓解繁體中文手寫字體資料稀缺問題
之潛力,能透過生成式學習有效擴充資料量並提升辨識表現。
Transformer 架構具備的全域視野與長距依賴建模能力,相較於長短期記憶網
路(Long Short-Term Memory, LSTM),能輸入更長的特徵序列,並且基於注意力機
制之算法,能夠很好的計算出前向序列與後項序列之注意力權重,使其在處理結構
差異大的手寫字形時,能更有效地抽取筆劃間的特徵關聯,進而提升辨識的準確性
與穩定性。本論文藉由實驗分析,驗證Transformer在繁體中文手寫文字辨識上的
應用潛力,並進一步探討其於其他視覺辨識任務,如物體偵測與場景理解的延伸應
用。研究成果可望為未來智慧視覺系統的發展提供理論基礎與實務參考,並促進辨
識技術在準確率與運算效率上的雙重提升。
Recent advances in Large Language Models (LLMs), such as Meta’s LLaMA and
the GPT series, have highlighted the transformative impact of the Transformer
architecture in AI. Originally proposed by Google in 2017, Transformers utilize self
attention mechanisms and parallel processing to significantly enhance the efficiency of
natural language tasks. Beyond text, Transformer-based models have extended to
multimodal domains including vision, where Vision Transformers (ViT) have
outperformed traditional CNNs in image recognition.
This study explores the use of Transformer-based models for handwritten Traditional
Chinese character recognition. We evaluate three models—Donut (based on Swin
Transformer), LayoutXLM (Multimodal Transformer), and PaddleOCR V5 (SVTR)—on
synthetic and real handwritten invoice images. Their recognition performance and
practical viability are assessed and integrated into a user-friendly system for automatic
field extraction, with mechanisms for error correction and continual learning.
A major challenge lies in the scarcity of handwritten Traditional Chinese datasets,
which hinders fine-tuning and evaluation. To address this, we propose a ViT-based font
generation model capable of learning and mimicking font styles to generate new
characters, effectively expanding training data and improving recognition accuracy.
Thanks to their ability to model long-range dependencies and complex stroke
relationships, Transformers offer advantages over traditional LSTM-based methods. Our
experiments confirm the potential of Transformer architectures in enhancing handwritten
text recognition and suggest their broader applicability to tasks like object detection and
scene understanding. This work provides both theoretical insight and practical tools for
advancing intelligent visual systems.
基於TRANSFORMER應用於視覺辨識任務之研究 I
致謝 II
摘要 III
ABSTRACT V
目錄 VI
圖目錄 VIII
第一章 緒論 1
1.1 背景與動機 1
1.2 研究目的 3
1.3 論文架構 4
第二章 文獻探討 5
2.1 TRANSFORMER與自注意力機制 5
2.2 VISION TRANSFORMER(VIT) 7
2.3 DOCUMENT UNDERSTANDING TRANSFORMER(DONUT) 8
2.4 LAYOUTLM 9
2.5 PADDLEOCR V5 11
2.6 MASKED AUTOENCODER(MAE) 11
2.7 文獻探討總結 12
第三章 研究方法 15
3.1 三大辨識模型建構流程 16
3.2 手寫字體辨識模型微調 17
3.2.1 PaddleOCR V5微調 17
3.2.2 Donut與LayoutXLM微調 18
3.3 發票辨識系統 19
3.4 字體生成模型 20
3.4.1 模型架構 21
3.4.2 圖片前處理 22
3.4.3 風格編碼器與特徵編碼器 23
3.4.4 多風格融合層 24
3.4.5 交叉注意力機制與融合層 26
3.4.6 字體解碼器 27
第四章 研究結果 32
4.1 發票辨識模型結果分析 33
4.1.1 輸出結果比較 35
4.1.2 Donut微調效能比較 38
4.1.3 Donut、LayoutXLM、PaddleOCR V5比較 42
4.2 字體生成模型結果 45
4.2.1 MAE預訓練損失曲線 46
4.2.2 生成文字結果比較 48
4.3 雙通道字體生成模型 53
4.3.1 雙通道訓練損失曲線 53
4.3.2 雙通道模型生成結果 54
第五章 結論與未來工作 57
5.1結論 57
5.2未來工作 57
參考文獻 58
[1] Touvron, Hugo et al. “LLaMA: Open and Efficient Foundation Language
Models.” ArXiv abs/2302.13971 (2023): n. pag.
[2] Yenduri, G., et al. "GPT (Generative Pre-Trained Transformer)—A Comprehensive
Review on Enabling Technologies, Potential Applications, Emerging Challenges,
and Future Directions." IEEE Access, vol. 12, 2024, pp. 54608–54649. IEEE,
https://doi.org/10.1109/ACCESS.2024.3389497.
[3] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information
Processing Systems.
[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N.
(2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale. ArXiv, abs/2010.11929.
[5] Li, Zewen et al. “A Survey of Convolutional Neural Networks: Analysis,
Applications, and Prospects.” IEEE Transactions on Neural Networks and Learning
Systems 33 (2020): 6999-7019.
[6] Liu, Ze et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows.” 2021 IEEE/CVF International Conference on Computer Vision
(ICCV) (2021): 9992-10002.
[7] Kim, Geewook et al. “OCR-Free Document Understanding Transformer.” European
Conference on Computer Vision (2021).
[8] Xu, Peng et al. “Multimodal Learning With Transformers: A Survey.” IEEE
Transactions on Pattern Analysis and Machine Intelligence 45 (2022): 12113-12132.
[9] Xu, Yiheng et al. “LayoutXLM: Multimodal Pre-training for Multilingual Visually
rich Document Understanding.” ArXiv abs/2104.08836 (2021): n. pag.
[10] Du, Yongkun et al. “SVTR: Scene Text Recognition with a Single Visual
Model.” International Joint Conference on Artificial Intelligence (2022).
[11] Du, Yuning et al. “PP-OCR: A Practical Ultra Lightweight OCR
System.” ArXiv abs/2009.09941 (2020): n. pag.
58
[12] Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding.” North American Chapter of the Association for
Computational Linguistics (2019).
[13] Li, Minghao et al. “TrOCR: Transformer-based Optical Character Recognition with
Pre-trained Models.” ArXiv abs/2109.10282 (2021): n. pag.
[14] Ba, J., Kiros, J.R., & Hinton, G.E. (2016). Layer Normalization. ArXiv,
abs/1607.06450.
[15] Hendrycks, Dan and Kevin Gimpel. “Gaussian Error Linear Units (GELUs).” arXiv:
Learning (2016): n. pag.
[16] Gheini, M., Ren, X., & May, J. (2021). Cross-Attention is all you need: Adapting
pretrained
transformers
for
machine
translation.
arXiv.https://arxiv.org/abs/2104.08771
[17] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin
Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021
IEEE/CVF International Conference on Computer Vision (ICCV), 9992-10002.
[18] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019). LayoutLM: Pre
training of Text and Layout for Document Image Understanding. Proceedings of the
26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining.
[19] He, Kaiming et al. “Masked Autoencoders Are Scalable Vision Learners.” 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (2021): 15979-15988.
[20] Rao, J. N. K. et al. “Mean squared error of empirical predictor.” Annals of
Statistics 32 (2004): 818-840.
[21] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for
Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.
[22] Liu, Y., & Lian, Z. (2023). FontTransformer: Few-shot high-resolution Chinese
glyph image synthesis via stacked transformers. Pattern Recognition, 141, 109593.
https://doi.org/10.1016/j.patcog.2023.109593.
[23] Wang, Z., & Liu, J. (2024). One‑Shot Multilingual Font Generation Via ViT. arXiv.
https://arxiv.org/abs/2412.11342