Recent advances in Large Language Models (LLMs), such as Meta’s LLaMA and
the GPT series, have highlighted the transformative impact of the Transformer
architecture in AI. Originally proposed by Google in 2017, Transformers utilize self
attention mechanisms and parallel processing to significantly enhance the efficiency of
natural language tasks. Beyond text, Transformer-based models have extended to
multimodal domains including vision, where Vision Transformers (ViT) have
outperformed traditional CNNs in image recognition.
This study explores the use of Transformer-based models for handwritten Traditional
Chinese character recognition. We evaluate three models—Donut (based on Swin
Transformer), LayoutXLM (Multimodal Transformer), and PaddleOCR V5 (SVTR)—on
synthetic and real handwritten invoice images. Their recognition performance and
practical viability are assessed and integrated into a user-friendly system for automatic
field extraction, with mechanisms for error correction and continual learning.
A major challenge lies in the scarcity of handwritten Traditional Chinese datasets,
which hinders fine-tuning and evaluation. To address this, we propose a ViT-based font
generation model capable of learning and mimicking font styles to generate new
characters, effectively expanding training data and improving recognition accuracy.
Thanks to their ability to model long-range dependencies and complex stroke
relationships, Transformers offer advantages over traditional LSTM-based methods. Our
experiments confirm the potential of Transformer architectures in enhancing handwritten
text recognition and suggest their broader applicability to tasks like object detection and
scene understanding. This work provides both theoretical insight and practical tools for
advancing intelligent visual systems.
基於TRANSFORMER應用於視覺辨識任務之研究 I
致謝 II
摘要 III
ABSTRACT V
目錄 VI
圖目錄 VIII
第一章 緒論 1
1.1 背景與動機 1
1.2 研究目的 3
1.3 論文架構 4
第二章 文獻探討 5
2.1 TRANSFORMER與自注意力機制 5
2.2 VISION TRANSFORMER(VIT) 7
2.3 DOCUMENT UNDERSTANDING TRANSFORMER(DONUT) 8
2.4 LAYOUTLM 9
2.5 PADDLEOCR V5 11
2.6 MASKED AUTOENCODER(MAE) 11
2.7 文獻探討總結 12
第三章 研究方法 15
3.1 三大辨識模型建構流程 16
3.2 手寫字體辨識模型微調 17
3.2.1 PaddleOCR V5微調 17
3.2.2 Donut與LayoutXLM微調 18
3.3 發票辨識系統 19
3.4 字體生成模型 20
3.4.1 模型架構 21
3.4.2 圖片前處理 22
3.4.3 風格編碼器與特徵編碼器 23
3.4.4 多風格融合層 24
3.4.5 交叉注意力機制與融合層 26
3.4.6 字體解碼器 27
第四章 研究結果 32
4.1 發票辨識模型結果分析 33
4.1.1 輸出結果比較 35
4.1.2 Donut微調效能比較 38
4.1.3 Donut、LayoutXLM、PaddleOCR V5比較 42
4.2 字體生成模型結果 45
4.2.1 MAE預訓練損失曲線 46
4.2.2 生成文字結果比較 48
4.3 雙通道字體生成模型 53
4.3.1 雙通道訓練損失曲線 53
4.3.2 雙通道模型生成結果 54
第五章 結論與未來工作 57
5.1結論 57
5.2未來工作 57
參考文獻 58
[1] Touvron, Hugo et al. “LLaMA: Open and Efficient Foundation Language
Models.” ArXiv abs/2302.13971 (2023): n. pag.
[2] Yenduri, G., et al. "GPT (Generative Pre-Trained Transformer)—A Comprehensive
Review on Enabling Technologies, Potential Applications, Emerging Challenges,
and Future Directions." IEEE Access, vol. 12, 2024, pp. 54608–54649. IEEE,
https://doi.org/10.1109/ACCESS.2024.3389497.
[3] Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N.,
Kaiser, L., & Polosukhin, I. (2017). Attention is All you Need. Neural Information
Processing Systems.
[4] Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner,
T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N.
(2020). An Image is Worth 16x16 Words: Transformers for Image Recognition at
Scale. ArXiv, abs/2010.11929.
[5] Li, Zewen et al. “A Survey of Convolutional Neural Networks: Analysis,
Applications, and Prospects.” IEEE Transactions on Neural Networks and Learning
Systems 33 (2020): 6999-7019.
[6] Liu, Ze et al. “Swin Transformer: Hierarchical Vision Transformer using Shifted
Windows.” 2021 IEEE/CVF International Conference on Computer Vision
(ICCV) (2021): 9992-10002.
[7] Kim, Geewook et al. “OCR-Free Document Understanding Transformer.” European
Conference on Computer Vision (2021).
[8] Xu, Peng et al. “Multimodal Learning With Transformers: A Survey.” IEEE
Transactions on Pattern Analysis and Machine Intelligence 45 (2022): 12113-12132.
[9] Xu, Yiheng et al. “LayoutXLM: Multimodal Pre-training for Multilingual Visually
rich Document Understanding.” ArXiv abs/2104.08836 (2021): n. pag.
[10] Du, Yongkun et al. “SVTR: Scene Text Recognition with a Single Visual
Model.” International Joint Conference on Artificial Intelligence (2022).
[11] Du, Yuning et al. “PP-OCR: A Practical Ultra Lightweight OCR
System.” ArXiv abs/2009.09941 (2020): n. pag.
58
[12] Devlin, Jacob et al. “BERT: Pre-training of Deep Bidirectional Transformers for
Language Understanding.” North American Chapter of the Association for
Computational Linguistics (2019).
[13] Li, Minghao et al. “TrOCR: Transformer-based Optical Character Recognition with
Pre-trained Models.” ArXiv abs/2109.10282 (2021): n. pag.
[14] Ba, J., Kiros, J.R., & Hinton, G.E. (2016). Layer Normalization. ArXiv,
abs/1607.06450.
[15] Hendrycks, Dan and Kevin Gimpel. “Gaussian Error Linear Units (GELUs).” arXiv:
Learning (2016): n. pag.
[16] Gheini, M., Ren, X., & May, J. (2021). Cross-Attention is all you need: Adapting
pretrained
transformers
for
machine
translation.
arXiv.https://arxiv.org/abs/2104.08771
[17] Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., & Guo, B. (2021). Swin
Transformer: Hierarchical Vision Transformer using Shifted Windows. 2021
IEEE/CVF International Conference on Computer Vision (ICCV), 9992-10002.
[18] Xu, Y., Li, M., Cui, L., Huang, S., Wei, F., & Zhou, M. (2019). LayoutLM: Pre
training of Text and Layout for Document Image Understanding. Proceedings of the
26th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining.
[19] He, Kaiming et al. “Masked Autoencoders Are Scalable Vision Learners.” 2022
IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) (2021): 15979-15988.
[20] Rao, J. N. K. et al. “Mean squared error of empirical predictor.” Annals of
Statistics 32 (2004): 818-840.
[21] Simonyan, Karen and Andrew Zisserman. “Very Deep Convolutional Networks for
Large-Scale Image Recognition.” CoRR abs/1409.1556 (2014): n. pag.
[22] Liu, Y., & Lian, Z. (2023). FontTransformer: Few-shot high-resolution Chinese
glyph image synthesis via stacked transformers. Pattern Recognition, 141, 109593.
https://doi.org/10.1016/j.patcog.2023.109593.
[23] Wang, Z., & Liu, J. (2024). One‑Shot Multilingual Font Generation Via ViT. arXiv.
https://arxiv.org/abs/2412.11342