字級大小SCRIPT,如您的瀏覽器不支援,IE6請利用鍵盤按住ALT鍵 + V → X → (G)最大(L)較大(M)中(S)較小(A)小,來選擇適合您的文字大小,如為IE7或Firefoxy瀏覽器則可利用鍵盤 Ctrl + (+)放大 (-)縮小來改變字型大小。
:
twitter line
研究生: 劉又瑋
研究生(外文): YU-WEI LIU
論文名稱: 利用深度學習預測T細胞受體與抗原結合的特異性
論文名稱(外文): Using deep learning to predict antigen binding specificity of T-cell receptors
指導教授: 陳倩瑜
指導教授(外文): CHIEN-YU CHEN
學位類別: 碩士
校院名稱: 國立臺灣大學
系所名稱: 生物機電工程學系
學門: 工程學門
學類: 機械工程學類
論文種類: 學術論文
論文出版年: 2022
畢業學年度: 110
語文別: 中文
論文頁數: 38
中文關鍵詞: T細胞受體 一類主要組織相容性複合物 胜肽
外文關鍵詞: TCR TCR-pMHC MHC-I peptide
相關次數:
  • 被引用 被引用:0
  • 點閱 點閱:38
  • 評分 評分:
  • 下載 下載:6
  • 收藏至我的研究室書目清單 書目收藏:0
預測 T 細胞受體 (T cell receptor,TCR ) 與主要組織相容性複合物(Major histocompatibility complex,MHC) 和胜肽(Peptide) 結合的相互作用,仍然是極具挑戰性的計算問題。這一挑戰主要源於三個主要因素:實驗數據準確性、稀缺性和問題本身的高複雜性。一般而言,關於新生抗原(Neoantigen)和抗原生物學中未解決的基本問題之一是:為什麼並非所有新生抗原或抗原都會引發 T 細胞反應,對此,如果能準確預測新生抗原/抗原和 TCR 之間相互作用,將對於了解癌症進展、預後和對免疫治療的反應之相關研究至關重要。另一方面,近期許多自然語言處理(Natural Language Processing,NLP)相關研究顯示,可將蛋白質序列視為句子,而將胺基酸視為單詞,因此,許多相關研究開始嘗試使用類似自然語言處理的技術,從蛋白質序列數據庫中提取有用的生物信息。日前,有一些可公開使用的蛋白質語言預訓練模型被釋出,而且已被證明有助於各種下游預測任務。因此,本研究旨於建立了一個以蛋白質語言模型ProtBert 為編碼基礎的預測模型,預測由 I 類主要組織相容性複合物呈現的新生抗原和一般 T 細胞抗原的 TCR 結合特異性。本研究針對兩個預測問題,一個是預測MHC-I和peptide的結合問題,一個是TCR和peptide-MHC(pMHC)的結合問題,比較不同編碼方式,結果顯示蛋白質語言模型在兩個問題上都可以提升預測準確率。最終,本研究提出搭配集成學習,進一步提升以ProtBert為基礎的預測模型之準確性,期望能強化預測T細胞受體與抗原結合特異性之後續應用。
Predicting the interaction of T cell receptors (TCR) with complexes of major histocompatibility and peptide (pMHC) remains challenging. This challenge involves
three main issues: accuracy of data, sparse and problem complexity. One of the fundamental and unanswered question about neoantigen and antigen is why not all antigen elicits T cell responses although the peptide might have been present on the MHC cell surface. Accurate and comprehensive characterization of the interactions between neoantigen/antigen and TCR is critical for understanding cancer progressions, prognosis, and the response of immunotherapy. On the other hand, many recent NLP studies have shown that protein sequences can be regarded as sentences and amino acids as words. In this regard, researchers can use natural language processing to extract biological information from protein sequence databases. Recently, there are some successful pre-training protein language models publicly available. This study then developed a prediction model based on protein language model ProtBert to predict TCR binding specificity of neoantigen/antigen presented by major histocompatibility complex class I. The results demonstrated that using protein language model can improve the accuracy of prediction on both problems: predicting MHC-peptide binding and TCR-pMHC binding. Moreover, this study integrated ensemble learning to further improve the prediction accuracy. The ProtBert-based ensemble model is expected to facilitate the immunogenomics studies related to TCR binding in the near future.
致謝 i
摘要 ii
Abstract iii
目錄 iv
圖目錄 vii
表目錄 ix
第一章 研究目的 1
第二章 文獻探討 3
2.1 TCR與peptide-MHC複合物結合 3
2.2 自然語言處理 4
2.3 蛋白質語言模型 5
2.4 TCR-pMHC 資料庫 6
2.4.1 VDJdb 6
2.4.2 McPAS-TCR 7
2.5 TCR-pMHC 結合預測工具 7
2.5.1 NetTCR 7
2.5.2 ERGO-II 8
2.5.3 pMTnet 9
第三章 研究方法 11
3.1資料介紹 11
3.1.1 NetMHCpan 資料收集 11
3.1.2 pMTnet 資料收集 12
3.2實驗模型 14
3.3實驗流程 16
3.3.1 比較不同蛋白質編碼工具在NetMHCpan 資料預測上影響 16
3.3.2 比較不同蛋白質編碼工具在pMTnet資料預測上影響 19
3.3.3 比較不同MHC-I 長度對pMTnet 資料預測上影響 20
3.3.4 比較不同填充(padding)方式對pMTnet 資料預測上影響 22
3.3.5 探討在訓練時刪掉特定的等位基因群對測試集的影響 22
3.3.6 訓練結果評估 22
第四章 結果與討論 24
4.1 NetMHCpan 資料分析 24
4.2 pMTnet資料分析 25
4.3 探討MHC 長度對AUC 的影響 27
4.4 探討不同填充方式對AUC 的影響 29
4.5 Ensemble ProtBert為編碼基礎的模型對AUC的影響 30
4.6 訓練時刪除特定的等位基因群對預測的影響 32
4.7 探討加入TCR的資訊對AUC的影響 35
第五章 結論 36
第六章 參考文獻 37
Devlin, J., M.-W. Chang, K. Lee and K. Toutanova (2018) "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." arXiv:1810.04805.
Elnaggar, A., M. Heinzinger, C. Dallago, G. Rihawi, Y. Wang, L. Jones, T. Gibbs, T. Feher, C. Angerer, M. Steinegger, D. Bhowmik and B. Rost (2020) "ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing." arXiv:2007.06225.
Henikoff, S. and J. G. Henikoff (1992). "Amino acid substitution matrices from protein blocks." Proc Natl Acad Sci U S A 89(22): 10915-10919.
Krogsgaard, M. and M. M. Davis (2005). "How T cells 'see' antigen." Nature Immunology 6(3): 239-245.
Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma and R. Soricut (2019) "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations." arXiv:1909.11942.
Lee, K.-H., Y.-C. Chang, T.-F. Chen, H.-F. Juan, H.-K. Tsai and C.-Y. Chen (2021). "Connecting MHC-I-binding motifs with HLA alleles via deep learning." Communications Biology 4(1): 1194.
Lu, T., Z. Zhang, J. Zhu, Y. Wang, P. Jiang, X. Xiao, C. Bernatchez, J. V. Heymach, D. L. Gibbons, J. Wang, L. Xu, A. Reuben and T. Wang (2021). "Deep learning-based prediction of the T cell receptor–antigen binding specificity." Nature Machine Intelligence 3(10): 864-875.
Montemurro, A., V. Schuster, H. R. Povlsen, A. K. Bentzen, V. Jurtz, W. D. Chronister, A. Crinklaw, S. R. Hadrup, O. Winther, B. Peters, L. E. Jessen and M. Nielsen (2021). "NetTCR-2.0 enables accurate prediction of TCR-peptide binding by using paired TCRα and β sequence data." Communications Biology 4(1): 1060.
Nielsen, M., C. Lundegaard, T. Blicher, K. Lamberth, M. Harndahl, S. Justesen, G. Røder, B. Peters, A. Sette, O. Lund and S. Buus (2007). "NetMHCpan, a Method for Quantitative Predictions of Peptide Binding to Any HLA-A and -B Locus Protein of Known Sequence." PLOS ONE 2(8): e796.
Raffel, C., N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li and P. J. Liu (2019) "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer." arXiv:1910.10683.
Rao, R., J. Meier, T. Sercu, S. Ovchinnikov and A. Rives (2020). "Transformer protein language models are unsupervised structure learners." bioRxiv: 2020.2012.2015.422761.
Reynisson, B., B. Alvarez, S. Paul, B. Peters and M. Nielsen (2020). "NetMHCpan-4.1 and NetMHCIIpan-4.0: improved predictions of MHC antigen presentation by concurrent motif deconvolution and integration of MS MHC eluted ligand data." Nucleic Acids Res 48(W1): W449-w454.
Shugay, M., D. V. Bagaev, I. V. Zvyagin, R. M. Vroomans, J. C. Crawford, G. Dolton, E. A. Komech, A. L. Sycheva, A. E. Koneva, E. S. Egorov, A. V. Eliseev, E. Van Dyk, P. Dash, M. Attaf, C. Rius, K. Ladell, J. E. McLaren, K. K. Matthews, E B. Clemens, D. C. Douek, F. Luciani, D. van Baarle, K. Kedzierska, C. Kesmir, P. G. Thomas, D. A. Price, A. K. Sewell and D. M. Chudakov (2017). "VDJdb: a curated database of T-cell receptor sequences with known antigen specificity." Nucleic Acids Research 46(D1): D419-D427.
Springer, I., N. Tickotsky and Y. Louzoun (2021). "Contribution of T Cell Receptor Alpha and Beta CDR3, MHC Typing, V and J Genes to Peptide Binding Prediction." Frontiers in Immunology 12.
Tickotsky, N., T. Sagiv, J. Prilusky, E. Shifrut and N. Friedman (2017). "McPAS-TCR: a manually curated catalogue of pathology-associated T cell receptor sequences." Bioinformatics 33(18): 2924-2929.
Vita, R., S. Mahajan, J. A. Overton, S. K. Dhanda, S. Martini, J. R. Cantrell, D. K. Wheeler, A. Sette and B. Peters (2019). "The Immune Epitope Database (IEDB): 2018 update." Nucleic Acids Res 47(D1): D339-d343.
連結至畢業學校之論文網頁 點我開啟連結
註: 此連結為研究生畢業學校所提供,不一定有電子全文可供下載,若連結有誤,請點選上方之〝勘誤回報〞功能,我們會盡快修正,謝謝!