Key Laboratory of Molecular Biophysics, Ministry of Education, College of Life Science and Technology , Huazhong University of Science and Technology , Wuhan 430074 , P. R. China.
Editorial Board of the Journal of Wuhan Institute of Technology , Wuhan Institute of Technology , Wuhan 430074 , P. R. China.
大多数天然蛋白质表现出差的热稳定性,这限制了它们的工业应用。计算机辅助的合理设计是一种有效的基于目的的方法,可以提高蛋白质的热稳定性。已经设计了许多基于机器学习的方法来预测突变引起的蛋白质热稳定性的变化。但是,由于忽略蛋白质序列特征的现有突变编码方法,所有这些方法都有一定的局限性。在这里,我们基于对热稳定性相关蛋白质特性的深入研究,提出了一种使用卷积神经网络预测蛋白质热稳定性的方法。该方法包括三维编码算法,包括蛋白质突变信息和基于多尺度卷积提取蛋白质突变位点附近特征的策略。广泛用于蛋白质热稳定性预测的S1615和S388数据集的准确性分别达到86.4%和87%。马修斯相关系数几乎是使用其他方法产生的相关系数的两倍。此外,基于S3661数据集构建了一个模型来预测Rhizomucor miehei脂肪酶突变体的热稳定性,该数据集是从ProTherm蛋白质热力学数据库中筛选出的单个氨基酸突变数据集。与由Rosetta ddg单体,I Mutant 3.0和FoldX三种算法组成的RIF策略相比,该方法的准确性更高(75.0对66.7%),并且负样本分辨率得到了同时提高。这些结果表明我们的预测方法可以更有效地评估蛋白质的热稳定性并区分其特征,
Most natural proteins exhibit poor thermostability, which limits their industrial application. Computer-aided rational design is an efficient purpose-oriented method that can improve protein thermostability. Numerous machine-learning-based methods have been designed to predict the changes in protein thermostability induced by mutations. However, all of these methods have certain limitations due to existing mutation coding methods that overlook protein sequence features. Here we propose a method to predict protein thermostability using convolutional neural networks based on an in-depth study of thermostability-related protein properties. This method comprises a three-dimensional coding algorithm, including protein mutation information and a strategy to extract neighboring features at protein mutation sites based on multiscale convolution. The accuracies on the S1615 and S388 data sets, which are widely used for protein thermostability predictions, reached 86.4 and 87%, respectively. The Matthews correlation coefficient was nearly double those produced using other methods. Furthermore, a model was constructed to predict the thermostability of Rhizomucor miehei lipase mutants based on the S3661 data set, a single amino acid mutation data set screened from the ProTherm protein thermodynamics database. Compared with the RIF strategy, which consists of three algorithms, i.e., Rosetta ddg monomer, I Mutant 3.0, and FoldX, the accuracy of the proposed method was higher (75.0 vs 66.7%), and the negative sample resolution was simultaneously enhanced. These results indicate that our prediction method more effectively assessed the protein thermostability and distinguished its features, making it a powerful tool to devise mutations that enhance the thermostability of proteins, particularly enzymes.