A Novel Protein Subcellular Localization Method With CNN-XGBoost Model for Alzheimer's Disease

2.2. Tree Boosting and XGBoost

Tree boosting is a learning method to enhance the classification ability of weak classifiers by iteratively adding new decision trees to the ensembles of decision trees. Let $D = {(x_{i}, y_{i})} (| D | = n, x_{i} \in ℝ^{m}, y_{i} \in ℝ^{n})$ denotes a dataset with n classes and m feature. Then the prediction of a tree boosting for a ( x _i , y _i ) is given by

\begin{array}{l} {\hat{y}}_{i} = g_{A} (x_{i}) = \sum_{j = 1}^{M} g_{j} (x_{i}) \end{array}

(6)

where g _j ( x _i ) = w _q ( x _i ) is the prediction of the j -th decision tree with leaf weights w _q on a datapoint x _i , and M is the number of members in the ensemble.

It is well-known that the decision tree tends to overfit when the decision tree is fully grown. Thus, the set prediction function of decision trees g _j can be learned by minimizing the objective function

C (x, g_{A}) = \sum_{i = 1}^{N} l (y_{i}, {\hat{y}}_{i}) + \sum_{j = 1}^{M} Ω (g_{j})

(7)

where l _i ( y _i , ŷ _i ) is a term which measures the goodness of the prediction ŷ _i and the object y _i . Ω( g _j ) is a regularization term that does not depend on the data.

XGBoost implements a parallel tree boosting in a fast and accurate way. In XGBoost, the regularization function is chosen to be

\begin{array}{l} Ω (g) = γ T + \frac{λ}{2} \sum_{l = 1}^{T} w_{l}^{2} \end{array}

(8)

with γ and λ regularization parameters that must be chosen appropriately. Notice this regularization penalizes both large weights on the leaves (similar to L ² -regularization) and has large partitions.

As mentioned above, the tree boosting iteratively enlarges the ensemble of decision trees, then the prediction of the t -th iteration can be defined as

{\hat{y}}_{i}^{(t)} = \sum_{j = 1}^{t} g_{j} (x_{i}) = {\hat{y}}_{i}^{(t - 1)} + g_{t} (x_{i})

(9)

The objective function (7) at step t can be modified as

\begin{array}{l} C_{t} = & \sum_{i = 1}^{N} l (y_{i}, {\hat{y}}_{i}^{(t - 1)} + g_{t} (x_{t})) + Ω (g_{t}) \end{array}

(10)

Apply a Taylor expansion on the objective function (10) to second order and then the final objective function at step t can be approximated as

C_{t} \approx C_{t - 1} + Δ C_{t}

(11)

= C_{t - 1} + b_{i} l (y_{i}, {\hat{y}}_{i}^{(t - 1)}) g_{t} (x_{i}) + \frac{1}{2} a_{i} g_{t} {(x_{i})}^{2} + Ω (g_{t})

(12)

where

a_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

(13)

b_{i} = \partial_{{\hat{y}}_{i}^{(t - 1)}}^{2} l (y_{i}, {\hat{y}}_{i}^{(t - 1)})

(14)

Let j : I _j = { i : q _t ( x _i ) = j } denotes the set of point x _i mapped to leaf, $B_{j} = \sum_{i \in I_{j}} b_{i}$ and $A_{j} = \sum_{i \in I_{j}} a_{i}$ . Then we can rewrite the $Δ C_{t}$ as

\begin{array}{rcl} Δ C_{t} = \sum_{j = 1}^{T} [B_{j} w_{j} + \frac{1}{2} (A_{j} + λ_{j}) w_{j}^{2}] + λ T \end{array}

(15)

To find the optimal weight w _j of leaf j for a fixed tree structure, q ( x ) can be obtained by applying the following equation

w_{j}^{o p t} = - \frac{B_{j}}{A_{j} + λ}

(16)

plugging back into $Δ C_{t}$ gives

Δ C_{t}^{o p t} = - \frac{1}{2} \sum_{j = 1}^{K} \frac{B_{j}^{2}}{A_{j} + λ} + γ T

(17)

It is clear that $Δ C_{t}^{o p t}$ measures the in-sample performance of g _t and we should find the decision tree that minimizes this value. However, in practice, this is impossible to enumerate all possible trees over the data and find the tree which can minimize $Δ C_{t}^{o p t}$ . Instead, an approximate greedy algorithm runs to optimize one level of the tree at a time by trying to find optimal splits of the data, leading to a tree with a local minimum of $Δ C_{t}^{o p t}$ , which is then added to the ensemble.

For the multi-label multi-class classification problem, we utilize XGBoost as classifiers and adopt the binary relevance strategy (Boutell et al., 2004 ) to construct m binary classifiers.

2.3. CNN-XGBoost Model

Figure Figure1 1 gives the overall structure of the CNN-XGBoost model for protein subcellular location prediction. The input of the model is a one-dimensional vector and constructed by the position specific scoring matrices (PSSM) and proteins interaction scoring matrix which are extracted from STRING and GO terms semantic similarities. On this basis, a protein can be expressed as L × 1 vector ( L is the number of sequences in training set), analog image data equivalent to a protein is a one-dimensional “image” with 1 channels. So the input is a L × 1 matrix.

After obtaining the proper feature representations by the trained CNN, compared with the classic CNN, our CNN-XGBoost model replaces the soft-max layer of CNN with XGBoost to predict the localization of subcellular of proteins, which enables features automatically obtained from input and provides more precise and efficient classification.

3. Results

3.1. Dataset

To verify the performance of our method, we employ three protein datasets: the Hum-mPloc3.0, the BaCelLo animals, and the Hoglund. Table Table1 1 gives the details of these datasets. The train set of Hum-mPloc 3.0 consists of 3,122 proteins and 1,023 proteins own more than one label. The test set of Hum-mPloc 3.0 consists of 379 proteins, among which 120 proteins belong to multi-label proteins. Each protein in Hum-mPloc 3.0 is assigned at least one label of 12 subcellular locations (Centrosome, Cytoplasm, Cytoskeleton, Endoplasmic reticulum, Endosome, Extracellular, Golgi apparatus, Lysosome, Mitochondrion, Nucleus, Peroxisome, and Plasma membrane).

Table 1

Dataset Summary.

	Hum-mLoc 3.0		BaCelLo		Hoglund
	Training	Testing	Training	Testing	Training	Testing
No. Proteins	3,126	379	2,597	576	5,959	158
No. Labels	4,229	541	2,597	576	5,959	158
No.Locations	12		4		6

Open in a separate window

For the BaCelLo dataset, there are four subcellular locations: Cytoplasm, Mitochondrion, Nucleus, and Secreted. The size of the training set is set to 2,597 and the testing set consists of 576 proteins. All the proteins of BaCelLo dataset are of a single label. In the Hoglund dataset, the training set includes nine subcellular locations (Nucleus, Cytoplasm, Mitochondrion, Endoplasmic reticulum, Golgi apparatus, Peroxisome, Plasma membrane, Extracellular space, Lysosome, and Vacuole), and the test consists of 158 proteins with six subcellular locations (Endoplasmic reticulum, Golgi apparatus, Peroxisome, Plasma membrane, Extracellular space, and Lysosome).

3.2. Measurements

A widely-applied method for evaluating a mutli-label multi-class classifier is to compute the ACC and F1 values. ACC is the average of ACC _{x

_i} of all proteins in the testing set, calculated for protein x _i is

{ACC}_{x_{i}} = \frac{{TP}_{x_{i}}}{{TP}_{x_{i}} + {FP}_{x_{i}} + {FN}_{x_{i}}}

(18)

where TP, FP, and FN are true positive, false positive, and false negative, respectively. The F1 score considers both the harmonic mean of precision and recall of subcellular location y _j , defined as follows:

\begin{array}{rcl} \begin{matrix} {precision}_{y_{j}} & = \frac{\sum_{x_{i} \in P_{j}} \frac{{TP}_{x_{i}}}{{TP}_{x_{i}} + {FP}_{x_{i}}}}{| P_{j} |} \\ {recall}_{y_{j}} & = \frac{\sum_{x_{i} \in T_{j}} \frac{{TP}_{x_{i}}}{{TP}_{x_{i}} + {FN}_{x_{i}}}}{| T_{j} |} \\ {F1}_{y_{j}} & = \frac{2 \times {precision}_{y_{j}} \times {recall}_{y_{j}}}{{recall}_{y_{j}} + {precision}_{y_{j}}} \end{matrix} \end{array}

(19)

where T _j and P _j are the set of proteins for true location y _j and the set of proteins for predicted locations y _j respectively.

3.3. Results and Discussions

To verify the performance of our approach, some typical protein subcellular location tools including Hum-mPLoc 3.0 (Zhou et al., 2016 ), YLoc+ (Briesemeister et al., 2010 ), iLoc-Hum (Chou et al., 2012 ), WegoLoc (Chi and Nam, 2012 ), mLASSO-Hum (Wan et al., 2015 ), and PSL-Recommender (Jamali et al., 2018 ) were compared to our method. The F1 score and ACC for each subcellular localization are summarized in Table Table2 2 and Figure Figure2 2 for Hum-mploc 3.0 dataset. As seen in Table Table2 2 and Figure Figure2, 2 , the CNN-XGBoost outperforms the mean value of F1 score and ACC of all other methods. Also, in 7 out of 12 subcellular locations, CNN-XGBoost has the best performance among all the methods while in the other three locations it has the second best performance. It is only in centrosome and endosome that CNN-XGBoost shows unsatisfactory results. As seen in Table Table3, 3 , the CNN-XGBoost slightly outperforms the second best method by both mean F1 score and ACC.

Table 2

Comparision of CNN-XGBoost on Hum-mPloc 3.0 dataset with other methods.

Location	iLoc-Human			WegoLoc			mLASSO-Hum			Hum-mLoc 3.0			PSL-Recommender			CNN-XGBoost
	pre	re	F1	pre	re	F1	pre	re	F1	pre	re	F1	pre	re	F1	pre	re	F1
Centrosome	0	0	0	0.75	0.14	0.23	0.59	0.59	0.59	0.75	0.55	0.63	0.94	0.75	0.83	0.79	0.50	0.61
Cytoplasm	0.5	0.54	0.52	0.69	0.53	0.60	0.93	0.51	0.66	0.76	0.73	0.74	0.79	0.81	0.80	0.85	0.89	0.87
Cytoskeleton	0	0	0	0.32	0.34	0.33	0.9	0.22	0.35	0.8	0.68	0.74	0.93	0.77	0.84	0.89	0.80	0.85
ER	0	0	0	0.73	0.2	0.31	0.74	0.49	0.59	0.83	0.37	0.51	0.9	0.7	0.79	0.97	0.71	0.82
Endosome	0	0	0	0.25	0.07	0.11	0.38	0.2	0.26	0.58	0.47	0.52	0.57	0.37	0.45	0.80	0.27	0.40
Extracellular	0.62	0.62	0.62	0.67	0.77	0.71	0.16	0.69	0.26	0.5	0.46	0.48	0.66	0.71	0.68	0.80	0.62	0.70
Golgi apparatus	0.6	0.3	0.4	0.6	0.15	0.24	0.72	0.65	0.68	0.69	0.45	0.55	0.88	0.61	0.72	0.80	0.60	0.69
Lysosome	0.5	0.13	0.2	0.2	0.13	0.15	0.55	0.75	0.63	0.71	0.63	0.67	1	0.55	0.71	1.00	0.75	0.86
Mitochondrion	0.95	0.33	0.49	0.79	0.73	0.76	0.83	0.88	0.85	0.78	0.75	0.76	0.92	0.88	0.90	0.96	0.80	0.87
Nucleus	0.54	0.7	0.61	0.65	0.64	0.64	0.85	0.7	0.76	0.75	0.71	0.73	0.81	0.92	0.87	0.83	0.91	0.87
Peroxisome	1	0.5	0.67	0.5	1	0.67	0.29	1	0.44	1	1	1	1	1	1	1	1	1
Plasma membrane	0.42	0.33	0.37	0.44	0.53	0.48	0.58	0.56	0.57	0.65	0.44	0.52	0.78	0.74	0.76	0.89	0.75	0.81
ACC-mean	0.41			0.50			0.65			0.63			0.77			0.78
F1-mean	0.32			0.44			0.56			0.65			0.78			0.80

Open in a separate window

The bold marks the first best result and the underline marks the second best result .

An external file that holds a picture, illustration, etc. Object name is fgene-09-00751-g0002.jpg

Open in a separate window

Figure 2

The accuracy comparison on the Hum-mPloc 3.0 data set.

Table 3

Comparison of CNN-XGBoost ACC/F1-mean on other proteins datasets with other methods.

	BaCelLo	Hoglund
MultiLoc2-LowRes	0.73/0.76	–
MultiLoc2-HighRes	0.68/0.71	0.57/0.41
BaCelLo	0.64/0.66	–
Hum-mPloc 3.0	0.86/0.84	0.64/0.59
PSL-Recommender	0.94/0.92	0.92/0.90
CNN-XGBoost	0.94/0.94	0.94/0.92

Open in a separate window

The bold marks the first best result and the underline marks the second best result .

In addition, we also evaluated our method on the DeepLoc dataset, compared to the DeepLoc, our method provides slightly better prediction with significantly lighter model, meanwhile, it is known that DeepLoc can not handle multilabel multiclass problem, whereas our method still shows outstanding performance.

4. Conclusions

In order to make balance of the classification performance and the complexity when training the model for the protein subcellular location in Alzheimer's disease, this paper proposes a prediction framework integrating CNN and XGBoost, taking advantage of the outstanding ability of feature expression of CNN, and the good classification performance of XGBoost. Experiments are implemented on the Hum-mPloc3.0, the BaCelLo animals, and the Hoglund database, and the results demonstrate that the new method outperforms the typical machine learning based tools. Further work will focus on the verification of our model on more datasets, especially the datasets related to Alzheimer's disease, and the optimization of the structure of CNN utilized in the model.

Author Contributions

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

Conflict of Interest Statement

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China (NSFC, Grant no. 61305013 and 61872114). The authors would like to thank the reviewers for their valuable comments.

References

Almagro Armenteros J. J., Sønderby C. K., Sønderby S. K., Nielsen H., Winther O. (2017). DeepLoc: prediction of protein subcellular localization using deep learning . Bioinformatics 33 , 3387–3395. 10.1093/bioinformatics/btx431 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Boutell M. R., Luo J., Shen X., Brown C. (2004). M. Learning multi-label scene classification . Pattern Recogn. 37 , 1757–1771. 10.1016/j.patcog.2004.03.009 [ CrossRef ] [ Google Scholar ]
Briesemeister S., Rahnenführer J., Kohlbacher O. (2010). YLoc–an interpretable web server for predicting subcellular localization . Nucleic Acids Res. 38 , W497–W502. 10.1093/nar/gkq477 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cai S., Yang S., Zheng F., Lu M., Wu Y., Krishnan S. (2013). Knee joint vibration signal analysis with matching pursuit decomposition and dynamic weighted classifier fusion . Comput. Math. Methods Med. 2013 :904267. 10.1155/2013/904267 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cheng L., Hu Y., Sun J., Zhou M., Jiang Q. (2018a). DincRNA: a comprehensive web-based bioinformatics toolkit for exploring disease associations and ncRNA function . Bioinformatics 34 , 1953–1956. 10.1093/bioinformatics/bty002 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cheng L., Jiang Y., Ju H., Sun J., Peng J., Zhou M., et al.. (2018b). InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk . BMC Genomics 19 :919. 10.1186/s12864-017-4338-6 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cheng L., Jiang Y., Wang Z., Shi H., Sun J., Yang H., et al.. (2016a). DisSim: an online system for exploring significant similar diseases and exhibiting potential therapeutic drugs . Sci. Rep. 6 :30024. 10.1038/srep30024 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cheng L., Sun J., Xu W., Dong L., Hu Y., Zhou M. (2016b). OAHG: an integrated resource for annotating human genes with multi-level ontologies . Sci. Rep. 10 :34820 10.1038/srep34820 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Cheng L., Yang H., Zhao H., Pei X., Shi H., Sun J., et al.. (2017). MetSigDis: a manually curated resource for the metabolic signatures of diseases . Brief. Bioinformatics . 10.1093/bib/bbx103. [Epub ahead of print]. [ PubMed ] [ CrossRef ] [ Google Scholar ]
Chi S.-M., Nam D. (2012). Wegoloc: accurate prediction of protein subcellular localization using weighted gene ontology terms . Bioinformatics 28 , 1028-1030. 10.1093/bioinformatics/bts062 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Chou K.-C., Wu Z.-C., Xiao X. (2012). iloc-hum: using the accumulation-label scale to predict subcellular locations of human proteins with both single and multiple sites . Mol. Biosyst. 8 , 629–641. [ PubMed ] [ Google Scholar ]
Gudenas B. L. (2018). Genomic Data Mining for Functional Annotation of Human Long Noncoding RNAs . Avaialble online at: https://tigerprints.clemson.edu/all_dissertations/2146
Horton P., Park K. J., Obayashi T., Fujita N., Harada H., Adams-Collier C. J., et al.. (2007). WoLF PSORT: protein localization predictor . Nucleic Acids Res. 35 ( Suppl. 2 ), W585–W587. 10.1093/nar/gkm259 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Hu Y., Cheng L., Zhang Y., Bai W., Wang T., Han Z., et al.. (2017a). Rs4878104 contributes to Alzheimer's disease risk and regulates DAPK1 gene expression . Neurol. Sci. 38 , 1255–1262. 10.1007/s10072-017-2959-9 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Hu Y., Zhao T., Zhang N., Zang T., Zhang J., Cheng L. (2018) Identifying diseases-related metabolites using random walk . BMC Bioinformatics 19 ( Suppl. 5 ):116. 10.1186/s12859-018-2098-1 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Hu Y., Zheng L., Cheng L., Bai W., Wang T., Han Z., et al.. (2017b). GAB2 rs2373115 variant contributes to Alzheimer's disease risk specifically in European population . J. Neurol. Sci. 375 , 18–22. 10.1016/j.jns.2017.01.030 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Hu Y., Zhou M., Shi H., Ju H., Jiang Q., Cheng L. (2017c). Measuring disease similarity and predicting disease-related ncRNAs by a novel method . BMC Med. Genomics 10 ( Suppl. 5 ):71. 10.1186/s12920-017-0315-9 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Jamali R., Eslahchi C., Jahangiri-Tazehkand S. (2018). Psl- recommender: protein subcellular localization prediction using recommender system . bioRxiv 462812. 10.1101/462812 [ CrossRef ] [ Google Scholar ]
Krizhevsky A., Sutskever I., Hinton G. E. (2012). Imagenet classification with deep convolutional neural networks , in Advances in Neural Information Processing Systems (Lake Tahoe, NV: ), 1097–1105. [ Google Scholar ]
LeCun Y., Bengio Y., Hinton G. (2015). Deep learning . Nature 521 :436. 10.1038/nature14539 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Liu X., Li Z., Liu J., Liu L., Zeng X. (2015). Implementation of arithmetic operations with time-free spiking neural P systems . IEEE Trans. Nanobiosci. 14 , 617–624. 10.1109/TNB.2015.2438257 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Mikolov T., Sutskever I., Chen K., Corrado G., Dean J. (2013). Distributed representations of words and phrases and their compositionality , in Advances in Neural Information Processing Systems (Lake Tahoe, NV: ), 3111–3119. [ Google Scholar ]
Pierleoni A., Martelli P. L., Fariselli P., Casadio R. (2006). BaCelLo: a balanced subcellular localization predictor . Bioinformatics 22 , e408–e416. 10.1093/bioinformatics/btl222 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Sutskever I., Vinyals O., Le Q. (2014). V. “Sequence to sequence learning with neural networks , in Advances in Neural Information Processing Systems (Montreal, QC: ), 3104–3112. [ Google Scholar ]
Wan S., Mak M.-W., Kung S.-Y. (2015). mlasso-hum: a lasso-based in- terpretable human-protein subcellular localization predictor . J. Theor. Biol. 382 , 223–234. [ PubMed ] [ Google Scholar ]
Wei L., Liao M., Gao X., Wang J., Lin W. (2016). mGOF-loc: a novel ensemble learning method for human protein subcellular localization prediction . Neurocomputing 217 , 73–82. 10.1016/j.neucom.2015.09.137 [ CrossRef ] [ Google Scholar ]
Wu Y., Krishnan S. (2011). Combining least-squares support vector machines for classification of biomedical signals: a case study with knee-joint vibroarthrographic signals . J. Exp. Theor. Artif. Intell. 23 , 63–77. 10.1080/0952813X.2010.506288 [ CrossRef ] [ Google Scholar ]
Wu Y., Luo X., Zheng F., Yang S., Cai S., Ng S. C. (2014). Adaptive linear and normalized combination of radial basis function networks for function approximation and regression . Math. Probl. Eng. 2014 :913897 10.1155/2014/913897 [ CrossRef ] [ Google Scholar ]
Xu Y., Wang Y., Luo J., Zhao W., Zhou X. (2017). Deep learning of the splicing(epi) genetic code reveals a novel candidate mechanism linking histone modifications to ESC fate decision . Nucleic Acids Res. 45 , 12100–12112. 10.1093/nar/gkx870 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Zeng X., Zhang X., Song T., Pan L. (2014). Spiking neural P systems with thresholds . Neural Comput. 26 , 1340–1361. 10.1162/NECO_a_00605 [ PubMed ] [ CrossRef ] [ Google Scholar ]
Zhou H., Yang Y., Shen H.-B. (2016). Hum-mploc 3.0: prediction enhancement of human protein subcellular localization through mod- eling the hidden correlations of gene ontology and functional domain features . Bioinformatics 33 , 843-853. [ PubMed ] [ Google Scholar ]

Articles from Frontiers in Genetics are provided here courtesy of Frontiers Media SA