基于综合 DNA 序列特征的支持向量机方法识别核小体定位

Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature

颖崔 , ^1, ² 泽龙徐 , ² and 建中李 ^1, ^3, ^*

颖崔

Find articles by 颖崔

泽龙徐

黑龙江大学电子工程学院（哈尔滨 150080）, Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China

建中李

黑龙江大学电子工程学院（哈尔滨 150080）, Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China 哈尔滨医科大学生物信息科学与技术学院（哈尔滨 150081）, School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China 黑龙江大学电子工程学院（哈尔滨 150080）, Electronic Engineering College, Heilongjiang University, Harbin 150080, P.R.China 哈尔滨医科大学生物信息科学与技术学院（哈尔滨 150081）, School of Bioinformatics Sciences and Technology, Harbin Medical University, Harbin 150081, P.R.China 哈尔滨工业大学计算机科学与技术学院（哈尔滨 150001）, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, P.R.China

建中李: nc.ude.tih@hzjil

李建中，Email： nc.ude.tih@hzjil

Keywords: sequence feature, support vector machine, nucleosome, z-curve, position weight matrix, euclidean distance

引言

核小体是真核生物染色质的基本结构单元，每个核小体由核心 DNA 序列和连接区 DNA 序列组成，核心 DNA 序列由 147 bp 的 DNA 缠绕组蛋白八聚体近两圈形成，也称为核小体 DNA，而相邻两个核小体 DNA 之间的序列称为连接区 DNA ^{[

1

-

3

]} 。核小体定位是指 DNA 双螺旋相对于组蛋白八聚体的位置，DNA 序列特征一直被认为是影响核小体定位的重要因素之一。核小体参与很多重要的生物学过程，如染色质形成 ^{[

4

]} 、拮抗转录因子 ^{[

5

]} 以及抑制基因表达 ^{[

6

]} 等，核小体 DNA 序列的精确定位影响着基因表达调控 ^{[

7

]} 、DNA 复制 ^{[

8

]} 、DNA 修复 ^{[

9

]} 和重组 ^{[

10

]} 等。随着高通量测序技术的快速发展，目前已经获得了多种真核生物高分辨率的核小体定位实验图谱，如酵母、果蝇、人等，但现阶段完全依靠实验方法检测核小体定位还面临很多问题，例如耗费大量时间和经费，难以满足部分研究人员希望即时获得研究数据的现实需要等，因此，通过计算模型进行核小体识别和预测已经成为生物实验研究的有力补充。

目前核小体定位识别算法的研究已经成为表观遗传学研究的重要领域，国内外有很多研究人员通过信息熵 ^{[

11

]} 、碱基对偏转角度 ^{[

12

]} 等方法来表示核小体 DNA 序列特征 ^{[

13

]} ，进而使用模式识别或深度学习方法进行核小体定位识别，但目前识别方法的精度有待提高，其应用范围也有待进一步推广。支持向量机（support vector machine，SVM）作为一种监督学习方法 ^{[

14

]} ，在解决小样本、非线性及高维模式识别中表现出许多特有的优势，已经在许多领域取得了成功的应用。本文基于 Z 曲线理论和位置权重矩阵（position weight matrix，PWM），提出一种综合序列特征模型，以计算样本与该模型间的欧氏距离作为特征，投入到 SVM 中进行训练和模型检验，用于酵母核小体的定位识别，并将该方法推广到其他物种中，包括线虫、人类和果蝇等。

1. 实验数据和方法

1.1. 实验数据

1.1.1. 酵母数据集

从文献中获得两套酵母（ S. cerevisiae ）核小体数据 ^{[

15

-

17

]} ，第一套数据集包括 5 000 条核小体 DNA 序列与 5 000 条连接区 DNA 序列，记为 S1；第二套数据集包含 1 880 条核小体 DNA 序列和 1 740 条连接区 DNA 序列，记为 S2。两套数据集的 DNA 序列长度均为 150 bp，S2 经过 CD-HIT 软件 ^{[

18

]} ，阈值设为 80% 后去冗余得到。

1.1.2. 线虫、人类和果蝇数据集

从文献中获得线虫（ C. elegans ）、人类（ H. sapiens ）和果蝇（ D. melanogaster ）三个物种的核小体数据集 ^{[

19

]} ， C. elegans 数据集包括核小体 DNA 序列 2 567 条和连接区 DNA 序列 2 608 条， H. sapiens 数据集包括核小体 DNA 序列 2 273 条和连接区 DNA 序列 2 300 条， D. melanogaster 数据集包括核小体 DNA 序列 2 900 条和连接区 DNA 序列 2 850 条，三个物种的 DNA 序列长度均为 147 bp，处理方法同 1.1.1 小节。

1.2. 实验方法

1.2.1. DNA 序列集的 Z 曲线模型

将每条 DNA 序列都转化成归一化的 Z 曲线坐标 ^{[

20

]} ，如式（1）所示：

M _in 为 PWM 元素，表示碱基 i 在 DNA 序列的第 n 位置上的权重值， q _in 是碱基 i 在 DNA 序列的第 n 个位置出现的频率， M 为 4* N 的矩阵。 S 为相似性权重得分，将每条候选 DNA 序列与 M 的每个位置的对应元素值进行累加，得到每一条序列的序列得分，最后将 m 条序列的得分加和取平均值，即得到新得分。

1.2.3. 综合序列特征模型构建

将构建 PWM 模型后计算得到的 S 得分与 Z 曲线模型中对应位置的三维坐标值相乘，即得到整合的新模型，如式（4）所示：

X _n 、 Y _n 、 Z _n 表示模型的三维空间坐标分量，由于 Z 曲线模型和 PWM 模型都是基于序列模型，故该模型被称为综合序列特征模型（comprehensive sequence feature model，CSeqFM）。

1.2.4. 欧氏距离特征提取

对于任意一条 DNA 序列，根据公式（1）将其转化为 Z 曲线坐标，计算其与 CSeqFM 之间的欧式距离（Euclidean distance，ED） ^{[

23

]} ，得到欧氏距离向量记为 ED ，具体过程为：计算一条长度为 N 的 DNA 序列在第个位置与 CSeqFM 中对应的第个位置的欧氏距离记为 Ed _n ，计算个位置，共得到个欧氏距离分量 Ed ₁ ， Ed ₂ ，， Ed _N ，组成得到欧氏距离向量记为 ED ，如公式（5）所示 ^{[

21

]} ：的欧式距离向量 ED ，则 ED 表示该条 DNA 序列与 CSeqFM 之间的距离，距离越小，DNA 序列与 CSeqFM 相似性越大，反之距离越大，DNA 序列与 CSeqFM 相似性越小。

1.2.5. 支持向量机

本文将欧氏距离向量 ED 作为特征，投入到 SVM 中进行训练和检验。具体方法如下：① 计算每一条核小体 DNA 序列与 CSeqFM 的距离得到欧氏距离向量集，记为欧氏距离阳性集；② 计算每一条非核小体 DNA 序列即连接 DNA 序列与 CSeqFM 模型的距离得到欧氏距离向量集，记为欧氏距离阴性集；③ 通过 SVM 方法，使用十折交叉验证，随机分配阳性集和阴性集为 10 份，选择 9 份距离特征样本集（阳性集）进行训练，剩余 1 份距离特征样本集（阳性集）进行检验，共进行 10 次，直到 10 份样本都完成检验为止，取 10 次实验平均值为一次随机分配的最后结果；④ 随机分配过程进行 10 次，实验使用 R 语言编程及 R 包“e1071”完成。

1.3. 评价指标

本文使用敏感性（sensitivity， Sn ）、特异性（specificity， Sp ）、准确率（accuracy， Acc ）和 Matthews 相关系数（Matthews correlation coefficient，MCC），以及受试者操作特征（receiver operating characteristic，ROC）曲线下面积（area under curve，AUC）作为评价参数。其中 ROC 曲线越靠近图左侧边界和上侧边缘（曲线与横轴包含的面积越大）则准确率越高，即 AUC 值越接近 1，说明方法的识别性能越好。前四个指标则通常被用于在统计预测理论中从不同角度衡量预测系统性能的评估指标 ^{[

16

,

19

]} ，如式（6）所示：表示核小体 DNA 序列被错误识别为连接区 DNA 序列的数目。

2. 实验结果与讨论

2.1. 酵母实验

应用 CSeqFM 使用 SVM 方法识别 S. cerevisiae 核小体定位的实验结果如表 1 和图 1 所示。在 S. cerevisiae 数据集 S1 结果中，敏感性 Sn 、特异性 Sp 、准确率 Acc 和 MCC 值分别为 97.1%、96.9%、94.2% 和 0.89，表明该方法性能稳定且效果较好；与基于 Z 曲线理论的 Wu’s 模型 ^{[

24

]} 结果进行比较，CSeqFM 识别结果的四项评估指标均高于 Wu’s 的结果；同时 AUC 分布箱式图显示 CSeqFM 的 AUC 整体分布远高于 Wu’s 的整体分布，曲线图显示 CSeqFM 的 AUC 值为 0.980 1，高于 Wu’s 的 0.938 2，说明 CSeqFM 模型识别性能更好。

表 1

Results of identifying nucleosome by two datasets for S. cerevisiae

两套酵母数据集的核小体定位识别结果

数据集	模型	Sn	MCC
S1	CSeqFM	97.1%	96.9%	94.2%	0.89
S1	Wu’s 模型	88.2%	88.2%	88.3%	0.77
S2	CSeqFM	92.4%	93.9%	93.1%	0.86
S2	Wu’s 模型	88.7%	89.1%	88.9%	0.77

Open in a separate window

An external file that holds a picture, illustration, etc. Object name is swyxgcxzz-37-3-496-1.jpg

Open in a separate window

图 1

Four performances, AUC distribution and ROC curves of dataset S1 for S. cerevisiae

酵母数据集 S1 结果的四项性能指标、AUC 值分布及 ROC 曲线

为进一步检验性能，用 CSeqFM 识别 S. cerevisiae 数据集 S2，实验结果如表 1 所示，敏感性 Sn 、特异性 Sp 、准确率 Acc 和 MCC 值分别为 92.4%、93.9%、93.1% 和 0.86，均高于 Wu’s 模型结果，再次说明 CSeqFM 具有较好的识别效果。

2.2. 线虫、人类和果蝇实验

将 CSeqFM 模型推广到其他物种，包括 C. elegans 、 H. sapiens 和 D. melanogaster 的核小体定位识别，实验结果与 iNuc-STNC ^{[

16

]} 和 iNuc-PseKNC ^{[

19

]} 方法比较，如表 2 和图 2 所示。

表 2

Comparison of experimental results between CSeFM and other methods

CSeFM 方法与其他方法的实验结果比较

物种	方法	Sn	MCC	AUC
C. elegans	iNuc-STNC	91.6%	86.7%	88.6%	0.77	−
	iNuc-PseKNC	90.3%	83.6%	86.9%	0.74	0.935 0
	CSeqFM	81.4%	86.8%	83.9%	0.68	0.905 2
H. sapiens	iNuc-STNC	89.3%	85.9%	87.6%	0.75	−
	iNuc-PseKNC	87.9%	84.7%	86.3%	0.73	0.925 0
	CSeqFM	90.1%	80.5%	84.6%	0.70	0.908 7
D. melanogaster	iNuc-STNC	79.8%	83.6%	81.7%	0.63	−
	iNuc-PseKNC	78.3%	81.7%	80.0%	0.60	0.874 0
	CSeqFM	79.9%	92.3%	84.8%	0.71	0.901 9

Open in a separate window

An external file that holds a picture, illustration, etc. Object name is swyxgcxzz-37-3-496-2.jpg

Open in a separate window

图 2

Experimental results of C. elegans , H. sapiens and D. melanogaster species

C. elegans 、 H. sapiens 和 D. melanogaster 的实验结果

首先，与 iNuc-STNC 方法比较，在 Sn 方面，CSeqFM 在 H. sapiens 和 D. melanogaster 中均高于 iNuc-STNC 方法；在 Sp 、 Acc 和 MCC 方面，CSeqFM 方法在 D. melanogaster 中高于 iNuc-STNC 方法；在 AUC 方面，iNuc-STNC 方法没有给出 AUC 值，而 CSeqFM 在三个物种中的 AUC 值均高于 0.90。如图 2 所示，总体比较，CSeqFM 与 iNuc-STNC 方法在三个物种中各有优势，整体性能基本一致。

其次，与 iNuc-PseKNC 方法比较，在 Sn 方面，CSeqFM 在 H. sapiens 和 D. melanogaster 中均高于 iNuc-PseKNC；在 Sp 方面，CSeqFM 在 C. elegans 和 D. melanogaster 中均高于 iNuc-PseKNC 方法；在 Acc 、MCC 和 AUC 方面，CSeqFM 在 D. melanogaste 中均高于 iNuc-PseKNC 方法。CSeqFM 与 iNuc-PseKNC 方法在 C. elegans 和 H. sapiens 中各有优势，但 CSeqFM 在 D. melanogaster 中的五项性能指标均高于 iNuc-PseKNC 方法，识别效果更好。

另外，iNuc-STNC 和 iNuc-PseKNC 方法都没有 S. cerevisiae 实验结果，而 CSeqFM 在两套 S. cerevisiae 数据集中都取得较好的识别效果（如表 1 和图 1 所示）。

综上所述，与 iNuc-STNC 和 iNuc-PseKNC 相比，CSeqFM 在 C. elegans 、 H. sapiens 和 D. melanogaster 的各项性能指标较好，识别效果稳定，说明 CSeqFM 方法具有可靠的物种推广性和有效性，进一步验证了 CSeqFM 方法具有好的识别效果。

3. 结论

本文提出一种基于综合序列特征的核小体定位模型 CSeqFM，通过 SVM 进行训练和检验，实验结果表明，CSeqFM 在 S. cerevisiae 中的 Sn 、 Sp 、 Acc 和 MCC 性能指标较好，且 AUC 值达到 0.980 1，各项性能均优于 Wu’s 模型的结果，表明该方法在 S. cerevisiae 中识别性能较好。将 CSeqFM 推广到 C. elegans 、 H. sapiens 和 D. melanogaster 物种中，结果显示 CSeqFM 的各项性能指标较好，三个物种的 AUC 值均高于 0.90，与 iNuc-STNC 和 iNuc-PseKNC 方法比较，CSeqFM 在 D. melanogaster 尤其表现出优势，进一步验证了 CSeqFM 方法具有较好的可靠性和有效性。分析原因，可能是由于 CSeqFM 模型是一种综合序列特征模型，整合了 Z 曲线模型在水平方向上的序列特征和 PWM 在垂直方向上的序列特征，更全面地表示了核小体的序列特征。另外，CSeqFM 也可以用于生物数据有关 DNA 序列或功能元件的分类与识别。总之，CSeqFM 具有较好的识别效果和推广性，有利于促进核小体定位 DNA 序列特征和功能的研究。

利益冲突声明：本文全体作者均声明不存在利益冲突。

Funding Statement

国家自然科学基金资助项目（61832003）

References

1. Maskell D P, Renault L, Serrao E, et al Structural basis for retroviral integration into nucleosomes. Nature. 2015; 523 (7560):366–369. doi: 10.1038/nature14495. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

2. Taberlay P C, Statham A L, Kelly T K, et al Reconfiguration of nucleosome-depleted regions at distal regulatory elements accompanies DNA methylation of enhancers and insulators in cancer. Genome Res. 2014; 24 (9):1421. doi: 10.1101/gr.163485.113. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

3. Cole H A, Cui F, Ocampo J, et al Novel nucleosomal particles containing core histones and linker DNA but no histone H1. Nucleic Acids Res. 2016; 44 (2):573–581. doi: 10.1093/nar/gkv943. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

4. Buckwalter J M, Norouzi D, Harutyunyan A, et al Regulation of chromatin folding by conformational variations of nucleosome linker DNA. Nucleic Acids Res. 2017; 45 (16):9372. doi: 10.1093/nar/gkx562. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

5. Murugan R Theory of site-specific DNA-protein interactions in the presence of nucleosome roadblocks. Biophys J. 2018; 114 (11):2516. doi: 10.1016/j.bpj.2018.04.039. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

6. Nocetti N, Whitehouse I, et al Nucleosome repositioning underlies dynamic gene expression. Genes Dev. 2016; 30 (6):660. doi: 10.1101/gad.274910.115. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

7. Bai L, Morozov A V Gene regulation by nucleosome positioning. Trends in Genetics. 2010; 26 (11):476–483. doi: 10.1016/j.tig.2010.08.003. [ PubMed ] [ CrossRef ] [ Google Scholar ]

8. Eaton M L, Kyriaki G, Sukhyun K, et al Conserved nucleosome positioning defines replication origins. Genes Dev. 2010; 24 (8):748–753. doi: 10.1101/gad.1913210. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

9. Hua Y, Epps J, Williams R, et al Evidence that localized variation in primate sequence divergence arises from an influence of nucleosome placement on DNA repair. Mol Biol Evol. 2010; 27 (3):637–649. doi: 10.1093/molbev/msp253. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

10. Bevington S, Boyes J Transcription-coupled eviction of histones H2A/H2B governs V(D)J recombination. EMBO J. 2013; 32 (10):1381–1392. doi: 10.1038/emboj.2013.42. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

11. Xing Y Q, Liu G Q, Zhao X J, et al An analysis and prediction of nucleosome positioning based on information content. Chromosome Res. 2013; 21 (1):63–74. doi: 10.1007/s10577-013-9338-z. [ PubMed ] [ CrossRef ] [ Google Scholar ]

12. Lieleg C, Krietenstein N, Walker M, et al Nucleosome positioning in yeasts: methods, maps, and mechanisms. Chromosoma. 2015; 124 (2):131–151. doi: 10.1007/s00412-014-0501-x. [ PubMed ] [ CrossRef ] [ Google Scholar ]

13. Zhang J, Peng W, Wang L, et al LeNup: Learning nucleosome positioning from DNA sequences with improved convolutional neural networks. Bioinformatics. 2018; 34 (10):1705–1712. doi: 10.1093/bioinformatics/bty003. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

14. Huang Xiaolin, Mehrkanoon S, Suykens J A K Support vector machines with piecewise linear feature mapping. Neurocomputing. 2013; 117 :118–127. doi: 10.1016/j.neucom.2013.01.023. [ CrossRef ] [ Google Scholar ]

15. Lee W, Tillo D, Bray N, et al A high-resolution atlas of nucleosome occupancy in yeast. Nat Genet. 2007; 9 (10):1235–1244. [ PubMed ] [ Google Scholar ]

16. Tahir M, Hayat M iNuc-STNC: a sequence-based predictor for identification of nucleosome positioning in genomes by extending the concept of SAAC and Chou’s PseAAC. Mol Biosyst. 2016; 12 (8):2587–2593. doi: 10.1039/C6MB00221H. [ PubMed ] [ CrossRef ] [ Google Scholar ]

17. Chen W, Feng P, Ding H, et al Using deformation energy to analyze nucleosome positioning in genomes. Genomics. 2016; 107 :69–75. doi: 10.1016/j.ygeno.2015.12.005. [ PubMed ] [ CrossRef ] [ Google Scholar ]

18. Fu Limin, Niu Beifang, Zhu Zhengwei, et al CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012; 28 (23):3150–3152. doi: 10.1093/bioinformatics/bts565. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

19. Guo Shouhui, Deng Enze, Xu Liqin, et al iNuc-PseKNC: a sequence-based predictor for predicting nucleosome positioning in genomes with pseudo k-tuple nucleotide composition. Bioinformatics. 2014; 30 (11):1522–1529. doi: 10.1093/bioinformatics/btu083. [ PubMed ] [ CrossRef ] [ Google Scholar ]

20. Zhang R, Zhang C T A brief review: The Z-curve theory and its application in genome analysis. Curr Genomics. 2014; 15 (2):78–94. doi: 10.2174/1389202915999140328162433. [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

21. 崔颖. 基于 Z 曲线理论的转录因子结合位点的识别研究. 长春: 东北师范大学, 2008.

22. 岁品品, 邢旭东, 王宏, 等基于位置权重矩阵的核小体识别及功能分析生物信息学 2016; 14 (1):1–6. doi: 10.3969/j.issn.1672-5565.2016.01.01. [ CrossRef ] [ Google Scholar ]

23. Alencar J, Bonates T, Lavor C, et al An algorithm for realizing Euclidean distance matrices. Electronic Notes in Discrete Mathematics. 2015; 50 :397–402. doi: 10.1016/j.endm.2015.07.066. [ CrossRef ] [ Google Scholar ]

24. Wu X, Liu H, Liu H, et al Z curve theory-based analysis of the dynamic nature of nucleosome positioning in Saccharomyces cerevisiae . Gene. 2013; 530 (1):8–18. doi: 10.1016/j.gene.2013.08.018. [ PubMed ] [ CrossRef ] [ Google Scholar ]

Articles from Sheng Wu Yi Xue Gong Cheng Xue Za Zhi = Journal of Biomedical Engineering are provided here courtesy of West China Hospital of Sichuan University

基于综合 DNA 序列特征的支持向量机方法识别核小体定位

Identification of nucleosome positioning using support vector machine method based on comprehensive DNA sequence feature

颖 崔

泽龙 徐

建中 李

引言

1. 实验数据和方法

1.1. 实验数据

1.1.1. 酵母数据集

1.1.2. 线虫、人类和果蝇数据集

1.2. 实验方法

1.2.1. DNA 序列集的 Z 曲线模型

1.2.3. 综合序列特征模型构建

1.2.4. 欧氏距离特征提取

1.2.5. 支持向量机

1.3. 评价指标

2. 实验结果与讨论

2.1. 酵母实验

表 1

2.2. 线虫、人类和果蝇实验

表 2

3. 结论

Funding Statement

References

颖崔

泽龙徐

建中李