Bedtools 使用

bedtools主要功能

bedtools: flexible tools for genome arithmetic and DNA sequence analysis.
usage:    bedtools <subcommand> [options]
The bedtools sub-commands include:
[ Genome arithmetic ]
    intersect     Find overlapping intervals in various ways.
                  求区域之间的交集，可以用来注释peak，计算reads比对到的基因组区域
                  不同样品的peak之间的peak重叠情况。
    window        Find overlapping intervals within a window around an interval.
    closest       Find the closest, potentially non-overlapping interval.
                  寻找最近但可能不重叠的区域
    coverage      Compute the coverage over defined intervals.
                  计算区域覆盖度
    map           Apply a function to a column for each overlapping interval.
    genomecov     Compute the coverage over an entire genome.
    merge         Combine overlapping/nearby intervals into a single interval.
                  合并重叠或相接的区域
    cluster       Cluster (but don't merge) overlapping/nearby intervals.
    complement    Extract intervals _not_ represented by an interval file.
                  获得互补区域
    subtract      Remove intervals based on overlaps b/w two files.
                  计算区域差集
    slop          Adjust the size of intervals.
                  调整区域大小，如获得转录起始位点上下游3 K的区域
    flank         Create new intervals from the flanks of existing intervals.
    sort          Order the intervals in a file.
                  排序，部分命令需要排序过的bed文件
    random        Generate random intervals in a genome.
                  获得随机区域，作为背景集
    shuffle       Randomly redistrubute intervals in a genome.
                  根据给定的bed文件获得随机区域，作为背景集
    sample        Sample random records from file using reservoir sampling.
    spacing       Report the gap lengths between intervals in a file.
    annotate      Annotate coverage of features from multiple files.
[ Multi-way file comparisons ]
    multiinter    Identifies common intervals among multiple interval files.
    unionbedg     Combines coverage intervals from multiple BEDGRAPH files.
[ Paired-end manipulation ]
    pairtobed     Find pairs that overlap intervals in various ways.
    pairtopair    Find pairs that overlap other pairs in various ways.
[ Format conversion ]
    bamtobed      Convert BAM alignments to BED (& other) formats.
    bedtobam      Convert intervals to BAM records.
    bamtofastq    Convert BAM records to FASTQ records.
    bedpetobam    Convert BEDPE intervals to BAM records.
    bed12tobed6   Breaks BED12 intervals into discrete BED6 intervals.
[ Fasta manipulation ]
    getfasta      Use intervals to extract sequences from a FASTA file.
                  提取给定位置的FASTA序列
    maskfasta     Use intervals to mask sequences from a FASTA file.
    nuc           Profile the nucleotide content of intervals in a FASTA file.
[ BAM focused tools ]
    multicov      Counts coverage from multiple BAMs at specific intervals.
    tag           Tag BAM alignments based on overlaps with interval files.
[ Statistical relationships ]
    jaccard       Calculate the Jaccard statistic b/w two sets of intervals.
                  计算数据集相似性
    reldist       Calculate the distribution of relative distances b/w two files.
    fisher        Calculate Fisher statistic b/w two feature files.
[ Miscellaneous tools ]
    overlap       Computes the amount of overlap from two intervals.
    igv           Create an IGV snapshot batch script.
                  用于生成一个脚本，批量捕获IGV截图
    links         Create a HTML page of links to UCSC locations.
    makewindows   Make interval "windows" across a genome.
                  把给定区域划分成指定大小和间隔的小区间 (bin)
    groupby       Group by common cols. & summarize oth. cols. (~ SQL "groupBy")
                  分组结算，不只可以用于bed文件。
    expand        Replicate lines based on lists of values in columns.
    split         Split a file into multiple files with equal records or base pairs.
安装bedtools
 
（Linux - Conda软件安装方法） 
ct@ehbio:~$ conda install bedtools
获得测试数据集(http://quinlanlab.org/tutorials/bedtools/bedtools.html)
 
ct@ehbio:~$ mkdir bedtools
ct@ehbio:~$ cd bedtools
ct@ehbio:~$ url=https://s3.amazonaws.com/bedtools-tutorials/web
ct@ehbio:~/bedtools$ curl -O ${url}/maurano.dnaseI.tgz
ct@ehbio:~/bedtools$ curl -O ${url}/cpg.bed
ct@ehbio:~/bedtools$ curl -O ${url}/exons.bed
ct@ehbio:~/bedtools$ curl -O ${url}/gwas.bed
ct@ehbio:~/bedtools$ curl -O ${url}/genome.txt
ct@ehbio:~/bedtools$ curl -O ${url}/hesc.chromHmm.bed
交集 (intersect)
 
查看输入文件，bed格式，至少三列，分别是染色体，起始位置(0-based, 
包括)，终止位置 
(1-based，不包括)。第四列一般为区域名字，第五列一般为空，第六列为链的信息。更详细解释见http://www.genome.ucsc.edu/FAQ/FAQformat.html#format1。 
自己做研究CpG岛信息可以从UCSC的Table Browser获得，具体操作见http://blog.genesino.com/2013/05/ucsc-usages/。 
ct@ehbio:~/bedtools$ head -n 3 cpg.bed exons.bed
==> cpg.bed <==
chr1    28735   29810   CpG:_116
chr1    135124  135563  CpG:_30
chr1    327790  328229  CpG:_29
==> exons.bed <==
chr1    11873   12227   NR_046018_exon_0_0_chr1_11874_f 0   +
chr1    12612   12721   NR_046018_exon_1_0_chr1_12613_f 0   +
chr1    13220   14409   NR_046018_exon_2_0_chr1_13221_f 0   +
获得重叠区域(既是外显子，又是CpG岛的区域) 
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed | head -5
chr1    29320   29370   CpG:_116
chr1    135124  135563  CpG:_30
chr1    327790  328229  CpG:_29
chr1    327790  328229  CpG:_29
chr1    327790  328229  CpG:_29
输出重叠区域对应的原始区域(与外显子存在交集的CpG岛) 
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wa -wb > | head -5
 chr1 28735 29810 CpG:_116 chr1 29320 29370
 NR_024540_exon_10_0_chr1_29321_r 0 -
 
 chr1 135124 135563 CpG:_30 chr1 134772 139696
 NR_039983_exon_0_0_chr1_134773_r 0 -
 
 chr1 327790 328229 CpG:_29 chr1 324438 328581
 NR_028322_exon_2_0_chr1_324439_f 0 +
 
 chr1 327790 328229 CpG:_29 chr1 324438 328581
 NR_028325_exon_2_0_chr1_324439_f 0 +
 
 chr1 327790 328229 CpG:_29 chr1 327035 328581
 NR_028327_exon_3_0_chr1_327036_f 0 +
  
计算重叠碱基数 
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -wo | head -10
 chr1 28735 29810 CpG:_116 chr1 29320 29370
 NR_024540_exon_10_0_chr1_29321_r 0 - 50
 
 chr1 135124 135563 CpG:_30 chr1 134772 139696
 NR_039983_exon_0_0_chr1_134773_r 0 - 439
 
 chr1 327790 328229 CpG:_29 chr1 324438 328581
 NR_028322_exon_2_0_chr1_324439_f 0 + 439
 
 chr1 327790 328229 CpG:_29 chr1 324438 328581
 NR_028325_exon_2_0_chr1_324439_f 0 + 439
 
 chr1 327790 328229 CpG:_29 chr1 327035 328581
 NR_028327_exon_3_0_chr1_327036_f 0 + 439
 
 chr1 713984 714547 CpG:_60 chr1 713663 714068
 NR_033908_exon_6_0_chr1_713664_r 0 - 84
 
 chr1 762416 763445 CpG:_115 chr1 761585 762902
 NR_024321_exon_0_0_chr1_761586_r 0 - 486
 
 chr1 762416 763445 CpG:_115 chr1 762970 763155
 NR_015368_exon_0_0_chr1_762971_f 0 + 185
 
 chr1 762416 763445 CpG:_115 chr1 762970 763155
 NR_047519_exon_0_0_chr1_762971_f 0 + 185
 
 chr1 762416 763445 CpG:_115 chr1 762970 763155
 NR_047520_exon_0_0_chr1_762971_f 0 + 185
  
计算第一个(-a)bed区域有多少个重叠的第二个(-b)bed文件中有多少个区域 
ct@ehbio:~/bedtools$ bedtools intersect -a cpg.bed -b exons.bed -c | head
chr1    28735   29810   CpG:_116    1
chr1    135124  135563  CpG:_30 1
chr1    327790  328229  CpG:_29 3
chr1    437151  438164  CpG:_84 0
chr1    533219  534114  CpG:_94 0
chr1    544738  546649  CpG:_171    0
chr1    713984  714547  CpG:_60 1
chr1    762416  763445  CpG:_115    10
chr1    788863  789211  CpG:_28 9
另外还有-v取出不重叠的区域, 
-f限定重叠最小比例，-sorted可以对按sort -k1,1 -k2,2n排序好的文件加速操作。 
同时对多个区域求交集 (可以用于peak的多维注释) 
# -names标注注释来源
# -sorted: 如果使用了这个参数，提供的一定是排序好的bed文件
ct@ehbio:~/bedtools$ bedtools intersect -a exons.bed \
    -b cpg.bed gwas.bed hesc.chromHmm.bed -sorted -wa -wb -names cpg gwas chromhmm \
    | head -10000  | tail -10
 chr1 27632676 27635124
 NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27633213
 27635013 5_Strong_Enhancer
 
 chr1 27632676 27635124
 NM_001276252_exon_15_0_chr1_27632677_chromhmm chr1 27635013
 27635413 7_Weak_Enhancer
 
 chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
 chromhmm chr1 27632613 27632813 6_Weak_Enhancer
 
 chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
 chromhmm chr1 27632813 27633213 7_Weak_Enhancer
 
 chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
 chromhmm chr1 27633213 27635013 5_Strong_Enhancer
 
 chr1 27632676 27635124 NM_015023_exon_15_0_chr1_27632677_f
 chromhmm chr1 27635013 27635413 7_Weak_Enhancer
 
 chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f cpg
 chr1 27648453 27649006 CpG:_63
 
 chr1 27648635 27648882 NM_032125_exon_0_0_chr1_27648636_f
 chromhmm chr1 27648613 27649413 1_Active_Promoter
 
 chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f cpg
 chr1 27648453 27649006 CpG:_63
 
 chr1 27648635 27648882 NR_037576_exon_0_0_chr1_27648636_f
 chromhmm chr1 27648613 27649413 1_Active_Promoter
  
bedtools merge输入的是按sort -k1,1 -k2,2n排序好的bed文件。 
只需要输入一个排序好的bed文件，默认合并重叠或邻接区域。 
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed | head -n 5
chr1    11873   12227
chr1    12612   12721
chr1    13220   14829
chr1    14969   15038
chr1    15795   15947
合并区域并输出此合并后区域是由几个区域合并来的 
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -c 1 -o count | head -n 5
chr1    11873   12227   1
chr1    12612   12721   1
chr1    13220   14829   2
chr1    14969   15038   1
chr1    15795   15947   1
合并相距90 nt内的区域，并输出是由哪些区域合并来的 
# -c: 指定对哪些列进行操作
# -o: 与-c对应，表示对指定列进行哪些操作
# 这里的用法是对第一列做计数操作，输出这个区域是由几个区域合并来的
# 对第4列做收集操作，记录合并的区域的名字，并逗号分隔显示出来
ct@ehbio:~/bedtools$ bedtools merge -i exons.bed -d 340 -c 1,4 -o count,collapse | head -4
chr1    11873   12227   1   NR_046018_exon_0_0_chr1_11874_f
chr1    12612   12721   1   NR_046018_exon_1_0_chr1_12613_f
chr1    13220   15038   3   NR_046018_exon_2_0_chr1_13221_f,NR_024540_exon_0_0_chr1_14362_r,NR_024540_exon_1_0_chr1_14970_r
chr1    15795   15947   1   NR_024540_exon_2_0_chr1_15796_r
计算互补区域
 
给定一个全集，再给定一个子集，求另一个子集。比如给定每条染色体长度和外显子区域，求非外显子区域。给定基因区，求非基因区。给定重复序列，求非重复序列等。 
重复序列区域的获取也可以用下面提供的链接：http://blog.genesino.com/2013/05/ucsc-usages/。 
ct@ehbio:~/bedtools$ head genome.txt
chr1    249250621
chr10   135534747
chr11   135006516
chr11_gl000202_random   40103
chr12   133851895
chr13   115169878
chr14   107349540
chr15   102531392
ct@ehbio:~/bedtools$ bedtools complement -i exons.bed -g genome.txt | head -n 5
chr1    0   11873
chr1    12227   12612
chr1    12721   13220
chr1    14829   14969
chr1    15038   15795
基因组覆盖广度和深度
 
计算基因组某个区域是否被覆盖，覆盖深度多少。有下图多种输出格式，也支持RNA-seq数据，计算junction-reads覆盖。 
genome.txt里面的内容就是染色体及对应的长度。 
# 对单行FASTA，可如此计算
# 如果是多行FASTA，则需要累加
ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{\
    if($0~/>/) {seq_name=$0;sub(">","",seq_name);} \
    else {print seq_name,length;} }' ../bio/genome.fa | tee ../bio/genome.txt
chr1    60001
chr2    54001
chr3    54001
chr4    60001
ct@ehbio:~/bedtools$ bedtools genomecov -ibam ../bio/map.sortP.bam -bga \
    -g ../bio/genome.txt | head
# 这个warning很有意思，因为BAM中已经有这个信息了，就不需要提供了
*****
*****WARNING: Genome (-g) files are ignored when BAM input is provided.
*****
# bedgraph文件，前3列与bed相同，最后一列表示前3列指定的区域的覆盖度。
chr1    0   11  0
chr1    11  17  1
chr1    17  20  2
chr1    20  31  3
chr1    31  36  4
chr1    36  43  6
chr1    43  44  7
chr1    44  46  8
chr1    46  48  9
chr1    48  54  10
两个思考题： 
  怎么计算有多少基因组区域被测到了？
 
 怎么计算平均测序深度是多少？
  
数据集相似性
 
bedtools jaccard计算的是给定的两个bed文件之间交集区域(intersection)占总区域(union-intersection)的比例(jaccard)和交集的数目(n_intersections)。 
ct@ehbio:~/bedtools$ bedtools jaccard \    -a fHeart-DS16621.hotspot.twopass.fdr0.05.merge.bed \    -b fHeart-DS15839.hotspot.twopass.fdr0.05.merge.bedintersection    union-intersection  jaccard n_intersections81269248    160493950   0.50637 130852
小思考：1. 如何用bedtools其它工具算出这个结果？2. 如果需要比较的文件很多，怎么充分利用计算资源？ 
一个办法是使用for循环, 
双层嵌套。这种用法也很常见，不管是单层还是双层for循环，都有利于简化重复运算。 
ct@ehbio:~/bedtools$ for i in *.merge.bed; do \    for j in *.merge.bed; do \    bedtools jaccard -a $i -b $j | cut -f3 | tail -n +2 | sed "s/^/$i\t$j\t/"; \    done; done >total.similarity
另一个办法是用parallel，不只可以批量，更可以并行。 
root@ehbio:~# yum install parallel.noarch# parallel 后面双引号("")内的内容为希望用parallel执行的命令，# 整体写法与Linux下命令写法一致。# 双引号后面的 三个相邻冒号 (:::)默认用来传递参数的，可多个连写。# 每个三冒号后面的参数会被循环调用，而在命令中的引用则是根据其出现的位置，分别用{1}, {2}# 表示第一个三冒号后的参数，第二个三冒号后的参数。## 这个命令可以替换原文档里面的整合和替换, 相比于原文命令生成多个文件，这里对每个输出结果# 先进行了比对信息的增加，最后结果可以输入一个文件中。#ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \    | sed 's/^/{1}\t{2}\t/'" ::: `ls *.merge.bed` ::: `ls *.merge.bed`  >totalSimilarity.2# 上面的命令也有个小隐患，并行计算时的输出冲突问题，可以修改为输出到单个文件,再cat到一起ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} | awk 'NR> | cut -f 3 \    | sed 's/^/{1}\t{2}\t/' >{1}.{2}.totalSimilarity_tmp" ::: `ls *.merge.bed` ::: `ls *.merge.bed`ct@ehbio:~/bedtools$ cat *.totalSimilarity_tmp >totalSimilarity.2# 替换掉无关信息ct@ehbio:~/bedtools$ sed -i -e 's/.hotspot.twopass.fdr0.05.merge.bed//' \    -e 's/.hg19//' totalSimilarity.2  
totalSimilarity.2数据表格式如下 (数据是假的)： 
fMusle_leg-DS19115    fMusle_back-DS18454    0.55fMusle_leg-DS19115    fHeart-DS15643    0.4fHeart-DS15643    fHeart-DS16621    0.8fHeart-DS15643    fHeart-DS15839    0.7fHeart-DS16621    fHeart-DS15839    0.7
得到结果后，就可以绘制相关性热图或PCA了。
 
也可以使用下面的命令转换成Wide format矩阵，用高颜值可定制在线绘图工具-第三版绘制。 
# 这里面b和c可以用一个，因为是一个对称阵# 如果2和3列内容不同，此脚本也可用ct@ehbio:~/bedtools$ awk 'BEGIN{OFS=FS="\t"}{a[$1, $2]=$3; b[$1]=1; c[$2]=1;}END\    {printf("ID"); for(i in c) printf("\t%s",  i); \        for (i in b) {printf("%s", i); for(j in c) {printf("\t%s", a[i, j]);} print "";}}
原文档(http://quinlanlab.org/tutorials/bedtools/bedtools.html)的命令，稍微有些复杂，利于学习不同命令的组合。使用时推荐使用上面的命令。 
ct@ehbio:~/bedtools$ parallel "bedtools jaccard -a {1} -b {2} \      | awk 'NR>1' | cut -f 3 \      > {1}.{2}.jaccard" \     ::: `ls *.merge.bed` ::: `ls *.merge.bed`
This command will create a single file containing the pairwise Jaccard 
measurements from all 400 tests. 
find . \    | grep jaccard \    | xargs grep "" \    | sed -e s"/\.\///" \    | perl -pi -e "s/.bed./.bed\t/" \    | perl -pi -e "s/.jaccard:/\t/" \    > pairwise.dnase.txt
A bit of cleanup to use more intelligible names for each of the samples. 
cat pairwise.dnase.txt \| sed -e 's/.hotspot.twopass.fdr0.05.merge.bed//g' \| sed -e 's/.hg19//g' \> pairwise.dnase.shortnames.txt
Now let’s make a 20x20 matrix of the Jaccard statistic. This will allow 
the data to play nicely with R. 
awk 'NF==3' pairwise.dnase.shortnames.txt \| awk '$1 ~ /^f/ && $2 ~ /^f/' \| python make-matrix.py \> dnase.shortnames.distance.matrix
                                    BedTools：强大的生物信息学工具箱
 bedtoolsA powerful toolset for genome arithmetic.项目地址:https://gitcode.com/gh_mirrors/be/bedtools 是一个非常实用的命令行工具集，用于处理基因组数据，如BED、GFF和VCF文件等。
什么是BedTools？
BedTools是由Matthew Arq Will...
                                    bedtools 用法大全
bedtools等工具号称是可以代替普通的生物信息学数据处理工程师的！我这里用一个专题来讲解它的用法，其实它能实现的需求，我们写脚本都是可以做的，而且我强烈建议正在学编程的小朋友模仿它的各种功能来增强自己的脚本功力。
BEDTools是可用于genomic features的比较，相关操作及进行注释的工具。而genomic features通常使用Browser Ext...
                                    bedtools: a powerful toolset for genome arithmeticbedtools工具是用于广泛的基因组学分析任务的一把利器。最广泛使用的工具能够实现基因组算术：即基因组上的集合理论。例如，bedtools允许人们从广泛使用的基因组文件格式（如BAM、BED、GFF/GTF、VCF）的多个文件中交叉、合并、计数、互补和洗牌基因组区间。虽然每个单独的工具被设计用来做一个相对简单的任务（例如，与两个区间文件相交），但通过在UNIX命令行上结合多个bedtools操作可以进行相当
BEDTools是可用于genomic features的比较，相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示，用UCSC Genome Browser进行可视化比较。
2、与BEDTools...
                                    文章目录官方文档下载安装演示版的bed文件 (demo.bed)我们的基因组文件（genome.txt）两侧的运算填充运算下载测试数据提取与genes.gff的间隔相对应的序列获取测试数据用这个间隔文件去分割bam文件实战案例获取数据bedtools intersect从注释文件中，选取启动子找到跟每个exon最近的启动子
https://bedtools.readthedocs.io...
BEDTools是可用于genomic features的比较，相关操作及进行注释的工具。而genomic features通常使用Browser Extensible Data (BED) 或者 General Feature Format (GFF)文件表示，用UC
                                    这些基因中，那些编码蛋白质的序列被称为外显子，而那些没被表达的序列则被称为内含子。5'-UTR：5'-UTR是5'-非翻译区（5' Untranslated Region）的缩写，是成熟mRNA位于编码区上游不被翻译为蛋白质的区域。3'-UTR：3'-UTR是3'-非翻译区（3'-Untranslated Region）的缩写，是成熟mRNA位于编码区下游不被翻译为蛋白质的区域。1.4 起始密码子属于5’UTR+CDS中CDS区域，终止密码子不属于CDS+3’UTR中的CDS区域。3.按照染色体顺序排序。