![]() |
爱旅游的茄子 · 官宣!体操冠军程然成为金惟他品牌形象大使!_葡萄糖· 9 月前 · |
![]() |
善良的红酒 · NBA2k20 街球日常 约基奇 ...· 9 月前 · |
![]() |
淡定的匕首 · 中国现代奇幻文学为什么很难写出《巫师》和《冰 ...· 10 月前 · |
![]() |
重情义的小马驹 · 一汽是国企吗_太平洋汽车网· 1 年前 · |
![]() |
会搭讪的领结 · 曲根万词班有用吗 ...· 1 年前 · |
import os
import glob
input_folder = "/home/myan/scratch/private/pome_WGS/huoyan_14_variety/"
output_folder = "/home/myan/scratch/private/practice_data/snakemake/20220412/fastp.results/"
SRA,FRR = glob_wildcards(input_folder+"{sra}_{frr}.fq.gz")
rule all:
input:
expand(output_folder+"{sra}.html",sra=SRA),
expand(output_folder+"{sra}.json",sra=SRA)
rule fastp:
input:
read01 = input_folder + "{sra}_1.fq.gz",
read02 = input_folder + "{sra}_2.fq.gz"
output:
read01 = output_folder + "{sra}_1.clean.fq.gz",
read02 = output_folder + "{sra}_2.clean.fq.gz",
html = output_folder + "{sra}.html",
json = output_folder + "{sra}.json"
threads:
shell:
fastp -i {input.read01} -I {input.read02} -o {output.read01} -O {output.read02} --thread {threads} --html {output.html} --json {output.json}
这里rule all的作用还是没有搞明白,看有的文档说是最终保留的文件 ,我这里rule all 只写了了最终的html和json,但是最终的结果里是有过滤后的fastq文件的
还有好多基础知识需要看
路径里的文件夹如果不存在会新建一个文件夹
今天的内容增加了config文件
input_folder:
"/home/myan/scratch/private/practice_data/RNAseq/chrX_data/"
output_folder:
"/home/myan/scratch/private/practice_data/RNAseq/snakemake.rnaseq/"
index_folder:
"/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/indexes/"
index_name:
"chrX_tran"
samples_folder:
"/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/samples/"
gtf_folder:
"/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf"
config文件主要用来指定文件的存贮路径
snakemake文件的内容
configfile: "config.yaml"
import os
import glob
print(config)
print(config['input_folder'])
print(config['output_folder'])
SRR,FRR = glob_wildcards(config['samples_folder']+"{srr}_chrX_{frr}.fastq.gz")
rule all:
input:
expand(config["output_folder"] + "output.sam/{srr}.sam",srr=SRR),
expand(config['output_folder'] + "output.bam/{srr}.bam",srr=SRR),
expand(config['output_folder'] + "output.gtf/{srr}.gtf",srr=SRR)
rule hisat2:
input:
read01 = config['samples_folder'] + "{srr}_chrX_1.fastq.gz",
read02 = config['samples_folder'] + "{srr}_chrX_2.fastq.gz",
output:
sam = config['output_folder'] + "output.sam/{srr}.sam"
threads:
params:
index_file = config['index_folder']+config['index_name']
shell:
hisat2 -p {threads} --dta -x {params.index_file} -1 {input.read01} -2 {input.read02} -S {output.sam}
rule samtools:
input:
sam = rules.hisat2.output.sam
output:
bam = config['output_folder'] + "output.bam/{srr}.bam"
threads:
shell:
samtools sort -@ {threads} -o {output.bam} {input.sam}
rule stringtie:
input:
gtf = config['gtf_folder'],
bam = rules.samtools.output.bam
output:
gtf = config['output_folder'] + "output.gtf/{srr}.gtf"
params:
l = "{srr}"
threads:
shell:
stringtie -p {threads} -G {input.gtf} -o {output.gtf} -l {params.l} {input.bam}
后面转录本定量的步骤还没有写完,好像还可以把差异表达分析的脚本嵌入进来
未完待续
示例数据用到的是论文
Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie, and Ballgown
中的数据
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
#rule all:
# input:
# expand("output.gtf/"+"{srr}.gtf",srr=SRR),
rule gtflist:
input:
gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
output:
output_txt = "MergedList.txt"
with open(output.output_txt,'w') as f:
for gtf in input.gtffiles:
print(gtf,file=f)
最开始没写这个逗号,一直遇到报错
image.png
Building DAG of jobs... MissingInputException in line 3 of /mnt/shared/scratch/myan/private/practice_data/RNAseq/snakemake.rnaseq/gtf_list.py: Missing input files for rule all: output.gtf/['ERR188337', 'ERR188245', 'ERR188428', 'ERR188401', 'ERR204916', 'ERR188383', 'ERR188104', 'ERR188257', 'ERR188273', 'ERR188044', 'ERR188234', 'ERR188454'].gtf
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
rule all:
input:
#expand("output.gtf/"+"{srr}.gtf",srr=SRR),
"MergedList.txt",
rule gtflist:
input:
gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
output:
output_txt = "MergedList.txt"
with open(output.output_txt,'w') as f:
for gtf in input.gtffiles:
print(gtf,file=f)
rule stringtieMerge:
input:
refgtf = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf",
gtflist = "MergedList.txt"
output:
gtf = "output.merged.gtf/StringtieMerged.gtf"
"output.merged.gtf/StringtieMerge.logs"
threads:
params:
"-m 200 -c 3"
shell:
stringtie --merge {params} -p {threads} -G {input.refgtf} -o {output.gtf} {input.gtflist}
第二个rule就是不运行
原来是在rule all 代码里少写了 第二个rule的输出文件
正确写法是
SRR, = glob_wildcards("output.gtf/"+"{srr}.gtf")
#SRR = ["ERR188401","ERR204916"]
rule all:
input:
#expand("output.gtf/"+"{srr}.gtf",srr=SRR),
"MergedList.txt",
"output.merged.gtf/StringtieMerged.gtf"
rule gtflist:
input:
gtffiles = expand("output.gtf/"+"{srr}.gtf",srr=SRR)
output:
output_txt = "MergedList.txt"
with open(output.output_txt,'w') as f:
for gtf in input.gtffiles:
print(gtf,file=f)
rule stringtieMerge:
input:
refgtf = "/mnt/shared/scratch/myan/private/practice_data/RNAseq/chrX_data/genes/chrX.gtf",
gtflist = "MergedList.txt"
output:
gtf = "output.merged.gtf/StringtieMerged.gtf"
"output.merged.gtf/StringtieMerge.logs"
threads:
params:
"-m 200 -c 3"
shell:
stringtie --merge {params} -p {threads} -G {input.refgtf} -o {output.gtf} {input.gtflist}
rule all 的作用是啥我还是不懂
rule useRscriptinsnakemake:
input:
ballgown_folder = "ballgown",
pheno = "geuvadis_phenodata.csv"
output:
rdat = "bg_chrX_1.Rdata"
#conda:
#"envs/ballgown.yaml"
script:
"scripts/ballgown_1.R"
尝试嵌入conda的时候遇到报错,暂时不知道是什么原因 我的
ballgown.yaml
文件
name: rnaseq_pra
channels:
- conda-forge
- biocondas
dependencies:
- r=4.0=r40hd8ed1ab_1004
- bioconductor-ballgown=2.22.0=r40hdfd78af_1
我的
ballgown_1.R
文件
library(ballgown)
library(genefilter)
pheno_data = read.csv(snakemake@input[["pheno"]])
bg_chrX = ballgown(dataDir = snakemake@input[["ballgown_folder"]],samplePattern = "ERR",pData = pheno_data)
bg_chrX_filt = subset(bg_chrX,"rowVars(texpr(bg_chrX)) > 1",genomesubset=TRUE)
results_transcripts = stattest(bg_chrX_filt,feature="transcript",covariate="sex",adjustvars=c("population"),getFC=TRUE,meas="FPKM")
save(bg_chrX,results_transcripts,file=snakemake@output[["rdat"]])
这里有一个问题是snakemake流程里怎么样使用已经存在的conda环境,看这个流程的时候 https://github.com/Alipe2021/NLncCirSmk/blob/master/NLncCirSmk.smk
他的写法是
rule Part06_CircRNA_Analysis_06_Bam2Anchors:
input:
bam = OUTPUTDIR + "./CircRNA/05.UnmappedBam/{sample}.unmapped.bam",
bai = OUTPUTDIR + "./CircRNA/05.UnmappedBam/{sample}.unmapped.bai"
output:
fq = OUTPUTDIR + "./CircRNA/06.Bam2Anchors/{sample}/unmapped_anchors.fq.gz"
params:
DIR = OUTPUTDIR + "./CircRNA/06.Bam2Anchors/{sample}"
import os
import subprocess
if not os.path.exists(params.DIR):
os.makedirs(params.DIR)
cmd = ("source activate find_circ_env && unmapped2anchors.py {bam} | "
" gzip > {fq} && conda deactivate").format(bam=input.bam, fq=output.fq)
print(cmd)
subprocess.call(cmd, shell=True)
本文参与 腾讯云自媒体分享计划 ,欢迎热爱写作的你一起参与!