python: 整理Unicycler基因组组装结果

导读

将二代三代Unicycler混合组装结果进行整理:抽提染色体/质粒序列,统计GC含量,统计序列长度,结构。

一、Unicycler结果

Unicycler二代、三代混合组装、矫错、抛光

-rw-rw-r--  1 cheng WST 2933634 1月   7 18:06 001_best_spades_graph.gfa
-rw-rw-r--  1 cheng WST 2909841 1月   7 18:06 002_overlaps_removed.gfa
-rw-rw-r--  1 cheng WST 2920506 1月   7 18:17 003_bridges_applied.gfa
-rw-rw-r--  1 cheng WST 2906329 1月   7 18:17 004_final_clean.gfa
-rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 005_polished.gfa
-rw-rw-r--  1 cheng WST 2945006 1月   7 19:32 assembly.fasta
-rw-rw-r--  1 cheng WST 2906330 1月   7 19:32 assembly.gfa
-rw-rw-r--  1 cheng WST   17815 1月   7 19:32 unicycler.log
  • 查看文件格式
  • head assembly.fasta
    >1 length=1838724 depth=1.00x
    ATTACTTCTTCAGCAAAGTTTTCTTCTTTCTTTTCGATACCTTCGCCAACTTCATAACGAACAAAAGATT
    TTACCTTACCATTCTTAGAAGCAACATATTTTGCAACTGTTAAATCAGGATCCTTAACAAAGTCTTGATC
    ATCCAATGCAACTTCTGCAAAGAACTTGTTCAAGCGACCAGCAACCATTTTTTCAATGATTTTTTCTGGC
    TTACCTTCATTCTTAGCTTCTTCTGTAAGAACTTCCTTTTCATGTGCAACTTCAGCTTCTGGAACTTCAT
    

    二、去换行符

    python & shell:去除fasta文件的换行符

    #!/usr/bin/env python3
    import re, os, sys
    ms, infile, outfile_genome, outfilefold_plasmid, out_table = sys.argv
    with open(infile) as f:
        Dict = {}
        for line in f:
            line = line.strip()
            if line[0] == ">":
                key = line
                Dict[key] = []
            else:
                Dict[key].append(line)
        i = 0
        with open(out_table, 'w') as table:
            with open(outfile_genome, 'w') as genome:
                table.write("\tBase(bp)\tGC%\tStructure\n")
                os.makedirs(outfilefold_plasmid)
                for key, value in Dict.items():
                    if key[0:9] == ">1 length":
                        seq = ''.join(value)   
                        genome.write("{}\n{}\n".format(key, seq))
                        # write table
                        length = len(seq)
                        gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                        print("\033[32m{}\t{}\033[0m".format(key, length))
                        table.write("Chromosome\t{}\t{}\tChromosome\n".format(length, gc_percent))
                for key, value in Dict.items():
                    if "circular=true" in key:
                        seq = ''.join(value)
                        print("\033[32m{}\t{}\033[0m".format(key, len(seq)))
                        i += 1
                        out_name = outfilefold_plasmid + "/plasmid" + str(i) + ".fasta"
                        with open(out_name, 'w') as plasmid:
                            plasmid.write("{}\n{}\n".format(key, seq))
                            # write table
                            length = len(seq)
                            gc_percent = format((seq.count("G") + seq.count("C"))/length*100, '0.2f')
                            ID = "Plasmid" + str(i)
                            table.write("{}\t{}\t{}\tCircle\n".format(ID, length, gc_percent))
    ├── assembly_enter.fasta
    ├── assembly_genome.fasta
    ├── out_table_html.txt
    ├── out_table.txt
    └── plasmid
        ├── plasmid1.fasta
        ├── plasmid2.fasta
        ├── plasmid3.fasta
        └── plasmid4.fasta
    1 directory, 8 files
    
    cat out_table.txt 
        Base(bp)    GC% Structure
    Chromosome  1838724 33.12   Chromosome