最近一直在处理samtools freebayes gatk 产生的snp数据, 结果文件都是vcf,于是自己就写了相应的类,但是总是不够完善。 海宝推荐这个模块,他都推荐了 我还抱着我那烂代码不放干啥 之前写的就当练习类了

安装:

sudo pip install pyvcf

然后报错说没有counter模块,于是:

sudo pip install counter

然后就安装好了

简单实用:

import vcf
myvcf = vcf.Reader(open('testpyvcf', 'r'))   #和python内置的文件类型一样,循环完不会从头开始。
for i in myvcf:
    print i

Record(CHROM=Chr1, POS=11553, REF=G, ALT=[C])
Record(CHROM=Chr1, POS=12840, REF=T, ALT=[G])
Record(CHROM=Chr1, POS=16188, REF=GAAAAAAAG, ALT=[GAAAAAAAAG])
Record(CHROM=Chr1, POS=18915, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=19439, REF=CTTTTTTTTTA, ALT=[CTTTTTTTTTTTA])
Record(CHROM=Chr1, POS=24810, REF=ATTTTTTTTTC, ALT=[ATTTTTTTTTTC])
Record(CHROM=Chr1, POS=26067, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=26996, REF=CAAAAAAAAT, ALT=[CAAAAAAAAAAT])
Record(CHROM=Chr1, POS=27142, REF=C, ALT=[G])
Record(CHROM=Chr1, POS=27698, REF=CTTTTTTTTC, ALT=[CTTTTTTTTTC])
Record(CHROM=Chr1, POS=30645, REF=A, ALT=[C])
Record(CHROM=Chr1, POS=31478, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=33667, REF=A, ALT=[G])
Record(CHROM=Chr1, POS=34057, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=34339, REF=TAAAAAAAAAC, ALT=[TAAAAAAAAAAC])
Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G])

当然可以直接取你所需

print i.CHROM   #对应vcf chr那一列

print i.POS      #对应vcf pos那一列 返回的是整型

print i.ID              #对应ID 无得话返回None

print i.REF         #对应ref列  返回的是一个字符串

print i.ALT         #返回的是一个列表! 知道为什么REF返回字符串而ALT却返回列表么? 因为ALT可能不止一个啊!比如: REF为A, ALT为T,G!!!注意返回的是一个列表  不管alt那列是几个碱基,都是返回列表,只有一个也是列表, 并列表中的元素不是字符串, 而是一个类:class 'vcf.model._Substitution'。 把元素转化为对应字符串用i.ALT[0].sequence

print i.QUAL        #对应qual列 返回的是float

print i.FILTER           #对应filter 列 无的话返回None

print i.INFO        #对应vcf文件INFO列 返回的是一个字典 注意字典的值有的是列表,有的是字符串, 有的是int, 有的是float

eg:

{'SAP': [5.1817700000000002], 'EPP': [5.1817700000000002], 'SRR': 5, 'DPB': 25.0, 'MQMR': 50.0, 'DP': 25, 'PAO': [0.0], 'RPP': [11.696199999999999], 'PAIRED': [0.75], 'ODDS': 3.4715400000000001, 'MEANALT': [1.0], 'MQM': [50.0], 'SAF': [3], 'PAIREDR': 0.90476199999999996, 'EPPR': 11.385999999999999, 'SAR': [1], 'NS': 3, 'RO': 21, 'AC': [2], 'AB': [0.20000000000000001], 'SRF': 16, 'AF': [0.33333299999999999], 'GTI': 0, 'AO': [4], 'AN': 6, 'ABP': [18.6449], 'SRP': 15.5221, 'DPRA': [2.0], 'RPPR': 15.5221, 'PQA': [0.0], 'QR': 656, 'RUN': [1], 'CIGAR': ['1X'], 'LEN': [1], 'NUMALT': 1, 'QA': [126], 'PQR': 0.0, 'TYPE': ['snp'], 'PRO': 0.0}

可以通过相应的键取出对应的值如:

print i.INFO[‘TYPE’] 返回列表

print i.INFO[‘DP’]    返回的是int

i.FORMAT  #返回format列 字符串  如果你的vcf文件中没有FORMAT 返回None

eg:

print i.FORMAT, type(i.FORMAT)

GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>

i.samples 和 i.genotype   #其他都是大写 这两个是小写

这两个不对应哪一列

eg:

print i.samples  # 我的vcf文件有三个样品 分别是12 CS48 F12  返回的是三个样 call object 组成的列表。

[Call(sample=12, CallData(GT=0/1, DP=12, RO=10, QR=300, AO=2, QA=64, GL=[-4.2871800000000002, 0.0, -10.0])), Call(sample=CS48, CallData(GT=0/1, DP=8, RO=6, QR=199, AO=2, QA=62, GL=[-4.9289199999999997, 0.0, -10.0])), Call(sample=F12, CallData(GT=0/0, DP=5, RO=5, QR=157, AO=0, QA=0, GL=[0.0, -1.50515, -10.0]))]

如何取出每个样中心信息?

for i in myvcf:             #返回record

for j in i.samples:     #返回每个sample

print j[‘GT’]      #每个sample的genotype   当然也可以j[‘DP’], j[‘RO’], j[‘AO’]…

1/1   #第一个样本的genotype   GT返回字符串;如果类型为others类型,AO返回由int组成的列表(ins,ins 会被认为是ins类型); RO DP 返回int
0/1    #第二个样本的genotype
0/1    #第三个样本的genotype

如果某个样没有信息 返回None

i.genotype的用法:

eg:

for i in myvcf:
    print i.genotype('12')['GT']      #和i.samples不太一样的是你必须知道你要取哪个样的信息,因为你必须给genotype传一个样本参数。

i.genotype返回的是call对象

每个call对象有三个属性。 site, sample, data

eg:

for i in myvcf:
    call = i.genotype('12')  #返回call对象
    print call.site         #返回call对象的chrom pos refbase altbase 信息
    print call.sample      #返回样本名字
    print call.data       #返回call data

Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G])     #可以讲对应项取出来
12        #样本的名字
CallData(GT=0/1, DP=12, RO=10, QR=300, AO=2, QA=64, GL=[-4.2871800000000002, 0.0, -10.0]) #可以将对应项取出来

除以上的方法外还有些方便的方法直接使用, 如

i.is_snp

i.is_indel

i.is_transition

i.is_deletion

i.is_monomorphic

上面的几位都返回bool

i.var_type   #如snp,,,,

i.var_subtype   #如ts

>>> myvcf = vcf.Reader(open('testpyvcf', 'r'))
>>> myvcf.metadata
OrderedDict([('fileformat', 'VCFv4.1'), ('fileDate', '20140817'), ('source', ['freeBayes v0.9.14-15-gc6f49c0']), ('reference', '../../call_snp/refseq/Osativa_204.fa'), ('phasing', ['none']), ('commandline', ['"freebayes -f ../../call_snp/refseq/Osativa_204.fa -u -X -L bamfile.fb.list"'])])

>>> myvcf.samples
['12', 'CS48', 'F12']

还有myvcf.infos  myvcf.filters myvcf.formats

>>> myvcf.infos
OrderedDict([('NS', Info(id='NS', num=1, type='Integer', desc='Number of samples with data')), ('DP', Info(id='DP', num=1, type='Integer', desc='Total read depth at the locus')), ('DPB', Info(id='DPB', num=1, type='Float', desc='Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype')), ('AC', Info(id='AC', num=-1, type='Integer', desc='Total number of alternate alleles in called genotypes')), ('AN', Info(id='AN', num=1, type='Integer', desc='Total number of alleles in called genotypes')), ('AF', Info(id='AF', num=-1, type='Float', desc='Estimated allele frequency in the range (0,1]')), ('RO', Info(id='RO', num=1, type='Integer', desc='Reference allele observation count, with partial observations recorded fractionally')), ('AO', Info(id='AO', num=-1, type='Integer', desc='Alternate allele observations, with partial observations recorded fractionally')), ('PRO', Info(id='PRO', num=1, type='Float', desc='Reference allele observation count, with partial observations recorded fractionally')), ('PAO', Info(id='PAO', num=-1, type='Float', desc='Alternate allele observations, with partial observations recorded fractionally')), ('QR', Info(id='QR', num=1, type='Integer', desc='Reference allele quality sum in phred')), ('QA', Info(id='QA', num=-1, type='Integer', desc='Alternate allele quality sum in phred')), ('PQR', Info(id='PQR', num=1, type='Float', desc='Reference allele quality sum in phred for partial observations')), ('PQA', Info(id='PQA', num=-1, type='Float', desc='Alternate allele quality sum in phred for partial observations')), ('SRF', Info(id='SRF', num=1, type='Integer', desc='Number of reference observations on the forward strand')), ('SRR', Info(id='SRR', num=1, type='Integer', desc='Number of reference observations on the reverse strand')), ('SAF', Info(id='SAF', num=-1, type='Integer', desc='Number of alternate observations on the forward strand')), ('SAR', Info(id='SAR', num=-1, type='Integer', desc='Number of alternate observations on the reverse strand')), ('SRP', Info(id='SRP', num=1, type='Float', desc="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality")), ('SAP', Info(id='SAP', num=-1, type='Float', desc="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality")), ('AB', Info(id='AB', num=-1, type='Float', desc='Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous')), ('ABP', Info(id='ABP', num=-1, type='Float', desc="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality")), ('RUN', Info(id='RUN', num=-1, type='Integer', desc='Run length: the number of consecutive repeats of the alternate allele in the reference genome')), ('RPP', Info(id='RPP', num=-1, type='Float', desc="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality")), ('RPPR', Info(id='RPPR', num=1, type='Float', desc="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality")), ('EPP', Info(id='EPP', num=-1, type='Float', desc="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality")), ('EPPR', Info(id='EPPR', num=1, type='Float', desc="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality")), ('DPRA', Info(id='DPRA', num=-1, type='Float', desc='Alternate allele depth ratio.  Ratio between depth in samples with each called alternate allele and those without.')), ('ODDS', Info(id='ODDS', num=1, type='Float', desc='The log odds ratio of the best genotype combination to the second-best.')), ('GTI', Info(id='GTI', num=1, type='Integer', desc='Number of genotyping iterations required to reach convergence or bailout.')), ('TYPE', Info(id='TYPE', num=-1, type='String', desc='The type of allele, either snp, mnp, ins, del, or complex.')), ('CIGAR', Info(id='CIGAR', num=-1, type='String', desc="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing.  Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.")), ('NUMALT', Info(id='NUMALT', num=1, type='Integer', desc='Number of unique non-reference alleles in called genotypes at this position.')), ('MEANALT', Info(id='MEANALT', num=-1, type='Float', desc='Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.')), ('LEN', Info(id='LEN', num=-1, type='Integer', desc='allele length')), ('MQM', Info(id='MQM', num=-1, type='Float', desc='Mean mapping quality of observed alternate alleles')), ('MQMR', Info(id='MQMR', num=1, type='Float', desc='Mean mapping quality of observed reference alleles')), ('PAIRED', Info(id='PAIRED', num=-1, type='Float', desc='Proportion of observed alternate alleles which are supported by properly paired read fragments')), ('PAIREDR', Info(id='PAIREDR', num=1, type='Float', desc='Proportion of observed reference alleles which are supported by properly paired read fragments'))])

如果想看某一个缩写的描述而不是全部可以:

>>> myvcf.infos['DP'].desc
'Total read depth at the locus'

>>> myvcf.infos['RO'].desc
'Reference allele observation count, with partial observations recorded fractionally'
>>> myvcf.infos['AO'].desc
'Alternate allele observations, with partial observations recorded fractionally'

如果你不想从头到尾循环某个文件。只想取某一部分,可以用fetch, 但是前体是用tabix对文件index, tabix前要用bgzip压缩

bgzip testpyvcf.vcf  #得到testpyvcf.vcf.gz文件

tabix -p vcf testpyvcf.vcf.gz     #得到testpyvcf.vcf.gz的index文件:testpyvcf.vcf.gz.tbi

>>> import vcf
>>> myvcf = vcf.Reader(filename='testpyvcf.vcf.gz')

>>> for i in myvcf.fetch('Chr1', 1111, 444444):      #第一个是序列名,第二个起始,第三个end,包括。加入vcf文件中有两个位点100 和 200 如果(100,200),会返回这两个record。...     print i
...
Record(CHROM=Chr1, POS=11553, REF=G, ALT=[C])
Record(CHROM=Chr1, POS=12840, REF=T, ALT=[G])
Record(CHROM=Chr1, POS=16188, REF=GAAAAAAAG, ALT=[GAAAAAAAAG])
Record(CHROM=Chr1, POS=18915, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=19439, REF=CTTTTTTTTTA, ALT=[CTTTTTTTTTTTA])
Record(CHROM=Chr1, POS=24810, REF=ATTTTTTTTTC, ALT=[ATTTTTTTTTTC])
Record(CHROM=Chr1, POS=26067, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=26996, REF=CAAAAAAAAT, ALT=[CAAAAAAAAAAT])
Record(CHROM=Chr1, POS=27142, REF=C, ALT=[G])
Record(CHROM=Chr1, POS=27698, REF=CTTTTTTTTC, ALT=[CTTTTTTTTTC])
Record(CHROM=Chr1, POS=30645, REF=A, ALT=[C])
Record(CHROM=Chr1, POS=31478, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=33667, REF=A, ALT=[G])
Record(CHROM=Chr1, POS=34057, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=34339, REF=TAAAAAAAAAC, ALT=[TAAAAAAAAAAC])
Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G])

myvcf = vcf.Reader(filename='testpyvcf.vcf.gz')
for i in myvcf.fetch('Chr1',34339): #只提供一个点,而不是区间,会返回这个点出样本的call对象。
    print i

Call(sample=12, CallData(GT=0/1, DP=47, RO=43, QR=1501, AO=4, QA=127, GL=[-2.8504, 0.0, -10.0]))
Call(sample=CS48, CallData(GT=1/1, DP=82, RO=4, QR=146, AO=68, QA=2316, GL=[-10.0, -2.51126, 0.0]))
Call(sample=F12, CallData(GT=0/1, DP=27, RO=22, QR=785, AO=3, QA=95, GL=[-4.4559800000000003, 0.0, -10.0]))

关于vcf的写操作, vcf提供了Writer类

eg:

vcffile = open('test.vcf', 'r')          # 普通的文件打开操作

outvcf = open('outvcf.vcf', 'w')     #打开要写入的文件

myvcf = vcf.Reader(vcffile)

woutvcf = vcf.Writer(outvcf, myvcf)          #将myvcf的header信息,写入到outvcf.vcf

for i in myvcf:

woutvcf.write_record(i)                  #将myvcf的record写入到outvcf.vcf

vcffile.close()

outvcf.close()            #将打开的文件关闭

by freemao

FAFU.

free_mao@qq.com

pyvcf 模块的更多相关文章

  1. 基因组与Python --PyVCF 好用的vcf文件处理器

    vcf文件的全称是variant call file,即突变识别文件,它是基因组工作流程中产生的一种文件,保存的是基因组上的突变信息.通过对vcf文件进行分析,可以得到个体的变异信息.嗯,总之,这是很 ...

  2. VCF文件处理工具PyVCF

    vcf格式示例 ##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description=”Low quality”> ##FORMAT=<ID=A ...

  3. npm 私有模块的管理使用

    你可以使用 NPM 命令行工具来管理你在 NPM 仓库的私有模块代码,这使得在项目中使用公共模块变的更加方便. 开始前的工作 你需要一个 2.7.0 以上版本的 npm ,并且需要有一个可以登陆 np ...

  4. node.js学习(三)简单的node程序&&模块简单使用&&commonJS规范&&深入理解模块原理

    一.一个简单的node程序 1.新建一个txt文件 2.修改后缀 修改之后会弹出这个,点击"是" 3.运行test.js 源文件 使用node.js运行之后的. 如果该路径下没有该 ...

  5. ES6模块import细节

    写在前面,目前浏览器对ES6的import支持还不是很好,需要用bable转译. ES6引入外部模块分两种情况: 1.导入外部的变量或函数等: import {firstName, lastName, ...

  6. Python标准模块--ContextManager

    1 模块简介 在数年前,Python 2.5 加入了一个非常特殊的关键字,就是with.with语句允许开发者创建上下文管理器.什么是上下文管理器?上下文管理器就是允许你可以自动地开始和结束一些事情. ...

  7. Python标准模块--Unicode

    1 模块简介 Python 3中最大的变化之一就是删除了Unicode类型.在Python 2中,有str类型和unicode类型,例如, Python 2.7.6 (default, Oct 26 ...

  8. Python标准模块--Iterators和Generators

    1 模块简介 当你开始使用Python编程时,你或许已经使用了iterators(迭代器)和generators(生成器),你当时可能并没有意识到.在本篇博文中,我们将会学习迭代器和生成器是什么.当然 ...

  9. 自己实现一个javascript事件模块

    nodejs中的事件模块 nodejs中有一个events模块,用来给别的函数对象提供绑定事件.触发事件的能力.这个别的函数的对象,我把它叫做事件宿主对象(非权威叫法),其原理是把宿主函数的原型链指向 ...

随机推荐

  1. [示例]NSDictionary编程题-字典的排序应用(iOS5班)

    代码? #import <Foundation/Foundation.h> int main(int argc, const char * argv[]) { @autoreleasepo ...

  2. uva 1210

    #include<iostream> #include<cstring> using namespace std; + ; bool notprime[MAXN];//值为fa ...

  3. Mongodb Management Studio

    1.服务器管理功能添加服务器,删除服务器 2.服务器,数据库,表,列,索引,树形显示和状态信息查看 3.查询分析器功能.支持select,insert,Delete,update支持自定义分页函数 $ ...

  4. 微信支付调用JSAPI缺少参数:timeStamp

    一般是安卓没问题,苹果会出现这样的问题,弹出下面这样的提示,如果你也是这样,那就恭喜你,现在,你找到解决的方法了 请看红色框框的timeStamp(图片有点小,可以鼠标右键打开图片 查看) 请注意,这 ...

  5. 2、IValueConverter应用

    1.C#代码如下: public class logotoimgConverter:IValueConverter { //将logo转换为URI public object Convert(obje ...

  6. 通过代码自定义cell 新浪微博页面显示

    通过代码自定义cell(cell的高度不一致)(如果高度一致的cell 用xib实现) 1.新建一个集成自UItableVIewCell的类 2.重写initWithStle :方法 - (insta ...

  7. 关于查询oracle in >1000 的讨论

    https://q.cnblogs.com/q/88538/

  8. maven的简单安装与配置

    什么是Maven? Maven可以被理解成"知识的积累",也可以被翻译为"专家".它是一个项目管理工具. 它的主要服务即源于java平台的项目构建.依赖管理和项 ...

  9. UIButton 点击后变灰

    +(UIButton *)getBlueButtonWithTitle:(NSString *)aTitle{ UIButton *button = [UIButton buttonWithType: ...

  10. 关押罪犯(noip2010)

    解法: (1)搜索(30分) (2)二分(此题属于最大值最小问题) (3)贪心+并查集 下面着重说一下“贪心+并查集” 因为有A.B两座监狱,每个犯人不是在A,就是在B监狱. 至于每个犯人在那个监狱, ...