pyvcf 模块

最近一直在处理samtools freebayes gatk 产生的snp数据，结果文件都是vcf，于是自己就写了相应的类，但是总是不够完善。海宝推荐这个模块，他都推荐了我还抱着我那烂代码不放干啥之前写的就当练习类了

安装：

sudo pip install pyvcf

然后报错说没有counter模块，于是：

sudo pip install counter

然后就安装好了

简单实用：

import vcf
myvcf = vcf.Reader(open('testpyvcf', 'r')) #和python内置的文件类型一样，循环完不会从头开始。
for i in myvcf:
print i

Record(CHROM=Chr1, POS=11553, REF=G, ALT=[C])
Record(CHROM=Chr1, POS=12840, REF=T, ALT=[G])
Record(CHROM=Chr1, POS=16188, REF=GAAAAAAAG, ALT=[GAAAAAAAAG])
Record(CHROM=Chr1, POS=18915, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=19439, REF=CTTTTTTTTTA, ALT=[CTTTTTTTTTTTA])
Record(CHROM=Chr1, POS=24810, REF=ATTTTTTTTTC, ALT=[ATTTTTTTTTTC])
Record(CHROM=Chr1, POS=26067, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=26996, REF=CAAAAAAAAT, ALT=[CAAAAAAAAAAT])
Record(CHROM=Chr1, POS=27142, REF=C, ALT=[G])
Record(CHROM=Chr1, POS=27698, REF=CTTTTTTTTC, ALT=[CTTTTTTTTTC])
Record(CHROM=Chr1, POS=30645, REF=A, ALT=[C])
Record(CHROM=Chr1, POS=31478, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=33667, REF=A, ALT=[G])
Record(CHROM=Chr1, POS=34057, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=34339, REF=TAAAAAAAAAC, ALT=[TAAAAAAAAAAC])
Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G])

当然可以直接取你所需

print i.CHROM #对应vcf chr那一列

print i.POS #对应vcf pos那一列返回的是整型

print i.ID #对应ID 无得话返回None

print i.REF #对应ref列返回的是一个字符串

print i.ALT #返回的是一个列表！知道为什么REF返回字符串而ALT却返回列表么？因为ALT可能不止一个啊！比如： REF为A， ALT为T,G！！！注意返回的是一个列表不管alt那列是几个碱基，都是返回列表，只有一个也是列表，并列表中的元素不是字符串，而是一个类：class 'vcf.model._Substitution'。把元素转化为对应字符串用i.ALT[0].sequence

print i.QUAL #对应qual列返回的是float

print i.FILTER #对应filter 列无的话返回None

print i.INFO #对应vcf文件INFO列返回的是一个字典注意字典的值有的是列表，有的是字符串，有的是int, 有的是float

eg:

{'SAP': [5.1817700000000002], 'EPP': [5.1817700000000002], 'SRR': 5, 'DPB': 25.0, 'MQMR': 50.0, 'DP': 25, 'PAO': [0.0], 'RPP': [11.696199999999999], 'PAIRED': [0.75], 'ODDS': 3.4715400000000001, 'MEANALT': [1.0], 'MQM': [50.0], 'SAF': [3], 'PAIREDR': 0.90476199999999996, 'EPPR': 11.385999999999999, 'SAR': [1], 'NS': 3, 'RO': 21, 'AC': [2], 'AB': [0.20000000000000001], 'SRF': 16, 'AF': [0.33333299999999999], 'GTI': 0, 'AO': [4], 'AN': 6, 'ABP': [18.6449], 'SRP': 15.5221, 'DPRA': [2.0], 'RPPR': 15.5221, 'PQA': [0.0], 'QR': 656, 'RUN': [1], 'CIGAR': ['1X'], 'LEN': [1], 'NUMALT': 1, 'QA': [126], 'PQR': 0.0, 'TYPE': ['snp'], 'PRO': 0.0}

可以通过相应的键取出对应的值如：

print i.INFO[‘TYPE’] 返回列表

print i.INFO[‘DP’] 返回的是int

i.FORMAT #返回format列字符串如果你的vcf文件中没有FORMAT 返回None

eg:

print i.FORMAT, type(i.FORMAT)

GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>
GT:DP:RO:QR:AO:QA:GL <type 'str'>

i.samples 和 i.genotype #其他都是大写这两个是小写

这两个不对应哪一列

eg:

print i.samples # 我的vcf文件有三个样品分别是12 CS48 F12 返回的是三个样 call object 组成的列表。

[Call(sample=12, CallData(GT=0/1, DP=12, RO=10, QR=300, AO=2, QA=64, GL=[-4.2871800000000002, 0.0, -10.0])), Call(sample=CS48, CallData(GT=0/1, DP=8, RO=6, QR=199, AO=2, QA=62, GL=[-4.9289199999999997, 0.0, -10.0])), Call(sample=F12, CallData(GT=0/0, DP=5, RO=5, QR=157, AO=0, QA=0, GL=[0.0, -1.50515, -10.0]))]

如何取出每个样中心信息？

for i in myvcf: #返回record

for j in i.samples: #返回每个sample

print j[‘GT’] #每个sample的genotype 当然也可以j[‘DP’], j[‘RO’], j[‘AO’]…

1/1   #第一个样本的genotype   GT返回字符串；如果类型为others类型，AO返回由int组成的列表（ins,ins 会被认为是ins类型）； RO DP 返回int
0/1    #第二个样本的genotype
0/1    #第三个样本的genotype

如果某个样没有信息返回None

i.genotype的用法：

eg:

for i in myvcf:
print i.genotype('12')['GT'] #和i.samples不太一样的是你必须知道你要取哪个样的信息，因为你必须给genotype传一个样本参数。

i.genotype返回的是call对象

每个call对象有三个属性。 site, sample, data

eg:

for i in myvcf:
    call = i.genotype('12') #返回call对象
    print call.site         #返回call对象的chrom pos refbase altbase 信息
    print call.sample      #返回样本名字
    print call.data       #返回call data

Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G]) #可以讲对应项取出来
12 #样本的名字
CallData(GT=0/1, DP=12, RO=10, QR=300, AO=2, QA=64, GL=[-4.2871800000000002, 0.0, -10.0]) #可以将对应项取出来

除以上的方法外还有些方便的方法直接使用，如

i.is_snp

i.is_indel

i.is_transition

i.is_deletion

i.is_monomorphic

上面的几位都返回bool

i.var_type #如snp,,,,

i.var_subtype #如ts

>>> myvcf = vcf.Reader(open('testpyvcf', 'r'))
>>> myvcf.metadata
OrderedDict([('fileformat', 'VCFv4.1'), ('fileDate', '20140817'), ('source', ['freeBayes v0.9.14-15-gc6f49c0']), ('reference', '../../call_snp/refseq/Osativa_204.fa'), ('phasing', ['none']), ('commandline', ['"freebayes -f ../../call_snp/refseq/Osativa_204.fa -u -X -L bamfile.fb.list"'])])

>>> myvcf.samples
['12', 'CS48', 'F12']

还有myvcf.infos myvcf.filters myvcf.formats

>>> myvcf.infos
OrderedDict([('NS', Info(id='NS', num=1, type='Integer', desc='Number of samples with data')), ('DP', Info(id='DP', num=1, type='Integer', desc='Total read depth at the locus')), ('DPB', Info(id='DPB', num=1, type='Float', desc='Total read depth per bp at the locus; bases in reads overlapping / bases in haplotype')), ('AC', Info(id='AC', num=-1, type='Integer', desc='Total number of alternate alleles in called genotypes')), ('AN', Info(id='AN', num=1, type='Integer', desc='Total number of alleles in called genotypes')), ('AF', Info(id='AF', num=-1, type='Float', desc='Estimated allele frequency in the range (0,1]')), ('RO', Info(id='RO', num=1, type='Integer', desc='Reference allele observation count, with partial observations recorded fractionally')), ('AO', Info(id='AO', num=-1, type='Integer', desc='Alternate allele observations, with partial observations recorded fractionally')), ('PRO', Info(id='PRO', num=1, type='Float', desc='Reference allele observation count, with partial observations recorded fractionally')), ('PAO', Info(id='PAO', num=-1, type='Float', desc='Alternate allele observations, with partial observations recorded fractionally')), ('QR', Info(id='QR', num=1, type='Integer', desc='Reference allele quality sum in phred')), ('QA', Info(id='QA', num=-1, type='Integer', desc='Alternate allele quality sum in phred')), ('PQR', Info(id='PQR', num=1, type='Float', desc='Reference allele quality sum in phred for partial observations')), ('PQA', Info(id='PQA', num=-1, type='Float', desc='Alternate allele quality sum in phred for partial observations')), ('SRF', Info(id='SRF', num=1, type='Integer', desc='Number of reference observations on the forward strand')), ('SRR', Info(id='SRR', num=1, type='Integer', desc='Number of reference observations on the reverse strand')), ('SAF', Info(id='SAF', num=-1, type='Integer', desc='Number of alternate observations on the forward strand')), ('SAR', Info(id='SAR', num=-1, type='Integer', desc='Number of alternate observations on the reverse strand')), ('SRP', Info(id='SRP', num=1, type='Float', desc="Strand balance probability for the reference allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SRF and SRR given E(SRF/SRR) ~ 0.5, derived using Hoeffding's inequality")), ('SAP', Info(id='SAP', num=-1, type='Float', desc="Strand balance probability for the alternate allele: Phred-scaled upper-bounds estimate of the probability of observing the deviation between SAF and SAR given E(SAF/SAR) ~ 0.5, derived using Hoeffding's inequality")), ('AB', Info(id='AB', num=-1, type='Float', desc='Allele balance at heterozygous sites: a number between 0 and 1 representing the ratio of reads showing the reference allele to all reads, considering only reads from individuals called as heterozygous')), ('ABP', Info(id='ABP', num=-1, type='Float', desc="Allele balance probability at heterozygous sites: Phred-scaled upper-bounds estimate of the probability of observing the deviation between ABR and ABA given E(ABR/ABA) ~ 0.5, derived using Hoeffding's inequality")), ('RUN', Info(id='RUN', num=-1, type='Integer', desc='Run length: the number of consecutive repeats of the alternate allele in the reference genome')), ('RPP', Info(id='RPP', num=-1, type='Float', desc="Read Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality")), ('RPPR', Info(id='RPPR', num=1, type='Float', desc="Read Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between RPL and RPR given E(RPL/RPR) ~ 0.5, derived using Hoeffding's inequality")), ('EPP', Info(id='EPP', num=-1, type='Float', desc="End Placement Probability: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality")), ('EPPR', Info(id='EPPR', num=1, type='Float', desc="End Placement Probability for reference observations: Phred-scaled upper-bounds estimate of the probability of observing the deviation between EL and ER given E(EL/ER) ~ 0.5, derived using Hoeffding's inequality")), ('DPRA', Info(id='DPRA', num=-1, type='Float', desc='Alternate allele depth ratio. Ratio between depth in samples with each called alternate allele and those without.')), ('ODDS', Info(id='ODDS', num=1, type='Float', desc='The log odds ratio of the best genotype combination to the second-best.')), ('GTI', Info(id='GTI', num=1, type='Integer', desc='Number of genotyping iterations required to reach convergence or bailout.')), ('TYPE', Info(id='TYPE', num=-1, type='String', desc='The type of allele, either snp, mnp, ins, del, or complex.')), ('CIGAR', Info(id='CIGAR', num=-1, type='String', desc="The extended CIGAR representation of each alternate allele, with the exception that '=' is replaced by 'M' to ease VCF parsing. Note that INDEL alleles do not have the first matched base (which is provided by default, per the spec) referred to by the CIGAR.")), ('NUMALT', Info(id='NUMALT', num=1, type='Integer', desc='Number of unique non-reference alleles in called genotypes at this position.')), ('MEANALT', Info(id='MEANALT', num=-1, type='Float', desc='Mean number of unique non-reference allele observations per sample with the corresponding alternate alleles.')), ('LEN', Info(id='LEN', num=-1, type='Integer', desc='allele length')), ('MQM', Info(id='MQM', num=-1, type='Float', desc='Mean mapping quality of observed alternate alleles')), ('MQMR', Info(id='MQMR', num=1, type='Float', desc='Mean mapping quality of observed reference alleles')), ('PAIRED', Info(id='PAIRED', num=-1, type='Float', desc='Proportion of observed alternate alleles which are supported by properly paired read fragments')), ('PAIREDR', Info(id='PAIREDR', num=1, type='Float', desc='Proportion of observed reference alleles which are supported by properly paired read fragments'))])

如果想看某一个缩写的描述而不是全部可以：

>>> myvcf.infos['DP'].desc
'Total read depth at the locus'

>>> myvcf.infos['RO'].desc
'Reference allele observation count, with partial observations recorded fractionally'
>>> myvcf.infos['AO'].desc
'Alternate allele observations, with partial observations recorded fractionally'

如果你不想从头到尾循环某个文件。只想取某一部分，可以用fetch, 但是前体是用tabix对文件index， tabix前要用bgzip压缩

bgzip testpyvcf.vcf #得到testpyvcf.vcf.gz文件

tabix -p vcf testpyvcf.vcf.gz #得到testpyvcf.vcf.gz的index文件：testpyvcf.vcf.gz.tbi

>>> import vcf
>>> myvcf = vcf.Reader(filename='testpyvcf.vcf.gz')

>>> for i in myvcf.fetch('Chr1', 1111, 444444): #第一个是序列名，第二个起始，第三个end,包括。加入vcf文件中有两个位点100 和 200 如果（100,200），会返回这两个record。... print i
...
Record(CHROM=Chr1, POS=11553, REF=G, ALT=[C])
Record(CHROM=Chr1, POS=12840, REF=T, ALT=[G])
Record(CHROM=Chr1, POS=16188, REF=GAAAAAAAG, ALT=[GAAAAAAAAG])
Record(CHROM=Chr1, POS=18915, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=19439, REF=CTTTTTTTTTA, ALT=[CTTTTTTTTTTTA])
Record(CHROM=Chr1, POS=24810, REF=ATTTTTTTTTC, ALT=[ATTTTTTTTTTC])
Record(CHROM=Chr1, POS=26067, REF=CAAAAAAG, ALT=[CAAAAAAAG])
Record(CHROM=Chr1, POS=26996, REF=CAAAAAAAAT, ALT=[CAAAAAAAAAAT])
Record(CHROM=Chr1, POS=27142, REF=C, ALT=[G])
Record(CHROM=Chr1, POS=27698, REF=CTTTTTTTTC, ALT=[CTTTTTTTTTC])
Record(CHROM=Chr1, POS=30645, REF=A, ALT=[C])
Record(CHROM=Chr1, POS=31478, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=33667, REF=A, ALT=[G])
Record(CHROM=Chr1, POS=34057, REF=C, ALT=[T])
Record(CHROM=Chr1, POS=34339, REF=TAAAAAAAAAC, ALT=[TAAAAAAAAAAC])
Record(CHROM=Chr1, POS=35604, REF=T, ALT=[G])

myvcf = vcf.Reader(filename='testpyvcf.vcf.gz')
for i in myvcf.fetch('Chr1',34339): #只提供一个点，而不是区间，会返回这个点出样本的call对象。
print i

Call(sample=12, CallData(GT=0/1, DP=47, RO=43, QR=1501, AO=4, QA=127, GL=[-2.8504, 0.0, -10.0]))
Call(sample=CS48, CallData(GT=1/1, DP=82, RO=4, QR=146, AO=68, QA=2316, GL=[-10.0, -2.51126, 0.0]))
Call(sample=F12, CallData(GT=0/1, DP=27, RO=22, QR=785, AO=3, QA=95, GL=[-4.4559800000000003, 0.0, -10.0]))

关于vcf的写操作， vcf提供了Writer类

eg:

vcffile = open('test.vcf', 'r') # 普通的文件打开操作

outvcf = open('outvcf.vcf', 'w') #打开要写入的文件

myvcf = vcf.Reader(vcffile)

woutvcf = vcf.Writer(outvcf, myvcf) #将myvcf的header信息，写入到outvcf.vcf

for i in myvcf:

woutvcf.write_record(i) #将myvcf的record写入到outvcf.vcf

vcffile.close()

outvcf.close() #将打开的文件关闭

by freemao

FAFU.

free_mao@qq.com

pyvcf 模块的更多相关文章

基因组与Python --PyVCF 好用的vcf文件处理器
vcf文件的全称是variant call file,即突变识别文件,它是基因组工作流程中产生的一种文件,保存的是基因组上的突变信息.通过对vcf文件进行分析,可以得到个体的变异信息.嗯,总之,这是很 ...
VCF文件处理工具PyVCF
vcf格式示例 ##fileformat=VCFv4.1 ##FILTER=<ID=LowQual,Description=”Low quality”> ##FORMAT=<ID=A ...
npm 私有模块的管理使用
你可以使用 NPM 命令行工具来管理你在 NPM 仓库的私有模块代码,这使得在项目中使用公共模块变的更加方便. 开始前的工作你需要一个 2.7.0 以上版本的 npm ,并且需要有一个可以登陆 np ...
node.js学习（三）简单的node程序&&模块简单使用&&commonJS规范&&深入理解模块原理
一.一个简单的node程序 1.新建一个txt文件 2.修改后缀修改之后会弹出这个,点击"是" 3.运行test.js 源文件使用node.js运行之后的. 如果该路径下没有该 ...
ES6模块import细节
写在前面,目前浏览器对ES6的import支持还不是很好,需要用bable转译. ES6引入外部模块分两种情况: 1.导入外部的变量或函数等: import {firstName, lastName, ...
Python标准模块--ContextManager
1 模块简介在数年前,Python 2.5 加入了一个非常特殊的关键字,就是with.with语句允许开发者创建上下文管理器.什么是上下文管理器?上下文管理器就是允许你可以自动地开始和结束一些事情. ...
Python标准模块--Unicode
1 模块简介 Python 3中最大的变化之一就是删除了Unicode类型.在Python 2中,有str类型和unicode类型,例如, Python 2.7.6 (default, Oct 26 ...
Python标准模块--Iterators和Generators
1 模块简介当你开始使用Python编程时,你或许已经使用了iterators(迭代器)和generators(生成器),你当时可能并没有意识到.在本篇博文中,我们将会学习迭代器和生成器是什么.当然 ...
自己实现一个javascript事件模块
nodejs中的事件模块 nodejs中有一个events模块,用来给别的函数对象提供绑定事件.触发事件的能力.这个别的函数的对象,我把它叫做事件宿主对象(非权威叫法),其原理是把宿主函数的原型链指向 ...

随机推荐

C#图书资源【更新中...】喜欢的就转存吧
百度分享暂时取消了,需要的朋友可以直接关注我,我发给你. 重点篇 1.C#与.NET3.5高级程序设计 2.C#与.NET3.5高级程序设计.rar 3.Effective_C#_中文版_改善C#程序 ...
winform开发中绑定combox到枚举
开发中需要根据下拉框的选择处理一些业务逻辑,使用ID值或Text值都不利于代码维护,所以可以写个扩展方法绑定到枚举上. public static class Extensions { /// < ...
BroadcastReceiver的实例----基于Service的音乐播放器之二
该程序的后台Service会在播放状态发生改变时对外发送广播(广播将会激发前台Activity的BroadcastReceiver):它也会采用BroadcastReceiver监听来自前台Activ ...
oracle之to_char,to_date用法
[转载自]http://www.jb51.net/article/45591.htm 这篇文章主要介绍了oracle中to_date详细用法示例,包括期和字符转换函数用法.字符串和时间互转.求某天是星 ...
js基础之动画（三）
一.链式运动首先,要改进运动框架 function getStyle(obj,attr){ if(obj.currentStyle){ return obj.currentStyle[attr]; ...
python错误集锦
1.lt与list等同,不能作为变量名 2.中文路径名:os.path.normcase(filepath) 如果遇到 ascii码在路径某处不能转换, 那么 filepath.encode('gbk ...
redis学习（一）
一.redis简介 Redis是基于内存.可持久化的日志型.key-value高性能存储系统.关键字(Keys)是用来标识数据块.值(Values)是关联于关键字的实际值,可以是任何东西.有时候你会存 ...
Android 之数据存储
在Android操作系统中,提供了5种数据存储方式:SharedPreferences存储,文件存储,SQLite数据库存储,ContentProvider存储和网络存储. 一.SharedPrefe ...
SyntaxError: missing ; before statement 错误的解决
今天jsp页面中报错:SyntaxError: missing ; before statement 简单的理解是语法错误,F12调试之后发现原来是我定义的一个js中的全局变量的问题. <scr ...
C++中的数组与指针
数组与指针看起来很像 int a[] = {1, 2 ,3}; int *p = a; 如此,我们可以p[0], p[1], p[2] 看起来,与直接使用数组名没什么两样,但是看这段代码 sizeof ...

pyvcf 模块

pyvcf 模块的更多相关文章

随机推荐

热门专题