05 Computing GC Content
Problem
The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.
DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.
In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.
Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).
Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.
Sample Dataset
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
Sample Output
Rosalind_0808
60.919540 方法一:
# -*- coding: utf-8 -*- # to open FASTA format sequence file:
s=open('Computing_GC_Content.txt','r').readlines() # to create two lists, one for names, one for sequences
name_list=[]
seq_list=[] data='' # to put the sequence from several lines together for line in s:
line=line.strip()
for i in line:
if i == '>':
name_list.append(line[1:])
if data:
seq_list.append(data) #将每一行的的核苷酸字符串连接起来
data='' # 合完后data 清零
break
else:
line=line.upper()
if all([k==k.upper() for k in line]): #验证是不是所有的都是大写
data=data+line
seq_list.append(data) # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
GC_cont=float(i)/len(seq)*100.0
GC_list.append(GC_cont) m=max(GC_list)
print name_list[GC_list.index(m)] # to find the index of max GC
print "{:0.6f}".format(m) # 保留6位小数
方法二:
# -*- coding: utf-8 -*- def parse_fasta(s):
results = {}
strings = s.strip().split('>')
# Python split()通过指定分隔符对字符串进行切片,如果参数num 有指定值,则仅分隔 num 个子字符串 for s in strings:
if len(s) == 0:
continue
# 如果字符串长度为0,就跳出循环。 parts = s.split()
label = parts[0]
bases = ''.join(parts[1:]) results[label] = bases return results def gc_content(s):
n = len(s)
m = 0 for c in s:
if c == 'G' or c == 'C':
m += 1 return 100 * (float(m) / n) if __name__ == "__main__": small_dataset = """
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
""" #large_dataset = open('datasets/rosalind_gc.txt').read() results = parse_fasta(small_dataset)
results = dict([(k, gc_content(v)) for k, v in results.iteritems()])
# 这里iteritem()和item()功能是一样的
# 前一个results输出,名称+序列,后一个results输出,名称+百分比 highest_k = None
highest_v = 0 for k, v in results.iteritems():
if v > highest_v:
highest_k = k
highest_v = v
# 输出GC含量高的
print highest_k
print '%f%%' % highest_v
方法三:
# -*- coding: utf-8 -*- ### 5. Computing GC Content ###
from operator import itemgetter
from collections import OrderedDict seqTest = OrderedDict()
gcContent = OrderedDict() with open('Computing_GC_Content.txt', 'rt') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
seqName = line[1:]
seqTest[seqName] = ''
continue
seqTest[seqName] += line.upper() for ke, val in seqTest.items():
totalLength = len(val)
gcNumber = val.count('G') + val.count('C')
gcContent[ke] = (float(gcNumber) / totalLength)*100 sortedGCContent = sorted(gcContent.items(), key=itemgetter(1))
largeName = sortedGCContent[-1][0]
largeGCContent = sortedGCContent[-1][1] print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))
05 Computing GC Content的更多相关文章
- Evaluate|GC content|Phred|BAC|heterozygous single nucleotide polymorphisms|estimate genome size|
(Evaluate):检查reads,可使用比对软件:使用SOAPaligner重新排列:采用massively parallel next-generation sequencing technol ...
- GC偏好的校正与偏好程度的评估
在二代测序仪上测出的数据,通常都会表现出测序深度与GC 含量的相关性,称为GC bias. GC bias校正 为了后续生物信息分析更加准确,通常需要做GC bias的校正. 2010 年 steve ...
- GC偏好
GC偏好 测序中的GC偏好指的是基因组上GC含量在50%左右的区域更容易被测到,产生的reads更多,这些区域的覆盖度更高, 在高GC或者低GC区域,不容易被测到,产生较少的reads,这些区域的覆盖 ...
- Physicoochemical|CG content|
NCBI存在的问题: 数据用户的增长 软件开发受限 数据分析缺乏 有些传统束缚,仅用底层语言书写 Pangenome Open gene是随菌株数量增大而增大的gene,Closed gene是随菌株 ...
- 【Python小试】判断一条序列GC含量高低
题目: 随便给定一条序列,如果GC含量超过65%,则认为高. 编程: from __future__ import division #整数除法 def is_gc_rich(dna): length ...
- ODOO-10.0 错误 Could not execute command 'lessc'
2017-01-05 20:24:12,473 4652 INFO None odoo.service.db: Create database `hello`. 2017-01-05 20:24:16 ...
- 《深入理解Java虚拟机》内存分配策略
上节学习回顾 1.判断对象存活算法:引用计数法和可行性分析算法 2.垃圾收集算法:标记-清除算法.复制算法.标记-整理算法 3.垃圾收集器: Serial:新生代收集器,采用复制算法,单线程. Par ...
- MongoDB和Redis-NoSQL数据库-文档型-内存型
1NoSQL简述 CAP(Consistency,Availabiity,Partitiontolerance)理论告诉我们,一个分布式系统不可能满足一致性,可用性和分区容错性这三个需求,最多只能同时 ...
- 生物信息大数据&数据库(NCBI、EBI、UCSC、TCGA)
想系统的学习生信数据库可以先看一下北大的公开课,有一章专门讲的数据库与软件: -生物信息学:导论与方法 北大\ 生物信息数据库及软件资源 一个优秀的生信开发者能够解决如下问题: 如何鉴定一个重要的且没 ...
随机推荐
- 二、Spark在Windows下的环境搭建
由于Spark是用Scala来写的,所以Spark对Scala肯定是原生态支持的,因此这里以Scala为主来介绍Spark环境的搭建,主要包括四个步骤,分别是:JDK的安装,Scala的安装,Spar ...
- Tool:Visual Studio
ylbtech-Tool:Visual Studio Microsoft Visual Studio(简称VS)是美国微软公司的开发工具包系列产品.VS是一个基本完整的开发工具集,它包括了整个软件生命 ...
- PDA后台运行、安装程序
////启动最新版本安装(后台安装模式),结束更新程序 //string cabPath = @"\Application Data\QY.DDM.PDA.CAB&qu ...
- 第八章 搭建hadoop2.2.0集群,Zookeeper集群和hbase-0.98.0-hadoop2-bin.tar.gz集群
安装配置jdk,SSH 一.首先,先搭建三台小集群,虚拟机的话,创建三个 下面为这三台机器分别分配IP地址及相应的角色:集群有个特点,三台机子用户名最好一致,要不你就创建一个组,把这些用户放到组里面去 ...
- JAVA中的IO流介绍(2)
一.流的概念 流(stream)的概念源于UNIX中管道(pipe)的概念.在UNIX中,管道是一条不间断的字节流,用来实现程序或进程间的通信,或读写外围设备.外部文件等. 一个流,必有源端和目的端, ...
- 关于SQLSERVER的全文目录跟全文索引的区别
很久没有写随笔了,本来之前想写一篇关于SQLSERVER全文索引的随笔,可惜没有时间,一直拖到现在才有时间写,不好意思让各位久等了~ 先介绍一下SQLSERVER中的存储类对象,哈哈,先介绍一下概念嘛 ...
- EDMX 残余表信息清理方法
今天出现的edmx报错,怎么也无法删除的问题,解决了.1.打开edxm2.删除所有表模型3.右键,选择模型浏览器4.在实体类型查看是否还有没有删除的模型如果有,点击删除5.重新生成edxm.解决问题.
- Python list和dict方法
###list类的方法 ###append 列表内最后增加一个元素a = [1,2,3,4,5,6,"dssdsd"]a.append(5)print(a) ###clear 清空 ...
- linux anaconda 管理 python 包
1.下载 anaconda https://www.continuum.io/downloads 2.安装anaconda 3.conda install package-name //利用anaco ...
- leetcode498
public class Solution { public int[] FindDiagonalOrder(int[,] matrix) { ); ); + col - ; var ary = ne ...