Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808
60.919540 方法一:
# -*- coding: utf-8 -*-

# to open FASTA format sequence file:
s=open('Computing_GC_Content.txt','r').readlines() # to create two lists, one for names, one for sequences
name_list=[]
seq_list=[] data='' # to put the sequence from several lines together for line in s:
line=line.strip()
for i in line:
if i == '>':
name_list.append(line[1:])
if data:
seq_list.append(data) #将每一行的的核苷酸字符串连接起来
data='' # 合完后data 清零
break
else:
line=line.upper()
if all([k==k.upper() for k in line]): #验证是不是所有的都是大写
data=data+line
seq_list.append(data) # is there a way to include the last sequence in the for loop?
GC_list=[]
for seq in seq_list:
i=0
for k in seq:
if k=="G" or k=='C':
i+=1
GC_cont=float(i)/len(seq)*100.0
GC_list.append(GC_cont) m=max(GC_list)
print name_list[GC_list.index(m)] # to find the index of max GC
print "{:0.6f}".format(m) # 保留6位小数

  方法二:

# -*- coding: utf-8 -*-

def parse_fasta(s):
results = {}
strings = s.strip().split('>')
# Python split()通过指定分隔符对字符串进行切片,如果参数num 有指定值,则仅分隔 num 个子字符串 for s in strings:
if len(s) == 0:
continue
# 如果字符串长度为0,就跳出循环。 parts = s.split()
label = parts[0]
bases = ''.join(parts[1:]) results[label] = bases return results def gc_content(s):
n = len(s)
m = 0 for c in s:
if c == 'G' or c == 'C':
m += 1 return 100 * (float(m) / n) if __name__ == "__main__": small_dataset = """
>Rosalind_6404
CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC
TCCCACTAATAATTCTGAGG
>Rosalind_5959
CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT
ATATCCATTTGTCAGCAGACACGC
>Rosalind_0808
CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC
TGGGAACCTGCGGGCAGTAGGTGGAAT
""" #large_dataset = open('datasets/rosalind_gc.txt').read() results = parse_fasta(small_dataset)
results = dict([(k, gc_content(v)) for k, v in results.iteritems()])
# 这里iteritem()和item()功能是一样的
# 前一个results输出,名称+序列,后一个results输出,名称+百分比 highest_k = None
highest_v = 0 for k, v in results.iteritems():
if v > highest_v:
highest_k = k
highest_v = v
# 输出GC含量高的
print highest_k
print '%f%%' % highest_v

  方法三:

# -*- coding: utf-8 -*-

### 5. Computing GC Content ###
from operator import itemgetter
from collections import OrderedDict seqTest = OrderedDict()
gcContent = OrderedDict() with open('Computing_GC_Content.txt', 'rt') as f:
for line in f:
line = line.rstrip()
if line.startswith('>'):
seqName = line[1:]
seqTest[seqName] = ''
continue
seqTest[seqName] += line.upper() for ke, val in seqTest.items():
totalLength = len(val)
gcNumber = val.count('G') + val.count('C')
gcContent[ke] = (float(gcNumber) / totalLength)*100 sortedGCContent = sorted(gcContent.items(), key=itemgetter(1))
largeName = sortedGCContent[-1][0]
largeGCContent = sortedGCContent[-1][1] print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))

  

05 Computing GC Content的更多相关文章

  1. Evaluate|GC content|Phred|BAC|heterozygous single nucleotide polymorphisms|estimate genome size|

    (Evaluate):检查reads,可使用比对软件:使用SOAPaligner重新排列:采用massively parallel next-generation sequencing technol ...

  2. GC偏好的校正与偏好程度的评估

    在二代测序仪上测出的数据,通常都会表现出测序深度与GC 含量的相关性,称为GC bias. GC bias校正 为了后续生物信息分析更加准确,通常需要做GC bias的校正. 2010 年 steve ...

  3. GC偏好

    GC偏好 测序中的GC偏好指的是基因组上GC含量在50%左右的区域更容易被测到,产生的reads更多,这些区域的覆盖度更高, 在高GC或者低GC区域,不容易被测到,产生较少的reads,这些区域的覆盖 ...

  4. Physicoochemical|CG content|

    NCBI存在的问题: 数据用户的增长 软件开发受限 数据分析缺乏 有些传统束缚,仅用底层语言书写 Pangenome Open gene是随菌株数量增大而增大的gene,Closed gene是随菌株 ...

  5. 【Python小试】判断一条序列GC含量高低

    题目: 随便给定一条序列,如果GC含量超过65%,则认为高. 编程: from __future__ import division #整数除法 def is_gc_rich(dna): length ...

  6. ODOO-10.0 错误 Could not execute command 'lessc'

    2017-01-05 20:24:12,473 4652 INFO None odoo.service.db: Create database `hello`. 2017-01-05 20:24:16 ...

  7. 《深入理解Java虚拟机》内存分配策略

    上节学习回顾 1.判断对象存活算法:引用计数法和可行性分析算法 2.垃圾收集算法:标记-清除算法.复制算法.标记-整理算法 3.垃圾收集器: Serial:新生代收集器,采用复制算法,单线程. Par ...

  8. MongoDB和Redis-NoSQL数据库-文档型-内存型

    1NoSQL简述 CAP(Consistency,Availabiity,Partitiontolerance)理论告诉我们,一个分布式系统不可能满足一致性,可用性和分区容错性这三个需求,最多只能同时 ...

  9. 生物信息大数据&数据库(NCBI、EBI、UCSC、TCGA)

    想系统的学习生信数据库可以先看一下北大的公开课,有一章专门讲的数据库与软件: -生物信息学:导论与方法 北大\ 生物信息数据库及软件资源 一个优秀的生信开发者能够解决如下问题: 如何鉴定一个重要的且没 ...

随机推荐

  1. 获取响应里面的cookie的方法

    使用方法: R.cookies.get_dict()   获取响应返回的cookies

  2. [Java.Web]Tomcat 常用配置

    1. web.xml 文件最下方内容 (X:\apache-tomcat-7.0.77\conf\ 目录下) <welcome-file-list> <welcome-file> ...

  3. 【POJ】3280 Cheapest Palindrome(区间dp)

    Cheapest Palindrome Time Limit: 2000MS   Memory Limit: 65536K Total Submissions: 10943   Accepted: 5 ...

  4. JAVA中return的用法

    public class TestReturn { public static void main(String args[]) { TestReturn t = new TestReturn(); ...

  5. Centos下Apache+Tomcat集群--搭建记录

    一.目的 利用apache的mod_jk模块,实现tomcat集群服务器的负载均衡以及会话复制,这里用到了<Cluster>. 二.环境 1.基础:3台主机,系统Centos6.5,4G内 ...

  6. python+selenium+Firefox+pycharm版本匹配

    window(2018-05-29)最新 python:3.6.1    地址https://www.python.org/downloads/release/python-361/ selenium ...

  7. CentOS 修改源为163和指定epel源和docker安装

    首先备份/etc/yum.repos.d/CentOS-Base.repo mv /etc/yum.repos.d/CentOS-Base.repo /etc/yum.repos.d/CentOS-B ...

  8. Django学习---自定义分页

    自定义分页 简单例子: urls.py: from django.contrib import admin from django.urls import path from django.conf. ...

  9. python学习——练习题(8)

    """ 题目:输出 9*9 乘法口诀表. """ def answer1(): """ 自己用最普通的双重循环 ...

  10. 初识Dash -- 构建一个人人都能够轻松上手的界面,操控数据和可视化

    从事数据科学工作,少不了使用Pandas.scikit-learn这些Python生态系统中的利器,还有就是控制工作流的Jupyter Notebooks,没的说,你和同事都爱用.但是,要想将工作成果 ...