05 Computing GC Content

Problem

The GC-content of a DNA string is given by the percentage of symbols in the string that are 'C' or 'G'. For example, the GC-content of "AGCTATAG" is 37.5%. Note that the reverse complement of any DNA string has the same GC-content.

DNA strings must be labeled when they are consolidated into a database. A commonly used method of string labeling is called FASTA format. In this format, the string is introduced by a line that begins with '>', followed by some labeling information. Subsequent lines contain the string itself; the first line to begin with '>' indicates the label of the next string.

In Rosalind's implementation, a string in FASTA format will be labeled by the ID "Rosalind_xxxx", where "xxxx" denotes a four-digit code between 0000 and 9999.

Given: At most 10 DNA strings in FASTA format (of length at most 1 kbp each).

Return: The ID of the string having the highest GC-content, followed by the GC-content of that string. Rosalind allows for a default error of 0.001 in all decimal answers unless otherwise stated; please see the note on absolute error below.

Sample Dataset

>Rosalind_6404

CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC

TCCCACTAATAATTCTGAGG

>Rosalind_5959

CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT

ATATCCATTTGTCAGCAGACACGC

>Rosalind_0808

CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC

TGGGAACCTGCGGGCAGTAGGTGGAAT

Sample Output

Rosalind_0808

60.919540

方法一：

# -*- coding: utf-8 -*-

# to open FASTA format sequence file:

s=open('Computing_GC_Content.txt','r').readlines()

# to create two lists, one for names, one for sequences

name_list=[]

seq_list=[]

data='' # to put the sequence from several lines together

for line in s:

    line=line.strip()

    for i in line:

        if i == '>':

            name_list.append(line[1:])

            if data:

                seq_list.append(data)         #将每一行的的核苷酸字符串连接起来

                data=''                       # 合完后data 清零

            break

        else:

            line=line.upper()

    if all([k==k.upper() for k in line]):    #验证是不是所有的都是大写

        data=data+line

seq_list.append(data)                         # is there a way to include the last sequence in the for loop?

GC_list=[]

for seq in seq_list:

    i=0

    for k in seq:

        if k=="G" or k=='C':

            i+=1

    GC_cont=float(i)/len(seq)*100.0

    GC_list.append(GC_cont)

m=max(GC_list)

print name_list[GC_list.index(m)]              # to find the index of max GC

print "{:0.6f}".format(m)                    # 保留6位小数

　　方法二：

# -*- coding: utf-8 -*-

def parse_fasta(s):

    results = {}

    strings = s.strip().split('>')

    # Python split()通过指定分隔符对字符串进行切片，如果参数num 有指定值，则仅分隔 num 个子字符串

    for s in strings:

        if len(s) == 0:

            continue

            # 如果字符串长度为0，就跳出循环。

        parts = s.split()

        label = parts[0]

        bases = ''.join(parts[1:])

        results[label] = bases

    return results

def gc_content(s):

    n = len(s)

    m = 0

    for c in s:

        if c == 'G' or c == 'C':

            m += 1

    return 100 * (float(m) / n)

if __name__ == "__main__":

    small_dataset = """

    >Rosalind_6404

    CCTGCGGAAGATCGGCACTAGAATAGCCAGAACCGTTTCTCTGAGGCTTCCGGCCTTCCC

    TCCCACTAATAATTCTGAGG

    >Rosalind_5959

    CCATCGGTAGCGCATCCTTAGTCCAATTAAGTCCCTATCCAGGCGCTCCGCCGAAGGTCT

    ATATCCATTTGTCAGCAGACACGC

    >Rosalind_0808

    CCACCCTCGTGGTATGGCTAGGCATTCAGGAACCGGAGAACGCTTCAGACCAGCCCGGAC

    TGGGAACCTGCGGGCAGTAGGTGGAAT

    """

    #large_dataset = open('datasets/rosalind_gc.txt').read()

    results = parse_fasta(small_dataset)

    results = dict([(k, gc_content(v)) for k, v in results.iteritems()])

    # 这里iteritem()和item()功能是一样的

    # 前一个results输出，名称+序列，后一个results输出，名称+百分比

    highest_k = None

    highest_v = 0

    for k, v in results.iteritems():

        if v > highest_v:

            highest_k = k

            highest_v = v

            # 输出GC含量高的

    print highest_k

    print '%f%%' % highest_v

　　方法三：

# -*- coding: utf-8 -*-

### 5. Computing GC Content ###

from operator import itemgetter

from collections import OrderedDict

seqTest = OrderedDict()

gcContent = OrderedDict()

with open('Computing_GC_Content.txt', 'rt') as f:

    for line in f:

        line = line.rstrip()

        if line.startswith('>'):

            seqName = line[1:]

            seqTest[seqName] = ''

            continue

        seqTest[seqName] += line.upper()

for ke, val in seqTest.items():

    totalLength = len(val)

    gcNumber = val.count('G') + val.count('C')

    gcContent[ke] = (float(gcNumber) / totalLength)*100

sortedGCContent = sorted(gcContent.items(), key=itemgetter(1))

largeName = sortedGCContent[-1][0]

largeGCContent = sortedGCContent[-1][1]

print ('most GC ratio gene is %s and it is %s ' % (largeName, largeGCContent))