All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Naive 方法就是两层循环,外层for(int i=0; i<=s.length()-10; i++), 内层for(int j=i+1; j<=s.length()-10; j++), 比较两个字符串s.substring(i, i+10)和s.substring(j, j+10)是否equal, 是的话,加入到result里,这样两层循环再加equals()时间复杂度应该到O(N^3)了,TLE了

方法2:进一步的方法是用HashSet, 每次取长度为10的字符串,O(N)时间遍历数组,重复就加入result,但这样需要O(N)的space, 准确说来O(N*10bytes), java而言一个char是2 bytes,所以O(N*20bytes)。String一大就MLE

最优解:是在方法2基础上用bit operation,大概思想是把字符串映射为整数,对整数进行移位以及位与操作,以获取相应的子字符串。众所周知,位操作耗时较少,所以这种方法能节省运算时间。

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。

Cost分析:

时间复杂度O(N), 而且众所周知,位操作耗时较少,所以这种方法能节省运算时间。

省空间,原来10个char要10 Byte,现在10个char总共20bit,总共O(N*20bits)

空间复杂度:20位的二进制数,至多有2^20种组合,因此HashSet的大小为2^20,即1024 * 1024,O(1)

 public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
ArrayList<String> res = new ArrayList<String>();
if (s==null || s.length()<=10) return res;
HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
dict.put('A', 0);
dict.put('C', 1);
dict.put('G', 2);
dict.put('T', 3);
HashSet<Integer> set = new HashSet<Integer>();
HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
int hashcode = 0;
for (int i=0; i<s.length(); i++) {
if (i < 9) {
hashcode = (hashcode<<2) + dict.get(s.charAt(i));
}
else {
hashcode = (hashcode<<2) + dict.get(s.charAt(i));
hashcode &= (1<<20) - 1;
if (!set.contains(hashcode)) {
set.add(hashcode);
}
else {
//duplicate hashcode, decode the hashcode, and add the string to result
String temp = s.substring(i-9, i+1);
result.add(temp);
}
}
}
for (String item : result) {
res.add(item);
}
return res;
}
}

naive方法:

 public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
ArrayList<String> res = new ArrayList<String>();
if (s==null || s.length()<=10) return res;
for (int i=0; i<=s.length()-10; i++) {
String cur = s.substring(i, i+10);
for (int j=i+1; j<=s.length()-10; j++) {
String comp = s.substring(j, j+10);
if (cur.equals(comp)) {
res.add(cur);
break;
}
}
}
return res;
}
}

Leetcode: Repeated DNA Sequence的更多相关文章

  1. [LeetCode] Repeated DNA Sequences 求重复的DNA序列

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  2. [Leetcode] Repeated DNA Sequences

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  3. LeetCode() Repeated DNA Sequences 看的非常的过瘾!

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  4. [LeetCode] Repeated DNA Sequences hash map

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  5. LeetCode 187. 重复的DNA序列(Repeated DNA Sequences)

    187. 重复的DNA序列 187. Repeated DNA Sequences 题目描述 All DNA is composed of a series of nucleotides abbrev ...

  6. 187. Repeated DNA Sequences

    题目: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: " ...

  7. lc面试准备:Repeated DNA Sequences

    1 题目 All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: &quo ...

  8. POJ 2778 DNA Sequence(AC自动机+矩阵加速)

    DNA Sequence Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 9899   Accepted: 3717 Desc ...

  9. POJ 2778 DNA Sequence (AC自己主动机 + dp)

    DNA Sequence 题意:DNA的序列由ACTG四个字母组成,如今给定m个不可行的序列.问随机构成的长度为n的序列中.有多少种序列是可行的(仅仅要包括一个不可行序列便不可行).个数非常大.对10 ...

随机推荐

  1. truncate表恢复

    Fy_Recover_Data 2014.03.07 更新 -- 现在可以从离线文件中恢复被Truncated的数据了 2014新版本的恢复思路 与 2012的不同 2014年是离线恢复 http:/ ...

  2. spring-data-redis的事务操作深度解析--原来客户端库还可以攒够了事务命令再发?

    一.官方文档 简单介绍下redis的几个事务命令: redis事务四大指令: MULTI.EXEC.DISCARD.WATCH. 这四个指令构成了redis事务处理的基础. 1.MULTI用来组装一个 ...

  3. sysbench 压力测试工具

    一.sysbench压力测试工具简介: sysbench是一个开源的.模块化的.跨平台的多线程性能测试工具,可以用来进行CPU.内存.磁盘I/O.线程.数据库的性能测试.目前支持的数据库有MySQL. ...

  4. HTTP协议详解(真的很经典)(转载)

    HTTP协议详解(真的很经典):http://www.cnblogs.com/li0803/archive/2008/11/03/1324746.html 引言 HTTP是一个属于应用层的面向对象的协 ...

  5. Implicit conversion from enumeration type 'enum CGImageAlphaInfo' to different enumeration type 'CGBitmapinfo' (aka) 'enum CGBitmapInfo')

    The constants for specifying the alpha channel information are declared with the CGImageAlphaInfo ty ...

  6. iOS - Block的循环引用内存泄漏问题探索

    循环引用的原因 众所周知,ARC下用block会产生循环引用的问题,造成泄露的原因是啥呢? 最简单的例子,如下面代码: [self.teacher requestData:^(NSData *data ...

  7. vue---指令怎么写

    我们在考虑做一些功能性的封装的时候,我们会考虑使用vue的指令来做,那么指令应该怎么写: 具体参考: https://cn.vuejs.org/v2/guide/custom-directive.ht ...

  8. Python内置函数之isinstance,issubclass

    isinstance判断一个变量的类型 >>> n1 = 10>>> isinstance (n1,int)True 判断n1是否是数字类型,如果是返回True如果 ...

  9. hdu 1525 Euclid's Game【 博弈论】

    Two players, Stan and Ollie, play, starting with two natural numbers. Stan, the first player, subtra ...

  10. libmv

    What is libmv? libmv, also known as the Library for Multiview Reconstruction (or LMV), is the comput ...