All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACGAATTCCG". When studying DNA, it is sometimes useful to identify repeated sequences within the DNA.

Write a function to find all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule.

For example,

Given s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT",

Return:
["AAAAACCCCC", "CCCCCAAAAA"].

Naive 方法就是两层循环,外层for(int i=0; i<=s.length()-10; i++), 内层for(int j=i+1; j<=s.length()-10; j++), 比较两个字符串s.substring(i, i+10)和s.substring(j, j+10)是否equal, 是的话,加入到result里,这样两层循环再加equals()时间复杂度应该到O(N^3)了,TLE了

方法2:进一步的方法是用HashSet, 每次取长度为10的字符串,O(N)时间遍历数组,重复就加入result,但这样需要O(N)的space, 准确说来O(N*10bytes), java而言一个char是2 bytes,所以O(N*20bytes)。String一大就MLE

最优解:是在方法2基础上用bit operation,大概思想是把字符串映射为整数,对整数进行移位以及位与操作,以获取相应的子字符串。众所周知,位操作耗时较少,所以这种方法能节省运算时间。

首先考虑将ACGT进行二进制编码

A -> 00

C -> 01

G -> 10

T -> 11

在编码的情况下,每10位字符串的组合即为一个数字,且10位的字符串有20位;一般来说int有4个字节,32位,即可以用于对应一个10位的字符串。例如

ACGTACGTAC -> 00011011000110110001

AAAAAAAAAA -> 00000000000000000000

每次向右移动1位字符,相当于字符串对应的int值左移2位,再将其最低2位置为新的字符的编码值,最后将高2位置0。

Cost分析:

时间复杂度O(N), 而且众所周知,位操作耗时较少,所以这种方法能节省运算时间。

省空间,原来10个char要10 Byte,现在10个char总共20bit,总共O(N*20bits)

空间复杂度:20位的二进制数,至多有2^20种组合,因此HashSet的大小为2^20,即1024 * 1024,O(1)

 public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
ArrayList<String> res = new ArrayList<String>();
if (s==null || s.length()<=10) return res;
HashMap<Character, Integer> dict = new HashMap<Character, Integer>();
dict.put('A', 0);
dict.put('C', 1);
dict.put('G', 2);
dict.put('T', 3);
HashSet<Integer> set = new HashSet<Integer>();
HashSet<String> result = new HashSet<String>(); //directly use arraylist to store result may not avoid duplicates, so use hashset to preselect
int hashcode = 0;
for (int i=0; i<s.length(); i++) {
if (i < 9) {
hashcode = (hashcode<<2) + dict.get(s.charAt(i));
}
else {
hashcode = (hashcode<<2) + dict.get(s.charAt(i));
hashcode &= (1<<20) - 1;
if (!set.contains(hashcode)) {
set.add(hashcode);
}
else {
//duplicate hashcode, decode the hashcode, and add the string to result
String temp = s.substring(i-9, i+1);
result.add(temp);
}
}
}
for (String item : result) {
res.add(item);
}
return res;
}
}

naive方法:

 public class Solution {
public List<String> findRepeatedDnaSequences(String s) {
ArrayList<String> res = new ArrayList<String>();
if (s==null || s.length()<=10) return res;
for (int i=0; i<=s.length()-10; i++) {
String cur = s.substring(i, i+10);
for (int j=i+1; j<=s.length()-10; j++) {
String comp = s.substring(j, j+10);
if (cur.equals(comp)) {
res.add(cur);
break;
}
}
}
return res;
}
}

Leetcode: Repeated DNA Sequence的更多相关文章

  1. [LeetCode] Repeated DNA Sequences 求重复的DNA序列

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  2. [Leetcode] Repeated DNA Sequences

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  3. LeetCode() Repeated DNA Sequences 看的非常的过瘾!

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  4. [LeetCode] Repeated DNA Sequences hash map

    All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: "ACG ...

  5. LeetCode 187. 重复的DNA序列(Repeated DNA Sequences)

    187. 重复的DNA序列 187. Repeated DNA Sequences 题目描述 All DNA is composed of a series of nucleotides abbrev ...

  6. 187. Repeated DNA Sequences

    题目: All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: " ...

  7. lc面试准备:Repeated DNA Sequences

    1 题目 All DNA is composed of a series of nucleotides abbreviated as A, C, G, and T, for example: &quo ...

  8. POJ 2778 DNA Sequence(AC自动机+矩阵加速)

    DNA Sequence Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 9899   Accepted: 3717 Desc ...

  9. POJ 2778 DNA Sequence (AC自己主动机 + dp)

    DNA Sequence 题意:DNA的序列由ACTG四个字母组成,如今给定m个不可行的序列.问随机构成的长度为n的序列中.有多少种序列是可行的(仅仅要包括一个不可行序列便不可行).个数非常大.对10 ...

随机推荐

  1. Android O PackageInstaller 解析

    Android O 8.0 1.src\com\android\packageinstaller\permission\mode\PermissionGroups.java @Override pub ...

  2. DOS 如何取当前时间做为文件名?

    如果要取得以日期为文件名的文件,假设在命令行下键入date返回形式为:当前日期: 2005-06-02 星期四echo > %date:~0,4%%date:~5,2%%date:~8,2%~表 ...

  3. Building Boost for Android with error “cannot find -lrt”

    编辑tools/build/src/tools/gcc.jam rule setup-threading ( targets * : sources * : properties * ){ local ...

  4. 重建索引:ALTER INDEX..REBUILD ONLINE vs ALTER INDEX..REBUILD

    什么时候需要重建索引 1. 删除的空间没有重用,导致 索引出现碎片 2. 删除大量的表数据后,空间没有重用,导致 索引"虚高" 3.索引的 clustering_facto 和表不 ...

  5. web前端面试题(一)

    1  选择题 1.1   默认情况下,使用P标记会形成什么效果() A.在文字P所在位置中加入8个空格 B.P后面的文字会变成粗体 C.开始新的一行 D.P后面的文字会变成斜体 答案: C 1.2   ...

  6. 【CF932F】Escape Through Leaf 启发式合并set维护凸包

    [CF932F]Escape Through Leaf 题意:给你一棵n个点的树,每个点有树形ai和bi,如果x是y的祖先,则你可以从x花费$a_x\times b_y$的费用走到y(费用可以为负). ...

  7. Android LayoutCast 初探

    今天无意间看见了一个神器,顿时让我血气蓬勃! 废话不多说,先上网址:https://github.com/mmin18/LayoutCast 把代码和资源文件的改动直接同步到手机上,应用不需要重启.省 ...

  8. Unity3D笔记 英保通十 射线碰撞器检测

    射线碰撞检测可以用来检测方向和距离: 通过Physics.RayCast光线投射来实现:常用于射击利用发射的射线来判断.还有对战中刀剑交战中.. 一.要涉及到RayCast和RayCastHit 1. ...

  9. Logstash自带正则表达式

    USERNAME [a-zA-Z0-._-]+ USER %{USERNAME} INT (?:[+-]?(?:[-]+)) BASE10NUM (?<![-.+-])(?>[+-]?(? ...

  10. 安装支持eigen线性迭代的ceres_solver

    Ceres可以求解以下形式的有界约束非线性最小二乘问题: 这种形式的问题来源于科学工程的多个领域,从统计学的曲线拟合到计算机视觉中从图像中构建三维模型. 最近在做sfm方面的重建问题,需要对得到的相机 ...