jellyfish K-mer analysis and genome size estimate
So, to estimate manually, take the sum of the counts of unique kmers under the first peak and multiply by 1; add the sum of the counts of unique kmers under the peak at 2x the depth of the first peak and multiply by 2; etc, for all peaks. This will give you the haploid genome size. So if your genome is tetraploid, the actual size will be 1/4 of your result, since the first peak will correspond to mutations present on only 1 ploidy (1/0/0/0 genotype).
You can make this more accurate by modelling the peaks as a sum of Gaussian curves, but that probably won't change the result much. Of course, this method is subjective because calling peaks is subjective.
Please note - I think 17-mers are too short for this kind of analysis. I prefer 31-mers because they are the longest computationally-efficient kmers. Also, FYI, BBNorm is faster than Jellyfish and can also generate kmer-frequency histograms:
khist.sh in=reads.fq hist=khist.txt
Outline
- count k-mer occurence using Jellyfish (jellyfish count)
- summarize as histogram (jellyfish histo)
- plot graph with R
- determine the total number of k-mer analyzed and the peak position
- compare the peak shape with poisson distribution
Count k-mer occurence
In this example we have 5 pair of fastq files in three different subdirectories. The file to process can be specified with "*/*.qf.fastq" and veriied with ls.
$ ls */*.qf.fastq run1/s_1_1_sequence.qf.fastq run2/s_2_2_sequence.qf.fastq run1/s_1_2_sequence.qf.fastq run3/s_1_1_sequence.qf.fastq run2/s_1_1_sequence.qf.fastq run3/s_1_2_sequence.qf.fastq run2/s_1_2_sequence.qf.fastq run3/s_2_1_sequence.qf.fastq run2/s_2_1_sequence.qf.fastq run3/s_2_2_sequence.qf.fastq
Next, we issue the jellyfish count command
jellyfish count -t 8 -C -m 25 -s 5G -o spec1_25mer --min-quality=20 --quality-start=33 */*.qf.fastq
- -t 8
- specifies the number of threads to be used. This value should be equal to the number of cores on the machine or the number of slots you reserved through job management system ($NSLOTS in SGE or UGE).
- -C
- specifies the both strands are considered. If you do not specify this, the apparent depth would be half, --- that is undesirable
- -m 25
- specified that now you are counting for 25 mer (i.e., k=25)
- -s 5G
- is some kind of magical number specification of hash size. This should be as high as the physical memory allows. The higher the faster, but exceeding the available memory leads to failure or extremely slow counting.
- -o spec1_25mer
- specifies the prefix of output file names.
- --quality-start=33
- specified that your fastq file have 33 based quality value string. Be careful on the dataformat. There are cases that your data are 64 based depending on the sequending system and software versions. This is relevant only when you specify --min-quality
- --min-quality=20
- specifies that nucleotide having qv lower than 20 should not included in the count. This selection reduces the k-mers derived from sequence errors and make the peak clearer.
- */*.qf.fastq
- will be expanded to the ten filenames explained above by the shell and passed to jellyfish as input files
summarize as histogram (jellyfish histo)
First confirm that you got the output file
$ ls spec1_25mer* spec1_25mer_0
now that there is a single file spec1_25mer_0
$ jellyfish histo -o spec1_25mer.histo spec1_25mer_0
Confirm that you got the output
$ ls spec1_25mer* spec1_25mer_0 spec1_25mer.histo
Examine the numbers by your eyes
$ head -25 spec1_25mer.histo 1 461938583 2 95606044 3 19280477 4 13836754 5 11018480 6 9555090 7 8557935 8 7863244 9 7319505 10 6920880 11 6589723 12 6321923 13 6148638 14 6036120 15 5972264 16 5962234 17 5987696 18 6051171 19 6154429 20 6297373 21 6485135 22 6700579 23 6932570 24 7217627 25 7533211
terminate called after throwing an instance of 'jellyfish::invertible_hash::ErrorAllocation'
what(): Failed to allocate 628292358736 bytes of memory
jellyfish K-mer analysis and genome size estimate的更多相关文章
- Evaluate|GC content|Phred|BAC|heterozygous single nucleotide polymorphisms|estimate genome size|
(Evaluate):检查reads,可使用比对软件:使用SOAPaligner重新排列:采用massively parallel next-generation sequencing technol ...
- Maximum Size Subarray Sum Equals k -- LeetCode
Given an array nums and a target value k, find the maximum length of a subarray that sums to k. If t ...
- 【LeetCode】325. Maximum Size Subarray Sum Equals k 解题报告 (C++)
作者: 负雪明烛 id: fuxuemingzhu 个人博客:http://fuxuemingzhu.cn/ 目录 题目描述 题目大意 解题方法 prefix Sum 日期 题目地址:https:// ...
- The sequence and de novo assembly of the giant panda genome.ppt
sequencing:使用二代测序原因:高通量,短序列 不用长序列原因: 1.算法错误率高 2.长序列测序将嵌合体基因错误积累.嵌合体基因:通过重组由来源与功能不同的基因序列剪接而形成的杂合基因 se ...
- [LeetCode] Longest Substring with At Least K Repeating Characters 至少有K个重复字符的最长子字符串
Find the length of the longest substring T of a given string (consists of lowercase letters only) su ...
- [LeetCode] Kth Largest Element in an Array 数组中第k大的数字
Find the kth largest element in an unsorted array. Note that it is the kth largest element in the so ...
- k近邻算法的Java实现
k近邻算法是机器学习算法中最简单的算法之一,工作原理是:存在一个样本数据集合,即训练样本集,并且样本集中的每个数据都存在标签,即我们知道样本集中每一数据和所属分类的对应关系.输入没有标签的新数据之后, ...
- 6.3Sum && 4Sum [ && K sum ] && 3Sum Closest
3Sum Given an array S of n integers, are there elements a, b, c in S such that a + b + c = 0? Find a ...
- 剑指offer系列55---最小的k个数
[题目] 输入n个整数,找出其中最小的K个数.例如输入4,5,1,6,2,7,3,8这8个数字,则最小的4个数字是1,2,3,4,. *[思路]排序,去除k后的数. package com.exe11 ...
随机推荐
- Octopus系列之如何让前台的js脚本变得灵活重用
Octopus系列如何让前台的js脚本变得灵活,重用 方式1:ajax方式 方式2:form表单方式 面向对象的脚本封装 jQuery的封装 做Web开发的少不了前台Ajax的使用, 返回true:f ...
- HDU 5773 The All-purpose Zero 求LIS
求最长上升子序列长度: 单纯的dp时间复杂度是O(n*n)的 dp[i] = max(dp[j]+1); (0=<j<=i-1 && a[i]>a[j]) 用二分可以 ...
- PHPExcel 学习笔记
首先到phpexcel官网上下载最新的phpexcel类,下周解压缩一个classes文件夹,里面包含了PHPExcel.php和PHPExcel的文件夹,这个类文件和文件夹是我们需要的,把class ...
- 一模 (6) day2
第一题: 题目大意:求最长公共上升子序列(LICS): 解题过程: 1.一开始想到模仿求最长公共子序列的方法,F[i][j]表示A串前i个,B串前j个的最长公共子序列,很明显当A[i]!= B[j]时 ...
- Redis系列-存储篇list主要操作函数小结
在总结list之前,先要弄明白几个跟list相关的概念: 列表:一个从左到右的队列,个人理解更类似于一个栈,常规模式下,先进列表的元素,后出. 表头元素:列表最左端第一个元素. 表尾元素:列表最右端的 ...
- 去除Sql Server中回车换行符
这里使用了,sql 函数.replace(string_expression , string_pattern , string_replacement), 第一个参数:要查找的字段. 第二个参数:要 ...
- Rhel6-heartbeat+lvs配置文档
系统环境: rhel6 x86_64 iptables and selinux disabled 主机: 192.168.122.119 server19.example.com 192.168.12 ...
- 【干货来了】2014年K2房地产IT分享峰会
2014年K2房地产IT分享峰会已圆满落幕,嘉宾们纷纷出招,分享干货,现场妙语连珠不断,高潮迭起. 主题:流程驱动的地产业务管控平台 嘉宾:王寿欣(卓越地产战略与运营管理部 副总经理) 卓越地产应用K ...
- 集成支付宝钱包支付iOS SDK的方法与经验
流程 摘自第一个文档<支付宝钱包支付接口开发包2.0标准版.pdf> 图中的“商户客户端”就是我们的iOS客户端需要做的事情: 调用支付宝支付接口 处理支付宝返回的支付结果 在调用支付宝支 ...
- opencv 工程的保存
一个项目的保存,只要保存工程底下的.CPP .h .dll .lib 输入输出文件即可 最终保存的文件