http://sourceforge.net/projects/het-smooth/

equencing technologies, such as Illumina sequencing, provide the sequences of
short "reads" of DNA that come from random positions on the genome. These reads
then must be assembled de-novo into the original genome, or, if there is a
reference genome available mapped onto the reference genome. These tasks, especially de-novo assembly, become more difficult if the genome is heterozygous.

het-smooth is a experimental program to smooth out heterozygosity in DNA sequence reads by identifying isolated SNPs and changing each one to only one of the heterozygous variants. It is intended for use in sequence data from diploid genomes. It accepts input data in FASTA or FASTQ format, and for each input file, it writes an output FASTA or FASTQ file containing the reads with a reduced rate of heterozygosity. See the README for more details.

http://sourceforge.net/p/het-smooth/code/ci/master/tree/

安装：

git clone git://git.code.sf.net/p/het-smooth/code

编辑Makefile

找到自己jellyfish的安装包，如果找不到就再下载一个。

改过以后，sudo make.

Read Me

				  INTRODUCTION

het-smooth is a program to smooth out heterozygosity in DNA sequence reads.

Sequencing technologies, such as Illumina sequencing, provide the sequences of
short "reads" of DNA that come from random positions on the genome.  These reads
then must be assembled de-novo into the original genome, or, if there is a
reference genome available mapped onto the reference genome.

These tasks, especially de-novo assembly, become more difficult if the genome is
heterozygous.  All diploid organisms have some rate of heterozygosity
(differences between each chromosome set), but some organisms in particular have
a very high rate of heterozygosity.  

`het-smooth' is an experimental program to reduce the heterozygosity rate of
reads.  It accepts input data in FASTA or FASTQ format.

				  INSTALLATION

`het-smooth' must be compiled and run on a modern UNIX system, such as
GNU/Linux.  It currently must be compiled with g++ because it uses built-in
atomic functions.

`het-smooth' also requires Jellyfish
(http://www.cbcb.umd.edu/software/jellyfish/) to be installed because it relies
on it for k-mer counting.  `het-smooth' should be able to spawn the `jellyfish'
binary as well as link to `libjellyfish.so'.  Edit the Makefile to make sure
that JELLYFISH_CXXFLAGS and JELLYFISH_LDFLAGS correctly specify the location of
the Jellyfish library and headers.

After installing Jellyfish and editing the Makefile, run `make' to build the
binary `het-smooth'.  There is no `make install' yet; just copy the binary
somewhere if you want to.

				     USAGE

Usage: het-smooth (FASTA_FILE | FASTQ_FILE)...
Reduce the level of heterozygosity in the read set contained in the
specified FASTA or FASTQ files, which may be gzipped.

   -k, --kmer-len=LEN    use a kmer size of LEN; must be odd (default: 19)
   -B, --bottom-threshold=NUM
                         use NUM as the lower bound on the kmer coverage for
                           heterozygosity
   -T, --top-threshold=NUM
                         use NUM as the upper bound on the kmer coverage for
                           heterozygosity
   -C, --correct-erroneous-kmers
                         do not place a lower bound on the coverage of k-mers
                           for them to be corrected.  This may result in the
                           correction of some of the erroneous k-mers.
                           However, this may not work as well as dedicated
                           error-correction programs such as Quake.
   -j, --jellyfish-hash-file=FILE
                         use existing kmer counts hash table for the reads
                           produced by the Jellyfish program in FILE
   -s, --jellyfish-hash-size=SIZE
                         use SIZE as the Jellyfish hash size parameter
                           (default: 100000000)
   -t, --threads=NUM     use NUM threads for Jellyfish kmer counting and for
                           processing reads files in parallel
                           (default: number of processors)
   -Q, --quality-weight
                         weight kmer counts by quality
   -q, --quality-start=NUM
                         assume that the lowest quality value starts at NUM
                           (default: 64) (only used if -Q given)
   -l, --kmer-replacements-log=FILE
                         log all kmer replacement pairs to FILE
                           (default: don't log)
   --no-multibase-replacements
                         do not do multibase replacements
   --no-reads-log        do not log read changes
   -h, --help            display this help and exit

The essential options to set are --kmer-len, --bottom-threshold, and
--top-threshold.  Furthermore, it is recommended to give
--no-multibase-replacements until the usefulness of the multibase replacements
algorithm is verified.  We have been using --kmer-len=23 for the pineapple
genome, which is a ~400 megabase, very repetitive genome.  Smaller, less
repetitive genomes could do with --kmer-len=19, or perhaps 17 or 21.

If you run `jellyfish count' on your data to count k-mers of this k-mer length
and then run `jellyfish histo', you will see how many k-mers have a given
coverage level.  If you plot this histogram, for a heterozygous genome it will
be bimodal; the lower peak is for k-mers that appear in the diploid genome one
time, while the upper peak is for k-mers that appear in the diploid genome two
times (probably once in each chromosome set).  --bottom-threshold should be set
to approximately the first minimum in the k-mer coverage plot near the beginning
of the lower peak, while --top-threshold should be set to a little past the
second minimum in the k-mer coverage plot, a little past the end of the lower
peak.  The goal is for these thesholds to include the coverage levels of k-mers
that appear only once in the diploid genome, and are therefore in heterozygous
regions.

An example of using the program would be:

het-smooth --kmer-len=23 --bottom-threshold=38 --top-threshold=220	\
	   --no-multibase-replacements --jellyfish-hash-file=23-mers.jf \
	   reads_1.fq reads_2.fq

In this example, the program will produce new files (reads_1.het-smoothed.fq and
reads_2.het-smoothed.fq) that contain the heterozygosity-smoothed reads.  There
also will be files reads_1.het-smoothed.log and reads_2.het-smoothed.log that
list the original and new reads for each modified read, although this can be
disabled by the --no-reads-log option; furthermore, it may be more useful to log
the actual k-mer replacements that were computed by providing the {\tt
--kmer-replacements-log option (this probably should be made the default).  The
name of the Jellyfish hash table containing 23-mer counts is given as
23-mers.jf, but if you do not specify this option, het-smooth will spawn a
jellyfish process that will produce the hash table from the FASTQ files
specified for heteterozygosity smoothing.

het-smooth will strip the directory name from the input files when it determines
the output files, so the output files will be written in the current directory.

The --quality-weight option currently does not work.

				   ALGORITHM

We designed and implemented an algorithm to preprocess the libraries to remove
some of the heterozygosity.  Although we have used this algorithm to improve the
pineapple genome assembly, it is primarily a proof-of-concept algorithm; it
eventually would be better to integrate the algorithm (or a modified version of
it) into Allpaths-LG, so that it could assemble highly heterozygous genomes by
default.

Our algorithm analyzes the reads at the level of k-mers.  The basic idea is that
if we use an odd value of k, we can go through each k-mer that appears in the
genome and see if there exists a lexicographically smaller k-mer in the
genome that differs only by the center base.  If so, we can replace the
former k-mer with the latter k-mer every time it appears in the reads.
Reverse complements can be handled by only considering canonical k-mers
when looking for replacements, but then replacing both the foward and
reverse complement versions of the k-mer with the correctly oriented
replacement when actually doing the replacements.

The effect of this algorithm is to remove SNPs from the data.  Ideally, each SNP
would be changed to only one of the two variants, thereby removing the SNP from
the data given to the assembler.

There are many problems with this heterozygosity smoothing algorithm, some of
which we have addressed in our implementation, and some of which we have not yet
been able to address.

The algorithm requires knowing which k-mers appear in the genome and which do
not.  We estimate this from the reads by examining k-mer coverage.  We require
the "from" and "to" k-mers to have coverage high enough to make it unlikely for
either k-mer to contain a sequencing error.  Furthermore, to reduce the chances
of incorrect replacements, which would produce misassembled sequence, from being
made, we only add a replacement if there is exactly one valid replacement given
the possible substitutions of the middle base.  We also establish an upper bound
on the coverage for the "from" k-mer, since repeated sequences can cause
slightly differing sequences to show up on the same chromosome set, and these
could easily be mistaken for SNPs.

If we were to restrict k-mer replacements only to the middle base in each k-mer,
this would prevent any changes from being made to the first k/2 - 1 or the last
k/2 - 1 k-mers of each read.  This is a big problem because a SNP, even in the
best case, would not get smoothed out in reads for which the SNP site happens to
be near one of the ends of the read.  To solve this problem, we add an
additional step to the algorithm; after adding the replacements as already
described, we go through the reads and look for the locations where the
middle-base replacements would be made.  For each such location, we examine the
surrounding k/2-1 k-mers and add replacements containing the SNP site as one of
the non-center base pairs, provided that the coverage falls within the bounds
mentioned earlier.  Later, when actually doing the replacements, these extra
replacements are only actually done if the k-mer appears as the first or last
k-mer in the read.

				  LIMITATIONS

The algorithm currently only smooths out SNPs that are separated from other SNPs
by at least k/2 - 1 base pairs.  If there is a cluster of SNPs, they will not be
smoothed out.

The algorithm does not handle insertions, deletions, or inversions.  These are
believed to be much rarer than SNPs, but this may still be a huge limitation.

Heterozygous sites that occur very close to sequencing errors are unlikely to be
smoothed out.

There is no built-in capability to this program to recover the heterozygosity
after the assembly; other tools would have to be used for this (you would want
to align the original reads to the draft genome and call the SNPs that were
removed).

		    CORRECTNESS OF HETEROZYGOSITY SMOOTHING

By chance, it is possible for this program to make changes to the reads that
would not be strongly supported when considering the larger context.  It is
important to using a large enough k-mer size to reduce the chance of incorrect
replacements from being made.  Providing the --no-multibase-replacements
argument is recommended because on large datasets it is much too likely for two
k-mers to differ by only two base pairs purely by chance.  Providing the
--correct-erroneous-kmers option is NOT recommended.

				  PERFORMANCE

The program is multi-threaded, and the running time of the program is roughly
linear with the number of reads that must be processed.  Computing the
middle-base replacements is fast compared to computing the edge-base
replacements and then doing the k-mer replacements.  We have used the program to
reduce the heterozygosity in 235 GB of reads from the pineapple genome, and the
running time was several hours on a server-class computer.

				      TODO

The algorithm needs to be improved.

Make the program automatically deteremine the lower and upper thresholds for
heterozygosity.

It probably would make more sense for stand-alone assemblers to be able to
assemble highly heterozygous data, rather than have a separate program to smooth
out heterozygosity.

				    LICENCE

This code is licensed under the GNU General Public License version 3 or later.
There is no warranty whatsoever.  See COPYING for details.

				    SPONSORS

The development of this algorithm and program was done as part of Cold Spring
Harbor Laboratory's Undergraduate Research Program.  Funding for this program is
provided in part by the National Science Foundation.

		    QUESTIONS / BUGS / SUGGESTIONS / PATCHES

Eric Biggers		ebiggers3 at gmail.com
Michael Schatz		mschatz at cshl.edu

freemao

FAFU

het smooth 组装高杂合度二倍体基因组前期数据处理的更多相关文章

vcf文件（call variants得来的）怎么看变异是纯合还是杂合的
如下图片所示: 对于位置为48245131的allele来说,REF为A,ALT为C 想确定变异到底是纯合还是杂合,即两条染色体是否同时发生了变异,则看GT,GT对应的数值为0/1,说明该变异为杂合: ...
windows+goland+gometalinter进行本地代码检查（高圈复杂度、重复代码等）
1.下载gometalinter release地址为:https://github.com/alecthomas/gometalinter/releases/tag/v3.0.0 下载windows ...
Android学习之图片压缩，压缩程度高且失真度小
曾经在做手机上传图片的时候.直接获取相机拍摄的原图上传,原图大小一般1~2M.因此上传一张都比較浪费资源,有些场景还须要图片多张上传,所以近期查看了好多前辈写的关于图片处理的资料.然后试着改了一个图片 ...
HDU 5754 Life Winner Bo（各类博弈大杂合）
http://acm.hdu.edu.cn/showproblem.php?pid=5754 题意: 给一个国际象棋的棋盘,起点为(1,1),终点为(n,m),现在每个棋子只能往右下方走,并且有4种不 ...
vue.js高仿饿了么（前期整理）
1.熟悉项目开发流程需求分析——>脚手架工具——>数据mock——>架构设计——>代码编写——>自测——>编译打包. 2.熟悉代码规范从架构设计.组件抽象.模块 ...
【基因组组装】HiC挂载软件以及如何用Juice_box手工纠错？
目录 1.常用HiC挂载软件 2. Juice_box手工纠错 1.常用HiC挂载软件 ALLHiC 张兴坦老师专为多倍体和高杂合度物种基因组挂载开发.如果是复杂基因组,肯定是首选.对于简单基因组,我 ...
SOAPdenovo组装软件使用记录
背景: 1.为什么要从头测序组装基因组? 基因组是不同表型的遗传基础:获得参考基因组是深入研究一个生物体全基因组的第一步也是必须的一步:从头测序组装能够对新的测序物种构建参考基因组: 2.为什么要研究 ...
Pacbio 纯三代组装复活草基因组
对于植物等真核生物基因组来说,重复序列, 多倍体,高杂合度等特征在利用二代数据进行组装的时候都会有很大的问题: 利用二代数据组装出来的基因组,大多达不到完成图的水准,通常只是覆盖到编码蛋白的基因区域, ...
基因组Denovo组装原理、软件、策略及实施
目录 1. 组装算法 1)基于OLC算法 2)基于DBG算法 3)OLC vs DBG 2. 组装软件 3. 组装策略 4. 组装项目实施 1)测序前的准备 2) 测序样品准备 3)测序策略的选择 4 ...

随机推荐

B’QConf（北京软件质量大会）记
下午从公司加班回来,顺路到淘宝(大望路)参加B'QConf(北京软件质量大会).淘宝所在的国家广告产业园原来是一个菜市场,已经有大约6年没有到那一带活动了.之所以记得这么清楚,是因为6年前曾经从那里的 ...
【bzoj2281】[Sdoi2011]黑白棋
博弈论---Nimk问题. dp再搞搞. 很容易看出,该游戏的终态是每两个棋子都紧靠着.当一颗棋子移动,另一方与该棋子对应的那一刻可以立即追上,使得仍旧紧靠,最终棋子动弹不得,游戏结束. 还能看出,对 ...
Jmeter java.lang.OutOfMemoryError: GC overhead limit exceeded
使用这个jmeter工具测试时,遇到这么个gc错误,网上找到了解决方案.原因是jmeter默认分配内存的参数很小,好像是256M吧.故而解决方法,就是增加内存: set HEAP=-Xms4g -Xm ...
织梦dedecms分类信息模型上一页下一页失效办法
修改文件/include/arc.archives.class 将一下代码 $next = (is_array($nextR) ? " where arc.id={$nextR['id']} ...
神奇的Noip模拟试题 T3 科技节位运算
3 科技节 (scifest.pas/.c/.cpp) [问题描述] 一年一度的科技节即将到来.同学们报名各项活动的名单交到了方克顺校长那,结果校长一看皱了眉头:这帮学生热情竟然如此高涨,每个人都报那 ...
long long 读数scanf的转换 #define
在win32的评测系统下,long long scanf 要用"%I64d" ,而网上评测和考试要用"%lld",因此,难免有点麻烦,还会runtime err ...
IT公司100题-4-在二元树中找出和为某一值的所有路径
问题描述: 输入一个整数和一棵二元树.从树的根结点开始往下访问一直到叶结点所经过的所有结点形成一条路径.打印出和与输入整数相等的所有路径. 例如输入整数30和如下二元树 14 / \ 5 16 / ...
DOM 之 SAX操作
SAX采用部分读取的方式,可以进行大型文件的处理,而且只需要从文件中读取特定的内容,SAX解析可以由用户自己建立对象模型.
java基础之垃圾回收机制
1. 垃圾回收的意义在C++中,对象所占的内存在程序结束运行之前一直被占用,在明确释放之前不能分配给其它对象:而在Java中,当没有对象引用指向原先分配给某个对象的内存时,该内存便成为垃圾.JVM的 ...
[windows驱动]windows8.1驱动调试前戏
人们都说在干正事之前,得先做足前戏才会爽,我一直很认同这个观点,下面我来总结下进行windows8.1的WDK调试所要做的准备工作. 软件安装: 1.VS2013. 2.WDK8.1 3.Window ...

het smooth 组装高杂合度二倍体基因组前期数据处理

Read Me

het smooth 组装高杂合度二倍体基因组前期数据处理的更多相关文章

随机推荐

热门专题