The Daligner Overlap Library

/************************************************************************************\

*                                                                                    *

* Copyright (c) 2014, Dr. Eugene W. Myers (EWM). All rights reserved.                *

*                                                                                    *

* Redistribution and use in source and binary forms, with or without modification,   *

* are permitted provided that the following conditions are met:                      *

*                                                                                    *

*  · Redistributions of source code must retain the above copyright notice, this     *

*    list of conditions and the following disclaimer.                                *

*                                                                                    *

*  · Redistributions in binary form must reproduce the above copyright notice, this  *

*    list of conditions and the following disclaimer in the documentation and/or     *

*    other materials provided with the distribution.                                 *

*                                                                                    *

*  · The name of EWM may not be used to endorse or promote products derived from     *

*    this software without specific prior written permission.                        *

*                                                                                    *

* THIS SOFTWARE IS PROVIDED BY EWM ”AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES,    *

* INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND       *

* FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL EWM BE LIABLE   *

* FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES *

* (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS  *

* OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY      *

* THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING     *

* NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN  *

* IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.                                      *

*                                                                                    *

* For any issues regarding this software and its use, contact EWM at:                *

*                                                                                    *

*   Eugene W. Myers Jr.                                                              *

*   Bautzner Str. 122e                                                               *

*   01099 Dresden                                                                    *

*   GERMANY                                                                          *

*   Email: gene.myers@gmail.com                                                      *

*                                                                                    *

\************************************************************************************/

                      The Daligner Overlap Library

                                      Author:  Gene Myers

                                      First:   July 17, 2013

                                      Current: August 29, 2014

  The commands below permit one to find all significant local alignments between reads

encoded in Dazzler database.  The assumption is that the reads are from a PACBIO RS II

long read sequencer.  That is the reads are long and noisy, up to 15% on average.

  Recall that a database has a current partition that divides it into blocks of a size

that can conveniently be handled by calling the "dalign" overlapper on all the pairs of

blocks producing a collection of .las local alignment files that can then be sorted and

merged into an ordered sequence of sorted files containing all alignments between reads

in the data set.  The alignment records are parsimonious in that they do not record an

alignment but simply a set of trace points, typically every 100bp or so, that allow the

efficient reconstruction of alignments on demand.

1. daligner [-vbd] [-k<int(14)>] [-w<int(6)>] [-h<int(35)>] [-t<int>]

                   [-e<double(.70)] [-l<int(1000)] [-s<int(100)>]

                   [-H<int>] [-M<int>] <subject:db> <target:db> ...

Compare sequences in the trimmed <subject> block against those in the list of <target>

blocks searching for local alignments involving at least -l base pairs (default 1000)

or more, that have an average correlation rate of -e (default 70%).  The local

alignments found will be output in a sparse encoding where a trace point on the

alignment is recorded every -s base pairs of the a-read (default 100bp). The subject

reads are compared in both orientations against each target read and local alignments

meeting the criteria are output to one of several created files described below.  The

-v option turns on a verbose reporting mode that gives statistics on each major step of

the computation.  

There cannot be more than 65,535 reads in a given db block, and each read must be less

than 65,535 characters long.  DBsplit is guaranteed to produce blocks that meet this

condition.  

The options -k, -h, and -w control the initial filtration search for possible matches

between reads.  Specifically, our search code looks for a pair of diagonal bands of

width 2^w (default 2^6 = 64) that contain a collection of exact matching k-mers

(default 14) between the two reads, such that the total number of bases covered by the

k-mer hits is h (default 35). k cannot be larger than 16 in the current implementation.

If the -b option is set, then the daligner assumes the data has a strong compositional

bias (e.g. >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer

sizes depending on compositional bias, so that the mers used have an effective

specificity of 4^k. If the -d option is set, then the daligner looks for a ".dust"

track produced by DBdust and if it finds it, then it ignores k-mers that contain any

bases from a low-complexity interval annotated by DBdust. Lastly, some k-mers are

significantly over-represented (e.g. homopolymer runs). These can be suppressed with

the -t option: any k-mer that occurs more than t times in either the subject or target

is not counted in the heuristic. If the -t option is absent then t is effectively

infinity.  

bias (e.g. >65% AT rich), and at the cost of a bit more time, dynamically adjusts k-mer

Invariably, some k-mers are significantly over-represented (e.g. homopolymer runs).

These k-mers create an excessive number of matching k-mer pairs and left unaddressed

would cause daligner to overflow the available physical memory.  One way to deal with

this is to explicitly set the -t parameter which suppresses the use of any k-mer that

occurs more than t times in either the subject or target block.  However, a better way

to handle the situation is to let the program automatically select a value of t that

meets a given memory usage limit specified (in Gb) by the -M parameter.  By default

daligner will use the amount of physical memory as the choice for -M.  If you want to

use less, say only 8Gb on a 24Gb HPC cluster node because you want to run 3 daligner

jobs on the node, then specify -M8.  Specifying -M0 basically indicates that you do not

want daligner to self adjust k-mer suppression to fit within a given amount of memory.

For each subject, target pair of blocks, say X and Y, the program produces files

containing alignments with the names X.Y.[C|N]#.las and if X !≠ Y then also files

Y.X.[C|N]#.las, where C indicates that the b-reads are complemented and N indicates

they are not (both comparisons are performed) and # is the thread that detected and

wrote out the collection of overlaps.  For example, if the defined constant NTHREAD (in

file filter.h) is 4 (the default) when the program is compiled, then 8 or 16 files are

output for each subject,target pair depending on whether or not the subject and target

are the same block.  Each found alignment is recorded as -- a[ab,ae] x b^o[bb,be] --

where a and b are the indices (in the trimmed DB) of the reads that overlap, o

indicates whether the b-read is from the same or opposite strand, and [ab,ae] and

[bb,be] are the intervals of a and b^o, respectively, that align.  The file X.Y.O#.las

contains the alignments produced by thread # for which the a-read is from X and the

b-read is from Y and in orientation O.

While the default parameter settings are good for raw Pacbio data, daligner can be used

for efficiently finding alignments in corrected reads or other less noisy reads. For

example, after error correction, we run "daligner -k15 -s5 -h60 -e.95 -m.80 -s500" on

the corrected reads and at these settings it is very fast.

2. LAsort [-v] <align:las> ...

Sort each .las alignment file specified on the command line. For each file it reads in

all the overlaps in the file and sorts them in lexicographical order of (a,b,o,ab)

assuming each alignment is recorded as a[ab,ae] x b^o[bb,be]. It then writes them all

to a file named <align>.S.las (assuming that the input file was <align>.las). With the

-v option set then the program reports the number of records read and written.

3. LAmerge [-v] <merge:las> <parts:las> ...

Merge the .las files <parts> into a singled sorted file <merge>, where it is assumed

that  the input <parts> files are sorted. Due to operating system limits, the number of

<parts> files must be <= 252.  With the -v option set the program reports the # of

records read and written.

Used correctly, LAmerge and LAsort together allow one to perform an "external" sort

that produces a collection of sorted files containing in aggregate all the local

alignments found by the daligner, such that their concatenation is sorted in order of

(a,b,o,ab). In particular, this means that all the alignments for a given a-read will

be found consecutively in one of the files.  So computations that need to look at all

the alignments for a given read can operate in simple sequential scans of these

sorted files.

4. LAshow [-coU] [-(a|r):<db>] [-i<int(4)>] [-w<int(100)>] [-b<int(10)>]

                 <align:las> [ <reads:range> ... ]

LAshow produces a printed listing of the local alignments contained in the specified

.las file. If a list of read ranges is given then only the overlaps for which the

a-read is in the set specified by the list are displayed. See DBshow for an explanation

of how the list of read ranges is interpreted.

If the -c option is given then a cartoon rendering is displayed, and if -a or -r option

is set then an alignment of the local alignment is displayed. If both are set then the

cartoon is first, and if neither are set then a simple one line summary of the local

alignment is displayed. To display alignments the name of the database containing the

reads must be given (and it must refer to the whole database, not a block thereof).

The -a option puts exactly -w columns per segment of the display, whereas the -r option

puts exactly -w a-read symbols in each segment of the display.  The -r display mode is

useful when one wants to visually compare two alignments involving the same a-read.

The -i option sets the indent for the cartoon and/or alignment if they are requested.

The -w option sets the width of an alignment, the -b sets the number of symbols on

either side of the aligned segments to display, and -U specifies that uppercase should

be used for DNA sequence instead of the default lowercase. If the -o option is set then

only proper overlaps are shown.

5. LAcat <source:las> > <target>.las

Given argument <source>, find all files <source>.1.las, <source>.2.las, ...

<source>.n.<las> where <source>.i.las exists for every i in [1,n].  Then

concatenate these files in order into a single .las file and pipe the result

to the standard output.

6. LAsplit <target:las> (<parts:int> | <path:db>) < <source>.las

If the second argument is an integer n, then divide the alignment file <source>, piped

in through the standard input, as evenly as possible into n alignment files with the

name <target>.i.las for i in [1,n], subject to the restriction that all alignment

records for a given a-read are in the same file.

If the second argument refers to a database <path>.db that has been partitioned, then

divide the input alignment file into block .las files where all records whose a-read is

in <path>.i.db are in <align>.i.las.

7. LAcheck [-vS] [-a:<db>] <align:las> ...

LAcheck checks each .las file for structural integrity.  That is, it makes sure each

file makes sense as a plausible .las file, e.g. values are not out of bound, the number

of records is correct, the number of trace points for a record is correct, and so on.

If the -S option is set then it further checks that the alignments are in sorted order,

and if the database is given with the -a option, then read ranges and lengths are

double checked.  If the -v option is set then a line is output for each .las file

saying either the file is OK or reporting the first error.  If the -v option is not set

then the program runs silently.  The exit status is 0 if every file is deemed good, and

1 if at least one of the files looks corrupted.

8. HPCdaligner [-vbd] [-k<int(14)>] [-w<int(6)>] [-h<int(35)>] [-t<int>]

                      [-e<double(.70)] [-l<int(1000)] [-s<int(100)>]

                      [-H<int>] [-M<int>] [-dal<int(4)>] [-mrg<int(25)>]

                      <path:db> [<first:int>[-<last:int>]]

HPCdaligner writes a UNIX shell script to the standard output that consists of a

sequence of commands that effectively run daligner on all pairs of blocks of a split

database and then externally sorts and merges them using LAsort and LAmerge into a

collection of alignment files with names <path>.#.las where # ranges from 1 to the

number of blocks the data base is split into. These sorted files if concatenated by say

LAcat would contain all the alignments in sorted order (of a-read, then b-read, ...).

Moreover, all overlaps for a given a-read are guaranteed to not be split across files,

so one can run artifact analyzers or error correction on each sorted file in parallel.

The data base must have been previously split by DBsplit and all the parameters, except

-v, -dal, and -mrg, are passed through to the calls to daligner. The defaults for these

parameters are as for daligner. The -v flag, for verbose-mode, is also passed to all

calls to LAsort and LAmerge. -dal and -mrg options are described later. For a database

divided into N sub-blocks, the calls to daligner will produce in total 2TN^2 .las files

assuming daligner runs with T threads. These will then be sorted and merged into N^2

sorted .las files, one for each block pair. These are then merged in ceil(log_mrg N)

phases where the number of files decreases geometrically in -mrg until there is 1 file

per row of the N x N block matrix. So at the end one has N sorted .las files that when

concatenated would give a single large sorted overlap file.

The -dal option (default 4) gives the desired number of block comparisons per call to

daligner. Some must contain dal-1 comparisons, and the first dal-2 block comparisons

even less, but the HPCdaligner "planner" does the best it can to give an average load

of dal block comparisons per command. The -mrg option (default 25) gives the maximum

number of files that will be merged in a single LAmerge command. The planner makes the

most even k-ary tree of merges, where the number of levels is ceil(log_mrg N).

If the integers <first> and <last> are missing then the script produced is for every

block in the database.  If <first> is present then HPCdaligner produces an incremental

script that compares blocks <first> through <last> (<last> = <first> if not present)

against each other and all previous blocks 1 through <first>-1, and then incrementally

updates the .las files for blocks 1 through <first>-1, and creates the .las files for

blocks <first> through <last>.

Each UNIX command line output by the HPCdaligner can be a batch job (we use the &&

operator to combine several commands into one line to make this so). Dependencies

between jobs can be maintained simply by first running all the daligner jobs, then all

the initial sort jobs, and then all the jobs in each phase of the external merge sort.

Each of these phases is separated by an informative comment line for your scripting

convenience.

Example:

//  Recall G.db from the example in DAZZ_DB/README

> cat G.db

files =         1

       1862 G Sim

blocks =         2

size =        11 cutoff =         0 all = 0

         0         0

      1024      1024

      1862      1862

> HPCplanner -d -t5 G | csh -v   // Run the HPCdazzler script

# Dazzler jobs (2)

dazzler -d -t5 G.1 G.1

dazzler -d -t5 G.2 G.1 G.2

# Initial sort jobs (4)

LAsort G.1.G.1.*.las && LAmerge G.L1.1.1 G.1.G.1.*.S.las && rm G.1.G.1.*.S.las

LAsort G.1.G.2.*.las && LAmerge G.L1.1.2 G.1.G.2.*.S.las && rm G.1.G.2.*.S.las

LAsort G.2.G.1.*.las && LAmerge G.L1.2.1 G.2.G.1.*.S.las && rm G.2.G.1.*.S.las

LAsort G.2.G.2.*.las && LAmerge G.L1.2.2 G.2.G.2.*.S.las && rm G.2.G.2.*.S.las

# Level 1 jobs (2)

LAmerge G.1 G.L1.1.1 G.L1.1.2 && rm G.L1.1.1.las G.L1.1.2.las

LAmerge G.2 G.L1.2.1 G.L1.2.2 && rm G.L1.2.1.las G.L1.2.2.las

> LAshow -c -a:G -w50 G.1 | more  // Take a look at the result !

G.1: 34,510 records

         1          9 c   [     0.. 1,876] x [ 9,017..10,825]  ( 18 trace pts)

                      12645

    A      ---------+====>   dif/(len1+len2) = 398/(1876+1808) = 21.61%

    B <====+---------

       9017

         1 ..........gtg-cggt--caggggtgcctgc-t-t-atcgcaatgtta

                     |||*||||**||||||||*||||*|*|*||**|*|*||||

      9008 gagaggccaagtggcggtggcaggggtg-ctgcgtcttatatccaggtta  27.5%

        35 ta-ctgggtggttaaacttagccaggaaacctgttgaaataa-acggtgg

           ||*|||||||||||||*|**|*||*|*||||||*|**|||||*|*|||||

      9057 tagctgggtggttaaa-tctg-ca-g-aacctg-t--aataacatggtgg  24.0%

        83 -ctagtggcttgccgtttacccaacagaagcataatgaaa-tttgaaagt

           *||||||||*||||||||*||**||||*|||**|||||||*||||*||||

      9100 gctagtggc-tgccgttt-ccgcacag-agc--aatgaaaatttg-aagt  20.0%

       131 ggtaggttcctgctgtct-acatacagaacgacggagcgaaaaggtaccg

           ||*|||||||||||||*|*||||*|*|*||||||||||*||||||||||*

      9144 gg-aggttcctgctgt-tcacat-c-ggacgacggagc-aaaaggtacc-  16.0%

...

> LAcat G >G.las   //  Combine G.1.las & G.2.las into a single .las file

> LAshow G | more    //   Take another look, now at G.las

G: 62,654 records

   1    9 c   [     0.. 1,876] x [ 9,017..10,825] :   <    398 diffs  ( 18 trace pts)

   1   38 c   [     0.. 7,107] x [ 5,381..12,330] :   <  1,614 diffs  ( 71 trace pts)

   1   49 n   [ 5,493..14,521] x [     0.. 9,065] :   <  2,028 diffs  ( 91 trace pts)

   1   68 n   [12,809..14,521] x [     0.. 1,758] :   <    373 diffs  ( 17 trace pts)

   1  147 c   [     0..13,352] x [   854..14,069] :   <  2,993 diffs  (133 trace pts)

   1  231 n   [10,892..14,521] x [     0.. 3,735] :   <    816 diffs  ( 37 trace pts)

   1  292 c   [ 3,835..14,521] x [     0..10,702] :   <  2,353 diffs  (107 trace pts)

   1  335 n   [ 7,569..14,521] x [     0.. 7,033] :   <  1,544 diffs  ( 70 trace pts)

   1  377 c   [ 9,602..14,521] x [     0.. 5,009] :   <  1,104 diffs  ( 49 trace pts)

   1  414 c   [ 6,804..14,521] x [     0.. 7,812] :   <  1,745 diffs  ( 77 trace pts)

   1  415 c   [     0.. 3,613] x [ 7,685..11,224] :   <    840 diffs  ( 36 trace pts)

   1  445 c   [ 9,828..14,521] x [     0.. 4,789] :   <  1,036 diffs  ( 47 trace pts)

   1  464 n   [     0.. 1,942] x [12,416..14,281] :   <    411 diffs  ( 19 trace pts)

...

The Daligner Overlap Library的更多相关文章

记一个mvn奇怪错误： Archive for required library: 'D:/mvn/repos/junit/junit/3.8.1/junit-3.8.1.jar' in project 'xxx' cannot be read or is not a valid ZIP file
我的maven 项目有一个红色感叹号, 而且Problems 存在 errors : Description Resource Path Location Type Archive for requi ...
代码的坏味道（22）——不完美的库类(Incomplete Library Class)
坏味道--不完美的库类(Incomplete Library Class) 特征当一个类库已经不能满足实际需要时,你就不得不改变这个库(如果这个库是只读的,那就没辙了). 问题原因许多编程技术都建 ...
笔记：Memory Notification: Library Cache Object loaded into SGA
笔记:Memory Notification: Library Cache Object loaded into SGA在警告日志中发现一些这样的警告信息:Mon Nov 21 14:24:22 20 ...
Android Studio2.1.2 Java8环境下引用Java Library编译出错
转载请注明出处:http://www.cnblogs.com/LT5505/p/5685242.html 问题:在Android Studio2.1.2+Java8的环境下,引用Java Librar ...
在数据库访问项目中使用微软企业库Enterprise Library，实现多种数据库的支持
在我们开发很多项目中,数据访问都是必不可少的,有的需要访问Oracle.SQLServer.Mysql这些常规的数据库,也有可能访问SQLite.Access,或者一些我们可能不常用的PostgreS ...
Unable to download data from http://ruby.taobao.org/ & don't have write permissions for the /Library/Ruby/Gems/2.0.0 directory.
安装cocoapods,记录两个问题! 1.镜像已经替换成了 http://ruby.taobao.org/, 还是不能不能安装cocoapods, 报错:Unable to download dat ...
[转载】——故障排除：Shared Pool优化和Library Cache Latch冲突优化 (文档 ID 1523934.1)
原文链接:https://support.oracle.com/epmos/faces/DocumentDisplay?_adf.ctrlstate=23w4l35u5_4&id=152393 ...
使用NuGet发布自己的类库包（Library Package）
STEP 1:注册并获取API Key 首先,你需要到NuGet上注册一个新的账号,然后在My Account页面,获取一个API Key,这个过程很简单,我就不作说明了. STEP 2:下载NuGe ...
启动 Eclipse 弹出“Failed to load the JNI shared library jvm.dll”错误的解决方法
原因1:给定目录下jvm.dll不存在. 对策:(1)重新安装jre或者jdk并配置好环境变量.(2)copy一个jvm.dll放在该目录下. 原因2:eclipse的版本与jre或者jdk版本不一致 ...

随机推荐

python学习笔记一 python入门（基础篇）
简单介绍一下python2.x和3.5的区别 print 在python3.5中print 变为print() Old: print * New: print( * ) 如果想要不换行,之前的 ...
iOS面试和招聘
1, <招聘一个靠谱的iOS>面试题参考答案(上) 2, 招聘一个靠谱的 iOS
SqlSever基础 dateadd month 增加五个月
镇场诗:---大梦谁觉,水月中建博客.百千磨难,才知世事无常.---今持佛语,技术无量愿学.愿尽所学,铸一良心博客.------------------------------------------ ...
Java String 一些实验
1.请查看String.equals()方法的实现代码,注意学习其实现方法. 结果: 原因: 当直接使用new关键字创建字符串对象时,虽然值一致(都是“Hello”),但s1.s2仍然是两个独立的对象 ...
将linux默认python升级到2.7.4版本
第一步:下载python2.7.4版本源码: wget http://python.org/ftp/python/2.7.4/Python-2.7.4.tgz 解压文件 [aa@localhost ~ ...
Linux有问必答：如何在Linux中修改环境变量PATH
提问: 当我试着运行一个程序时,它提示“command not found”. 但这个程序就在/usr/local/bin下.我该如何添加/usr/local/bin到我的PATH变量下,这样我就可以 ...
SQL Server建表和增删改
create database 数据库名 go --穿件完成 go create table 表名(列名类型, 列名类型, 列名类型 --最后一个列名不加逗号) go --创建完成go 以创建表 ...
MyBatis环境搭建
什么是MyBatis: MyBatis 是支持定制化 SQL.存储过程以及高级映射的优秀的持久层框架(O object R relatoin M mapping 框架),MyBatis 避免了几乎所 ...
sql 增加字段
ALTER TABLE [dt_article_goods] ADD [goods_id] int DEFAULT 0
GZFramwork快速开发框架演练之会员系统(二)添加字典模块
开始前请先阅读 GZFramwork快速开发框架之窗体设计说明第一步:准备模块图片图片为2张大小分别为16x16和32x32,放在\Debug\images目录下因为会员管理模块并不多 ...

The Daligner Overlap Library

The Daligner Overlap Library的更多相关文章

随机推荐

热门专题