CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice.

通过序列加权、特定位置的间隙惩罚和权重矩阵的选择来提高渐进多序列比对的灵敏度。

Abstract

Firstly, individual weights are assigned to each sequence in a partial alignment in order to down- weight near-duplicate sequences and up-weight the most divergent ones.

首先,对部分比对中的每个序列分配单独的权重,以便对接近重复的序列进行降权重,而对最分散的序列进行升权重。

Secondly, amino acid substitution matrices are varied at different alignment stages according to the divergence of the sequences to be aligned.

其次,氨基酸替代矩阵在不同的比对阶段根据待比对序列的不同而不同。

Thirdly, residue-specific gap penalties and locally reduced gap penalties in hydrophilic regions encourage new gaps in potential loop regions rather than regular secondary structure.

第三,残基特异性缺口惩罚和亲水性区域局部减少的缺口惩罚鼓励潜在环区出现新的缺口,而不是规则的二级结构。

Fourthly, positions in early alignments where gaps have been opened receive locally reduced gap penalties to encourage the opening up of new gaps at these positions.

第四,在早期调整中,已有缺口的职位将在当地减少缺口处罚,以鼓励在这些职位上开辟新的缺口。

Introduction

Currently, the most widely used approach is to exploit the fact that homologous sequences are evolutionarily related. One can build up a multiple alignment progressively by a series of pairwise alignments, following the branching order in a phylogenetic tree. One first aligns the most closely related sequences, gradually adding in the more distant ones. This approach is sufficiently fast to allow alignments of virtually any size.

目前,最广泛使用的方法是利用同源序列在进化上是相关的这一事实。人们可以按照系统发育树中的分支顺序,通过一系列成对的比对逐步建立多重比对。首先比对关系最密切的序列,然后逐渐添加距离较远的序列。这种方法足够快,几乎可以进行任何大小的对齐。

There are two major problems with the progressive approach: the local minimum problem and the choice of alignment parameters.

渐进法有两个主要问题:局部极小值问题对齐参数的选择

The local minimum problem

The local minimum problem stems from the 'greedy' nature of the alignment strategy. The algorithm greedily adds sequences together, following the initial tree. There is no guarantee that the global optimal solution, as defined by some overall measure of multiple alignment quality, or anything close to it, will be found. More specifically, any mistakes (misaligned regions) made early in the alignment process cannot be corrected later as new information from other sequences is added.

局部最小问题源于对齐策略的“贪婪”性质。该算法按照初始树的顺序贪婪地将序列相加在一起。不能保证找到全局最优解,该全局最优解由多个比对质量的某个总体度量或任何接近它的度量来定义。更具体地说,当来自其他序列的新信息被添加时,在比对过程早期所犯的任何错误(未对齐区域)都不能在以后纠正。

The only way to correct this is to use an iterative or stochastic sampling procedure.

解决这个问题的唯一方法就说使用迭代法或者随机抽样(不知道是啥)

The alignment parameter choice problem

The alignment parameter choice problem is, in our view, at least as serious as the local minimum problem.

Stochastic or iterative algorithms will be just as badly affected as progressive ones if the parameters are inappropriate: they will arrive at a false global minimum.

Traditionally, one chooses one weight matrix and two gap penalties (one for opening a new gap and one for extending an existing gap) and hope that these will work well over all parts of all the sequences in the data set. When the sequences are all closely related, this works.

传统上,人们选择一个权重矩阵和两个缺口惩罚(一个用于打开新的缺口,另一个用于扩大现有的缺口),并希望这些惩罚将在数据集中所有序列的所有部分都能很好地工作。当所有序列都紧密相关时,这是可行的。

The first reason is that virtually all residue weight matrices give most weight to identities. When identities dominate an alignment, almost any weight matrix will find approximately the correct solution. With very divergent sequences, however, the scores given to non-identical residues will become critically important; there will be more mismatches than identities. Different weight matrices will be optimal at different evolutionary distances or for different classes of proteins.

第一个原因是几乎所有的残留权重矩阵都赋予恒等式最大的权重。当恒等式在排列中占主导地位时,几乎任何权重矩阵都会找到近似正确的解。然而,对于非常不同的序列,给予不同残基的分数将变得至关重要;错配将比同一性更多。不同的权重矩阵在不同的进化距离或不同类别的蛋白质中是最优的。

The second reason is that the range of gap penalty values that will find the correct or best possible solution can be very broad for highly similar sequences (11). As more and more divergent sequences are used, however, the exact values of the gap penalties become important for success. In each case, there may be a very narrow range of values which will deliver the best alignment.

第二个原因是,对于高度相似的序列(11),将找到正确或最佳可能解决方案的差距罚值的范围可能非常宽。然而,随着越来越多的不同序列被使用,差距惩罚的精确值对于成功变得重要。在每种情况下,提供最佳对齐的值范围都可能非常窄。

Neighbour-Joining NJ

In the original CLUSTAL programs, the initial guide trees, used to guide the multiple alignment, were calculated using the UPGMA method (20). We now use the Neighbour-Joining method which is more robust against the effects of unequal evolutionary rates in different lineages and which gives better estimates of individual branch lengths. This is useful because it is these branch lengths which are used to derive the sequence weights.

将UPGMA层次聚类换成了NJ层次聚类。

Material and Methods

The basic alignment method

The basic multiple alignment algorithm consists of three main stages: (i) all pairs of sequences are aligned separately in order to calculate a distance matrix giving the divergence of each pair of sequences; (ii) a guide tree is calculated from the distance matrix; (iii) the sequences are progressively aligned according to the branching order in the guide tree.

  1. 比对所有序列对,计算得出距离矩阵
  2. 根据距离矩阵计算指导树
  3. 根据指导树的顺序逐步对序列进行比对

Calculate distance matrix

In the original CLUSTAL programs, the pairwise distances were calculated using a fast approximate method.

This allows very large numbers of sequence to be aligned, even on a microcomputer. The scores are calculated as the number of k-tuple matches (runs of identical residues, typically 1 or 2 long for proteins or 2 - 4 long for nucleotide sequences) in the best alignment between two sequences minus a fixed penalty for every gap.

分数的计算方法是两个序列之间的最佳比对中的k-字节组匹配的数量(相同残基,对于蛋白质通常为1或2,对于核苷酸序列通常为2-4) 减去每个gap的固定惩罚。

We now offer a choice between this method and the slower but more accurate scores from full dynamic programming alignments using two gap penalties (for opening or extending gaps) and a full amino acid weight matrix.

These scores are calculated as the number of identities in the best alignment divided by the number of residues compared (gap positions are excluded). Both of these scores are initially calculated as per cent identity scores and are converted to distances by dividing by 100 and subtracting from 1.0 to give number of differences per site.

Progressive alignment

组与组的比对,每个位置都需要计算得分。

CLUSTAL W论文解读的更多相关文章

  1. itemKNN发展史----推荐系统的三篇重要的论文解读

    itemKNN发展史----推荐系统的三篇重要的论文解读 本文用到的符号标识 1.Item-based CF 基本过程: 计算相似度矩阵 Cosine相似度 皮尔逊相似系数 参数聚合进行推荐 根据用户 ...

  2. Gaussian field consensus论文解读及MATLAB实现

    Gaussian field consensus论文解读及MATLAB实现 作者:凯鲁嘎吉 - 博客园 http://www.cnblogs.com/kailugaji/ 一.Introduction ...

  3. zz扔掉anchor!真正的CenterNet——Objects as Points论文解读

    首发于深度学习那些事 已关注写文章   扔掉anchor!真正的CenterNet——Objects as Points论文解读 OLDPAN 不明觉厉的人工智障程序员 ​关注他 JustDoIT 等 ...

  4. NIPS2018最佳论文解读:Neural Ordinary Differential Equations

    NIPS2018最佳论文解读:Neural Ordinary Differential Equations 雷锋网2019-01-10 23:32     雷锋网 AI 科技评论按,不久前,NeurI ...

  5. [论文解读] 阿里DIEN整体代码结构

    [论文解读] 阿里DIEN整体代码结构 目录 [论文解读] 阿里DIEN整体代码结构 0x00 摘要 0x01 文件简介 0x02 总体架构 0x03 总体代码 0x04 模型基类 4.1 基本逻辑 ...

  6. 《Stereo R-CNN based 3D Object Detection for Autonomous Driving》论文解读

    论文链接:https://arxiv.org/pdf/1902.09738v2.pdf 这两个月忙着做实验 博客都有些荒废了,写篇用于3D检测的论文解读吧,有理解错误的地方,烦请有心人指正). 博客原 ...

  7. 注意力论文解读(1) | Non-local Neural Network | CVPR2018 | 已复现

    文章转自微信公众号:[机器学习炼丹术] 参考目录: 目录 0 概述 1 主要内容 1.1 Non local的优势 1.2 pytorch复现 1.3 代码解读 1.4 论文解读 2 总结 论文名称: ...

  8. 论文解读丨基于局部特征保留的图卷积神经网络架构(LPD-GCN)

    摘要:本文提出一种基于局部特征保留的图卷积网络架构,与最新的对比算法相比,该方法在多个数据集上的图分类性能得到大幅度提升,泛化性能也得到了改善. 本文分享自华为云社区<论文解读:基于局部特征保留 ...

  9. CVPR2019论文解读:单眼提升2D检测到6D姿势和度量形状

    CVPR2019论文解读:单眼提升2D检测到6D姿势和度量形状 ROI-10D: Monocular Lifting of 2D Detection to 6D Pose and Metric Sha ...

随机推荐

  1. NGK数字钱包的特点是什么?NGK钱包的优点和缺点是什么?

    说起区块链数字资产,那就离不开谈到数字钱包.数字钱包不仅有资产管理的功能,还可以进行资产理财.资产交易,甚至能为公链DAPP导流. 对于NGK公链而言,其数字钱包已然成为了解NGK公链的基础条件.NG ...

  2. 你真的知道typeof null的结果为什么是‘object‘吗?

    到目前为止,ECMAScript 标准中定义了8种数据类型,它们分别是Undefined.Null.Number.Boolean.String.Symbol.BigInt.Object. 为了判断变量 ...

  3. vue-eahars生产编译报错

    { test: /\.js$/, loader: 'babel-loader', include: [resolve('src'), resolve('test'), resolve('node_mo ...

  4. 微信小程序:自定义组件

    为什么要学习自定义组件? 1.用上我自己的单词abc,我希望在页面中展示椭圆形的图片, 2.打开手机淘宝,假如现在要做一个企业级项目,里面有很多页面,首页存在导航模块,点击天猫,进入第二个页面,而第二 ...

  5. Spring Data Solr

    1.什么是spring data solr? Solr是一个开源搜索平台,用于构建搜索应用程序.简单的来说就是作为一个搜索引擎使用. 2.solr的安装(本地安装,远程安装同) 1)解压一个tomca ...

  6. Go benchmark 一清二楚

    前言 基准测试(benchmark)是 go testing 库提供的,用来度量程序性能,算法优劣的利器. 在日常生活中,我们使用速度 m/s(单位时间内物体移动的距离)大小来衡量一辆跑车的性能,同理 ...

  7. python学习之常用数据结构

    前言:数据结构不管在哪门编程语言之中都是非常重要的,因为学校的课程学习到了python,所以今天来聊聊关于python的数据结构使用. 一.列表 list 1.列表基本介绍 列表中的每个元素都可变的, ...

  8. 剑指 Offer 63. 股票的最大利润 + 动态规划

    剑指 Offer 63. 股票的最大利润 Offer_63 题目描述 方法一:暴力法 package com.walegarrett.offer; /** * @Author WaleGarrett ...

  9. 理解C#泛型运作原理

    前言  我们都知道泛型在C#的重要性,泛型是OOP语言中三大特征的多态的最重要的体现,几乎泛型撑起了整个.NET框架,在讲泛型之前,我们可以抛出一个问题,我们现在需要一个可扩容的数组类,且满足所有类型 ...

  10. Spark性能调优-RDD算子调优篇(深度好文,面试常问,建议收藏)

    RDD算子调优 不废话,直接进入正题! 1. RDD复用 在对RDD进行算子时,要避免相同的算子和计算逻辑之下对RDD进行重复的计算,如下图所示: 对上图中的RDD计算架构进行修改,得到如下图所示的优 ...