QIIME1 聚OTU

qiime 本身不提供聚类的算法，它只是对其他聚otu软件的封装

根据聚类软件的算法，分成了3个方向：

de novo： pick_de_novo_otus.py

closed-reference： pick_closed_reference_otus.py

open-reference OTU： pick_open_reference_otus.py

不同算法的优缺点：

de novo： pick_de_novo_otus.py

优点：所有的reads 都会聚类

缺点：不支持并行，计算速度慢，当reads > 10M 时就会非常慢

使用场景：研究不常见的marker 基因

closed-reference： pick_closed_reference_otus.py

和数据库比对，比对不上数据库的reasd 直接丢掉，数据库中reads 带有taxonpmy 注释，可以方便的进行taxonomy 注释

优点：完全并行，速度快；tree 或者taxonomy 注释更好，数据库中的otu分类效果都很好

缺点：不能检测数据库中没有的物种

Because reads that don’t hit the reference sequence collection are discarded, your analyses only focus on the diversity that you “already know about”

open-reference OTU： pick_open_reference_otus.py

首先和数据库比对，没有比对上的reads 在使用denovo的聚类策略进行聚otu

open-reference OTU 是推荐的聚otu策略

优点：所有reads都会聚类，部分并行，速度较快

缺点：当新物种较多时，速度会很慢

我们最常用的是open-reference OTU聚类，对应的脚本是 pick_open_reference_otus.py

可以看做一个pipieline, 共有6个步骤，其中前4步为OTU 聚类，后2步为产生OTU table 和聚类的tree

Step 1) Prefiltering and picking closed reference OTUs

The first step is an optional prefiltering of the input fasta file to remove

sequences that do not hit the reference database with a given sequence

identity (PREFILTER_PERCENT_ID). This step can take a very long time, so is

disabled by default. The prefilter parameters can be changed with the options:

--prefilter_refseqs_fp

--prefilter_percent_id

This filtering is accomplished by picking closed reference OTUs at the specified

prefilter percent id to produce:

prefilter_otus/seqs_otus.log

prefilter_otus/seqs_otus.txt

prefilter_otus/seqs_failures.txt

prefilter_otus/seqs_clusters.uc

Next, the seqs_failures.txt file is used to remove these failed sequences from

the original input fasta file to produce:

prefilter_otus/prefiltered_seqs.fna

This prefiltered_seqs.fna file is then considered to contain the reads

of the marker gene of interest, rather than spurious reads such as host

genomic sequence or sequencing artifacts

首先对序列进行一个预处理，给定一个比对相似度，采用close-reference OTU 方法删除输入序列中不能比对上数据库的序列，这一步是可选的

如果执行了预处理，会产生 prefilter_otus/prefiltered_seqs.fna 文件，如果不执行，直接拿 input.fasta 去进行下一步的处理

If prefiltering is applied, this step progresses with the prefiltered_seqs.fna.

Otherwise it progresses with the input file. The Step 1 closed reference OTU

picking is done against the supplied reference database. This command produces:

step1_otus/_clusters.uc

step1_otus/_failures.txt

step1_otus/_otus.log

step1_otus/_otus.txt

然后采用close-reference OTU的方式聚OTU

The representative sequence for each of the Step 1 picked OTUs are selected to

produce:

step1_otus/step1_rep_set.fna

Next, the sequences that failed to hit the reference database in Step 1 are

filtered from the Step 1 input fasta file to produce:

step1_otus/failures.fasta

Then the failures.fasta file is randomly subsampled to PERCENT_SUBSAMPLE of

the sequences to produce:

step1_otus/subsampled_failures.fna.

Modifying PERCENT_SUBSAMPLE can have a big effect on run time for this workflow,

but will not alter the final OTUs.

对于没能比对上数据库的read, 会生成 step1_otus/failures.fasta 文件，同时随机抽取一部分reads, 产生step1_otus/subsampled_failures.fna 文件

修改 PERCENT_SUBSAMPLE 参数，可以加速运行时间

Step 2) The subsampled_failures.fna are next clustered de novo, and each cluster

centroid is then chosen as a "new reference sequence" for use as the reference

database in Step 3, to produce:

step2_otus/subsampled_seqs_clusters.uc

step2_otus/subsampled_seqs_otus.log

step2_otus/subsampled_seqs_otus.txt

step2_otus/step2_rep_set.fna

对于第一步产生的step1_otus/subsampled_failures.fna 文件，使用denovo 聚类的方式对这部分序列聚类，产生新的参考序列

Step 3) Pick Closed Reference OTUs against Step 2 de novo OTUs

Closed reference OTU picking is performed using the failures.fasta file created

in Step 1 against the 'reference' de novo database created in Step 2 to produce:

step3_otus/failures_seqs_clusters.uc

step3_otus/failures_seqs_failures.txt

step3_otus/failures_seqs_otus.log

step3_otus/failures_seqs_otus.txt

用step1_otus/failures.fasta 比对step2_otus/step2_rep_set.fna 进行比对

Assuming the user has NOT passed the --suppress_step4 flag:

The sequences which failed to hit the reference database in Step 3 are removed

from the Step 3 input fasta file to produce:

step3_otus/failures_failures.fasta

没有比对上的序列会产生step3_otus/failures_failures.fasta 文件

Step 4) Additional de novo OTU picking

It is assumed by this point that the majority of sequences have been assigned

to an OTU, and thus the sequence count of failures_failures.fasta is small

enough that de novo OTU picking is computationally feasible. However, depending

on the sequences being used, it might be that the failures_failures.fasta file

is still prohibitively large for de novo clustering, and the jobs might take

too long to finish. In this case it is likely that the user would want to pass

the --suppress_step4 flag to avoid this additional de novo step.

A final round of de novo OTU picking is done on the failures_failures.fasta file

to produce:

step4_otus/failures_failures_cluster.uc

step4_otus/failures_failures_otus.log

step4_otus/failures_failures_otus.txt

用第三步产生failures_failures.fasta 文件再次聚OTU

Step 5) Produce the final OTU map and rep set

If Step 4 is completed, the OTU maps from Step 1, Step 3, and Step 4 are

concatenated to produce:

final_otu_map.txt

如果第四步执行了的话，将1,3,4 产生的map 文件合并起来，产生final_otu_map.txt 文件

If Step 4 was not completed, the OTU maps from Steps 1 and Step 3 are

concatenated together to produce:

final_otu_map.txt

如果第四步没有执行，将1,3产生的map 文件合并起来，产生final_otu_map.txt 文件

Next, the minimum specified OTU size required to keep an OTU is specified with

the --min_otu_size flag. For example, if the user left the --min_otu_size as the

default value of 2, requiring each OTU to contain at least 2 sequences, the any

OTUs which failed to meet this criteria would be removed from the

final_otu_map.txt to produce:

final_otu_map_mc2.txt

If --min_otu_size 10 was passed, it would produce:

final_otu_map_mc10.txt

The final_otu_map_mc2.txt is used to build the final representative set:

rep_set.fna

-min_otu_size 对OTU进行过滤，产生final_otu_map_mc2.txt 文件已经对应的代表序列 rep_set.fna

Step 6) Making the OTU tables and trees

An OTU table is built using the final_otu_map_mc2.txt file to produce:

otu_table_mc2.biom

由final_otu_map_mc2.txt 产生 otu_table_mc2.biom OTU table

As long as the --suppress_taxonomy_assignment flag is NOT passed,

then taxonomy will be assigned to each of the representative sequences

in the final rep_set produced in Step 5, producing:

rep_set_tax_assignments.log

rep_set_tax_assignments.txt

This taxonomic metadata is then added to the otu_table_mc2.biom to produce:

otu_table_mc_w_tax.biom

对otu 代表序列进行 taxonomy 注释，产生 otu_table_mc_w_tax.biom 文件

As long as the --suppress_align_and_tree is NOT passed, then the rep_set.fna

file will be used to align the sequences and build the phylogenetic tree,

which includes the de novo OTUs. Any sequences that fail to align are

omitted from the OTU table and tree to produce:

otu_table_mc_no_pynast_failures.biom

rep_set.tre

对otu代表序列进行多序列比对，构建进化树，产生 rep_set.tre 文件

If both --suppress_taxonomy_assignment and --suppress_align_and_tree are

NOT passed, the script will produce:

otu_table_mc_w_tax_no_pynast_failures.biom

It is important to remember that with a large workflow script like this that

the user can jump into intermediate steps. For example, imagine that for some

reason the script was interrupted on Step 2, and the user did not want to go

through the process of re-picking OTUs as was done in Step 1. They can simply

rerun the script and pass in the:

--step_1_otu_map_fp

--step1_failures_fasta_fp

parameters, and the script will continue with Steps 2 - 4.

对于大型的脚本，要求可以在大致的步骤之间跳转，不执行前面的步骤

**Note:** If most or all of your sequences are failing to hit the reference

during the prefiltering or closed-reference OTU picking steps, your sequences

may be in the reverse orientation with respect to your reference database. To

address this, you should add the following line to your parameters file

(creating one, if necessary) and pass this file as -p:

pick_otus:enable_rev_strand_match True

Be aware that this doubles the amount of memory used in these steps of the

workflow.

如果原始序列中有很大一部分序列，没有比对上数据库中的序列，可能的原因是输入序列与数据库中的是反向互补的，可以添加 pick_otus:enable_rev_strand_match True 参数

但是这个参数会导致内存加倍

基本用法：

pick_open_reference_otus.py -i $PWD/seqs1.fna -r $PWD/refseqs.fna -o $PWD/ucrss_sortmerna_sumaclust/ -p $PWD/ucrss_smr_suma_params.txt -m sortmerna_sumaclust

-i : 输入的原始序列，fasta格式

-r : 数据库中的序列，fasta格式, 默认采用的是 greengene /usr/local/lib/python2.7/site-packages/qiime_default_r eference/gg_13_8_otus/rep_set/97_otus.fasta

-o : 输出结果的目录

-p : 参数对应的文件

-m : 聚类的软件，可选的有'uclust', 'usearch61', 'sortmerna_sumaclust'，默认为 uclust

QIIME1 聚OTU的更多相关文章

扩增子分析解读4去嵌合体非细菌序列生成代表性序列和OTU表
本节课程,需要先完成扩增子分析解读1质控实验设计双端序列合并 2提取barcode 质控及样品拆分切除扩增引物 3格式转换去冗余聚类先看一下扩增子分析的整体流程,从下向上逐层分析分 ...
QIIME2使用方法
激活qiime2的执行环境:source activate qiime2-2019.4如何查看conda已有的环境:conda info -e 以下分析流程参考:https://docs.qiime2 ...
Oracle索引梳理系列（九）- 浅谈聚簇因子对索引使用的影响及优化方法
版权声明:本文发布于http://www.cnblogs.com/yumiko/,版权由Yumiko_sunny所有,欢迎转载.转载时,请在文章明显位置注明原文链接.若在未经作者同意的情况下,将本文内 ...
oracle的散列聚簇表
在簇表中,Oracle使用存储在索引中的键值来定位表中的行, 而在散列聚簇表中,使用了散列函数代替了簇索引,先通过内部函数或者自定义的函数进行散列计算,然后再将计算得到的码值用于定位表中的行. 创建散 ...
机器学习实战5：k-means聚类：二分k均值聚类+地理位置聚簇实例
k-均值聚类是非监督学习的一种,输入必须指定聚簇中心个数k.k均值是基于相似度的聚类,为没有标签的一簇实例分为一类. 一经典的k-均值聚类思路: 1 随机创建k个质心(k必须指定,二维的很容易确定 ...
聚簇（Cluster）和聚簇表（Cluster Table）
聚簇(Cluster)和聚簇表(Cluster Table) 时间:2010-03-13 23:12来源:OralanDBA.CN 作者:AlanSawyer 点击:157次 1.创建聚簇 icmad ...
MDCC为移动开发者服务：一看、一聊、一聚
MDCC为移动开发者服务:一看.一聊.一聚-CSDN.NET MDCC为移动开发者服务:一看.一聊.一聚发表于2013-11-05 20:54| 2698次阅读| 来源CSDN| 6 ...
聚币网API[Python2版]
聚币现货 API [Python2版] 一.utils.py,基础类,包括HTTP 请求.签名等 # -*- coding: utf-8 -*- import hashlib import hmac ...
聚簇(或者叫做聚集，cluster)索引和非聚簇索引
字典的拼音目录就是聚簇(cluster)索引,笔画目录就是非聚簇索引.这样查询“G到M的汉字”就非常快,而查询“6划到8划的字”则慢. 聚簇索引是一种特殊索引,它使数据按照索引的排序顺序存放表中.聚簇 ...

随机推荐

当music-list向上滑动的时候，设置layer层，随其滚动，覆盖图片，往下滚动时候，图片随着展现出来
1.layer层代码: <div class="bg-layer" ref="layer"></div> 2.在mounted()的时候 ...
开源实时日志分析ELK
开源实时日志分析ELK 2018-01-04 转自:开源实时日志分析ELK平台部署日志主要包括系统日志.应用程序日志和安全日志.系统运维和开发人员可以通过日志了解服务器软硬件信息.检查配置过程中的错 ...
Oracle使用Sql把XML解析成表（Table）的方法
SELECT * FROM XMLTABLE('$B/DEAL_BASIC/USER_DEAL_INFO' PASSING XMLTYPE('<?xml version="1.0&qu ...
python版本坑:md5例子(python2与python3中md5区别)
对于一些字符,python2和python3的md5加密出来是不一样的. Python2 和Python3MD5加密 # python2.7 pwd = "xxx" + chr(1 ...
Git pull error: Your local changes to the following files would be overwritten by merge:
联合开发,遇上的一个问题,果然,在此验证了百度的不靠谱,是谷歌出的答案...... stackoverflow上有解决方案,链接:http://stackoverflow.com/questions/ ...
1. pyhanlp介绍和简单应用
1. pyhanlp介绍和简单应用 2. 观点提取和聚类代码详解 1. 前言中文分词≠自然语言处理! 中文分词只是第一步:HanLP从中文分词开始,覆盖词性标注.命名实体识别.句法分析.文本分类等常 ...
Eigen教程(5)
整理下Eigen库的教程,参考:http://eigen.tuxfamily.org/dox/index.html 块操作块是matrix或array中的矩形子部分. 使用块函数.block(), ...
java基础篇---I/O技术(一)
对于任何程序设计语言而言,输入输出(I/O)系统都是比较复杂的而且还是比较核心的.在java.io.包中提供了相关的API. java中流的概念划分流的方向: 输入流:数据源到程序(inputStr ...
【.Net】exe加密/加壳工具.Net Reactor
用WPF开发的桌面应用,编译后得到的项目启动项exe文件是未加密的,使用ILSpy等反编译工具能够直接看到该exe内的文件源码! 如下图: 可以使用.Net Reactor(有破/解版)等工具对exe ...
头文件中ifndef/define/endif的作用以及#pragma once使用
例如:要编写头文件test.h 在头文件开头写上两行: #ifndef _TEST_H #define _TEST_H//一般是文件名的大写 ············ ············ 头文件 ...

QIIME1 聚OTU

QIIME1 聚OTU的更多相关文章

随机推荐

热门专题