使用mothur进行OTU聚类

微生物16S的OTU聚类工具有很多，最常用的就是 usearch、cdhit-OTU、mothur。

这些工具大多都是针对二代测序平台的，usearch的64bit版本是收费的。

如果要跑PacBio的OTU聚类，目前就只能用 mothur 了。

mothur有着非常详细的说明文档！

以下转载一篇教程：原文链接：http://blog.sina.com.cn/s/blog_80572f5d0101ac2v.html

mothur 用来处理高通量测序结果或者克隆文库的序列处理，这里进行OTU分类，就是排除克隆测序结果的重复序列，以97%的相似度进行分类，即cutoff=0.03

版本：mothur v.1.25.1 Last updated: 5/14/2012

mothur > unique.seqs(fasta=*.fasta)

Output File Names:
*.unique.fasta
*.names

mothur > align.seqs(candidate=*.unique.fasta,template=tmp.fasta,flip=T,processors=2)

Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Aligning sequences from *.unique.fasta ...

Reading in the tmp.fasta template sequences... DONE.
It took 0 to read 1 sequences.
Some of your sequences generated alignments that eliminated too many bases, a list is provided in *.unique.flip.accnos. If the reverse compliment proved to be better it was reported.
It took 1 secs to align *** sequences.

Output File Names:
*.unique.align
*.unique.align.report
*.unique.flip.accnos

mothur > filter.seqs(fasta=*.unique.align)

Length of filtered alignment: ***

Number of columns removed: 0

Length of the original alignment: ***

Number of sequences used to construct filter: ***

Output File Names:
*.filter
*.unique.filter.fasta

mothur > dist.seqs(fasta=*.unique.filter.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt)

Output File Name: g5gta.unique.filter.phylip.dist

It took 0 to calculate the distances for *** sequences.

mothur > cluster(phylip=*.unique.filter.phylip.dist,method=furthest,cutoff=0.03)

********************#****#****#****#****#****#****#****#****#****#****#
Reading matrix: ||||||||||||||||||||||||||||||||||||||||||||||||||||
***********************************************************************
unique 2 124 2
0.01 3 89 15 3
0.02 5 69 7 10 0 3
0.03 8 60 7 7 3 1 0 0 2

Output File Names:
*.unique.filter.phylip.fn.sabund
*.unique.filter.phylip.fn.rabund
*.unique.filter.phylip.fn.list

It took 0 seconds to cluster

mothur > bin.seqs(fasta=*.fasta,name=*.names)

Using *.unique.filter.phylip.fn.list as input file for the list parameter.
unique
0.01
0.02
0.03

Output File Names:
*.unique.filter.phylip.fn.unique.fasta
*.unique.filter.phylip.fn.0.01.fasta
*.unique.filter.phylip.fn.0.02.fasta
*.unique.filter.phylip.fn.0.03.fasta

mothur > get.oturep(phylip=*.unique.filter.phylip.dist,fasta=*.fasta,list=*.unique.filter.phylip.fn.list,label=0.03)

Output File Names:
*.unique.filter.phylip.fn.0.03.rep.names
*.unique.filter.phylip.fn.0.03.rep.fasta

总结

mothur下载地址：http://www.mothur.org/wiki/Download_mothur
支持Mac，Windows以及Linux操作系统
首先要打开mothur -----cmd，进入命令行界面cd xxx 进入mothur.exe目录，输入mothur.exe 回车
假设该目录内fasta格式文件叫做 Great.fasta
那么使用如下命令处理：

1.mothur > unique.seqs(fasta=Great.fasta)

2.mothur > dist.seqs(fasta=Great.unique.fasta,calc=onegap,countends=F,cutoff=0.03,output=lt)

3.mothur > cluster(phylip=Great.unique.phylip.dist,method=furthest,cutoff=0.03)

4.mothur > bin.seqs(fasta=Great.fasta,name=Great.names)

5.mothur > get.oturep(phylip=Great.unique.phylip.dist,fasta=Great.unique.fasta,list=Great.unique.phylip.fn.list,label=0.03)

最后打开Great.phylip.fn.0.03.rep.fasta 查看各OTU 的数目即可，即每个序列名最后的数字。
另外，还可以看到有多少个OTUs
Great.unique.phylip.fn.0.03.rep.names 这个文件显示了每个OTU都有哪些序列，可以用记事本打开

Great.unique.phylip.fn.list 这个文件显示了unique_0.01_0.02_0.03 分别有哪些类型的OTU,每一组用空格或Tab隔开，每组OTU内的各序列用"，" 隔开。

注意，cluster命令能读取phylip矩阵，也能读取column矩阵，如果读取的是column，还需要提供一个names文件。另外，用unique.seqs命令,是为了生成names文件
如果在运行第1步的时候，提示[ERROR]:your sequences are not the same length, aborting.

那么需要运行一下命令：

先提供一个template fasta文件，以以上fasta文件中序列长度最普遍的某个为序列模板，新建一个fasta，命名为temp.fasta

a. mothur > unique.seqs(fasta=Great.fasta)

b. mothur > align.seqs(candidate=Great.unique.fasta,template=temp.fasta,flip=T,processors=2)

c. mothur > filter.seqs(fasta=Great.unique.align)

再把这里生成的Great.unique.filter.fasta文件更名为Great.fasta 进行以上5步处理。（尽量在一个新的mothur目录内，打开一个新的mothur窗口，以免文件名冲突）

命令详情

unique.seqs

identify the unique sequences in a collection and generate a names file

找出fasta文件中独一无二的序列，即去冗余，在二代中有很多一模一样的冗余序列，需要首先去掉，而在三代中这一步可以不跑。

这一步生成 All_Sample.unique.fasta 和 All_Sample.names

align.seqs

align sequences against a reference alignment 比对到ref上，等于是多重比对

The default threshold is 0.50, meaning if 50% of the bases are removed in the alignment process. The flip parameter is used to specify whether or not you want mothur to try the reverse complement of a sequence if the sequence falls below the threshold. The default is false. If the flip parameter is set to true the reverse complement of the sequence is aligned and the better alignment is reported.

template means ref

这一步生成 All_Sample.unique.align 、 All_Sample.unique.align.report 、 All_Sample.unique.flip.accnos

filter.seqs

filter positions out of an alignment 过滤

we can probably remove a lot of positions from the alignment to speed up our distance calculations.

加速距离计算

dist.seqs

generate a pairwise distance matrix，生成双序列比对距离矩阵

This approach is better than the commonly used DNADIST because the distances are not stored in RAM, rather they are printed directly to a file. Furthermore, it is possible to ignore "large" distances that one might not be interested in.

得到了序列两两之间的距离

cluster

clustering sequences into OTUs based on genetic distance 遗传距离

有多种聚类方法，选哪一种好呢？

The methods available in mothur include opticlust (opti), average neighbor (average), furthest neighbor (furthest), nearest neighbor (nearest), Vsearch agc (agc), Vsearch dgc (dgc). By default cluster() uses the opticlust algorithm.

bin.seqs

identify the OTU that each sequence belongs to

prints out a fasta-formatted file where sequences are ordered according to the OTU that they belong to. Such an output may be helpful for generating primers specific to an OTU or for classification of sequences.

get.oturep

identify a representative sequence from each OTU

While the bin.seqs command reports the OTU number for all sequences, the get.oturep command generates a fasta-formatted sequence file containing only a representative sequence for each OTU.