The Greengenes Database Release 13_5

这是16S的一个非常重要的数据库
The Greengenes Database, a public resource since 2002 (DeSantis 2003, DeSantis 2006, McDonald 2011), is a well-characterized and curated database of small subunit ribosomal near-full length sequences from the kingdoms Bacteria and Archaea(细菌和古菌). With every release, a de novo phylogenetic tree(系统发生树) is constructed that incorporates the novel branching patterns, branch length and candidate divisions of the sequences that have been deposited into public databases since the prior release. Latest versions are available for download at http://greengenes.secondgenome.com.

Greengenes is operated and maintained by an international consortium, The Greengenes Database Consortium, representing academic and biotech interests. Many improvements and changes have been made to the process by which the database is built in an effort to streamline the release cycle. For this release, a few major changes have been made:

  1. Chimera detection now relies on UCHIME (Edgar 2011) using, as a reference, consensus sequences derived from the 94% OTUs from Greengenes 12_10. Previously identified chimeras are still considered chimeric.
  2. The inference for determining if a record named_isolate, unnamed_isolate, or clone has been dramatically improved. All existing Greengenes records have been updated and now reflect the improved decision status.
  3. Mappings to the Integrated Microbial Genomes databas v400 are now provided. See gg_13_5_img.txt for a mapping between the Greengenes ID and IMG genomes.
  4. PyNAST aligned sequences are now provided, and are included in the ARB database. Thank you Les Dethlefsen for pointing out the need for these.
  5. Taxon names above the rank of genus种 with square brackets are names proposed by the greengenes curators and will not be found in NCBI. Genus names with square brackets are contested names (usually due to polyphyly of the genus) some of which will be found in NCBI.

As always, various taxonomic updates have been made and over the last few months, we have received valuable feedback about the taxonomy. Specifically,we'd like to thank the following people who provided feedback on errors or inconsistencies in the taxonomy: Greengenes is a living project, and we sincerely value comments from the community, and are working on improving the methods in which we can better
solicit taxonomic feedback. Methods. Sequences not previously observed were obtained from Genbank during January of 2013. These records were parsed for viable sequence. All obtained sequences were run through SSU-Align 0.1 (Nawrocki 2009). SSU-Align was run as follows: First, ssu-prep was specified with –dna and set to perform a parallel
run. Second, the alignments were run through ssu-mask specifying –dna, to output as aligned FASTA (--afa) and to use the default SSU-Align masks (-d). Sequences that aligned were then filtered by length, dropping all reads less then 1200nt, tallied after SSU-Align deleted bases not fitting its secondary structure model. The remaining sequences were then inflated to NAST width. Thisbag of reads were run through UCHIME. The reference database used was composed of consensus sequences from the 94% GG_12_10 OTUs requiring a minimum of 10 sequences per cluster. Any sequence flagged as chimeric that bridge classes or higher were dropped and Any sequence with >= 1% of non-ACGT bases was dropped. Any sequence with >10% of positions that varied on invariant 16S positions was dropped OTU picking. Sequences were sorted as to increase stability in the picked representative sequences. The following preferences were used: 1) in the previous release
2) correctly named isolate
3) length OTUs were picked using QIIME (Caporaso, et al. 2011) and UCLUST (Edgar, 2010) where the sequences clustered were unaligned 16S reads. UCLUST was run after sorting by the above criteria. Initial tree construction was performed using FastTree (Price, et al 2010) with the representative sequences from the 99% OTUs. The parameters specified to FastTree were “-nt -gamma -fastest -no2nd -spr 4” as recommended by FastTree’s author, Morgan Price. A donor taxonomy based off the previous Greengenes release was decorated onto the tree. Taxonomy curation was performed on the 99% tree and expanded out to the full set of sequences. Taxonomy verification was performed to ensure that all paths through the tree contained all 7 levels of the taxonomy in proper order, and that the taxonomy itself formed a true hierarchy (custom scripts at the moment, see now below about Greengenes source code). File listing and descriptions. All files are prefixed with gg_<release> where the <release> is composed of <year>_<month>. For instance, gg_13_5 refers to the Greengenes release for May 2013.
00README
# This file
00CHANGELOG
# Important changes since the last release
00ROADMAP
# Planned changes and additions including release dates
00STATS
# Quick stats on the number of sequences included at various similarity levels
gg_13_5_otus_99_annotated.tree.gz
# 99% OTU rooted tree with taxonomy decorated
# Phylogenetic reconstruction performed with FastTree (Price, et al 2010)
# Base taxonomy decorated with tax2tree (McDonald, et al 2011)
gg_13_5_taxonomy.txt.gz
#很重要, 每条序列的分类学记录
    # Full taxonomy for every sequence in the release.
# All taxonomy strings are strictly 7 level and prefixed
gg_13_5.fasta.gz
# 没有比对上的序列
    # Full release unaligned sequences (no bases dropped).
gg_13_5_ssualign.fasta.gz
# 比对上的序列
    # Full release aligned sequences
# Sequences were aligned with SSU#Align (Nawrocki 2009). As a consequence of this software, bases corresponding to structural diversions from SSU
#Align’s model are removed from the sequence. It is not recommended to use this alignment for probe design or any other operation where you need access to all contiguous bases in the sequence.
# Each sequence is represented with 7,682 characters. Dashes (#) represent either missing data, as on the 5‘ and 3‘ termini, or an alignment gap, as is interspersed throughout the sequence.
gg_13_5_pynast.fasta.gz
# 比对序列
    # Near full release of aligned sequences. Approximately 1400 sequences that were alignable by SSU Align failed PyNAST (Caporaso 2010) alignment. The original Greengenes coreset was used which, due to out#of#date coverage, probably contributed to the alignment failures.
gg_13_5_accessions.txt.gz
# ID之间对应关系
    # A mapping from Greengenes IDs to external databases
# This is primarily Genbank references, but includes a few hundred links to IMG genome IDs as there was not an automatic means to infer some of the NCBI accessions.
gg_13_5_img.txt.gz
# 与IMG之间的ID对应关系
    # A mapping specifically between Greengenes IDs and IMG Genomes.
# Performed by accession not nearest neighbor
gg_13_5_otus.tgz
# QIIME (Caporaso, et al 2011) compatible OTUs
# Sequences are sorted to place seed preference on:
# in the previous release
# named isolates
# longer sequences
# Clusters are determined using QIIME#wrapped UCLUST (Edgar 2010)
# Now includes the representative sequences as aligned by SSU#Align and can be used as a template for PyNAST.
# Code can be obtained from
# https://github.com/qiime#dev/nested_reference_otus
# commit 821a98df6773ea4e4d209af20b9a8cf34d00324
gg_13_5.sql.gz
# Full Greengenes records
# This is a mysqldump. The user is 'greengenes' without a password. The database is named 'greengenes'
# NOTE: This database currently only contains sequence information for those records included in the release, however all examined Genbank records that contained alignable 16S are described. This is a work in progress with additional record data and functionality to be added in subsequent releases
gg_13_5_chimeras.txt.gz
# 嵌合体黑名单
    # The current chimera blacklist

Source code. Parts of the Greengenes code base are provided here:

https://github.com/greengenes/Greengenes

However, the provided code is still quite limited. We're continuing to make changes on the backend of Greengenes, and the structure of the code base has not yet finalized.

Greengenes Database(16S)的更多相关文章

  1. 全网最详细的Windows系统里Oracle 11g R2 Database(64bit)安装后的初步使用(图文详解)

    不多说,直接上干货! 前期博客 全网最详细的Windows系统里Oracle 11g R2 Database(64bit)的下载与安装(图文详解) 命令行方式测试安装是否成功 1)   打开服务(cm ...

  2. 全网最详细的Windows系统里Oracle 11g R2 Database(64bit)的完全卸载(图文详解)

    不多说,直接上干货! 前期博客 全网最详细的Windows系统里Oracle 11g R2 Database(64bit)的下载与安装(图文详解) 若你不想用了,则可安全卸载. 完全卸载Oracle ...

  3. DataBase(28)

    1.数据库管理系统(DataBase Management System,DBMS):指一种操作和管理数据库的大型软件,用于建立.使用和维护数据库,对数据库进行统一管理和控制,以保证数据库的安全性和完 ...

  4. 聊聊Oracle 11g的Snapshot Standby Database(上)

    Oracle 11g是Data Guard的重要里程碑版本.在11g中,Active DataGuard.Advanced Compression等特性大大丰富了Data Guard的功能和在实践领域 ...

  5. 一看就会一做就废系列:说说 RECOVER DATABASE(下)

    这里是:一看就会,一做就废系列 数据库演示版本为 19.3 (12.2.0.3) 该系列涉及恢复过程中使用的 个语句: 1. recover database 2. recover database ...

  6. 一看就会一做就废系列:说说 RECOVER DATABASE(上)

    这里是:一看就会,一做就废系列 数据库演示版本为 19.3 (12.2.0.3) 该系列涉及恢复过程中使用的 个语句: 1. recover database 2. recover database ...

  7. mysql数据库创建database(实例),和用户,并授权

    前言:mysql创建用户的方法分成三种:INSERT USER表的方法.CREATE USER的方法.GRANT的方法. 一.账号名称的构成方式 账号的组成方式:用户名+主机(所以可以出现重复的用户名 ...

  8. 从minist database(t10k-images-idx3-ubyte)中读取图片

    matlab代码(亲测,可运行出来): % Matlab_Read_t10k-images_idx3.m % 用于读取MNIST数据集中t10k-images.idx3-ubyte文件并将其转换成bm ...

  9. uva 1592 Database (STL)

    题意: 给出n行m列共n*m个字符串,问有没有在不同行r1,r2,有不同列c1,c2相同.即(r1,c1) = (r2,c1);(r1,c2) = (r2,c2); 如 2 3 123,456,789 ...

随机推荐

  1. BabelMap 12.0.0.1 汉化版(2019年3月11日更新)

    软件简介 BabelMap 是一个免费的字体映射表工具,可辅助使用<汉字速查>程序. 该软件可使用系统上安装的所有字体浏览 Unicode 中的十万个字符,还带有拼音及部首检字法,适合文献 ...

  2. Kafka集群监控工具之二--Kafka Eagle

    基于kafka: kafka_2.11-0.11.0.0.tgz kafka-eagle-bin-1.2.1.tar.gz 1.下载解压 tar -zxvf kafka-eagle-bin-1.2.1 ...

  3. 【Redis学习之三】Redis单节点安装

    本文介绍两个版本:redis-2.8.18.tar.gz 和 redis-3.0.0-rc2.tar.gz,均采用源码安装方式 一.redis3.0 部署环境 redis-3.0.0-rc2.tar. ...

  4. Go语言初尝

    对于语言设计之争, 唯一需要牢记的一句话是: 如果把 C 变成 C++, 那么 C 就消失了. Go 是一个轻量级的简洁的支持并发的现代语言,  可以用于探索性个人项目, 这是我想学这门语言的主要原因 ...

  5. MySQL数据库----数据操作

    注意的几点:1.如果你在cmd中书命令的时候,输入错了就用\c跳出 2.\s查看配置信息 一.操作文件夹(库) 增:create database db1 charset utf8; 删:drop d ...

  6. Python Web学习笔记之多线程编程

    本次给大家介绍Python的多线程编程,标题如下: Python多线程简介 Python多线程之threading模块 Python多线程之Lock线程锁 Python多线程之Python的GIL锁 ...

  7. 20145127《java程序设计》第三次实验

    实验三 敏捷开发与XP实践 一.实验内容及其步骤 1.配置开源中国公钥 2.找到ssh公钥 3.在开源中国添加公钥 4.下载同组同学项目 5.推送代码到开源中国 6.推送代码成功 二.实验感想总结: ...

  8. 20145334赵文豪《网络对抗》-逆向及Bof基础实践

    本次实践的对象是一个名为pwn1的linux可执行文件. 该程序正常执行流程是:main调用foo函数,foo函数会简单回显任何用户输入的字符串. 该程序同时包含另一个代码片段,getShell,会返 ...

  9. 关键字union

    union有一个作用就是判断,pc是大端存储还是小端存储的,x86是小端存储的,这个东西是有cpu决定的.arm(由存储器控制器决定)和x86一样都是小端的. 下面的是一个大端小端的一个例子,代码如下 ...

  10. c++builder ZIP文件解压与压缩(ZLIB DLL调用)(转载 )

    转载:http://blog.csdn.net/goodai007/article/details/7414512 头文件:ZipAndFile.h //----------------------- ...