今天被人问起如何看懂三代的下机数据,虽然解决了别人的问题,但感觉自己还是没有搞透。

基本的目录结构:

|-- HG002new_O1l_BP_P6_021315b_MB_100pM
| |-- D01_1.c60e446d-f276-41fc-9384-ffa937e22683.tar.gz
| |-- D01_2.19ee4f13-c420-4974-8262-cb1da56beccd.tar.gz
| |-- D01_3.94e34f0a-eef3-4b71-8f1b-c9790dec647e.tar.gz
| |-- D01_4.53ef7aed-e91e-46f9-bb71-8b021344b951.tar.gz
| |-- D01_5.55b1f7cb-ad44-4afb-bf2b-5c34fcb0a210.tar.gz
| `-- D01_6.b9b564dc-b794-4a7f-bc3b-854a7bc32887.tar.gz
`-- pacbio_README

解压后的目录结构:

Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.bax.h5
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.log
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.subreads.fasta
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.subreads.fastq
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.bax.h5
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.log
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.subreads.fasta
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.subreads.fastq
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.bax.h5
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.log
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.subreads.fasta
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.subreads.fastq
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.bas.h5
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.sts.csv
Analysis_Results/m140612_020550_42156_c100652082550000001823118110071460_s1_p0.sts.xml
m140612_020550_42156_c100652082550000001823118110071460_s1_p0.1.xfer.xml
m140612_020550_42156_c100652082550000001823118110071460_s1_p0.2.xfer.xml
m140612_020550_42156_c100652082550000001823118110071460_s1_p0.3.xfer.xml
m140612_020550_42156_c100652082550000001823118110071460_s1_p0.mcd.h5
m140612_020550_42156_c100652082550000001823118110071460_s1_p0.metadata.xml

可以看到数据是以HDF5的格式存储的,格式介绍:PacBio Sequences的HDF5格式

那么,上面目录和文件名都有哪些含义呢?仔细的看说明文档就会知道

============

Introduction

============

These directories contain data from PacBio sequencing of HG002, HG003, and HG004,

which are the son, father, and mother, respectively, in a trio of Ashkenazim Jewish

ancestry from the Personal Genome Project and are candidate Reference Materials being

characterized by NIST and the Genome in a Bottle Consortium. The coverage is approximately

69X, 32X, and 30X for HG002, HG003, and HG004, respectively. 89.7% of the data is

from P6-C4 chemistry, and the remaining from P5-C3 chemistry. Library preparation was

performed at NIST and sequencing was performed at Mt. Sinai School of Medicine. 
Details of the library preparation, sequencing, and data are provided below.

==============================================

Library Preparation (include reagent versions)

==============================================

SMRTbell library preparation and sequencing of HG002, HG003, and HG004 AJ Trio gDNA

DNA library preparation and sequencing was performed according to the manufacturer's

instructions with noted modifications. Following the Pacific Biosciences Protocol,

"20-kb Template Preparation Using Blue Pippin Size-Selection System", library

preparation was performed using the Pacific Biosciences SMRTbell Template Prep Kit 1.0

(PN # 100-259-100).  In short, 10 µg of extracted, high-quality, genomic DNA from eac

of HG-002, HG-003, and HG-004, the AJ trio, were used for library preparation. Genomic

DNA extracts were verified with the Life Technologies Qubit 2.0 Fluorometer using the

High Sensitivity dsDNA assay (PN# Q32851) to quantify the mass of double-stranded DNA

present.  After quantification, each sample was diluted to 150 µL, using kit provide

EB, yielding a concentration of ~ 66 ng/µL.  The 150 µL aliquots were individuall

pipetted into the top chambers of Covaris G-tube (PN# 520079) spin columns and sheared

for 60 seconds at 4500 rpm using an Eppendorf 5424 benchtop centrifuge. Once complete,

the spin columns were flipped after verifying that all DNA was now in the lower chamber.

The columns were spun for another 60 seconds at 4500 rpm to further shear the DNA and

place the aliquot back into the upper chamber. In some cases G-tubes were centrifuged

2-3 times, in both directions to ensure all volume had passed into the appropriate

chamber.  Shearing resulted in a ~20,000 bp DNA fragments verified using an Agilent

Bioanlyzer DNA 12000 gel chip (PN# 5067-1508).  The sheared DNA isolates were then

purified using a 0.5X AMPure PB magnetic bead purification step (0.5X AMPure PB beads

added, by volume, to each DNA sample, vortexed for 10 minutes at 2,000 rpm, followed

by two washes with 70% alcohol and finally eluted in EB). This AMPure purification

step assures removal of any small fragment and/or biological contaminant.  The sheared

DNA concentration was then measured using the Qubit High Sensitivity dsDNA assay. 
These values were used to calculate actual input mass for library preparation following

shearing and purification.

 
After purification, ~8-9 mg of each purified sheared sample went through the following

library preparation process per the aforementioned protocol:

------------------------------------

         ExcVII Treatment

       (remove) ssDNA ends

                |

    DNA Fragment Damage Repair

                |

     DNA Fragment End Repair

                |

  Purify Blunt-Ended DNA Fragments

                |

Blunt End SMRTbell Adapter Ligation

                |

      Exonuclease Treatment

  (remove failed ligation product)

                |

   Size Selection using BluePippin

                |

Clean and Concentrate Final Library

-------------------------------------

   
All library preparation reaction volumes were scaled to accommodate input mass for a

given sample.  Library size selection was performed using the Sage Science BluePippin

0.75% Agarose, Dye Free, PacBio ~20kb templates, S1 cassette (PN# PAC20KB).  Size

selections were run overnight to maximize recovered mass. Approximately 2-5 mg of

prepared libraries were size selected using a 10 kb start and 50 kb end in "Range" mode. 
This selection is necessary to narrow the library distribution and maximize the SMRTbell

sub-read length for the best de novo assembly possible.  Without selection, smaller

2000 - 10,000 bp molecules dominate the zero-mode waveguide loading distribution,

decreasing the sub-read length.  Size-selection was confirmed using pre and post size

selected DNA using an Agilent DNA 12000 chip. Final library mass was measured using the

Qubit High Sensitivity dsDNA Assay. Approximately 15-20% of the initial gDNA input mass

resulted after elution from the agarose cassette, which was enough yield to proceed to

primer annealing and DNA sequencing on the PacBio RSII instrument.  This entire library

preparation and selection strategy was conducted 7, 2 and 2 times across HG002, HG003,

and HG004 respectively, to provide enough library for the duration of this project.

==================================================

Sequencing (include chemistry/instrument versions)

==================================================

Sequencing reflects the P6-C4 sequencing enzyme and chemistry, respectively. (Note that

10.3 % of the data was collected using the P5-C3 enzyme/chemistry prior to the release

of the P6-C4 enzyme and chemistry.)  Primer was annealed to the size-selected SMRTbell

with the full-length libraries (80ºC for 2 minute 30 followed by decreasing thetemperature

by 0.1º/s to 25Cº. To prepare the polymerase-template complex, the SMRTbell template complex

was then bound to the P6 enzyme using the Pacific Biosciences DNA Polymerase Binding Kit

P6 v2 (PN# 100-372-700). A ratio of 10:1, polymerase to SMRTbell at 0.5 nM, was prepared

and incubated for 4 hours at 30ºC and then held at 4ºC until ready for magbead loading

prior to sequencing.  The Magnetic bead-loading step was conducted using the Pacific

Biosciences MagBead Kit (PN# 100-133-600) at 4ºC for 60-minutes permanufacturer's guidelines. 
The magbead-loaded, polymerase-bound, SMRTbell libraries were placed onto the RSII instrument

at a sequencing concentration of 100 to 40 pM to optimize loading across various SMRTcells.

Sequencing was performed using the C4 chemistry provided in the Pacific Biosciences DNA

Sequence Bundle 4.0 (PN# 100-356-400).  The RSII was then configured for 240-minute

continuous sequencing runs.

========================================

Preliminary Analyses and Quality Control

========================================

Assuming a 3.2 Gb human genome, sequencing was conducted to approximately 69X, 32X,

and 30X coverage for HG002, HG003, and HG004 across 292, 139, and 132 SMRTcells,

respectively.  27.4M, 13.2M, and 12.4M  subreads were generated resulting in 220.0,

101.6, and 94.9 Gb of sequence data with sub-readlength N50 values of 11,087, 10,728,

and 10,629 basepairs.

================================

File/Directory naming convention

================================

The file/directory naming convention is defined as follows:

[SampleName]/[WellName]_[CollectionNumber].[UUID].tar.gz

Note that SampleName may contain other genomes in the name, but the data directories

only contain data from HG002, HG003, and HG004.

For example, for SampleName of HG002new_O1_BP_P6_021815_MB_105pM,

WellName of A01, and CollectionNumber of 3, you will see a tar.gz file in

HG002new_O1_BP_P6_021815_MB_105pM directory with name A01_3.[UUID].tar.gz

The UUID is currently used for only hashing purpose.

The tar.gz file containts the raw SMRTPortal data including following contents:

tar.gz

|   [movie name].1.xfer.xml

|   [movie name].2.xfer.xml

|   [movie name].3.xfer.xml

|   [movie name].mcd.h5

|   [movie name].metadata.xml

\---Analysis_Results

    |   [movie name].1.bax.h5

    |   [movie name].1.log

    |   [movie name].1.subreads.fasta

    |   [movie name].1.subreads.fastq

    |   [movie name].2.bax.h5

    |   [movie name].2.log

    |   [movie name].2.subreads.fasta

    |   [movie name].2.subreads.fastq

    |   [movie name].3.bax.h5

    |   [movie name].3.log

    |   [movie name].3.subreads.fasta

    |   [movie name].3.subreads.fastq

    |   [movie name].bas.h5

    |   [movie name].sts.csv

    |   [movie name].sts.xml

The metadata.xml contains all the metadata of this particular sample in the xml format;

for example, in the TemplatePrep field you might see "DNA Template Prep Kit 2.0 (3Kb - 10Kb),"

and in the BindingKit field you might see "DNA/Polymerase Binding Kit P6," etc.

For information about bas.h5/bax.h5 files, please see:

http://files.pacb.com/software/instrument/2.0.0/bas.h5%20Reference%20Guide.pdf

For information about subreads, please see:

https://speakerdeck.com/pacbio/track-1-de-novo-assembly

PacBio下机数据解读的更多相关文章

  1. PacBio下机数据如何看?

    一开始拿到三代测序的下机数据时,蒙了,readme ?三代测序的下机数据都有哪些,以及他们具体的格式是怎么样的(以sequel 平台为主). 测序过程 SMRTbell A adapter通用接头,两 ...

  2. 3、PACBIO下机数据如何看

    转载:http://www.cnblogs.com/jinhh/p/8328818.html 三代测序的下机数据都有哪些,以及他们具体的格式是怎么样的(以sequel 平台为主). 测序过程 SMRT ...

  3. pacbio 原始下机数据h5 文件简介

    pacbio 采用hdf5文件格式保存原始的下机数据,对于RS 测序系统而言,会产生一个 bas.h5 的文件; 以bas.h5 文件为例,看一下有下机数据中保存了那些信息 h5dump 工具可以用来 ...

  4. 【转载】TalkingData首席金融行业专家鲍忠铁:18亿数据解读移动互联网

    http://www.36dsj.com/archives/33417 鲍忠铁:大家下午好! 今天我会讲三个议题,一是用18亿数据解读现在移动互联网的生态圈.二是看看数据有什么样的应用.三是大数据的隐 ...

  5. 《后会无期》票房赶超《小时代3》 大数据解读韩寒VS四娘之争

    7月25日.韩寒导演的处女作<后会无期>零点首映,而郭四娘导演的<小时代3:刺金时代>比<后会无期>早上映一周.也就是7月17日正式公映,韩寒与四娘之间向来不缺乏话 ...

  6. 2019年逾期率上升_24家头部P2P平台最新运营数据解读:8家近一年逾期率走势曝光

    python信用评分卡建模(附代码,博主录制) https://study.163.com/course/introduction.htm?courseId=1005214003&utm_ca ...

  7. PacBio & BioNano (Assembly and diploid architecture of an individual human genome via single-molecule technologies)

    Assembly and diploid architecture of an individual human genome via single-molecule technologies 文章链 ...

  8. 扩增子分析解读2提取barcode 质控及样品拆分 切除扩增引物

    本节课程,需要完成扩增子分析解读1质控 实验设计 双端序列合并 先看一下扩增子分析的整体流程,从下向上逐层分析 分析前准备 # 进入工作目录 cd example_PE250 上一节回顾:我们拿到了双 ...

  9. 【转】 GATK--原始数据预处理

    1. 对原始下机fastq文件进行过滤和比对(mapping) 对于Illumina下机数据推荐使用bwa进行mapping. Bwa比对步骤大致如下: (1)对参考基因组构建索引: 例子:bwa i ...

随机推荐

  1. ssh 配置自动登录

    假定 机器A 连接至 机器B . 1. 在机器A上,生成RSA秘钥对 ssh-keygen -t rsa 期间passphrase不输入密码.默认生成文件至 ~/.ssh/ -rw------- we ...

  2. 2016年10月23日 星期日 --出埃及记 Exodus 19:7

    2016年10月23日 星期日 --出埃及记 Exodus 19:7 So Moses went back and summoned the elders of the people and set ...

  3. IIS WebForm开发基础

    Winform是在客户电脑操作的. WebForm是客户机通过一个IP地址,到IIs服务器,再进行信息反馈,在非客户机上操作的. 一.WebForm 运行流程(1)需要访问数据库(aspx) 客户机打 ...

  4. C语言培训第一天

    下面是一些命令,先来谈谈今天的若干收获吧! 计算机中的一切文件都是以二进制补码的形式存在,问题也就来了. 第一个问题 如果我们给一个无符号的数赋值一个负数,他会读取到什么,又会输出什么?(似乎问题和上 ...

  5. 4.给定一个正整数m,统计m的位数,分别打印每一位数字,再按照逆序打印出各位数字。 要求:m定义为类的属性,需定义构造函数为m赋值;当m大于99999时,输出错误信息“the number is too large”,不再执行。

    package a; public class ShuZi { int m; public int getM() { return m; } public void setM(int m) { thi ...

  6. GCC编译器代码优化

    代码优化是指编译器通过分析源代码,找出其中尚未达到最优的部分,然后对其重新进行组合,目的是改善程序的执行性能.GCC提供的代码优化功能非常强大,它通过编译选项-On来控制优化代码的生成,其中n是一个代 ...

  7. @JoinColumn

    @OneToOne注释只能确定实体与实体的关系是一对一的关系,不能指定数据库表中的保存的关联字段.所以此时要结合@JoinColumn标记来指定保存实体关系的配置. @JoinColumn与本书上一章 ...

  8. sql 执行计划

    SQL Server执行计划的理解 要理解执行计划,怎么也得先理解,那各种各样的名词吧.鉴于自己还不是很了解.本文打算作为只写懂的,不懂的懂了才写. 在开头要先说明,第一次看执行计划要注意,SQL S ...

  9. Web开发——Tomcat的配置

    1.选择Tomcat 1.Apache官网http://apache.org/ 2.Tomcat官网http://tomcat.apache.org/ 3.Tomcat下载地址http://tomca ...

  10. CA*Layer(CATransformLayer--CAGradientLayer)

    CATransformLayer CATransformLayer不同于普通的CALayer,因为它不能显示它自己的内容.只有当存在了一个能作用域子图层的变换它才真正存在.CATransformLay ...