GATK-BWA-MEM handle GRCh38 alternate contig mappings
1. For the Impatient
# Download bwakit (or from <http://sourceforge.net/projects/bio-bwa/files/bwakit/> manually)
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.15_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh # skip "|sh" to show command lines
This generates out.aln.bam as the final alignment, out.hla.top for best HLA genotypes on each gene and out.hla.all for other possible HLA genotypes. Please check out bwa/bwakit/README.md for details.
1.1 Introduction
Bwakit is a self-consistent installation-free package of scripts and precompiled binaries, providing an end-to-end solution to read mapping. In addition to the basic mapping functionality implemented完成 in bwa, bwakit is able to generate proper human reference genome and to take advantage of ALT contigs, if present, to improve read mapping and to perform HLA typing for high-coverage human data. It can remap name- or coordinate-sorted BAM with read group and barcode information retained. Bwakit also optionally trims adapters (via trimadap), marks duplicates (via samblaster) and sorts the final alignment (via samtools).
Bwakit has two entry scripts: run-gen-ref which downloads and generates human reference genomes, and run-bwamemwhich prints mapping command lines on the standard output that can be piped to(输送到) sh to execute. The two scripts will call other programs or use data in bwa.kit. The following shows an example about how to use bwakit:
# Download the bwa-0.7.11 binary package (download link may change)
wget -O- http://sourceforge.net/projects/bio-bwa/files/bwakit/bwakit-0.7.15_x64-linux.tar.bz2/download \
| gzip -dc | tar xf -
# Generate the GRCh38+ALT+decoy+HLA and create the BWA index
bwa.kit/run-gen-ref hs38DH # download GRCh38 and write hs38DH.fa
bwa.kit/bwa index hs38DH.fa # create BWA index
# mapping
bwa.kit/run-bwamem -o out -H hs38DH.fa read1.fq read2.fq | sh
The last mapping command line will generate the following files:
out.aln.bam: unsorted alignments with ALT-aware mapping quality. In this file, one read may be placed on multiple overlapping ALT contigs at the same time even if the read is mapped better to some contigs than others. This makes it possible to analyze each contig independent of others.out.hla.top: best genotypes for HLA-A, -B, -C, -DQA1, -DQB1 and -DRB1 genes.out.hla.all: other possible genotypes on the six HLA genes.out.log.*: bwa-mem, samblaster and HLA typing log files.
Bwakit can be downloaded here. It is only available to x86_64-linux. The scripts in the package are available in the bwa/bwakitdirectory. Packaging is done manually for now.
1.2 Limitations
HLA typing only works for high-coverage human data. The typing accuracy can still be improved. We encourage researchers to develop better HLA typing tools based on the intermediate output of bwakit (for each HLA gene included in the index, bwakit writes all reads matching it in a separate file).
Duplicate marking only works when all reads from a single paired-end library are provided as the input. This limitation is the necessary tradeoff of fast MarkDuplicate provided by samblaster.
The adapter trimmer is chosen as it is fast, pipe friendly and does not discard reads. However, it is conservative and suboptimal. If this is a concern, it is recommended to preprocess input reads with a more sophisticated adapter trimmer. We also hope existing trimmers can be modified to operate on an interleaved FASTQ stream. We will replace trimadap once a better trimmer meets our needs.
Bwakit can be memory demanding depends on the functionality invoked. For 30X human data, bwa-mem takes about 11GB RAM with 32 threads, samblaster uses close to 10GB and BAM shuffling (if the input is sorted BAM) uses several GB. In the current setting, sorting uses about 10GB.
1.3 Package Contents
bwa.kit
|-- README.md This README file.
|-- run-bwamem *Entry script* for the entire mapping pipeline.
|-- bwa *BWA binary*
|-- k8 Interpretor for *.js scripts.
|-- bwa-postalt.js Post-process alignments to ALT contigs/decoys/HLA genes.
|-- htsbox Used by run-bwamem for shuffling BAMs and BAM=>FASTQ.
|-- samblaster MarkDuplicates for reads from the same library. v0.1.20
|-- samtools SAMtools for sorting and SAM=>BAM conversion. v1.1
|-- seqtk For FASTQ manipulation.
|-- trimadap Trim Illumina PE sequencing adapters.
|
|-- run-gen-ref *Entry script* for generating human reference genomes.
|-- resource-GRCh38 Resources for generating GRCh38
| |-- hs38DH-extra.fa Decoy and HLA gene sequences. Used by run-gen-ref.
| `-- hs38DH.fa.alt ALT-to-GRCh38 alignment. Used by run-gen-ref.
|
|-- run-HLA HLA typing for sequences extracted by bwa-postalt.js.
|-- typeHLA.sh Type one HLA-gene. Called by run-HLA.
|-- typeHLA.js HLA typing from exon-to-contig alignment. Used by typeHLA.sh.
|-- typeHLA-selctg.js Select contigs overlapping HLA exons. Used by typeHLA.sh.
|-- fermi2.pl Fermi2 wrapper. Used by typeHLA.sh for de novo assembly.
|-- fermi2 Fermi2 binary. Used by fermi2.pl.
|-- ropebwt2 RopeBWT2 binary. Used by fermi2.pl.
|-- resource-human-HLA Resources for HLA typing
| |-- HLA-ALT-exons.bed Exonic regions of HLA ALT contigs. Used by typeHLA.sh.
| |-- HLA-CDS.fa CDS of HLA-{A,B,C,DQA1,DQB1,DRB1} genes from IMGT/HLA-3.18.0.
| |-- HLA-ALT-type.txt HLA types for each HLA ALT contig. Not used.
| `-- HLA-ALT-idx BWA indices of each HLA ALT contig. Used by typeHLA.sh
| `-- (...)
|
`-- doc BWA documentations
|-- bwa.1 Manpage
|-- NEWS.md Release Notes
|-- README.md GitHub README page
`-- README-alt.md Documentation for ALT mapping
2. Background
GRCh38 consists of several components: chromosomal assembly, unlocalized contigs (chromosome known but location unknown), unplaced contigs (chromosome unknown) and ALT contigs (long clustered variations). The combination of the first three components is called the primary assembly. It is recommended to use the complete primary assembly for all analyses. Using ALT contigs in read mapping is tricky.
GRCh38 ALT contigs are totaled 109Mb in length, spanning 60Mbp of the primary assembly. However, sequences that are highly diverged from the primary assembly only contribute a few million bp. Most subsequences of ALT contigs are nearly identical to the primary assembly. If we align sequence reads to GRCh38+ALT blindly, we will get many additional reads with zero mapping quality and miss variants on them. It is crucial to make mappers aware of ALTs.
BWA-MEM is ALT-aware. It essentially computes mapping quality across the non-redundant content of the primary assembly plus the ALT contigs and is free of the problem above.
3. Methods
Sequence alignment
As of now, ALT mapping is done in two separate steps: BWA-MEM mapping and postprocessing. The bwa.kit/run-bwamemscript performs the two steps when ALT contigs are present. The following picture shows an example about how BWA-MEM infers mapping quality and reports alignment after step 2:
Step 1: BWA-MEM mapping
At this step, BWA-MEM reads the ALT contig names from "idxbase.alt", ignoring the ALT-to-ref alignment, and labels a potential hit as ALT or non-ALT, depending on whether the hit lands on an ALT contig or not. BWA-MEM then reports alignments and assigns mapQ following these two rules:
The mapQ of a non-ALT hit is computed across non-ALT hits only. The mapQ of an ALT hit is computed across all hits.
If there are no non-ALT hits, the best ALT hit is outputted as the primary alignment. If there are both ALT and non-ALT hits, non-ALT hits will be primary and ALT hits be supplementary (SAM flag 0x800).
In theory, non-ALT alignments from step 1 should be identical to alignments against the reference genome with ALT contigs. In practice, the two types of alignments may differ in rare cases due to seeding heuristics. When an ALT hit is significantly better than non-ALT hits, BWA-MEM may miss seeds on the non-ALT hits.
If we don't care about ALT hits, we may skip postprocessing (step 2). Nonetheless, postprocessing is recommended as it improves mapQ and gives more information about ALT hits.
Step 2: Postprocessing
Postprocessing is done with a separate script bwa-postalt.js. It reads all potential hits reported in the XA tag, lifts ALT hits to the chromosomal positions using the ALT-to-ref alignment, groups them based on overlaps between their lifted positions, and then re-estimates mapQ across the best scoring hit in each group. Being aware of the ALT-to-ref alignment, this script can greatly improve mapQ of ALT hits and occasionally improve mapQ of non-ALT hits. It also writes each hit overlapping the reported hit into a separate SAM line. This enables variant calling on each ALT contig independent of others.
On the completeness of GRCh38+ALT
While GRCh38 is much more complete than GRCh37, it is still missing some true human sequences. To make sure every piece of sequence in the reference assembly is correct, the Genome Reference Consortium (GRC) require each ALT contig to have enough support from multiple sources before considering to add it to the reference assembly. This careful and sophisticated procedure has left out some sequences, one of which is this example, a 10kb contig assembled from CHM1 short reads and present also in NA12878. You can try BLAT or BLAST to see where it maps.
For a more complete reference genome, we compiled a new set of decoy sequences from GenBank clones and the de novo assembly of 254 public SGDP samples. The sequences are included in hs38DH-extra.fa from the BWA binary package.
In addition to decoy, we also put multiple alleles of HLA genes in hs38DH-extra.fa. These genomic sequences were acquired from IMGT/HLA, version 3.18.0 and are used to collect reads sequenced from these genes.
HLA typing
HLA genes are known to be associated with many autoimmune diseases, infectious diseases and drug responses. They are among the most important genes but are rarely studied by WGS projects due to the high sequence divergence between HLA genes and the reference genome in these regions.
By including the HLA gene regions in the reference assembly as ALT contigs, we are able to effectively identify reads coming from these genes. We also provide a pipeline, which is included in the BWA binary package, to type the several classic HLA genes. The pipeline is conceptually simple. It de novo assembles sequence reads mapped to each gene, aligns exon sequences of each allele to the assembled contigs and then finds the pairs of alleles that best explain the contigs. In practice, however, the completeness of IMGT/HLA and copy-number changes related to these genes are not so straightforward to resolve. HLA typing may not always be successful. Users may also consider to use other programs for typing such as Warren et al (2012), Liu et al (2013), Bai et al (2014) and Dilthey et al (2014), though most of them are distributed under restrictive licenses.
4. Preliminary Evaluation
To check whether GRCh38 is better than GRCh37, we mapped the CHM1 and NA12878 unitigs to GRCh37 primary (hs37), GRCh38 primary (hs38) and GRCh38+ALT+decoy (hs38DH), and called small variants from the alignment. CHM1 is haploid. Ideally, heterozygous calls are false positives (FP). NA12878 is diploid. The true positive (TP) heterozygous calls from NA12878 are approximately equal to the difference between NA12878 and CHM1 heterozygous calls. A better assembly should yield higher TP and lower FP. The following table shows the numbers for these assemblies:
| Assembly | hs37 | hs38 | hs38DH | CHM1_1.1 | huref |
|---|---|---|---|---|---|
| FP | 255706 | 168068 | 142516 | 307172 | 575634 |
| TP | 2142260 | 2163113 | 2150844 | 2167235 | 2137053 |
With this measurement, hs38 is clearly better than hs37. Genome hs38DH reduces FP by ~25k but also reduces TP by ~12k. We manually inspected variants called from hs38 only and found the majority of them are associated with excessive read depth, clustered variants or weak alignment. We believe most hs38-only calls are problematic. In addition, if we compare two NA12878 replicates from HiSeq X10 with nearly identical library construction, the difference is ~140k, an order of magnitude higher than the difference between hs38 and hs38DH. ALT contigs, decoy and HLA genes in hs38DH improve variant calling and enable the analyses of ALT contigs and HLA typing at little cost.
5. Problems and Future Development
There are some uncertainties about ALT mappings - we are not sure whether they help biological discovery and don't know the best way to analyze them. Without clear demand from downstream analyses, it is very difficult to design the optimal mapping strategy. The current BWA-MEM method is just a start. If it turns out to be useful in research, we will probably rewrite bwa-postalt.js in C for performance; if not, we may make changes. It is also possible that we might make breakthrough on the representation of multiple genomes, in which case, we can even get rid of ALT contigs for good.
GATK-BWA-MEM handle GRCh38 alternate contig mappings的更多相关文章
- Secondary ,Supplementary alignment 和bwa mem的-M -Y参数
1.supplementary alignment supplementary alignment是指一条read的一部分和参考区域1比对成功,另一部分和参考区域2比对成功,参考区域1和参考区域2没有 ...
- BWA MEM算法
现在BWA大家基本上只用其mem算法了,无论是二代还是三代比对到参考基因组上,BWA应用得最多的就是在重测序方面. Aligning sequence reads, clone sequences a ...
- GATK使用说明-GRCh38(Genome Reference Consortium)(二)
Reference Genome Components 1. GRCh38 is special because it has alternate contigs that represent pop ...
- bwa比对软件的使用以及其结果文件(sam)格式说明
一.bwa比对软件的使用 1.对参考基因组构建索引 bwa index -a bwtsw hg19.fa # -a 参数:is[默认] or bwtsw,即bwa构建索引的两种算法,两种算法都是 ...
- 比对工具之 BWA 使用方法
BWA算法简介: BWA-bactrack BWA-SW BWA-MEM BWA安装: # installing BWA .tar.bz2 -C /opt/biosoft/ cd /opt/bioso ...
- BWA/BWT 比对软件
名称 bwa – Burrows-Wheeler Alignment Tool 内容摘要描述命令行与选项SAM 比对格式短序列比对注意事项 比对精确性 估计插入大小分布 内存需求 ...
- Linux command line exercises for NGS data processing
by Umer Zeeshan Ijaz The purpose of this tutorial is to introduce students to the frequently used to ...
- samtools 工具
软件地址: http://www.htslib.org/ 功能三大版块 : Samtools Reading/writing/editing/indexing/viewing SAM/BAM/CRAM ...
- GATK--数据预处理,质控,检测变异
版权声明:本文源自 解螺旋的矿工, 由 XP 整理发表,共 13781 字. 转载请注明:从零开始完整学习全基因组测序(WGS)数据分析:第4节 构建WGS主流程 | Public Library o ...
随机推荐
- android 常见分辨率(mdpi、hdpi 、xhdpi、xxhdpi )屏幕适配
http://www.tuicool.com/articles/nuyMZb 1 Android手机目前常见的分辨率 1.1 手机常见分辨率: 4:3 VGA 640*480 (Video G ...
- 升级NC6.3
2014-04-23 江苏建工&用友公司会谈提纲 1,合同规定江苏建工用友NC在实施成功之后三年免服务费(2010年增补了资金管理,如果以2010年作为软件最终实施完成,那么2010-2013 ...
- 今天的工作发现了4年前的“bug一枚”
上午的时候山东公司要求下拨资金160万(因目前系统不能支付个人卡),在下拨单保存的时候系统提示余额不足,我马上看内部存款,结果发现人家还有190万呢,然后就看今天的委托付款单还有下拨单,山东都没有,一 ...
- 自动装箱(boxing)和自动拆箱(unboxing)
摘自:http://www.codeceo.com/article/java-boxing-unboxing.html Java的四类八种基本数据类型 基本类型 占用空间(Byte) 表示范围 包装器 ...
- 分享一下自己正在使用的sublime text里的插件
真的回头想想要不是当时一个学姐给我介绍了这个编辑器,我可能还是那种迫不得已了不得不编程了才会去敲代码的,可能还是一只不喜欢编程的程序员.可是自从用了这款编辑器,我的世界仿佛都被改变了.天呐,整天忍不住 ...
- Hihocoder 1035 [树形dp]
/* 题意: 不要低头,不要放弃,不要气馁,不要慌张. PS:人生第一道自己独立做出来的树形dp... 给一棵树,标号1到n,每条边有两个权值,步行时间和驾车时间.车在1号点. 给m个必须访问的关键点 ...
- invalid END header (bad central directory offset) 异常解决方法
今天版本升级时,一个ear包在传到aix下,weblogic后启动出现 invalid END header (bad central directory offset) 后来才发下是文件传输中出现了 ...
- Mantis 1.2.19 on Windows Server 2012 r2 datacenter 安装及配置随笔
一.前言 新的小团队需要搭建一个缺陷管理的工具,之前用过bugfree,感觉比较适合,但是 禅道不太适合,放弃之,于是又百度推荐的: .JTrac13.BugNet14.BugOnline15.eTr ...
- 10.11 安装pod
原文地址:http://www.jianshu.com/p/5fc15906c53a 感谢. 更新升级10.11 cocoapods安装出问题最简单的解决方法 这是因为10.11把cocoapods ...
- 删除sde用户问题
删除SDE用户(GIS地图数据用户),长时间删除没反应,结束drop user sde cascade命令后,重新执行,结果报ORA-00604 ORA-21700 select user_id,us ...