Why you should QC your reads AND your assembly?
The genome sequence of the Common Carp Cyprinus carpio was published in Nature last week. By coincidence, I was doing some QC on some domesticated Ferret (Mustela ptorius furo) reads, which had thrown some kmer warnings in the FastQC tool. I blasted the kmers in NCBI and was quite perplexed by the number of hits that I found in the carp genome. Nearly all of the first 150 hits were all from the carp genome. Anyway, I looked a bit further into my odd kmers and it turns out that they were the ends of some Illumina adapter sequences that had presumably been incorporated into the paired-reads on the shorter ends of the insert size. This then took me back to the Carp Genome - what had creeped into that?
The only quality control they speak of is "We then filtered out low-quality and short reads to obtain a set of usable reads".
So I thought I'd look at what was actually in their assembly. I downloaded the Carp genome assembly (9377 scaffolds) and created a blast database from it and then created a fasta file of Illumina adapter sequences (found here) and used them as query sequences to blast against the Carp genome. There is some redundancy in the Illumina adapter sequences, so I collapsed them, so retaining only unique sequences and then removed any adapter sequences that were sub-sequences of longer adapter (the final file consisted of 81 sequences). The blast resulted in 3750 hits (evalue < 8.00E-06) of which 1009 were of 100% identity.
This gave me a final tally of at least 20 Illumina adapter sequences incorporated into the final Common Carp genome assembly. Out of the 9377 scaffolds, 277 appears to have Illumina Adapter sequences in them. I've included the counts of the different Illumina adapter sequences (non-redundant) for the scaffolds at the bottom of the page.
I've not looked for adapter sequences used in Solid or 454 sequencing yet. It would be interesting to see what that throws up.
So, a lesson to be learned here. QC your assembly, especially if you're not overly stringent with your read QC.
Here's the data:
Common Carp genome scaffolds
Illumina adapter sequences
Illumina adapter sequences collapsed
Illumina adapters v Carp genome blast
Why you should QC your reads AND your assembly?的更多相关文章
- 10、RNA-seq for DE analysis training(Mapping to assign reads to genes)
1.Goal of mapping 1)We want to assign reads to genes they were derived from 2)The result of the mapp ...
- samtools flagstat
samtools flagstat命令简介: 统计输入文件的相关数据并将这些数据输出至屏幕显示.每一项统计数据都由两部分组成,分别是QC pass和QC failed,表示通过QC的reads数据量和 ...
- sam/bam格式
1)Sam (Sequence Alignment/Map) ------------------------------------------------- 1) SAM 文件产生背景 随着Ill ...
- bam文件测序深度统计-bamdst
最近接触的数据都是靶向测序,或者全外测序的数据.对数据的覆盖深度及靶向捕获效率的评估成为了数据质量监控中必不可少的一环. 以前都是用samtools depth 算出单碱基的深度后,用perl来进行深 ...
- [题解+总结]NOIP2010-2015后四题汇总
1.前言 正式开始的第一周的任务--把NOIP2010至NOIP2015的所有D1/2的T2/3写出暴力.共22题. 暴力顾名思义,用简单粗暴的方式解题,不以正常的思路思考.能够较好的保证正确性,但是 ...
- DISCOVAR de novo
海宝建议用这个拼接软件 http://www.broadinstitute.org/software/discovar/blog/?page_id=98 DISCOVAR – variant call ...
- MySql数据库3【优化3】缓存设置的优化
1.表缓存 相关参数: table_open_cache 指定表缓存的大小.每当MySQL访问一个表时,如果在表缓冲区中还有空间,该表就被打开并放入其中,这样可以更快地访问表内容.通过检查峰值时间的状 ...
- 浅谈MySQL 数据库性能优化
MySQL数据库是 IO 密集型的程序,和其他数据库一样,主要功能就是数据的持久化以及数据的管理工作.本文侧重通过优化MySQL 数据库缓存参数如查询缓存,表缓存,日志缓存,索引缓存,innodb缓存 ...
- 拾人牙慧篇之———QQ微信的第三方登录实现
一.写在前面 关于qq微信登录的原理之流我就不一一赘述了,对应的官网都有,在这里主要是展示我是怎么实现出来的,看了好几个博客,有的是直接复制官网的,有的不知道为什么实现不了.我只能保证我的这个是我实现 ...
随机推荐
- copyWithZone 的使用方法
1.简单复制只能实现浅拷贝:指针赋值,使两个指针指向相同的一块内存空间,操作不安全. 2. Foundation类已经遵守了<NSCopying>和 <NSMutableCopyin ...
- 浅谈OA办公软件市场行情
3.原文:http://www.jiusi.net/detail/472__776__3999__1.html 关键词:oa系统,OA办公软件 浅谈OA办公软件市场行情 中国的OA办公软件市场历经20 ...
- OC 动态类型和静态类型
多态 允许不同的类定义相同的方法 动态类型 程序直到执行时才能确定所属的类 静态类型 将一个变量定义为特定类的对象时,使用的是静态形态 将一个变量定义为特定类的对象时,使用的是静态类型,在编译的时候就 ...
- 简易版DES加密和解密详解
在DES密码里,是如何进行加密和解密的呢?这里采用DES的简易版来进行说明. 二进制数据的变换 由于不仅仅是DES密码,在其它的现代密码中也应用了二进制数据,所以无论是文章还是数字,都需要将明文变换为 ...
- Java中的Object、T(泛型)、?区别
因为最近重新看了泛型,又看了些反射,导致我对Object.T(以下代指泛型).?产生了疑惑. 我们先来试着理解一下Object类,学习Java的应该都知道Object是所有类的父类,注意:那么这就意味 ...
- Telegram学习解析系列(三) : Build Telegram报错分析总结
正好通过这次 Telegram 的运行,很想把常见的项目运行的错误好好的总结一下,在前面的博客中,又星星散散的总结过错误和一些警告的消除方法,这次把错误处理一下,还有Telegram项目中有999+的 ...
- linux 命令(alias , unalias , install ,ar , arch ,uname )
https://linux.die.net/man/ http://man.linuxde.net/ user commands 1.alias [ˈālēəs]:别名 alias --help al ...
- OAuth 2.0 / RCF6749 协议解读
OAuth是第三方应用授权的开放标准,目前版本是2.0版,以下将要介绍的内容和概念主要来源于该版本.恐篇幅太长,OAuth 的诞生背景就不在这里赘述了,可参考 RFC 6749 . 四种角色定义: R ...
- 移动端网页meta设置和响应式
苏宁易购WAP的meta分析 响应式 meta设置 媒体查询时读的width为viewport的宽度.viewport宽度为手机分辨率.比如note2 1280*720.需要重置为设备 640*360 ...
- VR全景:vr元年过后,这些企业如何发动“vr+”应用引擎?
2016年,VR可谓是四处衍生.从如痴如迷的游戏行业到喜闻乐见的影视行业,再到医疗.军事.房地产,随便呼出一个"+",VR便能左右逢源,VR+各行各业,俨然成为一种标配.最近,Ma ...