http://www.molecularevolution.org/resources/activities/QC_of_NGS_data_activity_new

table of contents

expected learning outcomes

The objective of this activity is to understand some relevant properties of an ensemble of next generation sequencing reads such as length, quality scores and base distribution in order to assess the quality of the data and to discard low quality reads.

Next-generation sequencing technologies are becoming widely used, and although a massive number of sequences can be generated in a single experiment, it is still important to characterize the reads. Characterizing reads may prevent undesirable outcomes in the assembly or mapping processes. Discarding low-quality reads, controlling for contamination, and trimming adaptor sequences are examples of preliminary quality control procedures that can be applied to raw reads before further analysis.

We will use basic UNIX commands, the FASTX-Toolkit, and FastQC applied to two small 454 and Illumina datasets.

getting started

  1. Make sure you have the following programs installed (if you used the customized USB flash drive for software installation, you already have them):
    1. FASTX-Toolkit
    2. FastQC
    3. Perl and Bioperl
    4. R
    5. Gnuplot
  2. The data you are going to use have been copied from your USB flash drive during the software installation to the~/wcg_oct2011/activities/QC_of_NGS_data folder. Go to that folder and you should find:
    1. bartonella_illumina.fastq: a small subset of a Illumina run (v.1.5) in a FASTQ file
    2. bartonella_454.fasta and bartonella_454.qual: sequences and qualities for small subset of a 454 Titanium run
    3. plotLengthDistribution.R: a simple R script to plot the distribution of read lengths, taking as input a two column tab file where the second column is the length of the sequences (and the first column is read ID, although this information is not used)
    4. fastx_nucleotide_distribution.R: an R script to plot the distribution of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_nucleotide_distribution.sh, uses the same input file)
    5. fastx_quality_boxplot.R: an R script to plot the quality of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_quality_boxplot.sh, uses the same input file)
  3. The Perl scripts that are going to be used in this activity have been installed, if you are interested to look at them:
    1. fastaNamesSizes.pl: takes a sequence file as input (e.g. fastq). The output is a list of sequence IDs followed by the length of each sequence; requires BioPerl
    2. fastaQual2fastq.pl: takes a FASTA and a quality (.qual) file and outputs a FASTQ file; requires BioPerl
  4. Have a look at the documentation for FASTX-Toolkit. Get an idea of the different tools and what they do. In general, you can get help on each program by typing program_name -h

FIX FOR MISSING SCRIPTS

Execute the following commands:

cd ~/wcg_oct2011/local/scripts
wget http://www.molecularevolution.org/molevolfiles/exercises/QC_of_NGS/scripts.tar.gz
tar xvfz scripts.tar.gz

exercise 1: checking Illumina data with the FASTX-Toolkit

The goal of this exercise is to inspect the sequence data of an Illumina run. The sequence data belongs to the intracellular bacterium Bartonella bacilliformis which is zoonotic pathogen transmitted by insect vectors. Although FASTQ is the main file received from a sequence provider, some users want to perform the base calling step themselves, using a different package than the proprietary Illumina software. This is not covererd in this exercise, and we start directly from the FASTQ file.

    1. Inspect the data contained in the sequence file, bartonella_illumina.fastq.
    2. Using the Perl script fastaNamesSizes.pl, calculate the length of each sequence in the fastq file:

      fastaNamesSizes.pl -f fastq-illumina bartonella_illumina.fastq > bartonella_illumina.ns

      -f fastq-illumina = specifies what format the input sequence file is in. Format options are specified in Bioperl (see link for more information about the fastq format specified)
      > bartonella_illumina.ns = specifies the output file

      Have a look at the numbers output on the terminal. How many sequences do we have?

      This information can found in the standard error:
      # 10000 sequences, total length 380000.
      # Minimum len: 38. Max: 38. Average: 38.

      Inspect the output.
      [show answer]

      Answer:
      The output will be in tabular format, with each line containing the sequence ID and length.

    3. Examine the Perl script in a text editor and inspect it to see how the script works. Note how the script (less `which fastaNamesSizes.pl`).

(Note: this is the same as using which to find the file location, and then less to open the file at that path. The backticks (not single quotations) specify that the which command should be performed before less).

    1. Convert the FASTQ file into FASTA format using the Fastx Toolkit script fastq_to_fasta:

      fastq_to_fasta -n -i bartonella_illumina.fastq > bartonella_illumina.fasta

-n = keeps sequences with unknown (N) nucleotides
-i bartonella_illumina.fastq = specifies the input file

    1. Compute quality statistics for the bartonella_illumina dataset, using fastx_quality_stats

      fastx_quality_stats -i bartonella_illumina.fastq > bartonella_illumina.qualstats

    2. View the statistics file with a text editor. What do the columns represent?

[show answer]

Answer:
Run the program with -h to show help documentation which contains this information. The format of this output is described as the "old" style.

    1. (optional): Produce a more extensive output by running the same command with the -N flag. View it. What are the extra columns?

[show answer]

Answer:
The 'new' output contains the original information followed by the analysis for each nucleotide at each cycle (explained in the help documentation).

    1. Draw a box-plot of the reads quality with fastq_quality_boxplot_graph.sh:

      fastq_quality_boxplot_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.boxplot.png
      (or, alternatively with R: Rscript fastx_quality_boxplot.R bartonella_illumina.qualstats bartonella_illumina.boxplot2.png)

    2. Open the resulting png file (with open or Preview on a Mac, with Firefox or eog on Ubuntu from the terminal). What do x and y axis represent? What do the boxes represent? What can you say about the quality in general? What is the average probability of error? How is the quality varying along the read?

[show answer]

Answer: 
x axis represents the read length, y axis indicates the Phred quality score. From the top each box represents upper whisker, Q1, median, Q3 and lower whisker.
The sequence reads have a high quality. The probability of error is about 0.0001. The quality drops towards the end of the reads.

    1. Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh:

      fastx_nucleotide_distribution_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.nucdistr.png
      (or, alternatively with R: Rscript fastx_nucleotide_distribution.R bartonella_illumina.qualstats bartonella_illumina.nucdistr2.png)

    2. Examine the image from the resulting png file. What can you say about the distribution of nucleotides? Any estimation of the G+C content?

[show answer]

Answer: 
The sequence reads are A-T rich. The estimate of G+C content is about 30-40%.

exercise 2: checking 454 data with the FASTX-Toolkit

The goal of this exercise is to do the same kind of quality checks as in exercise 1, but on 454 data this time. The primary data from 454 is stored in a sff file, but in general, FASTA and qual files are also provided. It is possible to parse the sff files with different parameters using the proprietary software provided by 454 (sfffile/sffinfo) or a free, open-source tool,sff_extract. Sff extraction is not covered in this exercise, and we start from the bartonella_454.fasta andbartonella_454.qual files. How would you perform the following tasks?

    1. Examine the FASTA and qual files.
    2. Some functions of FASTX-Toolkit do not work with FASTA-formatted sequences on multiple lines, thus it is sometimes necessary to transform the file so that fasta_formatter each sequence is on a single line.

[show answer]

Answer: 
fasta_formatter < bartonella_454.fasta > bartonella_454_1.fasta

    1. Some FASTX-Toolkit programs can only take FASTQ as an input, thus necessitating the conversion of FASTA and qual files to a FASTQ file, which can be done using a short Bioperl script:

      fastaQual2fastq.pl bartonella_454.fasta bartonella_454.qual > bartonella_454.fastq

    2. Inspect the resulting FASTQ file
    3. Now calculate the length of each sequence using the fastaNamesSizes.pl script.

[show answer]

Answer: 
fastaNamesSizes.pl bartonella_454.fasta > bartonella_454.ns

    1. Explore the output with a text editor. What would you say about the statistics on length and number of sequences?

Standard error:
# 10000 sequences, total length 3666752.
# Minimum len: 40. Max: 633. Average: 367

Although the average read length is 367 bp, very few sequences are ~200-300 bp. The read length of the majority of the sequences in the file fits in two major groups: less than 100bp and more than 300bp.

    1. Plot the distribution of read lengths using a short R script. Open the resulting png file (with open on a Mac or eog on Ubuntu from the terminal):

      Rscript plotLengthDistribution.R bartonella_454.ns bartonella_454.readDistr.png

    2. (optional) Examine the script.
    3. Can you say something about the distribution? What is the mean length? the median?

[show answer]

Answer: 
The read length distribution is bimodal. Mean read length=367 bp, Median=449 bp.

    1. (optional) Modify the R script to display vertical bars of different colors at the mean and median read length. Hint: useabline(v=...) command in the R script.

[show answer]

Answer: 
The following code can be inserted after the hist line and before the dev.off()line:
abline(v=mean(data$V2),col="red")
abline(v=median(data$V2),col="blue")

    1. Now compute the quality statistics for this subset and look at them in a text editor. Use fastx_quality_stats and view the results in a text editor. Use also the -N switch to produce extensive output.

[show answer]

Answer: 
fastx_quality_stats -i bartonella_454.fastq > bartonella_454.qualstats
fastx_quality_stats -i bartonella_454.fastq -N > bartonella_454.extqualstats

    1. Now plot the quality distribution along the sequence with fastq_quality_boxplot_graph.sh and view the resulting png plot (alternatively, use the R script). What do you see?

[show answer]

Answer: 
fastq_quality_boxplot_graph.sh -i bartonella_454.qualstats -o bartonella_454.qualstats.png
The quality of the sequence reads is high but it drops after ~400bp.

    1. Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh (or with the Rscript) and view the result. What do you see?

[show answer]

Answer: 
fastx_nucleotide_distribution_graph.sh -i bartonella_454.qualstats -o bartonella_454.nucdistr.png
The sequence reads are G-C rich. The distribution of nucleotides is not uniform at the 5' end of the reads.

    1. We need to trim that down. Use head to take only the first lines of the stats file and output it to another file, and redraw the figure. What do you see now? What is your conclusion about this dataset? What should be done about it?

[show answer]

Answer: 
head -n 51 bartonella_454.qualstats > bartonella_454.50f.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.50f.qualstats -o bartonella_454.50f.nucdistr.png

The distribution of the nucleotides is not uniform in the first 10 nucleotide position of the reads. Adaptor sequences at the 5' end are included in the sequenced reads. The sequence reads need to be trimmed.

exercise 3: fixing extra adaptor sequence with the FASTX-Toolkit

Most of the sequences have some adaptor sequences at the 5' end. We need to remove these adaptor sequences to avoid problems with the assembly or mapping. We will use fastx_trimmer to do this.

    1. fastx_trimmer trims a given number of nucleotides, either from the beginning or from the end of the sequence.

[show answer]

Answer: 
fastx_trimmer -f 11 -m 50 -i bartonella_454.fastq -o bartonella_454.trim.fastq
-f 11 = the position on the read (from the 5' end) of first nucleotide to keep. The first 10 bp will be removed from the 5' end.
-m 50 = the minimum final read length. Reads shorter than this length will be discarded.

    1. Calculate the nucleotide distribution statistics for the trimmed reads and redraw the graph. How does it look now?

[show answer]

Answer: 
fastx_quality_stats -i bartonella_454.trim.fastq | head -n 100 > bartonella_454.trim.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.trim.qualstats -o bartonella_454.trim.nucdistr.png

The distribution of the nucleotides is now similar over the length of the reads.

    1. (optional): Why is it important to trim x bp from the start of all of the reads rather than searching for the adapter sequence and removing it?

[show answer]

Answer:
If the adapter was removed by searching for a specific sequence then reads with sequencing or PCR errors in the adapter would not being trimmed. These untrimmed reads would cause problems with assembly.

exercise 4: checking Illumina data with FastQC

Another tool to assess sequence quality is FastQC. FastQC can operate in two different modes - as a stand-alone interactive application or in a non-interactive mode which is more suitable for larger pipelines. For the purposes of this activity we will be using the interactive application. To start FastQC:

If you are running the WCG Ubuntu Linux distribution, first run the following commands:
rm ~/wcg_oct2011/local/bin/fastqc
ln -s ~/wcg_oct2011/software/FastQC/fastqc ~/wcg_oct2011/local/bin/fastqc

Then, you should be able to run fastqc from anywhere like so:

fastqc

Or if you are running on Mac OS X, navigate to the ~/wcg_oct2011/software/fastqc_v0.9.3 folder (already copied from your USB during the software installation) and double-click the FastQC icon.

FastQC may take a moment to open and should display a blank window once it has.

Load the bartonella_illumina.fastq Illumina file (File->Open) from the ~/wcg_oct2011/activities/QC_of_NGS_datadirectory. Once loaded, FastQC will process it and generate the report. This report can be exported (File->Save report...) as a zipped HTML folder which makes for easy sharing. You can view the results either within the FastQC application or theexported report. The FastQC documentation provides a detailed explanation of the tooland an explanation of each module report.

One nice feature of FastQC is that it easily handles and detects the many FASTQ variants. Within the Basic Statisticsreport, it shows that our data was Illumina 1.5 encoded.

The per base sequence quality plot is similar to the plot we produced in Exercise 1 with thefastq_quality_boxplot_graph.sh FastX-Toolkit command. Are there any differences between the two plots?

Explore the rest of the plots. What new information can be gathered about this Illumina dataset?

exercise 5: checking 454 data with FastQC

Load the bartonella_454.fastq file (File->Open) that we created from the 454 bartonella_454.fasta andbartonella_454.qual files in Exercise 2, step 3. Note that FastQC handles the following file types:

  • FASTQ (all variants)
  • Colorspace FASTQ (SOLiD)
  • GZip compressed FASTQ
  • SAM
  • BAM
  • SAM/BAM Mapped only (normally used for colorspace data)

You can view the results either within the FastQC application or the exported report.

You will notice for this dataset it shows that our data was Illumina 1.5 encoded. If you remember in exercise 2, we used thefastaQual2fastq.pl script to convert the .fasta and .qual files to FASTQ and it encoded them as Illumina 1.5.

You should notice in this case that 5 of the modules are marked with the red cross as "very unusual." Why was this the case for the 454 data? Try loading the bartonella_454.trim.fastq file to see the impact of our quality control efforts.

Another nice feature of FastQC is that many of the plots bin the nucleotide position on the x-axis which is useful for long reads such as 454. It is especially useful for the per base sequence content plot.

FastQC also provides example datasets on their website for a good Illumina datasetpoor Illumina dataset, a 454 dataset, and PacBio dataset.

see also

Other tools to verify quality of second-generation sequencing results are available:

    • Galaxy, a web-based genomics pipeline, in which FASTX-Toolkit and FastQC are integrated.
    • PRINSEQ, either as a standalone package or through a web-interface can generate summary statistics of sequence and quality data, which can subsequently be used to filter, reformat and trim next-generation sequence data.
    • Perl and Bioperl, to write small scripts. There already exist a very large number of packages devoted to genomics in Bioperl.
    • R and Bioconductor, are other solution to import and verify data. There are many contributed packages and modules available.

Quality assessment and quality control of NGS data的更多相关文章

  1. [论文笔记] Methodologies for Data Quality Assessment and Improvement (ACM Comput.Surv, 2009) (1)

    Carlo Batini, Cinzia Cappiello, Chiara Francalanci, and Andrea Maurino. 2009. Methodologies for data ...

  2. Paper | Blind Quality Assessment Based on Pseudo-Reference Image

    目录 1. 技术细节 1.1 失真识别 1.2 得到对应的PRI并评估质量 块效应 模糊和噪声 1.3 扩展为通用的质量评价指标--BPRI 归一化3种质量评分 判断失真类型 加权求和 2. 总结 这 ...

  3. Linux command line exercises for NGS data processing

    by Umer Zeeshan Ijaz The purpose of this tutorial is to introduce students to the frequently used to ...

  4. [论文笔记] Methodologies for Data Quality Assessment and Improvement (ACM Comput.Surv, 2009) (2)

    本篇博文主要对DMQ(S3.7)的分类进行了研读. 1. 这个章节提出了一种DQM的分类法(如下图) 由上图可见,该分类法的分类标准是对assessment & improvement阶段的支 ...

  5. Paper | No-reference Quality Assessment of Deblocked Images

    目录 故事背景 本文方法(DBIQ) 发表在2016年Neurocomputing. 摘要 JPEG is the most commonly used image compression stand ...

  6. Paper | BLIND QUALITY ASSESSMENT OF COMPRESSED IMAGES VIA PSEUDO STRUCTURAL SIMILARITY

    目录 1. 技术细节 1.1 得到MDI 1.2 判别伪结构,计算伪结构相似性 2. 实验 动机:作者认为,基于块的压缩会产生一种伪结构(pseudo structures),并且不同程度压缩产生的伪 ...

  7. Paper | Quality assessment of deblocked images

    目录 1. 故事 2. 失真变化 3. 方法(PSNR-B) 4. 实验 这篇文章提出了一个PSNR-B指标,旨在衡量 压缩图像的块效应强度 或 去块效应后的残留块效应强度(比较去块效应算法的优劣). ...

  8. SSIS ->> Control Flow And Data Flow

    In the Control Flow, the task is the smallest unit of work, and a task requires completion (success, ...

  9. Code Complete阅读笔记(二)

    2015-03-06   328   Unusual Data Types    ——You can carry this technique to extremes,putting all the ...

随机推荐

  1. POJ 2697 A Board Game(Trie判重+BFS)

    A Board Game Time Limit: 1000MS   Memory Limit: 65536K Total Submissions: 551   Accepted: 373 Descri ...

  2. C#记录程序耗时的方法

    用于准确测量运行时间的方法类: System.Diagnostics.Stopwatch 具体使用方法: using System.Diagnostics; Stopwatch stopwatch = ...

  3. PSR : php编码规范

    诸王混战 关于开发标准这块,可以说一直都是风格迥异,各家都有各家的玩法,民间更是个人玩个人的.目前我们国内比较出名的几个框架(Yii,Laravel) 都已经支持Composer并且加入了PHP-FI ...

  4. python display color output

    起因 在开发项目过程中,为了方便调试代码,经常会向stdout中输出一些日志,默认的这些日志就直接显示在了终端中.而一般的应用服务器,第三方库,甚至服务器的一些通告也会在终端中显示,这样就搅乱了我们想 ...

  5. python学习之 字符串前'r'的用法

    在打开文件的时候open(r'c:\....') 加r和不加''r是有区别的 'r'是防止字符转义的 如果路径中出现'\t'的话 不加r的话\t就会被转义 而加了'r'之后'\t'就能保留原有的样子 ...

  6. TCP 长连接与短连接的区别

    TCP连接 当网络通信时采用TCP协议时,在真正的读写操作之前,server与client之间必须建立一个连接,当读写操作完成后,双方不再需要这个连接时它们可以释放这个连接,连接的建立是需要三次握手的 ...

  7. iOS应用架构谈 网络层设计方案

    网络层在一个App中也是一个不可缺少的部分,工程师们在网络层能够发挥的空间也比较大.另外,苹果对网络请求部分已经做了很好的封装,业界的AFNetworking也被广泛使用.其它的ASIHttpRequ ...

  8. SET TEXTSIZE number

    When you using sqlcmd to export some data by a query, you will found some column data is truncated i ...

  9. nginx执行文件替换掉之后重启提示permission denied

    重启的时候显示 -bash: /usr/local/nginx/sbin/nginx: Permission denied 进入nginx执行文件的目录:

  10. Linux:宿主机通过桥接方式连接的VMware内部Linux14.04虚拟机(静态IP)实现上网方案

    首先,我们要弄清楚三种常见的连接方式中的桥接方式的网络结构: .bridged(桥接模式) 在这种模式下,VMWare虚拟出来的操作系统就像是局域网中的一台独立的主机,它可以访问网内任何一台机器.在桥 ...