Quality assessment and quality control of NGS data

http://www.molecularevolution.org/resources/activities/QC_of_NGS_data_activity_new

expected learning outcomes

The objective of this activity is to understand some relevant properties of an ensemble of next generation sequencing reads such as length, quality scores and base distribution in order to assess the quality of the data and to discard low quality reads.

Next-generation sequencing technologies are becoming widely used, and although a massive number of sequences can be generated in a single experiment, it is still important to characterize the reads. Characterizing reads may prevent undesirable outcomes in the assembly or mapping processes. Discarding low-quality reads, controlling for contamination, and trimming adaptor sequences are examples of preliminary quality control procedures that can be applied to raw reads before further analysis.

We will use basic UNIX commands, the FASTX-Toolkit, and FastQC applied to two small 454 and Illumina datasets.

getting started

Make sure you have the following programs installed (if you used the customized USB flash drive for software installation, you already have them):
1. FASTX-Toolkit
2. FastQC
3. Perl and Bioperl
4. R
5. Gnuplot
The data you are going to use have been copied from your USB flash drive during the software installation to the~/wcg_oct2011/activities/QC_of_NGS_data folder. Go to that folder and you should find:
1. bartonella_illumina.fastq: a small subset of a Illumina run (v.1.5) in a FASTQ file
2. bartonella_454.fasta and bartonella_454.qual: sequences and qualities for small subset of a 454 Titanium run
3. plotLengthDistribution.R: a simple R script to plot the distribution of read lengths, taking as input a two column tab file where the second column is the length of the sequences (and the first column is read ID, although this information is not used)
4. fastx_nucleotide_distribution.R: an R script to plot the distribution of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_nucleotide_distribution.sh, uses the same input file)
5. fastx_quality_boxplot.R: an R script to plot the quality of nucleotides along the read position (similar to the FASTX-Toolkit tool fastx_quality_boxplot.sh, uses the same input file)
The Perl scripts that are going to be used in this activity have been installed, if you are interested to look at them:
1. fastaNamesSizes.pl: takes a sequence file as input (e.g. fastq). The output is a list of sequence IDs followed by the length of each sequence; requires BioPerl
2. fastaQual2fastq.pl: takes a FASTA and a quality (.qual) file and outputs a FASTQ file; requires BioPerl
Have a look at the documentation for FASTX-Toolkit. Get an idea of the different tools and what they do. In general, you can get help on each program by typing program_name -h

FIX FOR MISSING SCRIPTS

Execute the following commands:

cd ~/wcg_oct2011/local/scripts
wget http://www.molecularevolution.org/molevolfiles/exercises/QC_of_NGS/scripts.tar.gz
tar xvfz scripts.tar.gz

exercise 1: checking Illumina data with the FASTX-Toolkit

The goal of this exercise is to inspect the sequence data of an Illumina run. The sequence data belongs to the intracellular bacterium Bartonella bacilliformis which is zoonotic pathogen transmitted by insect vectors. Although FASTQ is the main file received from a sequence provider, some users want to perform the base calling step themselves, using a different package than the proprietary Illumina software. This is not covererd in this exercise, and we start directly from the FASTQ file.

Inspect the data contained in the sequence file, bartonella_illumina.fastq.
Using the Perl script fastaNamesSizes.pl, calculate the length of each sequence in the fastq file:

fastaNamesSizes.pl -f fastq-illumina bartonella_illumina.fastq > bartonella_illumina.ns

-f fastq-illumina = specifies what format the input sequence file is in. Format options are specified in Bioperl (see link for more information about the fastq format specified)
> bartonella_illumina.ns = specifies the output file

Have a look at the numbers output on the terminal. How many sequences do we have?

This information can found in the standard error:
# 10000 sequences, total length 380000.
# Minimum len: 38. Max: 38. Average: 38.

Inspect the output.
[show answer]

Answer:
The output will be in tabular format, with each line containing the sequence ID and length.
Examine the Perl script in a text editor and inspect it to see how the script works. Note how the script (less `which fastaNamesSizes.pl`).

(Note: this is the same as using which to find the file location, and then less to open the file at that path. The backticks (not single quotations) specify that the which command should be performed before less).

Convert the FASTQ file into FASTA format using the Fastx Toolkit script fastq_to_fasta:

fastq_to_fasta -n -i bartonella_illumina.fastq > bartonella_illumina.fasta

-n = keeps sequences with unknown (N) nucleotides
-i bartonella_illumina.fastq = specifies the input file

Compute quality statistics for the bartonella_illumina dataset, using fastx_quality_stats

fastx_quality_stats -i bartonella_illumina.fastq > bartonella_illumina.qualstats
View the statistics file with a text editor. What do the columns represent?

[show answer]

Answer:
Run the program with -h to show help documentation which contains this information. The format of this output is described as the "old" style.

(optional): Produce a more extensive output by running the same command with the -N flag. View it. What are the extra columns?

[show answer]

Answer:
The 'new' output contains the original information followed by the analysis for each nucleotide at each cycle (explained in the help documentation).

Draw a box-plot of the reads quality with fastq_quality_boxplot_graph.sh:

fastq_quality_boxplot_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.boxplot.png
(or, alternatively with R: Rscript fastx_quality_boxplot.R bartonella_illumina.qualstats bartonella_illumina.boxplot2.png)
Open the resulting png file (with open or Preview on a Mac, with Firefox or eog on Ubuntu from the terminal). What do x and y axis represent? What do the boxes represent? What can you say about the quality in general? What is the average probability of error? How is the quality varying along the read?

[show answer]

Answer:
x axis represents the read length, y axis indicates the Phred quality score. From the top each box represents upper whisker, Q1, median, Q3 and lower whisker.
The sequence reads have a high quality. The probability of error is about 0.0001. The quality drops towards the end of the reads.

Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh:

fastx_nucleotide_distribution_graph.sh -i bartonella_illumina.qualstats -o bartonella_illumina.nucdistr.png
(or, alternatively with R: Rscript fastx_nucleotide_distribution.R bartonella_illumina.qualstats bartonella_illumina.nucdistr2.png)
Examine the image from the resulting png file. What can you say about the distribution of nucleotides? Any estimation of the G+C content?

[show answer]

Answer:
The sequence reads are A-T rich. The estimate of G+C content is about 30-40%.

exercise 2: checking 454 data with the FASTX-Toolkit

The goal of this exercise is to do the same kind of quality checks as in exercise 1, but on 454 data this time. The primary data from 454 is stored in a sff file, but in general, FASTA and qual files are also provided. It is possible to parse the sff files with different parameters using the proprietary software provided by 454 (sfffile/sffinfo) or a free, open-source tool,sff_extract. Sff extraction is not covered in this exercise, and we start from the bartonella_454.fasta andbartonella_454.qual files. How would you perform the following tasks?

Examine the FASTA and qual files.
Some functions of FASTX-Toolkit do not work with FASTA-formatted sequences on multiple lines, thus it is sometimes necessary to transform the file so that fasta_formatter each sequence is on a single line.

[show answer]

Answer:
fasta_formatter < bartonella_454.fasta > bartonella_454_1.fasta

Some FASTX-Toolkit programs can only take FASTQ as an input, thus necessitating the conversion of FASTA and qual files to a FASTQ file, which can be done using a short Bioperl script:

fastaQual2fastq.pl bartonella_454.fasta bartonella_454.qual > bartonella_454.fastq
Inspect the resulting FASTQ file
Now calculate the length of each sequence using the fastaNamesSizes.pl script.

[show answer]

Answer:
fastaNamesSizes.pl bartonella_454.fasta > bartonella_454.ns

Explore the output with a text editor. What would you say about the statistics on length and number of sequences?

Standard error:
# 10000 sequences, total length 3666752.
# Minimum len: 40. Max: 633. Average: 367

Although the average read length is 367 bp, very few sequences are ~200-300 bp. The read length of the majority of the sequences in the file fits in two major groups: less than 100bp and more than 300bp.

Plot the distribution of read lengths using a short R script. Open the resulting png file (with open on a Mac or eog on Ubuntu from the terminal):

Rscript plotLengthDistribution.R bartonella_454.ns bartonella_454.readDistr.png
(optional) Examine the script.
Can you say something about the distribution? What is the mean length? the median?

[show answer]

Answer:
The read length distribution is bimodal. Mean read length=367 bp, Median=449 bp.

(optional) Modify the R script to display vertical bars of different colors at the mean and median read length. Hint: useabline(v=...) command in the R script.

[show answer]

Answer:
The following code can be inserted after the hist line and before the dev.off()line:
abline(v=mean(data$V2),col="red") abline(v=median(data$V2),col="blue")

Now compute the quality statistics for this subset and look at them in a text editor. Use fastx_quality_stats and view the results in a text editor. Use also the -N switch to produce extensive output.

[show answer]

Answer:
fastx_quality_stats -i bartonella_454.fastq > bartonella_454.qualstats
fastx_quality_stats -i bartonella_454.fastq -N > bartonella_454.extqualstats

Now plot the quality distribution along the sequence with fastq_quality_boxplot_graph.sh and view the resulting png plot (alternatively, use the R script). What do you see?

[show answer]

Answer:
fastq_quality_boxplot_graph.sh -i bartonella_454.qualstats -o bartonella_454.qualstats.png
The quality of the sequence reads is high but it drops after ~400bp.

Draw the distribution of nucleotides with fastx_nucleotide_distribution_graph.sh (or with the Rscript) and view the result. What do you see?

[show answer]

Answer:
fastx_nucleotide_distribution_graph.sh -i bartonella_454.qualstats -o bartonella_454.nucdistr.png
The sequence reads are G-C rich. The distribution of nucleotides is not uniform at the 5' end of the reads.

We need to trim that down. Use head to take only the first lines of the stats file and output it to another file, and redraw the figure. What do you see now? What is your conclusion about this dataset? What should be done about it?

[show answer]

Answer:
head -n 51 bartonella_454.qualstats > bartonella_454.50f.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.50f.qualstats -o bartonella_454.50f.nucdistr.png

The distribution of the nucleotides is not uniform in the first 10 nucleotide position of the reads. Adaptor sequences at the 5' end are included in the sequenced reads. The sequence reads need to be trimmed.

exercise 3: fixing extra adaptor sequence with the FASTX-Toolkit

Most of the sequences have some adaptor sequences at the 5' end. We need to remove these adaptor sequences to avoid problems with the assembly or mapping. We will use fastx_trimmer to do this.

fastx_trimmer trims a given number of nucleotides, either from the beginning or from the end of the sequence.

[show answer]

Answer:
fastx_trimmer -f 11 -m 50 -i bartonella_454.fastq -o bartonella_454.trim.fastq
-f 11 = the position on the read (from the 5' end) of first nucleotide to keep. The first 10 bp will be removed from the 5' end.
-m 50 = the minimum final read length. Reads shorter than this length will be discarded.

Calculate the nucleotide distribution statistics for the trimmed reads and redraw the graph. How does it look now?

[show answer]

Answer:
fastx_quality_stats -i bartonella_454.trim.fastq | head -n 100 > bartonella_454.trim.qualstats

fastx_nucleotide_distribution_graph.sh -i bartonella_454.trim.qualstats -o bartonella_454.trim.nucdistr.png

The distribution of the nucleotides is now similar over the length of the reads.

(optional): Why is it important to trim x bp from the start of all of the reads rather than searching for the adapter sequence and removing it?

[show answer]

Answer:
If the adapter was removed by searching for a specific sequence then reads with sequencing or PCR errors in the adapter would not being trimmed. These untrimmed reads would cause problems with assembly.

exercise 4: checking Illumina data with FastQC

Another tool to assess sequence quality is FastQC. FastQC can operate in two different modes - as a stand-alone interactive application or in a non-interactive mode which is more suitable for larger pipelines. For the purposes of this activity we will be using the interactive application. To start FastQC:

If you are running the WCG Ubuntu Linux distribution, first run the following commands:
rm ~/wcg_oct2011/local/bin/fastqc
ln -s ~/wcg_oct2011/software/FastQC/fastqc ~/wcg_oct2011/local/bin/fastqc

Then, you should be able to run fastqc from anywhere like so:

fastqc

Or if you are running on Mac OS X, navigate to the ~/wcg_oct2011/software/fastqc_v0.9.3 folder (already copied from your USB during the software installation) and double-click the FastQC icon.

FastQC may take a moment to open and should display a blank window once it has.

Load the bartonella_illumina.fastq Illumina file (File->Open) from the ~/wcg_oct2011/activities/QC_of_NGS_datadirectory. Once loaded, FastQC will process it and generate the report. This report can be exported (File->Save report...) as a zipped HTML folder which makes for easy sharing. You can view the results either within the FastQC application or theexported report. The FastQC documentation provides a detailed explanation of the tooland an explanation of each module report.

One nice feature of FastQC is that it easily handles and detects the many FASTQ variants. Within the Basic Statisticsreport, it shows that our data was Illumina 1.5 encoded.

The per base sequence quality plot is similar to the plot we produced in Exercise 1 with thefastq_quality_boxplot_graph.sh FastX-Toolkit command. Are there any differences between the two plots?

Explore the rest of the plots. What new information can be gathered about this Illumina dataset?

exercise 5: checking 454 data with FastQC

Load the bartonella_454.fastq file (File->Open) that we created from the 454 bartonella_454.fasta andbartonella_454.qual files in Exercise 2, step 3. Note that FastQC handles the following file types:

FASTQ (all variants)

Colorspace FASTQ (SOLiD)

GZip compressed FASTQ

SAM/BAM Mapped only (normally used for colorspace data)

You can view the results either within the FastQC application or the exported report.

You will notice for this dataset it shows that our data was Illumina 1.5 encoded. If you remember in exercise 2, we used thefastaQual2fastq.pl script to convert the .fasta and .qual files to FASTQ and it encoded them as Illumina 1.5.

You should notice in this case that 5 of the modules are marked with the red cross as "very unusual." Why was this the case for the 454 data? Try loading the bartonella_454.trim.fastq file to see the impact of our quality control efforts.

Another nice feature of FastQC is that many of the plots bin the nucleotide position on the x-axis which is useful for long reads such as 454. It is especially useful for the per base sequence content plot.

FastQC also provides example datasets on their website for a good Illumina dataset, poor Illumina dataset, a 454 dataset, and PacBio dataset.