<二代測序> 下载 NCBI sra 文件

本文近期更新地址：

http://blog.csdn.net/tanzuozhev/article/details/51077222

随着測序技术的不断提高。二代測序数据成指数增长。

NCBI提供了SRA数据库存储这些数据。

http://www.ncbi.nlm.nih.gov/sra

为了方便更好的分析这些数据，NCBI提供了下载的命令行工具：sra-toolkit。

包含下面命令：

官方文档：

http://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=toolkit_doc

prefetch: Allows command-line downloading of SRA, dbGaP, and ADSP data 下载数据

fastq-dump: Convert SRA data into fastq format # 将下载的sra数据转换为 fastq文件，支持 PE

sam-dump: Convert SRA data to sam format# sra转换为sam

sra-pileup: Generate pileup statistics on aligned SRA data

vdb-config: Display and modify VDB configuration information

vdb-decrypt: Decrypt non-SRA dbGaP data (“phenotype data”)

prefetch

经常使用命令

Data transfer:

# 假设已有下载的文件是否强制下载，默觉得非强制

-f  |   --force <value> Force object download. One of: no, yes, all. no [default]: Skip download if the object if found and complete; yes: Download it even if it is found and is complete; all: Ignore lock files (stale locks or if it is currently being downloaded: use at your own risk!).

# 选择下载的方式 ascp 和 http，默认先尝试 ascp。再尝试http

--transport <value> Value one of: ascp (only), http (only), both (first try ascp, fallback to http). Default: both.

# 列举 kart 文件里的 内容，大小

# 你能够把须要下载的项目放入 kart 文件

-l  |   --list  List the contents of a kart file.

-s  |   --list-sizes    List the content of kart file with target file sizes.

# 设置文件的最小尺寸

-N  |   --min-size <size>   Minimum file size to download in KB (inclusive).

# 设置文件的最大尺寸

-X  |   --max-size <size>   Maximum file size to download in KB (exclusive). Default: 20G.

# 排序方式

-o  |   --order <value> Kart prefetch order. One of: kart (in kart order), size (by file size: smallest first). default: size.

样例

prefetch ERR732926

直接下载 ERR732926 样本的文件，默认放入 ~//ncbi/public/sra 文件夹下

prefetch cart_0.krt

下载 kart文件里的列表

prefetch -l cart_0.krt

列举cart_0.krt文件的内容

fastq-dump



General:

-h  |   --help  Displays ALL options, general usage, and version information.

-V  |   --version   Display the version of the program.

Data formatting:

#切割 paired-end data

--split-files   Dump each read into separate file. Files will receive suffix corresponding to read number.

--split-spot    Split spots into individual reads.

# 仅仅保留fasta，没有质量得分

--fasta <[line width]>  FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).

-I  |   --readids   Append read id after spot id as 'accession.spot.readid' on defline.

-F  |   --origfmt   Defline contains only original sequence name.

-C  |   --dumpcs <[cskey]>  Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.

-B  |   --dumpbase  Formats sequence using base space (default for other than SOLiD).

-Q  |   --offset <integer>  Offset to use for ASCII quality scores. Default is 33 ("!").

Filtering:

-N  |   --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range.

-X  |   --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.

-M  |   --minReadLen <len>  Filter by sequence length >= <len>

--skip-technical    Dump only biological reads.

--aligned   Dump only aligned sequences. Aligned datasets only; see sra-stat.

--unaligned Dump only unaligned sequences. Will dump all for unaligned datasets.

# 输出数据

Workflow and piping:

-O  |   --outdir <path> Output directory, default is current working directory ('.').

-Z  |   --stdout    Output to stdout, all split data become joined into single stream.

--gzip  Compress output using gzip.

--bzip2 Compress output using bzip2.

样例

fastq-dump -X 5 -Z SRR390728

能够在不下载的情况下。显示SRR390728样本的前五个读段（20行）

fastq-dump -I –split-files SRR390728

处理 paired-end 文件

Produces two fastq files (–split-files) containing “.1” and “.2” read suffices (-I) for paired-end data.

fastq-dump –split-files –fasta 60 SRR390728

Produces two (–split-files) fasta files (–fasta) with 60 bases per line (“60” included after –fasta).

fastq-dump –split-files –aligned -Q 64 SRR390728

Produces two fastq files (–split-files) that contain only aligned reads (–aligned; Note: only for files submitted as aligned data), with a quality offset of 64 (-Q 64) Please see the documentation on vdb-dump if you wish to produce fasta/qual data.

列举出经常使用命令，假设有其它须要请阅读官方文档。