Near-optimal RNA-Seq quantification https://pachterlab.github.io/kallisto

输入输出文件说明：http://bio.math.berkeley.edu/eXpress/manual.html

文章标题：

Pseudoalignment for metagenomic read assignment

文章摘要：

We explore connections between metagenomic read assignment and the quantification of transcripts from RNA-Seq data. In particular, we show that the recent idea of pseudoalignment introduced in the RNA-Seq context is suitable in the metagenomics setting. When coupled with the Expectation-Maximization (EM) algorithm, reads can be assigned far more accurately and quickly than is currently possible with state of the art software.

文章地址：

https://arxiv.org/abs/1510.07371v2

源代码：

https://pachterlab.github.io/kallisto/about

安装：

wget https://github.com/pachterlab/kallisto/releases/download/v0.43.0/kallisto_linux-v0.43.0.tar.gz

测试：

[biostack@localhost.localdomain test]$ /project/metagenomics_benchmark/kallisto_linux-v0.43.0/kallisto index -i --index transcripts.fasta

[biostack@localhost.localdomain test]$ /project/metagenomics_benchmark/kallisto_linux-v0.43.0/kallisto quant -i --index -o output reads_1.fastq reads_2.fastq（输入文件）

[biostack@localhost.localdomain output]$ more abundance.tsv

target_id length eff_length est_counts tpm

NM_001168316 2283 2105.9 160.606 12581

NM_174914 2385 2207.9 1500.72 112128

NR_031764 1853 1675.9 102.671 10106.2

NM_004503 1681 1503.9 331.118 36320.7

NM_006897 1541 1363.9 664 80311.3

NM_014212 2037 1859.9 55 4878.25

NM_014620 2300 2122.9 591.166 45937.9

NM_017409 1959 1781.9 47 4351.17

NM_017410 2396 2218.9 42 3122.5

NM_018953 1612 1434.9 227.999 26212.1

NM_022658 2288 2110.9 4881 381446

NM_153633 1666 1488.9 361.044 40002.4

NM_153693 2072 1894.9 73.6719 6413.67

NM_173860 849 671.903 962 236189

NR_003084 1640 1462.9 0.00164208 0.18517

使用说明：

kallisto

kallisto是一个用高通量测序片段从ＲＮＡ序列或更为普遍的目标序列中量化转录丰富度的一个程序。它是基于伪对齐的新的数据，用于快速确定reads目标，而无需alignment。在标准的ＲＮＡ序列数据中，kallisto能够在mac系统上用不到十分钟的时间构建索引，用不到三分钟的时间量化（也就是分类）３千ｗ人类的reads。reads伪对齐保留关键信息需要量化，并且kallisto不仅速度快，而且比现有的量化工具准确。事实上，由于伪对齐的过程是对reads出错上的健壮性，在许多基准中kallisto显著优于现有的工具。

kallisto能够用sleuth量化RNA序列分析。

kallisto产生的使用选项，这是一个列表：

kallisto 0.43.0

Usage: kallisto <CMD> [arguments] ..

Where <CMD> can be one of:

    index         Builds a kallisto index #构建一个kallisto索引

    quant         Runs the quantification algorithm #运行量化分析算法

    pseudo        Runs the pseudoalignment step#运行为比对

    h5dump        Converts HDF5-formatted results to plaintext#格式转换

    version       Prints version information#输出版本信息

    cite          Prints citation information#引用信息

Running kallisto <CMD> without arguments prints usage information for <CMD>

关于这些command说明如下：

index ：

kallisto index建立从靶序列的FASTA格式的文件的索引。该指数命令的参数有：

kallisto 0.43.0

Builds a kallisto index

Usage: kallisto index [arguments] FASTA-files#输入文件

Required argument: #必选参数

-i, --index=STRING          Filename for the kallisto index to be constructed #kallisto索引被构建的文件名

Optional argument:

-k, --kmer-size=INT         k-mer (odd) length (default: 31, max value: 31)

    --make-unique           Replace repeated target names with unique names

输入文件为fasta格式，可以是压缩文件。

quant：

kallisto quant运行量化算法。对于定量命令的参数有：

kallisto 0.43.0

Computes equivalence classes for reads and quantifies abundances#对reads进行分类和物种丰富度评估

Usage: kallisto quant [arguments] FASTQ-files #输入文件

Required arguments: #必选参数

-i, --index=STRING            Filename for the kallisto index to be used for

                              quantification  #索引文件

-o, --output-dir=STRING       Directory to write output to  #输出文件目录

Optional arguments:

    --bias                    Perform sequence based bias correction

-b, --bootstrap-samples=INT   Number of bootstrap samples (default: 0)

    --seed=INT                Seed for the bootstrap sampling (default: 42)

    --plaintext               Output plaintext instead of HDF5

    --single                  Quantify single-end reads

    --fr-stranded             Strand specific reads, first read forward

    --rf-stranded             Strand specific reads, first read reverse

-l, --fragment-length=DOUBLE  Estimated average fragment length

-s, --sd=DOUBLE               Estimated standard deviation of fragment length

                              (default: value is estimated from the input data)

-t, --threads=INT             Number of threads to use (default: 1)

    --pseudobam               Output pseudoalignments in SAM format to stdout

kallisto可以处理单端或双端的序列，默认情况下是双端序列，输入为fastq文件：

kallisto quant -i index -o output pairA_1.fastq pairA_2.fastq pairB_1.fastq pairB_2.fastq

对于单端序列可以用选项 --single ，也可用用 -l 和 -s 选项，然后列出输入的fastq文件即可：

kallisto quant -i index -o output --single -l 200 -s 20 file1.fastq.gz file2.fastq.gz file3.fastq.gz

kallisto quant produces three output files by default:

kallisto定量分析默认产生三个输出文件：

abundances.h5 ：二进制文件，包含运行信息，物种丰富度评估，bootstrap 评估等这个文件可以被sleuth打开阅读。
abundances.tsv ：是一个物种丰富度的说明文件。
run_info.json ：是一个包含运行的相关信息

可选参数说明：

Pseudobam：
--pseudobam，所有的伪比对输出格式为格式。可以被定向到一个文件中，也可以用samtools转换成bam。

例如： kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq > out.sam

或者用samtools：

kallisto quant -i index -o out --pseudobam r1.fastq r2.fastq | samtools view -Sb - > out.bam 



　　　　　　　　　　　　　　　　　　（学校的秋天，哈哈）

pseudo

kallisto pseudo只是在伪比对这一环节运行并且其目的是为在单细胞RNA的序列的使用。pseudo详细的命令选项如下：

kallisto 0.43.0

Computes equivalence classes for reads and quantifies abundances

Usage: kallisto pseudo [arguments] FASTQ-files

Required arguments:

-i, --index=STRING            Filename for the kallisto index to be used for

                              pseudoalignment

-o, --output-dir=STRING       Directory to write output to

Optional arguments:

-u  --umi                     First file in pair is a UMI file

-b  --batch=FILE              Process files listed in FILE

    --single                  Quantify single-end reads

-l, --fragment-length=DOUBLE  Estimated average fragment length

-s, --sd=DOUBLE               Estimated standard deviation of fragment length

                              (default: value is estimated from the input data)

-t, --threads=INT             Number of threads to use (default: 1)

    --pseudobam               Output pseudoalignments in SAM format to stdout

该命令的格式和参数的含义是与quant命令相同。然而，pseudo不运行EM算法来量化丰度。此外pseudo指令有一个选项在批处理文件中指定许多细胞，如：

kallisto pseudo -i index -o output -b batch.txt

h5dump

kallisto h5dump转换 hdf5格式。对于h5dump命令的参数有：

kallisto 0.43.0

Converts HDF5-formatted results to plaintext

Usage:  kallisto h5dump [arguments] abundance.h5

Required argument:

-o, --output-dir=STRING       Directory to write output to

kallisto：Near-optimal RNA-Seq quantification的更多相关文章

RNA seq 两种计算基因表达量方法
两种RNA seq的基因表达量计算方法: 1. RPKM:http://www.plob.org/2011/10/24/294.html 2. RSEM:这个是TCGAdata中使用的.RSEM据说比 ...
RNA -seq
RNA -seq RNA-seq目的.用处::可以帮助我们了解,各种比较条件下,所有基因的表达情况的差异. 比如:正常组织和肿瘤组织的之间的差异:检测药物治疗前后,基因表达的差异:检测发育过程中,不同 ...
数据结构（分块）：[HZOI 2015]easy seq
[题目描述] 给定一个序列,下标从0开始,分别为a0,a1,a2...an−1,有m个询问,每次给出l和r,求满足ai=aj且l<=i<=j<=r时j−i的最大值本题强制在线,l和 ...
链终止法|边合成边测序|Bowtie|TopHat|Cufflinks|RPKM|FASTX-Toolkit|fastaQC|基因芯片|桥式扩增|
生物信息学 Sanger采用链终止法进行测序带有荧光基团的ddXTP+其他四种普通的脱氧核苷酸放入同一个培养皿中,例如带有荧光基团的ddATP+普通的脱氧核苷酸A.T.C.G放入同一个培养皿,以此类 ...
xgene：之ROC曲线、ctDNA、small-RNA seq、甲基化seq、单细胞DNA, mRNA
灵敏度高 == 假阴性率低,即漏检率低,即有病人却没有发现出来的概率低. 用于判断:有一部分人患有一种疾病,某种检验方法可以在人群中检出多少个病人来. 特异性高 == 假阳性率低,即错把健康判定为病人 ...
泡泡一分钟：Optimal Trajectory Generation for Quadrotor Teach-And-Repeat
张宁 Optimal Trajectory Generation for Quadrotor Teach-And-Repeat链接:https://pan.baidu.com/s/1x0CmuOXiL ...
RNA测序相对基因表达芯片有什么优势？
RNA测序相对基因表达芯片有什么优势? RNA-Seq和基因表达芯片相比,哪种方法更有优势?关键看适用不适用.那么RNA-Seq适用哪些研究方向?是否您的研究?来跟随本文了解一下RNA测序相对基因表达 ...
xgene：WGS，突变与癌，RNA-seq，WES
人类全基因组测序06 SNP(single nucleotide polymorphism):有了10倍以上的覆盖深度以后,来确认SNP信息,就相当可靠了. 一个普通黄种人的基因组,与hg19这个参 ...
如果你也会C#，那不妨了解下F#（4）：了解函数及常用函数
函数式编程其实就是按照数学上的函数运算思想来实现计算机上的运算.虽然我们不需要深入了解数学函数的知识,但应该清楚函数式编程的基础是来自于数学. 例如数学函数$f(x) = x^2+x$,并没有指定 ...

随机推荐

Key Figure中的Aggregation决定了DSO/CUBE转换规则中的Aggregation合计方式
声明:原创作品,转载时请注明文章来自SAP师太技术博客( 博/客/园www.cnblogs.com):www.cnblogs.com/jiangzhengjun,并以超链接形式标明文章原始出处,否则将 ...
linux下启动AP热点时出错
1.启动hostapd,在终端下输入sudo ./hostapd hostapd.conf (注意:使用到的hostapd和hostapd.conf都处在当前工作目录下) 1.2.在执行1之后会出现以 ...
Codeforces Round #383 (Div. 2) A,B,C,D 循环节，标记，暴力，并查集+分组背包
A. Arpa’s hard exam and Mehrdad’s naive cheat time limit per test 1 second memory limit per test 256 ...
python collections defaultdict
class_counts = defaultdict(int) 一.关于defaultdict 在Python里面有一个模块collections,解释是数据类型容器模块.这里面有一个collect ...
docker部署tomcat
一.环境简介宿主机版本:ubuntu-14.04.3-server-amd64.iso JDK版本:jdk-7u76-linux-x64.tar.gz TOMCAT版本:apache-tomcat- ...
java判断时间是否是今天
SimpleDateFormat format=new SimpleDateFormat("yyyy-MM-dd"); Date d1=format.parse(FHavetime ...
由两点坐标如何画出直线 matlab
由两点坐标如何画出直线方法1:利用直线方程斜率加截距方法2:数据拟合 %由两点坐标得数据拟合直线与画线 x = [,]; y = [,]; k = ((-)/(-));% 由两点坐标得到直线斜 ...
鼠标滚轮控制侧边div上下翻动效果
css部分: <style> * { margin: 0; padding: 0;} .wrap { width: 1000px; margin: 0 auto; overflow: hi ...
iTunes Affiliate Resources
https://affiliate.itunes.apple.com/resources/documentation/itunes-store-web-service-search-api/
转: javascript实现全国城市三级联动菜单代码
<html> <head> <title>js全国城市三级联动菜单代码_B5教程网</title> <meta http-equiv=" ...

kallisto：Near-optimal RNA-Seq quantification