#===============================      版本1  ===============================================
InterProScan的三种使用方法
Interproscan,通过蛋白质结构域和功能位点数据库预测蛋白质功能。是EBI开发的一个集成了蛋白质家族、结构域和功能位点的非冗余数据库。Interproscan整合了一些使用最普及的一些数据库,并应用于功能未知的蛋白进行Interpro注释和GO注释。
以下介绍3中interpro注释的方法:

三、本地化的InterProScan注释
3.1 本地化的InterProScan安装与配置

3.1.1 从ftp://ftp.ebi.ac.uk/pub/databases/interpro/iprscan下载以下5个文件:

RELEASE/latest/iprscan_v4.8.tar.gz
BIN/4.x/iprscan_bin4.x_[PLATFORM].tar.gz
DATA/iprscan_DATA_[LATESTDATAVERSION].tar.gz
DATA/iprscan_PTHR_DATA_[LATESTDATAVERSION].tar.gz
DATA/iprscan_MATCH_DATA_[LATESTDATAVERSION].tar.gz

3.1.2 将5个文件解压到一个文件夹中,然后运行其中的文件Config.pl,来对InterProScan进行配置。
3.1.3 配置的过程中,若选择进行本地web配置,则修改本地www服务的配置文件,以能进行本地化网页版的运行。
3.2 本地化InterProScan的使用。
3.2.1 命令行运行iprscan的方法:

$bin/iprscan -cli -iprlookup -goterms -format xml -i test.fasta -o test.out

# help

http://www.chenlianfu.com/?tag=iprscan

该模块中XML::Parser    XML::Parser::Expat 这两个模块,后一个必须先安装,后续一个接着安装,由于是C层面的模块,需要安装一些东西

Expat must be installed prior to building XML::Parser and I can't find it in the standard library directories. Install 'expat-devel' (or 'libexpat1-dev') package

小提示: (root或者sudo权限)  yum 或者 apt-get install expat-devel  (具体版本具体办)

#==============================================    版本2   =============================================

https://github.com/ebi-pf-team/interproscan/wiki   原文链接

第一步: 环境配置

Software requirements:

  • 64-bit Linux
  • Perl (default on most Linux distributions)
  • Python 2.7.x only
  • Oracle's Java JDK/JRE version 8 (required by InterProScan 5.17-56.0 onwards). Earlier InterProScan release versions required Java 6 (version 6u4 and above) or Java 7.
  • Environment variables set
    • $JAVA_HOME should point to the location of the JVM

$JAVA_HOME/bin should be added to the $PATH

第二步: 数据下载

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz

wget ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/5/5.27-66.0/interproscan-5.27-66.0-64-bit.tar.gz.md5

md5sum -c interproscan-5.27-66.0-64-bit.tar.gz.md5   (解压前,把xxx.tar.gz xxx.tar.gz.md5放到同一目录下做检查完整性)

tar -pxvzf interproscan-5.27-66.0-64-bit.tar.gz   (-p参数为了保持文件的权限 -v 建议去掉,这个是解压过程显示)

(解压后进去有个data目录,后续panther数据解压放进去,配置文件默认路径,如果放其他地方,设置一下)

第三步:运行测试

./interproscan.sh -i test_proteins.fasta -f tsv 
./interproscan.sh -i test_proteins.fasta -cpu  -f GFF3 -goterms -iprlookup -t p -T 20171127tmp

#  参数: -i 输入  -f format   -goterms -iprlookup  GO注释  -t  数据类型   -T 临时文件目录名称

小提示:

TSV 是Tab-separated values的缩写,即制表符分隔值。
CSV,Comma-separated values(逗号分隔值)。

#=============================      具体参数  ========================================

27/11/2017 14:41:35:049 Welcome to InterProScan-5.27-66.0
usage: java -XX:+UseParallelGC -XX:ParallelGCThreads=2 -XX:+AggressiveOpts
-XX:+UseFastAccessorMethods -Xms128M -Xmx2048M -jar
interproscan-5.jar Please give us your feedback by sending an email to interhelp@ebi.ac.uk -appl,--applications <ANALYSES> Optional, comma separated list
of analyses. If this option
is not set, ALL analyses will
be run.
-b,--output-file-base <OUTPUT-FILE-BASE> Optional, base output filename
(relative or absolute path).
Note that this option, the
--output-dir (-d) option and
the --outfile (-o) option are
mutually exclusive. The
appropriate file extension for
the output format(s) will be
appended automatically. By
default the input file
path/name will be used.
-cpu,--cpu <CPU> Optional, number of cores for
inteproscan.
-d,--output-dir <OUTPUT-DIR> Optional, output directory.
Note that this option, the
--outfile (-o) option and the
--output-file-base (-b) option
are mutually exclusive. The
output filename(s) are the
same as the input filename,
with the appropriate file
extension(s) for the output
format(s) appended
automatically .
-dp,--disable-precalc Optional. Disables use of the
precalculated match lookup
service. All match
calculations will be run
locally.
-dra,--disable-residue-annot Optional, excludes sites from
the XML, JSON output
-f,--formats <OUTPUT-FORMATS> Optional, case-insensitive,
comma separated list of output
formats. Supported formats are
TSV, XML, JSON, GFF3, HTML and
SVG. Default for protein
sequences are TSV, XML and
GFF3, or for nucleotide
sequences GFF3 and XML.
-goterms,--goterms Optional, switch on lookup of
corresponding Gene Ontology
annotation (IMPLIES -iprlookup
option)
-help,--help Optional, display help
information
-i,--input <INPUT-FILE-PATH> Optional, path to fasta file
that should be loaded on
Master startup. Alternatively,
in CONVERT mode, the
InterProScan 5 XML file to
convert.
-iprlookup,--iprlookup Also include lookup of
corresponding InterPro
annotation in the TSV and GFF3
output formats.
-ms,--minsize <MINIMUM-SIZE> Optional, minimum nucleotide
size of ORF to report. Will
only be considered if n is
specified as a sequence type.
Please be aware of the fact
that if you specify a too
short value it might be that
the analysis takes a very long
time!
-o,--outfile <EXPLICIT_OUTPUT_FILENAME> Optional explicit output file
name (relative or absolute
path). Note that this option,
the --output-dir (-d) option
and the --output-file-base
(-b) option are mutually
exclusive. If this option is
given, you MUST specify a
single output format using the
-f option. The output file
name will not be modified.
Note that specifying an output
file name using this option
OVERWRITES ANY EXISTING FILE.
-pa,--pathways Optional, switch on lookup of
corresponding Pathway
annotation (IMPLIES -iprlookup
option)
-t,--seqtype <SEQUENCE-TYPE> Optional, the type of the
input sequences (dna/rna (n)
or protein (p)). The default
sequence type is protein.
-T,--tempdir <TEMP-DIR> Optional, specify temporary
file directory (relative or
absolute path). The default
location is temp/.
-version,--version Optional, display version
number
-vtsv,--output-tsv-version Optional, includes a TSV
version file along with any
TSV output (when TSV output
requested)
Copyright © EMBL European Bioinformatics Institute, Hinxton, Cambridge,
UK. (http://www.ebi.ac.uk) The InterProScan software itself is provided
under the Apache License, Version 2.0
(http://www.apache.org/licenses/LICENSE-2.0.html). Third party components
(e.g. member database binaries and models) are subject to separate
licensing - please see the individual member database websites for
details. Available analyses:
TIGRFAM (15.0) : TIGRFAMs are protein families based on Hidden Markov Models or HMMs
SFLD (3) : SFLDs are protein families based on Hidden Markov Models or HMMs
SUPERFAMILY (1.75) : SUPERFAMILY is a database of structural and functional annotation for all proteins and genomes.
PANTHER (12.0) : The PANTHER (Protein ANalysis THrough Evolutionary Relationships) Classification System is a unique resource that classifies genes by their functions, using published scientific experimental evidence and evolutionary relationships to predict function even in the absence of direct experimental evidence.
Gene3D (4.1.0) : Structural assignment for whole genes and genomes using the CATH domain structure database
Hamap (2017_10) : High-quality Automated and Manual Annotation of Microbial Proteomes
Coils (2.2.1) : Prediction of Coiled Coil Regions in Proteins
ProSiteProfiles (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
SMART (7.1) : SMART allows the identification and analysis of domain architectures based on Hidden Markov Models or HMMs
CDD (3.16) : Prediction of CDD domains in Proteins
PRINTS (42.0) : A fingerprint is a group of conserved motifs used to characterise a protein family
ProSitePatterns (2017_09) : PROSITE consists of documentation entries describing protein domains, families and functional sites as well as associated patterns and profiles to identify them
Pfam (31.0) : A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs)
ProDom (2006.1) : ProDom is a comprehensive set of protein domain families automatically generated from the UniProt Knowledge Database.
MobiDBLite (1.0) : Prediction of disordered domains Regions in Proteins
PIRSF (3.02) : The PIRSF concept is being used as a guiding principle to provide comprehensive and non-overlapping clustering of UniProtKB sequences into a hierarchical order to reflect their evolutionary relationships. Deactivated analyses:
Phobius (1.01) : Analysis Phobius is deactivated, because the resources expected at the following paths do not exist: bin/phobius/1.01/phobius.pl
SignalP_EUK (4.1) : Analysis SignalP_EUK is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
SignalP_GRAM_POSITIVE (4.1) : Analysis SignalP_GRAM_POSITIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp
TMHMM (2.0c) : Analysis TMHMM is deactivated, because the resources expected at the following paths do not exist: bin/tmhmm/2.0c/decodeanhmm, data/tmhmm/2.0c/TMHMM2.0c.model
SignalP_GRAM_NEGATIVE (4.1) : Analysis SignalP_GRAM_NEGATIVE is deactivated, because the resources expected at the following paths do not exist: bin/signalp/4.1/signalp

蛋白序列GO号注释及问题的更多相关文章

  1. interproscan 软件对序列进行GO 注释

    interproscan 软件实际上将对输入的查询序列和interpro 数据库中的序列去比对,将比对上的序列对应的GO信息作为查询序列的GO注释 在interpro 数据库中,每条蛋白质序列有一个唯 ...

  2. 由merge into引起的序列跳号

    最近生产库反应出一个问题,某张表的主键ID并没有按照原计划的期望增加,而是间歇性跳号,每次跳2万多,经过研究发现是某个同步过程的merge into引起的,具体语句如下 merge into t_if ...

  3. 利用BioPerl将DNA序列翻译成蛋白序列

    转自 https://www.plob.org/article/4603.html 具体请去上面的网页查看. my $DNA="ATGCCCGGT";my $pep=&Tr ...

  4. merge into 导致序列跳号

    For each row merged by a MERGE statement. The reference to NEXTVAL can appear in the merge_insert_cl ...

  5. 【Python小试】计算蛋白序列中指定氨基酸所占的比例

    编码 from __future__ import division def get_aa_percentage(protein, aa_list=['A','I','L','M','F','W',' ...

  6. interpro 数据库

    interpro 通过整合多个蛋白相关的数据库,提供了一个方便的对蛋白序列进行功能注释的平台,功能注释的内容包括蛋白质家族预测,domain 和 结合位点预测 interoro 在整合多个数据库的同时 ...

  7. KEGG注释

    在 KEGG 数据库中,把功能相似的蛋白质归为同一组,然后标上 KO 号.通过相似性比对,可以为未知功能的蛋白序列注释上 KO 号. 截止到 2015 年 6 月 12 日,KEGG 数据库中共收录了 ...

  8. Augustus 进行基因注释

      目前的从头预测软件大多是基于HMM(隐马尔科夫链)和贝叶斯理论,通过已有物种的注释信息对软件进行训练,从训练结果中去推断一段基因序列中可能的结构,在这方面做的最好的工具是AUGUSTUS它可以仅使 ...

  9. 使用BRAKER2进行基因组注释

    来自:https://www.jianshu.com/p/e6a5e1f85dda 使用BRAKER2进行基因组注释 BRAKER2是一个基因组注释流程,能够组合GeneMark,AUGUSTUS和转 ...

随机推荐

  1. 条件编译ifndef、ifdef、endif

    1.条件编译命令最常见的形式为: #ifdef 标识符 程序段1 #else 程序段2 #endif 当标识符已经被定义过(一般是用#define命令定义),则对程序段1进行编译,否则编译程序段2.  ...

  2. bulk

    bulk - 必应词典 美[bʌlk]英[bʌlk] n.大部分:主体:(大)体积:大(量) v.扩展:增大:堆积起来:形成大块 网络散装:大批:大量 变形复数:bulks:现在分词:bulking: ...

  3. 关于vs2010开发的ASP项目部署到XPSP2系统上出现找不到Reportviewer.XX.文件的解决方案

    尝试方法如下: 1.将webform.dll.winform.dll.common.dll三个引用直接复制到服务器的Bin目录,未解决问题,提示无法正确加载,程序及已关闭等. 2.SQLSysClrT ...

  4. 微信小程序开发——活动规则类文案文件读取及自动转换为小程序排版代码

    前言: 最近做的小程序活动规则内容比较多,且一直处于修改中.由于小程序并不支持类似Html5中的预排版,所以,活动规则内容修改较大的时候,仍需要对新的内容用小程序的<text>组件做下排版 ...

  5. POJ 2396 Budget(有源汇上下界网络流)

    Description We are supposed to make a budget proposal for this multi-site competition. The budget pr ...

  6. POJ 1177 Picture(线段树周长并)

      描述 A number of rectangular posters, photographs and other pictures of the same shape are pasted on ...

  7. 662. Maximum Width of Binary Tree二叉树的最大宽度

    [抄题]: Given a binary tree, write a function to get the maximum width of the given tree. The width of ...

  8. 读取properties文件的信息

    1.properties配置文件的信息 fcsimage_path=C://FCSImage 2.Java代码 public final class Config { private static f ...

  9. c# ?. 空值传播运算符

    当左侧为空时不执行右侧代码,避免出现为null的错误,同时也避免了判断是否为null,可以和??一起连用,省了好多事.举例如下: 以前:var res=obj==null?5:obj.a; 现在:va ...

  10. HTML day48

    前端知识之HTML内容   HTML介绍 Web服务本质 import socket#引入套接字模块 sk = socket.socket()#实例化一个套接字对象 sk.bind(("12 ...