Bioinformatics Glossary

原文：http://homepages.ulb.ac.be/~dgonze/TEACHING/bioinfo_glossary.html

Affine gap costs: A scoring system for gaps within alignments that charges a penalty for the existence of a gap and an additional per-residue penalty proportional to the gaps length. 

Algorithm: A fixed procedure, implemented in a computer program. 

Alignment score: A numerical value that describes the overall quality of an alignment. Higher numbers correspond to higher similarity, which is unlikely to have been obtained by chance. 

Bit score: A log-scaled version of a score. 

BLAST: (Basic Local Alignment Search Tool). A heuristic sequence comparison algorithm, developed at the National Center for Biotechnology Information (NCBI), that is used to search sequence databases for optimal local alignments to a query sequence. 

BLOSUM (BLOck SUbstitution Matrix): A collection of substitution matrices based on conserved protein domains. 

Bootstrapping: A statistical method often used to estimate the reproducibility of specific features of clustering or phylogenetic trees. 

Cluster analysis: (or clusering): A process of assigning data points (sequences) into groups (clusters). 

Command line: Interacting with software by typing specific commands. Generally considered less user friendly than a graphical user interface. 

Comparative genomics: The study of comparing complete genome sequences, often by computational methods, to understand general principles of genome structure and function. 

Controlled vocabulary: A vocabulary that contains specific words that are consistently applied to all entries in a database. Gene ontology (GO) or MeSH system are examples of controlled vocabulary. 

DNA chip technology: New technology for parallel processing thousands of DNA segments, such as for detecting mutation patterns in genomic DNAs or expression patterns of mRNAs. 

Dynamic programing: A type of algorithm widely used for constructing sequence alignments, which guaranty to return the optimal solution. 

E-value (Expectation value): correction of the p-value for multiple testing. In the context of database searches, the e-value is the number of distinct alignments, with score equivalent to or better than the one of interest, that are expected to occur in a database search purely by chance. The lower the E value, the more significant the score is. 

EST (Expressed Sequence Tag): A short cDNA (complementary DNA) sequence from an expressed gene and which is assumed to long enough to be specific to a given gene. Often used to confirm gene prediction. 

Extreme value distribution (EVD): The probability distribution applicable to the scores of optimal local alignments. EVD is used to compute the p-value and the e-value when a database search is performed. 

False negative (often denoted FN): something predicted as negative, but which is actually positive. Ex: a gene which is not predicted to be regulated by a transcription factor X, although in reality it is regulated by X. See also "true positive" and "false positive". 

False positive (often denoted FP): something predicted as positive, but which is actually not. Ex: a gene which is predicted to be regulated by a transcription factor X, but which is in reality not regulated by X. See also "true positive" and "false negative". 

FASTA: A popular heuristic sequence comparison algorithm (Pearson & Lipman), that is used to search sequence databases for optimal global alignments to a query. 

Filtering (in BLAST): See masking. 

Free-form text The opposite of a "controlled vocabulary". Free text has no structured set of words, such that two related entries might not be identified in a search because different words are used to describe each entry. 

ftp (File Transfer Protocol): A method for transferring files across a network. 

Functional genomics The study of predicting gene function using genomic information, and, in a broader sense, obtaining an overall picture of genome functions, including the expression profiles at the mRNA level (transcriptome) and the protein level (proteome). 

Gap: Within an alignment of two sequences, several adjacent null characters in one sequence aligned with adjacent letters in the other. 

Gap score (or gap cost): The score (or cost) assigned to a gap in an alignment (linear or affine). 

Gapped alignment: An alignment in which gaps are permitted. 

GenBank: A data bank of genetic sequences developped at National Institutes of Health (NIH). 

Gene family: Two or more genes that are related by divergent evolution from a common ancestor, either by speciation or gene duplication. 

Gene fusion: A fusion gene is a hybrid gene formed from two or several previously separate genes. This can lead for example to a situation where one gene in an organism (i.e. yeast) corresponds to two or several genes in a other species (e.g. e. coli). 

Global alignment: The alignment of two (or more) complete nucleic acid or protein sequences. 

Graphical user interface Software that allows a user to interact via user-friendly menu and mouse-driven commands, as is typical of Macintosh and Windows applications and less common for UNIX applications; as opposed to a "command line" interface of typed or scripted commands. 

Heuristics A term in computer science that refers to "guesses" made by a program to obtain approximately accurate results. Typically, these are used to increase the speed of a program greatly at the cost of potentially yielding suboptimal results. BLAST and FASTA use heuristics. 

High-throughput DNA sequencing Experimental procedures for determining massive amounts of genomic DNA or cDNA sequence data using highly automated sequencing machines. 

HMM (Hidden Markov Model): The extension of a Markov model. A pattern recognition method that can be used to represent the alignment of multiple sequences or sequence segments by attempting to capture common patterns of residue conservation.HMMER is a software package for profile hidden Markov model analysis. 

Homology: Two genes are said homolog is they derived from a common ancestor. 

Iterative search: After performing an initial search against the database, the high scoring matches are used to search the database again. In some cases (intermediate sequence path), these sequences are used on their own; in others, the sequences are joined together in an alignment or profile. 

Linux: A freely available but commercial-strength clone of the UNIX operating system. A godsend for starting bioinformatics on a budget. It is easily installed alongside Windows on a PC, so the same machine can be booted into either Linux or Windows. 

Local alignment: The alignment of segments from two (or more) nucleic acid or protein sequences. 

Low-complexity region: A region of a nucleic acid or protein sequence with highly biased residue composition, or consisting of many short near-perfect repeats. 

Markov model: A statistical model for sequences in which the probability of each letter depends on the letters that precede it. If the probability depends on the k preceeding letters, the model is said of order k. 

Masking: Some regions of sequences have particular characteristics (such as repeated patterns) that lead to spurious high scores. Masking replaces these "low complexity regions" of sequence with an X (for proteins) or N (for nucleic acids). 

MEDLINE: A free on-line literature database of papers in biomedical sciences (see http://www.ncbi.nlm.nih.gov/Entrez/medline.html). 

Metagenomics: Sequencing and analysis of genetic material retrieved directly from environmental samples. The term metagenome refers to the collection of genes sequenced from metagenomic data. 

Microarray (or DNA chip): An technology that allows to measure simultaneously the expression of thousands of genes in one or several experimental conditions. 

Motif: A short conserved region in a protein sequence. Motifs frequently form a recognition sequence or are highly conserved parts of domains. Motif is sometimes used in a broader sense for all localized homology regions, independent of their size. 

Motif descriptor: A data structure that stores information about a sequence family, motif or domain family. Typical examples are consensus sequences, patterns, profiles and HMMs. 

mRNA expression profile: The identities and absolute or relative expression levels of mRNAs that characterize a particular cell type or physiological, developmental or pathological state. 

Multiple alignment: An alignment of three or more sequences, with gaps (spaces) inserted in the sequences such that residues with common structural positions and/or ancestral residues are aligned in the same column of the multiple alignment. 

Needleman-Wunsch algorithm: The standard dynamic programing algorithm for finding optimal global alignments. 

Neural network: A statistical pattern recognition method. 

Optimal alignment: A global or local alignment of two sequences with the highest possible score. 

Orthologs: Homologous sequences in different species that arose from a common ancestor gene during speciation. Ortholog are often (but not always) responsible for a similar function. See also "paralogs". 

P-value: Probability that an event occurs by chance. 

PAM (Point Accepted Mutation): A collection of substitution matrices, derived by M. Dayhoff, based on phylogenetic reconstruction. 

Paralogs: Homologous sequences (that is, sequences that share a common evolutionary ancestor) that diverged by gene duplication. See also "orthologs". 

Pattern: A descriptor for short sequence motifs, consisting of amino acid characters and meta-characters that can represent ambiguities or variable length insertions. 

PDB database: Protein Data Bank. The repository of solved protein structures. 

Phylogenetic footprinting: A bioinformatic approach to find functional sequences in the genome that relies on detecting their high degrees of conservation across different species. 

Phylogenetic profile: Distribution of homologous genes in species. Comparison of phylogenetic profiles can be used to predict functionnally related genes. 

Position-specific scoring matrix (PSSM): A model representing characteristics of a group of aligned sequences, the simplest form of which is the tabulation of the frequency of amino acids (or nucleotides) in each position of the multiple sequence alignment. Also called "motif profile". 

Proteomics: Technically and conceptually similar to functional genomics, but with the aim of studying biological aspects of all proteins at once in a systematic manner. 

PSI-BLAST: Position-specific Iterated BLAST. An iterative search that uses the BLAST algorithm to provide fast searches, and builds a profile at every iteration. 

Regular expression: A text pattern that conforms to regular grammar and that is used for text pattern matching in the UNIX system, as well as for representing consensus sequence patterns in biology. 

Rooted tree: A phylogenetic tree in which the last common ancestor of all genes, or organisms, displayed on the tree is specified by the initial bifurcation of the tree. 

Score: A number used to assess the biological relevance of a finding. 

Sensitivity: one measure of the performance of a program, define by Sn=TP/(TP+FN) where TP=true positive and FN=false negative. Sn thus measures the proportion of actual positives which are correctly identified as such. A sensitivity of 100% means that the test correctly detects all positive (i.e. does not contain negative in the predictions). See also "specificity". 

Sequence signal: A local functional site in genomic DNA, such as a splice site or a TATA box. 

Smith-Waterman algorithm: The standard dynamic programing algorithm for finding optimal local alignments. 

Specificity: one measure of the performance of a program, define by Sp=TN/(TN+FP) where TN=true negative and FP=false positive. Sp thus measures the proportion of actual negatives which are correctly identified as such. A specificity of 100% means that the test correctly recognizes all negatives (i.e. does not miss positive in the predictions). See also "sensitivty". 

Substitution matrix: The collection of all substitution scores (ex: PAM, BLOSUM). 

Substitution score: The score for aligning a particular pair of letters. 

SWISS-PROT: A curated protein sequence database that provides a high level of annotations, a minimal level of redundancy and a high level of integration with other databases. It is maintained collaboratively by the Department of Medical Biochemistry at the University of Geneva and the European Bioinformatics Institute (EBI). 

Synteny (literally "on the same ribbon"): Co-localization of genetic loci (genes) on the same chromosome within an individual, regardless of whether on not they are phylogenetically linked. Shared synteny (or homology blocks) describes preserved co-localization of genes on chromosomes of related species. Note that conserved synteny does not imply conserved gene order. 

Taxon (pl. taxa): A group of one or more organisms. The term is often applied to the organisms represented by the terminal branches of a tree. 

True positive (often denoted TP): something correctly predicted as positive. Ex: a gene which is correctly predicted to be regulated by a transcription factor X, as it is in reality. See also "false negative" and "false positive". 

Ungapped alignment: An alignment in which gaps are not permitted. 

UNIX: A computer operating system (an alternative to Windows or MacOS). 

Unrooted tree: A phylogenetic tree in which the last common ancestor of all genes, or organisms, on the tree is not specified. 

Z-score: The number of standard deviations from the mean.

Bioinformatics Glossary的更多相关文章

INTRODUCTION TO BIOINFORMATICS
INTRODUCTION TO BIOINFORMATICS 这套教程源自Youtube,算得上比较完整的生物信息学领域的视频教程,授课内容完整清晰,专题化的讲座形式,细节讲解比国内的京师大 ...
Career path of Bioinformatics
Core services: Reward bioinformaticians http://www.nature.com/news/core-services-reward-bioinformati ...
Glossary
Glossary term terminology Certificate authority A norganization that authorizes a certificate. Certi ...
Windows Azure 名词定义（Glossary）
Glossary(名词) Definition(定义) Availability Set 可用性组 refers to two or more Virtual Machines deployed ac ...
Glossary of view transformations
Glossary of view transformations The following terms are used to define view orientation, i.e. trans ...
Deep Learning in Bioinformatics
最近在学tensorflow,深度学习的开源工具,很好奇在生信领域深度学习都能做些什么东西. 镇楼的综述:Deep Learning in Bioinformatics 几篇文章读读看: Deep l ...
Glossary of Terms in the JavaTM platform --reference
http://docs.oracle.com/javase/tutorial/information/glossary.html field :A data member of a class. Un ...
使用Atlas进行元数据管理之Glossary(术语)
背景:笔者和团队的小伙伴近期在进行数据治理/元数据管理方向的探索, 在接下来的系列文章中, 会陆续与读者们进行分享在此过程中踩过的坑和收获. 元数据管理系列文章: [0] - 使用Atlas进行元数据 ...
final model for bioinformatics
final model for bioinformatics 模拟真实的生物系统,从有机分子到细胞,到组织,到器官,到个体,到家系,到群体. 正确的设计结构,可拓展性,可塑性. 良好的可视化. 面向对 ...

随机推荐

Java 内部类的阐述
创建一个Computer抽象类:用来在Test类中创建匿名抽象类 package com.zhiyou; public abstract class Computer { int a = 1; /** ...
全新的membership框架Asp.net Identity(2)——绕不过的Claims
本来想直接就开始介绍Identity的部分,奈何自己挖坑太深,高举高打的方法不行.只能自己默默下载了Katana的源代码研究了好一段时间.发现要想能够理解好用好Identity, Claims是一个绕 ...
Windows on Device 项目实践 5 - 姿态控制灯制作
在前面几篇文章中,我们学习了如何利用Intel Galileo开发板和Windows on Device来设计并完成PWM调光灯.感光灯.火焰报警器和智能风扇的制作,涉及到了火焰传感器.DC直流电机. ...
Linux如何查找大文件或目录总结
在Windows系统中,我们可以使用TreeSize工具查找一些大文件或文件夹,非常的方便高效,在Linux系统中,如何去搜索一些比较大的文件呢?下面我整理了一下在Linux系统中如何查找大文件或文件 ...
asp.net signalR 专题—— 第一篇你需要好好掌握的实时通讯利器
一:背景我们知道传统的http采用的是“拉模型”,也就是每次请求,每次断开这种短请求模式,这种场景下,client是老大,server就像一个小乌龟任人摆布, 很显然,只有一方主动,这事情就没那么完 ...
oradebug/strace/pstack等分析数据库性能问题系列一
对于性能问题或者一些比较奇怪妖异的问题,有很多点可以着手去分析. 准备写一个系列关于用ash/dba_hist_active_sess_history,用oradebug,用linux命令strace ...
apache ab测试命令详解
这篇文章主要介绍了apache性能测试工具ab使用详解,需要的朋友可以参考下网站性能压力测试是服务器网站性能调优过程中必不可缺少的一环.只有让服务器处在高压情况下,才能真正体现出软件.硬件等各种 ...
Linux简介及常用命令使用3--vi编辑器
1.进入vi的命令 vi filename :打开或新建文件,并将光标置于第一行首 [新建文件]vi +n filename :打开文件,并将光标置于第n行首 [比如:某个shell报错的行数时使用] ...
原生js事件的添加和删除
在IE浏览器中添加或删除事件用attachEvent.detachEvent.在其他标准浏览器中则用addEventListener.removeEventListener.下面的对事件的添加和删除做 ...
ubuntu下怎么给普通用户赋予sudo权限
ununtu系统安装过程中,系统会提示建立一个默认用户,比如用户名为:zhuhui.这个默认用户具有一定的管理功能,即可以通过sudo命令执行root权限的操作.由于Ubuntu系统默认不允许通过ro ...

Bioinformatics Glossary

Bioinformatics Glossary的更多相关文章

随机推荐

热门专题