有很多概念需要明确区分:

人有23对染色体,其中22对常染色体autosome,另外一对为性染色体sex chromosome,XX为女,XY为男。

染色体区带命名:在标示一特定的带时需要包括4项:①染色体号;②臂的符号;③区号;④在该区内的带号。

1p22表示为1号染色体短臂2区2带。

等位基因其实是一个集合,在同一个locus出现得基因型互为等位基因。Aa不能叫等位基因,正确的逻辑是:A和a是一组等位基因。由等位基因可以定义纯合和杂合。

二倍体与多倍体细胞的某些染色体上,在同一基因座上有相同的等位基因,这类细胞称为纯合子/同型合子(homozygous)。若是相同基因座上含有不同的等位基因,则称作杂合子/异型合子(heterozygous)。

summary statistic顾名思义,就和R里面的summary函数一样,是对GWAS数据的一个概括总结,包含了结果中最核心的信息。

ebi也提供了很多GWAS研究summary statistic的结果下载,https://www.ebi.ac.uk/gwas/summary-statistics

GWAS的基本原理

如何跑GWAS?

转到姊妹篇:GWAS | 全基因组关联分析 | Linkage disequilibrium (LD)连锁不平衡 | 曼哈顿图 Manhattan_plot | QQ_plot | haplotype phasing

Power

Effect size

Major allele,

Minor allele,

Minor allele frequency (MAF),

Missingness per genotype,

Missingness per individuals,

metrics that we look at include

linkage disequilibrium (LD),

variance inflation factor (VIF),

runs of homozygosity (ROH),

These provide a broad 'summary' of the data and allow us to appropriately set thresholds for quality control. It would be wrong, for example, to run a statistical test on a genotype with high missingness because the resulting P value would be misleading and could lead to erroneous conclusions from the data.

PLINK is usually the 'go to' program for analysing GWAS data, but there are other alternatives. It is also possible to read PLINK data into R and do your own analyses, but for now there are not many programs to do that.

Further information can be found here: http://zzz.bwh.harvard.edu/plink/summary.shtml

A tutorial on conducting genome‐wide association studies: Quality control and statistical analysis

Clumping: This is a procedure in which only the most significant SNP (i.e., lowest p value) in each LD block is identified and selected for further analyses. This reduces the correlation between the remaining SNPs, while retaining SNPs with the strongest statistical evidence.

Co‐heritability: This is a measure of the genetic relationship between disorders. The SNP‐based co‐heritability is the proportion of covariance between disorder pairs (e.g., schizophrenia and bipolar disorder) that is explained by SNPs.

Gene: This is a sequence of nucleotides in the DNA that codes for a molecule (e.g., a protein)

Heterozygosity: This is the carrying of two different alleles of a specific SNP. The heterozygosity rate of an individual is the proportion of heterozygous genotypes. High levels of heterozygosity within an individual might be an indication of low sample quality whereas low levels of heterozygosity may be due to inbreeding.

Individual‐level missingness: This is the number of SNPs that is missing for a specific individual. High levels of missingness can be an indication of poor DNA quality or technical problems.

Linkage disequilibrium (LD): This is a measure of non‐random association between alleles at different loci at the same chromosome in a given population. SNPs are in LD when the frequency of association of their alleles is higher than expected under random assortment. LD concerns patterns of correlations between SNPs.

Minor allele frequency (MAF): This is the frequency of the least often occurring allele at a specific location. Most studies are underpowered to detect associations with SNPs with a low MAF and therefore exclude these SNPs.

Population stratification: This is the presence of multiple subpopulations (e.g., individuals with different ethnic background) in a study. Because allele frequencies can differ between subpopulations, population stratification can lead to false positive associations and/or mask true associations. An excellent example of this is the chopstick gene, where a SNP, due to population stratification, accounted for nearly half of the variance in the capacity to eat with chopsticks (Hamer & Sirota, 2000).

Pruning: This is a method to select a subset of markers that are in approximate linkage equilibrium. In PLINK, this method uses the strength of LD between SNPs within a specific window (region) of the chromosome and selects only SNPs that are approximately uncorrelated, based on a user‐specified threshold of LD. In contrast to clumping, pruning does not take the p value of a SNP into account.

Relatedness: This indicates how strongly a pair of individuals is genetically related. A conventional GWAS assumes that all subjects are unrelated (i.e., no pair of individuals is more closely related than second‐degree relatives). Without appropriate correction, the inclusion of relatives could lead to biased estimations of standard errors of SNP effect sizes. Note that specific tools for analysing family data have been developed.

Sex discrepancy: This is the difference between the assigned sex and the sex determined based on the genotype. A discrepancy likely points to sample mix‐ups in the lab. Note, this test can only be conducted when SNPs on the sex chromosomes (X and Y) have been assessed.

Single nucleotide polymorphism (SNP): This is a variation in a single nucleotide (i.e., A, C, G, or T) that occurs at a specific position in the genome. A SNP usually exists as two different forms (e.g., A vs. T). These different forms are called alleles. A SNP with two alleles has three different genotypes (e.g., AA, AT, and TT).

SNP‐heritability: This is the fraction of phenotypic variance of a trait explained by all SNPs in the analysis.

SNP‐level missingness: This is the number of individuals in the sample for whom information on a specific SNP is missing. SNPs with a high level of missingness can potentially lead to bias.

Summary statistics: These are the results obtained after conducting a GWAS, including information on chromosome number, position of the SNP, SNP(rs)‐identifier, MAF, effect size (odds ratio/beta), standard error, and p value. Summary statistics of GWAS are often freely accessible or shared between researchers.

The Hardy–Weinberg (dis)equilibrium (HWE) law: This concerns the relation between the allele and genotype frequencies. It assumes an indefinitely large population, with no selection, mutation, or migration. The law states that the genotype and the allele frequencies are constant over generations. Violation of the HWE law indicates that genotype frequencies are significantly different from expectations (e.g., if the frequency of allele A = 0.20 and the frequency of allele T = 0.80; the expected frequency of genotype AT is 2*0.2*0.8 = 0.32) and the observed frequency should not be significantly different. In GWAS, it is generally assumed that deviations from HWE are the result of genotyping errors. The HWE thresholds in cases are often less stringent than those in controls, as the violation of the HWE law in cases can be indicative of true genetic association with disease risk.


Meta-analysis

Generally, if a sample includes multiple ethnic groups (e.g., Africans, Asians, and Europeans), it is recommended to perform tests of association in each of the ethnic groups separately and to use appropriate methods, such as meta‐analysis (Willer, Li, & Abecasis, 2010), to combine the results.

Fast and efficient meta‐analysis of genomewide association scans

GWAS 全基因组关联分析 | summary statistic 概括统计 | meta-analysis 综合分析的更多相关文章

  1. GWAS | 全基因组关联分析 | Linkage disequilibrium (LD)连锁不平衡 | 曼哈顿图 Manhattan_plot | QQ_plot | haplotype phasing

    现在GWAS已经属于比较古老的技术了,主要是碰到严重的瓶颈了,单纯的snp与表现的关联已经不够,需要具体的生物学解释,这些snp是如何具体导致疾病的发生的. 而且,大多数病找到的都不是个别显著的snp ...

  2. 【GWAS文献解读】疟原虫青蒿素抗药性的全基因组关联分析

    英文名:Genetic architecture of artemisinin-resistant Plasmodium falciparum 中文名:疟原虫青蒿素抗药性的全基因组关联分析 期刊:Na ...

  3. 全基因组关联分析(GWAS)的计算原理

    前言 关于全基因组关联分析(GWAS)原理的资料,网上有很多. 这也是我写了这么多GWAS的软件教程,却从来没有写过GWAS计算原理的原因. 恰巧之前微博上某位小可爱提问能否写一下GWAS的计算原理. ...

  4. 全基因组关联分析(Genome-Wide Association Study,GWAS)流程

    全基因组关联分析流程: 一.准备plink文件 1.准备PED文件 PED文件有六列,六列内容如下: Family ID Individual ID Paternal ID Maternal ID S ...

  5. 一行命令学会全基因组关联分析(GWAS)的meta分析

    为什么需要做meta分析 群体分层是GWAS研究中一个比较常见的假阳性来源. 也就是说,如果数据存在群体分层,却不加以控制,那么很容易得到一堆假阳性位点. 当群体出现分层时,常规手段就是将分层的群体独 ...

  6. 全基因组关联分析(GWAS):为何我的QQ图那么飘

    前段时间有位小可爱问我,为什么她的QQ图特别飘,如果你不理解怎样算飘,请看下图: 理想的QQ图应该是这样的: 我当时的第一反应是:1)群体分层造成的:2)表型分布有问题.因此让她检查一下数据的群体分层 ...

  7. 全基因组关联分析(GWAS)扫不出信号怎么办(文献解读)

    假如你的GWAS结果出现如下图的时候,怎么办呢?GWAS没有如预期般的扫出完美的显著信号,也就没法继续发挥后续研究的套路了. 最近,nature发表了一篇文献“Common genetic varia ...

  8. R语言画全基因组关联分析中的曼哈顿图(manhattan plot)

    1.在linux中安装好R 2.准备好画曼哈顿图的R脚本即manhattan.r,manhattan.r内容如下: #!/usr/bin/Rscript #example : Rscript plot ...

  9. 全基因组关联分析学习资料(GWAS tutorial)

    前言 很多人问我有没有关于全基因组关联分析(GWAS)原理的书籍或者文章推荐. 其实我个人觉得,做这个分析,先从跑流程开始,再去看原理. 为什么这么说呢,因为对于初学者来说,跑流程就像一个大黑洞,学习 ...

随机推荐

  1. Flask之flask-sqlalchemy

    接下来基于这个Flask项目,我们要加入Flask-SQLAlchemy让项目变得生动起来 1.加入Flask-SQLAlchemy第三方组件 from flask import Flask # 导入 ...

  2. github markdown语法及使用

    历史 Markdown是一种轻量级标记语言,创始人为约翰·格鲁伯(英语:John Gruber).它允许人们"使用易读易写的纯文本格式编写文档,然后转换成有效的XHTML(或者HTML)文档 ...

  3. C#编译相关知识

    C#代码编译成MSIL代码. 当用户编译一个.NET程序时,编译器将源代码翻译成一组可以有效地转换为本机代码且独立于CPU的指令.当执行这些指令时,实时(JIT)编译器将它们转化为CPU特定的代码.由 ...

  4. python---Numpy模块中创建数组的常用方式代码示例

    要机器学习,这方面内容不可少. import numpy as np import time # 对比标准python实现和numpy实现的性能差异 def sum_trad(): start = t ...

  5. python算法与数据结构-算法介绍(31)

    一.算法和数据结构 什么是算法和数据结构?如果将最终写好运行的程序比作战场,我们程序员便是指挥作战的将军,而我们所写的代码便是士兵和武器. 那么数据结构和算法是什么?答曰:兵法!故,数据结构和算法是一 ...

  6. 《CoderXiaoban》第九次团队作业:Beta冲刺与验收准备2

    项目 内容 这个作业属于哪个课程 任课教师博客主页链接 这个作业的要求在哪里 实验十三 团队作业9:BETA冲刺与团队项目验收 团队名称 Coderxiaoban团队 作业学习目标 (1)掌握软件黑盒 ...

  7. 《代码敲不队》第九次团队作业:Beta冲刺与验收准备

    项目 内容 这个作业属于哪个课程 任课教师博客主页链接 这个作业的要求在哪里 作业链接地址 团队名称 代码敲不队 作业学习目标 (1)掌握软件测试基础技术(2)学习迭代式增量软件开发过程(Scrum) ...

  8. C#:抽象类PK密封类

    最近在看关于C#的书,看到了抽象类和抽象方法,另外还看到了密封类和密封方法,那么二者有什么联系又有什么区别,我把最近的收获分享给大家! 1.抽象类和抽象方法: ·C#使用abstract关键字,将类或 ...

  9. lstm-bp过程的手工源码实现

    近些年来,随着深度学习的崛起,RNN模型也变得非常热门.如果把RNN模型按照时间轴展开,它也类似其它的深度神经网络模型结构.因此,我们可以参照已有的方法训练RNN模型. 现在最流行的一种RNN模型是L ...

  10. dedecms搜索下拉

    今天公司用dedecms做一个音乐站,要用到下拉标题搜索,我在本地做的一个测试结果 以下是代码部分(ps:二级栏目不用的可以删除代码,如果只调用某一个栏目或者2个栏目可以用typeid='1,2'):