Google云平台使用方法 | Hail | GWAS | 分布式回归 | LASSO
参考:
Hail - Tutorial windows也可以安装:Spark在Windows下的环境搭建
spark-2.2.0-bin-hadoop2.7 - Hail依赖的平台,并行处理
google cloud platform - 云平台
Broad's data cluster set-up tool
对Google cloud SDK的一个简单的wrap,方便操作。
cloudtools is a small collection of command line tools intended to make using Hail on clusters running in Google Cloud's Dataproc service simpler.
These tools are written in Python and mostly function as wrappers around the gcloud suite of command line tools included in the Google Cloud SDK.
Google cloud基本使用
安装gcloud
登录,[GCloud] 讓gcloud連到新的 Google 帳戶下的 Google Cloud Platform
只需15分钟,使用谷歌云平台运行Jupyter Notebook
基本操作:
创建项目
进入控制台,点击三点符号
创建和删除虚拟机
gcloud dataproc clusters delete name
上传和删除文件
gcloud datastore create-indexes index.yaml
在程序中读入和写出文件
f1 = hc.read("gs://somewhere")
目前只是单独的使用一个VM,如何想批量并行使用Google cloud的VM就必须要使用分布式管理系统,如spark等,hail就是集成了spark。
Hail的基本使用
This snippet starts a cluster named "testcluster" with the 1 master machine, 2 worker machines (the minimum/default), and 6 additional preemptible worker machines. Then, after the cluster is started (this can take a few minutes), a Hail script is submitted to the cluster "testcluster".
spark基本原理
1. 在本地运行wrapper,创建Google cloud虚拟机
cluster start testcluster \
--master-machine-type n1-highmem-8 \
--worker-machine-type n1-standard-8 \
--num-workers 8 \
--version devel \
--spark 2.2.0 \
--zone asia-east1-a
2. 启动notebook
cluster connect testcluster notebook
3. 本地提交脚本到Google cloud上
cluster submit testcluster myhailscript.py
4. 登录到Google cloud,安装必备软件
gcloud compute ssh testcluster-m --zone asia-east1-a
5. 安装sklearn
sudo su # to be root and install packages
/opt/conda/bin/conda install scikit-learn
/opt/conda/bin/pip install findspark
文章案例
把这篇文章搞懂80%,遗传和统计就基本入门了,操作性很强。
Depression is more frequently observed among individuals exposed to traumatic events. The relationship between trauma exposure and depression, including the role of genetic variation, is complex and poorly understood. The UK Biobank concurrently assessed depression and reported trauma exposure in 126,522 genotyped individuals of European ancestry. We compared the shared aetiology of depression and a range of phenotypes, contrasting individuals reporting trauma exposure with those who did not (final sample size range: 24,094- 92,957). Depression was heritable in participants reporting trauma exposure and in unexposed individuals, and the genetic correlation between the groups was substantial and not significantly different from 1. Genetic correlations between depression and psychiatric traits were strong regardless of reported trauma exposure, whereas genetic correlations between depression and body mass index (and related phenotypes) were observed only in trauma exposed individuals. The narrower range of genetic correlations in trauma unexposed depression and the lack of correlation with BMI echoes earlier ideas of endogenous depression.
Major depressive disorder (MDD) is a common illness accompanied by considerable morbidity, mortality, costs, and heightened risk of suicide. We conducted a genome-wide association meta-analysis based in 135,458 cases and 344,901 controls and identified 44 independent and significant loci. The genetic findings were associated with clinical features of major depression and implicated brain regions exhibiting anatomical differences in cases. Targets of antidepressant medications and genes involved in gene splicing were enriched for smaller association signal. We found important relationships of genetic risk for major depression with educational attainment, body mass, and schizophrenia: lower educational attainment and higher body mass were putatively causal, whereas major depression and schizophrenia reflected a partly shared biological etiology. All humans carry lesser or greater numbers of genetic risk factors for major depression. These findings help refine the basis of major depression and imply that a continuous measure of risk underlies the clinical phenotype.
一些问题
Hail是用来干嘛的?
案例:gnomAD
The Neale Lab at the Broad Institute used Hail to perform QC and genome-wide association analysis of 2419 phenotypes across 10 million variants and 337,000 samples from the UK Biobank in 24 hours. paper
Hail’s functionality is exposed through Python and backed by distributed algorithms built on top of Apache Spark to efficiently analyze gigabyte-scale data on a laptop or terabyte-scale data on a cluster.
- a library for analyzing structured tabular and matrix data
- a collection of primitives for operating on data in parallel
- a suite of functionality for processing genetic data
- not an acronym
# conda env create -n hail -f $HAIL_HOME/python/hail/environment.yml
source activate hail
cd $HAIL_HOME/tutorials
jhail
运行GWAS
1kg_annotations.txt
Sample Population SuperPopulation isFemale PurpleHair CaffeineConsumption
HG00096 GBR EUR False False 77.0
HG00097 GBR EUR True True 67.0
HG00098 GBR EUR False False 83.0
HG00099 GBR EUR True False 64.0
HG00100 GBR EUR True False 59.0
HG00101 GBR EUR False True 77.0
1kg.mt目录
.
├── _SUCCESS
├── cols
│ ├── _SUCCESS
│ ├── metadata.json.gz
│ └── rows
│ ├── metadata.json.gz
│ └── parts
│ └── part-0
├── entries
│ ├── _SUCCESS
│ ├── metadata.json.gz
│ └── rows
│ ├── metadata.json.gz
│ └── parts
│ ├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5
│ ├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855
│ ├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af
├── globals
│ ├── _SUCCESS
│ ├── globals
│ │ ├── metadata.json.gz
│ │ └── parts
│ │ └── part-0
│ ├── metadata.json.gz
│ └── rows
│ ├── metadata.json.gz
│ └── parts
│ └── part-0
├── metadata.json.gz
├── references
└── rows
├── _SUCCESS
├── metadata.json.gz
└── rows
├── metadata.json.gz
└── parts
├── part-00-2-0-0-6886f608-afb6-1e68-684b-3c5920e7edd5
├── part-01-2-1-0-3d30160f-dba0-16f4-e898-4e7c30148855
├── part-02-2-2-0-1051da4b-6799-6074-7d32-9bd7fa9ed9af
问题:只需15分钟,使用谷歌云平台运行Jupyter Notebook
GWAS的原理
实例:分布式回归 LASSO / elastic net regression on blocks of variants
How to Turn Python Functions into PySpark Functions (UDF)
什么是LD block? 30X的全基因组和150X的全外显子组应该如何选择?
GWAS对SNP的要求,并不是一定要求致病位点(causal site)在待检测SNP集合中,但必须包含该位点邻近的SNP。基因组在遗传上是由一个一个的haplotype blocks(数k到数百kb)构成的,在这些block内,这些互相有较高LD的SNP在做GWAS检测时,是可以相互替代的。所以做的比较好的GWAS文章里,一个causal site往往相邻近的SNP位点都会被检出GWAS显著。这是一个辅助判断结果是否可信的一个依据。能够检出非同义SNP只是让结果解读找到一个捷径。但如果只选择外显子SNP作为待检测SNP,我相信很难覆盖所有的haplotype blocks。
要想自定义函数,udf,首先必须在每个worker上安装基础包,因为我们用了sklearn,所以必须必须在构建cluster的时候安装如下包:
#!/bin/sh
/opt/conda/bin/pip install scikit-learn
/opt/conda/bin/pip install numpy
/opt/conda/bin/pip install scipy
cluster start testcluster \
--master-machine-type n1-highmem-8 \
--worker-machine-type n1-standard-8 \
--num-workers 2 \
--num-preemptible-workers 2 \
--version devel \
--spark 2.2.0 \
--zone asia-east1-a \
--pkgs scikit-learn \
--init gs://ukb_testdata/additional_init.sh
有点docker的意思。
import hail as hl
mt = hl.balding_nichols_model(3, 100, 100)
gts_as_rows = mt.annotate_rows(
mean = hl.agg.mean(hl.float(mt.GT.n_alt_alleles()))
, genotypes = hl.agg.collect(hl.float(mt.GT.n_alt_alleles()))
).rows()
groups = gts_as_rows.group_by(
ld_block = gts_as_rows.locus.position // 10
).aggregate(
genotypes = hl.agg.collect(gts_as_rows.genotypes)
, ys = hl.agg.collect(gts_as_rows.mean)
) df = groups.to_spark() from pyspark.sql.functions import udf def get_intercept(X, y):
from sklearn import linear_model
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X, y)
return float(clf.intercept_) get_intercept_udf = udf(get_intercept)
df.select(get_intercept_udf("genotypes", "ys").alias("intercept")).show()
待续~
Google云平台使用方法 | Hail | GWAS | 分布式回归 | LASSO的更多相关文章
- Google云平台对于2014世界杯半决赛的预测,德国阿根廷胜!
由于本人是个足球迷,前段日子Google利用自己云平台预测世界杯八进四的比赛并取得了75%的正确率的事情让我振动不小.虽然这些年一直听说大数据的预测和看趋势能力如何如何强大,但这次的感受更加震撼,因为 ...
- Google云平台技术架构
Google Cloud 设计原理: 1.分布式文件系统: Google Distributed File System(GSF) 为了满足Google迅速增长的数据处理需求,我们设计并实现了G ...
- 我发起了一个 用 物理服务器 和 .Net 平台 构建云平台 的 .Net 开源项目
大家好 , 我发起了一个 用 物理服务器 和 .Net 平台 构建云平台 的 .Net 开源项目 . 对 , 用 物理服务器 和 .Net 平台 构建 云平台 . 通过 .Net 构建 分布式 计算集 ...
- 在云平台上基于Go语言+Google图表API提供二维码生成应用
二维码能够说已经深深的融入了我们的生活其中.到处可见它的身影:但通常我们都是去扫二维码, 曾经我们分享给朋友一个网址直接把Url发过去,如今我们能够把自己的信息生成二维码再分享给他人. 这里就分享一下 ...
- .Net 分布式云平台基础服务建设说明概要
1) 背景 建设云平台的基础框架,用于支持各类云服务的业务的构建及发展. 2) 基础服务 根据目前对业务的理解和发展方向,总结抽象出以下几个基础服务,如图所示 3) 概要说明 基础服务的发展会根 ...
- java分布式电子商务云平台b2b b2c o2o需要准备哪些技术??
技术解决方案 开发语言: java.j2ee 数据库:mysql JDK支持版本: JDK1.6.JDK1.7.JDK1.8版本 核心技术:分布式.云服务.微服务.服务编排等. 核心架构: 使用Spr ...
- Windows操作系统远程Linux服务器传输文件方法(以EasyDSS云平台、EasyNVR上传部署为例)
本文转自博客:https://blog.csdn.net/black_3717/article/details/79769406 问题背景: 之前给客户部署我们一款EasyDSS云平台(配合EasyN ...
- EasyDarwin开源流媒体云平台设计与实现(分布式+负载均衡)
前言 早在2013年我就设计了一套架构非常简单的分布式流媒体服务器平台<基于Darwin实现的分布式流媒体直播服务器系统>,当时的考虑如今看来有诸多的细节点没有考虑到:1.CMS是单点部署 ...
- Discuz! X3.1直接进入云平台列表的方法
Discuz! X3.1已经改版,后台不能直接进云平台列表,不方便操作,操作云平台服务时,大家可以这样操作: 1.登录后台:2.访问域名进入云平台列表http://你域名/admin.php?fram ...
随机推荐
- TensorFlow 生成 .ckpt 和 .pb
原文:https://www.cnblogs.com/nowornever-L/p/6991295.html 1. TensorFlow 生成的 .ckpt 和 .pb 都有什么用? The . ...
- bzoj 1295 最长距离 - 最短路
Description windy有一块矩形土地,被分为 N*M 块 1*1 的小格子. 有的格子含有障碍物. 如果从格子A可以走到格子B,那么两个格子的距离就为两个格子中心的欧几里德距离. 如果从格 ...
- wait()和notify()的理解与使用
void notify() Wakes up a single thread that is waiting on this object’s monitor. 译:唤醒在此对象监视器上等待的单个线程 ...
- rsync: read error: Connection reset by peer (104)
Centos7 rsync守护进程上传文件失败 [root@nfs ~]# rsync -avz /etc rsync_backup@172.16.1.41::backupsending inc ...
- Job for php-fpm.service failed because the control process exited with error code. See "systemctl status php-fpm.service" and "journalctl -xe" for details.
[root@web01 ~]# systemctl start php-fpm Job for php-fpm.service failed because the control process ...
- python --- 23 模块 os sys pickle json
一. os模块 主要是针对操作系统的 用于文件操作 二. sys 模块 模块的查找路径 sys.path 三.pickle 模块 1. pickle.dumps(对象) 序列化 把对 ...
- Install Virtualbox on ubuntu
1.Use the command: sudo apt-get install virtualbox
- Python3基础 list append 向尾部添加一个元素
Python : 3.7.0 OS : Ubuntu 18.04.1 LTS IDE : PyCharm 2018.2.4 Conda ...
- HDU 1848 Fibonacci again and again【博弈SG】
Problem Description 任何一个大学生对菲波那契数列(Fibonacci numbers)应该都不会陌生,它是这样定义的: F(1)=1; F(2)=2; F(n)=F(n-1)+F( ...
- LightOJ 1030 Discovering Gold(概率DP)题解
题意:1~n每格都有金子,每次掷骰子,掷到多少走几步,拿走那格的金子,问你金子的期望 思路:dp[i]表示从i走到n金子的期望,因为每次最多走1<=x<=6步,所以dp[i] = a[i] ...