Horovod Install
Horovod documentation
安装
【Step1】安装Open MPI
注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.
【Step2】安装 TensorFlow
- pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
- 也可以用conda 安装
【Step3】安装 horovod
cpu
pip install horovod
GPUs with NCCL:
$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod
Docker 文档:
https://horovod.readthedocs.io/en/stable/docker.html
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu
CPU-Dockerfile
FROM ubuntu:18.04
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0
# Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION}
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
# Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn
WORKDIR "/examples"
GPU-Dockerfile
FROM nvidia/cuda:10.1-devel-ubuntu18.04
# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0
# Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python}
# Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"]
RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers
RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python
RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py
# Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION}
# Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi
# Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig
# Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd
# Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config
# Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn
WORKDIR "/examples"
Horovod Install的更多相关文章
- 机器学习分布式框架horovod安装 (Linux环境)
1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...
- 安装 openmpi 4.0 用于 horovod 编译
最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...
- Horovod 分布式深度学习框架相关
最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...
- [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator
[源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...
- OEL上使用yum install oracle-validated 简化主机配置工作
环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...
- org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS
17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...
- 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?
注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...
- Install and Configure SharePoint 2013 Workflow
这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...
- Basic Tutorials of Redis(1) - Install And Configure Redis
Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...
随机推荐
- SSL/TLS协议详解(上):密码套件,哈希,加密,密钥交换算法
本文转载自SSL/TLS协议详解(上):密码套件,哈希,加密,密钥交换算法 导语 作为一名安全爱好者,我一向很喜欢SSL(目前是TLS)的运作原理.理解这个复杂协议的基本原理花了我好几天的时间,但只要 ...
- restful风格的理解
简而言之,就是不同的命令响应不同的操作: 关注点在url中的不同参数,是因为不同的参数才使得不同的method去对应的不同的操作.
- linux系统忘记root的登录密码
参考链接:https://www.jb51.net/article/146541.htm 亲测有效 使用场景 linux管理员忘记root密码,需要进行找回操作. 注意事项:本文基于centos7环 ...
- SpringBoot 整合 Shiro 密码登录与邮件验证码登录(多 Realm 认证)
导入依赖(pom.xml) <!--整合Shiro安全框架--> <dependency> <groupId>org.apache.shiro</group ...
- 第34天学习打卡(GUI编程之组件和容器 frame panel 布局管理 事件监听 多个按钮共享一个事件 )
GUI编程 组件 窗口 弹窗 面板 文本框 列表框 按钮 图片 监听事件 鼠标 键盘事件 破解工具 1 简介 GUi的核心技术:Swing AWT 1.界面不美观 2.需要jre环境 为什么要学习GU ...
- const成员函数可以将非const指针作为返回值吗?
先给出一段代码 class A { int *x; public: int *f() const { return x; } }; 成员函数f返回指向私有成员 x 的非常量指针,我认为这会修改成员x ...
- 剑指 Offer 57 - II. 和为s的连续正数序列 + 双指针 + 数论
剑指 Offer 57 - II. 和为s的连续正数序列 Offer_57_2 题目描述 方法一:暴力枚举 package com.walegarrett.offer; /** * @Author W ...
- PAT-1147(Heaps)最大堆和最小堆的判断+构建树
Heaps PAT-1147 #include<iostream> #include<cstring> #include<string> #include<a ...
- ORM思想解析
ORM思想解析 qq_16055765 2019-01-10 11:29:08 1688 收藏 1 分类专栏: # hibernate 最后发布:2019-01-10 11:29:08首次发布:201 ...
- 读 Kafka 源码写优雅业务代码:配置类
这个 Kafka 的专题,我会从系统整体架构,设计到代码落地.和大家一起杠源码,学技巧,涨知识.希望大家持续关注一起见证成长! 我相信:技术的道路,十年如一日!十年磨一剑! 往期文章 Kafka 探险 ...