Horovod documentation

安装

【Step1】安装Open MPI

注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.

【Step2】安装 TensorFlow

  • pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
  • 也可以用conda 安装

【Step3】安装 horovod

cpu

pip install horovod

GPUs with NCCL:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod

Docker 文档:

https://horovod.readthedocs.io/en/stable/docker.html

https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu

CPU-Dockerfile

FROM ubuntu:18.04

ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

GPU-Dockerfile

FROM nvidia/cuda:10.1-devel-ubuntu18.04

# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

Horovod Install的更多相关文章

  1. 机器学习分布式框架horovod安装 (Linux环境)

    1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...

  2. 安装 openmpi 4.0 用于 horovod 编译

    最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...

  3. Horovod 分布式深度学习框架相关

    最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...

  4. [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator

    [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...

  5. OEL上使用yum install oracle-validated 简化主机配置工作

    环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...

  6. org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS

    17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...

  7. 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?

    注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...

  8. Install and Configure SharePoint 2013 Workflow

    这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...

  9. Basic Tutorials of Redis(1) - Install And Configure Redis

    Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...

随机推荐

  1. SSL/TLS协议详解(上):密码套件,哈希,加密,密钥交换算法

    本文转载自SSL/TLS协议详解(上):密码套件,哈希,加密,密钥交换算法 导语 作为一名安全爱好者,我一向很喜欢SSL(目前是TLS)的运作原理.理解这个复杂协议的基本原理花了我好几天的时间,但只要 ...

  2. restful风格的理解

    简而言之,就是不同的命令响应不同的操作: 关注点在url中的不同参数,是因为不同的参数才使得不同的method去对应的不同的操作.

  3. linux系统忘记root的登录密码

    参考链接:https://www.jb51.net/article/146541.htm  亲测有效 使用场景 linux管理员忘记root密码,需要进行找回操作. 注意事项:本文基于centos7环 ...

  4. SpringBoot 整合 Shiro 密码登录与邮件验证码登录(多 Realm 认证)

    导入依赖(pom.xml)  <!--整合Shiro安全框架--> <dependency> <groupId>org.apache.shiro</group ...

  5. 第34天学习打卡(GUI编程之组件和容器 frame panel 布局管理 事件监听 多个按钮共享一个事件 )

    GUI编程 组件 窗口 弹窗 面板 文本框 列表框 按钮 图片 监听事件 鼠标 键盘事件 破解工具 1 简介 GUi的核心技术:Swing AWT 1.界面不美观 2.需要jre环境 为什么要学习GU ...

  6. const成员函数可以将非const指针作为返回值吗?

    先给出一段代码 class A { int *x; public: int *f() const { return x; } }; 成员函数f返回指向私有成员 x 的非常量指针,我认为这会修改成员x ...

  7. 剑指 Offer 57 - II. 和为s的连续正数序列 + 双指针 + 数论

    剑指 Offer 57 - II. 和为s的连续正数序列 Offer_57_2 题目描述 方法一:暴力枚举 package com.walegarrett.offer; /** * @Author W ...

  8. PAT-1147(Heaps)最大堆和最小堆的判断+构建树

    Heaps PAT-1147 #include<iostream> #include<cstring> #include<string> #include<a ...

  9. ORM思想解析

    ORM思想解析 qq_16055765 2019-01-10 11:29:08 1688 收藏 1 分类专栏: # hibernate 最后发布:2019-01-10 11:29:08首次发布:201 ...

  10. 读 Kafka 源码写优雅业务代码:配置类

    这个 Kafka 的专题,我会从系统整体架构,设计到代码落地.和大家一起杠源码,学技巧,涨知识.希望大家持续关注一起见证成长! 我相信:技术的道路,十年如一日!十年磨一剑! 往期文章 Kafka 探险 ...