Horovod documentation

安装

【Step1】安装Open MPI

注意: Open MPI 3.1.3 安装有些问题, 可以安装 Open MPI 3.1.2 或者 Open MPI 4.0.0.

【Step2】安装 TensorFlow

  • pip install tensorflow 确保 g++-4.8.5 或者 g++-4.9
  • 也可以用conda 安装

【Step3】安装 horovod

cpu

pip install horovod

GPUs with NCCL:

$ HOROVOD_GPU_ALLREDUCE=NCCL HOROVOD_GPU_BROADCAST=NCCL pip install horovod

Docker 文档:

https://horovod.readthedocs.io/en/stable/docker.html

https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.cpu
https://raw.githubusercontent.com/horovod/horovod/master/Dockerfile.gpu

CPU-Dockerfile

FROM ubuntu:18.04

ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow==${TENSORFLOW_VERSION} \
keras \
h5py
RUN pip install torch==${PYTORCH_VERSION} torchvision==${TORCHVISION_VERSION}
RUN pip install mxnet==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod
RUN HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

GPU-Dockerfile

FROM nvidia/cuda:10.1-devel-ubuntu18.04

# TensorFlow version is tightly coupled to CUDA and cuDNN so it should be selected carefully
ENV TENSORFLOW_VERSION=2.1.0
ENV PYTORCH_VERSION=1.4.0
ENV TORCHVISION_VERSION=0.5.0
ENV CUDNN_VERSION=7.6.5.32-1+cuda10.1
ENV NCCL_VERSION=2.4.8-1+cuda10.1
ENV MXNET_VERSION=1.6.0 # Python 3.6 is supported by Ubuntu Bionic out of the box
ARG python=3.6
ENV PYTHON_VERSION=${python} # Set default shell to /bin/bash
SHELL ["/bin/bash", "-cu"] RUN apt-get update && apt-get install -y --allow-downgrades --allow-change-held-packages --no-install-recommends \
build-essential \
cmake \
g++-4.8 \
git \
curl \
vim \
wget \
ca-certificates \
libcudnn7=${CUDNN_VERSION} \
libnccl2=${NCCL_VERSION} \
libnccl-dev=${NCCL_VERSION} \
libjpeg-dev \
libpng-dev \
python${PYTHON_VERSION} \
python${PYTHON_VERSION}-dev \
python${PYTHON_VERSION}-distutils \
librdmacm1 \
libibverbs1 \
ibverbs-providers RUN ln -s /usr/bin/python${PYTHON_VERSION} /usr/bin/python RUN curl -O https://bootstrap.pypa.io/get-pip.py && \
python get-pip.py && \
rm get-pip.py # Install TensorFlow, Keras, PyTorch and MXNet
RUN pip install future typing
RUN pip install numpy \
tensorflow-gpu==${TENSORFLOW_VERSION} \
keras \
h5py RUN pip install https://download.pytorch.org/whl/cu101/torch-${PYTORCH_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl \
https://download.pytorch.org/whl/cu101/torchvision-${TORCHVISION_VERSION}-$(python -c "import wheel.pep425tags as w; print('-'.join(w.get_supported(None)[0][:-1]))")-linux_x86_64.whl
RUN pip install mxnet-cu101==${MXNET_VERSION} # Install Open MPI
RUN mkdir /tmp/openmpi && \
cd /tmp/openmpi && \
wget https://www.open-mpi.org/software/ompi/v4.0/downloads/openmpi-4.0.0.tar.gz && \
tar zxf openmpi-4.0.0.tar.gz && \
cd openmpi-4.0.0 && \
./configure --enable-orterun-prefix-by-default && \
make -j $(nproc) all && \
make install && \
ldconfig && \
rm -rf /tmp/openmpi # Install Horovod, temporarily using CUDA stubs
RUN ldconfig /usr/local/cuda/targets/x86_64-linux/lib/stubs && \
HOROVOD_GPU_OPERATIONS=NCCL HOROVOD_WITH_TENSORFLOW=1 HOROVOD_WITH_PYTORCH=1 HOROVOD_WITH_MXNET=1 \
pip install --no-cache-dir horovod && \
ldconfig # Install OpenSSH for MPI to communicate between containers
RUN apt-get install -y --no-install-recommends openssh-client openssh-server && \
mkdir -p /var/run/sshd # Allow OpenSSH to talk to containers without asking for confirmation
RUN cat /etc/ssh/ssh_config | grep -v StrictHostKeyChecking > /etc/ssh/ssh_config.new && \
echo " StrictHostKeyChecking no" >> /etc/ssh/ssh_config.new && \
mv /etc/ssh/ssh_config.new /etc/ssh/ssh_config # Download examples
RUN apt-get install -y --no-install-recommends subversion && \
svn checkout https://github.com/horovod/horovod/trunk/examples && \
rm -rf /examples/.svn WORKDIR "/examples"

Horovod Install的更多相关文章

  1. 机器学习分布式框架horovod安装 (Linux环境)

    1.openmi 下载安装 下载连接: https://download.open-mpi.org/release/open-mpi/v4.0/openmpi-4.0.1.tar.gz 安装命令 1 ...

  2. 安装 openmpi 4.0 用于 horovod 编译

    最近编译 horovod框架过程中,需要使用openmpi 4.0但是环境中的openmpi版本比较低,所以在手动安装openmpi4.0 用于编译,下面对过程进行简要记录,进行备忘: curl -O ...

  3. Horovod 分布式深度学习框架相关

    最近需要 Horovod 相关的知识,在这里记录一下,进行备忘: 分布式训练,分为数据并行和模型并行两种: 模型并行:分布式系统中的不同GPU负责网络模型的不同部分.神经网络模型的不同网络层被分配到不 ...

  4. [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator

    [源码解析] 深度学习分布式训练框架 horovod (19) --- kubeflow MPI-operator 目录 [源码解析] 深度学习分布式训练框架 horovod (19) --- kub ...

  5. OEL上使用yum install oracle-validated 简化主机配置工作

    环境:OEL 5.7 + Oracle 10.2.0.5 RAC 如果你正在用OEL(Oracle Enterprise Linux)系统部署Oracle,那么可以使用yum安装oracle-vali ...

  6. org.jboss.deployment.DeploymentException: Trying to install an already registered mbean: jboss.jca:service=LocalTxCM,name=egmasDS

    17:34:37,235 INFO [Http11Protocol] Starting Coyote HTTP/1.1 on http-0.0.0.0-8080 17:34:37,281 INFO [ ...

  7. 如何使用yum 下载 一个 package ?如何使用 yum install package 但是保留 rpm 格式的 package ? 或者又 如何通过yum 中已经安装的package 导出它,即yum导出rpm?

    注意 RHEL5 和 RHEL6 的不同 How to use yum to download a package without installing it Solution Verified - ...

  8. Install and Configure SharePoint 2013 Workflow

    这篇文章主要briefly introduce the Install and configure SharePoint 2013 Workflow. Microsoft 推出了新的Workflow ...

  9. Basic Tutorials of Redis(1) - Install And Configure Redis

    Nowaday, Redis became more and more popular , many projects use it in the cache module and the store ...

随机推荐

  1. Gradle 差异化构建

    Compile 默认的依赖方式,任何情况下都会依赖. Provided 只提供编译时依赖,打包时不会添加进去. Apk 只在打包Apk包时依赖,这个应该是比较少用到的. TestCompile 只在测 ...

  2. QT现场同步

    // 1线程同步 QFutureSynchronizer<void> synchronizer; //2线程1 synchronizer.addFuture(QtConcurrent::r ...

  3. 后端程序员之路 51、A Tour of Go-1

    # A Tour of Go    - go get golang.org/x/tour/gotour    - https://tour.golang.org/    # welcome    - ...

  4. HDOJ-6645(简单题+贪心+树)

    Stay Real HDOJ-6645 由小根堆的性质可以知道,当前最大的值就在叶节点上面,所以只需要排序后依次取就可以了. #include<iostream> #include< ...

  5. 记录安装freeswitch的日常

    已知安装版本:Linux:Centos7 Freeswitch:1.10.2 解: 注意:(最好呢是先下载好包,然后上传到这个所用的环境中) 1.安装对应依赖 yum install -y git a ...

  6. 解决:layUI数据表格+简单查询

    解决:layUI数据表格+简单查询 最近在用layui写项目,在做到用户查询时,发现在layui框架里只有数据表格,不能增加查询.于是自己摸索了一下,写个笔记记录一下. 我想要的效果: 1.定义查询栏 ...

  7. 数据采集组件:Flume基础用法和Kafka集成

    本文源码:GitHub || GitEE 一.Flume简介 1.基础描述 Flume是Cloudera提供的一个高可用的,高可靠的,分布式的海量日志采集.聚合和传输的系统,Flume支持在日志系统中 ...

  8. three.js cannon.js物理引擎之齿轮动画

    郭先生今天继续说一说cannon.js物理引擎,并用之前已经学习过的知识实现一个小动画,知识点包括ConvexPolyhedron多边形.Shape几何体.Body刚体.HingeConstraint ...

  9. 翻译:《实用的Python编程》03_06_Design_discussion

    目录 | 上一节 (3.5 主模块) | 下一节 (4 类) 3.6 设计讨论 本节,我们重新考虑之前所做的设计决策. 文件名与可迭代对象 考虑以下两个返回相同输出的程序. # Provide a f ...

  10. layui数据表格-通过点击按钮使数据表格中的字段值增加

    通过点击右侧相对应的操作按钮,对迟到.休假次数实现自增效果 jsp页面代码 //监听行工具事件 table.on('tool(test)', function(obj){ var data = obj ...