linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程
本文首发于个人博客https://kezunlin.me/post/6b505d27/,欢迎阅读最新内容!
full guide tutorial to install and configure deep learning environments on linux server
Quick Guide
prepare
tools
- MobaXterm (for windows)
- ssh + vscode
for windows:
drop files to MobaXterm to upload to server
usezipformat
commands
view disk
du -d 1 -h
df -h
gpu and cpu usage
watch -n 1 nvidia-smi
top
view files and count
wc -l data.csv
# count how many folders
ls -lR | grep '^d' | wc -l
17
# count how many jpg files
ls -lR | grep '.jpg' | wc -l
1360
# view 10 images
ls train | head
ls test | head
link datasets
# link
ln -s srt dest
ln -s /data_1/kezunlin/datasets/ dl4cv/datasets
scp
scp -r node17:~/dl4cv ~/git/
scp -r node17:~/.keras ~/
tmux for background tasks
tmux new -s notebook
tmux ls
tmux attach -t notebook
tmux detach
wget download
# wget
# continue donwload
wget -c url
# background donwload for large file
wget -b -c url
tail -f wget-log
# kill background wget
pkill -9 wget
tips about training large model
terminal 1:
tmux new -s train
conda activate keras
time python train_alexnet.py
terminal 2:
tmux detach
tmux attach -t train
and then close vscode, otherwise bash training process will exit when we close vscode.
cuda driver and toolkits
see cuda-toolkit for cuda driver version
cudatookit version depends on cuda driver version.
install nvidia-drivers
sudo add-apt-repository ppa:graphics-drivers/ppa
sudp apt-get update
sudo apt-cache search nvidia-*
# nvidia-384
# nvidia-396
sudo apt-get -y install nvidia-418
# test
nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
reboot to test again
https://stackoverflow.com/questions/43022843/nvidia-nvml-driver-library-version-mismatch
install cuda-toolkit(dirvers)
remove all previous nvidia drivers
sudo apt-get -y pruge nvidia-*
go to here and download cuda_10.1
wget -b -c http://developer.download.nvidia.com/compute/cuda/10.1/Prod/local_installers/cuda_10.1.243_418.87.00_linux.run
sudo sh cuda_10.1.243_418.87.00_linux.run
sudo ./cuda_10.1.243_418.87.00_linux.run
vim .bashrc
# for cuda and cudnn
export PATH=/usr/local/cuda/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH
check cuda driver version
> cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 418.87.00 Thu Aug 8 15:35:46 CDT 2019
GCC version: gcc version 5.4.0 20160609 (Ubuntu 5.4.0-6ubuntu1~16.04.11)
>nvidia-smi
Tue Aug 27 17:36:35 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
> nvidia-smi -L
GPU 0: Quadro RTX 8000 (UUID: GPU-acb01c1b-776d-cafb-ea35-430b3580d123)
GPU 1: Quadro RTX 8000 (UUID: GPU-df7f0fb8-1541-c9ce-e0f8-e92bccabf0ef)
GPU 2: Quadro RTX 8000 (UUID: GPU-67024023-20fd-a522-dcda-261063332731)
GPU 3: Quadro RTX 8000 (UUID: GPU-7f9d6a27-01ec-4ae5-0370-f0c356327913)
> nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243
install conda
./Anaconda3-2019.03-Linux-x86_64.sh
[yes]
[yes]
config channels
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/free/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/pkgs/main/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/conda-forge/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/msys2/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/menpo/
conda config --add channels https://mirrors.tuna.tsinghua.edu.cn/anaconda/cloud/pytorch/
conda config --set show_channel_urls yes
install libraries
conclusions:
- py37/keras: conda install -y tensorflow-gpu keras==2.2.5
- py37/torch: conda install -y pytorch torchvision
- py36/mxnet: conda install -y mxnet
keras 2.2.5 was released on 2019/8/23.
Add new Applications: ResNet101, ResNet152, ResNet50V2, ResNet101V2, ResNet152V2.
common libraries
conda install -y scikit-learn scikit-image pandas matplotlib pillow opencv seaborn
pip install imutils progressbar pydot pylint
pip install imutilsto avoid downgrade for tensorflow-gpu
py37
cudatoolkit 10.0.130 0
cudnn 7.6.0 cuda10.0_0
tensorflow-gpu 1.13.1
py36
cudatoolkit anaconda/pkgs/main/linux-64::cudatoolkit-10.1.168-0
cudnn anaconda/pkgs/main/linux-64::cudnn-7.6.0-cuda10.1_0
tensorboard anaconda/pkgs/main/linux-64::tensorboard-1.14.0-py36hf484d3e_0
tensorflow anaconda/pkgs/main/linux-64::tensorflow-1.14.0-gpu_py36h3fb9ad6_0
tensorflow-base anaconda/pkgs/main/linux-64::tensorflow-base-1.14.0-gpu_py36he45bfe2_0
tensorflow-estima~ anaconda/cloud/conda-forge/linux-64::tensorflow-estimator-1.14.0-py36h5ca1d4c_0
tensorflow-gpu anaconda/pkgs/main/linux-64::tensorflow-gpu-1.14.0-h0d30ee6_0
imutils only support 36 and 37.
mxnet only support 35 and 36.
details
# remove py35
conda remove -n py35 --all
conda info --envs
conda create -n py37 python==3.7
conda activate py37
# common libraries
conda install -y scikit-learn pandas pillow opencv
pip install imutils
# imutils
conda search imutils
# py36 and py37
# Name Version Build Channel
imutils 0.5.2 py27_0 anaconda/cloud/conda-forge
imutils 0.5.2 py36_0 anaconda/cloud/conda-forge
imutils 0.5.2 py37_0 anaconda/cloud/conda-forge
# tensorflow-gpu and keras
conda install -y tensorflow-gpu keras
# install pytorch
conda install -y pytorch torchvision
# install mxnet
# method 1: pip
pip search mxnet
mxnet-cu80[mkl]/mxnet-cu90[mkl]/mxnet-cu91[mkl]/mxnet-cu92[mkl]/mxnet-cu100[mkl]/mxnet-cu101[mkl]
# method 2: conda
conda install mxnet
# py35 and py36
TensorFlow Object Detection API
home page: home page
download tensorflow models and rename models-master to tfmodels
vim ~/.bashrc
export PYTHONPATH=/home/kezunlin/dl4cv:/data_1/kezunlin/tfmodels/research:$PYTHONPATH
source ~/.bashrc
jupyter notebook
conda activate py37
conda install -y jupyter
install kernels
python -m ipykernel install --user --name=py37
Installed kernelspec py37 in /home/kezunlin/.local/share/jupyter/kernels/py37
config for server
python -c "import IPython;print(IPython.lib.passwd())"
Enter password:
Verify password:
sha1:ef2fb2aacff2:4ea2998699638e58d10d594664bd87f9c3381c04
jupyter notebook --generate-config
Writing default config to: /home/kezunlin/.jupyter/jupyter_notebook_config.py
vim .jupyter/jupyter_notebook_config.py
c.NotebookApp.ip = '*'
c.NotebookApp.password = u'sha1:xxx:xxx'
c.NotebookApp.open_browser = False
c.NotebookApp.port = 8888
c.NotebookApp.enable_mathjax = True
run jupyter on background
tmux new -s notebook
jupyter notebook
# ctlr+b+d exit session and DO NOT close session
# ctlr+d exit session and close session
access web and input password
test
py37
import cv2
cv2.__version
import tensorflow as tf
import keras
import torch
import torchvision
cat .keras/keras.json
{
"epsilon": 1e-07,
"floatx": "float32",
"backend": "tensorflow",
"image_data_format": "channels_last"
}
py36
import mxnet
train demo
export
# use CPU only
export CUDA_VISIBLE_DEVICES=""
# use gpu 0 1
export CUDA_VISIBLE_DEVICES="0,1"
code
import os
os.environ['CUDA_VISIBLE_DEVICES'] = "0,1"
start train
python train.py
./keras folder
view keras models and datasets
ls .keras/
datasets keras.json models
models saved to
/home/kezunlin/.keras/models/
datasets saved to/home/kezunlin/.keras/datasets/
models lists
xxx_kernels_notop.h5forinclude_top = False
xxx_kernels.h5forinclude_top = True
Datasets
mnist
cifar10
to skip download
wget http://www.cs.toronto.edu/~kriz/cifar-10-python.tar.gz
mv ~/Download/cifar-10-python.tar.gz ~/.keras/datasets/cifar-10-batches-py.tar.gz
to load data
(x_train, y_train), (x_test, y_test) = cifar10.load_data()
flowers-17
animals
panda images are WRONG !!!
counts
ls -lR animals/cat | grep ".jpg" | wc -l
1000
ls -lR animals/dog | grep ".jpg" | wc -l
1000
ls -lR animals/panda | grep ".jpg" | wc -l
1000
kaggle cats vs dogs
caltech101
download background
wget -b -c http://www.vision.caltech.edu/Image_Datasets/Caltech101/101_ObjectCategories.tar.gz
Kaggle API
install and config
see kaggle-api
conda activate keras
conda install kaggle
# download kaggle.json
mv kaggle.json ~/.kaggle/kaggle.json
chmod 600 ~/.kaggle/kaggle.json
cat kaggle.json
{"username":"xxx","key":"yyy"}
or by export
export KAGGLE_USERNAME=xxx
export KAGGLE_KEY=yyy
tips
- go to account and select 'Create API Token' and
keras.jsonwill be downloaded.- Ensure
kaggle.jsonis in the location~/.kaggle/kaggle.jsonto use the API.
check version
kaggle --version
Kaggle API 1.5.5
commands overview
commands
kaggle competitions {list, files, download, submit, submissions, leaderboard}
kaggle datasets {list, files, download, create, version, init}
kaggle kernels {list, init, push, pull, output, status}
kaggle config {view, set, unset}
download datasets
kaggle competitions download -c dogs-vs-cats
show leaderboard
kaggle competitions leaderboard dogs-vs-cats --show
teamId teamName submissionDate score
------ --------------------------------- ------------------- -------
71046 Pierre Sermanet 2014-02-01 21:43:19 0.98533
66623 Maxim Milakov 2014-02-01 18:20:58 0.98293
72059 Owen 2014-02-01 17:04:40 0.97973
74563 Paul Covington 2014-02-01 23:05:20 0.97946
74298 we've been in KAIST 2014-02-01 21:15:30 0.97840
71949 orchid 2014-02-01 23:52:30 0.97733
set default competition
kaggle config set --name competition --value dogs-vs-cats
- competition is now set to: dogs-vs-cats
kaggle config set --name competition --value dogs-vs-cats-redux-kernels-edition
dogs-vs-cats
dogs-vs-cats-redux-kernels-edition
submit
kaggle c submissions
- Using competition: dogs-vs-cats
- No submissions found
kaggle c submit -f ./submission.csv -m "first submit"
competition has already ended, so can not submit.
Nvidia-docker and containers
install
sudo apt-get -y install docker
# Install nvidia-docker2 and reload the Docker daemon configuration
sudo apt-get install -y nvidia-docker2
sudo pkill -SIGHUP dockerd
restart (optional)
cat /etc/docker/daemon.json
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
}
}
sudo systemctl enable docker
sudo systemctl start docker
if errors occur:
Job for docker.service failed because the control process exited with error code.
See "systemctl status docker.service" and "journalctl -xe" for details.
check/etc/docker/daemon.json
test
sudo docker run --runtime=nvidia --rm nvidia/cuda:10.1-base nvidia-smi
sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi
Thu Aug 29 00:11:32 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Quadro RTX 8000 Off | 00000000:02:00.0 Off | Off |
| 43% 67C P2 136W / 260W | 46629MiB / 48571MiB | 17% Default |
+-------------------------------+----------------------+----------------------+
| 1 Quadro RTX 8000 Off | 00000000:03:00.0 Off | Off |
| 34% 54C P0 74W / 260W | 0MiB / 48571MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Quadro RTX 8000 Off | 00000000:82:00.0 Off | Off |
| 34% 49C P0 73W / 260W | 0MiB / 48571MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Quadro RTX 8000 Off | 00000000:83:00.0 Off | Off |
| 33% 50C P0 73W / 260W | 0MiB / 48571MiB | 3% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
add user to
dockergroup, and no need to usesudo docker xxx
command refs
sudo nvidia-docker run --rm nvidia/cuda:10.1-base nvidia-smi
sudo nvidia-docker -t -i --privileged nvidia/cuda bash
sudo docker run -it --name kzl -v /home/kezunlin/workspace/:/home/kezunlin/workspace nvidia/cuda
Reference
History
- 20190821: created.
Copyright
- Post author: kezunlin
- Post link: https://kezunlin.me/post/6b505d27/
- Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 3.0 unless stating additionally.
linux服务器上配置进行kaggle比赛的深度学习tensorflow keras环境详细教程的更多相关文章
- 在Linux服务器上配置phpMyAdmin
使用php和mysql开发网站的话,phpmyadmin是一个非常友好的mysql管理工具,并且免费开源,国内很多虚拟主机都自带这样的管理工具,配置很简单,接下来在linux服务器上配置phpmyad ...
- 本地Linux服务器上配置Git
当我们需要拉取远程服务器代码到本地服务器时,我们首先要确定已经配置了正确的Git账号,可以从~/.gitconfig文件(为隐藏文件,需要使用ls -a查看),以及~/.ssh下的id_rsa.pub ...
- linux服务器上配置多个svn仓库
linux服务器上配置多个svn仓库 1.在指定目录建立仓库保存总目录,本文示例目录设定为:/usr/local/svn/svnrepos # mkdir -p /usr/local/svn/svnr ...
- 在linux服务器上配置anaconda和Tensorflow,并运行
1. 查看服务器上的Python安装路径: whereis python 2. 查看安装的Python版本号: python 3. 安装Anaconda: 1)下载 Anaconda2-4.0.0-L ...
- 深度学习Tensorflow生产环境部署(上·环境准备篇)
最近在研究Tensorflow Serving生产环境部署,尤其是在做服务器GPU环境部署时,遇到了不少坑.特意总结一下,当做前车之鉴. 1 系统背景 系统是ubuntu16.04 ubuntu@ub ...
- 在Linux服务器上配置Transmission来离线下载BT种子
Transmission简介 Transmission是一种BitTorrent客户端,特点是跨平台的后端和简洁的用户界面,硬件资源消耗极少,支持包括Linux.BSD.Solaris.Mac OS ...
- Linux服务器上配置2个Tomcat或者多个Tomcat
一.当在一个服务器上面安装2个tomcat的时候,修改第二个tomcat的conf目录下server.xml文件里面的端口号(原8080改成8081,原8005改成8006)可以达到两个tomcat都 ...
- [亲测]ASP.NET Core 2.0怎么发布/部署到Ubuntu Linux服务器并配置Nginx反向代理实现域名访问
前言 ASP.NET Core 2.0 怎么发布到Ubuntu服务器?又如何在服务器上配置使用ASP.NET Core网站绑定到指定的域名,让外网用户可以访问呢? 步骤 第1步:准备工作 一台Liun ...
- [亲测]七步学会ASP.NET Core 2.0怎么发布/部署到Ubuntu Linux服务器并配置Nginx反向代理实现域名访问
前言 ASP.NET Core 2.0 怎么发布到Ubuntu服务器?又如何在服务器上配置使用ASP.NET Core网站绑定到指定的域名,让外网用户可以访问呢? 步骤 第1步:准备工作 一台Liun ...
随机推荐
- VS 2017 中取消自动补全花括号
输入 "{", VS 会很智能的给你补全,得到 “{}”, 如果不想享受这个服务,可以按以下设置取消: Tools -> Options -> Text Editor ...
- mysql数据库基础SQL语句总结篇
常用的sql增删改查语句 创建数据库:create database db_name character set utf8;删除数据库:drop database db_name;切换数据库:use ...
- Java程序远程无法执行nohup命令
问题的上下文: 由于生产无法使用 jenkins 发布,所以采用 ch.ethz.ssh2 或叫 ganymed-ssh2 的开源 java 的 ssh api 进行远程发布. 在发起重启时,远程执行 ...
- 第04组 Beta冲刺(1/4)
队名:斗地组 组长博客:地址 作业博客:Beta冲刺(1/4) 各组员情况 林涛(组长) 过去两天完成了哪些任务: 1.分配展示任务 2.收集各个组员的进度 3.写博客 展示GitHub当日代码/文档 ...
- flash的几种模式Normal Mode、DUAL Mode、Quad Mode的概念和区别
概念 1. 标准SPI 标准SPI通常就称SPI,它是一种串行外设接口规范,有4根引脚信号:clk , cs, mosi, miso 2. Dual SPI 它只是针对SPI Flash而言,不是针对 ...
- JavaScript的函数以及循环和判断
1.什么是函数? 这个函数跟我们数学当中的函数不太一样,我们这个函数是一段定义好的代码,可以循环使用,(这样我们更方便). 2.函数的作用: 提升代码的可复用性,将一定代码进行预定义,需要的时候才触发 ...
- 从数据表字段 float 和 double 说起
今天在公司讨论项目重构的问题时,公司的 DBA 针对表中的字段大概介绍了一下 float 和 double 的存储方式.然后,我发现这个问题又回到了浮点数类型在内存中的存储方式,即 IEEE 对浮点数 ...
- 利用Python进行数据分析-Pandas(第二部分)
上一个章节中我们主要是介绍了pandas两种数据类型的具体属性,这个章节主要介绍操作Series和DataFrame中的数据的基本手段. 一.基本功能 1.重新索引 pandas对象的一个重要方法是r ...
- C++ 面向对象程序设计复习大纲
这是我在准备C++考试时整理的提纲,如果是通过搜索引擎搜索到这篇博客的师弟师妹,建议还是先参照PPT和课本,这个大纲也不是很准确,自己总结会更有收获,多去理解含义,不要死记硬背,否则遇到概念辨析题会 ...
- Linux下使命令不受终端断开的影响,保持在后台运行的几种方法及原理
摘自https://www.ibm.com/developerworks/cn/linux/l-cn-nohup/ 记录一下Linux下使命令不受终端断开的影响,保持在后台运行的几个方法及其原理.当用 ...