安装环境

系统要求

CPU: 2个核心

内存: 2GB

显卡:NVIDIA系列

安装docker

apt install docker.io

安装k8s

添加软件源

方便起见,将Ubuntu的软件管理中的下载地址修改为阿里云。

在/etc/apt/source.list添加k8s的软件源

deb https://mirrors.aliyun.com/kubernetes/apt kubernetes-xenial main

更新apt update

问题: NO_PUBKEY

NO_PUBKEY BA300B7755AFCFAE

apt-key adv --recv-keys --keyserver keyserver.ubuntu.com 6A030B21BA07F4FB

问题: depend on sth

执行 apt update

或者 apt --fix-broken install

修改HOST

vim /etc/hosts

注释掉

127.0.0.1 computer_name

根据想要形成的集群的IP地址添加

192.168.9.103 master

192.168.9.104 node1

192.168.9.105 node2 ......

#127.0.0.1      localhost
#127.0.1.1 dell3 # The following lines are desirable for IPv6 capable hosts
::1 ip6-localhost ip6-loopback
fe00::0 ip6-localnet
ff00::0 ip6-mcastprefix
ff02::1 ip6-allnodes
ff02::2 ip6-allrouters 192.168.9.103 master

安装kubeadm

apt install kubeadm

安装kubeadm时会自动安装kubectl、kubelet。

列出需要的镜像

kubeadm config images list

结果是:

k8s.gcr.io/kube-apiserver:v1.17.0
k8s.gcr.io/kube-controller-manager:v1.17.0
k8s.gcr.io/kube-scheduler:v1.17.0
k8s.gcr.io/kube-proxy:v1.17.0
k8s.gcr.io/pause:3.1
k8s.gcr.io/etcd:3.4.3-0
k8s.gcr.io/coredns:1.6.5

使用国内源下载这些镜像

docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0
docker pull registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5

使用tag命令打标

docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/pause:3.1 k8s.gcr.io/pause:3.1
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-apiserver:v1.17.0 k8s.gcr.io/kube-apiserver:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-controller-manager:v1.17.0 k8s.gcr.io/kube-controller-manager:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-scheduler:v1.17.0 k8s.gcr.io/kube-scheduler:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/kube-proxy:v1.17.0 k8s.gcr.io/kube-proxy:v1.17.0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/etcd:3.4.3-0 k8s.gcr.io/etcd:3.4.3-0
docker tag registry.cn-hangzhou.aliyuncs.com/google_containers/coredns:1.6.5 k8s.gcr.io/coredns:1.6.5

配置 master

先关闭swap

swapoff -a

进行初始化:

root@dell3:~# kubeadm init --kubernetes-version=v1.17.0 --pod-network-cidr 192.168.0.0/16
W1218 14:48:40.560734 20883 validation.go:28] Cannot validate kube-proxy config - no validator is available
W1218 14:48:40.560767 20883 validation.go:28] Cannot validate kubelet config - no validator is available
[init] Using Kubernetes version: v1.17.0
[preflight] Running pre-flight checks
[WARNING Service-Docker]: docker service is not enabled, please run 'systemctl enable docker.service'
[WARNING IsDockerSystemdCheck]: detected "cgroupfs" as the Docker cgroup driver. The recommended driver is "systemd". Please follow the guide at https://kubernetes.io/docs/setup/cri/
error execution phase preflight: [preflight] Some fatal errors occurred:
[ERROR Swap]: running with swap on is not supported. Please disable swap
[preflight] If you know what you are doing, you can make a check non-fatal with `--ignore-preflight-errors=...`
To see the stack trace of this error execute with --v=5 or higher

信息提示关闭swap

swapoff -a

之后再执行安装

安装完成的信息是:

Your Kubernetes control-plane has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/ Then you can join any number of worker nodes by running the following on each as root: kubeadm join 192.168.9.103:6443 --token tn3e9a.6fgbdbu3vvus8ia9 \
--discovery-token-ca-cert-hash sha256:ce5aa219f8fd1da40646997f2c3d27ee905989812b115146356ecfc9304036ba

按照提示执行三个命令:

  mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

添加网络配置

kubectl apply -f https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

完成的信息是:

podsecuritypolicy.policy/psp.flannel.unprivileged created
clusterrole.rbac.authorization.k8s.io/flannel created
clusterrolebinding.rbac.authorization.k8s.io/flannel created
serviceaccount/flannel created
configmap/kube-flannel-cfg created
daemonset.apps/kube-flannel-ds-amd64 created
daemonset.apps/kube-flannel-ds-arm64 created
daemonset.apps/kube-flannel-ds-arm created
daemonset.apps/kube-flannel-ds-ppc64le created
daemonset.apps/kube-flannel-ds-s390x created

查看pod

root@dell3:~# kubectl get pod
No resources found in default namespace.
root@dell3:~# kubectl get pod -n kube-system
NAME READY STATUS RESTARTS AGE
coredns-6955765f44-gccbp 0/1 Pending 0 2m33s
coredns-6955765f44-gl7zg 0/1 Pending 0 2m33s
etcd-dell3 1/1 Running 0 2m33s
kube-apiserver-dell3 1/1 Running 0 2m33s
kube-controller-manager-dell3 1/1 Running 0 2m33s
kube-flannel-ds-amd64-rrhng 0/1 Init:0/1 0 70s
kube-proxy-srnvg 1/1 Running 0 2m33s
kube-scheduler-dell3 1/1 Running 0 2m33s

如果k8s的核心组件都在运行中了,说明k8s安装成功。

安装cuda

NVIDIA建议先安装cuda再安装NVIDIA驱动。

./cuda.run --override 使用--override参数来取消安装时对gcc版本的检查。

安装 NVIDIA 驱动

寻找合适的版本

根据自己的GPU型号,到英伟达网站寻找合适的版本。

安装

方便起见,直接使用Ubuntu19提供的驱动管理软件Additional Drivers来安装

成功的标志

打开 NVIDIA X Server 看到显卡的详细信息。如果打开是空白的,说明当前没有安装NVIDIA驱动。

输入 nvidia-smi 可以看到驱动的详细信息。

root@dell3:~# nvidia-smi
Wed Dec 18 15:07:25 2019
+------------------------------------------------------+
| NVIDIA-SMI 340.107 Driver Version: 340.107 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GT 610 Off | 0000:01:00.0 N/A | N/A |
|100% 56C P8 N/A / N/A | 129MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
| 1 GeForce GT 610 Off | 0000:06:00.0 N/A | N/A |
|100% 46C P8 N/A / N/A | 3MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+
| Compute processes: GPU Memory |
| GPU PID Process name Usage |
|=============================================================================|
| 0 Not Supported |
| 1 Not Supported |
+-----------------------------------------------------------------------------+

问题

重复登录

开机之后按Ctrl+Alt+F[1-6]进入字符界面,卸载NVIDIA驱动。

apt remove --purge nvidia-*

如果有驱动的安装包,也可以执行

./nvidia-*.run --uninstall

重启,再次进入原生的图形界面,然后在设置中关闭密码登录。

再次安装NVIDIA驱动。

进不去图形界面

重新安装NVIDIA驱动。

安装 NVIDIA对k8s的插件

安装 nvidia-docker2

# Add the package repositories
$ distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
$ curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
$ curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list $ sudo apt-get update && sudo apt-get install -y nvidia-docker2
$ sudo systemctl restart docker

测试nvidia-docker2

使用nvidia-docker2来运行cuda:

docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi

第一次运行需下载镜像,如果镜像下载太慢,可以添加加速器。

运行结果是:

root@dell:~# docker run --runtime=nvidia --rm nvidia/cuda nvidia-smi
Wed Dec 18 08:55:00 2019
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 430.50 Driver Version: 430.50 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 750 Ti Off | 00000000:01:00.0 On | N/A |
| 33% 30C P8 1W / 38W | 359MiB / 1999MiB | 0% Default |
+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+

添加配置

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

修改runtime

修改/etc/docker/daemon.json文件,添加default-runtime键。

{
"default-runtime": "nvidia",
"runtimes": {
"nvidia": {
"path": "/usr/bin/nvidia-container-runtime",
"runtimeArgs": []
}
}
}

重启docker

# systemctl daemon-reload

# systemctl restart docker

使用

查看k8s是否识别出了GPU

执行kubectl describe node node_name来查看本节点的详细信息:

root@dell:~/mypod# kubectl describe nodes
Name: dell
Roles: master
Labels: beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
disktype=ssd
kubernetes.io/arch=amd64
kubernetes.io/hostname=dell
kubernetes.io/os=linux
node-role.kubernetes.io/master=
Annotations: flannel.alpha.coreos.com/backend-data: {"VtepMAC":"c6:9a:2d:50:03:4b"}
flannel.alpha.coreos.com/backend-type: vxlan
flannel.alpha.coreos.com/kube-subnet-manager: true
flannel.alpha.coreos.com/public-ip: 192.168.8.52
kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
volumes.kubernetes.io/controller-managed-attach-detach: true
CreationTimestamp: Thu, 12 Dec 2019 10:25:16 +0800
Taints: <none>
Unschedulable: false
Conditions:
Type Status LastHeartbeatTime LastTransitionTime Reason Message
---- ------ ----------------- ------------------ ------ -------
MemoryPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False Wed, 18 Dec 2019 17:17:55 +0800 Thu, 12 Dec 2019 18:00:39 +0800 KubeletHasSufficientPID kubelet has sufficient PID available
Ready True Wed, 18 Dec 2019 17:17:55 +0800 Mon, 16 Dec 2019 10:30:19 +0800 KubeletReady kubelet is posting ready status. AppArmor enabled
Addresses:
InternalIP: 192.168.8.52
Hostname: dell
Capacity:
cpu: 4
ephemeral-storage: 479152840Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 24568140Ki
nvidia.com/gpu: 1
pods: 110
Allocatable:
cpu: 4
ephemeral-storage: 441587256613
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 24465740Ki
nvidia.com/gpu: 1
pods: 110
System Info:
Machine ID: 833fac65cd12401db017c0b0033439e7
System UUID: 28d52460-d7da-11dd-9d00-40167e218cad
Boot ID: 7a4a6548-28da-447c-845a-fab20ed82181
Kernel Version: 5.3.0-24-generic
OS Image: Ubuntu 19.10
Operating System: linux
Architecture: amd64
Container Runtime Version: docker://19.3.2
Kubelet Version: v1.16.3
Kube-Proxy Version: v1.16.3
PodCIDR: 192.168.0.0/24
PodCIDRs: 192.168.0.0/24
Non-terminated Pods: (13 in total)

此时如果看到capacity中包含了nvidia.com/gpu:1的信息,说明k8s识别出了本机含有一块GPU。

调用GPU

创建一个调用GPU的pod

创建一个文件gpu-pod.yaml

apiVersion: v1
kind: Pod
metadata:
name: tf-pod
spec:
containers:
- name: tf-container
image: tensorflow/tensorflow:latest-gpu
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPUs

然后执行kubectl apply -f gpu-pod.yaml

使用kubectl get pod查看pod的状态。

root@dell:~/mypod# kubectl get pod
NAME READY STATUS RESTARTS AGE
busybox-6t962 0/1 Completed 0 6d2h
cuda3 0/1 Completed 47 2d7h
gpu-cuda 0/1 Completed 0 5d
gpu-pod 0/1 Completed 0 5d1h
gpu-pod23 0/2 Pending 0 2d6h
hello-world 0/1 ContainerCreating 0 6d5h
myjob-k9hx5 0/1 Completed 0 6d3h
myjob2-xmdm8 0/1 Completed 0 6d1h
pi-9cttz 0/1 Completed 0 6d2h

使用kubectl describe pod pod_name来查看pod的详细信息。

问题:kubectl 命令无效

像这样的报错:

root@dell3:~# kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml The connection to the server 192.168.9.103:6443 was refused - did you specify the right host or port?

多是重启之后连不上docker,解决方法:

# swapoff -a
# systemctl daemon-reload
# systemctl restart docker
# systemctl restart kubelet

其中最重要的就是禁用swap。

安装k8s和NVIDIA环境的更多相关文章

  1. kubeadm安装k8s测试环境

    目标是搭建一个可测试的k8s环境,使用的工具 kubeadm, 最终一个master节点(非高可用),2个node节点. 环境以及版本 Centos7.3 kubeadm 1.11.1 kubelet ...

  2. 1、二进制安装K8s 之 环境准备

    二进制安装K8s 之 环境准备 1.系统&软件 序号 设备\系统 版本 1 宿主机 MacBook Pro 11.4 2 系统 Centos 7.8 3 虚拟机 Parallels Deskt ...

  3. Dapr学习(2)之Rancher2.63(k8s&k3s)环境安装Dapr

    前言:前面写过一篇关于dapr入门安装的文章,self-host模式,使用docker安装的本地调试环境,并进行了测试:本篇介绍k8s方式安装dapr,此文主要基于的环境是k3s,通过rancher2 ...

  4. kubernetes 1.3 的安装和集群环境部署

    简介: Docker:是一个开源的应用容器引擎,可以为应用创建一个轻量级的.可移植的.自给自足的容器. Kubernetes:由Google开源的Docker容器集群管理系统,为容器化的应用提供资源调 ...

  5. CentOS7安装k8s

    借鉴博客:https://www.cnblogs.com/xkops/p/6169034.html 此博客里面有每个k8s配置文件的注释:https://blog.csdn.net/qq_359048 ...

  6. kubeadm安装K8S单master双节点集群

    宿主机:master:172.16.40.97node1:172.16.40.98node2:172.16.40.99 # 一.k8s初始化环境:(三台宿主机) 关闭防火墙和selinux syste ...

  7. Windows 7上安装配置TensorFlow-GPU运算环境

    Windows 7上安装配置TensorFlow-GPU运算环境 1. 概述 在深度学习实践中,对于简单的模型和相对较小的数据集,我们可以使用CPU完成建模过程.例如在MNIST数据集上进行手写数字识 ...

  8. 最小化安装k8s

    最小化安装k8s Nick_4438 关注 2018.07.11 10:40* 字数 670 阅读 0评论 0喜欢 0 1.前言 之前写过一篇二进制手工安装k8s的文章,过程复杂,搞了多日才安装成功. ...

  9. [k8s]kubespray(ansible)自动化安装k8s集群

    kubespray(ansible)自动化安装k8s集群 https://github.com/kubernetes-incubator/kubespray https://kubernetes.io ...

随机推荐

  1. java三大循环结构

    用于处理需要重复执行的操作: 根据判断条件的成立与否,决定程序段落的执行次数,而这个程序段落我们称为循环体: while:事先不需要知道循环执行多少次: do  while:同上,只是至少要执行一次( ...

  2. HDU 1051

    题意:给你n个木块的长和宽,现在要把它送去加工,这里怎么说呢,就是放一个木块花费一分钟,如果后面木块的长和宽大于等于前面木块的长和宽就不需要花费时间,否则时间+1,问把这个木块送去加工的最短时间. 思 ...

  3. MySQL中Index Merge简介

    索引合并优化 官网翻译 MySQL5.7文档 索引合并是为了减少几个范围(type中的range类型:range can be used when a key column is compared t ...

  4. Codeforces Beta Round #4 (Div. 2 Only) D. Mysterious Present(LIS)

    传送门 题意: 现在我们有 n 个信封,然后我们有一张卡片,并且我们知道这张卡片的长和宽. 现给出这 n 个信封的长和宽,我们想形成一个链,这条链的长度就是这条链中所含有的信封的数量: 但是需要满足① ...

  5. 对input输入框日期显示格式化

    先:导入fmt标签 在对input框进行属性设置 <input type="text" name="createtime"  value="&l ...

  6. ioctl 命令的实现

    ioctl 的 scull 实现只传递设备的配置参数, 并且象下面这样容易: switch(cmd) { case SCULL_IOCRESET: scull_quantum = SCULL_QUAN ...

  7. H3C设置时间--用户视图下

    <H3C>clock datetime ?   TIME  Specify the time (HH:MM:SS) <H3C>clock datetime 19:29:00 ? ...

  8. vue-learning:39 - router - vue-router的基本使用

    vue-router路由的基本使用 一张图阐述vue-router的基本使用步骤 // 0. 如果全局使用CDN引入:vue 引入在前,vue-router引入在后 // <script src ...

  9. Android studio 使用git仓库记录

    studio 绑定git settings --> verson control -->git 在项目文件目录右击打开git bash here操作界面 查看git项目安装位置 找到id_ ...

  10. Oracle Net Manager 的使用方法(监听的配置方法)

    一,在服务端配置oracle端口 win+R  输入netca 弹出如下窗口后 选择监听程序配置,点击下一步 二.配置端口后使用Telnet工具调试端口是否联通 在命令行输入telnet 服务器ip ...