学习笔记之Slurm
Slurm Workload Manager - Overview
- https://slurm.schedmd.com/overview.html
- Slurm is an open source, fault-tolerant, and highly scalable cluster management and job scheduling system for large and small Linux clusters. Slurm requires no kernel modifications for its operation and is relatively self-contained. As a cluster workload manager, Slurm has three key functions. First, it allocates exclusive and/or non-exclusive access to resources (compute nodes) to users for some duration of time so they can perform work. Second, it provides a framework for starting, executing, and monitoring work (normally a parallel job) on the set of allocated nodes. Finally, it arbitrates contention for resources by managing a queue of pending work. Optional plugins can be used for accounting, advanced reservation, gang scheduling (time sharing for parallel jobs), backfill scheduling, topology optimized resource selection, resource limits by user or bank account, and sophisticated multifactor job prioritization algorithms.
Slurm Workload Manager - Quick Start User Guide
Slurm Workload Manager - Wikipedia
- https://en.wikipedia.org/wiki/Slurm_Workload_Manager
- The Slurm Workload Manager (formerly known as Simple Linux Utility for Resource Management or SLURM), or Slurm, is a free and open-sourcejob scheduler for Linux and Unix-likekernels, used by many of the world's supercomputers and computer clusters.
- It provides three key functions:
- allocating exclusive and/or non-exclusive access to resources (computer nodes) to users for some duration of time so they can perform work,
- providing a framework for starting, executing, and monitoring work (typically a parallel job such as MPI) on a set of allocated nodes, and
- arbitrating contention for resources by managing a queue of pending jobs.
- Slurm is the workload manager on about 60% of the TOP500 supercomputers.[1]
- Slurm uses a best fit algorithm based on Hilbert curve scheduling or fat tree network topology in order to optimize locality of task assignments on parallel computers.[2]
Slurm Workload Manager - sacct
sbatch - Submit a batch script to Slurm
- https://slurm.schedmd.com/sbatch.html
- $ sbatch mytestsbatch.sh
- Actually the second srun will start only when the previous srun is completed, so no sleep is required.
# =============================================================================
# mytestscript.sh
# =============================================================================
#!/bin/sh
date & # =============================================================================
# mytestsbatch.sh
# =============================================================================
#!/bin/sh
#SBATCH -N 2
#SBATCH -n 10 srun -n10 -o testscript1.log mytestscript.sh
sleep 10; srun -n10 -o testscript2.log mytestscript.sh
wait
scancel - Used to signal jobs or job steps that are under the control of Slurm.
- https://slurm.schedmd.com/scancel.html
- $ scancel 123
scontrol - view or modify Slurm configuration and state.
- https://slurm.schedmd.com/scontrol.html
- $ scontrol show job 123
squeue - view information about jobs located in the Slurm scheduling queue.
- https://slurm.schedmd.com/squeue.html
- $ squeue
- $ squeue -u username
srun - Run parallel jobs
- https://slurm.schedmd.com/srun.html
- $ cat testscript.sh
- #!/bin/sh
- python mytest.py --arg test
- $ chmod +x testscript.sh
- $ srun -N5 -n100 testscript.sh
- Run it on 5 nodes with 100 tasks
- $ srun -n5 --nodelist=host1, host2 -o testscript.log testscript.sh
- $ srun -n10 -o testscript.log --begin=now+2hour testscript.sh
- $ srun --begin=now+10 date &
Convenient SLURM Commands | FAS Research Computing
srun: error: --begin is ignored because nodes are already allocated.
- use sleep in lieu of --begin
- bash - Can you help me run tasks in parallel in Slurm? - Stack Overflow
srun: error: Unable to create job step: More processors requested than permitted
- In the submission script, you request resources with the
#SBATCHdirectives, and you cannot use more resource than than in the subsequent calls tosrun. - slurm - Questions on alternative ways to run 4 parallel jobs - Stack Overflow
学习笔记之Slurm的更多相关文章
- js学习笔记:webpack基础入门(一)
之前听说过webpack,今天想正式的接触一下,先跟着webpack的官方用户指南走: 在这里有: 如何安装webpack 如何使用webpack 如何使用loader 如何使用webpack的开发者 ...
- PHP-自定义模板-学习笔记
1. 开始 这几天,看了李炎恢老师的<PHP第二季度视频>中的“章节7:创建TPL自定义模板”,做一个学习笔记,通过绘制架构图.UML类图和思维导图,来对加深理解. 2. 整体架构图 ...
- PHP-会员登录与注册例子解析-学习笔记
1.开始 最近开始学习李炎恢老师的<PHP第二季度视频>中的“章节5:使用OOP注册会员”,做一个学习笔记,通过绘制基本页面流程和UML类图,来对加深理解. 2.基本页面流程 3.通过UM ...
- 2014年暑假c#学习笔记目录
2014年暑假c#学习笔记 一.C#编程基础 1. c#编程基础之枚举 2. c#编程基础之函数可变参数 3. c#编程基础之字符串基础 4. c#编程基础之字符串函数 5.c#编程基础之ref.ou ...
- JAVA GUI编程学习笔记目录
2014年暑假JAVA GUI编程学习笔记目录 1.JAVA之GUI编程概述 2.JAVA之GUI编程布局 3.JAVA之GUI编程Frame窗口 4.JAVA之GUI编程事件监听机制 5.JAVA之 ...
- seaJs学习笔记2 – seaJs组建库的使用
原文地址:seaJs学习笔记2 – seaJs组建库的使用 我觉得学习新东西并不是会使用它就够了的,会使用仅仅代表你看懂了,理解了,二不代表你深入了,彻悟了它的精髓. 所以不断的学习将是源源不断. 最 ...
- CSS学习笔记
CSS学习笔记 2016年12月15日整理 CSS基础 Chapter1 在console输入escape("宋体") ENTER 就会出现unicode编码 显示"%u ...
- HTML学习笔记
HTML学习笔记 2016年12月15日整理 Chapter1 URL(scheme://host.domain:port/path/filename) scheme: 定义因特网服务的类型,常见的为 ...
- DirectX Graphics Infrastructure(DXGI):最佳范例 学习笔记
今天要学习的这篇文章写的算是比较早的了,大概在DX11时代就写好了,当时龙书11版看得很潦草,并没有注意这篇文章,现在看12,觉得是跳不过去的一篇文章,地址如下: https://msdn.micro ...
随机推荐
- Linux(环境篇):系统搭建本地FTP后,无法登录(331 Please specify the password.)问题解决
首先 Linux 搭建ftp,开放21端口.(省略...) 你可能会遇到以下问题:错误 SELinux is disabled 解决: setenforce: SELinux is disabled ...
- arris1750 pandorabox安装bandwidthd之后带宽监控(nlbwmon)报资源不足
nlbwmon 报错资源不足不能看的原因很可能是内存不足导致,因为重启进程会概率可用一下,且删除老的数据后又好用了. 可能与设置的最大数据库条数有关,条数过大导致申请内存大,改成默认的10000. 可 ...
- Attribute application@allowBackup value=(true) from AndroidManifest.xml:7:9-35
1: 在 AndroidManifest.xml 配置文件中显式配置 android:allowBackup=false. 项目中代码 allowBackup="true" 改为 ...
- 关于使用sudo找不到环境变量的问题
参考这里:https://www.cnblogs.com/zhongshiqiang/p/10839666.html 使用sudo -E 保留当前用户环境,这时就不会存在找不到环境变量的问题了.
- Docker 安装 ActiveMQ
搜索 ActiveMQ 镜像 docker search activemq 获取 ActiveMQ 镜像 docker pull webcenter/activemq 查看本地镜像 docker im ...
- django在centos生产环境的部署
# 安装数据库和web服务器nginx # yum install –y nginx mariadb-server # 安装虚拟环境 pip install virtualenv pip instal ...
- java多线程(五)线程通讯
1.1. 为什么要线程通信 多个线程并发执行时,在默认情况下CPU是随机切换线程的,有时我们希望CPU按我们的规律执行线程,此时就需要线程之间协调通信. 1.2. 线程通讯方式 线程间通信常用方式如下 ...
- mysql的全量备份与增量备份
mysql的全量备份与增量备份 全量备份:可以使用mysqldump直接备份整个库或者是备份其中某一个库或者一个库中的某个表. 备份所有数据库:[root@my ~]# mysqldump -uroo ...
- spring boot 打包引入第三方jar
本文作者:@Ryan Miao 本文链接:https://www.cnblogs.com/woshimrf/p/springboot-package-3rdparty-lib.html 版权声明: 本 ...
- sql server数据库显示“单用户”的解决方法
USE master; GO DECLARE @SQL VARCHAR(MAX); SET @SQL='' SELECT @SQL=@SQL+'; KILL '+RTRIM(SPID) --杀掉该进程 ...