Slurm Configuration¶
Services¶
SLURM package to be installed
Node Class | Services |
---|---|
Controller (VM) | slurm , slurm-perlapi , slurm-slurmctld |
Compute | slurm , slurm-perlapi , slurm-slurmd , slurm-pam |
Frontend | slurm , slurm-perlapi |
SlurmDBD (VM) | slurm , slurm-dbd |
Plugins Dependencies¶
Plugins | Dependencies |
---|---|
MUNGE | munge-devel |
PAM Support | pam-devel |
cgroup Task Affinity | hwloc-devel |
IPMI Engergy Consumption | freeipmi-devel |
Lua Support | lua-devel |
My SQL Support | mysql-devel |
X11 | libssh2-devel |
- [TBD]
- InfiniBand Accounting:
libibmad-devel
,libibumad-devel
- cgroup NUMA Affinity: ???
- InfiniBand Accounting:
Configuration¶
Configuration in /etc/slurm.conf
Config | Value | Detail |
---|---|---|
SlurmctldHost | slurmctld | Might need to set as slurmctld slurmctld.hpc.nstda.or.th |
AuthType | auth/munge | |
CryptoType | crypto/munge | |
GresTypes | Removed gpu, See. Generic Resource (GRES) Scheduling | |
JobRequeue | 1 | Automatically requeue batch jobs after node fail or preemption. |
LaunchType | launch/slurm | |
MailProg | /bin/mail |
|
MpiDefault | pmix | |
PrivateData | jobs,usage,users | Prevents users from viewing, jobs, usage of any other user, and information of any user other than themselves. |
ProctrackType | proctrack/cgroup | The slurmd daemon uses this mechanism to identify all processes which are children of processes it spawns
for a user job step.
The slurmd daemon must be restarted for a change in ProctrackType to take effect |
SlurmctldPidFile | /var/run/slurm/slurmctld.pid |
Local file |
SlurmctldPort | 6817 | |
SlurmdPidFile | /var/run/slurm/slurmd.pid |
Local file |
SlurmdPort | 6818 | |
SlurmdSpoolDir | /var/spool/slurm/slurmd |
Should be local file system |
SlurmUser | slurm | |
SlurmdUser | root | |
StateSaveLocation | /var/spool/slurm/slurm.state |
Should be local file system |
SwitchType | switch/none | |
TaskPlugin | task/affinity,task/cgroup | See. Cgroups |
TaskPluginParam | Sched | |
TopologyPlugin | topology/tree | |
RoutePlugin | route/topology | [TBD] |
TmpFS | /tmp |
A node’s TmpDisk space |
CpuFreqGovernors | OnDemand, Performance, PowerSave, UserSpace | See. CPU Frequency Governer |
CpuFreqDef | Performance | Default: Run the CPU at the maximum frequency. |
Note
The topology.conf file for an Infiniband switch can be automatically generated using the slurmibtopology tool found here: https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh
Job Scheduling¶
Config | Value | Detail |
---|---|---|
FastSchedule | 1 | |
SchedulerType | sched/backfill | |
SchedulerParameters | ||
SelectType | select/cons_res | See. Consumable Resources in Slurm |
SelectTypeParameters | CR_Socket_Memory | Sockets and memory are consumable resources. |
KillWait | 30 | The interval given to a job’s processes between the SIGTERM and SIGKILL signals upon reaching its time limit. |
OverTimeLimit | 5 | Number of minutes by which a job can exceed its time limit before being canceled. |
PreemptMode | REQUEUE | Preempts jobs by requeuing them (if possible) or canceling them. |
PreemptType | preempt/qos | Job preemption rules are specified by Quality Of Service (QOS). |
Job Priority¶
Config | Value | Detail |
---|---|---|
PriorityType | priority/multifactor | See. Multifactor plugin |
PriorityDecayHalfLife | 7-0 | The impact of historical usage (for fare share) is decayed every 7 days. |
PriorityCalcPeriod | 5 | Halflife decay wii be re-calculated every 5 minutes |
PriorityFavorSmall | NO | Larger job will have higher priority. Allocating whole machine will result in the 1.0 job size factor. |
PriorityFlags | TBD | |
PriorityMaxAge | 7-0 | Job will get maximum age factor (1.0) when it reside in the queue for more than 7 days. |
PriorityUsageResetPeriod | NONE | Never clear historic usage |
PriorityWeightAge | 1000 | |
PriorityWeightFairshare | 10000 | |
PriorityWeightJobSize | 1000 | |
PriorityWeightPartition | 1000 | |
PriorityWeightQOS | 1000 | |
PriorityWeightTRES |
If PriorityFavorSmall is set to YES, the single node job will receive the 1.0 job size factor
[TBD] Some interesting values for PriorityFlags
ACCRUE_ALWAYS: Priority age factor will be increased despite job dependencies or holds.
This could be beneficial for BioBank job where jobs have dependencies, so the dependent jobs could run as soon as the prior job is finished due to high age factor. However, users could abuse this system by adding a lot of job and hold them to increase age factor.
SMALL_RELATIVE_TO_TIME: The job’s size component will be based upon the the job size divided by the time limit.
In layman’s terms, a job with large allocation and short walltime will be more preferrable. This could promote a better user behavior, since users who have better estimation of their need will get a better priority and will eventually encourage users to parallelize their programs. However, serial programs, e.g. MATLAB if limited by the license, with a long running time will face a problem when trying to run on the system. Such problem could be solved by having a specialized partition, with high enough priority to compensate for the job size, for serial jobs.
Health Check¶
Config | Value | Detail |
---|---|---|
HealthCheckProgram | /usr/sbin/nhc |
nhc can be installed from https://github.com/mej/nhc. For more information See. [1] and [2] |
HealthCheckInterval | 3600 | |
HealthCheckNodeState | ANY | Run on nodes in any state. |
Warning
According to this documentation, there are some bugs in nhc
version 1.4.2.
Logging and Accounting¶
Config | Value | Detail |
---|---|---|
AccountingStorageType | accounting_storage/slurmdbd | |
AccountingStorageHost | slurmdbd | |
AccountingStoragePort | 6819 | |
AccountingStoreJobComment | YES | |
AccountingStorageEnforce | associations | Enforce following job submission policies.
|
ClusterName | tara | |
JobCompType | jobcomp/filetxt | If using the accounting infrastructure this plugin may not be of interest since the information here is redundant. |
JobAcctGatherFrequency | 30 | |
JobAcctGatherType | jobacct_gather/linux | |
SlurmctldLogFile | /var/log/slurm/slurmctld.log |
|
SlurmdLogFile | /var/log/slurm/slurmd.log |
|
SlurmSchedLogFile | /var/log/slurm/slurmsched.log |
|
SlurmSchedLogLevel | 1 | Enable scheduler logging |
AccountingStorageTRES | [TBD] Default: Billing, CPU, Energy, Memory, Node, and FS/Disk. Possible addition: GRES and license. | |
AcctGatherEnergyType | acct_gather_energy/ipmi | [TBD] For energy consumption accounting. Only in case of exclusive job allocation the energy consumption measurements will reflect the jobs real consumption |
Prolog and Epilog Scripts¶
Config | Value | Detail |
---|---|---|
Prolog | ||
Epilog | ||
PrologFlags | contain | |
PrologSlurmctld | Executed once on the ControlMachine for each job | |
EpilogSlurmctld | Executed once on the ControlMachine for each job |
pam_slurm_adopt
:PrologFlags=contain
must be set inslurm.conf
. This sets up the “extern” step into which ssh-launched processes will be adopted. For further discussion See. Issue 4098.
Node Configuration¶
Node Class | NodeName | Notes |
---|---|---|
freeipa | - | |
slurmctld | slurmctld | |
slurmdbd | slurmdbd | |
mysql | - | |
frontend | - | |
compute | tara-c-[001-006] | |
memory | tara-m-[001-002] | FAT nodes |
dgx | tara-dgx1-[001-002] | dgx1 is reserved. |
Warning
Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld
daemon and the slurmd
daemons.
/bin/hostname -s
Note
slurmd -C
command can be used to print hardware configuration of a compute node in slurm.conf
compatible format
slurm.conf
¶
# COMPUTE NODES
NodeName=tara-c-[001-006] CPUs=4 RealMemory=512 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=256
NodeName=tara-m-[001-002] CPUs=8 RealMemory=1024 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
# NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 Gres=gpu:volta:8 State=UNKNOWN TmpDisk=512
Partitions¶
Partition | AllocNodes | MaxTime | State | Additional Parameters |
---|---|---|---|---|
debug (default) | tara-c-[001-002] | 02:00:00 | UP | DefaultTime=00:30:00 |
standby | tara-c-[001-006] | 120:00:00 | UP | |
memory | tara-m-[001-002] | 120:00:00 | UP | |
dgx | tara-dgx1-002 | 120:00:00 | UP | OverSubscribe=EXCLUSIVE |
biobank | tara-dgx1-001 | UNLIMITED | UP | AllowGroups=biobank OverSubscribe=EXCLUSIVE |
OverSubscribe=EXCLUSIVE
will have exclusive access to all allocated nodes.slurm.conf
¶
# PARTITIONS
PartitionName=debug Nodes=tara-c-[001-002] Default=YES MaxTime=02:00:00 DefaultTime=00:30:00 State=UP
PartitionName=standby Nodes=tara-c-[001-006] MaxTime=120:00:00 State=UP
PartitionName=memory Nodes=tara-m-[001-002] MaxTime=120:00:00 State=UP
PartitionName=dgx Nodes=tara-dgx1-002 MaxTime=120:00:00 State=UP OverSubscribe=EXCLUSIVE
PartitionName=biobank Nodes=tara-dgx1-001 MaxTime=120:00:00 State=UP AllowGroups=biobank OverSubscribe=EXCLUSIVE
Accounting¶
With the SlurmDBD, accounting is maintained by username (not UID). A username should refer to the same person across all of the computers. Authentication relies upon UIDs, so UIDs must be uniform across all computers
Warning
Only lowercase usernames are supported.
SlurmDBD Configuration¶
SlurmDBD configuration is stored in a configuration file slurmdbd.conf
.
This file should be only on the computer where SlurmDBD executes and should
only be readable by the user which executes SlurmDBD.
Config | Value | Detail |
---|---|---|
AuthType | auth/munge | |
DbdHost | slurmdbd | The name of the machine where the slurmdbd daemon is executed |
DbdPort | 6819 | The port number that the slurmdbd listens to for work.
This value must be equal to the AccountingStoragePort parameter in the slurm.conf file. |
LogFile | /var/log/slurm/slurmdbd.log | |
PidFile | /var/run/slurm/slurmdbd.pid | |
SlurmUser | slurm | The name of the user that the slurmctld daemon executes as.
The user must have the same UID as the hosts on which slurmctld execute. |
StorageHost | mysql | |
StorageLoc | The default database is slurm_acct_db |
|
StoragePass | ||
StoragePort | ||
StorageType | accounting_storage/mysql | |
StorageUser | slurmdbd |
Warning
slurmdbd
must be responding when slurmctld
is first started.
For slurmctld
accounting configuration See. Logging and Accounting
MPI¶
We will support only MPI libraries and versions that support PMIx
APIs as follow
- OpenMPI
- MPICH (version 3) (Do we need MPICH2 ?)
- IntelMPI
Generic Resource (GRES) Scheduling¶
Since we require the DGX-1 node to be exclusively allocated, there is no need for GRES.
For more information, see. DGX Best Practice
Warning
gres.conf
will always be located in the same directory as the slurm.conf
file.
Topology¶
In the production system, this script will be
used for generating topology.conf
and we will manually edit the file as needed.
Warning
topology.conf
will always be located in the same directory as the slurm.conf
file.
Cgroups¶
###
# cgroup.conf
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
#
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes
Note
Slurm documentation recommends stacking task/affinity,task/cgroup together when configuring TaskPlugin,
and setting TaskAffinity=no
and ConstrainCores=yes
in cgroup.conf
.
This setup uses the task/affinity plugin for setting the affinity of the tasks
and uses the task/cgroup plugin to fence tasks into the specified resources,
thus combining the best of both pieces.
Warning
cgroup.conf
will always be located in the same directory as the slurm.conf
file.
Job Preemption¶
Tara configuration set PreemptType to preempt/qos, which will use QOS to determine job preemption.
To add a QOS named biobank-preempt
, use following sacctmgr
command
sacctmgr add qos biobank-preempt PreemptMode=REQUEUE
PreemptMode=REQUEUE
indicates that a job with this QOS will be requeued after preempt.
To add a QOS named biobank
, which has Priority value of 100 and coule preempts a job with biobank-preempt
QOS.
sacctmgr add qos biobank Priority=100 set Preempt=biobank-preempt
Notes¶
CPU Frequency Governer¶
From https://wiki.archlinux.org/index.php/CPU_frequency_scaling#Scaling_governors
Governor | Description |
---|---|
Performance | Run the CPU at the maximum frequency. |
PowerSave | Run the CPU at the minimum frequency. |
OnDemand | Scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases. |
UserSpace | Run the CPU at user specified frequencies. |
Conservative (not used) | Scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand. |
Configure SLURM PAM module to limit access to allocated compute nodes.
- On job termination, any processes initiated by the user outside of Slurm’s control may be killed using an
Epilog
script configured inslurm.conf
.
- On job termination, any processes initiated by the user outside of Slurm’s control may be killed using an