Slurm Configuration

Services

SLURM package to be installed

Node Class Services
Controller (VM) slurm, slurm-perlapi, slurm-slurmctld
Compute slurm, slurm-perlapi, slurm-slurmd, slurm-pam
Frontend slurm, slurm-perlapi
SlurmDBD (VM) slurm, slurm-dbd

Plugins Dependencies

List of plugins and their dependencies to be installed when building SLURM RPM packages.
Need to check that the package contains these plugins after installing
Plugins Dependencies
MUNGE munge-devel
PAM Support pam-devel
cgroup Task Affinity hwloc-devel
IPMI Engergy Consumption freeipmi-devel
Lua Support lua-devel
My SQL Support mysql-devel
X11 libssh2-devel
  • [TBD]
    • InfiniBand Accounting: libibmad-devel, libibumad-devel
    • cgroup NUMA Affinity: ???

Configuration

Configuration in /etc/slurm.conf

Config Value Detail
SlurmctldHost slurmctld Might need to set as slurmctld slurmctld.hpc.nstda.or.th
AuthType auth/munge  
CryptoType crypto/munge  
GresTypes   Removed gpu, See. Generic Resource (GRES) Scheduling
JobRequeue 1 Automatically requeue batch jobs after node fail or preemption.
LaunchType launch/slurm  
MailProg /bin/mail  
MpiDefault pmix  
PrivateData jobs,usage,users Prevents users from viewing, jobs, usage of any other user, and information of any user other than themselves.
ProctrackType proctrack/cgroup The slurmd daemon uses this mechanism to identify all processes which are children of processes it spawns for a user job step. The slurmd daemon must be restarted for a change in ProctrackType to take effect
SlurmctldPidFile /var/run/slurm/slurmctld.pid Local file
SlurmctldPort 6817  
SlurmdPidFile /var/run/slurm/slurmd.pid Local file
SlurmdPort 6818  
SlurmdSpoolDir /var/spool/slurm/slurmd Should be local file system
SlurmUser slurm  
SlurmdUser root  
StateSaveLocation /var/spool/slurm/slurm.state Should be local file system
SwitchType switch/none  
TaskPlugin task/affinity,task/cgroup See. Cgroups
TaskPluginParam Sched  
TopologyPlugin topology/tree  
RoutePlugin route/topology [TBD]
TmpFS /tmp A node’s TmpDisk space
CpuFreqGovernors OnDemand, Performance, PowerSave, UserSpace See. CPU Frequency Governer
CpuFreqDef Performance Default: Run the CPU at the maximum frequency.

Note

The topology.conf file for an Infiniband switch can be automatically generated using the slurmibtopology tool found here: https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Job Scheduling

Config Value Detail
FastSchedule 1  
SchedulerType sched/backfill  
SchedulerParameters    
SelectType select/cons_res See. Consumable Resources in Slurm
SelectTypeParameters CR_Socket_Memory Sockets and memory are consumable resources.
KillWait 30 The interval given to a job’s processes between the SIGTERM and SIGKILL signals upon reaching its time limit.
OverTimeLimit 5 Number of minutes by which a job can exceed its time limit before being canceled.
PreemptMode REQUEUE Preempts jobs by requeuing them (if possible) or canceling them.
PreemptType preempt/qos Job preemption rules are specified by Quality Of Service (QOS).

Job Priority

Config Value Detail
PriorityType priority/multifactor See. Multifactor plugin
PriorityDecayHalfLife 7-0 The impact of historical usage (for fare share) is decayed every 7 days.
PriorityCalcPeriod 5 Halflife decay wii be re-calculated every 5 minutes
PriorityFavorSmall NO Larger job will have higher priority. Allocating whole machine will result in the 1.0 job size factor.
PriorityFlags TBD  
PriorityMaxAge 7-0 Job will get maximum age factor (1.0) when it reside in the queue for more than 7 days.
PriorityUsageResetPeriod NONE Never clear historic usage
PriorityWeightAge 1000  
PriorityWeightFairshare 10000  
PriorityWeightJobSize 1000  
PriorityWeightPartition 1000  
PriorityWeightQOS 1000  
PriorityWeightTRES    
  • If PriorityFavorSmall is set to YES, the single node job will receive the 1.0 job size factor

  • [TBD] Some interesting values for PriorityFlags

    • ACCRUE_ALWAYS: Priority age factor will be increased despite job dependencies or holds.

      This could be beneficial for BioBank job where jobs have dependencies, so the dependent jobs could run as soon as the prior job is finished due to high age factor. However, users could abuse this system by adding a lot of job and hold them to increase age factor.

    • SMALL_RELATIVE_TO_TIME: The job’s size component will be based upon the the job size divided by the time limit.

      In layman’s terms, a job with large allocation and short walltime will be more preferrable. This could promote a better user behavior, since users who have better estimation of their need will get a better priority and will eventually encourage users to parallelize their programs. However, serial programs, e.g. MATLAB if limited by the license, with a long running time will face a problem when trying to run on the system. Such problem could be solved by having a specialized partition, with high enough priority to compensate for the job size, for serial jobs.

Health Check

Config Value Detail
HealthCheckProgram /usr/sbin/nhc nhc can be installed from https://github.com/mej/nhc. For more information See. [1] and [2]
HealthCheckInterval 3600  
HealthCheckNodeState ANY Run on nodes in any state.
Should we set HealthCheckNodeState to IDLE to avoid performance impact?
Other possible values: ALLOC, MIXED

Warning

According to this documentation, there are some bugs in nhc version 1.4.2.

Logging and Accounting

Config Value Detail
AccountingStorageType accounting_storage/slurmdbd  
AccountingStorageHost slurmdbd  
AccountingStoragePort 6819  
AccountingStoreJobComment YES  
AccountingStorageEnforce associations

Enforce following job submission policies.

  • associations: No new job is allowed to run unless a corresponding association exists in the system.
ClusterName tara  
JobCompType jobcomp/filetxt If using the accounting infrastructure this plugin may not be of interest since the information here is redundant.
JobAcctGatherFrequency 30  
JobAcctGatherType jobacct_gather/linux  
SlurmctldLogFile /var/log/slurm/slurmctld.log  
SlurmdLogFile /var/log/slurm/slurmd.log  
SlurmSchedLogFile /var/log/slurm/slurmsched.log  
SlurmSchedLogLevel 1 Enable scheduler logging
AccountingStorageTRES   [TBD] Default: Billing, CPU, Energy, Memory, Node, and FS/Disk. Possible addition: GRES and license.
AcctGatherEnergyType acct_gather_energy/ipmi [TBD] For energy consumption accounting. Only in case of exclusive job allocation the energy consumption measurements will reflect the jobs real consumption

Prolog and Epilog Scripts

Config Value Detail
Prolog    
Epilog    
PrologFlags contain  
PrologSlurmctld   Executed once on the ControlMachine for each job
EpilogSlurmctld   Executed once on the ControlMachine for each job
  • pam_slurm_adopt: PrologFlags=contain must be set in slurm.conf. This sets up the “extern” step into which ssh-launched processes will be adopted. For further discussion See. Issue 4098.

Node Configuration

Node Class NodeName Notes
freeipa -  
slurmctld slurmctld  
slurmdbd slurmdbd  
mysql -  
frontend -  
compute tara-c-[001-006]  
memory tara-m-[001-002] FAT nodes
dgx tara-dgx1-[001-002] dgx1 is reserved.

Warning

Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons.

NodeName: The name used by all Slurm tools when referring to the node
NodeAddr: The name or IP address Slurm uses to communicate with the node
NodeHostname: The name returned by the command /bin/hostname -s

TmpDisk: Total size of temporary disk storage in TmpFS in megabytes (e.g. “16384”). TmpFS (for “Temporary File System”) identifies the location which jobs should use for temporary storage. Note this does not indicate the amount of free space available to the user on the node, only the total file system size. The system administration should ensure this file system is purged as needed so that user jobs have access to most of this space. The Prolog and/or Epilog programs (specified in the configuration file) might be used to ensure the file system is kept clean.

Note

slurmd -C command can be used to print hardware configuration of a compute node in slurm.conf compatible format

slurm.conf

# COMPUTE NODES
NodeName=tara-c-[001-006] CPUs=4 RealMemory=512 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=256
NodeName=tara-m-[001-002] CPUs=8 RealMemory=1024 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
# NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 Gres=gpu:volta:8 State=UNKNOWN TmpDisk=512

Partitions

Partition AllocNodes MaxTime State Additional Parameters
debug (default) tara-c-[001-002] 02:00:00 UP DefaultTime=00:30:00
standby tara-c-[001-006] 120:00:00 UP  
memory tara-m-[001-002] 120:00:00 UP  
dgx tara-dgx1-002 120:00:00 UP OverSubscribe=EXCLUSIVE
biobank tara-dgx1-001 UNLIMITED UP AllowGroups=biobank OverSubscribe=EXCLUSIVE
AllowAccounts: Comma separated list of accounts which may execute jobs in the partition. The default value is “ALL”
AllowGroups: Comma separated list of group names which may execute jobs in the partition. If at least one group associated with the user attempting to execute the job is in AllowGroups, he will be permitted to use this partition. Jobs executed as user root can use any partition without regard to the value of AllowGroups.
AllowQos: Comma separated list of Qos which may execute jobs in the partition. Jobs executed as user root can use any partition without regard to the value of AllowQos.
OverSubscribe: Controls the ability of the partition to execute more than one job at a time on each resource. Jobs that run in partitions with OverSubscribe=EXCLUSIVE will have exclusive access to all allocated nodes.

slurm.conf

# PARTITIONS
PartitionName=debug Nodes=tara-c-[001-002] Default=YES MaxTime=02:00:00 DefaultTime=00:30:00 State=UP
PartitionName=standby Nodes=tara-c-[001-006] MaxTime=120:00:00 State=UP
PartitionName=memory Nodes=tara-m-[001-002] MaxTime=120:00:00 State=UP
PartitionName=dgx Nodes=tara-dgx1-002 MaxTime=120:00:00 State=UP OverSubscribe=EXCLUSIVE
PartitionName=biobank Nodes=tara-dgx1-001 MaxTime=120:00:00 State=UP AllowGroups=biobank OverSubscribe=EXCLUSIVE

Accounting

With the SlurmDBD, accounting is maintained by username (not UID). A username should refer to the same person across all of the computers. Authentication relies upon UIDs, so UIDs must be uniform across all computers

Warning

Only lowercase usernames are supported.

SlurmDBD Configuration

SlurmDBD configuration is stored in a configuration file slurmdbd.conf. This file should be only on the computer where SlurmDBD executes and should only be readable by the user which executes SlurmDBD.

Config Value Detail
AuthType auth/munge  
DbdHost slurmdbd The name of the machine where the slurmdbd daemon is executed
DbdPort 6819 The port number that the slurmdbd listens to for work. This value must be equal to the AccountingStoragePort parameter in the slurm.conf file.
LogFile /var/log/slurm/slurmdbd.log  
PidFile /var/run/slurm/slurmdbd.pid  
SlurmUser slurm The name of the user that the slurmctld daemon executes as. The user must have the same UID as the hosts on which slurmctld execute.
StorageHost mysql  
StorageLoc   The default database is slurm_acct_db
StoragePass    
StoragePort    
StorageType accounting_storage/mysql  
StorageUser slurmdbd  

Warning

slurmdbd must be responding when slurmctld is first started.

For slurmctld accounting configuration See. Logging and Accounting

MPI

We will support only MPI libraries and versions that support PMIx APIs as follow

  • OpenMPI
  • MPICH (version 3) (Do we need MPICH2 ?)
  • IntelMPI

Generic Resource (GRES) Scheduling

Since we require the DGX-1 node to be exclusively allocated, there is no need for GRES.

For more information, see. DGX Best Practice

Warning

gres.conf will always be located in the same directory as the slurm.conf file.

Topology

In the production system, this script will be used for generating topology.conf and we will manually edit the file as needed.

Warning

topology.conf will always be located in the same directory as the slurm.conf file.

Cgroups

###
# cgroup.conf
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
#
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes

Note

Slurm documentation recommends stacking task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.

Warning

cgroup.conf will always be located in the same directory as the slurm.conf file.

Job Preemption

Tara configuration set PreemptType to preempt/qos, which will use QOS to determine job preemption.

To add a QOS named biobank-preempt, use following sacctmgr command

sacctmgr add qos biobank-preempt PreemptMode=REQUEUE

PreemptMode=REQUEUE indicates that a job with this QOS will be requeued after preempt.

To add a QOS named biobank, which has Priority value of 100 and coule preempts a job with biobank-preempt QOS.

sacctmgr add qos biobank Priority=100 set Preempt=biobank-preempt

Notes

CPU Frequency Governer

From https://wiki.archlinux.org/index.php/CPU_frequency_scaling#Scaling_governors

Governor Description
Performance Run the CPU at the maximum frequency.
PowerSave Run the CPU at the minimum frequency.
OnDemand Scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases.
UserSpace Run the CPU at user specified frequencies.
Conservative (not used) Scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand.
  • Configure SLURM PAM module to limit access to allocated compute nodes.

    • On job termination, any processes initiated by the user outside of Slurm’s control may be killed using an Epilog script configured in slurm.conf.