Slurm Configuration¶

Services¶

SLURM package to be installed

Node Class	Services
Controller (VM)	`slurm`, `slurm-perlapi`, `slurm-slurmctld`
Compute	`slurm`, `slurm-perlapi`, `slurm-slurmd`, `slurm-pam`
Frontend	`slurm`, `slurm-perlapi`
SlurmDBD (VM)	`slurm`, `slurm-dbd`

Plugins Dependencies¶

List of plugins and their dependencies to be installed when building SLURM RPM packages.

Need to check that the package contains these plugins after installing

Plugins	Dependencies
MUNGE	`munge-devel`
PAM Support	`pam-devel`
cgroup Task Affinity	`hwloc-devel`
IPMI Engergy Consumption	`freeipmi-devel`
Lua Support	`lua-devel`
My SQL Support	`mysql-devel`
X11	`libssh2-devel`

[TBD]
- InfiniBand Accounting: libibmad-devel, libibumad-devel
- cgroup NUMA Affinity: ???

Configuration¶

Configuration in /etc/slurm.conf

Config	Value	Detail
SlurmctldHost	slurmctld	Might need to set as slurmctld slurmctld.hpc.nstda.or.th
AuthType	auth/munge
CryptoType	crypto/munge
GresTypes		Removed gpu, See. Generic Resource (GRES) Scheduling
JobRequeue	1	Automatically requeue batch jobs after node fail or preemption.
LaunchType	launch/slurm
MailProg	`/bin/mail`
MpiDefault	pmix
PrivateData	jobs,usage,users	Prevents users from viewing, jobs, usage of any other user, and information of any user other than themselves.
ProctrackType	proctrack/cgroup	The `slurmd` daemon uses this mechanism to identify all processes which are children of processes it spawns for a user job step. The slurmd daemon must be restarted for a change in ProctrackType to take effect
SlurmctldPidFile	`/var/run/slurm/slurmctld.pid`	Local file
SlurmctldPort	6817
SlurmdPidFile	`/var/run/slurm/slurmd.pid`	Local file
SlurmdPort	6818
SlurmdSpoolDir	`/var/spool/slurm/slurmd`	Should be local file system
SlurmUser	slurm
SlurmdUser	root
StateSaveLocation	`/var/spool/slurm/slurm.state`	Should be local file system
SwitchType	switch/none
TaskPlugin	task/affinity,task/cgroup	See. Cgroups
TaskPluginParam	Sched
TopologyPlugin	topology/tree
RoutePlugin	route/topology	[TBD]
TmpFS	`/tmp`	A node’s TmpDisk space
CpuFreqGovernors	OnDemand, Performance, PowerSave, UserSpace	See. CPU Frequency Governer
CpuFreqDef	Performance	Default: Run the CPU at the maximum frequency.

Note

The topology.conf file for an Infiniband switch can be automatically generated using the slurmibtopology tool found here: https://ftp.fysik.dtu.dk/Slurm/slurmibtopology.sh

Job Scheduling¶

Config	Value	Detail
FastSchedule	1
SchedulerType	sched/backfill
SchedulerParameters
SelectType	select/cons_res	See. Consumable Resources in Slurm
SelectTypeParameters	CR_Socket_Memory	Sockets and memory are consumable resources.
KillWait	30	The interval given to a job’s processes between the SIGTERM and SIGKILL signals upon reaching its time limit.
OverTimeLimit	5	Number of minutes by which a job can exceed its time limit before being canceled.
PreemptMode	REQUEUE	Preempts jobs by requeuing them (if possible) or canceling them.
PreemptType	preempt/qos	Job preemption rules are specified by Quality Of Service (QOS).

Job Priority¶

Config	Value	Detail
PriorityType	priority/multifactor	See. Multifactor plugin
PriorityDecayHalfLife	7-0	The impact of historical usage (for fare share) is decayed every 7 days.
PriorityCalcPeriod	5	Halflife decay wii be re-calculated every 5 minutes
PriorityFavorSmall	NO	Larger job will have higher priority. Allocating whole machine will result in the 1.0 job size factor.
PriorityFlags	TBD
PriorityMaxAge	7-0	Job will get maximum age factor (1.0) when it reside in the queue for more than 7 days.
PriorityUsageResetPeriod	NONE	Never clear historic usage
PriorityWeightAge	1000
PriorityWeightFairshare	10000
PriorityWeightJobSize	1000
PriorityWeightPartition	1000
PriorityWeightQOS	1000
PriorityWeightTRES

If PriorityFavorSmall is set to YES, the single node job will receive the 1.0 job size factor
[TBD] Some interesting values for PriorityFlags
- ACCRUE_ALWAYS: Priority age factor will be increased despite job dependencies or holds.
  
  This could be beneficial for BioBank job where jobs have dependencies, so the dependent jobs could run as soon as the prior job is finished due to high age factor. However, users could abuse this system by adding a lot of job and hold them to increase age factor.
- SMALL_RELATIVE_TO_TIME: The job’s size component will be based upon the the job size divided by the time limit.
  
  In layman’s terms, a job with large allocation and short walltime will be more preferrable. This could promote a better user behavior, since users who have better estimation of their need will get a better priority and will eventually encourage users to parallelize their programs. However, serial programs, e.g. MATLAB if limited by the license, with a long running time will face a problem when trying to run on the system. Such problem could be solved by having a specialized partition, with high enough priority to compensate for the job size, for serial jobs.

Health Check¶

Config	Value	Detail
HealthCheckProgram	`/usr/sbin/nhc`	`nhc` can be installed from https://github.com/mej/nhc. For more information See. [1] and [2]
HealthCheckInterval	3600
HealthCheckNodeState	ANY	Run on nodes in any state.

Should we set HealthCheckNodeState to IDLE to avoid performance impact?

Other possible values: ALLOC, MIXED

Warning

According to this documentation, there are some bugs in nhc version 1.4.2.

Logging and Accounting¶

Config	Value	Detail
AccountingStorageType	accounting_storage/slurmdbd
AccountingStorageHost	slurmdbd
AccountingStoragePort	6819
AccountingStoreJobComment	YES
AccountingStorageEnforce	associations	Enforce following job submission policies. associations: No new job is allowed to run unless a corresponding association exists in the system.
ClusterName	tara
JobCompType	jobcomp/filetxt	If using the accounting infrastructure this plugin may not be of interest since the information here is redundant.
JobAcctGatherFrequency	30
JobAcctGatherType	jobacct_gather/linux
SlurmctldLogFile	`/var/log/slurm/slurmctld.log`
SlurmdLogFile	`/var/log/slurm/slurmd.log`
SlurmSchedLogFile	`/var/log/slurm/slurmsched.log`
SlurmSchedLogLevel	1	Enable scheduler logging
AccountingStorageTRES		[TBD] Default: Billing, CPU, Energy, Memory, Node, and FS/Disk. Possible addition: GRES and license.
AcctGatherEnergyType	acct_gather_energy/ipmi	[TBD] For energy consumption accounting. Only in case of exclusive job allocation the energy consumption measurements will reflect the jobs real consumption

Prolog and Epilog Scripts¶

Config	Value	Detail
Prolog
Epilog
PrologFlags	contain
PrologSlurmctld		Executed once on the ControlMachine for each job
EpilogSlurmctld		Executed once on the ControlMachine for each job

pam_slurm_adopt: PrologFlags=contain must be set in slurm.conf. This sets up the “extern” step into which ssh-launched processes will be adopted. For further discussion See. Issue 4098.

Node Configuration¶

Node Class	NodeName	Notes
freeipa	-
slurmctld	slurmctld
slurmdbd	slurmdbd
mysql	-
frontend	-
compute	tara-c-[001-006]
memory	tara-m-[001-002]	FAT nodes
dgx	tara-dgx1-[001-002]	dgx1 is reserved.

Warning

Changes in node configuration (e.g. adding nodes, changing their processor count, etc.) require restarting both the slurmctld daemon and the slurmd daemons.

NodeName: The name used by all Slurm tools when referring to the node
NodeAddr: The name or IP address Slurm uses to communicate with the node
NodeHostname: The name returned by the command /bin/hostname -s

TmpDisk: Total size of temporary disk storage in TmpFS in megabytes (e.g. “16384”). TmpFS (for “Temporary File System”) identifies the location which jobs should use for temporary storage. Note this does not indicate the amount of free space available to the user on the node, only the total file system size. The system administration should ensure this file system is purged as needed so that user jobs have access to most of this space. The Prolog and/or Epilog programs (specified in the configuration file) might be used to ensure the file system is kept clean.

Note

slurmd -C command can be used to print hardware configuration of a compute node in slurm.conf compatible format

`slurm.conf`¶

# COMPUTE NODES
NodeName=tara-c-[001-006] CPUs=4 RealMemory=512 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=256
NodeName=tara-m-[001-002] CPUs=8 RealMemory=1024 Sockets=2 CoresPerSocket=4 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 State=UNKNOWN TmpDisk=512
# NodeName=tara-dgx1-[001-002] CPUs=4 RealMemory=1024 Sockets=2 CoresPerSocket=2 ThreadsPerCore=1 Gres=gpu:volta:8 State=UNKNOWN TmpDisk=512

Partitions¶

Partition	AllocNodes	MaxTime	State	Additional Parameters
debug (default)	tara-c-[001-002]	02:00:00	UP	DefaultTime=00:30:00
standby	tara-c-[001-006]	120:00:00	UP
memory	tara-m-[001-002]	120:00:00	UP
dgx	tara-dgx1-002	120:00:00	UP	OverSubscribe=EXCLUSIVE
biobank	tara-dgx1-001	UNLIMITED	UP	AllowGroups=biobank OverSubscribe=EXCLUSIVE

AllowAccounts: Comma separated list of accounts which may execute jobs in the partition. The default value is “ALL”
AllowGroups: Comma separated list of group names which may execute jobs in the partition. If at least one group associated with the user attempting to execute the job is in AllowGroups, he will be permitted to use this partition. Jobs executed as user root can use any partition without regard to the value of AllowGroups.
AllowQos: Comma separated list of Qos which may execute jobs in the partition. Jobs executed as user root can use any partition without regard to the value of AllowQos.
OverSubscribe: Controls the ability of the partition to execute more than one job at a time on each resource. Jobs that run in partitions with OverSubscribe=EXCLUSIVE will have exclusive access to all allocated nodes.

`slurm.conf`¶

# PARTITIONS
PartitionName=debug Nodes=tara-c-[001-002] Default=YES MaxTime=02:00:00 DefaultTime=00:30:00 State=UP
PartitionName=standby Nodes=tara-c-[001-006] MaxTime=120:00:00 State=UP
PartitionName=memory Nodes=tara-m-[001-002] MaxTime=120:00:00 State=UP
PartitionName=dgx Nodes=tara-dgx1-002 MaxTime=120:00:00 State=UP OverSubscribe=EXCLUSIVE
PartitionName=biobank Nodes=tara-dgx1-001 MaxTime=120:00:00 State=UP AllowGroups=biobank OverSubscribe=EXCLUSIVE

Accounting¶

With the SlurmDBD, accounting is maintained by username (not UID). A username should refer to the same person across all of the computers. Authentication relies upon UIDs, so UIDs must be uniform across all computers

Warning

Only lowercase usernames are supported.

SlurmDBD Configuration¶

SlurmDBD configuration is stored in a configuration file slurmdbd.conf. This file should be only on the computer where SlurmDBD executes and should only be readable by the user which executes SlurmDBD.

Config	Value	Detail
AuthType	auth/munge
DbdHost	slurmdbd	The name of the machine where the `slurmdbd` daemon is executed
DbdPort	6819	The port number that the `slurmdbd` listens to for work. This value must be equal to the AccountingStoragePort parameter in the `slurm.conf` file.
LogFile	/var/log/slurm/slurmdbd.log
PidFile	/var/run/slurm/slurmdbd.pid
SlurmUser	slurm	The name of the user that the `slurmctld` daemon executes as. The user must have the same UID as the hosts on which `slurmctld` execute.
StorageHost	mysql
StorageLoc		The default database is `slurm_acct_db`
StoragePass
StoragePort
StorageType	accounting_storage/mysql
StorageUser	slurmdbd

Warning

slurmdbd must be responding when slurmctld is first started.

For slurmctld accounting configuration See. Logging and Accounting

MPI¶

We will support only MPI libraries and versions that support PMIx APIs as follow

OpenMPI
MPICH (version 3) (Do we need MPICH2 ?)
IntelMPI

Generic Resource (GRES) Scheduling¶

Since we require the DGX-1 node to be exclusively allocated, there is no need for GRES.

For more information, see. DGX Best Practice

Warning

gres.conf will always be located in the same directory as the slurm.conf file.

Topology¶

In the production system, this script will be used for generating topology.conf and we will manually edit the file as needed.

Warning

topology.conf will always be located in the same directory as the slurm.conf file.

Cgroups¶

###
# cgroup.conf
# Slurm cgroup support configuration file
###
CgroupAutomount=yes
#
TaskAffinity=no
ConstrainCores=yes
ConstrainRAMSpace=yes

Note

Slurm documentation recommends stacking task/affinity,task/cgroup together when configuring TaskPlugin, and setting TaskAffinity=no and ConstrainCores=yes in cgroup.conf. This setup uses the task/affinity plugin for setting the affinity of the tasks and uses the task/cgroup plugin to fence tasks into the specified resources, thus combining the best of both pieces.

Warning

cgroup.conf will always be located in the same directory as the slurm.conf file.

Job Preemption¶

Tara configuration set PreemptType to preempt/qos, which will use QOS to determine job preemption.

To add a QOS named biobank-preempt, use following sacctmgr command

sacctmgr add qos biobank-preempt PreemptMode=REQUEUE

PreemptMode=REQUEUE indicates that a job with this QOS will be requeued after preempt.

To add a QOS named biobank, which has Priority value of 100 and coule preempts a job with biobank-preempt QOS.

sacctmgr add qos biobank Priority=100 set Preempt=biobank-preempt

Notes¶

CPU Frequency Governer¶

From https://wiki.archlinux.org/index.php/CPU_frequency_scaling#Scaling_governors

Governor	Description
Performance	Run the CPU at the maximum frequency.
PowerSave	Run the CPU at the minimum frequency.
OnDemand	Scales the frequency dynamically according to current load. Jumps to the highest frequency and then possibly back off as the idle time increases.
UserSpace	Run the CPU at user specified frequencies.
Conservative (not used)	Scales the frequency dynamically according to current load. Scales the frequency more gradually than ondemand.

Configure SLURM PAM module to limit access to allocated compute nodes.
- On job termination, any processes initiated by the user outside of Slurm’s control may be killed using an Epilog script configured in slurm.conf.

Slurm Configuration¶

Services¶

Plugins Dependencies¶

Configuration¶

Job Scheduling¶

Job Priority¶

Health Check¶

Logging and Accounting¶

Prolog and Epilog Scripts¶

Node Configuration¶

slurm.conf¶

Partitions¶

slurm.conf¶

Accounting¶

SlurmDBD Configuration¶

MPI¶

Generic Resource (GRES) Scheduling¶

Topology¶

Cgroups¶

Job Preemption¶

Notes¶

CPU Frequency Governer¶

`slurm.conf`¶

`slurm.conf`¶