System Installation

All RPM packages are built in the tara-c-060 node built with tara-build-centos7 image.

munge Version: 0.5.13 (Sep 27, 2017)
PMIx Version : 3.0.2 (Sep 19, 2018)
UCX Version : 1.4.0 (Oct 30, 2018)
Slurm Version: 18.08.5 (Jan 30, 2019)

Build Order

  1. munge
  2. OpenUCX
  3. PMIx
  4. Slurm
  5. Lmod/EasuBuild

Common Tools

We will use rpm-build for building RPM packages from source code and wget for downloading the code.

$ yum install rpm-build wget

MUNGE

Building MUNGE RPMs

Download the latest version of MUNGE

$ wget https://github.com/dun/munge/releases/download/munge-0.5.13/munge-0.5.13.tar.xz

Note

EPEL repository does not contain the latest version of munge package.

Install MUNGE dependencies

$ yum install gcc bzip2-devel openssl-devel zlib-devel

Build RPM package from MUNGE source.

$ rpmbuild -tb --clean munge-0.5.13.tar.xz

Create MUNGE directory in parallel file system and move RPM files.

$ mkdir -p /utils/munge
$ mv rpmbuild/ /utils/munge/

Install and start MUNGE

Generate munge.key. Need to do only once

$ dd if=/dev/urandom bs=1 count=1024 > /utils/munge/munge.key

Create munge user and group.

$ groupadd munge -g 2000
$ useradd --system munge -u 2000 -g munge -s /bin/nologin --no-create-home

Install MUNGE from RPM.

$ rpm -ivh /utils/munge/rpmbuild/RPMS/x86_64/munge-0.5.13-1.el7.x86_64.rpm \
  /utils/munge/rpmbuild/RPMS/x86_64/munge-libs-0.5.13-1.el7.x86_64.rpm \
  /utils/munge/rpmbuild/RPMS/x86_64/munge-devel-0.5.13-1.el7.x86_64.rpm

Create MUNGE local directory and copy munge.key.

$ mkdir -p /etc/munge/
$ chown -R 2000:2000 /etc/munge/
$ chmod 500 /etc/munge/
$ cp /utils/munge/munge.key /etc/munge
$ chmod 400 /etc/munge/munge.key

Start MUNGE service

$ systemctl enable munge
$ systemctl start munge
$ systemctl status munge

Testing MUNGE installation

$ munge -n
$ munge -n | unmunge
$ munge -n | ssh <host> unmunge
$ remunge

Note

By default the Munge daemon runs with two threads, but a higher thread count can improve its throughput. For high throughput support, the Munge daemon should start with ten threads

OpenUCX

https://github.com/openucx/ucx/releases

yum install numactl numactl-libs numactl-devel

export LD_LIBRARY_PATH=/usr/local/cuda-10.0/lib64\
                     ${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}

./contrib/configure-release  --prefix=$PWD/install --with-cuda=/usr/local/cuda/

rpmbuild -bb --define "configure_options --enable-optimizations --with-cuda=/usr/local/cuda" ucx-1.4.0/ucx.spec

PMIx

Build PMIx RPM Package

Install PMIx dependencies

$ yum install libtool libevent-devel

Download the latest stable version of PMIx

$ wget https://github.com/pmix/pmix/releases/download/v3.0.2/pmix-3.0.2.tar.bz2

Build PMIx package from PMIx source.

$ ./configure --with-munge=/usr --with-munge-libdir=/usr
$ rpmbuild -tb --clean --define "configure_options --with-munge=/usr" pmix-3.0.2.tar.bz2

Note

PMIx script seems to support C11 features but will require gcc 4.9+

Create PMIx directory in parallel file system and move RPM files.

$ mkdir -p /utils/pmix
$ mv rpmbuild/ /utils/pmix/

Install PMIx from RPM.

$ rpm -ivh /utils/pmix/rpmbuild/RPMS/x86_64/pmix-3.0.2-1.el7.x86_64.rpm

Checking PMIx installation

$ grep PMIX_VERSION /usr/include/pmix_version.h

#define PMIX_VERSION_MAJOR 3L
#define PMIX_VERSION_MINOR 0L
#define PMIX_VERSION_RELEASE 2L

Slurm

Build SLURM RPM Package

Install SLURM and its plugins dependencies (See. Plugins Dependencies)

$ yum install readline-devel perl-ExtUtils-MakeMaker pam-devel hwloc-devel freeipmi-devel lua-devel mysql-devel libssh2-devel

Download the latest stable version of SLURM

$ wget https://download.schedmd.com/slurm/slurm-18.08.3.tar.bz2

Build SLURM package from SLURM source with PMIx.

$ rpmbuild -tb --clean slurm-18.08.3.tar.bz2
$ rpmbuild -bb --clean --define "configure_options --with-ucx" slurm.spec

Create SLURM directory in parallel file system and move RPM files.

$ mkdir -p /utils/slurm
$ mv rpmbuild/ /utils/slurm/

Install Slurm

Create slurm user and group.

$ groupadd slurm -g 2001
$ useradd --system slurm -u 2001 -g slurm -s /bin/nologin --no-create-home

Install SLURM and its plugins dependencies (See. Plugins Dependencies)

$ yum install readline-devel perl-ExtUtils-MakeMaker pam-devel hwloc-devel freeipmi-devel lua-devel mysql-devel libssh2-devel

Frontend

Install slurm from RPM packages.

$ rpm -ivh /utils/slurm/rpmbuild/RPMS/x86_64/slurm-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-perlapi-18.08.3-1.el7.x86_64.rpm

Setup firewall

$ firewall-cmd --add-port 60001-63000/tcp --permanent
$ firewall-cmd --reload
$ iptables -nL

Slurmctld

Install slurmctld from RPM packages.

$ rpm -ivh /utils/slurm/rpmbuild/RPMS/x86_64/slurm-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-slurmctld-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-perlapi-18.08.3-1.el7.x86_64.rpm

Create required directory

$ mkdir -p /var/log/slurm/ /var/run/slurm/ /var/spool/slurm/
$ chown slurm:slurm /var/log/slurm/
$ chown slurm:slurm /var/run/slurm/
$ chown slurm:slurm /var/spool/slurm/

Setup firewall

$ firewall-cmd --add-port 6817/tcp --permanent
$ firewall-cmd --add-port 60001-63000/tcp --permanent
$ firewall-cmd --reload
$ iptables -nL

Edit PIDFile configuration in /usr/lib/systemd/system/slurmctld.service to the same localtion in slurm.conf (Current setting: /var/run/slurm/slurmctld.pid).

Following script could be use for editing.

$ sed -i -e 's@PIDFile=/var/run/slurmctld.pid@PIDFile=/var/run/slurm/slurmctld.pid@g' /usr/lib/systemd/system/slurmctld.service

Create slurmctld.conf in /usr/lib/tmpfiles.d/. The content of slurmctld.conf is as follows

d /var/run/slurm 0755 slurm slurm -

Start slurmdbd service

$ systemctl enable slurmdbd
$ systemctl start slurmdbd
$ systemctl status slurmdbd

Note

slurmctld receives SIGTERM after the first setup. The problem was solved by editing the PIDFile configuration in the .service file and run command systemctl daemon-reload.

SlurmDBD

Install slurmdbd from RPM packages.

$ rpm -ivh /utils/slurm/rpmbuild/RPMS/x86_64/slurm-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-slurmdbd-18.08.3-1.el7.x86_64.rpm

Create required directory

$ mkdir -p /var/log/slurm/ /var/run/slurm/
$ chown slurm:slurm /var/log/slurm/
$ chown slurm:slurm /var/run/slurm/

Create slurmdbd.conf in /usr/lib/tmpfiles.d/. The content of slurmdbd.conf is as follows

d /var/run/slurm 0755 slurm slurm -

Configure MySQL

The following SQL code creates a database slurm_acct_db and user slurmdbd and grants administrator privilege on the database to slurmdbd user.

CREATE DATABASE slurm_acct_db;
create user 'slurmdbd'@'<slurmdbd_IP>' identified by '<password>';
grant all on slurm_acct_db.* TO 'slurmdbd'@'<slurmdbd_IP>';

Edit PIDFile configuration in /usr/lib/systemd/system/slurmdbd.service to the same localtion in slurmdbd.conf (Current setting: /var/run/slurm/slurmdbd.pid).

Following script could be use for editing.

$ sed -i -e 's@PIDFile=/var/run/slurmdbd.pid@PIDFile=/var/run/slurm/slurmdbd.pid@g' /usr/lib/systemd/system/slurmdbd.service

Setup firewall

$ firewall-cmd --add-port 6819/tcp --permanent
$ firewall-cmd --reload

Start slurmdbd service

$ systemctl enable slurmdbd
$ systemctl start slurmdbd
$ systemctl status slurmdbd

Note

slurmdbd receives SIGTERM after the first setup. The problem was solved by editing the PIDFile configuration in the .service file and run command systemctl daemon-reload.

Slurmd

Install slurmd from RPM packages.

$ rpm -ivh /utils/slurm/rpmbuild/RPMS/x86_64/slurm-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-slurmd-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-perlapi-18.08.3-1.el7.x86_64.rpm \
  /utils/slurm/rpmbuild/RPMS/x86_64/slurm-pam_slurm-18.08.3-1.el7.x86_64.rpm

Setup firewall

$ firewall-cmd --add-port 6818/tcp --permanent
$ firewall-cmd --add-port 60001-63000/tcp --permanent
$ firewall-cmd --reload

Create required directory

$ mkdir -p /var/log/slurm/ /var/run/slurm/ /var/spool/slurm/
$ chown slurm:slurm /var/log/slurm/
$ chown slurm:slurm /var/run/slurm/
$ chown slurm:slurm /var/spool/slurm/

Create slurmd.conf in /usr/lib/tmpfiles.d/. The content of slurmd.conf is as follows

d /var/run/slurm 0755 slurm slurm -

Edit PIDFile configuration in /usr/lib/systemd/system/slurmd.service to the same localtion in slurm.conf (Current setting: /var/run/slurm/slurmd.pid).

Following script could be use for editing.

$ sed -i -e 's@PIDFile=/var/run/slurmd.pid@PIDFile=/var/run/slurm/slurmd.pid@g' /usr/lib/systemd/system/slurmd.service

Start slurmdbd service

$ systemctl enable slurmd
$ systemctl start slurmd
$ systemctl status slurmd

Note

slurmd receives SIGTERM after the first setup. The problem was solved by editing the PIDFile configuration in the .service file and run command systemctl daemon-reload.

Bringing node to idle state using scontrol. For example,

$ scontrol update NodeName=tara-c-00[1-6] State=DOWN Reason="undraining"
$ scontrol update NodeName=tara-c-00[1-6] State=RESUME

Installing nhc —— ———-

Download RPM package

$ wget https://github.com/mej/nhc/releases/download/1.4.2/lbnl-nhc-1.4.2-1.el7.noarch.rpm

Install nhc package

$ rpm -ivh /utils/nhc/lbnl-nhc-1.4.2-1.el7.noarch.rpm

PAM Setup

/etc/pam.d/sshd

After password include password-auth line, adds

account    sufficient   pam_slurm_adopt.so
account    required     pam_access.so

In pam_access configuration file (/etc/security/access.conf), add

+:root:ALL
-:ALL:ALL

To guarantee that slurm services start after NFS, update /usr/lib/systemd/system/slurmd.service from

After=munge.service network.target remote-fs.target

to

After=munge.service network.target remote-fs.target etc-slurm.mount

Lmod and EasyBuild

Lmod

Install Lmod from EPEL repository.

$ yum install lmod

EasyBuild

Create modules group and user with a home-directory on a shared filesystem

$ groupadd modules -g 2002
$ useradd -m -c "Modules user" -d /utils/modules -u 2002 -g modules -s /bin/bash modules

Configures environment variables for bootstrapping EasyBuild

$ export EASYBUILD_PREFIX=/utils/modules

Download EasyBuild bootstrap script

$ wget https://raw.githubusercontent.com/easybuilders/easybuild-framework/develop/easybuild/scripts/bootstrap_eb.py

Execute boostrap_eb.py

$ python bootstrap_eb.py $EASYBUILD_PREFIX

Update $MODULEPATH

export MODULEPATH="/utils/modules/modules/all:$MODULEPATH"

Test EasyBuild

$ module load EasyBuild
$ eb --version

# OPTIONAL Unittest
$ export TEST_EASYBUILD_MODULES_TOOL=Lmod
$ python -m test.framework.suite

Enable access to all users.

Change permissions of /utils/modules/

chmod a+rx /utils/modules

Add z01_EasyBuild.sh to /etc/profile.d/. The content of the file is as follows

if [ -z "$__Init_Default_Modules" ]; then
  export __Init_Default_Modules=1
  export EASYBUILD_MODULES_TOOL=Lmod
  export EASYBUILD_PREFIX=/utils/modules
  module use $EASYBUILD_PREFIX/modules/all
else
  module refresh
fi

EasyBuild robot path.

/utils/modules/software/EasyBuild/3.7.1/lib/python2.7/site-packages/easybuild_easyconfigs-3.7.1-py2.7.egg/easybuild/easyconfigs

Setup Lmod on other nodes

Install Lmod

$ yum install lmod

Add z01_EasyBuild.sh to /etc/profile.d/. The content of the file is as follows

if [ -z "$__Init_Default_Modules" ]; then
  export __Init_Default_Modules=1
  export EASYBUILD_MODULES_TOOL=Lmod
  export EASYBUILD_PREFIX=/utils/modules
  module use $EASYBUILD_PREFIX/modules/all
else
  module refresh
fi