With cluster management systems like xCAT, Bright Cluster Manager, HPCM, and others, it’s all too easy to fall into the trap of having a “golden image” – one root image directory maintained over months or years of maintenance downtimes and image changes, where one small misstep could result in breaking the image and taking the entire cluster offline. Without a well-defined policy for managing image changes and builds, incremental changes to an image directory build over time, eventually forming a snarl of individual hacks, tweaks and fixes, where no one knows who did what or why.
Luckily, there are tools available that can help mitigate this process and assist in enforcing a reliable and reproducible image build process. Red Hat’s Ansible configuration management tool integrates nicely with existing cluster management tools, with a bit of scripting, and provides an established tool for modifying host configuration, even in a cluster image. And by using Git, image configurations can be versioned and maintained effectively, even in an environment with multiple contributors.
Bright Cluster Manager: Modifying images directly using Ansible
Bright Cluster Manager provides a set of tools that can be used to create images from running hosts, but this method requires booting and customizing a host every time an image change is made or an updated image is created. Instead of this tedious and time consuming process, create a base image once using the Bright Cluster Manager toolkit, then clone and modify that image as needed using Ansible.
Cluster node images created in Bright Cluster Manager take the form of a chroot environment – a directory containing all of the files necessary for a node to run, as if that directory were the nodes / (root) directory. Ansible includes a chroot connection plugin that allows a playbook to target a chroot or image directory directly as if it were a running host. This will allow the creation of an Ansible playbook using the hundreds of existing Ansible modules available, saving time and effort. To create a customized image, all that must be done is clone the base image using Bright, then apply the custom configuration using Ansible.
xCAT: Extending the postinstall script for automatic Ansible configuration
One of the attributes of an xCAT image is the location of a postinstall script – a script that runs every time an image is built that can be extended to provide system-specific customization. The most straightforward way to customize an xCAT image is to modify this postinstall script, but this method can lead to a bloated mess full of different hacks, all interdependent – make one change to fix an issue, and the entire system comes crashing down.
This is where Ansible comes into play. With Ansible, a playbook can be created for each image, and Ansible can then run those playbooks directly from the postinstall script using the chroot connection plugin. Since an xCAT image directory is also basically a chroot environment, the Ansible chroot connection plugin can be used to apply an image playbook to existing images.
In addition, a little bit of smart scripting can handle multiple images with one set of Ansible playbooks and roles. For example, a site may have two different image types, one for login nodes named centos7-login, and one for compute nodes named centos7-compute. To handle both cases, two different Ansible playbooks would be created, centos7-login.yml and centos7-compute.yml, with each playbook containing its own set of variables and roles. Using simple command line tools like find, grep, cut, etc., the postinstall script can find playbooks in an established directory, allowing the postinstall script to select the desired playbook automatically.
Maintaining playbooks and scripts using Git
Of course, without effective change management of the Ansible playbooks, the mess just moves from the image itself out into the playbooks and scripts used to build it. This is where Git comes into play. By managing the Ansible playbooks created for image management in Git, images can be versioned and rolled back without having to maintain multiple multi-gigabyte image directories on a cluster management node.
In addition, merge control and branch management features present in Git hosting platforms like GitLab or GitHub further enhance manageability of Ansible playbooks. Proper Git workflow also enforces good image change control and maintainability. Locking down the master branch ensures that all changes to the production image must go through a testing and approval process. Every change thus has a discussion and ownership chain associated with it, making long term planning and maintenance of image changes easier. And if there are security concerns about pushing configuration details to a public platform, a self-hosted GitLab instance can be set up to manage playbook repositories.
With a coherent image modification process using Ansible and Git, you no longer have to fear losing the “golden image” that your cluster relies on. Every image customization is easily viewable and reviewable in the Git frontend of your choice, including history, comments, and justification for every change. In a worst-case scenario, you can start from scratch with a new image (or a new cluster) using the existing Ansible playbooks, provided and managed via Git.
In addition, these tools also help your cluster management team work together. Git’s source control features allow contributors from around the world to collaborate on a single project. While your cluster management team may not be so widely spread, Git will still allow for easier collaboration between multiple users, administrators, and even multiple cluster sites. Easier upgrades, easier cluster migrations, easier disaster recovery, and more – leveraging Git and Ansible for cluster management makes your life easier overall.
About the Author
Adam Dorsey is an HPC systems administrator and site lead for RedLine Performance Solutions.