Work has continued on my Intern project, with the virtual cluster completed, I tested various different backup solutions to use with Ceph e.g. BackupPC and BorgBackup. I have also looked into benchmarking the Ceph cluster once installed using tools such as rados bench and fio. The Hard drives for the compute nodes have arrived for the cluster so the next stage will be installing the new hard drives into Viper’s compute nodes ready for the installation of Ceph.
Manual Handling Training
The team undertook manual handling training to better prepare for moving heavy objects in the Datacentre. The training was really useful as much of the equipment is extremely heavy and requires two people to carry. We soon got the chance to put our training into practice see below…
Replacing a faulty node
Unfortunately last month a compute node failed. Over the course of a week I worked with Clustervision who promptly helped to diagnose and remedy the problem. This involved swapping components and eventually swapping the node for a spare.
Node returned to CV for further diagnosis
During the month, I and a colleague (Seif) have been helping a user with a job running on Viper that would stop at random points whilst running. The issue has been difficult to solve and as an interim solution I have produced a restart script that restarts the job from a previous checkpoint.
David and I have been given the opportunity to go on placements at two HPC companies, David is going to work a week with Clustervision in the Netherlands and I am going to work at OCF in Sheffield. We’re both very excited to get ‘hands on’ and to gain experience.