Late Saturday night, early Sunday morning, I was completing routine maintenance/upgrades which regularly included GitLab. Previously I had made some changes to the GitLab server, cutting back worker threads in an attempt to reduce memory usage. The changes included cutting the server down to 2.5 GB of memory from 3.5 GB and adding 1 GB of swap space. Things had been running fine, but I could tell by the graphs that it was still struggling.
Before the upgrade, I noticed GitLab was using 100% of swap space, and all of its memory with very little in cache/buffers. Therefore, I decided to shut it down to increase the memory to 3 GB. Randomly before shutdown I decided take a manual GitLab backup; something usually only done before performing the upgrade. After increasing the memory and attempting to boot the server, the server instantly failed to boot. The error messages made the problem plain as day to me.
Unable to locate image /qcow2_sda/git01.1702052226.qcow2
Images with this naming convention are created by backup automation I have in place. In order to back up a VM, you must take an external snapshot creating a new delta image that all new changes are written to, locking the original base image as a read-only to be referred to by the delta(s). While new changes are being written to the delta image, you can safely back up the read-only base image. Once backup operations are complete, you must block-commit the delta image back to the base image and pivot back to that disk. If this process fails, you are left with the VM still writing new changes to the delta image, and no change being written to the read-only base image.
After much confusion, I eventually determining the delta image GitLab was running on did not exist and was cleaned up weeks ago. Therefore, I proceeded to restore GitLab from the backup taken earlier. The base image restored GitLab to 8.16.3 and the backup was from 8.16.4 which poses an issue because the versions must match. Installation of GitLab 8.16.4 failed; therefore, a system wide upgrade was performed on the GitLab host which bumped the GitLab version to 8.16.6. GitLab was then rolled back to 8.16.4 and the backup successfully restored. GitLab was then rolled forward to 8.16.6, re-added to salt and smoke checked.
It was never determined how GitLab was running for 14 days on a non-existent delta image while maintaining data integrity. On February 2nd the delta image was removed with an “rm -rf” command after a failed block commit during manual maintenance. There is clearly a design feature of KVM and external snapshots that I do not fully understand.
GitLab was successfully restored after 30 minutes of downtime. No data was lost as a result of the incident.
Backups – GitLab automated backup frequency has been increased from 1x a day to 3x.
Process – A GitLab maintenance process has been instated to assure a backup is always available.
Cold Spare – A GitLab cold spare, git02, was created, salted, shutdown and ready to boot.
On February 19th, 2017 PST…
01:54 – Manual GitLab backup taken.
01:55 – Nagios downtime set.
01:56 – GitLab rebooted.
01:57 – GitLab not coming back online…
01:59 – Determined GitLab was running from non existent delta image.
02:00 – Edited XML to boot from base image and booted GitLab.
02:01 – GitLab online from base image dating 2017-02-05 at 22:26.
02:05 – GitLab taken back offline due to 2-week-old data.
02:06 – Investigated root cause in attempt to restore machine image.
02:17 – Determined machine image restore was not possible.
02:18 – GitLab Restore failed due to GitLab version mismatch.
02:19 – GitLab 8.16.4 install failed for unknown reasons.
02:20 – GitLab system upgraded and GitLab upgraded to 8.16.6.
02:22 – Rolled GitLab installation back to version 8.16.4.
02:26 – Restored GitLab from backup on version 8.16.4.
02:28 – Upgraded GitLab to latest version 8.16.6 and verified integrity.
02:35 – Re-salted git01 since it was restored from a non-salted image.