GitLab outage – why everyone needs to know about rmdir
2 February 2017 |
Louis Bougeard | About a 4 minute read
Yesterday, GitLab had a major outage and appear to have lost 6 hours’ worth of some data, merge requests, webhooks, etc. If you didn’t catch it; read their blog post about it here.
It was unfortunate, but the way they handled it, I thought was wonderful. Their candid and transparent approach to keeping people updated was, in my opinion, great. Obviously it would have been better if this had never happened, but it did, and could have happened to any of us, a number of events that you’d hope would never happen led to what was, an unfortunate human error.
I say it could have happened to any of us, because it really could. In fact, not too long ago, something remarkably similar happened to me. At the time I learnt an important DevOps lesson, and whilst this GitLab incident is fresh in people’s minds I’d like to share it.
A few years ago now, but all to vivid in memory, I was working on a system that had a syncing job to keep certain parts of content aligned across a load balanced set of two production web servers. An issue occurred that after a routine system update, the syncing had stopped working. After some investigation it appeared that on one of the server’s the content directory was empty. Whilst investigating why the directory wasn’t being synced to, it transpired that it had been created with wrong permissions and by the wrong user… I found this out by comparing the two directory permissions on the server that worked and the one that didn’t.
If the directory had merely not existed, the sync job would have created it on next sync. Thus I decided to remove the directory and allow the sync job to create it and move the data. Job done, home in time for tea and medals. However, in a moment of sleep-deprived madness. I ran rm -rf directory_name on the wrong box.
Luckily before doing any of this, I had created a backup. I truly sympathise with the guys at GitLab, when it’s late, you’re having a prod issue and you haven’t had enough sleep/coffee, these mistakes happen. I however was much luckier, the size of the backup and data was much smaller (about 8Gb), I had created a backup before i started which I’d scp’d to another server, but it was still on the box when i deleted the data, and I was able to get the whole data set restored and replicated in what felt like hours but was in fact under 30 mins.
That said, I learnt an important lesson that day, which I’d like to share with everyone.
When you are deleting a directory, that you believe is empty (in both the GitLab scenario and my case), don’t use rm, use rmdir. Should I have done this on the directory I thought was empty, but which wasn’t, I would have got the following error:
~$ rmdir content_directory
rmdir: content_directory: Directory not empty
It may sound obvious, but the habit for far too many of us to use rm is too ingrained into the way we operate.
However, being snazzy on the terminal, is no substitute for good procedure, monitoring and regularly checking that all the checks and safeguards you have in place are working and working as expected.
It’s all boring stuff, but checklists, rotas, well documented procedures, redundancy, and good planning for incidents, is all hugely important to maintaining uptime. Furthermore, trying to close the separation between an “Ops team” and “the developers” will also help ensure a stable service.
I met quite a few of the GitLab team at GitLab world tour in London, I really like their product and here at AND Digital we use it for a few of our clients. It wasn’t great that they had this outage, but I was impressed with how they handled it (especially their YouTube live stream which I thought was very brave). But when things like this happen, let’s all just take a moment to ensure that we check our estate, ensure our backups are working and try and make sure that we don’t have to go through the same thing.
Well done to getting things turn around GitLab. #HugOpsRead More From This Author
Senior Product Developer (London)
Champion software quality and technical vision for AND and our clients, work on large-scale projects and help junior and mid developers grow in their roles.
Technologies you will be using
Next People Strategy & Operations – Senior Consultant (London)
Lead in building our strategy and the operationalisation of our approach to establishing new clubs across the UK, Europe and beyond!I'm Interested
Squad Lead (Leeds)
Lead and inspire a squad to succeed, bringing your passion for managing and developing people through career coaching and mentoring.I'm Interested