I’m usually a late waker, I don’t do mornings, however this morning I had to be up to take a family member to the airport for 7 am. I reluctantly wake up at 6.40am and reach for my phone to see what is going on in the world, my usual routine, just less awake on this occasion! My Slack and the MoT Twitter are on fire with news of the site being down, and a few more accurate tweets regarding the cause being our SSL certificate had expired. I’m wide awake now! It’s 6.45am.
We are a small team at Ministry of Testing, nine people in total with two of those mostly focused on Tech/Development. So, I’m hoping that one of them is awake and able to sort the issue.
6.45am - Issue reported to the dev team
6.56am - We post updates to Twitter and Slack to inform the community that we are aware of the issue.
7.07am - Our DevBoss, Andrew, replies stating that he can see the Cert Issuer we use to issue our certificates but can’t find any credentials in our 1Password vault.
Twenty-two minutes for a response at that time is great, but unfortunately, Andrew is unable to take any action, but let me share some context here. Andrew joined MoT a little over a year ago, the last time our certs were updated was three years ago. Our TechBoss, Graham, has done the majority of the development work and system management for MoT for the last 7 years, and SSL certs hadn’t come up in conversations yet. We bought 1Password in last July, and again, SSL hasn’t come up in conversations, so, unfortunately, those credentials weren’t added to 1Password.
Actions: The credentials for our Cert Issuer have now been added to 1Password, so Andrew and other members of the team can get access.
7.28am - We hadn’t heard from Graham, so I pinged a WhatsApp message to see if he is around.
Some more context, mornings are very busy in the Sherry household, they have five children, the youngest being one!
8.01am - Graham has used his magnificent juggling skills and found some time to look into the expired certs, ninja!
8.48am - Graham reports to the team that he has nearly resolved the issue, however, some DNS caching is slowing him down. He also discovered while logging into Comodo that this was set up on an old email account, our old software testing club domain, for those that know the history of MoT.
It took us a while to move off the softwaretestingclub domain, but we finally did it, guess when, that’s right, June-July 2017, not long after the last SSL cert has been renewed! Upon logging into the old email, Graham noticed that our Cert Issuer has been sending us warning about our cert expiring soon, we’d seen none of them, oOps!
Actions: The email address has now been updated to a group MoT one, so multiple members of the team will see the alerts in the future. 1Password updated!
9.26am - Fixed! Graham reports to the team that the issue is now resolved. A few of the team confirm that we are able to access the site again.
9.31am - We let the community know that the issue has been resolved.
9.36am - Graham messages Andrew outlining all the steps he took to resolve the issue, and how to maintain it all going forward with links and instructions.
10.01am - Andrew has updated our internal documentation with all the information from Graham.
2 hours and 41 minutes to resolve the issue, at 7 am, after a bank holiday weekend. Given our context, the size of our team, and being dependent on one person to fix it, I’m really proud of the team for resolving this issue so fast.
Many will say it should never have happened, but software development is a complex process, made even more so with the growth and changes we’ve had in the last year. As you can see, we’ve taken actions to avoid this happening in the future and removed the dependency on Graham. However, this is software, it’s why we all love working in this space so much, maybe it will happen again, and if it does, I know this awesome team will get us up and running again as quickly as they can.
It also increases the appreciation of the number of hats team members needs to wear in small companies, especially bootstrapped companies like MoT. Graham has been wearing them all for years, CTO, Developer, Ops Engineer, Security and the rest of them, and as Andrew progresses with his DevBoss career his hat collection is ever expanding as well. People rock!
Fantastic work team MoT!
Lets keep in touch. Subscribe to our Newsletter