Pardon the Star Wars Yoda speak, but this is a topic that has been close to my heart for the longest time.
In today’s highly competitive work place, uttering the 3 words ‘I don’t know’ can be daunting especially for managers. It can be construed as a sign of weakness by their peers or worse, as sign of incompetence by their superiors.
In their moment of weakness, these managers will utter whatever plausible causes that come to mind, just to give the impression of being on top of things.
A Server that won’t boot up
A server that has been running for years without any issue, but inexplicably failed to come back up after an OS patching just the night before.
“Ah ha ! It is must be the patching ! Cos its was the last thing that happened to the server. That must be the last straw that broke the camel’s back.” remarked the manager, jumping to the brilliant conclusion.
Only to realise upon closer inspection of the boot error messages, that someone has updated the boot device.
For superiors, who value responsiveness above details, these technical in-congruencies will be either forgotten or simply not registered in the first place. That is as long as they got an answer for their boss.
Everyone would conveniently assume whatever explanation that was offered, is the underlying root cause. As long as the problem went away soon enough, they couldn’t bother to check back. Not when the next fire is already burning through the roof.
For the manager, the real cause of the issue is not important. What is important is the problem goes away and not linger around to create a stain on his team or his next appraisal review.
However, the sys admin (aka devOps, devSecOps, Site Reliability Engineer), is left to troubleshoot the issue and must somehow coax the system back to life in the next 24 hours. For him, reality couldn’t be more different.
Cold Heartless Beings
To begin with, computers are cold heartless beings that pay no heed to corporate reporting structure, SLA or IT service helpdesk’s complaints. They only understand ‘1’s and ‘0’s. And if the binaries are some how misaligned or the electrons are not flowing the way they should, no amount of threats or casual explanation would get them back in shape.
Going back to our earlier example of server outage, it could be due to
a) A simple hardware failure
b) Software problem like an incompatible system patch
c) An unauthorised change. i.e. boot device updated wrongly
For a) and b), the implications are probably as simple as the remedy, especially when the system is under warranty and has a working backup in place.
It’s c) that could have long lasting implication and it will require some forensic work to identify who made the last change and why.
Best Practices save the day
This is where having a well implemented Identity and Access Management (IAM) and proper audit trail in place will save the sysadmin’s day.
An IAM system will ensure that only authorised employees have access to the system. And if the principle of least privilege (PoLP) was followed, not all who have access will be able to perform ‘administrator’ tasks like updating the boot devices.
Next, having a proper audit trail will ensure that all login activities are captured and sent to offline store that cannot be easily tampered.
Luckily for this company, they have implemented Jumpcloud, a Cloud based IAM solutions that has the option of sending audit trails to AWS S3 that can be easily analysed and queried, to be discussed in an upcoming blog.
What should you utter in the next IT Incident?
So for the aspiring manager, instead of uttering “I don’t know” or offering a haphazard explanation, consider using the following response. This is guaranteed to suffice for any IT incident.
“We are in the midst of investigating the issue. Preliminary evidence suggests it could be a hardware issue or a software misconfiguration. At this moment we are not ruling out that it could be a security incident as well.
If its the latter, rest assured we have all the necessary best practices and systems in place to get to the bottom of the matter”
Upon hearing this, the superior will walk away satisfied, confident of facing his boss, because after all, which IT incidents are not due to hardware issues, software errors or malicious human activities?
Henley Ho (CREST Registered Pentester, AWS Certified Solution Architect Professional)
Henley is passionate about helping SMEs lay a strong foundation for a successful Cloud adoption journey, by building a flexible identity framework that enables the company to simplify and securely access IT. He believes the key to overcoming the current manpower crunch lies in empowering individual employees and automating processes as much as possible.