I’ve had the “fun” to at times write outage notices like the current Google statement. The wording is interesting. IT systems fail, and once it’s time to write about it, management will make sure they are well-hidden. The notice will be written by some poor soul who just wants to go home to his family instead of being a bearer of bad news. If he writes something that is not correct, management will be back just in time for the crucification.
Things go wrong, we lose data, we lose money. It’s just something that happens.
But squeezing both the uncertainties and hard truths of IT into public statements is an art of it’s own.
Guess what? I’ve been in popcorn mood – Here are multiple potential translations for this status notice:
“The issue with Compute Engine network connectivity should have been resolved for nearly all instances. For the remaining few remaining instances we are working directly with the affected customers. No further updates will be posted, but we will conduct an internal investigation of this issue and make appropriate improvements to our systems to prevent or minimize future recurrence. We will also provide a more detailed analysis of this incident once we have completed our internal investigation.”
should have been resolved:
we cannot take responsibility. it might come back even just now! we’re not able to safely confirm anything 😦
but we spoke to the guy’s boss and they promised he’ll not do that thing again.
for nearly all instances:
the error has been contained, but we haven’t been able to fix it? we’ve been unable to undo the damage where it happened. we were lucky in most cases, but not in all.
the remaining few remaining instances
the remaining remaining… they aren’t going away! Still remaining!
these instances that are left we’ll not be able to fix like the majority where our fix worked.
Please, just let me go to bed and someone else do those? I can’t even recall how many 1000s we just fixed and it won’t stop.
working directly with the affected customers:
only if you’re affected and we certainly know we need to come out, we will involve you. we can’t ascertain if there’s damage for you. we need to run more checks and will tell you when / how long. you get to pick between restores and giving us time to debug/fix. (with a network related issue, that is unlikely)
no further updates will be posted
since it is safely contained, but we don’t understand it, we can’t make any statement that doesn’t have a high chance of being wrong
we only post here for issues currently affecting a larger than X number of users. Now that we fixed it for most, the threshold is not reached
we aren’t allowed to speak about this – we don’t need to be this open once we know the potential liabilites are under a treshold.
will conduct an internal investigation
we are unwilling to discuss any details until we completely know what happened
we are really afraid now to change the wrong thing.
it will be the right button next time, it has to!
management has signed off the new switch at last!
to prevent or minimize future recurrence
we’re not even sure we can avoid this. right now, we would expect this to happen again. and again. and again. recurring, you know? if at least we’d get a periodic downtime to reboot shit before this happens, but we’re the cloud and noone gives us downtimes!
and please keep in mind minimized recurrence means a state like now, so only a few affected instances, which seems under the threshold where we notify on the tracker here.
we really hope it won’t re-occur so we don’t have to write another of those.
Don’t take too serious but these are some of the alarms going off in my head if I see such phrasing.
Have a great day and better backups!