Hey, Don't Fire that Guy
A few weeks ago at Silicon Valley DevOps Days, in an open spaces discussion about outages, somebody asked a question that has been on my mind ever since. The question was as follows, "has anyone ever fired someone for causing an outage?" It's not a new question, but the verdict is still out on the right answer.
Some of my favorite insight on the topic comes from a post by John Allspaw. In the post, he states that in order to learn from a mistake without inhibition, there must be not only complete trust but also no fear of retribution. I take this view to heart and I'm a devoted defender of the concept.
However, I feel like the view comes with a few provisos. There is possibly a greater philosophy at play in the question. What really got me thinking about this was one of the stories I heard at the conference. The story was about a site outage caused by a technician on the customer support team running a tool written by the dev team, which was known to have a very high risk of causing site outages. The technician had been trained in how to use the tool and, according to the story teller, knew full well the consequences of his actions. So, for the company it was a straightforward decision to fire him. ...Or was it?
Human Error
If you read about human error, you learn that situations within a system and organization can foster human errors. This is often the first explanation of a failure induced by a human. In our investigation here, it's difficult to imagine any resident situation in the system or org which would have fostered a technician to *knowingly* induce an adverse outage of the application he was paid to support. So, that lead me next to consider that perhaps the technician felt running the script would have no connection to an outage. But again, having been fully trained in the consequence of the script, this felt unlikely. Perhaps a final explanation might be that the technician felt the risk of outage justified the outcome of running the tool. Maybe; but from the storyline, it didn't seem like that was a likely decision.
So, were the theories about human error just somehow inapplicable to this situation?
Theory Prerequisites
That's when I realized something about each of the questions. I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job. I realized that if the technician didn't care about his job or hated his boss, then all of my thinking was flawed. All the explanations are based on the belief that the technician wants what is best for the company. If he had no thought for the success of the company, then it would make easy sense for him to take the shortest path to accomplishing an immediate goal despite negative side effects. So, rethinking the first question -- could an organizational, non-technical situation foster human errors? Yes, I believe so. A non-technical situation could certainly foster feelings of dissatisfaction or frustration resulting in a technician who didn't care about the outcome of running a tool that would solve his immediate problem, but potentially cause a site outage.
Excellent! So, the key to success is maintaining an inspired work force, right? Partly, yes; but not exactly.
For the Workforce
If the company is fostering a disgruntled workplace, then the solution is straightforward: keep them happy and inspired. But, what if your employees are happy? How could the tech's actions be explained?
The next idea that occurred to me was perhaps it wasn't the organizational situation at all, but actually the employee himself. What if the employee was generally just a disgruntled or careless person? DevOps and other strong cultures are built upon trust, but that trust needs to be built on the prerequisite that you hire the right people. If poor hiring choices are made, the whole ecosystem breaks down.
A Different Type of Outage
As it turns out, although hiring the right people is such an important task to do well, it's often underdone and rushed. And why? For the exact same reasons we end up with technical debt: urgency. If we've learned anything from technical debt, we know that we create it to solve an immediate problem, but that solution leaves us in a much worse long term situation. If we know this is true, then maybe we should treat firing an employee the same way we treat a site outage. Specifically, we should postmortem all the steps that lead up to hiring: the phone screen, the recruiter's notes, the interviews, the round-up, the closing. I think that if firing were treated as a much costlier action, then perhaps hiring would be conducted with a much higher bar. Ultimately, in the case of the fired technician, perhaps the fault lays more with the hiring manager for putting the tech in the situation to use the tool at all.
So, to bring it back to the original question: would you ever fire anyone who caused an outage? My answer is a lot more complicated now. I will continue to apply the concepts of human error and would never fire an employee only because of an outage. Would I fire an employee who consistently underperforms, is habitually unhappy and careless, costs more than he or she produces, continues to fail at personal performance improvement plans, and is uninterested in other roles within the company? Very possibly. Whatever the case, if an engineer is fired for any reason, I want to postmortem how we came to the situation, starting with the initial phone screen in the hiring process.
Open Questions
What metrics can drive improvement?
Some of my favorite insight on the topic comes from a post by John Allspaw. In the post, he states that in order to learn from a mistake without inhibition, there must be not only complete trust but also no fear of retribution. I take this view to heart and I'm a devoted defender of the concept.
However, I feel like the view comes with a few provisos. There is possibly a greater philosophy at play in the question. What really got me thinking about this was one of the stories I heard at the conference. The story was about a site outage caused by a technician on the customer support team running a tool written by the dev team, which was known to have a very high risk of causing site outages. The technician had been trained in how to use the tool and, according to the story teller, knew full well the consequences of his actions. So, for the company it was a straightforward decision to fire him. ...Or was it?
Human Error
If you read about human error, you learn that situations within a system and organization can foster human errors. This is often the first explanation of a failure induced by a human. In our investigation here, it's difficult to imagine any resident situation in the system or org which would have fostered a technician to *knowingly* induce an adverse outage of the application he was paid to support. So, that lead me next to consider that perhaps the technician felt running the script would have no connection to an outage. But again, having been fully trained in the consequence of the script, this felt unlikely. Perhaps a final explanation might be that the technician felt the risk of outage justified the outcome of running the tool. Maybe; but from the storyline, it didn't seem like that was a likely decision.
So, were the theories about human error just somehow inapplicable to this situation?
Theory Prerequisites
That's when I realized something about each of the questions. I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job. I realized that if the technician didn't care about his job or hated his boss, then all of my thinking was flawed. All the explanations are based on the belief that the technician wants what is best for the company. If he had no thought for the success of the company, then it would make easy sense for him to take the shortest path to accomplishing an immediate goal despite negative side effects. So, rethinking the first question -- could an organizational, non-technical situation foster human errors? Yes, I believe so. A non-technical situation could certainly foster feelings of dissatisfaction or frustration resulting in a technician who didn't care about the outcome of running a tool that would solve his immediate problem, but potentially cause a site outage.
Excellent! So, the key to success is maintaining an inspired work force, right? Partly, yes; but not exactly.
For the Workforce
If the company is fostering a disgruntled workplace, then the solution is straightforward: keep them happy and inspired. But, what if your employees are happy? How could the tech's actions be explained?
The next idea that occurred to me was perhaps it wasn't the organizational situation at all, but actually the employee himself. What if the employee was generally just a disgruntled or careless person? DevOps and other strong cultures are built upon trust, but that trust needs to be built on the prerequisite that you hire the right people. If poor hiring choices are made, the whole ecosystem breaks down.
A Different Type of Outage
As it turns out, although hiring the right people is such an important task to do well, it's often underdone and rushed. And why? For the exact same reasons we end up with technical debt: urgency. If we've learned anything from technical debt, we know that we create it to solve an immediate problem, but that solution leaves us in a much worse long term situation. If we know this is true, then maybe we should treat firing an employee the same way we treat a site outage. Specifically, we should postmortem all the steps that lead up to hiring: the phone screen, the recruiter's notes, the interviews, the round-up, the closing. I think that if firing were treated as a much costlier action, then perhaps hiring would be conducted with a much higher bar. Ultimately, in the case of the fired technician, perhaps the fault lays more with the hiring manager for putting the tech in the situation to use the tool at all.
So, to bring it back to the original question: would you ever fire anyone who caused an outage? My answer is a lot more complicated now. I will continue to apply the concepts of human error and would never fire an employee only because of an outage. Would I fire an employee who consistently underperforms, is habitually unhappy and careless, costs more than he or she produces, continues to fail at personal performance improvement plans, and is uninterested in other roles within the company? Very possibly. Whatever the case, if an engineer is fired for any reason, I want to postmortem how we came to the situation, starting with the initial phone screen in the hiring process.
Open Questions
What metrics can drive improvement?
Comments
There is a proviso that is worthwhile to underscore if intentional malice is at play, you're not talking about error.
Having said that, the notion of a blameless postmortem or Just Culture cannot have “escape hatches” like the provisos you mention. The onus is on the organization to provide an environment where the context of the accident and the individual can (as best it can) be investigated safely in order to understand HOW the event took place. Not who or why.
The attempt is to *describe*, not *explain*.
Some thoughts:
"But again, having been fully trained in the consequence of the script, this felt unlikely."
What we (not the engineer himself, but us trying to understand the event retrospectively, since we know the outcome and he didn’t) deem unlikely or likely isn't entirely useful here. We do have someone, however, who can answer the question how about how likely or not they connected their work with the result (the outage): the engineer himself. What did they say when he was asked that? Was he asked that? Did he feel safe in answering that question honestly?
Perhaps more importantly: do his peers at the company now feel safe in answering that question honestly, after that engineer is fired?
"I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job."
This is not the case with the New View perspective of human error. There is no assumption whatsoever on qualifications or job satisfaction. Mistakes can arise from an almost infinite number of contributing factors.
Let's look at the implication: does job satisfaction correlate with less mistakes? If you don't like your job, are you more "prone" to making mistakes? These questions are aimed only at the individual's predispositions, which should be a red flag for those looking to understand how an event came to be.
What we don't know in the story are critical details:
a) How does the "story teller" know that the technician knew "full well" the consequences of his actions, exactly? If we can find that out, then we can all get rich selling that technique as mind-reading. :)
What ways can we make sure that the story teller's assertion isn't a manifestation of the Fundamental Attribution Error?
b”)The technician had been trained in how to use the tool" - should we imply here that the training and tool’s design is perfect, it’s the human that was the problem?
c) “…tool written by the dev team, which was known to have a very high risk of causing site outages"
Should we not look at the tool as well? Is its design free of fault? If a tool brings such high risk, would firing the person rid you of similar outages in the future? In other words: is it possible that someone with the same training and high job satisfaction bring the site down in using the same tool in the future?
Retrospective analysis that connects a 'human error' all the way back to hiring practices still places the burden on the individual, but pretending to be about the organization ("If our hiring practices were good enough, we wouldn't hire people who make mistakes.") - is this possible?
Make no mistake, there are *many* reasons to terminate someone's employment.
Many of the ones that you mentioned. But these have nothing to do with accidents and making mistakes.
If they do, then you're taking what Dekker called the "Old View", not the "New View" on error that blameless postmortems sit atop.
I agree that post mortem, we seek to describe and understand what happened and how; not make excuses. Less so than escape hatches or exceptions, I'm more specifically considering that a time may come at which it is appropriate to relieve an individual of duty. That this time may come on the heels of a site outage should be coincidence and should not affect the investigation.
I'm glad you pointed out those flaws and missing details in the story. It's a good case study. The postmortem conducted was not thorough and I agree that the case for terminating the employee was weak. Several questions were not asked and others were overlooked -- especially the tool that lead to the outage; that should definitely be addressed! Furthermore I feel that termination was antithetical to a learning culture. It sent the remaining workforce a chastising message that mistakes may not always be tolerated. The story really served as the inception for the thoughts that followed, but your insight certainly identified some holes in the earlier thoughts leading up to those that followed.
About the New View: I'm curious what your stance is on role qualifications. At what point do you assume some responsibility on behalf of the humans in the equation? Would you put a pre-schooler in the pilot seat of a commercial airliner full of passengers? I believe that there is a minimum bar of skill and culture necessary to hire someone; and that bar remains or rises higher as that individual continues his or her employment with your team. Our goal is to be vigilant and retrospective, to always learn and improve, and to be tolerant and accepting; but can some mistakes be avoided altogether by putting the (theoretically) right people in the job?
The article's call to action of reviewing our hiring practices is intended as a suggestion that this may be a place to look for further learning. Can we discover patterns in our hiring that consistently lead to terminations (whatever the reason)? Is it possible to improve our rubric and practices such that firing almost never happens and people rarely quit? By no means is the suggestion that one could avoid site outages or other errors (I think this is orthogonal), but simply that the confluence of criteria necessary to encourage termination could occur less frequently.