Wednesday, July 2, 2014

Hey, Don't Fire that Guy

A few weeks ago at Silicon Valley DevOps Days, in an open spaces discussion about outages, somebody asked a question that has been on my mind ever since. The question was as follows, "has anyone ever fired someone for causing an outage?" It's not a new question, but the verdict is still out on the right answer.

Some of my favorite insight on the topic comes from a post by John Allspaw. In the post, he states that in order to learn from a mistake without inhibition, there must be not only complete trust but also no fear of retribution. I take this view to heart and I'm a devoted defender of the concept.

However, I feel like the view comes with a few provisos. There is possibly a greater philosophy at play in the question. What really got me thinking about this was one of the stories I heard at the conference. The story was about a site outage caused by a technician on the customer support team running a tool written by the dev team, which was known to have a very high risk of causing site outages. The technician had been trained in how to use the tool and, according to the story teller, knew full well the consequences of his actions. So, for the company it was a straightforward decision to fire him. ...Or was it?

Human Error

If you read about human error, you learn that situations within a system and organization can foster human errors. This is often the first explanation of a failure induced by a human. In our investigation here, it's difficult to imagine any resident situation in the system or org which would have fostered a technician to *knowingly* induce an adverse outage of the application he was paid to support. So, that lead me next to consider that perhaps the technician felt running the script would have no connection to an outage. But again, having been fully trained in the consequence of the script, this felt unlikely. Perhaps a final explanation might be that the technician felt the risk of outage justified the outcome of running the tool. Maybe; but from the storyline, it didn't seem like that was a likely decision.

So, were the theories about human error just somehow inapplicable to this situation?

Theory Prerequisites

That's when I realized something about each of the questions. I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job. I realized that if the technician didn't care about his job or hated his boss, then all of my thinking was flawed. All the explanations are based on the belief that the technician wants what is best for the company. If he had no thought for the success of the company, then it would make easy sense for him to take the shortest path to accomplishing an immediate goal despite negative side effects. So, rethinking the first question -- could an organizational, non-technical situation foster human errors? Yes, I believe so. A non-technical situation could certainly foster feelings of dissatisfaction or frustration resulting in a technician who didn't care about the outcome of running a tool that would solve his immediate problem, but potentially cause a site outage.

Excellent! So, the key to success is maintaining an inspired work force, right? Partly, yes; but not exactly.

For the Workforce

If the company is fostering a disgruntled workplace, then the solution is straightforward: keep them happy and inspired. But, what if your employees are happy? How could the tech's actions be explained?

The next idea that occurred to me was perhaps it wasn't the organizational situation at all, but actually the employee himself. What if the employee was generally just a disgruntled or careless person? DevOps and other strong cultures are built upon trust, but that trust needs to be built on the prerequisite that you hire the right people. If poor hiring choices are made, the whole ecosystem breaks down.

A Different Type of Outage

As it turns out, although hiring the right people is such an important task to do well, it's often underdone and rushed. And why? For the exact same reasons we end up with technical debt: urgency. If we've learned anything from technical debt, we know that we create it to solve an immediate problem, but that solution leaves us in a much worse long term situation. If we know this is true, then maybe we should treat firing an employee the same way we treat a site outage. Specifically, we should postmortem all the steps that lead up to hiring: the phone screen, the recruiter's notes, the interviews, the round-up, the closing. I think that if firing were treated as a much costlier action, then perhaps hiring would be conducted with a much higher bar. Ultimately, in the case of the fired technician, perhaps the fault lays more with the hiring manager for putting the tech in the situation to use the tool at all.

So, to bring it back to the original question: would you ever fire anyone who caused an outage? My answer is a lot more complicated now. I will continue to apply the concepts of human error and would never fire an employee only because of an outage. Would I fire an employee who consistently underperforms, is habitually unhappy and careless, costs more than he or she produces, continues to fail at personal performance improvement plans, and is uninterested in other roles within the company? Very possibly. Whatever the case, if an engineer is fired for any reason, I want to postmortem how we came to the situation, starting with the initial phone screen in the hiring process.

Open Questions

What metrics can drive improvement?

Sunday, May 4, 2014

Debugging Gitosis "Read Access Denied"

I recently set up gitosis to serve some side projects that I'd like to share with a few friends. I've used it in the past professionally and really enjoy the sanity it brings to managing users and permissions.

Things started off pretty well and I was committing and pushing changes in no time. A week or so passed and I wanted to add a new user to a project. I made the necessary changes to my clone of the gitosis-admin project, but when I tried to push my changes upstream, I suddenly I was unable to push! This was a major issue since the admin project is the heart of the configuration.

I put on my spelunking hat and ssh-ed to the box and switched to the git user to start debugging. The first thing I did was revert the gitosis.conf file back to it's original state. You can find this file in ~/git/repositories/gitosis-admin.git/gitosis.conf. Changing it had no effect.

I took a closer look at the error message from my failed push command, and noticed that it was complaining "Read Access Denied," but for a different user name (I could see this because I had loglevel = DEBUG). There are a total of three users involved in the projects, one of which I'd just added and only locally. So, on the server, there were only two users at play. OK, so maybe that user is causing issues. I next removed his key file from the server. This file was at ~git/repositories/gitosis-admin.git/gitosis-export/keydir/.

I tried to push again. No luck.

Hmm, next I looked into ~/git/.ssh/authorized_keys. I found there was a still a reference to the user there, so I deleted that line.

I tried to push again. It worked!

Ok, so are things working now? I tried to fetch. No dice.

So, when I pushed, gitosis re-applied the configuration and undid all of my debugging steps. Essentially reverting the system back to the previous state, including new additions for the new user.

At this point, it dawned on me to check my ssh agent identities. Lo and behold, I had two identities and one of them was for the other user! Oops! This was completely my mistake. I had generated his keys a few weeks ago and tested them to ensure they worked. Apparently I had not been so thoughtful as to delete the identity when done.

After running ssh-add -D, things started working again.

Monday, March 31, 2014

Use Jenkins REST API to Update Job Configurations Automatically

I manage several Jenkins jobs for a single git repository that has several modules (i.e. directories). Each of the modules has the same structure and the job for each of the modules is essentially the same, differentiated only by the name of the directory.

I recently needed to make a change to each of the job configurations and rather than do so by hand, I thought I'd investigate the possibility of doing so automatically using the Jenkins REST API.

I had originally created all the jobs using the API by posting to the generic the URI. I had since moved the jobs into folders and hitting createItem only created new jobs and did not update the proper jobs.

I did some searching on the internet and couldn't find any help. I eventually found my answer on the api page for the folder. I'm posting my code below for anyone else with similar questions.

# First, get the looking like you want

read -s token
# type token from$userName/configure

# Download the configuration XML for the template job (which will be our model template)
curl -v -u "bvanevery:$token" > generic-config.xml

# My modules
declare modules=('module1' 'module2' 'module3')

# POST the updated configuration XML to Jenkins
for m in ${modules[@]}; do 
   echo "module $m";
   sed "s/MODULE/$m/g" generic-config.xml > $m-config.xml; 
   curl -v -X POST --data-binary @$m-config.xml -u "bvanevery:$token" \
        -H 'Content-Type: application/xml' \
        "$m/config.xml ;

Tuesday, January 7, 2014

Script to Fix Gerrit: LDAP floods log for gerrit-only users

We recently upgraded to Gerrit 2.7 and started to see lots of LDAP related errors in the logs. We tracked it down to this bug report:

I wrote a quick script to fix the issue and thought I'd share it.

read -s pwd

echo "SELECT external_id FROM account_external_ids WHERE external_id LIKE 'gerrit:%';" | mysql -h -u gerrit -p${pwd} reviewdb | sed 's/^gerrit://' > usernames.txt

for u in $(< usernames.txt); do
if ! id $u > /dev/null 2>1; then
   echo "DELETE FROM account_external_ids WHERE external_id = 'gerrit:$u' LIMIT 1;" | mysql -h -u gerrit -p${pwd} reviewdb