Wednesday, November 11, 2015

What Makes a Good Engineering Culture?

This question was asked several years ago on Quora and has gotten some great answers! I was going through some of my old drafts and came across an answer that I had almost completed writing and decided to finally push through (better late than never!). I've copied the full text of the answer below.


In my experience, a company's engineering culture is possibly the most important thing a person can consider when evaluating a job offer. I've seen many Quora questions touch on similar topics (e.g. "Where should I work?", etc.), and I feel like the root of almost all of the answers stems from the culture. Thank you for asking this question.

My answer to this question is rooted in my ten years of software engineering, numerous blogs & books on the topic, and countless conversations with engineers. It also comes with the expectation that no answer to this question can speak for all software engineers. Everyone has different and changing needs, and with them different and changing views of what makes a good culture.

So, here goes -- the prevailing concepts that have stuck with me are as follows.


High quality software engineering is the product of a team. No one individual can be expected to deliver, nor take credit for, a successful product on his or her own. This gets fuzzy in small startups where there may only be one engineer, but otherwise holds true. A culture that celebrates one individual at the cost of another is making a grievous mistake.

There is an important distinction to note here about what comprises a *team* of engineers versus a group. The distinction is that a group is not a team until everyone in the group is committed the purpose [1]. In my experience, this commitment comes from inspirational leadership and transparency. The fact that an engineer is employed by a company is not reason enough to incite the determination, dedication, and thoughtfulness necessary to produce quality code. Committed engineers are engineers who are proud to work where they work and excited to talk about their jobs and their company's products.

Stake in the Product

A healthy culture builds a product that means something to them. This can happen in a lot of ways, one of which is by connecting the engineers with the users. Some examples are providing engineering the opportunity to sit with the customer support team, to join sales or product folks on customer visits, or to attend company conventions. As an engineer, it is such a rewarding feeling to meet a customer and hear their story about how a feature you built makes their life better or how the bug you fixed made their day. This kind of connection is what makes the software you build at your job significantly more meaningful than any college project or homework assignment.

The next critical piece of this is a healthy relationship between the product and engineering teams. I've always been baffled by product management teams that treat engineers like they are incapable of contributing to product design. Product and engineering need to function together like two sides of a brain; and respect needs to flow in both directions. Giving engineers a say in the product design simultaneously gives them stake in outcome. People are much more likely to care about the success of something they helped design versus something they were told how to do.

Equally, a culture that does not respect the purpose of its product team is no better off than one that does respecting engineering. The truly successful cultures of which I have been a part were made up of cross functional teams of product & engineering talent working together to set short term and long term goals; write achievable design specifications; and build software that makes both groups proud.


Next up: Experiments! Happy cultures promote experimenting with features over endlessly debating them [2]. One company where I've seen this really done well is Yammer, where each feature has a clear definition of success, but is assumed to be unnecessary until proven otherwise by objective usage metrics and customer feedback [3]. This precludes endless debates or analysis paralysis, and effectively lets product and engineering teams focus on what matters: building software. A positive side effect of writing software that inherently supports experimenting is that it encourages a culture that commits to small, manageable deliverables.

Encourage Learning

Another hallmark is a culture that encourages establishing a *deep understanding* of the tools, work flows, and responsibilities that go into producing production ready software. A team that knows __why__ something must be done in a certain fashion will encounter substantially less mistakes than a team kept ignorant by relying too heavily on process scripts. Intuitively it seems natural to reduce mistakes by automating processes and restricting influence. In my experience, it is a fine balance of the safety in automated process and the danger in freedom that produces a well rounded team more apt to optimize the whole rather than their own piece of it. Automated process saves time and avoids unintended errors, but that intentionally dangerous freedom lets the engineers know that they are respected and responsible for the success of their own products. That responsibility provides the motivation to go the extra mile to truly understand, for example, how a service is going to function in a production environment, or how version control really works.

Deployment Is Just the Beginning

Engineering is more than development, it's also deployment and support. Time spent in development is just a fraction of a project's lifetime. The majority of a product's life is in production. Engineering teams that fail to structure priorities around this concept will either produce poor quality products or endlessly miss deadlines because they are fixing bugs.

Learn From Failure

Nothing is perfect. Even rocket scientists at NASA make mistakes [4]. Expect failure and plan for it, in your code and in your culture. Learning from failures and *improving* from them makes a team stronger. I won't say too much about this because much has already been said elsewhere. Etsy has published some great material on this subject [5] and [6]. The Pragmatic Programmers also published a great book on ways to mitigate failures in production [7]. For my part, I'm proud of how we learned to learn from failure at Box [8].


Ultimately, all of this represents my own opinion based on my time spent engineering software. One prescription does not fit all and any prescription should be an ever changing vision.

[1] Leading Lean Software Development: Results Are not the Point: Mary Poppendieck, Tom Poppendieck: 9780321620705: Books
[2] Always ship trunk: Managing change in complex websites
[3] Why Yammer believes the traditional engineering organizational structure is dead
[4] The Martian
[5] Failure is an option - Velocity 2015
[6] Kitchen Soap  –  Learning from Failure at Etsy
[7] The Pragmatic Bookshelf | Release It!
[8] The Three Letters that Saved My Tech Debt

Sunday, November 8, 2015

Managing Runtime Configurations

Configuration Headaches

Managing runtime application configurations in large scale, heterogeneous environments is a total pain. Over the last five years I have attacked this problem in various ways, each with its own grace and flaws. The goal of this post is to sum up the evolution of my experience and hopefully impart some insight to any folks in similar plight. 
It starts with the challenges:
  1. Different environments get different configurations
  2. Different services in the same environment get different configurations
  3. Configuration files get massive and treacherous
  4. Configuration files get complex quickly with cross references
  5. Configurations have no safety against typos in keys (or values)
  6. Configurations have essentially no type safety guarantees
  7. Configurations have no README indicating their intent or usage
  8. Configurations can be accessed from anywhere in a code base (or outside) in inconsistent manners

Sidebar: Defining Runtime Application Configurations

For the context of this article, I use the term “runtime application configurations” to represent the configurations used by an application at runtime. I do not mean configuration as code type things like Puppet, Chef, or Ansible. For example, I would consider the log configurations in alogger.xml file for a JVM application that I own to be runtime app configs. I would not consider the configuration of Apache on a web server a runtime app config in this context, I would consider that a system configuration. More specifically, I would consider it configuration for an application whose source code I'm not writing.

Evolution of a solution

Compound keys

The most popular convention I've observed for managing configurations is by flat file such as ini, json, yaml, xml; very rarely is it in the same language as the code consuming it. My assumption is that this accomplishes a few things:
  1. They can be edited by non-programmers
  2. They can be consumed by multiple programming languages, thus freeing operations teams from managing multiple manifestations of the same configuration values
  3. In the case of compiled code, configurations do not need to be compiled (or recompiled when they change)
This restriction is very powerful in the guarantees of simplicity, but also limiting. One major limiting factor is that without a separate management system, it’s impossible to have different values of configurations per runtime environment. This was a major problem for us since we shared configuration files between all of our environments, but need to set different values for the same configuration key based on the environment.
Our initial solution was to encode the environmental context onto the keys such that we could effectively define unique values for a given key based on the environment that would use it. So instead of a single key-value pair, we would have multiple key-value pairs differentiated by environment. See an example of log levels for different environments.
level<dev> = DEBUG
level<staging> = INFO
level<prod> = WARN
When the INI gets parsed natively, each of the level's will be parsed as unique keys. The next step is custom code in the application to split the environments (e.g. <dev>) off of the keys and then figure out which environment and value applies in its running context.
Sound complicated? It is. Making this work involved writing some complicated INI parsing code as well as figuring out a way to tell an application the environment in which it’s running. The result was a brittle system that was error prone and slowed down onboarding. Furthermore, if any configurations needed to be shared between projects, those projects also needed to solve the parsing and environment setting problems.

Moving away from flat files

I asked myself, “Why are configurations always in flat files anyway?” I couldn’t come up with a convincing answer, so my next attempt was to write a straightforward configuration framework in PHP that expected different values per each key based on environmental context. All configuration files were PHP files that returned a large array. I experimented with model objects that could build configurations, but ultimately discarded the idea because I felt the time to solve the edge cases would outweigh the incremental value.
With this system, the mind bending key-to-environment relationship was slightly more clear. Another gain was removing cross references to other keys within the INI values. Since the values were set with PHP, it was possible to reuse values or base other values off each other, e.g. url = "{$scheme}{$domain}{$path}".
Although the system was less brittle and easier to understand, the configuration files themselves were still rather large and it wasn’t clear how defaults worked between environments. Worse, the files could only be used by PHP applications.

A separate configuration management system

My biggest problem was the size and complications of putting configurations for every environment in the configuration files. So I decided to look into using a separate system to manage the differences configurations, which could then deliver only the necessary content to a given consumer. I landed on Puppet since we already use it extensively elsewhere. This allowed me to write files containing only the configurations that mattered in the respective environments. Based on this, going back to flat files was possible. I could also tie together configuration values in Puppet before writing the files, so there was no need for cross references within the configuration file itself.
This was great, my first 5 problems were solved. But still, type safety was not enforced, configuration files had no guaranteed documentation, and understanding how they were used within an application was a grep nightmare.

Configuration model objects

I decided to revisit configuration model objects. However this time not as builders, but instead as accessors. By funneling all access of configurations through a single point, I figured that I could enforce a few things.
I buried the configuration parsing code inside an abstract class (aptly named Configuration) with protected methods for getting at the parsed values. Configuration model classes could then extend the base and expose relevant methods for their configurations. For example, a logging configuration class would have a getLevel() method, which behind the scenes would parse the configurations file and return back the value. Any user that wanted to load configurations could only do so by using a Configurationclass, or writing a new one. Next, I tied the name of the configuration file to the name of the class, such that if a class were named LogConfigurations, the configuration file behind the scenes needed to be named log.conf. Finding usages of a given configuration file became trivial.
I added type expectations methods to the parsing code so that invalid values would raise exceptions. TheConfiguration class requires that all extending classes implement a README() method, so that users understand the classes intention and expected INI contents. This is coupled with a generic unit test that parses the README() output and ensures it can actually be used by the class. Furthermore, each of the accessor methods has its own documentation.
Configuration model classes also opened the door for more sophisticated configurations. Since access is inside a class, and not just array referencing, the class can do smart things like tie multiple values together or call out to other classes. One really great manifestation of this was the ability to create the ProjectConfiguration class which uses the project version control to load a configuration file’s contents.

Get involved

That’s where I am today. It’s been a fun journey and I’m happy (atm) with where things are. I’m excited to see where this goes next and to learn other ways folks have found to solve this problem.
One key component that I’d like to see next is the configuration delivery mechanism. Currently, Puppet solves the problem reasonably for short lived PHP processes. However, the two areas that I’d like to improve are:
  1. Automatic reloading by long lived processes. Maybe this is just from a file watch.
  2. A better interface for managing the configurations. This could get extensive. One tool here would be validation. This could be much more than just type checking, but semantics and plugins for special configuration groups (e.g. validate a database in a slave section is up not a master). Typos in keys would be completely avoided. Another would be the ability to see what values a given environment would get.
Please post back with comments or pull requests!

Thursday, May 28, 2015

Make doing the right things easy enough that the doing the wrong things is embarrassing

Make the right things easy

It's not a new concept: doing the right thing should be easy, doing the wrong thing should be hard. In blogs, books, or management guides that's the advice you get for preventing people from doing the wrong thing. In practice, this is a subtle science and I believe this little apothegm needs some tweaking. My recommendation is "Make doing the right things easy enough that the doing the wrong things is embarrassing."

What's the difference?

In practice, making the wrong things hard can of itself be hard. Often the wrong thing is an established practice or convention. It may be the only way some people know how to get things done. Even worse, it may just be easy. When you're faced with this kind of ill luck, you need to look at the problem from a different angle. Making the right things easy becomes a cultural endeavor.

The subtle part

It starts with the leadership. Messaging from the leadership has a direct impact on the goals and priorities of the engineering workforce. If engineers know that taking longer to do the right thing is encouraged, then there is a much higher likelihood they will take that path than if that encouragement is absent. Leadership can reinforce the message by rewarding the engineers who go above and beyond to do the recognized right thing in challenging situations. They can sponsor development of tools that will make doing the right things easier.

All of this will *encourage* good behavior, but it won't do too much to discourage bad behavior.

For every encouragement there is an equal and opposite discouragement

It's important to me to be clear about what I mean when I say "embarrass." I do not mean that leadership or any team member should explicitly embarrass another team member. A culture that publicly shames or embarrasses its team is caustic, foments fear, and dampers collaboration [1]. You don't want that kind of culture, but you don't want to hide the bad stuff either. My suggestion is that the trick is in believing in good intentions.

Unless your hiring process is extremely dysfunctional, you should be able to assume your engineers have good intentions. That means that by encouraging the desired behavior, people will naturally be discouraged from doing otherwise. As your tools and processes continue to make the encouraged behaviors easier, it will become increasingly difficult to justify doing the wrong things.

An example of this is using leader boards to send a message. Consider a situation in which you want to encourage your engineers to write more unit tests. First, you invest in making writing tests *really* easy. Then, you start tracking unit test code coverage per commit. Soon thereafter, you can publish a leader board of "person to % code coverage per commit." You won't be explicitly calling anyone out for habitually not adding providing reasonable code coverage, but it will be clear over time that certain individuals may not be doing so [2].


I hope this opens up a healthy discussion about how we can stop complaining about bad behaviors and start feasibly changing them today.


[1] See Leaders East Last by Simon Sinek for a great exposition of this.

[2] Note I don't believe in metrics that track raw quantity. I don't feel that such measures accurately capture the intention. Metrics that people can be proud of should track the quality of their work over time, not "gamable" counts that are easily subverted.

Wednesday, July 2, 2014

Hey, Don't Fire that Guy

A few weeks ago at Silicon Valley DevOps Days, in an open spaces discussion about outages, somebody asked a question that has been on my mind ever since. The question was as follows, "has anyone ever fired someone for causing an outage?" It's not a new question, but the verdict is still out on the right answer.

Some of my favorite insight on the topic comes from a post by John Allspaw. In the post, he states that in order to learn from a mistake without inhibition, there must be not only complete trust but also no fear of retribution. I take this view to heart and I'm a devoted defender of the concept.

However, I feel like the view comes with a few provisos. There is possibly a greater philosophy at play in the question. What really got me thinking about this was one of the stories I heard at the conference. The story was about a site outage caused by a technician on the customer support team running a tool written by the dev team, which was known to have a very high risk of causing site outages. The technician had been trained in how to use the tool and, according to the story teller, knew full well the consequences of his actions. So, for the company it was a straightforward decision to fire him. ...Or was it?

Human Error

If you read about human error, you learn that situations within a system and organization can foster human errors. This is often the first explanation of a failure induced by a human. In our investigation here, it's difficult to imagine any resident situation in the system or org which would have fostered a technician to *knowingly* induce an adverse outage of the application he was paid to support. So, that lead me next to consider that perhaps the technician felt running the script would have no connection to an outage. But again, having been fully trained in the consequence of the script, this felt unlikely. Perhaps a final explanation might be that the technician felt the risk of outage justified the outcome of running the tool. Maybe; but from the storyline, it didn't seem like that was a likely decision.

So, were the theories about human error just somehow inapplicable to this situation?

Theory Prerequisites

That's when I realized something about each of the questions. I realized that they were posited on the assumption that the technician was qualified for and satisfied with his job. I realized that if the technician didn't care about his job or hated his boss, then all of my thinking was flawed. All the explanations are based on the belief that the technician wants what is best for the company. If he had no thought for the success of the company, then it would make easy sense for him to take the shortest path to accomplishing an immediate goal despite negative side effects. So, rethinking the first question -- could an organizational, non-technical situation foster human errors? Yes, I believe so. A non-technical situation could certainly foster feelings of dissatisfaction or frustration resulting in a technician who didn't care about the outcome of running a tool that would solve his immediate problem, but potentially cause a site outage.

Excellent! So, the key to success is maintaining an inspired work force, right? Partly, yes; but not exactly.

For the Workforce

If the company is fostering a disgruntled workplace, then the solution is straightforward: keep them happy and inspired. But, what if your employees are happy? How could the tech's actions be explained?

The next idea that occurred to me was perhaps it wasn't the organizational situation at all, but actually the employee himself. What if the employee was generally just a disgruntled or careless person? DevOps and other strong cultures are built upon trust, but that trust needs to be built on the prerequisite that you hire the right people. If poor hiring choices are made, the whole ecosystem breaks down.

A Different Type of Outage

As it turns out, although hiring the right people is such an important task to do well, it's often underdone and rushed. And why? For the exact same reasons we end up with technical debt: urgency. If we've learned anything from technical debt, we know that we create it to solve an immediate problem, but that solution leaves us in a much worse long term situation. If we know this is true, then maybe we should treat firing an employee the same way we treat a site outage. Specifically, we should postmortem all the steps that lead up to hiring: the phone screen, the recruiter's notes, the interviews, the round-up, the closing. I think that if firing were treated as a much costlier action, then perhaps hiring would be conducted with a much higher bar. Ultimately, in the case of the fired technician, perhaps the fault lays more with the hiring manager for putting the tech in the situation to use the tool at all.

So, to bring it back to the original question: would you ever fire anyone who caused an outage? My answer is a lot more complicated now. I will continue to apply the concepts of human error and would never fire an employee only because of an outage. Would I fire an employee who consistently underperforms, is habitually unhappy and careless, costs more than he or she produces, continues to fail at personal performance improvement plans, and is uninterested in other roles within the company? Very possibly. Whatever the case, if an engineer is fired for any reason, I want to postmortem how we came to the situation, starting with the initial phone screen in the hiring process.

Open Questions

What metrics can drive improvement?

Sunday, May 4, 2014

Debugging Gitosis "Read Access Denied"

I recently set up gitosis to serve some side projects that I'd like to share with a few friends. I've used it in the past professionally and really enjoy the sanity it brings to managing users and permissions.

Things started off pretty well and I was committing and pushing changes in no time. A week or so passed and I wanted to add a new user to a project. I made the necessary changes to my clone of the gitosis-admin project, but when I tried to push my changes upstream, I suddenly I was unable to push! This was a major issue since the admin project is the heart of the configuration.

I put on my spelunking hat and ssh-ed to the box and switched to the git user to start debugging. The first thing I did was revert the gitosis.conf file back to it's original state. You can find this file in ~/git/repositories/gitosis-admin.git/gitosis.conf. Changing it had no effect.

I took a closer look at the error message from my failed push command, and noticed that it was complaining "Read Access Denied," but for a different user name (I could see this because I had loglevel = DEBUG). There are a total of three users involved in the projects, one of which I'd just added and only locally. So, on the server, there were only two users at play. OK, so maybe that user is causing issues. I next removed his key file from the server. This file was at ~git/repositories/gitosis-admin.git/gitosis-export/keydir/.

I tried to push again. No luck.

Hmm, next I looked into ~/git/.ssh/authorized_keys. I found there was a still a reference to the user there, so I deleted that line.

I tried to push again. It worked!

Ok, so are things working now? I tried to fetch. No dice.

So, when I pushed, gitosis re-applied the configuration and undid all of my debugging steps. Essentially reverting the system back to the previous state, including new additions for the new user.

At this point, it dawned on me to check my ssh agent identities. Lo and behold, I had two identities and one of them was for the other user! Oops! This was completely my mistake. I had generated his keys a few weeks ago and tested them to ensure they worked. Apparently I had not been so thoughtful as to delete the identity when done.

After running ssh-add -D, things started working again.

Monday, March 31, 2014

Use Jenkins REST API to Update Job Configurations Automatically

I manage several Jenkins jobs for a single git repository that has several modules (i.e. directories). Each of the modules has the same structure and the job for each of the modules is essentially the same, differentiated only by the name of the directory.

I recently needed to make a change to each of the job configurations and rather than do so by hand, I thought I'd investigate the possibility of doing so automatically using the Jenkins REST API.

I had originally created all the jobs using the API by posting to the generic the URI. I had since moved the jobs into folders and hitting createItem only created new jobs and did not update the proper jobs.

I did some searching on the internet and couldn't find any help. I eventually found my answer on the api page for the folder. I'm posting my code below for anyone else with similar questions.

# First, get the looking like you want

read -s token
# type token from$userName/configure

# Download the configuration XML for the template job (which will be our model template)
curl -v -u "bvanevery:$token" > generic-config.xml

# My modules
declare modules=('module1' 'module2' 'module3')

# POST the updated configuration XML to Jenkins
for m in ${modules[@]}; do 
   echo "module $m";
   sed "s/MODULE/$m/g" generic-config.xml > $m-config.xml; 
   curl -v -X POST --data-binary @$m-config.xml -u "bvanevery:$token" \
        -H 'Content-Type: application/xml' \
        "$m/config.xml ;

Tuesday, January 7, 2014

Script to Fix Gerrit: LDAP floods log for gerrit-only users

We recently upgraded to Gerrit 2.7 and started to see lots of LDAP related errors in the logs. We tracked it down to this bug report:

I wrote a quick script to fix the issue and thought I'd share it.

read -s pwd

echo "SELECT external_id FROM account_external_ids WHERE external_id LIKE 'gerrit:%';" | mysql -h -u gerrit -p${pwd} reviewdb | sed 's/^gerrit://' > usernames.txt

for u in $(< usernames.txt); do
if ! id $u > /dev/null 2>1; then
   echo "DELETE FROM account_external_ids WHERE external_id = 'gerrit:$u' LIMIT 1;" | mysql -h -u gerrit -p${pwd} reviewdb