Wednesday, November 11, 2015

What Makes a Good Engineering Culture?

This question was asked several years ago on Quora and has gotten some great answers! I was going through some of my old drafts and came across an answer that I had almost completed writing and decided to finally push through (better late than never!). I've copied the full text of the answer below.


In my experience, a company's engineering culture is possibly the most important thing a person can consider when evaluating a job offer. I've seen many Quora questions touch on similar topics (e.g. "Where should I work?", etc.), and I feel like the root of almost all of the answers stems from the culture. Thank you for asking this question.

My answer to this question is rooted in my ten years of software engineering, numerous blogs & books on the topic, and countless conversations with engineers. It also comes with the expectation that no answer to this question can speak for all software engineers. Everyone has different and changing needs, and with them different and changing views of what makes a good culture.

So, here goes -- the prevailing concepts that have stuck with me are as follows.


High quality software engineering is the product of a team. No one individual can be expected to deliver, nor take credit for, a successful product on his or her own. This gets fuzzy in small startups where there may only be one engineer, but otherwise holds true. A culture that celebrates one individual at the cost of another is making a grievous mistake.

There is an important distinction to note here about what comprises a *team* of engineers versus a group. The distinction is that a group is not a team until everyone in the group is committed the purpose [1]. In my experience, this commitment comes from inspirational leadership and transparency. The fact that an engineer is employed by a company is not reason enough to incite the determination, dedication, and thoughtfulness necessary to produce quality code. Committed engineers are engineers who are proud to work where they work and excited to talk about their jobs and their company's products.

Stake in the Product

A healthy culture builds a product that means something to them. This can happen in a lot of ways, one of which is by connecting the engineers with the users. Some examples are providing engineering the opportunity to sit with the customer support team, to join sales or product folks on customer visits, or to attend company conventions. As an engineer, it is such a rewarding feeling to meet a customer and hear their story about how a feature you built makes their life better or how the bug you fixed made their day. This kind of connection is what makes the software you build at your job significantly more meaningful than any college project or homework assignment.

The next critical piece of this is a healthy relationship between the product and engineering teams. I've always been baffled by product management teams that treat engineers like they are incapable of contributing to product design. Product and engineering need to function together like two sides of a brain; and respect needs to flow in both directions. Giving engineers a say in the product design simultaneously gives them stake in outcome. People are much more likely to care about the success of something they helped design versus something they were told how to do.

Equally, a culture that does not respect the purpose of its product team is no better off than one that does respecting engineering. The truly successful cultures of which I have been a part were made up of cross functional teams of product & engineering talent working together to set short term and long term goals; write achievable design specifications; and build software that makes both groups proud.


Next up: Experiments! Happy cultures promote experimenting with features over endlessly debating them [2]. One company where I've seen this really done well is Yammer, where each feature has a clear definition of success, but is assumed to be unnecessary until proven otherwise by objective usage metrics and customer feedback [3]. This precludes endless debates or analysis paralysis, and effectively lets product and engineering teams focus on what matters: building software. A positive side effect of writing software that inherently supports experimenting is that it encourages a culture that commits to small, manageable deliverables.

Encourage Learning

Another hallmark is a culture that encourages establishing a *deep understanding* of the tools, work flows, and responsibilities that go into producing production ready software. A team that knows __why__ something must be done in a certain fashion will encounter substantially less mistakes than a team kept ignorant by relying too heavily on process scripts. Intuitively it seems natural to reduce mistakes by automating processes and restricting influence. In my experience, it is a fine balance of the safety in automated process and the danger in freedom that produces a well rounded team more apt to optimize the whole rather than their own piece of it. Automated process saves time and avoids unintended errors, but that intentionally dangerous freedom lets the engineers know that they are respected and responsible for the success of their own products. That responsibility provides the motivation to go the extra mile to truly understand, for example, how a service is going to function in a production environment, or how version control really works.

Deployment Is Just the Beginning

Engineering is more than development, it's also deployment and support. Time spent in development is just a fraction of a project's lifetime. The majority of a product's life is in production. Engineering teams that fail to structure priorities around this concept will either produce poor quality products or endlessly miss deadlines because they are fixing bugs.

Learn From Failure

Nothing is perfect. Even rocket scientists at NASA make mistakes [4]. Expect failure and plan for it, in your code and in your culture. Learning from failures and *improving* from them makes a team stronger. I won't say too much about this because much has already been said elsewhere. Etsy has published some great material on this subject [5] and [6]. The Pragmatic Programmers also published a great book on ways to mitigate failures in production [7]. For my part, I'm proud of how we learned to learn from failure at Box [8].


Ultimately, all of this represents my own opinion based on my time spent engineering software. One prescription does not fit all and any prescription should be an ever changing vision.

[1] Leading Lean Software Development: Results Are not the Point: Mary Poppendieck, Tom Poppendieck: 9780321620705: Books
[2] Always ship trunk: Managing change in complex websites
[3] Why Yammer believes the traditional engineering organizational structure is dead
[4] The Martian
[5] Failure is an option - Velocity 2015
[6] Kitchen Soap  –  Learning from Failure at Etsy
[7] The Pragmatic Bookshelf | Release It!
[8] The Three Letters that Saved My Tech Debt

Sunday, November 8, 2015

Managing Runtime Configurations

Configuration Headaches

Managing runtime application configurations in large scale, heterogeneous environments is a total pain. Over the last five years I have attacked this problem in various ways, each with its own grace and flaws. The goal of this post is to sum up the evolution of my experience and hopefully impart some insight to any folks in similar plight. 
It starts with the challenges:
  1. Different environments get different configurations
  2. Different services in the same environment get different configurations
  3. Configuration files get massive and treacherous
  4. Configuration files get complex quickly with cross references
  5. Configurations have no safety against typos in keys (or values)
  6. Configurations have essentially no type safety guarantees
  7. Configurations have no README indicating their intent or usage
  8. Configurations can be accessed from anywhere in a code base (or outside) in inconsistent manners

Sidebar: Defining Runtime Application Configurations

For the context of this article, I use the term “runtime application configurations” to represent the configurations used by an application at runtime. I do not mean configuration as code type things like Puppet, Chef, or Ansible. For example, I would consider the log configurations in alogger.xml file for a JVM application that I own to be runtime app configs. I would not consider the configuration of Apache on a web server a runtime app config in this context, I would consider that a system configuration. More specifically, I would consider it configuration for an application whose source code I'm not writing.

Evolution of a solution

Compound keys

The most popular convention I've observed for managing configurations is by flat file such as ini, json, yaml, xml; very rarely is it in the same language as the code consuming it. My assumption is that this accomplishes a few things:
  1. They can be edited by non-programmers
  2. They can be consumed by multiple programming languages, thus freeing operations teams from managing multiple manifestations of the same configuration values
  3. In the case of compiled code, configurations do not need to be compiled (or recompiled when they change)
This restriction is very powerful in the guarantees of simplicity, but also limiting. One major limiting factor is that without a separate management system, it’s impossible to have different values of configurations per runtime environment. This was a major problem for us since we shared configuration files between all of our environments, but need to set different values for the same configuration key based on the environment.
Our initial solution was to encode the environmental context onto the keys such that we could effectively define unique values for a given key based on the environment that would use it. So instead of a single key-value pair, we would have multiple key-value pairs differentiated by environment. See an example of log levels for different environments.
level<dev> = DEBUG
level<staging> = INFO
level<prod> = WARN
When the INI gets parsed natively, each of the level's will be parsed as unique keys. The next step is custom code in the application to split the environments (e.g. <dev>) off of the keys and then figure out which environment and value applies in its running context.
Sound complicated? It is. Making this work involved writing some complicated INI parsing code as well as figuring out a way to tell an application the environment in which it’s running. The result was a brittle system that was error prone and slowed down onboarding. Furthermore, if any configurations needed to be shared between projects, those projects also needed to solve the parsing and environment setting problems.

Moving away from flat files

I asked myself, “Why are configurations always in flat files anyway?” I couldn’t come up with a convincing answer, so my next attempt was to write a straightforward configuration framework in PHP that expected different values per each key based on environmental context. All configuration files were PHP files that returned a large array. I experimented with model objects that could build configurations, but ultimately discarded the idea because I felt the time to solve the edge cases would outweigh the incremental value.
With this system, the mind bending key-to-environment relationship was slightly more clear. Another gain was removing cross references to other keys within the INI values. Since the values were set with PHP, it was possible to reuse values or base other values off each other, e.g. url = "{$scheme}{$domain}{$path}".
Although the system was less brittle and easier to understand, the configuration files themselves were still rather large and it wasn’t clear how defaults worked between environments. Worse, the files could only be used by PHP applications.

A separate configuration management system

My biggest problem was the size and complications of putting configurations for every environment in the configuration files. So I decided to look into using a separate system to manage the differences configurations, which could then deliver only the necessary content to a given consumer. I landed on Puppet since we already use it extensively elsewhere. This allowed me to write files containing only the configurations that mattered in the respective environments. Based on this, going back to flat files was possible. I could also tie together configuration values in Puppet before writing the files, so there was no need for cross references within the configuration file itself.
This was great, my first 5 problems were solved. But still, type safety was not enforced, configuration files had no guaranteed documentation, and understanding how they were used within an application was a grep nightmare.

Configuration model objects

I decided to revisit configuration model objects. However this time not as builders, but instead as accessors. By funneling all access of configurations through a single point, I figured that I could enforce a few things.
I buried the configuration parsing code inside an abstract class (aptly named Configuration) with protected methods for getting at the parsed values. Configuration model classes could then extend the base and expose relevant methods for their configurations. For example, a logging configuration class would have a getLevel() method, which behind the scenes would parse the configurations file and return back the value. Any user that wanted to load configurations could only do so by using a Configurationclass, or writing a new one. Next, I tied the name of the configuration file to the name of the class, such that if a class were named LogConfigurations, the configuration file behind the scenes needed to be named log.conf. Finding usages of a given configuration file became trivial.
I added type expectations methods to the parsing code so that invalid values would raise exceptions. TheConfiguration class requires that all extending classes implement a README() method, so that users understand the classes intention and expected INI contents. This is coupled with a generic unit test that parses the README() output and ensures it can actually be used by the class. Furthermore, each of the accessor methods has its own documentation.
Configuration model classes also opened the door for more sophisticated configurations. Since access is inside a class, and not just array referencing, the class can do smart things like tie multiple values together or call out to other classes. One really great manifestation of this was the ability to create the ProjectConfiguration class which uses the project version control to load a configuration file’s contents.

Get involved

That’s where I am today. It’s been a fun journey and I’m happy (atm) with where things are. I’m excited to see where this goes next and to learn other ways folks have found to solve this problem.
One key component that I’d like to see next is the configuration delivery mechanism. Currently, Puppet solves the problem reasonably for short lived PHP processes. However, the two areas that I’d like to improve are:
  1. Automatic reloading by long lived processes. Maybe this is just from a file watch.
  2. A better interface for managing the configurations. This could get extensive. One tool here would be validation. This could be much more than just type checking, but semantics and plugins for special configuration groups (e.g. validate a database in a slave section is up not a master). Typos in keys would be completely avoided. Another would be the ability to see what values a given environment would get.
Please post back with comments or pull requests!