Wednesday, December 4, 2013

How To Build Gerrit Replication Plugin

After far too long of a wait, we've finally upgraded to Gerrit Code Review 2.7.

In versions prior to 2.5, the replication feature was packaged with the main war file. No longer is this the case. Now, if you want Gerrit to replicate your changes upstream to other repositories, you'll need to add the replication plugin using the command line tool `plugin add`. Sadly, I could not find the replication jar hosted anywhere and it appears that you need to build it by hand.

I ran into some confusion [1] with this end to end process and didn't find a sufficiently succinct answer online, so I'm writing up my own =) Here are the steps that lead me to successfully building the replication jar and installing it.

  1. git clone --recursive https://gerrit.googlesource.com/gerrit
    • You need the --recursive here because the plugins are actually git submodules and won't otherwise be cloned along with your repo.
    • If you're already cloned, you can run `git submodule init; git submodule update`
  2. cd gerrit
  3. git checkout -b stable-2.7 origin/stable-2.7
  4. mvn install -DskipTests=true -Dmaven.javadoc.skip=true
    • It's not necessary to skip the tests or generating Java Doc, but it will greatly improve your compile time and decrease the amount of memory maven uses
  5. cd gerrit-plugin-api
  6. mvn package -Dmaven.javadoc.skip=true
    • This creates the jar that will be necessary for the replication plugin to get built
  7. cd plugins/replication
  8. mvn package -Dmaven.javadoc.skip=true
  9. At this point, you have compiled and packaged the replication jar! All you need to do now is register it with your Gerrit server. For simplicity, I'll pretend your gerrit server is running at gerrit.example.com.
  10. scp target/replication-2.7.jar gerrit.example.com:/tmp/
  11. ssh -p 29418 gerrit.example.com gerrit plugin install -n replication /tmp/replication-2.7.jar

I hope this helps out anyone who was struggling with the same issues as I!

PS

Our Gerrit Code Review server runs inside an environment with no outside internet access. When upgrading Gerrit, the service assumes that it has internet access and tries to download any jars that are not packaged into its bundle. In my upgrade situation, it tried to download mysql-connector-java-5.1.21.jar from http://repo2.maven.org/maven2/. It obviously failed.

In order to resolve this issue, I downloaded it to a system that had external access and then scp-ed the jar to $review_site/lib and restarted the gerrit.war init upgrade process.

FOOTNOTES

[1] -- Some errors I saw:
  • Maven out of memory
  • Child module gerrit/plugins/commit-message-length-validator/pom.xml of gerrit/pom.xml does not exist
  • Child module gerrit/plugins/replication/pom.xml of gerrit/pom.xml does not exist
  • Child module gerrit/plugins/reviewnotes/pom.xml of gerrit/pom.xml does not exist


Friday, November 22, 2013

Talkin’ ’bout my Generation, or How I Learned More Than I Ever Wanted About JVM Memory

Profiling a Java application is an experience many developers may never encounter. Identifying the source of a memory leak is probably even more rare. Those kinds of investigations are typically handled by teams dedicated to the subject, or just deferred by throwing more memory at the problem. Until recently, I’ve been one of those lucky developers, blissfully ignorant of what goes on within a JVM. In this article, I seek to share my journey in tracking down a performance issue and what I learned along the way.
The journey begins with an open source scala project built on the play framework running on a 64 bit SL6 box with Java hotspot 1.6. The code base has a small footprint and is not overly complex. Its external resources consist of a mysql database, an internal solr index for searching, and minimal file system interaction. The critical feature of the app is a REST endpoint that handles very large HTTP PUT requests.
We’ve been running the app for some time in production under light load with no complaints. Just recently we started importing a large amount of records into the system via the aforementioned REST endpoint. That’s when we started to observe problems. Periodically the service would crash. The error was consistently “java.lang.OutOfMemoryError: PermGen space”. We observed that given enough time, this error was guaranteed to occur. Critically, it wasn’t simply time: it was after enough requests. I determined that the issue wasn’t going to go away and we needed to face it head on. And thus I embarked on my journey.
The first step was understanding what “PermGen” even means. Research revealed that PermGen stands for “Permanent Generation,” but before that could make sense, I needed to know a little bit more about JVM garbage collection (GC), which is whence the term “generation” comes. Put simply, garbage collection is the process by which old objects get removed from memory (aka “the heap”). Old objects consist of things like class instances that have been created within a function or class scope. When those scopes go away then so should those instances. That makes sense. So in more detail, the actual process of garbage collection consists of sweeping through the heap and determining each object’s classification. The GC has four major classifications for objects named Eden, Survivor 1, Survivor 2, and Old. These classifications are known as Generations. Objects are born into Eden and progressively promoted into the Old Generation at which point they are referred to as tenured [Footnote 1].
OK, so where does the Permanent Generation fit in? The Permanent Generation is a space outside the heap reserved for storing information about the Java classes your app uses. The PermGen has a fixed size that can be set with a JVM option. Class information in the PermGen is managed by ClassLoader instances, the most common of which is the Java system class loader or sun.misc.launcher.AppClassLoader. Generally, the loaded class information takes up a small, fixed memory footprint and you don’t need to put much thought into it.
Classes get loaded into the PermGen on demand, such that only the classes used by your app will take up space. The twist is when you start using Java Reflection. The Java Reflection package is an extremely powerful meta programming tool that gives you the ability to query class information and even create new classes at runtime. Reflection works in Java by calling through the Java Native Interface to get back the loaded class information from the JVM. By default, if information about the same class is requested more than 15 times [Footnote 2], then a new class will be created to hold that information. This process is called inflation and the new class information is stored in a DelegatingClassLoader.
Application frameworks that make heavy use of reflection will observe several instances of DelegatingClassLoaders in their PermGen. One such framework is the scala play framework.
Now that we have a pretty good idea of the PermGen’s role, it’s important to understand how garbage collection works in the PermGen space. I read in a few places that “classes are forever,” but in fact that is not the case. When garbage collection runs in the PermGen, it is true that it will not collect classes. However, what it will collect are ClassLoader instances that no longer hold references to any active classes. So, if all the classes in a given ClassLoader instance have expired, then that instance and all classes it references will go away. This is pretty much never going to happen with the Java system class loader [Footnote 3], but it could very well happen with a DelegatingClassLoader.
With all this knowledge in hand, I finally felt ready to start investigating what might be going on. The JDK comes with a tool called jmapjmap is an excellent tool for analyzing what’s happening with the memory in a JVM. The first command I tried out was jmap -permstat[Footnote 4]. Wow! Pretty cool; this shows me every class loader in the PermGen, the number of classes for which its responsible, the space it’s occupying, and its health. The first thing to jump out at me was that I observed a good deal of dead DelegatingClassLoader instances.
Weird. So, why aren’t those getting garbage collected if they are dead? After some more researching into JVM options, I discovered the following two options: -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC. In the absence of these two flags, garbage collection will not remove unused class loader instances from the PermGen space [Footnote 5].
Alright! Now we’re getting somewhere. These flags might be just what I’m missing. In order to find out, I set up a test system on which I could simulate the initial problem and then apply the new JVM options, rerun the simulation, and see how it holds up.
For the simulation, I wanted to start up my app with a small PermGen and keep my eye on the usage. The PermGen size is configurable at JVM startup with the option: -XX:MaxPermSize=68M. I choose 68M due to my observation that typically on startup the app immediately used up just under 60MB and quickly grew to near 68MB after a few requests. To keep my eye on the PermGen usage, I used jmap -heap, which shows both the total PermGen allocated as well as the current PermGen usage. Both of these values are important because although the PermGen has a max size, it will not use it all right away. Next up, jmap displays a whole load of useful information, but I was only interested in PermGen, so a little grep filtering brought me to jmap -heap $pid | grep -A 4 'Perm Gen'. Finally, I want to watch this as it changes, so throwing watch into the mix produced 
declare pid=$(pidof java); 
watch -n 2 "sudo /usr/java/jdk1.6.0_26/bin/jmap -heap $pid 2>&1 | grep -A 4 'Perm' "
With my watch in place, I started simulating load with a while loop in bash to hit the web endpoint with curl requests. The PermGen quickly rose and eventually peaked at 100% and then crashed after about 500 total curl requests. I set 500 as my base line for comparison and restarted the JVM with the GC flags -XX:+CMSClassUnloadingEnabled -XX:+UseConcMarkSweepGC. Re-running my load simulation script, I watched the PermGen rise just like before. This time however, at around 500 requests, the PermGen usage dropped. The load simulation continued and I reached around 2000 requests before the PermGen finally ran out.
My conclusion from the experiment is that the GC flags successfully configured the JVM to clean up the PermGen space and consequently improved performance over time. The new parameters have been deployed into production and we haven’t seen a crash!
My journey isn’t over yet. I’m not 100% satisfied with this solution because I still see dead DelegatingClassLoader instances in my jmap -permstat output and I’m mildly suspicious there may be a memory leak with the framework. I will continue to monitor PermGen usage and see how things progress. As an ultimate fall back option, I can disable inflation.
…and that’s it! Hopefully this can serve a helpful starting point for anyone facing similar situations.
Some More Notes
I’m also including some more notes here that I learned along the way, but weren’t really pertinent to the discussion about heap dump and analysis. 
  • The JVM can dump your heap to a file when it crashes. To do this enable +XX:HeapDumpOnOutOfMemoryError and set a path for the dump file, -XX:HeapDumpPath=/path/for/dumps/java-.hprof [Footnote 6]. 
  • I used the Eclipse MAT tool to analyze my dump file and had a great experience. It’s worth reading the manual on its basic functions before using it. The default reports available gave me an immense amount of actionable information. If the OOM errors recur, I will likely be turning to MAT to dig to the bottom of the (potential) leak.
  • This article was substantially valuable in wrapping my head around how the JVM interacts with native memory, http://www.ibm.com/developerworks/java/library/j-nativememory-linux/

FOOTNOTES:
Footnote 1: To see details of tenure calculations, enable -XX:+PrintTenuringDistribution
Footnote 2: See a quick description of how to control the process with system properties (sun.reflect.noInflation and inflationThreshold = 15), http://anshuiitk.blogspot.com/2010/11/excessive-full-garbage-collection.html
Footnote 3: This actually not completely out of the question. Application frameworks that employ the “hot deploy” method of swapping in new jars without stopping the JVM will ideally recycle the Java system class loader. See http://frankkieviet.blogspot.com/2006/10/classloader-leaks-dreaded-permgen-space.html for a good description how this can get you into undesirable situations. To read about more people experiencing this problem, http://stackoverflow.com/questions/5066044/java-lang-outofmemoryerror-permgen-space-on-web-app-usage
Footnote 4: The permstat information is expensive to calculate, and I’d suggest redirecting the output to a file so you can manipulate it. http://docs.oracle.com/javase/1.5.0/docs/tooldocs/share/jmap.html
Footnote 6: JVM Options http://www.oracle.com/technetwork/java/javase/tech/vmoptions-jsp-140102.html WARNING You must ensure there is no file with the same name at the dump path or the JVM will refuse to overwrite it. To increase entropy in the dump file name, use the placeholder in the path you pass.

Tuesday, October 29, 2013

There's No Room for Heroes

I've been thinking lately about how to successfully grow & mature an engineering team. What are the changes and shifts in mindset and practice that need to occur, naturally or synthetically, to remain effective, productive, and happy? One idea that's caught my attention is the shift from a culture of all-nighter heroes to a culture of on-time delivery of 100% finished projects.

A culture of all-nighter heroes revolves around the star players. The ones who get all the glory consistently saving the day. Every startup at which I've worked has had at least one person like this. You know you can turn to them whenever there's a problem and be confident that no matter the issue, it will get fixed. These are the guys or gals who're answering questions in the chat room 24*7, know every intricacy of your architecture, and have memorized the root password. These are the people the whole team looks up to and for whom you say a prayer every night hoping that they won't quit.

In a bootstrapped startup, the time for long term thinking is brief. Solutions are slapped together and shoved out the door just in time to evade catastrophe. Work is never quite finished. Instead the Pareto principle of 80% good enough is put to the test. Tech debt is an ever looming, always growing shadow of which no one wants to speak. "We'll fix it later" is the general motto. This is the haven of the hero: an undocumented, uncontrolled environment in which only the native knows their way around. Business becomes increasingly reliant on your heroes' ability to keep the technical ship afloat and battle scars are signs of prestige.

Some might argue that this is unavoidable, and perhaps even necessary. Yet as business becomes more stable and clients become less tolerant of failure, this method of operation becomes unsustainable. Suddenly, 80%-done work isn't good enough, tech debt has become a substantial hindrance, and that missing documentation is causing daily problems. The architecture has reached a point in which no group of heroes can reasonably hold it together.

This is the tipping point.

Simply acknowledging this is critical, but the biggest challenge to survival is the cultural shift. The heroes of this new business environment are the types that recognize long term goals. They realize that good solutions aren't invented over night, but via healthy design, debate, experimentation, and measurement. They create simple solutions that eliminate tech debt and avoid gotchas. They share their approaches and engage the rest of the team. They automate every possible touch point. Ultimately, the heroes of this new field do everything they can to be replaceable.

So, your challenge is to embrace automation, celebrate diligently crafted and deployed solutions, and put a damper on rewarding superhuman efforts riddled with tech debt. In my opinion, this a major inflection point in the professional development of a young team of engineers to a senior team. It must be closely shepherded and properly incentivized.

Thursday, July 25, 2013

Add Try-Catch Block in IntelliJ for PHP Using Live Templates

Choosing a powerful IDE and investing the time to learn its many features is a distinctly rewarding experience, saving you countless hours of typing of and brain cycles on boiler plate code.

I find the "Live Templates" feature of IntelliJ to be a major time saver. Today, I wanted to add a missing template for PHP: "try catch." I read through the docs and wasn't immediately educated on how it worked. I had to mess around a bit to get it working and thought I'd share my findings.


  1. Get into the preferences pane for editing live templates by following these directions http://www.jetbrains.com/idea/webhelp/creating-and-editing-template-variables.html
  2. Find the "PHP" section.
  3. Click the "+" symbol. In my version (Ultimate 12, Darcula), the plus symbol is to the upper right in the preferences pane.
  4. Follow the instructions for naming listed in the IntelliJ docs linked above.
  5. Paste the code outlined below (Exhibit A) into the "Template Text" box.
  6. In the box below the template text, be sure to choose "PHP" as an applicable context
Exhibit A, the live template:

try
{
    $SELECTION$
}
catch (\Exception $$e)
{
    $$this->logger->warn("$END$");
}


Now, when you want to use the template,

  1. Highlight the text you'd like to wrap in the try-catch
  2. Hit the keyboard combo for "Surround with Live Template"
  3. Hit Enter.
    1. IntelliJ will wrap the inject the template and replace "$SELECTION$" with the highlighted text.
    2. It will then place your cursor between the double quotes at "$END$"
    3. There is a way to have it stop along the way at "Exception" and "$e" allowing you to change the defaults, but I haven't gotten there yet =)
  4. Start typing your log message. If you don't typically have a $logger field defined in your class, you can customize this to fit your pattern.

Friday, May 31, 2013

Log4PHP not logging?

Do you use log4php? I am a big fan of the log4-* suite. It's great to have a consistent approach to logging across languages.

I've been using Log4PHP for almost a year now and aside from one issue with logging exceptions, I've had no problems with it and, in general, enjoy any time I've spent reading the source.

I had a bit of a problem today that was my own fault, but seems very possible for another to make. I've been happily using LoggerLayoutTTCC for some time. It was deprecated in the 2.3.0 release last October in favor of LoggerLayoutPattern. So I switched to LoggerLayoutPattern. Things have been humming along just fine, but something changed recently. I noticed today that none of my log entries were getting written. Only errors were getting written out, and they weren't using any identifiable format.

After digging into the source, I discovered that LoggerLayoutPattern had no parsed pattern. Effectively this means that when my appenders asked the layout class to format the message, it was returning an empty string, which was being thrown away and not logged.

How could this happen? The culprit was that although the LoggerLayoutPattern class has a default pattern, it is not applied by the constructor. Out of the box this would not be an issue, but since I had previously been using LoggerLayoutTTCC, which takes no constructor arguments, I was not passing in conversionPattern for the params argument to Logger::configure() when defining my appender.

The fix was straightforward: supply a conversion pattern! This feels like a minor bug with the Log4PHP code that should warn about the missing param or just use the defined default.

'appenders' => array(
 'default' => array(
  'class' => 'LoggerAppenderDailyFile',
  'params' => array('fileName' => 'logs/default.log'),
  'layout' => array(
   'class' => 'Bart\Log4PHP\LoggerLayoutTTCCWithException',
   'params' => array(
    'conversionPattern' => '%d %p pid:%t %c %m%n',
   )
  ),
 ),

Tuesday, May 21, 2013

Say Goodbye to PHP shell_exec() and PHP exec()

Dealing with the shell in PHP can be a pain. If you are executing anything more complicated than a no argument command with a one line string result, things start getting complicated.

To get started, if you want to check the return status of your commands, you need to pass a reference parameter to exec()! Ok, so it's not returning the command exit status, well then you might expect that it returns the command's output instead? Nope; the return value of exec()is actually the last line from the command's output. If you want to get the full output of the command, then you're looking at yet another reference parameter. And wait, which order do those references parameters go?

After you've got your parameters sorted out, what about escaping arguments for security and accuracy? For that, you're left manually using PHP's other global functions.

Alright, so now you've got a safe, accurate command. How do you know it works? You might want to test it, right? Well, if you want to test any of your code that uses exec(), you're looking at a rough road ahead because PHPUnit has no facility to let you make assertions against parameters passed by reference. Furthermore, if you have more than one call to exec, then you really have no way to stub results out of the box because you can't differentiate between calls to exec. I previously blogged about one approach I've taken to solving this -- http://asheepapart.blogspot.com/2012/10/testing-shell-methods-with-phpunit.htmlMockShell works alright, but it doesn't help us with escaping the command for security or accuracy.

Enter the Bart Command class. Command is a simple class that acts as a facade over a single shell command. It takes a variable number of arguments to its constructor, representing arguments to the shelled command. The first constructor arg is reserved for the actual command itself. Each shell argument is escaped and then substituted for placeholders in the shell command.

A command can be passed around and executed on demand. The results are returned as a full string or an array of each line. If the command fails, a custom exception of type is CommandException thrown.

We're using Command extensively in our internal code bases with a lot of success. Currently, you can see it used in a few places in the Git class.

I also created the gist below,


Testing is easy. There isn't any pass-by-reference magic, you can just apply all your standard testing know-how. You can use Diesel to inject your stub into your system under test. You can also mock the results of the shorthand Shell->command() method.

Monday, May 13, 2013

Bash Completion with SSH or SCP

I spend a lot of time ssh-ing around to the same machines. Using ~/.ssh/config helps me create nicknames for those hosts, but what's even better is bash completion. I created the following script to register the "ssh" and "scp" commands with bash completion so that I can just hit <TAB> to let bash complete the possible machine names.

I spent a lot of time reading up on how the bash completion system works and tried to make this script as clear as possible so that I can adopt the same strategy for future completion work.



Hope this helps!

Tuesday, May 7, 2013

CodeIgniter & PHP 5.4 -- Creating default object from empty value

I, sadly, use Code Igniter to power several of my legacy web applications. We recently upgraded our development environment PHP 5.4 and have not yet upgraded our production systems. We work with these apps infrequently in development and this hasn't been a problem. However, today I needed to test moving traffic from HTTP to HTTPS on one of the apps and kept receiving the several instances of the following message,

Severity: Warning
Message: Creating default object from empty value

Each of these warnings was linked to lines in my base controller (REST_Controller) that were performing assignments to properties of class fields. E.g. $this->rest->db = $this->load->database(...). I was unable to replicate the error on another machine running 5.3 and determined that my problem must be arising from PHP 5.4.

Code Igniter works in weird ways and my first guess was that some magic code or eval was failing due to the elevated strictness of PHP 5.4. So I embarked on a mission to discover where the class fields were being assigned inside Code Igniter. I scoured deep into CI_Config and CI_Loader and Code_Igniter.php, but couldn't find my answer; there were no magic methods or variable variable assignments. Feeling rather dejected, frustrated, and confused, I took another look at the error message: "creating default object from empty value" and an idea formed in my mind: what if PHP was telling me that it was in fact creating the class field right there on the spot?

The easiest way to test that theory was to create the object myself and test if the message still appeared. I updated the code to create the field just before each of the failing lines. E.g.

$this->rest = new stdClass();
$this->rest->db = $this->load->database(...);

And, voila -- no more warnings! So it turns out that the problem had little to do with Code Igniter itself, but just that historic PHP is way too friendly to lazy coders.

Now, with that out of the way, on to testing the HTTPS migration...

Thursday, April 25, 2013

Best Practices of Code Reviews

Long ago I read a great white paper on the Best Kept Secrets of Code Review published by SmartBear software. It's an excellent study of code reviews and offers some great advice. Since then, I've continued learning and owe many thanks to some great engineers and thought leaders. What follows is a compilation of my current thoughts on the subject.

Before I get started, I'll give my definition of code review:

  • Code Review -- a critical analysis by an individual or group of a functional unit of code, be it a diff, a completed work, or a combination of both.

Next, the five types of code review, as defined by SmartBear -- plus one from me.
  1. Over the shoulder
  2. Email
  3. Tools, something like Gerrit
  4. Pair programming
  5. Formal inspections
  6. Self review
I feel that the first two on their own are insufficient for any development organization with a focus on quality. The informality of over the shoulder or email does not induce the proper level of seriousness. Emails can be lost and are not available to the team at large.

A (good) code review tool is irreplaceable. It provides history, contextual comments, patch sets, a web link to reference for the future, etc. It also provides an integration point for build systems and development workflows, a critical component to any team attempting continuous delivery. Pair programming is very valuable, for more than just code review, but is not the subject of my post.

Formal reviews can go either direction. I'll start with a bad one. At a previous company, we used to hold ad hoc, hour long code reviews. We had no tool and instead would invite the entire team into one cramped room in which we'd hook up a laptop to the TV and someone would show us part of their code. There was rarely any preparation on part of the presenter, nor any advance notice of what we'd be seeing. It was kind of like open mic night at the local coffee shop.

Looking back on this, I'm astonished that we didn't realize its ineffectuality. Roughly 70% of the engineers did not pay attention; reviews never started on time; an average max of only 15% of the attendees actually had exposure to the code under review; bugs were rarely found; meetings often devolved into tangential debates. In other words, it was a terrible waste of time.

Not all formal reviews have to be so dreadful. In fact, we conduct very successful formal reviews at my present company. We call them "Code Workshops." These entail advance notice of the code being shared, an assigned reviewer who comes prepared, and a (comfortable) room full of attentive and respectful engineers. The outcome is generally positive and we all leave a little wiser.

Finally, self reviews. A self review is the act of critically reviewing code that you're written. This can happen at any point: before you submit for review by others, before you refactor it, or just for fun on a lonely night. Self review should not be overlooked. It is a critical factor in effective peer code review and overall a fundamental factor to personally developing better code practices.


Given the preceding context, I've come to learn and adopt the following lists.

Learnings

  • The cost of finding a customer reported bug versus finding a bug in development is significantly disproportionate. The cost benefit of ensuring quality through effective code review and other means is immense.
  • Sign-off from senior engineers should not be required. This says "We don't trust you." You should inherently trust the people you hire to be smart enough to *request* code reviews when its needed. If you're faced with habitual stealth submitters, your problems with those employees may extend beyond just code reviews.
  • Synchronous pair code reviewing for complex changes reduces feedback cycles and total time to deliver. Moreover, beyond code review, it promotes rapid knowledge transfer.
  • An engineer's time is your most precious resource. Make sure it's used as effectively as possible and avoid doing anything manually that can be automated. In the context of reviewing code, automate as much as you possibly can with static analysis tools and build systems so that engineer's spend their time leaving unique comments (and not nit-picks about code standars). We use Facebook's Arcanist for our static analysis of PHP and Adam Rosien's Sniff for Scala.

Best Practices

  • Use code review as training for new hires. Keep track of good reviews in a team wiki.
  • Do NOT use code review as a way to teach people how certain code is used. Hold training sessions and record them and upload them internally (or to Box!).
  • Do not oversubscribe reviewers: although having more people look over the code will most likely result in more defects being discovered, the relationship quickly becomes logarithmic rather than linear.
  • Strictly enforce limits on the size of a review change set. Effectiveness in reviews begins decreasing after the first 15 minutes spent reviewing and drops drastically after 60 minutes. Respecting this practice will also help when bisecting code history during bug hunting.
  • Code reviews of unfamiliar code should be accompanied by a brief overview or an in person conversation.
  • Before requesting a code review, the original author should very critically review their own code. A reviewer is not a janitor. In fact, authors often find several problems in this phase.
  • Create a checklist of common issues. Checklists helped uncover 30% more defects in equal amounts of time in studies conducted by SmartBear (paper, page 126).
    •  A checklist should be supplemental to automation by a static analysis phase of your build system.
    • See sample checklist on page 112 of the SmartBear white paper.

Tips

  • Reviewers tend to spend majority of time on declarations and signatures.
    • Keep variable declaration closer to their use.
    • Use shorter functions.
  • Reduce feedback cycles by leaving concise, intelligent comments.


Final Thoughts


If I could encourage you to do one thing, it would be to find a good tool that works for your team and diligently learn how to use it to its maximum potential. Integrate it into your development process and use it in your build system. Treat it with the same level of importance as your code repository.

The bottom line is that no matter how you're doing it, code reviews help your team do better. These bullet points are the product of several years of software development and teamwork, but nonetheless simply remain suggestions and not prescriptions.


Further Reading

http://smartbear.com/resources/whitepapers/best-kept-secrets-of-peer-code-review
http://www.ibm.com/developerworks/rational/library/11-proven-practices-for-peer-review/



Friday, April 12, 2013

Gerrit Code Review - Unpack error Missing unknown (Continued)

Several months ago I blogged about an issue I encountered administering the Gerrit Code Review system, version 2.3. Sadly, we had a recurrence today. This time, we had to take a different approach to solve it.

The problem manifested in much the same way. All of our gerrit users began seeing the following error message when pushing to Gerrit,
fatal: Unpack error, check server log 
remote: Resolving deltas: 100% (4/4)
error: unpack failed: error Missing unknown 98aab6fe33971a17b1bfb5a5288070e14f166b79 

The difference from last time is that this Gerrit installation is not the "owner" of the repository. We use gitosis to manage all of our repositories and keep the repositories replicated across several servers via post-receive hooks. This setup works great for our use case, until it comes to Gerrit. We do not require that all code be code reviewed, consequently not all code gets merged through Gerrit. For this reason, Gerrit can become out of date from the real repository. In order to fit Gerrit into our system configuration, we host one of our replicated repositories on the same server that hosts Gerrit. That repository has a special post-receive hook that pushes with --mirror and --force to the checkout that Gerrit uses. This keeps Gerit up to date with merges happening outside its control.

At some point this afternoon a commit related to an unmerged patch set disappeared from the Gerrit repository. The result was the error message pasted above for anyone trying to push new patch sets into Gerrit.

In this situation, we had no backups of the Gerrit repository. Digging into the Gerrit database, I found a reference to the commit in the the patch_sets table. The change_id lead me to the commit's author. SSH-ing to his machine and inspecting his git checkout, I was able to track down the missing object. Using git cat-file -t 98aab6fe33971a17b1bfb5a5288070e14f166b79, I identified the object was a commit. And upon asking him about it, he said it was the very commit that he had in code review.

I deleted the record from patch_sets, DELETE FROM patch_sets WHERE revision = '98aab6fe33971a17b1bfb5a5288070e14f166b79' LIMIT 1;.

We had him push to Gerrit again. The push was successful and I observed the record was recreated in the patch_sets table. At this point all other pushes from other users began functioning as expected.

Our problems weren't over yet. He was unable to view his review through the Web UI, instead Gerrit showed "Internal Server Error". The following error was showing up in the error_log,


[2013-04-12 19:16:01,526] WARN  / : Error in changeDetail
java.lang.NullPointerException
        at com.google.gerrit.server.project.ChangeControl.isPatchVisible(ChangeControl.java:177)

We identified that the patch_set_id numbers were out of order in the patch_set table for his change_id. We updated them to be in order, but this time, the change detail page loaded with no content. So far, we've been unable to fix that problem. Instead we ended up fixing the problem by generated a new Change-Id and pushing that as a new review into Gerrit, which worked as expected.

So, two questions remain unanswered:
  1. How did we get into this situation?
  2. How can we fix the problem review?
UPDATE April 15th, 2013

We have tracked the problem down with the missing patch set commit object to a race condition between Gerrit writing the commit to the repository and our external gitosis post-receive hooks pushing an independent commit at the same time. My assumption is that internally JGit is not handling lock files correctly, or getting confused with file descriptors. Read more at https://www.google.com/search?q=jgit+gerrit+broken

We've fixed this by removing the  --mirror flag from our post-receive hook, and instead specifying just refs/heads/master. This will reduce the probabilities of race conditions on lock files.

After much fiddling, I was unable to restore UI access to the change review in question. I surrendered to Gerrit and abandoned it over the ssh interface.