Continuous Benchmarking with JMH and JUnit

„Measuring Tape“ by Jamiesrabbits is licensed under CC BY 2.0

Why benchmarks?

When writing code, you are confronted with countless small decisions all the time.

“Should I use a HashMap or a ConcurrentHashMap?”
“Maybe using a StringBuilder would be better than concatenating the Strings myself…”
“My colleague told me indexOf() is faster than contains(), is it really though?”

While the performance of an application depends on many factors (hardware, workload, software architecture, etc.), these small programming decisions also have an impact on software performance. Many small mistakes can add up pretty quickly, the more so when the code you are developing is a critical component that is called very often. However, it is hard to intuitively know in advance what impact a specific decision might have on the actual software performance. That’s why measuring software performance is always a good idea.

Now, when it comes to measuring the performance of individual pieces of code, a naïve approach might be to just put a call to System.nanoTime() (or even worse, System.currentTimeMillis()!) at the beginning and end of the code in question. Most developers are probably guilty of using this approach at one point or another, but needless to say, it’s not very accurate. That’s because the JVM performs all sorts of optimizations internally, such as dead code elimination or loop unrolling. Sometimes, the JVM might change the code on-the-fly – for example, it might count how often certain branches are called and then optimize the code to speed up the most likely cases, while potentially slowing down improbable cases. The list of possible optimizations is long. And then we still have Java’s Just in Time (JIT) compiler which will automatically replace frequently called code with machine code to avoid the performance overhead of interpreting it time and time again. The compiled machine code will be a lot faster than the interpreted code, meaning for a proper benchmark, we will also need a sufficiently long warm-up phase…it’s easy to see that the naïve approach is not going to work out.

JMH to the Rescue

But thankfully, there are frameworks which take care of the heavy lifting for us. A popular one is the Java Microbenchmarking Harness, also known as JMH, which has been an insider tip among performance enthusiasts for quite some time already. While there are still a lot of pitfalls when writing microbenchmarks in Java, JMH manages to take a lot of the pain away. Writing a benchmark becomes as easy as writing a unit test – you just have to add a few annotations:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(3)
public class SearchBenchmark {

       @State(Scope.Thread)
       public static class SearchState {
             public String text = "abcdefghijklmnopqrstuvwxyz";
             public String search = "l";
             public char searchChar = 'l';
       }

       @Benchmark
       public int testIndexOf(SearchState state) {
             return state.text.indexOf(state.search);
       }

       @Benchmark
       public int testIndexOfChar(SearchState state) {
             return state.text.indexOf(state.searchChar);
       }

       @Benchmark
       public boolean testContains(SearchState state) {
             return state.text.contains(state.search);
       }
}

The only annotation which is really required is the @Benchmark annotation. This tells JMH that the specified method should be used to generate a microbenchmark. All other annotations control some aspect of the benchmarks. For example, the @BenchmarkMode annotation defines that we are measuring the average execution time; other options would be to measure throughput, or only sample time. While @OutputTimeUnit defines which unit will be used during the experiment, the annotation @Fork controls how often the experiment is repeated – each time, a new JVM is forked to minimize the effects and optimizations of one specific JVM run.

In addition to these class-level annotations, there is also a @State annotation on the inner class SearchState. This defines a state which is injected into the individual benchmark iterations. We use this state in order to avoid using constants in the benchmark itself, which would allow the compiler to detect that the computation is constant and optimize the code (constant folding), which would in turn distort the benchmark results. In a similar fashion, it is important that the results of the methods are returned, which instructs JMH to actually consume them; otherwise, the compiler might detect that the code has no side effects and remove the benchmark code itself.

There are many other options when writing JMH benchmarks – simple ones like adjusting the number of warm-up and measurement iterations, or more complex ones like black holes, params and controls. Even with all the tools JMH provides, writing good benchmarks is still a non-trivial task. If you are unfamiliar with JMH, I recommend to have a look at some of the excellent samples on the project website.

Once you have written a benchmark, running it is very easy. The recommended way is to use JMH’s Maven archetype, which will take care of all the nasty dependencies and build steps for you:

mvn archetype:generate \
          -DinteractiveMode=false \
          -DarchetypeGroupId=org.openjdk.jmh \
          -DarchetypeArtifactId=jmh-java-benchmark-archetype \
          -DgroupId=org.sample \
          -DartifactId=test \
          -Dversion=1.19

Internally, the generated pom.xml will use the maven-shade-plugin to create an uber JAR containing the benchmarks, so after building the project you can just:

java -jar target/benchmarks.jar

JMH will then run your benchmarks with the specified number of forks, warm-up iterations and measurement iterations. The final results for the benchmark listed above might look something like this:

# Run complete. Total time: 00:06:04
Benchmark                        Mode  Cnt   Score   Error  Units
SearchBenchmark.testContains     avgt   60  10,574 ± 0,016  ns/op
SearchBenchmark.testIndexOf      avgt   60  10,736 ± 0,044  ns/op
SearchBenchmark.testIndexOfChar  avgt   60   7,529 ± 0,091  ns/op

Of course, the results depend very much on your hardware as well as your operating system and JVM version, so you will most likely have different results when running the example benchmark on your system.

Benchmarking Continuously

Often times, running a benchmark once and then forgetting about it might be good enough for you. But benchmarking can also be used to get some early performance feedback for critical system components during development. Please note that benchmarks are not designed to replace any full-fledged performance tests – a good load test which tests your entire system with the proper workload will generally give you a better idea of the actual end-user experience than a benchmark which tests an isolated component of your system under high load. But setting up and running a load test requires a lot of time, even when it’s done automatically by your build and test pipeline (as outlined in a previous article: Continuous Performance Evaluation using Open Source Tools). In addition, sometimes you are not interested in the performance of the entire system, but only care about the maximum throughput that a performance-critical component can achieve. Therefore, it might sometimes be a valid strategy to continuously benchmark parts of your applications to detect performance regressions as early as possible.

So, how would you go about integrating JMH into your deployment pipeline? The easiest approach would be to combine JMH benchmarks with JUnit tests. Such an integration has many benefits – for example, you could add some benchmarks to your existing test suite to get an idea about the performance of a specific component. This offers the advantage that you can simply re-use your unit tests as benchmarks and don’t need to write your benchmarks from scratch. Of course, you have to keep in mind that not all unit tests are suitable to be used as benchmarks – but if you are confident in your benchmark, you can even define assertions on your benchmark results, which can be used as additional quality gate to determine whether your tests should pass or fail. For such a feature to work well, you need to take special care that your tests are stable and are not influenced by other parts of your build and testing process. Ideally, you would execute your benchmarks on a designated machine which doesn’t have any other active processes. While some people argue that running your benchmarks from within JUnit might also influence your test results negatively, we think that you should be fine if you run your tests with multiple forks.

In the following, we will quickly demonstrate how a simple JMH benchmark can be integrated with JUnit. First, we’ll need the dependencies. For Maven, these dependencies will look something like this:

<dependencies>
       <dependency>
              <groupId>org.openjdk.jmh</groupId>
              <artifactId>jmh-core</artifactId>
              <version>1.19</version>
              <scope>test</scope>
       </dependency>
       <dependency>
             <groupId>org.openjdk.jmh</groupId>
             <artifactId>jmh-generator-annprocess</artifactId>
             <version>1.19</version>
             <scope>test</scope>
       </dependency>
       <dependency>
             <groupId>junit</groupId>
             <artifactId>junit</artifactId>
             <version>4.12</version>
             <scope>test</scope>
       </dependency>
</dependencies>

Note that in addition to the dependency to jmh-core, you will also need a dependency to the annotation processor. The annotation processor will parse your benchmarks and evaluate all the annotations you defined. If the annotation processor hasn’t run before your benchmark starts, JMH will complain with this error message:

java.lang.RuntimeException: ERROR: Unable to find the resource: /META-INF/BenchmarkList

When running Maven from the command line, the annotation processor should be executed automatically. If you want to run your tests in an IDE as well, you might first have to configure the annotation processor in the IDE – alternatively, you can just trigger the Maven build once from the command line, and it should then also work within the IDE.

Next up, we need our benchmark. For demonstration purposes, we will use a very simple benchmark like the following one:

@BenchmarkMode(Mode.Throughput)
public class JmhBenchmark {

       @Benchmark
       @Fork(3)
       public static double benchmarkPerformanceCriticalComponent() {
             return PerformanceCriticalComponent.performanceCriticalMethod();
       }
}

So this benchmark simply calls the method that we want to test. Ensure that you are actually consuming the result to avoid dead code elimination. You should also ensure that you are running your benchmark in a forked JVM and not in the JUnit JVM – if you want to, you can also parametrize the number of forks or iterations by defining it in the Runner in the next step.

Now comes the part where we deviate from the previously presented approach. Instead of using the maven-shade-plugin to package our benchmark into an uber JAR and executing that, we will call our benchmark from within JUnit by using JMH’s Runner class. This class will execute the benchmark with all the specified options for us, and also present us with a set of RunResult objects which we can evaluate. An example might look like this:

public class JmhBenchmarkTest {

       private static DecimalFormat df = new DecimalFormat("0.000");
       private static final double REFERENCE_SCORE = 37.132;

       @Test
       public void runJmhBenchmark() throws RunnerException {
             Options opt = new OptionsBuilder()
                          .include(JmhBenchmark.class.getSimpleName())
                          .build();
             Collection<RunResult> runResults = new Runner(opt).run();
             assertFalse(runResults.isEmpty());
             for(RunResult runResult : runResults) {
                    assertDeviationWithin(runResult, REFERENCE_SCORE, 0.05);
             }
       }

       private static void assertDeviationWithin(RunResult result, double referenceScore, double maxDeviation) {
             double score = result.getPrimaryResult().getScore();
             double deviation = Math.abs(score/referenceScore - 1);
             String deviationString = df.format(deviation * 100) + "%";
             String maxDeviationString = df.format(maxDeviation * 100) + "%";
             String errorMessage = "Deviation " + deviationString + " exceeds maximum allowed deviation " + maxDeviationString;
             assertTrue(errorMessage, deviation < maxDeviation);
       }
}

In this example test case, we simply create a new Runner object which gets initialized with our benchmark configuration defined using the OptionsBuilder. The Runner will take care of everything else. If you want to, you can define some assertions on the RunResult to ensure that the performance is the same as you would expect it to be – in this case, we check whether our score doesn’t deviate from a reference score by more than five percent. This of course assumes that your benchmark will always run under the same stable conditions, i.e., the test will most likely fail when executed on different hardware. In the example above, the reference score was simply the result of an earlier benchmark run, but you can also use values from your production environment or specific SLAs as reference for your benchmarks.

As you can see, writing a benchmark and integrating it with JUnit is fairly easy. By automatically executing the test for every build of your software, you can ensure that your component’s performance doesn’t deteriorate. The hardest part about this is to design a benchmark which is stable enough to be run continuously. To ensure this is the case, you have to think about how to isolate your benchmark from the rest of your build process (some dedicated performance testing environment might make sense), but also the test data and test setup should ensure that the test will be reliable on consecutive runs.

If done right, benchmarks can help you to identify performance regressions very quickly and early on in the development process, which is much cheaper than finding them in later stages of the process. This will help you to make sure your software performance stays good – and even when it’s not yet where you’d like it to be, you can only start improving it by measuring it.

All code for the examples is also available on GitHub.