GopherCon 2018 - Micro-optimizing Go Code

Everyone knows: if you need to go fast, bust out the assembly. But what if there's better performance to be had just by writing Go more carefully?

Summary

In this tutorial session, George will optimize a hash function in pure Go, starting from a direct implementation of the RFC and going until he beats the standard library. This will cover the whole process - starting with the naive code and a set of tests, then progressing through some algorithmic improvements and down the rabbit hole of program tuning. Along the way you'll see the performance impact of API design, some microbenchmarking pitfalls, an introduction to profiling, and a lot of practical tips for doing this kind of optimization.

When the hash function runs out, George will draw further examples from the standard libraries (mostly crypto!) and other high performance Go projects.

Code at: https://github.com/gtank/blake2s

George gets carried away with optimization - wanted to make Go version of a hash function as fast as Rust/C++.

George is employed by Zcash - think Bitcoin with privacy.

Blake2s

A Hash function is variable length input -> fixed length output

Blake2 is an awesome new hash function with some great features:

Fast
Immune to length extension attacks
Parallelism
tree hashing
prefix-MAC personalization
...

but...

It's under-specified, and no one implements all of it.

"under-specified" is not a phrase you want to hear with regards to things like crypto and hash functions.

Playing tennis with dynamite while riding bears

Zcash wanted to implement Blake2 independently in several languages to validate it.

Blake2 Algorithm:

Initialize parameters
Split input data into fixed-size blocks
Scramble the bits around
Update internal state
Finalize & output

Hash functions in Go

A fairly complex interface.

Doesn't work for Blake2

Benchmarking

go bench
benchstat
pprof and this...?

DATE=`date -u +'%s' | tr -d '\n'`; BRANCH=`git
rev-parse --abbrev-ref HEAD`; for i in {1..8}; do
go test -bench . >> benchmark-$BRANCH-$DATE; done

testing.B is how you write benchmarks in Go - you can put them right in your test files.

Each benchmark needs to loop from 0 to b.N - go test -bench determines N, you just have to make sure that whatever you're benchmarking runs N times.

func benchmarkHashSize(b *testing.B, size int) {
  b.SetBytes(int64(size))
  sum := make([]byte, 32)
  b.ResetTimer()
  for i := 0; i < b.N; i++ {
    digest, _ := NewDigest(nil, nil, nil, 32)
    digest.Write(emptyBuf[:size])
    digest.Sum(sum[:0])
  }
}

Quite slow initially - no where near 1GB/s

pprof

can give you the "hottest" functions in your benchmark. ("top5" command)
use the "web" command to get a graphical view of the hottest functions.
try to find the lowest level function that you can control.

The Inliner

We found a function that is pretty small and doesn't have side effects - this is a good candidate for inlining.

Inlining is when the compiler copies the body of a function into the caller. It avoids function call overhead, but increases binary size.

compiler calculates a complexity score for a func to see if inlinable
no manual control of inliner
functions accrue cost for each node in their AST.
extra cost for slices
function calls are +2
never inlines something with for/range/select or other nonlinear control
never inlines certain runtime funcs and non-intrinsic assembly
check out inl.go in the compiler source

Is it worth it? In this case making the function inline will give a 70% speed boost.

How do you know if a function will inline?

go test -gcflags="-m=2" 2>&1 | grep "too complex"

Manually inline these math/bits function calls!

Now our cost is 81 - one over the limit!

Had to change the API to remove an argument to get it under budget - pretty ugly hack!

Be careful - under inliner budget doesn't always mean faster - always benchmark.

What's next?

pprof

Use "weblist" this time. Gives a nice source view in the browser with line-by-line profiling data.

Search for a hot function from our "top5" pprof command. We see 2 slice lookups that are taking over 1.5s of the total 6s. Diving into that in the weblist view gives us an assembly breakdown. Two JAE (jump) instructions to runtime.panicindex - this is bound checking!

"Safety is great... but we're trying to go fast."

Finding bounds checks:

go test -gcflags="-d=ssa/check_bce/debug=1"

You can do an early bounds check deep into the slice to avoid multiple bounds checks further on.

If that won't work:

propagate constants
unroll loops
reuse previously allocated vars

This can make the code super ugly! a dozen lines becomes a few hundred. but... 80% throughput boost!

Still some bounds checks - one more technique: Replace slices with arrays.

Going back to pprof:

We're seeing mallocgc, makeslice, and memmove. Allocations are expensive - initially, and because of garbage collection. Let's grep for make and see what we can get rid of.

The Go compiler contains "pattern matching" which can catch things like looping over a slice and zeroing it out. It will replace the loop with a special low level call like memset.

You can check the capacity of a slice with cap() which is often longer than its length - you may be able to use existing allocated memory rather than making a new slice.

We've eeked out another few percent!

Diminishing Returns

There's more we can do:

don't allocation intermediate variables
unroll all loops
manually inline things
hunt down more bounds checks

But, probably not worth it to go much further.

Things start to depend heavily on the compiler version - you have to watch carefully that new Go versions don't break your "optimizations".

You won't be competitive in assemply in pure Go - maybe use Rust?