Pocket Gophers

10 Packages to Instrument Your Go Application

go1.7.4

I keep writing small batch-processing applications. Go is nice for these because it allows relatively easy concurrent implementations. However, as with all things concurrent, it becomes difficult to see what is happening inside the process. I want to know what is happening to determine which parts to focus development effort on to, say, improve performance.

Offline approaches, like benchmarks and profiling, could be useful, but require simulating the load that would be naturally encountered in production. Gathering similar data from a live system is useful in directing the application of these offline approaches.

One user on golang-nuts wants to

capture realtime (not via a dump) metrics on heap usage, thread counts, GC collections, HTTP queue length, etc.

Realtime is an interesting request because metric gathering and (potentially) analysis is done by the application itself. This also works around needing to parse log files to recreate application state when the application already has its state and the metrics simply provide a view of it.

Dealing with realtime metrics can be split into several parts:

I focus on the first point, instrumenting the application. The other points will largely determine or be determined by your specific operating environment. One solution that seems popular now is Prometheus. The packages I looked at either already support or could support various metrics handling backends.

A Working Example

As usual, I want a working example to use as a basis for comparison and that allows me to easily try the alternatives. You can download the code with:

go get -d pocketgophers.com/10-to-instrument

I have chosen to work with a simple web application both because it is common to want to know how the current load is affecting your system, and because it is fairly easy to generate some load on a web server.

To make experimentation easy, I separated the program into two parts that are ran like:

go run main.go basic.go

Replacing basic.go with a test case allows me to try many options without them impacting each other or requiring a solution to select which case to run. Here basic.go contains a simple web server:

basic.go
package main

import (
	"io"
	"net/http"
)

func serve(addr string) error {
	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	work(w)
}

func printMetrics(addr string, w io.Writer) error {
	w.Write([]byte("no metrics recorded\n"))
	return nil
}

The test harness in main.go

  1. starts serve in a goroutine
  2. generates traffic on that server using github.com/rakyll/hey (formerly github.com/rakyll/boom)
  3. prints the results from rakyll/hey
  4. runs printMetrics to gather the recorded metrics
  5. waits for the user to exit the process by pressing ^C

Each test case is responsible for:

Running go run main.go basic.go results in:

output/basic.go.txt
main.go:25: serving on 127.0.0.1:3000
main.go:44: working…
Summary:
  Total:	35.7858 secs
  Slowest:	0.5000 secs
  Fastest:	0.0003 secs
  Average:	0.2335 secs
  Requests/sec:	8.3832
  Total data:	1714 bytes
  Size/request:	5 bytes

Status code distribution:
  [200]	300 responses

Response time histogram:
  0.000 [1]	|∎
  0.050 [36]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.100 [40]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.150 [26]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.200 [29]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.250 [32]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.300 [27]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.350 [25]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.400 [35]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.450 [28]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎
  0.500 [21]	|∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎∎

Latency distribution:
  10% in 0.0434 secs
  25% in 0.0980 secs
  50% in 0.2270 secs
  75% in 0.3594 secs
  90% in 0.4341 secs
  95% in 0.4643 secs
  99% in 0.4993 secs
main.go:47: metrics
no metrics recorded

While the output from rakyll/hey and the metrics produced may look like benchmark results, in this case they are not. Both rakyll/hey and the web server are running in the same process and no attempt has been made to isolate processes or make multiple runs.

What is important to notice, is the metrics are similar to the numbers gathered by rakyll/hey. This validates the metric gathering.

Speaking of metrics, I will attempt to gather:

To make this article more readable I elide the contents of printMetrics() and most of the output. The full code and output are included with the source.

expvar

expvar.go
package main

import (
	"expvar"
	"fmt"
	"io"
	"net/http"
	"time"
)

var (
	requests     = expvar.NewInt("requests")
	responseTime = &timing{}
)

func serve(addr string) error {
	expvar.Publish("responseTime", responseTime)

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	begin := time.Now()

	work(w)

	responseTime.Observe(time.Since(begin))
	requests.Add(1)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}

type timing struct {
	count int64
	sum   time.Duration
}

func (t *timing) Observe(d time.Duration) {
	t.count++
	t.sum += d
}

func (t timing) String() string {
	avg := time.Duration(t.sum.Nanoseconds() / t.count)
	return fmt.Sprintf("\"%v\"", avg)
}

output/expvar.go.txt
{
	// …
	requests": 300, 
	responseTime": "231.796036ms"
}

Prometheus

prometheus.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/prometheus/client_golang/prometheus"
)

var (
	requests = prometheus.NewCounter(prometheus.CounterOpts{
		Name: "requests",
		Help: "number of requests",
	})
	responseTime = prometheus.NewHistogram(prometheus.HistogramOpts{
		Name: "responseTime",
		Help: "response time",
	})
)

func serve(addr string) error {
	prometheus.MustRegister(requests)
	prometheus.MustRegister(responseTime)
	http.Handle("/metrics", prometheus.Handler())

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	start := time.Now()

	work(w)

	responseTime.Observe(time.Since(start).Seconds())
	requests.Inc()
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/prometheus.go.txt
// …
# HELP requests number of requests
# TYPE requests counter
requests 300
# HELP responseTime response time
# TYPE responseTime histogram
responseTime_bucket{le="0.005"} 5
responseTime_bucket{le="0.01"} 6
responseTime_bucket{le="0.025"} 17
responseTime_bucket{le="0.05"} 35
responseTime_bucket{le="0.1"} 74
responseTime_bucket{le="0.25"} 164
responseTime_bucket{le="0.5"} 299
responseTime_bucket{le="1"} 300
responseTime_bucket{le="2.5"} 300
responseTime_bucket{le="5"} 300
responseTime_bucket{le="10"} 300
responseTime_bucket{le="+Inf"} 300
responseTime_sum 69.97811379399995
responseTime_count 300

codahale, go-kit/metrics

codahale.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/codahale/metrics"
)

var (
	requests     = metrics.Counter("requests")
	responseTime = metrics.NewHistogram("responseTime", 0, time.Minute.Nanoseconds(), 4)
)

func serve(addr string) error {
	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	begin := time.Now()

	work(w)

	responseTime.RecordValue(time.Since(begin).Nanoseconds())
	requests.Add()
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/codahale.go.txt
{
	// …
	"metrics": {
		"Counters":{
			"requests":300
		},
	"Gauges": {
		"responseTime.P50": 221962239,
		"responseTime.P75": 357449727,
		"responseTime.P90": 430374911,
		"responseTime.P95": 461094911,
		"responseTime.P99": 493240319,
		"responseTime.P999": 498728959
	}
}

easy-metrics

easy-metrics.go
package main

import (
	"fmt"
	"io"
	"log"
	"net/http"
	"time"

	"github.com/admobi/easy-metrics"
)

var (
	requests        = metrics.NewCounter("requests")
	avgResponseTime = metrics.NewGauge("avgResponseTime")
	responseTime    = &timing{}
)

func serve(addr string) error {
	r, err := metrics.NewTrackRegistry("stats", 100, time.Second, false)
	if err != nil {
		log.Fatalln(err)
	}

	err = r.AddMetrics(requests, avgResponseTime)
	if err != nil {
		log.Fatalln(err)
	}

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	begin := time.Now()

	work(w)

	responseTime.Observe(time.Since(begin))
	requests.Inc()
}

func printMetrics(addr string, w io.Writer) error {
	// …
}

type timing struct {
	count int64
	sum   time.Duration
}

func (t *timing) Observe(d time.Duration) {
	t.count++
	t.sum += d

	avgResponseTime.Set(t.sum.Seconds() / float64(t.count))
}

func (t timing) String() string {
	avg := time.Duration(t.sum.Nanoseconds() / t.count)
	return fmt.Sprintf("\"%v\"", avg)
}
easy-metrics output

gocraft/health

health.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/gocraft/health"
)

var stream = health.NewStream()

func serve(addr string) error {
	// use 2 minute intervals because the interval does not start when
	// the program starts, but at the beginning of each minute
	// Since traffic generation takes about 35 seconds, the metrics
	// were being split into two intervals, which would be OK in real
	// life, but not for this example
	sink := health.NewJsonPollingSink(2*time.Minute, 5*time.Minute)
	stream.AddSink(sink)
	http.Handle("/health", sink)

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	job := stream.NewJob("handler")

	work(w)

	job.Complete(health.Success)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/health.go.txt
{
	// …
	"interval_duration": 120000000000,
	"aggregations": [
		{
			"interval_start": "2016-12-20T10:48:00+01:00",
			"serial_number": 300,
			"jobs": {
				"handler": {
					"timers": {},
					"gauges": {},
					"events": {},
					"event_errs": {},
					"count": 300,
					"nanos_sum": 69497956718,
					"nanos_sum_squares": 2.2404587230146523e+19,
					"nanos_min": 8567,
					"nanos_max": 498093309,
					"count_success": 300,
					"count_validation_error": 0,
					"count_panic": 0,
					"count_error": 0,
					"count_junk": 0
				}
			},
			"timers": {},
			"gauges": {},
			"events": {},
			"event_errs": {}
		}
	]
}

inspect

inspect.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/square/inspect/metrics"
)

var (
	requests     = metrics.NewCounter()
	responseTime = metrics.NewStatsTimer(time.Nanosecond, 300)
)

func serve(addr string) error {
	m := metrics.NewMetricContext("webapp")
	m.Register(requests, "requests")
	m.Register(responseTime, "responseTime")
	http.HandleFunc("/metrics", m.HttpJsonHandler)

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	t := responseTime.Start()

	work(w)

	responseTime.Stop(t)
	requests.Add(1)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/inspect.go.txt
// …
[
	{
		"Type": "*metrics.Counter",
		"Name": "requests",
		"Value": {
			"current": 300,
			"rate": 8.282713
		}
	},
	{
		"Type": "*metrics.StatsTimer",
		"Name": "responseTime",
		"Value": {
			"Percentiles": [{},{},{},{},{},{},{}]
		}
	}
]

instruments

instruments.go
package main

import (
	"fmt"
	"io"
	"net/http"
	"time"

	"github.com/heroku/instruments"
)

var (
	requests     = instruments.NewCounter()
	responseTime = instruments.NewTimer(-1) // in ms
)

func serve(addr string) error {
	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	begin := time.Now()

	work(w)

	responseTime.Since(begin)
	requests.Update(1)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/instruments.go.txt
// …
requests: 300
response time distribution:
  10% in 41ms
  25% in 95ms
  50% in 224ms
  75% in 358ms
  90% in 434ms
  95% in 466ms
  99% in 497ms

spacemonkeygo/monkit.v2

monkit.go
package main

import (
	"io"
	"net/http"

	"gopkg.in/spacemonkeygo/monkit.v2"
	"gopkg.in/spacemonkeygo/monkit.v2/environment"
	"gopkg.in/spacemonkeygo/monkit.v2/present"
)

var mon = monkit.Package()

func serve(addr string) error {
	environment.Register(monkit.Default)
	http.Handle("/monkit/", http.StripPrefix("/monkit/", present.HTTP(monkit.Default)))

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	ctx := r.Context()
	defer mon.Task()(&ctx, w, r)(nil)
	r = r.WithContext(ctx)

	work(w)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/monkit.go.txt
// …
/monkit/funcs:
[3354336375209924631] main.handler
  parents: entry
  current: 0, highwater: 2, success: 300, errors: 0, panics: 0
  success times:
    0.00: 25.712µs
    0.10: 37.850052ms
    0.25: 110.334474ms
    0.50: 229.859144ms
    0.75: 353.10128ms
    0.90: 452.483836ms
    0.95: 483.47486ms
    1.00: 500.321625ms
    avg: 232.048885ms
  failure times:
    0.00: 0s
    0.10: 0s
    0.25: 0s
    0.50: 0s
    0.75: 0s
    0.90: 0s
    0.95: 0s
    1.00: 0s
    avg: 0s

telemetry

telemetry.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/arussellsaw/telemetry"
	"pocketgophers.com/10-to-instrument/reporters"
)

var (
	tel          = telemetry.New("", 2*time.Minute)
	requests     = telemetry.NewCounter(tel, "requests", 2*time.Minute)
	responseTime = telemetry.NewAverage(tel, "responseTime", 2*time.Minute)
)

func serve(addr string) error {
	http.Handle("/metrics", reporters.TelemetryHandler{Tel: tel})
	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	start := time.Now()

	work(w)

	responseTime.Add(tel, time.Since(start).Seconds())
	requests.Add(tel, 1)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/telemetry.go.txt
{
 	// …
	"requests": 300,
	"responseTime": 0.23329017467
}

xstats

xstats.go
package main

import (
	"io"
	"net/http"
	"time"

	"github.com/rs/xstats"
	"github.com/rs/xstats/prometheus"
)

var sender = prometheus.NewHandler()
var stats = xstats.New(sender)

func serve(addr string) error {
	http.Handle("/metrics", sender)

	http.HandleFunc("/", handler)
	return http.ListenAndServe(addr, nil)
}

func handler(w http.ResponseWriter, r *http.Request) {
	begin := time.Now()

	work(w)

	stats.Histogram("responseTime", time.Since(begin).Seconds())
	stats.Count("requests", 1)
}

func printMetrics(addr string, w io.Writer) error {
	// …
}
output/xstats.go.txt
…
# HELP requests requests
# TYPE requests counter
requests 300
# HELP responseTime responseTime
# TYPE responseTime histogram
responseTime_bucket{le="0.005"} 5
responseTime_bucket{le="0.01"} 5
responseTime_bucket{le="0.025"} 18
responseTime_bucket{le="0.05"} 34
responseTime_bucket{le="0.1"} 76
responseTime_bucket{le="0.25"} 165
responseTime_bucket{le="0.5"} 299
responseTime_bucket{le="1"} 300
responseTime_bucket{le="2.5"} 300
responseTime_bucket{le="5"} 300
responseTime_bucket{le="10"} 300
responseTime_bucket{le="+Inf"} 300
responseTime_sum 69.99046372199999
responseTime_count 300

Dig into the Fundamentals of Go

Subscribe to receive a weekly email covering a Go fundamental. Be it the language, its tooling, or its packages, you will learn what you need to know.