Pocket Gophers

Concurrency Slower?

go1.9
WTH? Shouldn’t this be faster? - Concurrency slower?

This started as a simple exercise to learn concurrency in Go. The exercise had a simple premise:

But instead of fast running goodness, it ended slower than it started:

Go's concurrency primitives are light-weight. However, they are not free. Since there is a cost, there needs to be a return on that investment. For this exercise, the expected return is a faster runtime. Balancing the costs and the benefits is the real trick.

The code and associated files for this tutorial can be gotten with:

go get -d pocketgophers.com/concurrency-slower
Each commit in the downloaded git repository corresponds with the progress made in this tutorial.

This tutorial was ran on a Late 2010 MacBook Air with 1.6GHz Intel Core 2 Duo with 4GB RAM running macOS 10.12.6.

The Posted Code

Let's start with what was posted:

main.go
package main

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

var list = [][]int{}

func main() {

	rand.Seed(time.Now().UnixNano())

	rec := make(chan []int, 10000000)

	go func(c chan []int) {
		for {
			list = append(list, <-c)
		}
	}(rec)

	start := time.Now()
	fmt.Println("Start")
	var wg sync.WaitGroup
	for i := 0; i < 10; i++ {
		wg.Add(1)
		func() {
			defer wg.Done()
			for i := 0; i < 1000000; i++ {
				add(rec)
			}
		}()
	}
	wg.Wait()
	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func add(c chan []int) {
	c <- []int{
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
	}
}

This code has some good things going for it:

However, it also has some problems as can be shown by running it:

$ go run main.go
Start
End  9999991   26.401582218

Notice that len(list) is not the expected 1,000,000. Another posted version had the expected len(list), but only one of the elements was set (1M times).

Let's use Go's built-in testing and benchmarking tools to improve the code.

Extract to Function

The first step is to isolate the code to be tested and benchmarked. This was done by first extracting the relevant parts from main into a new function lottoNumbers(int) [][]int that takes the number of lotto numbers to generate as its argument and returns the list of lotto numbers.

Global variables make testing and benchmarking difficult because it is hard to make sure that all the globals were reset or useable for a second run. So I made list local to lottoNumbers.

The last thing to do is make the number of lotto numbers to generate tunable. This was already partially done by making it an argument to lottoNumbers(), however adding a global constant, numberToGenerate that will only be referred to in main and the tests and benchmarks allows that value to be easily changed without leaving hard-to-spot values that only differ by a 0, such as 10000000 and 1000000 in the posted code.

This leaves us with:

main.go
package main

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	var list = [][]int{}
	rand.Seed(time.Now().UnixNano())

	rec := make(chan []int, n)

	go func(c chan []int) {
		for {
			list = append(list, <-c)
		}
	}(rec)

	var wg sync.WaitGroup
	for i := 0; i < 10; i++ {
		wg.Add(1)
		func() {
			defer wg.Done()
			for i := 0; i < n; i++ {
				add(rec)
			}
		}()
	}
	wg.Wait()

	return list
}

func add(c chan []int) {
	c <- []int{
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
	}
}

A quick run shows the code works as it did before:

$ go run main.go
Start
End  9999996   20.554261007

Testing with go test

Now that the code is in a testable state, its time to write some tests. Testing results based on random numbers is really hard. The test suite for math/rand, for example, includes a statistical analysis of the generated numbers.

Since there are already two failure cases: not enough numbers, and most numbers not being set, at least these cases need to be tested for. The first case is easy while the second is harder.

Since this testing is done as an exercise, and I don't really know what the properties of lottery numbers should be, I relied on the existing code to figure out what should be tested and came up with the following criteria:

I picked a few attributes of distributions to compare against expected values:

I used github.com/gonum/stat to handle the statistics and ended up with the following test:

main_test.go
package main

import (
	"math"
	"testing"

	"github.com/gonum/stat"
	"github.com/gonum/stat/distuv"
)

func within(a, b, maxDiff float64) bool {
	absDiff := math.Abs(a - b)
	return absDiff < maxDiff
}

func TestLottoNumbers(t *testing.T) {
	list := lottoNumbers(numberToGenerate)

	const expectedNumberLength = 7

	// list should generated the correct number of numbers
	if len(list) != numberToGenerate {
		t.Errorf("len(list) is %v, not %v", len(list), numberToGenerate)
	}

	// lotto numbers are 7 long
	var withIncorrectLength []int
	for i, number := range list {
		if len(number) != expectedNumberLength {
			withIncorrectLength = append(withIncorrectLength, i)
		}
	}
	if len(withIncorrectLength) != 0 {
		t.Errorf("%v lotto numbers including index %v do not have the expected length of %d", len(withIncorrectLength), withIncorrectLength[0], expectedNumberLength)
		t.FailNow() // later tests rely on the numbers existing
	}

	// convert to form needed by stat
	// also grab max, min
	x := make([]float64, numberToGenerate*7)
	max := list[0][0]
	min := max
	for i := range x {
		j := i / 7
		k := i % 7

		n := list[j][k]

		x[i] = float64(n)

		if n > max {
			max = n
		}

		if n < min {
			min = n
		}
	}

	// check max and min
	if max > 48 {
		t.Errorf("max is %v", max)
	}

	if min < 0 {
		t.Errorf("min is %v", min)
	}

	// check distribution properties
	maxDiff := 0.02
	uniform := distuv.Uniform{
		Max: 49,
	}

	stdDev := stat.StdDev(x, nil)
	if !within(stdDev, uniform.StdDev(), maxDiff) {
		t.Errorf("std dev not within %v of uniform (%v): %v", maxDiff, uniform.StdDev(), stdDev)
	}

	skew := stat.Skew(x, nil)
	if !within(skew, uniform.Skewness(), maxDiff) {
		t.Errorf("skewness not within %v of uniform (%v): %v", maxDiff, uniform.Skewness(), skew)
	}

	exk := stat.ExKurtosis(x, nil)
	if !within(exk, uniform.ExKurtosis(), maxDiff) {
		t.Errorf("ExKurtosis not within %v of uniform (%v): %v", maxDiff, uniform.ExKurtosis(), exk)
	}
}
$ go test
--- FAIL: TestLottoNumbers (22.60s)
	main_test.go:23: len(list) is 9998657, not 1000000
FAIL
exit status 1
FAIL	concurrency-slower	22.894s

Since this version of lottoNumbers only fails by not generating enough numbers, I replaced it with the other posted version to show the other tests also work.

main.go
package main

import (
	"fmt"
	"math/rand"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	var parallelCount = 10
	var iterations = n/parallelCount
	var list = make([][]int, parallelCount*iterations)
	var wg sync.WaitGroup

	wg.Add(parallelCount)
	for i := 0; i < parallelCount; i++ {
		go func() {
			defer wg.Done()

			r := rand.New(rand.NewSource(time.Now().UnixNano()))

			for i := 0; i < iterations; i++ {
				list[parallelCount*iterations-1] = add(r)
			}
		}()
	}
	wg.Wait()

	return list
}

func add(r *rand.Rand) []int {
	return []int{
		r.Intn(49),
		r.Intn(49),
		r.Intn(49),
		r.Intn(49),
		r.Intn(49),
		r.Intn(49),
		r.Intn(49),
	}
}
$ go run main.go
Start
End  1000000   0.362201313
$ go test
--- FAIL: TestLottoNumbers (0.43s)
	main_test.go:34: 999999 lotto numbers including index 0 do not have the expected length of 7
FAIL
exit status 1
FAIL	concurrency-slower	0.467s

A Serial Version

A general rule of thumb is: make it work fast, then make it work fast. Since the existing versions don't work, as shown by the failing tests, the first thing to do is get a version that consistently passes the tests. Since the concurrency used here complicates the design instead of simplifying it, a purely serial version that is correct is a better starting point. I tried to reproduce the serial version the poster started with.

main.go
package main

import (
	"fmt"
	"math/rand"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	list := make([][]int, 0, n)
	rand.Seed(time.Now().UnixNano())

	for i := 0; i < n; i++ {
		list = append(list, []int{
			rand.Intn(49),
			rand.Intn(49),
			rand.Intn(49),
			rand.Intn(49),
			rand.Intn(49),
			rand.Intn(49),
			rand.Intn(49),
		})
	}

	return list
}
$ go run main.go
Start
End  1000000   1.180813682
$ go test
PASS
ok  	concurrency-slower	1.764s

This version is much faster. The speedup was achieved by not doing the extra work the original version did. Also, the tests passed 5 times..

Benchmarks

Now that we have a working version, its time to make it work fast. Benchmarks are a useful tool to do so. Go benchmarks are quick to get started with: loop b.N times, calling the function to be benchmarked each iteration.

main_test.go
package main

import (
	"math"
	"testing"

	"github.com/gonum/stat"
	"github.com/gonum/stat/distuv"
)

func within(a, b, maxDiff float64) bool {
	absDiff := math.Abs(a - b)
	return absDiff < maxDiff
}

func TestLottoNumbers(t *testing.T) {
	list := lottoNumbers(numberToGenerate)

	const expectedNumberLength = 7

	// list should generated the correct number of numbers
	if len(list) != numberToGenerate {
		t.Errorf("len(list) is %v, not %v", len(list), numberToGenerate)
	}

	// lotto numbers are 7 long
	var withIncorrectLength []int
	for i, number := range list {
		if len(number) != expectedNumberLength {
			withIncorrectLength = append(withIncorrectLength, i)
		}
	}
	if len(withIncorrectLength) != 0 {
		t.Errorf("%v lotto numbers including index %v do not have the expected length of %d", len(withIncorrectLength), withIncorrectLength[0], expectedNumberLength)
		t.FailNow() // later tests rely on the numbers existing
	}

	// convert to form needed by stat
	// also grab max, min
	x := make([]float64, numberToGenerate*7)
	max := list[0][0]
	min := max
	for i := range x {
		j := i / 7
		k := i % 7

		n := list[j][k]

		x[i] = float64(n)

		if n > max {
			max = n
		}

		if n < min {
			min = n
		}
	}

	// check max and min
	if max > 48 {
		t.Errorf("max is %v", max)
	}

	if min < 0 {
		t.Errorf("min is %v", min)
	}

	// check distribution properties
	maxDiff := 0.02
	uniform := distuv.Uniform{
		Max: 49,
	}

	stdDev := stat.StdDev(x, nil)
	if !within(stdDev, uniform.StdDev(), maxDiff) {
		t.Errorf("std dev not within %v of uniform (%v): %v", maxDiff, uniform.StdDev(), stdDev)
	}

	skew := stat.Skew(x, nil)
	if !within(skew, uniform.Skewness(), maxDiff) {
		t.Errorf("skewness not within %v of uniform (%v): %v", maxDiff, uniform.Skewness(), skew)
	}

	exk := stat.ExKurtosis(x, nil)
	if !within(exk, uniform.ExKurtosis(), maxDiff) {
		t.Errorf("ExKurtosis not within %v of uniform (%v): %v", maxDiff, uniform.ExKurtosis(), exk)
	}
}

func BenchmarkLottoNumbers(b *testing.B) {
	for i := 0; i < b.N; i++ {
		lottoNumbers(numberToGenerate)
	}
}

Benchmarks are ran with go test. I save the output using tee for later comparison.

$ go test -bench=. -benchmem -count 10 | tee serial.txt
goos: darwin
goarch: amd64
pkg: concurrency-slower
BenchmarkLottoNumbers-2   	       1	1151312776 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1125988040 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1138461320 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1117053451 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1112417673 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1191413534 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1153963489 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1188261781 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1143055600 ns/op	88002560 B/op	 1000001 allocs/op
BenchmarkLottoNumbers-2   	       1	1149106696 ns/op	88002560 B/op	 1000001 allocs/op
PASS
ok  	concurrency-slower	26.868s

The output shows the benchmark, the number of times the benchmark loop ran (i.e., b.N), the average time for each run (i.e., total_time/b.N), the average bytes allocated each run, and the average number of allocatons for each run.

Go will run the test suite before running the benchmarks. This is helpful to ensure the whatever you are benchmarking still passes the tests. The benchmark (and tests) are run 5 times to create enough data for later statistical analysis.

A Parallel Version

When parallelizing code I like to start by finding the minimal unit of work. This is the smallest unit of work the parallel version needs to do. In this case the minimal unit is:

list = append(list, []int{
	rand.Intn(49),
	rand.Intn(49),
	rand.Intn(49),
	rand.Intn(49),
	rand.Intn(49),
	rand.Intn(49),
	rand.Intn(49),
})

If the minimal unit were safe for concurrent operation (this one is not), running each one in a separate goroutine maximizes concurrency. Even though go can run a million goroutines, there is cost associated with launching and scheduling them. Any interaction between goroutines through channels or mutexes also introduces work the serial version did not need to do. All this extra work is overhead.

The trick to minimizing overhead to gain the benefits of parallelization is to find the minimal scalable unit. By scalable, I mean that the amount of work done by each goroutine is adjustable. In this case the minimal scalable unit is:

for i := 0; i < n; i++ {
	list = append(list, []int{
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
	})
}
Instead of each goroutine generating one lotto number, they will generate n lotto numbers.

The next step is to make the minimal scalable unit safe for concurrent execution. The main problem here is append. Since I learned to write code on distributed systems (i.e., many separate computers), my first instinct is to build up a local list and then send it to another goroutine that combines the lists together. While this method works, building local lists, sending them around, and combining them together is all extra work that does not need to be done because each goroutine can just change its part of the list. This can be checked as safe by using the race detector. That changes the minimal scalable unit to:

for i := begin; i < end; i++ {
	list[i] = []int{
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
		rand.Intn(49),
	}
}
Where begin and end are the part of the work to be accomplished by that unit.

The work can then be split as evenly as possible between the desired number of goroutines. Since this example appears to be limited by CPU, one goroutine per processor maximizes the potential benefit (amount of CPU capacity) while minimizing overhead.

Using a sync.WaitGroup to wait until all the goroutines have completed, results in this parallel version:

main.go
package main

import (
	"fmt"
	"math/rand"
	"runtime"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	list := make([][]int, n)
	rand.Seed(time.Now().UnixNano())

	var wg sync.WaitGroup
	workers := runtime.GOMAXPROCS(-1) // one for each proc

	for i := 0; i < workers; i++ {
		work := n / workers
		begin := work * i
		end := begin + work

		if i == workers-1 {
			end += n % workers
		}

		wg.Add(1)
		go func() {
			defer wg.Done()

			for i := begin; i < end; i++ {
				list[i] = []int{
					rand.Intn(49),
					rand.Intn(49),
					rand.Intn(49),
					rand.Intn(49),
					rand.Intn(49),
					rand.Intn(49),
					rand.Intn(49),
				}
			}
		}()
	}

	wg.Wait()

	return list
}

Running the test suite with the race detector enabled checks that this version produces correct results and data races were found.

$ go test -race
PASS
ok  	concurrency-slower	28.174s

Since the race detector reduces performance, I don't use it when benchmarking.

$ go test -bench=. -count 10 | tee parallel.txt
goos: darwin
goarch: amd64
pkg: concurrency-slower
BenchmarkLottoNumbers-2   	       1	1240373864 ns/op
BenchmarkLottoNumbers-2   	       1	1285141343 ns/op
BenchmarkLottoNumbers-2   	       1	1248011574 ns/op
BenchmarkLottoNumbers-2   	       1	1229725417 ns/op
BenchmarkLottoNumbers-2   	       1	1241098623 ns/op
BenchmarkLottoNumbers-2   	       1	1241275875 ns/op
BenchmarkLottoNumbers-2   	       1	1235109691 ns/op
BenchmarkLottoNumbers-2   	       1	1234113522 ns/op
BenchmarkLottoNumbers-2   	       1	1237894941 ns/op
BenchmarkLottoNumbers-2   	       1	1226142132 ns/op
PASS
ok  	concurrency-slower	29.093s

To compare benchmark results and decide if the differences are significant (i.e., p <= 0.05), I use benchstat.

serial.txtparallel.txt
time/opdelta
LottoNumbers-21.15s ± 4%1.24s ± 1%+7.84%(p=0.000 n=10+9)
 

Even with my efforts to minimize overhead, the parallel version is slower. Go provides some tools to help figure out what is happening.

Reducing Blocking Operations

Not all blocking operations are bad. For example, wg.Wait blocks until all the workers have completed their work. What we don't want are unnecessary blocking operations. The block profile helps find blocking operations:

$ go test -blockprofile block.pprof
PASS
ok  	concurrency-slower	1.921s
$ go tool pprof -list=lottoNumbers ./concurrency-slower.test block.pprof
Total: 3.07s
ROUTINE ======================== concurrency-slower.lottoNumbers in /Users/alaster/src/pocketgophers.com/concurrency-slower/gopath/src/concurrency-slower/main.go
         0      1.30s (flat, cum) 42.31% of Total
         .          .     54:				}
         .          .     55:			}
         .          .     56:		}()
         .          .     57:	}
         .          .     58:
         .      1.30s     59:	wg.Wait()
         .          .     60:
         .          .     61:	return list
         .          .     62:}
ROUTINE ======================== concurrency-slower.lottoNumbers.func1 in /Users/alaster/src/pocketgophers.com/concurrency-slower/gopath/src/concurrency-slower/main.go
         0     3.75ms (flat, cum)  0.12% of Total
         .          .     42:		go func() {
         .          .     43:			defer wg.Done()
         .          .     44:
         .          .     45:			for i := begin; i < end; i++ {
         .          .     46:				list[i] = []int{
         .     1.54ms     47:					rand.Intn(49),
         .   287.72us     48:					rand.Intn(49),
         .      745us     49:					rand.Intn(49),
         .   137.91us     50:					rand.Intn(49),
         .   171.40us     51:					rand.Intn(49),
         .   655.17us     52:					rand.Intn(49),
         .   211.95us     53:					rand.Intn(49),
         .          .     54:				}
         .          .     55:			}
         .          .     56:		}()
         .          .     57:	}
         .          .     58:

Here we see that wg.Wait blocks as expected (and needed). However, the blocks by rand.Intn are not needed and therefore reduce efficiency.

The default Source is safe for concurrent use by multiple goroutines. - math/rand

The default Source is made safe by locking. This means that only one random number can be generated at a time. The solution is to use a separate source for each worker. To do this correctly, a little understanding of random number generators (RNGs) is needed.

Imagine a function, rand, that takes a seed and returns and infinite slice of random values. For deterministic RNGs like math/rand's, rand(seed)[n] == rand(seed)[n] is true for every integer n. This property is helpful in debugging and testing processes that depend on random number by making the RNG generate the same sequence of numbers each time.

This same property is not helpful when you want to make sure that different processes have different random numbers because many deterministic RNGs also have the property that rand(seed)[n+1] == rand(seed+1)[n] is true for every integer n. This property is helpful in that, with care, the same sequence of random numbers can be generated by many RNGs as from a single RNG. This would allow, for example, the list produced by the serial version of lottoNumbers to be directly compared to the list produced by a parallel version using separate generators.

However, the RNG implemented in math/rand “cooks” the seed first. The cooking process uses a list of random numbers embedded in the code (to keep the first property) to alter the seed before seeding the generator (i.e., rand(cook(seed))). While this makes generating different parts of the same sequence impractical, it makes getting different sequences easy. Instead of needing to know how many random numbers are to be used by each worker to avoid overlap (and therefore reduced randomness), the seeds only need to be different.

With this knowledge, I created a separate source for each worker, seeding them with the time and the number of the worker.

main.go
package main

import (
	"fmt"
	"math/rand"
	"runtime"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	list := make([][]int, n)
	seed := time.Now().UnixNano()
	var wg sync.WaitGroup
	workers := runtime.GOMAXPROCS(-1) // one for each proc

	for i := 0; i < workers; i++ {
		work := n / workers
		begin := work * i
		end := begin + work

		if i == workers-1 {
			end += n % workers
		}

		r := rand.New(rand.NewSource(seed + int64(i)))

		wg.Add(1)
		go func() {
			defer wg.Done()

			for i := begin; i < end; i++ {
				list[i] = []int{
					r.Intn(49),
					r.Intn(49),
					r.Intn(49),
					r.Intn(49),
					r.Intn(49),
					r.Intn(49),
					r.Intn(49),
				}
			}
		}()
	}

	wg.Wait()

	return list
}
$ go test -bench=. -benchmem -count 10 | tee non-blocking-rand.txt
goos: darwin
goarch: amd64
pkg: concurrency-slower
BenchmarkLottoNumbers-2   	       3	 364119147 ns/op	88013445 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 346998866 ns/op	88013914 B/op	 1000007 allocs/op
BenchmarkLottoNumbers-2   	       3	 350583465 ns/op	88013424 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 353711867 ns/op	88013498 B/op	 1000007 allocs/op
BenchmarkLottoNumbers-2   	       3	 348360381 ns/op	88013424 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 351494713 ns/op	88013424 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 348087193 ns/op	88013424 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 354166810 ns/op	88013424 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 350044345 ns/op	88013445 B/op	 1000006 allocs/op
BenchmarkLottoNumbers-2   	       3	 366790968 ns/op	88013701 B/op	 1000007 allocs/op
PASS
ok  	concurrency-slower	30.691s
serial.txtnon-blocking-rand.txt
time/opdelta
LottoNumbers-21.15s ± 4%0.35s ± 3%−69.32%(p=0.000 n=10+9)
 
alloc/opdelta
LottoNumbers-288.0MB ± 0%88.0MB ± 0%+0.01%(p=0.000 n=10+8)
 
allocs/opdelta
LottoNumbers-21.00M ± 0%1.00M ± 0%+0.00%(p=0.000 n=10+10)
 

The benchmark shows the decreased run time we expected in the first place.

Reducing Allocations

Another thing to notice is the 1.00M allocations per operation. Go allocates memory on the stack or the heap. Stack memory is managed at compile time. Heap memory, on the other hand, is managed at runtime. This includes deciding where on the heap to allocate the memory and garbage collection when that memory is no longer needed. It is because of this extra cost that -benchmem only lists heap allocations.

Go tries to keep everything on the stack, using escape analysis to figure out what must be on the heap. Compiling with -gcflags=-m displays the results of the escape analysis process:

$ go build -gcflags=-m
# concurrency-slower
./main.go:27:29: inlining call to time.time.Time.UnixNano
./main.go:27:29: inlining call to time.(*Time).time.unixSec
./main.go:27:29: inlining call to time.(*Time).time.sec
./main.go:27:29: inlining call to time.(*Time).time.nsec
./main.go:40:16: inlining call to rand.New
./main.go:22:47: inlining call to time.time.Duration.Seconds
./main.go:26:14: make([][]int, n) escapes to heap
./main.go:42:5: wg escapes to heap
./main.go:28:6: moved to heap: wg
./main.go:43:6: func literal escapes to heap
./main.go:43:6: func literal escapes to heap
./main.go:44:10: &wg escapes to heap
./main.go:40:16: &rand.Rand literal escapes to heap
./main.go:60:4: wg escapes to heap
./main.go:44:12: wg escapes to heap
./main.go:44:10: leaking closure reference wg
./main.go:47:20: []int literal escapes to heap
./main.go:27:29: lottoNumbers time.t·2 does not escape
./main.go:27:29: lottoNumbers time.t·2 does not escape
./main.go:17:14: "Start" escapes to heap
./main.go:22:14: "End " escapes to heap
./main.go:22:25: len(list) escapes to heap
./main.go:22:33: " " escapes to heap
./main.go:22:47: d.Seconds() escapes to heap
./main.go:17:13: main ... argument does not escape
./main.go:22:13: main ... argument does not escape

As with blocking operations, some variables need to be on the heap. The goal is to find the things that escape and rewrite the code so it does not need to.

Since one million is the same number as the number of lotto number to generate, most of the allocations must happen on:

list[i] = []int{
	r.Intn(49),
	r.Intn(49),
	r.Intn(49),
	r.Intn(49),
	r.Intn(49),
	r.Intn(49),
	r.Intn(49),
}
The escape analysis confirms this with []int literal escapes to heap.

A slice is a descriptor of an array segment. It consists of a pointer to the array, the length of the segment, and its capacity (the maximum length of the segment). - Go Slices: usage and internals

Making a new slice also makes the underlying array. Since the slices are required, we can instead remove the allocation of the underlying arrays by preallocating enough space for all the random numbers we need and then slicing that array into the lotto numbers.

main.go
package main

import (
	"fmt"
	"math/rand"
	"runtime"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	total := 7 * n
	all := make([]int, total)
	list := make([][]int, n)
	seed := time.Now().UnixNano()
	var wg sync.WaitGroup
	workers := runtime.GOMAXPROCS(-1) // one for each proc

	for i := 0; i < workers; i++ {
		work := total / workers
		begin := work * i
		end := begin + work

		if i == workers-1 {
			end += n % workers
		}

		r := rand.New(rand.NewSource(seed + int64(i)))

		wg.Add(1)
		go func() {
			defer wg.Done()

			for i := begin; i < end; i++ {
				all[i] = r.Intn(49)
			}
		}()
	}

	wg.Wait()

	for i := range list {
		list[i] = all[i : i+7]
	}

	return list
}

The []int literal no longer escapes to heap and, as a nice side benefit, the repetition of r.Intn(49) is removed.

The next thing to deal with is the WaitGroup. It needs to be on the heap because it is accessed from different goroutines. What doesn't need to be on the heap are the many references to it. This can be fixed by initially getting a pointer to it.

The function and &rand.Rand literal escapes can be removed by moving the creation of r into the closer and then converting the closure into a separate function.

main.go
package main

import (
	"fmt"
	"math/rand"
	"runtime"
	"sync"
	"time"
)

const (
	numberToGenerate = 1000000
)

func main() {
	start := time.Now()
	fmt.Println("Start")

	list := lottoNumbers(numberToGenerate)

	d := time.Now().Sub(start)
	fmt.Println("End ", len(list), " ", d.Seconds())
}

func lottoNumbers(n int) [][]int {
	total := 7 * n
	all := make([]int, total)
	list := make([][]int, n)
	seed := time.Now().UnixNano()
	wg := &sync.WaitGroup{}
	workers := runtime.GOMAXPROCS(-1) // one for each proc

	for i := 0; i < workers; i++ {
		work := total / workers
		begin := work * i
		end := begin + work

		if i == workers-1 {
			end += n % workers
		}

		wg.Add(1)
		go lottoNumbersWorker(wg, seed+int64(i), all, begin, end)
	}

	wg.Wait()

	for i := range list {
		list[i] = all[i : i+7]
	}

	return list
}

func lottoNumbersWorker(wg *sync.WaitGroup, seed int64, all []int, begin, end int) {
	defer wg.Done()

	r := rand.New(rand.NewSource(seed))

	for i := begin; i < end; i++ {
		all[i] = r.Intn(49)
	}
}
$ go build -gcflags=-m
# concurrency-slower
./main.go:58:15: inlining call to rand.New
./main.go:29:29: inlining call to time.time.Time.UnixNano
./main.go:29:29: inlining call to time.(*Time).time.unixSec
./main.go:29:29: inlining call to time.(*Time).time.sec
./main.go:29:29: inlining call to time.(*Time).time.nsec
./main.go:22:47: inlining call to time.time.Duration.Seconds
./main.go:55:79: leaking param: wg
./main.go:55:79: lottoNumbersWorker all does not escape
./main.go:58:15: lottoNumbersWorker &rand.Rand literal does not escape
./main.go:27:13: make([]int, total) escapes to heap
./main.go:28:14: make([][]int, n) escapes to heap
./main.go:30:24: &sync.WaitGroup literal escapes to heap
./main.go:29:29: lottoNumbers time.t·2 does not escape
./main.go:29:29: lottoNumbers time.t·2 does not escape
./main.go:17:14: "Start" escapes to heap
./main.go:22:14: "End " escapes to heap
./main.go:22:25: len(list) escapes to heap
./main.go:22:33: " " escapes to heap
./main.go:22:47: d.Seconds() escapes to heap
./main.go:17:13: main ... argument does not escape
./main.go:22:13: main ... argument does not escape

The remaining escapes are required:

make([]int, total)
all, accessed by concurrent lottoNumberWorkers
make([][]int, n)
list, accessed by calling function after return
&sync.WaitGroup literal
wg, accessed by concurrent lottoNumberWorkers

Now to see the fruits of our labor

$ go test -bench=. -count 10 -benchmem| tee reduce-allocs.txt
goos: darwin
goarch: amd64
pkg: concurrency-slower
BenchmarkLottoNumbers-2   	       5	 284349234 ns/op	80013859 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 288585025 ns/op	80013955 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 284495151 ns/op	80013980 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 274063930 ns/op	80013865 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 275906121 ns/op	80013840 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 288819360 ns/op	80013840 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 276191900 ns/op	80013840 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 274710258 ns/op	80013929 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 274138641 ns/op	80013859 B/op	       5 allocs/op
BenchmarkLottoNumbers-2   	       5	 273619199 ns/op	80013852 B/op	       5 allocs/op
PASS
ok  	concurrency-slower	32.477s
non-blocking-rand.txtreduce-allocs.txt
time/opdelta
LottoNumbers-2352ms ± 3%279ms ± 3%−20.59%(p=0.000 n=9+10)
 
alloc/opdelta
LottoNumbers-288.0MB ± 0%80.0MB ± 0%−9.09%(p=0.000 n=8+10)
 
allocs/opdelta
LottoNumbers-21.00M ± 0%0.00M ± 0%−100.00%(p=0.000 n=10+10)
 

Success

Since you've made it this far you have seen, or even tried yourself, the following concepts and tools:

Now instead wondering why your concurrent performance is slower, you have some tools to figure why things got slower and some ideas on how to fix it. Instead of the broken slow implementation we started with, we now have a correct, fast implementation:

serial.txtreduce-allocs.txt
time/opdelta
LottoNumbers-21.15s ± 4%0.28s ± 3%−75.64%(p=0.000 n=10+10)
 
alloc/opdelta
LottoNumbers-288.0MB ± 0%80.0MB ± 0%−9.08%(p=0.000 n=10+10)
 
allocs/opdelta
LottoNumbers-21.00M ± 0%0.00M ± 0%−100.00%(p=0.000 n=10+10)
 

Dig into the Fundamentals of Go

Subscribe to receive a weekly email covering a Go fundamental. Be it the language, its tooling, or its packages, you will learn what you need to know.