## Implementing the sieve of Eratosthenes using channels in Go

The Go language tutorial describes a concurrent prime number sieve, but also noted that it is not the most computationally efficient method. In fact, the algorithm is so bad that it is even slower than testing every natural numbers using trial division, which tries to divides each `n` by all primes ```p ≤ sqrt(n)```. In contrast, the described sieve trial-divides each `n` by all primes `p ≤ n`.

Understandably, the point of the language tutorial is to introduce Go's concurrency constructs and not the algorithmic efficiency of a particular example. But just right after seeing that example, I thought it would be interesting to write an efficient prime sieve using Go's channels.

The simplest, yet effective, prime number sieve was discovered 2000 years ago by Eratosthenes. The algorithm starts off by eliminating all multiples of some known primes (usually `2`), then register the next number that has not been eliminated as a prime. Then we eliminate all multiples of this newly-found prime in order to find the next one, and so forth.

It is efficient since we do not need to do any division, and that there is little duplicate work: the number of times that we eliminate an already-eliminated candidate is the number of its distinct prime factors, which is very small for most composities and exactly equals zero for primes. In a sense, we get prime numbers for free.

The sieve of Eratosthenes is traditionally implemented using bitvector with a fixed upper bound. Using Go, it is possible to incrementally generate a channel containing an infinite stream of primes in the style of the original example. We will use three channels `candidates`, `primes`, `composites` and two main concurrent goroutines: one receives a value from the `primes` channel, generates all its multiples then merges them into the `composites` channel; the other goroutine eliminates those `composites` values from a stream of `candidates` and sends results to `primes`.

Here is the flow chart of our system:

Each channel of multiples is generated by a goroutine, and so is the channel of candidates.

The main problem with this design is: how do we merge an infinite number of channels, each containing an infinite stream of prime multiples? This is impossible in general, but not in our specific case since the heads of the channels of multiples form an increasing sequence and multiples of each prime are generated in increasing order as well.

The solution, as described by this paper, is to use a heap to do the merging by need. Since we are using Go's channels instead of cons'ed lists, we have to take an extra step to peek into its head value. But this is easy enough:

```type PeekCh struct {
ch chan int
}
```

We maintain a heap of `PeekCh` sorting by head values, and a channel (denoted `m`) which contains multiples of the next-in-line sieving prime. There are three possible cases:

• The heap minimum is smaller than the head of `m`: we pop the heap, send out `PeekCh.head`, update `PeekCh.head = ←PeekCh.ch`, then push back the modified `PeekCh` structure.
• The heap minimum is equal to the head of `m`: we do almost the same steps as above except for sending out the head value (so that duplicates are ignored).
• The heap minimum is greater than the head of `m`: we send out this head value of `m`, push `&PeehCh{←m, m}` into the heap, then update `m` to point to a channel of multiples of the next known prime.

The following code segment is the inner loop of the merging goroutine that sends values to the `composites` channel:

```for {
m := multiples(←primes)
composites ← min
minchan := heap.Pop(h).(*PeekCh)
heap.Push(h, minchan)
}
minchan := heap.Pop(h).(*PeekCh)
heap.Push(h, minchan)
}
heap.Push(h, &PeekCh{←m, m})
}
```

Having that, the writing the sieving goroutine is quite easy:

```primes ← 2
p := ←candidates
for {
c := ←composites
for p < c {
primes ← p
out ← p
p = ←candidates
}
if p == c {
p = ←candidates
}
}
```

Before skipping on, I suggest you take a look at the diagram again and make sure that you understand what is going on.

Note that the everytime we find a prime `p`, we send it to two channels: `primes` and `out`. The former is a feedback loop inside the system, and the latter is for an external goroutine to observe the produced stream of primes. The feedback loop can actually cause a deadlock if the buffer size for the `primes` channel is too small. If the sieving goroutine blocks sending to it, the merging goroutine will also block eventually since no one is receiving from the channel it sends to.

Now, the Go language requires that a channel must have a definite buffer size, declared at allocation time, which means we can not really generate an infinite stream of primes using this implementation, but only upto a limit `n`. This is a bit disappointing, as the algorithm does not impose such upper limit to be specified upfront. The problem with trying to write a program that generate an infinite stream is that sooner or later one must face the cruel finitude of computer memory :(

Anyway, we must solve the deadlock problem, in one way or another: either by writing our sieve such that it can produce just a few more than `n` primes without deadlock, or by allocating the channel with a humongous buffer -- big enough to hold all primes that fit into an `int32`.

In both cases, we need to estimate the minimum buffer size for the `primes` channel such that the system can produce `n` prime numbers without deadlock. It is not difficult. The first ```n primes``` is sieved from multiples of `primes ≤ sqrt(nth prime)`. This means the merging goroutine will only need to pull out ```sqrt(nth prime)``` values from the feedback loop. The sieving goroutine will obviously send `n` values to it. The minimum buffer size that will suffice is:

`n - (number of primes ≤ sqrt(nth prime))`

However, it turns out that the second term (let's call it `p(n)`) is always tiny compared to the first (for instance, ```p(1000000) = 546```). Without having to estimate the second term, the first solution is preferable: setting the buffer to `n` will guarantee that we will be able to generate `n` primes without deadlock. Interestingly, `p(n) + 2` is the number of goroutines that need to be spawned to generate `n` prime numbers.

We can also utilize a well-known optimization called wheel factorization. The name is a bit misleading, but the idea is that you simply pre-screen all multiples of certain small primes. For example, you could easily screen out all even numbers. Pre-screening multiples of 2 and 3 means that we only consider numbers that are `1 (mod 6)` or `5 (mod 6)`. The actual program only considers numbers coprime to 2, 3, 5 and 7, using a wheel with modulus 210.

Finally, it's time to compare some timings! As expected, this implementation completely smokes the naive sieve off the earth:

```\$ time ./sieve1 100000 | wc -l
9592

real    1m33.599s

\$ time ./sieve2 100000 | wc -l
9592

real    0m0.436s
```

What about parallel execution? If you take a moment and look at the diagram yet again, you will find that the synchrony of this system is quite breath-taking! This unfortunately also means that while being concurrent, it will not benefit from parallel execution (and neither will the naive sieve). Each goroutine will execute only a few instructions before it needs to synchronize with others by virtue of sending to or receiving from a channel. When they are run in parallel, it will often be the case that one thread needs to wait for the other. This is the result on my Linux box:

```\$ time ./sieve2 -n 1000000
15485863

real    0m12.852s
user    0m12.629s
sys     0m0.203s

\$ time ./sieve2 -ncpu 2 -n 1000000
15485863

real    0m16.763s
user    0m17.975s
sys     0m8.696s
```

Note the system time there. My guess is that the operating system now has to manage the threads' sleep and wakeup cycles, which involves more expensive operations than when goroutines are simply managed by the Go scheduler in a single thread. Clearly, for the sieve to be able to utilize multiple CPUs, a different approach must be used.

Writing concurrent programs isn't all about speed, however. Russ Cox, a Go author, says this while describing CSP:

Concurrent programming in this style is interesting for reasons not of efficiency but of clarity. That is, it is a widespread mistake to think only of concurrent programming as a means to increase performance, e.g., to overlap disk I/O requests, to reduce latency by prefetching results to expected queries, or to take advantage of multiple processors. Such advantages are important but not relevant to this discussion. After all, they can be realized in other styles, such as asynchronous event-driven programming. Instead, we are interested in concurrent programming because it provides a natural abstraction that can make some programs much simpler.

Does it make for a simple to understand implementation of the sieve of Eratosthenes? I'd let you be the judge of that.

A final, nonetheless illuminating, comparison is with the Haskell version from the mentioned paper as they both use the same algorithm and the heap data structure to build a list of primes online. Haskell's implicit lazy evaluation, while remarkably expressive, makes it harder to see what is actually going on, whereas in Go we write out the evaluation plan explicitly. Also, comparing the timings of the two programs will give us a vague idea about the overhead of scheduling goroutines and synchronizing channels versus straight-up but lazily-evaluated linked-list. This turns out to be about a factor of 2:

```\$ time ./sieve2 -n 10000000
179424673

real    2m58.758s

\$ time ./ONeillPrimesTest 10000000
(10000000,179424673)

real    1m33.626s
```

Obviously this last comparison should be taken with a grain of salt as the two programs are not the same. The Go authors have also said that the scheduler is still not quite optimized, and will change in the future.

The source code is available at github. ■