The
Go language tutorial
describes a
concurrent prime number sieve, but also noted that it is not the most
computationally efficient method. In fact, the algorithm is so bad that
it is even slower than testing every natural numbers using trial
division, which tries to divides each
n by all primes
p ≤
sqrt(n). In contrast, the described sieve trial-divides each
n
by all primes
p ≤ n.
Understandably, the point of the language tutorial is to introduce Go's concurrency constructs and not the algorithmic efficiency of a particular example. But just right after seeing that example, I thought it would be interesting to write an efficient prime sieve using Go's channels.
The simplest, yet effective, prime number sieve was discovered 2000 years
ago by Eratosthenes. The algorithm starts off by eliminating all
multiples of some known primes (usually
2), then register the next
number that has not been eliminated as a prime. Then we eliminate all
multiples of this newly-found prime in order to find the next one, and so
forth.
It is efficient since we do not need to do any division, and that there is little duplicate work: the number of times that we eliminate an already-eliminated candidate is the number of its distinct prime factors, which is very small for most composities and exactly equals zero for primes. In a sense, we get prime numbers for free.
The sieve of Eratosthenes is traditionally implemented using bitvector
with a fixed upper bound. Using Go, it is possible to incrementally
generate a channel containing an infinite stream of primes in the style of
the original example. We will use three channels
candidates,
primes,
composites and two main concurrent goroutines: one
receives a value from the
primes channel, generates all its
multiples then merges them into the
composites channel; the other
goroutine eliminates those
composites values from a stream of
candidates and sends results to
primes.
Here is the flow chart of our system:
Each channel of multiples is generated by a goroutine, and so is the channel of candidates.
The main problem with this design is: how do we merge an infinite number of channels, each containing an infinite stream of prime multiples? This is impossible in general, but not in our specific case since the heads of the channels of multiples form an increasing sequence and multiples of each prime are generated in increasing order as well.
The solution, as described by this paper, is to use a heap to do the merging by need. Since we are using Go's channels instead of cons'ed lists, we have to take an extra step to peek into its head value. But this is easy enough:
type PeekCh struct {
head int
ch chan int
}
We maintain a heap of
PeekCh sorting by head values, and a
channel (denoted
m) which contains multiples of the next-in-line
sieving prime. There are three possible cases:
-
The heap minimum is
smaller than the head of
m: we pop the heap, send outPeekCh.head, updatePeekCh.head = ←PeekCh.ch, then push back the modifiedPeekChstructure. -
The heap minimum is
equal to the head of
m: we do almost the same steps as above except for sending out the head value (so that duplicates are ignored). -
The heap minimum is
greater than the head of
m: we send out this head value ofm, push&PeehCh{←m, m}into the heap, then updatemto point to a channel of multiples of the next known prime.
The following code segment is the inner loop of the merging goroutine that
sends values to the
composites channel:
for {
m := multiples(←primes)
head := ←m
for min < head {
composites ← min
minchan := heap.Pop(h).(*PeekCh)
min = minchan.head
minchan.head = ←minchan.ch
heap.Push(h, minchan)
}
for min == head {
minchan := heap.Pop(h).(*PeekCh)
min = minchan.head
minchan.head = ←minchan.ch
heap.Push(h, minchan)
}
composites ← head
heap.Push(h, &PeekCh{←m, m})
}
Having that, the writing the sieving goroutine is quite easy:
primes ← 2
p := ←candidates
for {
c := ←composites
for p < c {
primes ← p
out ← p
p = ←candidates
}
if p == c {
p = ←candidates
}
}
Before skipping on, I suggest you take a look at the diagram again and make sure that you understand what is going on.
Note that the everytime we find a prime
p, we send it to two
channels:
primes and
out. The former is a feedback loop
inside the system, and the latter is for an external goroutine to observe
the produced stream of primes. The feedback loop can actually cause a
deadlock if the buffer size for the
primes channel is too small.
If the sieving goroutine blocks sending to it, the merging goroutine will
also block eventually since no one is receiving from the channel it sends
to.
Now, the Go language requires that a channel must have a definite buffer
size, declared at allocation time, which means we can not really generate
an infinite stream of primes using this implementation, but only upto a
limit
n. This is a bit disappointing, as the algorithm does
not impose such upper limit to be specified upfront. The problem with
trying to write a program that generate an infinite stream is that sooner
or later one must face the cruel finitude of computer memory :(
Anyway, we must solve the deadlock problem, in one way or another: either
by writing our sieve such that it can produce just a few more than
n primes without deadlock, or by allocating the channel with a
humongous buffer -- big enough to hold all primes that fit into an
int32.
In both cases, we need to estimate the minimum buffer size for the
primes channel such that the system can produce
n prime
numbers without deadlock. It is not difficult. The first
n
primes is sieved from multiples of
primes ≤ sqrt(nth prime). This
means the merging goroutine will only need to pull out
sqrt(nth
prime) values from the feedback loop. The sieving goroutine will
obviously send
n values to it. The minimum buffer size that will
suffice is:
n - (number of primes ≤ sqrt(nth prime))
However, it turns out that the second term (let's call it
p(n))
is always tiny compared to the first (for instance,
p(1000000) =
546). Without having to estimate the second term, the first solution is
preferable: setting the buffer to
n will guarantee that we will be
able to generate
n primes without deadlock. Interestingly,
p(n) + 2 is the number of goroutines that need to be spawned to
generate
n prime numbers.
We can also utilize a well-known optimization called
wheel
factorization. The name is a bit misleading, but the idea is that you
simply pre-screen all multiples of certain small primes. For example, you
could easily screen out all even numbers. Pre-screening multiples of 2
and 3 means that we only consider numbers that are
1 (mod 6) or
5 (mod 6). The
actual
program only considers numbers coprime to 2, 3, 5 and 7, using a wheel
with modulus 210.
Finally, it's time to compare some timings! As expected, this implementation completely smokes the naive sieve off the earth:
$ time ./sieve1 100000 | wc -l 9592 real 1m33.599s $ time ./sieve2 100000 | wc -l 9592 real 0m0.436s
What about parallel execution? If you take a moment and look at the diagram yet again, you will find that the synchrony of this system is quite breath-taking! This unfortunately also means that while being concurrent, it will not benefit from parallel execution (and neither will the naive sieve). Each goroutine will execute only a few instructions before it needs to synchronize with others by virtue of sending to or receiving from a channel. When they are run in parallel, it will often be the case that one thread needs to wait for the other. This is the result on my Linux box:
$ time ./sieve2 -n 1000000 15485863 real 0m12.852s user 0m12.629s sys 0m0.203s $ time ./sieve2 -ncpu 2 -n 1000000 15485863 real 0m16.763s user 0m17.975s sys 0m8.696s
Note the system time there. My guess is that the operating system now has to manage the threads' sleep and wakeup cycles, which involves more expensive operations than when goroutines are simply managed by the Go scheduler in a single thread. Clearly, for the sieve to be able to utilize multiple CPUs, a different approach must be used.
Writing concurrent programs isn't all about speed, however. Russ Cox, a Go author, says this while describing CSP:
Concurrent programming in this style is interesting for reasons not of efficiency but of clarity. That is, it is a widespread mistake to think only of concurrent programming as a means to increase performance, e.g., to overlap disk I/O requests, to reduce latency by prefetching results to expected queries, or to take advantage of multiple processors. Such advantages are important but not relevant to this discussion. After all, they can be realized in other styles, such as asynchronous event-driven programming. Instead, we are interested in concurrent programming because it provides a natural abstraction that can make some programs much simpler.
Does it make for a simple to understand implementation of the sieve of Eratosthenes? I'd let you be the judge of that.
A final, nonetheless illuminating, comparison is with the Haskell version from the mentioned paper as they both use the same algorithm and the heap data structure to build a list of primes online. Haskell's implicit lazy evaluation, while remarkably expressive, makes it harder to see what is actually going on, whereas in Go we write out the evaluation plan explicitly. Also, comparing the timings of the two programs will give us a vague idea about the overhead of scheduling goroutines and synchronizing channels versus straight-up but lazily-evaluated linked-list. This turns out to be about a factor of 2:
$ time ./sieve2 -n 10000000 179424673 real 2m58.758s $ time ./ONeillPrimesTest 10000000 (10000000,179424673) real 1m33.626s
Obviously this last comparison should be taken with a grain of salt as the two programs are not the same. The Go authors have also said that the scheduler is still not quite optimized, and will change in the future.
The source code is available at github. ■
Previous Post Next Post