
After reading this post on HackerNews, I also decided to see how efficient of a code I could write that would compute some result of all primes that fit in 32 bits. I actually thought of doing it for 64 bits at first, since that would have been really interesting, but I soˆon found out that there might be a bit too many primes at that size. But, , so we can generate up to the unsigned integer limit. In addition, I’ve decided not to write to a file (unlike the post) or even write to standard out, instead I’ve decided that I would calculate the XORs of all primes in that range, which removes any overhead and allows us to actually get the time it takes to generate the primes.
With my experience with competitive programming, I knew that I should be writing a sieve to calculate the result. You can read more at Algorithms for Competitive Programming or at the blog post linked previously. The basic idea for sieve is that instead of checking if a number has divisors, we let prime numbers “announce” their multiples as composite.
A number is prime, if none of the smaller prime numbers divides it. Since we iterate over the prime numbers in order, we already marked all numbers, which are divisible by at least one of the prime numbers, as divisible. Hence if we reach a cell and it is not marked, then it isn’t divisible by any smaller prime number and therefore has to be prime.
We know that the primes have sparsity of , or in other words . Using this fact we can prove that the running time complexity of the algorithm is .1 I wrote a quick script in C that implements this algorithm.
1#include <stdio.h>2#include <math.h>34#define u32 unsigned int5#define u64 unsigned long long6#define N (u64) 429496729678// different than usual sieve, where 0 and 1s are flipped9u32 is_not_prime[N / 32]; // 32 = 8 * sizeof(u32)1011int main() {12u64 result = 0;13is_not_prime[0] = 3; // set "0" and "1" as not primes14for (u64 i = 0; i < N; i += 32) {15for (u64 bit = (i == 0 ? 2 : 0); bit < 32; bit++) {16if ((is_not_prime[i / 32] & ((u64) 1 << bit)) || (u64) i * i >= N) continue;17result ^= i + bit;18for (u64 j = (i + bit) * (i + bit); j < N; j += i + bit) {19is_not_prime[j / 32] |= ((u64) 1 << (j % 32));20}21}22}23printf("%llu", result);24return 0;25}
Not the cleanest (yet), but it gives us an output in 23 seconds.2 Here, instead of creating a byte for 4294967295 values, we’re using a bit array, which is why the code is a bit ugly here. In bit arrays, I have a single bit for each of the 4294967295 indices I have. If I were to use a bool is_not_prime[] array, I would need to store eight times more data as modern processors architectures work much more efficiently with bytes than with bits since they usually cannot access bits directly. So what we do is, we store the bits in a large continuous memory, access the memory in blocks of a few bytes, and extract/set the bits with bit operations like bit masking and bit shifting.
Another thing here is that I’m creating an array is_not_prime instead of the usual is_prime as I’m utilising the fact that the programme will start that global variable as zero-initialised, otherwise I would’ve needed to memset it, which would be a small overhead.
Continuing on, I’ve had a couple other implementations (just small improvements/optimisations), before the next one we will talk:
- Instead of also checking for
N(which we know isn’t prime), we can go up untili < N. - Since we know the value of
N, we can write#define SQN (u32) 65535and checki > SQNinstead ofi * i >= N. - Instead of doing divisions and modulars by powers of two, we can use shifts and bitwise ands.
j & 31forj % 32andi >> 5fori / 32. Lucky for us, the compiler already did those optimisations, removing the need for us to do them. You can verify this on Compiler Explorer for yourself too! - I’ve tried using
u64instead ofu32foris_not_prime, but that doesn’t get me much improvement either, since it’s still the same amount of memory that’s being used. - One point of contention I’ve found is this check:
(is_not_prime[i / 32] & ((u64) 1 << bit)) || (u64) i * i >= N. We know that the compiler optimises these kinds of checks (i.e., if the first predicate is true, since it’s an OR, it just makes the if true, without checking the other predicate). So, we can optimise it further by seeing which one has higher amount of truthiness. And:usqn = 4294901760; up = 4091687075. Thus, by switching the two predicates we get down to 21 seconds. - Using
-O3in the compilation step gets us down to 20 seconds. - Since we have registers of size of 64 bits, we can just use
u64s for pretty much everything.
Thus, we get the following piece of code.
1#include <stdio.h>23#define u64 unsigned long long4#define N (u64) 42949672955#define SQN (u64) 6553567// different than usual sieve, where 0 and 1s are flipped8u64 is_not_prime[N / 64 + 1]; // 64 = 8 * sizeof(u64)910int main() {11u64 result = 0;12is_not_prime[0] = 3; // set "0" and "1" as not primes13for (u64 i = 0; i < N; i++) {14u64 bit = i % 64;15if (i > SQN || (is_not_prime[i / 64] & (1llu << bit))) continue;16result ^= i;17for (u64 j = i * i; j <= N; j += i) {18is_not_prime[j / 64] |= (1llu << (j % 64));19}20}21printf("%llu\n", result);22return 0;23}
SQRT
This is where things got pretty interesting. Now, I was checking the assembly of the code in Compiler Explorer, when I realised that the for loop was looping until i <= 65535, but in the code it should’ve been up to i < N.
Here, the compiler is actually pretty smart in understanding the next if block we have checks if i > SQN, and continues otherwise; thus, the compiler can be sure that the for loop only runs until SQN. Which, in either case, would’ve been our next optimisation. Obviously, to find all the prime numbers until , it will be enough just to perform the sifting only by the prime numbers, which do not exceed the root of .
Such optimisation doesn’t affect the complexity (indeed, by repeating the proof presented above we’ll get the evaluation , which is asymptotically the same according to the properties of logarithms), though the number of operations will reduce noticeably.
However, the issue with my code was that I wasn’t actually computing the correct result of XORs; because, as said, the loop would go up to SQN, and not the full way, so the operation result ^= i would not be run for majority of primes. Adding the correct for loop to actually compute the result we get the actual value in 20-ish seconds.
1#include <stdio.h>23#define u64 unsigned long long4#define N (u64) 42949672955#define SQN (u64) 6553567// different than usual sieve, where 0 and 1s are flipped8u64 is_not_prime[N / 64 + 1]; // 64 = 8 * sizeof(u64)910int main() {11u64 result = 0;12is_not_prime[0] = 3; // set "0" and "1" as not primes13for (u64 i = 2; i <= SQN; i++) {14if (is_not_prime[i / 64] & (1llu << (i % 64))) continue;15for (u64 j = i * i; j <= N; j += i) {16is_not_prime[j / 64] |= (1llu << (j % 64));17}18}19for (u64 i = 2; i <= N; i++) {20if (!(is_not_prime[i / 64] & (1llu << (i % 64)))) result ^= i;21}22printf("%llu\n", result);23return 0;24}
1gcc main.c -o main -O3 && time ./main2632302583./main 20.37s user 0.11s system 99% cpu 20.493 total
Two is an interesting number…
If you are you, you would know that two is the only even prime. And any even number is not prime, by definition. Therefore, by using this fact, we could halve our memory and time usage entirely by literally stopping checking even numbers at all. After implementing this, with some more C-like syntax with generics macros, we get the following piece of code.
1#include <stdio.h>23#define u64 unsigned long long4#define N (u64) 42949672955#define SQN (u64) 6553567// only odd numbers are stored8// bit for n is at index n >> 1.9#define WORD_INDEX(n) ((n) >> 7) // (n >> 1) / 6410#define BIT_MASK(n) (1ULL << (((n) >> 1) & 63))1112u64 is_not_prime[((N >> 1) / 64) + 1];1314int main() {15u64 result = 2; // handle the only even prime separately16is_not_prime[0] = 1;1718for (u64 i = 3; i <= SQN; i += 2) {19if (is_not_prime[WORD_INDEX(i)] & BIT_MASK(i)) continue;20// mark only odd multiples: i*i, i*i + 2*i, i*i + 4*i, ...21for (u64 j = i * i; j <= N; j += (i << 1)) {22is_not_prime[WORD_INDEX(j)] |= BIT_MASK(j);23}24}2526for (u64 i = 3; i <= N; i += 2) {27if (!(is_not_prime[WORD_INDEX(i)] & BIT_MASK(i))) {28result ^= i;29}30}3132printf("%llu\n", result);33return 0;34}
Which gives us an amazing result of 8.4 seconds:
1gcc main.c -o mainc -O3 && time ./mainc2632302583./mainc 8.31s user 0.07s system 99% cpu 8.400 total
This, I find, quite interesting actually. We have lowered our total time that it takes by more than half. We are nearly halving the number of operations that we are doing, which would’ve gotten us down to 10 seconds but with the less memory we are using, we even get around a 1.5 second boost too! Which gets us to our next point:
Let’s optimize the cache
It follows from the optimisation “sieving till root” that there is no need to keep the whole array is_prime[1, ..., n] at all times. For sieving, it is enough to just keep the prime numbers until the root of , i.e., prime[1, ..., sqrt(n)], split the complete range into blocks, and sieve each block separately.
Let be a constant which determines the size of the block, then we have blocks altogether, and the block contains the numbers in a segment . We can work on blocks by turns; i.e., for every block , we will go through all of the prime numbers (from to ) and perform sieving using them. It is worth noting that we have to modify the strategy a little bit when handling the first numbers:
- All of the prime numbers from shouldn’t remove themselves.
- The numbers and should be marked as non-prime numbers.
- While working on the last block it should not be forgotten that the last needed number is not necessarily located at the end of the block.
As discussed previously, the typical implementation of the Sieve of Eratosthenes is limited by the speed how fast you can load data into the CPU caches. By splitting the range of potential prime numbers into smaller blocks, we never have to keep multiple blocks in memory at the same time, and all operations are much more cache-friendlier.
To implement this strategy, I’m going back to my roots and I will use C++, since it has std::vector and that also std::vector<bool> has a type override as it is infact a bit array (just like a std::bitset).3
1#include <vector>2#include <iostream>3#include <algorithm>4#include <cmath>56using u64 = unsigned long long;7constexpr u64 N = 4294967295;8constexpr u64 S = 128 * 1024 * 8;9constexpr u64 SQN = 65535;1011int main() {12u64 result = 0;13std::vector<u64> primes;14std::vector<bool> is_prime(N + 1, true);1516for (u64 i = 2; i <= SQN; i++) {17if (!is_prime[i]) continue;18primes.emplace_back(i);19for (u64 j = i * i; j <= SQN; j += i) {20is_prime[j] = false;21}22}2324std::vector<bool> block(S, true);2526for (int k = 0; k * S <= N; k++) {27std::fill(block.begin(), block.end(), true);28u64 start = k * S;29for (u64 p : primes) {30u64 start_index = (start + p - 1) / p;31u64 j = std::max(start_index, p) * p - start;32for (; j < S; j += p) block[j] = false;33}34if (k == 0) block[0] = block[1] = false;35for (u64 i = 0; i < S && start + i <= N; i++)36if (block[i]) result ^= k * S + i;37}3839std::cout << result << '\n';40return 0;41}
With that code, we get the following result.
1g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main2632302583./main 12.46s user 0.06s system 99% cpu 12.553 total
Which, worse than what we had odd-optimised sieve, so once we apply the odd numbered optimisation:
1#include <algorithm>2#include <cmath>3#include <iostream>4#include <vector>56using u64 = unsigned long long;7constexpr u64 N = 4294967295;8constexpr u64 S = 128 * 1024 * 8 * 2;9constexpr u64 SQN = 65535;1011int main() {12u64 result = 2; // we know two is prime13std::vector<u64> primes;1415// odd n maps to index n / 216std::vector<bool> is_prime(SQN / 2 + 1, true);17is_prime[0] = false; // 1 is not prime1819for (u64 i = 3; i <= SQN; i += 2) {20if (!is_prime[i / 2]) continue;21primes.emplace_back(i);22for (u64 j = i * i; j <= SQN; j += 2 * i) {23is_prime[j / 2] = false;24}25}2627// block[i] represents start + 2 * i + 128std::vector<bool> block(S / 2, true);2930for (u64 k = 0; k * S <= N; k++) {31std::fill(block.begin(), block.end(), true);32u64 start = k * S;33for (u64 p : primes) {34u64 start_index = (start + p - 1) / p;35u64 j = std::max(start_index, p) * p;3637// skip even multiples; we only store odd numbers.38if ((j & 1) == 0) j += p;3940j -= start;41for (; j < S; j += 2 * p) block[j / 2] = false;42}4344if (k == 0) block[0] = false;45for (u64 i = 0; i < S / 2 && start + 2 * i + 1 <= N; i++) {46if (block[i]) result ^= start + 2 * i + 1;47}48}4950std::cout << result << '\n';51return 0;52}
We get the following result:
1g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main2632302583./main 4.86s user 0.03s system 99% cpu 4.889 total
Which is really good!
One thing that careful readers might notice is that I chose those interesting (and seemingly random) numbers for the block size . Why?
The reason is that, I am using a chip that has an L1 (data) cache size of 128KB = 128 * 1024 bytes, and since we’re using a bit array, we are able to store 128 * 1024 * 8 indices, which allows us to store the entirety of a block in the first level cache. Even more careful readers might notice that the odd-optimised implementation has a * 2 attached to the end. The reason is that the blocks have size , and so I can afford to increase the size of the blocks even more!
As we have seen, using a bit array allows us to have less memory overhead (of eight times!). But, that comes with quite a considerable overhead in terms of the computation that we need to do (that the compiler abstracts it away from us, but we saw what needs to be done in the C versions of our code). However, for how we did it with our first iterations of the bit arrays, we were actually limited by how fast we were loading data to the cache, since the data that we accessed was sparse (not segmentd/blocked, and that jumped all around the is_prime array), and therefore, using less memory gave us a big advantage.
As we are now no longer limited by the cache speeds, we can replace the std::vector<bool> with a std::vector<char>, and gain some additional performance as the processors can handle read and writes with bytes directly and don’t need to rely on bit operations for extracting individual bits.
1#include <algorithm>2#include <cmath>3#include <iostream>4#include <vector>56using u64 = unsigned long long;7constexpr u64 N = 4294967295;8constexpr u64 S = 128 * 1024 * 8 * 2;9constexpr u64 SQN = 65535;1011int main() {12u64 result = 2; // we know two is prime13std::vector<u64> primes;1415// odd n maps to index n / 216std::vector<char> is_prime(SQN / 2 + 1, true);17is_prime[0] = false; // 1 is not prime1819for (u64 i = 3; i <= SQN; i += 2) {20if (!is_prime[i / 2]) continue;21primes.emplace_back(i);22for (u64 j = i * i; j <= SQN; j += 2 * i) {23is_prime[j / 2] = false;24}25}2627// block[i] represents start + 2 * i + 128std::vector<char> block(S / 2, true);2930for (u64 k = 0; k * S <= N; k++) {31std::fill(block.begin(), block.end(), true);32u64 start = k * S;33for (u64 p : primes) {34u64 start_index = (start + p - 1) / p;35u64 j = std::max(start_index, p) * p;3637// skip even multiples; we only store odd numbers.38if ((j & 1) == 0) j += p;3940j -= start;41for (; j < S; j += 2 * p) block[j / 2] = false;42}4344if (k == 0) block[0] = false;45for (u64 i = 0; i < S / 2 && start + 2 * i + 1 <= N; i++) {46if (block[i]) result ^= start + 2 * i + 1;47}48}4950std::cout << result << '\n';51return 0;52}
1g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main2632302583./main 2.38s user 0.01s system 99% cpu 2.391 total
§
Can we go even further?
There is another algorithm called linear sieve, which has linear time complexity, but it performs worse than what we have above, since also that log log n term isn’t too big when n is only a 32 bit unsigned integer. Even though it’s slow, the algorithm allows us to find integer factorisations much quickly.
In this experiment, we tried generating primes, but how fast could have we gone if we were only counting primes. It’s a very fine line of difference but that difference, that leeway, can allow us to get down to 0.046987 seconds to count all primes that fit in 32 bits. So, I might do another post on only counting primes. This algorithm is called the Meissel-Lehmer algorithm, and you can go here to read more!
Footnotes
-
See here for a quick analysis of the asymptotic running time of the algorithm. ↩
-
I have an M4 Pro chip that I’m running these computations on. ↩
-
One thing, that kinda was interesting to see was that I wanted to have SQN be a
constexprthat was calculated fromN, but I’ve realised thatstd::sqrtis notconstexpr, yet. Apparently, it will be so in C++ 26! ↩
§