Apr 16, 2026

17 min read

Generating Primes that Fit in 32 Bits

And as fast as possible!

After reading this post on HackerNews, I also decided to see how efficient of a code I could write that would compute some result of all primes that fit in 32 bits. I actually thought of doing it for 64 bits at first, since that would have been really interesting, but I so on found out that there might be a bit too many primes at that size. $\pi(2^{64}) \approx 4.2567\times 10^{17}$ But, $\pi(2^{32})=2.0328\times 10^8$ , so we can generate up to the unsigned integer limit. In addition, I’ve decided not to write to a file (unlike the post) or even write to standard out, instead I’ve decided that I would calculate the XORs of all primes in that range, which removes any overhead and allows us to actually get the time it takes to generate the primes.

With my experience with competitive programming, I knew that I should be writing a sieve to calculate the result. You can read more at Algorithms for Competitive Programming or at the blog post linked previously. The basic idea for sieve is that instead of checking if a number has divisors, we let prime numbers “announce” their multiples as composite.

A number is prime, if none of the smaller prime numbers divides it. Since we iterate over the prime numbers in order, we already marked all numbers, which are divisible by at least one of the prime numbers, as divisible. Hence if we reach a cell and it is not marked, then it isn’t divisible by any smaller prime number and therefore has to be prime.

We know that the primes have sparsity of $\frac{n}{\ln n}$ , or in other words $\pi(n)\approx\frac{n}{\ln n}$ . Using this fact we can prove that the running time complexity of the algorithm is $O(n\log \log n)$ .¹ I wrote a quick script in C that implements this algorithm.

1#include <stdio.h>
2#include <math.h>
3
4#define u32 unsigned int 
5#define u64 unsigned long long
6#define N (u64) 4294967296
7
8// different than usual sieve, where 0 and 1s are flipped
9u32 is_not_prime[N / 32]; // 32 = 8 * sizeof(u32)
10
11int main() {
12  u64 result = 0;
13  is_not_prime[0] = 3; // set "0" and "1" as not primes
14  for (u64 i = 0; i < N; i += 32) {
15    for (u64 bit = (i == 0 ? 2 : 0); bit < 32; bit++) {
16      if ((is_not_prime[i / 32] & ((u64) 1 << bit)) || (u64) i * i >= N) continue;
17      result ^= i + bit;
18      for (u64 j = (i + bit) * (i + bit); j < N; j += i + bit) {
19        is_not_prime[j / 32] |= ((u64) 1 << (j % 32));
20      }
21    }
22  }
23  printf("%llu", result);
24  return 0;
25}

Not the cleanest (yet), but it gives us an output in 23 seconds.² Here, instead of creating a byte for 4294967295 values, we’re using a bit array, which is why the code is a bit ugly here. In bit arrays, I have a single bit for each of the 4294967295 indices I have. If I were to use a bool is_not_prime[] array, I would need to store eight times more data as modern processors architectures work much more efficiently with bytes than with bits since they usually cannot access bits directly. So what we do is, we store the bits in a large continuous memory, access the memory in blocks of a few bytes, and extract/set the bits with bit operations like bit masking and bit shifting.

Another thing here is that I’m creating an array is_not_prime instead of the usual is_prime as I’m utilising the fact that the programme will start that global variable as zero-initialised, otherwise I would’ve needed to memset it, which would be a small overhead.

Continuing on, I’ve had a couple other implementations (just small improvements/optimisations), before the next one we will talk:

Instead of also checking for N (which we know isn’t prime), we can go up until i < N.
Since we know the value of N, we can write #define SQN (u32) 65535 and check i > SQN instead of i * i >= N.
Instead of doing divisions and modulars by powers of two, we can use shifts and bitwise ands. j & 31 for j % 32 and i >> 5 for i / 32. Lucky for us, the compiler already did those optimisations, removing the need for us to do them. You can verify this on Compiler Explorer for yourself too!
I’ve tried using u64 instead of u32 for is_not_prime, but that doesn’t get me much improvement either, since it’s still the same amount of memory that’s being used.
One point of contention I’ve found is this check: (is_not_prime[i / 32] & ((u64) 1 << bit)) || (u64) i * i >= N. We know that the compiler optimises these kinds of checks (i.e., if the first predicate is true, since it’s an OR, it just makes the if true, without checking the other predicate). So, we can optimise it further by seeing which one has higher amount of truthiness. And: usqn = 4294901760; up = 4091687075. Thus, by switching the two predicates we get down to 21 seconds.
Using -O3 in the compilation step gets us down to 20 seconds.
Since we have registers of size of 64 bits, we can just use u64s for pretty much everything.

Thus, we get the following piece of code.

1#include <stdio.h>
2
3#define u64 unsigned long long
4#define N (u64) 4294967295
5#define SQN (u64) 65535
6
7// different than usual sieve, where 0 and 1s are flipped
8u64 is_not_prime[N / 64 + 1]; // 64 = 8 * sizeof(u64)
9
10int main() {
11  u64 result = 0;
12  is_not_prime[0] = 3; // set "0" and "1" as not primes
13  for (u64 i = 0; i < N; i++) {
14    u64 bit = i % 64;
15    if (i > SQN || (is_not_prime[i / 64] & (1llu << bit))) continue;
16    result ^= i;
17    for (u64 j = i * i; j <= N; j += i) {
18      is_not_prime[j / 64] |= (1llu << (j % 64));
19    }
20  }
21  printf("%llu\n", result);
22  return 0;
23}

SQRT

This is where things got pretty interesting. Now, I was checking the assembly of the code in Compiler Explorer, when I realised that the for loop was looping until i <= 65535, but in the code it should’ve been up to i < N.

Here, the compiler is actually pretty smart in understanding the next if block we have checks if i > SQN, and continues otherwise; thus, the compiler can be sure that the for loop only runs until SQN. Which, in either case, would’ve been our next optimisation. Obviously, to find all the prime numbers until $n$ , it will be enough just to perform the sifting only by the prime numbers, which do not exceed the root of $n$ .

Such optimisation doesn’t affect the complexity (indeed, doing the same proof as for the other case, we’d get the evaluation $n \ln \ln \sqrt n + o(n)$ , which is asymptotically the same according to the properties of logarithms), though the number of operations will reduce noticeably.

However, the issue with my code was that I wasn’t actually computing the correct result of XORs; because, as said, the loop would go up to SQN, and not the full way, so the operation result ^= i would not be run for majority of primes. Adding the correct for loop to actually compute the result we get the actual value in 20-ish seconds.

1#include <stdio.h>
2
3#define u64 unsigned long long
4#define N (u64) 4294967295
5#define SQN (u64) 65535
6
7// different than usual sieve, where 0 and 1s are flipped
8u64 is_not_prime[N / 64 + 1]; // 64 = 8 * sizeof(u64)
9
10int main() {
11  u64 result = 0;
12  is_not_prime[0] = 3; // set "0" and "1" as not primes
13  for (u64 i = 2; i <= SQN; i++) {
14    if (is_not_prime[i / 64] & (1llu << (i % 64))) continue;
15    for (u64 j = i * i; j <= N; j += i) {
16      is_not_prime[j / 64] |= (1llu << (j % 64));
17    }
18  }
19  for (u64 i = 2; i <= N; i++) {
20    if (!(is_not_prime[i / 64] & (1llu << (i % 64)))) result ^= i;
21  }
22  printf("%llu\n", result);
23  return 0;
24}

1gcc main.c -o main -O3 && time ./main
263230258
3./main  20.37s user 0.11s system 99% cpu 20.493 total

Two is an interesting number…

If you are you, you would know that two is the only even prime. And any even number is not prime, by definition. Therefore, by using this fact, we could halve our memory and time usage entirely by literally stopping checking even numbers at all. After implementing this, with some more C-like syntax with ~~generics~~ macros, we get the following piece of code.

1#include <stdio.h>
2
3#define u64 unsigned long long
4#define N (u64) 4294967295
5#define SQN (u64) 65535
6
7// only odd numbers are stored
8// bit for n is at index n >> 1.
9#define WORD_INDEX(n) ((n) >> 7) // (n >> 1) / 64
10#define BIT_MASK(n) (1ULL << (((n) >> 1) & 63))
11
12u64 is_not_prime[((N >> 1) / 64) + 1];
13
14int main() {
15  u64 result = 2; // handle the only even prime separately
16  is_not_prime[0] = 1;
17
18  for (u64 i = 3; i <= SQN; i += 2) {
19    if (is_not_prime[WORD_INDEX(i)] & BIT_MASK(i)) continue;
20    // mark only odd multiples: i*i, i*i + 2*i, i*i + 4*i, ...
21    for (u64 j = i * i; j <= N; j += (i << 1)) {
22      is_not_prime[WORD_INDEX(j)] |= BIT_MASK(j);
23    }
24  }
25
26  for (u64 i = 3; i <= N; i += 2) {
27    if (!(is_not_prime[WORD_INDEX(i)] & BIT_MASK(i))) {
28      result ^= i;
29    }
30  }
31
32  printf("%llu\n", result);
33  return 0;
34}

Which gives us an amazing result of 8.4 seconds:

1gcc main.c -o mainc -O3 && time ./mainc
263230258
3./mainc  8.31s user 0.07s system 99% cpu 8.400 total

This, I find, quite interesting actually. We have lowered our total time that it takes by more than half. We are nearly halving the number of operations that we are doing, which would’ve gotten us down to 10 seconds but with the less memory we are using, we even get around a 1.5 second boost too! Which gets us to our next point:

Let’s optimise the cache

It follows from the optimisation “sieving till root” that there is no need to keep the whole array is_prime[1, ..., n] at all times. For sieving, it is enough to just keep the prime numbers until the root of $n$ , i.e., prime[1, ..., sqrt(n)], split the complete range into blocks, and sieve each block separately.

Let $s$ be a constant which determines the size of the block, then we have $\lceil {\frac n s} \rceil$ blocks altogether, and the block $k = 0, \dots, \lfloor {\frac n s} \rfloor$ contains the numbers in a segment $[ks; ks + s - 1]$ . We can work on blocks by turns; i.e., for every block $k$ , we will go through all of the prime numbers (from $1$ to $\sqrt n$ ) and perform sieving using them. It is worth noting that we have to modify the strategy a little bit when handling the first numbers:

All of the prime numbers from $[1; \sqrt n]$ shouldn’t remove themselves.
The numbers $0$ and $1$ should be marked as non-prime numbers.
While working on the last block it should not be forgotten that the last needed number $n$ is not necessarily located at the end of the block.

As discussed previously, the typical implementation of the Sieve of Eratosthenes is limited by the speed how fast you can load data into the CPU caches. By splitting the range of potential prime numbers $[1; n]$ into smaller blocks, we never have to keep multiple blocks in memory at the same time, and all operations are much more cache-friendlier.

To implement this strategy, I’m going back to my roots and I will use C++, since it has std::vector and that also std::vector<bool> has a type override as it is infact a bit array (just like a std::bitset).³

1#include <vector>
2#include <iostream>
3#include <algorithm>
4#include <cmath>
5
6using u64 = unsigned long long;
7constexpr u64 N = 4294967295;
8constexpr u64 S = 128 * 1024 * 8;
9constexpr u64 SQN = 65535;
10
11int main() {
12  u64 result = 0;
13  std::vector<u64> primes;
14  std::vector<bool> is_prime(N + 1, true);
15
16  for (u64 i = 2; i <= SQN; i++) {
17    if (!is_prime[i]) continue;
18    primes.emplace_back(i);
19    for (u64 j = i * i; j <= SQN; j += i) {
20      is_prime[j] = false;
21    }
22  }
23
24  std::vector<bool> block(S, true);
25
26  for (int k = 0; k * S <= N; k++) {
27    std::fill(block.begin(), block.end(), true);
28    u64 start = k * S;
29    for (u64 p : primes) {
30      u64 start_index = (start + p - 1) / p;
31      u64 j = std::max(start_index, p) * p - start;
32      for (; j < S; j += p) block[j] = false;
33    }
34    if (k == 0) block[0] = block[1] = false;
35    for (u64 i = 0; i < S && start + i <= N; i++) 
36      if (block[i]) result ^= k * S + i;
37  }
38
39  std::cout << result << '\n';
40  return 0;
41}

With that code, we get the following result.

1g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main
263230258
3./main  12.46s user 0.06s system 99% cpu 12.553 total

Which, worse than what we had odd-optimised sieve, so once we apply the odd numbered optimisation:

1#include <algorithm>
2#include <cmath>
3#include <iostream>
4#include <vector>
5
6using u64 = unsigned long long;
7constexpr u64 N = 4294967295;
8constexpr u64 S = 128 * 1024 * 8 * 2;
9constexpr u64 SQN = 65535;
10
11int main() {
12  u64 result = 2; // we know two is prime
13  std::vector<u64> primes;
14
15  // odd n maps to index n / 2
16  std::vector<bool> is_prime(SQN / 2 + 1, true);
17  is_prime[0] = false; // 1 is not prime
18
19  for (u64 i = 3; i <= SQN; i += 2) {
20    if (!is_prime[i / 2]) continue;
21    primes.emplace_back(i);
22    for (u64 j = i * i; j <= SQN; j += 2 * i) {
23      is_prime[j / 2] = false;
24    }
25  }
26
27  // block[i] represents start + 2 * i + 1
28  std::vector<bool> block(S / 2, true);
29
30  for (u64 k = 0; k * S <= N; k++) {
31    std::fill(block.begin(), block.end(), true);
32    u64 start = k * S;
33    for (u64 p : primes) {
34      u64 start_index = (start + p - 1) / p;
35      u64 j = std::max(start_index, p) * p;
36
37      // skip even multiples; we only store odd numbers.
38      if ((j & 1) == 0) j += p;
39
40      j -= start;
41      for (; j < S; j += 2 * p) block[j / 2] = false;
42    }
43    
44    if (k == 0) block[0] = false;
45    for (u64 i = 0; i < S / 2 && start + 2 * i + 1 <= N; i++) {
46      if (block[i]) result ^= start + 2 * i + 1;
47    }
48  }
49
50  std::cout << result << '\n';
51  return 0;
52}

We get the following result:

1g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main
263230258
3./main  4.86s user 0.03s system 99% cpu 4.889 total

Which is really good!

One thing that careful readers might notice is that I chose those interesting (and seemingly random) numbers for the block size $s$ . Why?

The reason is that, I am using a chip that has an L1 (data) cache size of 128KB = 128 * 1024 bytes, and since we’re using a bit array, we are able to store 128 * 1024 * 8 indices, which allows us to store the entirety of a block in the first level cache. Even more careful readers might notice that the odd-optimised implementation has a * 2 attached to the end. The reason is that the blocks have size $\frac{s}{2}$ , and so I can afford to increase the size of the blocks even more!

As we have seen, using a bit array allows us to have less memory overhead (of eight times!). But, that comes with quite a considerable overhead in terms of the computation that we need to do (that the compiler abstracts it away from us, but we saw what needs to be done in the C versions of our code). However, for how we did it with our first iterations of the bit arrays, we were actually limited by how fast we were loading data to the cache, since the data that we accessed was sparse (not segmentd/blocked, and that jumped all around the is_prime array), and therefore, using less memory gave us a big advantage.

As we are now no longer limited by the cache speeds, we can replace the std::vector<bool> with a std::vector<char>, and gain some additional performance as the processors can handle read and writes with bytes directly and don’t need to rely on bit operations for extracting individual bits.

We also shouldn’t forget to update our value for S; since we are taking a byte of memory for each index, S should only be 128 * 1024 * 2.⁴

1#include <algorithm>
2#include <cmath>
3#include <iostream>
4#include <vector>
5
6using u64 = unsigned long long;
7constexpr u64 N = 4294967295;
8constexpr u64 S = 128 * 1024 * 2;
9constexpr u64 SQN = 65535;
10
11int main() {
12  u64 result = 2; // we know two is prime
13  std::vector<u64> primes;
14
15  // odd n maps to index n / 2
16  std::vector<char> is_prime(SQN / 2 + 1, true);
17  is_prime[0] = false; // 1 is not prime
18
19  for (u64 i = 3; i <= SQN; i += 2) {
20    if (!is_prime[i / 2]) continue;
21    primes.emplace_back(i);
22    for (u64 j = i * i; j <= SQN; j += 2 * i) {
23      is_prime[j / 2] = false;
24    }
25  }
26
27  // block[i] represents start + 2 * i + 1
28  std::vector<char> block(S / 2, true);
29
30  for (u64 k = 0; k * S <= N; k++) {
31    std::fill(block.begin(), block.end(), true);
32    u64 start = k * S;
33    for (u64 p : primes) {
34      u64 start_index = (start + p - 1) / p;
35      u64 j = std::max(start_index, p) * p;
36
37      // skip even multiples; we only store odd numbers.
38      if ((j & 1) == 0) j += p;
39
40      j -= start;
41      for (; j < S; j += 2 * p) block[j / 2] = false;
42    }
43    
44    if (k == 0) block[0] = false;
45    for (u64 i = 0; i < S / 2 && start + 2 * i + 1 <= N; i++) {
46      if (block[i]) result ^= start + 2 * i + 1;
47    }
48  }
49
50  std::cout << result << '\n';
51  return 0;
52}

1❯ g++ main.cpp -std=c++23 -o main -O3 -Wall -Wextra && time ./main
263230258
3./main  1.78s user 0.00s system 99% cpu 1.784 total

Can we go even further?

There is another algorithm called linear sieve, which has linear time complexity, but it performs worse than what we have above, since also that log log n term isn’t too big when n is only a 32 bit unsigned integer. Even though it’s slow, the algorithm allows us to find integer factorisations much quickly.

In this experiment, we tried generating primes, but how fast could have we gone if we were only counting primes. It’s a very fine line of difference but that difference, that leeway, can allow us to get down to 0.046987 seconds to count all primes that fit in 32 bits. So, I might do another post on only counting primes. This algorithm is called the Meissel-Lehmer algorithm, and you can go here to read more!

See here for a quick analysis of the asymptotic running time of the algorithm. ↩
I have an M4 Pro chip that I’m running these computations on. ↩
One thing, that kinda was interesting to see was that I wanted to have SQN be a constexpr that was calculated from N, but I’ve realised that std::sqrt is not constexpr, yet. Apparently, it will be so in C++ 26! ↩
I actually made the mistake of not updating the value of S on the first public edition of this post, where the incorrect version had a 2.391 second result! ↩

Subscribe!

Contents

SQRT
Two is an interesting number...
Let's optimise the cache
Can we go even further?

Generating Primes that Fit in 32 Bits

SQRT

Two is an interesting number…

Let’s optimise the cache

Can we go even further?

Footnotes