Faster Proof-of-Space (part 3)

This third part has fewer improvements and could have been called “Adventures with rust-gpu part 2” given how much time I spent wrestling with it.

Rmap optimizations
#

Since the last update I was mostly focusing on GPU implementation for matching logic. As it often happens, I’m re-reading and re-auditing related CPU code in the process, trying to figure out what would be an efficient way to implement it. And I discovered one more optimization that turned out to be applicable to both CPU and GPU.

I really want to support as many usable GPUs as possible, which in turn means thinking about their constraints. One of the constraints is the amount of shared memory used. Turned out that Rmap that was essentially [[u32; 2]; 15113] data structure was quite big, almost 121 kiB big! It doesn’t fit into the 32 kiB limit that I was targeting, it doesn’t even fit into many modern consumer GPUs with 48-64 kiB of shared memory! BTW, it turned out the baseline for Vulkan is actually 16 kiB and that is the amount Raspberry PI 5s iGPU has. Using global memory means a massive performance hit, and even if it could work, if I want to maximize GPU utilization by running more workgroups, I better use less shared memory than more.

The observation with Rmap data structure is that it is sparse. Not only it has most of the slots empty (remember, in the previous post I mentioned that the bucket size is now limited to just 272 elements), even those that are occupied usually have 1 element, not two. So a lot of space is wasted. What can be done about that is to create an indirection: Rmap table will store pointers into a different table, which only needs to be [[u32; 2] 272] (again, at least half of it will be empty, but dealing with that is not worth it). Since the second table is much smaller, pointers don’t need to occupy 8 bytes, in fact, 9 bits is sufficient, and that is what GPU implementation uses. For CPU additional arithmetic operations were not worth it, so it uses u16 for pointers instead, while still enjoying massive data structure size reduction.

As a result, we get the following change in CPU performance, which landed in PR 385:

Before:
chia/table/parallel/8x  time:   [529.15 ms 534.42 ms 540.15 ms]
                        thrpt:  [14.811  elem/s 14.969  elem/s 15.119  elem/s]
After:
chia/table/parallel/8x  time:   [518.48 ms 521.57 ms 525.13 ms]
                        thrpt:  [15.234  elem/s 15.338  elem/s 15.430  elem/s]

It is not particularly huge difference, but it is certainly consistently faster and has a much higher chance of remaining in L1 cache during processing.

On GPU using 9 bits for pointers plus an additional table with actual values occupies ~19 kiB of shared memory vs 121 kiB before. This still doesn’t fit into 16 kiB on Raspberry PI 5, so the Rmap table there will have to be moved to global memory and hoping that it is cached, but at least everything else fits nicely with some room to spare.

Other CPU improvements
#

I was also exploring some other things and noticed that manually unrolling in finding matches was actually worse in the current state of the library, so I removed that in PR 388, which further improved performance substantially:

Before:
chia/table/single/1x    time:   [747.82 ms 756.94 ms 768.09 ms]
                        thrpt:  [1.3019  elem/s 1.3211  elem/s 1.3372  elem/s]
chia/table/parallel/8x  time:   [518.48 ms 521.57 ms 525.13 ms]
                        thrpt:  [15.234  elem/s 15.338  elem/s 15.430  elem/s]
After:
chia/table/single/1x    time:   [697.77 ms 707.91 ms 723.34 ms]
                        thrpt:  [1.3825  elem/s 1.4126  elem/s 1.4331  elem/s]
chia/table/parallel/8x  time:   [500.34 ms 506.86 ms 513.82 ms]
                        thrpt:  [15.570  elem/s 15.783  elem/s 15.989  elem/s]

Finding matches on GPU
#

Finding matches on GPU was the first kernel that was not embarrassingly parallel and required some level of synchronization. The reason for it is that most of the candidate pairs do not have matches and those that do need to be compressed and be in the same deterministic order as on the CPU. Not only that, I later discovered that the whole code needs to progress in phases or else the results differ in unexplainable ways and depend on GPU vendor/implementation.

It does still waste a bit more time on divergent control flow than I’d like, but we’ll see how it performs once the whole workflow is complete.

Since Raspberry PI 5 and baseline Vulkan requirements more generally are lower than modern-ish consumer dGPUs, I decided to evolve existing compilation of two kernels with and without Int64 (64-bit integer) support. Since PR 389 the shader variants are “modern” and “fallback.” Modern kernel supports Int64 and 32 kiB+ of shared memory, while fallback will theoretically run on anything compliant with Vulkan 1.2+, which is most dGPUs in the last decade or so and even iGPUs. I checked one of the laptops I have with Intel HD Graphics 620 iGPU and even that one should work.

Making it compile with rust-gpu
#

While I was writing the code, I used cargo clippy and cargo build for verification, but since I did not have the shader definition initially, I didn’t yet know if it would compile for rust-gpu. And it did not compile in a big way with over 60 errors.

The root of all evil, or at least the most of it, was rust-gpu issue 241. The thing is that I used arrays for many data structures since I know the upper bound on everything and want to store things compactly. However, I most often have dynamic indices into those arrays. The way a Rust standard library implements both checked and unchecked indexing is using Deref of an array to a slice, but rust-gpu doesn’t allow casing *[u32; N] into *[u32], which effectively means I can’t use most of the methods, I can’t even use iterators.

So for iterators I had to do manual indexing. For unchecked access just use [] without the ability to explain to the compiler that it doesn’t need to do bounds checks. And for checked access I have to first check the index against length of the array explicitly and then use []. As you can imagine, this resulted in a lot of ugly boilerplate that is much harder to read. Not only that, I had to give up some of the new types because I couldn’t cast new types to/from their inner types (for those that were backed by reusable scratch space in the shared memory).

All in all, that wasn’t a great experience at all, and I hope it’ll improve soon.

Making it run properly
#

Once the code was compiled, it didn’t run properly (surprise!), and I had to figure out why. This turned out to be much more challenging than I expected. Turns out there is no direct way to know if shared executed successfully or not. If code hits a panic, it just exits execution early, leaving the memory with whatever garbage it may have had during allocation or with whatever incomplete results it managed to produce so far.

The only real way to know if it completed successfully is to write some value at the very end explicitly and check that manually. Not only that, once you know that the code has likely panicked, it is surprisingly challenging to figure out both where and why. No step by step debugger like on CPU for us, not even println!() is available easily unless you implement it yourself. And when you try to implement it yourself, it is most likely that you’ll hit the mentioned pointer casting issue again. Quite a frustrating experience overall, there must be a better way eventually if we want people to write shaders in Rust.

The complete version of find_matches_in_buckets shader with tests landed in PR 391.

It does run
#

In the end I managed to make it work, tested both on AMD GPU and with llvmpipe on CPU (which is the one used in CI). Both produce results that are identical to CPU.

Upcoming plans
#

With matching logic implemented, the only remaining difficult thing is bucketing. I think I’ll be changing the logic a bit again, and I think it’ll improve performance on CPU too. Once bucketing is implemented, the remaining thing will be to do proof searching (mostly embarrassingly parallel task) and applying them to record chunks in a sector. Neither of these two is remotely as complex as bucketing, though.

It’ll be a bonus to implement erasure coding on GPU, which will help with both plotting and archiving (which is already very fast, especially compared to the original KZG-based version in Subspace). But that is an optional thing for now. If you’re reading this and would like to give it a try, let me know!

I’ll write once I have more updates to share. In the meantime you can always find me on Zulip.

Rmap optimizations#

Other CPU improvements#

Finding matches on GPU#

Making it compile with rust-gpu#

Making it run properly#

It does run#

Upcoming plans#