It has been a couple of weeks since the last status update about performance improvements in Proof-of-Space and I am finally at a decent stopping point where all architectural changes are done and I can share them with you.
Matching performance optimizations#
In a long series of observations, I noticed that when matching proofs, we always need both y
values and its position,
which were previously stored separately. Combining them into a tuple with some extra cache line alignment for some data
structures in PR 393 resulted in yet another substantial performance improvement, both for table generation and even
proof searching:
Before:
chia/table/single/1x time: [702.17 ms 707.33 ms 712.59 ms]
thrpt: [1.4033 elem/s 1.4138 elem/s 1.4242 elem/s]
chia/table/parallel/8x time: [502.50 ms 505.77 ms 509.35 ms]
thrpt: [15.706 elem/s 15.818 elem/s 15.920 elem/s]
chia/proof/missing time: [107.29 ns 108.80 ns 111.58 ns]
thrpt: [8.9618 Melem/s 9.1915 Melem/s 9.3207 Melem/s]
chia/proof/present time: [374.03 ns 375.88 ns 377.16 ns]
thrpt: [2.6514 Melem/s 2.6604 Melem/s 2.6736 Melem/s]
After:
chia/table/single/1x time: [638.21 ms 642.97 ms 649.77 ms]
thrpt: [1.5390 elem/s 1.5553 elem/s 1.5669 elem/s]
chia/table/parallel/8x time: [426.94 ms 431.49 ms 436.70 ms]
thrpt: [18.319 elem/s 18.540 elem/s 18.738 elem/s]
chia/proof/missing time: [76.806 ns 76.913 ns 77.038 ns]
thrpt: [12.981 Melem/s 13.002 Melem/s 13.020 Melem/s]
chia/proof/present time: [351.19 ns 352.22 ns 354.05 ns]
thrpt: [2.8245 Melem/s 2.8391 Melem/s 2.8474 Melem/s]
Proof searching and verification improvements#
In PR 394 I then further reduced memory usage and simplified public API that allowed to reuse LeftTargets
across all
table generation instances, which helps with higher-level CPU cache utilization. As a bonus this PR removes heap usage
from proof verification and fixes (previously broken) alloc
feature for a measurable verification performance
improvement:
Before:
chia/verification time: [7.0819 µs 7.0908 µs 7.0961 µs]
thrpt: [140.92 Kelem/s 141.03 Kelem/s 141.21 Kelem/s]
After:
chia/verification time: [6.4624 µs 6.4689 µs 6.4758 µs]
thrpt: [154.42 Kelem/s 154.59 Kelem/s 154.74 Kelem/s]
Proofs API simplification#
I mentioned in one of the previous updates that number of matches for a pair of buckets was limited to 288 (a multiple
of 32, important for GPUs), and the number of elements per bucket of y
values was limited to 272 (a multiple of 16,
important for CPU). The idea behind those numbers is that they were supposed to be the smallest numbers that still
provide enough proofs for sector encoding. I came up with those numbers after a series of empirical experiments, but
didn’t have any guarantees that the assumption will hold.
Turns out, it was crucial for further simplification. Farmer since at that time the farmer code was ready for
an insufficient number of proofs for a sector, which complicates the logic in multiple places, especially when GPU
implementation is involved. In PR 408 I added a test case (it was much harder to write a const fn
function that
prevents compilation) to ensure that the assumption always holds and removed support for unencoded sector chunks from
the farmer.
With that, it was possible to simplify proof handling. PR 410 introduced and started using a new API that instead of generating tables that can be queried later, generates a sufficient number of proofs and a bitmap with which proofs are present. The bitmap is exactly the same as the one farmer was already generating internally, so a bit of compute is saved there. The proofs themselves are also easier to handle, especially during piece decoding.
After looking for so long at the way proofs are found, I also noted that the current design is somewhat inspired by the
older architecture. In the older architecture the tables were sorted by y
values. In Chia this is important. However,
we’re only interested in a single proof per challenge, so PR 406 changed the way s-buckets are converted into the
challenges that Chia deals with to optimize the search (breaking change compared to Subspace implementation), and
then PR 411 refactored internals to only retain a single proof target per s-bucket instead of the complete last
seventh table. With the above changes, we can generate the tables and find the proofs after than the earlier split
process of generating full tables and then finding proofs one by one:
Before:
chia/proofs/single/1x time: [728.10 ms 735.58 ms 744.74 ms]
thrpt: [1.3427 elem/s 1.3595 elem/s 1.3734 elem/s]
chia/proofs/parallel/8x time: [600.14 ms 604.93 ms 609.18 ms]
thrpt: [13.132 elem/s 13.225 elem/s 13.330 elem/s]
After:
chia/proofs/single/1x time: [710.02 ms 713.94 ms 718.19 ms]
thrpt: [1.3924 elem/s 1.4007 elem/s 1.4084 elem/s]
chia/proofs/parallel/8x time: [567.42 ms 574.66 ms 581.51 ms]
thrpt: [13.757 elem/s 13.921 elem/s 14.099 elem/s]
Note that while the numbers are a bit higher than before, this is not just table generation anymore, this is the time to find all the proofs the farmer will need for record encoding. And now since the API is reduced in scope, more optimizations are possible (though not implemented yet). For example, sector encoding is actually done with a hash of the proof. Since we know we’ll have exactly 2^15 proofs, we can use BLAKE3 SIMD to hash them much faster and return hashed proofs to the caller, which is even less RAM and much faster. Overall, there are several new optimization opportunities remaining unimplemented. A substantial chunk of the logic is actually sequential there for now, and yet it is much faster than before I started all these optimizations.
GPU implementation#
A lot of the changes above were driven by observations of what is actually needed for the protocol and what is an implementation design decision and can be changed. Some things were more efficiently implementable on GPU when design is changed slightly, and it turns out most of those changes benefit CPU as well. It is really beneficial to know both the theoretical needs of the protocol and the nuances of what can be efficiently implemented for CPU/GPU at the same time.
Overall, I have managed to implement all the pieces needed for plotting on the GPU.
PR 395 implemented a shader for sorting individual buckets. CPU produces deterministic order due to single-thread implementation (parallelization is not worth it there), but it is too costly on the GPU, so everything is placed in arbitrary order and sorted after the fact. Since the number of matches is small and has a hard upper-bound of less than 512, I ended up using bitonic sort and modified it in PR 396 to only use registers, which I think is a fairly neat implementation and should shine on a GPU.
To optimize the memory bandwidth, I fused matching shader and compute_fn
shaders together in PR 401, then similarly
fused chacha8
and compute_f1
shaders into one as well in PR 402.
I then introduced find_matches_and_compute_last
variant in PR 414, which instead of grouping entries by buckets,
stores proof targets like I described above. However, in contrast to the CPU version that is sequential and picks the
first entry per s-bucket, this one had to also estimate an upper-bound for number of elements per s-buckets and reduce
them to a single one at a later stage.
And with all of those, PR 415 finally implemented find_proofs
shader, which looks at those s-bucket elements,
reduces each to a single one and finally generates proofs. It doesn’t hash the proofs yet, which, as I mentioned above,
would be a nice performance improvement. It also processes even entries that do not have proofs, which wastes some
amount of compute that could have been used better. But those are optimizations for another time.
Upcoming plans#
Now all the primitives necessary for GPU plotting are present, what’s left is to combine those shaders into a single pipeline. Should not be too hard and GPU plotting will be ready.
With that I should be able to move the farmer and implement a version of local beacon chain block production shortly afterward. And after that I’ll probably get back to the database, it needs more work with upper-bound estimation to be able to run multiple shards in the same physical file/disk, which is crucial for the introduction of intermediate and leaf shards.
Hopefully future updates will be more entertaining, in the meantime you can find me on Zulip.