In 1995 the Chinook program built by a team of researchers at the university of Alberta beat the world champion in checkers. A key component in their success was building the 8 piece endgame database. In June 2003 the same team computed the 10 piece (5×5) endgame database.
In 2001 we started building endgame databases. Since the 10 piece database was already computed we immediately set our eyes on computing the 11 piece database. Computing the 11 piece databases took years to compute, unfortunately computing the the 11 piece databases by itself is not very useful since the side with 6 pieces typically wins. So we moved on to the 12 piece database.
Number of pieces | Number of positions | Size on disk (positions/byte) |
---|---|---|
8 | 132.537.738.625 | 2.14 GB (62) |
10 | 10.906.624.518.065 | 87.8 GB (124) |
12 | 528.192.915.748.705 | 2.83 TB (187) |
As can be seen in the above table the number of positions in the 12 piece database is quite daunting. Another problem is that when you add pieces to the board it takes longer to compute the value of each position in the database! The reason for this is quite technical and is given at the end of this article.
Let’s take a look at our generator. The database is sub divided into slices where each slice is computed in one go together with its mirror slice (that holds colour reversed results).
The generator goes through different phases. First the generator iterates only the jumping moves of each position to see if the position can be resolved, then the generator iterates only the moves that advances the most advanced man piece. These two phases only need to be run once since they only look at positions from previously generated slices and that data doesn’t change. The next phase (Non-advancing) can then look at only moves that lead to positions in the same slice being generated. This phase is repeated until no positions have changed value and we can now assume that the rest of positions are drawn.
Performance
As we can see in the above video we are running on 10 year old Intel Xeon computer but still getting quite solid results. We are able to resolve approximately 2 million positions per second including compression, writing to disk and a very expensive verification.
During the complete computation shown above we are performing 21.072.172 database lookups per second!
All of the code is written in Java originally in version 1.2 but upgraded over the years and is now running on version 17. The total code base is well over 100.000 lines of code. Upgrading has been done with zero effort. The software is running on Windows, Linux and Mac and zero effort has been required to make the program run the different operating systems. It is very clear, that the common myths of “Write once debug everywhere” or that Java is slow has very little to do with reality.
Storage
The table in the beginning of the article shows that each byte in the database holds the results of 184 positions! Lets take a look at how that is possible. This is possible since only the results of each position is stored, not the position it self. When the result is looked up typically an index function is used that perfectly maps each position to one and only one unique index. There are 4 possible results of each position win, loss, drawn or unknown. All positions in a slice are unknown to start with and when generation is finished no positions should be unknown. With 4 different results 2 bits is required to store one result. This means that uncompressed the database fills the number of positions / 4 bytes (each byte can hold 4 results).
Each slice is compressed before it is written to the disk. A trick that is usually employed is to set the result of all positions with jumping/advancing moves to the same value as the previous/next result. That way you can increase the compression, but every time you need to lookup a position with jumping/advancing moves you have to do a mini search and possibly look up many positions to determine the value. We are applying two layers of compression to achieve the compression rates stated at the beginning of the article.
Scalability
CPU Cores | Jump | Advancing | Non-Advancing | Compression | Verification |
---|---|---|---|---|---|
2 | 24 | 17 | 38 | 14 | 62 |
10 | 11 | 10 | 11 | 5 | 44 |
20 | 11 | 10 | 5 | 2 | 37 |
40 | 11 | 7 | 4 | 1 | 28 |
The above table shows how much time is spent (in seconds) in each phase of the generation for a specific slice, given that the generator is given X number of CPU cores to compute the slice. The computer is a dual cpu Xeon system with 10 cores per cpu. When running 2 threads we are running a single thread per CPU and each CPU is running at a higher clock speed than when running with more threads. When running 40 threads we are utilizing hyper threading, so we don’t expect to scale as well above 20 threads.
Each thread “owns” a sub range of indexes stored in ram. For instance with a slice with 10000 position and 10 threads, thread 1 owns indexes 0-999, thread 2 owns indexes 1000-1999 etc. With compressing we don’t share any data with the other threads and so expect linear scaling which we also get. For the other phases, we constantly have to read from the other threads ranges. This means data needs to be synchronized and competition for resources like caches. This is especially the case for the non-advancing phase and heavy optimization has been done to achieve the quite acceptable scaling we achieve.
The jumping and advancing phases don’t scale very well given more CPU resources. The reason for this is that these phases need to load data from previously computed slices from the same disk(s). Also the generator was restarted between each of these measurement runs. So in practice when the generator continuously generate slice after slice when the generator has generated 5-10 slices, these phases are faster because a lot of the data has already been loaded and cached in the computation of a previous slice.
Still it is well worth noting that both the jumping phase and advancing phase takes longer than the non advancing phase. This is despite the fact that these phases only need to iterate the positions of the slice one time, where the non advancing phase, that does the bulk of the work, has to iterate 20-50 times. One reason is that as the non-advancing phases is repeated, less and less positions needs to be processed (because the position is already resolved). Another reason is the heavy optimization of that phase mentioned earlier and some of these optimizations cannot be applied in the earlier phases.
Finally the elephant in the room is how much time is spent verifying the slice. It is interesting to see how otherwise reliable hardware starts to fail after having be on with 100% CPU load for some months/years. Windows starts to refuse to open more files after 6 months etc etc. Also there are of course many interesting anecdotes in the history of computing how a cosmic ray has flipped a single unfortunate bit, leading to data corruption.
Originally the verification used, only verified that the positions in the slice were consistent with the other values in the same slice (not in other slices). Unfortunately after generating the database was finished the occasional erroneous values was discovered. It is not the greatest of feeling when 9 years is spent generating the database to find out there were errors. Fortunately if was only the last third of the database that had errors but it took 3 extra years to regenerate.
Moral of the story don’t skimp on verification!
Code
Lets take a look at some pseudo code for generating an endgame database:
void generate(Slice slice) { resolveJumpMoves(slice); resolveAdvancingMoves(slice); int positionsChanged = Integer.Max; while(posistionsChanged > 0) { positionsChanged = resolveNonAdvancingMoves(slice) } convertUnknownsToDraws(slice); slice.compress(); slice.writeToDisk(); } int resolveNonAdvancingMoves(Slice slice) { for(Position position : slice) { long parentIndex = computeIndex(position) byte parentResult = slice.lookup(parentIndex); if(parentResult != UNKNOWN) continue; boolean allWins = true; var moves = position.generateMoves() for(Move move : moves) { position.applyMove(move); long index = computeIndex(position); position.unapplyMove(move) byte result = slice.lookup(index); if(result == LOSS) { slice.resolve(parentIndex, WIN); allWins = false; break; } if(result == UNKNOWN || result == DRAWN) allWins = false; } if(allWins) { slice.resolve(parentIndex, LOSS); } } } long computeIndex(Position position) { var pieces = position.getPieces() for(Piece piece : pieces) { // do something expensive } }
As you can see the computeIndex method is called a lot in the inner most loop and is typically a quite expensive method that becomes more expensive the more pieces you have. Also the more pieces the more legal moves you typically have. These two factors means that is significantly more expensive to compute each position the more pieces you have on the board.
I hope you found this article useful – please direct any questions to sune.thrane@gmail.com