I spent Thursday and Friday evenings and most of today working on a hierarchical cluster analysis of Chess960 starting position. It’s really only a preliminary analysis, sort of a dry run to make sure that the hierarchical cluster analysis software I’ve been writing works. But I thought I’d share my methodology and results as a prelude to the real analysis, which needs some thought as to the correct approach.
So, many of you are probably thinking “What’s hierarchical cluster analysis?” And others of you are probably thinking “Oh God, I hope he doesn’t explain hierarchical cluster analysis.” The second group can skip a couple paragraphs. Here’s how it works. You start with a bunch of observations. You figure out a method for determining a distance between the observations. You treat each observation as a cluster of one. You calculate all the distances between the clusters. Take the two clusters that are closest to each other and join them together in a new cluster. Repeat until you only have one cluster. Looking back at the ways you combined the clusters gives you a hierarchy of clusters. There is a wrinkle. Determining the distance between two clusters of one observation each is easy: it’s just the distance between the two observations. What if one cluster has five observations and the other eight? You could take the smallest distance between two observations, the largest, the average. There are all sorts of methods with different biases, but we don’t need to go into the details here.
How did I do this particular hierarchical cluster analysis? The observations are the positions. I used the information in the CCRL 404 FRC database to determine the distribution of pawn moves on the first move. The relative frequency (proportions) of the 16 possible pawn moves is treated as a point inside a 16 dimensional hyper unit cube. The distance between two observations is the distance between those two points. For the distance between clusters I created a center for each cluster, which is the point used for distance calculations with the cluster. The center is determined by summing the absolute frequency (actual number of times) of the 16 possible pawn moves for every observation in the cluster, and then calculating the relative frequencies for those totals. The main problems here are that I’m only using the first move (by white), which isn’t going to give a full picture of how the opening position forms; and that I’m only using pawn moves. The problem with using piece moves is that the pieces are randomized. Nf3 may be a common early move for standard Chess, but in most Chess960 positions Nf3 isn’t even legal until a few moves into the game. I’m trying to think of some way to generalize the piece moves so that using them doesn’t overly bias towards specific positions.
Once I had the hierarchy of clusters, I pulled in the data from my Chess960 Almanac. For any feature that seemed reasonable I calculated a percentage for it. For binary features this was just a percentage of it being true. For other features it was the percentage of openings in the cluster sharing the most common value of that feature. Then I looked at the last 20 or so clusters formed, and checked to see which ones had high percentages for each feature. That is, what are the defining features of that cluster? I was able to find seven clusters that have relatively high percentages and cover about 70% of the possible positions. They are:
Cluster #1900 (The Normalish Cluster): This is the smallest of the clusters, with only 16 observations. It’s defining features are a rook on the a-file 88% of the time, the king in the center (c-f file) 100% of the time, The queen on the d-file 25% of the time (twice the average), and different colored knights 81% of the time. Top initial pawn moves are c3 (37%), e4 (17%), and d4 (12%).
Cluster #1901 (The Cavalry Charge Cluster): This cluster has 37 observations. It’s defining features are a knight on the a-file 41% of the time, king on the d-file 41% of the time, and rooks 5 spaces apart 43% of the time. Top initial pawns moves are d3 (33%), d4 (19%), and g4 (18%).
Cluster #1902 (The Low Cluster): This cluster is the second biggest, with 180 observations. It doesn’t really have any defining features, but it was smack dab in the middle of the other ones, so I included it. Looking over it again, it’s really marked more by low percentages: queen on the d-file at 10%, rooks in or next to the corner 17%. Top initial pawn moves are f4 (49%), e4 (14%), and c4 (6%).
Cluster #1903 (The Patriarchal Cluster): This cluster has 99 observations. It’s defining features is bishops right next to each other (66%). The name is nod to Harry Osh and his names for the bishop pairs. Top initial pawn moves are c4 (52%), b3 and d4 (9% each).
Cluster #1904 (The Just Right Cluster): This cluster is the medium sized one with 94 observations. It’s defining feature is knights 6 spaces apart (86%). Note that 6 spaces is the distance between knights in the standard position. Top initial pawn moves are g3 (26%), b3 (21%), and g4 (18%). And, yes, b4 is the fourth likeliest initial pawn move at 15%.
Cluster #1905 (The Lefty Cluster): This cluster is the second smallest, with only 18 observations. It’s defining features are bishop on the a-file (94%), queen on the c-file (61%), king on d- or e-file (72%), and knights next to each other (50%). This cluster has the distinction of the last one being formed from a cluster of size one: the cluster for QRBNKRNB. So in terms of initial pawn moves, QRBNKRNB is the weirdest starting position of all. Top initial pawn moves (for the whole cluster) are b4 (61%), b3 (16%), and g3 (7%).
Cluster #1906 (The Fat Rook Cluster): This cluster is the largest, with 229 observations. It’s defining features are a rook on the h-file (56%), and both rooks on the corner or next to it (50%). It also has the distinction of containing the standard start position: RNBQKBNR. The top pawn moves for this cluster are e4 (47%), d4 (12%), and e3 (10%).
One thing I found interesting is that (based on initial pawn moves) the closest position to the standard start position is RNKQBBNR. In retrospect it makes sense if opening up your diagonal pieces is a primary reason for pawn moves. In RNKQBBNR, e4 still opens up a queen and a bishop, and d4 still opens up the other bishop. For this cluster of two, e4 is the top move by far (87%). I was kind of expecting the standard start position to pair with RNBKQBNR, it’s mirror image. That one instead pairs up with RNBKNBQR (top pawn moves are d4 (79%) and e4 (16%)).
I’m not sure if that’s useful information, but then again, it’s not necessarily supposed to be. It was however a successful dry run. The distance functions (between observations and between clusters) are parameters, so I can change them easily by just writing new distance functions. There is one potential problem, in that if more than one pair of clusters is the same distance apart, they aren’t handled at the same time as they should be. Certainly I need to think about how to handle piece moves, so that I can bring in a more accurate picture of the position after more than just one move. I also may try to map more features, especially relating pawns to the position. How often is a pawn next to a bishop or in front of a rook moved?