In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP6. “What are you doing?”, asked Minsky. “I am training a randomly wired neural net to play Tictactoe”, Sussman replied. “Why is the net wired randomly?”, asked Minsky. “I do not want it to have any preconceptions of how to play”, Sussman said. Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty.” At that moment, Sussman was enlightened.
Recently I’ve been doing some work with the tdigest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of tdigest sketches is that you can “add” them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scaleout parallel computing platforms such as Apache Spark or MapReduce.
In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of tdigests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new tdigest in that randomized order. In Scala code, this algorithm might look like the following:
1 2 3 4 5 

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.
To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated tdigest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a tdigest sketch and the original distribution, I computed the KolmogorovSmirnov Dstatistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following boxplot. The result was surprising and a bit disturbing:
As the plot shows, the tdigest sketch distributions are gradually diverging from the underlying “true” distribution ~N(0,1). This is a potentially significant problem for the stability of monoidal tdigest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in mapreducelike environments.
Seeing this divergence motivated me to think about ways to avoid it. One property of tdigest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The tdigest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one tdigest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:
1 2 3 4 5 

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largesttosmallest cluster insertion. When I plotted the resulting progression of Dstatistics sidebyside, the results were surprising:
As the plot demonstrates, not only was largetosmall insertion more stable, its Dstatistics appeared to be getting smaller instead of larger. To see if this trend was sustained over longer cumulative sums, I plotted the Dstats for cumulative sums over 100 samples:
The results were even more dramatic; These longer sums show that the standard randomizedinsertion method continues to diverge, but in the case of largetosmall insertion the cumulative tdigest sums continue to converge towards the underlying distribution!
To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no “tails”) and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn’t seem to be sensitive to shape of distribution.
I have upgraded my implementation of tdigest sketching to use this new definition of monoid addition for tdigests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!