Consider how we annotate and refer to release builds for a Scala project: The version of Scala – 2.10, 2.11, etc – that was used to build the project is a qualifier for the release. For example, if I am building a project using Scala 2.11, and package P is one of my project dependencies, then the maven build tooling (or sbt, etc) looks for a version of P that was also built using Scala 2.11; the build will fail if no such incarnation of P can be located. This build constraint propagates recursively throughout the entire dependency tree for a project.
Now consider how we treat the version for the package P dependency itself: Our build tooling forces us to specify one exact release version x.y.z for P. This is superficially similar to the constraint for building with Scala 2.11, but unlike the Scala constraint, the knowledge about using P x.y.z is not propagated down the tree.
If the dependency for P appears only once in the depenency tree, everything is fine. However, as anybody who has ever worked with a large dependency tree for a project knows, package P might very well appear in multiple locations of the deptree, as a transitive dependency of different packages. Worse, these deps may be specified as different versions of P, which may be mutually incompatible.
Transitive dep incompatibilities are a particularly thorny problem to solve, but there are other annoyances related to release versioning. Often a user would like a “major” package dependency built against a particular version of that dep. For example, packages that use Apache Spark may need to work with a particular build version of Spark (2.1, 2.2, etc). If I am the package purveyor, I have no very convenient way to build my package against multiple versions of spark, and then annotate those builds in Maven Central. At best I can bake the spark version into the name. But what if I want to specify other package dep verions? Do I create package names with increasinglylong lists of (package,version) pairs hacked into the name?
Finally, there is simply the annoyance of revving my own package purely for the purpose of building it against the latest versions of my dependencies. None of my code has changed, but I am cutting a new release just to pick up current dependency releases. And then hoping that my package users will want those particular releases, and that these won’t break their builds with incompatible transitive deps!
I have been toying with a release and build methodology for avoiding these headaches. What follows is full of vigorous handwaving, but I believe something like it could be formalized in a useful way.
The key idea is that a release build is defined by a build signature which is the union of all (dep, ver)
pairs.
This includes:
(mypackage, 1.2.3)
(dep, ver)
for all dependencies (taken over all transitive deps, recursively)(tool, ver)
for all impactful build tooling, e.g. (scala, 2.11)
, (python, 3.5)
, etcFor example, if I maintain a package P
, whose latest code release is 1.2.3
,
built with dependencies (A, 0.5)
, (B, 2.5.1)
and (C, 1.7.8)
, and dependency B
built against (Q, 6.7)
and (R, 3.3)
,
and C
built against (Q, 6.7)
and all compiled with (scala, 2.11)
, then the build signature will be:
{ (P, 1.2.3), (A, 0.5), (B, 2.5.1), (C, 1.7.8), (Q, 6.7), (R, 3.3), (scala, 2.11) }
Identifying a release build in this way makes several interesting things possible.
First, it can identify a build with a transitive dependency problem.
For example, if C
had been built against (Q, 7.0)
,
then the resulting build signature would have two pairs for Q
; (Q, 6.7)
and (Q, 7.0)
,
which is an immediate red flag for a potential problem.
More intriguingly, it could provide a foundation for avoiding builds with incompatible dependencies. Suppose that I redefine my build logic so that I only specify dependency package names, and not specific versions. Whenever I build a project, the build system automatically searches for the mostrecent version of each dependency. This already addresses some of the release headaches above. As a project builder, I can get the latest versions of packages when I build. As a package maintainer, I do not have to rev a release just to update my package deps; projects using my package will get the latest by default. Moreover, because the latest package release is always pulled, I never get multiple incompatible dependency releases in a build.
Suppose that for some reason I need a particular release of some dependency.
From the example above, imagine that I must use (Q, 6.7)
.
We can imagine augmenting the build specification to allow overriding the default behavior of pulling the most recent release.
We might either specify a specific version as we do currently, or possibly specify a range of releases, as systems like brew or ruby gemfiles allow.
In the case where some constraint is placed on releases, this constraint would be propagaged down the tree (or possibly up from the leaves),
in essentially the same way that the constraint of scala version is already.
In the event that the total set of constraints over the whole dependency tree is not satisfiable, then the build will fail.
With a build annotation system like the one I just described, one could imagine a new role for registries like Maven Central, where different builds are automatically cached. The registry could maybe even automatically run CI testing to identify the mostrecent versions of package dependencies that satisfy any given package build, or perhaps valid dependency release ranges.
To conclude, I believe that rethinking how we describe the dependencies used to build and annotate package releases, by generalizing release version to include the release version of all transitive deps (including build tooling as deps), may enable more flexible ways to both build software releases and specify them for pulling.
Happy Computing!
]]>In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP6. “What are you doing?”, asked Minsky. “I am training a randomly wired neural net to play Tictactoe”, Sussman replied. “Why is the net wired randomly?”, asked Minsky. “I do not want it to have any preconceptions of how to play”, Sussman said. Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty.” At that moment, Sussman was enlightened.
Recently I’ve been doing some work with the tdigest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of tdigest sketches is that you can “add” them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scaleout parallel computing platforms such as Apache Spark or MapReduce.
In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of tdigests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new tdigest in that randomized order. In Scala code, this algorithm might look like the following:
1 2 3 4 5 

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.
To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated tdigest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a tdigest sketch and the original distribution, I computed the KolmogorovSmirnov Dstatistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following boxplot. The result was surprising and a bit disturbing:
As the plot shows, the tdigest sketch distributions are gradually diverging from the underlying “true” distribution ~N(0,1). This is a potentially significant problem for the stability of monoidal tdigest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in mapreducelike environments.
Seeing this divergence motivated me to think about ways to avoid it. One property of tdigest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The tdigest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one tdigest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:
1 2 3 4 5 

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largesttosmallest cluster insertion. When I plotted the resulting progression of Dstatistics sidebyside, the results were surprising:
As the plot demonstrates, not only was largetosmall insertion more stable, its Dstatistics appeared to be getting smaller instead of larger. To see if this trend was sustained over longer cumulative sums, I plotted the Dstats for cumulative sums over 100 samples:
The results were even more dramatic; These longer sums show that the standard randomizedinsertion method continues to diverge, but in the case of largetosmall insertion the cumulative tdigest sums continue to converge towards the underlying distribution!
To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no “tails”) and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn’t seem to be sensitive to shape of distribution.
I have upgraded my implementation of tdigest sketching to use this new definition of monoid addition for tdigests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!
map
operation is creating some nontrivial monoid that represents a single element of the input type.
For example, if the monoidal type is Set[Int]
, then the mapping function (‘prepare’ in algebird) maps every input integer k
into Set(k)
, which is somewhat expensive.
In that discussion, I was focusing on mapreduce as embodied by the algebird Aggregator
type, where map
appears as the prepare
function.
However, it is easy to see that any mapreduce implementation may be vulnerable to the same inefficiency.
I wondered if there were a way to represent mapreduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.
I’ll begin by reviewing a standard mapreduce implementation.
The following scala code sketches out the definition of a monoid over a type B
and a mapreduce interface.
As this code suggests, the map
function maps input data of some type A
into some monoidal type B
, which can be reduced (aka “aggregated”) in a way that is amenable to parallelization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

In the parallel version of mapreduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal B
value, followed by a final reduction of those intermediate results.
This is the classic form of mapreduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.
Next I will present an alternative definition of mapreduce.
In this implementation, the map
function is replaced by a foldL
function, which executes a single “leftfold” of an input object with type A
into the monoid object with type B
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

As the comments above indicate, the leftfolding function foldL
is assumed to obey the law foldL(b, a) = b ++ foldL(e, a)
.
This law captures the idea that folding a
into b
should be the analog of reducing b
with a monoid corresponding to the single element a
.
Referring to my earlier example, if type A
is Int
and B
is Set[Int]
, then foldL(b, a) => b + a
.
Note that b + a
is directly inserting single element a
into b
, which is significantly more efficient than b ++ Set(a)
, which is how a typical mapreduce implementation would be required to operate.
This law also gives us the corresponding definition of map(a)
, which is foldL(e, a)
, or in my example: Set.empty[Int] ++ a
or just: Set(a)
In this formulation, the basic mapreduce operation is now a single foldLeft
operation, instead of a mapping followed by a monoidal reduction.
The parallel version is analoglous.
Each partition uses the new foldLeft
operation, and the final reduction of intermediate monoidal results remains the same as before.
The foldLeft
function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before.
I will conclude with a proof that this encoding is also parallelizable;
Note that the law foldL(b, a) = b ++ foldL(e, a)
is a significant component of this proof, as it represents the constraint that foldL
behaves like an analog of reducing b
with a monoidal representation of element a
.
In the following proof I used a scalalike pseudo code, described in the introduction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 

This immediately raised some interesting and thorny questions: in an ecosystem that contains not just algebird, but other popular alternatives such as cats and scalaz, what algebra API should I use in my code? How best to allow the library user to interoperate with the algebra libray of their choice? Can I accomplish these things while also avoiding any problematic package dependencies in my library code?
In Scala, the second question is relatively straightforward to answer. I can write my interface using implicit conversions, and provide subpackages that provide such conversions from popular algebra libraries into the library I actually use in my code. A library user can import the predefined implicit conversions of their choice, or if necessary provide their own.
So far so good, but that leads immediately back to the first question – what API should I choose to use internally in my own library?
One obvious approach is to just pick one of the popular options (I might favor cats
, for example) and write my library code using that.
If a library user also prefers cats
, great.
Otherwise, they can import the appropritate implicit conversions from their favorite alternative into cats
and be on their way.
But this solution is not without drawbacks.
Anybody using my library will now be including cats
as a transitive dependency in their project, even if they are already using some other alternative.
Although cats
is not an enormous library, that represents a fair amount of code sucked into my users’ projects, most of which isn’t going to be used at all.
More insidiously, I have now introduced the possiblity that the cats
version I package with is out of sync with the version my library users are building against.
Version misalignment in transitive dependencies is a landmine in project builds and very difficult to resolve.
A second approach I might use is to define some abstract algebraic traits of my own. I can write my libraries in terms of this new API, and then provide implicit conversions from popular APIs into mine.
This approach has some real advantages over the previous. Being entirely abstract, my internal API will be lightweight. I have the option of including only the algebraic concepts I need. It does not introduce any possibly problematic 3rdparty dependencies that might cause code bloat or versioning problems for my library users.
Although this is an effective solution, I find it dissatisfying for a couple reasons. Firstly, my new internal API effectively represents yet another competing algebra API, and so I am essentially contributing to the proliferatingstandards antipattern.
Secondly, it means that I am not taking advantage of community knowledge.
The cats
library embodies a great deal of cumulative human expertise in both category theory and Scala library design.
What does a good algebra library API look like?
Well, it’s likely to look a lot like cats
of course!
The odds that I end up doing an inferior job designing my little internal vanity API are rather higher than the odds that I do as well or better.
The best I can hope for is to reinvent the wheel, with a real possibility that my wheel has corners.
Is there a way to resolve this unpalatable situation? Can we design our projects to both remain flexible about interfacing with multiple 3rdparty alternatives, but avoid effectively writing yet another alternative for our own internal use?
I hardly have any authoritative answers to this problem, but I have one idea that might move toward a solution. As I alluded to above, when I write my libraries, I am most frequently only interested in the API – the abstract interface. If I did go with writing my own algebra API, I would seek to define purely abstract traits. Since my intention is that my library users would supply their own favorite library alternative, I would have no need or desire to instantiate any of my APIs. That function would be provided by the separate subprojects that provide implicit conversions from community alternatives into my API.
On the other hand, what if cats
and algebird
factored their libraries in a similar way?
What if I could include a subpackage like catskernelapi
, or algebirdcoreapi
, which contained only pure abstract traits for monoid, semigroup, etc?
Then I could choose my favorite community API, and code against it, with much less code bloat, and a much reduced vulnerability to any versioning drift.
I would still be free to provide implicit conversions and allow my users to make their own choice of library in their projects.
Although I find this idea attractive, it is certainly not foolproof.
For example, there is never a way to guarantee that versioning drift won’t break an API.
APIs such as cats
and algebird
are likely to be unusually amenable to this kind of approach.
After all, their interfaces are primarily driven by underlying mathematical definitions, which are generally as stable as such things ever get.
However, APIs in general tend to be significantly more stable than underlying code.
And the moststable subsets of APIs might be encoded as traits and exposed this way, allowing other more experimental API components to change at a higher frequency.
Perhaps library packages could even be factored in some way such as librarystableapi
and libraryunstableapi
.
That would clearly add a bit of complication to library trait hierarchies, but the payoff in terms of increased 3rdparty usability might be worth it.
There are some varied approaches in the community for addressing the task of identifying a good number of clusters in a data set. In this post I want to focus on an approach that I think deserves more attention than it gets: Minimum Description Length.
Many years ago I ran across a superb paper by Stephen J. Roberts on anomaly detection that described a method for automatically choosing a good value for the number of clusters based on the principle of Minimum Description Length. Minimum Description Length (MDL) is an elegant framework for evaluating the parsimony of a model. The Description Length of a model is defined as the amount of information needed to encode that model, plus the encodinglength of some data, given that model. Therefore, in an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose own description is also efficient (This suggests connections between MDL and the idea of learning as a form of data compression).
For example, a model that directly memorizes all the data may allow for a very short description of the data, but the model itself will cleary require at least the size of the raw data to encode, and so direct memorization models generaly stack up poorly with respect to MDL. On the other hand, consider a model of some Gaussian data. We can describe these data in a length proportional to their loglikelihood under the Gaussian density. Furthermore, the description length of the Gaussian model itself is very short; just the encoding of its mean and standard deviation. And so in this case a Gaussian distribution represents an efficient model with respect to MDL.
In summary, an MDL framework allows us to mathematically capture the idea that we only wish to consider increasing the complexity of our models if that buys us a corresponding increase in descriptive power on our data.
In the case of Roberts’ paper, the clustering model in question is a Gaussian Mixture Model (GMM), and the description length expression to be optimized can be written as:
In this expression, X represents the vector of data elements.
The first term is the (negative) loglikelihood of the data, with respect to a candidate GMM having some number (K) of Gaussians; p(x) is the GMM density at point (x).
This term represents the cost of encoding the data, given that GMM.
The second term is the cost of encoding the GMM itself.
The value P is the number of free parameters needed to describe that GMM.
Assuming a dimensionality D for the data, then
I wanted to apply this same MDL principle to identifying a good value for K, in the case of a KMedoids model. How best to adapt MDL to KMedoids poses some problems. In the case of KMedoids, the only structure given to the data is a distance metric. There is no vector algebra defined on data elements, much less any ability to model the points as a Gaussian Mixture.
However, any candidate clustering of my data does give me a corresponding distribution of distances from each data element to it’s closest medoid. I can evaluate an MDL measure on these distance values. If adding more clusters (i.e. increasing K) does not sufficiently tighten this distribution, then its description length will start to increase at larger values of K, thus indicating that more clusters are not improving our model of the data. Expressing this idea as an MDL formulation produces the following description length formula:
Note that the first two terms are similar to the equation above; however, the underlying distribution
And so, an MDLbased algorithm for automatically identifying a good number of clusters (K) in a KMedoids model is to run a KMedoids clustering on my data, for some set of potential K values, and evaluate the MDL measure above for each, and choose the model whose description length L(X) is the smallest!
As I mentioned above, there is also an implied task of choosing a form (or a set of forms) for the distance distribution
Another observation (based on my blog posts mentioned above) is that my use of the gamma distribution implies a bias toward cluster distributions that behave (more or less) like Gaussian clusters, and so in this respect its current behavior is probably somewhat analogous to the GMeans algorithm, which identifies clusterings that yield Gaussian disributions in each cluster. Adding other candidates for distance distributions is a useful subject for future work, since there is no compelling reason to either favor or assume Gaussianlike cluster distributions over all kinds of metric spaces. That said, I am seeing reasonable results even on data with clusters that I suspect are not well modeled as Gaussian distributions. Perhaps the shapecoverage of the gamma distribution is helping to add some robustness.
To demonstrate the MDLenhanced KMedoids in action, I will illustrate its performance on some data sets that are amenable to graphic representation. The code I used to generate these results is here.
Consider this synthetic data set of points in 2D space. You can see that I’ve generated the data to have two latent clusters:
I collected the descriptionlength values for candidate KMedoids models having 1 up to 10 clusters, and plotted them. This plot shows that the clustering with minimal description length had 2 clusters:
When I plot that optimal clustering at K=2 (with cluster medoids marked in blackandyellow), the clustering looks good:
To show the behavior for a different optimal value, the following plots demonstrate the MDL KMedoids results on data where the number of latent clusters is 4:
A final comment on Minimum Description Length approaches to clustering – although I focused on KMedoids models in this post, the basic approach (and I suspect even the same description length formulation) would apply equally well to KMeans, and possibly other clustering models. Any clustering model that involves a distance function from elements to some kind of cluster center should be a good candidate. I intend to keep an eye out for applications of MDL to other learning models, as well.
[1] “Novelty Detection Using Extreme Value Statistics”; Stephen J. Roberts; Feb 23, 1999 [2] “Learning the k in kmeans. Advances in neural information processing systems”; Hamerly, G., & Elkan, C.; 2004
]]>Recall that the form of this PDF is the generalized gamma distribution, with scale parameter
I was interested in fitting parameters to such a distribution, using some distance data from a clustering algorithm. SciPy comes with a predefined method for fitting generalized gamma parameters, however I wished to implement something similar using Apache Commons Math, which does not have native support for fitting a generalized gamma PDF. I even went so far as to start working out some of the math needed to augment the Commons Math Automatic Differentiation libraries with Gamma function differentiation needed to numerically fit my parameters.
Meanwhile, I have been fitting a non generalized gamma distribution to the distance data, as a sort of rough cut, using a fast noniterative approximation to the parameter optimization. Consistent with my habit of asking the obvious question last, I tried plotting this gamma approximation against distance data, to see how well it compared against the PDF that I derived.
Surprisingly (at least to me), my approximation using the gamma distribution is a very effective fit for spatial dimensionalities
As the plot shows, only for the 1dimension case is the gamma approximation substiantially deviating. In fact, the fit appears to get better as dimensionality increases. To address the 1D case, I can easily test the fit of a halfgaussian as a possible model.
]]>I’ll start by showing a simple recursion relation for these derivatives, and then gives its derivation. The kth derivative of Gamma(x) can be computed as follows:
The recursive formula for the D_{k} functions has an easy inductive proof:
Computing the next value D_{k} requires knowledge of D_{k1} but also derivative D’_{k1}. If we start expanding terms, we see the following:
Continuing the process above it is not hard to see that we can continue expanding until we are left only with terms of
What we want, to do these computations systematically, is a formula for computing the nth derivative of a term
Generalizing from the above, we see that the formula for the nth derivative is:
We are now in a position to fill in the triangular table of values, culminating in the value of
As previously mentioned, the basis row of values
Suppose that I draw some values from a classic onedimensional Gaussian, with zero mean and unit variance, but that I am actually interested in their corresponding distances from center. Knowing that my Gaussian is centered on the origin, I can rephrase that as: the distribution of magnitudes of values drawn from that Gaussian. I can simulate this process by actually samping Gaussian values and taking their absolute value. When I do, I get the following result:
It’s easy to see – and intuitive – that the resulting distribution is a halfGaussian, as I confirmed by overlaying the histogrammed samples above with a halfGaussian PDF (displayed in green).
I wanted to generalize this basic idea into some arbitrary dimensionality, (d), where I draw dvectors from an ddimensional Gaussian (again, centered on the origin with unit variances). When I take the magnitudes of these sampled dvectors, what will the probability distribution of their magnitudes look like?
My intuitive assumption was that these magnitudes would also follow a halfGaussian distribution. After all, every multivariate Gaussian is densest at its mean, just like the univariate case I examined above. In fact I was so confident in this assumption that I built my initial modeling around it. Great confusion ensued, when I saw how poorly my models were working on my higherdimensional data!
Eventually it occurred to me to do the obvious thing and generate some visualizations from higher dimensional data. For example, here is the correponding plot generated from a bivariate Gaussian (d=2):
Surprise – the distribution at d=2 is not even close to halfGaussian!. My intuitions couldn’t have been more misleading!
Where did I go wrong?
I’ll start by observing what happens when I take a multidimensional PDF of vectors in (d) dimensions and project it down to a onedimensional PDF of the corresponding vector magnitudes. To keep things simple, I will be assuming a multidimensional PDF
The key observation is that this term is a polynomial function of radius (r), with degree (d1). When d=1, it is simply a constant multiplier and so we get the halfGaussian distribution we expect, but when
The above ideas can be expressed compactly as follows:
In my experiments, I am using multivariate Gaussians of mean 0_{d} and unit covariance matrix I_{d}, and so the form for f(r;d) becomes:
This form is in fact the generalized gamma distribution, with scale parameter
I can verify that this PDF is correct by plotting it against randomly sampled data at differing dimensions:
This plot demonstrates both that the PDF expression is correct for varying dimensionalities and also illustrates how the shape of the PDF evolves as dimensionality changes. For me, it was a great example of challenging my intuitions and learning something completely unexpected about the interplay of distances and dimension.
]]>In this post I am going to discuss some advantages to one of my favorite approaches to measuring split quality, which is to use a test statistic significance – aka “pvalue” – of the null hypothesis that the left and right subpopulations are the same after the split. The idea is that if a split is of good quality, then it ought to have caused the subpopulations to the left and right of the split to be meaningfully different. That is to say: the null hypothesis (that they are the same) should be rejected with high confidence, i.e. a small pvalue. What constitutes “small” is always context dependent, but popular pvalues from applied statistics are 0.05, 0.01, 0.005, etc.
update – there is now an Apache Spark JIRA and a pull request for this feature
The remainder of this post is organized in the following sections:
Consistency
Awareness of Sample Sizes
Training Results
Conclusion
Test statistic pvalues have some appealing properties as a split quality measure. The test statistic methodology has the advantage of working essentially the same way regardless of the particular test being used. We begin with two sample populations; in our case, these are the left and right subpopulations created by a candidate split. We want to assess whether these two populations have the same distribution (the null hypothesis) or different distributions. We measure some test statistic ‘S’ (Student’s t, ChiSquared, etc). We then compute the probability that S >= the value we actually measured. This probability is commonly referred to as the pvalue. The smaller the pvalue, the less likely it is that our two populations are the same. In our case, we can interpret this as: a smaller pvalue indicates a better quality split.
This consistent methodology has a couple advantages contributing to user experience (UX). If all measures of split quality work in the same way, then there is a lower cognitive load to move between measures once the user understands the common pattern of use. A second advantage is better “unit analysis.” Since all such quality measures take the form of pvalues, there is no risk of a chosen quality measure getting misaligned with a corresponding quality threshold. They are all probabilities, on the interval [0,1], and “smaller threshold” always means “higher threshold of split quality.” By way of comparison, if an application is measuring entropy and then switches to using Gini impurity, these measures are in differing units and care has to be taken that the correct quality threshold is used in each case or the model training policy will be broken. Switching between differing statistical tests does not come with the same risk. A pvalue quality threshold will have the same semantic regardless of which statistical test is being applied: probability that left and right subpopulations are the same, given the particular statistic being measured.
Test statistics have another appealing property: many are “aware” of sample size in a way that captures the idea that the smaller the sample size, the larger the difference between populations should be to conclude a given significance. For one example, consider Welch’s ttest, the twosample variation of the t distribution that applies well to comparing left and right sub populations of candidate decision tree splits:
Visualizing the effects of sample sizes n1 and n2 on these equations directly is a bit tricky, but assuming equal sample sizes and variances allows the equations to be simplified quite a bit, so that we can observe the effect of sample size:
These simplified equations show clearly that (all else remaining equal) as sample size grows smaller, the measured tstatistic correspondingly grows smaller (proportional to sqrt(n)), and furthermore the corresponding variance of the t distribution to be applied grows larger. For any given shift in left and right subpopulations, each of these trends yields a larger (i.e. weaker) pvalue. This behavior is desirable for a split quality metric. The less data there is at a given candidate split, the less confidence there should be in split quality. Put another way: we would like to require a larger difference before a split is measured as being good quality when we have less data to work with, and that is exactly the behavior the ttest provides us.
These propreties are pleasing, but it remains to show that test statistics can actually improve decision tree training in practice. In the following sections I will compare the effects of training with test statstics with other split quality policies based on entropy and gini index.
To conduct these experiments, I modified a local copy of Apache Spark with the ChiSquared test statistic for comparing categorical distributions. The demo script, which I ran in sparkshell
, can be viewed here.
I generated an example data set that represents a twoclass learning problem, where labels may be 0 or 1. Each sample has 10 clean binary features, such that if the bit is 1, the probability of the label is 90% 1 and 10% 0. There are 5 noise features, also binary, which are completely random. There are 50 samples of each clean feature being on, for a total of 500 samples. There are also 500 samples where all clean features are 0 and the corresponding labels are 90% 0 and 10% 1. The total number of samples in the data set is 1000. The shape of the data is illustrated by the following table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

For the first run I use my customized chisquared statistic as the split quality measure. I used a pvalue threshold of 0.01 – that is, I would like my chisquared test to conclude that the probability of left and right split populations are the same is <= 0.01, or that split will not be used. Note, this means I can expect that around 1% of the time, it will conclude a split was good, when it was just luck. This is a reasonable falsepositive rate; random forests are by nature robust to noise, including noise in their own split decisions:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 

The first thing to observe is that the resulting decision tree used exactly the 10 clean features 0 through 9, and none of the five noise features. The tree splits off each of the clean features to obtain an optimally accurate leafnode (one with 90% 1s and 10% 0s). A second observation is that the pvalues shown in the demo output are extremely small (i.e. strong) values – around 1e9 (one part in a billion) – for goodquality splits. We can also see “weak” pvalues with magnitudes such as 0.7, 0.2, etc. These are poor quality splits on the noise features that it rejects and does not use in the tree, exactly as we hope to see.
Next, I will show a similar run with the standard available “entropy” quality measure, and a minimum gain threshold of 0.035, which is a value I had to determine by trial and error, as what kind of entropy gains one can expect to see, and where to cut them off, is somewhat unintuitive and likely to be very data dependent.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 

The first observation is that the resulting tree using entropy as a split quality measure is twice the size of the tree trained using the chisquared policy. Worse, it is using the noise features – its quality measure is yielding many more false positives. The entropybased model is less parsimonious and will also have performance problems since the model has included very noisy features.
Lastly, I ran a similar training using the “gini” impurity measure, and a 0.015 quality threshold (again, hopefully optimal value that I had to run multiple experiments to identify). Its quality is better than the entropybased measure, but this model is still substantially larger than the chisquared model, and it still uses some noise features:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 

In this post I have discussed some advantages of using test statstics and pvalues as split quality metrics for decision tree training:
I believe they are a useful tool for improved training of decision tree models! Happy computing!
]]>The data I clustered consisted of 135 machines, each with a list of installed RPM packages. The number of unique package names among all 135 machines was 4397. Each machine was assigned a vector of Boolean values: a value of 1
indicates that the corresponding RPM was installed on that machine. This means that the clustering data occupied a space of nearly 4400 dimensions. I discuss the implications of this later in the post, and what it has to do with Random Forest Clustering in particular.
For ease of navigation and digestion, the remainder of this post is organized in sections:
Introduction to Random Forest Clustering
(The PayOff)
Package Configuration Clustering Code
Clustering Results
(Outliers)
Full explainations of Random Forests and Random Forest Clustering could easily occupy blog posts of their own, but I will attempt to summarize them briefly here. Random Forest learning models per se are well covered in the machine learning community, and available in most machine learning toolkits. With that in mind, I will focus on their application to Random Forest Clustering, as it is less commonly used.
A Random Forest is an ensemble learning model, consisting of some number of individual decision trees, each trained on a random subset of the training data, and which choose from a random subset of candidate features when learning each internal decision node.
Random Forest Clustering begins by training a Random Forest to distinguish between the data to be clustered, and a corresponding synthetic data set created by sampling from the marginal distributions of each feature. If the data has well defined clusters in the joint feature space (a common scenario), then the model can identify these clusters as standing out from the more homogeneous distribution of synthetic data. A simple example of what this looks like in 2 dimensional data is displayed in Figure 1, where the dark red dots are the data to be clustered, and the lighter pink dots represent synthetic data generated from the marginal distributions:
Each interior decision node, in each tree of a Random Forest, typically divides the space of feature vectors in half: the halfspace <= some threshold, and the halfspace > that threshold. The result is that the model learned for our data can be visualized as rectilinear regions of space. In this simple example, these regions can be plotted directly over the data, and show that the Random Forest did indeed learn the location of the data clusters against the background of synthetic data:
Once this model has been trained, the actual data to be clustered are evaluated against this model. Each data element navigates the interior decision nodes and eventually arrives at a leafnode of each tree in the Random Forest ensemble, as illustrated in the following schematic:
A key insight of Random Forest Clustering is that if two objects (or, their feature vectors) are similar, then they are likely to arrive at the same leaf nodes more often than not. As the figure above suggests, it means we can cluster objects by their corresponding vectors of leaf nodes, instead of their raw feature vectors.
If we map the points in our toy example to leaf ids in this way, and then cluster the results, we obtain the following two clusters, which correspond well with the structure of the data:
A note on clustering leaf ids. A leaf id is just that – an identifier – and in that respect a vector of leaf ids has no algebra; it is not meaningful to take an average of such identifiers, any more than it would be meaningful to take the average of people’s names. Pragmatically, what this means is that the popular kmeans clustering algorithm cannot be applied to this problem.
These vectors do, however, have distance: for any pair of vectors, add 1 for each corresponding pair of leaf ids that differ. If two data elements arrived at all the same leafs in the Random Forest model, all their leaf ids are the same, and their distance is zero (with respect to the model, they are the same). Therefore, we can apply kmedoids clustering.
What does this somewhat indirect method of clustering buy us? Why not just cluster objects by their raw feature vectors?
The problem is that in many realworld cases (unlike in our toy example above), feature vectors computed for objects have many dimensions – hundreds, thousands, perhaps millions – instead of the two dimensions in this example. Computing distances on such objects, necessary for clustering, is often expensive, and worse yet the quality of these distances is frequently poor due to the fact that most features in large spaces will be poorly correlated with any structure in the data. This problem is so common, and so important, it has a name: the Curse of Dimensionality.
Random Forest Clustering, which clusters on vectors of leafnode ids from the trees in the model, sidesteps the curse of dimensionality because the Random Forest training process, by learning where the data is against the background of the synthetic data, has already identified the features that are useful for identifying the structure of the data! If any particular feature was poorly correlated with that struture, it has already been ignored by the model. In other words, a Random Forest Clustering model is implicitly examining exactly those features that are most useful for clustering , thus providing a cure for the Curse of Dimensionality.
The machine package configurations whose clustering I describe for this post are a good example of high dimensional data that is vulnerable to the Curse of Dimensionality. The dimensionality of the feature space is nearly 4400, making distances between vectors potentially expensive to evaluate. Any individual feature contributes little to the distance, having to contend with over 4000 other features. Installed packages are also noisy. Many packages, such as kernels, are installed everywhere. Others may be installed but not used, making them potentially irrelevant to grouping machines. Furthermore, there are only 135 machines, and so there are far more features than data examples, making this an underdetermined data set.
All of these factors make the machine package configuration data a good test of the strenghts of Random Forest Clustering.
The implementation of Random Forest Clustering I used for the results in this post is a library available from the silex project, a package of analytics libraries and utilities for Apache Spark.
In this section I will describe three code fragments that load the machine configuration data, perform a Random Forest clustering, and format some of the output. This is the code I ran to obtain the results described in the final section of this post.
The first fragment of code illustrates the logistics of loading the feature vectors from file train.txt
that represent the installedpackage configurations for each machine. A corresponding “parallel” file nodesclean.txt
contains corresponding machine names for each vector. A third companion file rpms.txt
contains names of each installed package. These are used to instantiate a specialized Scala function (InvertibleIndexFunction
) between feature indexes and humanreadable feature names (in this case, names of RPM packages). Finally, another specialized function (Extractor
) for instantiating Spark feature vectors is created.
Note: Extractor
and InvertibleIndexFunction
are also component libraries of silex
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

The next section of code is where the work of Random Forest Clustering happens. A RandomForestCluster
object is instantiated, and configured. Here, the configuration is for 7 clusters, 250 synthetic points (about twice as many synthetic points as true data), and a Random Forest of 20 trees. Training against the input data is a simple call to the run
method.
The predictWithDistanceBy
method is then applied to the data paired with machine names, to yield tuples of clusterid, distance to cluster center, and the associated machine name. These tuples are split by distance into data with a cluster, and data considered to be “outliers” (i.e. elements far from any cluster center). Lastly, the histFeatures
method is applied, to examine the Random Forest Model and identify any commonlyused features.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The final code fragment simply formats clusters and outliers into a tabular form, as displayed in the next section of this post. Note that there is neither Spark nor silex code here; standard Scala methods are sufficient to postprocess the clustering data:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

The result of running the code in the previous section is seven clusters of machines. In the following files, the first column represents distance from the cluster center, and the second is the actual machine’s node name. A cluster distance of 0.0 indicates that the machine was indistinguishable from cluster center, as far as the Random Forest model was concerned. The larger the distance, the more different from the cluster’s center a machine was, in terms of its installed RPM packages.
Was the clustering meaningful? Examining the first two clusters below is promising; the machine names in these clusters are clearly similar, likely configured for some common task by the IT department. The first cluster of machines appears to be web servers and corresponding backend services. It would be unsurprising to find their RPM configurations were similar.
The second cluster is a series of executor machines of varying sizes, but presumably these would be configured similarly to one another.
The second pair of clusters (3 & 4) are small. All of their names are similar (and furthermore, similar to some machines in other clusters), and so an IT administrator might wonder why they ended up in oddball small clusters. Perhaps they have some spurious, nonstandard packages installed that ought to be cleaned up. Identifying these kinds of structure in a clustering is one common clustering application.
Cluster 5 is a series of bugzilla web servers and corresponding backend bugzilla data base services. Although they were clustered together, we see that the web servers have a larger distance from the center, indicating a somewhat different configuration.
Cluster 6 represents a group of performancerelated machines. Not all of these machines occupy the same distance, even though most of their names are similar. These are also the same series of machines as in clusters 3 & 4. Does this indicate spurious package installations, or some other legitimate configuration difference? A question for the IT department…
Cluster 7 is by far the largest. It is primarily a combination of OpenStack machines and yet more perf machines. This clustering was relatively stable – it appeared across multiple independent clustering runs. Because of its stability I would suggest to an IT administrator that the performance and OpenStack machines are sharing some configuration similarities, and the performance machines in other clusters suggest that there might be yet more configuration anomalies. Perhaps these were OpenStack nodes that were repurposed as performance machines? Yet another question for IT…
This last grouping represents machines which were “far” from any of the previous cluster centers. They may be interpreted as “outliers”  machines that don’t fit any model category. Of these the node frodo
is clearly somebody’s personal machine, likely with a customized or idiosyncratic package configuration. Unsurprising that it is farthest of all machines from any cluster, with distance 9.0. The jenkins
machine is also somewhat unique among the nodes, and so perhaps not surprising that its registers as anomalous. The remaining machines match node series from other clusters. Their large distance is another indication of spurious configurations for IT to examine.
I will conclude with another useful feature of Random Forest Models, which is that you can interrogate them for information such as which features were used most frequently. Here is a histogram of model features (in this case, installed packages) that were used most frequently in the clustering model. This particular histogram i sinteresting, as no feature was used more than twice. The remaining features were all used exactly once. This is a bit unusual for a Random Forest model. Frequently some features are used commonly, with a longer tail. This histogram is rather “flat,” which may be a consequence of there being many more features (over 4000 installed packages) than there are data elements (135 machines). This makes the problem somewhat underdetermined. To its credit, the model still achieves a meaningful clustering.
Lastly I’ll note that full histogram length was 186; in other words, of the nearly 4400 installed packages, the Random Forest model used only 186 of them – a tiny fraction! A nice illustration of Random Forest Clustering performing in the face of high dimensionality!
]]>In this post I will derive an algorithm for assigning vertex locations in R^{(N1)} for each of N objects, using only pairwise object distances.
I will assume that N >= 2, since at least two object are required to define a pairwise distance. The case N=2 is easy, as I can assign vertex 1 to the origin, and vertex 2 to the point d(1,2), to form a 1simplex (i.e. a line segment) whose single edge is just the distance between the two objects. I will also assume that my N objects are distinct; that is, each pair has a nonzero distance.
Next consider an arbitrary N, and suppose I have already added vertices 1 through k. The next vertex (k+1) must obey the pairwise distance relations, as follows:
Adding the new vertex (k+1) involves adding another dimension (k) to the simplex. I define this new kth coordinate x(k) to be zero for the existing k vertices, as annotated above; only the new vertex (k+1) will have a nonzero kth coordinate. Expanding the quadratic terms on the left yields the following form:
The squared terms for the coordinates of the new vertex (k+1) are inconvenient, however I can get rid of them by subtracting pairs of equations above. For example, if I subtract equation 1 from the remaining k1 equations (2 through k), these squared terms disappear, leaving me with the following system of k1 equations, which we can see is linear in the 1st k1 coordinates of the new vertex. Therefore, I know I’ll be able to solve for those coordinates. I can solve for the remaining kth coordinate by plugging it into the first distance equation:
To clarify matters, the equations above can be rewritten as the following matrix equation, solveable by any linear systems library:
This gives me a recusion relation for adding a new vertex (k+1), given that I have already added the first k vertices. The basis case of adding the first two vertices was already described above. And so I can iteratively add all my vertices one at a time by applying the recursion relation.
As a corollary, assume that I have constructed a simplex having k vertices, as shown above, and I would like to assign a spatial location to a new object, (y), given its k distances to each vertex. The corresponding distance relations are given by:
I can apply a derivation very similar to the one above, to obtain the following linear equation for the (k1) coordinates of (y):
]]>My main working example will be the operation of splitting a collection of data elements into N randomlyselected subsamples. This operation is quite common in machine learning, for the purpose of dividing data into a training and testing set, or the related task of creating folds for crossvalidation).
Consider the current standard RDD method for accomplishing this task, randomSplit()
. This method takes a collection of N weights, and returns N output RDDs, each of which contains a randomlysampled subset of the input, proportional to the corresponding weight. The randomSplit()
method generates the jth output by running a random number generator (RNG) for each input data element and accepting all elements which are in the corresponding jth (normalized) weight range. As a diagram, the process looks like this at each RDD partition:
The observation I want to draw attention to is that to produce the N output RDDs, it has to run a random sampling over every element in the input for each output. So if you are splitting into 10 outputs (e.g. for a 10fold crossvalidation), you are resampling your input 10 times, the only difference being that each output is created using a different acceptance range for the RNG output.
To see what this looks like in code, consider a simplified version of random splitting that just takes an integer n
and always produces (n) equallyweighted outputs:
1 2 3 4 5 6 7 8 

(Note that for this method to operate correctly, the RNG seed must be set to the same value each time, or the data will not be correctly partitioned)
While this approach to random splitting works fine, resampling the same data N times is somewhat wasteful. However, it is possible to reorganize the computation so that the input data is sampled only once. The idea is to run the RNG once per data element, and save the element into a randomlychosen collection. To make this work in the RDD compute model, all N output collections reside in a single row of an intermediate RDD – a “manifold” RDD. Each output RDD then takes its data from the corresponding collection in the manifold RDD, as in this diagram:
If you abstract the diagram above into a generalized operation, you end up with methods that might like the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

Here, the operation of sampling is generalized to any usersupplied function that maps RDD partition data into a sequence of objects that are computed in a single pass, and then multiplexed to the final uservisible outputs. Note that these functions take a StorageLevel
argument that can be used to control the caching level of the internal “manifold” RDD. This typically defaults to MEMORY_ONLY
, so that the computation can be saved and reused for efficiency.
An efficient splitsampling method based on multiplexing, as described above, might be written using flatMuxPartitions
as follows:
1 2 3 4 5 6 7 8 9 

To test whether multiplexed RDDs actually improve compute efficiency, I collected runtime data at various split values of n
(from 1 to 10), for both the nonmultiplexing logic (equivalent to the standard randomSplit
) and the multiplexed version:
As the timing data above show, the computation required to run a nonmultiplexed version grows linearly with n
, just as predicted. The multiplexed version, by computing the (n) outputs in a single pass, takes a nearly constant amount of time regardless of how many samples the input is split into.
There are other potential applications for multiplexed RDDs. Consider the following tuplebased versions of multiplexing:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 

Suppose you wanted to run an inputvalidation filter on some data, sending the data that pass validation into one RDD, and data that failed into a second RDD, paired with information about the error that occurred. Data validation is a potentially expensive operation. With multiplexing, you can easily write the filter to operate in a single efficient pass to obtain both the valid stream and the stream of errordata:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 

RDD multiplexing is currently a PR against the silex project. The code I used to run the timing experiments above is saved for posterity here.
Happy multiplexing!
]]>prepare
operation in a mapreduce context, has substantial inefficiencies, compared to an equivalent formulation that is more directly suited to taking advantage of Scala’s aggregate method on collections method.
Consider the definition of aggregation in the Aggregator class:
1


You can see that it is a standard map/reduce operation, where reduce
is defined as a monoidal (or semigroup – more on this later) operation. Under the hood, it boils down to an invocation of Scala’s reduceLeft
method. The key thing to notice is that the role of prepare
is to map a collection of data elements into the required monoids, which are then aggregated using that monoid’s plus
operation. In other words, prepare
converts data elements into “singleton” monoids each representing a data element.
Now, if the monoid in question is simple, say some numeric type, this conversion is free, or nearly so. For example, the conversion of an integer into the “integer monoid” is a noop. However, there are other kinds of “nontrivial” monoids, for which the conversion of a data element into its corresponding monoid may be costly. In this post, I will be using the monoid defined by Scala Set[Int], where the monoid plus
operation is set union, and of course the zero
element is the empty set.
Consider the process of defining an Algebird aggregator for the task of generating the set of unique elements in a data set. The corresponding prepare
operation is: prepare(e: Int) = Set(e)
. A monoid trait that encodes this idea might look like the following. (the code I used in this post can be found here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

If we unpack the above code, as applied to intSetPrepared
, we are instantiating a new Set object, containing a single value, for every single input data element.
But there is a potentially better model of aggregation, exemplified by the Scala aggregate
method. This method does not use a prepare
operation. It uses a zero value and a monoidal operator, which the Scala docs refer to as combop
, but it also uses an “update” operation, that defines how to update the monoid object, directly, with a single element, referred to as seqop
in Scala’s documentation. This idea can also be encoded as a flavor of monoid, enhanced with an update
method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

This arrangement promises more efficiency when aggregating w.r.t. nontrivial monoids, by avoiding the construction of “singleton” monoids for each data element. The following demo confirms that for the Setbased monoid, it is over 10 times faster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

It is also possible to apply Scala’s aggregate
to a monoid enhanced with prepare
:
1 2 3 4 5 6 

Although this turns out to be measurably faster than the literal mapreduce implementation, it is still not nearly as fast as the variation using update
:
1 2 

Readers familiar with Algebird may be wondering about my use of monoids above, when the Aggregator
interface is actually based on semigroups. This is important, since building on Scala’s aggregate
function requires a zero element that semigroups do not have. Although I believe it might be worth considering changing Aggregator
to use monoids, another sensible option is to change the internal logic for the subclass AggregatorMonoid
, which does require a monoid, or possibly just define a new AggregatorMonoidUpdated
subclass.
A final note on compatability: note that any monoid enhanced with prepare
can be converted into an equivalent monoid enhanced with update
, as demonstrated by this factory function:
1 2 3 4 5 6 7 8 

The code I used to collect the data for this post can be viewed here. I generated the plots using the quantifind WISP project.
Update (April 4, 2016): my colleague RJ Nowling ran across a paper by J.S. Vitter that shows Vitter developed the trick of accelerating sampling with a samplinggap distribution in 1987 – I reinvented Vitter’s wheel 30 years after the fact! I’m surprised it never caught on, as it is not much harder to implement than the naive version.
In a previous post, I showed that random Bernoulli and Poisson sampling could be made much faster by modeling the sampling gap distribution for the corresponding sampling distributions. More recently, I also began exploring whether reservoir sampling might also be optimized using the gap sampling technique, by deriving the reservoir sampling gap distribution. For a sampling reservoir of size (R), starting at data element (j), the probability distribution of the sampling gap is:
Modeling a sampling gap distribution is a powerful tool for optimizing a sampling algorithm, but it presupposes that you can actually draw values from that distribution substantially faster than just applying a random process to drawing each data element. I was unable to come up with a “direct” algorithm for drawing samples from P(k) above (I suspect none exists), however I also know the CDF F(k), so it is possible to apply inversion sampling, which runs in logarithmic time w.r.t the desired accuracy. Although its logarithmic cost effectively guarantees that it will be a net efficiency win for sufficiently large (j), it still involves a substantial number of computations to yield its samples, and it seems unlikely to be competitive with straight “naive” reservoir sampling over many realworld data sizes, where (j) may never grow very large.
Well, if exact computations are too expensive, we can always look for a fast approximation. Consider the original “first principles” formula for the sampling gap P(k):
As the figure above alludes to, if (j) is relatively large compared to (k), then values (j+1),(j+2)…(j+k) are all going to be effectively “close” to (j), and so we can replace them all with (j) as an approximation. Note that the resulting approximation is just the PMF of the geometric distribution, with probability of success p=(R/j), and we already saw how to efficiently draw values from a geometric distribution from our experience with Bernoulli sampling.
Do we have any reason to hope that this approximation will be useful? For reasons that are similar to those for Bernoulli gap sampling, it will only be efficient to employ gap sampling when the probability (R/j) becomes small enough. From our experiences with Bernoulli sampling that is at least j>=2R. So, we have some assurance that (j) itself will be never be very small. What about (k)? Note that a geometric distribution “favors” smaller values of (k) – that is, small values of (k) have the highest probabilities. In fact, the smaller that (j) is, the larger the probability (R/j) is, and so the more likely that (k) values that are small relative to (j) will be the frequent ones. It is also promising that the true distribution for P(k) also favors smaller values of (k) (in fact it favors them even a bit more strongly than the approximation).
Although it is encouraging, it is also clear that my argument above is limited to heuristic handwaving. What does this approximation really look like, compared to the true distribution? Fortunately, it is easy to plot both distributions numerically, since we now know the formulas for both:
The plot above shows that, in fact, the geometric approximation is a surprisingly good approximation to the true distribution! Furthermore, the approximation remains good as both (j) and (k) grow larger.
Our numeric eyeballing looks quite promising. Is there an effective way to measure how good this approximation is? One useful measure is the KolmogorovSmirnov D statistic, which is just the maximum absolute error between two cumulative distributions. Here is a plot of the D statistic for reservoir size R=10, as (j) varies across several magnitudes:
This plot is also good news: we can see that deviation, as measured by D, remains bounded at a small value (less than 0.0262). As this is for the specific value R=10, we also want to know how things change as reservoir size changes:
The news is still good! As reservoir size grows, the approximation only gets better: the D values get smaller as R increases, and remain asymptotically bounded as (j) increases.
Now we have some numeric assurance that the geometric approximation is a good one, and stays good as reservoir size grows and sampling runs get longer. However, we should also verify that an actual implementation of the approximation works as expected.
Here is pseudocode for an implementation of reservoir sampling using the fast geometric approximation:
// data is array to sample from
// R is the reservoir size
function reservoirFast(data: Array, R: Int) {
n = data.length
// Initialize reservoir with first R elements of data:
res = data[0 until R]
// Until this threshold, use traditional sampling. This value may
// depend on performance characteristics of random number generation and/or
// numeric libraries:
t = 4 * R
j = 1 + R
while (j < n && j <= t) {
k = randomInt(j) // random integer >= 0 and < j
if (k < R) res[k] = data[j]
j = j + 1
}
// Once gaps become significant, it pays to do gap sampling
while (j < n) {
// draw gap size (g) from geometric distribution with probability p = R/j
p = R / j
u = randomFloat() // random float > 0 and <= 1
g = floor(log(u) / log(1p))
j = j + g
if (j < n) {
k = randomInt(R)
res[k] = data[j]
}
j = j + 1
}
// return the reservoir
return res
}
Following is a plot that shows twosample D statistics, comparing the distribution in sample gaps between runs of the exact “naive” reservoir sampling with the fast geometric approximation:
As expected, the measured difference in sampling characteristics between naive and fast approximation are small, confirming the numeric predictions.
Since the point of this exercise was to achieve faster random sampling, it remains to measure what kind of speed improvements the fast approximation provides. As a point of reference, here is a plot of run times for reservoir sampling over 10^{8} integers:
As expected, sample time remains constant at around 1.5 seconds, regardless of reservoir size, since the naive algorithm always samples from its RNG per each sample.
Compare this to the corresponding plot for the fast geometric approximation:
Firstly, we see that the sampling times are much faster, as originally anticipated in my previous post – in the neighborhood of 3 orders of magnitude faster. Secondly, we see that the sampling times do increase as a linear function of reservoir size. Based on our experience with Bernoulli gap sampling, this is expected; the sampling probabilities are given by (R/j), and therefore the amount of sampling is proportional to R.
Another property anticipated in my previous post was that the efficiency of gap sampling should continue to increase as the amount of data sampled grows; the sampling probability being (R/j), the probability of sampling decreases as j gets larger, and so the corresponding gap sizes grow. The following plot verifies this property, holding reservoir size R constant, and increasing the data size:
The sampling time (per million elements) decreases as the sample size grows, as predicted by the formula.
In conclusion, I have demonstrated that a geometric distribution can be used as a high quality approximation to the true sampling gap distribution for reservoir sampling, which allows reservoir sampling to be performed much faster than the naive algorithm while still retaining sampling quality.
]]>(update) Since I wrote this post, the code has evolved into a library on the isarn project. The original source files, containing the exact code fragments discussed in the remainder of this post, are preserved for posterity here.
This post eventually became a bit more sprawling and “tl/dr” than I was expecting, so by way of apology, here is a table of contents with links:
The skeptical programmer may be wondering what the point of Yet Another Map Collection really is, much less an entire class hierarchy. The use case that inspired this work was my project of implementing the tdigest algorithm. Discussion of tdigest is beyond the scope of this post, but suffice it to say that constructing a tdigest requires the maintenance of a collection of “cluster” objects, that needs to satisfy the following several properties:
Propreties 2,3 and 6 are commonly satisfied by a map structure backed by some variety of balanced tree representation, of which the bestknown is the RedBlack tree.
Properties 1, 4 and 5 are more interesting. Property 1 – representing a collection of multiple objects at each entry – can be accomplished in a generalizable way by noting that a collection is representable as a monoid, and so supporting values that can be incremented with respect to a usersupplied monoid relation can satisfy property1, but also can support many other kinds of update, including but not limited to classical numeric incrementing operations.
Properties 4 and 5 – nearestentry queries and prefixsum queries – are also both supportable in logarithmic time using a tree data structure, provided that tree is balanced. Again, the details of the algorithms are out of the current scope, however they are not extremely complex, and their implementations are available in the code.
A reader with their software engineering hat on will notice that these properties are orthogonal. A programmer might be interested in a data structure supporting any one of them, or in some mixed combination. This kind of situation fairly shouts “Scala traits” (or, alternatively, interfaces in Java, etc). With that idea in mind, I designed a system of Scala collection traits that support all of the above properties, in a pure trait form that is fully “mixable” by the programmer, so that one can use exactly the properties needed, but not pay for anything else.
The library consists broadly of 3 kinds of traits:
For the programmer who wishes to either create a trait mixture, or add new mixable traits, the collections also function as reference implementations.
The three tables that follow summarize the currently available traits of each kind listed above. They are (at the time of this posting) all under the package namespace com.redhat.et.silex.maps
:
trait  subpackage  description 
Node[K]  redblack.tree  Fundamental RedBlack tree functionality 
NodeMap[K,V]  ordered.tree  Support a mapping from keys to values 
NodeNear[K]  nearest.tree  Nearestentry query (keyonly) 
NodeNearMap[K,V]  nearest.tree  Nearestentry query for key/value maps 
NodeInc[K,V]  increment.tree  Increment values w.r.t. a monoid 
NodePS[K,V,P]  prefixsum.tree  Prefix sum queries by key (w.r.t. a monoid) 
trait  subpackage  description 
OrderedSetLike[K,IN,M]  ordered  ordered set of keys 
OrderedMapLike[K,V,IN,M]  ordered  ordered key/value map 
NearestSetLike[K,IN,M]  nearest  nearest entry query on keys 
NearestMapLike[K,V,IN,M]  nearest  nearest entry query on key/value map 
IncrementMapLike[K,V,IN,M]  increment  increment values w.r.t a monoid 
PrefixSumMapLike[K,V,P,IN,M]  prefixsum  prefix sum queries w.r.t. a monoid 
trait  subpackage  description 
OrderedSet[K]  ordered  ordered set 
OrderedMap[K,V]  ordered  ordered key/value map 
NearestSet[K]  nearest  ordered set with nearestentry query 
NearestMap[K,V]  nearest  ordred map with nearestentry query 
IncrementMap[K,V]  increment  ordered map with value increment w.r.t. a monoid 
PrefixSumMap[K,V,P]  prefixsum  ordered map with prefix sum query w.r.t. a monoid 
The following diagram summarizes the organization and inheritance relationships of the classes.
The most fundamental trait in this hierarchy is the trait that embodies RedBlack balancing; a “redblackness” trait, as it were. This trait supplies the axiomatic tree operations of insertion, deletion and key lookup, where the RedBlack balancing operations are encapsulated for insertion (due to Chris Okasaki) and deletion (due to Stefan Kahrs) Note that RedBlack trees do not assume a separate value, as in a map, but require only keys (thus implementing an ordered set over the key type):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

I will assume most readers are familiar with basic binary tree operations, and the RedBlack rules are described elsewhere (I adapted them from the Scala redblack implementation). For the purposes of this discussion, the most interesting feature is that this is a pure Scala trait. All val
declarations are abstract. This trait, by itself, cannot function without a subclass to eventually perform dependency injection. However, this abstraction allows the trait to be inherited freely – any programmer can inherit from this trait and get a basic RedBlack balanced tree for (nearly) free, as long as a few basic principles are adhered to for proper dependency injection.
Another detail to call out is the abstraction of the usual key
with a Data
element. This element represents any node payload that is moved around as a unit during tree structure manipulations, such as balancing pivots. In the case of a maplike subclass, Data
is extended to include a value
field as well as a key
field.
The other noteworthy detail is the abstract definition def iNode(color: Color, d: Data[K], lsub: Node[K], rsub: Node[K]): INode[K]
 this is the function called to create any new tree node. In fact, this function, when eventually instantiated, is what performs dependency injection of other tree node fields.
A relatively simple example of node inheritance is hopefully instructive. Here is the definition for tree nodes supporting a key/value map:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 

Note that in this case very little is added to the red/black functionality already provided by Node[K]
. A DataMap[K,V]
trait is defined to add a value
field in addition to the key
, and the internal node INodeMap[K,V]
refines the type of its data
field to be DataMap[K,V]
. The semantics is little more than “tree nodes now carry a value in addition to a key.”
A tree node trait inherits from its own parent class and the corresponding traits for any mixedin functionality. So for example INodeMap[K,V]
inherits from NodeMap[K,V]
but also INode[K]
.
Continuing with the ordered map example, here is the definition of the collection trait for an ordered map:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 

You can see that this trait supplies collection API methods that a Scala programmer will recognize as being standard for any maplike collection. Note that this trait also inherits other standard methods from OrderedLike[K,IN,M]
(common to both sets and maps) and also inherits from NodeMap[K,V]
: In other words, a collection is effectively yet another kind of tree node, with additional collection API methods mixed in. Note also the use of “self types” (the type parameter M
), which allows the collection to return objects of its own kind. This is crucial for allowing operations like data insertion to return an object that also supports node insertion, and to maintain consistency of type across operations. The collection type is properly “closed” with respect to its own operations.
To conclude the ordered map example, consider the task of defining a concrete instantiation of an ordered map:
1 2 3 4 5 6 

You can see that (aside from a convenience override of toString
) the trait OrderedMap[K,V]
is nothing more than a vehicle for instantiating a particular concrete OrderedMapLike[K,V,IN,M]
subtype, with particular concrete types for internal node (INodeMap[K,V]
) and its own selftype.
Things become a little more interesting inside the companion object OrderedMap
:
1 2 3 4 5 6 

Note that the object returned by the factory method is upcast to OrderedMap[K,V]
, but in fact has the more complicated type: InjectMap[K,V] with LNodeMap[K,V] with OrderedMap[K,V]
. There are a couple things going on here. The trait LNodeMap[K,V]
ensures that the new object is in particular a leaf node, which embodies a new empty tree in the RedBlack tree system.
The type InjectMap[K,V]
has an even more interesting purpose. Here is its definition:
1 2 3 4 5 6 7 8 9 10 

Firstly, note that it is a bona fide class, as opposed to a trait. This class is where, finally, all things abstract are made real – “dependency injection” in the parlance of Scala idioms. You can see that it defines the implementation of abstract method iNode
, and that it does this by returning yet another InjectMap[K,V]
object, mixed with both INodeMap[K,V]
and OrderedMap[K,V]
, thus maintaining closure with respect to all three slices of functionality: dependency injection, the proper type of internal node, and map collection methods.
The various abstract val
fields color
, data
, lsub
and rsub
are all given concrete values inside of iNode
. Here is where the value of concrete “reference” implementations manifests. Any fields in the relevant internalnode type must be instantiated here, and the logic of instantiation cannot be inherited while still preserving the ability to mix abstract traits. Therefore, any programmer wishing to create a new concrete subclass must replicate the logic for instantiating all inherited in an internal node.
Another example makes the implications more clear. Here is the definition of injection for a collection that mixes in all three traits for incrementable values, nearestkey queries, and prefixsum queries:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

Here you can see that all logic for both “basic” internal nodes and also for maintaining prefix sums, and key min/max information for nearestentry queries, must be supplied. If there is a singularity in this design here is where it is. The saving grace is that it is localized into a single well defined place, and any logic can be transcribed from a proper reference implementation of whatever traits are being mixed.
I will conclude by showing the code for mixing tree node traits and collection traits, which is elegant. Here are type definitions for tree nodes and collection traits that inherit from incrementable values, nearestkey queries, and prefixsum queries, and there is almost no code except the proper inheritances:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

If the following ideas interest you at all, I highly recommend looking at the ‘refined’ project authored by Frank S. Thomas, which generalizes on the ideas below and supports additional static checking functionalities via macros.
As a working example, I’ll discuss a nonnegative integer type NonNegInt
. My proposed definition is sufficiently lightweight to view as a single code block:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 

The notable properties and features of NonNegInt
are:
NonNegInt
is a value class around an Int
, and so invokes no actual object construction or allocationNonNegInt(v)
to construct a non negative integer valueInt
values to NonNegInt
NonNegInt
that contains a negative integer value.NonNegInt
back to Int
. Moving back and forth between Int
and NonNegInt
is effectively transparent.The above properties work to make NonNegInt
very lightweight with respect to size and runtime properties, and semantically safe in the sense that it is impossible to construct one with a negative value inside it.
NonNegInt
I primarily envision NonNegInt
as an easy and informative way to declare function parameters that are only well defined for nonnegative values, without the need to write any explicit checking code, and yet allowing the programmer to call the function with normal Int
values, due to the implicit conversions:
1 2 3 4 5 6 7 8 

This short example demonstrates some appealing properties of NonNegInt
. Firstly, the constraint that index j >= 0
is enforced via the type definition, and so the programmer does not have to write the usual require(j >= 0, ...)
check (or worry about forgetting it). Secondly, the implicit conversion from Int
to NonNegInt
means the programmer can just provide a regular integer value for parameter j
, instead of having to explicitly say NonNegInt(1)
. Third, the implicit conversion from NonNegInt
to Int
means that j
can easily be used anywhere a regular Int
is used. Last, and very definitely not least, the fact that function element
requires a nonnegative integer is obvious right in the function signature. There is no need for a programmer to guess whether j
can be negative, and no need for the author of element
to document that j
cannot be negative. Its type makes that completely clear.
In this post I’ve laid out some advantages of defining lightweight nonnegative numeric types, in particular using NonNegInt
as a working example. Clearly, if you want to apply this idea, you’d want to also define NonNegLong
, NonNegDouble
, NonNegFloat
and for that matter PosInt
, PosLong
, etc. Happy computing!
Another popular sampling algorithm is Reservoir Sampling. Its sampling logic is a bit more complicated than Bernoulli or Poisson sampling, in the sense that the probability of sampling any given (jth) element changes. For a sampling reservoir of size R, and all j>R, the probability of choosing element (j) is R/j. You can see that the potential payoff for gapsampling is big, particularly as data size becomes large; as (j) approaches infinity, the probability R/j goes to zero, and the corresponding gaps between samples grow without bound.
Modeling a sampling gap distribution is a powerful tool for optimizing a sampling algorithm, but it requires that (1) you actually know the sampling distribution, and (2) that you can effectively draw values from that distribution faster than just applying a random process to drawing each data element.
With that goal in mind, I derived the probability mass function (pmf) and cumulative distribution function (cdf) for the sampling gap distribution of reservoir sampling. In this post I will show the derivations.
In the interest of making it easy to get at the actual answers, here are the pmf and cdf for the Reservoir Sampling Gap Distribution. For a sampling reservoir of size (R), starting at data element (j), the probability distribution of the sampling gap is:
In the derivations that follow, I will keep to some conventions:
P(k) is the probability that the gap between one sample and the next is of size k. The support for P(k) is over all k>=0. I will generally assume that j>R, as the first R samples are always loaded into the reservoir and the actual random sampling logic starts at j=R+1. The constraint j>R will also be relevant to many binomial coefficient expressions, where it ensures the coefficient is well defined.
Suppose we just chose (randomly) to sample data element (j1). Now we are interested in the probability distribution of the next sampling gap. That is, the probability P(k) that we will not sample the next (k) elements {j,j+1,…j+k1}, and sample element (j+k):
By arranging the product terms in descending order as above, you can see that they can be written as factorial quotients:
Now we apply Lemma A. The 2nd case (a<=b) of the Lemma applies, since (j1R)<=j, so we have:
And so we have now derived a compact, closedform expression for P(k).
Now that we have a derivation for the pmf P(k), we can tackle a derivation for the cdf. First I will make note of this useful identity that I scraped off of Wikipedia (I substituted (x) => (a) and (k) => (b)):
The cumulative distribution function for the sampling gap, F(k), is of course just the sum over P(t), for (t) from 0 up to (k):
This is a closedform solution, but we can apply a bit more simplification:
We have derived closedform expressions for the pmf and cdf of the Reservoir Sampling gap distribution:
In order to apply these results to a practical gapsampling implementation of Reservoir Sampling, we would next need a way to efficiently sample from P(k), to obtain gap sizes to skip over. How to accomplish this is an open question, but knowing a formula for P(k) and F(k) is a start.
Many thanks to RJ Nowling and Will Benton for proof reading and moral support! Any remaining errors are my own fault.
The process of implementing the Kendall’s Tau statistic, with my software engineer’s hat on, caused me to reflect a bit on how it could be generalized beyond the traditional application of ranking numeric pairs. In this post I’ll discuss the generalization of Kendall’s Tau to nonnumeric data, and also generalizing from totally ordered data to partial orderings.
I’ll start with a brief review of Kendall’s Tau. For more depth, a good place to start is the Wikipedia article at the link above.
Consider a sequence of (n) observations where each observation is a pair (x,y), where we wish to measure how well a ranking by xvalues agrees with a ranking by the yvalues. Informally, Kendall’s Tau (aka the Kendall Rank Correlation Coefficient) is the difference between number of observationpairs (pairs of pairs, if you will) whose ordering agrees (“concordant” pairs) and the number of such pairs whose ordering disagrees (“discordant” pairs). This difference is divided by the total number of observation pairs.
The commonlyused formulation of Kendall’s Tau is the “TauB” statistic, which accounts for observed pairs having tied values in either x or y as being neither concordant nor discordant:
The formulation above has quadratic complexity, with respect to data size (n). It is possible to rearrange this computation in a way that can be computed in (n)log(n) time[1]:
The details of performing this computation can be found at [1] or on the Wikipedia entry. For my purposes, I’ll note that it requires two (n)log(n) sorts of the data, which becomes relevant below.
Generalizing Kendall’s Tau to nonnumeric values is mostly just making the observation that the definition of “concordant” and “discordant” pairs is purely based on comparing xvalues and yvalues (and, in the (n)log(n) formulation, performing sorts on the data). From the software engineer’s perspective this means that the computations are well defined on any data type with an ordering relation, which includes numeric types but also chars, strings, sequences of any element supporting an ordering, etc. Significantly, most programming languages support the concept of defining ordering relations on arbitrary data types, which means that Kendall’s Tau can, in principle, be computed on literally any kind of data structure, provided you supply it with a well defined ordering. Furthermore, an examination of the algorithms shows that values of x and y need not even be of the same type, nor do they require the same ordering.
When I brought this observation up, my colleague Will Benton asked the very interesting question of whether it’s also possible to compute Kendall’s Tau on objects that have only a partial ordering. It turns out that you can define Kendall’s Tau on partially ordered data, by defining the case of two noncomparable xvalues, or yvalues, as another kind of tie.
The big caveat with this definition is that the (n)log(n) optimization does not apply. Firstly, the optimized algorithm relies heavily on (n)log(n) sorting, and there is no unique full sorting of elements that are only partially ordered. Secondly, the formula’s definition of the quantities n1, n2 and n3 is founded on the assumption that element equality is transitive; this is why you can count a number of tied values, t, and use t(t1)/2 as the corresponding number of tied pairs. But in a partial ordering, this assumption is violated. Consider the case where (a) < (b), but (a) is noncomparable to (c) and (b) is also noncomparable to (c). By our definition, (a) is tied with (c), and (c) is tied with (b), but transitivity is violated, as (a) < (b).
So how can we compute Tau in this case? Consider (n1) and (n2), in Figure1. These values represent the number of pairs that were tied wrt (x) and (y), respectively. We can’t use the shortcut formulas for (n1) and (n2), but we can count them directly, pair by pair, simply by conducting the traditional quadratic iteration over pairs, and incrementing (n1) whenever two xvalues are noncomparable, and incrementing (n2) whenever two yvalues are noncomparable, just as we increment (nc) and (nd) to count concordant and discordant pairs. With this modification, we can apply the formula in Figure1 asis.
I made these observations without any particular application in mind. However, my instincts as a software engineer tell me that making generalizations in this way often paves the way for new ideas, once the generalized concept is made available. With luck, it will inspire either me or somebody else to apply Kendall’s Tau in interesting new ways.
[1] Knight, W. (1966). “A Computer Method for Calculating Kendall’s Tau with Ungrouped Data”. Journal of the American Statistical Association 61 (314): 436–439. doi:10.2307/2282833. JSTOR 2282833.
]]>KMedoids clustering is a relative of KMeans clustering that does not require an algebra over input data elements. That is, KMedoids requires only a distance metric defined on elements in the data space, and can cluster objects which do not have a welldefined concept of addition or division that is necessary for computing the centroids required by KMeans. For example, KMedoids can cluster character strings, which have a notion of distance, but no notion of summation that could be used to compute a geometric centroid.
This additional generality comes at a cost. The medoid of a collection of elements is the member of the collection that minimizes some function F of the distances from that element to all the other elements in the collection. For example, F might be the sum of distances from one element to all the elements, or perhaps the maximum distance, etc. It is not hard to see that the cost of computing a medoid of (n) elements is quadratic in (n): Evaluating F is linear in (n) and F in turn must be evaluated with respect to each element. Furthermore, unlike centroidbased computations used in KMeans, computing a medoid does not naturally lend itself to common scaleout computing formalisms such as Spark RDDs, due to the fullcrossproduct nature of the computation.
With this in mind, a more traditional multithreading approach is a good candidate to achieve some practical parallelism on modern multicore hardware. I’ll demonstrate that this is easy to implement in Scala with parallel sequences.
Consider a baseline nonparallel implementation of KMedoids, as in the following example skeleton code. (A working version of this code, under review at the time of this post, can be viewed here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

If we run the code above (deskeletonized), then we might see something like this output from our benchmarking, where I clustered a dataset of 40,000 randomlygenerated (x,y,z) points by Gaussian sampling around 5 chosen centers. (This data is numeric, but I provide only a distance metric on the points. KMedoids has no knowledge of the data except that it can run the given metric function on it):
1 2 3 4 5 6 

Observe that cluster sizes are generally not the same, and we can see the time per cluster varying quadratically with respect to cluster size.
Studying our nonparallel code above, we can see that the computation of each new medoid is independent, which makes it a likely place to inject some parallelism. A Scala sequence can be transformed into a corresponding parallel sequence using the par
method, and so parallelizing our code is literally this simple:
1 2 3 4 

In this block, I also apply .seq
at the end, which is not always necessary but can avoid type mismatches between Seq[T]
and ParSeq[T]
under some circumstances.
In my case I also wish to exercise some control over the threading used by the parallelism, and so I explicitly assign a ForkJoinPool
thread pool to the sequence:
1 2 3 4 5 6 7 8 9 10 11 

Minor grievance: it would be nice if Scala supported some ‘inline’ methods, like seq.par(n)...
and seq.par(threadPool)...
, instead of requiring the programmer to break the flow of the code to invoke tasksupport =
, which returns Unit
.
Now that we’ve parallelized our KMedoids training, we should see how well it responds to additional threads. I ran the above parallelized version using {1, 2, 4, 8, 16, 32}
threads, on a machine with 40 cores, so that my benchmarking would not be impacted by attempting to run more threads than there are cores to support them. I also ran two versions of test data. The first I generated with clusters of equal size (5 clusters of ~8000 elements), and the second with one cluster being twice as large (1 cluster of ~13300 and 4 clusters of ~6700). Following is a plot of throughput (iterations / second) versus threads:
In the best of all possible worlds, our throughput would increase linearly with the number of threads; double the threads, double our iterations per second. Instead, our throughput starts to increase nicely as we add threads, but hits a hard ceiling at 8 threads. It is not hard to see why: our parallelism is limited by the number of elements in our collection of clusters. In our case that is k = 5, and so we reach our ceiling at 8 threads, the first thread number >= 5. Furthermore, we see that when the size of clusters is unequal, the throughput suffers even more. The time required to complete the clustering is dominated by the most expensive element. In our case, the cluster that is twice the size of other clusters:
1 2 3 4 5 6 

Fortunately it is not hard to improve on this situation. If parallelizing by cluster is too coarse, we can try pushing our parallelism down one level of granularity. In our case, that means parallelizing the outer loop of our medoid function, and it is just as easy as before:
1 2 3 4 5 6 7 8 

Note that I retained the previous parallelism at the cluster level, otherwise the algorithm would execute parallel medoids, but one cluster at a time. Also observe that we are applying the same thread pool we supplied to the ParSeq at the cluster level. Scala’s parallel logic can utilize the same thread pool at multiple granularities without blocking. This makes it very clean to control the total number of threads used by some computation, by simply reusing the same threadpool across all points of parallelism.
Now, when we rerun our experiment, we see that our throughput continues to increase as we add threads. The following plot illustrates the throughput increasing in comparison to the previous ceiling, and also that throughput is less sensitive to the cluster size, as threads can be allocated flexibly across clusters as they are available:
I hope this short case study has demonstrated how easy it is to add multithreading to computations with Scala parallel sequences, and some considerations for making the best use of available threads. Happy Parallel Programming!
]]>To establish a bit of context, consider this simple example that obtains a function and serializes it to disk, and which does behave as expected:
object Demo extends App {
def write[A](obj: A, fname: String) {
import java.io._
new ObjectOutputStream(new FileOutputStream(fname)).writeObject(obj)
}
object foo {
val v = 42
// The returned function includes 'v' in its closure
def f() = (x: Int) => v * x
}
// The function 'f' will serialize as expected
val f = foo.f
write(f, "/tmp/demo.f")
}
When this app is compiled and run, it will serialize f
to “/tmp/demo.f1”, which of course includes the value of v
as part of the closure for f
.
$ scalac d /tmp closures.scala
$ scala cp /tmp Demo
$ ls /tmp/demo*
/tmp/demo.f
Now, imagine you wanted to make a straightforward change, where object foo
becomes class foo
:
object Demo extends App {
def write[A](obj: A, fname: String) {
import java.io._
new ObjectOutputStream(new FileOutputStream(fname)).writeObject(obj)
}
// foo is a class instead of an object
class foo() {
val v = 42
// The returned function includes 'v' in its closure, but also a secret surprise
def f() = (x: Int) => v * x
}
// This will throw an exception!
val f = new foo().f
write(f, "/tmp/demo.f")
}
It would be reasonable to expect that this minor variation behaves exactly as the previous one, but instead it throws an exception!
$ scalac d /tmp closures.scala
$ scala cp /tmp Demo
java.io.NotSerializableException: Demo$foo
If we look at the exception message, we see that it’s complaining about not knowing how to serialize objects of class foo
. But we weren’t including any values of foo
in the closure for f
, only a particular member ‘v’! What gives? Scala is not very helpful with diagnosing this problem, but when a class member value shows up in a closure that is defined inside the class body, the entire instance, including any and all other member values, is included in the closure. Presumably this is because a class may have any number of instances, and the compiler is including the entire instance in the closure to properly resolve the correct member value.
One straightforward way to fix this is to simply make class foo
serializable:
class foo() extends Serializable {
// ...
}
If you make this change to the above code, the example with class foo
now works correctly, but it is working by serializing the entire foo
instance, not just the value of v
.
In many cases, this is not a problem and will work fine. Serializing a few additional members may be inexpensive. In other cases, however, it can be an impractical or impossible option. For example, foo
might include other very large members, which will be expensive or outright impossible to serialize:
class foo() extends Serializable {
val v = 42 // easy to serialize
val w = 4.5 // easy to serialize
val data = (1 to 1000000000).toList // serialization landmine hiding in your closure
// The returned function includes all of 'foo' instance in its closure
def f() = (x: Int) => v * x
}
A variation on the above problem is class members that are small or moderate in size, but serialized many times. In this case, the serialization cost can become intractable via repetition of unwanted inclusions.
Another potential problem is class members that are not serializable, and perhaps not under your control:
class foo() extends Serializable {
import some.class.NotSerializable
val v = 42 // easy to serialize
val x = new NotSerializable // I'll hide in your closure and fail to serialize
// The returned function includes all of 'foo' instance in its closure
def f() = (x: Int) => v * x
}
There is a relatively painless way to decouple values from their parent instance, so that only desired values are included in a closure. Passing desired values as parameters to a shim function whose job is to assemble the closure will prevent the parent instance from being pulled into the closure. In the following example, a shim function named closureFunction
is defined for this purpose:
object Demo extends App {
def write[A](obj: A, fname: String) {
import java.io._
new ObjectOutputStream(new FileOutputStream(fname)).writeObject(obj)
}
// apply a generator to create a function with safe decoupled closures
def closureFunction[E,D,R](enclosed: E)(gen: E => (D => R)) = gen(enclosed)
class NotSerializable {}
class foo() {
val v1 = 42
val v2 = 73
val n = new NotSerializable
// use shim function to enclose *only* the values of 'v1' and 'v2'
def f() = closureFunction((v1, v2)) { enclosed =>
val (v1, v2) = enclosed
(x: Int) => (v1 + v2) * x // Desired function, with 'v1' and 'v2' enclosed
}
}
// This will work!
val f = new foo().f
write(f, "/tmp/demo.f")
}
Being aware of the scenarios where parent instances are pulled into closures, and how to keep your closures clean, can save some frustration and wasted time. Happy programming!
]]>