Consider how we annotate and refer to release builds for a Scala project:
The *version* of Scala -- 2.10, 2.11, etc -- that was used to build the project is a *qualifier* for the release.
For example, if I am building a project using Scala 2.11, and package P is one of my project dependencies, then the maven build tooling (or sbt, etc) looks for a version of P that was *also* built using Scala 2.11;
the build will fail if no such incarnation of P can be located.
This build constraint propagates recursively throughout the entire dependency tree for a project.

Now consider how we treat the version for the package P dependency itself:
Our build tooling forces us to specify one exact release version x.y.z for P.
This is superficially similar to the constraint for building with Scala 2.11, but *unlike* the Scala constraint, the knowledge about using P x.y.z is not propagated down the tree.

If the dependency for P appears only once in the depenency tree, everything is fine.
However, as anybody who has ever worked with a large dependency tree for a project knows, package P might very well appear in multiple locations of the dep-tree, as a transitive dependency of different packages.
Worse, these deps may be specified as *different versions* of P, which may be mutually incompatible.

Transitive dep incompatibilities are a particularly thorny problem to solve, but there are other annoyances related to release versioning. Often a user would like a "major" package dependency built against a particular version of that dep. For example, packages that use Apache Spark may need to work with a particular build version of Spark (2.1, 2.2, etc). If I am the package purveyor, I have no very convenient way to build my package against multiple versions of spark, and then annotate those builds in Maven Central. At best I can bake the spark version into the name. But what if I want to specify other package dep verions? Do I create package names with increasingly-long lists of (package,version) pairs hacked into the name?

Finally, there is simply the annoyance of revving my own package purely for the purpose of building it against the latest versions of my dependencies.
None of my code has changed, but I am cutting a new release just to pick up current dependency releases.
And then hoping that my package users will want those particular releases, and that these won't break *their* builds with incompatible transitive deps!

I have been toying with a release and build methodology for avoiding these headaches. What follows is full of vigorous hand-waving, but I believe something like it could be formalized in a useful way.

The key idea is that a release *build* is defined by a *build signature* which is the union of all `(dep, ver)`

pairs.
This includes:

- The actual release version of the package code, e.g.
`(mypackage, 1.2.3)`

- The
`(dep, ver)`

for all dependencies (taken over all transitive deps, recursively) - The
`(tool, ver)`

for all impactful build tooling, e.g.`(scala, 2.11)`

,`(python, 3.5)`

, etc

For example, if I maintain a package `P`

, whose latest code release is `1.2.3`

,
built with dependencies `(A, 0.5)`

, `(B, 2.5.1)`

and `(C, 1.7.8)`

, and dependency `B`

built against `(Q, 6.7)`

and `(R, 3.3)`

,
and `C`

built against `(Q, 6.7)`

and all compiled with `(scala, 2.11)`

, then the build signature will be:

`{ (P, 1.2.3), (A, 0.5), (B, 2.5.1), (C, 1.7.8), (Q, 6.7), (R, 3.3), (scala, 2.11) }`

Identifying a release build in this way makes several interesting things possible.
First, it can identify a build with a transitive dependency problem.
For example, if `C`

had been built against `(Q, 7.0)`

,
then the resulting build signature would have *two* pairs for `Q`

; `(Q, 6.7)`

and `(Q, 7.0)`

,
which is an immediate red flag for a potential problem.

More intriguingly, it could provide a foundation for *avoiding* builds with incompatible dependencies.
Suppose that I redefine my build logic so that I only specify dependency package names, and not specific versions.
Whenever I build a project, the build system automatically searches for the most-recent version of each dependency.
This already addresses some of the release headaches above.
As a project builder, I can get the latest versions of packages when I build.
As a package maintainer, I do not have to rev a release just to update my package deps;
projects using my package will get the latest by default.
Moreover, because the latest package release is always pulled, I never get multiple incompatible dependency releases
in a build.

Suppose that for some reason I *need* a particular release of some dependency.
From the example above, imagine that I must use `(Q, 6.7)`

.
We can imagine augmenting the build specification to allow overriding the default behavior of pulling the most recent release.
We might either specify a specific version as we do currently, or possibly specify a range of releases, as systems like brew or ruby gemfiles allow.
In the case where some constraint is placed on releases, this constraint would be propagaged down the tree (or possibly up from the leaves),
in essentially the same way that the constraint of scala version is already.
In the event that the total set of constraints over the whole dependency tree is not satisfiable, then the build will fail.

With a build annotation system like the one I just described, one could imagine a new role for registries like Maven Central, where different builds are automatically cached. The registry could maybe even automatically run CI testing to identify the most-recent versions of package dependencies that satisfy any given package build, or perhaps valid dependency release ranges.

To conclude, I believe that re-thinking how we describe the dependencies used to build and annotate package releases, by generalizing release version to include the release version of all transitive deps (including build tooling as deps), may enable more flexible ways to both build software releases and specify them for pulling.

Happy Computing!

]]>In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-tac-toe", Sussman replied. "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play", Sussman said. Minsky then shut his eyes. "Why do you close your eyes?" Sussman asked his teacher. "So that the room will be empty." At that moment, Sussman was enlightened.

Recently I've been doing some work with the t-digest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of t-digest sketches is that you can "add" them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scale-out parallel computing platforms such as Apache Spark or Map-Reduce.

In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of t-digests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new t-digest in that randomized order. In Scala code, this algorithm might look like the following:

```scala def combine(ltd: TDigest, rtd: TDigest): TDigest = { // randomly shuffle input clusters and re-insert to a new t-digest shuffle(ltd.clusters.toVector ++ rtd.clusters.toVector)

```
.foldLeft(TDigest.empty)((d, e) => d + e)
```

} ```

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.

To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated t-digest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a t-digest sketch and the original distribution, I computed the Kolmogorov-Smirnov D-statistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following box-plot. The result was surprising and a bit disturbing:

As the plot shows, the t-digest sketch distributions are gradually *diverging* from the underlying "true" distribution ~N(0,1).
This is a potentially significant problem for the stability of monoidal t-digest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in map-reduce-like environments.

Seeing this divergence motivated me to think about ways to avoid it. One property of t-digest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The t-digest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one t-digest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:

```scala
def combine(ltd: TDigest, rtd: TDigest): TDigest = {
// insert clusters from largest to smallest
(ltd.clusters.toVector ++ rtd.clusters.toVector).sortWith((a, b) => a.*2 > b.*2)

```
.foldLeft(TDigest.empty(delta))((d, e) => d + e)
```

} ```

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largest-to-smallest cluster insertion. When I plotted the resulting progression of D-statistics side-by-side, the results were surprising:

As the plot demonstrates, not only was large-to-small insertion more stable, its D-statistics appeared to be getting *smaller* instead of larger.
To see if this trend was sustained over longer cumulative sums, I plotted the D-stats for cumulative sums over 100 samples:

The results were even more dramatic; These longer sums show that the standard randomized-insertion method continues to diverge, but in the case of large-to-small insertion the cumulative t-digest sums continue to converge towards the underlying distribution!

To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no "tails") and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn't seem to be sensitive to shape of distribution.

I have upgraded my implementation of t-digest sketching to use this new definition of monoid addition for t-digests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!

`map`

operation is creating some non-trivial monoid that represents a single element of the input type.
For example, if the monoidal type is `Set[Int]`

, then the mapping function ('prepare' in algebird) maps every input integer `k`

into `Set(k)`

, which is somewhat expensive.
In that discussion, I was focusing on map-reduce as embodied by the algebird `Aggregator`

type, where `map`

appears as the `prepare`

function.
However, it is easy to see that *any* map-reduce implementation may be vulnerable to the same inefficiency.

I wondered if there were a way to represent map-reduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.

I'll begin by reviewing a standard map-reduce implementation.
The following scala code sketches out the definition of a monoid over a type `B`

and a map-reduce interface.
As this code suggests, the `map`

function maps input data of some type `A`

into some *monoidal* type `B`

, which can be reduced (aka "aggregated") in a way that is amenable to parallelization:

``` scala trait Monoid[B] { // aka 'combine' aka '++' def plus: (B, B) => B

// aka 'empty' aka 'identity' def e: B }

trait MapReduce[A, B] { // monoid embodies the reducible type def monoid: Monoid[B]

// mapping function from input type A to reducible type B def map: A => B

// the basic map-reduce operation def apply(data: Seq[A]): B = data.map(map).fold(monoid.e)(monoid.plus)

// map-reduce parallelized over data partitions def apply(data: ParSeq[Seq[A]]): B =

```
data.map { part =>
part.map(map).fold(monoid.e)(monoid.plus)
}
.fold(monoid.e)(monoid.plus)
```

} ```

In the parallel version of map-reduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal `B`

value, followed by a final reduction of those intermediate results.
This is the classic form of map-reduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.

Next I will present an alternative definition of map-reduce.
In this implementation, the `map`

function is replaced by a `foldL`

function, which executes a single "left-fold" of an input object with type `A`

into the monoid object with type `B`

:

``` scala // a map reduce operation based on a monoid with left folding trait MapReduceLF[A, B] extends MapReduce[A, B] { def monoid: Monoid[B]

// left-fold an object with type A into the monoid B // obeys type law: foldL(b, a) = b ++ foldL(e, a) def foldL: (B, A) => B

// foldL(e, a) embodies the role of map(a) in standard map-reduce def map = (a: A) => foldL(monoid.e, a)

// map-reduce operation is now a single fold-left operation override def apply(data: Seq[A]): B = data.foldLeft(monoid.e)(foldL)

// map-reduce parallelized over data partitions override def apply(data: ParSeq[Seq[A]]): B =

```
data.map { part =>
part.foldLeft(monoid.e)(foldL)
}
.fold(monoid.e)(monoid.plus)
```

} ```

As the comments above indicate, the left-folding function `foldL`

is assumed to obey the law `foldL(b, a) = b ++ foldL(e, a)`

.
This law captures the idea that folding `a`

into `b`

should be the analog of reducing `b`

with a monoid corresponding to the single element `a`

.
Referring to my earlier example, if type `A`

is `Int`

and `B`

is `Set[Int]`

, then `foldL(b, a) => b + a`

.
Note that `b + a`

is directly inserting single element `a`

into `b`

, which is significantly more efficient than `b ++ Set(a)`

, which is how a typical map-reduce implementation would be required to operate.

This law also gives us the corresponding definition of `map(a)`

, which is `foldL(e, a)`

, or in my example: `Set.empty[Int] ++ a`

or just: `Set(a)`

In this formulation, the basic map-reduce operation is now a single `foldLeft`

operation, instead of a mapping followed by a monoidal reduction.
The parallel version is analoglous.
Each partition uses the new `foldLeft`

operation, and the final reduction of intermediate monoidal results remains the same as before.

The `foldLeft`

function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before.
I will conclude with a proof that this encoding is also parallelizable;
Note that the law `foldL(b, a) = b ++ foldL(e, a)`

is a significant component of this proof, as it represents the constraint that `foldL`

behaves like an analog of reducing `b`

with a monoidal representation of element `a`

.

In the following proof I used a scala-like pseudo code, described in the introduction:

``` // given an object mr of type MapReduceFL[A, B] // and using notation: // f <==> mr.foldL // for b1,b2 of type B: b1 ++ b2 <==> mr.plus(b1, b2) // e <==> mr.e // [...] <==> Seq(...) // d1, d2 are of type Seq[A]

// Proof that map-reduce with left-folding is parallelizable // i.e. mr(d1 ++ d2) == mr(d1) ++ mr(d2) mr(d1 ++ d2) == (d1 ++ d2).foldLeft(e)(f) // definition of map-reduce operation == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f) // Lemma A == mr(d1) ++ mr(d2) // definition of map-reduce (QED)

// Proof of Lemma A // i.e. (d1 ++ d2).foldLeft(e)(f) == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f)

// proof is by induction on the length of data sequence d2

// case d2 where length is zero, i.e. d2 == [] (d1 ++ []).foldLeft(e)(f) == d1.foldLeft(e)(f) // definition of empty sequence [] == d1.foldLeft(e)(f) ++ e // definition of identity e == d1.foldLeft(e)(f) ++ [].foldLeft(e)(f) // definition of foldLeft

// case d2 where length is 1, i.e. d2 == [a] for some a of type A (d1 ++ [a]).foldLeft(e)(f) == f(d1.foldLeft(e)(f), a) // definition of foldLeft and f == d1.foldLeft(e)(f) ++ f(e, a) // the type-law f(b, a) == b ++ f(e, a) == d1.foldLeft(e)(f) ++ [a].foldLeft(e)(f) // definition of foldLeft

// inductive step, assuming proof for d2' of length <= n // consider d2 of length n+1, i.e. d2 == d2' ++ [a], where d2' has length n (d1 ++ d2).foldLeft(e)(f) == (d1 ++ d2' ++ [a]).foldLeft(e)(f) // definition of d2, d2', [a] == f((d1 ++ d2').foldLeft(e)(f), a) // definition of foldLeft and f == (d1 ++ d2').foldLeft(e)(f) ++ f(e, a) // type-law f(b, a) == b ++ f(e, a) == d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ f(e, a) // induction == d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ [a].foldLeft(e)(f) // def'n of foldLeft == d1.foldLeft(e)(f) ++ (d2' ++ [a]).foldLeft(e)(f) // induction == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f) // definition of d2 (QED) ```

]]>This immediately raised some interesting and thorny questions: in an ecosystem that contains not just algebird, but other popular alternatives such as cats and scalaz, what algebra API should I use in my code? How best to allow the library user to interoperate with the algebra libray of their choice? Can I accomplish these things while also avoiding any problematic package dependencies in my library code?

In Scala, the second question is relatively straightforward to answer. I can write my interface using implicit conversions, and provide sub-packages that provide such conversions from popular algebra libraries into the library I actually use in my code. A library user can import the predefined implicit conversions of their choice, or if necessary provide their own.

So far so good, but that leads immediately back to the first question -- what API should ** I** choose to use internally in my own library?

One obvious approach is to just pick one of the popular options (I might favor `cats`

, for example) and write my library code using that.
If a library user also prefers `cats`

, great.
Otherwise, they can import the appropritate implicit conversions from their favorite alternative into `cats`

and be on their way.

But this solution is not without drawbacks.
Anybody using my library will now be including `cats`

as a transitive dependency in their project, even if they are already using some other alternative.
Although `cats`

is not an enormous library, that represents a fair amount of code sucked into my users' projects, most of which isn't going to be used at all.
More insidiously, I have now introduced the possiblity that the `cats`

version I package with is out of sync with the version my library users are building against.
Version misalignment in transitive dependencies is a land-mine in project builds and very difficult to resolve.

A second approach I might use is to define some abstract algebraic traits of my own. I can write my libraries in terms of this new API, and then provide implicit conversions from popular APIs into mine.

This approach has some real advantages over the previous. Being entirely abstract, my internal API will be lightweight. I have the option of including only the algebraic concepts I need. It does not introduce any possibly problematic 3rd-party dependencies that might cause code bloat or versioning problems for my library users.

Although this is an effective solution, I find it dissatisfying for a couple reasons.
Firstly, my new internal API effectively represents *yet another competing algebra API*, and so I am essentially contributing to the proliferating-standards antipattern.

Secondly, it means that I am not taking advantage of community knowledge.
The `cats`

library embodies a great deal of cumulative human expertise in both category theory and Scala library design.
What does a good algebra library API look like?
Well, *it's likely to look a lot like cats* of course!
The odds that I end up doing an inferior job designing my little internal vanity API are rather higher than the odds that I do as well or better.
The best I can hope for is to re-invent the wheel, with a real possibility that my wheel has corners.

Is there a way to resolve this unpalatable situation?
Can we design our projects to both remain flexible about interfacing with multiple 3rd-party alternatives, but avoid effectively writing *yet another alternative* for our own internal use?

I hardly have any authoritative answers to this problem, but I have one idea that might move toward a solution.
As I alluded to above, when I write my libraries, I am most frequently *only* interested in the API -- the abstract interface.
If I did go with writing my own algebra API, I would seek to define purely abstract traits.
Since my intention is that my library users would supply their own favorite library alternative, I would have no need or desire to instantiate any of my APIs.
That function would be provided by the separate sub-projects that provide implicit conversions from community alternatives into my API.

On the other hand, what if `cats`

and `algebird`

factored *their* libraries in a similar way?
What if I could include a sub-package like `cats-kernel-api`

, or `algebird-core-api`

, which contained *only* pure abstract traits for monoid, semigroup, etc?
Then I could choose my favorite community API, and code against it, with much less code bloat, and a much reduced vulnerability to any versioning drift.
I would still be free to provide implicit conversions and allow *my* users to make their own choice of library in their projects.

Although I find this idea attractive, it is certainly not foolproof.
For example, there is never a way to *guarantee* that versioning drift won't break an API.
APIs such as `cats`

and `algebird`

are likely to be unusually amenable to this kind of approach.
After all, their interfaces are primarily driven by underlying mathematical definitions, which are generally as stable as such things ever get.
However, APIs in general tend to be significantly more stable than underlying code.
And the most-stable subsets of APIs might be encoded as traits and exposed this way, allowing other more experimental API components to change at a higher frequency.
Perhaps library packages could even be factored in some way such as `library-stable-api`

and `library-unstable-api`

.
That would clearly add a bit of complication to library trait hierarchies, but the payoff in terms of increased 3rd-party usability might be worth it.

There are some varied approaches in the community for addressing the task of identifying a good number of clusters in a data set. In this post I want to focus on an approach that I think deserves more attention than it gets: Minimum Description Length.

Many years ago I ran across a superb paper by Stephen J. Roberts on anomaly detection that described a method for *automatically* choosing a good value for the number of clusters based on the principle of Minimum Description Length.
Minimum Description Length (MDL) is an elegant framework for evaluating the parsimony of a model.
The Description Length of a model is defined as the amount of information needed to encode that model, plus the encoding-length of some data, *given* that model.
Therefore, in an MDL framework, a good model is one that allows an efficient (i.e. short) encoding of the data, but whose *own* description is *also* efficient
(This suggests connections between MDL and the idea of learning as a form of data compression).

For example, a model that directly memorizes all the data may allow for a very short description of the data, but the model itself will cleary require at least the size of the raw data to encode, and so direct memorization models generaly stack up poorly with respect to MDL. On the other hand, consider a model of some Gaussian data. We can describe these data in a length proportional to their log-likelihood under the Gaussian density. Furthermore, the description length of the Gaussian model itself is very short; just the encoding of its mean and standard deviation. And so in this case a Gaussian distribution represents an efficient model with respect to MDL.

**In summary, an MDL framework allows us to mathematically capture the idea that we only wish to consider increasing the complexity of our models if that buys us a corresponding increase in descriptive power on our data.**

In the case of Roberts' paper, the clustering model in question is a Gaussian Mixture Model (GMM), and the description length expression to be optimized can be written as:

In this expression, X represents the vector of data elements.
The first term is the (negative) log-likelihood of the data, with respect to a candidate GMM having some number (K) of Gaussians; p(x) is the GMM density at point (x).
This term represents the cost of encoding the data, given that GMM.
The second term is the cost of encoding the GMM itself.
The value P is the number of free parameters needed to describe that GMM.
Assuming a dimensionality D for the data, then

I wanted to apply this same MDL principle to identifying a good value for K, in the case of a K-Medoids model.
How best to adapt MDL to K-Medoids poses some problems.
In the case of K-Medoids, the *only* structure given to the data is a distance metric.
There is no vector algebra defined on data elements, much less any ability to model the points as a Gaussian Mixture.

However, any candidate clustering of my data *does* give me a corresponding distribution of distances from each data element to it's closest medoid.
I can evaluate an MDL measure on these distance values.
If adding more clusters (i.e. increasing K) does not sufficiently tighten this distribution, then its description length will start to increase at larger values of K, thus indicating that more clusters are not improving our model of the data.
Expressing this idea as an MDL formulation produces the following description length formula:

Note that the first two terms are similar to the equation above; however, the underlying distribution _{x}||)_{x}

And so, an MDL-based algorithm for automatically identifying a good number of clusters (K) in a K-Medoids model is to run a K-Medoids clustering on my data, for some set of potential K values, and evaluate the MDL measure above for each, and choose the model whose description length L(X) is the smallest!

As I mentioned above, there is also an implied task of choosing a form (or a set of forms) for the distance distribution _{x}||)

Another observation (based on my blog posts mentioned above) is that my use of the gamma distribution implies a bias toward cluster distributions that behave (more or less) like Gaussian clusters, and so in this respect its current behavior is probably somewhat analogous to the G-Means algorithm, which identifies clusterings that yield Gaussian disributions in each cluster.
Adding other candidates for distance distributions is a useful subject for future work, since there is no compelling reason to either favor or assume Gaussian-like cluster distributions over *all* kinds of metric spaces.
That said, I am seeing reasonable results even on data with clusters that I suspect are not well modeled as Gaussian distributions.
Perhaps the shape-coverage of the gamma distribution is helping to add some robustness.

To demonstrate the MDL-enhanced K-Medoids in action, I will illustrate its performance on some data sets that are amenable to graphic representation. The code I used to generate these results is here.

Consider this synthetic data set of points in 2D space. You can see that I've generated the data to have two latent clusters:

I collected the description-length values for candidate K-Medoids models having 1 up to 10 clusters, and plotted them. This plot shows that the clustering with minimal description length had 2 clusters:

When I plot that optimal clustering at K=2 (with cluster medoids marked in black-and-yellow), the clustering looks good:

To show the behavior for a different optimal value, the following plots demonstrate the MDL K-Medoids results on data where the number of latent clusters is 4:

A final comment on Minimum Description Length approaches to clustering -- although I focused on K-Medoids models in this post, the basic approach (and I suspect even the same description length formulation) would apply equally well to K-Means, and possibly other clustering models.
Any clustering model that involves a distance function from elements to some kind of cluster center should be a good candidate.
I intend to keep an eye out for applications of MDL to *other* learning models, as well.

[1] "Novelty Detection Using Extreme Value Statistics"; Stephen J. Roberts; Feb 23, 1999 [2] "Learning the k in k-means. Advances in neural information processing systems"; Hamerly, G., & Elkan, C.; 2004

]]>