tool monkey

adventures of an unfrozen caveman programmer

The Backtracking ULP Incident of 2018

| Feedback

This week I finally started applying my new convex optimization library to solve for interpolating splines with monotonic constraints. Things seemed to be going well. My convex optimization was passing unit tests. My monotone splines were passing their unit tests too. I cut an initial release, and announced it to the world.

Because Murphy rules my world, it was barely an hour later that I was playing around with my new toys in a REPL, and when I tried splining an example data set my library call went into an infinite loop:

1
2
3
4
5
// It looks mostly harmless:
double[] x = { 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, 9.0 };
double[] y = { 0.0, 0.15, 0.05, 0.3, 0.5, 0.7, 0.95, 0.98, 1.0 };
MonotonicSplineInterpolator interpolator = new MonotonicSplineInterpolator();
PolynomialSplineFunction s = interpolator.interpolate(x, y);

In addition to being a bit embarrassing, it was also a real head-scratcher. There was nothing odd about the data I had just given it. In fact it was a small variation of a problem it had just solved a few seconds prior.

There was nothing to do but put my code back up on blocks and break out the print statements. I ran my problem data set and watched it spin. Fast forward a half hour or so, and I localized the problem to a bit of code that does the “backtracking” phase of a convex optimization:

1
2
3
4
5
6
7
8
9
for (double t = 1.0 ; t >= epsilon ; t *= beta) {
    tx = x.add(xDelta.mapMultiply(t));
    tv = convexObjective.value(tx);
    if (tv == Double.POSITIVE_INFINITY) continue;
    if (tv <= v + t*alpha*gdd) {
        foundStep = true;
        break;
    }
}

My infinite loop was happening because my backtracking loop above was “succeeding” – that is, reporting it had found a forward step – but not actually moving foward along its vector. And the reason turned out to be that my test tv <= v + t*alpha*gdd was succeding because v + t*alpha*gdd was evaluating to just v, and I effectively had tv == v.

I had been bitten by one of the oldest floating-point fallacies: forgetting that x + y can equal x if y gets smaller than the Unit in the Last Place (ULP) of x.

This was an especially evil bug, as it very frequently doesn’t manifest. My unit testing in two libraries failed to trigger it. I have since added the offending data set to my splining unit tests, in case the code ever regresses somehow.

Now that I understood my problem, it turns out that I could use this to my advantage, as an effective test for local convergence. If I can’t find a step size that reduces my local objective function by an amount measurable to floating point resolution, then I am as good as converged at this stage of the algorithm. I re-wrote my code to reflect this insight, and added some annotations so I don’t forget what I learned:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
for (double t = 1.0; t > 0.0; t *= beta) {
    tx = x.add(xDelta.mapMultiply(t));
    tv = convexObjective.value(tx);
    if (Double.isInfinite(tv)) {
        // this is barrier convention for "outside the feasible domain",
        // so try a smaller step
        continue;
    }
    double vtt = v + (t * alpha * gdd);
    if (vtt == v) {
        // (t)(alpha)(gdd) is less than ULP(v)
        // Further tests for improvement are going to fail
        break;
    }
    if (tv <= vtt) {
        // This step resulted in an improvement, so halt with success
        foundStep = true;
        break;
    }
}

I tend to pride myself on being aware that floating point numerics are a leaky abstraction, and the various ways these leaks can show up in computations, but pride goeth before a fall, and after all these years I can still burn myself! It never hurts to be reminded that you can never let your guard down with floating point numbers, and unit testing can never guarantee correctness. That goes double for numeric methods!

Equality Constraints for Cubic B-Splines

| Feedback

In my previous post I derived the standard-form polynomial coefficients for cubic B-splines. As part of the same project, I also need to add a feature that allows the library user to declare equality constraints of the form (x,y), where S(x) = y. Under the hood, I am invoking a convex optimization library, and so I need to convert these user inputs to a linear equation form that is consumable by the optimizer.

I expected this to be tricky, but it turns out I did most of the work already. I can take one of my previously-derived expressions for S(x) and put it into a form that gives me coefficients for the four contributing knot points Kj-3 … Kj:

eq

Recall that by the convention from my previous post, Kj is the largest knot point that is <= x.

My linear constraint equation is with respect to the vector I am solving for, in particular vector (τ), and so the equation above yields the following:

eq

In this form, it is easy to add into a convex optimization problem as a linear equality constraint.

Gradient constraints are another common equality constraint in convex optimization, and so I can apply very similar logic to get coefficient values corresponding to the gradient of S:

eq

And so my linear equality constraint with respect to (τ) in this case is:

eq

And that gives me the tools I need to let my users supply additional equality constraints as simple (x,y) pairs, and translate them into a form that can be consumed by convex optimization routines. Happy Computing!

Putting Cubic B-Splines into Standard Polynomial Form

| Feedback

Lately I have been working on an implementation of monotone smoothing splines, based on [1]. As the title suggests, this technique is based on a univariate cubic B-spline. The form of the spline function used in the paper is as follows:

eq1

The knot points Kj are all equally spaced by 1/α, and so α normalizes knot intervals to 1. The function B3(t) and the four Ni(t) are defined in this transformed space, t, of unit-separated knots.

I’m interested in providing an interpolated splines using the Apache Commons Math API, in particular the PolynomialSplineFunction class. In principle the above is clearly such a polynomial, but there are a few hitches.

  1. PolynomialSplineFunction wants its knot intervals in closed standard polynomial form ax3 + bx2 + cx + d
  2. It wants each such polynomial expressed in the translated space (x-Kj), where Kj is the greatest knot point that is <= x.
  3. The actual domain of S(x) is K0 … Km-1. The first 3 “negative” knots are there to make the summation for S(x) cleaner. PolynomialSplineFunction needs its functions to be defined purely on the actual domain.

Consider the arguments to B3, for two adjacent knots Kj-1 and Kj, where Kj is greatest knot point that is <= x. Recalling that knot points are all equally spaced by 1/α, we have the following relationship in the transformed space t:

eq

We can apply this same manipulation to show that the arguments to B3, as centered around knot Kj, are simply {… t+2, t+1, t, t-1, t-2 …}.

By the definition of B3 above, you can see that B3(t) is non-zero only for t in [0,4), and so the four corresponding knot points Kj-3 … Kj contribute to its value:

eq2

This suggests a way to manipulate the equations into a standard form. In the transformed space t, the four nonzero terms are:

eq4

and by plugging in the appropriate Ni for each term, we arrive at:

eq5

Now, PolynomialSplineFunction is going to automatically identify the appropriate Kj and subtract it, and so I can define that transform as u = x - Kj, which gives:

eq6

I substitute the argument (αu) into the definitions of the four Ni to obtain:

eq7

Lastly, collecting like terms gives me the standard-form coefficients that I need for PolynomialSplineFunction:

eq8

Now I am equipped to return a PolynomialSplineFunction to my users, which implements the cubic B-spline that I fit to their data. Happy computing!

References

[1] H. Fujioka and H. Kano: Monotone smoothing spline curves using normalized uniform cubic B-splines, Trans. Institute of Systems, Control and Information Engineers, Vol. 26, No. 11, pp. 389–397, 2013

Solving Feasible Points With Smooth-Max

| Feedback

Overture

Lately I have been fooling around with an implementation of the Barrier Method for convex optimization with constraints. One of the characteristics of the Barrier Method is that it requires an initial-guess from inside the feasible region: that is, a point which is known to satisfy all of the inequality constraints provided by the user. For some optimization problems, it is straightforward to find such a point by using knowledge about the problem domain, but in many situations it is not at all obvious how to identify such a point, or even if a feasible point exists. The feasible region might be empty!

Boyd and Vandenberghe discuss a couple approaches to finding feasible points in §11.4 of [1]. These methods require you to set up an “augmented” minimization problem: eq1

As you can see from the above, you have to set up an “augmented” space x+s, where (s) represents an additional dimension, and constraint functions are augmented to fk-s

The Problem

I experimented a little with these, and while I am confident they work for most problems having multiple inequality constraints, my unit testing tripped over an ironic deficiency: when I attempted to solve a feasible point for a single planar constraint, the numerics went a bit haywire. Specifically, a linear constraint function happens to have a singular Hessian of all zeroes. The final Hessian, coming out of the log barrier function, could be consumed by SVD to get a search direction but the resulting gradients behaved poorly.

Part of the problem seems to be that the nature of this augmented minimization problem forces the algorithms to push (s) ever downward, but letting (s) transitively push the fk with the augmented constraint functions fk-s. When only a single linear constraint function is in play, the resulting gradient caused augmented dimension (s) to converge against the movement of the remaining (unaugmented) sub-space. The minimization did not converge to a feasible point, even though literally half of the space on one side of the planar surface is feasible!

Smooth Max

Thinking about these issues made me wonder if a more direct approach was possible. Another way to think about this problem is to minimize the maximum fk; If the maximum fk is < 0 at a point x, then x is a feasible point satisfying all fk. If the smallest-possible maximum fk is > 0, then we have definitive proof that no feasible point exists, and our constraints can’t be satisfied.

Taking a maximum preserves convexity, which is a good start, but maximum isn’t differentiable everywhere. The boundaries between regions where different functions are the maximum are not smooth, and along those boundaries there is no gradient, and therefore no Hessian either.

However, there is a variation on this idea, known as smooth-max, defined like so:

eq2

Smooth-max has a well defined gradient and Hessian, and furthermore can be computed in a numerically stable way. The sum inside the logarithm above is a sum of exponentials of convex functions. This is good news; exponentials of convex functions are log-convex, and a sum of log-convex functions is also log-convex.

That means I have the necessary tools to set up the my mini-max problem: For a given set of convex constraint functions fk, I create a functions which is the soft-max of these, and I minimize it.

Go Directly to Jail

I set about implementing my smooth-max idea, and immediately ran into almost the same problem as before. If I try to solve for a single planar constraint, my Hessian degenerates to all-zeros! When I unpacked the smoothmax-formula for a single constraint fk, it indeed is just fk, zero Hessian and all!

More is More

What to do? Well you know what form of constraint always has a well behaved Hessian? A circle, that’s what. More technically, an n-dimensional ball, or n-ball. What if I add a new constraint of the form:

eq3

This constraint equation is quadratic, and its Hessian is In. If I include this in my set of constraints, my smooth-max Hessian will be non-singular!

Since I do not know a priori where my feasible point might lie, I start with my n-ball centered at my initial guess, and minimize. The result might look something like this:

fig1

Because the optimization is minimizing the maximum fk, the optimal point may not be feasible, but if not it will end up closer to the feasible region than before. This suggests an iterative algorithm, where I update the location of the n-ball at each iteration, until the resulting optimized point lies on the intersection of my original constraints and my additional n-ball constraint:

fig2

Caught in the Underflow

I implemented the iterative algorithm above (you can see what this loop looks like here), and it worked exactly as I hoped… at least on my initial tests. However, eventually I started playing with its convergence behavior by moving my constraint region farther from the initial guess, to see how it would cope. Suddenly the algorithm began failing again. When I drilled down on why, I was taken aback to discover that my Hessian matrix was once again showing up as all zeros!

The reason was interesting. Recall that I used a modified formula to stabilize my smooth-max computations. In particular, the “stabilized” formula for the Hessian looks like this:

eq4

So, what was going on? As I started moving my feasible region farther away, the corresponding constraint function started to dominate the exponential terms in the equation above. In other words, the distance to the feasible region became the (z) in these equations, and this z value was large enough to drive the terms corresponding to my n-ball constraint to zero!

However, I have a lever to mitigate this problem. If I make the α parameter small enough, it will compress these exponent ranges and prevent my n-ball Hessian terms from washing out. Decreasing α makes smooth-max more rounded-out, and decreases the sharpness of the approximation to the true max, but minimizing smooth-max still yields the same minimum location as true maximum, and so playing this trick does not undermine my results.

How small is small enough? α is essentially a free parameter, but I found that if I set it at each iteration, such that I make sure that my n-ball Hessian coefficient never drops below 1e-3 (but may be larger), then my Hessian is always well behaved. Note that as my iterations grow closer to the true feasible region, I can gradually allow α to grow larger. Currently, I don’t increase α larger than 1, to avoid creating curvatures too large, but I have not experimented deeply with what actually happens if it were allowed to grow larger. You can see what this looks like in my current implementation here.

Convergence

Tuning the smooth-max α parameter gave me numeric stability, but I noticed that as the feasible region grew more distant from my initial guess, the algorithm’s time to converge grew larger fairly quickly. When I studied its behavior, I saw that at large distances, the quadratic “cost” of my n-ball constraint effectively pulled the optimal point fairly close to my n-ball center. This doesn’t prevent the algorithm from finding a solution, but it does prevent it from going long distances very fast. To solve this adaptively, I added a scaling factor s to my n-ball constraint function. The scaled version of the function looks like:

eq5

In my case, when my distances to a feasible region grow large, I want s to become small, so that it causes the cost of the n-ball constraint to grow more slowly, and allow the optimization to move farther, faster. The following diagram illustrates this intuition:

fig3

In my algorithm, I set s = 1/σ, where σ represents the “scale” of the current distance to feasible region. The n-ball function grows as the square of the distance to the ball center; therefore I set σ=(k)sqrt(s), so that it grows proportionally to the square root of the current largest user constraint cost. Here, (k) is a proportionality constant. It too is a somewhat magic free parameter, but I have found that k=1.5 yields fast convergences and good results. One last trick I play is that I prevent σ from becoming less than a minimum value, currently 10. This ensures that my n-ball constraint never dominates the total constraint sum, even as the optimization converges close to the feasible region. I want my “true” user constraints to dominate the behavior near the optimum, since those are the constraints that matter. The code is shorter than the explaination: you can see it here

Conclusion

After applying all these intuitions, the resulting algorithm appears to be numerically stable and also converges pretty quickly even when the initial guess is very far from the true feasible region. To review, you can look at the main loop of this algorithm starting here.

I’ve learned a lot about convex optimization and feasible point solving from working through practical problems as I made mistakes and fixed them. I’m fairly new to the whole arena of convex optimization, and I expect I’ll learn a lot more as I go. Happy Computing!

References

[1] §11.3 of Convex Optimization, Boyd and Vandenberghe, Cambridge University Press, 2008

Computing Smooth Max and its Gradients Without Over- and Underflow

| Feedback

In my previous post I derived the gradient and Hessian for the smooth max function. The Notorious JDC wrote a helpful companion post that describes computational issues of overflow and underflow with smooth max; values of fk don’t have to grow very large (or small) before floating point limitations start to force their exponentials to +inf or zero. In JDC’s post he discusses this topic in terms of a two-valued smooth max. However it isn’t hard to generalize the idea to a collection of fk. Start by taking the maximum value over our collection of functions, which I’ll define as (z):

eq1

As JDC described in his post, this alternative expression for smooth max (m) is computationally stable. Individual exponential terms may underflow to zero, but they are the ones which are dominated by the other terms, and so approximating them by zero is numerically accurate. In the limit where one value dominates all others, it will be exactly the value given by (z).

It turns out that we can play a similar trick with computing the gradient:

eq2

Without showing the derivation, we can apply exactly the same manipulation to the terms of the Hessian:

eq3

And so we now have a computationally stable form of the equations for smooth max, its gradient and its Hessian. Enjoy!

The Gradient and Hessian of the Smooth Max Over Functions

| Feedback

Suppose you have a set of functions over a vector space, and you are interested in taking the smooth-maximum over those functions. For example, maybe you are doing gradient descent, or convex optimization, etc, and you need a variant on “maximum” that has a defined gradient. The smooth maximum function has both a defined gradient and Hessian, and in this post I derive them.

I am using the logarithm-based definition of smooth-max, shown here:

eq1

I will use the second variation above, ignoring function arguments, with the hope of increasing clarity. Applying the chain rule gives the ith partial gradient of smooth-max:

eq2

Now that we have an ith partial gradient, we can take the jth partial gradient of that to obtain the (i,j)th element of a Hessian:

eq3

This last re-grouping of terms allows us to see that we can express the full gradient and Hessian in the following more compact way:

eq4

With a gradient and Hessian, we now have the tools we need to use smooth-max in algorithms such as gradient descent and convex optimization. Happy computing!

Rethinking the Concept of Release Versioning

| Feedback

Recently I’ve been thinking about a few related problems with our current concepts of software release versioning, release dependencies and release building. These problems apply to software releases in all languages and build systems that I’ve experienced, but in the interest of keeping this post as simple as possible I’m going to limit myself to talking about the Maven ecosystem of release management and build tooling.

Consider how we annotate and refer to release builds for a Scala project: The version of Scala – 2.10, 2.11, etc – that was used to build the project is a qualifier for the release. For example, if I am building a project using Scala 2.11, and package P is one of my project dependencies, then the maven build tooling (or sbt, etc) looks for a version of P that was also built using Scala 2.11; the build will fail if no such incarnation of P can be located. This build constraint propagates recursively throughout the entire dependency tree for a project.

Now consider how we treat the version for the package P dependency itself: Our build tooling forces us to specify one exact release version x.y.z for P. This is superficially similar to the constraint for building with Scala 2.11, but unlike the Scala constraint, the knowledge about using P x.y.z is not propagated down the tree.

If the dependency for P appears only once in the depenency tree, everything is fine. However, as anybody who has ever worked with a large dependency tree for a project knows, package P might very well appear in multiple locations of the dep-tree, as a transitive dependency of different packages. Worse, these deps may be specified as different versions of P, which may be mutually incompatible.

Transitive dep incompatibilities are a particularly thorny problem to solve, but there are other annoyances related to release versioning. Often a user would like a “major” package dependency built against a particular version of that dep. For example, packages that use Apache Spark may need to work with a particular build version of Spark (2.1, 2.2, etc). If I am the package purveyor, I have no very convenient way to build my package against multiple versions of spark, and then annotate those builds in Maven Central. At best I can bake the spark version into the name. But what if I want to specify other package dep verions? Do I create package names with increasingly-long lists of (package,version) pairs hacked into the name?

Finally, there is simply the annoyance of revving my own package purely for the purpose of building it against the latest versions of my dependencies. None of my code has changed, but I am cutting a new release just to pick up current dependency releases. And then hoping that my package users will want those particular releases, and that these won’t break their builds with incompatible transitive deps!

I have been toying with a release and build methodology for avoiding these headaches. What follows is full of vigorous hand-waving, but I believe something like it could be formalized in a useful way.

The key idea is that a release build is defined by a build signature which is the union of all (dep, ver) pairs. This includes:

  1. The actual release version of the package code, e.g. (mypackage, 1.2.3)
  2. The (dep, ver) for all dependencies (taken over all transitive deps, recursively)
  3. The (tool, ver) for all impactful build tooling, e.g. (scala, 2.11), (python, 3.5), etc

For example, if I maintain a package P, whose latest code release is 1.2.3, built with dependencies (A, 0.5), (B, 2.5.1) and (C, 1.7.8), and dependency B built against (Q, 6.7) and (R, 3.3), and C built against (Q, 6.7) and all compiled with (scala, 2.11), then the build signature will be:

{ (P, 1.2.3), (A, 0.5), (B, 2.5.1), (C, 1.7.8), (Q, 6.7), (R, 3.3), (scala, 2.11) }

Identifying a release build in this way makes several interesting things possible. First, it can identify a build with a transitive dependency problem. For example, if C had been built against (Q, 7.0), then the resulting build signature would have two pairs for Q; (Q, 6.7) and (Q, 7.0), which is an immediate red flag for a potential problem.

More intriguingly, it could provide a foundation for avoiding builds with incompatible dependencies. Suppose that I redefine my build logic so that I only specify dependency package names, and not specific versions. Whenever I build a project, the build system automatically searches for the most-recent version of each dependency. This already addresses some of the release headaches above. As a project builder, I can get the latest versions of packages when I build. As a package maintainer, I do not have to rev a release just to update my package deps; projects using my package will get the latest by default. Moreover, because the latest package release is always pulled, I never get multiple incompatible dependency releases in a build.

Suppose that for some reason I need a particular release of some dependency. From the example above, imagine that I must use (Q, 6.7). We can imagine augmenting the build specification to allow overriding the default behavior of pulling the most recent release. We might either specify a specific version as we do currently, or possibly specify a range of releases, as systems like brew or ruby gemfiles allow. In the case where some constraint is placed on releases, this constraint would be propagaged down the tree (or possibly up from the leaves), in essentially the same way that the constraint of scala version is already. In the event that the total set of constraints over the whole dependency tree is not satisfiable, then the build will fail.

With a build annotation system like the one I just described, one could imagine a new role for registries like Maven Central, where different builds are automatically cached. The registry could maybe even automatically run CI testing to identify the most-recent versions of package dependencies that satisfy any given package build, or perhaps valid dependency release ranges.

To conclude, I believe that re-thinking how we describe the dependencies used to build and annotate package releases, by generalizing release version to include the release version of all transitive deps (including build tooling as deps), may enable more flexible ways to both build software releases and specify them for pulling.

Happy Computing!

Converging Monoid Addition for T-Digest

| Feedback

In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. “What are you doing?”, asked Minsky. “I am training a randomly wired neural net to play Tic-tac-toe”, Sussman replied. “Why is the net wired randomly?”, asked Minsky. “I do not want it to have any preconceptions of how to play”, Sussman said. Minsky then shut his eyes. “Why do you close your eyes?” Sussman asked his teacher. “So that the room will be empty.” At that moment, Sussman was enlightened.

Recently I’ve been doing some work with the t-digest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of t-digest sketches is that you can “add” them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scale-out parallel computing platforms such as Apache Spark or Map-Reduce.

In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of t-digests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new t-digest in that randomized order. In Scala code, this algorithm might look like the following:

1
2
3
4
5
def combine(ltd: TDigest, rtd: TDigest): TDigest = {
  // randomly shuffle input clusters and re-insert to a new t-digest
  shuffle(ltd.clusters.toVector ++ rtd.clusters.toVector)
    .foldLeft(TDigest.empty)((d, e) => d + e)
}

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.

To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated t-digest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a t-digest sketch and the original distribution, I computed the Kolmogorov-Smirnov D-statistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following box-plot. The result was surprising and a bit disturbing:

plot1

As the plot shows, the t-digest sketch distributions are gradually diverging from the underlying “true” distribution ~N(0,1). This is a potentially significant problem for the stability of monoidal t-digest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in map-reduce-like environments.

Seeing this divergence motivated me to think about ways to avoid it. One property of t-digest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The t-digest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one t-digest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:

1
2
3
4
5
def combine(ltd: TDigest, rtd: TDigest): TDigest = {
  // insert clusters from largest to smallest
  (ltd.clusters.toVector ++ rtd.clusters.toVector).sortWith((a, b) => a._2 > b._2)
    .foldLeft(TDigest.empty(delta))((d, e) => d + e)
}

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largest-to-smallest cluster insertion. When I plotted the resulting progression of D-statistics side-by-side, the results were surprising:

plot2a

As the plot demonstrates, not only was large-to-small insertion more stable, its D-statistics appeared to be getting smaller instead of larger. To see if this trend was sustained over longer cumulative sums, I plotted the D-stats for cumulative sums over 100 samples:

plot2

The results were even more dramatic; These longer sums show that the standard randomized-insertion method continues to diverge, but in the case of large-to-small insertion the cumulative t-digest sums continue to converge towards the underlying distribution!

To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no “tails”) and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn’t seem to be sensitive to shape of distribution.

I have upgraded my implementation of t-digest sketching to use this new definition of monoid addition for t-digests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!

Appendix: Plots with Alternate Distributions

plot3

plot4

Encoding Map-Reduce As A Monoid With Left Folding

| Feedback

In a previous post I discussed some scenarios where traditional map-reduce (directly applying a map function, followed by some monoidal reduction) could be inefficient. To review, the source of inefficiency is in situations where the map operation is creating some non-trivial monoid that represents a single element of the input type. For example, if the monoidal type is Set[Int], then the mapping function (‘prepare’ in algebird) maps every input integer k into Set(k), which is somewhat expensive.

In that discussion, I was focusing on map-reduce as embodied by the algebird Aggregator type, where map appears as the prepare function. However, it is easy to see that any map-reduce implementation may be vulnerable to the same inefficiency.

I wondered if there were a way to represent map-reduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.

I’ll begin by reviewing a standard map-reduce implementation. The following scala code sketches out the definition of a monoid over a type B and a map-reduce interface. As this code suggests, the map function maps input data of some type A into some monoidal type B, which can be reduced (aka “aggregated”) in a way that is amenable to parallelization:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
trait Monoid[B] {
  // aka 'combine' aka '++'
  def plus: (B, B) => B

  // aka 'empty' aka 'identity'
  def e: B
}

trait MapReduce[A, B] {
  // monoid embodies the reducible type
  def monoid: Monoid[B]

  // mapping function from input type A to reducible type B
  def map: A => B

  // the basic map-reduce operation
  def apply(data: Seq[A]): B = data.map(map).fold(monoid.e)(monoid.plus)

  // map-reduce parallelized over data partitions
  def apply(data: ParSeq[Seq[A]]): B =
    data.map { part =>
      part.map(map).fold(monoid.e)(monoid.plus)
    }
    .fold(monoid.e)(monoid.plus)
}

In the parallel version of map-reduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal B value, followed by a final reduction of those intermediate results. This is the classic form of map-reduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.

Next I will present an alternative definition of map-reduce. In this implementation, the map function is replaced by a foldL function, which executes a single “left-fold” of an input object with type A into the monoid object with type B:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// a map reduce operation based on a monoid with left folding
trait MapReduceLF[A, B] extends MapReduce[A, B] {
  def monoid: Monoid[B]

  // left-fold an object with type A into the monoid B
  // obeys type law: foldL(b, a) = b ++ foldL(e, a)
  def foldL: (B, A) => B

  // foldL(e, a) embodies the role of map(a) in standard map-reduce
  def map = (a: A) => foldL(monoid.e, a)

  // map-reduce operation is now a single fold-left operation
  override def apply(data: Seq[A]): B = data.foldLeft(monoid.e)(foldL)

  // map-reduce parallelized over data partitions
  override def apply(data: ParSeq[Seq[A]]): B =
    data.map { part =>
      part.foldLeft(monoid.e)(foldL)
    }
    .fold(monoid.e)(monoid.plus)
}

As the comments above indicate, the left-folding function foldL is assumed to obey the law foldL(b, a) = b ++ foldL(e, a). This law captures the idea that folding a into b should be the analog of reducing b with a monoid corresponding to the single element a. Referring to my earlier example, if type A is Int and B is Set[Int], then foldL(b, a) => b + a. Note that b + a is directly inserting single element a into b, which is significantly more efficient than b ++ Set(a), which is how a typical map-reduce implementation would be required to operate.

This law also gives us the corresponding definition of map(a), which is foldL(e, a), or in my example: Set.empty[Int] ++ a or just: Set(a)

In this formulation, the basic map-reduce operation is now a single foldLeft operation, instead of a mapping followed by a monoidal reduction. The parallel version is analoglous. Each partition uses the new foldLeft operation, and the final reduction of intermediate monoidal results remains the same as before.

The foldLeft function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before. I will conclude with a proof that this encoding is also parallelizable; Note that the law foldL(b, a) = b ++ foldL(e, a) is a significant component of this proof, as it represents the constraint that foldL behaves like an analog of reducing b with a monoidal representation of element a.

In the following proof I used a scala-like pseudo code, described in the introduction:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
// given an object mr of type MapReduceFL[A, B]
// and using notation:
// f <==> mr.foldL
// for b1,b2 of type B: b1 ++ b2 <==> mr.plus(b1, b2)
// e <==> mr.e
// [...] <==> Seq(...)
// d1, d2 are of type Seq[A]

// Proof that map-reduce with left-folding is parallelizable
// i.e. mr(d1 ++ d2) == mr(d1) ++ mr(d2)
mr(d1 ++ d2)
== (d1 ++ d2).foldLeft(e)(f)  // definition of map-reduce operation
== d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f)  // Lemma A
== mr(d1) ++ mr(d2)  // definition of map-reduce (QED)

// Proof of Lemma A
// i.e. (d1 ++ d2).foldLeft(e)(f) == d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f)

// proof is by induction on the length of data sequence d2

// case d2 where length is zero, i.e. d2 == []
(d1 ++ []).foldLeft(e)(f)
== d1.foldLeft(e)(f)  // definition of empty sequence []
== d1.foldLeft(e)(f) ++ e  // definition of identity e
== d1.foldLeft(e)(f) ++ [].foldLeft(e)(f)  // definition of foldLeft

// case d2 where length is 1, i.e. d2 == [a] for some a of type A
(d1 ++ [a]).foldLeft(e)(f)
== f(d1.foldLeft(e)(f), a)  // definition of foldLeft and f
== d1.foldLeft(e)(f) ++ f(e, a)  // the type-law f(b, a) == b ++ f(e, a)
== d1.foldLeft(e)(f) ++ [a].foldLeft(e)(f)  // definition of foldLeft

// inductive step, assuming proof for d2' of length <= n
// consider d2 of length n+1, i.e. d2 == d2' ++ [a], where d2' has length n
(d1 ++ d2).foldLeft(e)(f)
== (d1 ++ d2' ++ [a]).foldLeft(e)(f)  // definition of d2, d2', [a]
== f((d1 ++ d2').foldLeft(e)(f), a)  // definition of foldLeft and f
== (d1 ++ d2').foldLeft(e)(f) ++ f(e, a)  // type-law f(b, a) == b ++ f(e, a)
== d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ f(e, a)  // induction
== d1.foldLeft(e)(f) ++ d2'.foldLeft(e)(f) ++ [a].foldLeft(e)(f)  // def'n of foldLeft
== d1.foldLeft(e)(f) ++ (d2' ++ [a]).foldLeft(e)(f)  // induction
== d1.foldLeft(e)(f) ++ d2.foldLeft(e)(f)  // definition of d2 (QED)

Supporting Competing APIs in Scala – Can Better Package Factoring Help?

| Feedback

On and off over the last year, I’ve been working on a library of tree and map classes in Scala that happen to make use of some algebraic structures (mostly monoids or related concepts). In my initial implementations, I made use of the popular algebird variations on monoid and friends. In their incarnation as an algebird PR this was uncontroversial to say the least, but lately I have been re-thinking them as a third-party Scala package.

This immediately raised some interesting and thorny questions: in an ecosystem that contains not just algebird, but other popular alternatives such as cats and scalaz, what algebra API should I use in my code? How best to allow the library user to interoperate with the algebra libray of their choice? Can I accomplish these things while also avoiding any problematic package dependencies in my library code?

In Scala, the second question is relatively straightforward to answer. I can write my interface using implicit conversions, and provide sub-packages that provide such conversions from popular algebra libraries into the library I actually use in my code. A library user can import the predefined implicit conversions of their choice, or if necessary provide their own.

So far so good, but that leads immediately back to the first question – what API should I choose to use internally in my own library?

One obvious approach is to just pick one of the popular options (I might favor cats, for example) and write my library code using that. If a library user also prefers cats, great. Otherwise, they can import the appropritate implicit conversions from their favorite alternative into cats and be on their way.

But this solution is not without drawbacks. Anybody using my library will now be including cats as a transitive dependency in their project, even if they are already using some other alternative. Although cats is not an enormous library, that represents a fair amount of code sucked into my users’ projects, most of which isn’t going to be used at all. More insidiously, I have now introduced the possiblity that the cats version I package with is out of sync with the version my library users are building against. Version misalignment in transitive dependencies is a land-mine in project builds and very difficult to resolve.

A second approach I might use is to define some abstract algebraic traits of my own. I can write my libraries in terms of this new API, and then provide implicit conversions from popular APIs into mine.

This approach has some real advantages over the previous. Being entirely abstract, my internal API will be lightweight. I have the option of including only the algebraic concepts I need. It does not introduce any possibly problematic 3rd-party dependencies that might cause code bloat or versioning problems for my library users.

Although this is an effective solution, I find it dissatisfying for a couple reasons. Firstly, my new internal API effectively represents yet another competing algebra API, and so I am essentially contributing to the proliferating-standards antipattern.

standards

Secondly, it means that I am not taking advantage of community knowledge. The cats library embodies a great deal of cumulative human expertise in both category theory and Scala library design. What does a good algebra library API look like? Well, it’s likely to look a lot like cats of course! The odds that I end up doing an inferior job designing my little internal vanity API are rather higher than the odds that I do as well or better. The best I can hope for is to re-invent the wheel, with a real possibility that my wheel has corners.

Is there a way to resolve this unpalatable situation? Can we design our projects to both remain flexible about interfacing with multiple 3rd-party alternatives, but avoid effectively writing yet another alternative for our own internal use?

I hardly have any authoritative answers to this problem, but I have one idea that might move toward a solution. As I alluded to above, when I write my libraries, I am most frequently only interested in the API – the abstract interface. If I did go with writing my own algebra API, I would seek to define purely abstract traits. Since my intention is that my library users would supply their own favorite library alternative, I would have no need or desire to instantiate any of my APIs. That function would be provided by the separate sub-projects that provide implicit conversions from community alternatives into my API.

On the other hand, what if cats and algebird factored their libraries in a similar way? What if I could include a sub-package like cats-kernel-api, or algebird-core-api, which contained only pure abstract traits for monoid, semigroup, etc? Then I could choose my favorite community API, and code against it, with much less code bloat, and a much reduced vulnerability to any versioning drift. I would still be free to provide implicit conversions and allow my users to make their own choice of library in their projects.

Although I find this idea attractive, it is certainly not foolproof. For example, there is never a way to guarantee that versioning drift won’t break an API. APIs such as cats and algebird are likely to be unusually amenable to this kind of approach. After all, their interfaces are primarily driven by underlying mathematical definitions, which are generally as stable as such things ever get. However, APIs in general tend to be significantly more stable than underlying code. And the most-stable subsets of APIs might be encoded as traits and exposed this way, allowing other more experimental API components to change at a higher frequency. Perhaps library packages could even be factored in some way such as library-stable-api and library-unstable-api. That would clearly add a bit of complication to library trait hierarchies, but the payoff in terms of increased 3rd-party usability might be worth it.