Lately I have been fooling around with an implementation of the Barrier Method for convex optimization with constraints.
One of the characteristics of the Barrier Method is that it requires an initial-guess from inside the
*feasible region*: that is, a point which is known to satisfy all of the inequality constraints provided
by the user.
For some optimization problems, it is straightforward to find such a point by using knowledge about the problem
domain, but in many situations it is not at all obvious how to identify such a point, or even if a
feasible point exists. The feasible region might be empty!

Boyd and Vandenberghe discuss a couple approaches to finding feasible points in §11.4 of [1]. These methods require you to set up an "augmented" minimization problem:

As you can see from the above, you have to set up an "augmented" space x+s, where (s) represents an additional
dimension, and constraint functions are augmented to f_{k}-s

I experimented a little with these, and while I am confident they work for most problems having multiple inequality constraints, my unit testing tripped over an ironic deficiency: when I attempted to solve a feasible point for a single planar constraint, the numerics went a bit haywire. Specifically, a linear constraint function happens to have a singular Hessian of all zeroes. The final Hessian, coming out of the log barrier function, could be consumed by SVD to get a search direction but the resulting gradients behaved poorly.

Part of the problem seems to be that the nature of this augmented minimization problem forces the algorithms
to push (s) ever downward, but letting (s) transitively push the f_{k} with the augmented constraint
functions f_{k}-s. When only a single linear constraint function is in play, the resulting gradient
caused augmented dimension (s) to converge *against* the movement of the remaining (unaugmented) sub-space.
The minimization did not converge to a feasible point, even though literally half of the space on one side
of the planar surface is feasible!

Thinking about these issues made me wonder if a more direct approach was possible.
Another way to think about this problem is to minimize the maximum f_{k};
If the maximum f_{k} is < 0 at a point x, then x is a feasible point satisfying all f_{k}.
If the smallest-possible maximum f_{k} is > 0, then we have definitive proof that no
feasible point exists, and our constraints can't be satisfied.

Taking a maximum preserves convexity, which is a good start, but maximum isn't differentiable everywhere. The boundaries between regions where different functions are the maximum are not smooth, and along those boundaries there is no gradient, and therefore no Hessian either.

However, there is a variation on this idea, known as smooth-max, defined like so:

Smooth-max has a well defined gradient and Hessian, and furthermore can be computed in a numerically stable way. The sum inside the logarithm above is a sum of exponentials of convex functions. This is good news; exponentials of convex functions are log-convex, and a sum of log-convex functions is also log-convex.

That means I have the necessary tools to set up the my mini-max problem:
For a given set of convex constraint functions f_{k}, I create a functions which is the soft-max of
these, and I minimize it.

I set about implementing my smooth-max idea, and immediately ran into almost the same problem as before.
If I try to solve for a single planar constraint, my Hessian degenerates to all-zeros!
When I unpacked the smoothmax-formula for a single constraint f_{k}, it indeed is just f_{k},
zero Hessian and all!

What to do?
Well you know what form of constraint *always* has a well behaved Hessian? A circle, that's what.
More technically, an n-dimensional ball, or n-ball.
What if I add a new constraint of the form:

This constraint equation is quadratic, and its Hessian is I_{n}.
If I include this in my set of constraints, my smooth-max Hessian will be non-singular!

Since I do not know a priori where my feasible point might lie, I start with my n-ball centered at my initial guess, and minimize. The result might look something like this:

Because the optimization is minimizing the maximum f_{k}, the optimal point may not be feasible,
but if not it *will* end up closer to the feasible region than before.
This suggests an iterative algorithm, where I update the location of the n-ball at each iteration,
until the resulting optimized point lies on the intersection of my original constraints and my
additional n-ball constraint:

I implemented the iterative algorithm above (you can see what this loop looks like here), and it worked exactly as I hoped... at least on my initial tests. However, eventually I started playing with its convergence behavior by moving my constraint region farther from the initial guess, to see how it would cope. Suddenly the algorithm began failing again. When I drilled down on why, I was taken aback to discover that my Hessian matrix was once again showing up as all zeros!

The reason was interesting. Recall that I used a modified formula to stabilize my smooth-max computations. In particular, the "stabilized" formula for the Hessian looks like this:

So, what was going on? As I started moving my feasible region farther away, the corresponding constraint function started to dominate the exponential terms in the equation above. In other words, the distance to the feasible region became the (z) in these equations, and this z value was large enough to drive the terms corresponding to my n-ball constraint to zero!

However, I have a lever to mitigate this problem.
If I make the α parameter *small* enough, it will compress these exponent ranges and prevent my
n-ball Hessian terms from washing out.
Decreasing α makes smooth-max more rounded-out, and decreases the sharpness of the approximation to the true max,
but minimizing smooth-max still yields the same minimum *location* as true maximum, and so playing this
trick does not undermine my results.

How small is small enough? α is essentially a free parameter, but I found that if I set it at each iteration, such that I make sure that my n-ball Hessian coefficient never drops below 1e-3 (but may be larger), then my Hessian is always well behaved. Note that as my iterations grow closer to the true feasible region, I can gradually allow α to grow larger. Currently, I don't increase α larger than 1, to avoid creating curvatures too large, but I have not experimented deeply with what actually happens if it were allowed to grow larger. You can see what this looks like in my current implementation here.

Tuning the smooth-max α parameter gave me numeric stability, but I noticed that as the feasible region grew more distant from my initial guess, the algorithm's time to converge grew larger fairly quickly. When I studied its behavior, I saw that at large distances, the quadratic "cost" of my n-ball constraint effectively pulled the optimal point fairly close to my n-ball center. This doesn't prevent the algorithm from finding a solution, but it does prevent it from going long distances very fast. To solve this adaptively, I added a scaling factor s to my n-ball constraint function. The scaled version of the function looks like:

In my case, when my distances to a feasible region grow large, I want s to become small, so that it causes the cost of the n-ball constraint to grow more slowly, and allow the optimization to move farther, faster. The following diagram illustrates this intuition:

In my algorithm, I set s = 1/σ, where σ represents the "scale" of the current distance to feasible region. The n-ball function grows as the square of the distance to the ball center; therefore I set σ=(k)sqrt(s), so that it grows proportionally to the square root of the current largest user constraint cost. Here, (k) is a proportionality constant. It too is a somewhat magic free parameter, but I have found that k=1.5 yields fast convergences and good results. One last trick I play is that I prevent σ from becoming less than a minimum value, currently 10. This ensures that my n-ball constraint never dominates the total constraint sum, even as the optimization converges close to the feasible region. I want my "true" user constraints to dominate the behavior near the optimum, since those are the constraints that matter. The code is shorter than the explaination: you can see it here

After applying all these intuitions, the resulting algorithm appears to be numerically stable and also converges pretty quickly even when the initial guess is very far from the true feasible region. To review, you can look at the main loop of this algorithm starting here.

I've learned a lot about convex optimization and feasible point solving from working through practical problems as I made mistakes and fixed them. I'm fairly new to the whole arena of convex optimization, and I expect I'll learn a lot more as I go. Happy Computing!

[1] §11.3 of *Convex Optimization*, Boyd and Vandenberghe, Cambridge University Press, 2008

As JDC described in his post, this alternative expression for smooth max (m) is computationally stable. Individual exponential terms may underflow to zero, but they are the ones which are dominated by the other terms, and so approximating them by zero is numerically accurate. In the limit where one value dominates all others, it will be exactly the value given by (z).

It turns out that we can play a similar trick with computing the gradient:

Without showing the derivation, we can apply exactly the same manipulation to the terms of the Hessian:

And so we now have a computationally stable form of the equations for smooth max, its gradient and its Hessian. Enjoy!

]]>I am using the logarithm-based definition of smooth-max, shown here:

I will use the second variation above, ignoring function arguments, with the hope of increasing clarity. Applying the chain rule gives the ith partial gradient of smooth-max:

Now that we have an ith partial gradient, we can take the jth partial gradient of *that* to obtain the (i,j)th element of a Hessian:

This last re-grouping of terms allows us to see that we can express the full gradient and Hessian in the following more compact way:

With a gradient and Hessian, we now have the tools we need to use smooth-max in algorithms such as gradient descent and convex optimization. Happy computing!

]]>Consider how we annotate and refer to release builds for a Scala project:
The *version* of Scala -- 2.10, 2.11, etc -- that was used to build the project is a *qualifier* for the release.
For example, if I am building a project using Scala 2.11, and package P is one of my project dependencies, then the maven build tooling (or sbt, etc) looks for a version of P that was *also* built using Scala 2.11;
the build will fail if no such incarnation of P can be located.
This build constraint propagates recursively throughout the entire dependency tree for a project.

Now consider how we treat the version for the package P dependency itself:
Our build tooling forces us to specify one exact release version x.y.z for P.
This is superficially similar to the constraint for building with Scala 2.11, but *unlike* the Scala constraint, the knowledge about using P x.y.z is not propagated down the tree.

If the dependency for P appears only once in the depenency tree, everything is fine.
However, as anybody who has ever worked with a large dependency tree for a project knows, package P might very well appear in multiple locations of the dep-tree, as a transitive dependency of different packages.
Worse, these deps may be specified as *different versions* of P, which may be mutually incompatible.

Transitive dep incompatibilities are a particularly thorny problem to solve, but there are other annoyances related to release versioning. Often a user would like a "major" package dependency built against a particular version of that dep. For example, packages that use Apache Spark may need to work with a particular build version of Spark (2.1, 2.2, etc). If I am the package purveyor, I have no very convenient way to build my package against multiple versions of spark, and then annotate those builds in Maven Central. At best I can bake the spark version into the name. But what if I want to specify other package dep verions? Do I create package names with increasingly-long lists of (package,version) pairs hacked into the name?

Finally, there is simply the annoyance of revving my own package purely for the purpose of building it against the latest versions of my dependencies.
None of my code has changed, but I am cutting a new release just to pick up current dependency releases.
And then hoping that my package users will want those particular releases, and that these won't break *their* builds with incompatible transitive deps!

I have been toying with a release and build methodology for avoiding these headaches. What follows is full of vigorous hand-waving, but I believe something like it could be formalized in a useful way.

The key idea is that a release *build* is defined by a *build signature* which is the union of all `(dep, ver)`

pairs.
This includes:

- The actual release version of the package code, e.g.
`(mypackage, 1.2.3)`

- The
`(dep, ver)`

for all dependencies (taken over all transitive deps, recursively) - The
`(tool, ver)`

for all impactful build tooling, e.g.`(scala, 2.11)`

,`(python, 3.5)`

, etc

For example, if I maintain a package `P`

, whose latest code release is `1.2.3`

,
built with dependencies `(A, 0.5)`

, `(B, 2.5.1)`

and `(C, 1.7.8)`

, and dependency `B`

built against `(Q, 6.7)`

and `(R, 3.3)`

,
and `C`

built against `(Q, 6.7)`

and all compiled with `(scala, 2.11)`

, then the build signature will be:

`{ (P, 1.2.3), (A, 0.5), (B, 2.5.1), (C, 1.7.8), (Q, 6.7), (R, 3.3), (scala, 2.11) }`

Identifying a release build in this way makes several interesting things possible.
First, it can identify a build with a transitive dependency problem.
For example, if `C`

had been built against `(Q, 7.0)`

,
then the resulting build signature would have *two* pairs for `Q`

; `(Q, 6.7)`

and `(Q, 7.0)`

,
which is an immediate red flag for a potential problem.

More intriguingly, it could provide a foundation for *avoiding* builds with incompatible dependencies.
Suppose that I redefine my build logic so that I only specify dependency package names, and not specific versions.
Whenever I build a project, the build system automatically searches for the most-recent version of each dependency.
This already addresses some of the release headaches above.
As a project builder, I can get the latest versions of packages when I build.
As a package maintainer, I do not have to rev a release just to update my package deps;
projects using my package will get the latest by default.
Moreover, because the latest package release is always pulled, I never get multiple incompatible dependency releases
in a build.

Suppose that for some reason I *need* a particular release of some dependency.
From the example above, imagine that I must use `(Q, 6.7)`

.
We can imagine augmenting the build specification to allow overriding the default behavior of pulling the most recent release.
We might either specify a specific version as we do currently, or possibly specify a range of releases, as systems like brew or ruby gemfiles allow.
In the case where some constraint is placed on releases, this constraint would be propagaged down the tree (or possibly up from the leaves),
in essentially the same way that the constraint of scala version is already.
In the event that the total set of constraints over the whole dependency tree is not satisfiable, then the build will fail.

With a build annotation system like the one I just described, one could imagine a new role for registries like Maven Central, where different builds are automatically cached. The registry could maybe even automatically run CI testing to identify the most-recent versions of package dependencies that satisfy any given package build, or perhaps valid dependency release ranges.

To conclude, I believe that re-thinking how we describe the dependencies used to build and annotate package releases, by generalizing release version to include the release version of all transitive deps (including build tooling as deps), may enable more flexible ways to both build software releases and specify them for pulling.

Happy Computing!

]]>In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. "What are you doing?", asked Minsky. "I am training a randomly wired neural net to play Tic-tac-toe", Sussman replied. "Why is the net wired randomly?", asked Minsky. "I do not want it to have any preconceptions of how to play", Sussman said. Minsky then shut his eyes. "Why do you close your eyes?" Sussman asked his teacher. "So that the room will be empty." At that moment, Sussman was enlightened.

Recently I've been doing some work with the t-digest sketching algorithm, from the paper by Ted Dunning and Omar Ertl. One of the appealing properties of t-digest sketches is that you can "add" them together in the monoid sense to produce a combined sketch from two separate sketches. This property is crucial for sketching data across data partitions in scale-out parallel computing platforms such as Apache Spark or Map-Reduce.

In the original Dunning/Ertl paper, they describe an algorithm for monoidal combination of t-digests based on randomized cluster recombination. The clusters of the two input sketches are collected together, then randomly shuffled, and inserted into a new t-digest in that randomized order. In Scala code, this algorithm might look like the following:

```scala def combine(ltd: TDigest, rtd: TDigest): TDigest = { // randomly shuffle input clusters and re-insert to a new t-digest shuffle(ltd.clusters.toVector ++ rtd.clusters.toVector)

```
.foldLeft(TDigest.empty)((d, e) => d + e)
```

} ```

I implemented this algorithm and used it until I noticed that a sum over multiple sketches seemed to behave noticeably differently than either the individual inputs, or the nominal underlying distribution.

To get a closer look at what was going on, I generated some random samples from a Normal distribution ~N(0,1). I then generated t-digest sketches of each sample, took a cumulative monoid sum, and kept track of how closely each successive sum adhered to the original ~N(0,1) distribution. As a measure of the difference between a t-digest sketch and the original distribution, I computed the Kolmogorov-Smirnov D-statistic, which yields a distance between two cumulative distribution functions. (Code for my data collections can be viewed here) I ran multiple data collections and subsequent cumulative sums and used those multiple measurements to generate the following box-plot. The result was surprising and a bit disturbing:

As the plot shows, the t-digest sketch distributions are gradually *diverging* from the underlying "true" distribution ~N(0,1).
This is a potentially significant problem for the stability of monoidal t-digest sums, and by extension any parallel sketching based on combining the partial sketches on data partitions in map-reduce-like environments.

Seeing this divergence motivated me to think about ways to avoid it. One property of t-digest insertion logic is that the results of inserting new data can differ depending on what clusters are already present. I wondered if the results might be more stable if the largest clusters were inserted first. The t-digest algorithm allows clusters closest to the distribution median to grow the largest. Combining input clusters from largest to smallest would be like building the combined distribution from the middle outwards, toward the distribution tails. In the case where one t-digest had larger weights, it would also somewhat approximate inserting the smaller sketch into the larger one. In Scala code, this alternative monoid addition looks like so:

```scala
def combine(ltd: TDigest, rtd: TDigest): TDigest = {
// insert clusters from largest to smallest
(ltd.clusters.toVector ++ rtd.clusters.toVector).sortWith((a, b) => a.*2 > b.*2)

```
.foldLeft(TDigest.empty(delta))((d, e) => d + e)
```

} ```

As a second experiment, for each data sampling I compared the original monoid addition with the alternative method using largest-to-smallest cluster insertion. When I plotted the resulting progression of D-statistics side-by-side, the results were surprising:

As the plot demonstrates, not only was large-to-small insertion more stable, its D-statistics appeared to be getting *smaller* instead of larger.
To see if this trend was sustained over longer cumulative sums, I plotted the D-stats for cumulative sums over 100 samples:

The results were even more dramatic; These longer sums show that the standard randomized-insertion method continues to diverge, but in the case of large-to-small insertion the cumulative t-digest sums continue to converge towards the underlying distribution!

To test whether this effect might be dependent on particular shapes of distribution, I ran similar experiments using a Uniform distribution (no "tails") and an Exponential distribution (one tail). I included the corresponding plots in the appendix. The convergence of this alternative monoid addition doesn't seem to be sensitive to shape of distribution.

I have upgraded my implementation of t-digest sketching to use this new definition of monoid addition for t-digests. As you can see, it is easy to change one implementation for another. One or two lines of code may be sufficient. I hope this idea may be useful for any other implementations in the community. Happy sketching!