This week I finally started applying my new convex optimization library to solve for interpolating splines with monotonic constraints. Things seemed to be going well. My convex optimization was passing unit tests. My monotone splines were passing their unit tests too. I cut an initial release, and announced it to the world.
Because Murphy rules my world, it was barely an hour later that I was playing around with my new toys in a REPL, and when I tried splining an example data set my library call went into an infinite loop:
1 2 3 4 5
In addition to being a bit embarrassing, it was also a real head-scratcher. There was nothing odd about the data I had just given it. In fact it was a small variation of a problem it had just solved a few seconds prior.
There was nothing to do but put my code back up on blocks and break out the print statements. I ran my problem data set and watched it spin. Fast forward a half hour or so, and I localized the problem to a bit of code that does the “backtracking” phase of a convex optimization:
1 2 3 4 5 6 7 8 9
My infinite loop was happening because my backtracking loop above was “succeeding” – that is, reporting it had found a forward step – but not actually moving foward along its vector. And the reason turned out to be that my test
tv <= v + t*alpha*gdd was succeding because
v + t*alpha*gdd was evaluating to just
v, and I effectively had
tv == v.
I had been bitten by one of the oldest floating-point fallacies: forgetting that
x + y can equal
y gets smaller than the Unit in the Last Place (ULP) of
This was an especially evil bug, as it very frequently doesn’t manifest. My unit testing in two libraries failed to trigger it. I have since added the offending data set to my splining unit tests, in case the code ever regresses somehow.
Now that I understood my problem, it turns out that I could use this to my advantage, as an effective test for local convergence. If I can’t find a step size that reduces my local objective function by an amount measurable to floating point resolution, then I am as good as converged at this stage of the algorithm. I re-wrote my code to reflect this insight, and added some annotations so I don’t forget what I learned:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
I tend to pride myself on being aware that floating point numerics are a leaky abstraction, and the various ways these leaks can show up in computations, but pride goeth before a fall, and after all these years I can still burn myself! It never hurts to be reminded that you can never let your guard down with floating point numbers, and unit testing can never guarantee correctness. That goes double for numeric methods!