I want to make an argument that the Algebird Aggregator design, in particular its use of the prepare
operation in a mapreduce context, has substantial inefficiencies, compared to an equivalent formulation that is more directly suited to taking advantage of Scala’s aggregate method on collections method.
Consider the definition of aggregation in the Aggregator class:
1


You can see that it is a standard map/reduce operation, where reduce
is defined as a monoidal (or semigroup – more on this later) operation. Under the hood, it boils down to an invocation of Scala’s reduceLeft
method. The key thing to notice is that the role of prepare
is to map a collection of data elements into the required monoids, which are then aggregated using that monoid’s plus
operation. In other words, prepare
converts data elements into “singleton” monoids each representing a data element.
Now, if the monoid in question is simple, say some numeric type, this conversion is free, or nearly so. For example, the conversion of an integer into the “integer monoid” is a noop. However, there are other kinds of “nontrivial” monoids, for which the conversion of a data element into its corresponding monoid may be costly. In this post, I will be using the monoid defined by Scala Set[Int], where the monoid plus
operation is set union, and of course the zero
element is the empty set.
Consider the process of defining an Algebird aggregator for the task of generating the set of unique elements in a data set. The corresponding prepare
operation is: prepare(e: Int) = Set(e)
. A monoid trait that encodes this idea might look like the following. (the code I used in this post can be found here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

If we unpack the above code, as applied to intSetPrepared
, we are instantiating a new Set object, containing a single value, for every single input data element.
But there is a potentially better model of aggregation, exemplified by the Scala aggregate
method. This method does not use a prepare
operation. It uses a zero value and a monoidal operator, which the Scala docs refer to as combop
, but it also uses an “update” operation, that defines how to update the monoid object, directly, with a single element, referred to as seqop
in Scala’s documentation. This idea can also be encoded as a flavor of monoid, enhanced with an update
method:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 

This arrangement promises more efficiency when aggregating w.r.t. nontrivial monoids, by avoiding the construction of “singleton” monoids for each data element. The following demo confirms that for the Setbased monoid, it is over 10 times faster:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 

It is also possible to apply Scala’s aggregate
to a monoid enhanced with prepare
:
1 2 3 4 5 6 

Although this turns out to be measurably faster than the literal mapreduce implementation, it is still not nearly as fast as the variation using update
:
1 2 

Readers familiar with Algebird may be wondering about my use of monoids above, when the Aggregator
interface is actually based on semigroups. This is important, since building on Scala’s aggregate
function requires a zero element that semigroups do not have. Although I believe it might be worth considering changing Aggregator
to use monoids, another sensible option is to change the internal logic for the subclass AggregatorMonoid
, which does require a monoid, or possibly just define a new AggregatorMonoidUpdated
subclass.
A final note on compatability: note that any monoid enhanced with prepare
can be converted into an equivalent monoid enhanced with update
, as demonstrated by this factory function:
1 2 3 4 5 6 7 8 
