# The 'prepare' operation considered harmful in Algebird aggregation

| Feedback

I want to make an argument that the Algebird Aggregator design, in particular its use of the prepare operation in a map-reduce context, has substantial inefficiencies, compared to an equivalent formulation that is more directly suited to taking advantage of Scala’s aggregate method on collections method.

Consider the definition of aggregation in the Aggregator class:

You can see that it is a standard map/reduce operation, where reduce is defined as a monoidal (or semigroup – more on this later) operation. Under the hood, it boils down to an invocation of Scala’s reduceLeft method. The key thing to notice is that the role of prepare is to map a collection of data elements into the required monoids, which are then aggregated using that monoid’s plus operation. In other words, prepare converts data elements into “singleton” monoids each representing a data element.

Now, if the monoid in question is simple, say some numeric type, this conversion is free, or nearly so. For example, the conversion of an integer into the “integer monoid” is a no-op. However, there are other kinds of “non-trivial” monoids, for which the conversion of a data element into its corresponding monoid may be costly. In this post, I will be using the monoid defined by Scala Set[Int], where the monoid plus operation is set union, and of course the zero element is the empty set.

Consider the process of defining an Algebird aggregator for the task of generating the set of unique elements in a data set. The corresponding prepare operation is: prepare(e: Int) = Set(e). A monoid trait that encodes this idea might look like the following. (the code I used in this post can be found here)

If we unpack the above code, as applied to intSetPrepared, we are instantiating a new Set object, containing a single value, for every single input data element.

But there is a potentially better model of aggregation, exemplified by the Scala aggregate method. This method does not use a prepare operation. It uses a zero value and a monoidal operator, which the Scala docs refer to as combop, but it also uses an “update” operation, that defines how to update the monoid object, directly, with a single element, referred to as seqop in Scala’s documentation. This idea can also be encoded as a flavor of monoid, enhanced with an update method:

This arrangement promises more efficiency when aggregating w.r.t. nontrivial monoids, by avoiding the construction of “singleton” monoids for each data element. The following demo confirms that for the Set-based monoid, it is over 10 times faster:

It is also possible to apply Scala’s aggregate to a monoid enhanced with prepare:

Although this turns out to be measurably faster than the literal map-reduce implementation, it is still not nearly as fast as the variation using update:

Readers familiar with Algebird may be wondering about my use of monoids above, when the Aggregator interface is actually based on semigroups. This is important, since building on Scala’s aggregate function requires a zero element that semigroups do not have. Although I believe it might be worth considering changing Aggregator to use monoids, another sensible option is to change the internal logic for the subclass AggregatorMonoid, which does require a monoid, or possibly just define a new AggregatorMonoidUpdated subclass.

A final note on compatability: note that any monoid enhanced with prepare can be converted into an equivalent monoid enhanced with update, as demonstrated by this factory function: