The 'prepare' operation considered harmful in Algebird aggregation
I want to make an argument that the Algebird Aggregator design, in particular its use of the prepare
operation in a map-reduce context, has substantial inefficiencies, compared to an equivalent formulation that is more directly suited to taking advantage of Scala’s aggregate method on collections method.
Consider the definition of aggregation in the Aggregator class:
def apply(inputs: TraversableOnce[A]): C = present(reduce(inputs.map(prepare)))
You can see that it is a standard map/reduce operation, where reduce
is defined as a monoidal (or semigroup – more on this later) operation. Under the hood, it boils down to an invocation of Scala’s reduceLeft
method. The key thing to notice is that the role of prepare
is to map a collection of data elements into the required monoids, which are then aggregated using that monoid’s plus
operation. In other words, prepare
converts data elements into “singleton” monoids each representing a data element.
Now, if the monoid in question is simple, say some numeric type, this conversion is free, or nearly so. For example, the conversion of an integer into the “integer monoid” is a no-op. However, there are other kinds of “non-trivial” monoids, for which the conversion of a data element into its corresponding monoid may be costly. In this post, I will be using the monoid defined by Scala Set[Int], where the monoid plus
operation is set union, and of course the zero
element is the empty set.
Consider the process of defining an Algebird aggregator for the task of generating the set of unique elements in a data set. The corresponding prepare
operation is: prepare(e: Int) = Set(e)
. A monoid trait that encodes this idea might look like the following. (the code I used in this post can be found here)
// an algebird-like monoid with the 'prepare' operation
trait PreparedMonoid[M, E] {
val zero: M
def plus(m1: M, m2: M): M
def prepare(e: E): M
}
// a PreparedMonoid for a set of integers. monoid operator is set union.
object intSetPrepared extends PreparedMonoid[Set[Int], Int] {
val zero = Set.empty[Int]
def plus(m1: Set[Int], m2: Set[Int]) = m1 ++ m2
def prepare(e: Int) = Set(e)
}
implicit class SeqWithMapReduce[E](seq: Seq[E]) {
// algebird map/reduce Aggregator model
def mrPrepared[M](mon: PreparedMonoid[M, E]): M = {
seq.map(mon.prepare).reduceLeft(mon.plus)
}
}
If we unpack the above code, as applied to intSetPrepared
, we are instantiating a new Set object, containing a single value, for every single input data element.
But there is a potentially better model of aggregation, exemplified by the Scala aggregate
method. This method does not use a prepare
operation. It uses a zero value and a monoidal operator, which the Scala docs refer to as combop
, but it also uses an “update” operation, that defines how to update the monoid object, directly, with a single element, referred to as seqop
in Scala’s documentation. This idea can also be encoded as a flavor of monoid, enhanced with an update
method:
// an algebird-like monoid with 'update' operation
trait UpdatedMonoid[M, E] {
val zero: M
def plus(m1: M, m2: M): M
def update(m: M, e: E): M
}
// an equivalent UpdatedMonoid for a set of integers
object intSetUpdated extends UpdatedMonoid[Set[Int], Int] {
val zero = Set.empty[Int]
def plus(m1: Set[Int], m2: Set[Int]) = m1 ++ m2
def update(m: Set[Int], e: Int) = m + e
}
implicit class SeqWithMapReduceUpdated[E](seq: Seq[E]) {
// map/reduce logic, taking advantage of scala 'aggregate'
def mrUpdatedAggregate[M](mon: UpdatedMonoid[M, E]): M = {
seq.aggregate(mon.zero)(mon.update, mon.plus)
}
}
This arrangement promises more efficiency when aggregating w.r.t. nontrivial monoids, by avoiding the construction of “singleton” monoids for each data element. The following demo confirms that for the Set-based monoid, it is over 10 times faster:
scala> :load /home/eje/scala/prepare.scala
Loading /home/eje/scala/prepare.scala...
defined module prepare
scala> import prepare._
import prepare._
scala> val data = Vector.fill(1000000) { scala.util.Random.nextInt(10) }
data: scala.collection.immutable.Vector[Int] = Vector(7, 9, 4, 2, 7,...
// Verify that output is the same for both implementations:
scala> data.mrPrepared(intSetPrepared)
res0: Set[Int] = Set(0, 5, 1, 6, 9, 2, 7, 3, 8, 4)
// results are the same
scala> data.mrUpdatedAggregate(intSetUpdated)
res1: Set[Int] = Set(0, 5, 1, 6, 9, 2, 7, 3, 8, 4)
// Compare timings of prepare-based versus update-based aggregation
// (benchmark values are returned in seconds)
scala> benchmark(10) { data.mrPrepared(intSetPrepared) }
res2: Double = 0.2957673056
// update-based aggregation is 10 times faster
scala> benchmark(10) { data.mrUpdatedAggregate(intSetUpdated) }
res3: Double = 0.027041249300000004
It is also possible to apply Scala’s aggregate
to a monoid enhanced with prepare
:
implicit class SeqWithMapReducePrepared[E](seq: Seq[E]) {
// using 'aggregate' with prepared op
def mrPreparedAggregate[M](mon: PreparedMonoid[M, E]): M = {
seq.aggregate(mon.zero)((m, e) => mon.plus(m, mon.prepare(e)), mon.plus)
}
}
Although this turns out to be measurably faster than the literal map-reduce implementation, it is still not nearly as fast as the variation using update
:
scala> benchmark(10) { data.mrPreparedAggregate(intSetPrepared) }
res2: Double = 0.1754636707
Readers familiar with Algebird may be wondering about my use of monoids above, when the Aggregator
interface is actually based on semigroups. This is important, since building on Scala’s aggregate
function requires a zero element that semigroups do not have. Although I believe it might be worth considering changing Aggregator
to use monoids, another sensible option is to change the internal logic for the subclass AggregatorMonoid
, which does require a monoid, or possibly just define a new AggregatorMonoidUpdated
subclass.
A final note on compatability: note that any monoid enhanced with prepare
can be converted into an equivalent monoid enhanced with update
, as demonstrated by this factory function:
object UpdatedMonoid {
// create an UpdatedMonoid from a PreparedMonoid
def apply[M, E](mon: PreparedMonoid[M, E]) = new UpdatedMonoid[M, E] {
val zero = mon.zero
def plus(m1: M, m2: M) = mon.plus(m1, m2)
def update(m: M, e: E) = mon.plus(m, mon.prepare(e))
}
}