In a previous post I discussed some scenarios where traditional map-reduce (directly applying a map function, followed by some monoidal reduction) could be inefficient.
To review, the source of inefficiency is in situations where the
map operation is creating some non-trivial monoid that represents a single element of the input type.
For example, if the monoidal type is
Set[Int], then the mapping function (‘prepare’ in algebird) maps every input integer
Set(k), which is somewhat expensive.
In that discussion, I was focusing on map-reduce as embodied by the algebird
Aggregator type, where
map appears as the
However, it is easy to see that any map-reduce implementation may be vulnerable to the same inefficiency.
I wondered if there were a way to represent map-reduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.
I’ll begin by reviewing a standard map-reduce implementation.
The following scala code sketches out the definition of a monoid over a type
B and a map-reduce interface.
As this code suggests, the
map function maps input data of some type
A into some monoidal type
B, which can be reduced (aka “aggregated”) in a way that is amenable to parallelization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
In the parallel version of map-reduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal
B value, followed by a final reduction of those intermediate results.
This is the classic form of map-reduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.
Next I will present an alternative definition of map-reduce.
In this implementation, the
map function is replaced by a
foldL function, which executes a single “left-fold” of an input object with type
A into the monoid object with type
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
As the comments above indicate, the left-folding function
foldL is assumed to obey the law
foldL(b, a) = b ++ foldL(e, a).
This law captures the idea that folding
b should be the analog of reducing
b with a monoid corresponding to the single element
Referring to my earlier example, if type
foldL(b, a) => b + a.
b + a is directly inserting single element
b, which is significantly more efficient than
b ++ Set(a), which is how a typical map-reduce implementation would be required to operate.
This law also gives us the corresponding definition of
map(a), which is
foldL(e, a), or in my example:
Set.empty[Int] ++ a or just:
In this formulation, the basic map-reduce operation is now a single
foldLeft operation, instead of a mapping followed by a monoidal reduction.
The parallel version is analoglous.
Each partition uses the new
foldLeft operation, and the final reduction of intermediate monoidal results remains the same as before.
foldLeft function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before.
I will conclude with a proof that this encoding is also parallelizable;
Note that the law
foldL(b, a) = b ++ foldL(e, a) is a significant component of this proof, as it represents the constraint that
foldL behaves like an analog of reducing
b with a monoidal representation of element
In the following proof I used a scala-like pseudo code, described in the introduction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42