In a previous post I discussed some scenarios where traditional mapreduce (directly applying a map function, followed by some monoidal reduction) could be inefficient.
To review, the source of inefficiency is in situations where the map
operation is creating some nontrivial monoid that represents a single element of the input type.
For example, if the monoidal type is Set[Int]
, then the mapping function (‘prepare’ in algebird) maps every input integer k
into Set(k)
, which is somewhat expensive.
In that discussion, I was focusing on mapreduce as embodied by the algebird Aggregator
type, where map
appears as the prepare
function.
However, it is easy to see that any mapreduce implementation may be vulnerable to the same inefficiency.
I wondered if there were a way to represent mapreduce using some alternative formulation that avoids this vulnerability. There is such a formulation, which I will talk about in this post.
I’ll begin by reviewing a standard mapreduce implementation.
The following scala code sketches out the definition of a monoid over a type B
and a mapreduce interface.
As this code suggests, the map
function maps input data of some type A
into some monoidal type B
, which can be reduced (aka “aggregated”) in a way that is amenable to parallelization:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 

In the parallel version of mapreduce above, you can see that map and reduce are executed on each data partition (which may occur in parallel) to produce a monoidal B
value, followed by a final reduction of those intermediate results.
This is the classic form of mapreduce popularized by tools such as Hadoop and Apache Spark, where inidividual data partitions may reside across highly parallel commodity clusters.
Next I will present an alternative definition of mapreduce.
In this implementation, the map
function is replaced by a foldL
function, which executes a single “leftfold” of an input object with type A
into the monoid object with type B
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 

As the comments above indicate, the leftfolding function foldL
is assumed to obey the law foldL(b, a) = b ++ foldL(e, a)
.
This law captures the idea that folding a
into b
should be the analog of reducing b
with a monoid corresponding to the single element a
.
Referring to my earlier example, if type A
is Int
and B
is Set[Int]
, then foldL(b, a) => b + a
.
Note that b + a
is directly inserting single element a
into b
, which is significantly more efficient than b ++ Set(a)
, which is how a typical mapreduce implementation would be required to operate.
This law also gives us the corresponding definition of map(a)
, which is foldL(e, a)
, or in my example: Set.empty[Int] ++ a
or just: Set(a)
In this formulation, the basic mapreduce operation is now a single foldLeft
operation, instead of a mapping followed by a monoidal reduction.
The parallel version is analoglous.
Each partition uses the new foldLeft
operation, and the final reduction of intermediate monoidal results remains the same as before.
The foldLeft
function is potentially a much more general operation, and it raises the question of whether this new encoding is indeed parallelizable as before.
I will conclude with a proof that this encoding is also parallelizable;
Note that the law foldL(b, a) = b ++ foldL(e, a)
is a significant component of this proof, as it represents the constraint that foldL
behaves like an analog of reducing b
with a monoidal representation of element a
.
In the following proof I used a scalalike pseudo code, described in the introduction:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 
