Scala supplies a parallel collections library that was designed to make it easy for a programmer to add parallel computing over the elements in a collection. In this post, I will describe a case study of applying Scala’s parallel collections to cleanly implement multithreading support for training a KMedoids clustering model.
Motivation
KMedoids clustering is a relative of KMeans clustering that does not require an algebra over input data elements. That is, KMedoids requires only a distance metric defined on elements in the data space, and can cluster objects which do not have a welldefined concept of addition or division that is necessary for computing the centroids required by KMeans. For example, KMedoids can cluster character strings, which have a notion of distance, but no notion of summation that could be used to compute a geometric centroid.
This additional generality comes at a cost. The medoid of a collection of elements is the member of the collection that minimizes some function F of the distances from that element to all the other elements in the collection. For example, F might be the sum of distances from one element to all the elements, or perhaps the maximum distance, etc. It is not hard to see that the cost of computing a medoid of (n) elements is quadratic in (n): Evaluating F is linear in (n) and F in turn must be evaluated with respect to each element. Furthermore, unlike centroidbased computations used in KMeans, computing a medoid does not naturally lend itself to common scaleout computing formalisms such as Spark RDDs, due to the fullcrossproduct nature of the computation.
With this in mind, a more traditional multithreading approach is a good candidate to achieve some practical parallelism on modern multicore hardware. I’ll demonstrate that this is easy to implement in Scala with parallel sequences.
NonParallel Code
Consider a baseline nonparallel implementation of KMedoids, as in the following example skeleton code. (A working version of this code, under review at the time of this post, can be viewed here)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 

If we run the code above (deskeletonized), then we might see something like this output from our benchmarking, where I clustered a dataset of 40,000 randomlygenerated (x,y,z) points by Gaussian sampling around 5 chosen centers. (This data is numeric, but I provide only a distance metric on the points. KMedoids has no knowledge of the data except that it can run the given metric function on it):
1 2 3 4 5 6 

Observe that cluster sizes are generally not the same, and we can see the time per cluster varying quadratically with respect to cluster size.
A First Take On Parallel KMedoids
Studying our nonparallel code above, we can see that the computation of each new medoid is independent, which makes it a likely place to inject some parallelism. A Scala sequence can be transformed into a corresponding parallel sequence using the par
method, and so parallelizing our code is literally this simple:
1 2 3 4 

In this block, I also apply .seq
at the end, which is not always necessary but can avoid type mismatches between Seq[T]
and ParSeq[T]
under some circumstances.
In my case I also wish to exercise some control over the threading used by the parallelism, and so I explicitly assign a ForkJoinPool
thread pool to the sequence:
1 2 3 4 5 6 7 8 9 10 11 

Minor grievance: it would be nice if Scala supported some ‘inline’ methods, like seq.par(n)...
and seq.par(threadPool)...
, instead of requiring the programmer to break the flow of the code to invoke tasksupport =
, which returns Unit
.
Now that we’ve parallelized our KMedoids training, we should see how well it responds to additional threads. I ran the above parallelized version using {1, 2, 4, 8, 16, 32}
threads, on a machine with 40 cores, so that my benchmarking would not be impacted by attempting to run more threads than there are cores to support them. I also ran two versions of test data. The first I generated with clusters of equal size (5 clusters of ~8000 elements), and the second with one cluster being twice as large (1 cluster of ~13300 and 4 clusters of ~6700). Following is a plot of throughput (iterations / second) versus threads:
In the best of all possible worlds, our throughput would increase linearly with the number of threads; double the threads, double our iterations per second. Instead, our throughput starts to increase nicely as we add threads, but hits a hard ceiling at 8 threads. It is not hard to see why: our parallelism is limited by the number of elements in our collection of clusters. In our case that is k = 5, and so we reach our ceiling at 8 threads, the first thread number >= 5. Furthermore, we see that when the size of clusters is unequal, the throughput suffers even more. The time required to complete the clustering is dominated by the most expensive element. In our case, the cluster that is twice the size of other clusters:
1 2 3 4 5 6 

Take 2: Improving The Use Of Threads
Fortunately it is not hard to improve on this situation. If parallelizing by cluster is too coarse, we can try pushing our parallelism down one level of granularity. In our case, that means parallelizing the outer loop of our medoid function, and it is just as easy as before:
1 2 3 4 5 6 7 8 

Note that I retained the previous parallelism at the cluster level, otherwise the algorithm would execute parallel medoids, but one cluster at a time. Also observe that we are applying the same thread pool we supplied to the ParSeq at the cluster level. Scala’s parallel logic can utilize the same thread pool at multiple granularities without blocking. This makes it very clean to control the total number of threads used by some computation, by simply reusing the same threadpool across all points of parallelism.
Now, when we rerun our experiment, we see that our throughput continues to increase as we add threads. The following plot illustrates the throughput increasing in comparison to the previous ceiling, and also that throughput is less sensitive to the cluster size, as threads can be allocated flexibly across clusters as they are available:
I hope this short case study has demonstrated how easy it is to add multithreading to computations with Scala parallel sequences, and some considerations for making the best use of available threads. Happy Parallel Programming!