Updated 2009-06-06: Due to continuing interest in this project, I have created a git repository for the project source code. Maybe someone can finish adding in all of the remaining IMB benchmarks or figure out what is wrong with my implementation of the parallel merge sort? Access it here.
In earlier posts, I mentioned my intent to study Erlang's performance characteristics relative to the MPI library for C. Here is a more in-depth explanation, source code, and final results.
Parallel applications development is hard. Erlang was designed from the ground up for massively parallel, distributed, and fault tolerant applications. Erlang, also by design, encourages modular, easily debugged, readable, and compact code. It was the intention of my project group to see how well Erlang fared against MPI.
Given the time frame of one semester, only microbenchmark and kernel applications could be developed. Fair comparison requires a benchmark suite based on reasonably large applications. As a consequence to the small size of the benchmarks, we were unable to evaluate Erlang's fault tolerance aspects. Regardless, we did gather some useful information. The platform that these experiments ran on was a 16-core SMP machine with 32GB of RAM. Erlang's virtual machine does not natively support cluster architectures.
We ported a small part of the Intel MPI Benchmark suite to Erlang. This suite measures MPI message-passing performance. It can be used by engineers to compare and measure the performance of MPI implementation, network interconnects, etc. Note: all of the Erlang code listed in the IMB section was written by me.
The portions we ported included:
Code common to all benchmarks, such as timing code and results printing code:
Pingpong — a simple test involving only two processes alternately sending and receiving messages.
Pingping — a similar test wherein non-blocking send operations are allowed to overlap.
Sendrecv — a test wherein any number of processes are organized into a ring structure and send messages along this ring.
Alltoall — a test where every process sends a message to each and every process. This benchmark is extremely communication's intensive.
We set out to port four programming kernels: an FFT, a matrix multiplication, a merge sort, and a Monte-Carlo integration. Due to time constraints, we finished MPI code for the merge sort, matrix multiplication, and monte-carlo integration kernels, and complete Erlang code for the matrix multiplication and monte-carlo integration kernels. Note: most of the C code was written by my project partner. Erlang code was, again, written by me.
This kernel was very computationally intensive. Very little overhead is seen in code relating to parallelization of the algorithm. In computer engineering parlance, this quality is described as "embarrassingly parallel".
More information on Monte-Carlo methods: http://en.wikipedia.org/wiki/Monte_Carl
This kernel was moderately communications intensive, as the parallelization scheme involved sending matrix rows and entire copies of one of the input matrices to each of the worker processes, as well as reporting similarly sized results to the parent process.
The C code here is complete, but the Erlang code is not. The Erlang code is probably four or five critical lines away from functioning correctly. A subtle race condition set in during the approach to the deadline that was not solved. The Erlang code is included because it does contain some interesting problem-solving approaches.
The parallelization scheme was based one a really clever hypercube communication pattern. Of special interest is the "Parallel Merge" code fragment in the page linked to in the preceding link, and the hypercube communication template here. The hypercube communication pattern is perhaps not the most efficient parallelization scheme for use on an SMP machine — it was likely designed to take advantage of older supercomputers which often used hypercube network interconnects. A chief problem with this parallelization scheme is that the algorithm is forced to be serialized into steps equal to the number of dimensions that the hypercube is built in. Each of these steps is parallelizable, however.
The IMB benchmarks were run as follows:
Each benchmark was run many times for each data size, with smaller data sizes being run 1000 times consecutively and larger data sizes decreasing.
MPI is shown to clearly outperform Erlang in terms of message passing overhead (as can be seen in the IMB benchmarks) and computational performance (as can be seen in the Monte Carlo kernel benchmark). Erlang, however, seems to outperform MPI when sending messages with large payloads, as the majority of the IMB benchmarks show. It should be understood, however, that while the data payload in MPI consists of raw data allocated using malloc, the data in Erlang was prepared using Erlang's bit-packing method. We don't understand Erlang internals sufficiently to know whether the bit-packing method engages in some form of compression. It is very possible that the actual magnitude of Erlang's data payloads are smaller than MPI's. If this is the case, then Erlang is shown to be more efficient at sending lists of values and raw data structures than MPI. It could also be the case that Erlang's virtual machine is allocating processes in such a way that they are able to transmit data efficiently across the test platform.
Note: the results of the Matrix Multiplication are misrepresented due to a data collection error. The values reported for Erlang should be reported in seconds, not microseconds. This indicates that the methods used in the Erlang matrix multiplication program, while idiomatic to the language, are very likely to be algorithmically complex and inefficient.
Note: all data structures were built using native Erlang lists. I have just been recently alerted to the fact that Erlang has a built-in array module. Matrix Multiplication performance would likely have increased greatly.
The IMB benchmarks and Monte-Carlo integration kernel fell together fairly easily. I programmed these benchmarks very modularly and designed them to be easy to test and read. The matrix multiplication, however, proved unexpectedly clunky and problematic to program in a straightforward manner. The merge sort benchmark was the most complex of the benchmarks I attempted, and also the most satisfying to program. However, I left it unfinished.
Despite its lower raw performance speed, Erlang might still find a home in high performance computing. Very large supercomputers suffer great performance penalties due to hardware failures. Erlang might be used as a monitoring system for jobs running on such a system or even as a communications layer wrapped around compiled C or Fortran applications that perform the heavy computation.
I will post highlights from the source code in a later post.