jd e ([info]jkndrkn) wrote,
@ 2008-05-02 05:46:00
Previous Entry  Add to memories!  Tell a Friend!  Next Entry
Erlang Versus MPI - Final Results and Source Code
In earlier posts, I mentioned my intent to study Erlang's performance characteristics relative to the MPI library for C. Here is a more in-depth explanation, source code, and final results.

Proposal

Parallel applications development is hard. Erlang was designed from the ground up for massively parallel, distributed, and fault tolerant applications. Erlang, also by design, encourages modular, easily debugged, readable, and compact code. It was the intention of my project group to see how well Erlang fared against MPI.

Experiments: Overview

Given the time frame of one semester, only microbenchmark and kernel applications could be developed. Fair comparison requires a benchmark suite based on reasonably large applications. As a consequence to the small size of the benchmarks, we were unable to evaluate Erlang's fault tolerance aspects. Regardless, we did gather some useful information. The platform that these experiments ran on was a 16-core SMP machine with 32GB of RAM. Erlang's virtual machine does not natively support cluster architectures.

Experiments: Microbenchmarks

We ported a small part of the Intel MPI Benchmark suite to Erlang. This suite measures MPI message-passing performance. It can be used by engineers to compare and measure the performance of MPI implementation, network interconnects, etc. Note: all of the Erlang code listed in the IMB section was written by me.

The portions we ported included:

Code common to all benchmarks, such as timing code and results printing code:
http://www.jkndrkn.com/proj/erlang-mpi/imb.hrl
http://www.jkndrkn.com/proj/erlang-mpi/imb.erl

Pingpong — a simple test involving only two processes alternately sending and receiving messages.

Pingping — a similar test wherein non-blocking send operations are allowed to overlap.

Sendrecv — a test wherein any number of processes are organized into a ring structure and send messages along this ring.

Alltoall — a test where every process sends a message to each and every process. This benchmark is extremely communication's intensive.

Experiments: Kernels

We set out to port four programming kernels: an FFT, a matrix multiplication, a merge sort, and a Monte-Carlo integration. Due to time constraints, we finished MPI code for the merge sort, matrix multiplication, and monte-carlo integration kernels, and complete Erlang code for the matrix multiplication and monte-carlo integration kernels. Note: most of the C code was written by my project partner. Erlang code was, again, written by me.

Monte-Carlo Integration:

This kernel was very computationally intensive. Very little overhead is seen in code relating to parallelization of the algorithm. In computer engineering parlance, this quality is described as "embarrassingly parallel".

http://www.jkndrkn.com/proj/erlang-mpi/integral.c
http://www.jkndrkn.com/proj/erlang-mpi/montecarlo.erl

More information on Monte-Carlo methods: http://en.wikipedia.org/wiki/Monte_Carlo_method

Matrix Multiplication:

This kernel was moderately communications intensive, as the parallelization scheme involved sending matrix rows and entire copies of one of the input matrices to each of the worker processes, as well as reporting similarly sized results to the parent process.

http://www.jkndrkn.com/proj/erlang-mpi/matmul.c
http://www.jkndrkn.com/proj/erlang-mpi/matmul.erl

Merge Sort:

The C code here is complete, but the Erlang code is not. The Erlang code is probably four or five critical lines away from functioning correctly. A subtle race condition set in during the approach to the deadline that was not solved. The Erlang code is included because it does contain some interesting problem-solving approaches.

http://www.jkndrkn.com/proj/erlang-mpi/mergesort.c
http://www.jkndrkn.com/proj/erlang-mpi/mergesort.erl

The parallelization scheme was based one a really clever hypercube communication pattern. Of special interest is the "Parallel Merge" code fragment in the page linked to in the preceding link, and the hypercube communication template here. The hypercube communication pattern is perhaps not the most efficient parallelization scheme for use on an SMP machine — it was likely designed to take advantage of older supercomputers which often used hypercube network interconnects. A chief problem with this parallelization scheme is that the algorithm is forced to be serialized into steps equal to the number of dimensions that the hypercube is built in. Each of these steps is parallelizable, however.

Results:


The IMB benchmarks were run as follows:


Each benchmark was run many times for each data size, with smaller data sizes being run 1000 times consecutively and larger data sizes decreasing.

Pingpong:


Pingping:


Sendrecv:






All-to-all:






Matrix Multiplication:





Monte Carlo:






Analysis:

MPI is shown to clearly outperform Erlang in terms of message passing overhead (as can be seen in the IMB benchmarks) and computational performance (as can be seen in the Monte Carlo kernel benchmark). Erlang, however, seems to outperform MPI when sending messages with large payloads, as the majority of the IMB benchmarks show. It should be understood, however, that while the data payload in MPI consists of raw data allocated using malloc, the data in Erlang was prepared using Erlang's bit-packing method. We don't understand Erlang internals sufficiently to know whether the bit-packing method engages in some form of compression. It is very possible that the actual magnitude of Erlang's data payloads are smaller than MPI's. If this is the case, then Erlang is shown to be more efficient at sending lists of values and raw data structures than MPI. It could also be the case that Erlang's virtual machine is allocating processes in such a way that they are able to transmit data efficiently across the test platform.

Note: the results of the Matrix Multiplication are misrepresented due to a data collection error. The values reported for Erlang should be reported in seconds, not microseconds. This indicates that the methods used in the Erlang matrix multiplication program, while idiomatic to the language, are very likely to be algorithmically complex and inefficient.

Note: all data structures were built using native Erlang lists. I have just been recently alerted to the fact that Erlang has a built-in array module. Matrix Multiplication performance would likely have increased greatly.

Remarks:

The IMB benchmarks and Monte-Carlo integration kernel fell together fairly easily. I programmed these benchmarks very modularly and designed them to be easy to test and read. The matrix multiplication, however, proved unexpectedly clunky and problematic to program in a straightforward manner. The merge sort benchmark was the most complex of the benchmarks I attempted, and also the most satisfying to program. However, I left it unfinished.

Despite its lower raw performance speed, Erlang might still find a home in high performance computing. Very large supercomputers suffer great performance penalties due to hardware failures. Erlang might be used as a monitoring system for jobs running on such a system or even as a communications layer wrapped around compiled C or Fortran applications that perform the heavy computation.

I will post highlights from the source code in a later post.


(Post a new comment)

Methodology question
(Anonymous)
2008-05-02 01:43 pm UTC (link)
How were the Erlang programs compiled? In particular, did you use HIPE? It's true that the BEAM VM isn't designed for heavy computational jobs, and my personal experience with HIPE is mixed. For some things, it performs somewhat better, and for others significantly worse. I expect that overall even with HIPE the C code would do better at computational stuff -- I'm actually more curious about how well BEAM Erlang does versus HIPE Erlang on a 16-core system.

(Reply to this)(Thread)


[info]jkndrkn
2008-05-02 02:08 pm UTC (link)
Hi. Thanks for your question. Yes, I made sure to use HiPE.

I didn't test HiPE vs. non-HiPE code. That is an interesting question. You might want to check out Google scholar for information regarding that. I remember coming across articles from around 2000-2001 regarding HiPE that might contain data on performance comparisons.

(Reply to this)(Parent)

another methodology question
(Anonymous)
2008-05-02 03:50 pm UTC (link)
I'm curious which MPI implementation and communication driver you used. Both of these could have a pretty substantial impact. Also, have you given any though to comparing openMP? Hybrid programming models (MPI+OpenMP) are currently a hot topic in HPC, with an eye toward supporting massively parallel multicore systems.

(Reply to this)(Thread)


[info]jkndrkn
2008-05-02 05:55 pm UTC (link)
Good question.

We used mpich2 version 1.0.6p1. Not sure about the communications driver. That's a question for my system's admin. I can find out if you are curious. However, I do know that the SMP machine used two quad-socket motherboards with four dual-core opterons per motherboard. The interconnect was the AMD hypertransport. MPI was configured to see each processor core as a compute node.

Shared memory paradigms are interesting and certainly are becoming practical and popular. Our lab has quite a few UPC users and active advocates. However, we found very few benchmark suites written for such new approaches and were forced to go with MPI for which many benchmark suites exist. We also chose MPI because it is a very reasonable and popular baseline against which to compare a more radical approach.

I'm not a huge fan of shared memory programming for larger applications because I feel that they are a step backwards in a way. Fault tolerance is already hard enough in MPI applications -- the most popular approach is checkpointing and restarting when and if any one compute node or communication link goes down. This can be quite inefficient and may result in lower utilization of a larger system. What needs to happen is Erlang-like error recovery wherein the parent process receives any errors thrown by the child process and can then recover from that error. This approach would avoid a complete restart of the system, but would likely only work well using a message passing approach since shared memory systems abstract inter-process communication away from the programmer.

When shared memory is thrown into the mix, the run-time system would then have to deal transparently with faults and take the necessary steps to restart the application and restore it to a workable state. The programmer would likely have less say in how the fault-recovery method works because the messaging layer is abstracted away behind distributed data structures and parallel data operations. I feel that maybe shared memory is a step backwards, though I am not intimately familiar with shared memory language internals.

(Reply to this)(Parent)(Thread)

Re:
(Anonymous)
2008-05-02 06:41 pm UTC (link)
I think that mpich2-1.0.6 uses ch3:sock as the default communication device. You might try using another device (--with-device), either ch3:ssm or ch3:nemesis. I know that nemesis was optimized for shared memory performance.

I bet it might close up the differences for large messages.

(Reply to this)(Parent)(Thread)


[info]jkndrkn
2008-05-02 06:50 pm UTC (link)
We weren't really interested in optimizations and wanted to use fairly "default" and unoptimized programming approaches in this comparison. Even so, this is nice information to know.

Thanks again.

(Reply to this)(Parent)(Thread)

Re: default is seldom deployed in the real world
(Anonymous)
2008-05-02 08:02 pm UTC (link)
The national labs usually pay huge amount of time to deploy infiniband of myrianet, they very rarely use the default configuration. TCP/IP header overhead is too much for the small packages.

(Reply to this)(Parent)(Thread)


[info]jkndrkn
2008-05-02 08:05 pm UTC (link)
You are right. When maximum performance needs to be extracted out of an architecture, customization is both necessary and expected.

(Reply to this)(Parent)


(Anonymous)
2008-05-02 06:04 pm UTC (link)
Very interesting (and somewhat surprising) results MPI would definitely benefit from non blocking calls. BTW, which MPI did you use? On the other hand, you are seeing communication and computation together in Erlang, where even the simplest math is cumbersome (to put it politely) to write and very slow. PS. Please consider converting the results to speedup measurements so people can make better sense out of them.

(Reply to this)(Thread)


[info]jkndrkn
2008-05-02 06:08 pm UTC (link)
MPI: mpich2 version 1.0.6p1

MPI does support non-blocking sends and receives. These are sort of tricky to use, though. Also, they are only useful if you use them to "hide" computation inside longish bursts of communication activity.

Yes, I agree that math was a bit troubling to perform in Erlang. I'm guessing that strong support for scientific applications was not on the priority list when the language was created.

Speedup measurements are a good idea, thanks.

Thanks for your comments.

(Reply to this)(Parent)


[info]trashhologram
2008-05-04 07:54 pm UTC (link)
hi :)

(Reply to this)(Thread)


[info]jkndrkn
2008-05-05 01:12 pm UTC (link)
greetings

(Reply to this)(Parent)

random access?
[info]keymone.myopenid.com
2008-05-05 07:46 am UTC (link)
using random list access is bad idea with Erlang lists
strongly suggest you to get rid of mget(M, I, J) function

(Reply to this)(Thread)


[info]jkndrkn
2008-05-05 01:13 pm UTC (link)
Yes, that is probably one of the reasons the matrix benchmark performed so poorly.

What technique do you suggest to use instead?

(Reply to this)(Parent)(Thread)


[info]keymone.myopenid.com
2008-05-19 09:34 am UTC (link)
actually the only technique you should use with Erlang is tail-recursion. Erlang stores arrays as lists with sequential access, so when you type lists:nth(N, L) it can be represented like

for(int i=0; i
[Error: Irreparable invalid markup ('<n;>') in entry. Owner must fix manually. Raw contents below.]

actually the only technique you should use with Erlang is tail-recursion. Erlang stores arrays as lists with sequential access, so when you type lists:nth(N, L) it can be represented like

for(int i=0; i<N; i++)
L.next();
return L.current();

using tail recursion you could have scalar multiplication of vectors in such simple way as:

smul([], []) ->
0;
smul([H1|T1], [H2|T2]) ->
H1*H2 + smul(T1, T2).

it is recurrent function and you would avoid it in most of languages, but in erlang as long as you have recurrent call to function in last opration you don't have to worry about stack overflow - this simply won't happen. Erlang reuses current element of stack for recurrent calls, and simply jumps to function start with new parameters.

There are some drawbacks with such approach. For example to achieve maximum performance you will need to store matrix 2 times - list of columns and list of rows. Only in that case you will be able to do 2 way multiplication - A*B and B*A. Otherwise you will have to use

column(N) ->
[lists:nth(N, M) || M <- A].

to get column of matrix and lists:nth is a bad function :)

hope that helps, have fun

(Reply to this)(Parent)


Create an Account
Forgot your login?
Login w/ OpenID
English • Español • Deutsch • Русский…