Parallelization of the Fast Multipole Algorithm: Algorithm and Architecture Design

James F. Leathrum, Jr.


Abstract

The Fast Multipole Algorithm (FMA) is an algorithm developed by Leslie Greengard and Victor Rokhlin which calculates the interactions (force or potential) between N bodies in a uniformly distributed in O(N) time. This is an improvement over the O(N^2) time of classical methods. The performance of the algorithm is improved in this work by parallelization which should ideally result in O(N/P) time for P processors. The 3-dimensional version of the algorithm is considered as it should prove useful in molecular dynamics and N-body gravitational studies. Both software and hardware issues are considered.

Initial work on the algorithm was done on the serial code to improve both the speed and memory constraints. Speed was improved by reducing the constant associated with the time complexity. Memory was reduced by modifications to the algorithm to reduce the amount of data stored at any point in time.

A parallel form of the FMA (The Parallel Fast Multipole Algorithm (PFMA)) was designed and implemented on a series of existing parallel architectures to study the characteristice of the algorithm, in particular the runtime ans speedup. An alternative to Greengard's Adaptive Fast Multipole Algorithm (AFMA) which addresses nonuniform particle systems is also proposed to facilitate its parallelization.

To prove the usefulness of the FMA to the scientific community, the serial version of the code has been applied to two sample application. The first uses FMA as the N-body solver in an existing molecular dynamics code. Results for several molecules are presented. The second uses the FMA code to solve a Gaussian distribution of bodies such as might occur in graviational force studies.

Given the experiences gathered from the parallel implementations, a new hardware architecture is proposed which takes advantage of the locally shared memories which are associated with the boundaries between processors so that each memory is shared among a subset of the processors in the system. Then a tree structure is imposed on this interconnect to provide easier treewalk operations which are required by FMA.


Note that this PhD dissertation is contained in four postscript files.

Back to the Scientific Computing Group Home Page


This page last modified 7/12/95 by Daniel Paul