Data Parallel Haskell is nested data parallelism in Haskell. It builds off of NESL by developing fusion techniques to reduce intermediate arrays generated by data parallelism. It does so in a way that makes it usable as a library, instead of requiring a bunch of compiler modifications (kinda like Rayon!).
Prior work and the contribution
DPH situates itself as an extension of NESL. Specifically, it notes that NESL’s “crucial breakthrough” was the flattening or vectorization transformation, transforming nested parallelism into operations on flat arrays.
This paper builds on that work in a few key ways:
- Data parallel programs generate lots of intermediate arrays. Fusion techniques eliminate those intermediates
- They introduce Rust-style associated types, a precursor to type/data families (which is a generalization of the former) to enable representing unboxed array types!
- This system can handle more than just built-in types, product, and array type. It can handle user-defined, sum, and function types!
How does it work
First problem: how do we represent parallel arrays? In particular, in GHC they have boxed arrays by default; how can we get around that? Well, they have a datatype [:e:]
representing parallel arrays. This datatype is defined as a type family in a typeclass ArrElem
:
Every type a
can define its own internal representation for what parallel arrays of that type should look like. We can take advantage of this to have a flat representation of nested arrays!
Then, when users write code operating on their parallel arrays:
After any array comprehension syntax has been desugared into regular degular function calls, we use Haskell user-level rewrite rules to lift sequential operations to parallel ones. In this case, we want to lift multiplication:
This generates lots of intermediate arrays. bpermuteP v is
creates one, xs ^* ...
creates another. If we have a thread pool, we need to synchronize every time a temporary is completed. Ideally, we want each thread to do traversal, extraction, multiplication, and summing all on its own. This is loop fusion!
To do this, the authors create the Dist a
type, representing a chunk to be assigned to each thread of type a
. They implement operations on Dist a
types, like mapD
and zipD
. Since each of these calls requires a synchronization at the end, they then use rewrite rules to fuse together operations, like forall f g xs. mapD f (mapD g xs) = mapD (f . g) xs
.
Data-parallel operations can be decomposed into parallel operations over distributed types and sequential operations that are executed simultaneously by all members of a thread pool.