Computer Architecture Paper Summaries (CAPS): Accelerator-level Parallellism

Paper link: https://arxiv.org/abs/1907.02064

Authors: Mark Hill, Vijay Reddi

This paper describes the trend where exploiting parallelism has slowly moved from a low abstraction instruction level (ILP) to a higher abstraction threads/tasks level (TLP) to a somewhat out-of-place data-level (DLP) and back to a more understandable, and subsuming, "specialized accelerator" level parallelism (henceforth, ALP).

ALP refers to a kind of modularity in hardware where specific functions or tasks or, more generally, sections of applications are more energy-efficient and faster when computed exclusively on a hardened dedicated hardware module.

Plenty of limitations though:

Obviously, this would be feasible when such separable/accelerable sections of applications occur frequently and don't do well on traditional CPUs -- meaning there should be better reusability of the accelerators you harden.
Plus the choice of dedicating accelerators to functions are mostly based on the current or expected application trends, and therein lies the drawback -- difficult to identify such use cases of the future. A smart system like either of the below would be great!

A system that could automatically identify efficiently accelerable portions of code
A system that could learn to pick the best ensemble of accelerators keeping a certain pre-determined cost, functionality, and other parameters in mind, intelligently no less.

Resource contention in the NoC, lack of memory-bandwidth, etc.
Many of the present designs involve the CPU as a coordinator between accelerators for data movements in the case of shared memory systems. In fact, the redundancy in copying data within the same physical memory but across different sections of shared memory sub-spaces consumes a significant amount of energy. Proposed solutions would be something like software and hardware managed coherence protocols that allow sharing data-spaces without moving from one device's sub-space to another's. Existing research solutions include:

Something like a CPU computing at a control plane and the rest of the accelerators in a data-plane
Important decisions like software-managed scratchpad v/s software-transparent cache need to be well thought out.

There's a lack of coordination between accelerator(s) and host CPUs arising out of intricacies in the different programming models involved, a need for heterogeneous scheduling of computation kernels, many different DSLs, SDKs, and driver interfaces, etc. It would be great if we could come up with a unifying programming model abstraction like the GP-GPU's SIMT paradigm that enabled GPUs to tackle both graphical workloads and general DLP workloads.

Tailoring portions of any code to accelerators requires expertise, and time. It also involves a sometimes-uninviting amount of setting up the communication, data part of the code.
Raising abstractions of writing code in such cases (like was done with pthreads using OpenMP pragmas)

The authors propose:
Holistic optimization of accelerators and the programming milieu (as opposed to a localized approach)

The general conclusion: ALP has a lot of potential and many areas of it need betterment.

Some ideas I have come up with:

To tackle the issue of concurrency, we could start employing reconfigurable accelerators, or programmable accelerators (with maybe limited flexibility).
We could study the application requirements and maybe segue towards an averagely useful accelerator with reasonable tradeoffs for each of the use cases to achieve increased average versatility.
One particular point mentioned in the paper makes me wonder if there are ways to mathematically decompose diverse and complex accelerator functionalities into a mixture of smaller, more frequently employable, common sub-functional accelerators, and fewer select-function complementary accelerators? This would reduce the area, and throughput, but could lead to better overall utilization of resources (if that is at all important).

PS: I feel the pre-print version, if you can get your hands on it, is more enjoyable.

An interesting perspective from the paper: Any performance that is not perceivable to the end-user is generally a waste of energy (referring to mobile SoCs)

Rabbit holes:

SIMT: https://en.wikipedia.org/wiki/Single_instruction,_multiple_threads
Halide DSL: https://halide-lang.org/

Computer Architecture Paper Summaries (CAPS)

Search for keywords or author names

Saturday, 1 May 2021

Accelerator-level Parallellism

No comments:

Post a Comment