January 27, Thursday
12:00 – 13:30
Task Superscalar Multiprocessors
Computer Science seminar
Lecturer : Yoav Etsion
Affiliation : Barcelona Supercomputing Center
Location : 202/37
Host : Dr. Danny Hendler
Parallel programming is notoriously difficult and is still considered
an artisan's job. Recently, the shift towards on-chip parallelism has
brought this issue to the front stage. Commonly referred to as the
Programmability Wall, this problem has already motivated the
development of simplified parallel programming models, and most notably
task-based models.
In this talk, I will present Task Superscalar Multiprocessors,
a conceptual multiprocessor organization that operates by dynamically
uncovering task-level parallelism in a sequential stream of
tasks. Task superscalar multiprocessors target an emerging class of
task-based dataflow programming models, and thus enables programmers to
exploit manycore systems effectively, while simultaneously simplifying
their programming model.
The key component in the design is the Task Superscalar Pipeline,
an abstraction of instruction-level out-of-order pipelines that operates
at the task-level and can be embedded into any manycore fabric to
manage cores as functional units. Like out-of-order pipelines that
dynamically uncover parallelism in a sequential instruction stream and
drive multiple functional units, the task superscalar pipeline
uncovers task-level parallelism in a stream of tasks generated by a
sequential thread. Utilizing intuitive programmer annotations of task
inputs and outputs, the task superscalar pipeline dynamically detects
inter-task data dependencies, identifies task-level parallelism, and
executes tasks out-of-order. I will describe the design of the task
superscalar pipeline, and discuss how it tackles the scalability
limitations of instruction-level out-of-order pipelines.
Finally, I will present simulation results that demonstrate the design
can sustain a decode rate faster than 60ns per task and dynamically
uncover data dependencies among as many as ~50,000 in-flight tasks,
using 7MB of on-chip eDRAM storage. This configuration achieves
speedups of 95-255x (average 183x) over sequential execution for nine
scientific benchmarks, running on a simulated multiprocessor with 256
cores.