June 16, Tuesday
12:00 – 14:00
Cluster Based Computation of Relational Joins
Computer Science seminar
Lecturer : Prof. Jeffrey D. Ullman
Affiliation : Stanford University
Location : 202/37
Host : Prof. Shlomi Dolev
The prevalence of large racks of interconnected processor nodes forces us to
take another look at how to exploit parallelism when taking the join of large relations.
Sometimes, there is a gain in total cost to be had by distributing pieces of each relation
to several different nodes and computing the join of several large relations at once. The
optimization problem is to pick the degree of replication of each relation, under the
constraint that the total number of compute-nodes is fixed. We set up this problem as a
nonlinear optimization and show that there is always a solution (which must be
approximated by rounding to the nearest integers). For some of the most common types
of join – star joins and chain joins – we give closed-form solutions to the optimization
problem. Finally, we point out that the join algorithm we propose can be implemented
using features already present in Hadoop, the open-source implementation of map-reduce