Article Preview
TopIntroduction
The Skyline operator (Börzsönyi, Kossmann, and Stocker, 2001) has emerged as an important and very popular summarization technique for multi-dimensional datasets. A Skyline query selects those objects from a dataset D that are not dominated by any others. An object p having d attributes (dimensions) dominates an object q, if p is better than q in at least one dimension and not worse than q in all other dimensions, for a defined comparison function. This dominance criterion defines a partial order and therefore transitivity holds. The Skyline is the set of points which are not dominated by any other point of D. Without loss of generality, we consider the Skyline with the min function for all attributes.
The most cited example on Skyline queries is the search for a hotel that is cheap and close to the beach (Börzsönyi et al., 2001). Unfortunately, these two goals are complementary, as the hotels near the beach tend to be more expensive. In Figure 1 each hotel is represented as a point in the two-dimensional space of price and distance to the beach. Interesting are all hotels that are not worse than any other hotel in both dimensions.
The hotels p6, p7, p8 are dominated by hotel p3. The hotel p9 is dominated by p4, while the hotels p1, p2, p3, p4, p5 are not dominated by any other hotels and build the Skyline. From the Skyline one can now make the final decision, thereby weighing the personal preferences for price and distance to the beach.
Most of the previous work on Skyline computation has focused on the development of efficient sequential algorithms (Chomicki, Ciaccia, and Meneghetti, 2013). The most prominent algorithms are characterized by a tuple-to-tuple comparison-based approach. Based on this, several algorithms have been published in the last decade, e.g., NN (Nearest Neighbor) (Kossmann, Ramsak, and Rost, 2002), BBS (Branch and Bound Skyline) (Papadias, Tao, Fu, and Seeger, 2003), SFS (Sort-Filter Skyline) (Chomicki, Godfrey, Gryz, and Liang, 2003), or LESS (Linear Elimination-Sort for Skyline) (Godfrey, Shipley, and Gryz, 2007), just to name a few.
However, the datasets to be processed in real-world applications are of considerable size, i.e., there is the need for improved query performance, and parallel computing is a natural choice to achieve this performance improvement, since multicore processors are going mainstream (Mattson and Wrinn, 2008). This is due to the fact that Moore’s law of doubling the density of transistors on a CPU every two years – and hence also doubling algorithm’s performance – may come to an end in the next decade due to thermal problems. Thus, the chip manufactures tend to integrate multiple cores into a single processor instead of increasing the clock frequency. In upcoming years, we will see processors with more than 100 cores, but not with much higher clock rates. However, since most applications are build on using sequential algorithms, software developers must rethink their algorithms to take full advantage of modern multicore CPUs (Mattson and Wrinn, 2008). The potential of parallel computing is best described by Amdahl’s law (Gustafson, 1988): the speedup of any algorithm using multiple processors is strictly limited by the time needed to run its sequential fraction. Thus, only high parallel algorithms can benefit from modern multicore processors.