Partition-Based Clustering in Object Bases: From Theory to Practice
C. Gerlhof, A. Kemper, Ch. Kilger, G. Moerkotte
We classify clustering algorithms into SEQUENCED-BASED
techniques---which transform the object net into
a linear sequence--- and PARTITION-BASED clustering
algorithms. Tsangaris and Naughton have shown that the
partition-based techniques are superior. However, their work
is based on a single partitioning algorithm, the Kernighan
and Lin heuristics, which is not applicable to realistically
large object bases because of its high running-time complexity.
The contribution of this paper is two-fold: (1) we devise
a new class of greedy object graph partitioning algorithms
(GGP) whose running-time complexity is moderate while still
yielding good quality results. For large object graphs GGP is
the best known heuristics with an acceptable running-time. (2)
We carry out an extensive quantitative analysis of all
well-known partitioning algorithms for clustering object
graphs. Our analysis yields that no ONE algorithm performs
superior for ALL object net characteristics. Therefore, we
derive a multi-dimensional grid: the dimensions correspond to
particular characteristics of the object base configurations
and the grid entries indicate the best clustering algorithm
for the particular configuration. We propose an ADAPTABLE
clustering strategy by determining first the characteristics
of the clustering problem---given by, e.g., number and size
of objects, degree of object sharing---and then applying
the most suitable algorithm which is obtained from the
multi-dimensional grid.