Empirical Comparison of Scalability Strategies in Density-based Clustering Algorithms for Large Datasets

Adrián J. Ramírez-Díaz, José Fco. Martínez-Trinidad, J. Ariel Carrasco-Ochoa

Abstract


Density-based clustering can discover
clusters of arbitrary shapes while filtering noise, but
its scalability for large datasets remains a challenge.
Recent density-based clustering algorithms for large
datasets employ different scalability strategies, such as
parallelization, sampling, or reduction of the number
of comparisons, to address this issue. This study
compares density-based clustering algorithms for large
datasets by explicitly evaluating the above-mentioned
scalability strategies for processing large datasets under
the same framework. Experiments were conducted on
synthetic datasets scaled up to 50×. The results provide
a comparative analysis that highlights that parallelism
allows improving scalability, sampling reduces runtime
with a potential loss in quality, and reducing comparisons
between objects maintains clustering quality. These
findings support the selection of algorithms according
to available computational resources and dataset size,
which is very valuable in practice.

Keywords


Clustering, large datasets, density-based clustering, scalability strategies

Full Text: PDF