Loading…

Integrating K-means clustering with a relational DBMS using SQL

Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2)...

Full description

Saved in:

Bibliographic Details
Published in:	IEEE transactions on knowledge and data engineering 2006-02, Vol.18 (2), p.188-201
Main Author:	Ordonez, C.
Format:	Article
Language:	English
Subjects:	Algorithms Applied sciences C++ (programming language) Cluster analysis Clustering Clustering algorithms Computer languages Computer science control theory systems Convergence Data base management systems Data mining Exact sciences and technology Exports Index Terms- Clustering Indexing Information systems. Data bases International trade K-means Memory organisation. Data processing Partitioning algorithms Programming profession Query languages Relational databases relational DBMS Scalability Software SQL Statistics
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Integrating data mining algorithms with a relational DBMS is an important problem for database programmers. We introduce three SQL implementations of the popular K-means clustering algorithm to integrate it with a relational DBMS: 1) a straightforward translation of K-means computations into SQL, 2) an optimized version based on improved data organization, efficient indexing, sufficient statistics, and rewritten queries, and 3) an incremental version that uses the optimized version as a building block with fast convergence and automated reseeding. We experimentally show the proposed K-means implementations work correctly and can cluster large data sets. We identify which K-means computations are more critical for performance. The optimized and incremental K-means implementations exhibit linear scalability. We compare K-means implementations in SQL and C++ with respect to speed and scalability and we also study the time to export data sets outside of the DBMS. Experiments show that SQL overhead is significant for small data sets, but relatively low for large data sets, whereas export times become a bottleneck for C++.
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2006.31