Loading…

MR-DIS: democratic instance selection for big data by MapReduce

Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, whi...

Full description

Saved in:

Bibliographic Details
Published in:	Progress in artificial intelligence 2017-09, Vol.6 (3), p.211-219
Main Authors:	Arnaiz-González, Álvar, González-Rogel, Alejandro, Díez-Pastor, José-Francisco, López-Nozal, Carlos
Format:	Article
Language:	English
Subjects:	Algorithms Artificial Intelligence Big Data Complexity Computation Computational Intelligence Computer Imaging Computer Science Control Data management Data mining Data Mining and Knowledge Discovery Datasets Mechatronics Natural Language Processing (NLP) Parallel processing Pattern Recognition and Graphics Preprocessing Regular Paper Robotics Vision
Citations:	Items that this one cites Items that cite this one
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Instance selection is a popular preprocessing task in knowledge discovery and data mining. Its purpose is to reduce the size of data sets maintaining their predictive capabilities. The usual emerging problem at this point is that these methods quite often suffer of high computational complexity, which becomes highly inconvenient for processing huge data sets. In this paper, a parallel implementation for the instance selection algorithm Democratic Instance Selection (DIS) is presented. The main advantages of the DIS algorithm turn out to be its computational complexity, linear in the number of instances, as well as its internal structure, intuitively parallelizable. The purpose of this paper is threefold: firstly, the design of the DIS algorithm by following the MapReduce model; secondly, its implementation in the popular big data framework Spark ; and finally, its empirical comparison over large-scale data sets. The results show that the processing time is reduced in a linear manner as the number of Spark executors increases, what makes it suitable for big data applications. In addition, the algorithm is publicly accessible to the scientific community.
ISSN:	2192-6352 2192-6360
DOI:	10.1007/s13748-017-0117-5