Data mining emerged in order to cope with the challenges that traditional data analysis techniques where facing up when dealing with large amounts of data. Moreover, these data often have a lot of peculiarities (e.g. missing values, noise, etc.). More specifically, data mining is the main step in the process of Knowledge Discovery in Databases or KDD Process . Knowledge Discovery in Databases is defined as the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data (Fayyad et al., 1996). However, the term “data mining” is very often used to describe the whole KDD Process. Although the core of the process is the data mining step, where a data mining algorithm is applied in order to extract the patterns from data, the pre-processing and post-processing phases are very important too and contribute sensibly to the quality of the extracted knowledge. The preprocessing phase usually includes the selection of an appropriate portion of data, the cleansing of the selected data, as well as the transformation of data in more appropriate representations. The post-processing phase deals with the management of the produced patterns and models and focuses on the evaluation and interpretation of data mining results. Data mining, in practice, has the following two “high-level” primary goals :
• Prediction: Involves the use of some fields (variables) in a database to predict unknown or future values of other variables of interest.
• Description: Focuses on finding human interpretable patterns describing the data.
Prediction and description are not equivalently important for every data mining application. In the context of Knowledge Discovery in Databases, description tends to be more important than prediction. In contrast, machine learning and pattern recognition applications, usually favor prediction as the primary goal. Prediction and description are achieved by using various data mining tasks. Depending on the nature of the data and the desired knowledge there is a large variety of algorithms for each task.