MODELLING APPROACHES TO DATA MINING

In the framework of data mining, defined as a set of methods related to the process of knowledge discovery in huge databases, an interesting research problem concerns the possibility of defining appropriate strategies in order to use statistical models for data mining purposes. In this context, a partitioning criteria for the identification of homogeneous and disjoint subgroups from the original data matrix has been introduced for reducing the dimensionality problem arising when dealing with big datasets. Then, in each of these subgroups has been possible to express the link between a response variable and a set of predictors through a statistical model. The proposed partitioning criteria allowed for the use of semiparametric models for data mining.
Moreover, an integrated strategy for data mining has also been defined. It is based on a data-driven procedure which is sequential and automatic. It combines production rules deriving from recursive partitioning algorithms with estimations deriving from combinations of mixtures of classifiers in decision rules resulting from a model fusion criterion. This strategy resulted useful when dealing with a large number of observations/covariates and can be considered as a general methodological framework for supervised classification and prediction problems.
Finally, the problem of the dimensionality reduction in the general context of supervised statistical learning has also been considered. In the framework of recursive partitioning, we proposed a methodology aimed to improve tree based methods as prediction tool by introducing an alternative approach to data partitioning which is meant to handle large numbers of (possibly correlated) covariates. The key idea is to use suitable combinations of covariates recursively identified.