DATA EDITING
One of the most important steps in any survey that collects large amounts of data is (automatic) editing. The starting point in any editing application is a set of edits defined according to some check plan, i.e. a list of a error sources, by a group of subject matter specialists. A valid approach for the automatic derivation of edits from clean datasets is based on validation rules (i.e. conditions that data must satisfy), that can be meant as conditional probability statements. In other words, a rule involving certain variables is seen as a statement of what are the most probable values of some of them given the others. The approach proceeds by specifying the domains of the variables involved in a given rule and then estimating the conditional probabilities on this probability space. In this way, a generic validation treatment is created which is free from formally defining rules. We are currently working under the INSPECTOR IST project on automatic data validation. In particular, we are currently studying the possibility of using segmentation via tree based models as a tool to practically estimate the conditional probabilities. The data are partitioned in an optimal way according to the values of the explanatory variables and the result is a tree. Each node corresponds to certain values of the explanatory variables and contains cases with a certain distribution of the response variable. This conditional distribution of each node is a validation rule.
|