WO2015004502A1

WO2015004502A1 - Method for imputing corrupted data based on localizing anomalous parts

Info

Publication number: WO2015004502A1
Application number: PCT/IB2013/055640
Authority: WO
Inventors: Huseyin Ozkan; Ozgur Yilmaz
Original assignee: Aselsan Elektronik Sanayi ve Ticaret AS
Current assignee: Aselsan Elektronik Sanayi ve Ticaret AS
Priority date: 2013-07-09
Filing date: 2013-07-09
Publication date: 2015-01-15
Anticipated expiration: 2016-01-09
Also published as: UA126396U; PH12016500066A1

Abstract

The method in this invention is a novel, anomaly detection based imputation technique designed for handling the local corruptions in a given data set. For a given data instance, the method first localizes the corruption through several statistical checks via an appropriate anomaly detection algorithm from the machine learning literature. Then the corrupted attributes, which might be considered as "missing" after the corruption is localized, are imputed using the average statistics extracted from the data set. In the machine learning applications such as data classification and clustering, data imputation is an important technique to improve the performance of the algorithms, i.e., empirical error rates in case of a binary classification, when a fraction of the data is corrupted under severe noise conditions. For instance, this corruption might be due to a scratch on the compact disc where the data are stored, or occlusion of a visual object in computer vision tasks. In this regard, data imputation techniques and our invention aim at replacing the corrupted or missing parts with statistically meaningful substituted values such that the corrupted parts after imputations becomes statistically consistent with the intact parts.

Description

DESCRIPTION

METHOD FOR IMPUTING CORRUPTED DATA BASED ON LOCALIZING ANOMALOUS PARTS

Field of the invention

The present invention relates to a method for imputing data that are locally corrupted by extreme noise, occlusion etc.

Background of the invention

Anomalous data detection is a well studied problem in machine learning [l]-[4]. Abnormality measure of a data instance is computed with respect to the set of data gathered under "normal" conditions, and it is used to set an alarm when an abnormal situation is represented by the data instance. Classification and clustering with corrupted data or incomplete data is also an important problem in machine learning, cf. [5] -[8] and the references therein. The proposed solutions in the corresponding studies are closely tied to the literature on inference with incomplete data [9] and generative models for classification [10] . In the disclosed method, we propose the utilization of a well known procedure called anomaly detection for localizing corrupted/incomplete portions in a data instance, and then filling in the data using data imputation techniques, for improving classification or clustering performances in subsequent stages.

Generative models are used to randomly generate observable data, using the learnt probabilistic structure encoded in its hidden variables. In contrast to the discriminative models, generative models specify a joint probability distribution over the observed data and the corresponding class labels. Mixture models are perhaps the most widely used generative tools and Expectation Maximization [10], [11] has become the standard technique for estimating the corresponding statistical parameters, i.e., the parameters of a mixture of subpopulations in the training data. Given the parameters of subpopulation distributions, new data can be generated through sampling methods.

Classification under incomplete data conditions is a well studied problem [6]- [8], [12]-[13] . Imputation is commonly used as a pre-processing tool, before the standard classification algorithms are applied [14] . The Mixture of Factor Analyzers [15] approach assumes multiple clusters in the data, estimates the statistical parameters of these clusters and uses them for filling-in the missing feature dimensions. Thus, in imputation stage, the missing feature values are filled-in with sampled values from a pre-computed distribution. Here, multiple imputations assumes that the data come from a mixture of distributions, and is capable of capturing variation in the data. Pseudo- likelihood [16] and dependency network [17] approaches solve data completion problem by learning conditional distributions that predict a data component using the rest of the components.

Data imputation and completion is essential for handling corrupted images and this problem is attempted in image denoising studies, in which a corrupted image is restored by explicitly modeling image statistics [18]-[21] or using neural networks [22]-[24]. Detection of occluded visual objects is investigated in a number of studies, using disparity cues [25], geometric form [26], segmentation results [27], information from a fusion of static and motion cues [28], or data reconstruction error [29]. Trivial template responses are used in [29], in which unexplained pixels after sparse data reconstruction is considered as occluded.

A method for constructing predictive models that can estimate missing values in a data structure is proposed in patent [30]. Patent [35] discloses a method for imputing unknown data components in a data structure using statistical methods. Tensor factorization method is used in patent [36] to perform multiple imputation in retail sales data. Neural network method is disclosed in patent [37] to denoise images that are compressed and decompressed. Objects of the invention

The object of the invention is a method that imputes locally corrupted data instance, by first partitioning the attribute space in a binary tree structure, then localizing anomalous attribute subspace using anomaly detection procedures, and locally filling-in with uncorrupted data using imputation techniques. None of the prior art utilizes a combination of these techniques for localization and imputation of corrupted data parts.

Detailed description of the invention

In contrast to classical imputation methods, the disclosed method assumes neither the existence nor the knowledge of the location of the corrupted portions in a given data instance. Namely, whether a corruption exists and, which attributes are affected in case of a corruption are both assumed unknown. The disclosed method first localizes occluded (image) or corrupted portions of the data, treats these portions as incomplete and replaces them with a statistically meaningful sample using imputation techniques. Therefore corrupted data detection and localization is used in tandem with incomplete data filling-in.

A method for anomaly detection based data imputation in order to fulfill the objects of the present invention is illustrated in the attached figures, where:

Figure 1 is a schematic of the flowchart of the method

Figure 2 is a schematic of the flowchart of the step 105 in Figure 1.

Figure 3 is a schematic of the flowchart of the step 107 in Figure 1.

Figure 4 is an illustration of the partitioning of the set of attributes in the step 200 in Figure 1.

The method in our invention takes, as an input, a "corrupted" data set X = {*_; }f₌₁ , x_{ e R^d , where <i is the dimensionality and each dimension accounts for an attribute value. Here, corruption in an instance x e l is defined as the event that a subset of the attributes are affected by a large structured or unstructured noise such as a scratch on the CD where the instance is stored or an occlusion in case of a visual object. For an example, a corruption affects an instance x = (x_l , x₂ ,..., x_d ) in a subset of consecutive attributes as x'= (x_l , x₂ ,..., x_j+l = y _j+l ,..., x_j+z = y_j+z , ···, ¾ ) , where in total z consecutive attributes are overwritten with random numbers { j_; }^¹ and the remaining attributes are intact. We emphasize that the number of corrupted attributes and their locations are completely unknown. In this regard, our method deals with the corrupted data set X and targets at imputing or filling in this data set in order to reduce the adversarial effects of the corruption on tasks such as classification, clustering, etc. We also assume that the number of the corrupted data points is not more than an en fraction of the set, i.e., the corrupted instances have a size less than Ta .

For a given data set X and an instance X G X , our method (1) decides, through several statistical checks with the data set X , whether x is corrupted and if so; (2) finds which attributes are corrupted and finally (3) performs imputation on x by overwriting in those locations, i.e., corrupted attribute values, the mean of the corresponding attribute values derived from X .

In the following, we provide a summary of the method in our invention.

To achieve these goals listed above, the set of attributes A = {1,2,..., d} is partitioned into disjoint subsets which are organized through a corresponding binary tree of depth K as shown in Figure 4. Starting from the root, the attributes are split such that half of them are registered in the left child, and the other half is registered on the right child. At a node n, we denote the corresponding set of attributes as n-attributes. Note that when there is an odd number of attributes at a node n, then the first X , i.e., floor of the half, attributes are registered on the left, and the rest are registered on the right (or vice versa). Recursively, all the attributes are distributed on the corresponding nodes. For an example (Figure 4), let us choose d = 80 and partition A into 8 subsets (10 attributes for each) which are all assigned to the leaf nodes in the tree. Note in this case the binary tree should be of depth 3. Suppose an instance x is corrupted on its attributes spread on the subsets 6,7,8, where 7,8 are fully affected and 6 is partially affected. The binary tree that our method constructs in this case for a given instance and the anomalous nodes that contain corrupted subsets are shown in Figure 4.

In the course of our algorithm, every instance is checked for a possible corruption at every node of this tree. For example, at a node n, both the instance and the data set are considered to be restricted to the corresponding n-attributes. Namely, suppose x = (x_l , x₂ ,..., x_d ) and n-attributes are the attributes

, j + n - l} d {1,2,..., d} . Then, the restricted instance jc is πφΧγ x'= x_j , x_j+l ,..., x_j+n__l ) . Similarly, the data set X is also considered as restricted with respect to the n-attributes at node n. Note that this restriction differs for every node. Based on the pair (χ', Χ') at a node n, the restricted instance x' is checked for a possible corruption with respect to the statistics derived from X' via an anomaly detection" algorithm. If ' is found to be anomalous, then we label the node n for the instance x as anomalous". After labeling all the nodes for a given instance x in this fashion, we decide if jc is corrupted based on these labels. We emphasize that there are a number of algorithms for anomaly detection, or outlier detection in the machine learning literature, cf. [1-4] and the references therein. In the preferred embodiment, we employ the algorithm proposed in [4] . In our running example shown in Figure 4, note that the anomalous nodes are colored red, and the normal ones are colored blue. For the localization of a possible corruption in attributes of a given instance s , we search the labeled (anomalous vs normal) nodes in the binary tree corresponding to the instance x in a breadth-first- search manner. In this search, a node is labeled as corrupted if the sub-tree rooted from this node has at least p percent anomalous nodes. Here, p is a user specified threshold. Once a node is labeled as corrupted, the corresponding sub-tree is then pruned and the search continues. If a corruption is not detected at a node, then it is labeled as uncorrupted, and the search continues. Note that in Figure 4, the starred node is found to be corrupted, which successfully corresponds to the set of corrupted attributes.

If there exists corrupted nodes in the tree corresponding to an instance x , then the instance jc is considered corrupted, and it is imputed, or filled in. To achieve this, we first locate the corrupted nodes in the corresponding tree. For imputing, i.e., filling in, each of these corrupted nodes, we use the parent node statistics extracted from the dataset. In general any imputation approach can be used to impute the corrupted data. However for a preferred implementation, k-nearest neighbor search is utilized. Let us suppose that a node n is detected as corrupted. Then we find the k nearest neighbors of the instance x by using the pn-attributes only, where pn is the parent node and k is a user specified parameter. Thus, we are utilizing the maximum likelihood estimate of the subset of attributes corresponding to the parent node of a corrupted node. We simply calculate the mean of the attributes in this neighborhood, copy x as y <— x , and overwrite the n- attributes of the instance y using the attribute values in the calculated mean. We emphasize that this is performed for all instances, and for all corrupted nodes in the corresponding nodes. If jc is a root node, we do not take any action, i.e., no imputation is applied.

We point out that the method summarized above is applied for every data instance x in the data set X . In an iterative fashion, the set of the imputed instances Y = {y_i } is fed into the method again, and the data are further imputed, until a pre-specified number of iterations is reached. The algorithm returns the final imputed set as its output.

In the following, we describe our method step by step as depicted in Figure 1: Input:

• Data set: X = {x_t }f₌₁ , where x_i e R^d and d is the dimensionality;

• Depth: K;

• Neighborhood size: k;

• Corruption detection parameter: p;

• Max # of iterations: mc

• iters = -1 ;

• Processed data are stored in Y ;

1) Step 200: A binary depth K tree is generated, an example of which is illustrated in Figure 4. i = 1 . iters <— iters + 1. Y = { }

2) Step 112:

a. If iters = mc, then execute Step 113 in Figure 1 (return the data set X )

b. Otherwise, go to 3 (Step 101 in Figure 1).

3) Step 101: The instance x_i is fetched from data set X .

4) A breadth-first traversal is started on the binary tree. At node n during this traversal,

a. Step 102: The instance x_i is restricted to the n-attributes named x_i ' . b. Step 103: The data set X - {x_{ } is restricted to the n-attributes, named X' .

c. Step 104: Assign this node n, a label "anomalous" or "normal" preferably using the algorithm in [4] . (Note here that any algorithm in the family of anomaly detection algorithms is appropriate, cf. [1- 4] and the references therein. However, the method in [4] is used in the preferred implementation)

) Step 105: For every node in the tree corresponding to x_i , a label

"corrupted" or "uncorrupted" is assigned using the following substeps (Figure 2):

i. Step 301: A breadth-first search is started on the binary tree. ii. Step 302: At a node n during this search, compute the rate (% percentage), say L, of the anomalous nodes in the corresponding sub-tree rooted from this node.

iii. Step 303:

a. If L<p, then the node n is labeled as

"uncorrupted"; and the breadth-first-search continues.

b. Step 304: If L>=p, then the node n is labeled as "corrupted". In Step 305: The

corresponding sub-tree rooted from this node n is pruned. Then the breadth-first- search continues.

) Step 106: The instance x_i is checked whether it is corrupted or not. If there exists at least one corrupted node in the corresponding tree, then instance x_i is corrupted and needs to be imputed, i.e., filled in.

a. Step 107: If the instance x_i is corrupted, then estimate an

uncorrupted version of the instance using the data imputation techniques and fill in the corrupted attributes accordingly. In the preferred implementation of our invention, we use for imputation the following procedure (i, ii, iii, iv) (Figure 3):

i. Locate the nodes in the corresponding binary tree, which are labeled as "corrupted". Step 401: For each of these nodes, say n, locate the parent node pn.

ii. Step 402: Using the pn-attributes only, we find the k

nearest neighbors of x_i in the data set X and utilize the mean of these neighbors (iii) and replace (iv).

iii. Step 403: Find the mean in this neighborhood, say mx. Note that mx has only pn-attributes.

iv. Step 404: Copy x_i as y_i <— x_i Replace the corrupted n- attributes of the instance y_i with the corresponding values of the n-attributes in the mean vector mx.

b. If the instance x_i is not corrupted, then copy x_i as y_i <— x_i .

) Step 109: Save y_i into Y = Y u { y_{ } . Then reset the tree corresponding to the instance x_i .

) Step 108: Increment i : i +

) Step 110: i is checked if it reached to T + 1

a. If i < T + 1 , then go to 3 (Step 101).

b. If i = r + l , then

i. Step 111: Set X <— Y , i.e., the filled in collection is chosen as the data set for the next iteration.

ii. Go to 1 (step 200.)

References:

1. B. Scholkopf, J. C. Piatt, J. Shawe-Taylor, A. J. Smola, and A.

Willaimson. Estimating Support of a High dimensional Distribution. Neural Computation, vol. 13, no. 7, pp 1443 - 1471, July 2001.

2. V. Nikiforov and M. Basseville. Detection of abrupt changes: theory and applications. Prentice-Hall, New Jersey, 1993.

3. A. O. Hero. Geometric entropy minimization(GEM) for anomaly detection and localization. Neural Information Processing Systems Conference, vol. 19, 2006.

4. M. Zhao and V. Saligrama. Anomaly detection with score functions based on nearest neighbor graphs. Neural Information Processing Systems Conference, vol. 22, 2009.

5. Salakhutdinov, R. and Hinton, G. E. (2012). An efficient learning procedure for deep Boltzmann machines. Neural Computation, 24, 1967-2006.

6. Chechik, G, Heitz, G, Elidan, G, Abbeel, P. and Roller, D. (2007).

Max-Margin Classification of Data with Absent Features. Journal of Machine Learning Research, 9, 127.

7. Marlin, B. (2008). Missing Data Problems in Machine Learning. PhD thesis, University of Toronto.

8. Marlin, B. M., Zemel, R.S., Roweis, S.T. and Slaney, M . (2011).

Recommender Systems: M issing Data and Statistical Model Estimation. Proceedings of the International Joint Conference on Artificial Intelligence.

9. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys.

New York: Wiley.

10. Smolensky, P. (1986). Information Processing in Dynamical Systems:

Foundations of Harmony Theory". In Rumelhart, David E.; McLelland, James L. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1 : Foundations. MIT Press, pp. 194-281. McLaughlan, G.J. (1988), Mixture Models: inference and applications to clustering, New York: Marcel Dekker. Wang, C, Liao, X., Carin, L., Dunson, D.B. (2010). Classification with Incomplete Data Using Dirichlet Process Priors. Journal of Machine Learning Research, 11, 3269-3311. Schafer, J., & Graham, J. (2002). Missing data: Our view of the state of the art. Psychological Methods. 7, 147 - 177. Little, R.J. A., Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York: John Wiley and Sons, Inc. Ghahramani, Z. and Jordan, M.I. (1994). Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing Systems. 6, 120- 127. Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179-195, 1975. Heckerman, D., Chickering, D.M. , Meek, C. , Rounthwaite, R. and Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49-75. S. Geman, D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, 1984 Y. Weiss, W.T. Freeman. What makes a good model of natural images? CVPR 2007. S. Lyu, E.P. Simoncelli. Statistical modeling of images with fields of Gaussian scale mixtures. NIPS 2006 S.C. Zhu, Y. Wu, D. Mumford. Filters, Random Fields and Maximum Entropy (FRAME): Towards a Unified Theory for Texture Modeling. International Journal of Computer Vision, 1998 LeCun, Y. (1987). Mod eles connexionistes de l'apprentissage. Doctoral dissertation, Universit'e de Paris VI. Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette. Jain, V. and Seung S.H. (2008). Natural image denoising with convolutional networks. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS'08). Toh, P.-S. and Forrest, A.K. 1990. Occlusion detection in early vision. In Proceedings of the 3rd International Conference on Computer Vision, Osaka, Japan, pp. 126-132 N. H. C. Yung and A. H. S. Lai, "Detection of vehicle occlusion using a generalized deformable model," in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '98), vol. 4, pp. 154-157, Monterey, Calif, USA, MayJune 1998 C. Pang, W. Lam, N. Yung, A novel method for resolving vehicle occlusion in a monocular traffic-image sequence, in: IEEE Transactions on Intelligent Transportation Systems, vol. 5(3), 2004, pp. 129-141. W. Zhang, Q. Wu, X. Yang, and X. Fang, "Multilevel Framework to Detect and Handle Vehicle Occlusion," IEEE Transactions on Intelligent Transportation System, vol. 9, pp. 161-174, March 2008. X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai. "Minimum Error Bounded Efficient LI Tracker with Occlusion Detection," Proc. of the Computer Vision and Pattern Recognition (CVPR), 2011. Pednault, E.P.D. (2004). Mechanism for constructing predictive models that allow inputs to have missing values. US Patent No. US 6,810,368 B l. Hsu, C.H., Shie, B.E., Su, J.H. , Tseng, S.M. (2012). System and method for imputing missing values and computer program product thereof. US Patent No. US 20120136896 Al. Natarajan, R., Banerjee, A., Shan, H. (2013). Multiple imputation of missing data in multi-dimensional retail sales data sets via tensor factorization. US Patent No. US 20130036082 Al.

Fischer, M., Staelin, C. (2008). Image artifact reduction using a neural network. US Patent No. US 7346208B2.

Claims

1. A method for data imputation (100) characterized in that it comprises the steps of; for a given corrupted data set,

a. Generate a binary K-tree (K is a user specified parameter), assign a subset of the data dimensions to a node depending on the node's location in the tree, empty the data storage from the previous iteration, increment the iteration, and let the index point the first instance in the data set. (200)

b. If the maximum number of iterations is reached (maximum number of iterations is a user- specified parameter) (112)

i. Return the data set (113)

c. Fetch i 'th data point to process. (101)

d. Start a breadth-first traversal on the tree. In the course of this traversal, at a node n (Here any searching scheme is appropriate, we utilize a breadth-first traversal in the preferred implementation),

1. Restrict the i 'th data point to the n-attributes, which is the subset of attributes assigned to node n during tree generation. (102)

2. Restrict the data set excluding the i 'th data point to the same n-attributes. (103)

3. Label each node as anomalous or normal using an anomaly detection algorithm from the machine learning literature. (104)

e. Start a breadth-first search on the labeled (anomalous vs normal) tree. In the course of this search, label each node further as corrupted or uncorrupted. (105)

f. If the i 'th data point is found to be corrupted, i.e., if at least one node on the tree found to be corrupted, (106)

1. Impute or fill in i 'th data point with known methods in the literature. (107) g. Save the i 'th data point into the data storage Y . (109).

h. Increment the index i . (108)

i. If the index i exceeds the number of points in the data set (110) i. Overwrite the data set X with the stored data in Y . (I l l) ii. Go to the step a.

j. Go to step c.

2. A method for data imputation (100) characterized in that the step 105 further comprises the sub-steps of; for a given data instance along with the corresponding labeled (anomalous vs normal) tree,

a. Start a breadth-first search of nodes for a corruption. At every node, (301)

a. Compute the rate (% perc.) of the anomalous nodes in the corresponding subtree. (302)

b. If this rate is greater than p (p is a user-specified threshold), (303)

i. Label this node as corrupted. (304)

ii. Prune the subtree of this node (305) c. If this rate is less than/or equal to p (p is a user- specified threshold), (303)

i. Label the node as uncorrupted, and continue the search. b. Return the pruned and labeled (corrupted vs uncorrupted) tree.

3. A method for data imputation (100) characterized in that the step 107 further comprises the sub-steps of; for a given data instance along with the corresponding labeled (corrupted vs uncorrupted) tree and the data set, a. Start a breadth-first search of nodes. At every corrupted node n, i. Locate the parent node pn (401)

ii. Using the pn-attributes, find the k nearest neighbors of the given data instance in the data set. (402) iii. Find the mean of these k neighbors (k is a user-specified parameter). (403) iv. Overwrite the corrupted n-attributes of the data instance from the n-attributes of the found mean. (404)

b. Return the data instance.

4. A method for data processing that utilizes anomaly detection procedures for localizing corrupted subspaces of the data.

5. A method for data imputation that partitions the data into neighboring subspaces using binary K-tree approach.

6. A method for data processing that iteratively replaces corrupted subspaces of the data using anomaly detection and K-tree based corruption localization.