[go: up one dir, main page]

WO2015004502A1 - Method for imputing corrupted data based on localizing anomalous parts - Google Patents

Method for imputing corrupted data based on localizing anomalous parts Download PDF

Info

Publication number
WO2015004502A1
WO2015004502A1 PCT/IB2013/055640 IB2013055640W WO2015004502A1 WO 2015004502 A1 WO2015004502 A1 WO 2015004502A1 IB 2013055640 W IB2013055640 W IB 2013055640W WO 2015004502 A1 WO2015004502 A1 WO 2015004502A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
corrupted
node
tree
attributes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/IB2013/055640
Other languages
French (fr)
Inventor
Huseyin Ozkan
Ozgur Yilmaz
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Aselsan Elektronik Sanayi ve Ticaret AS
Original Assignee
Aselsan Elektronik Sanayi ve Ticaret AS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Aselsan Elektronik Sanayi ve Ticaret AS filed Critical Aselsan Elektronik Sanayi ve Ticaret AS
Priority to PCT/IB2013/055640 priority Critical patent/WO2015004502A1/en
Priority to UAA201600281U priority patent/UA126396U/en
Publication of WO2015004502A1 publication Critical patent/WO2015004502A1/en
Priority to PH12016500066A priority patent/PH12016500066A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/045Explanation of inference; Explainable artificial intelligence [XAI]; Interpretable artificial intelligence
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/554Detecting local intrusion or implementing counter-measures involving event detection and direct action
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Definitions

  • the present invention relates to a method for imputing data that are locally corrupted by extreme noise, occlusion etc.
  • Anomalous data detection is a well studied problem in machine learning [l]-[4].
  • Abnormality measure of a data instance is computed with respect to the set of data gathered under "normal” conditions, and it is used to set an alarm when an abnormal situation is represented by the data instance.
  • Classification and clustering with corrupted data or incomplete data is also an important problem in machine learning, cf. [5] -[8] and the references therein.
  • the proposed solutions in the corresponding studies are closely tied to the literature on inference with incomplete data [9] and generative models for classification [10] .
  • Generative models are used to randomly generate observable data, using the learnt probabilistic structure encoded in its hidden variables. In contrast to the discriminative models, generative models specify a joint probability distribution over the observed data and the corresponding class labels. Mixture models are perhaps the most widely used generative tools and Expectation Maximization [10], [11] has become the standard technique for estimating the corresponding statistical parameters, i.e., the parameters of a mixture of subpopulations in the training data. Given the parameters of subpopulation distributions, new data can be generated through sampling methods.
  • Patent [35] discloses a method for imputing unknown data components in a data structure using statistical methods.
  • Tensor factorization method is used in patent [36] to perform multiple imputation in retail sales data.
  • Neural network method is disclosed in patent [37] to denoise images that are compressed and decompressed.
  • the object of the invention is a method that imputes locally corrupted data instance, by first partitioning the attribute space in a binary tree structure, then localizing anomalous attribute subspace using anomaly detection procedures, and locally filling-in with uncorrupted data using imputation techniques. None of the prior art utilizes a combination of these techniques for localization and imputation of corrupted data parts.
  • the disclosed method assumes neither the existence nor the knowledge of the location of the corrupted portions in a given data instance. Namely, whether a corruption exists and, which attributes are affected in case of a corruption are both assumed unknown.
  • the disclosed method first localizes occluded (image) or corrupted portions of the data, treats these portions as incomplete and replaces them with a statistically meaningful sample using imputation techniques. Therefore corrupted data detection and localization is used in tandem with incomplete data filling-in.
  • Figure 1 is a schematic of the flowchart of the method
  • Figure 2 is a schematic of the flowchart of the step 105 in Figure 1.
  • Figure 3 is a schematic of the flowchart of the step 107 in Figure 1.
  • Figure 4 is an illustration of the partitioning of the set of attributes in the step 200 in Figure 1.
  • corruption in an instance x e l is defined as the event that a subset of the attributes are affected by a large structured or unstructured noise such as a scratch on the CD where the instance is stored or an occlusion in case of a visual object.
  • ⁇ j ⁇ 1
  • our method deals with the corrupted data set X and targets at imputing or filling in this data set in order to reduce the adversarial effects of the corruption on tasks such as classification, clustering, etc.
  • the number of the corrupted data points is not more than an en fraction of the set, i.e., the corrupted instances have a size less than Ta .
  • our method (1) decides, through several statistical checks with the data set X , whether x is corrupted and if so; (2) finds which attributes are corrupted and finally (3) performs imputation on x by overwriting in those locations, i.e., corrupted attribute values, the mean of the corresponding attribute values derived from X .
  • the data set X is also considered as restricted with respect to the n-attributes at node n. Note that this restriction differs for every node.
  • the restricted instance x' is checked for a possible corruption with respect to the statistics derived from X' via an anomaly detection" algorithm. If ' is found to be anomalous, then we label the node n for the instance x as anomalous".
  • a node is labeled as corrupted if the sub-tree rooted from this node has at least p percent anomalous nodes.
  • p is a user specified threshold.
  • the instance jc is considered corrupted, and it is imputed, or filled in.
  • Step 113 in Figure 1 (return the data set X )
  • Step 101 The instance x i is fetched from data set X .
  • a breadth-first traversal is started on the binary tree. At node n during this traversal,
  • Step 102 The instance x i is restricted to the n-attributes named x i ' .
  • Step 103 The data set X - ⁇ x ⁇ ⁇ is restricted to the n-attributes, named X' .
  • Step 104 Assign this node n, a label "anomalous" or "normal” preferably using the algorithm in [4] . (Note here that any algorithm in the family of anomaly detection algorithms is appropriate, cf. [1- 4] and the references therein. However, the method in [4] is used in the preferred implementation)
  • Step 105 For every node in the tree corresponding to x i , a label
  • Step 301 A breadth-first search is started on the binary tree.
  • Step 302 At a node n during this search, compute the rate (% percentage), say L, of the anomalous nodes in the corresponding sub-tree rooted from this node.
  • Step 106 The instance x i is checked whether it is corrupted or not. If there exists at least one corrupted node in the corresponding tree, then instance x i is corrupted and needs to be imputed, i.e., filled in.
  • Step 107 If the instance x i is corrupted, then estimate an
  • Step 401 For each of these nodes, say n, locate the parent node pn.
  • Step 402 Using the pn-attributes only, we find the k
  • Step 403 Find the mean in this neighborhood, say mx. Note that mx has only pn-attributes.
  • Step 404 Copy x i as y i ⁇ — x i Replace the corrupted n- attributes of the instance y i with the corresponding values of the n-attributes in the mean vector mx.
  • Step 110 i is checked if it reached to T + 1
  • Step 111 Set X ⁇ — Y , i.e., the filled in collection is chosen as the data set for the next iteration.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Hardware Design (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Debugging And Monitoring (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method in this invention is a novel, anomaly detection based imputation technique designed for handling the local corruptions in a given data set. For a given data instance, the method first localizes the corruption through several statistical checks via an appropriate anomaly detection algorithm from the machine learning literature. Then the corrupted attributes, which might be considered as "missing" after the corruption is localized, are imputed using the average statistics extracted from the data set. In the machine learning applications such as data classification and clustering, data imputation is an important technique to improve the performance of the algorithms, i.e., empirical error rates in case of a binary classification, when a fraction of the data is corrupted under severe noise conditions. For instance, this corruption might be due to a scratch on the compact disc where the data are stored, or occlusion of a visual object in computer vision tasks. In this regard, data imputation techniques and our invention aim at replacing the corrupted or missing parts with statistically meaningful substituted values such that the corrupted parts after imputations becomes statistically consistent with the intact parts.

Description

DESCRIPTION
METHOD FOR IMPUTING CORRUPTED DATA BASED ON LOCALIZING ANOMALOUS PARTS
Field of the invention
The present invention relates to a method for imputing data that are locally corrupted by extreme noise, occlusion etc.
Background of the invention
Anomalous data detection is a well studied problem in machine learning [l]-[4]. Abnormality measure of a data instance is computed with respect to the set of data gathered under "normal" conditions, and it is used to set an alarm when an abnormal situation is represented by the data instance. Classification and clustering with corrupted data or incomplete data is also an important problem in machine learning, cf. [5] -[8] and the references therein. The proposed solutions in the corresponding studies are closely tied to the literature on inference with incomplete data [9] and generative models for classification [10] . In the disclosed method, we propose the utilization of a well known procedure called anomaly detection for localizing corrupted/incomplete portions in a data instance, and then filling in the data using data imputation techniques, for improving classification or clustering performances in subsequent stages.
Generative models are used to randomly generate observable data, using the learnt probabilistic structure encoded in its hidden variables. In contrast to the discriminative models, generative models specify a joint probability distribution over the observed data and the corresponding class labels. Mixture models are perhaps the most widely used generative tools and Expectation Maximization [10], [11] has become the standard technique for estimating the corresponding statistical parameters, i.e., the parameters of a mixture of subpopulations in the training data. Given the parameters of subpopulation distributions, new data can be generated through sampling methods.
Classification under incomplete data conditions is a well studied problem [6]- [8], [12]-[13] . Imputation is commonly used as a pre-processing tool, before the standard classification algorithms are applied [14] . The Mixture of Factor Analyzers [15] approach assumes multiple clusters in the data, estimates the statistical parameters of these clusters and uses them for filling-in the missing feature dimensions. Thus, in imputation stage, the missing feature values are filled-in with sampled values from a pre-computed distribution. Here, multiple imputations assumes that the data come from a mixture of distributions, and is capable of capturing variation in the data. Pseudo- likelihood [16] and dependency network [17] approaches solve data completion problem by learning conditional distributions that predict a data component using the rest of the components.
Data imputation and completion is essential for handling corrupted images and this problem is attempted in image denoising studies, in which a corrupted image is restored by explicitly modeling image statistics [18]-[21] or using neural networks [22]-[24]. Detection of occluded visual objects is investigated in a number of studies, using disparity cues [25], geometric form [26], segmentation results [27], information from a fusion of static and motion cues [28], or data reconstruction error [29]. Trivial template responses are used in [29], in which unexplained pixels after sparse data reconstruction is considered as occluded.
A method for constructing predictive models that can estimate missing values in a data structure is proposed in patent [30]. Patent [35] discloses a method for imputing unknown data components in a data structure using statistical methods. Tensor factorization method is used in patent [36] to perform multiple imputation in retail sales data. Neural network method is disclosed in patent [37] to denoise images that are compressed and decompressed. Objects of the invention
The object of the invention is a method that imputes locally corrupted data instance, by first partitioning the attribute space in a binary tree structure, then localizing anomalous attribute subspace using anomaly detection procedures, and locally filling-in with uncorrupted data using imputation techniques. None of the prior art utilizes a combination of these techniques for localization and imputation of corrupted data parts.
Detailed description of the invention
In contrast to classical imputation methods, the disclosed method assumes neither the existence nor the knowledge of the location of the corrupted portions in a given data instance. Namely, whether a corruption exists and, which attributes are affected in case of a corruption are both assumed unknown. The disclosed method first localizes occluded (image) or corrupted portions of the data, treats these portions as incomplete and replaces them with a statistically meaningful sample using imputation techniques. Therefore corrupted data detection and localization is used in tandem with incomplete data filling-in.
A method for anomaly detection based data imputation in order to fulfill the objects of the present invention is illustrated in the attached figures, where:
Figure 1 is a schematic of the flowchart of the method
Figure 2 is a schematic of the flowchart of the step 105 in Figure 1.
Figure 3 is a schematic of the flowchart of the step 107 in Figure 1.
Figure 4 is an illustration of the partitioning of the set of attributes in the step 200 in Figure 1.
The method in our invention takes, as an input, a "corrupted" data set X = {*; }f=1 , x{ e Rd , where <i is the dimensionality and each dimension accounts for an attribute value. Here, corruption in an instance x e l is defined as the event that a subset of the attributes are affected by a large structured or unstructured noise such as a scratch on the CD where the instance is stored or an occlusion in case of a visual object. For an example, a corruption affects an instance x = (xl , x2 ,..., xd ) in a subset of consecutive attributes as x'= (xl , x2 ,..., xj+l = y j+l ,..., xj+z = yj+z , ···, ¾ ) , where in total z consecutive attributes are overwritten with random numbers { j; }^1 and the remaining attributes are intact. We emphasize that the number of corrupted attributes and their locations are completely unknown. In this regard, our method deals with the corrupted data set X and targets at imputing or filling in this data set in order to reduce the adversarial effects of the corruption on tasks such as classification, clustering, etc. We also assume that the number of the corrupted data points is not more than an en fraction of the set, i.e., the corrupted instances have a size less than Ta .
For a given data set X and an instance X G X , our method (1) decides, through several statistical checks with the data set X , whether x is corrupted and if so; (2) finds which attributes are corrupted and finally (3) performs imputation on x by overwriting in those locations, i.e., corrupted attribute values, the mean of the corresponding attribute values derived from X .
In the following, we provide a summary of the method in our invention.
To achieve these goals listed above, the set of attributes A = {1,2,..., d} is partitioned into disjoint subsets which are organized through a corresponding binary tree of depth K as shown in Figure 4. Starting from the root, the attributes are split such that half of them are registered in the left child, and the other half is registered on the right child. At a node n, we denote the corresponding set of attributes as n-attributes. Note that when there is an odd number of attributes at a node n, then the first X , i.e., floor of the half, attributes are registered on the left, and the rest are registered on the right (or vice versa). Recursively, all the attributes are distributed on the corresponding nodes. For an example (Figure 4), let us choose d = 80 and partition A into 8 subsets (10 attributes for each) which are all assigned to the leaf nodes in the tree. Note in this case the binary tree should be of depth 3. Suppose an instance x is corrupted on its attributes spread on the subsets 6,7,8, where 7,8 are fully affected and 6 is partially affected. The binary tree that our method constructs in this case for a given instance and the anomalous nodes that contain corrupted subsets are shown in Figure 4.
In the course of our algorithm, every instance is checked for a possible corruption at every node of this tree. For example, at a node n, both the instance and the data set are considered to be restricted to the corresponding n-attributes. Namely, suppose x = (xl , x2 ,..., xd ) and n-attributes are the attributes
, j + n - l} d {1,2,..., d} . Then, the restricted instance jc is πφΧγ x'= xj , xj+l ,..., xj+n_l ) . Similarly, the data set X is also considered as restricted with respect to the n-attributes at node n. Note that this restriction differs for every node. Based on the pair (χ', Χ') at a node n, the restricted instance x' is checked for a possible corruption with respect to the statistics derived from X' via an anomaly detection" algorithm. If ' is found to be anomalous, then we label the node n for the instance x as anomalous". After labeling all the nodes for a given instance x in this fashion, we decide if jc is corrupted based on these labels. We emphasize that there are a number of algorithms for anomaly detection, or outlier detection in the machine learning literature, cf. [1-4] and the references therein. In the preferred embodiment, we employ the algorithm proposed in [4] . In our running example shown in Figure 4, note that the anomalous nodes are colored red, and the normal ones are colored blue. For the localization of a possible corruption in attributes of a given instance s , we search the labeled (anomalous vs normal) nodes in the binary tree corresponding to the instance x in a breadth-first- search manner. In this search, a node is labeled as corrupted if the sub-tree rooted from this node has at least p percent anomalous nodes. Here, p is a user specified threshold. Once a node is labeled as corrupted, the corresponding sub-tree is then pruned and the search continues. If a corruption is not detected at a node, then it is labeled as uncorrupted, and the search continues. Note that in Figure 4, the starred node is found to be corrupted, which successfully corresponds to the set of corrupted attributes.
If there exists corrupted nodes in the tree corresponding to an instance x , then the instance jc is considered corrupted, and it is imputed, or filled in. To achieve this, we first locate the corrupted nodes in the corresponding tree. For imputing, i.e., filling in, each of these corrupted nodes, we use the parent node statistics extracted from the dataset. In general any imputation approach can be used to impute the corrupted data. However for a preferred implementation, k-nearest neighbor search is utilized. Let us suppose that a node n is detected as corrupted. Then we find the k nearest neighbors of the instance x by using the pn-attributes only, where pn is the parent node and k is a user specified parameter. Thus, we are utilizing the maximum likelihood estimate of the subset of attributes corresponding to the parent node of a corrupted node. We simply calculate the mean of the attributes in this neighborhood, copy x as y <— x , and overwrite the n- attributes of the instance y using the attribute values in the calculated mean. We emphasize that this is performed for all instances, and for all corrupted nodes in the corresponding nodes. If jc is a root node, we do not take any action, i.e., no imputation is applied.
We point out that the method summarized above is applied for every data instance x in the data set X . In an iterative fashion, the set of the imputed instances Y = {yi } is fed into the method again, and the data are further imputed, until a pre-specified number of iterations is reached. The algorithm returns the final imputed set as its output.
In the following, we describe our method step by step as depicted in Figure 1: Input:
• Data set: X = {xt }f=1 , where xi e Rd and d is the dimensionality;
• Depth: K;
• Neighborhood size: k;
• Corruption detection parameter: p;
• Max # of iterations: mc
• iters = -1 ;
• Processed data are stored in Y ;
1) Step 200: A binary depth K tree is generated, an example of which is illustrated in Figure 4. i = 1 . iters <— iters + 1. Y = { }
2) Step 112:
a. If iters = mc, then execute Step 113 in Figure 1 (return the data set X )
b. Otherwise, go to 3 (Step 101 in Figure 1).
3) Step 101: The instance xi is fetched from data set X .
4) A breadth-first traversal is started on the binary tree. At node n during this traversal,
a. Step 102: The instance xi is restricted to the n-attributes named xi ' . b. Step 103: The data set X - {x{ } is restricted to the n-attributes, named X' .
c. Step 104: Assign this node n, a label "anomalous" or "normal" preferably using the algorithm in [4] . (Note here that any algorithm in the family of anomaly detection algorithms is appropriate, cf. [1- 4] and the references therein. However, the method in [4] is used in the preferred implementation)
) Step 105: For every node in the tree corresponding to xi , a label
"corrupted" or "uncorrupted" is assigned using the following substeps (Figure 2):
i. Step 301: A breadth-first search is started on the binary tree. ii. Step 302: At a node n during this search, compute the rate (% percentage), say L, of the anomalous nodes in the corresponding sub-tree rooted from this node.
iii. Step 303:
a. If L<p, then the node n is labeled as
"uncorrupted"; and the breadth-first-search continues.
b. Step 304: If L>=p, then the node n is labeled as "corrupted". In Step 305: The
corresponding sub-tree rooted from this node n is pruned. Then the breadth-first- search continues.
) Step 106: The instance xi is checked whether it is corrupted or not. If there exists at least one corrupted node in the corresponding tree, then instance xi is corrupted and needs to be imputed, i.e., filled in.
a. Step 107: If the instance xi is corrupted, then estimate an
uncorrupted version of the instance using the data imputation techniques and fill in the corrupted attributes accordingly. In the preferred implementation of our invention, we use for imputation the following procedure (i, ii, iii, iv) (Figure 3):
i. Locate the nodes in the corresponding binary tree, which are labeled as "corrupted". Step 401: For each of these nodes, say n, locate the parent node pn.
ii. Step 402: Using the pn-attributes only, we find the k
nearest neighbors of xi in the data set X and utilize the mean of these neighbors (iii) and replace (iv).
iii. Step 403: Find the mean in this neighborhood, say mx. Note that mx has only pn-attributes.
iv. Step 404: Copy xi as yi <— xi Replace the corrupted n- attributes of the instance yi with the corresponding values of the n-attributes in the mean vector mx.
b. If the instance xi is not corrupted, then copy xi as yi <— xi .
) Step 109: Save yi into Y = Y u { y{ } . Then reset the tree corresponding to the instance xi .
) Step 108: Increment i : i +
) Step 110: i is checked if it reached to T + 1
a. If i < T + 1 , then go to 3 (Step 101).
b. If i = r + l , then
i. Step 111: Set X <— Y , i.e., the filled in collection is chosen as the data set for the next iteration.
ii. Go to 1 (step 200.)
References:
1. B. Scholkopf, J. C. Piatt, J. Shawe-Taylor, A. J. Smola, and A.
Willaimson. Estimating Support of a High dimensional Distribution. Neural Computation, vol. 13, no. 7, pp 1443 - 1471, July 2001.
2. V. Nikiforov and M. Basseville. Detection of abrupt changes: theory and applications. Prentice-Hall, New Jersey, 1993.
3. A. O. Hero. Geometric entropy minimization(GEM) for anomaly detection and localization. Neural Information Processing Systems Conference, vol. 19, 2006.
4. M. Zhao and V. Saligrama. Anomaly detection with score functions based on nearest neighbor graphs. Neural Information Processing Systems Conference, vol. 22, 2009.
5. Salakhutdinov, R. and Hinton, G. E. (2012). An efficient learning procedure for deep Boltzmann machines. Neural Computation, 24, 1967-2006.
6. Chechik, G, Heitz, G, Elidan, G, Abbeel, P. and Roller, D. (2007).
Max-Margin Classification of Data with Absent Features. Journal of Machine Learning Research, 9, 127.
7. Marlin, B. (2008). Missing Data Problems in Machine Learning. PhD thesis, University of Toronto.
8. Marlin, B. M., Zemel, R.S., Roweis, S.T. and Slaney, M . (2011).
Recommender Systems: M issing Data and Statistical Model Estimation. Proceedings of the International Joint Conference on Artificial Intelligence.
9. Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys.
New York: Wiley.
10. Smolensky, P. (1986). Information Processing in Dynamical Systems:
Foundations of Harmony Theory". In Rumelhart, David E.; McLelland, James L. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Volume 1 : Foundations. MIT Press, pp. 194-281. McLaughlan, G.J. (1988), Mixture Models: inference and applications to clustering, New York: Marcel Dekker. Wang, C, Liao, X., Carin, L., Dunson, D.B. (2010). Classification with Incomplete Data Using Dirichlet Process Priors. Journal of Machine Learning Research, 11, 3269-3311. Schafer, J., & Graham, J. (2002). Missing data: Our view of the state of the art. Psychological Methods. 7, 147 - 177. Little, R.J. A., Rubin, D.B. (1987). Statistical Analysis with Missing Data. New York: John Wiley and Sons, Inc. Ghahramani, Z. and Jordan, M.I. (1994). Supervised learning from incomplete data via an EM approach. Advances in Neural Information Processing Systems. 6, 120- 127. Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24(3), 179-195, 1975. Heckerman, D., Chickering, D.M. , Meek, C. , Rounthwaite, R. and Kadie, C. (2000). Dependency networks for inference, collaborative filtering, and data visualization. Journal of Machine Learning Research, 1, 49-75. S. Geman, D. Geman. Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images. Pattern Analysis and Machine Intelligence, 1984 Y. Weiss, W.T. Freeman. What makes a good model of natural images? CVPR 2007. S. Lyu, E.P. Simoncelli. Statistical modeling of images with fields of Gaussian scale mixtures. NIPS 2006 S.C. Zhu, Y. Wu, D. Mumford. Filters, Random Fields and Maximum Entropy (FRAME): Towards a Unified Theory for Texture Modeling. International Journal of Computer Vision, 1998 LeCun, Y. (1987). Mod eles connexionistes de l'apprentissage. Doctoral dissertation, Universit'e de Paris VI. Gallinari, P., LeCun, Y., Thiria, S., & Fogelman-Soulie, F. (1987). Memoires associatives distribuees. Proceedings of COGNITIVA 87. Paris, La Villette. Jain, V. and Seung S.H. (2008). Natural image denoising with convolutional networks. In Daphne Koller, Dale Schuurmans, Yoshua Bengio, and Leon Bottou, editors, Advances in Neural Information Processing Systems 21 (NIPS'08). Toh, P.-S. and Forrest, A.K. 1990. Occlusion detection in early vision. In Proceedings of the 3rd International Conference on Computer Vision, Osaka, Japan, pp. 126-132 N. H. C. Yung and A. H. S. Lai, "Detection of vehicle occlusion using a generalized deformable model," in Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS '98), vol. 4, pp. 154-157, Monterey, Calif, USA, MayJune 1998 C. Pang, W. Lam, N. Yung, A novel method for resolving vehicle occlusion in a monocular traffic-image sequence, in: IEEE Transactions on Intelligent Transportation Systems, vol. 5(3), 2004, pp. 129-141. W. Zhang, Q. Wu, X. Yang, and X. Fang, "Multilevel Framework to Detect and Handle Vehicle Occlusion," IEEE Transactions on Intelligent Transportation System, vol. 9, pp. 161-174, March 2008. X. Mei, H. Ling, Y. Wu, E. Blasch, and L. Bai. "Minimum Error Bounded Efficient LI Tracker with Occlusion Detection," Proc. of the Computer Vision and Pattern Recognition (CVPR), 2011. Pednault, E.P.D. (2004). Mechanism for constructing predictive models that allow inputs to have missing values. US Patent No. US 6,810,368 B l. Hsu, C.H., Shie, B.E., Su, J.H. , Tseng, S.M. (2012). System and method for imputing missing values and computer program product thereof. US Patent No. US 20120136896 Al. Natarajan, R., Banerjee, A., Shan, H. (2013). Multiple imputation of missing data in multi-dimensional retail sales data sets via tensor factorization. US Patent No. US 20130036082 Al.
Fischer, M., Staelin, C. (2008). Image artifact reduction using a neural network. US Patent No. US 7346208B2.

Claims

1. A method for data imputation (100) characterized in that it comprises the steps of; for a given corrupted data set,
a. Generate a binary K-tree (K is a user specified parameter), assign a subset of the data dimensions to a node depending on the node's location in the tree, empty the data storage from the previous iteration, increment the iteration, and let the index point the first instance in the data set. (200)
b. If the maximum number of iterations is reached (maximum number of iterations is a user- specified parameter) (112)
i. Return the data set (113)
c. Fetch i 'th data point to process. (101)
d. Start a breadth-first traversal on the tree. In the course of this traversal, at a node n (Here any searching scheme is appropriate, we utilize a breadth-first traversal in the preferred implementation),
1. Restrict the i 'th data point to the n-attributes, which is the subset of attributes assigned to node n during tree generation. (102)
2. Restrict the data set excluding the i 'th data point to the same n-attributes. (103)
3. Label each node as anomalous or normal using an anomaly detection algorithm from the machine learning literature. (104)
e. Start a breadth-first search on the labeled (anomalous vs normal) tree. In the course of this search, label each node further as corrupted or uncorrupted. (105)
f. If the i 'th data point is found to be corrupted, i.e., if at least one node on the tree found to be corrupted, (106)
1. Impute or fill in i 'th data point with known methods in the literature. (107) g. Save the i 'th data point into the data storage Y . (109).
h. Increment the index i . (108)
i. If the index i exceeds the number of points in the data set (110) i. Overwrite the data set X with the stored data in Y . (I l l) ii. Go to the step a.
j. Go to step c.
2. A method for data imputation (100) characterized in that the step 105 further comprises the sub-steps of; for a given data instance along with the corresponding labeled (anomalous vs normal) tree,
a. Start a breadth-first search of nodes for a corruption. At every node, (301)
a. Compute the rate (% perc.) of the anomalous nodes in the corresponding subtree. (302)
b. If this rate is greater than p (p is a user-specified threshold), (303)
i. Label this node as corrupted. (304)
ii. Prune the subtree of this node (305) c. If this rate is less than/or equal to p (p is a user- specified threshold), (303)
i. Label the node as uncorrupted, and continue the search. b. Return the pruned and labeled (corrupted vs uncorrupted) tree.
3. A method for data imputation (100) characterized in that the step 107 further comprises the sub-steps of; for a given data instance along with the corresponding labeled (corrupted vs uncorrupted) tree and the data set, a. Start a breadth-first search of nodes. At every corrupted node n, i. Locate the parent node pn (401)
ii. Using the pn-attributes, find the k nearest neighbors of the given data instance in the data set. (402) iii. Find the mean of these k neighbors (k is a user-specified parameter). (403) iv. Overwrite the corrupted n-attributes of the data instance from the n-attributes of the found mean. (404)
b. Return the data instance.
4. A method for data processing that utilizes anomaly detection procedures for localizing corrupted subspaces of the data.
5. A method for data imputation that partitions the data into neighboring subspaces using binary K-tree approach.
6. A method for data processing that iteratively replaces corrupted subspaces of the data using anomaly detection and K-tree based corruption localization.
PCT/IB2013/055640 2013-07-09 2013-07-09 Method for imputing corrupted data based on localizing anomalous parts Ceased WO2015004502A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/IB2013/055640 WO2015004502A1 (en) 2013-07-09 2013-07-09 Method for imputing corrupted data based on localizing anomalous parts
UAA201600281U UA126396U (en) 2013-07-09 2013-09-07 METHOD OF RECOVERING DAMAGED DATA USING LOCALIZATION OF ANOMALIC DATA PARTS
PH12016500066A PH12016500066A1 (en) 2013-07-09 2016-01-08 Method for imputing corrupted data based on localizing anomalous parts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2013/055640 WO2015004502A1 (en) 2013-07-09 2013-07-09 Method for imputing corrupted data based on localizing anomalous parts

Publications (1)

Publication Number Publication Date
WO2015004502A1 true WO2015004502A1 (en) 2015-01-15

Family

ID=49261573

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2013/055640 Ceased WO2015004502A1 (en) 2013-07-09 2013-07-09 Method for imputing corrupted data based on localizing anomalous parts

Country Status (3)

Country Link
PH (1) PH12016500066A1 (en)
UA (1) UA126396U (en)
WO (1) WO2015004502A1 (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106033442A (en) * 2015-03-16 2016-10-19 北京大学 A Parallel Breadth-First Search Method Based on Shared Memory Architecture
US9928145B2 (en) 2013-12-24 2018-03-27 International Business Machines Corporation File corruption recovery in concurrent data protection
CN110097920A (en) * 2019-04-10 2019-08-06 大连理工大学 A kind of metabolism group shortage of data value fill method based on neighbour's stability
CN111611231A (en) * 2019-02-25 2020-09-01 新奥数能科技有限公司 Equipment operation data cleaning method and device, readable medium and electronic equipment
US11171975B2 (en) 2018-09-25 2021-11-09 Cisco Technology, Inc. Dynamic inspection of networking dependencies to enhance anomaly detection models in a network assurance service
CN114663265A (en) * 2022-03-30 2022-06-24 西安交通大学 EM-MFA algorithm-based building comprehensive energy system carbon emission monitoring method and device
US11783606B2 (en) 2021-11-01 2023-10-10 Rehrig Pacific Company Delivery system
CN117785855A (en) * 2023-12-25 2024-03-29 杭州字节方舟科技有限公司 Block chain-based wind control early warning, device, equipment and storage medium
US12493855B2 (en) 2023-12-17 2025-12-09 Rehrig Pacific Company Validation system for conveyor

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11508480B2 (en) * 2019-08-28 2022-11-22 International Business Machines Corporation Online partially rewarded learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810368B1 (en) 1998-06-29 2004-10-26 International Business Machines Corporation Mechanism for constructing predictive models that allow inputs to have missing values
US7346208B2 (en) 2003-10-25 2008-03-18 Hewlett-Packard Development Company, L.P. Image artifact reduction using a neural network
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
US20120136896A1 (en) 2010-11-26 2012-05-31 Shin-Mu Tseng System and method for imputing missing values and computer program product thereof
US20130036082A1 (en) 2011-08-05 2013-02-07 International Business Machines Corporation Multiple imputation of missing data in multi-dimensional retail sales data sets via tensor factorization

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810368B1 (en) 1998-06-29 2004-10-26 International Business Machines Corporation Mechanism for constructing predictive models that allow inputs to have missing values
US7346208B2 (en) 2003-10-25 2008-03-18 Hewlett-Packard Development Company, L.P. Image artifact reduction using a neural network
US7792770B1 (en) * 2007-08-24 2010-09-07 Louisiana Tech Research Foundation; A Division Of Louisiana Tech University Foundation, Inc. Method to indentify anomalous data using cascaded K-Means clustering and an ID3 decision tree
US20120136896A1 (en) 2010-11-26 2012-05-31 Shin-Mu Tseng System and method for imputing missing values and computer program product thereof
US20130036082A1 (en) 2011-08-05 2013-02-07 International Business Machines Corporation Multiple imputation of missing data in multi-dimensional retail sales data sets via tensor factorization

Non-Patent Citations (31)

* Cited by examiner, † Cited by third party
Title
A. O. HERO: "Geometric entropy minimization(GEM) for anomaly detection and localization", NEURAL INFORMATION PROCESSING SYSTEMS CONFERENCE, vol. 19, 2006
ANTONIO DÂ AMBROSIO ET AL: "Accurate Tree-based Missing Data Imputation and Data Fusion within the Statistical Learning Paradigm", JOURNAL OF CLASSIFICATION, SPRINGER-VERLAG, NE, vol. 29, no. 2, 17 June 2012 (2012-06-17), pages 227 - 258, XP035075869, ISSN: 1432-1343, DOI: 10.1007/S00357-012-9108-1 *
B. SCHOLKOPF; J. C. PLATT; J. SHAWE-TAYLOR; A. J. SMOLA; A. WILLAIMSON: "Estimating Support of a High dimensional Distribution", NEURAL COMPUTATION, vol. 13, no. 7, July 2001 (2001-07-01), pages 1443 - 1471
BESAG, J.: "Statistical analysis of non-lattice data", THE STATISTICIAN, vol. 24, no. 3, 1975, pages 179 - 195
C. PANG; W. LAM; N. YUNG: "A novel method for resolving vehicle occlusion in a monocular traffic-image sequence", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, vol. 5, no. 3, 2004, pages 129 - 141
CHECHIK, G.; HEITZ, G.; ELIDAN, G.; ABBEEL, P.; KOLLER, D.: "Max-Margin Classification of Data with Absent Features", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 9, 2007, pages 127
GALLINARI, P.; LECUN, Y.; THIRIA, S.; FOGELMAN-SOULIE, F.: "Memoires associatives distribuees", PROCEEDINGS OF COGNITIVA 87, 1987
GHAHRAMANI, Z.; JORDAN, M.I.: "Supervised learning from incomplete data via an EM approach", ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS., vol. 6, 1994, pages 120 - 127
GUSTAVO E A P A BATISTA ET AL: "A Study of K-Nearest Neighbour as an Imputation Method", HIS, 31 December 2003 (2003-12-31), XP055107935, Retrieved from the Internet <URL:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.3558&rep=rep1&type=pdf> [retrieved on 20140314] *
HECKERMAN, D.; CHICKERING, D.M.; MEEK, C.; ROUNTHWAITE, R.; KADIE, C.: "Dependency networks for inference, collaborative filtering, and data visualization", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 1, 2000, pages 49 - 75
JAIN, V.; SEUNG S.H.: "Advances in Neural Information Processing Systems", 2008, article "Natural image denoising with convolutional networks", pages: 21
LITTLE, R.J.A.; RUBIN, D.B.: "Statistical Analysis with Missing Data", 1987, JOHN WILEY AND SONS, INC.
M. ZHAO; V. SALIGRAMA: "Anomaly detection with score functions based on nearest neighbor graphs", NEURAL INFORMATION PROCESSING SYSTEMS CONFERENCE, vol. 22, 2009
MARLIN, B. M.; ZEMEL, R.S.; ROWEIS, S.T.; SLANEY, M.: "Recommender Systems: Missing Data and Statistical Model Estimation", PROCEEDINGS OF THE INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2011
MARLIN, B.: "Missing Data Problems in Machine Learning", PHD THESIS, 2008
MCLAUGHLAN, G.J.: "Mixture Models: inference and applications to clustering", 1988, MARCEL DEKKER
N. H. C. YUNG; A. H. S. LAI: "Detection of vehicle occlusion using a generalized deformable model", PROCEEDINGS OF THE IEEE INTERNATIONAL SYMPOSIUM ON CIRCUITS AND SYSTEMS (ISCAS '98, vol. 4, May 1998 (1998-05-01), pages 154 - 157
PEERAPON VATEEKUL ET AL: "Tree-Based Approach to Missing Data Imputation", DATA MINING WORKSHOPS, 2009. ICDMW '09. IEEE INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 6 December 2009 (2009-12-06), pages 70 - 75, XP031585556, ISBN: 978-1-4244-5384-9 *
RUBIN, D. B.: "Multiple imputation for nonresponse in surveys", 1987, WILEY
S. GEMAN; D. GEMAN: "Stochastic relaxation, Gibbs distributions and the Bayesian restoration of images", PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 1984
S. LYU; E.P. SIMONCELLI: "Statistical modeling of images with fields of Gaussian scale mixtures", NIPS, 2006
S.C. ZHU; Y. WU; D. MUMFORD: "Filters, Random Fields and Maximum Entropy (FRAME): Towards a Unified Theory for Texture Modeling", INTERNATIONAL JOURNAL OF COMPUTER VISION, 1998
SALAKHUTDINOV, R.; HINTON, G. E.: "An efficient learning procedure for deep Boltzmann machines", NEURAL COMPUTATION, vol. 24, 2012, pages 1967 - 2006
SCHAFER, J.; GRAHAM, J.: "Missing data: Our view of the state of the art", PSYCHOLOGICAL METHODS, vol. 7, 2002, pages 147 - 177
SMOLENSKY, P.: "Parallel Distributed Processing: Explorations in the Microstructure of Cognition", vol. 1, 1986, FOUNDATIONS. MIT PRESS, article "Information Processing in Dynamical Systems: Foundations of Harmony Theory", pages: 194 - 281
TOH, P.-S.; FORREST, A.K.: "Occlusion detection in early vision", PROCEEDINGS OF THE 3RD INTERNATIONAL CONFERENCE ON COMPUTER VISION, 1990, pages 126 - 132
V. NIKIFOROV; M. BASSEVILLE: "Detection of abrupt changes: theory and applications", 1993, PRENTICE-HALL
W. ZHANG; Q. WU; X. YANG; X. FANG: "Multilevel Framework to Detect and Handle Vehicle Occlusion", IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEM, vol. 9, March 2008 (2008-03-01), pages 161 - 174
WANG, C.; LIAO, X.; CARIN, L.; DUNSON, D.B.: "Classification with Incomplete Data Using Dirichlet Process Priors", JOURNAL OF MACHINE LEARNING RESEARCH, vol. 11, 2010, pages 3269 - 3311
X. MEI; H. LING; Y. WU; E. BLASCH; L. BAI: "Minimum Error Bounded Efficient L1 Tracker with Occlusion Detection", PROC. OF THE COMPUTER VISION AND PATTERN RECOGNITION (CVPR, 2011
Y. WEISS; W.T. FREEMAN: "What makes a good model of natural images?", CVPR, 2007

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10929242B2 (en) 2013-12-24 2021-02-23 International Business Machines Corporation File corruption recovery in concurrent data protection
US9928145B2 (en) 2013-12-24 2018-03-27 International Business Machines Corporation File corruption recovery in concurrent data protection
US10353781B2 (en) 2013-12-24 2019-07-16 International Business Machines Corporation File corruption recovery in concurrent data protection
CN106033442B (en) * 2015-03-16 2019-04-02 北京大学 A kind of parallel breadth first search method based on shared drive architecture
CN106033442A (en) * 2015-03-16 2016-10-19 北京大学 A Parallel Breadth-First Search Method Based on Shared Memory Architecture
US11171975B2 (en) 2018-09-25 2021-11-09 Cisco Technology, Inc. Dynamic inspection of networking dependencies to enhance anomaly detection models in a network assurance service
CN111611231A (en) * 2019-02-25 2020-09-01 新奥数能科技有限公司 Equipment operation data cleaning method and device, readable medium and electronic equipment
CN110097920A (en) * 2019-04-10 2019-08-06 大连理工大学 A kind of metabolism group shortage of data value fill method based on neighbour's stability
CN110097920B (en) * 2019-04-10 2022-09-20 大连理工大学 Metabonomics data missing value filling method based on neighbor stability
US11783606B2 (en) 2021-11-01 2023-10-10 Rehrig Pacific Company Delivery system
CN114663265A (en) * 2022-03-30 2022-06-24 西安交通大学 EM-MFA algorithm-based building comprehensive energy system carbon emission monitoring method and device
US12493855B2 (en) 2023-12-17 2025-12-09 Rehrig Pacific Company Validation system for conveyor
CN117785855A (en) * 2023-12-25 2024-03-29 杭州字节方舟科技有限公司 Block chain-based wind control early warning, device, equipment and storage medium

Also Published As

Publication number Publication date
UA126396U (en) 2018-06-25
PH12016500066A1 (en) 2016-04-18

Similar Documents

Publication Publication Date Title
WO2015004502A1 (en) Method for imputing corrupted data based on localizing anomalous parts
He et al. Structured pruning for deep convolutional neural networks: A survey
Zaheer et al. Old is gold: Redefining the adversarially learned one-class classifier training paradigm
Zhang et al. Context-aware surveillance video summarization
Han et al. Hyperattention: Long-context attention in near-linear time
Huang et al. Graphgdp: Generative diffusion processes for permutation invariant graph generation
Zhou et al. Accelerating online cp decompositions for higher order tensors
Zhang et al. Boosting video object segmentation via space-time correspondence learning
Ding et al. Bayesian robust principal component analysis
Mesgaran et al. Graph fairing convolutional networks for anomaly detection
Jiang et al. Incomplete graph learning via attribute-structure decoupled variational auto-encoder
TR201514432T1 (en) Method for pseudo-recurrent processing of data using a feedforward neural network architecture
Li et al. Multi-view graph learning with adaptive label propagation
Najafi et al. Outlier-Robust Multi-Aspect Streaming Tensor Completion and Factorization.
Li et al. A tensor-based online RPCA model for compressive background subtraction
Yang et al. Unbalanced Incomplete Multiview Unsupervised Feature Selection With Low-Redundancy Constraint in Low-Dimensional Space
CN119780786B (en) LED module breakpoint detection and positioning method and device
Zhang et al. DNA hypernetworks for information storage and retrieval
Atashgahi et al. Quick and robust feature selection: the strength of energy-efficient sparse training for autoencoders
Chen et al. Joint selective state space model and detrending for robust time series anomaly detection
Wu et al. Scoda: Domain adaptive shape completion for real scans
Zhang et al. Tensor graph convolutional neural network
Sun et al. Tensorial multiview representation for saliency detection via nonconvex approach
Wang et al. Explicit pairwise factorized graph neural network for semi-supervised node classification
Mo et al. Graph principal flow network for conditional graph generation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13770504

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 12016500066

Country of ref document: PH

NENP Non-entry into the national phase

Ref country code: DE

WWE Wipo information: entry into national phase

Ref document number: 2016/0044.1

Country of ref document: KZ

WWE Wipo information: entry into national phase

Ref document number: A201600281

Country of ref document: UA

122 Ep: pct application non-entry in european phase

Ref document number: 13770504

Country of ref document: EP

Kind code of ref document: A1