WO2016190841A1 - Marginal distribution in a graphical model - Google Patents
Marginal distribution in a graphical model Download PDFInfo
- Publication number
- WO2016190841A1 WO2016190841A1 PCT/US2015/032212 US2015032212W WO2016190841A1 WO 2016190841 A1 WO2016190841 A1 WO 2016190841A1 US 2015032212 W US2015032212 W US 2015032212W WO 2016190841 A1 WO2016190841 A1 WO 2016190841A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nodes
- time period
- graphical model
- subgraph
- impacted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/552—Detecting local intrusion or implementing counter-measures involving long-term monitoring or reporting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/57—Certifying or maintaining trusted computer platforms, e.g. secure boots or power-downs, version controls, system software checks, secure updates or assessing vulnerabilities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N7/00—Computing arrangements based on specific mathematical models
- G06N7/01—Probabilistic graphical models, e.g. probabilistic networks
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/42—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation
- G06V10/422—Global feature extraction by analysis of the whole pattern, e.g. using frequency domain transformations or autocorrelation for representing the structure of the pattern or shape of an object therefor
- G06V10/426—Graphical representations
Definitions
- Graphical models are graphs which compactly represent the joint distributions of many random variables. Graphical models have been widely used in various applications, such as malware detection, detecting malicious domains, topic modeling, information extraction, and many others.
- Figure 1 is a schematic illustration of an example system for maintaining inference results on evolving graphical models in accordance with an implementation of the present disclosure.
- Figure 2 illustrates a flowchart showing an example of a method for computing an approximation of a marginal distribution of all nodes in a graphical model in accordance with an implementation of the present disclosure.
- Figure 3 is illustrates a flowchart showing an example of a method for identifying independent subgraphs in the graphical model in accordance with an implementation of the present disclosure.
- Figure 4 illustrates a flowchart showing an example of a method for computing an approximation of the marginal distribution of all impacted nodes for each independent subgraph in accordance with an implementation of the present disclosure.
- Figure 5 is an example block diagram illustrating a computer- readable medium in accordance with an implementation of the present disclosure.
- Graphical models are widely used in many different applications. Graphical models may include a plurality of nodes or vertices connected by edges. In many of these applications, graphical models are constantly evolving. For example, in malware detection, the graphical model may be a bipartite graph where each node is a random binary variable which represents if a machine or a file is infected. There may be an edge between a file node and a machine node if the file is found on that machine. Such a graphical model changes when new files are installed on the machines, files are deleted and machines are added/deleted from system.
- Such evolving scenarios as well as many non-evolving cases may generally create a graph inference problem: given a graphical model that encodes a joint probability distribution, it may be difficult and time consuming to determine the marginal probability distribution of a particular node in the graph conditioned on the known value of some other nodes in the graph.
- the term graph inference result refers to the marginal distribution of each node in the graph.
- changes in the graphical model may be identified between snapshots of the model in two different time periods.
- the proposed techniques efficiently identify relatively small subgraphs in the changed large graphical models.
- the techniques run an inference algorithm on these subgraphs to compute an approximation of marginal distribution of all impacted nodes in the subgraphs as fast as possible.
- the techniques described herein perform graph inference incrementally (e.g., on a small neighborhood of impacted nodes), and not on the entire graphical model.
- the changes for the second node may be approximately calculated because the changed first node would not have much effect over the second node. Therefore, when the changes between a graphical model in two different times are small, the proposed techniques offer a faster way to re-run an inference algorithm on the entire graph from scratch.
- the techniques described herein propose to first identify impacted subgraphs of the original graph that are likely to contain nodes whose marginal distribution may change significantly based on the changes in the graph between the two different time periods. Then, the techniques propose to only run the inference algorithm on those subgraphs before combining data for the subgraphs with the data for the rest of the graph.
- a processor may identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period.
- the processor may further: determine changes between the graphical model at the first time period and the second time period, identify impacted nodes in the graphical model at the second time period, and identify independent subgraphs in the graphical model at the second time period.
- the processor may compute an approximation of a marginal distribution of all impacted nodes for each independent subgraph, and may compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
- the proposed techniques may consider any type of change in a graphical model (e.g., deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed, etc.) and may apply any inference algorithm to the determined subgraphs. Further, the described techniques may efficiently compute inference results on evolving power-law graphs that include high-degree nodes.
- FIG. 1 is a schematic illustration of an example system 10 for maintaining inference results on evolving graphical models.
- the term "inference result" refers to an approximation of a marginal distribution of all nodes in the graphical model.
- the illustrated system 10 is capable of carrying out the techniques described below.
- the system 10 is depicted as including at least one a computing device 100.
- computing device 100 includes a processor 102, an interface 106, and a machine-readable storage medium 110.
- computing device 100 is described in details below, the techniques described herein may be performed with several computing devices or by engines distributed on different devices.
- the computing device 100 may be any type of a computing device and may include at least engines 120-160. Engines 120-150 may or may not be part of the machine-readable storage medium 110. In one example, the computing device 100 may be an independent computing device. In another alternative example, engines 120-160 may be distributed between the computing device 100 and others computing devices.
- the computing device 100 may include additional components and some of the components depicted therein may be removed and/or modified without departing from a scope of the system that allows for carrying out the functionality described herein.
- Processor 102 may be central processing unit(s) (CPUs), microprocessors ), and/or other hardware device(s) suitable for retrieval and execution of instructions (not shown) stored in machine-readable storage medium 110.
- Processor 102 may fetch, decode, and execute instructions to identify different groups in a dataset.
- processor 102 may include electronic circuits comprising a number of electronic components for performing the functionality of instructions.
- Interface 106 may include a number of electronic components for communicating with various devices.
- interface 106 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the computing device.
- interface 106 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface that is used to connect with other devices/systems and/or to a network.
- the network may be a mesh sensor network (not shown).
- the network may include any suitable type or configuration of network to allow for communication between the computing device 100, and any other devices/systems (e.g., other computing devices, displays, etc.), for example, to send and receive data to and from a corresponding interface of another device.
- Each of the engines 120-160 may include, for example, at least one hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory.
- the engines 120-160 may be implemented as any combination of hardware and software to implement the functionalities of the engines.
- the hardware may be a processor and the software may be a series of instructions or microcode encoded on a machine-readable storage medium and executable by the processor. Therefore, as used herein, an engine may include program code, (e.g., computer executable instructions), hardware, firmware, and/or logic, or combination thereof to perform particular actions, tasks, and functions described in more detail herein in reference to Figures 2-5.
- the identification engine 120 may identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period.
- the graphical model may be represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period.
- the analysis engine 130 may determine changes between the graphical model at the first time period and the second time period, and may identify impacted nodes in the graphical model at the second time period.
- the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed. It may also include changes to factors on the edges and cliques (these factors encode the joint probability distribution of the nodes in the graphical model). In other examples, other types of changes may be detected.
- each of the impacted nodes may have a topology change from the first time period to the second time period. The topology change may include adding or deleting an edge of the node, having a brand new node, etc.
- the subgraphs engine 140 may identify independent subgraphs in the graphical model at the second time period.
- the subgraphs engine 130 may: identify evidence nodes that have changed between the first time period and the second time period, identify high-degree nodes, identify active nodes, remove all edges that are not connected to active nodes, and determine a plurality of independent subgraphs.
- active node refers to a node that has at least one active edge or has no inactive edges.
- the term “evidence node” refers to a node in the graphical model that has a known value.
- the term “high-degree node” refers to a node that is connected to a very large number of other nodes and the factors on the edges are such that a change in its neighbor nodes will not dramatically change its distribution.
- the inference engine 150 may compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs. Thus, the inference engine 150 may apply an inference algorithm to each independent subgraph that was affected by the change in the graphical model.
- the graphical model engine 160 may compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period. Thus, the graphical model engine 160 may maintain the inference results of an evolving graphical model at a much faster rate.
- Figure 2 illustrates a flowchart showing an example of a method 200 for computing an approximation of a marginal distribution of all nodes in a graphical model (i.e., for maintaining inference results on evolving graphical models).
- execution of the method 200 is described below with reference to the system 10, the components for executing the method 200 may be spread among multiple devices/systems.
- the method 200 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry.
- the method 200 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10.
- a computing device e.g., processor 102 of device 100
- the method may be executed by another processor in communication with the system 10.
- Various elements or blocks described herein with respect to the method 200 are capable of being executed simultaneously, in parallel, or in an order that differs from the illustrated serial manner of execution.
- the method 200 is also capable of being executed using additional or fewer elements than are shown in the illustrated examples.
- the method 200 begins at 210, where at least one processor may identify nodes and edges in a graphical model at a first time period.
- the graphical model may be represented as a first snapshot of the graphical model at the first time period.
- the processor may identify nodes and edges in the graphical model at a second time period that is later than the first time period.
- the graphical model may be represented as a second snapshot of the graphical model at the second time period.
- the processor may receive and/or identify the following information: a first graph snapshot Gtfor the first time period t, its set of nodes or vertices V/and edges E t , a second graph snapshot GM for the first time period t+1, its set of nodes or vertices Vt+i, and edges Et+i.
- a change or evolution in the graphical model is examined by the processor.
- the processor may use different techniques to identify nodes and edges in the graphical model at the different time periods.
- the processor may determine changes between the graphical model at the first time period t and the second time period t+1.
- the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed.
- other changes in the graphical model may also be identified.
- the processor may use various techniques to determine the changes in the graphical model.
- the processor may identify impacted nodes / in the graphical model at the second time period (at 240). For example, the processor may use the changes in the graphical model determined in block 230 to identity the impacted nodes / in the graph.
- each of the impacted nodes has a topology change from the first time period to the second time period.
- the topology change may include adding or deleting an edge of the node, having a brand new node, etc.
- the values of the impacted nodes may be recomputed using an inference to calculate an approximation of the marginal distribution of all nodes.
- the processor may identify independent subgraphs CM in the graphical model at the second time period.
- the processor may first identify the impacted regions (i.e., subgraphs) within the original large graph that are likely to contain nodes whose marginal distribution may change significantly based on the changes in the graph between the two different time periods.
- the processor may compute, for each independent subgraph, an approximation of the marginal distribution of all impacted nodes (at 260).
- the process for computing an approximation of the marginal distribution of all impacted nodes for each independent subgraph is described in below in relation to Figure 4.
- the processor applies an inference algorithm A to each subgraph to compute an approximation of the marginal distribution of all impacted nodes in each of the subgraphs. Therefore, the processor applies the inference algorithm only to the identified impacted subgraphs and not to the entire graphical model. Considering the size of many graphical models, this selective process speeds up the generation of inference results.
- the processor may compute an approximation - (G T+1 ) of the marginal distribution of all nodes in the graphical model A(Gt+i) at the second time period.
- the processor returns the approximated marginal distribution i4(G t+1 ) for the entire graphical model at the second period of time t+1 in order to maintain the inference results on the evolving graphical model in an expedited paste.
- the processor may merge the calculated approximation of the marginal distribution for all subgraphs A(l) with the unchanged nodes at time t+1 in the graphical model
- a method 300 for identifying independent subgraphs in the graphical model at the second time period is described in more details in relation to Figure 3.
- the method 300 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10.
- the trained model may be generated by using a training dataset.
- the method 300 begins at 310, where the processor may identify evidence nodes Ethat have changed between the first time period and the second time period.
- evidence node refers to a node in the graphical model that has a known value and, therefore, there is no randomness in that respect. Thus, their distribution is unaffected by the value of any other nodes. For an evidence node, changes in its few neighbors will not dramatically change the distribution of the node.
- the processor may identify evidence nodes E that have not changed between the first time period t and the second time period f+1 in the group of all nodes VM n Vt.
- Various techniques may be used to identify evidence nodes that have changed between the two time periods.
- the processor may identify high-degree nodes in the graphical model.
- a high-degree node is a node that is connected to a very large number of other nodes such that a change in its neighbor nodes will not dramatically change its distribution.
- the graphical models include power law graphs. In such graphs, some high-degree nodes have thousands of neighbors.
- the processor may use a predetermined threshold d of high- degree nodes to identify the high-degree nodes in the group of all nodes
- High-degree nodes also block the changes or dependencies to other neighbors.
- high-degree nodes and the evidence nodes may block the dependency among nodes in the graph. Any path that contains either an evidence node or a high-degree node may be identified as an inactive path.
- the processor may identify the union H of high-degree nodes and relevance nodes.
- the processor may identify active nodes in the graphical user interface.
- an “active node” refers to a node that has at least one active edge or has no inactive edges.
- the processor may identify active nodes in all nodes at the second time
- two nodes X and Y are conditionally independent given Z if knowledge about X gives no extra information about Y once there is knowledge of Z. In other words, once there is knowledge of Z, X adds nothing to the knowledge about Y. Thus, a path between two nodes is active if it carries information or dependence. Two nodes X and Y might be connected by many paths in a graph, where all, some, or none of the paths are active. Therefore, X and Y are dependent-separated if all the paths that connect them are inactive, or, equivalently, if no path between them is active.
- the processor may remove all edges that are not
- the processor may find all active paths, and that information may be used to identify the active nodes and to break the graph into several smaller independent subgraphs if there is no active path connecting them.
- a first subgraph and a second subgraph of the graphical model are independent when there are no active paths connecting the subgraphs, and where changes in the first independent subgraph do not affect the marginal distribution of any nodes in the second independent subgraph.
- Figure 4 illustrates a flowchart showing an example of a method 400 for computing an approximation of the marginal distribution of all impacted nodes for each independent subgraph.
- the method 400 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10.
- the method 400 begins at 410 where the processor may identify each impacted node in the subgraph. Then, at 420, the processor may identify all k-neighborhood nodes for each impacted node in the subgraph.
- k-neighborhood nodes refers to nodes that are reachable from a node and are within a predetermine distance k. The size of the k-neighborhood may vary and may be fixed or flexible.
- the processor may use the length of a path between two nodes (i.e., the number of other nodes it needs to go through to get from one node to the other) to identify all k-neighborhood nodes for each impacted node in the subgraph. For example, even if nodes X and Y are connected via some active paths, if all these active paths are very long, the dependence between X and Y is very weak. Therefore, changes in X's distribution only impact Y's distribution slightly. The processor, therefore, identifies the k-neighborhood of nodes to be considered for a subgraph. For instance, a threshold (e.g., Z number of nodes to get from node X to node Y) that would identify long paths may be used.
- a threshold e.g., Z number of nodes to get from node X to node Y
- the processor may add the identified k-neighborhood nodes for each impacted node to the set of impacted nodes.
- the processor "grows" the set of impacted nodes in the subgraph by adding non-impacted nodes to identify the right size subgraph to which to apply an inference algorithm.
- the right size subgraph is a subgraph where changes in the nodes would strongly impact the marginal distribution of adjacent nodes.
- the processor may apply an inference algorithm A to each independent subgraph.
- the inference algorithm is applied to the entire independent subgraph - including impacted and non-impacted nodes. That way, the processor may receive inference results for all nodes within the independent subgraph.
- Figure 5 illustrates a computer 501 and a non-transitory machine- readable medium 505 according to an example.
- the computer 501 maybe similar to the computing device 100 of the system 10 or may include a plurality of computers.
- the computer may be a server computes, a workstation computer, a desktop computer, a laptop, a mobile device, or the like, and may be part of a distributed system.
- the computer may include one or more processors and one or more machine-readable storage media.
- the computer may include a user interface (e.g., touch interface, mouse, keyboard, gesture input device, etc.).
- Computer 501 may perform methods 200-400 and variations thereof. Additionally, the functionality implemented by computer 501 may be part of a larger software platform, system, application, or the like. Computer 501 may be connected to a database (not shown) via a network.
- the network may be any type of communications network, including, but not limited to, wire-based networks (e.g., cable), wireless networks (e.g., cellular, satellite), cellular telecommunications network(s), and IP-based telecommunications network(s) (e.g., Voice over Internet Protocol networks).
- the network may also include traditional landline or a public switched telephone network (PSTN), or combinations of the foregoing.
- PSTN public switched telephone network
- the computer 501 may include a processor 503 and non-transitory machine-readable storage medium 505.
- the processor 503 e.g., a central processing unit, a group of distributed processors, a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a graphics processor, a multiprocessor, a virtual processor, a cloud processing system, or another suitable controller or programmable device
- ASIC application-specific integrated circuit
- the storage medium 505 may include any suitable type, number, and configuration of volatile or non- volatile machine-readable storage media to store instructions and data.
- machine-readable storage media in include read-only memory (“ROM”), random access memory (“RAM”) (e.g., dynamic RAM ["DRAM”], synchronous DRAM ["SDRAM”], etc.), electrically erasable programmable read-only memory (“EEPROM”), magnetoresistive random access memory (MRAM), memristor, flash memory, SD card, floppy disk, compact disc read only memory (CD-ROM), digital video disc read only memory (DVD-ROM), and other suitable magnetic, optical, physical, or electronic memory on which software may be stored.
- ROM read-only memory
- RAM random access memory
- EEPROM electrically erasable programmable read-only memory
- MRAM magnetoresistive random access memory
- CD-ROM compact disc read only memory
- DVD-ROM digital video disc read only memory
- Software stored on the non-transitory machine-readable storage media 505 and executed by the processor 503 includes, for example, firmware, applications, program data, filters, rules, program modules, and other executable instructions.
- the processor 503 retrieves from the machine-readable storage media 505 and executes, among other things, instructions related to the control processes and methods described herein.
- the processor 503 may fetch, decode, and execute instructions 307- 311 among others, to implement various processing.
- processor 503 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 507-515. Accordingly, processor 503 may be implemented across multiple processing units and instructions 507-515 may be implemented by different processing units in different areas of computer 501.
- IC integrated circuit
- the instructions 507-515 when executed by processor 503 can cause processor 503 to perform processes, for example, methods 200-400, and/or variations and portions thereof. In other examples, the execution of these and other methods may be distributed between the processor 503 and other processors in communication with the processors 603.
- data identification instructions 507 may cause processor 503 to identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period.
- the graphical model may be represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period.
- Analysis instructions 509 may cause the processor 503 to determine changes between the graphical model at the first time period and the second time period, and to identify impacted nodes in each of the independent subgraphs in the graphical model at the second time period.
- Each of the impacted nodes may have a topology change from the first time period to the second time period.
- each of the impacted nodes has a value in the second time period that is different from the value of the node in the first time period.
- These instructions may function similarly to the techniques described in blocks 230-240 of method 200.
- the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed. In other example, other types of changes may be detected.
- Subgraphs instructions 511 may cause the processor 503 to identify independent subgraphs in the graphical model at the second time period. These instructions may function similarly to the techniques described in block 250 of method 200 and to the techniques described in method 300. For example, the subgraphs instructions 511 may cause the processor 503 to: identify evidence nodes that have changed between the first time period and the second time period, identify high-degree nodes, identify active nodes, remove all edges that are not connected to active nodes, and determine a plurality of independent subgraphs.
- Inference instructions 513 may cause the processor 503 to compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs. These instructions may function similarly to the techniques described blocks 260 of method 200 and to the techniques described in method 400. Thus, inference instructions 513 may cause the processor 503 to apply an inference algorithm to each independent subgraph that was affected by the change in the graphical model.
- Graphical model instructions 515 may cause the processor 503 to compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period. These instructions may function similarly to the techniques described blocks 270 of method 200. Thus, graphical model instructions 515 may cause the processor 503 to maintain the inference results of an evolving graphical model at a very fast paste.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- Computer Hardware Design (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Computation (AREA)
- Multimedia (AREA)
- Mathematical Physics (AREA)
- Artificial Intelligence (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Medical Informatics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
An example method is provided in according with one implementation of the present disclosure. The method comprises identifying nodes and edges in a graphical model at a first time period, identifying nodes and edges in the graphical model at a second time period that is later than the first time period, determining changes between the graphical model at the first time period and the second time period, and identifying impacted nodes in the graphical model at the second time period. The method further comprises identifying independent subgraphs in the graphical model at the second time period, computing, for each independent subgraph, an approximation of a marginal distribution of all impacted nodes, and computing an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
Description
MARGINAL DISTRIBUTION IN A GRAPHICAL MODEL
BACKGROUND
[0001] Graphical models are graphs which compactly represent the joint distributions of many random variables. Graphical models have been widely used in various applications, such as malware detection, detecting malicious domains, topic modeling, information extraction, and many others.
BRIEF DESCRIPTION OF THE DRAWINGS
[0002] Figure 1 is a schematic illustration of an example system for maintaining inference results on evolving graphical models in accordance with an implementation of the present disclosure.
[0003] Figure 2 illustrates a flowchart showing an example of a method for computing an approximation of a marginal distribution of all nodes in a graphical model in accordance with an implementation of the present disclosure.
[0004] Figure 3 is illustrates a flowchart showing an example of a method for identifying independent subgraphs in the graphical model in accordance with an implementation of the present disclosure.
[0005] Figure 4 illustrates a flowchart showing an example of a method for computing an approximation of the marginal distribution of all impacted nodes for each independent subgraph in accordance with an implementation of the present disclosure.
[0006] Figure 5 is an example block diagram illustrating a computer- readable medium in accordance with an implementation of the present disclosure.
DETAILED DESCRIPTION OF SPECIFIC EXAMPLES
[0007] As mentioned above, graphical models are widely used in many different applications. Graphical models may include a plurality of nodes or vertices connected by edges. In many of these applications, graphical models are constantly evolving. For example, in malware detection, the graphical model may be a bipartite graph where each node is a random binary variable which represents if a machine or a file is infected. There may be an edge between a file node and a machine node if the file is found on that machine. Such a graphical model changes
when new files are installed on the machines, files are deleted and machines are added/deleted from system.
[0008] Such evolving scenarios as well as many non-evolving cases may generally create a graph inference problem: given a graphical model that encodes a joint probability distribution, it may be difficult and time consuming to determine the marginal probability distribution of a particular node in the graph conditioned on the known value of some other nodes in the graph. As used herein, the term graph inference result refers to the marginal distribution of each node in the graph.
[0009] When the graphical model evolves, to keep the inference results (e.g., the marginal distribution of each random node in the graph) up to date, one approach is to run graphical inference algorithms (e.g., belief propagation, Gibbs sampling, etc.) from scratch on the entire graph. However, on large scale graphs (i.e., graphs containing millions of vertices), such an approach can often takes hours to finish. For some areas (e.g., such as security applications), which need real-time or near real-time response, rerunning-from-scratch is not a viable approach.
[0010] In this regard, according to examples, techniques for computing or maintaining inference results on evolving graphical models are described herein. In one example, changes in the graphical model may be identified between snapshots of the model in two different time periods. The proposed techniques efficiently identify relatively small subgraphs in the changed large graphical models. Then, the techniques run an inference algorithm on these subgraphs to compute an approximation of marginal distribution of all impacted nodes in the subgraphs as fast as possible. Thus, the techniques described herein perform graph inference incrementally (e.g., on a small neighborhood of impacted nodes), and not on the entire graphical model. For example, if a change happens to a first node far away from a second node, the changes for the second node may be approximately calculated because the changed first node would not have much effect over the second node. Therefore, when the changes between a graphical model in two different times are small, the proposed techniques offer a faster way to re-run an inference algorithm on the entire graph from scratch.
[0011] In some examples, when there are small changes in a graphical model, the inference results may not change dramatically for most of the nodes.
Therefore, the techniques described herein propose to first identify impacted subgraphs of the original graph that are likely to contain nodes whose marginal distribution may change significantly based on the changes in the graph between the two different time periods. Then, the techniques propose to only run the inference algorithm on those subgraphs before combining data for the subgraphs with the data for the rest of the graph.
[0012] In one example, a processor may identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period. The processor may further: determine changes between the graphical model at the first time period and the second time period, identify impacted nodes in the graphical model at the second time period, and identify independent subgraphs in the graphical model at the second time period. Finally, the processor may compute an approximation of a marginal distribution of all impacted nodes for each independent subgraph, and may compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
[0013] Thus, the proposed techniques may consider any type of change in a graphical model (e.g., deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed, etc.) and may apply any inference algorithm to the determined subgraphs. Further, the described techniques may efficiently compute inference results on evolving power-law graphs that include high-degree nodes.
[0014] In the following detailed description, reference is made to the accompanying drawings, which form a part hereof, and in which is shown by way of illustration specific examples in which the disclosed subject matter may be practiced. It is to be understood that other examples may be utilized and structural or logical changes may be made without departing from the scope of the present disclosure. The following detailed description, therefore, is not to be taken in a limiting sense, and the scope of the present disclosure is defined by the appended claims. Also, it is to be understood that the phraseology and terminology used herein is for the purpose of description and should not be regarded as limiting. The use of "including," "comprising" or "having" and variations thereof herein is meant to encompass the items listed thereafter and equivalents thereof as well as
additional items. Furthermore, the term "based on," as used herein, means "based at least in part on." It should also be noted that a plurality of hardware and software based devices, as well as a plurality of different structural components may be used to implement the disclosed methods and devices.
[0015] Referring now to the figures, Figure 1 is a schematic illustration of an example system 10 for maintaining inference results on evolving graphical models. As noted above, the term "inference result" refers to an approximation of a marginal distribution of all nodes in the graphical model. The illustrated system 10 is capable of carrying out the techniques described below. As shown in Figure 1 , the system 10 is depicted as including at least one a computing device 100. In the embodiment of Figure 1, computing device 100 includes a processor 102, an interface 106, and a machine-readable storage medium 110. Although only computing device 100 is described in details below, the techniques described herein may be performed with several computing devices or by engines distributed on different devices.
[0016] The computing device 100 may be any type of a computing device and may include at least engines 120-160. Engines 120-150 may or may not be part of the machine-readable storage medium 110. In one example, the computing device 100 may be an independent computing device. In another alternative example, engines 120-160 may be distributed between the computing device 100 and others computing devices. The computing device 100 may include additional components and some of the components depicted therein may be removed and/or modified without departing from a scope of the system that allows for carrying out the functionality described herein. It is to be understood that the operations described as being performed by the engines 120-160 of the computing device 100 that are related to this description may, in some implementations, be performed by external engines (not shown) or distributed between the engines of the computing device 100 and other electronic/computing devices.
[0017] Processor 102 may be central processing unit(s) (CPUs), microprocessors ), and/or other hardware device(s) suitable for retrieval and execution of instructions (not shown) stored in machine-readable storage medium 110. Processor 102 may fetch, decode, and execute instructions to identify different groups in a dataset. As an alternative or in addition to retrieving and
executing instructions, processor 102 may include electronic circuits comprising a number of electronic components for performing the functionality of instructions.
[0018] Interface 106 may include a number of electronic components for communicating with various devices. For example, interface 106 may be an Ethernet interface, a Universal Serial Bus (USB) interface, an IEEE 1394 (Firewire) interface, an external Serial Advanced Technology Attachment (eSATA) interface, or any other physical connection interface suitable for communication with the computing device. Alternatively, interface 106 may be a wireless interface, such as a wireless local area network (WLAN) interface or a near-field communication (NFC) interface that is used to connect with other devices/systems and/or to a network. In one example, the network may be a mesh sensor network (not shown). The network may include any suitable type or configuration of network to allow for communication between the computing device 100, and any other devices/systems (e.g., other computing devices, displays, etc.), for example, to send and receive data to and from a corresponding interface of another device.
[0019] Each of the engines 120-160 may include, for example, at least one hardware device including electronic circuitry for implementing the functionality described below, such as control logic and/or memory. In addition or as an alternative, the engines 120-160 may be implemented as any combination of hardware and software to implement the functionalities of the engines. For example, the hardware may be a processor and the software may be a series of instructions or microcode encoded on a machine-readable storage medium and executable by the processor. Therefore, as used herein, an engine may include program code, (e.g., computer executable instructions), hardware, firmware, and/or logic, or combination thereof to perform particular actions, tasks, and functions described in more detail herein in reference to Figures 2-5.
[0020] In one example, the identification engine 120 may identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period. In some implementations, the graphical model may be represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period.
[0021] The analysis engine 130 may determine changes between the graphical model at the first time period and the second time period, and may
identify impacted nodes in the graphical model at the second time period. In one example, the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed. It may also include changes to factors on the edges and cliques (these factors encode the joint probability distribution of the nodes in the graphical model). In other examples, other types of changes may be detected. In addition, each of the impacted nodes may have a topology change from the first time period to the second time period. The topology change may include adding or deleting an edge of the node, having a brand new node, etc.
[0022] The subgraphs engine 140 may identify independent subgraphs in the graphical model at the second time period. In one example, the subgraphs engine 130 may: identify evidence nodes that have changed between the first time period and the second time period, identify high-degree nodes, identify active nodes, remove all edges that are not connected to active nodes, and determine a plurality of independent subgraphs. As used herein, the term "active node" refers to a node that has at least one active edge or has no inactive edges. As used herein, the term "evidence node" refers to a node in the graphical model that has a known value. Further, as used herein, the term "high-degree node" refers to a node that is connected to a very large number of other nodes and the factors on the edges are such that a change in its neighbor nodes will not dramatically change its distribution.
[0023] The inference engine 150 may compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs. Thus, the inference engine 150 may apply an inference algorithm to each independent subgraph that was affected by the change in the graphical model.
[0024] The graphical model engine 160 may compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period. Thus, the graphical model engine 160 may maintain the inference results of an evolving graphical model at a much faster rate.
[0025] Figure 2 illustrates a flowchart showing an example of a method 200 for computing an approximation of a marginal distribution of all nodes in a graphical model (i.e., for maintaining inference results on evolving graphical models).
Although execution of the method 200 is described below with reference to the system 10, the components for executing the method 200 may be spread among multiple devices/systems. The method 200 may be implemented in the form of executable instructions stored on a machine-readable storage medium, and/or in the form of electronic circuitry.
[0026] In one example, the method 200 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10. Various elements or blocks described herein with respect to the method 200 are capable of being executed simultaneously, in parallel, or in an order that differs from the illustrated serial manner of execution. The method 200 is also capable of being executed using additional or fewer elements than are shown in the illustrated examples.
[0027] The method 200 begins at 210, where at least one processor may identify nodes and edges in a graphical model at a first time period. For example, the graphical model may be represented as a first snapshot of the graphical model at the first time period. At 220, the processor may identify nodes and edges in the graphical model at a second time period that is later than the first time period. The graphical model may be represented as a second snapshot of the graphical model at the second time period. In one example, the processor may receive and/or identify the following information: a first graph snapshot Gtfor the first time period t, its set of nodes or vertices V/and edges Et, a second graph snapshot GM for the first time period t+1, its set of nodes or vertices Vt+i, and edges Et+i. Thus, a change or evolution in the graphical model is examined by the processor. The processor may use different techniques to identify nodes and edges in the graphical model at the different time periods.
[0028] At 230, the processor may determine changes between the graphical model at the first time period t and the second time period t+1. For example, the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed. In other implementations, other changes in the graphical model may also be identified. The
processor may use various techniques to determine the changes in the graphical model.
[0029] Next, the processor may identify impacted nodes / in the graphical model at the second time period (at 240). For example, the processor may use the changes in the graphical model determined in block 230 to identity the impacted nodes / in the graph. In one implementation, each of the impacted nodes has a topology change from the first time period to the second time period. The topology change may include adding or deleting an edge of the node, having a brand new node, etc. As noted below, the values of the impacted nodes may be recomputed using an inference to calculate an approximation of the marginal distribution of all nodes.
[0030] At 250, the processor may identify independent subgraphs CM in the graphical model at the second time period. When the changes in the graphical model between the first and the second time periods are small, the inference results may not change dramatically for most of the nodes. Thus, the processor may first identify the impacted regions (i.e., subgraphs) within the original large graph that are likely to contain nodes whose marginal distribution may change significantly based on the changes in the graph between the two different time periods.
[0031] With continued reference to Figure 2, the processor may compute, for each independent subgraph, an approximation of the marginal distribution of all impacted nodes (at 260). The process for computing an approximation of the marginal distribution of all impacted nodes for each independent subgraph is described in below in relation to Figure 4. As described in additional details below, the processor applies an inference algorithm A to each subgraph to compute an approximation of the marginal distribution of all impacted nodes in each of the subgraphs. Therefore, the processor applies the inference algorithm only to the identified impacted subgraphs and not to the entire graphical model. Considering the size of many graphical models, this selective process speeds up the generation of inference results.
[0032] At 270, the processor may compute an approximation - (GT+1) of the marginal distribution of all nodes in the graphical model A(Gt+i) at the second time
period. In other words, the processor returns the approximated marginal distribution i4(Gt+1) for the entire graphical model at the second period of time t+1 in order to maintain the inference results on the evolving graphical model in an expedited paste. In one example, the processor may merge the calculated approximation of the marginal distribution for all subgraphs A(l) with the unchanged nodes at time t+1 in the graphical model
[0033] A method 300 for identifying independent subgraphs in the graphical model at the second time period is described in more details in relation to Figure 3. The method 300 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10. The trained model may be generated by using a training dataset.
[0034] The method 300 begins at 310, where the processor may identify evidence nodes Ethat have changed between the first time period and the second time period. As noted above "evidence node" refers to a node in the graphical model that has a known value and, therefore, there is no randomness in that respect. Thus, their distribution is unaffected by the value of any other nodes. For an evidence node, changes in its few neighbors will not dramatically change the distribution of the node. In other words, the processor may identify evidence nodes E that have not changed between the first time period t and the second time period f+1 in the group of all nodes VM n Vt. Various techniques may be used to identify evidence nodes that have changed between the two time periods.
[0035] At 320, the processor may identify high-degree nodes in the graphical model. A high-degree node is a node that is connected to a very large number of other nodes such that a change in its neighbor nodes will not dramatically change its distribution. In some examples, the graphical models include power law graphs. In such graphs, some high-degree nodes have thousands of neighbors. In one implementation, the processor may use a predetermined threshold d of high- degree nodes to identify the high-degree nodes in the group of all nodes
Changes in high-degree nodes will not dramatically change the distribution of these high-degree nodes. Further, high-degree nodes also block the changes or dependencies to other neighbors. Thus, high-degree nodes and the evidence
nodes may block the dependency among nodes in the graph. Any path that contains either an evidence node or a high-degree node may be identified as an inactive path. Thus, the processor may identify the union H of high-degree nodes and relevance nodes.
[0036] At 330, the processor may identify active nodes in the graphical
model at the second time period. As noted above, an "active node" refers to a node that has at least one active edge or has no inactive edges. In other words, the processor may identify active nodes in all nodes at the second time
to identify active paths between the nodes by using dependent separation. For example, two nodes X and Y are conditionally independent given Z if knowledge about X gives no extra information about Y once there is knowledge of Z. In other words, once there is knowledge of Z, X adds nothing to the knowledge about Y. Thus, a path between two nodes is active if it carries information or dependence. Two nodes X and Y might be connected by many paths in a graph, where all, some, or none of the paths are active. Therefore, X and Y are dependent-separated if all the paths that connect them are inactive, or, equivalently, if no path between them is active.
connected to active nodes and may determine a plurality of independent subgraphs in the graphical model G', where Thus, given the
evidence in the graphical model (e.g. those nodes whose assignments are known, the dependent separation between the nodes, etc.), the processor may find all active paths, and that information may be used to identify the active nodes and to break the graph into several smaller independent subgraphs if there is no active path connecting them. In other words, a first subgraph and a second subgraph of the graphical model are independent when there are no active paths connecting the subgraphs, and where changes in the first independent subgraph do not affect the marginal distribution of any nodes in the second independent subgraph.
[0038] Figure 4 illustrates a flowchart showing an example of a method 400 for computing an approximation of the marginal distribution of all impacted nodes
for each independent subgraph. The method 400 can be executed by at least one processor of a computing device (e.g., processor 102 of device 100). In other examples, the method may be executed by another processor in communication with the system 10.
[0039] The method 400 begins at 410 where the processor may identify each impacted node in the subgraph. Then, at 420, the processor may identify all k-neighborhood nodes for each impacted node in the subgraph. As used herein, the term "k-neighborhood nodes" refers to nodes that are reachable from a node and are within a predetermine distance k. The size of the k-neighborhood may vary and may be fixed or flexible.
[0040] In one implementation, the processor may use the length of a path between two nodes (i.e., the number of other nodes it needs to go through to get from one node to the other) to identify all k-neighborhood nodes for each impacted node in the subgraph. For example, even if nodes X and Y are connected via some active paths, if all these active paths are very long, the dependence between X and Y is very weak. Therefore, changes in X's distribution only impact Y's distribution slightly. The processor, therefore, identifies the k-neighborhood of nodes to be considered for a subgraph. For instance, a threshold (e.g., Z number of nodes to get from node X to node Y) that would identify long paths may be used.
[0041] At 430, the processor may add the identified k-neighborhood nodes for each impacted node to the set of impacted nodes. In other words, the processor "grows" the set of impacted nodes in the subgraph by adding non-impacted nodes to identify the right size subgraph to which to apply an inference algorithm. The right size subgraph is a subgraph where changes in the nodes would strongly impact the marginal distribution of adjacent nodes.
[0042] At 440, the processor may apply an inference algorithm A to each independent subgraph. In other words, the inference algorithm is applied to the entire independent subgraph - including impacted and non-impacted nodes. That way, the processor may receive inference results for all nodes within the independent subgraph.
[0043] Figure 5 illustrates a computer 501 and a non-transitory machine- readable medium 505 according to an example. In one example, the computer 501 maybe similar to the computing device 100 of the system 10 or may include a
plurality of computers. For example, the computer may be a server computes, a workstation computer, a desktop computer, a laptop, a mobile device, or the like, and may be part of a distributed system. The computer may include one or more processors and one or more machine-readable storage media. In one example, the computer may include a user interface (e.g., touch interface, mouse, keyboard, gesture input device, etc.).
[0044] Computer 501 may perform methods 200-400 and variations thereof. Additionally, the functionality implemented by computer 501 may be part of a larger software platform, system, application, or the like. Computer 501 may be connected to a database (not shown) via a network. The network may be any type of communications network, including, but not limited to, wire-based networks (e.g., cable), wireless networks (e.g., cellular, satellite), cellular telecommunications network(s), and IP-based telecommunications network(s) (e.g., Voice over Internet Protocol networks). The network may also include traditional landline or a public switched telephone network (PSTN), or combinations of the foregoing.
[0045] The computer 501 may include a processor 503 and non-transitory machine-readable storage medium 505. The processor 503 (e.g., a central processing unit, a group of distributed processors, a microprocessor, a microcontroller, an application-specific integrated circuit (ASIC), a graphics processor, a multiprocessor, a virtual processor, a cloud processing system, or another suitable controller or programmable device) and the storage medium 305 may be operatively coupled to a bus. Processor 503 can include single or multiple cores on a chip, multiple cores across multiple chips, multiple cores across multiple devices, or combinations thereof.
[0046] The storage medium 505 may include any suitable type, number, and configuration of volatile or non- volatile machine-readable storage media to store instructions and data. (Examples of machine-readable storage media in include read-only memory ("ROM"), random access memory ("RAM") (e.g., dynamic RAM ["DRAM"], synchronous DRAM ["SDRAM"], etc.), electrically erasable programmable read-only memory ("EEPROM"), magnetoresistive random access memory (MRAM), memristor, flash memory, SD card, floppy disk, compact disc read only memory (CD-ROM), digital video disc read only memory (DVD-ROM),
and other suitable magnetic, optical, physical, or electronic memory on which software may be stored.
[0047] Software stored on the non-transitory machine-readable storage media 505 and executed by the processor 503 includes, for example, firmware, applications, program data, filters, rules, program modules, and other executable instructions. The processor 503 retrieves from the machine-readable storage media 505 and executes, among other things, instructions related to the control processes and methods described herein.
[0048] The processor 503 may fetch, decode, and execute instructions 307- 311 among others, to implement various processing. As an alternative or in addition to retrieving and executing instructions, processor 503 may include at least one integrated circuit (IC), other control logic, other electronic circuits, or combinations thereof that include a number of electronic components for performing the functionality of instructions 507-515. Accordingly, processor 503 may be implemented across multiple processing units and instructions 507-515 may be implemented by different processing units in different areas of computer 501.
[0049] The instructions 507-515 when executed by processor 503 (e.g., via one processing element or multiple processing elements of the processor) can cause processor 503 to perform processes, for example, methods 200-400, and/or variations and portions thereof. In other examples, the execution of these and other methods may be distributed between the processor 503 and other processors in communication with the processors 603.
[0050] For example, data identification instructions 507 may cause processor 503 to identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period. In some implementations, the graphical model may be represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period. These instructions may function similarly to the techniques described in blocks 210 and 220 of method 200.
[0051] Analysis instructions 509 may cause the processor 503 to determine changes between the graphical model at the first time period and the second time period, and to identify impacted nodes in each of the independent subgraphs in the
graphical model at the second time period. Each of the impacted nodes may have a topology change from the first time period to the second time period. In some examples, each of the impacted nodes has a value in the second time period that is different from the value of the node in the first time period. These instructions may function similarly to the techniques described in blocks 230-240 of method 200. In one example, the changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed. In other example, other types of changes may be detected.
[0052] Subgraphs instructions 511 may cause the processor 503 to identify independent subgraphs in the graphical model at the second time period. These instructions may function similarly to the techniques described in block 250 of method 200 and to the techniques described in method 300. For example, the subgraphs instructions 511 may cause the processor 503 to: identify evidence nodes that have changed between the first time period and the second time period, identify high-degree nodes, identify active nodes, remove all edges that are not connected to active nodes, and determine a plurality of independent subgraphs.
[0053] Inference instructions 513 may cause the processor 503 to compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs. These instructions may function similarly to the techniques described blocks 260 of method 200 and to the techniques described in method 400. Thus, inference instructions 513 may cause the processor 503 to apply an inference algorithm to each independent subgraph that was affected by the change in the graphical model.
[0054] Graphical model instructions 515 may cause the processor 503 to compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period. These instructions may function similarly to the techniques described blocks 270 of method 200. Thus, graphical model instructions 515 may cause the processor 503 to maintain the inference results of an evolving graphical model at a very fast paste.
[0055] In the foregoing description, numerous details are set forth to provide an understanding of the subject matter disclosed herein. However, implementations may be practiced without some or all of these details. Other
implementations may include modifications and variations from the details discussed above. It is intended that the appended claims cover such modifications and variations.
Claims
1. A method comprising, by at least one processor:
identifying nodes and edges in a graphical model at a first time period; identifying nodes and edges in the graphical model at a second time period that is later than the first time period;
determining changes between the graphical model at the first time period and the second time period;
identifying impacted nodes in the graphical model at the second time period; identifying independent subgraphs in the graphical model at the second time period;
computing, for each independent subgraph, an approximation of a marginal distribution of all impacted nodes; and
computing an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
2. The method of claim 1 , further comprising, by at least one processor: identifying evidence nodes that have changed between the first time period and the second time period;
identifying high-degree nodes;
identifying active nodes;
removing all edges that are not connected to active nodes; and
determining a plurality of independent subgraphs.
3. The method of claim 1 , further comprising, for each independent subgraph, by at least one processor:
identifying each impacted node in the subgraph;
identifying all k-neighborhood nodes for each impacted node in the subgraph;
adding the identified k-neighborhood nodes for each impacted node to the set of impacted nodes; and
applying an inference algorithm to each independent subgraph.
4. The method of claim 1 , wherein changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed.
5. The method of claim 1 , wherein each of the impacted nodes has a topology change from the first time period to the second time period.
6. The method of claim 1 , wherein a first subgraph and a second subgraph of the graphical model are independent when there are no active paths connecting the subgraphs, and wherein changes in the first independent subgraph do not affect the marginal distribution of any nodes in the second independent subgraph.
7. The method of claim 1 , wherein the graphical model is represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period.
8. A system comprising:
an identification engine to identify nodes and edges in a graphical model at a first time period and at a second time period that is later than the first time period, wherein the graphical model is represented as a first snapshot of the graphical model at the first time period and a second snapshot of the graphical model at the second time period;
an analysis engine to:
determine changes between the graphical model at the first time period and the second time period, and
identify impacted nodes in the graphical model at the second time period;
a subgraphs engine to identify independent subgraphs in the graphical model at the second time period;
an inference engine to compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs; and
a graphical model engine to compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
9. The system of claim 8, wherein for each independent subgraph the subgraphs engine is further to:
identify evidence nodes that have changed between the first time period and the second time period;
identify high-degree nodes;
identify active nodes;
remove all edges that are not connected to active nodes; and
determine a plurality of independent subgraphs.
10. The system of claim 8, wherein the inference engine is further to: identify each impacted node in the subgraph;
identify all k-neighborhood nodes for each impacted node in the subgraph; add the identified k-neighborhood nodes for each impacted node to the set of impacted nodes; and
apply an inference algorithm to each independent subgraph.
11. The system of claim 8, wherein changes between the graphical model at the first time period and the second time period include at least one of: deleted nodes, inserted nodes, nodes with changed edges, and nodes whose previously unknown values are revealed.
12. A non-transitory machine-readable storage medium encoded with instructions executable by at least one processor, the machine-readable storage medium comprising instructions to:
identify nodes and edges in a graphical model at a first time period;
identify nodes and edges in the graphical model at a second time period that is later than the first time period;
determine changes between the graphical model at the first time period and the second time period;
identify impacted nodes in the graphical model at the second time period, wherein each of the impacted nodes has a topology change from the first time period to the second time period;
identify independent subgraphs in the graphical model at the second time period;
compute an approximation of a marginal distribution of all impacted nodes in each of the independent subgraphs; and
compute an approximation of a marginal distribution of all nodes in the graphical model at the second time period.
13. The non-transitory machine-readable storage medium of claim 12, further comprising instructions to:
identify evidence nodes that have changed between the first time period and the second time period;
identify high-degree nodes;
identify active nodes;
remove all edges that are not connected to active nodes; and
determine a plurality of independent subgraphs.
14. The non-transitory machine-readable storage medium of claim 13, further comprising instructions to:
identify each impacted node in the subgraph;
identify all k-neighborhood nodes for each impacted node in the subgraph; add the identified k-neighborhood nodes for each impacted node to the set of impacted nodes; and
apply an inference algorithm to each independent subgraph.
15. The non-transitory machine-readable storage medium of claim 12, wherein a first subgraph and a second subgraph of the graphical model are independent when there are no active paths connecting the subgraphs, and wherein changes in the first independent subgraph do not affect the marginal distribution of any nodes in the second independent subgraph.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2015/032212 WO2016190841A1 (en) | 2015-05-22 | 2015-05-22 | Marginal distribution in a graphical model |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/US2015/032212 WO2016190841A1 (en) | 2015-05-22 | 2015-05-22 | Marginal distribution in a graphical model |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2016190841A1 true WO2016190841A1 (en) | 2016-12-01 |
Family
ID=57393565
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2015/032212 Ceased WO2016190841A1 (en) | 2015-05-22 | 2015-05-22 | Marginal distribution in a graphical model |
Country Status (1)
| Country | Link |
|---|---|
| WO (1) | WO2016190841A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070016898A1 (en) * | 2005-07-13 | 2007-01-18 | International Business Machines Corporation | Method and system for finding evolving regions in graphs without persistent node identity |
| WO2007117567A2 (en) * | 2006-04-06 | 2007-10-18 | Smobile Systems Inc. | Malware detection system and method for limited access mobile platforms |
| US20130179974A1 (en) * | 2012-01-11 | 2013-07-11 | Pratyusa Kumar Manadhata | Inferring a state of behavior through marginal probability estimation |
| US20140181819A1 (en) * | 2012-12-21 | 2014-06-26 | Microsoft Corporation | Virtualization detection |
| US9021260B1 (en) * | 2014-07-03 | 2015-04-28 | Palantir Technologies Inc. | Malware data item analysis |
-
2015
- 2015-05-22 WO PCT/US2015/032212 patent/WO2016190841A1/en not_active Ceased
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070016898A1 (en) * | 2005-07-13 | 2007-01-18 | International Business Machines Corporation | Method and system for finding evolving regions in graphs without persistent node identity |
| WO2007117567A2 (en) * | 2006-04-06 | 2007-10-18 | Smobile Systems Inc. | Malware detection system and method for limited access mobile platforms |
| US20130179974A1 (en) * | 2012-01-11 | 2013-07-11 | Pratyusa Kumar Manadhata | Inferring a state of behavior through marginal probability estimation |
| US20140181819A1 (en) * | 2012-12-21 | 2014-06-26 | Microsoft Corporation | Virtualization detection |
| US9021260B1 (en) * | 2014-07-03 | 2015-04-28 | Palantir Technologies Inc. | Malware data item analysis |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| TWI706273B (en) | Uniform resource locator (URL) attack detection method, device and electronic equipment | |
| Vora et al. | Kickstarter: Fast and accurate computations on streaming graphs via trimmed approximations | |
| US8280830B2 (en) | Systems and methods for using multiple in-line heuristics to reduce false positives | |
| KR102598173B1 (en) | Graph matching for optimized deep network processing | |
| US10917415B2 (en) | Machine learning-based determination of program code characteristics | |
| US9628506B1 (en) | Systems and methods for detecting security events | |
| US10679055B2 (en) | Anomaly detection using non-target clustering | |
| CN110378413A (en) | Neural network model processing method, device and electronic device | |
| CN104951465B (en) | Application recommendation method and device | |
| US9152703B1 (en) | Systems and methods for clustering data samples | |
| CN107870845A (en) | Towards the management method and system of micro services framework applications | |
| CN111859093A (en) | Sensitive word processing method, device and readable storage medium | |
| CN113890821A (en) | Log association method and device and electronic equipment | |
| CN105790967A (en) | Weblog processing method and device | |
| US11709798B2 (en) | Hash suppression | |
| JP7063274B2 (en) | Information processing equipment, neural network design method and program | |
| US20180114132A1 (en) | Controlling remote memory accesses in a multiple processing node graph inference engine | |
| CN111860554A (en) | Risk monitoring method, device, storage medium and electronic device | |
| WO2025139642A1 (en) | Target user identification method and apparatus, and electronic device | |
| CN116150746B (en) | An attack detection method, apparatus, electronic device, and storage medium | |
| WO2016190841A1 (en) | Marginal distribution in a graphical model | |
| CN109800775B (en) | File clustering method, device, equipment and readable medium | |
| CN114139711B (en) | Distributed inference method, data processing method, device, terminal and computing equipment | |
| AU2019446476A1 (en) | Information processing device, creation method, and creation program | |
| CN113377796B (en) | Method, device and storage medium for automatically updating buried events and their fields |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 15893486 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 15893486 Country of ref document: EP Kind code of ref document: A1 |