WO2023196917A1 - Neural network architectures for invariant object representation and classification using local hebbian rule-based updates - Google Patents
Neural network architectures for invariant object representation and classification using local hebbian rule-based updates Download PDFInfo
- Publication number
- WO2023196917A1 WO2023196917A1 PCT/US2023/065456 US2023065456W WO2023196917A1 WO 2023196917 A1 WO2023196917 A1 WO 2023196917A1 US 2023065456 W US2023065456 W US 2023065456W WO 2023196917 A1 WO2023196917 A1 WO 2023196917A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- nodes
- representation
- layer
- input
- classification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/778—Active pattern-learning, e.g. online learning of image or video features
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
Definitions
- This disclosure is related to improved machine learning configurations and techniques for invariant object representation and classification.
- the configurations and techniques described herein can be executed to enhance various computer vision functions including, but not limited to, functions involving object detection, object classification, and/or instance segmentation.
- Computer vision systems can be configured to perform various functions, such as those that involve object detection, object classification, and/or instance segmentation. These computer vision functions can be applied in many different contexts, such as facial recognition, medical image analysis, smart surveillance, and/or image analysis tasks.
- Computer vision systems must account for a variety of technical problems to accurately implement the aforementioned computer vision functions.
- one technical problem relates to accurately extracting features from input images. This can be particularly difficult in scenarios in which the objects (e.g., facial objects) included in the input images are partially hidden or heavily occluded, and/or degraded by noise, poor illumination, and/or uneven lighting.
- Other factors that can hinder feature extraction can be attributed to variations in camera angles, motion, perspective, poses, and object appearances (e.g., variations in facial expressions) across different images.
- Frameworks for performing feature extraction suffer from a variety of other shortcomings as well. For instance, with respect to frameworks that use blind source separation techniques, these frameworks fail to take into account the informativeness of features based on their relative abundance. Though a framework set to capture informative features does not need to know the exact occurrence frequency of objects, it should take the relative abundance of features into account. However, blind source separation and other related techniques are not capable of doing so.
- Frameworks that utilize sparse non-negative matrix factorization for feature extraction also include drawbacks. Though these frameworks can successfully generate invariant and efficient representations of inputs in some scenarios, the sparse non-negative matrix factorization-based approach used in obtaining the features is not always technologically plausible or feasible in its current form. In some cases, the limitations arise because the algorithm utilized by these frameworks does not incorporate the physiological constraints faced by a biological system. [0010] Furthermore, in certain feature extraction approaches, capturing the most informative structures from inputs is often a different process than obtaining input representations. As such, any network that accomplishes both generally incorporates two separate structures for accomplishing these two goals.
- Another drawback of existing techniques is that they do not accurately mimic processes of biological systems.
- An essential aspect of a biological system is its development. Organisms grow and develop with time, reach maturation, and eventually die. During their lives, they experience their surroundings and learn to adapt to them. From the perspective of sensory processing, this constitutes a continuous period of sensory experiences, and it allows the organisms to learn and re-learn sensory events. As a corollary, a biological system does not encounter all the events and stimuli to which it adapts at one point it time. It gradually discovers these events, determines their relevance with experience, and then conforms accordingly to represent them.
- FIG. 1A is a diagram of an exemplary system for generating image analysis in accordance with certain embodiments
- FIG. 1 B is a block diagram demonstrating exemplary features of a computer vision system in accordance with certain embodiments.
- FIG. 2 is a diagram of an exemplary neural network architecture in accordance with certain embodiments
- FIG. 3 is a diagram illustrating how inputs in an input sequence can be captured in the representation layer for a neural network architecture in accordance with certain embodiments
- FIG. 4 is a diagram illustrating how inputs in an input sequence that are corrupted can be learned by a neural network architecture in accordance with certain embodiments
- FIGS. 5A-5C are diagrams illustrating how characteristics of an object can be captured in the output of the representation layer for an neural network architecture in accordance with certain embodiments
- FIG. 6 is a diagram of an exemplary neural network architecture in accordance with certain embodiments.
- FIGS. 7A-7B are diagrams illustrating characteristics of an object that are captured in the output for a neural network architecture in accordance with certain embodiments.
- FIG. 8 is a flowchart illustrating an exemplary method for a neural network architecture in accordance with certain embodiments.
- the present disclosure relates to systems, methods, apparatuses, computer program products, and techniques for providing a neural network architecture that leverages local learning rules and a shallow, bi-layer neural network architecture to extract or generate robust, invariant object representations from objects included in images.
- the neural network architecture can be trained to generate invariant responses to image inputs corrupted in various ways.
- the learning process does not require any labeling of the training set or pre- determined outcomes, and eliminates the need for large training datasets during the learning process. Instead, the neural network architecture can generate the invariant object representations using only local learning rules, and without requiring backpropagation during the learning process or resorting to using reconstruction error or credit assignment.
- the enhanced object representations generated by the neural network architecture can be utilized to improve performance of various computer vision functions, for example, such as those which may involve object detection, object classification, object representation, object segmentation, or the like.
- a biologically-inspired shallow bi-layered, redundancy capturing artificial neural network that learns comprehensive structures from objects in an experience dependent manner.
- the ANN comprises nodes that can be configured to extract unique input structures and efficiently represent inputs.
- a single ANN can incorporate the functionality of both blind source separation and sparse recovery techniques.
- the ANN can include a modified Hopfield network that implements learning rules that allow redundancy capturing.
- the ANN includes biased connectivity and stochastic gradient descent- type learning to sequentially identify multiple inputs without catastrophic forgetting.
- the ANN can capture structures that uniquely identify individual objects and produces sparse, de-correlated representations that are robust against various forms of input corruption.
- the ANN can learn from various corrupted input forms to extract uncorrupted features in an unsupervised manner, separate identity and rotation information from different views of rotating 3D objects, and can produce cells tuned to different object orientations under unsupervised conditions.
- the ANN can learn to represent the initial sets of data (such as training set data) really well, but the ANN can also perform well for images similar to those included in an initial (or training) data set but that are not identical. In such scenarios, the ANN can adapt to the new images and represent them more sparsely and more robustly because it can employ continuous learning.
- the ANN includes a first layer of input nodes that can be connected in an all-to-all configuration with a second layer of representation nodes. Inhibitory recurrent connections among the representation nodes in the second layer provide negative input values and also can be connected in an all-to-all configuration.
- the input nodes can be configured to detect patterns in an input dataset, and project these patterns to the representation nodes in the second layer.
- the sparsity of the representations from the representation nodes of the ANN is generated by the inhibitory recurrent connections between the nodes in the representation layer. These inhibitory connections differ from the connections between the second layer nodes in a traditional Hopfield network, which are excitatory recurrent connections. Establishing a connection between an input node and a representation node enables the representation node to learn information related to features that are extracted by the input node.
- the capturing of the informative structures can be reflected in the tuning properties of the representation nodes (or nodes of the second layer).
- the tuning properties are a measure of how well the ANN has adapted to extracting features (or objects) from the images input into it (such as through the updating of weights).
- the tuning properties of the representation nodes can be determined by how they are connected to the early-stage nodes (such as the input nodes) in the sensory pathway (signal path). Therefore, the adaptation to inputs can pertain to changes in the connections of the ANN.
- the ANN more accurately mimics real-world biological cognitive processes in comparison to traditional approaches to neural network design.
- many traditional artificial neural networks designed to represent objects utilize an optimization process where discrepancies between the actual and desired outputs are reduced by updating the network connections through mechanisms such as error backpropagation.
- This approach requires individual connections at all levels of the artificial neural network to be sensitive to errors found in the later stages of the network.
- learning in biological nervous systems is known to occur locally, depending on pre-synaptic and post-synaptic activities.
- traditional techniques require the artificial neural network to “know” the correct outcome for certain sets of inputs, which is not required by biological neural networks.
- biological neural networks are constantly learning (that is, weights of the connections between the various neurons/nodes are updated constantly throughout the life of the neural network). These aspects of biological neural networks make them less susceptible to adversarial attacks than many preexisting artificial neural networks, regardless of their complexity.
- the ANNs described throughout this disclosure are modeled to more accurately mimic these and other aspects of biological neural networks. Further, like biological systems, representations in the ANN can be non- negative.
- the ANNs described herein dynamically update or change tuning properties for the representation nodes as the connections of the nodes change.
- Appropriate changes in the connectivity can guide the nodes to be tuned to the most informative structures.
- the changes in these connections can similarly be of either nature and, therefore, the updates in different connections can result in differing positive or negative signs.
- Such updates may appear contradictory to the non- negativity constraint placed on the values of the nodes that helps capture informative structures.
- the connectivity changes can be bidirectional, the inhibitory connections may only reduce activities of the nodes without pushing the value of any node below zero. In this setting, the ANN may not subtract the tuning properties of the nodes from one another. Thus, the non-negativity constraint can be satisfied even though the nodes receive both excitatory and inhibitory inputs.
- the ANN can extract unique features from inputs in an experience-dependent manner and generate sparse, efficient representations of the inputs based on such structures.
- the ANN described throughout this disclosure can be designed to be adaptive.
- the connectivity between the input layer and the representation layer can change based on the input to optimize its representation. Updating the connectivity of the ANN can be accomplished by using a stochastic gradient descent (SGD) type approach. Using this SGD-like approach, the ANN can slowly adapt to new inputs in a manner that does not affects its adaptation to other previous inputs. With repeated encounters to inputs, the ANN can adapt to all different inputs.
- SGD stochastic gradient descent
- the design of the ANN described herein allows for an increase in efficiency with both repeated encounters and the number of inputs. Adapting to a larger number of inputs can cause the ANN to contain more information about the inputs, and accommodating more information in the ANN can lead to proper utilization of the ANN'S capacity and increases in efficiency.
- the bi-layer neural network architecture of the ANN can be extended or connected to a classification layer to create a classification network.
- the discrimination (or representation) layer of the bi-layer neural network accentuates differences between different objects received as inputs by the neural network
- the classification layer identifies shared features between the different objects in the input.
- Nodes in the classification layer may be subject to mutual excitation from other nodes in the classification layer and general inhibition.
- these nodes can be connected in a one-to-one fashion to nodes in the discrimination layer in an excitatory manner and to nodes in the input layer in an inhibitory manner.
- the design of the classification network can enable it to classify similar objects and identify the same object from different perspectives, sizes, and/or positions. It further enables the classification network to classify representations of the same object (varied by size, perspective, etc.) even if it has not yet processed or experienced the particular representation.
- the classification network has the additional advantages over traditional approaches of being fully interpretable (a so-called white box) and of not being subject to catastrophic forgetting, which is a commonly observed phenomenon in traditional approaches and results in the neural network forgetting how to perform one task after it is trained on another task.
- the classification network performs its analysis on inputs in a manner that is both efficient and robust.
- the identity of an object is embedded in the structural relationships among its features and the neural network architectures of this disclosure can utilize these relationships, or dependencies, to encode object identity. Moreover, as explained in further detail below, because the neural network architecture maximally captures these dependencies, it is able to identify the presence of an object without accurate details of the input patterns and to generate or extract invariant representations.
- the technologies discussed herein can be used in a variety of different contexts and environments.
- One useful application of these technologies is in the context of computer vision, which can be applied across a wide variety of different applications.
- the technologies disclosed herein may be integrated into any application, device, or system that can benefit from using the object representations described herein.
- One exemplary application of these technologies can be applied in the context of facial recognition.
- Another useful application of these technologies is in the context of surveillance systems (e.g., at security checkpoints).
- Another useful application of these technologies is in the context of scene analysis applications (e.g., which may be used in automated, unmanned, and/or autonomous vehicles that rely on automated, unmanned, and/or autonomous systems to control the vehicles).
- Another useful application of these technologies is in the context of intelligent or automated traffic control systems.
- Another useful application of these technologies is in image editing applications.
- Another useful application of these technologies is in the context of satellite imaging systems. Additional useful applications can include quality control systems (e.g., industrial sample checks and industrial flaw detection), agricultural analysis systems, and medical analysis systems (e.g., for both human and animal applications).
- the technologies discussed herein can also be applied to many other contexts as well. For example, they can be used to process and/or analyze DNA and RNA sequences, auditory data, sensory data, or data collected from other sources.
- the neural network architecture can identify, categorize, or extract other information from the inputted data related to objects in that data, which may be certain patterns or other features of the data.
- the neural network architecture can generally perform the same functions related to extracting representations and/or classifying portions of the inputted data as it can with visual images.
- the data to be analyzed and/or processed by the neural network architecture can be pre-processed in some way, such as by converting it into pixels to form an image to be input into the neural network architecture. Other preprocessing steps, such as scaling and/or applying a wavelet or Fourier transform, can be applied to inputs of all types.
- any aspect or feature that is described for one embodiment can be incorporated to any other embodiment mentioned in this disclosure.
- any of the embodiments described herein may be hardware-based, may be software-based, or, preferably, may comprise a mixture of both hardware and software elements.
- the description herein may describe certain embodiments, features, or components as being implemented in software or hardware, it should be recognized that any embodiment, feature and/or component referenced in this disclosure can be implemented in hardware and/or software.
- FIG. 1A is a diagram of an exemplary system 100 in accordance with certain embodiments.
- FIG. 1 B is a diagram illustrating exemplary features and/or functions associated with a computer vision system 150. FIGS 1A and 1 B are discussed jointly below.
- the system 100 comprises one or more computing devices 110 and one or more servers 120 that are in communication over a network 190.
- a computer vision system 150 is stored on, and executed by, the one or more servers 120.
- the network 190 may represent any type of communication network, e.g., such as one that comprises a local area network (e.g., a Wi-Fi network), a personal area network (e.g., a Bluetooth network), a wide area network, an intranet, the Internet, a cellular network, a television network, and/or other types of networks.
- All the components illustrated in FIGS. 1A and 1 B, including the computing devices 110, servers 120, and computer vision system 150 can be configured to communicate directly with each other and/or over the network 190 via wired or wireless communication links, or a combination of the two.
- Each of the computing devices 110, servers 120, and computer vision system 150 can also be equipped with one or more communication devices, one or more computer storage devices 201 , and one or more processing devices 202 (e.g., central processing units) that are capable of executing computer program instructions.
- the one or more computer storage devices 201 may include (i) non- volatile memory, such as, for example, read only memory (ROM) and/or (ii) volatile memory, such as, for example, random access memory (RAM).
- the non-volatile memory may be removable and/or non-removable non-volatile memory.
- RAM may include dynamic RAM (DRAM), static RAM (SRAM), etc.
- ROM may include mask-programmed ROM, programmable ROM (PROM), one-time programmable ROM (OTP), erasable programmable read-only memory (EPROM), electrically erasable programmable ROM (EEPROM) (e.g., electrically alterable ROM (EAROM) and/or flash memory), etc.
- the computer storage devices 201 may be physical, non-transitory mediums. The one or more computer storage devices 201 can store instructions associated with executing the functions perform by the computer vision system 150.
- the one or more processing devices 202 may include one or more central processing units (CPUs), one or more microprocessors, one or more microcontrollers, one or more controllers, one or more complex instruction set computing (CISC) microprocessors, one or more reduced instruction set computing (RISC) microprocessors, one or more very long instruction word (VLIW) microprocessors, one or more graphics processor units (GPU), one or more digital signal processors, one or more application specific integrated circuits (ASICs), and/or any other type of processor or processing circuit capable of performing desired functions.
- the one or more processing devices 202 can be configured to execute any computer program instructions that are stored or included on the one or more computer storage devices including, but not limited to, instructions associated with executing the functions perform by the computer vision system 150.
- Each of the one or more communication devices can include wired and wireless communication devices and/or interfaces that enable communications using wired and/or wireless communication techniques.
- Wired and/or wireless communication can be implemented using any one or combination of wired and/or wireless communication network topologies (e.g., ring, line, tree, bus, mesh, star, daisy chain, hybrid, etc.) and/or protocols (e.g., personal area network (PAN) protocol(s), local area network (LAN) protocol(s), wide area network (WAN) protocol(s), cellular network protocol(s), powerline network protocol(s), etc.).
- PAN personal area network
- LAN local area network
- WAN wide area network
- the one or more communication devices additionally, or alternatively, can include one or more modem devices, one or more router devices, one or more access points, and/or one or more mobile hot spots.
- the computing devices 110 may represent desktop computers, laptop computers, mobile devices (e.g., smart phones, personal digital assistants, tablet devices, vehicular computing devices, or any other device that is mobile in nature), and/or other types of devices.
- the one or more servers 120 may generally represent any type of computing device, including any of the computing devices 110 mentioned above.
- the one or more servers 120 comprise one or more mainframe computing devices that execute web servers for communicating with the computing devices 1 10 and other devices over the network 190 (e.g., over the Internet).
- the computer vision system 150 is stored on, and executed by, the one or more servers 120.
- the computer vision system 150 can be configured to perform any and all operations associated with analyzing images 130 and/or executing computer vision functions including, but not limited to, functions for performing feature extraction, object detection, object classification, and object segmentation.
- the images 130 provided to, and analyzed by, the computer vision system 150 can include any type of image.
- the images 130 can include one or more two-dimensional (2D) images.
- the images 130 may include one or more three-dimensional (3D) images.
- the images 130 can be created from non-visual data sources by pixelizing (that is converting the non-visual data into an ‘image’ including one or more ‘pixels’ representing portions of the non-visual data), such as DNA or RNA sequences, auditory data, sensory data, and other types of data.
- the images 130 may be captured in any digital or analog format, and using any color space or color model.
- the images 130 can be portions excerpted from a video.
- Exemplary image formats can include, but are not limited to, bitmap (BMP), JPEG (Joint Photographic Experts Group), TIFF (Tagged Image File Format), GIF (Graphics Interchange Format), PNG (Portable Network Graphics), STEP (Standard for the Exchange of Product Data), etc.
- Exemplary color spaces or models can include, but are not limited to, sRGB (standard Red-Green-Blue), Adobe RGB, gray-scale, etc.
- some or all of the images 130 can be preprocessed and/or transformed prior to being analyzed by the computer vision system 150.
- the images 130 can be split into different color elements and/or processed via a transform, such as a Fourier or wavelet transform. Other preprocessing and transformation operations also can be applied.
- the images 130 received by the computer vision system 150 can be captured by any type of camera device.
- the camera devices can include any devices that include an imaging sensor, camera, or optical device.
- the camera devices may represent still image cameras, video cameras, and/or other devices that include image/video sensors.
- the camera device can capture and/or store both visible and invisible spectra including, but not limited to, ultraviolet (UV), infrared (IR), or positron emission tomography (PET), Magnetic resonance imaging (MRI), x-ray, ultrasound, other types of medical and nonmedical imaging.
- the camera devices also can include devices that comprise imaging sensors, cameras, or optical devices and which are capable of performing other functions unrelated to capturing images.
- the camera devices can include mobile devices (e.g., smart phones or cell phones), tablet devices, computing devices, desktop computers, etc.
- the camera devices can be equipped with analog-to-digital (A/D) converters and/or digital-to- analog (D/A) converters based on the configuration or design of the camera devices.
- the computing devices 110 shown in FIG. 1 can include any of the aforementioned camera devices, and other types of camera devices.
- Each of the images 130 can include one or more objects 135.
- any type of object 135 may be included in an image 130, and the types of objects 135 included in an image 130 can vary greatly.
- the objects 135 included in an image 130 may correspond to various types of inanimate articles (e.g., vehicles, beds, desks, windows, tools, appliances, industrial equipment, curtains, sporting equipment, fixtures, etc.), living things (e.g., human beings, faces, animals, plants, etc.), structures (e.g., buildings, houses, etc.), symbols (Latin letters of the alphabet, Arabic numerals, Chinese characters, etc.) and/or the like.
- the objects 135 can include any patterns or features of importance found in the data.
- the images 130 received by the computer vision system 150 can be provided to the neural network architecture 140 for processing and/or analysis.
- the neural network architecture 140 can extract enhanced or optimized object representations 165 from the images 130.
- the object representations 165 may represent features, embeddings, encodings, vectors and/or the like, and each object representation 165 may include encoded data that represents and/or identifies one or more objects 135 included in an image 130.
- the neural network architecture 140 can learn patterns presented to it in a sequential manner, and this learned knowledge can be leveraged to optimize the object representations 165 and perform other functions described herein.
- the structure or configuration of the neural network architecture 140 can vary.
- the neural network architecture 140 can include one or more recurrent neural networks (RNNs).
- RNNs recurrent neural networks
- the neural network architecture 140 can include a Hopfield network that has been modified and optimized to perform the tasks described herein.
- the modified Hopfield network is a shallow, bi-layer RNN that comprises a first layer of input nodes (or input neurons) and a second layer of representation nodes (or representation neurons).
- Each of the representation nodes can be connected to each of the input nodes in an all-to-all configuration, and feedforward weights between the input and representation nodes can be chosen to minimize the chances that two representation nodes are active at the same time.
- the representation nodes can be connected to each other using recurrent connections.
- the biased connectivity among the nodes coupled with a stochastic gradient descent (SGD) based learning mechanism, enable the neural network architecture 140 to sequentially identify multiple inputs without catastrophic forgetting.
- the biased connectivity and lateral inhibition in the neural network architecture 140 enable the representation nodes to encode structures that uniquely identify individual objects.
- slow synaptic weight changes allow continuous learning from individual examples.
- the slowness (relative to traditional image analysis systems) does not cause disturbances in the overall network connections, but allows specific patterns to be encoded.
- there is no normalization step with each learning iteration which can prevent the production or assignment of negative synaptic weights.
- Such a result is due to the slow synaptic weight changes and is similar to biological systems (e.g. in animal brains, where synaptic weights never go negative).
- the number of representation nodes included in the neural network architecture 140 may be proportional to the number of images or objects for which recognition is desired.
- the representational layer may contain approximately the same number of nodes as the number of images to be identified. In some embodiments, there may be 2x or more (up to 10x or more) expansion of the number of nodes from the primary layer to the representation layer. For many applications of the neural network architecture 140, more nodes in each layer yield better results. There is no upper bound on the number of total nodes comprising the neural network architecture 140.
- the neural network architecture 140 can be configured to be adaptive, such that the connectivity between the input layer and the representation layer is permitted to change based on a given input image that is being processed. This dynamic adaptation of the connections between the input layer and the representation layer enables the neural network architecture 140 to optimize the object representations 165 that are generated.
- the resulting object representations 165 are sparse, and individual nodes of the neural network architecture 140 are de- correlated, thereby leading to efficient coding of the input patterns.
- the neural network architecture 140 can extract the informative structures from the objects 135 in the images 130, the resulting object representations 165 are robust against various forms of degradation, corruption and occlusion.
- neural network architecture 140 Other configurations of the neural network architecture 140 also may be employed. While certain portions of this disclosure describe embodiments in which the neural network architecture 140 includes a modified Hopfield network or RNN, it should be understood that the principles described herein can be applied to various learning models or networks. In some examples, layers of the neural network architecture 140 can be appropriately stacked and/or parallelized in various configurations to form deep neural networks that execute the functions described herein. In certain embodiments where the neural network architecture 140 is stacked, the output of its representation layer or its classification layer (in instances where the neural network architecture 140 includes a third layer), or both, can be used as input to the next neural network(s) (such as another 2- or3-layer modified Hopfield network).
- next neural network(s) such as another 2- or3-layer modified Hopfield network.
- the input to these later neural networks is derived from the activity from each node of the previous neural network architecture 140 and can be treated as a pixel of input to the next network.
- the neural network architecture 140 can include a classic perceptron as an additional layer that reads class information.
- the first neural network architecture 140 can be used as a scanning device, which allows a limited number of pixels to cover a larger scene (similar to a biological organism using its eyes to focus on one area of the visual field at a time but synthesize the whole scene). To synthesize the whole scene, the scanned images (or sub- scenes) can be treated as time-invariant even though they are obtained at different points in time.
- the principles described herein can be extended or applied to other types of RNNs that are not specifically mentioned in this disclosure.
- the principles described herein can be extended or applied to reinforced learning neural networks.
- the principles described herein can be extended or applied to convolutional neural networks (CNNs).
- the neural network architecture 140 may additionally, or alternatively, comprise a convolutional neural network (CNN), or a plurality of convolutional neural networks.
- CNN may represent an artificial neural network, and may be configured to analyze images 130 and to execute deep learning functions and/or machine learning functions on the images 130.
- Each CNN may include a plurality of layers including, but not limited to, one or more input layers, one or more output layers, one or more convolutional layers (e.g., that include learnable filters), one or more ReLLI (rectifier linear unit) layers, one or more pooling layers, one or more fully connected layers, one or more normalization layers, etc.
- the configuration of the CNNs and their corresponding layers can be configured to enable the CNNs to learn and execute various functions for analyzing, interpreting, and understanding the images 130, including any of the functions described in this disclosure.
- the neural network architecture 140 can be trained to extract robust object representations 165 from input images 130.
- the neural network architecture 140 also can be trained to utilize the object representations 165 to execute one or more computer vision functions.
- the object representations 165 can be utilized to perform object detection functions, which may include predicting or identifying locations of objects 135 (e.g., using bounding boxes) associated with one or more target classes in the images 130.
- the object representations 165 can be utilized to perform object classification functions (e.g., which may include predicting or determining whether objects 135 in the images 130 belong to one or more target semantic classes and/or predicting or determining labels for the objects 135 in the images 130) and/or instance segmentation functions (e.g., which may include predicting or identifying precise locations of objects 135 in the images 130 with pixel- level accuracy).
- object classification functions e.g., which may include predicting or determining whether objects 135 in the images 130 belong to one or more target semantic classes and/or predicting or determining labels for the objects 135 in the images 130
- instance segmentation functions e.g., which may include predicting or identifying precise locations of objects 135 in the images 130 with pixel- level accuracy.
- the neural network architecture 140 can be trained to perform other types of computer vision functions as well.
- the neural network architecture 140 of the computer vision system 150 is configured to generate and output analysis information 160 based on an analysis of the images 130.
- the analysis information 160 for an image 130 can generally include any information or data associated with analyzing, interpreting, understanding, and/or classifying the images 130 and the objects 135 included in the images 130.
- the analysis information 160 can include information or data representing the object representations 165 that are extracted from the input images 130.
- the analysis information 160 may further include orientation information that indicates an angle of rotation or orientation or position of objects 135 included in the images 130.
- the analysis information 160 can include information or data that indicates the results of the computer vision functions performed by the neural network architecture 140.
- the analysis information 160 may include the predictions and/or results associated with performing object detection, object classification, and/or other computer vision functions.
- the computer vision system 150 may be stored on, and executed by, the one or more servers 120. In other exemplary systems, the computer vision system 150 can additionally, or alternatively, be stored on, and executed by, the computing devices 1 10 and/or other devices. For example, in certain embodiments, the computer vision system 150 can be integrated directly into a camera device to enable the camera device to analyze images using the techniques described herein.
- the computer vision system 150 can also be stored as a local application on a computing device 110, or integrated with a local application stored on a computing device 110, to implement the techniques described herein.
- the computer vision system 150 can be integrated with (or can communicate with) various applications including, but not limited to, facial recognition applications, automated vehicle applications, intelligent traffic applications, surveillance applications, security applications, industrial quality control applications, medical applications, agricultural applications, veterinarian applications, image editing applications, social media applications, and/or other applications that are stored on a computing device 1 10 and/or server 120.
- the neural network architecture 140 can be integrated with a facial recognition application and generates pseudo- images to aid in identification of faces or facial objects. For example, upon receiving a given image 130 that includes a facial object, the neural network architecture 140 robustly can generate a consistent pseudo-image of unknown or altered form (e.g., which may include an altered facial object) and the pseudo-image may be used for facial recognition purposes. Storage of the actual facial objects is not required, which can be beneficial both from a technical standpoint (e.g., by decreasing usage of storage space) and a privacy standpoint.
- the neural network architecture 140 can be deployed with a pre-learned weight matrix so that it is immediately available for its assigned application. In addition, the neural network architecture 140 can also perform additional learning, if preferred, even if it was deployed with a pre-learned weight matrix. In certain embodiments, where no or few new objects are expected, the neural network architecture 140 with a learned set of weights can be stored and used directly without any learning (or adaption) mechanism to accelerate its performance. Alternatively, or in addition, the neural network architecture 140 can be allowed to continuously update its weights to account for novel objects.
- the one or more computing devices 110 can enable individuals to access the computer vision system 150 over the network 190 (e.g., over the Internet via a web browser application). For example, after a camera device (e.g., which may be directly integrated into a computing device 110 or may be a device that is separate from a computing device 110) has captured one or more images 130, an individual can utilize a computing device 1 10 to transmit the one or more images 130 over the network 190 to the computer vision system 150.
- the computer vision system 150 can analyze the one or more images 130 using the techniques described in this disclosure.
- the analysis information 160 generated by the computer vision system 150 can be transmitted over the network 190 to the computing device 1 10 that transmitted the one or more images 130 and/or to other computing devices 110.
- the neural network architecture 140 can include a shallow, bi-layer ANN 200 (e.g., a modified Hopfield network) that comprises a first layer of input nodes 210a-d (which may also be referred to herein as primary layer nodes) and a second layer of representation nodes 220a-e (which may also be referred to herein as discrimination nodes, representation nodes or secondary layer nodes).
- a shallow, bi-layer ANN 200 e.g., a modified Hopfield network
- Each of the input nodes 210a-d can be connected to each of the representation nodes 220a-e in an all-to-all configuration.
- the initial feedforward weights between the input 21 Oa-d and representation nodes 220a- e can be chosen in part on the variance structure of the input dataset to minimize the chances that any two representation nodes 220a-e are active at the same time.
- the representation nodes 220a-e can be connected to each other in an all-to-all configuration using recurrent connections that are inhibitory.
- the biased connectivity and lateral inhibition in the neural network architecture 140 enable the nodes to encode structures that uniquely identify individual objects 135.
- the sparsity of the object representations 165 of the objects 135 embedded in the images 130 is due to the inhibitory recurrent connections between the representation nodes 220a-e. These inhibitory connections are not present in a traditional Hopfield network, which contains excitatory recurrent connections.
- the bi-layer ANN 200 can be configured to be adaptive, such that the connectivity between the input layer nodes 21 Oa-d and the representation layer nodes 220a-e is permitted to change based on a given input image that is being processed.
- This dynamic adaptation of the connections between the input layer nodes 21 Oa-d and the representation layer nodes 220a-e enables the bi-layer ANN 200 to optimize the object representations 165 that are generated.
- the resulting object representations 165 are sparse, and individual representation layer nodes 220a-e of the bi-layer ANN 200 are de-correlated, thereby leading to efficient coding of the input patterns.
- the bi-layer ANN 200 can extract the informative structures from the objects 135 in the images 130, the resulting object representations 165 are robust against various forms of degradation, corruption and occlusion.
- the weights between any two nodes are updated using local learning rules. For example, the connection between an input node and a representation node can be strengthened when both nodes are active. When two of the representation nodes 220a-e are active at the same time, the input connections to these two nodes are weakened and the inhibitory weights can be increased when two of the representation nodes 220a-e have the same level of activity.
- the strengthening of connections between input nodes 210a-d and representation nodes 220a-e is an example of local Hebbian behavior while the weakening of any two of the representation nodes 220a-e that are active at the same time is an example of local anti-Hebbian behavior.
- Hebbian learning rules where to store p patterns in a network with N units, the weights that ensure recollection of the patterns are set using where denotes the state of the i th unit in the r th pattern
- Hebbian learning rules generally specify that when the neurons are activated and connected with other neurons, these connections start off weak, but the connections grow stronger and stronger each time the stimulus is repeated.
- connections between the input nodes 210a-d and representation nodes 220a-e are strengthened when connections are formed, thereby establishing associations between features extracted by the input nodes 210a-d and representation nodes 220a-e that can capture the related feature information.
- the learning rules can reduce the strengths of the connections between the input nodes 210a-d and those tow of the representation nodes 220a-e.
- the connectivity between the input nodes 210a-d and the representation nodes 220a-e takes the variance structure of the input dataset into account and ensures that any two of the representation nodes 220a-e are less likely to fire together for any given input. This approach to the initial bias of the ANN 200 can enhance learning speed.
- the bi-layer ANN 200 is able to quickly represent images 130 after it has been exposed to them.
- the bi-layer ANN 200 can accurately capture the structural features of input including images of symbols from world languages, reaching a plateau of performance, after less than ten exposures to the symbols.
- the bi-layer ANN 200 is capable of continuous learning.
- the bi-layer ANN 200 can learn to represent novel input types (such as faces) after learning to represent a different input type (such as symbols from world languages) without “forgetting” how to represent the earlier input type.
- the number of representation nodes 220a-e included in the neural network architecture 140 may be proportional to the number of images 130 or objects 135 for which recognition is desired.
- the representation layer 220 may contain approximately the same number of nodes as the number of images 130 to be identified.
- more nodes in each layer yield better results.
- the input layer of the bi-layer ANN 200 can have 10,000 nodes and 500 nodes in the representation layer.
- the input layer 210 can include 10,000 nodes and the representation layer 220 can include 1 ,000 nodes.
- slow synaptic weight changes allow continuous learning from individual examples.
- the slowness does not cause disturbances in the overall network connections, but allows specific patterns to be encoded.
- there is no normalization step with each learning iteration which can prevent the production or assignment of negative synaptic weights. Such a result is due to the slow synaptic weight changes and is similar to biological systems (e.g. in animal brains, where synaptic weights never go negative).
- the characteristics of the representation nodes 220a-e in the second layer can be modeled or based upon the characteristics of neurons observed in biological systems. For example, certain concepts such as membrane potential and firing rate, taken from biological neural networks, or neurons therein, can be used to set the attributes of the nodes in the ANN 200.
- the connections between the (primary) input layer nodes 210a-d and the (second) representation layer nodes 220a-e can be represented by a connection matrix, with the shape of the connection matrix depending on the number of input nodes 210a-d and number of representation layer nodes 220a-e (and, as such, need not be symmetric).
- connection strength from node / to node j in the representation layer 220 is the same as the connection strength from node j to node i.
- connection strengths between the nodes can either be static or adapt over time.
- the properties of the nodes can change as the ANN 200 encounters inputs.
- the properties of the representation nodes 220a-e in the second layer arise due to their connections to the input nodes 21 Oa-d. Therefore, the strength of recurrent connections can be the similarity of representation nodes' 220a-e connections to the primary nodes 21 Oa-d.
- any given input would similarly activate them and their recurrent interactions would be similar as well.
- the ANN 200 can be completely dynamic in some embodiments. For example, it can adapt to the inputs not only through the changes in connections between the input nodes 21 Oa-d and the representation nodes 220a-e but also through updating recurrent connections' strengths (between the representation nodes 220a- e).
- the dynamics of the ANN 200 can be modeled as W is the matrix of weights between the input nodes 21 Oa-d in primary layer connected to the representation nodes 220a-e of the second layer, T (tau) is a time constant related to the parameters of the neuron model, y is the activity of the first layer, u is the vector of membrane potentials and V is the firing rate or the representation pattern of the nodes in the second layer.
- the function g can relate the membrane potential to the firing rate of neurons in a biological system.
- the membrane potential can be the same as those found in existing models.
- the nodes in the ANN 200 can exhibit certain non-linear behavior.
- the nodes 220a-e in the representation layer can have a certain threshold, with the node inactive (or not ‘firing’) when its value is below the threshold. This value can be determined by summing the inputs to the node as multiplied by the weights applied to those inputs. After the threshold is reached, the node can respond linearly to its inputs. In certain embodiments, this region of linear response may be limited, for instance, because the node response will saturate at a certain level of activity.
- the behavior of the nodes can be modeled in a number of ways.
- the behavior of the representation nodes 220a-e of the ANN 200 are modeled on biological structures, such as neurons.
- the behavior of these nodes is determined by certain parameters taken from the biological context: membrane potential, firing rate, etc.
- the nodes in the representation layer 220a-e can be modeled using the “Leaky Integrate and Fire” model.
- the fitness or quality of adaptations of the ANN 200 can be measured by the difference between an input and its reconstruction obtained from the representation nodes' 220a-e tuning properties and response values.
- This fitness of adaption can be modeled as: where is the matrix of the tuning properties of the nodes, and where E is reduced with each update.
- This term can be used to measure the discrepancy between the input into the input layer 210 and the representation derived from the representation layer 220. In certain embodiments, this term, when combined with the sparsity and non-negative constraints, can help derive the learning rules for the ANN 200 (as described in more detail below).
- each node behave linearly
- the activity of each node is a function of the weighted sum of its inputs, so that a change in tuning properties directly corresponds to a change in its connectivity i.e. ⁇ W ⁇ ⁇ .
- the connectivity of the ANN 200 can be updated in a number of ways. For example, it can be updated using the following three step procedure. First, for each state of connectivity, the tuning properties are determined. Second, a change in tuning properties that would reduce the error is then calculated from the representations, and lastly, a change proportional to that is made in the connectivity.
- the inability of the ANN 200 to differentiate between different inputs can undercut its effectiveness.
- the ANN 200 can be optimized to represent inputs based on the most informative structures and to adapt to different forms of inputs, the initial weights of the ANN 200 can be set to achieve differentiating between different inputs from the first inputs that it inputted. Otherwise, the ANN 200 may not be able to distinguish between two different inputs, leading to a flawed adaptation process resulting in only selective adaptation.
- the initial weights are set so as to minimize the chances of having any two of the representation nodes 220a-e activated by the same input to ensure that different inputs activate different nodes, avoiding mapping different inputs to the same representation.
- the weight matrix W can be calculated based on the variance-covariance matrix of response profiles of early nodes (denoted by ⁇ yy ) based on the set of inputs as where ⁇ is an N x M generalizing matrix of real numbers with orthogonal columns, is the diagonal matrix of eigenvalues of ⁇ yy , and Q is the matrix of orthogonal eigenvectors of ⁇ yy .
- M is the number of primary nodes and N is the number of representation nodes.
- ⁇ is created by first constructing an N x N symmetric matrix (when N is greater than M) and calculating its eigenvectors. The generalizing matrix can then be created by taking M of the eigenvectors.
- a connectivity matrix W as derived above will make the variance-covariance matrix of representation nodes' response profiles match the identity matrix.
- the ANN 200 can be generalized by ensuring that ⁇ has orthogonal columns (in other words, when the number of representation nodes is larger than the number of primary nodes).
- the updating can be stated as an optimization problem with the goal of minimizing where ⁇ is the input to the ANN 200 and is its corresponding output.
- This optimization problem for updating the connectivity between the primary layer input nodes 210a-d and the representation layer representation nodes 220a-e can be solved by taking a gradient descent approach. In this approach, a function's value is iteratively reduced by updating its variables along its gradient. In other words, for every variable, the value which further reduces the function is found by moving along the functions' negative gradient with respect to the variable. Eventually, a minimum of the function is reached.
- the gradient descent steps can be formulated as where ⁇ is the step size and
- this matrix corresponds to the Hebbian update rule that strengthens the connection when one of the input nodes 21 Oa-d in the primary layer and one of the re presentation nodes 220a-e in the representation layer fire together.
- matrix can be positive only when V i and V j are both positive.
- the negative sign before this update component makes it anti- Hebbian in nature, i.e., the update reduces all the connections between input nodes 21 Oa-d in the primary layer and two similarly active nodes in the representation layer 220. In other words, if two of the representation nodes 220a-e are firing together, their input is reduced so that they can be decoupled. Overall, an update in connectivity strengthens the connections between simultaneously firing nodes in the primary layer 210 and the representation layer 220 but reduces the chances of two of the representation nodes 220a-e firing together. This process allows that the ANN 200 can gradually get tuned to features from the multiple inputs presented to it.
- the ANN 200 can utilize simultaneous re-learning of features from all the previous inputs to minimize the effects of such disruptions.
- the ANN 200 can use a stochastic gradient descent (SGD) to solve the problem of disruption of the ANN’S adaption to previously encountered inputs.
- SGD stochastic gradient descent
- This is a stochastic approximation of gradient descent optimization.
- the ANN 200 instead of optimizing the objective function for all the training data, The ANN 200 optimizes the function for only a randomly selected subset of the data.
- any optimization problem as a finite-sum problem, where the value of the objective function can be expressed as a sum of losses for each data point, i.e.
- f is the objective function
- fj is the loss at the I th data point
- x is the optimization variable.
- the gradient of the objective function is the gradient of this finite-sum, which is calculated with respect to every training data point:
- each step of descent is decided using only a subset of training data points, and hence, the gradient is decided based only on a portion of this finite-sum: Though this strategy does not reach optimum, it can reach very close to the objective function's optimum value.
- the ANN 200 is designed to update its connectivity so that it learns to efficiently represent a finite set of inputs based on their most informative structures.
- the objective function can be used as the measure of adaptiveness
- the optimization variable can be used as the matrix of tuning properties
- the training data points can be used as the pairs of inputs and their corresponding representations.
- the SGD method can train the ANN 200 for all the inputs presented in a sequence although the SGD does not reach the optimum.
- the step size can be any size when using the SGD method. In certain embodiments, the step size for a given implementation of the ANN
- the 200 can be determined through an iterative process. The process begins by selecting a very small step size and running simulations of the ANN 200 against certain test input data. As the weights of the ANN 200 adjust, the output of the ANN 200 can be compared to an optimum output for the inputted test data. The value of the step size can be adjusted upwards until the output of the ANN 200 is mismatched with the input. However, since only a subset of data points is considered while estimating the gradient, taking larger gradient steps in SGD may throw the updated point very far from the optimum. In certain embodiments, only small step sizes are used.
- the adaptation process can also require that the connectivities be updated to a particular strength to make the adaptation effective (a smaller update in connectivity may not be differentiated from unadapted connectivity), so that a minimum step size or a minimal update is necessary.
- updates to the connectivity are performed with smaller step sizes and utilize multiple presentations of the same input to reach the desired adaptation level.
- the ANN 200 can perform both of these tasks (that is solving sparse recovery problems and updating the connectivity between primary layer input nodes 210a-d and representation nodes 220a-e using SGD).
- the ANN 200 can function in two modes. In Mode 0 the ANN 200 can only perform a sparse recovery, because the connectivity between the primary
- the ANN 200 When functioning in Mode 0, no update in connectivity is performed.
- the ANN 200 performs both sparse recovery and basis adaptation with initial connectivity and input given as arguments to the ANN 200.
- the ANN 200 can also produce a sparse representation of the input and the connections between various nodes are updated using the obtained representation and the corresponding input to ensure learning.
- the ANN 200 operating in Mode 1 can learn to represent the initial sets of data (such as training set data) really well, but the ANN 200 can also perform well for images 130 similar to those included in an initial (or training) data set but not identical.
- the ANN 200 can adapt to the new images 130 and represent them more sparsely and more robustly because it can employ continuous learning.
- the ANN 200 described herein differs from traditional hierarchical assembly models, which attempt to explain the increasing complexity of receptive field properties along the visual pathway and later formed the foundation of convolutional neural networks. These traditional models assume that neurons in the cognitive centers recapitulate precise object details. However, accurate object image reconstruction is not always necessary for robust representation, and this deeply rooted assumption creates unwanted complexity in modeling object recognition.
- the ANN 200 described herein does not have to calculate reconstruction errors to assess its learning performance. By capturing dependencies that define objects 135 and their classes, it can produce remarkably consistent representations of the same object 135 across different conditions.
- the size, translation, and rotation invariance show that the ANN 200 can naturally link features that define an object 135 or its class together without ostensibly being designed to do so. It permits the non- linear transformation of the input signals into a representation geometry suitable for identification and discrimination.
- One aspect of the ANN 200 is that it can generate invariant responses to corrupted inputs in part because its design takes inspiration biological systems.
- Sensory stimuli evoke high-dimensional neuronal activities that reflect not only the identities of different objects but also context, the brain’s internal state, and other sensorimotor activities.
- the high-dimensional responses can be mapped to object-specific low-dimensional manifolds that remain unperturbed by neuronal and environmental variability.
- One distinguishing feature of the ANN 200 in comparison to traditional frameworks is that the initial connectivity between the input nodes 210a-d and the representation nodes 220a-e in the discrimination (or representation) layers takes the variance structure of the input dataset into account and ensures that any two of the representation nodes 220a-e are less likely to fire together for any input. Moreover, the learning process does not utilize any label, nor require any pre-determined outcomes. It is entirely unsupervised, as the representations evolve with exposures to individual images. Thus, the recurrent weights do not reflect the correlation structure between pre-determined representation patterns.
- the learning rules are all local and modeled as the following: where ⁇ is an input vector, x is its representation in the discrimination (or representation) layer, is the connectivity between the input nodes 210a-d and the representation nodes 220a-e in the discrimination (or representation) layer, ⁇ is the learning rate, and w is the recurrent inhibition weight matrix.
- the updates enable the ANN 200 to learn comprehensive input structures without resorting to using reconstruction error or credit assignment.
- the learning rules are implemented through a combination of matrix operations and differential equations to compute and adjust the weights of the ANN 200.
- the ANN 200 adjusts connection strengths in an activity-dependent manner.
- the first term of the learning rule is a small increment of the connection strengthens when both one or more input nodes 210a-d and one of the representation layer representation nodes 220a-e are active. This update allows the association between a feature (in the input) with the representation unit to capture the information.
- the second term indicates that when two of the representation nodes 220a-e in the recurrent layer are co-active (and mutually inhibited), the strengths of all connections from the nodes in the input layer 210a-d to these nodes are reduced.
- the inhibitory weights in the recurrent (second or representation) layer 220 are such that any two of the representation nodes 220a-e responding to similar inputs have strong mutual inhibition.
- These updates are essentially local Hebbian or anti-Hebbian rules, where connection updates are solely determined by the activity of the nodes.
- This configuration i. e. , the initial biased connectivity and local learning rules, distinguish the ANN 200 from existing neural networks, which incorporate random initial connections from the input layer that do not update (e.g., the convolutional input strengths in other models).
- all activities in the nodes and the connections are non-negative, reflecting constraints from biological neural networks.
- the ANN 200 can denoise inputs and extract cleaner structures from them.
- the receptive fields of the representation nodes 220a-e of the ANN 200 can produce structures that resembled faces (along with random noise) inputted into the ANN 200 but were not specific to any input face.
- the receptive fields can be much less noisy than the inputted faces at all levels of training, as measured by average power in the highest spatial frequencies. (A higher mean power indicated higher noise content.)
- the ANN 200 can have the ability to learn from pure experience and generate consistent representations. It can achieve prospective robustness, defined as consistently representing input patterns it has never experienced.
- the ANN 200 has the ability to represent facial images not in the training set, including unseen pictures corrupted by Gaussian noises or with occlusions.
- the ANN 200 can generate sparse and consistent representations of the new faces. Representation of corrupted inputs can be nearly identical to that of the clean images with even images with large occlusion represented consistently.
- the specificity of the ANN 200 can be high for corruptions with all noise levels and occlusions.
- the ANN 200 trained on a specific set of images rapidly learns the receptive fields (in the representation, or second layer 220) that conform to the images. For example, in an ANN 200 trained using symbols from world languages, similarity between the receptive fields and the symbols increases rapidly as the ANN 200 repeatedly encounters the same characters. The specificity of symbols’ representations increases even faster, reaching a plateau with less than 10 exposures. Thus, the ANN 200 effectively captures structural features that are maximally informative about the input.
- the ANN 200 can learn to represent novel input types without compromising its previous discrimination abilities.
- the ANN 200 can be trained to represent a fixed set of symbols, followed by learning faces. Although learning faces after the characters can change the receptive field properties of a subset of nodes; however, for the ANN 200, the specificity of symbol representations before and after learning a different input, such as faces, remained comparably high.
- the ANN 200 can also maintain high specificity of face representations (orvice versa). In other words, the ANN 200 avoids the catastrophic forgetting problem encountered by the many other neural network models.
- the ANN 200 can learn from images 130 of symbols that were corrupted, such as with different fractions of pixels flipped.
- the ANN 200 can have any number of nodes in its primary layer 210 and in its representation layer 220.
- the ANN 200 can have 256 primary nodes and 500 representation nodes.
- the ANN 200 is constructed so that it can successfully differentiate inputs before adaptation.
- the ANN 200 can be constructed in a number of ways to differentiate inputs before adaption.
- the ANN 200 can use non-negative uniform connectivity where the connection strengths between the primary layer input nodes 21 Oa-d and representation nodes 220a-e of the secondary layer were chosen to be values between 0 and 1.
- the connection weights are derived from a uniform distribution over (0, 1 ).
- the weights can be normalized such that the length of the weight vector corresponding to any representation node is 1 .
- the ANN 200 can also be constructed using normally distributed connectivity where the weights are derived from a normal distribution with mean 0 and standard deviation 1 .
- the weights can also be normalized to have length 1 .
- the ANN 20 can also be constructed with decorrelating connectivity where the weights are normalized in this case too to have length 1.
- the decorrelation can be based on the eigenvectors of the variance-covariance matrix of the inputs. In certain embodiments only 150 eigenvectors were utilized as affective dimensions of the input space since the variance of the input space along these vectors becomes saturated after 150 dimensions; however other numbers of evectors can be used to create the variance-covariance matrix of the inputs.
- the Frobenius norm of the correlation and identity matrices' difference can be calculated and used to measure the difference between the two matrices. Lower Frobenius norms indicate better decorrelation. In certain embodiments, the Frobenius norm of the difference between the correlation matrix and the identity matrix was lowest for the decorrelating model of connectivity, indicating that it could decorrelate the nodes most.
- each image 130 can correspond to each of the 500 representation nodes, and each of the pixels in each image correspond to each of the primary nodes.
- the ANN 200 can adapt to any number of input sets of images.
- the ANN 200 can adapt to input sets containing 500, 800, or 1000 inputs.
- Each input can be presented repeatedly (for example, up to 100 times) to allow for adaptation (for instance using SGD) with the inputs presented one at a time in a sequence (with the order of their presentation randomly chosen).
- Changes can be calculated with respect to the initial decorrelating connectivity and represent how strongly a particular node of the representation nodes 220a-e is connected to primary layer nodes 21 Oa-d.
- an input node that is one of the input nodes 21 Oa-d
- a representational node of the representation nodes 220a-e
- these connections can reflect the representation nodes’ 220a-e tuning properties.
- different representation nodes 220a-e get tuned to different structures from the inputs.
- a distribution of cosine similarity of the connectivity changes for different nodes across different states can be used to determine if connectivity similarity was maintained while repeatedly encountering symbols.
- a sustained similarity level indicates that the distinctiveness of node tunings remained unaltered.
- the connectivity structure of the ANN 200 does not change for individual nodes, the similarity of connectivity to nodes increases slightly over states and then saturates, which illustrates that the connections to individual representation nodes 220a-e were slightly changing as inputs were encountered repeatedly and then reached a stable state after a certain number of encounters. This can demonstrate how attainment of such a stable state in nodes' connectivity eventually reaches saturation. This suggests that in certain embodiments of the ANN 200, only the first few encounters of any input change the structure of connectivity and the representations of the inputs change based on the immediate experience of the ANN 200 and saturate afterward. This saturation highlights the critical difference between the framework of the ANN 200 and the classical efficient coding paradigm, where the representations of inputs depend upon their overall statistical and not just immediate encounters.
- ⁇ 0.5 a low average similarity ( ⁇ 0.5) is observed, indicating that the connections of different nodes changed differently.
- the average similarity remains consistently small and slightly decreased with the state.
- the ANN 200 As the ANN 200 encounters an input an increasing number of times, the structures outputted by the ANN 200 become more input-like. In certain embodiments, the ANN 200 successfully identifies comprehensive, unique structures from the inputs by encountering the same inputs repeatedly; however, with increasing the number of distinct inputs, the representation nodes 220a-e tune to more localized structures.
- Cosine similarity between changes in connectivity and input to the ANN 200 can be measured at different stages. In certain embodiments, the similarity increased with the network state but decreased with the increasing number of inputs. [0114] In certain embodiments, the representations of the ANN 200 become sparser with more encounters of the inputs. Moreover, with an increasing number of inputs, the responses of the ANN 200 are confined to a smaller number of nodes. Representation efficiency can be quantified in three ways to highlight the changes that occur while adapting to a varying number of inputs (response profiles' correlation, kurtosis, and sparsity). These measures can be measured across different states of the ANN 200, as well as across the different numbers of inputs.
- the representation nodes' 220a-e response becomes increasingly non-Gaussian.
- Increasing the number of input presentations can also increase the kurtosis of node response profiles. Both experience and sampling of inputs can increase the representation efficiency of the ANN 200.
- the correlation among the representation nodes 220a-e can also decrease (as indicated by the smaller Frobenius norm of the difference of correlation and identity matrices and by the L0 and L1 sparsity measures) with more encounters of the same set of inputs, as well as encounters of new inputs.
- the responses of the ANN 200 can become sparser with the adaptation states as well as with the number of inputs.
- Nodal response profiles’ kurtosis calculations can assess the efficiency in terms of representation sparseness.
- the correlation among nodes can be measured, and the Frobenius norm of the difference between correlation and identity matrices can be calculated.
- the norm too can decrease with the states and the number of inputs, indicating a decorrelation trend.
- the sparsity of representations can also show similar trends for ANNs 200 in accordance with certain embodiments.
- Both the L0 and L1 sparsity measures can decrease with the ANN 200 network state while maintaining the levels across the number of inputs.
- the performance of the ANN 200 in accordance with certain embodiments outperform those obtained through known approaches such as the matrix factorization, where the efficiency in representation drops with increasing inputs.
- the ANN 200 can produce consistent representations at different network states across all types of corruption. For example, when experiencing five different inputs in their corrupted forms, the representations are consistent across different forms of corruption and across different states of the ANN 200.
- the specificity of representations for different forms of corruption can be calculated using the z- scored cosine similarity between the representations of uncorrupted and corrupted inputs. Specificity can increase slightly with practice, i.e., after encountering the inputs a greater number of times for all forms of corruption (with high specificity of representations being observed with a slight increase in the network's 100 th state).
- the representations of the ANN 200 in the 100 th state can be sparser than the representations in the 50 th state.
- the specificity can decrease with increasing levels of corruption, occlusion, or addition of noise.
- the representations' consistency increased with the representation nodes 220a-e of the ANN 200 becoming more specific by getting tuned to unique features from the inputs.
- the ANN 200 does not need to know the entire input space's statistics to be efficient and can produce consistent representations of inputs under varying circumstances.
- the ANN 200 can similarly generalize an input when seeing various variations of it. When experiencing corrupted inputs (such as inputs with 10%-20% of their pixels altered), the change in connectivity in the ANN 200 can resemble uncorrupted inputs much as in the case of adaptation to non-corrupted symbols. Although similarities can vary from input to input, the maximum similarity observed with any input to the ANN 200 is high. The ANN 200 is able to find the consistency that existed across the input variants and adapt to it, similar to complex deep or convoluted neural networks that have been shown to perform in this manner. However unlike embodiments of the ANN 200 (including those of only two layers and learning from 800 examples), these other networks are very complex, contain multiple layers, and require numerous examples.
- FIG. 3 is a diagram illustrating how inputs in an input sequence are tuned in the representation layer for an ANN 200 in accordance with certain embodiments.
- a series of symbol images 310a-c can be input sequentially in time into the input layer input nodes 210a-d of the ANN 200.
- the ANN 200 learns each symbol in the series of symbol images 310a-c and can reconstruct the symbol from the output of the representation nodes 220 a-e.
- the weights between the input nodes 210a-d and the representation nodes 220a-e or the weights between representation nodes 220a-e or both can be updated.
- the ANN 200 does not experience catastrophic forgetting.
- the ANN 200 captures its characteristics and remembers them, as represented on the sequence of grids 320a-c.
- the fact that each symbol takes up its own square of the grids 320a-c illustrates that the ANN 200 does not forget and is able to learn sequentially.
- Symbol grid 330 represents a subset of learned tuning properties of the representations. The symbol grid 330 demonstrates that the most informative components of the inputted symbols 310 are captured by the ANN 200.
- FIG. 4 is a diagram illustrating how corrupted inputs included an input sequence can be learned by the representation layer 220 for an ANN 200 in accordance with certain embodiments.
- the series of corrupted symbol forms 410 which, for instance, may be generated by randomly flipping a certain percentage of pixels (such as 10% or 20% of the pixels) is inputted into the input nodes 210a-d of the ANN 200.
- the series of corrupted symbol forms 410 can include around 100 different corruptions of each symbol.
- the tuning properties 420 learned by the ANN 200 are clean versions of the inputted symbol forms 410.
- FIG. 5 is a diagram illustrating how characteristics of an object, varying views of which are inputted, are captured in the output of an ANN 200 in accordance with certain embodiments.
- 3D models of different objects were rotated in x and y directions to generate different object views (depicted here with an example of human face object 510).
- a subset of views 520 from all objects can be selected and presented to the ANN 200.
- Sample tuning properties 530 can be learned by the ANN 200 include single views and superpositions of multiple views.
- two groups of cells 540 emerge from the response of the ANN 200 to the inputted views 520.
- One group of cells 540a is specific to the object identity while the other group of cells 540b is specific to the direction and angle of rotation.
- the output of cells 540a and 540b can be used to identify the object and its rotation, as shown in the columns of the output grid in Fig. 5C.
- FIG. 6 is a diagram of classification network 600 comprising a bi-layer ANN connected to a classification layer in accordance with certain embodiments.
- the first two layers of classification network 600 function in the same manner as the two layers of the bi-layer ANN 200 above.
- the classification network 600 comprises a first layer of input nodes 610a-d (or first layer nodes), a second layer of discrimination nodes 620a-e (or representation or second layer nodes), and a third layer of classification nodes 630a-e (or third layer nodes).
- Nodes 630a-e in the classification layer can receive direct excitatory input from a single node in the discrimination layer (nodes 620a-e) while also receiving in parallel feedforward inhibitions that mirror the excitatory input from nodes in the input layer (input nodes 610a-d).
- the nodes in the classification layer 630a-e can also have recurrent excitatory connections and receive a global inhibitory signal 640 imposed on all nodes in the classification layer 630a-e (which helps limit spurious and/or runaway activities in this layer).
- the global inhibition 640 is a constant.
- the value for global inhibition 640 can be any value capable of preventing runaway behavior in the nodes 630a-e of the classification layer.
- the global inhibition 640 can be a constant, such as 10. This value can be set based on the expected inputs to the classification nodes 630a-e.
- the excitatory connections between each of the nodes in the discrimination layer 620 and its corresponding node in the classification layer 630 can be a constant, such as 1 .
- the inhibitory weights for the connections between the nodes in the input layer 610a-d and the nodes in the classification layer 630a-e can also be a constant.
- the number of nodes in the discrimination layer 620a-e can equal the number of nodes in the classification layer 630a-e.
- nodes in each layer can be associated with each other by grouping nodes in each layer and relating those nodes to a group of nodes in the other layer. For instance, in a classification network 600 where there are twice as many nodes in the discrimination layer 620 than there are in the classification layer 630, each node in the classification layer 630 can be connected to two nodes in the discrimination layer 620.
- Learning in the classification network 600 can also be based on local learning rules. Learning for the first two layers (the input layer 610a-e and the discrimination layer 620a-e) can be accomplished using the same technique described above with respect to the bi-layer ANN 200.
- the node(s) in the third layer (the classification layer 630a-e) are augmented when a node in the discrimination layer 620a-e and the classification layer 630a-e are active at the same time or when two nodes in the classification layer 630a-e are active at the same time.
- the weights between to the nodes in the classification layer 630a-e and the input nodes 610a-d and the weights from the global inhibition do not change.
- the classification network 600 is designed using principles of Maximal Dependence Capturing (MDC), which prescribes that individual nodes (neurons) should capture maximum information about distinct objects.
- MDC Maximal Dependence Capturing
- the classification network 600 is designed to be able to differentiate objects in its initial response.
- the weights between the input layer input nodes 610a-d and the discrimination layer nodes 620a-e are calibrated to allow distinct inputs to elicit disparate responses without specific learning.
- the initial bias in the connectivity is set to minimize the chances of co- activating any two of the discrimination nodes 630a-e at the same time, which maximizes distinctions in the classification network’s 600 initial response to various inputs.
- the connectivity matrix ⁇ which is the matrix of weights between each node of the input layer 610a-e and each node of the discrimination layer 620a- e, can be set so that the variance-covariance matrix of the response profiles of nodes in the representation layer match the identity matrix.
- the nodes in the discrimination layer 620a-e are modeled as leaky integrate and fire neurons with thresholding.
- the nodes in the discrimination layers 620a-e can have a dynamic response based on the following equation: where x is the response vector for the nodes in the discrimination layer, y is the input vector to the layer, and the operator T(.) is the thresholding function (ReLU) that gives rise to the thresholding activity.
- ReLU thresholding function
- the dynamic response of the nodes in the classification layer 630a-e can function in the same way as the nodes in the discrimination layer 620a-e with two primary differences.
- the input to each node in the classification layer (to each of classification nodes 630a-e) has two components, the excitatory input from the node in the discrimination layer 620a-e and the inhibitory input from the input layer input nodes 610a-d (which can be weighted inhibitory input from a single node of the input nodes 610a-d or from some combination of the input nodes 610a-d).
- the inhibitory recurrent connection matrix w is changed to recurrent connection matrix in the classification layer w class , which is equal to w class inhib - minus w class excit .
- the effective layer dynamics for the classification layer 630a-e can be modelled by the following equation: Here is the signal from the nodes in the discrimination layer, and y is the signal from the nodes in the input layer 61 Oa-d.
- the classification network 600 can update the connections from the nodes in the input layer 610a-d to optimize the following equation where is an input vector, is the representational vector in the discrimination layer 620a-e, and is the matrix of the weights between the nodes in the input layer 610a- d and the nodes in the discrimination layer 620a-e.
- the updates in the connectivity for this function can be stated as where ⁇ is the learning rate.
- the recurrent inhibiting weights w in the discrimination layer 620a-e can be set using the following equation: In certain embodiments, there is no normalization of before calculating the recurrent weights.
- the weights between nodes in the discrimination layer 620a-e and the nodes in the classification layer 630a-e can be updated based on the activities of the relevant two nodes.
- the recurrent excitatory connections between the nodes within the classification layer 630a-e can initially be set at 0, while all of the nodes in this layer receive global inhibition.
- the weights can then be updated based on the sum of potentiation between any pair of classification nodes 630a-e. For instance, when two nodes are co-active together, the potentiation for their connection increases. Alternatively, if only one of the two nodes is active at a set time, then the potentiation of their connection decreases.
- connection weight between any two nodes in the classification layer is set to 1 if the sum of all potentials after encountering an arbitrary number of inputs reaches a preset threshold. All other weights remain 0.
- the potentiation values of all possible connections are reset to zero and the process of updating them restarts. Another way of expressing this updating of weights is with the following equation:
- the representation function of the classification network 600 maximizes differences between objects 135 and represents them distinctively.
- the classification network 600 can capture shared features that identify an object 135 in different perspectives, or a class.
- the distinguishing features of the same type of objects 135 can be linked together using mutual excitation and discerned from similar features of other categories using inhibition.
- recurrent excitation and broad inhibition are prevalent in the upper layers of sensory cortices.
- the design of the classification network 600 draws inspiration from these biological systems by adding a recurrent layer, the classification layer 630 (a third layer), to simulate these circuit motifs and perform computations for classification.
- Nodes in this layer receive direct excitatory input from the discrimination layer 620 (the second layer) in a column-like, one-to-one manner. In parallel, they receive feedforward inhibitions that mirror the excitatory input from the input layer 610.
- the nodes in the classification layer 630 can also have recurrent excitatory connections between each other and receive global inhibition imposed on all nodes of this layer.
- the connections between classification nodes 630a-e and classification nodes 630a- e discrimination nodes 620a-e can also be adaptive. For example, the learning rule is that the connections strengthen between two excitatory nodes (discrimination to classification and between classification neurons, or nodes) when both are active. There is no weight change to connections to and from inhibitory neurons (or nodes).
- This architectural configuration of the classification network 600 permits capturing class-specific features from objects 135.
- nodes in the classification layer 630 receive excitatory input from the discrimination layer 620 and feedforward inhibition relayed from the input layer 610. This combination passes the difference between the updated excitatory output and non-updated inhibitory output to inform the classification layer630 about the features learned in the discrimination layer620.
- the lateral excitatory connection between the classification nodes 630a-e links the correlated features that provide the class information.
- global inhibition 640 ensures that only nodes receiving sufficient excitatory input can be active to reduce spurious and runaway activities. The result is that any of the classification nodes 630a- e with reciprocal excitation display attractor-like activities for class-specific features.
- the classification abilities of the classification network 600 are superior to traditional approaches. For instance, when classifying objects in the MNIST handwritten digit dataset, training with only 25% of unlabeled samples results in the receptive fields of the classification network 600 resembling the digits in the discrimination layer 620. Further, population activities in the classification layer 630 of the classification network 600 exhibit high concordance for the same digit type but maintain distinction among different classes. The classification network 600 can correctly identify 94% of the digit types when using pooled nodes from the most consistently active nodes of each group. On the other hand, the most sophisticated existing network models currently achieve 85-99% accuracy, but they all need supervision in some form. For example, the self-supervised networks require digit labels in the initial training.
- the classification network 600 is robust in recognizing and categorizing individual symbols, faces, and handwritten digits without explicitly being designed for these tasks. Specifically, in its discrimination layer 620, the classification network 600 can identify features that uniquely identify an object 135 and, in the classification layer 630, link those features to form class-specific node ensembles. This last feature allows the classification network 600 to identify 3- dimensional objects 135, from views varying in size, position, and perspective. The problem of relating various views to extract the object’s identity is particularly challenging. Various other neural network models require highly sophisticated algorithms with deep convolution layers and considerable supervision to achieve good performance.
- classification network 600 different views of the same object form an image class that has shared features, which allows the classification network 600 to capture shared features of an image class without ostensibly being designed to do so.
- the classification network 600 can learn to consistently represent 3D objects 135 varying in size, position, and perspective.
- the classification network 600 can identify objects 135 from various sizes and positions. For example, after experiencing several short clips of contiguous movie frames of objects 135 from various positions and sizes where random clips could be partially overlapped but covered less than 33% of the entire animation sequence in total, the classification network 600 can learn specific views and superpositions of different objects 135 in the input. When analyzing the entire animation sequence (much of which the classification network 600 had not experienced, > 67% of all views), representations of different frames are distinct in the discrimination layer 620 and nodes are persistently active over large animation portions in the classification layer 630 (for all objects 135). Active node ensembles are specific for individual objects 135 even when there were high similarities between some of them. For the classification network 600, in the representation domain, the overall similarity between the same object’s views are significantly higher than the similarity between images of distinct objects.
- classification nodes 630a-e can show consistent responses to the same object 135 regardless of the presentation angle, when presented with an animation of 3D rotation sequences with training of the classification network 600 on short clips of rotation along the vertical axis. This is true even for highly irregular shaped models. For example, with respect to inputs of four 4-legged animals, fluctuations in representations occurred at similar viewpoints, reflecting their common features. Overall, the similarity between the different perspectives of the same object is high but low between different objects for the classification network 600. Therefore, the classification network 600 is able to generate invariant identity representations even when the classification network 600 only experiences less than a third of all possible angles. Moreover, the classification 600 has the capacity for invariant representation and does not need to encounter all possible variations to represent objects 135 consistently.
- the identity of an object 135 is embedded in the structural relationships among its features. These relationships, or dependencies, can be utilized to encode object identity.
- the classification network 600 maximally captures these dependencies to identify the presence of an object 135 without requiring accurate details of the input patterns.
- the specific configurations of classification network 600 allow dependence capturing to permit invariant representations.
- This design is distinct from the hierarchical assembly model, which explains the increasing complexity of receptive field properties along the visual pathway and later formed the foundation of convolutional neural networks. These models assume that neurons in the cognitive centers recapitulate precise object details. However, accurate object image reconstruction is not necessary for robust representation, and this deeply rooted assumption can create unwanted complexity in modeling object recognition.
- the classification network 600 does not calculate reconstruction errors to assess its learning performance.
- the classification network 600 can naturally link features that define an object or its class together without ostensibly being designed to do so. It can permit the non-linear transformation of the input signals into a representation geometry suitable for identification and discrimination.
- the classification network 600 can illustrate how dependence capturing may learn about objects 135 through local and continuous changes at individual synapses and stably represent them (in a similar fashion to biological systems).
- the two circuit architectures are based on known connectivity patterns. Although both designs capture feature dependencies defining objects 135 and classes, their connections differ and serve different functions.
- the discrimination layer 620 makes individual representations as distinctive as possible.
- the classification layer 630 binds class-specific features to highlight and distinguish different object types. This two- prong representation may give rise to perceptual distances that are not linearly related to the distances in input space.
- the representation specificity assesses how specific an input's representation is.
- the pairwise similarity between all representations of all objects is calculated to obtain a similarity matrix S.
- the z-score of the similarity of an input's representation to all other representations is then calculated.
- dot operation (.) denotes elementwise calculations.
- a power spectrum analysis can be performed. Both the images 130 and learned images can be Fourier-transformed, and their log-power calculated. The 2D log-power of the images 130 and the learned structures can be radially averaged to obtain the 1 D power spectrum. The presence of noise is indicated by a higher power in higher frequencies of the spectrum. The comparisons can be made using the highest 20% of the frequencies.
- the representation of different views of 3D objects in the classification layer 630a-e consisted of nodes that are consistently active for all views of the object.
- the overall consistency of object representation in the classification layer 630a-e of the classification network 600 can be calculated.
- the cosine similarity between the representations of consecutive views of the object 135 can be measured.
- the variation in the similarity indicates the consistency in representations.
- a lower variation in the similarity measures implies higher consistency and vice versa.
- FIG. 7 is an illustrating demonstrating how characteristics of an object 135, varying views of which are inputted, are captured in the output for a classification network 600 in accordance with certain embodiments.
- Animations were rendered as movie frames depicting the size variations (SF) 730 and position variations (PF) 740. Examples of different position variations 721 a and 721 b are shown for a car on a road in box 720. Examples of size variation for minivan (711 a and 711 b) are shown in box 710. Short sequences of these frames 730 and 740 generally not covering more than 33% of the entire sequences of size variation frames 730 and position variation frames 740 in total can be randomly selected and fed into the classification network 600.
- the classification network 600 can capture complete object shapes varying in sizes and positions.
- Chart 750 comparing similarity scores between the same objects and between different objects shows that the average similarities between representations of frames belonging to the same object (self) are considerably higher than the representation similarities between frames of distinct objects (other).
- Inputted images 130 to the neural network architecture 140 can include any number of pixels, such as 100 x 100 pixels.
- the number of discrimination layer 620 nodes and classification nodes 630 (when used) can vary.
- the number of discrimination layer 620 nodes and classification nodes 630 can vary depending on the pixel number of the inputs to the neural network architecture 140.
- the number of nodes in the discrimination layer620 can be 500 or 1000.
- discrimination layer 620 size can be 500 nodes.
- the discrimination layer 620 and classification layers 630 both include 10,000 nodes.
- FIG. 8 illustrates a flow chart for an exemplary method 800, according to certain embodiments.
- Method 800 is merely exemplary and is not limited to the embodiments presented herein. Method 800 can be employed in many different embodiments or examples not specifically depicted or described herein. In some embodiments, the steps of method 800 can be performed in the order presented.
- the activities of method 800 can be performed in any suitable order.
- one or more of the steps of method 800 can be combined or skipped.
- system 100 and/or computer vision system 150 can be configured to perform method 800 and/or one or more of the steps of method 800.
- one or more of the steps of method 800 can be implemented as one or more computer instructions configured to run at one or more processing devices 201 and configured to be stored at one or more non- transitory computer storage devices 202.
- non-transitory memory storage devices 202 can be part of a computer system such as system 100 and/or computer vision system 150.
- the processing device(s) 201 can be similar or identical to the processing device(s) 201 described above with respect to computer system 100 and/or computer vision system 150.
- step 810 the weights between the input layer of the neural network architecture and the recurrent weights between the nodes in the representation layer are initialized.
- the manner in which the weights are initialized can vary.
- the initial weights between the nodes in the input layer and the nodes in the representation layer can be calculated based on the eigenvectors of the variance-covariance matrix of the inputs.
- an image included in an input sequence is input into the nodes of the input layer.
- each pixel can be input into a separate node.
- the number of input nodes is equal to the number of pixels in the images of the data set to be analyzed.
- the pixels are input into the input layer without being preprocessed, thereby giving that input node the value of that pixel.
- the images in the data set may be preprocessed.
- the values of each image may be scaled in a certain, such as by scaling all image values to be within a certain range (such as from 0 to 1).
- Certain transforms such as the Fourier transform or a wavelet transform, can be performed on the image before inputting the image data into the nodes of the input layer.
- step 830 initial values of the nodes included in the representation layer are calculated by multiplying the vector of values of the nodes of the input layer for in step 820 by the matrix of weights for the connections in the neural network architecture between the nodes in the input layer and the nodes in the representation layer.
- these weights are the initial weights of the ANN, which were calculated in step 810.
- these weights are updated in accordance with step 850 below.
- a behavior model for the nodes in the representation layer is applied to calculate the values for the nodes in the representation layer.
- Various types of behavior models can be used, including those models drawn from biological neural networks.
- the behavior of the nodes in representation layer of the ANN can be modeled as “Leaky Integrate-and-Fire” neurons.
- the values from the recurrent connections between the nodes in the representation layer can be used to calculate the values of the nodes in the representation layer. The calculation of the values of the nodes can be performed iteratively, until the values for each nodes reaches a steady state.
- the values of the nodes in the classification layer can be updated by applying the process for the behavioral model as discussed in the paragraph above.
- the initial values of the nodes in the classification layer can be calculated, for each node by summing: a) the value of the input (multiplied by an excitatory connection weight) from the node in the discrimination (or representation layer), b) the value of the input (multiplied by inhibitory connection weights) from the node(s) in the input layer, and c) the value of a global inhibition applied to all nodes in the classification layer.
- the number of times that any two nodes in the classification layer are active together can be tracked over a given number of inputs. If the number of times any two nodes are active together is above a certain threshold, the weight between those nodes can be set to an excitatory value (such as 1 ). The weights of connections between nodes in the classification layer that are not typically active together (as determined by being below the threshold), can be set to 0.
- step 850 the weights between the nodes in the neural network architecture are updated.
- the updating of the weight matrix for the connections between the nodes in the input layer and the nodes in the representation layer is performed using a gradient descent approach.
- step 860 it is determined whether there is another image in the data set. If not, the method 800 terminates. If so, the method 800 returns to step 820.
- step 870 the method 800 terminates with the neural network architecture tuned to inputted the images.
- the data to be inputted into the neural network architecture 140 is not picture or visual data.
- the data to be analyzed can be DNA or RNA sequences, audio data, or other sensory data. This data can be ‘pixelated’ or transformed in another manner so that it can be inputted into the input layer of the neural network architecture 140.
- the neural network architecture 140 has advantages over other known neural networks.
- the neural network architecture 140 utilizes fundamentally different learning algorithms from existing models and do not rely on error propagation. It can also avoid the problem of credit assignments in deep learning. It can produce remarkable results that rival much more complicated networks with fewer nodes, fewer parameters, and no requirement for deep layers. Although this performance may be trumped by the highly sophisticated deep learning models that rely on superior computing power, the neural network architecture 140 can also be developed into complex structures to perform additional tasks with improved performance. Given that it requires far fewer examples to learn and is much more energy efficient, the neural network architecture 140 can rival or outperform current alternatives.
- the inventive techniques set forth in this disclosure are rooted in computertechnologies that overcome existing problems in known computer vision systems, including problems dealing with extracting robust object representations from images and/or performing computing vision functions.
- the techniques described in this disclosure provide a technical solution (e.g., one that utilizes various Al-based neural networking and machine learning techniques) for overcoming the limitations associated with known techniques.
- This technology-based solution marks an improvement over existing capabilities and functionalities related to computer vision and machine learning systems by improving the accuracy of the computer vision (or machine learning) functions and reducing the information that is required to perform such functions.
- no storage of reference objects such as faces or facial objects
- this can serve to minimize storage requirements and avoid privacy issues.
- the neural network architectures disclosed herein are less complex, and therefore less computationally intensive, than other neural networks. They further do not require time- and resource-intensive creation and labeling of training set data.
- the neural network architectures described herein can additionally provide advantages of being fully interpretable (so-called white box) and of not being subject to neural network’s commonly observed “catastrophic forgetting”. These findings have substantial implications for understanding how biological brains achieve invariant object representation and for developing biologically realistic intelligent networks that are efficient and robust.
- a system for extracting object representations from images comprises one or more processing devices; one or more non-transitory computer-readable storage devices storing computing instructions configured to be executed on the one or more processing devices and cause the one or more processing devices to execute functions comprising: receiving, at a computing device, an image comprising pixels; and generating, at the computing device, an object representation from the image using a bi-layer neural network comprising an input layer of input nodes and a representation layer of representation nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that two representation nodes in the representation layer are active at the same time; a second set of connection weights for the second set of weighted
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- a learning mechanism continuously updates the first set of connection weights as additional images are processed by the bi-layer neural network.
- the learning mechanism includes a stochastic gradient descent method.
- the second set of values for the representation nodes in the representation layer and the first set of values for the input nodes in the input layer are all non-negative values.
- the second set of connection weights for the second set of weighted connections is continuously updated based, at least in part, on changes in the first set of connection weights.
- the object representations include data related to object identification and data related to position information.
- the second set of weighted connections is inhibitory.
- the stochastic gradient descent method uses a step with a step size between 0 and 1 .
- a method for extracting object representations from images implemented via execution of computing instructions configured to run at one or more processing devices and configured to be stored on non-transitory computer-readable media comprises: receiving, at a computing device, an image comprising pixels; and generating, at the computing device, an object representation from the image using a bi-layer neural network comprising an input layer of input nodes and a representation layer of representation nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that two representation nodes in the representation layer are active at the same time; a second set of connection weights for the second set of weighted connections is determined such that weights between any two representation nodes in
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- a learning mechanism continuously updates the first set of connection weights as additional images are processed by the bi-layer neural network.
- the learning mechanism includes a stochastic gradient descent method.
- the second set of values for the representation nodes in the representation layer and the first set of values for the input nodes in the input layer are all non-negative values.
- the bi-layer neural network includes more representation nodes in the representation layer than input nodes in the input layer.
- the second set of connection weights for the second set of weighted connections is continuously updated based, at least in part, on changes in the first set of connection weights.
- the object representations include data related to object identification and data related to position information.
- the second set of weighted connections is inhibitory.
- a computer program product for extracting object representations from images comprising a non- transitory computer-readable medium including instructions for causing a computing device to: receive, at a computing device, an image comprising pixels; and generate, at the computing device, an object representation from the image using a bi-layer neural network comprising an input layer of input nodes and a representation layer of representation nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that two representation nodes in the representation layer are active at the same time; a second set of connection weights for the second set of weighted connections is determined such that weights between any two representation nodes in the representation layer are the
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- a system for classifying object representations from images comprises: one or more processing devices; one or more non-transitory computer readable storage devices storing computing instructions configured to be executed on the one or more processing devices and cause the one or more processing devices to execute functions comprising: receiving, at a computing device, an image comprising pixels; and generating, at the computing device, classification data for one or more objects in the image using a tri-layer neural network comprising: i) an input layer comprising input nodes; ii) a representation layer comprising representation nodes; and iii) a classification layer comprising classification nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- a learning mechanism continuously updates the first set of connection weights as additional images are processed by the tri-layer neural network.
- the learning mechanism includes a stochastic gradient descent method.
- the third set of values for the classification nodes in the classification layer and the second set of values for the representation nodes in the representation layer and the first set of values for the input nodes in the input layer are all non-negative values.
- the second set of connection weights for the second set of weighted connections is continuously updated based, at least in part, on changes in the first set of connection weights.
- the classification data comprises identification data related to at least one object in the images.
- the second set of weighted connections is inhibitory.
- the stochastic gradient descent method uses a step with a step size between 0 and 1 .
- a method for classifying object representations from images implemented via execution of computing instructions configured to run at one or more processing devices and configured to be stored on non-transitory computer-readable media comprising: receiving, at a computing device, an image comprising pixels; and generating, at the computing device, classification data for one or more objects in the image using a tri-layer neural network comprising: i) an input layer comprising input nodes; ii) a representation layer comprising representation nodes; and iii) a classification layer comprising classification nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that two representation nodes in the representation layer are active at the same time
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- a learning mechanism continuously updates the first set of connection weights as additional images are processed by the tri-layer neural network.
- the learning mechanism includes a stochastic gradient descent method.
- the third set of values for the classification nodes in the classification layer, the second set of values for the representation nodes in the representation layer, and the first set of values for the input nodes in the input layer are all non-negative values.
- the second set of connection weights for the second set of weighted connections is continuously updated based, at least in part, on changes in the first set of connection weights.
- the classification data comprises identification data related to at least one object in the images.
- the second set of weighted connections is inhibitory.
- the stochastic gradient descent method uses a step with a step size between 0 and 1 .
- a computer program product for classifying object representations from images comprises a non- transitory computer-readable medium including instructions for causing a computing device to: receive, at a computing device, an image comprising pixels; and generate, at the computing device, classification data for one or more objects in the image using a tri-layer neural network comprising: i) an input layer comprising input nodes; ii) a representation layer comprising representation nodes; and iii) a classification layer comprising classification nodes; wherein: all input nodes are connected to all representation nodes through a first set of weighted connections having differing values and all representation nodes are connected to all other representation nodes through a second set of weighted connections having differing values; a first set of connection weights associated with the first set of weighted connections between the input nodes of the input layer and the representation nodes of the representation layer is selected to minimize the chances that two representation nodes in the representation layer are active at the same time; a second set of connection
- the first set of connection weights associated with the first set of weighted connections is calculated using estimates of the eigenvectors of the variance-covariance matrix based on an input matrix created from vector representations of the images.
- Embodiments may include a computer program product accessible from a computer-usable or computer-readable medium providing program code for use by or in connection with a computer or any instruction execution system.
- a computer- usable or computer-readable medium may include any apparatus that stores, communicates, propagates, or transports the program for use by or in connection with the instruction execution system, apparatus, or device.
- the medium can be a magnetic, optical, electronic, electromagnetic, infrared, or semiconductor system (or apparatus or device) or a propagation medium.
- the medium may include a computer- readable storage medium, such as a semiconductor or solid-state memory, magnetic tape, a removable computer diskette, a random access memory (RAM), a read-only memory (ROM), a rigid magnetic disk and an optical disk, etc.
- a data processing system suitable for storing and/or executing program code may include at least one processor coupled directly or indirectly to memory elements through a system bus.
- the memory elements can include local memory employed during actual execution of the program code, bulk storage, and cache memories that provide temporary storage of at least some program code to reduce the number of times code is retrieved from bulk storage during execution.
- I/O devices including but not limited to keyboards, displays, pointing devices, etc. may be coupled to the system either directly or through intervening I/O controllers.
- Network adapters may also be coupled to the system to enable the data processing system to become coupled to other data processing systems or remote printers or storage devices through intervening private or public networks.
- Modems, cable modems, and Ethernet cards are just a few of the currently available types of network adapters.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Software Systems (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Medical Informatics (AREA)
- Multimedia (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Image Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims
Priority Applications (9)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CA3203238A CA3203238A1 (en) | 2022-04-06 | 2023-04-06 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
| KR1020237035779A KR20240031216A (en) | 2022-04-06 | 2023-04-06 | Neural network architecture for immutable object representation and classification using local Hebbian rule-based updating |
| CN202380009888.2A CN117203679A (en) | 2022-04-06 | 2023-04-06 | Neural network architecture using updated invariant object representation and classification based on local Hebbian rules |
| KR1020247004854A KR20240162024A (en) | 2022-04-06 | 2023-04-06 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
| EP23734889.1A EP4278323A4 (en) | 2022-04-06 | 2023-04-06 | ARCHITECTURES OF NEURAL NETWORKS FOR INVARIANT OBJECT REPRESENTATION AND CLASSIFICATION USING LOCAL RULE-BASED UPDATES BY HEBBIEN |
| JP2023574482A JP7610731B2 (en) | 2022-04-06 | 2023-04-06 | Neural network architecture for invariant object representation and classification using local Hebbian updates. |
| US18/343,577 US20230360367A1 (en) | 2022-04-06 | 2023-06-28 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
| US18/343,557 US20230360370A1 (en) | 2022-04-06 | 2023-06-28 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
| JP2024010122A JP2024106338A (en) | 2022-04-06 | 2024-01-26 | Neural network architecture for invariant object representation and classification using local Hebbian updates. |
Applications Claiming Priority (4)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US202263328063P | 2022-04-06 | 2022-04-06 | |
| US63/328,063 | 2022-04-06 | ||
| US202363480675P | 2023-01-19 | 2023-01-19 | |
| US63/480,675 | 2023-01-19 |
Related Child Applications (2)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US18/343,557 Continuation US20230360370A1 (en) | 2022-04-06 | 2023-06-28 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
| US18/343,577 Continuation US20230360367A1 (en) | 2022-04-06 | 2023-06-28 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2023196917A1 true WO2023196917A1 (en) | 2023-10-12 |
Family
ID=88206808
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/US2023/065456 Ceased WO2023196917A1 (en) | 2022-04-06 | 2023-04-06 | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates |
Country Status (3)
| Country | Link |
|---|---|
| CA (1) | CA3210365A1 (en) |
| TW (1) | TW202347173A (en) |
| WO (1) | WO2023196917A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20200234143A1 (en) * | 2019-01-23 | 2020-07-23 | MakinaRocks Co., Ltd. | Anomaly detection |
| US20210264287A1 (en) * | 2020-02-17 | 2021-08-26 | Sas Institute Inc. | Multi-objective distributed hyperparameter tuning system |
| US20220066456A1 (en) * | 2016-02-29 | 2022-03-03 | AI Incorporated | Obstacle recognition method for autonomous robots |
-
2023
- 2023-04-06 WO PCT/US2023/065456 patent/WO2023196917A1/en not_active Ceased
- 2023-04-06 TW TW112112964A patent/TW202347173A/en unknown
- 2023-04-06 CA CA3210365A patent/CA3210365A1/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20220066456A1 (en) * | 2016-02-29 | 2022-03-03 | AI Incorporated | Obstacle recognition method for autonomous robots |
| US20200234143A1 (en) * | 2019-01-23 | 2020-07-23 | MakinaRocks Co., Ltd. | Anomaly detection |
| US20210264287A1 (en) * | 2020-02-17 | 2021-08-26 | Sas Institute Inc. | Multi-objective distributed hyperparameter tuning system |
Also Published As
| Publication number | Publication date |
|---|---|
| TW202347173A (en) | 2023-12-01 |
| CA3210365A1 (en) | 2023-10-06 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12165742B2 (en) | Splicing site classification using neural networks | |
| US8504493B2 (en) | Self-organizing sequential memory pattern machine and reinforcement learning method | |
| US20210264300A1 (en) | Systems and methods for labeling data | |
| Tripathi et al. | Image classification using small convolutional neural network | |
| Tariyal et al. | Greedy deep dictionary learning | |
| Tareen et al. | Convolutional neural networks for beginners | |
| Sa-Couto et al. | Using brain inspired principles to unsupervisedly learn good representations for visual pattern recognition | |
| US20230360370A1 (en) | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates | |
| WO2023196917A1 (en) | Neural network architectures for invariant object representation and classification using local hebbian rule-based updates | |
| CN117203679A (en) | Neural network architecture using updated invariant object representation and classification based on local Hebbian rules | |
| Zhang et al. | Patch-based face recognition using a hierarchical multi-label matcher | |
| Singh | Facial expression recognition using convolutional neural networks (CNNs) and generative adversarial networks (GANs) for data augmentation and image generation | |
| Bhargavi et al. | A survey on recent deep learning architectures | |
| Roos | Modeling spatiotemporal information with convolutional gated networks | |
| Somani et al. | Neural Networks for Deep Learning | |
| Anh et al. | Convolutional neural networks for image object recognition and classification with large-scale and complex data | |
| Das et al. | The Impact and Evolution of Deep Learning in Contemporary Real-World Predictive Applications: Diving Deep | |
| Mushtaq et al. | Deep Learning Architectures for IoT Data Analytics | |
| Raj et al. | A Local Hebbian Rule Based Neural Network Model of Invariant Object Representation and Classification | |
| Wang | Neural diagrammatic reasoning | |
| Esakkirajan et al. | Image Analysis Using Machine Learning and Deep Learning | |
| TAHA | LOCATION IDENTIFICATION IN MECCA USING DEEP NEURAL NETWORKS | |
| Rahman et al. | A Convolution Neural Network, Particle Swarm Optimization Hybrid Model for Scripting Language Handwritten Character Recognition | |
| Akram et al. | Innovative Techniques for Image Clustering and Classification | |
| Panda | Learning and Design Methodologies for Efficient, Robust Neural Networks |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| ENP | Entry into the national phase |
Ref document number: 2023734889 Country of ref document: EP Effective date: 20230707 |
|
| ENP | Entry into the national phase |
Ref document number: 3203238 Country of ref document: CA Kind code of ref document: A |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 202380009888.2 Country of ref document: CN |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2023574482 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |