[go: up one dir, main page]

WO2024138492A1 - Method and apparatus for incorporating neuro-inspired adaptability for continual learning - Google Patents

Method and apparatus for incorporating neuro-inspired adaptability for continual learning Download PDF

Info

Publication number
WO2024138492A1
WO2024138492A1 PCT/CN2022/143198 CN2022143198W WO2024138492A1 WO 2024138492 A1 WO2024138492 A1 WO 2024138492A1 CN 2022143198 W CN2022143198 W CN 2022143198W WO 2024138492 A1 WO2024138492 A1 WO 2024138492A1
Authority
WO
WIPO (PCT)
Prior art keywords
parameters
sub
data
networks
tasks
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Ceased
Application number
PCT/CN2022/143198
Other languages
French (fr)
Inventor
Jun Zhu
Hang SU
Ze CHENG
Liyuan Wang
Mingtian ZHANG
Qian Li
Yi ZHONG
Xingxing Zhang
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Robert Bosch GmbH
Original Assignee
Tsinghua University
Robert Bosch GmbH
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University, Robert Bosch GmbH filed Critical Tsinghua University
Priority to PCT/CN2022/143198 priority Critical patent/WO2024138492A1/en
Priority to DE112022008133.3T priority patent/DE112022008133T5/en
Priority to CN202280102864.7A priority patent/CN120500723A/en
Publication of WO2024138492A1 publication Critical patent/WO2024138492A1/en
Anticipated expiration legal-status Critical
Ceased legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Definitions

  • aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to method and apparatus provided for incorporating neuro-inspired adaptability for continual learning in artificial intelligence.
  • Continual learning also known as lifelong learning, is one of the cornerstones towards artificial general intelligence (AGI) . Since the real world is dynamic and unpredictable, an intelligent agent needs to learn and remember throughout its lifetime, just like the biological brain, in order to adapt effectively. For this purpose, a desirable solution should properly balance memory stability with learning plasticity, and acquire sufficient compatibility to capture the observed distribution.
  • AGI artificial general intelligence
  • the disclosed method and apparatus draw inspirations from the adaptive mechanisms equipped in a robust biological learning system, and exhibit superior generality across a variety of continual learning scenarios, also achieve state-of-the-art performance in a plug-and-play manner.
  • the disclosed method and apparatus can facilitate realistic applications, such as autonomous driving and robotics, intelligent manufacturing and smartphones, etc., to flexibly accommodate user needs and environmental changes. Meanwhile, the deployment of continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.
  • a computer implemented method by a neural network for continual learning of a series of tasks comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
  • the method further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively; and wherein the loss function further has constraints on differences in the corresponding predictions.
  • Fig. 3 illustrates an exemplary formulation of active forgetting in combination with stability protection, according to aspects of the disclosure.
  • Fig. 4 illustrates two equivalent ways for optimization of active forgetting, according to aspects of the disclosure.
  • Fig. 2 five continual learners L1-L5 corresponding to the five compartments of the ⁇ MB system are shown in Fig. 2, of course other quantity of continual learners is allowed.
  • the white blocks represent learned parameters for task A
  • the gray blocks represent learned parameters for task B
  • the black blocks represent learned parameters for task C.
  • the combinations of blocks named a, b and c show different degrees of merging the parameters for different tasks as a solution, wherein a shows excessive stability as it sticks too much on an old task, and b shows proper balance between old and new tasks, and c shows excessing forgetting since it mainly focuses on the latest task.
  • the dashed areas on the right denote the target distributions of L1-L5, which are coordinated to obtain better solutions for both old and new tasks, and the small hollow circles represent the optimal solution for each incremental task.
  • Fig. 4 illustrates the two equivalent ways for optimization of active forgetting, according to aspects of the disclosure, where the network parameters ⁇ are selectively renormalized with and ⁇ e in order to mutually balance new and old tasks in a shared solution.
  • the proposed active forgetting can largely enhance the averaged accuracy (illustrated in top row A) for all incremental tasks, using EWC as a baseline for preserving stability. Then the effects of active forgetting on learning plasticity for new tasks and memory stability for old tasks are analyzed through evaluating forward transfer (illustrated in middle row B) and backward transfer (illustrated in bottom row C) , respectively, where the former is clearly dominant.
  • a 6-layer convolution neural network (CNN) is used, and all results are averaged over 5 runs with different random seeds and task orders.
  • each learner’s expertise is directly modulated by its target distribution, where the functional strategy described above in active forgetting with stability protection can serve this purpose, from which it can be inferred that the target distribution tends to be different for learners with different forgetting rates ⁇ . Therefore, a learnable forgetting rate for each learner is applied to adaptively coordinate their relationship. Further, to encourage cooperation of the continual learners, a supplementary modulation that explicitly constrains the differences in predictions made by feature representations of respective learner is proposed, in order to modulate the learning rules for the current task.
  • the forgetting rate ⁇ is introduced, and p ( ⁇ D A ) in Eqn. (4) is replaced with (i.e., Eqn. (1) ) , where the target distribution p ( ⁇ D A , D B ) becomes p ( ⁇ D A , D B , ⁇ ) .
  • a properly-selected ⁇ will maximize the probability of learning each new task well through forgetting the old knowledge:
  • the current task has N t training data, and p i (x t, n ) is the prediction of learner i for x t, n .
  • ⁇ AF, i and ⁇ i, j can be adaptively learned by constraining the balance of their average and with other loss terms, or simply fixed to the same value for each learner.
  • Fig. 7 illustrates exemplary performances of collaborative continual learners with adaptive forgetting (CAF) , according to aspects of the disclosure.
  • the disclosed method of CAF is evaluated across various continual learning benchmarks, and is compared with a wide range of representative approaches. Visual classification tasks are considered as an example in different scenarios.
  • S-CIFAR-100, R-CIFAR-100 and R-CIFAR-10/100 for overall knowledge transfer, another three benchmarks are adopted, such as Omniglot for long task sequence with imbalanced number of classes, and CUB-200-211 and Tiny-ImageNet for larger scale images.
  • Omniglot for long task sequence with imbalanced number of classes
  • CUB-200-211 and Tiny-ImageNet for larger scale images.
  • the performance of all baselines evaluated with average accuracy over 5 runs with different random seeds and task orders, varies widely across experimental settings, while CAF achieves consistently state-of-the-art performance in a plug-and-play manner.
  • CPR is a recent strong method that encourages a single continual learner to converge to a flat loss landscape, but its performance lags far behind ours.
  • continual learning in artificial neural networks.
  • the superior performance and generality of the method can facilitate a variety of realistic applications, such as but not limited to autonomous driving and robotics, visual classification, intelligent manufacturing, intelligent healthcare and smartphones, to flexibly accommodate user needs and environmental changes.
  • continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.
  • Fig. 8 illustrates an exemplary flow chart for continual learning, in accordance with various aspects of the present disclosure. As described below, some or all illustrated features may be omitted in a particular implementation within the scope of the present disclosure, and some illustrated features may not be required for implementation of all embodiments. Further, some of the blocks may be performed parallel or in a different order. In some examples, the method may be carried out by any suitable apparatus or means for carrying out the functions or algorithm described below.
  • the disclosed method of collaborative continual learners with adaptive forgetting aims at incorporating neuro-inspired adaptability for continual learning in artificial intelligence, which could be used across a variety of scenarios, to name a few, visual classification, autonomous driving, robots, intelligent manufacturing, intelligent healthcare and smartphones.
  • What the potential applications have in common is that the agents in corresponding scenario need to deal with incremental tasks, thus finding a solution which can satisfy all the tasks is expected.
  • the method begins at block 801, with initializing parameters of the neural network for continual learning, wherein the neural network comprises a plurality of sub-networks.
  • the continual learning involves a series of tasks, which may be of same or different kind of task, including but not limited to a regression task, a classification task, or a reinforcement task.
  • the sub-networks have same architectures and respective parameter spaces.
  • the same architecture includes but not limited to the number of feature channels, network width, the number of layers, etc.
  • the number of sub-networks of a neural network is fixed.
  • the amount of the parameters of the network is fixed.
  • same random initialization is exploited and dropout is removed for each sub-network to construct a low-diversity background.
  • either same random initialization or dropout is exploited for each sub-network to construct a middle-diversity background.
  • different random initialization and dropout are exploited for each sub-network to construct a high-diversity background.
  • the optimal forgetting rate ⁇ in eqn. (6) can be empirically determined by a grid search of hyperparameters ⁇ SP and/or ⁇ AF .
  • the forgetting rate ⁇ can be different for each sub-network and/or for each task.
  • ⁇ AF can be different for each sub-network and/or for each task, but ⁇ SP can be fixed for the plurality of sub-networks and tasks.
  • the method proceeds to block 802, with inputting training data for a current task into the plurality of sub-networks in parallel.
  • the current task could be a visual classification task
  • the input data could be image data, graph data, video data, which may be collected by one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.
  • the current task could be a regression task for automatic driving
  • the input data could be data collected by sensor, such as road condition, and images of road signs, and corresponding operations of the vehicle, like steering wheel angle and/or brake/throttle condition.
  • the method proceeds to block 803, with generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively.
  • the feature representation generated by sub-network i with their dedicated parameters can be represented as
  • the method proceeds to block 804, with generating by a shared output head of the plurality of sub-networks, a prediction related to the current task.
  • the prediction related to the current task is generated with a sum of all the sub-networks’ outputs corresponding to K continual learners and representing the shared output head.
  • the prediction related to the current task is generated with a weighted sum of all the sub-networks’ outputs
  • block 804 further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively, which could be represented by p i (x t, n ) for sub-network i and sample n of task t.
  • the method proceeds to block 805, with calculating a loss function for the current task.
  • the loss function is calculated based on two weighted terms that selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
  • one of the weighted terms could be the stability protection term in Eqn. (2) , Eqn. (9) or Eqn. (12) ; another of the weighted terms could be the active forgetting term in Eqn. (2) , Eqn. (9) or Eqn. (12) .
  • active forgetting term there can be two equivalent ways, one way is to set the expanded parameters for the current task as empty, and another is to set the expanded parameters for the current task as optimal solution for the current task only without constrains on the previous tasks.
  • the loss function is calculated based on a term putting constraints on differences in the corresponding predictions generated based on the plurality of feature representations of the plurality of sub-networks.
  • the term could be AF-S in Eqn. (12) .
  • the loss function is calculated based on constraining a balance of forgetting rate for each sub-networks and coefficients for constraints on differences in the corresponding predictions with their average. For example, in Eqn. (12) , ⁇ AF, i and ⁇ i, j can be adaptively learned by constraining the balance of their average and with other loss terms.
  • the method proceeds to block 806, with updating the parameters of the neural network based on the calculated loss function.
  • the parameters of the neural network are optimized. Then the optimized parameters are transferred back to block 802 and to be used as a prior distribution for learning a next task.
  • the optimal forgetting rate ⁇ can be empirically determined based on the updated parameters, as described above with block 801.
  • Fig. 9 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure.
  • the computing system may comprise at least one processor 910.
  • the computing system may further comprise at least one storage device 920.
  • the storage device 920 may store computer-executable instructions that, when executed, cause the processor 910 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-8.
  • the embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces.
  • the method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
  • the embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces.
  • the method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
  • the non-transitory computer-readable medium may further comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-8.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Feedback Control In General (AREA)

Abstract

A computer implemented method by a neural network for continual learning of a series of tasks is disclosed, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.

Description

METHOD AND APPARATUS FOR INCORPORATING NEURO-INSPIRED ADAPTABILITY FOR CONTINUAL LEARNING FIELD
Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to method and apparatus provided for incorporating neuro-inspired adaptability for continual learning in artificial intelligence.
BACKGROUND
Continual learning, also known as lifelong learning, is one of the cornerstones towards artificial general intelligence (AGI) . Since the real world is dynamic and unpredictable, an intelligent agent needs to learn and remember throughout its lifetime, just like the biological brain, in order to adapt effectively. For this purpose, a desirable solution should properly balance memory stability with learning plasticity, and acquire sufficient compatibility to capture the observed distribution.
Numerous efforts have been devoted to preserving memory stability to mitigate catastrophic forgetting in artificial neural networks, where parameter changes for learning a new task well typically result in a dramatic performance drop of the old tasks. Representative strategies include selectively stabilizing parameters, recovering old data distributions and allocating dedicated parameter subspaces. However, they usually achieve only modest improvements in specific scenarios, with effectiveness varying widely across experimental settings (e.g., task type and similarity, input size, number of training samples, etc. ) . As a result, there is still a huge gap between existing advances and realistic applications, and further afield, AGI.
SUMMARY
The following presents a simplified summary of one or more aspects to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.
In order to improve the performance of continual learning, that is, perform well on all the tasks ever seen, a desirable solution should properly balance memory stability of old tasks with learning plasticity of new tasks, while being adequately compatible to capture their distributions.
Since biological learning systems are natural continual learners that exhibit strong adaptability to real-world dynamics, it is argued that they have been equipped with  effective strategies to address the above challenges. In particular, the γ subset of the Drosophila mushroom body (γMB) is a biological learning system that is essential for coping with different tasks in succession and enjoys relatively clear and in-depth understanding at both functional and anatomical levels, which emerges as an excellent source for inspiring continual learning in artificial intelligence (AI) systems.
As a key functional advantage, the γMB system can regulate memories in distinct ways for optimizing memory-guided behaviors in changing environments. First, old memories are actively protected from new disruption by strengthening the previously-learned synaptic changes. This idea of selectively stabilizing parameters has been widely used to alleviate catastrophic forgetting in continual learning. Besides, old memories can be actively forgotten for better adapting to a new memory. Specifically, there are specialized molecular signals to regulate the speed of memory decay, whose activation reduces the persistence of outdated information, while inhibition exhibits the opposite effect.
It is proposed in the disclosure a functional strategy that incorporates active forgetting together with stability protection for a better trade-off between new and old tasks, where the active forgetting part is formulated as selectively attenuating old memories in parameter distributions and optimized by a synaptic expansion-renormalization process.
The organizing principles of the γMB system that support its function is further explored, which adopts five compartments with dynamic modulation to perform continual learning in parallel.
Inspired by this, it is proposed in the disclosure a specialized architecture of multiple parallel learning modules, which can ensure solution compatibility for incremental changes by coordinating the diversity of learners’ expertise. Adaptive implementations of the disclosed functional strategy can naturally serve this purpose through adjusting the target distribution of each learner, suggesting that the neurological adaptive mechanisms are highly synergistic rather than operating in isolation.
The disclosed method and apparatus draw inspirations from the adaptive mechanisms equipped in a robust biological learning system, and exhibit superior generality across a variety of continual learning scenarios, also achieve state-of-the-art performance in a plug-and-play manner. The disclosed method and apparatus can facilitate realistic applications, such as autonomous driving and robotics, intelligent manufacturing and smartphones, etc., to flexibly accommodate user needs and environmental changes. Meanwhile, the deployment of continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.
According to an aspect of the disclosure, a computer implemented method by a neural network for continual learning of a series of tasks is disclosed, wherein the neural network includes a plurality of sub-networks having the same architecture and  respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
According to a further aspect, the method further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively; and wherein the loss function further has constraints on differences in the corresponding predictions.
According to a further aspect, the parameters of the neural network comprise the parameters of each sub-network, weights for the plurality of feature representations and parameters of the shared output head.
According to a further aspect, the expanded parameters for the current task are set as empty. According to another further aspect, the expanded parameters for the current task are optimal solution for the current task only without constrains on the previous tasks.
According to a further aspect, the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task are adaptively learned and updated for each sub-network respectively.
According to a further aspect, the shared output head is a full connected layer.
According to a further aspect, the method is a plug-and-play approach.
According to a further aspect, the method comprises one of exploiting different random initialization and dropout for the plurality of sub-networks; or exploiting identical random initialization and dropout for the plurality of sub-networks; or exploiting identical random initialization and no dropout for the plurality of sub-networks.
According to a further aspect, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data, statistics, sensor data and wherein the series of tasks comprise one or more of regression task, classification task, or reinforcement task; and wherein the prediction comprising one or more of a classification of the image data, the video data or the graph data, a segmentation of the image data or the video data or the audio data, and content or action generated based on the text data, gaming data, graph data or video data, predicted value based on statistics and/or sensor data.
According to a further aspect, wherein the input data comprise data obtained in one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.
BRIEF DESCRIPTION OF THE DRAWINGS
The disclosed aspects will be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.
Fig. 1 illustrates an exemplary framework of the γ subset of the Drosophila mushroom body (γMB) , according to aspects of the disclosure.
Fig. 2 illustrates an exemplary framework of a neuro-inspired computational model, according to aspects of the disclosure.
Fig. 3 illustrates an exemplary formulation of active forgetting in combination with stability protection, according to aspects of the disclosure.
Fig. 4 illustrates two equivalent ways for optimization of active forgetting, according to aspects of the disclosure.
Fig. 5 illustrates exemplary performances of active forgetting on three representative benchmarks for continual learning of visual classification tasks, according to aspects of the disclosure.
Fig. 6 illustrates exemplary γMB-like architecture of multiple continual learners, according to aspects of the disclosure.
Fig. 7 illustrates exemplary performances of collaborative continual learners with adaptive forgetting (CAF) , according to aspects of the disclosure.
Fig. 8 illustrates an exemplary flow chart for continual learning, in accordance with various aspects of the present disclosure.
Fig. 9 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure.
DETAILED DESCRIPTION
The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to examples and  embodiments are for illustrative purposes, and are not intended to limit the scope of the disclosure.
The ability to incrementally learn a sequence of tasks, which is also named continual learning, is critical for artificial neural networks. Since the training data distribution is highly dynamic, the network needs to carefully trade-off the learning plasticity and memory stability. In general, excessive plasticity in learning new tasks leads to the catastrophic forgetting of old tasks, while excessive stability in remembering old tasks limits the learning of new tasks.
In order to perform well on all the tasks ever seen, a desirable solution should properly balance memory stability of old tasks with learning plasticity of new tasks, while being adequately compatible to capture their distributions. For example, if you want to accommodate a sequence of cakes (i.e., incremental tasks) into a bag (i.e., a solution) , you should (within certain limits) optimize the efficiency of space allocation for each cake as well as the total space of the bag, rather than simply freezing the old cakes.
To get inspired by biological learning systems which exhibit strong adaptability to real-world dynamics, the γ subset of the Drosophila mushroom body (γMB) is considered as a biological learning system that is essential for coping with different tasks in succession and enjoys relatively clear and in-depth understanding at both functional and anatomical levels.
Fig. 1 illustrates an exemplary framework of the γ subset of the Drosophila mushroom body (γMB) , according to aspects of the disclosure.
As a key functional advantage, the γMB system can regulate memories in distinct ways for optimizing memory-guided behaviors in changing environments. First, old memories are actively protected from new disruption by strengthening the previously-learned synaptic changes. This idea of selectively stabilizing parameters has been widely used to alleviate catastrophic forgetting in continual learning. Besides, old memories can be actively forgotten for better adapting to a new memory. Specifically, there are specialized molecular signals to regulate the speed of memory decay, whose activation reduces the persistence of outdated information, while inhibition exhibits the opposite effect. Further, the γMB system adopts five compartments (γ1-5) with dynamic modulation to perform continual learning in parallel.
As shown in Fig. 1, the sensory information (A, B, C, …) is incrementally input from Kenyon cells (KCs) in the MB, while the valence (Va) is conveyed by dopaminergic neurons (DANs) . The outputs of these compartments (γ1-γ5) are carried by distinct MB output neurons (MBONs) and integrated in a weighted-sum fashion to guide adaptive behaviors (Be) . In particular, the DANs allow for distinct learning rules and forgetting rates in each compartment, where the latter has been shown important for processing sequential conflicting experiences. The distinct learning rules could be modulated by  respective reward (shown as hollow circle in Fig. 1) and punishment (shown as gray circle in Fig. 1) .
Fig. 2 illustrates an exemplary framework of a neuro-inspired computational model, according to aspects of the disclosure.
Inspired by biological advantages of the γMB system mentioned above, it is disclosed herein to incorporate active forgetting together with stability protection for a better trade-off between new and old tasks, and accordingly coordinate multiple continual learners to ensure solution compatibility.
As an example, five continual learners L1-L5 corresponding to the five compartments of the γMB system are shown in Fig. 2, of course other quantity of continual learners is allowed. The white blocks represent learned parameters for task A, the gray blocks represent learned parameters for task B, and the black blocks represent learned parameters for task C. With a sequence having three tasks are learned in Fig. 2, the combinations of blocks named a, b and c show different degrees of merging the parameters for different tasks as a solution, wherein a shows excessive stability as it sticks too much on an old task, and b shows proper balance between old and new tasks, and c shows excessing forgetting since it mainly focuses on the latest task. The dashed areas on the right denote the target distributions of L1-L5, which are coordinated to obtain better solutions for both old and new tasks, and the small hollow circles represent the optimal solution for each incremental task.
i. Active forgetting with stability protection
Since a central challenge of continual learning is to resolve the mutual interference between new and old tasks due to their distributional differences. The functional advantages of the γMB system suggest that stability protection and actively-regulated forgetting are both important, although current efforts mainly focus on the former to prevent catastrophic forgetting.
It is disclosed herein that this process can be formulated with the framework of Bayesian learning, which is able to model biological synaptic plasticity by tracking the probability distribution of synaptic weights under the input of dynamic sensory information. A simple case of two tasks is described first.
Let’s consider a neural network with parameter θ continually learning tasks A and B from their training data D A and D B to perform well on their test data, which is called a “continual learner” . From a Bayesian perspective, the learner first places a prior distribution p (θ) on the model parameters. After learning task A, the learner updates the belief of the parameters, resulting in a posterior distribution p (θ∣D A) ∝p (D A∣θ) p (θ) that can incorporate the knowledge of task A . Then task A can be successfully performed by finding a mode of the posterior: 
Figure PCTCN2022143198-appb-000001
Figure PCTCN2022143198-appb-000002
To continually learn task B, p (θ∣D A) can be used as a prior distribution and the posterior p (θ∣D A, D B) ∝p (D B∣θ) p (θ∣D A) will further incorporate the knowledge of task B . Similarly, the learner needs to find
Figure PCTCN2022143198-appb-000003
corresponding to maximizing both log p (D B∣θ) for learning task B and log p (θ∣D A) for remembering task A.
However, due to the differences in data distribution, remembering old tasks precisely can increase the difficulty of learning each new task well. Inspired by the biological active forgetting, a forgetting rate β is introduced in the disclosure and p (θ∣D A) is replaced with
Figure PCTCN2022143198-appb-000004
where p (θ) is a non-informative prior without incorporating old knowledge. Z is a β -dependent normalizer that keeps
Figure PCTCN2022143198-appb-000005
a normalized probability distribution. 
Figure PCTCN2022143198-appb-000006
will actively forget task A when β→1, while be dominated by p (θ∣D A) with full old knowledge when β→0 . Therefore, for the new target
Figure PCTCN2022143198-appb-000007
Figure PCTCN2022143198-appb-000008
the loss function is derived as
Figure PCTCN2022143198-appb-000009
Where L B (θ) is the loss function of learning task B , λ SP and λ AF are hyperparameters that control the strengths of two regularizers responsible for stability protection and active forgetting, respectively. The stability protection part is to selectively penalize the deviance of each parameter θ m from
Figure PCTCN2022143198-appb-000010
depending on its “importance” for the old task (s) , estimated by the Fisher information F A, m.
Fig. 3 illustrates an exemplary formulation of active forgetting in combination with stability protection, according to aspects of the disclosure. As shown in Fig. 3, all the connected solid lines are learned for task B, wherein black solid lines are learned for previous task A and are maintained in virtue of stability protection, however the dashed lines are learned for previous task A but are abandoned for active forgetting reasons.
The optimization of active forgetting can be achieved in two equivalent ways, AF-1 and AF-2. With the introduction of the forgetting rate β, a learner needs to find:
Figure PCTCN2022143198-appb-000011
AF-1 and AF-2 both encourage the network parameters θ to renormalize with an “expanded” set of parameters θ e when learning task B.
Fig. 4 illustrates the two equivalent ways for optimization of active forgetting, according to aspects of the disclosure, where the network parameters θ are selectively renormalized with
Figure PCTCN2022143198-appb-000012
and θ e in order to mutually balance new and old tasks in a shared solution.
For AF-1, θ e, m=0 is “empty” with equal selectivity I e, m=1 for renormalization, where the active forgetting term becomes the L2 norm of θ . The hyperparameters λ AF∝β and λ SP∝1-β, indicating that the old memories are directly affected. For AF-2, 
Figure PCTCN2022143198-appb-000013
is the optimal solution for task B only, obtained from optimizing L B (θ e) , and I e, m=F B, m is the Fisher information. The forgetting rate is fully integrated into λ AF∝β/ (1-β) and is independent of λ SP. In the absence of active forgetting (β=0) , the loss function in Eqn. (2) is left with only the stability protection term and L B (θ) , which is approximately equivalent to that of regular synaptic regularization approaches such as EWC. In particular, since the loss functions of such methods typically have a similar form and differ only in the metric for estimating the importance of parameters, the disclosed proposal can be naturally combined with them by plugging in the active forgetting term.
For biological neural networks, the actively-regulated forgetting is able to remove outdated information and provide flexibility for adapting to a new memory. In the disclosed proposal, a properly-selected β will maximize the probability of learning each new task well through forgetting the old knowledge: 
Figure PCTCN2022143198-appb-000014
Figure PCTCN2022143198-appb-000015
which can be empirically determined by a grid search of λ AF and/or λ SP. Additionally, when θ moves to the neighborhood of an empirical optimal solution, the active forgetting term in Eqn. (2) can directly minimize the generalization errors, so as to improve the performance of continual learning.
In addition to theoretically analyze of how active forgetting can achieve the benefit of removing outdated information and providing flexibility for adapting to a new memory, Fig. 5 illustrates exemplary performances of active forgetting on three representative benchmarks for continual learning of visual classification tasks, according to aspects of the disclosure. The three benchmarks are all constructed from the CIFAR-100 dataset, which contains 100-class natural images with 500 training samples per class of size 32×32, but the overall knowledge transfer ranges from more negative to more positive due to different construction principles.
As shown in Fig. 5, the proposed active forgetting can largely enhance the averaged accuracy (illustrated in top row A) for all incremental tasks, using EWC as a baseline  for preserving stability. Then the effects of active forgetting on learning plasticity for new tasks and memory stability for old tasks are analyzed through evaluating forward transfer (illustrated in middle row B) and backward transfer (illustrated in bottom row C) , respectively, where the former is clearly dominant. A 6-layer convolution neural network (CNN) is used, and all results are averaged over 5 runs with different random seeds and task orders.
ii. Coordination of Multiple Continual Learners
In addition to describe actively regulated forgetting together with stability protection in a continual learning model, the organizing principles of the γMB system where new memory forms and active forgetting happens are to be discussed. As discussed with Fig. 1, there are five compartments that process sequential experiences in parallel. The outputs of these compartments are integrated in a weighted sum fashion to guide adaptive behaviors.
Inspired by this, a γMB-like architecture of multiple continual learners in parallel is disclosed, as illustrated in Fig. 6. Specifically, there are K identically structured neural networks
Figure PCTCN2022143198-appb-000016
corresponding to K continual learners L i, i=1, …, K with their dedicated parameter spaces. The dedicated output head of each learner is removed, and the weighted sum of the previous layer’s output is fed into a shared output head
Figure PCTCN2022143198-appb-000017
to make predictions, where the output weight of each learner g i, i=1, …, K is incrementally updated. Then, the final prediction becomes
Figure PCTCN2022143198-appb-000018
Figure PCTCN2022143198-appb-000019
where all optimizable parameters θ MCL includes 
Figure PCTCN2022143198-appb-000020
and
Figure PCTCN2022143198-appb-000021
In such a γMB-like architecture, the relationship between learners is critical to the performance of continual learning. When the diversity of their expertise is properly coordinated, the obtained solution can provide a high degree of compatibility with both new and old tasks, vastly superior to that of a single continual learner (SCL) .
In general, each learner’s expertise is directly modulated by its target distribution, where the functional strategy described above in active forgetting with stability protection can serve this purpose, from which it can be inferred that the target distribution
Figure PCTCN2022143198-appb-000022
tends to be different for learners with different forgetting rates β. Therefore, a learnable forgetting rate for each learner is applied to adaptively coordinate their relationship. Further, to encourage cooperation of the continual learners, a supplementary modulation that explicitly constrains the differences in predictions made by feature representations of respective learner is proposed, in order to modulate the learning rules for the current task.
iii. Collaborative continual learners with adaptive forgetting (CAF)
Based on the discussion set forth above, drawing inspirations from the adaptive mechanisms equipped in a robust biological learning system, an approach combining both active forgetting with stability protection and the architecture of multiple continual learners is disclosed, which can be named collaborative continual learners with adaptive forgetting (CAF) .
For the case of two tasks as mentioned before, the learner needs to find a mode of the posterior distribution that incorporates the knowledge of tasks A and B:
Figure PCTCN2022143198-appb-000023
where p (D B∣θ) is the loss for current task B. Although p (θ∣D A) is generally intractable, we can locally approximate it with a second-order Taylor expansion around 
Figure PCTCN2022143198-appb-000024
resulting in a Gaussian distribution whose mean is
Figure PCTCN2022143198-appb-000025
and precision matrix is the Hessian of the negative log posterior. To simplify the computation, the Hessian can be approximated by the diagonal of the Fisher information matrix:
Figure PCTCN2022143198-appb-000026
To improve learning plasticity, the forgetting rate β is introduced, and p (θ∣D A) in Eqn. (4) is replaced with
Figure PCTCN2022143198-appb-000027
 (i.e., Eqn. (1) ) , where the target distribution p (θ∣D A, D B) becomes p (θ∣D A, D B, β) . A  properly-selected βwill maximize the probability of learning each new task well through forgetting the old knowledge:
Figure PCTCN2022143198-appb-000028
With the implementation of active forgetting, the learner needs to find
Figure PCTCN2022143198-appb-000029
which can be optimized in two equivalent ways (AF-1 and AF-2) as discussed with Fig. 4.
For continual learning of more than two tasks, e.g., t tasks (t>2) , the leaner needs to find:
Figure PCTCN2022143198-appb-000030
where
Figure PCTCN2022143198-appb-000031
are the training datasets of the previous tasks and
Figure PCTCN2022143198-appb-000032
Figure PCTCN2022143198-appb-000033
are the previously used forgetting rates. Similarly, we replace the posterior p (θ∣D 1: t-1, β 1: t-1) that absorbs all the information of D 1: t-1 with
Figure PCTCN2022143198-appb-000034
To simplify the hyperparameter tuning, a fixed forgetting rate may be adopted for each learning task, i.e., β i=β for i=1, …, t. Then the loss function may be modified from Eqn. (2) into:
Figure PCTCN2022143198-appb-000035
Figure PCTCN2022143198-appb-000036
is the solution for previous tasks, i.e., the old parameters. Referring to discussion with Fig. 4, for AF-1, θ e, m=0 and I e, m=1, λ AF∝β and λ SP∝ (1-β) ; for AF-2, 
Figure PCTCN2022143198-appb-000037
I e, m=I t, m , λ AF∝β/ (1-β) and λ SP∝1. F 1: t-1 is recursively updated by:
F 1: t-1=F 1: t-2+F t-1                    (10)
When β=0, the loss function in Eqn. (8) degenerates to a similar form as regular synaptic regularization approaches that only preserve stability:
Figure PCTCN2022143198-appb-000038
Generally, such methods differ with Eqn. (8) mainly in the metric ξ 1: t-1 of estimating the importance of parameters for performing old tasks. Therefore, the active forgetting term can be naturally combined with prior arts in a plug and player manner.
To combine with the architecture of K continual learners, and in order to coordinate the diversity of learners’ expertise, we focus on regular synaptic regularization approaches (i.e., Eqn. (11) ) as the default baseline for implementation. And the proposed forgetting rate β is adaptively implemented in each learner. Moreover, differences in their predictive distributions are regularized, which can be quantified by the widely-used Kullback Leibler (KL) divergence. Therefore, the loss function for the full version of our neuro-inspired approach is defined as:
Figure PCTCN2022143198-appb-000039
where M i denotes the amount of parameters θ i= {φ i, g i} for learner i . The current task has N t training data, and p i (x t, n) is the prediction of learner i for x t, n. To avoid the extra training cost to compute
Figure PCTCN2022143198-appb-000040
in AF-2, consider AF-1 in this example while keeping λ SP fixed for the easy of hyperparameter tuning. λ AF, i and γ i, j can be adaptively learned by constraining the balance of their average 
Figure PCTCN2022143198-appb-000041
and
Figure PCTCN2022143198-appb-000042
with other loss terms, or simply fixed to the same value for each learner.
Fig. 7 illustrates exemplary performances of collaborative continual learners with adaptive forgetting (CAF) , according to aspects of the disclosure.
The disclosed method of CAF is evaluated across various continual learning benchmarks, and is compared with a wide range of representative approaches. Visual classification tasks are considered as an example in different scenarios. In addition to S-CIFAR-100, R-CIFAR-100 and R-CIFAR-10/100 for overall knowledge transfer, another three benchmarks are adopted, such as Omniglot for long task sequence with imbalanced number of classes, and CUB-200-211 and Tiny-ImageNet for larger scale images. As shown in Fig. 7, the performance of all baselines, evaluated with average accuracy over 5 runs with different random seeds and task orders, varies widely across experimental settings, while CAF achieves consistently state-of-the-art performance in a plug-and-play manner. In particular, CPR is a recent strong method that encourages a single continual learner to converge to a flat loss landscape, but its performance lags far behind ours.
By drawing inspirations from the adaptive mechanisms equipped in a robust biological learning system, a generic approach for continual learning in artificial neural networks is disclosed. The superior performance and generality of the method can facilitate a variety of realistic applications, such as but not limited to autonomous driving and robotics, visual classification, intelligent manufacturing, intelligent healthcare and smartphones, to flexibly accommodate user needs and environmental changes. Meanwhile, the deployment of continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.
An exemplary process of continual learning by collaborative continual learners with adaptive forgetting (CAF) is shown in algorithm 1 as below.
Figure PCTCN2022143198-appb-000043
Fig. 8 illustrates an exemplary flow chart for continual learning, in accordance with various aspects of the present disclosure. As described below, some or all illustrated features may be omitted in a particular implementation within the scope of the present disclosure, and some illustrated features may not be required for implementation of all  embodiments. Further, some of the blocks may be performed parallel or in a different order. In some examples, the method may be carried out by any suitable apparatus or means for carrying out the functions or algorithm described below.
The disclosed method of collaborative continual learners with adaptive forgetting aims at incorporating neuro-inspired adaptability for continual learning in artificial intelligence, which could be used across a variety of scenarios, to name a few, visual classification, autonomous driving, robots, intelligent manufacturing, intelligent healthcare and smartphones. What the potential applications have in common is that the agents in corresponding scenario need to deal with incremental tasks, thus finding a solution which can satisfy all the tasks is expected.
Now the disclosed method of is illustrated with Fig. 8. The method begins at block 801, with initializing parameters of the neural network for continual learning, wherein the neural network comprises a plurality of sub-networks. The continual learning involves a series of tasks, which may be of same or different kind of task, including but not limited to a regression task, a classification task, or a reinforcement task.
In an example, the sub-networks have same architectures and respective parameter spaces. In a further example, the same architecture includes but not limited to the number of feature channels, network width, the number of layers, etc. In another example, the number of sub-networks of a neural network is fixed. In a further example, the amount of the parameters of the network is fixed. In yet a further example, to reduce the total amount of parameters of the networks, there can be a trade-off between the number of sub-networks and network width and/or the number of feature channels.
In an example, same random initialization is exploited and dropout is removed for each sub-network to construct a low-diversity background. In another example, either same random initialization or dropout is exploited for each sub-network to construct a middle-diversity background. In yet another example, different random initialization and dropout are exploited for each sub-network to construct a high-diversity background.
In an example, the optimal forgetting rate β in eqn. (6) can be empirically determined by a grid search of hyperparameters λ SP and/or λ AF . In a further example, the forgetting rate β can be different for each sub-network and/or for each task. In another example, λ AF can be different for each sub-network and/or for each task, but λ SP can be fixed for the plurality of sub-networks and tasks.
The method proceeds to block 802, with inputting training data for a current task into the plurality of sub-networks in parallel.
As an example, the current task could be a visual classification task, the input data could be image data, graph data, video data, which may be collected by one or more of an automatic driving system, an intelligent transportation system, an intelligent  manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.
As another example, the current task could be a regression task for automatic driving, the input data could be data collected by sensor, such as road condition, and images of road signs, and corresponding operations of the vehicle, like steering wheel angle and/or brake/throttle condition.
The method proceeds to block 803, with generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively. As an example, referring to description above, the feature representation generated by sub-network i with their dedicated parameters can be represented as
Figure PCTCN2022143198-appb-000044
The method proceeds to block 804, with generating by a shared output head of the plurality of sub-networks, a prediction related to the current task.
In an example, the prediction related to the current task is generated with a sum of all the sub-networks’ outputs
Figure PCTCN2022143198-appb-000045
corresponding to K continual learners and
Figure PCTCN2022143198-appb-000046
representing the shared output head. In a further example, the prediction related to the current task is generated with a weighted sum of all the sub-networks’ outputs
Figure PCTCN2022143198-appb-000047
In an example, the shared output head is a full connected layer.
In an example, block 804 further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively, which could be represented by p i (x t, n) for sub-network i and sample n of task t.
The method proceeds to block 805, with calculating a loss function for the current task.
In an example, the loss function is calculated based on the target of the current task itself. For example, the target of the current task can be represented as L t (θ MCL) .
In a further example, the loss function is calculated based on two weighted terms that selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task. For example, one of the weighted terms could be the stability protection term in Eqn. (2) , Eqn. (9) or Eqn. (12) ; another of the weighted terms could be the active forgetting term in Eqn. (2) , Eqn. (9) or Eqn. (12) . In a further example, for optimization of active forgetting term, there can be two equivalent ways, one way is to set the expanded parameters for the current task as empty, and another is to set the expanded parameters for the current task as optimal solution for the current task only without constrains on the previous tasks.
In a further example, the loss function is calculated based on a term putting constraints on differences in the corresponding predictions generated based on the plurality of feature representations of the plurality of sub-networks. For example, the term could be AF-S in Eqn. (12) .
In a further example, the loss function is calculated based on constraining a balance of forgetting rate for each sub-networks and coefficients for constraints on differences in the corresponding predictions with their average. For example, in Eqn. (12) , λ AF, i and γ i, j can be adaptively learned by constraining the balance of their average 
Figure PCTCN2022143198-appb-000048
and
Figure PCTCN2022143198-appb-000049
with other loss terms.
The method proceeds to block 806, with updating the parameters of the neural network based on the calculated loss function.
In an example, the parameters of the neural network comprise the parameters of each sub-network, and parameters of the shared output head. In a further example, the parameters of the neural network comprise weights for the plurality of feature representations. In yet a further example, the parameters of the neural network comprise the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task.
After the training for the current task is done after several iterations or convergence in block 806, the parameters of the neural network are optimized. Then the optimized parameters are transferred back to block 802 and to be used as a prior distribution for learning a next task. In a further example, the optimal forgetting rate β can be empirically determined based on the updated parameters, as described above with block 801.
Fig. 9 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure. The computing system may comprise at least one processor 910. The computing system may further comprise at least one storage device 920. It should be appreciated that the storage device 920 may store computer-executable instructions that, when executed, cause the processor 910 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-8.
The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the  plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
The non-transitory computer-readable medium may further comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-8.
It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.
It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.
The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the  various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims (15)

  1. A computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces, comprising:
    inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks;
    generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively;
    generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and
    updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
  2. The computer implemented method of claim 1, further comprising:
    generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively; and
    wherein the loss function further has constraints on differences in the corresponding predictions.
  3. The computer implemented method of claim 1, wherein the parameters of the neural network comprise the parameters of each sub-network, weights for the plurality of feature representations and parameters of the shared output head.
  4. The computer implemented method of claim 1, wherein the expanded parameters for the current task are set as empty.
  5. The computer implemented method of claim 1, wherein the expanded parameters for the current task are optimal solution for the current task only without constrains on the previous tasks.
  6. The computer implemented method of claim 1, wherein the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task are adaptively learned and updated for each sub-network respectively.
  7. The computer implemented method of claim 1, wherein the shared output head is a full connected layer.
  8. The computer implemented method of claim 1, wherein the number of the plurality of sub-networks is fixed.
  9. The computer implemented method of claim 1, wherein the method is a plug-and-play approach.
  10. The computer implemented method of claim 1, further comprising one of:
    exploiting different random initialization and dropout for the plurality of sub-networks; or
    exploiting identical random initialization and dropout for the plurality of sub-networks; or
    exploiting identical random initialization and no dropout for the plurality of sub-networks.
  11. The computer implemented method of claim 1, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data, statistics, sensor data and
    wherein the series of tasks comprise one or more of regression task, classification task, or reinforcement task; and
    wherein the prediction comprising one or more of a classification of the image data, the video data or the graph data, a segmentation of the image data or the video data or the audio data, and content or action generated based on the text data, gaming data, graph data or video data, predicted value based on statistics and/or sensor data.
  12. The computer implemented method of claim 1, wherein the input data comprise data obtained in one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.
  13. A computer system, comprising:
    one or more processors; and
    one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-12.
  14. One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.
  15. A computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.
PCT/CN2022/143198 2022-12-29 2022-12-29 Method and apparatus for incorporating neuro-inspired adaptability for continual learning Ceased WO2024138492A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
PCT/CN2022/143198 WO2024138492A1 (en) 2022-12-29 2022-12-29 Method and apparatus for incorporating neuro-inspired adaptability for continual learning
DE112022008133.3T DE112022008133T5 (en) 2022-12-29 2022-12-29 Method and device for integrating neuro-inspired adaptability for continuous learning
CN202280102864.7A CN120500723A (en) 2022-12-29 2022-12-29 Method and apparatus for integrating neural heuristic adaptation for continuous learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/143198 WO2024138492A1 (en) 2022-12-29 2022-12-29 Method and apparatus for incorporating neuro-inspired adaptability for continual learning

Publications (1)

Publication Number Publication Date
WO2024138492A1 true WO2024138492A1 (en) 2024-07-04

Family

ID=91715966

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/143198 Ceased WO2024138492A1 (en) 2022-12-29 2022-12-29 Method and apparatus for incorporating neuro-inspired adaptability for continual learning

Country Status (3)

Country Link
CN (1) CN120500723A (en)
DE (1) DE112022008133T5 (en)
WO (1) WO2024138492A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119832340A (en) * 2025-01-22 2025-04-15 中山大学 Robot vision direction identification method and system for robot navigation

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244108A1 (en) * 2018-02-08 2019-08-08 Cognizant Technology Solutions U.S. Corporation System and Method For Pseudo-Task Augmentation in Deep Multitask Learning
US20210117786A1 (en) * 2018-04-18 2021-04-22 Deepmind Technologies Limited Neural networks for scalable continual learning in domains with sequentially learned tasks
CN113792874A (en) * 2021-09-08 2021-12-14 清华大学 Continuous learning method and device based on innate knowledge
US20220318641A1 (en) * 2019-06-07 2022-10-06 The Regents Of The University Of California General form of the tree alternating optimization (tao) for learning decision trees

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190244108A1 (en) * 2018-02-08 2019-08-08 Cognizant Technology Solutions U.S. Corporation System and Method For Pseudo-Task Augmentation in Deep Multitask Learning
US20210117786A1 (en) * 2018-04-18 2021-04-22 Deepmind Technologies Limited Neural networks for scalable continual learning in domains with sequentially learned tasks
US20220318641A1 (en) * 2019-06-07 2022-10-06 The Regents Of The University Of California General form of the tree alternating optimization (tao) for learning decision trees
CN113792874A (en) * 2021-09-08 2021-12-14 清华大学 Continuous learning method and device based on innate knowledge

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119832340A (en) * 2025-01-22 2025-04-15 中山大学 Robot vision direction identification method and system for robot navigation

Also Published As

Publication number Publication date
CN120500723A (en) 2025-08-15
DE112022008133T5 (en) 2025-10-09

Similar Documents

Publication Publication Date Title
Lomonaco et al. Rehearsal-Free Continual Learning over Small Non-IID Batches.
Wang et al. Incorporating neuro-inspired adaptability for continual learning in artificial intelligence
Kao et al. Natural continual learning: success is a journey, not (just) a destination
Wiwatcharakoses et al. SOINN+, a self-organizing incremental neural network for unsupervised learning from noisy data streams
Maclaurin et al. Gradient-based hyperparameter optimization through reversible learning
Ashfahani et al. Unsupervised continual learning in streaming environments
Wiwatcharakoses et al. A self-organizing incremental neural network for continual supervised learning
Pourcel et al. Online task-free continual learning with dynamic sparse distributed memory
Bhat et al. Consistency is the key to further mitigating catastrophic forgetting in continual learning
Mancoo et al. Understanding spiking networks through convex optimization
WO2024138492A1 (en) Method and apparatus for incorporating neuro-inspired adaptability for continual learning
KR102471514B1 (en) Method for overcoming catastrophic forgetting by neuron-level plasticity control and computing system performing the same
Hintze et al. The structure of evolved representations across different substrates for artificial intelligence
Srivastava et al. Adaptive compression-based lifelong learning
Ding et al. Improve noise tolerance of robust loss via noise-awareness
Szadkowski et al. Continually trained life-long classification
Diamant et al. De-confusing pseudo-labels in source-free domain adaptation
Zheng et al. Flexible prefrontal control over hippocampal episodic memory for goal-directed generalization
Wang et al. Incremental online learning of randomized neural network with forward regularization
Yuan et al. CASSOR: Class-aware sample selection for ordinal regression with noisy labels
Xiong et al. Learning to remember from a multi-task teacher
Antoniou Meta learning for supervised and unsupervised few-shot learning
Serra-Perello et al. Incremental Learning Methodologies for Addressing Catastrophic Forgetting: Analysis and Experimental Evaluation
Kang et al. Forecasting evolution of clusters in game agents with hebbian learning
Kurtkaya et al. Learning rate collapse prevents training recurrent neural networks at scale

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22969643

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 202280102864.7

Country of ref document: CN

WWE Wipo information: entry into national phase

Ref document number: 112022008133

Country of ref document: DE

WWP Wipo information: published in national office

Ref document number: 202280102864.7

Country of ref document: CN

WWP Wipo information: published in national office

Ref document number: 112022008133

Country of ref document: DE