WO2024138492A1

WO2024138492A1 - Method and apparatus for incorporating neuro-inspired adaptability for continual learning

Info

Publication number: WO2024138492A1
Application number: PCT/CN2022/143198
Authority: WO
Inventors: Jun Zhu; Hang SU; Ze CHENG; Liyuan Wang; Mingtian ZHANG; Qian Li; Yi ZHONG; Xingxing Zhang
Original assignee: Tsinghua University; Robert Bosch GmbH
Current assignee: Tsinghua University; Robert Bosch GmbH
Priority date: 2022-12-29
Filing date: 2022-12-29
Publication date: 2024-07-04
Anticipated expiration: 2025-06-29
Also published as: CN120500723A; DE112022008133T5

Abstract

A computer implemented method by a neural network for continual learning of a series of tasks is disclosed, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.

Description

METHOD AND APPARATUS FOR INCORPORATING NEURO-INSPIRED ADAPTABILITY FOR CONTINUAL LEARNING

FIELD

Aspects of the present disclosure relate generally to artificial intelligence, and more particularly, to method and apparatus provided for incorporating neuro-inspired adaptability for continual learning in artificial intelligence.

BACKGROUND

Continual learning, also known as lifelong learning, is one of the cornerstones towards artificial general intelligence (AGI) . Since the real world is dynamic and unpredictable, an intelligent agent needs to learn and remember throughout its lifetime, just like the biological brain, in order to adapt effectively. For this purpose, a desirable solution should properly balance memory stability with learning plasticity, and acquire sufficient compatibility to capture the observed distribution.

Numerous efforts have been devoted to preserving memory stability to mitigate catastrophic forgetting in artificial neural networks, where parameter changes for learning a new task well typically result in a dramatic performance drop of the old tasks. Representative strategies include selectively stabilizing parameters, recovering old data distributions and allocating dedicated parameter subspaces. However, they usually achieve only modest improvements in specific scenarios, with effectiveness varying widely across experimental settings (e.g., task type and similarity, input size, number of training samples, etc. ) . As a result, there is still a huge gap between existing advances and realistic applications, and further afield, AGI.

SUMMARY

The following presents a simplified summary of one or more aspects to provide a basic understanding of such aspects. This summary is not an extensive overview of all contemplated aspects, and is intended to neither identify key or critical elements of all aspects nor delineate the scope of any or all aspects. Its sole purpose is to present some concepts of one or more aspects in a simplified form as a prelude to the more detailed description that is presented later.

In order to improve the performance of continual learning, that is, perform well on all the tasks ever seen, a desirable solution should properly balance memory stability of old tasks with learning plasticity of new tasks, while being adequately compatible to capture their distributions.

Since biological learning systems are natural continual learners that exhibit strong adaptability to real-world dynamics, it is argued that they have been equipped with effective strategies to address the above challenges. In particular, the γ subset of the Drosophila mushroom body (γMB) is a biological learning system that is essential for coping with different tasks in succession and enjoys relatively clear and in-depth understanding at both functional and anatomical levels, which emerges as an excellent source for inspiring continual learning in artificial intelligence (AI) systems.

As a key functional advantage, the γMB system can regulate memories in distinct ways for optimizing memory-guided behaviors in changing environments. First, old memories are actively protected from new disruption by strengthening the previously-learned synaptic changes. This idea of selectively stabilizing parameters has been widely used to alleviate catastrophic forgetting in continual learning. Besides, old memories can be actively forgotten for better adapting to a new memory. Specifically, there are specialized molecular signals to regulate the speed of memory decay, whose activation reduces the persistence of outdated information, while inhibition exhibits the opposite effect.

It is proposed in the disclosure a functional strategy that incorporates active forgetting together with stability protection for a better trade-off between new and old tasks, where the active forgetting part is formulated as selectively attenuating old memories in parameter distributions and optimized by a synaptic expansion-renormalization process.

The organizing principles of the γMB system that support its function is further explored, which adopts five compartments with dynamic modulation to perform continual learning in parallel.

Inspired by this, it is proposed in the disclosure a specialized architecture of multiple parallel learning modules, which can ensure solution compatibility for incremental changes by coordinating the diversity of learners’ expertise. Adaptive implementations of the disclosed functional strategy can naturally serve this purpose through adjusting the target distribution of each learner, suggesting that the neurological adaptive mechanisms are highly synergistic rather than operating in isolation.

The disclosed method and apparatus draw inspirations from the adaptive mechanisms equipped in a robust biological learning system, and exhibit superior generality across a variety of continual learning scenarios, also achieve state-of-the-art performance in a plug-and-play manner. The disclosed method and apparatus can facilitate realistic applications, such as autonomous driving and robotics, intelligent manufacturing and smartphones, etc., to flexibly accommodate user needs and environmental changes. Meanwhile, the deployment of continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.

According to an aspect of the disclosure, a computer implemented method by a neural network for continual learning of a series of tasks is disclosed, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.

According to a further aspect, the method further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively; and wherein the loss function further has constraints on differences in the corresponding predictions.

According to a further aspect, the parameters of the neural network comprise the parameters of each sub-network, weights for the plurality of feature representations and parameters of the shared output head.

According to a further aspect, the expanded parameters for the current task are set as empty. According to another further aspect, the expanded parameters for the current task are optimal solution for the current task only without constrains on the previous tasks.

According to a further aspect, the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task are adaptively learned and updated for each sub-network respectively.

According to a further aspect, the shared output head is a full connected layer.

According to a further aspect, the method is a plug-and-play approach.

According to a further aspect, the method comprises one of exploiting different random initialization and dropout for the plurality of sub-networks; or exploiting identical random initialization and dropout for the plurality of sub-networks; or exploiting identical random initialization and no dropout for the plurality of sub-networks.

According to a further aspect, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data, statistics, sensor data and wherein the series of tasks comprise one or more of regression task, classification task, or reinforcement task; and wherein the prediction comprising one or more of a classification of the image data, the video data or the graph data, a segmentation of the image data or the video data or the audio data, and content or action generated based on the text data, gaming data, graph data or video data, predicted value based on statistics and/or sensor data.

According to a further aspect, wherein the input data comprise data obtained in one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.

BRIEF DESCRIPTION OF THE DRAWINGS

The disclosed aspects will be described in connection with the appended drawings that are provided to illustrate and not to limit the disclosed aspects.

Fig. 1 illustrates an exemplary framework of the γ subset of the Drosophila mushroom body (γMB) , according to aspects of the disclosure.

Fig. 2 illustrates an exemplary framework of a neuro-inspired computational model, according to aspects of the disclosure.

Fig. 3 illustrates an exemplary formulation of active forgetting in combination with stability protection, according to aspects of the disclosure.

Fig. 4 illustrates two equivalent ways for optimization of active forgetting, according to aspects of the disclosure.

Fig. 5 illustrates exemplary performances of active forgetting on three representative benchmarks for continual learning of visual classification tasks, according to aspects of the disclosure.

Fig. 6 illustrates exemplary γMB-like architecture of multiple continual learners, according to aspects of the disclosure.

Fig. 7 illustrates exemplary performances of collaborative continual learners with adaptive forgetting (CAF) , according to aspects of the disclosure.

Fig. 8 illustrates an exemplary flow chart for continual learning, in accordance with various aspects of the present disclosure.

Fig. 9 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure.

DETAILED DESCRIPTION

The present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.

Various embodiments will be described in detail with reference to the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts. References made to examples and embodiments are for illustrative purposes, and are not intended to limit the scope of the disclosure.

The ability to incrementally learn a sequence of tasks, which is also named continual learning, is critical for artificial neural networks. Since the training data distribution is highly dynamic, the network needs to carefully trade-off the learning plasticity and memory stability. In general, excessive plasticity in learning new tasks leads to the catastrophic forgetting of old tasks, while excessive stability in remembering old tasks limits the learning of new tasks.

In order to perform well on all the tasks ever seen, a desirable solution should properly balance memory stability of old tasks with learning plasticity of new tasks, while being adequately compatible to capture their distributions. For example, if you want to accommodate a sequence of cakes (i.e., incremental tasks) into a bag (i.e., a solution) , you should (within certain limits) optimize the efficiency of space allocation for each cake as well as the total space of the bag, rather than simply freezing the old cakes.

To get inspired by biological learning systems which exhibit strong adaptability to real-world dynamics, the γ subset of the Drosophila mushroom body (γMB) is considered as a biological learning system that is essential for coping with different tasks in succession and enjoys relatively clear and in-depth understanding at both functional and anatomical levels.

As a key functional advantage, the γMB system can regulate memories in distinct ways for optimizing memory-guided behaviors in changing environments. First, old memories are actively protected from new disruption by strengthening the previously-learned synaptic changes. This idea of selectively stabilizing parameters has been widely used to alleviate catastrophic forgetting in continual learning. Besides, old memories can be actively forgotten for better adapting to a new memory. Specifically, there are specialized molecular signals to regulate the speed of memory decay, whose activation reduces the persistence of outdated information, while inhibition exhibits the opposite effect. Further, the γMB system adopts five compartments (γ1-5) with dynamic modulation to perform continual learning in parallel.

As shown in Fig. 1, the sensory information (A, B, C, …) is incrementally input from Kenyon cells (KCs) in the MB, while the valence (Va) is conveyed by dopaminergic neurons (DANs) . The outputs of these compartments (γ1-γ5) are carried by distinct MB output neurons (MBONs) and integrated in a weighted-sum fashion to guide adaptive behaviors (Be) . In particular, the DANs allow for distinct learning rules and forgetting rates in each compartment, where the latter has been shown important for processing sequential conflicting experiences. The distinct learning rules could be modulated by respective reward (shown as hollow circle in Fig. 1) and punishment (shown as gray circle in Fig. 1) .

Inspired by biological advantages of the γMB system mentioned above, it is disclosed herein to incorporate active forgetting together with stability protection for a better trade-off between new and old tasks, and accordingly coordinate multiple continual learners to ensure solution compatibility.

As an example, five continual learners L1-L5 corresponding to the five compartments of the γMB system are shown in Fig. 2, of course other quantity of continual learners is allowed. The white blocks represent learned parameters for task A, the gray blocks represent learned parameters for task B, and the black blocks represent learned parameters for task C. With a sequence having three tasks are learned in Fig. 2, the combinations of blocks named a, b and c show different degrees of merging the parameters for different tasks as a solution, wherein a shows excessive stability as it sticks too much on an old task, and b shows proper balance between old and new tasks, and c shows excessing forgetting since it mainly focuses on the latest task. The dashed areas on the right denote the target distributions of L1-L5, which are coordinated to obtain better solutions for both old and new tasks, and the small hollow circles represent the optimal solution for each incremental task.

i. Active forgetting with stability protection

Since a central challenge of continual learning is to resolve the mutual interference between new and old tasks due to their distributional differences. The functional advantages of the γMB system suggest that stability protection and actively-regulated forgetting are both important, although current efforts mainly focus on the former to prevent catastrophic forgetting.

It is disclosed herein that this process can be formulated with the framework of Bayesian learning, which is able to model biological synaptic plasticity by tracking the probability distribution of synaptic weights under the input of dynamic sensory information. A simple case of two tasks is described first.

Let’s consider a neural network with parameter θ continually learning tasks A and B from their training data D _A and D _B to perform well on their test data, which is called a “continual learner” . From a Bayesian perspective, the learner first places a prior distribution p (θ) on the model parameters. After learning task A, the learner updates the belief of the parameters, resulting in a posterior distribution p (θ∣D _A) ∝p (D _A∣θ) p (θ) that can incorporate the knowledge of task A . Then task A can be successfully performed by finding a mode of the posterior:

To continually learn task B, p (θ∣D _A) can be used as a prior distribution and the posterior p (θ∣D _A, D _B) ∝p (D _B∣θ) p (θ∣D _A) will further incorporate the knowledge of task B . Similarly, the learner needs to find

corresponding to maximizing both log p (D _B∣θ) for learning task B and log p (θ∣D _A) for remembering task A.

However, due to the differences in data distribution, remembering old tasks precisely can increase the difficulty of learning each new task well. Inspired by the biological active forgetting, a forgetting rate β is introduced in the disclosure and p (θ∣D _A) is replaced with

where p (θ) is a non-informative prior without incorporating old knowledge. Z is a β -dependent normalizer that keeps

a normalized probability distribution.

will actively forget task A when β→1, while be dominated by p (θ∣D _A) with full old knowledge when β→0 . Therefore, for the new target

the loss function is derived as

Where L _B (θ) is the loss function of learning task B , λ _SP and λ _AF are hyperparameters that control the strengths of two regularizers responsible for stability protection and active forgetting, respectively. The stability protection part is to selectively penalize the deviance of each parameter θ _m from

depending on its “importance” for the old task (s) , estimated by the Fisher information F _A, m.

Fig. 3 illustrates an exemplary formulation of active forgetting in combination with stability protection, according to aspects of the disclosure. As shown in Fig. 3, all the connected solid lines are learned for task B, wherein black solid lines are learned for previous task A and are maintained in virtue of stability protection, however the dashed lines are learned for previous task A but are abandoned for active forgetting reasons.

The optimization of active forgetting can be achieved in two equivalent ways, AF-1 and AF-2. With the introduction of the forgetting rate β, a learner needs to find:

AF-1 and AF-2 both encourage the network parameters θ to renormalize with an “expanded” set of parameters θ _e when learning task B.

Fig. 4 illustrates the two equivalent ways for optimization of active forgetting, according to aspects of the disclosure, where the network parameters θ are selectively renormalized with

and θ _e in order to mutually balance new and old tasks in a shared solution.

For AF-1, θ _e, m=0 is “empty” with equal selectivity I _e, m=1 for renormalization, where the active forgetting term becomes the L2 norm of θ . The hyperparameters λ _AF∝β and λ _SP∝1-β, indicating that the old memories are directly affected. For AF-2,

is the optimal solution for task B only, obtained from optimizing L _B (θ _e) , and I _e, m=F _B, m is the Fisher information. The forgetting rate is fully integrated into λ _AF∝β/ (1-β) and is independent of λ _SP. In the absence of active forgetting (β=0) , the loss function in Eqn. (2) is left with only the stability protection term and L _B (θ) , which is approximately equivalent to that of regular synaptic regularization approaches such as EWC. In particular, since the loss functions of such methods typically have a similar form and differ only in the metric for estimating the importance of parameters, the disclosed proposal can be naturally combined with them by plugging in the active forgetting term.

For biological neural networks, the actively-regulated forgetting is able to remove outdated information and provide flexibility for adapting to a new memory. In the disclosed proposal, a properly-selected β will maximize the probability of learning each new task well through forgetting the old knowledge:

which can be empirically determined by a grid search of λ _AF and/or λ _SP. Additionally, when θ moves to the neighborhood of an empirical optimal solution, the active forgetting term in Eqn. (2) can directly minimize the generalization errors, so as to improve the performance of continual learning.

In addition to theoretically analyze of how active forgetting can achieve the benefit of removing outdated information and providing flexibility for adapting to a new memory, Fig. 5 illustrates exemplary performances of active forgetting on three representative benchmarks for continual learning of visual classification tasks, according to aspects of the disclosure. The three benchmarks are all constructed from the CIFAR-100 dataset, which contains 100-class natural images with 500 training samples per class of size 32×32, but the overall knowledge transfer ranges from more negative to more positive due to different construction principles.

As shown in Fig. 5, the proposed active forgetting can largely enhance the averaged accuracy (illustrated in top row A) for all incremental tasks, using EWC as a baseline for preserving stability. Then the effects of active forgetting on learning plasticity for new tasks and memory stability for old tasks are analyzed through evaluating forward transfer (illustrated in middle row B) and backward transfer (illustrated in bottom row C) , respectively, where the former is clearly dominant. A 6-layer convolution neural network (CNN) is used, and all results are averaged over 5 runs with different random seeds and task orders.

ii. Coordination of Multiple Continual Learners

In addition to describe actively regulated forgetting together with stability protection in a continual learning model, the organizing principles of the γMB system where new memory forms and active forgetting happens are to be discussed. As discussed with Fig. 1, there are five compartments that process sequential experiences in parallel. The outputs of these compartments are integrated in a weighted sum fashion to guide adaptive behaviors.

Inspired by this, a γMB-like architecture of multiple continual learners in parallel is disclosed, as illustrated in Fig. 6. Specifically, there are K identically structured neural networks

corresponding to K continual learners L _i, i=1, …, K with their dedicated parameter spaces. The dedicated output head of each learner is removed, and the weighted sum of the previous layer’s output is fed into a shared output head

to make predictions, where the output weight of each learner g _i, i=1, …, K is incrementally updated. Then, the final prediction becomes

where all optimizable parameters θ _MCL includes

and

In such a γMB-like architecture, the relationship between learners is critical to the performance of continual learning. When the diversity of their expertise is properly coordinated, the obtained solution can provide a high degree of compatibility with both new and old tasks, vastly superior to that of a single continual learner (SCL) .

In general, each learner’s expertise is directly modulated by its target distribution, where the functional strategy described above in active forgetting with stability protection can serve this purpose, from which it can be inferred that the target distribution

tends to be different for learners with different forgetting rates β. Therefore, a learnable forgetting rate for each learner is applied to adaptively coordinate their relationship. Further, to encourage cooperation of the continual learners, a supplementary modulation that explicitly constrains the differences in predictions made by feature representations of respective learner is proposed, in order to modulate the learning rules for the current task.

iii. Collaborative continual learners with adaptive forgetting (CAF)

Based on the discussion set forth above, drawing inspirations from the adaptive mechanisms equipped in a robust biological learning system, an approach combining both active forgetting with stability protection and the architecture of multiple continual learners is disclosed, which can be named collaborative continual learners with adaptive forgetting (CAF) .

For the case of two tasks as mentioned before, the learner needs to find a mode of the posterior distribution that incorporates the knowledge of tasks A and B:

where p (D _B∣θ) is the loss for current task B. Although p (θ∣D _A) is generally intractable, we can locally approximate it with a second-order Taylor expansion around

resulting in a Gaussian distribution whose mean is

and precision matrix is the Hessian of the negative log posterior. To simplify the computation, the Hessian can be approximated by the diagonal of the Fisher information matrix:

To improve learning plasticity, the forgetting rate β is introduced, and p (θ∣D _A) in Eqn. (4) is replaced with

(i.e., Eqn. (1) ) , where the target distribution p (θ∣D _A, D _B) becomes p (θ∣D _A, D _B, β) . A properly-selected βwill maximize the probability of learning each new task well through forgetting the old knowledge:

With the implementation of active forgetting, the learner needs to find

which can be optimized in two equivalent ways (AF-1 and AF-2) as discussed with Fig. 4.

For continual learning of more than two tasks, e.g., t tasks (t>2) , the leaner needs to find:

where

are the training datasets of the previous tasks and

are the previously used forgetting rates. Similarly, we replace the posterior p (θ∣D _1: t-1, β _1: t-1) that absorbs all the information of D _1: t-1 with

To simplify the hyperparameter tuning, a fixed forgetting rate may be adopted for each learning task, i.e., β _i=β for i=1, …, t. Then the loss function may be modified from Eqn. (2) into:

is the solution for previous tasks, i.e., the old parameters. Referring to discussion with Fig. 4, for AF-1, θ _e, m=0 and I _e, m=1, λ _AF∝β and λ _SP∝ (1-β) ; for AF-2,

I _e, m=I _t, m , λ _AF∝β/ (1-β) and λ _SP∝1. F _1: t-1 is recursively updated by:

F _1: t-1=F _1: t-2+F _t-1 (10)

When β=0, the loss function in Eqn. (8) degenerates to a similar form as regular synaptic regularization approaches that only preserve stability:

Generally, such methods differ with Eqn. (8) mainly in the metric ξ _1: t-1 of estimating the importance of parameters for performing old tasks. Therefore, the active forgetting term can be naturally combined with prior arts in a plug and player manner.

To combine with the architecture of K continual learners, and in order to coordinate the diversity of learners’ expertise, we focus on regular synaptic regularization approaches (i.e., Eqn. (11) ) as the default baseline for implementation. And the proposed forgetting rate β is adaptively implemented in each learner. Moreover, differences in their predictive distributions are regularized, which can be quantified by the widely-used Kullback Leibler (KL) divergence. Therefore, the loss function for the full version of our neuro-inspired approach is defined as:

where M _i denotes the amount of parameters θ _i= {φ _i, g _i} for learner i . The current task has N _t training data, and p _i (x _t, n) is the prediction of learner i for x _t, n. To avoid the extra training cost to compute

in AF-2, consider AF-1 in this example while keeping λ _SP fixed for the easy of hyperparameter tuning. λ _AF, i and γ _i, j can be adaptively learned by constraining the balance of their average

and

with other loss terms, or simply fixed to the same value for each learner.

The disclosed method of CAF is evaluated across various continual learning benchmarks, and is compared with a wide range of representative approaches. Visual classification tasks are considered as an example in different scenarios. In addition to S-CIFAR-100, R-CIFAR-100 and R-CIFAR-10/100 for overall knowledge transfer, another three benchmarks are adopted, such as Omniglot for long task sequence with imbalanced number of classes, and CUB-200-211 and Tiny-ImageNet for larger scale images. As shown in Fig. 7, the performance of all baselines, evaluated with average accuracy over 5 runs with different random seeds and task orders, varies widely across experimental settings, while CAF achieves consistently state-of-the-art performance in a plug-and-play manner. In particular, CPR is a recent strong method that encourages a single continual learner to converge to a flat loss landscape, but its performance lags far behind ours.

By drawing inspirations from the adaptive mechanisms equipped in a robust biological learning system, a generic approach for continual learning in artificial neural networks is disclosed. The superior performance and generality of the method can facilitate a variety of realistic applications, such as but not limited to autonomous driving and robotics, visual classification, intelligent manufacturing, intelligent healthcare and smartphones, to flexibly accommodate user needs and environmental changes. Meanwhile, the deployment of continual learning avoids retraining all previous data each time the model is updated, which provides an energy-efficient and eco-friendly path for developing AI systems.

An exemplary process of continual learning by collaborative continual learners with adaptive forgetting (CAF) is shown in algorithm 1 as below.

Fig. 8 illustrates an exemplary flow chart for continual learning, in accordance with various aspects of the present disclosure. As described below, some or all illustrated features may be omitted in a particular implementation within the scope of the present disclosure, and some illustrated features may not be required for implementation of all embodiments. Further, some of the blocks may be performed parallel or in a different order. In some examples, the method may be carried out by any suitable apparatus or means for carrying out the functions or algorithm described below.

The disclosed method of collaborative continual learners with adaptive forgetting aims at incorporating neuro-inspired adaptability for continual learning in artificial intelligence, which could be used across a variety of scenarios, to name a few, visual classification, autonomous driving, robots, intelligent manufacturing, intelligent healthcare and smartphones. What the potential applications have in common is that the agents in corresponding scenario need to deal with incremental tasks, thus finding a solution which can satisfy all the tasks is expected.

Now the disclosed method of is illustrated with Fig. 8. The method begins at block 801, with initializing parameters of the neural network for continual learning, wherein the neural network comprises a plurality of sub-networks. The continual learning involves a series of tasks, which may be of same or different kind of task, including but not limited to a regression task, a classification task, or a reinforcement task.

In an example, the sub-networks have same architectures and respective parameter spaces. In a further example, the same architecture includes but not limited to the number of feature channels, network width, the number of layers, etc. In another example, the number of sub-networks of a neural network is fixed. In a further example, the amount of the parameters of the network is fixed. In yet a further example, to reduce the total amount of parameters of the networks, there can be a trade-off between the number of sub-networks and network width and/or the number of feature channels.

In an example, same random initialization is exploited and dropout is removed for each sub-network to construct a low-diversity background. In another example, either same random initialization or dropout is exploited for each sub-network to construct a middle-diversity background. In yet another example, different random initialization and dropout are exploited for each sub-network to construct a high-diversity background.

In an example, the optimal forgetting rate β in eqn. (6) can be empirically determined by a grid search of hyperparameters λ _SP and/or λ _AF . In a further example, the forgetting rate β can be different for each sub-network and/or for each task. In another example, λ _AF can be different for each sub-network and/or for each task, but λ _SP can be fixed for the plurality of sub-networks and tasks.

The method proceeds to block 802, with inputting training data for a current task into the plurality of sub-networks in parallel.

As an example, the current task could be a visual classification task, the input data could be image data, graph data, video data, which may be collected by one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.

As another example, the current task could be a regression task for automatic driving, the input data could be data collected by sensor, such as road condition, and images of road signs, and corresponding operations of the vehicle, like steering wheel angle and/or brake/throttle condition.

The method proceeds to block 803, with generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively. As an example, referring to description above, the feature representation generated by sub-network i with their dedicated parameters can be represented as

The method proceeds to block 804, with generating by a shared output head of the plurality of sub-networks, a prediction related to the current task.

In an example, the prediction related to the current task is generated with a sum of all the sub-networks’ outputs

corresponding to K continual learners and

representing the shared output head. In a further example, the prediction related to the current task is generated with a weighted sum of all the sub-networks’ outputs

In an example, the shared output head is a full connected layer.

In an example, block 804 further comprises generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively, which could be represented by p _i (x _t, n) for sub-network i and sample n of task t.

The method proceeds to block 805, with calculating a loss function for the current task.

In an example, the loss function is calculated based on the target of the current task itself. For example, the target of the current task can be represented as L _t (θ _MCL) .

In a further example, the loss function is calculated based on two weighted terms that selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task. For example, one of the weighted terms could be the stability protection term in Eqn. (2) , Eqn. (9) or Eqn. (12) ; another of the weighted terms could be the active forgetting term in Eqn. (2) , Eqn. (9) or Eqn. (12) . In a further example, for optimization of active forgetting term, there can be two equivalent ways, one way is to set the expanded parameters for the current task as empty, and another is to set the expanded parameters for the current task as optimal solution for the current task only without constrains on the previous tasks.

In a further example, the loss function is calculated based on a term putting constraints on differences in the corresponding predictions generated based on the plurality of feature representations of the plurality of sub-networks. For example, the term could be AF-S in Eqn. (12) .

In a further example, the loss function is calculated based on constraining a balance of forgetting rate for each sub-networks and coefficients for constraints on differences in the corresponding predictions with their average. For example, in Eqn. (12) , λ _AF, i and γ _i, j can be adaptively learned by constraining the balance of their average

and

with other loss terms.

The method proceeds to block 806, with updating the parameters of the neural network based on the calculated loss function.

In an example, the parameters of the neural network comprise the parameters of each sub-network, and parameters of the shared output head. In a further example, the parameters of the neural network comprise weights for the plurality of feature representations. In yet a further example, the parameters of the neural network comprise the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task.

After the training for the current task is done after several iterations or convergence in block 806, the parameters of the neural network are optimized. Then the optimized parameters are transferred back to block 802 and to be used as a prior distribution for learning a next task. In a further example, the optimal forgetting rate β can be empirically determined based on the updated parameters, as described above with block 801.

Fig. 9 illustrates an exemplary computing system, in accordance with various aspects of the present disclosure. The computing system may comprise at least one processor 910. The computing system may further comprise at least one storage device 920. It should be appreciated that the storage device 920 may store computer-executable instructions that, when executed, cause the processor 910 to perform any operations according to the embodiments of the present disclosure as described in connection with FIGs. 1-8.

The embodiments of the present disclosure may be embodied in a computer-readable medium such as non-transitory computer-readable medium. The non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform a computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces. The method comprises inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks; generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively; generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.

The non-transitory computer-readable medium may further comprise instructions that, when executed, cause one or more processors to perform any operations according to the embodiments of the present disclosure as described in connection with Figs. 1-8.

It should be appreciated that all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts.

It should also be appreciated that all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together.

The previous description is provided to enable any person skilled in the art to practice the various aspects described herein. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects. Thus, the claims are not intended to be limited to the aspects shown herein. All structural and functional equivalents to the elements of the various aspects described throughout the present disclosure that are known or later come to be known to those of ordinary skill in the art are expressly incorporated herein by reference and are intended to be encompassed by the claims.

Claims

A computer implemented method by a neural network for continual learning of a series of tasks, wherein the neural network includes a plurality of sub-networks having the same architecture and respective parameter spaces, comprising:

inputting training data for a current task into the plurality of sub-networks in parallel, wherein parameters of the neural network are learned for previous tasks;

generating a plurality of feature representations based on the input data by the plurality of sub-networks respectively;

generating by a shared output head of the plurality of sub-networks, a prediction related to the current task with an input of weighted average of the plurality of feature representations; and

updating the parameters of the neural network based on a loss function has constraints on the prediction related to the current task and selectively merging parameters of the sub-network with learned parameters for the previous tasks and expanded parameters for the current task.
The computer implemented method of claim 1, further comprising:

generating corresponding predictions based on the plurality of feature representations generated by the plurality of sub-networks respectively; and

wherein the loss function further has constraints on differences in the corresponding predictions.
The computer implemented method of claim 1, wherein the parameters of the neural network comprise the parameters of each sub-network, weights for the plurality of feature representations and parameters of the shared output head.
The computer implemented method of claim 1, wherein the expanded parameters for the current task are set as empty.
The computer implemented method of claim 1, wherein the expanded parameters for the current task are optimal solution for the current task only without constrains on the previous tasks.
The computer implemented method of claim 1, wherein the degree of the merging parameters of the sub-network with learned parameters for previous tasks and expanded parameters for the current task are adaptively learned and updated for each sub-network respectively.
The computer implemented method of claim 1, wherein the shared output head is a full connected layer.
The computer implemented method of claim 1, wherein the number of the plurality of sub-networks is fixed.
The computer implemented method of claim 1, wherein the method is a plug-and-play approach.
The computer implemented method of claim 1, further comprising one of:

exploiting different random initialization and dropout for the plurality of sub-networks; or

exploiting identical random initialization and dropout for the plurality of sub-networks; or

exploiting identical random initialization and no dropout for the plurality of sub-networks.
The computer implemented method of claim 1, wherein the input data comprising one or more of image data, video data, graph data, gaming data, text data, statistics, sensor data and

wherein the series of tasks comprise one or more of regression task, classification task, or reinforcement task; and

wherein the prediction comprising one or more of a classification of the image data, the video data or the graph data, a segmentation of the image data or the video data or the audio data, and content or action generated based on the text data, gaming data, graph data or video data, predicted value based on statistics and/or sensor data.
The computer implemented method of claim 1, wherein the input data comprise data obtained in one or more of an automatic driving system, an intelligent transportation system, an intelligent manufacturing system, industrial equipment system, an intelligent maintenance equipment system, a medical equipment system, and a camera of a smart device.
A computer system, comprising:

one or more processors; and

one or more storage devices storing computer-executable instructions that, when executed, cause the one or more processors to perform the operations of the method of one of claims 1-12.
One or more computer readable storage media storing computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.
A computer program product comprising computer-executable instructions that, when executed, cause one or more processors to perform the operations of the method of one of claims 1-12.