WO2025046092A1

WO2025046092A1 - Motion planning

Info

Publication number: WO2025046092A1
Application number: PCT/EP2024/074324
Authority: WO
Inventors: Mihai DOBRE; John Redford; Subramanian Ramamoorthy
Original assignee: Five AI Ltd
Current assignee: Five AI Ltd
Priority date: 2023-08-31
Filing date: 2024-08-30
Publication date: 2025-03-06
Anticipated expiration: 2026-02-28
Also published as: GB202313287D0

Abstract

A computer-implemented method of planning a trajectory for an ego agent in the presence of one or more further agents, the method comprising: receiving an observed state of each agent; and, for each agent, generating a predicted distribution over trajectories and adapting the predicted distribution over trajectories to generate a planned distribution over trajectories, the adaptation based on a planner confidence score and a trajectory evaluation score; and selecting an estimated optimal trajectory from the planned distribution over trajectories for the ego agent; wherein the trajectory evaluation score is based on the interaction of each of a set of possible trajectories of that agent with one or more possible trajectories of each other agent; and wherein the planner confidence score is based on a comparison of the observed state of the agent with a previous estimated optimal trajectory of the agent determined by the planner at a previous planning iteration.

Description

MOTION PLANNING

Technical Field

The present invention relates to prediction and planning methods for mobile agents. Particularly, but not exclusively, the present invention relates to a planning system and method for autonomous vehicle planning in interactive scenarios.

Background

There have been major and rapid developments in the field of autonomous vehicles. An autonomous vehicle (AV) is a vehicle which is equipped with sensors and control systems which enable it to operate without a human controlling its behaviour. An autonomous vehicle is equipped with sensors which enable it to perceive its physical environment, such sensors including for example cameras, radar and lidar. Autonomous vehicles are equipped with suitably programmed computers which are capable of processing data received from the sensors and making safe and predictable decisions based on the context which has been perceived by the sensors. An autonomous vehicle may be fully autonomous (in that it is designed to operate with no human supervision or intervention, at least in certain circumstances) or semi-autonomous. Semi -autonomous systems require varying levels of human oversight and intervention. An Advanced Driver Assist System (ADAS) and certain levels of Autonomous Driving System (ADS) may be classed as semi-autonomous. A“level 5” vehicle is one that can operate entirely autonomously in any circumstances, because it is always guaranteed to meet some minimum level of safety. Such a vehicle would not require manual controls (steering wheel, pedals etc.) at all. By contrast, level 3 and level 4 vehicles can operate fully autonomously but only within certain defined circumstances (e.g. within geofenced areas). A level 3 vehicle must be equipped to autonomously handle any situation that requires an immediate response (such as emergency braking); however, a change in circumstances may trigger a “transition demand”, requiring a driver to take control of the vehicle within some limited timeframe. A level 4 vehicle has similar limitations; however, in the event the driver does not respond within the required timeframe, a level 4 vehicle must also be capable of autonomously implementing a “minimum risk maneuver” (MRM), i.e. some appropriate action(s) to bring the vehicle to safe conditions (e.g. slowing down and parking the vehicle). A level 2 vehicle requires the driver to be ready to intervene at any time, and it is the responsibility of the driver to intervene if the autonomous systems fail to respond properly at any time. With level 2 automation, it is the responsibility of the driver to determine when their intervention is required; for level 3 and level 4, this responsibility shifts to the vehicle’s autonomous systems and it is the vehicle that must alert the driver when intervention is required. Other mobile robots are being developed, for example for carrying freight supplies in internal and external industrial zones. Such mobile robots would have no people on board and belong to a class of mobile robot termed UAV (unmanned autonomous vehicle). Autonomous air mobile robots (drones) are also being developed.

Driving decisions in AVs and other mobile robots depend on a number of factors, including progress, safety and comfort. Typically, AV planners are supported by motion prediction, which reasons about the future goals and behaviours of other agents of a scenario in order to determine what actions are safe for an ego vehicle to take. A challenge in prediction and planning is accounting for interactivity between agents of the scenario, by anticipating how other agents will themselves account for possible future behaviour of other agents in their behaviour. International Patent Publication Nos. WO 2020/079066 and WO2021/073781, the contents of each of which is incorporated herein by reference in its entirety, describe an inverse planning prediction methodology for predicting the goals and behaviours of other agents. Inverse planning is a prediction methodology that assumes a mobile agent (such as a vehicle or other road user) plans rationally towards some defined goal.

Summary

A standard pipeline used to determine a planned trajectory for an autonomous vehicle (ego agent) in the presence of one or more other agents comprises a prediction component, which determines predicted future behaviour of the one or more other agents, and a planner, which determines a best trajectory for the ego agent based on the predictions for the other agents of the scenario and other scene context such as speed limit, traffic lights, static objects, etc.

However, this type of pipeline makes a number of simplifying assumptions that can be limiting in highly interactive scenarios. This structure assumes that the predictions for the other agents are fixed in the sense that they are not influenced by a change in the ego’s planned trajectories. This also implies a priority in the planning computation. Since the ego agent plans its own trajectory based on the predicted actions of other agents, the other agents always have priority over the ego agent. The ego agent therefore would not take actions that could influence the other agents of the scene. This also limits the behaviour of the ego agent to a range of behaviours that conforms to the predicted behaviour of the other agents of the scene. Marginal predictions over trajectories account for all possible actions by all the agents of the scene, including the ego agent. The ego’s planning goal is only considered at the planning stage, which allows the search space to be reduced dramatically. Predicting joint trajectories (i.e. predicting both the ego agent and the other agents’ trajectories together) can reduce the search space to only those ego trajectories that are likely according to the prediction, but again this locks the ego agent into performing certain actions, and problems can occur if the prediction module does not anticipate a specific behaviour from the ego that is necessary based on the ego’s planning goal as determined in the planning stage. Finally, changes to the predictor component can have unforeseen effects on the planner when the components are implemented independently.

An improved planning architecture is disclosed herein which incorporates both prediction and planning for the ego agent and other agents in interactive scenarios. The proposed architecture generates predicted policies (i.e. predicted distributions over trajectories) for each agent of a scene based on the scene context, including the map, other agents’ states, etc., and consumes these predicted policies and adapts the distribution over trajectories based on an evaluation of the possible trajectories of the agents that accounts for interactions between the different agents of the scene to generate a joint distribution (policy) over all agents.

A multi-agent planner that incorporates interactions is more likely to be assertive in interactive driving scenarios than a single-agent planner (i.e. a planner in which the distribution over trajectories for other agents is not updated during planning). This can cause problems when the other agents behave aggressively, leading to potential collisions. By contrast, the single agent planner is more conservative, such that it typically plans safely and avoids collisions, but may fail to achieve planning goals when prediction uncertainty is high.

To address this, the planning method described herein includes a confidence estimation that determines a confidence that the multi-agent planner’s policy for the other agents matches observed reality, based on a comparison of both the predicted policy for a given agent (i.e. single-agent policy) and the planned policy for that agent (i.e. the joint distribution) with a real observed state of the agent. When it is determined that the multi-agent planner is generating policies for agents that closely match the agents’ actual states, the adaptation of the policy for those agents by the multi-agent planner is greater, and when it is determined that the multi-agent planner’s policies are not closely matched with reality, the adaptation of the policy for those agents is smaller, and the policy remains closer to the policy generated by the single agent planner. The idea of this confidence-based adaptation is that the planner behaves more conservatively (akin to the single-agent planner) when the confidence in the multi-agent policy updates is low and more assertively (akin to the full multi-agent planner) when the confidence in the multi-agent policy updates is high. The confidence is based on the similarity of observed agent behaviour to the predicted behaviour of those agents according to the planner at a previous timestep.

Since the multi-agent planner considers possible interactions, the resulting predictions for the other agents are consistent with the agents being aware of and reactive to other agents of the scenario. If the observed states for a given agent are not consistent with this prediction for a given agent, this may indicate that the agent is behaving aggressively or with poor awareness of other agents of the scenario. In this case, according to the confidence estimation based planning techniques described herein, the ego planner uses a lower confidence score for that agent, thus behaving more conservatively around that agent.

A first aspect herein provides a computer-implemented method of determining an estimated optimal trajectory for an ego agent in the presence of one or more further agents, the method comprising, at each of a plurality of planning iterations: receiving an observed state of each agent; for each agent: generating a predicted distribution over trajectories; and performing an adaptation of the respective predicted distribution over trajectories to generate a planned distribution over trajectories, wherein the adaptation is based on a planner confidence score for the agent and a trajectory evaluation score for the agent; and selecting an estimated optimal trajectory from the planned distribution over trajectories for the ego agent; wherein the trajectory evaluation score for each agent is based on the interaction of each of a set of possible trajectories of that agent with one or more possible trajectories of each other agent; and wherein the planner confidence score for each of the one or more further agents is based on a comparison of the observed state of the agent with a previous estimated optimal trajectory of the agent determined by the planner at a previous planning iteration.

The planner confidence score for each of the one or more further agents may be based on a comparison of the observed state of the agent with a previous predicted trajectory selected from the predicted distribution over trajectories at a previous planning iteration.

The predicted distribution over trajectories for each agent may be generated based on a belief about the goals of each agent. The method may further comprise updating the belief about the goals of each agent based on the observed state of each agent.

The belief about the goals of each agent may be represented by a probability distribution over possible goals for that agent.

The trajectory evaluation score may be computed based on the interaction of each of a set of possible trajectories of that agent with a set of possible trajectories of each other agent, the set of possible trajectories for each agent being sampled based on the current belief about the goals of that agent.

The probability distribution over possible goals for each agent may be updated by performing Bayesian inference based on a previous probability distribution over the goals of that agent and the observed state of the agent.

The Bayesian inference may comprise determining an updated probability value for each goal by: computing a likelihood term by comparing the observed state of the agent to the estimated optimal trajectory generated by the planner for that goal at a previous planning iteration; and multiplying the likelihood term by a previous probability value for that goal to generate a posterior probability value for that goal.

The observed state of the agent may comprise one or more of a position of the agent, an orientation of the agent, and a velocity of the agent.

The comparison of the observed state of the agent with the previous estimated optimal trajectory and/or the previous predicted trajectory may be based on a comparison of the observed state of the agent with a closest state along the respective trajectory, the closest state being the state along the trajectory that is closest in time to the observed state.

The comparison of the observed state of the agent with the closest state along the respective trajectory may comprise computing a probability of at least one variable of the observed state according to a Gaussian distribution centred on the value of the corresponding variable of the closest state along the respective trajectory.

The result of the comparison may be used to determine a likelihood term, wherein the planner confidence score is determined based on a multiplication of the likelihood term by a prior planner confidence value. The prior planner confidence value may be a predefined constant value for the first planning iteration of the plurality of planning iterations.

For the second and subsequent planning iterations of the plurality of planning iterations, the prior planner confidence value is based on the planner confidence score from the previous planning iteration.

The planner confidence score for the ego agent may be set to a constant value.

The predicted distribution over trajectories for each agent is generated by a trajectory generator as a bundle of weighted trajectories.

The bundle of weighted trajectories may be generated by a hybrid trajectory generator comprising a data-driven motion profile prediction component and a classical trajectory generation component.

The classical trajectory generation component may generate the trajectories by applying a pure pursuit algorithm to the output of the data-driven motion profile prediction component.

A different respective bundle of weighted trajectories may be generated for each possible goal of a predetermined set of goals for each agent.

Each bundle of weighted trajectories may comprise one or more subsets of weighted trajectories, each subset associated with a respective maneuver.

For each agent, the adaptation may be performed by applying learned weights to the predicted distribution over trajectories.

The weights may be learned by iteratively multiplying an initial set of weights, each weight of the initial set of weights associated with a different respective trajectory, by an evaluation factor, the evaluation factor based on an interaction of that trajectory with each trajectory of a set of possible trajectories of the other agents.

The evaluation factor at each step of the iterative multiplication may be computed based on the weights of each agent, and at least one of a number of collisions between that trajectory and one of the set of possible trajectories of the other agents; a number of distance violations between that trajectory and one of the set of possible trajectories of the other agents; whether the trajectory reaches a possible goal of the agent. The set of possible trajectories of the other agents may comprise all trajectories for each other agent for each possible goal of that agent and each possible maneuver of that agent.

The set of possible trajectories of the other agents may be sampled from a subset of trajectories associated with the agent, the subset of trajectories determined by sampling a set of possible goals of the other agents according to a current belief about goals.

The weights may be learned by applying an Iterative Best Response algorithm to an initial set of weights.

The adaptation may comprise sampling a set of maneuvers, and wherein the set of possible trajectories of the other agents are sampled from the subset of the trajectories associated with the sampled set of maneuvers.

The sampling of the maneuvers may be performed using Upper Confidence Bounds (UCB).

The previous estimated optimal policy may be the most probable trajectory according to the planned distribution over trajectories at the previous planning iteration.

The previous predicted trajectory may be selected as the most probable trajectory according to the predicted distribution over trajectories at the previous planning iteration.

The estimated optimal trajectory for the ego agent may be selected as the most probable trajectory of the planned distribution over trajectories for the ego agent.

The method may comprise outputting the selected estimated optimal trajectory for the ego agent to a controller to perform control actions to effect the estimated optimal trajectory.

The iterative multiplication may comprise updating the weights for each agent in turn, wherein, for a selected agent, the evaluation factor at each step is computed based on the updated weights for each agent that was updated before the selected agent, and the previous weights for each agent that is due to be updated after the selected agent.

The order of updating the agents may be determined based on a priority of the agents in the scene.

A second aspect herein provides a computer-implemented method of estimating an optimal trajectory for an ego vehicle agent in the presence of one or more further vehicle agents, comprising, at each of a plurality of planning iterations: receiving an observed state of each vehicle agent; generating, for each vehicle agent, a predicted distribution over trajectories; performing, for each vehicle agent, an adaptation of the respective predicted distribution over trajectories to generate a planned distribution over trajectories for the current planning iteration, wherein the adaptation is based on a trajectory evaluation score for the agent; and selecting a trajectory from the planned distribution over trajectories for the ego agent; wherein the trajectory evaluation score for each agent is based on the interaction of each of a set of possible trajectories of that agent with one or more possible trajectories of each other agent.

The predicted distribution over trajectories may be generated based on a scene context, the scene context comprising a road layout, and a belief about the goals of each of the ego vehicle agent and the one or more further vehicle agents.

The method may comprise updating, at each planning iteration, the belief about the goals of each of the ego vehicle agent and the one or more further vehicle agents based on the observed state of each vehicle agent.

The adaptation of the predicted distribution over trajectories may comprise performing an iterative update of a set of weights to be applied to the predicted distribution over trajectories; wherein at each step of the iterative weight update, the weights of each agent are updated in turn, and wherein the order in which the weight updates are applied are based on a priority of the agents within a road layout in which the vehicle agents are driving.

A further aspect provides a computer system comprising one or more processors configured to implement any of the methods described herein.

A further aspect provides a computer program for programming a computer system to implement any of the methods described herein.

Brief Description of Figures

For a better understanding of the present invention, and to show how embodiments of the same may be carried into effect, reference is made, by way of example only, to the following figures, in which:

Figure 1 shows a schematic block diagram of an autonomous vehicle stack;

Figure 2 shows an example set of goals, maneuvers, and trajectories for an agent; Figure 3 shows a schematic block diagram of a confidence-based multi-agent planning system;

Figure 4 shows an example set of paths from a starting position as provided by an environment model of a planner;

Figure 5 shows three alternative methods of representing collision boxes for agents;

Figure 6 shows an example set of trajectories output by a low-level trajectory generator;

Figure 7 shows a high-level schematic block diagram of an iterative planning algorithm;

Figure 8 shows a schematic diagram showing a tree data structure for an example set of agent goals;

Figure 9 shows a one-dimensional example of a distribution over trajectories before and after applying an iterative best response algorithm;

Figure 10 shows an example comparison of an observed trajectory with best trajectories generated by prediction and planning components; and

Figure 11 shows a road layout and set of agents which could be used in a margining or lane change scenario.

Detailed Description

The methods described herein relate to prediction and planning methods for mobile agents in interactive scenarios, where multiple agents move in a given environment and react to the behaviour of each other agent of the scene. The methods are described herein in the context of autonomous vehicle planning, in which the planning methods are applied to enable an autonomous vehicle agent (ego agent) to plan a safe trajectory taking into account the behaviours of other agents (vehicles, cyclists, etc.) within a driving scenario. However, the methods are equally applicable to other forms of mobile agents operating within interactive scenarios, including semi-autonomous vehicles, autonomous drones, etc.

The planning methods described herein may be implemented within a planning component of an autonomous vehicle stack. A typical autonomous vehicle stack comprises multiple components implemented on a computer system, including perception stack, prediction stack, planner and controller. However, as noted above, the separation of prediction and planning stacks can lead to problems in planning for the ego vehicle in the presence of other agents due to the implicit prioritisation of other agents over the ego vehicle, and the restriction of ego behaviour based on predictions. The methods described herein can be implemented within a planning component of the autonomous vehicle stack that implements both prediction and planning within a single iterative process, described further herein. The planning stack 106 may comprise sub-components referred to herein as ‘prediction’ and ‘planning’ sub-components, although these functions are not strictly separated by component. As described in further detail below, a prediction sub-component provides an initial distribution over agent trajectories for the various agents present in the scene, while the ‘planning’ subcomponent updates this initial distribution for both the ego vehicle and the other agents of the scene based on the possible interactions between the different trajectories (therefore effectively performing both planning and prediction). In such cases, the autonomous vehicle stack comprises a perception stack, a planning stack (which implements both prediction and planning functions) and a controller, which implements control signals on the ego vehicle to effect the planned trajectory generated by the planner.

Figure 1 shows an example autonomous vehicle stack having a planning component that performs an integrated prediction and planning process to generate a planned trajectory for the ego vehicle.

The perception stack 102 receives sensor outputs from an on-board sensor system 110 of the AV.

The on-board sensor system 110 can take different forms but generally comprises a variety of sensors such as image capture devices (cameras/optical sensors), LiDAR and/or RADAR unit(s), satellite-positioning sensor(s) (GPS etc.), motion sensor(s) (accelerometers, gyroscopes etc.) etc., which collectively provide rich sensor data from which it is possible to extract detailed information about the surrounding environment and the state of the AV and any external actors (vehicles, pedestrians, cyclists etc.) within that environment.

Hence, the sensor outputs typically comprise sensor data of multiple sensor modalities such as stereo images from one or more stereo optical sensors, LiDAR, RADAR etc.

Stereo imaging may be used to collect dense depth data, with LiDAR/RADAR etc. proving potentially more accurate but less dense depth data. More generally, depth data collection from multiple sensor modalities may be combined in a way that respects their respective levels (e.g. using Bayesian or non-Bayesian processing or some other statistical process etc.). Multiple stereo pairs of optical sensors may be located around the vehicle e.g. to provide full 360° depth perception. This provides a much richer source of information than is used in conventional cruise control systems.

The perception stack 102 comprises multiple perception components which co-operate to interpret the sensor outputs and thereby provide perception outputs to the planning stack 106.

The perception outputs from the perception stack 102 are provided to the planning stack 106 as part of what is referred to herein as ‘scene context’, and used by the planning stack 106 to predict and plan the behaviour of the agents of the scenario, as described in further detail below.

The planner 106 implements the techniques described below to plan trajectories for the AV and determine control actions for realizing such trajectories. In particular, a core function of the planner 106 is to determine a series of control actions for controlling the AV to implement a desired goal in a given scenario. In a real-time planning context, a scenario is determined using the perception stack 102. A scenario is represented as a set of scenario description parameters used by the planner 106. A typical scenario would define a drivable area and would also capture predicted movements of any obstacles within the drivable area (such as other vehicles) along with a goal. A goal would be defined within the scenario, and a trajectory would then need to be planned for that goal within that scenario. The goal could for example be to enter a roundabout and leave it at a desired exit; to overtake a vehicle in front; or to stay in a current lane at a target speed (lane following). The goal may, for example, be determined by an autonomous route planner (not shown). Goals, maneuvers, and trajectories are described in more detail below, with reference to Figure 3.

The controller 108 executes the decisions taken by the planner 106 by providing suitable control signals to on-board actuators 112 such as motors of the AV. In particular, the controller 108 controls the actuators in order to control the autonomous vehicle to follow a trajectory computed by the planner 106.

The planner 106 plans over acceleration (magnitude) and steering angle control actions simultaneously, which are mapped to a corresponding trajectory by modelling the response of the vehicle to those control actions. This allows constraints to be imposed both on the control actions (such as limiting acceleration and steering angle) and the trajectory (such as collisionavoidance constraints), and ensures that the final trajectories produced are dynamically realisable. The planner 106 will determine an optimal trajectory and a corresponding sequence of control actions that would result in the optimal trajectory according to whatever vehicle dynamics model is being applied. The control actions determined by the planner 106 will not necessarily be in a form that can be applied directly by the controller 108 (they may or may not be). Ultimately, the role of the planner 106 is to plan a trajectory and the role of the controller 108 is to implement that trajectory. The term “control data” is used herein to mean any trajectory information derived from one or both of the planned trajectory and the corresponding series of control actions that can be used by the controller 108 to realize the planner’s chosen trajectory. For example, the controller 108 may take the trajectory computed by the planner 106 and determine its own control strategy for realizing that trajectory, rather than operating on the planner’s determined control actions directly (in that event, the controller’s control strategy will generally mirror the control actions determined by the planner, but need not do so exactly).

The planner 106 will continue to update the planned trajectory as a scenario develops. Hence, a trajectory determined at any time may never be realized in full, because it will have been updated before then to account for changes in the scenario over time.

The functions of the planner 106 will now be described in further detail. As set out above, the present disclosure addresses the planning problem of determining an optimal trajectory (and/or corresponding control actions) for an ego vehicle in a given scenario.

Figure 2 shows an illustration of an example road layout showing the concepts of goals, maneuvers, and trajectories as used in the present description. In the description below, prediction and planning for agents is described in terms of overall distributions over trajectories. However, these distributions are also conditioned on goals and maneuvers associated with each agent.

Herein, goals 802 refer to a goal location that a given agent intends to reach. These are represented by large areas of a road layout, such as lanes or portions of lanes, that define the eventual region that the agent wants to reach, but this does not restrict the agent’s overall intention to any specific location within the road map. Each goal 802 is represented by a polygon in Figure 2. For example, if the intent of the agent is to change lane, this could be achieved immediately, such that the agent trajectory crosses into the neighbouring lane quickly, or after some initial time period continuing in its current lane. In either case, the same goal 802 is represented by a region extending along the lane, and is achieved once the agent reaches that region of the road layout. Maneuvers 804 are high level agent behaviours which are defined based on the road layout. Examples include ‘follow path’, ‘change path left’ and ‘change path right’. As described further herein, the road layout can be represented by a connected graph of paths, and maneuvers cover all possible layouts, including junctions, roundabouts, merges and splits. Given a current state of an agent 200, the possible maneuvers of the agent can be enumerated, with each maneuver being attached to a path that the agent can follow from its current position, or change left or right before following it. This set of maneuvers 804 provides a high-level description of the behaviours that an agent can have indifferent of its goal. Maneuvers are represented in Figure 2 by dotted midlines along each lane. Whether a given midline corresponds to a follow path or a change path maneuver depends on the current path of the given agent. Goals and maneuvers are distinguished in that goals represent an agent’s intent, while maneuvers represent a possible behaviour of the agent, whether or not that behaviour corresponds with the given intent. Although there is a single maneuver 804 at any given time that brings the agent to its intended goal, this maneuver does not necessarily have to be taken, and so multiple maneuvers are possible for each goal. For example, where the agent has a goal of changing to the adjacent lane to the right, the agent may determine, based on the behaviour of other agents of the scenario that the best maneuver to take is to continue forward along the current lane, taking a ‘follow path’ maneuver at the current timestep, but at some later timestep, the agent may determine that a change path maneuver is safe, and achieve the intended goal at that time.

A trajectory 806 is the low-level execution of a maneuver, i.e. the path and motion taken by the agent to achieve the maneuver. For a ‘follow path’ maneuver, this is the path and motion profile of the agent along its current path, while for a ‘change path’ maneuver, this is the path and motion profile taken by the agent to change path, which could be fast or slow, sharp or wide, etc. While the space of goals and the space of maneuvers are each discrete spaces, in that, for a given state of an agent, a discrete number of possible goals and a discrete number of possible maneuvers can be defined, the space of trajectories is continuous. Trajectories are represented in Figure 2 by the curves starting from the agent 200 and executing different respective maneuvers to either follow the current path or to change to the right or left. The algorithms described below for planning and predicting trajectories for different agents use distributions over trajectories to represent maneuvers. A trajectory is defined by a sequence of spatial coordinates, orientations, and velocities (x_t, y_t, 0_t, v_t) of the agent over time defining the motion of the agent. Additional motion variables such as acceleration, or jerk (rate of change of acceleration) may also be included as part of the trajectory.

Figure 3 shows an example planner 106 that implements a confidence-based multi-agent planning method to determine a trajectory for the ego vehicle. As described above, a space of possible trajectories for a given agent exists for each of multiple possible goals and maneuvers. For the purposes of describing Figure 3, unless otherwise specified, distributions over trajectories are computed over all possible maneuvers and goals. However, as described later, the planner may sample goals and/or maneuvers as part of the algorithm used to generate an updated distribution over trajectories.

The planner 106 receives a scene context, which is represented within an environment model that defines states of the agents of the scenario, as determined by the perception component 102 based on sensor data, as well as a static layout of the environment in which the autonomous vehicle is driving, including map data for the driving area, which defines a road layout comprising lanes, junctions, etc, and their characteristics (width, rules, etc.) and the connections between them.

The prediction sub-component 204 of the planner is configured to generate an initial ‘predicted’ distribution over trajectories based on the scene context. As described further below, the prediction sub-component may be implemented as a ‘hybrid’ trajectory generator which uses a data-driven motion profile generator to determine a distribution over motion profiles and a classical trajectory generator which uses this distribution over motion profiles to generate a bundle of individual trajectories. However, any kind of trajectory generator configured to generate an initial set or ‘bundle’ of trajectories, representative of a predicted distribution over trajectories, can be used. A bundle of trajectories is generated over all maneuvers and goals for the given agent. Although the bundle of trajectories is a discrete subset of possible trajectories according to an underlying probability distribution, it may also be referred to herein as a ‘distribution’, since it provides a sample-based representation of the underlying distribution.

The bundle of trajectories generated by the trajectory generator of the prediction component 204 are provided to an adaptation component 206 as a representation of the predicted distribution for each agent at a given iteration of the planner corresponding to a time t. The adaptation component 206 is configured update the weighting of the bundle of trajectories associated with each agent according to an adaptation rule. The purpose of the adaptation component is to assess the predicted trajectories generated by the predictor 204 against each other, taking possible interactions between trajectories into account, and to determine a new distribution over trajectories for each agent that represents the probability of the given agent taking each possible trajectory (of the previously generated bundle of trajectories), taking possible future interactions into account.

The adaptation updates the weights assigned to each trajectory based on the ‘goodness’ of the agent’s trajectories, given a belief about its possible goals, and the scene context. This is achieved using an algorithm to iteratively adjust the weights of the trajectory according to an update rule. Two alternative choices of algorithm, Multi-agent distributional upper confidence bounds (MAD-UCB) and Bayesian Iterative Best Response (BIBeR) are described in further detail below. Although, as described further below, the beliefs about the goals of each agent are updated iteratively based on observations, the methods described herein differ from the existing inverse planning methods, since these beliefs are incorporated into an iterative planning algorithm that updates the predicted trajectories for each agent based on interactions.

It should be noted that the distribution for each agent updated in turn, and once the distribution over trajectories is updated for a given agent, it is the updated trajectory that is used to evaluate the possible trajectories and compute an updated weighting for subsequent agents. In this way, priority of agents is implicitly enforced. The earliest agents to be updated are the lowest priority, since their updated distributions over trajectories give consideration to the predictions of all later agents when computing a planned trajectory. Alternatively, the distributions for each agent could be fixed for all agents, and the update performed simultaneously for all agents at the end of a given planning iteration. An advantage of updating each agent’s policy in turn is that the algorithm has convergence guarantees in this case.

It should be noted that the term ‘planned trajectory’ is used generally herein to refer to a trajectory taken from an adapted distribution over trajectories, irrespective of whether this distribution is associated with an ego agent or an external agent. This term is used herein to distinguish the updated distribution over trajectories from the initial distribution over trajectories determined by the prediction component 204. However, only those trajectories generated for an ego agent, and for which control signals are subsequently generated in order to implement the given trajectory, are really ‘planned’ trajectories corresponding to those generated by a planning component of an autonomous vehicle stack. Adapted distributions over trajectories for all other agents, i.e. agents over which the software stack does not have control, are merely predictions for those agents, and the adaptation component 206 provides these predictions as a way to take interactions into account. Thus, any reference to a ‘planned’ trajectory for any non-ego agent should be taken to refer to a predicted trajectory for that agent that takes possible interactions between agents into account.

As mentioned above, the planner shown in Figure 3 is a confidence-based multi-agent planner. The update performed by the adaptation component to generate planned trajectories for each agent is based on a computed confidence that the planning outputs are close to observed reality. A multi-agent planner that takes into account interactions between agents can be assertive, which is advantageous in that good progress can be made by the ego vehicle based on the planner outputs, for example by pro-actively starting a merge maneuver where the ego vehicle is reasonably confident that the other agents of the scenario will allow the ego to merge. However, this can also lead to dangerous situations when the actual behaviour of the other agents of the scenario does not match the predicted behaviour of those agents according to the planner. By contrast, the ‘single-agent’ planner, corresponding to the conventional separate prediction/planning architecture described above, in which predictions are made for each external agent and the ego trajectory planned based on those static predictions, is more conservative.

The prediction component 204 and the adaptation component 206 provide the respective predicted and planned distributions over trajectories for each agent (in the form of a weighted bundle of trajectories) generated at a previous planning iteration corresponding to a time t-1 to a confidence estimation component 208 which determines a level of confidence in the planner outputs and the prediction outputs (i.e. the trajectories generated by the adaptation component 206 and the prediction component 204, respectively) based on the similarity of an observed state of each agent as received at a current timestep with each of those outputs. The confidence estimation component 208 receives the observed state 202 of each agent, which may be determined by the perception component 102 based on sensor data, and determines a relative level of confidence in each of the prediction component 204 and the adaptation component 206 based on some comparison between the observed state and the distributions over trajectories generated by each of these components. This comparison could be performed using Bayesian inference to determine a probability that either the predictor 204 (i.e. the single-agent planner) or the adaptation component 206 (i.e. the multi-agent planner) match the real behaviour of an agent observed in the real world. This is described further below, in which the adaptation algorithms and further mathematical details of the planning method are provided.

Further details of the components of the planner 106 will now be provided. First, details of an example Tow-level’ trajectory generator for generating an initial distribution over trajectories will be provided. Further mathematical details of the adaptation component 206 will then be provided with reference to two example algorithms for performing the update.

All algorithms described herein require an environment model which provides a) a queryable static layout and b) means to reason about context, e.g. information on other agents in the scene. The static layout provides information on the road layout such as the road components (lanes, junctions etc), their characteristics (width, rules etc) and the connections between them (i.e. the patch graph covering adjacency, connections injunctions etc). An example map format that can be used to provide a static layout is the OpenDRIVE® standard. Currently, the algorithm uses a query engine for a static layout which is known as Scenario Query Engine (SQE). This engine provides low-level functions for interacting with the map, for example what road layout elements are at a certain position or return a reference line (a.k.a midline) of a certain element. This low-level engine is wrapped with a model referred to herein as the Motion Planning and Prediction (MPP) Environment. The MPP Environment provides higher-level functions by combining several low-level functions from SQE. Further details of the Scenario Query Engine and the OpenDRIVE format are provided in International Patent Publication No. WO2023062166, which is hereby incorporated by reference herein in its entirety.

One example high-level function is to provide all paths starting from the midline of the lane that an agent is currently on given a certain required distance. Figure 4 shows forward paths up to distance (solid bold tree) and neighbour paths up to distance (dotted bold tree) given an initial query position (‘X’). These paths are generated automatically based on the static layout which defines the road layout elements of the current lane corresponding to the initial query position and the road layout elements of the neighbouring lanes of the initial query position which are reachable from the initial query position based on the static road layout information.

Regarding context, the MPP Environment provides the means to represent agents, their footprints and states, as well as to compute collisions between agents, or combine some of the layout queries with the agent representation (e.g. longitudinal distance between agents on midline of a lane/path). Figure 5 shows three alternatives for approximating the collision boxes of the agents using a set of circles:

(a) Circles aligned on the mid axis of the vehicle such that the first and last circle centre are on the front and rear bumper.

(b) As in (a), but adding width and length padding to the box before placing the circles.

(c) Using trigonometry to configure the desired approximation, i.e. to trade-off width vs length errors.

The algorithms presented herein make use of a predefined hierarchy, where the macro-actions are maneuvers. This is an abstract description of the behaviour that an agent can make which allows the integration of a semantic interpretation of the road layout into the algorithms described below. It also grounds the algorithms to the map and allows reasoning about legal and/or acceptable behaviours according to the highway code. For example, it is illegal to change path if one would cross a double solid lines separator. It is straightforward to implement this behaviour in the planner by excluding any maneuver that would cross a double solid lines separator.

The planner 106 is provided with a maneuver generator configured to use the MPP Environment queries to enumerate the applicable maneuvers given the current state of an agent, and to generate the goals reachable by an agent given its current state. The maneuver generator is also configured to use the data-driven motion profile predictors and trajectory generator(s) described in more detail below to generate trajectories for each agent according to the selected maneuvers.

Multiple possible methods exist for generating an initial distribution over trajectories for each agent (either in the form of a continuous distribution, from which a bundle of trajectories can be sampled, or in the form of a representative bundle of trajectories). These include treesearch-based methods such as Monte Carlo Tree search, which search over possible actions that can be taken by each agent to determine a possible future trajectory for each agent, as well as neural -network-based models, which are trained based on existing trajectory data to predict a most likely trajectory for each agent given an input representing the agent’s state, the state of other agents of the scene, and other scene context.

The prediction component 204 of the confidence-based planner 106 is preferably implemented with a trajectory generator that satisfies the following requirements: The predictor should take the scene context (map, other agents, static objects, traffic lights etc.) into account and perform marginal predictions for the other agents. The predictor should be able to perform predictions that are conditioned on the maneuvers of the other agents (and optionally on the goals of the other agents). Note that while conditioning on maneuvers is required for the MAD-UCB algorithm described above, the BIBeR algorithm described below does not require conditioning on maneuvers as it does not use a hierarchy of maneuvers and trajectories.

The predictor should be configured to provide a parametric distribution over trajectories which can be sampled to create bundles of weighted trajectories to be passed to the adaptation component 206 (bundles of trajectories may be sampled from such a distribution and post-processed with a classical trajectory generator). The predictor should be implemented efficiently.

The predictor should be configured to enforce kinematic feasibility (and optionally dynamic feasibility).

The predictor can consider comfort, for example by imposing limits on longitudinal and lateral jerk and/or acceleration.

One example trajectory generator that satisfies the above requirements is a ‘hybrid’ trajectory generator which combines a data-driven technique for generating predicted motion profiles with a ‘classical’ trajectory generator.

A motion profile predictor predicts the motion profile of the other agents, i.e. their future speeds over a set of future timesteps. The motion profile predictor is implemented in the form of a Mixture Density Network (MDN) trained to predict a sequence of future distances which can be turned into a sequence of future speeds and used as targets by a classical trajectory generator. The MDN predicts a Gaussian Mixture Model (GMM) over distances over the upcoming 5 seconds from a current state of the agent, with a time interval (dt) of 1 second. Constant acceleration is assumed to determine a sequence of future speeds from the predicted distances. Linear interpolation is used to reduce the time interval between predictions to a reasonable dt of 0.1 seconds. This motion profile prediction network is described in more detail in Antonello, Morris, et al. "Flash: Fast and light motion prediction for autonomous driving with Bayesian inverse planning and learned motion profiles." 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2022, as well as International Patent Publication No. WO2023/135271, each of which is incorporated by reference herein in its entirety. The output of the motion profile prediction is a Gaussian Mixture Model, i.e. a distribution over future speeds which can be used as inputs to a classical trajectory generator, such as a pure pursuit generator. Pure pursuit is a path and motion profile tracking algorithm that uses a controller and a vehicle model to generate a trajectory that minimises the error from the target motion profile and the target path. The algorithm generates trajectories by advancing the state of the agent given an acceleration and steering input and assuming a kinematic bicycle model. As mentioned above, a hybrid trajectory generator that uses a MDN-based motion profile generator and a pure pursuit trajectory generator is described in International Patent Publication No. WO2023/135271, and will not be discussed in any further detail herein.

While a pure pursuit trajectory generator is used in the example hybrid trajectory generator described above, various types of classical trajectory generators can be used to generate an initial bundle of trajectories. Other examples of classical trajectory generators include a Bezier spline base generator, which decomposes the path and motion profile generation, and a quintic polynomial generator, which decomposes the trajectory generation on the x and y dimensions. Pure pursuit has advantages over other classical trajectory generation methods. Pure pursuit has a relatively simple implementation, following a path according to a kinematic vehicle model (in the present example, a simple bicycle model). Pure pursuit also uses a ‘lookahead’ parameter which dictates how far ahead the algorithm looks when selecting a point on the target path to follow. Higher lookahead values result in smoother trajectories. In lane change scenarios, for example, this value can dictate whether the lane change is sharp or smooth. Another advantage of pure pursuit is that this is the only parameter, and that it is intuitive to tune.

It should be noted that the hybrid trajectory generator referred to in the earlier application uses Bayesian inference to inform the goal and path extraction step of prediction. The present implementation also uses Bayesian inference to update beliefs about both agent goals and the planner confidence. The Bayesian inference step is implemented at each planning iteration based on the overall output of the planner at the previous iteration as well as the scene context, as will be described in further detail below, with reference to Figure 7.

Figure 6 shows an example output prediction (dashed and dotted lines) of the combination of motion profile prediction with a pure pursuit trajectory generator, described in more detail below. The solid line represents a ground truth trajectory in the NGSIM dataset which clearly violates both kinematic constraints and motion profile constraints (i.e. acceleration and jerk limits). The predicted trajectories (dashed and dotted lines) are smooth and feasible without impacting the predictor’s ability to select the correct behaviour (shown as a dotted line).

A mathematical description of two example algorithms, Bayesian Iterative Best Response (BIBeR) and Multi-Agent Distributional Upper Confidence Bound (MAD-UCB), that may be implemented by the planner 106 will now be described in more detail. Both algorithms use versions of an iterative best response approach, as described in Muchen Sun et al. ‘Move Beyond Trajectories: Distribution Space Coupling for Crowd Navigation’, which is hereby incorporated by reference herein in its entirety. These algorithms encompass both the prediction component 204 and the adaptation component 206 of Figure 3. Both algorithms are multi-agent methods which incorporate the prediction of the other agents. The prediction part of the algorithm uses the plans of the other agents and the real world observations to infer certain aspects regarding the other agents, such as their goal (intent), and/or their speed or comfort preferences. These beliefs are then used within the planning part of the algorithm to choose a next trajectory for the ego vehicle.

An overview of the steps performed by both algorithms will first be provided, with reference to the mathematical formulation of goals, maneuvers and trajectories used by the planner. Each algorithm will then be described in more detail.

The two algorithms attempt to infer the goal g} G G{ and future trajectory t[ G T for agent i at time t. Each algorithm models the beliefs about the agent goals as a discrete distribution over goals P_t(G(|C_t) for each agent i in the scene, and a continuous distribution over trajectories P_t(T_t‘ | G(, C_t) for each agent i in the scene, where C_t is the scene context. The ego trajectory is selected from such a distribution over its trajectories. Internally, these algorithms build a flat tree at time t over maneuvers

for each agent i based on the graph of the paths in the layout. Both algorithms are composed of three main components:

1. A low-level trajectory generator that provides an initial distribution over the continuous space of possible trajectories T_t‘ given the scene context C_t (for example, a containing map defining a road layout, other agents' states, etc.). This trajectory generator is also conditioned on the maneuvers MJ. Therefore the initial distribution is PtCT^Mj, C_t). As described above, one example implementation of the low-level trajectory generator combines mixture density network that predicts GMMs over motion profiles with a classical trajectory generator (pure pursuit) to produce a sample-based representation of the distribution over trajectories.

2. A high-level planning component, corresponding to the adaptation component 206 of Figure 3, in which the two algorithms differ. MAD-UCB combines upper confidence bound (UCB) for sampling maneuvers with an Iterative Best Response algorithm for estimating the goodness of an agent’s maneuvers Mj and its trajectories T_t‘ given its possible goals G[ and current context C_t. BIBeR doesn’t make use of the hierarchy of goals and maneuvers, and it skips the UCB calculation, but uses the same Iterative Best Response algorithm over all trajectories indifferent of which maneuver these belong to. The initial trajectory distribution P_t(T_t ¹|Mj, C_t) is therefore modified to account for its possible goals and all future trajectories of the other agents T_t ^]*‘ : P_t(T_t ¹|M G{, C_t, T^*¹) . For MAD-UCB, the value of each maneuver m{ is computed as a statistic of its corresponding trajectories. Optionally, this information could be used to construct a discrete probability distribution over all maneuvers:

3. The Bayesian inference step which estimates the ego’s current belief of another agent’s goal P_t(G}, C_t), given the previous belief P_t_! (G _₁ 1 C_t_ ₁ ) and previous output of the behavioural planning component,

T^ ‘ ) and optionally, where such a distribution has been computed, P_t-1(M{_₁1

As described further below, the Bayesian inference step includes a belief update for goals, as well as a belief update on the confidence score of the planner, which is used to determine the extent to which the adaptation component takes interactions between agents into account when updating the predicted distribution over trajectories for each agent. In other words, a Bayesian inference is performed to determine the degree to which the planner is a ‘multi-agent’ planner rather than a ‘single-agent’ planner.

The algorithm can return a joint distribution of these 3 variables (Gj, T_t‘, Mj), or marginal distributions of each.

Figure 7 shows a simplified block diagram of the steps carried out by the algorithm. At each planning iteration, the goal belief and planner confidence are updated at step S702 based on the output of the previous planning step as well as the observed scene context. A bundle of weighted trajectories are then generated at step S704 by the planner. As described in more detail below, the trajectories are output in the form of a tree for each goal. The initial state, maneuvers and trajectories are the same for each tree, but the trajectory weights are different for each tree. As described in more detail below, for the MAD-UCB algorithm, a maneuver statistic is also computed which differs for each tree. An example set of trees is described with reference to Figure 8 for a simple case of four goals and two maneuvers. The current belief and generated bundle of trajectories are combined to return an output in the required format (S706). The output could be in the form of a joint distribution or a marginal distribution, based on the weighted trajectories output by the planner. Other post-processing steps can be applied to generate an output in any required format.

A first example algorithm that can be used by the adaptation component 206 of the planner 106 to generate an updated distribution over trajectories is referred to herein as ‘Multi-agent Distributional Upper Confidence Bounds (MAD-UCB). Pseudocode of the MAD-UCB algorithm for implementing the proposed planning method is provided below.

Run_planner (road layout, scene context, current belief)

#generates all trajectories using low-level trajectory generator and generates trees for each goal generate trees (road layout, scene context, current belief)

#compute parts of evaluation that don ’t change during the algorithm: compute collisions between all generated trajectories compute distance violations between all generated trajectories compute goals reached by all generated trajectories initial weights are set to 1 initiali se traj ectory_weights (trees) initialise maneuver statistic to 0 and number of visits to 0 for UCB initialise_maneuver_statistics(trees) while within planning budget:

# sample goals for all agents according to current belief and select maneuvers for them all agents goals = sample goals(current belief) maneuvers = trees. select_maneuvers_with_UCB(all_agents_goals) #One iteration of Iterative best response for each agent in context. agents:

#Recompute the weights of each trajectory for the agent trees. update traj ectories_weights(agent, all_agents_goals, maneuvers, collisions, distance violations, reached goals)

# Compute aggregate over current trajectories weights, updates statistic and number of visits trees.update_maneuver_statistics(agent, all agents goals, maneuvers) return trees

To account for the impossibility of knowing the goals of the other agents with certainty, the behavioural planner must integrate the ego’s current belief of what the other agent’s goals could be. This is achieved by storing a tree for each goal, for each agent. The distribution is estimated using unweighted samples from the current belief, see the step of sampling goals in the algorithm above.

The main while-loop of the algorithm combines UCB over maneuvers statistics with a sampling-based iterative best response approach for adjusting the distribution over trajectories and learning the maneuver statistics. This algorithm uses a set of predefined trajectories as samples, as generated by the low-level trajectory generator of the prediction component 204, described earlier. As a result, the generation of all trajectories, checking of collisions and distance violations, and checking of whether trajectories reach goals is performed before the main loop of the algorithm. The main loop of the algorithm uses this information to iteratively update only the weights representing the value of each trajectory for each agent. It uses the following equations to perform the update.

Equation 1 below shows the update of the weights

for agent i between iterations q (i.e. for each loop of the while statement). This corresponds to the ‘update trajectories weights’ step shown in the algorithm above.

where y⁷ represents the evaluation of all sampled trajectories for a certain maneuver and goal for agent i at iteration q (note that goal and maneuver indices are not shown for readability’s sake). The initial weights w₍ ⁽'‘^7_0) are set to 1.

Note that, as specified in the while-loop above, the weight update is applied only to trajectories corresponding to a subset of sampled goals. In this way, the beliefs about goals are taken into account in the planning algorithm. However, a possible alternative is not to sample goals, and instead perform a weighted update. In this case, the trajectories are replicated for each goal. In this case, the weights for each trajectory can be initialised with the goal probabilities, or the goal probabilities may be incorporated in the trajectory evaluation step, in order to take into account the distribution over goals without sampling goals directly.

Equation 2 below defines the value y^ of a single trajectory n by using the weights of all 1 trajectories of the other agents across all their maneuvers and goals (where, again, goal and maneuver indices are not shown for readability’s sake). This value can be considered a ‘return’ or reward associated with the given trajectories. The full return can be estimated by using the full set of trajectory weights across all maneuvers and goals of the other agents. Alternatively, goals and maneuvers can also be sampled for the other agents as well. It should be noted that the Iterative Best Response algorithm iteratively updates the distribution for each agent and uses the updated policies of the agents in the same iteration as it updates the remaining ones. As mentioned above, the order in which agents are updated provides an implicit ordering of the priority of agents in the scenario. For a given agent i, the algorithm evaluates the possible trajectories of that agent based on the updated weights w ⁺¹ of all earlier-updated agents j = 0, i — 1 and based on the previous weights w? for all agents which have yet to be updated, j = i + 1, ... , |A|, where A is the set of agents.

An alternative would be to fix all policies (i.e. all trajectory weightings) throughout the for- loop over agents and perform the update for all agents at the end, however this alternative doesn’t have the convergence guarantees of IBR.

is an indicator function which will return 1.0 if T_{i n} reaches the sampled goal or 0.0 otherwise.

represents a function for evaluating a trajectory n for the agent i against another trajectory z of another agent j by checking for collisions and distance violations using step functions:

Note that the above values for collision and distance violation penalties are just one possible choice, and that the values can be tuned as desired to penalise collisions and distance violations to the desired degree.

Finally, a distribution can be recovered at iteration k using the initial distribution over trajectories:

above, a distribution P^q(T_{i n}) above is computed for each goal and maneuver, though the goal and maneuver indices are not included above for readability.

It should be noted that the step of computing an updated distribution over trajectories is only performed after the final iteration of the weight update described above, i.e. once the while loop of the above algorithm has completed. This step is needed as the original set of samples (i.e. ‘bundle’ of trajectories) is unweighted, and the aim of the algorithm is to return a likelihood for each trajectory.

An aggregate information over w^q+1 (or P^^q T_{i n})) is used to update the maneuver statistics (‘update maneuver statistics’ in the above pseudocode) which is used to trade-off exploration and exploitation in future selections of maneuvers

(‘select maneuvers with UCB’ in the pseudocode above). For example, the mean of the weights can be used to update the maneuver value and increment the number of visits to the maneuver. Other statistics for the maneuver value can be used, including risk metrics. Upper confidence bound (UCB) is a reasonable fit in this case as the distribution of the value statistics will be a Gaussian if the samples were independent and identically distributed (i.i.d) (Central Limit Theorem). The UCB of a maneuver m is defined as follows:

where Q(m) represents the estimated value of the maneuver (i.e. the aggregate statistic over the values of the bundle of trajectories corresponding to this maneuver such as mean), N the number of iterations that the algorithm performed, N_m the number of times maneuver m was selected and c is a parameter controlling the level of exploration required to reduce uncertainty. The selection of maneuvers according to the ‘select maneuvers with UCB’ step of the pseudocode above is based on the value in the equation above. Without the upper

~ . . . 1 1 1 contidence bound term the selection would be based only on the previous returns.

This could lead to certain maneuvers which were not previously selected often to be more likely to be discounted from selection, even if they may lead to positive outcomes. The upper confidence bound term encourages exploration of the available maneuvers as a function of the extent to which those maneuvers have been selected previously. As more maneuvers are explored, the uncertainty associated with those maneuvers decreases and the algorithm can more confidently select a maneuver based on the value Q( ).

However, in practice samples are not necessarily i.i.d and some tuning may be required to define an aggregate measure for the maneuver. The step of selecting maneuvers in the above pseudocode is likely to help when there are a large number of agents as it will guide the process towards the most promising part of the space and avoid spending resources on the unlikely behaviours from the other agents. However, this step can be optional and one can simply update all trajectories for an agent given all trajectories for the other agents (this is done in the BiBER algorithm, described further below). The two algorithms can be empirically evaluated against each other for a given planning application to determine which to use.

Finally, the algorithm needs to select a trajectory for the ego vehicle to execute in the real world. This is not shown within the pseudocode above. Given the hierarchy, this final selection has two steps. First, Lower Confidence Bounds (LCB) is used to select the maneuver followed by selecting the trajectory with the max probability from the set of trajectories that belong to the chosen maneuver. LCB allows the algorithm to consider its confidence in its estimation when breaking ties between maneuvers and trajectories adding an extra safety check in highly uncertain situations or when few samples have been performed:

T1 where Q(m) represents the estimated value of the maneuver (i.e. the aggregate statistic over the values of the bundle of trajectories corresponding to this maneuver such as mean), N the number of iterations that the algorithm performed, N_m the number of times maneuver m was selected and c is a parameter controlling the level of exploration required to reduce uncertainty. Note the difference between UCB and LCB is in the sign of the second term; UCB is optimistic in the face of uncertainty to encourage exploration while learning and LCB is pessimistic in the face of uncertainty to discourage selecting an uncertain maneuver to be executed in the real world.

BIBeR uses the same equations as MAD-UCB, however it doesn’t use UCB during planning or LCB during the final selection. The final selection involves simply selecting the trajectory with the highest probability. The rest of the algorithm is similar to MAD-UCB as shown in the following pseudocode: def run_planner(layout, context, current belief):

# create trees with maneuvers for each agent and each of their goal

# the lo -level trajectory generator is also called here to generate all trajectories trees = create_trees_maneuvers_and_trajectories(layout, context, current belief)

# Compute parts of the evaluation which don't change during the algorithm collisions = compute_all_trajectories_collisions(trees) distance violations = compute_all_trajectories_distance_violations(trees) reached goals = check_goals_for_all_trajectories(trees)

# Sets the initial weights of the trajectories to 1 initialise_trajectories_weights(trees)

# budget can be iterations based, time based or more advanced convergence based stop while within budget:

# sample goals for all agents according to our belief all agents goals = sample goals(current belief)

# One iteration of Iterative best response for each agent in context. agents: # Recompute the weights of each trajectory for the agent trees. update traj ectories_weights(agent, all agents goals, collisions, di stance vi ol ati ons, reached goal s) return trees

Note that in the above case, the line that performed UCB is removed, and the statistics of the maneuvers are no longer tracked. Instead, the algorithm simply updates the weights of all trajectories on every iteration given all trajectories of the other agents.

Figure 8 shows an example tree data structure as generated by the low-level trajectory generator as part of each of the algorithms described above for a simple case where there are two maneuvers and four goals for a single agent. Each goal is associated with a different tree, with each tree having a set of trajectories as branches, as shown by the lines in Figure 8, and with the trajectories grouped into maneuvers as shown by the bounding boxes. This data structure is repeated for each agent in the scene. Each agent has several trees, one per goal. The same initial state, maneuvers and trajectories are shared between the trees but the trajectory weights are different for each tree (as represented by lines of different thickness representing the trajectories). Similarly, the maneuver value captured by the statistics in the case of MAD-UCB is different for each tree (as represented by the thickness of the box border around maneuvers).

The set of goals available to an agent is needed for a Bayesian inference step, described below, to estimate each agent’s goals. MAD-UCB and BIBeR both require goal estimation for the trajectory value estimation, note the I() function above, which considers whether a given trajectory reaches a given goal. If this part is removed, one can simply ignore the goals and the estimation of the probability over these. While goals are sampled in both algorithms above, sampling of goals is a matter of choice; it may be important to sample goals for certain scenarios where the agents have multiple choices, e.g. change lane vs stay in current lane. MAD-UCB also uses a maneuver sampling step, as described above. BIBeR differs from the MAD-UCB implementation in that BIBeR uses a bundle of trajectories, which could cover all maneuvers, but without requiring the trajectories to be categorised into particular maneuvers. In the present example, as described above, the predictor is conditioned on goals and maneuvers, and so the hierarchy of goals, maneuvers and trajectories is used in both cases. However, BIBeR does not sample maneuvers when computing the iterative weight update.

Figure 9 shows a one-dimensional example of a distribution over parameters used for generating trajectories (902), the equivalent samples based representation (904), as determined by the planning component 204, and the resulting distributions (906) after running one iteration of the IBR algorithm by the adaptation component 206. In reality, the distributions are multi-dimensional and also conditioned on goals and maneuvers.

The algorithms described above determine the initial set of trajectories and maneuvers for a given agent based on a current belief about the goals of the agent. As mentioned above with reference to Figure 7, and as described in the pseudocode of the planning algorithms above, these beliefs are updated at each planning iteration based on the output of the previous planning iteration as well as new observations. These belief updates are performed using a Bayesian inference framework. A method of inferring goals using Bayes rule will now be described.

Assuming an observed sequence of states s for time t = k — 1 and we have just received a new observation at time t = k, the goal posterior is computed using the following update rule for one of the agents:

where g are the hypothesized goals of the agent, s are states of the agent and t is time. The first term on the right, P(s_tk

dt_k_^) is the likelihood function which describes how likely it is to observe the last observation at time t = k given the history over t = 1: k — 1 and the goal j at t = k — 1. (Note that the index j here refers to the goal, and not to the agent as above). This requires a planning method to produce the space of possible observations at time t = k. This space is infinite so we estimate the space with a set of sampled possible observations given the previous planned trajectories.

The second term in this equation,

is a prior distribution for the given update step, which is taken either from the posterior distribution from the previous update call or from a prior distribution over the hypothesized goals when k = 0. A simplification of the above update rule can be applied in which only the last observed state s_tk at t = k is used instead of the full history:

This update rule is called recursively on each state in the history if a longer history is available. In essence, the prior value (second term on the right) distils the information given the history up to t = k — 1 since it is the posterior at t = k — 1. At t = k — 1, the prior is the posterior at t = k — 2 and so on.

The first term on the right in the previous equation,

\g^J _tk_ ) is the likelihood, which is estimated using the planned trajectories from the previous timestep. Each maneuver m in the tree has a bundle of trajectories associated with it. To simplify the calculation of the likelihood, these bundles can be reduced to one trajectory per maneuver and goal by either selecting the most likely trajectory given a goal and a maneuver according to the planner output, or selecting the trajectory that is closest to the observation for the given goal and maneuver.

Given the selected trajectories for each goal and maneuver, the matching time t=k in the trajectory is determined and the pose is retrieved for that time: x- y- and 0^. ^p(^m’ g^]t_k_i) is simply set to the probability of the selected trajectory for the maneuver m and goal g_t ^J _{k l} computed by the behavioural planner component. As mentioned above, the planner component 106 is configured to output an updated distribution over trajectories for each goal and maneuver, i.e. it is set to the normalised probability of the selected trajectory for that given goal and maneuver according to P^(T_{i n}) at the final iteration q. The likelihood is computed as shown in the following equation:

Alternatively, a single trajectory may be selected for all available goals and maneuvers. A Normal distribution can be used to capture similarity between trajectories generated during planning at the previous timestep, for which the poses x,y , 0 are retrieved and the observed pose. Note that while the poses are taken from a planned trajectory that was generated at a previous timestep, the poses themselves correspond to a timestep k corresponding to the current observation. The yaw similarity can be computed using a normal distribution with 0 mean and the circular difference between the observed yaw and the planned yaw. x and y should be computed in the agent frame of reference. <J parameters for each of these can be different and should be carefully tuned since these can impact the magnitude of the belief update on every step. The <J parameters can be determined empirically by running a search over a large set of simulated scenarios and observing the results of the scenarios for different values of a.

An agent’s goal may change in time so an optional forgetting step can be added to allow the prediction system after the posterior calculation to adapt quicker to this goal change:

Here the output from the previous calculation is mixed with a probability taken from a Uniform distribution I/ j^over all hypothesized goals at t = k. The amount of forgetfulness can be modelled by tuning the parameter .

Another important aspect regarding the goals is that the set of hypothesized goals can change in 3 ways when there isn’t a 1 : 1 connection in the layout graph between g at time t = k and g at t = k — 1:

(a) The goal is no longer achievable. In this case the mass is distributed equally to all remaining goals.

(b) The goal turns into multiple children. In this case the mass of the parent goal is distributed equally to the children goals.

(c) (Edge case) New goals that are not connected to any of the previous goals could be added to the set of goals if the agent is not following the graph as we expect it to. In this case, a portion of the mass from previous existing goals is shared with the new goals (this is achieved by mixing the previous posterior distribution with a uniform distribution over the new set of goals).

Initial experiments (further details of which are included below) have shown that the multiagent planners are more assertive in interactive scenarios when the other agents are reasonable (both conservative and aggressive within reason). When the other agents are tuned to be extremely aggressive (e.g. to close the gap at all costs such that the ego cannot merge into a neighbouring lane), the multi-agent planners are able to complete some scenarios but some of these end in collision, while the single agent planner complete very few scenarios and are safe. Tables 3, 6 and 9 below show a summary performance of BIBeR on such a scenario and an ablation of BIBeR which doesn’t update the policy of the other agents during planning (i.e. a single-agent variant, BIBeR-SA). A possible reason for this is that the model used for the other agent during planning doesn’t match observed reality. The proposed solution is a monitoring system which assesses the confidence in how well the Prediction + Multi-agent Planner (PMP) system (i.e. the system described above with reference to Figure 2) matches reality vs a Prediction + Single-agent Planner (PSP) (i.e. a conventional prediction and planning framework in which the predicted trajectories for other agents are not updated during planning) and automatically moves towards one or the other modes. This behaviour balances assertiveness when the certainty is high that it has modelled the other agents well with the multi-agent planner and conservativeness when uncertain that the modelling fits reality.

Figure 10 shows how the multi-agent planner can modify the marginal prediction (policy) of one agent. The black solid bar represents the prediction mean (corresponding to the ‘best’ trajectory according to the marginal distribution), the dashed line represents the most probable (i.e. ‘best’) trajectory after planning according to the joint prediction, and the grey bar represents the observed trajectory. Computing a measure of distance between each of these can be used to inform this confidence estimation. The idea of applying a confidence estimation to planning an ego trajectory is to adapt the ego planner according to its confidence that other agents are behaving as expected by the multi-agent planning output (i.e. the planned distribution output by the adaptation component 206). When other agents are observed as closer to the behaviour predicted by the adaptation component 206 than to the behaviour expected based on the predicted distribution generated by the prediction component 204, the ego agent can more confidently apply the multi-agent interactive approach to that agent. By contrast, if the agent behaves closer to the predicted behaviour before the update by the adaptation component 206, the confidence in the multi-agent planning approach is lower for that agent.

This evaluation of the confidence is performed by the confidence estimation component 208 by comparing an observed state of the agent with a trajectory selected from each of the predicted and planned distributions at an earlier timestep corresponding to that observed state.

The confidence estimation is incorporated into the update of the prediction distribution by modifying the update rule of Equation 1 to the following update rule:

A separate P_c is computed for each agent, though an agent index is not shown for the sake of readability. When P_c is equal to 1, the algorithm is the multi-agent variant. When it is equal to 0, the predicted marginal distribution is used as is, i.e. it is a single-agent variant (note that P_c must always equal 1.0 when 1 is the ego agent). P_c, also referred to herein as a planner confidence score or a planner confidence value, is the confidence value that is estimated at runtime to fluctuate between the two variants.

This estimation of the planner confidence value P_c by directly re-purposing the Bayesian inference procedure described above for inferring agent goals as part of the multi-agent planning technique. In place of a probability assigned to a given goal for an agent, the value that is updated at each confidence belief update step is the planner confidence value for a given agent, according to the following update rule:

where P(PMP\s_tk) is a probability that the multi-agent planner output is the ‘correct’ model for the given agent, given the observed state s_t for that agent at a timestep t=k, which is used to determine a confidence score in the multi-agent planner for that agent. Note that the timestep k here refers to the point in time corresponding to the observed agent state, while the index k as applied to the weight update above refers to the iteration of the weight update. As with the goal belief updated described above, the first term on the right of the above equation,

a likelihood, which can be estimated based on the observed state of the agent and the previous output of the planner. A probability of the single-agent planner is also computed at each update step according to the same formula. The values are computed as probabilities by normalising the result of the above update rule such that the two values (for the single-agent and multi-agent planner, respectively) sum to 1.

To compute the likelihood for each planner, a best trajectory is computed for the respective planner according to each of the multi-agent planner output (i.e. the joint distribution output by the adaptation component 206) and the single-agent planner (i.e. the marginal distribution output by the prediction component 204), and a pose x_t ,y_t , 0_t is determined for the selected trajectory for each planner. The overall likelihood for each planner is computed as follows, summing over all goals and maneuvers to determine an overall likelihood for the given planner (either the multi-agent planner or the single-agent planner):

P(s_tfc| planner) « ^(x^ x^. a²) x jy(y_t y_tfc, a²) x jy(0_tfc |0_tfc, a²).

Alternatively, a ‘best’ trajectory can be selected for each goal and maneuver, and the resulting likelihood for the given planner computed by summing the above expression for all maneuvers and goals:

As in the case of the goal belief update above, the probability P(m, g) is the probability of the selected trajectory according to the distribution output by the given planner.

The likelihood computed for each planner is multiplied by the posterior probability for that planner at the previous step, which acts as a prior at the current step, to determine an updated posterior probability for that planner. In one example, the prior probability for the multi-agent planner is set to 0.7. The initial prior for the planner confidence can be an empirically determined value. It should be noted that the result of the multiplication of the previous prior with the above-computed likelihood is normalised to result in a probability for each of the planners, such that the total probability for both planners sums to 1. The probability of the multi-agent planner is used to determine the confidence score for the updates of the trajectory weights at the next iteration of the planner. As with the goal belief calculation above, a forgetting step can be included in this case to prevent the system from collapsing into one of the two modes (i.e. entirely single-agent or entirely multi-agent planning). Instead of goals, the Bayesian inference over planners computes a posterior over PMP and PSP that highlights which of these two systems is the most likely match to the behaviour of an agent observed in the real world. The inference is performed for each agent independently as it was for the goal estimation. To incorporate confidence estimation and goal inference into the planner 106, the algorithms described above are implemented with an initial step of updating the current belief and updating the current planner confidence for each planning iteration (corresponding to step S702 of Figure 7).

It should be noted that the planner is limited in searching within the predicted marginal distribution and to the number of samples afforded. As a result, the planner (and confidence estimation) is limited in the corrections that it can perform and is still susceptible to tail events given the current distribution type used (Gaussian).

Once an updated distribution over trajectories is determined according to one of the algorithms described above, a planned trajectory can be selected for the ego for the current planning iteration based on the generated distribution, for example by selecting the trajectory corresponding to the highest probability, or, as described above for the MAD-UCB algorithm, lower confidence bounds can be used to conservatively select a best trajectory according to the distribution over possible trajectories for the ego vehicle.

Once a trajectory is selected, this can be passed to a controller 108 of the autonomous vehicle stack in order to implement the planned trajectory via control actions. As mentioned above, the planner may directly output parameters such as steering angle and acceleration values which define the planned trajectory, which can be implemented by the controller. Alternatively the planner may output the trajectory as a set of future positions, orientations and velocities of the selected trajectory, where the controller 108 is configured to map these future states to a set of control actions which can drive the vehicle to implement those future states. Provided that there is a mapping between a trajectory and a set of signals interpretable by a controller, this enables the planner to communicate with the controller so as to implement the planned trajectory in the ego agent. Although the planner generates a trajectory as a sequence of states extending a certain time period into the future, only a portion of this trajectory ends up being followed by the ego agent before the planner is called again, and a new planning iteration is run to generate a new trajectory, based on the updated state of the ego agent as well as the updated states of all other agents of the scenario. The methods above can be first evaluated in simulation, on synthetic scenarios, or scenarios selected from driving/research datasets, and on a controlled test track before being evaluated on public routes. The methods described above are intended to address a number of interactive scenarios, including scenarios in which the ego must: change lane in a highway or urban area, merge in a highway or urban area, react to another vehicle changing lane in a highway or urban area, react to another vehicle merging in a highway or urban area, enter a roundabout, negotiate a four-way junction (in the USA) or negotiate an unsignaled junction (in Europe). The experiments and results described below are focussed on four of these scenarios: change lane, merge, reaction to lane change performed by another agent, and reaction to merge performed by another agent. The experiments are also limited to highway situations. However, the methods described above can be extended to other types of scenarios in other driving contexts.

Figure 11 shows a combination of the merge and lane change scenario. For example, the ego agent could be agent 1012 in Figure 11, wherein it aims to do a merge, or either of agents 1004 or 1006, where it aims to do a lane change. Alternatively, the ego agent could be any of the agents shown, where it aims to react to a lane change or merge executed by another agent.

The change lane and merge maneuvers in straight roads scenarios are simple to be developed given the relatively small number of parameters that need to be considered. The parameters in these situations are the following:

The number of agents;

The positioning of each agent relative to the ego;

The states of each agent (particularly the velocity of each agent);

Their goals/types or behaviour choreography.

The positioning of the agents shown in Figure 11 can be determined using the longitudinal distance between each agent and the ego agent. This also determines the ego position relative to the gap, or the size of the gap. The state of the agents is the most challenging parameter. Initial scenarios can be defined where agents mostly maintain their current velocity except for the case where another agent cuts in in front of them. The reaction of the agent to a cut-in can be determined based on a predefined agent model. Examples of agent models that may be used are: - Fixed choreography, i.e. either fixed motion profile to be executed indifferent of the other agents, or a fixed reaction such as ‘always slow down’, where the agent slows down in a predefined, specific way.

An intelligent driver model (IDM) with tuneable parameters including when to consider an agent as being the forward agent.

The experiments described herein are run on synthetic scenarios in simulation and evaluate the high-level planner described above (BIBeR) independently, where the prediction and behaviour of other agents is controlled.

Progress is measured via metrics that capture the desired objectives of the methods described above. These objectives include (a) creating a more assertive ego behaviour in the above scenarios, (b) improving the estimation of the other agents’ behaviour (in particular in situations where ego actions influence them), (c) improving the response to the other agents’ behaviours, (d) maintaining or improving safety guarantees. The metrics for capturing objectives (a) and (d) include:

- Metrics capturing collisions (e.g. as a percentage)

- Metrics capturing distance to other agents (e.g. minimum distance averages over the run and averaged over all runs afterwards)

- Metrics capturing goals (e.g. percentage of scenarios where the ego reaches the goal, time to reach the goal averaged over all runs, etc.)

- Metrics capturing decisiveness (e.g. plan fluctuations from one timestep to another averaged over the run and averaged over all runs afterwards).

These metrics are not mutually exclusive given that the evaluation is closed-loop. For example, an ‘average distance’ metric may be ‘better’ (i.e. agents may be further apart on average) in scenarios where there are collisions, since a collision could terminate the scenario earlier, while a lack of collision could mean that the agents stayed closer together for longer and reached the end safely. Worse performance on a given metric should be interpreted according to the scenario. For example, decisiveness can be impacted by the need to react against more aggressive other agents in order to reach the goal. Any individual metric is not enough to evaluate the overall performance of the ego vehicle within a scenario.

If the performance improves on these metrics, then it is expected that the other objectives are implicitly satisfied. However, it would be best to have a method of determining this with certainty. When applied to extending the operational design domain (ODD), the approach should involve generating a large number of scenarios covering the particular ODD extension, for example, to extend to ‘gap creation’ scenarios, variations of the ego merging between vehicles in a highway can be generated. An assumption that other agents behave well can be rectified by replacing the other agents with other agents that have a varied set of intentions (e.g. to allow the ego to merge vs not vs ignoring the ego altogether etc) and evaluating with the same metrics. Furthermore, the Bayesian inference output can be evaluated independently by checking the probabilities assigned to certain goals or preferences.

Objectives (b) and (c) are the most challenging to evaluate. Evaluating objective (c) (improving the response to the other agents’ behaviours) would require access to a gold standard or at least a desired behaviour/maneuver that the ego agent should execute. Any maneuver/trajectory distance based metric can be used to compare the executed maneuver/trajectory with the desired one. Objective (b) (improving the estimation of the other agents’ behaviour) requires evaluating the internals of the planner and in particular the policies for the other agents that the planner has converged to. These can be compared to the executed ones. Given the challenges of evaluating these two objectives, the evaluations described below focus on the other objectives, only performing these additional evaluations if necessary.

Finally, there is a preference to maintain an acceptable level of comfort. For this metrics are used that capture the value or variations of jerk of the output trajectories as well as that of the executed plans.

A set of experiments to evaluate the high-level planner described above, and the results of said experiments, will now be described. This is an independent evaluation of the high-level planner by isolating planner performance metrics from other variables. This is achieved in simulation for the gap creation scenario, in which the ego vehicle needs to change lane in between two agents (for example, where the ego vehicle is agent 1006, and intends to change to the right hand lane). In this example scenario, the following parameters can be varied:

• Size of the gap; There are 3 options that experiments are run with: a large easy gap, a medium gap and a narrow gap.

• Ego starting position: In each experiment, the space between the two agents was discretised in 100 positions and one simulation was run for each longitudinal position as a starting point for ego (obviously ego being in the adjacent lane). • Behaviour of rear agent in the gap; There are three variations that have been tried: fixed traces, slowly closing the gap, aggressively closing the gap. The latter two are achieved with an IDM model that ignores the ego agent and tries to reduce the gap to 0 while tuning the parameter that decide on how quickly the IDM achieves this.

• Trigger for slowing down of the rear agent of the gap depending on ego behaviour. To simulate ‘worst case’ scenarios, the rear agent can be configured to only stop closing the gap and keep a distance to ego once the ego is fully merged in the lane. In reality any reasonable agent would slow down considerably earlier.

To isolate the high-level planner from the predictor, the motion profile prediction module is replaced with a constant velocity prediction. Gaussian noise was added to simulate uncertainty of prediction. The ego motion is also predicted with the same approach as for the other agents. The other agents are always predicted to stay on their lane, while ego can choose between change left or right and follow lane. Each behaviour is represented via 100 sampled trajectories from this motion profile prediction which means that possible ego behaviour is chosen from 300 trajectories (100 trajectories for each of ‘change left’, ‘follow lane’, and ‘change right’), and the possible behaviour of the other agents is represented with 100 trajectories (since each other agent is assumed to follow their current lane). The same trajectory generation is performed for all high-level planners tested.

Given that the type of behaviour of the other agents is fixed, goal inference has no impact on these experiments.

For the confidence estimation described above, the prior belief that the multi-agent planner (PMP) approach matches the observed behaviour is set to 0.7. The parameter <J for x and y used in the likelihood estimation is set to 0.05. These parameters were manually tuned on a subset of possible scenarios.

In this study, the following ablations of BIBeR have been compared:

1. BIBeR as described above, including confidence estimation, and with parameters as previously described. This is referred to as BIBeR-C and is the proposed approach.

2. BIBeR without updating the policies of the other agents; This is achieved by setting P_c used in confidence estimation to 0.0 for all agents except for ego. This is a single-agent planner that doesn’t consider the impact of the ego policy on the other agent’s policies. This is referred to as BIBeR-SA.

3. BIBeR for multi-agent planning without the confidence estimation. This is equivalent to setting to 1.0 for all agents and it is a pure multi-agent planner approach. This is referred to as BIBeR.

Similar to the method described in 3. with the exception that the ego’s policy is last to be updated. As described above, the order of the updates represents priority: the earlier in the list an agent is the less priority it has as it needs to accommodate for the other agents. By moving ego last we are imitating a planner with a conditional prediction method in which the prediction takes into account the ego’s policy. This is referred to as BIBeR-R as the order of updates is reversed. A series of gap-creation scenarios having different gap sizes and non-ego behaviours as described above, and the resulting behaviour of the ego vehicle according to each of the above planner implementations will now be described. The evaluation metrics for the results shown in the tables below include a percentage of scenarios that were completed (i.e. reached the ego goal), a time taken to reach the ego goal, a percentage of scenarios in which collisions occurred, fluctuations in speed, fluctuations in path, and an average minimum distance maintained between agents of the scenario (averaged over all the scenario instances run in simulation for the given planner).

Table 1. Results of planner evaluation for small gap and fixed agent traces.

Small gap

For fixed traces for other (non-ego) agents, the results of each planner are shown in Table 1.

The single-agent planner (BIBeR-SA) does not attempt to merge. The multi-agent planner without confidence (BIBeR) performs best, as it only merges in the situations that it has the space to, given the current limited vocabulary of the constant velocity motion profile with Gaussian noise.

The multi-agent planner with confidence estimation (BIBeR-C) is between the two, as the gap is too small and it is risky to enter given the lack of signal received from the other agent (fixed traces). The time to goal is higher than the planner without confidence estimation, which indicates that the ego agent is more careful in joining the target lane, and this is at the expense of increasing path fluctuations needed to get in position.

The multi-agent with reverse updates acts dangerously and results in collisions even with fixed traces behaviour for the other agents. This can also be seen via the time to goal metric, which highlights that the ego agent is very aggressive in joining the target lane when this planner is used. In other words, when the ego agent is prioritised above all other agents in the planner, this results in dangerously aggressive behaviour from the ego agent for this scenario.

Table 2 shows the results of each planner for a small gap scenario when the rear agent intends to slowly close the gap.

Table 2: Results of planner evaluation for small gap where rear agent slowly closes gap.

The single-agent planner (BIBeR-SA) performed similarly poorly to the fixed traces case, except for a few attempts to merge after the other agent has closed the gap.

BIBeR adjusts to the slightly more aggressive behaviour of the other agent but does so quite late having quite a few fluctuations and reducing the distance to the other agent too much. BIBeR-C noticeably improves the behaviour of the multi-agent planner in both terms of fluctuations and distance keeping. This highlights the benefits of the confidence estimation in deciding quicker to act conservatively when the other agent is acting unexpectedly.

The multi-agent with reverse updates (BIBeR-R) acts dangerously and results in even more collisions than with fixed traces. In this case, the minimum distance metric is higher given the shorter traces ending in collision and the averaging over-weighing the initial states. Note that the time to goal has decreased as the agent is only able to merge in the simpler cases which are also the cases which can be completed quickly.

Table 3 shows the results of the planner evaluation for a small gap, where the rear agent intends to quickly close the gap.

Table 3: Results of planner evaluation for small gap where rear agent quickly closes gap.

The single agent planner has a similarly poor performance to the above two experiments, except for even more attempts to merge after the other agent has closed the gap.

As before, the difference between the two multi-agent planners highlights the benefits of the confidence estimation when the other agent acts unexpectedly, in that the confidence estimation maintains a greater distance and has fewer fluctuations than the version without confidence estimation.

The multi-agent with reverse updates acts dangerously and results in even more collisions than before. Similarly, the minimum distance metric is misleading given the effect of the averaging (since many scenarios terminate early due to collision).

Medium Gap Table 4 shows the planner evaluation results for a medium gap, where the non-ego agents have fixed traces.

The results are in line with the results of the small gap experiment (Table 1), except for the time to goal metric in which the confidence estimation has helped slightly at the cost of increasing fluctuations slightly in order to get in a safer position.

Table 5 shows the results of planner evaluation for a medium gap, where the rear agent is slowly closing the gap.

Table 5: Results of planner evaluation for medium gap and agent slowly closing gap.

The results are similar to the small gap experiment (Table 2). However, note that the effect of the confidence estimation is reduced as the gap is increased but the other agent is not as aggressive. The single agent planner manages to solve some of the scenarios, however it does so by merging behind the other agent (note the large time to completion). This is not the intended behaviour, yet the current goal metric is configured to count this as a completion.

An alternative goal metric may be defined for future experiments that distinguishes goals reached in the intended way (in this case, where the ego vehicle merges into the neighbouring lane between the front and rear agents).

Table 6 shows the results of the planner evaluation for a medium gap where the agent is quickly closing the gap (i.e. behaving aggressively).

Table 6: Results of planner evaluation for medium gap and agent quickly closing gap.

The results are similar to the small gap experiment. Different to the less aggressive agents for medium gap, here the confidence estimation is balancing safety and assertiveness more evidently. It is increasing the distance to other agents and reducing fluctuations indicating that it chose to avoid merging in front of an extremely aggressive other agent. It also prevents a few collisions that the multi-agent planner without confidence couldn’t.

Large gap

Table 7 shows the planner evaluation for scenarios with a large gap and fixed agent traces.

This is a very simple scenario where all agents have a comparable performance. In this case, even the reverse order planner, in which the ego vehicle behaves relatively aggressively, no collisions occur, since the size of the gap is sufficient that a merge can always be safely performed. Table 8 shows the planner evaluation for scenarios with a large gap, where the rear agent is slowly closing the gap.

The single agent planner can no longer finish the scenario at even the slightest aggressive hint from the other agent despite the sufficient space available. The low distance that the SA agent keeps to the other agents is due to it staying side by side to the rear agent that is slowly closing the gap. The multi-agent planners show comparable performance in these simple scenarios. Table 9 shows ...

Similar story. The only interesting point to highlight here is the fact that the reverse update method results in collisions even when the ego has enough space to merge safely. This highlights the high and incorrect confidence of this method in the others giving way to ego. Note the low time to completion for the reverse update method which also indicates its incorrect aggressiveness.

In summary, the results show that the single-agent planner (BIBeR-SA) guarantees safety in terms of collision, but is extremely conservative, taking longer to complete scenarios and being unable to complete some scenarios even when given a large gap and where other agents behave conservatively. The multi-agent planner (BIBeR) is more assertive, and adapts to more scenarios, though it resulted in some collisions where the other agents behaved aggressively. It has more fluctuations as it attempts to do lane changes and adapts to the other agent. Since it is quite confident in its estimation of the other agent’s behaviour so it completes scenarios quickly. The BIBeR-C planner, using the confidence estimation techniques described above, has similar performance to BIBeR without confidence estimation, but reduces collisions and generally has larger distances to other agents. BIBeR-R results in the most dangerous driving behaviour, based on the incorrect assumption made by the planner that other agents will give way to the ego agent.

It will be appreciated that the term “stack” encompasses software, but can also encompass hardware. A stack may be deployed on a real-world vehicle, where it processes sensor data from the on-board sensors, and controls the vehicle’s motion via the actor system 112.

References herein to agents, vehicles, robots and the like include real-world entities but also simulated entities. The techniques described herein have application on a real-world vehicle, but also in simulation-based autonomous vehicle testing. For example, the stack 100 may be used to plan ego trajectories/plans to be used as a benchmark for other AV stacks. In simulation, software of the stack may be tested on a “generic” off-board computer system, before it is eventually uploaded to an on-board computer system of a physical vehicle. However, in “hardware-in-the-loop” testing, the testing may extend to underlying hardware of the vehicle itself. For example, the stack software may be run on the on-board computer system (or a replica thereof) that is coupled to the simulator for the purpose of testing. In this context, the stack under test extends to the underlying computer hardware of the vehicle. As another example, certain functions of the stack (e.g. perception functions) may be implemented in dedicated hardware. In a simulation context, hardware-in-the loop testing could involve feeding synthetic sensor data to dedicated hardware perception components.

References herein to components, functions, modules and the like, denote functional components of a computer system which may be implemented at the hardware level in various ways. A computer system comprises execution hardware which may be configured to execute the method/algorithmic steps disclosed herein and/or to implement a model trained using the present techniques. The term execution hardware encompasses any form/combination of hardware configured to execute the relevant method/algorithmic steps. The execution hardware may take the form of one or more processors, which may be programmable or non-programmable, or a combination of programmable and nonprogrammable hardware may be used. Examples of suitable programmable processors include general purpose processors based on an instruction set architecture, such as CPUs, GPUs/accelerator processors etc. Such general-purpose processors typically execute computer readable instructions held in memory coupled to or internal to the processor and carry out the relevant steps in accordance with those instructions. Other forms of programmable processors include field programmable gate arrays (FPGAs) having a circuit configuration programmable through circuit description code. Examples of nonprogrammable processors include application specific integrated circuits (ASICs). Code, instructions etc. may be stored as appropriate on transitory or non-transitory media (examples of the latter including solid state, magnetic and optical storage device(s) and the like). The subsystems 102-108 of the runtime stack of Figure 1 may be implemented in programmable or dedicated processor(s), or a combination of both, on-board a vehicle or in an off-board computer system in the context of testing and the like.

Claims

1. A computer-implemented method of determining an estimated optimal trajectory for an ego agent in the presence of one or more further agents, the method comprising, at each of a plurality of planning iterations: receiving an observed state of each agent; for each agent: generating a predicted distribution over trajectories; and performing an adaptation of the respective predicted distribution over trajectories to generate a planned distribution over trajectories, wherein the adaptation is based on a planner confidence score for the agent and a trajectory evaluation score for the agent; and selecting an estimated optimal trajectory from the planned distribution over trajectories for the ego agent; wherein the trajectory evaluation score for each agent is based on the interaction of each of a set of possible trajectories of that agent with one or more possible trajectories of each other agent; and wherein the planner confidence score for each of the one or more further agents is based on a comparison of the observed state of the agent with a previous estimated optimal trajectory of the agent determined by the planner at a previous planning iteration.

2. A method according to claim 1, wherein the planner confidence score for each of the one or more further agents is further based on a comparison of the observed state of the agent with a previous predicted trajectory selected from the predicted distribution over trajectories at a previous planning iteration.

3. A method according to claim 1 or 2, wherein the predicted distribution over trajectories for each agent is generated based on a belief about the goals of each agent.

4. A method according to claim 3, further comprising updating the current belief about the goals of each agent based on the observed state of each agent.

5. A method according to claim 3 or 4, wherein the belief about the goals of each agent is represented by a probability distribution over possible goals for that agent.

6. A method according to claim 3, wherein the trajectory evaluation score is computed based on the interaction of each of a set of possible trajectories of that agent with a set of possible trajectories of each other agent, the set of possible trajectories for each agent being sampled based on the current belief about the goals of that agent.

7. A method according to claim 5, wherein the probability distribution over possible goals for each agent is updated by performing Bayesian inference based on a previous probability distribution over the goals of that agent and the observed state of the agent.

8. A method according to claim 7, wherein the Bayesian inference comprises determining an updated probability value for each goal by: computing a likelihood term by comparing the observed state of the agent to the estimated optimal trajectory generated by the planner for that goal at a previous planning iteration; and multiplying the likelihood term by a previous probability value for that goal to generate a posterior probability value for that goal.

9. A method according to any preceding claim, wherein the observed state of the agent comprises one or more of: a position of the agent, an orientation of the agent, and a velocity of the agent.

10. A method according to any preceding claim, wherein the comparison of the observed state of the agent with the previous estimated optimal trajectory and/or the previous predicted trajectory is based on a comparison of the observed state of the agent with a closest state along the respective trajectory, the closest state being the state along the trajectory that is closest in time to the observed state.

11. A method according to claim 10, wherein the comparison of the observed state of the agent with the closest state along the respective trajectory comprises computing a probability of at least one variable of the observed state according to a Gaussian distribution centred on the value of the corresponding variable of the closest state along the respective trajectory.

12. A method according to claim 11, wherein the result of the comparison is used to determine a likelihood term, and wherein the planner confidence score is determined based on a multiplication of the likelihood term by a prior planner confidence value.

13. A method according to claim 12, wherein, for the first planning iteration of the plurality of planning iterations, the prior planner confidence value is a predefined constant value.

14. A method according to claim 12 or 13, wherein, for the second and subsequent planning iterations of the plurality of planning iterations, the prior planner confidence value is based on the planner confidence score from the previous planning iteration.

15. A method according to any preceding claim, wherein the planner confidence score for the ego agent is set to a constant value.

16. A method according to any preceding claim, wherein the predicted distribution over trajectories for each agent is generated by a trajectory generator as a bundle of weighted trajectories.

17. A method according to claim 16 wherein the bundle of weighted trajectories is generated by a hybrid trajectory generator comprising a data-driven motion profile prediction component and a classical trajectory generation component.

18. A method according to claim 17, wherein the classical trajectory generation component generates the trajectories by applying a pure pursuit algorithm to the output of the data-driven motion profile prediction component.

19. A method according to any of claims 16 to 18, wherein a different respective bundle of weighted trajectories is generated for each possible goal of a predetermined set of goals for each agent.

20. A method according to claim 19, wherein each bundle of weighted trajectories comprises one or more subsets of weighted trajectories, each subset associated with a respective maneuver.

21. A method according to claim 16 or any claim dependent thereon, wherein, for each agent, the adaptation is performed by applying learned weights to the predicted distribution over trajectories.

22. A method according to claim 21, wherein the weights are learned by iteratively multiplying an initial set of weights by an evaluation factor, each weight of the initial set of weights associated with a different respective trajectory; wherein the evaluation factor is based on an interaction of that trajectory with each trajectory of a set of possible trajectories of the other agents.

23. A method according to claim 22, wherein the evaluation factor at each step of the iterative multiplication is computed based on the weights of each agent, and at least one of: a number of collisions between that trajectory and one of the set of possible trajectories of the other agents; a number of distance violations between that trajectory and one of the set of possible trajectories of the other agents; whether the trajectory reaches a possible goal of the agent.

24. A method according to claim 22, wherein the set of possible trajectories of the other agents comprises all trajectories for each other agent for each possible goal of that agent and each possible maneuver of that agent.

25. A method according to claim 22, wherein the set of possible trajectories of the other agents are sampled from a subset of trajectories associated with the agent, the subset of trajectories determined by sampling a set of possible goals of the other agents according to a current belief about goals.

26. A method according to any preceding claim, wherein the weights are learned by applying an Iterative Best Response algorithm to an initial set of weights.

27. A method according to claim 26, wherein the adaptation comprises sampling a set of maneuvers, and wherein the set of possible trajectories of the other agents are sampled from the subset of the trajectories associated with the sampled set of maneuvers.

28. A method according to claim 27, wherein the sampling of the maneuvers is performed using Upper Confidence Bounds (UCB).

29. A method according to any preceding claim, wherein the previous estimated optimal policy is the most probable trajectory according to the planned distribution over trajectories at the previous planning iteration.

30. A method according to claim 2 or any claim dependent thereon, wherein the previous predicted trajectory is selected as the most probable trajectory according to the predicted distribution over trajectories at the previous planning iteration.

31. A method according to any preceding claim, wherein the estimated optimal trajectory for the ego agent is selected as the most probable trajectory of the planned distribution over trajectories for the ego agent.

32. A method according to any preceding claim, comprising outputting the selected estimated optimal trajectory for the ego agent to a controller to perform control actions to effect the estimated optimal trajectory.

33. A method according to claim 23, wherein the iterative multiplication comprises updating the weights for each agent in turn, wherein, for a selected agent, the evaluation factor at each step is computed based on the updated weights of each agent that was updated before the selected agent, and the previous weights of each agent that is due to be updated after the selected agent.

34. A method according to claim 33, wherein the order of updating the agents is determined based on a priority of the agents in the scene.

35. A computer-implemented method of estimating an optimal trajectory for an ego vehicle agent in the presence of one or more further vehicle agents, comprising, at each of a plurality of planning iterations: receiving an observed state of each vehicle agent; generating, for each vehicle agent, a predicted distribution over trajectories; performing, for each vehicle agent, an adaptation of the respective predicted distribution over trajectories to generate a planned distribution over trajectories for the current planning iteration, wherein the adaptation is based on a trajectory evaluation score for the agent; and selecting a trajectory from the planned distribution over trajectories for the ego agent; wherein the trajectory evaluation score for each agent is based on the interaction of each of a set of possible trajectories of that agent with one or more possible trajectories of each other agent.

36. A method according to claim 35, wherein the predicted distribution over trajectories is generated based on a scene context, the scene context comprising a road layout, and a belief about the goals of each of the ego vehicle agent and the one or more further vehicle agents.

37. A method according to claim 36, comprising updating, at each planning iteration, the belief about the goals of each of the ego vehicle agent and the one or more further vehicle agents based on the observed state of each vehicle agent.

38. A method according to any of claims 35 to 37, wherein the adaptation of the predicted distribution over trajectories comprises performing an iterative update of a set of weights to be applied to the predicted distribution over trajectories; wherein at each step of the iterative weight update, the weights of each agent are updated in turn, and wherein the order in which the weight updates are applied are based on a priority of the agents within a road layout in which the vehicle agents are driving.

39. A computer system comprising one or more processors configured to implement the method of any preceding claim.

40. A computer program for programming a computer system to implement the method of any of claims 1 to 38.