WO2024175684A1

WO2024175684A1 - Offline object tracking

Info

Publication number: WO2024175684A1
Application number: PCT/EP2024/054465
Authority: WO
Inventors: Andrew Lawson
Original assignee: Five AI Ltd
Current assignee: Five AI Ltd
Priority date: 2023-02-24
Filing date: 2024-02-22
Publication date: 2024-08-29
Anticipated expiration: 2025-08-24
Also published as: GB202302671D0

Abstract

A computer-implemented method of tracking objects in a multi-object scenario based on a set of observations generated over multiple frames 0,..., K, the method performed by an offline object tracker in a sequence of processing steps 0,... K based on a pruning window of size n, and comprising: in the initial 0,..., n — 1 processing steps, determining a set of data association hypotheses for frames 0,..., n — 1, each data association hypotheses comprising, for each frame 0,..., n — 1, a frame association hypothesis, the frame association hypothesis being a set of associations between one or more observations for frame k and a set of track identifiers, in the subsequent processing step n, performing operations of: determining an updated set of data association hypotheses for frames 0,..., n, calculating an updated probability of each updated data association hypothesis, identifying a most probable data association hypothesis at step n based on the updated set of data association hypotheses and their updated probabilities; selecting a final frame association hypothesis for the frame 0 based on the most probable data association hypothesis selected in step n; generating a tracking output external to the offline object tracker for the frame 0 based on the final frame association hypothesis for the frame 0, as selected based on the most probable data association hypothesis identified at step n; wherein, in each subsequent processing step k, the offline object tracker repeats said operations for frames 0,... k, with the updated set of data association hypotheses restricted to account for every final frame association hypothesis selected in the previous time steps, resulting in the selection of a final frame association hypothesis for the frame k — n based on the most probable data association hypothesis identified at step k, and the generation in step k of a tracking output external to the offline object tracker for the frame k — n.

Description

Offline Object Tracking Technical Field [0001] The present disclosure pertains to systems and methods for tracking objects in an offline (non-real-time processing) context. Applications of offline tracking include generating high-quality ‘pseudo-groundtruth’ for use in testing real-world performance of robotic systems and components for the same. Background [0002] There have been major and rapid developments in the field of autonomous vehicles. An autonomous vehicle (AV) is a vehicle which is equipped with sensors to perceive and interpret its physical environment and an online stack which enable it to operate without a human controlling its behaviour, at least in certain circumstances. Such sensors include for example cameras, radar and lidar. The online stack includes perception capabilities that enable the AV to interpret the output of its sensors, such as a perception system that includes object detectors and trackers (among others). [0003] An autonomous vehicle may be fully autonomous (in that it is designed to operate with no human supervision or intervention, at least in certain circumstances) or semi- autonomous. Semi-autonomous systems require varying levels of human oversight and intervention. For example, an Advanced Driver Assist System (ADAS) and certain levels of Autonomous Driving System (ADS) may be classed as semi-autonomous. Other mobile robots are being developed, for example for carrying freight supplies in internal and external industrial zones. Such mobile robots would have no people on board and belong to a class of mobile robot termed UAV (unmanned autonomous vehicle). Autonomous air mobile robots (drones) are also being developed. [0004] There are different facets to testing the behaviour of the sensors and control systems aboard a particular autonomous vehicle, or a type of autonomous vehicle. Real-world testing may be conducted under the supervision of a test driver in a physical AV. The AV operated autonomously but the test driver can intervene at any time. To demonstrate an acceptable level of safety, a very large number of driven ‘test miles’ are required to adequately assess driving performance, which would not be feasible through real-world testing alone. Therefore, an increasing emphasis is based on testing of autonomous driving systems in simulation environments. [0005] Nevertheless, the collection and processing of real-world driving data for testing purposes remains an important aspect of testing. [0006] ‘Groundtruth’ is an important concept in the context of testing. For example, in a driving run involving an AV under testing (the ego vehicle), the AV’s performance may be assessed based on a set of driving rules applied to a groundtruth representation of the driving run, which is a ‘true’ representation of what happened. This may be different from the ego vehicle’s own perception of the run, e.g., due to errors or limitations in the AV’s perception system. Ground truth is inherent to simulation; a simulator computes a sequence of scenario states, which is, by definition, a perfect, authoritative representation of a simulated driving run. However, in a real-world scenario run, a “perfect” representation of the driving run does not exist. Therefore, some form of pseudo-ground truth representation is needed. Manual annotation is one possibility. However, this requires significant human effort to annotate large volumes of sensor data with sufficient accuracy. Summary [0007] Tracking is one aspect of perception considered herein. Tracking refers to the problem of determining associations between observations. Two observations are said to be associated when they are generated by the same object. A track refers to a set of mutually independent observations, which is typically assigned a unique track identifier (ID) by an object tracker. For example, each frame in a time-sequence of sensor data may be processed using an object detector, and an object tracker may be used to associate object detections (such as bounding boxes) between different frames. Tracking may be combined with other forms of processing, such as filtering applied to reduce overall uncertainty in a set of object detections belonging to a given track. Tracking errors in an online perception system (such as ‘swapped’ detections between the tracks of nearby objects) can impact driving performance. [0008] According to a first aspect herein, there is provided a computer-implemented method of tracking objects in a multi-object scenario based on a set of observations generated over multiple frames 0, … , ^^, the method performed by an offline object tracker in a sequence of processing steps 0, … ^^ based on a pruning window of size ^^, and comprising: in the initial 0, … , ^^ − 1 processing steps, determining a set of data association hypotheses for frames 0, … , ^^ − 1, each data association hypotheses comprising, for each frame 0, … , ^^ − 1, a frame association hypothesis, the frame association hypothesis being a set of associations between one or more observations for frame ^^ and a set of track identifiers, in the subsequent processing step ^^, performing operations of: determining an updated set of data association hypotheses for frames 0, … , ^^, calculating an updated probability of each updated data association hypothesis, identifying a most probable data association hypothesis at step ^^ based on the updated set of data association hypotheses and their updated probabilities; selecting a final frame association hypothesis for the frame 0 based on the most probable data association hypothesis selected in step ^^; generating a tracking output external to the offline object tracker for the frame 0 based on the final frame association hypothesis for the frame 0, as selected based on the most probable data association hypothesis identified at step ^^; wherein, in each subsequent processing step ^^, the offline object tracker repeats said operations for frames 0, … ^^, with the updated set of data association hypotheses restricted to account for every final frame association hypothesis selected in the previous time steps, resulting in the selection of a final frame association hypothesis for the frame ^^ − ^^ based on the most probable data association hypothesis identified at step ^^, and the generation in step ^^ of a tracking output external to the offline object tracker for the frame ^^ − ^^. [0009] In embodiment, the method may comprise comparing the tracking output for at least one of the frames with an online tracking output for the at least one frame, the online tracking output having been extracted by an online object tracker, and thereby identifying a tracking error in the online tracking output. [0010] The method may comprise comparing the tracking output for each frame with an online tracking output for that frame. [0011] The multiple frames may have been captured in a test scenario, in which an ego agent reacts to a plurality of other agents, and the method may comprise determining from the multiple frames an ego trace of the ego robot and a plurality of agent traces comprising or derived from the tracking output of each of at least some of the multiple frames. [0012] The ego agent may be a mobile robot, and the method may comprise inputting the ego trace and the plurality of agent traces to a test oracle. The test oracle may evaluate performance of the mobile robot in the test scenario and outputs a set of performance evaluation results. [0013] The method may comprise using the set of performance evaluation results to identify and mitigate an issue with the mobile robot. [0014] The method may comprise using the ego trace and the plurality of agent traces to create a scenario for testing performance of a simulated ego agent in a simulation environment. [0015] The multiple frames may have been captured by at least on image capture device of the ego robot. [0016] The tracking output may comprise, for each frame, one or more observations and for each observation a track identifier. [0017] The tracking output of at least one frame may comprise an observation marked as a false positive. [0018] The updated probability of each updated data association hypothesis may be calculated based on the one or more observations and a motion model. [0019] Further aspects provide a computer system comprising one or more computers configured to implement the method or any embodiment thereof, and a computer program for programming a computer system to implement the same. [0020] For a better understanding of the present disclosure, and to show how embodiments of the same may be carried into effect, reference is made by way of example only to the following figures in which: Brief Description of Figures [0021] FIG. 1 shows a schematic flow chart for a unified tracking algorithm; [0022] FIG. 2 shows a schematic flow chart for an MHT tracking algorithm; [0023] FIG.3 shows an example of MHT applied to a vehicle tracking in a lane driving scenario; [0024] FIG. 4 shows an example hypothesis tree for a vehicle in a lane driving scenario; [0025] FIG.5 shows a hypothesis tracking diagram illustrating the principles of pruning over a sequence of frames; [0026] FIG. 6 shows the pruning of the hypothesis tree of the scenario of FIG.3 at time frame ^^_^ି^; [0027] FIG. 7 shows the pruning of the hypothesis tree of the scenario of FIG.3 at time frame ^^_^; [0028] FIG. 8 shows a schematic block diagram of an offline tracking system; [0029] FIG. 9A shows an example autonomous vehicle (AV) stack; [0030] FIG. 9B shows an AV testing paradigm; [0031] FIG. 9C shows an example ground truthing pipeline; and [0032] FIG. 10 shows an example testing pipeline. Detailed Description 1. Overview [0033] The described embodiments use a novel offline (non-real time) application of multiple hypothesis tracking (MHT) to generate (pseudo-)ground truth data from a set of observations relating to a driving scenario. [0034] Reference is made herein to Lawrence D. Stone, “A Bayesian Approach to Multiple- Target Tracking”, available at http://dsp-book.narod.ru/HMDF/2379ch10.pdf (Stone), which is incorporated herein by reference in its entirety, and which describes certain principles of Bayesian tracking, including MHT. [0035] Most tracking algorithms are based on a predictable prior distribution for the state of a target, with adjustments to the prediction based on observations received. Target tracking may thus be fundamentally described as a problem of Bayesian inference. The tracking problem in the general Bayesian formulation is computationally expensive to solve. Hence most modern tracking algorithms, including multiple hypothesis tracking, are approximations of full Bayesian tracking. Most approaches solve the multiple-target tracking problem by computing a posterior distribution ^^( ^^_^, ^^_^) on the joint target state space over ^^ time steps. A popular approximation to Bayesian tracking is Kalman filtering, which requires certain assumptions such as Gaussian priors and measurements that are linear functions of the target state. Many approaches include the notions of contacts and associations and assumptions such as a fixed number of targets, identical targets, discrete state space, identical sensors, amongst others. 2. Elements of Bayesian tracking [0036] In the context of tracking a target, such as a moving vehicle, a probabilistic description of the motion of the target is referred to as the prior distribution,

. Properties of the target’s motion, such as position, may be predicted within errors at any given time from the prior distribution. A motion update is first performed based on the prediction from the prior distribution. Measurements from sensors that also track the target, such as GPS or odometry measurements, may be used to improve the prediction. These sensor measurements may be characterised by likelihood functions, ^^_^, … , ^^_^. The likelihood function for a sensor captures the probability and error distribution of that sensor’s response conditioned on the value of the target state. A posterior probability distribution on the state of the target is computed by combining the motion updated prior at a given time with the likelihood function for the observations received at that time. This posterior probability distribution is the output of a Bayesian tracker. 2.1 Target tracking recursions [0037] Consider the situation where the target motion is modelled in continuous time, but observations are received at discrete, possibly random, times. This is called continuous- discrete filtering. The single target tracking problem is defined in terms of a target state. The target state is represented by a vector of components, some of which are kinematic and include position, velocity and possibly acceleration. The state space ( ^^) must be rich enough to satisfy two conditions. Firstly, the target’s motion should be Markovian in the chosen state space i.e. the conditional probability of future states of the process, given the present state and all past states, depends only upon the present state and not on any past states. Secondly, the sensor likelihood functions should depend only on the state of the target at the time of the observation. The prior information about the target is modelled by a stochastic process { ^^( ^^); ^^ ≥ 0}, where ^^( ^^) is the (unknown) target state at time ^^. Sample paths of this process correspond to possible target paths through the state space, ^^. At a given time, the likelihood of finding a possible set of values of observations is the product of the probability density functions for each observation (if sensors are independent) or the product of the joint density functions (if sensors are correlated). The single target tracking problem is that of computing the posterior distribution (of a quantity such as position) on the target for given the observed sensor measurements. For multi-tracking, the principles are extended to a joint state space of a set of multiple targets. [0038] The posterior 104 is computed (FIG. 1, step S3) by Bayes theorem and involves integrating the likelihood functions over the state space. A basic recursion for single target tracking depends on two assumptions. Firstly, as mentioned before, the stochastic process for the target’s motion must be Markovian on the state space. Secondly, sensor responses at a given time must only depend on the target state at that time. The recursion is computed numerically. The posterior distribution is initialised to the prior distribution of the initial state. By applying a Markov assumption, a motion update is performed to update the posterior distribution from time ^^ − 1 using the prediction from the prior distribution. For each value of target state x, a likelihood function calculates the probability (density) of obtaining the measurement given the target is in state x. An information update is performed to further modify the result from the motion update by the likelihoods computed from the observations. [0039] With likelihood functions, it is straightforward to unify linear and non-linear measurements of the target state. Since all information is represented on the same state space, it can easily and correctly be combined, regardless of how disparate the sources of the information. [0040] Multiple target tracking (or unified tracking) involves generalising the recursion from single-target tracking. The total number of targets is unknown but bounded by ^^, which is known. The state ^^^ା = ^^^ା ൈ … ൈ ^^^ା is the joint target state space where the product is taken ^^, times. Similarly, to single target tracking, prior knowledge about the targets and their “movements” through the state space ^^^ା is expressed as a stochastic process ^^ = { ^^( ^^); ^^ ≥ 0}. Specifically, ^^( ^^) = ( ^^_^( ^^), … , ^^_ே( ^^)) is the state of the system at time ^^ where ^^_^( ^^) ∈ ^^^ା is the state of target ^^ at time ^^. An extended state space definition ^^^ା = ^^ ∪ ∅ is used, where ∅ denotes the empty set and ^^_^( ^^) = ∅ denotes the target ^^ is not present at time ^^. A target that is present is assigned a state vector in ^^, which can include any observable property of properties of the target, such position variable(s) (e.g. 1D, 2D or 3D position) and/or motion variable(s) (such as speed and/or acceleration in any number of dimensions). The notation ^^_^ = ^^( ^^_^) may be used where ^^_^ is a value (a particular set of target states) of a random variable ^^( ^^_^) representing the joint state space. In other words, ^^_^ is a value denoting the state of multiple targets at a given time step (the system state). [0041] In formulating the update, the stochastic process ^^ is assumed to be Markovian in the state space ^^^ା as given by Equation 10.11 of Stone:

, [0042] where ^^_^( ^^_^) is an initial prior on an initial system state ^^_^ at time ^^_^ and ^^_^( ^^_^| ^^_^ି^) is a motion prior on the system state ^^_^ at time ^^_^ given the system state ^^_^ି^ at time ^^_^ି^, which models the probability ^^ ^^ ^^ ^^ { ^^( ^^_^) = ^^_^| ^^( ^^_^ି^) = ^^_^ି^} For a discrete measure, the state space, the integral reduces to a summation over time. [0043] In the general formulation, ‘motion’ is said to occur within the state space at target states change, and the motion prior ^^_^( ^^_^| ^^_^ି^) is a function encoding prior knowledge about the expected movement of targets in the state space. In the context of object/agent tracking, motion in the state space generally corresponds to real-world motion, and a physics-based motion model may be used to implement ^^_^( ^^_^| ^^_^ି^), which probabilistically encodes expected motion characteristics of, say, particular types of agent (such as vehicles, pedestrians, cyclists, animals etc.). The motion prior ^^_^( ^^_^| ^^_^ି^) may be referred to as a motion model for conciseness. In a vehicle tracking contact, for example, the motion model can encode knowledge of vehicle dynamics (for example, vehicles do not tend to exhibit large sudden changes in lateral motion). Relationships between targets can also be considered (e.g. a particular motion by one vehicle may be impossible without colliding with another vehicle). [0044] A collection of observations ^^ over K time steps is considered, where ^^ = ^{ ^^_^, … , ^^_^ ^} and ^^_^ is the set of (one or more) observations for time step ^^. The likelihood of the collection of observations ^^ is related to the likelihood functions by an independence assumption, as per Equation 10.15 of Stone:

, [0045] where ^^_^( ^^_^| ^^_^) is the likelihood of the observations ^^_^ at time ^^_^ given the joint system state ^^_^ at time ^^_^. The notation ^^_^ = ^^_^ may be used where ^^_^ denotes the value of a random variable ^^_^ representing the possible observations at time ^^_^. Similarly, the notation _{^^ = ^^} ⁽ _^^^ ⁾ _{denotes the value ^^ of a random variable ^^} ⁽ _^^^ ⁾ _{representing the collection of} observations up to time ^^_^. [0046] The likelihood function ^^_^( ^^) is defined as per 10.13 of Stone:

[0047] In other words, the likelihood of the observations at time step ^^_^ is the probability of obtaining those observations given that the joint state space at time ^^_^ is ^^. The likelihood function can be modelled using any distribution that models factors such as sensor noise and perception errors (for example, a Gaussian error distribution or other parameterized distribution representative of sensor and perception error). [0048] In contrast to single target tracking, the likelihood functions in multiple-target- tracking can depend on the joint state of all the targets. The number of targets that may be tracked is notionally bounded by the integer ^^ (noting that any given target may be absent at any given time). This is not a true bound, because this number can be arbitrarily high; ^^ does not represent any restriction, and is introduced purely to assist in the presentation. [0049] The complete generalisation of single target tracking to multiple targets yields a unified tracking recursion defined in Algorithm 1, which is quoted from Equations 10.19 to 10.21 of Stone. For each time step ^^_^, the algorithm computes an updated posterior distribution ^^( ^^_^, ^^_^) = ^^ ^^ ^^ ^^ { ^^( ^^_^) = ^^_^| ^^( ^^_^)} (estimated probability distribution on the system state at ^^_^ given the collection of observations up to and including ^^_^).

Algorithm 1 [0050] FIG.1 shows a flow chart depicting the steps of Algorithm 1. Reference numerals 102, 103 denote, respectively, an initial joint state ^^_^ and initial prior ^^_^( ^^_^) at an initial time step ^^_^ (the initialization of Algorithm 1). The latter can encode any prior knowledge on the initial set of target states. Initially, the posterior distribution ^^( ^^_^, ^^_^) at time ^^_^ is set to ^^_^( ^^_^). [0051] At step S1 (motion update), a motion update is performed based on the posterior distribution of the previous time step the motion model ^^_^( ^^_^| ^^_^ି^). The notation ^^^ି( ^^_^, ^^_^) in Algorithm 1 denotes a motion-updated prior, taking into account the observations up to the previous time step, and the motion model ^^_^( ^^_^| ^^_^ି^) that probabilistically models motion between the previous time step and the current time step. [0052] At step S2, the likelihood function ^^_^ is determined based on the ‘new’ observations associated with time step ^^_^. As noted, an appropriate error distribution may be used to model the likelihood function. [0053] Steps S1 and S2 are shown in parallel, though they can be performed in parallel or sequentially in an order. [0054] At step S3 (information update), the updated Bayesian posterior ^^⁽ ^^_^, ^^_^ ⁾ (denoted by reference numeral 104) is computed based on the motion-updated prior of step S1 and the likelihood function of step S2. [0055] The process of FIG.1 is repeated iteratively as new observations are received, based on the new observations and the posterior distribution 104 as computed in the previous time step. [0056] An important use-case, detailed below, is referred to herein as “offline tracking”. In an offline tracking context, a tracking method is used track objects in sensor data that is not captured and processed in real time. This use case contrasts with more conventional “online” tracking, where the aim is to output updated set of objects tracking outputs as new measurements are received, often compromising accuracy for speed. By contrast, in offline tracking, accuracy is prioritised over speed, and a delay in the computation of tracking outputs is acceptable. [0057] References to ‘receiving new observations’ does not necessarily imply that observations or measurements are received in real-time. In an offline context, such ‘new’ observations may, for example, be predetermined, and retrieved from computer storage as the method progresses. 3. Multiple Hypothesis tracking [0058] The complete unified tracking recursion is computationally expensive when performed in the joint state space of all targets. The combination of the likelihood functions defined on the joint state space with the joint distribution function of the targets automatically accounts for all possible association hypotheses without requiring explicit identification of these hypotheses. Multiple hypothesis tracking (MHT) is a special case of unified tracking that uses concepts of “contacts” and “association”. The unified tracking recursion produces the same joint posterior distribution as MHT does when the conditions for MHT are satisfied. 3. 1 Associating contacts with targets [0059] A ‘contact’ is an observation comprising a 'called detection’ and a measurement. For example, a basic proximity sensor may be said to call a detection when a signal-to-noise ratio at the sensor crosses a predefined threshold (implying the presence of some object in proximity to the sensor). As another example, an object detector (e.g. applied to an image or other form of sensor data such as lidar or radar) may be said to call a detection when it generates an object bounding box or other object detection output with a detection score above some threshold. In this case, a contact is a bounding box or other object detection output that has been generated with sufficient confidence. A ‘scan’ or ‘frame’ is defined as a set of contacts ^^_^ at time ^^_^ such that each contact is associated with at most one target, and each target generates at most one contact. A scan could, for example, be a set of object detections (e.g., detected bounding boxes) generated from a sensor frame (e.g. an image or lidar/radar sweep) captured at or indexed by time t. In that case, a sequence of frames processed by an object detector gives rise to a sequence of scans. A set of contacts associated with a given target (e.g. a set of contacts generated over multiple time steps) is referred to as a ‘track’, which is typically assigned a unique track identifier (ID). Note, the terms ‘scan’ and ‘frame’ are used interchangeably herein. [0060] Note that, unless otherwise indicated, a scan, contact or observation ‘at time t’ and similar terminology means the scan, observation or contact has been generated from a set of sensor data indexed by time t (that is, ‘time t’ refers to timing of the sensor data, and not necessarily the time at which the scan/observation/contact is generated. For example, in an offline processing context, observations may be generated from timestamped sensor data at any time after the sensor data has been captured). [0061] The contacts associated with each target can be used to produce an estimate of that target’s state. In practice, there may be more than one reasonable way to associate contacts with targets. Therefore, in MHT, alternative hypotheses are formed to explain the source of the observations. Given a collection of observations over multiple scans, a data association hypothesis assigns each observation to either a target or a ‘false alarm’ (the latter characterises the observation as a false positive, not associated with any target). A data association hypothesis for the ^^th scan considers all contacts of the ^^ths scan and all contacts of the previous ^^ = 1, … , ^^ − 1 scans. [0062] It is also useful to introduce the concept of a scan association hypothesis (also referred to as a scan hypothesis herein). A scan association hypothesis ^^ pertains to a single scan, and specifies, for each contact of the scan, which target generated the contact (by assigning the contact to a track ID) or that the contact was due to a false alarm (assigning the contact to a special false alarm indicator). Any form of track identifier and false alarm indicator may be used. The following examples consider track identifiers in the form of positive integers, with zero used at the false alarm indicator. Thus, for each scan ^^_^ containing ^^_^ contacts, a scan hypothesis mapping ^^ is ^^: ^{1, … , ^^_^ ^} → {0, … , ^^} such that no two contacts are associated to the same positive number. If ^^⁽ ^^⁾ = 0, then contact m is associated to a false alarm; if ^^⁽ ^^⁾ = ^^ > 0, then contact m is associated to target ^^ (multiple contacts may be characterized as false alarms, in which case multiple targets are mapped to zero). [0063] A data association hypothesis for the ^^th scan is thus composed of ^^ + 1 scan hypotheses for the scans 0, … , ^^. If two contacts from different scans have been generated by the same target in a given hypothesis, they are assigned to the same track ID, and are said to belong to the same track. 3.2 MHT probabilities [0064] MHT involves computing the probability that the observations ^^_^ and the scan hypothesis ^^ intersect, for each scan hypothesis ^^ pertaining to the ^^th scan. Treating ^^_^ as a random variable, the ‘correctness’ of the scan hypothesis ^^ is characterized in terms of the probability of the event { ^^ ∧ ^^_^ = ^^_^} (the joint probability of obtaining the observations ^^_^ = ^^_^ in the ^^th scan and the scan hypothesis ^^ being correct). [0065] A scan association likelihood function is defined for a given scan hypothesis ^^ as follows, following Equations 10.22 and 10.23 of Stone: , [0067] where ^^ ^^{ ^^| ^^( ^^_^) = ^^_^} is a prior probability that the scan hypothesis ^^ is correct, equal to ^^ ^^^{ ^^^} assuming independence of the system state ^^_^. In other words, the scan association likelihood is defined as the joint probability of the scan hypothesis ^^ being true and obtaining the contacts ^^_^ in scan ^^ given that the system state in scan ^^ is ^^_^. Here, ^^( ^^_^) is a random variable denoting the system state (the state of all targets in the extended state space) at time ^^_^. [0068] Recall that a data association hypothesis, ℎ, is a set of scan hypotheses over a sequence of ^^ scans. Given a set of observations (contacts) obtained in those ^^ scans, a set of data association hypotheses ^^( ^^) is determined, where ℎ = { ^^_^, … ,

for each ℎ ∈ ^^( ^^). A data association likelihood for data association hypothesis ℎ is defined as per Equation 10.27 of Stone:

[0069] MHT operates recursively to compute a posterior distribution on the system state at time ^^_^ for each data association hypothesis in ^^( ^^) given that the data association hypothesis ℎ is true, as per Equation 10.28 of Stone:

[0070] . This conditional posterior distribution is combined with a probability ^^(ℎ) that the data association hypothesis is true to obtain the Bayesian posterior for that data association hypothesis, as defined in Equations 10.29 to 10.30 of Stone:

[0071] The general MHT recursion is given in Algorithm 2, which is quoted from Equations 10.35 to 10.38 of Stone.

Algorithm 2 [0072] An initial state hypothesis ℎ_^ is a state with no contacts or associations and the probability that this hypothesis is true is ^^(ℎ_^) = 1. Analogously to unified tracking, a motion update is performed for each hypothesis conditioned on the hypothesis being true (Equation 10.35). Using the scan association likelihood functions an information update is performed as given in Equation 10.37. The normalisation factor over the total probability for all hypotheses is given in Equation 10.36. This normalisation factor is a function of the association probability (probability that an association hypothesis is correct) ^^ (Equation 10.38). [0073] In practice, with up to ^^ targets, there are ^^ available track identifiers (noting N can be arbitrarily high). At ^^_^ each contact can be uniquely assigned to one of the available tracking IDs arbitrarily. No two contacts are assigned to the same track ID at this point, hence there cannot be any associations at ^^_^, Typically, the number of available tracking IDs will be greater than the number of contacts observed initially, leaving a number of ‘free’ (unallocated) tracking identifiers. As noted, a contact may alternatively be characterized as a false alarm, which means it is not assigned to any of the track IDs. [0074] In subsequent time steps, a data association hypothesis ℎ may assign a contact to: [0075] i) a track ID to which at least one earlier contact has already been assigned (in other words, assigning the contact to an existing track, and thus associating it with all previous contact(s) assigned to the same track ID), [0076] ii) an unallocated track identifier (thus starting a new object track), or [0077] iii) the false alarm indicator (effectively ignoring it). [0078] False alarms can account for ‘ghost’ detections, where a single object generates multiple contacts (e.g. two nearby bounding boxes). [0079] FIG. 2 shows a flow diagram for the steps of Algorithm 2. [0080] At step S5 (scan association hypothesis), the contacts 200 in the scan 201 received at a given time are mapped to different targets 203 or to false alarms. Every such possible mapping constitutes a scan association hypothesis 204. [0081] At step S6 (scan likelihood functions), for every hypothesis ^^, a set of likelihood functions 205 is computed based on a current set of measurement-derived observation(s) ^^_^, conditioned on the scan hypothesis ^^ being correct. [0082] Given step S6 is dependent on the output of step S5, steps S5 and S6 need to be performed sequentially. [0083] At step S7 (motion update), a motion update is performed based on an initial state 202 and a scan association hypothesis 204. A motion update is performed for each hypothesis conditioned on the hypothesis being true, and is done by applying prior distributions to initial target states in 202, resulting in a motion-updated prior. [0084] Steps S6 and S7 can be performed in parallel or sequentially in an order. [0085] At step S8 (information update), MHT Bayesian posteriors 206 are computed for every scan association hypothesis based on the motion-updated priors of step S7 and the scan likelihood functions of step S2. [0086] Given step S8 is dependent on the output of steps S6 and S7, steps S8 needs to be performed after steps S6 and S7. [0087] As with the unified tracking algorithm of FIG.1, the process of FIG.2 is repeated iteratively as new observations are received. This can be in an “offline tracking context” with, for example, observations retrieved from computer storage as the method progresses. 4.1 Example [0088] FIG.3 shows an example of MHT applied to a vehicle tracking in a lane driving scenario, over three time steps ^^_^ିଶ, ^^_^ି^, ^^_^ . First and second vehicles (the targets in this example) are shown to be travelling through a road layout with four lanes ( ^^_^, ^^_ଶ, ^^_ଷ, ^^_ସ). [0089] The four lanes ( ^^_^, ^^_ଶ, ^^_ଷ, ^^_ସ) are shown on the vertical axis and the time frames ^^_^ିଶ, ^^_^ି^ and ^^_^ are shown on the horizontal axis. The contacts are shown as boxes (note, at each time step, the contacts from all three time steps are depicted). [0090] At each time step, two contacts (such as bounding boxes) have been generated, denoted A and B respectively, with a subscript denoting the time step of the contact. In the above notation, the scans generated at each time step are denoted: ^^_^ = { ^^_^, ^^_^} ^^_^ି^ = { ^^_^ି^, ^^_^ି^} ^^_^ିଶ = { ^^_^ିଶ, ^^_^ିଶ} [0091] For simplicity, FIG. 3 considers a highly simplified scenario with only two object tracks ( ^^ = 2), with track IDs

and ^^_ଶ respectively. [0092] First consider the scan at time step ^^_^ିଶ. It is possible that contact ^^_^ିଶ belongs to track and contact ^^_^ିଶ belong to track ^^_ଶ. Alternatively, it is possible that target ^^_^ିଶ belongs to track ^^_ଶ and contact ^^_^ିଶ belong to track ^^_^. Thus, as depicted, the two scan hypotheses for this timeframe are:

[0093] Similarly, for the scan at time step , ^^_^ି^, the two scan hypotheses are: ^^_^ ^{^} ି_^ = ( ^^_^ି^, ^^_^ି^)

[0094] For the scan at time step ^^_^, the two scan hypotheses for this time frame are likewise: ^^_^ ^{^} = ( ^^_^, ^^_^) ^^_^ ^ଶ = ( ^^_^, ^^_^) [0095] Considering only the three time steps shown in FIG. 3 for simplicity, even this simplified example yields eight data association hypotheses, only two of which are depicted in the bottom part of FIG. 3. [0096] According to the first hypothesis, the ‘A’ contacts have all been generated by the same target driving along lane ^^_ସ, and the ‘B’ contacts have been generated by a second vehicle moving from lane ^^_ଷ to ^^_ଶ to ^^_^. [0097] FIG. 4 shows an example hypothesis tree for the scenario of FIG. 3 at scan ^^. The three levels in the tree represent the scans ^^ − 2, ^^ − 1, and ^^. At scan ^^ − 1, the tree splits further into two branches, denoting the hypotheses ^^_^ ^{^} ି_^ and ^^_^ ^ଶ ି_^ , at each of the ^^_^ି^ nodes. At scan ^^ the tree splits further into two branches, denoting the hypotheses ^^_^ ^{^} and ^^_^ ^ଶ. [0098] The hypothesis tree encodes the eight data association hypothesis that are possible between ^^ − 2 and ^^. Each level of the tree corresponds to a scan, and each edge represents a scan association hypothesis. Each data association hypothesis is, therefore, represented as a unique path through the tree from a root node (representing scan ^^ − 2) to one of eight leaf nodes (representing scan ^^). [0099] By way of example, a first such path is shown, representing a first data association hypothesis 402 at scan ^^: ℎ^{^} ^ = ( ^^_^ ^{^} ି_ଶ , ^^_^ ^{^} ି_^ , ^^_^ ^{^}). [0100] A second path is shown, representing a second data association hypothesis 404:

[0101] The first and second data association hypotheses 402, 404 are also depicted graphically at the bottom of FIG. 3. According to the first hypothesis 402, the ‘A’ contacts have all been generated by the same target driving along lane ^^_ସ, and the ‘B’ contacts have been generated by a second vehicle moving from lane ^^_ଷ to ^^_ଶ to ^^_^. The second hypothesis 404 flips the association at time ^^ − 2, implying a first vehicle moving from ^^_ଷ to ^^_ସ and a second vehicle moving from ^^_ସ to ^^_ଶ to ^^_^ between scans. [0102] The first and second branches shown in FIG. 4 represent complete data association hypotheses only in the special case of ^^ − 2 = 0. For later times steps, the full set of data association hypotheses would be given by the paths going all the way back to the true root note at ^^_^. With each time step, the number of data association hypotheses increases exponentially, which quickly become intractable. [0103] In the example of FIG. 4, the size of the hypothesis tree grows as a power of 2 because only two object tracks are considered. In practice, a significantly larger number of object tracks may be considered, which could result in a significantly greater number of scan hypotheses at each scan, with the tree growing even faster as a result. [0104] Therefore, the hypothesis tree is ‘pruned’ at each time step, as described below. Pruning implies fixing the scan association hypothesis for scans of greater than a threshold age. Once an earlier scan moves outside of a pruning window, its system state is fixed to the most probable system state at that point, and only the most probable scan hypothesis (that is, only the most probable set of scan-track associations) is retained for that scan. 4.1.4 Scan pruning [0105] With the MHT results from the scans at ^^ − 1 and k, it is possible to go back to the scan at ^^ − 2 and make a final decision on the chosen hypothesis ( ^^_^ ^{^} ି_ଶ or ^^_^ ^ଶ ି_ଶ ). The system state at ^^_^ିଷ can also be fixed at this point based on the final scan hypothesis. In the terminology used herein, the hypothesis tree is said to be ‘pruned’ at the ^^ − 2 level when the MHT algorithm has completed the processing of scan ^^, based on a pruning window of size 2 in this example. [0106] This might, for example, involve choosing one of the scenarios 402 and 404 in FIG. 4, or any of the other six paths through the tree of FIG. 4. Note, this incorporates knowledge of the scans ^^ − 1 and ^^ subsequent to scan ^^ − 2, and is based on the most likely data association hypothesis at scan ^^. Note that the ^^ − 1 and ^^ associations are not fixed at this point, and the most probably ^^ − 1 and ^^ associations could change in subsequent time steps as the probabilities are updated. However, those changes would have no effect on the final selections for scan ^^ − 2 (or earlier), because those have now been fixed based on the information that was available at time ^^. [0107] FIG.5 shows a hypothesis tracking diagram in tabular format, to illustrate the principles of pruning over a sequence of scans. [0108] Each column in FIG. 5 represents processing operations performed in the processing step for the scan indicated in the column header. Scans from ^^ − 5 to ^^ are considered in this example, with a pruning window of size n in this example. The notation ^^_^ is used to denote the pruning window at scan ^^. For a window size n, then at scan k, the pruning window contains the previous n scans, i.e. ^^_^ = … , ^^_^ି^}. The following description considers a window of size n=2 by way of example, by the description applies equally to other values of n. [0109] The first row indicates for each processing step the scan hypotheses from preceding time steps that are considered. The second row represents the new scan hypotheses that are introduced for the current times step, describing the possible associations for the new observations. The third shows how the most probable scan hypothesis for earlier scan can change, up to the point of pruning (described below). [0110] The fourth and fifth rows indicate how the outputs of the MHT tracking algorithm are generated, in online and offline implementations respectively. [0111] The pruning works as follows. Let ℎ_^ ^∗ denote the most probable data association hypothesis once the processing of observations ^^_^ has been completed. This takes into account all observations between scan 0 and scan k. As per Algorithm 2, the processing of the ^^_^ observations yields updated scan hypothesis probabilities ^^(ℎ) for each ℎ ∈ ^^( ^^) where ℎ_^ ^∗ = ^^ ^^ ^^ ^^ ^^ ^^_^ ^^⁽ℎ⁾. Let ℎ_^ ^∗ ,_^ି^ denote the ( ^^ − ^^)th component of ℎ_^ ^∗ (i.e. the scan hypothesis for ^^ − ^^ belonging to ℎ_^ ^∗ ); this is the most probable scan hypothesis for ^^ − ^^ at the end of the scan ^^ processing, and is final scan hypothesis selected in respect of scan ^^ − ^^. In the example of FIG. 5, the final selection for ^^ − 2 is the scan hypothesis ^^_^ ^{^} ି_ଶ . This is selected at the end of the scan ^^ processing, and ‘frozen’ from that point on. The selection is shown in the column headed ^^. [0112] In the preceding processing steps ^^ − 1, ^^ − 2, ^^ − 3, … final scan hypotheses have been selected for the scans ^^ − 3, ^^ − 4, ^^ − 5, … respectively in the same way. Note that, in this example, the most probable scan hypothesis for ^^ − 5 has changed between ^^ − 5 and ^^ − 3 (from ^^_^ ^ଶ ି_ହ initially to the final selection of ^^_^ ^{^} ି_ହ at the end of scan ^^ − 3). Similarly, the most probable scan hypothesis for ^^ − 4 has changed between ^^ − 4 and ^^ − 2 (from ^^_^ ^{^} ି_ସ initially to the final selection of ^^_^ ^ଶ ି_ହ at the end of scan ^^ − 2). For ^^ − 3, it so happens the most probably scan hypothesis at the end of scan ^^ − 1 has not changed between ^^ − 2 and ^^ − 4. [0113] Thus, at scan ^^, ^^( ^^) is restricted to only data association hypotheses containing the final scan hypotheses for ^^_^ିଷ and earlier, as depicted in the first row. [0114] Thus, it is possible to prune the scan hypotheses at time ^^_^ିଶ by using the scans from the next two frames. The window of size two (covering two scans prior to the current scan) has been chosen purely for illustration purposes and can be optimised to balance performance and computational expense. [0115] Conceptually, the process of pruning involves a moving window of scans over time frames, helps to limit the number of hypotheses over the whole tracking tree at any given time. [0116] In an online application, it is necessary to provide tracking outputs in real-time, in response to observations generated in real-time. Thus, it is necessary to provide a tracking out for, say, scan ^^ − 5 as soon as the processing of the ^^ − 5 observations has been completed. This means the output must be generated before the final selection of the ^^ − 5 scan hypothesis (because that selection only occurs at the end of the ^^ − 3 processing step). [0117] At time frame ^^_^ିହ, there are two possible hypotheses, namely ^^_^ ^{^} ି_ହ and ^^_^ ^ଶ ି_ହ . Based on the information available at that time frame, the hypothesis ^^_௧ ^ଶ ି_ହ is chosen as a basis for generating a tracking output for scan ^^_^ିହ., as depicted in the final row of FIG. 5. However, both hypotheses ^^_^ ^{^} ି_ହ and ^^_^ ^ଶ ି_ହ are still retained in memory at time frames ^^_^ିସ and ^^_^ିଷ, to allow the full range of data associations to be considered. At the end of step ^^ − 3, it is found that ^^_^ ^{^} ି_ହ is the more probable hypothesis (as a consequence of updating the data association hypothesis probabilities as per Algorithm 2). Hence, at time ^^_^ିଷ, the final scan hypothesis selection for time frame ^^_^ିହ is ^^_^ ^{^} ି_ହ . Whilst that final selection will influence future tracking outputs (recall that the final selection of a scan hypothesis is made based on the data association hypothesis probabilities taking into account all previous time steps), in the online application, it is not possible to go back and correct the ‘wrong’ association output

was provided at ^^ − 5. [0118] However, in the online application considered herein, depicted in the fourth row of FIG. 4, tracking output are delayed by an amount of time corresponding to the size of the pruning window. This means that a tracking output will not be generated for any given scan until that scan has reached the end of the pruning window. This, in turn, means that the tracking output for every scan can be generated based on the final scan association hypotheses selected for this scan on reaching the end of the pruning window. This, in FIG.5, in contracts to the online case, the tracking output for scan ^^ − 5 is only outputted once the processing of the ^^ − 3 observations has completed, and there can never be a discrepancy between the output of the object tracker and the final scan hypothesis selections. [0119] This procedure is repeated for the next three time frames, namely ^^_^ିସ, ^^_^ିଷand ^^_^ିଶ. The hypothesis ^^_^ ^{^} ି_ସ is chosen for the online output at ^^_^ିସ. However, both hypotheses ^^_^ ^{^} ି_ସ and ^^_௧ ^ଶ ି_ସ are still retained in memory at time frames ^^_^ିଷand ^^_^ିଶ. At time ^^_^ିଶ, the information, from the three time frames ^^_^ିସ, ^^_^ିଷ and ^^_^ିଶ, is used to change the initial choice of hypothesis (the tracking output in an online use case). It is found that ^^_^ ^ଶ ି_ସ is the better hypothesis and a hypothesis update is performed. Hence, at time ^^_^ିଶ, the offline decision for time frame ^^_^ିସis made. The hypothesis chosen, namely ^^_^ ^ଶ ି_ସ is retained in memory while the hypothesis ^^_^ ^{^} ି_ସ is pruned. [0120] The pruning of hypotheses is illustrated further in FIG. 6 and FIG. 7. FIG. 6 shows the hypothesis tree at time frame ^^_^ି^ The hypothesis ^^_^ ^ଶ ି_ସ is retained in memory while the hypothesis ^^_^ ^{^} ି_ସ is pruned. At ^^_^ି^, the hypothesis ^^_^ ^{^} ି_^ is chosen in the online decision based on the most likely sequence at ^^_^ି^. This hypothesis is shown circled in yellow in FIG.6. The most likely sequence is shown enclosed in green in FIG.6. At time ^^_^ି^, the information, from the three time frames ^^_^ିଷ, ^^_^ିଶand ^^_^ି^ (shown as ^^_^ି^ in FIG.5) is used to make the offline decision for time frame ^^_^ିଷ. The hypothesis ^^_^ ^{^} ି_ଷ is chosen and is shown as reference numerals 601 and 701 in FIG.6. In FIG.7., the hypothesis tree at time ^^_^ is shown. The hypothesis ^^_^ ^{^} ି_ଷ is retained in memory while the hypothesis ^^_^ ^ଶ ି_ଷ is pruned. Again, the information, from the three time frames ^^_^ିଶ, ^^_^ି^ and ^^_^(shown as ^^_^ in FIG.5) is used to make the offline decision for time frame ^^_^ିଶ. The hypothesis ^^_^ ^{^} ି_ଶ is chosen and is shown by reference numeral 701 in FIG.7. At ^^_^, the hypothesis ^^_^ ^ଶ is chosen in the online decision based on the most likely sequence at ^^_^. [0121] Although, in principle, scan association hypotheses all the way back to time ^^_^ are considered at every time step, in practice an object track may be discarded after an appropriate interval to prevent the accumulation of 'stale' data. For example, if no new observations have been added to an object track for a certain period of time, the object track may be purged from memory. [0122] The principle of such a multi-frame tracker is to only output tracks when sufficient future information has been received to make the best association decision. The idea is that future data can influence how past data is interpreted and can change incorrect past associations. The use of multiple frame associations can potentially give better associations. Associations made by the multiple frame tracker may be passed downstream, for example to a path interpolator for smoothing. With an offline MHT tracker, these associations may be generated with an arbitrarily high time lag without affecting elements of online tracking. [0123] FIG. 8 shows a schematic block diagram of an offline tracking system 800, shown to comprise a processor 802 coupled to a memory 804. An offline object tracker 806 and has access to the memory 804. Typically, the object tracker 806 would only have access to a certain region or regions of the memory 804 allocated by an Operating System (not shown). Observations are stored in a first external storage region (or regions). Such observations may have been generated ‘online’, e.g. on-board a vehicle 810 equipped with sensor(s) and an online perception system 812 (the latter e.g. comprising an object detector), or by an offline perception system 814, or using a combination of both types of processing. The observations are time-stamped, and streamed from external storage to the offline object tracker 806 per scan. Observations can be streamed in order of time, or in reverse order of time for a ‘backwards pass’ of the data. In a backwards pass of the data, index ^^ = 0 now denotes the most recent scan, and as ^^ is incremented, increasing older scans are considered. [0124] Note that a ‘tracking output’ in the present context means an external output generated by an object tracker, such as an output passed to a separate processing component or to written to persistent storage in a form that is consumable by a separate processing component. Tracking output therefore excludes purely internal state of the tracker, which would typically be held in a region of processor memory assigned to the object tracker and not be accessible in any useable form to other processing components. Examples of tracking outputs include those depicted in the bottom two rows of FIG. 5 (“OUTPUT OFFLINE”, and “OUTPUT ONLINE”). [0125] FIG. 8 shows an offline processing component 816 executed on the processor 802. This is an external processing component that consumes external object tracking outputs generated by the offline object tracker 806. As depicted, the external tracking outputs comprise the final object associations ^^_^ି^ିଶ, ^^_^ି^ି^ , ^^_^ି^, delayed relative to the inputs

to the object tracker 806 for the reasons described above. Alternatively or additionally, those external object tracker outputs may be written to a second region (or regions) of persistent external storage 819 for later use. [0126] Reference numeral 818 denotes a set of purely internal state data of the object tracker 806 that is held in the memory 804, which is not an external tracking output. The internal state data 818 is temporary in nature, and would include the information contained in the hypothesis trees of FIGS. 7 and 8 at the applicable time steps, which is pruned as new scans are processed. Among other things, this internal state data 818 would include, at the applicable time steps, the top row of FIG. 5 (“HYPOTHESES IN MEMORY”). In particular, the internal state data 818 can include ‘candidate’ scan hypotheses that are retained in the memory 804 whilst the are still within the pruning window, but which are not selected on reaching the end of the pruning window, and which are therefore never provided as external tracking outputs in an offline implementation. [0127] The offline object tracker 806 is implemented in the form of code (computer-readable instructions) contained in the memory 804 and executed on the processor 802). [0128] In performing the above processing with a window size of ^^, no pruning occurs in the first ^^ − 1 processing steps, and no external outputs are generated in those steps. This is because every scan up to that point is still within the pruning window, and no scan association hypotheses have been finalized. [0129] In practice, the tracker may have an API which allows an external component (or user) to read off the set of object tracks at any given time, which can facilitate both online and offline tracking. The user provides the desired time as an input to the tracker API, and the object tracker 806 returns the final set of object tracks for that time as an external output. So, e.g., if the user requests a set of object tracks at a current time, the tracker 806 provides the tracks at that time (online use case). If they request a time in the past, the tracker 806 reads the relevant set of object tracks from an earlier point in the tree (offline use case). If the user requests a time that was "in-between" two scans, the tracker 806 interpolates between two time-points in the tree and provides an interpolated tracker output. [0130] Processing step ^^ is the first point at which pruning occurs. At the end of step ^^ (once the data association hypotheses from 0, … , ^^ have been determined, and their probabilities have been determined), a final scan association hypothesis is selected for the first scan ^^ = 0. This is the first point at which an external tracking output is generated (for scan 0). Thereafter, only the final scan association hypothesis for scan 0 is retained, as described above. The processing then continues as per FIG. 5 until the observations from all K scans have been processed. [0131] The novelty of the offline implementation lies in the novel use of the trees generated by MHT. The output of MHT is a set of track trees whose "root node" is for the scan "k-n". Usually runtime applications read associations from the leaves of the tree; whereas, in the offline use case, the tracker 806 reads the association of the root node as being the correct one. [0132] Note that, in the offline implementation, K is not, in fact, the very end of the run because, at the end of the run, it is not possible to read from the ( ^^ − ^^)^{௧^} scan up until the end. Let N be the total length of the run. The remaining associations read from the root at N-n forward to the leaves, so for the Nth scan the leaves may be read in the same way as a runtime (online) application. [0133] By contrast, in an online use-case, the final hypothesis selected on pruning would not be provided as an external output, and would only persist in the internal tracker state for use in generating tracker outputs for subsequent states. For example, referring to FIG. 5 in the online use case, the external tracker output in step ^^_^ିହ would be ^^_^ ^ଶ ି_ହ , which is not the scan hypothesis ^^_^ ^{^} ି_ହ that is finally selected for scan ^^ − 5 at the end of the ^^_^ିଷ processing step; in the online application, the selection of ^^_^ ^{^} ି_ହ would be purely internal,

would not be provided as an external output. In an online case, the "real-time" data has already been used, so there's no way to go back and "correct" the mistake. Applications of offline tracking [0134] The offline tracking described herein can, among other things, be used to support the development of autonomous vehicle stacks and other robotic systems. [0135] Figure 9A shows a highly schematic block diagram of an AV runtime stack 900. The stack 900 may be fully or semi-autonomous. For example, the stack 900 may operate as an Autonomous Driving System (ADS) or Advanced Driver Assist System (ADAS). [0136] The run time stack 900 is shown to comprise a perception system 902, a prediction system 904, a planning system (planner) 906 and a control system (controller) 908. [0137] In a real-world context, the perception system 902 receives sensor outputs from an on- board sensor system 910 of the AV, and uses those sensor outputs to detect external agents and measure their physical state, such as their position, velocity, acceleration etc. The on- board sensor system 910 can take different forms but generally comprises a variety of sensors such as image capture devices (cameras/optical sensors), lidar and/or radar unit(s), satellite- positioning sensor(s) (GPS etc.), motion/inertial sensor(s) (accelerometers, gyroscopes etc.) etc. The onboard sensor system 910 thus provides rich sensor data from which it is possible to extract detailed information about the surrounding environment, and the state of the AV and any external actors (vehicles, pedestrians, cyclists etc.) within that environment. The sensor outputs typically comprise sensor data of multiple sensor modalities such as stereo images from one or more stereo optical sensors, lidar, radar etc. Sensor data of multiple sensor modalities may be combined using filters, fusion components etc. [0138] The perception system 902 typically comprises multiple perception components which co-operate to interpret the sensor outputs and thereby provide perception outputs to the prediction system 904. [0139] The perception system 904 within the stack might implement online object tracking. For example, online tracking may be applied to bounding boxes or other object detection outputs generated at runtime within the perception system 904. Detecting online perception errors [0140] Another application of offline tracking is the evaluation of online tracker performance. In this context, object tracks generated in the online perception system 904 via online tracking may be compared with corresponding object tracks generated using offline object tracking. Online tracking errors (such as wrongly ‘swapped’ objects) may, in turn, be detected by identifying discrepancies in the online tracks relative to the offline tracks. [0141] In this case, an offline tracking output generated for frame k (delayed by n frames) may be compared with an online tracking output generated for the same frame k, to identify any error in the online tracking output. [0142] By identifying tracking errors, issues with the online tracker and/or related issues within the perception system 904 or wider stack 900 can be identified and mitigated. [0143] By way of example, our co-pending PCT Application No. PCT/EP2022/065509, incorporated herein by reference in its entirety, discloses a perception performance evaluation tool and graphical user interface for identifying and visualizing perception errors using pseudo-ground truth from a ground truthing pipeline as a baseline. The described offline tracking techniques may be used in that context in generating the pseudo-ground truth. Evaluating driving performance [0144] In a simulation-based testing context, depending on the nature of the testing – and depending, in particular, on where the stack 900 is “sliced” for the purpose of testing (see below) – it may or may not be necessary to model the on-board sensor system 900. With higher-level slicing, simulated sensor data is not required therefore complex sensor modelling is not required. [0145] The perception outputs from the perception system 902 are used by the prediction system 904 to predict future behaviour of external actors (agents), such as other vehicles in the vicinity of the AV. [0146] Predictions computed by the prediction system 904 are provided to the planner 906, which uses the predictions to make autonomous driving decisions to be executed by the AV in a given driving scenario. The inputs received by the planner 906 would typically indicate a drivable area and would also capture predicted movements of any external agents (obstacles, from the AV’s perspective) within the drivable area. The driveable area can be determined using perception outputs from the perception system 902 in combination with map information, such as an HD (high definition) map. For example, predictions about agents may be generated based on agent tracks determined via online tracking. [0147] A core function of the planner 906 is the planning of trajectories for the AV (ego trajectories), taking into account predicted agent motion. This may be referred to as trajectory planning. A trajectory is planned in order to carry out a desired goal within a scenario. The goal could for example be to enter a roundabout and leave it at a desired exit; to overtake a vehicle in front; or to stay in a current lane at a target speed (lane following). The goal may, for example, be determined by an autonomous route planner 916, also referred to as a goal generator 916. [0148] The controller 908 executes the decisions taken by the planner 906 by providing suitable control signals to an on-board actor system 912 of the AV. In particular, the planner 906 plans trajectories for the AV and the controller 908 generates control signals to implement the planned trajectories. Typically, the planner 906 will plan into the future, such that a planned trajectory may only be partially implemented at the control level before a new trajectory is planned by the planner 906. The actor system 912 includes “primary” vehicle systems, such as braking, acceleration and steering systems, as well as secondary systems (e.g., signalling, wipers, headlights etc.). [0149] The extent to which the various stack functions are integrated or separable can vary significantly between different stack implementations – in some stacks, certain aspects may be so tightly coupled as to be indistinguishable. For example, in other stacks, planning and control may be integrated (e.g. such stacks could plan in terms of control signals directly), whereas other stacks (such as that depicted in Figure 9A) may be architected in a way that draws a clear distinction between the two (e.g. with planning in terms of trajectories, and with separate control optimizations to determine how best to execute a planned trajectory at the control signal level). Similarly, in some stacks, prediction and planning may be more tightly coupled. At the extreme, in so-called “end-to-end” driving, perception, prediction, planning and control may be essentially inseparable. Unless otherwise indicated, the perception, prediction planning and control terminology used herein does not imply any particular coupling or modularity of those aspects. [0150] A “full” stack typically involves everything from processing and interpretation of low-level sensor data (perception), feeding into primary higher-level functions such as prediction and planning, as well as control logic to generate suitable control signals to implement planning-level decisions (e.g. to control braking, steering, acceleration etc.). For autonomous vehicles, level 3 stacks include some logic to implement transition demands and level 4 stacks additionally include some logic for implementing minimum risk maneuvers. The stack may also implement secondary control functions e.g. of signalling, headlights, windscreen wipers etc. [0151] The term “stack” can also refer to individual sub-systems of the full stack, such as perception, prediction, planning or control stacks 904, 906, 908, which may be tested individually or in any desired combination. A stack can refer purely to software, i.e. one or more computer programs that can be executed on one or more general-purpose computer processors. It will be appreciated that the term “stack” encompasses software, but can also encompass hardware. In simulation, software of the stack may be tested on a “generic” off- board computer system, before it is eventually uploaded to an on-board computer system of a physical vehicle. However, in “hardware-in-the-loop” testing, the testing may extend to underlying hardware of the vehicle itself. For example, the stack software may be run on the on-board computer system (or a replica thereof) that is coupled to the simulator for the purpose of testing. In this context, the stack under testing extends to the underlying computer hardware of the vehicle. As another example, certain functions of the stack 900 (e.g. perception functions) may be implemented in dedicated hardware. In a simulation context, hardware-in-the loop testing could involve feeding synthetic sensor data to dedicated hardware perception components. [0152] Figure 9B shows a highly schematic overview of a testing paradigm for autonomous vehicles. An ADS/ADAS stack 900, e.g. of the kind depicted in Figure 9A, is subject to repeated testing and evaluation in simulation, by running multiple scenario instances in a simulator 1002, and evaluating the performance of the stack 100 (and/or individual subs- stacks thereof) in a test oracle 1052. The output of the test oracle 1052 is informative to an expert 922 (team or individual), allowing them to identify issues in the stack 900 and modify the stack 900 to mitigate those issues (S924). The results also assist the expert 922 in selecting further scenarios for testing (S926), and the process continues, repeatedly modifying, testing and evaluating the performance of the stack 900 in simulation. The improved stack 900 is eventually incorporated (S925) in a real-world AV 901, equipped with a sensor system 910 and an actor system 912. The improved stack 900 typically includes program instructions (software) executed in one or more computer processors of an on-board computer system of the vehicle 901 (not shown). The software of the improved stack is uploaded to the AV 901 at step S925. Step S925 may also involve modifications to the underlying vehicle hardware. On board the AV 901, the improved stack 900 receives sensor data from the sensor system 910 and outputs control signals to the actor system 912. Real- world testing (S928) can be used in combination with simulation-based testing. For example, having reached an acceptable level of performance through the process of simulation testing and stack refinement, appropriate real-world scenarios may be selected (S930), and the performance of the AV 901 in those real scenarios may be captured and similarly evaluated in the test oracle 1052. [0153] Figure 9C shows a highly schematic block diagram of a scenario extraction pipeline. Data 940 of a real-world run is passed to a ‘ground-truthing’ pipeline 942 for the purpose of generating ‘pseudo’ ground truth for the real-world scenario. The run data 940 could comprise, for example, sensor data and/or perception outputs captured/generated on board one or more vehicles (which could be autonomous, human-driven or a combination thereof), and/or data captured from other sources such external sensors (CCTV etc.). The run data is processed within the ground truthing pipeline 942, in order to generate appropriate ground truth 944 (trace(s) and contextual data) for the real-world run. As discussed, the ground- truthing process could be based on manual annotation of the ‘raw’ run data 940, or the process could be entirely automated (e.g. using offline perception method(s)), or a combination of manual and automated ground truthing could be used. For example, 3D bounding boxes may be placed around vehicles and/or other agents captured in the run data 940, in order to determine spatial and motion states of their traces. A scenario extraction component 946 receives the scenario ground truth 944, and processes the scenario ground truth 944 to extract a more abstracted scenario description 948 that can be used for the purpose of simulation. The scenario description 948 is consumed by the simulator 1002, allowing multiple simulated runs to be performed. The simulated runs are variations of the original real-world run, with the degree of possible variation determined by the extent of abstraction. Ground truth 950 is provided for each simulated run. [0154] In the present off-board content, there is no requirement for the traces to be extracted in real-time (or, more precisely, no need for them to be extracted in a manner that would support real-time planning); rather, the traces are extracted “offline”. Examples of offline perception algorithms include non-real time and non-causal perception algorithms. Offline techniques contrast with “on-line” techniques that can feasibly be implemented within an AV stack 900 to facilitate real-time planning/decision making. [0155] For example, it is possible to use non-real time processing, which cannot be performed on-line due to hardware or other practical constraints of an AV’s onboard computer system. For example, one or more non-real time perception algorithms can be applied to the real-world run data 940 to extract the traces. A non-real time perception algorithm could be an algorithm that it would not be feasible to run in real time because of the computation or memory resources it requires. [0156] It is also possible to use “non-causal” perception algorithms in this context. A non- causal algorithm may or may not be capable of running in real-time at the point of execution, but in any event could not be implemented in an online context, because it requires knowledge of the future. For example, a perception algorithm that detects an agent state (e.g. location, pose, speed etc.) at a particular time instant based on subsequent data could not support real-time planning within the stack 900 in an on-line context, because it requires knowledge of the future (unless it was constrained to operate with a short look ahead window). For example, filtering with a backwards pass is a non-causal algorithm that can sometimes be run in real-time, but requires knowledge of the future. [0157] The term “perception” generally refers to techniques for perceiving structure in the real-world data 940, such as 2D or 3D bounding box detection, location detection, pose detection, motion detection etc. For example, a trace may be extracted as a time-series of bounding boxes or other spatial states in 3D space or 2D space (e.g. in a birds-eye-view frame of reference), with associated motion information (e.g. speed, acceleration, jerk etc.). In the context of image processing, such techniques are often classed as “computer vision”, but the term perception encompasses a broader range of sensor modalities. [0158] Further details of an example testing pipeline 1000 incorporating the test oracle 1052 will now be described. The examples that follow focus on simulation-based testing. However, the test oracle 1052 can equally be applied to evaluate stack performance on real scenarios, and the relevant description below applies equally to real scenarios. The following description refers to the stack 900 of Figure 9A by way of example. However, as noted, the testing pipeline 1000 is highly flexible and can be applied to any stack or sub-stack operating at any level of autonomy. [0159] Figure 10 shows a schematic block diagram of the testing pipeline 1000. The testing pipeline 1000 is shown to comprise the simulator 1002 and the test oracle 1052. The simulator 1002 runs simulated scenarios for the purpose of testing all or part of an AV run time stack 900, and the test oracle 1052 evaluates the performance of the stack (or sub-stack) on the simulated scenarios. As discussed, it may be that only a sub-stack of the run-time stack is tested, but for simplicity, the following description refers to the (full) AV stack 900 throughout. However, the description applies equally to a sub-stack in place of the full stack 900. The term “slicing” is used herein to the selection of a set or subset of stack components for testing. [0160] The idea of simulation-based testing is to run a simulated driving scenario that an ego agent must navigate under the control of the stack 900 being tested. Typically, the scenario includes a static drivable area (e.g. a particular static road layout) that the ego agent is required to navigate, typically in the presence of one or more other dynamic agents (such as other vehicles, bicycles, pedestrians etc.). To this end, simulated inputs 203 are provided from the simulator 202 to the stack 900 under testing. [0161] The slicing of the stack dictates the form of the simulated inputs 903. By way of example, Figure 10 shows the prediction, planning and control systems 904, 906 and 908 within the AV stack 900 being tested. To test the full AV stack of Figure 9A, the perception system 902 could also be applied during testing. In this case, the simulated inputs 203 would comprise synthetic sensor data that is generated using appropriate sensor model(s) and processed within the perception system 902 in the same way as real sensor data. This requires the generation of sufficiently realistic synthetic sensor inputs (such as photorealistic image data and/or equally realistic simulated lidar/radar data etc.). The resulting outputs of the perception system 902 would, in turn, feed into the higher-level prediction and planning systems 904, 906. [0162] By contrast, so-called “planning-level” simulation would essentially bypass the perception system 902. The simulator 1002 would instead provide simpler, higher-level inputs 1003 directly to the prediction system 904. In some contexts, it may even be appropriate to bypass the prediction system 904 as well, in order to test the planner 906 on predictions obtained directly from the simulated scenario (i.e. “perfect” predictions). [0163] Between these extremes, there is scope for many different levels of input slicing, e.g. testing only a subset of the perception system 902, such as “later” (higher-level) perception components, e.g. components such as filters or fusion components which operate on the outputs from lower-level perception components (such as object detectors, bounding box detectors, motion detectors etc.). [0164] As an alternative to synthetic sensor data, all or part of the perception system 902 may be modelled, e.g. using one or more surrogate models of the perception system that operate directly on lower fidelity inputs and introduce realistic perception error into those inputs. [0165] Whatever form they take, the simulated inputs 903 are used (directly or indirectly) as a basis for decision-making by the planner 908. The controller 908, in turn, implements the planner’s decisions by outputting control signals 909. In a real-world context, these control signals would drive the physical actor system 912 of AV. In simulation, an ego vehicle dynamics model 1004 is used to translate the resulting control signals 909 into realistic motion of the ego agent within the simulation, thereby simulating the physical response of an autonomous vehicle to the control signals 909. [0166] Alternatively, a simpler form of simulation assumes that the ego agent follows each planned trajectory exactly between planning steps. This approach bypasses the control system 908 (to the extent it is separable from planning) and removes the need for the ego vehicle dynamic model 1004. This may be sufficient for testing certain facets of planning. [0167] To the extent that external agents exhibit autonomous behaviour/decision making within the simulator 1002, some form of agent decision logic 1010 is implemented to carry out those decisions and determine agent behaviour within the scenario. The agent decision logic 1010 may be comparable in complexity to the ego stack 900 itself or it may have a more limited decision-making capability. The aim is to provide sufficiently realistic external agent behaviour within the simulator 1002 to be able to usefully test the decision-making capabilities of the ego stack 900. In some contexts, this does not require any agent decision making logic 1010 at all (open-loop simulation), and in other contexts useful testing can be provided using relatively limited agent logic 1010 such as basic adaptive cruise control (ACC). One or more agent dynamics models 1006 may be used to provide more realistic agent behaviour if appropriate. [0168] A scenario is run in accordance with a scenario description 901, which typically has both static and dynamic elements. The static element(s) typically include a static road layout. The dynamic element(s) typically include one or more external agents within the scenario, such as other vehicles, pedestrians, bicycles etc. Scenario runs are orchestrated by a test orchestration component 1060. [0169] The extent of the dynamic information provided to the simulator 1002 for each external agent can vary. For example, a scenario may be described by separable static and dynamic layers. A given static layer (e.g. defining a road layout) can be used in combination with different dynamic layers to provide different scenario instances. The dynamic layer may comprise, for each external agent, a spatial path to be followed by the agent together with one or both of motion data and behaviour data associated with the path. In simple open-loop simulation, an external actor simply follows the spatial path and motion data defined in the dynamic layer that is non-reactive i.e. does not react to the ego agent within the simulation. Such open-loop simulation can be implemented without any agent decision logic 1010. However, in closed-loop simulation, the dynamic layer instead defines at least one behaviour to be followed along a static path (such as an ACC behaviour). In this case, the agent decision logic 1010 implements that behaviour within the simulation in a reactive manner, i.e. reactive to the ego agent and/or other external agent(s). Motion data may still be associated with the static path but in this case is less prescriptive and may for example serve as a target along the path. For example, with an ACC behaviour, target speeds may be set along the path which the agent will seek to match, but the agent decision logic 1010 might be permitted to reduce the speed of the external agent below the target at any point along the path in order to maintain a target headway from a forward vehicle. [0170] The output of the simulator 1002 for a given simulation includes an ego trace 1012a of the ego agent and one or more agent traces 1012b of the one or more external agents (traces 212). Each trace 1012a, 1012b is a complete history of an agent’s behaviour within a simulation having both spatial and motion components. For example, each trace 1012a, 1012b may take the form of a spatial path having motion data associated with points along the path such as speed, acceleration, jerk (rate of change of acceleration), snap (rate of change of jerk) etc. [0171] A “trace” is a history of an agent’s location and motion over the course of a scenario. There are many ways a trace can be represented. Trace data will typically include spatial and motion data of an agent within the environment. The term is used in relation to both real scenarios (with real-world traces) and simulated scenarios (with simulated traces). [0172] The test oracle 1052 receives the traces 1012 and scores those outputs in respect of a set of performance evaluation rules 1054. The performance evaluation rules 1054 are shown to be provided as an input to the test oracle 1052. [0173] The rules 1054 are categorical in nature (e.g. pass/fail-type rules). Certain performance evaluation rules are also associated with numerical performance metrics used to “score” trajectories (e.g. indicating a degree of success or failure or some other quantity that helps explain or is otherwise relevant to the categorical results). The evaluation of the rules 1054 is time-based – a given rule may have a different outcome at different points in the scenario. The scoring is also time-based: for each performance evaluation metric, the test oracle 1052 tracks how the value of that metric (the score) changes over time as the simulation progresses. The test oracle 1052 provides an output 1056 comprising a time sequence 1056a of categorical (e.g. pass/fail) results for each rule, and a score-time plot 1056b for each performance metric. The results and scores 1056a, 1056b are informative to the expert 922 and can be used to identify and mitigate performance issues within the tested stack 900. The test oracle 1052 also provides an overall (aggregate) result for the scenario (e.g. overall pass/fail). The output 1056 of the test oracle 952 is stored in a test database 1058, in association with information about the scenario to which the output 1056 pertains. [0174] When testing on real-world scenarios, in place of the simulated traces 1012a, 1012b, real-world traces extracted from sensor data are provided to the test oracle 1052 and evaluated in the same way. Referring to FIG. 9C, the pseudo-ground truth 944 extracted in the ground-truthing pipeline may be provided to the test oracle 1052 to evaluate the ego vehicle’s real-world performance. [0175] The described offline tracking techniques may be implemented in the ground truthing pipeline 942, in order to extract pseudo-ground truth traces for the other agent’s in the scenario. For example, offline tracking may be applied to bounding boxes detected online in the vehicle 901 itself, or detected offline in an earlier part of the ground-truthing pipeline 942. The agent traces derived from real-world data may therefore comprise or be derived from agent tracks extracted using offline tracking. [0176] Alternatively in addition, real agent traces extracted using offline tracking may be used to generate scenarios that can then be run in a simulator, in the manner described with reference to figure 9C. [0177] References herein to components, functions, modules and the like, denote functional components of a computer system which may be implemented at the hardware level in various ways. A computer system comprises execution hardware which may be configured to execute the method/algorithmic steps disclosed herein and/or to implement a model trained using the present techniques. The term execution hardware encompasses any form/combination of hardware configured to execute the relevant method/algorithmic steps. The execution hardware may take the form of one or more processors, such as the processor 802 of FIG. 8, which may be programmable or non-programmable, or a combination of programmable and non-programmable hardware may be used. Examples of suitable programmable processors include general purpose processors based on an instruction set architecture, such as CPUs, GPUs/accelerator processors etc. Such general-purpose processors typically execute computer readable instructions held in memory coupled to or internal to the processor and carry out the relevant steps in accordance with those instructions. Other forms of programmable processors include field programmable gate arrays (FPGAs) having a circuit configuration programmable through circuit description code. Examples of non-programmable processors include application specific integrated circuits (ASICs). Code, instructions etc. may be stored as appropriate on transitory or non- transitory media (examples of the latter including solid state, magnetic and optical storage device(s) and the like). The subsystems 902-908 of the runtime stack Figure 9A may be implemented in programmable or dedicated processor(s), or a combination of both, on-board a vehicle or in an off-board computer system in the context of testing and the like. The various components of Figure 10, such as the simulator 1002 and the test oracle 1052 may be similarly implemented in programmable and/or dedicated hardware.

Claims

Claims 1. A computer-implemented method of tracking objects in a multi-object scenario based on a set of observations generated over multiple frames 0, … , ^^, the method performed by an offline object tracker in a sequence of processing steps 0, … ^^ based on a pruning window of size ^^, and comprising: in the initial 0, … , ^^ − 1 processing steps, determining a set of data association hypotheses for frames 0, … , ^^ − 1, each data association hypotheses comprising, for each frame 0, … , ^^ − 1, a frame association hypothesis, the frame association hypothesis being a set of associations between one or more observations for frame ^^ and a set of track identifiers, in the subsequent processing step ^^, performing operations of: determining an updated set of data association hypotheses for frames 0, … , ^^, calculating an updated probability of each updated data association hypothesis, identifying a most probable data association hypothesis at step ^^ based on the updated set of data association hypotheses and their updated probabilities; selecting a final frame association hypothesis for the frame 0 based on the most probable data association hypothesis selected in step ^^; generating a tracking output external to the offline object tracker for the frame 0 based on the final frame association hypothesis for the frame 0, as selected based on the most probable data association hypothesis identified at step ^^; wherein, in each subsequent processing step ^^, the offline object tracker repeats said operations for frames 0, … ^^, with the updated set of data association hypotheses restricted to account for every final frame association hypothesis selected in the previous time steps, resulting in the selection of a final frame association hypothesis for the frame ^^ − ^^ based on the most probable data association hypothesis identified at step ^^, and the generation in step ^^ of a tracking output external to the offline object tracker for the frame ^^ − ^^.

2. The method of claim 1, comprising comparing the tracking output for at least one of the frames with an online tracking output for the at least one frame, the online tracking output having been extracted by an online object tracker, and thereby identifying a tracking error in the online tracking output.

3. The method of claim 2, comprising comparing the tracking output for each frame with an online tracking output for that frame.

4. The method of claim 1, wherein the multiple frames have been captured in a test scenario, in which an ego agent reacts to a plurality of other agents, the method comprising determining from the multiple frames an ego trace of the ego robot and a plurality of agent traces comprising or derived from the tracking output of each of at least some of the multiple frames.

5. The method of claim 4, wherein the ego agent is a mobile robot, the method comprising inputting the ego trace and the plurality of agent traces to a test oracle, wherein the test oracle evaluates performance of the mobile robot in the test scenario and outputs a set of performance evaluation results.

6. The method of claim 5, comprising using the set of performance evaluation results to identify and mitigate an issue with the mobile robot.

7. The method of claim any of claims 4 to 6, comprising using the ego trace and the plurality of agent traces to create a scenario for testing performance of a simulated ego agent in a simulation environment.

8. The method of any of claims 5 to 7, wherein the multiple frames have been captured by at least on image capture device of the ego robot.

9. The method of any preceding claim, wherein the tracking output comprises, for each frame, one or more observations and for each observation a track identifier.

10. The method of claim 9, wherein the tracking output of at least one frame comprises an observation marked as a false positive.

11. The method of any preceding claim, wherein the updated probability of each updated data association hypothesis is calculated based on the one or more observations and a motion model.

12. A computer system comprising one or more computers configured to implement the method of any preceding claim.

13. A computer program for programming a computer system to implement the method of any of claims 1 to 11.