WO2020177876A1

WO2020177876A1 - System and method for training a model performing human-like driving

Info

Publication number: WO2020177876A1
Application number: PCT/EP2019/055786
Authority: WO
Inventors: Nicolas VIGNARD; Dengxin DAI; Simon Hecker; Luc Van Gool
Original assignee: Toyota Motor Europe NV SA; Eidgenoessische Technische Hochschule Zurich ETHZ
Current assignee: Toyota Motor Europe NV SA; Eidgenoessische Technische Hochschule Zurich ETHZ
Priority date: 2019-03-07
Filing date: 2019-03-07
Publication date: 2020-09-10
Anticipated expiration: 2021-09-07

Abstract

The invention relates to a method and system for training a human-like generative driving model for a vehicle, comprising: a - obtaining (SOI) a set of video data of driving scenes performed by a human driven vehicle, b - obtaining (S02) a data set of human driving maneuvers carried out during the driving scenes, c - training (S03) a generative driving model using the set of video data and the data set of human driving maneuvers, wherein the training step (S03) of c is augmented with an adversarial training scheme such that the prediction of the trained generative driving model becomes more human like.

Description

SYSTEM AND METHOD FOR TRAINING A MODEL PERFORMING

HUMAN-LIKE DRIVING

FIELD OF THE DISCLOSURE

[0001] The present disclosure is related to the field of image processing, in particular to a method for training a human-like generative driving model for a vehicle.

BACKGROUND OF THE DISCLOSURE

[0002] The prospect of deploying autonomously driven cars is imminent owing to the advances in perception, robotics and sensor technologies. However, it is believed that autonomous vehicles are more likely to be accepted if they drive accurately, comfortably and drive the same way as human drivers would do. This is especially true for the near future when autonomous vehicles and human-driven vehicles need to share the same road.

[0003] Classical approaches require the recognition of all driving-relevant objects, such as lanes, traffic signs, traffic lights, cars and pedestrians, and then perform motion planning, which is further used for final vehicle control, cf. e.g.:

C. Urmson, et.al. Autonomous driving in urban environments: Boss and the urban challenge. Journal of Field Robotics Special Issue on the 2007 DARPA Urban Challenge, Part I, 25(8):425-466, June 2008.

[0004] These type of systems are sophisticated, represent the current state- of-the-art for autonomous driving, but they are hard to maintain and prone to error accumulation over the pipeline. Most systems also need to use diverse sensors, such as cameras, laser scanners, radar, GPS and high-definition maps.

[0005] End-to-end mapping methods on the other hand construct a direct mapping from the sensory input to the maneuvers. In this regard the last years have seen tremendous progress in academia on learning driving models, cf. e.g.:

F. Codevilla, M. Muller, A. Lopez, V. Koltun, and A. Doso-vitskiy. End-to-end driving via conditional imitation learning. 2018, and

S. Hecker, D. Dai, and L. Van Gool. End-to-end learning of driving models with surround-view cameras and route planners. In ECCV, 2018.

[0006] However, many of these systems are deficient in terms of the sensors used, when compared to the driving systems developed by large companies. For instance, many algorithms only use a front-facing camera. Maps are exploited only for simple directional commands or rendered videos. While these setups are sufficient to allow the community to study many challenges, developing algorithms for fully autonomous cars requires the use of numerical maps of high fidelity.

[0007] Current driving algorithms, e.g. those cited above, mostly treat driving as a regression problem with i.i.d individual training samples, e.g. regressing the low-level steering angle and speed for a given data sample. Yet, driving is a continuous sequence of events over time. Longitudinal and lateral control need to be coupled and these coupled operations need to be combined over time for a comfortable ride. Thus, driving models need to be learned with continuous data sequences and proper passenger comfort measures need to be embedded into the learning system.

[0008] Other contributions have chosen the middle ground between traditional pipe-lined methods and the monolithic end-to-end approach. They learn driving models from compact intermediate representations called affordance indicators such as distance to the front car and existence of traffic light, cf. e.g.

A. Sauer, N. Savinov, and A. Geiger. Conditional affordance learning for driving in urban environments. In Conference on Robot Learning, 2018.

[0009] While research on passenger comfort started to receive some attention, it hardly did so in learning driving models, cf. e.g. :

M. Elbanhawi, M. Simic, and R. Jazar. In the passenger seat: Investigating ride comfort measures in autonomous cars. IEEE Intelligent Transportation

Systems Magazine, 7(3):4-17, 2015).

[0010] A large body of work has studied human driving styles, cf. e.g.:

G. A. M. Meiring and H. C. Myburgh. A review of intelligent driving style analysis systems and related artificial intelligence algorithms. In Sensors, 2015.

[0011] Statistical approaches have been employed to evaluate human drivers and to suggest improvements, cf. e.g.:

H. Zhao, H. Zhou, C. Chen, and J. Chen. Join driving: A smart phone-based driving behavior evaluation system. In IEEE Global Communications Conference (GLOBECOM), 2013.

However, human-like driving is hard to quantify. SUMMARY OF THE DISCLOSURE

[0012] Currently, it remains desirable to provide a system and a method for training a human-like generative driving model for a vehicle, in particular for learning to drive accurately, comfortably and to drive the same way as human drivers would do.

[0013] Therefore, according to the embodiments of the present disclosure, a (desirably computer-implemented) method for training a human-like generative driving model for a vehicle is provided. The method comprises the steps of: a - obtaining a set of video data of driving scenes performed by a human driven vehicle,

b - obtaining a data set of human driving maneuvers carried out during the driving scenes,

c - training a generative driving model using the set of video data and the data set of human driving maneuvers,

wherein the training step of c is augmented with an adversarial training scheme such that the prediction of the trained generative driving model becomes more human like.

[0014] By providing such a method, it becomes possible to take advantage of the advance of adversary learning to learn human-like driving. Specifically, a discriminator may be trained, together with the driving model, to distinguish human driving and machine driving. The driving model is trained to be accurate, comfortable, and at the same time to fool the discriminator so that it believes that the driving performed by the method was by a human driver. A new evaluation criterion is proposed to score the human-likeness of a driving model.

[0015] As a further advantage, the learning procedure is desirably improved from a pointwise prediction to a sequence-based prediction.

[0016] The generative driving model desirably outputs predicted driving maneuvers. Accordingly, the model is desirably configured to steer autonomously a vehicle based on said outputted predicted driving maneuvers. For example, the driving maneuvers may comprise any kind of maneuvers for driving control of the vehicle, e.g. simple maneuvers as steering or braking, or more complex maneuver line taking a turn by a combination of braking, steering and re-accelerating.

[0017] The adversarial training scheme may comprise: cl - training a discriminator model based on the predicted driving maneuvers outputted by the generative driving model and corresponding human driving maneuvers of the data set to discriminate between human and machine maneuvers, and

c2 - forcing the generative driving model to learn more human like driving (e.g. by penalizing the model using an adversary loss).

[0018] For example, the standard LI and/or L2 loss may be augmented by an adversary loss which is based on a discriminator model trained to distinguish human driving and machine driving.

[0019] The step of obtaining the set of video data may further comprise: al - obtaining a route planning data set representing route information according to which the human driven vehicle have performed the driving scenes of the set of video data,

a2 - enriching the set of video data by the route planning data (set), such that the accuracy of the predicted driving maneuvers outputted by the generative driving model is increased.

For example, the set of video data may be enriched with numerical map data from HERE Technologies.

[0020] The step of training the generative driving model may comprise: training the generative driving model based on a predefined loss function (e.g. the LI and/or the L2 loss) augmented by an adversary loss which is based on the output of the discriminator model.

[0021] The generative driving model may receive as an input video data and data of human driving maneuvers of past time steps, and output predicted driving maneuvers for future time steps.

[0022] In case the set of video data are enriched by the route planning data before being inputted to the generative driving model, the generative driving model may receive as a further input the vehicle location in past time steps.

[0023] For example, given the video I, the map information M, and the vehicle's location L, a deep neural network may be trained to predict the steering angle s and speed v for a future time step. All data inputs may be synchronized and sampled at the same sampling rate f, meaning the vehicle makes driving decision every 1/f seconds. The inputs and outputs may be represented in this discretized form.

[0024] The generative driving model may be a deep neural network. [0025] The discriminator model may be a deep neural network.

[0026] For example, the driving model developed in S. Hecker, D. Dai, and L. Van Gool. End-to-end learning of driving models with surround-view cameras and route planners, ECCV, 2018, may be adopted.

[0027] The present disclosure further relates to a system for training a human-like generative driving model for a vehicle, the system comprises:

a module A for obtaining a set of video data of driving scenes performed by a human driven vehicle,

a module B for obtaining a data set of human driving maneuvers carried out during the driving scenes,

a module C for training a generative driving model using the set of video data, and the data set of human driving maneuvers,

wherein training in module C is augmented with an adversarial training scheme such that the prediction of the trained generative driving model becomes more human like.

[0028] The system may comprise further (sub-) modules and features corresponding to the features of the method described above.

[0029] Moreover the present disclosure relates to a system for predicting human-like driving maneuvers of a vehicle, comprising the (trained) model of step c or of module C, as descried above.

[0030] Furthermore the present disclosure relates to a computer program including instructions for executing the steps of a method, as described above, when said program is executed by a computer.

[0031] This program can use any programming language and take the form of source code, object code or a code intermediate between source code and object code, such as a partially compiled form, or any other desirable form.

[0032] Finally, the present disclosure relates to a recording medium readable by a computer and having recorded thereon a computer program including instructions for executing the steps of a method, as described above.

[0033] The information medium can be any entity or device capable of storing the program. For example, the medium can include storage means such as a ROM, for example a CD ROM or a microelectronic circuit ROM, or magnetic storage means, for example a diskette (floppy disk) or a hard disk. [0034] Alternatively, the information medium can be an integrated circuit in which the program is incorporated, the circuit being adapted to execute the method in question or to be used in its execution.

[0035] It is intended that combinations of the above-described elements and those within the specification may be made, except where otherwise contradictory.

[0036] It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure, as claimed.

[0037] The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description, and serve to explain the principles thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

[0038] Fig. 1 shows a schematic flow chart of the steps of a method for training a human-like generative driving model according to embodiments of the present disclosure; and

[0039] Fig. 2 shows a schematic block diagram of a system according to embodiments of the present disclosure.

DESCRIPTION OF THE EMBODIMENTS

[0040] Reference will now be made in detail to exemplary embodiments of the disclosure, examples of which are illustrated in the accompanying drawings. Wherever possible, the same reference numbers will be used throughout the drawings to refer to the same or like parts.

[0041] End-to-end driving allows developing promising driving models based on camera data, cf. e.g.:

the driving model developed in S. Hecker, D. Dai, and L. Van Gool. End-to-end learning of driving models with surround-view cameras and route planners, ECCV, 2018.

[0042] The focus has mainly been though on perception, not so much navigation. Thus far, the representations for navigation are either primitive directional commands in a simulation environment or rendered videos of planned routes in real-world environments. [0043] Fig. 1 shows a schematic flow chart of the steps of a method for training a human-like generative driving model according to embodiments of the present disclosure. For example, the driving model developed in the publication cited above (S. Hecker et. al., 2018) may be adopted in the present disclosure.

[0044] In particular, the used core model may consist of a fine-tuned Resnet34 CNN (cf. e.g. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016) to process sequences of front facing camera images, followed by two regression networks to predict steering wheel angle and vehicle speed. The architecture may thus be similar to the baseline model from the publication cited above (S. Hecker et. al., 2018).

[0045] In a first step SO1 a set of video data of driving scenes performed by a human driven vehicle is obtained. The set of video data may be e.g. the Drive360 dataset, as described in the publication cited above (S. Hecker et. al., 2018).

[0046] In an optional step SOla (not shown) a route planning data set representing route information according to which the human driven vehicle have performed the driving scenes of the set of video data may be additionally obtained, e.g. map data from Here Technologies.

[0047] In a further optional step SOlb (not shown) the set of video data may be enriched by the route planning data, such that the accuracy of the predicted driving maneuvers outputted by the generative driving model is increased.

[0048] It is hence proposed by the present disclosure to (1) augment real- world driving data with numerical map data from e.g. HERE Technologies; and (2) design map features believed to be relevant for driving and integrating them into a driving model.

[0049] In a further step S02 (which may be carried out before, after or at the same time as step SOI) a data set of human driving maneuvers carried out during the driving scenes is obtained.

[0050] Accordingly, said data set of human driving maneuvers is a further input for the model proving information regarding the driving control gestures of humans when driving the vehicles in the driving scenes of the set of video data. [0051] In a subsequent step S03 a generative driving model is trained using the set of video data and the data set of human driving maneuvers. Said training is augmented with an adversarial training scheme such that the prediction of the trained generative driving model becomes more human like.

[0052] In particular, in an optional step S03a (not shown) a discriminator model may be trained based on the predicted driving maneuvers outputted by the generative driving model and corresponding human driving maneuvers of the data set to discriminate between human and machine maneuvers. The generative driving model may be forced to learn more human like driving in a further optional step S03b (not shown).

[0053] For example, given the video I , the map information M, and the vehicle's location L, a deep neural network is trained to predict the steering angle s and speed v for a future time step. All data inputs are synchronized and sampled at the same sampling rate f, meaning the vehicle makes driving decision every 1/f seconds. The inputs and outputs are represented in this discretized form. It is used t to indicate the time stamp, such that ail data can be indexed over time. For example, l_t indicates the current video frame and v_t the vehicle's current speed. Similarly, l_t_k is the k^th previous video frame and s_t-k is the k^th previous steering angle. Since predictions need to rely on data of previous time steps, the k recent video frames are denoted by l[_t-k+i,t] º <l_t_k₊i, . . . , l_t), and the k recent map representations by M[_t-k+i,t] º (M_t_k+i, . . M_t). The goal is to train a deep network that predicts desired driving actions from the visual observations and the planned route. The learning task can be defined as: (1)

where S_t+1 represents the steering angle space and V_t+1 the speed space for future time t + 1. s and V can be defined at several levels of granularity. The continuous values directly recorded from the car's CAN bus may be considered, where v = (V|0 £ v £ 180 for speed and s = {S| - 720 £ S £ 720] for steering angle in this case. Here, kilometer per hour (km/h) is the unit of v, and degree (°) the unit of s. M_t is either a rendered video frame from the TomTom route planner (cf. S. Hecker et. al., 2018), or the engineered features for the numerical maps from HERE Technologies (as described below), or the combination of both.

[0054] In order to keep notations concise, the synchronized data (I, M) may be denoted as D. Without loss of generality, the training data are assumed to consist of a long sequence of driving data with T frames in total. Then the basic driving model is to learn the prediction function for the steering angle

and the velocity

with the objective

where Ŝ and

are predicted values, and Ŝ and

are the ground truth values.

[0055] The learning under Eq. 4 is straightforward and can be implemented with any standard deep network. This objective, however, assumes the driving decisions at each time step are independent from each other. It is believed that this may be an over-simplification because driving decisions indeed exhibit strong temporal dependencies within a relatively short time range. In the following section, the objective according to the present disclosure is reformulated by introducing ride comfort and human-likeness score to better model the temporal dependency of driving actions.

Accurate and Comfortable Driving

[0056] Multiple concepts relating to driving comfort have been proposed and discussed, such as apparent safety, motion sickness, level of controllability and resulting force. While those are all relevant, some are hard to quantify. It is hence chosen to reduce motion sickness, which has been shown to be largely caused by the vehicle's longitudinal and lateral oscillations, cf. e.g.: G. M. Turner Ml. Motion sickness in public road transport: the effect of driver, route and vehicle. Ergonomics, (1646-64), 1999.

[0057] Due to the short-term predictive nature of most end-to-end driving models, substantial jerking is an inherent problem. The used comfort component aims at reducing jerk by imposing a temporal smoothness constraint on the longitudinal and lateral oscillations, by minimizing the second derivative of consecutive steering angle and speed predictions.

[0058] Before introducing ride comfort and human-like driving, Eq. 4 is reformulated. If the number of consecutive predictions that need to be optimized jointly is denoted by O, then minimizing Eq. 4 is equivalent to minimizing

[0059] Then for every 0 consecutive frames starting at time t, the loss of driving accuracy will be

[0060] Now the objective function can be presented for accurate and comfortable driving as

where

z_! is a trade-off parameter to balance the two costs. By optimizing under the objective in Eq. 7, consecutive predictions are learned and optimized together for accurate and comfortable driving. Accurate, Comfortable & Human-like Driving

[0061] If autonomous cars behave differently from human-driven cars, it is hard for humans to predict their future actions. This unpredictability can cause accidents. Thus, it is argued that it is important to design human-like driving algorithms from the very start. Hence, a human-likeness score is introduced. The higher the value, the closer to human driving. Since it is hard to manually define what a human driving style is - as was done for general comfort measures, adversarial learning is adopted to model it.

[0062] An adversarial learning method consists of a generator and discriminator. The driving model of the present disclosure as defined in Eq. 4 or in Eq. 7 is the generator G. Now the training objective for the discriminator will be described. For convenience, the short trajectories of o frames described above are named as drivelets. Given the outputs of the driving model for a drivelet B_t = (s_t+1, ... , s_t+0,v_t+1. v_t+0) and its corresponding ground truth from the human driver B_t = (s_t+1, ... , s_t+0,v_t+1, ... , v_t+o), the goal is to train a fully-connected discriminator D using the cross-entropy loss to classify the two classes (i.e. machine and human).

[0063] The drivelet at t is forwarded to G to obtain the driving actions B_t. To make autonomous driving more human-like is equivalent to letting the distribution of B_t approximate that of B_t. Thus the loss for human-like driving according to the present disclosure is defined as an adversarial loss:

where D(Bt)¹ is the probability of classifying B_t as human driving.

[0064] Putting everything together, the objective for accurate, comfortable and human-like driving according to the present disclosure is as follows:

z₂ is a trade-off parameter to control the contributions of the costs. In keeping with adversarial learning, the training is conducted under the following min-max criterion:

max_Gmin_Dz(I, M). (11) Obtaining HERE Map Data

[0065] The set of video data according to the present disclosure may be provided by panoramic videos recorded by vehicle comprising one or several cameras oriented toward the environment in at least one of the front, the sides, and the back of the vehicle, e.g. by the Drive360 video data set. Drive360 features 60 hours of real-world driving data over 3000 km. The Drive360 is e.g. augmented with HERE Technologies map data. Drive360 offers a time stamped GPS trace for each route recorded. A path-matcher is used based on a hidden markov model employing the Viterbi algorithm (cf. e.g. G. D. Forney. The viterbi algorithm. Proceedings of the IEEE, 61(3): 268-278, 1973) to calculate the most likely path traveled by the vehicle during dataset recording, snapping the GPS trace to the underlying road network. This improves the localization accuracy significantly, especially in urban environments where the GPS signal may be weak and noisy. Through the path matcher a map matched GPS coordinate is obtained for each time stamp, which is then used to query the HERE Technologies map database to obtain the various types of navigation data.

[0066] Following S. Hecker et. al., 2018, a fine-tuned AlexNet (cf. e.g. A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems. 2012) may be used to process the visual map representation from the TomTom Go App.

[0067] For learning comfortable driving, no extra network is needed. The respective loss may computed according to Eq. 7 and gradients are back propagated to adjust the driving network. In order to learn human-like driving, a fully-connected, three-layer discriminator network may be used to model human-like driving. The loss may be computed according to Eq. 9 to adjust the driving network.

[0068] Fig. 2 shows a schematic block diagram of a system according to embodiments of the present disclosure. [0069] In this figure, a system 200 for training a model has been represented. This system 200, which may be a computer, comprises a processor 201 and a non-volatile memory 202. The system 200 may also comprise, be configured to be integrated in or form a part of a vehicle 400. The system 200 may not only be configured for training a human-like generative driving model for a vehicle but also to apply the trained model to autonomously drive a vehicle (in particular in case it is part of a vehicle 400).

[0070] The system 200 may further be connected to a (passive) optical sensor 300, in particular a digital camera (e.g. integrated into the vehicle and being oriented to at least one of the front, the sides and the back). The digital camera 300 is configured such that it can record a scene in front of the vehicle 400, and in particular output digital data providing appearance (color, e.g. RGB) information of the scene. The camera 300 desirably generates image data comprising a 2D or 3D image of the environment. There may also be provided a set of monocular cameras which generate a panoramic 2D or 3D image. The output of the camera 300 may be used as video data of driving scenes for training the model (cf. step SOI of the method described above) and/or as input for a trained model, based on which the trained model autonomously controls driving the vehicle.

[0071] In the non-volatile memory 202, a set of instructions is stored and this set of instructions comprises instructions to perform a method for training a model.

[0072] In particular, these instructions and the processor 201 may respectively form a plurality of modules:

a module A for obtaining (SOI) a set of video data of driving scenes performed by a human driven vehicle,

a module C for training (S03) a generative driving model using the set of video data, and the data set of human driving maneuvers,

[0073] Throughout the description, including the claims, the term "comprising a" should be understood as being synonymous with "comprising at least one" unless otherwise stated. In addition, any range set forth in the description, including the claims should be understood as including its end value(s) unless otherwise stated. Specific values for described elements should be understood to be within accepted manufacturing or industry tolerances known to one of skill in the art, and any use of the terms "substantially" and/or "approximately” and/or "generally" should be understood to mean falling within such accepted tolerances.

[0074] Although the present disclosure herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present disclosure.

[0075] It is intended that the specification and examples be considered as exemplary only, with a true scope of the disclosure being indicated by the following claims.

Claims

1. A method for training a human-like generative driving model for a vehicle, comprising the steps of:

a - obtaining (SOI) a set of video data of driving scenes performed by a human driven vehicle,

b - obtaining (S02) a data set of human driving maneuvers carried out during the driving scenes,

c - training (S03) a generative driving model using the set of video data and the data set of human driving maneuvers,

wherein the training step (S03) of c is augmented with an adversarial training scheme such that the prediction of the trained generative driving model becomes more human like.

2. The method according to the claim 1, wherein

the generative driving model outputs predicted driving maneuvers.

3. The method according to claim 1 or 2, wherein

the adversarial training scheme comprises:

cl - training (S03a) a discriminator model based on the predicted driving maneuvers outputted by the generative driving model and corresponding human driving maneuvers of the data set to discriminate between human and machine maneuvers, and

c2 - forcing (S03b) the generative driving model to learn more human like driving.

4. The method according to any one of the preceding claims, wherein the step of obtaining (SOI) the set of video data further comprises:

al - obtaining (SOla) a route planning data set representing route information according to which the human driven vehicle have performed the driving scenes of the set of video data,

a2 - enriching (SOlb) the set of video data by the route planning data, such that the accuracy of the predicted driving maneuvers outputted by the generative driving model is increased.

5. The method according to any one of the preceding claims, wherein the step of training (S03) the generative driving model comprises:

training the generative driving model based on a predefined loss function augmented by an adversary loss which is based on the output of the discriminator model.

6. The method according to any one of the preceding claims, wherein the generative driving model receives as an input video data and data of human driving maneuvers of past time steps, and

outputs predicted driving maneuvers for future time steps.

7. The method according to the preceding claim, wherein

in case the set of video data are enriched by the route planning data before being inputted to the generative driving model, the generative driving model receives as a further input the vehicle location in past time steps.

8. The method according to the preceding claim, wherein

the generative driving model is a deep neural network, and/or

the discriminator model is a deep neural network.

9. A system for training a human-like generative driving model for a vehicle, comprising:

10. A system for predicting human-like driving maneuvers of a vehicle, comprising the model of step c of any one of claims 1 to 8 or of module C of claim 9.

11. A computer program including instructions for executing the steps of a method according to any one of claims 1 to 8 when said program is executed by a computer.

12. A recording medium readable by a computer and having recorded thereon a computer program including instructions for executing the steps of a method according to any one of claims 1 to 8.