WO2024189997A1 - Information processing device - Google Patents
Information processing device Download PDFInfo
- Publication number
- WO2024189997A1 WO2024189997A1 PCT/JP2023/042953 JP2023042953W WO2024189997A1 WO 2024189997 A1 WO2024189997 A1 WO 2024189997A1 JP 2023042953 W JP2023042953 W JP 2023042953W WO 2024189997 A1 WO2024189997 A1 WO 2024189997A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- action
- value
- learning
- vehicle
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/40—Business processes related to the transportation industry
- G06Q50/43—Business processes related to the sharing of vehicles, e.g. car sharing
Definitions
- the present disclosure relates to an information processing device that optimizes the reallocation of vehicles among multiple ports in a shared transportation service.
- a "port" refers to a space for parking multiple shared vehicles (such as a bicycle parking lot for parking multiple shared electric bicycles), and multiple ports are provided in multiple locations to provide a shared transportation service.
- Patent Document 1 There is a known technology for vehicle sharing services that improves the vehicle utilization rate throughout the entire service by collecting vehicles from a port with a surplus of vehicles among multiple ports and placing the collected vehicles in a port with a shortage of vehicles (see Patent Document 1).
- the vehicle sharing services mentioned above are provided by private companies, so it is desirable to operate them in a way that maximizes the rewards (profits) of the service as a whole, rather than simply allowing ports to replenish bicycle shortages appropriately.
- the present disclosure takes into consideration the above circumstances and aims to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize rewards (profits).
- reinforcement learning is the problem of an agent in an environment observing the current state and determining the future actions to be taken based on a policy from the state information obtained from the observation.
- the agent obtains rewards from the environment by executing the determined actions, and reinforcement learning learns the policy that will obtain the greatest rewards through a series of actions.
- the applicant has focused on the above-mentioned reinforcement learning and has devised an invention that utilizes a reinforcement learning framework to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize the reward in reinforcement learning, and discloses this invention in the present specification.
- the information processing device includes an acquisition unit that acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles between multiple ports in a shared transportation service in which vehicles are shared, and a learning and inference unit that performs reinforcement learning on the action value function using the information acquired by the acquisition unit in order to make the action value function select an action that will maximize the value of a future action.
- the acquisition unit acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles between multiple ports in a shared transportation service. Examples of the information will be described in detail in the embodiments of the invention, but "acquisition” here broadly includes not only obtaining information but also calculating information using a predetermined formula or the like.
- the learning and inference unit performs reinforcement learning on the action value function using the information acquired by the acquisition unit to turn the above action value function into an action value function that selects an action that will have the highest value for a future action. As a result, through reinforcement learning, the action value function learns to become an action value function that selects an action that will have the highest value for a future action.
- the value of actions related to vehicle reallocation between multiple ports is estimated based on such an action value function, and the action that will have the highest value in the future is selected. Therefore, when reallocating vehicles between multiple ports in a shared transportation service, vehicle reallocation can be optimized to maximize the reward in reinforcement learning.
- FIG. 1 is a functional block diagram of an information processing device according to an embodiment of the present invention
- FIG. 1 is a diagram for explaining an overview of reinforcement learning.
- FIG. 11 is a diagram showing an example of state information etc. when the present invention is applied to reinforcement learning.
- FIG. 13 is a diagram for explaining a discounted cumulative reward.
- FIG. 13 is a diagram for explaining a state value function.
- FIG. 13 is a diagram for explaining learning of a state value function.
- FIG. 11 is a diagram for explaining an example of calculating a reward in reinforcement learning.
- 13 is a diagram for explaining calculation of the average number of times used by placing it on a port.
- FIG. 13 is a diagram showing an example of information used in calculating the average number of times a device is used by being placed on a port.
- FIG. 13 is a diagram for explaining calculation of the average number of impressions of a bicycle.
- FIG. 13 is a diagram showing an example of information used in calculating the average number of impressions of a bicycle.
- FIG. 13 is a flow diagram showing the processing in the learning stage.
- FIG. 13 is a flowchart showing details of the process in step S4 of FIG. 12.
- FIG. 1 is a flow diagram showing the processing of the inference stage.
- FIG. 2 is a diagram illustrating an example of a hardware configuration of an information processing device.
- the information processing device 10 includes an acquisition unit 11 and a learning and inference unit 12.
- the acquisition unit 11 is a functional unit having a function of acquiring information to be input to an action value function that estimates the value of an action related to the reallocation of vehicles among multiple ports in a shared transportation service in which vehicles are shared.
- the learning and inference unit 12 is a functional unit having a function of performing reinforcement learning on the action value function using the information acquired by the acquisition unit 11 in order to make the action value function select an action that will maximize the value of a future action.
- the vehicles shared are vehicles that can be restarted by replacing the battery and have advertisements placed on the body, and examples of such vehicles include rental electric bicycles, rental electric kick scooters, rental motorbikes, and rental cars.
- rental electric bicycles rental electric kick scooters
- rental motorbikes rental cars.
- multiple ports here, bicycle parking areas for multiple shared bicycles
- Service staff use bicycle transport trucks (hereafter referred to as “relocation trucks”) to "dispatch" bicycles to a port, “collect” bicycles from a port, and “replace” the batteries installed in the bicycles, as actions related to the relocation of bicycles.
- relocation trucks bicycle transport trucks
- the learning and inference unit 12 uses impressions, which are the number of times an advertisement can be viewed, as one element of the "reward" included in the value of an action, and calculates (a) a placement evaluation value for placement, (b) a recovery evaluation value for recovery, and (c) a battery replacement evaluation value for battery replacement for each port using a calculation formula described below, and performs reinforcement learning based on the (a) placement evaluation value, (b) recovery evaluation value, and (c) battery replacement evaluation value obtained for all ports.
- the learning inference unit 12 has the function of substituting the current state information acquired by the acquisition unit 11 into the action value function obtained by reinforcement learning to acquire the action value when various actions are performed in the current state, and selecting the action that maximizes the acquired action value as the action to be taken.
- reinforcement learning is known as one type of machine learning, which deals with the problem of an agent in an environment observing the current state and determining the future action to be taken based on a policy from the state information obtained from the observation.
- the agent obtains a reward from the environment by executing the determined action, and reinforcement learning learns a policy that will obtain the most reward through a series of actions.
- the applicant focuses on the above-mentioned reinforcement learning, and below discloses a technology that utilizes a reinforcement learning framework to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize the reward in reinforcement learning.
- a "state” in reinforcement learning is information that an agent obtains from the environment, such as the current port state, demand forecast results, weather, distance, and restricted ports.
- An “action” is a behavior that an agent takes in the environment, such as placement, recovery, battery replacement, and which port to go to.
- a “reward” is a profit that an agent obtains in the environment, such as the “discounted cumulative reward R t ,” which will be described later with reference to FIG. 4.
- a “policy” is a function ⁇ that returns an action from a state.
- examples of information for states S t and S t+1 include port status, port demand, weather, truck status, bicycle status, port location, and distance between ports.
- action a t for example, for "truck A” as an agent, actions such as “port A” as the “next port,”"-23" as the number of bicycles placed/recovered, and "10" as the number of batteries exchanged are returned.
- an agent (truck A) performs action a t to reach state S t+1 , and by inputting state S t+1 into function ⁇ related to the policy, as action a t+1 , for example, for "truck A" as an agent, actions such as "port C" as the "next port,””10" as the number of bicycles placed/collected, and "0" as the number of batteries replaced are returned.
- This action means that 10 bicycles will be placed at the next port C, and no battery replacement will be performed (no replacement is required).
- the “reward” in reinforcement learning is defined as the “discounted cumulative reward Rt” as shown in the following equation (1).
- ⁇ (0 ⁇ 1) is a constant called the discount rate.
- the discount rate ⁇ is multiplied for each generation going into the future as follows.
- the reward R(S t+1 , a t+1 ) after one generation is ⁇ 1 R(S t+1 , a t+1 )
- the reward R(S t+2 , a t+2 ) after two generations is ⁇ 2 R(S t+2 , a t+2 )
- the reward R(S t+3 , a t+3 ) after three generations is ⁇ 3 R(S t+3 , a t+3 )
- Discounted cumulative reward R t ⁇ 1 R(S t+1 , a t+1 )+ ⁇ 1 R(S t+1 , a t+1 )+ ⁇ 1 R(S t+1 , a t+1 )+ ⁇ 1 R(S t+1 , a t+1 )+... Since the part enclosed by the dashed line in the bottom row of FIG. 4 corresponds to ⁇ R t+1 , the above formula (1) is derived.
- a state value function that estimates the value of a state and an action value function that estimates the value of an action are defined.
- the "state value function V ⁇ (S t )" is a function that indicates how much discounted reward can be obtained in the future starting from the state S t if the policy ⁇ is followed.
- the "action value function” is a function that indicates how much discounted reward can be obtained in the future if a certain action is taken from the state S t , and constitutes a part of the above state value function.
- the latter part of the second stage of the above formula (2) corresponds to the action value function Q(S t , a t ). In this case, if the action value function can be accurately estimated, rewards can be obtained efficiently.
- the learning and inference unit 12 uses the impression, which is the number of times an advertisement can be viewed, as one element of the "reward" included in the value of the action, and calculates for each port (a) a placement evaluation value related to placement, (b) a recovery evaluation value related to recovery, and (c) a battery replacement evaluation value related to battery replacement using a calculation formula described later, and performs reinforcement learning based on the (a) placement evaluation value, (b) recovery evaluation value, and (c) battery replacement evaluation value obtained for all ports.
- the calculation of the (a) placement evaluation value, (b) recovery evaluation value, and (c) battery replacement evaluation value will be described below with reference to Figures 7 to 11.
- Placement evaluation value (average number of times bicycles are used when placed at the port x usage fee x percentage of members who use the bicycle once x constant ⁇ + average number of impressions of the bicycle x probability of seeing the bicycle x advertising cost x constant ⁇ ) x number of bicycles to be placed (5) It is calculated by the following formula (5).
- Recovery evaluation value ((Maximum number of average times used by placing it at another port - Average number of times used by placing it at the port) x Usage fee x Percentage of one-time members x Constant ⁇ - (Maximum number of average times of impressions of bicycles at other ports - Average number of impressions of bicycles) x Bicycle visibility probability x Advertising cost x Constant ⁇ ) x Number of bicycles to be recovered (6) It is calculated by the following formula (6).
- Battery replacement evaluation value (average number of times the bicycle is used when placed at the port x usage fee x percentage of members who use the bicycle once x constant ⁇ ) x number of bicycles to replace the battery (7) This is calculated by the following formula (7).
- each element of the formulas (5) to (7) will be outlined.
- the "average number of times a device is used by being placed in a port” will be described later with reference to FIG. 8 and FIG.
- the "average number of impressions for a bicycle” will be described later with reference to FIG. 10 and FIG.
- “Usage fee” is the usage fee for the electric bicycle sharing service, for example 165 yen.
- “One-time member ratio” is the ratio of one-time members to the total number of members.
- the one-time member ratio is calculated by the number of one-time members / (number of one-time members + number of monthly members).
- “Advertising cost” is the cost per impression of a DressGuard advertisement placed on the body of a shared bicycle.
- Constants ⁇ , ⁇ are constants for adjusting the importance, which are determined taking into account the profit obtained by placing a bicycle at a port and the profit obtained by dressguard advertising, and the default is set to "1".
- Bicycle visibility probability is the probability that people who are estimated to have actually viewed the advertisement among bicycle impressions.
- the maximum average number of times used by placing it in another port is the maximum value of "the average number of times used by placing it in a port” which will be described later.
- Maximum number of average impressions of bicycles of other ports is the maximum value of "Average number of impressions of bicycles” described later. "The number of bicycles to place,””The number of bicycles to retrieve,” and “The number of bicycles to replace the batteries” are information contained in the action a that is returned by inputting the state S into the policy ⁇ .
- the "average number of times used by leaving it at a port” is calculated by the following procedure: counting the number of times each bicycle is used from a certain time (reference time) until it is collected by a truck or the battery is replaced (step 1); and calculating the average number of times the bicycle is used by weekday/holiday, time, port, and weather (step 2).
- FIG. 8 illustrates an example of a situation in which, after the reference time of 12:00, bicycle A is used from port A to port B, then from port B to port C, then from port C to port D, and then from port D to port E, and then collected by a truck at port E.
- step 1 bicycle A's battery has not been replaced since the reference time of 12:00, and bicycle A has been used four times before it is collected by a truck, so in step 1, bicycle A is counted as "number of times used: 4 times.” This counting is performed for each bicycle. Furthermore, in step 2, the average number of times the bicycle is used is calculated by weekday/holiday, time, port, and weather, as shown in the table in the lower right of FIG. 8. At this time, the weather is recorded as "sunny” if the hourly precipitation is 0 mm, and “rainy” if the hourly precipitation is > 0 mm. Also, the "port" in the summary table is the port where the bicycle was first used after the reference time.
- the acquisition unit 11 extracts the usage history information for each bicycle from the usage table that stores the bicycle usage history information, and obtains the usage start date/time (reference time) and number of uses from the usage history information for bicycle A that is the subject of the calculation, as shown in FIG. 8.
- the acquisition unit 11 references the weather information at the usage start date/time to determine the weather at that time, and calculates the average number of times the bicycle is used for weekdays/holidays, time, port, and weather, as shown in the table in the lower right of FIG. 9.
- the "average number of impressions for a bicycle” is calculated by tallying up the number of times an advertisement is likely to be viewed for each bicycle from a certain time (reference time) until the bicycle is collected by a truck or the battery is replaced (step 1), as described below, and then calculating the average number of impressions for the bicycle for weekdays/holidays, time, port, and weather (step 2). Note that the weather is determined in the same way as in the example of Figure 8 described above.
- the acquisition unit 11 extracts usage history information for each bicycle from a usage table that stores bicycle usage history information.
- the acquisition unit 11 also extracts people who are within a certain range of the location of the target bicycle (bicycle A in FIG. 11) from the bicycle location information and person location information, and counts up the number of people, using the obtained number as the number of impressions (the number of times the advertisement can be viewed) for the target bicycle A.
- This number of impressions is calculated for each bicycle usage history and merged with the usage history information for each bicycle extracted above, to obtain the number of impressions for each bicycle for each usage start date and time (reference time).
- the acquisition unit 11 then refers to the weather information at the usage start date and time (reference time) to determine the weather at that time, and calculates the average number of impressions for the bicycle for weekdays/holidays, time, port, and weather, as shown in the table in the lower right of FIG. 11.
- the acquisition unit 11 extracts the time from a certain time until collection and the time from a certain time until battery replacement for each bicycle from the bicycle usage table of FIG. 9 and FIG. 11 (step S1).
- the above "certain time” may be selected from a number of predetermined reference time candidates.
- the acquisition unit 11 calculates the average number of times the bicycle is used by being placed in a port using the procedure described above with reference to FIG. 8 and FIG. 9 (step S2).
- the acquisition unit 11 calculates the average number of times the bicycle is used for each weekday/holiday, time, port, and weather, as shown in the table at the bottom right of FIG. 9.
- the acquisition unit 11 calculates the average number of impressions using the procedure described above with reference to FIG. 10 and FIG. 11 (step S3).
- the average number of impressions of the bicycle is calculated for each weekday/holiday, time, port, and weather, as shown in the table at the bottom right of FIG. 11.
- the learning inference unit 12 executes the accumulation process of information (S t , a t , R(S t , a t ), S t+1 ) shown in FIG. 13 (step S4).
- the learning inference unit 12 acquires the state S t acquired by the acquisition unit 11 (step S4A in FIG. 13), and obtains an action a t by inputting the acquired state S t to the policy ⁇ (step S4B).
- step S4A the acquisition unit 11
- an action a t by inputting the acquired state S t to the policy ⁇
- multiple possible actions a 1 , a 2 , ..., an are obtained.
- the learning inference unit 12 calculates the discounted cumulative reward R(S t , a t ) when each of the actions a t (a 1 , a 2 , ..., an ) is executed from the state S t using the formula of the discounted cumulative reward Rt (step S4C). Specifically, the "placement evaluation value,”"recovery evaluation value,” and “battery exchange evaluation value” described with reference to Figures 7 to 11 are calculated for each bicycle, and the discounted cumulative reward R(S t , a t ) is calculated from the above three evaluation values for all bicycles. The learning and inference unit 12 then determines the action a k that maximizes the calculated discounted cumulative reward (step S4D).
- step S4D of the multiple possible actions a 1 , a 2 , ..., a n
- the learning and inference unit 12 performs a simulation using the action a k obtained in step S4D to obtain the next state S t+1 (step S4E), and accumulates information (S t , a t , R(S t , a t ), S t+1 ) consisting of the state S t , action a t (i.e., the action a k with the largest discounted cumulative reward), reward R(S t , a t ) (i.e., the discounted cumulative reward when the action a k is executed from the state S t ) , and the next state S t + 1 obtained as described above (step S4F).
- Step S5 the learning and inference unit 12 repeats the processes of steps S4 and S5 a predetermined number of times to learn the action value function Q(S t , a t ). After the loop is repeated a predetermined number of times, the process of FIG . 12 ends.
- FIG. 14 shows a process flow of the inference stage.
- the acquisition unit 11 acquires state information S t in the same manner as the learning stage process described above (step S11), and the learning inference unit 12 calculates the action value for each action a t using the action value function Q(S t , a t ) obtained by learning (step S12). Then, the learning inference unit 12 selects the action a t with the maximum action value (step S13). In this way, the action with the highest value of future action is selected.
- the action value function is trained to an action value function that selects the action that will have the highest value for future actions. Based on this action value function, the value of actions related to vehicle reallocation between multiple ports is estimated, and control is exercised so that the action that will have the highest value for future actions is selected. Therefore, when reallocating vehicles between multiple ports in a shared transportation service, vehicle reallocation can be optimized to maximize the reward in reinforcement learning.
- the actions assumed for the reallocation of bicycles shared in a shared transportation service are "placement,” “collection,” and “battery replacement.”
- Appropriate reinforcement learning is performed based on evaluation values from multiple aspects, namely placement, collection, and battery replacement, which allows for more appropriate optimization of vehicle reallocation and maximizes rewards.
- the bicycles shared in the shared transportation service have advertisements displayed on the body of the bicycle, and the learning and inference unit 12 calculates the placement evaluation value and the collection evaluation value while taking into account impressions, which are the number of times an advertisement can be viewed, so that rewards from advertising can be maximized.
- rental electric bicycles have been given as an example of vehicles shared in the shared transportation service
- any vehicle that can be restarted by replacing the battery and has advertisements on the body can be used
- the present disclosure can also be applied to other types of vehicles such as rental electric kick scooters, rental motorcycles, and rental cars.
- the present disclosure can be applied to a wide range of vehicles, including rental electric kick scooters, which have become increasingly popular in recent years, and has the potential to be widely used by users.
- the action that maximizes the action value can be selected as the action to be taken, thereby maximizing the reward (profit) for the service.
- the gist of the present disclosure lies in the following [1] to [5].
- An acquisition unit that acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles among a plurality of ports in a sharing transportation service in which vehicles are shared; a learning and inference unit that performs reinforcement learning on the action-value function using the information acquired by the acquisition unit in order to make the action-value function select an action that will have the highest value for a future action;
- An information processing device comprising: [2] A vehicle shared in the shared transportation service is a vehicle that can be reused by replacing the installed battery, The actions related to the relocation of the vehicle include the deployment of the vehicle to a port, the recovery of the vehicle from a port, and the replacement of a battery installed in the vehicle;
- the learning and inference unit uses a predetermined calculation formula to calculate a placement evaluation value for the placement, a recovery evaluation value for the recovery, and a battery replacement evaluation value for the battery replacement for each port as rewards included in the value of the action
- a vehicle shared in the sharing transportation service is a vehicle on which advertisements are displayed, The information processing device according to [2], wherein the learning and inference unit calculates the placement evaluation value and the collection evaluation value using impressions, which are the number of times an advertisement can be viewed, as one element.
- the information processing device described in [3], wherein the vehicles shared in the shared transportation service are rental electric bicycles, rental electric kick scooters, rental motorbikes, and rental cars that can be restarted by replacing the batteries and have advertisements displayed on the vehicle bodies.
- the learning inference unit The information processing device according to any one of [1] to [4], wherein the current state information acquired by the acquisition unit is substituted into the action value function obtained by the reinforcement learning to obtain action values for various actions taken in the current state, and the action that maximizes the obtained action value is selected as the action to be taken.
- each functional block may be realized using one device that is physically or logically coupled, or may be realized using two or more devices that are physically or logically separated and directly or indirectly connected (for example, using wires, wirelessly, etc.).
- the functional blocks may be realized by combining the one device or the multiple devices with software.
- Functions include, but are not limited to, judgement, determination, judgment, calculation, computation, processing, derivation, investigation, search, confirmation, reception, transmission, output, access, resolution, selection, selection, establishment, comparison, assumption, expectation, regarding, broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, and assignment.
- a functional block (component) that performs the transmission function is called a transmitting unit or a transmitter.
- an information processing device in an embodiment of the present disclosure may function as a computer that executes the processing of the present disclosure.
- FIG. 15 is a diagram showing an example of the hardware configuration of an information processing device 10 according to an embodiment of the present disclosure.
- the information processing device 10 described above may be physically configured as a computer device including a processor 1001, memory 1002, storage 1003, a communication device 1004, an input device 1005, an output device 1006, a bus 1007, etc.
- the word “apparatus” can be interpreted as a circuit, device, unit, etc.
- the hardware configuration of the information processing device 10 may be configured to include one or more of the devices shown in the figure, or may be configured to exclude some of the devices.
- Each function of the information processing device 10 is realized by loading a specific software (program) onto hardware such as the processor 1001 and memory 1002, causing the processor 1001 to perform calculations, control communications via the communication device 1004, and control at least one of the reading and writing of data in the memory 1002 and storage 1003.
- a specific software program
- the processor 1001 for example, runs an operating system to control the entire computer.
- the processor 1001 may be configured as a central processing unit (CPU) that includes an interface with peripheral devices, a control device, an arithmetic unit, registers, etc.
- CPU central processing unit
- the processor 1001 also reads out programs (program codes), software modules, data, etc. from at least one of the storage 1003 and the communication device 1004 into the memory 1002, and executes various processes according to these.
- the programs used are those that cause a computer to execute at least some of the operations described in the above-mentioned embodiments. Although it has been described that the various processes are executed by one processor 1001, they may be executed simultaneously or sequentially by two or more processors 1001.
- the processor 1001 may be implemented by one or more chips.
- the programs may be transmitted from a network via a telecommunications line.
- Memory 1002 is a computer-readable recording medium, and may be composed of at least one of, for example, ROM (Read Only Memory), EPROM (Erasable Programmable ROM), EEPROM (Electrically Erasable Programmable ROM), RAM (Random Access Memory), etc. Memory 1002 may also be called a register, cache, main memory (primary storage device), etc. Memory 1002 can store executable programs (program codes), software modules, etc. for implementing a wireless communication method according to one embodiment of the present disclosure.
- ROM Read Only Memory
- EPROM Erasable Programmable ROM
- EEPROM Electrical Erasable Programmable ROM
- RAM Random Access Memory
- Memory 1002 may also be called a register, cache, main memory (primary storage device), etc.
- Memory 1002 can store executable programs (program codes), software modules, etc. for implementing a wireless communication method according to one embodiment of the present disclosure.
- Storage 1003 is a computer-readable recording medium, and may be, for example, at least one of an optical disk such as a CD-ROM (Compact Disc ROM), a hard disk drive, a flexible disk, a magneto-optical disk (e.g., a compact disk, a digital versatile disk, a Blu-ray (registered trademark) disk), a smart card, a flash memory (e.g., a card, a stick, a key drive), a floppy (registered trademark) disk, a magnetic strip, etc.
- Storage 1003 may also be referred to as an auxiliary storage device.
- the above-mentioned storage medium may be, for example, a database, a server, or other suitable medium including at least one of memory 1002 and storage 1003.
- the communication device 1004 is hardware (transmitting/receiving device) for communicating between computers via at least one of a wired network and a wireless network, and is also referred to as, for example, a network device, a network controller, a network card, a communication module, etc.
- the communication device 1004 may be configured to include a high-frequency switch, a duplexer, a filter, a frequency synthesizer, etc., to realize, for example, at least one of Frequency Division Duplex (FDD) and Time Division Duplex (TDD).
- FDD Frequency Division Duplex
- TDD Time Division Duplex
- the input device 1005 is an input device (e.g., a keyboard, a mouse, a microphone, a switch, a button, a sensor, etc.) that accepts input from the outside.
- the output device 1006 is an output device (e.g., a display, a speaker, an LED lamp, etc.) that performs output to the outside. Note that the input device 1005 and the output device 1006 may be integrated into one structure (e.g., a touch panel).
- each device such as the processor 1001 and memory 1002 is connected by a bus 1007 for communicating information.
- the bus 1007 may be configured using a single bus, or may be configured using different buses between each device.
- the information processing device 10 may be configured to include hardware such as a microprocessor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), a programmable logic device (PLD), or a field programmable gate array (FPGA), and some or all of the functional blocks may be realized by the hardware.
- the processor 1001 may be implemented using at least one of these pieces of hardware.
- the notification of information is not limited to the aspects/embodiments described in this disclosure, and may be performed using other methods.
- the notification of information may be performed by physical layer signaling (e.g., DCI (Downlink Control Information), UCI (Uplink Control Information)), higher layer signaling (e.g., RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or a combination of these.
- RRC signaling may be referred to as an RRC message, and may be, for example, an RRC Connection Setup message, an RRC Connection Reconfiguration message, etc.
- Each aspect/embodiment described in this disclosure is a mobile communication system that is compatible with LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), 6th generation mobile communication system (6G), xth generation mobile communication system (xG) (xG (x is, for example, an integer or decimal number)), FRA (Future Ra).
- the present invention may be applied to at least one of systems using IEEE 802.11 (Wi-Fi (registered trademark)), IEEE 802.16 (WiMAX (registered trademark)), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth (registered trademark), and other appropriate systems, and next-generation systems that are expanded, modified, created, or defined based on these. It may also be applied to a combination of multiple systems (for example, a combination of at least one of LTE and LTE-A with 5G, etc.).
- the input and output information may be stored in a specific location (e.g., memory) or may be managed using a management table.
- the input and output information may be overwritten, updated, or added to.
- the output information may be deleted.
- the input information may be sent to another device.
- the determination may be based on a value represented by one bit (0 or 1), a Boolean value (true or false), or a numerical comparison (e.g., a comparison with a predetermined value).
- notification of specific information is not limited to being done explicitly, but may be done implicitly (e.g., not notifying the specific information).
- Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
- Software, instructions, information, etc. may also be transmitted and received via a transmission medium.
- a transmission medium For example, if the software is transmitted from a website, server, or other remote source using at least one of wired technologies (such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL)), and/or wireless technologies (such as infrared, microwave), then at least one of these wired and wireless technologies is included within the definition of a transmission medium.
- wired technologies such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL)
- wireless technologies such as infrared, microwave
- the information, signals, etc. described in this disclosure may be represented using any of a variety of different technologies.
- the data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.
- the channel and the symbol may be a signal (signaling).
- the signal may be a message.
- the component carrier (CC) may be called a carrier frequency, a cell, a frequency carrier, etc.
- system and “network” are used interchangeably.
- a radio resource may be indicated by an index.
- the names used for the parameters described above are not intended to be limiting in any way. Furthermore, the formulas etc. using these parameters may differ from those explicitly disclosed in this disclosure.
- the various channels (e.g., PUCCH, PDCCH, etc.) and information elements may be identified by any suitable names, and therefore the various names assigned to these various channels and information elements are not intended to be limiting in any way.
- determining may encompass a wide variety of actions.
- Determining and “determining” may include, for example, judging, calculating, computing, processing, deriving, investigating, looking up, search, inquiry (e.g., searching in a table, database, or other data structure), and considering ascertaining as “judging” or “determining.”
- determining and “determining” may include receiving (e.g., receiving information), transmitting (e.g., sending information), input, output, accessing (e.g., accessing data in memory), and considering ascertaining as “judging” or “determining.”
- judgment” and “decision” can include considering resolving, selecting, choosing, establishing, comparing, etc., to have been “judged” or “decided.” In other words, “judgment” and “decision” can include considering some action to have been “judged” or “decided.” Additionally, “judgment (decision)” can be interpreted as “assuming,” “ex
- the phrase “based on” does not mean “based only on,” unless expressly stated otherwise. In other words, the phrase “based on” means both “based only on” and “based at least on.”
- any reference to an element using a designation such as "first,” “second,” etc., used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient method of distinguishing between two or more elements. Thus, a reference to a first and a second element does not imply that only two elements may be employed or that the first element must precede the second element in some way.
- a and B are different may mean “A and B are different from each other.”
- the term may also mean “A and B are each different from C.”
- Terms such as “separate” and “combined” may also be interpreted in the same way as “different.”
- 10 Information processing device, 11: Acquisition unit, 12: Learning and inference unit, 1001: Processor, 1002: Memory, 1003: Storage, 1004: Communication device, 1005: Input device, 1006: Output device, 1007: Bus.
Landscapes
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
本開示は、シェアリング交通サービスにおける複数のポート間での車両の再配置の最適化を行う情報処理装置に関する。なお、「ポート」とは、シェアされる複数の車両を停めるためのスペース(例えば、シェアされる複数の電動自転車を停めるための駐輪場など)を意味し、シェアリング交通サービスを提供するために、複数のポートが複数の場所に設けられている。 The present disclosure relates to an information processing device that optimizes the reallocation of vehicles among multiple ports in a shared transportation service. Note that a "port" refers to a space for parking multiple shared vehicles (such as a bicycle parking lot for parking multiple shared electric bicycles), and multiple ports are provided in multiple locations to provide a shared transportation service.
従来より、車両のシェアリングサービスに関し、複数のポートのうち、車両が余っているポートから車両を回収し、回収された車両を車両が不足しているポートに配置することで、サービス全体での車両の利用率を向上させる技術が知られている(特許文献1参照)。 There is a known technology for vehicle sharing services that improves the vehicle utilization rate throughout the entire service by collecting vehicles from a port with a surplus of vehicles among multiple ports and placing the collected vehicles in a port with a shortage of vehicles (see Patent Document 1).
上記の車両のシェアリングサービスは民間企業により提供されるサービスであるため、単に、ポート間で不足自転車を適切に補充し合うのみではなく、サービス全体としての報酬(利益)を最大化するように運用することが望まれる。 The vehicle sharing services mentioned above are provided by private companies, so it is desirable to operate them in a way that maximizes the rewards (profits) of the service as a whole, rather than simply allowing ports to replenish bicycle shortages appropriately.
しかし、シェアリングサービスのサービス全体としての報酬(利益)を最大化することに主眼を置いて、車両の再配置を行う技術は提案されていない。 However, no technology has been proposed that reallocates vehicles with a primary focus on maximizing the overall reward (profit) of the sharing service.
本開示は、上記のような事情を考慮し、シェアリング交通サービスにおける複数のポート間での車両の再配置において報酬(利益)を最大化するように車両の再配置の最適化を行うことを目的とする。 The present disclosure takes into consideration the above circumstances and aims to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize rewards (profits).
機械学習の1つとして、ある環境におけるエージェントが、現在の状態を観測し、観測で得られた状態情報から、方策(policy)に基づいて将来取るべき行動を決定する問題を扱う強化学習(reinforcement learning)が知られている。上記エージェントは、決定された行動を実行することで環境から報酬を得るが、強化学習は、一連の行動を通じて報酬が最も多く得られるような方策を学習する。 One type of machine learning known as reinforcement learning is the problem of an agent in an environment observing the current state and determining the future actions to be taken based on a policy from the state information obtained from the observation. The agent obtains rewards from the environment by executing the determined actions, and reinforcement learning learns the policy that will obtain the greatest rewards through a series of actions.
出願人は、上記のような強化学習に着目し、強化学習の枠組みを利用して、シェアリング交通サービスにおける複数のポート間での車両の再配置において強化学習における報酬を最大化するように車両の再配置の最適化を行う発明をしたので、本明細書にて開示する。 The applicant has focused on the above-mentioned reinforcement learning and has devised an invention that utilizes a reinforcement learning framework to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize the reward in reinforcement learning, and discloses this invention in the present specification.
本開示に係る情報処理装置は、車両をシェアするシェアリング交通サービスにおける複数のポート間での車両の再配置に関する行動の価値を見積もる行動価値関数に入力される情報を取得する取得部と、将来の行動の価値が最も高くなる行動を選択する前記行動価値関数にするために、前記取得部により取得された情報を用いて前記行動価値関数に対し強化学習を行う学習推論部と、を備える。 The information processing device according to the present disclosure includes an acquisition unit that acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles between multiple ports in a shared transportation service in which vehicles are shared, and a learning and inference unit that performs reinforcement learning on the action value function using the information acquired by the acquisition unit in order to make the action value function select an action that will maximize the value of a future action.
上記の情報処理装置では、取得部が、シェアリング交通サービスにおける複数のポート間での車両の再配置に関する行動の価値を見積もる行動価値関数に入力される情報を取得する。当該情報の例は、発明の実施形態にて詳述するが、ここでの「取得」は、情報を入手することに加え、予め定められた数式等を用いて情報を算出することも広く含む。学習推論部は、上記の行動価値関数を、将来の行動の価値が最も高くなる行動を選択する行動価値関数にするために、取得部により取得された情報を用いて行動価値関数に対し強化学習を行う。これにより、強化学習によって、行動価値関数は、将来の行動の価値が最も高くなる行動を選択するような行動価値関数に学習されていく。 In the above information processing device, the acquisition unit acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles between multiple ports in a shared transportation service. Examples of the information will be described in detail in the embodiments of the invention, but "acquisition" here broadly includes not only obtaining information but also calculating information using a predetermined formula or the like. The learning and inference unit performs reinforcement learning on the action value function using the information acquired by the acquisition unit to turn the above action value function into an action value function that selects an action that will have the highest value for a future action. As a result, through reinforcement learning, the action value function learns to become an action value function that selects an action that will have the highest value for a future action.
このような行動価値関数に基づき複数のポート間での車両の再配置に関する行動の価値が見積もられ、将来の行動の価値が最も高くなる行動が選択されるよう制御されるため、シェアリング交通サービスにおける複数のポート間での車両の再配置において、強化学習における報酬を最大化するように車両の再配置の最適化を行うことができる。 The value of actions related to vehicle reallocation between multiple ports is estimated based on such an action value function, and the action that will have the highest value in the future is selected. Therefore, when reallocating vehicles between multiple ports in a shared transportation service, vehicle reallocation can be optimized to maximize the reward in reinforcement learning.
本開示によれば、シェアリング交通サービスにおける複数のポート間での車両の再配置において、強化学習における報酬を最大化するように車両の再配置の最適化を行うことができる。 According to the present disclosure, when reallocating vehicles between multiple ports in a shared transportation service, it is possible to optimize the reallocation of vehicles so as to maximize rewards in reinforcement learning.
以下、図面を参照しながら、シェアリング交通サービスにおける複数のポート間での車両の再配置の最適化を行う情報処理装置の一実施形態を説明する。 Below, we will explain one embodiment of an information processing device that optimizes vehicle reallocation between multiple ports in a shared transportation service, with reference to the drawings.
(情報処理装置の構成)
図1に示すように、情報処理装置10は、取得部11、および、学習推論部12を備える。このうち取得部11は、車両をシェアするシェアリング交通サービスにおける複数のポート間での車両の再配置に関する行動の価値を見積もる行動価値関数に入力される情報を取得する機能を備えた機能部である。学習推論部12は、将来の行動の価値が最も高くなる行動を選択する行動価値関数にするために、取得部11によって取得された情報を用いて行動価値関数に対し強化学習を行う機能を備えた機能部である。
(Configuration of information processing device)
As shown in Fig. 1, the
本実施形態におけるシェアリング交通サービスにおいてシェアされる車両は、バッテリ交換により再稼働可能とされ且つ車体に広告が掲載される車両であり、レンタル用電動自転車、レンタル用電動キックボード、レンタルバイク、レンタカーなどが挙げられる。以下では、上記のうち「レンタル用電動自転車」をシェアするサービスを例にして説明するが、「レンタル用電動自転車」は「自転車」と略称する。 In the shared transportation service of this embodiment, the vehicles shared are vehicles that can be restarted by replacing the battery and have advertisements placed on the body, and examples of such vehicles include rental electric bicycles, rental electric kick scooters, rental motorbikes, and rental cars. In the following, we will use as an example a service for sharing "rental electric bicycles," which will be abbreviated to "bicycles."
また、シェアリング交通サービスを提供するために、複数のポート(ここでは、シェアされる複数の自転車を停めるための駐輪場)が複数の場所に設けられており、各ポートには複数の自転車が停められている。サービス員は、自転車の再配置に関する行動として、自転車輸送用のトラック(以下「再配置トラック」という)を用いて、あるポートへの自転車の「配置」、あるポートからの自転車の「回収」、および、自転車に搭載されたバッテリの「バッテリ交換」を行う。 Furthermore, to provide a shared transportation service, multiple ports (here, bicycle parking areas for multiple shared bicycles) are provided in multiple locations, and multiple bicycles are parked at each port. Service staff use bicycle transport trucks (hereafter referred to as "relocation trucks") to "dispatch" bicycles to a port, "collect" bicycles from a port, and "replace" the batteries installed in the bicycles, as actions related to the relocation of bicycles.
上記のようなシェアリング交通サービスを想定した上で、学習推論部12は、行動の価値に含まれる「報酬」として、広告が視認されうる回数であるインプレッションを一要素としつつ、後述する算出式を用いて、(a)配置に係る配置評価値、(b)回収に係る回収評価値、および、(c)バッテリ交換に係るバッテリ交換評価値を各ポートについて算出し、全ポートについて得られた(a)配置評価値、(b)回収評価値、および、(c)バッテリ交換評価値に基づいて、強化学習を行う。
Assuming a shared transportation service as described above, the learning and
また、強化学習を行った後の「行動を推論する段階」では、学習推論部12は、取得部11により取得された現時点の状態情報を、強化学習により得られた行動価値関数に代入することで、現時点の状態で様々な行動のそれぞれを行った場合の行動価値を取得し、得られた行動価値が最大となる行動を、取るべき行動として選択する機能を有する。
In addition, in the "stage of inferring an action" after reinforcement learning, the
上記の強化学習に係る「学習段階の処理」、および、行動の推論に係る「推論段階の処理」は、図12~図14を用いて、後述する。 The "learning stage processing" related to the above reinforcement learning and the "inference stage processing" related to behavior inference will be described later using Figures 12 to 14.
(強化学習の概要と学習方法の補足的説明(図2~図6))
前述したように、機械学習の1つとして、ある環境におけるエージェントが、現在の状態を観測し、観測で得られた状態情報から、方策(policy)に基づいて将来取るべき行動を決定する問題を扱う強化学習(reinforcement learning)が知られている。上記エージェントは、決定された行動を実行することで環境から報酬を得るが、強化学習は、一連の行動を通じて報酬が最も多く得られるような方策を学習する。
(Supplementary explanation of reinforcement learning overview and learning method (Figures 2 to 6))
As mentioned above, reinforcement learning is known as one type of machine learning, which deals with the problem of an agent in an environment observing the current state and determining the future action to be taken based on a policy from the state information obtained from the observation. The agent obtains a reward from the environment by executing the determined action, and reinforcement learning learns a policy that will obtain the most reward through a series of actions.
出願人は、上記のような強化学習に着目し、以下では、強化学習の枠組みを利用して、シェアリング交通サービスにおける複数のポート間での車両の再配置において強化学習における報酬を最大化するように車両の再配置の最適化を行う技術を開示する。 The applicant focuses on the above-mentioned reinforcement learning, and below discloses a technology that utilizes a reinforcement learning framework to optimize vehicle reallocation between multiple ports in a shared transportation service so as to maximize the reward in reinforcement learning.
図2に示すように、強化学習における「状態(State)」は、エージェントが環境から得た情報であり、例えば、現在のポートの状態、需要予測結果、天候、距離、制約ポートなどの情報が例示される。「行動(Action)」は、エージェントが環境でとる行動であり、例えば、配置、回収、バッテリ交換、どこのポートに行くか、などが例示される。「報酬(Reward)」は、エージェントが環境で得た利益であり、ここでは、図4を用いて後述する「割引累積報酬Rt」が挙げられる。「方策(Policy)」は、状態から行動を返す関数πである。 As shown in FIG. 2, a "state" in reinforcement learning is information that an agent obtains from the environment, such as the current port state, demand forecast results, weather, distance, and restricted ports. An "action" is a behavior that an agent takes in the environment, such as placement, recovery, battery replacement, and which port to go to. A "reward" is a profit that an agent obtains in the environment, such as the "discounted cumulative reward R t ," which will be described later with reference to FIG. 4. A "policy" is a function π that returns an action from a state.
図2の下段に示すように、環境から得られた状態Stを、方策に係る関数πに入力すると、行動atが返される。そこで、状態Stにおいてエージェントが行動atを実施すると、状態St+1に至る。状態St+1に遷移する確率(状態遷移確率)は、T(St+1|St, at)と表記され、状態Stにおいてエージェントが行動atを実施した場合の報酬は報酬関数R(St, at)と表記される。 As shown in the lower part of Figure 2, when state S t obtained from the environment is input to function π related to the policy, action a t is returned. Therefore, when the agent performs action a t in state S t , it reaches state S t+1 . The probability of transitioning to state S t+1 (state transition probability) is expressed as T(S t+1 | S t , a t ), and the reward when the agent performs action a t in state S t is expressed as reward function R(S t , a t ).
図3に示すように、状態St、St+1の情報例としては、ポートの状態、ポートの需要、天候、トラックの状態、自転車の状態、ポートの位置、ポート間の距離などの情報が挙げられる。状態Stを、方策に係る関数πに入力することで、行動atとして、例えばエージェントとしての「トラックA」に関し「次に行くポート」として「ポートA」、「自転車の配置・回収数」として「-23」、「バッテリ交換数」として「10」といった行動が返される。なお、自転車の配置・回収数が正の値である場合は、その数だけ自転車を配置することを意味し、負の値である場合は、その数だけ自転車を回収することを意味するため、上記例は、次に行くポートAにおいて自転車を23台回収し、10台分のバッテリを交換する行動を意味する。状態Stにおいてエージェント(トラックA)が行動atを実施して状態St+1に至り、状態St+1を、方策に係る関数πに入力することで、行動at+1として、例えばエージェントとしての「トラックA」に関し「次に行くポート」として「ポートC」、「自転車の配置・回収数」として「10」、「バッテリ交換数」として「0」といった行動が返される。この行動は、次に行くポートCにおいて自転車を10台配置し、バッテリ交換は行わない(交換不要である)ことを意味する。 As shown in Fig. 3, examples of information for states S t and S t+1 include port status, port demand, weather, truck status, bicycle status, port location, and distance between ports. By inputting state S t into function π related to the policy, as action a t , for example, for "truck A" as an agent, actions such as "port A" as the "next port,""-23" as the number of bicycles placed/recovered, and "10" as the number of batteries exchanged are returned. Note that when the number of bicycles placed/recovered is a positive value, it means that the number of bicycles will be placed, and when it is a negative value, it means that the number of bicycles will be recovered, so the above example means that 23 bicycles will be recovered at the next port A and 10 batteries will be replaced. In state S t, an agent (truck A) performs action a t to reach state S t+1 , and by inputting state S t+1 into function π related to the policy, as action a t+1 , for example, for "truck A" as an agent, actions such as "port C" as the "next port,""10" as the number of bicycles placed/collected, and "0" as the number of batteries replaced are returned. This action means that 10 bicycles will be placed at the next port C, and no battery replacement will be performed (no replacement is required).
図4に示すように、本件では、強化学習における「報酬」は「割引累積報酬Rt」として以下の式(1)のように定義される。
1世代後の報酬R(St+1, at+1)は、γ1R(St+1, at+1)
2世代後の報酬R(St+2, at+2)は、γ2R(St+2, at+2)
3世代後の報酬R(St+3, at+3)は、γ3R(St+3, at+3)
となり、ここで、
割引累積報酬Rt
=γ1R(St+1, at+1)+γ1R(St+1, at+1)+γ1R(St+1, at+1)+γ1R(St+1, at+1)+…
と表され、図4の最下行にて破線で囲んだ部分はγRt+1に相当するため、上記式(1)が導かれる。
As shown in FIG. 4, in this case, the “reward” in reinforcement learning is defined as the “discounted cumulative reward Rt” as shown in the following equation (1).
The reward R(S t+1 , a t+1 ) after one generation is γ 1 R(S t+1 , a t+1 )
The reward R(S t+2 , a t+2 ) after two generations is γ 2 R(S t+2 , a t+2 )
The reward R(S t+3 , a t+3 ) after three generations is γ 3 R(S t+3 , a t+3 )
Here,
Discounted cumulative reward R t
=γ 1 R(S t+1 , a t+1 )+γ 1 R(S t+1 , a t+1 )+γ 1 R(S t+1 , a t+1 )+γ 1 R(S t+1 , a t+1 )+…
Since the part enclosed by the dashed line in the bottom row of FIG. 4 corresponds to γR t+1 , the above formula (1) is derived.
また、図5に示すように、強化学習では、よりよい方策を学習するために、状態の価値を見積もる状態価値関数、および、行動の価値を見積もる行動価値関数を定義する。ここで、「状態価値関数Vπ(St)」は、その方策πに従えば、その状態Stからスタートして将来どれだけの割引報酬を得られるかを表す関数である。「行動価値関数」は、状態Stから、ある行動をとった場合に将来どれだけの割引報酬を得られるかを表す関数であり、上記の状態価値関数の一部を構成する。
そこで、図6に示すように、見積もり値と経験値の誤差
(強化学習における報酬の算出例の説明(図7~図11))
前述したように、学習推論部12は、行動の価値に含まれる「報酬」として、広告が視認されうる回数であるインプレッションを一要素としつつ、後述する算出式を用いて、(a)配置に係る配置評価値、(b)回収に係る回収評価値、および、(c)バッテリ交換に係るバッテリ交換評価値を各ポートについて算出し、全ポートについて得られた(a)配置評価値、(b)回収評価値、および、(c)バッテリ交換評価値に基づいて、強化学習を行う。以下では、図7~図11を用いて、(a)配置評価値、(b)回収評価値、および、(c)バッテリ交換評価値の算出について説明する。
(Explanation of Reward Calculation Example in Reinforcement Learning (FIGS. 7 to 11))
As described above, the learning and
(a)配置評価値は、図7にも示すように、
配置評価値=(ポートに置くことで利用される平均回数×利用料金×1回会員割合×定数α+自転車のインプレッション平均回数×自転車視認確率×広告単価×定数β)×配置する自転車の台数 (5)
という式(5)により算出される。
(a) The placement evaluation value, as shown in FIG.
Placement evaluation value = (average number of times bicycles are used when placed at the port x usage fee x percentage of members who use the bicycle once x constant α + average number of impressions of the bicycle x probability of seeing the bicycle x advertising cost x constant β) x number of bicycles to be placed (5)
It is calculated by the following formula (5).
(b)回収評価値は、図7にも示すように、
回収評価値=((他のポートに置くことで利用される平均回数の最大数-ポートに置くことで利用される平均回数)×利用料金×1回会員割合×定数α-(他のポートの自転車のインプレッション平均回数の最大数-自転車のインプレッション平均回数)×自転車視認確率×広告単価×定数β)×回収する自転車の台数
(6)
という式(6)により算出される。
(b) The recovery evaluation value is as shown in FIG.
Recovery evaluation value = ((Maximum number of average times used by placing it at another port - Average number of times used by placing it at the port) x Usage fee x Percentage of one-time members x Constant α - (Maximum number of average times of impressions of bicycles at other ports - Average number of impressions of bicycles) x Bicycle visibility probability x Advertising cost x Constant β) x Number of bicycles to be recovered (6)
It is calculated by the following formula (6).
(c)バッテリ交換評価値は、図7にも示すように、
バッテリ交換評価値=(ポートに置くことで利用される平均回数×利用料金×1回会員割合×定数α)×バッテリを交換する自転車の台数 (7)
という式(7)により算出される。
(c) The battery exchange evaluation value, as shown in FIG.
Battery replacement evaluation value = (average number of times the bicycle is used when placed at the port x usage fee x percentage of members who use the bicycle once x constant α) x number of bicycles to replace the battery (7)
This is calculated by the following formula (7).
次に、式(5)~(7)の各要素について概説する。
・「ポートに置くことで利用される平均回数」は、図8と図9を用いて後述する。
・「自転車のインプレッション平均回数」は、図10と図11を用いて後述する。
・「利用料金」は、電動自転車のシェアリングサービスの利用料金であり、例えば165円である。
・「1回会員割合」は、全会員数に占める1回会員の数の比率である。ここでは、会員として、月額一定額を払って利用する月額会員と、利用の都度上記利用料金を払って利用する1回会員の2種類の会員を想定しており、1回会員割合は、1回会員数/(1回会員数+月額会員数)により算出される。
・「広告単価」は、シェアされる自転車の車体に掲載されるドレスガード広告の1インプレッションの単価である。
・「定数α、β」は、自転車をポートに配置することで得られる利益とドレスガード広告により得られる利益とを考慮して定められる重要度調整用の定数であり、デフォルトは「1」とされる。
・「自転車視認確率」は、自転車のインプレッションの中で実際に広告を視認したと推測される人の占める確率である。
・「他のポートに置くことで利用される平均回数の最大数」は、後述する「ポートに置くことで利用される平均回数」の最大値である。
・「他のポートの自転車のインプレッション平均回数の最大数」は、後述する「自転車のインプレッション平均回数」の最大値である。
・「配置する自転車の台数」、「回収する自転車の台数」、「バッテリを交換する自転車の台数」は、方策πに状態Sを入力することで返される行動aに含まれる情報である。
Next, each element of the formulas (5) to (7) will be outlined.
The "average number of times a device is used by being placed in a port" will be described later with reference to FIG. 8 and FIG.
The "average number of impressions for a bicycle" will be described later with reference to FIG. 10 and FIG.
"Usage fee" is the usage fee for the electric bicycle sharing service, for example 165 yen.
- "One-time member ratio" is the ratio of one-time members to the total number of members. Here, there are two types of members: monthly members who pay a fixed monthly amount, and one-time members who pay the above-mentioned usage fee each time they use the service. The one-time member ratio is calculated by the number of one-time members / (number of one-time members + number of monthly members).
"Advertising cost" is the cost per impression of a DressGuard advertisement placed on the body of a shared bicycle.
- "Constants α, β" are constants for adjusting the importance, which are determined taking into account the profit obtained by placing a bicycle at a port and the profit obtained by dressguard advertising, and the default is set to "1".
- "Bicycle visibility probability" is the probability that people who are estimated to have actually viewed the advertisement among bicycle impressions.
"The maximum average number of times used by placing it in another port" is the maximum value of "the average number of times used by placing it in a port" which will be described later.
"Maximum number of average impressions of bicycles of other ports" is the maximum value of "Average number of impressions of bicycles" described later.
"The number of bicycles to place,""The number of bicycles to retrieve," and "The number of bicycles to replace the batteries" are information contained in the action a that is returned by inputting the state S into the policy π.
「ポートに置くことで利用される平均回数」は、図8に示すように、ある時刻(基準時刻)から自転車ごとにトラックが回収する又はバッテリ交換されるまでの間で何回利用されたかを集計し(ステップ1)、平日・休日、時刻、ポート、天候ごとに自転車の平均利用回数を算出する(ステップ2)、という手順で求められる。図8には、基準時刻12:00以降に、自転車Aが、ポートAからポートBまで利用され、次にポートBからポートCまで利用され、次にポートCからポートDまで利用され、さらにポートDからポートEまで利用され、その後、ポートEでトラックにより回収された、という状況を例示している。この状況では、自転車Aは、基準時刻12:00以降、バッテリ交換はされず、トラックが回収するまでの間で4回利用されているため、ステップ1では、自転車Aについては「利用回数:4回」と集計される。このような集計が自転車ごとに行われる。さらに、ステップ2にて、図8の右下の表のように、平日・休日、時刻、ポート、天候ごとに自転車の平均利用回数が算出される。このとき、天候については、例えば、時間降水量=0mmであれば「晴」とし、時間降水量>0mmであれば「雨」とされる。また、集計表における「ポート」は、基準時刻以降に最初に自転車が利用されたポートとされる。
As shown in FIG. 8, the "average number of times used by leaving it at a port" is calculated by the following procedure: counting the number of times each bicycle is used from a certain time (reference time) until it is collected by a truck or the battery is replaced (step 1); and calculating the average number of times the bicycle is used by weekday/holiday, time, port, and weather (step 2). FIG. 8 illustrates an example of a situation in which, after the reference time of 12:00, bicycle A is used from port A to port B, then from port B to port C, then from port C to port D, and then from port D to port E, and then collected by a truck at port E. In this situation, bicycle A's battery has not been replaced since the reference time of 12:00, and bicycle A has been used four times before it is collected by a truck, so in
より詳しくは、図9に示すように、取得部11は、自転車の利用履歴情報を保管した利用テーブルから、利用履歴情報を自転車ごとに抽出し、算出対象の自転車Aに関する利用履歴情報から図8に示すように利用開始日時(基準時刻)と利用回数を求める。そして、取得部11は、利用開始日時における天候情報を参照することで、その時点の天候を判断し、図9の右下の表のように、平日・休日、時刻、ポート、天候ごとに自転車の平均利用回数を算出する。
More specifically, as shown in FIG. 9, the
「自転車のインプレッション平均回数」は、図10に示すように、ある時刻(基準時刻)から自転車ごとにトラックが回収する又はバッテリ交換されるまでの間で広告が視認されうる回数を後述の要領で集計し(ステップ1)、平日・休日、時刻、ポート、天候ごとに自転車のインプレッション平均回数を算出する(ステップ2)という手順で求められる。なお、天候の判断は、前述した図8の例と同様である。 As shown in Figure 10, the "average number of impressions for a bicycle" is calculated by tallying up the number of times an advertisement is likely to be viewed for each bicycle from a certain time (reference time) until the bicycle is collected by a truck or the battery is replaced (step 1), as described below, and then calculating the average number of impressions for the bicycle for weekdays/holidays, time, port, and weather (step 2). Note that the weather is determined in the same way as in the example of Figure 8 described above.
より詳しくは、図11に示すように、取得部11は、自転車の利用履歴情報を保管した利用テーブルから、利用履歴情報を自転車ごとに抽出する。また、取得部11は、自転車の位置情報と人の位置情報から、対象の自転車(図11では自転車A)の位置から一定範囲内に存在する人を抽出し、その人数を集計することで、得られた人数を対象の自転車Aのインプレッション回数(広告が視認されうる回数)とする。このようなインプレッション回数は、自転車の利用履歴ごとに求められ、上記で抽出された自転車ごとの利用履歴情報とマージされ、各自転車につき利用開始日時(基準時刻)ごとのインプレッション回数が得られる。そして、取得部11は、利用開始日時(基準時刻)における天候情報を参照することで、その時点の天候を判断し、図11の右下の表のように、平日・休日、時刻、ポート、天候ごとに自転車のインプレッション平均回数を算出する。
More specifically, as shown in FIG. 11, the
(学習段階の処理の説明(図12~図13))
以下、情報処理装置10において実行される「学習段階の処理」と「推論段階の処理」を順に説明する。
(Explanation of the learning stage process (FIGS. 12-13))
The "learning stage processing" and the "inference stage processing" executed in the
図12には、学習段階の処理フローを示す。まず、取得部11は、各自転車について、ある時刻から回収されるまでの時間、および、ある時刻からバッテリ交換されるまでの時間を、図9および図11の自転車の利用テーブルから抽出する(ステップS1)。上記の「ある時刻」は、例えば、予め定められた複数の基準時刻の候補から選択すればよい。次に、取得部11は、図8、図9を用いて前述した手順で、ポートに置くことで利用される平均回数を算出する(ステップS2)。これにより、図9の右下の表のように、平日・休日、時刻、ポート、天候ごとに自転車の平均利用回数が算出される。さらに、取得部11は、図10、図11を用いて前述した手順で、インプレッション平均回数を算出する(ステップS3)。これにより、図11の右下の表のように、平日・休日、時刻、ポート、天候ごとに自転車のインプレッション平均回数が算出される。
12 shows the process flow of the learning stage. First, the
次に、学習推論部12は、図13に示す、情報(St, at, R(St, at), St+1)の蓄積処理を実行する(ステップS4)。まず、学習推論部12は、取得部11によって取得された状態Stを取得し(図13のステップS4A)、取得された状態Stを方策πに入力することで、行動atを得る(ステップS4B)。このとき、実際は、複数の取りうる行動a1、a2、…、anが得られる。そして、学習推論部12は、割引累積報酬Rtの式を用いて、状態Stから行動at(a1、a2、…、an)それぞれを実行した場合の割引累積報酬R(St, at)を算出する(ステップS4C)。このとき、具体的には、図7~図11を用いて説明した「配置評価値」、「回収評価値」および「バッテリ交換評価値」が自転車ごとに算出され、全ての自転車についての上記3つの評価値から割引累積報酬R(St, at)が算出される。そして、学習推論部12は、算出された割引累積報酬が最大となる行動akを求める(ステップS4D)。即ち、ステップS4Dでは、複数の取りうる行動a1、a2、…、anのうち、
図12へ戻り、学習推論部12は、
(推論段階の処理の説明(図14))
図14には、推論段階の処理フローを示す。まず、取得部11は、前述した学習段階の処理と同様に、状態情報Stを取得し(ステップS11)、学習推論部12は、学習により得られた行動価値関数Q(St, at)を用いて行動atごとの行動価値を算出する(ステップS12)。そして、学習推論部12は、行動価値が最大となる行動atを選択する(ステップS13)。このようにして、将来の行動の価値が最も高くなる行動が選択される。
FIG. 14 shows a process flow of the inference stage. First, the
(上記の実施形態の効果)
以下、上記の実施形態による効果を説明する。
(Effects of the above embodiment)
The effects of the above embodiment will be described below.
上述した強化学習によって、行動価値関数は、将来の行動の価値が最も高くなる行動を選択するような行動価値関数に学習されていく。このような行動価値関数に基づき複数のポート間での車両の再配置に関する行動の価値が見積もられ、将来の行動の価値が最も高くなる行動が選択されるよう制御されるため、シェアリング交通サービスにおける複数のポート間での車両の再配置において、強化学習における報酬を最大化するように車両の再配置の最適化を行うことができる。 By using the above-mentioned reinforcement learning, the action value function is trained to an action value function that selects the action that will have the highest value for future actions. Based on this action value function, the value of actions related to vehicle reallocation between multiple ports is estimated, and control is exercised so that the action that will have the highest value for future actions is selected. Therefore, when reallocating vehicles between multiple ports in a shared transportation service, vehicle reallocation can be optimized to maximize the reward in reinforcement learning.
具体的には、シェアリング交通サービスにてシェアされる自転車の再配置に関する行動として、「配置」、「回収」および「バッテリ交換」が想定され、配置、回収、バッテリ交換という複数の側面での評価値に基づき適切な強化学習が行われるため、より適切に車両の再配置の最適化が行われ、報酬を最大化することができる。 Specifically, the actions assumed for the reallocation of bicycles shared in a shared transportation service are "placement," "collection," and "battery replacement." Appropriate reinforcement learning is performed based on evaluation values from multiple aspects, namely placement, collection, and battery replacement, which allows for more appropriate optimization of vehicle reallocation and maximizes rewards.
シェアリング交通サービスにてシェアされる自転車は、車体に広告が表示される自転車であり、学習推論部12は、広告が視認されうる回数であるインプレッションを加味した上で、配置評価値および回収評価値を算出するため、広告による報酬を最大化することができる。
The bicycles shared in the shared transportation service have advertisements displayed on the body of the bicycle, and the learning and
シェアリング交通サービスにてシェアされる車両は、レンタル用電動自転車を例示したが、バッテリ交換により再稼働可能とされ車体に広告が掲載される車両であればよく、レンタル用電動キックボード、レンタルバイク、およびレンタカーといった他の種類の車両にも適用可能である。このように、本開示は、近年普及しつつあるレンタル用電動キックボードも含め、幅広い車両に適用可能であり、ユーザに広く利用される可能性を有する。 Although rental electric bicycles have been given as an example of vehicles shared in the shared transportation service, any vehicle that can be restarted by replacing the battery and has advertisements on the body can be used, and the present disclosure can also be applied to other types of vehicles such as rental electric kick scooters, rental motorcycles, and rental cars. In this way, the present disclosure can be applied to a wide range of vehicles, including rental electric kick scooters, which have become increasingly popular in recent years, and has the potential to be widely used by users.
また、行動を推論する段階においては、図14の処理を実行することにより、行動価値が最大となる行動を、取るべき行動として選択でき、結果的に、サービスにおける報酬(利益)を最大化することができる。 In addition, at the stage of inferring actions, by executing the process in Figure 14, the action that maximizes the action value can be selected as the action to be taken, thereby maximizing the reward (profit) for the service.
なお、上記の実施形態は、本開示の1つの例であり、情報の種類、報酬の内容、処理の内容などは、上述したものに限定されるものではない。 Note that the above embodiment is one example of the present disclosure, and the type of information, the content of the reward, the content of the processing, etc. are not limited to those described above.
本開示の要旨は以下の[1]~[5]に存する。
[1] 車両をシェアするシェアリング交通サービスにおける複数のポート間での車両の再配置に関する行動の価値を見積もる行動価値関数に入力される情報を取得する取得部と、
将来の行動の価値が最も高くなる行動を選択する前記行動価値関数にするために、前記取得部により取得された情報を用いて前記行動価値関数に対し強化学習を行う学習推論部と、
を備える情報処理装置。
[2] 前記シェアリング交通サービスにてシェアされる車両は、搭載されたバッテリの交換により再利用可能となる車両であり、
前記車両の再配置に関する行動は、あるポートへの車両の配置、あるポートからの車両の回収、および、車両に搭載されたバッテリのバッテリ交換を含み、
前記学習推論部は、前記行動の価値に含まれる報酬として、所定の算出式を用いて、前記配置に係る配置評価値、前記回収に係る回収評価値、および、前記バッテリ交換に係るバッテリ交換評価値を各ポートについて算出し、全ポートについて得られた配置評価値、回収評価値、および、バッテリ交換評価値に基づいて、前記強化学習を行う、[1]に記載の情報処理装置。
[3] 前記シェアリング交通サービスにてシェアされる車両は、車体に広告が表示される車両であり、
前記学習推論部は、広告が視認されうる回数であるインプレッションを一要素として、前記配置評価値および前記回収評価値を算出する、[2]に記載の情報処理装置。
[4] 前記シェアリング交通サービスにてシェアされる車両は、バッテリ交換により再稼働可能とされ車体に広告が掲載される、レンタル用電動自転車、レンタル用電動キックボード、レンタルバイク、およびレンタカーである、[3]に記載の情報処理装置。
[5] 前記学習推論部は、行動を推論する段階では、
前記取得部により取得された現時点の状態情報を、前記強化学習により得られた前記行動価値関数に代入することで、前記現時点の状態で様々な行動のそれぞれを行った場合の行動価値を取得し、得られた行動価値が最大となる当該行動を、取るべき行動として選択する、[1]~[4]の何れか一項に記載の情報処理装置。
The gist of the present disclosure lies in the following [1] to [5].
[1] An acquisition unit that acquires information to be input into an action value function that estimates the value of an action related to the reallocation of vehicles among a plurality of ports in a sharing transportation service in which vehicles are shared;
a learning and inference unit that performs reinforcement learning on the action-value function using the information acquired by the acquisition unit in order to make the action-value function select an action that will have the highest value for a future action;
An information processing device comprising:
[2] A vehicle shared in the shared transportation service is a vehicle that can be reused by replacing the installed battery,
The actions related to the relocation of the vehicle include the deployment of the vehicle to a port, the recovery of the vehicle from a port, and the replacement of a battery installed in the vehicle;
The learning and inference unit uses a predetermined calculation formula to calculate a placement evaluation value for the placement, a recovery evaluation value for the recovery, and a battery replacement evaluation value for the battery replacement for each port as rewards included in the value of the action, and performs the reinforcement learning based on the placement evaluation values, recovery evaluation values, and battery replacement evaluation values obtained for all ports.
[3] A vehicle shared in the sharing transportation service is a vehicle on which advertisements are displayed,
The information processing device according to [2], wherein the learning and inference unit calculates the placement evaluation value and the collection evaluation value using impressions, which are the number of times an advertisement can be viewed, as one element.
[4] The information processing device described in [3], wherein the vehicles shared in the shared transportation service are rental electric bicycles, rental electric kick scooters, rental motorbikes, and rental cars that can be restarted by replacing the batteries and have advertisements displayed on the vehicle bodies.
[5] In the step of inferring an action, the learning inference unit
The information processing device according to any one of [1] to [4], wherein the current state information acquired by the acquisition unit is substituted into the action value function obtained by the reinforcement learning to obtain action values for various actions taken in the current state, and the action that maximizes the obtained action value is selected as the action to be taken.
(用語の説明、ハードウェア構成(図15)の説明など)
なお、上記実施形態の説明に用いたブロック図は、機能単位のブロックを示している。これらの機能ブロック(構成部)は、ハードウェア及びソフトウェアの少なくとも一方の任意の組み合わせによって実現される。また、各機能ブロックの実現方法は特に限定されない。すなわち、各機能ブロックは、物理的又は論理的に結合した1つの装置を用いて実現されてもよいし、物理的又は論理的に分離した2つ以上の装置を直接的又は間接的に(例えば、有線、無線などを用いて)接続し、これら複数の装置を用いて実現されてもよい。機能ブロックは、上記1つの装置又は上記複数の装置にソフトウェアを組み合わせて実現されてもよい。
(Explanation of terms, explanation of hardware configuration (Fig. 15), etc.)
The block diagrams used in the description of the above embodiments show functional blocks. These functional blocks (components) are realized by any combination of at least one of hardware and software. The method of realizing each functional block is not particularly limited. That is, each functional block may be realized using one device that is physically or logically coupled, or may be realized using two or more devices that are physically or logically separated and directly or indirectly connected (for example, using wires, wirelessly, etc.). The functional blocks may be realized by combining the one device or the multiple devices with software.
機能には、判断、決定、判定、計算、算出、処理、導出、調査、探索、確認、受信、送信、出力、アクセス、解決、選択、選定、確立、比較、想定、期待、見做し、報知(broadcasting)、通知(notifying)、通信(communicating)、転送(forwarding)、構成(configuring)、再構成(reconfiguring)、割り当て(allocating、mapping)、割り振り(assigning)などがあるが、これらに限られない。たとえば、送信を機能させる機能ブロック(構成部)は、送信部(transmitting unit)、送信機(transmitter)と呼称される。いずれも、上述したとおり、実現方法は特に限定されない。 Functions include, but are not limited to, judgement, determination, judgment, calculation, computation, processing, derivation, investigation, search, confirmation, reception, transmission, output, access, resolution, selection, selection, establishment, comparison, assumption, expectation, regarding, broadcasting, notifying, communicating, forwarding, configuring, reconfiguring, allocating, mapping, and assignment. For example, a functional block (component) that performs the transmission function is called a transmitting unit or a transmitter. As mentioned above, there are no particular limitations on the method of realization for either of these.
例えば、本開示の一実施形態における情報処理装置などは、本開示の処理を実行するコンピュータとして機能してもよい。図15は、本開示の一実施形態に係る情報処理装置10のハードウェア構成の一例を示す図である。上述の情報処理装置10は、物理的には、プロセッサ1001、メモリ1002、ストレージ1003、通信装置1004、入力装置1005、出力装置1006、バス1007などを含むコンピュータ装置として構成されてもよい。
For example, an information processing device in an embodiment of the present disclosure may function as a computer that executes the processing of the present disclosure. FIG. 15 is a diagram showing an example of the hardware configuration of an
なお、以下の説明では、「装置」という文言は、回路、デバイス、ユニットなどに読み替えることができる。情報処理装置10のハードウェア構成は、図に示した各装置を1つ又は複数含むように構成されてもよいし、一部の装置を含まずに構成されてもよい。
In the following description, the word "apparatus" can be interpreted as a circuit, device, unit, etc. The hardware configuration of the
情報処理装置10における各機能は、プロセッサ1001、メモリ1002などのハードウェア上に所定のソフトウェア(プログラム)を読み込ませることによって、プロセッサ1001が演算を行い、通信装置1004による通信を制御したり、メモリ1002及びストレージ1003におけるデータの読み出し及び書き込みの少なくとも一方を制御したりすることによって実現される。
Each function of the
プロセッサ1001は、例えば、オペレーティングシステムを動作させてコンピュータ全体を制御する。プロセッサ1001は、周辺装置とのインタフェース、制御装置、演算装置、レジスタなどを含む中央処理装置(CPU:Central Processing Unit)によって構成されてもよい。
The
また、プロセッサ1001は、プログラム(プログラムコード)、ソフトウェアモジュール、データなどを、ストレージ1003及び通信装置1004の少なくとも一方からメモリ1002に読み出し、これらに従って各種の処理を実行する。プログラムとしては、上述の実施の形態において説明した動作の少なくとも一部をコンピュータに実行させるプログラムが用いられる。各種処理は、1つのプロセッサ1001によって実行される旨を説明してきたが、2以上のプロセッサ1001により同時又は逐次に実行されてもよい。プロセッサ1001は、1以上のチップによって実装されてもよい。なお、プログラムは、電気通信回線を介してネットワークから送信されても良い。
The
メモリ1002は、コンピュータ読み取り可能な記録媒体であり、例えば、ROM(Read Only Memory)、EPROM(Erasable Programmable ROM)、EEPROM(Electrically Erasable Programmable ROM)、RAM(Random Access Memory)などの少なくとも1つによって構成されてもよい。メモリ1002は、レジスタ、キャッシュ、メインメモリ(主記憶装置)などと呼ばれてもよい。メモリ1002は、本開示の一実施の形態に係る無線通信方法を実施するために実行可能なプログラム(プログラムコード)、ソフトウェアモジュールなどを保存することができる。
ストレージ1003は、コンピュータ読み取り可能な記録媒体であり、例えば、CD-ROM(Compact Disc ROM)などの光ディスク、ハードディスクドライブ、フレキシブルディスク、光磁気ディスク(例えば、コンパクトディスク、デジタル多用途ディスク、Blu-ray(登録商標)ディスク)、スマートカード、フラッシュメモリ(例えば、カード、スティック、キードライブ)、フロッピー(登録商標)ディスク、磁気ストリップなどの少なくとも1つによって構成されてもよい。ストレージ1003は、補助記憶装置と呼ばれてもよい。上述の記憶媒体は、例えば、メモリ1002及びストレージ1003の少なくとも一方を含むデータベース、サーバその他の適切な媒体であってもよい。
通信装置1004は、有線ネットワーク及び無線ネットワークの少なくとも一方を介してコンピュータ間の通信を行うためのハードウェア(送受信デバイス)であり、例えばネットワークデバイス、ネットワークコントローラ、ネットワークカード、通信モジュールなどともいう。通信装置1004は、例えば周波数分割複信(FDD:Frequency Division Duplex)及び時分割複信(TDD:Time Division Duplex)の少なくとも一方を実現するために、高周波スイッチ、デュプレクサ、フィルタ、周波数シンセサイザなどを含んで構成されてもよい。
The
入力装置1005は、外部からの入力を受け付ける入力デバイス(例えば、キーボード、マウス、マイクロフォン、スイッチ、ボタン、センサなど)である。出力装置1006は、外部への出力を実施する出力デバイス(例えば、ディスプレイ、スピーカー、LEDランプなど)である。なお、入力装置1005及び出力装置1006は、一体となった構成(例えば、タッチパネル)であってもよい。
The
また、プロセッサ1001、メモリ1002などの各装置は、情報を通信するためのバス1007によって接続される。バス1007は、単一のバスを用いて構成されてもよいし、装置間ごとに異なるバスを用いて構成されてもよい。
Furthermore, each device such as the
また、情報処理装置10は、マイクロプロセッサ、デジタル信号プロセッサ(DSP:Digital Signal Processor)、ASIC(Application Specific Integrated Circuit)、PLD(Programmable Logic Device)、FPGA(Field Programmable Gate Array)などのハードウェアを含んで構成されてもよく、当該ハードウェアにより、各機能ブロックの一部又は全てが実現されてもよい。例えば、プロセッサ1001は、これらのハードウェアの少なくとも1つを用いて実装されてもよい。
In addition, the
情報の通知は、本開示において説明した態様/実施形態に限られず、他の方法を用いて行われてもよい。例えば、情報の通知は、物理レイヤシグナリング(例えば、DCI(Downlink Control Information)、UCI(Uplink Control Information))、上位レイヤシグナリング(例えば、RRC(Radio Resource Control)シグナリング、MAC(Medium Access Control)シグナリング、報知情報(MIB(Master Information Block)、SIB(System Information Block)))、その他の信号又はこれらの組み合わせによって実施されてもよい。また、RRCシグナリングは、RRCメッセージと呼ばれてもよく、例えば、RRC接続セットアップ(RRC Connection Setup)メッセージ、RRC接続再構成(RRC Connection Reconfiguration)メッセージなどであってもよい。 The notification of information is not limited to the aspects/embodiments described in this disclosure, and may be performed using other methods. For example, the notification of information may be performed by physical layer signaling (e.g., DCI (Downlink Control Information), UCI (Uplink Control Information)), higher layer signaling (e.g., RRC (Radio Resource Control) signaling, MAC (Medium Access Control) signaling, broadcast information (MIB (Master Information Block), SIB (System Information Block))), other signals, or a combination of these. In addition, RRC signaling may be referred to as an RRC message, and may be, for example, an RRC Connection Setup message, an RRC Connection Reconfiguration message, etc.
本開示において説明した各態様/実施形態は、LTE(Long Term Evolution)、LTE-A(LTE-Advanced)、SUPER 3G、IMT-Advanced、4G(4th generation mobile communication system)、5G(5th generation mobile communication system)、6th generation mobile communication system(6G)、xth generation mobile communication system(xG)(xG(xは、例えば整数、小数))、FRA(Future Radio Access)、NR(new Radio)、New radio access(NX)、Future generation radio access(FX)、W-CDMA(登録商標)、GSM(登録商標)、CDMA2000、UMB(Ultra Mobile Broadband)、IEEE 802.11(Wi-Fi(登録商標))、IEEE 802.16(WiMAX(登録商標))、IEEE 802.20、UWB(Ultra-WideBand)、Bluetooth(登録商標)、その他の適切なシステムを利用するシステム及びこれらに基づいて拡張、修正、作成、規定された次世代システムの少なくとも一つに適用されてもよい。また、複数のシステムが組み合わされて(例えば、LTE及びLTE-Aの少なくとも一方と5Gとの組み合わせ等)適用されてもよい。 Each aspect/embodiment described in this disclosure is a mobile communication system that is compatible with LTE (Long Term Evolution), LTE-A (LTE-Advanced), SUPER 3G, IMT-Advanced, 4G (4th generation mobile communication system), 5G (5th generation mobile communication system), 6th generation mobile communication system (6G), xth generation mobile communication system (xG) (xG (x is, for example, an integer or decimal number)), FRA (Future Ra The present invention may be applied to at least one of systems using IEEE 802.11 (Wi-Fi (registered trademark)), IEEE 802.16 (WiMAX (registered trademark)), IEEE 802.20, UWB (Ultra-WideBand), Bluetooth (registered trademark), and other appropriate systems, and next-generation systems that are expanded, modified, created, or defined based on these. It may also be applied to a combination of multiple systems (for example, a combination of at least one of LTE and LTE-A with 5G, etc.).
本開示において説明した各態様/実施形態の処理手順、シーケンス、フローチャートなどは、矛盾の無い限り、順序を入れ替えてもよい。例えば、本開示において説明した方法については、例示的な順序を用いて様々なステップの要素を提示しており、提示した特定の順序に限定されない。 The processing steps, sequences, flow charts, etc. of each aspect/embodiment described in this disclosure may be reordered unless inconsistent. For example, the methods described in this disclosure present elements of various steps using an example order and are not limited to the particular order presented.
入出力された情報等は特定の場所(例えば、メモリ)に保存されてもよいし、管理テーブルを用いて管理してもよい。入出力される情報等は、上書き、更新、又は追記され得る。出力された情報等は削除されてもよい。入力された情報等は他の装置へ送信されてもよい。 The input and output information may be stored in a specific location (e.g., memory) or may be managed using a management table. The input and output information may be overwritten, updated, or added to. The output information may be deleted. The input information may be sent to another device.
判定は、1ビットで表される値(0か1か)によって行われてもよいし、真偽値(Boolean:true又はfalse)によって行われてもよいし、数値の比較(例えば、所定の値との比較)によって行われてもよい。 The determination may be based on a value represented by one bit (0 or 1), a Boolean value (true or false), or a numerical comparison (e.g., a comparison with a predetermined value).
本開示において説明した各態様/実施形態は単独で用いてもよいし、組み合わせて用いてもよいし、実行に伴って切り替えて用いてもよい。また、所定の情報の通知(例えば、「Xであること」の通知)は、明示的に行うものに限られず、暗黙的(例えば、当該所定の情報の通知を行わない)ことによって行われてもよい。 Each aspect/embodiment described in this disclosure may be used alone, in combination, or switched depending on the execution. In addition, notification of specific information (e.g., notification that "X is the case") is not limited to being done explicitly, but may be done implicitly (e.g., not notifying the specific information).
以上、本開示について詳細に説明したが、当業者にとっては、本開示が本開示中に説明した実施形態に限定されるものではないということは明らかである。本開示は、請求の範囲の記載により定まる本開示の趣旨及び範囲を逸脱することなく修正及び変更態様として実施することができる。したがって、本開示の記載は、例示説明を目的とするものであり、本開示に対して何ら制限的な意味を有するものではない。 Although the present disclosure has been described in detail above, it is clear to those skilled in the art that the present disclosure is not limited to the embodiments described herein. The present disclosure can be implemented in modified and altered forms without departing from the spirit and scope of the present disclosure as defined by the claims. Therefore, the description of the present disclosure is intended to be illustrative and does not have any limiting meaning on the present disclosure.
ソフトウェアは、ソフトウェア、ファームウェア、ミドルウェア、マイクロコード、ハードウェア記述言語と呼ばれるか、他の名称で呼ばれるかを問わず、命令、命令セット、コード、コードセグメント、プログラムコード、プログラム、サブプログラム、ソフトウェアモジュール、アプリケーション、ソフトウェアアプリケーション、ソフトウェアパッケージ、ルーチン、サブルーチン、オブジェクト、実行可能ファイル、実行スレッド、手順、機能などを意味するよう広く解釈されるべきである。 Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, executable files, threads of execution, procedures, functions, etc., whether referred to as software, firmware, middleware, microcode, hardware description language, or otherwise.
また、ソフトウェア、命令、情報などは、伝送媒体を介して送受信されてもよい。例えば、ソフトウェアが、有線技術(同軸ケーブル、光ファイバケーブル、ツイストペア、デジタル加入者回線(DSL:Digital Subscriber Line)など)及び無線技術(赤外線、マイクロ波など)の少なくとも一方を使用してウェブサイト、サーバ、又は他のリモートソースから送信される場合、これらの有線技術及び無線技術の少なくとも一方は、伝送媒体の定義内に含まれる。 Software, instructions, information, etc. may also be transmitted and received via a transmission medium. For example, if the software is transmitted from a website, server, or other remote source using at least one of wired technologies (such as coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL)), and/or wireless technologies (such as infrared, microwave), then at least one of these wired and wireless technologies is included within the definition of a transmission medium.
本開示において説明した情報、信号などは、様々な異なる技術のいずれかを使用して表されてもよい。例えば、上記の説明全体に渡って言及され得るデータ、命令、コマンド、情報、信号、ビット、シンボル、チップなどは、電圧、電流、電磁波、磁界若しくは磁性粒子、光場若しくは光子、又はこれらの任意の組み合わせによって表されてもよい。 The information, signals, etc. described in this disclosure may be represented using any of a variety of different technologies. For example, the data, instructions, commands, information, signals, bits, symbols, chips, etc. that may be referred to throughout the above description may be represented by voltages, currents, electromagnetic waves, magnetic fields or magnetic particles, optical fields or photons, or any combination thereof.
なお、本開示において説明した用語及び本開示の理解に必要な用語については、同一の又は類似する意味を有する用語と置き換えてもよい。例えば、チャネル及びシンボルの少なくとも一方は信号(シグナリング)であってもよい。また、信号はメッセージであってもよい。また、コンポーネントキャリア(CC:Component Carrier)は、キャリア周波数、セル、周波数キャリアなどと呼ばれてもよい。 Note that the terms explained in this disclosure and the terms necessary for understanding this disclosure may be replaced with terms having the same or similar meanings. For example, at least one of the channel and the symbol may be a signal (signaling). Also, the signal may be a message. Also, the component carrier (CC) may be called a carrier frequency, a cell, a frequency carrier, etc.
本開示において使用する「システム」及び「ネットワーク」という用語は、互換的に使用される。 As used in this disclosure, the terms "system" and "network" are used interchangeably.
また、本開示において説明した情報、パラメータなどは、絶対値を用いて表されてもよいし、所定の値からの相対値を用いて表されてもよいし、対応する別の情報を用いて表されてもよい。例えば、無線リソースはインデックスによって指示されるものであってもよい。 In addition, the information, parameters, etc. described in this disclosure may be represented using absolute values, may be represented using relative values from a predetermined value, or may be represented using other corresponding information. For example, a radio resource may be indicated by an index.
上述したパラメータに使用する名称はいかなる点においても限定的な名称ではない。さらに、これらのパラメータを使用する数式等は、本開示で明示的に開示したものと異なる場合もある。様々なチャネル(例えば、PUCCH、PDCCHなど)及び情報要素は、あらゆる好適な名称によって識別できるので、これらの様々なチャネル及び情報要素に割り当てている様々な名称は、いかなる点においても限定的な名称ではない。 The names used for the parameters described above are not intended to be limiting in any way. Furthermore, the formulas etc. using these parameters may differ from those explicitly disclosed in this disclosure. The various channels (e.g., PUCCH, PDCCH, etc.) and information elements may be identified by any suitable names, and therefore the various names assigned to these various channels and information elements are not intended to be limiting in any way.
本開示で使用する「判断(determining)」、「決定(determining)」という用語は、多種多様な動作を包含する場合がある。「判断」、「決定」は、例えば、判定(judging)、計算(calculating)、算出(computing)、処理(processing)、導出(deriving)、調査(investigating)、探索(looking up、search、inquiry)(例えば、テーブル、データベース又は別のデータ構造での探索)、確認(ascertaining)した事を「判断」「決定」したとみなす事などを含み得る。また、「判断」、「決定」は、受信(receiving)(例えば、情報を受信すること)、送信(transmitting)(例えば、情報を送信すること)、入力(input)、出力(output)、アクセス(accessing)(例えば、メモリ中のデータにアクセスすること)した事を「判断」「決定」したとみなす事などを含み得る。また、「判断」、「決定」は、解決(resolving)、選択(selecting)、選定(choosing)、確立(establishing)、比較(comparing)などした事を「判断」「決定」したとみなす事を含み得る。つまり、「判断」「決定」は、何らかの動作を「判断」「決定」したとみなす事を含み得る。また、「判断(決定)」は、「想定する(assuming)」、「期待する(expecting)」、「みなす(considering)」などで読み替えられてもよい。 As used in this disclosure, the terms "determining" and "determining" may encompass a wide variety of actions. "Determining" and "determining" may include, for example, judging, calculating, computing, processing, deriving, investigating, looking up, search, inquiry (e.g., searching in a table, database, or other data structure), and considering ascertaining as "judging" or "determining." Also, "determining" and "determining" may include receiving (e.g., receiving information), transmitting (e.g., sending information), input, output, accessing (e.g., accessing data in memory), and considering ascertaining as "judging" or "determining." Additionally, "judgment" and "decision" can include considering resolving, selecting, choosing, establishing, comparing, etc., to have been "judged" or "decided." In other words, "judgment" and "decision" can include considering some action to have been "judged" or "decided." Additionally, "judgment (decision)" can be interpreted as "assuming," "expecting," "considering," etc.
本開示において使用する「に基づいて」という記載は、別段に明記されていない限り、「のみに基づいて」を意味しない。言い換えれば、「に基づいて」という記載は、「のみに基づいて」と「に少なくとも基づいて」の両方を意味する。 As used in this disclosure, the phrase "based on" does not mean "based only on," unless expressly stated otherwise. In other words, the phrase "based on" means both "based only on" and "based at least on."
本開示において使用する「第1の」、「第2の」などの呼称を使用した要素へのいかなる参照も、それらの要素の量又は順序を全般的に限定しない。これらの呼称は、2つ以上の要素間を区別する便利な方法として本開示において使用され得る。したがって、第1及び第2の要素への参照は、2つの要素のみが採用され得ること、又は何らかの形で第1の要素が第2の要素に先行しなければならないことを意味しない。 Any reference to an element using a designation such as "first," "second," etc., used in this disclosure does not generally limit the quantity or order of those elements. These designations may be used in this disclosure as a convenient method of distinguishing between two or more elements. Thus, a reference to a first and a second element does not imply that only two elements may be employed or that the first element must precede the second element in some way.
本開示において、「含む(include)」、「含んでいる(including)」及びそれらの変形が使用されている場合、これらの用語は、用語「備える(comprising)」と同様に、包括的であることが意図される。さらに、本開示において使用されている用語「又は(or)」は、排他的論理和ではないことが意図される。 When the terms "include," "including," and variations thereof are used in this disclosure, these terms are intended to be inclusive, similar to the term "comprising." Additionally, the term "or," as used in this disclosure, is not intended to be an exclusive or.
本開示において、例えば、英語でのa, an及びtheのように、翻訳により冠詞が追加された場合、本開示は、これらの冠詞の後に続く名詞が複数形であることを含んでもよい。 In this disclosure, where articles have been added through translation, such as a, an, and the in English, this disclosure may include that the nouns following these articles are plural.
本開示において、「AとBが異なる」という用語は、「AとBが互いに異なる」ことを意味してもよい。なお、当該用語は、「AとBがそれぞれCと異なる」ことを意味してもよい。「離れる」、「結合される」などの用語も、「異なる」と同様に解釈されてもよい。 In this disclosure, the term "A and B are different" may mean "A and B are different from each other." The term may also mean "A and B are each different from C." Terms such as "separate" and "combined" may also be interpreted in the same way as "different."
10…情報処理装置、11…取得部、12…学習推論部、1001…プロセッサ、1002…メモリ、1003…ストレージ、1004…通信装置、1005…入力装置、1006…出力装置、1007…バス。 10: Information processing device, 11: Acquisition unit, 12: Learning and inference unit, 1001: Processor, 1002: Memory, 1003: Storage, 1004: Communication device, 1005: Input device, 1006: Output device, 1007: Bus.
Claims (5)
将来の行動の価値が最も高くなる行動を選択する前記行動価値関数にするために、前記取得部により取得された情報を用いて前記行動価値関数に対し強化学習を行う学習推論部と、
を備える情報処理装置。 An acquisition unit that acquires information to be input to an action value function that estimates the value of an action related to the reallocation of vehicles among a plurality of ports in a shared transportation service that shares vehicles;
a learning and inference unit that performs reinforcement learning on the action-value function using the information acquired by the acquisition unit in order to make the action-value function select an action that will have the highest value for a future action;
An information processing device comprising:
前記車両の再配置に関する行動は、あるポートへの車両の配置、あるポートからの車両の回収、および、車両に搭載されたバッテリのバッテリ交換を含み、
前記学習推論部は、前記行動の価値に含まれる報酬として、所定の算出式を用いて、前記配置に係る配置評価値、前記回収に係る回収評価値、および、前記バッテリ交換に係るバッテリ交換評価値を各ポートについて算出し、全ポートについて得られた配置評価値、回収評価値、および、バッテリ交換評価値に基づいて、前記強化学習を行う、
請求項1に記載の情報処理装置。 The vehicle shared in the shared transportation service is a vehicle that can be reused by replacing the installed battery,
The actions related to the relocation of the vehicle include the deployment of the vehicle to a port, the recovery of the vehicle from a port, and the replacement of a battery installed in the vehicle;
the learning and inference unit calculates, for each port, a placement evaluation value related to the placement, a recovery evaluation value related to the recovery, and a battery replacement evaluation value related to the battery replacement, using a predetermined calculation formula as a reward included in the value of the action, and performs the reinforcement learning based on the placement evaluation values, recovery evaluation values, and battery replacement evaluation values obtained for all ports.
The information processing device according to claim 1 .
前記学習推論部は、広告が視認されうる回数であるインプレッションを一要素として、前記配置評価値および前記回収評価値を算出する、
請求項2に記載の情報処理装置。 The vehicle shared in the shared transportation service is a vehicle on which advertisements are displayed,
the learning and inference unit calculates the placement evaluation value and the recovery evaluation value using, as one element, impressions, which are the number of times an advertisement can be viewed;
The information processing device according to claim 2 .
請求項3に記載の情報処理装置。 The vehicles shared in the shared transportation service are rental electric bicycles, rental electric kick scooters, rental motorcycles, and rental cars that can be restarted by replacing the batteries and have advertisements placed on the vehicle body.
The information processing device according to claim 3 .
前記取得部により取得された現時点の状態情報を、前記強化学習により得られた前記行動価値関数に代入することで、前記現時点の状態で様々な行動のそれぞれを行った場合の行動価値を取得し、得られた行動価値が最大となる当該行動を、取るべき行動として選択する、
請求項1に記載の情報処理装置。 In the step of inferring an action, the learning inference unit
The current state information acquired by the acquisition unit is substituted into the action value function obtained by the reinforcement learning to acquire action values in the case where various actions are performed in the current state, and the action that maximizes the acquired action value is selected as the action to be taken.
The information processing device according to claim 1 .
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2025506474A JPWO2024189997A1 (en) | 2023-03-15 | 2023-11-30 |
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| JP2023-041118 | 2023-03-15 | ||
| JP2023041118 | 2023-03-15 |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2024189997A1 true WO2024189997A1 (en) | 2024-09-19 |
Family
ID=92754669
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/JP2023/042953 Pending WO2024189997A1 (en) | 2023-03-15 | 2023-11-30 | Information processing device |
Country Status (2)
| Country | Link |
|---|---|
| JP (1) | JPWO2024189997A1 (en) |
| WO (1) | WO2024189997A1 (en) |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2021060921A (en) * | 2019-10-09 | 2021-04-15 | 株式会社豊田中央研究所 | Recommended vehicle dispatch system and recommended vehicle dispatch program |
| WO2021176632A1 (en) * | 2020-03-05 | 2021-09-10 | 日本電信電話株式会社 | Optimization function generation device, optimization function generation method, and program |
| WO2022006873A1 (en) * | 2020-07-10 | 2022-01-13 | Beijing Didi Infinity Technology And Development Co., Ltd. | Vehicle repositioning on mobility-on-demand platforms |
-
2023
- 2023-11-30 WO PCT/JP2023/042953 patent/WO2024189997A1/en active Pending
- 2023-11-30 JP JP2025506474A patent/JPWO2024189997A1/ja active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| JP2021060921A (en) * | 2019-10-09 | 2021-04-15 | 株式会社豊田中央研究所 | Recommended vehicle dispatch system and recommended vehicle dispatch program |
| WO2021176632A1 (en) * | 2020-03-05 | 2021-09-10 | 日本電信電話株式会社 | Optimization function generation device, optimization function generation method, and program |
| WO2022006873A1 (en) * | 2020-07-10 | 2022-01-13 | Beijing Didi Infinity Technology And Development Co., Ltd. | Vehicle repositioning on mobility-on-demand platforms |
Non-Patent Citations (1)
| Title |
|---|
| NAKAI, ETSUJI. : "Introduction to reinforcement learning theory for IT engineers.", GIJUTSU-HYORON CO., LTD, 17 July 2020 (2020-07-17), XP093210282 * |
Also Published As
| Publication number | Publication date |
|---|---|
| JPWO2024189997A1 (en) | 2024-09-19 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9285240B2 (en) | EV route optimization through crowdsourcing | |
| CN111523968B (en) | Method and equipment for spelling bill | |
| JPWO2015012144A1 (en) | Battery secondary usage management system, battery secondary usage management device, and battery secondary usage management method | |
| JP7542540B2 (en) | Demand forecasting device | |
| WO2018207878A1 (en) | Demand forecast device | |
| CN114691969A (en) | Processing method and device for recommended swapping station, electronic equipment and storage medium | |
| CN111629390B (en) | Network slice orchestration method and device | |
| CN118396723B (en) | User carwash service information pushing method, system, equipment and storage medium | |
| JP2019159663A (en) | Information processing system, information processing method, and information processing program | |
| WO2024189997A1 (en) | Information processing device | |
| JP7499619B2 (en) | Information processing device | |
| JP7083288B2 (en) | Contribution estimation device | |
| CN114358855B (en) | Charging pricing method, server, system, equipment and medium | |
| JP7397738B2 (en) | Aggregation device | |
| JP7499683B2 (en) | Information processing device | |
| CN111339468B (en) | Information pushing method, device, electronic equipment and storage medium | |
| JP2020064394A (en) | Economic index calculation device, economic index calculation method, and economic index calculation program | |
| CN111049892B (en) | Data processing method and device of sensing terminal | |
| CN117812624A (en) | Base station cluster complaint prediction method and device, electronic equipment and storage medium | |
| JP7629806B2 (en) | Prediction model construction device and demand forecasting device | |
| JP7478847B2 (en) | Simulation Equipment | |
| JP7679250B2 (en) | Information processing device | |
| JP7519466B2 (en) | Simulation Equipment | |
| JP2021185463A (en) | Demand prediction device | |
| JP7547644B2 (en) | Service demand potential forecasting device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 23927602 Country of ref document: EP Kind code of ref document: A1 |
|
| WWE | Wipo information: entry into national phase |
Ref document number: 2025506474 Country of ref document: JP |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |