WO2025202592A1

WO2025202592A1 - A device, computer program and method

Info

Publication number: WO2025202592A1
Application number: PCT/GB2025/050480
Authority: WO
Inventors: Hampus WESSMAN; Lars JÄGARE
Original assignee: Sony Europe Bv; Sony Group Corp
Current assignee: Sony Europe Bv; Sony Group Corp
Priority date: 2024-03-28
Filing date: 2025-03-10
Publication date: 2025-10-02
Anticipated expiration: 2026-09-28
Also published as: GB202404517D0; GB2639939A

Abstract

A method of audio inpainting in a client device, comprising: receiving game audio from a server; identifying a gap in the received game audio; applying the game audio to a trained artificial intelligence model to generate predicted game audio; inpainting the generated predicted game audio in the gap to generate output game audio; and outputting the output game audio.

Description

A DEVICE, COMPUTER PROGRAM AND METHOD

BACKGROUND

Field of the Disclosure

The present technique relates to a device, computer program and method.

Description of the Related Art

The “background” description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in the background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly or impliedly admitted as prior art against the present technique. Portable (client) devices are very popular for playing computer games. In many instances, these client devices are connected over a network to a server where the game play is rendered and the resulting video and audio is sent back to the portable device over the network. In other words, the server generates the game content and sends the audio and video content indicative of the game content to the client device.

As will be appreciated, these types of online games require a stable network connection in order to ensure that the user’s gaming experience is satisfactory. Of course, there are instances where the network connection may fluctuate. In this instance, the audio and/or video may stutter or freeze. However, the human ear is very sensitive to gaps in audio and unexpected sound and silence in audio is particularly distracting for a user.

Accordingly, it is an aim of embodiments of the present disclosure to mitigate the impact of fluctuating audio on the user of the client device.

SUMMARY

According to embodiments of the disclosure, there is provided method of audio inpainting in a client device, comprising: receiving game audio from a server; identifying a gap in the received game audio; applying the game audio to a trained artificial intelligence model to generate predicted game audio; inpainting the generated predicted game audio in the gap to generate output game audio; and outputting the output game audio.

The foregoing paragraphs have been provided by way of general introduction, and are not intended to limit the scope of the following claims. The described embodiments, together with further advantages, will be best understood by reference to the following detailed description taken in conjunction with the accompanying drawings. BRIEF DESCRIPTION OF THE DRAWINGS

A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein:

Figure 1 shows a system 100 according to embodiments of the disclosure;

Figure 2 shows a server 200 according to embodiments of the disclosure;

Figure 3 shows a client 300 according to embodiments of the disclosure;

Figure 4 shows a diagram explaining audio inpainting 400 according to embodiments of the disclosure; Figure 5 shows a checking process 700 carried out by the client 300 according to embodiments;

Figure 6 shows a table of the stored game and associated trained Al model;

Figure 7 shows a diagram according to embodiments explaining the signalling between the server 200, the client 300 and the storage containing the trained Al model; and

Figure 8 shows a flow chart describing a process 900 according to embodiments performed in the predicting of game audio in a computer game run on the client 300.

DESCRIPTION OF THE EMBODIMENTS

Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views.

Numerous modifications and variations of the present disclosure are possible in light of the above teachings. It is therefore to be understood that within the scope of the appended claims, the disclosure may be practiced otherwise than as specifically described herein.

Figure 1 shows a system 100 according to embodiments of the disclosure. The system 100 comprises a server 200 and a client 300. The server 200 and the client 300 communicate through network 110. In embodiments, the network 110 is a Local Area Network, a Wide Area Network, a mesh network, a cellular network or some other network as would be appreciated.

In embodiments, the server 200 is configured to store a catalogue of computer games which may be accessed by a user interacting with the client 300. In other words, the user of the client 300 has access to one or more computer games which are stored on the server 200. The server 200 also generates game content, such as appropriate audio and video content, which is sent to the client 300 over the network 110. The game content is generated in response to the input of the user on the client 300. In other words, the user interacts with the client 300 as the game content generated by the server 200 and sent to the client 300 over the network is displayed on the client 300.

In instances, and as mentioned above, the audio of the game content generated by the server 200 may be sent over the network for output by the client 300 but due to a delay or interrupted connection a gap in the audio played at the client 300 may result. This is undesirable as the human ear is very sensitive to gaps in audio. As will be explained later, the client 300 is configured to detect the impending presence of the gap in the audio and to inpaint the audio for play back to the user. This disguises the gap in the audio for the user of the client 300.

Figure 2 shows a server 200 according to embodiments of the disclosure. The server 200 comprises a server processor 220 and server storage 210 connected to the server processor 220. In embodiments, the server processor 220 is circuitry that is configured to perform various steps and is controlled by computer software that is stored on the server storage 210. The server processor 220 is connected to the network 110 and is configured to connect to the client 300 for transmission of audio and/or video packets to the client 300 over the network 110. In embodiments, the server processor 220 is configured to generate game content in response to instructions received from the client 300 over the network 110. The generated game content is transmitted to the client 300 over the network 110. In embodiments, the server 200 may be on the cloud or maybe a games console such as a PlayStation ® console.

Figure 3 shows a client 300 according to embodiments of the disclosure. The client 300 comprises a client processor 320 and client storage 310 connected to the client processor 320. In embodiments, the client processor 320 is circuitry that is configured to perform various steps and is controlled by computer software that is stored on the client storage 310. The client processor 320 is connected to the network 110 and is configured to connect to the server 200 for reception of audio and/or video packets from the server 200 over the network 110. In embodiments, the client processor 320 is configured to provide the game content (both audibly and visibly) received from the server 200 over the network 110. The generated game content is received from the server 200 over the network 110. In embodiments, the client 300 may be a portable terminal such as a mobile telephone, tablet, laptop computer or a handheld games console such as a PlayStation ® Portal.

Figure 4 shows a diagram explaining audio inpainting 400 according to embodiments of the disclosure. During a gaming session, the server 200 communicates game audio and game video to the client 300 over the network 110. This game audio and game video is generated by the server 200 in response to the client 300 sending user instructions to the server 200 over the network 110. In Figure 4, the game audio 405 received by the client 300 is interrupted. In the embodiments of Figure 4, communication between the server 200 and the client 300 is interrupted and so there is a gap 410 within the game audio 405. This means that, according to embodiments, the client 300 will inpaint game audio into the gap 410 to ensure continuity of game audio 405 for the user. In other words, the client 300 will need to predict the audio missing in the gap 410, locally generate the audio, and insert the locally generated audio into the audio output to the user. As the client 300 is typically a mobile device, it is desirable to generate the audio using a computationally inexpensive manner to reduce storage requirements, processing power and electrical consumption within the client 300. In embodiments, the game audio is perceived as uninterrupted by the user. In order to achieve this, the game audio received at the client 300 is fed into a trained Artificial Intelligence (Al) model 420 which predicts the future game audio (hereinafter referred to as predicted game audio) 415. When the connection between the server 200 and the client over the network 110 is stable, the audio received from the server 200 is played back on the client 300 and the predicted game audio is not output on the client 300. However, in the event of a loss of a stable connection (for example in the event the client 300 is moving between base stations in a wireless network connection or where a handover occurs between base stations and/or between different connections such as terrestrial 5G to satellite 5G or from one band to a mm waveband), to avoid the gap 410 in the audio, the predicted game audio 415 is inpainted on the game audio output to the user from the client 300. In other words, the predicted game audio 415 is output to the user of the client 300.

When the connection between the client 300 and the server 200 is stabilised, the game audio sent from the server 200 to the client 300 resumes (identified in Figure 4 as resumed game audio 405B) and is output to the user of the client 300.

The Al model is trained such that there is not a noticeable transition between the game audio 405 and the predicted game audio 415. In embodiments, in order to avoid a noticeable transition between the predicted game audio 415 and the resumed game audio 405B, in embodiments, a fade-back (by means of a cross-fade for example) is applied. This means that when the client generates the predicted game audio, it smoothly transitions from the predicted game audio 415 to the resumed game audio 405B.

In embodiments, the inpainting is such that a beginning portion is inpainted audio predicted by the trained Al model and audio processing circuitry of the client applies a change to the audio predicted for an end portion by the Al trained model, for example in the form of fade-back to the resumed game audio 405B. In embodiments, the change may only be applied to the end portion when the inpainting rejoins the resumed game audio 405B, and not to the beginning portion. In embodiments, the change overcomes at least in part deviation of the predicted audio from the game audio at resumption. It will be appreciated that there may be a close match between predicted game audio 415 and the game audio at resumption, whereas sometimes, it may deviate significantly. The deviation may increase over the time period to be inpainted. The deviation may increase over the time period but reconverge towards the resumed game audio over the time period to be inpainted and depending on the degree of reconvergence, fade-back may or may not be applied by the audio processing circuitry.

As noted above, in order for the client 300 to locally generate the inpainted audio, the game audio received at the client 300 is fed into a trained Al model. The trained Al model is used to predict the predicted game audio and, where gaps in the game audio exist due to a brief loss of connection between the server 200 and the client 300 over the network 110, the predicted game audio is inpainted into the gap. In embodiments, the client 300 will feed game audio received from the server 200 into the trained Al model only when the connection between the client 300 and the server 200 is unstable and the connection between the client 300 and the server 200 is likely to drop. In other words, in the event of an unstable connection between the client and the server, the gap is determined and the client 300 feeds received game audio into the trained Al model. In embodiments, audio previously received from the server 200 a predetermined time before the connection is disrupted is fed into the trained Al model. This allows the most appropriate predicted game audio to be generated by the trained Al model and used locally by the client 300 when there is a discontinuity in the game audio provided by the server 200.

By only feeding game audio received from the server 200 only when the connection between the client 300 and the server 200 is unstable, the amount of power consumed is reduced compared to the situation where the game audio is continuously fed into the trained Al model. The predetermined time may be any time that is sufficient for the trained Al model to predict the predicted game audio. For example, the predetermined time may be 1 second or defined in accordance with the size of the audio buffer within the client device. The mechanism used by the client 300 for generating the predicted game audio in embodiments will be described later.

However, of course, the disclosure is not so limited. In embodiments, the game audio received from the server 200 is continuously fed into the trained Al model which means that the client 300 will continuously process two audio streams; the game audio received from the server 200 and the predicted game audio 415 generated by the trained Al model. The client 300 then decides which audio to output based upon the connection between the client 300 and the server 200 so that the client 300 will output the game audio when it is available and the predicted game audio when the game audio is not available.

It should be noted here that, in embodiments, other information may be used in addition to the game audio from the server 200 in order to provide the trained Al model with appropriate context so that the trained Al model can generate suitable predicted game audio to be output by the client 300. For example, game audio generated in response to a user is fed into the trained Al model and is used to provide context to the game (i.e. where in the game the user is such as on which level, which location on a level or position on a track and what is happening around the user). This, in embodiments, is fed into the trained Al model to assist the trained Al model in providing an appropriate predicted game audio.

In embodiments, metadata such as a tag, may be periodically received at the client 300 from the server 200 and fed into the trained Al model. The metadata may be textual, video or audio, and provides context information indicating, for example, where in the game a user is and whether any particular sounds would be required in the event of a loss of connection. As this information is provided as metadata, it is typically small in size and so does not significantly increase the processing or network overhead.

Training of the Al model used by the Client 300 The training of the Al model used by the client 300 according to embodiments will now be described. In embodiments, the Al model is trained by the server 200.

In embodiments, the game audio used to train the Al model will cover different types of games as the audio varies significantly from one type of game to another type of game. For example, a party type game may have audio that is very high tempo with a high timbre whereas a first person shooter may have a slower tempo with a lower, more atmospheric timbre. In this instance, a single Al model will be trained using game audio from across a wide range of types of games. For example, the single Al model will be trained on a data set of 1000 games across a range of game types and 100 game sessions for each game. As noted above, each game type has different audio characteristics and it is useful to train the Al model across a wide range of these audio characteristics. This trained Al model is, in embodiments, a foundation Al model.

In order to train (fine-tune) the foundation Al model for any particular game, game audio from a corresponding type of game or the game itself is fed into the foundation model to fine-tune the model. This fine-tuning is carried out, in embodiments, using a Low-Rank Adaptation method. As will be apparent, the Low-Rank Adaptation method updates only a small number of the parameters within the model whilst keeping the rest of the Al model constant. This means the resulting set of parameters that are game specific are smaller and so are downloaded by the client with the game. The game specific parameters are stored in the client. By providing a smaller set of parameters for the trained Al model that are specific to the game, the amount of data stored on the client is smaller which reduces the amount of data storage and makes it feasible to download the game specific parameters on demand.

In embodiments, the various types of games include but are not limited to: first person shooter, third person shooter, sports simulation, fighting games, simulation, Role Playing Game, shooting games, stealth games, survival horror games, puzzle games, platformer, adventure, battle royale game, roguelike, survival games, storylines, multiplayer online game, party game, strategy game or the like.

Of course, the disclosure is not so limited to a foundation Al model and the Al model may be trained on game audio associated with a single particular game or type of game. In embodiments, a first Al model may be trained only on game audio associated with a first type of game (for example a first person shooter) and a second Al model may be trained only on game audio associated with a second type of game such as a party game. In this instance, the first Al model will be stored in the server 200 in association with a first person shooter game and the second Al model will be stored in the server 200 in association with a party type game. In other words, the trained Al model will be stored in association with its corresponding type of game. In embodiments, the Al model will be trained on a data set of 500 games across each type of game and 500 game sessions for each type of game. The game audio used to train the Al model will, in embodiments, be in predefined segments of time. For example, in embodiments, the game audio used to train the Al model will be one second segments. Of course, the disclosure is not so limited and segments of any size are envisaged. For example, the size of the segment may be the same size of the audio buffer within the client. In embodiments, the size of the segments used to train the Al model is determined in accordance with a statistical distribution of gap lengths in real usage. Specifically, the size of the segments may be determined by the average length of uninterrupted game audio or the like in a real -world scenario.

The segments of game audio across the varied game types are fed into the foundation Al model which as noted is a generic Al model trained across a wide number of types of games and produces output audio across a wide range of game types.

Of course the disclosure is not so limited and any appropriate Al Training model is envisaged such as Jukebox by OpenAI ® or PEFT, Parameter Efficient Fine Tuning (which is a type of Low-Rank Adaption method).

In embodiments, during the training phase, metadata from the game session may be provided in addition to the game audio. For example, subtitle information may be provided in addition to the game audio. The subtitle information may include spoken words or atmospheric context. In embodiments, the subtitle information may include subtitles for the deaf or hard of hearing (SDH). This is provided to the client during gameplay and so this additional metadata assists the client in providing the context for the predicted game audio generated by the trained Al model. For example, the metadata assists the trained Al model to predict the game audio for inpainting. Of course, any metadata is envisaged, such as a key- frame of video or the like or game pad inputs as will be explained later.

In embodiments, and as noted above during the training phase, user input of a game session may be provided. In other words, a user in a game session operates an input during the training phase as patterns of inputs may indicate certain game audio or specific game audio that should be played when a certain input or pattern of inputs is provided. The input may be one or more of a controller input, touch screen input, voice command, gesture control for example using a camera (not shown), brain activity sensor control (example EEG) or the like. For example in a first person shooter game, game audio such as a gunshot may be generated in response to a particular button being pressed by the user on a game pad or the sound of a car accelerating when the user holds down the appropriate acceleration button. In terms of a pattern of game pad inputs, the timing between consecutive game pad inputs or the number of game pad inputs may be used during the training phase. For example, during game sessions used fortraining, users may fire 30 shots in quick succession at a particular point. The trained Al model may use this information when generating the predicted game audio at this particular point to generate the sounds for 30 shots if, say, the user fires 15 shots in quick succession before a breakdown in the network connection. In embodiments, the training data set may be subjected to balancing such as undersampling, oversampling or feature selection to ensure that less common game pad inputs do not significantly impact the trained model. Further, less common scenes and audio effects may influence the training data. Accordingly, in embodiments, balancing may be applied to other training data such as the less common visual scenes.

Generating the Predicted Audio

After the Al model has been trained, it will be stored in association with an appropriate game within the server 200. As mentioned previously, a single Al model may be trained for all types of game, or a single Al model may be trained for a particular type of game, or a single Al model may be trained for a particular game or even a single Al model may be trained for a particular level of a game using the techniques described above. When the game is played from the server 200, the corresponding trained Al model is also provided to the client 300. The trained Al model is stored within the client storage 310 whilst the game is being played.

Figure 5 shows a checking process 700 carried out by the client 300 according to embodiments. The process starts at step 705. The process moves to step 710 where the client 300 checks the game audio buffer allocated to game audio received from the server 200. Specifically, the client 300 checks the amount of game audio stored in the game audio buffer. The process then moves to step 715 where the amount of game audio stored in the game audio buffer is compared to a first threshold amount to see if the amount of game audio stored in the game audio buffer is less than a predetermined amount. This comparison is made to determine whether to start producing the game audio using the trained Al model. Specifically, although it is possible to generate the game audio using the trained Al model continuously and only use the predicted game audio when required, this requires processing within the client 300 which increases the amount of processing power used and electrical power consumed. Accordingly, by selectively generating the predicted game audio when the game audio buffer is at or lower than a predetermined threshold amount, the amount of processing resource and electrical power consumed is reduced.

Although it is possible to use the amount of game audio stored in the buffer to determine whether to start production of the predicted game audio, the disclosure is not so limited and other playback statuses may be used. For instance, in embodiments, disruption to the network conditions at the physical layer or signal layer may be used as the playback status in addition or instead of the amount of data in the buffer. For example, repeated signal loss for one or two seconds may be used to determine how much audio is needed to be stored at the client 300. Recent network condition history may be used to configure the buffer so that in periods of repeated loss of signal over a short period, the buffer size may be increased. Further, it is possible to commence production of the predicted game audio only when the game audio buffer is empty. However, it is desirable to predict the game audio using game audio from the game audio buffer to provide context to the trained Al model.

In embodiments, the first threshold amount is 2 seconds of game audio although, of course, the disclosure is not so limited and may be selected based upon the network conditions. For example, the game audio buffer may store less data when the network conditions are good and may store more data when the network conditions are not as good.

In the event that the amount of game audio stored in the game audio buffer is less than the threshold, the “Yes” path is followed to step 730. In step 730, the game audio stored in the game audio buffer is fed into the trained Al model and the predicted game audio is generated. In embodiments, game pad input or inputs may also be provided to generate the predicted game audio. The predicted game audio is then output. The process returns to step 710 where the client 300 checks the game audio buffer allocated to game audio received from the server 200.

The predicted game audio will continue to be generated by the trained Al model until the game audio buffer has filled again with sufficient game audio received from the server 200 to reach the first threshold amount. In other words, the predicted game audio will continue to be generated by the trained Al model until the first threshold amount is reached. In this case, the “no” path is followed in step 715 to progress to step 720. In step 720, the game audio is streamed from the game audio buffer and, in the event that the predicted game audio is being currently reproduced, the client 300 outputs the game audio from the game audio buffer.

Although the above describes the client 300 checking the amount of game audio in a game audio buffer as a condition for commencing the generation of predicted game audio, the disclosure is not so limited. In embodiments, the client 300 uses a measure of network connection quality to determine when to commence the generation of the predicted game audio. For example, it is possible to analyse the amount of data packet loss over the network connection and when the amount of data packet loss is above a threshold amount then the client 300 determines that commencement of predicted game audio is required. It is possible that such measure of network connection quality may be used on its own or in combination with the amount of game audio stored in the game audio buffer to determine when to commence the generation of the predicted game audio.

Figure 6 shows a table of the stored game and associated trained Al model. This is stored in the server storage 210.

In the table, a plurality of games are stored. These are identified by the ‘Game Name’ column. In addition, the ‘Game Type’ which is the type of the game shown in the ‘Game Name’ column is stored. This allows filtering of the content for display to the user of the client 300. Of course, although not shown, other features of the trained Al Model may be stored in association with the Al model. For example, a certain character or characters, levels within the game, scenarios or customised music may be stored within the table.

Further, a location of the trained Al model used for the game or one of the other features of the trained Al model noted above is stored in association with the game. In this case, a Unique Resource Indicator (URI) link to the trained Al model or other feature(s) is stored. This allows the client 300 to retrieve the trained Al model or feature(s) for storage within the client storage 310. Of course, the disclosure is not so limited and any kind of identifier that uniquely identifies the location of the trained Al model may be provided instead. Further, instead of a location, the trained Al model may be stored within the server storage 210 and may be provided to the client 300 upon request of the game.

Although the above describes an Al model trained on game audio of a specific game, the disclosure is not so limited and in the event of no specifically trained Al model being available, a generic Al model is, in embodiments, provided to the client 300. This may be stored within the client 300 or may be downloaded from the server 200. Further, in embodiments, the generic or foundation model may exist in the game itself and that the parameters to fine tune the model may be downloaded and stored locally within the client.

Figure 7 shows a diagram according to embodiments explaining the signalling between the server 200, the client 300 and the storage containing the trained Al model. In step 605, the client 300 requests access to the game library stored within the server 200. The table is returned to the client 300 in step 610. This allows the client 300 to identify a list of games available to the user to play. As noted above, the client 300 may apply a filter to the returned results in order to better navigate a large number of returned games. This filter may be based upon game type. The user selects a game and this selection is returned to the server 612. At the same time as selecting the game, the client 300 retrieves the trained Al model from the URI location noted in the table. This is step 615. The client 300 may obviously retrieve the trained Al model from the server 200 if this is where the trained Al model is stored. The client 300 informs the server 200 that the trained Al model has been retrieved in step 620.

It should be noted that the server 200 and the client 300 communicate in 625 allowing the user to play the computer game. In embodiments, the client 300 provides the user input and the server 200 uses the user input to generate game video and game audio using the game mechanics stored within server 200 and returns the game video and game audio to the client 300. During the game play of step 625, the client 300 performs a checking process according to embodiments. The checking process according to embodiments is described above with reference to Figure 5.

Figure 8 shows a flow chart describing a process 900 according to embodiments performed in the predicting of game audio in a computer game run on the client 300. The process starts at step 905. The process moves to step 910 where game audio is received from the server 300. The process moves to step 915 where a gap is identified in the received game audio. The process moves to step 920 where the received game audio is applied to the trained Al model to generate predicted game audio. In embodiments, the audio generation in some embodiments composes audio for the inpainting according to the type of games or location in the game, so in some embodiments at the beginning of a gap the full extent of the inpainting for the gap has not yet been determined. The audio may be assembled from one or more composed segments. The process moves to step 925 where the generated predicted game audio is inpainted in the gap to generate output game audio. The process moves to step 930 where the output game audio is output. The process moves to step 935 where the process ends.

In so far as embodiments of the disclosure have been described as being implemented, at least in part, by software-controlled data processing apparatus, it will be appreciated that a non-transitory machine- readable medium carrying such software, such as an optical disk, a magnetic disk, semiconductor memory or the like, is also considered to represent an embodiment of the present disclosure.

It will be appreciated that the above description for clarity has described embodiments with reference to different functional units, circuitry and/or processors. However, it will be apparent that any suitable distribution of functionality between different functional units, circuitry and/or processors may be used without detracting from the embodiments.

Described embodiments may be implemented in any suitable form including hardware, software, firmware or any combination of these. Described embodiments may optionally be implemented at least partly as computer software running on one or more data processors and/or digital signal processors. The elements and components of any embodiment may be physically, functionally and logically implemented in any suitable way. Indeed the functionality may be implemented in a single unit, in a plurality of units or as part of other functional units. As such, the disclosed embodiments may be implemented in a single unit or may be physically and functionally distributed between different units, circuitry and/or processors.

Although the present disclosure has been described in connection with some embodiments, it is not intended to be limited to the specific form set forth herein. Additionally, although a feature may appear to be described in connection with particular embodiments, one skilled in the art would recognize that various features of the described embodiments may be combined in any manner suitable to implement the technique.

Embodiments of the present technique can generally described by the following numbered clauses:

1. A method of audio inpainting in a client device, comprising: receiving game audio from a server; identifying a gap in the received game audio; applying the game audio to a trained artificial intelligence model to generate predicted game audio; inpainting the generated predicted game audio in the gap to generate output game audio; and outputting the output game audio.

2. A method according to clause 1, comprising: receiving resumed game audio from the server; and outputting the resumed game audio.

3. A method according to clause 2, comprising: applying a cross-fade between the predicted game audio and the resumed game audio.

4. A method according to any preceding clause, wherein the gap is identified in the event of an unstable connection between the client device and the server.

5. A method according to any preceding clause, comprising: receiving metadata from the server and generating the predicted game audio in accordance with the received metadata.

6. A method according to any preceding clause, comprising: receiving one or more user input and generating the predicted game audio in accordance with the received one or more user input.

7. A method according to any preceding clause, wherein the artificial intelligence model is trained using audio corresponding to one or more user input.

8. A method according to either one of clause 6 or 7, wherein the user input is provided by one or more of a controller input, touch screen input, voice command, gesture control or brain activity sensor control.

9. A method of predicting game audio in a computer game run on a client, the method comprising: receiving game audio from a server; applying the received game audio to a game-audio trained Al model claim to produce predicted game audio; and outputting the predicted game audio based on a playback status, detected by circuitry, of the received game audio.

10. A computer program product comprising computer readable instructions which, when loaded onto a computer, configures the computer to perform a method according to any one of clause 1 to 9. 11. A client device, comprising circuitry configured to: receive game audio from a server; identify a gap in the received game audio; apply the game audio to a trained artificial intelligence model to generate predicted game audio; inpaint the generated predicted game audio in the gap to generate output game audio; and output the output game audio.

12. A device according to clause 11, wherein the circuitry is configured to: receive resumed game audio from the server; and output the resumed game audio.

13. A device according to clause 12, wherein the circuitry is configured to: apply a cross-fade between the predicted game audio and the resumed game audio.

14. A device according to any one of clause 11 to 13, wherein the gap is identified in the event of an unstable connection between the client device and the server.

15. A device according to any one of clause 11 to 14, wherein the circuitry is configured to: receive metadata from the server and generating the predicted game audio in accordance with the received metadata.

16. A device according to any one of clause 11 to 15, wherein the circuitry is configured to: receive one or more user input and generating the predicted game audio in accordance with the received one or more user input.

17. A device according to any one of clause 11 to 16, wherein the artificial intelligence model is trained using audio corresponding to one or more user input.

18. A device according to either one of clause 16 or 17, wherein the user input is provided by one or more of a controller input, touch screen input, voice command, gesture control or brain activity sensor control.

19. A device for predicting game audio in a computer game run on a client, the device comprising circuitry configured to: receive game audio from a server; apply the received game audio to a game-audio trained Al model claim to produce predicted game audio; and output the predicted game audio based on a playback status, detected by circuitry, of the received game audio.

Claims

2. A method according to claim 1, comprising: receiving resumed game audio from the server; and outputting the resumed game audio.

3. A method according to claim 2, comprising: applying a cross-fade between the predicted game audio and the resumed game audio.

4. A method according to claim 1, wherein the gap is identified in the event of an unstable connection between the client device and the server.

5. A method according to any preceding claim, comprising: receiving metadata from the server and generating the predicted game audio in accordance with the received metadata.

6. A method according to claim 1, comprising: receiving one or more user input and generating the predicted game audio in accordance with the received one or more user input.

7. A method according to claim 1, wherein the artificial intelligence model is trained using audio corresponding to one or more user input.

8. A method according to claim 6, wherein the user input is provided by one or more of a controller input, touch screen input, voice command, gesture control or brain activity sensor control.

9. A method of predicting game audio in a computer game run on a client, the method comprising: receiving game audio from a server; applying the received game audio to a game-audio trained Al model claim to produce predicted game audio; and outputing the predicted game audio based on a playback status, detected by circuitry, of the received game audio.

10. A computer program product comprising computer readable instructions which, when loaded onto a computer, configures the computer to perform a method according to claim 1.

11. A client device, comprising circuitry configured to: receive game audio from a server; identify a gap in the received game audio; apply the game audio to a trained artificial intelligence model to generate predicted game audio; inpaint the generated predicted game audio in the gap to generate output game audio; and output the output game audio.

12. A device according to claim 11, wherein the circuitry is configured to: receive resumed game audio from the server; and output the resumed game audio.

13. A device according to claim 12, wherein the circuitry is configured to: apply a cross-fade between the predicted game audio and the resumed game audio.

14. A device according to claim 11, wherein the gap is identified in the event of an unstable connection between the client device and the server.

15. A device according to claim 11, wherein the circuitry is configured to: receive metadata from the server and generating the predicted game audio in accordance with the received metadata.

16. A device according to claim 11, wherein the circuitry is configured to: receive one or more user input and generating the predicted game audio in accordance with the received one or more user input.

17. A device according to claim 11, wherein the artificial intelligence model is trained using audio corresponding to one or more user input.

18. A device according to claim 16, wherein the user input is provided by one or more of a controller input, touch screen input, voice command, gesture control or brain activity sensor control.