WO2025017358A1

WO2025017358A1 - Meteorological information prediction using images from webcams

Info

Publication number: WO2025017358A1
Application number: PCT/IB2023/057398
Authority: WO
Inventors: Roy SARKIS; Luisa LAMBERTINI; Demetri Psaltis; Christophe Moser
Original assignee: Ecole Polytechnique Federale de Lausanne EPFL
Current assignee: Ecole Polytechnique Federale de Lausanne EPFL
Priority date: 2023-07-20
Filing date: 2023-07-20
Publication date: 2025-01-23
Anticipated expiration: 2026-01-20

Abstract

One aspect of the present invention proposes a method of predicting meteorological information by using an artificial deep neural network. The present invention according to one example focuses on image processing, specifically on fusing a plurality of images obtained by different webcams. It introduces a novel neural network architecture that combines a convolutional neural network (CNN) and long short-term memory (LSTM). The purpose of this fusion is to enhance the accuracy of predicting meteorological information compared to using a single image. By leveraging the capabilities of CNN and LSTM, the invention aims to improve the fusion process by extracting significant features from the images (such as clouds and shades) while considering the temporal aspect of the data.

Description

METEOROLOGICAL INFORMATION PREDICTION USING IMAGES FROM WEBCAMS

TECHNICAL FIELD

The present invention relates to a method for predicting meteorological data or information, such as solar irradiance, using one or more webcam images processed by a novel neural network architecture combining a convolutional neural network (CNN) and a long short-term memory (LSTM). The present invention provides an improved accuracy to predicting global horizontal solar irradiance (GHI) or other meteorological information. By leveraging the power of the CNN-LSTM combination, the invention aims to optimise the image fusion process by extracting meaningful features (such as clouds, shades, etc.) from the images while accounting for the temporal aspect of the data. This invention may also benefit from feeding optional weather-related information, such as wind direction, humidity, and/or satellite images, to the neural network to further improve the prediction. The present invention also relates to a prediction system configured to implement the method.

BACKGROUND OF THE INVENTION

The need to forecast GHI at different time horizons depends on the use case of photovoltaic (PV) forecasting. GHI prediction may be used for instance to better schedule and dispatch energy, and/or it may be used for storage planning and reserve activation. As researchers sought more accurate methods to forecast intraday GHI, the use of all-sky images or satellite images started gaining popularity. Traditional approaches, such as big meteorological weather models, often struggled to accurately capture the dynamic and stochastic nature of cloud cover. All-sky cameras are specialised cameras used for capturing an unobstructed view of the sky. They provide a 180-degree field-of-view covering the celestial sphere.

The collected images are then pre-processed to remove any distortion or noise and to extract relevant features, such as colour, texture or image channel ratios. Cloud detection algorithms are then implemented to identify and classify the cloud coverage. With all these features extracted, prediction models were developed to forecast GHI. The models range from simple machine learning (ML) models, such as regression models, random forest, and support vector machines, to deep neural networks with multiple CNN and/or LSTM blocks. The developments of computer vision techniques and ML models enabled more precise GHI forecasts. The main problem with satellite images is that they are low resolution images. Furthermore, both the satellite images and the all-sky cameras fail to show any cross-sectional view of the sky around the site of interest.

SUMMARY OF THE INVENTION

It is an object of the present invention to overcome at least some of the problems identified above related to meteorological information prediction. More specifically, one of the objects of the present invention is to propose a meteorological information prediction method.

According to a first aspect of the present invention, there is provided a method of predicting meteorological information by using an artificial deep neural network as recited in claim 1.

The methodology described in the present invention partially builds on the existing literature, but for instance it uses images from webcams (public or not) instead of all-sky cameras. Additionally, it optionally combines a plurality of streams of images from different locations. An advantage of this methodology is the ease of scalability as it only requires the availability of webcams (which are already abundant) to collect the images. Hence, instead of having to install all-sky cameras in the sites of interest, one can simply tap into any surrounding webcam. An additional benefit of the webcams is the cross- sectional view of the sky that they provide as opposed to all-sky or satellite cameras. This brings cloud height information that is not present in other types of images. More specifically, cross-sectional views contain a lot of information about the types of clouds as they show the different types of clouds at different elevations unlike the other types of images. Knowing the cloud type and at which elevation they are would allow their density as well as the speed at which they move to be determined because different types of clouds at different elevations in the atmosphere travel at different speeds.

Another advantage of the present invention relies on the optional combination of multiple image viewpoints as this improves the accuracy of the predictions. This can be achieved because the network is able to extract contemporaneous (i.e. , time-related) features from the different images that would help better understand the dynamics of the clouds surrounding the site of interest.

Thus, the present invention provides a new concept of image processing, optionally by fusing a plurality of images, using a deep neural network comprised of CNN and LSTM layers. Other objects and advantages of the invention will become apparent to those skilled in the art from a review of the ensuing detailed description.

According to a second aspect of the present invention, there is provided a computer program product comprising instructions for implementing the steps of the method when loaded and run on a computing apparatus.

According to a third aspect of the present invention, there is provided a prediction system configured to implement the method according to the first aspect of the present invention.

Other aspects of the invention are recited in the dependent claims attached hereto.

BRIEF DESCRIPTION OF THE DRAWINGS

Other features and advantages of the invention will become apparent from the following description of non-limiting example embodiments, with reference to the appended drawings, in which:

• Figure 1 schematically illustrates an example artificial deep neural network used to combine and analyse images to generate a GHI prediction according to an example of the present invention;

• Figure 2 schematically illustrates the concept of a time-distributed block;

• Figure 3 shows two sample images from two different webcams. These images may be used to train the proposed neural network or to generate a GHI prediction, for example;

• Figure 4 shows three different images collected from an all-sky camera. The three different images depict different weather conditions, namely starting from left to right: sunny, cloudy, and rainy;

• Figure 5 illustrates one example network configuration for adding additional data to the existing deep neural network; and

• Figures 6a and 6b show a flowchart illustrating the main steps of the process of performing a GHI prediction according to one example of the present invention. DETAILED DESCRIPTION OF EMBODIMENTS OF THE INVENTION

Some embodiments of the present invention will now be described in detail with reference to the attached figures. As utilised herein, “and/or” means any one or more of the items in the list joined by “and/or”. As an example, “x and/or y” means any element of the three-element set {(x), (y), (x, y)}. In other words, “x and/or y” means “one or both of x and y.” As another example, “x, y, and/or z” means any element of the seven-element set {(x), (y), (z), (x, y), (x, z), (y, z), (x, y, z)}. In other words, “x, y and/or z” means “one or more of x, y, and z.” Furthermore, the term “comprise” is used herein as an open-ended term. This means that the object encompasses all the elements listed, but may also include additional, unnamed elements. Thus, the word “comprise” is interpreted by the broader meaning “include”, “contain” or “comprehend”. Identical or corresponding functional and structural elements which appear in the different drawings are assigned the same reference numerals. It is to be noted that the use of words “first”, “second” and “third”, etc. may not imply any kind of particular order or hierarchy unless this is explicitly or implicitly made clear in the context.

In the current context, the term “webcam” refers to a digital camera that is primarily used to transmit real-time images designed for live streaming or video surveillance among other various applications. Webcams in the present context are thus configured to record video and/or capture still images. There exist multiple differences between a webcam and an “all-sky camera”. Whereas the latter is installed in open areas with a clear and unobstructed view of the sky (such as rooftops or weather stations), the former are installed almost anywhere. In this case, the interest is in webcams installed outdoors that capture the sky (a first part of the image) and the surrounding area (a second part of the image). Two example images captured by such webcams are shown in Figure 3. These images are panoramic images that capture images with horizontally elongated fields of view. More precisely, the example images of Figure 3 are 360-degree panoramic images, but other types of images are equally possible. As can be seen, one portion of these images shows the sky, and the remaining portion of these images shows the surrounding area (i.e. , elements other than the sky), such as the ground. For comparison, Figure 4 shows three images captured by an all-sky camera.

Webcams provide a better cross-sectional view of the sky compared to allsky cameras. The cross-section is taken substantially along the direction of a surface normal of the ground. Additionally, since webcams capture part of their surroundings, additional information, such as shadows and reflections, could be extracted from the images that could help predict GHL On the other hand, images from all-sky cameras may experience some distortion due to the “fisheye lens” used to capture the entire hemisphere of the sky with a wide-angle perspective. The distortion is usually corrected as modern all-sky cameras employ some corrective measures to ensure that the distortion is minimised. However, even with this correction, this could lead to some inaccuracies in the captured images. The distortion could lead to geometric distortions, intensity variations of certain objects in the image or calibration errors.

When describing different layers of the proposed neural network, the term “parameters” refers to the layers’ internal variables that are chosen by the user. These include but are not limited to the number of artificial neurons (also referred to as units or filters in some layers), the activation function and the kernel initialiser. The term “weights” refers to the values of the connection between different neurons in the network. During training, their values are adjusted to optimise the network's performance. Finally, the term “bias” refers to a constant that is added to a respective layer. The bias is also learned during the training process and is used for many reasons, such as for handling zero or near-zero inputs and shifting activation thresholds (this adds flexibility to the network to help it learn complex relationships in the data).

Conv2D refers to two-dimensional (2D) convolution. In 2D convolution, the kernel (a matrix of a specific size) moves in two directions. Input and output data of the 2D convolution are three-dimensional, which is typically the case with image data. This is in contrast with ConvI D. In 1 D convolution, the kernel moves in one direction. Input and output data of 1 D convolution are two-dimensional, which is the case with time-series data. Input shapes of these two types of layers are supposed to be: Convl D: (size 1 , channel number), Conv2D: (size 1 , size 2, channel number). In the example below, we use RGB colour model images that have a height, width and depth (which are the channels).

Long short-term memory (LSTM) network is a recurrent neural network (RNN), which aims to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over traditional RNNs, hidden Markov models and other sequence learning methods. It provides a short-term memory for RNN that can last thousands of timesteps, leading to its name long shortterm memory. It is especially applicable to classification, processing and predicting data based on time series.

The proposed neural network architecture 1 is schematically illustrated in Figure 1. The first block of the proposed network 1 contains a first convolutional layer 3. Used as a layer for feature extraction and image recognition, the CNN algorithm takes an image 5 as input and extracts relevant features of the given image using filters (or kernels). The image is captured by a webcam, which is connected to the network 1 either wirelessly or by a wired connection. The filters, whose size may be specified by the user, are generally small square matrices that slide over the input image 5 to create feature maps. They slide across the input image. In this case they slide from the top left to the bottom right part of the input image. At every pixel of the image, a new value is computed based on the convolution operation. The amount by which the filter moves horizontally and vertically in a given step is determined by the stride. A stride of (2,2) for example means that the filter is moved 2 pixels horizontally and 2 pixels vertically in a given step. A feature map is then created by moving the filter across the image (represented by a 3D matrix in our case) while performing the convolution operation at every step. The edges, textures, or shapes of the input image are each represented by a distinct feature map. The network may extract more intricate characteristics from the input image by layering a plurality of convolutional layers on top of one another.

Following the first convolution layer, a batch normalisation layer 7 is used before a second convolution layer 9. A first pooling layer 10 and a first dropout layer 11 are then added to complete a first sub-block in a CNN block 13. The pooling layer 10 is used to downsample the dimensionality in order to speed up calculations. There exist three common types of pooling layers: max pooling, min pooling and average pooling. These layers simply return the maximum, minimum or average value of their input (the output of the filter) respectively. In this example, the pooling layer is a maxpooling layer.

The dropout layer 11 is used as a regularisation technique, which helps prevent overfittings as it allows the network to better generalise when testing. During the training process, the dropout layer simply drops a pre-specified fraction of neurons (by setting their output to zero). This means that the connection of these specific neurons is ignored during that particular training step. The dropout is a stochastic process as the neurons dropped change from one training step to another. The dropping of random neurons will prevent the network from relying on particular nodes or from learning patterns and relationships present in the training data but not in the test data. With this layer, the sub-block in the CNN block 13 is completed.

In this example, there are two adjacent or consecutive CNN sub-blocks for each input image 5 of the network 1. More specifically, the first sub-block is followed by a third convolutional layer 15, a second batch normalisation layer 17, a fourth convolutional layer 19, a second pooling layer 21 (which in this example is a maxpooling layer), and a second dropout layer 23.

Hence, in this example, the CNN block 13 is comprised of the following elements:

• The two convolutional layers (Conv2D) 3, 9 with 16 filters and “Relu” activation function. Each Conv2D layer extracts features from the input images 5 using the filters. The “Relu” activation function is applied to the output of each filter to introduce non-linearity and increase the network’s learning capacity.

• The first batch normalisation layer 7 is inserted between the two Conv2D layers 3, 9. The batch normalisation layer normalises the output of the previous layer (in this case the previous Conv2D layer) before passing it to the next layer in contrast with normalising the network’s input data. This allows for higher learning rates and improves accuracy.

• Following the second Conv2D layer, the first maxpooling layer 10 is introduced which reduces the spatial dimension of the input by selecting the maximum value for every pooling window. This leads to downsampling the feature maps and preserving the most prominent features.

• The first dropout layer 11 is provided next. In this example, this layer has a dropout rate of 0.3. This layer acts as a regularisation technique to prevent overfitting as it randomly drops a fraction of its input which reduces the interdependencies between the neurons.

• There are also provided the two further Conv2D layers 15, 19 with the second batch normalisation layer 17 in-between. The two Conv2D layers have 32 filters and “Relu” activation function.

• These Conv2D layers 15, 19 are followed by the second maxpooling layer 21 .

• The second maxpooling layer 21 is followed by the second dropout layer 23 with a dropout rate of 0.1 in this example.

The CNN blocks 13 are wrapped under a time-distributed block 24. This indicates that the network is designed to process temporal sequences, in this case the sequence of images. The time-distributed block 24 allows the same set of layers to be applied to multiple time steps of the input data. This extends the capabilities of the CNN to handle time series data. One key aspect of the time-distributed block 24 is weight sharing. Some of the weights and preferably all the weights are shared across all the CNN blocks 13. This means that all the CNN blocks 13 have the same set of weights and biases.

Weight sharing allows the network 1 to learn and generalise temporal patterns and to extract similar features and patterns across different images. It also helps reduce the total number of parameters and hence, computational complexity.

As seen in Figure 2, each image has its dedicated CNN block 13. However, in order to save some training complexity, time and power, we allow the weights of the CNN blocks to be shared. The dashed arrows in Figure 2 represent the weights of the CNN blocks 13 that are shared between the upper and lower CNN blocks (one CNN block for each input). The intuition is that the images 5 are taken at the same instant and have thus the same timestamp, and they share many common features. Hence, sharing the CNN layers is not a problem as opposed to having two independent CNN blocks that are trained separately on each input image.

The outputs of the CNN blocks 13 are then flattened and reshaped to be used as input to two consecutive LSTM layers. For this purpose, the network 1 comprises a flattening block or layer 27 for flattening the processed images from the CNN blocks 13, and a reshaping block or layer 29 for reshaping the data received from the flattening block. The LSTM architecture or layer comprises a memory cell and three gates: input, output, and forget gates. The input gate combines current input, previous output, and previous cell state using a sigmoid activation function. The forget gate determines which information from the previous cell state is discarded based on the current input and previous output/state. The output gate combines current input, previous output, and previous cell state. Finally, the block output is computed by combining the current cell state and the output gate.

In the present example, we use two consecutive LSTM layers, namely a first LSTM layer 31 and a second LSTM layer 32 with 128 and 64 neurons, respectively, and with ReLu activation functions. To be able to use two consecutive LSTMs, the first LSTM layer 31 is set to return sequences for each input time step rather than just a single output at the last time step. This sequence will then be the input for the second LSTM layer 32. The first and second LSTM layers collectively form a LSTM architecture or block 33. Finally, the output of the second LSTM layer 32 is fed into a fully connected dense layer 35 with 32 neurons in this case followed by a single-neuron dense layer 37 which acts as the output layer. The output layer 37 outputs a real positive number representing the GHI. The network is then trained with a set of images and an early stopping criterion of 10 epochs is added in this example. It should be noted that the “restore best weight” parameter is set to “True” to make sure to restore the model weights from the epoch with the lowest error.

With the current setup and in the present example, the neural network is trained using the images from only two cameras as input and the GHI 2-hour ahead measurement as a label, i.e., as an output prediction. For the training, images were collected from public webcams at regular intervals (for instance at intervals of a few seconds). This sampling allows the temporal changes to be observed in the cloud coverage as well as the sun's positioning throughout the day. The example images are 360° images with a square shape of 250x250x3. Once all the images are collected, the timestamps between the two sets of images are cross-checked to ensure that the respective images used are taken at the same time. Then, using the cross-matched timestamps, the GHI measurements are collected such that at each timestamp, the 2- hour future GHI measurement is used. In this example, the final dataset is then comprised of two images from two different cameras (taken at the same time) and a corresponding GHI measurement (two hours after the timestamp of the images).

Although only two images are used in the present example, this can be easily adapted to include just one image from one camera or more than two images, as well as to include additional data, such as air temperature, sea temperature, wind speed, wind direction, humidity, air pressure, pollution level, particle counts in the air, etc. The proposed neural network 1 depicted in Figure 1 serves only as an example to illustrate the idea of combining different images from a plurality of cameras (in this case from two cameras).

Including additional data can be achieved by adding one or more input layers and processing the new data separately before merging with the existing data. Figure 5 presents a preview of adding new data. However, this represents merely one way of adding new data, but the invention is not limited to adding only one additional stream of new information. One can experiment with the best way of modifying the architecture of the network to add new data based on the type of data to be added. For instance, adding additional images from a new camera would result in an extra input to the CNN block 13 of the network by adding a new CNN block for the new camera. On the other hand, adding a time series of meteorological data would mean adding a new input (a separate branch) to the network such that this data is not handled by the CNN blocks 13 but by another more appropriate layer, such as a concatenation layer 39 as shown in Figure 5. The concatenation layer is configured to concatenate different data sets that are fed into this layer.

The flowchart of Figures 6a and 6b summarises the above-described process of generating an intraday GHI prediction (typically two to six hours ahead). In step 61 , a first image 5 captured by a first webcam, and a second, different image 5 captured by a second, different webcam are fed into the deep neural network 1. The first and second images, which in this example are panoramic images, are taken at the same or substantially at the same time instant and have thus the same timestamp. Furthermore, the images taken by these cameras at least partially overlap, i.e., they show substantially the same surrounding but from a different angle. It is to be noted that the deep neural network 1 has previously been trained by using a training data set comprising training images captured by the first and second webcams. In step 62, first sets of convolutional layers, i.e., the first and second convolutional layers 3, 9 and the first batch normalisation layer 7 of the respective CNN block 13, are applied to the images to extract low- and mid-level features from the images. In step 63, the first pooling layers of the CNN blocks are applied to the outputs of the convolutional layers to extract the maximum output of the convolutional layers. In step 64, the first dropout layers of the CNN blocks are applied to the outputs of the first pooling layers to prevent the network from depending on particular neurons.

In step 65, second sets of convolutional layers, i.e., the third and fourth convolutional layers 15, 19 and the second batch normalisation layer 17 of the respective CNN block 13, are applied to the outputs of the first dropout layers to extract mid- and high-level features from the images. In step 66, the second pooling layers 21 of the CNN blocks are applied to the outputs of the convolutional layers to extract the maximum output of the convolutional layers. In step 67, the second dropout layers of the CNN blocks are applied to the outputs of the second pooling layers to prevent the network from depending on particular neurons. The first and second dropout layers have in this example mutually different dropout rates. More specifically, in this example, the dropout rate of the first dropout layers is greater than then dropout rate of the second dropout layers. The output of the first CNN block is thus formed by a first feature map and the output of the second CNN block is formed by a second feature map. In step 68, the outputs of the second dropout layers, i.e., the feature maps, are merged or fused. More specifically, in this step the feature maps are flattened into a vector. In step 69, the vector is reshaped to serve as input to the memory block 33 to obtain a reshaped data set or vector. In step 70, the memory block 33 retaining information from previous time step images is applied to the reshaped data set or vector. In this manner, the reshaped data set is processed or modified by taking into account information from one or more previous images that were processed by the neural network 1 . The previous images are part of the training data set used to train the neural network. Thus, in this step, the memory block 33, which is able to extract time dependencies across different images captured at different time instants, processes the reshaped data set by considering the time dependencies to obtain a memory block output data set. A respective time dependency reflects the difference between the present images and one image from the training data set (with an older timestamp). In step 71 , the dense layer 35, which is formed by one or more regression-like layers, is applied to the memory block output data set to map the memory block output data set to the final output layer 37 to obtain a mapped data set. In step 72, the output layer 37 is applied to the mapped data set to obtain a predicted GHI value from the output layer. It is to be noted that some of the layers shown in Figure 1 may be optional depending on the shape and/or size of data matrices output by various layers. For example, if the data output by the memory block is in a suitable format, then the dense layer 35 may be optional. In this case, the memory block would effectively map the reshaped data set to a suitable format. Furthermore, one of the flattening and reshaping layers may also be optional.

It is to be noted that although the above example was described in the context of generating a GHI prediction, it is to be noted that the proposed methodology is not limited to this context but could instead or in addition be used to predict other meteorological data, such as wind speed or wind direction. Thus, one aspect of the present invention proposes a computer-implemented method of predicting meteorological data by using an artificial deep neural network. The present invention according to one example focuses on image processing, specifically on fusing a plurality of images obtained by different webcams. It introduces a novel neural network architecture that combines a CNN and an LSTM. The purpose of this fusion is to enhance the accuracy of predicting meteorological data compared to using a single image. By leveraging the capabilities of an CNN and an LSTM, the invention aims to improve the fusion process by extracting significant features from the images (such as clouds and shades) while considering the temporal aspect of the data. However, it is to be noted that the above teachings also apply to a scenario where a meteorological prediction is obtained only from one input image optionally complemented by additional meteorological data.

While the invention has been illustrated and described in detail in the drawings and foregoing description, such illustration and description are to be considered illustrative or exemplary and not restrictive, the invention being not limited to the disclosed embodiments. Other embodiments and variants are understood, and can be achieved by those skilled in the art when carrying out the claimed invention, based on a study of the drawings, the disclosure and the appended claims. Further variants may be obtained by combining the teachings of any of the examples explained above.

In the claims, the word ’’comprising” does not exclude other elements or steps, and the indefinite article “a” or “an” does not exclude a plurality. The mere fact that different features are recited in mutually different dependent claims does not indicate that a combination of these features cannot be advantageously used. Any reference signs in the claims should not be construed as limiting the scope of the invention.

Claims

1 . A method of predicting meteorological information by using an artificial deep neural network (1 ), the method comprising:

- receiving (61 ) by a first convolutional neural network block (13) of the artificial deep neural network (1 ) a first image (5) captured by a first webcam, the first image (5) showing a cross-sectional view of its surroundings such that a first portion of the first image (5) shows the sky and a second portion the first image (5) shows a surrounding area containing elements other than the sky;

- extracting (62-67) by the first convolutional neural network block (13) a first set of features from the first image to obtain a first feature data set;

- extracting (70) by a memory block (33) one or more time dependencies across different images captured at least by the first webcam at different time instants to thereby process the first feature data set as processed or unprocessed by considering the one or more time dependencies to obtain a memory block output data set; and

- obtaining (72) a meteorological information prediction from an output layer (37) of the artificial deep neural network (1 ).

2. The method according to claim 1 , wherein the processing comprises:

- flattening (68) the first feature data set to obtain a flattened data set, and/or

- reshaping (69) the flattened data set or the first feature data set.

3. The method according to claim 1 , wherein the method further comprises:

- receiving (61 ) by a second convolutional neural network block (13) of the artificial deep neural network (1 ) a second image (5) captured by a second webcam, the second image (5) showing a cross-sectional view of its surroundings such that a first portion of the second image (5) shows the sky and a second portion the second image shows a surrounding area containing elements other than the sky;

- extracting by the second convolutional neural network block (13) a second set of features from the second image (5) to obtain a second feature data set; and

- processing the first and second feature data sets to merge them.

4. The method according to claim 3, wherein the processing comprises: - flattening (68) the first and second feature data sets to obtain a flattened data set, and/or

- reshaping (69) the flattened data set or the first and second feature data sets.

5. The method according to claim 3 or 4, wherein the first and second images (5) are captured at the same time instant or substantially at the same time instant.

6. The method according to any one of claims 3 to 5, wherein the first and second images (5) show at least partially the same surroundings but from a different perspective.

7. The method according to any one of claims 3 to 6, wherein the first and second convolutional neural network blocks (13) share their weights or share at least some of their weights.

8. The method according to any one of the preceding claims, wherein the respective convolutional neural network block (13) comprises at least a first convolutional neural network sub-block comprising a first convolutional neural network layer (3) followed by a first batch normalisation layer (7), a second convolutional neural network layer (9), a first pooling layer (10), and a first dropout layer (11 ).

9. The method according to 8, wherein the first convolutional neural network sub-block is followed by a second convolutional neural network sub-block comprising a third convolutional neural network layer (15) followed by a second batch normalisation layer (17), a fourth convolutional neural network layer (19), a second pooling layer (21 ), and a second dropout layer (23).

10. The method according to any one of the preceding claims, wherein the method further comprises feeding additional non-image meteorological data to the artificial deep neural network (1 ) to be processed at least by the memory block (33) to influence the meteorological information prediction.

11. The method according to claim 10, wherein the additional non-image meteorological data are any of the following: air temperature, sea temperature, wind speed, wind direction, humidity, air pressure, pollution level, and particle counts in the air.

12. The method according to any one of the preceding claims, wherein the method further comprises mapping (71 ) the memory block output data set for the output layer (37) of the artificial deep neural network (1 ) to obtain a mapped data set.

13. The method according to any one of the preceding claims, wherein the meteorological information prediction is an intraday meteorological information prediction.

14. The method according to any one of the preceding claims, wherein the one or more time dependencies across different images reflect the difference between at least the first image (5) and one or more training images used to train the artificial deep neural network (1 ).

15. The method according to any one of the preceding claims, wherein the meteorological information prediction comprises a global horizontal solar irradiance prediction.

16. The method according to any one of the preceding claims, wherein the memory block (33) is a long short-term memory block (33).

17. The method according to claim 16, wherein the memory block (33) consists of a first long short-term memory layer (31 ) followed by a second long short-term memory layer (32).

18. The method according to any one of the preceding claims, wherein the output layer (37) consists of one artificial neuron.

19. The method according to any one of the preceding claims, wherein the respective image (5) is a panoramic image, and optionally a 360-degree image.

20. A computer program product comprising instructions for implementing the following steps when loaded and run on a computing apparatus:

- receiving by a first convolutional neural network block (13) of the artificial deep neural network (1 ) a first image (5) captured by a first webcam, the first image (5) showing a cross-sectional view of its surroundings such that a first portion of the first image (5) shows the sky and a second portion the first image (5) shows a surrounding area containing elements other than the sky;

- extracting by the first convolutional neural network block (13) a first set of features from the first image to obtain a first feature data set;

- extracting by a memory block (33) one or more time dependencies across different images captured at least by the first webcam at different time instants to thereby process the first feature data set as processed or unprocessed by considering the one or more time dependencies to obtain a memory block output data set; and

21. A prediction system for predicting meteorological information, the system comprising at least a first webcam and an artificial deep neural network (1 ), the artificial deep neural network (1 ) being configured to perform operations comprising:

- receive by a first convolutional neural network block (13) of the artificial deep neural network (1 ) a first image (5) captured by a first webcam, the first image (5) showing a cross-sectional view of its surroundings such that a first portion of the first image (5) shows the sky and a second portion the first image (5) shows a surrounding area containing elements other than the sky;

- extract by the first convolutional neural network block (13) a first set of features from the first image to obtain a first feature data set;

- feed the first feature data set as processed or unprocessed to a memory block (33) able to extract one or more time dependencies across different images captured at least by the first webcam at different time instants to thereby process the first feature data set as processed or unprocessed by considering the one or more time dependencies to obtain a memory block output data set; and

- obtain a meteorological information prediction from an output layer (37) of the artificial deep neural network (1 ).