WO2021164028A1

WO2021164028A1 - Method and apparatus for filling missing industrial longitudinal data

Info

Publication number: WO2021164028A1
Application number: PCT/CN2020/076273
Authority: WO
Inventors: Linfei ZHOU; Jing Li; Daniel Schneegass; Pengwei TIAN
Original assignee: Siemens Ltd China; Siemens AG; Siemens Corp
Current assignee: Siemens Ltd China; Siemens AG; Siemens Corp
Priority date: 2020-02-21
Filing date: 2020-02-21
Publication date: 2021-08-26
Anticipated expiration: 2022-08-21
Also published as: CN115151900A

Abstract

A method, apparatus, system and computer-readable medium for filling missing industrial longitudinal data are presented. In contrast to current linear regression or interpolation, a slice is treated as a whole, trend of slices over time is also considered, with which missing data can be filled in more meaningfully and reflecting real physical status.

Description

Method and apparatus for filling missing industrial longitudinal data

Technical Field

The present invention relates to techniques of industrial data processing, and more particularly to a method, apparatus and computer-readable storage medium for filling missing industrial longitudinal data.

Background Art

Industrial data is widely used in industrial field for systems’a nd devices’s tatus monitoring, predictive maintenance, etc. Some of the industrial data are time series data. For example, data of load rate of a grid can be collected at separate time points, and vary along time in a day.

Furthermore, we can find that s data from a grid might share similar pattern between different days, which may indicate different working modes of power consumers. In such a case, time series data collected from a grid can be presented as longitudinal data where each slice is observation of a corresponding day, representing load rate of a grid, as shown in FIG. 1. Because of the periodic property of grid data, we could represent it as the form of longitudinal data. For example, each slice (or instance) is the daily running data and all the slices are arranged in chronological order. Here the missing data is defined as missing slices, as shown in FIG. 1, where several slices are missing from May 2015 to July 2015.

Missing data occur when no data value is stored or collected and could have significant effect on the conclusions that drawn from the data. It is a common occurrence, and certainly not unusual in longitudinal data.

Various approaches have been proposed to fill missing data for longitudinal data. There are also general missing data filling methods available for time series data (not longitudinal data) , for example interpolation. However, to the best of our knowledge, none of them deal with longitudinal data where each slice is a piece of time series data instead of general multiple-dimensional features.

Summary of the Invention

In this disclosure, we propose solutions in industrial field to fill missing data for longitudinal data wherein each slice is time series data. In contrast to current linear regression or interpolation, a slice is treated as a whole, trend of slices over time is also considered, with which missing data can be filled in more meaningfully and reflecting real physical status.

Embodiments of the present disclosure include methods, apparatuses for filling missing industrial longitudinal data.

According to a first aspect of the present disclosure, a method for filling missing industrial longitudinal data is presented. The method includes following steps: collecting industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; estimating overall trend of all slices of the industrial longitudinal data along time; calculating trend value of each missing slice based on the overall trend; for each missing slice, finding at least one similar slice based on trend value; filling each missing slice based on the at least one similar slice.

According to a second aspect of the present disclosure, an apparatus for filling missing industrial longitudinal data is presented. The apparatus includes: a data collection module, configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; a data processing module, configured to: estimate overall trend of all slices of the industrial longitudinal data along time; calculate trend value of each missing slice based on the overall trend; for each missing slice, find at least one similar slice based on trend value; fill each missing slice based on the at least one similar slice.

According to a third aspect of the present disclosure, an apparatus for filling missing industrial longitudinal data is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect.

According to a fourth aspect of the present disclosure, a computer-readable medium for filling missing industrial longitudinal data is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.

With solutions provided in the present disclosure, each missing slice is filled as a whole according to the over-all trend and existing data with similar trend values, over-all trend of slices of longitudinal data is estimated, in comparison to currently used solutions, a more reasonable filling solution is presented, which can be widely used in periodic time series data.

Optionally, before estimating overall trend of all slices of the industrial longitudinal data along time, each slice of the industrial longitudinal data can be normalized; then, it can be determined whether all normalized slices of the industrial longitudinal data have identical shape, and if not, all slices of the industrial longitudinal data can be split into parts, wherein slices in each part have identical shape; and for each part with missing slices, overall trend of slices in the part can be estimated, and trend value of each missing slice based on the overall trend of the part can be calculated, at least one similar slice for each missing slice based on trend value can be found in the part, and each missing slice can be filled based on the at least one similar slice. To ensure the filling result closer to real status, slices with identical shape should be found firstly for reference. However, with influence of amplitude difference, slices with same shape and significant different amplitudes can be taken as difference shapes. In order to introduce more slices for reference, firstly, influence of different amplitudes should be eliminated by normalization. If different shapes really exist along time after normalization, then to ensure closest slice to be referenced, slices with identical shape should be processed as a separate part. Existing slices can be selected from this part for filling the missing ones. With normalization and division by shape of slices, the filling result can be more accurate and closer to real status. Methods of normalization can be customized based on customers' requirements.

Optionally, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determining that the slices during the period of time have identical shape.

Optionally, trend value is mean of a slice.

Brief Description of the Drawings

The above mentioned attributes and other features and advantages of the present technique and the manner of attaining them will become more apparent and the present technique itself will be better understood by reference to the following description of embodiments of the present technique taken in conjunction with the accompanying drawings, wherein:

FIG. 1 depicts industrial longitudinal data and missing data.

FIG. 2 depicts a block diagram of an apparatus for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.

FIG. 3A depicts a flow diagram of a method for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.

FIG. 3B depicts a flow diagram of step S302.

FIG. 4 depicts normalized slices of data in FIG. 1.

FIG. 5A and FIG. 5B depict extracted trend from slices in FIG. 1.

FIG. 6 depicts filling results with solution provided in this disclosure.

Reference Numbers:

10, an apparatus for filling missing industrial longitudinal data

101, a data collecting module

102, a data processing module

103, at least one processor

104, at least one memory

105, a communication module

30, a data processing program

31, longitudinal data collected

300, a method for filling missing industrial longitudinal data

S301～S305, steps of method 300

S3021～S3023, sub steps of S302

Detailed Description of Example Embodiments

Hereinafter, above-mentioned and other features of the present technique are described in detail. Various embodiments are described with reference to the drawing, where like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to explain, and not to limit the invention. It may be evident that such embodiments may be practiced without these specific details.

When introducing elements of various embodiments of the present disclosure, the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.

Now the present disclosure will be described hereinafter in details by referring to FIG. 2 to FIG. 6.

FIG. 2 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure. The apparatus 10 for filling missing industrial longitudinal data presented in the present disclosure can be implemented as a network of computer processors, to execute following method 300 for filling missing industrial longitudinal data presented in the present disclosure. the apparatus 10 can also be a single computer, as shown in FIG. 2, including at least one memory 104, which includes computer-readable medium, such as a random access memory (RAM) . The apparatus 10 also includes at least one processor 103, coupled with the at least one memory 104. Computer-executable instructions are stored in the at least one memory 104, and when executed by the at least one processor 103, can cause the at least one processor 103 to perform the steps described herein. The at least one processor 103 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc. embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.

The at least one memory 104 shown in FIG. 2 can contain a data processing program 30, when executed by the at least one processor 103, causing the at least one processor 103 to execute the method 300 for filling missing industrial longitudinal data presented in the present disclosure. longitudinal data 31 can also be stored in the at least one memory 104. These data can be received via a communication module 105 of the apparatus 10.

The data processing program 30 can include:

- a data collection module 101, configured to collect industrial longitudinal data 31;

- a data processing module 102, configured to process the collected industrial longitudinal data 31.

Here, the industrial longitudinal data 31 include missing slices, and each slice corresponds to a collecting time point.

In detail, the data processing module 102 is configured to

- estimate overall trend of all slices of the industrial longitudinal data 31 along time;

- calculate trend value of each missing slice based on the overall trend;

- for each missing slice, find at least one similar slice based on trend value;

- fill each missing slice based on the at least one similar slice.

Optionally, the data processing module 102 is further configured to: before estimating overall trend of all slices of the industrial longitudinal data along time, normalize each slice of the industrial longitudinal data; when estimating overall trend of all slices of the industrial longitudinal data along time, determine whether all normalized slices of the industrial longitudinal data have identical shape, and if not, split all slices of the industrial longitudinal data into parts wherein slices in each part have identical shape; for each part with missing slices, estimate overall trend of slices in the part; when calculating trend value of each missing slice based on the overall trend, for each part with missing slices, calculate trend value of each missing slice based on the overall trend of the part; when for each missing slice, finding at least one similar slice based on trend value, for each part with missing slices, find in the part at least one similar slice for each missing slice based on trend value; when filling each missing slice based on the at least one similar slice, for each part with missing slices, filling each missing slice based on the at least one similar slice.

Optionally, the data processing module 102 is further configured to, when determining whether all normalized slices of the industrial longitudinal data have identical shape, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determine that the slices during the period of time have identical shape.

Optionally, trend value is mean of a slice.

Details of data processing by the data processing module 102 will be described later in reference to FIG. 3A and FIG. 3B.

Although the data collecting module 101, the data processing module 102 are described above as software modules of the data processing program 30. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.

It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 2. The architecture above is merely exemplary and used to explain the exemplary method 300 shown in FIG. 3A and FIG. 3B.

Various methods in accordance with the present disclosure may be carried out. One exemplary method 300 according to the present disclosure includes following steps:

S301: collecting industrial longitudinal data, wherein the industrial longitudinal data include missing slices, each slice corresponds to a collecting time point;

S302: estimating overall trend of all slices of the industrial longitudinal data along time;

S303: calculating trend value of each missing slice based on the overall trend;

S304: for each missing slice, finding at least one similar slice based on trend value;

S305: filling each missing slice based on the at least one similar slice.

Optionally, before the step S302, estimating overall trend of all slices of the industrial longitudinal data along time, the method 300 can further includes:

S301’ : normalizing each slice of the industrial longitudinal data. Then referring to FIG. 3B, the step S302 can include following sub steps:

S3021: determining whether all normalized slices of the industrial longitudinal data have identical shape, and if not, the procedure is proceeded with sub step S3022.

S3022: splitting all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape; and

S3023: for each part with missing slices, estimating overall trend of slices in the part.

Then the step S303 can include: for each part with missing slices, calculating trend value of each missing slice based on the overall trend of the part; the step S304 can include: for each part with missing slices, finding in the part at least one similar slice for each missing slice based on trend value; and the step S305 can include: for each part with missing slices, filling each missing slice based on the at least one similar slice.

Optionally, in the sub step S3021, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, it can be determined that the slices during the period of time have identical shape.

Next, taking data from a grid shown in FIG. 1 as an example, an exemplary embodiment is described.

As shown in FIG. 1, grid power data in form of longitudinal data were collected from January in 2015 to June in 2017. several slices are missing from May 2015 to June 2015. With the solution provided in the present disclosure, missing slices can be filled in reference to data trend along the time axis and existing slices.

Basic idea is to fill missing data with the help of over-all trend and other slices. Optionally, to eliminate influence of amplitude difference on shape judgement, in the step S301’ , normalization can be executed on collected data, otherwise slices only differ in latitude might be considered as having different shape . Referring to FIG. 1, amplitudes of load rates in winter (around January and February) is significantly lower than in summer (around July, August and September in the same year) , while shapes of slices are the consistent (from 0 o’clock to 24 o’clock, first keeping low, then ramping up, keeping flat and falling down) .

Then in the sub step S3021, comparison can be done between normalized slices on their shape. Euclidean distance can be used to measure difference of slices’ shape. If the shape differences of the normalized slices along time are greater than a predefined threshold, then in the sub step S3022, all slices can be split into parts, in each of which all the slices share similar shape. Then in the sub step S3023, for each part (or for all the slices if there is no obvious shape difference in all the slices) , overall trend of slices along time can be calculated. Optionally, mean value of each slice can be calculated trend value of the slice.

Next in the step S303, for the missing slices in each part, several techniques can be applied to estimate them, for instance, polynomial curve fitting and gaussian process. Having the estimated trend values for the missing slices, in the step S304 and S305, missing slices can be filled in based on other slices that have similar trend values.

In the step S301’ , for each slice, we can use feature scaling as the normalization method:

Where x is the original slice and x _normalized is the normalized slice, and all the normalized slices of data are shown in FIG. 4. Each curve in FIG. 4 could be regarded as the shape of corresponding slice. From FIG. 4 we can see that almost all the slices share the same shape. So, it is unnecessary to cut the data into different parts. Otherwise, some techniques (for instance, clustering) can be used to separate all the slices such that slices in the same part have identical shape.

FIG. 5A and FIG. 5B show the extracted trend for each slice. Here the mean value (bold dot) of each slice is used as the trend. In the step S302 and the step S303, referring to FIG. 5B, gaussian process can be applied to estimate trend of the missing slices (the dotted line) . In case there are multiple parts because the shape of slices is not consistent, this step will be executed separately for each part.

In the step S304, at least one existing slice that have the most similar trend values can be found. Here, 2 existing slices slice _a and slice _b are used, the missing slice could be represented as:

Where slice _a and slice _b are two slices that have closest trend values with slice _missing, and trend _a, trend _b and trend _missing are their trend values, respectively. Number of existing slices to be used to calculate the missing slice can be considered according to difference application requirements.

Filling results of the present disclosure and other method. Referring to FIG. 6, in the middle of the figure, linear regression is used for filling missing data. A point on a slice is a dimension, here there are 24 dimensions in a slice, each represents a specific hour in a day. The percentage on the right side is value of a point on a slice, ranged from 0 to 1. For every dimension of each slice, the missing value is filled separately, we can see that the filled slices are no longer meaningful. Techniques other than linear regression can also be applied here, however, they have the same disadvantage. In the bottom the missing slices are filled according to the present disclosure, which achieves a more reasonable result, a sharp increase can be found on May. 1 ^st, 2015, which is consistent with trend in same period in year 2016 and 2017. Whereas in the middle of the figure, the missing slice increases bit by bit, importance change information will be omitted with such method.

Following are 2 use cases in which the solution provided in the present disclosure can be adopted.

Use Case 1: Condition Assessment Manager for Transformers

Data are collected at transformers and transferred into data management system. After data processing and analysis, the health reports and load-shift recommendations of transformers could be provided to customers. Due to lots of reasons, the collected data possibly will be incomplete and missing data filling methods are needed in data processing and analysis part.

Use Case 2: Distributed Energy System

There are various of applications under the topic of distributed energy system, for instance, load balancing, peak avoidance, theft avoidance and so on. All these applications are based on the continuous monitoring of related devices, which has low-tolerance of missing data, making the filling methods an indispensable part of data process.

A computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.

A computer program, which is being executed by at least one processor and performs any of the methods presented in this disclosure.

While the present technique has been described in detail with reference to certain embodiments, it should be appreciated that the present technique is not limited to those precise embodiments. Rather, in view of the present disclosure which describes exemplary modes for practicing the invention, many modifications and variations would present themselves, to those skilled in the art without departing from the scope and spirit of this invention. The scope of the invention is, therefore, indicated by the following claims rather than by the foregoing description. All changes, modifications, and variations coming within the meaning and range of equivalency of the claims are to be considered within their scope.

Claims

A method (300) for filling missing industrial longitudinal data, comprising:

-collecting (S301) industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point;

-estimating (S302) overall trend of all slices of the industrial longitudinal data along time;

-calculating (S303) trend value of each missing slice based on the overall trend;

-for each missing slice, finding (S304) at least one similar slice based on trend value;

-filling (S305) each missing slice based on the at least one similar slice.
the method (300) according to claim 1, wherein,

-before estimating (S302) overall trend of all slices of the industrial longitudinal data along time, the method further comprises: normalizing (S301’ ) each slice of the industrial longitudinal data;

-estimating (S302) overall trend of all slices of the industrial longitudinal data along time, comprises:

-determining (S3021) whether all normalized slices of the industrial longitudinal data have identical shape, and if not,

-splitting (S3022) all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape;

-for each part with missing slices, estimating (S3023) overall trend of slices in the part;

-calculating (S303) trend value of each missing slice based on the overall trend, comprises: for each part with missing slices, calculating (S303) trend value of each missing slice based on the overall trend of the part;

-for each missing slice, finding (S304) at least one similar slice based on trend value, comprises: for each part with missing slices, finding (S304) in the part at least one similar slice for each missing slice based on trend value;

-filling (S305) each missing slice based on the at least one similar slice, comprises: for each part with missing slices, filling (S305) each missing slice based on the at least one similar slice.
the method (300) according to claim 2, wherein determining (S3021) whether all normalized slices of the industrial longitudinal data have identical shape, comprises: if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determining (S3021) that the slices during the period of time have identical shape.
the method (300) according to claim 1, wherein trend value is mean of a slice.
An apparatus (10) for filling missing industrial longitudinal data, comprising:

-a data collection module (101) , configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point;

-a data processing module (102) , configured to:

-estimate overall trend of all slices of the industrial longitudinal data along time;

-calculate trend value of each missing slice based on the overall trend;

-for each missing slice, find at least one similar slice based on trend value;

-fill each missing slice based on the at least one similar slice.
the apparatus (10) according to claim 5, wherein,

-the data processing module (102) is further configured to: before estimating overall trend of all slices of the industrial longitudinal data along time, normalize each slice of the industrial longitudinal data;

-the data processing module (102) is further configured to, when estimating overall trend of all slices of the industrial longitudinal data along time:

-determine whether all normalized slices of the industrial longitudinal data have identical shape, and if not,

-split all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape;

-for each part with missing slices, estimate overall trend of slices in the part;

-the data processing module (102) is further configured to, when calculating trend value of each missing slice based on the overall trend: for each part with missing slices, calculate trend value of each missing slice based on the overall trend of the part;

-the data processing module (102) is further configured to, when for each missing slice, finding at least one similar slice based on trend value: for each part with missing slices, find in the part at least one similar slice for each missing slice based on trend value;

-the data processing module (102) is further configured to, when filling each missing slice based on the at least one similar slice: for each part with missing slices, filling each missing slice based on the at least one similar slice.
the apparatus (10) according to claim 6, wherein the data processing module (102) is further configured to, when determining whether all normalized slices of the industrial longitudinal data have identical shape: if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determine that the slices during the period of time have identical shape.
the apparatus (10) according to claim 5, wherein trend value is mean of a slice.
An apparatus (10) for filling missing industrial longitudinal data, comprising:

-at least one processor (103) ;

-at least one memory (104) , coupled to the at least one processor (103) , configured to execute method according to any of claims 1～4.
A computer-readable medium for filling missing industrial longitudinal data, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 1～4.