WO2021164028A1 - Method and apparatus for filling missing industrial longitudinal data - Google Patents
Method and apparatus for filling missing industrial longitudinal data Download PDFInfo
- Publication number
- WO2021164028A1 WO2021164028A1 PCT/CN2020/076273 CN2020076273W WO2021164028A1 WO 2021164028 A1 WO2021164028 A1 WO 2021164028A1 CN 2020076273 W CN2020076273 W CN 2020076273W WO 2021164028 A1 WO2021164028 A1 WO 2021164028A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- slices
- missing
- slice
- industrial
- trend
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
Definitions
- the present invention relates to techniques of industrial data processing, and more particularly to a method, apparatus and computer-readable storage medium for filling missing industrial longitudinal data.
- Industrial data is widely used in industrial field for systems’a nd devices’s tatus monitoring, predictive maintenance, etc. Some of the industrial data are time series data. For example, data of load rate of a grid can be collected at separate time points, and vary along time in a day.
- time series data collected from a grid can be presented as longitudinal data where each slice is observation of a corresponding day, representing load rate of a grid, as shown in FIG. 1.
- each slice or instance
- the missing data is defined as missing slices, as shown in FIG. 1, where several slices are missing from May 2015 to July 2015.
- Missing data occur when no data value is stored or collected and could have significant effect on the conclusions that drawn from the data. It is a common occurrence, and certainly not unusual in longitudinal data.
- Embodiments of the present disclosure include methods, apparatuses for filling missing industrial longitudinal data.
- a method for filling missing industrial longitudinal data includes following steps: collecting industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; estimating overall trend of all slices of the industrial longitudinal data along time; calculating trend value of each missing slice based on the overall trend; for each missing slice, finding at least one similar slice based on trend value; filling each missing slice based on the at least one similar slice.
- an apparatus for filling missing industrial longitudinal data includes: a data collection module, configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; a data processing module, configured to: estimate overall trend of all slices of the industrial longitudinal data along time; calculate trend value of each missing slice based on the overall trend; for each missing slice, find at least one similar slice based on trend value; fill each missing slice based on the at least one similar slice.
- an apparatus for filling missing industrial longitudinal data includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect.
- a computer-readable medium for filling missing industrial longitudinal data stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.
- each missing slice is filled as a whole according to the over-all trend and existing data with similar trend values, over-all trend of slices of longitudinal data is estimated, in comparison to currently used solutions, a more reasonable filling solution is presented, which can be widely used in periodic time series data.
- each slice of the industrial longitudinal data can be normalized; then, it can be determined whether all normalized slices of the industrial longitudinal data have identical shape, and if not, all slices of the industrial longitudinal data can be split into parts, wherein slices in each part have identical shape; and for each part with missing slices, overall trend of slices in the part can be estimated, and trend value of each missing slice based on the overall trend of the part can be calculated, at least one similar slice for each missing slice based on trend value can be found in the part, and each missing slice can be filled based on the at least one similar slice.
- slices with identical shape should be found firstly for reference.
- trend value is mean of a slice.
- FIG. 1 depicts industrial longitudinal data and missing data.
- FIG. 2 depicts a block diagram of an apparatus for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.
- FIG. 3A depicts a flow diagram of a method for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.
- FIG. 3B depicts a flow diagram of step S302.
- FIG. 4 depicts normalized slices of data in FIG. 1.
- FIG. 5A and FIG. 5B depict extracted trend from slices in FIG. 1.
- FIG. 6 depicts filling results with solution provided in this disclosure.
- the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements.
- the terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
- FIG. 2 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure.
- the apparatus 10 for filling missing industrial longitudinal data presented in the present disclosure can be implemented as a network of computer processors, to execute following method 300 for filling missing industrial longitudinal data presented in the present disclosure.
- the apparatus 10 can also be a single computer, as shown in FIG. 2, including at least one memory 104, which includes computer-readable medium, such as a random access memory (RAM) .
- the apparatus 10 also includes at least one processor 103, coupled with the at least one memory 104.
- Computer-executable instructions are stored in the at least one memory 104, and when executed by the at least one processor 103, can cause the at least one processor 103 to perform the steps described herein.
- the at least one processor 103 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc.
- embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions.
- various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless.
- the instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.
- the at least one memory 104 shown in FIG. 2 can contain a data processing program 30, when executed by the at least one processor 103, causing the at least one processor 103 to execute the method 300 for filling missing industrial longitudinal data presented in the present disclosure.
- longitudinal data 31 can also be stored in the at least one memory 104. These data can be received via a communication module 105 of the apparatus 10.
- the data processing program 30 can include:
- a data collection module 101 configured to collect industrial longitudinal data 31;
- a data processing module 102 configured to process the collected industrial longitudinal data 31.
- the industrial longitudinal data 31 include missing slices, and each slice corresponds to a collecting time point.
- the data processing module 102 is configured to
- the data processing module 102 is further configured to: before estimating overall trend of all slices of the industrial longitudinal data along time, normalize each slice of the industrial longitudinal data; when estimating overall trend of all slices of the industrial longitudinal data along time, determine whether all normalized slices of the industrial longitudinal data have identical shape, and if not, split all slices of the industrial longitudinal data into parts wherein slices in each part have identical shape; for each part with missing slices, estimate overall trend of slices in the part; when calculating trend value of each missing slice based on the overall trend, for each part with missing slices, calculate trend value of each missing slice based on the overall trend of the part; when for each missing slice, finding at least one similar slice based on trend value, for each part with missing slices, find in the part at least one similar slice for each missing slice based on trend value; when filling each missing slice based on the at least one similar slice, for each part with missing slices, filling each missing slice based on the at least one similar slice.
- the data processing module 102 is further configured to, when determining whether all normalized slices of the industrial longitudinal data have identical shape, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determine that the slices during the period of time have identical shape.
- trend value is mean of a slice.
- the data collecting module 101, the data processing module 102 are described above as software modules of the data processing program 30. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
- the present disclosure may include apparatuses having different architecture than shown in FIG. 2.
- the architecture above is merely exemplary and used to explain the exemplary method 300 shown in FIG. 3A and FIG. 3B.
- One exemplary method 300 according to the present disclosure includes following steps:
- S301 collecting industrial longitudinal data, wherein the industrial longitudinal data include missing slices, each slice corresponds to a collecting time point;
- the method 300 can further includes:
- step S301’ normalizing each slice of the industrial longitudinal data. Then referring to FIG. 3B, the step S302 can include following sub steps:
- the step S303 can include: for each part with missing slices, calculating trend value of each missing slice based on the overall trend of the part; the step S304 can include: for each part with missing slices, finding in the part at least one similar slice for each missing slice based on trend value; and the step S305 can include: for each part with missing slices, filling each missing slice based on the at least one similar slice.
- the sub step S3021 if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, it can be determined that the slices during the period of time have identical shape.
- missing slices can be filled in reference to data trend along the time axis and existing slices.
- step S301’ normalization can be executed on collected data, otherwise slices only differ in latitude might be considered as having different shape .
- amplitudes of load rates in winter around January and February
- summer around July, August and September in the same year
- shapes of slices are the consistent (from 0 o’clock to 24 o’clock, first keeping low, then ramping up, keeping flat and falling down) .
- comparison can be done between normalized slices on their shape. Euclidean distance can be used to measure difference of slices’ shape. If the shape differences of the normalized slices along time are greater than a predefined threshold, then in the sub step S3022, all slices can be split into parts, in each of which all the slices share similar shape. Then in the sub step S3023, for each part (or for all the slices if there is no obvious shape difference in all the slices) , overall trend of slices along time can be calculated. Optionally, mean value of each slice can be calculated trend value of the slice.
- step S303 for the missing slices in each part, several techniques can be applied to estimate them, for instance, polynomial curve fitting and gaussian process. Having the estimated trend values for the missing slices, in the step S304 and S305, missing slices can be filled in based on other slices that have similar trend values.
- x is the original slice and x normalized is the normalized slice, and all the normalized slices of data are shown in FIG. 4.
- Each curve in FIG. 4 could be regarded as the shape of corresponding slice. From FIG. 4 we can see that almost all the slices share the same shape. So, it is unnecessary to cut the data into different parts. Otherwise, some techniques (for instance, clustering) can be used to separate all the slices such that slices in the same part have identical shape.
- FIG. 5A and FIG. 5B show the extracted trend for each slice.
- the mean value (bold dot) of each slice is used as the trend.
- gaussian process can be applied to estimate trend of the missing slices (the dotted line) . In case there are multiple parts because the shape of slices is not consistent, this step will be executed separately for each part.
- step S304 at least one existing slice that have the most similar trend values can be found.
- 2 existing slices slice a and slice b are used, the missing slice could be represented as:
- slice a and slice b are two slices that have closest trend values with slice missing
- trend a , trend b and trend missing are their trend values, respectively.
- Number of existing slices to be used to calculate the missing slice can be considered according to difference application requirements.
- a point on a slice is a dimension, here there are 24 dimensions in a slice, each represents a specific hour in a day.
- the percentage on the right side is value of a point on a slice, ranged from 0 to 1.
- the missing value is filled separately, we can see that the filled slices are no longer meaningful.
- Techniques other than linear regression can also be applied here, however, they have the same disadvantage.
- the missing slices are filled according to the present disclosure, which achieves a more reasonable result, a sharp increase can be found on May. 1 st , 2015, which is consistent with trend in same period in year 2016 and 2017. Whereas in the middle of the figure, the missing slice increases bit by bit, importance change information will be omitted with such method.
- Data are collected at transformers and transferred into data management system. After data processing and analysis, the health reports and load-shift recommendations of transformers could be provided to customers. Due to lots of reasons, the collected data possibly will be incomplete and missing data filling methods are needed in data processing and analysis part.
- a computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.
- a computer program which is being executed by at least one processor and performs any of the methods presented in this disclosure.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Operations Research (AREA)
- Probability & Statistics with Applications (AREA)
- Algebra (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Software Systems (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Quality & Reliability (AREA)
- General Factory Administration (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
A method, apparatus, system and computer-readable medium for filling missing industrial longitudinal data are presented. In contrast to current linear regression or interpolation, a slice is treated as a whole, trend of slices over time is also considered, with which missing data can be filled in more meaningfully and reflecting real physical status.
Description
The present invention relates to techniques of industrial data processing, and more particularly to a method, apparatus and computer-readable storage medium for filling missing industrial longitudinal data.
Background Art
Industrial data is widely used in industrial field for systems’a nd devices’s tatus monitoring, predictive maintenance, etc. Some of the industrial data are time series data. For example, data of load rate of a grid can be collected at separate time points, and vary along time in a day.
Furthermore, we can find that s data from a grid might share similar pattern between different days, which may indicate different working modes of power consumers. In such a case, time series data collected from a grid can be presented as longitudinal data where each slice is observation of a corresponding day, representing load rate of a grid, as shown in FIG. 1. Because of the periodic property of grid data, we could represent it as the form of longitudinal data. For example, each slice (or instance) is the daily running data and all the slices are arranged in chronological order. Here the missing data is defined as missing slices, as shown in FIG. 1, where several slices are missing from May 2015 to July 2015.
Missing data occur when no data value is stored or collected and could have significant effect on the conclusions that drawn from the data. It is a common occurrence, and certainly not unusual in longitudinal data.
Various approaches have been proposed to fill missing data for longitudinal data. There are also general missing data filling methods available for time series data (not longitudinal data) , for example interpolation. However, to the best of our knowledge, none of them deal with longitudinal data where each slice is a piece of time series data instead of general multiple-dimensional features.
Summary of the Invention
In this disclosure, we propose solutions in industrial field to fill missing data for longitudinal data wherein each slice is time series data. In contrast to current linear regression or interpolation, a slice is treated as a whole, trend of slices over time is also considered, with which missing data can be filled in more meaningfully and reflecting real physical status.
Embodiments of the present disclosure include methods, apparatuses for filling missing industrial longitudinal data.
According to a first aspect of the present disclosure, a method for filling missing industrial longitudinal data is presented. The method includes following steps: collecting industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; estimating overall trend of all slices of the industrial longitudinal data along time; calculating trend value of each missing slice based on the overall trend; for each missing slice, finding at least one similar slice based on trend value; filling each missing slice based on the at least one similar slice.
According to a second aspect of the present disclosure, an apparatus for filling missing industrial longitudinal data is presented. The apparatus includes: a data collection module, configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point; a data processing module, configured to: estimate overall trend of all slices of the industrial longitudinal data along time; calculate trend value of each missing slice based on the overall trend; for each missing slice, find at least one similar slice based on trend value; fill each missing slice based on the at least one similar slice.
According to a third aspect of the present disclosure, an apparatus for filling missing industrial longitudinal data is presented. The apparatus includes at least one processor; at least one memory, coupled to the at least one processor, configured to execute method according to the first aspect.
According to a fourth aspect of the present disclosure, a computer-readable medium for filling missing industrial longitudinal data is presented. The computer-readable medium stores computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to the first aspect.
With solutions provided in the present disclosure, each missing slice is filled as a whole according to the over-all trend and existing data with similar trend values, over-all trend of slices of longitudinal data is estimated, in comparison to currently used solutions, a more reasonable filling solution is presented, which can be widely used in periodic time series data.
Optionally, before estimating overall trend of all slices of the industrial longitudinal data along time, each slice of the industrial longitudinal data can be normalized; then, it can be determined whether all normalized slices of the industrial longitudinal data have identical shape, and if not, all slices of the industrial longitudinal data can be split into parts, wherein slices in each part have identical shape; and for each part with missing slices, overall trend of slices in the part can be estimated, and trend value of each missing slice based on the overall trend of the part can be calculated, at least one similar slice for each missing slice based on trend value can be found in the part, and each missing slice can be filled based on the at least one similar slice. To ensure the filling result closer to real status, slices with identical shape should be found firstly for reference. However, with influence of amplitude difference, slices with same shape and significant different amplitudes can be taken as difference shapes. In order to introduce more slices for reference, firstly, influence of different amplitudes should be eliminated by normalization. If different shapes really exist along time after normalization, then to ensure closest slice to be referenced, slices with identical shape should be processed as a separate part. Existing slices can be selected from this part for filling the missing ones. With normalization and division by shape of slices, the filling result can be more accurate and closer to real status. Methods of normalization can be customized based on customers' requirements.
Optionally, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determining that the slices during the period of time have identical shape.
Optionally, trend value is mean of a slice.
The above mentioned attributes and other features and advantages of the present technique and the manner of attaining them will become more apparent and the present technique itself will be better understood by reference to the following description of embodiments of the present technique taken in conjunction with the accompanying drawings, wherein:
FIG. 1 depicts industrial longitudinal data and missing data.
FIG. 2 depicts a block diagram of an apparatus for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.
FIG. 3A depicts a flow diagram of a method for filling missing industrial longitudinal data in accordance with one embodiment of the present disclosure.
FIG. 3B depicts a flow diagram of step S302.
FIG. 4 depicts normalized slices of data in FIG. 1.
FIG. 5A and FIG. 5B depict extracted trend from slices in FIG. 1.
FIG. 6 depicts filling results with solution provided in this disclosure.
Reference Numbers:
10, an apparatus for filling missing industrial longitudinal data
101, a data collecting module
102, a data processing module
103, at least one processor
104, at least one memory
105, a communication module
30, a data processing program
31, longitudinal data collected
300, a method for filling missing industrial longitudinal data
S301~S305, steps of method 300
S3021~S3023, sub steps of S302
Detailed Description of Example Embodiments
Hereinafter, above-mentioned and other features of the present technique are described in detail. Various embodiments are described with reference to the drawing, where like reference numerals are used to refer to like elements throughout. In the following description, for purpose of explanation, numerous specific details are set forth in order to provide a thorough understanding of one or more embodiments. It may be noted that the illustrated embodiments are intended to explain, and not to limit the invention. It may be evident that such embodiments may be practiced without these specific details.
When introducing elements of various embodiments of the present disclosure, the articles “a” , “an” , “the” and “said” are intended to mean that there are one or more of the elements. The terms “comprising” , “including” and “having” are intended to be inclusive and mean that there may be additional elements other than the listed elements.
Now the present disclosure will be described hereinafter in details by referring to FIG. 2 to FIG. 6.
FIG. 2 depicts a block diagrams of an apparatus in accordance with one embodiment of the present disclosure. The apparatus 10 for filling missing industrial longitudinal data presented in the present disclosure can be implemented as a network of computer processors, to execute following method 300 for filling missing industrial longitudinal data presented in the present disclosure. the apparatus 10 can also be a single computer, as shown in FIG. 2, including at least one memory 104, which includes computer-readable medium, such as a random access memory (RAM) . The apparatus 10 also includes at least one processor 103, coupled with the at least one memory 104. Computer-executable instructions are stored in the at least one memory 104, and when executed by the at least one processor 103, can cause the at least one processor 103 to perform the steps described herein. The at least one processor 103 may include a microprocessor, an application specific integrated circuit (ASIC) , a digital signal processor (DSP) , a central processing unit (CPU) , a graphics processing unit (GPU) , state machines, etc. embodiments of computer-readable medium include, but not limited to a floppy disk, CD-ROM, magnetic disk, memory chip, ROM, RAM, an ASIC, a configured processor, all optical media, all magnetic tape or other magnetic media, or any other medium from which a computer processor can read instructions. Also, various other forms of computer-readable medium may transmit or carry instructions to a computer, including a router, private or public network, or other transmission device or channel, both wired and wireless. The instructions may include code from any computer-programming language, including, for example, C, C++, C#, Visual Basic, Java, and JavaScript.
The at least one memory 104 shown in FIG. 2 can contain a data processing program 30, when executed by the at least one processor 103, causing the at least one processor 103 to execute the method 300 for filling missing industrial longitudinal data presented in the present disclosure. longitudinal data 31 can also be stored in the at least one memory 104. These data can be received via a communication module 105 of the apparatus 10.
The data processing program 30 can include:
- a data collection module 101, configured to collect industrial longitudinal data 31;
- a data processing module 102, configured to process the collected industrial longitudinal data 31.
Here, the industrial longitudinal data 31 include missing slices, and each slice corresponds to a collecting time point.
In detail, the data processing module 102 is configured to
- estimate overall trend of all slices of the industrial longitudinal data 31 along time;
- calculate trend value of each missing slice based on the overall trend;
- for each missing slice, find at least one similar slice based on trend value;
- fill each missing slice based on the at least one similar slice.
Optionally, the data processing module 102 is further configured to: before estimating overall trend of all slices of the industrial longitudinal data along time, normalize each slice of the industrial longitudinal data; when estimating overall trend of all slices of the industrial longitudinal data along time, determine whether all normalized slices of the industrial longitudinal data have identical shape, and if not, split all slices of the industrial longitudinal data into parts wherein slices in each part have identical shape; for each part with missing slices, estimate overall trend of slices in the part; when calculating trend value of each missing slice based on the overall trend, for each part with missing slices, calculate trend value of each missing slice based on the overall trend of the part; when for each missing slice, finding at least one similar slice based on trend value, for each part with missing slices, find in the part at least one similar slice for each missing slice based on trend value; when filling each missing slice based on the at least one similar slice, for each part with missing slices, filling each missing slice based on the at least one similar slice.
Optionally, the data processing module 102 is further configured to, when determining whether all normalized slices of the industrial longitudinal data have identical shape, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determine that the slices during the period of time have identical shape.
Optionally, trend value is mean of a slice.
Details of data processing by the data processing module 102 will be described later in reference to FIG. 3A and FIG. 3B.
Although the data collecting module 101, the data processing module 102 are described above as software modules of the data processing program 30. Also, they can be implemented via hardware, such as ASIC chips. They can be integrated into one chip, or separately implemented and electrically connected.
It should be mentioned that the present disclosure may include apparatuses having different architecture than shown in FIG. 2. The architecture above is merely exemplary and used to explain the exemplary method 300 shown in FIG. 3A and FIG. 3B.
Various methods in accordance with the present disclosure may be carried out. One exemplary method 300 according to the present disclosure includes following steps:
S301: collecting industrial longitudinal data, wherein the industrial longitudinal data include missing slices, each slice corresponds to a collecting time point;
S302: estimating overall trend of all slices of the industrial longitudinal data along time;
S303: calculating trend value of each missing slice based on the overall trend;
S304: for each missing slice, finding at least one similar slice based on trend value;
S305: filling each missing slice based on the at least one similar slice.
Optionally, before the step S302, estimating overall trend of all slices of the industrial longitudinal data along time, the method 300 can further includes:
S301’ : normalizing each slice of the industrial longitudinal data. Then referring to FIG. 3B, the step S302 can include following sub steps:
S3021: determining whether all normalized slices of the industrial longitudinal data have identical shape, and if not, the procedure is proceeded with sub step S3022.
S3022: splitting all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape; and
S3023: for each part with missing slices, estimating overall trend of slices in the part.
Then the step S303 can include: for each part with missing slices, calculating trend value of each missing slice based on the overall trend of the part; the step S304 can include: for each part with missing slices, finding in the part at least one similar slice for each missing slice based on trend value; and the step S305 can include: for each part with missing slices, filling each missing slice based on the at least one similar slice.
Optionally, in the sub step S3021, if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, it can be determined that the slices during the period of time have identical shape.
Next, taking data from a grid shown in FIG. 1 as an example, an exemplary embodiment is described.
As shown in FIG. 1, grid power data in form of longitudinal data were collected from January in 2015 to June in 2017. several slices are missing from May 2015 to June 2015. With the solution provided in the present disclosure, missing slices can be filled in reference to data trend along the time axis and existing slices.
Basic idea is to fill missing data with the help of over-all trend and other slices. Optionally, to eliminate influence of amplitude difference on shape judgement, in the step S301’ , normalization can be executed on collected data, otherwise slices only differ in latitude might be considered as having different shape . Referring to FIG. 1, amplitudes of load rates in winter (around January and February) is significantly lower than in summer (around July, August and September in the same year) , while shapes of slices are the consistent (from 0 o’clock to 24 o’clock, first keeping low, then ramping up, keeping flat and falling down) .
Then in the sub step S3021, comparison can be done between normalized slices on their shape. Euclidean distance can be used to measure difference of slices’ shape. If the shape differences of the normalized slices along time are greater than a predefined threshold, then in the sub step S3022, all slices can be split into parts, in each of which all the slices share similar shape. Then in the sub step S3023, for each part (or for all the slices if there is no obvious shape difference in all the slices) , overall trend of slices along time can be calculated. Optionally, mean value of each slice can be calculated trend value of the slice.
Next in the step S303, for the missing slices in each part, several techniques can be applied to estimate them, for instance, polynomial curve fitting and gaussian process. Having the estimated trend values for the missing slices, in the step S304 and S305, missing slices can be filled in based on other slices that have similar trend values.
In the step S301’ , for each slice, we can use feature scaling as the normalization method:
Where x is the original slice and x
normalized is the normalized slice, and all the normalized slices of data are shown in FIG. 4. Each curve in FIG. 4 could be regarded as the shape of corresponding slice. From FIG. 4 we can see that almost all the slices share the same shape. So, it is unnecessary to cut the data into different parts. Otherwise, some techniques (for instance, clustering) can be used to separate all the slices such that slices in the same part have identical shape.
FIG. 5A and FIG. 5B show the extracted trend for each slice. Here the mean value (bold dot) of each slice is used as the trend. In the step S302 and the step S303, referring to FIG. 5B, gaussian process can be applied to estimate trend of the missing slices (the dotted line) . In case there are multiple parts because the shape of slices is not consistent, this step will be executed separately for each part.
In the step S304, at least one existing slice that have the most similar trend values can be found. Here, 2 existing slices slice
a and slice
b are used, the missing slice could be represented as:
Where slice
a and slice
b are two slices that have closest trend values with slice
missing, and trend
a, trend
b and trend
missing are their trend values, respectively. Number of existing slices to be used to calculate the missing slice can be considered according to difference application requirements.
Filling results of the present disclosure and other method. Referring to FIG. 6, in the middle of the figure, linear regression is used for filling missing data. A point on a slice is a dimension, here there are 24 dimensions in a slice, each represents a specific hour in a day. The percentage on the right side is value of a point on a slice, ranged from 0 to 1. For every dimension of each slice, the missing value is filled separately, we can see that the filled slices are no longer meaningful. Techniques other than linear regression can also be applied here, however, they have the same disadvantage. In the bottom the missing slices are filled according to the present disclosure, which achieves a more reasonable result, a sharp increase can be found on May. 1
st, 2015, which is consistent with trend in same period in year 2016 and 2017. Whereas in the middle of the figure, the missing slice increases bit by bit, importance change information will be omitted with such method.
Following are 2 use cases in which the solution provided in the present disclosure can be adopted.
Use Case 1: Condition Assessment Manager for Transformers
Data are collected at transformers and transferred into data management system. After data processing and analysis, the health reports and load-shift recommendations of transformers could be provided to customers. Due to lots of reasons, the collected data possibly will be incomplete and missing data filling methods are needed in data processing and analysis part.
Use Case 2: Distributed Energy System
There are various of applications under the topic of distributed energy system, for instance, load balancing, peak avoidance, theft avoidance and so on. All these applications are based on the continuous monitoring of related devices, which has low-tolerance of missing data, making the filling methods an indispensable part of data process.
A computer-readable medium is also provided in the present disclosure, storing computer-executable instructions, which upon execution by a computer, enables the computer to execute any of the methods presented in this disclosure.
A computer program, which is being executed by at least one processor and performs any of the methods presented in this disclosure.
While the present technique has been described in detail with reference to certain embodiments, it should be appreciated that the present technique is not limited to those precise embodiments. Rather, in view of the present disclosure which describes exemplary modes for practicing the invention, many modifications and variations would present themselves, to those skilled in the art without departing from the scope and spirit of this invention. The scope of the invention is, therefore, indicated by the following claims rather than by the foregoing description. All changes, modifications, and variations coming within the meaning and range of equivalency of the claims are to be considered within their scope.
Claims (10)
- A method (300) for filling missing industrial longitudinal data, comprising:-collecting (S301) industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point;-estimating (S302) overall trend of all slices of the industrial longitudinal data along time;-calculating (S303) trend value of each missing slice based on the overall trend;-for each missing slice, finding (S304) at least one similar slice based on trend value;-filling (S305) each missing slice based on the at least one similar slice.
- the method (300) according to claim 1, wherein,-before estimating (S302) overall trend of all slices of the industrial longitudinal data along time, the method further comprises: normalizing (S301’ ) each slice of the industrial longitudinal data;-estimating (S302) overall trend of all slices of the industrial longitudinal data along time, comprises:-determining (S3021) whether all normalized slices of the industrial longitudinal data have identical shape, and if not,-splitting (S3022) all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape;-for each part with missing slices, estimating (S3023) overall trend of slices in the part;-calculating (S303) trend value of each missing slice based on the overall trend, comprises: for each part with missing slices, calculating (S303) trend value of each missing slice based on the overall trend of the part;-for each missing slice, finding (S304) at least one similar slice based on trend value, comprises: for each part with missing slices, finding (S304) in the part at least one similar slice for each missing slice based on trend value;-filling (S305) each missing slice based on the at least one similar slice, comprises: for each part with missing slices, filling (S305) each missing slice based on the at least one similar slice.
- the method (300) according to claim 2, wherein determining (S3021) whether all normalized slices of the industrial longitudinal data have identical shape, comprises: if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determining (S3021) that the slices during the period of time have identical shape.
- the method (300) according to claim 1, wherein trend value is mean of a slice.
- An apparatus (10) for filling missing industrial longitudinal data, comprising:-a data collection module (101) , configured to collect industrial longitudinal data, wherein the industrial longitudinal data comprise missing slices, each slice corresponds to a collecting time point;-a data processing module (102) , configured to:-estimate overall trend of all slices of the industrial longitudinal data along time;-calculate trend value of each missing slice based on the overall trend;-for each missing slice, find at least one similar slice based on trend value;-fill each missing slice based on the at least one similar slice.
- the apparatus (10) according to claim 5, wherein,-the data processing module (102) is further configured to: before estimating overall trend of all slices of the industrial longitudinal data along time, normalize each slice of the industrial longitudinal data;-the data processing module (102) is further configured to, when estimating overall trend of all slices of the industrial longitudinal data along time:-determine whether all normalized slices of the industrial longitudinal data have identical shape, and if not,-split all slices of the industrial longitudinal data into parts, wherein slices in each part have identical shape;-for each part with missing slices, estimate overall trend of slices in the part;-the data processing module (102) is further configured to, when calculating trend value of each missing slice based on the overall trend: for each part with missing slices, calculate trend value of each missing slice based on the overall trend of the part;-the data processing module (102) is further configured to, when for each missing slice, finding at least one similar slice based on trend value: for each part with missing slices, find in the part at least one similar slice for each missing slice based on trend value;-the data processing module (102) is further configured to, when filling each missing slice based on the at least one similar slice: for each part with missing slices, filling each missing slice based on the at least one similar slice.
- the apparatus (10) according to claim 6, wherein the data processing module (102) is further configured to, when determining whether all normalized slices of the industrial longitudinal data have identical shape: if shape differences between all normalized slices during a period of time are all smaller than a predefined threshold, determine that the slices during the period of time have identical shape.
- the apparatus (10) according to claim 5, wherein trend value is mean of a slice.
- An apparatus (10) for filling missing industrial longitudinal data, comprising:-at least one processor (103) ;-at least one memory (104) , coupled to the at least one processor (103) , configured to execute method according to any of claims 1~4.
- A computer-readable medium for filling missing industrial longitudinal data, storing computer-executable instructions, wherein the computer-executable instructions when executed cause at least one processor to execute method according to any of claims 1~4.
Priority Applications (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2020/076273 WO2021164028A1 (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling missing industrial longitudinal data |
| CN202080097170.XA CN115151900A (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling missing industrial longitudinal data |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/CN2020/076273 WO2021164028A1 (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling missing industrial longitudinal data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| WO2021164028A1 true WO2021164028A1 (en) | 2021-08-26 |
Family
ID=77391429
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| PCT/CN2020/076273 Ceased WO2021164028A1 (en) | 2020-02-21 | 2020-02-21 | Method and apparatus for filling missing industrial longitudinal data |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN115151900A (en) |
| WO (1) | WO2021164028A1 (en) |
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8941652B1 (en) * | 2012-05-23 | 2015-01-27 | Google Inc. | Incremental surface hole filling |
| CN106844781A (en) * | 2017-03-10 | 2017-06-13 | 广州视源电子科技股份有限公司 | Data processing method and device |
| CN109460775A (en) * | 2018-09-20 | 2019-03-12 | 国家计算机网络与信息安全管理中心 | A kind of data filling method and device based on comentropy |
| CN109947812A (en) * | 2018-07-09 | 2019-06-28 | 平安科技(深圳)有限公司 | Consecutive miss value fill method, data analysis set-up, terminal and storage medium |
| US20190378022A1 (en) * | 2018-06-11 | 2019-12-12 | Oracle International Corporation | Missing value imputation technique to facilitate prognostic analysis of time-series sensor data |
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2575310C (en) * | 2004-07-28 | 2014-11-04 | Ims Health Incorporated | A method for linking de-identified patients using encrypted and unencrypted demographic and healthcare information from multiple data sources |
| US8788291B2 (en) * | 2012-02-23 | 2014-07-22 | Robert Bosch Gmbh | System and method for estimation of missing data in a multivariate longitudinal setup |
| WO2017044082A1 (en) * | 2015-09-09 | 2017-03-16 | Intel Corporation | Separated application security management |
-
2020
- 2020-02-21 WO PCT/CN2020/076273 patent/WO2021164028A1/en not_active Ceased
- 2020-02-21 CN CN202080097170.XA patent/CN115151900A/en active Pending
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8941652B1 (en) * | 2012-05-23 | 2015-01-27 | Google Inc. | Incremental surface hole filling |
| CN106844781A (en) * | 2017-03-10 | 2017-06-13 | 广州视源电子科技股份有限公司 | Data processing method and device |
| US20190378022A1 (en) * | 2018-06-11 | 2019-12-12 | Oracle International Corporation | Missing value imputation technique to facilitate prognostic analysis of time-series sensor data |
| CN109947812A (en) * | 2018-07-09 | 2019-06-28 | 平安科技(深圳)有限公司 | Consecutive miss value fill method, data analysis set-up, terminal and storage medium |
| CN109460775A (en) * | 2018-09-20 | 2019-03-12 | 国家计算机网络与信息安全管理中心 | A kind of data filling method and device based on comentropy |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115151900A (en) | 2022-10-04 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN110362612B (en) | Abnormal data detection method and device executed by electronic equipment and electronic equipment | |
| CN111915019B (en) | Federal learning method, system, computer device, and storage medium | |
| WO2021212756A1 (en) | Index anomaly analysis method and apparatus, and electronic device and storage medium | |
| US6883135B1 (en) | Proxy server using a statistical model | |
| EP3489877A1 (en) | Risk identification method, client device, and risk identification system | |
| CN105071983A (en) | Abnormal load detection method for cloud calculation on-line business | |
| KR20170112904A (en) | Risk early warning method and device | |
| EP3927000B1 (en) | Network element health status detection method and device | |
| CN114598539B (en) | Root cause location methods, devices, storage media and electronic equipment | |
| CN113610632A (en) | Bank outlet face recognition method and device based on block chain | |
| CN109684320B (en) | Method and device for online cleaning of monitoring data | |
| WO2019153598A1 (en) | Customer risk level management method, server and computer readable storage medium | |
| CN107403480A (en) | A kind of vehicle trouble method for early warning, system and vehicle | |
| CN114925028B (en) | Industrial Internet data storage method and system based on blockchain | |
| CN113656452B (en) | Method and device for detecting call chain index abnormality, electronic equipment and storage medium | |
| CN113204692A (en) | Method and device for monitoring execution progress of data processing task | |
| CN116934061A (en) | Block chain-based carbon emission management method, system, equipment and storage medium | |
| US9331912B2 (en) | Violation sign condition setting supporting system, violation sign condition setting supporting method, and violation sign condition setting supporting program | |
| CN115328723A (en) | Self-adaptive baseband optimization time sequence abnormity detection method and system | |
| CN106445788A (en) | Method and device for predicting operating state of information system | |
| WO2021164028A1 (en) | Method and apparatus for filling missing industrial longitudinal data | |
| CN110069379B (en) | Monitoring index screening method and screening device | |
| EP4276627A1 (en) | Iterative method for monitoring a computing device | |
| CN115169089A (en) | Wind power probabilistic prediction method and device based on kernel density estimation and copula | |
| CN111611132B (en) | Business-oriented operation and maintenance analysis method, device, equipment and medium |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20920582 Country of ref document: EP Kind code of ref document: A1 |
|
| NENP | Non-entry into the national phase |
Ref country code: DE |
|
| 122 | Ep: pct application non-entry in european phase |
Ref document number: 20920582 Country of ref document: EP Kind code of ref document: A1 |