CN111771195A - Stream processing device and data stream processing method - Google Patents
Stream processing device and data stream processing method Download PDFInfo
- Publication number
- CN111771195A CN111771195A CN201880090313.7A CN201880090313A CN111771195A CN 111771195 A CN111771195 A CN 111771195A CN 201880090313 A CN201880090313 A CN 201880090313A CN 111771195 A CN111771195 A CN 111771195A
- Authority
- CN
- China
- Prior art keywords
- stream
- order
- processor
- representation
- window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24553—Query execution of query operations
- G06F16/24554—Unary operations; Data partitioning operations
- G06F16/24556—Aggregation; Duplicate elimination
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
- G06F16/24565—Triggers; Constraints
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24568—Data stream processing; Continuous queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Mathematical Physics (AREA)
- Fuzzy Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明提供一种流处理设备100,其具有处理器110,用于处理流窗口130中的数据流120。处理器110用于从流窗口130的内容生成高阶表示140,并且处理器110用于处理高阶表示140。
The present invention provides a stream processing device 100 having a processor 110 for processing a data stream 120 in a stream window 130. The processor 110 is used to generate a higher-order representation 140 from the contents of the stream window 130, and the processor 110 is used to process the higher-order representation 140.
Description
技术领域technical field
本发明涉及一种流处理设备和数据流处理方法。The present invention relates to a stream processing device and a data stream processing method.
背景技术Background technique
本发明针对大数据分布式流处理领域。流是事件序列,即包含各种数据的元组,由诸如传感器、机器或人类等各种来源按时间顺序生成。流处理范式涉及针对流中的事件应用分析函数。一个典型的流处理方法中,假设给定时间在特定边界内累积此类事件,并针对所得到的集合应用分析函数。这种瞬态事件集合称为窗口。The present invention is directed to the field of big data distributed stream processing. Streams are sequences of events, i.e. tuples containing various data, generated in chronological order by various sources such as sensors, machines or humans. The stream processing paradigm involves applying analytical functions to events in a stream. A typical approach to stream processing assumes that such events are accumulated within certain bounds at a given time, and an analytical function is applied to the resulting collection. This collection of transient events is called a window.
流处理引擎提供即时处理事件(当事件进入系统时)的工具。在数据采集技术方面,流处理引擎能够支持从流源实时到达的数据,以及加载预存储在存储介质中的数据。数据通常称为事件,表示不同数据片段的聚合,可能具有不同的逻辑意义。这些数据在系统中按一定的顺序生成和接收。我们可以考虑与每个事件相关联的至少三个时间概念(或者在没有时间源的情况下至少考虑顺序这一概念):到达时间是将事件通知给处理引擎(或缓冲队列)的时间;事件时间是实际生成事件的时间;处理时间是系统实际处理事件的时间。通常,处理可以按固定时间间隔(例如基于挂钟时间水印的概念)触发,或者由每个事件的到达触发。处理的逻辑通常由特定的触发函数处理。Stream processing engines provide tools to process events on the fly (as they enter the system). In terms of data acquisition technology, stream processing engines can support data arriving in real-time from stream sources, as well as loading data pre-stored in storage media. Data is often referred to as an event, which represents an aggregation of different pieces of data that may have different logical meanings. These data are generated and received in a certain order in the system. We can consider at least three concepts of time associated with each event (or at least the concept of order in the absence of a time source): arrival time is the time at which the processing engine (or buffer queue) is notified of the event; the event Time is when the event is actually generated; processing time is when the system actually processes the event. In general, processing can be triggered at fixed time intervals (eg based on the concept of wall clock time watermarks), or by the arrival of each event. The processing logic is usually handled by a specific trigger function.
应用于数据流的大部分函数都需要在任意给定时刻从流中获取一个整体处理后事件的子集。也就是说,函数应用在窗口上,其中窗口是关于时间或事件的逻辑顺序的定界,该定界包含给定边界内(例如当前时间前2小时)的事件。这些窗口的内容随着新事件的到来和旧事件超出窗口边界并删除而随时间变化。通常,待应用的窗口和处理函数指定在机器上执行。然而,由于数据大小可以快速增长,特别是在大窗口边界的情况下,设计一种针对此类窗口执行模式匹配的有效方法引起了极大的兴趣。然而,窗口流算子需要大量的资源来迭代计算大窗口上的模式匹配验证。窗口算子将所有事件保存在内存中,在每个触发时刻,所有元素都被(重新)迭代地处理以计算窗口函数,如图1所示。值得注意的是,针对大窗口的模式匹配计算既需要保持内存中的大状态,也需要对包含数百万事件的窗口重新进行迭代计算,这使得很难满足(近似)实时性要求。因此,大数据应用无法获得高效、通用、低时延、低计算时间的流处理。Most functions applied to data streams require a subset of the overall processed events from the stream at any given time. That is, the function is applied on a window, where a window is a delimitation about a time or logical sequence of events that contains events within a given boundary (eg, 2 hours before the current time). The contents of these windows change over time as new events arrive and old events go beyond the window boundaries and are deleted. Typically, the window and handler functions to be applied are specified to be executed on the machine. However, since the data size can grow rapidly, especially in the case of large window boundaries, there is great interest in designing an efficient method to perform pattern matching for such windows. However, windowed stream operators require significant resources to iteratively compute pattern matching verification over large windows. The window operator keeps all events in memory, and at each trigger moment, all elements are (re)iteratively processed to compute the window function, as shown in Figure 1. Notably, pattern matching computations for large windows require both keeping a large state in memory and re-iterative computations for windows containing millions of events, making it difficult to meet (approximately) real-time requirements. Therefore, big data applications cannot obtain efficient, general-purpose, low-latency, and low-computing time stream processing.
主要问题在于,现有的流技术没有提供默认方案来在事件窗口上实现极低时延的复杂事件处理和模式匹配。这意味着,在保持实时系统时序约束的同时,能够描述和执行事件窗口的模式匹配计算。模式的描述和分析成本高昂、效率低下,而且用规则或用户定义函数来表示很复杂。然而,存在方便描述模式和执行模式匹配的高阶表示,但是现阶段它们在流引擎上不可用,并且它们不能满足实时性要求,因为它们通常在温或冷数据上执行。The main problem is that existing streaming technologies do not provide default solutions to achieve extremely low-latency complex event processing and pattern matching on event windows. This means that pattern matching computations for event windows can be described and executed while maintaining real-time system timing constraints. Pattern description and analysis are expensive, inefficient, and complex to express in rules or user-defined functions. However, there exist higher-order representations that are convenient for describing patterns and performing pattern matching, but at this stage they are not available on streaming engines, and they cannot meet real-time requirements because they are usually performed on warm or cold data.
在线复杂事件处理和模式匹配是一个挑战,因为它可能需要重新迭代处理可能包含数百万个事件的大窗口,并在流进行的同时执行模式匹配。当根据可用的流算子(例如用户定义函数(User-Defined Function,UDF))或包含长语句列表的规则来表示时,模式匹配计算必须在滑动的事件窗口(即与流同时进行并可能在连续实例之间共享事件)上重新计算。然而,数据密集型应用导致窗口具有数百万个元素,例如在过去6个月中,一个国家的信用卡交易量,从而增加了计算开销。这导致性能下降,因为资源和计算开销随着要聚合的元素数量呈线性增长。另一相关的方面在于,包含很多语句的复杂模式很难获取,并且很难在专门定义的函数中既作为规则又作为处理语句表示。事实上,当前工具在复杂事件处理(complex event processing,CEP)模式匹配中仅支持少量表示为正则表达式的语句。Online complex event processing and pattern matching is a challenge because it may require re-iterating processing large windows that may contain millions of events and performing pattern matching while the stream is in progress. When represented in terms of available stream operators (such as User-Defined Functions (UDFs)) or rules containing long lists of statements, pattern matching computations must be performed in a sliding event window (i.e. concurrently with the stream and possibly in Shared events between consecutive instances) are recomputed. However, data-intensive applications result in windows with millions of elements, such as the number of credit card transactions in a country over the past 6 months, increasing computational overhead. This leads to performance degradation as resource and computational overhead grows linearly with the number of elements to be aggregated. Another related aspect is that complex patterns containing many statements are difficult to obtain and represent both as rules and processing statements in specially defined functions. In fact, current tools support only a small number of statements expressed as regular expressions in complex event processing (CEP) pattern matching.
本发明与数据管理和处理的一些领域相重合。如下文所述,这些技术中没有任何技术提供或构成针对本发明所解决的具体问题的方案。现有的流引擎和相关机制侧重于使用需要具有专用窗口函数的窗口算子的功能,并将所有事件保存在窗口缓存中。处理函数可以用于某些类型的函数,但需要随着窗口状态变化针对不同的函数重新进行计算。当扩展到更大的事件集、高频流和长/大窗口时,这明显影响实时性约束和资源使用。缺乏支持复杂事件和模式匹配的机制、流算子或方案:(1)依赖于二阶数据模型;(2)对此类高阶数据模型启用联机操作;(3)以极低的时延运行。本发明最具代表性的技术和概念涉及流处理。The present invention overlaps with some areas of data management and processing. As discussed below, none of these techniques provides or constitutes a solution to the specific problem addressed by the present invention. Existing streaming engines and related mechanisms focus on using features that require window operators with dedicated window functions and keep all events in a window cache. Handler functions can be used for some types of functions, but need to be recalculated for different functions as the window state changes. This significantly impacts real-time constraints and resource usage when scaling to larger event sets, high-frequency streams, and long/large windows. Lack of mechanisms, stream operators or schemes to support complex events and pattern matching: (1) rely on second-order data models; (2) enable online operations on such higher-order data models; (3) operate with extremely low latency . The most representative techniques and concepts of the present invention relate to stream processing.
流引擎(例如Spark Streaming、Flink、Storm、Samza和Dataflow)是本文讨论的主要流技术。流引擎的作用是处理动态数据,即处理移动中的数据。它们提供基于流的时间排序的计算能力。根据特定的引擎,可以进一步设置时间指的是事件时间、处理时间、计算机时间或事件的到达时间。大部分流引擎都允许以某种形式对窗口中的事件进行分组。根据流引擎的应用程序接口(API),存在不同的灵活性级别来定义和驱动窗口中的计算。主要的局限性在于窗口算子使用用户定义函数,因此它们不会基于函数属性进行优化。此外,所有窗口通常将窗口范围内的所有数据保存在内存中,即使窗口函数只使用其中一部分数据。Streaming engines such as Spark Streaming, Flink, Storm, Samza, and Dataflow are the main streaming technologies discussed in this article. The role of the streaming engine is to process dynamic data, that is, to process data in motion. They provide computational power for stream-based temporal ordering. Depending on the specific engine, time can be further set to refer to event time, processing time, computer time, or arrival time of the event. Most streaming engines allow some form of grouping of events in a window. Depending on the application programming interface (API) of the streaming engine, there are different levels of flexibility to define and drive computations in the window. The main limitation is that window operators use user-defined functions, so they are not optimized based on function properties. Also, all windows typically keep all data within the window's extent in memory, even if the window function only uses a portion of that data.
大部分流引擎都支持结构化查询语言(Structured Query Language,SQL)查询表达式形式的CEP。Spark和Ignite支持批处理模式的时间序列分析,不支持流处理模式的时间序列分析。Druid系统设计用于快速注入和时间序列分析,具有高吞吐量和低时延,但是不处理流处理问题,并且需要通过抽取转换与加载(Extract,Transform,and Load,ETL)来正确输入数据。Druid用作后续执行查询的流处理下沉节点,因此不在实时流水线之内。Most streaming engines support CEP in the form of Structured Query Language (SQL) query expressions. Spark and Ignite support time series analysis in batch mode, but not in stream processing mode. The Druid system is designed for fast injection and time series analysis with high throughput and low latency, but does not handle stream processing issues and requires Extract, Transform, and Load (ETL) to correctly input data. Druid is used as a stream processing sink node for subsequent execution of queries and is therefore not part of the real-time pipeline.
当前的流处理方法将模式匹配视为用户定义的窗口函数或相当简单的状态机,这种方法有三个主要的局限性。第一个局限性在于,目标模式描述由自定义用户实现方式或描述事件类型序列的正则表达式来决定。第二个局限性在于,每个窗口函数将整个事件序列保存在窗口缓存(或状态)中,窗口缓存随着窗口中包含的事件数而增长。第三个局限性在于,模式匹配要求在每个触发事件发生时迭代整体数据。因此,除非模式非常简单,否则定义模式匹配可能代价高昂,并且计算时间可能无法满足实时性要求。Current approaches to stream processing treat pattern matching as user-defined window functions or rather simple state machines, and this approach suffers from three main limitations. The first limitation is that the target pattern description is determined by custom user implementations or regular expressions describing the sequence of event types. The second limitation is that each window function saves the entire sequence of events in a window cache (or state), which grows with the number of events contained in the window. A third limitation is that pattern matching requires that the entire data be iterated over at each trigger event. Therefore, unless the pattern is very simple, defining pattern matching can be expensive, and the computation time may not meet real-time requirements.
通常,定义一组规则以基于在特定时间窗口内计算的不同聚合(总和、计数、平均值等)来对用户行为进行建模。然而,复杂模式要求根据分析人员对欺诈方案的解释来定义和验证很多复杂的规则。Typically, a set of rules is defined to model user behavior based on different aggregates (sums, counts, averages, etc.) computed over a specific time window. However, complex patterns require many complex rules to be defined and validated against analysts' interpretations of fraud schemes.
在默认的方法中,创建大窗口以在全局特征计算时保存所有历史数据,或在某些分区的特征计算时保存数据的子域。每个特征都需要自己的窗口。将会通过遍历整个数据集并从头开始重新计算模式匹配或规则验证来计算结果,从而导致结果计算缓慢或验证能力有限。此外,模式范围或趋势的细微变化可能导致漏检。第一种选择是定义获取模式的用户定义函数。函数应遍历整个窗口,逐条验证窗口的当前状态与目标模式之间的相应匹配。In the default approach, a large window is created to hold all historical data during global feature computation, or a subdomain of data during feature computation for certain partitions. Each feature needs its own window. Results will be computed by traversing the entire dataset and recomputing pattern matching or rule validation from scratch, resulting in slow result computation or limited validation capabilities. Additionally, small changes in pattern range or trend can lead to missed detections. The first option is to define a user-defined function that gets the schema. The function should traverse the entire window, verifying line by line the corresponding match between the current state of the window and the target pattern.
这种方法的主要局限性在于,它需要大量的资源来描述模式并执行其验证。此外,由于高度复杂性以及不可扩展性,它容易出错,因为整个逻辑将嵌入到一个独特的专用函数中。另一种选择是依赖用于定义状态机的CEP API。例如,element1=x FOLLOWED BYelement2=y FOLLOWED BY element3=z等。这种方法的主要局限性在于,它只用于相当简单的分析,不能用于分布模式,即仅限用于单机状态机。The main limitation of this approach is that it requires significant resources to describe the schema and perform its validation. Also, due to its high complexity and non-scalable, it is prone to errors because the entire logic would be embedded in a unique dedicated function. Another option is to rely on the CEP API for defining state machines. For example, element1=x FOLLOWED BY element2=y FOLLOWED BY element3=z and so on. The main limitation of this approach is that it is only used for fairly simple analyses and cannot be used for distributed models, i.e. only for single-machine state machines.
发明内容SUMMARY OF THE INVENTION
鉴于上述问题和缺点,本发明旨在改进大数据分布式流处理。因此,本发明的目的在于提供一种流处理设备和处理数据流的方法。与本领域已知的相应方案相比,所述设备和方法表现更优。In view of the above problems and disadvantages, the present invention aims to improve distributed stream processing of big data. Therefore, the object of the present invention is to provide a stream processing device and a method for processing a data stream. The apparatus and method perform better than corresponding solutions known in the art.
本发明的目的通过所附独立权利要求所提供的方案来实现。本发明的有利实现方式在从属权利要求中进一步说明。The objects of the invention are achieved by the solutions provided by the attached independent claims. Advantageous implementations of the invention are further specified in the dependent claims.
本发明提出了一种新的窗口式处理系统,其一方面提供了一种将传入数据流的窗口转换为二阶表示的机制,另一方面提供了一种随着流的进行而处理此类窗口的二阶表示的机制。更具体地,存在多种场景,例如欺诈预防、异常值检测或监控,其中必须以非常低的时延执行复杂的模式匹配过程。此外,在此类场景中,二阶表示可用于在窗口中嵌入事件的统计属性或数学模型,并且此类工具可用于支持在线机器学习模块以及预测分析。The present invention proposes a new windowed processing system, which on the one hand provides a mechanism for converting the window of an incoming data stream into a second-order representation, and on the other hand, provides a mechanism for processing this data stream as the stream progresses. Mechanism for second-order representation of class windows. More specifically, there are various scenarios, such as fraud prevention, outlier detection or monitoring, where complex pattern matching processes must be performed with very low latency. Furthermore, in such scenarios, second-order representations can be used to embed statistical properties or mathematical models of events in windows, and such tools can be used to support online machine learning modules as well as predictive analytics.
根据本发明,提出了一种新的系统,其能够在流窗口中实现高效和有效的模式匹配以及CEP,以便能够应对广泛的大数据场景。为此,构建了一个专用系统,用于转换二阶数据模型(例如时间序列索引)中的事件窗口,并定义一组能够基于这种表示执行操作的高阶算子。According to the present invention, a new system is proposed that enables efficient and effective pattern matching and CEP in streaming windows to be able to cope with a wide range of big data scenarios. To this end, a specialized system is built for transforming event windows in second-order data models (such as time series indices) and defining a set of higher-order operators capable of performing operations based on this representation.
本发明第一方面提供一种流处理设备,其具有处理器,用于处理流窗口中的数据流,其中所述处理器用于从所述流窗口的内容生成高阶表示,并且所述处理器用于处理所述高阶表示。A first aspect of the present invention provides a stream processing device having a processor for processing a data stream in a stream window, wherein the processor is configured to generate a high-order representation from the content of the stream window, and the processor uses for processing the higher-order representation.
本发明基于两种新的算子类型。首先,转换窗口算子生成二阶表示(例如时间序列),并由此实现升阶(elevate)方法。根据不同的升阶逻辑可以生成不同类型的表示。其次,高阶算子是作为流算子实现的二阶或三阶函数,以二阶表示作为输入和参数。原则上,所有现有的一阶算子类型都将具有作为高阶算子的相应实现方式。The present invention is based on two new operator types. First, a transformation window operator generates a second-order representation (eg, a time series), and thus implements an elevate method. Different types of representations can be generated according to different promotion logics. Second, higher-order operators are second- or third-order functions implemented as stream operators, with second-order representations as inputs and parameters. In principle, all existing first-order operator types will have corresponding implementations that are higher-order operators.
本发明具有以下优点:通过计算模式匹配,克服了用于在窗口中执行模式匹配的迭代方法(现有技术中的方法)的复杂性,并且通过有效的二阶表示克服了其它二阶函数的复杂性,从而:(a)降低模式匹配的计算开销(例如索引搜索的日志复杂性与多次迭代的多项式复杂性);(b)节省资源,减少数据维数;(c)通过应用统计方法来降低噪声,提高准确性。The present invention has the advantage of overcoming the complexity of iterative methods for performing pattern matching in windows (methods in the prior art) by computing pattern matching, and overcoming the complexity of other second-order functions by an efficient second-order representation complexity, thereby: (a) reducing the computational overhead of pattern matching (such as log complexity for index searches versus polynomial complexity for multiple iterations); (b) saving resources and reducing data dimensionality; (c) by applying statistical methods to reduce noise and improve accuracy.
本发明的另一优点在于,通过降低模式匹配的复杂性来提高性能,从基于窗口中事件数的多项式复杂性到对数复杂性(加上构建高阶表示的恒定开销)。复杂事件处理和模式匹配可以在高维和复杂数据上执行,而无需指定精细规则或复杂程序来表示模式。Another advantage of the present invention is that it improves performance by reducing the complexity of pattern matching, from polynomial complexity based on the number of events in the window to logarithmic complexity (plus the constant overhead of building higher order representations). Complex event processing and pattern matching can be performed on high-dimensional and complex data without specifying fine-grained rules or complex programs to represent patterns.
本发明的另一优点在于,采用更有效的表示来节省资源/开销,其中高维数据压缩高达1000倍。方法可以原生地应用于二阶表示,以减少维数,从而减少模式匹配过程中窗口表示占用的内存。Another advantage of the present invention is that it saves resources/overhead with a more efficient representation, where high dimensional data is compressed up to 1000 times. Methods can be natively applied to second-order representations to reduce dimensionality and thus the memory occupied by window representations during pattern matching.
本发明的另一优点在于,通过提高模式匹配精度来降低噪声从而提高准确性。统计方法规范化或强调窗口中数据的特定特征,使其能够适应窗口事件的变化,而无需纠正查询参数(例如平滑)。Another advantage of the present invention is to improve accuracy by reducing noise by improving pattern matching accuracy. Statistical methods normalize or emphasize specific characteristics of the data in a window, making it adaptable to changes in window events without correcting query parameters (such as smoothing).
本发明的另一优点在于,具有丰富的功能。与现有技术中的流引擎相比,提供了至少2倍的算子。本发明实现了一套全新的实时分析流程,提出了一种利用现有所有时序分析和处理(例如光谱、干预、解释和预测分析)工具的系统。Another advantage of the present invention is that it has rich functions. Compared with the stream engine in the prior art, at least 2 times more operators are provided. The present invention implements a whole new set of real-time analysis procedures, and proposes a system that utilizes all existing time-series analysis and processing (eg spectral, intervention, interpretation and predictive analysis) tools.
本发明可以应用于大量需要在大数据流上执行模式匹配的场景。其直接适用领域包括物联网(Internet of Things,IoT)、金融、欺诈检测、风险分析以及未来领域,例如人工智能(AI)和实时智能决策。The present invention can be applied to a large number of scenarios where pattern matching needs to be performed on large data streams. Its immediate application areas include the Internet of Things (IoT), finance, fraud detection, risk analysis, and future areas such as artificial intelligence (AI) and real-time intelligent decision-making.
所述第一方面的系统的主要优点在于,增强流处理系统复杂事件处理和模式匹配的能力,以及简化用于此类分析的编程模型。事实上,可以优化二阶表示以提高在大窗口中执行模式匹配的性能。所述系统将负担构建二阶表示的开销,但是在每次迭代时,将可以通过简单地对这种优化的数据结构执行转换和比较来验证与目标模式的匹配。例如,二阶表示可以是时间序列索引,例如SAX或ADS。这种时间序列表示保证了匹配执行点的数量的对数复杂性。此外,它们还包含各点之间的几何距离,以适应类似事件中的时间漂移。The main advantage of the system of the first aspect is that it enhances the capabilities of the stream processing system for complex event processing and pattern matching, and simplifies the programming model for such analysis. In fact, the second-order representation can be optimized to improve the performance of pattern matching in large windows. The system will be burdened with the overhead of building a second-order representation, but at each iteration it will be possible to verify a match with the target schema by simply performing transformations and comparisons on this optimized data structure. For example, the second order representation can be a time series index such as SAX or ADS. This time series representation guarantees a logarithmic complexity of matching the number of execution points. In addition, they contain geometric distances between points to accommodate temporal drift in similar events.
本发明与当前流处理器的前景高度相关,因为它解决了当前流引擎需要应用迭代方法来对传入事件的窗口执行模式匹配的问题。这意味着使用窗口算子时,这些算子可能缓存每次必须迭代的大量事件。一方面,复杂模式匹配作为迭代算法在当前流处理设置中的表示和验证相当复杂。另一方面,数据量随着时间增长,所需资源预算也随之增加,甚至可能使计算相较实时性要求过于昂贵,甚至不可行(例如给定执行处理的算子的计算资源)。此外,不同的方案可能需要不同类型的模式匹配技术,在问题的不同方面具有不同的容差。每种不同的方案都必须使用专门定义的模式匹配函数来处理,这增加了分析的成本和部署新分析的时间。The present invention is highly relevant to the current stream processor landscape as it solves the problem of current stream engines needing to apply an iterative method to perform pattern matching on windows of incoming events. This means that when using window operators, these operators may cache a large number of events that must be iterated over each time. On the one hand, complex pattern matching as an iterative algorithm is rather complex to represent and validate in current stream processing settings. On the other hand, as the amount of data grows over time, so does the required resource budget, which may even make computation too expensive or even infeasible compared to real-time requirements (eg, given the computational resources of an operator performing processing). Furthermore, different scenarios may require different types of pattern matching techniques, with different tolerances for different aspects of the problem. Each different scenario must be handled using a specially defined pattern matching function, which increases the cost of the analysis and the time to deploy a new analysis.
本发明应用于需要对高维数据进行实时模式匹配和复杂事件处理(complexevent processing,CEP)的用例。本发明提出了一种新的系统,以简化复杂分析,从而能够使用流处理中典型的时间序列挖掘技术。本发明描述了一种新的基于两种新型算子类型的系统,利用将事件的时间序列(窗口)在线转换为二阶表示(例如模式(1))的机制,以启用基于此类表示(2)的无缝流处理。The present invention is applied to use cases that require real-time pattern matching and complex event processing (CEP) on high-dimensional data. The present invention proposes a new system to simplify complex analysis, enabling the use of time series mining techniques typical of stream processing. The present invention describes a new system based on two novel types of operators, utilizing a mechanism for online transformation of time series (windows) of events into second-order representations (e.g. mode (1)) to enable such representations ( 2) seamless stream processing.
所述第一方面的系统为数据流分析提供了一种新的范式,为直接对模式进行流处理提供了基础。例如,这类算子包括:流相关性检测器、异常值检测器、匹配算子、滤波器、映射和分组函数。The system of the first aspect provides a new paradigm for data flow analysis and provides a basis for directly stream processing patterns. Such operators include, for example: stream correlation detectors, outlier detectors, matching operators, filters, mapping and grouping functions.
更高阶的表示将使现有的工具包和方法能够在流引擎的上下文中进行模式分析。一个应用实例是在线机器学习,因为一些时间序列表示法包含大量与金融和贸易市场也相关的统计(例如自回归滑动平均数)。Higher-order representations will enable existing toolkits and methods to perform pattern analysis in the context of streaming engines. An application example is online machine learning, since some time series representations contain a large number of statistics (such as autoregressive moving averages) that are also relevant to financial and trade markets.
在所述第一方面的一种实现方式中,所述处理器用于通过执行以下至少一项来生成所述高阶表示:降维变换,例如离散傅里叶变换、快速傅里叶变换、快速傅里叶逆变换、离散小波变换;分段聚合近似;符号聚合近似;贝塞尔曲线生成;形状描述字母表;奇异值分解;自回归滑动平均;鞅差序列;索引方法,例如iSAX2、ADS;散列函数;分段方差和。In an implementation of the first aspect, the processor is configured to generate the higher-order representation by performing at least one of the following: a dimensionality reduction transform, such as a discrete Fourier transform, a fast Fourier transform, a fast Inverse Fourier transform, discrete wavelet transform; piecewise aggregate approximation; symbolic aggregate approximation; Bezier curve generation; shape description alphabet; singular value decomposition; autoregressive moving average; martingale difference sequence; indexing methods such as iSAX2, ADS ; hash function; piecewise sum of variance.
所述高阶表示可以是二阶或三阶,甚至更高阶也是可能的。上述转换或函数不是详尽列表。所有合适的函数都可以用于生成所述高阶表示。The higher-order representation may be of second or third order, and even higher orders are possible. The above transformations or functions are not an exhaustive list. All suitable functions can be used to generate the higher-order representation.
在所述第一方面的另一实现方式中,所述处理器用于通过执行以下至少一种函数来处理所述高阶表示:基于时间序列相关性/匹配函数的滤波和多重滤波函数,例如,欧氏距离和动态时间规整;映射;扁平化映射;时间序列聚类/匹配方法,允许例如分组和联合实现;高阶分组约简,允许例如通过约简和折叠分组函数来创建一阶实体分组;功率谱变换,例如傅里叶变换、平方根、立方根、对数;季节性调整;趋势放大或去除;平滑;对事件进行滤波以简化相关性;基于时间序列匹配的流连接;时间序列函数,用于表示时间序列模型的先验、似然性,支持使用概率程序的搜索策略,以及用于回归的参数和参数模型的外推,例如自动贝叶斯协方差发现;从时间序列中提取特征以支持基于特征推导窗口、预测点和预测窗口及距离的预测;时间序列的合成度量以推导描述性特征,例如滚动平均值、滚动最大值、滚动最小值、布林线指标和统计信息、类别特征的滚动熵、类别特征的滚动多数、滚动文本统计。In another implementation of the first aspect, the processor is configured to process the higher order representation by performing at least one of the following functions: filtering based on time series correlation/matching functions and multiple filtering functions, eg, Euclidean distance and dynamic time warping; maps; flattened maps; time series clustering/matching methods, allowing for example grouping and joint implementation; higher-order grouping reduction, allowing for example to create first-order entity groupings by reducing and collapsing grouping functions ; Power spectral transforms such as Fourier transform, square root, cube root, logarithm; seasonal adjustment; trend amplification or removal; smoothing; filtering events to simplify correlations; stream joins based on time series matching; time series functions, Used to represent priors, likelihoods for time series models, supports search strategies using probabilistic procedures, and extrapolation of parametric and parametric models for regression, such as automatic Bayesian covariance discovery; feature extraction from time series to support feature derivation windows, forecast points and forecast windows and distances; synthetic measures of time series to derive descriptive features such as rolling mean, rolling maximum, rolling minimum, Bollinger Bands indicators and statistics, categories Rolling entropy of features, rolling majority of category features, rolling text statistics.
上述函数或高阶算子不是详尽列表。所有合适的一阶函数都可以用于处理所述高阶表示。这种高阶算子的输出可以作为另一个高阶算子的输入,也可以作为基于降阶(demote)方法的一阶算子的输入。当高阶表示或时间序列过于密集时,可以应用降维方法,例如时间序列平滑,以最小的准确性损失来加速模式匹配的执行。The above functions or higher-order operators are not an exhaustive list. All suitable first-order functions can be used to process the higher-order representation. The output of this higher-order operator can be used as the input of another higher-order operator, or it can be used as the input of a first-order operator based on demote methods. When higher-order representations or time series are too dense, dimensionality reduction methods, such as time series smoothing, can be applied to speed up the execution of pattern matching with minimal loss of accuracy.
在所述第一方面的另一实现方式中,所述处理器用于生成由触发事件触发的所述高阶表示,其中所述流窗口的缓存数据用于生成所述流窗口的状态在触发时刻的高阶表示。In another implementation manner of the first aspect, the processor is configured to generate the high-level representation triggered by a trigger event, wherein the buffered data of the flow window is used to generate the state of the flow window at the trigger moment high-level representation of .
这样的触发事件可以基于时间段、窗口中的事件数等。Such triggering events may be based on time periods, number of events in a window, and the like.
在所述第一方面的另一实现方式中,所述处理器用于通过处理所生成的高阶表示的元数据和/或上下文返回到一阶流处理。In another implementation of the first aspect, the processor is configured to return to first-order stream processing by processing the metadata and/or context of the generated high-order representation.
所述处理器或高阶算子可以实现降阶方法,在给定的函数输出下,所述降阶方法基于生成所述高阶表示的实体的元数据和/或上下文产生相关输出。通过降阶,结果、表示等可以从较高阶返回到一阶进行进一步处理、评估等。The processor or higher-order operator may implement a reduction-order method that, given a function output, produces a relevant output based on the metadata and/or context of the entity generating the higher-order representation. Through order reduction, results, representations, etc. can be returned from higher order to first order for further processing, evaluation, etc.
在所述第一方面的另一实现方式中,所述处理器用于将高阶对象返回逻辑作为参数传递。In another implementation manner of the first aspect, the processor is configured to pass the higher-order object return logic as a parameter.
将所述逻辑作为参数提供可以增加灵活性。Providing the logic as a parameter increases flexibility.
在所述第一方面的另一实现方式中,所述处理器用于提供接口以检索和查询所生成的高阶表示,并存储所述高阶表示以及与所述高阶表示的生成相关的上下文和/或元数据。In another implementation of the first aspect, the processor is configured to provide an interface to retrieve and query the generated higher-order representation, and to store the higher-order representation and context related to the generation of the higher-order representation and/or metadata.
提供这样的标准化接口可以简化数据传输和函数调用。Providing such a standardized interface simplifies data transfer and function calls.
在所述第一方面的另一实现方式中,所述元数据包括所述流窗口的边界描述和/或针对所述流窗口的处理的说明。In another implementation of the first aspect, the metadata includes a description of the boundaries of the flow window and/or a description of processing for the flow window.
所述元数据简化了数据处理,降低了通信带宽,从而提高了整体性能。The metadata simplifies data processing and reduces communication bandwidth, thereby improving overall performance.
在所述第一方面的另一实现方式中,所述处理器用于根据所述流窗口的状态动态更新所生成的高阶表示。In another implementation of the first aspect, the processor is configured to dynamically update the generated high-order representation according to the state of the flow window.
这样的动态更新增加了灵活性并且降低了处理需求,因为已生成的高阶表示可以针对所述流窗口进行适配。Such dynamic updating increases flexibility and reduces processing requirements as the generated higher order representation can be adapted to the flow window.
在所述第一方面的另一实现方式中,所述处理器用于在生成所述高阶表示时,将所述高阶表示的类型作为参数传递。In another implementation manner of the first aspect, the processor is configured to pass the type of the higher-order representation as a parameter when generating the higher-order representation.
将所述高阶表示的类型作为参数提供可以增加灵活性。Providing the type of the higher-order representation as a parameter increases flexibility.
在所述第一方面的另一实现方式中,所述处理器用于生成和操作包括至少一个适于检测分布变化的统计模型的高阶表示,其中所述处理器用于根据所述分布变化启动所述统计模型的更新。In another implementation of the first aspect, the processor is configured to generate and operate a higher-order representation comprising at least one statistical model adapted to detect distribution changes, wherein the processor is configured to initiate all of the distribution changes based on the distribution changes. Updates to the statistical model described.
这样的实现方式使得可以轻松和快速地适应分布变化,而无需重新设置。Such an implementation makes it easy and quick to adapt to distribution changes without having to reset.
在所述第一方面的另一实现方式中,所述处理器用于生成二阶或三阶表示形式的高阶表示。甚至更高阶的表示也是可能的,但是可能会增加处理负担。In another implementation of the first aspect, the processor is configured to generate a higher-order representation of a second- or third-order representation. Even higher order representations are possible, but may increase the processing burden.
本发明的第二方面提供一种处理流窗口中的数据流的方法,包括:从流窗口的内容生成高阶表示,并处理所述高阶表示。A second aspect of the present invention provides a method of processing a data stream in a stream window, comprising: generating a higher-order representation from the content of the stream window, and processing the higher-order representation.
上述第一方面的系统的相同优点也适用于所述第二方面的方法。进一步地,所述第二方面的方法可以具有与上述第一方面的实现方式相对应的实现方式。因此,所述方法还可以实现这些实现方式的优点。The same advantages of the system of the first aspect described above also apply to the method of the second aspect. Further, the method of the second aspect may have an implementation manner corresponding to the implementation manner of the above-mentioned first aspect. Thus, the method may also achieve the advantages of these implementations.
需要注意的是,本申请中所描述的所有设备、元件、单元和装置可以在软件或硬件元件或其任意组合中实现。本申请中描述的各种实体执行的所有步骤和所描述的将由各种实体执行的功能旨在表明各个实体适于或用于执行各自的步骤和功能。It should be noted that all devices, elements, units and means described in this application may be implemented in software or hardware elements or any combination thereof. All steps performed by the various entities described in this application and the functions described to be performed by the various entities are intended to indicate that the various entities are adapted or used to perform the respective steps and functions.
虽然在以下具体实施例的描述中,由外部实体执行的特定功能或步骤没有在执行特定步骤或功能的该实体的具体元件的描述中反映,但是技术人员应该清楚的是这些方法和功能可以在各自的硬件或软件元件或其任意组合中实现。Although in the following description of specific embodiments, specific functions or steps performed by an external entity are not reflected in the description of specific elements of that entity performing the specific steps or functions, it should be clear to those skilled in the art that these methods and functions may be implemented in implemented in respective hardware or software elements or any combination thereof.
附图说明Description of drawings
结合所附附图,下面具体实施例的描述将阐述上述本发明的各方面及其实现形式,其中:In conjunction with the accompanying drawings, the following description of specific embodiments will illustrate various aspects of the present invention described above and implementations thereof, wherein:
图1示出了根据本发明实施例的高阶流处理设备的示例。FIG. 1 shows an example of a high-order stream processing device according to an embodiment of the present invention.
图2示出了高阶流处理的第一示例。Figure 2 shows a first example of higher order stream processing.
图3示出了高阶流处理的第二示例。Figure 3 shows a second example of higher order stream processing.
图4示出了高阶流处理的第三示例。Figure 4 shows a third example of higher order stream processing.
图5示出了高阶流处理的流程图。Figure 5 shows a flow diagram of higher order stream processing.
图6示出了根据本发明实施例的处理数据流的方法的流程图。FIG. 6 shows a flowchart of a method for processing a data stream according to an embodiment of the present invention.
具体实施方式Detailed ways
图1示出了根据本发明实施例的高阶流处理设备100的示例。流处理设备100包括处理器110,用于处理流窗口130中的数据流120。处理器110可以在硬件和/或软件中实现。处理器110可以是服务器的一部分,也可以是数字信号处理器(DSP)。数据流120包括事件或数据项,这些事件或数据项随后经过流窗口130。FIG. 1 shows an example of a high-order
处理器110用于从流窗口130的内容生成或升阶高阶表示140。高阶可以是二阶或三阶,甚至可以包括更高阶。高阶可以是时间序列形式的二阶。The
处理器110用于通过执行以下至少一项来处理高阶表示140:降维变换,例如离散傅里叶变换、快速傅里叶变换、快速傅里叶逆变换、离散小波变换;分段聚合近似;符号聚合近似;贝塞尔曲线生成;形状描述字母表;奇异值分解;自回归滑动平均;鞅差序列;索引方法,例如iSAX2、ADS;散列函数;分段方差和。The
处理器110还用于例如使用一个或多个高阶算子来处理高阶表示140。高阶算子可以与高阶表示140同阶。处理器110用于通过执行以下至少一种函数来处理高阶表示140:基于时间序列相关性/匹配函数的滤波和多重滤波函数,例如,欧氏距离和动态时间规整;映射;扁平化映射;时间序列聚类/匹配方法,允许例如分组和联合实现;高阶分组约简,允许例如通过约简和折叠分组函数来创建一阶实体分组;功率谱变换,例如傅里叶变换、平方根、立方根、对数;季节性调整;趋势放大或去除;平滑;对事件进行滤波以简化相关性;基于时间序列匹配的流连接;时间序列函数,用于表示时间序列模型的先验、似然性,支持使用概率程序的搜索策略,以及用于回归的参数和参数模型的外推,例如自动贝叶斯协方差发现;从时间序列中提取特征以支持基于特征推导窗口、预测点和预测窗口及距离的预测;时间序列的合成度量以推导描述性特征,例如滚动平均值、滚动最大值、滚动最小值、布林线指标和统计信息、类别特征的滚动熵、类别特征的滚动多数、滚动文本统计。The
图2示出了高阶流处理的第一示例,其中处理流窗口130中的数据流120。流处理在根据本发明实施例的设备100中执行,所述实施例基于图1所示的设备100。FIG. 2 shows a first example of high-order stream processing, where a
首先,事件在流窗口130或流窗口算子中缓存。然后,在触发事件发生时,利用缓存的事件来生成窗口130的状态在触发时刻的二阶表示140。转换逻辑在升阶方法中执行。First, events are buffered in the
转换窗口算子200在升阶步骤中生成二阶表示140,即时间序列,并实现升阶方法。可以根据不同的升阶逻辑生成不同类型的表示140。二阶表示140提供API以检索和查询所生成的表示,并存储与升阶过程的实体主体相关的时间序列表示、上下文和元数据。参考本发明提出的高阶处理,将窗口内事件序列以二阶表示进行转换的逻辑阶段或步骤称为升阶。二阶表示140也可以包含一些元数据和针对所生成的窗口的说明,使得下游算子能够一致地处理所生成的数据模型。元数据可以包括例如升阶后窗口的边界描述,以及针对窗口处理实体主体的说明。The
一个或多个高阶算子210是作为流算子实现的二阶或三阶函数,以二阶表示140作为输入和参数。原则上,所有现有的一阶算子类型都将具有作为二阶算子的相应实现方式。高阶算子210作用于二阶表示140,并在二阶域中下发输出。The one or more higher-
此外或可替代地,高阶算子210实现降阶方法,在给定的函数输出下,降阶方法将基于生成二阶表示的实体的元数据和上下文产生相关输出230。输出230仍处于一阶域中。Additionally or alternatively, the
高阶流算子类型将二阶表示140作为输入,并针对二阶表示140执行例如业务逻辑之类的函数。针对单个事件,可以是三阶算子,但是假设二阶表示140由不可逆函数生成,我们可以依靠处理复杂和丰富表示的二阶函数来实现此类算子。为了扩展当前流引擎,高阶表示140的输出可以按照二阶例如高阶映射或滤波函数,或基于附加到二阶表示的元数据的原始一阶处理。高阶算子输出返回到一阶流处理的逻辑阶段或步骤称为降阶。Higher-order flow operator types take as input a second-
一旦将二阶表示140输入到高阶流算子210,例如根据所定义的语义执行一些业务逻辑,例如映射、分组、约简、滤波等,高阶算子的输出可以作为另一个高阶算子的输入,也可以作为基于降阶方法的一阶算子的输入。Once the second-
图3示出了高阶流处理的第二示例。流处理在根据本发明实施例的设备100中执行,所述实施例基于图1所示的设备100。图3特别示出了用于基于实体在特定时间帧内的行为对实体进行分组的高阶分组算子的示例。在第一阶段中,基于一些关键标识符(分区流)在窗口130a、130b中累积事件。然后,在触发时刻,每个特殊窗口算子生成特定的二阶表示140a、140b,每个二阶表示输入到高阶分组算子210。然后,算子210基于在所考虑的时间窗口中由实体行为定义的模式来执行聚类操作,并根据二阶表示的相似性输出二阶表示组或簇220a、220b。Figure 3 shows a second example of higher order stream processing. Stream processing is performed in a
图4示出了采用二阶滤波器的高阶流处理的第三示例。流处理在根据本发明实施例的设备100中执行,所述实施例基于图1所示的设备100。Figure 4 shows a third example of higher order stream processing employing a second order filter. Stream processing is performed in a
下面介绍系统在高阶流处理中的应用。所提出的算子将针对所考虑的每个数据分区进行实例化,并将在有限的资源上运行。在触发事件发生时,生成二阶表示140,并且可以例如由二阶滤波算子400对这种表示执行比较。可以根据所选择的二阶表示模型(例如SAX或ADS)来表示目标模式410,然后与一些距离度量进行比较,并基于阈值参数进行滤波。一个或多个目标模式410可以存储在存储器中并在需要时检索。The application of the system in high-order stream processing is described below. The proposed operator will be instantiated for each data partition considered and will run on limited resources. When a triggering event occurs, a second-
在每个触发时刻,针对窗口130中包含的所有事件,所生成的二阶表示140将与目标模式410匹配一次。当相匹配时,可以输出与事件关联实体相关的相应元数据/密钥。很多并发窗口可以生成二阶表示140且并行进行滤波,使得系统具有可扩展性和分布性。The generated
此外,高阶流处理系统可以构建和操作适于检测分布变化的二阶表示140,例如自回归差分滑动平均(ARIMA)模型,其可用于在线机器学习算法以及触发更新和训练已部署模型的需求。这类工具设计为用于滑动复杂特征(即数学函数、统计)的基于流的处理组件,通过优化互补数学函数的计算来支持在线连续机器学习,以馈送和微调已部署模型的生命周期。In addition, higher-order stream processing systems can build and operate second-
考虑了另一个用于故障检测和根因分析的大规模云基础设施的异常值检测监控的用例。目前,对基础设施事件进行批处理分析,以提取按指纹(一组特征,为轨迹比较提供上下文)分组的轨迹(操作序列),然后进行分析和比较,从而个性化显示可能指示现有问题的轨迹(异常值检测)。然后向算子显示异常值轨迹,以进行进一步分析。监控过程不依赖于任何分布式框架来分析数据,任何对轨迹的进一步操作都应该自定义实现。Another use case for outlier detection monitoring of large-scale cloud infrastructures for fault detection and root cause analysis is considered. Currently, batch analysis of infrastructure events is performed to extract trajectories (sequences of operations) grouped by fingerprints (a set of features that provide context for trajectory comparison), which are then analyzed and compared to personalize indications of existing problems. Trajectory (outlier detection). The outlier trajectories are then shown to the operator for further analysis. The monitoring process does not rely on any distributed framework to analyze the data, and any further operations on the trajectory should be custom implemented.
利用根据本发明的高阶流算子,监控过程成为流过程,其中出于监控目的,可以分析和比较由一组轨迹生成的一组时间序列。通过触发进一步分析和与标准化相关指标进行比较,例如消除趋势,可以减少针对算子的误告警数量。此外,由于监控工具可以利用分布式流引擎,因此监控能力的可伸缩性将通过数据处理框架来管理。With the higher order flow operators according to the invention, the monitoring process becomes a flow process, wherein for monitoring purposes a set of time series generated by a set of trajectories can be analyzed and compared. The number of false alarms for operators can be reduced by triggering further analysis and comparison with standardized relevant metrics, such as removing trends. Furthermore, since monitoring tools can leverage distributed streaming engines, the scalability of monitoring capabilities will be managed by the data processing framework.
图5示出了高阶流处理的流程图或工作流程图。所定义的系统的工作流程从步骤500开始。工作流程可以在根据本发明实施例的设备100中执行,所述实施例基于图1所示的设备100。Figure 5 shows a flowchart or work flow diagram of higher order stream processing. The workflow of the defined system begins at
根据步骤510,当触发事件发生时,例如达到发送时间或默认触发器被调用时,连续流在窗口式算子中读取并累积数据。在步骤520中,收集流的事件数据。According to step 510, the continuous stream reads and accumulates data in a windowed operator when a trigger event occurs, such as when the send time is reached or when a default trigger is invoked. In
接下来,在步骤530中,检查能否增量地构建二阶表示。如果二阶表示是所收集事件的索引或压缩表示,则增量构建是可能的。然后,在步骤540中,构建二阶表示。在步骤550中,添加元数据和密钥作为说明。Next, in
在步骤560中,检查是否存在触发事件。如果不存在,则流程返回到步骤520,收集事件数据。如果存在,则流程继续进行到步骤570,检查高阶表示是否已准备就绪。如果是,流程继续。如果否,则流程返回到步骤540,构建二阶表示。In
以上描述侧重于二阶表示的增量构建。对于二阶表示的非增量构建,在步骤530处确定为否,并且在存在触发事件的情况下,所述方法进行到步骤560并且进一步进行到步骤570,其中确定高阶表示尚未就绪。相应地,所述方法返回到步骤540,非增量地构建二阶表示,后续流程如上所述。The above description focuses on incremental construction of second-order representations. For non-incremental construction of the second-order representation, the determination at
随着步骤580中一个或多个二阶表示的发送,完成方法的升阶部分。现在,在步骤582中的是由高阶流算子收集的或传送到高阶流算子的二阶表示。一个或多个高阶流算子针对一个或多个二阶表示执行例如有关统计或业务的二阶或高阶函数。With the transmission of one or more second-order representations in
在步骤586中,决定是否应继续高阶处理。如果是,则流程返回到步骤580,发送所构建的二阶表示。如果否,则流程继续进行到步骤588,决定输出是否已下沉或完成。如果是,则流程继续进行到步骤590(下沉)和步骤592(流程结束)。In
如果否,则流程继续进行到步骤594,启动降阶方法。在步骤594中,从二阶表示中提取密钥和元数据,因为它们是降阶回一阶所必需的。在步骤596中,将行或元组这样的一阶对象作为降阶方法的结果发送。在步骤598中,继续或完成一阶处理。然后,流程继续进行到步骤590(下沉)和步骤592(流程结束)。If not, flow continues to step 594, where the order reduction method is initiated. In
执行升阶方法,并将二阶表示输出到执行已定义逻辑或函数的高阶算子。然后,可以通过编程使得高阶算子将处理降阶并切换回一阶事件处理,或者与在流水线中执行的其它高阶算子一起继续进行高阶处理。高阶流处理任务通常以一个输出完整流水线的下沉步骤结束。Executes an ascending method and outputs a second-order representation to a higher-order operator that performs a defined logical OR function. The higher-order operators can then be programmed to downgrade processing and switch back to first-order event processing, or to continue higher-order processing with other higher-order operators executing in the pipeline. Higher-order stream processing tasks usually end with a sinking step that outputs a full pipeline.
生成时间序列表示的特殊窗口算子的升阶方法将使用以下格式作为输入:The ascending method of a special window operator that produces a time series representation will use the following format as input:
数字列表,表示当部分结果应作为二阶对象输出时的通知时刻,默认值与窗口算子的触发机制有关。A list of numbers representing the notification moment when a partial result should be output as a second-order object, the default value is related to the triggering mechanism of the window operator.
指示是否为每个指定时序生成新的二阶输出流或仅需要创建和更新一个时序的参数。目标二阶表示必须支持更新模式。A parameter that indicates whether a new second-order output stream is generated for each specified time series or only one time series needs to be created and updated. The target second-order representation must support the update mode.
指示从时间序列生成二阶表示的具体实现方式的参数,以允许适合不同分析类型(例如模式匹配、统计提取等)的不同类型的变换,转换类型的增量枚举可以依据适于在窗口环境中执行的现有方法以及根据所选择的使用升阶方法扩展的算子的类型来定义和实现。A parameter indicating a specific implementation of generating a second-order representation from a time series to allow different types of transformations suitable for different types of analysis (e.g. pattern matching, statistical extraction, etc.) Existing methods implemented in , and defined and implemented according to the chosen type of operator extended using the ascending method.
针于生成二阶表示的每个逻辑窗口,例如可以使用滑动/翻滚/滚动/会话窗口,窗口的元数据和实际参数将作为上下文附加到二阶表示中。此外,在窗口目标分析中累积的原始记录/行/元组的特定参数指向部分也将附加到二阶表示中。原则上,特定参数指向部分可以是元组、行或标准简单无规则Java对象(POJO)类型。For each logical window that generates a second-order representation, e.g. sliding/tumbling/scrolling/session windows can be used, the window's metadata and actual parameters will be attached as context to the second-order representation. In addition, the specific parameter pointing part of the raw records/rows/tuples accumulated in the window target analysis will also be appended to the second-order representation. In principle, the part pointed to by a particular parameter can be of type tuple, row, or standard Simple Unregular Java Object (POJO).
存在有效的时间序列索引表示,例如,iSAX2+构建比特位级别的树以支持近似搜索,ADS+是最新的索引算法。目前流算子使用行或元组,而时间序列算子将使用特殊的时间序列对象,因此必须定义序列化机制以避免POJO序列化开销。Efficient time series index representations exist, for example, iSAX2+ builds bit-level trees to support approximate searches, and ADS+ is the latest indexing algorithm. Currently stream operators use rows or tuples, while time series operators will use special time series objects, so a serialization mechanism must be defined to avoid POJO serialization overhead.
高阶算子输出将输入到相应的应用逻辑的高阶算子,例如,二阶映射函数将转换应用适当函数(例如平滑)的二阶表示对象,二阶滤波器将基于模式匹配方法等实现滤波。Higher-order operator outputs will be input to corresponding higher-order operators of application logic, for example, a second-order mapping function will transform a second-order representation object applying an appropriate function (such as smoothing), a second-order filter will be implemented based on pattern matching methods, etc. filter.
高阶算子结果直接通过降阶方法输出,或者像标准数据处理流式拓扑中那样将二阶表示传递给另一个算子。由于二阶表示通常将与实体(例如客户、传感器等)相关联。与二阶时序表示一起保存的窗口分组密钥通常将用于将该过程与正常的一阶流处理集成在一起。The higher-order operator results are output directly through a reduced-order method, or the second-order representation is passed to another operator as in standard data processing streaming topologies. Since second order representations will usually be associated with entities such as customers, sensors, etc. A windowed grouping key saved with the second-order temporal representation will typically be used to integrate this process with normal first-order stream processing.
图6示出了根据本发明实施例的处理数据流的方法的流程图。在第一步骤600中,处理流窗口130中的数据流120。所述处理可以包括例如缓存、提供和执行触发事件。在第二步骤610中,从流窗口130的内容生成高阶表示140。所述生成可以由触发事件启动。在第三步骤620中,例如由高阶算子处理高阶表示140。FIG. 6 shows a flowchart of a method for processing a data stream according to an embodiment of the present invention. In a
在一个示例性用例中,可以考虑自动欺诈检测的情况,其通常试图根据某种模式(即欺诈签名)获取和验证用户行为。此场景的灵感来自检测用于信用卡交易验证的欺诈规则。欺诈检测系统通常利用很多并发指标来构建复杂的检测系统。通常,定义一组规则以基于在特定时间窗口内计算的不同聚合(总和、计数、平均值等)来对用户行为进行建模。然而,复杂模式要求根据分析人员对欺诈方案的解释来定义和验证很多复杂的规则。向高阶的转换降低了复杂性,允许实时处理大数据集或流。In an exemplary use case, consider the case of automated fraud detection, which typically attempts to obtain and verify user behavior based on some pattern (ie, fraudulent signatures). This scenario was inspired by the detection of fraud rules for credit card transaction verification. Fraud detection systems typically utilize many concurrent metrics to build complex detection systems. Typically, a set of rules is defined to model user behavior based on different aggregates (sums, counts, averages, etc.) computed over a specific time window. However, complex patterns require many complex rules to be defined and validated against analysts' interpretations of fraud schemes. Transformation to higher order reduces complexity, allowing real-time processing of large data sets or streams.
本发明将基于整套高阶流算子的开发,包括已可用于流处理的最相关的二阶函数的适配,例如映射、约简、分组以及在其他情况下不可能实现的其他新的算子。The present invention will be based on the development of a complete set of higher-order stream operators, including the adaptation of the most relevant second-order functions already available for stream processing, such as mapping, reduction, grouping, and other new algorithms that are otherwise impossible to implement son.
每个高阶表示140将根据嵌入函数的属性(例如允许或不允许更新)进行分类,并且开发人员将负责创建二阶转换和相应高阶处理的流水线,使得所选择的表示与处理目的兼容。Each higher-
已经结合作为实例的不同实施例以及实施方案描述了本发明。但本领域技术人员通过实践所请发明,研究附图、本公开以及独立权项,能够理解并获得其他变体。在权利要求以及描述中,术语“包括”不排除其他元件或步骤,且“一”并不排除复数可能。单个元件或其它单元可满足权利要求书中所叙述的若干实体或项目的功能。在仅凭某些措施被记载在相互不同的从属权利要求书中这个单纯的事实并不意味着这些措施的结合不能在有利的实现方式中使用。The present invention has been described in connection with various embodiments and embodiments by way of example. However, those skilled in the art can understand and obtain other variations by practicing the claimed invention, studying the drawings, the present disclosure and the independent claims. In the claims and descriptions, the term "comprising" does not exclude other elements or steps, and "a" does not exclude a plural possibility. A single element or other unit may fulfill the functions of several entities or items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.
Claims (13)
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| PCT/EP2018/054668 WO2019161931A1 (en) | 2018-02-26 | 2018-02-26 | Stream processing device and method of processing a stream of data |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111771195A true CN111771195A (en) | 2020-10-13 |
Family
ID=61521501
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201880090313.7A Pending CN111771195A (en) | 2018-02-26 | 2018-02-26 | Stream processing device and data stream processing method |
Country Status (2)
| Country | Link |
|---|---|
| CN (1) | CN111771195A (en) |
| WO (1) | WO2019161931A1 (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117150224A (en) * | 2023-10-30 | 2023-12-01 | 宜兴启明星物联技术有限公司 | User behavior data storage analysis method based on Internet of things |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112948455A (en) * | 2021-01-08 | 2021-06-11 | 四川新网银行股份有限公司 | Real-time analysis and calculation method based on Apache drive |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1776739A (en) * | 2004-11-16 | 2006-05-24 | 微软公司 | Traffic predictiong using model-setting and analuzing to probability relativity and environment data |
| US20100312727A1 (en) * | 2008-12-19 | 2010-12-09 | Pottenger William M | Systems and methods for data transformation using higher order learning |
| US20110016160A1 (en) * | 2009-07-16 | 2011-01-20 | Sap Ag | Unified window support for event stream data management |
| CN102955841A (en) * | 2011-08-15 | 2013-03-06 | 德商赛克公司 | Systems and/or methods for forecasting future behavior of event streams in complex event processing (cep) environments |
-
2018
- 2018-02-26 WO PCT/EP2018/054668 patent/WO2019161931A1/en not_active Ceased
- 2018-02-26 CN CN201880090313.7A patent/CN111771195A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN1776739A (en) * | 2004-11-16 | 2006-05-24 | 微软公司 | Traffic predictiong using model-setting and analuzing to probability relativity and environment data |
| US20100312727A1 (en) * | 2008-12-19 | 2010-12-09 | Pottenger William M | Systems and methods for data transformation using higher order learning |
| US20110016160A1 (en) * | 2009-07-16 | 2011-01-20 | Sap Ag | Unified window support for event stream data management |
| CN102955841A (en) * | 2011-08-15 | 2013-03-06 | 德商赛克公司 | Systems and/or methods for forecasting future behavior of event streams in complex event processing (cep) environments |
Non-Patent Citations (2)
| Title |
|---|
| PARIS CARBONE等: "Apache Flink: Stream and Batch Processing in a Single Engine", 《BULLETIN OF THE IEEE COMPUTER SOCIETY TECHNICAL COMMITTEE ON DATA ENGINEERING》 * |
| 韩燕波: "基于系统动态重构的高适应性工作流管理", 计算机工程与应用, no. 09 * |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN117150224A (en) * | 2023-10-30 | 2023-12-01 | 宜兴启明星物联技术有限公司 | User behavior data storage analysis method based on Internet of things |
| CN117150224B (en) * | 2023-10-30 | 2024-01-26 | 宜兴启明星物联技术有限公司 | User behavior data storage analysis method based on Internet of things |
Also Published As
| Publication number | Publication date |
|---|---|
| WO2019161931A1 (en) | 2019-08-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US20240320123A1 (en) | Time series forecasting | |
| US20240184785A1 (en) | Continuous functions in a time-series database | |
| KR102627690B1 (en) | Dimensional context propagation techniques for optimizing SKB query plans | |
| US20200167360A1 (en) | Scalable architecture for a distributed time-series database | |
| US20230125566A1 (en) | Long string pattern matching of aggregated account data | |
| US10235376B2 (en) | Merging metadata for database storage regions based on overlapping range values | |
| US20200167355A1 (en) | Edge processing in a distributed time-series database | |
| US20160299947A1 (en) | Optimized exclusion filters for multistage filter processing in queries | |
| US20160246829A1 (en) | Managing time series databases | |
| CN104809242A (en) | Distributed-structure-based big data clustering method and device | |
| CN111522846B (en) | A Data Aggregation Method Based on Time-Series Intermediate State Data Structure | |
| CN104820708A (en) | Cloud computing platform based big data clustering method and device | |
| Lakshmi et al. | A survey on different trends in data streams | |
| CN118446415B (en) | Enterprise management method and system based on big data | |
| US20220019764A1 (en) | Method and device for classifying face image, electronic device and storage medium | |
| CN108536823B (en) | A cache design and query method for IoT-aware big data | |
| WO2021143010A1 (en) | Response method and device for distributed computing task | |
| CN118643444A (en) | Big data anomaly detection method, device, equipment, storage medium and product | |
| CN111771195A (en) | Stream processing device and data stream processing method | |
| CN103823881A (en) | Method and device for performance optimization of distributed database | |
| CN115374155A (en) | Data query method and device, electronic equipment and storage medium | |
| CN111858946B (en) | Construction method of tobacco monopoly market supervision big data E-R model | |
| WO2020106487A1 (en) | Scalable architecture for a distributed time-series database | |
| CN112435151B (en) | Government information data processing method and system based on association analysis | |
| Huang et al. | DTW-based subsequence similarity search on AMD heterogeneous computing platform |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20220223 Address after: 550025 Huawei cloud data center, jiaoxinggong Road, Qianzhong Avenue, Gui'an New District, Guiyang City, Guizhou Province Applicant after: Huawei Cloud Computing Technologies Co.,Ltd. Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen Applicant before: HUAWEI TECHNOLOGIES Co.,Ltd. |
|
| TA01 | Transfer of patent application right | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20201013 |
|
| WD01 | Invention patent application deemed withdrawn after publication |