CN110825801B

CN110825801B - Train signal system vehicle-mounted log analysis system and method based on distributed architecture

Info

Publication number: CN110825801B
Application number: CN201911076714.XA
Authority: CN
Inventors: 谢飞; 魏盛昕; 程浩; 李立; 张奕男; 朱存仁; 付朗; 杨辉
Original assignee: Casco Signal Cherngdu Ltd
Current assignee: Casco Signal Cherngdu Ltd
Priority date: 2019-11-06
Filing date: 2019-11-06
Publication date: 2023-03-10
Anticipated expiration: 2039-11-06
Also published as: CN110825801A

Abstract

The invention discloses a train signal system vehicle-mounted log big data analysis system and method based on a distributed database, relates to the technical field of train data analysis, and comprises a data acquisition module, a data analysis module, a data storage module, a data cleaning module, a data statistical analysis module and the distributed database.

Description

Train signal system vehicle-mounted log analysis system and method based on distributed architecture

Technical Field

The invention relates to the technical field of train data analysis, in particular to a train signal system vehicle-mounted log big data analysis system and method based on a distributed database.

Background

In urban rail transit, a signal system is a key system for guaranteeing driving safety and improving transportation efficiency, and vehicle-mounted equipment of the signal system mainly comprises two subsystems of Automatic Train Operation (ATO) and train overspeed protection (ATP). The log data of the vehicle-mounted equipment records all running states of the vehicle-mounted equipment in the whole process of train running, fault alarm and other key information, and plays an important role in the application and maintenance of the vehicle-mounted equipment.

In the data analysis and application process of the vehicle-mounted log of the existing signal system, the following problems mainly exist:

1. analyzing the mass vehicle-mounted logs: due to the high real-time requirement of the signal system, each control end (arranged at the position of the vehicle head) of the ATO and ATP subsystems generates 10 log data packets per second. According to the estimation, 1 subway line with 40 trains generates nearly 1 hundred million log data every day, and the accumulation reaches more than 300 hundred million data scale in 1 year. The traditional relational database has the defects that the requirements of users on analysis and processing of mass log data cannot be completely met due to the limited single-machine storage space and computing capability.

2. The problem of locating faults of the vehicle-mounted equipment is as follows: when a certain vehicle-mounted device or system module breaks down, the reason for analyzing the problem needs to be located, subway companies mainly check and analyze vehicle-mounted log files with problems one by one through log viewing tools provided by signal system suppliers at present, and the problems of low efficiency, complex process and the like exist in the whole process.

3. Limitations of the on-line monitoring system of the signal equipment: the existing signal equipment online monitoring system can only reflect the alarm and fault information of the vehicle-mounted equipment and does not store complete vehicle-mounted logs, so that the alarm and fault information of the vehicle-mounted equipment cannot be intelligently diagnosed and analyzed, and the requirement of a subway company on development of maintenance and repair of the vehicle-mounted equipment towards an intelligent direction cannot be met.

4. Maintenance problems of the wire-mesh level signal system: in the urban rail transit project which is opened and built at present, due to the fact that signal systems are manufactured by different manufacturers, a Maintenance Support System (MSS) is often arranged on each line independently, a serious information isolated island exists, the problem that resources such as maintenance information, maintenance tools and maintenance personnel of each manufacturer cannot be shared is caused, and the fault rapid positioning and rapid repairing of a wire network level signal system cannot be achieved.

In the existing signal system vehicle-mounted log analysis solution, an analysis algorithm of a vehicle-mounted log and a log analysis system based on a big data technology are mainly involved.

The analysis algorithm of the vehicle-mounted log mainly comprises two algorithms of pattern recognition and fusion analysis. The pattern recognition algorithm mainly analyzes the vehicle-mounted log (only ATP log), extracts effective state data in the vehicle-mounted log, inputs the effective state into a set behavior pattern to perform recognition and matching calculation, and finally realizes recognition and prediction of subway faults. The fusion analysis algorithm mainly defines basic data and performs service modeling on analyzable items in mass logs of the train control system, collects log data based on the open and standard principles, preprocesses and stores the log data based on the fusion analysis algorithm rule, finally realizes cross-system log correlation analysis based on a service model, and visually displays the analysis result. Although the algorithm can effectively meet the requirement of vehicle-mounted log analysis, a specific analysis platform still adopts a traditional relational database, the operation and storage performance of the vehicle-mounted log cannot be guaranteed in the face of massive vehicle-mounted logs, and enterprises need to build a distributed database according to actual conditions of the enterprises and perform a large amount of data migration work. In the prior art, if the publication number is CN107256219A, and the publication time is 2017, 10 and 17, a big data fusion analysis method applied to a mass log of an automatic train control system is disclosed, which includes the following steps: (1) Defining basic data types of service analyzable items in a system log; (2) modeling a system fusion analysis service; (3) Realizing a unified log collection process based on an open and standard principle; (4) Preprocessing and storing the log data based on the fusion analysis data processing rule; (5) Cross-system log association analysis is realized based on a business analysis model; (6) And realizing the visual display of the log analysis result through a uniform interface. Compared with the prior art, the method has the advantages of timely diagnosing the abnormality among the systems, effectively reducing the workload of maintenance personnel and the like.

The log analysis system based on the big data technology is mainly an analysis system of website logs based on a Hadoop platform, the used architecture is as Flume, hive, HBase, sqoop and the like, modules including file uploading, data cleaning, data statistical analysis, data exporting, data displaying and the like are included, and millisecond-level query of mass data can be achieved. As in the prior art, if the publication number is CN108123834A, and the publication time is 2018, 6 and 5 days, a log analysis system based on a distributed database is disclosed, which includes a network data collector, a distributed real-time data transmission channel, a distributed log processing platform, a network data protocol feature library, and a distributed database; the main functions and processing flows are as follows: (1) Data is transmitted through a distributed real-time data transmission channel; the network data acquisition unit is responsible for acquiring network data packets on the network equipment and sending the data packets to the distributed log processing platform in a real-time queuing manner through the distributed real-time data transmission channel; (2) The distributed log processing platform processes the data packet in real time; the distributed log processing platform analyzes real-time data of the network data packet, performs data characteristic matching through a network data protocol characteristic library, and sends the network log data which is confirmed to be abnormal in matching to the distributed database for storage; (3) The distributed database performs cluster analysis and classification training on the weblog data and dynamically updates a weblog protocol feature library.

Although the Hadoop-based big data analysis platform can also store and analyze mass log data, the performance of processing the structured data of the vehicle-mounted log of the signal system is slightly inferior to that of a distributed database such as Greenplus, and a set of complete Hadoop distributed database is built and finally put into use, so that higher cost is required for enterprises.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, provides a train signal system vehicle-mounted log analysis system based on a distributed architecture, and provides an efficient and good-expansibility solution for offline analysis of vehicle-mounted logs in a massive train signal system.

The purpose of the invention is realized by the following technical scheme:

train signal system vehicle-mounted log analysis system based on distributed architecture, its characterized in that: the system comprises a data acquisition module, a data analysis module, a data storage module, a data cleaning module, a data statistical analysis module and a distributed database.

The data acquisition module and the data analysis module are message middleware based on a Kafka + Zookeeper architecture and are used for acquiring and analyzing vehicle-mounted log data of a train signal system;

in order to solve the problem that the storage efficiency is reduced due to too many data fields, the data analysis module classifies fields which are continuous and belong to the same system module in the data collected by the data collection module during the analysis of the vehicle-mounted log data, all the fields of the same category are combined into 1 large field in sequence, and corresponding data are also combined into one data block.

The data storage module is based on a GPSS (Greenplus Stream Server) + Kafka + gpfdist architecture and is used for recording the vehicle-mounted log data analyzed by the data analysis module into the distributed database to form an original log table.

The data cleaning module is used for processing the original log table after being put in storage, sequentially splitting the combined large fields into original fields in the log data according to a field combining rule when the data analysis module analyzes the vehicle-mounted log data, and converting binary values corresponding to the fields into decimal or Boolean type data corresponding to the binary values to complete numerical value conversion.

The data statistical analysis module performs aggregate statistics based on specific train operation events, calculates to obtain corresponding key train operation and maintenance indexes, and finally stores the key train operation and maintenance indexes in a distributed database; and the data statistical analysis module is used for performing statistical analysis on the cleaned log data to obtain required operation and maintenance indexes, and storing statistical results into the distributed database.

The train specific operation events comprise station entering and stopping, emergency Braking (EB), beacon loss and train-ground wireless communication faults.

The key operation and maintenance indexes of the train comprise train stop time, times and average duration of exceeding-standard train stop, train stop precision and exceeding-standard train stop times, times and reasons of Emergency Braking (EB), beacon loss times and train-ground wireless communication failure times.

The distributed database is a distributed cluster based on greenplus, and provides a distributed data storage and calculation platform for the data acquisition module, the data analysis module, the data storage module, the data cleaning module and the data statistical analysis module, namely the distributed database is responsible for storing and calculating vehicle-mounted log data and comprises a management (Master) node, a calculation (Segment) node and a Standby (Standby) node, and the distributed database is a basic platform for realizing the five modules.

Folders for storing vehicle-mounted log data are created on the three nodes; the management node is provided with a MySQL database for storing acquisition and analysis records, the MySQL database is started to deploy data acquisition and analysis services in the system, and when the fact that vehicle-mounted log data are uploaded in the web server is detected, the FTP is started to download.

The management node does not store vehicle-mounted log data, is responsible for SQL analysis, forms distributed tasks, collects calculation results and manages other nodes; the computing node and the standby node are responsible for storing vehicle-mounted log data and executing distributed tasks; the storage strategy of the vehicle-mounted log data adopts a random distribution mode in a distributed database, so that data inclination is avoided.

Each node is configured to be 2 CPUs with 8 cores, 32GB memories and 20 SAS hard disks, gigabit network connection is adopted between each node, 2 Primary instances (Primary) and 2 Mirror instances (Mirror) are deployed on each node, cross Mirror configuration is performed between each node, and availability of the distributed cluster is improved.

The data acquisition module and the data analysis module start service after a corresponding log Topic (Topic) is established for each train in a cluster corresponding to the Kafka architecture, and a corresponding GPSS instance is established.

The train signal system vehicle-mounted log analysis method based on the distributed architecture is characterized by comprising the following steps of:

the method comprises the steps of data acquisition, namely, firstly, starting a multi-thread log scanning task for each node in a distributed cluster of a distributed database, and regularly scanning a folder of vehicle-mounted log data of each train on a wire network log server; when data updating is detected in the folder of the vehicle-mounted log data, the updated folder is locked, and meanwhile, an FTP downloading task is created on a node of a log scanning task to finish the acquisition of the vehicle-mounted log data from a net log server to a local node.

And a data analysis step, namely decompressing the vehicle-mounted log data acquired by the data acquisition step, analyzing the log data according to a message analysis rule to obtain vehicle-mounted log data classified according to system modules, packaging analysis results into a Topic message in a JSON format according to different trains, issuing the Topic message to a Kafka cluster server (broker) of the data analysis module for storage, and effectively judging whether data packet loss exists or not by detecting the position and change of a timestamp in the log data in the log analysis process to cause data analysis abnormity. After the log file data is downloaded successfully, the log file is analyzed through the data analysis module, the analysis result is sent to a server broker of the Kafka cluster in the data analysis module to carry out persistence operation, the broker is a place used for storing message data issued by a producer in the Kafka cluster, each type of message is a topic, and the message is deleted automatically once being read by a consumer.

The analysis of the log data according to the message analysis rule is to solve the problem that the storage efficiency is reduced due to too many data fields, classify the fields which are continuous and belong to the same system module in the vehicle-mounted log data collected to the local node, sequentially combine all the fields of the same class into 1 large field, and combine the corresponding data into one data block.

A data warehousing step, in which a vehicle-mounted log data processing task is started, a GPSS serving as a Kafka cluster consumer in the data warehousing module detects the Topic message stored in the Kafka cluster, and if the Topic message is updated, a gpfdist program is started to write log data into an original log table corresponding to a train in a highly-concurrent manner in a readable external table manner; the gpfdist is a Greenplus self-contained concurrent file distribution program, can realize that a plurality of instances are simultaneously and quickly written into a database, and therefore has high concurrency.

And a data cleaning step, namely storing the vehicle-mounted log data subjected to the data storage step in an original log table, sequentially splitting the combined large fields into original fields in the log data by a data cleaning module according to a field combining rule when the vehicle-mounted log data is analyzed by a data analysis module in the original log table, splitting the fields, converting binary values corresponding to the fields into decimal or Boolean type data corresponding to the binary values, completing numerical value conversion, and writing the result into a middle log table.

And a data statistical analysis step, namely performing aggregation operation on the intermediate log table after the data cleaning step according to a specific train operation event to obtain key indexes required by the operation and maintenance of the signal system, and storing statistical results into a distributed database.

Compared with the prior art, the technical scheme provided by the invention has the advantages that the data acquisition and analysis tasks are deployed to each node in the cluster, and meanwhile, the acquisition and analysis tasks are synchronized to the zookeeper, so that the coordinated management of multiple acquisition and analysis tasks is realized, and the performance bottleneck which may occur when a single machine is adopted for downloading is effectively avoided. The method has the advantages that the gpfdist mode provided by Greenplus is adopted to carry out storage operation on the log data, the log data are directly and concurrently loaded through the computing nodes of the distributed database, load balancing is achieved for each node, high concurrent storage of the log data can be achieved, and storage time of a large amount of data is effectively shortened.

The method is a field merging and splitting mechanism aiming at the problem that log data fields exceed the database limit due to too many log data fields, and can effectively improve the data storage efficiency by merging the fields during log analysis and splitting the fields during data cleaning, so that the data storage time is greatly shortened. The method is also an aggregation algorithm for determining the change time of various time sequence variables in the process of train stop, and can accurately obtain the time of occurrence of key events such as train stop, train door/shield door opening command, train door/shield door opening, train door/shield door closing command, train door/shield door closing, departure indicator light lightening, departure button and the like by aggregating fields such as train number, train position, train speed, stop sign and the like and calculating the maximum value or the minimum value when the time sequence variables change, thereby improving the accuracy of statistical data to a certain extent.

Drawings

The foregoing and following detailed description of the invention will be apparent when read in conjunction with the following drawings, in which:

FIG. 1 is a schematic diagram of a train signal system vehicle-mounted log big data analysis system of the present invention;

FIG. 2 is a logic diagram of a train signal system vehicle log big data analysis method;

FIG. 3 is a schematic diagram of a vehicle-mounted log big data analysis distributed database according to the present invention;

FIG. 4 is a logic diagram of a processing method for analyzing vehicle-mounted logs according to the present invention.

Detailed Description

The technical solutions for achieving the objects of the present invention are further illustrated by the following specific examples, and it should be noted that the technical solutions claimed in the present invention include, but are not limited to, the following examples.

Example 1

As a most basic implementation scheme of the present invention, as shown in fig. 1, this embodiment discloses a train signal system vehicle-mounted log analysis system based on a distributed architecture, which includes a data acquisition module, a data analysis module, a data storage module, a data cleaning module, a data statistical analysis module, and a distributed database.

The data acquisition module and the data analysis module are message middleware based on a Kafka + Zookeeper architecture and are used for acquiring and analyzing vehicle-mounted log data of a train signal system; in order to solve the problem that the storage efficiency is reduced due to too many data fields, the data analysis module classifies fields which are continuous and belong to the same system module in the data collected by the data collection module during the analysis of the vehicle-mounted log data, all the fields of the same category are combined into 1 large field in sequence, and corresponding data are also combined into one data block.

The data cleaning module is used for processing the original log table after being put in storage, sequentially splitting the combined large fields into original fields in the log data according to a field combination rule when the data analysis module analyzes the vehicle-mounted log data, and converting binary values corresponding to the fields into decimal or Boolean type data corresponding to the binary values to complete numerical conversion.

The train specific operation events comprise station entering and stopping, emergency Braking (EB), beacon loss and train-ground wireless communication faults. The key operation and maintenance indexes of the train comprise train stop time, times and average duration of exceeding-standard train stop, train stop precision and exceeding-standard train stop times, times and reasons of Emergency Braking (EB), beacon loss times and train-ground wireless communication failure times.

The distributed database is a distributed cluster based on greenplus, and provides a distributed data storage and calculation platform for the data acquisition module, the data analysis module, the data storage module, the data cleaning module and the data statistical analysis module, namely the distributed database is responsible for storing and calculating vehicle-mounted log data, and as shown in fig. 3, the distributed database comprises 1 management (Master) node, 2 calculation (Segment) nodes and 1 Standby (Standby) node, and is a basic platform for realizing the five modules.

The management node does not store vehicle-mounted log data, is responsible for SQL analysis, forms distributed tasks, collects calculation results and manages other nodes; the computing node and the standby node are responsible for storing vehicle-mounted log data and executing distributed tasks; the storage strategy of the vehicle-mounted log data adopts a random distribution mode in a distributed database, so that the data inclination is avoided.

As shown in fig. 3, each node is configured as 2 CPUs with 8 cores, a 32GB memory, and 20 SAS hard disks, gigabit network connection is adopted between each node, and 2 Primary instances (Primary) and 2 Mirror instances (Mirror) are deployed on each node, and cross Mirror configuration is performed between each node, so that availability of the distributed cluster is improved.

In addition, the embodiment also discloses a train signal system vehicle log analysis method based on the system, as shown in fig. 2, including the following steps:

the method comprises the steps of data acquisition, namely, firstly, starting a multi-thread log scanning task for each node in a distributed cluster of a distributed database, and regularly scanning a folder of vehicle-mounted log data of each train on a wire network log server; and when detecting that data update exists in the folder of the vehicle-mounted log data, locking the updated folder, and simultaneously creating an FTP downloading task on the node of the log scanning task to finish the acquisition of the vehicle-mounted log data from the net log server to the local node.

And a data analysis step, namely decompressing the vehicle-mounted log data acquired by the local node in the data acquisition step, analyzing the log data according to a message analysis rule to obtain vehicle-mounted log data classified according to system modules, meanwhile, encapsulating analysis results into a Topic message in a JSON format according to different trains, issuing the Topic message to a Kafka cluster server (broker) of the data analysis module for storage, and effectively judging whether data packet loss exists or not to cause abnormal data analysis by detecting the position and change of a timestamp in the log data in the log analysis process. After the log file data is downloaded successfully, the log file is analyzed through the data analysis module, the analysis result is sent to a server broker of a Kafka cluster in the data analysis module to carry out persistence operation, the broker is a place used for storing message data issued by a producer in the Kafka cluster, each type of message is a topic, and the messages can be deleted automatically once being read by a consumer.

Example 2

As a preferred and specific implementation scheme of the technical solution of the present invention, the present embodiment discloses a train signal system vehicle-mounted log analysis system based on a distributed architecture, as shown in fig. 1, which includes a data acquisition module, a data analysis module, a data storage module, a data cleaning module, a data statistical analysis module, and a distributed database.

The data acquisition module is used for downloading the vehicle-mounted log file, firstly, the data acquisition module downloads the log file in an FTP mode, and then, the log file is stored in the distributed database; the log files are vehicle-mounted logs of a subway signal system and comprise ATP log files and ATO log files, and the log files are in a binary format.

The data analysis module is used for analyzing the log files downloaded into the distributed database, packaging the analyzed data into JSON data according to the requirement of data storage, persisting the JSON data into a Broker of the Kafka cluster, and processing the log files which do not meet the requirement, wherein the log files which do not meet the requirement mainly comprise three types of files such as damaged files, missing data packets and abnormal data contents; the JSON data is organized according to mapping fields of the Topic configuration files in the Kafka cluster.

And the data storage module is used for reading JSON data in the Broker of the Kafka cluster and writing the JSON data into an original log table corresponding to the distributed database.

And the data cleaning module is used for cleaning and converting the data in the original log table, and storing the cleaned data into the corresponding intermediate data table. The data cleaning comprises field splitting, numerical value conversion, effective data screening and the like, and the effective data screening conditions comprise whether the data is a Master control terminal (Master CC-Core) or not and whether the data is in-line operation or not.

And the data statistical analysis module is used for performing cluster analysis on the intermediate data table based on the specific train operation event to obtain the required operation and maintenance index and storing the statistical data into the distributed database. Train specific operational events include parking, emergency Braking (EB), beacon loss, train-ground wireless communication failure, and the like.

The distributed database is a distributed cluster based on greenplus, is responsible for storing and calculating all data, and is a basic platform for realizing the five modules. As shown in fig. 3, the platform mainly consists of 4 nodes, including 1 management node, 2 computing nodes, and 1 standby node. Each node is configured with 2 CPUs with 8 cores, 32GB memory and 20 SAS hard disks, and gigabit network connection is adopted among the nodes. Each node is provided with 2 main instances (Primary) and 2 Mirror instances (Mirror), and cross Mirror configuration is performed among the nodes, so that the availability of the distributed cluster is improved.

The construction process of the vehicle-mounted log analysis system comprises the following steps.

The first step is as follows: a distributed cluster platform based on Greenplus is built, and the distributed cluster platform comprises the following three nodes: the system comprises a management node, a computing node and a standby node, wherein the computing node and the standby node are data nodes, and a greenplus service is started on the management node.

The second step is that: firstly building a Zookeeper cluster in a platform, then building a Kafka cluster, thereby forming a set of distributed task collaborative management cluster, and respectively building corresponding Topic for ATO and ATP logs of each train in the cluster. Firstly starting the Zookeeper cluster service, and then starting the Kafka cluster service.

The third step: creating a folder for storing vehicle-mounted logs for each train on all computing nodes in the platform, wherein the directory structure of the folder is consistent with the file directory structure (/ FTPOMAP/date/train number /) in a network server, deploying a MySQL database on a management node, and then deploying data acquisition and analysis service in the platform and starting. And once the data scanning task detects that the network server has the vehicle-mounted log file to upload, starting an FTP downloading task to download, and locking the file to prevent the file from being downloaded by other threads.

The fourth step: after the log file is downloaded successfully, a data analysis task is started to analyze the log file, the analyzed data is packaged into a JSON format and sent to a server (broker) of the Kafka cluster for persistence.

The fifth step: after detecting that a new message is issued to the Kafka cluster, the GPSS instance reads the log data stored in the corresponding Topic, and persists the log data to the original log table in the platform in a gpfdist mode. An original log table including an ATO _ log _ line name _ train number and an ATP _ log _ line name _ train number needs to be created for the ATO and ATP logs of each train, respectively, and a Partition (Partition) needs to be created for the original log table by day due to a large data volume.

And a sixth step: after the log data are put in storage, a data cleaning module is started to carry out field splitting and numerical conversion on the original log table, and the cleaned and converted data are written into an intermediate data table. Intermediate data tables including the ATO _ log _ mid _ line name _ train number and the ATP _ log _ mid _ line name _ train number are required to be created separately for the ATO and ATP logs of each train.

The seventh step: and after the data is cleaned, performing statistical analysis on the data by using a data statistical analysis module, and writing the statistical result into a corresponding statistical table for front-end display and calling.

Further, based on the system, as shown in fig. 2, an analysis method of a train signal system vehicle-mounted log analysis system based on a distributed architecture includes the following steps:

step 1, data acquisition, namely firstly starting a multi-thread data acquisition task of each node in a distributed cluster, scanning a log folder of each train on a wire mesh log server at regular time, and triggering an FTP (file transfer protocol) download task on the node to finish acquisition of the log file if the log file update of a certain train is detected; and the downloaded log file name is written into a log record table in the MySQL database to prevent repeated downloading.

And 2, analyzing the data, starting a data analysis task after the log file is successfully downloaded, firstly decompressing the log file, analyzing according to a message protocol of the log file, packaging an analysis result into a JSON file as required, and issuing the JSON file to a broker of the Kafka cluster for storage.

Corresponding Topic messages need to be created in advance for the vehicle-mounted logs of each train in the Kafka cluster, the content of the Topic messages is consistent with the fields in the JSON files, field merging processing is carried out in the data packaging process, and the analyzed log file names are written into an analysis record table in the MySQL database to prevent repeated analysis.

And 3, warehousing data, starting a log data processing task, consuming log data corresponding to Topic in the Kafka cluster in real time by the GPSS instance of each train, completing high concurrency writing of the log data into an original log table of the distributed database on each computing node in a gpfdist mode, and establishing corresponding GPSS instances in advance aiming at different trains.

And 4, cleaning data, after the log data are put into a warehouse, performing field splitting and numerical value conversion on the original log table, screening useful data fields and effective log data in the original log according to the actual operation scene of the train, and storing the cleaned and converted data into an intermediate data table in the distributed database.

And 5, performing data statistical analysis, performing aggregation operation on the cleaned log data according to a specific operation event of the train to obtain key indexes required by a maintenance decision of the subway company, such as train stop time, train exceeding stop times and average duration, train stop precision and exceeding stop times, emergency Braking (EB) times and reasons, beacon loss times, train-ground wireless communication fault times and the like, and storing result data into a distributed database.

Claims

1. Train signal system vehicle-mounted log analysis system based on distributed architecture, its characterized in that: the system comprises a data acquisition module, a data analysis module, a data storage module, a data cleaning module, a data statistical analysis module and a distributed database;

the data storage module is based on a GPSS (Greenplus Stream Server) and Kafka + gpfdist architecture and is used for recording the vehicle-mounted log data analyzed by the data analysis module into the distributed database to form an original log table;

the data cleaning module is used for processing the original log table after being put in storage, sequentially splitting the combined large fields into original fields in the log data according to a field combination rule when the data analysis module analyzes the vehicle-mounted log data, and converting binary values corresponding to the fields into decimal or Boolean type data corresponding to the binary values to complete numerical value conversion;

the data statistical analysis module performs aggregate statistics based on specific train operation events, calculates to obtain corresponding key train operation and maintenance indexes, and finally stores the key train operation and maintenance indexes in a distributed database; the data statistical analysis module is used for performing statistical analysis on the cleaned log data to obtain required operation and maintenance indexes, and storing statistical results into a distributed database;

the distributed database is a distributed cluster based on greenplus, provides a distributed data storage and calculation platform for the data acquisition module, the data analysis module, the data storage module, the data cleaning module and the data statistical analysis module, and comprises a management node, a calculation node and a standby node.

2. The train signal system on-board log analysis system based on the distributed architecture of claim 1, wherein: and the data analysis module is used for classifying fields which are continuous and belong to the same system module in the data acquired by the data acquisition module during the analysis of the vehicle-mounted log data, combining all the fields of the same type into 1 large field in sequence, and combining the corresponding data into one data block.

3. The distributed architecture based train signal system onboard log analysis system of claim 1, wherein: the train specific operation events comprise station entering and stopping, emergency Braking (EB), beacon losing and train-ground wireless communication faults.

4. The distributed architecture based train signal system onboard log analysis system of claim 1 or 3, wherein: the key operation and maintenance indexes of the train comprise train stop time, times and average duration of exceeding-standard train stop, train stop precision and exceeding-standard train stop times, times and reasons of Emergency Braking (EB), beacon loss times and train-ground wireless communication failure times.

5. The train signal system on-board log analysis system based on the distributed architecture of claim 1, wherein: folders for storing vehicle-mounted log data are created on three nodes of the distributed database; and a MySQL database for storing acquisition and analysis records is deployed on the management node, a data acquisition and analysis service is deployed in the system after the MySQL database is started, and FTP is started to download when vehicle-mounted log data are uploaded in the web server.

6. The train signal system on-board log analysis system based on the distributed architecture of claim 5, wherein: the management node is responsible for SQL analysis, forming distributed tasks, collecting calculation results and managing other nodes; the computing node and the standby node are responsible for storing vehicle-mounted log data and executing distributed tasks; the storage strategy of the vehicle-mounted log data adopts a random distribution mode in a distributed database.

7. The train signal system onboard log analysis system based on a distributed architecture of claim 1, 5 or 6, wherein: each node is configured with 2 CPUs with 8 cores, 32GB memories and 20 SAS hard disks, gigabit network connection is adopted between each node, 2 main instances and 2 mirror instances are deployed on each node, and cross mirror configuration is carried out between each node.

8. The analysis method of the train signal system vehicle-mounted log analysis system based on the distributed architecture as claimed in claim 1, characterized by comprising the following steps:

the method comprises the steps of data acquisition, namely, firstly, starting a multi-thread log scanning task for each node in a distributed cluster of a distributed database, and regularly scanning a folder of vehicle-mounted log data of each train on a wire network log server; when detecting that data update exists in a folder of the vehicle-mounted log data, locking the updated folder, and simultaneously creating an FTP (file transfer protocol) download task on a node of a log scanning task to finish the acquisition of the vehicle-mounted log data from a net log server to a local node;

a data analysis step, namely decompressing the vehicle-mounted log data acquired by the local node in the data acquisition step, analyzing the log data according to a message analysis rule to obtain vehicle-mounted log data classified according to system modules, meanwhile, encapsulating analysis results into a Topic message in a JSON format according to different trains, issuing the Topic message to a Kafka cluster server (broker) of the data analysis module for storage, and judging whether data packet loss exists or not by detecting the position of a timestamp in the log data and whether the timestamp changes or not in the log analysis process so as to cause abnormal data analysis;

a data warehousing step, in which a vehicle-mounted log data processing task is started, a GPSS serving as a Kafka cluster consumer in the data warehousing module detects the Topic message stored in the Kafka cluster, and if the Topic message is updated, a gpfdist program is started to write log data into an original log table corresponding to a train in a highly-concurrent manner in a readable external table manner;

a data cleaning step, namely storing the vehicle-mounted log data subjected to the data storage step in an original log table, sequentially splitting the combined large fields into original fields in the log data by a data cleaning module according to field combination rules when the vehicle-mounted log data are analyzed by a data analysis module, carrying out field splitting, converting binary values corresponding to the fields into decimal or Boolean type data corresponding to the binary values, completing numerical value conversion, and writing the result into a middle log table;

and a data statistical analysis step, namely performing aggregation operation on the intermediate log table after the data cleaning step according to the specific operation event of the train to obtain key indexes required by the operation and maintenance of the signal system, and storing the statistical result into a distributed database.

9. The analysis method of the train signal system vehicle-mounted log analysis system based on the distributed architecture as claimed in claim 8, wherein: the analysis of the log data according to the message analysis rule is to classify fields which are continuous and belong to the same system module in the vehicle-mounted log data collected to the local node, combine all the fields of the same category into 1 large field in sequence, and combine the corresponding data into one data block.