Disclosure of Invention
Aiming at the technical problems existing in the prior art, the invention provides a user identification method, a system and a storage medium based on distributed data processing, and the efficient processing and analysis of user behavior data are realized through Redis distributed storage, a multistage ID mapping mechanism and a real-time data processing technology. The method and the device solve the problems of complexity of user identification management, insufficient data processing efficiency and poor system expansibility in the prior art.
According to a first aspect of the present invention, there is provided a method of user identification based on distributed data processing, comprising the steps of:
receiving user behavior data and performing basic verification on the data;
Constructing a Redis Key of the equipment ID, inquiring and distributing the internal ID of the equipment in batches, and distributing the session ID;
Constructing a Redis Key of the user ID, and inquiring and distributing the user internal ID in batches;
and processing the user attribute, the equipment attribute and the event attribute to complete the standardization and completion of the data structure.
On the basis of the technical scheme, the invention can also make the following improvements.
Optionally, the user behavior data comprises a message object list, wherein each message comprises app_key, original data and user and device information, and the performing basic verification on the data comprises:
And carrying out JSON format verification and analysis on each message, marking the message with failed analysis as invalid, and recording an error log.
Checking the message structure according to a predefined schema, and screening out data with illegal structure;
and inquiring a database according to the app_key in the message, acquiring a corresponding application ID (identity) appId, and complementing the application ID to the message object.
Optionally, the Redis Key of the build device ID includes:
Traversing all effective messages, extracting an application ID and a device ID from each message, and splicing the application ID and the device ID into a Redis Key: "d: appId: { deviceId }";
the Redis Key for all unique device IDs is collected.
Optionally, the batch querying and assigning device internal IDs include:
Batch inquiring Redis, and obtaining the internal device ID corresponding to the Redis Key of each device ID currently, DEVICEINTERNALID;
For unassigned device ID Key, obtaining the current maximum device ID Key of the application ID as ID d: { appId }, self-increasing and assigning a new ID, writing in a Redis: hash structure, wherein field is the device ID, value is the new ID, and simultaneously self-increasing a counter.
And maintaining Redis keys of all the device IDs and the corresponding device internal IDs in a local Map.
Optionally, the constructing the rediskey of the user ID includes:
traversing all message event attributes, extracting user-defined IDs (userId), and splicing the user-defined IDs (userId) into Redis Key: "u: appId: { userId }";
the Redis Key of all unique user IDs is collected.
Optionally, the batch querying and assigning the user internal ID includes:
Obtaining the current maximum user internal ID of the application ID for the Redis Key of the unassigned user ID, self-increasing and assigning a new ID, and writing in the Redis;
And maintaining Redis Key of all user IDs and the corresponding user internal IDs in a local Map.
Optionally, the global ID allocation and mapping maintenance includes:
The opportunity to assign new globalId is to assign new globalId to "user internal ID" and "device internal ID" only if there is no mapping in Redis for both "user internal ID" and "device internal ID";
Priority merge to the existing globalId, new creation is only performed when none exists, and the fact that the same user or the same equipment always belongs to the unique globalId under the same application is ensured;
the global ID is written into the message structure, and the multi-level mapping relation of Redis is synchronously updated.
Optionally, the user attribute processing includes:
Traversing all the messages, and extracting custom attributes aiming at the data item with the type of usr;
Judging the attribute type, checking the attribute name length, and filtering illegal or ultra-long attributes;
Each legitimate attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.
Optionally, the device attribute processing includes:
extracting custom device attributes for a data item of type "pl";
judging the attribute type and the checking length, and filtering illegal or overlength attributes;
assigning unique attribute ID and type to each legal equipment attribute, searching/registering through a cache or a database, and writing in a message structure;
the event attribute processing includes:
extracting custom event attributes aiming at various event data items with types of 'evt' and the like;
judging the attribute type and the checking length, and filtering illegal or overlength attributes;
Each legal event attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.
According to a second aspect of the present invention there is provided a distributed data processing based subscriber identity system comprising:
The user data acquisition module is used for receiving user behavior data and performing basic verification on the data;
the device ID and session ID distribution module is used for constructing a Redis Key of the device ID and inquiring and distributing the internal ID of the device in batches;
The user ID and global ID distribution module is used for constructing Redis Key of the user ID, inquiring and distributing the user internal ID in batches, and carrying out global ID distribution and mapping maintenance;
and the attribute processing and data structure complementing module is used for processing the user attribute, the equipment attribute and the event attribute and completing the standardization and the complementation of the data structure.
According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a user identification method based on distributed data processing.
The invention has the technical effects and advantages that:
The invention provides a user identification method, a system and a storage medium based on distributed data processing, which effectively solve the problems of user identification and association in a multi-platform multi-device environment by maintaining a plurality of mapping relations among device IDs, user IDs and global IDs, and provide a complete and unified data basis for user behavior analysis. By adopting a distributed computing framework based on Spark and Kafka and combining with Redis high-performance cache, the high-throughput and low-delay data processing capability is realized, and the system processing efficiency and instantaneity are remarkably improved. Through the modularized design and the Kafka theme distribution mechanism, the expandability and flexibility of the system are improved, the system can adapt to the change of requirements of different business scenes, and the sustainable development of business is supported. And a partitioning strategy based on keys is adopted in the data distribution link, so that the data of the same user are ensured to be sent to the same partition, the data consistency guarantee is provided for the subsequent user behavior analysis, and the accuracy of analysis results is improved.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
It should be noted that the current user behavior analysis system has the following challenges:
1. The data consistency is that the mapping relation of the equipment ID, the user ID and the global ID is difficult to ensure consistency in a distributed environment;
2. real-time requirements, namely supporting high concurrency real-time data processing;
3. data quality, namely ensuring the accuracy of data cleaning and conversion;
4. system expansibility, namely processing of large-scale user behavior data needs to be supported;
5. multidimensional analysis, namely supporting multidimensional data analysis of user attributes, equipment attributes, event attributes and the like.
Based on the defects in the background art, the embodiment of the invention provides a user identification method based on distributed data processing, and particularly as shown in fig. 1, the method comprises the following steps:
s1, receiving user behavior data, and performing basic verification on the behavior data;
the performing basic verification on the data specifically includes:
The system receives user behavior data in batches, such as a message object list, wherein each message comprises app_key, original data, user and equipment information;
And carrying out JSON format verification and analysis on each message, marking the message with failed analysis as invalid, and recording an error log.
Checking the message structure according to a predefined schema, and screening out data with illegal structure;
and inquiring a database according to the app_key in the message, acquiring a corresponding application ID (identity) appId, and complementing the application ID to the message object.
S2, constructing a Redis Key of the equipment ID, and inquiring and distributing the internal ID of the equipment in batches;
REDIS KEYKV is the Key of storage component Redis. The Redis Key construction of the device ID specifically comprises:
traversing all valid messages, extracting an application ID and a device identifier deviceId from each message, and splicing the application ID and the device identifier to form a Redis Key in a manner of'd: appId: { deviceId }';
Collecting Redis Key of all unique device IDs facilitates subsequent batch operations.
The batch querying and assigning device internal IDs include:
Batch inquiring Redis, obtaining the current corresponding internal device ID of Redis Key of each device ID (DEVICEINTERNALID), obtaining the current maximum device internal ID of the application ID (Key is ID: d: { appId }), self-increasing and distributing new ID, writing Redis (Hash structure, field is deviceId, value is new ID), and self-increasing counter.
And maintaining Redis keys of all the device IDs and the corresponding device internal IDs in a local Map.
Write-back message structure
All valid messages are traversed and the device internal ID is obtained from the local Map according to "d: appId: { deviceId }".
The device internal ID is written into the user information field of the message and synchronously written into the attribute field of each event in the data list in the message.
S3, constructing a Redis Key of the user ID, inquiring and distributing the user internal ID in batches, and carrying out global ID distribution and mapping maintenance;
The Redis Key construction of the user ID specifically comprises:
traversing all message event attributes, extracting user-defined IDs (userId), and splicing the user-defined IDs (userId) into Redis Key: "u: appId: { userId }";
And collecting Redis keys of all unique user IDs, so that subsequent batch operation is facilitated.
Batch querying and assigning user internal IDs includes:
Batch inquiring Redis, obtaining the current corresponding user internal ID of Redis Key of each user ID userInternalId, obtaining the current maximum user internal ID of the application ID (Redis Key is ID: u: $ { appId }), self-increasing and distributing new ID, writing in Redis (Hash structure, field is userId, value is new ID), and self-increasing counter.
And maintaining Redis Key of all user IDs and the corresponding user internal IDs in a local Map.
Write-back message structure
All valid messages are traversed and the user internal ID is written into the event attribute field.
Global ID allocation and mapping maintenance includes:
The following mapping relationship is mainly maintained in Redis:
Device internal ID→global ID (dz: $ { appId }: $ { DEVICEINTERNALID } → globalId);
user internal ID→global ID (uz: $ { appId }: $ { userInternalId } → globalId);
global id→user internal ID (zu: $ { appId }: $ { globalId } → userInternalId);
The opportunity to assign a new globalId is that a new globalId is assigned to the combination only if no mapping exists in Redis for both the "user internal ID" and the "device internal ID".
The attribution of new globalId, the newly allocated globalId will simultaneously establish the following mappings:
device internal ID → globalId
User internal ID→ globalId (if user ID is present)
GlobalId →user internal ID (if there is user ID)
The merging principle is that the merging is preferentially conducted to the existing globalId, and new creation is conducted only when none exists, so that the fact that the same user or the same equipment always belongs to the unique globalId under the same application is ensured. globalId's allocation strictly guarantees the uniqueness and incrementation of the same user-device merger under the same application.
And writing the global ID into a message structure, synchronously updating the multi-level mapping relation of Redis, and ensuring the uniqueness and consistency of the equipment, the user and the global ID.
And S4, processing the user attribute, the equipment attribute and the event attribute to complete standardization and completion of the data structure.
The user attribute processing includes:
traversing all the messages, and extracting the custom attribute aiming at the data item with the type of usr.
Judging the attribute type (such as character string, numerical value, boolean, etc.), checking the length of the attribute name, and filtering illegal or ultra-long attributes.
Each legitimate attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.
The device attribute processing includes:
custom device attributes are extracted for data items of type "pl".
Judging the attribute type and the check length, and filtering illegal or ultra-long attributes.
Each legitimate device attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.
The event attribute processing includes:
For various event data items with the type of evt and the like, the custom event attribute is extracted.
Judging the attribute type and the check length, and filtering illegal or ultra-long attributes.
Each legal event attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.
Data structure standardization and complementation:
And organizing information such as equipment, users, events and the like into a standardized data structure, and complementing all IDs and attribute information.
The data collection and enqueuing application buried point SDK collects user behavior data including event types, event attributes, equipment information, user information and the like, and sends the data to the Kafka message queue. The message queue uses behavior _data as a subject name and is responsible for storing buried point data from each application.
Initializing consumer system initialization SPARK STREAMING context and Kafka consumer, setting relevant parameters, creating direct current receiving Kafka message. The main parameters include consumer group ID, auto-commit configuration, maximum message size, etc.
Message consumption and processing SPARK STREAMING consumes Kafka messages in batches per second, encapsulates the messages into objects, contains information such as topics, partitions, offsets, key values and the like, and then delivers the objects to a real-time processing program for processing.
The batch message processing real-time handler receives the list of messages and processes each message according to the following steps.
The JSON parsing checks whether the message is in a legal JSON format, parses the message content, and stores the parsing result in the data field of the message object.
Data verification verifies whether the data structure meets the basic requirements according to a predefined schema, and unsatisfactory messages are marked as invalid.
The application identification queries the database according to the application key (app_key) in the message, obtains the corresponding application ID, and stores in the appId field of the message object.
The device ID is generated to generate a device ID for the message;
The format is "d: { appId }: { DEVICEIDENTIFIER }, where DEVICEIDENTIFIER is from the device identifier in the message.
Session management adds session ID and UUID for a particular type of event (evt, ss, se, mkt, abp) for tracking user sessions.
The user ID is generated to add a user ID to the data in the format "u: { appId }: { customUserId }", where customUserId is from a user-defined identifier in the message.
Global ID generation generates a global ID for a message, processing logic is as follows:
extracting a device ID and a user ID from the message;
Constructing Redis query keys, device appId, deviceId and user appId, userId;
batch inquiring Redis to obtain global ID mapping corresponding to the equipment ID and the user ID;
the global ID is determined according to the following rules:
If a user ID exists and a corresponding global ID already exists, the global ID is used, if a user ID does not exist but a device ID already exists a corresponding global ID is used, and if none exists, a new global ID is generated (by incrementing the counter of the application: ID: { appId }).
Updating the mapping relation in Redis, expressed as:
mapping of device ID to Global ID device: { appId }
Mapping of user ID to Global ID user: { appId }
Mapping of Global ID to user ID IDs: { appId }
The global ID is added to the message attributes.
The user identification system is constructed by the following three levels of user identification systems, including:
The device ID ($system_did) is generated based on the device unique identifier in the format "d: { appId }: { DEVICEIDENTIFIER }" for the device.
User ID ($system_uid) is generated based on the user-defined identifier in the format "u: { appId: { customUserId }".
Global ID ($system_id), a globally unique identifier generated by the system, for associating the device ID with the user ID.
User attribute processing processes user-related custom attributes and generates unique identifiers for the user attributes.
The device attributes process information associated with the device and generate a unique identifier for the device attributes.
The event attribute processing processes the custom event and generates a unique identifier for the event attribute.
Data structure construction a data distribution structure is constructed, comprising:
organizing device information into a device data structure;
organizing user information into a user data structure;
organizing event information into an event data structure;
Integrating global ID information and establishing association between data;
Data distribution distributes the processed data to different Kafka topics:
transmitting the complete data to a system_total theme;
Send data to the system_total_random topic with appId _ systemId as a key;
Transmitting the user data to a system_user theme;
Transmitting the device data to a system_device theme;
sending event data to a system_event topic;
the data transmission adopts an asynchronous mode, the network transmission overhead is reduced by using a Snappy compression algorithm, and the data of the same user is ensured to be transmitted to the same partition through the partition strategy of the key.
Through the detailed implementation mode, the invention realizes the unique identification and behavior association of the user in the multi-platform multi-device environment, provides high-efficiency, reliable and extensible data processing capability, and provides a solid data base for user behavior analysis.
In summary, the method for identifying the user based on the distributed data processing according to the embodiment of the present invention implements unique identification and behavior association of the user in a multi-platform and multi-device environment through a distributed data processing technology, and provides efficient data processing and distribution capabilities. The system builds a real-time data processing frame based on SPARK STREAMING and Kafka, uses Redis as a high-performance cache to store a user identification mapping relation, and realizes generation, management and association of user identifications through a multi-level identification system (equipment ID, user ID and global ID).
According to a second aspect of the present invention there is provided a distributed data processing based subscriber identity system comprising:
The user data acquisition module is used for receiving user behavior data and performing basic verification on the data;
the device ID and session ID distribution module is used for constructing a Redis Key of the device ID and inquiring and distributing the internal ID of the device in batches;
The user ID and global ID distribution module is used for constructing Redis Key of the user ID, inquiring and distributing the user internal ID in batches, and carrying out global ID distribution and mapping maintenance;
and the attribute processing and data structure complementing module is used for processing the user attribute, the equipment attribute and the event attribute and completing the standardization and the complementation of the data structure.
It may be understood that the user identification system based on distributed data processing provided by the present invention corresponds to the user identification method based on distributed data processing provided in the foregoing embodiments, and relevant technical features of the user identification system based on distributed data processing may refer to relevant technical features of the user identification method based on distributed data processing, which are not described herein again.
In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the steps of implementing the distributed data processing based user identification method provided by the methods above.
From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.
It should be noted that the foregoing description is only a preferred embodiment of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood that modifications, equivalents, improvements and modifications to the technical solution described in the foregoing embodiments may occur to those skilled in the art, and all modifications, equivalents, and improvements are intended to be included within the spirit and principle of the present invention.