CN120578695A

CN120578695A - User identification method, system and storage medium based on distributed data processing

Info

Publication number: CN120578695A
Application number: CN202510718267.2A
Authority: CN
Inventors: 张桃龙; 彭磊; 徐晋毅
Original assignee: Wuhan Zhongbang Bank Co Ltd
Current assignee: Wuhan Zhongbang Bank Co Ltd
Priority date: 2025-05-30
Filing date: 2025-05-30
Publication date: 2025-09-02

Abstract

The present invention provides a user identification method, system and storage medium based on distributed data processing, which relates to the fields of big data analysis and user behavior tracking. The main purpose is to solve the problems of generation, management and association of unique user identification in a multi-platform and multi-channel environment, as well as the problems of insufficient data processing efficiency and poor real-time performance caused by this. The main scheme of the present invention includes a distributed data stream processing framework based on Spark and Kafka to achieve efficient parsing, processing and distribution of user behavior data; adopt a multi-level user identification system, including the generation and association of device ID, user ID and global ID; use Redis high-performance cache to achieve fast query of identification mapping relationships; and efficiently distribute processed data to different business modules through the Kafka topic distribution mechanism. This solution effectively solves the complexity of identification management in a large-scale user data environment, improves data processing efficiency and real-time performance, and provides a reliable data foundation for user behavior analysis.

Description

User identification method, system and storage medium based on distributed data processing

Technical Field

The present invention relates to the technical field of big data analysis and user behavior tracking, and more particularly, to a method, a system and a storage medium for user identification based on distributed data processing.

Background

With the rapid development of mobile internet and various applications, users may use the same application on multiple platforms and multiple devices at the same time, resulting in dispersion of user behavior data in different channels. How to identify and correlate the behavior of the same user on different devices, and generate a unified user view, is an important challenge faced by the field of data analysis. Traditional user identification methods based on Cookies or device IDs are difficult to cope with such complex scenes, and particularly show obvious limitations in application scenes such as user cross-device behavior analysis and the like.

Meanwhile, as the business scale expands, the data volume of user behavior increases exponentially, and higher requirements are put on the throughput capacity and instantaneity of the data processing system. The traditional single machine processing mode or simple distributed architecture often has the problems of low processing efficiency, poor system stability and the like when facing to mass data, and cannot meet the requirement of business on real-time analysis of user behavior data.

In addition, the user identification system needs to support the dynamic development of the service, can flexibly adapt to the change of the requirements of different applications, and has higher requirements on the expandability of the system. Most of the existing user identification systems are designed aiming at specific scenes, lack enough universality and expansibility, and are difficult to support diversified business requirements.

Therefore, designing a high-efficiency, reliable and extensible user identification computing method, realizing unique identification and behavior association of users in a multi-device and multi-platform environment, and providing real-time data processing capability is a problem to be solved in the current data analysis field.

Disclosure of Invention

Aiming at the technical problems existing in the prior art, the invention provides a user identification method, a system and a storage medium based on distributed data processing, and the efficient processing and analysis of user behavior data are realized through Redis distributed storage, a multistage ID mapping mechanism and a real-time data processing technology. The method and the device solve the problems of complexity of user identification management, insufficient data processing efficiency and poor system expansibility in the prior art.

According to a first aspect of the present invention, there is provided a method of user identification based on distributed data processing, comprising the steps of:

receiving user behavior data and performing basic verification on the data;

Constructing a Redis Key of the equipment ID, inquiring and distributing the internal ID of the equipment in batches, and distributing the session ID;

Constructing a Redis Key of the user ID, and inquiring and distributing the user internal ID in batches;

and processing the user attribute, the equipment attribute and the event attribute to complete the standardization and completion of the data structure.

On the basis of the technical scheme, the invention can also make the following improvements.

Optionally, the user behavior data comprises a message object list, wherein each message comprises app_key, original data and user and device information, and the performing basic verification on the data comprises:

And carrying out JSON format verification and analysis on each message, marking the message with failed analysis as invalid, and recording an error log.

Checking the message structure according to a predefined schema, and screening out data with illegal structure;

and inquiring a database according to the app_key in the message, acquiring a corresponding application ID (identity) appId, and complementing the application ID to the message object.

Optionally, the Redis Key of the build device ID includes:

Traversing all effective messages, extracting an application ID and a device ID from each message, and splicing the application ID and the device ID into a Redis Key: "d: appId: { deviceId }";

the Redis Key for all unique device IDs is collected.

Optionally, the batch querying and assigning device internal IDs include:

Batch inquiring Redis, and obtaining the internal device ID corresponding to the Redis Key of each device ID currently, DEVICEINTERNALID;

For unassigned device ID Key, obtaining the current maximum device ID Key of the application ID as ID d: { appId }, self-increasing and assigning a new ID, writing in a Redis: hash structure, wherein field is the device ID, value is the new ID, and simultaneously self-increasing a counter.

And maintaining Redis keys of all the device IDs and the corresponding device internal IDs in a local Map.

Optionally, the constructing the rediskey of the user ID includes:

traversing all message event attributes, extracting user-defined IDs (userId), and splicing the user-defined IDs (userId) into Redis Key: "u: appId: { userId }";

the Redis Key of all unique user IDs is collected.

Optionally, the batch querying and assigning the user internal ID includes:

Obtaining the current maximum user internal ID of the application ID for the Redis Key of the unassigned user ID, self-increasing and assigning a new ID, and writing in the Redis;

And maintaining Redis Key of all user IDs and the corresponding user internal IDs in a local Map.

Optionally, the global ID allocation and mapping maintenance includes:

The opportunity to assign new globalId is to assign new globalId to "user internal ID" and "device internal ID" only if there is no mapping in Redis for both "user internal ID" and "device internal ID";

Priority merge to the existing globalId, new creation is only performed when none exists, and the fact that the same user or the same equipment always belongs to the unique globalId under the same application is ensured;

the global ID is written into the message structure, and the multi-level mapping relation of Redis is synchronously updated.

Optionally, the user attribute processing includes:

Traversing all the messages, and extracting custom attributes aiming at the data item with the type of usr;

Judging the attribute type, checking the attribute name length, and filtering illegal or ultra-long attributes;

Each legitimate attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.

Optionally, the device attribute processing includes:

extracting custom device attributes for a data item of type "pl";

judging the attribute type and the checking length, and filtering illegal or overlength attributes;

assigning unique attribute ID and type to each legal equipment attribute, searching/registering through a cache or a database, and writing in a message structure;

the event attribute processing includes:

extracting custom event attributes aiming at various event data items with types of 'evt' and the like;

Each legal event attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.

According to a second aspect of the present invention there is provided a distributed data processing based subscriber identity system comprising:

The user data acquisition module is used for receiving user behavior data and performing basic verification on the data;

the device ID and session ID distribution module is used for constructing a Redis Key of the device ID and inquiring and distributing the internal ID of the device in batches;

The user ID and global ID distribution module is used for constructing Redis Key of the user ID, inquiring and distributing the user internal ID in batches, and carrying out global ID distribution and mapping maintenance;

and the attribute processing and data structure complementing module is used for processing the user attribute, the equipment attribute and the event attribute and completing the standardization and the complementation of the data structure.

According to a third aspect of the present invention, there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a user identification method based on distributed data processing.

The invention has the technical effects and advantages that:

The invention provides a user identification method, a system and a storage medium based on distributed data processing, which effectively solve the problems of user identification and association in a multi-platform multi-device environment by maintaining a plurality of mapping relations among device IDs, user IDs and global IDs, and provide a complete and unified data basis for user behavior analysis. By adopting a distributed computing framework based on Spark and Kafka and combining with Redis high-performance cache, the high-throughput and low-delay data processing capability is realized, and the system processing efficiency and instantaneity are remarkably improved. Through the modularized design and the Kafka theme distribution mechanism, the expandability and flexibility of the system are improved, the system can adapt to the change of requirements of different business scenes, and the sustainable development of business is supported. And a partitioning strategy based on keys is adopted in the data distribution link, so that the data of the same user are ensured to be sent to the same partition, the data consistency guarantee is provided for the subsequent user behavior analysis, and the accuracy of analysis results is improved.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

Fig. 1 is a flowchart of a user identification method based on distributed data processing according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that the current user behavior analysis system has the following challenges:

1. The data consistency is that the mapping relation of the equipment ID, the user ID and the global ID is difficult to ensure consistency in a distributed environment;

2. real-time requirements, namely supporting high concurrency real-time data processing;

3. data quality, namely ensuring the accuracy of data cleaning and conversion;

4. system expansibility, namely processing of large-scale user behavior data needs to be supported;

5. multidimensional analysis, namely supporting multidimensional data analysis of user attributes, equipment attributes, event attributes and the like.

Based on the defects in the background art, the embodiment of the invention provides a user identification method based on distributed data processing, and particularly as shown in fig. 1, the method comprises the following steps:

s1, receiving user behavior data, and performing basic verification on the behavior data;

the performing basic verification on the data specifically includes:

The system receives user behavior data in batches, such as a message object list, wherein each message comprises app_key, original data, user and equipment information;

S2, constructing a Redis Key of the equipment ID, and inquiring and distributing the internal ID of the equipment in batches;

REDIS KEYKV is the Key of storage component Redis. The Redis Key construction of the device ID specifically comprises:

traversing all valid messages, extracting an application ID and a device identifier deviceId from each message, and splicing the application ID and the device identifier to form a Redis Key in a manner of'd: appId: { deviceId }';

Collecting Redis Key of all unique device IDs facilitates subsequent batch operations.

The batch querying and assigning device internal IDs include:

Batch inquiring Redis, obtaining the current corresponding internal device ID of Redis Key of each device ID (DEVICEINTERNALID), obtaining the current maximum device internal ID of the application ID (Key is ID: d: { appId }), self-increasing and distributing new ID, writing Redis (Hash structure, field is deviceId, value is new ID), and self-increasing counter.

Write-back message structure

All valid messages are traversed and the device internal ID is obtained from the local Map according to "d: appId: { deviceId }".

The device internal ID is written into the user information field of the message and synchronously written into the attribute field of each event in the data list in the message.

S3, constructing a Redis Key of the user ID, inquiring and distributing the user internal ID in batches, and carrying out global ID distribution and mapping maintenance;

The Redis Key construction of the user ID specifically comprises:

And collecting Redis keys of all unique user IDs, so that subsequent batch operation is facilitated.

Batch querying and assigning user internal IDs includes:

Batch inquiring Redis, obtaining the current corresponding user internal ID of Redis Key of each user ID userInternalId, obtaining the current maximum user internal ID of the application ID (Redis Key is ID: u: $ { appId }), self-increasing and distributing new ID, writing in Redis (Hash structure, field is userId, value is new ID), and self-increasing counter.

Write-back message structure

All valid messages are traversed and the user internal ID is written into the event attribute field.

Global ID allocation and mapping maintenance includes:

The following mapping relationship is mainly maintained in Redis:

Device internal ID→global ID (dz: $ { appId }: $ { DEVICEINTERNALID } → globalId);

user internal ID→global ID (uz: $ { appId }: $ { userInternalId } → globalId);

global id→user internal ID (zu: $ { appId }: $ { globalId } → userInternalId);

The opportunity to assign a new globalId is that a new globalId is assigned to the combination only if no mapping exists in Redis for both the "user internal ID" and the "device internal ID".

The attribution of new globalId, the newly allocated globalId will simultaneously establish the following mappings:

device internal ID → globalId

User internal ID→ globalId (if user ID is present)

GlobalId →user internal ID (if there is user ID)

The merging principle is that the merging is preferentially conducted to the existing globalId, and new creation is conducted only when none exists, so that the fact that the same user or the same equipment always belongs to the unique globalId under the same application is ensured. globalId's allocation strictly guarantees the uniqueness and incrementation of the same user-device merger under the same application.

And writing the global ID into a message structure, synchronously updating the multi-level mapping relation of Redis, and ensuring the uniqueness and consistency of the equipment, the user and the global ID.

And S4, processing the user attribute, the equipment attribute and the event attribute to complete standardization and completion of the data structure.

The user attribute processing includes:

traversing all the messages, and extracting the custom attribute aiming at the data item with the type of usr.

Judging the attribute type (such as character string, numerical value, boolean, etc.), checking the length of the attribute name, and filtering illegal or ultra-long attributes.

The device attribute processing includes:

custom device attributes are extracted for data items of type "pl".

Judging the attribute type and the check length, and filtering illegal or ultra-long attributes.

Each legitimate device attribute is assigned a unique attribute ID and type, looked up/registered through a cache or database, and written into a message structure.

The event attribute processing includes:

For various event data items with the type of evt and the like, the custom event attribute is extracted.

Data structure standardization and complementation:

And organizing information such as equipment, users, events and the like into a standardized data structure, and complementing all IDs and attribute information.

The data collection and enqueuing application buried point SDK collects user behavior data including event types, event attributes, equipment information, user information and the like, and sends the data to the Kafka message queue. The message queue uses behavior _data as a subject name and is responsible for storing buried point data from each application.

Initializing consumer system initialization SPARK STREAMING context and Kafka consumer, setting relevant parameters, creating direct current receiving Kafka message. The main parameters include consumer group ID, auto-commit configuration, maximum message size, etc.

Message consumption and processing SPARK STREAMING consumes Kafka messages in batches per second, encapsulates the messages into objects, contains information such as topics, partitions, offsets, key values and the like, and then delivers the objects to a real-time processing program for processing.

The batch message processing real-time handler receives the list of messages and processes each message according to the following steps.

The JSON parsing checks whether the message is in a legal JSON format, parses the message content, and stores the parsing result in the data field of the message object.

Data verification verifies whether the data structure meets the basic requirements according to a predefined schema, and unsatisfactory messages are marked as invalid.

The application identification queries the database according to the application key (app_key) in the message, obtains the corresponding application ID, and stores in the appId field of the message object.

The device ID is generated to generate a device ID for the message;

The format is "d: { appId }: { DEVICEIDENTIFIER }, where DEVICEIDENTIFIER is from the device identifier in the message.

Session management adds session ID and UUID for a particular type of event (evt, ss, se, mkt, abp) for tracking user sessions.

The user ID is generated to add a user ID to the data in the format "u: { appId }: { customUserId }", where customUserId is from a user-defined identifier in the message.

Global ID generation generates a global ID for a message, processing logic is as follows:

extracting a device ID and a user ID from the message;

Constructing Redis query keys, device appId, deviceId and user appId, userId;

batch inquiring Redis to obtain global ID mapping corresponding to the equipment ID and the user ID;

the global ID is determined according to the following rules:

If a user ID exists and a corresponding global ID already exists, the global ID is used, if a user ID does not exist but a device ID already exists a corresponding global ID is used, and if none exists, a new global ID is generated (by incrementing the counter of the application: ID: { appId }).

Updating the mapping relation in Redis, expressed as:

mapping of device ID to Global ID device: { appId }

Mapping of user ID to Global ID user: { appId }

Mapping of Global ID to user ID IDs: { appId }

The global ID is added to the message attributes.

The user identification system is constructed by the following three levels of user identification systems, including:

The device ID ($system_did) is generated based on the device unique identifier in the format "d: { appId }: { DEVICEIDENTIFIER }" for the device.

User ID ($system_uid) is generated based on the user-defined identifier in the format "u: { appId: { customUserId }".

Global ID ($system_id), a globally unique identifier generated by the system, for associating the device ID with the user ID.

User attribute processing processes user-related custom attributes and generates unique identifiers for the user attributes.

The device attributes process information associated with the device and generate a unique identifier for the device attributes.

The event attribute processing processes the custom event and generates a unique identifier for the event attribute.

Data structure construction a data distribution structure is constructed, comprising:

organizing device information into a device data structure;

organizing user information into a user data structure;

organizing event information into an event data structure;

Integrating global ID information and establishing association between data;

Data distribution distributes the processed data to different Kafka topics:

transmitting the complete data to a system_total theme;

Send data to the system_total_random topic with appId _ systemId as a key;

Transmitting the user data to a system_user theme;

Transmitting the device data to a system_device theme;

sending event data to a system_event topic;

the data transmission adopts an asynchronous mode, the network transmission overhead is reduced by using a Snappy compression algorithm, and the data of the same user is ensured to be transmitted to the same partition through the partition strategy of the key.

Through the detailed implementation mode, the invention realizes the unique identification and behavior association of the user in the multi-platform multi-device environment, provides high-efficiency, reliable and extensible data processing capability, and provides a solid data base for user behavior analysis.

In summary, the method for identifying the user based on the distributed data processing according to the embodiment of the present invention implements unique identification and behavior association of the user in a multi-platform and multi-device environment through a distributed data processing technology, and provides efficient data processing and distribution capabilities. The system builds a real-time data processing frame based on SPARK STREAMING and Kafka, uses Redis as a high-performance cache to store a user identification mapping relation, and realizes generation, management and association of user identifications through a multi-level identification system (equipment ID, user ID and global ID).

It may be understood that the user identification system based on distributed data processing provided by the present invention corresponds to the user identification method based on distributed data processing provided in the foregoing embodiments, and relevant technical features of the user identification system based on distributed data processing may refer to relevant technical features of the user identification method based on distributed data processing, which are not described herein again.

In yet another aspect, the present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, is implemented to perform the steps of implementing the distributed data processing based user identification method provided by the methods above.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

It should be noted that the foregoing description is only a preferred embodiment of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it should be understood that modifications, equivalents, improvements and modifications to the technical solution described in the foregoing embodiments may occur to those skilled in the art, and all modifications, equivalents, and improvements are intended to be included within the spirit and principle of the present invention.

Claims

1. A user identification method based on distributed data processing, characterized in that it comprises the following steps:

Receive user behavior data and perform basic verification on the behavior data;

Build the Redis Key of the device ID, and batch query and assign the internal ID of the device;

Build the Redis key of the user ID, batch query and assign user internal IDs, and perform global ID allocation and mapping maintenance;

Process user attributes, device attributes, and event attributes to complete data structure standardization and completion.

2. The user identification method based on distributed data processing according to claim 1, wherein the user behavior data includes: a list of message objects, each message containing an app_key, raw data, and user and device information; and the basic verification of the behavior data includes:

Perform JSON format verification and parsing on each message. Messages that fail parsing are marked as invalid and an error log is recorded.

Verify the message structure according to the predefined schema and filter out data with illegal structure;

Query the database based on the app_key in the message, obtain the corresponding application ID: appId, and complete it in the message object.

3. The user identification method based on distributed data processing according to claim 1, wherein the Redis Key for constructing the device ID comprises:

Traverse all valid messages, extract the application ID and device ID from each message, and concatenate them into a Redis key: "d:appId:{deviceId}"

Collect the Redis keys of all unique device IDs.

4. The user identification method based on distributed data processing according to claim 1, wherein the batch query and allocation of device internal IDs comprises:

Batch query Redis to obtain the Redis key of each device ID and the current corresponding internal device ID;

For unassigned device ID keys, obtain the current maximum internal device ID for the application ID: the key is id:d:{appId}, automatically assign a new ID, and write it to Redis: Hash structure, with the field being the device ID and the value being the new ID, while also automatically incrementing the counter;

Maintain the Redis keys of all device IDs and their corresponding internal device IDs in a local Map.

5. The user identification method based on distributed data processing according to claim 1, wherein the Redis Key for constructing the user ID comprises:

Traverse all message event attributes, extract the user-defined ID: userId, and concatenate them into a Redis key: "u:appId:{userId}";

Collect the Redis keys of all unique user IDs.

6. The user identification method based on distributed data processing according to claim 1, wherein the batch query and allocation of user internal IDs comprises:

Batch query Redis to obtain the current user internal ID corresponding to the Redis key of each user ID: userInternalId; for the Redis key of unassigned user IDs, obtain the current maximum user internal ID of the application ID, automatically increment and allocate a new ID, and write it to Redis;

Maintain the Redis Key of all user IDs and their corresponding internal user IDs in a local Map.

7. The user identification method based on distributed data processing according to claim 1, wherein the global ID allocation and mapping maintenance comprises:

When to allocate a new globalId: Only when neither the "user internal ID" nor the "device internal ID" is mapped in Redis, a new globalId is allocated for the "user internal ID" and "device internal ID";

Prioritize merging into existing globalIds, and only create a new one if none exist. This ensures that the same user or device always has a unique globalId under the same app.

Write the global ID into the message structure and synchronously update the multi-level mapping relationship of Redis.

8. The user identification method based on distributed data processing according to claim 1, wherein the user attribute processing comprises:

Traverse all messages and extract custom attributes for data items of type "usr";

Determine the attribute type, check the attribute name length, and filter out illegal or overlong attributes;

Assign a unique attribute ID and type to each legal attribute, look up/register it through the cache or database, and write it into the message structure;

The device attribute processing includes:

For data items of type "pl", extract custom device attributes;

Determine attribute type, check length, and filter illegal or overlength attributes;

Assign a unique attribute ID and type to each legal device attribute, look up/register it through cache or database, and write it into the message structure;

The event attribute processing includes:

Extract custom event attributes for event data items of various types such as "evt";

Assign a unique attribute ID and type to each legal event attribute, look up/register it through the cache or database, and write it into the message structure.

9. A user identification system based on distributed data processing, characterized by comprising:

A user data acquisition module is used to receive user behavior data and perform basic verification on the data;

Device ID and session ID allocation module, used to construct the Redis key of device ID, batch query and allocate device internal ID;

User ID and global ID allocation module, used to build Redis keys for user IDs, batch query and allocate user internal IDs, and perform global ID allocation and mapping maintenance;

The attribute processing and data structure completion module is used to process user attributes, device attributes, and event attributes, and complete data structure standardization and completion.

10. A computer-readable storage medium, characterized in that a computer management program is stored thereon, and when the computer program is executed by a processor, the user identification method based on distributed data processing according to any one of claims 1 to 8 is implemented.